Forem: Avthar Sewrathan

🚀 pgai Vectorizer: Automate AI Embeddings With One SQL Command in PostgreSQL

Avthar Sewrathan — Tue, 29 Oct 2024 13:31:18 +0000

Learn how to automate AI embedding creation using the PostgreSQL you know and love.

Managing embedding workflows for AI systems like RAG, search and AI agents can be a hassle: juggling multiple tools, setting up complex pipelines, and spending hours syncing data, especially if you aren't an ML or AI expert. But it doesn’t have to be that way.

With pgai Vectorizer, now in Early Access, you can automate vector embedding creation, keep them automatically synced as your data changes, and experiment with different AI models -- all with a simple SQL command. No extra tools, no complex setups -- just PostgreSQL doing the heavy lifting.

-- Create a vectorizer to embed data in the blogs table
-- Use Open AI text-embedding-3-small model
SELECT ai.create_vectorizer(
    'public.blogs'::regclass,
    embedding => ai.embedding_openai('text-embedding-3-small', 1536),
    chunking => ai.chunking_recursive_character_text_splitter('content')
);

What pgai Vectorizer does:

Embedding creation with SQL: generate vector embeddings from multiple text columns with just one command, streamlining a key part of your AI workflow.
Automatic sync: embeddings update as your data changes—no manual intervention needed.
Quick model switching: test different AI models instantly using SQL—no data reprocessing required.
Test and roll out: compare models and chunking techniques, A/B test, and roll out updates with confidence and without downtime.

Here's an example of testing the RAG output of two different embedding models using pgai Vectorizer:

-- Vectorizer using OpenAI text-embedding-3-small
SELECT ai.create_vectorizer(
   'public.blogs'::regclass,
   destination => 'blogs_embedding_small',
   embedding => ai.embedding_openai('text-embedding-3-small', 1536),
   chunking => ai.chunking_recursive_character_text_splitter('content'),
   formatting => ai.formatting_python_template('Title: $title\nURL: $url\nContent: $chunk')
);

-- Vectorizer using OpenAI text-embedding-3-large
SELECT ai.create_vectorizer(
   'public.blogs'::regclass,
   destination => 'blogs_embedding_large',
   embedding => ai.embedding_openai('text-embedding-3-large', 1536),  -- Note different dimensions
   chunking => ai.chunking_recursive_character_text_splitter('content'),
   formatting => ai.formatting_python_template('Title: $title\nURL: $url\nContent: $chunk')
);

-- Compare results from the two vectorizers on the same RAG query
SELECT
   'text-embedding-3-small' as model,
   generate_rag_response(
       'What is AI?',
       'public.blogs_embedding_small'
   ) as response
UNION ALL
SELECT
   'text-embedding-3-large' as model,
   generate_rag_response(
       'What is AI?',
       'public.blogs_embedding_large'
   ) as response;

Built to Scale

As your datasets grow, pgai Vectorizer scales with you. It automatically optimizes search performance with vector indexes (like HNSW and StreamingDiskANN) once you exceed 100,000 vectors. You’re in control—define chunking and formatting rules to tailor your embeddings to your needs.

Here's an example of an advanced vectorizer configuration, with an ANN index created after 100k rows added, and custom chunking for HTML files:


-- Advanced vectorizer configuration
SELECT ai.create_vectorizer(
   'public.blogs'::regclass,
   destination => 'blogs_embedding_recursive',
   embedding => ai.embedding_openai('text-embedding-3-small', 1536),
   -- automatically create a StreamingDiskANN index when table has 100k rows
   indexing => ai.indexing_diskann(min_rows => 100000, storage_layout => 'memory_optimized'),
   -- apply recursive chunking with specified settings for HTML content
   chunking => ai.chunking_recursive_character_text_splitter(
       'content',
       chunk_size => 800,
       chunk_overlap => 400,
       -- HTML-aware separators, ordered from highest to lowest precedence
       separator => array[
           E'</article>', -- Split on major document sections
           E'</div>',    -- Split on div boundaries
           E'</section>',
           E'</p>',      -- Split on paragraphs
           E'<br>',      -- Split on line breaks
           E'</li>',     -- Split on list items
           E'. ',        -- Fall back to sentence boundaries
           ' '          -- Last resort: split on spaces
       ]
   ),
   formatting => ai.formatting_python_template('title: $title url: $url $chunk')
);

Try pgai Vectorizer Today (Early Access)

For companies like MarketReader, pgai Vectorizer has already made AI development faster and more efficient:

“pgai Vectorizer streamlines our AI workflow, from embedding creation to real-time syncing, making AI development faster and simpler -- all in PostgreSQL.” — Web Begole, CTO at MarketReader, an AI Financial Insights Company

If you're ready to start building, we are hosting a Dev Challenge with our partners at Ollama all about building AI apps with Open Source Software. We're really excited to see what the community builds with PostgreSQL and pgai Vectorizer!

Save time and effort. Focus less on embeddings. Spend more time building your next killer AI app. Try pgai Vectorizer free today: get it on GitHub or fully managed on Timescale Cloud (free for a limited time during Early Access).

How to Build More Accurate Grafana Trend Lines: Give Perspective with Series-Override

Avthar Sewrathan — Thu, 30 Apr 2020 19:58:42 +0000

Problem: Skewed Trends Due to Differences in Data Scale

Many times, we want to plot two variables on the same graph (a useful feature of viz tools like Grafana), but, run into one big problem: the scale of one of the variables distorts the trend line of the other variable.

Case in point is this graph I put together to track COVID-19 cases and deaths in the USA:

As you can see, the scale of the total cases makes the trend line for deaths look flat, even though it’s actually growing rapidly, as we can see from the graph which plots only COVID-19-related deaths:

Viewing two related data points in one graph is extremely useful to create informationally dense dashboards and compare related variables, but distorted trends can have large consequences - whether we view the COVID fatality situation more optimistically than we should, or if we’re comparing our eCommerce site’s unique visitors and relative session crashes.

We need a way to more accurately represent the trends of both variables, while still plotting them on the same axis.

Solution: Two Y Axes!

The solution is to use a different Y axis for each variable on our graph. Continuing with my COVID-19 example, this means one for the total cases variable and one for the total deaths variable, as shown in the graph below:

Here, we use two Y axes, one for COVID-19 total cases, on the left, and one for total deaths, on the right.

Each axis has their own scale, allowing us to more accurately see the growth of each trend line without the scale of one variable (e.g., total volume of reported cases) impacting how another variable (e.g., growing number deaths) appears.

Try it yourself: Implementation in Grafana with Series Override

In this post, I'll show you how to use Grafana’s series override feature to implement two Y axes (and, thus, solve our two-trend line problem).

We’ll use the example of charting the spread of COVID-19 cases and deaths in the USA, but the concepts apply to any dataset you’d like to visualize in Grafana. We’ll get our COVID-19 data from the New York Times’ public dataset

Prerequisites

To replicate the graph I’ll create in the following steps, you’ll need a:

Grafana instance
TimescaleDB database, loaded with the NYT COVID-19 data.
PostgreSQL datasource, with TimescaleDB enabled, connected to your Grafana instance. See here to get this setup.
Grafana panel with Graph visualization using the PostgreSQL database with the COVID data as the data source.

Step 1: Create two series

Plotting multiple series in one panel is a handy Grafana feature. Let’s create two series, one for COVID-19 cases and the other for COVID-19 deaths:

SELECT date as "time", sum (cases) as total_cases, sum(deaths) as total_deaths
FROM states
GROUP BY time
ORDER BY time;

Notice how we alias total cases and deaths as total_cases and total_deaths respectively.

Step 2: Modify our visualization to add a second Y axis

First, navigate to the visualization panel (pictured above) and select the Add series override button.

Then, we select the name of the series we'd like to override, “total_deaths” from the drop down menu. Then, to associate the series with the second Y axis, we select the ‘plus’ button and then select Y-Axis 2, as shown below:

When we navigate down to the Axes section, we see Left Y and Right Y, where we customize the units and scale for each axis.

In our case, we’ll leave the units as short and the scale as linear, since those defaults work for the scalar quantities in our COVID dataset.

Finally, we save the graph and refresh. We should now see both variables, total cases and deaths, plotted on the same graph, but with differently scaled axes.

Notice: we more clearly see how quickly COVID-19 deaths in the USA are growing, which was difficult to discern in the original graph where deaths were plotted with total COVID-19 cases on the same Y axis.

That’s it! We’ve successfully created a graph with two Y axes, using series-override!

Learn More

Found this tutorial useful? Here’s two more resources to help you build Grafana dashboards like a pro:

#1 Grafana Webinar

Join me on May 20 at 10am PT/1pm ET/4pm GMT where I’ll demo how to:

Use alerts effectively when monitoring metrics in Grafana
Define alert rules for your panels and dashboards
Configure different notification channels, like Slack and email
Take my demo and customize it for your project, team, or organization

I’ll focus on code and step-by-step live demos – and I and my dashboarding expert colleagues will be available to answer questions throughout the session, plus share ample resources and technical documentation.

#2 All-in-One Grafana Tutorial

We’ve compiled all our tutorials, tips, and tricks for visualizing PostgreSQL data in Grafana into this one doc. You’ll find everything from how to create visuals for Prometheus metrics to how to visualize geo-spatial data using a World Map. Check it out here

Devopsdays NYC 2020 Demo, Open Space Recap & More

Avthar Sewrathan — Wed, 18 Mar 2020 22:13:23 +0000

Learn about the latest devopsdays event, get our demo, answers to community questions, and more.

(This post was originally published on the Timescale Blog on March 13, 2020.)

We recently attended the NYC installment of the devopsdays event series (thank you to the local organizers and volunteers!), where we met with community members interested in all things monitoring, infrastructure, software development, and CI/CD.

Given the cancellation of many industry events to ensure public safety and mitigate COVID-19’s spread (check out our blog post if you’re interested in monitoring it yourself), we’re sharing a bit about our recent experience – what we learned, what we demoed, and what we spoke about – to bring the event experience to the wider community.

The Demo

During the event, I demoed how to use TimescaleDB as a long-term store for Prometheus metrics - combining Prometheus, TimescaleDB, and Grafana to monitor a piece of critical infrastructure (in this case, a database). This sort of create-your-own flexibility and customization is becoming more and more common in the conversations I have with developers, and this demo allows you to create a monitoring stack that suits your needs, without adding significant costs.

Why this scenario? I was inspired by one of our Timescale Cloud customers, who uses TimescaleDB to store and analyze their Prometheus metrics. They told us how it not only saves them money and disk space, but it also allows them to keep their data around and see trends over longer time periods.

See the demo in action below:

You’ll notice a Grafana dashboard visualizing metrics, with TimescaleDB as the data source powering the dashboard. I focused on the below basic monitoring metrics, but if you try it yourself, you can customize and add more metrics that give you more insight (e.g., query latency, queries per second, open locks, cache hits, etc.):

CPU usage
Service status
% of Disk used
# of Database connections
% Memory used
Network Status

To replicate the demo, follow these tutorials on how to store Prometheus metrics in Timescale and how to use Timescale as a datasource to power Grafana dashboards.

Open Space: DevOps & Data

Devopsdays “Open Spaces” are a (wonderful) concept similar to an unconference format: there’s a block of time scheduled for any attendees to discuss topics of their choosing with other interested attendees. Simply propose a topic to the audience that you’d like to discuss for 30 mins and other attendees can pick and choose which sessions they’d like to attend.

Fellow Timescaler Matvey Arye and I hosted an Open Space session about DevOps Data, and other topics ranged from negotiating pay and other soft skills to DevOps in small companies and DevOps in a certain ecosystem (AWS, Microsoft Azure, Google Cloud, etc.).

In our session, we heard stories, best practices, and the ways developers from all industries and areas think about the DevOps data they collect.

A few highlights and commonalities

Teams are moving away from managing infrastructure themselves and toward managed services (as one person put it: “One of the key criteria when we select a new tool is that we want one less thing to manage”).

DevOps at certain companies can be a lonely and isolating job. To remedy that, folks mentioned that they’d joined (and recommend!) a few Slack workspaces: O11y.slack.com, HangOps and Coffee Ops.

Data is becoming increasingly central in how teams fuel their post-mortem problem analysis. Developers collect data about critical incidents, search for patterns in what’s causing them, and correlate this information with how it impacts clients or users.

One team’s best practice and advice (they manage a massive consumer messaging app): Take snapshots of high load periods. This way, you get more detailed information to use for planning and to calibrate for the following years. In this team’s case, the New Year’s Eve timeframe is when they see the highest number of messages sent across their global user base.

Kubernetes, as always, was a hot topic. Two common pain points stood out (and are things that we can relate to as we build our Kubernetes deployment and multi-node offerings):

Visibility about what’s happening inside clusters and pods. Someone summed it up with, “I don’t just want to know my pod is offline, I want to know what was going on inside it.” We couldn’t agree more.
Aggregate observability data across clusters to simplify things for Ops teams who handle metrics from multiple applications teams.

Questions & Conversations

To me, the best part of any conference are the hallway conversations and hearing the things community members are keen to learn. As a company, we’re help-first, so, in the spirit of helping, here are a few questions I heard again and again that may be relevant as you get up and running, or do more advanced things with TimescaleDB:

How does TimescaleDB perform at scale?

TimescaleDB scales up well within a single node, and also offers scale-out capabilities if you use our multi-node beta.

In our internal benchmarks on standard cloud VMs, we regularly test TimescaleDB to 10+ billion rows, while sustaining insert rates of 100-200k rows per second (1-2 million metric inserts / second). While running on more powerful hardware, we’ve seen users scale a single-node setup to 500 billion rows of data, while sustaining 400k row inserts per second. To learn more about how TimescaleDB is architected to achieve this scale, see this blog explainer.

And, in our internal tests, a multi-node beta setup with 9 nodes achieved an insert rate of over 12 million metrics per second (and you can read more about our multi-node benchmarking here).

What’s the role of a long-term data store? What types of things does this allow me to do?

In order to keep Prometheus simple and easy to operate, its creators intentionally left out some of the scaling features developers typically need. Prometheus stores data locally within the instance and is not replicated. While having both compute and data storage on one node makes it easier to operate, it also makes it harder to scale and ensure high availability.

More specifically, this means Prometheus data isn’t arbitrarily scalable or durable in the face of disk or node outages.

Simply put, Prometheus isn’t designed to be a long-term metrics store. However, its creators also made Prometheus extremely extensible, and, thus, you can use TimescaleDB to store metrics for longer periods of time, which helps with capacity planning and system calibration. This combination also enables high availability and provides advanced capabilities and features, such as full SQL, joins and replication (things not available in Prometheus). To learn more, see why use TimescaleDB and Prometheus.

How do I use TimescaleDB and Prometheus? Do I have to use any special connectors?

Check out the demo :). I suggest using TimescaleDB as a remote read and write for Prometheus metrics, whether they’re infrastructure for an internal system or your public-facing eCommerce website. Since TimescaleDB extends Postgres, you use the pg_prometheus extension for Postgres and our prometheus_postgresql_adapter, and you’re ready to get started.

Whatever works with Postgres works with TimescaleDB, so, if you want to connect to viz tools (like Grafana or Tableau), ingest data from places like Kafka or insert and analyze data using your favorite programming language (like Python or Go), just use one of the many connectors and libraries in the Postgres ecosystem.

Want to learn more?

Thank you again to the devopsdays NYC team for your work to pull off such an interactive, fun, and community-first event! We’ll definitely be attending as future events are announced (virtually or otherwise).

In the meantime, those resources once more:

Demo Video
Tutorials: Prometheus, Grafana

...and, in the event you’d like to see an advanced version of this demo and/or are keen to join some #remote-friendly events, you can join me on March 25 at 12 ET for “How to Analyze Your Prometheus Data in SQL: 3 Queries You Need to Know.”

I’ll focus on code and showing vs. telling: You’ll learn how to write custom SQL queries to analyze infrastructure monitoring metrics and create Grafana visualizations to see trends, and I’ll answer any questions that you may have.
Interested? Sign up here. You’ll receive the recording and resources shortly following the session, so register even if you can’t attend live.