Forem: Anthony Accomazzo

All the ways to react to changes in Supabase

Anthony Accomazzo — Mon, 09 Sep 2024 22:01:09 +0000

Supabase makes it easy for your frontend to react to changes in your database via its Realtime feature. But outside the frontend, there's lots of reasons your application might want to react to changes in your database. You might need to trigger side effects, like sending users an email, alerting admins about a change, or invalidating a cache. Or you might need to capture a log of changes for compliance or debugging purposes.

Realtime is just one way to respond to changes in your Supabase project. In this post, we'll explore the options available. Hopefully I can help you choose the right solution for your needs.

Database triggers

Postgres triggers execute a specified function in response to certain database events.

Postgres triggers are a part of the lifecycle of rows. You can write functions in PL/pgSQL and have Postgres invoke them whenever a row is inserted, updated, or deleted. They're a powerful way to chain database changes together.

For example, here's a trigger that maintains a search index (search_index) whenever an article is changed:

create or replace function update_search_index() returns trigger as $$
begin
    if tg_op = 'insert' or tg_op = 'update' then
        insert into search_index (record_id, content)
        values (new.id, to_tsvector('english', new.title || ' ' || new.body))
        on conflict (record_id) do update
        set content = to_tsvector('english', new.title || ' ' || new.body);
    elsif tg_op = 'delete' then
        delete from search_index where record_id = old.id;
    end if;
    return null;
end;
$$ language plpgsql;

create trigger maintain_search_index
after insert or update or delete on articles
for each row execute function update_search_index();

What's really neat about triggers is that they run in the same transaction as the change they're triggering off of. So, if a trigger fails to execute, the whole transaction is rolled back. That can give your system some great guarantees. (Said another way, triggers give you "exactly-once processing".)

This means triggers are great for:

Maintaining derived tables (like search indexes)
Populating column defaults (where default doesn't cut it)
Creating audit logs of changes
Enforcing business rules

When are triggers not a good fit?

What makes triggers great also makes them weak for certain use cases.

First, they can impact database performance. A great way to eke out more performance from Postgres is to batch operations. But triggers are executed one-by-one, row-by-row, sometimes blunting the benefits of batching.

If you're not careful, one insert can lead to a cascade of changes across your database. Naturally, the more tables Postgres has to visit to make your insert possible, the slower those inserts will become.

Second, Postgres triggers are relatively easy to write thanks to tools like Claude Sonnet. But they're hard to test and debug. PL/pgSQL isn't the most ergonomic language, and triggers aren't the most ergonomic runtime. With some database clients, one of the only tools for debugging is sprinkling raise exception 'here!' throughout your codebase. This can be a headache.

Third, and perhaps most obviously, Postgres triggers are limited to your database runtime. They can only interact with the tables in your database.

Unless...

Database Webhooks

Database Webhooks in Supabase allow your database to interface with the outside world. With the pg_net extension, you can trigger HTTP requests to external services when database changes occur. The pg_net extension is asynchronous, which means database changes will not be blocked during long-running requests.

Here's an example of a Database Webhook that fires whenever a row is inserted or updated into the orders table:

create or replace function post_inserted_order() returns trigger language plpgsql as $$
begin
    -- calls net.http_post function
    -- sends request to webhook.site
    perform net.http_post(
        'https://api.example.com/my/webhook/handler'::text,
        jsonb_build_object(
            'id', new.id,
            'name', new.name,
            'user_id', new.user_id
        ),
        '{"Content-Type": "application/json", "Authorization": "Bearer my_secret"}'::jsonb
    ) as request_id;
    return new;
end;
$$;

create trigger inserted_order_webhook
after insert or update on public.orders
for each row execute function post_inserted_order();

Database Webhooks make Postgres triggers a whole lot more powerful. You can send webhooks directly to workflow tools or to non-JS services in your stack. You can use them to trigger serverless functions, like Supabase Edge Functions.

You can use Database Webhooks to move complex triggers from PL/pgSQL to your application code. For example, you could use a Database Webhook to notify your app of a recently placed order. Then your app could run the series of follow-up SQL queries to modify other tables as necessary. While you could do this with a plain database trigger, this lets you write code in your domain language – where you can easily unit test, debug, etc.

When are Database Webhooks not a good fit?

While Database Webhooks allow you to move more business logic into your application code, the setup process will still take some trial and error. I recommend getting your requests to work first by running them directly in Supabase's SQL Editor (e.g. run perform net.http_post...) Then, once you're confident that you're shaping your requests the right way, you can embed the call into your Postgres trigger.

Second, unlike Postgres triggers, pg_net calls are async. This is good, because it means there's little performance overhead. But bad because pg_net offers at-most-once delivery. That means if something goes wrong or your webhook endpoint is down, the notification might get lost for good. Supabase will store the error in a dedicated table for 6 hours, but won't automatically retry the webhook.

Third, there are some reports of pg_net failing to make requests after your database transaction volume surpasses a certain threshold.

So, while Database Webhooks expand the possibilities of triggers, they're not a replacement. You'll want to continue to use triggers for those critical in-database workflows where you 100% can't miss a change.

Realtime Subscriptions

Realtime is Supabase's flagship feature for reacting to database changes. It allows both client and server applications to subscribe to changes in your database tables and receive updates in real-time.

First, be sure to turn Realtime on for your table:

Then, you can create subscriptions. Here, we specify a subscription for INSERT operations on orders:

// Listen to inserts
supabase
  .channel("default")
  .on(
    "postgres_changes",
    { event: "INSERT", schema: "public", table: "orders" },
    (payload) => {
      console.log("Received change", payload);
    }
  )
  .subscribe();

Realtime's easy to use and you can use the same client interface on both the frontend and the backend.

Unlike Database Webhooks, Realtime is a pub/sub system. You can use it to broadcast changes to many clients, which is great for building reactive interfaces. And clients can even broadcast their own messages, making Realtime a powerful tool for building collaborative features.

Compared to Database Webhooks, I find Realtime a bit easier to work with, in part because it's well supported in Supabase's console and JavaScript client.

When is Realtime not a good fit?

Like Database Webhooks, messages have at-most-once delivery guarantees. So it's not a good fit when you absolutely need to react to a change. You need to be comfortable with the fact that messages will be dropped (for example, your Node app wasn't subscribed).

While you can use Database Webhooks to trigger side effects and async workers, that may not be the best use case for Realtime. With webhooks, you know your request was routed to at most one worker, and so only one worker will field your request. But with broadcast systems like Realtime, multiple workers might pick it up. So if you wanted to use Realtime to, say, send an email, that could result in some undesirable situations: multiple workers hear about the request and send out an email. (You can try to mitigate with private channels, but how do you mutex message handlers on deploys?)

Listen/Notify

Postgres' built-in pub/sub mechanism, Listen/Notify, is a simple way to broadcast events:

-- in one session
listen my_channel;

-- in another session
notify my_channel, 'something happened!';

You can call NOTIFY within trigger functions to alert listeners of changes.

However, I don't think it's the best fit for Supabase projects. First, Listen/Notify doesn't work with the Supabase JS client and doesn't work with Supabase cloud's connection pooler. But more important, everything that Listen/Notify can do, Realtime can do better.

We felt there was a gap in the option space for Supabase, which is why we built Sequin.

Sequin is an open source tool that pairs nicely with Supabase. Unlike Database Webhooks or Realtime broadcasts, Sequin is designed to deliver messages with exactly-once processing guarantees. After connecting Sequin to your Supabase database, you select which table you want to monitor and filter down to which changes you care about:

You then tell Sequin where to send change event webhooks:

Unlike Database Webhooks, if your servers are down or your functions return errors, Sequin will keep retrying the message (with backoffs). So you get retries, replays, and a great debugger experience.

Sequin comes as a cloud offering so it's easy to get up and running.

When is Sequin not a good fit?

For really simple use cases that a 5-line PL/pgSQL trigger can handle, Sequin is probably too heavyweight. Same if your Database Webhook is fire-and-forget – you won't need Sequin's delivery guarantees.

Sequin's also not a good fit for pub/sub use cases like Realtime. Because Sequin offers exactly-once processing, it only delivers messages to a single worker at a time.

Choosing the right approach

To recap, here's when you might consider each approach:

Triggers are great for maintaining order and consistency in your database. Ideally, your triggers are not too complicated and you don't have too many of them. If your table has high write volume, be mindful of them.

Database Webhooks are good for quick fire-and-forget use cases. Things like POST'ing a Slack notification for your team or sending an analytic event.

Realtime can help you build a differentiating client experience. You can use it to build a reactive client that updates immediately when data changes in the database. Or power features like presence in collaborative editing tools. You can also use Realtime where you might otherwise reach for a pub/sub system like Simple Notification Service (SNS) to broadcast events to backend services that you're OK with missing some events.

Sequin lets you write robust event-driven workflows with exactly-once processing guarantees. It's a more powerful and easier to work with version of Database Webhooks. It's great for critical workflows like sending emails, updating user data in your CRM, invalidating caches, and syncing data. You can even use Sequin in place of a queuing system like SQS or Kafka.

What was that commit? Searching GitHub with OpenAI embeddings

Anthony Accomazzo — Fri, 20 Oct 2023 22:52:47 +0000

We ran into a situation the other day that was all too familiar: we needed to write some code that I knew we’ve written before. We wanted to serialize and deserialize an Elixir struct into a Postgres jsonb column. Although we’d solved this before, the module had long been deleted, so it was lingering somewhere in our git history.

We didn’t remember what the module was called, or any other identifying details about the implementation or the commit.

After scraping my mind and scraping through git reflog, we eventually found it. But we realized that simple text search through our git history was too limiting.

It dawned on us that we wanted to perform not a literal string search but a semantic search.

This seemed like the kind of problem that embeddings were designed to solve. So, we set out to build the tool.

Embeddings

An embedding is a vector representation of data. A vector representation is a series of floats, like this:

[-0.016741209, 0.019078454, 0.017176045, -0.028046958, ...]

Embeddings help capture the relatedness of text, images, video, or other data. With that relatedness, you can search, cluster, and classify.

For example, you can generate embeddings for the two strings “I committed my changes to GitHub” and “I pushed the commit to remote.” A literal text comparison would find few substring matches between the two. But an embeddings-powered similarity comparison would rank very high – the two sentences are very related, as they describe practically the same activity.

In contrast, “I’m committed to remote” has many of the same words. But it would rank as not very related. The words “commit” and “remote” are referring to completely different things!

How to create embeddings?

There are lots of ways to create embeddings. The easiest solution is to rely on a third-party vendor like OpenAI:

curl https://api.openai.com/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "input": "Your text string goes here",
    "model": "text-embedding-ada-002"
  }'

OpenAI accepts batches of embeddings too, where input is set to an array of strings.

Workflow

In order to power a GitHub search tool, first we needed embeddings for all our GitHub data. This means creating string representations of each object and retrieving embeddings via OpenAI’s API.

For example, for Pull Requests, we just concatenated the title and body fields to make the string for embeddings. For commits, we only needed the commit message.

Then, to search across these embeddings, the user will type in a search query. We’d convert the search query into an embedding. Then, with both the search query and GitHub objects represented as embeddings, we can perform our similarity search.

Using Postgres

When generating GitHub embeddings, we need to store them somewhere. This is what a vector database is designed to do: be a repository for your embedding vectors and allow you to perform efficient queries with them.

Fortunately, Postgres has a vector extension, pgvector. This is great because it means you don’t have to add an entirely new data store to your stack. With pgvector, Postgres can work with vector data like embeddings, and it’s performant enough for plenty of workflows like ours.

To add pgvector to your database, you just need to run a single command ¹:

create extension vector;

To build our solution, we knew that we needed to both generate embeddings for all current GitHub data as well as dynamically generate embeddings in the future for all new GitHub data. i.e. we’d need to run some kind of backfill to generate embeddings for all current Pull Requests and Issues. And then setup a process to monitor inserts and updates for these objects to ensure the embeddings are kept up-to-date.

Using Sequin, we pulled all GitHub objects into Postgres. So Pull Requests in GitHub → pull_requests in our database and Issues → issues. We could then run a one-off process that paginated through the table like this:

select id, body, title from github.pull_request order by id asc limit 1000 offset {offset};

Then, for each batch of records, we fetched embeddings with an API request to OpenAI. We decided to store embeddings in a separate table, like this:

create table github_embedding.commit (
  id text references github.commit(id) on delete cascade,
  embedding vector(1536) not null
)

Batch jobs like this work fine for backfilling data. We knew we could get away with running the task once a day to generate embeddings for new or updated records.

But we wanted our search tool to work with the freshest data possible. We didn’t want to have a big time delay between activity in GitHub and results in the search tool.

Generating embeddings on insert or update

In order to generate embeddings for GitHub objects whenever they were created or updated, we needed a way to find out about these events.

In situations like this, developers often consider Postgres' listen/notify protocol. It's fast to get started with and works great. But, notify events are ephemeral, so delivery is at-most-once. That means there's a risk of missing notifications, and therefore of there being holes in your data.

The other option was to use Sequin’s events. Along with a sync to Postgres, Sequin provides an event stream. Sequin will publish events to a serverless Kafka stream associated with your sync. Sequin will publish events like “GitHub Pull Request deleted” or “GtiHub Commit upserted.”

You don’t have to use Kafka to interface with the event stream. There are options to use a simple HTTP interface or to have events POST’d to an endpoint you choose (webhooks).

Events contain the ID and collection of the affected record, as well as the payload of the record itself:

{
    "collection" : "pull_request",
    "id":"079013db-8b17-44cd-8528-f5e68fc61333",
    "data": {
      “activity_date”: “2023-09-12”,
      "title" : "Add GitHub embeddings [ … ] ",
      // …
    }
 }

To make events work, we just needed to setup an event listener. That event listener implements a callback function. It derives a string value from the record by concatenating and stringifying fields. Then, it makes a request to OpenAI to get the embedding. Finally, it upserts the embedding into the database:

@impl true
def handle_message(message, state) do
  event = Jason.decode!(message.body)

 %{ “id” => id, “collection” => collection } = event

  body = %{
    input: get_embedding_input(collection, id),
    model: "text-embedding-ada-002"
  }

  req =
    Req.new(
      url: "https://api.openai.com/v1/embeddings",
      headers: [
        {"Content-Type", "application/json"},
        {"Authorization", "Bearer <<secret>>"}
      ],
      json: body
    )

  {:ok, resp} = Req.post(req)
  %{ “data” => [%{ “embedding” => embedding }] } = resp.body

  upsert_embedding(collection, id, embedding)

  {:ack, state}
end

defp get_embedding_input(“pull_request”, id) do
  GitHub.PullRequest.get!(id, select: [:title, :body])
  |> Map.take([:title, :body])
  |> Enum.join(“; “)
End

defp upsert_embedding(“pull_request”, id, embedding) do
    %GitHub.PullRequest{id: id}
    |> GitHub.PullRequest.changeset(%{ embedding: embedding })
    |> MyApp.Repo.insert!(on_conflict: :replace_all, conflict_target: [:id])
end

# handle other collection types here

With the backfill done and an event handler in place, we now had up-to-date database tables with GitHub embeddings. With that foundation in place, we were ready to build our tool!

A Postgres query for finding matches

With your embeddings setup in Postgres, you’re ready to create a mechanism for querying them.

Supabase has a great post on embeddings in Postgres. I’ve adapted their similarity query below. You can use the cosine distance operator (<=>) provided by pg_vector to determine similarity. Here’s a query that grabs a list of pull_requests over a match_threshold, ordered by most similar to least similar:

select
  pull_request.id,
  pull_request.title,
  pull_request.body,
  1 - (embedding_pull_request.embedding <=> {{searchEmbedding.value}}) as similarity
from github_sequin.pull_request as pull_request
join github_embedding_sequin.pull_request as embedding_pull_request on pull_request.id = embedding_pull_request.id
-- match threshold set to 0.75, you can change it
where 1 - (embedding_pull_request.embedding <=> {{searchEmbedding.value}}) > 0.75
order by similarity desc
-- match count set to 5, you can change it
limit 5;

The search tool

With our data model and search function squared away, we were ready to build our tool.

When the user enters a query, we first convert their search query into an embedding using OpenAI. Then, we use the SQL query above to find the GitHub objects that are the closest match.

Below is a simple example of this tool. Here’s a demonstration of a search for Pull Requests that mention “serialize and deserialize structs into jsonb ecto”:

On the left, we see the list of the top 5 PRs that matched, sorted by similarity descending. On the right is a preview of the PR that you selected.

Note that this is not a literal string match. The search refers to the “serialize and deserialize errors,” but the PR contains serializes/deserializes. The PR also doesn’t mention jsonb, just JSON.

Because of embeddings, we found the exact PR we were looking for, and with only a vague idea of what we were looking for!

Weaknesses

The tool is very effective when the search query has some substance to it (several words) and your PRs do as well. Naturally, if a PR or issue is very light on content, it’s harder to match.

In fact, PRs or issues with very little text content can match too frequently for the wrong things. So, you may consider adding a clause that filters out GitHub objects that have fields that don’t meet some minimum required length.

Remember, you’re not describing what you’re looking for. You’re writing text that you think will be a match for a description found in a PR or an issue.

Further exploration

Now that we have our first workflow around embeddings, we’re starting to think up other ideas.

For example, how could we expand search over commit bodies/diffs? Will embeddings work well if we’re describing the code inside of a commit (vs matching descriptions on the PRs and issues around the code)?

Can we power roll-ups off this data? For example, imagine a weekly summary that describes what got committed (vs just listing PRs). Or reports like “cluster analysis” that told the team how our time broke down between fixing bugs vs shipping new features.

pg_vector is included in most of the latest distributions of Postgres. ↩

If you're on AWS RDS, be sure you upgrade to Postgres 15.2+ to get access to the vector extension.

Storing Salesforce embeddings with pgvector and OpenAI

Anthony Accomazzo — Tue, 10 Oct 2023 17:11:59 +0000

We use Salesforce as the hub for all customer data. We pipe notes, call transcripts, and email conversations into Salesforce.

We thought it would be cool to build tooling on top of Salesforce that helped us with product roadmap and direction. We receive feedback and great ideas all the time from our customers. How could we make it easy to see suggested features? To follow-up with the right customers after shipping a request? And spot recurring themes from our customer conversations?

Just building a simple search tool isn’t enough. A standard search query matches on literal words in a string. So, if we wanted to pull up all the notes where a customer requested that Sequin support an integration, we’d have to brute force with a number of search strings, like “request,” “support,” “add integration.” Not only is this tedious, but it’s unlikely to work.

This is where embeddings come in.

Embeddings

An embedding is a vector representation of data. A vector representation is a series of floats, like this:

[-0.016741209, 0.019078454, 0.017176045, -0.028046958, ...]

Embeddings help capture the relatedness of text, images, video, or other data. With that relatedness, you can search, cluster, and classify.

Embeddings enable an advanced search method of our customers notes. First, we can generate embeddings for all notes. Then, we can perform a query using another embedding. The user will type in a query, such as “want sequin to support ecommerce service.” We can take that query, turn it into an embedding, and compare its relatedness to the embeddings of all the notes.

To generate embeddings, you’ll want to rely on a third-party vendor. You can use APIs like OpenAI’s to generate them.

Embeddings search tool

In this post, I’ll show you how to build a tool that will let you do an embeddings search across your Salesforce data. This will let you search semantically instead of literally. You can use a tool like this to filter and find feedback or product suggestions.

You can create embeddings on any Salesforce object, like Case, Task, Note or a custom object that your team uses. In the examples below, I’ll use popular Salesforce objects interchangeably.

This post assumes you already have Salesforce setup with Sequin to sync Salesforce objects to your Postgres database. You should also have an OpenAI account setup.

Prepare your database

To prepare your database, first add the pgvector extension ¹:

create extension vector;

Create a separate schema, salesforce_embedding, for your embedding data ². In your queries, you’ll join your embedding tables to your Salesforce tables.

Here’s an example of creating an embedding table for Salesforce tasks:

create table salesforce_embedding.task (
  id text references salesforce.task(id) on delete cascade,
  embedding vector(1536) not null
)

In this post, I’ll show you how to use OpenAI’s text-embedding-ada-002 model. That model generates embeddings with 1536 dimensions, hence the 1536 parameter above.

Generate embeddings on insert or update

You'll first setup your app to generate embeddings for Salesforce records as they're inserted or updated. Then, I'll show you how to backfill embeddings for your existing records.

You have two options for finding out about new or updated Salesforce objects in your database.

You can use Postgres' listen/notify protocol. It's fast to get started with and works great. But, notify events are ephemeral, so delivery is at-most-once. That means there's a risk of missing notifications, and therefore of there being holes in your data.

Along with a sync to Postgres, Sequin provisions an event stream for you. Sequin will publish events to a NATS stream associated with your sync, sequin-[sync_id] (e.g. sequin-sync_1a107d79). You can write a function in your app that listens for these events and updates the embedding column for the Salesforce object that was updated or inserted. Notably, unlike listen/notify, the NATS stream is durable so you get at-least-once delivery guarantees.

The NATS team maintains client libraries in over 40 languages. Here's the skeleton for a listener for Salesforce upsert events in Elixir:

defmodule MyApp.Sequin.SalesforceStreamConsumer do
  use Jetstream.PullConsumer

  def start_link([]) do
    Jetstream.PullConsumer.start_link(__MODULE__, [])
  end

    @impl PullConsumer
  def init([]) do
      {:ok, nil, connection_name: :gnat, stream_name: "sequin-sync_1a107d79", consumer_name: "my_app_sf_upserts"}
  end

  @impl PullConsumer
  def handle_message(message, state) do
    # TODO
    # event handling code goes here
  end
end

In init/1, you specify the stream name as well as a name for your consumer. handle_message/2 is the function that handles each event on the stream. In this case, that means handle_message/2 will be invoked every time a Salesforce object is inserted or updated.

The consumer_name for this module is my_app_sf_upserts. I’ll show you in a moment how to register this consumer with NATS.

In handle_message/2, you make an API request to OpenAI. The body specifies the input for the embedding and the model to use. For the input, you’ll want to generate the embedding based on a different field or combination of fields for each embedding. So, you can implement a get_embedding_input/2 for each collection you care about. The following example handles one table, task:

@impl true
def handle_message(message, state) do
  event = Jason.decode!(message.body)

 %{ “id” => id, “collection” => collection } = event

  body = %{
    input: get_embedding_input(collection, id),
    model: "text-embedding-ada-002"
  }

  req =
    Req.new(
      url: "https://api.openai.com/v1/embeddings",
      headers: [
        {"Content-Type", "application/json"},
        {"Authorization", "Bearer <<secret>>"}
      ],
      json: body
    )

  {:ok, resp} = Req.post(req)
  %{ “data” => [%{ “embedding” => embedding }] } = resp.body

  upsert_embedding(collection, id, embedding)

  {:ack, state}
end

defp get_embedding_input(“task”, id) do
  Salesforce.Task.get!(id, select: :description)
  |> Map.fetch!(:description)
End

defp upsert_embedding(“task”, id, embedding) do
    %Salesforce.TaskEmbedding{id: id}
    |> Salesforce.TaskEmbedding.changeset(%{ embedding: embedding })
    |> MyApp.Repo.insert!(on_conflict: :replace_all, conflict_target: [:id])
end

# handle other collection types here

At the end of handle_message/2 is a call to upset_embedding/3 which upserts the record to the database. Shown in the example are handler functions for the Task collection. You can add whatever handler functions you need for the collections you want to have embeddings for.

This example does not handle issues with the OpenAI API gracefully. A more robust solution would have some error handling around that call.

Now, register this consumer you just wrote with your NATS event stream. You can filter on only upserted events (you don’t want your handler to be invoked for deleted events):

nats consumer add --pull --deliver=all --creds /path/to/your.creds sequin-sync_1a107d79 ghola --filter sequin.sync_1a107d79.salesforce.*.upserted

This example uses NATS cli, which is nice for one-off commands like this one.

With this listener deployed, when a record inserts, your consumer will populate its embedding column. And when a record updates, your consumer will regenerate its embedding column.

The next step is to backfill all the records with null values for embedding in the database.

Backfill the `embedding` column for existing records

You have two primary options for backfilling the embedding column:

Create a batch job

You can write a one-off batch job that paginates through your table and kicks off API calls to fetch the embeddings for each record.

You can paginate through each table like this ³:

select id, description from salesforce.task order by id asc limit 1000 offset 0;

You can send multiple strings at once to OpenAI’s embedding API. After grabbing a set of rows, here’s how you might fetch the embeddings for those records:

defp fetch_and_upsert_rows(rows) do
  inputs = Enum.map(rows, &get_embedding_input/1)

  body = %{
    input: inputs,
    model: "text-embedding-ada-002"
  }

  req =
    Req.new(
      url: "https://api.openai.com/v1/embeddings",
      headers: [
        {"Content-Type", "application/json"},
        {"Authorization", "Bearer <<secret>>"}
      ],
      json: body
    )

  embeddings = req.body[“data”] |> Enum.map(& &1[“embedding”])
  upsert_embeddings(rows, embeddings)
end

defp get_embedding_input(%Salesforce.Task{} = task) do
  task.description
end

# … write other `get_embedding_input/1` clauses

defp upsert_embeddings(rows, embeddings) do
  records = Enum.zip_with(tasks, embeddings, fn task, embedding ->
    %{
      id: task.id,
      embedding: embedding
    }
  end)

  MyApp.Repo.insert_all(
    SalesforceTaskEmbedding,
    records,
    on_conflict: :replace_all,
    conflict_target: [:id]
  )
end

Use a Sequin job

Alternatively, you can have Sequin do the record pagination and collection part for you. This will let you use your existing event handling code to backfill your table.

You can kick-off a backfill of your events stream via the Sequin console. Sequin will paginate your Postgres tables and fill your stream with events that have the same shape as the update and insert events:

{ "id": "note-8hUjsk2p", "table_name": “note”, { “data”: [ … ] } }

Assuming you don’t have any other consumers listening to the sequin.sync_1a107d79.salesforce.*.upserted topic, you can reuse this topic for the backfill ⁴. You can backfill each of your collections, like task and account.

Create a Postgres query for finding matches

With your embeddings setup in Postgres, you’re ready to create a mechanism for querying them.

Supabase has a great post on embeddings in Postgres. I’ve adapted their similarity query below. You can use the cosine distance operator (<=>) provided by pgvector to determine similarity. Here’s a query that grabs a list of tasks over a match_threshold, ordered by most similar to least similar:

select
  task.id,
  task.content,
  1 - (embedding_task.embedding <=> query_embedding) as similarity
from salesforce.task as task
join salesforce_embedding.task as embedding_task on task.id = embedding_task.id
where 1 - (embedding_task.embedding <=> query_embedding) > match_threshold
order by similarity desc
limit match_count;

Build the tool

With your data model and search function squared away, you can build your tool.

When the user enters a query, you’ll first convert their search query into an embedding using OpenAI. Then, you’ll use the SQL query above to find the Salesforce objects that are the closest match.

Below is a simple example of this tool. Here’s a demonstration of a search for Notes that mention a SaaS platform that a customer or prospect is hoping we add to Sequin:

Note that the word “integrated” didn’t appear at all in our filter query, yet we still found a match for “interest in seeing ServiceNow integrated into Sequin…”

This strategy works great for shorter text fields. But it will break down with longer call notes, Intercom conversations, and extended email threads. In those situations, it’s often not enough to find the matching record. You also want to know where in that record the match occurred.

To advance our tool in order to address this, we sliced the text fields of our Salesforce objects into smaller, overlapping “windows.” This meant we could compare each of these smaller embeddings to our query embedding to identify regions of high similarity.

To achieve this, you can split your objects across multiple embedding records. Your table could look something like this, with an added idx column:

create table salesforce_embedding.task (
  id text references salesforce.task(id) on delete cascade,
  idx integer not null,
  embedding vector(1536) not null,
  primary key (id, idx)
);

The idx (or index) is the window index. One Salesforce object could be split over an arbitrary number of embedding records, according to whatever window size seems to work best for your application.

In the application, you’ll display the relevant windows that scored highest in similarity. That will let the user easily see the sentences or paragraphs that are a match. Clicking on the window can bring them to the whole Note, but at the specific location that was a high match.

Writing back to Salesforce

As we were filtering and reading through Salesforce Tasks and Notes, we realized in addition to search we wanted two pieces of functionality:

The ability to rate objects on a scale of 1-5, according to how deep or insightful the conversation was.
The ability to tag notes based on product themes, recurring problems, etc.

With Sequin’s write support, this update is trivial. You can add custom fields to your objects (like Rating__c and Tags__c). Then, you can make write requests back to Salesforce like this:

def update_tags(%Salesforce.Task{} = task, tags) do
  task
  |> Salesforce.Task.update_changeset(%{ tags__c: tags })
  |> Repo.update!
end

Changes are applied synchronously to Salesforce’s API, so if there’s a validation error it will be raised in your code.

Conclusion

Consolidating customer feedback and call notes into one location is only half the battle. The next piece is creating tools and workflows that let you use this information to guide your product and keep customers in the loop while doing so.

Embeddings are a powerful tool for achieving this. You can use a machine to help you find similar notes and cluster ideas. With a little effort, you can build your own tool, which gives you far more power and flexibility than you’d find using Salesforce AI.

Your team will need to centralize their notes to make this work great, however! In a future post, I’ll detail the strategies we use for making data capture easy (e.g. drop a call note into Slack). Subscribe to get notified when we write that post.

pgvector is included in most of the latest distributions of Postgres.
If you're on AWS RDS, be sure you upgrade to Postgres 15.2+ to get access to the vector extension. ↩
You can mix and match fields from different tables to generate embeddings. To start, you can keep it simple and generate embeddings that correspond to a single Salesforce object. For most objects, you’ll probably choose to create an embedding for just one field. For example, you don't need to create an embedding for the whole Note object, just the body field. ↩

A few tables might warrant creating a blended embedding with more than one field. For example, Tasks have both a subject and a description. You can concatenate the two fields together into a newline-separated string, and generate the embedding on that.

In the future, you can blend more fields or objects together to let you build on your data in novel ways.
Normally a pagination strategy like this wouldn't be safe unless IDs were auto-incrementing. But this will work fine in all situations, because we don't care if we miss records that are inserted mid-pagination -- those are being handled by our event handler above! ↩
If you need to, you can use a different topic name for this populator, e.g. jobs.backfill_sf_embeddings.[collection]. You’ll just need to register a different consumer, as each consumer can only be subscribed to one topic. ↩

All the ways to capture changes in Postgres

Anthony Accomazzo — Tue, 19 Sep 2023 22:57:55 +0000

Working with data at rest is where Postgres shines. But what about when you need data in motion? What about when you need to trigger a workflow based on changes to a table? Or you need to stream the data in Postgres to another data store, system, or service in real-time?

Fortunately, Postgres comes with a lot of options to make this happen. In this post, I’ll lay them all out. I’ll also give you an idea of which are easy to do, which are more robust, and how to make the right choice for you.

Listen/Notify

Perhaps the simplest approach is to use Postgres' interprocess communication feature, Listen/Notify. Listen/Notify is an implementation of the publish-subscribe pattern.

With Listen/Notify, a Postgres session (or connection) can "listen" to a particular channel for notifications. Activity in the database or other sessions can "notify" that channel. Whenever a notification is sent to a channel, all sessions listening to that channel receive the notification instantly.

You can see Listen/Notify for yourself by opening two psql sessions.

In session 1, you can setup your listener:

> listen my_channel;
LISTEN

And in session 2, you can publish to that channel with a message:

> notify my_channel, 'hey there!';
NOTIFY
> notify my_channel, 'is this thing on?';
NOTIFY

While the listener process received the message right away, psql won't print the message automatically. To get it to print out the messages it's received so far, you just need to run any query. For example, you can just send an empty query like this:

> listen my_channel;
LISTEN
> ;
Asynchronous notification "my_channel" with payload "hey there!" received from server process with PID 80019.
Asynchronous notification "my_channel" with payload "is this thing on?" received from server process with PID 80019.

(Naturally, this isn't how the Postgres client library in your preferred programming language will work. Libraries will deliver messages to your subscriber immediately without requiring a query.)

To use Listen/Notify to capture changes, you can set up a trigger. For example, here's an after trigger that sends along the payload of the record that changed as JSON via Notify:

create or replace function notify_trigger() returns trigger as $$
declare
  payload json;
begin
  payload := json_build_object('table', TG_TABLE_NAME, 'id', NEW.id, 'action', TG_OP);
  perform pg_notify('table_changes', payload::text);
  return new;
end;
$$ language plpgsql;

create trigger my_trigger
after insert or update or delete on my_table
for each row execute function notify_trigger();

Downsides

Listen/Notify is simple and powerful, but has some notable downsides.

First, as a pub-sub mechanism, it has "at most once" delivery semantics. Notifications are transient; a listener needs to be listening to a channel when notifications are published. When a listener subscribes to a channel, it will only receive notifications from that moment forward. This also means that if there are network issues that cause a listening session to disconnect even briefly, it won't receive the notification.

Second, the payload size limit is 8000 bytes. If the message exceeds this size, the notify command will fail. ¹

As such, Listen/Notify is solid for basic change detection needs, but you'll probably find it does not serve more sophisticated needs well. However, it can complement other strategies (like "poll the table") nicely.

Poll the table

The simplest robust way to capture changes is to poll the table directly. Here, you need each table to have an updated_at column or similar that updates whenever the row updates. (You can use a trigger for this.) A combination of updated_at and id serve as your cursor. In this setup, your application logic that polls the table handles storing and maintaining the cursor.

In addition to polling the table, you can use a Notify subscription to inform your application that a record has been inserted or modified. Postgres' notifications are ephemeral, so this should only serve as an optimization on top of polling.

Downsides

This approach has two downsides.

The first is that you can't detect when a row is deleted. There's no way to "see" the missing row in the table.

One remediation is to have a Postgres trigger fire on deletes, and store the id (and whatever other columns you want) in a separate table: e.g. deleted_contacts. Then, your application can poll that table to discover deletes instead.

The second downside is that you don't get diffs. You know this record was updated since you last polled the table, but you don't know what was updated on the record.

Maybe deletes aren't a big deal for your use case or you don't care about diffs. If so, polling the table is a reasonable and simple solution for tracking changes.

Replication (WAL)

Postgres supports streaming replication to other Postgres databases. In streaming replication, Postgres sends the WAL stream over a network connection from the primary to a replica. The standby servers pull these WAL records and replay them to keep their database in sync with the primary database.

Streaming replication was built for streaming changes to other Postgres servers. But you can use it to capture changes for your application too.

You first create a replication slot, like this:

select * from
pg_create_logical_replication_slot('<your_slot_name>', '<output_plugin>');

output_plugin is a parameter which specifies which plugin Postgres should use to decode WAL changes. Postgres comes with a few built-in plugins. pgoutput is the default. It formats the output in the binary expected by client servers. test_decoding is a simple output plugin that provides human-readable output of the changes to the WAL.

The most popular output plugin not built-in to Postgres is wal2json. It does what it says on the tin. JSON will be a lot easier for you to consume from an application than Postgres' binary format.

After creating your replication slot, you can start it and consume from it. Working with replication slots uses a different part of the Postgres protocol than standard queries. But many client libraries have functions that help you work with replication slots.

For example, this is how you consume WAL messages in the psycopg2 library:

cursor.start_replication(slot_name='your_slot_name', decode=True)
cursor.consume_stream(lambda msg: acknowledge_to_server(cursor, msg))

def acknowledge_to_server(cursor, msg):
    # Process the message (msg) here
    # ...
    # Acknowledge the message
    cursor.send_feedback(flush_lsn=msg.wal_end)

Note that the client is responsible for ack'ing WAL messages that it has received. So the replication slot behaves like event buses such as SQS.

Instead of consuming from the WAL directly, you can use tools like Debezium to do this for you. Debezium will consume the WAL from Postgres and stream those changes to a variety of sinks, including Kafka or NATS.

Downsides

Using Postgres' replication facilities to capture changes is a robust solution. The biggest downside is complexity. Replication slots and the replication protocol are less familiar to most developers than the "standard" parts (i.e. tables and queries).

Along with this complexity is a decrease in clarity. If something with replication breaks or if there's a lag or things aren't working as expected, it can be a bit trickier to debug than the other solutions outlined here.

Another aspect worth mentioning is that replication slots may require tweaking postgresql.conf. For example, you may need to tweak parameters like max_wal_senders and max_replication_slots. So you'll need total access to the database to implement this solution.

Capture changes in an audit table

In this approach, you set up a separate table for logging changes, e.g. changelog. That table contains column related to the record's modification, such as:

action: Was this an insert, update, or delete?
old: A jsonb of the record before the mutation. Blank for inserts.
values: A jsonb of the change fields. Blank for deletes.
inserted_at: Time the change occurred.

To set this up, you need to create a trigger function that inserts into this table every time a change occurs. Then, you need to create triggers on all the tables you care about to invoke that trigger function.

Here's an example of what that trigger function might look like:

create or replace function changelog_trigger() returns trigger as $$
declare
  action text;
  table_name text;
  transaction_id bigint;
  timestamp timestamp;
  old_data jsonb;
  new_data jsonb;
begin
  action := lower(TG_OP::text);
  table_name := TG_TABLE_NAME::text;
  transaction_id := txid_current();
  timestamp := current_timestamp;

  if TG_OP = 'DELETE' then
    old_data := to_jsonb(OLD.*);
  elseif TG_OP = 'INSERT' then
    new_data := to_jsonb(NEW.*);
  elseif TG_OP = 'UPDATE' then
    old_data := to_jsonb(OLD.*);
    new_data := to_jsonb(NEW.*);
  end if;

  insert into changelog (action, table_name, transaction_id, timestamp, old_data, new_data) 
  values (action, table_name, transaction_id, timestamp, old_data, new_data);

  return null;
end;
$$ language plpgsql;

After setting up a way to capture changes, you need to figure out how to consume them.

There's a lot of different ways you can do this. One way is to treat the changelog as a queue. Your application workers can pull changes from this table. You'll probably want to ensure that changes are processed ~exactly once. You can use the for update skip locked feature in Postgres to do this. For example, your workers can open a transaction and grab a chunk of changelog entries:

begin;

select * 
from changelog 
order by timestamp 
limit 100 
for update skip locked;

Now, other workers running that query will not receive this "locked" block of rows. After your worker processes the records, it can delete them:

delete from changelog 
where id in (list_of_processed_record_ids);

commit;

Downsides

This approach is similar to using a replication slot, but more manual. The trigger function and table design I've outlined might work to start. But you'd likely need to make tweaks before deploying at scale in production. ²

The advantage over replication slots is that it's all "standard" Postgres. Instead of an opaque replication slot, you have an easy to query Postgres table. And you don't need access to postgresql.conf to make this work.

Foreign data wrappers

Foreign data wrappers (FDWs) are a Postgres feature that allow you to both read from and write to external data sources from your Postgres database.

The most notable and widely supported extension built on FDWs is postgres_fdw. With postgres_fdw, you can connect two Postgres databases and create something like a view in one Postgres database that references a table in another Postgres database. Under the hood, you're turning one Postgres database into a client and the other into a server. When you make queries against foreign tables, the client database sends the queries to the server database via Postgres' wire protocol.

Using FDWs to capture changes is an unusual strategy. I wouldn't recommend it outside very specific situations.

One situation where FDWs could make sense is if you're capturing changes in one Postgres database in order to write them to another Postgres database. Perhaps you use one database for accounting and another for your application. You can skip the intermediary change capture steps and use postgres_fdw to go from database to database.

Here's an example trigger that ensures the status for a given account (identified by email) is in-sync across two databases. This assumes the foreign table has already been declared as foreign_app_database:

create or replace function cancel_subscription()
  returns trigger as $$
declare
  account_status text;
begin
  if (new.status = 'cancelled' or new.status = 'suspended') then
    account_status := 'cancelled';

    update foreign_app_database.account
    set status = account_status
    where email = new.email;
  end if;

  return new;
end;
$$ language plpgsql;

In addition to postgres_fdw, you can create and load your own foreign data wrappers into your Postgres database.

That means you could create a foreign data wrapper that posts changes to an internal API. Unlike the other change detection strategies in this list, because you'd write to the API inside your commit, your API would have the ability to reject the change and roll back the commit.

Downsides

Foreign data wrappers are a fun and powerful Postgres feature. But they'll rarely be your best option for capturing changes. You're probably not trying to replicate changes from one Postgres database to another. And while writing your own foreign data wrapper from scratch has gotten easier, writing your own FDW is probably the biggest lift in this list for capturing changes.

Conclusion

There are lots of options for capturing changes in Postgres. Depending on your use case, some options are clearly better than others. In sum:

Listen/Notify is great for non-critical event capture, prototyping, or optimizing polling.
Polling for changes is a fine, straightforward solution for simple use cases.
Replication is probably your best bet for a robust solution. If that’s too difficult or opaque, then perhaps the audit table is a good middle-ground.
Finally, foreign data wrappers solve a need you’re unlikely to have.

We examined all of these options for our own change capture requirements, and unfortunately none of them met our complex (and niche) needs. So, we ended up needing to build a Postgres proxy 😅 You can read more about that here.

Note the payload size includes the channel name, which like all Postgres identifiers can be up to 64 bytes in size. ↩
One example issue that comes to mind: should there be a timeout for how long workers can have changes checked out? ↩

We had no choice but to build a Postgres proxy

Anthony Accomazzo — Tue, 19 Sep 2023 22:40:56 +0000

We knew building a database proxy would be hard. We wanted to find any other way to achieve our mission. But alas, after looking at all the options, only one solution remained standing.

Below, I'll share our journey to the inescapable conclusion.

Perhaps async writes will be good enough?

When we started Sequin, we had a one-way sync from third-party APIs to Postgres databases. Our hypothesis was that when working with third-party APIs, just reading all your data from a database is way easier than from an API. You can use SQL or your favorite ORM, don't have to worry about rate limits or latency, and don't have to worry about availability.

Reading through the database worked so well that we wanted to see if we could make writes work through the database too.

So we added database writes. We'd monitor your database for changes and send those changes to the API. The process ran async, completing just a couple of seconds after you committed your write.

After seeing customers adopt and scale with async writes, it was confirmed: writing back to the API via SQL is amazing. But the async part was causing a lot of problems.

For example, you write to Postgres. Let’s say you’re updating the email on a Salesforce contact. The write succeeds. But it's totally unknown to you if and when that change will make it to the API. Inserting, updating, or deleting a record in Postgres is like creating a job or a "promise" that you are hoping would resolve in a successful API write in the future.

The API is ultimately the source of truth. You need the API to approve your mutation. Writes are where APIs enforce their policy. You want to know about validation errors when they happen – like if the contact you’re updating has an invalid email – so you can handle them in code.

When developing on an API, an async experience like this is tough. You craft your mutation in your Postgres client and commit it. Then, you have to go check somewhere else to monitor the progress of your request – be it the audit table or a dashboard somewhere.

Furthermore, this approach means changes can originate in two places and that you have two nodes that can drift apart.

We were removing the HTTP API but replacing it with a classically hard distributed systems problems. For example: if a change fails API validation, do we roll it back? What if there were other changes stacked on top of this one? Or what if drift occurred between the time when the developer committed the Postgres change and we made the subsequent API request?

Synchronous, but at what cost?

We grew weary of async writes. It felt like we were close, but hadn't found the winning solution yet.

Ideally, we wanted the API to continue to act as the validation layer for commits. And for the API to be the source of truth, with Postgres as a simple follower database.

We wanted synchronous writes. But inside of Postgres' transactional model, we didn't see a way to make this happen.

So, we began exploring.

Requirements

Our requirements were driven by the fact that we wanted writes to be synchronous, so errors would be easy to see and handle. And that we wanted to be compatible with all Postgres clients, including ORMs:

Support for insert, update, and delete.
Support for returning clauses. returning clauses are often necessary for inserts, where you need to do something with the row you just inserted. And indeed several ORMs rely on these clauses to operate. ¹
A commit must translate to a single API request. This was the simplest way to avoid weird inconsistent state. ²
Errors must happen during the commit. If the operation fails, the user should receive a Postgres error.

Notably, we decided that batch operations were a "nice to have." Many transactional workflows operate on only a single row at a time. Batch operations would be most common in one-off workflows. If we had to give them up for synchronous writes, we would.

Option: Synchronous replication

Postgres supports streaming replication to other Postgres databases. In streaming replication, Postgres sends the WAL stream over a network connection from the primary to a replica.

When streaming replication is set to synchronous mode, the primary will wait for any or all replicas to confirm they committed the data.

Instead of streaming the WAL to another Postgres database, we could stream the WAL synchronously to our application server. Instead of committing the changes to a database, it would attempt to commit them to the API. If it failed to do so, it could raise an error, which would trickle up to and break the transaction.

However, this wasn't going to meet our requirements.

Let's start by considering a success case: the customer inserts a record into the database, the new record streams to us through the WAL, we commit the new record to the API, and the API accepts the insert.

We now need to update the database with the record returned by the API. Importantly, the API response body includes the record’s API ID. It also may contain other fields not sent in our request, like calculated fields or timestamps.

In synchronous replication, we can only update the row with the result from the API after the commit has happened. That’s because another process is responsible for writing the changes back:

So while we hear about the commit as it’s happening – and can interrupt the commit if it fails – we still can't fit our whole operation neatly into the commit. We have to wait for the commit to finish – and therefore for the row to become available/unlocked – before we can update it with the API response.

This means we can’t meet two requirements.

The first is that there is no way for us to support a returning clause. The row needs to be modified before it's committed if you want to reflect the updated row in the fields returned to the client. You can only do that in a "before each row" trigger or in a rewrite rule.

The second issue is related: when the record will be updated with the API response is really indeterminate! If the client can't rely on a returning clause, they may opt to do a read-after-write: after writing the record, immediately read it. But again, because the update is not happening coincident with the commit, there's no telling if that subsequent read will "beat" whatever process we have writing back the API's response to the row.

In addition, in the failure case where the API rejects the changes, we weren't confident we could craft the right Postgres errors to percolate up to the client. (Unconfirmed, as we'd already eliminated this option.)

Option: Foreign data wrappers

Foreign data wrappers were another serious contender.

Foreign data wrappers (FDWs) are a Postgres feature that allow you to both read from and write to external data sources. The architecture that they model felt very similar to what we were building: the data source you're writing to doesn't live in your database, it lives over the wire (the API). This was encouraging.

While you can build your own FDWs, most cloud providers do not let you load arbitrary extensions into their managed databases. This was the first rub: in order to support our numerous customers on AWS' or GCP's managed Postgres databases, we couldn't create our own foreign data wrapper extension.

Instead we’d need to use Postgres’ built-in FDW, postgres_fdw. With postgres_fdw, you can connect two Postgres databases and create something like a view in one Postgres database that references a table in another Postgres database.

These foreign tables behave exactly like local tables. You can select, insert, update, delete, and join all the same.

When you setup postgres_fdw, under the hood you're turning one Postgres database into a client and the other into a server. When you make queries against foreign tables, the client database sends the queries to the server database via Postgres' wire protocol.

To make postgres_fdw work, we'd setup a Postgres wire protocol compliant server. That server would act as a fake Postgres database. By fitting into the model of the standard postgres_fdw, we'd have the widest compatibility:

Like synchronous replication, we had to find a way to use postgres_fdw inside of Postgres' standard commit flow to deliver the experience we wanted. And like synchronous replication, we ran into limitations.

The most notable limitation was with postgres_fdw itself.

With update and delete queries, the client Postgres sends the query as-is to the server. This makes sense – the client Postgres doesn't store any of the rows. So when you run an update or delete, it has to fully delegate the operation to the server. This is exactly what we wanted, because it gives us full control. The client Postgres database is proxying requests to our server, allowing us to have full control over how they are executed.

But insert queries were a different story. In specific situations, postgres_fdw does not support batch inserts. The biggest drag is that it does not support batch inserts when a returning clause is specified.

In these situations, the query doesn't fail (which for our purposes would be preferable). Instead, postgres_fdw will rewrite the batch insert, turning it into multiple single row inserts:

-- original query sent by the client…
insert into orders (order_id, product_name, quantity) 
values (1, 'Product A', 30), (2, 'Product B', 20);

-- is split into two queries before being sent to the foreign Postgres server
insert into orders (order_id, product_name, quantity) 
values (1, 'Product A', 30);

insert into orders (order_id, product_name, quantity) 
values (2, 'Product B', 20);

This is unfortunate because both the user and foreign server are blind to the fact that their batch insert is actually being translated into a bunch of serial inserts. Likewise, on the server, when we receive an insert we have no way of knowing that another is about to follow as part of a batch.

With postgres_fdw, all operations happen inside of a transaction. So, for batch inserts, you might think: can't we just "ack" each inserted row as it's received, storing it in memory. Then, at the end of the transaction, go write all the rows to the API? But then we'd violate our requirement to fully support returning clauses – because our only opportunity to affect the row returned to the client is when we receive each individual insert query. We can't return dummy data for each insert query, then at the end of the transaction say: "never mind all those rows I just sent you – here are the real rows you should return."

Ideally, we'd be able to detect when a customer was attempting to make a batch insert with a returning clause and return a helpful error. But that’s not possible.

So, foreign data wrappers in the general sense wouldn't work for us because we can't install our own FDWs on managed databases. And using postgres_fdw felt clever, but put us downstream of an extension that we had little control over.

We briefly surveyed other options, including far-out projects like pgsql-http. But no matter what we looked at, it was clear: we couldn't do what we needed to do behind the database (synchronous replication). And we couldn't do what we needed to do inside the database (FDWs).

We'd need to get in front of it.

Landing on the Postgres proxy

To get in front of the database, we'd need to build a proxy:

It felt like the biggest lift, but also came with the biggest guarantee that we could get the experience we wanted:

We’d be able to support insert, update, and delete, including batches.
We’d be able to fully support returning clauses, returning the response that we received from the API after performing the mutation request.
We’d have full control over the Postgres errors that we returned to the client.

To pull this off, we’d need to add an additional requirement, which was that the proxy would need to work seamlessly with all Postgres clients. That meant adhering to Postgres' wire protocol.

A standard Postgres proxy like Pgbouncer doesn't need to know much about Postgres' wire protocol beyond authentication. After a client establishes a connection to Pgbouncer, Pgbouncer opens connection to the Postgres database. These connections reside in a pool. When a client sends a statement or transaction, Pgbouncer checks out a connection from the pool and uses that connection for the duration of the statement or transaction.

But once a client's connection is paired with a database connection, Pgbouncer doesn't need to know much about what's going on. The proxy simply passes TCP packets back and forth between the two. All the while, it's looking for one particular message from the server to the client (the ReadyForQuery message). When it sees that message, it knows whatever the client and server were working on is completed, and it's able to release the database connection back into the pool.

We thought to achieve our goal, our proxy might not need to know too much about what was going on either.

Boy, were we wrong.

As we expanded our proxy to cover the surface area of Postgres operations, our proxy had to become more and more fluent in the Postgres protocol.

Eventually, our proxy became a fluent Postgres server. The specification of the Postgres protocol is concise, leaving room for interpretation. In that room, client quirks have blossomed, and our proxy had to adapt to handle all of them. ³

The proxy also had to become a fluent Postgres client. We have to inject our own queries into the client/server dance to capture changes.

Just like a Postgres server or client, our proxy keeps an internal state machine for all connections to ensure we know precisely where we are in a statement or transaction flow. We know the state of the client connection and of the server connection and what we need to do to safely progress to the next state with each.

(More on our proxy's design in a future post!)

The experience we always wanted

While it was a journey to decide to build the proxy and another journey to build it, we ended up with a solution that gives us much more control. That meant building the experience we'd been looking for.

When you make an insert, update, or delete to a Sequin-synced table, we're able to check the batch size of your query. If you're operating on more records than we can modify in a single API request, we'll return an error. Otherwise, we'll commit the changes to the API. If the API request succeeds, we'll commit your changes to the database and complete your query – including fulfilling your returning clause if you had one. If the API request fails, we'll rollback the changes in your database and return a helpful Postgres error to your client.

Some 80%+ of the operations we all perform on APIs are just standard CRUD. SQL or your favorite ORM is a great interface for CRUD, and far easier and faster to use than an HTTP API. It's such a cool experience, and we love seeing customers' reactions every time they get to play with it.

If you’re curious to give it a whirl for yourself, sign-up for a free trial. Otherwise, be sure to subscribe to our blog to catch future posts where we detail how our Postgres proxy works.

The row the database returns must reflect the row we upsert after getting a response from the API. That means it will be fully populated, with a canonical ID and the like. ↩
For example, imagine if it took us 5 API requests to perform all the mutations in a single commit. What happens if the third API request fails validation? The commit was only partially committed to the API, but in Postgres it's all-or-nothing. ↩
One example: some clients send an empty SimpleQuery message as a keep alive. An empty SimpleQuery is strange. A Sync is better suited for this purpose. ↩

How we built our sync on HubSpot's API

Anthony Accomazzo — Tue, 28 Mar 2023 16:58:39 +0000

HubSpot is a leading platform for marketing, sales, and customer service. For companies that use HubSpot heavily, many operational workflows originate in the platform. Critical data about their customers pass in and out of HubSpot and their core services.

Like other CRMs, HubSpot allows for a lot of customization. With custom objects, you can create tables in HubSpot that match your domain. With associations, you can connect related objects together in meaningful ways, such as linking contacts to their respective companies or deals.

This means organizations often "blur the line" between their database and their HubSpot data. Records often need to move seamlessly from HubSpot to their application – and back again.

HubSpot has an API to facilitate reading data from and writing changes to their platform.

One of the most common use cases of an API is figuring out what's changed. As a developer, you need to find out what happened in an upstream API so that you can make updates to your local data or trigger side effects.

Figuring out what's changed in HubSpot is challenging because:

There is no /events endpoint that lists changes.
There are no webhooks for custom objects.
There is limited queryability around associations.
Query support for deleted objects is incomplete.
The Search API is eventually consistent.

HubSpot is one of the sources we sync at Sequin. Detecting changes (often called change data capture) is the backbone of any sync process. So, we had to figure out how to overcome these challenges to detect changes in HubSpot to power our sync.

In this post, I'll break our strategy down. Our sync process consists of three primary parts, which I'll step through in order:

Schema introspection, where we determine the list of syncable objects from HubSpot and their schemas.
Backfilling, where we paginate over the entire HubSpot instance to grab historical data.
Syncing, where we poll HubSpot for changes.

Introspecting the schema

HubSpot has some reflection APIs you can use to list the objects and custom objects in a HubSpot instance. You can also describe the fields of those objects.

To list the schemas for all the custom objects in your instance, call GET /crm/v3/schemas. The response contains all the info you'll need to determine how to parse JSON objects for use in your code or database. It contains the full list of properties for each object and the type of each property (e.g. text or datetime):

{
  "results": [
      {
          "labels": { "singular": "Car", "plural": "Cars" },
          "requiredProperties": ["year", "vin", "model", "make"],
          "searchableProperties": ["year", "vin", "model", "make"],
          "primaryDisplayProperty": "model",
          "secondaryDisplayProperties": ["make"],
          "archived": false,
          "id": "7171718",
          "fullyQualifiedName": "p82918_car",
          "properties": [
            {
              "updatedAt": "2022-05-26T00:13:43.786Z",
              "createdAt": "2022-05-26T00:05:17.903Z",
              "name": "color",
              "label": "Color",
              "type": "string",
              "fieldType": "text",
              "description": "",
              "groupName": "car_information",
              "options": [],
              "updatedUserId": "44561081",
              "displayOrder": -1,
              "calculated": false,
              "externalOptions": false,
              "archived": false,
              "hasUniqueValue": false,
              "hidden": false,
              "modificationMetadata": {
                "archivable": true,
                "readOnlyDefinition": false,
                "readOnlyValue": false
              },
              ...
      }
  ]
}

Curiously, only custom objects are listed here, not core HubSpot objects. To get the schema for core objects like Company or Contact, you'll need to call the dedicated schema endpoints for those objects. Here's the list of those objects and endpoints:

Company @ /crm/v3/schemas/companies
Contact @ /crm/v3/schemas/contacts
Deal @ /crm/v3/schemas/deals
Line item @ /crm/v3/schemas/line_items
Product @ /crm/v3/schemas/products
Quote @ /crm/v3/schemas/quotes
Ticket @ /crm/v3/schemas/tickets

Finally, some objects don't seem to be present anywhere. For example, there is no schema endpoint for a HubSpot Pipeline. For these objects, your best bet is to determine their schema definition manually.

Backfilling

A very common workflow is backfilling or paginating over an entire dataset. Backfilling is integral to setting up a sync on top of an API – you'll need to pull in the data that was created prior to your sync going live.

We decided to use HubSpot's Search API for backfilling. This felt like the best way to paginate through the history in a stable way that would ensure we didn't miss anything.

The Search API

To use the Search API for stable pagination, you need three components:

A query filter
The list of properties you want returned
A sort parameter

You'll send them along in the JSON body of your Search API request like this:

POST /crm/v3/objects/{object_name}/search

{
  filterGroups: [ /* ... */ ],
  properties: [ /* ... */ ],
  sorts: [ /* ... */ ]
}

filterGroups

HubSpot's query filters match a specific format. At Sequin, we use a query filter that looks like this:

"filters": [
    {
      "propertyName": "createdate",
      "operator": "GTE",
      "value": {cursor}
    }
]

This essentially constructs a query that looks like this:

createdate >= {cursor}

The propertyName is the field that you want to filter on. Here, we specify the createdate field. Unfortunately, not every object has a createdate field – for many non-custom objects (but not all), this field is hs_createdate.

The operator specifies GTE or "greater than or equal." And value is the timestamp we're comparing to. We start with 0, then use the latest createdate in the response to compute our next cursor.

properties

By default, the Search API returns only a few properties on the objects it returns. You'll need to specify which properties you want back in your response. There doesn't appear to be a limit on the size of this parameter, so you can just set properties to a list of all properties on the object.

sorts

For backfills, we use a sorts parameter like this:

{
  "propertyName": "createdate",
  "direction": "ASCENDING"
}

Again, propertyName will vary based on the object – many objects instead call this property hs_createdate.

Together, these three parameters in our Search API request ensure that we're able to paginate the full table of each HubSpot object without missing any records.

This strategy alone suffers from one fatal flaw: you can get stuck. If you get back a page of HubSpot records that all have the same createdate, you can't move forward with a GTE operator – you have nothing to increment your cursor with! So, we do some extra filtering and sorting that incorporates the ID of records as well to "get through" the deadlock.

This is only usually a problem on the busiest HubSpot instances. If absolute consistency isn't an issue for you, you can also consider just using a GT operator to be safe. While you might miss objects, that operator will never get you stuck in the stream.

Data type issues in the Search API

We've seen some strange values come back from HubSpot's Search API. Once in a while, we'll get back an integer in a date field or a string in a numeric.

We're still not sure where these come from. We try to recover the original type if we can, but usually our system ends up needing to just throw the values out. They're rare enough that we haven't been able to get to the bottom of what causes them to appear.

Syncing changes

Backfilling is a one-off process that we run when a sync is first established or when a re-sync is kicked off (in the case that a customer of ours e.g. adds a new column to their synced table.)

Syncing changes is far more critical – and tricky. While our sync could just use the backfill process over and over again to keep HubSpot and Postgres in-sync, this would be wildly inefficient. Large HubSpot instances can take many hours to backfill.

Instead, we need to query HubSpot in such a way that we can efficiently see what's changed from one request to the next.

As noted in the beginning of this post, this is uniquely challenging with HubSpot. Change detection is not standardized across the API. For example, webhooks are only supported on six objects. And some changes are not possible to detect at all.

We use the Search API to detect creates and updates. We use the standard list objects endpoint for deletes, where supported. And we use a combination of endpoints to create a synthetic change stream for associations.

Querying the Search API for changes

The Search API is the primary API for finding out what's changed in HubSpot.

As mentioned in "Backfilling," we use three parameters in our search requests: properties, filterGroups, and sorts. Our pagination strategy for syncing changes is similar to backfilling: use a timestamp to page through the changes. Except we're filtering and sorting on a different property:

"filters": [
    {
      // See note below on this property
      "propertyName": "lastmodifieddate",
      "operator": "GTE",
      "value": {cursor}
    }
]

Again, the propertyName to use here is not standard across objects – for some objects, it's hs_lastmodifieddate.

With this method, our system only needs to process records that were recently created or updated. We use GTE so we don't accidentally skip records in situations where multiple records were created at the same time. (This means we'll always "see" one record twice – once in request n and again in request n+1 – but the inefficiency is worth the consistency guarantee.)

One caveat with this method is that it's subject to getting stuck in certain situations with highly active HubSpot accounts, similar to our backfill method. (See "Backfills.)

With this strategy, you can detect creates and updates for objects and custom objects. You can't detect deletes. And associations are not supported in the Search API at all.

Detecting deletes

A common limitation of APIs is that they support seeing changes in records that still exist but there's no such change detection for records that have been deleted (they're no longer present anywhere for you to "see.")

HubSpot has this limitation as well, but not for all objects. Fortunately, for standard objects, they provide a way to retrieve deletes.

To detect deletes (or "archived objects"), you need to use the standard list objects endpoint:

GET /crm/v3/objects/{object}?after={cursor}&archived=true

We can paginate this endpoint with the after cursor. We get this in the paging.next.after property from the previous response. archived=true is how we tell HubSpot to give us deleted objects.

You might ask: If we can use this endpoint for grabbing archived objects, why can't we use this to do regular change detection as well? Can't we just paginate this endpoint to see which records have been created or updated, as opposed to the Search API?

Our concern is stability. We can't confirm the sort or stability of this endpoint. When an endpoint is dynamic (things are being updated and created and deleted between requests), a limit/offset pagination strategy can make it easy for you to accidentally skip records.

We can't use the Search API for archived objects, though. And archived objects are stable, so this strategy is safe.

A huge bummer, however, is that custom objects do not support the archived flag. And as we'll see, archived associations don't have a way to be retrieved either – but nothing is standard about associations anyway.

Associations

Associations are a key feature of HubSpot. On top of associating any object to another, you can add labels to your associations to create rich networks of objects.

However, associations are the most challenging part of HubSpot's API to sync. They are not supported to the same degree as objects in the API. You can't use the search API to find out which associations have been updated. And associations don't have created_at or updated_at timestamps on them.

We're hopeful HubSpot will improve their support for accessing associations. In the meantime, in order to sync them, you're forced to do a full sweep: you have to paginate through every association on a regular interval, inserting new associations and updating existing ones as you go.

Using the list objects API to fetch associations

There are two endpoints you can use to access associations.

The first option is to make calls to the standard list objects API. You can include the associations parameter, which will have HubSpot return the associations you list for each object returned:

GET /crm/v3/objects/companies?associations=CONTACT

The second option is to use the batch associations API. This endpoint will list up to 1,000 associations per object per call:

GET /crm/v4/objects/{objectType}/{objectId}/associations/{toObjectType}

We ended up using both endpoints in combination for our sync.

We use the standard list objects API to fetch a list of 100 objects and the associations for each of those objects. There's a catch, though: HubSpot only returns 100 associations per object. That means if you have an object with a lot of associations, you have to further paginate that object. If that object has thousands of associations, that could mean tens of API requests just to retrieve all the associations for that one object!

So, we then follow-up that request with zero or more batch association requests. For each object in our first request that had more than 100 associations, we make a batch association request to try to grab the rest. That will let us see 1,000 associations for that object at a time vs just 100.

As long as the count of associations per object follows a power law (only ~10% of objects have thousands of associations), this works reasonably well.

Mitigating inefficiency

The full sweep strategy is very inefficient. Sync time increases linearly with the count of associations. And any given request contains mostly associations we've seen before, meaning we're spending a lot of network and CPU retrieving and parsing JSON that contains no new information.

We want to contain this inefficiency as much as possible. The impact would be even worse if we blindly sent associations through the rest of our pipeline. If the database also had to process every association record we saw, we'd be paying the cost in multiple areas.

So to mitigate and isolate the inefficiency, we use an in-memory cache. The cache is essentially an md5 fingerprint that tells us if we've seen an association before or not.

Deletes

Our in-memory cache is helpful for another purpose: association deletes.

HubSpot provides for no way to detect if an association has been deleted. So, we use a "mark-and-sweep" strategy in our sync.

As we're paging through associations, we keep a record of all the associations that we've seen in this sync cycle. When we reach the end, we know which associations we should keep – and which we should flush. We can then issue a delete command to the database to drop the associations we saw in the last sync cycle but not in this one.

Eventual consistency

While the Search API is powerful and does most of what we want for syncing objects and custom objects, it has one major flaw: it's eventually consistent.

As we've written about, eventual consistency can be the bane of any integration. When an API is eventually consistent, you can miss creates, updates, and deletes without knowing it. Eventual consistency issues are hard to debug and hard to mitigate.

In HubSpot's Search API, objects can arrive out-of-order. To illustrate an example, I'll use a human-readable form of HubSpot's filterGroups syntax. I'll also use human-readable timestamps without a date.

Let's say you make a request to the search endpoint at exactly 12:00:10 with a cursor from just 10 seconds before, 12:00:00:

"lastmodifieddate" ≥ 12:00:00

This should return all records in the system that have been updated since 12:00:00. What we've observed can happen is that HubSpot can return a response like this:

[
  {
    "id": "some-id-1",
    "lastmodifieddate": "12:00:07"
    // ...
  },
  {
    "id": "some-id-2",
    "lastmodifieddate": "12:00:08",
    // ...
  }
]

Seeing this response, we assume that just two records have been updated since 12:00:00. For our next cursor, we can use 12:00:08. We grab that cursor, and continue on our way.

The problem: there was actually another record that was created/updated between 12:00:00 and 12:00:08. It's just not being returned by the Search API yet.

We can confirm this a minute later by repeating our original request, again asking for records created/updated after 12:00:00:

// one minute later, run this again
"lastmodifieddate" ≥ 12:00:00

And then we'll see a record appear this time, with a lastmodifieddate prior to 12:00:08:

[
  // New record appears!
  {
    "id": "some-id-0",
    "lastmodifieddate": "12:00:06"
    // ...
  },
  {
    "id": "some-id-1",
    "lastmodifieddate": "12:00:07"
    // ...
  },
  {
    "id": "some-id-2",
    "lastmodifieddate": "12:00:08",
    // ...
  }
]

Because when we initially made this request we updated our cursor to 12:00:08 to make our next request, we will never see this update.

Objects arriving out-of-order to a stream like this is no good. It means we can't increment our cursor with confidence, as we're not sure if a given page that we've retrieved has "settled" yet.

We've reached out to HubSpot for comment on this and will update this post when we hear back. Our guess is that the eventual consistency emerges because they're streaming changes from their database to some second store for search querying (like Elastic.) And must be doing so in such a way that does not guarantee strict ordering of changes.

We're still determining how long it can take the Search API to reach consistency. The safest option is to avoid cursoring up until the present. If you run your pagination a few minutes behind, you'll significantly reduce the chances you'll miss any changes.

For our sync, we wanted to continue to propagate changes in as close to real-time as possible. But consistency is of utmost importance to us. So we run two sync processes for each object that we sync: one on the "bleeding edge" of changes and another several minutes behind to catch stragglers.

This means changes can arrive out-of-order to our customers' databases, but this doesn't seem to be an issue. It's rare that it really matters that this contact was inserted before that one. What's most important is that both contacts make it into your database.

Rate limits

One final consideration when architecting a sync on HubSpot: rate limits.

The syncing strategies we've outlined can consume a ton of rate limit. You'll need to evaluate how this consumption interacts with any other integrations you have set up for your HubSpot account.

Fortunately, OAuth apps benefit from their own dedicated rate limit. This means that our solution can interact with our customers' HubSpot instances without interfering with any other integrations they may have configured to access their HubSpot API.

If you encounter rate limit issues with your own syncing process, you can explore options such as the "API add-on," which substantially increases the number of requests available for your HubSpot account. Alternatively, you can configure your app as a private OAuth app, granting it a separate pool of requests to draw from.

Conclusion

HubSpot's API poses some unique challenges for syncing. It's frustrating that their webhook support is so limited. If webhooks were supported comprehensively, we could use that to power the real-time nature of our sync. And then we could use polling the search API as a "sweep" operation to ensure we didn't miss anything. We could run that poll way behind the present to avoid eventual consistency issues.

Further, everyone would benefit if HubSpot better supported querying associations through their API. Associations are so vital to a CRM like HubSpot – they're the R in CRM! Improved access to associations would make things far easier for the developer. And would mean syncs like ours would impose far less load on their system. Sending countless gigabytes of unnecessary JSON through both of our systems feels tragically inefficient.

Finally, the differentiated treatment between objects and custom objects adds needless friction to adoption. Both in little areas, like not returning every object in the /schemas endpoint. And in big areas, like not being able to detect deletes for custom objects.

All that said, with a sync in place, having your HubSpot data in your local datastore provides a ton of benefits. You abstract API complexity away from the rest of your app. Remote calls become local ones, reducing cognitive overhead around guarding for failures or availability issues. You don't have to think about rate limits. Reads are lightning fast. And when your data is in a database, it's far more queryable–you can perform complex queries and analysis easily.

Finding and fixing eventual consistency with Stripe events

Anthony Accomazzo — Thu, 23 Mar 2023 16:40:02 +0000

At Sequin, the backbone of our syncing infrastructure is polling. This is because polling provides stronger consistency guarantees than webhooks.

As we've written about, when you use webhooks, you give up some control: webhooks are ephemeral. If your service is down or you mishandle a webhook you receive, you're out of luck. You're also at the whims of the webhook provider. They might drop a webhook altogether, meaning you'll never have a chance to process it.

Polling is not without its challenges, however. Besides the complexity of maintaining polling infrastructure, the hardest part about polling is cursoring or paging through a stream of events. When cursoring through API items, you need to traverse the list in such a way that you don't miss any items. (And, ideally, you don't repeat items often either, as that's inefficient.)

Cursoring is surprisingly hard, as most APIs don't make it easy to see what's changed in them.

Cursoring becomes extra hard if the API you're querying is eventually consistent. In an API, eventual consistency means that your result set is not stable – the results you get from a request can change the next time you make the same request. This adds a lot of complexity, as you have to write defensive code.

Stripe events

Stripe is one of the rare API providers that has thoughtful solutions for change detection. They have a dedicated /events endpoint where they publish most of the changes that happen in their system. Examples include an event for every time a customer is created, a subscription is updated, or a new payment goes through.

We've been happy consumers of Stripe's /events and want to see more endpoints like it across other APIs.

However, due to the demanding nature of our real-time sync, we poll the /events endpoint frequently – multiple times per second. This means we're susceptible to even the slightest eventual consistency issues. And, indeed, we found a situation with the /events endpoint.

I'll give some background on the /events endpoint, discuss the issue we encountered, then tell you how we're mitigating the issue.

Paginating events

Most Stripe objects have a created property. This property is a Unix timestamp in seconds.

As a result, there are many Stripe events that will share the same created timestamp in a given Stripe account. For example, certain Stripe operations cause many Stripe records to be created at the same time. When a customer signs up for your service and starts a new subscription, Stripe creates a bunch of objects like a customer and subscription for that customer.

Normally, if we were cursoring Stripe's API with a created timestamp, this could be a problem. For example, consider this simplified HTTP query:

GET api.stripe.com/v1/events?createdAfter=${cursor}&limit=100

Using created > cursor would be a problem because we could easily skip any other events created in the same timestamp. Likewise, this could be a problem as well:

GET api.stripe.com/v1/events?createdAtOrAfter=${cursor}&limit=100

Here, using created >= cursor we'd have the potential of getting stuck on a page where every event had the same created timestamp – there would be no way for us to move forward.

Fortunately, Stripe lets us cursor by the event's ID. We can make a request to get some stream of Stripe events, like this (for brevity, I'll include just the id and created properties of each event):

GET api.stripe.com/v1/events?ending_before=evt_1MoCivKddDnm8ttlZ19ZW52C

{
  "data": [
         {
              "created": 1679422378,
              "id": "evt_1Mo9g6KddDnm8ttlWtWxdDBt"
              # ...
            }
            {
              "created": 1679422373,
              "id": "evt_1Mo9g1KddDnm8ttlitN7Jl38"
            # ...
            }
            {
              "created": 1679422371,
              "id": "evt_1Mo9fzKddDnm8ttlPDyDsUez"
            # ...
            }
            {
              "created": 1679422292,
              "id": "evt_1Mo9eiKddDnm8ttlip9sgaB5"
            # ...
            }
            {
              "created": 1679422292,
              "id": "evt_3MgXZgKddDnm8ttl09DxjuvM"
            # ...
            }
        ]
}

The list of events is returned sorted by created descending. So, the most recent event in the list is on top. Assuming we're paginating through the stream from oldest → newest, to continue pagination, we'd pluck the event ID at the top (evt_1MoCivKddDnm8ttlZ19ZW52C) and send that along as our ending_before to continue forward.

One odd thing to note is that the event IDs themselves are not strictly ordered. Note that the last event in the list begins with evt_3 which is "greater than" the event above it, evt_1. We'll discuss this more in a bit.

Missing events

We had some customers report missing items in their sync. This kicked off an investigation. We logged every request and response to and from Stripe. We then ran audits comparing the state of our synced database over time to the state of Stripe's API.

When our audits caught a missing item in our database – say, a missing Stripe subscription – we had the full trail of evidence to determine how we got there.

Our investigation revealed: the /events endpoint is eventually consistent!

Eventually consistent `/events`

Here's the behavior we observed: We make a request to Stripe with some event ID, say evt_0. We get back a list of 3 events. For brevity, I'll just include the id and created properties of each event. To make the created timestamps easier to read, I've formatted them into human-readable strings:

[
    {
      "created": "12:07:00",
      "id": "evt_3"
      # ...
    },
    {
      "created": "12:05:00",
      "id": "evt_2"
    # ...
    },
    {
      "created": "12:00:00",
      "id": "evt_1"
    # ...
    }
]

Given this response, our next cursor becomes evt_3. So, we make that request and get back the following events:

[
    {
      "created": "12:07:01",
      "id": "evt_7"
      # ...
    },
    {
      "created": "12:07:01",
      "id": "evt_6"
    # ...
    }
]

Problem is, at the 12:07:00 timestamp, evt_3 wasn't the only event that occurred. There are two other events, evt_4 and evt_5 which were not present in the first response. For some reason, when we used evt_3 to get our second response, the stream started at evt_6 – which occurred at 12:07:01, the second after the batch of events took place.

We can see this play out in our historic request/response logs. Yet, when we replay the request later with evt_3, we do get back evt_4 and evt_5 in the response!

This suggests there's something eventually consistent about Stripe's /events API. If we paginate through the endpoint using Event IDs, we're subject to skipping events. And because we query Stripe's /events endpoint multiple times per second, we're especially vulnerable to this issue.

How is this happening?

We're not sure why this is happening. We've confirmed it can happen when events are created in the same second, but haven't ruled out it happening in other situations.

One theory we have: some Stripe event IDs are prefixed with evt_3xxx and others with evt_1xxx. These IDs seem to correspond to what object the event is enveloping. For example, events for payment_intent and charge always have an event ID with evt_3xxx. It's possible that these objects are generated in a separate system that have their own ID generator. This could explain the objects potentially reaching the /events endpoint out-of-order.

Solution

To mitigate this issue, we're changing our cursoring logic. After receiving a response, to determine our cursor for the subsequent request, we follow a simple algorithm:

If the created value on the latest event is more than 5 seconds in the past, update to use that cursor.
Otherwise, do not update the cursor. Instead, use the same cursor we just used in our next request.

This means we'll "see" the same events over several requests. And for very busy Stripe /events endpoints, it could mean we add a few seconds of latency, as we might always be running just a tad behind the present. But the improved consistency guarantee is worth it.

Without knowing the root cause, we can't be sure how much mitigation we'll need to resolve this issue. We'll update this post after we've run this algorithm in production for a bit and had a chance to measure drift.

In general, finding out what's changed in an API is an extremely common requirement for engineering teams. Eventual consistency makes this task very difficult. When designing your API, consider how you can use strategies that will make your API consistent and predictable.

How we sync Stripe to Postgres

Anthony Accomazzo — Fri, 09 Jul 2021 00:12:43 +0000

At Sync Inc, we replicate APIs to Postgres databases in real-time. We want to provide our customers with the experience of having direct, row-level access to their data from third-party platforms, like Stripe and Airtable.

Stripe's API is great, so we knew our Stripe support would have to be very fast and reliable on day one. In order to replace API reads, our sync needs to create a database that is truly a second source of truth. This means:

We need to backfill all data: On initial sync, we load in all historical data for a given Stripe account into the target Postgres database.
Support virtual objects: Stripe has a few "virtual" objects, like "upcoming invoices." These objects are in constant flux until they are created (eg an upcoming invoice becomes an invoice.) You have to fetch these one-off as there's no place to paginate them. They don't even have primary keys.
We provide a /wait endpoint: As you'll see, customers can call the /wait endpoint on Sync Inc after writing changes to Stripe. This endpoint will return a 200 when we've confirmed the database is completely up-to-date. This means they can read from their database after writing to Stripe and know it's consistent.

Two primary sync strategies

Our Stripe sync orbits around Stripe's events endpoint. This endpoint serves the same purpose as a replication slot on a database. It contains a list of all create/update/delete events that have happened for a given account on Stripe.

Each event contains a full payload of the affected record. We can use this event stream to effectively playback all changes to a Stripe account.

However, as you might expect, the endpoint does not contain an unbounded list of all events in a Stripe account ever. It contains data from the last 30 days.

So, this means that when our customers start up a new replica Postgres database, we need to backfill it with historical Stripe data first. Backfilling just means paginating each and every endpoint back to the beginning of the Stripe account.

What we ended up with was two distinct sync processes: Our backfill process and our events polling process.

We run the backfill process first to build up the initial database. Then we run the events polling process continuously over the lifetime of the database to keep it in sync.

Sync process: The backfill

During the backfill, we need to paginate through the full history of each endpoint on Stripe.

Given the breadth of the Stripe API, there are a few of challenges the backfill poses:

We need to make requests to dozens of API endpoints.
Then, for each endpoint, we have to convert the JSON response into a structure that's ready to insert into Postgres.
Further, each response can contain several layers of nested children. Those children can be lists of children which are, in turn, paginateable.

This was a great excuse to use Elixir's Broadway. A Broadway pipeline consists of one producer and one or more workers. The producer is in charge of producing jobs. The workers consume and work those jobs, each working in parallel. Broadway gives us a few things out of the box:

A worker queue with back-pressure.
We can dynamically scale the number of workers in the pipeline based on the amount of work to do.
We can easily rate limit the volume of work we process per unit of time. We tuned this to stay well below Stripe's API quota limit.
A "message" construct with acknowledge/fail behaviors. This made things like retry logic trivial.

In our case, the queue of work the producer maintains is a list of pages to be processed. A page is the combination of an endpoint and the current cursor for that endpoint. Here's a small example:

queue = [
  {"/v1/customers", "cur9sjkxi1x"},
  {"/v1/invoices", "cur0pskoxiq1"},
  # ...
]

To configure throughput, we just instantiate Broadway with a few parameters:

options = [
  producer: [
    module: BackfillProducer,
    rate_limiting: [
      allowed_messages: 50,
      interval: :timer.seconds(1)
    ]
  ],
  processors: [
    default: [
      concurrency: 50,
      max_demand: 1
    ]
  ]
]

That rate_limiting setting is all we need to ensure we process no more than 50 pages per second. This leaves a comfy 50 requests per second left over in a customer's Stripe quota.

Under processors, we specify that we want up to 50 concurrent workers and that each may request one unit of work per time (in our case, a page).

So, to kick off the sync, the backfill producer's queue is seeded with all Stripe endpoints (and nil cursors). Our workers checkout a page to work and fetch it. Each page contains up to 100 objects. Each of those objects can contain a list of paginateable children. As such, the worker's first job is to populate all objects in the page completely.

Once we have a list of "filled out" objects, we parse and insert them. We use a large JSON object which maps object types and their fields to tables and columns in Postgres. We benefit greatly from the fact that every Stripe object contains an object field which identifies what the entity is.

Sync process: new events

After the backfill completes, it's time to switch to processing events for the indefinite lifetime of the sync. But we need a smooth hand-off between the two, otherwise we risk missing a change.

To facilitate the hand-off, before the backfill begins we make a request to /events to grab the most recent cursor. After the backfill completes, we first catch up on all /events that occurred while we were backfilling. After those are processed, the database is up-to-date. And it's time to poll /events indefinitely.

We poll the /events endpoint every 500ms to check to see if there’s anything new to process, continuously. This is how we can promise "sub-second" lag.

We log sync completions to a Postgres table. We use the "polymorphic embed" pattern, where each log entry contains a JSON payload that can take one of several shapes. For example, our "Stripe backfill complete" log looks like this:

{
  "kind": "stripe_backfill_complete",
  "row_count": 1830,
  "last_event_before_backfill": "evt_1J286oDXGuvRIWUJKfUqKpsJ"
}

When we boot a Stripe sync process, it checks the sync logs table for the most recent completed sync for this database. Our sync manager then knows what kind of sync process we need to boot and the initial state of that process.

What about webhooks?

When one hears about a "real-time" or "evented" API integration, the first API primitive that leaps to mind is "webhooks."

But webhooks come with a few challenges:

You can't go down: Senders typically retry undelivered webhooks with some exponential back-off. But the guarantees are often loose or unclear. And the last thing your system probably needs after recovering from a disaster is a deluge of backed-up webhooks to handle.
You're counting on sender deliverability: When polling, the only real barrier between you and the latest data is a possible caching layer. With webhooks, senders will typically have some sort of queue or "outbox" that their workers work through. Queues like this are subject to back-pressure. This opens you up to your sync slowing down if your sender's queue backs up.
A redundant system: Webhooks are not something we can rely on exclusively for a syncing operation like this, so they'll always need to be complemented by a polling system. We have to poll to backfill the database after it initializes. And we may have to poll after recovering from downtime or after fixing a bug in our webhook handling logic.

In general, I have this suspicion that a system that relies purely on webhooks to stay in sync is bound to fail. All it takes is for one webhook to get dropped, on either the receiving end or sending end. With no other backup mechanisms in place, you risk a record in your database being out of sync indefinitely.

Luckily, it turns out that with an /events endpoint to poll, webhooks are not necessary. The trick is to just poll it frequently enough to get as close to real-time as possible! What's great is that you can use the same sync system to get a change made milliseconds ago or to catch up on all changes that happened during unexpected downtime.

The "wait" endpoint

Our databases are read-only. Customers make writes to a platform's API, so that those writes can go through a platform's validation stack. Then, those changes flow down to their database. This is the one-way data flow we advocate.

To make our Stripe database a true second source of truth, we need a final pillar. We need to enable "read after writes," or the guarantee that if you make a write to Stripe's API, that write will be reflected in your database in a subsequent read. While our Stripe sync is fast, the architecture at present leaves open a race condition: You can make a write to Stripe then query your database before the change has propagated.

The simplest way to overcome this race condition is to sleep for one second before any subsequent reads. This should work almost all the time. But we wanted to provide something even more robust.

Customers can instead call our "wait" endpoint:

GET <https://api.syncinc.so/api/stripe/wait/:id>

This endpoint will hold open until we've confirmed that your Stripe database is up-to-date. When it is, the request returns with a 200. You can now make a subsequent read to your database with confidence.

Coming up

With backfills, support for virtual objects like "upcoming invoices," and sub-second sync times, we provide a true Postgres replica with all your Stripe data.

We still have a lot of work to do to make the developer experience around this database great. There are 82 tables to wrap one's head around (!!) and robust ORM support is a must for many developers. Now that the foundations of our sync are in place, stay tuned for updates to the overall experience.

Our Airtable sync process, layer by layer

Anthony Accomazzo — Thu, 01 Jul 2021 17:22:51 +0000

After the concept of Sync Inc crystallized, we knew we had to get something to market quickly. But it turns out that most APIs are not designed to service real-time replication. The first API we supported – Airtable – is no exception.

Airtable's API presents some unique challenges, which I'll touch on first. Then I'll describe how we built a minimum-viable sync process that was fast and reliable enough to get us our first customers. From there, you'll see how we iteratively built and improved our sync process layer-by-layer.

Challenges

With Airtable's API:

There's no way to figure out what's changed
The schema can change at any time
Throughput is low

Breaking these challenges down:

No way to figure out what's changed

There's no way around it: we have to sweep every table, all the time.

For Airtable, it's difficult to see what's changed in a base. First, you have to make requests against each individual table. Given that Airtable's API is limited to 5 requests/second, if you have more than a few tables our dream of sub-second lag time becomes difficult to attain.

Second, deletes are very common in Airtable. And yet, there's no way to easily tell from the API what's been deleted – you have to check every row.

The last curveball is Airtable's inconsistent treatment of the last_updated field for records. For example, changes to computed fields do not affect this timestamp. So, we can't poll for changes to them.

The schema can change at any time

Airtable has a flexible schema. Users can add, drop, rename, and change the type of columns at any time. This means the shape of the data we receive from the API is constantly changing.

Throughput is low

Airtable's responses will contain a maximum of 100 records. And we can only make 5 requests per second. This, of course, means we can only process a maximum of 500 records per second.

Iteratively building up our sync

We launched with the minimum viable version of our sync process. Once we had something that worked, we began to layer on sync optimizations that made the sync faster and more efficient.

Lowest-common denominator: the rebuild sync

Because the Airtable schema can change at any time and we have to sweep the full Airtable API on each sync run, the lowest-common denominator was what we call "the rebuild sync." In the rebuild sync, we perform every single operation necessary to instantiate a Postgres database and bring it to parity with the data in Airtable.

To support this process, our database is split into two schemas:

public
public_swap

Our customers read from the public schema. The public_swap or "swap schema" is where the sync takes place during a rebuild.

At the top of each sync cycle, we initialize all the tables in public_swap as defined by the Airtable base's current schema. We then pull all the records for each table from Airtable and insert them into the tables in the swap schema.

At the end of the sync cycle, we open up a database transaction. Inside that transaction, we drop every table in public and "promote" every table in public_swap to public.

From our customers' point of view, it looks like their public schema suddenly receives all updates from Airtable, all at once, at discrete intervals.

We decided it made sense to start with doing this rebuild every sync cycle for a couple of reasons:

We can use the same process for syncing for the first time or the millionth time.
Using migrations would be tricky. When the schema changes, we'd have to map out all the paths that were taken to get from schema A → schema B. We avoid this entirely by just rebuilding all the tables from scratch.

Sync to public, only rebuild on schema changes

As you might imagine, there were a few immediate problems with running a full rebuild on every sync cycle:

We were rebuilding the tables in the swap schema and "promoting" them to public on each sync run. Promoting is an expensive operation. We had to drop all the tables in public. This meant we were constantly marking Postgres tuples for deletion, keeping the vacuuming functionality very busy.
It increased the max lag time of a given row. This is because a row first had to get synced to the swap schema, and then wait for the rest of the rows to get synced to the swap schema before being promoted to the public schema.

So, for our database this meant high write IOPs and high network utilization. For our customers, it meant a suboptimal max lag time.

The next level of our sync was to sync directly to the public schema. To pull this off, we needed to incorporate a few changes.

First, we needed a way to upsert records from Airtable right to the public schema. Here's an example of what the upsert statement looked like:

insert into public.products (id,created_time,name,size,color)
      values $1, $2, $3, $4, $5
      on conflict (id) do update set
      id=excluded.id, created_time=excluded.created_time, name=excluded.name, size=excluded.size, color=excluded.color
      where (created_time, name, size, color) is distinct from (excluded.created_time, excluded.name, excluded.size, excluded.color)

There are a few key components of this upsert statement:

The on conflict (id) is what switches Postgres from performing an insert to performing an update. When updating, the set clause is used instead. The set clause here is just a re-iteration of the mapping that the insert clause makes between columns and their values.
The where ... is distinct from is a key clause. Without it, the upsert would perform a write operation for every single row in the upsert clause, regardless if there were any changes. This trades a write-heavy characteristic for a read-heavy characteristic, which is more efficient for Postgres. Furthermore, for customers using replication slots, this means we'll only generate a WAL entry when something has actually been created/updated.

Note that we're putting the database in charge of determining what's changed. We're constantly sending rows to Postgres, and Postgres is diffing those rows with what it has stored. If there are changes, then it performs a write.

This upsert logic lets us write directly to public. But, we still need to trigger a full rebuild if the base's schema is modified in some way (eg a table is added or a column is changed).

So, at the beginning of each sync we need to check to see if the Airtable schema has changed since our last sync. We pull the schema, hash it, then compare that hash with the hash of the last build. If the hashes are the same, we sync right to public. If the hashes are different, we kick off a full rebuild.

This sync process was a big step up from the naive first implementation. For our customers, this reduced the lag time of a given row significantly. For Postgres, we traded high write IOPs for high read IOPs, which meant our database could handle greater load. It also effectively eliminated the fraction of time a table is in a write lock, removing all kinds of weird, intermittent performance hiccups.

In-memory fingerprinting

Layer two of our sync process bought us considerably more room on our Postgres database. But there was one last major piece of low-hanging fruit.

While more optimal than the constant rebuild, the upsert statement above still places a heavy burden on the Postgres database. Every sync cycle, Postgres is given the responsibility of diffing each batch of rows we pull from Airtable with what it has stored. Read IOPs and network were both still high.

We knew it would be very beneficial to hoist this diffing work up to our workers.

Traditionally, this is the moment I'd reach for Redis. But sharing memory between worker instances is precisely where Elixir/OTP shines. We can keep a map in-memory of %{ "record_id" => "record_hash" }, where record_hash is an MD5 hash of all the key/value pairs of a given record. When we request 100 records from Airtable, we can compare their latest hashes with what we have in memory. Then, we only perform an upsert of the records that have new hashes. After Postgres confirms the write succeeded, we write the latest hashes to our in-memory cache.

The algorithm to employ this in-memory hash is pretty straightforward:

Hash the incoming records we just pulled from Airtable
Diff the incoming hashes with the hashes we have in-memory
Upsert all records that have changed into Postgres
Update the in-memory hash

By keeping the upsert from our second layer intact, we're resilient to failures in this pipeline. If eg the update step in 4 fails, no big deal, those records will just be re-upserted on the next sync, meaning just a little more work for Postgres. If we're unable to upsert to Postgres (eg Postgres is down), we don't update our in-memory hash, so try again to insert on the next go around.

This particular improvement had a massive impact. Our read IOPs and network utilization each dropped by over 90%:

There was an uptick in load on our workers, as they were now doing all the diffing. But this was more than offset by the reduction in time they spent waiting for Postgres.

Support read-after-writes with a proxy

At this point, our sync is efficient and we're able to approach the theoretical replication limit given the limitations of the Airtable API.

However, we felt ourselves missing an answer for writes. We believe in a one-way data flow: reads are best made in SQL, but for writes you'll want to make an API request so you can properly handle eg validation errors. But, even if the replication lag is only a few seconds, any code that performs a read-after-write will not work.

For example, we have some customers that are powering their React web applications with an Airtable back-end. Let's say they have a workflow to let users update their email address. The user clicks an "Edit profile" button, and is prompted for their new email address. After updating, they're sent back to their profile page:

Here's the problem: when they save their new email address, that triggers an API PATCH request to Airtable on behalf of that user. But there's a race condition: when the user is redirected back to his or her profile page, the React app re-fetches their user record from Postgres. There's no guarantee, though, that the updated user record (with the new email) has propagated via Sync Inc to their Postgres database. So the React app grabs and displays the stale email, confusing the user.

So we built a proxy that users can write through to Airtable. Functionally, it's a reverse proxy. You just prepend the proxy's hostname to whatever Airtable API request you want to make, eg:

PATCH https://proxy.syncinc.so/api.airtable.com/v0/appF0qbTS1QiA025N/Users

The proxy attempts to write the change to Airtable first. If the Airtable write succeeds, the change is written to the database and then the requestor receives a 200. This means that the change that was just written is guaranteed to be present in any subsequent database read.

Layer by layer

Our wisest move was to build the initial sync to serve the lowest common denominator, even if it was slow and inefficient. As long as we were reliable, delivered on the replication time we estimated in our console, and were resilient to eg schema changes, our customers were happy.

Reflecting back on our discussions before launch, a lot of our original assumptions about what our sync would need were wrong. Luckily, we time-boxed the initial build and got our MVP sync (the rebuild sync) to market as fast as possible. From there, servicing customers and learning about their needs helped inform the shape of each subsequent layer and its priority. Notice how we slowly layered on performance optimizations:

Layer 1
- The rebuild sync: High write IOPs, high network, and high vacuuming.
Layer 2
- Sync to public: We first reduced vacuuming by syncing directly to public and keeping tables between syncs.
- Then we added where ... is distinct from. This traded high write IOPs for high read IOPs and got rid of our vacuuming problem. We still had high network.
Layer 3
- Fingerprinting: By hoisting the diffing logic up to our workers, we solved both read IOPs and network on our database.

What is RDS Proxy? Exploring through benchmarks

Anthony Accomazzo — Wed, 30 Dec 2020 03:13:27 +0000

I was surprised when a customer wrote in to explain that he noticed - for simple queries - the Airtable API was about as performant as querying his Sync Inc-provisioned Airtable database.

We'd expect a database query to be at least an order of magnitude faster than an API query. So I had to investigate: What was the explanation for this?

Airtable's API is pretty consistent and responsive, with a mean and median request time of 350ms:

This is about what you'd expect from a standard list request against a third-party API with 100 items returned per request.

So how could a database query get up to 350ms? A little digging, and we discovered that the customer was querying his database via a Lambda function. This meant opening a connection to his Sync Inc db before each query. A solid lead.

I wanted to see how much overhead the connection times were adding. So I wrote some simple benchmark functions in Go.

The timing for the benchmarking is handled by this simple function:

func timeTrack(start time.Time, name string) {
    elapsed := time.Since(start)
    log.Printf("%s took %s", name, elapsed)
}

You can log how long a function execution took in Go using this function in combination with defer like so:

func benchmarkQuery(count int) {
    defer timeTrack(time.Now(), "benchmarkQuery")
    for i := 0; i < count; i++ {
        funcUnderTest()
    }
}

Above, the arguments to timeTrack() are evaluated immediately, meaning the first argument is set to time.Now() - the time benchmarkQuery() was invoked.

But defer waits to invoke timeTrack() until right before the function benchmarkQuery() returns. Which, in this case, will be after the for loop executing funcUnderTest() count times has completed. Neat trick.

With a timing function in place, I wrote out some benchmarks. We want to benchmark two different kinds of behavior:

A function that opens a connection on every loop and then makes a request
A function that opens a connection then loops and makes requests

This should help us understand how much time is spent establishing the connection and how much time is spent executing the query.

Here's what the function that opens a connection every loop looks like:

func benchmarkOpen(count int) {
  defer timeTrack(time.Now(), "benchmarkOpen")
    for i := 0; i < count; i++ {
        db := openDb()
        defer db.Close()
        rows, err := db.Query("SELECT id from purchase_orders limit(100);")
    rows.Close()
        if err != nil {
            log.Printf("Got error: %v", err)
        }
    }
}

And the one that opens a connection then makes a bunch of queries on that connection:

func benchmarkQuery(count int) {
  defer timeTrack(time.Now(), "benchmarkQuery")
    db := openDb()
    defer db.Close()

    for i := 0; i < count; i++ {
        rows, err := db.Query("SELECT id from purchase_orders limit(100);")
    rows.Close()
        if err != nil {
            log.Printf("Got error: %v", err)
        }
    }
}

(Note: It's important to run rows.Close(), otherwise the connection is not immediately released - which means Go's database/sql will default to just opening a new connection for the next request, defeating this test.)

Using these two functions, we're prepared to find out:

What kind of impact does re-creating a db connection on each run have on the wall time of a 1-off execution?
Does using RDS proxy have an effect on that wall time?
What are the basic load characteristics of a db.r5.large getting hammered with new connection requests? How does RDS Proxy help it perform?

This inquiry is all motivated by the nature of Lambda functions, which the customer uses. And indeed Amazon recently released RDS proxy to address this precise architecture:

Many applications, including those built on modern serverless architectures, can have a large number of open connections to the database server, and may open and close database connections at a high rate, exhausting database memory and compute resources. Amazon RDS Proxy allows applications to pool and share connections established with the database, improving database efficiency and application scalability.

RDS Proxy gotchas

Before we get into the benchmarks, I want to touch on the gotchas that sucked up a lot of my time. (I guess after all these years I haven't learned my lesson: The AWS UX is just not going to help you use their products. You have to read the user guide from end-to-end.)

Here are the two that tripped me up big time:

1. RDS Proxy only supports specific database engine versions – and the latest RDS Postgres is not one of them.

The first time I went to setup a new RDS proxy, I didn't see any available databases in the dropdown:

After much trial and error, I ended up spinning up a new Aurora db. At last, I saw something populate the dropdown.

It wasn't until much later I learned that Postgres 12 - the engine I use everywhere - just isn't supported yet.

(What would have been nice: Showing me all my RDS databases in that drop-down, but just greying out the ones that are not supported by Proxy.)

2. You can only connect to your RDS Proxy from inside the same VPC

This one was a time sink. There are few things I dread more than trying to connect to something on AWS and getting a timeout. There are about a half-dozen layers (security groups, IGWs, subnets) that all have to be lined up just so to get a connection online. The bummer is that there's no easy way to debug where a given connection has failed.

So, the first wrench was when I discovered - after much trial and error - that I couldn't connect to my RDS Proxy from my laptop.

(How about a small bone, right next to "Proxy endpoint," that lets me know the limitations associated with this endpoint?)

Now that we're up and connected

RDS vs RDS Proxy

With my RDS Proxy 101 hard-won, we're ready to run some benchmarks.

I copied the benchmark binary up to an EC2 server co-located with the RDS database and RDS proxy.

Per above:

benchmarkOpen opens a connection and then runs a query on each loop (sequential)
benchmarkQuery opens a connection first, then each loop is a query on that connection (also sequential)

Here's what I got running this against the RDS database directly, 1000 times per, with a 10s pause between:

# RDS direct, same datacenter
$ ./main
2020/12/25 00:08:14 Starting benchmarks...
2020/12/25 00:08:19 benchmarkOpen took 5.846447324s
2020/12/25 00:08:30 benchmarkQuery took 504.32703ms

So:

benchmarkOpen - ~5.85ms per loop
benchmarkQuery - ~0.500ms per loop

Fair to assume the connection handshake takes 5ms, which isn't bad.

This benchmark runs too fast to register on any RDS graphs. But if I loop the benchmark to sustain it over a couple minutes, I can see a healthy connection spike for the first test:

I recognize that in benchmarkOpen, instead of deferring db.Close() I could be nicer to my database and close it immediately. But I kind of like the DB slow boil.

Let's move on to the RDS Proxy. Remember, RDS Proxy has to be in the same VPC as the entity calling it as well as the database. So all three are co-located:

# RDS Proxy, same datacenter
$ ./main
2020/12/25 00:15:21 Starting benchmarks...
2020/12/25 00:15:28 benchmarkOpen took 7.094809555s
2020/12/25 00:15:39 benchmarkQuery took 1.239892867s

Comparing to the first benchmark: It appears the Proxy adds an overhead to each request of 0.7ms (1.2ms - 0.5ms). As we'd expect, connection openings take a bit longer, about 1.15ms longer.

And, sure enough, I can't get the graph to budge. It sustains about 100 connections, with no real CPU spike, with the benchmark sustained over several minutes:

We'll see soon how all this performance translates to Lambda, but the motivation for RDS Proxy is clear: It's less about the client, more about the database. RDS Proxy is just a layer of indirection for our database connections. It doesn't speed up establishing new connections - in fact, we pay a small tax on connection opening. It just saves our database from the thundering herd.

Cross-region

Because we're here, I'm curious: What happens if we run this benchmark cross-region, from US-East-1 to US-West-2?

# RDS - different datacenter
$ ./main
2020/12/25 00:11:12 Starting benchmarks...
2020/12/25 00:15:53 benchmarkOpen took 4m41.607514892s
2020/12/25 00:17:15 benchmarkQuery took 1m11.767803425s

Different story:

benchmarkOpen - ~281.6ms per loop
benchmarkQuery - ~71.77ms per loop

Connections take about 200ms, presumably because they involve a lot of back and forth that require coast-to-coast travel.

The difference is over a full order of magnitude. At this point, opening a database connection and making a query starts to have performance characteristics similar to querying Airtable's API.

Indeed, the customer that opened this inquiry operates out of US-East-1. Bingo.

Opening connections is certainly a factor here. But RDS Proxy won't help, because as we've learned it's not intended to reduce client-side connection opening costs. And even if it did, we couldn't use it cross-region!

So co-locating the database in the customer's data center is the move, and will give their team performance an API can only dream of.

Benchmarking Lambda

Lambda got us into this mess, so let's do an end-to-end benchmark with it. Our direct-to-RDS test will route like this:

API Gateway -> Lambda -> RDS

And our RDS Proxy test will look like this:

API Gateway -> Lambda -> RDS Proxy -> RDS

Each lambda function call will open a new database connection and issue a single query. I just adapted the function above to make one open/query as opposed to doing so inside a for loop.

I'll run a benchmark locally on my computer that will call the API Gateway endpoint. Because I'm running locally, I'm able to save a little work by using Go's native benchmarking facilities:

func BenchmarkRequest(b *testing.B) {
    for i := 0; i < b.N; i++ {
        resp, err := http.Post("https://[REDACTED]/run", "application/json", nil)
        if err != nil {
            panic(err)
        }
        defer resp.Body.Close()
        if resp.StatusCode != 200 {
            fmt.Println("Received non-200 response. Continuing")
        }
    }
}

The first test routes to RDS directly:

$ go test -bench=. -benchtime 1000x
goos: darwin
goarch: amd64
pkg: github.com/acco/rds-proxy-bench
BenchmarkRequest-8          1000     128684974 ns/op
PASS
ok      github.com/acco/rds-proxy-bench 130.351s

So, 128ms per request.

Now using RDS Proxy:

$ go test -bench=. -benchtime 1000x
goos: darwin
goarch: amd64
pkg: github.com/acco/rds-proxy-bench
BenchmarkRequest-8          1000     118576056 ns/op
PASS
ok      github.com/acco/rds-proxy-bench 120.529s

118ms per request.

We're not impressed until we look at the graphs:

The first big spike is the first benchmark, hitting RDS directly. The second non-spike is the second benchmark, routed through the proxy.

The end-to-end Lambda benchmark takes it all home: The purpose of RDS Proxy is foremost to tame connections (and related load) on the database. Any happy gains on the client side will be a result of a relieved database.

Forem: Anthony Accomazzo

All the ways to react to changes in Supabase

Database triggers

When are triggers not a good fit?

Database Webhooks

When are Database Webhooks not a good fit?

Realtime Subscriptions

When is Realtime not a good fit?

Listen/Notify

When is Sequin not a good fit?

Choosing the right approach

What was that commit? Searching GitHub with OpenAI embeddings

Embeddings

How to create embeddings?

Workflow

Using Postgres

Generating embeddings on insert or update

A Postgres query for finding matches

The search tool

Weaknesses

Further exploration

Storing Salesforce embeddings with pgvector and OpenAI

Embeddings

Embeddings search tool

Prepare your database

Generate embeddings on insert or update

Backfill the embedding column for existing records

Create a Postgres query for finding matches

Build the tool

Writing back to Salesforce

Conclusion

All the ways to capture changes in Postgres

Listen/Notify

Downsides

Poll the table

Downsides

Replication (WAL)

Downsides

Capture changes in an audit table

Downsides

Foreign data wrappers

Downsides

Conclusion

We had no choice but to build a Postgres proxy

Perhaps async writes will be good enough?

Synchronous, but at what cost?

Requirements

Option: Synchronous replication

Option: Foreign data wrappers

Landing on the Postgres proxy

The experience we always wanted

How we built our sync on HubSpot's API

Introspecting the schema

Backfilling

The Search API

Data type issues in the Search API

Syncing changes

Querying the Search API for changes

Detecting deletes

Associations

Using the list objects API to fetch associations

Mitigating inefficiency

Deletes

Eventual consistency

Rate limits

Conclusion

Finding and fixing eventual consistency with Stripe events

Stripe events

Paginating events

Missing events

Eventually consistent /events

How is this happening?

Solution

How we sync Stripe to Postgres

Two primary sync strategies

Sync process: The backfill

Sync process: new events

What about webhooks?

The "wait" endpoint

Coming up

Backfill the `embedding` column for existing records

Eventually consistent `/events`