Forem: Ian Macartney

Work Stealing: Load-balancing for compute-heavy tasks

Ian Macartney — Wed, 24 Jul 2024 00:17:26 +0000

For fast, light-weight workloads, you can often get away with a small number of powerful machines. Even when load isn’t distributed evenly, a single backlogged machine won’t noticeably impact a user’s experience.

However, when your app requires heavy operations, such as running requests on an LLM, transcoding a video, or intensive cryptography, you need a better strategy for handling concurrency.

Requests that monopolize many CPU or GPU cores require more machines, as each machine is able to handle less parallelism. When you factor in slow requests, a single backlogged machine can cause significant delays and p95 performance degredation, even if the overall system has extra bandwidth.

So how can you distribute the load across many workers?

tl;dr

In this post I’ll explain the “work stealing”¹ strategy for task distribution and why you should consider it for workloads that:

Take significant time to execute.
Do not share resources well, such as GPU-intensive computation.
Prioritize throughput and utilization over average-case latency.
Run locally, behind a NAT, or are otherwise not discoverable from a web server.

Overview

We will look at two strategies for managing resource-intensive workloads:

Push-based routing: a load balancer decides where to send requests and waits for a response from the worker, which it then returns to the client.
Pull-based “work stealing”: an incoming request is put on one or more queues from which workers pull. They publish results which can be included in the response to the original request, or pulled from the client via a subscription, allowing the original response to return early. Multiple clients can subscribe to the result.

One way to think about this is ordering food at a restaurant.

A push-based approach would assign you to a chef when you walked in the door, as a load balancer forwards a request. You’d wait for all other parties assigned to the chef to be served, hoping there aren’t many time-intensive dishes ahead of you, and wondering if anyone else was lucky enough to be assigned to an idle chef. If you left the restaurant for any reason, you’d be re-assigned when you came back in, losing your place in line.

A pull-based approach is more similar to getting an order number. All guests have their orders taken when they walk in, and chefs work on the next order as they become available. You can walk around with your order number, check in on its status, or even cancel your order if it hasn’t been started. It’s more efficient for the chefs, but it requires writing things down and having a way to notify you when your food is ready, since you might not be standing next to the chef waiting.

As a concrete code example, I recently put out a demo of distributed LLM computing: llama farm, where requests to the website that require llama3 are farmed out to workers. I can run these workers from the command line on a spare laptop, in containers hosted on fly.io, or even from browsers using web-llm. The repo is an implementation of “work stealing,” which enables these llama workers to pull and process jobs at their discretion without exposing a port to http requests or requiring service discovery. Read this post about the implementation, or check out the code.

To learn more about work stealing and how it compares to more traditional load balancing, read on. For the sake of this article, I’ll use the example of processing LLM requests, but the techniques naturally extend to any high-latency or hardware-intensive workload.

Do I need this?

Note: This decision assumes you are controlling your own infrastructure. If you are using a cloud service, such as using Replicate to serve and scale your models, you don’t have to worry about this - you are paying them to make these decisions and scale transparently. Most of the time this is the right way to start.

Some reasons you might benefit from scaling your own infrastructure:

Controlling data: If you are unwilling to send data to a third party LLM, you can run your own machines and know how your data is being handled and used. This is especially important if you have data governance requirements preventing it from leaving a private network.
Controlling costs: Cloud providers allow you to scale more granularly at a higher per-request cost. By deciding when and how many machines you run, you control your scale.
- Note: I say controlling rather than reducing because, until you utilize your machines well, this is unlikely to save you money. In the case of llama farm, however, we avoid paying for dedicated GPUs altogether by leveraging existing idle hardware.
Controlling latency: By controlling the routing and prioritization of requests, you can ensure tighter bounds on latency than you may get from a cloud provider, which is likely sharing resources with other customers and may not expose a mechanism for you to prioritize or cancel requests. Note: you’ll need to decide how to absorb spikes of traffic. Options include:
- Over-provisioning (or auto-scaling) your hardware to accept additional load.
- Shedding load by rejecting requests (often with a 429 status code) and relying on clients to retry later.
- Accepting high latency during these periods, ideally isolated to low-priority traffic.

Push-based routing

Traditionally, the web works via pushing, or sending, requests. A request (usually HTTP) gets routed to a machine based on its IP address. For compute-intensive tasks, a client typically hits an API endpoint, which doesn’t do the CPU-intensive operation itself but rather makes its own request to a pool of dedicated workers. Forwarding the work to other machines isolates the API server’s resources so it is available to serve other requests, while also allowing you to scale the workers on use-case-specific hardware, such as machines with GPUs, separately from the web servers. The API server returns the (potentially streamed) results to the client.

Benefits:

Serverless hosting. On platforms where you only pay for the duration of a request, you can avoid running the machine between requests. For a worker to pull requests, it needs to be running continuously or auto-scaled by a monitoring service.
Standard. It is easier to reason about latency, errors, and work attribution for a traditional request. By comparison, when a worker pulls work and publishes results, it is no longer within the call graph of the original request.
Stateless. When you hold open the client request and return the result directly from a worker, you don’t have to persist any state if you don’t want to.

Challenges:

Load balancing needs to keep track of workers.
- You have to guess which backend to send work to, or poll every worker for their state.
- When a backend starts or stops, something needs to update, whether it’s Consul, kube-proxy, ELB, or otherwise. To stop a worker without incurring failures, you need to prevent the load balancer from sending new requests and then finishing existing ones.
- These updates can fail or take some time.
- All workers need to be discoverable and exposed to inbound http traffic. To run a worker on your local machine, you could use a service like Tunnelmole or ngrok to proxy traffic, which exposes you to public internet traffic.
Isolated queues: if a worker has too many requests, it can only queue or reject.
- Requests might not be started in the order they were received, and higher priority requests may be queued behind lower priority ones.
  - Often the queueing happens in the TCP socket connection, which can’t distinguish application-layer details, such as request priority or expected duration.
- Per-machine queues can cause high tail latency in distributed systems.
  - Some workers might be idle while others have a backlog of slow requests.
HTTP connection lifecycle: the API request needs to hold open both incoming and outgoing connections for the duration of the work.
- If the client loses the connection, the operation can’t easily resume. Even with sticky connections, an API server could come up on a new machine during a deploy.
- This can results in low CPU utilization on the API server. If this is a serverless function, you may be paying for this idle time.

Pull-based work stealing

Compared to push-based, in a pull-based system, workers take on or ”steal” work when they have capacity, and then publish the result. To see an implementation of work stealing, check out this post and this GitHub repo where I implement it for LLM-powered group chat:

Implementing work stealing with a reactive database

llama-farm-chat GitHubRepo

Benefits

Optimizing throughput with consistent concurrency.
- In non-user-facing workloads you want to utilize machines as efficiently as possible, which is especially common in AI applications where you want to crawl large amounts of data to generate embeddings. Instead of controlling how fast you push work and to which machines, having a large work queue consumed by workers that you can dynamically spin up allows for optimal utilization.
- For user-facing workloads, during spikes in load the API server knows how much work is in flight and can decide whether to reject or enqueue the work, as well as whether to re-order or cancel existing jobs based on priority.
No load-balancing or service discovery.
- Workers can come and go without updating anything - they simply start requesting work.
- Workers only make outbound requests: they can safely run behind a NAT.
No isolated queues: workers don’t accumulate their own backlog.
- By sharing a global queue, performance (latency) is more uniform and can be FIFO or globally prioritized.
- Workers decide when to take on work, and how much.
- To stop, they finish their requests and don’t request more.
Multiplexed subscriptions: clients can start jobs, disconnect, and re-subscribe to results. In fact, many clients can be subscribed to the result, since it is persisted outside of the scope of an active http request.

Challenges

Serverless hosting ecosystem: workers need to be subscribed and are harder to dynamically wake up, compared to serverless hosting models that wake on incoming HTTP requests.
Failures are harder to detect: the worker needs to periodically let the server know it’s still working. With “push” HTTP requests, failure can be detected automatically by the connection closing.
Additional overhead: every request is persisted and flows through a subscription mechanism such as a pub/sub service, or database queries in Convex.
- If the request is otherwise fast, the additional latency might be noticeable. It will affect the "average case" or "p50" performance.
- If the requests are frequent and don’t otherwise require much database bandwidth, the overhead of tracking requests might be noticeable.

Making the call: my experience

While pull-based solutions have a lot of benefits, this decision is highly sensitive to your application’s needs. I’ll contextualize this with my own experience deciding between push- and pull-based solutions for task distribution.

I used to run the team at Dropbox responsible for generating previews of user documents. If you’ve ever used the Dropbox website and looked through images, watched a video, or looked at a pdf preview of a Microsoft Office document, that file was processed by the system my team built and maintained. We thought deeply about load balancing, caching, and reliability. One fun statistic: if you removed the cache and processed the full file for every user’s request, it would amount to processing over one exabyte of data per day.

When we were re-architecting it, we considered both push & pull.

Why we wanted a pull-based solution:

System utilization and maximizing throughput. We had different classes of services optimized for different operations - video transcoding, windows emulation for office documents, etc. These machines knew their capabilities, and we were excited about a workflow where a machine could take on different types and quantities of work based on its available memory and CPU utilization.
Absorbing and shedding load. There were occasionally spikes of load that meant either failing requests, or saturating each service’s http queue and driving up latency for all users. With a queue we could have had more control over which requests we dropped and which requests we could continue to prioritize.
Avoids service discovery. Keeping track of which machines to route to introduces many opportunities for failure, especially during deployments:
1. When a backend dies, how soon will service discovery adapt?
2. When a machine comes online, how soon will it be discovered?
3. When service discovery fails (it does), how well can you keep serving traffic?

Why we ended up with a push-based solution:

Our database wasn’t reactive: determining from the backend when a job was available, or detecting when it was finished from the API server would have involved additional infrastructure.
Predictability: our infrastructure was based around discrete HTTP requests that flowed into gRPC services, not WebSockets and subscriptions. This influenced what tooling was readily available.
High volume, low latency: although occasional requests took minutes, most of our traffic was very high volume and low latency (~5ms). Incurring a database write per request would have overwhelmed the database, and a pub/sub subscription was too heavyweight.
Uniformity: there was a case for doing pull-based requests only for heavy operations like video transcoding, but we were a small team and didn’t want to maintain two sets of infrastructure.
Monitoring: we wanted to track latency and success statistics in a centralized place close to the request. Splitting the status across multiple metrics reported from different services and machines would have complicated our monitoring setup.

I don’t regret making that decision at the time. However, I do think some things have changed since then that enable the work stealing pattern for modern apps, such as reactive databases.

Why reactive databases change the game

One big challenge with workers pulling work is connecting the result back to the client. If you want to return it in the client’s original request, the worker needs to know which API server to send the work to, and get the result to the right thread or process waiting for the request. Or it needs to leverage a pub/sub system where the worker is subscribed only for its own results. This “return address” problem ends up requiring a lot of nuance to get right at scale. For example, how long should the pub/sub system wait before dropping the message?

Since using Convex, I’ve come to appreciate separating data flow between queries, which are read-only, side-effect-free, consistent views of data, and mutations which are read-write transactions. This is an increasingly common separation, and greatly simplifies how you reason about data moving through a system, including for work stealing.

A client subscribes to a view of data with a query. In the case of a chat app, it subscribes to recent messages in a channel. Whenever those messages are updated, regardless of who updated them, it gets a fresh view of the data.
The work can be submitted to the API server (Convex in my case) whether it’s within the context of a client request or not. In the case of my chat app, the original request creates a placeholder message, and submits a job to fill out the message. When the worker generates and submits the message — whether it’s a partial update or the final result — all the API endpoint needs to know is what record(s) to update in the database.
The communication channel — how users end up seeing the message — is the same as the application’s transactional data persistence: the reactive database. There’s never a case where a client receives a result but the result was never persisted, or where the result was recorded but the client missed the update.

I was pleasantly surprised how quickly this pattern came together for llama farm, and am excited to see what novel architectures this enables.

Summary

In this post we compared push-based load balancing with pull-based work stealing, as ways of distributing resource-intensive workloads. While the former is the traditional strategy, the latter brings a lot of benefits, provided you are able to separate your reads and writes.

Next steps:

To learn more about optimizing for latency, I recommend reading the paper “The Tail at Scale” if you haven’t read it already.
To see an implementation of work stealing, read this post about implementing llama farm, or read the code here.
Learn more about how Convex works here.

If you’re curious where the term “work stealing” comes from, it is a scheduling algorithm for parallel computing, specifically when there is one queue of tasks per process where another process can “steal” tasks (threads to execute) while idle. In our example, we simplify it to have one queue for the sake of explanation. At high scale, this technique at scale involves dividing requests into multiple queues, and having each worker interact primarily with a one or a subset of them, and stealing work from other queues when idle. Let me know in Discord if you'd be interested in an article on doing this at scale. ↩

Supercharge `npm run dev` with package.json scripts

Ian Macartney — Wed, 24 Jul 2024 00:07:51 +0000

npm run dev is the standard for "run my website locally," but how does it work? How can we expand its functionality? In this post we'll look at:

How to configure what npm run dev does.
How to decompose complex commands into granular units.
How to run multiple commands in parallel.
How to run pre-requisites without losing normal Ctrl-C behavior.
How to add seed data (if none exists) when starting up a Convex backend.

As a motivating example, here are some scripts defined in the convex-helpers example app. We'll cover what each piece does

  "scripts": {
    "dev": "npm-run-all --parallel dev:backend dev:frontend",
    "build": "tsc && vite build",
    "dev:backend": "convex dev",
    "dev:frontend": "vite",
    "predev": "convex dev --until-success",
    "test": "vitest"
  },

How and where they're defined

npm run runs commands that are defined in your package.json in your project's workspace. These commands are often pre-configured when you start your repo from a command like npm create vite@latest with commands for:

dev: Run a development environment. This often includes autor-reloading the UI when files change. For Vite this is vite and Next.js is next dev.
build: Build the website for deployment. This will generally compile and bundle all your html, css, and javascript. For Vite this is vite build and Next.js is next build.
test: Run tests - if you're using Jest, it's just "test": "jest" or vitest for Vitest.

Here's a basic example from Next.js:

// in package.json
{
// ...
  "scripts": {
    "dev": "next dev",
    "build": "next build",
    "start": "next start",
    "lint": "next lint"
  },
//...

Here you can run npm run dev or npm run lint etc.

You can learn more about npm run in the docs.

Why use scripts?

It's a fair question why one would put commands that already so simple into package scripts. Why not just call jest or vite or next build? There's a few good reasons:

You can save the default parameters for commands so you don't have to remember or document the "standard" way of starting something. We'll see below how you can configure it to chain commands and run others in parallel.
It allows you to easily run commands that are installed by npm but not globally accessible from your shell (terminal).¹ When you install things like npm install -D vitest, it installs vitest into node_modules/.bin.² You can't run vitest directly in your shell,³ but you can have a config like: "scripts": { "test": "vitest" } and npm run test will run vitest.
It always runs with the root of the package folder as the "current directory" even if you're in a subdirectory. So you can define a script like "foo": "./myscript.sh" and it will always look for myscript.sh in the package root (in the same directory as package.json). Note: you can access the current directory where it was called via the INIT_CWD environment variable.
You can reference variables in the package.json easily when the script is run from npm run. For instance, you can access the "version" of your package with the npm_package_version environment variable, like process.env.npm_package_version in js or $npm_package_version in a script.
If you have multiple workspaces (many directories with their own package.json configured into a parent package.json with a "workspaces" config), you can run the same command in all workspaces with npm test --workspaces or one with npm run lint --workspace apps/web.

Does `npm run dev` work with yarn / pnpm / bun?

Yes! Even if you install your dependencies with another package manager, you can still run your package scripts with npm.

yarn # similar to `npm install`
npm run dev # still works!

You don't have to remember that npm run dev maps to yarn dev (or yarn run dev). The same goes for npx: npx convex dev works regardless of what package manager you used to install things.

Running commands in parallel

There are a couple packages you can use to run commands concurrently:⁴

We'll just look at npm-run-all here. Consider our example:

  "scripts": {
    "dev": "npm-run-all --parallel dev:backend dev:frontend",
    "dev:backend": "convex dev",
    "dev:frontend": "vite",
  },

This defines three scripts.

npm run dev:backend runs convex dev.
npm run dev:frontend runs vite.
npm run dev runs both convex dev and vite in parallel via npm-run-all.

Both outputs are streamed out, and doing Ctrl-C will interrupt both scripts.

predev? postbuild?

You can specify commands to run before (pre) or after (post) another command (say, X) by naming your command preX or postX. In the example:

  "scripts": {
    "dev": "npm-run-all --parallel dev:backend dev:frontend",
    "dev:backend": "convex dev",
    "dev:frontend": "vite",
    "predev": "convex dev --until-success",
  },

This will run convex dev --until-success, before the "dev" command of npm-run-all --parallel dev:backend dev:frontend.

Chaining with "&&"

For those used to shell scripting, you can run two commands in sequence if the previous one succeeds with commandA && commandB. This works on both Windows and Unix (mac / linux).

However, there's a couple advantages to just using pre-scripts:

You can run either command with npm run dev --ignore-scripts to not do the "predev" script, or npm run predev to explicitly only do the "predev" step.
The Ctrl-C behavior is more predictable in my experience. In different shell environments, doing Ctrl-C (which sends an interrupt signal to the current process) would sometimes kill the first script but still run the second script. After many attempts we decided to switch to "predev" as the pattern.

Run interactive steps first

For Convex, when you first run npx convex dev (or npm run dev with the above scripts), it will ask you to log in if you aren't already, and ask you to set up your project if one isn't already set up. This is great, but interactive commands that update the output text don't work well when the output is being streamed by multiple commands at once. This is the motivation for running npx convex dev --until-success before npx convex dev.

convex dev syncs your functions and schema whenever it doesn't match what you have deployed, watching for file changes.
The --until-success flag syncs your functions and schema only until it succeeds once, telling you what to fix if something is wrong and retrying automatically until it succeeds or you Ctrl-C it.
By running npx convex dev --until-success, we can go through the login, project configuration, and an initial sync, all before trying to start up the frontend and backend.
The initial sync is especially helpful if it catches issues like missing environment variables which need to be set before your app can function.
This way the frontend doesn't start until the backend is ready to handle requests with the version of functions it expects.

Seeding data on startup

If you change your "predev" command for Convex to include --run it will run a server-side function before your frontend has started.

  "scripts": {
      //...
    "predev": "convex dev --until-success --run init",
        //...
  },

The --run init command will run a function that is the default export in convex/init.ts. You could also have run --run myFolder/myModule:myFunction. See docs on naming here. See this post on seeding data but the gist is that you can define an internalMutation that checks if the database is empty, and if so inserts a collection of records for testing / setup purposes.

tsc?

If you use TypeScript, you can run a type check / compile your typescript files with a bare tsc. If your tsconfig.json is configured to emit types, it will write out the types. If not, it will just validate the types. This is great to do as part of the build, so you don't build anything that has type errors. This is why the above example did:

    "build": "tsc && vite build",

How to pass arguments?

If you want to pass arguments to a command, for instance passing arguments to your testing command to specify what test to run, you can pass them after a -- to separate the command from the argument. Technically you don't need -- if your arguments are positional instead of --prefixed, but it doesn't hurt to always do it in case you forget which to do it for.

npm run test -- --grep="pattern"

Summary

We looked at some ways of using package.json scripts to simplify our workflows. Who knew how much power could rest behind a simple npm run dev? Looking at our original example:

  "scripts": {
    "dev": "npm-run-all --parallel dev:backend dev:frontend",
    "build": "tsc && vite build",
    "dev:backend": "convex dev",
    "dev:frontend": "vite",
    "predev": "convex dev --until-success",
    "test": "vitest"
  },

dev runs the frontend and backend in parallel, after predev.
build does type checking via tsc before building the static site.
dev:backend continuously deploys the backend functions to your development environment as you edit files.
dev:frontend runs a local frontend server that auto-reloads as you edit files.
predev runs before dev and does an initial deployment, handling login, configuration, and an initial sync as necessary.
test uses Vitest to run tests. Note: npm test is shorthand for npm run test along with other commands, but they're special cases. npm run test is the habit I suggest.

The way your shell finds which command to run when you type npm is to check the shell's PATH environment variable (on unix machines anyways). You can see your own with echo "$PATH". It checks all the places specified in $PATH and uses the first one. ↩
Technically you can override & specify where npm installs binaries. ↩
If you really want to, you can run npm exec vitest, npx vitest for short, ./npm_modules/.bin/vitest directly, or add .npm_modules/.bin to your PATH. ↩
Some people use a bare & to run one task in the background, but that is not supported on Windows, and interrupting one command won't necessarily kill the other. ↩

Convert your .json array to a .jsonl (JSON Lines)

Ian Macartney — Wed, 24 Jul 2024 00:05:57 +0000

JSON Lines is a file format that stores one JSON object per line in a file. This is more scalable than storing it as a JSON array, since most JSON parsers require loading the full stringified array into memory before it can parse it. With JSON Lines, your code can load one line at a time. If you don't hold onto each object (for instance, you do some operation and save it elsewhere), your code can theoretically read an infinitely long file without running out of memory. It also allows you to easily read a stream of data.

To make these files, each object needs to be on its own line. and the file extension is .json*l*

For Convex import, we allow you to specify either format: a .json or .jsonl. However, we limit the .json size to 8MB. So one question that comes up is how can you transform your .json file into .jsonl?

Using `jq` to convert .json to .jsonl

jq is a common tool for working with JSON from the commandline.

Here's the one-liner to convert a .json file into .jsonl using jq:

jq -c '.[]' ./mydata.json > mydata.jsonl

Example

Input: ./mydata.json

[
    {
        "foo": "bar"
    },
    {
        "easy_as": 123
    }
]

Output: ./mydata.jsonl

{"foo":"bar"}
{"easy_as":123}

Implementing Rate Limiting with only two numbers

Ian Macartney — Wed, 24 Jul 2024 00:03:03 +0000

There’s a lot of buzz about rate limiting, in part because of the cost exposure associated with LLM workloads and other resource-intensive patterns. Especially for apps that have a freemium model or any use that isn’t correlated with revenue, a single user can have a meaningful impact on your costs if they’re allowed to fire off thousands of requests .

There’s a host of advice and services that can help you solve this problem for different applications, but I’d like to show you how simple it can be to implement when you have fast access to a database with strong ACID guarantees¹.

Specifically, in this post I’m going to look at implementing the following, storing just two numbers per rate limit.

Token bucket: Enable limiting the overall request rate for a sliding window, while also accommodating bounded bursts of traffic after a period of inactivity. For example, the number of requests should be limited to X per hour plus up to Y more “rollover minutes” from previous hours. Tokens become available continuously over time. This is what I recommend for most use-cases.
Fixed window: For example, in each hour there can be no more than X requests to some third-party service, to avoid hitting their rate limits. When the hour is up, all tokens are available again. We’ll also discuss using “jitter” to avoid thundering herds.

For those who just want to use a library, I made one (code here) available in convex-helpers. You can see examples below.

For the sake of this article, when I refer to “tokens” it is using the mental model of rate limiting where you are granted a certain number of tokens per some time period, and when a request successfully “consumes” them it can proceed. I’ll say a “debt” or “deficit” when tokens are over-consumed, as we’ll see later in exploring reservations.

What is application-layer rate limiting

In this article, we are going to talk about application-layer rate limiting. Specifically, the controls you have available when you are aware of a user and the operation they are trying to perform, rather than the networking layer, which is the responsibility domain of the hosting platform provider. To prevent a request from being made in the first place, you can also look into client-side throttling, such as single-flighting requests.

In practice, application layer rate limiting is the most useful and only falls down during extreme load, such as a distributed denial of service attack (DDOS) which thankfully are extremely rare. More commonly there is a small number of users, whether malicious or other otherwise, who are consuming more resources than you expected. The cost incurred of such "attacks" are often in the expensive requests to auto-scaling third party services, such as hosted LLMs for AI apps.²

Benefits of these implementations

The rate limits discussed here have these properties:

Efficient storage and compute: in particular, it doesn’t require crons or storage that scales with load. Each rate limit (a combination of a name and a “key”) stores two numbers and does simple math.
Transactional evaluation: You can make multiple decisions using multiple rate limits and be ensured that they’ll all be consumed or none will be (if you roll back by throwing an exception, for instance).
Fairness via opt-in credit “reservation”: By using reserve: true below, I’ll show how you can pre-allocate tokens and schedule work that doesn’t require client backoff, and doesn’t starve large requests.
Opt-in “rollover” allowance: Allowing clients that have been under-consuming resources to accumulate tokens that “roll over” to the next period up to some limit, so they can service bursts of traffic while limiting average usage by a token bucket.
Deterministic: The results of these approaches will give you concrete bounds on usage, do not rely on probability, and will not “drift” over time. They also can determine the next time a retry could succeed.
Fail closed: If the system is under heavy load, it won’t fling open the gates for the traffic to overwhelm other services and cause cascading failure, as other “fail open” solutions can. This is an easy property to satisfy, since the application database is being used for the rate limit. If the application database is unavailable, continuing to serve the request is unlikely to succeed anyways. Failing open makes more sense when adding additional infrastructure or services that could introduce a single point of failure (SPOF) such as single-host in-memory service.

The algebra of rate limits

Here is how you calculate rate limits, using just to numbers: value and ts. Here is the Convex database schema for it:

rateLimits: defineTable({
  name: v.string(),
  key: v.optional(v.string()), // undefined is singleton
  value: v.number(), // can go negative if capacity is reserved ahead of time
  ts: v.number(),
}).index("name", ["name", "key"]),

Each of the following approaches have some basic error checking omitted for brevity, such as checking that the requested number of units is less than the maximum possible (which will never succeed).

See below for modifications to accommodate reserving tokens.

Token bucket

Instead of a traditional approach to a sliding window in which there are discrete jumps in value based on past events, we can use a token bucket to provide similar benefits, with a much more efficient storage and runtime footprint.

We model tokens as being continuously provided, with a capacity defined in the config. If your configured rate is 10 in a period of a minute and you use 5 tokens, they will be fully restored in 30 seconds. Since we model it continuously, one credit will be restored every six seconds. So you could be consuming one credit every six seconds, five every thirty seconds, or ten every minute. When you don’t use all your tokens, they can accumulate up to the defined capacity amount. With capacity, you can use more than the normal rate for a period of time, resting assured that all of that capacity was accumulated during idle time. Because the tokens are issued at a fixed rate, the overall credit consumption is bound to that rate. If it isn’t set, capacity defaults to rate.

Config:

export type TokenBucketRateLimit = {
  kind: "token bucket";
  rate: number;
  period: number;
  capacity?: number; // defaults to rate
  maxReserved?: number;
};

The core calculation is:

const now = Date.now();
const elapsed = now - state.ts;

ts = now;

value = Math.min(
  state.value + elapsed * config.rate / config.period,
  config.capacity ?? config.rate
) - (args.count ?? 1);

We keep track of the last time we calculated the state of the bucket as ts and the value at that time.
We calculate the current value since then, capping it at the capacity of the bucket. If there is no capacity configured, we default to the rate. So if you allow 10 per second and don't specify a capacity, you can have up to 10 tokens in the bucket.

Full code, including handling reservations:

const now = Date.now();
// Fetch the existing value &amp; ts from the database, if present.
// If the key is undefined, it will fetch the shared value for the name.
const existing = await db.query("rateLimits")
  .withIndex("name", (q) =&gt; q.eq("name", name).eq("key", key))
  .unique();
// If there isn't a capacity defined, default to the rate
const max = config.capacity ?? config.rate;
const consuming = args.count ?? 1;

// Start of token-bucket-specific code
// Default to the maximum available right now.
const state = existing ?? { value: max, ts: now };
const elapsed = now - state.ts;
const rate = config.rate / config.period; // I appologize for the rate naming.
// The current value is whatever accumulated since the last evaluation up to max.
const value = Math.min(state.value + elapsed * rate, max) - consuming;
const ts = now;
let retryAt = undefined;
if (value &lt; consuming) { // not enough capacity currently
  retryAt = now + -value / rate;
  // End of token-bucket-specific code
  if (!args.reserve || (config.maxReserved &amp;&amp; (-value  &gt; config.maxReserved)) {
    return { ok: false, retryAt };
  }
}
if (existing) {
  await db.patch(existing._id, { value, ts });
} else {
  const { name, key } = args;
  await db.insert("rateLimits", { value, ts, name, key });
}
return { ok: true, retryAt };

Some things to keep in mind:

If you are sensitive to bursts of traffic, for instance if you’re using a third party API that has a hard rate limit cap, note that with this approach you could have as many as rate + capacity requests in a single window, if the accumulated capacity is all consumed at once, and then the accumulated tokens consumed at the end of the window. For these scenarios, you can:
- Use fixed window to divide capacity into fixed windows.
- Set capacity and rate to both be half of the third party hard cap, so worst case you use a burst of half, then use the accumulated amount before the window is over. The downside is your average consumption is limited to half of what’s available.
- Set capacity to be smaller than rate, which allows using more steady-state bandwidth, so long as the requests are small and somewhat frequent. For example if you had an hourly budget of 70 tokens, you could have a rate of 60 and a capacity of 10. You could consume one token per minute, or 5 tokens every five minutes, but if you didn’t consume anything for 15 minutes, you’d only have accumulated 10 tokens.
- Set capacity to zero and always use reserve and scheduling to perfectly space requests based on their needs. You lose the benefits of accumulated bandwidth, and all requests will suffer some delay, so only use this for time-insensitive workloads.
If many clients are waiting for the same rate limit to have bandwidth, and you aren’t using the reserve technique, you should add some jitter before returning it to a client, so each client attempts at a different time.

Fixed window

When you need your rate limiting windows to be rigid, you can use this more traditional approach, where tokens are issued at distinct intervals and can be used during that interval. We also extend it with an optional capacity configuration to allow unused tokens to accumulate, allowing us to control the overall usage while accommodating traffic that isn’t consistent.

Config:

export type FixedRateLimit = {
  kind: "fixed window";
  rate: number;
  period: number;
  capacity?: number; // defaults to rate
  start?: number;
};

The core calculation for value at a given time is:

const elapsedWindows = Math.floor((Date.now() - state.ts) / config.period);

ts = state.ts + elapsedWindows * config.period;

value = Math.min(
  state.value + config.rate * elapsedWindows, 
  config.capacity ?? config.rate
) - (args.count ?? 1);

We use ts to mark the start of the window, and don’t update it until we are consuming resources in a more recent window, at which point we add tokens for each window that started since then.
value holds the tokens available at that timestamp.

Full code, including reservations:

const now = Date.now();
// Fetch the existing value &amp; ts from the database, if present.
// If the key is undefined, it will fetch the shared value for the name.
const existing = await db.query("rateLimits")
  .withIndex("name", (q) =&gt; q.eq("name", name).eq("key", key))
  .unique();
// If there isn't a capacity defined, default to the rate
const max = config.capacity ?? config.rate;
const consuming = args.count ?? 1;

// Start of fixed-window-specific code
const state = existing ?? {
  // If there wasn't a value or start time, default to a random time.
  ts: config.start ?? (Math.random() * config.period),
  value: max, // start at full capacity
};
const elapsedWindows = Math.floor((Date.now() - state.ts) / config.period);
// Add value for each elapsed window
const value = Math.min(state.value + config.rate * elapsedWindows, max) - consuming;
// Move ts forward to the start of this window
const ts = state.ts + elapsedWindows * config.period;
let retryAt = undefined;
if (value &lt; 0) {
  const windowsNeeded = Math.ceil(-value / config.rate);
  retryAt = ts + config.period * windowsNeeded;

  // End of fixed-window-specific code
  if (!args.reserve || (config.maxReserved &amp;&amp; (-value  &gt; config.maxReserved)) {
    return { ok: false, retryAt };
  }
}
if (existing) {
  await db.patch(existing._id, { value, ts });
} else {
  const { name, key } = args;
  await db.insert("rateLimits", { value, ts, name, key });
}
return { ok: true, retryAt };

start is the offset from 0 UTC, in this case to align the start of the period with midnight in the PDT timezone. This is handy for aesthetically aligning requests with a user’s midnight or starting on the hour or on the minute, but if the rate limit will see a lot of concurrent usage, this can lead to many clients all waiting until midnight to fire off requests and causing a thundering herd. For these situations you should add jitter, or omit start. If you don’t provide start, it will use the “key” to assign a random time as the start time.

Note: if you allow for capacity, the maximum number of tokens used in a given period will be capacity, whereas the token bucket implementation above could use a maximum capacity + rate for a single period (worst case). This makes fixed windows a good fit for maximizing third party API limits, and capacity a nice feature if the third party has an accommodation for “burst” traffic.

Reserving tokens

When you hit a rate limit with these implementations, the rate limiter knows the next time it could plausibly handle your request. However, by that time a smaller request could have come along and consumed tokens, further delaying the larger request. If you are sure you want to eventually serve a request, it’s more efficient to pre-allocate the work and schedule its execution.³

Reserving tokens provides three useful properties:

Fairness: By reserving capacity ahead of time, larger requests can allocate capacity without retrying until enough tokens accumulate.
Fire-and-forget: When you require a client to retry an operation later, there’s a chance the client won't be around - e.g. if a user clicks away or refreshes a website. If you know you eventually want to take some action, you can schedule execution for later and free the client from the responsibility of retrying.
“Perfect” scheduling: Using retries, especially when you add jitter, prevents you from fully utilizing available resources. With reservations, you are given advance authorization to run your operation at the exact time.

The implementation for both strategies is relatively straightforward. Both allow the available tokens to go negative, up to some configurable limit (by default unlimited).

For example, say you had 3 tokens and a request came in for 5.

You updated the token count to -2 and responded that the work should happen later - specifically when 2 tokens would have been added.
The caller schedules their work to happen at that later time. It is important to schedule it and not just run the request right away - otherwise it's just equivalent to a higher burst capacity.
When the scheduled function runs later, it doesn't need to check rate limits because it's already been approved.
Later on, when another function call checks the rate limit, it calculates how many tokens to add based on the elapsed time and adds it to the -2 value before deciding whether there are enough tokens for the new call.

Passing reserve: true is optional. By default the library will refuse to go negative.

See the below example for usage.

Using rate limits for common operations

To make the rate limit tradeoffs concrete, let’s consider how we’d use rate limiting in our application. I made a library for that provides a simple rate limiting API to consume, check, and reset limits. This library is Convex-specific, but it will serve as a representative example of how application-layer rate limits might look.

In general, the distinctions here will be between “global” rate limits, for which there is one per application for a given rate limit “name,” and ones where each distinct “key” is rate limited independently.

The examples below assume a flow like:

A client calls a mutation to take some action, which may involve a rate-limited behavior. For those unfamiliar with Convex, a mutation runs server-side and encapsulates a database transaction, providing Serializable isolation and Optimistic Concurrency Control with automatic retries.
The mutation checks the rate limit before taking the action, and if it fails it returns the time when the client should retry.
Alternatively, the client could call an action (a non-transactional non-deterministic general-purpose environment, akin to a normal API endpoint) which could then call a mutation before taking some action.

Configuration can be centralized or provided at the call-site. If you use a limit from more than one place, defining them centrally is best, which will produce type-safe functions auto-completing your rate limit names.

import { defineRateLimits } from "convex-helpers/server/rateLimit";

const { rateLimit, checkRateLimit, resetRateLimit } = defineRateLimits({
  createAThing: { kind: "token bucket", rate: 3, period: HOUR },
  makeAThirdPartyRequest: { kind: "fixed window", rate: 100, period: MINUTE },
});

rateLimit is what you call to consume resources. It will return whether it succeeded.
checkRateLimit will do the same thing as rateLimit, but return the result without consuming any resources, as a way to tell whether it would have failed.
resetRateLimit will reset a given rate limit.

export const doAThing = mutation({
  args: { email: v.string() },
  handler: async (ctx, args) => {
    const { ok, retryAt } = await rateLimit(ctx, { name: "myRateLimit" });
    if (!ok) return { retryAt };
    await doTheThing(ctx, args.email);
  },
});

ok is whether it successfully consumed the resource and the operation should proceed.
retryAt is when it would have succeeded in the future, which can be used by a client to decide when to retry. We’ll discuss “jitter” later which is important here if it’s highly contended.

The only setup required is to add the rate limiting table to your schema in convex/schema.ts:

import { rateLimitTables } from "./rateLimit.js";

export default defineSchema({
  ...rateLimitTables,
  exampleOtherTable: defineTable({}),
  // other tables
});

From here on, the mutation context will be assumed. We’ll also define these constants to make it more readable:

const SECOND = 1000; // ms
const MINUTE = 60 * SECOND;
const HOUR = 60 * MINUTE;
const DAY = 24 * HOUR;

Failed logins

This will allow 5 failed requests in an hour. Because it’s a bucket implementation, the user will be able to try 5 times immediately if they haven’t tried in an hour, and then can try again every 6 minutes afterwards.

const { checkRateLimit, rateLimit, resetRateLimit } = defineRateLimits({
  failedLogins: { kind: "token bucket", rate: 10, period: Hour },
});

Using the functions to manage failed logins:

await checkRateLimit(ctx, { name: "failedLogins", key: userId, throws: true });
const success = await logInAttempt(ctx, userId, otp);
if (success) {
  // If we successfully logged in, stop limiting us in the future
  await resetRateLimit(ctx, { name: "failedLogins", key: userId });
} else {
  const { retryAt } = await rateLimit(ctx, { name: "failedLogins", key: userId });
  return { retryAt }; // So the client can indicated to the user when to try again
}

throws is a convenience to have it throw when ok is false instead of return the values. It works for rateLimit and checkRateLimit.

Account creation via global limit

To prevent a flood of spam accounts, you can set a global limit on signing up for a free trial. This limits sign-ups to an average of 100 per hour.

await rateLimit(ctx, { 
  name: "freeTrialSignUp",
  config: { kind: "token bucket", rate: 100, period: HOUR },
  throws: true,
});

config: The configuration is inlined if you didn’t define it with defineRateLimits.

Note: this is a deterrent for spammers, but means that during a flood of attempts, other users will be impacted. See below for tips on authenticating anonymous users.

Sending messages per user

const { ok, retryAt } = await rateLimit(ctx, {
  name: "sendMessage",
  key: userId,
  config: { kind: "token bucket", rate: 10, period: MINUTE, capacity: 20 },
});

key will isolate the rate limiting, in this case to be a per-user limit.
capacity here allows accumulating up to 20 unused tokens so a bursty minute won’t fail. See above for details on the implementation.

Making LLM requests with reserved capacity

If you’re staying on the OpenAI free tier, you can ensure you don’t go above their rate limits, and when you have too many requests, you can schedule them to happen when you will.

const { ok, retryAt } = await rateLimit(ctx, {
  name: "chatCompletion",
  count: numTokensInRequest,
  reserve: true,
});
if (!ok) return { retryAt }; // There were too many reserved already.
if (retryAt) { // We need to wait until later, but we've reserved the tokens.
  // Spread the request across 10s in case of many reservations.
  const withJitter = retryAt + (Math.random() * 10 * SECOND);
  await ctx.scheduler.runAt(withJitter, internal.llms.generateCompletion, args);
} else { // We can run it immediately. Scheduled for after this DB transaction.
  await ctx.scheduler.runAfter(0, internal.llms.generateCompletion, args);
}

count can decrease the number of tokens by a custom amount. By default it’s 1.
reserve: true instructs it to allow a token deficit if there isn’t enough capacity, provided we are willing to schedule our work for the future time when it would have had enough capacity. See above for more details.
maxReserved defines how many tokens it should allow to be reserved before refusing to set aside tokens.

Reservations work with either rate limiting approach.

Jitter: introducing randomness to avoid thundering herds

If we tell all clients to retry just when the next window starts, we are inviting what’s called a “thundering herd” which is about how it sounds. When too many users show up at once, it can cause network congestion, database contention, and consume other shared resources at an unnecessarily high rate. Instead we can return a random time within the next period to retry. Hopefully this is infrequent. This technique is referred to as adding “jitter.”

A simple implementation could look like:

const withJitter = retryAt + (Math.random() * period);

For the fixed window, we also introduce randomness by picking the start time of the window (from which all subsequent windows are based) randomly if config.start wasn’t provided. This helps from all clients flooding requests at midnight and paging your on-call.

Scaling rate limits with shards

As your usage grows, you’ll want to think about scalability for your rate limiting.

If you are using per-user keys, using each key won’t conflict with the others. However, if you use global rate limits, or a single key might have hundreds of requests per second, you should shard the rate limit by dividing the capacity into multiple rate limits.

For example, if you’re trying to limit the overall (global) number of tokens sent to an LLM API, you could make 10 rate limits, each at 1/10th the bandwidth. When you go to use the rate limit, you can set the key to be one of 10 random values, such as 0...9.

If you’re already using a key (for instance per-team), you can use the same approach by appending a random number to the existing key. All of the keys are just strings at the end of the day.

await rateLimit(ctx, { key: teamId + shard, name: "myLimit" });

This will decrease your ability to maximize resources, however, since you could get unlucky and try a rate limiter that has gotten more requests. To mitigate this, we can leverage The Power of Two Choices in Randomized Load Balancing by using checkRateLimit on two shards and using the one with a higher value.

const status1 = await checkRateLimit(ctx, { key: teamId + shard1, name: "myLimit" });
const status2 = await checkRateLimit(ctx, { key: teamId + shard2, name: "myLimit" });
const shard = (status1.ok && status2.ok ?
  (status1.value > status2.value ? shard1 : shard2)
  : status1.ok ? shard1
  : status2.ok ? shard2
  : status1.retryAt < status2.retryAt ? shard1 : shard2);
await rateLimit(ctx, { key: teamId + shard, name: "myLimit", throws: true });

What if I rely on multiple rate limits?

If your operations requires consuming multiple rate limits, you can run into issues if you aren’t careful. You could consume resources that you don’t end up using. This can even deadlock if two operations acquire resources in different orders, for instance:

Request A takes 5 unit of x and fails to take 10 units of y, so returns to the client that it should retry later.
Request B takes 5 units of y (say there were only 5 units) and fails to take 10 units of x.
Request A retries, taking another 5 units of x which had accumulated, and fails to take y.
etc.

Here are two strategies to handle this:

Use checkRateLimit ahead of time for each rate limit you’ll depend on, and only continue to consume them if they are all satisfied, otherwise return the largest retryAt value so the client doesn’t retry before it’d plausibly be accepted.
Roll back the transaction by throwing an exception instead of returning. When an exception is thrown, database writes are not committed, oo any rate limits the request already consumed will be reset to their previous values. When a rate limit fails, no state is persisted by the library.

To pass information back to the client, you can use ConvexError. This is what the library does if you specify throws: true:
```
if (args.throws) {
  throw new ConvexError({
    kind: "RateLimited",
    name: args.name,
    retryAt,
  });
}
```
Note: this might not be the maximum value of retryAt for your request, just the first it ran into. Hopefully your application rarely hits rate limits, so it’s ok for the client to retry later and need to wait again.

In general, I’d advise you to consolidate all of your rate limits into a single transaction (mutation) rather than calling out to multiple mutations from an action as you go, if all of the rate limits need to be satisfied for the operation to succeed.

Authenticating anonymous users

It sounds like an oxymoron, but there are a few strategies for authenticating users that haven’t logged in or authenticated using traditional methods.

Client-generated session ID (optimistic): This ID is generated in the browser and sent with requests to identify the user. This is only meaningful protection if you authorize the session ID using a strategy below, since a malicious user could generate new ones for each request.
Associate the session ID with an IP (lossy): You can have the client make a one-time HTTP request to an API endpoint that associates the session ID provided with an IP. You then rate-limit behaviors based on IP. This is handy, but ultimately a flawed approach, since many real users may share a virtual IP exposed by their ISP.
Use a Captcha or similar to authorize the session ID (robust): This is what I’d recommend. Anonymous users submit a captcha to prove they’re not bots, and associate the successful captcha with their session ID to be authorized to do any operation. At this point, you can rate limit their session ID as if it were a userId.

Summary

We looked at implementing application-layer rate limiting to help limit operations, either globally or specific to a user or other key. We looked at implementing both a token bucket and fixed window limits using just two numbers.

This is enabled by having an environment that:

Provides transactional guarantees to avoid race conditions with reading & writing values.
Automatically retries conflicting transactions.
Schedules work transactionally, to allow handling multiple rate limits independently and roll back everything if any of them fail.
Has fast access to a database with indexed lookups.

Beware: if you plan to implement these with Postgres or similar, beware of the read-modify-write behavior where, by default, you are exposed to data races, even within a transaction.

We also looked at adding the ability to reserve capacity ahead of time to ensure fairness, as well as handle bursts of traffic while still maintaining an overall average limit.

As always, let me know in our Discord what you think and what else you’d like to see out of the library.

Specifically having serializable isolation is really useful for this use case. “Read committed” isolation (the default for Postgres and other SQL variants) is vulnerable to race conditions for the code in this article, even in a transaction unless you jump through some hoops. And if you figure out how to do it for Drizzle without adding read locks for every row in a transaction, this person could use some help. ↩
For an article about a load balancing strategy that helps control costs and optimizes for throughput, check out my recent article on work stealing. ↩
The Convex scheduler is transactional within mutations. If the mutations throws an exception, no database writes will happen and no functions will be scheduled. ↩

Streaming HTTP Responses using fetch

Ian Macartney — Tue, 23 Jul 2024 23:56:37 +0000

This post will look at working with the JavaScript Streams API which allows making a fetch HTTP call and receiving a streaming response in chunks, which allows a client to start responding to a server response more quickly and build UIs like ChatGPT.

As a motivating example, we’ll implement a function to handle the streaming LLM response from OpenAI (or any server using the same http streaming API), using no npm dependencies—just the built-in fetch. The full code is here including retries with exponential backoff, embeddings, non-streaming chat, and a simpler APIs for interacting with chat completions and embeddings.

If you’re interested in seeing how to also return an HTTP stream to clients, check out this post.

Full example code

Here’s the full example. We’ll look at each piece below:

async function createChatCompletion(body: ChatCompletionCreateParams) {
  // Making the request
  const baseUrl = process.env.LLM_BASE_URL || "https://api.openai.com";
  const response = await fetch(baseUrl + "/v1/chat/completions", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "Authorization": "Bearer " + process.env.LLM_API_KEY,
    },
    body: JSON.stringify(body),
  });
  // Handling errors
  if (!response.ok) {
    const error = await response.text();
    throw new Error(`Failed (${response.status}): ${error}`,
  }
  if (!body.stream) { // the non-streaming case
    return response.json();
  }
  const stream = response.body;
  if (!stream) throw new Error("No body in response");
  // Returning an async iterator
  return {
    [Symbol.asyncIterator]: async function* () {
      for await (const data of splitStream(stream)) {
        // Handling the OpenAI HTTP streaming protocol
        if (data.startsWith("data:")) {
          const json = data.substring("data:".length).trimStart();
          if (json.startsWith("[DONE]")) {
            return;
          }
          yield JSON.parse(json);
        }
      }
    },
  };
}

// Reading the stream  
async function* splitStream(stream: ReadableStream<Uint8Array>) {
  const reader = stream.getReader();
  let lastFragment = "";
  try {
    while (true) {
      const { value, done } = await reader.read();
      if (done) {
        // Flush the last fragment now that we're done
        if (lastFragment !== "") {
          yield lastFragment;
        }
        break;
      }
      const data = new TextDecoder().decode(value);
      lastFragment += data;
      const parts = lastFragment.split("\n\n");
      // Yield all except for the last part
      for (let i = 0; i < parts.length - 1; i += 1) {
        yield parts[i];
      }
      // Save the last part as the new last fragment
      lastFragment = parts[parts.length - 1];
    }
  } finally {
    reader.releaseLock();
  }
}

See the code here for a version that has nice typed overloads for streaming & non-streaming parameter variants, along with retries and other improvements.

The rest of the post is about understanding what this code does.

Making the request

This part is actually very easy. A streaming HTTP response comes from a normal HTTP request:

const baseUrl = process.env.LLM_BASE_URL || "https://api.openai.com";
const response = await fetch(baseUrl + "/v1/chat/completions", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": "Bearer " + process.env.LLM_API_KEY,
  },
  body: JSON.stringify(body),
});

The HTTP headers are sent up per usual, and don’t have to set anything in particular to enable streaming. And you can still leverage regular caching headers for HTTP streaming.

Handling errors

The story around errors on the client side is a little unfortunate for HTTP streaming. The upside is that for HTTP streaming, the client gets status codes right away in the initial response and can detect failure there. The downside to the http protocol is that if the server returns success but then breaks mid-stream, there isn’t anything at the protocol level that will tell the client that the stream was interrupted. We’ll see below how OpenAI encodes an “all done” sentinel at the end to work around this.

if (!response.ok) {
  const error = await response.text();
  throw new Error(`Failed (${response.status}): ${error}`,
}

Reading the stream

In order to read an HTTP streaming response, the client can use the response.body property which is a ReadableStream allowing you to iterate over the chunks as they come in from the server using the .getReader() method.¹

const reader = request.body.getReader();
try {
    while (true) {
      const { value, done } = await reader.read();
      if (done) break;
      const text = TextDecoder().decode(value);
      //... do something with the chunk
    }
} finally {
  reader.releaseLock();
}

This handles every bit of data that we get back, but for the OpenAI HTTP protocol we are expecting the data to be JSON separated by newlines, so instead we will split up the response body and “yield” each line as they’re completed. We buffer the in-progress line into lastFragment and only return full lines that have been separated by two newlines:

// stream here is request.body
async function* splitStream(stream: ReadableStream<Uint8Array>) {
  const reader = stream.getReader();
  let lastFragment = "";
  try {
    while (true) {
      const { value, done } = await reader.read();
      if (done) {
        // Flush the last fragment now that we're done
        if (lastFragment !== "") {
          yield lastFragment;
        }
        break;
      }
      const data = new TextDecoder().decode(value);
      lastFragment += data;
      const parts = lastFragment.split("\n\n");
      // Yield all except for the last part
      for (let i = 0; i < parts.length - 1; i += 1) {
        yield parts[i];
      }
      // Save the last part as the new last fragment
      lastFragment = parts[parts.length - 1];
    }
  } finally {
    reader.releaseLock();
  }
}

If this function* and yield syntax is unfamiliar to you, just treat function* as a function that can return multiple things in a loop, and yield as the way of returning something multiple times from a function.

You can then loop over this splitStream function like:

for await (const data of splitStream(response.body)) {
  // data here is a full line of text. For OpenAI, it might look like
  // "data: {...some json object...}" or "data: [DONE]" at the end
}

If this "for await" syntax throws you off, it's using what’s called an “async iterator” - like a regular iterator you’d use with a for loop, but every time it gets the next value, it’s awaited.

For our example, when we’ve gotten some text from OpenAI and we’re waiting for more, the for loop will wait until splitStream yields another value, which will happen when await reader.read() returns a value that finishes one or more lines of text.

Next up we’ll look at another way of returning an async iterator that isn’t a function like splitStream, so a caller can use a “for await” loop to iterate over this data.

Returning an async iterator

Now that we have an async iterator returning full lines of text, we could just return splitStream(response.body), but we want to intercept each of the lines and transform them, while still letting the caller of our function to iterate.

The approach is similar to to the async function* syntax above. Here we’ll return an async iterator directly, instead of an async function that returns one when it’s called. The difference is the type is AsyncIterator instead of AsyncGenerator which needs to be called first. An AsyncIterator can be defined by having a certain named function: Symbol.asyncIterator.²

      return {
        [Symbol.asyncIterator]: async function* () {
          for await (const data of splitStream(stream)) {
            //handle the data
            yield data;
          }
        },
      };

This is useful when you want to return something different from the data coming from splitStream. Every time a new line comes in from the streaming HTTP request, splitStream will yield it, this function will receive it in data and can do something before yielding it to its caller.

Next we’ll look at how to interpret this data specifically in the case of OpenAI’s streaming chat completion API.

Handling the OpenAI HTTP streaming protocol

The OpenAI response protocol is a series of lines that start with data: or event:, but we’ll just handle the data responses, since that’s the useful part for chat completions. There’s a sentinel of [DONE] if the stream is done, otherwise it’s just JSON.

for await (const data of splitStream(stream)) {
  if (data.startsWith("data:")) {
    const json = data.substring("data:".length).trimStart();
    if (json.startsWith("[DONE]")) {
      return;
    }
    yield JSON.parse(json);
  } else {
    console.debug("Unexpected data:", data);
  }
}

Bringing it all together

Now that you understand HTTP streaming, you can feel confident working directly with streaming APIs without relying on sdks or libraries. This allows you to hide latency, as your UI can immediately start updating, without consuming more bandwidth with multiple requests. You can use the above function like you would with the official openai npm package:

  const response = await createChatCompletion({
    model: "llama3",
    messages: [...your messages...],
    stream: true,
  });
  for await (const chunk of response) {
    if (chunk.choices[0].delta?.content) {
      console.log(chunk.choices[0].delta.content);
    }
  }

See the code here that also lets you make some utility functions to make this even easier by pre-configuring the model and extracting the .choices[0].delta.content:

const response = await chatStream(messages);
for await (const content of response) {
  console.log(content);
}

Before you copy the code, try to implement it yourself as an exercise in async functions.

More resources

For information about returning HTTP streaming data from your own server endpoint, check out this post on AI Chat with HTTP Streaming that both streams data from OpenAI (or similar) to your server and simultaneously streams it down to a client, while doing custom logic as it goes (such as saving chunks to a database).
The MDN docs, as always, are great. Beyond the links above, here’s a guide on the readable streams API that shows how to connect a readable stream to an <img> tag to stream in an image request. Note: this guide uses response.body as an async iterator, but currently that is not widely implemented and not in the TypeScript types.

Note: you can only have one reader of the stream at a time, so you generally don’t call .getReader() multiple times - you probabaly want .tee() in that case, and if you want to use .getReader() multiple times for some reason, make sure to have the first .releaseLock() first. ↩
Or alternatively you can If you aren’t familiar with Symbol, it’s used in a way to have keys in an object that aren’t strings or numbers. That way they don’t conflict if you added a key named asyncIterator. You could access the function with myIterator[Symbol.asyncIterator](). ↩

Multiple apps on a single domain hosted on sub-paths

Ian Macartney — Tue, 23 Jul 2024 23:52:14 +0000

tl;dr It's possible to deploy multiple apps on the same domain by serving each at a different sub-path, using a simple vercel.json configuration, and custom build commands.

Even a static SPA React app with client-side routing can be configured to be served at a custom path.

When developing my llama farm chat demo, I had been developing it like a normal app, with the index at / on http://localhost:5173/, but then needed to host the app entirely under the /llama-farm sub-path at labs.convex.dev/llama-farm, since labs.convex.dev also hosts auth, Ents, a million checkboxes clone and other independent projects. It uses a client-side router (React Router), which needs all routes to get redirected to the same index.html, and each page needs to get served relative to that subpath, without littering the codebase with a prefix for every route.

In this post I'll cover:

Hosting your Vite-based React app with client-side router under a sub-path on Vercel.
Hosting multiple apps on subpaths within a single project.
Configuring multiple sub-paths for the same app in a single vercel.json.

Serving on a subpath

To change my app to work on a sub-path on Vercel, the four changes that I needed to make were:

Setting the public base path by specifying --base=/llama-farm in my vite build command.
Setting the build.outDir to --outDir=dist/llama-farm so it would create the assets at the relative path Vercel expected. You can see the full build command below.
Setting the basename field for React Router on createBrowserRouter to { basename: "/llama-farm" }. In order to test locally at / I made this an environment variable VITE_BASEPATH that I only set in Vercel. You can see my code here.
Adding a rewrites configuration to my vercel.json file:

{
  "rewrites": [
    {
      "source": "/llama-farm(.*)",
      "destination": "/llama-farm"
    }
  ]
}

On Vercel my build command ended up looking like:

vite build --base=$VITE_BASEPATH --outDir=dist$VITE_BASEPATH

I set the VITE_BASEPATH environment variable to /llama-farm so it was shared between the build command and React Router config.

Adding in Convex

One more layer is if our app uses for the backend, in which case we can also deploy our Convex backend at the same time. For this, we wrap our build command with npx convex deploy --cmd '<build command here>'.¹ This will build the Vite app with the VITE_CONVEX_URL set, which tells the frontend where to find the corresponding backend. This is especially useful for preview deploys which each have their own backend. As an additional safeguard, if the build fails it won't deploy your backend.

The full updated build command:

npx convex deploy --cmd 'vite build --base=$VITE_BASEPATH --outDir=dist$VITE_BASEPATH'

Now my chat app is being served on llama-farm-chat.vercel.app/llama-farm. In the next section I'll connect it to the app running labs.convex.dev so I can access it on labs.convex.dev/llama-farm.

Serving multiple supaths from another project

To serve my app from the labs.convex.dev app, I'll edit the vercel.json for the Convex Labs project to handle redirecting and rewriting requests to my app.

redirects will serve assets from the same relative path, but from elsewhere. In my case the .js and .css files are in the llama-farm/assets/ directory.
rewrites will take into account the rewrites in my llama farm project, both for the base url /llama-farm and any subpaths. These will all get rewritten to /llama-farm which will serve the index.html file there.

{
  "redirects": [
    {
      "source": "/llama-farm/assets/:filePath",
      "destination": "https://llama-farm-chat.vercel.app/llama-farm/assets/:filePath",
      "permanent": true
    },
    //... other projects
  ],
  "rewrites": [
    {
      "source": "/llama-farm",
      "destination": "https://llama-farm-chat.vercel.app/llama-farm"
    },
    {
      "source": "/llama-farm/:match*",
      "destination": "https://llama-farm-chat.vercel.app/llama-farm/:match*"
    },
    //... other projects
  ]
}

You can read more about redirects and rewrites in the Vercel docs.

Configuring multiple subpaths

You might want to build the same project and host it on different sub-paths on different domains. For example, I also host my app at llamafarm.chat.

I'll start by saying there are many Vercel community discussions about how using environment variables to configure vercel.json is not supported. The recommendation is to use edge function middleware or a Next.js config. For this app I'd rather not pay for serverless functions or migrate it to Next.js.

However, I have a configuration that is working for me. I've found that I can put both rewrite configs into one vercel.json and both apps will route correctly.

vercel.json:

{
  "rewrites": [
    {
      "source": "/llama-farm(.*)",
      "destination": "/llama-farm"
    },
    {
      "source": "/(.*)",
      "destination": "/"
    }
  ]
}

On labs.convex.dev/llama-farm, it makes sense that it will only ever match against incoming urls starting with /llama-farm since the labs.convex.dev project only rewrites those urls.
However, I'm a bit confused why llamafarm.chat correctly routes /llama-farm to /index.html instead of looking for /llama-farm. My guess is that it can't find any /llama-farm/index.html at build time, so ignores the first rule for that project. If you know more, please reach out! I'm just happy that it works.

Summary

We looked at hosting a client-routed SPA app (in my case a Vite app using React Router) on a base path, serving multiple apps on the same project, and configuring the rewrite paths for multiple environments in the one vercel.json.

I hope it helps you out! Come chat in our community Discord if you have any comments or feedback.

Note when pointing multiple Vercel projects at one Convex backend: I choose to only deploy the Convex backend from one project, and just do the frontend-only deploy from the other. For the one that is only deploying the frontend, I set the CONVEX_SITE_URL environment variable instead of the CONVEX_DEPLOY_KEY. This means that all frontend preview deployments for this project are talking to the prod backend. ↩

GPT Streaming With Persistent Reactivity

Ian Macartney — Mon, 10 Jul 2023 22:31:14 +0000

Building ChatGPT-powered experiences feel snappier when the responses show up incrementally. Instead of waiting for the full response before showing the user anything, streaming the text in allows them to start reading immediately.

OpenAI exposes a streaming API for chat completions. But how do you manage a streaming request, when you have a server between the client and OpenAI? You might be tempted to use HTTP streaming end to end - both from the client to the server and the server to OpenAI. However, there’s another way that comes with some big benefits. Spoiler: it’s possible to use a database as a layer of reactivity that separates client request lifecycles from server requests. Don’t worry if that doesn’t make sense yet - we’ll take it one step at a time.

This post will look at working with streams with OpenAI’s beta v4.0.0 Node SDK. Beyond just getting streaming for a single user, we’ll look at an approach that enables:

Persisting the response even if the user closes their browser.
Multiplayer chat, including streaming multiple ChatGPT messages at once.
Resuming a stream when a user refreshes their browser mid-stream.
Streaming to multiple users at once.
Implement custom stream granularity, such as only updating on full words or sentences, rather than on each token.

To do this, we’ll use Convex to store the messages and make the request to OpenAI. This code is on GitHub for you to clone and play with.

Persisting messages

Let’s say we have a chat app, like the one pictured in the gif above. We want to store the messages from each user, as well as messages populated by responses from OpenAI. First let’s look at how data is stored (2), assuming a client sends a message (1).

When a user sends a message, we immediately commit it to the database, so they’re correctly ordered by creation time. This code is executed on the server:

export const send = mutation({
  args: { body: v.string(), author: v.string() },
  handler: async ({ db, scheduler }, { body, author }) => {
    // Save our message to the DB.
    await db.insert("messages", { body, author });

    if (body.indexOf("@gpt") !== -1) {
      // ...see below
    }
  }
});

This mutation saves the message to the database. When the user wants a response from the GPT model (by adding “@gpt” to the message), we will:

Store a placeholder message to update later.
Make a streaming request to OpenAI in an asynchronous background function.
Progressively update the message as the response streams in.

By running the streaming request asynchronously (versus blocking in a user request), we can interact with ChatGPT and save the data to the database even if the client has closed their browser. It also allows us to run many requests in parallel, from the same or multiple users.

We also run it asynchronously because, in Convex, mutations are pure transactions and as such can’t do non-deterministic things like making API requests. In order to talk to third-party services, we can use an action. Actions are non-transactional serverless functions that can talk to third-party services. We trigger the background job to call ChatGPT and update the message body by scheduling the action like so:

// ...when the user wants to send a message to OpenAI's GPT model
const messages = // fetch recent messages to send as context
// Insert a message with a placeholder body.
const messageId = await db.insert("messages", {
  author: "ChatGPT",
  body: "...",
});
// Schedule an action that calls ChatGPT and updates the message.
await scheduler.runAfter(0, internal.openai.chat, { messages, messageId });

We schedule it for zero milliseconds later, similar to doing setTimeout(fn, 0). The message writing and action scheduling happens transactionally in a mutation, so we will only run the action if the messages are successfully committed to the database.

When the action wants to update the body of a message as the streaming results come in, it can invoke an update mutation with the messageId from above:

export const update = internalMutation({
  args: { messageId: v.id("messages"), body: v.string() },
  handler: async ({ db }, { messageId, body }) => {
    await db.patch(messageId, { body });
  },
});

Note: An internalMutation is just a mutation that isn’t exposed as part of the public API. Next we’ll look at the code that calls this update function.

Convex has end-to-end reactivity, so when we update the messages in the database, the UI automatically updates. See below what it looks like to reactively query data.

Streaming with the OpenAI node SDK

Streaming is currently available in the beta version of OpenAI’s node SDK. To install it:

npm install openai@4.0.0-beta.2

This is very similar to previous releases of the openai package, with a few nuances you can read about here.

The internal.openai.chat action we referenced above will live in convex/openai.ts - see the full code here. One important detail is that it needs to run in the node runtime to support some dependencies in the openai package, which means it needs to have "use node"; as the first line in the file.

"use node";
import { OpenAI } from "openai";
import { internalAction } from "./_generated/server";
//...
type ChatParams = {
  messages: Doc<"messages">[];
  messageId: Id<"messages">;
};
export const chat = internalAction({
  handler: async ({ runMutation }, { messages, messageId }: ChatParams) => {
    //...Create and handle a stream request

Creating a stream request

// inside the chat function in convex/openai.ts
const apiKey = process.env.OPENAI_API_KEY!;
const openai = new OpenAI({ apiKey });

const stream = await openai.chat.completions.create({
  model: "gpt-3.5-turbo", // "gpt-4" also works, but is so slow!
  stream: true,
  messages: [
    {
      role: "system",
      content: "You are a terse bot in a group chat responding to q's.",
    },
    ...messages.map(({ body, author }) => ({
      role:
        author === "ChatGPT" ? "assistant" : "user",
      content: body,
    })),
  ],
});
//...handling the stream

The main difference here from previous openai versions (aside from their new simplified client configuration) is passing stream: true. This changes the return format, which unfortunately does not currently provide token usage as the non-streaming version does. I hope this is fixed in a future release, as keeping track of token usage is useful to know how different users or features are affecting your costs.

Handling the stream

The API exposed by the openai SDK makes handling the stream very easy. We use an async iterator to handle each chunk, appending it to the body and updating the message body with everything we’ve received so far:

let body = "";
for await (const part of stream) {
  if (part.choices[0].delta?.content) {
    body += part.choices[0].delta.content;
    await runMutation(internal.messages.update, {
      messageId,
      body,
    });
  }
}

Note that here we’re updating the message every time the body updates, but we could implement custom granularity by deciding when to call runMutation, such as on word breaks or at the end of full sentences.

This action allows us to stream messages from OpenAI to our server function and into the database. But how does this translate to clients updating in real time? Next, let’s see how the client reactively updates as messages are created and updated.

Client “streaming” via subscriptions

After the previous sections, you might be surprised how little is required to get the client to show live updating messages. I put streaming in quotes since we aren’t using HTTP streaming here - instead, we’re just using the reactivity provided out-of-the-box by Convex.

On the client, we use the useQuery hook, which calls the api.messages.list server function in the messages module, which we’ll see in a second. This hook will give us an updated list of messages every time a message is added or modified. This is a special property of a Convex query: it tracks the database requests, and when any of the data is changed it:

Invalidates the query cache (which is managed transparently by Convex).
Recomputes the result.
Pushes the new data over a WebSocket to all subscribed clients.

export default function App() {
  const messages = useQuery(api.messages.list);
  ...
  return (
    ...
    {messages?.map((message) => (
      <article key={message._id}>
        <div>{message.author}</div>
        <p>{message.body}</p>
      </article>
    ))}

Because this query is decoupled from the HTTP streaming response from OpenAI, multiple browsers can be subscribed to updates as messages change. And if a user refreshes or restarts their browser, it will just pick up the latest results of the query.

On the server, this is the query that grabs the most recent 100 messages:

export const list = query({
  handler: async ({ db }): Promise<Doc<"messages">[]> => {
    // Grab the most recent messages.
    const messages = await db.query("messages").order("desc").take(100);
    // Reverse the list so that it's in chronological order.
    // Alternatively, return it reversed and flip the order via flex-direction.
    return messages.reverse();
  },
});

Convex is doing some magic under the hood. If any message is inserted or updated into the database that would match this query - for instance if a new message is added or one of the first 100 messages is edited - then it will automatically re-execute this query (if there are any clients subscribed to it via useQuery). If the results differ, it will push the new results over a WebSocket to the clients, which will trigger an update to the components using useQuery for that query.

To give you a sense of performance, list takes ~17ms and update takes ~7ms for me on the server, so the total latency between a new token coming from OpenAI and a new set of messages being sent to the client is very fast. The gifs in this article are real recordings, not sped up.

Summary

We looked at how to stream ChatGPT responses into Convex, allowing clients to watch the responses, without the flakiness of browser-based HTTP streaming requests. The full code is available here. Let us know in Discord what you think!

Extra Credit 🤓

Beyond what’s covered here, it would be easy to extend this demo to:

Store whether a message has finished streaming by storing a boolean on the message updated at the end of the stream.
Add error handling, to mark a message as failed if the stream fails. See this post for an example of updating a message in the case of failure.
Schedule a function to serve as a watchdog, that marks a message as timed out if it hasn’t finished within a certain timeframe, just in case the action failed. See this post for more details, as well as other patterns for background jobs.
Organize the messages by thread or user, using indexes.

Using Pinecone and Embeddings

Ian Macartney — Mon, 10 Jul 2023 21:34:11 +0000

Looking to implement semantic search or add on-demand context to a GPT prompt so it doesn’t just make shit up (as much)? Pinecone and Convex are a good match when you’re looking to build an app that leverages embeddings and also has user data. Pinecone stores and queries over vectors efficiently, and Convex stores relational & document data with strong transaction guarantees and nifty end-to-end data reactivity.

Let’s walk through how this shakes out in practice. If you want to see some code you can play around with, check out this GitHub repo where you can add your own data and compare it and search over it using Convex and Pinecone.

High-level user flow

To start, what’s an example? With Pinecone and Convex, you can have a flow like this:

A user submits a question and starts subscribing to the question’s results. Under the hood, Convex has stored the question in a document and kicked off an asynchronous action. If this question has been asked before, it might re-use previous results.
The action creates an embedding using a service like OpenAI or Cohere. It can persist this embedding for posterity or to be able to search for similar questions.
The action uses the embedding to query Pinecone (or any vector store) for related documents, products, or whatever your embeddings represent.
The action stores the results in the question document, which automatically reflows to update the user’s client with the new data - potentially returning materialized data pulled from other documents in the database associated with the results.
If this is part of a broader chain of operations, it might use the related documents to compose a prompt to an LLM like ChatGPT, using both the related documents and the question to get a more contextual answer.

A word on streaming updates

At every step, updates written to the Convex database will update all subscribed clients. Unlike raw HTTP streaming, Convex subscriptions can be trivially consumed by multiple clients in parallel and are resilient to network issues or page refreshes. The data received will be from a consistent snapshot of the database state, making it easier to reason about correctness.

Many of the code snippets below can be found in this GitHub repo which you’re welcome to play around with using your own data and API keys. If you are desperate for a hosted playground, let me know in Discord!

Adding data to Convex and Pinecone

Depending on the application, you may have a large mostly-static corpus of data, or be continually adding data — which I’ll refer to as a source below. The process looks something like the following:

Break up your source into bite-sized chunks.

This helps limit how much data you pass (embedding models have context limits), as well as make the embedding more targeted. You could imagine an embedding of this whole post might not rank as highly against “How do you add data to Convex and Pinecone?” as an embedding that just covered this section.

To do this, you can split it yourself or use a library like LangChain’s RecursiveCharacterTextSplitter:

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: ChunkSize,
});
const splitTexts = await textSplitter.createDocuments([pageContent]);
const chunks = splitTexts.map((chunk) => ({
  text: chunk.pageContent,
  lines: chunk.metadata.loc.lines,
}));

You can tune the size and algorithm to your own needs. One tip is to add some overlap, so each chunk has some text from the previous and next sections.

Store the source in the database

The Convex database is a great place to store all the metadata that your app will want. For embeddings based on text, you’ll likely even want to store the chunk of text in the database, so you can quickly access it to return as part of queries for a client or as part of a pipeline. For larger data, like video, it makes more sense to store the data in file storage.

Importantly, you should not store the text chunk directly in Pinecone metadata, as it can quickly fill up the index because all metadata is indexed by default in Pinecone.

async function addSource(
  db: DatabaseWriter,
  name: string,
  chunks: { text: string; lines: { from: number; to: number } }[]
) {
  const sourceId = await db.insert("sources", {
    name,
    chunkIds: [],
    saved: false,
  });
  const chunkIds = await Promise.all(
    chunks.map(({ text, lines }, chunkIndex) =>
      db.insert("chunks", {
        text,
        sourceId,
        chunkIndex,
        lines,
      })
    )
  );
  await db.patch(sourceId, { chunkIds });
  return (await db.get(sourceId))!;
}

There are a few things to note here:

I’m both saving a back-reference from chunks to sources, as well as a forward reference from a source to many chunks. As discussed in this post on database relationships, this is a way to get quick access in both directions without having to define an extra index when you have a small number of relations (1024 technically but 100 is my rule of thumb).
I’m saving an empty array at first, then patching it with the chunk IDs once I insert them. Convex generates unique IDs on insertion. At this time you can’t pre-allocate or specify custom primary IDs for documents.
I’m creating the source with saved: false - we’ll update this once we’ve saved the embeddings into Pinecone. This allows the client to know the insertion status, as well as help with transient failures, which we’ll see later on.

Kick off a background action

The Convex mutation function is transactional but as a result, we can’t perform a non-transactional operation like talking to a third-party service in the middle of a mutation. A Convex action is non-transactional and can talk to the outside world. One trick I like to use is to schedule an action to execute after a mutation commits, ensuring that the communication with the outside world only happens if the mutation has successfully run.

Mutations in Convex are transactions and are prohibited from having non-transactional side effects like calling other cloud services. With actions you can make these sorts of calls, but how do you “call” an action from a mutation if the mutation can’t have side effects? A pattern I really like is to schedule the action from the mutation:

await scheduler.runAfter(0, api.sources.addEmbedding, {
  source,
  texts: chunks.map(({ text }) => text),
});

Thanks to Convex’s strong transaction guarantees, the action is only invoked if the mutation successfully commits, so you’ll never have an action running for a source that doesn’t exist.

Create an embedding

From our action, we can fetch embeddings. See this post for more information on what embeddings are. See the code for fetchEmbeddingBatch here.

const { embeddings } = await fetchEmbeddingBatch(texts);

Upsert into Pinecone

Adding data into Pinecone is a straightforward operation. “Upsert” for those unfamiliar is an update if the specified id already exists, otherwise it inserts.

await upsertVectors(
  "chunks", // namespace
  source.chunkIds.map((id, chunkIndex) => ({
    id,
    values: embeddings[chunkIndex],
    metadata: { sourceId: source._id, textLen: texts[chunkIndex].length },
  }))
);

Tips:

We aren’t including much metadata here - in general, you should only store metadata that you might want to use to limit Pinecone queries - such as keywords, categories, or in this case text length¹.
We’re re-using the Convex document ID for the pinecone vector. This isn’t required—you could make up your own ID and store that in the Convex document—but I find it very handy. Results of Pinecone queries, without returning metadata, can be used directly with db.get which is wicked fast. It also means you can fetch or delete the Pinecone vector for a given chunk, without storing an extra ID.
I used the table name as the Pinecone namespace for convenience, so queries for chunks wouldn’t return vectors for other data. This isn’t required but helped me with organization and naming fatigue.

Tip: Use the @pinecone-database/pinecone Pinecone client for the best experience in Convex.

There are two action runtimes in Convex: our optimized runtime, and a generic node environment. When possible I prefer using the optimized runtime, so I can keep the actions in the same file as the queries and mutations, along with some performance benefits. However, our runtime doesn’t support all npm libraries. Thankfully the pinecone package doesn’t depend on any incompatible packages and just uses the fetch standard under the hood. This is also why I prefer using fetch and the OpenAI HTTP API directly above. See here for more information on runtimes.

Mark the source as “saved”

All we need to do to notify the frontend that the data has been saved is to update the source. Any queries that reference the source document will be updated with the latest data automatically.

await runMutation(api.sources.patch, {
  id: source._id,
  patch: { saved: true, totalTokens, embeddingMs },
});

At this point, our data is in Convex and an embedding vector is saved in Pinecone.

Extensions

Beyond saving chunks of the source, you might also consider:

Adding an embedding of a summary of the whole source.
Add a hierarchy of searches - where you could separately search for a category of documents and then provide that category as a metadata filter in a later query.
Namespacing or otherwise segmenting user data so you never accidentally leak context between users.

Searching for sources

Similarly to inserting data, to do a semantic search over your documents, you can:

Insert the search into a table of searches. If there’s already an identical search, you could even decide to re-use those results. This is handy for iterating on large pipelines and keeping latency and costs low.
```
const searchId = await db.insert("searches", { input, count });
```

Kick off an action transactionally.

await scheduler.runAfter(0, api.searches.search, {
  input,
  searchId,
  topK: count,
});

Create an embedding of the search. Aside: I have a hunch there’s a lot of opportunity here for ways of transforming the raw search into a better text input for the embedding.
```
const { embedding } = await fetchEmbedding(input);
```

Use the pinecone query to find nearby vectors representing chunks.

const { matches } = await pinecone.query({
  queryRequest: {
    namespace: "chunks",
    topK,
    vector: embedding,
  },
});

Update the search results by running a mutation.

await runMutation(api.searches.patch, {
        id: searchId,
        patch: {
          relatedChunks,
        },
      });

Optional: store the search embedding in Pinecone if you want to be able to search semantically over searches themselves!

Note: you could just do steps 2-4 directly in an action if you don’t care about keeping a cache and storing the search vector.

Returning results to the client

The client can subscribe to the search document’s ID:

const results = useQuery(api.searches.semanticSearch, { searchId });

The query looks for the search and returns the related chunks along with their source’s name:

export const semanticSearch = query(
  async ({ db }, { searchId }: { searchId: Id<"searches"> }) => {
    const search = (await db.get(searchId))!;
    if (!search.relatedChunks) return null;
    return pruneNull(
      await Promise.all(
        search.relatedChunks.map(async ({ id, score }) => {
          const chunk = await db.get(id);
          if (!chunk) return null;
          const source = await db.get(chunk.sourceId);
          return { ...chunk, score, sourceName: source!.name };
        })
      )
    );
  }
);

This is parallelized by using Promise.all and calls to db.get are cached.

Summary

In this post, we looked at using Pinecone and Embeddings in Convex. One natural extension of this is to then use the sources as part of a GPT prompt template, but I’ll leave that for a future post. Let us know in discord what you think and what you’d like to see next!

At the Pinecone hackathon, there was a discussion of issues of semantic rankings sometimes behaving oddly - in my case, when I searched over a corpus of famous poems for “what is the meaning of life?” one of the top hits was a “hello world” dummy text I had added. One participant mentioned that a useful filter—after listing a plethora of sophisticated ranking strategies—was to just exclude text less than 200 characters. Intuitively this makes some sense - the longer something is, as short phrases probably have higher semantic variance. ↩

The Magic of Embeddings

Ian Macartney — Tue, 06 Jun 2023 16:00:00 +0000

How similar are the strings “I care about strong ACID guarantees” and “I like transactional databases”? While there’s a number of ways we could compare these strings—syntactically or grammatically for instance—one powerful thing AI models give us is the ability to compare these semantically, using something called embeddings. Given a model, such as OpenAI’s text-embedding-ada-002, I can tell you that the aforementioned two strings have a similarity of 0.784, and are more similar than “I care about strong ACID guarantees” and “I like MongoDB” 😛. With embeddings, we can do a whole suite of powerful things:¹

Search (where results are ranked by relevance to a query string)
Clustering (where text strings are grouped by similarity)
Recommendations (where items with related text strings are recommended)
Anomaly detection (where outliers with little relatedness are identified)
Diversity measurement (where similarity distributions are analyzed)
Classification (where text strings are classified by their most similar label)

This article will look at working with raw OpenAI embeddings.

What is an embedding?

An embedding is ultimately a list of numbers that describe a piece of text, for a given model. In the case of OpenAI’s model, it’s always a 1,536-element-long array of numbers. Furthermore, for OpenAI, the numbers are all between -1 and 1, and if you treat the array as a vector in 1,536-dimensional space, it has a magnitude of 1 (i.e. it’s “normalized to length 1” in linear algebra lingo).

On a conceptual level, you can think of each number in the array as capturing some aspect of the text. Two arrays are considered similar to the degree that they have similar values in each element in the array. You don’t have to know what any of the individual values correspond to—that’s both the beauty and the mystery of embeddings—you just need to compare the resulting arrays. We’ll look at how to compute this similarity below.

Depending on what model you use, you can get wildly different arrays, so it only makes sense to compare arrays that come from the same model. It also means that different models may disagree about what is similar. You could imagine one model being more sensitive to whether the string rhymes. You could fine-tune a model for your specific use case, but I’d recommend starting with a general-purpose one to start, for similar reasons as to why to generally pick Chat GPT over fine-tuned text generation models.

It’s beyond the scope of this post, but it’s also worth mentioning that we’re just looking at text embeddings here, but there are also models to turn images and audio into embeddings, with similar implications.

How do I get an embedding?

There are a few models to turn text into an embedding. To use a hosted model behind an API, I’d recommend OpenAI, and that’s what we’ll be using in this article. For open-source options, you can check out all-MiniLM-L6-v2 or all-mpnet-base-v2.

Assuming you have an API key in your environment variables, you can get an embedding via a simple fetch:

export async function fetchEmbedding(text: string) {
  const result = await fetch("https://api.openai.com/v1/embeddings", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization: "Bearer " + process.env.OPENAI_API_KEY,
    },
    body: JSON.stringify({
      model: "text-embedding-ada-002",
      input: [text],
    }),
  });
  const jsonresults = await result.json();
  return jsonresults.data[0].embedding;
}

For efficiency, I’d recommend fetching multiple embeddings at once in a batch.

export async function fetchEmbeddingBatch(text: string[]) {
  const result = await fetch("https://api.openai.com/v1/embeddings", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization: "Bearer " + process.env.OPENAI_API_KEY,
    },

    body: JSON.stringify({
      model: "text-embedding-ada-002",
      input: [text],
    }),
  });
  const jsonresults = await result.json();
  const allembeddings = jsonresults.data as {
    embedding: number[];
    index: number;
  }[];
  allembeddings.sort((a, b) => b.index - a.index);
  return allembeddings.map(({ embedding }) => embedding);
}

Where should I store it?

Once you have an embedding vector, you’ll likely want to do one of two things with it:

Use it to search for similar strings (i.e. search for similar embeddings).
Store it to be searched against in the future.

If you plan to store thousands of vectors, I’d recommend using a dedicated vector database like Pinecone. This allows you to quickly find nearby vectors for a given input, without having to compare against every vector every time. Stay tuned for a future post on using Pinecone alongside Convex.

If you don’t have many vectors, however, you can just store them directly in a normal database. In my case, if I want to suggest Stack posts similar to a given post or search, I only need to compare against fewer than 100 vectors, so I can just fetch them all and compare them in a matter of milliseconds using the Convex database.

How should I store an embedding?

If you’re storing your embeddings in Pinecone, stay tuned for a dedicated post on it, but the short answer is you configure a Pinecone “Index” and store some metadata along with the vector, so when you get results from Pinecone you can easily re-associate them with your application data. For instance, you can store the document ID for a row that you want to associate with the vector.

If you’re storing the embedding in Convex, I’d advise storing it as a binary blob rather than a javascript array of numbers. Convex advises to not store arrays longer than 1024 elements. We can achieve this by converting it into a Float32Array pretty easily in JavaScript:

const numberList = await fetchEmbedding(inputText); // number[]
const floatArray = Float32Array.from(numberList); // Float32Array
const floatBytes = floatArray.buffer; // ArrayBuffer
// Save floatBytes to the DB
// Later, after you read the bytes back out:
const arrayAgain = new Float32Array(bytesFromDB); // Float32Array

You can represent the embedding as a field in a table in your schema:

vectors: defineTable({
  float32Buffer: v.bytes(),
  textId: v.id("texts"),
}),

In this case, I store the vector alongside an ID of a document in the “texts” table.

How to compare embeddings in JavaScript

If you’re looking to compare two embeddings from OpenAI without using a vector database, it’s very simple. There’s a few ways of comparing vectors, including Euclidean distance, dot product, and cosine similarity. Thankfully, because OpenAI normalizes all the vectors to be length 1, they will all give the same rankings! With a simple dot product you can get a similarity score ranging from -1 (totally unrelated) to 1 (incredibly similar). There are optimized libraries to do it, but for my purposes, this simple function suffices:

/**
 * Compares two vectors by doing a dot product.
 *
 * Assuming both vectors are normalized to length 1, it will be in [-1, 1].
 * @returns [-1, 1] based on similarity. (1 is the same, -1 is the opposite)
 */
export function compare(vectorA: Float32Array, vectorB: Float32Array) {
  return vectorA.reduce((sum, val, idx) => sum + val * vectorB[idx], 0);
}

Example

In this example, let’s make a function (a Convex query in this case) that returns all of the vectors and their similarity scores in order based on some query vector, assuming a table of vectors as we defined above, and the compare function we just defined.

export const compareTo = query(async ({ db }, { vectorId }) => {
  const target = await db.get(vectorId);
  const targetArray = new Float32Array(target.float32Buffer);
  const vectors = await db.query("vectors").collect();
  const scores = await Promise.all(
    vectors
      .filter((vector) => !vector._id.equals(vectorId))
      .map(async (vector) => {
        const score = compare(
          targetArray,
          new Float32Array(vector.float32Buffer)
        );
        return { score, textId: vector.textId, vectorId: vector._id };
      })
  );
  return scores.sort((a, b) => b.score - a.score);
});

Summary

In this post, we looked at embeddings, why they’re useful, and how we can store and use them in Convex. I’ll be making more posts on working with embeddings, including chunking long input into multiple embeddings and using Pinecone alongside Convex soon. Let us know in our Discord what you think!

Copied from OpenAI’s guide ↩

Building a Full-Stack ChatGPT app

Ian Macartney — Mon, 05 Jun 2023 21:11:38 +0000

In this post, we’ll walk through putting together a full-stack chat app, and add some features.

We’ll use React and Vite, though any React framework works. We’ll use Convex as the backend, where our server-side functions will run, and where we’ll store our app’s data.

To see this in action, the code is here, and a running version is here (for now). You can clone it and run it yourself with a few configurations in the README, but read on to see a step by step guide on building your own. Also, the published version uses the "authed" branch in the repo, in case you want to see how simple it is to add auth.

To run this yourself, you’ll need to make an OpenAI account and get an API key.

If you’re familiar with Convex, you can skip ahead to step 2.

0. Bootstrap a Vite React app

If you don’t already have an app, we can make one:

npm create vite@latest

I picked convex-chatgpt as the project name, React as the framework, and Javascript as the variant.

At this point, if we run:

npm install
npm run dev

we have a locally running webapp.

Let’s change src/App.jsx to list messages and have a form to submit messages:

App.jsx

import {useState} from "react";
import "./App.css";

function App() {
  const messages = [
    {author: "user", body: "Hello, world"},
  ];
  const sendMessage = body =>
    console.log("Trying to send: " + body);
  const [newMessageText, setNewMessageText] =
    useState("");

  return (
    <div className="App">
      {messages.map((message, i) => (
        <p key={i}>
          <span>{message.author}: </span>
          <span style={{ whiteSpace: "pre-wrap" }}>
            {message.body ?? "..."}
          </span>
        </p>
      ))}
      <form onSubmit={(e) => {
        e.preventDefault();
        setNewMessageText("");
        sendMessage(newMessageText);
      }}>
        <input
          value={newMessageText}
          onChange={e => setNewMessageText(e.target.value)}
          placeholder="Write a message…"
        />
        <input type="submit" value="Send" disabled={!newMessageText} />
      </form>
    </div>
  );
}

export default App;

At this point, we have an app that shows a static list of messages and logs to the console when trying to send a message.

1. Add Convex

This is similar to the Convex quickstart. In a new terminal (leave the other one running npm run dev):

npm install convex
npx convex init

If you haven’t used convex before, you’ll be prompted to log in, create an account, etc. I picked convex-chatgpt as the project name.

Let’s add the ability to send messages and list messages. We will add functions that run in Convex (on a server) that will read and write to the database with a query and mutation.

Add convex/messages.ts:

import { query, mutation } from "./_generated/server";

export const list = query(async ({ db }) => {
  return await db.query("messages").collect();
});

export const send = mutation(async ({ db }, { body }) => {
  await db.insert("messages", {
    body,
    author: "user",
  });
  const botMessageId = await db.insert("messages", {
    author: "assistant",
  });
});

In a separate terminal from the one running npm run dev, run npx convex dev. This will deploy these functions to your new Convex backend, and as you edit the functions, it will automatically re-deploy. This will ask you to save the deployment URL into your .env and .env.local files: these point at your convex backends for your production and development deployments. We’ll just be working with the development deployment.

To access Convex from inside React, you’ll need to add a <ConvexProvider> context at the top level of your app. Edit main.jsx:

...
import { ConvexProvider, ConvexReactClient } from "convex/react";

const convex = new ConvexReactClient(import.meta.env.VITE_CONVEX_URL);

ReactDOM.createRoot(document.getElementById("root")).render(
  <React.StrictMode>
    <ConvexProvider client={convex}>
      <App />
    </ConvexProvider>
  </React.StrictMode>,
);

We can then use these functions from App.jsx:

function App() {
  const messages = useQuery("messages:list") || [];
  const sendMessage = useMutation("messages:send");
    ...

Great! Now we have messages being written to and from the database. Try it out! Those new to Convex will be surprised to see that adding new messages will automatically result in the useQuery("messages:list") hook returning new messages. This is part of the magic of Convex. Learn more about it here.

You can check out your data in the dashboard: npx convex dashboard.

You may note that the query and mutation are named by the filename:function. See more about this here.

You’ll notice every time we send a message, there’s a “…” message from the assistant. Next let’s update that message from chatGPT.

2. Send a message to the ChatGPT API

So far we’ve been working with a query and mutation, which are Convex functions that interact with the database. In order to interact with external services, we need to do that in an “action” - which is a Convex function that isn’t inside a deterministic environment and isn’t part of a database transaction. This frees us to do things with side effects, such as making requests to OpenAI’s API. We’ll do this in a few steps.

Fetching messages to send to ChatGPT

In our send mutation, we can grab the latest 10 messages to send to ChatGPT using a database query:

export const send = mutation(async ({ db }, { body }) => {
  //... insert messages to the table
  const messages = await db
    .query("messages")
    .order("desc")
    .filter((q) => q.neq(q.field("body"), undefined))
    .take(21);
  messages.reverse();
  return { messages, botMessageId };
});

This orders messages in descending order (sorted by creation time unless you use a different index), filters out messages with no body (for instance, the new placeholder system message), and takes the first 21 (which are the 21 most recent messages). It then reverses the list so it’s ordered in ascending time order. I picked 21 so that we’ll have 10 pairs of user/bot messages, followed by the latest prompt. It returns the messages to be used as input to the chat completion API, which we'll add next.

Creating the action

We’ll use the openai npm package. You can install it with:

npm install openai

Make a new file: convex/openai.js:

"use node";
import { Configuration, OpenAIApi } from "openai";
import { action } from "../_generated/server";

export const chat = action(async ({ runMutation }, { body }) => {
  const { messages, botMessageId } = await runMutation("messages:send", { body });
  const fail = async (reason) => throw new Error(reason);
  // Grab the API key from environment variables
  // Specify this in your dashboard: `npx convex dashboard`
  const apiKey = process.env.OPENAI_API_KEY;
  if (!apiKey) {
    await fail("Add your OPENAI_API_KEY as an env variable");
  }
  const configuration = new Configuration({ apiKey });
  const openai = new OpenAIApi(configuration);

  const openaiResponse = await openai.createChatCompletion({
    model: "gpt-3.5-turbo",
    messages: [
      {
        role: "system",
        content: instructions,
      },
      ...messages.map(({ body, author }) => ({
        role: author,
        content: body,
      })),
    ],
  });
  if (openaiResponse.status !== 200) {
    await fail("OpenAI error: " + openaiResponse.statusText);
  }
  const body = openaiResponse.data.choices[0].message.content;
    console.log("Response: " + body);
});

This will first send the messages with the send mutation we modified, getting in return the list of messages and message ID to update with the bot’s message. It will then make a request to the gpt-3.5-turbo model, passing in one system message with instructions (hard-coded for now), followed by each message. We’re turning our body & author fields into “role” and “content”. See their docs here for more details on the API.

Add your API key to the dashboard

To authenticate requests to OpenAI, you’ll need to add your API key to the Convex dashboard, so your action (which runs on Convex’s servers) has access to it. Go to your dashboard with npx convex dashboard and on the left panel go into Settings and add the key with the name OPENAI_API_KEY. While you’re there, you can add it as an env variable to both your production and development deployments by toggling the dropdown in the left panel.

Triggering the action from the UI

We can change our call to useQuery("messages:send") to

const sendMessage = useAction("openai:chat");

Note the action is addressed by the path to the file (openai) and the function (chat). The API is the same (taking one argument, body), so it’s a drop-in replacement!

At this point, if you run it, it will send a request to ChatGPT and console.log the response. Convex prints Convex function logs to the browser’s console, though you can also see them in the dashboard, to prove to yourself it’s running in the cloud. Now let’s show them in the UI.

Updating the message with the reply

In order to get the response into the chat message, we’ll want to update the empty placeholder message we added. Treat each call to a mutation as a transaction. Once we called our first mutation, the messages were committed to the database and the UI could show the new messages before doing the slow call to OpenAI. Once we get the response, we update the bot’s message and the UI will update automatically via the messages:list query. We’ll add a new mutation in the convex/messages.js file:

// An `internalMuation` can only be called from other server functions.
export const update = internalMutation(async ({ db }, { messageId, patch }) => {
  await db.patch(messageId, patch);
});

This patches the specified message. You could pass in {body: "hi"} to change the message’s body to “hi,” for example. Since the first mutation returned the botMessageId, we can use that from the action in convex/openai.js:

export const chat = action(async ({ runMutation }, { body }) => {
  const { messages, botMessageId } = await runMutation("messages:send", { body });

  // Call OpenAI

  await runMutation("messages:update", {
    messageId: botMessageId, 
    patch: {
      body: openaiResponse.data.choices[0].message.content,
      // Track how many tokens we're using for various messages
      usage: openaiResponse.data.usage,
      updatedAt: Date.now(),
      // How long it took OpenAI
      ms: Number(openaiResponse.headers["openai-processing-ms"]),
    }
  });
});

While we’re here, we can store a bunch of interesting data into the message along with the body. Convex’s database is flexible enough to store strings, numbers, javascript objects, Maps, etc. while also letting you nail down a schema later to get typescript types, autocomplete, etc.

🎉 Now you have chatGPT replying to your messages!

Summary

In this post, we made a full-stack web app to chat with OpenAI’s ChatGPT API. Let us know in Discord what you think, and if you’d like to see some extensions:

Customizing the identity you’re talking to.
Moderating the identities and messages to prevent harmful content.
Create new message threads. (See the code for how it's done!)

Forem: Ian Macartney

Work Stealing: Load-balancing for compute-heavy tasks

tl;dr

Overview

Do I need this?

Push-based routing

Benefits:

Challenges:

Pull-based work stealing

Benefits

Challenges

Making the call: my experience

Why we wanted a pull-based solution:

Why we ended up with a push-based solution:

Why reactive databases change the game

Summary

Supercharge `npm run dev` with package.json scripts

How and where they're defined

Why use scripts?

Does npm run dev work with yarn / pnpm / bun?

Running commands in parallel

predev? postbuild?

Chaining with "&&"

Run interactive steps first

Seeding data on startup

tsc?

How to pass arguments?

Summary

Convert your .json array to a .jsonl (JSON Lines)

Using jq to convert .json to .jsonl

Example

Implementing Rate Limiting with only two numbers

What is application-layer rate limiting

Benefits of these implementations

The algebra of rate limits

Token bucket

Some things to keep in mind:

Fixed window

Reserving tokens

Using rate limits for common operations

Failed logins

Account creation via global limit

Sending messages per user

Making LLM requests with reserved capacity

Jitter: introducing randomness to avoid thundering herds

Scaling rate limits with shards

What if I rely on multiple rate limits?

Authenticating anonymous users

Summary

Streaming HTTP Responses using fetch

Full example code

Making the request

Handling errors

Reading the stream

Returning an async iterator

Handling the OpenAI HTTP streaming protocol

Bringing it all together

More resources

Multiple apps on a single domain hosted on sub-paths

Serving on a subpath

Adding in Convex

Serving multiple supaths from another project

Configuring multiple subpaths

Summary

GPT Streaming With Persistent Reactivity

Persisting messages

Streaming with the OpenAI node SDK

Creating a stream request

Handling the stream

Client “streaming” via subscriptions

Summary

Extra Credit 🤓

Using Pinecone and Embeddings

High-level user flow

A word on streaming updates

Adding data to Convex and Pinecone

Break up your source into bite-sized chunks.

Store the source in the database

Kick off a background action

Create an embedding

Does `npm run dev` work with yarn / pnpm / bun?

Using `jq` to convert .json to .jsonl