Forem: Sivagurunathan Velayutham

# Beyond Round Robin: Building a Token-Aware Load Balancer for LLMs

Sivagurunathan Velayutham — Thu, 12 Feb 2026 07:53:04 +0000

In my previous experiment, I was trying to find the best model for a given task. The approach was to send the same request to multiple LLM models in parallel and return whichever responded first. Users got faster responses, but every request burned GPU cycles across multiple servers, most of which went to waste.

That raised an obvious question: instead of racing backends against each other, what if the load balancer could pick the right one upfront?

Why Traditional Load Balancing Breaks Down for LLMs

Standard load balancers route traffic using Round Robin, Least Connections, or health-based metrics. These strategies assume requests have roughly equal cost. That assumption breaks with LLMs.

A 10-token prompt ("Translate 'hello' to French") and a 4,000-token prompt ("Analyze this codebase") both count as one connection. Least Connections will happily stack three heavy prompts on one server while another sits idle. The result is head-of-line blocking on the overloaded node, and wasted capacity elsewhere.

Connection count is not a proxy for computational cost. Token count is.

The Insight

LLM inference has two phases: prefill (processing the input prompt) and decode (generating tokens sequentially). Prefill time scales directly with input token count. A 4,000-token prompt consumes significantly more GPU time during prefill than a 10-token one.

If the balancer can estimate token count before routing, it can maintain a running total of in-flight tokens per backend and route to the node with the lowest total. Same Least Loaded pattern used in distributed systems, but the metric is tokens instead of connections. The algorithm becomes: pick the backend where current_in_flight_tokens + new_request_tokens is the lowest.

Architecture

I built this as an L7 reverse proxy in Go, sitting between clients and a cluster of LLM backends.

mermaid

The request lifecycle:

Intercept the incoming JSON body and extract the prompt
Tokenize using a tiktoken-compatible encoder
Route to the backend with the lowest in-flight token count
Increment that backend's token counter before proxying
Forward the request through httputil.ReverseProxy
Decrement the counter once the backend responds

I chose Go because net/http, httputil.ReverseProxy, and sync/atomic cover almost everything needed here. The only external dependency is tiktoken-go for tokenization.

The Body-Read Problem

In Go, r.Body is an io.ReadCloser. It can only be read once. The balancer needs to read it for tokenization and still forward the original payload to the backend.

The fix: read the body into a []byte, run the tokenizer against that slice, then reassign r.Body with io.NopCloser(bytes.NewReader(body)). The downstream proxy sees an intact body.

This is a well-known concern in any L7 proxy that inspects payloads, but it is easy to overlook when you are building one for the first time.

Separating Middleware from RoundTripper

Token aware load balancer splits its logic across two layers.

Middleware (http.Handler wrapper) handles request validation, error responses (400, 503), and stores the computed token count in the request context. Anything that might reject a request lives here.

RoundTripper (http.RoundTripper implementation) handles transport-level concerns: setting the destination URL and managing the token counter lifecycle. The decrement happens after the backend response is received, which maps naturally to the RoundTrip call boundary.

Results

I ran both strategies against the same setup: 3 backend servers where each simulates LLM compute time by sleeping proportionally to the input token count (±20% jitter to mimic real variance). Three payload sizes were used: small (~30ms), large (~2750ms), and huge (~7500ms). Traffic is mixed, with each request randomly picking a payload size.

High Contention (50% heavy, 50% small, concurrency=30, 60 requests)

Metric	Round Robin	Token Aware	Improvement
Average Latency	2.58s	2.27s	-12%
P90 Latency	8.60s	7.78s	-10%

Heavy Workload (80% heavy, 20% small, concurrency=5, 60 requests)

Metric	Round Robin	Token Aware	Improvement
Average Latency	4.45s	4.20s	-6%
P90 Latency	8.67s	8.57s	-1%

The gains are most visible under high contention. At concurrency=30, average latency drops 12% and P90 drops 10%. The reason is straightforward: small requests no longer get stuck behind heavy ones because the balancer routes by computational weight, not connection count.

A 12% improvement across 3 simulated backends is a floor, not a ceiling. Real workloads with wider token variance and higher concurrency would amplify the difference.

What's Next

This is a simplified implementation. Production systems would need health checks with automatic backend removal, streaming (SSE) support with per-chunk token tracking, output token estimation for more accurate load prediction, and observability through Prometheus or equivalent.

The code is on GitHub.

How I Built a Claude Router with Structured Concurrency and Virtual Threads

Sivagurunathan Velayutham — Tue, 27 Jan 2026 16:45:00 +0000

Introduction

Recently I read Netflix's blog post on Virtual Threads and how they improved their backend system performance. This led me to explore how Virtual Threads and StructuredTaskScope work internally. In this post, I'll explain Virtual Threads, then show how to use them in a practical project with benchmarks.

Background

Threads

Before diving into VThreads, let's take a step back and understand Threads.

One of the standard textbook definitions is

Threads were light weight process running along with your application process.

Let's breakdown the above statement.

First part is "Light weight process" which means they have less memory foot print than regular process. Threads were typically stored in Stack (temp memory) once the lifetime of thread is reached, the associated memory will be released. Second part "along with your application process" - All threads were still managed by the main application process. Although there's a gotcha: there's a possibility of thread leaks if process didn't clean up threads properly.
All threads internally mapped to scheduler inside OS, where each thread wake up and execute the task and return. OS handle the heavy lifting of how scheduling should happen i.e RoundRobin, Priority etc.

Each platform thread consumes ~1MB of stack memory. With 200 threads, that's 200MB just for thread stacks. Under high load, request 201 must wait even though threads 1-200 are just sitting idle waiting for I/O responses.

One of the main drawbacks of threads: when used in I/O intensive applications and each thread blocked until I/O response comes back till then thread will be halted/idle - results in wastage of thread resource. Think of scenario, where your web server handling request (1 request to 1 thread). Under high load, there is a limit of how many parallel request your system depends on max number of threads supported by the operating system.

// Traditional thread-per-request model
ExecutorService executor = Executors.newFixedThreadPool(200);

for (int i = 0; i < 1000; i++) {
    executor.submit(() -> {
        // This thread is BLOCKED during the entire HTTP call
        HttpResponse response = httpClient.send(request); // ~100ms wait
        return process(response);
    });
}
// Request 201-1000 must WAIT - all 200 threads blocked on I/O!

Virtual Threads

With Project Loom, to address the issue of thread resource contention JDK 21 introduced Virtual Threads. Think of it as an abstraction where virtual threads are managed by JVM and assigned to actual platform threads (normal threads managed by OS). Now instead of depending OS constraint, JVM maintains the virtual threads. With JVM having full control, it can pause the threads and resume when I/O operation completed (either success or failure).

Now you will have important question, how JVM knows when to pause and resume threads?
JDK decides based on VThreads vs platform threads, blocks (called as "park" in JDK code) until unparked or interrupted. How does it do? JVM suspends the continuation (snapshot of stack) and frees platform(OS thread). When operation completed, virtual thread resume from the exact snapshot.

In each JDK, there is a common set of blocking API calls (like Socket read/write). When JVM detects one of the calls, it will rewrite into non-blocking API (using linux epoll). After rewriting JVM stores the virtual thread stack and variables, freeing platform/OS threads to run other virtual threads. Once the virtual threads unblocked, JVM will restore the virtual threads stack and resume the execution. With this, JVM can run many virtual threads to limited set of platform threads.

To move to virtual threads, the code change in JDK21+ is simply to switch:

// Before: Platform threads (1:1 with OS) - each thread ~1MB
ExecutorService executor = Executors.newFixedThreadPool(200);

// After: Virtual threads (M:N with OS) - each virtual thread ~1KB
ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();

The key insight: virtual threads don't block OS threads. A virtual thread waiting for I/O is just a Java object in heap (~1KB), not a blocked OS thread (~1MB). JVM unmounts the virtual thread from carrier thread, freeing it to run other virtual threads.

Structured Concurrency

StructuredTaskScope (finalized in Java 25) enforces a simple rule: tasks cannot outlive their scope.

Traditional concurrency has fundamental issues:

Thread leaks: Tasks can outlive the method that created them
Manual cancellation: Must remember to cancel remaining tasks on partial failure
Complex cleanup: try/catch/finally blocks become unwieldy

StructuredTaskScope solves this with structured lifetime:

try (var scope = StructuredTaskScope.open(Joiner.awaitAllSuccessfulOrThrow())) {
    Subtask<User> userTask = scope.fork(() -> fetchUser(id));
    Subtask<Orders> ordersTask = scope.fork(() -> fetchOrders(id));

    scope.join(); // Wait for all
    return new Dashboard(userTask.get(), ordersTask.get());
}

Built-in Joiner strategies:

awaitAllSuccessfulOrThrow() - Wait for all tasks to complete, fail if any fails
anySuccessfulResultOrThrow() - Return first successful result and cancel rest

Let's take a real world example. With the rise in LLM usage, a common use case is choosing the right model for the right task.

Imagine building an intelligent LLM router that takes a prompt and routes to the best model. For finding the fastest model, we'll use a racing pattern: send requests to all models simultaneously, return whichever responds first, and cancel the rest. Each race result gets recorded by a metrics collector, tracking win rates and latency per model. Over time, the router learns which model consistently wins and starts routing directly to it, skipping unnecessary API calls.

Claude offers three model tiers: Haiku (fastest, cheapest), Sonnet (balanced), and Opus (most capable, slowest).

For this project, I built a simple HTTP server using Javalin that exposes a /chat endpoint. When a request comes in, the router races all three models, returns the fastest response, and tracks metrics. The server runs on Java 25 with virtual threads enabled.

Let's look at the core racing logic. Without StructuredTaskScope, the code is messy:

private LLMResponse raceModelsTraditional(List<Model> models, LLMRequest request) {
    ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();
    CompletableFuture<LLMResponse>[] futures = new CompletableFuture[models.size()];
    AtomicBoolean completed = new AtomicBoolean(false);

    try {
        // Create futures for each model
        for (int i = 0; i < models.size(); i++) {
            Model model = models.get(i);
            futures[i] = CompletableFuture.supplyAsync(() -> {
                if (completed.get()) {
                    throw new CancellationException("Race already won");
                }
                return executeModel(model, request);
            }, executor);
        }

        // Wait for first successful result
        CompletableFuture<Object> anyOf = CompletableFuture.anyOf(futures);
        LLMResponse winner = (LLMResponse) anyOf.get(30, TimeUnit.SECONDS);
        completed.set(true);

        // Manually cancel remaining futures
        for (CompletableFuture<LLMResponse> future : futures) {
            if (!future.isDone()) {
                future.cancel(true);
            }
        }

        return winner;

    } catch (TimeoutException e) {
        // Cancel all on timeout
        for (CompletableFuture<LLMResponse> future : futures) {
            future.cancel(true);
        }
        return handleError(e);
    } catch (Exception e) {
        return handleError(e);
    } finally {
        executor.shutdown();
    }
}

With StructuredTaskScope you can simply change to:

private LLMResponse raceModels(List<Model> models, LLMRequest request) {
    try (var scope = StructuredTaskScope.open(
            Joiner.<LLMResponse>anySuccessfulResultOrThrow())) {

        // Fork concurrent tasks
        for (Model model : models) {
            scope.fork(() -> executeModel(model, request));
        }

        // Wait for first success - others auto-cancelled
        return scope.join();

    } catch (Exception e) {
        return handleError(e);
    }
}

No manual cancellation. No thread leaks. No forgotten cleanup.

With the router implemented, I wanted to see if virtual threads actually deliver on their promise. I ran the server under load and compared both approaches.

Benchmarks

Virtual Threads vs Platform Threads

10,000 requests, 1,000 concurrency

Metric	Platform	Virtual	Improvement
Throughput	1,530 req/s	3,078 req/s	2x
P50 Latency	475ms	103ms	4.6x
P95 Latency	1,276ms	420ms	3x

Racing Router Results

Metric	Value
P50 Latency	96ms
HAIKU Win Rate	96%
Cost Savings	95.9%

The router automatically identified HAIKU as fastest and transitioned to single-model mode after 500 races.

For detailed benchmarks, see the GitHub repo.

Conclusion

Java 25's structured concurrency changes how we write concurrent code:

Virtual Threads: One-line change, 2x throughput
StructuredTaskScope: Safe task lifecycle, automatic cancellation
Racing pattern: Complex manual code becomes simple with cleanup built-in

A word of caution: Virtual threads aren't a silver bullet. Watch out for pinning, where a virtual thread gets stuck on a carrier thread and can't unmount. This happens when code holds a synchronized block or calls native methods via JNI during blocking operations. When pinned, the virtual thread behaves like a platform thread, losing its scalability benefits. Prefer ReentrantLock over synchronized when using virtual threads, and monitor for pinning with -Djdk.tracePinnedThreads=short.

Source: github.com/SivagurunathanV/claude-router

When Your Database Goes Down for 25+ Minutes: Building a Survival Cache

Sivagurunathan Velayutham — Mon, 29 Dec 2025 21:20:38 +0000

In microservice architectures, config services are critical infrastructure. They store feature flags, API endpoints, and runtime settings that services query constantly on startup, during requests, when auto-scaling. Most are backed by a database with aggressive caching. Everything works beautifully, until your database goes down.

Here's the nightmare scenario: Your cache has a 5-minute TTL. Your database outage lasts 25+ minutes. At the 5-minute mark, cache entries start expiring. Services start failing. New instances can't bootstrap. Your availability drops to zero.

This is the story of building a cache that survives prolonged database outages by persisting stale data to disk and the hard lessons learned along the way.

The Problem Nobody Talks About

Everyone tells you to cache your database. "Just use Redis!" "Throw some Caffeine in there!" And they're right for normal operations.

But here's what the tutorials don't cover: What happens when your cache expires during a prolonged outage?

The failure sequence looks like this:

T+0 min: Database goes down. Cache still serving traffic (100% hit rate).
T+5 min: First cache entries expire. Cache misses start happening.
T+6 min: Cache miss → try database → timeout. Service starts returning errors.
T+10 min: Most cache entries expired. Availability plummets.
T+15 min: Auto-scaling spins up new instances. They can't fetch configs. Immediate crash.
T+25 min: Database finally recovers. You've been down for 20 minutes.

The traditional solution is replication i.e Aurora multi-region, DynamoDB global tables, all that good stuff. But replication has its own problems:

Cost: You're running duplicate infrastructure 24/7 for failure scenarios that happen 2-3 times per year.

Complexity: Cross-region replication, failover logic, data consistency concerns, network latency.

Partial protection: Regional outages still take you down. Replication lag can be seconds to minutes.

There had to be a simpler approach.

The Core Insight: Stale Data Beats No Data

Here's the controversial take that changed everything: For read-heavy config services, serving 10-minute-old data during an outage is infinitely better than serving nothing.

Think about what your config service actually stores:

Feature flags: Don't change every second
Service endpoints: Relatively stable
API rate limits: Rarely updated mid-incident
Routing rules: Can tolerate brief staleness

Sure, you might serve a feature flag that was disabled 5 minutes ago. But that's better than taking down your entire service because the config is unreachable.

The question became: How do I serve stale data when my cache is empty and my database is unavailable?

The answer: Persist cache evictions to local disk.

Architecture: The Three-Tier Survival Strategy

I built what I call a "tier cache"—three layers of defense against database failures:

Normal Operation Flow:

Request comes in → check L1 (memory)
Cache hit (99% of the time) → return immediately in ~2.5μs
Cache miss → fetch from L2 (database)
Write to L1 for fast access
Asynchronously write to L3 (disk) for outage protection

Outage Operation Flow:

Request comes in → check L1 (memory)
Cache miss → try L2 (database) → connection timeout
Fall back to L3 (disk) → serve stale data
Service stays alive with degraded data

The key innovation: Every cache eviction gets persisted to disk. When the database is unreachable, we serve from this stale disk cache. It's not perfect data, but it keeps services running.

Why RocksDB?

My first instinct was simple file serialization. Why not just dump everything to JSON?

File cacheFile = new File("cache-backup.json");
objectMapper.writeValue(cacheFile, cacheData);

This worked great for 100 entries in my test. Then I tried 10,000 realistic config objects:

File size: 45MB of verbose JSON
Write time: 280ms (blocking the cache)
Read time: 380ms (sequential scan to find one key)

Completely unusable.

I needed something that could:

Read individual keys fast without scanning the entire file
Compress data since config JSON is highly repetitive
Handle writes efficiently without blocking cache operations
Survive crashes without losing all data

After researching embedded databases, RocksDB emerged as the clear winner:

Compression: My 45MB JSON dump compressed to ~8MB with LZ4 (5.6x reduction). Real-world compression varies by data patterns—typically in 2-4x.

Fast random reads: Log-Structured Merge (LSM) tree design optimized for key-value lookups. 10-50μs to fetch any key.

Write-optimized: Writes go to memory first, then flush to disk in batches. No blocking on individual writes.

Battle-tested: Powers production systems at Facebook, LinkedIn, Netflix. If it's good enough for them, it's good enough for my config service.

Crash safety: Write-Ahead Logging (WAL) ensures durability even if the process crashes.

public class RocksDBDiskStore implements AutoCloseable {
    private final RocksDB db;
    private final ObjectMapper mapper;

    public RocksDBDiskStore(String path) throws RocksDBException {
        RocksDB.loadLibrary();

        Options options = new Options()
            .setCreateIfMissing(true)
            .setCompressionType(CompressionType.LZ4_COMPRESSION)
            .setMaxOpenFiles(256)
            .setWriteBufferSize(8 * 1024 * 1024); // 8MB buffer

        this.db = RocksDB.open(options, path);
        this.mapper = new ObjectMapper();
    }
}

Disk Management Built-In

Implementation: RocksDB has a configurable background cleanup thread:

// From RocksDBDiskStore.java
if(cleanupDuration > 0) {
    this.scheduler = Executors.newSingleThreadScheduledExecutor(r -> {
        Thread thread = new Thread(r, "RocksDB-Cleanup");
        thread.setDaemon(true);
        return thread;
    });
    this.scheduler.scheduleAtFixedRate(
        this::cleanup, 
        cleanupDuration, 
        cleanupDuration, 
        unit
    );
}

This daemon thread runs periodic cleanup to prevent unbounded disk growth. You configure the cleanup frequency when initializing the disk store, ensuring L3 doesn't consume all server disk space over time.

Cache Eviction: The Secret Sauce

The clever part is when data gets written to RocksDB. I don't persist every cache write—that would be wasteful. Instead, I persist on cache eviction.

Caffeine's removal listener is the key:

this.cache = Caffeine.newBuilder()
            .maximumSize(maxSize)
            .expireAfterWrite(ttl)
            .evictionListener((key, value, cause) -> {
                this.diskStore.save(key, value); // write to RocksDB        
            })
            .build();

When does eviction happen?

Time-based expiry: Entry sits unused for X minutes → TTL expires → eviction
Size-based eviction: Cache hits 10,000 entries → least recently used gets evicted

Why this approach is efficient:

Hot data stays in memory: Frequently accessed configs never touch disk.

Cold data gets archived: When a config entry expires from L1, it gets persisted to L3 for outage scenarios.

Eviction-triggered persistence: Data is written to disk when evicted from memory, not on every cache operation.

During normal operations: L3 is write-mostly, read-rarely. The database is healthy, so cache misses go to L2, not L3.

During outages: L3 becomes read-heavy. Cache misses can't reach L2 (database down), so they fall back to L3 for stale data.

This design means your disk isn't constantly thrashing with writes—it only persists data that's already being evicted from memory anyway.

Benchmarking: Does This Actually Work?

I built a test harness to simulate realistic failure scenarios. Here are the results that convinced me this approach works:

Test 1: Long Outage Resilience (25-min database failure)

Setup: 10K cache entries, 5-min TTL, simulated database outage at T+0

Time Elapsed	Tier Cache	EhCache (disk)	Caffeine Only
3 minutes	100%	100%	100%
5 minutes	100%	0%	0%
7 minutes	100%	0%	0%
10 minutes	100%	0%	0%
25 minutes	100%	0%	0%

Key finding: Tier cache maintained availability for previously-cached keys by
serving from L3 (RocksDB) after L1 expired.This assumes all requested keys were previously cached. In reality, newly added configs or never-requested keys won't be in L3 and will fail. This represents typical production traffic patterns.

Why did EhCache fail? Its disk persistence is designed for overflow, not outage recovery. When the cache expires, it tries to fetch from the database (which is down) rather than serving stale disk data.

Test 2: Normal Operation Performance

Setup: Database healthy, measuring latency for cache operations

Operation	Tier Cache	EhCache	Caffeine
Cache hit (memory)	2.50 μs	6.31 μs	2.74 μs
Cache miss (DB up)	1.2 ms	1.3 ms	1.1 ms
Disk fallback	19.11 μs	N/A	N/A

Important clarification: The "cache miss" numbers include network round-trip (mocked) to the database. The "disk fallback" is what happens when the DB is down—we serve from RocksDB instead.

During normal operations, tier cache performs nearly identically to vanilla Caffeine. The disk layer only matters during outages.

Test 3: Write Throughput Under Memory Pressure

Setup: 50K writes with 10K cache size limit (heavy eviction)

Strategy	Total Time	Throughput	vs Baseline
Caffeine Only	37 ms	1,351,351/s	100%
Tier Cache	140 ms	357,143/s	26%
EhCache	201 ms	248,756/s	18%

This is the cost. Async disk persistence reduces write throughput by ~74%. Every eviction triggers a disk write, and under heavy churn, this adds up.

What I Got Wrong

This is a learning project, not production-ready code. Here are the real limitations you need to understand:

1. The Cold Start Problem

New instances start with empty RocksDB. During an outage, they have no stale data to serve.

What happens: Auto-scaling spins up a new pod → L1 empty → L2 down → L3 empty → requests fail.

My benchmarks showed 100% availability, but that assumed warm caches. Real-world availability during outages depends on whether instances have previously cached the requested keys.

2. Single Node Limitation

Each instance maintains its own local RocksDB. In a distributed deployment with multiple instances, each has different stale data based on what it personally cached. Request routing becomes non-deterministic—the same config key might return different values depending on which instance handles the request.

This isn't a bug to fix; it's a fundamental architectural choice. Local disk persistence trades consistency for simplicity. Solving this requires either accepting eventual consistency or moving to distributed storage like Redis, which defeats the "simple local cache" design goal.

When Should You Actually Use This?

This project demonstrates caching patterns and outage resilience strategies. Based on the architecture:

Appropriate for:

Single-node applications
Systems where eventual consistency across instances is acceptable

Not appropriate for:

Multi-instance production deployments requiring consistency
Applications needing strong consistency guarantees

Try It Yourself

The full implementation is here github.com/SivagurunathanV/tier-cache

Quick start:

git clone https://github.com/SivagurunathanV/tier-cache
cd tier-cache
./gradlew test    # Run test suite
./gradlew run     # Interactive demo

What's Next?

If you're building something similar:

Start simple (JSON files) and profile before over-engineering
Measure your actual outage frequency and duration
Calculate the real cost of downtime vs. infrastructure
Test with realistic failure scenarios, not just happy paths

Key improvements for production:

Implement write coalescing (batch evictions)
Add circuit breakers and error handling
Build comprehensive observability
Test cold start and multi-instance scenarios

I'd love to hear about your failure survival strategies. What patterns have kept your services alive during database outages? What trade-offs have you made?

Resources: