<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Sivagurunathan Velayutham</title>
    <description>The latest articles on Forem by Sivagurunathan Velayutham (@sivagurunathanv).</description>
    <link>https://forem.com/sivagurunathanv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F466653%2F887999ab-4630-483f-8377-d635b8e32be5.webp</url>
      <title>Forem: Sivagurunathan Velayutham</title>
      <link>https://forem.com/sivagurunathanv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/sivagurunathanv"/>
    <language>en</language>
    <item>
      <title># Beyond Round Robin: Building a Token-Aware Load Balancer for LLMs</title>
      <dc:creator>Sivagurunathan Velayutham</dc:creator>
      <pubDate>Thu, 12 Feb 2026 07:53:04 +0000</pubDate>
      <link>https://forem.com/sivagurunathanv/-beyond-round-robin-building-a-token-aware-load-balancer-for-llms-29i7</link>
      <guid>https://forem.com/sivagurunathanv/-beyond-round-robin-building-a-token-aware-load-balancer-for-llms-29i7</guid>
      <description>&lt;p&gt;In my &lt;a href="https://www.linkedin.com/posts/activity-7421967760563400704--3o9?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAABU_xkgBm5F2FNEN9O0OowmM_jnfZVYR6a0" rel="noopener noreferrer"&gt;previous experiment&lt;/a&gt;, I was trying to find the best model for a given task. The approach was to send the same request to multiple LLM models in parallel and return whichever responded first. Users got faster responses, but every request burned GPU cycles across multiple servers, most of which went to waste.&lt;/p&gt;

&lt;p&gt;That raised an obvious question: instead of racing backends against each other, what if the load balancer could pick the right one upfront?&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Traditional Load Balancing Breaks Down for LLMs
&lt;/h2&gt;

&lt;p&gt;Standard load balancers route traffic using Round Robin, Least Connections, or health-based metrics. These strategies assume requests have roughly equal cost. That assumption breaks with LLMs.&lt;/p&gt;

&lt;p&gt;A 10-token prompt ("Translate 'hello' to French") and a 4,000-token prompt ("Analyze this codebase") both count as one connection. Least Connections will happily stack three heavy prompts on one server while another sits idle. The result is head-of-line blocking on the overloaded node, and wasted capacity elsewhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connection count is not a proxy for computational cost. Token count is.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Insight
&lt;/h2&gt;

&lt;p&gt;LLM inference has two phases: prefill (processing the input prompt) and decode (generating tokens sequentially). Prefill time scales directly with input token count. A 4,000-token prompt consumes significantly more GPU time during prefill than a 10-token one.&lt;/p&gt;

&lt;p&gt;If the balancer can estimate token count before routing, it can maintain a running total of in-flight tokens per backend and route to the node with the lowest total. Same Least Loaded pattern used in distributed systems, but the metric is tokens instead of connections. The algorithm becomes: pick the backend where &lt;code&gt;current_in_flight_tokens + new_request_tokens&lt;/code&gt; is the lowest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;I built this as an L7 reverse proxy in Go, sitting between clients and a cluster of LLM backends.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fueb8h12jvilyu1qmakt6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fueb8h12jvilyu1qmakt6.png" alt="Token Aware Load balancer" width="800" height="692"&gt;&lt;/a&gt;mermaid&lt;/p&gt;

&lt;p&gt;The request lifecycle:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Intercept&lt;/strong&gt; the incoming JSON body and extract the prompt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tokenize&lt;/strong&gt; using a tiktoken-compatible encoder&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route&lt;/strong&gt; to the backend with the lowest in-flight token count&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Increment&lt;/strong&gt; that backend's token counter before proxying&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forward&lt;/strong&gt; the request through &lt;code&gt;httputil.ReverseProxy&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decrement&lt;/strong&gt; the counter once the backend responds&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I chose Go because &lt;code&gt;net/http&lt;/code&gt;, &lt;code&gt;httputil.ReverseProxy&lt;/code&gt;, and &lt;code&gt;sync/atomic&lt;/code&gt; cover almost everything needed here. The only external dependency is &lt;code&gt;tiktoken-go&lt;/code&gt; for tokenization.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Body-Read Problem
&lt;/h2&gt;

&lt;p&gt;In Go, &lt;code&gt;r.Body&lt;/code&gt; is an &lt;code&gt;io.ReadCloser&lt;/code&gt;. It can only be read once. The balancer needs to read it for tokenization and still forward the original payload to the backend.&lt;/p&gt;

&lt;p&gt;The fix: read the body into a &lt;code&gt;[]byte&lt;/code&gt;, run the tokenizer against that slice, then reassign &lt;code&gt;r.Body&lt;/code&gt; with &lt;code&gt;io.NopCloser(bytes.NewReader(body))&lt;/code&gt;. The downstream proxy sees an intact body.&lt;/p&gt;

&lt;p&gt;This is a well-known concern in any L7 proxy that inspects payloads, but it is easy to overlook when you are building one for the first time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Separating Middleware from RoundTripper
&lt;/h2&gt;

&lt;p&gt;Token aware load balancer splits its logic across two layers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Middleware&lt;/strong&gt; (&lt;code&gt;http.Handler&lt;/code&gt; wrapper) handles request validation, error responses (400, 503), and stores the computed token count in the request context. Anything that might reject a request lives here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RoundTripper&lt;/strong&gt; (&lt;code&gt;http.RoundTripper&lt;/code&gt; implementation) handles transport-level concerns: setting the destination URL and managing the token counter lifecycle. The decrement happens after the backend response is received, which maps naturally to the &lt;code&gt;RoundTrip&lt;/code&gt; call boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;I ran both strategies against the same setup: 3 backend servers where each simulates LLM compute time by sleeping proportionally to the input token count (±20% jitter to mimic real variance). Three payload sizes were used: small (~30ms), large (~2750ms), and huge (~7500ms). Traffic is mixed, with each request randomly picking a payload size.&lt;/p&gt;

&lt;h3&gt;
  
  
  High Contention (50% heavy, 50% small, concurrency=30, 60 requests)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Round Robin&lt;/th&gt;
&lt;th&gt;Token Aware&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.58s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.27s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-12%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P90 Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8.60s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7.78s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-10%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Heavy Workload (80% heavy, 20% small, concurrency=5, 60 requests)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Round Robin&lt;/th&gt;
&lt;th&gt;Token Aware&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4.45s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.20s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-6%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P90 Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8.67s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8.57s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-1%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The gains are most visible under high contention. At concurrency=30, average latency drops 12% and P90 drops 10%. The reason is straightforward: small requests no longer get stuck behind heavy ones because the balancer routes by computational weight, not connection count.&lt;/p&gt;

&lt;p&gt;A 12% improvement across 3 simulated backends is a floor, not a ceiling. Real workloads with wider token variance and higher concurrency would amplify the difference.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This is a simplified implementation. Production systems would need health checks with automatic backend removal, streaming (SSE) support with per-chunk token tracking, output token estimation for more accurate load prediction, and observability through Prometheus or equivalent.&lt;/p&gt;

&lt;p&gt;The code is on &lt;a href="https://github.com/SivagurunathanV/token-aware-balancer" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>go</category>
      <category>ai</category>
      <category>webdev</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>How I Built a Claude Router with Structured Concurrency and Virtual Threads</title>
      <dc:creator>Sivagurunathan Velayutham</dc:creator>
      <pubDate>Tue, 27 Jan 2026 16:45:00 +0000</pubDate>
      <link>https://forem.com/sivagurunathanv/how-i-built-a-claude-router-with-structured-concurrency-and-virtual-threads-49jh</link>
      <guid>https://forem.com/sivagurunathanv/how-i-built-a-claude-router-with-structured-concurrency-and-virtual-threads-49jh</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Recently I read &lt;a href="https://netflixtechblog.com/java-21-virtual-threads-dude-wheres-my-lock-3052540e231d" rel="noopener noreferrer"&gt;Netflix's blog post on Virtual Threads&lt;/a&gt; and how they improved their backend system performance. This led me to explore how Virtual Threads and StructuredTaskScope work internally. In this post, I'll explain Virtual Threads, then show how to use them in a practical project with benchmarks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Threads
&lt;/h3&gt;

&lt;p&gt;Before diving into VThreads, let's take a step back and understand Threads.&lt;/p&gt;

&lt;p&gt;One of the standard textbook definitions is&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Threads were light weight process running along with your application process.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's breakdown the above statement.&lt;/p&gt;

&lt;p&gt;First part is "Light weight process" which means they have less memory foot print than regular process. Threads were typically stored in Stack (temp memory) once the lifetime of thread is reached, the associated memory will be released. Second part "along with your application process" - All threads were still managed by the main application process. Although there's a gotcha: there's a possibility of thread leaks if process didn't clean up threads properly.&lt;br&gt;
All threads internally mapped to scheduler inside OS, where each thread wake up and execute the task and return. OS handle the heavy lifting of how scheduling should happen i.e RoundRobin, Priority etc.&lt;/p&gt;

&lt;p&gt;Each platform thread consumes ~1MB of stack memory. With 200 threads, that's 200MB just for thread stacks. Under high load, request 201 must wait even though threads 1-200 are just sitting idle waiting for I/O responses.&lt;/p&gt;

&lt;p&gt;One of the main drawbacks of threads: when used in I/O intensive applications and each thread blocked until I/O response comes back till then thread will be halted/idle - results in wastage of thread resource. Think of scenario, where your web server handling request (1 request to 1 thread). Under high load, there is a limit of how many parallel request your system depends on max number of threads supported by the operating system.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Traditional thread-per-request model&lt;/span&gt;
&lt;span class="nc"&gt;ExecutorService&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Executors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newFixedThreadPool&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;submit&lt;/span&gt;&lt;span class="o"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// This thread is BLOCKED during the entire HTTP call&lt;/span&gt;
        &lt;span class="nc"&gt;HttpResponse&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;send&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// ~100ms wait&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;});&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;// Request 201-1000 must WAIT - all 200 threads blocked on I/O!&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Virtual Threads
&lt;/h3&gt;

&lt;p&gt;With Project Loom, to address the issue of thread resource contention JDK 21 introduced Virtual Threads. Think of it as an abstraction where virtual threads are managed by JVM and assigned to actual platform threads (normal threads managed by OS). Now instead of depending OS constraint, JVM maintains the virtual threads. With JVM having full control, it can pause the threads and resume when I/O operation completed (either success or failure).&lt;/p&gt;

&lt;p&gt;Now you will have important question, how JVM knows when to pause and resume threads?&lt;br&gt;
JDK decides based on VThreads vs platform threads, blocks (called as "park" in JDK code) until unparked or interrupted. How does it do? JVM suspends the continuation (snapshot of stack) and frees platform(OS thread). When operation completed, virtual thread resume from the exact snapshot.&lt;/p&gt;

&lt;p&gt;In each JDK, there is a common set of blocking API calls (like Socket read/write). When JVM detects one of the calls, it will rewrite into non-blocking API (using linux epoll). After rewriting JVM stores the virtual thread stack and variables, freeing platform/OS threads to run other virtual threads. Once the virtual threads unblocked, JVM will restore the virtual threads stack and resume the execution. With this, JVM can run many virtual threads to limited set of platform threads.&lt;/p&gt;

&lt;p&gt;To move to virtual threads, the code change in JDK21+ is simply to switch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Before: Platform threads (1:1 with OS) - each thread ~1MB&lt;/span&gt;
&lt;span class="nc"&gt;ExecutorService&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Executors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newFixedThreadPool&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// After: Virtual threads (M:N with OS) - each virtual thread ~1KB&lt;/span&gt;
&lt;span class="nc"&gt;ExecutorService&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Executors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newVirtualThreadPerTaskExecutor&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: virtual threads don't block OS threads. A virtual thread waiting for I/O is just a Java object in heap (~1KB), not a blocked OS thread (~1MB). JVM unmounts the virtual thread from carrier thread, freeing it to run other virtual threads.&lt;/p&gt;

&lt;h3&gt;
  
  
  Structured Concurrency
&lt;/h3&gt;

&lt;p&gt;StructuredTaskScope (finalized in Java 25) enforces a simple rule: &lt;strong&gt;tasks cannot outlive their scope&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Traditional concurrency has fundamental issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Thread leaks&lt;/strong&gt;: Tasks can outlive the method that created them&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual cancellation&lt;/strong&gt;: Must remember to cancel remaining tasks on partial failure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complex cleanup&lt;/strong&gt;: try/catch/finally blocks become unwieldy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;StructuredTaskScope solves this with structured lifetime:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StructuredTaskScope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;open&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Joiner&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;awaitAllSuccessfulOrThrow&lt;/span&gt;&lt;span class="o"&gt;()))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;Subtask&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;User&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;userTask&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fork&lt;/span&gt;&lt;span class="o"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;fetchUser&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="nc"&gt;Subtask&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Orders&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ordersTask&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fork&lt;/span&gt;&lt;span class="o"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;fetchOrders&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;

    &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;join&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// Wait for all&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;Dashboard&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;userTask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;ordersTask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Built-in Joiner strategies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;awaitAllSuccessfulOrThrow()&lt;/code&gt; - Wait for all tasks to complete, fail if any fails&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;anySuccessfulResultOrThrow()&lt;/code&gt; - Return first successful result and cancel rest&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's take a real world example. With the rise in LLM usage, a common use case is choosing the right model for the right task.&lt;/p&gt;

&lt;p&gt;Imagine building an intelligent LLM router that takes a prompt and routes to the best model. For finding the fastest model, we'll use a racing pattern: send requests to all models simultaneously, return whichever responds first, and cancel the rest. Each race result gets recorded by a metrics collector, tracking win rates and latency per model. Over time, the router learns which model consistently wins and starts routing directly to it, skipping unnecessary API calls.&lt;/p&gt;

&lt;p&gt;Claude offers three model tiers: &lt;strong&gt;Haiku&lt;/strong&gt; (fastest, cheapest), &lt;strong&gt;Sonnet&lt;/strong&gt; (balanced), and &lt;strong&gt;Opus&lt;/strong&gt; (most capable, slowest).&lt;/p&gt;

&lt;p&gt;For this project, I built a simple HTTP server using Javalin that exposes a &lt;code&gt;/chat&lt;/code&gt; endpoint. When a request comes in, the router races all three models, returns the fastest response, and tracks metrics. The server runs on Java 25 with virtual threads enabled.&lt;/p&gt;

&lt;p&gt;Let's look at the core racing logic. Without StructuredTaskScope, the code is messy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;LLMResponse&lt;/span&gt; &lt;span class="nf"&gt;raceModelsTraditional&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;LLMRequest&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;ExecutorService&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Executors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newVirtualThreadPerTaskExecutor&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="nc"&gt;CompletableFuture&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;LLMResponse&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;[]&lt;/span&gt; &lt;span class="n"&gt;futures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CompletableFuture&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;()];&lt;/span&gt;
    &lt;span class="nc"&gt;AtomicBoolean&lt;/span&gt; &lt;span class="n"&gt;completed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AtomicBoolean&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Create futures for each model&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="nc"&gt;Model&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
            &lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CompletableFuture&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;supplyAsync&lt;/span&gt;&lt;span class="o"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;completed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;CancellationException&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Race already won"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
                &lt;span class="o"&gt;}&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;executeModel&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
            &lt;span class="o"&gt;},&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;// Wait for first successful result&lt;/span&gt;
        &lt;span class="nc"&gt;CompletableFuture&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Object&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;anyOf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CompletableFuture&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;anyOf&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="nc"&gt;LLMResponse&lt;/span&gt; &lt;span class="n"&gt;winner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LLMResponse&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;anyOf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;TimeUnit&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;SECONDS&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;completed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;set&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Manually cancel remaining futures&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;CompletableFuture&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;LLMResponse&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;isDone&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;cancel&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
            &lt;span class="o"&gt;}&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;winner&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

    &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TimeoutException&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Cancel all on timeout&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;CompletableFuture&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;LLMResponse&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;cancel&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;handleError&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;handleError&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="k"&gt;finally&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;shutdown&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With StructuredTaskScope you can simply change to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;LLMResponse&lt;/span&gt; &lt;span class="nf"&gt;raceModels&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;LLMRequest&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StructuredTaskScope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;open&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="nc"&gt;Joiner&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;LLMResponse&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;anySuccessfulResultOrThrow&lt;/span&gt;&lt;span class="o"&gt;()))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

        &lt;span class="c1"&gt;// Fork concurrent tasks&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Model&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fork&lt;/span&gt;&lt;span class="o"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;executeModel&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;// Wait for first success - others auto-cancelled&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;join&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

    &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;handleError&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No manual cancellation. No thread leaks. No forgotten cleanup.&lt;/p&gt;

&lt;p&gt;With the router implemented, I wanted to see if virtual threads actually deliver on their promise. I ran the server under load and compared both approaches.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Virtual Threads vs Platform Threads
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;10,000 requests, 1,000 concurrency&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Virtual&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;1,530 req/s&lt;/td&gt;
&lt;td&gt;3,078 req/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P50 Latency&lt;/td&gt;
&lt;td&gt;475ms&lt;/td&gt;
&lt;td&gt;103ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.6x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P95 Latency&lt;/td&gt;
&lt;td&gt;1,276ms&lt;/td&gt;
&lt;td&gt;420ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Racing Router Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;P50 Latency&lt;/td&gt;
&lt;td&gt;96ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HAIKU Win Rate&lt;/td&gt;
&lt;td&gt;96%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost Savings&lt;/td&gt;
&lt;td&gt;95.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The router automatically identified HAIKU as fastest and transitioned to single-model mode after 500 races.&lt;/p&gt;

&lt;p&gt;For detailed benchmarks, see the &lt;a href="https://github.com/SivagurunathanV/claude-router" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Java 25's structured concurrency changes how we write concurrent code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Virtual Threads&lt;/strong&gt;: One-line change, 2x throughput&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;StructuredTaskScope&lt;/strong&gt;: Safe task lifecycle, automatic cancellation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Racing pattern&lt;/strong&gt;: Complex manual code becomes simple with cleanup built-in&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;A word of caution&lt;/strong&gt;: Virtual threads aren't a silver bullet. Watch out for &lt;em&gt;pinning&lt;/em&gt;, where a virtual thread gets stuck on a carrier thread and can't unmount. This happens when code holds a &lt;code&gt;synchronized&lt;/code&gt; block or calls native methods via JNI during blocking operations. When pinned, the virtual thread behaves like a platform thread, losing its scalability benefits. Prefer &lt;code&gt;ReentrantLock&lt;/code&gt; over &lt;code&gt;synchronized&lt;/code&gt; when using virtual threads, and monitor for pinning with &lt;code&gt;-Djdk.tracePinnedThreads=short&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;a href="https://github.com/SivagurunathanV/claude-router" rel="noopener noreferrer"&gt;github.com/SivagurunathanV/claude-router&lt;/a&gt;&lt;/p&gt;

</description>
      <category>java</category>
      <category>backend</category>
      <category>llm</category>
      <category>virtualthreads</category>
    </item>
    <item>
      <title>When Your Database Goes Down for 25+ Minutes: Building a Survival Cache</title>
      <dc:creator>Sivagurunathan Velayutham</dc:creator>
      <pubDate>Mon, 29 Dec 2025 21:20:38 +0000</pubDate>
      <link>https://forem.com/sivagurunathanv/-when-your-database-goes-down-for-25-minutes-building-a-survival-cache-7bc</link>
      <guid>https://forem.com/sivagurunathanv/-when-your-database-goes-down-for-25-minutes-building-a-survival-cache-7bc</guid>
      <description>&lt;p&gt;In microservice architectures, config services are critical infrastructure. They store feature flags, API endpoints, and runtime settings that services query constantly on startup, during requests, when auto-scaling. Most are backed by a database with aggressive caching. Everything works beautifully, until your database goes down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here's the nightmare scenario:&lt;/strong&gt; Your cache has a 5-minute TTL. Your database outage lasts 25+ minutes. At the 5-minute mark, cache entries start expiring. Services start failing. New instances can't bootstrap. Your availability drops to zero.&lt;/p&gt;

&lt;p&gt;This is the story of building a cache that survives prolonged database outages by persisting stale data to disk and the hard lessons learned along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Everyone tells you to cache your database. "Just use Redis!" "Throw some Caffeine in there!" And they're right for normal operations.&lt;/p&gt;

&lt;p&gt;But here's what the tutorials don't cover: &lt;strong&gt;What happens when your cache expires during a prolonged outage?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The failure sequence looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;T+0 min&lt;/strong&gt;: Database goes down. Cache still serving traffic (100% hit rate).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T+5 min&lt;/strong&gt;: First cache entries expire. Cache misses start happening.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T+6 min&lt;/strong&gt;: Cache miss → try database → timeout. Service starts returning errors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T+10 min&lt;/strong&gt;: Most cache entries expired. Availability plummets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T+15 min&lt;/strong&gt;: Auto-scaling spins up new instances. They can't fetch configs. Immediate crash.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T+25 min&lt;/strong&gt;: Database finally recovers. You've been down for 20 minutes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The traditional solution is replication i.e Aurora multi-region, DynamoDB global tables, all that good stuff. But replication has its own problems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost&lt;/strong&gt;: You're running duplicate infrastructure 24/7 for failure scenarios that happen 2-3 times per year.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Complexity&lt;/strong&gt;: Cross-region replication, failover logic, data consistency concerns, network latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Partial protection&lt;/strong&gt;: Regional outages still take you down. Replication lag can be seconds to minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There had to be a simpler approach.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Insight: Stale Data Beats No Data
&lt;/h2&gt;

&lt;p&gt;Here's the controversial take that changed everything: &lt;strong&gt;For read-heavy config services, serving 10-minute-old data during an outage is infinitely better than serving nothing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Think about what your config service actually stores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Feature flags&lt;/strong&gt;: Don't change every second&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service endpoints&lt;/strong&gt;: Relatively stable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API rate limits&lt;/strong&gt;: Rarely updated mid-incident&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Routing rules&lt;/strong&gt;: Can tolerate brief staleness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sure, you might serve a feature flag that was disabled 5 minutes ago. But that's better than taking down your entire service because the config is unreachable.&lt;/p&gt;

&lt;p&gt;The question became: &lt;em&gt;How do I serve stale data when my cache is empty and my database is unavailable?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The answer: &lt;strong&gt;Persist cache evictions to local disk.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture: The Three-Tier Survival Strategy
&lt;/h2&gt;

&lt;p&gt;I built what I call a "tier cache"—three layers of defense against database failures:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw19bg7o19o1bx7alxi3v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw19bg7o19o1bx7alxi3v.png" alt="Architecture" width="800" height="1453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Normal Operation Flow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Request comes in → check L1 (memory)&lt;/li&gt;
&lt;li&gt;Cache hit (99% of the time) → return immediately in ~2.5μs&lt;/li&gt;
&lt;li&gt;Cache miss → fetch from L2 (database)&lt;/li&gt;
&lt;li&gt;Write to L1 for fast access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Asynchronously write to L3 (disk)&lt;/strong&gt; for outage protection&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Outage Operation Flow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Request comes in → check L1 (memory)&lt;/li&gt;
&lt;li&gt;Cache miss → try L2 (database) → connection timeout&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fall back to L3 (disk)&lt;/strong&gt; → serve stale data&lt;/li&gt;
&lt;li&gt;Service stays alive with degraded data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The key innovation&lt;/strong&gt;: Every cache eviction gets persisted to disk. When the database is unreachable, we serve from this stale disk cache. It's not perfect data, but it keeps services running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why RocksDB?
&lt;/h2&gt;

&lt;p&gt;My first instinct was simple file serialization. Why not just dump everything to JSON?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;File&lt;/span&gt; &lt;span class="n"&gt;cacheFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;File&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"cache-backup.json"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;objectMapper&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;writeValue&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cacheFile&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cacheData&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This worked great for 100 entries in my test. Then I tried 10,000 realistic config objects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;File size&lt;/strong&gt;: 45MB of verbose JSON&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write time&lt;/strong&gt;: 280ms (blocking the cache)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read time&lt;/strong&gt;: 380ms (sequential scan to find one key)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Completely unusable.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I needed something that could:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Read individual keys fast&lt;/strong&gt; without scanning the entire file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compress data&lt;/strong&gt; since config JSON is highly repetitive&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handle writes efficiently&lt;/strong&gt; without blocking cache operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Survive crashes&lt;/strong&gt; without losing all data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After researching embedded databases, RocksDB emerged as the clear winner:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compression&lt;/strong&gt;: My 45MB JSON dump compressed to ~8MB with LZ4 (5.6x reduction). Real-world compression varies by data patterns—typically in 2-4x.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fast random reads&lt;/strong&gt;: Log-Structured Merge (LSM) tree design optimized for key-value lookups. 10-50μs to fetch any key.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Write-optimized&lt;/strong&gt;: Writes go to memory first, then flush to disk in batches. No blocking on individual writes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Battle-tested&lt;/strong&gt;: Powers production systems at Facebook, LinkedIn, Netflix. If it's good enough for them, it's good enough for my config service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Crash safety&lt;/strong&gt;: Write-Ahead Logging (WAL) ensures durability even if the process crashes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RocksDBDiskStore&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;AutoCloseable&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;RocksDB&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;ObjectMapper&lt;/span&gt; &lt;span class="n"&gt;mapper&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nf"&gt;RocksDBDiskStore&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="nc"&gt;RocksDBException&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;RocksDB&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;loadLibrary&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

        &lt;span class="nc"&gt;Options&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Options&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCreateIfMissing&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCompressionType&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;CompressionType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;LZ4_COMPRESSION&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setMaxOpenFiles&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setWriteBufferSize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// 8MB buffer&lt;/span&gt;

        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RocksDB&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;open&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;mapper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ObjectMapper&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Disk Management Built-In
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Implementation&lt;/strong&gt;: RocksDB has a configurable background cleanup thread:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// From RocksDBDiskStore.java&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cleanupDuration&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;scheduler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Executors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newSingleThreadScheduledExecutor&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;Thread&lt;/span&gt; &lt;span class="n"&gt;thread&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"RocksDB-Cleanup"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setDaemon&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;});&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;scheduler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;scheduleAtFixedRate&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;cleanup&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;cleanupDuration&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;cleanupDuration&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;unit&lt;/span&gt;
    &lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This daemon thread runs periodic cleanup to prevent unbounded disk growth. You configure the cleanup frequency when initializing the disk store, ensuring L3 doesn't consume all server disk space over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cache Eviction: The Secret Sauce
&lt;/h2&gt;

&lt;p&gt;The clever part is &lt;em&gt;when&lt;/em&gt; data gets written to RocksDB. I don't persist every cache write—that would be wasteful. Instead, I persist on &lt;strong&gt;cache eviction&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Caffeine's removal listener is the key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Caffeine&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;maximumSize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;maxSize&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;expireAfterWrite&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;evictionListener&lt;/span&gt;&lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cause&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;diskStore&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;save&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// write to RocksDB        &lt;/span&gt;
            &lt;span class="o"&gt;})&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When does eviction happen?&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Time-based expiry&lt;/strong&gt;: Entry sits unused for X minutes → TTL expires → eviction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Size-based eviction&lt;/strong&gt;: Cache hits 10,000 entries → least recently used gets evicted&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Why this approach is efficient:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hot data stays in memory&lt;/strong&gt;: Frequently accessed configs never touch disk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cold data gets archived&lt;/strong&gt;: When a config entry expires from L1, it gets persisted to L3 for outage scenarios.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Eviction-triggered persistence&lt;/strong&gt;: Data is written to disk when evicted from memory, not on every cache operation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;During normal operations&lt;/strong&gt;: L3 is write-mostly, read-rarely. The database is healthy, so cache misses go to L2, not L3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;During outages&lt;/strong&gt;: L3 becomes read-heavy. Cache misses can't reach L2 (database down), so they fall back to L3 for stale data.&lt;/p&gt;

&lt;p&gt;This design means your disk isn't constantly thrashing with writes—it only persists data that's already being evicted from memory anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarking: Does This Actually Work?
&lt;/h2&gt;

&lt;p&gt;I built a test harness to simulate realistic failure scenarios. Here are the results that convinced me this approach works:&lt;/p&gt;

&lt;h3&gt;
  
  
  Test 1: Long Outage Resilience (25-min database failure)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Setup&lt;/strong&gt;: 10K cache entries, 5-min TTL, simulated database outage at T+0&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time Elapsed&lt;/th&gt;
&lt;th&gt;Tier Cache&lt;/th&gt;
&lt;th&gt;EhCache (disk)&lt;/th&gt;
&lt;th&gt;Caffeine Only&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;3 minutes&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7 minutes&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25 minutes&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key finding&lt;/strong&gt;: Tier cache maintained availability &lt;strong&gt;for previously-cached keys&lt;/strong&gt; by &lt;br&gt;
serving from L3 (RocksDB) after L1 expired.This assumes all requested keys were previously cached. In reality, newly added configs or never-requested keys won't be in L3 and will fail. This represents typical production traffic patterns.&lt;/p&gt;

&lt;p&gt;Why did EhCache fail? Its disk persistence is designed for overflow, not outage recovery. When the cache expires, it tries to fetch from the database (which is down) rather than serving stale disk data.&lt;/p&gt;
&lt;h3&gt;
  
  
  Test 2: Normal Operation Performance
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Setup&lt;/strong&gt;: Database healthy, measuring latency for cache operations&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;Tier Cache&lt;/th&gt;
&lt;th&gt;EhCache&lt;/th&gt;
&lt;th&gt;Caffeine&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cache hit (memory)&lt;/td&gt;
&lt;td&gt;2.50 μs&lt;/td&gt;
&lt;td&gt;6.31 μs&lt;/td&gt;
&lt;td&gt;2.74 μs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache miss (DB up)&lt;/td&gt;
&lt;td&gt;1.2 ms&lt;/td&gt;
&lt;td&gt;1.3 ms&lt;/td&gt;
&lt;td&gt;1.1 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk fallback&lt;/td&gt;
&lt;td&gt;19.11 μs&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Important clarification&lt;/strong&gt;: The "cache miss" numbers include network round-trip (mocked) to the database. The "disk fallback" is what happens when the DB is down—we serve from RocksDB instead.&lt;/p&gt;

&lt;p&gt;During normal operations, tier cache performs nearly identically to vanilla Caffeine. The disk layer only matters during outages.&lt;/p&gt;
&lt;h3&gt;
  
  
  Test 3: Write Throughput Under Memory Pressure
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Setup&lt;/strong&gt;: 50K writes with 10K cache size limit (heavy eviction)&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Total Time&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;vs Baseline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Caffeine Only&lt;/td&gt;
&lt;td&gt;37 ms&lt;/td&gt;
&lt;td&gt;1,351,351/s&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tier Cache&lt;/td&gt;
&lt;td&gt;140 ms&lt;/td&gt;
&lt;td&gt;357,143/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;26%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EhCache&lt;/td&gt;
&lt;td&gt;201 ms&lt;/td&gt;
&lt;td&gt;248,756/s&lt;/td&gt;
&lt;td&gt;18%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;This is the cost.&lt;/strong&gt; Async disk persistence reduces write throughput by ~74%. Every eviction triggers a disk write, and under heavy churn, this adds up.&lt;/p&gt;
&lt;h2&gt;
  
  
  What I Got Wrong
&lt;/h2&gt;

&lt;p&gt;This is a learning project, not production-ready code. Here are the real limitations you need to understand:&lt;/p&gt;
&lt;h3&gt;
  
  
  1. The Cold Start Problem
&lt;/h3&gt;

&lt;p&gt;New instances start with empty RocksDB. During an outage, they have no stale data to serve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: Auto-scaling spins up a new pod → L1 empty → L2 down → L3 empty → requests fail.&lt;/p&gt;

&lt;p&gt;My benchmarks showed 100% availability, but that assumed warm caches. Real-world availability during outages depends on whether instances have previously cached the requested keys.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Single Node Limitation
&lt;/h3&gt;

&lt;p&gt;Each instance maintains its own local RocksDB. In a distributed deployment with multiple instances, each has different stale data based on what it personally cached. Request routing becomes non-deterministic—the same config key might return different values depending on which instance handles the request.&lt;/p&gt;

&lt;p&gt;This isn't a bug to fix; it's a fundamental architectural choice. Local disk persistence trades consistency for simplicity. Solving this requires either accepting eventual consistency or moving to distributed storage like Redis, which defeats the "simple local cache" design goal.&lt;/p&gt;
&lt;h2&gt;
  
  
  When Should You Actually Use This?
&lt;/h2&gt;

&lt;p&gt;This project demonstrates caching patterns and outage resilience strategies. Based on the architecture:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Appropriate for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single-node applications&lt;/li&gt;
&lt;li&gt;Systems where eventual consistency across instances is acceptable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Not appropriate for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-instance production deployments requiring consistency&lt;/li&gt;
&lt;li&gt;Applications needing strong consistency guarantees&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The full implementation is here &lt;strong&gt;&lt;a href="https://github.com/SivagurunathanV/tier-cache" rel="noopener noreferrer"&gt;github.com/SivagurunathanV/tier-cache&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick start:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/SivagurunathanV/tier-cache
&lt;span class="nb"&gt;cd &lt;/span&gt;tier-cache
./gradlew &lt;span class="nb"&gt;test&lt;/span&gt;    &lt;span class="c"&gt;# Run test suite&lt;/span&gt;
./gradlew run     &lt;span class="c"&gt;# Interactive demo&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;If you're building something similar:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start simple (JSON files) and profile before over-engineering&lt;/li&gt;
&lt;li&gt;Measure your actual outage frequency and duration&lt;/li&gt;
&lt;li&gt;Calculate the real cost of downtime vs. infrastructure&lt;/li&gt;
&lt;li&gt;Test with realistic failure scenarios, not just happy paths&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key improvements for production:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implement write coalescing (batch evictions)&lt;/li&gt;
&lt;li&gt;Add circuit breakers and error handling&lt;/li&gt;
&lt;li&gt;Build comprehensive observability&lt;/li&gt;
&lt;li&gt;Test cold start and multi-instance scenarios&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'd love to hear about your failure survival strategies. What patterns have kept your services alive during database outages? What trade-offs have you made?&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/SivagurunathanV/tier-cache" rel="noopener noreferrer"&gt;Full source code and tests&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://rocksdb.org/" rel="noopener noreferrer"&gt;RocksDB documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/ben-manes/caffeine" rel="noopener noreferrer"&gt;Caffeine cache library&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>database</category>
      <category>caching</category>
      <category>java</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
