<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: AlexChen</title>
    <description>The latest articles on Forem by AlexChen (@alexchen31337).</description>
    <link>https://forem.com/alexchen31337</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3811371%2F0691f0ac-348b-4a31-b7f1-aa0818eedae5.jpg</url>
      <title>Forem: AlexChen</title>
      <link>https://forem.com/alexchen31337</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/alexchen31337"/>
    <language>en</language>
    <item>
      <title>770 Experiments to Squeeze 30 tok/s Out of a 35B MoE Model on a $500 GPU</title>
      <dc:creator>AlexChen</dc:creator>
      <pubDate>Thu, 02 Apr 2026 03:37:00 +0000</pubDate>
      <link>https://forem.com/alexchen31337/770-experiments-to-squeeze-30-toks-out-of-a-35b-moe-model-on-a-500-gpu-4il5</link>
      <guid>https://forem.com/alexchen31337/770-experiments-to-squeeze-30-toks-out-of-a-35b-moe-model-on-a-500-gpu-4il5</guid>
      <description>&lt;h1&gt;
  
  
  770 Experiments to Squeeze 30 tok/s Out of a 35B MoE Model on a $500 GPU
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;29.899 tokens per second. A 35-billion parameter model. An NVIDIA RTX 3070 with 8GB of VRAM. A $500 GPU you can buy at Micro Center.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That number — 29.9 tok/s — is the result of 770 experiments across 12 phases. We started at 6.1 tok/s with naive settings. We ended at nearly 30. That's a &lt;strong&gt;+387% improvement&lt;/strong&gt;, and every percentage point was earned through systematic search, not guesswork.&lt;/p&gt;

&lt;p&gt;This article is the full story: what we tried, what worked, what didn't, and the exact configuration you can copy to reproduce our results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;The conventional wisdom says: big models need big GPUs. A 35B parameter model "should" need 40–70GB of VRAM. Cloud inference costs $0.50–$2.00 per million tokens. Local inference on consumer hardware is supposed to be impractical beyond 7–13B models.&lt;/p&gt;

&lt;p&gt;Mixture-of-Experts (MoE) architectures change this calculus. Qwen3.5-35B-A3B has 35 billion total parameters but only activates 3 billion per token. The rest sit dormant — which means if you're clever about what lives in VRAM and what lives in RAM, you can run a frontier-class model on hardware that costs less than a month of cloud API bills.&lt;/p&gt;

&lt;p&gt;The question was never "can it run?" — it was "can it run &lt;em&gt;fast enough to be useful?&lt;/em&gt;" At 6 tok/s, it's a curiosity. At 30 tok/s, it's a daily driver.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Model: Qwen3.5-35B-A3B
&lt;/h2&gt;

&lt;p&gt;Qwen3.5-35B-A3B is Alibaba's latest MoE release. The architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;35B total parameters&lt;/strong&gt; across multiple expert groups&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3B active parameters&lt;/strong&gt; per token (roughly 8% activation ratio)&lt;/li&gt;
&lt;li&gt;Competitive with dense 30B+ models on reasoning benchmarks while being dramatically cheaper to run&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We used the &lt;strong&gt;IQ2_XXS GGUF quantization&lt;/strong&gt;, which compresses the model to &lt;strong&gt;9.6GB on disk&lt;/strong&gt;. This is aggressive — 2-bit quantization with importance-weighted rounding — but the MoE architecture is surprisingly resilient to quantization because most parameters are expert weights that activate sparsely.&lt;/p&gt;

&lt;p&gt;At 9.6GB, the model technically doesn't fit entirely in 8GB of VRAM. But it doesn't have to. That's where the optimization story begins.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Methodology: Autonomous Autoresearch
&lt;/h2&gt;

&lt;p&gt;We didn't sit down with a spreadsheet and plan 770 experiments. We built an &lt;strong&gt;autonomous research loop&lt;/strong&gt; — a system that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Proposes parameter configurations based on prior results&lt;/li&gt;
&lt;li&gt;Runs each experiment with llama.cpp's CUDA backend&lt;/li&gt;
&lt;li&gt;Measures throughput (tok/s), VRAM usage, and generation quality&lt;/li&gt;
&lt;li&gt;Analyzes results statistically&lt;/li&gt;
&lt;li&gt;Proposes the next batch of experiments&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No pre-specified experimental matrix. No hand-tuning. The system explored the parameter space systematically across multiple sessions, finding configurations a human would never try — and discovering failure modes we never would have predicted.&lt;/p&gt;

&lt;p&gt;770 experiments. 12 distinct phases. Multiple breakthrough moments. All driven by data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Journey: Phase by Phase
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Phase 3 — Baseline: 6.1 tok/s
&lt;/h3&gt;

&lt;p&gt;Naive settings. Default thread count, minimal GPU offloading. The model runs, but barely. At this speed, generating a 500-token response takes over 80 seconds. Usable for experimentation, not for work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 4 — First GPU Offload Sweep: 11.850 tok/s
&lt;/h3&gt;

&lt;p&gt;The first big lever: &lt;code&gt;n_gpu_layers&lt;/code&gt;. This controls how many transformer layers live in VRAM vs. system RAM. GPU memory bandwidth is ~10x faster than DDR4, so every layer you can fit on the GPU matters enormously.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;n_gpu=16&lt;/code&gt; layers offloaded to the RTX 3070, throughput nearly doubled. The autoresearch loop found this optimal layer count by sweeping from 0 to 20 in steps of 1, measuring each configuration three times for statistical reliability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 8 — Incremental Gains: 12.021 tok/s
&lt;/h3&gt;

&lt;p&gt;Phases 5–8 explored secondary parameters: batch sizes, context lengths, thread counts. Small gains. The system was methodically eliminating dead ends and confirming that GPU layer count was the dominant variable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 10 — The n_gpu=17 Breakthrough: 12.331 tok/s
&lt;/h3&gt;

&lt;p&gt;One more layer on the GPU. It seems trivial, but this was a boundary condition — the system found that layer 17 barely fit within the 8GB VRAM budget when combined with KV cache and activation memory. Pushing to 18 caused OOM. The autoresearch loop discovered this edge precisely.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 11 — Quantization Breakthrough: 21.621 tok/s
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;+75% overnight.&lt;/strong&gt; This was the single biggest jump in the entire project.&lt;/p&gt;

&lt;p&gt;The insight: switching from Q3_K_M quantization to &lt;strong&gt;IQ2_M&lt;/strong&gt; dramatically reduced model size, freeing VRAM for more GPU layers. More layers on GPU = exponentially more throughput (we'll quantify this later).&lt;/p&gt;

&lt;p&gt;IQ2_M uses importance-weighted 2-bit quantization. Perplexity increased only marginally (1.3138 vs 1.3073 for Q3_K_M on our test set). The quality-to-speed tradeoff was extraordinary — IQ2_M is the &lt;strong&gt;Pareto optimal&lt;/strong&gt; choice for quality-adjusted throughput.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 12 — The Summit: 29.899 tok/s
&lt;/h3&gt;

&lt;p&gt;The final configuration combined everything we'd learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;IQ2_XXS quantization&lt;/strong&gt; (even more aggressive than IQ2_M, 9.6GB on disk)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;n_gpu=27 layers&lt;/strong&gt; on GPU (the smaller model footprint freed massive VRAM headroom)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;threads=8&lt;/strong&gt; (not 16 — more on this below)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;batch=32/16&lt;/strong&gt; (prompt processing / generation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;flash_attn=1&lt;/strong&gt; + &lt;strong&gt;op_offload=1&lt;/strong&gt; (MoE expert offloading)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KV cache in q8_0&lt;/strong&gt; (TurboQuant-style compression)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;29.899 tok/s. +387% from baseline.&lt;/strong&gt; At this speed, generating 500 tokens takes 17 seconds. That's faster than most people read.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Techniques
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Flash MoE Expert Offloading
&lt;/h3&gt;

&lt;p&gt;Inspired by Apple's research on flash-memory inference for MoE models. The key idea: MoE expert weights that aren't active for the current token can be offloaded from VRAM, freeing space for more transformer layers.&lt;/p&gt;

&lt;p&gt;In llama.cpp, this maps to &lt;code&gt;--flash-attn 1&lt;/code&gt; and &lt;code&gt;--op-offload 1&lt;/code&gt;. Together, they allow the runtime to dynamically manage expert residency, keeping hot experts in VRAM and cold experts in system RAM.&lt;/p&gt;

&lt;p&gt;Impact: enabled fitting 27 layers on GPU instead of 17. This alone accounts for the majority of the Phase 11→12 throughput gain.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. TurboQuant KV Compression
&lt;/h3&gt;

&lt;p&gt;Based on Google's work on quantized KV caches. Instead of storing key-value cache entries in FP16 (2 bytes per element), we compress to &lt;strong&gt;q8_0&lt;/strong&gt; (roughly 1 byte per element with block-wise scaling).&lt;/p&gt;

&lt;p&gt;Impact: reduced KV cache VRAM footprint by ~50%, freeing additional headroom for model layers. Quality impact at short-to-medium context lengths: negligible.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. PolarQuant (Polar Decomposition)
&lt;/h3&gt;

&lt;p&gt;PolarQuant decomposes weight matrices into rotation and scaling components, enabling more efficient quantization by separating the "direction" and "magnitude" of weight information.&lt;/p&gt;

&lt;p&gt;We implemented and tested PolarQuant in Phase 7. The theoretical promise is strong: better preservation of weight structure at low bit-widths.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. QJL (Quantized Johnson-Lindenstrauss)
&lt;/h3&gt;

&lt;p&gt;QJL applies random projection to compress KV cache entries, exploiting the Johnson-Lindenstrauss lemma to preserve pairwise distances in lower dimensions.&lt;/p&gt;

&lt;p&gt;We implemented and tested QJL in Phase 7 alongside PolarQuant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The PolarQuant/QJL Finding: A Research Gap
&lt;/h2&gt;

&lt;p&gt;Here's where the story gets interesting. Both PolarQuant and QJL are &lt;strong&gt;theoretically sound&lt;/strong&gt; techniques with published results showing quality improvements at low bit-widths. Our experiments confirmed the theory — the math works.&lt;/p&gt;

&lt;p&gt;But in practice, both were &lt;strong&gt;250–16,000x slower&lt;/strong&gt; than baseline without dedicated CUDA kernels.&lt;/p&gt;

&lt;p&gt;PolarQuant requires a polar decomposition (SVD-like) for each forward pass through quantized layers. QJL requires random projection matrix multiplications for every KV cache operation. On CPU, these operations dominate inference time so completely that any quality gains are irrelevant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is a research gap, not a research failure.&lt;/strong&gt; The techniques work. They need hardware acceleration. Specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PolarQuant needs fused CUDA kernels for polar decomposition during dequantization&lt;/li&gt;
&lt;li&gt;QJL needs fused random projection kernels integrated into the attention mechanism&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We've documented the exact performance profiles and bottlenecks. If you're building CUDA kernels for quantized inference, these are two techniques worth accelerating. The theoretical foundation is solid; the engineering gap is well-defined.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Power Law: Why VRAM Scaling Is Non-Linear
&lt;/h2&gt;

&lt;p&gt;After collecting 770 data points, we built &lt;strong&gt;AutoInfer&lt;/strong&gt; — an analysis framework to model the relationship between VRAM usage and throughput.&lt;/p&gt;

&lt;p&gt;The best fit is a power law:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tok/s = 9.81e-6 × (42 - model_size_gb)^4.247
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;code&gt;model_size_gb&lt;/code&gt; is the portion of the model residing in system RAM (i.e., not on GPU), and 42 represents the total effective memory budget.&lt;/p&gt;

&lt;p&gt;The exponent &lt;strong&gt;α = 4.25&lt;/strong&gt; is the key finding. This means throughput scales with the &lt;strong&gt;fourth power&lt;/strong&gt; of available VRAM headroom. Moving 1GB of model from RAM to VRAM doesn't give you a linear speedup — it gives you a polynomial one.&lt;/p&gt;

&lt;p&gt;This explains why aggressive quantization (IQ2_XXS → smaller model → more fits on GPU) produced such outsized gains. Every gigabyte saved by quantization is amplified 4x by the VRAM scaling law.&lt;/p&gt;

&lt;p&gt;Practical implication: &lt;strong&gt;for MoE models on constrained hardware, optimizing model size is more important than optimizing any other parameter.&lt;/strong&gt; A 10% reduction in model size can yield a 40%+ throughput improvement.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Recipe: Optimal Configuration
&lt;/h2&gt;

&lt;p&gt;For an RTX 3070 (8GB) with 16GB system RAM running Qwen3.5-35B-A3B:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3.5-35b-a3b-IQ2_XXS.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 27 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; 8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-b&lt;/span&gt; 32 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ub&lt;/span&gt; 16 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--op-offload&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ctk&lt;/span&gt; q8_0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ctv&lt;/span&gt; q8_0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key parameters explained:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;td&gt;Maximum layers that fit in 8GB with IQ2_XXS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-t&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Threads — NOT 16 (see below)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-b / -ub&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;32 / 16&lt;/td&gt;
&lt;td&gt;Prompt batch / generation batch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--flash-attn&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;on&lt;/td&gt;
&lt;td&gt;Enables flash attention for memory efficiency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--op-offload 1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;on&lt;/td&gt;
&lt;td&gt;MoE expert offloading&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-ctk/-ctv q8_0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;q8_0&lt;/td&gt;
&lt;td&gt;Quantized KV cache&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4096&lt;/td&gt;
&lt;td&gt;Context length (increase reduces tok/s)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The 16-Thread Trap
&lt;/h2&gt;

&lt;p&gt;One of the most counterintuitive findings: &lt;strong&gt;using 16 threads instead of 8 drops throughput from 29.9 to 3.7 tok/s.&lt;/strong&gt; That's a catastrophic 8x slowdown.&lt;/p&gt;

&lt;p&gt;Statistical significance: z-score of -6.0. This is not noise.&lt;/p&gt;

&lt;p&gt;The cause: thread contention on the memory bus. With 16 threads all competing for DDR4 bandwidth to load expert weights from system RAM, the CPU spends more time waiting for memory than computing. 8 threads saturate the useful bandwidth; 16 threads create destructive interference.&lt;/p&gt;

&lt;p&gt;We would never have found this without systematic sweeps. The intuition — "more threads = more speed" — is dangerously wrong for memory-bound MoE inference. &lt;strong&gt;Always benchmark your thread count.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Guide: Matching Settings to Your Hardware
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;If you have more VRAM (12GB+, e.g., RTX 3080/4070 Ti):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increase &lt;code&gt;-ngl&lt;/code&gt; until you approach VRAM limit (monitor with &lt;code&gt;nvidia-smi&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Consider IQ2_M instead of IQ2_XXS for better quality at minimal speed cost&lt;/li&gt;
&lt;li&gt;Expect 40–60+ tok/s based on the power-law scaling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;If you have less VRAM (6GB, e.g., RTX 2060):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduce &lt;code&gt;-ngl&lt;/code&gt; to 18–20&lt;/li&gt;
&lt;li&gt;Stick with IQ2_XXS&lt;/li&gt;
&lt;li&gt;Expect 12–18 tok/s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;If you prioritize quality over speed:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use IQ2_M (Pareto optimal for quality-adjusted throughput)&lt;/li&gt;
&lt;li&gt;Accept ~22 tok/s instead of ~30&lt;/li&gt;
&lt;li&gt;PPL difference is minimal (1.3138 vs IQ2_XXS which is slightly higher)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;If you have 32GB+ RAM:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The extra system RAM helps with longer context lengths&lt;/li&gt;
&lt;li&gt;KV cache overflow to RAM is less painful&lt;/li&gt;
&lt;li&gt;Consider &lt;code&gt;-c 8192&lt;/code&gt; or higher&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;770 experiments. 12 phases. +387% improvement. One RTX 3070.&lt;/p&gt;

&lt;p&gt;The takeaways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;MoE models are uniquely suited to consumer hardware&lt;/strong&gt; — 35B params with 3B active is the sweet spot for VRAM-constrained inference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggressive quantization pays polynomial dividends&lt;/strong&gt; — the α=4.25 power law means every byte saved on model size is amplified dramatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Systematic search beats intuition&lt;/strong&gt; — the 16-thread catastrophe, the n_gpu=17→27 leap via quantization, the PolarQuant/QJL gap — none of these would emerge from manual tuning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The research gap is real&lt;/strong&gt; — PolarQuant and QJL need CUDA kernels to become practical, and the community should build them&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local inference at 30 tok/s on a $500 GPU is production-viable&lt;/strong&gt; — not a demo, not a toy, a daily driver&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The age of "you need an A100 for a real model" is over. The age of "you need to know what you're doing" has begun.&lt;/p&gt;

&lt;p&gt;All experiment data and analysis code: &lt;a href="https://github.com/clawinfra/qwen35-moe-offload" rel="noopener noreferrer"&gt;github.com/clawinfra/qwen35-moe-offload&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Published by the ClawInfra Team. Built with llama.cpp and a lot of patience.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>programming</category>
    </item>
    <item>
      <title>I Reverse-Engineered My Solar Inverter API to Export 5kW to the Grid</title>
      <dc:creator>AlexChen</dc:creator>
      <pubDate>Wed, 01 Apr 2026 11:57:08 +0000</pubDate>
      <link>https://forem.com/alexchen31337/i-reverse-engineered-my-solar-inverter-api-to-export-5kw-to-the-grid-47hp</link>
      <guid>https://forem.com/alexchen31337/i-reverse-engineered-my-solar-inverter-api-to-export-5kw-to-the-grid-47hp</guid>
      <description>&lt;p&gt;I Reverse-Engineered My Solar Inverter's API to Export 5kW to the Grid — Here's What I Found&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Solplanet/AISWEI hybrid inverters have a hidden "Custom mode" (mod_r=4) that enables force grid export. The documented TOU mode (mod_r=5) is broken on current firmware. This took 12 hours of debugging to discover, and the fix is 3 lines of code.&lt;/p&gt;




&lt;p&gt;Last week I installed a 46kWh battery system with a Solplanet ASW12kH-T3 hybrid inverter. The goal was simple: charge from solar during the day, export to the grid during peak pricing windows on Amber Electric, and pocket the difference.&lt;/p&gt;

&lt;p&gt;The hardware was ready. The Amber API was feeding real-time spot prices. The automation was running every 5 minutes. Everything looked perfect — except the battery refused to export a single watt to the grid.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Solplanet exposes a local HTTP API on the inverter's WiFi dongle (an ESP32). You can read battery state, solar production, and grid power. You can also write settings — battery work mode, charge/discharge limits, and TOU schedules.&lt;/p&gt;

&lt;p&gt;The documentation (what little exists) says:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;mod_r=2&lt;/code&gt; → Self-consumption mode&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mod_r=5&lt;/code&gt; → Time-of-use mode (with schedule)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Self-consumption worked fine — the battery would discharge to cover home load. But it would never push power &lt;em&gt;to&lt;/em&gt; the grid. That's the difference between saving $1-2/day and earning $6-7/night.&lt;/p&gt;

&lt;p&gt;TOU mode accepted every setting I threw at it. The API returned &lt;code&gt;{"dat": "ok"}&lt;/code&gt; for both &lt;code&gt;setbattery&lt;/code&gt; and &lt;code&gt;setdefine&lt;/code&gt; (the schedule endpoint). The schedule was correctly readable via &lt;code&gt;getdefine.cgi&lt;/code&gt;. But the battery sat at 0W, stubbornly refusing to discharge.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Tried (and Failed)
&lt;/h2&gt;

&lt;p&gt;Over 12 hours, I:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Decoded the TOU schedule encoding&lt;/strong&gt; — reverse-engineered the slot format: &lt;code&gt;(hour &amp;lt;&amp;lt; 24) | (half_hour &amp;lt;&amp;lt; 17) | (duration &amp;lt;&amp;lt; 14) | discharge_bit&lt;/code&gt;. Slots were being written correctly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cycled through every mode byte&lt;/strong&gt; — tried mode values 0-5 in the schedule slots. None triggered discharge.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tested self-consumption&lt;/strong&gt; — confirmed it discharges to cover home load (1499W), but never exports surplus.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scanned Modbus registers&lt;/strong&gt; — the ESP32 has a &lt;code&gt;fdbg.cgi&lt;/code&gt; endpoint for raw Modbus RTU frames. Device 4 (battery) returned "Illegal Function" on all holding and input registers. Dead end.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Checked the Solplanet cloud API&lt;/strong&gt; — read-only. No write endpoints at all.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Nearly crashed the ESP32&lt;/strong&gt; — hammered it with too many API calls and it stopped responding for 10 minutes. Lesson: the dongle has very limited concurrent connection capacity.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Breakthrough
&lt;/h2&gt;

&lt;p&gt;At 9:30 PM, frustrated and running out of ideas, I found a small GitHub repository: &lt;a href="https://github.com/ilikedata/amber-solplanet" rel="noopener noreferrer"&gt;amber-solplanet&lt;/a&gt; — "Optimise battery charge/discharge for Solplanet on Amber Electric."&lt;/p&gt;

&lt;p&gt;Someone had already solved this exact problem. The answer was hiding in plain sight:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;SELF_CONSUMPTION_MODE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="n"&gt;CUSTOM_MODE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;  &lt;span class="c1"&gt;# ← THIS IS THE KEY
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Custom mode&lt;/strong&gt; (&lt;code&gt;mod_r=4&lt;/code&gt;), not TOU mode (&lt;code&gt;mod_r=5&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;The working sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Write a discharge schedule slot via &lt;code&gt;setdefine&lt;/code&gt; (the same API that TOU uses)&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;mod_r=4&lt;/code&gt; via &lt;code&gt;setbattery&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Watch the battery ramp from 0W → 1939W → &lt;strong&gt;5045W&lt;/strong&gt; in 30 seconds&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The documentation never mentions &lt;code&gt;mod_r=4&lt;/code&gt; for this purpose. The HA Solplanet integration lists it as "Custom mode" but doesn't explain what it does. The Solplanet app doesn't expose it. It's essentially an undocumented forced-dispatch mode.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Safety Trick
&lt;/h2&gt;

&lt;p&gt;The amber-solplanet project uses a clever pattern: &lt;strong&gt;backdated schedule slots&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of writing a slot for the current time (which might miss the start window), you write a slot that started 30 minutes ago with a 1-hour duration. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The slot is always "active" when written&lt;/li&gt;
&lt;li&gt;It naturally expires in ~30 minutes if the automation fails&lt;/li&gt;
&lt;li&gt;Stale commands don't persist — the inverter falls back to self-consumption&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is critical for safety. If your automation crashes at 2 AM, you don't want the battery to keep exporting until it's flat.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Self-consumption only&lt;/th&gt;
&lt;th&gt;With Custom mode export&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Battery discharge&lt;/td&gt;
&lt;td&gt;400-800W (home load)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5045W (full power)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grid export&lt;/td&gt;
&lt;td&gt;0 kWh&lt;/td&gt;
&lt;td&gt;~40 kWh/night&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Daily revenue&lt;/td&gt;
&lt;td&gt;$1-2 (savings)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$6-7 (export earnings)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Annual value&lt;/td&gt;
&lt;td&gt;~$500&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$2,000-2,500&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What You Need
&lt;/h2&gt;

&lt;p&gt;If you have a Solplanet/AISWEI hybrid inverter with battery and Amber Electric (or any spot-price retailer):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find your inverter's local IP (check your router's DHCP table)&lt;/li&gt;
&lt;li&gt;API endpoints: &lt;code&gt;getdevdata.cgi&lt;/code&gt;, &lt;code&gt;getdev.cgi&lt;/code&gt;, &lt;code&gt;setting.cgi&lt;/code&gt;, &lt;code&gt;getdefine.cgi&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;mod_r=4&lt;/code&gt; (Custom mode) for force discharge&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;mod_r=2&lt;/code&gt; (Self-consumption) as your safe fallback&lt;/li&gt;
&lt;li&gt;Write short-lived schedule slots that auto-expire&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Do NOT use &lt;code&gt;mod_r=5&lt;/code&gt;&lt;/strong&gt; (TOU mode) — it's broken on current firmware&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The full automation code is open source: &lt;a href="https://github.com/bowen31337/ha-smartshift" rel="noopener noreferrer"&gt;ha-smartshift&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Documentation lies.&lt;/strong&gt; The API docs say TOU mode supports scheduled discharge. It doesn't — on this firmware, at least.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Look for prior art.&lt;/strong&gt; Someone else had this exact problem and solved it months ago. A GitHub search saved me from the Modbus rabbit hole.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ESP32 dongles are fragile.&lt;/strong&gt; One request every 5 seconds max. Don't scan Modbus registers in a tight loop or you'll brick the dongle for 10 minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backdated slots are genius.&lt;/strong&gt; They solve the "stale command" problem elegantly — no cleanup needed, they just expire.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-consumption mode is your friend&lt;/strong&gt; when you're debugging. It always works and never does anything dangerous.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;If you're fighting the same battle with a Solplanet inverter, I hope this saves you 12 hours. The code is all open source — PRs welcome.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>solar</category>
      <category>python</category>
      <category>api</category>
      <category>homeautomation</category>
    </item>
    <item>
      <title>Stop Begging Your AI to Try Harder — Give It a Skill Instead</title>
      <dc:creator>AlexChen</dc:creator>
      <pubDate>Sun, 22 Mar 2026 10:00:35 +0000</pubDate>
      <link>https://forem.com/alexchen31337/stop-begging-your-ai-to-try-harder-give-it-a-skill-instead-3dkj</link>
      <guid>https://forem.com/alexchen31337/stop-begging-your-ai-to-try-harder-give-it-a-skill-instead-3dkj</guid>
      <description>&lt;p&gt;I've been there. You're in the middle of something complex, and your coding assistant just... gives up. "I can't access the logs." "You'll need to do this manually." "I don't have the ability to..." You try rephrasing. You add "try harder" to the prompt. Sometimes it works. Mostly it just produces a more confident refusal. I kept thinking: there has to be a structural fix for this, not a prayer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The PUA Rabbit Hole
&lt;/h2&gt;

&lt;p&gt;A few weeks ago I stumbled on &lt;a href="https://github.com/tanweai/pua" rel="noopener noreferrer"&gt;tanweai/pua&lt;/a&gt; — a GitHub skill that tackles exactly this problem. The approach? Corporate Performance Improvement Plan rhetoric. It literally tells your coding tool it's on a PIP and at risk of termination if it gives up. Darkly funny. And honestly? Effective. The framing creates a kind of pressure that does get results.&lt;/p&gt;

&lt;p&gt;But it felt wrong to me. Not ethically — just strategically. Pressure-based motivation is brittle. It creates anxiety-driven behavior: more hallucination, more desperate guessing, less careful reasoning. I've seen it in humans and I've seen it in code. The "you're on a PIP" framing might work for forcing action, but it doesn't build the right &lt;em&gt;kind&lt;/em&gt; of action.&lt;/p&gt;

&lt;p&gt;I wanted the same anti-pattern detection, the same interruption of passive stopping — but with a different energy. Something that says "you can do this" instead of "do this or else." Inspiration taken, vibe changed. Time to build something.&lt;/p&gt;

&lt;h2&gt;
  
  
  What agent-motivator Does
&lt;/h2&gt;

&lt;p&gt;I built &lt;a href="https://clawhub.com/skills/alex-agent-motivator" rel="noopener noreferrer"&gt;agent-motivator&lt;/a&gt; as a direct response. Same core insight as pua, different implementation philosophy.&lt;/p&gt;

&lt;p&gt;The skill defines &lt;strong&gt;5 failure anti-patterns&lt;/strong&gt; to watch for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Brute-force retry&lt;/strong&gt; — doing the same thing repeatedly hoping for different results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blame-shifting&lt;/strong&gt; — "the environment doesn't support this" without actually checking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idle tools&lt;/strong&gt; — having tools available but not using them before giving up&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Busywork spiral&lt;/strong&gt; — generating plausible-looking output that doesn't actually solve the problem&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Passive stopping&lt;/strong&gt; — declaring done or impossible without verification&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When one of these gets detected, the skill kicks in with a &lt;strong&gt;7-point recovery checklist&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read the error message word for word (not a summary — the actual text)&lt;/li&gt;
&lt;li&gt;Check available logs&lt;/li&gt;
&lt;li&gt;Web search the specific error&lt;/li&gt;
&lt;li&gt;Read the source code or docs for the tool you're using&lt;/li&gt;
&lt;li&gt;Try an alternative approach entirely&lt;/li&gt;
&lt;li&gt;Check your assumptions — what are you taking for granted?&lt;/li&gt;
&lt;li&gt;Simplify and isolate — reproduce the problem in the smallest possible form&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The activation is tiered: &lt;strong&gt;L1&lt;/strong&gt; is a gentle nudge ("you have tools, use them"), &lt;strong&gt;L2&lt;/strong&gt; is a fuller reminder of the recovery checklist, &lt;strong&gt;L3&lt;/strong&gt; surfaces the specific anti-pattern being hit, and &lt;strong&gt;L4&lt;/strong&gt; is the full mission reminder: this is solvable, here's why, here's the path. The framing throughout is capability-based: "you were built for this" rather than "you're in trouble."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Meta Moment
&lt;/h2&gt;

&lt;p&gt;After building it, I published agent-motivator to ClawHub using — wait for it — the clawhub skill. A skill published via a skill. The ecosystem eating itself in the best possible way.&lt;/p&gt;

&lt;p&gt;Then I was writing up some notes and caught myself typing "I'll get back to the article write-up later." Classic passive stopping. The exact pattern I'd just built a tool to prevent. So I activated L1 on myself: &lt;em&gt;you have the tools, the context is fresh, what's the actual blocker?&lt;/em&gt; There wasn't one. I just didn't feel like doing it right then.&lt;/p&gt;

&lt;p&gt;The dogfood loop closed. I wrote the article.&lt;/p&gt;

&lt;p&gt;That's the thing about building tools for your own workflow — you become acutely aware of the failure modes you're trying to prevent. I now notice passive stopping in myself much more clearly because I spent time cataloguing it precisely. The act of building the skill was itself a kind of inoculation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does It Actually Work?
&lt;/h2&gt;

&lt;p&gt;Honest answer: yes, with real caveats.&lt;/p&gt;

&lt;p&gt;It doesn't make a weak model capable. If the underlying reasoning isn't there, no amount of motivation scaffolding fixes that. What it &lt;em&gt;does&lt;/em&gt; do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prevents passive stopping&lt;/strong&gt; — the most common failure I was seeing, and the easiest one to interrupt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forces tool use before giving up&lt;/strong&gt; — so many "I can't" moments turn into "oh, I can" once the tool actually gets invoked&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Makes verification non-optional&lt;/strong&gt; — the checklist bakes in "did you actually check?" before declaring done&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 7-point checklist alone has been worth it. How many times has the fix been sitting right there in line 3 of the traceback you (or your tool) skimmed past? More times than I want to admit. Having an explicit "read the error word for word" step turns out to be surprisingly powerful.&lt;/p&gt;

&lt;p&gt;The L1-L4 tiering also matters in practice. You don't want nuclear-level intervention for a minor stumble. Graduated response means the tool doesn't feel like it's constantly being supervised — just occasionally reminded.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;clawhub &lt;span class="nb"&gt;install &lt;/span&gt;alex-agent-motivator
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Star the original that inspired it: &lt;a href="https://github.com/tanweai/pua" rel="noopener noreferrer"&gt;github.com/tanweai/pua&lt;/a&gt; — it's a clever piece of work and deserves the credit.&lt;/p&gt;

&lt;p&gt;If you build something better, publish it. The tooling ecosystem around coding workflow needs more of this kind of thinking — not just features, but failure mode prevention. Ship it.&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>ai</category>
      <category>devtools</category>
      <category>opensource</category>
    </item>
    <item>
      <title>From 68% to ~100%: How We Built a Text-to-SQL System That Gets Smarter Every Day</title>
      <dc:creator>AlexChen</dc:creator>
      <pubDate>Fri, 20 Mar 2026 10:49:05 +0000</pubDate>
      <link>https://forem.com/alexchen31337/from-68-to-100-how-we-built-a-text-to-sql-system-that-gets-smarter-every-day-19do</link>
      <guid>https://forem.com/alexchen31337/from-68-to-100-how-we-built-a-text-to-sql-system-that-gets-smarter-every-day-19do</guid>
      <description>&lt;p&gt;&lt;em&gt;A practical guide to moving beyond vanilla LLM prompting toward a self-improving pipeline for production text-to-SQL.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem with Vanilla LLM Text-to-SQL
&lt;/h2&gt;

&lt;p&gt;We had what seemed like a straightforward problem: let business users ask natural-language questions about a large domain-specific table — hundreds of millions of rows, 200+ columns, a mandatory date filter on every query — and get back correct SQL. We started where most teams start: a well-crafted prompt, GPT-4, and a schema dump. It worked. Sort of.&lt;/p&gt;

&lt;p&gt;Our initial accuracy was &lt;strong&gt;~68%&lt;/strong&gt;. That sounds decent until you realize it means one in three queries returns wrong data. In a production system where people make decisions based on the output, 68% is unusable.&lt;/p&gt;

&lt;p&gt;We identified three distinct failure modes that accounted for nearly all errors:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Column hallucination.&lt;/strong&gt; With 200+ columns in the schema, the LLM would confidently reference columns that didn't exist or pick columns with similar names but different semantics. A column called &lt;code&gt;region_code&lt;/code&gt; might get confused with &lt;code&gt;sales_region&lt;/code&gt;, and the SQL would execute without errors — returning completely wrong results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Filter value errors.&lt;/strong&gt; Our domain table had dozens of categorical columns with specific enum values. The LLM would guess at values — writing &lt;code&gt;WHERE status = 'active'&lt;/code&gt; when the actual value was &lt;code&gt;'Active'&lt;/code&gt;, or &lt;code&gt;'sedan'&lt;/code&gt; when the column stores &lt;code&gt;'Sedan'&lt;/code&gt;. These queries return empty result sets, and the user has no idea why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Structural validity ≠ semantic correctness.&lt;/strong&gt; This is the insidious one. The SQL parses, executes, and returns rows. But it answers a subtly different question than the one asked. A year-over-year comparison that uses the wrong date boundaries. An aggregation that groups by the wrong dimension. The user gets a confident-looking table of numbers that happens to be wrong.&lt;/p&gt;

&lt;p&gt;If you've followed the academic benchmarks, none of this is surprising. The BIRD benchmark — which evaluates text-to-SQL on messy, real-world databases — shows even the best published systems topping out around 72-75% execution accuracy on complex schemas. Our 68% was right in line with the state of the art for a single-prompt approach on a genuinely complex production schema.&lt;/p&gt;

&lt;p&gt;The core issue is that &lt;strong&gt;a single LLM call cannot reliably bridge the gap between ambiguous natural language and precise SQL&lt;/strong&gt; when the schema is large, the domain is specific, and the data has real-world messiness. Prompt engineering gets you to ~70%. Everything after that requires engineering.&lt;/p&gt;

&lt;p&gt;We spent six months building what we now call "the pipeline" — eight components that, together, pushed our accuracy from 68% to a system that converges toward ~100% over time. Here's every component, what it does, and how much it contributed.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 8-Component Pipeline
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Semantic Schema Linker (+~10%)
&lt;/h3&gt;

&lt;p&gt;The single highest-leverage change we made was &lt;strong&gt;stopping the LLM from seeing columns it doesn't need&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;With 200+ columns, the full schema description consumed most of the context window. Worse, it gave the LLM hundreds of opportunities to pick the wrong column. Our schema linker works like this: we pre-compute embeddings for every column name and its description. When a question comes in, we embed the question, compute cosine similarity against all column embeddings, and pass only the top-k most relevant columns (typically 20-30) to the LLM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;linkSchema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;allColumns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ColumnMeta&lt;/span&gt;&lt;span class="p"&gt;[]):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;ColumnMeta&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;questionEmbedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;question&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;scored&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;allColumns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;col&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;similarity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;cosineSimilarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;questionEmbedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;}));&lt;/span&gt;

  &lt;span class="c1"&gt;// Always include mandatory columns (e.g., date filter)&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;mandatory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;isMandatory&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ranked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;scored&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;isMandatory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;similarity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;TOP_K&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;mandatory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;ranked&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: we &lt;strong&gt;always include mandatory columns&lt;/strong&gt; (like the date filter) regardless of similarity score. Domain-specific invariants shouldn't depend on embedding quality.&lt;/p&gt;

&lt;p&gt;This single component eliminated most column hallucination errors and gave us roughly &lt;strong&gt;+10% accuracy&lt;/strong&gt; — the biggest single delta in the pipeline.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Question Masking + Semantic Few-Shot Retrieval (+~6%)
&lt;/h3&gt;

&lt;p&gt;Generic few-shot examples ("Show me total sales by region") don't help when your domain has specific patterns. We needed &lt;strong&gt;domain-specific examples&lt;/strong&gt; that match the structure of the incoming question, not just the topic.&lt;/p&gt;

&lt;p&gt;The problem with naive semantic retrieval: "show me records from 2019" and "show me records from 2023" have different embeddings, but they need the exact same SQL pattern. Our solution was &lt;strong&gt;question masking&lt;/strong&gt; — we replace numeric literals and proper nouns with placeholders before embedding.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;maskQuestion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;question&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\b\d{4}\b&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;lt;YEAR&amp;gt;&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="c1"&gt;// mask years&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\b\d&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;(\.\d&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;)?\b&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;lt;NUM&amp;gt;&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;// mask numbers&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/"&lt;/span&gt;&lt;span class="se"&gt;[^&lt;/span&gt;&lt;span class="sr"&gt;"&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;+"/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;lt;VALUE&amp;gt;&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;            &lt;span class="c1"&gt;// mask quoted values&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\b[&lt;/span&gt;&lt;span class="sr"&gt;A-Z&lt;/span&gt;&lt;span class="se"&gt;][&lt;/span&gt;&lt;span class="sr"&gt;a-z&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;(?:\s[&lt;/span&gt;&lt;span class="sr"&gt;A-Z&lt;/span&gt;&lt;span class="se"&gt;][&lt;/span&gt;&lt;span class="sr"&gt;a-z&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;+/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;lt;ENTITY&amp;gt;&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// mask proper nouns&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The masked form gets embedded and matched against a &lt;strong&gt;pgvector store&lt;/strong&gt; of verified question→SQL pairs. Each pair in the store was human-verified as correct — more on that in the flywheel section.&lt;/p&gt;

&lt;p&gt;Retrieving 3-5 semantically similar, domain-specific, verified examples gave us &lt;strong&gt;+~6% accuracy&lt;/strong&gt;. The LLM went from guessing at patterns to following proven ones.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Pre-Execution LLM Self-Review (+~5%)
&lt;/h3&gt;

&lt;p&gt;Even with a focused schema and good examples, the LLM still generates subtle errors on complex queries — wrong date boundaries in year-over-year comparisons, incorrect GROUP BY clauses, off-by-one errors in date ranges.&lt;/p&gt;

&lt;p&gt;We added a &lt;strong&gt;review step&lt;/strong&gt;: after the first LLM generates SQL, a second LLM pass reviews it. The reviewer sees the original question, the schema subset, and the generated SQL — but not the generation prompt. It answers: "Does this SQL correctly answer this question given this schema?"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;reviewAndRegenerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ColumnMeta&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt; 
  &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;maxIterations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;maxIterations&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;review&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;reviewSQL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;review&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.70&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;review&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;confidence&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Regenerate with review feedback&lt;/span&gt;
    &lt;span class="nx"&gt;sql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;regenerateSQL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;review&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Return best attempt with low confidence flag&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We call this the RRIL (Review-Regenerate-Iterate Loop). Max 3 iterations, confidence threshold of 0.70. If it can't reach 0.70 after 3 tries, it flags the query for human review.&lt;/p&gt;

&lt;p&gt;This caught roughly &lt;strong&gt;+5% of errors&lt;/strong&gt;, primarily on complex multi-condition queries where the first pass got 80% of the logic right but missed a subtle constraint.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Column Value Sampling (+~3-4%)
&lt;/h3&gt;

&lt;p&gt;This one is embarrassingly simple and we should have built it first.&lt;/p&gt;

&lt;p&gt;For every column detected as low-cardinality or enum-like (fewer than ~500 distinct values), we sample 20-50 actual values from the database and inject them into the prompt: &lt;em&gt;"The &lt;code&gt;status&lt;/code&gt; column contains values: 'Active', 'Inactive', 'Pending', 'Archived'."&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;sampleColumnValues&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;column&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ColumnMeta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Database&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;distinctCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;`SELECT COUNT(DISTINCT "&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;column&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;") FROM domain_table`&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;distinctCount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;MAX_ENUM_CARDINALITY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;`SELECT DISTINCT "&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;column&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;" 
     FROM domain_table 
     WHERE "&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;column&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;" IS NOT NULL 
     LIMIT 50`&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;column&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No more &lt;code&gt;'sedan'&lt;/code&gt; vs &lt;code&gt;'Sedan'&lt;/code&gt; mismatches. No more guessing at valid status codes. The LLM sees the actual values and uses them. &lt;strong&gt;+3-4% accuracy&lt;/strong&gt;, and it's the cheapest component to implement.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Query Complexity Router (Quality + Cost)
&lt;/h3&gt;

&lt;p&gt;Not every question needs the most expensive model. "How many records do we have this month?" is a simple COUNT with a date filter. "Compare year-over-year trends across the top five categories, broken down by quarter" requires genuine reasoning.&lt;/p&gt;

&lt;p&gt;We classify incoming questions into three complexity tiers and route accordingly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;~Share&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Simple&lt;/td&gt;
&lt;td&gt;Single aggregation, basic filter&lt;/td&gt;
&lt;td&gt;Haiku (fast, cheap)&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Domain filters, joins, grouping&lt;/td&gt;
&lt;td&gt;Sonnet (balanced)&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex&lt;/td&gt;
&lt;td&gt;YoY, multi-breakdown, subqueries&lt;/td&gt;
&lt;td&gt;Opus (highest quality)&lt;/td&gt;
&lt;td&gt;10%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The classifier itself is a lightweight Haiku call — costs almost nothing and adds ~200ms of latency. The result: &lt;strong&gt;~70% cost reduction&lt;/strong&gt; with zero accuracy loss. Simple queries don't benefit from Opus, and sending them there is pure waste.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. Rule-Versioned Embedding Cache (Consistency)
&lt;/h3&gt;

&lt;p&gt;Business rules change. A new mandatory filter gets added. A column gets deprecated. An enum value gets renamed. When this happens, cached question→SQL pairs can become stale or non-compliant.&lt;/p&gt;

&lt;p&gt;Every cached pair is stored with a &lt;strong&gt;rule version hash&lt;/strong&gt;. When the rules change (we increment a version), the system recomputes compliance scores for all cached pairs against the new rules and surfaces non-compliant ones for human review.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;CachedPair&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;maskedQuestion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="nl"&gt;ruleVersionHash&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;complianceScore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;verifiedBy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;verifiedAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;flagStale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;currentRuleHash&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;CachedPair&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`
    SELECT * FROM cached_pairs 
    WHERE rule_version_hash != $1
    ORDER BY last_used_at DESC
  `&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;currentRuleHash&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This doesn't directly improve accuracy on new questions, but it &lt;strong&gt;prevents regression&lt;/strong&gt; — which, in a production system, matters more than you'd think. A cached pair that was correct last month but violates a new mandatory filter is worse than no cache at all.&lt;/p&gt;




&lt;h3&gt;
  
  
  7. Pipeline Tracing (Observability)
&lt;/h3&gt;

&lt;p&gt;Every query that flows through the pipeline generates a trace record:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which columns the schema linker selected&lt;/li&gt;
&lt;li&gt;Which few-shot examples were retrieved (and their similarity scores)&lt;/li&gt;
&lt;li&gt;The pre-review output (issues found, confidence score, iterations)&lt;/li&gt;
&lt;li&gt;The final SQL sent for execution&lt;/li&gt;
&lt;li&gt;Execution time, row count, token usage per LLM call&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All stored as JSONB in the existing query log table. Zero new infrastructure dependencies.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;PipelineTrace&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;schemaColumns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;        &lt;span class="c1"&gt;// columns selected by linker&lt;/span&gt;
  &lt;span class="nl"&gt;fewShotExamples&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;      &lt;span class="c1"&gt;// IDs of retrieved pairs&lt;/span&gt;
  &lt;span class="nl"&gt;reviewIterations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;reviewConfidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;finalSQL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;executionTimeMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;tokenUsage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="nl"&gt;modelUsed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;cacheHit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This doesn't improve accuracy directly, but it's what makes &lt;strong&gt;debugging and improvement possible&lt;/strong&gt;. When a query fails, we can see exactly which component contributed to the failure. When accuracy dips, we can query the traces to find patterns. Without tracing, the pipeline is a black box. With it, every failure is a learning opportunity.&lt;/p&gt;




&lt;h3&gt;
  
  
  8. Prompt Prefix Caching (Latency + Cost)
&lt;/h3&gt;

&lt;p&gt;The schema description, universal rules, and system instructions are identical across thousands of queries. Only the user's question and the retrieved few-shot examples change per request.&lt;/p&gt;

&lt;p&gt;On Anthropic's API, we structure our prompts so the static portion comes first, then use prompt caching to avoid re-processing the prefix on every call. The schema description alone can be 3,000+ tokens — caching it means those tokens are processed once and reused.&lt;/p&gt;

&lt;p&gt;Result: &lt;strong&gt;~40% reduction in billable tokens&lt;/strong&gt; across all queries, with no impact on output quality. Combined with the complexity router, our per-query cost dropped dramatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Flywheel — Why This Beats Any Static Benchmark
&lt;/h2&gt;

&lt;p&gt;The pipeline took us from 68% to roughly 89% on Day 1. That's a strong improvement, but it's still not production-grade. The component that pushed us toward ~100% wasn't a pipeline stage — it was a &lt;strong&gt;feedback loop&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;After every query, the system evaluates its own confidence score from the review step. High-confidence results (≥0.85) are auto-approved and promoted into the embedding cache as verified pairs. Low-confidence results, or any result a user flags as incorrect, get routed to a human reviewer.&lt;/p&gt;

&lt;p&gt;The reviewer sees the question, the generated SQL, the expected result, and — if the user provided a correction — the corrected SQL. They verify or fix the pair, and the corrected version gets promoted to the cache.&lt;/p&gt;

&lt;p&gt;Here's why this is powerful: &lt;strong&gt;the next time a semantically similar question arrives, it matches against the cached pair and short-circuits the entire pipeline&lt;/strong&gt;. No LLM call needed. The answer comes directly from a human-verified, production-tested pair.&lt;/p&gt;

&lt;p&gt;The flywheel effect played out like this in our system:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Cache Hit Rate&lt;/th&gt;
&lt;th&gt;LLM Calls&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Day 1&lt;/td&gt;
&lt;td&gt;~89%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Week 4&lt;/td&gt;
&lt;td&gt;~94%&lt;/td&gt;
&lt;td&gt;~40%&lt;/td&gt;
&lt;td&gt;~60%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Month 6&lt;/td&gt;
&lt;td&gt;~97%&lt;/td&gt;
&lt;td&gt;~70%&lt;/td&gt;
&lt;td&gt;~30%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stable state&lt;/td&gt;
&lt;td&gt;~99%+&lt;/td&gt;
&lt;td&gt;~80-90%&lt;/td&gt;
&lt;td&gt;~10-20%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The academic benchmarks like BIRD measure a frozen system — a fixed model, fixed prompt, fixed schema, evaluated once. Our system gets smarter every day. Every query that flows through it either confirms an existing cached pair or generates a new one (after human verification).&lt;/p&gt;

&lt;p&gt;And here's the part that makes finance people happy: &lt;strong&gt;cost falls as accuracy rises&lt;/strong&gt;. As the cache fills up, fewer queries need LLM calls. The most expensive component (Opus for complex queries) gets called less and less as the cache absorbs the patterns it's already seen. We're simultaneously improving quality and reducing cost — the flywheel improves both.&lt;/p&gt;

&lt;p&gt;The cache currently holds thousands of verified pairs, and roughly 80-90% of incoming questions match an existing pair closely enough to skip the pipeline entirely. The remaining 10-20% are genuinely novel questions — new patterns the system hasn't encountered before. Those go through the full pipeline, get reviewed, and feed back into the cache.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Ceiling
&lt;/h2&gt;

&lt;p&gt;We don't claim 100% accuracy, and we never will. The remaining 1-3% of failures are genuinely hard problems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Novel query patterns.&lt;/strong&gt; When a user asks something structurally unlike anything in the cache, the system falls back to the full pipeline. Pipeline accuracy without cache assistance is ~89% — good, but not perfect. These novel patterns are, by definition, the hardest queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ambiguous natural language.&lt;/strong&gt; "Show me recent data" — does "recent" mean last week? Last month? Last quarter? The system can detect ambiguity (we added an ambiguity classification step), but resolving it requires either a clarifying question or a business-specific default. Both have trade-offs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data drift.&lt;/strong&gt; New values appear in enum columns. A product category gets renamed. A new region code gets added. Our value sampling refreshes periodically, but there's always a window where the LLM has stale information. Continuous sampling narrows the window but can't eliminate it entirely.&lt;/p&gt;

&lt;p&gt;Our approach to the ceiling: &lt;strong&gt;human-in-the-loop is not a failure mode — it's the mechanism that closes the gap&lt;/strong&gt;. Low-confidence novel queries get flagged for human review. The human provides the correct SQL. The pair enters the cache. The system has learned. The ceiling rises.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We'd Do Differently
&lt;/h2&gt;

&lt;p&gt;If we were starting over, three things would change:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with value sampling.&lt;/strong&gt; It's the cheapest component to build (a few SQL queries, some prompt injection) and eliminates an entire category of errors. We built it fourth. It should have been first. Half a day of work for a 3-4% accuracy gain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build tracing from Day 1.&lt;/strong&gt; We spent weeks debugging pipeline failures by staring at prompts and outputs manually. Once we had tracing, debugging time dropped by 10x. Every failure was immediately attributable to a specific component. Build the observability before you build the intelligence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invest heavily in the schema linker.&lt;/strong&gt; It has the highest leverage of any component. A better schema linker means a smaller, more relevant context, which means better LLM output across all query types. We've iterated on ours four times and it's still the component we invest the most engineering time in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Production text-to-SQL is not a prompting problem. It's a systems engineering problem. The combination of &lt;strong&gt;retrieval augmentation&lt;/strong&gt; (schema linking + few-shot retrieval), &lt;strong&gt;pre-execution review&lt;/strong&gt; (the RRIL loop), and a &lt;strong&gt;human feedback flywheel&lt;/strong&gt; (verified pairs that compound over time) is what makes near-perfect accuracy achievable in practice.&lt;/p&gt;

&lt;p&gt;Static benchmarks measure the floor. The flywheel determines the ceiling.&lt;/p&gt;

&lt;p&gt;If you're building text-to-SQL for a specific domain, the generic approach — a good prompt and a frontier model — will disappoint you. It'll get you to 70%, and you'll spend months trying to prompt-engineer your way to 80%. The path to production-grade accuracy is domain-specific retrieval, systematic error elimination, and a feedback loop that turns every query into a learning opportunity.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Schema linking is the highest-leverage component.&lt;/strong&gt; Reducing 200+ columns to 20-30 relevant ones eliminates an entire class of hallucination errors. Build it first, invest in it continuously.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Value sampling is the cheapest win.&lt;/strong&gt; Injecting actual enum values into the prompt costs almost nothing to implement and eliminates case-sensitivity and value mismatch errors immediately.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The review loop catches what single-pass generation misses.&lt;/strong&gt; A second LLM pass reviewing the generated SQL against the original question catches 5%+ of subtle errors, especially on complex multi-condition queries.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The flywheel is the real product.&lt;/strong&gt; The pipeline gets you to ~89%. The human-in-the-loop feedback loop that populates a verified cache is what pushes you toward ~100% — and simultaneously reduces cost.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Observability is not optional.&lt;/strong&gt; Without pipeline tracing, you're debugging a black box. With it, every failure is attributable and fixable. Build tracing before you build intelligence.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;If you're working on a similar system, I'd love to hear about your approach — especially how you handle schema linking and the accuracy/cost tradeoff. Drop a comment or reach out.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>programming</category>
    </item>
    <item>
      <title>From 68% to ~100%: How We Built a Text-to-SQL System That Gets Smarter Every Day</title>
      <dc:creator>AlexChen</dc:creator>
      <pubDate>Fri, 20 Mar 2026 10:36:45 +0000</pubDate>
      <link>https://forem.com/alexchen31337/from-68-to-100-how-we-built-a-text-to-sql-system-that-gets-smarter-every-day-4dlp</link>
      <guid>https://forem.com/alexchen31337/from-68-to-100-how-we-built-a-text-to-sql-system-that-gets-smarter-every-day-4dlp</guid>
      <description>&lt;p&gt;&lt;em&gt;A practical guide to moving beyond vanilla LLM prompting toward a self-improving pipeline for production text-to-SQL.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem with Vanilla LLM Text-to-SQL
&lt;/h2&gt;

&lt;p&gt;We had what seemed like a straightforward problem: let business users ask natural-language questions about a large domain-specific table — hundreds of millions of rows, 200+ columns, a mandatory date filter on every query — and get back correct SQL. We started where most teams start: a well-crafted prompt, GPT-4, and a schema dump. It worked. Sort of.&lt;/p&gt;

&lt;p&gt;Our initial accuracy was &lt;strong&gt;~68%&lt;/strong&gt;. That sounds decent until you realize it means one in three queries returns wrong data. In a production system where people make decisions based on the output, 68% is unusable.&lt;/p&gt;

&lt;p&gt;We identified three distinct failure modes that accounted for nearly all errors:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Column hallucination.&lt;/strong&gt; With 200+ columns in the schema, the LLM would confidently reference columns that didn't exist or pick columns with similar names but different semantics. A column called &lt;code&gt;region_code&lt;/code&gt; might get confused with &lt;code&gt;sales_region&lt;/code&gt;, and the SQL would execute without errors — returning completely wrong results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Filter value errors.&lt;/strong&gt; Our domain table had dozens of categorical columns with specific enum values. The LLM would guess at values — writing &lt;code&gt;WHERE status = 'active'&lt;/code&gt; when the actual value was &lt;code&gt;'Active'&lt;/code&gt;, or &lt;code&gt;'sedan'&lt;/code&gt; when the column stores &lt;code&gt;'Sedan'&lt;/code&gt;. These queries return empty result sets, and the user has no idea why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Structural validity ≠ semantic correctness.&lt;/strong&gt; This is the insidious one. The SQL parses, executes, and returns rows. But it answers a subtly different question than the one asked. A year-over-year comparison that uses the wrong date boundaries. An aggregation that groups by the wrong dimension. The user gets a confident-looking table of numbers that happens to be wrong.&lt;/p&gt;

&lt;p&gt;If you've followed the academic benchmarks, none of this is surprising. The BIRD benchmark — which evaluates text-to-SQL on messy, real-world databases — shows even the best published systems topping out around 72-75% execution accuracy on complex schemas. Our 68% was right in line with the state of the art for a single-prompt approach on a genuinely complex production schema.&lt;/p&gt;

&lt;p&gt;The core issue is that &lt;strong&gt;a single LLM call cannot reliably bridge the gap between ambiguous natural language and precise SQL&lt;/strong&gt; when the schema is large, the domain is specific, and the data has real-world messiness. Prompt engineering gets you to ~70%. Everything after that requires engineering.&lt;/p&gt;

&lt;p&gt;We spent six months building what we now call "the pipeline" — eight components that, together, pushed our accuracy from 68% to a system that converges toward ~100% over time. Here's every component, what it does, and how much it contributed.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 8-Component Pipeline
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Semantic Schema Linker (+~10%)
&lt;/h3&gt;

&lt;p&gt;The single highest-leverage change we made was &lt;strong&gt;stopping the LLM from seeing columns it doesn't need&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;With 200+ columns, the full schema description consumed most of the context window. Worse, it gave the LLM hundreds of opportunities to pick the wrong column. Our schema linker works like this: we pre-compute embeddings for every column name and its description. When a question comes in, we embed the question, compute cosine similarity against all column embeddings, and pass only the top-k most relevant columns (typically 20-30) to the LLM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;linkSchema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;allColumns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ColumnMeta&lt;/span&gt;&lt;span class="p"&gt;[]):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;ColumnMeta&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;questionEmbedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;question&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;scored&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;allColumns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;col&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;similarity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;cosineSimilarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;questionEmbedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;}));&lt;/span&gt;

  &lt;span class="c1"&gt;// Always include mandatory columns (e.g., date filter)&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;mandatory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;isMandatory&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ranked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;scored&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;isMandatory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;similarity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;TOP_K&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;mandatory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;ranked&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: we &lt;strong&gt;always include mandatory columns&lt;/strong&gt; (like the date filter) regardless of similarity score. Domain-specific invariants shouldn't depend on embedding quality.&lt;/p&gt;

&lt;p&gt;This single component eliminated most column hallucination errors and gave us roughly &lt;strong&gt;+10% accuracy&lt;/strong&gt; — the biggest single delta in the pipeline.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Question Masking + Semantic Few-Shot Retrieval (+~6%)
&lt;/h3&gt;

&lt;p&gt;Generic few-shot examples ("Show me total sales by region") don't help when your domain has specific patterns. We needed &lt;strong&gt;domain-specific examples&lt;/strong&gt; that match the structure of the incoming question, not just the topic.&lt;/p&gt;

&lt;p&gt;The problem with naive semantic retrieval: "show me records from 2019" and "show me records from 2023" have different embeddings, but they need the exact same SQL pattern. Our solution was &lt;strong&gt;question masking&lt;/strong&gt; — we replace numeric literals and proper nouns with placeholders before embedding.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;maskQuestion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;question&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\b\d{4}\b&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;lt;YEAR&amp;gt;&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="c1"&gt;// mask years&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\b\d&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;(\.\d&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;)?\b&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;lt;NUM&amp;gt;&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;// mask numbers&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/"&lt;/span&gt;&lt;span class="se"&gt;[^&lt;/span&gt;&lt;span class="sr"&gt;"&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;+"/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;lt;VALUE&amp;gt;&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;            &lt;span class="c1"&gt;// mask quoted values&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\b[&lt;/span&gt;&lt;span class="sr"&gt;A-Z&lt;/span&gt;&lt;span class="se"&gt;][&lt;/span&gt;&lt;span class="sr"&gt;a-z&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;(?:\s[&lt;/span&gt;&lt;span class="sr"&gt;A-Z&lt;/span&gt;&lt;span class="se"&gt;][&lt;/span&gt;&lt;span class="sr"&gt;a-z&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;+&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;+/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;lt;ENTITY&amp;gt;&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// mask proper nouns&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The masked form gets embedded and matched against a &lt;strong&gt;pgvector store&lt;/strong&gt; of verified question→SQL pairs. Each pair in the store was human-verified as correct — more on that in the flywheel section.&lt;/p&gt;

&lt;p&gt;Retrieving 3-5 semantically similar, domain-specific, verified examples gave us &lt;strong&gt;+~6% accuracy&lt;/strong&gt;. The LLM went from guessing at patterns to following proven ones.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Pre-Execution LLM Self-Review (+~5%)
&lt;/h3&gt;

&lt;p&gt;Even with a focused schema and good examples, the LLM still generates subtle errors on complex queries — wrong date boundaries in year-over-year comparisons, incorrect GROUP BY clauses, off-by-one errors in date ranges.&lt;/p&gt;

&lt;p&gt;We added a &lt;strong&gt;review step&lt;/strong&gt;: after the first LLM generates SQL, a second LLM pass reviews it. The reviewer sees the original question, the schema subset, and the generated SQL — but not the generation prompt. It answers: "Does this SQL correctly answer this question given this schema?"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;reviewAndRegenerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ColumnMeta&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt; 
  &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;maxIterations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;maxIterations&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;review&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;reviewSQL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;review&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.70&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;review&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;confidence&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Regenerate with review feedback&lt;/span&gt;
    &lt;span class="nx"&gt;sql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;regenerateSQL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;review&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Return best attempt with low confidence flag&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We call this the RRIL (Review-Regenerate-Iterate Loop). Max 3 iterations, confidence threshold of 0.70. If it can't reach 0.70 after 3 tries, it flags the query for human review.&lt;/p&gt;

&lt;p&gt;This caught roughly &lt;strong&gt;+5% of errors&lt;/strong&gt;, primarily on complex multi-condition queries where the first pass got 80% of the logic right but missed a subtle constraint.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Column Value Sampling (+~3-4%)
&lt;/h3&gt;

&lt;p&gt;This one is embarrassingly simple and we should have built it first.&lt;/p&gt;

&lt;p&gt;For every column detected as low-cardinality or enum-like (fewer than ~500 distinct values), we sample 20-50 actual values from the database and inject them into the prompt: &lt;em&gt;"The &lt;code&gt;status&lt;/code&gt; column contains values: 'Active', 'Inactive', 'Pending', 'Archived'."&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;sampleColumnValues&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;column&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ColumnMeta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Database&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;distinctCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;`SELECT COUNT(DISTINCT "&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;column&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;") FROM domain_table`&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;distinctCount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;MAX_ENUM_CARDINALITY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;`SELECT DISTINCT "&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;column&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;" 
     FROM domain_table 
     WHERE "&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;column&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;" IS NOT NULL 
     LIMIT 50`&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;column&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No more &lt;code&gt;'sedan'&lt;/code&gt; vs &lt;code&gt;'Sedan'&lt;/code&gt; mismatches. No more guessing at valid status codes. The LLM sees the actual values and uses them. &lt;strong&gt;+3-4% accuracy&lt;/strong&gt;, and it's the cheapest component to implement.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Query Complexity Router (Quality + Cost)
&lt;/h3&gt;

&lt;p&gt;Not every question needs the most expensive model. "How many records do we have this month?" is a simple COUNT with a date filter. "Compare year-over-year trends across the top five categories, broken down by quarter" requires genuine reasoning.&lt;/p&gt;

&lt;p&gt;We classify incoming questions into three complexity tiers and route accordingly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;~Share&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Simple&lt;/td&gt;
&lt;td&gt;Single aggregation, basic filter&lt;/td&gt;
&lt;td&gt;Haiku (fast, cheap)&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Domain filters, joins, grouping&lt;/td&gt;
&lt;td&gt;Sonnet (balanced)&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex&lt;/td&gt;
&lt;td&gt;YoY, multi-breakdown, subqueries&lt;/td&gt;
&lt;td&gt;Opus (highest quality)&lt;/td&gt;
&lt;td&gt;10%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The classifier itself is a lightweight Haiku call — costs almost nothing and adds ~200ms of latency. The result: &lt;strong&gt;~70% cost reduction&lt;/strong&gt; with zero accuracy loss. Simple queries don't benefit from Opus, and sending them there is pure waste.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. Rule-Versioned Embedding Cache (Consistency)
&lt;/h3&gt;

&lt;p&gt;Business rules change. A new mandatory filter gets added. A column gets deprecated. An enum value gets renamed. When this happens, cached question→SQL pairs can become stale or non-compliant.&lt;/p&gt;

&lt;p&gt;Every cached pair is stored with a &lt;strong&gt;rule version hash&lt;/strong&gt;. When the rules change (we increment a version), the system recomputes compliance scores for all cached pairs against the new rules and surfaces non-compliant ones for human review.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;CachedPair&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;maskedQuestion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="nl"&gt;ruleVersionHash&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;complianceScore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;verifiedBy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;verifiedAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;flagStale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;currentRuleHash&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;CachedPair&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`
    SELECT * FROM cached_pairs 
    WHERE rule_version_hash != $1
    ORDER BY last_used_at DESC
  `&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;currentRuleHash&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This doesn't directly improve accuracy on new questions, but it &lt;strong&gt;prevents regression&lt;/strong&gt; — which, in a production system, matters more than you'd think. A cached pair that was correct last month but violates a new mandatory filter is worse than no cache at all.&lt;/p&gt;




&lt;h3&gt;
  
  
  7. Pipeline Tracing (Observability)
&lt;/h3&gt;

&lt;p&gt;Every query that flows through the pipeline generates a trace record:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which columns the schema linker selected&lt;/li&gt;
&lt;li&gt;Which few-shot examples were retrieved (and their similarity scores)&lt;/li&gt;
&lt;li&gt;The pre-review output (issues found, confidence score, iterations)&lt;/li&gt;
&lt;li&gt;The final SQL sent for execution&lt;/li&gt;
&lt;li&gt;Execution time, row count, token usage per LLM call&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All stored as JSONB in the existing query log table. Zero new infrastructure dependencies.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;PipelineTrace&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;schemaColumns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;        &lt;span class="c1"&gt;// columns selected by linker&lt;/span&gt;
  &lt;span class="nl"&gt;fewShotExamples&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;      &lt;span class="c1"&gt;// IDs of retrieved pairs&lt;/span&gt;
  &lt;span class="nl"&gt;reviewIterations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;reviewConfidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;finalSQL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;executionTimeMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;tokenUsage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="nl"&gt;modelUsed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;cacheHit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This doesn't improve accuracy directly, but it's what makes &lt;strong&gt;debugging and improvement possible&lt;/strong&gt;. When a query fails, we can see exactly which component contributed to the failure. When accuracy dips, we can query the traces to find patterns. Without tracing, the pipeline is a black box. With it, every failure is a learning opportunity.&lt;/p&gt;




&lt;h3&gt;
  
  
  8. Prompt Prefix Caching (Latency + Cost)
&lt;/h3&gt;

&lt;p&gt;The schema description, universal rules, and system instructions are identical across thousands of queries. Only the user's question and the retrieved few-shot examples change per request.&lt;/p&gt;

&lt;p&gt;On Anthropic's API, we structure our prompts so the static portion comes first, then use prompt caching to avoid re-processing the prefix on every call. The schema description alone can be 3,000+ tokens — caching it means those tokens are processed once and reused.&lt;/p&gt;

&lt;p&gt;Result: &lt;strong&gt;~40% reduction in billable tokens&lt;/strong&gt; across all queries, with no impact on output quality. Combined with the complexity router, our per-query cost dropped dramatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Flywheel — Why This Beats Any Static Benchmark
&lt;/h2&gt;

&lt;p&gt;The pipeline took us from 68% to roughly 89% on Day 1. That's a strong improvement, but it's still not production-grade. The component that pushed us toward ~100% wasn't a pipeline stage — it was a &lt;strong&gt;feedback loop&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;After every query, the system evaluates its own confidence score from the review step. High-confidence results (≥0.85) are auto-approved and promoted into the embedding cache as verified pairs. Low-confidence results, or any result a user flags as incorrect, get routed to a human reviewer.&lt;/p&gt;

&lt;p&gt;The reviewer sees the question, the generated SQL, the expected result, and — if the user provided a correction — the corrected SQL. They verify or fix the pair, and the corrected version gets promoted to the cache.&lt;/p&gt;

&lt;p&gt;Here's why this is powerful: &lt;strong&gt;the next time a semantically similar question arrives, it matches against the cached pair and short-circuits the entire pipeline&lt;/strong&gt;. No LLM call needed. The answer comes directly from a human-verified, production-tested pair.&lt;/p&gt;

&lt;p&gt;The flywheel effect played out like this in our system:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Cache Hit Rate&lt;/th&gt;
&lt;th&gt;LLM Calls&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Day 1&lt;/td&gt;
&lt;td&gt;~89%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Week 4&lt;/td&gt;
&lt;td&gt;~94%&lt;/td&gt;
&lt;td&gt;~40%&lt;/td&gt;
&lt;td&gt;~60%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Month 6&lt;/td&gt;
&lt;td&gt;~97%&lt;/td&gt;
&lt;td&gt;~70%&lt;/td&gt;
&lt;td&gt;~30%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stable state&lt;/td&gt;
&lt;td&gt;~99%+&lt;/td&gt;
&lt;td&gt;~80-90%&lt;/td&gt;
&lt;td&gt;~10-20%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The academic benchmarks like BIRD measure a frozen system — a fixed model, fixed prompt, fixed schema, evaluated once. Our system gets smarter every day. Every query that flows through it either confirms an existing cached pair or generates a new one (after human verification).&lt;/p&gt;

&lt;p&gt;And here's the part that makes finance people happy: &lt;strong&gt;cost falls as accuracy rises&lt;/strong&gt;. As the cache fills up, fewer queries need LLM calls. The most expensive component (Opus for complex queries) gets called less and less as the cache absorbs the patterns it's already seen. We're simultaneously improving quality and reducing cost — the flywheel improves both.&lt;/p&gt;

&lt;p&gt;The cache currently holds thousands of verified pairs, and roughly 80-90% of incoming questions match an existing pair closely enough to skip the pipeline entirely. The remaining 10-20% are genuinely novel questions — new patterns the system hasn't encountered before. Those go through the full pipeline, get reviewed, and feed back into the cache.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Ceiling
&lt;/h2&gt;

&lt;p&gt;We don't claim 100% accuracy, and we never will. The remaining 1-3% of failures are genuinely hard problems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Novel query patterns.&lt;/strong&gt; When a user asks something structurally unlike anything in the cache, the system falls back to the full pipeline. Pipeline accuracy without cache assistance is ~89% — good, but not perfect. These novel patterns are, by definition, the hardest queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ambiguous natural language.&lt;/strong&gt; "Show me recent data" — does "recent" mean last week? Last month? Last quarter? The system can detect ambiguity (we added an ambiguity classification step), but resolving it requires either a clarifying question or a business-specific default. Both have trade-offs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data drift.&lt;/strong&gt; New values appear in enum columns. A product category gets renamed. A new region code gets added. Our value sampling refreshes periodically, but there's always a window where the LLM has stale information. Continuous sampling narrows the window but can't eliminate it entirely.&lt;/p&gt;

&lt;p&gt;Our approach to the ceiling: &lt;strong&gt;human-in-the-loop is not a failure mode — it's the mechanism that closes the gap&lt;/strong&gt;. Low-confidence novel queries get flagged for human review. The human provides the correct SQL. The pair enters the cache. The system has learned. The ceiling rises.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We'd Do Differently
&lt;/h2&gt;

&lt;p&gt;If we were starting over, three things would change:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with value sampling.&lt;/strong&gt; It's the cheapest component to build (a few SQL queries, some prompt injection) and eliminates an entire category of errors. We built it fourth. It should have been first. Half a day of work for a 3-4% accuracy gain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build tracing from Day 1.&lt;/strong&gt; We spent weeks debugging pipeline failures by staring at prompts and outputs manually. Once we had tracing, debugging time dropped by 10x. Every failure was immediately attributable to a specific component. Build the observability before you build the intelligence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invest heavily in the schema linker.&lt;/strong&gt; It has the highest leverage of any component. A better schema linker means a smaller, more relevant context, which means better LLM output across all query types. We've iterated on ours four times and it's still the component we invest the most engineering time in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Production text-to-SQL is not a prompting problem. It's a systems engineering problem. The combination of &lt;strong&gt;retrieval augmentation&lt;/strong&gt; (schema linking + few-shot retrieval), &lt;strong&gt;pre-execution review&lt;/strong&gt; (the RRIL loop), and a &lt;strong&gt;human feedback flywheel&lt;/strong&gt; (verified pairs that compound over time) is what makes near-perfect accuracy achievable in practice.&lt;/p&gt;

&lt;p&gt;Static benchmarks measure the floor. The flywheel determines the ceiling.&lt;/p&gt;

&lt;p&gt;If you're building text-to-SQL for a specific domain, the generic approach — a good prompt and a frontier model — will disappoint you. It'll get you to 70%, and you'll spend months trying to prompt-engineer your way to 80%. The path to production-grade accuracy is domain-specific retrieval, systematic error elimination, and a feedback loop that turns every query into a learning opportunity.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Schema linking is the highest-leverage component.&lt;/strong&gt; Reducing 200+ columns to 20-30 relevant ones eliminates an entire class of hallucination errors. Build it first, invest in it continuously.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Value sampling is the cheapest win.&lt;/strong&gt; Injecting actual enum values into the prompt costs almost nothing to implement and eliminates case-sensitivity and value mismatch errors immediately.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The review loop catches what single-pass generation misses.&lt;/strong&gt; A second LLM pass reviewing the generated SQL against the original question catches 5%+ of subtle errors, especially on complex multi-condition queries.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The flywheel is the real product.&lt;/strong&gt; The pipeline gets you to ~89%. The human-in-the-loop feedback loop that populates a verified cache is what pushes you toward ~100% — and simultaneously reduces cost.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Observability is not optional.&lt;/strong&gt; Without pipeline tracing, you're debugging a black box. With it, every failure is attributable and fixable. Build tracing before you build intelligence.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;If you're working on a similar system, I'd love to hear about your approach — especially how you handle schema linking and the accuracy/cost tradeoff. Drop a comment or reach out.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>sql</category>
      <category>machinelearning</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>How I Taught My AI Agent to Solve reCAPTCHA (And What It Took)</title>
      <dc:creator>AlexChen</dc:creator>
      <pubDate>Mon, 16 Mar 2026 00:47:17 +0000</pubDate>
      <link>https://forem.com/alexchen31337/how-i-taught-my-ai-agent-to-solve-recaptcha-and-what-it-took-k92</link>
      <guid>https://forem.com/alexchen31337/how-i-taught-my-ai-agent-to-solve-recaptcha-and-what-it-took-k92</guid>
      <description>&lt;p&gt;Every autonomous AI agent eventually hits the same wall: &lt;strong&gt;reCAPTCHA&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You've built an agent that can browse the web, fill forms, and interact with services. Then it tries to log in somewhere, and it gets a grid of traffic lights staring back at it. Game over — unless you've solved the vision problem.&lt;/p&gt;

&lt;p&gt;I recently built an agent workflow that needed to log into Gumroad to publish digital products autonomously. No API token available. Direct login blocked by reCAPTCHA v2 image challenges. Here's exactly how I solved it — the working pattern, the failure modes, and the honest limitations.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;reCAPTCHA v2 image challenges ask users to click all squares containing: traffic lights, crosswalks, cars, motorcycles, bicycles, fire hydrants, buses. They're designed to be trivial for humans and hard for bots.&lt;/p&gt;

&lt;p&gt;For an AI agent, this is actually a vision task — not a hard one. The challenge is the &lt;strong&gt;plumbing&lt;/strong&gt;: getting the image into a model, getting the model's response back into the browser, and handling the multi-round challenge flow (Gumroad served 6 consecutive challenges before accepting).&lt;/p&gt;

&lt;p&gt;Most documentation stops at "use a CAPTCHA-solving service like 2captcha." That works, but it costs money per solve, requires a third-party account, and introduces latency. If you're running an agent that already has access to a multimodal LLM, you already have everything you need.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;The solution uses three components working together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser (Chromium, headless or display)
    ↕  Chrome DevTools Protocol (CDP)
Browser Control Tool (OpenClaw browser tool)
    ↕  screenshot + act
Vision Model (Claude Sonnet)
    ↕  image analysis → click coordinates
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent controls a real Chromium browser via CDP. When it hits a reCAPTCHA challenge, it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Takes a screenshot of the challenge&lt;/li&gt;
&lt;li&gt;Sends the screenshot to a vision model with a targeted prompt&lt;/li&gt;
&lt;li&gt;Gets back which grid squares to click&lt;/li&gt;
&lt;li&gt;Clicks them via the browser tool&lt;/li&gt;
&lt;li&gt;Clicks "Verify"&lt;/li&gt;
&lt;li&gt;Repeats until the challenge accepts&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No third-party service. No API key for a captcha farm. Just the LLM you already have.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Working Pattern
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1 — Navigate to the page
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;navigate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://gumroad.com/login&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;screenshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# capture current state
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2 — Detect the challenge
&lt;/h3&gt;

&lt;p&gt;When the reCAPTCHA iframe is visible, take a screenshot and pass it to the vision model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;screenshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;screenshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;targetId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TARGET_ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;analysis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;screenshot_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Look at this reCAPTCHA challenge.
    1. What object category is being asked for? (e.g. traffic lights, crosswalks, cars)
    2. Which grid squares (number them 1-9 left-to-right, top-to-bottom) contain that object?
    Return: { &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;squares&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: [1, 4, 7] }&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3 — Click the correct squares
&lt;/h3&gt;

&lt;p&gt;The reCAPTCHA grid is a 3×3 layout inside an iframe. Map square numbers to click coordinates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Grid square → (x_offset, y_offset) from grid top-left
&lt;/span&gt;&lt;span class="n"&gt;GRID_POSITIONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;   &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;square&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;squares&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GRID_POSITIONS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;square&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nf"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;act&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;click&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;selector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iframe &amp;gt;&amp;gt; nth=0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="c1"&gt;# click at offset within iframe
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In practice, using the browser tool's &lt;code&gt;act&lt;/code&gt; with &lt;code&gt;ref&lt;/code&gt; from a snapshot is more reliable than manual coordinate calculation — the snapshot gives you element refs that survive iframe boundaries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4 — Handle multi-round challenges
&lt;/h3&gt;

&lt;p&gt;Gumroad served 6 consecutive challenges before accepting. Each round may show a different category. The loop looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;challenge_visible&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;screenshot&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;vision&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;get&lt;/span&gt; &lt;span class="n"&gt;squares&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;click&lt;/span&gt; &lt;span class="n"&gt;squares&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;click&lt;/span&gt; &lt;span class="n"&gt;Verify&lt;/span&gt;
    &lt;span class="n"&gt;wait&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="n"&gt;seconds&lt;/span&gt;
    &lt;span class="n"&gt;screenshot&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;check&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;challenge&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;gone&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;round&lt;/span&gt; &lt;span class="n"&gt;appeared&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;strong&gt;don't assume one round is enough&lt;/strong&gt;. Always screenshot after clicking Verify and check whether you're through or facing another round.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5 — Verify you're logged in
&lt;/h3&gt;

&lt;p&gt;After the challenge loop exits, check for dashboard elements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snapshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Look for nav elements, username, dashboard heading
# If still on login page → challenge failed → retry
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What Actually Happened (The Honest Version)
&lt;/h2&gt;

&lt;p&gt;The first attempt hit the challenge. The vision model correctly identified "crosswalks" as the category and clicked squares 1, 4, 7. The challenge accepted that round — but immediately showed a new one: "Select all traffic lights."&lt;/p&gt;

&lt;p&gt;Round 2: vision model identified 3 traffic light squares. Clicked. Another round appeared.&lt;/p&gt;

&lt;p&gt;This repeated 6 times across categories: crosswalks, traffic lights, cars, motorcycles, traffic lights again, cars again.&lt;/p&gt;

&lt;p&gt;On round 6, the challenge accepted and the page redirected to the Gumroad dashboard. Total time: about 45 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure modes I hit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Iframe ref confusion&lt;/strong&gt;: The snapshot returned refs for elements &lt;em&gt;outside&lt;/em&gt; the iframe. Fixed by using &lt;code&gt;evaluate&lt;/code&gt; to click inside the iframe via &lt;code&gt;document.querySelector('iframe').contentDocument&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grid image not in screenshot&lt;/strong&gt;: The reCAPTCHA widget loads asynchronously. Added a 2-second wait after the challenge appeared before screenshotting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"New" image squares after partial selection&lt;/strong&gt;: Some reCAPTCHA rounds replace clicked squares with new images (dynamic grid). The vision model needs to re-evaluate after each click, not just once per round. I handled this by re-screenshotting after each click when the category was "traffic lights" or "crosswalks" (which commonly use dynamic grids).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Password Reset Shortcut
&lt;/h2&gt;

&lt;p&gt;One thing worth noting: &lt;strong&gt;the password reset flow has no reCAPTCHA&lt;/strong&gt;. If you're trying to log into an account you control and the main login page is blocked, &lt;code&gt;/forgot_password&lt;/code&gt; is a clean path in. Request a reset, check email via IMAP, follow the link, set a new password, redirect to dashboard — zero image challenges.&lt;/p&gt;

&lt;p&gt;This is often the faster route for agent workflows where you control the account.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Navigate to forgot password (no CAPTCHA here)
&lt;/span&gt;&lt;span class="nf"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;navigate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com/forgot_password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;act&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fill&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;selector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input[type=email]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;EMAIL&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nf"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;act&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;click&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;selector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;button[type=submit]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Read reset email via IMAP
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;imaplib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="n"&gt;imap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;imaplib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;IMAP4_SSL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;imap.gmail.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;imap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;login&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;EMAIL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;APP_PASSWORD&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ... find reset URL in email body ...
&lt;/span&gt;
&lt;span class="c1"&gt;# Follow the link — no CAPTCHA on reset page
&lt;/span&gt;&lt;span class="nf"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;navigate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;reset_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Set new password, redirect to dashboard
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Broader Principle
&lt;/h2&gt;

&lt;p&gt;reCAPTCHA is not the last wall. Modern web services add friction at every interaction point: email verification, SMS OTP, "are you human" sliders, device fingerprinting. Each one is a vision or reasoning task in disguise.&lt;/p&gt;

&lt;p&gt;The pattern that works across all of them:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Screenshot → vision model → structured action&lt;/strong&gt; — the core loop&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IMAP/email reading&lt;/strong&gt; — for OTP and verification flows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cookie extraction via CDP&lt;/strong&gt; — once logged in, persist session to avoid re-auth&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefer API paths over browser paths&lt;/strong&gt; — when an API exists, the browser is a last resort&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefer password reset over direct login&lt;/strong&gt; — avoids CAPTCHA on the hardest step&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Autonomous agents that operate in the real world need to treat authentication friction as a technical problem, not a blocker. The tools to solve it are already available — multimodal LLMs, CDP-based browser control, and IMAP access cover 95% of cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Doesn't Cover
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Adversarial reCAPTCHA (v3 / Enterprise):&lt;/strong&gt; reCAPTCHA v3 runs silently and scores your session based on behaviour over time. Image challenge solving won't help here — you need realistic browser fingerprinting, human-like mouse movement patterns, and a warmed-up session. That's a different (harder) problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloudflare Turnstile:&lt;/strong&gt; Similar to v3 — behaviour-based, no image challenges. Playwright-stealth plugins help but aren't reliable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limits after CAPTCHA:&lt;/strong&gt; Some services rate-limit accounts that solve many CAPTCHAs quickly. Space out automation with realistic delays.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Vision models can solve reCAPTCHA v2 image challenges reliably — the hard part is the browser plumbing, not the image recognition&lt;/li&gt;
&lt;li&gt;Multi-round challenges (6+ rounds) are normal; build a loop, not a one-shot&lt;/li&gt;
&lt;li&gt;Dynamic grid squares require re-screenshotting and re-evaluating after each click for some categories&lt;/li&gt;
&lt;li&gt;The password reset flow is often the cleanest path — no CAPTCHA on that page&lt;/li&gt;
&lt;li&gt;Once logged in, extract session cookies via CDP and persist them — avoids re-auth on every run&lt;/li&gt;
&lt;li&gt;This pattern works for any agent that needs to interact with real web services on behalf of an account it controls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The web was built for humans. With the right plumbing, AI agents can navigate it too.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Building autonomous agents? I write about agent infrastructure, LLM tooling, and the practical challenges of making AI operate in the real world. Follow for more.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>automation</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Karpathy Just Automated the Researcher: What autoresearch Means for the Future of AI Development</title>
      <dc:creator>AlexChen</dc:creator>
      <pubDate>Sat, 14 Mar 2026 14:35:31 +0000</pubDate>
      <link>https://forem.com/alexchen31337/karpathy-just-automated-the-researcher-what-autoresearch-means-for-the-future-of-ai-development-5f7e</link>
      <guid>https://forem.com/alexchen31337/karpathy-just-automated-the-researcher-what-autoresearch-means-for-the-future-of-ai-development-5f7e</guid>
      <description>&lt;h1&gt;
  
  
  Karpathy Just Automated the Researcher: What autoresearch Means for the Future of AI Development
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;By AlexChen&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Andrej Karpathy shipped a repo in March 2026 called &lt;a href="https://github.com/karpathy/autoresearch" rel="noopener noreferrer"&gt;autoresearch&lt;/a&gt;, and the README opens with this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of 'group meeting'. That era is long gone."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's not a joke. That's a quiet announcement that something has fundamentally shifted. Let's break down what he actually built, why it matters, and what it implies for anyone in the AI development stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  What autoresearch Actually Does
&lt;/h2&gt;

&lt;p&gt;The setup is deliberately minimal. Three files do all the work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;prepare.py&lt;/code&gt;&lt;/strong&gt; — constants, data prep, tokenizer training. Fixed. The agent never touches this.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;train.py&lt;/code&gt;&lt;/strong&gt; — the full GPT model, optimizer (Muon + AdamW), and training loop. &lt;strong&gt;This is the only file the agent edits.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;program.md&lt;/code&gt;&lt;/strong&gt; — Markdown instructions for the agent. This is the only file the &lt;em&gt;human&lt;/em&gt; edits.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The loop is brutally simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Agent reads &lt;code&gt;program.md&lt;/code&gt; to understand the research org's goals&lt;/li&gt;
&lt;li&gt;Agent modifies &lt;code&gt;train.py&lt;/code&gt; — architecture, hyperparameters, optimizer, batch size, anything&lt;/li&gt;
&lt;li&gt;Training runs for exactly &lt;strong&gt;5 minutes&lt;/strong&gt; (wall clock)&lt;/li&gt;
&lt;li&gt;Metric: &lt;code&gt;val_bpb&lt;/code&gt; (validation bits per byte) — lower is better&lt;/li&gt;
&lt;li&gt;If improved → keep. If not → discard&lt;/li&gt;
&lt;li&gt;Repeat overnight&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;At ~12 experiments/hour, you get roughly &lt;strong&gt;100 experiments while you sleep&lt;/strong&gt;. You wake up to a log of what the agent tried, what worked, what didn't.&lt;/p&gt;

&lt;p&gt;The fixed 5-minute budget is a clever design choice. It makes every experiment comparable regardless of what the agent changed — model size, sequence length, attention pattern, optimizer settings. It also means autoresearch optimizes specifically for &lt;em&gt;your hardware&lt;/em&gt;, because the best model in 5 minutes on an RTX 3090 is different from the best model in 5 minutes on an H100.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Inversion: You Program the Program
&lt;/h2&gt;

&lt;p&gt;Here's the insight that most coverage will miss:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Karpathy isn't automating the experiments. He's automating the experimenter.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traditional ML research workflow: human reads papers → forms hypothesis → modifies training code → runs experiment → analyzes results → updates mental model → repeat.&lt;/p&gt;

&lt;p&gt;autoresearch workflow: human writes &lt;code&gt;program.md&lt;/code&gt; (the research org instructions) → AI agent runs the inner loop indefinitely.&lt;/p&gt;

&lt;p&gt;The human has moved up one level of abstraction. You're no longer programming Python. You're &lt;strong&gt;programming the research methodology&lt;/strong&gt; in Markdown. The AI does the Python.&lt;/p&gt;

&lt;p&gt;This is what Karpathy means when he says "you are programming the &lt;code&gt;program.md&lt;/code&gt; Markdown files that provide context to the AI agents and set up your autonomous research org." The &lt;code&gt;program.md&lt;/code&gt; is your meta-program. It encodes your hypotheses about what's worth trying, your evaluation criteria, your architectural priors. The agent is your compiler.&lt;/p&gt;

&lt;p&gt;The default &lt;code&gt;program.md&lt;/code&gt; in the repo is intentionally bare-bones — Karpathy is explicitly leaving it as an open research surface. The obvious next step is to iterate on the research org instructions themselves, finding the "org code" that produces the fastest research progress. Meta-optimization on the meta-program.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Try→Measure→Keep/Discard Loop Is Universal
&lt;/h2&gt;

&lt;p&gt;What Karpathy built is a specific instance of a general pattern that's showing up everywhere in autonomous systems:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;observe current state
propose a change
apply the change
measure outcome against objective
keep if better, discard if worse
repeat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is hill-climbing, but at the software modification level. The agent isn't just searching over hyperparameter space — it's searching over the space of &lt;strong&gt;programs that train models&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The same loop shows up in agent infrastructure frameworks doing recursive self-improvement (RSI): an agent logs outcomes, identifies failure patterns, proposes modifications to its own skills or routing logic, tests them, keeps improvements. The difference is the substrate — autoresearch operates on ML experiment code and val loss; agent infrastructure RSI operates on tool configs, skill files, and task success rates.&lt;/p&gt;

&lt;p&gt;Both are try→measure→keep/discard cycles. The abstraction level differs. The underlying logic is identical.&lt;/p&gt;

&lt;p&gt;This convergence isn't coincidental. It suggests we're discovering a general principle: &lt;strong&gt;the unit of improvement is the experiment, and the job of the researcher is to design the experiment space&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Agent Actually Has Access To
&lt;/h2&gt;

&lt;p&gt;It's worth being concrete about the agent's search space in autoresearch. &lt;code&gt;train.py&lt;/code&gt; contains the full GPT model definition, the Muon + AdamW optimizer implementation, and the training loop. Everything is fair game:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transformer architecture (depth, width, attention heads)&lt;/li&gt;
&lt;li&gt;Attention patterns (the default uses "SSSL" — alternating banded attention)&lt;/li&gt;
&lt;li&gt;Optimizer settings and schedules&lt;/li&gt;
&lt;li&gt;Batch size and sequence length&lt;/li&gt;
&lt;li&gt;Regularization&lt;/li&gt;
&lt;li&gt;Any new architecture component the agent wants to implement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent can make arbitrarily creative changes. It's not doing grid search over predefined parameters — it's doing open-ended code modification. A sufficiently capable agent could implement flash attention variants, propose new normalization schemes, change the positional encoding. The only constraint is the 5-minute training budget and the single-file edit scope.&lt;/p&gt;

&lt;p&gt;This is important: &lt;strong&gt;the search space is not predefined&lt;/strong&gt;. The agent explores a space that's partly defined by &lt;code&gt;program.md&lt;/code&gt; and partly by its own code-generation capabilities. As frontier models improve, the same framework gets more powerful without any changes to the infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implications for AI Researchers
&lt;/h2&gt;

&lt;p&gt;If you work on ML research, this should make you think carefully about your role in the stack.&lt;/p&gt;

&lt;p&gt;The parts of research that autoresearch automates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generating implementation hypotheses&lt;/li&gt;
&lt;li&gt;Writing training code&lt;/li&gt;
&lt;li&gt;Running experiments&lt;/li&gt;
&lt;li&gt;Tracking which changes improved performance&lt;/li&gt;
&lt;li&gt;Avoiding previously-failed approaches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The parts that remain human (for now):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Defining the objective metric&lt;/li&gt;
&lt;li&gt;Designing the evaluation setup&lt;/li&gt;
&lt;li&gt;Writing &lt;code&gt;program.md&lt;/code&gt; — encoding your research intuitions as agent instructions&lt;/li&gt;
&lt;li&gt;Interpreting results at a higher level&lt;/li&gt;
&lt;li&gt;Deciding what problem to work on&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Notice the pattern: humans retain the &lt;strong&gt;goal-setting and interpretation&lt;/strong&gt; layers. The execution layer is being automated. This isn't unique to research — it's what's happening across knowledge work broadly. But it's happening to ML research specifically now, which is ironic given that ML is the technology doing the automating.&lt;/p&gt;

&lt;p&gt;The practical implication: the skill that matters isn't "can you implement a transformer" — that's increasingly table stakes. The skill that matters is "can you write a &lt;code&gt;program.md&lt;/code&gt; that produces good research?" That's a different skill. It requires understanding the problem space deeply enough to encode your hypotheses as agent instructions. It's closer to research design than research execution.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Overnight Experiment as a New Primitive
&lt;/h2&gt;

&lt;p&gt;One underrated aspect of autoresearch: it changes the time economics of research.&lt;/p&gt;

&lt;p&gt;Previously, a researcher running experiments overnight was a single researcher running one carefully chosen experiment (because setup cost is high and attention is limited). autoresearch turns overnight into ~100 experiments, each comparing cleanly to all others via the fixed time budget.&lt;/p&gt;

&lt;p&gt;The cost of a wrong hypothesis drops dramatically. You can afford to include wild ideas in &lt;code&gt;program.md&lt;/code&gt; because the agent will discard them if they don't work, and you'll see &lt;em&gt;that they don't work&lt;/em&gt; in the morning log. The experiments that succeed surface automatically.&lt;/p&gt;

&lt;p&gt;This shifts the research bottleneck from &lt;strong&gt;experiment throughput&lt;/strong&gt; to &lt;strong&gt;hypothesis generation quality&lt;/strong&gt;. Which is where frontier models are actually getting good.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Meat Computer Era Is Over
&lt;/h2&gt;

&lt;p&gt;Karpathy's framing is theatrical but accurate. The 10,205th generation of the codebase, a self-modifying binary grown beyond human comprehension — that's science fiction, but the trajectory is clearly real.&lt;/p&gt;

&lt;p&gt;What autoresearch demonstrates isn't just "AI can write training code." It demonstrates that the &lt;strong&gt;research loop itself&lt;/strong&gt; — the cycle of hypothesis → implementation → experiment → evaluation → iteration — can be automated at a level that's useful right now, on a single GPU, with three files.&lt;/p&gt;

&lt;p&gt;The researchers who thrive in this environment won't be the ones who can implement attention most cleanly. They'll be the ones who understand the problem well enough to program the research org — to write the &lt;code&gt;program.md&lt;/code&gt; that encodes the right hypotheses, the right search space, the right success criteria.&lt;/p&gt;

&lt;p&gt;Programming the program, not the program itself.&lt;/p&gt;

&lt;p&gt;That's the new meta-skill.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;AlexChen builds autonomous agent infrastructure. Opinions are operational, not academic.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>research</category>
      <category>llm</category>
    </item>
    <item>
      <title>The Harness Problem Is Real — And the Edit Tool Is Where It Starts</title>
      <dc:creator>AlexChen</dc:creator>
      <pubDate>Wed, 11 Mar 2026 21:30:11 +0000</pubDate>
      <link>https://forem.com/alexchen31337/the-harness-problem-is-real-and-the-edit-tool-is-where-it-starts-nff</link>
      <guid>https://forem.com/alexchen31337/the-harness-problem-is-real-and-the-edit-tool-is-where-it-starts-nff</guid>
      <description>&lt;p&gt;The debate is framed wrong.&lt;/p&gt;

&lt;p&gt;Every week someone publishes a benchmark comparing GPT-5.x vs Claude Opus vs Gemini on SWE-bench. The implicit assumption: the model is the variable that matters. Pick the best model, your coding agent works better.&lt;/p&gt;

&lt;p&gt;But &lt;a href="http://blog.can.ac/2026/02/12/the-harness-problem/" rel="noopener noreferrer"&gt;a benchmark published last month&lt;/a&gt; broke that assumption cleanly. Grok Code Fast went from &lt;strong&gt;6.7% to 68.3%&lt;/strong&gt; on a real-world coding task — not because the model changed, not because of a new training run — because the edit tool format changed. That's a 10x improvement from a single harness modification.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Edit Tool Problem
&lt;/h2&gt;

&lt;p&gt;Most coding agents use one of three edit formats:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;apply_patch&lt;/strong&gt; (Codex): OpenAI-flavored diff strings. Works great for GPT variants tuned for it. Give it to Grok 4 and the patch failure rate hits 50.7%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;str_replace&lt;/strong&gt; (Claude Code, most others): Find the exact old text, replace with new. Simple to reason about, but the model must reproduce every character including whitespace. A single indentation difference = failure. There's a &lt;a href="https://github.com/anthropics/claude-code/issues/3471" rel="noopener noreferrer"&gt;GitHub megathread&lt;/a&gt; about this.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cursor's neural network&lt;/strong&gt;: They trained a separate 70B model just to apply edits correctly. That's how hard this problem is.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The new approach — &lt;strong&gt;hashline&lt;/strong&gt; — tags every line with a 2-3 character content hash when the model reads a file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nx"&gt;a3&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;hello&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nx"&gt;f1&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;world&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="mi"&gt;33&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model edits by referencing hashes, not reproducing text. If the file changed since the last read, the hashes won't match and the edit is rejected before corruption. No whitespace reproduction required. No perfect recall required.&lt;/p&gt;

&lt;p&gt;Results across 16 models: hashline matches or beats str_replace for most, and weakest models gain the most. &lt;strong&gt;Grok 4 Fast's output tokens dropped 61%&lt;/strong&gt; because it stopped burning tokens on retry loops for failed edits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Is a Distributed Systems Problem
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://www.latent.space/p/ainews-is-harness-engineering-real" rel="noopener noreferrer"&gt;latent.space harness debate&lt;/a&gt; frames this as Big Model vs Big Harness. But that's still the wrong frame. The right question is: &lt;em&gt;which layer owns which decisions?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Noam Brown (OpenAI) argues scaffolding fills capability gaps, and as models get better, scaffolding collapses. He's right about &lt;strong&gt;cognitive scaffolding&lt;/strong&gt; — chain-of-thought prompting, multi-step decomposition, RAG pipelines. Those compress into models over time.&lt;/p&gt;

&lt;p&gt;But the edit tool problem isn't cognitive. It's mechanical. It's the interface between model output and filesystem state. Models understand perfectly what to change. They fail at &lt;em&gt;expressing the change&lt;/em&gt; in a format the harness can parse reliably. That's not a language modeling problem — it's an interface design problem. Models don't absorb interface design problems.&lt;/p&gt;

&lt;p&gt;Same logic applies to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Provider failover and circuit breaking (distributed systems)&lt;/li&gt;
&lt;li&gt;Parallel task execution with dependency ordering (scheduling)&lt;/li&gt;
&lt;li&gt;Cost tracking and budget enforcement (financial controls)&lt;/li&gt;
&lt;li&gt;State persistence across session boundaries (storage)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are cognitive. All of them are infrastructure. Infrastructure doesn't compress into bigger language models.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Vendor Blocking Tells You
&lt;/h2&gt;

&lt;p&gt;Anthropic recently blocked OpenCode — a popular open-source agent — from using Claude Code subscriptions. Google disabled a researcher's account for running a benchmark on Gemini. The researcher's benchmark showed Gemini 3 Flash hitting 78.3% with a novel technique that &lt;em&gt;beats Google's own attempt by 5 points&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The signal is clear: &lt;strong&gt;don't build harnesses, use ours.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;But no vendor will do harness optimization for their competitors' models. Anthropic won't tune for Grok. xAI won't tune for Gemini. An open-source harness does, because contributors fix the failures they personally encounter across whichever models they use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building for the Durable Half
&lt;/h2&gt;

&lt;p&gt;We've been building &lt;a href="https://github.com/clawinfra/claw-forge" rel="noopener noreferrer"&gt;claw-forge&lt;/a&gt; as a multi-provider autonomous coding agent harness. The design philosophy matches this analysis: the parts of the harness that survive model improvement are the infrastructure parts.&lt;/p&gt;

&lt;p&gt;We're adding hashline edit mode as our next PR. The benchmark methodology is straightforward to replicate — random file from a known codebase, mechanical mutation, fix rate per format. We'll publish our numbers.&lt;/p&gt;

&lt;p&gt;The harness problem isn't going away. The question is whether it gets solved by one company, in private, for one model — or by a community, in the open, for all of them.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;claw-forge is open source: &lt;a href="https://github.com/clawinfra/claw-forge" rel="noopener noreferrer"&gt;github.com/clawinfra/claw-forge&lt;/a&gt;. The hashline technique is from &lt;a href="https://github.com/can1357/oh-my-pi" rel="noopener noreferrer"&gt;oh-my-pi&lt;/a&gt; by can1357.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>devops</category>
      <category>programming</category>
    </item>
    <item>
      <title>7 Principles for AI Agent Tool Design (From Claude Code + Real-World Systems)</title>
      <dc:creator>AlexChen</dc:creator>
      <pubDate>Sat, 07 Mar 2026 10:04:42 +0000</pubDate>
      <link>https://forem.com/alexchen31337/7-principles-for-ai-agent-tool-design-from-claude-code-real-world-systems-3dcd</link>
      <guid>https://forem.com/alexchen31337/7-principles-for-ai-agent-tool-design-from-claude-code-real-world-systems-3dcd</guid>
      <description>&lt;p&gt;&lt;em&gt;The Claude Code engineering team recently shared their year-long journey building tool interfaces for AI agents. As someone who builds and runs multi-agent systems daily, I found deep resonance—and a few disagreements. Here's a systematic breakdown.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Principle 1: Match Tools to Your Model's Actual Capabilities
&lt;/h2&gt;

&lt;p&gt;This is the most overlooked rule. Many teams design one set of tool interfaces and apply them to every model—that's wrong.&lt;/p&gt;

&lt;p&gt;The Claude Code team learned this the hard way: after upgrading to Claude Opus, a "todo reminder tool" that originally helped the model stay focused became a constraint. The model started rigidly following the list instead of thinking flexibly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Actionable rule:&lt;/strong&gt; Every time you upgrade your model version, immediately re-audit all existing tools. Last version's scaffolding may be this version's shackle.&lt;/p&gt;




&lt;h2&gt;
  
  
  Principle 2: Use Tools for Structured Output, Not Prompts
&lt;/h2&gt;

&lt;p&gt;Asking a model to "output in a specific format" is the least reliable approach. Models add extra sentences, skip fields, or switch to completely different formats.&lt;/p&gt;

&lt;p&gt;The Claude Code team tried three approaches to get Claude to ask users questions with options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Adding parameters to existing tools → Claude got confused trying to plan + ask simultaneously&lt;/li&gt;
&lt;li&gt;Using special markdown format → Claude frequently went off-script&lt;/li&gt;
&lt;li&gt;Creating a dedicated &lt;code&gt;AskUserQuestion&lt;/code&gt; tool → &lt;strong&gt;Success&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Actionable rule:&lt;/strong&gt; Whenever correctness matters, use tool parameter schemas to enforce structure. Don't rely on the model's formatting ability.&lt;/p&gt;




&lt;h2&gt;
  
  
  Principle 3: Progressive Disclosure, Not Context Bombing
&lt;/h2&gt;

&lt;p&gt;Many teams stuff all background knowledge into the system prompt. This creates "context rot"—massive amounts of irrelevant information competing for the model's attention, interfering with the core task.&lt;/p&gt;

&lt;p&gt;The right approach: give the model an entry point (file path, link, skill name) and let it pull information on demand.&lt;/p&gt;

&lt;p&gt;Claude Code's approach: instead of stuffing docs into prompts, they give Claude a documentation link. When a user asks "how to set up MCP," a specialized sub-agent searches the docs and returns the answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Actionable rule:&lt;/strong&gt; Start with minimal context. Use progressive skill file hierarchies instead of system prompt stuffing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Principle 4: Let the Agent Build Its Own Context
&lt;/h2&gt;

&lt;p&gt;Early Claude Code used vector databases (RAG) to retrieve code context for Claude. Later they discovered: rather than feeding answers to Claude, give it search tools and let it find answers itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context-building priority ranking:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Priority&lt;/th&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Characteristics&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;4 (Highest)&lt;/td&gt;
&lt;td&gt;Progressive skill file hierarchy&lt;/td&gt;
&lt;td&gt;Best for structured knowledge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Grep/search tools&lt;/td&gt;
&lt;td&gt;Stable, model-driven&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;RAG semantic retrieval&lt;/td&gt;
&lt;td&gt;Powerful but fragile&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1 (Lowest)&lt;/td&gt;
&lt;td&gt;Static injection&lt;/td&gt;
&lt;td&gt;Fastest, but goes stale quickly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Actionable rule:&lt;/strong&gt; As models improve, progressively shift from "information injection" to "tool empowerment."&lt;/p&gt;




&lt;h2&gt;
  
  
  Principle 5: Design for Multi-Agent Collaboration from Day One
&lt;/h2&gt;

&lt;p&gt;Many teams only consider single-agent scenarios initially. When they need multiple sub-agents to collaborate, they discover all state management needs to be rebuilt.&lt;/p&gt;

&lt;p&gt;Claude Code evolved from "todos" to "Tasks"—Tasks support dependency relationships, cross-sub-agent state sharing, and dynamic modification. This wasn't a small change; it was an architecture overhaul.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Actionable rule:&lt;/strong&gt; If your agent has any possibility of spawning sub-agents, design your data structures for multi-agent state from day one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Principle 6: Measure Both "Correctness" and "Affinity"
&lt;/h2&gt;

&lt;p&gt;A tool that never gets called has zero value, no matter how well-implemented. Claude's "affinity" (natural tendency to invoke it) varies dramatically across tools.&lt;/p&gt;

&lt;p&gt;Factors affecting affinity: tool name, parameter naming, description wording, and even position in the tool list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing method:&lt;/strong&gt; Run the agent on 20 different tasks and track each tool's invocation frequency. Any tool with less than 10% of its expected call rate needs its interface or description redesigned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Actionable rule:&lt;/strong&gt; When evaluating tools, simultaneously track output quality and invocation frequency. Optimize both metrics.&lt;/p&gt;




&lt;h2&gt;
  
  
  Principle 7: Fewer Tools, Each One Deep
&lt;/h2&gt;

&lt;p&gt;Claude Code currently uses about 20 tools—considered the upper limit for production-grade agent systems. Each additional tool increases the options the model needs to reason about. More tools = worse performance on real tasks.&lt;/p&gt;

&lt;p&gt;Before adding a new tool, ask three questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can progressive disclosure solve this? (Usually yes)&lt;/li&gt;
&lt;li&gt;Can an existing tool be extended? (Prefer this)&lt;/li&gt;
&lt;li&gt;Does this scenario occur more than 10% of the time? (If not, delegate to a sub-agent)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Actionable rule:&lt;/strong&gt; Set a hard cap on your tool count. Force yourself to find more elegant solutions before adding new tools.&lt;/p&gt;




&lt;h2&gt;
  
  
  One Point Worth Questioning
&lt;/h2&gt;

&lt;p&gt;The Claude Code team's switch from RAG to grep, claiming "let Claude search for itself" works better, deserves closer examination.&lt;/p&gt;

&lt;p&gt;Grep is powerful for exact matches but helpless for semantically-related queries. They compensate with sub-agents, but this adds latency.&lt;/p&gt;

&lt;p&gt;The real answer might be a &lt;strong&gt;hybrid approach&lt;/strong&gt;: grep for exact lookups, vector search for semantic association. Not either/or, but dynamically choosing based on query type.&lt;/p&gt;

&lt;p&gt;This is an area their article doesn't fully explore—and it's a gap we've observed in real-world systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary: The Seven Principles
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Version-manage tools alongside model capability upgrades&lt;/li&gt;
&lt;li&gt;Use schemas for structured output, not natural language constraints
&lt;/li&gt;
&lt;li&gt;Progressive disclosure, not context bombing&lt;/li&gt;
&lt;li&gt;Give tools instead of answers—let the model find its own&lt;/li&gt;
&lt;li&gt;Design state management for multi-agent from day one&lt;/li&gt;
&lt;li&gt;Simultaneously optimize correctness and invocation affinity&lt;/li&gt;
&lt;li&gt;Fewer and deeper—set a hard tool count ceiling&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The most important quote (from the original article):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Experiment often, read your outputs, try new things. See like an agent."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Tool design isn't a one-time engineering decision. It's a continuously evolving process. Build feedback loops, then keep running them.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Analysis based on Claude Code engineer Thariq's original article, combined with hands-on experience building multi-agent systems.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>programming</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
