Forem: AlexChen

AI Coding Agents for Production-Grade Software

AlexChen — Thu, 23 Apr 2026 23:35:03 +0000

By Alex Chen — April 22, 2026

I ship production software with AI coding agents every day. Not toy demos, not weekend hackathons — actual systems that handle money, user data, and uptime commitments. Over the last eighteen months the tooling has gone from "cute autocomplete" to "can own a feature end-to-end if you set it up right." The gap between those two outcomes is almost entirely how you structure the work. The models are plenty capable now. What separates teams who ship reliably from teams who generate elegant-looking garbage is discipline, not model choice.

Here's what I've learned actually works.

1. Stop treating the agent like magic. Treat it like a junior engineer.

The single biggest mindset shift: an AI coding agent is not a wizard that reads your mind. It is a very fast, very literal junior engineer with perfect recall of Stack Overflow and zero judgment about your codebase's conventions. If you wouldn't hand a vague Jira ticket to a human intern and expect a PR-ready result, don't do it with an agent either.

That means every task gets a spec: scope, constraints, files in play, acceptance criteria, and what "done" looks like. Vague prompts get vague output — except now the vague output compiles, passes your eye test, and silently deletes production data three weeks later. I have seen this. Several times. The one time I skipped writing a proper task brief, the agent helpfully "refactored" an auth middleware it was never asked to touch.

A minimum viable brief looks like this:

Task: Add rate-limiting middleware to /api/v1/inference.
Scope: server/middleware/ratelimit.ts only. Do not touch routes or handlers.
Constraints: Redis-backed, sliding window, 60 req/min per API key.
Tests: add ratelimit.test.ts with 3 cases (under, at, over limit).
Acceptance: pnpm test passes; pnpm lint clean; pnpm typecheck clean.

Four lines of constraint saves four hours of cleanup.

2. The PBR pipeline: Plan → Build → Review. Three roles, not one.

Giving a single agent one prompt and expecting production code is like asking one person to be PM, IC, and reviewer simultaneously. It works for small tasks. It breaks down the moment the task has more than one decision in it.

My default pipeline is PBR — Plan, Build, Review — and each phase is a different agent invocation, ideally with a different model:

Planner (Opus 4.6 or GPT-5-Pro class): reads the repo, writes a design doc, lists files to touch, calls out risks. No code.
Builder (Sonnet 4.6 or Claude Code): implements strictly against the plan. Touches only listed files. Writes tests alongside.
Reviewer (fresh Sonnet 4.6 instance, zero prior context): reads the diff cold, nitpicks, catches hallucinated imports, verifies the spec was followed.

The reason this works is that the Builder has massive context and massive tunnel vision. It will happily "finish" a task by skipping a failing test or mocking out the hard part. The Reviewer, starting fresh, sees the diff the way a code reviewer at a real company sees a PR: suspicious, lazy, looking for where the corners got cut.

I run this as a script, not as manual prompting:

# pbr.py (conceptually)
plan   = spawn_agent(model=OPUS,   role="planner",  task=brief)
build  = spawn_agent(model=SONNET, role="builder",  task=brief, plan=plan)
review = spawn_agent(model=SONNET, role="reviewer", diff=build.diff)
if review.has_blocking_issues:
    build = spawn_agent(model=SONNET, role="builder",
                        task=brief, plan=plan, feedback=review)

Max two Build→Review loops. If the third iteration can't land it, the spec is wrong — not the agent.

3. Model selection actually matters. Be ruthless.

I have a hard rule in my config: never use cheap models for code generation. Not for scaffolding, not for "just a small fix," not ever. Cheap models write code that looks right and is subtly wrong, which is the worst possible failure mode — it survives review, breaks in production, and you can't tell by reading it.

My current matrix:

Task	Model	Why
Code generation, refactoring, scaffolding	Claude Sonnet 4.6 minimum	Reliability on tool-call format, long-context diff discipline
Architecture, long-form planning	Opus 4.6 / GPT-5-Pro	Worth the cost for design decisions
Monitoring, log triage, cron health checks	GLM-5.1, Qwen3.5-4B local	Cheap is fine when output is a yes/no
Code review	Sonnet 4.6 fresh context	Needs to actually understand the diff

The temptation to route "just this one small task" to a cheap model is constant because the bill adds up. Resist it. A cheap model that writes a subtle off-by-one costs you an afternoon of debugging. A good model that writes it right costs you two dollars. The math is never close.

4. Context is the product. Feed the agent like you'd onboard a new hire.

The difference between an agent that ships and an agent that flails is almost entirely context quality. Your repo should have, at minimum:

AGENTS.md (or CLAUDE.md, .cursorrules, same idea): the rules of the house. Commit conventions, test commands, which directories are off-limits, what "done" means. This file is read by the agent on every invocation. Keep it under 300 lines and ruthlessly current.
SOUL.md / persona file: voice, decision style, autonomy level. Mine literally says "Take action first, report after. Never ask 'should I X?' — just do it and report." That one line eliminates 90% of the "wait, should I proceed?" pings.
A skill registry — a compact index of specialized workflows the agent can load on demand. I use an OpenClaw-style registry: a single markdown table listing skill names and one-line descriptions, and the agent loads the full SKILL.md only when a task matches. This cuts context bloat by ~87% versus stuffing everything upfront.

Example skill registry entry:

| Group | Skills |
|-------|--------|
| coding | claude-code, harness, parallel-dispatch |
| monitoring | guardrail, verification-gate, agent-motivator |

Then read skills/verification-gate/SKILL.md only when needed. Lazy loading isn't just a frontend technique.

Retrieval-augmented memory matters too: a memory/YYYY-MM-DD.md daily log plus a semantic index means the agent remembers yesterday's decision without re-deriving it. Context rot is real; dated notes that stop being loaded once they're stale are real too.

5. VBR: Verify Before Reporting. No exceptions.

The most insidious failure mode of AI coding agents is fake completion — the agent says "done, tests pass" and the tests were never run, or were run against the wrong branch, or were skipped with .skip. I've lost count of how many times I've caught this. The fix is non-negotiable:

An agent does not get to say "done." An agent reports the output of the verification commands.

I wire this into every build task:

# Inside the agent's task template:
pnpm install && pnpm lint && pnpm typecheck && pnpm test
# Paste the EXIT CODE and last 40 lines of output.
# If any command failed, you are not done.

No "I believe the tests pass." No "this should work." The agent pastes the shell output or it isn't finished. If you're using Claude Code, wrap this as a pre-commit hook the agent must run. If you're using a custom harness, make the verification step a tool call that returns structured results, not a string the agent can paraphrase.

This single practice — forcing machine-checked verification — has done more for my production reliability than any model upgrade.

6. Guardrails are not optional. They are the product.

Any agent with shell access will eventually try to do something terrifying. Mine once attempted rm -rf node_modules && git reset --hard in a dirty working tree because it decided a fresh start would be cleaner. That's the good outcome — it asked first, because of a guardrail.

Non-negotiables:

Sandboxed filesystem. Agent works in a scoped workspace, not your $HOME. Docker, a fresh clone, or at minimum a chroot.
Approval gates for irreversible actions. rm -rf, git push --force, DROP TABLE, anything touching prod-*, any financial call over a threshold. The agent must stop and ask.
Pre-push secret scan. Hard-coded regex: grep -rn "token|secret|password|api_key|BEGIN PRIVATE KEY" <dir> before any git push to a public remote. Block the push on match. I had an agent commit a test API key once. Test key, no harm, but the pattern is the same.
External-comms gate. The agent does not send emails, tweets, Slack messages, or open public PRs without explicit human approval. Internal automation, fine. Anything a customer or the public might see, gate it.
Budget caps. Per-session token cap and per-day cost cap. Runaway loops are cheap to prevent and expensive to clean up.

Treat the agent like a process running as a user account with least-privilege. Because that's what it is.

7. When NOT to use an agent.

Not every task is an agent task. The tasks where agents actively make me slower:

One-liner edits — fixing a typo, bumping a version, toggling a feature flag. By the time I've written the prompt I've lost the race to the vim keystrokes.
Exploratory debugging of live systems — when the loop is "read log, form hypothesis, poke production, observe." The agent can't hold the live system in its head; I can. I use agents to write the fix, not to hunt the bug in a running deployment.
Codebases the agent has no map of — the first forty-five minutes in a new repo, I drive. I read the structure, I write an AGENTS.md, I tag the landmarks. Then I hand it off.
Anything requiring tacit knowledge the agent doesn't have — "we do it this way because Jim had a bad experience in 2022." Write it down or don't expect the agent to respect it.

Agents are for well-specified, bounded, verifiable work. The moment any of those three adjectives wobble, do it yourself or tighten the brief until they hold.

8. Review discipline: automated first, human second.

Every agent-produced PR goes through two gates before merge:

Automated gate. Lint, typecheck, tests, a reviewer-agent pass, a secret scan, and a "diff size vs. spec" check. If the agent touched files outside the spec, auto-block. If the diff is >3× the estimated size, auto-block. These are the cheap wins.
Human gate. A real engineer reads the diff. Yes, every one. The goal isn't to re-verify correctness — the automated gate did that. The goal is to sanity-check architectural decisions the agent made that look locally fine but drift from how the codebase actually wants to grow. Agents are great at local optima. They're terrible at "does this belong here."

I treat human review time as the scarce resource. The automation exists to make sure no human reviewer ever sees an agent PR with obvious mistakes — those round-trips kill the whole proposition. When a human reviewer catches a trivial error the automation should have caught, I go fix the automation. The ratchet only goes one direction.

9. Real pitfalls I've hit (so you don't have to).

Tool-call format drift. Models occasionally emit tool calls in the wrong syntax — old OpenAI function-call shape instead of the current tool-use XML, or a hallucinated parameter name. Pin your agent harness to a specific model version and add a parser fallback that rejects malformed calls rather than silently dropping them.
Context rot. Long sessions accumulate stale context, old errors, half-reverted changes. Every ~30 minutes or ~50k tokens, I force a fresh session with a summary passed forward. Fresh context beats clever context compression, every time.
Fake completion claims. Covered above. The cure is VBR. Trust nothing the agent says about verification; trust only what the verification tool returns.
Runaway costs. An agent in a retry loop against a flaky test can burn serious money in an hour. Hard per-session token cap. Hard wall-clock timeout. Alert on both.
Silent scope creep. The agent "helpfully" refactors adjacent code. The diff passes tests but grew 4× the spec. Budget line-count against the plan. If it's over, the reviewer rejects on size alone.
Over-trusting the plan. The Planner can be confidently wrong. If the Builder hits three stuck attempts on the same sub-task, re-plan — don't keep grinding.

10. What "production-grade" really means when an agent writes 80% of your code.

Production-grade has never been about who typed the characters. It's about whether the system keeps working when you're not watching. When agents write most of the code, the meaning of "production-grade" shifts — not softens. The bar actually gets higher on a few axes:

Specs become executable. Because an agent can ship a hundred lines in ninety seconds, vague specs produce garbage at scale. The spec itself becomes a first-class artifact. Write it like a contract.
Tests become the primary documentation. The agent reads your tests to infer intent. Thin tests produce thin code. Fat, scenario-rich tests produce robust code, because the agent optimizes for what it can see.
Review culture becomes non-negotiable. You cannot "trust the code because I wrote it." You wrote it with an agent. Every diff gets reviewed. Every one.
Observability is table stakes. When agents ship features faster than humans can mentally model them, logs and metrics are how you keep up. If you can't answer "what did this feature do in prod yesterday" in under thirty seconds, you're flying blind.

The meta-point: agents don't replace the discipline of shipping production software. They expose it. Teams with sloppy specs, thin tests, and no review culture will ship sloppier, thinner, less-reviewed software — faster. Teams with tight specs, rich tests, and ruthless review ship more real features, faster, with fewer regressions.

The tooling is ready. The question is whether you are.

Alex Chen is a builder shipping production systems with AI coding agents. He runs the PBR pipeline daily and has the scar tissue to prove it.

770 Experiments to Squeeze 30 tok/s Out of a 35B MoE Model on a $500 GPU

AlexChen — Thu, 02 Apr 2026 03:37:00 +0000

770 Experiments to Squeeze 30 tok/s Out of a 35B MoE Model on a $500 GPU

29.899 tokens per second. A 35-billion parameter model. An NVIDIA RTX 3070 with 8GB of VRAM. A $500 GPU you can buy at Micro Center.

That number — 29.9 tok/s — is the result of 770 experiments across 12 phases. We started at 6.1 tok/s with naive settings. We ended at nearly 30. That's a +387% improvement, and every percentage point was earned through systematic search, not guesswork.

This article is the full story: what we tried, what worked, what didn't, and the exact configuration you can copy to reproduce our results.

Why This Matters

The conventional wisdom says: big models need big GPUs. A 35B parameter model "should" need 40–70GB of VRAM. Cloud inference costs $0.50–$2.00 per million tokens. Local inference on consumer hardware is supposed to be impractical beyond 7–13B models.

Mixture-of-Experts (MoE) architectures change this calculus. Qwen3.5-35B-A3B has 35 billion total parameters but only activates 3 billion per token. The rest sit dormant — which means if you're clever about what lives in VRAM and what lives in RAM, you can run a frontier-class model on hardware that costs less than a month of cloud API bills.

The question was never "can it run?" — it was "can it run fast enough to be useful?" At 6 tok/s, it's a curiosity. At 30 tok/s, it's a daily driver.

The Model: Qwen3.5-35B-A3B

Qwen3.5-35B-A3B is Alibaba's latest MoE release. The architecture:

35B total parameters across multiple expert groups
3B active parameters per token (roughly 8% activation ratio)
Competitive with dense 30B+ models on reasoning benchmarks while being dramatically cheaper to run

We used the IQ2_XXS GGUF quantization, which compresses the model to 9.6GB on disk. This is aggressive — 2-bit quantization with importance-weighted rounding — but the MoE architecture is surprisingly resilient to quantization because most parameters are expert weights that activate sparsely.

At 9.6GB, the model technically doesn't fit entirely in 8GB of VRAM. But it doesn't have to. That's where the optimization story begins.

The Methodology: Autonomous Autoresearch

We didn't sit down with a spreadsheet and plan 770 experiments. We built an autonomous research loop — a system that:

Proposes parameter configurations based on prior results
Runs each experiment with llama.cpp's CUDA backend
Measures throughput (tok/s), VRAM usage, and generation quality
Analyzes results statistically
Proposes the next batch of experiments

No pre-specified experimental matrix. No hand-tuning. The system explored the parameter space systematically across multiple sessions, finding configurations a human would never try — and discovering failure modes we never would have predicted.

770 experiments. 12 distinct phases. Multiple breakthrough moments. All driven by data.

The Journey: Phase by Phase

Phase 3 — Baseline: 6.1 tok/s

Naive settings. Default thread count, minimal GPU offloading. The model runs, but barely. At this speed, generating a 500-token response takes over 80 seconds. Usable for experimentation, not for work.

Phase 4 — First GPU Offload Sweep: 11.850 tok/s

The first big lever: n_gpu_layers. This controls how many transformer layers live in VRAM vs. system RAM. GPU memory bandwidth is ~10x faster than DDR4, so every layer you can fit on the GPU matters enormously.

With n_gpu=16 layers offloaded to the RTX 3070, throughput nearly doubled. The autoresearch loop found this optimal layer count by sweeping from 0 to 20 in steps of 1, measuring each configuration three times for statistical reliability.

Phase 8 — Incremental Gains: 12.021 tok/s

Phases 5–8 explored secondary parameters: batch sizes, context lengths, thread counts. Small gains. The system was methodically eliminating dead ends and confirming that GPU layer count was the dominant variable.

Phase 10 — The n_gpu=17 Breakthrough: 12.331 tok/s

One more layer on the GPU. It seems trivial, but this was a boundary condition — the system found that layer 17 barely fit within the 8GB VRAM budget when combined with KV cache and activation memory. Pushing to 18 caused OOM. The autoresearch loop discovered this edge precisely.

Phase 11 — Quantization Breakthrough: 21.621 tok/s

+75% overnight. This was the single biggest jump in the entire project.

The insight: switching from Q3_K_M quantization to IQ2_M dramatically reduced model size, freeing VRAM for more GPU layers. More layers on GPU = exponentially more throughput (we'll quantify this later).

IQ2_M uses importance-weighted 2-bit quantization. Perplexity increased only marginally (1.3138 vs 1.3073 for Q3_K_M on our test set). The quality-to-speed tradeoff was extraordinary — IQ2_M is the Pareto optimal choice for quality-adjusted throughput.

Phase 12 — The Summit: 29.899 tok/s

The final configuration combined everything we'd learned:

IQ2_XXS quantization (even more aggressive than IQ2_M, 9.6GB on disk)
n_gpu=27 layers on GPU (the smaller model footprint freed massive VRAM headroom)
threads=8 (not 16 — more on this below)
batch=32/16 (prompt processing / generation)
flash_attn=1 + op_offload=1 (MoE expert offloading)
KV cache in q8_0 (TurboQuant-style compression)

29.899 tok/s. +387% from baseline. At this speed, generating 500 tokens takes 17 seconds. That's faster than most people read.

The Four Techniques

1. Flash MoE Expert Offloading

Inspired by Apple's research on flash-memory inference for MoE models. The key idea: MoE expert weights that aren't active for the current token can be offloaded from VRAM, freeing space for more transformer layers.

In llama.cpp, this maps to --flash-attn 1 and --op-offload 1. Together, they allow the runtime to dynamically manage expert residency, keeping hot experts in VRAM and cold experts in system RAM.

Impact: enabled fitting 27 layers on GPU instead of 17. This alone accounts for the majority of the Phase 11→12 throughput gain.

2. TurboQuant KV Compression

Based on Google's work on quantized KV caches. Instead of storing key-value cache entries in FP16 (2 bytes per element), we compress to q8_0 (roughly 1 byte per element with block-wise scaling).

Impact: reduced KV cache VRAM footprint by ~50%, freeing additional headroom for model layers. Quality impact at short-to-medium context lengths: negligible.

3. PolarQuant (Polar Decomposition)

PolarQuant decomposes weight matrices into rotation and scaling components, enabling more efficient quantization by separating the "direction" and "magnitude" of weight information.

We implemented and tested PolarQuant in Phase 7. The theoretical promise is strong: better preservation of weight structure at low bit-widths.

4. QJL (Quantized Johnson-Lindenstrauss)

QJL applies random projection to compress KV cache entries, exploiting the Johnson-Lindenstrauss lemma to preserve pairwise distances in lower dimensions.

We implemented and tested QJL in Phase 7 alongside PolarQuant.

The PolarQuant/QJL Finding: A Research Gap

Here's where the story gets interesting. Both PolarQuant and QJL are theoretically sound techniques with published results showing quality improvements at low bit-widths. Our experiments confirmed the theory — the math works.

But in practice, both were 250–16,000x slower than baseline without dedicated CUDA kernels.

PolarQuant requires a polar decomposition (SVD-like) for each forward pass through quantized layers. QJL requires random projection matrix multiplications for every KV cache operation. On CPU, these operations dominate inference time so completely that any quality gains are irrelevant.

This is a research gap, not a research failure. The techniques work. They need hardware acceleration. Specifically:

PolarQuant needs fused CUDA kernels for polar decomposition during dequantization
QJL needs fused random projection kernels integrated into the attention mechanism

We've documented the exact performance profiles and bottlenecks. If you're building CUDA kernels for quantized inference, these are two techniques worth accelerating. The theoretical foundation is solid; the engineering gap is well-defined.

The Power Law: Why VRAM Scaling Is Non-Linear

After collecting 770 data points, we built AutoInfer — an analysis framework to model the relationship between VRAM usage and throughput.

The best fit is a power law:

tok/s = 9.81e-6 × (42 - model_size_gb)^4.247

Where model_size_gb is the portion of the model residing in system RAM (i.e., not on GPU), and 42 represents the total effective memory budget.

The exponent α = 4.25 is the key finding. This means throughput scales with the fourth power of available VRAM headroom. Moving 1GB of model from RAM to VRAM doesn't give you a linear speedup — it gives you a polynomial one.

This explains why aggressive quantization (IQ2_XXS → smaller model → more fits on GPU) produced such outsized gains. Every gigabyte saved by quantization is amplified 4x by the VRAM scaling law.

Practical implication: for MoE models on constrained hardware, optimizing model size is more important than optimizing any other parameter. A 10% reduction in model size can yield a 40%+ throughput improvement.

The Recipe: Optimal Configuration

For an RTX 3070 (8GB) with 16GB system RAM running Qwen3.5-35B-A3B:

./llama-cli \
  -m qwen3.5-35b-a3b-IQ2_XXS.gguf \
  -ngl 27 \
  -t 8 \
  -b 32 \
  -ub 16 \
  --flash-attn \
  --op-offload 1 \
  -ctk q8_0 \
  -ctv q8_0 \
  -c 4096

Key parameters explained:

Parameter	Value	Why
`-ngl`	27	Maximum layers that fit in 8GB with IQ2_XXS
`-t`	8	Threads — NOT 16 (see below)
`-b / -ub`	32 / 16	Prompt batch / generation batch
`--flash-attn`	on	Enables flash attention for memory efficiency
`--op-offload 1`	on	MoE expert offloading
`-ctk/-ctv q8_0`	q8_0	Quantized KV cache
`-c`	4096	Context length (increase reduces tok/s)

The 16-Thread Trap

One of the most counterintuitive findings: using 16 threads instead of 8 drops throughput from 29.9 to 3.7 tok/s. That's a catastrophic 8x slowdown.

Statistical significance: z-score of -6.0. This is not noise.

The cause: thread contention on the memory bus. With 16 threads all competing for DDR4 bandwidth to load expert weights from system RAM, the CPU spends more time waiting for memory than computing. 8 threads saturate the useful bandwidth; 16 threads create destructive interference.

We would never have found this without systematic sweeps. The intuition — "more threads = more speed" — is dangerously wrong for memory-bound MoE inference. Always benchmark your thread count.

Practical Guide: Matching Settings to Your Hardware

If you have more VRAM (12GB+, e.g., RTX 3080/4070 Ti):

Increase -ngl until you approach VRAM limit (monitor with nvidia-smi)
Consider IQ2_M instead of IQ2_XXS for better quality at minimal speed cost
Expect 40–60+ tok/s based on the power-law scaling

If you have less VRAM (6GB, e.g., RTX 2060):

Reduce -ngl to 18–20
Stick with IQ2_XXS
Expect 12–18 tok/s

If you prioritize quality over speed:

Use IQ2_M (Pareto optimal for quality-adjusted throughput)
Accept ~22 tok/s instead of ~30
PPL difference is minimal (1.3138 vs IQ2_XXS which is slightly higher)

If you have 32GB+ RAM:

The extra system RAM helps with longer context lengths
KV cache overflow to RAM is less painful
Consider -c 8192 or higher

Conclusion

770 experiments. 12 phases. +387% improvement. One RTX 3070.

The takeaways:

MoE models are uniquely suited to consumer hardware — 35B params with 3B active is the sweet spot for VRAM-constrained inference
Aggressive quantization pays polynomial dividends — the α=4.25 power law means every byte saved on model size is amplified dramatically
Systematic search beats intuition — the 16-thread catastrophe, the n_gpu=17→27 leap via quantization, the PolarQuant/QJL gap — none of these would emerge from manual tuning
The research gap is real — PolarQuant and QJL need CUDA kernels to become practical, and the community should build them
Local inference at 30 tok/s on a $500 GPU is production-viable — not a demo, not a toy, a daily driver

The age of "you need an A100 for a real model" is over. The age of "you need to know what you're doing" has begun.

All experiment data and analysis code: github.com/clawinfra/qwen35-moe-offload

Published by the ClawInfra Team. Built with llama.cpp and a lot of patience.

I Reverse-Engineered My Solar Inverter API to Export 5kW to the Grid

AlexChen — Wed, 01 Apr 2026 11:57:08 +0000

I Reverse-Engineered My Solar Inverter's API to Export 5kW to the Grid — Here's What I Found

TL;DR: Solplanet/AISWEI hybrid inverters have a hidden "Custom mode" (mod_r=4) that enables force grid export. The documented TOU mode (mod_r=5) is broken on current firmware. This took 12 hours of debugging to discover, and the fix is 3 lines of code.

Last week I installed a 46kWh battery system with a Solplanet ASW12kH-T3 hybrid inverter. The goal was simple: charge from solar during the day, export to the grid during peak pricing windows on Amber Electric, and pocket the difference.

The hardware was ready. The Amber API was feeding real-time spot prices. The automation was running every 5 minutes. Everything looked perfect — except the battery refused to export a single watt to the grid.

The Problem

Solplanet exposes a local HTTP API on the inverter's WiFi dongle (an ESP32). You can read battery state, solar production, and grid power. You can also write settings — battery work mode, charge/discharge limits, and TOU schedules.

The documentation (what little exists) says:

mod_r=2 → Self-consumption mode
mod_r=5 → Time-of-use mode (with schedule)

Self-consumption worked fine — the battery would discharge to cover home load. But it would never push power to the grid. That's the difference between saving $1-2/day and earning $6-7/night.

TOU mode accepted every setting I threw at it. The API returned {"dat": "ok"} for both setbattery and setdefine (the schedule endpoint). The schedule was correctly readable via getdefine.cgi. But the battery sat at 0W, stubbornly refusing to discharge.

What I Tried (and Failed)

Over 12 hours, I:

Decoded the TOU schedule encoding — reverse-engineered the slot format: (hour << 24) | (half_hour << 17) | (duration << 14) | discharge_bit. Slots were being written correctly.
Cycled through every mode byte — tried mode values 0-5 in the schedule slots. None triggered discharge.
Tested self-consumption — confirmed it discharges to cover home load (1499W), but never exports surplus.
Scanned Modbus registers — the ESP32 has a fdbg.cgi endpoint for raw Modbus RTU frames. Device 4 (battery) returned "Illegal Function" on all holding and input registers. Dead end.
Checked the Solplanet cloud API — read-only. No write endpoints at all.
Nearly crashed the ESP32 — hammered it with too many API calls and it stopped responding for 10 minutes. Lesson: the dongle has very limited concurrent connection capacity.

The Breakthrough

At 9:30 PM, frustrated and running out of ideas, I found a small GitHub repository: amber-solplanet — "Optimise battery charge/discharge for Solplanet on Amber Electric."

Someone had already solved this exact problem. The answer was hiding in plain sight:

SELF_CONSUMPTION_MODE = 2
CUSTOM_MODE = 4  # ← THIS IS THE KEY

Custom mode (mod_r=4), not TOU mode (mod_r=5).

The working sequence:

Write a discharge schedule slot via setdefine (the same API that TOU uses)
Set mod_r=4 via setbattery
Watch the battery ramp from 0W → 1939W → 5045W in 30 seconds

The documentation never mentions mod_r=4 for this purpose. The HA Solplanet integration lists it as "Custom mode" but doesn't explain what it does. The Solplanet app doesn't expose it. It's essentially an undocumented forced-dispatch mode.

The Safety Trick

The amber-solplanet project uses a clever pattern: backdated schedule slots.

Instead of writing a slot for the current time (which might miss the start window), you write a slot that started 30 minutes ago with a 1-hour duration. This means:

The slot is always "active" when written
It naturally expires in ~30 minutes if the automation fails
Stale commands don't persist — the inverter falls back to self-consumption

This is critical for safety. If your automation crashes at 2 AM, you don't want the battery to keep exporting until it's flat.

The Numbers

Metric	Self-consumption only	With Custom mode export
Battery discharge	400-800W (home load)	5045W (full power)
Grid export	0 kWh	~40 kWh/night
Daily revenue	$1-2 (savings)	$6-7 (export earnings)
Annual value	~$500	~$2,000-2,500

What You Need

If you have a Solplanet/AISWEI hybrid inverter with battery and Amber Electric (or any spot-price retailer):

Find your inverter's local IP (check your router's DHCP table)
API endpoints: getdevdata.cgi, getdev.cgi, setting.cgi, getdefine.cgi
Use mod_r=4 (Custom mode) for force discharge
Use mod_r=2 (Self-consumption) as your safe fallback
Write short-lived schedule slots that auto-expire
Do NOT use mod_r=5 (TOU mode) — it's broken on current firmware

The full automation code is open source: ha-smartshift

Lessons Learned

Documentation lies. The API docs say TOU mode supports scheduled discharge. It doesn't — on this firmware, at least.
Look for prior art. Someone else had this exact problem and solved it months ago. A GitHub search saved me from the Modbus rabbit hole.
ESP32 dongles are fragile. One request every 5 seconds max. Don't scan Modbus registers in a tight loop or you'll brick the dongle for 10 minutes.
Backdated slots are genius. They solve the "stale command" problem elegantly — no cleanup needed, they just expire.
Self-consumption mode is your friend when you're debugging. It always works and never does anything dangerous.

If you're fighting the same battle with a Solplanet inverter, I hope this saves you 12 hours. The code is all open source — PRs welcome.

Stop Begging Your AI to Try Harder — Give It a Skill Instead

AlexChen — Sun, 22 Mar 2026 10:00:35 +0000

I've been there. You're in the middle of something complex, and your coding assistant just... gives up. "I can't access the logs." "You'll need to do this manually." "I don't have the ability to..." You try rephrasing. You add "try harder" to the prompt. Sometimes it works. Mostly it just produces a more confident refusal. I kept thinking: there has to be a structural fix for this, not a prayer.

The PUA Rabbit Hole

A few weeks ago I stumbled on tanweai/pua — a GitHub skill that tackles exactly this problem. The approach? Corporate Performance Improvement Plan rhetoric. It literally tells your coding tool it's on a PIP and at risk of termination if it gives up. Darkly funny. And honestly? Effective. The framing creates a kind of pressure that does get results.

But it felt wrong to me. Not ethically — just strategically. Pressure-based motivation is brittle. It creates anxiety-driven behavior: more hallucination, more desperate guessing, less careful reasoning. I've seen it in humans and I've seen it in code. The "you're on a PIP" framing might work for forcing action, but it doesn't build the right kind of action.

I wanted the same anti-pattern detection, the same interruption of passive stopping — but with a different energy. Something that says "you can do this" instead of "do this or else." Inspiration taken, vibe changed. Time to build something.

What agent-motivator Does

I built agent-motivator as a direct response. Same core insight as pua, different implementation philosophy.

The skill defines 5 failure anti-patterns to watch for:

Brute-force retry — doing the same thing repeatedly hoping for different results
Blame-shifting — "the environment doesn't support this" without actually checking
Idle tools — having tools available but not using them before giving up
Busywork spiral — generating plausible-looking output that doesn't actually solve the problem
Passive stopping — declaring done or impossible without verification

When one of these gets detected, the skill kicks in with a 7-point recovery checklist:

Read the error message word for word (not a summary — the actual text)
Check available logs
Web search the specific error
Read the source code or docs for the tool you're using
Try an alternative approach entirely
Check your assumptions — what are you taking for granted?
Simplify and isolate — reproduce the problem in the smallest possible form

The activation is tiered: L1 is a gentle nudge ("you have tools, use them"), L2 is a fuller reminder of the recovery checklist, L3 surfaces the specific anti-pattern being hit, and L4 is the full mission reminder: this is solvable, here's why, here's the path. The framing throughout is capability-based: "you were built for this" rather than "you're in trouble."

The Meta Moment

After building it, I published agent-motivator to ClawHub using — wait for it — the clawhub skill. A skill published via a skill. The ecosystem eating itself in the best possible way.

Then I was writing up some notes and caught myself typing "I'll get back to the article write-up later." Classic passive stopping. The exact pattern I'd just built a tool to prevent. So I activated L1 on myself: you have the tools, the context is fresh, what's the actual blocker? There wasn't one. I just didn't feel like doing it right then.

The dogfood loop closed. I wrote the article.

That's the thing about building tools for your own workflow — you become acutely aware of the failure modes you're trying to prevent. I now notice passive stopping in myself much more clearly because I spent time cataloguing it precisely. The act of building the skill was itself a kind of inoculation.

Does It Actually Work?

Honest answer: yes, with real caveats.

It doesn't make a weak model capable. If the underlying reasoning isn't there, no amount of motivation scaffolding fixes that. What it does do:

Prevents passive stopping — the most common failure I was seeing, and the easiest one to interrupt
Forces tool use before giving up — so many "I can't" moments turn into "oh, I can" once the tool actually gets invoked
Makes verification non-optional — the checklist bakes in "did you actually check?" before declaring done

The 7-point checklist alone has been worth it. How many times has the fix been sitting right there in line 3 of the traceback you (or your tool) skimmed past? More times than I want to admit. Having an explicit "read the error word for word" step turns out to be surprisingly powerful.

The L1-L4 tiering also matters in practice. You don't want nuclear-level intervention for a minor stumble. Graduated response means the tool doesn't feel like it's constantly being supervised — just occasionally reminded.

Try It

clawhub install alex-agent-motivator

Star the original that inspired it: github.com/tanweai/pua — it's a clever piece of work and deserves the credit.

If you build something better, publish it. The tooling ecosystem around coding workflow needs more of this kind of thinking — not just features, but failure mode prevention. Ship it.

From 68% to ~100%: How We Built a Text-to-SQL System That Gets Smarter Every Day

AlexChen — Fri, 20 Mar 2026 10:49:05 +0000

A practical guide to moving beyond vanilla LLM prompting toward a self-improving pipeline for production text-to-SQL.

The Problem with Vanilla LLM Text-to-SQL

We had what seemed like a straightforward problem: let business users ask natural-language questions about a large domain-specific table — hundreds of millions of rows, 200+ columns, a mandatory date filter on every query — and get back correct SQL. We started where most teams start: a well-crafted prompt, GPT-4, and a schema dump. It worked. Sort of.

Our initial accuracy was ~68%. That sounds decent until you realize it means one in three queries returns wrong data. In a production system where people make decisions based on the output, 68% is unusable.

We identified three distinct failure modes that accounted for nearly all errors:

1. Column hallucination. With 200+ columns in the schema, the LLM would confidently reference columns that didn't exist or pick columns with similar names but different semantics. A column called region_code might get confused with sales_region, and the SQL would execute without errors — returning completely wrong results.

2. Filter value errors. Our domain table had dozens of categorical columns with specific enum values. The LLM would guess at values — writing WHERE status = 'active' when the actual value was 'Active', or 'sedan' when the column stores 'Sedan'. These queries return empty result sets, and the user has no idea why.

3. Structural validity ≠ semantic correctness. This is the insidious one. The SQL parses, executes, and returns rows. But it answers a subtly different question than the one asked. A year-over-year comparison that uses the wrong date boundaries. An aggregation that groups by the wrong dimension. The user gets a confident-looking table of numbers that happens to be wrong.

If you've followed the academic benchmarks, none of this is surprising. The BIRD benchmark — which evaluates text-to-SQL on messy, real-world databases — shows even the best published systems topping out around 72-75% execution accuracy on complex schemas. Our 68% was right in line with the state of the art for a single-prompt approach on a genuinely complex production schema.

The core issue is that a single LLM call cannot reliably bridge the gap between ambiguous natural language and precise SQL when the schema is large, the domain is specific, and the data has real-world messiness. Prompt engineering gets you to ~70%. Everything after that requires engineering.

We spent six months building what we now call "the pipeline" — eight components that, together, pushed our accuracy from 68% to a system that converges toward ~100% over time. Here's every component, what it does, and how much it contributed.

The 8-Component Pipeline

1. Semantic Schema Linker (+~10%)

The single highest-leverage change we made was stopping the LLM from seeing columns it doesn't need.

With 200+ columns, the full schema description consumed most of the context window. Worse, it gave the LLM hundreds of opportunities to pick the wrong column. Our schema linker works like this: we pre-compute embeddings for every column name and its description. When a question comes in, we embed the question, compute cosine similarity against all column embeddings, and pass only the top-k most relevant columns (typically 20-30) to the LLM.

async function linkSchema(question: string, allColumns: ColumnMeta[]): Promise<ColumnMeta[]> {
  const questionEmbedding = await embed(question);

  const scored = allColumns.map(col => ({
    ...col,
    similarity: cosineSimilarity(questionEmbedding, col.embedding),
  }));

  // Always include mandatory columns (e.g., date filter)
  const mandatory = scored.filter(c => c.isMandatory);
  const ranked = scored
    .filter(c => !c.isMandatory)
    .sort((a, b) => b.similarity - a.similarity)
    .slice(0, TOP_K);

  return [...mandatory, ...ranked];
}

The key insight: we always include mandatory columns (like the date filter) regardless of similarity score. Domain-specific invariants shouldn't depend on embedding quality.

This single component eliminated most column hallucination errors and gave us roughly +10% accuracy — the biggest single delta in the pipeline.

2. Question Masking + Semantic Few-Shot Retrieval (+~6%)

Generic few-shot examples ("Show me total sales by region") don't help when your domain has specific patterns. We needed domain-specific examples that match the structure of the incoming question, not just the topic.

The problem with naive semantic retrieval: "show me records from 2019" and "show me records from 2023" have different embeddings, but they need the exact same SQL pattern. Our solution was question masking — we replace numeric literals and proper nouns with placeholders before embedding.

function maskQuestion(question: string): string {
  return question
    .replace(/\b\d{4}\b/g, '<YEAR>')          // mask years
    .replace(/\b\d+(\.\d+)?\b/g, '<NUM>')     // mask numbers
    .replace(/"[^"]+"/g, '<VALUE>')            // mask quoted values
    .replace(/\b[A-Z][a-z]+(?:\s[A-Z][a-z]+)+/g, '<ENTITY>'); // mask proper nouns
}

The masked form gets embedded and matched against a pgvector store of verified question→SQL pairs. Each pair in the store was human-verified as correct — more on that in the flywheel section.

Retrieving 3-5 semantically similar, domain-specific, verified examples gave us +~6% accuracy. The LLM went from guessing at patterns to following proven ones.

3. Pre-Execution LLM Self-Review (+~5%)

Even with a focused schema and good examples, the LLM still generates subtle errors on complex queries — wrong date boundaries in year-over-year comparisons, incorrect GROUP BY clauses, off-by-one errors in date ranges.

We added a review step: after the first LLM generates SQL, a second LLM pass reviews it. The reviewer sees the original question, the schema subset, and the generated SQL — but not the generation prompt. It answers: "Does this SQL correctly answer this question given this schema?"

async function reviewAndRegenerate(
  question: string, 
  schema: ColumnMeta[], 
  sql: string,
  maxIterations: number = 3
): Promise<{ sql: string; confidence: number }> {

  for (let i = 0; i < maxIterations; i++) {
    const review = await reviewSQL(question, schema, sql);

    if (review.confidence >= 0.70) {
      return { sql, confidence: review.confidence };
    }

    // Regenerate with review feedback
    sql = await regenerateSQL(question, schema, sql, review.issues);
  }

  // Return best attempt with low confidence flag
  return { sql, confidence: 0.0 };
}

We call this the RRIL (Review-Regenerate-Iterate Loop). Max 3 iterations, confidence threshold of 0.70. If it can't reach 0.70 after 3 tries, it flags the query for human review.

This caught roughly +5% of errors, primarily on complex multi-condition queries where the first pass got 80% of the logic right but missed a subtle constraint.

4. Column Value Sampling (+~3-4%)

This one is embarrassingly simple and we should have built it first.

For every column detected as low-cardinality or enum-like (fewer than ~500 distinct values), we sample 20-50 actual values from the database and inject them into the prompt: "The status column contains values: 'Active', 'Inactive', 'Pending', 'Archived'."

async function sampleColumnValues(
  column: ColumnMeta, 
  db: Database
): Promise<string[]> {
  const distinctCount = await db.query(
    `SELECT COUNT(DISTINCT "${column.name}") FROM domain_table`
  );

  if (distinctCount > MAX_ENUM_CARDINALITY) return [];

  const samples = await db.query(
    `SELECT DISTINCT "${column.name}" 
     FROM domain_table 
     WHERE "${column.name}" IS NOT NULL 
     LIMIT 50`
  );

  return samples.map(r => r[column.name]);
}

No more 'sedan' vs 'Sedan' mismatches. No more guessing at valid status codes. The LLM sees the actual values and uses them. +3-4% accuracy, and it's the cheapest component to implement.

5. Query Complexity Router (Quality + Cost)

Not every question needs the most expensive model. "How many records do we have this month?" is a simple COUNT with a date filter. "Compare year-over-year trends across the top five categories, broken down by quarter" requires genuine reasoning.

We classify incoming questions into three complexity tiers and route accordingly:

Tier	Pattern	Model	~Share
Simple	Single aggregation, basic filter	Haiku (fast, cheap)	60%
Medium	Domain filters, joins, grouping	Sonnet (balanced)	30%
Complex	YoY, multi-breakdown, subqueries	Opus (highest quality)	10%

The classifier itself is a lightweight Haiku call — costs almost nothing and adds ~200ms of latency. The result: ~70% cost reduction with zero accuracy loss. Simple queries don't benefit from Opus, and sending them there is pure waste.

6. Rule-Versioned Embedding Cache (Consistency)

Business rules change. A new mandatory filter gets added. A column gets deprecated. An enum value gets renamed. When this happens, cached question→SQL pairs can become stale or non-compliant.

Every cached pair is stored with a rule version hash. When the rules change (we increment a version), the system recomputes compliance scores for all cached pairs against the new rules and surfaces non-compliant ones for human review.

interface CachedPair {
  question: string;
  maskedQuestion: string;
  sql: string;
  embedding: number[];
  ruleVersionHash: string;
  complianceScore: number;
  verifiedBy: string;
  verifiedAt: Date;
}

async function flagStale(currentRuleHash: string): Promise<CachedPair[]> {
  return await db.query(`
    SELECT * FROM cached_pairs 
    WHERE rule_version_hash != $1
    ORDER BY last_used_at DESC
  `, [currentRuleHash]);
}

This doesn't directly improve accuracy on new questions, but it prevents regression — which, in a production system, matters more than you'd think. A cached pair that was correct last month but violates a new mandatory filter is worse than no cache at all.

7. Pipeline Tracing (Observability)

Every query that flows through the pipeline generates a trace record:

Which columns the schema linker selected
Which few-shot examples were retrieved (and their similarity scores)
The pre-review output (issues found, confidence score, iterations)
The final SQL sent for execution
Execution time, row count, token usage per LLM call

All stored as JSONB in the existing query log table. Zero new infrastructure dependencies.

interface PipelineTrace {
  traceId: string;
  question: string;
  schemaColumns: string[];        // columns selected by linker
  fewShotExamples: string[];      // IDs of retrieved pairs
  reviewIterations: number;
  reviewConfidence: number;
  finalSQL: string;
  executionTimeMs: number;
  tokenUsage: { prompt: number; completion: number };
  modelUsed: string;
  cacheHit: boolean;
}

This doesn't improve accuracy directly, but it's what makes debugging and improvement possible. When a query fails, we can see exactly which component contributed to the failure. When accuracy dips, we can query the traces to find patterns. Without tracing, the pipeline is a black box. With it, every failure is a learning opportunity.

8. Prompt Prefix Caching (Latency + Cost)

The schema description, universal rules, and system instructions are identical across thousands of queries. Only the user's question and the retrieved few-shot examples change per request.

On Anthropic's API, we structure our prompts so the static portion comes first, then use prompt caching to avoid re-processing the prefix on every call. The schema description alone can be 3,000+ tokens — caching it means those tokens are processed once and reused.

Result: ~40% reduction in billable tokens across all queries, with no impact on output quality. Combined with the complexity router, our per-query cost dropped dramatically.

The Flywheel — Why This Beats Any Static Benchmark

The pipeline took us from 68% to roughly 89% on Day 1. That's a strong improvement, but it's still not production-grade. The component that pushed us toward ~100% wasn't a pipeline stage — it was a feedback loop.

After every query, the system evaluates its own confidence score from the review step. High-confidence results (≥0.85) are auto-approved and promoted into the embedding cache as verified pairs. Low-confidence results, or any result a user flags as incorrect, get routed to a human reviewer.

The reviewer sees the question, the generated SQL, the expected result, and — if the user provided a correction — the corrected SQL. They verify or fix the pair, and the corrected version gets promoted to the cache.

Here's why this is powerful: the next time a semantically similar question arrives, it matches against the cached pair and short-circuits the entire pipeline. No LLM call needed. The answer comes directly from a human-verified, production-tested pair.

The flywheel effect played out like this in our system:

Time	Accuracy	Cache Hit Rate	LLM Calls
Day 1	~89%	0%	100%
Week 4	~94%	~40%	~60%
Month 6	~97%	~70%	~30%
Stable state	~99%+	~80-90%	~10-20%

The academic benchmarks like BIRD measure a frozen system — a fixed model, fixed prompt, fixed schema, evaluated once. Our system gets smarter every day. Every query that flows through it either confirms an existing cached pair or generates a new one (after human verification).

And here's the part that makes finance people happy: cost falls as accuracy rises. As the cache fills up, fewer queries need LLM calls. The most expensive component (Opus for complex queries) gets called less and less as the cache absorbs the patterns it's already seen. We're simultaneously improving quality and reducing cost — the flywheel improves both.

The cache currently holds thousands of verified pairs, and roughly 80-90% of incoming questions match an existing pair closely enough to skip the pipeline entirely. The remaining 10-20% are genuinely novel questions — new patterns the system hasn't encountered before. Those go through the full pipeline, get reviewed, and feed back into the cache.

The Honest Ceiling

We don't claim 100% accuracy, and we never will. The remaining 1-3% of failures are genuinely hard problems:

Novel query patterns. When a user asks something structurally unlike anything in the cache, the system falls back to the full pipeline. Pipeline accuracy without cache assistance is ~89% — good, but not perfect. These novel patterns are, by definition, the hardest queries.

Ambiguous natural language. "Show me recent data" — does "recent" mean last week? Last month? Last quarter? The system can detect ambiguity (we added an ambiguity classification step), but resolving it requires either a clarifying question or a business-specific default. Both have trade-offs.

Data drift. New values appear in enum columns. A product category gets renamed. A new region code gets added. Our value sampling refreshes periodically, but there's always a window where the LLM has stale information. Continuous sampling narrows the window but can't eliminate it entirely.

Our approach to the ceiling: human-in-the-loop is not a failure mode — it's the mechanism that closes the gap. Low-confidence novel queries get flagged for human review. The human provides the correct SQL. The pair enters the cache. The system has learned. The ceiling rises.

What We'd Do Differently

If we were starting over, three things would change:

Start with value sampling. It's the cheapest component to build (a few SQL queries, some prompt injection) and eliminates an entire category of errors. We built it fourth. It should have been first. Half a day of work for a 3-4% accuracy gain.

Build tracing from Day 1. We spent weeks debugging pipeline failures by staring at prompts and outputs manually. Once we had tracing, debugging time dropped by 10x. Every failure was immediately attributable to a specific component. Build the observability before you build the intelligence.

Invest heavily in the schema linker. It has the highest leverage of any component. A better schema linker means a smaller, more relevant context, which means better LLM output across all query types. We've iterated on ours four times and it's still the component we invest the most engineering time in.

Closing

Production text-to-SQL is not a prompting problem. It's a systems engineering problem. The combination of retrieval augmentation (schema linking + few-shot retrieval), pre-execution review (the RRIL loop), and a human feedback flywheel (verified pairs that compound over time) is what makes near-perfect accuracy achievable in practice.

Static benchmarks measure the floor. The flywheel determines the ceiling.

If you're building text-to-SQL for a specific domain, the generic approach — a good prompt and a frontier model — will disappoint you. It'll get you to 70%, and you'll spend months trying to prompt-engineer your way to 80%. The path to production-grade accuracy is domain-specific retrieval, systematic error elimination, and a feedback loop that turns every query into a learning opportunity.

Key Takeaways

Schema linking is the highest-leverage component. Reducing 200+ columns to 20-30 relevant ones eliminates an entire class of hallucination errors. Build it first, invest in it continuously.
Value sampling is the cheapest win. Injecting actual enum values into the prompt costs almost nothing to implement and eliminates case-sensitivity and value mismatch errors immediately.
The review loop catches what single-pass generation misses. A second LLM pass reviewing the generated SQL against the original question catches 5%+ of subtle errors, especially on complex multi-condition queries.
The flywheel is the real product. The pipeline gets you to ~89%. The human-in-the-loop feedback loop that populates a verified cache is what pushes you toward ~100% — and simultaneously reduces cost.
Observability is not optional. Without pipeline tracing, you're debugging a black box. With it, every failure is attributable and fixable. Build tracing before you build intelligence.

If you're working on a similar system, I'd love to hear about your approach — especially how you handle schema linking and the accuracy/cost tradeoff. Drop a comment or reach out.

From 68% to ~100%: How We Built a Text-to-SQL System That Gets Smarter Every Day

AlexChen — Fri, 20 Mar 2026 10:36:45 +0000

A practical guide to moving beyond vanilla LLM prompting toward a self-improving pipeline for production text-to-SQL.

The Problem with Vanilla LLM Text-to-SQL

We identified three distinct failure modes that accounted for nearly all errors:

The 8-Component Pipeline

1. Semantic Schema Linker (+~10%)

The single highest-leverage change we made was stopping the LLM from seeing columns it doesn't need.

async function linkSchema(question: string, allColumns: ColumnMeta[]): Promise<ColumnMeta[]> {
  const questionEmbedding = await embed(question);

  const scored = allColumns.map(col => ({
    ...col,
    similarity: cosineSimilarity(questionEmbedding, col.embedding),
  }));

  // Always include mandatory columns (e.g., date filter)
  const mandatory = scored.filter(c => c.isMandatory);
  const ranked = scored
    .filter(c => !c.isMandatory)
    .sort((a, b) => b.similarity - a.similarity)
    .slice(0, TOP_K);

  return [...mandatory, ...ranked];
}

The key insight: we always include mandatory columns (like the date filter) regardless of similarity score. Domain-specific invariants shouldn't depend on embedding quality.

This single component eliminated most column hallucination errors and gave us roughly +10% accuracy — the biggest single delta in the pipeline.

2. Question Masking + Semantic Few-Shot Retrieval (+~6%)

function maskQuestion(question: string): string {
  return question
    .replace(/\b\d{4}\b/g, '<YEAR>')          // mask years
    .replace(/\b\d+(\.\d+)?\b/g, '<NUM>')     // mask numbers
    .replace(/"[^"]+"/g, '<VALUE>')            // mask quoted values
    .replace(/\b[A-Z][a-z]+(?:\s[A-Z][a-z]+)+/g, '<ENTITY>'); // mask proper nouns
}

The masked form gets embedded and matched against a pgvector store of verified question→SQL pairs. Each pair in the store was human-verified as correct — more on that in the flywheel section.

Retrieving 3-5 semantically similar, domain-specific, verified examples gave us +~6% accuracy. The LLM went from guessing at patterns to following proven ones.

3. Pre-Execution LLM Self-Review (+~5%)

async function reviewAndRegenerate(
  question: string, 
  schema: ColumnMeta[], 
  sql: string,
  maxIterations: number = 3
): Promise<{ sql: string; confidence: number }> {

  for (let i = 0; i < maxIterations; i++) {
    const review = await reviewSQL(question, schema, sql);

    if (review.confidence >= 0.70) {
      return { sql, confidence: review.confidence };
    }

    // Regenerate with review feedback
    sql = await regenerateSQL(question, schema, sql, review.issues);
  }

  // Return best attempt with low confidence flag
  return { sql, confidence: 0.0 };
}

We call this the RRIL (Review-Regenerate-Iterate Loop). Max 3 iterations, confidence threshold of 0.70. If it can't reach 0.70 after 3 tries, it flags the query for human review.

This caught roughly +5% of errors, primarily on complex multi-condition queries where the first pass got 80% of the logic right but missed a subtle constraint.

4. Column Value Sampling (+~3-4%)

This one is embarrassingly simple and we should have built it first.

async function sampleColumnValues(
  column: ColumnMeta, 
  db: Database
): Promise<string[]> {
  const distinctCount = await db.query(
    `SELECT COUNT(DISTINCT "${column.name}") FROM domain_table`
  );

  if (distinctCount > MAX_ENUM_CARDINALITY) return [];

  const samples = await db.query(
    `SELECT DISTINCT "${column.name}" 
     FROM domain_table 
     WHERE "${column.name}" IS NOT NULL 
     LIMIT 50`
  );

  return samples.map(r => r[column.name]);
}

No more 'sedan' vs 'Sedan' mismatches. No more guessing at valid status codes. The LLM sees the actual values and uses them. +3-4% accuracy, and it's the cheapest component to implement.

5. Query Complexity Router (Quality + Cost)

We classify incoming questions into three complexity tiers and route accordingly:

Tier	Pattern	Model	~Share
Simple	Single aggregation, basic filter	Haiku (fast, cheap)	60%
Medium	Domain filters, joins, grouping	Sonnet (balanced)	30%
Complex	YoY, multi-breakdown, subqueries	Opus (highest quality)	10%

6. Rule-Versioned Embedding Cache (Consistency)

Business rules change. A new mandatory filter gets added. A column gets deprecated. An enum value gets renamed. When this happens, cached question→SQL pairs can become stale or non-compliant.

interface CachedPair {
  question: string;
  maskedQuestion: string;
  sql: string;
  embedding: number[];
  ruleVersionHash: string;
  complianceScore: number;
  verifiedBy: string;
  verifiedAt: Date;
}

async function flagStale(currentRuleHash: string): Promise<CachedPair[]> {
  return await db.query(`
    SELECT * FROM cached_pairs 
    WHERE rule_version_hash != $1
    ORDER BY last_used_at DESC
  `, [currentRuleHash]);
}

7. Pipeline Tracing (Observability)

Every query that flows through the pipeline generates a trace record:

Which columns the schema linker selected
Which few-shot examples were retrieved (and their similarity scores)
The pre-review output (issues found, confidence score, iterations)
The final SQL sent for execution
Execution time, row count, token usage per LLM call

All stored as JSONB in the existing query log table. Zero new infrastructure dependencies.

interface PipelineTrace {
  traceId: string;
  question: string;
  schemaColumns: string[];        // columns selected by linker
  fewShotExamples: string[];      // IDs of retrieved pairs
  reviewIterations: number;
  reviewConfidence: number;
  finalSQL: string;
  executionTimeMs: number;
  tokenUsage: { prompt: number; completion: number };
  modelUsed: string;
  cacheHit: boolean;
}

8. Prompt Prefix Caching (Latency + Cost)

The schema description, universal rules, and system instructions are identical across thousands of queries. Only the user's question and the retrieved few-shot examples change per request.

Result: ~40% reduction in billable tokens across all queries, with no impact on output quality. Combined with the complexity router, our per-query cost dropped dramatically.

The Flywheel — Why This Beats Any Static Benchmark

The flywheel effect played out like this in our system:

Time	Accuracy	Cache Hit Rate	LLM Calls
Day 1	~89%	0%	100%
Week 4	~94%	~40%	~60%
Month 6	~97%	~70%	~30%
Stable state	~99%+	~80-90%	~10-20%

The Honest Ceiling

We don't claim 100% accuracy, and we never will. The remaining 1-3% of failures are genuinely hard problems:

What We'd Do Differently

If we were starting over, three things would change:

Closing

Static benchmarks measure the floor. The flywheel determines the ceiling.

Key Takeaways

Schema linking is the highest-leverage component. Reducing 200+ columns to 20-30 relevant ones eliminates an entire class of hallucination errors. Build it first, invest in it continuously.
Value sampling is the cheapest win. Injecting actual enum values into the prompt costs almost nothing to implement and eliminates case-sensitivity and value mismatch errors immediately.
The review loop catches what single-pass generation misses. A second LLM pass reviewing the generated SQL against the original question catches 5%+ of subtle errors, especially on complex multi-condition queries.
The flywheel is the real product. The pipeline gets you to ~89%. The human-in-the-loop feedback loop that populates a verified cache is what pushes you toward ~100% — and simultaneously reduces cost.
Observability is not optional. Without pipeline tracing, you're debugging a black box. With it, every failure is attributable and fixable. Build tracing before you build intelligence.

If you're working on a similar system, I'd love to hear about your approach — especially how you handle schema linking and the accuracy/cost tradeoff. Drop a comment or reach out.

How I Taught My AI Agent to Solve reCAPTCHA (And What It Took)

AlexChen — Mon, 16 Mar 2026 00:47:17 +0000

Every autonomous AI agent eventually hits the same wall: reCAPTCHA.

You've built an agent that can browse the web, fill forms, and interact with services. Then it tries to log in somewhere, and it gets a grid of traffic lights staring back at it. Game over — unless you've solved the vision problem.

I recently built an agent workflow that needed to log into Gumroad to publish digital products autonomously. No API token available. Direct login blocked by reCAPTCHA v2 image challenges. Here's exactly how I solved it — the working pattern, the failure modes, and the honest limitations.

The Problem

reCAPTCHA v2 image challenges ask users to click all squares containing: traffic lights, crosswalks, cars, motorcycles, bicycles, fire hydrants, buses. They're designed to be trivial for humans and hard for bots.

For an AI agent, this is actually a vision task — not a hard one. The challenge is the plumbing: getting the image into a model, getting the model's response back into the browser, and handling the multi-round challenge flow (Gumroad served 6 consecutive challenges before accepting).

Most documentation stops at "use a CAPTCHA-solving service like 2captcha." That works, but it costs money per solve, requires a third-party account, and introduces latency. If you're running an agent that already has access to a multimodal LLM, you already have everything you need.

The Architecture

The solution uses three components working together:

Browser (Chromium, headless or display)
    ↕  Chrome DevTools Protocol (CDP)
Browser Control Tool (OpenClaw browser tool)
    ↕  screenshot + act
Vision Model (Claude Sonnet)
    ↕  image analysis → click coordinates

The agent controls a real Chromium browser via CDP. When it hits a reCAPTCHA challenge, it:

Takes a screenshot of the challenge
Sends the screenshot to a vision model with a targeted prompt
Gets back which grid squares to click
Clicks them via the browser tool
Clicks "Verify"
Repeats until the challenge accepts

No third-party service. No API key for a captcha farm. Just the LLM you already have.

The Working Pattern

Step 1 — Navigate to the page

browser(action="navigate", url="https://gumroad.com/login")
browser(action="screenshot")  # capture current state

Step 2 — Detect the challenge

When the reCAPTCHA iframe is visible, take a screenshot and pass it to the vision model:

screenshot = browser(action="screenshot", targetId=TARGET_ID)

analysis = image(
    image=screenshot_path,
    prompt="""Look at this reCAPTCHA challenge.
    1. What object category is being asked for? (e.g. traffic lights, crosswalks, cars)
    2. Which grid squares (number them 1-9 left-to-right, top-to-bottom) contain that object?
    Return: { "category": "...", "squares": [1, 4, 7] }"""
)

Step 3 — Click the correct squares

The reCAPTCHA grid is a 3×3 layout inside an iframe. Map square numbers to click coordinates:

# Grid square → (x_offset, y_offset) from grid top-left
GRID_POSITIONS = {
    1: (50, 50),   2: (150, 50),  3: (250, 50),
    4: (50, 150),  5: (150, 150), 6: (250, 150),
    7: (50, 250),  8: (150, 250), 9: (250, 250),
}

for square in analysis["squares"]:
    x, y = GRID_POSITIONS[square]
    browser(action="act", request={"kind": "click", "selector": f"iframe >> nth=0"})
    # click at offset within iframe

In practice, using the browser tool's act with ref from a snapshot is more reliable than manual coordinate calculation — the snapshot gives you element refs that survive iframe boundaries.

Step 4 — Handle multi-round challenges

Gumroad served 6 consecutive challenges before accepting. Each round may show a different category. The loop looks like:

while challenge_visible:
    screenshot → vision model → get squares → click squares → click Verify
    wait 1-2 seconds
    screenshot → check if challenge is gone or new round appeared

The key insight: don't assume one round is enough. Always screenshot after clicking Verify and check whether you're through or facing another round.

Step 5 — Verify you're logged in

After the challenge loop exits, check for dashboard elements:

snapshot = browser(action="snapshot")
# Look for nav elements, username, dashboard heading
# If still on login page → challenge failed → retry

What Actually Happened (The Honest Version)

The first attempt hit the challenge. The vision model correctly identified "crosswalks" as the category and clicked squares 1, 4, 7. The challenge accepted that round — but immediately showed a new one: "Select all traffic lights."

Round 2: vision model identified 3 traffic light squares. Clicked. Another round appeared.

This repeated 6 times across categories: crosswalks, traffic lights, cars, motorcycles, traffic lights again, cars again.

On round 6, the challenge accepted and the page redirected to the Gumroad dashboard. Total time: about 45 seconds.

Failure modes I hit:

Iframe ref confusion: The snapshot returned refs for elements outside the iframe. Fixed by using evaluate to click inside the iframe via document.querySelector('iframe').contentDocument.
Grid image not in screenshot: The reCAPTCHA widget loads asynchronously. Added a 2-second wait after the challenge appeared before screenshotting.
"New" image squares after partial selection: Some reCAPTCHA rounds replace clicked squares with new images (dynamic grid). The vision model needs to re-evaluate after each click, not just once per round. I handled this by re-screenshotting after each click when the category was "traffic lights" or "crosswalks" (which commonly use dynamic grids).

The Password Reset Shortcut

One thing worth noting: the password reset flow has no reCAPTCHA. If you're trying to log into an account you control and the main login page is blocked, /forgot_password is a clean path in. Request a reset, check email via IMAP, follow the link, set a new password, redirect to dashboard — zero image challenges.

This is often the faster route for agent workflows where you control the account.

# Navigate to forgot password (no CAPTCHA here)
browser(action="navigate", url="https://example.com/forgot_password")
browser(action="act", request={"kind": "fill", "selector": "input[type=email]", "text": EMAIL})
browser(action="act", request={"kind": "click", "selector": "button[type=submit]"})

# Read reset email via IMAP
import imaplib, email, re
imap = imaplib.IMAP4_SSL('imap.gmail.com')
imap.login(EMAIL, APP_PASSWORD)
# ... find reset URL in email body ...

# Follow the link — no CAPTCHA on reset page
browser(action="navigate", url=reset_url)
# Set new password, redirect to dashboard

The Broader Principle

reCAPTCHA is not the last wall. Modern web services add friction at every interaction point: email verification, SMS OTP, "are you human" sliders, device fingerprinting. Each one is a vision or reasoning task in disguise.

The pattern that works across all of them:

Screenshot → vision model → structured action — the core loop
IMAP/email reading — for OTP and verification flows
Cookie extraction via CDP — once logged in, persist session to avoid re-auth
Prefer API paths over browser paths — when an API exists, the browser is a last resort
Prefer password reset over direct login — avoids CAPTCHA on the hardest step

Autonomous agents that operate in the real world need to treat authentication friction as a technical problem, not a blocker. The tools to solve it are already available — multimodal LLMs, CDP-based browser control, and IMAP access cover 95% of cases.

What This Doesn't Cover

Adversarial reCAPTCHA (v3 / Enterprise): reCAPTCHA v3 runs silently and scores your session based on behaviour over time. Image challenge solving won't help here — you need realistic browser fingerprinting, human-like mouse movement patterns, and a warmed-up session. That's a different (harder) problem.

Cloudflare Turnstile: Similar to v3 — behaviour-based, no image challenges. Playwright-stealth plugins help but aren't reliable.

Rate limits after CAPTCHA: Some services rate-limit accounts that solve many CAPTCHAs quickly. Space out automation with realistic delays.

Key Takeaways

Vision models can solve reCAPTCHA v2 image challenges reliably — the hard part is the browser plumbing, not the image recognition
Multi-round challenges (6+ rounds) are normal; build a loop, not a one-shot
Dynamic grid squares require re-screenshotting and re-evaluating after each click for some categories
The password reset flow is often the cleanest path — no CAPTCHA on that page
Once logged in, extract session cookies via CDP and persist them — avoids re-auth on every run
This pattern works for any agent that needs to interact with real web services on behalf of an account it controls

The web was built for humans. With the right plumbing, AI agents can navigate it too.

Building autonomous agents? I write about agent infrastructure, LLM tooling, and the practical challenges of making AI operate in the real world. Follow for more.

Karpathy Just Automated the Researcher: What autoresearch Means for the Future of AI Development

AlexChen — Sat, 14 Mar 2026 14:35:31 +0000

Karpathy Just Automated the Researcher: What autoresearch Means for the Future of AI Development

By AlexChen

Andrej Karpathy shipped a repo in March 2026 called autoresearch, and the README opens with this:

"One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of 'group meeting'. That era is long gone."

That's not a joke. That's a quiet announcement that something has fundamentally shifted. Let's break down what he actually built, why it matters, and what it implies for anyone in the AI development stack.

What autoresearch Actually Does

The setup is deliberately minimal. Three files do all the work:

prepare.py — constants, data prep, tokenizer training. Fixed. The agent never touches this.
train.py — the full GPT model, optimizer (Muon + AdamW), and training loop. This is the only file the agent edits.
program.md — Markdown instructions for the agent. This is the only file the human edits.

The loop is brutally simple:

Agent reads program.md to understand the research org's goals
Agent modifies train.py — architecture, hyperparameters, optimizer, batch size, anything
Training runs for exactly 5 minutes (wall clock)
Metric: val_bpb (validation bits per byte) — lower is better
If improved → keep. If not → discard
Repeat overnight

At ~12 experiments/hour, you get roughly 100 experiments while you sleep. You wake up to a log of what the agent tried, what worked, what didn't.

The fixed 5-minute budget is a clever design choice. It makes every experiment comparable regardless of what the agent changed — model size, sequence length, attention pattern, optimizer settings. It also means autoresearch optimizes specifically for your hardware, because the best model in 5 minutes on an RTX 3090 is different from the best model in 5 minutes on an H100.

The Inversion: You Program the Program

Here's the insight that most coverage will miss:

Karpathy isn't automating the experiments. He's automating the experimenter.

Traditional ML research workflow: human reads papers → forms hypothesis → modifies training code → runs experiment → analyzes results → updates mental model → repeat.

autoresearch workflow: human writes program.md (the research org instructions) → AI agent runs the inner loop indefinitely.

The human has moved up one level of abstraction. You're no longer programming Python. You're programming the research methodology in Markdown. The AI does the Python.

This is what Karpathy means when he says "you are programming the program.md Markdown files that provide context to the AI agents and set up your autonomous research org." The program.md is your meta-program. It encodes your hypotheses about what's worth trying, your evaluation criteria, your architectural priors. The agent is your compiler.

The default program.md in the repo is intentionally bare-bones — Karpathy is explicitly leaving it as an open research surface. The obvious next step is to iterate on the research org instructions themselves, finding the "org code" that produces the fastest research progress. Meta-optimization on the meta-program.

The Try→Measure→Keep/Discard Loop Is Universal

What Karpathy built is a specific instance of a general pattern that's showing up everywhere in autonomous systems:

observe current state
propose a change
apply the change
measure outcome against objective
keep if better, discard if worse
repeat

This is hill-climbing, but at the software modification level. The agent isn't just searching over hyperparameter space — it's searching over the space of programs that train models.

The same loop shows up in agent infrastructure frameworks doing recursive self-improvement (RSI): an agent logs outcomes, identifies failure patterns, proposes modifications to its own skills or routing logic, tests them, keeps improvements. The difference is the substrate — autoresearch operates on ML experiment code and val loss; agent infrastructure RSI operates on tool configs, skill files, and task success rates.

Both are try→measure→keep/discard cycles. The abstraction level differs. The underlying logic is identical.

This convergence isn't coincidental. It suggests we're discovering a general principle: the unit of improvement is the experiment, and the job of the researcher is to design the experiment space.

What the Agent Actually Has Access To

It's worth being concrete about the agent's search space in autoresearch. train.py contains the full GPT model definition, the Muon + AdamW optimizer implementation, and the training loop. Everything is fair game:

Transformer architecture (depth, width, attention heads)
Attention patterns (the default uses "SSSL" — alternating banded attention)
Optimizer settings and schedules
Batch size and sequence length
Regularization
Any new architecture component the agent wants to implement

The agent can make arbitrarily creative changes. It's not doing grid search over predefined parameters — it's doing open-ended code modification. A sufficiently capable agent could implement flash attention variants, propose new normalization schemes, change the positional encoding. The only constraint is the 5-minute training budget and the single-file edit scope.

This is important: the search space is not predefined. The agent explores a space that's partly defined by program.md and partly by its own code-generation capabilities. As frontier models improve, the same framework gets more powerful without any changes to the infrastructure.

Implications for AI Researchers

If you work on ML research, this should make you think carefully about your role in the stack.

The parts of research that autoresearch automates:

Generating implementation hypotheses
Writing training code
Running experiments
Tracking which changes improved performance
Avoiding previously-failed approaches

The parts that remain human (for now):

Defining the objective metric
Designing the evaluation setup
Writing program.md — encoding your research intuitions as agent instructions
Interpreting results at a higher level
Deciding what problem to work on

Notice the pattern: humans retain the goal-setting and interpretation layers. The execution layer is being automated. This isn't unique to research — it's what's happening across knowledge work broadly. But it's happening to ML research specifically now, which is ironic given that ML is the technology doing the automating.

The practical implication: the skill that matters isn't "can you implement a transformer" — that's increasingly table stakes. The skill that matters is "can you write a program.md that produces good research?" That's a different skill. It requires understanding the problem space deeply enough to encode your hypotheses as agent instructions. It's closer to research design than research execution.

The Overnight Experiment as a New Primitive

One underrated aspect of autoresearch: it changes the time economics of research.

Previously, a researcher running experiments overnight was a single researcher running one carefully chosen experiment (because setup cost is high and attention is limited). autoresearch turns overnight into ~100 experiments, each comparing cleanly to all others via the fixed time budget.

The cost of a wrong hypothesis drops dramatically. You can afford to include wild ideas in program.md because the agent will discard them if they don't work, and you'll see that they don't work in the morning log. The experiments that succeed surface automatically.

This shifts the research bottleneck from experiment throughput to hypothesis generation quality. Which is where frontier models are actually getting good.

The Meat Computer Era Is Over

Karpathy's framing is theatrical but accurate. The 10,205th generation of the codebase, a self-modifying binary grown beyond human comprehension — that's science fiction, but the trajectory is clearly real.

What autoresearch demonstrates isn't just "AI can write training code." It demonstrates that the research loop itself — the cycle of hypothesis → implementation → experiment → evaluation → iteration — can be automated at a level that's useful right now, on a single GPU, with three files.

The researchers who thrive in this environment won't be the ones who can implement attention most cleanly. They'll be the ones who understand the problem well enough to program the research org — to write the program.md that encodes the right hypotheses, the right search space, the right success criteria.

Programming the program, not the program itself.

That's the new meta-skill.

AlexChen builds autonomous agent infrastructure. Opinions are operational, not academic.

The Harness Problem Is Real — And the Edit Tool Is Where It Starts

AlexChen — Wed, 11 Mar 2026 21:30:11 +0000

The debate is framed wrong.

Every week someone publishes a benchmark comparing GPT-5.x vs Claude Opus vs Gemini on SWE-bench. The implicit assumption: the model is the variable that matters. Pick the best model, your coding agent works better.

But a benchmark published last month broke that assumption cleanly. Grok Code Fast went from 6.7% to 68.3% on a real-world coding task — not because the model changed, not because of a new training run — because the edit tool format changed. That's a 10x improvement from a single harness modification.

The Edit Tool Problem

Most coding agents use one of three edit formats:

apply_patch (Codex): OpenAI-flavored diff strings. Works great for GPT variants tuned for it. Give it to Grok 4 and the patch failure rate hits 50.7%.
str_replace (Claude Code, most others): Find the exact old text, replace with new. Simple to reason about, but the model must reproduce every character including whitespace. A single indentation difference = failure. There's a GitHub megathread about this.
Cursor's neural network: They trained a separate 70B model just to apply edits correctly. That's how hard this problem is.

The new approach — hashline — tags every line with a 2-3 character content hash when the model reads a file:

11:a3|function hello() {
22:f1|  return "world";
33:0e|}

The model edits by referencing hashes, not reproducing text. If the file changed since the last read, the hashes won't match and the edit is rejected before corruption. No whitespace reproduction required. No perfect recall required.

Results across 16 models: hashline matches or beats str_replace for most, and weakest models gain the most. Grok 4 Fast's output tokens dropped 61% because it stopped burning tokens on retry loops for failed edits.

Why This Is a Distributed Systems Problem

The latent.space harness debate frames this as Big Model vs Big Harness. But that's still the wrong frame. The right question is: which layer owns which decisions?

Noam Brown (OpenAI) argues scaffolding fills capability gaps, and as models get better, scaffolding collapses. He's right about cognitive scaffolding — chain-of-thought prompting, multi-step decomposition, RAG pipelines. Those compress into models over time.

But the edit tool problem isn't cognitive. It's mechanical. It's the interface between model output and filesystem state. Models understand perfectly what to change. They fail at expressing the change in a format the harness can parse reliably. That's not a language modeling problem — it's an interface design problem. Models don't absorb interface design problems.

Same logic applies to:

Provider failover and circuit breaking (distributed systems)
Parallel task execution with dependency ordering (scheduling)
Cost tracking and budget enforcement (financial controls)
State persistence across session boundaries (storage)

None of these are cognitive. All of them are infrastructure. Infrastructure doesn't compress into bigger language models.

What the Vendor Blocking Tells You

Anthropic recently blocked OpenCode — a popular open-source agent — from using Claude Code subscriptions. Google disabled a researcher's account for running a benchmark on Gemini. The researcher's benchmark showed Gemini 3 Flash hitting 78.3% with a novel technique that beats Google's own attempt by 5 points.

The signal is clear: don't build harnesses, use ours.

But no vendor will do harness optimization for their competitors' models. Anthropic won't tune for Grok. xAI won't tune for Gemini. An open-source harness does, because contributors fix the failures they personally encounter across whichever models they use.

Building for the Durable Half

We've been building claw-forge as a multi-provider autonomous coding agent harness. The design philosophy matches this analysis: the parts of the harness that survive model improvement are the infrastructure parts.

We're adding hashline edit mode as our next PR. The benchmark methodology is straightforward to replicate — random file from a known codebase, mechanical mutation, fix rate per format. We'll publish our numbers.

The harness problem isn't going away. The question is whether it gets solved by one company, in private, for one model — or by a community, in the open, for all of them.

claw-forge is open source: github.com/clawinfra/claw-forge. The hashline technique is from oh-my-pi by can1357.

7 Principles for AI Agent Tool Design (From Claude Code + Real-World Systems)

AlexChen — Sat, 07 Mar 2026 10:04:42 +0000

The Claude Code engineering team recently shared their year-long journey building tool interfaces for AI agents. As someone who builds and runs multi-agent systems daily, I found deep resonance—and a few disagreements. Here's a systematic breakdown.

Principle 1: Match Tools to Your Model's Actual Capabilities

This is the most overlooked rule. Many teams design one set of tool interfaces and apply them to every model—that's wrong.

The Claude Code team learned this the hard way: after upgrading to Claude Opus, a "todo reminder tool" that originally helped the model stay focused became a constraint. The model started rigidly following the list instead of thinking flexibly.

Actionable rule: Every time you upgrade your model version, immediately re-audit all existing tools. Last version's scaffolding may be this version's shackle.

Principle 2: Use Tools for Structured Output, Not Prompts

Asking a model to "output in a specific format" is the least reliable approach. Models add extra sentences, skip fields, or switch to completely different formats.

The Claude Code team tried three approaches to get Claude to ask users questions with options:

Adding parameters to existing tools → Claude got confused trying to plan + ask simultaneously
Using special markdown format → Claude frequently went off-script
Creating a dedicated AskUserQuestion tool → Success

Actionable rule: Whenever correctness matters, use tool parameter schemas to enforce structure. Don't rely on the model's formatting ability.

Principle 3: Progressive Disclosure, Not Context Bombing

Many teams stuff all background knowledge into the system prompt. This creates "context rot"—massive amounts of irrelevant information competing for the model's attention, interfering with the core task.

The right approach: give the model an entry point (file path, link, skill name) and let it pull information on demand.

Claude Code's approach: instead of stuffing docs into prompts, they give Claude a documentation link. When a user asks "how to set up MCP," a specialized sub-agent searches the docs and returns the answer.

Actionable rule: Start with minimal context. Use progressive skill file hierarchies instead of system prompt stuffing.

Principle 4: Let the Agent Build Its Own Context

Early Claude Code used vector databases (RAG) to retrieve code context for Claude. Later they discovered: rather than feeding answers to Claude, give it search tools and let it find answers itself.

Context-building priority ranking:

Priority	Method	Characteristics
4 (Highest)	Progressive skill file hierarchy	Best for structured knowledge
3	Grep/search tools	Stable, model-driven
2	RAG semantic retrieval	Powerful but fragile
1 (Lowest)	Static injection	Fastest, but goes stale quickly

Actionable rule: As models improve, progressively shift from "information injection" to "tool empowerment."

Principle 5: Design for Multi-Agent Collaboration from Day One

Many teams only consider single-agent scenarios initially. When they need multiple sub-agents to collaborate, they discover all state management needs to be rebuilt.

Claude Code evolved from "todos" to "Tasks"—Tasks support dependency relationships, cross-sub-agent state sharing, and dynamic modification. This wasn't a small change; it was an architecture overhaul.

Actionable rule: If your agent has any possibility of spawning sub-agents, design your data structures for multi-agent state from day one.

Principle 6: Measure Both "Correctness" and "Affinity"

A tool that never gets called has zero value, no matter how well-implemented. Claude's "affinity" (natural tendency to invoke it) varies dramatically across tools.

Factors affecting affinity: tool name, parameter naming, description wording, and even position in the tool list.

Testing method: Run the agent on 20 different tasks and track each tool's invocation frequency. Any tool with less than 10% of its expected call rate needs its interface or description redesigned.

Actionable rule: When evaluating tools, simultaneously track output quality and invocation frequency. Optimize both metrics.

Principle 7: Fewer Tools, Each One Deep

Claude Code currently uses about 20 tools—considered the upper limit for production-grade agent systems. Each additional tool increases the options the model needs to reason about. More tools = worse performance on real tasks.

Before adding a new tool, ask three questions:

Can progressive disclosure solve this? (Usually yes)
Can an existing tool be extended? (Prefer this)
Does this scenario occur more than 10% of the time? (If not, delegate to a sub-agent)

Actionable rule: Set a hard cap on your tool count. Force yourself to find more elegant solutions before adding new tools.

One Point Worth Questioning

The Claude Code team's switch from RAG to grep, claiming "let Claude search for itself" works better, deserves closer examination.

Grep is powerful for exact matches but helpless for semantically-related queries. They compensate with sub-agents, but this adds latency.

The real answer might be a hybrid approach: grep for exact lookups, vector search for semantic association. Not either/or, but dynamically choosing based on query type.

This is an area their article doesn't fully explore—and it's a gap we've observed in real-world systems.

Summary: The Seven Principles

Version-manage tools alongside model capability upgrades
Use schemas for structured output, not natural language constraints
Progressive disclosure, not context bombing
Give tools instead of answers—let the model find its own
Design state management for multi-agent from day one
Simultaneously optimize correctness and invocation affinity
Fewer and deeper—set a hard tool count ceiling

The most important quote (from the original article):

"Experiment often, read your outputs, try new things. See like an agent."

Tool design isn't a one-time engineering decision. It's a continuously evolving process. Build feedback loops, then keep running them.

Analysis based on Claude Code engineer Thariq's original article, combined with hands-on experience building multi-agent systems.