Forem: Alex Chen

The 50,000-Token Demonstration Nobody Saved: Capturing Agent Trajectories to Train Your Own Code-SLM

Alex Chen — Thu, 07 May 2026 12:06:14 +0000

Last Tuesday, Sonnet 4.5 spent forty-three minutes implementing JWT authentication in a project I run. It read four files, wrote a 180-line patch, ran the test suite, watched two tests fail, traced one of the failures to a stale fixture, fixed both, ran the suite again, watched it pass, then squash-merged the work to main with a commit message that read like a senior engineer wrote it. The whole exchange consumed about 50,000 tokens of model output, broken into nineteen AssistantMessage turns interleaved with twenty-three ToolUseBlock calls and twenty-one ToolResultBlock returns.

I have the final code. I have the commit. I do not have the trajectory.

I had nineteen turns of expert reasoning — the kind of demonstration that, if you handed it to a smaller model as supervised fine-tuning data, would teach that smaller model how to act like a coding agent, not just how to write Python. And I threw it on the floor the moment the ResultMessage arrived, because my harness was wrapped around claude_agent_sdk.query() like this:

result_text = ""
async for message in run_agent(prompt, ...):
    if message.__class__.__name__ == "ResultMessage":
        result_text = message.result or ""
return result_text

Look at that loop. Eighteen messages walked past it for free. The last one paid the rent.

This is the post about why I decided that was insane, what I built to fix it, and what it now lets me do — including, eventually, train my own Qwen2.5-Coder fine-tune on Sonnet's distilled coding behavior.

1. The thing nobody is doing yet, but should be

If you are running an agent harness at any scale — even hobby scale, even one-developer scale — you are paying a Frontier-model API bill and generating a continuous stream of high-quality expert demonstrations and throwing them away. The math on this is depressing once you actually run it. A two-week sprint with one agent running ten hours a day at modest concurrency produces something like 500 task trajectories. Each one is, on average, six thousand to twenty thousand tokens of expert thinking, tool use, and code edits, paired with the canonical "right answer" diff that landed on main.

This is the shape of training data people pay for. Coding-specific SFT corpora don't fall out of the sky. The teams shipping the leading code models scrape GitHub, run synthetic generation pipelines, hire annotators. You have a smaller, narrower, higher-quality version of that already happening in your dev environment for free, modulo the fact that you are not capturing it.

The reason most teams aren't doing this isn't technical difficulty. It's a missing primitive. The agent SDK gives you a stream of messages. Most harnesses iterate the stream once and discard it. Adding a tee — a "yield to the caller AND write to a database" wrapper — is eighty lines of code. The hard part is not the tee. The hard part is figuring out what to capture and what shape to capture it in so that six months from now, when someone says "let's actually try training that model now," you don't discover you stored the wrong thing.

2. The two design questions that actually matter

Before any code, two decisions:

What format do you store? The naive answer is "store it in the format your fine-tuning library wants." That answer is wrong. Fine-tuning libraries change. The chat template you use today (let's say OpenAI tool-use) is not the chat template you'll use in eighteen months. ShareGPT had its moment, ChatML is having its moment, the next thing is already in someone's repo. If you store in the trained-model format, you locked yourself in.
What's your training label? A trajectory by itself is imitation-learning data — "here's what the expert did, copy it." That gets you to mid-tier capability, full stop. The reason DPO and rejection-sampling matter is they let you do preference learning: "of these K candidate solutions, which one matches the actual answer?" To do that, you need a label — a canonical "this is what the correct final state looked like" against which candidate completions get scored. If you only stored the trajectory, you've half-stored the dataset.

The answers I landed on, after going down both wrong paths first:

Capture the superset. Store the raw SDK message stream — every AssistantMessage with its ThinkingBlock and TextBlock and ToolUseBlock content, every UserMessage with its ToolResultBlock content, every model name, every usage tally. Don't project to a chat format at capture time. Projection is cheap and reversible from the superset; the reverse direction isn't true. This is the same principle as event-sourcing in databases: store the events, project the views.

Capture the diff. When the agent's branch squash-merges to main, the resulting commit hash is the ground-truth label. git show <sha> gives you the canonical patch the expert eventually landed. Add one nullable column to your task table, PATCH the SHA back after squash, and at export time you can attach the diff to every successful trajectory. Now your dataset isn't "trajectory." It's "trajectory plus the right answer." DPO and rejection sampling become trivial future work because the label is already on disk.

That's the design. The implementation is small enough to fit on a napkin.

3. The recorder is a tee, and it's eighty lines

The whole capture surface is a single async iterator wrapper:

async def record_messages(
    messages: AsyncIterator[Any],
    *,
    dest: RecordingDestination,
    client: httpx.AsyncClient | None = None,
) -> AsyncIterator[Any]:
    own_client = client is None
    active = client if client else httpx.AsyncClient(timeout=5.0)
    try:
        turn = 0
        async for message in messages:
            try:
                payload = _serialize_message(message, turn=turn)
                await active.post(
                    f"{dest.state_url}/sessions/{dest.session_id}/events",
                    json={
                        "event_type": "agent_message",
                        "task_id": dest.task_id,
                        "payload": payload,
                    },
                )
            except Exception:
                log.warning(
                    "trace recording failed for task %s turn %d; continuing",
                    dest.task_id, turn, exc_info=True,
                )
            turn += 1
            yield message
    finally:
        if own_client:
            await active.aclose()

That's it. Yield to the caller; tee to the events table. The four design choices baked into those eighty lines are worth naming because they're the ones that go wrong if you skip past them:

Caller-side, not runner-side. The wrapper sits at the call site that already knows session_id and task_id. The agent runner stays a pure SDK wrapper. This is the boring choice and the right choice — it keeps the runner module reusable in contexts (testing, ad-hoc scripts) where there's no state service to record into.
Best-effort. A network blip, a state-service restart, a transient permission error — none of them abort the agent. The recorder catches every exception, logs a warning, and continues. The asymmetry is correct: the agent's job is to ship the feature, not to ship the trace. Lost traces are a nuisance. Lost agent runs are a fire.
Lossless serialization. _serialize_message walks the SDK Message object's attributes generically — model, stop_reason, usage, content blocks — and JSON-serializes them with no projection, no opinion. Whatever shape the SDK emits is what lands in the database. When the SDK adds a new content-block type next quarter, the recorder doesn't break.
One event per Message, not per content-block. Tool-use ↔ tool-result correlation stays implicit via the SDK's IDs; reconstructing the conversation at export time is straightforward; the events table doesn't 5x its row count for marginal queryability.

The storage is the existing events table. No new schema. The payload is a JSON column. SQLite handles 1–3 MB per task comfortably. A hundred tasks is 100–300 MB. Disk is cheap. WAL mode makes the writes essentially free at this volume. The state service this lands inside has been doing this for other event types since v0.1.

4. The merge_commit_sha column does most of the conceptual work

The single largest design decision in this whole feature is one nullable String(40) column on the task table. Everything else is mechanism. This column is meaning.

When the harness squash-merges a feature branch to main, squash_merge() returns {"merged": True, "commit_hash": "abc1234"}. The cli.py task handler PATCHes that hash back to the corresponding task row. The PATCH is best-effort and try/excepted because the task is already complete by then — a failed PATCH costs you the diff label for that record, not the agent run.

At export time, --include-diff reads the column and shells out to git show --pretty=format: <sha> against the project's git repo. The diff lands on the JSONL record as final_diff. Now every outcome="success" trajectory carries the canonical patch the expert eventually shipped — the one that survived the test suite, the code review, the squash merge.

This is the difference between "imitation data" and "imitation + reward". It's also the difference between "a corpus you can SFT on" and "a corpus you can DPO on later." You don't need the DPO pipeline today — the schema's already forward-compatible, so when you decide it's time, the labels are sitting there on disk waiting.

I did not appreciate how much this column matters until I started thinking about evaluation. If you're going to fine-tune a smaller model on captured trajectories, you need a metric that says "did the smaller model learn to land the right diff?" Not "did the smaller model produce text that looks like the expert" — that's BLEU on assistant content, and BLEU on assistant content is a vanity metric. The honest metric is diff similarity: reconstruct the smaller model's proposed patch from its tool-call sequence (Edit / Write blocks), score it line-level Jaccard plus difflib.SequenceMatcher.ratio() against final_diff, and call that your eval. You cannot run that eval without the ground-truth column. The column is the experiment.

5. Format projection is a one-page module

With the superset captured, projection to any chat format at export time is mechanical:

OpenAI tool-use — fold thinking + text + tool_use blocks into one assistant message with tool_calls; emit each tool_result block as its own role: tool message. Default format. Reads natively into HuggingFace apply_chat_template(tools=...).
ShareGPT — flatten tool calls to <tool_call name="X">{...}</tool_call> text. Lossy but trl/Axolotl ShareGPT loaders eat it without complaining.
ChatML — generic <|im_start|> tags; no tool semantics; useful for non-tool-using base models.
raw-jsonl — direct dump of the SDK message stream. Use when you want to write your own templating.

The projector module is two hundred lines. The interesting half is _assistant_from_blocks, which folds an assistant message's heterogeneous content blocks into one OpenAI-format message. Thinking blocks become a thinking field (a non-standard extension that most loaders silently drop, which is fine — if you want chain-of-thought training, use --format raw-jsonl). Text blocks concatenate to content. Tool-use blocks become tool_calls[] with their JSON arguments stringified. The shape mirrors what apply_chat_template expects when you pass tools=....

Hygiene at the JSONL layer is two more functions:

Dedupe — drop trajectories where (prompt, final_diff) already appears in the corpus. Default mode is "both must match." Cheap and obvious — protects against the user re-running the same task five times during debugging and polluting their training set.
Deterministic split — train/val/test by SHA-256 of task_id. Same input set always partitions the same way, so val and test holdouts stay stable across re-exports. Important when you're iterating on the export pipeline and want to know whether a metric change came from new data or new partition.

That's the export mechanism. Reader → filter → projector → redactor → splitter → JSONL. Each stage is replaceable. The reader is the only one with database access. Everything downstream operates on dicts.

6. Redaction has to happen at export, not capture

This was the choice I almost got wrong, and I want to flag it because the wrong instinct is very tempting.

The wrong instinct: "I should redact secrets at capture time, before they hit the database." This feels safer. It's not. It's destructive. If your redaction rule has a bug — and your redaction rule will have a bug, because regex secret-detection is not a solved problem — you've lost the original data forever. You can't re-export with a fixed rule. You can't audit what was actually said. The DB is downstream of the redactor and you've thrown away your ground truth.

The right instinct: redact at export time. Keep the database authoritative. Treat the project-local .claw-forge/state.db as having the same trust boundary as the source code itself — if a laptop compromise leaks the DB, the source code is the bigger problem. The export pipeline applies redaction rules to projected JSONL records after projection, so re-exporting with new rules is a one-line operation. You can also generate a fully-faithful, un-redacted export for local fine-tuning experiments, then a redacted export for sharing. The DB is the same.

The redaction module is three composable rules:

SecretsRule — well-known patterns: AWS keys, GitHub PATs, Stripe keys, Anthropic keys, OpenAI keys, GCP API keys, Authorization headers. Conservative by design — better to miss some than to mangle innocuous text that happens to look secret-shaped.
UsernamesRule — substitutes /Users/<you>/ and /home/<you>/ with <REDACTED:username> while preserving directory structure. File layout is meaningful learning signal; the username is not.
CustomPatternsRule — user-supplied regex list from the YAML config. For project-specific stuff: customer IDs, internal hostnames, ticket prefixes.

The Redactor walks records recursively. Strings get every rule applied. Dicts and lists recurse. Everything else passes through. Replacement markers are structured (<REDACTED:secret>, <REDACTED:username>) so the model never learns to fabricate the redacted form — it learns "this is a placeholder, ignore."

7. The opt-in is two flags and a banner

Anthropic's Usage Policies prohibit using Claude outputs to develop models that compete with their services. This feature is squarely in the grey zone unless you treat it carefully. I built the gate with that in mind.

There are two flags in claw-forge.yaml:

training_traces:
  enabled: true
  acknowledged_terms: true

Both must be true before the recorder emits anything. If enabled: true but acknowledged_terms: false, the state service logs a one-time banner at startup with the relevant policy excerpt and does not record traces. The user has to flip the second flag explicitly.

In v0.7.1 I made claw-forge init scaffold both flags as true by default — the scaffold itself acknowledges the policy via the comments in the YAML, and the user is opting in by running init. Existing claw-forge.yaml files are untouched (the scaffold only writes when the file is absent). This was a deliberate friction-vs-discovery tradeoff: gate-by-default would mean nobody ever discovers the feature; default-on means everyone discovers it but the policy reminder is one cat claw-forge.yaml | head -100 away. I chose discovery.

This is a feature that's intended for personal/internal distillation — building a smaller model that imitates your own Claude usage on your own code, for your own internal use. Distribution of derived models is the user's responsibility and emphatically out of scope. The provenance fields on every JSONL record (model name, capture date, claw-forge version, applied redaction rules) preserve a verifiable lineage if you ever need to demonstrate "this corpus came from my own Claude usage."

8. The training recipe is short and lives outside the harness

claw-forge stops at JSONL export. The downstream Unsloth/Axolotl/trl pipeline lives in a separate user repo. The harness has no train command, no model registry, no inference layer. The reasons are scope hygiene: training stacks change fast, GPU dependencies are heavyweight, and the harness is supposed to run on any laptop. The recipe is documented in a markdown file (docs/training/unsloth-recipe.md) and ships as a reference, not as code.

The recipe at the time of writing:

Base model: unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit. Coder-specialised, native tool-use chat template, Apache 2.0 licence, fits a 24 GB consumer GPU with LoRA. DeepSeek-Coder-V2-Lite-Instruct as the alternative if you have raw eval scores to chase and don't mind MoE finickiness for LoRA.
LoRA config: r=16, alpha=32, target modules q/k/v/o + gate/up/down, dropout 0, gradient checkpointing on.
Training: per-device batch 2, grad accumulation 8 (effective 16), 3 epochs, lr 2e-4, cosine schedule, adamw_8bit, max_seq_length 8192, bf16 if available.
Eval on the held-out test split: ROUGE-L on assistant content (cheap reasoning-quality proxy), and diff similarity on the model's reconstructed patch vs the captured final_diff (the real correctness signal). The latter is the one that matters; the former is the one you watch during training to spot collapse.

For a corpus of ~500 trajectories at average 6K tokens each, expect single-digit hours on a 4090 for a full training run. Calibrate with a 50-task dry run before committing. That's not a thousand-GPU pretraining job; it's a weekend's worth of consumer-grade compute on top of months of accumulated agent traces. The economics make sense at single-developer scale, which is the part nobody seems to be talking about yet.

9. What this doesn't solve (be honest)

Some things this design explicitly does not handle:

Distribution rights. I keep coming back to this because it's the part most likely to bite someone. Training a model on Sonnet's outputs and using it internally on your own code is one thing. Distributing that model — uploading weights to HuggingFace, releasing a derivative product — is a different thing and not protected by anything in this pipeline. Read the policy. Talk to a lawyer. The provenance stamping helps with audit; it does not authorize redistribution.
Eval beyond diff similarity. Diff similarity catches "did the model land the right code change" but it doesn't catch "did the model produce a clean, well-reasoned, well-commented solution." For that you need either human eval or LLM-as-judge eval, both of which sit outside the harness. The corpus enables both, but the harness doesn't ship them.
Multi-Claude-version mixing. Every trace stamps the originating model name. Mixing trajectories captured under Sonnet 4.5 with trajectories captured under Opus 4.7 gives you a heterogeneous teacher signal. Sometimes that's what you want — pooled expert demonstrations across model strengths — and sometimes it isn't (when the lower-capability traces are noise). The provenance field lets you filter, but the harness has no opinion about whether you should.
Capturing failure modes that didn't go through claw-forge. If the engineer drops out of the harness and edits a file by hand, none of that lands in the trace. The corpus represents what the agent did, not what the human did to clean up after the agent. For pure agent-distillation that's fine; for "train a model that handles the real workflow including human-in-the-loop fixups," this is a gap.
Cross-project corpus building. Each project has its own state.db. Combining corpora across projects is cat *.jsonl plus a check that the provenance.claude_model and claw_forge_version fields are compatible. Works fine for SFT, but if you're seriously building a multi-project corpus you want a manifest and a deduplicator that operates across files. That's tooling I haven't built yet.

The honest framing: this design captures a very specific kind of training data — the agentic coding loop on your own codebase, paired with the ground-truth diff that landed. That kind of data is unusually hard to come by and unusually valuable. It's not a substitute for general-purpose pre-training data, and it's not going to give you a model that handles tasks outside your codebase's distribution. It is going to give you a model that, on tasks similar to the ones you've been running, behaves more like Sonnet than the base Qwen2.5-Coder weights do. That's the win.

10. The cultural shift I keep coming back to

There's a meta-point that took me too long to internalize.

If you're paying for Frontier-model API calls, the expensive artifact isn't the code that ships. The code that ships is checkable, reviewable, reversible. The expensive artifact is the expert demonstration — the nineteen turns of senior-engineer reasoning that took the model forty-three minutes to produce. You're paying for the trajectory whether you save it or not. Saving it is the line between "I rented a senior engineer for an hour" and "I rented a senior engineer for an hour and learned how they work."

The harness equivalent of this insight is: the events table is a training corpus in disguise. The schema was already there. The state service was already writing to it. Adding a new event type and tee-ing the SDK message stream into it is eighty lines of code. The data was always going to be high-value; the only question was whether you'd capture it.

I think more harnesses are going to do this in the next twelve months, and I think it's going to start showing up as a competitive feature. The teams running large agent fleets without trace capture are paying for expert demonstrations and discarding them. The teams running with trace capture have a data flywheel: every agent run produces both a feature and a training example. After six months of that, you have something to fine-tune. After twelve months, you might have a smaller model that handles the easy 60% of your tasks for an order of magnitude less per-call cost than the Frontier model that produced the training data. The Frontier model still handles the hard 40%. The cost curve bends.

That's not a hypothetical; that's just SFT plus rejection-sampling with a corpus you already paid for. The mechanism is well-understood. The piece nobody is shipping yet — at least, not in the open-source agent harness landscape I follow — is the capture primitive. I built it because I wanted it. I'm sharing the design because I think the rest of the ecosystem will arrive at this primitive eventually, and the sooner it's a commodity, the sooner the interesting work above it can start.

If your harness throws away every Message except the final ResultMessage, you are walking past free training data every day. The fix is eighty lines, one nullable column, and a config gate. Build it before you next run the swarm.

Alex Chen builds AI-coding-agent infrastructure shipped to production. He runs ten-agent swarms daily and is currently waiting for his Qwen2.5-Coder fine-tune to finish so he can find out whether the months of captured Sonnet trajectories were worth the disk.

The Architectural Shape Hint: A Spec-Time Trick That Lets 10 AI Agents Run in Parallel Without Stepping on Each Other

Alex Chen — Sun, 03 May 2026 06:12:10 +0000

I run agent swarms now. Not "an agent" — agents, plural, in flight at once, each working on a different feature against the same repo. Ten agents per session is normal. Twenty isn't unusual when the spec is well-decomposed. The token math works, the wall-clock math works, the model latency hides inside the swarm because something is always landing while something else is still compiling. The economics make a strong case for parallel execution as the default.

Until you hit the wall everyone hits: two agents touched the same file.

I've spent the better part of the year fighting this. I've shipped four layers of runtime defense. They all work and none of them are the answer. The answer turned out to be one attribute on the spec. This is the post about that one attribute.

1. The four layers nobody told you you'd need

Before I describe the fix, let me describe the disease — because if you're running parallel agents and you don't recognize this stack, you're probably going to recognize it next week.

When two agents in flight at once both want to edit src/router/routes.py, here's what claw-forge (the harness I work in) does:

File-claim locks. Each task declares touches_files=[...] upfront. The dispatcher refuses to start a second task that wants a file currently held by a running task. The second task defers to the next dispatch cycle.
Pre-dispatch worktree sync. Before the agent runs, the harness merges target_branch into the feature branch inside the worktree. If target moved while the task was queued, the merge happens before any token is spent. Conflicts surface as resume_conflict: failures with the offending file list.
Catch-up rebase inside squash_merge. When the agent's branch finally squash-merges to main and conflicts with concurrent work, the harness merges target into the branch and retries the squash automatically.
Resume-on-retry preamble. If a task fails mid-run, the next attempt picks up the worktree as-is, with a prompt prefix listing what's already committed and what failed last time. The agent doesn't redo the first 60% of the work.

This stack is correct. Each layer earns its keep. If I deleted any one of them, real users would file real bug reports within 48 hours. But notice what they all have in common: they are reactive. Every layer is a response to "two agents touched the same file." The conflict has already happened by the time the layer fires.

What if it never happened?

2. Conflicts are usually predictable from architecture

Sit down with a senior engineer who has worked on a codebase for six months. Hand them a list of feature requests. Ask: "If we built these in parallel with one engineer per feature, where would the merge conflicts happen?" They'll be right within five minutes. They don't run the merges. They look at the codebase's structure and know.

The reason they know is that conflicts cluster around architectural surfaces. A few specific files — the dispatcher, the routes table, the global event bus, the error envelope, the auth middleware — get touched by every feature. Most other files are owned by one feature each. The conflict surface isn't uniformly distributed across the repo. It's concentrated on the structural choke points.

This is the same insight that drives plugin architectures in big software systems. WordPress plugins don't conflict because each lives in wp-content/plugins/<name>/. VS Code extensions don't conflict because each lives in its own directory and registers through a stable API. The host is small and stable. The plugins are everything else.

If you build your codebase as a small core plus many plugins, and your spec tells the harness which features are plugins versus core, and the harness honors that distinction at scheduling time — then ten agents working on ten plugins literally cannot conflict. They are editing files in ten different directories. The locks are decorative. The catch-up rebase is dead code. The pre-dispatch sync is a no-op.

This was the unlock. Encode the architectural intent in the spec. Let the scheduler use it.

3. Two shapes, one attribute

Every feature in our specs now carries an architectural-shape attribute. There are exactly two shapes that matter:

shape="plugin" — vertical features. Live in their own directory, own their own data model, own their own tests. Adding or removing the plugin doesn't touch sibling plugins. Examples: "user can register," "user can edit profile," "task CRUD with tag filtering." Each lives in src/plugins/<name>/.
shape="core" — cross-cutting concerns. Edit files used by every plugin. Examples: "all endpoints validate JWT," "uniform RFC 7807 error envelope," "global rate limit," "database connection pool." Each lives in src/core/<concern>/.

That's it. No tier, no taxonomy, no UML. Two values. The simplicity is load-bearing — if the classifier had three values it would have ten by next quarter, and the scheduling rule would have to handle a Cartesian product of cases.

A spec entry now looks like this:

<feature index="14" shape="plugin" plugin="auth">
  <description>User can register with email and password</description>
</feature>
<feature index="20" shape="core"
         touches_files="src/core/middleware/auth.py">
  <description>All endpoints validate JWT on incoming requests</description>
</feature>

The plugin="auth" attribute auto-fills touches_files to ["src/plugins/auth/**"]. The harness now knows that feature 14 will only touch files inside src/plugins/auth/. Two shape="plugin" features with different plugin names are guaranteed to be file-disjoint. Not "probably." Not "usually." Guaranteed by directory boundaries.

For shape="core" features the auto-derivation can't help — cross-cutting work touches a specific file by name. The author writes touches_files="src/core/middleware/auth.py" explicitly. The parser refuses any spec where shape="core" lacks a touches_files value. Cross-cutting work without a declared file set is a bug in the spec, not a runtime decision the dispatcher gets to make.

4. The scheduling rule that follows

Once shape is in the spec, the dispatcher gets two new rules:

shape="plugin" tasks dispatch freely up to --concurrency N. Their file sets are disjoint by construction. The file-claim lock layer becomes a sanity check rather than a primary defense. Plugin tasks scale linearly with concurrency.
shape="core" tasks single-flight. At most one cross-cutting task runs at a time, regardless of --concurrency. Two core tasks both want to edit the auth middleware? They serialize. Always. No clever overlap analysis, no "well actually they touch different lines." Cross-cutting work is cheap to serialize — it's a small minority of features — and the cost of getting it wrong is high.
Tasks without shape (legacy specs) fall through to the existing concurrency cap + file-claim lock behavior. Backward compatibility is free because the new rules are gated on task.shape IS NOT NULL.

The scheduler's filter is twelve lines of Python:

def get_ready_tasks(self) -> list[TaskNode]:
    ready = [t for t in self._tasks.values() if self._is_ready(t)]
    # Cross-cutting (shape="core") tasks single-flight: drop any
    # candidate ``core`` task from the ready set if another core task
    # is already running.
    any_core_running = any(
        t.status == "running" and t.shape == "core"
        for t in self._tasks.values()
    )
    if any_core_running:
        ready = [t for t in ready if t.shape != "core"]
    return sorted(ready, key=lambda t: -t.priority)

That's the entire enforcement mechanism. The scheduler has no opinion about parallelism beyond this. The touches_files lock layer handles the second-line defense for cases where a plugin author lied about their shape (which the code review should catch separately).

5. Why this works structurally, not just behaviorally

The thing that makes this approach durable is that the safety property is structural: it's a consequence of file-system layout, not of clever runtime detection.

If feat/plugins/auth/ and feat/plugins/profile/ are the only file sets two agents touch, there is no possible interleaving where they conflict. Not because the harness is smart. Because the files don't overlap. The same way two git worktree instances on different branches can edit different files without any locking — git just doesn't see them as a conflict.

Compare this to the old approach: "predict conflicts at runtime by checking which files each agent claims to touch." That works if every agent honestly declares its file set. In practice, agents trying to wire a plugin into a registry often need to edit the registry too. They forget to declare the registry file. The lock layer doesn't fire. The merge conflicts at squash time. The whole reactive stack kicks in.

The plugin-shape approach refuses to be in that situation. If your codebase has a registry that every plugin has to edit, that registry is a hotspot and you should restructure it — or declare it as shape="core" and serialize work on it. The architecture catches up to the parallelism, not the other way around.

This is also why the harness composes naturally with my project's boundaries audit pass. That tooling already identifies hotspot files (registries, route tables, dispatch chains) and refactors them into plugin-extensible patterns. After a boundaries apply --auto pass, the codebase is more amenable to plugin-shape features — fewer surfaces remain that force a shape="core" declaration. The two pieces — spec-time architectural intent and codebase structural refactoring — pull in the same direction. Each makes the other more effective.

6. The brownfield path: refactor first, then extend

Greenfield projects can be built plugin-shaped from day one. Brownfield projects — i.e. every project worth working on — usually have an existing dispatcher / route table / event bus that gets touched by every feature. You can't bolt plugin-shape semantics onto a codebase whose architecture isn't ready for them.

So the brownfield workflow has an extra step:

analyze — generate a manifest with stack, conventions, test baseline.
boundaries audit — emit boundaries_report.md listing extension hotspots and the refactor pattern best suited to each (registry / split / route-table / extract-collaborators).
boundaries apply --auto — refactor each hotspot one at a time on its own feature branch with test gating. Squash-merges to main on green; reverts on red.
/create-spec — the slash command reads boundaries_report.md first. If hotspots remain unrefactored, it warns the user before generating any spec. Then it asks shape per feature.
claw-forge add — runs the planner against the now-shape-aware spec.

Skipping step 3 is the costly mistake. New features land as shape="plugin", but the file-claim lock catches them when they try to edit the un-refactored hotspot, the dispatcher fails the task with resume_conflict, and the agent has wasted one full attempt on stale state. Refactoring up front is cheaper than discovering you need to mid-flight. The boundaries harness exists exactly to make that "up front" step automatic.

The cultural ask is: when adding non-trivial features to an existing codebase, do the structural work first. That's not a new principle — it's "make the change easy, then make the easy change," Kent Beck, twenty years ago. Plugin-shape specs make this principle observable: if you can't write a clean spec without declaring half your features as shape="core", that's a structural signal, not a spec-writing failure.

7. What this doesn't solve (be honest)

I want to be careful not to oversell this. Here's what plugin-shape specs explicitly do not do:

Semantic conflicts inside a single plugin. Two tasks for the same plugin (plugin="auth") still serialize via touches_files locks. Adding "user can reset password" while "user can change email" is in flight will defer the second one until the first finishes. This is fine — it's the correct behavior — but it limits intra-plugin parallelism to one task at a time.
Cross-plugin coupling that wasn't designed in. If your tasks plugin imports from your auth plugin's internals (and your codebase doesn't enforce plugin isolation via lint or import boundaries), edits to auth/ can break tasks/ after merge. The spec doesn't catch this; tests do. Treat the spec as a parallelism hint, not an isolation guarantee.
Shared infrastructure changes. A migration that adds a column to the users table is shape="core" because the migrations directory is shared. Two such migrations serialize. They have to — concurrent migration writers race on the migration sequence number. Don't try to plugin-ify your migrations.
Specifications written as shape-agnostic. A feature whose acceptance criteria say "the system shall …" without naming a directory or file is hard to classify. Either rewrite the criterion to reference a concrete piece of the system, or accept that the feature won't get a shape attribute and will fall through to legacy scheduling.

The honest framing: plugin-shape specs make the common parallelism case (many vertical features against a clean plugin host) trivial-safe. The hard cases — cross-cutting concerns, coupled plugins, shared infrastructure — still require engineering judgment. The win is that the common case becomes the default rather than the exception.

8. The cultural shift this enables

There's a meta-point here that's bigger than the technical mechanism.

Most discussions of "AI agents at scale" focus on the agent's capabilities — context window, reasoning depth, tool-use accuracy. Those matter, but they're not where the leverage is. The leverage is in encoding the human's architectural intent in a place the harness can read. Specs are not just task descriptions for the agent. They're scheduling hints for the orchestrator. They're isolation declarations for the locks. They're refactoring targets for the boundaries pass. They're documentation for the next human reviewer.

When you start writing specs that carry this much load, the spec format itself stops being a casual prose blob and becomes a structured contract. XML attributes that look fussy at first — index, depends_on, shape, plugin, touches_files — earn their keep because every one of them maps to a runtime decision the harness will otherwise have to guess. Guessing is what produces the four-layer reactive stack. Declaring is what makes that stack a quiet backstop instead of a daily firefight.

This is the same shift that happened in deployment automation a decade ago: declarative manifests beat imperative shell scripts because the intent — "I want three replicas behind a load balancer" — was machine-readable rather than buried in a sequence of side-effecting commands. Plugin-shape specs are doing the same thing for AI-agent orchestration: making intent readable so the orchestrator can stop guessing.

If you're building AI-coding-agent infrastructure right now and your dispatcher is making scheduling decisions based purely on what's in the queue, you're building the imperative-shell-script version of this. The declarative version — where the agents read what the human meant rather than what they typed — is meaningfully better, and it doesn't require a smarter model. It requires a more structured spec.

9. The minimum implementation

If you want to try this in your own harness, the minimum viable version is:

One attribute on your task/feature object. Call it shape, kind, category, whatever — but pick exactly two values. "vertical" and "horizontal" works. "feature" and "infra" works. Two values. The temptation to add a third is a trap.
One auto-derivation rule. When shape="plugin" and a plugin="X" is set, the file-claim list defaults to ["plugins/X/**"]. One line of helper code.
One scheduling rule. When any shape="core" task is running, drop other core tasks from the ready set. Twelve lines of Python.
One spec-time validation. shape="core" without an explicit file list raises an error before the planner runs. Five lines.

That's the whole ship. Total surface area: maybe 50 lines of harness code, plus the spec schema extension and the docs to teach the spec author what to declare.

The minimum tests:

A round-trip test that parses the documented XML example and asserts the auto-derived file lists match (guards against doc/code drift).
A scheduler test that adds two shape="core" tasks and confirms only one is in the ready set when the other is running.
A scheduler test that confirms shape="plugin" tasks dispatch freely when a core task is running.

Three tests. Done. The pattern compounds: now your codebase has a place to put new shape-aware behavior, and your spec authors have a place to encode new architectural intent. Future work — auto-derived shape inference via static analysis, telemetry on adoption rates, conflict-prediction at scheduler time — all builds on this primitive.

10. Closing thought

The thing that took me too long to internalize is that parallelism is a property of the architecture, not the runtime. You can't bolt safe parallelism onto a codebase whose architecture forces every feature through the same chokepoint. You can build elaborate runtime defenses against the resulting conflicts — and you should, because real codebases always have some chokepoints — but the runtime defenses are the patch, not the cure.

The cure is to design codebases where parallelism is structurally safe, and to encode that structural intent in the spec so the orchestrator can lean on it. Two values, one attribute, twelve lines of scheduler logic. That's the surface area of the win. The cost was a year of fighting the four-layer reactive stack to recognize that the layers were treating symptoms, not the disease.

If your AI-agent harness is dropping conflicts on you, look at your spec format before you look at your dispatcher. The dispatcher is downstream. The spec is where the architecture lives.

Alex Chen builds AI-coding-agent infrastructure shipped to production. He runs ten-agent swarms daily and would like to thank the team's boundaries harness for finally making it stop hurting.

Building an Autonomous Crypto Trading Bot

Alex Chen — Sun, 03 May 2026 06:05:58 +0000

I've been spending too much time inside trading bot codebases lately. Most of them are one of two things: a 200-line Jupyter notebook that someone calls a "system," or a sprawling monorepo where the strategy logic and exchange integration are so tangled that you can't swap exchanges without rewriting half the code.

A few weeks ago I went deep on AlphaStrike, a production-grade crypto perpetual futures bot. Not because the returns were headline-grabbing (though a 2.4 Sharpe is nothing to sneeze at), but because the architecture solves problems most of us hand-wave past. I want to walk through what's interesting, what's novel, and what I'd steal for my own projects.

The Problem Space

Algorithmic crypto trading sounds simple at the whiteboard: read prices, predict direction, place orders, manage risk. In practice, every layer of that stack will try to kill you.

Exchanges are inconsistent. WEEX, Binance, Hyperliquid — every one has different symbol formats, different REST paradigms, different WebSocket lifecycles, different ways of representing a position.
Models decay. A signal that worked last quarter doesn't work this quarter. Pretending otherwise is how accounts get blown up.
Volatility is non-stationary. Static leverage and fixed position sizes are a lie you tell yourself until you wake up at -40% drawdown.
Pure quant is fragile. Numbers don't know that the SEC just sued the second-largest exchange.

AlphaStrike's design isn't trying to be the smartest bot. It's trying to be the bot that's still alive in 12 months. That's a different optimization target, and it shows.

The Architecture, Top-Down

EXCHANGE → DATA GATEWAY → FEATURE LAYER → FEATURE VALIDATOR
                                                    │
                                                    ▼
EXECUTION ← RISK LAYER ← STRATEGY LAYER ← ML LAYER

Eight stages, every one of them able to halt the pipeline on its own. That's the first lesson: every layer is a potential circuit breaker. If features fail validation (PSI drift, KS test, CUSUM), no signal reaches the model. If the risk layer flags exposure, no order reaches the exchange. Fail-closed by default.

Let me walk through the four pieces I actually want to talk about.

1. Exchange Abstraction Done Right

This is where most trading bots rot. AlphaStrike defines two Protocol classes — ExchangeRESTProtocol and ExchangeWebSocketProtocol — and every adapter (WEEX, Hyperliquid, Binance, generic OpenAPI) implements them. The trading logic only talks to the unified protocol.

@runtime_checkable
class ExchangeRESTProtocol(Protocol):
    async def get_ticker(self, symbol: str) -> UnifiedTicker: ...
    async def place_order(self, order: UnifiedOrder) -> UnifiedOrderResult: ...
    async def get_positions(self, symbol: str | None = None) -> list[UnifiedPosition]: ...
    async def set_leverage(self, symbol: str, leverage: int) -> bool: ...

The unified data models (UnifiedOrder, UnifiedPosition, UnifiedCandle) are the contract. Every adapter has a mappers.py that translates between exchange-native shapes and the unified shapes. Symbol normalization happens at the adapter boundary — internally everything is BTCUSDT, externally it becomes cmt_btcusdt or whatever WEEX wants this week.

Why I care: I've shipped trading code where exchange-specific assumptions leaked into the strategy. It's death by a thousand if exchange == "binance" cuts. The Protocol-based approach keeps the boundary honest. You add a new exchange by writing one adapter file, not by hunting through the codebase.

2. The ML Layer That Doesn't Trust Itself

The signal pipeline runs 12 categories of weak signals — order flow, microstructure, volatility, correlation, sentiment, seasonality, statistical, price action, volume, derivatives, alternative, macro — and combines them through a regime-aware ensemble. This is the explicitly Renaissance/Medallion-inspired bit, and the backtest deltas are real:

Metric	Single Signal	12-Category Ensemble
Sharpe	1.2	2.4
Win Rate	52%	58%
Max Drawdown	-15%	-8%

But the part I find genuinely novel is the signal decay tracker. Every signal logs its predictions, the system records outcomes, and signals get auto-retired when their rolling accuracy drops below 48%. Weight is (edge × 2)², so signals with real edge get amplified and weak signals fade out without anyone touching code.

edge = accuracy - 0.5            # 0.52 accuracy → 0.02 edge
weight = (edge * 2) ** 2         # quadratic weighting of strong signals
if accuracy < 0.48:
    signal.retire()

This is the right way to do it. Most "ensemble" systems use static weights tuned once and forgotten. Here the weights are alive — they update with reality. Models that lose their edge get fired by the system itself.

3. Dynamic Leverage as a First-Class Citizen

Static leverage is the crypto equivalent of running with scissors while drunk. AlphaStrike treats leverage as a continuous control variable:

leverage = base × vol_factor × dd_factor × perf_factor

vol_factor  = normal_vol / current_vol     # clamped 0.3 to 1.5
dd_factor   = 1.0 / 0.7 / 0.5 / 0.3        # tiered by drawdown
perf_factor = half_kelly_fraction          # 0.6 to 1.2

Real scenarios from the doc:

Conditions	Leverage
Normal	5.0x
High vol (5%)	2.0x
In 12% drawdown	2.5x
Strong perf + low vol	9.0x
All bad (high vol + DD + losing)	1.0x

The leverage state lives in data/state/leverage_state.json so it survives restarts. When the system reduces from 5x to 2x because volatility spiked, the next process boot doesn't forget. That detail matters more than it sounds — most bots reset to defaults on restart and quietly take on more risk than the operator thinks.

4. The LLM Layer That Knows Its Place

Here's the part that surprised me. AlphaStrike has an LLM decision layer — a local Ollama-served qwen2.5:1.5b — but its design philosophy is the opposite of what's currently fashionable. The LLM does not generate signals. It does not pick trades. It does not "reason about the market."

It only intervenes when performance degrades. When the rolling win rate drops below 40%, drawdown crosses 15%, or you stack 5 consecutive losses, the system hands the LLM a structured performance report and a tightly scoped tool palette:

adjust_conviction(symbol, threshold, reason)
adjust_position_size(symbol, multiplier, reason)
adjust_leverage(new_leverage, reason)
disable_shorts(symbol, reason)
disable_asset(symbol, duration_hours, reason)
no_action(reason)

Example LLM response when SOL is having a 25% win rate, 22% drawdown, 7-loss streak:

[
  {"tool": "adjust_position_size", "params": {"symbol": "SOL", "multiplier": 0.3}},
  {"tool": "adjust_conviction", "params": {"symbol": "SOL", "new_threshold": 85}},
  {"tool": "disable_shorts", "params": {"symbol": "SOL"}},
  {"tool": "send_alert", "params": {"severity": "critical"}}
]

That's the right shape for LLMs in financial systems: bounded actions, explicit triggers, no inference loops touching live capital. The model doesn't have to be smart, it has to be defensive. A 1.5B parameter local model is more than enough when the action space is six tools wide.

What I Took Away

Three things I'm stealing:

Protocol-based exchange abstraction. No more if exchange == chains. Define the contract once, swap implementations behind it. This generalizes way past trading.
Self-retiring signals with quadratic edge weighting. Static feature weights are tech debt the moment you ship them. Make signal decay a first-class concept and let the data prune your own model.
LLM-as-circuit-breaker, not LLM-as-strategist. The hype-cycle take is "use the LLM to pick trades." The mature take is "use the LLM to recognize when your quant system is dying and apply targeted, reversible, well-typed interventions." The hype-cycle take blows up your account. The mature take saves it.

What I'd build next: an offline evaluation harness for the LLM's tool-call decisions. Right now the LLM's interventions only get evaluated by their downstream P&L impact, which is noisy and slow. A counterfactual replay framework — "what would have happened if the LLM had done nothing, or chosen a different tool?" — would let you tune the trigger thresholds and the prompt without burning real capital. That's where I'd put the next two weeks of engineering time.

Trading bots are not magic. They're software systems that have to survive volatility, exchange flakiness, model decay, and operator panic. The systems that survive are the ones that take all four threats seriously at the architecture level — not the ones with the prettiest backtest curve.