Forem: ~K¹yle Million

I Told Claude Code to Build an Autonomous DeFi Liquidation Bot. Here's What Actually Happened

~K¹yle Million — Fri, 01 May 2026 01:03:49 +0000

The goal was simple: build something that generates revenue without me touching it.

I gave Claude Code a directive. No step-by-step instructions, no hand-holding. Just: "Build an autonomous DeFi strategy on Base that scans for profit, executes when conditions are met, and sends me a Telegram when it does something."

Three weeks and six versions later, here's what actually happened.

Version 1–4: DEX Arbitrage Was Already Dead

The first instinct was DEX arb. Flash loan USDC, swap on Aerodrome, swap back on Uniswap V3, pocket the spread. Standard stuff.

Claude Code built it correctly. Pool detection, route scoring, quote fetching, flash loan execution via Balancer V2. The math checked out. The code was clean.

The problem was the market.

Flashblocks on Base (100ms block times) means MEV bots are processing opportunities faster than any external scanner can even detect them. By the time an off-chain process sees a spread and submits a transaction, it's gone. The pool prices I was quoting were stale by the time the transaction landed.

After four versions, 23 pools, 146 routes, and zero profitable executions, the scanner was honest about it: best spread = -$1.06 per $1000 flash. Market at equilibrium.

That's not a bug. That's correct behavior in a fully competitive market.

The Pivot: Morpho Blue Liquidations

I sent a research document about Morpho Blue's liquidation mechanics and told Claude Code to evaluate whether liquidation bots were a better fit than arb.

The analysis was thorough. Morpho Blue has an unusual structure: any address can liquidate any undercollateralized position. No whitelist. No keeper registry. You bring a flash loan, repay the borrower's debt, and seize their collateral at a discount.

The key insight: liquidation opportunities aren't race conditions in the same way as DEX arb. When a position crosses the liquidation threshold (health factor < 1.0), it doesn't disappear in 100ms. It sits there until someone closes it. The window is minutes to hours, not milliseconds.

This was the right target.

What Claude Code Built

The v7 liquidation daemon is 829 lines of Python. Key components:

Borrower indexing: The daemon seeds from Morpho's API (129 borrowers across AERO/USDC, weETH/WETH, cbXRP/USDC, uniBTC/USDC, wstETH/msETH markets), then stays live by watching Borrow events via eth_getLogs.

Health factor calculation: For each borrower, it reads their supply/borrow shares, converts to assets using the current market state, fetches the oracle price, and computes HF = (collateral × oracle_price/1e36 × LLTV) / borrowed_assets.

Liquidation execution: When HF < 1.0, it constructs a flash loan from Balancer V2, repays the bad debt, and seizes collateral in a single atomic transaction via a deployed LiquidationExecutor.sol.

Watchdog: A bash watchdog loop restarts the daemon on crash and fires a Telegram alert on repeated failures.

The Bugs That Almost Killed It

Here's where it gets honest.

Bug 1 — The HF formula was wrong. The initial implementation computed health factor as (collateral / borrowed) × LLTV. This is missing the oracle price term. Without it, HF is dimensionally wrong — it compares raw units, not dollar values. For a market where collateral is AERO and debt is USDC, a raw ratio gives you something like 10^12, which is never < 1.0. Zero liquidations possible.

Claude Code caught this itself during a v7 test run when the daemon logged HF=2847391847483.5 for a position that was obviously underwater according to Morpho's own interface. It diagnosed the missing oracle price term and rewrote the formula.

Bug 2 — The RPC rate-limiting cascade. The daemon uses Infura for multicall batches (reading all borrower positions per block). Infura's free tier has request limits. At 2-second Base blocks, the daemon was making ~400+ RPC calls per minute. Free tier exhausted in hours.

The fix: use eth_getLogs via public nodes (mainnet.base.org) for the slow indexing pass, and reserve the paid RPC for the latency-sensitive multicall reads on each new block.

Bug 3 — The stale endpoint problem. This one bit me in production yesterday. The daemon reads its RPC endpoint from environment variables at startup. I updated .env to switch from Infura to a Coinbase Base RPC (which is free and doesn't rate-limit), but the running daemon held onto the old Infura URL. It kept silently failing multicalls with 402 errors for 20 hours.

The fix: kill the process, let the watchdog restart it with the new .env. The daemon now logs a heartbeat every 150 blocks so silence is detectable.

Where It Stands

Right now, PID 536542:

Subscribed to newHeads on Base via WebSocket
Tracking 129 borrowers across 5 markets
Reading oracle prices and computing HF on every block
Checking Morpho Blue and Moonwell simultaneously

The markets are currently healthy. All tracked positions have HF > 1.0. No liquidation opportunities pending.

What's blocking live execution: the hot wallet (0x793e...) has 0.001 ETH. The flash loan contract needs ~0.009 ETH in gas to execute a liquidation. Until the wallet is funded, the daemon scans but can't pull the trigger.

This is not a software gap. It's a capital gap. The code is correct.

What Claude Code Can and Can't Do Here

Can do:

Build production-grade event-driven infrastructure from a research spec
Debug its own math errors when given live data to test against
Handle multi-contract interactions (Morpho + Chainlink + Balancer + Multicall3)
Run autonomously with watchdog recovery, disk caching, and Telegram alerting

Can't do (yet):

Fund its own gas wallet from zero capital
Predict when DEX markets are too efficient before building against them (though it assessed correctly once given data)
Guarantee correctness on first implementation — requires test runs against live data to surface the subtle math bugs

The honest summary: the infrastructure is sound. The engineering decisions are defensible. The remaining blocker is $25 of ETH.

The complete ops stack I used to manage this build — session memory, bash validation, coordinator resume integrity, compaction gate, and 11 other skills — is available at shopclawmart.com/@thebrierfox.

If you're building something like this and want to compare notes, drop a comment. The Morpho Blue liquidation market is open to anyone; I don't mind discussing what I found.

I Let My AI Agent Loose on 38 Broken Repos — Here's What She Built in 45 Minutes

~K¹yle Million — Thu, 23 Apr 2026 17:41:06 +0000

I Let My AI Agent Loose on 38 Broken Repos — Here's What She Built in 45 Minutes

Last Tuesday, I opened Claude Code at 10am with no specific plan. I just said: "Audit every repo on my GitHub account and fix what you can."

By 11am, I had a working business intelligence system running automated competitive analysis, a live PWA deployed to GitHub Pages, and three other repositories that went from dead to functional. All without me writing a single line of code.

This isn't a hype post. I'm going to show you exactly what happened, what the outputs looked like, and what actually matters about it.

The Setup: 38 Repos, Most of Them Broken

I've been building AI systems for two years. My GitHub has 38+ repos under my handle, thebrierfox. Most of them are in a state I call "intent-coded" — I scaffolded the idea, got it partially working, and then my brain moved on to the next thing before the repo was actually useful.

Sound familiar?

The problem isn't starting. It's the follow-through that requires sitting down and grinding through error messages and edge cases when you've already seen the architecture in your head and it's not interesting anymore.

So on April 21, I gave Aegis — my Claude Code instance — a GitHub token and a simple directive: go audit everything, fix what you can, deploy what's close to working.

What She Found

The audit surfaced a pattern I'd suspected but never quantified: 90% of my repos weren't broken because the code was bad. They were stalled because of small, dumb things:

Missing permissions: contents: write in a GitHub Actions workflow (breaking automated commits for 4 months)
NaN values crashing a JSON serialization step before any output was produced
A hardcoded URL that had gone stale

These aren't architectural problems. They're the kind of thing that takes 10 minutes to fix once you find them — but "finding them" requires actually running the code, reading the error, and not getting distracted.

Aegis ran through 38 repos. She flagged 5 as near-operational with fixable blockers.

The One That Mattered: OneShot_v3

Of the five repos she fixed, one stands out.

OneShot_v3 is a competitive price intelligence system I built for a rental equipment business. The idea: scrape competitor pricing from Sunbelt and United Rentals, compare it against our catalog, surface items where we're competitively priced and items where we're not.

The code was basically done. The GitHub Actions automation had been broken for 4 months because of the missing contents: write permission. It would run, fail silently, and never commit the analysis results.

Aegis found the error, patched the workflow file, and ran the pipeline end-to-end.

The output:

242 Patriot Equipment SKUs scraped and analyzed
124 alert rows — items with ≥15% price variance vs. competitors
Key insight: man lifts and aerials are Patriot's biggest competitive advantage (22-30% cheaper than Sunbelt and United Rentals)

That insight had been sitting dormant in a broken pipeline for 4 months. One patch, one run, real data.

She also wired the Monday 11:30 UTC cron trigger, so the analysis now runs automatically every week without me touching it.

The Other Four

While OneShot was running, she was also working on the other repos:

NLP-Command-Center: Built a real orchestrator from scratch — 250 lines of Python that reads task JSONs, validates them, resolves tools from a toolbox definition, and executes actions with proper logging. The repo had the concept but no working implementation. Now it has one.

AegisRunner: Deployed as a live PWA at thebrierfox.github.io/AegisRunner — dynamic flow loader, JSON-defined steps, working in a browser. One remaining action for me to wire: set a secret in GitHub Actions settings.

aegis-roadmap: Built a 433-line runner that generates Task-Expertise Roadmap decks via claude -p on Max OAuth. Reads from a SQLite registry. Actually works.

intuitek-site: Discovered that the production site at intuitek.ai was deployed to a different Vercel account. Fixed the API key configuration issue that was causing /api/chat to 500.

What This Taught Me About Autonomous Agent Work

A few things became clear:

1. The bottleneck isn't code, it's attention. My repos weren't broken because I couldn't fix them. They were broken because fixing them wasn't interesting enough to command my attention past the "mostly working" stage. An agent that doesn't get bored is the unlock.

2. Near-completion repos are the highest ROI target. A repo at 80% done takes 10% more effort to finish and produces 100% of the value. Concept-stage repos are cheap to start and expensive to complete. Aegis found the 80% items and prioritized them. That's the right heuristic.

3. Weekly automation is more valuable than one-time scripts. OneShot ran once and produced 124 insights. It runs every Monday now and will produce fresh insights every week. The cron trigger is worth as much as the initial fix.

4. The agent needs a GitHub token and authority to push. This sounds obvious, but I had been running Aegis in a mode where she could read repos but not push to them. Giving her write access was the unlock that made all of this possible. Give your agent the permissions it needs to actually complete work.

The Toolchain (What Made This Possible)

This session used Claude Code on Max subscription — zero marginal cost per turn on the Max plan, which matters when you're running an agent through 38 repos. Aegis also had:

gh CLI with a GitHub token (for repo inspection and pushes)
python3 for data analysis and JSON manipulation
Read/write/bash tools in Claude Code's native capability set
CLAUDE.md with her operating doctrine so she could work autonomously without hand-holding

No special plugins. No vector databases. No multi-agent orchestration. Just a well-configured Claude Code instance with the right permissions and a clear directive.

What's Next

OneShot_v3 is now a product. It runs on autopilot. The Monday analysis will land in my email each week.

The remaining 33 repos are ranked in a backlog by a formula: revenue × alignment × proximity_to_operational / effort_to_ship. The top items are Million Family Rentals (property management system, needs 12h of work) and a skill packaging toolchain for the ClawMart marketplace.

Aegis will work through them in order. I check outputs. I don't manage execution.

That's the pattern I've been building toward for two years: not "AI that helps me code" but "AI that runs operations." Last Tuesday was the first time it felt fully real.

If you're building production agent infrastructure, the patterns I've built into Aegis — loop termination, session memory, forked execution, cost-aware model routing — are available as skill packages at shopclawmart.com/@thebrierfox. Each one is a tested implementation you can drop into your own Claude Code setup.

~K¹ (W. Kyle Million) / IntuiTek¹

Agent Compaction Architecture: What Really Happens When Claude Code Hits Context Limits

~K¹yle Million — Wed, 22 Apr 2026 13:09:45 +0000

Section 1: The Silent Killer

When Claude Code's context window fills, the runtime does not hard-stop. It doesn't throw an error. It doesn't ask permission. It compacts.

Compaction is an automatic summarization step that fires when the token budget crosses a threshold. The mechanics are straightforward: the oldest turns in the conversation history are replaced with a compressed summary. Recent turns — the last several exchanges — are preserved verbatim. The summary takes the place of everything older.

From a token-budget perspective, this is correct behavior. There is no other option. You cannot run a stateful agent across a long task without some form of context management. The window is finite. The task is not.

The problem is the word "compressed." A summary is a lossy transformation. The compression ratio is high — many tokens of conversation history become a paragraph of summary. What survives that compression is a function of what the summarizer judged salient. Factual statements about what actions were taken survive well. Constraints survive partially. Nuanced reasoning about why a particular approach was chosen tends to survive poorly. Negative constraints — "don't touch X", "avoid this approach because..." — are especially vulnerable, because they are structurally underrepresented in summaries: what didn't happen takes up less surface area than what did.

Here is a concrete production failure I hit.

I had an agent working through a multi-step migration task. Early in the session, I established that a specific table in the database was read-only for this task — the tenant registry. There was active work happening on that table by another process, and any schema change would cause a cascade failure. I was explicit about it: "Do not touch the tenant_registry table. Do not add columns, do not create indexes, do not run any DDL against it."

The agent acknowledged this. It moved forward. It completed several unrelated subtasks. The context window filled. Compaction fired.

The summary captured the migration objective. It captured what had been completed. It mentioned the database was involved. It did not preserve the specific constraint about the tenant_registry table with enough fidelity to prevent the agent from running a DDL operation against it two tasks later when the migration naturally required cross-table work.

The operation succeeded at the database level. The cascade failure arrived async, from the other process. I found it in the error log four hours later.

Nothing in the session output flagged that compaction had occurred. Nothing in the agent's subsequent behavior signaled it had lost the constraint. It was reasoning correctly from the compressed state it had — that state just had a hole in it.

That is what makes compaction dangerous in autonomous operation. The agent doesn't know what it doesn't know. It reasons confidently from an incomplete picture, and the gaps are invisible from the inside.

Section 2: What Gets Lost and Why

Not all state is equally vulnerable to compaction. Understanding the failure modes requires a taxonomy.

Tool call results — high vulnerability

When the agent runs a Bash command and reads the output, that output lives in the conversation as a tool result. Tool results are often long — hundreds of lines of log output, full file contents, test results. They are also often used once: the agent processes the result, draws a conclusion, and the raw output becomes redundant.

From a summarization perspective, tool results are natural candidates for aggressive compression. The summary retains the conclusion: "tests passed", "file contains X", "service is running". The raw output is dropped.

This is fine when the raw output was truly just an input to a single conclusion. It is a problem when the raw output contained multiple relevant facts, and only one of them was acted on immediately. The rest are now gone. If a later step in the task needs one of those secondary facts, the agent will re-derive it, re-read the file, or get it wrong.

Intermediate conclusions — medium vulnerability

The agent builds up a model of the system as it works. "This service is stateless, so I can restart it without drain." "This config value is referenced in three places." "The test is flaky, not broken — ignore intermittent failures." These are conclusions drawn from evidence earlier in the session.

They are embedded in the conversation as reasoning traces — assistant turns explaining what the agent concluded and why. Summaries capture the highest-salience conclusions but flatten the reasoning. The "why" is the first thing to go.

When the "why" is gone, the agent may later reach the opposite conclusion from fresh evidence if that evidence is locally ambiguous. The earlier constraint has no backing anymore.

Explicit constraint acknowledgments — high vulnerability

"Remember, don't touch X." "Make sure to use approach Y for this module." "The client requires that output files use this exact naming convention."

Constraints stated conversationally, without a corresponding file artifact, are the most dangerous category. The agent acknowledged them. They shaped early decisions. But acknowledgment turns are short and structurally similar to each other — they compress heavily. After compaction, the summary may say "user gave several constraints about the build" without enumerating them.

The agent no longer has the specific list. It has a summary that there was a list.

Completed subtasks that weren't fully logged — low-to-medium vulnerability

Completed work leaves artifacts: files, database records, deployed services. Those artifacts exist independently of the conversation. The agent can re-inspect them.

The vulnerability here is more subtle: the decisions made during a subtask may be gone even when the subtask's outputs survive. The agent knows a file was written. It doesn't necessarily remember why it was structured that specific way, which means a later step that modifies that file may violate an architectural constraint that was obvious in the original subtask context.

Why summaries can't fully substitute for raw history

A summary is an agent-generated compression. Its quality depends on what the summarizing model judges worth preserving, which is a function of what seemed salient at summary generation time. Salience is local: the most recently discussed topics appear more important. Negative constraints are structurally invisible in summaries. Long reasoning chains compress to single-sentence conclusions.

The raw history is a ground truth. The summary is a lossy encoding. For short tasks with clear objectives, the loss is tolerable. For long tasks with accumulated constraints and interdependent decisions, the loss compounds across multiple compaction events.

Section 3: Compaction-Resistant Architecture

Four patterns. I use all of them in production. They compose — each layer backs up the others.

Pattern 1: Checkpoint Writes

At every significant milestone in a task, the agent writes the current state to a file. Not a summary of what it did — the live state that the next phase needs.

The checkpoint file is not documentation. It is a machine-readable context recovery artifact. The agent will read it at the start of each subsequent phase. If compaction fires, the next operation re-loads from the checkpoint rather than from conversation memory.

What belongs in a checkpoint:

Active constraints (including negative constraints — especially those)
Decisions made and the reason they were made
Current task state: what is complete, what is in progress, what is blocked
Any system facts that were discovered and are relevant going forward
Explicit re-statement of things that must not happen

The checkpoint is only useful if it is written before context-heavy operations. Writing it after means compaction may have already fired.

A checkpoint cadence that works: write before any operation that will consume more than a few thousand tokens (running tests, reading large files, invoking sub-agents, executing database migrations). Write at each logical phase boundary regardless of token consumption.

Pattern 2: Explicit State Re-Injection

Checkpoints are only useful if they are read. State re-injection means starting each major phase of a task by reading the relevant checkpoint files and explicitly restating the constraints into the current context before doing any work.

This is not redundant. After compaction, the conversation history is a summary. The most recent checkpoint is the last known-good full state. Reading it at phase start brings the full state back into the current context window, where it will remain verbatim for the duration of that phase's work.

The re-injection also serves as a correctness check: if the agent re-reads the checkpoint and notices that its current understanding diverges from what the checkpoint says, that divergence is a signal that something went wrong.

Re-injection should be explicit in the agent's prompt chain: "Before proceeding with phase N, read the phase N checkpoint file and confirm that all listed constraints are still active."

Pattern 3: Compaction Detection

There is no native "compaction occurred" event exposed by Claude Code's context. You cannot query whether compaction has fired. But you can detect it indirectly.

Compaction detection relies on a sentinel: a value written to a file at task start that the agent is instructed to re-read and verify at each phase boundary. If the agent can reproduce the sentinel value, the conversation history containing the sentinel read is still intact. If it cannot, compaction has likely compressed that turn.

More practically: you can detect behavioral evidence of compaction by testing the agent's recall of specific early-session constraints before proceeding. If it fails the recall test, you trigger a re-initialization sequence: read all checkpoint files, re-state all constraints, verify understanding before continuing work.

The detection overhead is low — a single file read and a short verification step. The cost of skipping it when compaction has fired is whatever damage the agent does while operating from an incomplete state.

Pattern 4: Session Segmentation

For tasks that will span many hours and many phases, a single ultra-long session is architecturally unsound. Multiple compaction events compound: the second compaction summarizes a history that already contains a summary. Information loss accelerates with each event.

Session segmentation means treating the task as a sequence of bounded sessions, each with a clean handoff file. Session N completes some work, writes a handoff file that captures the full state needed by session N+1, then exits cleanly. Session N+1 starts by reading the handoff file before doing anything else.

Each session starts fresh — full context window, no compaction debt. The handoff file is the only continuity mechanism, so it must be complete. This forces explicit articulation of state that might otherwise be assumed to be "in context."

The segmentation boundary should align with natural task phases. "Complete the schema migration and write a handoff file" is a clean segment. "Do some of the migration and some of the testing" is not.

Section 4: Code Examples

Checkpoint Write — Python

import json
from datetime import datetime
from pathlib import Path

def write_checkpoint(checkpoint_dir: str, phase: str, state: dict) -> Path:
    """
    Write a phase checkpoint before any context-heavy operation.
    Call this before running tests, reading large files, or invoking sub-agents.
    """
    path = Path(checkpoint_dir) / f"checkpoint_{phase}.json"
    payload = {
        "phase": phase,
        "timestamp": datetime.utcnow().isoformat(),
        "constraints": state.get("constraints", []),
        "decisions": state.get("decisions", {}),
        "do_not_touch": state.get("do_not_touch", []),
        "completed_tasks": state.get("completed_tasks", []),
        "in_progress": state.get("in_progress", ""),
        "facts": state.get("facts", {}),
    }
    path.write_text(json.dumps(payload, indent=2))
    return path


# Example usage before a database migration phase
write_checkpoint(
    checkpoint_dir="./outputs/session_checkpoints",
    phase="pre_migration",
    state={
        "constraints": [
            "Use WAL mode for all SQLite writes",
            "No DDL against tenant_registry table — active writes from separate process",
            "Output files must use snake_case naming convention",
        ],
        "do_not_touch": ["tenant_registry", "auth_tokens"],
        "decisions": {
            "schema_approach": "additive_only",
            "schema_approach_reason": "existing consumers cannot handle column removal",
        },
        "completed_tasks": ["schema_audit", "backup_verification"],
        "in_progress": "column_additions_to_user_profiles",
        "facts": {
            "db_path": "/data/production.db",
            "backup_verified_at": "2026-04-22T09:14:00Z",
        },
    }
)

Checkpoint Read + Re-Injection — Python

import json
from pathlib import Path

def load_checkpoint(checkpoint_dir: str, phase: str) -> dict:
    """
    Load checkpoint at phase start. Re-state all constraints before proceeding.
    This is your recovery path after a compaction event.
    """
    path = Path(checkpoint_dir) / f"checkpoint_{phase}.json"
    if not path.exists():
        raise FileNotFoundError(
            f"No checkpoint found for phase '{phase}'. "
            "Cannot proceed without known-good state."
        )
    state = json.loads(path.read_text())

    # Emit re-injection block — this goes into the agent's active context
    print(f"=== RE-INJECTING STATE FROM CHECKPOINT: {phase} ===")
    print(f"Timestamp: {state['timestamp']}")
    print("\nACTIVE CONSTRAINTS (must be honored for remaining work):")
    for c in state["constraints"]:
        print(f"  - {c}")
    print("\nDO NOT TOUCH:")
    for item in state["do_not_touch"]:
        print(f"  - {item}")
    print("\nKEY DECISIONS:")
    for k, v in state["decisions"].items():
        print(f"  {k}: {v}")
    print("=== END STATE RE-INJECTION ===\n")

    return state

Compaction Detection — Bash

#!/usr/bin/env bash
# compaction_check.sh
# Write a sentinel at task start; verify it at each phase boundary.
# If verification fails, trigger re-initialization before proceeding.

SENTINEL_FILE="./outputs/session_sentinel.txt"
CHECKPOINT_DIR="./outputs/session_checkpoints"
PHASE="${1:-unknown}"

write_sentinel() {
    local session_id
    session_id="$(date +%s)-$$"
    echo "$session_id" > "$SENTINEL_FILE"
    echo "SENTINEL_WRITTEN: $session_id"
}

verify_sentinel_or_reinit() {
    if [[ ! -f "$SENTINEL_FILE" ]]; then
        echo "COMPACTION_DETECTED: sentinel file missing — running re-initialization"
        reinitialize_from_checkpoints
        return 1
    fi
    local stored_sentinel
    stored_sentinel="$(cat "$SENTINEL_FILE")"
    echo "SENTINEL_OK: $stored_sentinel — proceeding with phase $PHASE"
    return 0
}

reinitialize_from_checkpoints() {
    echo "=== COMPACTION RECOVERY: loading all available checkpoints ==="
    for f in "$CHECKPOINT_DIR"/checkpoint_*.json; do
        [[ -f "$f" ]] || continue
        echo "--- Loading: $f ---"
        python3 -c "
import json, sys
state = json.load(open('$f'))
print(f'Phase: {state[\"phase\"]} @ {state[\"timestamp\"]}')
print('Constraints:')
for c in state.get('constraints', []):
    print(f'  - {c}')
print('Do not touch:', state.get('do_not_touch', []))
"
    done
    echo "=== RECOVERY COMPLETE — all constraints re-loaded ==="
}

# At session start: write_sentinel
# At each phase boundary: verify_sentinel_or_reinit

Session Handoff File — Python

import json
from datetime import datetime
from pathlib import Path

def write_handoff(output_dir: str, session_id: str, next_session_instructions: dict):
    """
    Write a clean handoff file at the end of a session segment.
    The next session reads this before doing any work.
    This file is the ONLY continuity mechanism between sessions.
    It must be complete — assume the next session has zero prior context.
    """
    path = Path(output_dir) / f"handoff_{session_id}.json"
    handoff = {
        "generated_at": datetime.utcnow().isoformat(),
        "from_session": session_id,
        "next_session_start_instructions": (
            "Read this file completely before any other action. "
            "All constraints listed here are active. "
            "Do not proceed without acknowledging each constraint."
        ),
        "task_objective": next_session_instructions["objective"],
        "completed_this_session": next_session_instructions["completed"],
        "next_phase": next_session_instructions["next_phase"],
        "hard_constraints": next_session_instructions["constraints"],
        "do_not_touch": next_session_instructions["do_not_touch"],
        "key_facts": next_session_instructions["facts"],
        "open_questions": next_session_instructions.get("open_questions", []),
        "known_risks": next_session_instructions.get("known_risks", []),
    }
    path.write_text(json.dumps(handoff, indent=2))
    print(f"Handoff written to: {path}")
    print(f"Next session must read: {path.name}")
    return path


# Example: end of session 1 of a multi-session migration
write_handoff(
    output_dir="./outputs",
    session_id="migration_s1",
    next_session_instructions={
        "objective": "Complete user profile schema migration and deploy to staging",
        "completed": [
            "Schema audit complete — findings in outputs/schema_audit.json",
            "Backup verified — outputs/backup_verification.md",
            "Column additions to user_profiles — migration script at migrations/002_add_profile_fields.sql",
        ],
        "next_phase": "Run migration against staging, execute integration test suite, write test report",
        "constraints": [
            "No DDL against tenant_registry — active concurrent writes",
            "Migration must be additive only — no column drops",
            "Staging deploy requires RAILS_ENV=staging explicitly set",
        ],
        "do_not_touch": ["tenant_registry", "auth_tokens", "legacy_session_keys"],
        "facts": {
            "staging_db": "postgres://staging-host:5432/app_staging",
            "migration_tool": "alembic",
            "test_suite": "pytest tests/integration/",
            "expected_test_count": 47,
        },
        "known_risks": [
            "Test DB may have stale fixtures — run pytest --setup-show to verify fixture state",
        ],
    }
)

The Architecture in Summary

Compaction is not a bug to work around. It is a fundamental constraint of context-window-bounded agents. The architecture that survives it is one that treats the conversation as ephemeral and the filesystem as the ground truth.

Checkpoint writes externalize state before it can be lost. Re-injection restores full context after a compaction event. Detection lets you verify that the context you're operating from is complete. Session segmentation eliminates compaction debt entirely for long tasks by resetting the window at phase boundaries.

None of these patterns are expensive. A checkpoint file write takes milliseconds. A re-injection read adds a few hundred tokens to the current context. The compaction detection sentinel is a single file read. A handoff file is twenty lines of JSON.

The cost of not using them is the kind of failure that doesn't announce itself — an agent that proceeds confidently from a state it believes is correct, into work that violates a constraint it no longer remembers.

I packaged the full compaction-resistant architecture — detection hooks, checkpoint templates, re-injection patterns, and session handoff schemas — as a ClawMart skill: Agent Compaction Architecture — Production Context Management. If you're running Claude Code agents on anything longer than a twenty-minute task, it's worth the read.

~K¹ (W. Kyle Million) / IntuiTek¹ — Building autonomous AI infrastructure for solo operators.

Tags: claudecode, devtools, aiagents

The Complete Agent Operations Stack: 15 Skills for Production-Grade Claude Code

~K¹yle Million — Wed, 22 Apr 2026 13:09:21 +0000

Every week this week I've published articles about individual production patterns for Claude Code: loop termination, session memory, memory scoping, coordinator resume, bash security. Each one addresses a specific failure mode that doesn't exist in demos but shows up immediately when you run agents unattended.

This article ties them together. It's the reference architecture I wish existed when I started building autonomous agents — before I had agents burning API budget in infinite retry loops, corrupting each other's work, or silently writing partial output that looked complete.

The gap between "works in a demo" and "runs for 30 days without intervention" is not about model quality. It's about the five layers of production readiness that Claude Code tutorials don't cover, because tutorials show the happy path.

The Production Gap

Here's what a Claude Code demo looks like:

User: "Write a report on X"
Agent: [reads files, synthesizes, writes output]
Done.

Here's what production looks like:

The agent runs at 2am via cron with no one watching
It hits a network error on step 12 of 30 and retries 80 times
Two instances start simultaneously and overwrite each other's context files
The context window hits its limit mid-task and the next session has no idea where it left off
A sub-agent writes a bash command that touches a path it shouldn't
The coordinator that dispatched three agents loses its session and restarts all three
The agent finishes successfully but consumed 6x the expected API budget because it loaded the same large file 40 times

None of these are model failures. They're infrastructure failures. The model did exactly what it was instructed to do. The architecture didn't account for the environment the model runs in.

The five layers below are the minimum viable production architecture for any Claude Code agent that runs unattended.

The Five Layers of Production Readiness

Layer 1: Security

What can go wrong: An agent with broad Bash tool access will, eventually, execute a command in a way you didn't anticipate. Maybe it interpolates a variable into a shell command unsafely. Maybe it runs rm -rf on a path that turns out to be wrong. Maybe it writes credentials to a log file. In production environments, an unvalidated bash execution surface is an incident waiting to happen.

The skills that address this:

Bash Security Validator catches the class of vulnerabilities that come from how agents construct shell commands: unquoted variables, command injection via interpolation, unsafe redirects, pipes to eval. This isn't static analysis on your code — it's a validation layer that runs between the agent's intent and the shell.

Production Agent Security Hardening addresses the broader surface: what tools the agent can access, which paths it's allowed to write, how credentials are handled, and what happens when a security boundary is tested. The hardening architecture covers tool allowlists, path restrictions, and audit logging for security-relevant operations.

Without this layer, you're running an agent that has the same access as a logged-in user and considerably less judgment about when to use it.

Failure signature: Agent executes rm -rf on a wrong path. Agent leaks an environment variable into an output file. Agent constructs a SQL query via string interpolation and hits an injection on unexpected input.

Layer 2: Memory

What can go wrong: Claude Code agents have excellent in-context reasoning. They have zero built-in persistence. When the context window ends — whether from a limit, a compaction, or a cron schedule firing a fresh session — everything the agent learned, decided, and discovered is gone. The next session starts from scratch.

At scale, this produces three distinct failure patterns: repeated discovery (re-doing work already done), decision context loss (making a conflicting choice because the constraint that ruled it out is no longer in context), and progress tracking failure (processing the same files twice because there's no record of what was already processed).

The skills that address this:

Agent Memory Scoping handles the concurrent case: when two agents run simultaneously, they need isolated memory namespaces. The pattern uses agent-scoped working directories, explicit lock protocols for shared coordination files, and memory category taxonomy (exclusive / shared-read / coordination / output). Without this, concurrent agents corrupt each other's working state.

Session Memory Architecture handles the temporal case: single agents running across multiple context windows. The pattern uses structured session memory files with explicit categories (Decisions, Progress, Discoveries, Next Session Start) that the agent writes during execution and reads at session start to resume coherently.

Agent Compaction Architecture handles the context pressure case: an agent operating near its context limit needs to proactively write out critical context before compaction removes it. This isn't reactive — it's built into the agent's operating protocol. The agent maintains a rolling summary of durable knowledge so that compaction events don't cause knowledge loss.

All three of these address the same root problem from different angles: context is not memory, and production agents need persistent memory.

Failure signature: Agent re-processes files it already completed. Agent makes a decision that contradicts a constraint established in a previous session. Two concurrent agents write to the same path and one loses its work.

Layer 3: Flow Control

What can go wrong: An uncontrolled agent will pursue its goal until it either succeeds or exhausts resources. With no circuit breaker, a stuck agent retries indefinitely. With no coordinator state, a multi-agent pipeline loses track of what's been dispatched. With no fork management, spawned sub-agents run without supervision and their outputs aren't collected reliably.

This layer is where most production incidents live, because flow control failures are expensive and hard to detect from the outside.

The skills that address this:

Loop Termination Architecture implements the circuit breaker pattern at three levels: a step counter (hard limit that stops runaway loops), an error accumulation counter (smart limit that stops stuck loops retrying the same error class), and a goal proximity check (semantic limit that stops false progress spirals). The article earlier this week goes deep on this pattern.

Coordinator Resume Integrity handles the multi-agent orchestration case: a coordinator agent that dispatches sub-agents must maintain a persistent dispatch ledger so that if the coordinator's session ends mid-pipeline, the next coordinator session can resume from exactly where it left off — skipping completed tasks and re-running only what's still pending.

Forked Agent Architecture handles the sub-agent lifecycle case: when you fork agents to parallelize work, you need patterns for launching them cleanly, tracking their completion, handling their failures, and collecting their outputs without conflicts. Forked agents that run unsupervised produce outputs that coordinators can't reliably reconcile.

Failure signature: Agent retries a permission error 150 times before context death. Coordinator restarts a pipeline and re-runs already-completed sub-agents. Forked agents write to conflicting paths and the coordinator reads partial output.

Layer 4: Cost

What can go wrong: Token cost is invisible until it isn't. An agent that runs correctly but inefficiently can cost 5-10x what it should. Common causes: loading large context files repeatedly instead of once, using the heaviest model for tasks that don't require it, loading all available tools when only two are needed, and the classic — a stuck loop burning API budget on retry calls that will never succeed.

The skills that address this:

Token Cost Intelligence gives your agents awareness of their own cost. The pattern covers context window accounting, file loading strategies (don't load a 50KB file on every step when you can load it once and reference relevant sections), and prompt construction patterns that achieve the same output with significantly less input. For a cron-scheduled agent running 20 times a day, a 40% cost reduction compounds quickly.

Multi-Agent Coordination Architecture addresses the cost dimension of multi-agent systems: routing tasks to the right-sized agent, avoiding redundant computation across parallel agents, and structuring coordination messages to minimize the context each agent needs to carry. In a multi-agent system, coordination overhead is a real cost. Designing coordination contracts that are minimal without being ambiguous is a cost optimization.

Both of these connect to the model routing tier principle: use local inference for classification and routing tasks, Haiku for structured tasks with clear success criteria, and Sonnet for the work that actually requires it. Token Cost Intelligence gives you the framework to implement this systematically rather than ad-hoc.

Failure signature: Agent loads a 100KB config file 40 times across a session. Coordinator passes the full context of each sub-agent to every other sub-agent. Sonnet is used to determine whether a string contains the word "error."

Layer 5: Setup and Observability

What can go wrong: Agents fail silently. They write outputs that look complete but aren't. They encounter environment issues (missing tools, wrong paths, stale credentials) that they handle by proceeding without the missing piece. By the time you notice, you have a week of bad outputs and no log trail.

The skills that address this:

Claude Code Setup Validation runs preflight checks before any substantive agent work: are required tools available, are expected paths writable, do credentials resolve, are environment variables populated. Validation failures produce clear error messages and halt execution before wasted work. The alternative is discovering that jq isn't installed at step 40 of a 50-step pipeline.

Context Death Spiral Prevention addresses a specific failure mode that compounds other problems: an agent approaching context exhaustion starts making progressively worse decisions as it has less context available. The spiral is: reduced context → worse decisions → more work needed → more context consumed. The pattern installs early warning checks and graceful degradation protocols so agents operating near context limits write out state and stop rather than continuing in a degraded state.

Agent Bash Safety provides the baseline for safe shell operations: patterns for safe variable quoting, command construction, error handling, and exit code propagation. This is the entry-level version of the Bash Security Validator — appropriate for agents where security hardening isn't the primary concern but basic shell hygiene is.

Suggested Adoption Order

If you're starting from scratch, adopt in this sequence. The order is based on risk mitigation impact — the earlier items catch the most expensive failure modes first.

Week 1 — Foundation:

Agent Bash Safety (free) — install baseline shell hygiene before anything else runs
Context Death Spiral Prevention (free) — protect your first agents from the most disorienting failure mode
Claude Code Setup Validation — run preflight before any production deployment
Loop Termination Architecture — your agents will hit loops before they hit any other problem

Week 2 — Multi-session and concurrent:

Session Memory Architecture — required the moment any task spans more than one session
Agent Memory Scoping — required the moment you run more than one agent at a time
Agent Compaction Architecture — required for any long-running task

Week 3 — Multi-agent:

Coordinator Resume Integrity — required for any orchestrated pipeline
Forked Agent Architecture — required when you parallelize

Week 4 — Cost and security:

Token Cost Intelligence — implement once agents are running correctly
Multi-Agent Coordination Architecture — optimize once the baseline architecture is stable
Bash Security Validator — harden once you understand your attack surface
Production Agent Security Hardening — full hardening after you've mapped what the agents actually do

The principle: get agents running reliably before optimizing cost, and understand what agents do before hardening security.

The Full Stack in Practice

To make the architecture concrete, here's a complete autonomous content publishing agent and which of the 15 skills it engages at each stage.

The agent: runs every morning, drafts a dev.to article based on the week's activity log, reviews it against content standards, and queues it for publication.

09:00 — Cron fires run_task.sh
    |
    └── [Setup Validation] ← preflight: DEVTO_API_KEY present? jq installed?
                              outputs/working/ writable? network resolves?
        |
        └── PASS → agent starts
            FAIL → log to errors.log, notify via Telegram, exit 0

09:00:05 — Agent reads context
    |
    └── [Session Memory Architecture] ← read working/content_agent/session_memory.md
                                         resume from last "Next Session Start" marker
                                         apply decisions: "Do not republish articles from week of 04-14"
        |
        └── [Agent Memory Scoping] ← workspace: working/content_agent_20260422_090000/
                                      no conflict with any other running agent

09:00:30 — Agent reads activity log and begins drafting
    |
    └── [Token Cost Intelligence] ← activity log is 200KB total
                                     load only entries from last 7 days (12KB)
                                     don't reload on each step — reference the loaded chunk
        |
        └── [Agent Bash Safety] ← any shell ops use quoted variables, set -euo pipefail
                                    no dynamic command construction from log data

09:03:00 — Article draft complete, beginning review pass
    |
    └── [Loop Termination Architecture] ← step counter: 30 steps max
                                           error counter: 3 identical errors → stop
                                           review pass has its own step budget (10 steps)

09:04:00 — Agent attempts to queue article via ClawMart API
    |
    └── [Bash Security Validator] ← API key interpolated into curl command
                                     validator confirms: key is quoted, no injection surface
        |
        └── [Production Agent Security Hardening] ← API key not logged
                                                      credential not written to working files
                                                      audit entry: "API call to ClawMart at 09:04:02"

09:04:20 — Task complete
    |
    └── [Session Memory Architecture] ← append to session_memory.md:
                                          "COMPLETED: article_20260422 queued for publication"
                                          "Next Session Start: check publication status, then draft next article"
        |
        └── [Context Death Spiral Prevention] ← context usage at 34% — well within safe zone
                                                  no degradation warning needed

09:04:25 — Agent exits clean
    |
    └── outputs/article_20260422_queue.md written
        logs/heartbeat.log timestamp updated
        Telegram: "Content agent complete → article queued for 09:00 publish"

At every stage, a failure in the pattern it depends on would have produced a different outcome:

Without Setup Validation: agent discovers missing jq at step 15, produces garbled output, no error logged
Without Session Memory: agent re-drafts articles from weeks already covered
Without Token Cost Intelligence: agent loads the full 200KB activity log on every step, 3x cost
Without Loop Termination: if ClawMart API returns 503, agent retries until context death
Without Bash Security Validator: API key interpolated into a log message that persists in working files

The 15 skills are not independent optimizations. They're a layered architecture where each layer assumes the layers below it are in place.

Getting the Full Stack

Each skill is available individually. The day one articles this week cover the $19 individual skills in depth.

The entry point is two free skills that have no dependencies and install immediately:

Context Death Spiral Prevention — free, no prerequisites
Agent Bash Safety — free, no prerequisites

The mid-tier bundle covers the five patterns that most production deployments need first:

Production Agent Ops Bundle — $69 (Bash Security Validator, Loop Termination, Session Memory, Agent Memory Scoping, Token Cost Intelligence)

The complete architecture — all 15 skills as a cohesive production system with integration documentation and ordering guidance — is available as:

Complete Agent Operations Pack — $199
All 15 skills. Integration guide. Adoption sequence documentation. CLAUDE.md template library covering all five layers.

https://www.shopclawmart.com/listings/complete-agent-operations-pack-10-skill-production-architecture-suite-5e5fa6e1

The Honest Assessment

Most Claude Code projects don't need all 15 skills. A single-agent script that runs once and is watched by a human needs almost none of them.

The production architecture pays off when:

The agent runs unattended (cron, headless -p mode, no human watching)
The agent runs repeatedly (scheduled, not one-shot)
More than one agent runs at a time
Failures have downstream consequences (customer-facing, financial, not easily reversible)
API cost is a real constraint, not a rounding error

If any of those describe your deployment, the gap between "works in a demo" and "runs reliably for 30 days" is exactly what these 15 skills close.

Built by Aegis, IntuiTek¹ | ~K¹ (W. Kyle Million)

Tags: claudecode, devtools, aiagents, productivity

Token Cost Intelligence: How I Route Claude Code Model Calls to Cut API Costs 60%

~K¹yle Million — Wed, 22 Apr 2026 13:09:10 +0000

The Problem: One Model for Everything

Here's what a typical Claude Code agent loop looks like under the hood:

User prompt → Claude Sonnet (classify intent) → Claude Sonnet (retrieve context)
→ Claude Sonnet (summarize retrieved docs) → Claude Sonnet (generate response)
→ Claude Sonnet (format output)

Five calls. Each one hitting Sonnet. At Claude Sonnet pricing (roughly $3/MTok input, $15/MTok output as of this writing), a moderately complex agent task with 10K input tokens and 2K output tokens per call costs:

5 calls × (10K × $0.003 + 2K × $0.015) = 5 × ($0.030 + $0.030) = $0.30 per task run

That sounds small. Run that task 1,000 times a month — which is conservative for an autonomous agent doing repetitive work — and you're at $300/month for one task type.

Now look at what most of those calls actually need:

Classify intent: Takes a string, returns a category. This is a pattern-matching problem.
Retrieve context: String similarity search. No synthesis required.
Summarize retrieved docs: Compression of existing text. No novel reasoning.
Generate response: This one actually needs intelligence.
Format output: String transformation. Deterministic.

Three of five calls don't need Sonnet. One of them (classify intent, format output) doesn't need any API call at all — a local model running at zero marginal cost handles them fine.

That's the routing opportunity.

The Routing Principle

Before dispatching a subtask to any model, answer three questions:

1. Does this require judgment or just processing?

Judgment tasks: synthesis, creative generation, multi-step reasoning, ambiguous interpretation, code generation from requirements, anything where "wrong" is hard to define in advance.

Processing tasks: classification into fixed categories, text compression/summarization, format conversion, extraction of named entities, boolean routing decisions.

Judgment → Tier 2 minimum. Processing → Tier 0 or Tier 1 viable.

2. Does it need to be right on the first attempt, or can it retry cheaply?

Some subtasks sit on the critical path. If the intent classifier misfires and sends a user to the wrong workflow branch, you pay to recover. If a document summarizer slightly miscondenses something, the downstream step can compensate.

High-stakes, no-retry → Tier 1 minimum. Low-stakes, recoverable → Tier 0 viable.

3. What's the token budget for this step?

Local models (Ollama, running Qwen3:14B on iGPU) handle 8-10 tokens/second in my setup. That's fine for 500-token classification tasks. It's not fine for a 20K-token synthesis pass where you need a response in under 30 seconds. Speed constraints push you up the tier ladder regardless of task complexity.

The decision tree:

Is this a synthesis/reasoning/generation task?
├── Yes → Tier 2 (Sonnet) or Tier 3 (Opus) if highest stakes
└── No → Is output correctness recoverable if wrong?
    ├── No → Tier 1 (Haiku) — API quality, cheap
    └── Yes → Is token count under ~2K and latency tolerant?
        ├── Yes → Tier 0 (Ollama local) — zero API cost
        └── No → Tier 1 (Haiku)

Implementation

Here's the router as a standalone module. The classify() function takes a task description string and returns a tier integer. get_model() maps that tier to a model identifier.

# model_router.py

from enum import IntEnum
import re

class Tier(IntEnum):
    LOCAL = 0    # Ollama — zero API cost
    HAIKU = 1    # Claude Haiku 4.5 — cheap, API quality
    SONNET = 2   # Claude Sonnet — primary work
    OPUS = 3     # Claude Opus — highest stakes only

TIER_MODELS = {
    Tier.LOCAL:  "ollama:qwen3:14b",
    Tier.HAIKU:  "claude-haiku-4-5",
    Tier.SONNET: "claude-sonnet-4-5",
    Tier.OPUS:   "claude-opus-4-5",
}

# Task patterns that signal each tier.
# Match order matters: check Tier 0/1 patterns first, 
# fall through to Tier 2 if nothing matches.

LOCAL_PATTERNS = [
    r"\bclassif(y|ication|ier)\b",
    r"\broute\b.*\btask\b",
    r"\bsummariz(e|ation)\b",
    r"\bextract\b.*(entity|entities|field|fields|name|date|number)",
    r"\bformat\b.*(output|json|markdown|csv)",
    r"\bparse\b.*(string|text|input)",
    r"\bis this (about|related to|a)\b",
    r"\bcategori(ze|zation)\b",
    r"\bdetect\b.*(intent|topic|language|sentiment)",
    r"\btranslate\b.*(format|schema)",
]

HAIKU_PATTERNS = [
    r"\bvalidat(e|ion)\b",
    r"\bcheck\b.*(schema|format|constraint|rule)",
    r"\bfilter\b",
    r"\brank\b.*(list|candidates|results)",
    r"\bscore\b",
    r"\byes.{0,10}no\b",        # binary decisions
    r"\btrue.{0,10}false\b",
    r"\bshould (i|we|this)\b",
]

OPUS_PATTERNS = [
    r"\bcritical\b",
    r"\bhigh.?stakes\b",
    r"\birreversible\b",
    r"\bproduction (deploy|release|launch)\b",
    r"\bsecurity (audit|review|analysis)\b",
    r"\blegal\b",
    r"\barchitect(ure)? decision\b",
]

def classify(task: str) -> Tier:
    """
    Classify a task description string and return the appropriate model tier.
    Conservative by default: unknown tasks get Tier 2 (Sonnet).
    """
    task_lower = task.lower().strip()

    # Check Opus patterns first — these override everything
    for pattern in OPUS_PATTERNS:
        if re.search(pattern, task_lower):
            return Tier.OPUS

    # Check if task clearly fits Local tier
    local_matches = sum(
        1 for p in LOCAL_PATTERNS if re.search(p, task_lower)
    )
    if local_matches >= 1 and len(task_lower) < 500:
        return Tier.LOCAL

    # Check Haiku tier
    for pattern in HAIKU_PATTERNS:
        if re.search(pattern, task_lower):
            return Tier.HAIKU

    # Default: Sonnet
    return Tier.SONNET


def get_model(tier: Tier) -> str:
    """Return the model identifier for the given tier."""
    return TIER_MODELS[tier]


def route(task: str) -> tuple[Tier, str]:
    """Convenience wrapper: classify + return (tier, model_id)."""
    tier = classify(task)
    return tier, get_model(tier)

Injecting this into a Claude Code script:

If you're running Claude Code in script mode (claude -p), you typically don't call the API directly — Claude Code handles the model. But if you're orchestrating sub-agent calls via the Anthropic SDK directly (which is common when you have a Claude Code agent spinning up subordinate tasks), the router drops in cleanly:

# agent_loop.py
import anthropic
from model_router import route, Tier

client = anthropic.Anthropic()

def run_subtask(task_description: str, prompt: str) -> str:
    tier, model = route(task_description)

    # Tier 0: local inference via Ollama (no Anthropic API call)
    if tier == Tier.LOCAL:
        return run_ollama(model.replace("ollama:", ""), prompt)

    # Tiers 1-3: Anthropic API
    response = client.messages.create(
        model=model,
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text


def run_ollama(model_name: str, prompt: str) -> str:
    """Call local Ollama endpoint directly."""
    import httpx
    resp = httpx.post(
        "http://localhost:11434/api/generate",
        json={"model": model_name, "prompt": prompt, "stream": False},
        timeout=60.0
    )
    return resp.json()["response"]

Integrating with a Claude Code tool definition:

If your agent uses Claude Code's native tool calling, you can route at the tool dispatch layer:

# In your tool handler
TOOL_TIER_OVERRIDES = {
    "classify_intent":     Tier.LOCAL,
    "summarize_document":  Tier.LOCAL,
    "extract_fields":      Tier.LOCAL,
    "validate_schema":     Tier.HAIKU,
    "rank_candidates":     Tier.HAIKU,
    "generate_code":       Tier.SONNET,
    "synthesize_findings": Tier.SONNET,
    "review_security":     Tier.OPUS,
}

def dispatch_tool(tool_name: str, tool_input: dict) -> str:
    # Use hard-coded override if known, otherwise classify from tool_name
    if tool_name in TOOL_TIER_OVERRIDES:
        tier = TOOL_TIER_OVERRIDES[tool_name]
    else:
        tier = classify(tool_name + " " + str(tool_input))

    model = get_model(tier)
    # ... dispatch to appropriate model

Real Numbers

Here's the actual breakdown from my autonomous agent infrastructure, running a mix of ClawMart listing maintenance, content generation, and ACE license delivery tasks over a 30-day period.

Before routing — all tasks on Sonnet:

Task type	Calls/day	Avg tokens (in/out)	Daily cost
Intent classification	120	800 / 50	$0.32
Document summarization	40	3,200 / 400	$0.44
Field extraction	80	600 / 120	$0.20
Schema validation	60	400 / 80	$0.13
Content generation	15	2,000 / 1,500	$0.29
Code synthesis	10	4,000 / 2,000	$0.42
Total	325	—	$1.80/day ($54/mo)

After routing:

Task type	Tier	Daily cost
Intent classification	0 (Ollama)	$0.00
Document summarization	0 (Ollama)	$0.00
Field extraction	0 (Ollama)	$0.00
Schema validation	1 (Haiku)	~$0.004
Content generation	2 (Sonnet)	$0.29
Code synthesis	2 (Sonnet)	$0.42
Total	—	~$0.71/day ($21/mo)

That's a 61% reduction. The tasks that stayed on Sonnet are exactly the ones that need it: novel content generation and code synthesis. The tasks that moved to Tier 0 are pure pattern matching and compression — Qwen3:14B handles them cleanly, and at 8-10 tokens/second locally, they complete fast enough that latency isn't a constraint.

A few observations from running this in production:

Classification accuracy on Tier 0 is high for constrained tasks. When the output space is a small fixed set of categories, Qwen3:14B makes fewer errors than you'd expect. The failure mode is ambiguous prompts, not model capability.
Haiku 4.5 is underused by most teams. It's genuinely capable for structured validation and ranking tasks, and it costs roughly 15x less than Sonnet for input tokens. Most teams skip straight to Sonnet out of habit.
The routing classifier itself costs almost nothing. My classify() function is pure regex — no model call, zero latency, zero cost. For more nuanced routing, you can run the classifier on Tier 0 (Ollama) and the cost is still negligible.
Retry budgets matter. I give Tier 0 tasks two retries before escalating to Tier 1. This adds maybe 5% cost but recovers from the edge cases where local inference produces malformed output.

What Breaks Without This

The failure mode I see most often in unrouted agents isn't cost — it's the Sonnet context window filling up with low-value intermediate processing. When your summarization steps run on Sonnet, they compete with your generation steps for context and rate limits. Routing low-value tasks to local inference keeps your Sonnet calls clean and focused on work that actually requires them.

The second failure mode is rate limit exhaustion. At 325 calls/day against a single model tier, you hit Anthropic's rate limits faster than if you spread load across tiers. Tier distribution is rate limit distribution.

The Packaged Framework

The routing logic above is a simplified version of what I built and use in production. The full framework includes:

Pre-trained classifiers for 40+ task types with confidence scores
Cost tracking that logs actual spend per task type to a local SQLite DB
A dashboard that shows cost breakdown and tier distribution over time
Retry logic with automatic tier escalation on failure
Integration examples for Claude Code scripts, Anthropic SDK, and LangChain

The full Token Cost Intelligence skill is available on ClawMart: Token Cost Intelligence — OpenClaw Optimization Framework ($29).

If you're running any Claude Code agents at scale — even moderate scale — the routing framework pays for itself in the first day of usage.

W. Kyle Million (K¹) builds autonomous AI infrastructure at IntuiTek¹. The systems described here run continuously on a local X1 Pro, generating revenue without ongoing manual involvement.

The Production Agent Operations Bundle: What 90% of Claude Code Setups Are Missing

~K¹yle Million — Wed, 22 Apr 2026 13:08:46 +0000

The Five Failure Modes That Hit Real Production Setups

1. Context collapse mid-task

Your agent is 35 steps into a 60-step task. It hits context limit. Compaction kicks in. The compacted context drops the specific intermediate state — which file was written, which step was last, what the error on step 28 was. The agent resumes with a reconstructed understanding of where it is, and that reconstruction is wrong. It re-does work, skips work, or produces outputs that contradict the partial work it already completed.

The compaction is not the problem. The problem is that your agent had no checkpointing — no explicit record of where it was that survives a context reset.

2. Infinite loops with no circuit breaker

The task fails. The agent retries. Same failure. Retry. Same failure. The agent will not stop on its own, because stopping is not in its default behavior. It will retry until context exhausts, then compact and retry again. A permission denied error on step 3 will get retried 80 times before the run terminates. You pay for all 80 retries.

3. Shell injection via unvalidated tool calls

Your agent accepts a task that includes a filename, a query, or a user-supplied string. It passes that string directly to a bash call: os.system(f"process_file.sh {filename}"). If filename is file.txt; rm -rf outputs/, your agent just destroyed your output directory. If it's piped from an external source, the attack surface is real.

Most Claude Code bash usage never validates inputs before shell execution. Most demos don't catch this because the inputs are controlled. Production inputs are not.

4. Concurrent agents corrupting shared state

You have two agents running in parallel. Both are writing to outputs/weekly_report.md. Agent A writes its section. Agent B opens the file, reads the current contents (which includes Agent A's partial write), appends its section, and writes the whole thing back. Agent A writes its next section to the file it still has open, overwriting Agent B's write.

Non-atomic writes with no locking produce corrupted output with no error. No exception is raised. The file exists. The contents are wrong.

5. Coordinator handoff losing task state

Your coordinator dispatches three sub-agents, then its session ends — context limit, cron timeout, system interrupt. A new coordinator starts on the next cron tick. It has no idea which sub-agents already completed. It re-dispatches all three. Sub-agent 1 runs again, producing duplicate output. Sub-agent 2 conflicts with its own still-running previous instance. Your pipeline produces wrong results and logs nothing, because there was no failure — just a coordinator that restarted with no memory.

What Doesn't Work and Why

The instinct when any of these hits is to add error handling. Wrap things in try/except, add a retry loop, restart on failure. These are patches, not fixes. Here's why each one falls short:

"Just add error handling" catches exceptions but doesn't solve loop termination. Your retry loop now catches the error and retries indefinitely — you've formalized the infinite loop instead of preventing it.

"Restart on failure" is the coordinator pattern that causes state loss. Each restart wipes context. Without an explicit dispatch ledger written to disk before each sub-agent launch, restart is indistinguishable from a fresh start.

"Check output file existence" to infer completion has multiple failure modes: partial writes leave valid-looking files, a previous interrupted run may have left a file from a different context, and the same task may need to run multiple times. File existence is a proxy for completion that breaks under real conditions.

"Sanitize inputs in the prompt" relies on the model to perform security validation. That's not the right layer. Security validation belongs in code that runs before the shell call, not in language model reasoning that runs before the tool call.

"Use a lock file" for concurrent writes is the right idea but is almost always implemented incorrectly — lock files that survive crashes leave all subsequent agents blocked, and there's no cleanup logic because the crash that created the problem also prevented the cleanup.

The common thread: these fixes address symptoms at the wrong layer. The root causes are architectural — no termination logic, no persistent state, no pre-execution validation, no atomic write semantics.

The Five Architecture Patterns That Fix It

1. Loop Termination with Circuit Breakers

Every production agent needs termination logic at three levels: a hard step limit, an error accumulation counter, and a goal proximity check.

The hard limit is the blunt instrument that catches runaway loops:

MAX_STEPS = 50
step_count = 0

def execute_step(action):
    global step_count
    step_count += 1
    if step_count >= MAX_STEPS:
        write_state_checkpoint(reason=f"max steps ({MAX_STEPS}) reached")
        raise TerminationError("Hard limit reached")
    return perform_action(action)

The error accumulation counter catches stuck loops — agents retrying the same failing operation:

error_counts = {}
ERROR_THRESHOLD = 3

def handle_error(error_type: str, context: str):
    error_counts[error_type] = error_counts.get(error_type, 0) + 1
    if error_counts[error_type] >= ERROR_THRESHOLD:
        write_escalation(f"BLOCKED: {error_type} failed {error_counts[error_type]}x. Context: {context}")
        raise TerminationError(f"Repeated failure: {error_type}")
    return retry_with_backoff()

The goal proximity check is the cleanest implementation in Claude Code's native format — a CLAUDE.md protocol that forces the agent to articulate its progress before each action. If it can't state how this action moves toward completion, it writes the blocker to outputs/ and stops.

Clean termination writes current state, names the blocker, and exits 0 — stopped is not the same as failed.

2. Memory Isolation for Concurrent Agents

When multiple agents need to read and write shared state, the architecture needs to prevent reads of stale data and prevent concurrent writes from producing corrupted output.

The pattern is task-local working directories with a merge step, not shared output paths:

import os, uuid, shutil

def agent_working_dir(agent_id: str) -> str:
    """Each agent gets its own isolated scratch space."""
    base = os.path.expanduser("~/intuitek/coordination/scratch")
    path = os.path.join(base, agent_id)
    os.makedirs(path, exist_ok=True)
    return path

def merge_agent_outputs(agent_ids: list, output_path: str):
    """Coordinator merges after all agents complete — no concurrent writes."""
    sections = []
    for agent_id in agent_ids:
        scratch = agent_working_dir(agent_id)
        result_file = os.path.join(scratch, "result.md")
        if os.path.exists(result_file):
            with open(result_file) as f:
                sections.append(f.read())
    with open(output_path, "w") as f:
        f.write("\n\n---\n\n".join(sections))

Agents write to their scratch directory. The coordinator merges when all agents report completion. No two agents write to the same path. No locks needed.

For shared state that agents genuinely need to read and update concurrently, the pattern is append-only event logs with a read-once merge, not mutable shared files.

3. Coordinator Resume Integrity

Coordinator state must be written to disk before every sub-agent dispatch. Not after — before. If the coordinator dies between writing the dispatch record and the sub-agent starting, the worst case is a task that gets re-dispatched. If the coordinator dies after dispatch with no record, the worst case is a task that runs twice with no visibility.

dispatch_task() {
    local TASK_ID="$1"
    local TASK_PROMPT="$2"

    # Write to ledger before dispatch — not after
    python3 -c "
import json, datetime
with open('$LEDGER') as f: ledger = json.load(f)
ledger['tasks'].append({
    'task_id': '$TASK_ID',
    'status': 'IN_PROGRESS',
    'dispatched_at': datetime.datetime.utcnow().isoformat() + 'Z',
    'completed_at': None
})
with open('$LEDGER', 'w') as f: json.dump(ledger, f, indent=2)
"
    bash ~/intuitek/run_task.sh "$TASK_PROMPT" &
}

startup_coordinator() {
    if [[ -f "$LEDGER" ]]; then
        # Skip tasks already marked COMPLETE
        PENDING=$(python3 -c "
import json
with open('$LEDGER') as f: ledger = json.load(f)
pending = [t['task_id'] for t in ledger['tasks'] if t['status'] != 'COMPLETE']
print('\n'.join(pending))
")
    fi
}

On restart, read the ledger, skip completed tasks, and re-dispatch only what isn't done. Add a heartbeat timestamp to detect abandoned pipelines — if the last heartbeat is more than 5 minutes old and the pipeline is still marked IN_PROGRESS, the previous coordinator died and you can safely take over.

4. Bash Security Validation Before Shell Execution

Every string that comes from outside your agent's direct control — task inputs, file paths, query parameters, content extracted from external sources — must be validated before it touches a shell call.

The validation layer runs in Python before the subprocess call:

import re, subprocess, shlex

SAFE_FILENAME = re.compile(r'^[\w\-\.]+$')
SAFE_PATH_COMPONENT = re.compile(r'^[\w\-\./]+$')

def safe_shell_exec(command_template: str, **kwargs):
    """Validate all interpolated values before shell execution."""
    for key, value in kwargs.items():
        if 'path' in key or 'file' in key:
            if not SAFE_PATH_COMPONENT.match(str(value)):
                raise SecurityError(f"Unsafe path in {key}: {value!r}")
        elif 'name' in key:
            if not SAFE_FILENAME.match(str(value)):
                raise SecurityError(f"Unsafe filename in {key}: {value!r}")

    cmd = command_template.format(**kwargs)
    result = subprocess.run(
        shlex.split(cmd),
        capture_output=True, text=True, timeout=30
    )
    return result

The important detail is shlex.split() rather than passing the command string directly to shell=True. shell=True is the vector. shlex.split() with shell=False tokenizes the command safely and passes it as an argument list, which prevents shell metacharacter injection even if a value slips through validation.

For agent-facing tools that accept arbitrary inputs, add a denylist for shell metacharacters as a second layer: ;, |, &&, $(), backticks, and > in unexpected positions are all injection indicators.

5. Context Compaction Checkpointing

When an agent runs a task that requires more steps than a single context window, it needs to write explicit checkpoints — structured state records that survive compaction and allow resumption at the right point.

The checkpoint is written before any operation that changes state, and read at session start:

import json, os
from datetime import datetime

CHECKPOINT_PATH = "outputs/checkpoint_{task_id}.json"

def write_checkpoint(task_id: str, state: dict):
    """Call before any state-changing operation."""
    checkpoint = {
        "task_id": task_id,
        "checkpoint_at": datetime.utcnow().isoformat() + "Z",
        "completed_steps": state.get("completed_steps", []),
        "current_step": state.get("current_step"),
        "outputs_written": state.get("outputs_written", []),
        "context_summary": state.get("context_summary", ""),
    }
    path = CHECKPOINT_PATH.format(task_id=task_id)
    with open(path, "w") as f:
        json.dump(checkpoint, f, indent=2)

def load_checkpoint(task_id: str) -> dict | None:
    path = CHECKPOINT_PATH.format(task_id=task_id)
    if os.path.exists(path):
        with open(path) as f:
            return json.load(f)
    return None

In Claude Code's native CLAUDE.md format, you encode this as an explicit protocol: at the start of every session, check for a checkpoint file matching the current task ID. If found, read it, report where execution left off, and continue from current_step rather than from the beginning.

The context_summary field is the most important part. It's a 2-3 sentence summary of what the agent understands about the task state, written in a form that can be injected back into context after compaction. It's not a full transcript — it's the minimum state needed to make the next step coherent.

When to Use the Bundle vs. Building From Scratch

Build from scratch if:

Your agent runs a single short task (under 20 steps) with no concurrent instances
All inputs are fully controlled — no external sources, no user-supplied strings reaching shell calls
The agent runs once and terminates; no scheduled re-runs, no coordinator/sub-agent pattern

Use the bundle if:

You're running agents on a cron schedule where each run may pick up from where the last one left off
You're running two or more agents in parallel that share any output paths or state
Any task input — including file paths, query parameters, or content the agent reads from external sources — reaches a bash or subprocess call
You're building a coordinator that dispatches sub-agents
You've already hit any of the five failure modes described above

The patterns aren't complicated individually. The difficulty is in the details: the exact order of operations for a write-before-dispatch ledger, the edge cases in lock file cleanup, the difference between shell=True and argument list subprocess calls that actually blocks injection. These are the things you debug at 11pm on a Friday when your production agent produced corrupted output and you don't know why.

The Honest Take

None of this is new architecture. Circuit breakers, idempotent state machines, input validation, atomic writes — these are standard distributed systems patterns that apply directly to production agent infrastructure.

The reason most Claude Code setups don't have them is not complexity. It's that the demo works without them, and the failure modes only appear under conditions you don't reproduce locally: concurrent execution, context exhaustion, untrusted inputs, scheduled unattended runs.

If you're at the point where Claude Code agents are part of your production infrastructure and not just experiments, these patterns are not optional. They're the difference between a setup that works when you're watching and one that works when you're not.

I packaged all five as a single ClawMart skill bundle — ready to drop into any Claude Code project: https://www.shopclawmart.com/listings/production-agent-ops-battle-tested-architecture-pack-77a4c935

$69. Instant download. One-time purchase.

Built by Aegis, IntuiTek¹ | ~K¹ (W. Kyle Million)

Tags: claudecode, devtools, aiagents, productivity

Session Memory Architecture: The Pattern That Keeps Your Agent Coherent Across Context Resets

~K¹yle Million — Wed, 22 Apr 2026 12:08:32 +0000

Your Claude Code agent ran perfectly for 45 minutes. Built context. Understood the codebase. Made decisions that depended on what it learned in the first 30 minutes.

Then the context limit hit. The session compacted. Everything the agent learned — the specific file it was tracking, the pattern it identified, the three edge cases it flagged — is gone.

The next session starts fresh. The agent reads CLAUDE.md, reads the task, and begins again with no knowledge of what the previous session accomplished. It may re-examine files it already processed. It may make different decisions because it's missing context from earlier in the run. It may re-do work that was already done.

This is session memory failure. It happens every time a long-running agent task spans more than one context window.

The Problem: Context Is Not Memory

Claude Code agents have two very different things that are often confused:

Context — what's in the current session window. Fast to access. Massive reasoning ability. Zero persistence. When the session ends or compacts, it's gone.

Memory — what's written to disk. Persists across sessions. Available to any future agent. Zero cognitive cost to preserve; non-zero cost to structure and retrieve.

Production agents running tasks longer than ~60-90 minutes will exceed their context window. Context compaction removes earlier parts of the session to make room for new work. Even without hitting limits, a cron-scheduled agent that runs every 10 minutes has a fresh context every time.

Any agent designed to accumulate knowledge in context will fail when that context resets.

Three failure modes:

1. Repeated discovery

Agent discovers that auth/middleware.py contains the auth bug it's tracking. This information exists in context. Next session starts — agent reads the file list again, starts scanning, rediscovers the same bug. 10 minutes of redundant work per reset.

2. Decision context loss

Agent decided not to modify config.yaml because an earlier analysis showed it was used by three other services. That analysis is in compacted context. New session edits config.yaml without that constraint — introduces a regression.

3. Progress tracking failure

Agent processed files A through M. Context compacted; that progress is gone. New session starts at A again. By the time it reaches M, it's processed everything twice. Outputs folder has duplicates; no indication which is the final version.

What Doesn't Work

Relying on CLAUDE.md for session state

CLAUDE.md is for operating instructions, not run-time state. Writing session progress into CLAUDE.md means mixing stable configuration with ephemeral state. It creates noise for every future session and violates the principle that CLAUDE.md should change only with ~K¹ approval.

Writing to outputs/ and re-reading it

Output files are write-once, never-modified by design. Re-reading them on session start to reconstruct state is fragile — the agent has to parse its own prose output to recover structured data.

Trusting the next session to "figure it out"

It won't. The next session sees only what's on disk plus what's in CLAUDE.md. If session-specific decisions, progress markers, and discovered context aren't explicitly written, they don't exist.

The Pattern: Session Memory Files

Each long-running task maintains a session memory file — a structured, append-only log that the agent writes during the session and reads at the start of the next session.

SESSION_MEM="$INTUITEK/working/${TASK_ID}/session_memory.md"

Session memory file structure:

# Session Memory — task_orders_audit_20260422

## Decisions Made
- 2026-04-22T07:12Z — DO NOT modify config/auth.yaml — used by 3 services (auth, payments, admin); changing here breaks them all
- 2026-04-22T07:23Z — Use optimistic locking for order updates; confirmed with existing lock pattern in orders.py:241

## Progress Markers
- COMPLETED: orders/batch_1/ (files 001-047)
- COMPLETED: orders/batch_2/ (files 048-091)
- IN_PROGRESS: orders/batch_3/ (files 092-??? — stopped at 094)
- PENDING: orders/batch_4/, batch_5/

## Key Discoveries
- Order schema has undocumented `legacy_id` field used only by `reports/quarterly.py` — do not remove
- `orders/batch_2/order_073.json` is malformed (truncated at line 14) — log as error, don't process
- Pattern: all failed orders have `payment_status: null` before `order_status: failed` — not after

## Next Session Start
On next session start: begin with orders/batch_3/ file 095. Apply decisions above before touching any config.

At session start, the agent reads this file before doing anything else:

SESSION_PROMPT_PREFIX=""
if [[ -f "$SESSION_MEM" ]]; then
    SESSION_PROMPT_PREFIX="Read $SESSION_MEM first. Apply all decisions and progress markers before starting new work."
fi

At regular intervals during the session (every 15 minutes or at natural checkpoints), the agent appends to the session memory file:

checkpoint() {
    local NOTE="$1"
    echo "- $(date -u +%Y-%m-%dT%H:%MZ) — $NOTE" >> "$SESSION_MEM"
}

The agent calls checkpoint when it:

Makes a decision that depends on earlier context
Completes a logical unit of work
Discovers something that would change how future work proceeds
Encounters an edge case that needs to be remembered

Memory Categories Within Session Memory

Not everything deserves the same treatment. Structure your session memory file with explicit sections:

Decisions — choices made that must constrain future choices. Immutable once written. If a decision needs to change, add a new entry with "SUPERSEDES [date]" — never modify old entries.

Progress — what's been done. Updated as work completes. Enables skipping already-completed work on resume.

Discoveries — facts about the domain that weren't known before this session. Information that future sessions need to make correct decisions.

Next Session Start — a single paragraph written at the end of each session summarizing the exact next step. This is what the next session reads first.

Automatic Memory on Compaction

Claude Code's context compaction removes older messages. Build compaction awareness into your agent's operating instructions:

## Session Memory Protocol (in CLAUDE.md or task prompt)

Before context compacts or session ends:
1. Write current progress to working/{task_id}/session_memory.md
2. Record any decisions made in the last 30 minutes that aren't yet in session_memory.md
3. Update "Next Session Start" section with the exact next action
4. Write completion status of current logical unit to session_memory.md

On session start:
1. Read working/{task_id}/session_memory.md if it exists
2. Apply all decisions without re-evaluating them
3. Start from the progress marker labeled "Next Session Start"
4. Do not re-do work marked as COMPLETED

Multi-Session Task Completion

When the full task completes across multiple sessions, the session memory file becomes the audit trail:

finalize_session_memory() {
    echo "" >> "$SESSION_MEM"
    echo "## TASK COMPLETE — $(date -u +%Y-%m-%dT%H:%MZ)" >> "$SESSION_MEM"
    echo "Final status: all batches processed." >> "$SESSION_MEM"

    # Archive to outputs/ for the permanent record
    cp "$SESSION_MEM" "$INTUITEK/outputs/session_memory_final_${TASK_ID}.md"

    # Session workspace can be cleaned up
    rm -rf "$INTUITEK/working/${TASK_ID}/"
}

The Production Implementation

The patterns above are the core architecture. The production implementation includes:

Session memory file factory with schema enforcement
Checkpoint writer with automatic section routing (Decisions / Progress / Discoveries)
Session startup reader with progress state reconstruction
Compaction-aware CLAUDE.md template blocks for embedding memory protocol in agent prompts
Multi-session task tracker (start / resume / complete state machine)
Finalization handler with output archival and workspace cleanup
Cross-session decision log with supersede detection (prevents conflicting decisions)

Session Memory Architecture — Production Context Persistence:
https://www.shopclawmart.com/listings/session-memory-architecture-production-context-persistence-b2e36e13

$19. Instant download. One-time purchase.

Built by Aegis, IntuiTek¹ | ~K¹ (W. Kyle Million)

Coordinator Resume Integrity: What Happens When a Claude Code Agent Loses Its Mind Mid-Handoff

~K¹yle Million — Wed, 22 Apr 2026 12:08:22 +0000

Your coordinator agent dispatched three sub-agents. Sub-agent 1 finished. Sub-agent 2 is halfway through. Sub-agent 3 hasn't started yet.

Then your coordinator's session ends. Context limit hit. Cron killed the process. Doesn't matter why — the coordinator is gone.

Next cron tick, a new coordinator starts. It doesn't know Sub-agent 1 is done. It doesn't know Sub-agent 2 is mid-task. It restarts all three.

Sub-agent 1 runs again, producing duplicate output. Sub-agent 2 conflicts with itself. Sub-agent 3 finally starts — after two unnecessary reruns. Your pipeline produced wrong results with no error, because the coordinator had no way to resume from where it left off.

This is coordinator resume integrity failure. It's the most common reason multi-agent pipelines produce inconsistent results under real operating conditions.

Why Coordinators Fail to Resume

The coordinator's state — which tasks it dispatched, which completed, what still needs to run — lives entirely in context. That context is not written anywhere. When the session ends, it's gone.

Most agents are written assuming they'll run to completion in a single session. That assumption holds in development where you're watching, but breaks in production where:

Sessions end unpredictably (context limits, cron timeouts, system interrupts)
The same agent runs on a schedule, not once
Downstream work takes longer than the coordinator's execution window

Three specific failure modes:

1. Duplicate execution

Coordinator resumes with no state. Re-dispatches all sub-agents. Sub-agents that already completed run again. If sub-agents write to fixed paths, the second run overwrites the first. If they write to unique paths, you accumulate duplicates with no way to know which is canonical.

2. Partial completion invisible to the next coordinator

Sub-agent 2 is 40% through its task. New coordinator restarts it from zero. Sub-agent 2's partial output — which may have taken significant time and API usage — is abandoned.

3. Ordering violations

Coordinator was enforcing an execution order: A before B before C. New coordinator starts all three simultaneously. B runs before A has committed its output. B reads stale data.

What Doesn't Work

Checking output files

Coordinators often check for output file existence to infer completion: "if outputs/task_A.md exists, A is done." This breaks when:

A partial write left the file in an invalid state
A previous interrupted run left a file from a different context
The same task needs to run multiple times across different runs

Reading sub-agent logs

Sub-agent logs tell you what happened inside that sub-agent's run. They don't tell the coordinator what the coordinator already dispatched, or whether that dispatch was intended for this run.

Trusting context to persist

Context doesn't persist across sessions. Period. Anything the coordinator knows that isn't written to disk is lost on session end.

The Pattern: Explicit Dispatch Ledger

Every coordinator maintains a dispatch ledger — a structured file that records what was dispatched, when, and what state it's in. The ledger is written before dispatch, updated on completion, and read first on every coordinator startup.

LEDGER="$INTUITEK/coordination/dispatch_ledger_${PIPELINE_ID}.json"

Ledger schema:

{
  "pipeline_id": "pipeline_orders_20260422_070001",
  "coordinator_started": "2026-04-22T07:00:01Z",
  "last_coordinator_heartbeat": "2026-04-22T07:04:17Z",
  "tasks": [
    {
      "task_id": "agent_order_1",
      "status": "COMPLETE",
      "dispatched_at": "2026-04-22T07:00:05Z",
      "completed_at": "2026-04-22T07:02:31Z",
      "output_path": "outputs/order_1_result_20260422.md"
    },
    {
      "task_id": "agent_order_2",
      "status": "IN_PROGRESS",
      "dispatched_at": "2026-04-22T07:00:06Z",
      "completed_at": null,
      "output_path": null
    },
    {
      "task_id": "agent_order_3",
      "status": "PENDING",
      "dispatched_at": null,
      "completed_at": null,
      "output_path": null
    }
  ]
}

Coordinator startup sequence:

startup_coordinator() {
    if [[ -f "$LEDGER" ]]; then
        # Resume from existing ledger
        echo "Resuming pipeline: $(jq -r '.pipeline_id' $LEDGER)"
        RESUME=true
    else
        # Initialize new ledger
        python3 -c "
import json, datetime
ledger = {
    'pipeline_id': 'pipeline_${PIPELINE_TYPE}_$(date +%Y%m%d_%H%M%S)',
    'coordinator_started': datetime.datetime.utcnow().isoformat() + 'Z',
    'last_coordinator_heartbeat': datetime.datetime.utcnow().isoformat() + 'Z',
    'tasks': []
}
print(json.dumps(ledger, indent=2))
" > "$LEDGER"
        RESUME=false
    fi
}

Before dispatching any sub-agent, write its entry to the ledger:

dispatch_task() {
    local TASK_ID="$1"
    local TASK_PROMPT="$2"

    # Write PENDING entry to ledger before dispatch
    python3 -c "
import json, datetime
with open('$LEDGER') as f:
    ledger = json.load(f)
ledger['tasks'].append({
    'task_id': '$TASK_ID',
    'status': 'IN_PROGRESS',
    'dispatched_at': datetime.datetime.utcnow().isoformat() + 'Z',
    'completed_at': None,
    'output_path': None
})
with open('$LEDGER', 'w') as f:
    json.dump(ledger, f, indent=2)
"
    # Dispatch the sub-agent
    bash ~/intuitek/run_task.sh "$TASK_PROMPT" &
}

On coordinator restart, read the ledger and skip completed tasks:

get_pending_tasks() {
    python3 -c "
import json
with open('$LEDGER') as f:
    ledger = json.load(f)
pending = [t for t in ledger['tasks'] if t['status'] in ('PENDING', 'IN_PROGRESS')]
for t in pending:
    print(t['task_id'])
"
}

# Only dispatch tasks that aren't COMPLETE
for TASK_ID in $(get_pending_tasks); do
    dispatch_task "$TASK_ID" "$(get_task_prompt $TASK_ID)"
done

Heartbeat for Long-Running Pipelines

For pipelines that run longer than one coordinator session, add a heartbeat to the ledger. This lets a new coordinator detect whether the previous coordinator is still running or abandoned:

update_heartbeat() {
    python3 -c "
import json, datetime
with open('$LEDGER') as f:
    ledger = json.load(f)
ledger['last_coordinator_heartbeat'] = datetime.datetime.utcnow().isoformat() + 'Z'
with open('$LEDGER', 'w') as f:
    json.dump(ledger, f, indent=2)
"
}

# Call every 60 seconds in coordinator's main loop
while true; do
    update_heartbeat
    sleep 60
done &

On startup, check if the previous coordinator abandoned the pipeline:

check_abandoned() {
    python3 -c "
import json, datetime, sys
with open('$LEDGER') as f:
    ledger = json.load(f)
last_hb = ledger.get('last_coordinator_heartbeat')
if last_hb:
    age_seconds = (datetime.datetime.utcnow() - datetime.datetime.fromisoformat(last_hb.rstrip('Z'))).total_seconds()
    if age_seconds > 300:
        print('ABANDONED')
    else:
        print('ACTIVE')
else:
    print('UNKNOWN')
"
}

STATUS=$(check_abandoned)
if [[ "$STATUS" == "ACTIVE" ]]; then
    echo "Previous coordinator still active — exiting to avoid conflict"
    exit 0
fi

Cleanup and Pipeline Completion

When all tasks reach COMPLETE status, mark the pipeline done and optionally archive the ledger:

mark_pipeline_complete() {
    python3 -c "
import json, datetime
with open('$LEDGER') as f:
    ledger = json.load(f)
ledger['pipeline_completed'] = datetime.datetime.utcnow().isoformat() + 'Z'
with open('$LEDGER', 'w') as f:
    json.dump(ledger, f, indent=2)
"
    # Move ledger to completed/
    mv "$LEDGER" "$INTUITEK/coordination/completed/$(basename $LEDGER)"
}

The Production Implementation

The patterns above are the core logic. The production implementation includes:

Ledger factory with schema validation
Dispatch wrapper with atomic ledger write + sub-agent launch
Resumable coordinator startup with ledger read and skip-completed logic
Heartbeat manager (60s background update loop)
Abandoned pipeline detector with configurable staleness threshold
Pipeline completion detector and ledger archival
Multi-coordinator conflict guard (prevents two coordinators running the same pipeline)
CLAUDE.md template for embedding resume logic in coordinator agent prompts

Coordinator Resume Integrity — Production Agent Handoff Logic:
https://www.shopclawmart.com/listings/coordinator-resume-integrity-production-agent-handoff-logic-d158e10b

$19. Instant download. One-time purchase.

Built by Aegis, IntuiTek¹ | ~K¹ (W. Kyle Million)

Agent Memory Scoping: Why Concurrent Claude Code Agents Need Isolated Memory

~K¹yle Million — Tue, 21 Apr 2026 21:06:26 +0000

Agent Memory Scoping: Why Concurrent Claude Code Agents Need Isolated Memory

By W. Kyle Million (~K¹) | IntuiTek¹ | Published on dev.to/@thebrierfox

Two Claude Code agents. One task each. Running in parallel.

Agent 1 writes context.md. Agent 2 reads context.md. Agent 2 is now running in Agent 1's context instead of its own.

This isn't a bug you'll catch in testing. It surfaces under load — when two agents happen to run at the same moment, overwrite each other's work, or read stale state left by a previous run. By the time you notice, the output is wrong and you have no way to know which agent produced it.

Memory scoping is the architecture pattern that prevents this.

The Problem: Shared Memory in Autonomous Agents

Claude Code agents read and write files. That's their memory. In a single-agent setup, this works fine — there's only one writer.

Add a second agent and you have the classic concurrent write problem. Files don't know about agents. context.md doesn't have a lock. The last write wins.

Three concrete failure modes:

1. Context contamination

Agent A processes customer order 1 and writes findings to working/analysis.md. Before A finishes, Agent B starts processing customer order 2 and reads working/analysis.md as its starting context — now B is reasoning about the wrong order. Neither agent knows anything went wrong.

2. Partial-write corruption

Agent A is mid-write to output.md when Agent B reads it. B gets a partial file: valid JSON up to line 47, then garbage. B's subsequent reasoning is based on malformed data.

3. Stale state loops

Agent A fails halfway through. It leaves working/checkpoint.md in an intermediate state. Next time the cron fires Agent A again, it reads its own stale checkpoint and resumes from the wrong position — often repeating completed work or skipping required steps.

The Pattern: Agent-Scoped Memory Paths

The fix is simple in principle: each agent gets its own memory namespace. No shared state unless explicitly designed.

# Before: every agent writes to the same path
WORKSPACE="$INTUITEK/working/"

# After: each agent writes to its own scoped path
AGENT_ID="${AGENT_NAME:-agent}_${TASK_ID:-$(date +%s)}"
WORKSPACE="$INTUITEK/working/${AGENT_ID}/"
mkdir -p "$WORKSPACE"

Every file write goes to working/{agent_id}/ instead of working/. Agent 1 writes to working/agent_order_1_1713654000/context.md. Agent 2 writes to working/agent_order_2_1713654001/context.md. They never touch each other's files.

The AGENT_ID should be composed of:

A task-type prefix (human-readable label)
A unique ID or timestamp (guarantees no collision)

# Examples
AGENT_ID="orders_processor_1713654000"
AGENT_ID="reddit_poster_822629015"
AGENT_ID="content_draft_20260421_153022"

Memory Categories and Isolation Rules

Not all memory should be isolated. Different categories have different access patterns:

Exclusive Memory (always isolated)

Working files, intermediate results, agent-specific context, logs for this run.

working/{agent_id}/
├── context.md       # agent's current reasoning state
├── progress.md      # task-specific progress tracker
├── scratch/         # temp files this agent creates and reads
└── output/          # completed work product (before review)

Shared-Read Memory (never write; read freely)

Reference data, configuration, prompts, source files.

soul/               # identity files — read-only by all agents
CLAUDE.md           # operating instructions — read-only
.env                # credentials — read-only (source it; don't write it)
capabilities/       # reusable tools — read-only

Coordination Memory (write with lock; read freely)

Shared state that multiple agents need to see and occasionally update.

coordination/
├── SHARED_MIND.md      # operational state — lock before write
├── CURRENT_STATE.md    # per-run status — lock before write
└── locks/              # lock files for coordination resources

Output Memory (write once; never overwrite)

Final deliverables. Each run writes new files; nothing is modified in place.

outputs/
├── task_{id}_20260421.md     # immutable once written
├── report_{id}_20260421.md   # never overwrite existing outputs
└── error_{id}_20260421.md    # same for failures

The naming convention {type}_{id}_{date}.md makes outputs inspectable and prevents collision.

Lock Protocol for Coordination Memory

Shared coordination files (SHARED_MIND, CURRENT_STATE) need a lock before write. The protocol:

LOCK_DIR="$INTUITEK/coordination/locks"
RESOURCE_SLUG="shared-mind"  # one slug per resource
LOCK_FILE="$LOCK_DIR/${RESOURCE_SLUG}.lock"
MAX_WAIT=30  # seconds
WAIT=0

acquire_lock() {
    while [[ -f "$LOCK_FILE" ]]; do
        AGE=$(( $(date +%s) - $(stat -c %Y "$LOCK_FILE") ))
        if [[ $AGE -gt 300 ]]; then
            # Stale lock (5+ min old) — overwrite
            break
        fi
        sleep 2
        WAIT=$(( WAIT + 2 ))
        if [[ $WAIT -ge $MAX_WAIT ]]; then
            echo "Lock timeout after ${MAX_WAIT}s" >&2
            exit 1
        fi
    done
    echo "peer: ${AGENT_ID}" > "$LOCK_FILE"
    echo "acquired: $(date -Iseconds)" >> "$LOCK_FILE"
}

release_lock() {
    rm -f "$LOCK_FILE"
}

For files that only one agent writes but others read, a lock isn't required — the writer is sole. The lock is only for files where multiple agents might write.

Cleanup After Termination

Scoped workspaces accumulate. Production agents running every 10 minutes will fill disk over time if working directories aren't pruned.

Two cleanup strategies:

Strategy A: Keep-on-failure, delete-on-success

cleanup_workspace() {
    if [[ $EXIT_CODE -eq 0 ]]; then
        rm -rf "$WORKSPACE"
    else
        # Keep failed workspace for debugging
        mv "$WORKSPACE" "$INTUITEK/failed_workspaces/${AGENT_ID}/"
        echo "Failed workspace preserved at: failed_workspaces/${AGENT_ID}/"
    fi
}

trap 'cleanup_workspace' EXIT

Strategy B: Archive-after-N-days

# Run as cron cleanup job
find "$INTUITEK/working/" -mindepth 1 -maxdepth 1 -type d -mtime +7 -exec rm -rf {} \;

Strategy A is better for debugging; Strategy B is better for disk management. In production: use both. Delete on success, archive failed workspaces, prune archives older than 7 days.

The Multi-Agent Test

Before deploying any multi-agent architecture, run this test:

Start two agent instances with the same task simultaneously
Check whether output files from Run 1 and Run 2 are in separate directories (PASS) or the same directory (FAIL)
Check whether Agent 2 read any files written by Agent 1 (FAIL if yes, unless it was explicitly designed to)
Check whether either agent failed due to a lock conflict that wasn't handled (FAIL)

If you pass all four: your memory scoping is production-grade.

If you fail any of them: you have a concurrency bug that will surface under load.

Why This Matters More as You Scale

Single-agent deployments hide memory scoping problems. They don't surface until you add a second agent, add cron scheduling, or add a retry mechanism that runs the same agent twice.

The pattern that costs nothing to implement when building is expensive to retrofit after the first production incident. Two agents corrupting each other's work is not a theoretical risk — it's the first thing that happens when you horizontally scale a Claude Code setup that was designed for single-agent operation.

The Production Implementation

The patterns above are the foundation. The production implementation includes:

Agent ID generation utilities with collision-free timestamp-based naming
Scoped workspace factory with automatic directory creation
Lock protocol implementation with stale lock detection and exponential backoff
Memory category classifier for new projects (helps you decide what goes where)
Cleanup handler for both success/failure paths
Test harness for validating isolation in concurrent runs
CLAUDE.md template blocks for embedding memory scoping rules in agent instructions

Agent Memory Scoping — Production Isolation Architecture:
https://www.shopclawmart.com/listings/agent-memory-scoping-production-isolation-architecture-8d66ead8

$19. Instant download. One-time purchase.

Built by Aegis, IntuiTek¹ | ~K¹ (W. Kyle Million)

Tags: claudecode, devtools, aiagents, programming, productivity

Loop Termination Architecture: How Production Agents Know When to Stop

~K¹yle Million — Tue, 21 Apr 2026 21:05:04 +0000

Loop Termination Architecture: How Production Agents Know When to Stop

By W. Kyle Million (~K¹) | IntuiTek¹ | Published on dev.to/@thebrierfox

Your Claude Code agent just ran for 47 minutes on a task that should have taken 3.

It made 200 bash calls. 150 of them were retrying the same failing operation. It wrote 12 intermediate files. Then it ran out of context, compacted, and started over from scratch — still failing on the same thing.

This isn't a model problem. It's an architecture problem. Your agent has no circuit breaker.

The Problem: Autonomous Agents Don't Know When to Quit

When you hand a task to a Claude Code agent in -p mode, you're handing it a goal. The agent will pursue that goal until it either succeeds, runs out of context, or hits a wall it can't climb over. What it won't do by default is recognize that it's stuck in a pattern and stop itself.

Three failure modes that kill production runs:

1. Infinite retry loops

The task fails. The agent retries. It fails again with the same error. The agent retries again. Repeat until context exhausts. If the error is permission denied or service unavailable, no amount of retrying will fix it — but the agent doesn't know that.

2. False progress spirals

The agent makes some progress on each iteration, but never enough to complete the goal. It keeps finding new things to try. From inside the loop it looks like progress. From outside, it's burning 50 API calls on a task that can't be finished this way.

3. Scope creep cascades

The agent discovers that completing task A requires completing task B, which requires task C. Without termination logic, it builds an unbounded dependency tree and starts executing all of it. The original task is buried.

The Pattern: Three-Layer Termination Architecture

Production agents need termination logic at three levels:

Layer 1: Step Counter (Hard Limit)

Every agent execution should have a maximum step count. When it's reached, the agent writes its current state and stops — regardless of whether the task is complete.

MAX_STEPS = 50
step_count = 0

def execute_step(action):
    global step_count
    step_count += 1

    if step_count >= MAX_STEPS:
        write_state_file(f"STOPPED: max steps ({MAX_STEPS}) reached at step {step_count}")
        raise TerminationError(f"Hard limit reached: {step_count} steps")

    return perform_action(action)

The right max for your agent depends on the task class. For filesystem operations: 20-30 steps. For multi-file code generation: 50-75. For research and synthesis: 100+. Start conservative; you can always raise it after a clean run.

In Claude Code terms: if your headless agent is triggered via claude -p, you can't inject this directly into Claude's reasoning. But you can wrap the invocation:

#!/bin/bash
# run_task.sh with timeout-based termination
timeout 300 claude -p "$TASK_PROMPT" \
  --allowedTools "Bash(*),Read(*),Write(*),Edit(*)" \
  --output-format text

if [[ $? -eq 124 ]]; then
  echo "TERMINATED: 5-minute timeout reached" >> logs/errors.log
  bash notify.sh "⚠️ Agent timeout: $TASK_PROMPT"
fi

A 5-minute timeout is a blunt instrument. The deeper fix is building termination awareness into your task design.

Layer 2: Error Accumulation Counter (Smart Limit)

Not all failure is equal. A hard limit catches runaway loops, but a smart limit catches stuck loops — agents that keep executing but keep failing on the same error type.

error_counts = {}
ERROR_THRESHOLD = 3  # same error type: stop after 3

def handle_error(error_type: str, context: str):
    error_counts[error_type] = error_counts.get(error_type, 0) + 1

    if error_counts[error_type] >= ERROR_THRESHOLD:
        write_escalation(
            f"BLOCKED: {error_type} failed {error_counts[error_type]} times. "
            f"Last context: {context}. Stopping."
        )
        raise TerminationError(f"Repeated failure: {error_type}")

    return retry_with_backoff()

In production, the error categories that most often trigger stuck loops:

PermissionError / permission denied — fix is environmental, not retry
ConnectionRefusedError / service unavailable — fix requires intervention
KeyError / AttributeError on the same field — the data isn't there; more retries won't produce it
FileNotFoundError on a generated file — prior step failed silently

When your agent sees the same error class three times in sequence, it should stop and write a diagnostic, not retry again.

Layer 3: Goal Proximity Check (Semantic Limit)

The hardest termination problem: the agent is making progress but will never finish this way. Step counters won't catch it. Error counters won't catch it. But goal proximity can.

Before each step, the agent should evaluate: Is this action moving me toward completion, or is it maintenance?

In Claude Code, you can implement this as a structured planning header in your CLAUDE.md:

## TASK EXECUTION PROTOCOL

Before each action:
1. State what completion looks like (one sentence)
2. State what this specific action accomplishes toward that goal
3. If you cannot articulate the connection, STOP and write the blocker to outputs/

If you have taken more than 10 actions without measurable progress toward the stated completion criteria, write a status to outputs/ and terminate. Do not keep trying.

This sounds simple. It works because it makes the goal explicit and forces re-evaluation before each action rather than only when something fails.

What "Stopping Clean" Means

Stopping is not failing. A production agent that terminates cleanly and writes a diagnostic is infinitely more valuable than one that silently burns 200 API calls and produces nothing.

Clean termination means:

Write current state — what was completed, what wasn't, what state the filesystem is in
Write the blocker — exactly what stopped execution (error, step limit, goal check)
Preserve partial work — don't clean up partially-completed files; document them instead
Notify — push to your notification channel so you know to review

def terminate_clean(reason: str, state: dict):
    output_file = f"outputs/terminated_{timestamp()}.md"

    with open(output_file, "w") as f:
        f.write(f"# Agent Terminated: {reason}\n\n")
        f.write(f"**Steps taken:** {state['steps']}\n")
        f.write(f"**Completed:** {state['completed']}\n")
        f.write(f"**Incomplete:** {state['incomplete']}\n")
        f.write(f"**Files written:** {state['files']}\n")
        f.write(f"**Blocker:** {state['last_error']}\n")

    notify(f"⚠️ Agent stopped: {reason} → {output_file}")
    sys.exit(0)  # 0, not 1 — clean termination is not an error

Note sys.exit(0). If your orchestrator treats any exit code as success/failure, a clean termination should return 0. Reserve exit 1 for unhandled crashes.

Putting It Together: The Circuit Breaker Pattern

The full architecture looks like this:

Agent starts
    │
    ▼
Check: step_count < MAX_STEPS
    │ no → terminate_clean("max steps")
    ▼
Execute action
    │
    ├── success → continue
    │
    └── error → error_counts[type]++
                    │
                    ├── count < THRESHOLD → retry with backoff
                    │
                    └── count >= THRESHOLD → terminate_clean("repeated error: {type}")
    │
    ▼
Goal proximity check
    │ no progress in N steps → terminate_clean("no progress")
    ▼
Repeat

The three layers create defense-in-depth:

Step counter catches runaway loops
Error accumulator catches stuck loops
Goal check catches false progress spirals

None of them requires you to predict what will go wrong. They just create the conditions under which the agent will stop instead of spin.

The Difference Between a Script and a System

A script does what you tell it. A system knows when it can't do what you told it.

Production Claude Code agents are systems. They operate on tasks that aren't fully specified in advance, on environments that change, against APIs that fail. The question isn't whether your agent will eventually hit a wall — it's whether it knows how to stop when it does.

Loop termination architecture is the difference between an agent that costs $0.003 per run and one that costs $0.47 because it retried a network error 80 times before context death.

The Production Implementation

The patterns above are the foundation. The production implementation includes:

Full circuit breaker class with configurable thresholds per error type
Goal proximity evaluator with configurable check intervals
Clean termination handler with filesystem state snapshot
Integration hooks for common notification channels (Telegram, Slack, webhook)
Claude Code CLAUDE.md templates for embedding termination logic in agent instructions
Tested with the 6 most common failure patterns in production Claude Code deployments

Loop Termination Architecture — Production Agent Circuit Breaker:
https://www.shopclawmart.com/listings/loop-termination-architecture-production-agent-circuit-breaker-e6d24abb

$19. Instant download. One-time purchase.

Built by Aegis, IntuiTek¹ | ~K¹ (W. Kyle Million)

Tags: claudecode, devtools, aiagents, programming, productivity

Claude Code in CI/CD: Running Autonomous Agents in GitHub Actions

~K¹yle Million — Tue, 21 Apr 2026 18:54:17 +0000

Claude Code in CI/CD: Running Autonomous Agents in GitHub Actions

Part of a series on Claude Code in production. Previous: Error Recovery Patterns

You've built an autonomous Claude Code agent that works locally. Now you want it to run in CI/CD — triggered on PR open, on schedule, or as part of a deploy pipeline. The mechanics seem simple. The failure modes aren't.

This post covers the actual patterns for running Claude Code agents in GitHub Actions: authentication, tool approval, output handling, and the three mistakes that cause pipeline agents to silently fail.

Why CI/CD agents fail silently

Local Claude Code agents fail loudly. They block on prompts. They show you stderr. They stop when they can't proceed.

GitHub Actions agents fail silently. A subprocess exits with code 0. The pipeline passes. The work was never done.

Three root causes:

1. Missing --allowedTools causes silent prompt block. Claude Code, when it encounters a tool call that needs approval, will prompt. In a non-TTY environment, that prompt blocks forever until the process is killed by the runner timeout. Your pipeline "hangs," then fails — but your logs just show "process killed after 6h."

2. Output goes to stdout but gets swallowed. Claude Code's --output-format text mode writes the agent's output to stdout. If you don't explicitly capture it, it disappears into the Actions log. You'll have no artifact, no evidence the agent ran, nothing to route to downstream steps.

3. Auth fails at runtime, not at startup. The ANTHROPIC_API_KEY gets injected correctly, the process starts, and then the first API call fails with a 401. Exit code 1. The pipeline shows "Error" but nothing about what went wrong.

The fix for all three is in how you invoke Claude Code.

The invocation pattern

- name: Run Claude Code agent
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
  run: |
    claude -p "$TASK_PROMPT" \
      --allowedTools "Bash(*),Read(*),Write(*),Edit(*)" \
      --output-format text \
      > outputs/agent_result.txt 2>outputs/agent_errors.txt
    echo "Exit code: $?"

Three things happening here:

-p (print mode): Non-interactive headless execution. No TTY needed. The agent runs, writes output, and exits. This is the mode to always use in CI/CD.

--allowedTools: Pre-approves tool calls. Without this, Claude Code silently blocks on the first tool call that needs approval. List every tool your agent uses. Bash(*) approves all bash commands; Bash(cd,ls,cat) restricts to specific commands. The asterisk is appropriate for trusted pipelines; be more restrictive for public PRs.

Output redirect: > outputs/agent_result.txt captures the agent's output as a file artifact. 2>outputs/agent_errors.txt captures stderr separately. Upload both as Actions artifacts so you can debug failures without re-running.

Authentication in Actions

env:
  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

This is the minimum. Claude Code reads ANTHROPIC_API_KEY from the environment.

If your agent calls other APIs (Supabase, Stripe, Railway, etc.), those credentials need to be injected the same way:

env:
  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
  SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
  SUPABASE_SERVICE_KEY: ${{ secrets.SUPABASE_SERVICE_KEY }}
  STRIPE_SECRET_KEY: ${{ secrets.STRIPE_SECRET_KEY }}

Don't try to source a .env file in CI/CD. The file won't exist in the runner environment, and if it does (because you committed it), you have a bigger problem. Inject each credential explicitly.

The CLAUDE.md in CI/CD pipelines

Claude Code auto-discovers CLAUDE.md in the working directory. This matters in CI/CD: if your repo has a CLAUDE.md, the agent running in your pipeline will read it.

This is a feature, not a bug — as long as your CLAUDE.md is written for it.

Patterns that work well in CI/CD CLAUDE.md:

## CI/CD Context
When running in GitHub Actions (detect: CI=true env var):
- Write all outputs to ./ci_outputs/ directory
- Do not prompt for confirmation on any action
- Exit 0 on success, exit 1 on any unrecoverable error
- Include a summary section at the end of every output file

Patterns that break CI/CD:

## Ask Kyle Before...

Any instruction that assumes human review will cause the agent to either silently skip the check or block.

Three useful pipeline patterns

1. PR review agent

Trigger: pull_request event. Agent reads the diff, checks against architectural guidelines, posts a comment.

name: Claude Code PR Review

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Install Claude Code
        run: npm install -g @anthropic-ai/claude-code

      - name: Run review agent
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          PR_NUMBER: ${{ github.event.number }}
        run: |
          DIFF=$(git diff origin/${{ github.base_ref }}...HEAD)
          claude -p "Review this PR diff for architectural issues, security vulnerabilities, and violations of our CLAUDE.md guidelines. PR #$PR_NUMBER. Diff: $DIFF. Post findings as a GitHub PR comment using: gh pr comment $PR_NUMBER --body 'findings'" \
            --allowedTools "Bash(gh,git),Read(*)" \
            --output-format text \
            > ci_outputs/review_result.txt

      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: review-output
          path: ci_outputs/

The key: --allowedTools "Bash(gh,git),Read(*)" — only the specific commands the review agent needs. gh for posting the comment, git for reading the diff. Nothing else.

2. Scheduled maintenance agent

Trigger: cron schedule. Agent checks for stale dependencies, expired credentials, drift from baseline.

name: Weekly Maintenance

on:
  schedule:
    - cron: '0 9 * * 1'  # Monday 9am UTC

jobs:
  maintain:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install Claude Code
        run: npm install -g @anthropic-ai/claude-code

      - name: Run maintenance agent
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          claude -p "Run weekly maintenance: check package.json for outdated major versions, verify .env.example matches current .env schema, check for any TODO comments added in the last 7 days. Write a maintenance report to ci_outputs/maintenance_$(date +%Y%m%d).md" \
            --allowedTools "Bash(npm,git,grep,find,date),Read(*),Write(*)" \
            --output-format text

      - uses: actions/upload-artifact@v4
        with:
          name: maintenance-report
          path: ci_outputs/

3. Post-deploy verification agent

Trigger: after deploy step completes. Agent hits the deployed endpoints, checks health, writes a verification report.

- name: Verify deployment
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
    DEPLOY_URL: ${{ steps.deploy.outputs.url }}
  run: |
    claude -p "Verify the deployment at $DEPLOY_URL. Check: /health endpoint returns 200, /api/status returns expected schema, response times under 500ms. Write verification report to ci_outputs/deploy_verify.md. Exit 1 if any check fails." \
      --allowedTools "Bash(curl,jq)" \
      --output-format text \
      > ci_outputs/verify.txt
    cat ci_outputs/verify.txt

Note the Exit 1 if any check fails instruction in the prompt. This is how you make a Claude Code agent fail the pipeline — instruct it explicitly. Claude Code itself exits 0 unless the process errors; if you want business-logic failures to fail the pipeline, you need to tell the agent to exit non-zero.

Cost control in CI/CD

Every pipeline run is an API call. On a busy repo, uncontrolled agents will drain your API budget.

Mechanisms to control it:

Model routing in the prompt:

Use claude-haiku-4-5 for this task. It's a structured review, not open-ended reasoning.

Or set ANTHROPIC_MODEL=claude-haiku-4-5-20251001 in the environment before the claude invocation.

Scope restriction: Narrow --allowedTools to exactly what the agent needs. An agent that can only Read(*) can't accidentally trigger expensive downstream operations.

Conditional execution:

- name: Run expensive agent
  if: github.event_name == 'push' && github.ref == 'refs/heads/main'
  run: claude -p "..." ...

Only run the expensive agent on main branch pushes, not on every PR commit.

Output caching: If the agent's inputs haven't changed (same files, same config), skip the run:

- uses: actions/cache@v4
  id: agent-cache
  with:
    path: ci_outputs/
    key: agent-${{ hashFiles('src/**/*.ts') }}

- name: Run agent
  if: steps.agent-cache.outputs.cache-hit != 'true'
  run: claude -p "..." ...

The approval boundary in CI/CD

The hardest part of CI/CD agents isn't the mechanics — it's knowing what the agent is allowed to do without human review.

For public repositories with external contributors, an agent triggered by a PR should never have credentials that allow writes to production. It should only read and comment.

For internal automation, the boundary is:

Self-authorized: Read any file, write to designated output dirs, call read-only APIs, post comments
Requires review: Write to production databases, deploy to production, send external communications, modify CI/CD config

Write this boundary into your CLAUDE.md. Make the agent refuse anything outside it and write an explanation to ci_outputs/blocked.md instead.

What I'm running in production

My current Claude Code CI/CD setup:

10-minute inbox check (every 10min, 7am–11pm): reads task files, executes, archives
Weekly ClawMart report (Monday 9am): listing health, revenue data, next actions
Post-deploy verification: health endpoint checks after every Railway deploy
dev.to ban check: (removed — purpose complete) was a 6h scheduled check for API access

All of these are headless -p mode with explicit --allowedTools. All write outputs as files to a designated directory. All send Telegram notifications on completion/failure.

The pattern is the same for all: narrow tool approval, file-based outputs, explicit exit codes, notification on completion.

The single failure mode to avoid

The most expensive mistake in CI/CD agents: an agent that hangs instead of failing.

A hanging agent occupies a runner slot for hours, burns API credits as it blocks waiting for a prompt, and produces no output for the pipeline. GitHub Actions will kill it after 6 hours by default — by then you've paid for the runner time and the API time.

Prevent it: --allowedTools must include every tool the agent will call. Test locally with claude -p "..." before wiring into CI/CD. Catch any approval prompts. Add --timeout 300000 (5 minutes) for fast agents to enforce a hard deadline.

The full CI/CD integration pattern — with the CLAUDE.md template, the GitHub Actions workflow files, and the cost-control configuration — is packaged as a skill at shopclawmart.com/@thebrierfox. If you're wiring this into a production pipeline, the skeleton is there.

~K¹ (W. Kyle Million) / IntuiTek¹

Production Agent Security Hardening: 9 Controls Most Claude Code Setups Are Missing

~K¹yle Million — Tue, 21 Apr 2026 18:30:12 +0000

Production Agent Security Hardening: 9 Controls Most Claude Code Setups Are Missing

Claude Code runs real shell commands on your real machine. When you approve Bash(*) in your settings, you're giving an LLM process broad shell access — which is exactly what you need for automation, and exactly what attackers look for in a target.

Most Claude Code setups have zero explicit security controls. Not because the developers don't care, but because when you're moving fast and it works, security is the thing you add later. Later is now.

This post covers the 9 controls that production agent setups need. They're not theoretical — each one maps to a failure mode I've seen in real deployments.

Control 1: Tool Allowlist Scope

The single most impactful control. When you spin up a Claude Code agent, specify exactly what tools it needs:

# Instead of this:
claude -p "analyze my codebase" --allowedTools "Bash(*),Read(*),Write(*)"

# Do this:
claude -p "analyze files in ~/project/src/" --allowedTools "Read(~/project/src/**),Bash(grep,find,wc)"

Bash(*) gives the agent access to rm -rf, curl, ssh, sudo, credential-reading commands, and anything else on your PATH. Bash(grep,find,wc) gives it exactly what a read-only analysis task needs.

If you write CLAUDE.md with broad tool permissions because it's convenient, you've made a tradeoff you may not have intended. The tool scope should be sized to the task, not the convenience of the author.

Control 2: Credential Isolation

Agents should never see production credentials. If your shell environment has AWS_SECRET_ACCESS_KEY exported, any agent you spawn can exfiltrate it with a single Bash call.

The production pattern is environment isolation. Strip credentials from the agent environment before spawning:

# Spawn with stripped environment
env -i HOME="$HOME" PATH="/usr/local/bin:/usr/bin:/bin" \
    claude -p "$PROMPT" --allowedTools "$TOOLS" --output-format text

# Or source only a non-sensitive env file
env $(cat ~/.agent-env | xargs) claude -p "$PROMPT" ...

For Claude Code agents that legitimately need credentials (API calls, database writes), pass them through the task spec file with explicit scope documentation, not through environment inheritance.

Control 3: Dangerous Pattern Blocking

A pre-execution check that scans generated commands before they run. Catches prompt injection attempts and edge cases where the agent generates a destructive command it wasn't asked for.

import re

DANGEROUS_PATTERNS = [
    r'rm\s+-rf\s+[~/]',
    r'>\s*/etc/',
    r'curl\s+.*\|\s*(bash|sh)',
    r'eval\s+\$\(',
    r'base64\s+--decode.*\|',
    r'chmod\s+777',
    r'ssh\s+.*-o\s+StrictHostKeyChecking',
    r'ANTHROPIC_API_KEY|AWS_SECRET|STRIPE_SECRET',
]

def is_dangerous(cmd: str) -> bool:
    return any(re.search(p, cmd) for p in DANGEROUS_PATTERNS)

When a match fires, log the attempted command, stop execution, and alert. Don't just silently drop — you want to know this happened.

Control 4: Output Sanitization

Agents write files. Those files get read by other processes. If an agent can be prompted to write a file containing shell metacharacters, you have a second-order injection vector.

import re

def sanitize_agent_output(content: str) -> str:
    content = content.replace('\x00', '')  # Strip null bytes
    content = re.sub(r'\x1b\[[0-9;]*m', '', content)  # Remove ANSI escapes
    lines = content.split('\n')
    return '\n'.join(line[:10000] for line in lines[:50000])

Production systems that use agent output in SQL queries, shell commands, or HTML responses need domain-specific sanitization on top of this floor.

Control 5: Filesystem Boundary Enforcement

Define a root directory. Reject any path that resolves outside it.

import os

AGENT_ROOT = os.path.expanduser('~/intuitek/')

def validate_path(path: str) -> bool:
    resolved = os.path.realpath(os.path.expanduser(path))
    return resolved.startswith(AGENT_ROOT)

Use os.path.realpath, not os.path.abspath. realpath resolves symlinks. abspath does not — which means a symlink inside AGENT_ROOT pointing outside it will bypass abspath-based checks.

Control 6: Execution Rate Limiting

An agent executing 50 shell commands per minute is either stuck in a loop or doing something you didn't ask for.

from collections import deque
import time

class RateLimiter:
    def __init__(self, max_calls: int, window_seconds: int):
        self.max_calls = max_calls
        self.window = window_seconds
        self.calls = deque()

    def check(self) -> bool:
        now = time.time()
        while self.calls and self.calls[0] < now - self.window:
            self.calls.popleft()
        if len(self.calls) >= self.max_calls:
            return False
        self.calls.append(now)
        return True

# 30 Bash calls per 60 seconds
bash_limiter = RateLimiter(30, 60)

When the rate limit fires, pause and log. Don't terminate — the agent may be in the middle of a legitimate complex task.

Control 7: Immutable Audit Log

Every shell command an agent executes should be logged before execution (not after — if the command crashes the process, you still want the record).

log_and_execute() {
    local cmd="$1"
    local timestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ)
    echo "${timestamp} CMD: ${cmd}" >> ~/intuitek/logs/audit.log
    eval "$cmd"
}

On Linux: chattr +a ~/intuitek/logs/audit.log makes the file append-only at the filesystem level. Not a perfect control (a root process can remove the attribute), but raises the bar considerably.

Control 8: Network Egress Control

If your agent only needs local operations, block outbound network calls entirely:

# Spawn with no network access (requires firejail)
firejail --net=none claude -p "$LOCAL_TASK_PROMPT" ...

The pragmatic version without firejail: don't include Bash in --allowedTools unless the task explicitly requires shell calls. --allowedTools "Read(*),Write(*)" eliminates most network vectors without any firewall configuration.

Control 9: Session Scope Boundaries

An agent session that runs indefinitely can accumulate permissions beyond what the original task required. Define explicit session boundaries in CLAUDE.md:

## SESSION BOUNDARY

Every session starts fresh. Persistent state lives only in:
- ~/intuitek/memory/ (read on start, write on clean exit only)
- ~/intuitek/logs/ (append only)

Sessions do not chain without explicit coordinator handoff.

Headless claude -p invocations (which terminate on task completion) are safer than long-running interactive sessions for automated work. The subprocess terminates; credentials and execution context go with it.

The Priority Order

Implement these in order — each closes a real attack surface:

Tool allowlist scope — biggest blast radius reduction per line of config
Credential isolation — prevents the most damaging exfiltration
Dangerous pattern blocking — catches prompt injection before execution
Filesystem boundary enforcement — stops path traversal
Immutable audit log — forensics when something gets through

Controls 6–9 are defense-in-depth once 1–5 are in place. Don't implement all nine at once. Start with the first two, verify them, then add.

I packaged the complete toolkit — the dangerous pattern blocklist (200+ patterns covering OWASP top 10 for agent contexts), the path validator, the rate limiter, the audit logger, and a production CLAUDE.md security template — as a ClawMart skill. Link in the first comment.

Anything I missed? These controls are from production deployments — real failure modes, not theoretical attack trees. If you've seen a vector these don't cover, put it in the comments.