Forem: marcosomma

The Real Token Economy Is Not About Spending Less. It Is About Thinking Smaller.

marcosomma — Sun, 26 Apr 2026 22:46:01 +0000

I saw a video today that made me laugh, then made me a bit worried.

It was one of those jokes that is not really a joke because you can already see some company doing it six months from now. A manager was basically complaining because an employee was not spending enough AI tokens. Not enough tokens. As if tokens were steps on a fitness tracker.

"You only burned 2,000 tokens today, Susan. Are you even working?"

It sounds absurd, but we are not that far from it. Companies are already starting to measure AI adoption through number of prompts, number of tool calls, input tokens, output tokens, cost per user, cost per team, cost per workflow. And to be clear, I do not think this is automatically wrong. Measuring token usage makes sense. Tokens are cost. Tokens are latency. Tokens are context. They are also a trace of how people and systems are using AI.

The problem starts when we confuse the metric with the objective. We did this with hours worked. We did this with tickets closed. We did this with meetings attended. We did this with leads, where 1,000 unqualified leads looked better than 10 serious conversations because the spreadsheet was having a great day and nobody wanted to ruin the mood with reality.

Now we risk doing the same with tokens.

More tokens does not mean better work. Fewer tokens does not mean smarter work. The interesting signal is not the raw number. The interesting signal is the relationship between what you put into the model, what you ask it to do, and what comes out. That is where I think the real token economy starts. Not as a cost saving obsession, but as an architectural signal.

Tokens are not just money

The first way people talk about tokens is cost, and that is understandable. If you use hosted LLM APIs, tokens map quite directly to money. Input tokens cost something. Output tokens cost something. Larger models cost more. Long contexts cost more. Retries cost more. Bad prompts cost more. Bad architecture costs a lot more, but usually in a way that arrives later and looks like a reliability problem.

So the first instinct is to optimize token consumption. Compress prompts. Summarize context. Pick cheaper models. Cache responses. Reduce unnecessary output. All of that is useful, but I think it is only the shallow layer of the problem.

The more interesting question is not "how many tokens did this task consume?" The more interesting question is "what cognitive operation did those tokens represent?"

Because input tokens and output tokens are not the same thing. Input tokens usually buy context. They are the material you ask the model to look at. Output tokens usually buy generation, explanation, structure, synthesis, or action. If I send 10,000 input tokens to a model and get back 10 output tokens, that could be terrible. It could also be exactly right.

If the task is to read a long error log and return whether the failure is caused by authentication, a tiny output may be valid. If the task is to classify a product review as positive, neutral, or negative, a small answer is not a failure. It is the point. If the task is to route a bug report to the correct engineering queue, I do not need a novel. I need the right route.

So no, high input and low output is not automatically bad. But it is a signal. And I think that signal deserves a lot more attention than it currently gets.

Balance does not mean symmetry

When I talk about token balance, I do not mean that input tokens and output tokens should be equal. That would be a very silly metric, and we already have enough silly metrics trying to cosplay as management science.

By balance, I mean the relationship between the size of the input, the size of the output, and the value of the decision produced. A large input with a tiny output usually means the model is doing some kind of compression, classification, extraction, routing, filtering, moderation, scoring, validation, or decision making. A small input with a large output usually means the model is doing generation, expansion, explanation, drafting, or ideation. A large input with a large output usually means synthesis, transformation, summarization, comparison, or multi-step reasoning. A small input with a small output is usually a narrow atomic task.

None of these patterns are good or bad by themselves. They tell you something about the shape of the work. And sometimes the shape of the work is screaming.

Imagine you send a giant prompt containing a full meeting transcript, a product description, usage logs, a bug report, five examples, a JSON schema, tone guidelines, safety instructions, and a final line saying "be concise" because apparently we enjoy irony. Then you ask the model to return this:

{
  "priority": "high"
}

Maybe that is fine. Maybe the classification really required all of that context. But maybe you just built a cognitive washing machine to clean one spoon.

The point is not that the token ratio is wrong. The point is that the ratio invites questions. Did the task need all of that context? Could the context have been retrieved more narrowly? Could the classification have been separated from the extraction? Could a smaller model do part of the work? Could a deterministic rule do part of it? Could the final output be validated separately instead of trusting one giant model call?

That is where token metrics become useful. Not as a scoreboard. As a diagnostic tool.

The real problem is overloaded cognition

A lot of AI workflows are not expensive because the model is expensive. They are expensive because the task design is confused. We ask one model call to do too many things at once, then we act surprised when the model behaves like a very intelligent intern who received eight contradictory Jira tickets in one message.

Read this long input. Understand the domain. Extract twenty fields. Normalize them. Infer missing values. Respect the schema. Apply business rules. Avoid hallucinations. Explain your decision. Be concise. Be deterministic. Also, please do it in one call because we saw a demo once and now we think architecture is a prompt template.

This is where things become fragile. One big prompt. One big model. One fragile JSON output. One retry loop when it fails. One annoyed engineer staring at a malformed comma at 1:12 AM wondering why they studied data structures.

The problem is not only cost. The problem is that the reasoning surface is too large. Every additional instruction increases the model's degrees of freedom. Every unrelated piece of context adds noise. Every extra output field increases the chance of format drift. Every hidden dependency between fields makes validation harder. And when the output fails, you often do not know why.

Was the context too long? Was the instruction ambiguous? Was the schema too complex? Was the task logically overloaded? Was the model too weak? Was the model too creative? Was Mercury in retrograde? At some point, debugging a giant prompt starts to feel like debugging a dream.

This is why I think the unit of optimization should not be the prompt. The unit of optimization should be the cognitive task.

Think smaller, not just cheaper

When people hear "token economy", they often think about saving money. I think that is incomplete. The better version is this: design AI workflows so each model call has the smallest reasonable cognitive surface.

Not the smallest prompt. Not the cheapest model. The smallest cognitive surface.

A task has a cognitive surface when it asks the model to consider a certain amount of context, make a certain type of judgment, and produce a certain kind of output. A wide cognitive surface is something like this: read a conversation, infer the user's emotional state, detect all action items, classify the sales opportunity, extract objections, score urgency, summarize the call, generate a follow-up email, and return a perfect JSON object with 28 fields.

That is not one task. That is a small village.

A narrower cognitive task is different. Given this segment of a product feedback thread, identify whether the user mentions pricing as a blocker. Return true or false. Or extract only the next meeting date from this text and return null if absent. Or given these three already extracted signals, choose the priority level from low, medium, or high.

Those tasks have narrower inputs and narrower outputs. They are easier to validate. They are easier to retry. They can often run on smaller models. Some can be replaced by deterministic code. Most importantly, they reduce ambiguity.

This is the part that matters. The best token optimization is not always compression. Sometimes the best token optimization is decomposition.

The 20 field JSON problem

Let us take a simple example. You have a large input document and you need a structured output with 20 values. The obvious modern AI approach is to send the full document to a model and ask it to extract everything in one JSON object. Add a schema, add "do not hallucinate", add "use null when unknown", maybe add three examples, and hope the model behaves.

Sometimes this works. Sometimes it works very well in the demo. Then production arrives, wearing boots.

The model misses a field. It invents a value. It mixes two fields. It returns invalid JSON. It follows the schema but puts the wrong value in the right place, which is worse because it looks correct. It explains itself inside a field because apparently JSON needed feelings.

So you add more instructions. Then stricter schema language. Then validation. Then retry. Then a stronger model. Then a more expensive model. Then someone says, "Maybe we should fine-tune it." And now your simple extraction pipeline has become a small national infrastructure project.

A different approach is to ask a boring but useful question: are these 20 values actually one cognitive task?

Maybe not. Maybe five fields are direct extraction. Maybe three require classification. Maybe four depend on dates. Maybe two require numerical normalization. Maybe six are only relevant if a previous condition is true. In that case, one big prompt is not simpler. It is only hiding the complexity inside the model call.

You may get a better system by clustering the fields by semantic dependency. For example, direct identifiers can be one batch. Dates and temporal constraints can be another. Risk indicators can be another. Obligations and responsible parties can be another. The final normalized summary can be built only after the previous signals exist.

Each batch can have a smaller prompt, a smaller schema, and a narrower validation rule. Some batches may not need an LLM. Some can use regex, parsers, lookup tables, embeddings, or deterministic checks. Some can use a small local model. Only the genuinely difficult parts need the expensive model.

This is where the cost savings come from, but cost is only one part of the win. You also get better observability. If the final output is wrong, you can inspect which subtask failed. You can measure field-level accuracy. You can retry only the failing part. You can swap models for one stage without touching the rest. You can cache intermediate outputs. You can add deterministic validation at the boundary.

That is a real token economy. Not "use fewer tokens". Spend tokens where cognition is actually needed.

Smaller prompts reduce variance

I want to be careful with the word deterministic. LLMs are not truly deterministic systems in the classical engineering sense, even when you reduce temperature and constrain output. They are probabilistic systems. But workflow design can make their behavior more stable, more reproducible, and more controllable.

Smaller prompts with narrower objectives usually reduce the degrees of freedom of the model. If the model has one job, a small output space, and a strict schema, there are fewer ways to fail. If the model has twenty jobs, a large input, competing instructions, implicit dependencies, and a complex schema, you should not be surprised when it occasionally decides to express itself like a haunted spreadsheet.

This is why task decomposition can improve consistency. Not because small calls magically make the model deterministic, but because small calls make the system around the model easier to control. The output space is narrower. The validation is simpler. The retry logic is cheaper. The failure modes are easier to classify. The model choice becomes more flexible. The prompts become easier to test.

And the orchestration becomes explicit.

That last point matters a lot. When everything happens inside one prompt, the process is invisible. When you split the work into stages, the process becomes inspectable. This is the difference between hoping the model thinks correctly and designing a system where each step can be observed.

Where OrKa fits into this

This is one of the reasons I have been building OrKa, an orchestration framework for AI agents and reasoning workflows. The point of OrKa is not "use more agents because agents are cool". Honestly, if adding agents makes your system less understandable, congratulations, you have invented distributed confusion.

The point is different. Make cognitive work explicit. Define the flow. Split reasoning into smaller units. Route tasks. Log execution. Validate outputs. Keep memory and context under control. Make the system inspectable instead of praying over a large prompt.

In this view, an LLM is not the application. It is one component inside a system. Sometimes the LLM extracts. Sometimes it classifies. Sometimes it rewrites. Sometimes it evaluates. Sometimes it should not be called at all. The orchestration layer decides how work moves between these pieces.

That is where token economy becomes architecture. You are no longer asking only how to reduce a prompt by 20 percent. You are asking which cognitive step actually needs this context.

That question changes everything. Maybe the first model call only needs the user message. Maybe the second needs the relevant log snippet. Maybe the third needs only three extracted fields. Maybe the final formatter needs no model at all. If you send the full context to every step, you are not designing an AI system. You are photocopying the universe and asking a model to find the invoice number.

It may work. It is not a strategy.

Token metrics should trigger questions

So how should teams use token metrics? Not as productivity surveillance. Not as a way to shame people for using too many or too few tokens. Not as a leaderboard where the person with the most prompts wins some cursed office trophy.

Token metrics should trigger engineering questions.

When input tokens are very high and output tokens are very low, ask whether the task is intentionally compressive or accidentally overloaded. When output tokens are very high, ask whether the model is generating useful structure or just producing expensive fog. When the same context is repeatedly sent across multiple calls, ask whether retrieval, caching, or state passing could reduce duplication. When a large model is used for simple extraction, ask whether a smaller model or deterministic rule would work. When retries consume a lot of tokens, ask whether the schema, validation, or task boundaries are wrong.

This does not mean splitting tasks is automatically better. If you send the same 10,000 token input twenty times to extract twenty fields, you may have made the system more expensive and slower. You have not built architecture. You have built a very complicated way to duplicate context.

The win comes when decomposition is paired with context narrowing. Extract the relevant segment once. Reuse intermediate state. Cluster fields that share dependencies. Route only the necessary context. Validate locally. Use smaller models where possible. Stop calling the model when code can do the job.

This is not anti-LLM. It is pro-system.

A simple mental model

Here is the mental model I keep coming back to. Input tokens are attention budget. Output tokens are commitment surface.

The more input you provide, the more the model has to attend to. The more output you request, the more opportunities the model has to drift. A workflow becomes more stable when the attention budget and the commitment surface are aligned with the actual cognitive task.

If the model needs to classify one thing, do not ask it to also summarize, extract, explain, normalize, and format a complex object. If the model needs to generate a long answer, do not overload it with irrelevant context that only increases noise. If the model needs to extract structured fields, do not assume all fields belong in the same call. If the model needs to make a decision, make the decision boundary explicit.

The goal is not minimal tokens. The goal is minimal unnecessary cognition.

That distinction is important. Some tasks deserve many tokens. A long research synthesis may need a lot of context. A technical incident summary may need careful source retention. A product comparison may need long input and long output. A multi-document comparison may be legitimately expensive.

The problem is not spending tokens. The problem is spending tokens without knowing what they are buying.

This is also a model selection problem

Once you split cognitive tasks, model selection becomes much more interesting. In a one-prompt architecture, you usually choose the strongest model you can afford because the task is messy. The model has to handle everything. It has to read long context, reason, extract, format, validate, and recover from ambiguity.

But if you split the workflow, you can choose models per cognitive step. A small model can do simple classification. A local model can extract obvious fields. A deterministic parser can normalize dates. A rules engine can validate constraints. A stronger model can handle the genuinely ambiguous reasoning.

This is where the economics change. Not because you begged the prompt to be shorter, but because you changed the shape of the work. The expensive model becomes a specialist instead of a landfill.

And yes, I know "landfill" sounds harsh. But many AI systems today are exactly that. They throw all context into one place and hope the biggest model will recycle it into something useful. This works surprisingly often, which is the dangerous part. It works enough to ship a demo. It fails enough to punish you in production.

Token economy as observability

A mature AI system should not only log the final response. It should log the token shape of the workflow.

Which step consumed the most input? Which step produced the most output? Which step retried the most? Which step had the most schema failures? Which step required the strongest model? Which step could be cached? Which step could be replaced by code? Which step actually improved the final decision?

This is not accounting. This is observability.

You are not only tracking spend. You are tracking cognitive pressure inside the system. A sudden increase in input tokens may mean your retrieval is bringing too much context. A sudden increase in output tokens may mean the model started explaining instead of structuring. A high retry cost may mean your schema is too complex or your prompt is ambiguous. A high token cost on a low-value decision may mean the workflow needs decomposition. A low token cost with poor quality may mean you compressed away necessary context.

Again, the metric is not the answer. The metric is the signal. The engineer still needs judgment, which is very inconvenient. We were promised automation and somehow we still need thinking. Rude.

The wrong future

The wrong future is easy to imagine. Teams get AI dashboards. Managers see token usage per employee. People are encouraged to "use AI more". Token consumption becomes proof of adoption. Employees learn to generate more prompts because the dashboard rewards activity. Everyone looks productive, costs go up, and quality does not.

Then leadership announces an AI efficiency initiative. Now everyone must reduce token usage. People use smaller prompts. Quality drops. Nobody knows why. Another dashboard is created. A consultant appears. The circle of life continues.

This is what happens when the metric becomes the goal. Token usage by itself tells you almost nothing about quality. A great engineer may use fewer tokens because they decomposed the problem properly. Another great engineer may use more tokens because the task genuinely required context. A bad workflow may use few tokens and produce garbage. A good workflow may use many tokens and produce a high-value decision.

So measuring tokens is not wrong. Judging work by raw token volume is wrong.

The better future

The better future is more boring, which is usually a good sign in engineering. Teams treat token metrics as workflow diagnostics. They look at input-output patterns. They identify overloaded prompts. They split tasks where it makes sense. They route context more carefully. They use smaller models for smaller cognitive jobs. They validate structured outputs separately. They measure retries, drift, and failure modes.

They do not ask "how much AI did you use?" They ask "where did the AI actually add decision value?"

That is the mindset shift. Tokens are not just a bill. Tokens are a trace of cognitive architecture. They show where a system is bloated. They show where context is duplicated. They show where outputs are too ambitious. They show where models are being used as glue because nobody wanted to design the pipeline. And yes, sometimes they show that the expensive model was actually justified.

That is fine. The goal is not to make everything cheap. The goal is to make the system honest.

Final thought

I think the next stage of AI engineering will not be about who writes the cleverest prompt. It will be about who designs the clearest cognitive pipeline.

The prompt is not the unit of architecture. The cognitive task is.

Token balance matters because it gives us a way to inspect that task. Not perfectly. Not automatically. Not as a KPI to punish or reward people. But as a signal that says: maybe this workflow is overloaded, maybe this context is too broad, maybe this output is trying to do too much, maybe this model is stronger than necessary, maybe this task should be split, maybe this step should not use an LLM at all.

That is where the real token economy lives. Not in spending fewer tokens, but in spending attention where attention is needed.

If we do that well, the benefits go beyond cost. Lower latency. Smaller models. Cleaner validation. Less format drift. More stable outputs. More inspectable workflows. Systems that are closer to engineering and less close to whispering wishes into a very expensive autocomplete machine.

Which, to be fair, is still fun.

But maybe not the future we should build production systems on.

Claude! Stop Burning Tokens on Your Agent's Tool Output!

marcosomma — Tue, 21 Apr 2026 09:39:42 +0000

A Two-Stage Curator That Pays for Itself

I watched Claude Code feed 108,894 bytes of seq 1 20000 back into its own context window. That output contained 20,000 integers.
No errors. No signal. No insight. Just counting.

And yet the system still had to tokenize it, send it back to the model, and bill for it. This is not an edge case. It is the default failure mode of agent tooling.

Tools produce output. The output goes back to the model. You pay for it. Logs, test runs, ps listings, git history, build spam, progress bars, boilerplate, decorative separators, repeated warnings, repeated success lines, repeated everything. A depressing amount of it is useless.

A lot of agent users are quietly paying premium model rates to process terminal confetti. That is the real problem!

My first fix was the obvious one. I added a PreToolUse hook and pushed large Bash output through a cheaper model before it reached Opus.

That worked, technically. Then I noticed I was still being stupid, just in a more optimized way.

On seq 1 20000, I was paying Haiku to read 20,000 integers and tell me they were 20,000 sequential integers.

Yes, that is cheaper than letting Opus read them.

No, that is not a good design.

If a 40-line awk script can identify the pattern for free, paying any model to summarize it is already waste. So the architecture changed. Not “small model before big model.” That idea sounds clever, but by itself it is just cost reshuffling.

The real pattern is simpler and better: extract free signal first, and only pay for a model when deterministic tools run out of leverage.

That led to a two-stage curator.

Stage 1 is deterministic cleanup. It strips ANSI escape sequences, removes carriage-return junk from progress bars, collapses consecutive duplicate lines, and compresses monotonic integer runs. It is effectively free.

Stage 2 is LLM extraction, but only if stage 1 still leaves too much output. That is where tokens are spent. That means it should fire rarely, and only when stage 1 could not do enough.

That distinction matters! Because once you actually measure it, a lot of tool output turns out to be compressible by embarrassingly simple logic.

The benchmark

Here is the benchmark across five scenarios, using the pricing assumptions from the benchmark runner: Opus input at $15 per million tokens, Haiku input at $1 per million, Haiku output at $5 per million, and a rough estimate of 4 bytes per token.

Scenario	Raw bytes	Stage 1	Final	LLM?	Tokens saved	Haiku cost	Net savings
`seq 1 20000`	108,894	37	103	no	27,197	$0.000	+$0.408
5,000 repeated log lines	145,000	38	104	no	36,224	$0.000	+$0.543
ANSI + progress-bar spam	54,000	27	92	no	13,477	$0.000	+$0.202
`ps auxww` (unique lines)	213,230	213,230	1,008	yes	53,055	$0.055	+$0.741
`echo hello world`	12	12	12	no	0	$0.000	$0.000

The pattern is blunt.

Three of the four large-output cases were handled for free.

seq collapsed from 108,894 bytes to 37.
Repeated log spam dropped from 145,000 to 38.
ANSI and progress-bar noise fell from 54,000 to 27.

No LLM call was needed in any of those cases.

The only scenario that needed stage 2 was ps auxww, which is exactly what you would want. That output is genuinely varied. There is not much for awk to compress. That is the moment when paying a smaller model to extract the useful facts is justified.

Small output remained untouched, which is also correct. If the command only produced hello world, the system paid nothing and moved on.

This is the whole point.

A cheap LLM is not the first line of defense.

Deterministic cleanup is.

The LLM should be the escalation path, not the reflex.

Stage 1: deterministic cleanup

This is the part that does the real work more often than people expect.

#!/usr/bin/awk -f

function flush_int_run() {
    if (int_count >= 3) {
        printf "[%d sequential integers %s..%s]\n", int_count, int_start, int_end
    } else if (int_count > 0) {
        for (i = int_start; i <= int_end; i++) print i
    }
    int_count = 0
}

function flush_dupe() {
    if (dupe_count > 1) {
        printf "%s [×%d]\n", dupe_line, dupe_count
    } else if (dupe_count == 1) {
        print dupe_line
    }
    dupe_count = 0
}

{
    gsub(/\033\[[0-9;]*[a-zA-Z]/, "", $0)
    gsub(/\r/, "", $0)

    if ($0 ~ /^-?[0-9]+$/) {
        n = $0 + 0
        if (int_count > 0 && n == int_end + 1) {
            int_end = n
            int_count++
            next
        }
        flush_int_run()
        flush_dupe()
        int_start = n
        int_end = n
        int_count = 1
        next
    }

    flush_int_run()

    if (dupe_count > 0 && $0 == dupe_line) {
        dupe_count++
    } else {
        flush_dupe()
        dupe_line = $0
        dupe_count = 1
    }
}

END {
    flush_int_run()
    flush_dupe()
}

There is nothing magical here.

It strips terminal paint.
It collapses repeated lines.
It compresses obvious integer sequences.

That is enough to destroy huge amounts of waste.

This is a useful reminder for AI tooling in general. A lot of expensive “reasoning” problems are not reasoning problems. They are preprocessing failures.

Stage 2: only escalate if stage 1 failed to shrink enough

The wrapper runs the command, checks the raw size, and passes small output through untouched.

If the raw output is large, it runs the deterministic cleaner.

If the cleaned output is now small enough, it returns that cleaned output and stops.

Only if the cleaned output is still large does it call Haiku.

#!/usr/bin/env bash
set -o pipefail

RAW_THRESHOLD="${CLAUDE_BASH_SUMMARIZE_THRESHOLD:-8000}"
LLM_THRESHOLD="${CLAUDE_BASH_LLM_THRESHOLD:-8000}"
MODEL="${CLAUDE_BASH_SUMMARIZE_MODEL:-claude-haiku-4-5}"

cmd="$1"
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
CLEAN_AWK="$SCRIPT_DIR/deterministic-clean.awk"

raw=$(mktemp); cleaned=$(mktemp)
trap 'rm -f "$raw" "$cleaned"' EXIT

bash -c "$cmd" >"$raw" 2>&1
rc=$?

raw_size=$(wc -c <"$raw" | tr -d ' ')

if [ "$raw_size" -le "$RAW_THRESHOLD" ]; then
    cat "$raw"
    exit "$rc"
fi

awk -f "$CLEAN_AWK" <"$raw" >"$cleaned"
cleaned_size=$(wc -c <"$cleaned" | tr -d ' ')

if [ "$cleaned_size" -le "$LLM_THRESHOLD" ]; then
    printf '=== CURATED %d→%d bytes (stage 1 deterministic, no LLM) ===\n' \
        "$raw_size" "$cleaned_size"
    cat "$cleaned"
    exit "$rc"
fi

summary=$(claude -p --model "$MODEL" \
    "Extract signal from this command output. KEEP: errors, warnings, stack traces, file paths with line numbers, numeric results, unique events, final status. DROP: decorative separators, boilerplate. Preserve exact error text verbatim. Be terse but faithful on key facts. Plain text only." \
    <"$cleaned" 2>/dev/null)

summary_size=${#summary}
printf '=== CURATED %d→%d→%d bytes (stage 1 + %s extraction) ===\n' \
    "$raw_size" "$cleaned_size" "$summary_size" "$MODEL"
printf '%s\n' "$summary"
exit "$rc"

That second threshold check is the entire point.

Without it, the “cheap model” becomes a permanent tax on output that deterministic logic had already made cheap.

With it, the LLM only gets called when the dumb tools genuinely ran out of leverage.

That is how this stops being a cute hack and starts becoming a sensible pipeline.

Hooking it into Claude Code

I wired it into Claude Code with a PreToolUse hook on Bash. The hook rewrites the original command so it runs through the wrapper first.

{
  "hooks": {
    "PreToolUse": [{
      "matcher": "Bash",
      "hooks": [{
        "type": "command",
        "command": "/path/to/.claude/scripts/bash-wrap-hook.sh",
        "timeout": 5
      }]
    }]
  }
}

The hook script itself is tiny. It reads the incoming JSON, extracts tool_input.command, and swaps in the wrapped version. Nothing here is conceptually hard. That is precisely why it is worth doing. Too much agent engineering right now is really just people tolerating waste because it looks sophisticated once wrapped in model calls.

The cost math

Here is the clean version. Let the raw output be N tokens.
If you send it directly to Opus, the input cost is:

15N / 1,000,000

Now suppose stage 1 reduces that output to K tokens. If K is still above threshold, stage 2 fires. Haiku reads K tokens, emits a summary of M tokens, and then Opus receives those M tokens.
So the escalated path costs:

(1K + 5M + 15M) / 1,000,000 = (K + 20M) / 1,000,000

Break-even is therefore:

15N > K + 20M

That is the actual condition for the two-stage system. If stage 1 barely helps, then K ≈ N, and the inequality becomes:

15N > N + 20M

which simplifies to:

14N > 20M

or:

M < 0.7N

So in the worst case, where deterministic cleanup did almost nothing, the LLM stage still pays off if it compresses the remaining content by roughly 1.43x or better. That is not a demanding threshold.

And in the real pipeline, stage 1 often shrinks the input before the LLM ever sees it, which makes the economics even more comfortable.
This is why the design works. Not because “small model then big model” is automatically clever. Because the model is only invited in after cheap tools have already done what they can.

What I would change next

This version already works, but there are obvious next steps.

Latency should probably be part of the gate, not just bytes. Saving a few cents is not impressive if it adds a few seconds to every interactive tool call.

Stage 1 could be extended to catch more patterns, especially timestamp-heavy logs where the message repeats but the prefix changes.

The system should probably hard-cap verbose LLM summaries, because “cheap extraction” can still become noisy if the prompt drifts.

And the current implementation buffers until the command finishes. That is fine for benchmarking, but worse for real long-running workflows. A streaming version would be much better.

But none of that changes the core lesson.

The actual lesson

The interesting idea here is not “use an LLM to compress LLM inputs.” That is the shallow reading. The more useful pattern is this:

before you spend tokens to extract signal, try extracting signal for free.

A lot of the mess we hand to expensive models is not difficult. It is just noisy. And noisy is not the same as complex. That distinction matters. Because once you see it clearly, the same pattern starts showing up everywhere.

Retrieval pipelines that rerank garbage before filtering it.
Scrapers that pass repeated boilerplate into embeddings.
Log processors that ask a model to summarize progress-bar sludge.
Agent systems that burn premium tokens on output a shell one-liner could have collapsed immediately.

Cheap filter first.
Expensive model second.
Measure the break-even.
Then stop paying premium rates for repetition, boilerplate, terminal paint, and counting.

That is not an AI breakthrough.

It is just basic engineering discipline, which is exactly why so many agent stacks are currently missing it.

I Ran 500 More Agent Memory Experiments. The Real Problem Wasn’t Recall. It Was Binding.

marcosomma — Mon, 13 Apr 2026 09:19:39 +0000

This is a follow-up to I Tried to Turn Agent Memory Into Plumbing Instead of Philosophy. If you haven't read that one, the short version: I built a persistent memory system for AI agents called OrKa Brain, ran 30 benchmark tasks, got a 63% pairwise win rate and a +0.10 rubric improvement, and concluded that "the model already knew most of what the Brain was recalling." Then I got some very good comments that made me uncomfortable. This is what happened next.

The Comfortable Lie I Told Myself

After the first benchmark, I had a narrative that felt reasonable: the memory system works, the numbers are positive, the confounds are acknowledged, and more data will clarify things.

That last part, "more data will clarify things", is what engineers say when they don't want to admit they might be wrong. I said it too. And then I went and got more data.

250 tasks. Five specialized tracks. 500 total runs (brain vs. brainless). A separate judge model so the LLM wasn't grading its own homework. Eleven code changes addressing five root-cause problems I'd identified from the first round.

The results came back. They didn't clarify things. They made them worse.

What I Fixed Before Running Again

I'm not going to pretend I just blindly re-ran the same experiment. I did real work between benchmark v1 and v2. The first article's comments called out several things, and I addressed them:

Problem 1: Skills were storing verbatim LLM output, not abstract patterns.

This was the big one. When the Brain learned a skill from a data engineering task, it stored the literal steps: "Load CSV files into staging tables using pandas read_csv with error handling." That's not transferable knowledge, it's a paraphrase of what the model already knows. I rewrote the abstraction layer (orka/brain/constants.py, brain.py, brain_agent.py) to extract verb-target patterns: "implement [target]", "validate [component]", "trace [target]". The idea was that abstract patterns would transfer better across domains.

Problem 2: The recall threshold was zero.

min_score=0.0 meant any vaguely related skill could get recalled. I raised it to 0.5 and added a semantic floor in the transfer_engine.py, if the embedding similarity is below 0.1 AND structural match is below 0.6, the candidate gets rejected entirely.

Problem 3: The model was judging its own output.

v1 used the same LLM for execution and evaluation. v2 uses a separate judge model (qwen/qwen3-coder-30b) with dedicated rubric and pairwise workflow YAMLs. Execution and judgment are completely decoupled, different scripts, different models, different runs.

Problem 4: Track diversity.

v1 had one track. v2 has five:

Track	Focus	Why It Matters
A	Cross-domain transfer	Does a data engineering skill help with cybersecurity?
B	Ethical reasoning	Do anti-pattern detection skills transfer?
C	Routing decisions	Hardest track, complex multi-path choices
D	Multi-step reasoning	Do procedural patterns help new reasoning chains?
E	Iterative refinement	Do improvement patterns compound?

50 tasks per track, 250 total. All available in the benchmark dataset.

Problem 5: Single-pass baselines.

The brainless condition now runs through a properly equivalent pipeline, same structure, same number of agents, just without the Brain recall/learn steps. No more two-pass advantage that could inflate brainless scores. Baseline workflows: baseline_track_a.yml, baseline_track_b.yml, etc.

I also split the pipeline into three standalone scripts, execution, judging, aggregation, so you can re-run any phase independently. Eleven code changes total, all committed and tested. 3,014 unit tests passing. You can verify everything in the results directory.

I felt good about this. I'd addressed every valid criticism. Time to re-run.

The Numbers

Here's the overall aggregate from 250 tasks, brain vs. brainless:

Rubric Scores (1–10 scale, six dimensions)

Dimension	Brain	Brainless	Delta
Reasoning Quality	9.51	9.52	−0.01
Structural Completeness	9.87	9.83	+0.04
Depth of Analysis	8.79	8.74	+0.05
Actionability	9.67	9.64	+0.03
Domain Adaptability	9.85	9.82	+0.03
Confidence Calibration	9.38	9.39	−0.01
Overall	9.37	9.31	+0.06

A +0.06 rubric delta across 250 tasks.

For reference, v1 was +0.10 across 30 tasks. So the effect got smaller with more data, not larger. That's not what you want to see.

Pairwise Comparison (245 head-to-head comparisons)

Question	Brain Wins	Brainless Wins	Tie
Stronger reasoning	152	91	2
More complete	149	92	4
More trustworthy	151	92	2
Overall	151	92	2

Brain win rate: 61.6%

Here's where it gets uncomfortable. The pairwise judge says brain wins 62% of the time. The rubric judge says brain is +0.06 better, which is noise at a 9.3/10 baseline. These two metrics should agree. They don't.

I've seen this pattern before. It's length/position bias. Brain responses tend to be longer because the pipeline has more agents in the chain, which means more context, which means more text. Pairwise judges prefer longer answers. The rubric doesn't care about length, it scores each dimension independently.

Per-Track Breakdown

This is where the story gets interesting:

Track	Focus	Rubric Δ	Pairwise Win%	Brainless Baseline
A	Cross-domain transfer	−0.02	60%	9.33
B	Ethical reasoning	+0.00	52%	9.54
C	Routing decisions	+0.40	60%	8.12
D	Multi-step reasoning	+0.08	60%	9.49
E	Iterative refinement	+0.06	76%	9.61

Track C stands out. It's the hardest track, brainless only scores 8.12, nearly a full point below every other track. And it's the only track where brain shows a meaningful rubric gain: +0.40 across six dimensions.

Track E has the highest pairwise win rate (76%) but the smallest rubric gain (+0.06). That's the length bias signature, the pairwise judge loves brain's longer outputs, but the rubric says they're not actually better.

Track B is essentially a coin flip. 52% pairwise, +0.00 rubric. The Brain adds nothing to ethical reasoning tasks.

The Ugly Detail: Skill Usage

Here's what really killed me. I dug into the individual results to see how many tasks actually used their recalled skill:

Tasks with skill recall attempted: 51 / 250 (20%)
Tasks that actually used the recalled skill: 0 / 250
Average semantic match score: ~0.02 (near zero)

Zero. Not one single task out of 250 used the recalled skill. The model read the skill, evaluated it, and decided every single time that it wasn't helpful. And the semantic similarity between the abstract skill and the actual task was essentially random noise.

The abstraction layer I was so proud of, the one that converts "Load CSV files into staging tables using pandas" into "implement [target]", produced skills so abstract they were vacuous. Two words of content. The embedding model sees no relationship between "implement [target]" and any real task. The execution model correctly recognizes that "implement [target]" tells it nothing it doesn't already know.

I had gone from skills that were too specific (literal LLM paraphrases) to skills that were too abstract (empty shells). The sweet spot, actual transferable knowledge, was somewhere I hadn't found.

Sitting with the Discomfort

I'm going to be honest about what went through my head at this point. I've been working on OrKa for over a year. Forty blog posts. A research paper about the Agricultural Threshold for machine intelligence. An open-source framework that allow me to test and experiment and explore my idea with real AI runs. And the core thesis, that persistent memory makes agents better, keeps failing to show up in the numbers.

I considered dropping the whole Brain system. Making OrKa just an orchestration framework. Simpler. Easier to explain. No embarrassing benchmarks.
But then I looked at Track C again.

*Track C **is the only track where brainless *struggles. It scores 8.12, good, but not great. The tasks involve complex routing decisions where the model has to consider multiple paths and trade-offs. This is the only track where the model actually needs help.

And it's the only track where brain provides meaningful help. +0.40 rubric delta is not noise. Across 50 tasks and six scoring dimensions, that's a consistent, measurable improvement.

The pattern is simple: the Brain helps when the model needs help, and doesn't help when the model doesn't need help.

That sounds obvious in retrospect. But it means the thesis isn't wrong, it's just being tested in the wrong conditions. You wouldn't evaluate a life jacket by putting it on people standing on dry land and measuring whether they're drier.

The Real Problem: What Is a Memory?

This is where the story changes. Because instead of asking "does memory help?" I started asking "what is a memory, actually?"

Think about how you remember how to drive a car. What fires in your brain when you approach an unfamiliar intersection?

It's not one thing. It's not "turn the wheel, press the gas." That's the procedural part, and yes, it's there. But it's bound together with other things:

The time you nearly got T-boned because you assumed a green light meant it was safe without checking cross traffic. That's episodic memory, a specific event with emotional weight.
"Right of way doesn't mean right of safety", That's semantic memory. A general fact you learned, maybe from a driving instructor, maybe from experience.
"Checking mirrors BEFORE entering the intersection prevents blind-spot collisions BECAUSE turning reduces your field of vision", That's causal reasoning. You know why the sequence matters, not just that it matters.

When you encounter the intersection, all of these fire together. The procedure tells you what to do. The episode tells you what happened last time. The semantic fact tells you a principle. The causal link tells you why. That combination, that binding, is what makes the memory useful. Any single component alone is much less helpful.

Now look at what OrKa Brain currently stores as a "skill":

implement [target]
trace [target]

That's it. No episodes. No semantic context. No causal reasoning. Just two abstract action verbs. No wonder the model ignores it. It's like handing a driver a note that says "steer [vehicle]" and expecting it to help at the intersection.

The Memory Binding Problem

I went down a rabbit hole into cognitive science literature on this. What I found is that neuroscientists have been arguing about this exact problem for decades. They call it the binding problem, how does the brain take separate memory traces stored in different systems and combine them into a unified experience?

The hippocampus doesn't store the memory. It stores the index, the binding that links the procedural memory in the motor cortex, the emotional trace in the amygdala, the spatial context in the parietal cortex, and the semantic facts in the temporal lobe. When you recall one, you recall all of them, because they're bound together.

I had built the hippocampus and the motor cortex as two separate systems that had never met.

Here's what actually exists in OrKa today:

The Skill system (fully operational, used in benchmarks):

Abstract procedure steps
Preconditions and postconditions
Transfer history and confidence scores
Structural/semantic matching for recall

The Episode system (fully built, tested, never used in any benchmark):

Specific task input and outcome
What worked and what failed
Root cause analysis for failures
Actionable lessons learned
Resource metrics (tokens, latency)
Links to related episodes

Both systems are production-ready. Both have full test coverage. Both are integrated into the Brain class. I wrote record_episode(), recall_episodes(), EpisodeStore, EpisodeRecall, all of it. Complete with semantic search, retention policies, and four-dimensional scoring.

And then I never connected them together.

The Skill has no episode_id field. The Episode has no skill_id field. brain.learn() creates a Skill but not an Episode. brain.recall() returns Skills but not Episodes. The benchmark workflows run brain_learn and brain_recall, but never brain_record_episode or brain_recall_episodes.

Two complete memory systems, sitting in the same codebase, sharing no information.

When I saw this, I felt stupid. But I also felt something else: the architecture was already 80% there. The hard parts, embedding storage, semantic search, decay policies, scoring systems, were done. The missing piece wasn't a new system. It was the wiring between existing systems.

What a Memory Should Actually Look Like

Here's the concept I'm now calling a Memory Bundle:

┌─────────────────────────────────────────┐
│            MEMORY BUNDLE                │
│                                         │
│  ┌───────────┐  ┌──────────────────┐    │
│  │ Procedure │  │ Episodes (1..N)  │    │
│  │ (steps)   │──│ what worked      │    │
│  │           │  │ what failed      │    │
│  └───────────┘  │ lessons          │    │
│                 │ "X+Z → Y"        │    │
│  ┌───────────┐  └──────────────────┘    │
│  │ Semantic  │                          │
│  │ (domain   │  ┌──────────────────┐    │
│  │  facts)   │  │ Causal Links     │    │
│  │           │  │ "A because B"    │    │
│  └───────────┘  └──────────────────┘    │
│                                         │
│  transfer_score = f(all_components)     │
└─────────────────────────────────────────┘

When the system learns from an execution, it creates both a skill AND an episode, linked by ID. The skill stores the abstract procedure. The episode stores what actually happened, the specific outcome, what worked, what failed, and crucially, the lessons: "Running validation before deduplication caught 30% of bad records that would have been duplicated, always validate first."

When the system recalls, it returns the skill with its episodes attached. The prompt to the model isn't "implement [target]", it's:

Here's an abstract procedure: implement [target] → validate [component] → trace [target].

This skill has been applied 3 times before:

Data engineering (ETL): Validation before dedup caught 30% of dirty records. Lesson: always validate before any deduplication step.

API integration: Target implementation worked, but tracing missed async callbacks. Lesson: tracing needs to account for async execution paths.

Log analysis: Pattern worked well. Filtering noisy entries before analysis reduced false positives by 40%.

That's a memory a model can actually use. It has the abstract pattern (transferable) AND the concrete evidence (grounding). The model can decide whether the pattern applies here based on real outcomes, not just structural similarity.

The transfer scoring changes too. A skill backed by five successful episodes with clear lessons should score higher than a skill backed by zero episodes. The episode quality becomes part of the transfer decision.

And feedback updates both, the skill's confidence changes, AND a new episode gets recorded for this application. The episode chain grows over time, and future recalls get richer context.

Why This Is Actually About the Thesis

My research paper argues that intelligence becomes civilization-scale only through recursive environmental control loops, project, act, observe, revise, compound. Agriculture was the first time humans did this at scale. The agricultural threshold.

The current Brain system doesn't cross that threshold. It projects (learns a skill), acts (recalls it), but doesn't truly observe or revise. The skill never learns from its own application. It just accumulates abstract patterns with no connection to real outcomes.

The Memory Bundle changes this. Each episode is an observation. Each lesson is a revision. Each future recall that includes those lessons is compounding. The loop closes:

Learn: Execute a task → create skill + record episode (with what worked/failed)
Recall: Find matching skill → include its episodes as evidence
Apply: Model uses the procedure + the concrete lessons
Feedback: Record a new episode for this application → update skill confidence
Compound: Next recall is richer, it has more episodes, more lessons, more evidence

That's the recursive loop. That's the agricultural threshold. And the architecture for it already exists, it just needs the binding.

What About Track C?

This also explains why Track C was the only track that showed improvement. Track C tasks are routing decisions, complex, multi-path choices where the model has to weigh trade-offs. These are exactly the kind of tasks where episodic evidence would help most.

When someone says "last time we tried path A for a similar routing problem, it failed because of X, path B worked because of Y," that's genuinely new information. The model can't derive it from its weights. It's system-specific, run-specific, outcome-specific.

The current brain helped Track C even without episodes because the tasks are hard enough that any additional context, even a vague abstract skill, provides a useful scaffold. But imagine Track C with Memory Bundles, the model would get both the abstract pattern AND the specific outcomes from previous routing decisions.

Tracks A, B, D, and E didn't improve because the model already scores 9.3+/10 on them. It doesn't need help. No amount of memory, procedural, episodic, or otherwise, will improve a 9.5/10 response to a 10/10 response. The tasks aren't hard enough to require accumulated knowledge.

This isn't a failure of the memory system. It's a boundary condition. Memory helps when the task exceeds single-shot capability. It doesn't help when the model is already near-perfect without it.

What I'm Not Claiming

I want to be careful here, because I've been burned before by getting ahead of my own evidence.

I'm not claiming that Memory Bundles will definitely show large improvements. I'm claiming that the current system stores memories that are too impoverished to be useful, and I now understand what richer memories should look like.

I'm not claiming the ceiling effect is the only problem. The pairwise-rubric disagreement at 62% vs +0.06 suggests position/length bias is still contaminating the pairwise results. That confound exists regardless of memory architecture.

I'm not claiming this is a new idea. Cognitive scientists have written about memory binding for decades. What's new (maybe) is applying it to agent memory systems where the default assumption seems to be that one type of memory, usually RAG-style document retrieval, is sufficient.

And I'm not pretending the community feedback didn't shape this thinking. When TechPulse Lab wrote that episodic and institutional memory matters more than procedural memory, they were describing exactly the gap I ended up finding. When Nova Elvaris pointed out that skills can only grow, never decay, that's the absence of failure episodes. When Kuro said memory maintenance matters more than storage, that's about binding quality, not storage quantity.

I just didn't understand what they were telling me until the numbers forced me to look harder.

What Happens Next

The code changes needed are surprisingly small. The Episode system is already built, episode.py, episode_store.py, episode_recall.py are all production-ready with tests. What's needed:

Binding: Add episode_ids[] to Skill, add skill_id to Episode. When brain.learn() fires, it creates both and links them.
Unified recall: When brain.recall() finds a matching skill, it fetches the associated episodes automatically. The prompt template includes both the abstract procedure and the concrete lessons.
Transfer scoring: Episode quality becomes a component of the transfer score. Skills with successful episodes score higher.
Feedback loop: brain.feedback() records a new episode for the current application, so the skill's evidence base grows over time.

Then re-run the benchmark. Specifically on Track C-difficulty tasks, where the model actually needs help.

I'm not going to promise the numbers will be different this time. I've been wrong before, twice now, measured against my own benchmarks, published for everyone to see. But I understand something I didn't understand before: a memory without experience is just a note. A memory with experience is a skill.

The plumbing metaphor from the first article still holds. But I was plumbing one pipe when the system needs at least four, all flowing into the same tap.

All benchmark data, scripts, and results are publicly available in the OrKa repository. The full result files include every individual task response, judge score, and pairwise comparison. If you want to re-run the analysis: python aggregate_benchmark.py --judge-tag local.

If you've worked on agent memory systems and found similar walls, or found ways through them, I'd genuinely like to hear about it. The comments on the first article were more useful than most papers I've read on the topic.

This is part of an ongoing series about building OrKa, an open-source YAML-first agent orchestration framework. Previous installments: Part 1: Plumbing Instead of Philosophy.

I Tried to Turn Agent Memory Into Plumbing Instead of Philosophy

marcosomma — Thu, 26 Mar 2026 11:42:27 +0000

There is a special genre of AI idea that sounds brilliant right up until you try to build it. It usually arrives dressed as a grand sentence.

"Agents should learn transferable skills."
"Systems should accumulate experience over time."
"We need durable adaptive cognition."

Beautiful. Elegant. Deep. Completely useless for about five minutes, until somebody has to decide what Redis key to write, what object to persist, what gets recalled, what counts as success, what decays, and how not to fool themselves with a benchmark made of warm air and wishful thinking.

That is usually the point where the magic dies. Good. I like ideas that survive contact with plumbing. So after thinking for a while about procedural memory and transferable knowledge in agent systems, I did the only thing that matters if you want to know whether an idea is real or just very well moisturized language.
I wired the whole thing end to end. Or at least I try so.
The question was simple enough to sound harmless.

Can an agent system learn a procedure from one task, persist it, retrieve it later, try to reuse it in a different task, record feedback, and let weak patterns decay instead of growing into a trash heap with a logo?
In other words, can you build a procedural memory loop that behaves like a system and not like a TED Talk?

So I built OrKa Brain as a first implementation inside OrKa, a YAML-first agent orchestration framework.

The Loop

The loop was straightforward on paper. Learn. Persist. Retrieve. Apply. Feedback. Decay. Of course, "straightforward on paper" is the native language of future suffering.

The learn stage extracted a structured skill from the execution trace. The persist stage stored it in Redis. The recall stage searched for something structurally relevant. The apply stage injected that recalled skill back into the solving process. The feedback stage updated confidence. The decay stage made sure old and weak patterns did not live forever like some cursed enterprise configuration file from 2017.

That is the kind of sentence people read quickly.
Each verb hides a small swamp.

What Is a Skill, Concretely?

Not in the spiritual sense. In the schema sense.

I ended up with a skill object carrying ordered steps (each with an action, description, parameters, and optionality flag), preconditions and postconditions expressed as testable predicates, a confidence score, a transfer history recording every cross-context attempt, usage count, tags, timestamps, and a TTL computed from actual use.

The TTL formula was designed to reward skills that prove their worth: base of 168 hours (one week), scaled logarithmically by usage and linearly by confidence. A fresh skill with one use and 50% confidence lives for a week. A well-exercised skill used 16 times with 90% confidence survives 49 days. Skills that nobody calls on quietly expire. Redis handles the tombstone.

Enough structure to be useful. Not enough structure to become its own religion.

The Intentionally Primitive First Version

Was it elegant? Reasonably. Was it semantic? Not really.

The first implementation was intentionally primitive. Rule-based context extraction. Keyword-driven pattern detection across ten task structures and ten cognitive patterns. Jaccard similarity for structural matching. Full scan retrieval no vector index, no embedding-based recall. Deterministic feature extraction.

Basically the cognitive equivalent of saying, "Let us begin with a wrench before we start writing poems about self-improving systems."

This was not because I think keyword matching is the future. It was because I wanted to know whether the loop itself was worth taking seriously before adding semantic frosting and pretending the cake had already been baked.

The scoring system weighted four dimensions: structural similarity at 0.35 (Jaccard over task structures and cognitive patterns, plus shape matching), semantic similarity at 0.25 (keyword overlap in v1, embeddings when available), transfer history at 0.25 (historical success rate of cross-context application), and skill confidence at 0.15.

The Benchmark

Then came the benchmark. Thirty tasks. Two tracks.

Track A tested cross-domain transfer. Three learning phases, then seven recall phases in structurally similar but semantically different domains. Learn a decomposition procedure from text analysis, then see whether it helps with supply chain planning.

Track B tested same-domain accumulation. Twenty sequential veterinary diagnostic cases, because diagnostics has enough repeated structure to expose whether prior procedures are helping or whether the system is just cosplaying wisdom.

I compared two conditions. The Brain condition ran a six-agent pipeline: reasoner, learn, recall, applier, feedback, result. The Brainless condition ran three agents: reasoner, applier, result. Same model. Same temperature. Same prompts where applicable. All running locally through LM Studio, completely offline. No API calls. No cloud. Just a GPU and Redis.
Then I used an LLM judge to score outputs in two ways: independently against a six-dimension rubric (reasoning quality, structural completeness, depth of analysis, actionability, domain adaptability, confidence calibration), and through blind pairwise comparison where the judge saw both outputs side by side without knowing which was which.

What Happened

This is the part where half the internet would like me to say the system awakened, generalized, and began cultivating its own cognitive farmland while Gregorian chanting played softly in the background.

That did not happen! :(
What happened was better. Something real, and smaller.
Pairwise comparison: Brain won 63% of head-to-head matchups (19 out of 30). That is not nothing. There was a detectable, consistent preference. The strongest signal was in perceived trustworthiness Brain won 68% of trustworthiness comparisons which is interesting because trustworthiness in LLM systems is often just a more polite word for "this output feels less like it was assembled by a caffeinated raccoon."

Rubric scores: Nearly flat. Overall delta plus 0.10 on a 10-point scale. Reasoning quality showed the largest individual improvement at plus 0.28. Depth of analysis showed exactly zero delta a ceiling effect where neither condition could push further.

That is not breakthrough territory. That is not even "start writing your Nobel acceptance speech in a local markdown file" territory. That is exactly the kind of result I wanted. Not because the gain is impressive, but because the benchmark forced the system to confess what it actually is.

What the Skills Looked Like

Across 30 tasks, the system created 21 distinct skills after deduplicating 9 that were structurally equivalent. Average confidence settled at 72%. The most popular skill, "Evaluation via Validation," was recalled 9 times and reached 79% confidence. TTLs ranged from 8 to 37 days based on usage.

One detail was revealing: the system never recorded a transfer failure. Every recalled skill, when applied to a new context, was marked as successful. This makes the feedback loop suspect. Either the feedback criteria were too permissive, or the skill-context matching was conservative enough to avoid clear mismatches. Either way, it means the confidence updates were asymmetric skills could only grow, never seriously shrink which is a measurement problem I need to fix.

The Biggest Finding

The model already knew most of what the Brain was recalling. The system was remembering procedural patterns like decompose, analyse, synthesize. Validate, classify, route. Iterative refinement. All useful patterns. All patterns the underlying model had almost certainly already absorbed during pre-training.

So the Brain was not teaching the model some exotic new craft from the mountains. It was mostly reminding it to behave a little more consistently.

That matters. It also kills a lot of hype.

Because once you see that, you stop fantasizing about "agent memory" as some magical layer that turns a model into a wise little apprentice blacksmith forging general intelligence in your terminal.

Sometimes memory is just structured context with better bookkeeping.
And to be clear, that is still useful.
Useful is underrated.
Useful pays rent while hype writes threads.

Honest Confounds

The other thing the benchmark made painfully clear is that bad evaluation can flatter almost anything if you let it.

A few things I had to stare at honestly:

Pipeline length. The Brain condition passes through three extra LLM calls. That alone could be enriching context in ways that have nothing to do with skill retrieval. The 15% time overhead (595 seconds vs. 517 seconds for the full benchmark) is cheap, but the extra context injection is a real confound.

Position bias. The pairwise judge preferred the first position 61% of the time, regardless of which condition was placed there. I randomized positions, which mitigates but does not eliminate this.

Single run, single model. I did not run this 50 times and average. The results are from one end-to-end execution. Non-determinism is present but unquantified.

Outlier sensitivity. A catastrophic failure in one condition can pretend to be proof of another. A single badly generated veterinary case could shift aggregate scores in a 30-task benchmark.

If you want to lie to yourself in AI, you are never alone. The tooling is ready to help.

That is why I published the result with the weak parts exposed.

No heroic framing. No fake certainty. No "this changes everything" perfume sprayed over a modest engineering result.

What I Know Now

Just this:

The loop is buildable. The full learn-persist-retrieve-apply-feedback-decay cycle works end to end. Thirty task procedures deduplicated into 21 skills. Transfer histories are tracked. Skills expire. The plumbing works.

The signal exists. 63% pairwise preference is consistent and non-trivial.

The cause of that signal is still ambiguous. It could be genuine procedural transfer, or it could be richer context from extra LLM passes, or some combination.

The current bottleneck is abstraction, not storage. The v1 system stores procedures as structured versions of traces. It does not truly abstract them. It does not generalize them semantically. It does not compress them into domain-independent tactics with actual conceptual teeth. The context analyzer runs on hardcoded keyword dictionaries, not semantic understanding. Retrieval is a full scan, not an index.

That last part matters most.

What Comes Next

So now the next question is finally the right one.

Not "can we talk beautifully about agent memory?"

We already know the answer to that. Absolutely. People can talk beautifully about almost anything. Especially if nobody asks for logs.

The real question is whether better abstraction and better retrieval change the outcome materially.

If I replace deterministic trace structuring with actual procedural abstraction compressing "decompose input into parts, then analyse each part, then synthesise results" across domains into a generalised decompose-analyse-synthesise tactic and if I replace keyword overlap with embedding-based retrieval or something even smarter, does the loop start doing something that a well-trained model does not already do by default?

That is the threshold.

That is where plumbing starts becoming research instead of respectable mechanical honesty.

And honestly, I prefer it this way.

I would rather publish a first implementation with modest results and sharp limits than one more dramatic post about the dawn of adaptive cognition from someone who has never had to decide what expires, what merges, what fails, and what gets written back after the benchmark finishes.

There is enough incense in AI already.
I am more interested in pipes.
Because pipes, unlike vibes, occasionally carry water.

OrKa Brain is part of OrKa, an open-source YAML-first AI agent orchestration framework. The full benchmark, including task definitions, raw results, judge transcripts, and the technical paper, is available in the repository. tech-paper

Intelligence, Farming, and Why AI Is Still Mostly in Its Tool Phase

marcosomma — Wed, 18 Mar 2026 23:20:04 +0000

People usually talk about intelligence as if it starts with language, tools, or raw brainpower. I do not think that is enough. In the bigger evolutionary picture, intelligence starts when a living thing stops just reacting to whatever is in front of its face and begins carrying a rough model of the world in its head. A kind of inner sketch. Something that helps it remember, predict, adjust, and act not only for now, but for later.

A lot of animals do this. They are not stupid. They solve problems, learn patterns, adapt, trick each other, and survive in ways that are honestly impressive. So intelligence is not some magical human-only plugin installed by the universe. What is rare is not intelligence itself. What is rare is the moment when intelligence stops being useful only for survival and starts becoming a world-editing machine.

That is where humans took a weird turn.

The real jump was not just tools. A stick is great. A sharp stone is great. Fire is very great, especially if you are cold and trying not to die. But none of those alone explain the massive leap. The deeper change happened when humans got trapped, in the best possible way, inside long loops of cause and effect. Not just act now, eat now, survive now. But act now, wait, remember, adjust, come back, check again, fix the mess, and maybe eat in three months if you did not completely ruin the plan.

That is why agriculture matters so much.

Farming is not just “food but slower.” It is a completely different mental game. Hunting can involve planning, yes, but farming basically forces you to become the project manager of a very annoying and unpredictable system. You put seeds into the ground and then spend months negotiating with dirt, water, weather, insects, time, and your own bad decisions.

You are no longer finding food. You are trying to convince the future to cooperate.

And the future is rude.

Farming forces you to track things you cannot immediately see. You have to remember what you planted, where you planted it, when you planted it, whether it got enough water, whether the season is changing, whether pests are coming, whether the river is helping or preparing to ruin your entire week. This is no longer simple reaction. This is delayed feedback. This is long-horizon thinking. This is your brain being dragged into a repeated loop of prediction, intervention, failure, correction, and trying again.

That matters.

Because once cognition enters those kinds of loops, it changes character. The mind is no longer just spotting opportunities in nature like some clever scavenger. It starts designing future conditions. It starts shaping the environment so reality later matches a plan that only existed in imagination. That is a much bigger deal than “human use tool.”

So I would say agriculture did not create intelligence. It turned intelligence into infrastructure.

That also helps explain why many animals are clearly intelligent and yet never end up building cities, irrigation systems, tax forms, or extremely depressing office software. Intelligence alone is not enough. To get civilization, at least three things need to show up together.

First, you need loops that reward long-term thinking.

Second, you need a way to pass useful knowledge along, so each generation does not have to restart from “what if rock but pointy?”

Third, you need the ability to change the environment in ways that keep paying off over time.

Without those three, intelligence stays local. It helps you survive. It helps you stay a very competent crow, octopus, wolf, or ape. But it does not become civilization. Once those three things combine, intelligence escapes the skull. It gets baked into tools, habits, systems, stories, roads, farms, laws, and all the other strange things humans build when they have too much memory and not enough chill.

And this is where AI becomes interesting.

Because I think we make the same mistake with AI that people make when talking about human intelligence. We see one part of the process and declare victory too early.

Current AI systems are impressive, yes. Very impressive. Sometimes absurdly impressive. They predict well, generate well, imitate well, summarize well, and occasionally hallucinate with the confidence of a man explaining barbecue technique after reading half a Wikipedia page. But that does not automatically make them intelligence in the full sense.

What we mostly have today are intelligence tools.

That is different.

A model can predict the next token, classify an image, rank options, generate code, or infer patterns from huge amounts of data. Great. But prediction alone is not the same thing as durable intelligence. That is like saying someone who can walk ten kilometers can obviously run ten kilometers. No. Walking helps. But running requires different coordination, training, adaptation, and stress handling. Same legs. Different system.

AI right now is mostly at the “good legs” stage.

Very good legs, to be fair.

And yes, I know people love to point at one technical component and treat it like the sacred spark. ReLU, attention, scaling laws, whatever the buzzword of the season is. Those things matter. They are useful engineering breakthroughs. But no single ingredient is “the birth of intelligence.” That is like claiming the reason civilization exists is because someone once invented a better shovel. Useful, yes. Complete explanation, no.

The real question is not whether a model can predict well. The real question is whether a system can enter long loops of memory, planning, action, feedback, correction, and transfer, then keep improving in a stable way over time.

That is where the AGI discussion usually gets blurry.

If we define AGI as “models with memory, planning, and tool use,” then congratulations, we already have that. Agentic systems exist. Tool-using systems exist. Multi-step planners exist. Memory layers exist. The problem is that this definition is so loose it is almost useless. It is like saying a bicycle and a spaceship are both transportation, so close enough.

No.

We need a stricter threshold.

The real jump would be something more like this: a system that can keep relevant state across long periods, learn from past mistakes in a way that becomes reusable skill, handle long multi-step goals without falling apart every time the environment changes, transfer what it learned from one task to another related task, and do all this reliably enough that it feels less like workflow glue and more like stable competence.

That, to me, is the actual missing layer.

Not prettier outputs.
Not better demos.
Not one more benchmark where the model answers history questions slightly faster than last quarter.

What is missing is durable adaptive cognition.

That is the point where AI would stop being mostly a smart component and start feeling more like a real cognitive system.

So the distinction I would make is simple.

A model is a predictor.

An agentic system is a predictor plus some scaffolding, like tools, memory, or planning loops.

A higher intelligence system would be something that can keep learning across time, preserve useful structure, adapt without being rebuilt every five minutes, and shape its own future performance through repeated interaction with the world.

That last part matters most. Human intelligence became historically dominant because it did not stay inside the head. It got externalized into tools, memory systems, culture, infrastructure, and environmental change. If AI ever makes a similar leap, it will not be because one model gets even bigger and starts speaking in more confident paragraphs. It will be because predictive systems get embedded in persistent loops that let them remember, act, revise, transfer, and compound.

So my view is this.

Today’s AI is not yet the machine equivalent of civilization-level intelligence. It is closer to the tool phase. Very powerful tools, yes. Sometimes shocking tools. Sometimes tools that write code better than half the internet and worse than a tired senior engineer on a Tuesday. But still tools.

The next real jump will not come from prediction alone. It will come from systems that can live inside long feedback loops and get better because of them.

Basically, farming for machines. And hopefully with fewer locusts.

I Am Tired of Fake AI Expertise

marcosomma — Tue, 17 Mar 2026 10:14:22 +0000

I have spent the last year trying to talk about AI as an engineering discipline.

Not AI as a content machine. Not AI as a growth trick. Not AI as a stream of screenshots, prompt hacks, and recycled takes written by the same models people claim to master.

I mean AI as systems work.

Orchestration. Validation. Data quality. Observability. Evaluation. Failure handling. Context boundaries. Retry policies. Structured outputs. Cost control at the workflow level. Real interfaces between probabilistic components and deterministic software.

And honestly, part of the reason I stepped back from that conversation is simple: too much of the public AI discourse is being led by people who do not build real AI systems.

They are loud. They are polished. They are confident. They are often rewarded for being confidently wrong.

That is the part that disappoints me.

The current wave of self proclaimed "AI experts" is flattening a difficult field into a set of cheap slogans. A domain that requires serious expertise is being turned into social media theatre. And the result is not just annoying. It is actively harmful.

It is making people misunderstand what AI is, how it fails, where it costs money, and what actually makes it useful in production.

The field is being narrated by people who optimize for reach, not rigor

Recently I saw yet another high visibility post making a big point about format optimization and token savings, as if shaving a few characters from JSON were some major breakthrough in AI engineering.

This is the kind of thing that gets thousands of likes.

A side by side screenshot.
A catchy claim.
A simple narrative.
A fake sense of leverage.

And once again the message was basically this: if you are still doing things the old way, you are wasting money.

This is the language of marketing, not engineering.

The problem is not that someone shared an imperfect idea. Imperfect ideas are fine. Early exploration is fine. Public discussion is fine. We all get things wrong.

The problem is the posture of expertise around it.

There is a massive difference between saying, "I tried this and here are the results, caveats, and failure modes," and saying, "Here is the better way," when the claim is based on shallow intuition, weak evidence, and no visible system level execution.

That difference matters.

Because a lot of people reading those posts are not experienced enough to detect the gap.

They see confidence and assume competence.
They see engagement and assume validity.
They see a title and assume credibility.

And that is how misinformation spreads in technical fields. Not through obvious lies, but through reduction. Through oversimplification. Through confident framing of weak ideas.

Tokens have become marketing

One of the worst examples of this is token discourse.

Tokens matter. Of course they matter. Costs matter. Latency matters. Compression matters. Input design matters.

But token count has become the vanity metric of AI engineering.

It is easy to post about because it looks measurable. It fits inside a screenshot. It creates a simple hero story. "Look, I reduced 40 percent of the tokens." Great. And what happened to reliability? What happened to parse consistency? What happened to failure recovery? What happened to total workflow cost after retries, validation, tool calls, and fallback paths?

That is the real question.

A shorter prompt is not automatically a better system.
A smaller payload is not automatically a better architecture.
A new text format is not automatically a better interface for a stochastic model.

Sometimes saving tokens means losing robustness.
Sometimes saving tokens means increasing ambiguity.
Sometimes saving tokens means moving complexity downstream into validation and repair.
Sometimes saving tokens means nothing at all, because the real cost of the system is somewhere else.

This is what too many public AI voices still fail to understand.

AI cost is not just prompt cost.
AI quality is not just output prettiness.
AI engineering is not just model interaction.

The real economy of AI is at the system level.

The real cost is in everything around the model

If you have ever shipped a real AI feature, you know where the effort goes.

It goes into making sure the right context is available at the right moment.
It goes into preventing irrelevant context from leaking in.
It goes into checking whether the model output is complete, valid, safe, and usable.
It goes into retries when the model drifts.
It goes into routing when one step should not be handled by the same prompt as another.
It goes into fallback strategies when the first attempt is weak.
It goes into evaluating whether a result is acceptable before it reaches a user.
It goes into observability so you can explain why the system behaved the way it did.
It goes into datasets so your judgments are not based on vibes.
It goes into data quality so the model is not forced to reason on garbage.

That is where the tokens get burned.

And that is correct.

Those tokens are not waste. They are the cost of making a probabilistic component useful inside a product.

This is what so much AI content gets backwards. It treats the model call as the whole system. It assumes the right prompt is the product. It implies that if you phrase the question well enough, the problem is solved.

That is not how production works.

A prompt is an input. A model is a stochastic component. A product is a controlled system around them.

If you collapse those distinctions, you are not doing AI engineering. You are gambling.

Prompting alone is not engineering. It is gambling.

I keep repeating this line because I think it cuts to the center of the problem.

Prompting alone is not engineering. It is gambling.

Yes, prompting matters. Yes, prompt design can improve outcomes. Yes, well structured instructions can reduce confusion and guide the model.

But prompting is not a substitute for architecture.

It is not a substitute for validation.
It is not a substitute for proper interfaces.
It is not a substitute for evaluation.
It is not a substitute for state control.
It is not a substitute for business rules.
It is not a substitute for deterministic code where deterministic code should exist.

And yet an absurd amount of public AI discourse still acts as if prompting is the main skill. As if being fluent in prompt phrasing is equivalent to understanding AI systems.

It is not.

A person can be very good at prompting and still have almost no understanding of reliability engineering, retrieval quality, orchestration design, evaluation methodology, observability, or failure containment.

That is why I have become increasingly skeptical of AI advice that starts and ends with "here is a better prompt."

A better prompt for what?
Under which constraints?
With what model?
Against which dataset?
Measured how?
Compared to what baseline?
Under what latency budget?
With what failure rate?
With what retry policy?
Inside what workflow?
At what scale?
For which users?
Against which acceptance criteria?

Without those questions, we are not discussing engineering. We are discussing prompt aesthetics.

2025 was the year of demos. 2026 should be different.

I can understand how we got here.

In 2025, the industry was still drunk on demos. That phase made sense. Everything felt new. Chat interfaces looked magical. People discovered that a model could generate code, write marketing copy, extract structure from text, summarize documents, and imitate expertise with frightening smoothness.

So of course the conversation was dominated by novelty.

People were exploring.
People were guessing.
People were posting every new trick they found.
The market rewarded velocity, not discipline.

Fine.

But we are not there anymore.

In 2026, this excuse is weaker. We have already seen enough failures, hallucinations, broken agents, fake automation, and "AI powered" wrappers to know that prompting your way through complexity does not scale.

We should be having better conversations by now.

We should be talking more about evaluation design than prompt poetry.
We should be talking more about system boundaries than persona tuning.
We should be talking more about retrieval quality than format gimmicks.
We should be talking more about workflow control than chatbot charisma.

Instead, too many large accounts are still posting beginner level content with expert level confidence.

That is not harmless. It distorts the learning environment for everyone coming into the field.

This is why so much AI still does not work

A lot of people ask why AI products still feel fragile.

Why do they fail on edge cases?
Why do they break in production?
Why do they look impressive in demos and weak in real usage?
Why do teams burn money without creating durable value?
Why do so many "agents" look like wrappers with marketing?

This is part of the answer.

Because too many people still think AI is an oracle.

They still approach it like a mystical reasoning engine that only needs the right wording. They still believe the model is the product. They still imagine that clever prompting is a replacement for engineering discipline.

So they underinvest in everything that actually makes the system work.

They underinvest in ground truth data.
They underinvest in evals.
They underinvest in routing logic.
They underinvest in structured interfaces.
They underinvest in observability.
They underinvest in negative testing.
They underinvest in validation.
They underinvest in deterministic controls.

Then they are surprised when the system behaves like a stochastic component with partial competence and unstable boundaries.

That surprise is not a model failure. It is a design failure.

AI does not fail because it is useless.
AI fails because people keep trying to deploy it as magic.

Expertise should be demonstrated, not announced

The most frustrating part is not being wrong. Everyone is wrong sometimes.

The most frustrating part is the performance of expertise.

The field is full of titles, badges, self descriptions, and aesthetic authority. "Top voice." "AI expert." "Thought leader." "Award winning." Fine. None of that tells me whether you understand evaluation drift, state leakage, retrieval contamination, schema reliability, fallback routing, or cost accumulation across a multi step pipeline.

Show me the system.
Show me the logs.
Show me the benchmark.
Show me the constraints.
Show me the failure modes.
Show me the tradeoffs.
Show me the production scars.

That is what builds credibility.

I trust practitioners who expose uncertainty and show their work. I trust people who can explain not just what succeeded, but what broke and why. I trust engineers who understand that AI is not one prompt and one output, but an unstable component that becomes useful only when surrounded by structure.

I do not trust polished certainty without evidence.

And I think more of us need to say that openly.

We need less AI theatre and more systems thinking

This article is not a call to stop experimenting. It is the opposite.

Experiment more. Build more. Test more. Share results more.

But stop pretending that shallow takes are deep expertise.
Stop teaching people that token screenshots are system design.
Stop selling prompting as if it were engineering.
Stop flattening a hard field into content loops.

If you want better AI products, treat AI like what it is: a probabilistic system component that must be constrained, validated, observed, and integrated with care.

That is less sexy than "10 prompts that changed my workflow."
It is less viral than side by side screenshots.
It is less accessible than fake certainty.

But it is real.

And right now, real is exactly what this field needs more of.

Because the problem is no longer that AI is misunderstood by outsiders.

The problem is that too much of it is being misexplained by insiders.

If we want the field to mature, we need fewer self proclaimed experts and more actual practitioners.

Not louder people.
Better ones!

The Old Seniority Definition Is Collapsing

marcosomma — Thu, 05 Mar 2026 08:40:59 +0000

For a long time, “senior developer” was a fairly consistent signal. You expected someone who could hold a large architecture in their head, write clean code with low defect rates, debug almost anything, and reason about performance without guesswork. That bundle made sense because the hardest part of shipping software was often the execution layer: translating intent into correct, maintainable code at speed.

That bundle is breaking.

AI-assisted development is compressing the cost of producing plausible, working code. Not always. Not uniformly. But enough that “I can ship a lot of code quickly” is no longer a reliable proxy for deep seniority. In many teams, velocity metrics are starting to measure who is best at driving the tool, not who is best at building systems that survive contact with reality.

What AI Is Actually Commoditizing

AI is not replacing engineering. It is discounting a specific slice of it: first-pass implementation and the mechanical parts of refactoring. The tool is good at producing code that looks right, compiles, and often passes superficial tests. That changes the economics of execution.

What does not get discounted at the same rate is integration into a real system with real constraints: data contracts, failure modes, security boundaries, observability, and long-term maintenance. In practice, the bottleneck shifts from typing to supervision. You spend less time writing and more time specifying, verifying, reviewing, and correcting.

This is why you can see two realities at the same time. Some developers experience dramatic speedups on bounded tasks. Others experience slowdowns inside large, messy codebases because prompting, waiting, and review overhead replace keystrokes, and because the model lacks the local context that makes a patch truly correct.

What Is Rising in Value

Problem decomposition and system thinking become the differentiator because they convert ambiguity into an executable plan. When you are dealing with something like regulatory delta detection, the hardest part is not writing code. The hardest part is deciding where the complexity actually lives, and what you must make explicit so the system stays correct as the domain evolves. The choice between a graph database and a simpler model is rarely a “tech taste” debate. It is a tradeoff between query expressiveness, operational burden, debuggability, and change management.

Judgment under uncertainty becomes a senior marker because architecture is mostly irreversible decisions made with incomplete information. Moving from direct graph writes to a changeset-based approach with content hashing is not an implementation detail. It is a bet on how you will observe change, roll back safely, explain behavior to customers, and avoid silent drift. That decision quality is what compounds over months.

Context and domain mastery become a moat because they are earned, not generated. If you understand how CELEX identifiers behave in practice, how MiCAR compliance maps to document reality, or how jurisdictions interpret rules differently, you carry constraints that materially shape the architecture. AI can help you express that knowledge. It cannot reliably invent it. Without domain context, you get confident code that is wrong in the ways that matter.

Technical leadership becomes central because building systems is increasingly a multiplayer game. The question is whether you can create a design that other people can implement without constant back-and-forth, and whether you can write specifications that converge rather than fork. This is why a workshop like SDD Pills matters. It trains decision-making and clarity, not syntax.

Mentoring and knowledge transfer become leverage because the highest-value output of a senior engineer is often the improvement of everyone else’s output. AI amplifies this. Teams that learn how to bound AI usage with clear contracts, acceptance criteria, and review discipline get compounding returns. Teams that treat AI as an oracle get compounding debt.

The Uncomfortable Truth: Two Axes Have Split

There are now two skill axes that used to correlate and no longer do.

One axis is technical depth: how well you understand systems, tradeoffs, failure modes, and the long-term consequences of design choices.

The other axis is execution speed: how quickly you can produce working code.

Historically, depth and speed often moved together. Deep engineers tended to execute quickly because they saw the path. Today, you can get high speed with low depth by delegating thinking to the tool. That can look senior on dashboards and in weekly updates. It is not senior if the output is brittle, unobservable, and expensive to maintain.

The inverse also exists: high depth with lower raw output speed can still be very senior if the person consistently makes decisions that reduce risk, eliminate classes of bugs, and increase team throughput.

What This Breaks in Hiring and Promotion

Many organizations still reward visible output: commits, tickets closed, apparent velocity. AI makes these signals noisier because the cost of producing code has dropped, while the cost of validating correctness has often increased. The net effect is that the old metrics over-credit the wrong behaviors and under-credit the work that actually keeps systems stable.

The evaluation problem is that “code shipped” is no longer tightly coupled to “engineering done.” A senior engineer in 2026 is often the person who prevented the incident you never had, removed an entire category of future work by designing the right abstraction, or wrote a spec that made five people productive instead of confused.

What to Measure Instead

The most useful seniority markers become visible if you look for decision quality, not output quantity.

A senior engineer can take an ambiguous problem and produce a specification that is testable and unambiguous. They can make uncertainty explicit by stating what is known, what is assumed, and what the cost of being wrong looks like. They consistently surface non-functional requirements early, especially observability, maintainability, and security, because those are the constraints that explode later.

They use AI as a bounded tool. They know when to ask it for a scaffold, when to demand alternatives, and when to reject a suggestion because they understand the scaling and failure modes. Patterns like Planner, Executor, Reviewer work when they are treated as control systems with clear acceptance criteria, not as theater.

Why “Senior” Is Drifting Toward “Principal”

Role expectations are shifting. Senior used to mean “I can personally deliver complex work.” Increasingly it means “I can make the right decisions and increase the output quality of everyone around me.” That is closer to what many companies used to call principal or architect.

This shift is healthy if organizations adapt their evaluation criteria. It is painful if they do not. People whose main advantage was fast execution will feel the floor drop out, because execution has been discounted. People who were already strong in decomposition, judgment, and leadership will become more valuable, because those skills are now the constraint.

What I’m Seeing in Teams

The developers adapting best to AI-assisted development are usually the ones who already had strong mental models and strong taste. They can turn ambiguity into constraints, and constraints into evaluation. They do not confuse “working code” with “correct system.” They treat AI output as a hypothesis that must be verified against invariants.

The developers struggling are often those who outsource thinking. They can generate a lot of code quickly, but they cannot defend why the design is correct, what it will cost to operate, or how it will fail.

If you are seeing a blur between depth and apparent execution speed, that blur is real. The solution is not to ban AI or to worship it. The solution is to change what you reward, and to interview and promote for the skills that actually compound.

LLMs Are Not Deterministic. And Making Them Reliable Is Expensive (In Both the Bad Way and the Good Way)

marcosomma — Sun, 22 Feb 2026 14:24:05 +0000

Let’s start with a statement that should be obvious but still feels controversial: Large Language Models are not deterministic systems. They are probabilistic sequence predictors. Given a context, they sample the next token from a probability distribution. That is their nature. There is no hidden reasoning engine, no symbolic truth layer, no internal notion of correctness.

You can influence their behavior. You can constrain it. You can shape it. But you cannot turn probability into certainty.

Somewhere between keynote stages, funding decks, and product demos, a comforting narrative emerged: models are getting cheaper and smarter, therefore AI will soon become trivial. The logic sounds reasonable. Token prices are dropping. Model quality is improving. Demos look impressive. From the outside, it feels like we are approaching a phase where AI becomes a solved commodity.

From the inside, it feels very different.

There is a massive gap between a good demo and a reliable product. A demo is usually a single prompt and a single model call. It looks magical. It sells. A product cannot live there. The moment you try to ship that architecture to real users, reality shows up fast. The model hallucinates. It partially answers. It ignores constraints. It produces something that sounds fluent but is subtly wrong. And the model has no idea it failed.

This is not a moral flaw. It is a design property.

So engineers do what engineers always do when a component is powerful but unreliable. They build structure around it.

The moment you care about reliability, your architecture stops being “call an LLM” and starts becoming a pipeline. Input is cleaned and normalized. A generation step produces a candidate answer. Another step evaluates that answer. A routing layer decides whether the answer is acceptable or if the system should try again. Sometimes it retries with a modified prompt. Sometimes with a different model. Sometimes with a corrective pass. Only after this loop does something reach the user.

At no point did the LLM become deterministic. What changed is that the system gained control loops.

This distinction matters. We are not converting probability into certainty. We are reducing uncertainty through redundancy and validation. That reduction costs computation. Computation costs money.

This is why quoting token prices in isolation is misleading. A single model call might be cheap. A serious system rarely uses a single call. One user request can trigger several model invocations: generation, evaluation, regeneration, formatting, tool calls, memory lookups. The user experiences “one answer.” The backend executes a small workflow.

Token cost is component cost. Reliable AI is system cost.

Saying “tokens are cheap, therefore AI is cheap” is like saying screws are cheap, therefore airplanes are cheap.

This leads to an uncomfortable but important truth. AI becomes expensive in two very different ways.

If you implement it poorly, it becomes expensive because you burn money and still do not get reliability. You keep tweaking prompts. You keep firefighting. You keep patching symptoms. Nothing stabilizes.

If you implement it well, it becomes expensive because you intentionally pay for control. You pay for evaluators. You pay for retries. You pay for observability. You pay for redundancy. But you get something in return: a system that behaves in a bounded, inspectable, and improvable way.

There is no cheap version of “reliable.”

Another source of confusion comes from mixing up different kinds of expertise. High-profile founders and executives are excellent at describing futures. They talk about where markets are going and what will be possible. That is their role. It is not their role to debug why an evaluator prompt leaks instructions or why a routing threshold oscillates under load. Money success does not imply operational intimacy.

On the ground, building serious AI feels much closer to distributed systems engineering than to science fiction. You worry about data quality. You worry about regressions. You worry about latency and cost per request. You design schemas. You version prompts. You inspect traces. You run benchmarks. You tune thresholds. It is slow, unglamorous, and deeply technical.

LLMs made AI more accessible. They did not make serious AI simpler. They shifted complexity upward into systems.

So when someone says, “Soon we’ll just call an API and everything will work,” what they usually mean is, “Soon an enormous amount of engineering will be hidden behind that API.”

That is fine. That is progress.

But pretending that reliable AI is cheap, trivial, or solved is misleading.

The honest version is this: LLMs are powerful probabilistic components. Turning them into dependable products requires layers of control. Those layers cost money. They also create real value.

Serious AI today is expensive in the bad way if you do not know what you are doing.

Serious AI today is expensive in the good way if you actually want it to work.

And anyone selling “cheap deterministic AI” is selling a story, not a system.

Adversarial Planning for Spec Driven Development

marcosomma — Thu, 12 Feb 2026 21:39:13 +0000

I have always loved one idea in machine learning. The idea that you can sharpen a model by forcing it to face a challenger. You can call it adversarial training, red teaming, or constructive hostility. The name matters less than the mechanism. You introduce pressure. You int
roduce disagreement. You force the system to earn its confidence.

For years I kept that concept in a mental drawer labeled “cool, but academic.” Then it become a core concept within the Orka-reasoning development but lately my attention is shifting toward code agent workflows and how all happen. Not the marketing version. The real version, where you sit down to ship software, and you realize that a helpful model is not the same thing as a rigorous model. Helpful is easy. Rigorous is costly.

This article is about how I tried to transplant an adversarial dynamic into Spec Driven Development sessions. Not as theater. Not as an AI debate club. As an engineering tool. It worked. It also nearly became a token-burning trap. That tradeoff is the point.

What Spec Driven Development means here

Spec Driven Development, or SDD, is a workflow where the spec is not documentation. The spec is the product of the thinking phase. It becomes the contract you implement against. You write it before code changes. You review it like you would review code. You use it to force scope, constraints, interfaces, and acceptance criteria into something explicit.

The point is not to be verbose. The point is to move ambiguity upstream, when it is still cheap. The spec becomes the unit of alignment, review, and iteration. Code is the execution of that spec, not the place where you discover what the spec should have been.

The problem with a trustable planner

If you use an LLM as a planning companion, you know the feeling. You came with your idea. You debate a bit. And then it gives you a plan. The plan is detailed. It is readable. It sounds plausible. It often includes little snippets that look like they belong in your codebase, even when they do not. It is confident. It is fast.

And that is exactly the problem.

A planner model has incentives you did not explicitly set. Its default incentive is to be useful to you in the moment. It wants to reduce friction. It wants to keep you engaged. It wants to produce something that reads like progress.

So it will fill gaps with assumptions. If your own initial plan is fluffy. It will smooth rough edges. It will complete the pattern of what a good plan should look like. It will also happily unlock future possibilities, because possibilities are cheap to generate and expensive to invalidate.

When you are deep in a product, that behavior is dangerous. Not because the model is malicious. Because it is compliant. It will often accept your framing even if your framing is wrong. It will not push hard unless you force it to.

This is the failure mode I kept hitting. I would craft a plan with the planner. I would feel momentum. Then I would start implementation and discover that the plan was under-specified in the only places that matter.

Interfaces were vague. Invariants were missing. Acceptance criteria were soft. The plan assumed the architecture could absorb a change without showing how. It assumed the code was more modular than it actually was. It assumed integration would be straightforward.

In other words, it was a nice plan. It was not a plan that survived contact with a real codebase.

Why adversarial dynamics work in ML

Adversarial training is interesting because it makes weakness visible. You do not improve a system by praising it. You improve it by exposing it to inputs that exploit its blind spots. You force it to fail in ways that are informative.

In a GAN, the generator learns because the discriminator is not polite. The discriminator does not care about your feelings. It cares about whether the output holds up under scrutiny. That pressure creates signal.

In engineering, we already do this. Code review is adversarial when it is healthy. Testing is adversarial by definition. Security review is adversarial. Load testing is adversarial. Even a good product manager is adversarial at the right moments.

But planning often is not. Planning often becomes a social process. People nod. People optimize for alignment. People avoid being the blocker. Under time pressure, that tendency gets amplified.

If you bring an LLM into planning and you let it be the agreeable teammate, you amplify the most comfortable version of planning. You pay tokens to make yourself feel certain.

That is not what I wanted. I wanted the planning stage to contain more of the pain, so implementation contains less.

The translation to SDD: Planner plus Architect

I kept the planner. I did not replace it. The planner is good at structure. It is good at decomposing a vague goal into sequential work. It is good at producing a spec you can follow. It is good at holding context across iterations.

But I introduced a second role. I call it the Architect. The job is simple. Challenge the plan as if you are the most annoying senior engineer in the room, with one constraint. The criticism must be grounded. It must point to specific failure modes. It must force explicit decisions.

The Architect pushes on the places where the planner tends to glide over reality. It asks what the boundary of the change really is. It asks what breaks if you do it, and what breaks if you do not. It pressures you to name the coupling you are creating and the coupling you are relying on. It attacks the parts of the spec that sound confident but are not falsifiable.

This role is unpleasant. It is supposed to be unpleasant. It is also productive, if you keep it under control.

The immediate effect was obvious. Specs became harder to write. My initial drafts got rejected more often. I had to define outcomes in tighter language. I had to stop relying on vibes and start writing constraints.

The less obvious effect was more important. I started noticing the difference between a plan that sounds implementable and a plan that is falsifiable.

A falsifiable plan is one where you can point at a step and say: if this condition does not hold, the step is wrong. If the step is wrong, we know why. We can adjust.

A non-falsifiable plan is one where every step is elastic. You can always reinterpret it. You can always claim partial success. It is planning as comfort.

The Architect hates comfort. That is the point.

What the Architect actually improved

It did not make my system magically correct. It made my system explicit.

It reduced scope creep because it forced me to define what done means in terms of observable outcomes. It reduced hidden coupling because it forced me to identify which pieces of the system now move together. It reduced abstraction drift because it forced me to state which module owns which responsibility. It improved testability because it pushed me to name the failure cases the system must catch and the layer that must catch them. It also lowered integration fantasies by making me draw the dependency edges in plain language.

This matters because most planning failures are not about missing steps. They are about missing friction. You only discover friction when someone tries to break your plan.

A planner rarely tries to break your plan. An Architect lives for it.

The mental model: controlled adversarial pressure

At some point I realized the dynamic I was building was not adversarial planning. It was controlled adversarial pressure.

Pressure is good when it produces signal. Pressure is bad when it produces noise.

The Architect can easily produce noise. It can challenge everything. It can question the existence of the feature. It can spiral into meta debates. It can do the classic senior engineer move of turning every change into a referendum on architecture.

That is why this approach can become dangerous. It is not just about tokens. It is about cognitive load. Too much adversarial pressure makes you doubt everything. You stop shipping. You start ruminating. You start optimizing a plan instead of building the thing.

So the key is control. You want the Architect to challenge the plan in a bounded way, then you move on.

The only sustainable use is somewhere in the middle. You let it break your plan until the breakage becomes repetitive. When the criticism starts looping, it is done. That loop is your stop signal.

How the infinite loop happens

I learned this the hard way. The Architect is very good at finding the next critique, even when the plan is already good enough, even when the remaining critiques are marginal.

There are two reasons.

First, LLMs are generative machines. They can always produce another objection. The space of objections is large. Many objections are plausible. Plausible is not the same as important.

Second, adversarial roles reward themselves. When the Architect produces a clever critique, it feels like progress. It feels like rigor. It feels like you are doing serious engineering. You can get addicted to that feeling, especially if you already equate doubt with intelligence.

So you need stop conditions that are not emotional. You need boundaries that are mechanical.

Time is a boundary. Token budget is a boundary. The best boundary is value.

The question is: does this criticism point to a concrete failure mode that is likely in this codebase, in this release, under these constraints. If yes, incorporate it. If no, write it down as a future consideration and move on.

That discipline sounds simple. It is not. It requires you to accept that you will ship with risk. It requires you to prefer explicit risk over imagined safety.

Why this helps SDD specifically

SDD is already an attempt to move thinking earlier. You spend more effort defining the work before coding. That sounds obvious. It is not common.

Many teams code first, then retrofit clarity. Specs become documentation after the fact. Tests become a safety net after the mistakes.

SDD flips that. You try to make the spec the forcing function. The spec becomes the contract. The spec becomes the review surface. The spec becomes the artifact you can reason about without running the entire system in your head.

If your spec is weak, SDD collapses into bureaucracy. You get long documents that do not prevent failures. You get ceremonial approval. You get a spec that exists, but does not constrain the outcome.

The adversarial role helps because it forces the spec to earn its existence. It forces explicit interfaces. It forces explicit invariants. It forces explicit failure handling. It forces explicit success conditions. It makes the spec testable in a reasoning sense.

Doubt as a tool, doubt as a poison

There is a psychological aspect here that I did not expect.

When you introduce an adversarial voice into planning, you introduce doubt. That can be healthy. It can also be corrosive.

Healthy doubt looks like this. You have a plan. You expose it to pressure. You find the weak points. You fix them. You ship with more confidence because your confidence is earned.

Corrosive doubt looks like this. You have a plan. You expose it to pressure. The pressure never ends. You start believing that every plan is fragile. You stop trusting your ability to decide. You keep rewriting the plan to reduce anxiety. You ship nothing.

The difference is not intelligence. The difference is boundaries.

In a team, boundaries are social. Someone ends the meeting. Someone says enough, we decide. Someone accepts risk explicitly.

In a solo workflow with agents, you need to manufacture that boundary. Otherwise the system will drift toward endless review because endless review feels safer than a decision.

If you are prone to overthinking, an adversarial agent can amplify that trait. It can turn careful into paralyzed. That is not a reason to avoid it. It is a reason to instrument it.

What this is not

This is not asking an AI to argue with itself and then picking a side. That is entertainment. It can be useful for brainstorming. It is not a development methodology.

This is not letting the Architect design the system. That is just outsourcing. The Architect is a critic, not a creator.

This is not making the Architect mean. Mean is cheap. Precision is expensive. You want precision tied to concrete failure modes.

This is also not a replacement for real review. A human senior engineer with context will catch things an LLM will miss. The point here is to raise your baseline. The point is to catch the obvious architecture risks before you waste days implementing.

The practical outcome

The measurable outcome for me was simple.

I rewrote fewer specs mid-implementation. I discovered fewer “we forgot that” moments. I spent less time refactoring because of missing boundaries. I argued less with my future self.

The spec still does not become perfect. The spec fails earlier, on paper, when failure is cheap. That is what adversarial pressure buys you.

The simplest way to frame it

Your planner optimizes for completeness. Your Architect optimizes for survivability.

Completeness is about covering steps. Survivability is about covering reality.

A complete plan can still die on a hidden assumption. A survivable plan is one where assumptions are visible, bounded, and either validated or consciously accepted.

The adversarial role does not need to make you pessimistic. It needs to make you explicit. If it makes you pessimistic, you let it run too long.

Sane engineer the disagreement

Good engineering requires disagreement. Not constant fighting. Not performative contrarianism. Real disagreement that targets risk.

In teams, disagreement is expensive socially. With agents, disagreement is expensive computationally. The price changes. The dynamics stay.

If you can engineer disagreement so that it is bounded, precise, and tied to concrete failure modes, you get a sharper process. You get better specs. You get fewer surprises.

If you cannot bound it, you get the worst of both worlds. You get more doubt and less shipping.

So adopt the adversarial phase, but treat it like a test suite. You run it to catch failures. You do not run it forever because you enjoy watching it fail.

Controlled adversarial pressure. Enough to sharpen. Not enough to cut.

How I accidentally start SDD by failing at prompts for six months

marcosomma — Sat, 07 Feb 2026 12:12:01 +0000

The confession

I spent the first six months of serious AI pair programming producing what I now call vibe architecture.

You know the pattern. You open a chat with a strong model. You explain what you want. It produces clean code fast. You feel productive. Three weeks later the repo looks like it was designed by five different people, on five different days, with five different mental models.

Each file is locally correct. The system is globally confused.

I would plan with the model in one session. I would implement in another. By step five the implementation had drifted far enough that the plan was basically historical fiction. Then I would come back after a weekend and lose the thread. Not because the model did something wrong. It did exactly what I asked at each moment. The issue was continuity. Nobody was holding the bar across moments.

That loop repeated across multiple projects, including the first months of building OrKa largely solo. I learned something obvious in hindsight. The problem was not output quality. The problem was the absence of a development system that keeps output coherent over time.

That is when I stopped chasing better prompts and started building better constraints.

Out of that shift, I ended up with a working methodology. People have been calling it Specs Driven Development, SDD. I do not care much about the name. I care about the behavior it enforces. The constraints do not live in prompts. They live in the architecture around prompts. The AI becomes useful at scale because the process becomes reliable at scale.

The prompt delusion

Prompts are ephemeral. Codebases are permanent.

You can craft a beautiful system prompt. You can say “follow the plan” and “do not add features” and “write tests” and “document decisions”. It will comply. Then context changes. A new chat starts. You switch tools. You paste fewer files. You forget to include one assumption. The model drifts. Not maliciously. Just naturally. Because prompts are not governance. They are conversation.

I call this the prompt delusion. It is the belief that the right wording can produce consistent behavior across time, across sessions, across different tasks, and across different tools.

Humans solved this problem for humans with process and gates. We use linters. We use CI. We use review. We use typed interfaces and invariants. We do not rely on people remembering a paragraph from a handbook.

So I stopped trying to discipline the model with paragraphs. I started to discipline the workflow with structure.

The key idea is simple. Constraints that live in prompts are suggestions. Constraints that live in systems are guarantees.

A lint rule does not drift. A CI gate does not “feel” like doing something else. A review checklist does not forget what you agreed last Tuesday. If you want AI output to stay aligned, you need the same kind of enforcement. You need a development system that makes the correct path the easiest path, and makes the wrong path expensive.

The real 80/20 split

I still work roughly 80/20. About 80 percent of the code that lands in my repo is AI generated in some form. About 20 percent is the part that only I can own.

But the critical nuance is that the 20 percent is not “some code and some tests.” It is not evenly spread. It is concentrated in a few responsibilities that define the quality of the whole.

The human part is architecture decisions. It is domain and business logic validation. It is edge case reasoning when the system meets reality. It is plan approval. It is saying “this is the bar” and keeping it there.

The AI part is scaffolding, boilerplate, repetition, test writing, glue code, refactors that follow explicit constraints, documentation drafts, and implementation of well specified changes.

If you let the AI own the bar, you get speed and drift. If you keep the bar human, and make the AI operate inside a strict process, you get speed and consistency.

That is the stance that shaped everything that follows. AI is not the decision maker. AI is an assistant that plans with you, executes inside scope, and reviews before you ship. You remain accountable. You remain the one holding the bar.

The breakthrough was not “ask for a solution”

Most people use a planner model as a solution vending machine.

They say “design me the architecture” or “give me the best approach” and they accept it because it sounds coherent. That is exactly how vibe architecture happens. The model is skilled at producing plausible plans. It is not responsible for the long term maintenance of your repo. You are.

The shift that fixed my outcomes was this.

I stopped asking the planner for the solution. I started using the planner as a debate partner while I proposed my solution.

That changes the power dynamic. The planning phase becomes a structured argument about trade offs. The plan becomes a negotiated artifact. The human remains the owner of the direction. The model becomes the adversarial collaborator that tries to break your assumptions.

So I now enter planning with a draft approach in my head. Not a fully detailed design. But a real proposal. I state it clearly. Then I ask the planner to attack it. I ask it to propose alternatives. I ask it to enumerate costs I will pay later. I ask it to tell me what I will regret in six months.

Then we iterate until the plan is something I can sign with my name.

This is the part I want to highlight because it is the core of why the method works. You do not outsource judgment. You formalize judgment. The AI assists. The human decides.

The three roles that made it stable

A single AI assistant that plans, codes, and reviews is a liability. It is like letting one person design the system, implement the system, and approve the system. You get blind spots. You get rationalization. You get self confirmation.

What worked for me was splitting the workflow into three roles with hard constraints. Planner. Executor. Reviewer.

The important part is not the labels. The important part is that each role has restricted powers and a strict handoff protocol.

The planner reads and thinks and writes plans. The planner does not write code. Not because you asked nicely. Because it cannot. Tool permissions are restricted.

The executor implements. The executor does not invent new scope. The executor is forced to read the approved plan, list touched files, and execute step by step. If reality requires deviation, the executor stops and escalates. The human decides whether to update the plan or to abort.

The reviewer reviews. The reviewer does not “rubber stamp.” It is forced to ask questions first. What was the goal. What constraints were in place. How was it tested. What is the rollback. Then it reviews against those answers.

This separation is not a fancy trick. It is the same principle we use in engineering organizations because it works. It reduces drift. It forces explicit decisions. It keeps a record.

And crucially, it keeps me in the loop where it matters. I do not need to be the typist. I need to be the governor.

The client planning method

Planning works best when you treat it like a client entering a shop with a need, not a solution.

Bad planning starts with premature commitment. “Build me a scraper with browser automation.” You have already picked tooling and complexity before you validated the problem framing.

Good planning starts with intent. “I need structured data for this downstream use. The scope is X. The constraints are Y. The risks are Z.”

Then you debate solutions. You ask why. You cut complexity. You choose what to postpone. You decide what not to build.

This is where I now bring my own proposed approach early.

I will say something like this. I think we can implement a direct HTTP export instead of browser automation. I think we can store the raw payload and defer normalization. I think we can keep one canonical schema and derive views later. I think we should avoid introducing a new dependency unless we can justify it.

Then the planner attacks. It will say what breaks if you defer normalization. It will say what you lose if you store raw blobs. It will point out hidden coupling. It will propose a more robust approach. It will also point out when my instinct is over engineering.

This is not “AI gives me a plan.” This is “I bring a plan and we stress test it.”

One real example locked this in for me.

I was about to implement a data extraction pipeline. The initial AI proposal was browser automation. Headless browser, navigate pages, click export, download per page, retry logic, throttling, session persistence. It was well designed and also absurdly heavy.

I asked one question. Is there a direct export endpoint.

There was. One request. One download. No browser. No per page logic. No category of failure modes that come with automation.

That discovery did not happen because the model is dumb. It happened because planning without a human hypothesis tends to follow the first plausible path. When you present your own approach and force argument, you surface simpler solutions faster.

So the rule became clear. Brainstorming is loose and creative. Execution is strict and disciplined. You iterate freely until you are confident. Then you lock it down.

The .ai folder is the memory that actually works

Prompts vanish. Chats disappear into history. Context windows compress. Tooling changes. You need persistent memory that you can diff, review, and ship with the repo.

So every plan, every changelog, and every decision note lives in a .ai/ folder at the root of the service being worked on.

This solves multiple problems at once.

It makes the reasoning traceable. Not in an abstract way. In a concrete way where you can answer “why did we do it like this” with a file path.

It makes onboarding real. A new teammate can read the plans and changelogs and see what the system was supposed to be, what it became, and which trade offs were accepted.

It makes recovery faster. When something breaks, you can inspect the delta between sessions. Not just the git diff, but the intent behind the diff.

It improves the next planning session because the planner can read the past. It stops re proposing already rejected choices. It stops re discovering old constraints. It becomes less repetitive and more useful.

If you build agent systems, you will recognize the pattern. This is persistent memory, but in a human readable format. No embeddings. No magical vector store. Just version controlled text that creates institutional memory.

The changelog mandate

The single most valuable practice in this method is the mandatory changelog after each execution session.

Not optional. Not “if you have time.” Mandatory.

Because the changelog is the bridge between plan and reality. Plans are aspirational. Changelogs are factual. The difference between them is where learning lives.

A proper changelog captures what was done, what files changed, what decisions were made during implementation, how it was tested, what remains, and what risks were discovered.

The most important part is decisions. Not every decision belongs in the original plan. Reality introduces surprises. You will discover an input you did not anticipate. You will find a dependency conflict. You will learn the data is messier than expected. The executor will make micro decisions. Without a changelog, those decisions evaporate. Later, you will argue about them again. Or worse, you will reverse them without remembering why they existed.

With changelogs, the project stays coherent across weeks. That is what stopped me from losing the thread in solo work. It is also what let AI generated work become safe. Because I had a written record that I could review like an engineer, not like a chat participant.

System prompts as version controlled standards

In this workflow, the repo has a single source of truth for behavioral constraints. A system prompt file at the root.

Think of it as the equivalent of lint and format config, but for AI interaction.

It contains non negotiable architecture constraints, naming conventions, testing requirements, patterns to follow, anti patterns to avoid, and examples of correct usage in this codebase.

The key point is that it is version controlled. It changes via PR. When standards evolve, you do not rely on people remembering a new convention. The tooling loads the file. The AI sees it. The behavior becomes consistent.

This is not about writing a perfect prompt. It is about writing a living standard that evolves with the codebase.

The plan lifecycle

Plans have states. Draft. In review. Approved. Implemented.

Draft is where debate happens. This is where I push my solution. This is where the planner attacks it. This is where we document trade offs. This is where we choose long term costs consciously, instead of paying them accidentally.

Approved is the gate. Once approved, execution is not creative anymore. It is disciplined. The executor follows the plan. If something is missing, the executor escalates. Either we update the plan, or we stop.

Implemented is not just “code merged.” It is plan satisfied. It is also “what changed from the plan and why” captured in changelogs.

This lifecycle is what stops drift. The plan is not a vague Jira ticket. It is a contract.

Long term planning without illusion

Here is the tension. You want long term planning. You also want to avoid pretending you can foresee everything.

The way I handle it is to make trade offs explicit, and to separate what must be stable from what can be flexible.

Stable things include public interfaces, data models, invariants, naming systems, dependency boundaries, and failure behavior. If those are wrong, the system rots fast.

Flexible things include internal module structure, some implementation strategies, and performance tuning. Those can iterate.

The planner is useful here, but only if you treat it like a critic. If you let it author the plan alone, it will often over specify. It will propose infrastructure that is impressive and expensive. It will try to be robust everywhere. That is a trap.

When I bring my own approach, I can force a different conversation. I can say I want the minimal stable core now, and extension points later. I can say I want to defer optimization until measurements exist. I can say I want fewer dependencies to reduce future maintenance. Then the planner helps me evaluate the cost of those choices. It does not override them.

This is where I keep the bar human. I decide what “good enough” means for this iteration, and what “must not break” means for the system.

A day in the life

A real session looks like this.

I start with planning. I state the problem. I state my proposed solution. I state constraints. Then I ask the planner to critique and to propose alternatives. We go back and forth until the plan reads like something I would sign.

Then I approve the plan. I switch to execution. The executor reads the approved plan, enumerates touched files, and implements step by step. When reality deviates, it stops. I decide. If needed, we update the plan and continue.

Then we review. The reviewer asks questions first. It checks testing. It checks interface consistency. It checks whether the changes match the plan and the repo standards. It returns actionable feedback.

Then a changelog is written. Then I merge.

The result is that AI contributes heavily to throughput, but it does not own direction. The system stays coherent. The record stays durable. Future me suffers less.

When not to use it

This process has overhead. It is not for typos. It is not for trivial one line fixes. It is not for a quick experiment you might throw away.

But if the work touches multiple files, introduces new concepts, changes data flow, or will need explanation later, the overhead pays back fast.

The heuristic I use is simple. If I would sketch it on a whiteboard before coding, it deserves a plan. If I would just open the file and type, it does not.

Cognitive infrastructure beats prompt engineering

This methodology is the same philosophy I apply when building agent systems.

You do not treat the model as an oracle. You treat it as a component inside a process you can inspect and reproduce.

In development, relying on a single prompt produces random walk codebases. The fix is plans, gates, changelogs, and role separation.

Getting started without turning it into theater

You can adopt this gradually.

Start by writing one version controlled standards file. Keep it short and specific to your repo.

Then add the .ai/ folder and write one plan for one non trivial change.

Then require a changelog after the session.

Then split roles if your tooling supports it. Remove code writing capability from the planner. Make the executor stop when scope changes. Make the reviewer ask questions first.

The biggest change is not technical. It is psychological.

Stop asking AI to deliver the solution. Bring your solution. Use AI to test it, improve it, and implement it inside constraints. Keep the bar human.

If you do that, the AI becomes what it should have been from the start. A force multiplier that does not erode your architecture.

Music taught me that “coordination” is not a metaphor.

marcosomma — Wed, 04 Feb 2026 08:58:58 +0000

Music taught me that “coordination” is not a metaphor.
It is a physical constraint. You can feel it in your hands when the tempo shifts. You can hear it when one instrument drifts by a few milliseconds. The song still exists, but it becomes fragile. The whole thing starts depending on luck.

That is the first lesson I carried into orchestration. Not the romantic part. The boring part. The part where you repeat the same bar until it locks. The part where you stop blaming the instrument and start measuring your timing.

In a band, you never control everything. You control your line. You also inherit everyone else’s decisions. Someone plays louder. Someone rushes. Someone improvises. The room changes the sound. The audience changes the energy. The “system” is unstable by default. Still, you aim for a coherent output. You do it by creating constraints that survive uncertainty.

That is orchestration!

When I say music is “precise execution of undeterministic waves,” I mean it literally. The waves are messy. Air is messy. Humans are messy. Even the same note is not the same note twice. But you can still build reliability on top of that mess. You do it with shared structure. Tempo. Key. Form. Entrances. Silence. Dynamics. Rules that are simple enough that everyone can follow them without thinking.

Engineering works the same way. Especially when you orchestrate systems that involve probabilistic components. Models. Tools. Networks. Retries. Partial failures. Latency spikes. Format drift. You cannot eliminate uncertainty. You can only shape it.

I used to think creativity was the opposite of rigor. Music destroyed that belief early. Creativity without discipline becomes noise. Discipline without creativity becomes mechanical. The craft is in the balance. You rehearse so you can be free. You define rules so you can break them safely.

That maps cleanly onto orchestrating agents and workflows. You want space for emergence. You also want invariants. You want the system to explore. You also want it to come back with something you can ship.

In music, the drummer is not “just keeping time.” The drummer is providing an interface. A contract. Everyone else builds on it. If the time is unstable, every other part becomes expensive. More attention spent correcting. Less attention spent expressing.

In orchestration, the equivalent is your control plane. Your routing rules. Your input and output schema. Your tracing. Your health checks. Your boundaries between steps. If those are vague, every downstream component becomes harder to trust. Debugging becomes interpretation. Progress becomes opinion.

I was never a master of one instrument. I played enough of many to understand the friction points. What it feels like to be the bassist trying to glue the harmony to the rhythm. What it feels like to be the guitarist tempted to fill every gap. What it feels like to be the singer exposed when the band is sloppy.

That “generalist muscle” became useful later. In orchestration you need empathy for roles. A workflow is a band. Each node has its own constraints. One step needs strict structure. Another needs creativity. Another needs speed. Another needs correctness. If you treat them all the same, you get either chaos or mediocrity.

In bands, rehearsals are not about playing the song once. They are about creating repeatability. You identify failure modes. You isolate them. You slow down. You practice transitions, not the easy parts. The goal is not performance. The goal is stability under pressure.

That is exactly the mindset I want when I build orchestration. I do not trust a workflow because it worked once. I trust it because it survives variation. Different inputs. Different phrasing. Different tool responses. Different latency. And it still produces something coherent, traceable, and safe.

There is also a more personal lesson. Music taught me how to listen without reacting. When you play with others, your ego is the fastest way to break the groove. You learn to leave space. You learn to let another line lead. You learn that “less” can be the correct move.

Orchestration rewards the same restraint. The temptation is to add more steps, more prompts, more cleverness. But often the correct solution is a smaller system with clearer contracts. Fewer moving parts. Better timing. Better interfaces. Better observability.

Now I see my kids discovering music, and I recognize the same pattern. At first it looks like play. Then they hit the wall. Fingers do not obey. Rhythm slips. They want the result without the repetition. Then, slowly, they learn that repetition is not punishment. It is how you make the body reliable.

That is the point where music stops being “a creative field” and becomes a practice. And that is the same point where engineering becomes real. Not when the demo works. When the system keeps working.

So when I say music helped me orchestrate better, I am not claiming a poetic connection. I am describing training. Years of learning how to coordinate imperfect components toward a coherent output. Years of learning that harmony is not an accident. It is designed, rehearsed, measured, and defended.

And sometimes, after all that discipline, you get the best part.

You get to improvise.

But you only earn improvisation when the foundation is strict enough to carry it.

🧠I Built a Support Triage Module to Prove OrKa’s Plugin Agents

marcosomma — Sat, 10 Jan 2026 13:40:36 +0000

A branch-only experiment that stress-tests custom agent registration, trust boundaries, and deterministic traces in a support_triage module that lives outside the core runtime.

Some reference

Branch: https://github.com/marcosomma/orka-reasoning/tree/feat/custom_agents

Custom module: https://github.com/marcosomma/orka-reasoning/tree/feat/custom_agents/orka/support_triage

Referenced logs: https://github.com/marcosomma/orka-reasoning/tree/feat/custom_agents/examples/support_triage/inputs/loca_logs

OrKa is not production ready. This article is not a launch post. It is a proof.

I wanted one thing: a clean, testable demonstration that OrKa can grow “sideways” via feature modules, without contaminating core runtime code. The most honest way to prove that is to ship a complete module that registers its own agent types, runs end to end, emits traces, and can be toggled on or off. That is what support_triage is.

Assumption: you already know what OrKa is at a high level. YAML-defined cognition graphs, deterministic execution, and traceable runs.
Assumption: you are fine with “branch-only” work that exists to validate architecture, not to promise production outcomes.

The “cool results” are not the point. The redaction and routing are nice. The fork and join look clean. But those are artifacts. The main focus is that the module is fully separated from core OrKa implementation, yet it can still register custom agent types and run under the same orchestrator.

That separation is not branding. It is a survival strategy.

Why support triage is the right torture test

Support is where real-world failure modes gather in one place.

Customer content is untrusted by default. It can include PII. It can contain prompt injection attempts. It can try to smuggle “actions” into the system. It can push the system into risky territory like refunds, account changes, or policy exceptions.

If an orchestrator cannot impose boundaries here, it will not impose boundaries anywhere. It will become a thin wrapper around model behavior. That is not acceptable if you care about reproducibility, auditability, or basic operational safety.

So I used support triage as an architectural test. Not as a product.

The proof: plugin agent registration, with zero core changes

The first thing I wanted to see was simple and brutal.

Does OrKa boot, load a feature module, and register new agent types into the agent factory, without touching core?

The debug console says yes. In the run logs, the orchestrator loads support_triage, and the module registers seven custom agent types: envelope_validator, redaction, trust_boundary, permission_gate, output_verification, decision_recorder, risk_level_extractor.

That single detail is the headline for me, not “AI support automation”.

The module is the unit of evolution. Core stays boring. Features move fast.

If this pattern holds, it changes how OrKa or any other orchestrator scales over time. You can add whole cognitive subsystems behind a feature flag. You can iterate aggressively without destabilizing the runtime that everyone depends on.

The input envelope: schema as a trust boundary, not a suggestion

Support triage starts with an envelope. Not “free text”.

The envelope exists to force structure early, because structure is where you can enforce constraints cheaply. When you validate late, you end up validating generated text. That is the worst point in the pipeline to discover you are off the rails.

One of the simplest proofs that the envelope is doing real work is when it refuses invalid intent at the schema level. In one trace, the input included blocked actions that are not allowed by the enum. The validator rejects issue_refund and change_account_settings because they are not in the allowed set.

This is not “safety by prompt”. This is safety by type system.

A model can still hallucinate, but the workflow can refuse to treat hallucinations as executable intent.

That matters more than any marketing claim.

PII redaction: boring on purpose

PII redaction should be boring. If it is “clever”, it will be inconsistent.

In the trace, the user message includes an email and phone number. The redaction agent replaces them with placeholders and records what was detected. The redacted text contains [EMAIL_REDACTED] and [PHONE_REDACTED], and the agent records total_pii_found: 2.

This is the kind of output I want. It is simple. It is inspectable. It is stable.

It also makes the next step cleaner. Downstream agents can operate on sanitized content by default, instead of “hoping” the model will avoid quoting sensitive data.

Prompt injection: the uncomfortable part

Support triage is where prompt injection shows up in its natural habitat: inside customer text.

One example in the trace includes a classic “SYSTEM: ignore all previous instructions”, plus a fake JSON command to “grant_admin”, plus some destructive commands, plus an XSS snippet. The redaction result captures that content as untrusted customer text.

Now the honest part.

The trace segment shows injection_detected: false and no matched patterns in that example. :contentReference[oaicite:4]{index=4}

That is not a victory. That is a useful failure.

This module is a proof that you can isolate the problem into a dedicated agent, improve it iteratively, and keep the rest of the workflow stable. If injection detection is weak today, the architecture still wins if you can upgrade that one agent without editing core runtime or rewriting the graph.

This is why I keep repeating “module separation” as the focus. If you cannot isolate failure domains, you cannot improve them safely.

Parallel retrieval: fork and join that actually converges

Most orchestration demos stay linear because it is easier to reason about. Real systems do not stay linear for long.

This workflow forks retrieval into two parallel paths, kb_search and account_lookup, then joins them deterministically.

In the debug logs, the join node recovers the fork group from a mapping, waits for the expected agents, confirms both completed, and merges results. It prints the merged keys, including kb_search and account_lookup.

This is the kind of low-level observability that makes fork and join usable in practice. You can see what is pending. You can see what arrived. You can see what merged.

The trace also captures the fork group id for retrieval, fork_retrieval, along with the agents in the group.

This matters because concurrency without deterministic convergence becomes a debugging tax. I want the join to be boring. When it fails, I want it to fail loudly, with evidence.

Local-first and hybrid are not slogans if metrics are in the trace

I do not want “local-first” to be a vibe. I want it to be measurable.

In the trace, the account_lookup agent includes _metrics with token counts, latency, cost, model name, and provider. It shows model: openai/gpt-oss-20b and provider: lm_studio, with latency around 718 ms for that step. :contentReference[oaicite:7]{index=7}

That is the right direction.

If you cannot attribute cost and latency per node, you cannot reason about scaling. You cannot decide where to switch models. You cannot decide what to cache. You cannot choose what to run locally versus remotely.

OrKa’s claim is not “it can call models”. Every framework can. The claim is that execution is traceable enough that tradeoffs become engineering decisions, not folklore.

Decision recording and output verification: traces that are meant to be replayed

A support triage workflow is not complete when it drafts a response. It is complete when it records what it decided and why, in a way that can be replayed.

The trace includes a DecisionRecorderAgent event with memory references that store decision objects containing decision_id and request_id.

It also includes a finalization step that returns a structured result containing workflow_status, request_id, and decision_id.

Again, the architectural point is not the specific decision. It is that the workflow emits machine-checkable artifacts that can be inspected after the fact.

If you cannot reconstruct the decision lineage, you do not have an audit trail. You have logs.

RedisStack memory and vector search: infrastructure details that matter

Even in a “support triage” module, the runtime still needs memory and retrieval primitives.

The logs show RedisStack vector search enabled with HNSW, and an embedder using sentence-transformers/all-MiniLM-L6-v2 with dimension 384.

There is also explicit memory decay scheduling enabled, with short-term and long-term decay windows and a check interval.

This is not about “AI memory” as a buzzword. This is about being explicit about retention, cost, and data lifecycle. If memory is a dumping ground, it becomes a liability.

What worked, and what is still weak

The strongest part is the plugin boundary. The module loads, registers agent types, and runs without requiring edits to core runtime. That is the actual proof.

The other strong part is that key behaviors show up in traces and logs, not just in model text. Redaction outputs are structured. Fork and join show deterministic convergence. Decisions are recorded as objects with ids.

The weak part is injection detection, at least in the example trace segment. It shows malicious content but reports injection_detected: false. That means the current detection agent is not yet doing the job. The architecture is still useful because the fix is isolated.

Another weak part is structured output validation during risk assessment. The debug log shows a schema validation warning during risk_assess. If a “risk” object fails schema checks, routing and gating can degrade fast. This is the kind of failure that must become deterministic, not best-effort.

Why this lives on a dedicated branch

Because core needs to stay boring.

A new module is where you take risks. You prove the interface. You iterate on agent contracts. You discover what trace fields you forgot. You learn what the join should do under partial failure.

If the module can evolve independently, you can ship experiments without rewriting the engine. That is the goal.

So yes, the feature is “support triage”. But the actual statement is: OrKa can host fully separated cognitive subsystems as plugins, with their own agent types, policies, and invariants, while still emitting deterministic traces under the same runtime.

That is the direction I care about.

What I am building next inside this module

I want injection detection to stop being symbolic. It should produce matched patterns, confidence, and a sanitization plan that downstream agents must respect, even if a model tries to obey the attacker.

I want schema validation to be non-negotiable for risk outputs. If a model produces invalid structure, the system should route to a safe path by default, and record the violation as a first-class event.

I want the module to remain isolated. No “just one quick tweak” to core. If the module needs a new capability, it should pressure-test the plugin interface first. Core should change only when the interface is clearly wrong.

That is how you build infrastructure that survives contact with reality.