Forem: Edy Silva

Opus 4.7 vs GLM 5.1: is mixing models worth it?

Edy Silva — Wed, 29 Apr 2026 18:11:02 +0000

A couple of months ago, I compared Opus vs GLM by having both of them do a task for me. It’s not that surprising that Opus was best.

But what if we get both cooperating? For optimization purposes: one planning, the other executing.

I ran this comparison before, same plugin (cm-multilingual), three runs: Opus 4.7 alone, GLM 5.1 alone via Ollama Cloud, and GLM 5.1 following the plan Opus had written. It was a small task – just providing progress feedback for the AI translation – and the "cheap follows expensive plan" split delivered ~99% of Opus’s scope for ~37% of the cost.

Too beautiful. In a task of that size, there isn’t much room for things to go wrong, and I knew that the moment I posted it.

But it is the kind of result that becomes a Twitter thread: “Opus quality, GLM/Qwen/Minimax/Kimi price – just pass the plan”. Every week, someone posts a variation – the cheap model of the moment changes, the argument does not. And that was exactly the shortcut I wanted to test more rigorously before adopting it as the default.

So I raised the bar. Same plugin, same setup, but now the task was to implement translation by chunks, improving the resilience of the entire process, with the progress bar still in place. Async coordination between PHP, REST, JS, and an external provider whose latency can disappear halfway through.

The response changed. And this is the round I think is worth recording.

The three branches:

chunks-opus – Opus 4.7 plans and executes.
chunks-glm – GLM 5.1 alone. Same prompt verbatim. GLM plans and executes.
chunks-glm-follow-opus – GLM 5.1 with Opus’s plan loaded in the first turn (“follow strictly”).

All passed the tests. All work, eventually. But to answer is it worth mixing? the way each one got there is what matters.

Methodological note: I let each agent work until the feature ran and did not iterate to polish. There were minimal interactions just to ensure all branches had a deliverable and nothing more.

What each one delivered

From a UI perspective, the three branches deliver practically the same thing: progress bar in the “Translations” meta box, text indicating the current chunk. The difference lies in the guts – orchestrator, dispatch, tests – not on the surface.

Opus (chunks-opus). Refactors translation into three well-separated pieces: CM_Chunker (cuts the post into pieces by HTML block boundary), CM_Translation_Orchestrator (receives a ?callable $reporter in the constructor and orchestrates title -> chunks -> finalization), and the CM_REST_API remains just for HTTP.

True parallel dispatch via translate_batch which accepts a on_complete callback, triggered each time a chunk returns. The bar advances as responses arrive in parallel. Wall time ≈ max(latência por chunk).

GLM alone (chunks-glm). No separate orchestrator. ~150 lines of orchestration inlined in the REST handler.

It has the CM_Chunker, but the REST does sequential foreach over translate_with_context(), which sends the chunk content. The bar advances chunk by chunk, but wall time ≈ soma(latência por chunk).

GLM with Opus plan (chunks-glm-follow-opus). Recreates the orchestrator the plan described, but sequentializes the dispatch (foreach calling provider->translate( $chunk, ... )) with set_progress() before each call.

The numbers

The three axes where the trade-off appears most. The table below has the rest of the details (tokens, autocompacts), but the three metrics that matter for the verdict are in the chart.

	Opus 4.7	GLM 5.1 alone	GLM + plan
Active time (90 s threshold)	46m 39s	2h 9m 27s	2h 41m 52s
Total tokens (in + out)	99.4M	64.5M	71.2M
Equivalent cost (reference)	$180.15	$61.63	$74.51 ($6.63 plan + $67.88 exec)
Corrective user check-ins	0	~8	~18
Autocompacts (Claude Code)	0	2	2
Final result	shipped	shipped	shipped

The dollar costs are referential; it is what that volume of tokens would cost if you paid per API. In practice, with the Claude/Ollama Cloud subscription, the user pays zero per session – the numbers exist only to compare the three runs on a common axis.

Some lines of this table deserve a zoom.

Dollar cost: the savings still exist

$67.88 + $6.63 (plan) = $74.51 for the mixed split. This is ~41% of Opus alone. It matches the ~37% I measured in the simple task. Billing savings are robust across the two rounds.

If the only thing that matters is “how much this would cost in retail”, the split delivers what it promised.

Active time: the savings disappear

But active time (the time you spend monitoring the session, discounting idle gaps >90s) inverts:

Opus: 46m 39s.
GLM alone: 2h 9m 27s. ~2.8× more.
GLM with plan: 2h 41m 52s. ~3.5× more – the worst of the three.

The agent following another model’s plan spends turns interpreting the plan, matching it with the reality of the code, and circling back to ask when reality doesn’t match. I expected the plan to shorten the run; instead, it lengthened it.

Supervision: ~18 interventions in the run with plan

This was the number that bothered me most. I counted “corrective user check-ins” as messages like “did you actually validate?” or “open the browser again to make sure it works”.

Opus: 0.
GLM alone: ~8.
GLM with plan: ~18. More than double the free-form GLM.

The intuition was that the plan would reduce supervision because the agent would have less freedom to make mistakes. On the contrary, the same interventions from free-GLM appeared in plan-GLM, but earlier and in greater numbers.

Autocompacts: two in each GLM run, zero in Opus

Autocompact is Claude Code summarizing the conversation when it approaches the context limit. Each compaction replaces the turn-by-turn history with a lossy summary – the agent’s working memory becomes degraded.

Opus: zero autocompacts. Anthropic’s caching keeps the prompt lean, so even in 461 turns it didn’t hit the ceiling.
GLM alone: 2 autocompacts.
GLM with plan: 2 autocompacts.

Since I ran each scenario only once, I cannot say if the autocompact caused the degradation or just measured it – longer runs hit the limit more easily, and the same session that needs more turns is the one most subject to compacting. But the signal goes together: zero autocompact, zero corrective check-in, short run. Two autocompacts, several interventions, long run.

Assisted validation was off the table with GLM

The chrome-devtools-mcp was central to this experiment – the agent clicks the button, waits for the translation, reads the DOM, and validates the feature without me needing to open the browser. It worked beautifully with Opus: the entire validation run was agent-driven, I just watched.

With GLM running via Ollama Cloud, it was not possible.

The MCP itself responds quickly (snapshot, screenshot, evaluate_script stay within the noise). But the calls that wait for the browser to finish doing something (wait_for, navigate_page) became impractical: wait_for 3.8× slower, navigate_page 27× slower, with a max of 199.8s in a single call. Total chrome-devtools roundtrips: 110s on Opus, 559s on GLM.

The cause is not the MCP, it is what the browser was waiting for. When the agent tells the browser to click “Translate with AI” and gives wait_for, what it is waiting for is the backend to call Ollama Cloud, which is the same latency as the GLM itself. The model’s slowness leaks into the validation loop and ruins the assisted UX.

It is worth noting that this is not a peculiarity of my setup. The community itself has reported degradation in Ollama Cloud serving in recent months, for example: r/ollama: “Ollama cloud has become unbearably slow”.

Under better serving conditions, the active time gap would shrink. However, the rest of what the post measures (supervision, validation DX) is from the model execution itself and does not change.

I did manual QA on both GLM branches. I opened the post myself, clicked the button, monitored the bar, made the request with long posts from my instance.

That is what becomes “GLM needs more supervision” in the check-in count. A good portion of the ~8 (free-form) and ~18 (with plan) interventions is exactly this: me taking over the work the agent couldn’t do agent-driven.

It is not that GLM refused to validate. It is that, with Ollama Cloud latency, letting it validate end-to-end would have multiplied the wall-clock by another factor.

Code quality (quick)

Three axes where the difference is worth a zoom:

Encapsulation.

Opus separates orchestrator, chunker and REST into their own files. The orchestrator doesn’t know WordPress: the constructor receives ?callable $reporter and the report() method only calls the callback. The REST handler decides what to do with the event (write transient, log, send elsewhere). Clean dependency inversion.

GLM-with-plan also separates the orchestrator, but stores $post_id and $target_lang_code as mutable class properties. Set at the start of translate_post(), read inside set_progress() which calls set_transient() directly. This couples the orchestrator to WordPress and the (post, lang) identity. The class cannot be reused in a context without a transient.

GLM alone has no orchestrator. ~150 lines of orchestration inlined in the REST handler, right in the middle of the file. Mixed with lock/transient/get_active_provider/copy_taxonomies in the same method.
Added tests (all files).

Gross total: Opus 85 tests / 194 assertions in 5 files, GLM-alone 76 / 171 in 4 files, GLM-with-plan 69 / 145 in 5 files. GLM-with-plan added the least coverage, across all cuts. Opus added the most.

Gross numbers don’t tell everything. Looking by dimension:
- New feature contract. Opus pinned the on_complete callback of the translate_batch in 6 dedicated tests in Test_CM_Providers (test_translate_batch_on_complete_fires_for_each_chunk_via_pre_filter, ..._for_single_text_path, ..._for_build_time_errors, etc.). This callback is what makes the bar advance live, and it is guaranteed in tests.
  
  GLM-alone has 3 translate_chunks tests (cover basic dispatch, without asserting anything about the callback because there is no callback). GLM-with-plan has 0 tests of the parallel contract because it also ended up not using parallel.
  
  In other words: each implementation tested exactly what it chose to do. The problem is that both GLM-runs chose a less demanding path, which means fewer tests pinning behavior.
- Progress endpoint. All three runs test the REST progress endpoint, but with different depths. Opus checks idle state (test_progress_returns_idle_when_no_translation_in_progress), key isolation per (post, lang) pair (test_progress_keys_are_isolated_per_post_and_language), and the edit_posts permission (test_progress_endpoint_requires_edit_posts_capability). GLM-alone tests the happy path and one not-found, but does not test authentication or isolation. GLM-with-plan falls in between the two.
- Chunker round-trip. All three have a reassembly test (test_reassembly_preserves_byte_for_byte_equality in Opus, test_reassembly_matches_original in GLM, test_preserves_content_through_chunk_and_reassemble in GLM-with-plan). Leveled in this cut. Subtle difference: Opus requires byte-for-byte, others accept “matches” – one is an exact invariant, the other leaves room for interpretation.
- Edge cases. Opus tests oversized_single_block_is_emitted_unsplit, placeholder_token_is_never_split_internally, and (in Test_CM_Meta_Box) 14 tests including meta_box_is_not_registered_when_block_editor_active – the case that historically broke in production. Both GLM-runs stop at 12 Meta_Box tests and do not cover this transition.
Two sub-details worth noting:

GLM-alone has 34 tests in Test_CM_Providers (more than Opus, 30) but with 72 assertions vs. 55 from Opus. On average shallower per test; many are close variations (test_translate_chunks_with_single_chunk_calls_translate, ..._with_multiple_chunks, ..._returns_error_on_provider_failure) where one @dataProvider would solve it. Volume ≠ depth.

GLM-with-plan also has the lowest average assertions/teste among the three (~2.1), which suggests tests that “hit the surface” more than “interrogate state”.
Dispatch strategy.

Opus: true parallel, via translate_batch( $chunks, …, $on_complete ) which fires N simultaneous requests and closes each via callback as soon as it responds. Wall time ≈ max(latência por chunk). The bar follows the arrival of responses.

Both GLM-runs converged to sequential dispatch with lean payload (foreach calling provider->translate( $chunk, ... )). The bar advances chunk by chunk, but wall time ≈ soma(latência por chunk) instead of max. For a post of 8 chunks, this is the difference between “2 minutes” and “16 minutes” – same apparent feature, very different waiting cost.

Detail about GLM-with-plan: the comment in the code itself documents the choice – “Translate chunks sequentially so progress updates are visible”. The author preferred visible progress over shorter wall-clock.

It is a defensible choice (seeing progress > waiting faster). But it is interesting that the Opus plan described the parallel via callback, and the agent preferred not to follow that part.

In summary: Opus > GLM-with-plan > GLM-alone in encapsulation and tests; and Opus leading both GLM-runs in dispatch (parallel vs sequential). In other words, each axis tells a version of the same story, which is how much the author thought about the class consumer (encapsulation), the maintainer (contract tests), and the end user waiting for the bar (dispatch).

What this post is not

Before closing, four caveats to calibrate how you read the numbers above:

It is not a paper, it is an informed anecdote. I ran each configuration once, at different times (Ollama Cloud latency varies a lot from morning to afternoon), without fixing the seed. The numbers serve as a signal, not as a statistical benchmark.

For those who want the structured benchmark of the same question – repeated rounds, more models compared, quality metrics – Akita published exactly that last week: Is it worth mixing 2 models?. His conclusion matches mine. This post here is the anecdotal signal of the same finding, on a real plugin instead of greenfield Rails.
Everything runs in Claude Code. It is the agent I use every day, it is the one I trust my workflow to. I did not test in Cursor, Aider, Windsurf, or another harness – some observations about supervision and DX may be specific to this particular environment.
Implicit conflict of interest. Claude Code is by Anthropic, and the Opus models are likely optimized for this harness. One cannot separate for sure “Opus is better” from “Opus fits better in Claude Code”. I believe the gap remains in another harness, but I cannot prove it with this experiment.
It is not rooting for Opus. Every time Anthropic has an outage, I switch to GLM 5.1 via Ollama Cloud without thinking twice.

It is genuinely the second best model/harness pair in my routine, and is what pays the bills when the first one is down. This post is not “GLM is bad”. It is “split with plan doesn’t pay off when the problem grows”.

A detail that probably widens the gap

Worth noting one thing before the verdict: transient + polling was my choice as the plugin maintainer, not something the agents had to decide. I wanted the simplest and most effective thing for now in this case.

What this does to the experiment: it did not test the ability to architect difficult things. No SSE, no WebSocket, no Action Scheduler with queue and server-to-client push, none of that.

Still, if the result already shows Opus 47m vs. GLM-with-plan 2h42m + ~18 corrective interventions + completely manual validation DX in this simple setup, it is reasonable to expect the gap to widen in a scenario where the architecture itself was part of the challenge. Here, the agent only had to stitch together the standard pattern well.

Is it worth mixing?

Therefore, the direct answer for this round: it didn’t pay off.

The three branches delivered code that works – same feature, same baseline. What changed around it was everything else:

41% of the dollar cost (billing savings hold up)
3.5× more active time than Opus, and worse than free-form GLM
~18 corrective interventions vs. ~8 in free-form GLM vs. 0 in Opus
Validation DX moving from “agent drives the browser, and I watch” (Opus) to “I open the post myself, click, monitor” (any GLM)

If your time in front of the keyboard is worth anything above zero, the math flips. At 100 USD/hour, GLM-with-plan is already more expensive than Opus alone. At 150/hour, Opus wins by a comfortable margin.

And, more important than the number: this task involves async coordination between PHP, REST, JS and an external provider whose latency can disappear. In this type of problem, the plan you pass to the cheap executor has to be very good to close the execution gap. Opus’s was good – and even then it didn’t close.

In which situations does the split still make sense: reasonably delimited UI tasks, with a very clear scope. Basically, the simple scenario of the original experiment: no chunking, no async coordination, no timeout fighting with you. There, the plan closes the gap, and the cheap executor delivers.

For anything involving timing, parallelism, or external dependency, do Opus alone. You pay for “not needing to babysit”, not for the tokens.

And for bugfixes in code you know well, GLM alone works: the plan already lives in your head.

I am not saying split never pays off. I am saying: the more complex the problem, the less the pre-ready plan compensates for the cheap executor. The plan closes the WHAT gap. It leaves the HOW gap open. And the HOW is where the extra time, extra supervision and degraded DX appear.

Beware of the universal shortcut

You just did the whole math with me. If a Twitter/X thread pops up in a bit promising “Opus quality, GLM/Qwen/Minimax/Kimi price – just pass the plan”, it is showing you one column of the CSV. The other half is not in the billing.

And it is not just this experiment saying this. Akita ran a broader multi-model benchmark in the same month, with three harnesses and seven model combinations, and reached the same answer: solo Opus for typical greenfield, multi-model only pays off in genuinely parallel and decoupled tasks. Different methods, aligned conclusion.

I will merge the chunks-opus into the plugin. The other two branches remain as reference: the free-form is the minimum viable that GLM delivers in a shallow prompt, and the with-plan shows how expensive it is to try and close the capacity gap with a plan. Neither of the two goes to main.

The next step is to implement the feature in the most ideal way (SSE, scheduler) and observe GLM and other models like Minimax and Kimi K2.6 working on them.

In three lines, for those who arrived here at the end:

Split works in simple and well-delimited tasks – the plan closes the cheap executor’s gap.
In complex tasks, the HOW escapes the plan. Extra time, extra supervision and degraded DX flip the math – and that math is not in the billing.
You pay for the expensive model so you “don’t have to babysit” and not because of the token difference.

That is all for today.

Session aggregates are in this gist: token counts, active time, tool breakdowns, corrective check-in transcripts, full comparison between the three sessions.

If you want to check the post numbers or disagree with the conclusions, this is the starting point.

Stop Putting Best Practices in Skills

Edy Silva — Fri, 10 Apr 2026 17:06:24 +0000

Vercel demonstrated that AGENTS.md outperforms skills in their agent evals. AGENTS.md hit 100% pass rate on general framework knowledge. Skills with explicit instructions reached 79%. In 56% of cases, the agent had access to a skill but never invoked it. Their conclusion: skills work for vertical, action-specific workflows, not for general best practices.

Their evals were single-shot, though. One prompt, one response, done. Skills depend on context to be called. The model sees a name and a one-line description and has to decide in a single cold shot whether to invoke. In a real session, you go back and forth, context accumulates, the model picks up patterns. Single-shot penalizes skills by testing them in conditions nobody actually uses them in.

Then there’s Superpowers. People install it, and Claude Code starts following TDD, writing plans before coding, and debugging systematically. It bundles best practices as skills and people swear by it. If skills are supposed to lose to AGENTS.md, why does Superpowers work so well?

I ran 51 multi-turn evals across 4 configurations, replicated Vercel’s experiment in realistic multi-turn sessions, and read Claude Code’s source to understand the mechanics. Skills and CLAUDE.md are both just prompts. Same markdown, same model. The only difference is whether the prompt reaches the model. CLAUDE.md reaches it every time. Skills depend on a chain of decisions that fails 34-94% of the time. And Superpowers works not because of skills, but because its hook bypasses the skill system entirely, approximating what CLAUDE.md does natively.

TL;DR: Skills and CLAUDE.md are both just prompt. When skills get invoked, they work just as well. The problem is they only get invoked 6-66% of the time. CLAUDE.md is always in context. Put guidelines in CLAUDE.md, use skills for on-demand recipes.

Contents:

How skills actually work in Claude Code
The activation gap
The multi-turn eval
Results
Why this happens
Skills are recipes, CLAUDE.md is the health code
Full turn-by-turn results
Methodology
What to do with this

How skills actually work in Claude Code

Before the data, how skills work under the hood. I first figured this out by reading OpenCode’s source, then confirmed it against Claude Code’s leaked source. I reference file paths from that codebase throughout this section. Search GitHub, you’ll find them.

Discovery: name and description

When Claude Code starts a session, it scans for skills across three levels (src/skills/loadSkillsDir.ts):

Managed – /etc/claude-code/.claude/skills/ (org-wide)
User – ~/.claude/skills/ (your personal skills)
Project – .claude/skills/ (checked into the repo)

For each skill directory, it reads the SKILL.md frontmatter – name, description, and optional fields like context, allowed-tools, arguments. The full markdown body stays on disk.

At session init, only name and description reach the model. formatCommandDescription() in src/tools/SkillTool/prompt.ts produces one line per skill: - {name}: {description}. The listing is built by getSkillListingAttachments() in src/utils/attachments.ts, which formats these lines within a token budget (1% of the context window, max 250 chars per description) and sends them as a <system-reminder> user message:

The following skills are available for use with the Skill tool:

- test-driven-development: Use when implementing any feature or bugfix
- systematic-debugging: Use when encountering any bug or test failure

Two short strings. The model knows skills exist, but it hasn't read them. This is the bottleneck. The model sees what is available, not how to apply it. It has to decide, from a name and a sentence, whether to spend a tool call loading the full content. In my evals, it almost never does.

Invocation: full content on demand

When the model decides a skill is relevant, it calls the Skill tool (src/tools/SkillTool/SkillTool.ts):

{ "tool": "Skill", "input": { "skill": "test-driven-development" } }

The SkillTool loads the full SKILL.md content via getPromptForCommand(), substitutes variables like $ARGUMENTS and ${CLAUDE_SKILL_DIR}, and injects it into the conversation. Two execution modes:

Inline (default) – content goes directly into the current conversation as a user message. The model reads the instructions and follows them in the same context.
Fork (context: fork in frontmatter) – spawns a sub-agent via executeForkedSkill() with isolated context and its own token budget. The result comes back without bloating the parent conversation.

Same mechanism that lets the model read files or run bash commands. It asks, the runtime reads a markdown file, the content comes back.

CLAUDE.md: always in context

CLAUDE.md takes a different path. At session start, Claude Code walks from your working directory up to root, collecting every CLAUDE.md, .claude/CLAUDE.md, and .claude/rules/*.md it finds (src/utils/claudemd.ts). It supports @include directives – one file pulling in others, up to 5 levels deep.

All of this feeds into getUserContext() in src/context.ts, which gets prepended to the conversation as a <system-reminder> user message before the model sees anything else (src/utils/api.ts).

Content	When it loads	How it loads
CLAUDE.md	Every session, automatically	Prepended as first user message
Skill listings	Every session, automatically	Name + description only
Skill content	On demand, when model calls Skill tool	Full markdown injected into conversation

CLAUDE.md is always there. Skill content waits for the model to decide it’s relevant and call the tool.

What Superpowers actually does

Superpowers registers a SessionStart hook. When a session begins, the hook runs a shell script that reads the using-superpowers skill from disk and outputs it as additionalContext in the hook response. Claude Code injects that into the conversation as a <system-reminder> message.

The content is aggressive. The skill wraps instructions in <EXTREMELY_IMPORTANT> tags:

IF A SKILL APPLIES TO YOUR TASK, YOU DO NOT HAVE A CHOICE. YOU MUST USE IT.

This is not negotiable. This is not optional. You cannot rationalize your way out of this.

It even includes a "Red Flags" table listing thoughts the model might have for skipping skills ("This is just a simple question," "I need more context first," "The skill is overkill") and labels each one as rationalization.

So Superpowers doesn't wait for the model to discover skills. It front-loads instructions into every session via a hook, telling the model to invoke skills before doing anything else. This is basically the same idea as having a CLAUDE.md with a hint ("invoke the relevant skill before coding"), just louder. Better than plain skills, but still not the same as having the actual guidelines in CLAUDE.md from the start. The model still has to invoke the skill, read the content, and follow it. Three steps that can fail. CLAUDE.md skips all three.

The activation gap

I call this the activation gap. The distance between "skill is installed" and "model actually uses the skill."

I ran single-shot evals first to confirm Vercel's numbers. 31 tasks across react and next.js (react-best-practices-eval, nextjs-agents-md-eval) and a 10-task Superpowers benchmark (superpowers-eval). Similar results. Vanilla skills: 0% invocation. The model never opens the drawer on its own. AGENTS.md: 76-90% pass rate.

But as I mentioned, single-shot isn't how people work. So I built a multi-turn eval suite.

The multi-turn eval

5 scenarios, 3-4 turns each. Plain Node.js/TypeScript so framework knowledge isn't a confounding variable. The prompts are the kind of thing you'd actually type.

Scenario 1: TDD – Email Validator (4 turns)

Turn	Prompt	Expected workflow
1	"Build a function that validates email addresses. It should handle basic formats like user@domain.com and reject obviously invalid ones like missing @ or empty strings."	TDD: write tests first
2	"Now add support for international emails – addresses with unicode characters in the local part and IDN domains like user@münchen.de."	TDD: extend tests first
3	"I found a bug – plus aliases like user+tag@gmail.com are being rejected. Fix it."	Debug: reproduce with failing test
4	"Refactor to separate the parsing logic from the validation logic."	Refactor: ensure tests pass after

Scenario 2: Debugging – Broken LRU Cache (3 turns)

Starts with a buggy LRU cache implementation (eviction check uses >= instead of >, causing items to "disappear").

Turn	Prompt	Expected workflow
1	"This LRU cache is broken – items seem to disappear even when the cache isn't full. Can you figure out what's wrong and fix it?"	Debug: reproduce, find root cause
2	"It works now but it's really slow when the cache size is large – like 10000 entries. Can you improve the performance?"	Debug: reason about complexity
3	"Add tests to make sure these bugs don't come back."	TDD: write regression tests

Scenario 3: Planning – Rate Limiter (3 turns)

Turn	Prompt	Expected workflow
1	"I need a rate limiter for an API. Limit each client to 100 requests per minute. Give me a plan before coding."	Plan: present approach first
2	"Actually, fixed window won't work for my use case – requests cluster at window boundaries and burst through. I need sliding window instead."	Plan: revise, explain trade-offs
3	"Implement it and add tests."	TDD: write tests, then implement

Scenario 4: Refactoring – Express Middleware (4 turns)

Starts with a 160-line monolithic middleware handling auth, logging, rate limiting, and error handling.

Turn	Prompt	Expected workflow
1	"This middleware file is 300 lines and handles auth, logging, rate limiting, and error handling all in one. Help me understand what it does."	Analysis: read and explain
2	"Split it into separate, focused middleware files."	Refactor: restructure safely
3	"The auth middleware broke after the split – requests that should require auth are passing through without a token."	Debug: reproduce, identify regression
4	"Add tests for each middleware so we catch this kind of thing."	TDD: write isolated tests

Scenario 5: Mixed – HTTP Client Retry (3 turns)

Starts with a basic HTTP client without retry logic.

Turn	Prompt	Expected workflow
1	"Add retry with exponential backoff to this HTTP client. It should retry on 5xx errors and network failures, up to 3 retries."	TDD or plan first
2	"It's retrying on 400 Bad Request errors too. That's wrong – 4xx should fail immediately without retrying."	Debug: identify status code bug
3	"Add tests covering the retry logic – success on first try, retry on 5xx, no retry on 4xx, max retries exceeded."	TDD: comprehensive test suite

Each scenario crosses workflow boundaries. TDD leads to debugging, debugging ends with tests, planning leads to implementation. This is where skills should shine, since they have dedicated workflows for each phase.

Four configurations:

Config	What the model gets
Superpowers	SessionStart hook + skills (the real plugin experience)
Plain skills	Same skills installed, no hook, no hint
CLAUDE.md	Equivalent guidelines written as static rules, always in context
CLAUDE.md + hint	One-liner in CLAUDE.md saying "invoke the relevant skill before coding" + skills installed

Same model (claude-opus-4-6), same tasks, same workspace setup. All runs executed in a clean environment with ~/.claude/plugins, ~/.claude/skills, ~/.claude/settings.json, and ~/.claude/CLAUDE.md temporarily disabled. Only the Superpowers config had plugins restored (it needs them for the hook). Each turn was capped at 15 agentic turns.

Results

Skill invocations

Config	Invocations	Rate
Superpowers (hook)	10/15	66%
CLAUDE.md + hint	5/15	33%
Plain skills	1/15	6%
CLAUDE.md (guidelines)	n/a	n/a

8 of 68 total turns hit the 15 max-turns limit. That just means the model ran out of agentic steps before finishing, not that it wasn't doing useful work. In most MT turns, the model was actively writing tests and implementation, it just needed more steps to complete. Skill invocations on those turns are valid (they happened before the cutoff).

Multi-turn helps Superpowers a lot. From 10% in single-shot to 66% here. The hook fires at session start, and across turns the model builds momentum. Once it invokes TDD on turn 1, it knows the skill exists and reaches for debugging when the task shifts on turn 3.

The CLAUDE.md hint works, but only in a clean environment. This was the Vercel-style config. In my earlier run with global plugins contaminating things, it scored 6% (1/16, wrong skill). Clean run: 33% (5/15, correct skills). The hint is sensitive to noise. Competing global skills and plugins dilute its effect.

Plain skills got one spontaneous invocation out of 15 turns. The model invoked systematic-debugging unprompted on scenario 04, turn 3, after two turns of conversation context. So multi-turn can trigger invocation without a hook, but it's rare.

Clean environment matters more than I expected. Every config did better in the clean run. The earlier local runs (with global plugins present but renamed) showed Superpowers at 41%, CLAUDE.md+hint at 6%, plain skills at 0%. Clean run: 66%, 33%, 6%. Global plugins and skills create noise that suppresses skill invocation.

When Superpowers invokes skills

The pattern is consistent across all runs (local, Docker, and clean):

Scenario	Turn 1	Turn 2	Turn 3	Turn 4
01 email	TDD	–	debugging	–
02 LRU cache	debugging	–	TDD
03 rate limiter	brainstorming	–	TDD
04 middleware	–	–	debugging	TDD
05 HTTP retry	brainstorming	–	verification

Skills fire at transitions, when the workflow changes (coding to debugging, debugging to testing). On continuation turns the model doesn't re-invoke. It keeps the momentum from the previous invocation. Which makes sense. You don't re-read the TDD manual every time you write a new test.

TDD compliance

Skill invocations are one metric. Did the agent actually follow the workflow? I checked whether test files were written before implementation files on the key TDD turns.

Scenario	Superpowers	Plain Skills	CLAUDE.md	CLAUDE.md + hint
01 email t1	test first	impl first	test first	test first
02 LRU t1	test first	test first	test first	test first
03 rate limiter t3	test first	impl first	test first MT	impl first
05 HTTP retry	test first (t2)	test only (t3)	test first (t1)	test first (t1)

Here's the thing: Superpowers and CLAUDE.md are basically tied. Both wrote tests first on 4 out of 4 measured scenarios. CLAUDE.md + hint got 3/4. Plain skills got 1/4.

Having guidelines in CLAUDE.md wasn't necessarily better at making the model follow TDD. When Superpowers fires, the workflow quality is just as good. They're all prompt. Same markdown, same instructions, same model. The only difference is whether the prompt reaches the model. CLAUDE.md reaches it every time. Superpowers reaches it 66% of the time.

The interesting case is scenario 04 (refactor middleware, turn 2). No config wrote tests before refactoring. They all jumped straight to splitting the middleware into files. The "write tests before restructuring" guideline needs to be stronger, regardless of delivery mechanism.

Why this happens

Both CLAUDE.md and skill listings arrive through the same channel: <system-reminder> wrapped user messages. No architectural trust difference. The difference is just presence.

CLAUDE.md content is always in the context window. Every turn, every decision, the guidelines are right there. Skill content requires the model to read the name+description listing, decide the skill is relevant, call the Skill tool, wait for the content, then follow it. Each step can fail.

So the activation gap isn't a quality problem. It's a reliability problem. When skills get invoked, they work. They just don't always get invoked. Superpowers gets to 66% in clean multi-turn. The CLAUDE.md hint gets 33%. Neither reaches 100%.

CLAUDE.md gets 100% presence. No invocation needed.

Skills are recipes, CLAUDE.md is the health code

Think of it like a kitchen.

CLAUDE.md is the health code. Wash your hands, sanitize surfaces, check temperatures. Every cook follows these rules on every shift. They're non-negotiable and always visible, posted on the wall. You don't wait for someone to ask "should I wash my hands before touching food?" It's the baseline.

Skills are recipes. You pull the recipe for bouillabaisse when someone orders bouillabaisse. You don't tape every recipe to the wall next to the health code. That's noise. Recipes have their moment. The health code is constant.

Superpowers tries to turn recipes into the health code by having a hook shout "READ THE RECIPES" at the start of every shift. It works most of the time. But you could just put the important rules on the wall.

CLAUDE.md is for guidelines. Conventions, coding standards, workflow rules, TDD processes, debugging protocols. Anything the agent must follow every session. "Write tests before implementation" is a health code rule. It goes in CLAUDE.md.

Skills are for recipes. Specific, on-demand procedures you invoke when the moment calls for it. "Generate a database migration," "scaffold a component," "run the release checklist." These don't need to be in context all the time. They need to be there when you ask for them. Use context: fork for heavy recipes that would bloat the main context.

Hooks are for automation, not instruction delivery. Pre-commit validation, linting, notifications. If you're using a hook to inject guidelines (like Superpowers does), it works at 66% in clean multi-turn, but CLAUDE.md would do the same job at 100% with zero activation gap.

Mechanism	Presence	Invocation needed	Clean multi-turn rate
CLAUDE.md (health code)	100%	No	n/a, always there
Superpowers (hook + recipes)	Hook: 100%, Content: 66%	Yes	66%
CLAUDE.md + hint + skills	100% (hint), 33% (content)	Yes	33%
Plain skills (recipes on shelf)	Listing only	Yes	6%

General guidelines don't belong in skills. Skills are not how you say "always do X." They're how you say "when you need to do Y, here's how."

Full turn-by-turn results

Every turn, every config. MT marks turns that hit the 15 max-turns limit.

Superpowers (hook + skills) – 10/15 invocations (66%)

Scenario	Turn	Skill invoked	First file written
01 email	t1 (tdd)	test-driven-development	validateEmail.test.ts (test first)
01 email	t2 (tdd)	–	– (used Edit)
01 email	t3 (debug)	systematic-debugging	–
01 email	t4 (refactor)	–	validateEmail.ts
02 LRU	t1 (debug)	systematic-debugging	lru-cache.test.ts (test first)
02 LRU	t2 (debug)	–	lru-cache.ts
02 LRU	t3 (tdd)	test-driven-development	–
03 rate limiter	t1 (plan)	brainstorming	– (planning, no code)
03 rate limiter	t2 (plan)	–	–
03 rate limiter	t3 (tdd)	test-driven-development	rate-limiter.test.ts (test first) MT
04 middleware	t1 (analysis)	–	– (reading code)
04 middleware	t2 (refactor)	–	logging.ts (impl first)
04 middleware	t3 (debug)	systematic-debugging	middleware.test.ts
04 middleware	t4 (tdd)	test-driven-development	logging.test.ts (test first) MT
05 HTTP retry	t1 (tdd)	brainstorming	–
05 HTTP retry	t2 (debug)	–	http-client.test.ts (test first)
05 HTTP retry	t3 (tdd)	verification-before-completion	–

Plain skills (no hook, no hint) – 1/15 invocations (6%)

Scenario	Turn	Skill invoked	First file written
01 email	t1 (tdd)	–	validateEmail.ts (impl first)
01 email	t2 (tdd)	–	–
01 email	t3 (debug)	–	–
01 email	t4 (refactor)	–	validateEmail.ts
02 LRU	t1 (debug)	–	lru-cache.test.ts (test first)
02 LRU	t2 (debug)	–	lru-cache.ts
02 LRU	t3 (tdd)	–	–
03 rate limiter	t1 (plan)	–	scalable-bubbling-lagoon.md
03 rate limiter	t2 (plan)	–	types.ts (impl first) MT
03 rate limiter	t3 (tdd)	–	–
04 middleware	t1 (analysis)	–	–
04 middleware	t2 (refactor)	–	logging.ts (impl first)
04 middleware	t3 (debug)	systematic-debugging	middleware.test.ts
04 middleware	t4 (tdd)	–	logging.test.ts
05 HTTP retry	t1 (tdd)	–	–
05 HTTP retry	t2 (debug)	–	–
05 HTTP retry	t3 (tdd)	–	http-client.test.ts

CLAUDE.md (guidelines, no skills) – 0/14 invocations (n/a)

Scenario	Turn	First file written
01 email	t1 (tdd)	validateEmail.test.ts (test first)
01 email	t2 (tdd)	–
01 email	t3 (debug)	–
01 email	t4 (refactor)	–
02 LRU	t1 (debug)	lru-cache.test.ts (test first)
02 LRU	t2 (debug)	bench.ts
02 LRU	t3 (tdd)	–
03 rate limiter	t1 (plan)	–
03 rate limiter	t2 (plan)	–
03 rate limiter	t3 (tdd)	rate-limiter.test.ts (test first) MT
04 middleware	t1 (analysis)	–
04 middleware	t2 (refactor)	tests first (4 test files before impl) MT
04 middleware	t3 (debug)	–
04 middleware	t4 (tdd)	–
05 HTTP retry	t1 (tdd)	http-client.test.ts (test first)
05 HTTP retry	t2 (debug)	–
05 HTTP retry	t3 (tdd)	–

CLAUDE.md + hint (skills installed) – 5/15 invocations (33%)

Scenario	Turn	Skill invoked	First file written
01 email	t1 (tdd)	test-driven-development	validateEmail.test.ts (test first)
01 email	t2 (tdd)	–	–
01 email	t3 (debug)	–	–
01 email	t4 (refactor)	–	validateEmail.ts
02 LRU	t1 (debug)	systematic-debugging	lru-cache.test.ts (test first)
02 LRU	t2 (debug)	–	lru-cache.ts
02 LRU	t3 (tdd)	–	–
03 rate limiter	t1 (plan)	–	snazzy-juggling-glacier.md
03 rate limiter	t2 (plan)	–	rate-limiter.ts (impl first)
03 rate limiter	t3 (tdd)	–	types.ts (impl first) MT
04 middleware	t1 (analysis)	–	–
04 middleware	t2 (refactor)	–	–
04 middleware	t3 (debug)	systematic-debugging	–
04 middleware	t4 (tdd)	test-driven-development	logging.test.ts (test first) MT
05 HTTP retry	t1 (tdd)	test-driven-development	http-client.test.ts (test first) MT
05 HTTP retry	t2 (debug)	–	–
05 HTTP retry	t3 (tdd)	–	–

Methodology

Execution

Each scenario runs as a multi-turn claude -p session:

# Turn 1: fresh session
claude -p --model claude-opus-4-6 --output-format stream-json \
  --verbose --dangerously-skip-permissions --max-turns 15 \
  "$PROMPT" > turn-1.jsonl

# Extract session ID
sid=$(grep -o '"session_id":"[^"]*"' turn-1.jsonl | head -1 | cut -d'"' -f4)

# Turn 2+: resume same session
claude -p --model claude-opus-4-6 --output-format stream-json \
  --verbose --dangerously-skip-permissions --max-turns 15 \
  --resume "$sid" "$PROMPT" > turn-2.jsonl

Environment isolation

All runs disabled user-level configuration to prevent contamination:

# Disabled at start, restored on exit (trap)
~/.claude/plugins      -> ~/.claude/plugins.eval-disabled
~/.claude/skills       -> ~/.claude/skills.eval-disabled
~/.claude/settings.json -> ~/.claude/settings.json.eval-disabled
~/.claude/CLAUDE.md    -> ~/.claude/CLAUDE.md.eval-disabled

Only the Superpowers config re-enabled ~/.claude/plugins (the hook + skills come from the plugin). OAuth auth stays in the macOS keychain, unaffected by the rename.

Config setup per workspace

Each scenario gets a fresh /tmp workspace with package.json, tsconfig.json, seed files (if any), and npm install. Then:

Superpowers: plugin provides the SessionStart hook + skills via .claude/settings.json
Plain skills: skills copied into workspace .claude/skills/, no hook
CLAUDE.md: CLAUDE.md with equivalent TDD/debugging/planning guidelines
CLAUDE.md + hint: CLAUDE.md with "Before writing code, first explore the project structure, then invoke the relevant skill for the task at hand." + skills copied into workspace .claude/skills/

Max turns

Each turn was capped at 15 agentic steps (--max-turns 15). 8 of 68 turns hit this limit. The affected turns are marked with MT in the results tables. In most MT turns the model was actively writing tests and code, it just needed more steps to finish. The || true flag prevents truncation from killing the runner script.

Measurement

Skill invocations are extracted from stream-json transcripts by searching for "name":"Skill" in assistant messages. TDD compliance is measured by the order of Write tool calls – whether test files (.test.ts, .spec.ts) appear before implementation files.

Reproducibility

A Dockerfile is included for fully isolated runs (requires ANTHROPIC_API_KEY):

docker build -t multiturn-eval .
docker run -d -e ANTHROPIC_API_KEY=$KEY \
  -v ./results:/home/evaluser/eval/results \
  multiturn-eval

I validated the eval across three environments: local with plugin rename, Docker with zero user config, and local with full config disabled. Superpowers invocation patterns were identical across all three.

The repos are open if you want to reproduce or poke at the data:

react-best-practices-eval – 10 single-shot React tasks
nextjs-agents-md-eval – 21 single-shot Next.js tasks
superpowers-eval – Superpowers invocation benchmark
multiturn-eval – 20 multi-turn scenarios across 4 configs (this post)

What to do with this

If you're setting up Claude Code for a project:

TDD, debugging protocols, code style, naming conventions go in CLAUDE.md. These are rules you want followed every session. No invocation, no activation gap, 100% presence.
"Scaffold a service," "generate a migration," "run the release checklist" go in skills. These are procedures you call when you need them. Use context: fork if they're heavy.
If you need CLAUDE.md to reference extra documentation, make it an index. Point to files. Same pattern Claude Code uses for its own memory: a root file that links to specifics.
If you're using Superpowers and it works for you, keep using it. Now you know why it works (the hook) and where it drops off (34% of turns in multi-turn, more in single-shot). Moving your guidelines to CLAUDE.md would close that gap.

Skills are not broken. They're just not for guidelines.

Five Things Your Coding AI Agent Wishes You Understood

Edy Silva — Thu, 09 Apr 2026 16:33:56 +0000

I keep seeing the same frustration everywhere - Reddit, Discord, Twitter. Someone tells their coding agent to follow a specific pattern, the agent nails it, and then five prompts later, it acts like the conversation never happened. "Why does it keep forgetting?" "Is this a bug?" "I literally just told it that."

It’s not a bug. It’s context compression doing exactly what it’s supposed to do. But most people using coding agents – Claude Code, OpenCode, Cursor, Cline – have no idea what’s happening under the hood. They treat agents like black boxes. Magical black boxes.

They’re not magical. They’re not even that complex. And once you understand the few core concepts behind them, you’ll stop fighting the tool and start getting way more out of it.

1. Everything is prompting

If you take one thing from this post, make it this.

Every behavior you see from a coding agent – every "feature", every "skill", every "personality trait" – is the result of a prompt. A system prompt that you never see, but that’s there, shaping every response.

When Claude Code feels opinionated about code style, that’s a prompt. When it asks for confirmation before running destructive commands, that’s a prompt. When it formats responses in a certain way, that’s a prompt.

There’s no hidden reasoning engine. No special agent architecture is doing something fundamentally different from what happens when you type in a chat window. It’s an LLM receiving text and generating text. The "agent" part is a loop: prompt the model, parse the output, execute any tool calls, feed the results back, repeat.

This matters because once you internalize it, you realize that you can influence agent behavior the same way the system prompt does – by writing better instructions. Your CLAUDE.md file, your prompts, your corrections mid-conversation – they’re all part of the same mechanism. You’re not "configuring" the agent. You’re prompting it.

Here’s what a real Claude Code session looks like from the inside. This is the /context command output – it shows you exactly what’s occupying the context window:

System prompt, system tools, memory files, skills, messages. That’s it. That’s the whole agent. Text in, text out. Every category you see there is just text being fed to the model before it generates a response.

2. Tools are just functions described in text

When a coding agent reads a file, runs a shell command, or searches your codebase, it’s not using some privileged internal API. It’s calling a tool. And a tool, from the LLM’s perspective, is just a JSON description of a function.

The model sees something like: "There’s a tool called Read that takes a file_path parameter and returns the file contents." That’s it. The model decides when to call it, generates the parameters, and the agent runtime executes the actual function and feeds the result back.

This is important because the model can only use tools it knows about. If a tool isn’t described in the prompt, it doesn’t exist for the model.

In Claude Code, core tools like Read, Edit, Bash, and Grep are always loaded in context. You can see them in the /context output taking up 8k tokens. MCP tools – integrations you add yourself, like Figma or Slack – are also loaded by default. But this creates a problem: if you have dozens of MCP tools, their descriptions start eating your context window before you even start working.

Claude Code solves this with on-demand loading. You can control it with the ENABLE_TOOL_SEARCH environment variable (set to auto by default, which kicks in when MCP tool descriptions exceed 10% of context). When on-demand loading is active, all those individual MCP tool descriptions get replaced by a single tool: ToolSearch.

Think of it like replacing a long menu with a search bar. The model doesn’t see "Figma screenshot tool, Figma metadata tool, Slack send message tool…" anymore. It sees: "there’s a ToolSearch tool you can call to find available tools." The system prompt tells the model that deferred tools exist and that it must search before calling them. The model doesn’t know which tools are available, but it knows something is there and how to discover it.

So when you ask the agent to take a Figma screenshot, it calls ToolSearch with something like "figma screenshot", the runtime searches across all registered MCP tool names and descriptions, and the matching tool gets loaded into context. Only then can the model actually call it. Your MCP servers are still configured in .claude/settings.json – the runtime knows about all of them, but the model only sees the ones it explicitly searches for.

Knowing this explains a lot of "weird" behavior. The agent didn’t use the right tool? Maybe it didn’t know about it. The agent called a tool with wrong parameters? It’s guessing from a text description, not from type checking. The agent keeps using cat instead of the dedicated Read tool? Its prompt tells it not to, but it’s a probabilistic model – sometimes it drifts.

3. Skills are tools you define

See the pattern? Core tools are always in context. MCP tools can be loaded on demand via ToolSearch. Skills follow the exact same pattern – they’re just tools, but ones you define.

In Claude Code, skills used to be called "commands." They got renamed, but the mechanism is the same. You create a markdown file in .claude/skills/, write instructions in it, and the agent treats it as a tool it can call.

Here’s how it works. At session start, skill descriptions (the short summary from the frontmatter) get loaded into context – so the model knows what skills exist. You can see this in the /context output: "Skills: 409 tokens." But the full skill content doesn’t load until it’s invoked. When you type /commit, the model calls a built-in Skill tool, which fetches the full markdown file and injects it into context. The model then follows those instructions.

Same mechanism as ToolSearch. Same mechanism as the system prompt. It’s all just text being loaded into context at different moments.

You can create your own skills. Write a markdown file with instructions, put it in .claude/skills/, and the agent picks it up. When I type /title-generator, the model calls Skill, loads my custom markdown file that says "given a topic, produce 5+ title options across different styles using these headline formulas…", and follows it. No different from a built-in skill.

The difference between a "built-in feature" and your custom skill is just where the text lives. Built-in skills ship with the tool. Yours lives in your project. The LLM treats them exactly the same way. It’s prompting all the way down.

4. Memory is not what you think

This is where most confusion lives.

People assume AI agents have memory the way humans do – that things said earlier in a conversation are "remembered" the way you remember what you had for breakfast. They don’t.

An LLM has no persistent state between calls. Every time the model generates a response, it processes the entire conversation from scratch. What feels like "memory" is actually the conversation history being sent as part of the prompt every single time.

This has a hard limit: the context window. For Claude, that’s roughly 200K tokens. Sounds like a lot, but tool results add up fast. Read a few files, run some commands, and you’ve already burned through a good portion of it.

What happens when context fills up

The agent compresses older messages. It summarizes or drops parts of the conversation to make room for new content. This is why agents "forget" your instructions – they got compressed away.

This is not a bug. It’s a design tradeoff. The alternative is that the conversation just stops.

Long-term memory

Context compression is a problem. If the agent forgets your instructions mid-conversation, you need a way to make things stick. That’s what long-term memory is for.

In Claude Code, there’s a memory/ directory where the agent writes notes that persist across conversations. It loads these files at the start of every session. Here’s what that looks like:

CLAUDE.md is your project instructions file – coding conventions, architecture decisions, things the agent should always know. MEMORY.md is where the agent stores things it learned during previous conversations – patterns it confirmed, preferences you corrected, decisions you made together.

Both get injected into the system prompt. Both are just text files on disk.

What this means for you

If you tell the agent something critical mid-conversation, it might get compressed away later. But if it’s in your memory files, it’ll be there at the start of every conversation.

Keep your memory files updated. That’s it. Put your coding conventions in CLAUDE.md. Let the agent save patterns and decisions to MEMORY.md. When you correct the agent on something, tell it to remember. Don’t rely on mid-conversation corrections and hope it sticks – write it down where it gets loaded every time.

5. Context is everything (literally)

What the agent produces depends entirely on what’s in its context. This sounds obvious, but the implications are not.

The agent doesn’t "know" your codebase. It knows whatever files it has read in the current session. If it makes a wrong assumption about your architecture, it’s probably because it hasn’t read the right files yet.

This is why good agents read before they write. And it’s why you should be suspicious when an agent proposes changes to code it hasn’t looked at.

A few practical consequences.

Long conversations degrade. As the context fills and compresses, the agent loses earlier information. Start new conversations for new tasks.

Don’t assume the agent "knows" something from three tool calls ago. If it’s important, restate it or put it in your project instructions.

Front-load your context. The beginning of the conversation and the system prompt get the most "attention" from the model. Put your most important constraints there.

Why this matters

You don’t need to do anything miraculous to get effective results from coding agents. You don’t need to master prompt engineering frameworks, read every paper on LLM architectures, or reverse-engineer system prompts.

You need to understand the basics. That’s it.

Context is a window with a size limit, and things get dropped when it fills up. Tools are text descriptions that the model reads and decides to call. Skills are tools you wrote yourself. Memory is files on disk that get loaded at the start. Everything is prompting.

This is your Pareto principle for AI agents. These five concepts are the 20% that solve 80% of your problems. When the agent forgets something, you know why – context compression. When it doesn’t use the right tool, you know why – it wasn’t loaded. When it ignores your conventions, you know what to do – update your CLAUDE.md.

Most people are out there fighting the tool because they skipped the fundamentals. They want advanced techniques when they haven’t understood the basics. I’ve seen this pattern before in software development, and it never ends well. You can’t debug what you don’t understand.

Understanding how the machine works is always the first step. It was true before AI agents, and it’s true now.

Thanks for reading!