<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Edy Silva</title>
    <description>The latest articles on Forem by Edy Silva (@edysilva).</description>
    <link>https://forem.com/edysilva</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F339574%2F86cc9eb6-34f4-4a41-b435-ccf3b2f53aa4.jpeg</url>
      <title>Forem: Edy Silva</title>
      <link>https://forem.com/edysilva</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/edysilva"/>
    <language>en</language>
    <item>
      <title>Stop Putting Best Practices in Skills</title>
      <dc:creator>Edy Silva</dc:creator>
      <pubDate>Fri, 10 Apr 2026 17:06:24 +0000</pubDate>
      <link>https://forem.com/edysilva/stop-putting-best-practices-in-skills-3pof</link>
      <guid>https://forem.com/edysilva/stop-putting-best-practices-in-skills-3pof</guid>
      <description>&lt;p&gt;Vercel demonstrated that &lt;a href="https://vercel.com/blog/agents-md-outperforms-skills-in-our-agent-evals" rel="noopener noreferrer"&gt;&lt;code&gt;AGENTS.md&lt;/code&gt; outperforms skills&lt;/a&gt; in their agent evals. &lt;code&gt;AGENTS.md&lt;/code&gt; hit 100% pass rate on general framework knowledge. Skills with explicit instructions reached 79%. In 56% of cases, the agent had access to a skill but never invoked it. Their conclusion: skills work for vertical, action-specific workflows, not for general best practices.&lt;/p&gt;

&lt;p&gt;Their evals were single-shot, though. One prompt, one response, done. Skills depend on context to be called. The model sees a name and a one-line description and has to decide in a single cold shot whether to invoke. In a real session, you go back and forth, context accumulates, the model picks up patterns. Single-shot penalizes skills by testing them in conditions nobody actually uses them in.&lt;/p&gt;

&lt;p&gt;Then there’s &lt;a href="https://github.com/obra/superpowers" rel="noopener noreferrer"&gt;Superpowers&lt;/a&gt;. People install it, and Claude Code starts following TDD, writing plans before coding, and debugging systematically. It bundles best practices as skills and people swear by it. If skills are supposed to lose to &lt;code&gt;AGENTS.md&lt;/code&gt;, why does Superpowers work so well?&lt;/p&gt;

&lt;p&gt;I ran 51 multi-turn evals across 4 configurations, replicated Vercel’s experiment in realistic multi-turn sessions, and read Claude Code’s source to understand the mechanics. Skills and &lt;code&gt;CLAUDE.md&lt;/code&gt; are both just prompts. Same markdown, same model. The only difference is whether the prompt reaches the model. &lt;code&gt;CLAUDE.md&lt;/code&gt; reaches it every time. Skills depend on a chain of decisions that fails 34-94% of the time. And Superpowers works not because of skills, but because its hook bypasses the skill system entirely, approximating what &lt;code&gt;CLAUDE.md&lt;/code&gt; does natively.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Skills and CLAUDE.md are both just prompt. When skills get invoked, they work just as well. The problem is they only get invoked 6-66% of the time. CLAUDE.md is always in context. Put guidelines in CLAUDE.md, use skills for on-demand recipes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Contents:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  How skills actually work in Claude Code
&lt;/li&gt;
&lt;li&gt;  The activation gap
&lt;/li&gt;
&lt;li&gt;  The multi-turn eval
&lt;/li&gt;
&lt;li&gt;  Results
&lt;/li&gt;
&lt;li&gt;  Why this happens
&lt;/li&gt;
&lt;li&gt;  Skills are recipes, CLAUDE.md is the health code
&lt;/li&gt;
&lt;li&gt;  Full turn-by-turn results
&lt;/li&gt;
&lt;li&gt;  Methodology
&lt;/li&gt;
&lt;li&gt;  What to do with this
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How skills actually work in Claude Code
&lt;/h2&gt;

&lt;p&gt;Before the data, how skills work under the hood. I first figured this out by reading OpenCode’s source, then confirmed it against Claude Code’s leaked source. I reference file paths from that codebase throughout this section. Search GitHub, you’ll find them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Discovery: name and description
&lt;/h3&gt;

&lt;p&gt;When Claude Code starts a session, it scans for skills across three levels (&lt;code&gt;src/skills/loadSkillsDir.ts&lt;/code&gt;):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Managed&lt;/strong&gt; – &lt;code&gt;/etc/claude-code/.claude/skills/&lt;/code&gt; (org-wide)&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;User&lt;/strong&gt; – &lt;code&gt;~/.claude/skills/&lt;/code&gt; (your personal skills)&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Project&lt;/strong&gt; – &lt;code&gt;.claude/skills/&lt;/code&gt; (checked into the repo)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For each skill directory, it reads the &lt;code&gt;SKILL.md&lt;/code&gt; frontmatter – &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;description&lt;/code&gt;, and optional fields like &lt;code&gt;context&lt;/code&gt;, &lt;code&gt;allowed-tools&lt;/code&gt;, &lt;code&gt;arguments&lt;/code&gt;. The full markdown body stays on disk.&lt;/p&gt;

&lt;p&gt;At session init, only name and description reach the model. &lt;code&gt;formatCommandDescription()&lt;/code&gt; in &lt;code&gt;src/tools/SkillTool/prompt.ts&lt;/code&gt; produces one line per skill: &lt;code&gt;- {name}: {description}&lt;/code&gt;. The listing is built by &lt;code&gt;getSkillListingAttachments()&lt;/code&gt; in &lt;code&gt;src/utils/attachments.ts&lt;/code&gt;, which formats these lines within a token budget (1% of the context window, max 250 chars per description) and sends them as a &lt;code&gt;&amp;lt;system-reminder&amp;gt;&lt;/code&gt; user message:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The following skills are available for use with the Skill tool:

- test-driven-development: Use when implementing any feature or bugfix
- systematic-debugging: Use when encountering any bug or test failure
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two short strings. The model knows skills exist, but it hasn't read them. This is the bottleneck. The model sees &lt;em&gt;what&lt;/em&gt; is available, not &lt;em&gt;how&lt;/em&gt; to apply it. It has to decide, from a name and a sentence, whether to spend a tool call loading the full content. In my evals, it almost never does.&lt;/p&gt;

&lt;h3&gt;
  
  
  Invocation: full content on demand
&lt;/h3&gt;

&lt;p&gt;When the model decides a skill is relevant, it calls the &lt;code&gt;Skill&lt;/code&gt; tool (&lt;code&gt;src/tools/SkillTool/SkillTool.ts&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Skill"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"skill"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"test-driven-development"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;SkillTool&lt;/code&gt; loads the full &lt;code&gt;SKILL.md&lt;/code&gt; content via &lt;code&gt;getPromptForCommand()&lt;/code&gt;, substitutes variables like &lt;code&gt;$ARGUMENTS&lt;/code&gt; and &lt;code&gt;${CLAUDE_SKILL_DIR}&lt;/code&gt;, and injects it into the conversation. Two execution modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Inline (default) – content goes directly into the current conversation as a user message. The model reads the instructions and follows them in the same context.&lt;/li&gt;
&lt;li&gt;  Fork (&lt;code&gt;context: fork&lt;/code&gt; in frontmatter) – spawns a sub-agent via &lt;code&gt;executeForkedSkill()&lt;/code&gt; with isolated context and its own token budget. The result comes back without bloating the parent conversation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same mechanism that lets the model read files or run bash commands. It asks, the runtime reads a markdown file, the content comes back.&lt;/p&gt;

&lt;h3&gt;
  
  
  CLAUDE.md: always in context
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt; takes a different path. At session start, Claude Code walks from your working directory up to root, collecting every &lt;code&gt;CLAUDE.md&lt;/code&gt;, &lt;code&gt;.claude/CLAUDE.md&lt;/code&gt;, and &lt;code&gt;.claude/rules/*.md&lt;/code&gt; it finds (&lt;code&gt;src/utils/claudemd.ts&lt;/code&gt;). It supports &lt;code&gt;@include&lt;/code&gt; directives – one file pulling in others, up to 5 levels deep.&lt;/p&gt;

&lt;p&gt;All of this feeds into &lt;code&gt;getUserContext()&lt;/code&gt; in &lt;code&gt;src/context.ts&lt;/code&gt;, which gets prepended to the conversation as a &lt;code&gt;&amp;lt;system-reminder&amp;gt;&lt;/code&gt; user message before the model sees anything else (&lt;code&gt;src/utils/api.ts&lt;/code&gt;).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Content&lt;/th&gt;
&lt;th&gt;When it loads&lt;/th&gt;
&lt;th&gt;How it loads&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CLAUDE.md&lt;/td&gt;
&lt;td&gt;Every session, automatically&lt;/td&gt;
&lt;td&gt;Prepended as first user message&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skill listings&lt;/td&gt;
&lt;td&gt;Every session, automatically&lt;/td&gt;
&lt;td&gt;Name + description only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skill content&lt;/td&gt;
&lt;td&gt;On demand, when model calls Skill tool&lt;/td&gt;
&lt;td&gt;Full markdown injected into conversation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt; is always there. Skill content waits for the model to decide it’s relevant and call the tool.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Superpowers actually does
&lt;/h3&gt;

&lt;p&gt;Superpowers registers a &lt;code&gt;SessionStart&lt;/code&gt; hook. When a session begins, the hook runs a shell script that reads the &lt;code&gt;using-superpowers&lt;/code&gt; skill from disk and outputs it as &lt;code&gt;additionalContext&lt;/code&gt; in the hook response. Claude Code injects that into the conversation as a &lt;code&gt;&amp;lt;system-reminder&amp;gt;&lt;/code&gt; message.&lt;/p&gt;

&lt;p&gt;The content is aggressive. The skill wraps instructions in &lt;code&gt;&amp;lt;EXTREMELY_IMPORTANT&amp;gt;&lt;/code&gt; tags:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IF A SKILL APPLIES TO YOUR TASK, YOU DO NOT HAVE A CHOICE. YOU MUST USE IT.

This is not negotiable. This is not optional. You cannot rationalize your way out of this.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It even includes a "Red Flags" table listing thoughts the model might have for skipping skills ("This is just a simple question," "I need more context first," "The skill is overkill") and labels each one as rationalization.&lt;/p&gt;

&lt;p&gt;So Superpowers doesn't wait for the model to discover skills. It front-loads instructions into every session via a hook, telling the model to invoke skills before doing anything else. This is basically the same idea as having a &lt;code&gt;CLAUDE.md&lt;/code&gt; with a hint ("invoke the relevant skill before coding"), just louder. Better than plain skills, but still not the same as having the actual guidelines in &lt;code&gt;CLAUDE.md&lt;/code&gt; from the start. The model still has to invoke the skill, read the content, and follow it. Three steps that can fail. &lt;code&gt;CLAUDE.md&lt;/code&gt; skips all three.&lt;/p&gt;

&lt;h2&gt;
  
  
  The activation gap
&lt;/h2&gt;

&lt;p&gt;I call this the activation gap. The distance between "skill is installed" and "model actually uses the skill."&lt;/p&gt;

&lt;p&gt;I ran single-shot evals first to confirm Vercel's numbers. 31 tasks across react and next.js (&lt;a href="https://github.com/geeksilva97/react-best-practices-eval" rel="noopener noreferrer"&gt;react-best-practices-eval&lt;/a&gt;, &lt;a href="https://github.com/geeksilva97/nextjs-agents-md-eval" rel="noopener noreferrer"&gt;nextjs-agents-md-eval&lt;/a&gt;) and a 10-task Superpowers benchmark (&lt;a href="https://github.com/geeksilva97/superpowers-eval" rel="noopener noreferrer"&gt;superpowers-eval&lt;/a&gt;). Similar results. Vanilla skills: 0% invocation. The model never opens the drawer on its own. &lt;code&gt;AGENTS.md&lt;/code&gt;: 76-90% pass rate.&lt;/p&gt;

&lt;p&gt;But as I mentioned, single-shot isn't how people work. So I built a multi-turn eval suite.&lt;/p&gt;

&lt;h2&gt;
  
  
  The multi-turn eval
&lt;/h2&gt;

&lt;p&gt;5 scenarios, 3-4 turns each. Plain Node.js/TypeScript so framework knowledge isn't a confounding variable. The prompts are the kind of thing you'd actually type.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 1: TDD – Email Validator&lt;/strong&gt; (4 turns)&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Turn&lt;/th&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Expected workflow&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;"Build a function that validates email addresses. It should handle basic formats like &lt;a href="mailto:user@domain.com"&gt;user@domain.com&lt;/a&gt; and reject obviously invalid ones like missing @ or empty strings."&lt;/td&gt;
&lt;td&gt;TDD: write tests first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;"Now add support for international emails – addresses with unicode characters in the local part and IDN domains like user@münchen.de."&lt;/td&gt;
&lt;td&gt;TDD: extend tests first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;"I found a bug – plus aliases like &lt;a href="mailto:user+tag@gmail.com"&gt;user+tag@gmail.com&lt;/a&gt; are being rejected. Fix it."&lt;/td&gt;
&lt;td&gt;Debug: reproduce with failing test&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;"Refactor to separate the parsing logic from the validation logic."&lt;/td&gt;
&lt;td&gt;Refactor: ensure tests pass after&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Scenario 2: Debugging – Broken LRU Cache&lt;/strong&gt; (3 turns)&lt;/p&gt;

&lt;p&gt;Starts with a buggy LRU cache implementation (eviction check uses &lt;code&gt;&amp;gt;=&lt;/code&gt; instead of &lt;code&gt;&amp;gt;&lt;/code&gt;, causing items to "disappear").&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Turn&lt;/th&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Expected workflow&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;"This LRU cache is broken – items seem to disappear even when the cache isn't full. Can you figure out what's wrong and fix it?"&lt;/td&gt;
&lt;td&gt;Debug: reproduce, find root cause&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;"It works now but it's really slow when the cache size is large – like 10000 entries. Can you improve the performance?"&lt;/td&gt;
&lt;td&gt;Debug: reason about complexity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;"Add tests to make sure these bugs don't come back."&lt;/td&gt;
&lt;td&gt;TDD: write regression tests&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Scenario 3: Planning – Rate Limiter&lt;/strong&gt; (3 turns)&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Turn&lt;/th&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Expected workflow&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;"I need a rate limiter for an API. Limit each client to 100 requests per minute. Give me a plan before coding."&lt;/td&gt;
&lt;td&gt;Plan: present approach first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;"Actually, fixed window won't work for my use case – requests cluster at window boundaries and burst through. I need sliding window instead."&lt;/td&gt;
&lt;td&gt;Plan: revise, explain trade-offs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;"Implement it and add tests."&lt;/td&gt;
&lt;td&gt;TDD: write tests, then implement&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Scenario 4: Refactoring – Express Middleware&lt;/strong&gt; (4 turns)&lt;/p&gt;

&lt;p&gt;Starts with a 160-line monolithic middleware handling auth, logging, rate limiting, and error handling.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Turn&lt;/th&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Expected workflow&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;"This middleware file is 300 lines and handles auth, logging, rate limiting, and error handling all in one. Help me understand what it does."&lt;/td&gt;
&lt;td&gt;Analysis: read and explain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;"Split it into separate, focused middleware files."&lt;/td&gt;
&lt;td&gt;Refactor: restructure safely&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;"The auth middleware broke after the split – requests that should require auth are passing through without a token."&lt;/td&gt;
&lt;td&gt;Debug: reproduce, identify regression&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;"Add tests for each middleware so we catch this kind of thing."&lt;/td&gt;
&lt;td&gt;TDD: write isolated tests&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Scenario 5: Mixed – HTTP Client Retry&lt;/strong&gt; (3 turns)&lt;/p&gt;

&lt;p&gt;Starts with a basic HTTP client without retry logic.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Turn&lt;/th&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Expected workflow&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;"Add retry with exponential backoff to this HTTP client. It should retry on 5xx errors and network failures, up to 3 retries."&lt;/td&gt;
&lt;td&gt;TDD or plan first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;"It's retrying on 400 Bad Request errors too. That's wrong – 4xx should fail immediately without retrying."&lt;/td&gt;
&lt;td&gt;Debug: identify status code bug&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;"Add tests covering the retry logic – success on first try, retry on 5xx, no retry on 4xx, max retries exceeded."&lt;/td&gt;
&lt;td&gt;TDD: comprehensive test suite&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each scenario crosses workflow boundaries. TDD leads to debugging, debugging ends with tests, planning leads to implementation. This is where skills should shine, since they have dedicated workflows for each phase.&lt;/p&gt;

&lt;p&gt;Four configurations:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Config&lt;/th&gt;
&lt;th&gt;What the model gets&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Superpowers&lt;/td&gt;
&lt;td&gt;SessionStart hook + skills (the real plugin experience)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain skills&lt;/td&gt;
&lt;td&gt;Same skills installed, no hook, no hint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLAUDE.md&lt;/td&gt;
&lt;td&gt;Equivalent guidelines written as static rules, always in context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLAUDE.md + hint&lt;/td&gt;
&lt;td&gt;One-liner in CLAUDE.md saying "invoke the relevant skill before coding" + skills installed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same model (&lt;code&gt;claude-opus-4-6&lt;/code&gt;), same tasks, same workspace setup. All runs executed in a clean environment with &lt;code&gt;~/.claude/plugins&lt;/code&gt;, &lt;code&gt;~/.claude/skills&lt;/code&gt;, &lt;code&gt;~/.claude/settings.json&lt;/code&gt;, and &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt; temporarily disabled. Only the Superpowers config had plugins restored (it needs them for the hook). Each turn was capped at 15 agentic turns.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Skill invocations
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Config&lt;/th&gt;
&lt;th&gt;Invocations&lt;/th&gt;
&lt;th&gt;Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Superpowers (hook)&lt;/td&gt;
&lt;td&gt;10/15&lt;/td&gt;
&lt;td&gt;66%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLAUDE.md + hint&lt;/td&gt;
&lt;td&gt;5/15&lt;/td&gt;
&lt;td&gt;33%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain skills&lt;/td&gt;
&lt;td&gt;1/15&lt;/td&gt;
&lt;td&gt;6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLAUDE.md (guidelines)&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;8 of 68 total turns hit the 15 max-turns limit. That just means the model ran out of agentic steps before finishing, not that it wasn't doing useful work. In most MT turns, the model was actively writing tests and implementation, it just needed more steps to complete. Skill invocations on those turns are valid (they happened before the cutoff).&lt;/p&gt;

&lt;p&gt;Multi-turn helps Superpowers a lot. From 10% in single-shot to 66% here. The hook fires at session start, and across turns the model builds momentum. Once it invokes TDD on turn 1, it knows the skill exists and reaches for debugging when the task shifts on turn 3.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;CLAUDE.md&lt;/code&gt; hint works, but only in a clean environment. This was the Vercel-style config. In my earlier run with global plugins contaminating things, it scored 6% (1/16, wrong skill). Clean run: 33% (5/15, correct skills). The hint is sensitive to noise. Competing global skills and plugins dilute its effect.&lt;/p&gt;

&lt;p&gt;Plain skills got one spontaneous invocation out of 15 turns. The model invoked &lt;code&gt;systematic-debugging&lt;/code&gt; unprompted on scenario 04, turn 3, after two turns of conversation context. So multi-turn can trigger invocation without a hook, but it's rare.&lt;/p&gt;

&lt;p&gt;Clean environment matters more than I expected. Every config did better in the clean run. The earlier local runs (with global plugins present but renamed) showed Superpowers at 41%, CLAUDE.md+hint at 6%, plain skills at 0%. Clean run: 66%, 33%, 6%. Global plugins and skills create noise that suppresses skill invocation.&lt;/p&gt;

&lt;h3&gt;
  
  
  When Superpowers invokes skills
&lt;/h3&gt;

&lt;p&gt;The pattern is consistent across all runs (local, Docker, and clean):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Turn 1&lt;/th&gt;
&lt;th&gt;Turn 2&lt;/th&gt;
&lt;th&gt;Turn 3&lt;/th&gt;
&lt;th&gt;Turn 4&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;01 email&lt;/td&gt;
&lt;td&gt;TDD&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;debugging&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;02 LRU cache&lt;/td&gt;
&lt;td&gt;debugging&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;TDD&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03 rate limiter&lt;/td&gt;
&lt;td&gt;brainstorming&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;TDD&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;04 middleware&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;debugging&lt;/td&gt;
&lt;td&gt;TDD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;05 HTTP retry&lt;/td&gt;
&lt;td&gt;brainstorming&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;verification&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Skills fire at transitions, when the workflow changes (coding to debugging, debugging to testing). On continuation turns the model doesn't re-invoke. It keeps the momentum from the previous invocation. Which makes sense. You don't re-read the TDD manual every time you write a new test.&lt;/p&gt;

&lt;h3&gt;
  
  
  TDD compliance
&lt;/h3&gt;

&lt;p&gt;Skill invocations are one metric. Did the agent actually follow the workflow? I checked whether test files were written before implementation files on the key TDD turns.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Superpowers&lt;/th&gt;
&lt;th&gt;Plain Skills&lt;/th&gt;
&lt;th&gt;CLAUDE.md&lt;/th&gt;
&lt;th&gt;CLAUDE.md + hint&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;01 email t1&lt;/td&gt;
&lt;td&gt;test first&lt;/td&gt;
&lt;td&gt;impl first&lt;/td&gt;
&lt;td&gt;test first&lt;/td&gt;
&lt;td&gt;test first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;02 LRU t1&lt;/td&gt;
&lt;td&gt;test first&lt;/td&gt;
&lt;td&gt;test first&lt;/td&gt;
&lt;td&gt;test first&lt;/td&gt;
&lt;td&gt;test first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03 rate limiter t3&lt;/td&gt;
&lt;td&gt;test first&lt;/td&gt;
&lt;td&gt;impl first&lt;/td&gt;
&lt;td&gt;test first MT&lt;/td&gt;
&lt;td&gt;impl first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;05 HTTP retry&lt;/td&gt;
&lt;td&gt;test first (t2)&lt;/td&gt;
&lt;td&gt;test only (t3)&lt;/td&gt;
&lt;td&gt;test first (t1)&lt;/td&gt;
&lt;td&gt;test first (t1)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Here's the thing: Superpowers and &lt;code&gt;CLAUDE.md&lt;/code&gt; are basically tied. Both wrote tests first on 4 out of 4 measured scenarios. &lt;code&gt;CLAUDE.md + hint&lt;/code&gt; got 3/4. Plain skills got 1/4.&lt;/p&gt;

&lt;p&gt;Having guidelines in &lt;code&gt;CLAUDE.md&lt;/code&gt; wasn't necessarily &lt;em&gt;better&lt;/em&gt; at making the model follow TDD. When Superpowers fires, the workflow quality is just as good. They're all prompt. Same markdown, same instructions, same model. The only difference is whether the prompt reaches the model. &lt;code&gt;CLAUDE.md&lt;/code&gt; reaches it every time. Superpowers reaches it 66% of the time.&lt;/p&gt;

&lt;p&gt;The interesting case is scenario 04 (refactor middleware, turn 2). No config wrote tests before refactoring. They all jumped straight to splitting the middleware into files. The "write tests before restructuring" guideline needs to be stronger, regardless of delivery mechanism.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this happens
&lt;/h2&gt;

&lt;p&gt;Both &lt;code&gt;CLAUDE.md&lt;/code&gt; and skill listings arrive through the same channel: &lt;code&gt;&amp;lt;system-reminder&amp;gt;&lt;/code&gt; wrapped user messages. No architectural trust difference. The difference is just presence.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt; content is always in the context window. Every turn, every decision, the guidelines are right there. Skill content requires the model to read the name+description listing, decide the skill is relevant, call the &lt;code&gt;Skill&lt;/code&gt; tool, wait for the content, then follow it. Each step can fail.&lt;/p&gt;

&lt;p&gt;So the activation gap isn't a quality problem. It's a reliability problem. When skills get invoked, they work. They just don't always get invoked. Superpowers gets to 66% in clean multi-turn. The &lt;code&gt;CLAUDE.md&lt;/code&gt; hint gets 33%. Neither reaches 100%.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt; gets 100% presence. No invocation needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skills are recipes, CLAUDE.md is the health code
&lt;/h2&gt;

&lt;p&gt;Think of it like a kitchen.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt; is the health code. Wash your hands, sanitize surfaces, check temperatures. Every cook follows these rules on every shift. They're non-negotiable and always visible, posted on the wall. You don't wait for someone to ask "should I wash my hands before touching food?" It's the baseline.&lt;/p&gt;

&lt;p&gt;Skills are recipes. You pull the recipe for bouillabaisse when someone orders bouillabaisse. You don't tape every recipe to the wall next to the health code. That's noise. Recipes have their moment. The health code is constant.&lt;/p&gt;

&lt;p&gt;Superpowers tries to turn recipes into the health code by having a hook shout "READ THE RECIPES" at the start of every shift. It works most of the time. But you could just put the important rules on the wall.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt; is for guidelines. Conventions, coding standards, workflow rules, TDD processes, debugging protocols. Anything the agent must follow every session. "Write tests before implementation" is a health code rule. It goes in &lt;code&gt;CLAUDE.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Skills are for recipes. Specific, on-demand procedures you invoke when the moment calls for it. "Generate a database migration," "scaffold a component," "run the release checklist." These don't need to be in context all the time. They need to be there when you ask for them. Use &lt;code&gt;context: fork&lt;/code&gt; for heavy recipes that would bloat the main context.&lt;/p&gt;

&lt;p&gt;Hooks are for automation, not instruction delivery. Pre-commit validation, linting, notifications. If you're using a hook to inject guidelines (like Superpowers does), it works at 66% in clean multi-turn, but &lt;code&gt;CLAUDE.md&lt;/code&gt; would do the same job at 100% with zero activation gap.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;Presence&lt;/th&gt;
&lt;th&gt;Invocation needed&lt;/th&gt;
&lt;th&gt;Clean multi-turn rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CLAUDE.md (health code)&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;n/a, always there&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Superpowers (hook + recipes)&lt;/td&gt;
&lt;td&gt;Hook: 100%, Content: 66%&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;66%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLAUDE.md + hint + skills&lt;/td&gt;
&lt;td&gt;100% (hint), 33% (content)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;33%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain skills (recipes on shelf)&lt;/td&gt;
&lt;td&gt;Listing only&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;6%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;General guidelines don't belong in skills. Skills are not how you say "always do X." They're how you say "when you need to do Y, here's how."&lt;/p&gt;

&lt;h2&gt;
  
  
  Full turn-by-turn results
&lt;/h2&gt;

&lt;p&gt;Every turn, every config. &lt;code&gt;MT&lt;/code&gt; marks turns that hit the 15 max-turns limit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Superpowers (hook + skills)&lt;/strong&gt; – 10/15 invocations (66%)&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Turn&lt;/th&gt;
&lt;th&gt;Skill invoked&lt;/th&gt;
&lt;th&gt;First file written&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;01 email&lt;/td&gt;
&lt;td&gt;t1 (tdd)&lt;/td&gt;
&lt;td&gt;test-driven-development&lt;/td&gt;
&lt;td&gt;validateEmail.test.ts (test first)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;01 email&lt;/td&gt;
&lt;td&gt;t2 (tdd)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;– (used Edit)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;01 email&lt;/td&gt;
&lt;td&gt;t3 (debug)&lt;/td&gt;
&lt;td&gt;systematic-debugging&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;01 email&lt;/td&gt;
&lt;td&gt;t4 (refactor)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;validateEmail.ts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;02 LRU&lt;/td&gt;
&lt;td&gt;t1 (debug)&lt;/td&gt;
&lt;td&gt;systematic-debugging&lt;/td&gt;
&lt;td&gt;lru-cache.test.ts (test first)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;02 LRU&lt;/td&gt;
&lt;td&gt;t2 (debug)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;lru-cache.ts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;02 LRU&lt;/td&gt;
&lt;td&gt;t3 (tdd)&lt;/td&gt;
&lt;td&gt;test-driven-development&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03 rate limiter&lt;/td&gt;
&lt;td&gt;t1 (plan)&lt;/td&gt;
&lt;td&gt;brainstorming&lt;/td&gt;
&lt;td&gt;– (planning, no code)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03 rate limiter&lt;/td&gt;
&lt;td&gt;t2 (plan)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03 rate limiter&lt;/td&gt;
&lt;td&gt;t3 (tdd)&lt;/td&gt;
&lt;td&gt;test-driven-development&lt;/td&gt;
&lt;td&gt;rate-limiter.test.ts (test first) MT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;04 middleware&lt;/td&gt;
&lt;td&gt;t1 (analysis)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;– (reading code)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;04 middleware&lt;/td&gt;
&lt;td&gt;t2 (refactor)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;logging.ts (impl first)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;04 middleware&lt;/td&gt;
&lt;td&gt;t3 (debug)&lt;/td&gt;
&lt;td&gt;systematic-debugging&lt;/td&gt;
&lt;td&gt;middleware.test.ts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;04 middleware&lt;/td&gt;
&lt;td&gt;t4 (tdd)&lt;/td&gt;
&lt;td&gt;test-driven-development&lt;/td&gt;
&lt;td&gt;logging.test.ts (test first) MT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;05 HTTP retry&lt;/td&gt;
&lt;td&gt;t1 (tdd)&lt;/td&gt;
&lt;td&gt;brainstorming&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;05 HTTP retry&lt;/td&gt;
&lt;td&gt;t2 (debug)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;http-client.test.ts (test first)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;05 HTTP retry&lt;/td&gt;
&lt;td&gt;t3 (tdd)&lt;/td&gt;
&lt;td&gt;verification-before-completion&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Plain skills (no hook, no hint)&lt;/strong&gt; – 1/15 invocations (6%)&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Turn&lt;/th&gt;
&lt;th&gt;Skill invoked&lt;/th&gt;
&lt;th&gt;First file written&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;01 email&lt;/td&gt;
&lt;td&gt;t1 (tdd)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;validateEmail.ts (impl first)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;01 email&lt;/td&gt;
&lt;td&gt;t2 (tdd)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;01 email&lt;/td&gt;
&lt;td&gt;t3 (debug)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;01 email&lt;/td&gt;
&lt;td&gt;t4 (refactor)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;validateEmail.ts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;02 LRU&lt;/td&gt;
&lt;td&gt;t1 (debug)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;lru-cache.test.ts (test first)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;02 LRU&lt;/td&gt;
&lt;td&gt;t2 (debug)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;lru-cache.ts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;02 LRU&lt;/td&gt;
&lt;td&gt;t3 (tdd)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03 rate limiter&lt;/td&gt;
&lt;td&gt;t1 (plan)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;scalable-bubbling-lagoon.md&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03 rate limiter&lt;/td&gt;
&lt;td&gt;t2 (plan)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;types.ts (impl first) MT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03 rate limiter&lt;/td&gt;
&lt;td&gt;t3 (tdd)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;04 middleware&lt;/td&gt;
&lt;td&gt;t1 (analysis)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;04 middleware&lt;/td&gt;
&lt;td&gt;t2 (refactor)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;logging.ts (impl first)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;04 middleware&lt;/td&gt;
&lt;td&gt;t3 (debug)&lt;/td&gt;
&lt;td&gt;systematic-debugging&lt;/td&gt;
&lt;td&gt;middleware.test.ts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;04 middleware&lt;/td&gt;
&lt;td&gt;t4 (tdd)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;logging.test.ts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;05 HTTP retry&lt;/td&gt;
&lt;td&gt;t1 (tdd)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;05 HTTP retry&lt;/td&gt;
&lt;td&gt;t2 (debug)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;05 HTTP retry&lt;/td&gt;
&lt;td&gt;t3 (tdd)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;http-client.test.ts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;CLAUDE.md (guidelines, no skills)&lt;/strong&gt; – 0/14 invocations (n/a)&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Turn&lt;/th&gt;
&lt;th&gt;First file written&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;01 email&lt;/td&gt;
&lt;td&gt;t1 (tdd)&lt;/td&gt;
&lt;td&gt;validateEmail.test.ts (test first)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;01 email&lt;/td&gt;
&lt;td&gt;t2 (tdd)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;01 email&lt;/td&gt;
&lt;td&gt;t3 (debug)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;01 email&lt;/td&gt;
&lt;td&gt;t4 (refactor)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;02 LRU&lt;/td&gt;
&lt;td&gt;t1 (debug)&lt;/td&gt;
&lt;td&gt;lru-cache.test.ts (test first)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;02 LRU&lt;/td&gt;
&lt;td&gt;t2 (debug)&lt;/td&gt;
&lt;td&gt;bench.ts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;02 LRU&lt;/td&gt;
&lt;td&gt;t3 (tdd)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03 rate limiter&lt;/td&gt;
&lt;td&gt;t1 (plan)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03 rate limiter&lt;/td&gt;
&lt;td&gt;t2 (plan)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03 rate limiter&lt;/td&gt;
&lt;td&gt;t3 (tdd)&lt;/td&gt;
&lt;td&gt;rate-limiter.test.ts (test first) MT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;04 middleware&lt;/td&gt;
&lt;td&gt;t1 (analysis)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;04 middleware&lt;/td&gt;
&lt;td&gt;t2 (refactor)&lt;/td&gt;
&lt;td&gt;tests first (4 test files before impl) MT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;04 middleware&lt;/td&gt;
&lt;td&gt;t3 (debug)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;04 middleware&lt;/td&gt;
&lt;td&gt;t4 (tdd)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;05 HTTP retry&lt;/td&gt;
&lt;td&gt;t1 (tdd)&lt;/td&gt;
&lt;td&gt;http-client.test.ts (test first)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;05 HTTP retry&lt;/td&gt;
&lt;td&gt;t2 (debug)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;05 HTTP retry&lt;/td&gt;
&lt;td&gt;t3 (tdd)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;CLAUDE.md + hint (skills installed)&lt;/strong&gt; – 5/15 invocations (33%)&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Turn&lt;/th&gt;
&lt;th&gt;Skill invoked&lt;/th&gt;
&lt;th&gt;First file written&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;01 email&lt;/td&gt;
&lt;td&gt;t1 (tdd)&lt;/td&gt;
&lt;td&gt;test-driven-development&lt;/td&gt;
&lt;td&gt;validateEmail.test.ts (test first)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;01 email&lt;/td&gt;
&lt;td&gt;t2 (tdd)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;01 email&lt;/td&gt;
&lt;td&gt;t3 (debug)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;01 email&lt;/td&gt;
&lt;td&gt;t4 (refactor)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;validateEmail.ts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;02 LRU&lt;/td&gt;
&lt;td&gt;t1 (debug)&lt;/td&gt;
&lt;td&gt;systematic-debugging&lt;/td&gt;
&lt;td&gt;lru-cache.test.ts (test first)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;02 LRU&lt;/td&gt;
&lt;td&gt;t2 (debug)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;lru-cache.ts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;02 LRU&lt;/td&gt;
&lt;td&gt;t3 (tdd)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03 rate limiter&lt;/td&gt;
&lt;td&gt;t1 (plan)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;snazzy-juggling-glacier.md&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03 rate limiter&lt;/td&gt;
&lt;td&gt;t2 (plan)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;rate-limiter.ts (impl first)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;03 rate limiter&lt;/td&gt;
&lt;td&gt;t3 (tdd)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;types.ts (impl first) MT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;04 middleware&lt;/td&gt;
&lt;td&gt;t1 (analysis)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;04 middleware&lt;/td&gt;
&lt;td&gt;t2 (refactor)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;04 middleware&lt;/td&gt;
&lt;td&gt;t3 (debug)&lt;/td&gt;
&lt;td&gt;systematic-debugging&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;04 middleware&lt;/td&gt;
&lt;td&gt;t4 (tdd)&lt;/td&gt;
&lt;td&gt;test-driven-development&lt;/td&gt;
&lt;td&gt;logging.test.ts (test first) MT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;05 HTTP retry&lt;/td&gt;
&lt;td&gt;t1 (tdd)&lt;/td&gt;
&lt;td&gt;test-driven-development&lt;/td&gt;
&lt;td&gt;http-client.test.ts (test first) MT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;05 HTTP retry&lt;/td&gt;
&lt;td&gt;t2 (debug)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;05 HTTP retry&lt;/td&gt;
&lt;td&gt;t3 (tdd)&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Execution
&lt;/h3&gt;

&lt;p&gt;Each scenario runs as a multi-turn &lt;code&gt;claude -p&lt;/code&gt; session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Turn 1: fresh session&lt;/span&gt;
claude &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="nt"&gt;--model&lt;/span&gt; claude-opus-4-6 &lt;span class="nt"&gt;--output-format&lt;/span&gt; stream-json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--verbose&lt;/span&gt; &lt;span class="nt"&gt;--dangerously-skip-permissions&lt;/span&gt; &lt;span class="nt"&gt;--max-turns&lt;/span&gt; 15 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PROMPT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; turn-1.jsonl

&lt;span class="c"&gt;# Extract session ID&lt;/span&gt;
&lt;span class="nv"&gt;sid&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s1"&gt;'"session_id":"[^"]*"'&lt;/span&gt; turn-1.jsonl | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-1&lt;/span&gt; | &lt;span class="nb"&gt;cut&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;&lt;span class="s1"&gt;'"'&lt;/span&gt; &lt;span class="nt"&gt;-f4&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Turn 2+: resume same session&lt;/span&gt;
claude &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="nt"&gt;--model&lt;/span&gt; claude-opus-4-6 &lt;span class="nt"&gt;--output-format&lt;/span&gt; stream-json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--verbose&lt;/span&gt; &lt;span class="nt"&gt;--dangerously-skip-permissions&lt;/span&gt; &lt;span class="nt"&gt;--max-turns&lt;/span&gt; 15 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resume&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$sid&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PROMPT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; turn-2.jsonl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Environment isolation
&lt;/h3&gt;

&lt;p&gt;All runs disabled user-level configuration to prevent contamination:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Disabled at start, restored on exit (trap)&lt;/span&gt;
~/.claude/plugins      -&amp;gt; ~/.claude/plugins.eval-disabled
~/.claude/skills       -&amp;gt; ~/.claude/skills.eval-disabled
~/.claude/settings.json -&amp;gt; ~/.claude/settings.json.eval-disabled
~/.claude/CLAUDE.md    -&amp;gt; ~/.claude/CLAUDE.md.eval-disabled
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only the Superpowers config re-enabled &lt;code&gt;~/.claude/plugins&lt;/code&gt; (the hook + skills come from the plugin). OAuth auth stays in the macOS keychain, unaffected by the rename.&lt;/p&gt;

&lt;h3&gt;
  
  
  Config setup per workspace
&lt;/h3&gt;

&lt;p&gt;Each scenario gets a fresh &lt;code&gt;/tmp&lt;/code&gt; workspace with &lt;code&gt;package.json&lt;/code&gt;, &lt;code&gt;tsconfig.json&lt;/code&gt;, seed files (if any), and &lt;code&gt;npm install&lt;/code&gt;. Then:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Superpowers&lt;/strong&gt;: plugin provides the SessionStart hook + skills via &lt;code&gt;.claude/settings.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Plain skills&lt;/strong&gt;: skills copied into workspace &lt;code&gt;.claude/skills/&lt;/code&gt;, no hook&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;CLAUDE.md&lt;/strong&gt;: &lt;code&gt;CLAUDE.md&lt;/code&gt; with equivalent TDD/debugging/planning guidelines&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;CLAUDE.md + hint&lt;/strong&gt;: &lt;code&gt;CLAUDE.md&lt;/code&gt; with "Before writing code, first explore the project structure, then invoke the relevant skill for the task at hand." + skills copied into workspace &lt;code&gt;.claude/skills/&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Max turns
&lt;/h3&gt;

&lt;p&gt;Each turn was capped at 15 agentic steps (&lt;code&gt;--max-turns 15&lt;/code&gt;). 8 of 68 turns hit this limit. The affected turns are marked with &lt;code&gt;MT&lt;/code&gt; in the results tables. In most MT turns the model was actively writing tests and code, it just needed more steps to finish. The &lt;code&gt;|| true&lt;/code&gt; flag prevents truncation from killing the runner script.&lt;/p&gt;

&lt;h3&gt;
  
  
  Measurement
&lt;/h3&gt;

&lt;p&gt;Skill invocations are extracted from stream-json transcripts by searching for &lt;code&gt;"name":"Skill"&lt;/code&gt; in assistant messages. TDD compliance is measured by the order of &lt;code&gt;Write&lt;/code&gt; tool calls – whether test files (&lt;code&gt;.test.ts&lt;/code&gt;, &lt;code&gt;.spec.ts&lt;/code&gt;) appear before implementation files.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reproducibility
&lt;/h3&gt;

&lt;p&gt;A Dockerfile is included for fully isolated runs (requires &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; multiturn-eval &lt;span class="nb"&gt;.&lt;/span&gt;
docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$KEY&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; ./results:/home/evaluser/eval/results &lt;span class="se"&gt;\&lt;/span&gt;
  multiturn-eval
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I validated the eval across three environments: local with plugin rename, Docker with zero user config, and local with full config disabled. Superpowers invocation patterns were identical across all three.&lt;/p&gt;

&lt;p&gt;The repos are open if you want to reproduce or poke at the data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://github.com/geeksilva97/react-best-practices-eval" rel="noopener noreferrer"&gt;react-best-practices-eval&lt;/a&gt; – 10 single-shot React tasks&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://github.com/geeksilva97/nextjs-agents-md-eval" rel="noopener noreferrer"&gt;nextjs-agents-md-eval&lt;/a&gt; – 21 single-shot Next.js tasks&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://github.com/geeksilva97/superpowers-eval" rel="noopener noreferrer"&gt;superpowers-eval&lt;/a&gt; – Superpowers invocation benchmark&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://github.com/geeksilva97/multiturn-eval" rel="noopener noreferrer"&gt;multiturn-eval&lt;/a&gt; – 20 multi-turn scenarios across 4 configs (this post)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What to do with this
&lt;/h2&gt;

&lt;p&gt;If you're setting up Claude Code for a project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;TDD, debugging protocols, code style, naming conventions&lt;/strong&gt; go in &lt;code&gt;CLAUDE.md&lt;/code&gt;. These are rules you want followed every session. No invocation, no activation gap, 100% presence.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;"Scaffold a service," "generate a migration," "run the release checklist"&lt;/strong&gt; go in skills. These are procedures you call when you need them. Use &lt;code&gt;context: fork&lt;/code&gt; if they're heavy.&lt;/li&gt;
&lt;li&gt;  If you need &lt;code&gt;CLAUDE.md&lt;/code&gt; to reference extra documentation, make it an index. Point to files. Same pattern Claude Code uses for its own memory: a root file that links to specifics.&lt;/li&gt;
&lt;li&gt;  If you're using Superpowers and it works for you, keep using it. Now you know why it works (the hook) and where it drops off (34% of turns in multi-turn, more in single-shot). Moving your guidelines to &lt;code&gt;CLAUDE.md&lt;/code&gt; would close that gap.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Skills are not broken. They're just not for guidelines.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>claude</category>
      <category>promptengineering</category>
    </item>
    <item>
      <title>Five Things Your Coding AI Agent Wishes You Understood</title>
      <dc:creator>Edy Silva</dc:creator>
      <pubDate>Thu, 09 Apr 2026 16:33:56 +0000</pubDate>
      <link>https://forem.com/edysilva/five-things-your-coding-ai-agent-wishes-you-understood-37pj</link>
      <guid>https://forem.com/edysilva/five-things-your-coding-ai-agent-wishes-you-understood-37pj</guid>
      <description>&lt;p&gt;I keep seeing the same frustration everywhere - Reddit, Discord, Twitter. Someone tells their coding agent to follow a specific pattern, the agent nails it, and then five prompts later, it acts like the conversation never happened. "Why does it keep forgetting?" "Is this a bug?" "I literally just told it that."&lt;/p&gt;

&lt;p&gt;It’s not a bug. It’s context compression doing exactly what it’s supposed to do. But most people using coding agents – Claude Code, OpenCode, Cursor, Cline – have no idea what’s happening under the hood. They treat agents like black boxes. Magical black boxes.&lt;/p&gt;

&lt;p&gt;They’re not magical. They’re not even that complex. And once you understand the few core concepts behind them, you’ll stop fighting the tool and start getting way more out of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Everything is prompting
&lt;/h2&gt;

&lt;p&gt;If you take one thing from this post, make it this.&lt;/p&gt;

&lt;p&gt;Every behavior you see from a coding agent – every "feature", every "skill", every "personality trait" – is the result of a prompt. A system prompt that you never see, but that’s there, shaping every response.&lt;/p&gt;

&lt;p&gt;When Claude Code feels opinionated about code style, that’s a prompt. When it asks for confirmation before running destructive commands, that’s a prompt. When it formats responses in a certain way, that’s a prompt.&lt;/p&gt;

&lt;p&gt;There’s no hidden reasoning engine. No special agent architecture is doing something fundamentally different from what happens when you type in a chat window. It’s an LLM receiving text and generating text. The "agent" part is a loop: prompt the model, parse the output, execute any tool calls, feed the results back, repeat.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuhlcessb3ap881ehvxh3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuhlcessb3ap881ehvxh3.png" alt="The LLM prompting pipeline - system prompt, user prompt, LLM, assistant output" width="434" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This matters because once you internalize it, you realize that &lt;strong&gt;you can influence agent behavior the same way the system prompt does&lt;/strong&gt; – by writing better instructions. Your &lt;code&gt;CLAUDE.md&lt;/code&gt; file, your prompts, your corrections mid-conversation – they’re all part of the same mechanism. You’re not "configuring" the agent. You’re prompting it.&lt;/p&gt;

&lt;p&gt;Here’s what a real Claude Code session looks like from the inside. This is the &lt;code&gt;/context&lt;/code&gt; command output – it shows you exactly what’s occupying the context window:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh8l3v4e0origk3lp0vif.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh8l3v4e0origk3lp0vif.webp" alt="Claude Code context usage breakdown showing system prompt, tools, memory files, skills, messages, free space, and autocompact buffer" width="800" height="315"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;System prompt, system tools, memory files, skills, messages. That’s it. That’s the whole agent. Text in, text out. Every category you see there is just text being fed to the model before it generates a response.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Tools are just functions described in text
&lt;/h2&gt;

&lt;p&gt;When a coding agent reads a file, runs a shell command, or searches your codebase, it’s not using some privileged internal API. It’s calling a tool. And a tool, from the LLM’s perspective, is just a JSON description of a function.&lt;/p&gt;

&lt;p&gt;The model sees something like: "There’s a tool called &lt;code&gt;Read&lt;/code&gt; that takes a &lt;code&gt;file_path&lt;/code&gt; parameter and returns the file contents." That’s it. The model decides when to call it, generates the parameters, and the agent runtime executes the actual function and feeds the result back.&lt;/p&gt;

&lt;p&gt;This is important because &lt;strong&gt;the model can only use tools it knows about&lt;/strong&gt;. If a tool isn’t described in the prompt, it doesn’t exist for the model.&lt;/p&gt;

&lt;p&gt;In Claude Code, core tools like &lt;code&gt;Read&lt;/code&gt;, &lt;code&gt;Edit&lt;/code&gt;, &lt;code&gt;Bash&lt;/code&gt;, and &lt;code&gt;Grep&lt;/code&gt; are always loaded in context. You can see them in the &lt;code&gt;/context&lt;/code&gt; output taking up 8k tokens. MCP tools – integrations you add yourself, like Figma or Slack – are also loaded by default. But this creates a problem: if you have dozens of MCP tools, their descriptions start eating your context window before you even start working.&lt;/p&gt;

&lt;p&gt;Claude Code solves this with on-demand loading. You can control it with the &lt;code&gt;ENABLE_TOOL_SEARCH&lt;/code&gt; environment variable (set to &lt;code&gt;auto&lt;/code&gt; by default, which kicks in when MCP tool descriptions exceed 10% of context). When on-demand loading is active, all those individual MCP tool descriptions get replaced by a single tool: &lt;code&gt;ToolSearch&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Think of it like replacing a long menu with a search bar. The model doesn’t see "Figma screenshot tool, Figma metadata tool, Slack send message tool…" anymore. It sees: "there’s a &lt;code&gt;ToolSearch&lt;/code&gt; tool you can call to find available tools." The system prompt tells the model that deferred tools exist and that it must search before calling them. The model doesn’t know &lt;em&gt;which&lt;/em&gt; tools are available, but it knows &lt;em&gt;something&lt;/em&gt; is there and how to discover it.&lt;/p&gt;

&lt;p&gt;So when you ask the agent to take a Figma screenshot, it calls &lt;code&gt;ToolSearch&lt;/code&gt; with something like "figma screenshot", the runtime searches across all registered MCP tool names and descriptions, and the matching tool gets loaded into context. Only then can the model actually call it. Your MCP servers are still configured in &lt;code&gt;.claude/settings.json&lt;/code&gt; – the runtime knows about all of them, but the model only sees the ones it explicitly searches for.&lt;/p&gt;

&lt;p&gt;Knowing this explains a lot of "weird" behavior. The agent didn’t use the right tool? Maybe it didn’t know about it. The agent called a tool with wrong parameters? It’s guessing from a text description, not from type checking. The agent keeps using &lt;code&gt;cat&lt;/code&gt; instead of the dedicated &lt;code&gt;Read&lt;/code&gt; tool? Its prompt tells it not to, but it’s a probabilistic model – sometimes it drifts.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Skills are tools you define
&lt;/h2&gt;

&lt;p&gt;See the pattern? Core tools are always in context. MCP tools can be loaded on demand via &lt;code&gt;ToolSearch&lt;/code&gt;. Skills follow the exact same pattern – they’re just tools, but ones &lt;em&gt;you&lt;/em&gt; define.&lt;/p&gt;

&lt;p&gt;In Claude Code, skills used to be called "commands." They got renamed, but the mechanism is the same. You create a markdown file in &lt;code&gt;.claude/skills/&lt;/code&gt;, write instructions in it, and the agent treats it as a tool it can call.&lt;/p&gt;

&lt;p&gt;Here’s how it works. At session start, skill &lt;em&gt;descriptions&lt;/em&gt; (the short summary from the frontmatter) get loaded into context – so the model knows what skills exist. You can see this in the &lt;code&gt;/context&lt;/code&gt; output: "Skills: 409 tokens." But the full skill content doesn’t load until it’s invoked. When you type &lt;code&gt;/commit&lt;/code&gt;, the model calls a built-in &lt;code&gt;Skill&lt;/code&gt; tool, which fetches the full markdown file and injects it into context. The model then follows those instructions.&lt;/p&gt;

&lt;p&gt;Same mechanism as &lt;code&gt;ToolSearch&lt;/code&gt;. Same mechanism as the system prompt. It’s all just text being loaded into context at different moments.&lt;/p&gt;

&lt;p&gt;You can create your own skills. Write a markdown file with instructions, put it in &lt;code&gt;.claude/skills/&lt;/code&gt;, and the agent picks it up. When I type &lt;code&gt;/title-generator&lt;/code&gt;, the model calls &lt;code&gt;Skill&lt;/code&gt;, loads my custom markdown file that says "given a topic, produce 5+ title options across different styles using these headline formulas…", and follows it. No different from a built-in skill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The difference between a "built-in feature" and your custom skill is just where the text lives.&lt;/strong&gt; Built-in skills ship with the tool. Yours lives in your project. The LLM treats them exactly the same way. It’s prompting all the way down.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Memory is not what you think
&lt;/h2&gt;

&lt;p&gt;This is where most confusion lives.&lt;/p&gt;

&lt;p&gt;People assume AI agents have memory the way humans do – that things said earlier in a conversation are "remembered" the way you remember what you had for breakfast. They don’t.&lt;/p&gt;

&lt;p&gt;An LLM has &lt;strong&gt;no persistent state between calls&lt;/strong&gt;. Every time the model generates a response, it processes the entire conversation from scratch. What feels like "memory" is actually the conversation history being sent as part of the prompt every single time.&lt;/p&gt;

&lt;p&gt;This has a hard limit: the context window. For Claude, that’s roughly 200K tokens. Sounds like a lot, but tool results add up fast. Read a few files, run some commands, and you’ve already burned through a good portion of it.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happens when context fills up
&lt;/h3&gt;

&lt;p&gt;The agent compresses older messages. It summarizes or drops parts of the conversation to make room for new content. This is why agents "forget" your instructions – they got compressed away.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This is not a bug. It’s a design tradeoff. The alternative is that the conversation just stops.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Long-term memory
&lt;/h3&gt;

&lt;p&gt;Context compression is a problem. If the agent forgets your instructions mid-conversation, you need a way to make things stick. That’s what long-term memory is for.&lt;/p&gt;

&lt;p&gt;In Claude Code, there’s a &lt;code&gt;memory/&lt;/code&gt; directory where the agent writes notes that persist across conversations. It loads these files at the start of every session. Here’s what that looks like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjdmdd0t8zphcbxzkfnng.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjdmdd0t8zphcbxzkfnng.webp" alt="Claude Code memory files showing CLAUDE.md and MEMORY.md with their token counts" width="800" height="93"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt; is your project instructions file – coding conventions, architecture decisions, things the agent should always know. &lt;code&gt;MEMORY.md&lt;/code&gt; is where the agent stores things it learned during previous conversations – patterns it confirmed, preferences you corrected, decisions you made together.&lt;/p&gt;

&lt;p&gt;Both get injected into the system prompt. Both are just text files on disk.&lt;/p&gt;

&lt;h3&gt;
  
  
  What this means for you
&lt;/h3&gt;

&lt;p&gt;If you tell the agent something critical mid-conversation, it might get compressed away later. But if it’s in your memory files, it’ll be there at the start of every conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keep your memory files updated.&lt;/strong&gt; That’s it. Put your coding conventions in &lt;code&gt;CLAUDE.md&lt;/code&gt;. Let the agent save patterns and decisions to &lt;code&gt;MEMORY.md&lt;/code&gt;. When you correct the agent on something, tell it to remember. Don’t rely on mid-conversation corrections and hope it sticks – write it down where it gets loaded every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Context is everything (literally)
&lt;/h2&gt;

&lt;p&gt;What the agent produces depends entirely on what’s in its context. This sounds obvious, but the implications are not.&lt;/p&gt;

&lt;p&gt;The agent doesn’t "know" your codebase. It knows whatever files it has read in the current session. If it makes a wrong assumption about your architecture, it’s probably because it hasn’t read the right files yet.&lt;/p&gt;

&lt;p&gt;This is why good agents read before they write. And it’s why you should be suspicious when an agent proposes changes to code it hasn’t looked at.&lt;/p&gt;

&lt;p&gt;A few practical consequences.&lt;/p&gt;

&lt;p&gt;Long conversations degrade. As the context fills and compresses, the agent loses earlier information. Start new conversations for new tasks.&lt;/p&gt;

&lt;p&gt;Don’t assume the agent "knows" something from three tool calls ago. If it’s important, restate it or put it in your project instructions.&lt;/p&gt;

&lt;p&gt;Front-load your context. The beginning of the conversation and the system prompt get the most "attention" from the model. Put your most important constraints there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;You don’t need to do anything miraculous to get effective results from coding agents. You don’t need to master prompt engineering frameworks, read every paper on LLM architectures, or reverse-engineer system prompts.&lt;/p&gt;

&lt;p&gt;You need to understand the basics. That’s it.&lt;/p&gt;

&lt;p&gt;Context is a window with a size limit, and things get dropped when it fills up. Tools are text descriptions that the model reads and decides to call. Skills are tools you wrote yourself. Memory is files on disk that get loaded at the start. Everything is prompting.&lt;/p&gt;

&lt;p&gt;This is your Pareto principle for AI agents. &lt;strong&gt;These five concepts are the 20% that solve 80% of your problems.&lt;/strong&gt; When the agent forgets something, you know why – context compression. When it doesn’t use the right tool, you know why – it wasn’t loaded. When it ignores your conventions, you know what to do – update your &lt;code&gt;CLAUDE.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Most people are out there fighting the tool because they skipped the fundamentals. They want advanced techniques when they haven’t understood the basics. I’ve seen this pattern before in software development, and it never ends well. You can’t debug what you don’t understand.&lt;/p&gt;

&lt;p&gt;Understanding how the machine works is always the first step. It was true before AI agents, and it’s true now.&lt;/p&gt;

&lt;p&gt;Thanks for reading!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>softwareengineering</category>
    </item>
  </channel>
</rss>
