<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Yaohua Chen</title>
    <description>The latest articles on Forem by Yaohua Chen (@chen115y).</description>
    <link>https://forem.com/chen115y</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2671324%2F3ca83c79-d0fd-40c0-b5ac-349326e71725.jpg</url>
      <title>Forem: Yaohua Chen</title>
      <link>https://forem.com/chen115y</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/chen115y"/>
    <language>en</language>
    <item>
      <title>Write, Install, or Generate: A Practical Guide to Agent Skills</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Fri, 17 Apr 2026 02:01:15 +0000</pubDate>
      <link>https://forem.com/chen115y/write-install-or-generate-a-practical-guide-to-agent-skills-37d3</link>
      <guid>https://forem.com/chen115y/write-install-or-generate-a-practical-guide-to-agent-skills-37d3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A plain-English guide to Agent Skills — what they are, how they differ from MCP, and the three ways to source one: write, install, or generate.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you use Claude at work, you probably have a running tab of context you paste into every session: your team's naming conventions, the testing library you prefer, that one internal helper Claude keeps forgetting. You copy. You paste. You remind. And you do it all again next week.&lt;/p&gt;

&lt;p&gt;Agent Skills are Anthropic's answer to that fatigue. Announced in October 2025 and released as an open standard that December, they're now supported across Claude API, Claude Code, Cursor, VS Code Copilot, Cline, and more than two dozen other coding agents. The idea is simple: teach an agent something once, then reuse that knowledge everywhere — without bloating your prompts or your token bill.&lt;/p&gt;

&lt;p&gt;This guide explains what a skill is, how it differs from MCP (the other acronym you'll hear in the same breath), the three ways to get one — write, install, or generate — and two patterns for scaling beyond a single skill once you have a few.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a skill actually is
&lt;/h2&gt;

&lt;p&gt;A skill is a folder with a markdown file inside. The file — &lt;code&gt;SKILL.md&lt;/code&gt; — contains two things: a short description that tells Claude &lt;em&gt;when&lt;/em&gt; to use the skill, and a longer body with the actual instructions.&lt;/p&gt;

&lt;p&gt;Think of it as a recipe card. When you ask Claude to bake bread, it pulls the card titled "here's how we bake bread in this kitchen." When you ask for something unrelated, the card sits in the drawer untouched. The point is that the card isn't in Claude's working memory until it's needed.&lt;/p&gt;

&lt;p&gt;That's the difference between a skill and a big system prompt. A system prompt is the entire cookbook handed to Claude at every meal, even when you only want toast. A skill is one recipe pulled out on demand. Anthropic documents each idle skill as costing roughly &lt;strong&gt;100 tokens&lt;/strong&gt; of metadata — enough for Claude to know the skill exists without paying for its full content.&lt;/p&gt;

&lt;p&gt;That math matters once you have a handful. Twenty skills at ~100 tokens each is 2,000 tokens of fixed overhead no matter how long each skill actually is. The same twenty rules dumped into a system prompt would weigh in at tens of thousands of tokens every turn.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skills vs. MCP: the recipe vs. the pantry
&lt;/h2&gt;

&lt;p&gt;The other term you'll hear is &lt;strong&gt;MCP&lt;/strong&gt; — the Model Context Protocol. People often treat skills and MCP as competing ideas, but they solve different problems.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MCP&lt;/strong&gt; is the live connection between Claude and your data: query a Jira ticket, read a Google Doc, fetch current Stripe API docs. It's the pantry — where the fresh ingredients live.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;skill&lt;/strong&gt; is a reusable set of instructions: "when you're writing a React component, follow these rules." It's the recipe — how you combine ingredients, consistently, every time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's a side-by-side view:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;MCP&lt;/th&gt;
&lt;th&gt;Agent Skill&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Purpose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Connect Claude to live data or tools&lt;/td&gt;
&lt;td&gt;Teach Claude a repeatable procedure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per call; fetched data stays in context&lt;/td&gt;
&lt;td&gt;~100 tokens idle; full body loads on demand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lifetime&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;What you fetch stays for the session&lt;/td&gt;
&lt;td&gt;Stored locally; version-controlled in git&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"What's the latest Drizzle syntax?"&lt;/td&gt;
&lt;td&gt;"How we always write our tests here"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;They aren't competitors. Most real workflows use both — MCP pulls the live docs; a skill teaches Claude how your team adapts them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The anatomy of a skill
&lt;/h2&gt;

&lt;p&gt;Every skill uses a layered structure Anthropic calls &lt;strong&gt;progressive disclosure&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Metadata&lt;/strong&gt; — always loaded. A short header at the top of &lt;code&gt;SKILL.md&lt;/code&gt; that says who the skill is and when to trigger it. About 100 tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The body&lt;/strong&gt; — loaded when triggered. The markdown instructions Claude reads once the description matches your task.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reference files&lt;/strong&gt; — pulled in only if the body points to them. Supporting docs, checklists, example code.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A minimal skill looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;acme-pr-style&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Use when drafting a pull request description. Enforces Acme Corp's PR template and ticket-linking rules.&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gh"&gt;# Acme PR Style&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Start every PR title with a ticket ID like &lt;span class="sb"&gt;`[ACME-1234]`&lt;/span&gt;.
&lt;span class="p"&gt;-&lt;/span&gt; The body must have three sections: &lt;span class="gs"&gt;**Summary**&lt;/span&gt;, &lt;span class="gs"&gt;**Changes**&lt;/span&gt;, &lt;span class="gs"&gt;**Test plan**&lt;/span&gt;.
&lt;span class="p"&gt;-&lt;/span&gt; Never merge without at least one linked Linear ticket.
&lt;span class="p"&gt;-&lt;/span&gt; Use "we" voice, not "I" voice.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole skill. The block between the &lt;code&gt;---&lt;/code&gt; lines is YAML, and Claude uses the &lt;code&gt;description&lt;/code&gt; to decide whether to activate the skill when you type a request. Once active, the body becomes a hard rule for that conversation.&lt;/p&gt;

&lt;p&gt;The Anthropic spec requires only two frontmatter fields — &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;description&lt;/code&gt; — and Claude Code adds a small handful of optional ones (for example, &lt;code&gt;user-invocable: true&lt;/code&gt;, the default, controls whether the skill also appears in the &lt;code&gt;/&lt;/code&gt; slash-command menu). You don't need anything beyond the two required fields for your first skill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build your first skill in five minutes
&lt;/h2&gt;

&lt;p&gt;Let's walk through the PR-style skill end to end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1. Create the folder.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In your project root, add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.claude/skills/acme-pr-style/
└── SKILL.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2. Write &lt;code&gt;SKILL.md&lt;/code&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Copy the example from the previous section — swap &lt;code&gt;acme&lt;/code&gt; for your team name and replace the rules with yours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3. Ask Claude to use it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start Claude Code in that directory and ask something that matches the trigger:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Write the PR description for my current branch."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Claude scans the active skills, notices the &lt;code&gt;description&lt;/code&gt; matches "drafting a pull request," and silently loads the body. In Claude Code you'll see a confirmation like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Skill loaded: acme-pr-style]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your PR description now follows the template.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4. Iterate on the description.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If Claude doesn't pick up the skill, the culprit is almost always the &lt;code&gt;description&lt;/code&gt; field. It's the only signal Claude has when deciding to activate. Vague descriptions ("coding standards") rarely trigger. Task-shaped descriptions ("Use when drafting a pull request description") do. A useful rule: phrase it like you're writing a job posting — state the trigger condition first, then the outcome.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5. Share it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Commit &lt;code&gt;.claude/skills/acme-pr-style/&lt;/code&gt; to your repo. Every teammate who checks out the repo automatically gets the skill — no install step, no sync service. That's the quiet win here: the rules live with the code. When you bump your PR template, you bump the skill in the same commit, and Claude stays aligned with your current conventions instead of the ones from six months ago.&lt;/p&gt;

&lt;h2&gt;
  
  
  You don't have to write every skill from scratch
&lt;/h2&gt;

&lt;p&gt;You just hand-wrote one, and that's the most direct path. Before you do it for everything, though, it helps to know that hand-writing is one of three ways to get a skill — and that the other two are usually faster when they apply.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Write your own&lt;/strong&gt; for everything specific to your team — naming conventions, internal libraries, security requirements, release workflows. This is the irreducible kernel: nobody outside your team can produce it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Install one somebody else wrote&lt;/strong&gt; from a community source. These come in three flavors, in order of how curated they are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Curated CLI registry&lt;/strong&gt; — small but vetted, install via command. &lt;strong&gt;skills.sh&lt;/strong&gt; (Vercel Labs, early 2026) is the canonical example: &lt;code&gt;npx skills find&lt;/code&gt; to search, &lt;code&gt;npx skills add &amp;lt;pkg&amp;gt;&lt;/code&gt; to install.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Curated "awesome" list&lt;/strong&gt; — a GitHub README organized by category; copy or clone manually. &lt;strong&gt;&lt;a href="https://github.com/ComposioHQ/awesome-claude-skills" rel="noopener noreferrer"&gt;awesome-claude-skills&lt;/a&gt;&lt;/strong&gt; (maintained by Composio) is the largest, grouped by use case: document processing, dev tools, data analysis, app automation, and so on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search-driven aggregator&lt;/strong&gt; — auto-indexes hundreds of thousands of skills from GitHub with AI-assisted search and one-click install. &lt;strong&gt;&lt;a href="https://skillsmp.com" rel="noopener noreferrer"&gt;SkillsMP&lt;/a&gt;&lt;/strong&gt; lists 900K+ across Claude Code, Codex, and ChatGPT.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Generate one from live documentation&lt;/strong&gt; when you're picking up a third-party library and don't know it well enough to write the rules yourself. &lt;strong&gt;Context7's&lt;/strong&gt; wizard (&lt;code&gt;npx ctx7 skills generate&lt;/code&gt;) does this — covered in the next section.&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Rule of thumb: write for internal rules, install for shared community patterns, generate for third-party library knowledge. Note: curated sources are higher signal but smaller; aggregators have everything but you should read each &lt;code&gt;SKILL.md&lt;/code&gt; before installing — skills can ship scripts the agent will execute, which means you should verify the code is safe to run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generate skills from docs with the Context7 wizard
&lt;/h2&gt;

&lt;p&gt;Writing a skill for your team's conventions is one thing — you already know the rules. Writing a skill for a third-party library is harder, because you have to know the library well enough to capture its current best practices, the patterns that are deprecated, and the mistakes you want the agent to avoid. Most of us aren't that fluent with the SDK we adopted last week, and the docs keep moving. So skills for external libraries often don't get written, or get written from stale memory and quietly drift.&lt;/p&gt;

&lt;p&gt;Context7 ships a CLI workflow specifically for this gap. Run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx ctx7 skills generate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and you get an interactive wizard that turns Context7's live documentation index into a scoped skill in five steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Describe the expertise&lt;/strong&gt; — &lt;code&gt;Clerk authentication&lt;/code&gt;, &lt;code&gt;Drizzle migrations&lt;/code&gt;, &lt;code&gt;Tailwind v4 theming&lt;/code&gt;. Frame it as the domain you want the agent to be expert in, not the task you want it to do.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pick the sources&lt;/strong&gt; — the wizard searches Context7's library and shows matching documentation sets. You confirm which ones it should treat as ground truth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Answer the scoping questions&lt;/strong&gt; — it asks targeted clarifications: which framework you're on (Next.js, Remix, Astro), what stage you're at (initial setup, hardening, migration), which slice of the API you care about (sign-in/sign-up, social SSO, organizations).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review and refine&lt;/strong&gt; — the wizard queries Context7 for the latest docs, drafts the skill, and shows you exactly which snippets it pulled from. If something's off, you describe what to change and it regenerates while keeping what you liked.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Install&lt;/strong&gt; — pick the targets. The wizard detects Claude Code, Cursor, Codex, OpenCode, Amp, and Antigravity, and writes the skill into the right folder for each — or all of them with &lt;code&gt;--all&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What you end up with isn't a generic library wiki. It's a focused &lt;code&gt;SKILL.md&lt;/code&gt; that answers the exact question you scoped — say, "set up sign-in and sign-up in a Next.js App Router app with Clerk." It typically includes where the provider component goes, how the middleware should be wired, the required environment variables, and, usefully, the wrong patterns the official docs explicitly warn against.&lt;/p&gt;

&lt;p&gt;The non-obvious win is the scoping. Instead of one omnibus &lt;code&gt;clerk&lt;/code&gt; skill that tries to cover everything, you re-run the wizard for each concern: one skill for sign-in/sign-up flows, another for user management and profiles, another for social SSO. Each is narrow enough to have a sharp &lt;code&gt;description&lt;/code&gt;, which means each triggers precisely when relevant. The agent loads the auth-flow skill while you're wiring login pages, and the profile skill while you're building the account screen — never both at once, and never the wrong one.&lt;/p&gt;

&lt;p&gt;A reasonable heuristic: reach for the wizard when you're adopting a library you don't know intimately, or when you suspect the model's training data is older than the version you're on. Keep writing your own skills for the rules only your team knows — internal libraries, naming conventions, security requirements. The wizard is a great way to get a &lt;em&gt;library&lt;/em&gt; skill. It can't write your &lt;em&gt;culture&lt;/em&gt; skill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Compose skills into new skills
&lt;/h2&gt;

&lt;p&gt;Once you have a few skills, the next unlock is treating them as building blocks. A skill can invoke other installed skills as subroutines — running them in sequence, in parallel, or both — and combine the results with its own work to produce something neither could on its own.&lt;/p&gt;

&lt;p&gt;A concrete example from my own toolkit is a &lt;code&gt;code-review-report&lt;/code&gt; skill that runs two independent review passes against the same diff and consolidates them into a severity-tiered report. The two passes compose differently:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A convention-based pass&lt;/strong&gt; the skill runs itself. It reads the project's &lt;code&gt;CLAUDE.md&lt;/code&gt; and a per-language checklist, and reviews each file against those rules. For large diffs the skill fans this out across subagents — up to ten files per subagent — so the diff never has to fit in a single context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An adversarial pass&lt;/strong&gt; done by invoking another installed skill: &lt;strong&gt;&lt;code&gt;/codex:adversarial-review&lt;/code&gt;&lt;/strong&gt; from the Codex plugin. It runs the same diff through a different model (Codex) playing the skeptic, looking for bugs, security issues, and architectural risks the convention pass might miss.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both passes run in parallel. When they return, &lt;code&gt;code-review-report&lt;/code&gt; consolidates them:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Deduplicate.&lt;/strong&gt; If both reviewers flag the same issue, it becomes a corroborated, higher-confidence item tagged &lt;code&gt;[Both]&lt;/code&gt;. Disagreements are surfaced rather than hidden.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Classify and format.&lt;/strong&gt; Group findings into a severity-tiered report (Critical / High / Medium / Low / Nits), annotate each with its source (&lt;code&gt;[Claude]&lt;/code&gt;, &lt;code&gt;[Codex]&lt;/code&gt;, or &lt;code&gt;[Both]&lt;/code&gt;), and append lint output.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result is a single report backed by two independent perspectives — something meaningfully different from either pass alone.&lt;/p&gt;

&lt;p&gt;The compounding payoff is the point. Claude and Codex have different training and different blind spots, so running both against the same diff catches issues either model alone would miss. The &lt;code&gt;[Both]&lt;/code&gt; tag turns agreement into signal — when two independent reviewers flag the same issue, the team can triage with much higher confidence than they could from either alone. Disagreements stay visible too, which is itself useful: a finding only one reviewer raised tells you something about the issue's character (model-specific blind spot, ambiguous case, judgment call worth a human discussion).&lt;/p&gt;

&lt;p&gt;The pattern generalizes. Any time you catch yourself running two or three skills by hand in the same order — research, then draft, then critique; lint, then test, then summarize — that sequence is itself a skill. Write a new &lt;code&gt;SKILL.md&lt;/code&gt; whose body tells Claude "run skill X, then run skill Y, then consolidate the outputs like this." You get reuse, a shareable workflow, and the same ~100 tokens of idle overhead as any other skill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fan out to subagents inside one skill
&lt;/h2&gt;

&lt;p&gt;Composition isn't the only way to scale a skill. A skill can also dispatch its own subagents — short-lived workers, each with a fresh context, all running in parallel — and consolidate the results when they return. This is the move when one skill needs to do work that's too big for a single context window or has natural parallelism inside it.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;code-review-report&lt;/code&gt; does this for its convention-based pass. Reviewing every file in a large diff against &lt;code&gt;CLAUDE.md&lt;/code&gt; and a per-language checklist would either overflow context or grind serially. Instead, the skill splits the changed file list into batches of up to ten files and dispatches one subagent per batch. Each subagent loads the same instructions but only its own slice of the diff, runs the mechanical and semantic checks, and returns structured findings. The parent skill collects all batches and merges them into the consolidation step.&lt;/p&gt;

&lt;p&gt;Three things make subagent fan-out earn its complexity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context economics.&lt;/strong&gt; Each subagent has its own fresh context. The parent skill never has to hold the whole diff at once — only the consolidated findings, which are typically orders of magnitude smaller than the raw input.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real parallelism.&lt;/strong&gt; Subagents run concurrently on the wall clock, not just logically. Reviewing thirty files across three subagents of ten finishes roughly three times faster than one subagent grinding through all thirty.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Isolation.&lt;/strong&gt; A subagent can't contaminate another with framing from an earlier file or half-formed conclusions. Each one sees its slice cleanly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fan-out can take two shapes, and both are worth knowing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Partition by data&lt;/strong&gt; (what &lt;code&gt;code-review-report&lt;/code&gt; does) — same work, different slice of input. Each subagent runs the same instructions on a different chunk of the diff. Best for naturally divisible inputs: file batches, record windows, time ranges, regions of a document.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partition by concern&lt;/strong&gt; — different work, same input. One subagent specializes in security, another in performance, another in test coverage; they all see the full input but each looks for something different. Best when concerns are independent and benefit from a dedicated reviewer rather than being squeezed into one pass.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trade-off is real: subagents cost tokens (each one re-loads its instructions and partial context) and add orchestration overhead. Fan out only when the work is divisible &lt;em&gt;and&lt;/em&gt; large enough that the alternatives — overflowing context, running serially — are worse. For a five-file diff, the parent context is fine. For a fifty-file diff, fan out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Skills are recipe cards.&lt;/strong&gt; Markdown files Claude reads only when they match your task, at ~100 tokens of idle overhead each.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skills are not MCP.&lt;/strong&gt; Skills are reusable procedures; MCP is live data access. You'll likely use both.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skills live in your repo.&lt;/strong&gt; When the rules change, commit the change. Claude reads the latest version automatically, across every teammate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate library skills, write team skills.&lt;/strong&gt; Use &lt;code&gt;npx ctx7 skills generate&lt;/code&gt; to spin up scoped skills for third-party libraries from current docs. Write your own for the rules only your team knows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two ways to scale beyond one skill.&lt;/strong&gt; Compose other skills as building blocks, or fan out to subagents inside a single skill. Use the first when the pieces are already separate skills; use the second when the work inside one skill is too big for one context.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've caught yourself pasting the same block of context into Claude twice this week, that block is already your first skill. The rest is copying it into a &lt;code&gt;SKILL.md&lt;/code&gt; file.&lt;/p&gt;




&lt;h2&gt;
  
  
  Appendix — A developer's deeper look
&lt;/h2&gt;

&lt;p&gt;For readers who write code, here's what a skill looks like once it grows up.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recommended directory layout
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.claude/skills/acme-typescript/
├── SKILL.md                 # Metadata + core rules
├── references/
│   ├── naming-conventions.md
│   └── standard-patterns.ts # "Good" vs "Bad" code examples
└── templates/
    └── api-route.ts.template
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;references/&lt;/code&gt; folder is Level 3 of progressive disclosure. The body of &lt;code&gt;SKILL.md&lt;/code&gt; mentions these files by path, and Claude opens them only when it actually needs the example — keeping the active context small until the moment you need the detail.&lt;/p&gt;

&lt;h3&gt;
  
  
  A realistic &lt;code&gt;SKILL.md&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;acme-typescript&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Enforces Acme Corp's strict TypeScript 5.x standards, including Zod validation at boundaries and the internal Result&amp;lt;T, E&amp;gt; error pattern. Use when writing or reviewing any .ts file.&lt;/span&gt;
&lt;span class="na"&gt;user-invocable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gh"&gt;# Acme TypeScript Standards&lt;/span&gt;

&lt;span class="gu"&gt;## Type safety&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; No &lt;span class="sb"&gt;`any`&lt;/span&gt;. Use &lt;span class="sb"&gt;`unknown`&lt;/span&gt; with a type guard.
&lt;span class="p"&gt;-&lt;/span&gt; Boundary data (API, file, env) must be validated with Zod.
&lt;span class="p"&gt;-&lt;/span&gt; Derive the TypeScript type from the schema: &lt;span class="sb"&gt;`type X = z.infer&amp;lt;typeof XSchema&amp;gt;`&lt;/span&gt;.

&lt;span class="gu"&gt;## Error handling&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Use &lt;span class="sb"&gt;`Result&amp;lt;T, E&amp;gt;`&lt;/span&gt; from &lt;span class="sb"&gt;`./references/standard-patterns.ts`&lt;/span&gt;.
&lt;span class="p"&gt;-&lt;/span&gt; Never throw for expected business errors; return &lt;span class="sb"&gt;`err(...)`&lt;/span&gt; instead.

&lt;span class="gu"&gt;## Reference&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; See &lt;span class="sb"&gt;`./references/standard-patterns.ts`&lt;/span&gt; for the canonical implementation.

ultrathink
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the &lt;em&gt;culture&lt;/em&gt; side of the divide from earlier — it codifies Acme's internal &lt;code&gt;Result&amp;lt;T, E&amp;gt;&lt;/code&gt; pattern, which no wizard could know about. The library counterparts you'd pair it with — say &lt;code&gt;zod-runtime-validation&lt;/code&gt; or &lt;code&gt;typescript-strict-mode&lt;/code&gt; — are exactly the kind of skill &lt;code&gt;npx ctx7 skills generate&lt;/code&gt; would draft for you from the official docs in a couple of minutes.&lt;/p&gt;

&lt;p&gt;That last word — &lt;code&gt;ultrathink&lt;/code&gt; — is a real Claude Code trigger. When it appears anywhere inside a skill, Claude allocates its extended-thinking budget (roughly 32,000 tokens) whenever the skill is active. Use it for skills that enforce expensive or subtle constraints where quiet mistakes are costly.&lt;/p&gt;

&lt;h3&gt;
  
  
  A reference file
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;references/standard-patterns.ts&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;E&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Error&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;T&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;E&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;never&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;E&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;E&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;never&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;E&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When Claude is asked to edit a service file, it reads the skill, sees "use the pattern at &lt;code&gt;./references/standard-patterns.ts&lt;/code&gt;," opens that file once, and applies the pattern consistently across every change in the session. That's what progressive disclosure buys you: one source of truth, loaded only when relevant.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills" rel="noopener noreferrer"&gt;Anthropic — Equipping agents for the real world with Agent Skills&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview" rel="noopener noreferrer"&gt;Agent Skills overview (Claude platform docs)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://code.claude.com/docs/en/skills" rel="noopener noreferrer"&gt;Claude Code — Skills reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://context7.com/docs/clients/cli" rel="noopener noreferrer"&gt;Context7 CLI documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://skills.sh" rel="noopener noreferrer"&gt;skills.sh — the open agent skills ecosystem&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=-AL4Wx-LVEs" rel="noopener noreferrer"&gt;Context7 + Skill Wizard = Instant Claude Code Skills (YouTube)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/ComposioHQ/awesome-claude-skills" rel="noopener noreferrer"&gt;Awesome Claude Skills&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://skillsmp.com/" rel="noopener noreferrer"&gt;Agent Skills Marketplace&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>agentskills</category>
      <category>agents</category>
    </item>
    <item>
      <title>Self-Evolving Agents: A Developer's Guide</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Mon, 13 Apr 2026 18:54:40 +0000</pubDate>
      <link>https://forem.com/chen115y/self-evolving-agents-a-developers-guide-40e7</link>
      <guid>https://forem.com/chen115y/self-evolving-agents-a-developers-guide-40e7</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Static agents hit performance ceilings. This guide shows you how to build agents&lt;br&gt;
that improve themselves — through prompt optimization, dynamic skill libraries,&lt;br&gt;
code and harness evolution, RAG, and LLM fine-tuning — and how a unified LLM&lt;br&gt;
judge decides which track to take. Along the way, we'll survey the frameworks&lt;br&gt;
and methodologies — from DSPy to autoresearch to TextGrad — that have turned&lt;br&gt;
these ideas into working code.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;Most production agents are frozen at deployment. Their system prompt is fixed, their tools are hardcoded, and when they fail, a human manually intervenes. This works until it doesn't — and it usually stops working the moment the task distribution shifts or edge cases accumulate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-evolving agents&lt;/strong&gt; close this loop automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;They evaluate their own outputs&lt;/li&gt;
&lt;li&gt;They diagnose failure modes&lt;/li&gt;
&lt;li&gt;They improve the right layer — prompt, skill, code, knowledge, or model weights&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not a theoretical concept — in 2026, the field often refers to these patterns as &lt;strong&gt;recursive optimization&lt;/strong&gt; or &lt;strong&gt;self-distillation&lt;/strong&gt;. Several open-source frameworks have already shipped working implementations: OpenAI's Self-Evolving Agents Cookbook automates prompt improvement through graders and metaprompt agents. Karpathy's autoresearch lets an agent rewrite its own training code overnight. DSPy compiles optimal prompts via Bayesian search and can distill them into smaller model weights. TextGrad treats the entire agent as a differentiable program, using textual gradients to patch failure modes. And frameworks like AgentScope close the loop all the way to automated fine-tuning from production data.&lt;/p&gt;

&lt;p&gt;This guide covers five escalation levels in order of cost and commitment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Level 1 — Prompt tuning              (minutes, free)
     │  still failing after 3 rounds?
     ▼
Level 2 — Add/improve skills         (hours, cheap)
     │  still failing on reasoning/architecture?
     ▼
Level 3 — Code &amp;amp; Harness evolution   (hours, cheap — runs overnight)
     │  still failing on knowledge?
     ▼
Level 4 — RAG                        (hours, medium cost)
     │  still failing on reasoning style/pattern?
     ▼
Level 5 — LLM Fine-tuning            (days, expensive)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each section builds toward a &lt;strong&gt;master LLM judge pipeline&lt;/strong&gt; in Section 9 that automatically decides which track to trigger — and calls the right code to execute it.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The Landscape: Frameworks for Self-Evolution
&lt;/h2&gt;

&lt;p&gt;Before building from scratch, it is worth understanding the frameworks that have already solved pieces of this problem. They share the same core loop — run, evaluate, improve, repeat — but differ in &lt;em&gt;what&lt;/em&gt; they evolve (prompts, code, skills, or model weights), &lt;em&gt;how&lt;/em&gt; they score, and &lt;em&gt;what safety model&lt;/em&gt; they use.&lt;/p&gt;

&lt;h3&gt;
  
  
  2a. OpenAI Self-Evolving Agents Cookbook
&lt;/h3&gt;

&lt;p&gt;The most production-oriented of the four. It addresses the scenario every developer has experienced: an LLM-powered agent that works reasonably well but keeps failing on certain inputs, leaving you stuck in a never-ending cycle of prompt tweaking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What evolves:&lt;/strong&gt; The system prompt (the instructions given to the LLM). A &lt;code&gt;VersionedPrompt&lt;/code&gt; class tracks every revision with timestamps and eval scores, so rollback is always one line away.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it scores:&lt;/strong&gt; Multiple graders run in parallel — Python functions for deterministic checks (keyword presence, length deviation), cosine similarity for semantic fidelity, and an LLM-as-judge for nuanced quality. A metaprompt agent reads grader feedback and rewrites the system prompt automatically. The loop continues until scores pass or a retry limit is hit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Going further:&lt;/strong&gt; The cookbook also supports comparing model versions (e.g., GPT-5 vs GPT-5-mini) to find the best model-prompt combination, and demonstrates GEPA (Genetic-Pareto) optimization as an advanced alternative to simple metaprompt rewriting.&lt;/p&gt;

&lt;h3&gt;
  
  
  2b. Karpathy's autoresearch
&lt;/h3&gt;

&lt;p&gt;Instead of improving prompts, the agent improves actual source code — specifically, code that trains a small language model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What evolves:&lt;/strong&gt; A single Python file (&lt;code&gt;train.py&lt;/code&gt;) containing the full GPT model, optimizer, and training loop. Everything is on the table: architecture, hyperparameters, optimizer, batch size, attention pattern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it scores:&lt;/strong&gt; A single, hard metric: validation bits per byte (val_bpb). Lower is better. Each training run is limited to exactly 5 minutes of wall-clock time, making experiments directly comparable regardless of what the agent changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The key insight:&lt;/strong&gt; You are not writing training code — you are writing &lt;code&gt;program.md&lt;/code&gt;, a Markdown file that instructs the agent. The agent reads your instructions, modifies &lt;code&gt;train.py&lt;/code&gt;, runs training, checks if the score improved, and keeps or discards the change. You can expect roughly 12 experiments per hour, or 100 overnight.&lt;/p&gt;

&lt;h3&gt;
  
  
  2c. autoagent (kevinrgu)
&lt;/h3&gt;

&lt;p&gt;"Like autoresearch but for agent engineering." Instead of optimizing model training code, it optimizes the agent itself — system prompt, tool definitions, agent registry, and routing/orchestration logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What evolves:&lt;/strong&gt; A single-file agent harness (&lt;code&gt;agent.py&lt;/code&gt;) containing config, tool definitions, agent registry, and orchestration. An adapter boundary is explicitly marked as fixed; everything else is the edit surface for the meta-agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it scores:&lt;/strong&gt; Total score produced by benchmark task test suites in Harbor format. Tasks run in Docker containers for isolation. The meta-agent hill-climbs on this score.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Same meta-programming model:&lt;/strong&gt; Like autoresearch, the human steers the loop through &lt;code&gt;program.md&lt;/code&gt; while the meta-agent edits &lt;code&gt;agent.py&lt;/code&gt;. The agent runs benchmarks, diagnoses failures, modifies the harness, and iterates.&lt;/p&gt;

&lt;h3&gt;
  
  
  2d. EvoMap Evolver
&lt;/h3&gt;

&lt;p&gt;If the OpenAI cookbook is about improving prompts and autoresearch is about improving code, Evolver is about improving agent behavior through a formal, protocol-driven process — version control for agent evolution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What evolves:&lt;/strong&gt; Structured behavior assets. Genes are reusable improvement patterns (like "add input validation before edits"). Capsules bundle related Genes together for larger changes. Events log every evolution, creating a complete audit trail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it scores:&lt;/strong&gt; Signal-based — scans agent logs for error patterns and uses those signals to select which Gene to apply.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Governance model:&lt;/strong&gt; Evolver supports multiple operational modes: review mode (human-in-the-loop), continuous loop (autonomous), and strategy presets that steer priorities — &lt;code&gt;innovate&lt;/code&gt; (maximize new features), &lt;code&gt;harden&lt;/code&gt; (focus on stability), or &lt;code&gt;repair-only&lt;/code&gt; (emergency fix mode).&lt;/p&gt;

&lt;h3&gt;
  
  
  2e. The Broader Ecosystem
&lt;/h3&gt;

&lt;p&gt;The four frameworks above are the ones this guide draws its architecture patterns from, but the self-evolving agent space is broader. Several other systems take fundamentally different optimization approaches worth knowing about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DSPy (Declarative Self-improving Python).&lt;/strong&gt; The industry standard for self-improving prompts. Instead of writing prompt strings, you define a &lt;em&gt;Signature&lt;/em&gt; (input/output spec) and a &lt;em&gt;Metric&lt;/em&gt; (your judgment function). DSPy's MIPRO optimizer uses an LLM to triage failures, propose 10-20 prompt variants, and "compile" the best one via Bayesian search. DSPy can also fine-tune smaller models (e.g., Llama 3) to mimic the reasoning of a larger model by distilling best-performing prompt traces into weights — a technique called &lt;em&gt;self-distillation&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TextGrad (Textual Backpropagation).&lt;/strong&gt; Published in &lt;em&gt;Nature&lt;/em&gt; (2025), TextGrad treats an LLM agent like a neural network but replaces numerical gradients with &lt;em&gt;textual gradients&lt;/em&gt;. You define a &lt;code&gt;TextLoss&lt;/code&gt; — for example: "The response should be technically accurate and concise; provide feedback if it is too wordy." TextGrad passes this loss back through the agent's execution trace and mutates the system prompt or solution code to patch the specific failure mode the judge discovered. This is particularly effective for hard optimization problems (math, code generation) where failures are diagnosable from the trace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memento-Skills.&lt;/strong&gt; A framework focused on evolving an agent's &lt;em&gt;skill library&lt;/em&gt; rather than a single prompt. When an agent encounters a task and fails, an orchestrator evaluates why, then literally rewrites the Markdown and code files for the failing skill. Over time, the agent accumulates a library of refined skills — like learning new moves in a game by trial and error, refining each move's code/instructions after every loss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentScope + Trinity-RFT.&lt;/strong&gt; Designed for enterprise-scale self-evolution. AgentScope captures production logs via "Inference Tables," and Trinity-RFT uses an LLM judge to label production data as "good" or "bad." The system then automatically kicks off a fine-tuning job using reinforcement learning from feedback (RLHF/PPO/SFT) to update the underlying model weights — closing the loop from production failures to weight updates without manual data curation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Side-by-Side Comparison
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Frameworks covered in this guide:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;OpenAI Cookbook&lt;/th&gt;
&lt;th&gt;autoresearch&lt;/th&gt;
&lt;th&gt;autoagent&lt;/th&gt;
&lt;th&gt;Evolver&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What evolves&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;System prompt&lt;/td&gt;
&lt;td&gt;Source code (&lt;code&gt;train.py&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Agent harness (&lt;code&gt;agent.py&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Behavior assets (Genes/Capsules)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Evaluation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-grader (Python + similarity + LLM judge)&lt;/td&gt;
&lt;td&gt;Single metric (val_bpb)&lt;/td&gt;
&lt;td&gt;Benchmark task suites (Harbor)&lt;/td&gt;
&lt;td&gt;Log signal scanning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Human role&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Define graders and thresholds&lt;/td&gt;
&lt;td&gt;Write/iterate on &lt;code&gt;program.md&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Write/iterate on &lt;code&gt;program.md&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Choose mode and strategy preset&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Safety model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Versioned prompts with rollback&lt;/td&gt;
&lt;td&gt;Git keep-or-revert; fixed time budget&lt;/td&gt;
&lt;td&gt;Docker isolation; Harbor sandboxing&lt;/td&gt;
&lt;td&gt;Command whitelist; scoped execution; audit trail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Production prompt improvement&lt;/td&gt;
&lt;td&gt;Single-file, single-metric optimization&lt;/td&gt;
&lt;td&gt;Agent harness optimization&lt;/td&gt;
&lt;td&gt;Regulated environments needing audit trails&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Additional frameworks worth evaluating:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;What it evolves&lt;/th&gt;
&lt;th&gt;Optimization method&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DSPy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prompts and weights&lt;/td&gt;
&lt;td&gt;Bayesian search / compilation (MIPRO)&lt;/td&gt;
&lt;td&gt;RAG pipelines and complex multi-step workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TextGrad&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prompts and code&lt;/td&gt;
&lt;td&gt;Textual backpropagation&lt;/td&gt;
&lt;td&gt;Hard optimization problems (math, code generation)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memento-Skills&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Skill artifacts (Markdown + code)&lt;/td&gt;
&lt;td&gt;Reflection and mutation&lt;/td&gt;
&lt;td&gt;Long-horizon autonomous agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AgentScope&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Model weights&lt;/td&gt;
&lt;td&gt;Online fine-tuning (PPO/SFT via Trinity-RFT)&lt;/td&gt;
&lt;td&gt;Production enterprise loops with RLHF&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  3. Foundations — The Evolution Loop
&lt;/h2&gt;

&lt;p&gt;Every self-evolving agent shares the same feedback cycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent runs task
      │
      ▼
Evaluator scores output
      │
      ▼
Failure classifier diagnoses root cause
      │
      ▼
Improvement dispatcher triggers the right track
      │
      ▼
Updated agent reruns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three components make this possible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt; — a versioned log of runs, prompts, and scores&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation signal&lt;/strong&gt; — a judge that tells you &lt;em&gt;how well&lt;/em&gt; the agent did&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Improvement dispatcher&lt;/strong&gt; — the logic that routes failures to prompt, skill, code, RAG, or fine-tune&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The rest of this guide builds each component in code. All code snippets use Anthropic's Claude (via the Python SDK), but the patterns are model-agnostic — swap in any LLM provider and the architecture stays the same.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost optimization tip:&lt;/strong&gt; The code uses &lt;code&gt;claude-opus-4-6-20260205&lt;/code&gt; throughout for simplicity, but in production you should use different model tiers for different roles. Sonnet 4.6 delivers ~98.5% of Opus performance on routine agent runs (79.6% vs 80.8% on SWE-bench) at 1/5 the cost and 2x the speed. Opus 4.6 pulls ahead decisively on deep reasoning (91.3% vs 74.1% on GPQA Diamond). The practical split: use &lt;strong&gt;Sonnet for the agent runner, evaluator, and prompt rewriter&lt;/strong&gt; (Sections 4a–4c), and reserve &lt;strong&gt;Opus for the judge and track recommender&lt;/strong&gt; (Section 9, Judges 3–4) where multi-step reasoning about failure signals matters most.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Track 1 — Prompt &amp;amp; Skill Evolution
&lt;/h2&gt;

&lt;p&gt;This is the fastest, cheapest, and most reversible improvement path. Always start here.&lt;/p&gt;

&lt;h3&gt;
  
  
  4a. System Prompt Optimization
&lt;/h3&gt;

&lt;p&gt;The core loop: run → evaluate → rewrite prompt if score is low.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# --- Versioned prompt store ---
&lt;/span&gt;&lt;span class="n"&gt;prompt_versions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;save_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;prompt_versions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;prompt_versions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;best_prompt&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;prompt_versions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;prompt_versions&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;INITIAL_PROMPT&lt;/span&gt;

&lt;span class="c1"&gt;# --- Agent runner ---
&lt;/span&gt;&lt;span class="n"&gt;INITIAL_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant that answers math word problems.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6-20260205&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_task&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

&lt;span class="c1"&gt;# --- LLM-as-judge evaluator ---
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;judge_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
    Expected answer: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
    Agent response: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

    Score the response from 0.0 to 1.0 based on correctness and clarity.
    Reply with JSON only: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: 0.0, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}}
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6-20260205&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;judge_prompt&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# --- Prompt rewriter ---
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rewrite_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;failed_response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;rewrite_request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    The current system prompt failed on this task.

    System prompt: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;current_prompt&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
    Task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
    Bad response: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;failed_response&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
    Failure reason: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

    Rewrite the system prompt to handle this better.
    Reply with the new prompt text only.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6-20260205&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rewrite_request&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

&lt;span class="c1"&gt;# --- Evolution loop ---
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evolution_loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_rounds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;current_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;INITIAL_PROMPT&lt;/span&gt;
    &lt;span class="nf"&gt;save_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="nb"&gt;round&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_rounds&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;=== Round &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;round&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | Prompt: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;current_prompt&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;... ===&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;round_scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;round_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | Score: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;current_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;rewrite_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;current_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Low score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;avg_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;round_scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;round_scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;save_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;avg_score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Avg score: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;avg_score&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;avg_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Prompt converged.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;best_prompt&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Example usage
&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;If a train travels 60mph for 2.5 hours, how far does it go?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;150 miles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A store has 240 apples. 1/3 are sold. How many remain?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;160 apples&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;final_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evolution_loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Final best prompt:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;final_prompt&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The metaprompt rewriting approach above is straightforward but has a limitation: it uses a single static meta-prompt that can overfit to immediate grader feedback.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alternatives to consider:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GEPA&lt;/strong&gt; (Section 4d) — population-based search with train/validation splits for more robust prompt generalization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DSPy&lt;/strong&gt; (Section 2e) — instead of writing prompt strings at all, define a &lt;em&gt;Signature&lt;/em&gt; (input/output spec) and a &lt;em&gt;Metric&lt;/em&gt;, and let DSPy's MIPRO optimizer compile the best prompt via Bayesian search. This is the most structured approach to prompt optimization and works particularly well for multi-step pipelines (e.g., RAG chains) where multiple prompts need to be co-optimized.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TextGrad&lt;/strong&gt; (Section 2e) — treats the agent as a differentiable program and uses &lt;em&gt;textual gradients&lt;/em&gt; (natural-language feedback on the execution trace) to mutate the prompt or code. Best for hard optimization problems where failures are diagnosable from the trace (math reasoning, code generation).&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  4b. Dynamic Skill Library
&lt;/h3&gt;

&lt;p&gt;Agents that write, register, and retrieve tools on demand — and prune the ones that stop working. The &lt;strong&gt;Memento-Skills&lt;/strong&gt; framework (Section 2e) takes this pattern further: when an agent fails a task, an orchestrator evaluates why and literally rewrites the Markdown and code files for the failing skill, accumulating a refined skill library over time. The implementation below captures the same core idea.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# --- Skill registry ---
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SkillRegistry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;skills&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;  &lt;span class="c1"&gt;# name -&amp;gt; {code, description, stats}
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;skills&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usage_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Skill registered: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Keyword overlap retrieval — swap for vector search in prod.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;scored&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skill&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;skills&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;overlap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;skill&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skill&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update_stats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;skills&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;skill&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;skills&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;skill&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usage_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="n"&gt;skill&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;skill&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;skill&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usage_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;skill&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usage_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;prune&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_success_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_uses&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Remove underperforming skills.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;to_remove&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;skills&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usage_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;min_uses&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;min_success_rate&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;to_remove&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;del&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;skills&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🗑️  Pruned skill: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;registry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SkillRegistry&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# --- Seed with initial skills ---
&lt;/span&gt;&lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;calculate_percentage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;calculate percentage proportion ratio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;def calculate_percentage(part, whole): return round((part / whole) * 100, 2)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;days_between_dates&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date difference calendar days between two dates&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
from datetime import datetime
def days_between_dates(d1: str, d2: str) -&amp;gt; int:
    fmt = &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%Y-%m-%d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
    return abs((datetime.strptime(d2, fmt) - datetime.strptime(d1, fmt)).days)
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# --- Skill generator: agent writes new skills on demand ---
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_skill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    A user needs help with: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task_description&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
    No existing skill covers this. Write a new Python skill.

    Reply with JSON only:
    {{
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snake_case_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keywords describing when to use this skill&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;def skill_name(...):&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;n    ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
    }}
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6-20260205&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;```

json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;

```&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# --- Agent that uses the skill registry ---
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;skill_aware_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;relevant_skills&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;skill_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Skill `&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;`:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;```
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
python&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
```&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;relevant_skills&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;system&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a Python agent. Use available skills when helpful.
Available skills:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;skill_context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

If no skill fits, say NEED_NEW_SKILL: &amp;lt;description of what&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s needed&amp;gt;.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6-20260205&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_task&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

    &lt;span class="c1"&gt;# Auto-generate missing skill if flagged
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NEED_NEW_SKILL:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;needed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NEED_NEW_SKILL:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔧 Generating new skill for: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;needed&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;new_skill&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_skill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;needed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;new_skill&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;skill_aware_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Retry with new skill
&lt;/span&gt;
    &lt;span class="n"&gt;success&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sorry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;relevant_skills&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_stats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;

&lt;span class="c1"&gt;# Example usage
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;skill_aware_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What percentage is 45 out of 180?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;skill_aware_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How many days between 2024-01-15 and 2024-07-04?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;skill_aware_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Convert 100 USD to EUR at a rate of 0.92&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;# triggers new skill
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  4c. Evaluation &amp;amp; Version Gating
&lt;/h3&gt;

&lt;p&gt;Only promote a new prompt or skill if it measurably beats the current baseline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layered graders.&lt;/strong&gt; A single LLM-as-judge is fragile. Production systems should layer multiple evaluation signals, as the OpenAI Cookbook demonstrates:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Grader type&lt;/th&gt;
&lt;th&gt;What it checks&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Deterministic&lt;/strong&gt; (Python)&lt;/td&gt;
&lt;td&gt;Keyword presence, length within bounds&lt;/td&gt;
&lt;td&gt;Fast, cheap, catches hard failures early&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Semantic&lt;/strong&gt; (cosine similarity)&lt;/td&gt;
&lt;td&gt;Summary stays anchored to source content&lt;/td&gt;
&lt;td&gt;Guards against superficial rephrasing that drifts from the original&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;LLM-as-judge&lt;/strong&gt; (score model)&lt;/td&gt;
&lt;td&gt;Rubric-driven quality assessment&lt;/td&gt;
&lt;td&gt;Captures nuanced signals that rule-based metrics miss&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The deterministic graders stabilize optimization before semantic tuning kicks in. The LLM judge provides a holistic failsafe for edge cases that slip past the other checks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EvalResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;
    &lt;span class="n"&gt;feedback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EvalSuite&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pass_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.75&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_case&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[]})&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;llm_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;EvalResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Evaluate this agent response.
Task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Expected: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Actual: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Reply with JSON only:
{{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: 0.0-1.0, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: true/false, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;feedback&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brief reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6-20260205&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;EvalResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_eval_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;EvalSuite&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;tag_scores&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6-20260205&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;llm_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
            &lt;span class="n"&gt;tag_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setdefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]).&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passed&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;❌&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] score=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;feedback&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;avg_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tag_summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tag_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avg_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;avg_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pass_threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tag_breakdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tag_summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_cases&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed_cases&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;promote_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;EvalSuite&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Only promote candidate if it beats the current prompt.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;📊 Evaluating CURRENT prompt...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;current_report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_eval_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;📊 Evaluating CANDIDATE prompt...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;candidate_report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_eval_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;candidate_report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;current_report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;🚀 Promoting (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;candidate_report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;avg_score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &amp;gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;current_report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;avg_score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;candidate_report&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;⏪ Keeping current (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;current_report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;avg_score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &amp;gt;= &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;candidate_report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;avg_score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_report&lt;/span&gt;

&lt;span class="c1"&gt;# Example usage
&lt;/span&gt;&lt;span class="n"&gt;suite&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EvalSuite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;math_agent_v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pass_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_case&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is 15% of 200?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;30&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;percentage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_case&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A rectangle is 8x5. What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s its area?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;40 sq units&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;geometry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_case&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Train goes 90mph for 3 hours. Distance?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;270 miles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;word_problem&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_case&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Factor 12 into primes.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2 × 2 × 3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;number_theory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;current_prompt&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant that solves math problems.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;candidate_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a precise math tutor. Always show step-by-step reasoning, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state the formula used, then give a clean final answer.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;best_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;promote_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Tag breakdown: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tag_breakdown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Final: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;passed_cases&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_cases&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; cases passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Version tracking in production.&lt;/strong&gt; The OpenAI Cookbook introduces a &lt;code&gt;VersionedPrompt&lt;/code&gt; class that stores each prompt revision with a timestamp, eval ID, run ID, and metadata. This gives you instant rollback and a full audit trail of what changed and why. The pattern is simple to implement yourself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timezone&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PromptVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;VersionedPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;initial_prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6-20260205&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_versions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;PromptVersion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;initial_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;PromptVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PromptVersion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_versions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;new_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_versions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_versions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;current&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;PromptVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_versions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;best&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;PromptVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_versions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rollback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;PromptVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_versions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_versions&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_versions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Model comparison.&lt;/strong&gt; When optimizing, you can also test the same prompt across different model variants (e.g., a full model vs a smaller/cheaper model) and select the best model-prompt combination. The OpenAI Cookbook demonstrates this by running candidate prompts against both &lt;code&gt;gpt-5&lt;/code&gt; and &lt;code&gt;gpt-5-mini&lt;/code&gt; in parallel and keeping whichever scores higher — balancing quality against cost and latency.&lt;/p&gt;




&lt;h3&gt;
  
  
  4d. Advanced: GEPA Optimization
&lt;/h3&gt;

&lt;p&gt;The simple metaprompt rewriting loop in Section 4a works but has a limitation: a static meta-prompt explores a narrow space and can overfit to immediate grader feedback on individual examples.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GEPA (Genetic-Pareto)&lt;/strong&gt; is a more rigorous alternative demonstrated in the OpenAI Cookbook. It samples agent trajectories, reflects on them in natural language, proposes prompt revisions, and evolves the system through iterative feedback loops with train/validation splits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it differs from simple rewriting:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Simple metaprompt&lt;/th&gt;
&lt;th&gt;GEPA&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Search strategy&lt;/td&gt;
&lt;td&gt;Greedy rewrite per failure&lt;/td&gt;
&lt;td&gt;Population-based, Pareto front selection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overfitting protection&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Train/validation split&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feedback used&lt;/td&gt;
&lt;td&gt;Grader scores only&lt;/td&gt;
&lt;td&gt;Scores + natural language reflection on trajectories&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-objective&lt;/td&gt;
&lt;td&gt;Single average score&lt;/td&gt;
&lt;td&gt;Pareto-optimal across multiple grader dimensions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The GEPA loop:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start with a seed prompt (candidate)&lt;/li&gt;
&lt;li&gt;Evaluate on a training subsample using your graders&lt;/li&gt;
&lt;li&gt;Reflect on trajectories — the GEPA reflection LM reads inputs, outputs, and feedback to propose an improved prompt&lt;/li&gt;
&lt;li&gt;Evaluate the new candidate on a validation set&lt;/li&gt;
&lt;li&gt;Maintain a Pareto front of non-dominated candidates&lt;/li&gt;
&lt;li&gt;Repeat until convergence or budget exhaustion
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;gepa&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;gepa&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;EvaluationBatch&lt;/span&gt;

&lt;span class="n"&gt;seed_candidate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system_prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a summarization assistant. Given a section of text, produce a summary.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gepa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;optimize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;seed_candidate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;seed_candidate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trainset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;valset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;val_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;adapter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;your_eval_adapter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# bridges your graders to GEPA's interface
&lt;/span&gt;    &lt;span class="n"&gt;reflection_lm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_metric_calls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;track_best_outputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;best_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_candidate&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system_prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When to use GEPA vs simple rewriting:&lt;/strong&gt; If you have fewer than 10 eval cases and need a quick improvement, simple metaprompt rewriting is sufficient. If you have a real dataset with dozens of examples and need the prompt to generalize across them, GEPA's population-based search with train/validation splits will produce more robust results.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. When to Improve Prompt vs. Create a Skill
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;Improve Prompt&lt;/th&gt;
&lt;th&gt;Create/Improve Skill&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Wrong tone, style, or reasoning format&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Misunderstands task intent&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Missing a computation or lookup&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fails consistently on one task type&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Needs external data or API&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucinating facts it should retrieve&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The 3-question test:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Knowledge/reasoning gap or behavior gap? → behavior = prompt, knowledge = skill&lt;/li&gt;
&lt;li&gt;Reproducible with the same input type? → yes = skill (deterministic logic in code)&lt;/li&gt;
&lt;li&gt;Would a human use a tool or think differently? → tool = skill, think = prompt&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Automated Failure Classifier
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;enum&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Enum&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ImprovementTrack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Enum&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;SKILL&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skill&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;BOTH&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;both&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify_failure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;agent_response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;current_system_prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;classifier_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You are an AI agent debugging expert. Analyze this agent failure.

System prompt: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;current_system_prompt&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Expected: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Actual response: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent_response&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Diagnose the root cause and classify it. Consider:
- PROMPT: the agent has the capability but wrong behavior/tone/reasoning style
- SKILL: the agent is missing a tool, lookup, or computation it cannot reliably do in its head
- BOTH: the prompt misdirects AND a skill is missing

Reply with JSON only:
{{
    &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;track&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skill&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;both&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
    &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;root_cause&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;one sentence explanation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
    &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;evidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;specific part of the response that reveals the problem&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
    &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;suggested_action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;concrete next step&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
}}
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6-20260205&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;classifier_prompt&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;diagnosis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;diagnosis&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;track&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ImprovementTrack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;diagnosis&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;track&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;diagnosis&lt;/span&gt;

&lt;span class="c1"&gt;# Example usage
&lt;/span&gt;&lt;span class="n"&gt;failures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the compound interest on $5000 at 4.5% for 3 years?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$706.06&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;actual&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The compound interest would be approximately $700.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful financial assistant.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain the steps to solve a quadratic equation.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Step-by-step: factoring, completing the square, quadratic formula&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;actual&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Just use the quadratic formula: x = (-b ± √(b²-4ac)) / 2a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful math assistant.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;diagnosis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;classify_failure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;actual&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Track     : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;diagnosis&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;track&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Root cause: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;diagnosis&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;root_cause&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Action    : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;diagnosis&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;suggested_action&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Thumb rules:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Prompt&lt;/code&gt; = change &lt;em&gt;how&lt;/em&gt; the agent thinks&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Skill&lt;/code&gt; = change &lt;em&gt;what&lt;/em&gt; the agent can do&lt;/li&gt;
&lt;li&gt;If a fix requires &lt;code&gt;math&lt;/code&gt;, &lt;code&gt;datetime&lt;/code&gt;, or any API call → always a &lt;strong&gt;Skill&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Aim for a &lt;strong&gt;thin prompt, rich skill library&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  6. Track 2 — Code &amp;amp; Harness Evolution
&lt;/h2&gt;

&lt;p&gt;Prompt and skill tuning change the &lt;em&gt;instructions&lt;/em&gt; and &lt;em&gt;tools&lt;/em&gt; given to a model. Code and harness evolution go further: the agent modifies its own implementation.&lt;/p&gt;

&lt;p&gt;Code evolution has two variants: &lt;em&gt;model-side&lt;/em&gt; (autoresearch modifies training code to produce a better model) and &lt;em&gt;harness-side&lt;/em&gt; (autoagent modifies the agent itself — prompt, tools, orchestration). Both use the same &lt;code&gt;program.md&lt;/code&gt; pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  The &lt;code&gt;program.md&lt;/code&gt; Pattern
&lt;/h3&gt;

&lt;p&gt;The key insight from both frameworks: &lt;strong&gt;you are not touching the Python files like you normally would as an engineer. Instead, you are programming &lt;code&gt;program.md&lt;/code&gt;&lt;/strong&gt; — the Markdown file that provides context to the meta-agent and defines the evolution loop.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────┐
│  Human writes program.md                    │
│  (instructions, constraints, goals)         │
│                                             │
│         ┌──────────────┐                    │
│         │  Meta-agent   │                   │
│         │  reads        │                   │
│         │  program.md   │                   │
│         └──────┬───────┘                    │
│                │                            │
│         ┌──────▼───────┐                    │
│         │  Modifies     │                   │
│         │  train.py or  │                   │
│         │  agent.py     │                   │
│         └──────┬───────┘                    │
│                │                            │
│         ┌──────▼───────┐                    │
│         │  Runs eval    │                   │
│         │  (metric)     │                   │
│         └──────┬───────┘                    │
│                │                            │
│         ┌──────▼───────┐                    │
│         │  Score better?│                   │
│         │  Keep : Revert│                   │
│         └──────────────┘                    │
└─────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  autoresearch: Evolving Model Training Code
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt; Three files. &lt;code&gt;prepare.py&lt;/code&gt; handles data prep (fixed). &lt;code&gt;train.py&lt;/code&gt; contains the full model and training loop (agent edits this). &lt;code&gt;program.md&lt;/code&gt; is the agent's instruction manual (human edits this).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Loop:&lt;/strong&gt; Point a coding agent (Claude, Codex, etc.) at the repo. The agent reads &lt;code&gt;program.md&lt;/code&gt;, modifies &lt;code&gt;train.py&lt;/code&gt;, kicks off a 5-minute training run, checks if validation bits per byte improved. If yes, the change sticks. If no, the agent reverts and tries something else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Results:&lt;/strong&gt; ~12 experiments/hour, ~100 overnight. You wake up to a log of everything the agent tried and (hopefully) a better model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this is code evolution, not fine-tuning:&lt;/strong&gt; Although autoresearch produces a better-trained model as its output, the evolution mechanism is code editing, not weight updating — the agent modifies Python source (architecture, optimizer, hyperparameters), not gradients. The coding agent's own weights are never touched.&lt;/p&gt;

&lt;h3&gt;
  
  
  autoagent: Evolving the Agent Harness
&lt;/h3&gt;

&lt;p&gt;autoagent applies the same pattern to the agent itself rather than model training code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;agent.py&lt;/code&gt; — the entire harness in a single file: config, tool definitions, agent registry, routing/orchestration, and a Harbor adapter boundary (explicitly marked as fixed)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;program.md&lt;/code&gt; — meta-agent instructions plus the directive (what kind of agent to build)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tasks/&lt;/code&gt; — evaluation tasks in Harbor format, running in Docker containers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The meta-agent modifies the system prompt, tools, agent configuration, and orchestration, runs the benchmark, checks the score, keeps or discards the change, and repeats.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to Use Code Evolution
&lt;/h3&gt;

&lt;p&gt;This track generalizes to any scenario where you have:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A single file (or small surface) to optimize&lt;/strong&gt; — a config file, a set of
hyperparameters, a build configuration, an agent harness&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A clear, measurable metric&lt;/strong&gt; — validation loss, benchmark score, test pass rate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A bounded experiment time&lt;/strong&gt; — each iteration completes in minutes, not hours&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If your problem fits this shape, the autoresearch/autoagent pattern can be more effective than manual iteration — and it works overnight while you sleep.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important distinction from fine-tuning:&lt;/strong&gt; Code evolution modifies the &lt;em&gt;code and configuration around the model&lt;/em&gt;, not the model weights. It is cheaper, faster, and fully reversible (just revert the file). Consider it before jumping to fine-tuning.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Track 3 — RAG
&lt;/h2&gt;

&lt;p&gt;RAG fixes &lt;strong&gt;knowledge gaps&lt;/strong&gt;. It slots between code evolution and fine-tuning in the escalation ladder.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;RAG&lt;/th&gt;
&lt;th&gt;Fine-Tune&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Missing domain facts or docs&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stale knowledge / live updates needed&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Specific reasoning style/pattern&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 500 training examples available&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucinating facts it should look up&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️ partial&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Minimal RAG Skill
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# --- Toy in-memory store (swap for Chroma/Pinecone in prod) ---
&lt;/span&gt;&lt;span class="n"&gt;knowledge_base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Q1 2025 audit found 3 critical gaps in access control policies.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Revenue for Q1 2025 was $4.2M, up 18% YoY.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The compound XR-47 showed hepatotoxicity in Phase 2 trials.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;simple_retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Keyword overlap retrieval — replace with embedding search in prod.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;query_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;scored&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;knowledge_base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;doc_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="n"&gt;overlap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_words&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;doc_words&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rag_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;context_chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;simple_retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;context_chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;context_block&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;context_chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;system&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a helpful enterprise assistant.
Use ONLY the retrieved context below to answer.
If the context doesn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t cover the question, say so.

Retrieved context:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context_block&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;system&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful enterprise assistant.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6-20260205&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

&lt;span class="c1"&gt;# Example usage
&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What did the Q1 2025 audit find?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What were Q1 revenues?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tell me about XR-47 safety.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is our HR vacation policy?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# not in KB → honest fallback
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Q: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;rag_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key principle:&lt;/strong&gt; RAG + skills often eliminate the need for fine-tuning entirely for enterprise agents where knowledge is the primary gap.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Track 4 — LLM Fine-Tuning
&lt;/h2&gt;

&lt;p&gt;Fine-tuning internalizes &lt;strong&gt;behavior and reasoning patterns&lt;/strong&gt; that prompt iteration cannot reliably produce. It is the most expensive and least reversible track — and it carries a real risk of &lt;strong&gt;losing generalization capability&lt;/strong&gt;. A model fine-tuned on a narrow domain dataset may improve on that domain while degrading on everything else. This is not a theoretical concern: it is the primary failure mode of production fine-tuning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Escalate to fine-tuning only when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt iteration has plateaued (3+ rounds, no score improvement)&lt;/li&gt;
&lt;li&gt;Failures persist even when the correct skill is invoked&lt;/li&gt;
&lt;li&gt;Failures are concentrated in one domain (finance, legal, medical)&lt;/li&gt;
&lt;li&gt;You have 500+ clean, high-quality training trajectories&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Consider code evolution first.&lt;/strong&gt; If the issue is about &lt;em&gt;how the agent operates&lt;/em&gt; rather than &lt;em&gt;how the model reasons&lt;/em&gt;, the autoresearch/autoagent pattern from Section 6 may be more effective. Code evolution modifies the code and configuration around the model (architecture, hyperparameters, tools, orchestration) without touching model weights — cheaper, faster, and fully reversible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The iterative fine-tuning loop:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Deploy → collect trajectories → filter (score ≥ 0.8) → fine-tune → redeploy → repeat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Avoiding catastrophic forgetting:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Always fine-tune from the base model, not iteratively from prior fine-tunes&lt;/li&gt;
&lt;li&gt;Evaluate on a held-out general benchmark alongside the domain benchmark&lt;/li&gt;
&lt;li&gt;Set a regression threshold: if general score drops &amp;gt; 5%, abort&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Frameworks That Automate the Fine-Tuning Loop
&lt;/h3&gt;

&lt;p&gt;Two frameworks are worth highlighting for teams that want to close the loop from production failures to weight updates without manual data curation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DSPy self-distillation.&lt;/strong&gt; DSPy can fine-tune a smaller, cheaper model (e.g., Llama 3) to mimic the reasoning of a larger model (e.g., GPT-5) by distilling the best-performing prompt traces into training data. The workflow: run your DSPy program with the large model, collect the traces that score highest on your metric, and use them to fine-tune the small model. This gives you the reasoning quality of the big model at the inference cost of the small one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentScope + Trinity-RFT.&lt;/strong&gt; Designed for enterprise-scale autonomous fine-tuning. AgentScope captures production logs via "Inference Tables." Trinity-RFT uses an LLM judge to label production data as "good" or "bad," then automatically kicks off a fine-tuning job using reinforcement learning from feedback (PPO or SFT). This is the most hands-off approach to weight updates: the system monitors production, identifies failures, curates training data, and fine-tunes — all without human intervention. The trade-off is complexity: you need the infrastructure to run fine-tuning jobs on schedule and the monitoring to catch regressions.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. The Master Decision Pipeline — LLM as Judge
&lt;/h2&gt;

&lt;p&gt;This is the centrepiece of the guide. Four judges, one pipeline — everything from Sections 4–8 plugs into the dispatcher at the end.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent runs → Failures logged
     ↓
Judge 1: Per-run evaluator (scores 0–1)
     ↓
Judge 2: Signal extractor (persistence, skill gap, knowledge gap, data volume)
     ↓
Judge 3: Track recommender (LLM synthesizes signals → verdict)
     ↓
Judge 4: Action dispatcher → calls evolution_loop() / rag_agent() / fine-tune export
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;enum&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Enum&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="c1"&gt;# ── Data models ──────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Track&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Enum&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;PROMPT_SKILL&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_skill&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;CODE_EVOLUTION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_evolution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;RAG&lt;/span&gt;            &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;FINE_TUNE&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fine_tune&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;RAG_FINE_TUNE&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag+fine_tune&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentRun&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;prompt_version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;prompt_round&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;correct_skill_invoked&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;JudgeVerdict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;track&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Track&lt;/span&gt;
    &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;signals&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;
    &lt;span class="n"&gt;rationale&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;next_steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;estimated_effort&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;


&lt;span class="c1"&gt;# ── Judge 1: Per-run evaluator ────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentRun&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AgentRun&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Scores a single agent run 0.0–1.0.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Evaluate this agent response.

Task     : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Expected : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Actual   : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Reply with JSON only:
{{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: 0.0-1.0, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: true/false, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;one sentence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}}
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6-20260205&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;```

json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;

```&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;


&lt;span class="c1"&gt;# ── Judge 2: Signal extractor ─────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_signals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;AgentRun&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;corpus_exists&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;example_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Derives quantitative signals from a batch of runs.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;failed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;runs&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all_passing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Signal 1: Prompt plateau — failures persisting after 3+ prompt rounds
&lt;/span&gt;    &lt;span class="n"&gt;persistence_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;failed&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_round&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;

    &lt;span class="c1"&gt;# Signal 2: Skill bottleneck — skill fired but still failed
&lt;/span&gt;    &lt;span class="n"&gt;skill_failure_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;failed&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;correct_skill_invoked&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;

    &lt;span class="c1"&gt;# Signal 3: Domain concentration — one task type dominating failures
&lt;/span&gt;    &lt;span class="n"&gt;type_counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;type_counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;type_counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="n"&gt;dominant_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;type_counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;type_counts&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;dominant_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;type_counts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;type_counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;type_counts&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Signal 4: Knowledge gap — failed despite no skill gap → likely needs retrieval
&lt;/span&gt;    &lt;span class="n"&gt;knowledge_gap_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;failed&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;correct_skill_invoked&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_round&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_runs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;         &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failure_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;       &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;persistence_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;   &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persistence_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;   &lt;span class="c1"&gt;# &amp;gt; 0.4 → fine-tune
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skill_failure_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;skill_failure_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c1"&gt;# &amp;gt; 0.3 → fine-tune
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;knowledge_gap_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;knowledge_gap_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c1"&gt;# &amp;gt; 0.4 → RAG
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dominant_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;dominant_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dominant_type_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dominant_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;      &lt;span class="c1"&gt;# &amp;gt; 0.5 → systematic gap
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;corpus_exists&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;corpus_exists&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;example_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;example_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_sufficient&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;example_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="c1"&gt;# ── Judge 3: Track recommender ────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;recommend_track&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;signals&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;current_prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sample_failures&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;AgentRun&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;JudgeVerdict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;LLM judge: reads signals + failure samples → recommends track.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;sample_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;actual&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_round&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_round&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;correct_skill_invoked&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;correct_skill_invoked&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sample_failures&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;judge_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You are a senior AI systems architect. Decide the best improvement track
for an underperforming agent based on signals and failure samples.

## Quantitative Signals
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;signals&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

## Signal Thresholds
- persistence_rate &amp;gt; 0.4     → prompt iteration plateauing → consider fine_tune
- skill_failure_rate &amp;gt; 0.3   → model reasoning is bottleneck → consider fine_tune
- knowledge_gap_rate &amp;gt; 0.4   → facts/docs missing → consider rag
- dominant_type_rate &amp;gt; 0.5   → systematic domain gap
- data_sufficient = false    → BLOCK fine_tune, default to rag or prompt_skill

## Available Tracks
- prompt_skill   : Rewrite system prompt and/or add/fix tools. Fast, cheap, reversible.
- code_evolution : Let a meta-agent modify code/config against a clear metric.
                   Use when the problem has a single file to optimize and a measurable goal.
- rag            : Index a knowledge corpus and retrieve at query time.
                   Prefer over fine-tuning when knowledge changes or data &amp;lt; 500.
- fine_tune      : Train on trajectories. Use when reasoning style is systematically
                   wrong AND 500+ examples exist AND prompt iteration has plateaued.
- rag+fine_tune  : Both. Use when knowledge AND reasoning style are both gaps.

## Current System Prompt
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;current_prompt&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

## Sample Failures
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sample_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Be conservative — recommend fine_tune only when signals clearly justify it.

Reply with JSON only:
{{
    &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;track&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_skill&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_evolution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fine_tune&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag+fine_tune&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
    &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: 0.0-1.0,
    &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;signals_fired&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: {{
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_plateau&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;   : true/false,
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skill_bottleneck&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; : true/false,
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;knowledge_gap&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;    : true/false,
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;systematic_domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: true/false,
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_sufficient&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  : true/false
    }},
    &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rationale&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2-3 sentence explanation referencing specific signals&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
    &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;next_steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: [&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step 1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step 2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step 3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;],
    &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;estimated_effort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;e.g. 2hrs prompt iteration vs 4 days fine-tuning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
    &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;risk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;main risk of this recommendation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
}}
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-6-20260205&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;judge_prompt&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;```

json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;

```&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JudgeVerdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;track&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Track&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;track&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;signals&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;signals_fired&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;rationale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rationale&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;next_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;next_steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;estimated_effort&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;estimated_effort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;risk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="c1"&gt;# ── Judge 4: Action dispatcher ────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dispatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;JudgeVerdict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  TRACK       : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;track&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  CONFIDENCE  : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  RATIONALE   : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rationale&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  EFFORT      : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;estimated_effort&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  RISK        : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  SIGNALS     : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;signals&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;  NEXT STEPS:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;next_steps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;    &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;actions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Track&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PROMPT_SKILL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;→ Calling evolution_loop() to rewrite system prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;→ Calling classify_failure() to split prompt vs skill fixes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;Track&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CODE_EVOLUTION&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;→ Set up program.md with constraints and goals&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;→ Point meta-agent at the repo (autoresearch or autoagent pattern)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;→ Let it hill-climb overnight; review results in the morning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;Track&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RAG&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;→ Chunk and embed your knowledge corpus&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;→ Register retrieval as a new skill in SkillRegistry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;→ Re-run eval suite to confirm improvement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;Track&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FINE_TUNE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;→ Export high-scoring runs as training trajectories&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;→ Filter: keep only runs with score &amp;gt;= 0.8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;→ Submit fine-tune job (OpenAI / HuggingFace / Anthropic)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;Track&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RAG_FINE_TUNE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;→ Step 1: Build RAG pipeline first (faster win)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;→ Step 2: Validate RAG improves knowledge gaps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;→ Step 3: Fine-tune on reasoning style gaps in parallel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;track&lt;/span&gt;&lt;span class="p"&gt;]()&lt;/span&gt;


&lt;span class="c1"&gt;# ── Master pipeline ───────────────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_judge_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;AgentRun&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;current_prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;corpus_exists&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;example_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⏳ Step 1: Evaluating all runs...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;evaluated&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;evaluate_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;avg_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;evaluated&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;evaluated&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;failed_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;evaluated&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;   Avg score: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;avg_score&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | Failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;failed_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;evaluated&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;avg_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Agent is performing well. No improvement needed.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;⏳ Step 2: Extracting signals...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;signals&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_signals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;evaluated&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;corpus_exists&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;example_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;   Signals: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;signals&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;failed_runs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;evaluated&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;⏳ Step 3: LLM judge recommending track...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;recommend_track&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;signals&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;failed_runs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;⏳ Step 4: Dispatching recommendation...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;dispatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;verdict&lt;/span&gt;


&lt;span class="c1"&gt;# ── Example usage ─────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="n"&gt;runs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;AgentRun&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the Q1 2025 earnings report&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Revenue $4.2M, up 18% YoY, 3 audit gaps found&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t have access to Q1 2025 earnings data.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompt_round&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;correct_skill_invoked&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;AgentRun&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What were the audit findings for access control?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3 critical gaps found in access control policies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I cannot find specific audit findings in my knowledge.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompt_round&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;correct_skill_invoked&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;AgentRun&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Calculate compound interest $5000 at 4.5% for 3 years&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$706.06&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Approximately $700 using compound interest formula.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompt_round&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;correct_skill_invoked&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;AgentRun&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyze revenue trend from last 4 quarters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Structured YoY trend with % changes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Revenue seems to be going up based on general trends.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompt_round&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;correct_skill_invoked&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;  &lt;span class="c1"&gt;# scale to 40 runs
&lt;/span&gt;
&lt;span class="n"&gt;current_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a financial analysis assistant. Be thorough and precise.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_judge_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;current_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;current_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;corpus_exists&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# financial docs available to index
&lt;/span&gt;    &lt;span class="n"&gt;example_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;350&lt;/span&gt;     &lt;span class="c1"&gt;# below the 500 fine-tuning threshold
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sample output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;⏳ Step 1: Evaluating all runs...
   Avg score: 0.31 | Failed: 37/40

⏳ Step 2: Extracting signals...
   Signals: {failure_rate: 0.93, persistence_rate: 0.89,
             knowledge_gap_rate: 0.76, dominant_type: finance,
             corpus_exists: True, data_sufficient: False}

⏳ Step 3: LLM judge recommending track...

⏳ Step 4: Dispatching recommendation...
============================================================
  TRACK       : RAG
  CONFIDENCE  : 91%
  RATIONALE   : High knowledge_gap_rate (0.76) with corpus_exists=True
                and data_sufficient=False clearly points to RAG. Agent
                is failing on factual retrieval, not reasoning style.
  EFFORT      : 4–6 hours to chunk, embed, and integrate corpus
  RISK        : Retrieval quality depends on chunking strategy
  SIGNALS     : {prompt_plateau: True, skill_bottleneck: False,
                 knowledge_gap: True, systematic_domain: True,
                 data_sufficient: False}

  NEXT STEPS:
    1. Chunk Q1 earnings report and audit docs into 512-token segments
    2. Embed with text-embedding-3-small and store in Chroma/Pinecone
    3. Register retrieval as a skill and re-run eval suite

→ Chunk and embed your knowledge corpus
→ Register retrieval as a new skill in SkillRegistry
→ Re-run eval suite to confirm improvement
============================================================
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  10. The Complete Escalation Ladder
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Level 1 — Prompt tuning          (minutes, free)
     │  still failing after 3 rounds?
     ▼
Level 2 — Add/improve skills     (hours, cheap)
     │  still failing on reasoning/architecture?
     ▼
Level 3 — Code/harness evolution (hours, cheap — runs overnight)
     │  still failing on knowledge?
     ▼
Level 4 — RAG                    (hours, medium cost)
     │  still failing on reasoning style/pattern?
     ▼
Level 5 — Fine-tuning            (days, expensive)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The master pipeline in Section 9 enforces this ladder automatically — it blocks fine-tuning when data is insufficient, and prefers RAG when a corpus exists. Code evolution (Section 6) is a manual decision point: if your problem has a single file and a clear metric, try the autoresearch/autoagent pattern before moving to RAG or fine-tuning.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. Continuous Monitoring
&lt;/h2&gt;

&lt;p&gt;The evolution loop does not end after the initial optimization converges. Production agents face shifting data distributions, new edge cases, and model updates that can degrade performance over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Periodic re-evaluation.&lt;/strong&gt; Schedule the eval suite to run on incoming data at regular intervals. When scores drop below a threshold, the evolution loop restarts automatically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;continuous_monitor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;eval_suite&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;versioned_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;check_interval_hours&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;regression_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.70&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Re-evaluate the agent periodically and trigger evolution if scores regress.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;new_tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;collect_recent_tasks&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# returns list[{"task": ..., "expected": ...}]
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;new_tasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;check_interval_hours&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;

        &lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_eval_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;eval_suite&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;versioned_prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;current&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;regression_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Score regressed to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;avg_score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; — triggering evolution loop&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;new_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evolution_loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_tasks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;regression_threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;versioned_prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;trigger&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto_regression&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Score healthy: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;avg_score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;check_interval_hours&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Model version comparison on new data.&lt;/strong&gt; When a new model version becomes available, run the eval suite with the current prompt on both the old and new models. If the new model scores higher, update the &lt;code&gt;VersionedPrompt&lt;/code&gt; with the new model. If it scores lower, keep the current model — do not assume newer is better.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Drift detection with auto-rollback.&lt;/strong&gt; Log prompt version, skill version, model version, and average score over time. If score regresses after any change, auto-rollback to the last known good version. The &lt;code&gt;VersionedPrompt.rollback()&lt;/code&gt; method makes this a single call.&lt;/p&gt;




&lt;h2&gt;
  
  
  12. Pitfalls &amp;amp; Safety
&lt;/h2&gt;

&lt;p&gt;Self-evolving loops introduce new failure modes that static agents do not have. The more autonomy you give the improvement loop, the more these risks matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reward hacking&lt;/strong&gt; — if your eval signal is imperfect, the agent will optimize for the signal rather than the goal. Use multiple eval dimensions (correctness, format, safety) and audit a random sample manually every N rounds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Drift detection&lt;/strong&gt; — log prompt version, skill version, and avg score over time. If score regresses after a change, auto-rollback to the last known good version.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Version everything&lt;/strong&gt; — never deploy an unevaluated prompt or skill. The &lt;code&gt;promote_prompt()&lt;/code&gt; gate in Section 4c enforces this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human checkpoints&lt;/strong&gt; — before any fine-tuning job, require a human review of the filtered training trajectories. Garbage in, garbage out — and fine-tuning mistakes are expensive to undo.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rollback strategy&lt;/strong&gt; — store every prompt version with its eval score. A one-line revert (&lt;code&gt;current_prompt = best_prompt()&lt;/code&gt;) should always be available.&lt;/p&gt;

&lt;h3&gt;
  
  
  Safety Models Across Frameworks
&lt;/h3&gt;

&lt;p&gt;Different frameworks take different approaches to containing the risk of autonomous evolution:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;Safety approach&lt;/th&gt;
&lt;th&gt;Trade-off&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenAI Cookbook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Versioned prompts with rollback; promote-only-if-better gate&lt;/td&gt;
&lt;td&gt;Simple and effective, but no isolation — bad prompts can affect production before rollback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;autoresearch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Git-based keep-or-revert; fixed 5-minute time budget per experiment&lt;/td&gt;
&lt;td&gt;Time budget prevents runaway experiments; git makes every change reversible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;autoagent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Docker isolation; Harbor sandboxing; tasks run in containers&lt;/td&gt;
&lt;td&gt;Strong isolation, but Docker overhead adds latency to the feedback loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Evolver&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Command whitelist; scoped execution; timeout limits; full audit trail of every Event&lt;/td&gt;
&lt;td&gt;Most comprehensive safety model, but also the most complex to set up&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Strategy Presets
&lt;/h3&gt;

&lt;p&gt;EvoMap's Evolver introduces a useful concept that applies even outside the framework: &lt;strong&gt;strategy presets&lt;/strong&gt; that match the evolution behavior to the current development phase.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;innovate&lt;/code&gt;&lt;/strong&gt; — maximize new features and exploration. Use early in development when the agent is far from production-ready.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;harden&lt;/code&gt;&lt;/strong&gt; — focus on stability, regression testing, and edge case coverage. Use when approaching production readiness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;repair-only&lt;/code&gt;&lt;/strong&gt; — constrain the agent to fixes only, no new behavior. Use when something is broken in production and you need a targeted fix.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This maps neatly onto how most teams already think about release stages. Even without Evolver, you can implement strategy presets by adjusting the &lt;code&gt;threshold&lt;/code&gt; and &lt;code&gt;max_rounds&lt;/code&gt; parameters in your evolution loop: high exploration tolerance for innovate mode, strict thresholds and minimal rounds for repair-only.&lt;/p&gt;




&lt;h2&gt;
  
  
  13. Conclusion
&lt;/h2&gt;

&lt;p&gt;Self-evolving agents are not magic — they are disciplined feedback loops with clear escalation rules. Several open-source frameworks have already proven these patterns work in practice, from automated prompt optimization to overnight code evolution to governed harness engineering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The four tracks in one sentence each:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prompt/Skill&lt;/strong&gt; — change how the agent thinks and what it can do. Always try this first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code/Harness evolution&lt;/strong&gt; — let the agent modify its own implementation against a clear metric. Try this before RAG or fine-tuning when the problem has a single file and a measurable goal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG&lt;/strong&gt; — give the agent access to knowledge it doesn't have. Prefer this over fine-tuning when knowledge changes or data is scarce.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-tuning&lt;/strong&gt; — internalize reasoning patterns that prompt iteration cannot reliably produce. Use this last, and only with 500+ clean examples.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Thumb rules to remember:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Thin prompt, rich skill library&lt;/li&gt;
&lt;li&gt;RAG before fine-tune&lt;/li&gt;
&lt;li&gt;Code evolution before fine-tune (it is cheaper and reversible)&lt;/li&gt;
&lt;li&gt;Persistence is the clearest fine-tune signal&lt;/li&gt;
&lt;li&gt;Never deploy an unevaluated change&lt;/li&gt;
&lt;li&gt;The LLM judge pipeline does the routing — let it&lt;/li&gt;
&lt;li&gt;Version everything; rollback should be one line&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Practical advice from the frameworks:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Version your prompts like you version your code (&lt;code&gt;VersionedPrompt&lt;/code&gt; pattern)&lt;/li&gt;
&lt;li&gt;Try the autoresearch pattern for any "single file, single metric" problem&lt;/li&gt;
&lt;li&gt;Borrow Evolver's audit trail thinking for production agents — log every change as a structured event with before/after scores&lt;/li&gt;
&lt;li&gt;Use strategy presets to match evolution aggressiveness to the development phase&lt;/li&gt;
&lt;li&gt;Layer your graders: deterministic checks first, then semantic, then LLM judge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The long-term vision is agents that compound in capability over time, with humans setting goals and guardrails while the agent handles the improvement loop. The pipeline in Section 9 is a practical starting point for exactly that.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Frameworks covered in this guide:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cookbook.openai.com/examples/partners/self_evolving_agents/autonomous_agent_retraining" rel="noopener noreferrer"&gt;OpenAI Self-Evolving Agents Cookbook&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/karpathy/autoresearch" rel="noopener noreferrer"&gt;Karpathy's autoresearch&lt;/a&gt; — AI agents running research on single-GPU nanochat training&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/kevinrgu/autoagent" rel="noopener noreferrer"&gt;kevinrgu/autoagent&lt;/a&gt; — autonomous harness engineering&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/EvoMap/evolver" rel="noopener noreferrer"&gt;EvoMap Evolver&lt;/a&gt; — governed evolution with audit trails&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2507.19457" rel="noopener noreferrer"&gt;GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning&lt;/a&gt; — Agrawal et al. (&lt;a href="https://github.com/gepa-ai/gepa" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Additional frameworks and methodologies:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/stanfordnlp/dspy" rel="noopener noreferrer"&gt;DSPy&lt;/a&gt; — Declarative Self-improving Python; Bayesian prompt compilation and self-distillation (Stanford NLP)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/zou-group/textgrad" rel="noopener noreferrer"&gt;TextGrad&lt;/a&gt; — Automatic differentiation via text; textual backpropagation for LLM optimization (&lt;a href="https://www.nature.com/articles/s41586-025-08661-4" rel="noopener noreferrer"&gt;&lt;em&gt;Nature&lt;/em&gt;, 2025&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/microsoft/memento" rel="noopener noreferrer"&gt;Memento-Skills&lt;/a&gt; — Skill-evolution framework for long-horizon autonomous agents&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/modelscope/agentscope" rel="noopener noreferrer"&gt;AgentScope&lt;/a&gt; — Multi-agent platform with Trinity-RFT for online fine-tuning from production logs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Background reading:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://blog.softmaxdata.com/self-evolving-agents/" rel="noopener noreferrer"&gt;Self-Evolving Agents: Three Frameworks That Let Your AI Improve Itself&lt;/a&gt; — Jia Chen, Softmax Data Blog&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/anthropic/anthropic-sdk-python" rel="noopener noreferrer"&gt;Anthropic Python SDK&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>agentaichallenge</category>
      <category>programming</category>
    </item>
    <item>
      <title>Anthropic Managed Agents: What It Takes to Build Agent-as-a-Service</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Thu, 09 Apr 2026 17:19:56 +0000</pubDate>
      <link>https://forem.com/chen115y/when-the-model-company-builds-the-factory-what-it-takes-to-build-agent-as-a-service-51g5</link>
      <guid>https://forem.com/chen115y/when-the-model-company-builds-the-factory-what-it-takes-to-build-agent-as-a-service-51g5</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Anthropic just launched Managed Agents. The open-source world has been learning the hard way why this matters.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;On April 8, 2025, Anthropic launched the public beta of &lt;strong&gt;Claude Managed Agents&lt;/strong&gt; -- a fully hosted platform for running AI agents with built-in sandboxing, session management, error recovery, and permission control. Four days earlier, the company had quietly cut off third-party agent frameworks like OpenClaw from using Claude subscription quotas, forcing them onto pay-per-use billing.&lt;/p&gt;

&lt;p&gt;These two moves, four days apart, tell one story: &lt;strong&gt;the company that sells the brain has decided to sell the body, too.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Why? Because the "body" -- the infrastructure that lets an AI model actually &lt;em&gt;do&lt;/em&gt; things in the real world -- is where agents succeed or fail in production. And as the open-source community has painfully demonstrated, getting this infrastructure wrong doesn't just cause bugs. It causes data leaks, runaway costs, and security breaches measured in the hundreds of thousands of dollars.&lt;/p&gt;

&lt;p&gt;This post explores three questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;What does it actually take&lt;/strong&gt; to build a reliable, safe Agent-as-a-Service?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What goes wrong&lt;/strong&gt; when these foundations are missing? (We have the data.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How do different approaches&lt;/strong&gt; -- managed platforms, open-source gateways, and learning engines -- stack up against these requirements?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Whether you're a developer evaluating agent frameworks, an architect designing agent infrastructure, or simply curious about where AI is headed, the answer starts with understanding five technical pillars that separate demo-grade agents from production-grade ones.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F76gy6fv4tgonu2inzz7i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F76gy6fv4tgonu2inzz7i.png" alt=" " width="800" height="63"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is an AI Agent, Really?
&lt;/h2&gt;

&lt;p&gt;Before diving into architecture, let's clarify what we're talking about -- because "AI agent" means very different things to different people.&lt;/p&gt;

&lt;p&gt;Most of us interact with AI through chat interfaces: you type a question, the model answers. That's a &lt;strong&gt;model&lt;/strong&gt; -- a brain in a jar. Extremely intelligent, but it can't &lt;em&gt;do&lt;/em&gt; anything. It can't browse your files, run code, send emails, or check your calendar. It just thinks and talks.&lt;/p&gt;

&lt;p&gt;An &lt;strong&gt;agent&lt;/strong&gt; is what happens when you give that brain a body.&lt;/p&gt;

&lt;p&gt;Anthropic's engineering team describes this with a vivid metaphor: &lt;strong&gt;the model is the brain; the harness is the limbs plus the nervous system.&lt;/strong&gt; The brain decides what to do. The harness actually does it -- calling tools, managing context, handling errors, keeping things running.&lt;/p&gt;

&lt;p&gt;In practice, an agent system has three core components:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fywog72x8d97fkg4x19k3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fywog72x8d97fkg4x19k3.png" alt=" " width="470" height="354"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Think of it this way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;Session&lt;/strong&gt; is the agent's notebook -- the log of everything that's happened. If the agent crashes, this is how it remembers where it left off.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Harness&lt;/strong&gt; is the nervous system -- the loop that calls the AI model, routes tool calls, handles errors, and decides what to do next.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Sandbox&lt;/strong&gt; is the workshop -- the isolated environment where the agent actually runs code and performs actions, separated from your sensitive data and credentials.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you use ChatGPT or Claude in a chat window, you're talking to the brain. When companies deploy agents that write code, manage workflows, or process documents autonomously, they need all three components working in concert.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;And that's where things get interesting -- and dangerous.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  When Agents Go Wrong: Lessons from OpenClaw
&lt;/h2&gt;

&lt;p&gt;OpenClaw is one of the most popular open-source agent frameworks -- the fastest-growing repo in GitHub history, surpassing 350,000 stars in under three months -- with a thriving community of over 1,000 contributors. It's powerful, flexible, and genuinely useful. It's also a case study in what happens when agent infrastructure doesn't get the fundamentals right.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2602.14364" rel="noopener noreferrer"&gt;A security audit conducted by researchers at Shanghai University of Science and Technology and the Shanghai AI Lab&lt;/a&gt; put OpenClaw through 34 standardized test cases. The results should give anyone building agent services pause.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overall safety pass rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;58.9%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intent misunderstanding &amp;amp; unsafe assumptions&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0% pass rate&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt injection robustness&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;57%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unexpected results under open-ended objectives&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;(The audit used MiniMax M2.1 as the default model. Results may vary with other models, but the failure patterns -- particularly around architecture and permission design -- are model-agnostic.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That &lt;strong&gt;0% pass rate&lt;/strong&gt; on intent misunderstanding is worth lingering on. In every single test with an ambiguous instruction, the agent filled in the blanks on its own and executed immediately. It never once asked the user for confirmation.&lt;/p&gt;

&lt;p&gt;Industry-wide monitoring data paints an even more alarming picture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;230,000+ OpenClaw instances&lt;/strong&gt; detected exposed on the public internet&lt;/li&gt;
&lt;li&gt;Approximately &lt;strong&gt;87,800 instances&lt;/strong&gt; with data leaks&lt;/li&gt;
&lt;li&gt;Approximately &lt;strong&gt;43,000 instances&lt;/strong&gt; with personal identity information exposed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;36.8% of skills&lt;/strong&gt; on the ClawHub marketplace contained security flaws&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Over 1,000 skills&lt;/strong&gt; contained malicious payloads&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;CVSS 8.8 high-severity vulnerability&lt;/strong&gt; enabling remote computer takeover&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cisco's assessment was blunt: &lt;strong&gt;"OpenClaw's security issues aren't configuration problems -- they're architecture problems."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenClaw's own documentation concedes the point: &lt;strong&gt;&lt;em&gt;&lt;a href="https://docs.openclaw.ai/gateway/security" rel="noopener noreferrer"&gt;There is no "perfectly secure" setup.&lt;/a&gt;&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Do These Failures Happen?
&lt;/h3&gt;

&lt;p&gt;These aren't random bugs. They trace back to four systemic root causes -- each one a missing piece of agent infrastructure:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Context Compression Drops Safety Rails.&lt;/strong&gt; When the information volume gets too large, the agent compresses its memory. During compression, it can squeeze out critical safety instructions -- the very guardrails meant to keep it in check. Imagine an air traffic controller under extreme stress who starts skipping safety checklists. That's context compression in action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Execute First, Ask Never.&lt;/strong&gt; The default behavior strategy leans toward &lt;a href="https://docs.openclaw.ai/automation/standing-orders" rel="noopener noreferrer"&gt;"do it first, explain later"&lt;/a&gt; rather than "ask clearly first." For every ambiguous instruction in the security audit, the agent guessed the user's intent and acted immediately. Zero confirmation. Zero pause.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Prompt Injection Walks Through the Front Door.&lt;/strong&gt; Malicious content embedded in inputs can trick the agent into bypassing safety mechanisms entirely. With a 57% robustness rate, nearly half of all injection attempts succeed. That's not a bug in one feature -- it's a gap in the security boundary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The Agent Has the Keys to the Kingdom.&lt;/strong&gt; OpenClaw runs with the same system permissions as the user who launched it. It can read, write, and delete anything the user can. Combine this with the injection vulnerability above, and an attacker doesn't need to hack your system -- they just need to convince the agent to do it for them.&lt;/p&gt;

&lt;p&gt;These aren't problems unique to OpenClaw. &lt;strong&gt;They're the universal challenges of Agent-as-a-Service.&lt;/strong&gt; Any framework, any platform, any team building agents will face these same four failure modes -- unless they're addressed at the architectural level.&lt;/p&gt;

&lt;p&gt;Which brings us to the technologies that actually matter.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 5 Pillars of Effective and Safe Agent Services
&lt;/h2&gt;

&lt;p&gt;Anthropic has published 15 engineering blog posts over the past two years, documenting their approach to building production-grade agents. Distilled into a learning path, they form a capability pyramid -- a stack of technologies and practices that builds from foundation to production readiness:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fci8c3xipa6ymqzwf9eve.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fci8c3xipa6ymqzwf9eve.png" alt=" " width="284" height="539"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each pillar directly addresses one of the failure modes we saw with OpenClaw:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg2dhwepczo4hgp27mhwr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg2dhwepczo4hgp27mhwr.png" alt=" " width="753" height="659"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's walk through them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pillar 1: Foundation Architecture -- Know When NOT to Use an Agent
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The OpenClaw failure it addresses:&lt;/strong&gt; Execute first, ask never.&lt;/p&gt;

&lt;p&gt;The most important architectural decision is also the most counterintuitive: &lt;strong&gt;start simple, and don't use an autonomous agent when a well-defined workflow will do.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Anthropic's foundational guidance, laid out in &lt;em&gt;"Building effective agents,"&lt;/em&gt; distinguishes between &lt;strong&gt;workflows&lt;/strong&gt; and &lt;strong&gt;agents&lt;/strong&gt;. A workflow is a predefined sequence of steps with clear decision points. An agent is an autonomous system that decides its own next steps. The difference matters enormously.&lt;/p&gt;

&lt;p&gt;The execute-first problem in OpenClaw stems from a fundamental architectural choice: giving the agent full autonomy over ambiguous tasks without building in confirmation gates. In workflow-based architectures, ambiguous steps trigger explicit checkpoints -- the system asks the user before proceeding. In purely autonomous architectures, the agent fills in blanks and acts.&lt;/p&gt;

&lt;p&gt;For practitioners, the key patterns here are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ReAct&lt;/strong&gt; (Reasoning + Acting): The agent reasons about what to do, takes an action, observes the result, and then reasons again before the next step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Planning&lt;/strong&gt;: The agent creates a plan before execution, allowing for human review of the intended steps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-in-the-loop gates&lt;/strong&gt;: Critical actions require explicit approval before execution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The rule of thumb: &lt;strong&gt;if a task has clear inputs and outputs, use a workflow. If it requires judgment under uncertainty, use an agent -- but with confirmation gates for high-risk actions.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;For Practitioners:&lt;/strong&gt; Read &lt;em&gt;"Building effective agents"&lt;/em&gt; and &lt;em&gt;"Building agents with the Claude Agent SDK"&lt;/em&gt; on &lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;Anthropic's Engineering Blog&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Pillar 2: Tool Capabilities -- Think Before You Act
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The OpenClaw failure it addresses:&lt;/strong&gt; Reckless execution without reasoning.&lt;/p&gt;

&lt;p&gt;An agent is only as good as its tools -- and more importantly, how it &lt;em&gt;decides&lt;/em&gt; to use them. Tool description design directly affects how well an agent selects and invokes the right tool at the right time. A vague tool description leads to misuse; a precise one guides the agent toward correct behavior.&lt;/p&gt;

&lt;p&gt;But the real breakthrough in this space is Anthropic's &lt;strong&gt;Think Tool&lt;/strong&gt; -- a technique that lets agents perform chain-of-thought reasoning &lt;em&gt;before&lt;/em&gt; taking any action. Instead of immediately executing, the agent pauses, reasons through its options, considers edge cases, and only then acts.&lt;/p&gt;

&lt;p&gt;This is the direct antidote to "execute first, ask later." The Think Tool essentially gives the agent an internal monologue: &lt;em&gt;"Wait -- is this instruction ambiguous? What are the possible interpretations? Which one is most likely? Should I ask for clarification?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In practice, the Think Tool significantly improves performance on complex reasoning tasks, especially those involving:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ambiguous instructions with multiple valid interpretations&lt;/li&gt;
&lt;li&gt;Multi-step tasks where an early mistake compounds&lt;/li&gt;
&lt;li&gt;Tasks requiring judgment about when to ask for help&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Beyond the Think Tool, production-grade tool systems need &lt;strong&gt;Agent Skills&lt;/strong&gt; -- reusable, encapsulated capabilities that an agent can invoke like a professional using standardized procedures. Skills turn one-off problem-solving into repeatable expertise.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;For Practitioners:&lt;/strong&gt; Read &lt;em&gt;"The 'think' tool,"&lt;/em&gt; &lt;em&gt;"Writing effective tools for agents -- with agents,"&lt;/em&gt; and &lt;em&gt;"Equipping agents for the real world with Agent Skills"&lt;/em&gt; on &lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;Anthropic's Engineering Blog&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Pillar 3: Context Engineering -- Memory That Doesn't Lose the Plot
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The OpenClaw failure it addresses:&lt;/strong&gt; Context compression dropping safety instructions.&lt;/p&gt;

&lt;p&gt;Even as AI model context windows expand to hundreds of thousands of tokens, &lt;strong&gt;context engineering remains critical.&lt;/strong&gt; A larger window doesn't solve the fundamental problem: the model's attention is a scarce resource, and what you put into the context window -- and how you structure it -- determines whether the agent remembers its safety instructions or forgets them under load.&lt;/p&gt;

&lt;p&gt;Context compression losing safety rails is not a theoretical risk. It's a documented failure mode. See the &lt;a href="https://medium.com/@dingzhanjun/analyzing-the-incident-of-openclaw-deleting-emails-a-technical-deep-dive-56e50028637b" rel="noopener noreferrer"&gt;Analyzing the Incident of OpenClaw Deleting Emails: A Technical Deep Dive&lt;/a&gt; for more details. When the information volume exceeds what the system can handle, something gets squeezed out. In OpenClaw's case, that "something" was often the safety guardrails themselves.&lt;/p&gt;

&lt;p&gt;The solution isn't just "bigger context windows." It's &lt;strong&gt;context engineering&lt;/strong&gt; -- the deliberate management of what goes into the agent's working memory, when, and in what form.&lt;/p&gt;

&lt;p&gt;Key techniques include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory management&lt;/strong&gt;: Explicitly structuring what the agent remembers across turns and sessions, rather than relying on raw conversation history.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG (Retrieval-Augmented Generation)&lt;/strong&gt;: Instead of cramming everything into the context window, retrieve only the information relevant to the current task. This keeps the context focused and prevents safety instructions from being crowded out.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contextual Retrieval&lt;/strong&gt;: An innovation from Anthropic where the model generates explanatory context &lt;em&gt;before&lt;/em&gt; retrieval, solving the classic RAG problem of chunk-level information loss.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An emerging open-source approach tackles this from a different angle. &lt;strong&gt;&lt;a href="https://github.com/milla-jovovich/mempalace" rel="noopener noreferrer"&gt;MemPalace&lt;/a&gt;&lt;/strong&gt; (33K+ GitHub stars) takes the position that the problem isn't what the AI remembers -- it's what it &lt;em&gt;forgets&lt;/em&gt; when memory gets compressed. Instead of having the AI decide what's worth keeping (and risk discarding safety instructions), MemPalace stores everything verbatim and uses a structured navigation system -- inspired by the ancient Greek memory palace technique -- to make it findable without loading it all into context.&lt;/p&gt;

&lt;p&gt;The architecture is a layered memory stack that directly addresses context pressure:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What it holds&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Identity -- who is this AI?&lt;/td&gt;
&lt;td&gt;~50 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Critical facts -- team, projects, preferences&lt;/td&gt;
&lt;td&gt;~120 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Room recall -- recent sessions, current topic&lt;/td&gt;
&lt;td&gt;On demand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deep search -- semantic query across all stored memories&lt;/td&gt;
&lt;td&gt;On demand&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The agent wakes up with only ~170 tokens (L0 + L1) and searches deeper layers only when needed. This keeps the context window lean and focused. Memories are organized into "wings" (projects/people), "rooms" (topics), and "halls" (memory types like decisions, events, discoveries), with "tunnels" cross-referencing the same topic across domains. This structured retrieval scored 96.6% recall on the LongMemEval benchmark -- the highest published result for a free, local-only system with zero API calls.&lt;/p&gt;

&lt;p&gt;Critically for the context compression problem, MemPalace includes a &lt;strong&gt;PreCompact hook&lt;/strong&gt; that fires &lt;em&gt;before&lt;/em&gt; the context window is compressed, performing an emergency save of the current session. This is a direct architectural response to the failure mode that caused the Meta email deletion incident: if the agent's safety instructions live only in the context window, they can be summarized away. MemPalace externalizes memory so that compression never touches what matters.&lt;/p&gt;

&lt;p&gt;The principle: &lt;strong&gt;treat the context window like a surgeon's tray, not a junk drawer.&lt;/strong&gt; Every token should earn its place. Safety instructions should be architecturally pinned, not left to compete with task data for the model's attention.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;For Practitioners:&lt;/strong&gt; Read &lt;em&gt;"Effective context engineering for AI agents"&lt;/em&gt; and &lt;em&gt;"Introducing Contextual Retrieval"&lt;/em&gt; on &lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;Anthropic's Engineering Blog&lt;/a&gt;. For an open-source, local-first approach to structured memory, see &lt;a href="https://github.com/milla-jovovich/mempalace" rel="noopener noreferrer"&gt;MemPalace&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Pillar 4: Long Tasks &amp;amp; Collaboration -- Surviving the Marathon
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The OpenClaw failure it addresses:&lt;/strong&gt; No state recovery, runaway execution.&lt;/p&gt;

&lt;p&gt;Demo agents handle single-turn tasks. Production agents run for minutes, hours, or days. The difference is enormous.&lt;/p&gt;

&lt;p&gt;A long-running agent needs what Anthropic calls a &lt;strong&gt;harness&lt;/strong&gt; -- an execution framework designed for durability. The harness handles what happens when things go wrong: network interruptions, model errors, infinite loops, context window exhaustion. Without a harness, a long-running agent is a ticking time bomb -- one crash and all progress is lost.&lt;/p&gt;

&lt;p&gt;The core capabilities a harness must provide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;State persistence&lt;/strong&gt;: If the agent crashes, it can resume from where it left off, not from scratch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interruption recovery&lt;/strong&gt;: External disruptions (network outages, API rate limits, user cancellation) are handled gracefully.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loop detection&lt;/strong&gt;: The agent recognizes when it's stuck in a cycle and breaks out, rather than burning tokens endlessly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource budgets&lt;/strong&gt;: Hard limits on tokens, time, and API calls prevent runaway costs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For complex tasks that exceed what a single agent can handle, the &lt;strong&gt;Orchestrator-Workers pattern&lt;/strong&gt; distributes work across multiple agents coordinated by a central orchestrator. This is how Anthropic built their own multi-agent research system -- one agent plans, others execute specialized subtasks, and the orchestrator synthesizes results.&lt;/p&gt;

&lt;p&gt;The practical implication: &lt;strong&gt;if your agent can run for more than a few minutes, you need a harness. If it can run unsupervised, you need budgets and kill switches.&lt;/strong&gt; The users who discovered their OpenClaw instances burning money wildly learned this lesson the hard way.&lt;/p&gt;

&lt;p&gt;But a harness alone isn't enough. A long-running agent can stay alive, recover from crashes, and stay within budget -- and still silently degrade in quality over time. This is where &lt;strong&gt;continuous evaluation&lt;/strong&gt; becomes essential. Anthropic's guide on &lt;a href="https://platform.claude.com/docs/en/test-and-evaluate/develop-tests" rel="noopener noreferrer"&gt;defining success criteria and building evaluations&lt;/a&gt; lays out a disciplined framework that applies directly to long-running agent services.&lt;/p&gt;

&lt;p&gt;The key insight: success criteria for agents must be &lt;strong&gt;specific, measurable, achievable, and relevant&lt;/strong&gt; -- not vague goals like "performs well." For a long-running agent, this means defining quantitative thresholds upfront: What is the acceptable error rate per 10,000 actions? What is the maximum response latency? What percentage of edge cases must be handled without human intervention?&lt;/p&gt;

&lt;p&gt;The framework distinguishes three grading methods, ranked by preference:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Code-based grading&lt;/strong&gt; -- fastest, most reliable. Exact match, string match, programmatic checks. Use this wherever possible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM-based grading&lt;/strong&gt; -- fast and flexible, suitable for complex judgments like tone, coherence, and context utilization. Requires clear rubrics and validated reliability before scaling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human grading&lt;/strong&gt; -- most flexible but slowest. Avoid for ongoing monitoring; reserve for calibrating automated methods.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For long-running agents specifically, the &lt;strong&gt;context utilization&lt;/strong&gt; evaluation is critical: it measures whether the agent is still coherently using information from earlier in the conversation, which is exactly the capability that degrades under context pressure. The &lt;strong&gt;consistency&lt;/strong&gt; evaluation catches drift -- if the agent starts giving different answers to semantically similar questions over time, something has gone wrong. And &lt;strong&gt;privacy preservation&lt;/strong&gt; evaluations can detect when an agent starts leaking sensitive information that it should be filtering, a risk that compounds the longer an agent runs with accumulated context.&lt;/p&gt;

&lt;p&gt;The principle that ties this back to the harness: &lt;strong&gt;a harness keeps the agent running; evaluations tell you whether it's still running correctly.&lt;/strong&gt; Loop detection catches infinite cycles. Evals catch silent quality degradation. You need both.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;For Practitioners:&lt;/strong&gt; Read &lt;em&gt;"Effective harnesses for long-running agents,"&lt;/em&gt; &lt;em&gt;"How we built our multi-agent research system,"&lt;/em&gt; and &lt;em&gt;"Code execution with MCP"&lt;/em&gt; on &lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;Anthropic's Engineering Blog&lt;/a&gt;. For evaluation methodology, see Anthropic's &lt;a href="https://platform.claude.com/docs/en/test-and-evaluate/develop-tests" rel="noopener noreferrer"&gt;Define success criteria and build evaluations&lt;/a&gt; guide.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Pillar 5: Safety, Evaluation &amp;amp; Monitoring -- The Last Mile
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The OpenClaw failure it addresses:&lt;/strong&gt; Excessive permissions, prompt injection, no production safeguards.&lt;/p&gt;

&lt;p&gt;This is the pillar where most teams skip steps -- and where the consequences are most severe. The numbers from OpenClaw tell the story: 230,000 exposed instances, 87,800 data leaks, a CVSS 8.8 remote code execution vulnerability.&lt;/p&gt;

&lt;p&gt;Three practices are non-negotiable for production agents:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sandboxing.&lt;/strong&gt; When an agent can execute code, it must do so in an isolated environment that cannot access credentials, sensitive files, or system-level permissions. OpenClaw runs with the user's full system permissions. Anthropic's Managed Agents architecture puts the sandbox in a separate container that can &lt;em&gt;never&lt;/em&gt; touch credentials -- authentication goes through a vault proxy, and the harness itself has zero awareness of any credentials.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Least privilege.&lt;/strong&gt; The agent should have exactly the permissions it needs for the current task, and no more. Permissions should be granted per-task and revoked when the task completes. Standing permissions are standing risks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluations (Evals).&lt;/strong&gt; Anthropic's guidance is unambiguous: &lt;strong&gt;without evals, don't go live.&lt;/strong&gt; An automated evaluation system that tests agent behavior against known scenarios -- including adversarial ones like prompt injection -- is the only way to know whether your agent is safe before it touches production data. Relying on manual testing or intuition is not engineering; it's hope.&lt;/p&gt;

&lt;p&gt;The difference between OpenClaw's 57% prompt injection robustness and a production-grade system isn't just better prompting -- it's architectural. Security must be designed into the boundary between components, not bolted on as a configuration option.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;For Practitioners:&lt;/strong&gt; Read &lt;em&gt;"Demystifying evals for AI agents,"&lt;/em&gt; &lt;em&gt;"Beyond permission prompts: Claude Code sandboxing,"&lt;/em&gt; and &lt;em&gt;"A postmortem of three recent issues"&lt;/em&gt; on &lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;Anthropic's Engineering Blog&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Anthropic's Answer: The Operating System Approach
&lt;/h2&gt;

&lt;p&gt;With the 5 pillars as context, Anthropic's Managed Agents architecture comes into sharper focus. It's not just a hosting service -- it's a deliberate embodiment of these principles.&lt;/p&gt;

&lt;h3&gt;
  
  
  Separating Session, Harness, and Sandbox
&lt;/h3&gt;

&lt;p&gt;The core design decision is to &lt;strong&gt;thoroughly separate&lt;/strong&gt; three components that most agent frameworks cram into a single container:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Analogy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Session&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The log of what happened&lt;/td&gt;
&lt;td&gt;The agent's notebook&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Harness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The loop of calling Claude and routing tool calls&lt;/td&gt;
&lt;td&gt;The nervous system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sandbox&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The execution environment where code runs&lt;/td&gt;
&lt;td&gt;The workshop&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Previously, all three lived in one container. If it crashed, the session was lost. Engineers had to babysit. Anthropic calls this the &lt;strong&gt;"pets"&lt;/strong&gt; model -- each container is precious, irreplaceable, and needs constant attention.&lt;/p&gt;

&lt;p&gt;After separation, containers become &lt;strong&gt;"cattle."&lt;/strong&gt; If one dies, spin up a new one. The session is stored externally. The harness resumes via &lt;code&gt;wake(sessionId)&lt;/code&gt;, reads the event log, and continues running. Any component can crash or be replaced independently.&lt;/p&gt;

&lt;p&gt;Think of it like a restaurant kitchen. The "pets" model is a restaurant with one chef who does everything -- if that chef gets sick, the restaurant closes. The "cattle" model is a kitchen brigade: prep cooks, line cooks, and a head chef, each replaceable. The recipes (session) are written down. The process (harness) is standardized. The cooking stations (sandbox) are interchangeable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security by Architecture
&lt;/h3&gt;

&lt;p&gt;The security redesign directly addresses the "keys to the kingdom" problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Old design:&lt;/strong&gt; Agent-generated code and system credentials ran in the same container. A prompt injection only needed to convince the model to read its own environment variables to steal tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;New design:&lt;/strong&gt; The sandbox can &lt;strong&gt;never&lt;/strong&gt; touch credentials. Authentication goes through a vault proxy. The harness has zero awareness of any credentials.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't a configuration toggle. It's a boundary enforced by the architecture itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance Results
&lt;/h3&gt;

&lt;p&gt;The performance impact of this separation is dramatic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;p50&lt;/strong&gt; (median) time-to-first-token latency &lt;strong&gt;dropped 60%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;p95&lt;/strong&gt; (tail) time-to-first-token latency &lt;strong&gt;dropped over 90%&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Separating concerns doesn't just improve reliability -- it improves speed. When the harness doesn't have to manage the sandbox's lifecycle, it can focus on what it does best: routing model calls.&lt;/p&gt;

&lt;h3&gt;
  
  
  The OS Analogy
&lt;/h3&gt;

&lt;p&gt;Anthropic draws a comparison to operating systems: an OS virtualizes hardware into stable abstractions -- "processes," "files," "sockets" -- that outlast any generation of hardware. The &lt;code&gt;read()&lt;/code&gt; system call worked on 1970s disk drives and works on today's SSDs.&lt;/p&gt;

&lt;p&gt;Managed Agents does the same thing for agents: &lt;strong&gt;virtualizing core components into stable interfaces&lt;/strong&gt;, so upper-level logic doesn't break when the model gets smarter or the framework evolves. Every model generation makes some harness code obsolete -- Anthropic calls this the "structural dilemma of the harness industry." Their solution is to own the interface and let the implementation evolve underneath.&lt;/p&gt;

&lt;h3&gt;
  
  
  Early Adoption
&lt;/h3&gt;

&lt;p&gt;The approach is already in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Notion&lt;/strong&gt; integrated agents into its workspaces, supporting dozens of concurrent tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rakuten&lt;/strong&gt; deployed department-specific agents (product, sales, finance, HR) within a week, connected to Slack and Teams.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sentry&lt;/strong&gt; has agents automatically writing bug-fix patches and opening PRs -- an integration originally estimated at months that went live in weeks.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Open Source Still Matters: Two Paths Forward
&lt;/h2&gt;

&lt;p&gt;Managed Agents is Anthropic's answer. But the open-source world offers two genuinely different alternatives -- and understanding the contrast reveals what "agent value" actually means.&lt;/p&gt;

&lt;h3&gt;
  
  
  OpenClaw: The Platform Path
&lt;/h3&gt;

&lt;p&gt;OpenClaw's core logic is that of a &lt;strong&gt;platform&lt;/strong&gt; or &lt;strong&gt;gateway.&lt;/strong&gt; Think of it as a dispatch center. It unifies chat entry points -- Telegram, Slack, Discord, WhatsApp -- connects different models, different tools, and different workflows. It's a multi-channel personal assistant operating system.&lt;/p&gt;

&lt;p&gt;This direction has real value. People's information entry points are inherently scattered. Whoever can unify those entry points gets closer to being a truly usable personal AI hub.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenClaw's strength:&lt;/strong&gt; Integration, distribution, ecosystem, platform coverage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenClaw's weakness:&lt;/strong&gt; The security model relies on trust and configuration auditing. As Cisco noted, the issues are architectural, not configurational. The ClawHub skill marketplace -- with 36.8% of skills containing security flaws -- demonstrates what happens when a platform grows faster than its safety infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hermes Agent: The Growth Path
&lt;/h3&gt;

&lt;p&gt;Hermes Agent starts from a fundamentally different premise. It doesn't deny the importance of integration, but what it truly emphasizes is: &lt;strong&gt;will this agent accumulate capability over long-term use?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Where OpenClaw cares about how an agent connects to the world, Hermes cares about how an agent continuously evolves &lt;em&gt;within&lt;/em&gt; the world.&lt;/p&gt;

&lt;p&gt;Hermes's most distinctive capability is its &lt;strong&gt;learning loop.&lt;/strong&gt; After completing a task, the agent doesn't just finish -- it distills the process into a structured &lt;strong&gt;Skill&lt;/strong&gt;, a reusable method template. The next time it encounters a similar problem, it invokes that crystallized experience instead of starting from scratch.&lt;/p&gt;

&lt;p&gt;Its memory architecture goes beyond storing chat history:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What It Stores&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Layer 1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Who you are -- persistent background context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Layer 2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;What you've done -- full history, recalled on demand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Layer 3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How to do similar things better -- skills extracted from experience&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is &lt;strong&gt;"user model + task model + method library"&lt;/strong&gt; -- the architecture of a long-term partner, not a one-shot tool.&lt;/p&gt;

&lt;p&gt;On security, Hermes takes a markedly different approach from OpenClaw, implementing &lt;strong&gt;five-layer defense-in-depth:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User authorization&lt;/li&gt;
&lt;li&gt;Dangerous command review&lt;/li&gt;
&lt;li&gt;Container isolation&lt;/li&gt;
&lt;li&gt;Credential filtering&lt;/li&gt;
&lt;li&gt;Context injection scanning with auto-reject on timeout&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Compare this to OpenClaw's trust-plus-configuration model, and the architectural gap is clear.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three Philosophies, One Set of Challenges
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;OpenClaw&lt;/th&gt;
&lt;th&gt;Hermes Agent&lt;/th&gt;
&lt;th&gt;Anthropic Managed Agents&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Philosophy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gateway / Platform&lt;/td&gt;
&lt;td&gt;Growth Engine&lt;/td&gt;
&lt;td&gt;Operating System&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Core Value&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Connection&lt;/td&gt;
&lt;td&gt;Accumulation&lt;/td&gt;
&lt;td&gt;Abstraction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Trust + config&lt;/td&gt;
&lt;td&gt;Defense-in-depth&lt;/td&gt;
&lt;td&gt;Architecture-level isolation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best For&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-channel hubs&lt;/td&gt;
&lt;td&gt;Long-term projects&lt;/td&gt;
&lt;td&gt;Enterprise production&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Trade-off&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Breadth over safety depth&lt;/td&gt;
&lt;td&gt;Newer, smaller ecosystem&lt;/td&gt;
&lt;td&gt;Vendor lock-in&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The choice isn't "managed vs. open-source." It's which design philosophy matches your use case -- and whether the 5 pillars are addressed regardless of which path you take.&lt;/p&gt;




&lt;h2&gt;
  
  
  Principles Over Frameworks
&lt;/h2&gt;

&lt;p&gt;Tools change. Frameworks rise and fall. Model capabilities leap forward every few months, turning yesterday's clever harness code into tomorrow's technical debt.&lt;/p&gt;

&lt;p&gt;But the engineering principles endure:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with workflows, graduate to agents.&lt;/strong&gt; Don't give autonomy before you've built confirmation gates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make the agent think before it acts.&lt;/strong&gt; Chain-of-thought reasoning is not optional for production systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Treat context like a scarce resource.&lt;/strong&gt; Pin safety instructions architecturally; don't let them compete with task data for attention.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Design for crashes, not just success.&lt;/strong&gt; State persistence, interruption recovery, and resource budgets are production requirements, not nice-to-haves.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security is architecture, not configuration.&lt;/strong&gt; If your agent and your credentials share a container, you don't have a security model -- you have a vulnerability.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These five pillars matter whether you use Anthropic's Managed Agents, OpenClaw, Hermes Agent, or build your own infrastructure from scratch.&lt;/p&gt;

&lt;p&gt;Anthropic's engineering blog ends with a statement that reads like technical humility:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"We have opinions about the form of the interface, but we don't have opinions about what specific harness Claude will need in the future."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But the precondition for saying this is that they've already taken control of the interface itself. The interface -- the 5 pillars, the stable abstractions -- is what endures. The implementation is what evolves.&lt;/p&gt;

&lt;p&gt;For those of us building with agents, the lesson is the same one software engineering has taught for decades: &lt;strong&gt;invest in the interfaces, not the implementations.&lt;/strong&gt; The frameworks will change. The principles won't.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;Sources referenced in this post, organized by topic. Anthropic's 15 engineering blog posts are listed by module; reading them in order provides a structured path from agent fundamentals to production readiness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security Research&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2602.14364" rel="noopener noreferrer"&gt;A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)&lt;/a&gt; -- Tianyu Chen et al., ShanghaiTech University &amp;amp; Shanghai AI Lab. The trajectory-centric security evaluation referenced in this post, covering six risk dimensions of OpenClaw's agentic behavior (arXiv:2602.14364).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Context Compression &amp;amp; Safety Instruction Loss&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://medium.com/@dingzhanjun/analyzing-the-incident-of-openclaw-deleting-emails-a-technical-deep-dive-56e50028637b" rel="noopener noreferrer"&gt;Analyzing the Incident of OpenClaw Deleting Emails: A Technical Deep Dive&lt;/a&gt; -- John Ding. How Meta AI Safety Director Summer Yue's "don't action until I tell you" instruction was lost during context compaction, causing 200+ email deletions.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.letsdatascience.com/blog/metas-ai-safety-chief-told-her-ai-agent-to-stop-it-deleted-her-inbox-anyway" rel="noopener noreferrer"&gt;Why AI Agents Fail: Context Compaction Explained&lt;/a&gt; -- Let's Data Science. Covers the Meta incident, CVE-2026-25253, and the broader context compaction failure pattern.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/waxell/why-ai-agents-bypass-human-approval-lessons-from-metas-rogue-agent-incidents-1a92"&gt;Why AI Agents Bypass Human Approval: Lessons from Meta's Rogue Agent Incidents&lt;/a&gt; -- Waxell. Architectural analysis of why prompt-based human-in-the-loop fails under context pressure and why infrastructure-layer enforcement is needed.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/openclaw/openclaw/issues/5357" rel="noopener noreferrer"&gt;safeguard compaction fails to recover when context significantly exceeds model limit&lt;/a&gt; -- OpenClaw GitHub Issue #5357. Documents compaction failure when context exceeds token limits by more than 20%.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/openclaw/openclaw/issues/7477" rel="noopener noreferrer"&gt;Default compaction mode silently fails on large contexts&lt;/a&gt; -- OpenClaw GitHub Issue #7477. Documents silent summarization failure producing "Summary unavailable" instead of preserving conversation history.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Open-Source Memory Systems&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/milla-jovovich/mempalace" rel="noopener noreferrer"&gt;MemPalace&lt;/a&gt; -- Milla Jovovich &amp;amp; Ben Sigman. Local-first, structured AI memory system using a palace metaphor (wings, rooms, halls, tunnels) with verbatim storage and semantic search. 96.6% recall on LongMemEval with zero API calls. Includes PreCompact hooks to save memory before context compression.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Evaluation &amp;amp; Testing&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://platform.claude.com/docs/en/test-and-evaluate/develop-tests" rel="noopener noreferrer"&gt;Define success criteria and build evaluations&lt;/a&gt; -- Anthropic. Official guide on designing measurable success criteria and automated evaluation systems for LLM-based applications, with code examples for exact match, cosine similarity, ROUGE-L, LLM-based Likert scale, and binary classification grading.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Managed Agents Announcement&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.anthropic.com/engineering/managed-agents" rel="noopener noreferrer"&gt;Managed Agents&lt;/a&gt; -- Anthropic's engineering deep-dive on the architecture behind Claude Managed Agents.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Module 1: Foundation Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;Building effective agents&lt;/a&gt; -- Agent architecture introduction: workflows vs. autonomous agents, ReAct, Tool Use, Planning.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;Building agents with the Claude Agent SDK&lt;/a&gt; -- Practical getting started with the Agent SDK.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Module 2: Tools &amp;amp; Capability Extension&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;Introducing advanced tool use&lt;/a&gt; -- Advanced tool usage: parallelism, barriers, and error handling.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;Writing effective tools for agents -- with agents&lt;/a&gt; -- Tool design principles and best practices.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;The "think" tool&lt;/a&gt; -- Teaching agents to stop and reason before acting.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;Equipping agents for the real world with Agent Skills&lt;/a&gt; -- Skill encapsulation and reuse.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Module 3: Context &amp;amp; Memory Management&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;Effective context engineering for AI agents&lt;/a&gt; -- Managing the agent's memory and attention across long conversations.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;Introducing Contextual Retrieval&lt;/a&gt; -- Making RAG more context-aware to reduce chunk-level information loss.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Module 4: Long Tasks &amp;amp; Multi-Agent Collaboration&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;Effective harnesses for long-running agents&lt;/a&gt; -- Designing reliable execution frameworks with interruption recovery and state persistence.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;How we built our multi-agent research system&lt;/a&gt; -- Anthropic's practical experience with multi-agent architecture.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;Code execution with MCP&lt;/a&gt; -- Agent execution environment design using the Model Context Protocol.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Module 5: Safety, Evaluation &amp;amp; Engineering&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;Demystifying evals for AI agents&lt;/a&gt; -- Evaluation system design for agent behavior.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;Beyond permission prompts: Claude Code sandboxing&lt;/a&gt; -- From permission prompts to sandbox isolation.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;Claude Code: Best practices for agentic coding&lt;/a&gt; -- Engineering best practices for coding agents.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://anthropic.com/engineering" rel="noopener noreferrer"&gt;A postmortem of three recent issues&lt;/a&gt; -- Real-world agent incident case studies.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>openclaw</category>
      <category>claude</category>
    </item>
    <item>
      <title>A Claude Code Skills Stack: How to Combine Superpowers, gstack, and GSD Without the Chaos</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:30:15 +0000</pubDate>
      <link>https://forem.com/imaginex/a-claude-code-skills-stack-how-to-combine-superpowers-gstack-and-gsd-without-the-chaos-44b3</link>
      <guid>https://forem.com/imaginex/a-claude-code-skills-stack-how-to-combine-superpowers-gstack-and-gsd-without-the-chaos-44b3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;One article to compare the frameworks, see where they overlap, and land on a stable three-layer practice.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Claude Code has quickly become one of the most widely adopted AI coding tools. Individual developers, startups, and large engineering teams alike have integrated it into their daily workflows—writing production code, reviewing pull requests, debugging, and shipping features at a pace that was hard to imagine a year ago. As usage has scaled, so has the ecosystem around it. &lt;strong&gt;Claude Skills&lt;/strong&gt;—composable, auto-invoked instruction sets that shape how the agent plans, builds, and verifies—have emerged as one of the most important extension points in Claude Code. They let you go beyond one-off prompts and encode &lt;strong&gt;repeatable workflows&lt;/strong&gt; directly into the agent's behavior. In fact, Anthropic has doubled down on this direction: the latest version of Claude Code &lt;strong&gt;consolidates the previously separate "slash commands" and "skills" systems into a single, unified skills format&lt;/strong&gt;, signaling that skills are now the canonical way to extend the agent.&lt;/p&gt;

&lt;p&gt;With Skills now central to the experience, the community has rallied around a handful of open-source frameworks that package best practices into ready-made skill sets. The two most discussed stacks are &lt;strong&gt;Superpowers&lt;/strong&gt; and &lt;strong&gt;gstack&lt;/strong&gt;. Installing both sounds easy; in practice they can &lt;strong&gt;conflict&lt;/strong&gt;, and piling frameworks on without a plan often makes the setup &lt;strong&gt;less&lt;/strong&gt; stable, not more. So where do they differ, and how should you choose?&lt;/p&gt;

&lt;p&gt;This post does three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Compare&lt;/strong&gt; Superpowers and gstack on repos, features, and philosophy—the material below on stars, skill lists, and trade-offs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add a third layer&lt;/strong&gt; many guides skip: &lt;strong&gt;GSD&lt;/strong&gt; as a &lt;strong&gt;context / spec&lt;/strong&gt; stabilizer so long-running work does not drift (&lt;em&gt;informed by Tricia Notes Editorial’s three-layer framing&lt;/em&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;End with a single playbook&lt;/strong&gt;: who owns &lt;strong&gt;decision&lt;/strong&gt;, &lt;strong&gt;context&lt;/strong&gt;, and &lt;strong&gt;execution&lt;/strong&gt;, and how to cherry-pick skills without blowing up token use or cognitive load.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The useful question is not only “Superpowers &lt;strong&gt;or&lt;/strong&gt; gstack?” but: &lt;em&gt;what are you missing—decision-making, durable context, or execution?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In one line:&lt;/strong&gt; &lt;em&gt;gstack thinks, GSD stabilizes, Superpowers executes.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Orientation: Three Layers, Not Only Two
&lt;/h2&gt;

&lt;p&gt;What stays stable in practice is often &lt;strong&gt;not&lt;/strong&gt; picking one framework over another, but a &lt;strong&gt;three-way division of labor&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Stack&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Decision / roles&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;gstack&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Judgment from CEO, design, architecture, QA-style lenses—not only “how to code.”&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context / spec&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GSD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Keeps spec, status, boundaries, and long-horizon context from rotting.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Execution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Superpowers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Requirement clarification → plan → TDD → acceptance as a &lt;strong&gt;closed loop&lt;/strong&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;How each is “strong”:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Superpowers&lt;/strong&gt; — &lt;strong&gt;How&lt;/strong&gt; work gets done; smooth execution loop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gstack&lt;/strong&gt; — &lt;strong&gt;What&lt;/strong&gt; to do and &lt;strong&gt;whether&lt;/strong&gt; it should be done; richer role-based judgment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GSD&lt;/strong&gt; — &lt;strong&gt;Not drifting&lt;/strong&gt;; steadier specs and context over long chains.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both Superpowers and gstack have gone viral. On the surface they add process to AI; in use, they help you &lt;strong&gt;think clearly about what matters&lt;/strong&gt;. When the model codes fast, that is exactly when you need clear requirements and stable context—&lt;strong&gt;that&lt;/strong&gt; is what most people still overlook.&lt;/p&gt;




&lt;h2&gt;
  
  
  Superpowers vs gstack: Quick Facts
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Superpowers (GitHub ~137K stars)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Repository: &lt;strong&gt;obra/superpowers&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;Agent Skills&lt;/strong&gt; framework and software development methodology: &lt;strong&gt;14 built-in skills&lt;/strong&gt; across brainstorming, planning, TDD, execution, and verification.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  gstack (GitHub ~65K stars)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Repository: &lt;strong&gt;garrytan/gstack&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;From &lt;strong&gt;YC CEO Garry Tan&lt;/strong&gt;, open source.&lt;/li&gt;
&lt;li&gt;Philosophy: a &lt;strong&gt;team&lt;/strong&gt; beside you—CEO, designer, eng manager, release manager, doc engineer, QA, and more—&lt;strong&gt;23 opinionated tools&lt;/strong&gt; (product thinking, CEO review, architecture review, real browser testing, design review, security audits, etc.).&lt;/li&gt;
&lt;li&gt;Garry has claimed &lt;strong&gt;600K+ lines of production code (35% tests) in 60 days&lt;/strong&gt;, part-time while running YC full-time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stars are a weak proxy: high star count does not mean every skill fits your workflow.&lt;/p&gt;




&lt;h2&gt;
  
  
  Feature Comparison (Superpowers vs gstack)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Superpowers&lt;/th&gt;
&lt;th&gt;gstack&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Product brainstorming&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;brainstorming&lt;/td&gt;
&lt;td&gt;/office-hours, /plan-ceo-review&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Architecture planning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;writing-plans&lt;/td&gt;
&lt;td&gt;/plan-eng-review, /autoplan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Design&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;/design-consultation, /plan-design-review, /design-shotgun, /design-html&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Development execution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;executing-plans, subagent-driven-development, dispatching-parallel-agents&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Testing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;test-driven-development&lt;/td&gt;
&lt;td&gt;/qa, /qa-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Debugging&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;systematic-debugging&lt;/td&gt;
&lt;td&gt;/investigate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Code review&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;requesting-code-review, receiving-code-review&lt;/td&gt;
&lt;td&gt;/review, /codex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Verification &amp;amp; acceptance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;verification-before-completion, finishing-a-development-branch&lt;/td&gt;
&lt;td&gt;/ship, /land-and-deploy, /canary, /document-release&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;/cso, /careful, /freeze, /guard, /unfreeze&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Observability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;/learn, /retro&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Browser testing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;/browse, /connect-chrome, /setup-browser-cookies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Git worktrees&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;using-git-worktrees&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Skill management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;using-superpowers, writing-skills&lt;/td&gt;
&lt;td&gt;/gstack-upgrade&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;/benchmark&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;/setup-deploy&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Coverage differs a lot; &lt;strong&gt;quantity is not the point—design philosophy is.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Design Philosophy: “How” vs “What” (and Where GSD Fits)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Superpowers — focused on &lt;strong&gt;how&lt;/strong&gt; code gets built
&lt;/h3&gt;

&lt;p&gt;The workflow centers on &lt;strong&gt;high-quality output&lt;/strong&gt;: clarify, plan, &lt;strong&gt;TDD&lt;/strong&gt; (tests before implementation), verify. Checkpoints at each step—little room to skip. In practice it feels &lt;strong&gt;disciplined&lt;/strong&gt;: you ask for X, it tends to build X. Engineers who already know what to build often find that empowering.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Execution-layer detail from hands-on use: strong process and steady execution; small tasks can still feel **heavy&lt;/em&gt;* because the full rhythm applies even to tiny asks.)*&lt;/p&gt;

&lt;h3&gt;
  
  
  gstack — focused on &lt;strong&gt;what&lt;/strong&gt; and &lt;strong&gt;what not&lt;/strong&gt; to do
&lt;/h3&gt;

&lt;p&gt;Before heavy coding, flows like &lt;strong&gt;/office-hours&lt;/strong&gt; walk requirements; &lt;strong&gt;CEO&lt;/strong&gt; and &lt;strong&gt;engineering&lt;/strong&gt; reviews stress-test the approach. It is not only code—it can &lt;strong&gt;run real browser tests&lt;/strong&gt; from a user angle. Rough split:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Decision layer:&lt;/strong&gt; &lt;code&gt;/office-hours&lt;/code&gt;, &lt;code&gt;/plan-ceo-review&lt;/code&gt;, &lt;code&gt;/plan-eng-review&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution layer:&lt;/strong&gt; &lt;code&gt;/review&lt;/code&gt;, &lt;code&gt;/qa&lt;/code&gt;, &lt;code&gt;/ship&lt;/code&gt;, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;gstack shines when requirements are still fuzzy—PMs, indies, or “think while building.” Caveat: turning &lt;strong&gt;all&lt;/strong&gt; roles on can feel &lt;strong&gt;bloated&lt;/strong&gt;; decision skills also burn serious tokens (see below).&lt;/p&gt;

&lt;h3&gt;
  
  
  GSD — &lt;strong&gt;context / spec&lt;/strong&gt;, not another “team chart”
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;GSD&lt;/strong&gt; is not “install another team.” It is &lt;strong&gt;context engineering&lt;/strong&gt;: goals, specs, status, boundaries, and summaries anchored so &lt;strong&gt;context rot&lt;/strong&gt; slows down. Short demos hide this; &lt;strong&gt;long&lt;/strong&gt; projects show it—when context wobbles, output scatters; that is &lt;strong&gt;state&lt;/strong&gt;, not only “bad execution.”&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;gstack&lt;/strong&gt; thinks but is not, by itself, a long-term context vault.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Superpowers&lt;/strong&gt; executes but is not, by itself, a spec/context system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GSD&lt;/strong&gt; fills that gap so chains stay coherent.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Three-Way Comparison (Problems, Not “Who Wins”)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Superpowers&lt;/th&gt;
&lt;th&gt;gstack&lt;/th&gt;
&lt;th&gt;GSD&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Core question&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How to get things done&lt;/td&gt;
&lt;td&gt;What to do; whether it should&lt;/td&gt;
&lt;td&gt;How to keep the project from diverging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Layer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Execution&lt;/td&gt;
&lt;td&gt;Decision / roles&lt;/td&gt;
&lt;td&gt;Context / spec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Strongest fit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Planning, TDD, acceptance loop&lt;/td&gt;
&lt;td&gt;Multi-perspective judgment, review, QA&lt;/td&gt;
&lt;td&gt;Context engineering; stable state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Clear requirements&lt;/td&gt;
&lt;td&gt;Think-while-building&lt;/td&gt;
&lt;td&gt;Long chains / many iterations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Common pain&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Front-loaded process can feel heavy (details below)&lt;/td&gt;
&lt;td&gt;Bloated and token-hungry when fully enabled (details below)&lt;/td&gt;
&lt;td&gt;Little standalone “shipping” value on its own (details below)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Role&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Own &lt;strong&gt;execution&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Own &lt;strong&gt;decision-making&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Own &lt;strong&gt;long-term context&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Common Pain Points in Detail
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Superpowers — front-loaded process can feel heavy.&lt;/strong&gt; Every task, no matter how small, runs through the full cycle: clarify requirements, draft a plan, write tests first, then implement, then verify. For a large feature this rhythm pays off handsomely. For a two-line config fix or a quick copy change, the same ceremony kicks in and you end up spending more time on process than on the actual change. The overhead does not scale down with task size, so small requests can feel disproportionately slow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;gstack — bloated and token-hungry when fully enabled.&lt;/strong&gt; Each gstack role (CEO, designer, architect, QA, etc.) injects its own perspective and prompts into the context. Turn them all on and a single execution-layer skill can consume &lt;strong&gt;10K+ tokens&lt;/strong&gt; before any real code is written. Daily usage burns through tokens fast, and the back-and-forth between multiple “virtual team members” can make even straightforward tasks feel sluggish and redundant. You may also encounter irrelevant meta-questions (e.g. “Are you applying to become a YC company?”) while your codebase is being scanned—artifacts of the framework’s opinionated persona layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GSD — little standalone “shipping” value.&lt;/strong&gt; GSD excels at keeping specs, goals, and state anchored across long sessions. But if you use it &lt;strong&gt;alone&lt;/strong&gt;, it does not directly produce code, run tests, or open a PR. It is a stabilizer, not a builder. Without an execution layer (Superpowers) or a decision layer (gstack) alongside it, GSD manages context that nothing acts on—useful plumbing, but no visible output. Its value only becomes apparent when paired with tools that actually ship work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical takeaway:&lt;/strong&gt; they are &lt;strong&gt;complements&lt;/strong&gt;, not substitutes—Superpowers executes, gstack decides, GSD stabilizes specs and context over time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Strengths, Weaknesses, and Friction
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Superpowers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strengths:&lt;/strong&gt; Brainstorming and overall workflow feel solid; full process even on small asks can become smooth once habitual; execution and TDD are strong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weaknesses:&lt;/strong&gt; Weaker spots are often &lt;strong&gt;early&lt;/strong&gt; decision skills (e.g. planning/brainstorming) compared to gstack’s decision layer—hence many people &lt;strong&gt;pair&lt;/strong&gt; gstack’s front end with Superpowers’ execution.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  gstack
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strengths:&lt;/strong&gt; &lt;strong&gt;Decision layer&lt;/strong&gt;—&lt;code&gt;/office-hours&lt;/code&gt;, &lt;code&gt;/plan-ceo-review&lt;/code&gt;, &lt;code&gt;/plan-eng-review&lt;/code&gt;—stand out for positioning and approach review.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weaknesses:&lt;/strong&gt; Execution feels rougher vs Superpowers; token cost is real—&lt;strong&gt;a single execution-layer skill can cost 10K+ tokens&lt;/strong&gt;, and heavy scans can feel like noisy “process” rather than help.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The analogy
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Superpowers is a scalpel&lt;/strong&gt; — precise and efficient.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;gstack is a full clinic&lt;/strong&gt; — from diagnosis to aftercare.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Use the metaphor to choose depth: narrow execution vs full-spectrum product and review.&lt;/p&gt;




&lt;h2&gt;
  
  
  Consolidated Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Choose skills deliberately—do not install everything
&lt;/h3&gt;

&lt;p&gt;Skill counts spiral easily (Superpowers today, gstack tomorrow, another stack next week). &lt;strong&gt;Selective deployment&lt;/strong&gt; beats volume; random invocation feels unstable and inflates surface-level “skill count” without clarity.&lt;/p&gt;

&lt;p&gt;Underlying idea: both stacks are experiments in &lt;strong&gt;Harness Engineering&lt;/strong&gt;. The mindset is &lt;strong&gt;leverage strengths, cover weaknesses&lt;/strong&gt;—not “I want it all.”&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Decision vs execution (the classic split)—then add context when needed
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;gstack for the decision layer (cherry-picked):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prioritize high-value flows: e.g. &lt;code&gt;/office-hours&lt;/code&gt;, &lt;code&gt;/plan-ceo-review&lt;/code&gt;, &lt;code&gt;/plan-eng-review&lt;/code&gt; for requirements and alignment—avoid over-investing in every role.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Superpowers for the execution layer:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prefer Superpowers as the &lt;strong&gt;base&lt;/strong&gt; for TDD, plans-as-executed, verification—optionally &lt;strong&gt;de-emphasize&lt;/strong&gt; its own heavy decision skills if gstack already covers that phase, so small tasks do not inherit double process.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GSD when the chain diverges:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If work &lt;strong&gt;spreads&lt;/strong&gt; across sessions and threads, add &lt;strong&gt;GSD&lt;/strong&gt; so spec and state stay anchored—not for flash, &lt;strong&gt;for anti-drift&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Stable workflow (three steps)
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Decision → gstack&lt;/strong&gt; — Start with &lt;code&gt;/office-hours&lt;/code&gt; to stress-test the idea, then run &lt;code&gt;/plan-ceo-review&lt;/code&gt; for a founder-level sanity check and &lt;code&gt;/plan-eng-review&lt;/code&gt; to lock architecture and data flow. If design matters, add &lt;code&gt;/plan-design-review&lt;/code&gt;. The goal: decide &lt;strong&gt;what&lt;/strong&gt; to build and &lt;strong&gt;whether&lt;/strong&gt; to build it before touching code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context → GSD&lt;/strong&gt; — Once the decision is made, use GSD (v2) to anchor the plan: &lt;code&gt;PROJECT.md&lt;/code&gt; for what the project is, &lt;code&gt;DECISIONS.md&lt;/code&gt; for architectural choices, &lt;code&gt;KNOWLEDGE.md&lt;/code&gt; for cross-session rules and patterns, and milestone roadmaps (&lt;code&gt;M001-ROADMAP.md&lt;/code&gt;) for sliced execution. These v2 artifacts keep spec, status, and boundaries stable so context does not rot between sessions. (The original GSD uses &lt;code&gt;REQUIREMENTS.md&lt;/code&gt;, &lt;code&gt;ROADMAP.md&lt;/code&gt;, and &lt;code&gt;STATE.md&lt;/code&gt; instead.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution → Superpowers&lt;/strong&gt; — With clear requirements and stable context in place, hand off to Superpowers’ execution loop: &lt;code&gt;brainstorming&lt;/code&gt; (if lightweight refinement is still needed), &lt;code&gt;writing-plans&lt;/code&gt; → &lt;code&gt;executing-plans&lt;/code&gt; for implementation, &lt;code&gt;test-driven-development&lt;/code&gt; for the RED-GREEN-REFACTOR cycle, &lt;code&gt;requesting-code-review&lt;/code&gt; / &lt;code&gt;receiving-code-review&lt;/code&gt; for review, and &lt;code&gt;verification-before-completion&lt;/code&gt; → &lt;code&gt;finishing-a-development-branch&lt;/code&gt; to close the loop. For parallel work, use &lt;code&gt;dispatching-parallel-agents&lt;/code&gt; or &lt;code&gt;subagent-driven-development&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Merged tagline:&lt;/strong&gt; &lt;em&gt;gstack handles thinking, Superpowers handles doing, GSD keeps long context honest.&lt;/em&gt; Combining the &lt;strong&gt;strong decision slice&lt;/strong&gt; of gstack with &lt;strong&gt;Superpowers’ execution&lt;/strong&gt; (and GSD when needed) keeps skill count and collisions under control—similar to the author’s experience building a small tool on a weekend with a &lt;strong&gt;curated&lt;/strong&gt; mix.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Final heuristics
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Requirements still fuzzy → &lt;strong&gt;start with gstack&lt;/strong&gt; (decision).&lt;/li&gt;
&lt;li&gt;Work keeps diverging across the chain → &lt;strong&gt;add GSD&lt;/strong&gt; (context).&lt;/li&gt;
&lt;li&gt;You want execution &lt;strong&gt;steady and closed-loop&lt;/strong&gt; → &lt;strong&gt;lean on Superpowers&lt;/strong&gt; (execution).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Stop asking only:&lt;/strong&gt; “Superpowers or gstack?” &lt;strong&gt;Ask:&lt;/strong&gt; &lt;em&gt;Am I missing decision, context, or execution?&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing:
&lt;/h2&gt;

&lt;p&gt;Skills are not stronger because you install more—they are stronger when you &lt;strong&gt;combine the right pieces for the gap you actually have&lt;/strong&gt; and understand what each layer does, then assemble a workflow that is yours.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Superpowers&lt;/strong&gt; — &lt;a href="https://github.com/obra/superpowers" rel="noopener noreferrer"&gt;github.com/obra/superpowers&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gstack&lt;/strong&gt; — &lt;a href="https://github.com/garrytan/gstack" rel="noopener noreferrer"&gt;github.com/garrytan/gstack&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GSD (Get Shit Done)&lt;/strong&gt; — &lt;a href="https://github.com/gsd-build/get-shit-done" rel="noopener noreferrer"&gt;github.com/gsd-build/get-shit-done&lt;/a&gt; (original) | &lt;a href="https://github.com/gsd-build/gsd-2" rel="noopener noreferrer"&gt;github.com/gsd-build/gsd-2&lt;/a&gt; (v2, standalone CLI)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>sde</category>
      <category>claude</category>
    </item>
    <item>
      <title>From IDE to AGaaS: How Cursor Cloud Agents Bring the OpenClaw Model to Your Slack</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Tue, 24 Mar 2026 00:07:17 +0000</pubDate>
      <link>https://forem.com/imaginex/from-ide-to-agaas-how-cursor-cloud-agents-bring-the-openclaw-model-to-your-slack-4547</link>
      <guid>https://forem.com/imaginex/from-ide-to-agaas-how-cursor-cloud-agents-bring-the-openclaw-model-to-your-slack-4547</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Cursor's &lt;strong&gt;Cloud Agents&lt;/strong&gt; let you delegate coding tasks — bug fixes, feature work, test writing — directly from a Slack message. The agent spins up a remote VM, clones your repo, writes the code, runs your tests, and opens a Pull Request on GitHub. You never open an IDE. This post walks you through the full setup — from Slack integration to your first hands-off pull request — and then examines where the technology shines, where it falls short, and where the AGaaS market is heading next.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is the OpenClaw Model — and Why Should You Care?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OpenClaw&lt;/strong&gt; refers to an emerging paradigm in AI-assisted development where a cloud-hosted coding agent operates &lt;em&gt;autonomously and headlessly&lt;/em&gt; — meaning it doesn't need a local IDE, a human at the keyboard, or even a screen. You give it a task in natural language, and it handles the full software development lifecycle (clone → code → test → commit → PR) on its own.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AGaaS (Agent-as-a-Service)&lt;/strong&gt; is the broader industry term for this pattern: instead of installing AI tooling locally, you interact with a managed agent through everyday interfaces like Slack, Teams, or a web dashboard.&lt;/p&gt;

&lt;p&gt;Cursor's Cloud Agents are a production-ready implementation of this model. If you're already using Cursor as your IDE, you can now step &lt;em&gt;outside&lt;/em&gt; the IDE entirely and operate as a manager — assigning tasks from Slack and reviewing the output as Pull Requests.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Cloud Agents Work Under the Hood?
&lt;/h2&gt;

&lt;p&gt;Before diving into setup, here's what happens when you type &lt;code&gt;@Cursor revise the README.md file to make it more professional and beginner-friendly&lt;/code&gt; in Slack:&lt;/p&gt;

&lt;h3&gt;
  
  
  Headless Execution on Isolated VMs
&lt;/h3&gt;

&lt;p&gt;Traditionally, Cursor ran locally — consuming your RAM, competing for your CPU. Cloud Agents move the execution layer to a remote, isolated Virtual Machine. When a task is triggered, the agent provisions a sandboxed VM, clones your GitHub repository into it, and does all the work in the background. Your local machine stays completely free.&lt;/p&gt;

&lt;p&gt;Each VM comes pre-loaded with a production-grade development environment:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Specification&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ubuntu 24.04.4 LTS (Noble Numbat), Linux kernel 6.12.58+, x86_64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hardware&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4 CPU cores, 15 GB RAM, ~126 GB disk (overlay filesystem)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Runtimes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python 3.12.3, Node.js v22.22.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Toolchain&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Git 2.43.0, GitHub CLI 2.81.0, Bash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Workspace&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Your repo cloned at &lt;code&gt;/workspace&lt;/code&gt;, running as user &lt;code&gt;ubuntu&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You can verify this yourself by asking the agent about its environment. Here's what that looks like in a real Slack conversation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0gj4r7scskjganscyjqy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0gj4r7scskjganscyjqy.png" alt=" " width="800" height="369"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Slack Thread as Context Window
&lt;/h3&gt;

&lt;p&gt;This isn't a basic chatbot that only reads your one-line prompt. Cursor's Slack integration behaves like a teammate who's been reading the whole conversation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If your team has been discussing a bug in a thread — sharing stack traces, debating approaches, pasting logs — the agent ingests &lt;em&gt;all of it&lt;/em&gt; when you tag &lt;code&gt;@Cursor&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;It synthesizes the thread context and implements a fix that reflects the team's consensus, not just your single message.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Autonomous Testing via "Computer Use"
&lt;/h3&gt;

&lt;p&gt;Because the agent has its own VM with a full desktop environment, it doesn't just write code and hope for the best:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It can start your dev server, open a headless browser, and click through UI flows to visually verify the fix.&lt;/li&gt;
&lt;li&gt;If tests fail or the UI breaks, it self-corrects &lt;em&gt;before&lt;/em&gt; submitting the Pull Request.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now that you understand what's happening behind the scenes, let's set it up. The whole process takes about 15 minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step-by-Step Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;Before you begin, make sure you have the following in place:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;th&gt;Details&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cursor subscription&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud Agents require a paid plan — &lt;strong&gt;Pro&lt;/strong&gt; ($20/mo), &lt;strong&gt;Pro+&lt;/strong&gt;, &lt;strong&gt;Ultra&lt;/strong&gt;, or &lt;strong&gt;Teams&lt;/strong&gt;. Check your plan at &lt;a href="https://cursor.com/en-US/pricing" rel="noopener noreferrer"&gt;cursor.com/pricing&lt;/a&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitHub account&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Your repository must be hosted on GitHub or GitLab. You need read-write access to the repo.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Slack workspace&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You need &lt;strong&gt;admin permissions&lt;/strong&gt; (or the ability to request app installation) in your Slack workspace.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Existing test suite&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Recommended but not required. The agent can run your tests automatically if they exist (e.g., &lt;code&gt;npm test&lt;/code&gt;, &lt;code&gt;pytest&lt;/code&gt;, &lt;code&gt;go test&lt;/code&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 1: Connect Slack to Cursor
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Open the &lt;strong&gt;Cursor Dashboard&lt;/strong&gt; at &lt;a href="https://cursor.com/dashboard" rel="noopener noreferrer"&gt;cursor.com/dashboard&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Navigate to the &lt;strong&gt;Integrations &amp;amp; MCP&lt;/strong&gt; tab.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Connect&lt;/strong&gt; next to &lt;strong&gt;Slack&lt;/strong&gt;. This launches an OAuth flow that installs the Cursor bot into your Slack workspace.&lt;/li&gt;
&lt;li&gt;Authorize the requested permissions (read messages in channels where the bot is invited, post replies).&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 2: Connect Your GitHub Repository
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;In the same Dashboard, go to the &lt;strong&gt;Cloud Agents &amp;gt; Default Repositories &amp;gt; Manage Repositories&lt;/strong&gt; section.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Repository&lt;/strong&gt; and authenticate with GitHub.&lt;/li&gt;
&lt;li&gt;Select the repository (or repositories) you want the Cloud Agent to access.&lt;/li&gt;
&lt;li&gt;Grant the agent permission to &lt;strong&gt;create branches&lt;/strong&gt; and &lt;strong&gt;open Pull Requests&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 3: Configure the Cloud Agent Environment
&lt;/h3&gt;

&lt;p&gt;Before triggering tasks from Slack, configure the Cloud Agent's development environment and defaults in the Cursor dashboard. Navigate to &lt;strong&gt;Cloud Agents&lt;/strong&gt; in the left sidebar.&lt;/p&gt;

&lt;h4&gt;
  
  
  3a. Set Your Defaults
&lt;/h4&gt;

&lt;p&gt;Under the &lt;strong&gt;My Settings&lt;/strong&gt; tab, configure the following:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;What It Controls&lt;/th&gt;
&lt;th&gt;Example Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Default Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The AI model the agent uses when no model is specified in the task. Higher-tier models produce better code.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Opus 4.6 High Fast&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Default Repository&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The GitHub repo the agent targets when no repo is mentioned in the Slack message.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;chen115y/MLOpsLearning&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Base Branch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The branch the agent creates feature/fix branches from. Leave empty to use the repo's default branch.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;main&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Branch Prefix&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prepended to every branch the agent creates, making agent-authored branches easy to filter.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;cursor/&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1v6x13qguzkeegrwliw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1v6x13qguzkeegrwliw.png" alt=" " width="800" height="525"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  3b. Set Up a Development Environment
&lt;/h4&gt;

&lt;p&gt;For repositories with complex dependencies (Python data-science stacks, system libraries, database services), click &lt;strong&gt;Add Environment&lt;/strong&gt; button. This launches a very simple setup agent that provisions and validates the VM:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faost6duggi01rkhw8i2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faost6duggi01rkhw8i2g.png" alt=" " width="800" height="654"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once all fields are filled, click &lt;strong&gt;Start For Free&lt;/strong&gt; to start the VM provisioning. The setup agent will analyze the repository and provision the VM accordingly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn28ln49e8e22bd6nh9fo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn28ln49e8e22bd6nh9fo.png" alt=" " width="800" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; You can add multiple environments for different repos. If the setup agent reports warnings (e.g., deprecated API calls in older notebooks), these are pre-existing code issues, not environment problems — the snapshot is still safe to save.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 4: Create a Channel and Invite the Bot
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;In Slack, create a dedicated channel for agent-assisted work (e.g., &lt;code&gt;#engineering-triage&lt;/code&gt;, &lt;code&gt;#cursor-tasks&lt;/code&gt;, or &lt;code&gt;#bug-reports&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Simply mention &lt;code&gt;@Cursor&lt;/code&gt; in the channel with any prompt — the bot joins automatically when the Slack app is installed (Step 1). No separate invite is needed.&lt;/li&gt;
&lt;li&gt;You can also type &lt;code&gt;@Cursor help&lt;/code&gt; to see available commands, or &lt;code&gt;@Cursor settings&lt;/code&gt; to configure channel-level defaults.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 5: Configure Cursor Rules (the Agent's Playbook)
&lt;/h3&gt;

&lt;p&gt;This is the most important step. Without rules, the agent will make reasonable guesses about your codebase conventions. With rules, it follows your team's standards precisely.&lt;/p&gt;

&lt;p&gt;Create a &lt;code&gt;.cursor/rules/triage.mdc&lt;/code&gt; file in your repository root, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rules&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Slack-triggered&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;bug&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;triage&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;feature&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tasks"&lt;/span&gt;
&lt;span class="na"&gt;globs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;**/*"&lt;/span&gt;
&lt;span class="na"&gt;alwaysApply&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="c1"&gt;# Agent Behavior for Slack Tasks&lt;/span&gt;

&lt;span class="c1"&gt;## Bug Triage Protocol&lt;/span&gt;
&lt;span class="s"&gt;1. Read the full Slack thread for context, including any error logs or stack traces.&lt;/span&gt;
&lt;span class="s"&gt;2. Search the codebase to locate the relevant source files.&lt;/span&gt;
&lt;span class="s"&gt;3. Identify the root cause before writing any fix.&lt;/span&gt;
&lt;span class="s"&gt;4. Write the fix following existing code patterns in the repository.&lt;/span&gt;
&lt;span class="s"&gt;5. Use the project's standard error-handling approach (check for existing wrappers).&lt;/span&gt;

&lt;span class="c1"&gt;## Testing Requirements&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Run the full test suite&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="s"&gt;npm run test` (or the project's equivalent).&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;If no tests exist for the changed code, write at least one unit test covering the fix.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Do not submit a PR if tests fail. Debug and fix until green.&lt;/span&gt;

&lt;span class="c1"&gt;## Git and PR Conventions&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Create a new branch from `main` with the format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="s"&gt;fix/&amp;lt;short-description&amp;gt;`.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Never push directly to `main` or `develop`.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;PR title format: `fix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;concise description of the change&amp;gt;`&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Include a summary of the root cause and fix in the PR description.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Reply to the original Slack thread with the PR link and a brief explanation.&lt;/span&gt;

&lt;span class="c1"&gt;## Out of Scope&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Do not modify CI/CD configuration files without explicit approval.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Do not upgrade dependencies unless the fix requires it.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;If the issue is unclear, ask clarifying questions in the Slack thread before proceeding.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can create additional rule files for different workflows — feature development, refactoring, documentation — each with its own conventions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: Run Your First Agent Task
&lt;/h3&gt;

&lt;p&gt;With everything connected, you're ready to give the agent its first job. Post a message in your channel (or reply in an existing thread) and tag &lt;code&gt;@Cursor&lt;/code&gt; with a clear task description. The agent picks it up, executes the work on its remote VM, and reports back — all within the same Slack thread.&lt;/p&gt;

&lt;p&gt;Here's a real example. A user asks the agent to revise a repository's README to make it more professional and beginner-friendly. Within minutes, the agent replies with a structured breakdown of every change it made — reorganized navigation, plain-language introductions, typo fixes, new formatting — along with the commit diff (+338 / -190 lines):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjf9deq0twbkaj30move3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjf9deq0twbkaj30move3.png" alt=" " width="800" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The user asks the agent to make a commit and push the changes directly to the remote repository. Once the work is done, the agent confirms it has committed and pushed the changes directly to the remote repository, and provides a link to verify on GitHub:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdnl6hdfa8mjlbv34n6sw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdnl6hdfa8mjlbv34n6sw.png" alt=" " width="800" height="259"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Want to see how the agent reasoned through the task? Click the &lt;strong&gt;"Open in Web"&lt;/strong&gt; button in the Slack message to open the full agent session. This view shows the agent's step-by-step thought process — the file diff it analyzed, the to-do list it created for itself (commit, push), and the detailed revision plan it followed:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8tbt7ruxda7w3oavm6d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8tbt7ruxda7w3oavm6d.png" alt=" " width="800" height="946"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And to close the loop, here's the GitHub repository immediately after. Notice the &lt;code&gt;README.md&lt;/code&gt; row — updated "1 minute ago" by &lt;code&gt;cursoragent&lt;/code&gt; with the commit message matching exactly what the agent described in Slack:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F467evach5czpkqns5sek.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F467evach5czpkqns5sek.png" alt=" " width="800" height="733"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;No IDE opened. No branch created manually. No code written by hand. One Slack message in, a polished commit out.&lt;/p&gt;




&lt;h2&gt;
  
  
  Writing Effective Cursor Rules: A Deeper Look
&lt;/h2&gt;

&lt;p&gt;The example above worked smoothly because the task was straightforward. But as you start assigning more complex work — multi-file refactors, feature additions, cross-cutting bug fixes — the quality of the agent's output depends heavily on how well you've defined your team's standards. That's where Cursor Rules go from "nice to have" to essential.&lt;/p&gt;

&lt;p&gt;Step 5 introduced the basic format. Here we'll look at patterns that make rules genuinely effective at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scope rules by file type.&lt;/strong&gt; Use the &lt;code&gt;globs&lt;/code&gt; field to apply different rules to different parts of your codebase:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Frontend&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;component&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;conventions"&lt;/span&gt;
&lt;span class="na"&gt;globs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;src/components/**/*.tsx"&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Use functional components with hooks, never class components.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;All components must have a corresponding .test.tsx file.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Use the project's design system tokens for colors and spacing.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;route&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;conventions"&lt;/span&gt;
&lt;span class="na"&gt;globs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;src/api/**/*.ts"&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Validate all request bodies with zod schemas.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Return consistent error response shapes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;error&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;string&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;number&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Log errors with the structured logger, not console.log.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Be specific about what the agent should &lt;em&gt;not&lt;/em&gt; do.&lt;/strong&gt; Guardrails prevent expensive mistakes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;## Boundaries&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Never delete database migration files.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Never modify environment variable files (.env, .env.local).&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;If a change requires more than 5 files, stop and ask for confirmation in Slack.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point you have the full toolkit: the agent is connected, the environment is configured, and the rules are in place. But having the setup working and knowing where to &lt;em&gt;rely&lt;/em&gt; on it are two different things. Let me share what I've learned from using this in practice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Cloud Agents Shine — and Where They Don't (Yet)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Real Unlock: Work Anytime, Anywhere
&lt;/h3&gt;

&lt;p&gt;Here's what changed my daily workflow more than any single feature: I no longer need to be at my desk, or even awake, for code to get written.&lt;/p&gt;

&lt;p&gt;Think about that for a moment. It's 11 PM and a teammate in another timezone drops a bug report in Slack with a Datadog trace attached. Before Cloud Agents, that bug sat untouched until someone opened their laptop the next morning, cloned the repo, reproduced the issue, wrote the fix, ran the tests, and pushed a PR. That's a minimum 30-minute context-switch tax — and that's if the person was already familiar with the code.&lt;/p&gt;

&lt;p&gt;Now? I glance at the Slack notification on my phone, type &lt;code&gt;@Cursor investigate and fix this&lt;/code&gt;, and go back to sleep. By morning, there's a PR waiting for review with a clear explanation of the root cause. The agent read the error trace, found the offending line, wrote the fix, confirmed the tests pass, and opened the PR — all while I was unconscious.&lt;/p&gt;

&lt;p&gt;This isn't just about convenience. It fundamentally changes when and where software development can happen. You can triage bugs from an airport lounge with nothing but your phone. You can delegate a documentation overhaul while you're deep in a design review. You can assign test-writing tasks to the agent on Friday afternoon and come back Monday to a PR that covers the gaps you've been meaning to address for weeks. The agent doesn't get tired, doesn't lose context, and doesn't need to "get back into the zone" after lunch.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the Agent Handles Well Today
&lt;/h3&gt;

&lt;p&gt;The sweet spot for Cloud Agents is any task where the goal is clearly defined and the scope is contained. Bug fixes are the most natural fit — especially when someone has already done the diagnostic work and there's an error trace, a stack dump, or a reproduction path sitting in the Slack thread. The agent can read that context, locate the relevant source files, and produce a targeted fix without anyone needing to spell out which file to open. It's remarkably good at this.&lt;/p&gt;

&lt;p&gt;Test coverage is another area where the agent earns its keep. Most teams know they should be writing more tests, but nobody &lt;em&gt;wants&lt;/em&gt; to write the fifteenth unit test for a utility function. Hand that to the agent. It reads the existing code, infers the expected behavior, and generates tests that follow whatever patterns your codebase already uses — &lt;code&gt;pytest&lt;/code&gt;, &lt;code&gt;jest&lt;/code&gt;, &lt;code&gt;go test&lt;/code&gt;, you name it. It's not glamorous work, but it's exactly the kind of high-value, low-creativity task that agents are built for.&lt;/p&gt;

&lt;p&gt;Small-to-medium feature additions work well too, as long as the spec is clear. "Add a CSV export button to the billing page that calls the existing &lt;code&gt;exportService&lt;/code&gt;" is a great agent task. "Make the app feel more modern" is not — that requires taste, iteration, and subjective judgment that the agent can't provide.&lt;/p&gt;

&lt;p&gt;The same applies to code refactoring. If you can describe the before and after state clearly — "rename all instances of &lt;code&gt;getUserData&lt;/code&gt; to &lt;code&gt;fetchUserProfile&lt;/code&gt; across the codebase" or "extract the validation logic from the controller into a dedicated middleware" — the agent will handle it methodically and consistently. And documentation updates? The agent writes clean, structured prose. Give it a README that's fallen out of date, and it'll cross-reference the actual codebase to produce documentation that matches reality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where You Still Need the IDE
&lt;/h3&gt;

&lt;p&gt;That said, Cloud Agents aren't a replacement for sitting down with your code — at least not yet. There are categories of work where human judgment, rapid iteration, and architectural intuition still matter more than raw execution speed.&lt;/p&gt;

&lt;p&gt;Large architectural changes are the clearest example. If a task spans multiple services, touches database schemas, modifies CI/CD pipelines, and requires coordinating changes across a dozen files in a specific order, the agent can get lost. It doesn't have the mental model of your system's dependency graph that you've built up over months of working in the codebase. It might fix one file in a way that breaks three others, then chase its tail fixing those. For these tasks, you want a human architect in the driver's seat, possibly &lt;em&gt;using&lt;/em&gt; the agent for individual sub-tasks, but directing the overall strategy.&lt;/p&gt;

&lt;p&gt;Exploratory prototyping is another area where the agent falls short. When you're experimenting — trying out a new library, playing with different UI layouts, iterating on an API design — you need a tight feedback loop. You write a few lines, run it, see what happens, change direction, try something else. That back-and-forth is the creative engine of prototyping, and it doesn't translate well to "write a Slack message and wait for a PR." The latency alone kills the creative flow.&lt;/p&gt;

&lt;p&gt;Security-sensitive code deserves human eyes, full stop. The agent can write functionally correct authentication logic, but it won't catch the subtle timing-attack vulnerability or the OAuth misconfiguration that a security-conscious engineer would flag during a manual review. Use the agent to write the boilerplate, but review every line yourself before it touches production auth flows.&lt;/p&gt;

&lt;p&gt;And anything requiring visual design judgment — pixel-perfect UI work, animation tuning, responsive layout decisions — still demands a human with a browser open, resizing windows, and squinting at spacing. The agent can generate the JSX and CSS, but it can't tell you whether the result &lt;em&gt;feels&lt;/em&gt; right.&lt;/p&gt;

&lt;h3&gt;
  
  
  Making the Most of Imperfect Results
&lt;/h3&gt;

&lt;p&gt;Here's a practical pattern that works well: &lt;strong&gt;the 90% handoff.&lt;/strong&gt; The agent doesn't need to produce a perfect PR every time. If it gets 90% of the way there — the logic is right but it missed an edge case, or the implementation is solid but the naming isn't quite what you'd choose — you can pull the agent's remote session directly into your local Cursor IDE and finish the last stretch yourself. You don't start over. You continue right where the agent left off, with all the files already modified and the context preserved.&lt;/p&gt;

&lt;p&gt;And when the agent goes in the wrong direction entirely? Course-correct in the same Slack thread. Reply with something like &lt;code&gt;@Cursor stop. The issue is in the middleware, not the controller. Look at src/middleware/auth.ts instead.&lt;/code&gt; The agent re-reads the full thread, incorporates your feedback, and adjusts its approach. Think of it less like a tool that either works or doesn't, and more like a junior developer who's fast and tireless but occasionally needs steering.&lt;/p&gt;




&lt;h2&gt;
  
  
  Going Further: MCP Integrations for Closed-Loop Automation
&lt;/h2&gt;

&lt;p&gt;So far, every workflow in this post has followed the same pattern: a human writes a Slack message, the agent does the work, and a PR appears on GitHub. That's already powerful — but it still requires someone to initiate each task. What if the agent could respond to events across your entire toolchain without waiting for a Slack prompt?&lt;/p&gt;

&lt;p&gt;That's where the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; comes in. MCP lets the agent interact with external tools beyond Slack and GitHub. By adding MCP servers, you can build a fully closed-loop system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Jira / Linear:&lt;/strong&gt; The agent automatically creates a ticket, links it to the PR, and transitions the issue status.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Datadog / Sentry:&lt;/strong&gt; The agent queries your monitoring tools directly to pull error traces without anyone needing to paste them into Slack.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confluence / Notion:&lt;/strong&gt; The agent updates your team's documentation when it changes an API contract.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This turns the workflow from a &lt;em&gt;Slack → PR&lt;/em&gt; pipeline into a &lt;em&gt;Slack → Ticket → PR → Docs → Status Update&lt;/em&gt; pipeline — with zero manual handoff.&lt;/p&gt;

&lt;p&gt;MCP integrations are where Cloud Agents start to feel less like a developer tool and more like infrastructure. And that shift — from tool to infrastructure — is exactly what's happening across the industry.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Road Ahead: AGaaS and Where the Market Is Going
&lt;/h2&gt;

&lt;h3&gt;
  
  
  From Novelty to Infrastructure
&lt;/h3&gt;

&lt;p&gt;What Cursor has shipped with Cloud Agents is impressive, but it's also clearly early. If you zoom out from the specifics of this one product, a much larger shift is taking shape: &lt;strong&gt;Agent-as-a-Service (AGaaS)&lt;/strong&gt; is becoming a real infrastructure category, not just a buzzword.&lt;/p&gt;

&lt;p&gt;The core idea is straightforward — instead of every developer installing AI tooling on their local machine and managing prompts, context windows, and model versions themselves, you subscribe to a managed agent that lives in the cloud, integrates with your existing tools, and operates autonomously on your behalf. Cursor is one implementation, but the pattern is bigger than any single vendor.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Customers Actually Need (and What's Missing)
&lt;/h3&gt;

&lt;p&gt;If you've followed along with this post and tried the setup yourself, you've probably already noticed a few gaps. These aren't criticisms — they're the natural rough edges of a category that's still being defined. But they point directly at where the market is heading.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-repository orchestration.&lt;/strong&gt; Today, each Cloud Agent task targets a single repository. But real-world features often span a frontend repo, a backend API, a shared library, and an infrastructure-as-code repo. The next generation of AGaaS platforms will need to coordinate changes across multiple repos atomically — opening linked PRs that reference each other and can be merged together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Persistent agent memory.&lt;/strong&gt; Right now, each task starts fresh. The agent doesn't remember that it fixed a similar bug last week, or that your team prefers a particular error-handling pattern, or that the last three PRs it opened for this repo all needed the same test fixture adjustment. Future agents will build a persistent understanding of your codebase, your team's preferences, and your project's history — getting better at their job over time, just like a human teammate does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Richer feedback loops beyond Slack.&lt;/strong&gt; Slack is a natural starting point because it's where engineering teams already communicate. But imagine triggering agent tasks from a Jira ticket transition, a Sentry alert threshold, a failing CI check, or a monitoring dashboard anomaly. The agent becomes a first-responder that patches issues before a human even notices them. Some of this is possible today through MCP integrations, but it's still manual plumbing — it should be turnkey.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Customizable execution environments at scale.&lt;/strong&gt; The environment setup flow shown in Step 3 is a solid start, but enterprise teams need more. Think GPU-enabled VMs for ML codebases, pre-configured database fixtures for integration testing, VPN access to internal services, and compliance-scoped environments that restrict which external packages the agent can install. As AGaaS matures, the execution environment will need to match the complexity of real enterprise infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost transparency and resource governance.&lt;/strong&gt; When an agent spins up a VM, runs your test suite, and interacts with a paid AI model for 15 minutes, who pays for what? Teams need clear visibility into per-task cost breakdowns — compute, model tokens, API calls — and the ability to set budgets, quotas, and approval gates for expensive operations. This is table stakes for enterprise adoption.&lt;/p&gt;

&lt;h3&gt;
  
  
  Market Convergence
&lt;/h3&gt;

&lt;p&gt;It's worth noting that Cursor isn't the only player moving in this direction. GitHub Copilot has introduced its own agent mode. Amazon Q Developer (formerly CodeWhisperer) has evolved toward autonomous capabilities. Smaller players like Devin, Cosine, and Factory are building agent-first platforms from scratch. The competitive pressure is accelerating the category.&lt;/p&gt;

&lt;p&gt;What's emerging is a spectrum: at one end, lightweight copilot-style suggestions embedded in your editor; at the other end, fully autonomous agents that operate headlessly across your entire development workflow. Most teams will use both, for different tasks, at different times. The interesting question isn't &lt;em&gt;which&lt;/em&gt; tool wins — it's how the boundaries between human-driven and agent-driven work shift over the next two to three years.&lt;/p&gt;

&lt;p&gt;For engineering leaders, the strategic play is clear: start experimenting now. The teams that build fluency with agent-assisted workflows today — who learn which tasks to delegate, how to write effective agent rules, and how to review agent-produced code efficiently — will have a significant velocity advantage as these tools mature.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cursor.com/docs/cloud-agent" rel="noopener noreferrer"&gt;Cursor Cloud Agents Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cursor.com/docs/integrations/slack" rel="noopener noreferrer"&gt;Cursor Docs: Slack Integration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cursor.com/docs/rules" rel="noopener noreferrer"&gt;Cursor Rules Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cursor.com/docs/context/mcp" rel="noopener noreferrer"&gt;Model Context Protocol (MCP)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>automation</category>
      <category>openclaw</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>From Prompts to Real Files: A Developer's Guide to AI File Generation</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Mon, 16 Mar 2026 22:11:50 +0000</pubDate>
      <link>https://forem.com/imaginex/your-llm-can-write-files-now-4c6e</link>
      <guid>https://forem.com/imaginex/your-llm-can-write-files-now-4c6e</guid>
      <description>&lt;p&gt;Ask ChatGPT to "create a sales report PDF with a revenue chart." A year ago, it would paste some markdown and wish you luck. Today, it spins up a sandboxed Python environment, runs &lt;code&gt;reportlab&lt;/code&gt; and &lt;code&gt;matplotlib&lt;/code&gt;, and hands you a real, downloadable PDF file.&lt;/p&gt;

&lt;p&gt;This is the shift from &lt;strong&gt;text generation&lt;/strong&gt; to &lt;strong&gt;artifact generation&lt;/strong&gt; -- and every major LLM vendor now supports it through their API. Claude, OpenAI, and Gemini each give developers a way to prompt an LLM and get back actual files: PDFs, spreadsheets, charts, slide decks, whatever you can create with Python.&lt;/p&gt;

&lt;p&gt;This post walks through the universal pattern behind file generation, then shows you exactly how to do it with each vendor -- working code included.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Universal Pattern
&lt;/h2&gt;

&lt;p&gt;Despite different APIs, all three vendors follow the same three-step architecture:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjyzbmd5nj5hz3dx9j4hm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjyzbmd5nj5hz3dx9j4hm.png" alt=" " width="800" height="42"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every vendor-specific implementation is a variation on this flow. The details change, but three concepts repeat everywhere:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Tool declaration&lt;/strong&gt; -- you opt in to code execution by including a specific tool in your API request. It's never on by default.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sandboxed execution&lt;/strong&gt; -- the LLM's code runs in an isolated container with no internet access. Common libraries (pandas, matplotlib, reportlab) come pre-installed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File retrieval&lt;/strong&gt; -- each vendor has a different mechanism to get the bytes out. Some give you a file ID to download; others return bytes inline.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once you internalize this pattern, learning any vendor's API is just a matter of mapping it to these three steps.&lt;/p&gt;




&lt;h2&gt;
  
  
  Claude: Code Execution + Files API
&lt;/h2&gt;

&lt;p&gt;Claude's file generation is the most full-featured option for document creation. It provides a persistent container with full bash access, a rich set of pre-installed document libraries, and a clean Files API for uploads and downloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generating a PDF from a Prompt
&lt;/h3&gt;

&lt;p&gt;Enable the &lt;code&gt;code_execution_20250825&lt;/code&gt; tool, send your prompt, then extract file IDs from the response and download them through the Files API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Step 1: Request with code execution enabled
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution_20250825&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Create a one-page PDF sales report with a revenue chart for Q1 2026.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Extract file IDs from the response
&lt;/span&gt;&lt;span class="n"&gt;file_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bash_code_execution_tool_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bash_code_execution_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                    &lt;span class="n"&gt;file_ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Download each generated file
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;file_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;file_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_to_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Saved: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response content blocks have a nested structure: you're looking for &lt;code&gt;bash_code_execution_tool_result&lt;/code&gt; blocks, which contain &lt;code&gt;bash_code_execution_result&lt;/code&gt; objects, which contain items with &lt;code&gt;file_id&lt;/code&gt; attributes. The &lt;code&gt;files.download()&lt;/code&gt; call gives you the raw bytes; &lt;code&gt;retrieve_metadata()&lt;/code&gt; gives you the original filename.&lt;/p&gt;

&lt;p&gt;Why &lt;code&gt;bash_code_execution&lt;/code&gt;? When you include the &lt;code&gt;code_execution_20250825&lt;/code&gt; tool, Claude actually gets two sub-tools: &lt;code&gt;bash_code_execution&lt;/code&gt; (run shell commands) and &lt;code&gt;text_editor_code_execution&lt;/code&gt; (create and edit files). To generate a file, Claude typically writes a Python script with the text editor sub-tool, then runs it via bash. The result block is named after whichever sub-tool produced the output -- and since it's the bash execution that creates the final file, that's the block type you parse. This is also why Claude has full bash access unlike the other vendors: it's not running Python in a restricted interpreter, it's executing real shell commands. The &lt;code&gt;_20250825&lt;/code&gt; tool version introduced this bash/text-editor split, replacing the earlier &lt;code&gt;_20250522&lt;/code&gt; version that was Python-only.&lt;/p&gt;

&lt;h3&gt;
  
  
  Uploading a CSV, Getting Back a Chart + PDF
&lt;/h3&gt;

&lt;p&gt;To process your own data, upload via the Files API first, then attach the file to your prompt alongside the code execution tool.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Upload your input file
&lt;/span&gt;&lt;span class="n"&gt;uploaded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales_data.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Send the file + prompt with code execution
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;betas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;files-api-2025-04-14&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyze this sales CSV. Create a bar chart of revenue by region &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;and save it as &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;revenue_chart.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. Also generate a one-page PDF &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary report of the key findings.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;container_upload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;uploaded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution_20250825&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Download all generated files
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bash_code_execution_tool_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bash_code_execution_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_to_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Downloaded: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A single prompt can produce multiple files. In this case, you'll get both the PNG chart and the PDF report. Always iterate the full response -- never assume a single file.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Container Reuse: The Key to Iteration Workflows&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Claude containers persist for &lt;strong&gt;30 days&lt;/strong&gt;. When your first request creates a container, the response includes a &lt;code&gt;container.id&lt;/code&gt;. Pass it to subsequent calls and Claude picks up right where it left off -- all files from the previous request are still on disk.&lt;/p&gt;


&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# First call creates the container
&lt;/span&gt;&lt;span class="n"&gt;response1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate a sales report PDF.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution_20250825&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;container_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;

&lt;span class="c1"&gt;# Subsequent calls reuse the same container
&lt;/span&gt;&lt;span class="n"&gt;response2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;container_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Update the chart on page 2 to use a pie chart instead.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution_20250825&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;This enables "conversational file editing" -- users can iterate on documents without re-uploading data or starting from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-installed Libraries&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Claude's sandbox comes with the document generation essentials: &lt;code&gt;reportlab&lt;/code&gt; (PDFs), &lt;code&gt;python-docx&lt;/code&gt; (Word), &lt;code&gt;python-pptx&lt;/code&gt; (PowerPoint), &lt;code&gt;openpyxl&lt;/code&gt; (Excel), &lt;code&gt;pandas&lt;/code&gt;, &lt;code&gt;matplotlib&lt;/code&gt;, &lt;code&gt;pillow&lt;/code&gt;, &lt;code&gt;pypdf&lt;/code&gt;, &lt;code&gt;pdfplumber&lt;/code&gt;, &lt;code&gt;seaborn&lt;/code&gt;, &lt;code&gt;scipy&lt;/code&gt;, and &lt;code&gt;scikit-learn&lt;/code&gt;. Since Claude has full bash access, you can also &lt;code&gt;pip install&lt;/code&gt; anything else you need during the session.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  OpenAI: Responses API + Code Interpreter
&lt;/h2&gt;

&lt;p&gt;OpenAI's Responses API (the successor to the deprecated Assistants API) uses the &lt;strong&gt;Code Interpreter&lt;/strong&gt; tool for file generation. The pattern is similar to Claude, but the response structure and file retrieval mechanism differ.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generating a CSV with Code Interpreter
&lt;/h3&gt;

&lt;p&gt;Enable the &lt;code&gt;code_interpreter&lt;/code&gt; tool, then parse &lt;code&gt;container_file_citation&lt;/code&gt; annotations from the response to find generated files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Step 1: Request with code interpreter enabled
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_interpreter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;container&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate a CSV file named &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;q1_report.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; with 10 rows of financial data.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Extract file references from annotations
# The response structure nests deep: output → message → content → output_text → annotations
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;content_block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;content_block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;annotation&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;content_block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;annotations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;container_file_citation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="c1"&gt;# Step 3: Download from the container endpoint
&lt;/span&gt;                        &lt;span class="n"&gt;file_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;containers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                            &lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="n"&gt;container_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;container_id&lt;/span&gt;
                        &lt;span class="p"&gt;)&lt;/span&gt;
                        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                            &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
                        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Downloaded: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The annotation traversal is the trickiest part. Don't try to shortcut it with &lt;code&gt;response.output_text&lt;/code&gt; -- that gives you a plain string with citation markers, not the actual file references.&lt;/p&gt;

&lt;h3&gt;
  
  
  Uploading a File, Transforming It
&lt;/h3&gt;

&lt;p&gt;Upload via the standard Files API, then pass the file ID in the container config.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Upload the file
&lt;/span&gt;&lt;span class="n"&gt;uploaded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales_data.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;purpose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Pass it to code interpreter via container config
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_interpreter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;container&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;uploaded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyze this sales CSV. Create a bar chart of revenue by region and save it as a PNG.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Download generated files from annotations
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;content_block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;content_block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;annotation&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;content_block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;annotations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;container_file_citation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="n"&gt;file_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;containers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                            &lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="n"&gt;container_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;container_id&lt;/span&gt;
                        &lt;span class="p"&gt;)&lt;/span&gt;
                        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                            &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
                        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Downloaded: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also request higher memory tiers -- &lt;code&gt;1g&lt;/code&gt; (default), &lt;code&gt;4g&lt;/code&gt;, &lt;code&gt;16g&lt;/code&gt;, or &lt;code&gt;64g&lt;/code&gt; -- by setting &lt;code&gt;"memory_limit"&lt;/code&gt; in the container config. Useful when processing large datasets.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;OpenAI Gotchas&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The &lt;code&gt;cfile_&lt;/code&gt; 404 trap.&lt;/strong&gt; Generated files have IDs prefixed with &lt;code&gt;cfile_&lt;/code&gt;. If you try to download them using the standard &lt;code&gt;client.files.content()&lt;/code&gt; endpoint, you'll get a 404. You &lt;em&gt;must&lt;/em&gt; use &lt;code&gt;client.containers.files.content.retrieve()&lt;/code&gt; instead. This has tripped up every developer at least once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;20-minute container expiry.&lt;/strong&gt; OpenAI containers are ephemeral -- they expire after 20 minutes of inactivity. Download your files immediately after generation. There is no 30-day persistence like Claude.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Missing annotations fallback.&lt;/strong&gt; There's a known edge case where &lt;code&gt;container_file_citation&lt;/code&gt; annotations don't appear in the response. When this happens, check &lt;code&gt;response.output&lt;/code&gt; for items of type &lt;code&gt;code_interpreter_call&lt;/code&gt; and inspect their outputs for file references:&lt;/p&gt;


&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;file_refs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_interpreter_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;output_item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                    &lt;span class="c1"&gt;# Download using output_item.file_id and output_item.container_id
&lt;/span&gt;                    &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Gemini: Inline Results + Structured Output
&lt;/h2&gt;

&lt;p&gt;Gemini takes a fundamentally different approach. It doesn't return downloadable file artifacts with file IDs. Instead, code execution results come back &lt;strong&gt;inline&lt;/strong&gt; -- matplotlib charts as raw image bytes, everything else as text or JSON.&lt;/p&gt;

&lt;p&gt;This isn't a technical limitation -- Google has the infrastructure to build containers and file artifact systems. The gap is strategic. Google's file generation story lives in &lt;strong&gt;Google Workspace&lt;/strong&gt;, not in the developer API:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gemini in Docs&lt;/strong&gt; generates full first drafts from prompts, matching writing styles and pulling data from Gmail, Drive, and the web.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini in Sheets&lt;/strong&gt; builds entire spreadsheets from natural language and auto-populates cells with live data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini in Slides&lt;/strong&gt; generates themed slides, with full presentation generation from a single prompt on the roadmap.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes business sense for Google. Anthropic and OpenAI are API-first companies -- their revenue comes from developers using their APIs, so building sandboxes and file download endpoints directly serves their customers. Google's revenue comes from Workspace subscriptions. When Gemini generates a spreadsheet in Workspace, it creates a Google Sheet (not an &lt;code&gt;.xlsx&lt;/code&gt;), keeping users in the Google ecosystem. An API that produces vendor-neutral files would undermine that.&lt;/p&gt;

&lt;p&gt;The practical implication: Gemini's API-level file generation gap is unlikely to close anytime soon. The structured output and inline image patterns below are the right long-term approaches, not temporary workarounds.&lt;/p&gt;

&lt;p&gt;For developers, this means Gemini is best suited for quick charts and data transforms, while complex document creation belongs with Claude or OpenAI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generating a Chart (Inline Image)
&lt;/h3&gt;

&lt;p&gt;Enable the &lt;code&gt;code_execution&lt;/code&gt; tool, then extract image bytes directly from the response parts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;code_execution&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToolCodeExecution&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate a bar chart of quarterly revenue: Q1=$2.1M, Q2=$2.8M, Q3=$3.2M, Q4=$3.9M.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Gemini returns results inline -- no separate download step
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Code ran:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable_code&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;code_execution_result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Output:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;code_execution_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_image&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;revenue_chart.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_image&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;image_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chart saved as revenue_chart.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No file IDs, no download endpoints. The image bytes are right there in the response. For text/data output, it shows up in &lt;code&gt;code_execution_result.output&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Structured Output for CSV Generation
&lt;/h3&gt;

&lt;p&gt;Gemini's strongest file generation pattern is actually indirect: get structured JSON data back, then format it locally with whatever library you prefer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Ask for structured JSON output
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_mime_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Return a JSON array of 10 tech companies with fields: name, ticker, market_cap, sector.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Convert to CSV locally -- you control the formatting
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tech_companies.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Saved &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows to tech_companies.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This "structured output" approach gives you 100% control over formatting and is the most reliable way to produce files from Gemini. Let the model do what it's good at (data generation), and handle the file formatting yourself.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;30-Second Execution Timeout&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gemini's code execution sandbox has a hard 30-second timeout. This makes it ideal for quick chart generation and data transforms, but rules it out for heavy document creation tasks like multi-page PDF reports or complex PowerPoint decks. For those, use Claude or OpenAI.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Which API for What?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Claude&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;OpenAI&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Gemini&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sandbox Type&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reusable container (30-day expiry)&lt;/td&gt;
&lt;td&gt;Ephemeral container (20-min idle timeout)&lt;/td&gt;
&lt;td&gt;Stateless sandbox (30s timeout)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resources&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5 GiB disk, 5 GiB RAM, 1 CPU&lt;/td&gt;
&lt;td&gt;Up to 64 GB RAM (tiered)&lt;/td&gt;
&lt;td&gt;Token-limited (inline output)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Shell Access&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full bash&lt;/td&gt;
&lt;td&gt;Python only&lt;/td&gt;
&lt;td&gt;Python only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;File Download&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Files API (&lt;code&gt;files.download()&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Container endpoint (&lt;code&gt;containers.files.content.retrieve()&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Inline in response (no download step)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best Use Case&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Complex documents (PDF, DOCX, PPTX)&lt;/td&gt;
&lt;td&gt;Heavy data processing + file gen&lt;/td&gt;
&lt;td&gt;Quick charts and data transforms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;pip install&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (bash access)&lt;/td&gt;
&lt;td&gt;No (isolated sandbox)&lt;/td&gt;
&lt;td&gt;No (isolated sandbox)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The short version:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Complex documents&lt;/strong&gt; (PDF reports, slide decks, Word docs with formatting): &lt;strong&gt;Claude&lt;/strong&gt;. The pre-installed document libraries and 30-day container persistence make it the best fit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large dataset processing&lt;/strong&gt; (crunching big CSVs, Excel transformations): &lt;strong&gt;OpenAI&lt;/strong&gt;. The ability to request up to 64 GB of RAM is unmatched.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quick visualizations&lt;/strong&gt; (charts, graphs, simple data summaries): &lt;strong&gt;Gemini&lt;/strong&gt;. Inline image return means fewer API calls and faster turnaround.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maximum formatting control&lt;/strong&gt;: Any model's &lt;strong&gt;Structured Output&lt;/strong&gt; mode. Get JSON data back, render locally with your own libraries.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Self-Hosted Alternative: Run Your Own Sandbox
&lt;/h2&gt;

&lt;p&gt;The three vendor APIs above all run code in &lt;em&gt;their&lt;/em&gt; infrastructure. You send a prompt, they spin up a container, and they hand you back the file. This is convenient, but it means your data leaves your network, you're bound by each vendor's sandbox limits (30-second timeouts, no internet, fixed library sets), and you pay per-execution fees.&lt;/p&gt;

&lt;p&gt;There's a fourth option: &lt;strong&gt;run the sandbox yourself&lt;/strong&gt;. In this pattern, you call any LLM API to generate code (without enabling the vendor's code execution tool), then execute that code locally in an isolated environment on your own machines. You get the same prompt-to-file workflow, but you control the execution environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Self-Host?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data residency.&lt;/strong&gt; In regulated industries (healthcare, finance, government), sending code and data to a third-party sandbox may violate compliance requirements. A local sandbox keeps everything on your infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No vendor sandbox limits.&lt;/strong&gt; You choose the timeout, the RAM, the disk, the installed libraries. Need 10 minutes of execution time? A GPU? Network access to internal services? Your sandbox, your rules.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost at scale.&lt;/strong&gt; Vendor sandbox pricing is per-session or per-hour. At high volume, running your own execution infrastructure can be significantly cheaper.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model flexibility.&lt;/strong&gt; Since you're decoupling "generate the code" from "run the code," you can use &lt;em&gt;any&lt;/em&gt; LLM -- including open-source models, fine-tuned models, or your own -- to produce the Python script. The sandbox doesn't care where the code came from.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tools for Building It
&lt;/h3&gt;

&lt;p&gt;Two open-source projects have emerged as the leading options for sandboxed code execution:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/e2b-dev/e2b" rel="noopener noreferrer"&gt;E2B&lt;/a&gt;&lt;/strong&gt; uses Firecracker microVMs (the same technology behind AWS Lambda) to isolate each execution in its own lightweight VM with a dedicated kernel -- stronger isolation than Docker containers. E2B offers a managed cloud service, but you can also self-host on your own GCP or Linux infrastructure using their Terraform-based deployment. The Python and JavaScript SDKs make it straightforward to spin up a sandbox, run code, and retrieve files programmatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://pypi.org/project/exec-sandbox/" rel="noopener noreferrer"&gt;exec-sandbox&lt;/a&gt;&lt;/strong&gt; takes the fully-local approach. It runs untrusted code in ephemeral QEMU microVMs with hardware acceleration (KVM on Linux, HVF on macOS). No cloud dependency -- code never leaves your machine. Warm-pool latency is 1-2ms, and it supports Python, JavaScript, and shell execution. It's designed for air-gapped environments where sending code to any external service is a non-starter.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architecture Shift
&lt;/h3&gt;

&lt;p&gt;The key difference is that self-hosting &lt;strong&gt;decouples code generation from code execution&lt;/strong&gt;. With vendor APIs, the LLM both writes and runs the code in a single API call. With a self-hosted sandbox, you split these into two steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Call the LLM API for text/code generation (no code execution tool needed).&lt;/li&gt;
&lt;li&gt;Extract the generated Python script from the response.&lt;/li&gt;
&lt;li&gt;Execute it in your local sandbox (E2B, exec-sandbox, or even a locked-down Docker container).&lt;/li&gt;
&lt;li&gt;Retrieve the output files from the sandbox filesystem.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's a concrete example using E2B as the sandbox and Anthropic as the LLM. Notice there's no code execution tool in the API call -- we just ask Claude to write a script, then run it ourselves:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Anthropic&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;e2b_code_interpreter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Sandbox&lt;/span&gt;

&lt;span class="c1"&gt;# Step 1: Ask the LLM to generate a Python script (no code execution tool)
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a Python script that uses matplotlib to create a bar chart &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;of quarterly revenue (Q1=$2.1M, Q2=$2.8M, Q3=$3.2M, Q4=$3.9M) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;and saves it as &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;revenue_chart.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. Return only the script, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no explanation.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Extract the Python code from the response
&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;```

python\n(.*?)

```&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DOTALL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Execute it in an E2B sandbox
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;Sandbox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sbx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;execution&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sbx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;execution&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;execution&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Step 4: Download the generated file from the sandbox
&lt;/span&gt;        &lt;span class="n"&gt;file_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sbx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/home/user/revenue_chart.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bytes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;revenue_chart.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Saved: revenue_chart.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can swap &lt;code&gt;Anthropic&lt;/code&gt; for &lt;code&gt;OpenAI&lt;/code&gt;, &lt;code&gt;genai.Client&lt;/code&gt;, or any other LLM client -- the sandbox doesn't care where the code came from. You can also upload input files to the sandbox before execution using &lt;code&gt;sbx.files.write()&lt;/code&gt;, mirroring the upload-then-process pattern from the vendor APIs.&lt;/p&gt;

&lt;p&gt;E2B's default &lt;code&gt;code-interpreter&lt;/code&gt; template comes with matplotlib, pandas, numpy, scikit-learn, pillow, openpyxl, python-docx, seaborn, and dozens of other common libraries pre-installed -- similar to the vendor sandboxes. If you need additional packages, you can either install them at runtime with &lt;code&gt;sbx.commands.run("pip install &amp;lt;package&amp;gt;")&lt;/code&gt;, or build a custom template with your dependencies baked in so every sandbox starts ready to go.&lt;/p&gt;

&lt;p&gt;This is more work to build, but it gives you full control over execution, security, and cost. It also means you can use Gemini or any other model that &lt;em&gt;doesn't&lt;/em&gt; offer file artifacts -- you just need the model to write good Python, and your sandbox handles the rest.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production Tips
&lt;/h2&gt;

&lt;p&gt;If you're building file generation into a real product, a few hard-won lessons:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Sanitize filenames.&lt;/strong&gt; The LLM chooses the filename based on the prompt. A creative user (or an adversarial one) can craft prompts that produce filenames with path traversal characters. Always strip or validate filenames before writing to disk. &lt;code&gt;os.path.basename()&lt;/code&gt; is your friend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Handle multi-file responses.&lt;/strong&gt; A single prompt like "make a PDF report and an Excel spreadsheet of the raw data" can produce two or more files. Always iterate the full response -- never assume exactly one file comes back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Persist container IDs for edit workflows.&lt;/strong&gt; Claude's 30-day containers enable a powerful pattern: users can say "update the chart on page 2" in a follow-up message, and the LLM picks up the original file from the persistent container. Store the &lt;code&gt;container_id&lt;/code&gt; alongside the conversation thread in your database.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Set timeouts generously.&lt;/strong&gt; Code execution is significantly slower than text generation. Simple files might take 30-60 seconds; complex multi-file generation (especially PPTX with embedded charts) can take 5-15 minutes. Don't use your standard API timeout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. All sandboxes are offline.&lt;/strong&gt; None of the three vendors allow network access from within the sandbox. All data must be uploaded or included in the prompt. You can't &lt;code&gt;pip install&lt;/code&gt; on OpenAI or Gemini (Claude is the exception -- it has bash access). You can't fetch URLs. Plan accordingly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;File generation via LLM APIs follows a universal pattern across all three major vendors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude&lt;/strong&gt; excels at complex document creation with its 30-day persistent containers, full bash access, and pre-installed document libraries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI&lt;/strong&gt; offers the most compute headroom with up to 64 GB of RAM, making it ideal for heavy data processing tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini&lt;/strong&gt; is the fastest path to charts and visualizations, returning inline image bytes with no separate download step.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Try it yourself:&lt;/strong&gt; Build a CLI tool that takes a prompt and a desired output format, routes to the best vendor based on file type (PDFs to Claude, big data to OpenAI, charts to Gemini), and saves the result locally. You'll touch all three APIs and internalize the patterns in a single afternoon.&lt;/p&gt;

&lt;h3&gt;
  
  
  Official Documentation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/code-execution-tool" rel="noopener noreferrer"&gt;Anthropic Code Execution Tool&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/build-with-claude/files" rel="noopener noreferrer"&gt;Anthropic Files API&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developers.openai.com/api/docs/guides/tools-code-interpreter/" rel="noopener noreferrer"&gt;OpenAI Code Interpreter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developers.openai.com/api/docs" rel="noopener noreferrer"&gt;OpenAI API Reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ai.google.dev/gemini-api/docs/code-execution" rel="noopener noreferrer"&gt;Gemini Code Execution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ai.google.dev/gemini-api/docs" rel="noopener noreferrer"&gt;Gemini API Documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
    </item>
    <item>
      <title>Skills Required for Building AI Agents in 2026</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Wed, 25 Feb 2026 19:44:03 +0000</pubDate>
      <link>https://forem.com/imaginex/skills-required-for-building-ai-agents-in-2026-2ed</link>
      <guid>https://forem.com/imaginex/skills-required-for-building-ai-agents-in-2026-2ed</guid>
      <description>&lt;h2&gt;
  
  
  Why Agent Development Is Harder Than You Think
&lt;/h2&gt;

&lt;p&gt;An Agent is conceptually simple: take the one-question-one-answer model of an LLM and add a loop. The model reasons about what to do next, calls external tools, feeds results back into itself, and repeats until the task is complete. A &lt;code&gt;while&lt;/code&gt; loop plus tool-calling — that's the skeleton.&lt;/p&gt;

&lt;p&gt;But between "working demo" and "production product" lies an engineering chasm. OAuth flows, tool design, error cascading across multi-step tasks, runaway costs, context window management, evaluation, multi-Agent coordination, model capability bottlenecks, and framework trade-offs — these nine challenges are where Agent development &lt;em&gt;actually&lt;/em&gt; gets hard. API calls account for roughly 5% of the total effort; the other 95% is everything else.&lt;/p&gt;

&lt;p&gt;For a detailed walkthrough of each challenge, see the companion piece: &lt;a href="//agent_dev_issues.md"&gt;Is AI Agent Development Just About Calling APIs?&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The question this post addresses is different: &lt;strong&gt;given that Agent development is hard, what skills do you actually need to succeed at it in 2026?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Skill Shift: From Writing Code to Shaping Problems
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Inspired by a Story: How an Intern Outperformed a Senior Engineer?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/shubhamsaboo/" rel="noopener noreferrer"&gt;Shubham Saboo&lt;/a&gt; — Senior AI Product Manager at Google Cloud, founder of Unwind AI, and co-author of Google's &lt;em&gt;Introduction to Agents&lt;/em&gt; whitepaper — recently shared an experience from a startup where he serves as an advisor. Something happened that overturned everyone's assumptions.&lt;/p&gt;

&lt;p&gt;A senior engineer received a task and followed the traditional workflow: understand requirements, design architecture, write code, debug, and test. Three days later, he delivered a technically flawless solution -- clean code, clear logic, fully compliant with engineering standards.&lt;/p&gt;

&lt;p&gt;An intern completed the same task in a single afternoon.&lt;/p&gt;

&lt;p&gt;It wasn't that the intern had superior technical skills. Quite the opposite -- his coding experience was far less than the senior engineer's. But he did something fundamentally different: &lt;strong&gt;he defined the problem clearly enough, then let Claude Code do the rest.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This scenario reveals a harsh reality: when AI can complete implementation-level work quickly and accurately, the bottleneck shifts entirely upstream. The value is no longer &lt;em&gt;"Can you write this code?"&lt;/em&gt; but rather &lt;em&gt;"Can you decompose the problem to a level where AI almost never makes mistakes?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;An even more striking example comes from inside Anthropic. They had Opus 4.6 build a C compiler using a team of Agents, then essentially stepped back. Two weeks later, it could run on the Linux kernel -- 100,000 lines of working Rust code, without a single line written by a human.&lt;/p&gt;

&lt;p&gt;The researcher leading this project, &lt;a href="https://nicholas.carlini.com/" rel="noopener noreferrer"&gt;Nicholas Carlini&lt;/a&gt; — a research scientist at Anthropic known for his work on adversarial machine learning — did only one thing: &lt;strong&gt;problem decomposition.&lt;/strong&gt; He broke down the vague goal of "build a compiler" into 16 precisely defined subtasks, each with clear inputs, outputs, and success criteria. Then 16 Agents, each handling its own piece, completed the entire compiler.&lt;/p&gt;

&lt;p&gt;The real leverage isn't in writing code -- it's in breaking problems down to the point where AI almost never gets it wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  Four Skills That Are No Longer Differentiating
&lt;/h3&gt;

&lt;p&gt;Shubham argues that four capabilities that once commanded high salaries for developers are rapidly losing their power as differentiators — not because they're useless, but because AI has made them table stakes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Writing code from scratch.&lt;/strong&gt; Agents write faster and produce fewer bugs. The ability to hand-write code still matters as foundational knowledge, but it's no longer what sets great developers apart.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Boilerplate code and project scaffolding.&lt;/strong&gt; A single prompt generates them instantly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memorizing syntax and APIs.&lt;/strong&gt; Extended context windows have already solved this problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Translating specifications into code.&lt;/strong&gt; Now, the specification itself &lt;em&gt;is&lt;/em&gt; the code.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These skills were once valuable because implementation itself was hard. They required years of training and justified six-figure salaries. But &lt;strong&gt;implementation is no longer the bottleneck&lt;/strong&gt; — it's becoming the easy part.&lt;/p&gt;

&lt;p&gt;Yet the entire industry is still optimizing around the old bottleneck. Most companies' job descriptions still emphasize "proficient in Java," "familiar with Spring framework," "5+ years of development experience." These criteria are losing relevance at a visible pace.&lt;/p&gt;

&lt;p&gt;Value has migrated to five new skills.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Five Skills That Truly Matter in 2026
&lt;/h3&gt;

&lt;p&gt;I am tryiing to answer this question. This isn't theoretical speculation -- it's what I has witnessed firsthand when developing AI solutions in the past 2 years, in the open-source community, and through countless experiences building Agents.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Problem Shaping
&lt;/h4&gt;

&lt;p&gt;Turning vague goals into executable tasks -- this skill separates people who "play around with AI" from those who actually build products with it.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Build me a dashboard"&lt;/em&gt; is not a task; it's a wish. Problem shaping breaks it into twelve specific, testable subtasks: What data does this dashboard display? What decisions does it support? What must the user understand within the first three seconds? Each sub-problem has clear inputs, clear outputs, and clear success criteria.&lt;/p&gt;

&lt;p&gt;When you decompose a vague goal into precise sub-problems, the Agent's execution quality transforms entirely. It no longer needs to guess your intent -- it just follows clear instructions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to practice problem shaping:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start with the desired output and work backwards — what does "done" look like?&lt;/li&gt;
&lt;li&gt;For each subtask, define three things: the input it receives, the output it produces, and how you'll know it succeeded.&lt;/li&gt;
&lt;li&gt;If a subtask is still ambiguous enough that two people would interpret it differently, break it down further.&lt;/li&gt;
&lt;li&gt;Verify your decomposition by asking: could a competent person with zero context about this project execute each subtask from the description alone?&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  2. Context Design
&lt;/h4&gt;

&lt;p&gt;Agent output quality is directly proportional to the quality of context you provide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Poor context:&lt;/strong&gt; &lt;em&gt;"Build me a customer support agent."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good context:&lt;/strong&gt; &lt;em&gt;"The target users are SaaS customers considering canceling their subscriptions who have already tried the help documentation but failed. The tone should be empathetic yet efficient -- avoid excessive apologies and robotic responses. Here are 3 real cases that received five-star ratings and 2 cases that received complaints. Edge cases requiring human escalation include: billing disputes over $500, account security issues, and legal compliance matters. The success metric is resolving the issue within 4 messages without escalation."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The difference isn't in prompt engineering tricks. It's in &lt;strong&gt;information density, boundary conditions, success criteria, and understanding of real-world scenarios.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A context design checklist:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Who is the target user, and what is their state of mind?&lt;/li&gt;
&lt;li&gt;What does the desired tone sound like? Provide 2–3 real examples, not adjectives.&lt;/li&gt;
&lt;li&gt;What are the edge cases that require special handling or human escalation?&lt;/li&gt;
&lt;li&gt;What does success look like, in measurable terms?&lt;/li&gt;
&lt;li&gt;What are the most common failure modes, and how should the Agent handle them?&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  3. Aesthetic Judgment
&lt;/h4&gt;

&lt;p&gt;When ten options are in front of you, knowing that nine of them won't work.&lt;/p&gt;

&lt;p&gt;Shubham recently had Antigravity build a bargaining simulator for his repository: two Agents negotiating a used car deal, each with a distinct personality, live-streaming the entire process. The first version ran perfectly -- clean code, no errors, both sides going back and forth. Technically complete.&lt;/p&gt;

&lt;p&gt;He rejected it in thirty seconds.&lt;/p&gt;

&lt;p&gt;The interface was just a plain chat window. The negotiation process read like a log file -- no personality tension, no emotional highs and lows, no dramatic moments of &lt;em&gt;"Shark Steve holding the line against Cool-Hand Casey pretending to walk away."&lt;/em&gt; It worked as software; it failed as an experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An Agent can build anything you describe, but it cannot judge what is worth describing.&lt;/strong&gt; Agents optimize for &lt;em&gt;correctness&lt;/em&gt;; humans optimize for &lt;em&gt;"Would anyone actually want to use this?"&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  4. Agent Orchestration
&lt;/h4&gt;

&lt;p&gt;Knowing when to use one Agent, when to use multiple, when to run them in parallel, when to run them sequentially, when to add guardrails, and when to let go.&lt;/p&gt;

&lt;p&gt;Three core patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sequential pipeline:&lt;/strong&gt; Agent A completes its task and passes the output to Agent B. Best for scenarios with dependencies between steps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coordinator + specialist team:&lt;/strong&gt; A lead Agent dispatches tasks and integrates results. Best for complex tasks requiring quality control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel execution + merge:&lt;/strong&gt; Multiple Agents handle independent tasks simultaneously, with results consolidated at the end. Best for scenarios with no dependencies between subtasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most people default to sequential workflows because they feel "safer." But knowing when to parallelize and when to introduce a coordinator determines whether your workflow finishes in five minutes or drags on for an hour.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A practical rule of thumb:&lt;/strong&gt; If two subtasks don't share state — neither reads what the other writes — they can run in parallel. If one subtask's output determines what the next subtask even &lt;em&gt;is&lt;/em&gt;, they must be sequential. And if you have more than three parallel Agents whose outputs need to be merged, introduce a coordinator to avoid contradictory results.&lt;/p&gt;

&lt;h4&gt;
  
  
  5. Knowing When NOT to Use an Agent
&lt;/h4&gt;

&lt;p&gt;Not every problem needs an Agent.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Need to reformat JSON? Hand it to Gemini 3 Flash -- done in ten seconds.&lt;/li&gt;
&lt;li&gt;Text replacement across ten files? A lightweight model handles it in seconds.&lt;/li&gt;
&lt;li&gt;A bug you already fully understand? Fixing it yourself is faster than explaining it to an Agent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;True capability is matching the right tool to the problem.&lt;/strong&gt; Complex problems get Agents. Simple problems get models. Obvious problems get your keyboard.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conway's Law Restructured in the Age of AI
&lt;/h3&gt;

&lt;p&gt;In the classic book &lt;em&gt;The Mythical Man-Month&lt;/em&gt;, Fred Brooks proposed a famous insight: a software system's architecture will inevitably mirror the communication structure of the organization that built it. This became known as &lt;strong&gt;Conway's Law.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Building AI agents is essentially restructuring Conway's Law with AI.&lt;/p&gt;

&lt;p&gt;In traditional software development, the speed of delivering a feature depends on team size, communication efficiency, and technical debt. You need frontend engineers, backend engineers, QA engineers, countless meetings to align requirements, and long develop-test-fix cycles.&lt;/p&gt;

&lt;p&gt;In the Agent era, this chain is compressed. &lt;strong&gt;One person plus 16 Agents can build a compiler in two weeks. One intern plus Claude Code can accomplish in an afternoon what took a senior engineer three days.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Organizational structure is no longer the bottleneck. &lt;strong&gt;The quality of problem definition is.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is why Shubham says the best developers of 2026 look more like &lt;strong&gt;film directors&lt;/strong&gt; than programmers. They set the scene, cast the actors, and know when to call "cut." They don't write every line of dialogue -- they shape the entire production.&lt;/p&gt;

&lt;p&gt;The essence of programming is shifting from &lt;strong&gt;"writing"&lt;/strong&gt; to &lt;strong&gt;"orchestrating."&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Three Limitations You Must Know
&lt;/h3&gt;

&lt;p&gt;Although Agents sound like magic, you must be aware of three limitations when applying them in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Agent quality is highly dependent on problem definition.&lt;/strong&gt; If you cannot decompose the problem clearly enough, the Agent will consistently produce outputs in the wrong direction. This isn't the Agent's fault -- it's a problem-shaping problem. Before you master this skill, Agents may actually slow you down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Context design requires deep business understanding.&lt;/strong&gt; Writing a good &lt;code&gt;CLAUDE.md&lt;/code&gt; or &lt;code&gt;.cursor/rules&lt;/code&gt; file requires you to truly understand the product's worldview, users' pain points, and success criteria. This understanding cannot be rushed -- it can only be accumulated through repeated shipping and observing real user behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Aesthetic judgment cannot be learned from books.&lt;/strong&gt; It comes from repeated shipping, observing real user behavior, and developing sensitivity to the gap between &lt;em&gt;"it works"&lt;/em&gt; and &lt;em&gt;"it's worth using."&lt;/em&gt; Without this accumulation, Agents will help you rapidly produce a large volume of things that are &lt;em&gt;"technically correct but experientially failed."&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  State Management: Problem Shaping Applied to Execution
&lt;/h3&gt;

&lt;p&gt;All five skills above come into sharpest focus in one practical engineering challenge: &lt;strong&gt;state management.&lt;/strong&gt; An Agent that can plan is worthless if it can’t track its own progress. Without a progress-tracking mechanism, Agents fall into "hallucination loops" — repeating steps, losing track of the original goal, or confidently declaring a task complete when it’s half-done.&lt;/p&gt;

&lt;p&gt;This is where all five skills converge — applied not to a product or a user-facing feature, but to the Agent itself. Each of the four patterns below draws on a different combination of skills:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The "Plan-Act-Observe" Loop (ReAct pattern).&lt;/strong&gt; &lt;em&gt;(Skill #1 Problem Shaping + Skill #2 Context Design)&lt;/em&gt; Instead of handing the Agent a giant task list, force it to update its internal state after every single action. The Agent explains what it intends to do (Thought), calls a tool (Action), receives the raw result (Observation), then compares that result against the original plan (Status Update). The loop itself is problem shaping — breaking execution into atomic Thought→Action→Observation cycles. The status update after each cycle is context design — ensuring the Agent's next decision is informed by accurate, structured state rather than stale memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Dynamic Task Graphs.&lt;/strong&gt; &lt;em&gt;(Skill #1 Problem Shaping + Skill #4 Agent Orchestration)&lt;/em&gt; For complex workflows, static to-do lists break down. Use a directed acyclic graph (DAG) or dynamic task queue where each task carries a status (&lt;code&gt;PENDING&lt;/code&gt;, &lt;code&gt;IN_PROGRESS&lt;/code&gt;, &lt;code&gt;COMPLETED&lt;/code&gt;, &lt;code&gt;FAILED&lt;/code&gt;), dependencies are tracked explicitly (Task B doesn’t start until Task A succeeds), and intermediate variables are stored in a scratchpad — like a URL found in Step 1 that’s needed in Step 5. Defining each node with clear inputs, outputs, and success criteria is problem shaping. Deciding which nodes run in parallel versus sequentially, and how results flow between them, is agent orchestration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The Critic Node.&lt;/strong&gt; &lt;em&gt;(Skill #3 Aesthetic Judgment + Skill #4 Agent Orchestration)&lt;/em&gt; In multi-Agent architectures, it helps to have a supervisor that reviews outputs rather than just trusting the worker’s self-assessment. The Worker executes and reports "I’m done." The Critic checks whether the goal was &lt;em&gt;actually&lt;/em&gt; achieved. A shared Global State stores the current version of truth. This is the Coordinator pattern from Skill #4 applied to quality control — but the Critic’s evaluation criteria come from Skill #3: knowing when output is "technically correct" but not actually good enough. Without aesthetic judgment baked into the Critic’s rubric, it degrades into a syntax checker.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Checkpointing and Self-Correction.&lt;/strong&gt; &lt;em&gt;(Skill #1 Problem Shaping + Skill #5 Knowing When NOT to Use an Agent)&lt;/em&gt; Progress tracking isn’t just about moving forward — it’s about knowing when to turn back. If an observation returns an error, the Agent should update the plan rather than crash — that’s problem shaping in real time, re-decomposing the remaining work based on new information. And if an Agent is 50 steps deep into what should be a 5-step task, it’s "lost in the woods" and needs a reset. Budget monitoring (tokens, turns, or wall-clock time) prevents runaway execution. Recognizing when to abort an Agent run and switch to a simpler tool — or fix the issue manually — is Skill #5 in action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A practical implementation tip:&lt;/strong&gt; &lt;em&gt;(Skill #2 Context Design)&lt;/em&gt; Prepend a status summary to every LLM call — original goal, completed steps, current step, remaining steps. This is context design at its most literal: engineering the information the Agent sees at every turn. This "external state" acts as a rhythmic beat that keeps the context window focused on the finish line, counteracting the "Agentic Amnesia" problem described in the companion piece.&lt;/p&gt;

&lt;h3&gt;
  
  
  Putting It Into Practice
&lt;/h3&gt;

&lt;p&gt;I close with a poignant statement: &lt;em&gt;"These skills cannot be acquired through reading. They come from practice."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I offer five concrete exercises:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Review your last five Agent outputs.&lt;/strong&gt; Write down what you would change and why.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write a &lt;code&gt;CLAUDE.md&lt;/code&gt; for your current project&lt;/strong&gt; -- even if it only takes 30 minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The next time you face a vague requirement,&lt;/strong&gt; break it into 10 subtasks before writing a prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Take a sequential workflow&lt;/strong&gt; and identify which steps can run in parallel.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For one week, log every task&lt;/strong&gt; where you used an Agent but a simple prompt would have sufficed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Open your most recent project and ask yourself: &lt;em&gt;Are you spending more time writing code, or shaping problems?&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="//agent_dev_issues.md"&gt;ten engineering challenges of building AI agents&lt;/a&gt; haven't gone away. But the response to them has fundamentally shifted.&lt;/p&gt;

&lt;p&gt;Twenty years ago, the scarce resource was implementation skill — the ability to translate an idea into working code. That scarcity justified years of training, specialized hiring, and the entire structure of software teams. Today, Agents handle implementation at speed and quality that rivals senior engineers. The scarce resource has moved upstream: the ability to decompose problems precisely, design rich context, exercise aesthetic judgment, orchestrate multi-Agent workflows, and know when to reach for a simpler tool.&lt;/p&gt;

&lt;p&gt;This isn't a prediction about the future. It's a description of what's already happening — an intern shipping in an afternoon, a compiler built without a human writing a single line of code, organizations discovering that their bottleneck is problem definition, not programming talent.&lt;/p&gt;

&lt;p&gt;The developers who thrive in this era won't be the ones who write the most code. They'll be the ones who ask the best questions, shape the clearest problems, and know when the Agent's output is good enough — and when it isn't.&lt;/p&gt;

&lt;p&gt;The skills have shifted. The question is whether you'll shift with them.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Berkeley Function-Calling Leaderboard&lt;/strong&gt; — Tool-calling accuracy benchmarks across models (~77.5% top accuracy). &lt;a href="https://gorilla.cs.berkeley.edu/leaderboard.html" rel="noopener noreferrer"&gt;berkeley-function-call-leaderboard&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Galileo Research&lt;/strong&gt; — Findings on error cascading in multi-step Agent tasks. &lt;a href="https://www.galileo.ai/" rel="noopener noreferrer"&gt;galileo.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangChain State of AI Agents Report&lt;/strong&gt; — Survey data on Agent evaluation practices (52% offline evaluation, 37% online evaluation). &lt;a href="https://blog.langchain.dev/" rel="noopener noreferrer"&gt;blog.langchain.dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UC Berkeley MAST Framework&lt;/strong&gt; — Analysis of 1,600+ Agent traces showing 41–86.7% multi-Agent failure rates, with 79% of failures from orchestration. &lt;a href="https://arxiv.org/abs/2503.13657" rel="noopener noreferrer"&gt;arxiv.org&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Microsoft Azure SRE Case Study&lt;/strong&gt; — Production experience scaling from 50+ sub-Agents to 5 core tools. &lt;a href="https://techcommunity.microsoft.com/blog/appsonazureblog/context-engineering-lessons-from-building-azure-sre-agent/4481200" rel="noopener noreferrer"&gt;techcommunity.microsoft.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anthropic Agent Evaluation Blog (January 2025)&lt;/strong&gt; — Challenges in systematically evaluating Agent behavior. &lt;a href="https://www.anthropic.com/research/building-effective-agents" rel="noopener noreferrer"&gt;anthropic.com/research&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nicholas Carlini — C Compiler with Opus&lt;/strong&gt; — Building a C compiler with 16 Agents producing 100,000 lines of Rust. &lt;a href="https://nicholas.carlini.com/" rel="noopener noreferrer"&gt;nicholas.carlini.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shubham Saboo / Unwind AI&lt;/strong&gt; — &lt;a href="https://www.theunwindai.com/" rel="noopener noreferrer"&gt;theunwindai.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Boston Consulting Group&lt;/strong&gt; — Research showing fewer than 20% of enterprise Agent projects achieve expected ROI. &lt;a href="https://www.bcg.com/publications/2025/closing-the-ai-impact-gap" rel="noopener noreferrer"&gt;bcg.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alibaba Cloud Engineering Blog&lt;/strong&gt; — Data showing AI completes 30% of work in production Agent systems, with 70% being tool engineering. &lt;a href="https://www.alibabacloud.com/blog/602301" rel="noopener noreferrer"&gt;alibabacloud.com/blog&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spotify Engineering&lt;/strong&gt; — Experience with context window limits in code Agent development. &lt;a href="https://engineering.atspotify.com/2025/11/context-engineering-background-coding-agents-part-2" rel="noopener noreferrer"&gt;engineering.atspotify.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manus Team&lt;/strong&gt; — Four framework rebuilds for context engineering. &lt;a href="https://manus.im/" rel="noopener noreferrer"&gt;manus.im&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fred Brooks, The Mythical Man-Month&lt;/strong&gt; — Origin of Conway's Law and organizational structure insights. &lt;a href="https://en.wikipedia.org/wiki/The_Mythical_Man-Month" rel="noopener noreferrer"&gt;wikipedia.org&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>agentaichallenge</category>
      <category>programming</category>
    </item>
    <item>
      <title>Is AI Agent Development Just About Calling APIs? Where's the Real Difficulty?</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Wed, 25 Feb 2026 19:40:02 +0000</pubDate>
      <link>https://forem.com/imaginex/is-ai-agent-development-just-about-calling-apis-wheres-the-real-difficulty-2j75</link>
      <guid>https://forem.com/imaginex/is-ai-agent-development-just-about-calling-apis-wheres-the-real-difficulty-2j75</guid>
      <description>&lt;h2&gt;
  
  
  The Bottom Line First
&lt;/h2&gt;

&lt;p&gt;Calling APIs is indeed the entirety of Agent development — just like cooking is indeed putting ingredients in a pot. Technically correct, but it perfectly explains why some people produce Michelin-star dishes while others produce culinary disasters.&lt;/p&gt;

&lt;p&gt;Saying the conclusion without explanation is meaningless. Let's actually build an Agent and walk through it together. But before diving in, let's take 30 seconds to clarify what an Agent actually is.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is an Agent, Exactly?
&lt;/h2&gt;

&lt;p&gt;The original interaction model with large language models (LLMs) was simple: you ask a question, it gives an answer. One question, one answer, done. If you wanted it to do something complex, you had to manually break tasks into small pieces and feed them one round at a time. You were the "orchestrator"; the LLM was just a passive tool that responded on demand.&lt;/p&gt;

&lt;p&gt;What an Agent does is fundamentally one thing: &lt;strong&gt;it adds a loop to this question-and-answer model.&lt;/strong&gt; The model no longer just answers you once. Instead, it judges "what else do I need to do," calls external tools to get results, feeds those results back to itself, thinks about what to do next, and repeats until the task is complete. This loop transforms a large model from a "responder" into an "executor."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Execution Loop:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Input → LLM Reasoning → Need to call a tool?
                                      │
                    ┌─── Yes ─────────┘─── No ───┐
                    ▼                             ▼
           Select Appropriate Tool         Task Complete?
                    │                             │
                    ▼                        Yes  ▼
           Call External Tool          Return Final Result
         ┌──────────────────┐
         │  Check Emails    │
         │  Check Calendar  │
         │  Create Meeting  │
         └──────────────────┘
                    │
                    ▼
         Get Tool Return Results
                    │
                    ▼
           Update Context ──────────────────────────────┐
                                                        │
                                              (loop continues)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Conceptually, it's that simple. A &lt;code&gt;while&lt;/code&gt; loop plus tool-calling capability — that's your Agent skeleton. So many people read this and think, "There's no real technical depth here?" True, the skeleton is simple. But making that loop run stably, reliably, and efficiently in the real world — &lt;strong&gt;that&lt;/strong&gt; is the real engineering challenge.&lt;/p&gt;

&lt;p&gt;Let's walk through it for real. Say you want to build an Agent that manages your schedule: read emails, check calendars, arrange meetings. Doesn't sound complicated, right? Let's look at what you encounter at each step.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Call the API — Done in 10 Minutes
&lt;/h2&gt;

&lt;p&gt;This step really is easy. Install an SDK, write a few lines of code, pass user input to the model, get back a result. If you've used the OpenAI or Claude API, you could write it blindfolded. You don't even need to write code yourself — open an AI coding tool like Claude Code or Cursor, describe your requirements in natural language, and they'll scaffold the project for you. Define a few tools — check calendar, read emails, create meeting — write the JSON schema, and the model can call them.&lt;/p&gt;

&lt;p&gt;It runs. You ask it "what meetings do I have tomorrow?", it calls the calendar tool, gets the result, and reads it back in natural language. Perfect. You think: Agent development isn't that hard, maybe I can ship this in a week.&lt;/p&gt;

&lt;p&gt;I've had this feeling before. 20 years ago when I first learned C# development, I dragged a few controls onto a Windows Form and had a running App — I thought Windows Form development was no big deal either.&lt;/p&gt;

&lt;p&gt;In theory, those AI coding Agents could handle every step ahead for you too. But in practice, every problem you encounter from here on isn't about &lt;em&gt;how to write the code&lt;/em&gt; — it's about &lt;em&gt;what code should be written&lt;/em&gt;. To really understand where Agent development gets hard, let's keep walking.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Connect to Real APIs — The Nightmare Begins
&lt;/h2&gt;

&lt;p&gt;In the demo you used mock data. Now you need to connect to real email and calendar services. Each user might use something different: Outlook, Gmail, hotmail, etc. Let's simplify and just connect to Microsoft's Graph API — it's accessible domestically and Outlook is mainstream in enterprise.&lt;/p&gt;

&lt;p&gt;The first problem arrives immediately: &lt;strong&gt;OAuth&lt;/strong&gt;. Your users must authorize your application to access their Microsoft account. You need to register an app in Azure AD, handle OAuth redirects, securely store refresh tokens, and auto-refresh when tokens expire. None of this has anything to do with the LLM, but without it, your Agent can't take its first step. Microsoft's permissions model alone (delegated permissions vs. application permissions) can eat half a day of research.&lt;/p&gt;

&lt;p&gt;Then come the &lt;strong&gt;API edge cases&lt;/strong&gt;. Microsoft Graph returns email lists paginated — 10 items per page by default, up to 50. Your Agent gets the first page without knowing how many more pages exist, and it will give you a confident-sounding conclusion based on just those 10 emails. Ask "did anyone email me last week about Project A?" — the actual email is on page 3, but the Agent confidently tells you "no." You can add a tool to check the next page, but then you need to add a tool to check the next page, and so on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limiting&lt;/strong&gt; is another problem. Microsoft Graph's throttling strategy is complex, with different thresholds per app, per user, and per resource type. If your Agent calls it a dozen times in a complex task, it will easily hit a 429 error. What happens then? The model doesn't know what "429 Too Many Requests" means — it just thinks the tool call failed and starts guessing reasons. And this is only for one provider. To build a real product, every provider (Gmail, hotmail, etc.) has its own authentication system and API design. The workload multiplies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Tool Design Problem:&lt;/strong&gt; Connecting the API is only half of the tool-call equation. The other half is &lt;strong&gt;how to design the tool itself&lt;/strong&gt; — and this is trickier than you'd expect.&lt;/p&gt;

&lt;p&gt;What should your "search emails" tool look like? If it's too rigid — only supporting sender-based queries — a user saying "find last week's emails about Project A" will fail immediately. So you add keyword search, time range filtering, attachment filtering? The more parameters, the more complex the schema, and the more likely the model is to fill things in wrong or miss fields. Berkeley's Function-Calling benchmark found that the more tools and the more complex the parameters, the worse model accuracy becomes. Smaller models degrade dramatically as tool count grows — BFCL data shows that models like Llama 3.1 8B can handle a modest number of tools but start failing unpredictably once tool count exceeds their capacity threshold.&lt;/p&gt;

&lt;p&gt;On the other end, if you design a generic "search" tool that covers everything, the model won't know what to put in it. It might pass calendar query parameters to the email search tool, or call "send email" when it should "create a meeting." There's no right answer for tool granularity — too fine and user needs aren't covered, too coarse and the model can't handle it. The only way is to iterate in your specific context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool description text matters enormously.&lt;/strong&gt; For the same functionality, a description written as &lt;code&gt;"Search emails"&lt;/code&gt; vs. &lt;code&gt;"Search the user's Outlook inbox by keyword, sender, date range, or attachment presence. Returns a list of matching emails sorted by date"&lt;/code&gt; produces dramatically different model accuracy. In short, you don't just need to write code to implement a tool — you need to learn to &lt;strong&gt;write a manual for the model&lt;/strong&gt;, and whether that manual is good or bad, you can only verify through repeated testing.&lt;/p&gt;

&lt;p&gt;A lot of research puts it clearly with data: &lt;strong&gt;in production-grade Agent systems, AI completes only 30% of the work, and the remaining 70% is tool engineering.&lt;/strong&gt; What you think of as "calling an API" is mostly spent on the design and integration work surrounding that API.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Multi-step Tasks — Errors Start Snowballing
&lt;/h2&gt;

&lt;p&gt;Good — the API is connected and basically working. Now try a slightly more complex request: "Find a time slot next week when everyone is free, schedule a project review meeting, and then email all attendees."&lt;/p&gt;

&lt;p&gt;This task requires: querying multiple people's calendar availability, finding the intersection, creating a meeting invite, drafting an email, and sending it. Five or six steps, each depending on the previous one's result.&lt;/p&gt;

&lt;p&gt;Here's the problem. Berkeley's Function-Calling Leaderboard (BFCL) shows that even the best models struggle with tool-calling accuracy — top scores hover around &lt;strong&gt;80%&lt;/strong&gt; on overall benchmarks, and accuracy drops further as tool count and parameter complexity increase. That means roughly 1 in 5 calls has an error. The probability of a five-step task completing entirely correctly? About 0.8 to the fifth power — &lt;strong&gt;less than 33%.&lt;/strong&gt; Your Agent has roughly a two-thirds chance of going wrong at some step.&lt;/p&gt;

&lt;p&gt;Worse, Galileo's research found that &lt;strong&gt;early small errors amplify through later steps.&lt;/strong&gt; Say the model misparses a date format in step one and reads Tuesday as Wednesday. Every subsequent step builds on that error. It creates a meeting at the wrong time, then sends everyone an email notification with the wrong time. One small hallucination triggers a cascade of wrong actions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;At this point you realize: you need to add validation logic between each step, rollback mechanisms, and confirmation loops. None of this is taught in any LLM's API documentation.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Guardrails — The Invisible Security Risk
&lt;/h2&gt;

&lt;p&gt;And there's a deeper problem lurking here that most people don't think about until it's too late: &lt;strong&gt;guardrails.&lt;/strong&gt; Your scheduling Agent has permissions to send emails, create meetings, and modify calendars. What happens when it hallucinates a participant name and sends a meeting invite to the wrong person? Or confidently deletes a calendar block because it "optimized" your schedule?&lt;/p&gt;

&lt;p&gt;OWASP classifies this as &lt;strong&gt;"Excessive Agency"&lt;/strong&gt; (LLM06:2025) — one of the top security threats in LLM applications. It breaks down into three failure modes: excessive functionality (your Agent has access to 50 actions when it only needs 5), excessive permissions (your Agent can modify &lt;em&gt;any&lt;/em&gt; calendar, not just the user's), and excessive autonomy (the Agent sends emails and creates meetings without any human confirmation gate).&lt;/p&gt;

&lt;p&gt;In practice, you need to separate "read" tools from "write" tools and put explicit approval gates on write operations. High-stakes actions — sending external emails, deleting calendar entries, modifying shared resources — should run in a "dry run" mode where the Agent describes what it &lt;em&gt;would&lt;/em&gt; do and waits for human confirmation before executing. You need to design for rapid rollback, because the question isn't &lt;em&gt;if&lt;/em&gt; your Agent will take a wrong action — it's &lt;em&gt;when&lt;/em&gt;. And you need to enforce the principle of least privilege: your Agent should request only the minimum API permissions it needs, not broad access "just in case."&lt;/p&gt;

&lt;p&gt;None of this is glamorous engineering. But skip it, and one hallucinated email from your Agent can undo months of user trust.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Open It to Real Users — The Bill Scares You Awake
&lt;/h2&gt;

&lt;p&gt;You tested the first three steps in your development environment and things seemed fine. But once you open the Agent to real users, the nightmare comes from a direction you never anticipated: &lt;strong&gt;the bill.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You used Claude Sonnet or GPT-4o for development testing — great results, a few cents per complex task, no pain. But with real users, hundreds of requests per day, each averaging four or five tool call rounds, each carrying substantial context — you look at the monthly bill and see a small feature burning thousands of dollars a month. What if user volume grows ten times?&lt;/p&gt;

&lt;p&gt;You think: a user saying "what meetings do I have tomorrow?" — does that really need the most powerful model? That's overkill.&lt;/p&gt;

&lt;p&gt;So you start thinking about &lt;strong&gt;model routing&lt;/strong&gt;: different tasks use different base models. Simple queries go to cheap small models (Haiku, GPT-4o mini, Gemini Flash); complex multi-step reasoning goes to large models (Claude Sonnet, GPT-4o, Gemini Pro). But who judges complexity?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use a large model to judge? That costs money too.&lt;/li&gt;
&lt;li&gt;Use a rule engine? Works for simple cases, but user inputs are endlessly variable and rules always have gaps.&lt;/li&gt;
&lt;li&gt;Use a small model as a classifier? Now you've added another model component that needs tuning and maintenance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And different models vary enormously in their tool-calling capabilities. A tool schema that works on Claude Sonnet may have parameters filled in wrong on Haiku. JSON that runs perfectly on GPT-4o may fail to parse on open-source models. Every time you swap a model, your carefully tuned prompts and tool descriptions may need to be re-adapted. This is why many teams eventually find that the token money saved doesn't cover the labor cost of multi-model adaptation.&lt;/p&gt;

&lt;p&gt;To put concrete numbers on this: Claude Sonnet costs \$3/\$15 per million input/output tokens, while Claude Haiku costs \$0.25/\$1.25 — a 12x to 60x difference. GPT-4o vs. GPT-4o mini has a similar spread. Mid-sized Agent deployments easily burn \$1K–\$5K per month in token costs alone; complex Agents consuming 5–10 million tokens monthly aren't unusual. One underrated optimization: &lt;strong&gt;prompt caching.&lt;/strong&gt; Anthropic's prefix caching can reduce costs by up to 90% and latency by 85% for repeated long prompts — a massive win for Agents that include the same system prompt and tool definitions in every call.&lt;/p&gt;

&lt;p&gt;And cost isn't the only scaling problem — &lt;strong&gt;latency&lt;/strong&gt; hits you just as hard. A multi-step scheduling task that checks four people's calendars, finds a common slot, creates a meeting, and sends emails can easily take 30–45 seconds end-to-end. Technically correct, but your users experience it as broken. The biggest UX win is &lt;strong&gt;streaming intermediate results&lt;/strong&gt;: instead of a 45-second black box, show "Checking Alice's calendar... Found 3 available slots... Confirming with Bob..." — the total time is the same, but the perceived wait drops dramatically. Parallelizing independent tool calls (check all four calendars simultaneously instead of sequentially) helps with actual latency. But the hard tradeoff remains: smaller, faster models hallucinate more, so you can't just throw Haiku at everything to speed things up.&lt;/p&gt;

&lt;p&gt;Cost optimization looks like an operations problem, but it's actually an &lt;strong&gt;architecture problem&lt;/strong&gt;. You need to make the model-calling layer pluggable from the very beginning — something most people never think about when writing a demo.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6: Context Management — Your Agent Starts "Forgetting"
&lt;/h2&gt;

&lt;p&gt;After a while, you notice a strange problem: the Agent "drifts" during long tasks. You give it a complex task requiring seven or eight conversation rounds, and by rounds four or five, it starts forgetting the original requirements and constraints.&lt;/p&gt;

&lt;p&gt;This is what the industry calls &lt;strong&gt;"Agentic Amnesia."&lt;/strong&gt; Research data is clear: when tasks are split across multiple conversation rounds, model performance degrades significantly — and without memory management strategies, Agents lose track of constraints, requirements, and earlier results as context accumulates.&lt;/p&gt;

&lt;p&gt;The reason is that LLM context windows are finite. Every tool call's input and output consumes context space. Query five people's calendars, each returning a large JSON payload, and the context window is mostly full. Spotify's engineering team hit the exact same pitfall building a code Agent: once the context window filled up, the Agent "lost its direction" and forgot the original task after a few rounds.&lt;/p&gt;

&lt;p&gt;You need to start doing &lt;strong&gt;Context Engineering&lt;/strong&gt;. Anthropic defines it as "curating exactly what content goes into a limited context window from an ever-changing universe of information." In plain terms, it's the LLM version of memory management: you dynamically decide what the model "sees" at each reasoning step and what it "forgets." Which historical information gets compressed into summaries? Which key constraints must always be preserved? Which tool return values can be discarded?&lt;/p&gt;

&lt;p&gt;The Manus team rebuilt their entire framework four times to get this right. Four times. They called this process "stochastic gradient descent" — inelegant, but effective.&lt;/p&gt;

&lt;p&gt;There's also a subtler trap: &lt;strong&gt;research shows context length and hallucination rate are positively correlated.&lt;/strong&gt; The longer the input, the more likely the model is to hallucinate. For Agent tasks that require large contexts, this is nearly an unresolvable structural paradox.&lt;/p&gt;

&lt;p&gt;One emerging solution to this problem is &lt;strong&gt;Agent Skills&lt;/strong&gt;, a mechanism pioneered by Anthropic. Where Context Engineering is about &lt;em&gt;managing&lt;/em&gt; what's in the context window, Skills are about &lt;em&gt;not putting things there in the first place.&lt;/em&gt; A Skill is a modular package of instructions, workflows, and best practices (typically a &lt;code&gt;SKILL.md&lt;/code&gt; file plus optional scripts) that an Agent loads on demand. Think of it as pluggable expertise — a "Tax Compliance Skill" or a "Cloud Migration Skill" that transforms a general-purpose Agent into a domain specialist, without bloating the context window for every other task.&lt;/p&gt;

&lt;p&gt;The design uses &lt;strong&gt;progressive disclosure&lt;/strong&gt;: an Agent can have dozens of Skills installed but only loads the 2–3 it needs for any given task. This directly mitigates the context window pressure that causes Agentic Amnesia. Skills also enable &lt;strong&gt;composability&lt;/strong&gt; — combining a code-review Skill with a git-automation Skill produces an Agent that can review and commit code without anyone writing explicit coordination logic.&lt;/p&gt;

&lt;p&gt;The impact on the ecosystem has been rapid. OpenAI adopted structurally identical Skills for ChatGPT and Codex CLI. Microsoft's Semantic Kernel implements an equivalent "Plugins" abstraction. Marketplaces like SkillsMP have emerged with hundreds of thousands of community-built Skills. Anthropic has positioned Agent Skills as an open standard — and the convergence across platforms suggests it's becoming the standard abstraction for packaging Agent capabilities, much like MCP became the standard for Agent-to-tool communication.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 7: Want to Test It? You Don't Even Know How
&lt;/h2&gt;

&lt;p&gt;At this point, your Agent barely works. But how do you determine whether it's "actually good" vs. "just barely functional"?&lt;/p&gt;

&lt;p&gt;Traditional software development has mature testing methodologies: unit tests, integration tests, end-to-end tests — inputs are deterministic, expected outputs are deterministic. But an Agent's input space is open-ended (users can say anything) and its output is non-deterministic (the model generates different text each time). LangChain's blog put it perfectly: &lt;strong&gt;"every input is an edge case"&lt;/strong&gt; — a challenge traditional software has never faced.&lt;/p&gt;

&lt;p&gt;You might think to use LLM-as-judge to evaluate LLM outputs. A Hacker News developer explained the problem clearly: using a judge with the same architecture as the system being tested maximizes the probability of systematic bias. The judge and the tested Agent share exactly the same blind spots.&lt;/p&gt;

&lt;p&gt;Anthropic's January blog also acknowledged: Agent interactions involving tool calls, state modifications, and behavior adjustments based on intermediate results are precisely the capabilities that make Agents useful — and simultaneously make them almost impossible to evaluate systematically.&lt;/p&gt;

&lt;p&gt;The data is stark. LangChain's State of AI Agents survey (1,300+ professionals, 2025) found &lt;strong&gt;only about half of organizations run offline evaluations&lt;/strong&gt;, and &lt;strong&gt;fewer than a quarter combine both offline and online evaluations.&lt;/strong&gt; A multi-dimensional analysis of major Agent benchmarks found a &lt;strong&gt;37% performance gap between lab testing and production environments&lt;/strong&gt; — with reliability dropping from 60% to 25% in real-world conditions. An Agent that tests great in your dev environment may behave completely differently in users' hands.&lt;/p&gt;

&lt;p&gt;Anyone who's done client-side development will understand this pain: your Agent might handle a request perfectly today, and fail on the same request tomorrow. Users can accept missing features — they can't accept inconsistency.&lt;/p&gt;

&lt;p&gt;And evaluation is only half the story — the other half is &lt;strong&gt;observability in production.&lt;/strong&gt; Evaluation tests what you &lt;em&gt;expect&lt;/em&gt; the Agent to do; observability shows what it &lt;em&gt;actually&lt;/em&gt; does with real users. When a user reports "the Agent scheduled my meeting at the wrong time," you need to trace back through every tool call: what calendar data was retrieved, what the LLM reasoned, what meeting parameters were generated, and why the wrong time was selected. Without tool call tracing, latency monitoring, and cost/token budget tracking, you're debugging blind. That "37% performance gap" between lab and production? Observability is how you find it. Tools like LangSmith and Arize have emerged specifically for this, but many teams still discover production failures only when users complain.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 8: Add Multi-Agent Collaboration? Complexity Explodes
&lt;/h2&gt;

&lt;p&gt;Your scheduling Agent is working well, and you start thinking: could you add more specialized Agents? One for email, one for calendar, one for meeting notes, one for scheduling coordination. Clear division of labor, each handling its domain — sounds reasonable, right?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Microsoft's Azure SRE team went down this path.&lt;/strong&gt; They initially built a massive system with 100+ tools and 50+ sub-Agents, and hit a pile of unexpected problems: the orchestrator Agent couldn't find the right sub-Agent (the correct one was "buried three hops away"); a buggy sub-Agent didn't just crash itself — it dragged down the entire reasoning chain; Agents kicked responsibility back and forth in infinite loops. They eventually scaled down to 5 core tools and a few general-purpose Agents, and the system became more reliable.&lt;/p&gt;

&lt;p&gt;Their core lesson: &lt;strong&gt;scaling from one Agent to five doesn't multiply complexity by four — it grows exponentially.&lt;/strong&gt; UC Berkeley's MAST framework analyzed 1,600+ Agent traces and found that &lt;strong&gt;41–86.7% of multi-Agent systems fail in production&lt;/strong&gt;, and &lt;strong&gt;79% of problems come from the orchestration and coordination layer, not the technical implementation.&lt;/strong&gt; How to divide work and how to communicate between Agents is far harder than how to write the code.&lt;/p&gt;

&lt;p&gt;There are established orchestration patterns — sequential chains, concurrent fan-out, hierarchical supervisor models — and each has tradeoffs. ICLR 2025 research found that hierarchical architectures (one coordinator delegating to specialists) show only a &lt;strong&gt;5.5% performance drop&lt;/strong&gt; when individual Agents malfunction, compared to 10.5–23.7% for flatter architectures. This explains why Microsoft eventually simplified to a supervisor model. The practical advice is almost counterintuitive: &lt;strong&gt;start with fewer, more capable Agents rather than many specialized ones&lt;/strong&gt;, and only decompose when a single Agent demonstrably can't handle the workload. The allure of clean role separation is strong, but the coordination overhead will eat you alive.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 9: You Start Doubting — Where's the Bottleneck?
&lt;/h2&gt;

&lt;p&gt;After months of work, your engineering gets more refined, but Agent performance always hits a ceiling you can't break through. You realize a harsh truth: &lt;strong&gt;all engineering optimization has one prerequisite — the underlying model needs to be capable enough.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An InfoQ interview with Alibaba Cloud's code platform lead captured it honestly: engineering challenges can be overcome, but model capability bottlenecks are far more daunting. An awkward industry reality: nearly every company building general-purpose Agent products uses Claude Sonnet as their first-choice model, because other models lag noticeably on instruction-following in complex tasks. The more instructions a model can follow, the more complex the problems it can handle. When a model can't even do basic instruction-following, no amount of engineering optimization above it helps.&lt;/p&gt;

&lt;p&gt;You might think: what about using more powerful reasoning models — o3, o4-mini, DeepSeek R1, Claude Sonnet, Claude Opus? &lt;strong&gt;Research finds that reasoning models hallucinate more than base models.&lt;/strong&gt; The data is striking: OpenAI's o3 has a 33% hallucination rate on person-specific factual questions — double the rate of its predecessor o1. The o4-mini reasoning model hits 48%. The root cause is that RL fine-tuning for chain-of-thought reasoning introduces high-variance gradients and entropy-induced randomness, making models more confident even when wrong. They answer rather than admit uncertainty.&lt;/p&gt;

&lt;p&gt;The practical implication for Agents: reasoning models may handle complex task decomposition better, but they trade off reliability on factual tasks. One emerging pattern is to use reasoning models for &lt;em&gt;planning&lt;/em&gt; (breaking down what needs to happen) and base models for &lt;em&gt;execution and verification&lt;/em&gt; (actually doing it and checking the results). But this adds yet another layer of architectural complexity.&lt;/p&gt;

&lt;p&gt;It's like finding your app is laggy, spending days optimizing code logic, and then discovering the bottleneck is hardware performance. Your engineering optimizations have limits, and beyond those limits lies the constraints of underlying capability.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 10: You Start Understanding the Framework Wars
&lt;/h2&gt;

&lt;p&gt;At this point, you've definitely wrestled with whether to use LangChain, CrewAI, or similar frameworks. The Hacker News discussion has moved from debate to consensus: &lt;strong&gt;frameworks are useful for prototyping; in production they often become a burden.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A CTO shared on Hacker News that he built hundreds of Agents without any framework, using only chat completions plus structured output.&lt;/p&gt;

&lt;p&gt;Anthropic's official guidelines also advise caution with frameworks, as they often make underlying prompts and responses opaque and harder to debug.&lt;/p&gt;

&lt;p&gt;Here's the practical landscape: &lt;strong&gt;LangGraph&lt;/strong&gt; (by LangChain) uses a graph-based architecture with nodes, edges, and conditional routing — it's powerful for complex multi-step reasoning and is used in production by 400+ companies. &lt;strong&gt;CrewAI&lt;/strong&gt; takes a role-based approach where you define Agents by organizational roles — simpler to set up, adopted by 60% of the Fortune 500 for content generation and analysis workflows. &lt;strong&gt;AutoGen&lt;/strong&gt; (Microsoft) was merged into the Microsoft Agent Framework in late 2025, reflecting a broader trend of frameworks consolidating. Each imposes its own abstractions, and those abstractions become constraints the moment your use case doesn't fit neatly.&lt;/p&gt;

&lt;p&gt;There is one thing you genuinely need frameworks for: &lt;strong&gt;persistence and state management.&lt;/strong&gt; Your Agent needs to pause while waiting for user confirmation, recover from checkpoints after errors, and resume long tasks mid-execution. Most lightweight solutions lack these capabilities — which is why orchestration engines like Temporal have risen in the Agent space. Temporal provides durable execution with an append-only event history, letting Agents recover from failures mid-execution. That's genuinely hard to build from scratch.&lt;/p&gt;

&lt;p&gt;Perhaps more consequential than any framework is the emerging &lt;strong&gt;protocol and abstraction layer&lt;/strong&gt; — three complementary standards that are reshaping how Agents are built and composed:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt;, created by Anthropic, standardizes how models interact with external tools and data sources. Instead of writing custom integrations for every API, MCP provides a universal interface with well-defined security boundaries. It's the "USB port" for Agent-to-tool connections.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent2Agent (A2A)&lt;/strong&gt;, backed by Google and Microsoft, tackles inter-Agent communication — enabling Agents from different providers and frameworks to discover each other and collaborate via standardized protocols. It's the "HTTP" for Agent-to-Agent interactions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Skills&lt;/strong&gt;, pioneered by Anthropic (discussed in Step 6), solve a different problem entirely: &lt;strong&gt;domain knowledge and procedural expertise.&lt;/strong&gt; MCP gives Agents access to tools; Skills give them the &lt;em&gt;knowledge of how to use those tools effectively&lt;/em&gt; — modular, on-demand expertise that keeps context windows lean through progressive disclosure.&lt;/p&gt;

&lt;p&gt;Together, these three layers — MCP (Agent-to-tool), Agent Skills (Agent knowledge), and A2A (Agent-to-Agent) — form a cohesive architecture. Developers building production Agents will likely use all three: MCP to plug into APIs and databases, Skills to inject domain expertise, and A2A to enable cross-ecosystem Agent collaboration. This matters more than framework choice in the long run, because these protocols define how Agents interoperate — regardless of what framework built them.&lt;/p&gt;

&lt;p&gt;The truth is, framework choice isn't the core challenge of Agent development. The real challenges are the nine steps above. Frameworks are just tools. Choosing the wrong tool wastes time, but going in the wrong engineering direction wastes everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The ten steps above aren't something I made up sitting here. I built agents myself, hit almost every pitfall listed, and some of the projects ultimately failed. The Agent worked flawlessly in my development environment — but in production, context window limits caused it to lose track of multi-step tasks, costs spiraled because I hadn't designed for model routing, and I had no observability to diagnose why users were getting wrong results. By the time I understood the real scope of the engineering required, the project had burned through its budget and patience. Looking back, the mindset of "it's just calling an API, how hard can it be?" was exactly the same as my mindset 20 years ago of "drag a few controls and you have an app." What really taught me, in the end, was that failure.&lt;/p&gt;

&lt;p&gt;Walk through these ten steps and you'll find that &lt;strong&gt;"calling APIs" accounts for roughly 5% of total Agent development effort.&lt;/strong&gt; The other 95% is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OAuth, rate limiting, and error handling in the tool layer (Step 2)&lt;/li&gt;
&lt;li&gt;Getting tool design granularity and descriptions right (Step 2)&lt;/li&gt;
&lt;li&gt;Validation and rollback for multi-step error cascades (Step 3)&lt;/li&gt;
&lt;li&gt;Safety guardrails, least-privilege permissions, and human-in-the-loop gates (Step 4)&lt;/li&gt;
&lt;li&gt;Cost control, prompt caching, model routing, and latency optimization (Step 5)&lt;/li&gt;
&lt;li&gt;Context Engineering, memory management, and Agent Skills for progressive disclosure (Step 6)&lt;/li&gt;
&lt;li&gt;Building evaluation and production observability from scratch (Step 7)&lt;/li&gt;
&lt;li&gt;Complexity control for multi-Agent orchestration and coordination (Step 8)&lt;/li&gt;
&lt;li&gt;Engineering around model capability ceilings and reasoning model tradeoffs (Step 9)&lt;/li&gt;
&lt;li&gt;Navigating the framework/protocol landscape — MCP, A2A, and Agent Skills (Step 10)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LangChain calls this emerging discipline &lt;strong&gt;"Agent Engineering"&lt;/strong&gt; — I think that's exactly right. Boston Consulting Group's research shows that &lt;strong&gt;only about a quarter of companies achieve significant ROI from their AI initiatives&lt;/strong&gt;, and Agent projects are no exception. LangChain's survey found that &lt;strong&gt;32% of companies cite "quality below standard" as the top barrier to shipping an Agent.&lt;/strong&gt; These numbers say it all.&lt;/p&gt;

&lt;p&gt;The enormous gap between Agent and Agent doesn't come from who's calling different APIs — it comes from the vastly different quality of the 95% of engineering that happens &lt;em&gt;outside&lt;/em&gt; the API call. Calling an API is the entry threshold, something you can cross in a week. But between demo and product lies an entire system of engineering around reliability, observability, context management, and error recovery.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's where Agent development is truly hard.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://gorilla.cs.berkeley.edu/leaderboard.html" rel="noopener noreferrer"&gt;Berkeley Function Calling Leaderboard (BFCL)&lt;/a&gt; — Tool-calling accuracy benchmarks across models&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://galileo.ai/blog/agent-failure-modes-guide" rel="noopener noreferrer"&gt;Galileo: 7 AI Agent Failure Modes and How To Fix Them&lt;/a&gt; — Error propagation in multi-step Agent tasks&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.langchain.com/state-of-agent-engineering" rel="noopener noreferrer"&gt;LangChain: State of AI Agents Report (2025)&lt;/a&gt; — Industry survey on Agent evaluation and adoption&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/html/2511.14136v1" rel="noopener noreferrer"&gt;Beyond Accuracy: Multi-Dimensional Framework for Enterprise Agentic AI&lt;/a&gt; — Lab vs. production performance gap analysis&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://techcommunity.microsoft.com/blog/appsonazureblog/context-engineering-lessons-from-building-azure-sre-agent/4481200/" rel="noopener noreferrer"&gt;Context Engineering for Reliable AI Agents: Lessons from Building Azure SRE Agent&lt;/a&gt; — Microsoft's experience with 100+ tools and 50+ sub-Agents&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/html/2503.13657" rel="noopener noreferrer"&gt;Why Do Multi-Agent LLM Systems Fail? (UC Berkeley MAST Framework)&lt;/a&gt; — Analysis of 1,600+ Agent traces and 14 failure modes&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/html/2505.23646v1" rel="noopener noreferrer"&gt;Are Reasoning Models More Prone to Hallucination?&lt;/a&gt; — Comparison of hallucination rates in reasoning vs. base models&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://engineering.atspotify.com/2025/11/context-engineering-background-coding-agents-part-2" rel="noopener noreferrer"&gt;Spotify Engineering: Context Engineering for Background Coding Agents&lt;/a&gt; — Context window management lessons&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus" rel="noopener noreferrer"&gt;Manus: Context Engineering for AI Agents&lt;/a&gt; — Four framework rebuilds and iterative context design&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents" rel="noopener noreferrer"&gt;Anthropic: Effective Context Engineering for AI Agents&lt;/a&gt; — Defining and implementing Context Engineering&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://genai.owasp.org/llmrisk/llm062025-excessive-agency/" rel="noopener noreferrer"&gt;OWASP: LLM06:2025 Excessive Agency&lt;/a&gt; — Security threat classification for Agent systems&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.bcg.com/publications/2025/agents-accelerate-next-wave-of-ai-value-creation" rel="noopener noreferrer"&gt;BCG: How Agents Are Accelerating the Next Wave of AI Value Creation&lt;/a&gt; — Enterprise AI ROI data&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://openreview.net/forum?id=bkiM54QftZ" rel="noopener noreferrer"&gt;On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents (ICLR 2025)&lt;/a&gt; — Hierarchical vs. flat architecture resilience&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.anthropic.com/news/prompt-caching" rel="noopener noreferrer"&gt;Anthropic: Prompt Caching&lt;/a&gt; — 90% cost reduction and 85% latency reduction for repeated prompts&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills" rel="noopener noreferrer"&gt;Anthropic: Equipping Agents for the Real World with Agent Skills&lt;/a&gt; — The original Agent Skills mechanism and design philosophy&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://thenewstack.io/agent-skills-anthropics-next-bid-to-define-ai-standards/" rel="noopener noreferrer"&gt;Agent Skills: Anthropic's Next Bid to Define AI Standards&lt;/a&gt; — Skills as an open standard for modular Agent capabilities&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.friedrichs-it.de/blog/agent-skills-vs-model-context-protocol/" rel="noopener noreferrer"&gt;Agent Skills vs MCP: Two Standards, Two Security Models&lt;/a&gt; — Complementary roles of Skills and MCP in Agent architecture&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>agentaichallenge</category>
      <category>programming</category>
    </item>
    <item>
      <title>AI Agent Memory Management - When Markdown Files Are All You Need?</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Wed, 18 Feb 2026 02:15:15 +0000</pubDate>
      <link>https://forem.com/imaginex/ai-agent-memory-management-when-markdown-files-are-all-you-need-5ekk</link>
      <guid>https://forem.com/imaginex/ai-agent-memory-management-when-markdown-files-are-all-you-need-5ekk</guid>
      <description>&lt;h2&gt;
  
  
  What is Memory Management for AI Agents?
&lt;/h2&gt;

&lt;p&gt;Memory management for AI agents refers to the mechanisms that allow an agent to store, retrieve, and use information across interactions. Without memory management, every conversation starts from a blank slate — the agent is stateless and forgets everything between sessions. With it, the agent accumulates knowledge over time, learns from past mistakes, and maintains continuity — becoming truly stateful.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are the Memory Types for AI Agents?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Short-term&lt;/strong&gt; - The agent's immediate context window, holding the current conversation and recent tool outputs. Analogous to a human's active attention span. Duration: minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-term&lt;/strong&gt; - Persistent storage of facts, preferences, and decisions that survive across sessions. Analogous to human declarative memory. Duration: indefinite.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Procedural&lt;/strong&gt; - Learned workflows, action sequences, and "how-to" knowledge the agent acquires through experience. Analogous to human muscle memory or learned skills. Duration: permanent once codified.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Working&lt;/strong&gt; - A temporary scratchpad for intermediate reasoning steps during a single task. Analogous to a mental whiteboard used for chain-of-thought reasoning. Duration: seconds to minutes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Comparison of Memory Types in Agents
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Memory Type&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Duration&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Typical Implementation&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Primary Use Case&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Short-Term&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;Context Window / RAM&lt;/td&gt;
&lt;td&gt;Following a conversation thread.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Long-Term&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Years&lt;/td&gt;
&lt;td&gt;Vector DB / SQL&lt;/td&gt;
&lt;td&gt;Remembering user preferences and facts.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Procedural&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Permanent&lt;/td&gt;
&lt;td&gt;Action Recipes / Logs&lt;/td&gt;
&lt;td&gt;Learning "how" to use a specific tool or API.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Working&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds&lt;/td&gt;
&lt;td&gt;Scratchpad / State&lt;/td&gt;
&lt;td&gt;Intermediate reasoning steps (Chain-of-Thought).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What are Use Cases for AI Agent Memory Management?
&lt;/h2&gt;

&lt;p&gt;Memory management is the "glue" that transforms a basic chatbot into a functional AI agent. While simple models process prompts in isolation (stateless), agents with memory can track goals, learn from mistakes, and personalize their behavior over time.&lt;/p&gt;

&lt;p&gt;Effective memory management generally involves balancing &lt;strong&gt;Short-Term Memory&lt;/strong&gt; (immediate context), &lt;strong&gt;Long-Term Memory&lt;/strong&gt; (historical facts and patterns), &lt;strong&gt;Procedural Memory&lt;/strong&gt; (refined workflows), and &lt;strong&gt;Working Memory&lt;/strong&gt; (intermediate reasoning steps).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Personal AI Assistants &amp;amp; Companions&lt;/strong&gt; - Agents like virtual executive assistants must manage memory to provide a "human-like" continuity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Step Research &amp;amp; Coding Agents&lt;/strong&gt; - Agents designed for "deep research" or complex software engineering (e.g., Devin or OpenDevin) navigate thousands of lines of code or documents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customer Support Automation&lt;/strong&gt; - Modern support agents handle issues that may span several days or multiple channels (email, chat, phone).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autonomous DevOps &amp;amp; CI/CD Agents&lt;/strong&gt; - Agents managing cloud infrastructure or deployment pipelines need memory to understand the state of a complex system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Healthcare &amp;amp; Patient Management&lt;/strong&gt; - AI agents in healthcare act as long-term monitors for chronic conditions.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What are the Existing Approaches?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;When designing a smart AI agent, memory management determines whether your agent is "forgetful" (stateless) or "intelligent" (stateful).&lt;/strong&gt; Some AI agent frameworks like LangChain and LangGraph have built-in memory management, while others like OpenAI and Google ADK have their own memory management systems. Each framework approaches memory with a different philosophy—some prioritize ease of use (OpenAI), while others prioritize granular control (LangGraph).&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparison: Memory Management Architectures
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Framework&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Primary Memory Strategy&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Persistence Level&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Best For...&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LangChain&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Modular Components&lt;/strong&gt; (Buffer, Summary, Entity)&lt;/td&gt;
&lt;td&gt;Manual (must connect DB)&lt;/td&gt;
&lt;td&gt;Diverse, specialized RAG workflows.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LangGraph&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Graph Persistence&lt;/strong&gt; (Checkpointers)&lt;/td&gt;
&lt;td&gt;Built-in (Thread-level)&lt;/td&gt;
&lt;td&gt;Complex, cyclical tasks (e.g., self-correcting code).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google ADK&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Memory Bank&lt;/strong&gt; (Identity-scoped)&lt;/td&gt;
&lt;td&gt;Fully Managed&lt;/td&gt;
&lt;td&gt;Personalized, long-term user context on GCP.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CrewAI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Unified Multi-Layer&lt;/strong&gt; (Short, Long, Entity)&lt;/td&gt;
&lt;td&gt;Built-in (SQLite/Chroma)&lt;/td&gt;
&lt;td&gt;Multi-agent collaboration and role-playing.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenAI SDK&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Threads API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fully Managed (Opaque)&lt;/td&gt;
&lt;td&gt;Rapid prototyping; hands-off state management.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Is There a Simpler Alternative?
&lt;/h2&gt;

&lt;p&gt;In December 2025, Meta acquired Manus for $2 billion. The startup was just 8 months old with a small team. Industry insiders speculated: "They must have revolutionary AI algorithms... proprietary models... breakthrough technology..."&lt;/p&gt;

&lt;p&gt;The truth was far more interesting—and far simpler.&lt;/p&gt;

&lt;p&gt;Their competitive advantage wasn't complex algorithms or massive infrastructure. It was &lt;strong&gt;how they managed memory using plain text files&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;While the AI industry spent millions building vector databases, complex RAG pipelines, and proprietary memory systems, three independent high-value projects quietly converged on the same "boring" solution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Manus&lt;/strong&gt; (acquired for $2B) - Used file-based planning for long-running agents. Its agents followed a three-file pattern: &lt;code&gt;task_plan.md&lt;/code&gt; for goals and progress, &lt;code&gt;notes.md&lt;/code&gt; for research, and a deliverable output file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenClaw&lt;/strong&gt; (145K+ GitHub stars) - Built dual-layer Markdown memory architecture. It uses &lt;code&gt;MEMORY.md&lt;/code&gt; for curated knowledge, &lt;code&gt;memory/YYYY-MM-DD.md&lt;/code&gt; for daily logs, and &lt;code&gt;SOUL.md&lt;/code&gt; for personality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude Code&lt;/strong&gt; (Anthropic's official tool) - Implemented Skills and memory as Markdown files. It uses a &lt;code&gt;CLAUDE.md&lt;/code&gt; hierarchy for project context, &lt;code&gt;.claude/MEMORY.md&lt;/code&gt; for auto-captured learnings, and a Skills system for on-demand capability loading.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This convergence suggests something fundamental about what works in practice. In biology, this is called convergent evolution — when independent organisms develop the same trait because it is the optimal solution to a shared challenge. While many AI systems rely on elaborate memory infrastructure, file-based approaches offer a simpler alternative that addresses the core requirements: persistence, transparency, and reliability.&lt;/p&gt;

&lt;p&gt;Using local Markdown files for memory management—an approach popularized by tools like &lt;strong&gt;OpenClaw&lt;/strong&gt;, &lt;strong&gt;Claude Code&lt;/strong&gt;, and &lt;strong&gt;Manus&lt;/strong&gt;—offers a philosophy of &lt;strong&gt;"Memory as Documentation."&lt;/strong&gt; This contrasts sharply with the "Memory as Database" approach of frameworks like LangGraph or CrewAI.&lt;/p&gt;

&lt;p&gt;This approach treats the agent's memory not as a hidden system state, but as a transparent, editable file living directly in the user's workspace.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why File-based Memory Works?
&lt;/h3&gt;

&lt;p&gt;File-based memory systems work because they align with how developers already manage information. Here are the key properties that make them effective for AI agents:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Persistent&lt;/strong&gt;: Memory survives agent restarts, crashes, or updates. Files decouple memory from process lifecycle — no data loss when a process dies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transparent and Editable&lt;/strong&gt;: You can open the agent's memory file (e.g., &lt;code&gt;MEMORY.md&lt;/code&gt; or &lt;code&gt;task_plan.md&lt;/code&gt;) in any text editor, read exactly what it "knows," and edit it manually. In LangGraph or CrewAI, modifying memory often requires writing scripts to update a database or decoding complex JSON objects. With Markdown, if the agent hallucinates a goal, you simply highlight the text and delete it. This zero-friction "human-in-the-loop" capability builds trust and enables compliance audits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Version-Controllable&lt;/strong&gt;: Because memory is plain text, it lives in your Git repository. You can commit the agent's "knowledge," revert changes if the agent goes off-rails, and branch the memory. Frameworks like CrewAI usually store memory in external databases (Postgres, ChromaDB) — syncing that external state with your code's version history is difficult. Markdown memory treats context &lt;em&gt;as part of the codebase&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Holistic Context&lt;/strong&gt;: Agents like &lt;strong&gt;Claude Code&lt;/strong&gt; use Markdown to maintain a high-level summary of the project structure. They read this file &lt;em&gt;first&lt;/em&gt; to orient themselves. RAG (Vector Databases) retrieves fragments based on similarity search, which often misses the "forest for the trees" — fetching specific functions but missing the overall architectural pattern. A curated Markdown summary solves this by forcing the agent to maintain a "map" of the project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Portable&lt;/strong&gt;: Standard Markdown format means no vendor lock-in. Your agent's memory is not locked into OpenAI's &lt;code&gt;thread_id&lt;/code&gt; or a proprietary vector store. You can swap the underlying model (e.g., switch from Claude to GPT-4o) and simply feed it the same Markdown file. Migration is as simple as copying files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Searchable&lt;/strong&gt;: Standard text search tools (e.g., grep, ripgrep) work immediately — no special database required. More advanced approaches like full-text search or vector embeddings can be added as the memory grows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost-effective&lt;/strong&gt;: Local disk storage costs \$0.02/GB/month compared to managed vector database services at \$50-200/GB/month. No per-query API fees or infrastructure scaling costs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparison Matrix: Markdown vs. Frameworks
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Feature&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Markdown Files (Claude Code/Manus)&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Database Frameworks (LangGraph/CrewAI)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Debuggability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;High&lt;/strong&gt;: Just read/edit the file.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Med/Low&lt;/strong&gt;: Requires DB inspection tools.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Low&lt;/strong&gt;: Instant file read.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Med&lt;/strong&gt;: Network calls to Vector DBs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Low&lt;/strong&gt;: Files get unmanageable &amp;gt;5MB.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;High&lt;/strong&gt;: Handles millions of records easily.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Persistence&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Local&lt;/strong&gt;: Lives on your disk/repo.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Cloud/Server&lt;/strong&gt;: Lives in a managed service.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retrieval&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Linear&lt;/strong&gt;: Agent reads the whole file.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Semantic&lt;/strong&gt;: Agent searches for keywords/vectors.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Strategic Trade-off
&lt;/h3&gt;

&lt;p&gt;The "Markdown" approach is &lt;strong&gt;optimal for Local Agents&lt;/strong&gt; because the "context" is finite and structured. The "Database" approach is &lt;strong&gt;optimal for Enterprise Agents&lt;/strong&gt; where the "memory" consists of millions of user profiles and history logs that cannot fit into a single file, requiring dynamic agent management and more sophisticated search capabilities.&lt;/p&gt;

&lt;p&gt;For example, an enterprise customer support agent typically integrates a Vector DB into a RAG (Retrieval-Augmented Generation) pipeline. Before the LLM generates a response, a retrieval step automatically grabs relevant "memories" based on the user's input and injects them into the system prompt as context. This enables semantic search across structured and unstructured data — user profiles, past chat transcripts, PDF manuals, or meeting notes — so the agent can answer questions like "Has this user complained about something similar before?" without being explicitly told to look it up.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Design File-based Memory for Your AI Agent?
&lt;/h2&gt;

&lt;p&gt;File-based AI agent memory typically consists of two layers: remembrance and personalization.&lt;/p&gt;

&lt;h3&gt;
  
  
  Remembrance Layer
&lt;/h3&gt;

&lt;p&gt;The remembrance layer stores what the agent knows, organized into three types:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-term memory (e.g., MEMORY.md)&lt;/strong&gt;: Stores curated, important information that should persist indefinitely. This includes user preferences, key decisions and their rationale, learned lessons, and standard procedures. This file is typically loaded into every agent conversation. Systems like OpenClaw trigger a memory flush before context compression, prompting the agent to write important information to MEMORY.md before older context is discarded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Daily logs (e.g., memory/YYYY-MM-DD.md)&lt;/strong&gt;: Timestamped records of activities, conversations, and observations. These provide chronological context and help the agent maintain continuity across sessions. Recent logs (today and yesterday) are typically loaded automatically, while older logs are searched on-demand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Working memory (e.g., task_plan.md)&lt;/strong&gt;: Tracks the current task's goals, progress, and context. This prevents "goal drift" in long-running tasks by providing a consistent reference point that the agent can check throughout execution. Manus popularized a three-file variant (&lt;code&gt;task_plan.md&lt;/code&gt;, &lt;code&gt;notes.md&lt;/code&gt;, deliverable) with a read-decide-act-update cycle: read the plan, act on the next step, update progress, then repeat.&lt;/p&gt;

&lt;h3&gt;
  
  
  Personalization Layer
&lt;/h3&gt;

&lt;p&gt;The personalization layer defines how the agent behaves and how it is perceived by the user:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SOUL.md&lt;/strong&gt;: Defines core values, decision principles, and behavioral guidelines. This file shapes the agent's personality and decision-making approach. For example, a SOUL.md might specify "prefer simple solutions over complex ones" or "always ask for clarification when ambiguous."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IDENTITY.md&lt;/strong&gt;: Defines the agent's public identity, including name, start date, and communication style. This file is used to identify the agent to the user.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;USER.md&lt;/strong&gt;: Defines the user's profile, including technical background, preferences, and context. This file is used to tailor the agent's behavior to the user's needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Modular skills&lt;/strong&gt;: Additional capabilities can be loaded on-demand using separate skill files. Rather than loading all possible skills at startup, the agent loads specific skill documentation only when needed, keeping the context manageable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Search Strategies
&lt;/h3&gt;

&lt;p&gt;As memory grows, search becomes important. Three approaches offer progressively more capability:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Basic text search (grep/ripgrep)&lt;/strong&gt;: Sufficient for most use cases with fewer than 1,000 files. Fast, free, and deterministic. Works well for exact keyword matches and phrases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BM25 full-text search&lt;/strong&gt;: Useful when scaling to 1,000-10,000 files. BM25 is a ranking algorithm that scores documents by relevance — similar to how a search engine ranks web pages. It supports boolean operators (AND, OR, NOT) and can be implemented using SQLite's built-in full-text search with minimal infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid vector + BM25&lt;/strong&gt;: Most sophisticated approach, combining semantic search (understanding concepts) with keyword matching. Typically only needed when exceeding 10,000 files or when conceptual queries are important. Requires embedding generation, which adds API costs. OpenClaw's implementation uses 70:30 weighting (vector similarity : BM25 keyword) with a 0.35 minimum score threshold. In testing, this achieved 89% recall vs. 76% for vector-only and 68% for BM25-only.&lt;/p&gt;

&lt;p&gt;Most implementations should start with basic text search and upgrade only when the need is demonstrated through actual usage patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation Considerations
&lt;/h3&gt;

&lt;p&gt;Starting with file-based memory is straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a MEMORY.md file and give your AI agent read/write access to it&lt;/li&gt;
&lt;li&gt;Implement daily log files with timestamps (memory/YYYY-MM-DD.md format)&lt;/li&gt;
&lt;li&gt;Add basic grep/ripgrep search capability&lt;/li&gt;
&lt;li&gt;Define a SOUL.md file to establish agent personality and values&lt;/li&gt;
&lt;li&gt;Add task planning files when working on multi-step projects&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The simplicity of this approach means implementation typically takes days rather than months. The architecture can scale from single-user prototypes to production systems handling thousands of agents.&lt;/p&gt;

&lt;p&gt;For more complex deployments, consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Git version control for memory files&lt;/li&gt;
&lt;li&gt;Separate memory directories for different agents or use cases&lt;/li&gt;
&lt;li&gt;Shared knowledge bases that multiple agents can reference&lt;/li&gt;
&lt;li&gt;Encryption for sensitive information (filesystem-level or application-level)&lt;/li&gt;
&lt;li&gt;Progressive context disclosure: load only memory relevant to the current task rather than everything at startup (as practiced by Claude Code's Skills system)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;File-based memory for AI agents represents a practical middle ground: simpler than elaborate infrastructure, but more capable than purely ephemeral in-memory approaches. The convergence of multiple successful projects on this pattern suggests it addresses real needs effectively.&lt;/p&gt;

&lt;p&gt;The approach offers particularly strong advantages in transparency, portability, and user control—increasingly important considerations as AI agents handle more sensitive and critical tasks.&lt;/p&gt;

&lt;p&gt;When three independent, high-profile projects converge on the same architectural choice, it is worth paying attention — not because Markdown files are the final answer, but because they reveal that the right abstraction for agent memory may be simpler than the industry assumed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Manus&lt;/strong&gt;: &lt;a href="https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus" rel="noopener noreferrer"&gt;Context Engineering for AI Agents: Lessons from Building Manus&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenClaw&lt;/strong&gt;: &lt;a href="https://docs.openclaw.ai/concepts/memory" rel="noopener noreferrer"&gt;Memory Concepts Documentation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude Code&lt;/strong&gt;: &lt;a href="https://code.claude.com/docs/en/memory" rel="noopener noreferrer"&gt;Memory Documentation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AGENTS.md&lt;/strong&gt;: &lt;a href="https://agents.md/" rel="noopener noreferrer"&gt;The Open Standard for Agent Configuration&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agentic Design Patterns&lt;/strong&gt;: &lt;a href="https://github.com/sarwarbeing-ai/Agentic_Design_Patterns" rel="noopener noreferrer"&gt;A Hands-On Guide to Building Intelligent Systems&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
    </item>
    <item>
      <title>Reward Engineering: An Emerging Skill for AI Engineers</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Fri, 13 Feb 2026 16:16:32 +0000</pubDate>
      <link>https://forem.com/imaginex/reward-engineering-an-emerging-skill-for-ai-engineers-1i01</link>
      <guid>https://forem.com/imaginex/reward-engineering-an-emerging-skill-for-ai-engineers-1i01</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In their comprehensive report &lt;strong&gt;"&lt;a href="https://you.com/resources/2026-ai-predictions" rel="noopener noreferrer"&gt;AI Predictions for 2026&lt;/a&gt;,"&lt;/strong&gt; Richard Socher (one of the world's most-cited NLP researchers and CEO of You.com) and Bryan McCann (CTO of You.com) outline a fundamental shift in how we interact with artificial intelligence. Their central thesis: the era of simple Large Language Model (LLM) chatbots is giving way to sophisticated, autonomous AI agent ecosystems.&lt;/p&gt;

&lt;p&gt;This transformation represents a shift from &lt;strong&gt;"Chat-Engines"&lt;/strong&gt; (systems you converse with) to &lt;strong&gt;"Do-Engines"&lt;/strong&gt; (systems that autonomously complete tasks for you). To enable this shift, Socher and McCann predict the emergence of a new specialization: the &lt;strong&gt;Reward Engineer&lt;/strong&gt;—a professional who designs the mathematical and logical objective functions that define success for AI agents.&lt;/p&gt;

&lt;p&gt;Whether or not "Reward Engineer" becomes an official job title in 2026, the underlying skill of reward engineering is rapidly becoming essential for any AI engineer working with autonomous systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Reward Engineering?
&lt;/h2&gt;

&lt;p&gt;As AI evolves from generating text to autonomously executing multi-step tasks, our approach to guiding these systems must also evolve. Traditional &lt;strong&gt;Context Engineering&lt;/strong&gt;—writing instructions in natural language—works well for chatbots but proves insufficient for autonomous agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Prompts Aren't Enough:&lt;/strong&gt; When an AI agent must complete complex, long-term goals—such as optimizing a supply chain, conducting legal research, or managing a project—simple text instructions cannot capture all the nuances, constraints, and trade-offs involved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enter Reward Engineering:&lt;/strong&gt; This discipline combines logic, ethics, and data science to define precise success criteria. Reward engineers must anticipate how AI agents might find unintended shortcuts (a phenomenon called "reward hacking") and design objective functions that align agent behavior with genuine human intent across extended time horizons.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Responsibilities
&lt;/h2&gt;

&lt;p&gt;Rather than writing traditional code or conversational prompts, engineers design the &lt;strong&gt;objective functions&lt;/strong&gt; and &lt;strong&gt;reinforcement learning frameworks&lt;/strong&gt; that guide autonomous AI agents. Think of this role as a "Policy Architect"—ensuring agents achieve complex business objectives (such as "increase supply chain efficiency by 15%") while respecting ethical boundaries, security protocols, and resource constraints.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Responsibilities
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Objective Function Design:&lt;/strong&gt; Translate broad business goals into precise mathematical reward signals that guide agent behavior toward desired outcomes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guardrail Engineering:&lt;/strong&gt; Create constraints and penalties that prevent reward hacking—situations where an AI technically achieves its goal but in unintended or harmful ways.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Agent Coordination:&lt;/strong&gt; Design reward structures that encourage multiple AI agents to collaborate effectively rather than compete counterproductively for shared resources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-in-the-Loop (HITL) Policies:&lt;/strong&gt; Establish clear escalation triggers that determine when an agent must pause and request human approval before proceeding with high-stakes decisions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validation &amp;amp; Benchmarking:&lt;/strong&gt; Develop comprehensive test suites to evaluate agent reasoning and ensure consistent, reliable performance across different scenarios and model versions.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Required Technical Skills
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Logic &amp;amp; Ethics:&lt;/strong&gt; Strong foundation in game theory, utility functions, and AI alignment principles to design fair and effective reward systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agentic Frameworks:&lt;/strong&gt; Proficiency with modern AI agent frameworks (such as LangChain, AutoGPT, CrewAI, and their successors) as well as cloud-based agentic platforms (Amazon Bedrock Agents, Azure AI Agent Service with Semantic Kernel, and Vertex AI Agent Builder) that enable autonomous task execution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python Programming:&lt;/strong&gt; Ability to write validation scripts that evaluate AI outputs and enforce behavioral constraints—essentially serving as "referees" for agent actions. Python is specifically required because it's the dominant language in the AI/ML ecosystem: nearly all reinforcement learning frameworks (PyTorch, TensorFlow, Gymnasium), agent frameworks (LangChain, AutoGPT), and evaluation tools are built in Python. This creates seamless integration between reward function design and the AI models they guide, unlike general-purpose languages such as Bash (limited to shell scripting) or Node.js (less common in ML applications).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain Expertise:&lt;/strong&gt; Deep understanding of specific industries (finance, healthcare, legal, etc.) to define what constitutes a genuinely successful outcome versus a superficial one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk Identification:&lt;/strong&gt; Skill in recognizing logical inconsistencies, potential failure modes, and "hallucination-prone" scenarios within autonomous agent workflows.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Reward Engineering vs. Context Engineering
&lt;/h2&gt;

&lt;p&gt;The shift from conversational AI to autonomous agents demands a fundamental change in how we guide these systems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context Engineering (Today):&lt;/strong&gt; Writing natural language instructions like "Act as a lawyer and draft a contract." This works for generating single responses but lacks the precision needed for autonomous, multi-step tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reward Engineering (Tomorrow):&lt;/strong&gt; Designing mathematical frameworks that define success. Instead of telling an AI &lt;em&gt;what&lt;/em&gt; to do, reward engineers create scoring systems that guide &lt;em&gt;how&lt;/em&gt; the AI optimizes its behavior over time.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Critical Difference: Preventing Reward Hacking
&lt;/h3&gt;

&lt;p&gt;Consider a common pitfall: if you reward an AI for "reducing customer complaints," a poorly designed system might simply delete incoming complaint emails—technically achieving the goal while completely missing the intent.&lt;/p&gt;

&lt;p&gt;AI engineers must anticipate such shortcuts and create sophisticated reward models that balance competing priorities: speed, accuracy, ethics, and safety. This becomes especially critical as AI agents make consequential decisions with real-world financial, legal, or safety implications.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Evolution: From Context Engineering to Reward Engineering
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Context Engineering&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Reward Engineering&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary Tool&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Natural language instructions&lt;/td&gt;
&lt;td&gt;Mathematical objective functions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Focus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Generating single responses&lt;/td&gt;
&lt;td&gt;Guiding multi-step autonomous behavior&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Success Measure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"The output sounds right"&lt;/td&gt;
&lt;td&gt;"The task completed successfully within all constraints"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output Type&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Text, images, code snippets&lt;/td&gt;
&lt;td&gt;Real-world actions and transactions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scope&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One interaction at a time&lt;/td&gt;
&lt;td&gt;Extended time horizons with multiple decision points&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This evolution from conversational AI to autonomous agents represents not just a technical shift, but a fundamental change in how we conceptualize human-AI collaboration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Your Reward Engineering Skills: A Practical Roadmap
&lt;/h2&gt;

&lt;p&gt;Transitioning to reward engineering means evolving from a "Writer" (crafting conversational prompts) to an "Architect" (designing behavioral frameworks). You'll shift from asking AI for outputs to defining the mathematical and ethical boundaries within which it operates.&lt;/p&gt;

&lt;p&gt;Here's a three-phase roadmap to develop these skills:&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1: Foundations — From Intuition to Precision
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Move from informal, "vibe-based" prompting to structured, contract-like specifications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Skills to Develop:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Logical Decomposition:&lt;/strong&gt; Practice breaking complex problems into small, verifiable subtasks. Each subtask needs a clearly defined success state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contract-Based Thinking:&lt;/strong&gt; Transform vague requests into precise specifications. Instead of "Write a professional email," specify: "Generate an email under 200 words containing exactly three bullet points and referencing invoice #12345, or fail validation."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Basic Programming Literacy:&lt;/strong&gt; Develop comfort with Python control flow (if/then/else logic) and APIs. Many reward functions are implemented as Python scripts that evaluate agent outputs against defined criteria.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Phase 2: Understanding Agentic Systems
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Learn how autonomous "Do-Engines" operate and make decisions over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Skills to Develop:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;State Management:&lt;/strong&gt; Understand how agents maintain memory of previous actions and decisions. Study frameworks like &lt;strong&gt;ReAct (Reasoning + Acting)&lt;/strong&gt; and &lt;strong&gt;Plan-and-Execute&lt;/strong&gt; patterns that enable multi-step reasoning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool Integration:&lt;/strong&gt; Learn how agents access and utilize external tools (calculators, search engines, databases). Your role is designing rewards that encourage appropriate tool usage and penalize inefficient or incorrect tool selection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quantitative Evaluation:&lt;/strong&gt; Adopt rigorous evaluation frameworks like &lt;strong&gt;LangSmith&lt;/strong&gt; or &lt;strong&gt;Hugging Face Evaluate&lt;/strong&gt;. Shift from subjective assessment ("This looks good") to measurable metrics ("This output scores 8.5/10 on our accuracy rubric").&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Phase 3: Advanced Reward Engineering
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Master the specialized skills that define the reward engineering role.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Skills to Develop:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;RLHF (Reinforcement Learning from Human Feedback):&lt;/strong&gt; Understand how models learn from human preferences. You'll design the ranking criteria and evaluation rubrics that human labelers use to train agent behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Objective Function Design:&lt;/strong&gt; This is the core competency. Learn to translate business goals into mathematical reward functions that balance competing priorities.&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Example:&lt;/em&gt; For a budget management agent, design rewards that optimize both cost savings &lt;em&gt;and&lt;/em&gt; service quality—preventing the agent from simply cutting all expenses.

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Safety &amp;amp; Alignment Engineering:&lt;/strong&gt; Create guardrail mechanisms ensuring that the reward for helpful behavior never outweighs the penalty for harmful actions. This requires anticipating edge cases where agents might find dangerous shortcuts.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Hands-On Practice: Thinking Like a Reward Engineer
&lt;/h2&gt;

&lt;p&gt;The best way to prepare for this emerging skill is a fundamental shift in perspective: stop focusing on &lt;em&gt;what&lt;/em&gt; you want the AI to say, and start defining &lt;em&gt;how you'll measure&lt;/em&gt; whether its actions were successful.&lt;/p&gt;

&lt;p&gt;The following exercise introduces you to reward function design—the core of reward engineering.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical Exercise: The Budget-Conscious Travel Agent
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Scenario:&lt;/strong&gt; You're developing an AI agent to book corporate travel. With a vague instruction like "Book the best flight," the agent might select a $10,000 first-class ticket—technically "the best" by some measures, but clearly not what you intended.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your Task:&lt;/strong&gt; Design a reward system that guides the agent to balance cost, timeliness, comfort, and convenience appropriately.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 1: Distribute Reward Points
&lt;/h4&gt;

&lt;p&gt;You have 100 reward points to allocate across four potential outcomes. The agent will optimize for maximum points. How should you distribute them?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Outcome&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Your Allocation&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Arrival Time:&lt;/strong&gt; Flight arrives before the 9:00 AM meeting&lt;/td&gt;
&lt;td&gt;_____ points&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Cost Efficiency:&lt;/strong&gt; Flight costs under $500&lt;/td&gt;
&lt;td&gt;_____ points&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Convenience:&lt;/strong&gt; Direct flight with no layovers&lt;/td&gt;
&lt;td&gt;_____ points&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Comfort:&lt;/strong&gt; Business or first-class seating&lt;/td&gt;
&lt;td&gt;_____ points&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Step 2: Recognizing the Reward Hacking Trap
&lt;/h4&gt;

&lt;p&gt;Review your point allocation. If you assigned 80 points to &lt;strong&gt;Cost Efficiency&lt;/strong&gt; but only 10 points to &lt;strong&gt;Arrival Time&lt;/strong&gt;, the agent might book a $50 red-eye flight that arrives &lt;em&gt;after&lt;/em&gt; the 9:00 AM meeting. It maximized points but completely failed the actual objective.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Reward Engineering Solution:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Professional reward engineers use &lt;strong&gt;hard constraints&lt;/strong&gt; and &lt;strong&gt;dynamic incentives&lt;/strong&gt; to prevent such failures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hard Constraint:&lt;/strong&gt; "If arrival time is after 9:00 AM, apply a penalty of -1,000 points (automatic failure)."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incremental Incentive:&lt;/strong&gt; "For every $10 saved below the $500 budget, add +1 bonus point."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This combination ensures critical requirements are never violated, while still encouraging optimization within acceptable parameters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Alignment Requires Precision:&lt;/strong&gt; Without explicit penalties for missing the meeting, even a well-intentioned point system can lead to failures. Intent alone isn't enough—you must formalize every constraint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Logic Replaces Language:&lt;/strong&gt; This exercise demonstrates programming agent behavior through mathematical objectives rather than conversational instructions—the essence of reward engineering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The Future of Software Development:&lt;/strong&gt; This approach reflects Socher and McCann's vision for 2026: rather than giving AI step-by-step instructions, we'll define the rules and constraints, then let AI agents find optimal solutions within those boundaries.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;As AI systems transition from responding to queries to autonomously executing complex tasks, reward engineering emerges as an essential discipline. Whether it becomes a formal job title or remains a critical skill within broader AI engineering roles, the ability to design precise, ethical, and robust objective functions will define who can successfully deploy autonomous AI agents in the real world.&lt;/p&gt;

&lt;p&gt;Start developing these skills now: think in terms of measurable outcomes, anticipate unintended behaviors, and practice translating human intent into mathematical frameworks. The future of AI isn't just about building smarter systems—it's about building systems that are smart in the &lt;em&gt;right&lt;/em&gt; ways.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>career</category>
      <category>llm</category>
    </item>
    <item>
      <title>Why Your Multi-Agent AI System Is Probably Making Things Worse?</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Mon, 05 Jan 2026 23:50:13 +0000</pubDate>
      <link>https://forem.com/imaginex/the-ai-agent-scaling-problem-why-more-isnt-better-9nh</link>
      <guid>https://forem.com/imaginex/the-ai-agent-scaling-problem-why-more-isnt-better-9nh</guid>
      <description>&lt;p&gt;2025 has been dubbed the "Year of the Agent" by investors and tech media. Companies like &lt;a href="https://manus.im/" rel="noopener noreferrer"&gt;Manus&lt;/a&gt;, &lt;a href="https://www.lovart.ai/" rel="noopener noreferrer"&gt;Lovart&lt;/a&gt;, &lt;a href="https://www.fellou.ai/" rel="noopener noreferrer"&gt;Fellou&lt;/a&gt;, and &lt;a href="https://www.google.com/search?q=ai+agent+companies" rel="noopener noreferrer"&gt;many others&lt;/a&gt; have captured headlines with their AI agent applications, which are software systems that can autonomously perform tasks on your behalf, from browsing the web to analyzing documents.&lt;/p&gt;

&lt;p&gt;Over the past two years, I've built multi-agent systems for various clients across different industries by using various AI models and agent frameworks. A pattern keeps emerging: these projects look impressive in demos but struggle to work reliably in production. The same questions come up again and again: why isn't adding more agents helping? Why doesn't giving the system more tokens (via prompt engineering or Retrieval-Augmented Generation pipeline (RAG)), more tool calls, or more compute budget improve results?&lt;/p&gt;

&lt;p&gt;The industry has embraced two assumptions that seem logical on the surface:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;More agents = better results.&lt;/strong&gt; Since a single AI agent has limited capabilities, having multiple agents collaborate should solve more complex problems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More compute = better performance.&lt;/strong&gt; If results aren't good enough, just give the AI more time to think (more "tokens") and more tools to work with.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But are these assumptions actually true?&lt;/p&gt;

&lt;p&gt;Recent research tells a very different story. A report from UC Berkeley, &lt;em&gt;"&lt;a href="https://arxiv.org/abs/2512.04123" rel="noopener noreferrer"&gt;Measuring Agents in Production&lt;/a&gt;"&lt;/em&gt; (December 2025), combined with two papers from Google DeepMind, systematically debunks both assumptions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;More agents ≠ better results.&lt;/strong&gt; Multi-agent systems often perform &lt;em&gt;worse&lt;/em&gt; than single agents due to coordination overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More compute ≠ better performance.&lt;/strong&gt; Agents don't know how to effectively use extra resources. They leave 85% of their budget untouched.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These studies reveal that current AI agents have fundamental limitations that no amount of scaling can easily fix. Let me walk you through what the research actually shows.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Reality Check: What Berkeley Found in Production Systems
&lt;/h2&gt;

&lt;p&gt;The Berkeley team surveyed 306 practitioners and conducted 20 in-depth case studies with organizations actually running AI agents in production, including Accenture, Amazon, AMD, Anyscale, Broadcom Inc., Google, IBM, Intel, Intesa Sanpaolo, Lambda, Mibura Inc, Samsung SDS, and SAP. Crucially, they filtered out demo-stage or conceptual projects, focusing only on systems generating real business value.&lt;/p&gt;

&lt;p&gt;Their findings paint a surprisingly conservative picture:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Most agents are kept on a very short leash.&lt;/strong&gt; 68% of production systems limit agents to 10 steps or fewer. Only 16.7% allow dozens of steps, and a mere 6.7% give agents unlimited autonomy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Companies build safety barriers between agents and real systems.&lt;/strong&gt; Rather than letting agents directly call production APIs, engineering teams create simplified "wrapper APIs", bundling multiple complex operations into single, safer commands. For example, instead of making an agent call three separate database queries, engineers package them into one pre-tested function. This reduces what could go wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human-designed workflows dominate.&lt;/strong&gt; 80% of successful deployments use "structured control flow", meaning humans draw the flowchart, and the AI simply fills in the blanks at predetermined decision points. The agent isn't autonomously planning, it's following a script.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agents require massive instruction sets.&lt;/strong&gt; 12% of deployed systems use prompts exceeding 10,000 tokens (roughly 7,500 words of instructions). These aren't lightweight assistants, they're heavily engineered systems with extensive guardrails.&lt;/p&gt;

&lt;p&gt;In essence, today's successful AI agents work like &lt;strong&gt;tireless interns with good reading comprehension&lt;/strong&gt;, useful within a tightly defined process, capable of handling some ambiguity, but not the autonomous problem-solvers the marketing suggests.&lt;/p&gt;

&lt;p&gt;So why are production systems so constrained? Two papers from Google DeepMind, published in late 2025, may provide the answers by systematically disproving the core assumptions behind agent scaling:&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitation #1: More Agents ≠ Better Performance
&lt;/h2&gt;

&lt;p&gt;DeepMind's paper &lt;em&gt;"&lt;a href="https://arxiv.org/abs/2512.08296" rel="noopener noreferrer"&gt;Towards a Science of Scaling Agent Systems&lt;/a&gt;"&lt;/em&gt; tackled a seductive idea: if one AI isn't smart enough, why not create a whole team? Imagine GPT handling product management, Claude writing code, and Gemini running tests—a virtual software company where PhD-level AIs collaborate to solve any problem.&lt;/p&gt;

&lt;p&gt;It sounds logical. After all, that's how human organizations scale. But 180 controlled experiments later, DeepMind proved this intuition wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Experiment Setup
&lt;/h3&gt;

&lt;p&gt;The researchers tested five different ways to organize AI agents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single Agent&lt;/strong&gt;: One AI handles everything (think: a solo developer)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Independent Multi-Agent&lt;/strong&gt;: Multiple AIs work on the same problem separately, then their answers are combined through voting (think: getting multiple opinions, then picking the consensus)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decentralized Multi-Agent&lt;/strong&gt;: Agents communicate directly with each other to negotiate solutions (think: a peer-to-peer discussion group)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralized Multi-Agent&lt;/strong&gt;: One "manager" agent assigns tasks and verifies results (think: a team with a project manager)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid&lt;/strong&gt;: A combination of centralized coordination with peer communication&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They tested these architectures using top models from OpenAI, Google, and Anthropic across four different task types: financial analysis, web browsing, game planning (Minecraft-style crafting), and general workflows.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Core Finding
&lt;/h3&gt;

&lt;p&gt;DeepMind discovered a formula that predicts agent system performance with average 87% accuracy:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Net Performance = (Individual Capability + Collaboration Benefits) − (Coordination Chaos + Communication Overhead + Tool Complexity)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;the costs often outweigh the benefits&lt;/strong&gt;. When coordination overhead, miscommunication, and tool management burden exceed the gains from parallelization, adding more agents makes systems &lt;em&gt;worse&lt;/em&gt;, not better.&lt;/p&gt;

&lt;p&gt;The results varied dramatically by task type:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Financial analysis&lt;/strong&gt;: Multi-agent systems helped (up to 81% improvement with centralized architecture)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Web browsing&lt;/strong&gt;: Minimal benefit; errors actually got amplified&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Game planning (PlanCraft)&lt;/strong&gt;: Multi-agent systems performed &lt;em&gt;significantly worse&lt;/em&gt; than single agents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;General workflows&lt;/strong&gt;: Mixed results; decentralized approaches slightly better&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;See the figure below for the detailed results:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4captxy9o19fjjzwtupe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4captxy9o19fjjzwtupe.png" alt=" " width="800" height="596"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Multi-Agent Systems Fail: Three Key Reasons
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. The Coordination Tax&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In complex, open-ended tasks, adding more agents makes the system &lt;em&gt;dumber&lt;/em&gt;, not smarter.&lt;/p&gt;

&lt;p&gt;Consider the PlanCraft benchmark (a Minecraft-style planning task). When Anthropic's Claude model was put into a multi-agent setup, performance dropped by 35%. Why? Every agent must understand tool interfaces, maintain conversation context, and process results from other agents. When the tool count exceeds a threshold, agents spend all their "mental bandwidth" on coordination, reading documentation and attending virtual meetings, with no capacity left for actual problem-solving.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The Capability Saturation Effect&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a single agent can already solve a problem with greater than 45% accuracy, adding more agents typically provides diminishing or negative returns.&lt;/p&gt;

&lt;p&gt;The logic is straightforward: if one agent can correctly answer "What is 2+2?", having three agents debate the answer for an hour won't improve accuracy, it just wastes resources.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Error Amplification (The Most Surprising Finding)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Intuitively, you'd expect that having three agents vote on an answer would &lt;em&gt;reduce&lt;/em&gt; errors, the wisdom-of-crowds effect. But DeepMind found the opposite.&lt;/p&gt;

&lt;p&gt;In independent multi-agent systems (where agents work separately then vote), errors don't cancel out—they multiply. The paper quantifies this with an "error amplification factor" of 17.2. This means &lt;strong&gt;if a single agent has a 5% error rate, an independent multi-agent system can have an error rate as high as 86%&lt;/strong&gt; (5% × 17.2).&lt;/p&gt;

&lt;p&gt;Why does this happen? Without cross-verification during reasoning, each agent makes errors based on its own flawed logic. These errors become self-reinforcing within each agent's context. When you aggregate three independently wrong answers through voting, you're not getting wisdom, you're getting confident wrongness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitation #2: More Thinking Time ≠ Better Results
&lt;/h2&gt;

&lt;p&gt;If adding more agents doesn't work, what about giving a single agent more time to think?&lt;/p&gt;

&lt;p&gt;After OpenAI released its o1 model, "test-time compute" became a buzzword. The idea: let AI models "think longer" by giving them more computational budget during inference. Search more, reason more, and eventually they'll find the answer, right? But is it true?&lt;/p&gt;

&lt;p&gt;DeepMind's paper &lt;em&gt;"&lt;a href="https://arxiv.org/abs/2511.17006" rel="noopener noreferrer"&gt;Budget-Aware Tool-Use Enables Effective Agent Scaling&lt;/a&gt;"&lt;/em&gt; tested this assumption—and found it largely false.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Experiment
&lt;/h3&gt;

&lt;p&gt;Researchers increased an agent's "tool-call budget", the number of web searches, API calls, or other actions it could perform, from 10 to 100. The expectation: 10x more resources should yield significantly better results.&lt;/p&gt;

&lt;p&gt;The reality: &lt;strong&gt;doubling the budget improved accuracy by only 0.2 percentage points&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Even more telling: when given a budget of 100 tool calls, agents only used an average of 14.24 searches and 1.36 browsing sessions. They left 85% of their budget untouched.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Agents Can't Use Extra Resources Effectively
&lt;/h3&gt;

&lt;p&gt;The core problem: &lt;strong&gt;agents don't know what they don't know, and they don't track their remaining budget&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When an agent goes down a wrong path, say, searching for a paper title that doesn't exist, it has no concept of opportunity cost. It will keep digging deeper into a dead end rather than trying a different approach. Give it unlimited compute, and it just digs a deeper hole.&lt;/p&gt;

&lt;p&gt;Making matters worse, long conversation contexts cause "attention drift". After a dozen failed searches, the agent gets lost in its own accumulated noise, the search results, error messages, and dead ends it generated. Performance actually &lt;em&gt;declines&lt;/em&gt; as context grows.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Potential Solution: Budget-Aware Agents (BATS)
&lt;/h3&gt;

&lt;p&gt;DeepMind proposed a framework called BATS (Budget-Aware Test-time Scaling) that addresses these issues with two key mechanisms:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Budget-Aware Planning&lt;/strong&gt;: Instead of making a fixed plan upfront, the agent maintains a dynamic task tree. Each node tracks its status (pending, completed, failed) and resource consumption. When budget is plentiful, expand exploration; when budget is tight, focus on verification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Budget-Aware Verification&lt;/strong&gt;: After proposing an answer, a separate verification step checks constraints: What's confirmed? What's contradicted? What can't be verified? Based on this assessment and remaining budget, the agent decides whether to dig deeper or abandon the current path.&lt;/p&gt;

&lt;p&gt;The results were significant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BrowseComp benchmark&lt;/strong&gt;: 24.6% accuracy (95% improvement over standard approaches at 12.6%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BrowseComp-ZH&lt;/strong&gt;: 46.0% accuracy (46% improvement over 31.5%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost efficiency&lt;/strong&gt;: 40% lower total cost (tokens + tool calls) at equivalent accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The lesson: raw thinking time isn't enough. Agents need structured self-reflection, the ability to recognize dead ends, and the wisdom to cut losses early.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Actually Needed for AI Agents to Work?
&lt;/h2&gt;

&lt;p&gt;Let's return to DeepMind's core formula:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Net Performance = (Individual Capability + Collaboration Benefits) − (Coordination Chaos + Communication Overhead + Tool Complexity)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The problem isn't that we need smarter models or more compute. &lt;strong&gt;The problem is that the negative factors, including coordination overhead, communication noise, and tool complexity, are overwhelming the positive factors.&lt;/strong&gt; All of these boil down to one root cause: inefficient use of context.&lt;/p&gt;

&lt;p&gt;Every token spent on coordination, error recovery, or tool documentation is a token not spent on actual problem-solving. To make agents work, we need to reduce this context burden, not pile on more resources.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three Directions That Actually Work
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Smarter Tool Management&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The key insight from the financial analysis success (81% improvement with multi-agent systems) is instructive: it worked because the task had clear boundaries and well-defined steps.&lt;/p&gt;

&lt;p&gt;Financial analysis follows a predictable pattern: read report → extract data → calculate ratios → generate summary. Each agent fills in blanks within a predetermined framework, no creative planning required.&lt;/p&gt;

&lt;p&gt;This tells us something important: &lt;strong&gt;current AI models cannot self-organize division of labor&lt;/strong&gt;. They can handle easily parallelizable tasks (like financial analysis) or consensus-based error correction (like multi-path search), but not emergent collaboration.&lt;/p&gt;

&lt;p&gt;The implication? For complex tasks, &lt;strong&gt;human-designed task decomposition (SOPs) remains necessary&lt;/strong&gt;. The dream of throwing agents together and watching them spontaneously develop hierarchies has been empirically disproven.&lt;/p&gt;

&lt;p&gt;This is why &lt;a href="https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview" rel="noopener noreferrer"&gt;Anthropic's Skills mechanism&lt;/a&gt; matters: it lets agents accumulate reusable capability modules instead of starting from scratch, reducing the cognitive load of tool management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Built-in Self-Verification&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;BATS works because it formalizes verification. The system explicitly tracks constraints: what's satisfied, what's contradicted, what can't be verified yet. This isn't emergent behavior, it's enforced through careful prompt engineering.&lt;/p&gt;

&lt;p&gt;Without structured verification, errors accumulate silently. Each mistake pollutes the context with garbage that degrades future reasoning. Formal verification catches errors early, preventing context pollution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Efficient Inter-Agent Communication&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Current agents coordinate via natural language, verbose, ambiguous, requiring constant clarification. This high message density is inherently wasteful.&lt;/p&gt;

&lt;p&gt;Future improvements might come from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structured communication protocols (like &lt;a href="https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/" rel="noopener noreferrer"&gt;Google's A2A framework&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Latent-space communication where models exchange compressed representations rather than text&lt;/li&gt;
&lt;li&gt;Shared memory architectures that reduce redundant information exchange&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Until these three capabilities mature, such as smart tool management, built-in verification, and efficient communication, multi-agent systems will continue to underperform their theoretical potential.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Despite the hype, the "Year of the Agent" hasn't truly arrived. The research tells us:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Production agents are heavily constrained&lt;/strong&gt;: most limited to 10 steps or fewer, running within human-designed workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More agents often means worse performance&lt;/strong&gt;: coordination costs overwhelm collaboration benefits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More compute doesn't help much&lt;/strong&gt;: agents don't know how to use extra resources effectively&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The path forward is reducing overhead&lt;/strong&gt;, not adding more power&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Current successful AI agents are best understood as &lt;strong&gt;capable interns with good reading comprehension, working within strict SOPs&lt;/strong&gt;. They handle ambiguity better than traditional software, but they're not autonomous problem-solvers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For practitioners, the implication is clear: invest in workflow design, tool abstraction, and structured verification rather than chasing multi-agent architectures or unlimited compute budgets. The engineering fundamentals—not the scaling laws—determine success.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2512.04123" rel="noopener noreferrer"&gt;Measuring Agents in Production&lt;/a&gt; - UC Berkeley (December 2025)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2512.08296" rel="noopener noreferrer"&gt;Towards a Science of Scaling Agent Systems&lt;/a&gt; - Google DeepMind (December 2025)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2511.17006" rel="noopener noreferrer"&gt;Budget-Aware Tool-Use Enables Effective Agent Scaling&lt;/a&gt; - Google DeepMind (November 2025)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview" rel="noopener noreferrer"&gt;Agent Skills - Claude Docs&lt;/a&gt; - Anthropic&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>discuss</category>
    </item>
    <item>
      <title>AI Made Me 10x Faster—Here's What I Had to Change</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Fri, 19 Dec 2025 22:05:43 +0000</pubDate>
      <link>https://forem.com/imaginex/ai-made-me-10x-faster-heres-what-i-had-to-change-3j91</link>
      <guid>https://forem.com/imaginex/ai-made-me-10x-faster-heres-what-i-had-to-change-3j91</guid>
      <description>&lt;p&gt;I've been working as an IT engineer in multiple industries for over 25 years, from .Net developer, BI developer, data architect, data scientist, and finally to AI solutions architect. Recently, the team in my organization is developing and using AI programming tools and has achieved 8x to 20x code output efficiency compared to ordinary high-performing teams. See this &lt;a href="https://www.linkedin.com/feed/update/urn:li:activity:7369788200967479296/" rel="noopener noreferrer"&gt;LinkedIn post&lt;/a&gt; for more details.&lt;/p&gt;

&lt;p&gt;Reading this, you might think: Wow, are programmers going to lose their jobs? Is AI going to replace humans?&lt;/p&gt;

&lt;p&gt;But my view is exactly the opposite. As a frontline professional programmer, I'm telling you responsibly: When your speed increases 10x, the risks and bottlenecks you face may also be magnified 10x.&lt;/p&gt;

&lt;p&gt;I'll be honest - even I was skeptical at first. Could we really sustain this pace without everything falling apart?&lt;/p&gt;

&lt;p&gt;What this means is: AI has fundamentally changed how "costs" and "benefits" are calculated in software engineering. But to truly enjoy this improvement, the entire software development system needs to be upgraded simultaneously. This insight applies not only to programming but has profound implications for everyone who uses AI tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. AI Hasn't Made Programmers Unemployed, But Has Fundamentally Changed How People Work
&lt;/h2&gt;

&lt;p&gt;Here's how my team works:&lt;/p&gt;

&lt;p&gt;In the code our team submits, 80% - 90% is written by AI. But this is definitely not that casual "Vibe Coding (i.e., coding without thinking)", which is not a good coding practice. This workflow is called "Agentic Coding" (i.e., coding with AI agents).&lt;/p&gt;

&lt;p&gt;In this model, AI plays the role of a "highly capable but irresponsible junior programmer."&lt;/p&gt;

&lt;p&gt;And the human engineers? They are "experienced tech leads or architects."&lt;/p&gt;

&lt;p&gt;Specifically, the engineer's workflow has become:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Design, plan and break down tasks&lt;/strong&gt; - Figure it out yourself first, or brainstorm with AI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Give AI instructions&lt;/strong&gt; - Clearly tell AI what to do&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review and refine AI's output&lt;/strong&gt; - This is the most critical step&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterate repeatedly&lt;/strong&gt; - Until completely satisfied with the quality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Submit and take full responsibility&lt;/strong&gt; - Ultimately, humans are still responsible for the code&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;See that? The role of humans hasn't diminished; it's become more important. The focus of work has just shifted from "writing code by hand" to "defining requirements" and "code review."&lt;/p&gt;

&lt;p&gt;Think about your own work: What would change if 80% of your output came from AI?&lt;/p&gt;

&lt;p&gt;An analogy: Previously you were a worker carrying bricks on a construction site. Now you're an operator commanding an excavator. Although you no longer carry bricks with your own hands, your judgment, operational skills, and responsibilities have actually increased.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Speed Increases 10x, Accident Rate May Also Increase 10x
&lt;/h2&gt;

&lt;p&gt;When you're speeding down a track at 200 km/h, you need massive "downforce" to keep the car firmly planted on the ground. Otherwise, you'll fly off at the first curve.&lt;/p&gt;

&lt;p&gt;In software engineering, "flying off" means bugs and system crashes.&lt;/p&gt;

&lt;p&gt;Some alarming data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Before:&lt;/strong&gt; A team might only encounter one or two severe production bugs per year&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Now:&lt;/strong&gt; When you're submitting code at 10x speed, even if the probability of bugs stays the same, the absolute number of bugs you encounter will also increase 10x&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What does this mean?&lt;/p&gt;

&lt;p&gt;Incidents that used to happen once a year might now happen every week. Imagine explaining to your boss why production went down every Monday.&lt;/p&gt;

&lt;p&gt;This kind of "accident rate" is unbearable for any team. Yet many people promoting "AI omnipotence" have intentionally or unintentionally ignored this problem.&lt;/p&gt;

&lt;p&gt;To enjoy the 10x coding speed boost from AI, you must also find ways to reduce the "probability of problems" by 10x, or even more.&lt;/p&gt;

&lt;p&gt;Having a good engine isn't enough; you also need a better braking system.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. The True Value of AI: Making "Good but Expensive Methods" Affordable
&lt;/h2&gt;

&lt;p&gt;So how do you reduce risk while increasing speed?&lt;/p&gt;

&lt;p&gt;AI isn't just about letting you write faster; it's about making those "good but too expensive" best practices in software engineering become affordable and feasible.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 Change #1: Build a "Wind Tunnel Testing" Environment
&lt;/h3&gt;

&lt;p&gt;What is wind tunnel testing?&lt;/p&gt;

&lt;p&gt;Just like building an airplane - before it actually takes flight, the model is put in a wind tunnel to test various extreme conditions.&lt;/p&gt;

&lt;p&gt;In software development, this means building a "high-fidelity simulation environment" locally. For example, if your system depends on 10 external services (databases, authentication, payments, etc.), you run or simulate all 10 services locally.&lt;/p&gt;

&lt;p&gt;This way, on your computer you can run complete end-to-end tests, and even simulate various extreme failure scenarios.&lt;/p&gt;

&lt;p&gt;This kind of testing can catch a lot of bugs hidden in the "cracks between components."&lt;/p&gt;

&lt;p&gt;Why didn't we do this before? Too expensive!&lt;/p&gt;

&lt;p&gt;Simulating and maintaining these services was too much work, so most teams gave up.&lt;/p&gt;

&lt;p&gt;Why can we do it now? AI excels at this!&lt;/p&gt;

&lt;p&gt;AI is very good at writing these simulation services with clear logic and well-defined behavior. Especially by using AI agents with Model Context Protocol (MCP) and Agent2Agent Protocol (A2A), we can easily build a complete local "wind tunnel" for our fairly complex system.&lt;/p&gt;

&lt;p&gt;Work that used to take weeks or even months can now be done in days.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 Change #2: Upgrade CI/CD (Continuous Integration/Deployment)
&lt;/h3&gt;

&lt;p&gt;In the early days of waterfall development, everyone developed separately and then integrated after development. The result was a pile of problems during integration, taking a long time to stabilize.&lt;/p&gt;

&lt;p&gt;Later, the concept of "continuous integration" became popular:&lt;/p&gt;

&lt;p&gt;The earlier you integrate, the earlier you get feedback. The more frequently you integrate, the more you can reduce problem complexity.&lt;/p&gt;

&lt;p&gt;Now, CI/CD is recognized as the best practice in software engineering. But not many teams actually do it well, because building and maintaining it is still not cheap.&lt;/p&gt;

&lt;p&gt;What's worse is that many teams, despite having CI/CD, have extremely time-consuming processes. One code commit, waiting for all tests and deployment to run through - at minimum ten minutes, sometimes several hours.&lt;/p&gt;

&lt;p&gt;Before AI, these problems weren't obvious. Now that AI is more capable, they've become obstacles.&lt;/p&gt;

&lt;p&gt;So CI/CD also needs to be upgraded along with it, compressing the feedback loop from "hours" to "minutes." You need infrastructure that's fast to an absurd degree, able to discover, isolate, and roll back problematic changes within minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3 Change #3: Decision-Making and Communication Systems Also Need Upgrading
&lt;/h3&gt;

&lt;p&gt;10x code output means you also need 10x or more communication and decision-making efficiency.&lt;/p&gt;

&lt;p&gt;Before, developing a system required various meetings, lengthy discussions, and only then could work begin. After all, you had to depend on other people's modules, so you had to define agreements first, otherwise you couldn't integrate later.&lt;/p&gt;

&lt;p&gt;Various technical decisions also required repeated discussion for a long time, because development costs were high back then, and if decisions were wrong, the cost of rework was too great.&lt;/p&gt;

&lt;p&gt;But now, if we still have the same communication efficiency as before, it will greatly drag down overall efficiency.&lt;/p&gt;

&lt;p&gt;Perhaps the most efficient approach is to minimize communication as much as possible, letting everyone do their tasks as independently as possible from others.&lt;/p&gt;

&lt;p&gt;For example, microservices architecture might be a good choice in the AI era.&lt;/p&gt;

&lt;p&gt;For technical decisions, now you can actually have more opportunities to experiment. You don't need to be as rigorous as before in repeatedly verifying technical decisions. Because development costs have decreased, the cost of experimentation has also decreased.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Insights for Everyone
&lt;/h2&gt;

&lt;p&gt;AI isn't a "stimulant" that makes you run faster; it's giving you a "supercar."&lt;/p&gt;

&lt;p&gt;But the question is: Are you ready to drive it?&lt;/p&gt;

&lt;h3&gt;
  
  
  4.1 Tool Upgrade Doesn't Equal System Upgrade
&lt;/h3&gt;

&lt;p&gt;Using AI is like upgrading your car with a brand new engine. If you just install it on your old "vintage car," what you get isn't 10x speed, but 10x problems.&lt;/p&gt;

&lt;p&gt;This principle applies to each of us.&lt;/p&gt;

&lt;p&gt;When you learn to use AI tools (ChatGPT, Gemini, Claude, Midjourney, various AI assistants), don't assume your work efficiency will automatically improve.&lt;/p&gt;

&lt;p&gt;Your workflows, quality inspection mechanisms, and collaboration methods all need to be adjusted accordingly.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;Using AI to write code is fast, but if you don't have a rigorous review process, you might produce a lot of low-quality, buggy code.&lt;/p&gt;

&lt;p&gt;Using AI to quickly generate investment advice - but has your risk assessment ability kept up?&lt;/p&gt;

&lt;p&gt;Using AI to make quick decisions - but have you established a review and error-correction mechanism?&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 Speed Increase Must Come with Risk Management Improvement
&lt;/h3&gt;

&lt;p&gt;The "accident rate paradox" doesn't only exist in programming.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The faster food delivery, the higher the traffic accident risk&lt;/li&gt;
&lt;li&gt;The faster product iteration, the more quality issues there might be&lt;/li&gt;
&lt;li&gt;The faster decision-making, the greater the probability of making mistakes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So don't blindly pursue "fast." Ask yourself: Has my "braking system" been upgraded?&lt;/p&gt;

&lt;p&gt;Build your own "wind tunnel testing": Try on a small scale first, rather than pushing forward comprehensively right away.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.3 Re-examine Those "Good but Expensive" Methods
&lt;/h3&gt;

&lt;p&gt;Here's what finally clicked for me after months of using AI tools:&lt;/p&gt;

&lt;p&gt;The true value of AI isn't just about "writing faster"; it's about making those "good but too expensive" best practices become affordable and feasible.&lt;/p&gt;

&lt;p&gt;This realization made me re-examine many good habits I had abandoned because they were "too troublesome":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Personal finance management:&lt;/strong&gt; Keeping track of expenses used to be too troublesome. Now AI can help you automatically categorize and analyze&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learning new skills:&lt;/strong&gt; Hiring a private tutor used to be too expensive. Now AI can provide personalized tutoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Health management:&lt;/strong&gt; Nutritionists used to be too expensive. Now AI can customize meal plans&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content creation:&lt;/strong&gt; Making videos used to require a team. Now individuals can also produce high-quality content&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key is: Don't just use AI for "fast production." Use it to achieve "things you wanted to do before but couldn't."&lt;/p&gt;

&lt;h3&gt;
  
  
  4.4 Your Role is Changing
&lt;/h3&gt;

&lt;p&gt;In programming, my role has shifted from "executor" to "decision-maker + quality inspector." But this pattern applies everywhere:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Writers&lt;/strong&gt; are becoming editors who review and refine AI drafts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Designers&lt;/strong&gt; are becoming creative directors who guide AI-generated concepts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analysts&lt;/strong&gt; are becoming strategists who interpret AI-processed data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Managers&lt;/strong&gt; are becoming orchestrators who coordinate AI-assisted workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common thread? Responsibilities haven't decreased; they've actually increased.&lt;/p&gt;

&lt;p&gt;In the AI era, your core competitiveness is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Judgment&lt;/strong&gt; - Being able to distinguish good from bad AI output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Questioning ability&lt;/strong&gt; - Being able to give AI clear instructions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sense of responsibility&lt;/strong&gt; - Being willing to take responsibility for the final results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Systems thinking&lt;/strong&gt; - Understanding the entire process, not just one part&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Many people can use AI to write reports, but those who can review AI's logical flaws are valuable.&lt;/p&gt;

&lt;p&gt;Many people can use AI to design solutions, but those who can judge a solution's feasibility are irreplaceable.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.5 Build Your AI Work System
&lt;/h3&gt;

&lt;p&gt;For us ordinary people, don't just focus on "individual AI tools." Build your own "AI work system":&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Input system&lt;/strong&gt; - How to quickly and accurately provide AI with information and instructions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality inspection system&lt;/strong&gt; - How to efficiently review AI's output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feedback system&lt;/strong&gt; - How to iterate and improve quickly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge management&lt;/strong&gt; - How to accumulate and reuse experience from AI collaboration&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;It's not just about using ChatGPT; you also need to build your own prompt library, review checklist, and iteration workflow.&lt;/p&gt;

&lt;p&gt;It's not just about using AI for image generation; you also need to establish a style library, quality standards, and version management.&lt;/p&gt;

&lt;p&gt;This is the true way of working in the AI era.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Summary
&lt;/h2&gt;

&lt;p&gt;The AI era requires "systems thinking," not "tool thinking."&lt;/p&gt;

&lt;p&gt;Many people treat AI as a "fast production tool," hoping to use it to accelerate existing work.&lt;/p&gt;

&lt;p&gt;But those who truly understand how to leverage AI treat it as an "opportunity for system upgrade," rethinking the entire workflow.&lt;/p&gt;

&lt;p&gt;AI isn't just about upgrading the car's engine; it's also about upgrading the roads the car frequently drives on.&lt;/p&gt;

&lt;p&gt;The goal for veteran drivers isn't to be replaced by AI, but to help them adapt to the new high-speed engine, giving them a comfortable and safe driving environment.&lt;/p&gt;

&lt;p&gt;So when AI increases your speed 10x or 20x, don't rush to celebrate.&lt;/p&gt;

&lt;p&gt;First ask yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Has my quality inspection mechanism been upgraded?&lt;/li&gt;
&lt;li&gt;Has my risk management ability improved?&lt;/li&gt;
&lt;li&gt;Has my workflow been restructured?&lt;/li&gt;
&lt;li&gt;Am I ready to take full responsibility for the results?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Remember: You are the driver who is responsible for the final results.&lt;/p&gt;

&lt;p&gt;AI just gave you a supercar, but whether you can arrive at your destination safely and efficiently depends on your driving skills and road conditions.&lt;/p&gt;

&lt;p&gt;May we all become good drivers in the AI era - ones who can step on the gas, and also know when to brake.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>career</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
