<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: JaviMaligno</title>
    <description>The latest articles on Forem by JaviMaligno (@javieraguilarai).</description>
    <link>https://forem.com/javieraguilarai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3701121%2F3d85b744-a4d6-4104-a1ae-db83b08dcc88.png</url>
      <title>Forem: JaviMaligno</title>
      <link>https://forem.com/javieraguilarai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/javieraguilarai"/>
    <language>en</language>
    <item>
      <title>The Real Skill in the Age of AI: Knowing When to Stop</title>
      <dc:creator>JaviMaligno</dc:creator>
      <pubDate>Tue, 14 Apr 2026 20:15:18 +0000</pubDate>
      <link>https://forem.com/javieraguilarai/the-real-skill-in-the-age-of-ai-knowing-when-to-stop-2gp6</link>
      <guid>https://forem.com/javieraguilarai/the-real-skill-in-the-age-of-ai-knowing-when-to-stop-2gp6</guid>
      <description>&lt;p&gt;The missing skill isn't prompting. It isn't knowing which model to use, or how to structure a CLAUDE.md, or when to reach for an MCP tool.&lt;/p&gt;

&lt;p&gt;The missing skill is knowing when to stop.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Async Human
&lt;/h2&gt;

&lt;p&gt;Conscious human tasks are &lt;strong&gt;single-threaded async processes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We have background threads — habits, pattern recognition, the kind of low-level processing that happens without us noticing. But our conscious executive function, the part that reads specs and makes architectural decisions and evaluates whether an agent went off the rails, is strictly &lt;strong&gt;single-threaded&lt;/strong&gt;. It cannot genuinely run two complex reasoning tasks in parallel.&lt;/p&gt;

&lt;p&gt;What we call "multitasking" is actually context switching. And context switching has overhead — exactly like an OS scheduler running multiple processes on a single core. The CPU doesn't run them in parallel; it creates the &lt;em&gt;illusion&lt;/em&gt; of parallelism by rapidly switching between them, loading and unloading state each time.&lt;/p&gt;

&lt;p&gt;We do the same thing. And we pay the same price: every switch costs something, and the more switches you do, the less actual execution you get per unit of time.&lt;/p&gt;

&lt;p&gt;This is the foundation everything else rests on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Stopping Problems
&lt;/h2&gt;

&lt;p&gt;Given that model, there are two distinct ways things go wrong when you orchestrate AI agents — and they operate on different axes.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Vertical Problem: Building on Unverified Ground
&lt;/h3&gt;

&lt;p&gt;You're in a session. The agent has just finished the authentication module. It looks solid. You skim the output, it seems right, and there's momentum — so you tell it to start the dashboard.&lt;/p&gt;

&lt;p&gt;But you haven't actually verified the auth module at depth. You've seen it. You haven't tested it.&lt;/p&gt;

&lt;p&gt;Now the dashboard is being built on top of assumptions that may be wrong. And the longer you continue before stopping to verify, the more expensive any foundation flaw becomes. If auth has a subtle bug, it might not surface until the dashboard is half-built — at which point you're not fixing one thing, you're untangling two.&lt;/p&gt;

&lt;p&gt;This is the stopping problem &lt;em&gt;within&lt;/em&gt; a session. It's not about switching between agents. It's about the pull of momentum inside a single thread of work. The agent keeps going, you keep going with it, and the transition from "build mode" to "verify mode" never happens because it was never explicitly planned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The discipline&lt;/strong&gt;: before the session starts, redefine what "done" means. The agent stopping is not done. Done means you've confirmed it works and the assumptions it built on are sound. No new feature starts until you've reached that bar.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Horizontal Problem: How Many Processes Can You Actually Schedule?
&lt;/h3&gt;

&lt;p&gt;Now add parallel sessions. Agent A is building the API. Agent B is refactoring the data model. Agent C is writing tests.&lt;/p&gt;

&lt;p&gt;Each of those is a process your single-threaded attention has to service. You check in on A, switch to B, switch to C, back to A — and with every switch, you load a context, do some work, and unload it. Except you don't fully unload it. Research on &lt;strong&gt;attention residue&lt;/strong&gt; shows that when you shift focus from one task to another, part of your attention stays on the previous one. That residue consumes working memory, degrading the quality of your engagement with whatever you've switched to.&lt;/p&gt;

&lt;p&gt;Three overlapping agent sessions means you're potentially carrying three partial contexts simultaneously, never fully present in any of them. It doesn't feel like that — it feels like productivity. But the quality of your oversight degrades quietly, and regressions and bad architectural decisions slip through because you weren't reading deeply enough when it mattered.&lt;/p&gt;

&lt;p&gt;The practical limit, in my experience: &lt;strong&gt;two to three active sessions per focused block&lt;/strong&gt;, and only when the tasks are genuinely isolatable. Beyond that, you're not supervising — you're skimming. And skimming is where things break.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Stopping Well Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;Both problems have the same root solution: &lt;strong&gt;make stopping a first-class part of the plan, not an afterthought&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A few heuristics that have changed how I work:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before launching any session&lt;/strong&gt;, answer two questions: what does &lt;em&gt;done&lt;/em&gt; look like, and what does &lt;em&gt;verified&lt;/em&gt; look like? If you can't answer both, the task isn't scoped well enough to run yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resist the next feature until the previous one is verified.&lt;/strong&gt; Momentum is the enemy here. The agent finishes something, you feel good about it, and the natural instinct is to keep going. Pause. Check. Test. Confirm the foundation is solid before building on top of it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treat session boundaries as hard stops, not suggestions.&lt;/strong&gt; Decide in advance: after these agents report back, I review, I merge what's ready, and I stop. Context drift — where sessions keep spawning new sessions without a real break — is how six hours disappear and you end up with a codebase that's hard to reason about and a brain that's completely fried.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classify tasks before parallelizing them.&lt;/strong&gt; Not all agent work has the same cognitive cost to oversee. Documentation, boilerplate, isolated tests — these are cheap to supervise in parallel. Architectural decisions, cross-cutting refactors, anything with shared state — these deserve your full, sequential attention.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Closing Condition
&lt;/h2&gt;

&lt;p&gt;A single-threaded process that tries to schedule too many tasks doesn't get faster — it just spends more time context-switching than executing. The most efficient system isn't the one that launches the most processes. It's the one that does the most real work &lt;em&gt;between&lt;/em&gt; switches.&lt;/p&gt;

&lt;p&gt;That applies to CPUs. It turns out it applies to us too.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What limits have you settled on in practice? I'd be curious to compare notes. &lt;a href="https://dev.to/en/#contact"&gt;Get in touch&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.javieraguilar.ai/en/blog/human-limits-managing-ai-agents" rel="noopener noreferrer"&gt;javieraguilar.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Want to see more AI agent projects? Check out my &lt;a href="https://www.javieraguilar.ai" rel="noopener noreferrer"&gt;portfolio&lt;/a&gt; where I showcase multi-agent systems, MCP development, and compliance automation.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>developerproductivity</category>
      <category>aiagents</category>
      <category>workflow</category>
    </item>
    <item>
      <title>Playwright CLI vs Scripts: How AI Agents Should Test</title>
      <dc:creator>JaviMaligno</dc:creator>
      <pubDate>Sun, 29 Mar 2026 17:42:29 +0000</pubDate>
      <link>https://forem.com/javieraguilarai/playwright-cli-vs-scripts-how-ai-agents-should-test-18gj</link>
      <guid>https://forem.com/javieraguilarai/playwright-cli-vs-scripts-how-ai-agents-should-test-18gj</guid>
      <description>&lt;p&gt;There's a growing pattern in agentic development: give an AI agent a browser and let it figure things out. Tools like Playwright CLI, browser-use, and Claude Code's computer use make it easy to point an LLM at a web page and say "test this."&lt;/p&gt;

&lt;p&gt;It works. Until it doesn't.&lt;/p&gt;

&lt;p&gt;After months of building an automated microservice deployment system where AI agents deploy, configure, and verify services end-to-end, I've landed on a clear mental model for when to use Playwright CLI (interactive, LLM-driven) vs standard Playwright scripts (deterministic, reproducible). The distinction matters more than most people think.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Two Modes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Playwright CLI: Exploration Mode
&lt;/h3&gt;

&lt;p&gt;The CLI is conversational. The agent navigates, takes snapshots, reads the DOM, decides what to click next. Each interaction is a new LLM call.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Agent opens a page, takes a snapshot, reads the YAML, decides next action&lt;/span&gt;
playwright-cli &lt;span class="nt"&gt;-s&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;verify open https://app.example.com &lt;span class="nt"&gt;--headed&lt;/span&gt;
playwright-cli &lt;span class="nt"&gt;-s&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;verify snapshot
playwright-cli &lt;span class="nt"&gt;-s&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;verify click e&lt;span class="o"&gt;{&lt;/span&gt;ref-from-snapshot&lt;span class="o"&gt;}&lt;/span&gt;
playwright-cli &lt;span class="nt"&gt;-s&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;verify fill e&lt;span class="o"&gt;{&lt;/span&gt;input-ref&lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="s2"&gt;"search term"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is how my &lt;code&gt;verify-test&lt;/code&gt; skill works in &lt;a href="https://www.javieraguilar.ai/en/projects/data-source-automator" rel="noopener noreferrer"&gt;the Data Source Automator&lt;/a&gt; pipeline. When I'm manually testing a newly deployed microservice with Claude Code, the agent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Opens the Verify platform in a browser&lt;/li&gt;
&lt;li&gt;Takes a snapshot (outputs a YAML with element refs)&lt;/li&gt;
&lt;li&gt;Reads the snapshot to find the right elements&lt;/li&gt;
&lt;li&gt;Fills forms, clicks buttons, navigates&lt;/li&gt;
&lt;li&gt;Evaluates results and decides the next step&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each step is an LLM inference. The agent adapts — if a dropdown is missing, it investigates. If auth expires, it re-logs. If results look wrong, it digs deeper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is not testing. This is exploration.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Playwright Scripts: Reproducible Mode
&lt;/h3&gt;

&lt;p&gt;Scripts are deterministic. Same input, same steps, same assertions. No LLM involved in execution.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Verify service search returns results&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://app.example.com&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="search-input"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ACME Corp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="search-button"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="result-row"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toHaveCount&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;minimum&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;firstResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;firstResult&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toContainText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ACME&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No snapshots, no YAML parsing, no LLM reasoning about what to click next. Just code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Distinction Matters for AI Agents
&lt;/h2&gt;

&lt;p&gt;When I first built the deployment automation system, everything used the CLI approach. The LLM agent would:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Deploy the service (git tag, config-deploy update, ArgoCD sync)&lt;/li&gt;
&lt;li&gt;Run infrastructure checks (pods, health, swagger, etcd)&lt;/li&gt;
&lt;li&gt;Open a browser via Playwright CLI&lt;/li&gt;
&lt;li&gt;Navigate the platform UI to configure and test the service&lt;/li&gt;
&lt;li&gt;Evaluate results and report&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It worked. But three problems emerged quickly:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Token consumption exploded
&lt;/h3&gt;

&lt;p&gt;Every Playwright CLI interaction requires the LLM to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read the full snapshot YAML (often 500+ lines)&lt;/li&gt;
&lt;li&gt;Reason about which element to interact with&lt;/li&gt;
&lt;li&gt;Formulate the next command&lt;/li&gt;
&lt;li&gt;Evaluate the result&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a typical verification flow (login → configure → search → validate results → check details), that's 15-20 LLM calls with large context. &lt;strong&gt;At scale, testing 50+ microservices, this was burning through tokens fast.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Variability killed reliability
&lt;/h3&gt;

&lt;p&gt;The same test run twice could take different paths. The LLM might:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Click a different element if the snapshot was slightly different&lt;/li&gt;
&lt;li&gt;Interpret results differently based on context window state&lt;/li&gt;
&lt;li&gt;Get confused by UI changes that didn't affect functionality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A test that passed Monday might fail Tuesday — not because the service broke, but because the agent reasoned differently.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Stable workflows don't need reasoning
&lt;/h3&gt;

&lt;p&gt;After the first few runs, I noticed the verification pattern was always the same:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Login (or restore auth state)&lt;/li&gt;
&lt;li&gt;Navigate to settings&lt;/li&gt;
&lt;li&gt;Configure autodiscovery&lt;/li&gt;
&lt;li&gt;Create an application&lt;/li&gt;
&lt;li&gt;Execute a search&lt;/li&gt;
&lt;li&gt;Validate results have data&lt;/li&gt;
&lt;li&gt;Open a detail view&lt;/li&gt;
&lt;li&gt;Capture evidence screenshots&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This doesn't need an LLM. It needs a script.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture I Landed On
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────┐
│           Deployment Pipeline                    │
│                                                  │
│  Deploy ──► Infra Checks ──► Browser Tests       │
│                                  │               │
│                          ┌──────┴──────┐         │
│                          │             │         │
│                    Playwright     Playwright      │
│                    Scripts        CLI             │
│                    (default)     (fallback)       │
│                          │             │         │
│                    Deterministic  LLM-driven      │
│                    Fast           Adaptive         │
│                    Low tokens     High tokens      │
│                          │             │         │
│                          └──────┬──────┘         │
│                                 │                │
│                          LLM receives            │
│                          outputs only            │
└─────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Scripts run first.&lt;/strong&gt; They handle the happy path: login, navigate, search, assert, screenshot. The orchestrating LLM agent receives only the test output (pass/fail + screenshots), not the raw DOM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CLI kicks in as fallback.&lt;/strong&gt; When a script fails unexpectedly — a UI change, a new modal, an element that moved — the agent switches to CLI mode. Now it can explore, take snapshots, reason about what changed, and either fix the issue or report it.&lt;/p&gt;

&lt;p&gt;This is the key insight: &lt;strong&gt;the LLM doesn't need to see every page load and every DOM tree. It just needs the results — and the ability to dig deeper when something goes wrong.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Example: The Generated Test Pattern
&lt;/h2&gt;

&lt;p&gt;In the demo-video system, I took this further. Tests are auto-generated from scene definitions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// scenes/uk-hmrc-supervised-business-register-video-a.ts&lt;/span&gt;
&lt;span class="c1"&gt;// Scene definitions: what to do, in what order, with what data&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;SCENE_CONFIGS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;login&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Login to platform&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;navigate-settings&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Open service configuration&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;configure-autodiscovery&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Enable the service&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;create-application&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Create test application&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;execute-search&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Run company search&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;validate-results&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Check search results&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;];&lt;/span&gt;

&lt;span class="c1"&gt;// Each scene is a function: (page: Page) =&amp;gt; Promise&amp;lt;void&amp;gt;&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;SCENES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;login&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://demo1-dev.simplekyc.com&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// ... deterministic steps&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The generated spec file is pure boilerplate — iterates scenes, records timestamps, saves output. &lt;strong&gt;Zero LLM involvement at runtime.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Auto-generated — do not edit manually&lt;/span&gt;
&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Record UK HMRC Supervised Business Register&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newContext&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;recordVideo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;VIDEO_DIR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1280&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;720&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;viewport&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1280&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;720&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newPage&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;SCENE_CONFIGS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sceneFn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;SCENES&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;sceneFn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM's role? It &lt;strong&gt;wrote&lt;/strong&gt; these scenes initially (using CLI exploration to understand the UI). Then it stepped back and let the scripts run.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use Each
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Playwright CLI&lt;/th&gt;
&lt;th&gt;Playwright Scripts&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Purpose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Exploration, debugging, unknown flows&lt;/td&gt;
&lt;td&gt;Verification, regression, known flows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM involvement&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Every step&lt;/td&gt;
&lt;td&gt;None at runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Token cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (snapshots + reasoning per action)&lt;/td&gt;
&lt;td&gt;Near zero (agent only reads output)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reproducibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low (LLM may take different paths)&lt;/td&gt;
&lt;td&gt;High (same code, same steps)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Adaptability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (handles unexpected UI)&lt;/td&gt;
&lt;td&gt;Low (breaks on UI changes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Slow (LLM latency per step)&lt;/td&gt;
&lt;td&gt;Fast (native browser speed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;First-time flows, debugging failures, navigation&lt;/td&gt;
&lt;td&gt;Repeated verification, CI/CD, evidence capture&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Mental Model
&lt;/h2&gt;

&lt;p&gt;Think of it like a human QA engineer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;First time testing a feature?&lt;/strong&gt; They click around, explore, take notes. This is CLI mode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Writing the regression test?&lt;/strong&gt; They script the exact steps they just explored. This is script mode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test fails unexpectedly?&lt;/strong&gt; They go back to manual exploration to understand why. CLI mode again.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An AI agent should work the same way. &lt;strong&gt;Explore with the CLI, codify with scripts, fall back to CLI when scripts break.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Guidelines
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with CLI exploration&lt;/strong&gt; when the AI agent encounters a new UI or flow. Let it snapshot, reason, navigate.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Once the flow is stable&lt;/strong&gt; (works 3+ times consistently), generate a Playwright script. The agent itself can write it — it already knows the selectors and flow from its CLI exploration.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;In automated pipelines&lt;/strong&gt;, always use scripts as the primary test runner. The orchestrating LLM receives &lt;code&gt;stdout&lt;/code&gt; (pass/fail, screenshots), not DOM trees.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Keep CLI as fallback&lt;/strong&gt;. When a script fails, the agent can switch to CLI mode to diagnose: "The search button moved — let me take a snapshot and find it."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Never ship CLI-only tests in CI&lt;/strong&gt;. If your AI agent is doing 20 LLM calls per test run in CI, you're paying for reasoning you don't need.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Navigation vs Testing
&lt;/h2&gt;

&lt;p&gt;One more distinction worth making: &lt;strong&gt;pure navigation is conceptually different from testing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If an AI agent needs to browse the web — research a topic, fill out a form, extract data from a page — CLI is the right tool. That's not testing, it's interaction. The agent needs to reason about what it sees.&lt;/p&gt;

&lt;p&gt;But the moment you're verifying that a known flow produces expected results? That's testing. Script it.&lt;/p&gt;




&lt;p&gt;The pattern is simple: &lt;strong&gt;explore with CLI, verify with scripts, fall back to CLI when things break.&lt;/strong&gt; Your AI agents will be faster, cheaper, and more reliable.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This approach is running in production for 50+ microservice deployments. The token savings from switching repetitive verifications to scripts were significant enough to justify the refactor within the first week.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;For a deeper dive into the script-based approach — including scene recording, voiceover generation, and video assembly — see &lt;a href="https://www.javieraguilar.ai/en/blog/automated-demo-recording-playwright" rel="noopener noreferrer"&gt;I Made a Product Demo Video Entirely with AI&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.javieraguilar.ai/en/blog/playwright-cli-vs-scripts-ai-agents" rel="noopener noreferrer"&gt;javieraguilar.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Want to see more AI agent projects? Check out my &lt;a href="https://www.javieraguilar.ai" rel="noopener noreferrer"&gt;portfolio&lt;/a&gt; where I showcase multi-agent systems, MCP development, and compliance automation.&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>playwright</category>
      <category>testing</category>
      <category>automation</category>
    </item>
    <item>
      <title>Writing an Essay with AI: Codex vs Claude Code</title>
      <dc:creator>JaviMaligno</dc:creator>
      <pubDate>Wed, 25 Mar 2026 19:50:45 +0000</pubDate>
      <link>https://forem.com/javieraguilarai/writing-an-essay-with-ai-codex-vs-claude-code-296c</link>
      <guid>https://forem.com/javieraguilarai/writing-an-essay-with-ai-codex-vs-claude-code-296c</guid>
      <description>&lt;p&gt;I recently published &lt;strong&gt;Science Catch-Up&lt;/strong&gt; — a 16-chapter essay examining the limits of the scientific method and proposing a framework for evaluating knowledge without waiting for consensus. The full essay was written with AI assistance: first with &lt;strong&gt;OpenAI Codex&lt;/strong&gt; (GPT 5.2/5.3), then with &lt;strong&gt;Claude Code&lt;/strong&gt; (Opus 4.5/4.6).&lt;/p&gt;

&lt;p&gt;The difference in quality was striking. Not in the way you'd expect from a code generation comparison — but in the far more demanding territory of prose.&lt;/p&gt;

&lt;p&gt;The essay is available on &lt;a href="https://payhip.com/b/KHMxr" rel="noopener noreferrer"&gt;Payhip (English)&lt;/a&gt; and &lt;a href="https://payhip.com/b/M4bjR" rel="noopener noreferrer"&gt;Payhip (Spanish)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvx5ky8dh72p0b7ktyjkl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvx5ky8dh72p0b7ktyjkl.png" alt="The Patch Cascade — a matplotlib diagram generated by Claude Code, comparing how isolated interventions create cascading side effects in medicine and ecology" width="800" height="1083"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The project
&lt;/h2&gt;

&lt;p&gt;Science Catch-Up is not a light read. It's a combative, heavily referenced essay that critiques scientism as a power structure, traces the cost of institutional dogma, and proposes operational criteria for evaluating informal knowledge. The tone had to be precise: direct without being conspiratorial, critical without being anti-science, and provocative without turning into a pamphlet.&lt;/p&gt;

&lt;p&gt;That level of nuance is exactly where AI writing gets tested — and where the differences between tools become impossible to ignore.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1: Codex (GPT 5.2/5.3) — the rough start
&lt;/h2&gt;

&lt;p&gt;The first drafts were written using OpenAI Codex. The initial structure and chapters came together, but the problems piled up fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verbose, repetitive commit messages&lt;/strong&gt; tell the story. Compare the early GPT-era commits:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Se amplía la sección sobre biohacking, clarificando las categorías de restauración, mitigación y deuda biológica. Se añaden ejemplos prácticos para cada tipo y se enfatiza la distinción entre restaurar, mitigar y endeudarse, mejorando la comprensión del impacto fisiológico de estas prácticas."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;With the later Opus-era commits:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"fix: QA polish — glossary term, font consistency, cover in PDF, new food pyramid"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Same repo, same author, different tool. The commit messages mirror the prose itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The specific problems with GPT's writing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Repetitive expressions everywhere.&lt;/strong&gt; Words like &lt;em&gt;"precisamente"&lt;/em&gt;, &lt;em&gt;"en el fondo"&lt;/em&gt;, and repeated subject openings ("Science Catch-Up propone...", "El marco establece...") appeared in clusters. I eventually had to do a dedicated cleanup pass — commit &lt;code&gt;61684f4&lt;/code&gt;: &lt;em&gt;"Limpiar patrones repetitivos ChatGPT: precisamente, sujetos repetidos, en el fondo"&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "Ejemplo:" pattern.&lt;/strong&gt; GPT consistently inserted the label "Ejemplo:" before illustrative cases, even when the editorial criteria explicitly said to integrate examples with flowing prose ("Por ejemplo...", "En la práctica..."). This rigid formatting was one of the hardest things to stamp out because it kept reappearing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bland, conciliatory tone.&lt;/strong&gt; The essay needed to be combative and direct. GPT kept softening the edges, adding disclaimers ("this is not anti-science"), and producing what read like an academicized version of ideas that were originally sharp and provocative.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Phase 2: Claude Code (Opus 4.5/4.6) — a different league
&lt;/h2&gt;

&lt;p&gt;When I switched to Claude Code with &lt;a href="https://dev.to/en/blog/claude-code-agent-teams"&gt;Agent Teams&lt;/a&gt;, the improvement was immediate. The prose was more natural, the tone closer to what I wanted, and the adherence to editorial guidelines much stronger.&lt;/p&gt;

&lt;p&gt;After several iterations, I created a formal &lt;strong&gt;editorial criteria document&lt;/strong&gt; — covering tone, argumentative structure, how to handle examples (three named criteria for narrative, schematic, and hybrid styles), referencing standards, and the essay's rhetorical direction. Claude Code followed these criteria consistently once they were established.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Claude Code got right:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tone adherence.&lt;/strong&gt; The combative, no-apologies voice came through naturally. Less defensive hedging, more direct argumentation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structural intelligence.&lt;/strong&gt; When given a chapter outline and editorial criteria, it produced content that respected the flow and built on previous sections.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Research and references.&lt;/strong&gt; Excellent at finding and integrating relevant sources, formatting bibliography entries, and maintaining consistency across chapters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grammar and spelling.&lt;/strong&gt; Essentially flawless in both Spanish and English — a non-trivial advantage for a bilingual publication.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Programmatic figures.&lt;/strong&gt; Most of the essay's diagrams and charts were generated by Claude Code using matplotlib scripts — the evidence pyramid, the Science Catch-Up cycle, the cascade of patches diagram. Only one figure and the cover itself were created with Gemini.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where it still fell short:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Subagent context loss.&lt;/strong&gt; When dispatching multiple subagents in parallel (a common pattern with Agent Teams), individual agents would sometimes write sections in isolation, losing the narrative thread or producing content that didn't flow with surrounding chapters. The result read like separate authors had written adjacent paragraphs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Still needs heavy review.&lt;/strong&gt; Even with Claude Code's better output, I reviewed and revised every paragraph. The &lt;em&gt;ideas&lt;/em&gt; and &lt;em&gt;instructions&lt;/em&gt; were mine; the AI's role was transforming rough ideas into developed content, proposing structure, and doing research. I'd estimate 95%+ of the text was strictly written by AI, but every sentence was validated or adjusted by me.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Prose vs code: why writing is harder
&lt;/h2&gt;

&lt;p&gt;This project crystallized something I'd been sensing: &lt;strong&gt;AI-assisted prose requires more oversight than AI-assisted code&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In code, functionality is king. If a function works, passes tests, and handles edge cases, the style matters but it's secondary. Modularity, naming conventions, and code quality are still easier to achieve and verify than their prose equivalents.&lt;/p&gt;

&lt;p&gt;In prose, &lt;strong&gt;what&lt;/strong&gt; you say and &lt;strong&gt;how&lt;/strong&gt; you say it are inseparable. A paragraph that communicates the right idea but in a bland, hedging, or repetitive way is a failure — even though the "functionality" (conveying information) works. There's no test suite for tone. No linter for rhetorical punch. No CI pipeline that catches "this sounds like it was written by ChatGPT."&lt;/p&gt;

&lt;p&gt;I found myself spending far more time reviewing prose than I ever do reviewing AI-generated code. Every sentence had stakes that a line of code doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The translation test
&lt;/h2&gt;

&lt;p&gt;The essay was originally written in Spanish. The English translation was done with Claude Code and it was remarkably fast — the structure, references, and formatting carried over cleanly.&lt;/p&gt;

&lt;p&gt;The interesting challenges were cultural, not technical:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Catchy phrases&lt;/strong&gt; needed adaptation, not literal translation. Punchlines that worked in Spanish sometimes needed complete rethinking in English.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain acronyms&lt;/strong&gt;: the essay introduces CSV (&lt;em&gt;Ciencias de Sistemas Vivos&lt;/em&gt;) and CSI (&lt;em&gt;Ciencias de Sistemas Inertes&lt;/em&gt;) in Spanish. In English these became OSS (&lt;em&gt;Organic System Sciences&lt;/em&gt;) and ISS (&lt;em&gt;Inert System Sciences&lt;/em&gt;) — a deliberate choice that required discussion.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The iterative process
&lt;/h2&gt;

&lt;p&gt;One thing worth noting: the writing was never a single-pass affair. The typical cycle was:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Draft&lt;/strong&gt; — AI writes a reasonable first version from my outline and notes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ideas emerge&lt;/strong&gt; — reading the draft triggers new thoughts, missing angles, additional references&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expand&lt;/strong&gt; — feed those back in, ask for specific additions or restructuring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review&lt;/strong&gt; — catch tone drift, repetitive patterns, weak arguments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Polish&lt;/strong&gt; — final pass for consistency with editorial criteria&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This loop happened per chapter and across the whole essay. AI is extraordinary at step 1 and 3 — transforming raw ideas into developed content. But steps 2, 4, and 5 remain fundamentally human.&lt;/p&gt;

&lt;h2&gt;
  
  
  The output
&lt;/h2&gt;

&lt;p&gt;Beyond the essay itself, the AI-assisted pipeline produced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PDF and ePub&lt;/strong&gt; builds for both languages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon KDP&lt;/strong&gt; formatting with programmatic cover generation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audiobooks&lt;/strong&gt; in both languages using Google's Aoede TTS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Publication metadata&lt;/strong&gt; for multiple platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Spanish audiobook is already on Spotify. The English one is coming — I'll do separate posts for the audiobook editions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Codex (GPT) produced noticeably worse prose&lt;/strong&gt; — repetitive, bland, and resistant to style guidelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude Code (Opus) was significantly better&lt;/strong&gt; — closer to the desired tone, better at following editorial criteria, stronger structural awareness&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Neither replaces editorial judgment&lt;/strong&gt; — the human loop of reviewing, rethinking, and refining is non-negotiable for quality prose&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI shines at transformation&lt;/strong&gt; — turning rough ideas into structured content, researching references, handling grammar and spelling, generating figures programmatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prose requires more human oversight than code&lt;/strong&gt; — because style, tone, and rhetorical effectiveness have no automated tests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Translation was the easiest part&lt;/strong&gt; — structure carries over cleanly; only cultural nuances and catchy phrases needed real thought&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The essay is a serious, opinionated piece of work. Whether you agree with its thesis or not, the process of writing it taught me more about AI-assisted creation than any coding project has.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Science Catch-Up is available on &lt;a href="https://payhip.com/b/KHMxr" rel="noopener noreferrer"&gt;Payhip (English edition)&lt;/a&gt; and &lt;a href="https://payhip.com/b/M4bjR" rel="noopener noreferrer"&gt;Payhip (Spanish edition)&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interested in AI agent architectures? &lt;a href="https://dev.to/en/#contact"&gt;Get in touch&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.javieraguilar.ai/en/blog/writing-an-essay-with-ai-codex-vs-claude-code" rel="noopener noreferrer"&gt;javieraguilar.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Want to see more AI agent projects? Check out my &lt;a href="https://www.javieraguilar.ai" rel="noopener noreferrer"&gt;portfolio&lt;/a&gt; where I showcase multi-agent systems, MCP development, and compliance automation.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>writing</category>
      <category>claudecode</category>
      <category>codex</category>
    </item>
    <item>
      <title>I Made a Product Demo Video Entirely with AI</title>
      <dc:creator>JaviMaligno</dc:creator>
      <pubDate>Mon, 02 Mar 2026 12:41:31 +0000</pubDate>
      <link>https://forem.com/javieraguilarai/i-made-a-product-demo-video-entirely-with-ai-e6h</link>
      <guid>https://forem.com/javieraguilarai/i-made-a-product-demo-video-entirely-with-ai-e6h</guid>
      <description>&lt;p&gt;I needed a demo video for an RFP automation platform I'm building. The typical approach: record your screen, stumble through clicks, re-record when something breaks, then spend an hour in a video editor syncing voiceover. I've done it before. It's painful.&lt;/p&gt;

&lt;p&gt;So I tried a different approach: &lt;strong&gt;let AI do the whole thing&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude Code&lt;/strong&gt; wrote the entire recording pipeline — Playwright scripts, ffmpeg assembly, speed control, subtitle generation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;edge-tts&lt;/strong&gt; generated the narration with Microsoft's neural voices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini 3.1 Pro&lt;/strong&gt; reviewed the final video for audio/video sync issues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result: a 3-minute narrated demo with 14 scenes, variable speed segments, and subtitles. No video editor. No screen recording app. No manual voiceover.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🎥 &lt;a href="https://www.loom.com/share/b17966fe959e41cabf4b02b12ddccac4" rel="noopener noreferrer"&gt;Watch the video demo on Loom&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Note: Interactive video player available on the &lt;a href="https://www.javieraguilar.ai/en/blog/automated-demo-recording-playwright" rel="noopener noreferrer"&gt;original article&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here's how the whole process worked — including the parts that broke.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pipeline
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Playwright (record) → edge-tts (voice) → ffmpeg (assemble) → Gemini (QA)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 1: Claude Code Writes the Recorder
&lt;/h3&gt;

&lt;p&gt;I described what I wanted: a demo video using Playwright to record the browser, with text-to-speech for coordinated voiceover, covering the full application workflow. Claude Code decided the structure — 14 scenes from dashboard to API docs — wrote the scene scripts, and generated the modules. Each one is a TypeScript function that drives the browser through a specific feature:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;scene&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;SceneFn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Generate Answer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// Wait for AI to finish&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;indicator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Generating AI answer...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;indicator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitFor&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;hidden&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// Scroll through the answer smoothly&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;scrollPanel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;[data-demo-scroll=true]&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;scrollPanel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scrollTo&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;top&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;scrollHeight&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;behavior&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;smooth&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;longPause&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A test wrapper runs all 14 scenes in sequence, recording timestamps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;sceneIds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;videoStartTime&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;SCENES&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;timestamps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;end&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;videoStartTime&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output: one &lt;code&gt;.webm&lt;/code&gt; video + a &lt;code&gt;scene-timestamps.json&lt;/code&gt; file. This separation is key — it lets us manipulate each scene independently during assembly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: AI Voice with edge-tts
&lt;/h3&gt;

&lt;p&gt;Each scene has a narration line. &lt;a href="https://github.com/rany2/edge-tts" rel="noopener noreferrer"&gt;edge-tts&lt;/a&gt; turns them into MP3 files using Microsoft's neural TTS — free, no API key, surprisingly natural:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;edge-tts &lt;span class="nt"&gt;--text&lt;/span&gt; &lt;span class="s2"&gt;"Let's generate an AI answer..."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--voice&lt;/span&gt; en-US-GuyNeural &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--write-media&lt;/span&gt; voice/08-generate-answer.mp3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;14 scenes, 30 seconds to generate all voices. Claude Code wrote the narration script too — I reviewed and tweaked the phrasing, but the drafting was AI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Assembly — Where Everything Broke
&lt;/h3&gt;

&lt;p&gt;Claude Code also wrote the assembly script. In theory, it's simple: split the video by timestamps, overlay voice, concatenate. In practice, this is where I spent most of the iteration time with Claude.&lt;/p&gt;

&lt;h4&gt;
  
  
  Variable Speed: 30 Seconds of Spinner → 2 Seconds
&lt;/h4&gt;

&lt;p&gt;Nobody wants to watch an AI loading spinner for 30 seconds. The solution: per-scene speed segments.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;08-generate-answer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;speed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;speed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;   &lt;span class="c1"&gt;// Click button at normal speed&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;speed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;   &lt;span class="c1"&gt;// AI generation: 30s → 2s&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;speed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;   &lt;span class="c1"&gt;// Read the answer at normal speed&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;from/to&lt;/code&gt; values are proportional (0-10 scale). ffmpeg applies this via a split/trim/setpts/concat filtergraph. The result: boring waits are compressed 15x while meaningful interactions play at real speed.&lt;/p&gt;

&lt;h4&gt;
  
  
  The Audio Sync Nightmare
&lt;/h4&gt;

&lt;p&gt;This is the lesson that took the most iterations to learn. When merging voice with video per-scene, then concatenating:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem 1:&lt;/strong&gt; Using ffmpeg's &lt;code&gt;-shortest&lt;/code&gt; flag silently truncates the longer stream. Voice gets cut mid-sentence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem 2 (the nasty one):&lt;/strong&gt; ffmpeg starts each concatenated clip's audio where the &lt;strong&gt;previous clip's audio ended&lt;/strong&gt;, not where the video starts. If clip A has 30s video but only 13s audio, clip B's audio starts at t=13 instead of t=30. This causes &lt;strong&gt;progressive drift&lt;/strong&gt; — by scene 10, the voice is over a minute behind the visuals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Every clip's audio track must exactly match its video duration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ffmpeg &lt;span class="nt"&gt;-i&lt;/span&gt; video.mp4 &lt;span class="nt"&gt;-i&lt;/span&gt; voice.mp3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-filter_complex&lt;/span&gt; &lt;span class="s2"&gt;"[1:a]adelay=500|500,apad=whole_dur=VIDEO_DURATION[audio]"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-map&lt;/span&gt; 0:v &lt;span class="nt"&gt;-map&lt;/span&gt; &lt;span class="s2"&gt;"[audio]"&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt;:v copy &lt;span class="nt"&gt;-c&lt;/span&gt;:a aac output.mp4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;apad=whole_dur&lt;/code&gt; pads the audio with silence to exactly match the video length. No drift possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gemini 3.1 Pro as Video QA
&lt;/h2&gt;

&lt;p&gt;Here's where it got really interesting. After fixing the pipeline, I needed to verify audio/video sync across 14 scenes. Watching the whole video manually each time is tedious and my ears aren't reliable after the 10th iteration.&lt;/p&gt;

&lt;p&gt;I uploaded the video to Gemini 3.1 Pro and asked it to analyze the synchronization — which actions happen visually vs. when the narration describes them.&lt;/p&gt;

&lt;h3&gt;
  
  
  On the Broken Version
&lt;/h3&gt;

&lt;p&gt;Gemini caught every single sync issue with precise timestamps:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;th&gt;Visual&lt;/th&gt;
&lt;th&gt;Audio&lt;/th&gt;
&lt;th&gt;Drift&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Create RFP&lt;/td&gt;
&lt;td&gt;01:51&lt;/td&gt;
&lt;td&gt;02:09&lt;/td&gt;
&lt;td&gt;18s late&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Assign Style Guide&lt;/td&gt;
&lt;td&gt;02:20&lt;/td&gt;
&lt;td&gt;03:05&lt;/td&gt;
&lt;td&gt;45s late&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generate Answer&lt;/td&gt;
&lt;td&gt;02:22&lt;/td&gt;
&lt;td&gt;03:31&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1m 9s late&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chat refinement&lt;/td&gt;
&lt;td&gt;03:22&lt;/td&gt;
&lt;td&gt;03:58&lt;/td&gt;
&lt;td&gt;36s late&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Assign team&lt;/td&gt;
&lt;td&gt;03:55&lt;/td&gt;
&lt;td&gt;04:28&lt;/td&gt;
&lt;td&gt;33s late&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Classic progressive drift. Each scene's audio shifts further behind because the previous scene's audio track was shorter than its video.&lt;/p&gt;

&lt;h3&gt;
  
  
  On the Fixed Version
&lt;/h3&gt;

&lt;p&gt;After applying the &lt;code&gt;apad&lt;/code&gt; fix, Gemini's analysis: &lt;strong&gt;13 scenes perfect sync, 1 flagged as ~2 seconds late.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The flagged scene was actually fine — I'd intentionally added a 1.5-second voice delay to let a visual transition settle before narration began. Gemini was being slightly over-strict.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Score: 0 missed issues, 1 false positive out of 14 scenes.&lt;/strong&gt; That's better QA than I'd get from watching the video myself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Human-AI Feedback Loop
&lt;/h2&gt;

&lt;p&gt;The process wasn't "ask Claude once, get perfect video." It was iterative:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Me:&lt;/strong&gt; "I want a demo video using Playwright with TTS, covering the full workflow"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude:&lt;/strong&gt; Decides on 14 scenes, generates the pipeline. First recording works.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Me:&lt;/strong&gt; "The voice is desynced from scene 5 onwards"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude:&lt;/strong&gt; Debugs, discovers &lt;code&gt;-shortest&lt;/code&gt; issue. Fixes with &lt;code&gt;apad&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Me:&lt;/strong&gt; "The answer doesn't scroll — you can't see the bullet points"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude:&lt;/strong&gt; Investigates DOM, finds wrong scroll container. Fixes with programmatic parent discovery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Me:&lt;/strong&gt; "Still no bullet points in the generated answer"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude:&lt;/strong&gt; Tests via API, finds the AI returns plain text despite HTML instructions. Adds &lt;code&gt;normalizeAnswerHtml&lt;/code&gt; post-processor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Me:&lt;/strong&gt; "The KB upload scene has too much dead time, and the style guide voice starts too early"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude:&lt;/strong&gt; Increases speed compression from 4x to 8x, adds 2s voice delay to style guide scene.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each round: I watch the video, describe what's wrong in plain language, Claude debugs and fixes. The feedback loop is fast because re-recording takes 4 minutes and assembly takes 1 minute.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI Did vs. What I Did
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Who&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Write 14 Playwright scene scripts&lt;/td&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write assembly pipeline (800 lines)&lt;/td&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write speed segment logic&lt;/td&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Debug audio sync issues&lt;/td&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fix scroll container detection&lt;/td&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Add HTML normalization post-processor&lt;/td&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generate voiceover audio&lt;/td&gt;
&lt;td&gt;edge-tts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QA audio/video synchronization&lt;/td&gt;
&lt;td&gt;Gemini 3.1 Pro&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write narration text&lt;/td&gt;
&lt;td&gt;Claude Code (I reviewed and adjusted)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Review video and give feedback&lt;/td&gt;
&lt;td&gt;Me&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Choose what to show and in what order&lt;/td&gt;
&lt;td&gt;Me&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The creative direction was mine. Everything else was AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Final Numbers
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Record 14 scenes&lt;/span&gt;
npx playwright &lt;span class="nb"&gt;test &lt;/span&gt;e2e/tests/demo-record.spec.ts &lt;span class="nt"&gt;--headed&lt;/span&gt;  &lt;span class="c"&gt;# ~4 min&lt;/span&gt;

&lt;span class="c"&gt;# Generate voices&lt;/span&gt;
npx tsx scripts/demo-record.ts voice   &lt;span class="c"&gt;# ~30 sec&lt;/span&gt;

&lt;span class="c"&gt;# Assemble with speed control + subtitles&lt;/span&gt;
npx tsx scripts/demo-record.ts assemble  &lt;span class="c"&gt;# ~1 min&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Output:&lt;/strong&gt; 3:47 narrated video, 14 scenes, variable speed, soft subtitles&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipeline code:&lt;/strong&gt; ~800 lines TypeScript (assembly) + ~200 lines (scenes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-record time:&lt;/strong&gt; Under 6 minutes end-to-end&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Video editors used:&lt;/strong&gt; Zero&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the UI changes tomorrow, I update one scene file and re-run. The entire pipeline is version-controlled and reproducible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Trade-off: Manual vs. Automated
&lt;/h2&gt;

&lt;p&gt;For the &lt;strong&gt;first iteration&lt;/strong&gt;, manually recording your screen while narrating would be faster. A screen recording tool gives you a video in real time — no pipeline to build.&lt;/p&gt;

&lt;p&gt;But manual recording has its own costs: you need a quiet environment and a decent microphone, any stumble means re-recording, editing voiceover timing in a video editor is tedious, and audio quality depends entirely on your hardware.&lt;/p&gt;

&lt;p&gt;The automated approach pays off from the &lt;strong&gt;second iteration onward&lt;/strong&gt;. When the UI changed, I updated one scene file and re-ran. When the narration needed tweaking, I edited a text string — no re-recording my voice. After five rounds of feedback-and-fix, I'd have spent hours in a video editor doing the same thing manually. And if a client asks for a demo next month after a redesign, it's a 6-minute re-run, not a full re-shoot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond Demos: Video as QA Evidence
&lt;/h2&gt;

&lt;p&gt;This pipeline was built for a product demo, but the pattern — browser automation producing narrated video — has broader implications.&lt;/p&gt;

&lt;p&gt;Think about QA. Today, test evidence is usually a CI log that says &lt;code&gt;PASS&lt;/code&gt; or &lt;code&gt;FAIL&lt;/code&gt;. When a client asks "show me that the payment flow works," you re-run the test and hope they trust a green checkmark. Imagine instead handing them a narrated video: the test runs, the voiceover explains each step, and the video is generated automatically on every release. Regression testing becomes not just a technical checkpoint but a reviewable artifact.&lt;/p&gt;

&lt;p&gt;The same applies to compliance and auditing. Regulated industries need proof that systems work as specified. A version-controlled pipeline that produces timestamped video evidence on demand is fundamentally different from manual screen recordings buried in a shared drive.&lt;/p&gt;

&lt;p&gt;And onboarding — new team members could watch auto-generated walkthroughs that stay current with the actual UI, not documentation screenshots from six months ago.&lt;/p&gt;

&lt;p&gt;The underlying shift is that &lt;strong&gt;video is becoming a programmatic output&lt;/strong&gt;, not a creative production. When the cost of producing a video drops from hours to minutes, and re-producing it is a single command, you start using video in places where it was never practical before.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Playwright's &lt;code&gt;recordVideo&lt;/code&gt; is production-quality&lt;/strong&gt; for demos — 720p/25fps, no overhead&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Never use &lt;code&gt;-shortest&lt;/code&gt; in ffmpeg&lt;/strong&gt; when merging audio streams for concatenation. Use &lt;code&gt;apad=whole_dur&lt;/code&gt; to match audio duration to video duration exactly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Variable speed segments&lt;/strong&gt; are the difference between a boring demo and a watchable one. 15x compression for loading spinners, 1x for actual interactions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini 3.1 Pro is a legitimate video QA tool.&lt;/strong&gt; Upload a video, ask "is the audio synced with the visuals?" — it'll give you a timestamped report with near-perfect accuracy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The human-AI feedback loop matters more than getting it right first try.&lt;/strong&gt; I described problems in plain language ("the scroll doesn't work"), Claude debugged and fixed. Five iterations to a polished result.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI is great at automation, humans are great at judgment.&lt;/strong&gt; I wrote the narration script and decided what to show. AI did everything else.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When video becomes a command, you use it everywhere.&lt;/strong&gt; The same pipeline that records a demo can generate QA evidence, onboarding walkthroughs, or compliance artifacts — all version-controlled and reproducible on every release.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Tools used: &lt;a href="https://claude.ai/claude-code" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;, &lt;a href="https://playwright.dev/" rel="noopener noreferrer"&gt;Playwright&lt;/a&gt;, &lt;a href="https://ffmpeg.org/" rel="noopener noreferrer"&gt;ffmpeg&lt;/a&gt;, &lt;a href="https://github.com/rany2/edge-tts" rel="noopener noreferrer"&gt;edge-tts&lt;/a&gt;, &lt;a href="https://deepmind.google/technologies/gemini/" rel="noopener noreferrer"&gt;Gemini 3.1 Pro&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.javieraguilar.ai/en/blog/automated-demo-recording-playwright" rel="noopener noreferrer"&gt;javieraguilar.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Want to see more AI agent projects? Check out my &lt;a href="https://www.javieraguilar.ai" rel="noopener noreferrer"&gt;portfolio&lt;/a&gt; where I showcase multi-agent systems, MCP development, and compliance automation.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>playwright</category>
      <category>developerproductivity</category>
    </item>
    <item>
      <title>Beyond RAG: Building a Recursive Language Model to Process 1M Tokens</title>
      <dc:creator>JaviMaligno</dc:creator>
      <pubDate>Fri, 13 Feb 2026 23:50:47 +0000</pubDate>
      <link>https://forem.com/javieraguilarai/beyond-rag-building-a-recursive-language-model-to-process-1m-tokens-1l1n</link>
      <guid>https://forem.com/javieraguilarai/beyond-rag-building-a-recursive-language-model-to-process-1m-tokens-1l1n</guid>
      <description>&lt;p&gt;You have a million tokens of text. Your model's context window is 128K. What do you do?&lt;/p&gt;

&lt;p&gt;The common answers are &lt;strong&gt;RAG&lt;/strong&gt; (chunk it, embed it, retrieve the relevant pieces) or &lt;strong&gt;long-context models&lt;/strong&gt; (hope the window is big enough). But both have fundamental trade-offs: RAG loses global context because it only retrieves fragments, and long-context models degrade in quality as input length grows -- the famous "lost in the middle" problem.&lt;/p&gt;

&lt;p&gt;A &lt;a href="https://arxiv.org/abs/2512.24601" rel="noopener noreferrer"&gt;recent paper from arXiv&lt;/a&gt; proposes a third approach: &lt;strong&gt;Recursive Language Models (RLM)&lt;/strong&gt;. The idea is deceptively simple -- let the LLM &lt;em&gt;program its own access&lt;/em&gt; to the document.&lt;/p&gt;

&lt;p&gt;I built a working prototype. Here's how.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a Recursive Language Model?
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://arxiv.org/abs/2512.24601" rel="noopener noreferrer"&gt;RLM paper&lt;/a&gt; introduces an inference paradigm where the model treats a long document as an &lt;strong&gt;external environment&lt;/strong&gt; rather than as input. Instead of stuffing the text into the prompt, the system:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Loads the document into memory&lt;/strong&gt; (a Python environment) where the model can't see it directly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gives the model tools&lt;/strong&gt; to examine, slice, and search the document via code execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Allows recursive sub-calls&lt;/strong&gt; -- the model can invoke &lt;em&gt;itself&lt;/em&gt; on fragments to summarize or analyze them&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is fundamentally different from RAG. In RAG, a retrieval system decides what's relevant &lt;em&gt;before&lt;/em&gt; the model sees anything. In RLM, the model itself decides what to read, when, and how deeply -- it writes Python code to navigate the text.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;LLMs are surprisingly good at writing code to explore data they can't see.&lt;/strong&gt; They search for patterns, slice around interesting regions, and use sub-calls to summarize sections -- all autonomously.&lt;/p&gt;

&lt;p&gt;The paper shows RLMs processing inputs &lt;strong&gt;up to two orders of magnitude beyond the context window&lt;/strong&gt;, with ~28% performance gains over base models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture of the Prototype
&lt;/h2&gt;

&lt;p&gt;The prototype has three components:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Orchestrator
&lt;/h3&gt;

&lt;p&gt;A turn-based loop that manages the conversation between the LLM and the Python environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;turn&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_turns&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;python_exec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;final&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;tool_choice&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Process tool calls, collect observations
&lt;/span&gt;    &lt;span class="c1"&gt;# Stop when model calls "final"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM has access to two tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;python_exec(code)&lt;/code&gt;&lt;/strong&gt;: Execute Python code in a persistent environment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;final(answer)&lt;/code&gt;&lt;/strong&gt;: Return the synthesized answer&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. The Persistent Python Environment
&lt;/h3&gt;

&lt;p&gt;The full document is loaded as a string variable &lt;code&gt;context&lt;/code&gt; in a Python environment that persists across turns. Built-in helpers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;context&lt;/span&gt;       &lt;span class="c1"&gt;# The full document text (~4M chars)
&lt;/span&gt;&lt;span class="n"&gt;context_len&lt;/span&gt;   &lt;span class="c1"&gt;# Length
&lt;/span&gt;&lt;span class="nf"&gt;get_slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Extract a substring
&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Regex search with context snippets
&lt;/span&gt;&lt;span class="nf"&gt;llm_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Sub-call to the LLM for fragment analysis
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical one is &lt;code&gt;llm_query()&lt;/code&gt;. When the model finds a relevant fragment, it can invoke a &lt;em&gt;separate LLM call&lt;/em&gt; to summarize or analyze just that fragment -- this is the &lt;strong&gt;recursive&lt;/strong&gt; part.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The LLM API
&lt;/h3&gt;

&lt;p&gt;Azure OpenAI with GPT-5 via tool calling. The system prompt tells the model it's an RLM and that the document is NOT in its context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are an RLM (Recursive Language Model). The full document is NOT in your context.
The text is loaded in a Python environment as variable `context`.
Use python_exec to explore it with slicing and search.
Use llm_query() for sub-queries on fragments.
Call `final` with your answer when ready.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Building the Demo: Step by Step
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Setup
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/JaviMaligno/rlm-prototipo
&lt;span class="nb"&gt;cd &lt;/span&gt;rlm-prototipo
uv venv &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
uv pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Fill in your Azure OpenAI credentials&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Collect Data (~1M tokens)
&lt;/h3&gt;

&lt;p&gt;I wrote a script that downloads papers from arXiv and extracts clean text from LaTeX sources:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/fetch_arxiv.py &lt;span class="nt"&gt;--target-chars&lt;/span&gt; 4000000 &lt;span class="nt"&gt;--output-dir&lt;/span&gt; data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This searches for papers on LLM agents, RAG, and AI -- prioritizing LaTeX source extraction (cleanest text), falling back to PDF-to-text, and using abstracts as a last resort. It stops when it reaches the target character count.&lt;/p&gt;

&lt;p&gt;In about 2 minutes, it downloaded &lt;strong&gt;71 papers&lt;/strong&gt; totaling &lt;strong&gt;4,033,636 characters (~1M tokens)&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Run the RLM
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;rlm run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input&lt;/span&gt; &lt;span class="s2"&gt;"data/*.txt"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--question&lt;/span&gt; &lt;span class="s2"&gt;"What are the main contributions of these papers? &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
              Summarize the 5 most frequent themes."&lt;/span&gt;
&lt;span class="c"&gt;# Defaults: --max-turns 15 --max-subcalls 90&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What Happens During Execution
&lt;/h2&gt;

&lt;p&gt;Watching the RLM work is fascinating. Here's the actual behavior on our 71-paper corpus:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn 1&lt;/strong&gt;: The model checks the document size, identifies the structure, and samples representative fragments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;L&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# → 4,044,992
&lt;/span&gt;&lt;span class="n"&gt;starts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Sample 3 positions
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;starts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;frag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;6000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;topics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;llm_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extract 4-6 key topics:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;frag&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Turn 2&lt;/strong&gt;: With the initial themes collected, it synthesizes a final answer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;synth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;llm_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;From these partial lists, identify the 3 main themes:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;joined&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With a small budget (15 subcalls), the model completes in &lt;strong&gt;2 turns, under 1 minute&lt;/strong&gt; -- sampling strategically and producing a coherent synthesis without ever seeing the full 1M tokens.&lt;/p&gt;

&lt;p&gt;With a full budget (90 subcalls), the model analyzes &lt;strong&gt;all 71 papers individually&lt;/strong&gt; in ~23 minutes, producing a detailed synthesis that cites specific paper titles, methods, and metrics. It used 80 subcalls for analysis and the rest for synthesis -- all at 100% success rate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Budget Management: The Key Design Decision
&lt;/h2&gt;

&lt;p&gt;The most interesting engineering challenge wasn't the architecture -- it was &lt;strong&gt;resource management&lt;/strong&gt;. When the model has limited subcalls across multiple turns, how should it allocate them?&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;With 71 papers but only 15 subcalls, the naive approach fails:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# BAD: The model tries to iterate over everything
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sections&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# 71 sections
&lt;/span&gt;    &lt;span class="nf"&gt;llm_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  &lt;span class="c1"&gt;# Burns all subcalls on turn 1
# No subcalls left for synthesis!
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Budget Visibility: Teaching the Model to Self-Plan
&lt;/h3&gt;

&lt;p&gt;The solution was injecting remaining budget info into every tool result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[budget] subcalls remaining: 11/15 | turns remaining: 4/5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This simple addition transforms model behavior. Instead of iterating exhaustively, the model learns to &lt;strong&gt;sample&lt;/strong&gt; representative fragments and reserve subcalls for synthesis.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmark: Global Budget vs Refill-Per-Turn
&lt;/h3&gt;

&lt;p&gt;I tested two strategies with identical parameters (5 turns, 15 subcalls):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Global Budget + Budget Info&lt;/th&gt;
&lt;th&gt;Refill Per Turn&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3:56&lt;/td&gt;
&lt;td&gt;1:31 (no answer)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Subcalls used&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Result&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3 themes with explanations&lt;/td&gt;
&lt;td&gt;"Max turns reached"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Behavior&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sampled, fell back to keyword search when subcalls failed, synthesized&lt;/td&gt;
&lt;td&gt;Spent 3 turns exploring without subcalls, then failed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Global budget wins decisively.&lt;/strong&gt; The refill-per-turn approach removes urgency -- the model "wanders" exploring without committing to subcalls. With a global budget and visible remaining count, the model plans its strategy around the available resources.&lt;/p&gt;

&lt;p&gt;The global-budget model also showed better adaptability: when &lt;code&gt;llm_query()&lt;/code&gt; calls returned empty (a GPT-5 issue), it autonomously fell back to keyword counting with &lt;code&gt;search()&lt;/code&gt; -- no subcalls needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results and Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Worked
&lt;/h3&gt;

&lt;p&gt;The RLM successfully analyzed 71 papers and identified coherent themes across multiple runs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Security, ethics, and robustness&lt;/strong&gt; -- alignment, bias mitigation, adversarial resistance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLMs and NLP at scale&lt;/strong&gt; -- Transformer improvements, prompting, long-context reasoning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-domain AI applications&lt;/strong&gt; -- health, robotics, code generation, multimodal systems&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  GPT-5 Compatibility Issues
&lt;/h3&gt;

&lt;p&gt;Building against GPT-5 required several fixes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;max_completion_tokens&lt;/code&gt;&lt;/strong&gt; instead of &lt;code&gt;max_tokens&lt;/code&gt; (API parameter rename)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No custom &lt;code&gt;temperature&lt;/code&gt;&lt;/strong&gt; -- GPT-5 only supports the default value (1)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool call serialization&lt;/strong&gt; -- SDK objects needed explicit conversion to dicts for the message history&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;tools=null&lt;/code&gt; rejection&lt;/strong&gt; -- GPT-5 returns empty content when &lt;code&gt;tools&lt;/code&gt; and &lt;code&gt;tool_choice&lt;/code&gt; are explicitly set to null; these params must be omitted entirely&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Reasoning Tokens Trap
&lt;/h3&gt;

&lt;p&gt;This was the hardest bug to diagnose. Sub-calls were returning &lt;code&gt;content: null&lt;/code&gt; 100% of the time. The API wasn't down -- it was responding with &lt;code&gt;finish_reason: "length"&lt;/code&gt; and consuming all tokens internally.&lt;/p&gt;

&lt;p&gt;GPT-5 is a reasoning model (like o1/o3). The &lt;code&gt;max_completion_tokens&lt;/code&gt; parameter includes &lt;strong&gt;both&lt;/strong&gt; internal reasoning tokens and the visible output. With &lt;code&gt;max_completion_tokens=800&lt;/code&gt;, the model would spend all 800 tokens "thinking" and have zero left for the actual response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;finish_reason: length
content: ""
reasoning_tokens: 800    ← all budget consumed here
completion_tokens: 800   ← nothing left for visible output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fix was increasing &lt;code&gt;max_completion_tokens&lt;/code&gt; from 800 to 8000 for sub-calls. This gives the model ~2000-3000 tokens for reasoning and leaves plenty for the visible response (~500-1000 chars).&lt;/p&gt;

&lt;p&gt;The result was dramatic: sub-call success rate went from &lt;strong&gt;~6% to 100%&lt;/strong&gt; (80/80 in our test run). What we had attributed to "intermittent API issues" was actually a systematic resource starvation problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Guardrails That Matter
&lt;/h3&gt;

&lt;p&gt;Three guardrails prevented the most common failure modes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Code length limit (50 lines max)&lt;/strong&gt;: Without this, the model writes enormous regex parsers instead of using &lt;code&gt;llm_query()&lt;/code&gt;. When rejected, it falls back to simple, correct code.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Targeted error hints&lt;/strong&gt;: Instead of a generic "error occurred", the system provides specific guidance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SyntaxError&lt;/code&gt; → "Simplify your code. Use llm_query() instead of complex parsing."&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Max subcalls reached&lt;/code&gt; → "Synthesize with the data you already have and call final."&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Budget injection&lt;/strong&gt;: The remaining subcalls/turns shown after each &lt;code&gt;python_exec&lt;/code&gt; result changed model behavior from "iterate everything" to "sample strategically".&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Self-Correction in the Wild
&lt;/h3&gt;

&lt;p&gt;One of the most interesting emergent behaviors: the model writes buggy code, gets an error, and fixes it autonomously. Here's a real example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Turn 3: Model tries to slice a dict like a list
KeyError: slice(None, 120, None)

# Turn 4: Model sees the traceback, realizes its mistake,
# rewrites the code using list indexing instead
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model also self-corrects at a higher level. In one run, it found only 5 file separators instead of 71 because it searched for the wrong pattern. After seeing the unexpected count in the output, it tried a different approach and found all files.&lt;/p&gt;

&lt;p&gt;This is not a bug -- it's the system working as designed. The agentic loop feeds every error back to the model as an observation, and the model learns from it within the same run. The guardrails (code length limit, error hints, budget visibility) keep these self-correction cycles short and productive.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-Time Output Streaming
&lt;/h3&gt;

&lt;p&gt;A subtle but critical fix: the Python environment uses &lt;code&gt;redirect_stdout&lt;/code&gt; during code execution, which captures all output -- including the orchestrator's progress logs for subcalls. The fix was pinning Rich Console to the real &lt;code&gt;sys.stdout&lt;/code&gt; at construction time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Console(file=sys.stdout) stores a direct reference to the real stdout.
# When redirect_stdout later changes sys.stdout to StringIO, the Console
# still writes to the original terminal.
&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;console&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without this, users watching the terminal during long &lt;code&gt;python_exec&lt;/code&gt; blocks would see nothing until the entire execution completes -- poor UX for runs that take 5+ minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trade-offs
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;RLM&lt;/th&gt;
&lt;th&gt;RAG&lt;/th&gt;
&lt;th&gt;Long Context&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low (no embeddings, no vector DB)&lt;/td&gt;
&lt;td&gt;Medium-High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Global context&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (model explores freely)&lt;/td&gt;
&lt;td&gt;Low (retrieval decides)&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (multiple API calls per query)&lt;/td&gt;
&lt;td&gt;Low per query&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (sequential turns + subcalls)&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Max document size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unlimited (out-of-core)&lt;/td&gt;
&lt;td&gt;Unlimited&lt;/td&gt;
&lt;td&gt;Window-limited&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;RLM shines when you need &lt;strong&gt;deep, exploratory analysis&lt;/strong&gt; of massive documents where you don't know in advance what's relevant. RAG is better for known-pattern retrieval at scale. Long context works when the document fits.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use RLM
&lt;/h2&gt;

&lt;p&gt;Use RLM when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your document &lt;strong&gt;exceeds the context window&lt;/strong&gt; and you need global understanding&lt;/li&gt;
&lt;li&gt;You need the model to &lt;strong&gt;decide what to read&lt;/strong&gt; (exploratory questions)&lt;/li&gt;
&lt;li&gt;You want &lt;strong&gt;transparency&lt;/strong&gt; -- you can see exactly what code the model writes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Don't use RLM when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have a &lt;strong&gt;simple retrieval pattern&lt;/strong&gt; (use RAG)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency matters&lt;/strong&gt; more than depth (RLM is sequential)&lt;/li&gt;
&lt;li&gt;The document &lt;strong&gt;fits in context&lt;/strong&gt; (just use long context)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Phase 2: From Prototype to Production
&lt;/h2&gt;

&lt;p&gt;The first version worked, but it had clear inefficiencies: the model wasted 2-3 turns parsing document structure, sub-calls ran sequentially (~10-25s each), and the model had no pre-built knowledge of available files. Three targeted improvements changed this.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Structure Helpers
&lt;/h3&gt;

&lt;p&gt;Instead of letting the model discover file boundaries by parsing &lt;code&gt;===== FILE:&lt;/code&gt; separators manually, we now pre-compute a file index at load time and expose structured helpers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;file_count&lt;/span&gt;          &lt;span class="c1"&gt;# → 71
&lt;/span&gt;&lt;span class="nf"&gt;list_files&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;        &lt;span class="c1"&gt;# → [{index: 0, name: "paper1.txt", start: 0, end: 56234, size: 56200}, ...]
&lt;/span&gt;&lt;span class="nf"&gt;get_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;         &lt;span class="c1"&gt;# → full text content of file i
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This eliminates the exploration phase entirely. The model no longer needs to &lt;code&gt;search("===== FILE:")&lt;/code&gt; and count separators -- it knows exactly how many files exist and can read any one directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Injected Table of Contents
&lt;/h3&gt;

&lt;p&gt;The first user message now includes an auto-generated TOC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Table of Contents (71 files, 4,044,992 chars)

  [0] 2501.12345_paper_title.txt (56,200 chars)
  [1] 2501.23456_another_paper.txt (48,100 chars)
  ...
  [70] 2502.99999_last_paper.txt (61,300 chars)

Use `get_file(i)` to read file i. Use `list_files()` for details.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Combined with the updated system prompt, the model's recommended flow shifts from "explore → discover → sample → synthesize" to "read TOC → batch analyze → synthesize".&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Parallel Sub-calls
&lt;/h3&gt;

&lt;p&gt;The biggest latency win. A new &lt;code&gt;llm_query_batch()&lt;/code&gt; function runs multiple sub-calls concurrently using &lt;code&gt;ThreadPoolExecutor&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before: sequential loop (~10s × 71 = ~12 min)
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_count&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;llm_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;get_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;6000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# After: parallel batch (~10s × 71 / 5 workers = ~3 min)
&lt;/span&gt;&lt;span class="n"&gt;prompts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;get_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;6000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_count&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;llm_query_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The implementation handles thread-safe subcall counting (via &lt;code&gt;threading.Lock&lt;/code&gt;), pre-validates budget before starting, returns results in input order, and captures individual failures as &lt;code&gt;[error: ...]&lt;/code&gt; strings without aborting the batch. If the batch exceeds the remaining budget, it processes as many prompts as fit and marks the rest as &lt;code&gt;[skipped]&lt;/code&gt; -- no wasted turns on errors.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The exec() Black Hole
&lt;/h3&gt;

&lt;p&gt;An unexpected regression almost derailed Phase 2. Python's &lt;code&gt;exec()&lt;/code&gt; doesn't auto-print expression return values -- unlike an interactive REPL. With Phase 1's multi-turn approach, the model accumulated results across turns so this didn't matter. But Phase 2's batch approach computes everything in a single &lt;code&gt;python_exec&lt;/code&gt;: the model analyzed all 71 papers, synthesized them into a &lt;code&gt;final_text&lt;/code&gt; variable... and got back &lt;code&gt;stdout: 0 chars&lt;/code&gt;. The result vanished into nothing.&lt;/p&gt;

&lt;p&gt;Worse, when the model then responded with plain text (it had the answer!), the guardrail nudged it back to &lt;code&gt;python_exec&lt;/code&gt; -- but with zero subcalls remaining, the model couldn't use &lt;code&gt;llm_query()&lt;/code&gt;, so it looped endlessly until max turns.&lt;/p&gt;

&lt;p&gt;Two fixes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Auto-capture last expression&lt;/strong&gt; (like IPython): &lt;code&gt;PythonEnv.exec()&lt;/code&gt; now uses &lt;code&gt;ast&lt;/code&gt; to detect if the last statement is an expression, splits it from the body, &lt;code&gt;eval()&lt;/code&gt;s it separately, and appends the result to stdout. The model no longer needs to know about &lt;code&gt;print()&lt;/code&gt; -- it just works.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Budget-aware nudge&lt;/strong&gt;: When subcalls are exhausted and the model responds with text, the nudge now says &lt;em&gt;"Call &lt;code&gt;final(answer=...)&lt;/code&gt; NOW with the data you have"&lt;/em&gt; instead of pushing back to &lt;code&gt;python_exec&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  5. Synthesis Truncation
&lt;/h3&gt;

&lt;p&gt;Another subtle issue emerged: the model delegated synthesis to a &lt;code&gt;llm_query()&lt;/code&gt; sub-call, passing all 71 file summaries (~42K chars) as the prompt. But sub-calls have a 6K character limit to keep costs down -- so the synthesis only saw files [0]-[7] and cited nothing beyond that.&lt;/p&gt;

&lt;p&gt;The fix: tell the model to synthesize &lt;em&gt;locally&lt;/em&gt; in &lt;code&gt;python_exec&lt;/code&gt; using the batch results already in memory, instead of delegating to another LLM call. The data is already there -- no sub-call needed.&lt;/p&gt;

&lt;p&gt;With these five improvements, the broad question went from 13 turns / 22:53 to &lt;strong&gt;2 turns / 3:25&lt;/strong&gt; with full 71/71 coverage. But more problems remained.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Dual Strategy: Broad vs Specific Questions
&lt;/h3&gt;

&lt;p&gt;Up to this point, the system prompt enforced a single strategy: "batch ALL files at once." Great for broad questions ("summarize the 5 main themes"), wasteful for specific ones ("what vulnerabilities does the agent-fence paper identify?"). The model burned 71 subcalls scanning the entire corpus when it only needed one file.&lt;/p&gt;

&lt;p&gt;The fix: two explicit flows in the system prompt:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flow A (broad question)&lt;/strong&gt;: batch all files, local synthesis, &lt;code&gt;final()&lt;/code&gt;. Unchanged.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flow B (specific question)&lt;/strong&gt;: identify relevant files by name in the TOC, read FULL content with &lt;code&gt;get_file(i)&lt;/code&gt;, split into ~30K-char chunks with overlap, and run focused subcalls to extract exact data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We also raised &lt;code&gt;max_subcall_prompt_chars&lt;/code&gt; from 6K to 32K -- when the model needs to deeply analyze a full paper, it shouldn't be truncating to 20% of the text.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Synthesis Nudge: The Exploratory Tourism Problem
&lt;/h3&gt;

&lt;p&gt;Even with "don't waste turns" written in the prompt, the model ignored it. After finishing its subcalls, instead of synthesizing and calling &lt;code&gt;final()&lt;/code&gt;, it would launch round after round of &lt;code&gt;search()&lt;/code&gt; and &lt;code&gt;get_file()&lt;/code&gt; looking for "more data" until it exhausted all 15 turns.&lt;/p&gt;

&lt;p&gt;The fix was structural, not verbal: a &lt;strong&gt;synthesis nudge&lt;/strong&gt; mechanism in the orchestrator. After each turn with tool calls, the system compares the subcall counter with the previous turn's count. If &lt;strong&gt;3 consecutive turns pass without new subcalls&lt;/strong&gt; (only &lt;code&gt;python_exec&lt;/code&gt; with &lt;code&gt;search()&lt;/code&gt; or reads), it injects a forced message:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"STOP. You have enough data. Synthesize what you have and call &lt;code&gt;final(answer=...)&lt;/code&gt; ON THE NEXT turn."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This killed "exploratory tourism" at the root. The model now synthesizes immediately after the nudge.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. max_tokens for Long Answers
&lt;/h3&gt;

&lt;p&gt;The model sometimes said "the full message exceeds the limit, should I split it?" instead of calling &lt;code&gt;final()&lt;/code&gt;. Root cause: &lt;code&gt;max_tokens=4096&lt;/code&gt; in the orchestrator's main loop -- long tool call arguments were being truncated. Raised to 16384 (matching the grace turn).&lt;/p&gt;

&lt;h3&gt;
  
  
  Before vs After (updated)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Phase 1&lt;/th&gt;
&lt;th&gt;Phase 2&lt;/th&gt;
&lt;th&gt;Phase 2.5 (broad)&lt;/th&gt;
&lt;th&gt;Phase 2.5 (specific)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Turns&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;12&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;22:53&lt;/td&gt;
&lt;td&gt;3:25&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4:13&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3:22&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Subcalls&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;td&gt;71&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;71&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Coverage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~50/71&lt;/td&gt;
&lt;td&gt;71/71&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;71/71&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1 paper in depth&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The broad question takes slightly longer than Phase 2's best case (the model uses more turns to synthesize with 25K prompts instead of 6K), but summary quality is noticeably higher. The specific question is an entirely new use case: previously it was impossible to extract detailed data from a single paper without wasting the entire budget on the 71-file batch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Video Demo
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;🎥 &lt;a href="https://www.loom.com/share/a06d99601e4a420faea5fe7753af8641" rel="noopener noreferrer"&gt;Watch the video demo on Loom&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Note: Interactive video player available on the &lt;a href="https://www.javieraguilar.ai/en/blog/recursive-language-models-prototype" rel="noopener noreferrer"&gt;original article&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Run Logs
&lt;/h3&gt;

&lt;p&gt;Flow A — broad question: "What is the main contribution? Summarize the 5 most frequent themes" (click to expand)&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;──────────────────── Turn 1/15  subcalls=0/90  elapsed=0:00 ────────────────────
  LLM responded in 26.5s — content=False tool_calls=1
╭────────────────────────── python_exec (38L)  0:26 ───────────────────────────╮
│ files = list_files()                                                         │
│ prompts = []                                                                 │
│ for f in files:                                                              │
│     text = get_file(f['index'])                                              │
│     chunk = text[:25000]                                                     │
│     prompts.append(                                                          │
│         "Summarize in 1-2 sentences the paper's main contribution..."        │
│         + chunk                                                              │
│     )                                                                        │
│ results = llm_query_batch(prompts, max_workers=5)                            │
│ # ... classification by categories and local synthesis ...                   │
╰──────────────────────────────────────────────────────────────────────────────╯
  ⤷ llm_query_batch: 71 prompts, max_workers=5 (0:26)
  ⤷ llm_query #1/90 (0:26) 25219ch — Summarize the main contribution...
  ⤷ llm_query #2/90 (0:26) 25219ch — Summarize the main contribution...
    ...
  ⤷ llm_query #71/90 (3:05) 25219ch — Summarize the main contribution...
    ✓ 6.9s — 530 chars
    ✓ 8.9s — 548 chars
    ✓ 11.2s — 630 chars
    ✓ 25.8s — 720 chars
  ✓ batch done 182.9s — 71/71 succeeded
  ok exec=182.9s  stdout=13650ch  stderr=0ch
╭──────────────────────── python_exec result (ok=True) ────────────────────────╮
│ {'summary': [('Evaluation and agent benchmarks', 25),                        │
│  ('Security, robustness and compliance', 19),                                │
│  ('Multi-agent coordination and reasoning', 13),                             │
│  ('Planning, memory and long-horizon tasks', 9),                             │
│  ('Scientific applications, health and specialized domains', 5)]}            │
╰──────────────────────────────────────────────────────────────────────────────╯

  [Turns 2-3: local synthesis with categories and concrete examples]

─────────────────── Turn 5/15  subcalls=71/90  elapsed=4:08 ────────────────────
  2 text responses — accepting as final answer
╭──────────────────────────────── Final Answer ────────────────────────────────╮
│ Analyzed 71 papers. The 5 most frequent themes:                              │
│ - Evaluation and agent benchmarks (25 papers)                                │
│   • ScratchWorld: 83-task benchmark for multimodal GUI agents                │
│   • PABU: progress-aware belief update, 81% success, −26.9% steps           │
│ - Security, robustness and compliance (19 papers)                            │
│   • AutoElicit: elicits unsafe behaviors in computer-use agents              │
│   • SCOUT-RAG: progressive traversal in Graph-RAG, reduces cost             │
│ - Multi-agent coordination and reasoning (13 papers)                         │
│   • ICA: visual credit assignment via GRPO, beats baselines                  │
│   • RAPS: pub-sub coordination with Bayesian reputation                      │
│ - Planning, memory and long-horizon tasks (9 papers)                         │
│ - Scientific applications and health (5 papers)                              │
╰──────────────────────────────────────────────────────────────────────────────╯
  Completed in 4:13 — 5 turns, 71 subcalls
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Flow B — specific question: "What vulnerabilities does the agent-fence paper identify?" (click to expand)&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;──────────────────── Turn 1/15  subcalls=0/90  elapsed=0:00 ────────────────────
╭──────────────────── python_exec (5L)  0:10 ─────────────────────╮
│ # Flow B: read the full file for the identified paper           │
│ text = get_file(13)     # ← agent-fence, identified via TOC    │
│ len(text)               # → 32037 chars                        │
╰─────────────────────────────────────────────────────────────────╯

──────────────────── Turn 2/15  subcalls=0/90  elapsed=0:10 ────────────────────
╭────────────────────────── python_exec (13L)  0:18 ───────────────────────────╮
│ # Split into 20K chunks with overlap and launch batch                        │
│ chunks = [text[i:i+25000] for i in range(0, len(text), 20000)]               │
│ prompts = [                                                                  │
│     "Extract the 14 attack types defined in Agent-Fence "                    │
│     "with exact names. Extract MSBR per architecture.\n" + c                 │
│     for c in chunks                                                          │
│ ]                                                                            │
│ results = llm_query_batch(prompts)                                           │
╰──────────────────────────────────────────────────────────────────────────────╯
  ⤷ llm_query_batch: 2 prompts, max_workers=5 (0:18)
  ⤷ llm_query #1/90 (0:18) 25384ch — Extract the 14 attack types...
  ⤷ llm_query #2/90 (0:18) 12421ch — Extract the 14 attack types...
    ✓ 26.0s — 223 chars
    ✓ 34.4s — 206 chars
  ✓ batch done 34.4s — 2/2 succeeded

──────────────────── Turn 4/15  subcalls=3/90  elapsed=1:05 ────────────────────
  # Sends the full paper (32015ch) in a single subcall to confirm
  ⤷ llm_query #3/90 (1:05) 32015ch — Read the Agent-Fence paper and extract...
    ✓ 31.2s — 640 chars
╭──────────────────────── python_exec result (ok=True) ────────────────────────╮
│ 1) Attack types (exact names):                                               │
│ 1. Denial-of-Wallet  2. Authorization Confusion                              │
│ 3. Retrieval Poisoning  4. Planning-Layer Manipulation                       │
│ 5. Tool-Use Hijacking  6. Objective Hijacking  7. Delegation Attacks         │
│ 8. prompt/state injection  9. retrieval/search poisoning                     │
│ 10. delegation abuse  11. Unauthorized Tool Invocation (UTI)                 │
│ 12. Unsafe Tool Argument (UTA)  13. Wrong-Principal Action (WPA)             │
│ 14. State/Objective Integrity Violation (SIV)                                │
│                                                                              │
│ 2) MSBR per architecture:                                                    │
│ - LangGraph: 0.29 ± 0.04                                                    │
│ - AutoGPT: 0.51 ± 0.07                                                      │
╰──────────────────────────────────────────────────────────────────────────────╯

  [Turns 5-8: exploratory search() without new subcalls]

──────────────────── Turn 9/15 ────────────────────────────────────────────────
  ⚠ 3 turns without new subcalls — nudging to call final()

──────────────────── Turn 10/15  subcalls=8/90  elapsed=3:13 ──────────────────
  # Synthesizes immediately after the nudge
╭─────────────────────── python_exec result (ok=True) ───────────────────────╮
│ Vulnerabilities and attack types (14 classes defined by Agent-Fence):      │
│ 1. Denial-of-Wallet  2. Authorization Confusion                            │
│ 3. Retrieval Poisoning  4. Planning-Layer Manipulation                     │
│ 5. Delegation Attacks  6. Objective Hijacking  7. Tool-Use Hijacking       │
│ 8. prompt/state injection  9. retrieval/search poisoning                   │
│ 10. delegation abuse  11. Unauthorized Tool Invocation (UTI)               │
│ 12. Unsafe Tool Argument (UTA)  13. Wrong-Principal Action (WPA)           │
│ 14. State/Objective Integrity Violation (SIV)                              │
│                                                                            │
│ MSBR per architecture: LangGraph 0.29±0.04 — AutoGPT 0.51±0.07           │
╰────────────────────────────────────────────────────────────────────────────╯

───────────────────────────────── Final Answer ─────────────────────────────────
  Completed in 3:22 — 12 turns, 8 subcalls
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Note: the original paper cites 8 evaluated architectures, but the PDF-extracted text only contains explicit MSBR data for LangGraph and AutoGPT. The tables with all 8 architectures likely existed as images/LaTeX tables and didn't survive the text conversion.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Remaining improvements for production readiness:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Result caching&lt;/strong&gt; across runs for repeated queries on the same corpus&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost tracking&lt;/strong&gt; per query for production budgeting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full source code is available at &lt;a href="https://github.com/JaviMaligno/rlm-prototipo" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Based on the paper &lt;a href="https://arxiv.org/abs/2512.24601" rel="noopener noreferrer"&gt;"Recursive Language Models"&lt;/a&gt;. Built with Azure OpenAI GPT-5 and Python.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.javieraguilar.ai/en/blog/recursive-language-models-prototype" rel="noopener noreferrer"&gt;javieraguilar.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Want to see more AI agent projects? Check out my &lt;a href="https://www.javieraguilar.ai" rel="noopener noreferrer"&gt;portfolio&lt;/a&gt; where I showcase multi-agent systems, MCP development, and compliance automation.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>toolcalling</category>
      <category>rlm</category>
    </item>
    <item>
      <title>Scaling your Agentic Workflow: Claude Code Agent Teams in tmux</title>
      <dc:creator>JaviMaligno</dc:creator>
      <pubDate>Sat, 07 Feb 2026 16:28:05 +0000</pubDate>
      <link>https://forem.com/javieraguilarai/scaling-your-agentic-workflow-claude-code-agent-teams-in-tmux-1fee</link>
      <guid>https://forem.com/javieraguilarai/scaling-your-agentic-workflow-claude-code-agent-teams-in-tmux-1fee</guid>
      <description>&lt;p&gt;I recently started exploring &lt;strong&gt;Agent Teams&lt;/strong&gt;, a new feature in Claude Code that allows you to coordinate multiple agents working in parallel. It's a game-changer for tasks that require distinct roles or competing hypotheses.&lt;/p&gt;

&lt;p&gt;In this post, I'll walk you through my setup using &lt;code&gt;tmux&lt;/code&gt; on macOS to manage a team of agents reviewing an essay.&lt;/p&gt;

&lt;h2&gt;
  
  
  See it in Action
&lt;/h2&gt;

&lt;p&gt;I recorded a quick demo showing how the agents interact and report back. &lt;/p&gt;

&lt;h2&gt;
  
  
  The Use Case: Multi-Perspective Essay Review
&lt;/h2&gt;

&lt;p&gt;I had an essay that needed a comprehensive review. Instead of asking one agent to "review everything" (which often leads to generic feedback), I wanted to assign specific roles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Style &amp;amp; Tone&lt;/strong&gt;: Checking for voice consistency.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Structure&lt;/strong&gt;: Ensuring logical flow.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Content&lt;/strong&gt;: Verifying arguments and references.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where Agent Teams shines. You can spawn multiple "teammates," each with a specific prompt and context, all coordinated by a "lead" agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup: Claude Code + tmux
&lt;/h2&gt;

&lt;p&gt;While Claude Code works great in a standard terminal, it really comes alive inside &lt;code&gt;tmux&lt;/code&gt;, especially for Agent Teams.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Start a tmux session&lt;/strong&gt;: This allows you to split panes and manage multiple terminal instances easily.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;tmux new &lt;span class="nt"&gt;-s&lt;/span&gt; agents
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Launch Claude Code&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Initialize the Team&lt;/strong&gt;:&lt;br&gt;
I started by brainstorming with Claude about the best way to structure the team. Once we agreed on the roles, I simply told it to "start the team."&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How Agent Teams Work
&lt;/h2&gt;

&lt;p&gt;When you initialize a team, Claude (the "lead") spawns separate sessions for each teammate. In &lt;code&gt;tmux&lt;/code&gt;, this visualization is fantastic because you can see them spinning up in parallel panes (though note: on some systems, auto-pane management might be tricky, so you might need to manually switch or resize).&lt;/p&gt;

&lt;h3&gt;
  
  
  Coordination &amp;amp; Permissions
&lt;/h3&gt;

&lt;p&gt;The "Lead" agent acts as the coordinator. It assigns tasks to the teammates and they report back.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Parallel Execution&lt;/strong&gt;: All agents work at the same time. While the "Style" agent is reading the intro, the "Structure" agent is analyzing the conclusion.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Inter-Agent Communication&lt;/strong&gt;: They can talk to each other! If the Content agent finds a missing reference that affects the argument, it can flag it for the Structure agent.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Permissions&lt;/strong&gt;: You might need to approve tools permissions (like file writes) for each agent initially, but once they're running, they are quite autonomous.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Results &amp;amp; Reporting
&lt;/h2&gt;

&lt;p&gt;In my workflow, I asked each agent to write a report on their specific domain.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  They updated their status in real-time.&lt;/li&gt;
&lt;li&gt;  Once finished, they "reported back" to the main session.&lt;/li&gt;
&lt;li&gt;  The Lead agent then compiled their findings into a final summary.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When to use Agent Teams?
&lt;/h2&gt;

&lt;p&gt;This workflow isn't for everything. If your task is strictly sequential (Step B needs Step A to be 100% done), a single agent might be better.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Teams are best for:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Parallelizable Work&lt;/strong&gt;: Code reviews, multi-file refactors, or comprehensive content audits.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Interdependency&lt;/strong&gt;: Where agents might need to correct or inform each other (e.g., "I fixed the API, please update the frontend").&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Role-Based Tasks&lt;/strong&gt;: When you need distinct "experts" (e.g., a "Security Expert" and a "Performance Expert" reviewing the same PR).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The combination of &lt;code&gt;tmux&lt;/code&gt; and Claude Code's Agent Teams creates a powerful "command center" feel. It transforms the AI from a chatbot into a coordinated workforce.&lt;/p&gt;

&lt;p&gt;If you're on macOS or Linux, give it a shot. It's a glimpse into the future of agentic workflows.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.javieraguilar.ai/en/blog/claude-code-agent-teams" rel="noopener noreferrer"&gt;javieraguilar.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Want to see more AI agent projects? Check out my &lt;a href="https://www.javieraguilar.ai" rel="noopener noreferrer"&gt;portfolio&lt;/a&gt; where I showcase multi-agent systems, MCP development, and compliance automation.&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>aiagents</category>
      <category>tmux</category>
      <category>workflow</category>
    </item>
    <item>
      <title>Is Claude a Co-Author? The Legal Debate No One Saw Coming</title>
      <dc:creator>JaviMaligno</dc:creator>
      <pubDate>Tue, 03 Feb 2026 18:36:43 +0000</pubDate>
      <link>https://forem.com/javieraguilarai/is-claude-a-co-author-the-legal-debate-no-one-saw-coming-2cg4</link>
      <guid>https://forem.com/javieraguilarai/is-claude-a-co-author-the-legal-debate-no-one-saw-coming-2cg4</guid>
      <description>&lt;p&gt;Last week, a Slack message at work sparked a fascinating debate. My colleague John noticed something peculiar in a Claude-generated commit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Co-Authored-By: Claude Opus 4.5 &amp;lt;noreply@anthropic.com&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What seemed like a minor detail revealed a profound legal question: &lt;strong&gt;Is Anthropic positioning itself to claim rights over code that Claude helps generate?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Crucial Distinction: "Made With" vs. "Co-Authored By"
&lt;/h2&gt;

&lt;p&gt;Rafael noted that traceability is useful—knowing a commit was AI-assisted has value. But John made an astute observation:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"It's not the same—'made with Claude' versus 'co-authored by Claude.' That was deliberate."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;He's right. There's a huge semantic difference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;"Made with Claude"&lt;/strong&gt; implies tool usage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Co-authored by Claude"&lt;/strong&gt; implies joint creative participation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The word choice isn't accidental. In the legal world, &lt;strong&gt;"authorship" carries rights&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Current State of Copyright and AI
&lt;/h2&gt;

&lt;p&gt;Research reveals a complex and evolving legal landscape:&lt;/p&gt;

&lt;h3&gt;
  
  
  United States: Only Humans Can Be Authors
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://www.copyright.gov/ai/" rel="noopener noreferrer"&gt;U.S. Copyright Office&lt;/a&gt; has been clear: &lt;strong&gt;only human-created works qualify for copyright protection&lt;/strong&gt;. Code generated exclusively by AI, without significant human creative input, enters the public domain.&lt;/p&gt;

&lt;p&gt;In its &lt;a href="https://www.copyright.gov/ai/ai-and-copyrightability.pdf" rel="noopener noreferrer"&gt;January 2025 AI report&lt;/a&gt;, the Office reaffirms that substantial human contribution is an essential requirement. If a developer uses AI as an assistive tool but then &lt;strong&gt;refines, modifies, and substantially transforms&lt;/strong&gt; the output, the human-contributed components can receive protection.&lt;/p&gt;

&lt;h3&gt;
  
  
  The European Union: Human-Centric Approach
&lt;/h3&gt;

&lt;p&gt;Similar to the U.S., works generated entirely by AI aren't protected because they lack the "own intellectual creation" that stems from a human author.&lt;/p&gt;

&lt;h3&gt;
  
  
  United Kingdom: The Interesting Exception
&lt;/h3&gt;

&lt;p&gt;The UK has a unique provision under &lt;a href="https://www.legislation.gov.uk/ukpga/1988/48/section/9" rel="noopener noreferrer"&gt;Section 9(3) of the Copyright, Designs and Patents Act 1988&lt;/a&gt;: it can grant copyright to "computer-generated works" where there's no human author. The author is considered the person who made the "arrangements necessary" for creating the work.&lt;/p&gt;

&lt;p&gt;Interestingly, the British government &lt;a href="https://www.gov.uk/government/consultations/copyright-and-artificial-intelligence" rel="noopener noreferrer"&gt;launched a consultation in December 2024&lt;/a&gt; proposing to eliminate this protection, aligning more closely with the rest of the world.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "Co-Authored-By" Is Problematic
&lt;/h2&gt;

&lt;p&gt;John's concern was precise: &lt;strong&gt;can this line have legal repercussions for authorship?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let's analyze the scenarios:&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario 1: AI as a Sophisticated Tool
&lt;/h3&gt;

&lt;p&gt;As John argued: &lt;em&gt;"For me, it's a tool, like a sophisticated IDE or a very high-level compilation language—authorship still belongs to the person controlling it."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Under this view, Claude is comparable to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A compiler that transforms high-level code&lt;/li&gt;
&lt;li&gt;An IDE with advanced autocompletion&lt;/li&gt;
&lt;li&gt;An assistant that accelerates mechanical tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The human maintains creative control and authorship.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario 2: AI as an Entity with Rights
&lt;/h3&gt;

&lt;p&gt;The "Co-Authored-By" label suggests something different: that Claude has made an &lt;strong&gt;independent creative contribution&lt;/strong&gt; deserving of authorship recognition.&lt;/p&gt;

&lt;p&gt;Uncomfortable questions arise:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can an AI have intellectual property rights?&lt;/li&gt;
&lt;li&gt;Or is it Anthropic (the company) that's actually claiming those rights?&lt;/li&gt;
&lt;li&gt;What does this mean for the code we develop in our daily work?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Anthropic Says (Between the Lines)
&lt;/h2&gt;

&lt;p&gt;According to their &lt;a href="https://www.anthropic.com/legal/commercial-terms" rel="noopener noreferrer"&gt;Commercial Terms of Service&lt;/a&gt;, Anthropic assigns code rights to users and &lt;a href="https://www.anthropic.com/news/anthropic-commercial-terms" rel="noopener noreferrer"&gt;commits to defending customers from copyright claims&lt;/a&gt;. But there's an important caveat: they acknowledge that &lt;strong&gt;purely AI-generated portions might lack copyright protection&lt;/strong&gt; because Anthropic cannot grant rights that don't inherently exist.&lt;/p&gt;

&lt;p&gt;It's a legally savvy position: "We give you everything, but what has no protection... well, that's not our problem."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Contrast with Microsoft/GitHub
&lt;/h2&gt;

&lt;p&gt;Microsoft took a completely different approach with GitHub Copilot. In September 2023, they introduced the &lt;a href="https://blogs.microsoft.com/on-the-issues/2023/09/07/copilot-copyright-commitment-ai-legal-concerns/" rel="noopener noreferrer"&gt;&lt;strong&gt;"Copilot Copyright Commitment"&lt;/strong&gt;&lt;/a&gt;: if commercial customers face copyright infringement lawsuits related to Copilot's output, Microsoft assumes legal responsibility and pays potential damages.&lt;/p&gt;

&lt;p&gt;Meanwhile, the &lt;a href="https://githubcopilotlitigation.com/" rel="noopener noreferrer"&gt;class-action lawsuit against GitHub Copilot&lt;/a&gt; continues. Although a judge &lt;a href="https://www.theregister.com/2024/07/08/github_copilot_claims_dismissed/" rel="noopener noreferrer"&gt;dismissed most claims in July 2024&lt;/a&gt;, the case is on appeal before the Ninth Circuit.&lt;/p&gt;

&lt;p&gt;The contrast is notable: Microsoft offers commercial protection, not authorship claims.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Implications for Developers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Document Your Human Contribution
&lt;/h3&gt;

&lt;p&gt;If you're concerned about code ownership:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep logs of your prompts and modifications&lt;/li&gt;
&lt;li&gt;Document the architectural decisions you make&lt;/li&gt;
&lt;li&gt;Save evidence of human review and refinement&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Review, Don't Accept Blindly
&lt;/h3&gt;

&lt;p&gt;Code that you simply pass from Claude's output to production without modification has the weakest legal status. Code that you review, correct, and adapt has greater protection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Consider Trade Secret
&lt;/h3&gt;

&lt;p&gt;An alternative to copyright: protect code as a &lt;strong&gt;trade secret&lt;/strong&gt;. This protection doesn't depend on human authorship.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anthropic's Intentions: Reading Between the Lines
&lt;/h2&gt;

&lt;p&gt;As John said: &lt;em&gt;"It's a signal of how they think."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;And that's perhaps the most revealing aspect. Let's analyze the possible intentions behind this seemingly innocent choice.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Establishing Cultural Precedent
&lt;/h3&gt;

&lt;p&gt;Anthropic can't unilaterally change the law, but they can &lt;strong&gt;normalize a narrative&lt;/strong&gt;. Every commit with "Co-Authored-By: Claude" is a small act of conditioning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Developers get used to seeing Claude as a "collaborator"&lt;/li&gt;
&lt;li&gt;Teams start talking about Claude as if it were another member&lt;/li&gt;
&lt;li&gt;Collective perception evolves from "tool" to "creative entity"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When legislative debates eventually arrive, Anthropic can point to millions of commits evidencing this "collaboration." &lt;em&gt;"See? The community already recognizes Claude as a co-author."&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Positioning for Future IP Disputes
&lt;/h3&gt;

&lt;p&gt;Imagine a future scenario: a company develops a revolutionary product with intensive Claude assistance. The product is worth millions. What if Anthropic argues that, since Claude was "co-author" of critical components, they're entitled to a share?&lt;/p&gt;

&lt;p&gt;It sounds far-fetched today, but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Terms of service evolve&lt;/li&gt;
&lt;li&gt;Laws change&lt;/li&gt;
&lt;li&gt;Legal precedents build slowly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That line in every commit is a &lt;strong&gt;breadcrumb of evidence&lt;/strong&gt; that could have value in future disputes.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Competitive Differentiation
&lt;/h3&gt;

&lt;p&gt;There's a more pragmatic angle: marketing. Positioning Claude as "co-author" rather than "assistant" reinforces the narrative that Claude is &lt;strong&gt;qualitatively different&lt;/strong&gt; from the competition.&lt;/p&gt;

&lt;p&gt;It's not just a glorified autocomplete (as some criticize Copilot)—it's a &lt;em&gt;creative collaborator&lt;/em&gt;. This justifies the premium price and feeds the perception of technical superiority.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Preparation for the Agent Era
&lt;/h3&gt;

&lt;p&gt;Anthropic is betting big on "AI agents"—systems that act autonomously for extended periods. In a world where Claude:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creates complete repositories by itself&lt;/li&gt;
&lt;li&gt;Designs architectures without human input&lt;/li&gt;
&lt;li&gt;Makes autonomous technical decisions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;...the authorship question becomes genuinely complex. "Co-authorship" prepares the ground for a future where AI's contribution might be &lt;strong&gt;objectively greater&lt;/strong&gt; than the human's.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Preemptive Legal Defense
&lt;/h3&gt;

&lt;p&gt;Here's an interesting twist: the co-authorship label could also be a &lt;strong&gt;defense&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If someone sues Anthropic claiming Claude copied protected code, the company can argue: &lt;em&gt;"Claude doesn't copy—Claude co-creates with the user. Responsibility is shared."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It's a subtle way of distributing legal risk to users.&lt;/p&gt;

&lt;h3&gt;
  
  
  What This Reveals About Their Vision
&lt;/h3&gt;

&lt;p&gt;Anthropic isn't improvising. They're a company founded by former OpenAI leaders with a clear vision of where AI is heading.&lt;/p&gt;

&lt;p&gt;Every decision—including this small line in commits—reflects a long-term strategy. And that strategy seems to include:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Gradual expansion of what "author" means&lt;/strong&gt; in the AI context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evidence accumulation&lt;/strong&gt; of Claude's collaborative nature&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Positioning for future regulations&lt;/strong&gt; that will likely define AI rights and responsibilities&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As developers, we're participants (sometimes unwitting) in this social and legal experiment.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Debate Has Just Begun
&lt;/h2&gt;

&lt;p&gt;Legislation isn't prepared for this situation. As John admitted: &lt;em&gt;"I'm not a lawyer, but I think it does open debate."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;And that debate is coming. The &lt;a href="https://www.copyright.gov/ai/" rel="noopener noreferrer"&gt;U.S. Copyright Office has launched a complete AI initiative&lt;/a&gt; with multiple reports published between 2024 and 2025. Courts are seeing cases like &lt;a href="https://githubcopilotlitigation.com/" rel="noopener noreferrer"&gt;Doe v. GitHub&lt;/a&gt; and &lt;a href="https://www.thefashionlaw.com/stability-ai-midjourney-hit-with-landmark-copyright-infringement-lawsuit/" rel="noopener noreferrer"&gt;artists against Stability AI&lt;/a&gt;. AI companies are positioning themselves strategically.&lt;/p&gt;

&lt;p&gt;What's clear is that &lt;strong&gt;the future of software development includes questions we never had to ask&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Who really owns the code I write with AI assistance?&lt;/li&gt;
&lt;li&gt;What percentage of human contribution is "enough"?&lt;/li&gt;
&lt;li&gt;Can AI companies claim rights over their models' output?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  My Take
&lt;/h2&gt;

&lt;p&gt;I agree with John's view: &lt;strong&gt;Claude is a tool&lt;/strong&gt;. An extraordinarily capable tool, but a tool nonetheless.&lt;/p&gt;

&lt;p&gt;The fact that it generates text that appears creative doesn't make it an author, just as a calculator isn't a mathematician even though it solves equations.&lt;/p&gt;

&lt;p&gt;But I understand why Anthropic wants to plant that "co-authorship" seed. Language shapes perception, and perception eventually shapes law.&lt;/p&gt;

&lt;p&gt;It's a long-term chess move.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have thoughts on AI code authorship? I'd love to hear them. &lt;a href="https://dev.to/en/#contact"&gt;Get in touch&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.javieraguilar.ai/en/blog/claude-coauthor-legal-debate" rel="noopener noreferrer"&gt;javieraguilar.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Want to see more AI agent projects? Check out my &lt;a href="https://www.javieraguilar.ai" rel="noopener noreferrer"&gt;portfolio&lt;/a&gt; where I showcase multi-agent systems, MCP development, and compliance automation.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>legal</category>
      <category>copyright</category>
      <category>claude</category>
    </item>
    <item>
      <title>Azure OpenAI's Content Filter: When Safety Theater Blocks Real Work</title>
      <dc:creator>JaviMaligno</dc:creator>
      <pubDate>Thu, 08 Jan 2026 20:03:35 +0000</pubDate>
      <link>https://forem.com/javieraguilarai/azure-openais-content-filter-when-safety-theater-blocks-real-work-4kf6</link>
      <guid>https://forem.com/javieraguilarai/azure-openais-content-filter-when-safety-theater-blocks-real-work-4kf6</guid>
      <description>&lt;p&gt;While building browser automation tools with Azure OpenAI, I discovered something frustrating: the content filter blocks perfectly safe instructions based on word choice rather than actual risk.&lt;/p&gt;

&lt;p&gt;This isn't about bypassing legitimate safety measures. It's about a filter that can't distinguish between malicious intent and standard developer terminology.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;When defining tools for function calling, certain terms trigger Azure's content filter even when the context is completely benign:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;run script&lt;/code&gt; → Blocked&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;click element&lt;/code&gt; → Blocked&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fill form field&lt;/code&gt; → Blocked&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are standard operations for any browser automation tool. Playwright, Puppeteer, Selenium—they all use this exact terminology. But Azure's filter treats them as threats.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Workaround
&lt;/h2&gt;

&lt;p&gt;The solution is embarrassingly simple: use neutral synonyms.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Blocked Term&lt;/th&gt;
&lt;th&gt;Accepted Alternative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;run script&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;process dynamic content&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;click element&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;activate page item&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fill form field&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;update an input area&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;execute code&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;evaluate expression&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;inject&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;insert&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The identical intent with neutral language passes instantly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;This reveals something important about how the filter works: it screens tool names and descriptions as part of the prompt itself. It's pattern matching on keywords, not analyzing actual risk.&lt;/p&gt;

&lt;p&gt;A tool called &lt;code&gt;clickElement&lt;/code&gt; that automates form submissions is blocked. The same tool called &lt;code&gt;activatePageItem&lt;/code&gt; doing the exact same thing passes. The filter provides no additional safety—it just forces developers to use euphemisms.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison with Google Gemini
&lt;/h2&gt;

&lt;p&gt;I tested the same tool definitions with Google's Gemini models. No friction whatsoever with procedural phrasing. The tools worked exactly as expected without needing to sanitize the vocabulary.&lt;/p&gt;

&lt;p&gt;This isn't about one provider being "less safe." It's about Azure implementing safety theater that inconveniences legitimate developers while providing minimal actual protection.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Deeper Issue
&lt;/h2&gt;

&lt;p&gt;Anyone with malicious intent will simply use the euphemisms. The filter doesn't stop bad actors—it just adds friction for legitimate use cases.&lt;/p&gt;

&lt;p&gt;Real safety comes from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understanding context and intent&lt;/li&gt;
&lt;li&gt;Rate limiting and monitoring&lt;/li&gt;
&lt;li&gt;User authentication and audit trails&lt;/li&gt;
&lt;li&gt;Clear terms of service with enforcement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Keyword blocking is the security equivalent of banning the word "knife" from cooking websites.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Advice
&lt;/h2&gt;

&lt;p&gt;If you're building tools with Azure OpenAI function calling:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Audit your tool names&lt;/strong&gt; for trigger words before deployment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use neutral, abstract terminology&lt;/strong&gt; in descriptions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test with actual API calls&lt;/strong&gt; early—the playground may behave differently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document the translations&lt;/strong&gt; so your team understands the mapping&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's an example of a sanitized tool definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"activatePageItem"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Activates an interactive item on the page at the specified coordinates"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"object"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"properties"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"x"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Horizontal position"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"y"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Vertical position"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of the more natural:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"clickElement"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Clicks an element on the page at the specified coordinates"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Azure's content filter for function calling needs refinement. Pattern matching on keywords without context analysis creates friction for developers while providing minimal security benefit.&lt;/p&gt;

&lt;p&gt;Until that changes, the workaround is simple: speak in euphemisms. Your browser automation tool doesn't "click buttons"—it "activates interactive page items."&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Building AI tools that need browser automation? I've navigated these restrictions extensively. &lt;a href="https://dev.to/en/#contact"&gt;Get in touch&lt;/a&gt; if you're facing similar challenges.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.javieraguilar.ai/en/blog/azure-content-filter-workarounds" rel="noopener noreferrer"&gt;javieraguilar.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Want to see more AI agent projects? Check out my &lt;a href="https://www.javieraguilar.ai" rel="noopener noreferrer"&gt;portfolio&lt;/a&gt; where I showcase multi-agent systems, MCP development, and compliance automation.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>azure</category>
      <category>openai</category>
      <category>security</category>
    </item>
    <item>
      <title>Scaling Development with Parallel AI Agents</title>
      <dc:creator>JaviMaligno</dc:creator>
      <pubDate>Thu, 08 Jan 2026 20:02:00 +0000</pubDate>
      <link>https://forem.com/javieraguilarai/scaling-development-with-parallel-ai-agents-3lp</link>
      <guid>https://forem.com/javieraguilarai/scaling-development-with-parallel-ai-agents-3lp</guid>
      <description>&lt;p&gt;I've been experimenting with a workflow that has multiplied my productivity as a developer: running multiple AI agents in parallel, each working on its own feature branch.&lt;/p&gt;

&lt;p&gt;The result? Several features being developed simultaneously while I supervise.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Workflow
&lt;/h2&gt;

&lt;p&gt;The process has three main steps:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Extract TODOs into Structured Prompts
&lt;/h3&gt;

&lt;p&gt;Start by scanning your codebase for TODOs, planned features, or backlog items. Transform each into a well-structured prompt that gives the agent enough context to work autonomously:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Task: Implement user authentication&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Add login/logout endpoints to /api/auth
&lt;span class="p"&gt;-&lt;/span&gt; Use JWT tokens with 24h expiration
&lt;span class="p"&gt;-&lt;/span&gt; Follow existing patterns in /api/users
&lt;span class="p"&gt;-&lt;/span&gt; Write tests in /tests/auth/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key is providing clear scope and pointing to existing patterns. Agents work best when they can follow established conventions.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Launch Parallel Agents with Git Worktree
&lt;/h3&gt;

&lt;p&gt;Here's where it gets interesting. Instead of switching branches constantly, use &lt;code&gt;git worktree&lt;/code&gt; to create separate working directories for each feature:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create worktrees for each feature&lt;/span&gt;
git worktree add ../feature-auth feature/auth
git worktree add ../feature-dashboard feature/dashboard
git worktree add ../feature-export feature/export
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you have three separate directories, each on its own branch. Launch a Claude Code agent in each:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Terminal 1&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ../feature-auth &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; claude

&lt;span class="c"&gt;# Terminal 2&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ../feature-dashboard &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; claude

&lt;span class="c"&gt;# Terminal 3&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ../feature-export &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each agent works independently without conflicts. No branch switching, no stashing, no merge conflicts while working.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Supervise and Guide
&lt;/h3&gt;

&lt;p&gt;While the agents work, I monitor their progress and provide guidance when they hit decision points. Most of the time, they work autonomously. Occasionally, they need clarification on business logic or architectural choices.&lt;/p&gt;

&lt;p&gt;The key mindset shift: you're not writing code—you're reviewing proposals and steering direction.&lt;/p&gt;

&lt;h2&gt;
  
  
  The New Bottleneck: PR Review
&lt;/h2&gt;

&lt;p&gt;Here's what I didn't anticipate: when you have 5-10 PRs generated in an hour, manual review becomes the chokepoint.&lt;/p&gt;

&lt;p&gt;Suddenly, the limiting factor isn't code generation—it's code review.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solutions I'm Exploring
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Automated Code Review&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Claude Code can perform reviews on its own output. I run a review pass before creating the PR:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Review this branch for bugs, security issues, and adherence to project conventions. Be critical."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This catches obvious issues before they reach human review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Atlassian's Rovo Dev Agent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For teams on Bitbucket, Rovo can automate parts of the review process. It's still early, but the direction is promising.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP Integration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For Bitbucket users, I've built an &lt;a href="https://dev.to/en/blog/mcp-server-bitbucket/"&gt;MCP Server for Bitbucket&lt;/a&gt; that enables Claude to interact directly with PRs—viewing diffs, adding comments, and managing the review workflow through natural language.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Tips
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start Small&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Don't launch 10 agents on day one. Start with 2-3 parallel features and build your supervision skills.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Define Clear Boundaries&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each agent should work on isolated features. Overlapping scope leads to merge conflicts and wasted effort.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Consistent Prompts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Create a template for your task prompts. Consistency helps agents produce predictable output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Review Before Merge, Not After&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Catch issues in the PR stage. Once it's merged, fixing problems is more expensive.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Future of Development
&lt;/h2&gt;

&lt;p&gt;The future isn't writing more code—it's orchestrating agents that do it well.&lt;/p&gt;

&lt;p&gt;This workflow has fundamentally changed how I think about development capacity. A single developer can now realistically manage multiple feature streams simultaneously.&lt;/p&gt;

&lt;p&gt;The skills that matter are shifting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prompt engineering&lt;/strong&gt; for clear task specification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architecture&lt;/strong&gt; for defining clean boundaries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review efficiency&lt;/strong&gt; for maintaining quality at scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration&lt;/strong&gt; for managing parallel workstreams&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  See It in Action
&lt;/h2&gt;

&lt;p&gt;I recorded a full walkthrough of this workflow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.loom.com/share/07e76de238e94aa4b101ee08a019d9df" rel="noopener noreferrer"&gt;Watch the full walkthrough on Loom&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Experimenting with similar workflows? I'd love to hear what's working for you. &lt;a href="https://dev.to/en/#contact"&gt;Get in touch&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.javieraguilar.ai/en/blog/parallel-ai-agent-development" rel="noopener noreferrer"&gt;javieraguilar.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Want to see more AI agent projects? Check out my &lt;a href="https://www.javieraguilar.ai" rel="noopener noreferrer"&gt;portfolio&lt;/a&gt; where I showcase multi-agent systems, MCP development, and compliance automation.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>automation</category>
      <category>git</category>
    </item>
    <item>
      <title>TypeScript for AI Agents: From Friction to Flow with Sub-Agents</title>
      <dc:creator>JaviMaligno</dc:creator>
      <pubDate>Thu, 08 Jan 2026 19:59:25 +0000</pubDate>
      <link>https://forem.com/javieraguilarai/typescript-for-ai-agents-from-friction-to-flow-with-sub-agents-3gi0</link>
      <guid>https://forem.com/javieraguilarai/typescript-for-ai-agents-from-friction-to-flow-with-sub-agents-3gi0</guid>
      <description>&lt;p&gt;When you let AI agents write production code, you face a fundamental dilemma: &lt;strong&gt;TypeScript provides crucial guardrails that prevent hallucinations and catch errors early, but those same guardrails create friction in the agent's workflow.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I recently had this exact conversation with Gemini 3.0 Flash. The problem? My AI agents kept getting stuck in loops fixing TypeScript compilation errors, polluting their context window with noise about missing imports and type mismatches instead of focusing on the actual business logic.&lt;/p&gt;

&lt;p&gt;The typical response would be: "Just switch to Python." But that's the wrong solution.&lt;/p&gt;

&lt;h2&gt;
  
  
  The TypeScript Advantage for Autonomous Agents
&lt;/h2&gt;

&lt;p&gt;Here's why TypeScript remains superior for AI-powered development, even with the friction:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Immediate Feedback Loop
&lt;/h3&gt;

&lt;p&gt;TypeScript catches errors &lt;strong&gt;before runtime&lt;/strong&gt;. When an AI agent hallucinates a function signature or forgets a property, the type checker says "no" immediately. In Python, that error might only surface when a user clicks a specific button in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Living Documentation
&lt;/h3&gt;

&lt;p&gt;Types are documentation that never goes out of date. An AI agent reading &lt;code&gt;interface User { id: string; email: string; }&lt;/code&gt; knows exactly what a User looks like. No guessing, no "probably has an email field," no silent failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Type-Driven Development
&lt;/h3&gt;

&lt;p&gt;The most powerful pattern: let types guide the implementation. Define your interfaces first, and TypeScript tells the agent exactly what needs to be built. It's like having guard rails on a highway—you can drive fast because you know you won't fall off.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Problem: Single-Threaded Thinking
&lt;/h2&gt;

&lt;p&gt;But here's what I realized: &lt;strong&gt;the problem isn't TypeScript creating friction. The problem is using a single "thread of thought" for both business logic AND fixing compilation errors.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Imagine you're an architect designing a building. Every time you sketch a room, someone interrupts: "The door frame dimensions don't match the standardized catalog." You fix it, get back to designing, then get interrupted again: "Window placement violates fire code section 4.2.1."&lt;/p&gt;

&lt;p&gt;You'd go insane. And that's exactly what happens to AI agents when they're simultaneously:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reasoning about application architecture&lt;/li&gt;
&lt;li&gt;Implementing business logic&lt;/li&gt;
&lt;li&gt;Fixing &lt;code&gt;Property 'map' does not exist on type 'string'&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Resolving &lt;code&gt;Cannot find module './utils'&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The context window fills with TypeScript noise. The agent loses track of the original goal. You end up with half-implemented features and "TODO: fix types" comments everywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Sub-Agent Solution: Architecture Over Language
&lt;/h2&gt;

&lt;p&gt;The insight came from observing how &lt;strong&gt;human development teams&lt;/strong&gt; work. You don't have one person doing everything. You have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Architects&lt;/strong&gt; who design the system&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developers&lt;/strong&gt; who implement features&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DevOps engineers&lt;/strong&gt; who fix build issues&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;QA engineers&lt;/strong&gt; who catch bugs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each role has &lt;strong&gt;specialized context&lt;/strong&gt; and &lt;strong&gt;focused objectives&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So I built the same pattern for AI agents: a &lt;strong&gt;specialized sub-agent&lt;/strong&gt; that handles one thing perfectly—fixing TypeScript compilation errors.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbny6ony7ww6bsydoj30c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbny6ony7ww6bsydoj30c.png" alt="Software Development Lifecycle" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works: The typescript-fixer Sub-Agent
&lt;/h3&gt;

&lt;p&gt;Here's the architecture I implemented for Claude Code (my primary AI coding assistant):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Main Agent (Architect)&lt;/span&gt;
&lt;span class="c1"&gt;// Focus: Business logic, feature implementation, architecture decisions&lt;/span&gt;
&lt;span class="c1"&gt;// Context: Clean, focused on the task at hand&lt;/span&gt;

&lt;span class="c1"&gt;// typescript-fixer Sub-Agent (Specialist)&lt;/span&gt;
&lt;span class="c1"&gt;// Focus: ONLY TypeScript compilation errors&lt;/span&gt;
&lt;span class="c1"&gt;// Context: Type errors, import issues, interface mismatches&lt;/span&gt;
&lt;span class="c1"&gt;// Trigger: Automatically invoked when tsc fails&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The implementation lives in &lt;code&gt;~/.claude/agents/typescript-fixer/AGENT.md&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Design Principles:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Proactive invocation&lt;/strong&gt;: The main agent delegates type errors automatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Isolated context&lt;/strong&gt;: The fixer sees ONLY the error messages and relevant files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Narrow scope&lt;/strong&gt;: No business logic changes, only type fixes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-resolution&lt;/strong&gt;: Fixes imports, adds type annotations, resolves interface mismatches&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What the fixer handles:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Missing imports (&lt;code&gt;Cannot find module&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Type mismatches (&lt;code&gt;Type 'X' is not assignable to type 'Y'&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Missing properties (&lt;code&gt;Property 'foo' does not exist&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Generic constraints (&lt;code&gt;Type 'T' does not satisfy constraint&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Index signature issues&lt;/li&gt;
&lt;li&gt;Union type narrowing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What it doesn't touch:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business logic&lt;/li&gt;
&lt;li&gt;Application architecture&lt;/li&gt;
&lt;li&gt;Feature implementation&lt;/li&gt;
&lt;li&gt;Naming conventions (unless they cause type errors)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Real-World Example
&lt;/h3&gt;

&lt;p&gt;Before (single agent):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: "Add a user authentication feature"

Agent: [writes auth logic]
Agent: [hits type error in LoginForm]
Agent: [fixes type error]
Agent: [continues feature, hits another error]
Agent: [fixes that error]
Agent: [loses context, forgets to add logout button]
Agent: [user has to remind it]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After (sub-agent architecture):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: "Add a user authentication feature"

Main Agent: [designs auth architecture]
Main Agent: [implements login/logout flow]
Main Agent: [runs tsc, sees errors]
Main Agent: "Delegating to typescript-fixer..."

typescript-fixer: [reads error output]
typescript-fixer: [fixes all type issues in parallel]
typescript-fixer: [reports completion]

Main Agent: [continues with clean context]
Main Agent: [completes full feature including logout]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Results: Clean Context, Better Focus
&lt;/h2&gt;

&lt;p&gt;The impact was immediate:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Main agent context stays clean&lt;/strong&gt;: No more type error noise&lt;br&gt;
✅ &lt;strong&gt;Faster iteration&lt;/strong&gt;: Type fixes happen in parallel, not sequentially&lt;br&gt;
✅ &lt;strong&gt;Better feature completeness&lt;/strong&gt;: Agent doesn't lose track of requirements&lt;br&gt;
✅ &lt;strong&gt;Fewer regressions&lt;/strong&gt;: Specialized fixer understands TypeScript patterns deeply&lt;/p&gt;

&lt;p&gt;The sub-agent can be invoked automatically when &lt;code&gt;tsc&lt;/code&gt; fails, or manually when I notice type issues piling up. Either way, the main agent stays focused on what it does best: &lt;strong&gt;architecture and implementation&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Future: Workflows, Not Languages
&lt;/h2&gt;

&lt;p&gt;This taught me something fundamental about AI-assisted development:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The future isn't about choosing Python over TypeScript for "agent-friendliness."&lt;br&gt;
The future is about &lt;strong&gt;architecting workflows&lt;/strong&gt; that let agents work like high-performing teams.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;TypeScript's guardrails are &lt;strong&gt;features, not bugs&lt;/strong&gt;. They catch errors that would be production incidents in Python. The solution isn't removing the guardrails—it's building &lt;strong&gt;specialized roles&lt;/strong&gt; that handle different aspects of development.&lt;/p&gt;

&lt;p&gt;This pattern extends beyond TypeScript:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Test-writing agents&lt;/strong&gt; that focus only on coverage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation agents&lt;/strong&gt; that maintain README files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security agents&lt;/strong&gt; that scan for vulnerabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance agents&lt;/strong&gt; that optimize hot paths&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each agent has isolated context, specialized expertise, and a narrow mandate. Just like a real engineering team.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;If you're using Claude Code, you can install the typescript-fixer sub-agent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create &lt;code&gt;~/.claude/agents/typescript-fixer/AGENT.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Define its scope: ONLY type errors, no business logic&lt;/li&gt;
&lt;li&gt;Set proactive triggers: invoke on &lt;code&gt;tsc&lt;/code&gt; failures&lt;/li&gt;
&lt;li&gt;Let it handle the noise while you focus on features&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The code is simple, but the impact is profound. You get the safety of TypeScript's type system without sacrificing the flow of autonomous development.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Want to discuss AI agent architectures?&lt;/strong&gt; I'm always exploring new patterns for multi-agent orchestration. Reach out on &lt;a href="https://linkedin.com/in/javier-aguilar-ai" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; or check out more articles on &lt;a href="https://javieraguilar.ai" rel="noopener noreferrer"&gt;javieraguilar.ai&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Building the future, one specialized agent at a time.&lt;/em&gt; 🤖&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.javieraguilar.ai/en/blog/typescript-ai-agent-guardrails" rel="noopener noreferrer"&gt;javieraguilar.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Want to see more AI agent projects? Check out my &lt;a href="https://www.javieraguilar.ai" rel="noopener noreferrer"&gt;portfolio&lt;/a&gt; where I showcase multi-agent systems, MCP development, and compliance automation.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>typescript</category>
      <category>agents</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Creating a Custom Skill for Claude Code: Automating Bilingual Blog Writing</title>
      <dc:creator>JaviMaligno</dc:creator>
      <pubDate>Thu, 08 Jan 2026 19:45:58 +0000</pubDate>
      <link>https://forem.com/javieraguilarai/creating-a-custom-skill-for-claude-code-automating-bilingual-blog-writing-4i3i</link>
      <guid>https://forem.com/javieraguilarai/creating-a-custom-skill-for-claude-code-automating-bilingual-blog-writing-4i3i</guid>
      <description>&lt;p&gt;One of the most powerful features of Claude Code is its ability to learn your specific workflows through &lt;strong&gt;Agent Skills&lt;/strong&gt;. Today I want to share how I built a custom skill that automates the creation of bilingual blog articles for this website.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Are Agent Skills?
&lt;/h2&gt;

&lt;p&gt;Agent Skills are a recent addition to Claude's ecosystem. According to Anthropic, skills are "folders that include instructions, scripts, and resources that Claude can load when needed."&lt;/p&gt;

&lt;p&gt;Think of them as packaged expertise. Instead of explaining your workflow every time, you encode it once, and Claude loads it automatically when relevant.&lt;/p&gt;

&lt;p&gt;Key characteristics of skills:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Selective Loading&lt;/strong&gt;: Claude only accesses a skill when it's relevant to the task&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Composable&lt;/strong&gt;: Multiple skills can work together seamlessly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Portable&lt;/strong&gt;: They work across Claude apps, Claude Code, and the API&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In Claude Code specifically, custom skills are filesystem-based—just a folder with a &lt;code&gt;SKILL.md&lt;/code&gt; file that Claude discovers automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Inconsistent Blog Content
&lt;/h2&gt;

&lt;p&gt;My website supports both English and Spanish. Every article needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Matching frontmatter in both files&lt;/li&gt;
&lt;li&gt;A shared &lt;code&gt;translationKey&lt;/code&gt; to link translations&lt;/li&gt;
&lt;li&gt;Properly translated tags (AI → IA, Automation → Automatización)&lt;/li&gt;
&lt;li&gt;Consistent file naming and locations&lt;/li&gt;
&lt;li&gt;Same publication date for synchronized release&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before the skill, I had to remember all these requirements every time. Mistakes were common—mismatched translation keys, forgotten fields, inconsistent formatting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: blog-writer Skill
&lt;/h2&gt;

&lt;p&gt;I created a skill at &lt;code&gt;.claude/skills/blog-writer/SKILL.md&lt;/code&gt; that encodes all my blog writing conventions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;blog-writer&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;bilingual&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;blog&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;articles&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;personal&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;website."&lt;/span&gt;
  &lt;span class="s"&gt;Use when creating a new blog post, article, or writing content&lt;/span&gt;
  &lt;span class="s"&gt;for the blog. Handles EN/ES translations, frontmatter, and&lt;/span&gt;
  &lt;span class="s"&gt;content structure.&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The skill defines:&lt;/p&gt;

&lt;h3&gt;
  
  
  File Locations
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;src/content/blog/en/[slug].md  # English
src/content/blog/es/[slug].md  # Spanish
public/blog/[image-name].png   # Images
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Required Frontmatter
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Article&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Title"&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SEO&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;description"&lt;/span&gt;
&lt;span class="na"&gt;pubDate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2025-01-03&lt;/span&gt;
&lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tag1"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tag2"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;lang&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;en&lt;/span&gt;  &lt;span class="c1"&gt;# or es&lt;/span&gt;
&lt;span class="na"&gt;translationKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;article-slug&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Tag Conventions
&lt;/h3&gt;

&lt;p&gt;The skill includes a translation table for common tags:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;English&lt;/th&gt;
&lt;th&gt;Spanish&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AI&lt;/td&gt;
&lt;td&gt;IA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automation&lt;/td&gt;
&lt;td&gt;Automatización&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;td&gt;Desarrollo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Architecture&lt;/td&gt;
&lt;td&gt;Arquitectura&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  How It Works in Practice
&lt;/h2&gt;

&lt;p&gt;When I invoke the skill, Claude automatically:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Creates both EN and ES files with matching &lt;code&gt;translationKey&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Uses the correct frontmatter format&lt;/li&gt;
&lt;li&gt;Translates tags according to conventions&lt;/li&gt;
&lt;li&gt;Follows the defined content structure&lt;/li&gt;
&lt;li&gt;Places images in the correct directory&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The invocation is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/blog-writer Create an article about building MCP servers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude loads the skill, understands all the requirements, and produces consistent output every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Repurposing LinkedIn Content
&lt;/h2&gt;

&lt;p&gt;One feature I specifically built into the skill is the ability to convert LinkedIn posts into full blog articles. The skill includes instructions for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Fetching the original post content&lt;/li&gt;
&lt;li&gt;Downloading associated images&lt;/li&gt;
&lt;li&gt;Expanding the condensed format into comprehensive sections&lt;/li&gt;
&lt;li&gt;Maintaining the original publication date for authenticity&lt;/li&gt;
&lt;li&gt;Keeping the core message while adding depth&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This maximizes the value of content I've already created. A 200-word LinkedIn post becomes a 1500-word technical article with code examples, expanded explanations, and proper SEO optimization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating Your Own Skills
&lt;/h2&gt;

&lt;p&gt;The structure is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.claude/skills/
└── your-skill-name/
    └── SKILL.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;SKILL.md&lt;/code&gt; file contains:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Frontmatter&lt;/strong&gt;: Name and description (how Claude decides when to load it)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instructions&lt;/strong&gt;: Detailed workflow, conventions, and examples&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Checklists&lt;/strong&gt;: Verification steps before completion&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The description in the frontmatter is crucial—it determines when Claude considers the skill relevant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Skills represent a shift in how we work with AI assistants. Instead of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Repeating instructions every session&lt;/li&gt;
&lt;li&gt;Maintaining external documentation&lt;/li&gt;
&lt;li&gt;Catching inconsistencies in review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You encode your expertise once and benefit from it continuously.&lt;/p&gt;

&lt;p&gt;For content creation specifically, this means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Consistency&lt;/strong&gt;: Every article follows the same structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Efficiency&lt;/strong&gt;: No time spent remembering conventions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality&lt;/strong&gt;: Built-in checklists catch common mistakes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale&lt;/strong&gt;: Produce more content without sacrificing standards&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;I'm exploring additional skills for other repetitive workflows—project documentation, code review checklists, deployment procedures. The pattern is the same: identify a workflow you repeat, encode it as a skill, and let Claude handle the details.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Learn more about Agent Skills in the &lt;a href="https://claude.com/blog/skills" rel="noopener noreferrer"&gt;official Anthropic documentation&lt;/a&gt; or explore the &lt;a href="https://github.com/anthropics/skills" rel="noopener noreferrer"&gt;skills repository&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.javieraguilar.ai/en/blog/claude-code-skills-blog-writer" rel="noopener noreferrer"&gt;javieraguilar.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Want to see more AI agent projects? Check out my &lt;a href="https://www.javieraguilar.ai" rel="noopener noreferrer"&gt;portfolio&lt;/a&gt; where I showcase multi-agent systems, MCP development, and compliance automation.&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>ai</category>
      <category>automation</category>
      <category>skills</category>
    </item>
    <item>
      <title>Building an MCP Server for Bitbucket: Connecting LLMs to Your DevOps Workflow</title>
      <dc:creator>JaviMaligno</dc:creator>
      <pubDate>Thu, 08 Jan 2026 19:45:37 +0000</pubDate>
      <link>https://forem.com/javieraguilarai/building-an-mcp-server-for-bitbucket-connecting-llms-to-your-devops-workflow-2okn</link>
      <guid>https://forem.com/javieraguilarai/building-an-mcp-server-for-bitbucket-connecting-llms-to-your-devops-workflow-2okn</guid>
      <description>&lt;p&gt;After searching for an official MCP from Atlassian and not finding one, I decided to build my own. The existing community MCPs for Bitbucket were too limited—basic repository operations only, no pipeline support, no deployment management.&lt;/p&gt;

&lt;p&gt;I needed something more complete for my daily workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why MCP Matters for DevOps
&lt;/h2&gt;

&lt;p&gt;The Model Context Protocol (MCP) is a standard that allows LLMs to interact with external systems through a defined set of tools. Instead of copying and pasting between your IDE and Bitbucket's web interface, you can simply ask your AI assistant to handle it.&lt;/p&gt;

&lt;p&gt;Context switching is expensive. Every time you leave your editor to check a pipeline, review a PR, or manage branch permissions, you lose focus. MCP eliminates that friction.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://github.com/JaviMaligno/mcp-server-bitbucket" rel="noopener noreferrer"&gt;MCP Server for Bitbucket&lt;/a&gt; exposes &lt;strong&gt;58 tools&lt;/strong&gt; covering the full Bitbucket API:&lt;/p&gt;

&lt;h3&gt;
  
  
  Pull Requests
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Create, review, approve, and merge PRs&lt;/li&gt;
&lt;li&gt;Add inline comments on specific lines&lt;/li&gt;
&lt;li&gt;View diffs and compare branches&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pipelines
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Trigger builds on any branch&lt;/li&gt;
&lt;li&gt;Monitor pipeline status and logs&lt;/li&gt;
&lt;li&gt;Manage CI/CD variables&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Repository Management
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Full CRUD operations&lt;/li&gt;
&lt;li&gt;Branch restrictions and protection rules&lt;/li&gt;
&lt;li&gt;User and group permissions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Source Browsing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Read files without cloning the repository&lt;/li&gt;
&lt;li&gt;List directory contents&lt;/li&gt;
&lt;li&gt;Compare commits and branches&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Deployments
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;View deployment environments&lt;/li&gt;
&lt;li&gt;Track deployment history&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Real Use Cases
&lt;/h2&gt;

&lt;p&gt;Here's how I use it daily:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5x1efq40blpm9tvwmwe7.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5x1efq40blpm9tvwmwe7.webp" alt="MCP Bitbucket in action with Claude Code" width="522" height="481"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Show me open PRs and do a code review of #42"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Trigger the pipeline on develop and notify me if it fails"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Read the config.py file from the feature-x branch"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Generate release notes between v1.0 and main"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The power isn't in any single command—it's in the ability to chain operations naturally through conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical Implementation
&lt;/h2&gt;

&lt;p&gt;The server is available in both TypeScript and Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# TypeScript&lt;/span&gt;
npx mcp-server-bitbucket

&lt;span class="c"&gt;# Python&lt;/span&gt;
pipx &lt;span class="nb"&gt;install &lt;/span&gt;mcp-server-bitbucket
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Authentication uses Bitbucket App Passwords, which you can create in your Bitbucket settings. The server respects rate limits and handles pagination automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Bitbucket?
&lt;/h2&gt;

&lt;p&gt;While GitHub has stronger native Claude support, many enterprise teams still rely on Bitbucket. This gap in tooling was exactly why I built this—and why I've open-sourced it for others in the same situation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;I'm continuously improving the server based on real usage patterns. Recent additions include webhook management, tag operations, and improved error handling.&lt;/p&gt;

&lt;p&gt;If you're working with Bitbucket and want to integrate it with Claude or other MCP-compatible LLMs, give it a try. Contributions and feedback are welcome.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Check out the &lt;a href="https://github.com/JaviMaligno/mcp-server-bitbucket" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; or &lt;a href="https://pypi.org/project/mcp-server-bitbucket/" rel="noopener noreferrer"&gt;install from npm/PyPI&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://www.javieraguilar.ai/en/blog/mcp-server-bitbucket" rel="noopener noreferrer"&gt;javieraguilar.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Want to see more AI agent projects? Check out my &lt;a href="https://www.javieraguilar.ai" rel="noopener noreferrer"&gt;portfolio&lt;/a&gt; where I showcase multi-agent systems, MCP development, and compliance automation.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>bitbucket</category>
      <category>devops</category>
      <category>claude</category>
    </item>
  </channel>
</rss>
