Forem: JaviMaligno

Building a Production DevOps Agent: From Slack to Kubernetes

JaviMaligno — Thu, 30 Apr 2026 09:37:13 +0000

At SimpleKYC, the #devops_requests Slack channel is where every team goes when something breaks — or when they think something might be about to break. Pod stuck in CrashLoopBackOff? Pipeline failed? 502 from a service that was fine yesterday? It all lands there.

A colleague analyzed the channel and found that 70% of requests could be self-resolved by querying the same tools the DevOps team uses: kubectl, Azure CLI, Bitbucket pipelines, health checks. The knowledge wasn't secret — it was just scattered across terminals, runbooks, and people's heads.

We were already using Claude Code to investigate issues ourselves. The next step was obvious: put an agent in the channel that could do the same thing, for everyone, automatically.

The Agent Architecture

The bot is a generic LLM agent — not a collection of pre-built workflows. When someone mentions @DevOpsBot in a thread, GPT-5.4 (via Azure OpenAI) decides which tools to call, in what order, with what arguments. It can chain up to 20 tool calls per request, building up context as it goes.

This matters. A pod crash investigation might require:

kubectl get pods to find the failing pod
kubectl describe pod to check events
kubectl logs to read the error
A Bitbucket API call to check the last deployment pipeline
A health check on the service endpoint

No pre-built workflow can anticipate every combination. The LLM constructs the investigation path dynamically, adapting based on what each tool returns.

Six Tools, All Read-Only

The bot has access to six tools:

Tool	What it does	Examples
kubectl	Kubernetes diagnostics across 9 AKS clusters	`get pods`, `describe`, `logs`, `top`
az cli	Azure resource queries across 5 subscriptions	VM status, Postgres info, storage, APIM
Bitbucket	CI/CD pipeline status, repo content, config files	Pipeline logs, `values.yaml`, branch diffs
Health check	HTTP endpoint testing	Status codes, response times, body preview
Jira	Issue search and context	JQL queries, ticket details, comments
Confluence	Documentation search	Runbooks, architecture docs, onboarding guides

Every tool is strictly read-only. kubectl blocks delete, apply, patch, scale, exec, and edit. Azure CLI only allows show and list. There's no way for the agent to modify infrastructure — by design.

Read-Only First: A Deliberate Choice

This wasn't a technical limitation — it was a trust-building strategy. The agent's first job is to prove it can understand the infrastructure correctly before anyone considers letting it change anything.

The roadmap is progressive:

Phase 1 (current): Read-only diagnostics — investigate and report
Phase 2: Safe actions in dev/test — restart pods, rerun pipelines
Phase 3: Approval workflows for staging/production — the agent proposes, a human approves via Slack reaction
Phase 4: Full traced autonomy — the agent acts independently with complete audit trails

Escalation to humans isn't a phase — it's always present. The agent knows when it's out of its depth and routes to the right person. That's not a feature you add later; it's part of the human-in-the-loop design from day one.

This mirrors how you'd onboard a new team member. You don't give someone production access on day one. You let them observe, ask questions, and demonstrate understanding first.

Infrastructure as Knowledge

The agent's system prompt encodes the team's accumulated infrastructure knowledge: cluster naming conventions, namespace patterns, GitOps workflows, config repo structures, and common troubleshooting paths.

For example, the agent knows that:

Deployments follow a GitOps pattern — editing config files in environment branches triggers ArgoCD reconciliation
Kubernetes labels follow consistent naming patterns that map services to environments
The promotion chain goes through multiple stages from development to production
Some services run on VMs with different config structures than the K8s-based ones

This knowledge didn't come from a single source. I had context from working with the infrastructure, the LLM explored the actual clusters and repos to fill gaps, and then a DevOps engineer shared the documentation he personally uses with Claude — that was the biggest single jump in accuracy. After that, I refined edge cases through testing.

The result is that when someone asks "why is service X failing in UAT?", the agent already knows which cluster to check, which namespace, what label selector to use, and where the config repo lives. It doesn't need to be told.

Output Compaction: Saving 79% on Tokens

kubectl and Azure CLI return verbose JSON. A kubectl get pods across a namespace can easily be 8,000+ characters. Feed that raw into the LLM context and you burn tokens fast.

The bot compacts tool outputs on the fly — parsing JSON, extracting relevant fields, and discarding noise. A raw kubectl response of 8,000 characters becomes ~1,600 characters after compaction, with no loss of diagnostic value. That's a 79% token reduction that directly translates to lower costs and faster responses.

The compaction is tool-aware: it knows which fields matter for kubectl (status, restart count, images, events) vs. Azure CLI (provisioning state, FQDN, SKU) vs. Bitbucket (pipeline state, step results, error messages). If JSON parsing fails, it falls back gracefully to truncation.

Security Guardrails

Building an agent that runs shell commands against production infrastructure requires paranoia-level security:

Command validation: Every tool call goes through shlex.split with an array-based subprocess call — never shell=True. This prevents injection by construction.

Blocklists: Shell metacharacters (;, |, &, backticks, $()) are blocked at the input level, before any command reaches the shell.

Allowlists: Each tool defines exactly which subcommands are permitted. For kubectl, that's get, describe, logs, top, config, version. Everything else is rejected.

Output limits: Tool results are capped at 8,000 characters (before compaction) to prevent context window flooding.

Private IP blocking: The health check tool blocks requests to localhost, 10.x, 192.168.x, and 172.16-31.x to prevent SSRF.

The Moment It Clicked

Someone asked the bot to check resource usage for a service. The Grafana query was failing, and the bot started investigating — as usual. But then something unexpected happened: it found its own bug.

The bot's Grafana tool was sending step: '1h' as a string, but the implementation tried to parse it as an integer. ValueError: invalid literal for int() with base 10: '1h'. The bot traced the error to its own code, reported it honestly — "the error was mine" — and then pivoted: instead of giving up, it checked what it could verify (the deployment, replicas, pod status, resource definitions), gave an honest assessment of what it couldn't answer yet, and proposed a fix for its own Grafana tool.

That's meta-debugging. An agent that not only investigates infrastructure issues, but catches and reports its own failures with the same rigor. No hallucinated numbers, no pretending everything is fine — just "here's what I can confirm, here's what I can't, and here's how to fix it."

The First Test Went Wrong

The first live demo with the DevOps team exposed two UX problems I hadn't anticipated.

The first: someone tagged @DevOpsBot in an existing thread — just the mention, no additional text. They expected the bot to read the thread context and help. Instead, it silently dropped the message. I'd built a guard against empty inputs as a safety measure, but "empty mention" and "useless input" aren't the same thing. In a thread with 15 messages of context, a bare @DevOpsBot is a perfectly valid request for help. This guard made the bot useless in one of the most common Slack patterns: jumping into an ongoing conversation.

The second: requiring an @mention at all is friction. In a fast-moving incident, typing @DevOpsBot before every question feels like ceremony. But it's an acceptable tradeoff as part of the trust-building process — no accidental triggers, no noise, clear audit trail of who asked what. The alternative, auto-responding to everything in the channel, risks unintended tool calls and noise that erodes trust faster than the friction does.

Both were quick fixes in code, but they taught me something about building agents for real teams: defensive defaults that make sense in isolation can break the most natural interaction patterns. Test with the actual users early — your assumptions about how people will talk to the bot are probably wrong.

What I Learned

Tools first was the right call. Getting kubectl and Azure CLI working reliably was the quick win that let me start experimenting immediately. The system prompt refinements came naturally from watching the agent struggle with real questions — you need the tools running to know what knowledge is missing.

Compaction should be built in from day one. I added it after noticing token costs climbing. In hindsight, every agent that calls tools returning structured data should compact by default.

Design for thread context early, even if you implement it later. The bot reads up to 20 previous messages in a thread to maintain conversation continuity. Without this, every follow-up question ("what about the UAT environment?") would lack context. You don't need it for initial testing, but keep it in mind from the start — retrofitting conversational state is harder than designing for it.

The Stack

LLM: GPT-5.4 via Azure OpenAI (function calling)
Slack: Slack Bolt + Socket Mode (async Python)
Runtime: Python 3.11, asyncio throughout
Deployment: Docker → AKS via ArgoCD (GitOps)
Observability: Prometheus metrics, Application Insights, structured logging
Testing: pytest (unit + integration + e2e against real Slack)

Is It Worth Building?

If your DevOps team spends significant time on questions that amount to "look at this thing and tell me what's wrong" — yes. The investment is primarily in encoding your team's infrastructure knowledge, not in the LLM plumbing.

The read-only constraint makes the risk profile low. The agent can't break anything. The worst case is a wrong answer, which is the same worst case as a human answering from memory without checking.

The best case is that 70% of your #devops_requests channel gets answered in seconds, and your DevOps engineers focus on the 30% that actually needs human judgment.

Built with Python, Azure OpenAI, Slack Bolt, and a lot of infrastructure knowledge. Deployed on AKS via ArgoCD. View the project details.

Originally published on javieraguilar.ai

Want to see more AI agent projects? Check out my portfolio where I showcase multi-agent systems, MCP development, and compliance automation.

From Claude Artifact to Production PWA: Building VitaminD Explorer

JaviMaligno — Wed, 29 Apr 2026 20:19:26 +0000

It started with a simple question: can I synthesize vitamin D today in London?

I'd been reading about how latitude, season, and skin type determine whether sunlight can actually trigger vitamin D production. The science is clear — below a certain solar elevation angle (~45–50°), UVB radiation is too weak for synthesis regardless of how long you stay outside. But figuring out when that window opens for a given city on a given day? That's a calculation involving solar declination, hour angles, and spherical trigonometry.

So I asked Claude.

The Artifact That Started Everything

Claude didn't just answer the question — it generated an interactive artifact with a solar elevation curve, a day-of-year scrubber, and a threshold line showing when synthesis was possible. It used Spencer's 1971 formula for solar declination and the equation of time, projected onto a clean SVG chart.

I expected a paragraph with some numbers. Instead I got a working prototype.

I played with it for a few minutes — scrubbing through the year, watching the synthesis window expand in summer and vanish in winter — and thought: this is actually useful. Not just for me, but for anyone living at high latitudes wondering if they should bother going outside for their vitamin D.

That's when I decided to build it for real.

From Toy to Product: The Technical Decisions

Next.js App Router as the Foundation

I went with Next.js 16 and the App Router. The app needed client-side interactivity (D3 charts, geolocation, animations) but also server-side API routes (weather data proxying, push notification delivery, city search). App Router gives you both in one project without stitching together separate services.

D3.js on Canvas: Performance Over Convenience

The original artifact used SVG for everything. That works for a single chart, but I wanted three visualizations: a daily solar curve, an interactive world map, and a latitude × day-of-year heatmap (130 latitudes × 365 days = 47,450 cells).

SVG can't handle that on mobile. So I moved to Canvas rendering with an SVG overlay for interactive elements — hover tooltips, click targets, threshold adjusters. The world map uses an equirectangular projection with pan and pinch-to-zoom, rendering 195 countries from TopoJSON on every frame. Device pixel ratio scaling keeps it sharp on Retina displays.

The tradeoff is complexity. Canvas means manual coordinate transforms, hit-testing, and redraw management. But it was necessary — the heatmap alone would have created 47K DOM nodes as SVG rectangles.

The Vitamin D Math: Not Just Solar Angles

The artifact only calculated when synthesis was possible. I wanted to answer how long you need to stay outside for a given dose — say, 1000 IU.

This required modeling the actual photobiology. The rate depends on:

UVI (UV index at the current hour)
Skin type (Fitzpatrick scale I–VI, each with a different MED — Minimal Erythemal Dose)
Exposed body area (face and hands vs. t-shirt and shorts)
Age (synthesis efficiency drops ~1.3% per year after 20, per Holick et al. 1989)

And critically, it's not linear. Previtamin D3 doesn't just accumulate — it photodegrades under continued UVB into inert compounds (lumisterol and tachysterol). This is your body's built-in overdose protection, and it means there's a ceiling on how much you can produce per session.

I modeled this as a saturating exponential:

IU(t) = IU_sat × (1 - e^(-R×t / IU_sat))

Where IU_sat is the photodegradation ceiling (19,200 IU for full-body 1 MED, per Holick 1982, scaled by area and age) and R is the production rate based on Dowdy et al. 2010 MED tables. At low doses, it matches Holick's linear rule. At higher doses, diminishing returns kick in. The model agrees with Young et al. 2021 (n=75, in-vivo validation).

Real Weather Data: Open-Meteo

Solar geometry gives you clear-sky estimates, but clouds matter enormously. A 70% cloud cover can reduce UVB by 60%. I integrated the Open-Meteo API for hourly UV index and cloud cover — no API key needed, free for non-commercial use.

The tricky part: Open-Meteo's UV index already includes cloud cover. If you apply a cloud penalty on top, you double-count it. I caught this bug after noticing that exposure times on cloudy days were absurdly high. The fix: only apply the cloud factor to theoretical (solar geometry) estimates, never to real API data.

Why a PWA and Not a Native App

This was a deliberate choice. A vitamin D calculator needs to work on any device — iOS, Android, desktop — and I'm a solo developer. Building and maintaining native apps for two platforms plus a web version was out of the question.

PWAs close most of the gap that used to matter:

Installation: "Add to Home Screen" works on both iOS and Android — no app store, no review process, no $99/year Apple Developer fee
Push notifications: Supported on Android since 2015 and on iOS since 16.4 (March 2023) — the last major blocker is gone
Offline: Service workers cache everything needed for core functionality
Updates: Deploy to Vercel, users get the new version on next visit — no "please update the app" friction

What you don't get: background location tracking, health app integrations (Apple Health, Google Fit), and a presence in the App Store for discoverability. The first two could matter for future features (wearable integrations, automatic daily tracking). The third is a real distribution disadvantage — people search app stores, not the web.

For now, the tradeoff is clearly worth it. If the app grows to need native APIs (wearables, health data), a hybrid approach with Capacitor or a thin native wrapper around the existing web app would be the natural next step — not a rewrite.

Offline Support

A vitamin D calculator is most useful when you're already outside — potentially with no signal. The service worker caches all pages and static assets at install time, with a network-first strategy for dynamic content and an offline fallback page.

Push Notifications

This was the feature that felt most "app-like." A Vercel cron job runs daily at 8 AM UTC, fetches real UV data for each subscriber's saved location, calculates their personal synthesis window, and sends a Web Push notification:

"London · 12 min for 1000 IU · Best hour: 13:00 (UVI 4.5) · Window: 11:00–15:00"

The implementation uses VAPID keys with the web-push library, and subscriptions are stored in Supabase. One gotcha that cost me 53 days of broken push notifications: echo "$KEY" | vercel env add appends a newline that Vercel stores literally inside the VAPID key. The fix is printf '%s' instead of echo. A subtle shell behavior that silently corrupted a cryptographic key.

Six Languages and 5,000+ Cities

The app supports English, Spanish, French, German, Russian, and Lithuanian via next-intl. Each language file is ~480 lines covering UI labels, Fitzpatrick skin descriptions, and a full educational FAQ about vitamin D science.

City search combines a built-in database of 50+ major cities with Supabase full-text search (pg_trgm) over GeoNames' 200K+ city dataset. City names are localized — "Londres" in Spanish, "Лондон" in Russian.

What I Learned

Claude artifacts are underrated as prototyping tools. The artifact wasn't a mockup or a diagram — it was working code with real solar math. That head start meant I could focus on product decisions (what visualizations? what user flow?) instead of re-deriving spherical trigonometry.

Canvas D3 is worth the pain for data-dense visualizations. If you're rendering fewer than 1,000 elements, stick with SVG. Beyond that, Canvas is the only way to keep mobile smooth.

PWA push notifications are surprisingly powerful and surprisingly fragile. The Web Push spec works well, but the operational surface area (VAPID keys, cron timing, subscription cleanup, error handling for uninstalled apps) is larger than the code suggests.

Photobiology is more nuanced than "go outside for 15 minutes." The difference between Fitzpatrick I (pale, Celtic) and VI (deeply pigmented) is a factor of 6× in MED. Body area exposure matters more than most people think — face and hands alone (10% body area) can require 5× longer than a t-shirt and shorts (25%).

Try It

VitaminD Explorer is free, open source, and works on any device. Search your city, set your skin type and exposure, and see exactly when and how long you need for your daily vitamin D.

The original Claude artifact is still live if you want to see where it all began.

Built with Next.js 16, D3.js, Supabase, and Vercel. Source on GitHub.

Originally published on javieraguilar.ai

Want to see more AI agent projects? Check out my portfolio where I showcase multi-agent systems, MCP development, and compliance automation.

Software Is Dissolving Into the Model

JaviMaligno — Sat, 25 Apr 2026 18:42:37 +0000

Two repositories caught my eye this month. The first is google/agents-cli, Google's official tooling for building agents on Google Cloud — a CLI plus a bundle of markdown "skills" that any coding assistant (Claude Code, Codex, Gemini CLI, Cursor) can pick up. The second is Flipbook, an experimental "browser" launched by ex-OpenAI researchers two days ago. Flipbook has no HTML, no DOM, no rendered components. Every pixel you see — including the text — is generated frame-by-frame by a video diffusion model streaming over a WebSocket.

They look like they belong to different conversations. They don't. Put them next to each other and a pattern emerges: the layer we used to call "software" is being squeezed from both ends. What we write is becoming markdown. What users see is becoming a model output. The middle keeps getting thinner.

What we write is becoming markdown

Skills are repositories of instructions, not deployed services. The runtime is whatever coding agent the user happens to have installed. The artifact is a SKILL.md file with maybe a couple of helper scripts.

The last few weeks alone tell the story:

OthmanAdi/planning-with-files is a Claude Code skill that gives the agent a Manus-style persistent planning workflow — a few markdown files defining when to write task_plan.md, findings.md, progress.md. It has 9.2k stars. Manus, the company that built that workflow into a full product, was reportedly acquired for $2B. The IP was the pattern, not the implementation.
VoltAgent/awesome-agent-skills curates 1000+ portable skills across Claude Code, Codex, Gemini CLI, Cursor and 40+ other agents. 18.7k stars.
GitHub shipped gh skill on April 16 — a first-class CLI primitive for installing, pinning and publishing skills directly from GitHub repositories.
Google Workspace's official CLI now ships 100+ SKILL.md files, one for every supported API, plus 50 curated recipes for Gmail, Drive, Docs, Calendar and Sheets.

A year ago, "ship a tool for X" meant building an SDK, a service, or at least a wrapper library. Today it increasingly means writing a folder of markdown that any agent can load on demand. The trend isn't "AI helps you write software". It's "what we used to call software is now an instruction set".

What users see is becoming a model output

Now look at the other end of the pipeline. It's doing the same thing.

Flipbook launched on April 23 from Zain Shah's team (ex-OpenAI, ex-Humane, ex-Apple), out of South Park Commons. The interface is generated pixel-by-pixel by LTX Video, an open-source diffusion transformer for video, optimized to stream at 1080p/24fps via WebSocket from Modal Labs serverless GPUs. There is no DOM. There are no buttons in the traditional sense. When you "click", the model generates the next frames.

DeepMind has been doing the same thing at world-model scale. Project Genie rolled out to AI Ultra subscribers in January 2026, powered by Genie 3. You type a prompt. You walk around a generated world that is being generated as you move, at 24fps, with consistency held for a few minutes.

These are demos. Flipbook is barely useful as a product yet. Genie's worlds last a minute or two before the model loses coherence. But the direction is clearer than the products: the rendered output of an application is becoming a model inference, not a tree of components.

We're the ones who should adapt

A friend told me this week that he'd been trying to get an LLM to write internal reports following his company's exact style guidelines. After enough iteration he realized the prompt-engineering effort to keep the model on rails was taking him longer than writing the reports himself — and his company has multiple report types like that.

What he said next stuck with me: "Maybe the problem isn't that the AI can't adapt. Maybe we're the ones who should adapt. We're making it less efficient with all our constraints, when what it produces on its own is good enough for the purpose."

That insight isn't really about reports. It's the thing happening on both ends of the software stack right now.

A markdown skill is what you get when you stop trying to force the model into the shape of a typed SDK with versioned interfaces and start writing in the medium the model actually thrives in: natural language with examples. Flipbook is what you get when you stop trying to force the model to emit valid React components and start letting it render the pixels directly.

The friction we were adding — typed APIs, deterministic UIs, hand-coded glue — was buying us correctness, but at a cost we hadn't priced in. Every constraint also costs throughput, and a lot of those constraints existed because humans needed them, not because the model did.

That doesn't mean correctness goes away. It means we re-locate it. Evals replace types. Skill instructions replace SDK documentation. Frame-level objectives replace pixel-perfect Figma specs. We're trading one kind of rigor for another that fits the tool better.

What survives in the middle

So if what we write is becoming skills and what users see is becoming pixels, what's left for those of us building things in the middle?

I've been thinking about this in the context of my own work.

The MCP servers I've built — mcp-server-bitbucket, postgres_mcp, langfuse-mcp-server — already live in this new shape. They're not products with UIs. They're protocols any agent can pick up. The interface is the spec, not the screen.

The compliance classifier and the medical document parser depend on something that doesn't compress into a markdown file: a curated taxonomy, edge cases extracted from real documents, an evaluation suite that proves the model handles the long tail. The orchestration around them increasingly does compress into instructions, but the domain context doesn't.

The data-source automator's value is in a specific human-in-the-loop sequence we worked out the hard way — where the agent stops, what it asks, what it logs. Two years ago that lived in code. Today it lives more naturally as a skill plus an eval set.

What survives in the middle is the stuff that wasn't really code in the first place: the data, the taxonomy, the eval harness, the orchestration decisions. The stuff that was just "code" — the wrapper, the form, the boilerplate — heads to the skill layer on one end or the generated render on the other.

Engineers can't hide in the niche anymore

There's a corollary for those of us who write the code being squeezed: the gap between idea and shipped product is getting short enough that we can't stay tucked into a development niche and still add value.

When the middle layer was thick, you could be "the backend person" or "the LLM ops person" and contribute through a slice of the stack. The product owner had the idea, designers handled the surface, engineers handled the plumbing, and the handoffs were the work. When the middle is thin, those handoffs cost more than they coordinate. The model renders a lot of the surface. The agent carries a lot of the plumbing.

What's left for engineers looks more like product judgment than implementation — which problem to solve, what "good enough" means in this domain, where the model needs to be on rails and where it doesn't, what failure looks like and how to catch it. Even when you're not building your own product, you can't hide in your niche. You have to take ownership of the outcome to be trusted with it — because the part you used to be paid for, the implementation, is the part that's compressing.

Practical takeaways

Three things I'm acting on:

Ship the skill, not the wrapper. If you're building anything that lives behind an LLM, the wrapper code is the part with the shortest half-life. The skill — instructions, examples, evals — is what you actually own.
Stop fighting the medium. If you're spending more time constraining the model than the model is spending producing useful output, you're solving the wrong problem. Either accept what it produces or pick a different tool.
Invest where the moat is. The new defensible thing isn't "we built it well". It's "we have the data, evals and orchestration that prove this works in our domain". That's where I'm putting time on my own products.

Software isn't disappearing. It's redistributing itself across the stack — and the middle, the part we used to call "the application", is getting a lot thinner than I expected even six months ago.

Repositories referenced: google/agents-cli · OthmanAdi/planning-with-files · VoltAgent/awesome-agent-skills. Demos: Flipbook · Project Genie.

Originally published on javieraguilar.ai

Want to see more AI agent projects? Check out my portfolio where I showcase multi-agent systems, MCP development, and compliance automation.

The Real Skill in the Age of AI: Knowing When to Stop

JaviMaligno — Tue, 14 Apr 2026 20:15:18 +0000

The missing skill isn't prompting. It isn't knowing which model to use, or how to structure a CLAUDE.md, or when to reach for an MCP tool.

The missing skill is knowing when to stop.

The Async Human

Conscious human tasks are single-threaded async processes.

We have background threads — habits, pattern recognition, the kind of low-level processing that happens without us noticing. But our conscious executive function, the part that reads specs and makes architectural decisions and evaluates whether an agent went off the rails, is strictly single-threaded. It cannot genuinely run two complex reasoning tasks in parallel.

What we call "multitasking" is actually context switching. And context switching has overhead — exactly like an OS scheduler running multiple processes on a single core. The CPU doesn't run them in parallel; it creates the illusion of parallelism by rapidly switching between them, loading and unloading state each time.

We do the same thing. And we pay the same price: every switch costs something, and the more switches you do, the less actual execution you get per unit of time.

This is the foundation everything else rests on.

Two Stopping Problems

Given that model, there are two distinct ways things go wrong when you orchestrate AI agents — and they operate on different axes.

The Vertical Problem: Building on Unverified Ground

You're in a session. The agent has just finished the authentication module. It looks solid. You skim the output, it seems right, and there's momentum — so you tell it to start the dashboard.

But you haven't actually verified the auth module at depth. You've seen it. You haven't tested it.

Now the dashboard is being built on top of assumptions that may be wrong. And the longer you continue before stopping to verify, the more expensive any foundation flaw becomes. If auth has a subtle bug, it might not surface until the dashboard is half-built — at which point you're not fixing one thing, you're untangling two.

This is the stopping problem within a session. It's not about switching between agents. It's about the pull of momentum inside a single thread of work. The agent keeps going, you keep going with it, and the transition from "build mode" to "verify mode" never happens because it was never explicitly planned.

The discipline: before the session starts, redefine what "done" means. The agent stopping is not done. Done means you've confirmed it works and the assumptions it built on are sound. No new feature starts until you've reached that bar.

The Horizontal Problem: How Many Processes Can You Actually Schedule?

Now add parallel sessions. Agent A is building the API. Agent B is refactoring the data model. Agent C is writing tests.

Each of those is a process your single-threaded attention has to service. You check in on A, switch to B, switch to C, back to A — and with every switch, you load a context, do some work, and unload it. Except you don't fully unload it. Research on attention residue shows that when you shift focus from one task to another, part of your attention stays on the previous one. That residue consumes working memory, degrading the quality of your engagement with whatever you've switched to.

Three overlapping agent sessions means you're potentially carrying three partial contexts simultaneously, never fully present in any of them. It doesn't feel like that — it feels like productivity. But the quality of your oversight degrades quietly, and regressions and bad architectural decisions slip through because you weren't reading deeply enough when it mattered.

The practical limit, in my experience: two to three active sessions per focused block, and only when the tasks are genuinely isolatable. Beyond that, you're not supervising — you're skimming. And skimming is where things break.

What Stopping Well Actually Looks Like

Both problems have the same root solution: make stopping a first-class part of the plan, not an afterthought.

A few heuristics that have changed how I work:

Before launching any session, answer two questions: what does done look like, and what does verified look like? If you can't answer both, the task isn't scoped well enough to run yet.

Resist the next feature until the previous one is verified. Momentum is the enemy here. The agent finishes something, you feel good about it, and the natural instinct is to keep going. Pause. Check. Test. Confirm the foundation is solid before building on top of it.

Treat session boundaries as hard stops, not suggestions. Decide in advance: after these agents report back, I review, I merge what's ready, and I stop. Context drift — where sessions keep spawning new sessions without a real break — is how six hours disappear and you end up with a codebase that's hard to reason about and a brain that's completely fried.

Classify tasks before parallelizing them. Not all agent work has the same cognitive cost to oversee. Documentation, boilerplate, isolated tests — these are cheap to supervise in parallel. Architectural decisions, cross-cutting refactors, anything with shared state — these deserve your full, sequential attention.

The Closing Condition

A single-threaded process that tries to schedule too many tasks doesn't get faster — it just spends more time context-switching than executing. The most efficient system isn't the one that launches the most processes. It's the one that does the most real work between switches.

That applies to CPUs. It turns out it applies to us too.

What limits have you settled on in practice? I'd be curious to compare notes. Get in touch.

Originally published on javieraguilar.ai

Want to see more AI agent projects? Check out my portfolio where I showcase multi-agent systems, MCP development, and compliance automation.

Playwright CLI vs Scripts: How AI Agents Should Test

JaviMaligno — Sun, 29 Mar 2026 17:42:29 +0000

There's a growing pattern in agentic development: give an AI agent a browser and let it figure things out. Tools like Playwright CLI, browser-use, and Claude Code's computer use make it easy to point an LLM at a web page and say "test this."

It works. Until it doesn't.

After months of building an automated microservice deployment system where AI agents deploy, configure, and verify services end-to-end, I've landed on a clear mental model for when to use Playwright CLI (interactive, LLM-driven) vs standard Playwright scripts (deterministic, reproducible). The distinction matters more than most people think.

The Two Modes

Playwright CLI: Exploration Mode

The CLI is conversational. The agent navigates, takes snapshots, reads the DOM, decides what to click next. Each interaction is a new LLM call.

# Agent opens a page, takes a snapshot, reads the YAML, decides next action
playwright-cli -s=verify open https://app.example.com --headed
playwright-cli -s=verify snapshot
playwright-cli -s=verify click e{ref-from-snapshot}
playwright-cli -s=verify fill e{input-ref} "search term"

This is how my verify-test skill works in the Data Source Automator pipeline. When I'm manually testing a newly deployed microservice with Claude Code, the agent:

Opens the Verify platform in a browser
Takes a snapshot (outputs a YAML with element refs)
Reads the snapshot to find the right elements
Fills forms, clicks buttons, navigates
Evaluates results and decides the next step

Each step is an LLM inference. The agent adapts — if a dropdown is missing, it investigates. If auth expires, it re-logs. If results look wrong, it digs deeper.

This is not testing. This is exploration.

Playwright Scripts: Reproducible Mode

Scripts are deterministic. Same input, same steps, same assertions. No LLM involved in execution.

import { test, expect } from "@playwright/test";

test("Verify service search returns results", async ({ page }) => {
  await page.goto("https://app.example.com");
  await page.fill('[data-testid="search-input"]', "ACME Corp");
  await page.click('[data-testid="search-button"]');

  const results = page.locator('[data-testid="result-row"]');
  await expect(results).toHaveCount({ minimum: 1 });

  const firstResult = results.first();
  await expect(firstResult).toContainText("ACME");
});

No snapshots, no YAML parsing, no LLM reasoning about what to click next. Just code.

Why This Distinction Matters for AI Agents

When I first built the deployment automation system, everything used the CLI approach. The LLM agent would:

Deploy the service (git tag, config-deploy update, ArgoCD sync)
Run infrastructure checks (pods, health, swagger, etcd)
Open a browser via Playwright CLI
Navigate the platform UI to configure and test the service
Evaluate results and report

It worked. But three problems emerged quickly:

1. Token consumption exploded

Every Playwright CLI interaction requires the LLM to:

Read the full snapshot YAML (often 500+ lines)
Reason about which element to interact with
Formulate the next command
Evaluate the result

For a typical verification flow (login → configure → search → validate results → check details), that's 15-20 LLM calls with large context. At scale, testing 50+ microservices, this was burning through tokens fast.

2. Variability killed reliability

The same test run twice could take different paths. The LLM might:

Click a different element if the snapshot was slightly different
Interpret results differently based on context window state
Get confused by UI changes that didn't affect functionality

A test that passed Monday might fail Tuesday — not because the service broke, but because the agent reasoned differently.

3. Stable workflows don't need reasoning

After the first few runs, I noticed the verification pattern was always the same:

Login (or restore auth state)
Navigate to settings
Configure autodiscovery
Create an application
Execute a search
Validate results have data
Open a detail view
Capture evidence screenshots

This doesn't need an LLM. It needs a script.

The Architecture I Landed On

┌─────────────────────────────────────────────────┐
│           Deployment Pipeline                    │
│                                                  │
│  Deploy ──► Infra Checks ──► Browser Tests       │
│                                  │               │
│                          ┌──────┴──────┐         │
│                          │             │         │
│                    Playwright     Playwright      │
│                    Scripts        CLI             │
│                    (default)     (fallback)       │
│                          │             │         │
│                    Deterministic  LLM-driven      │
│                    Fast           Adaptive         │
│                    Low tokens     High tokens      │
│                          │             │         │
│                          └──────┬──────┘         │
│                                 │                │
│                          LLM receives            │
│                          outputs only            │
└─────────────────────────────────────────────────┘

Scripts run first. They handle the happy path: login, navigate, search, assert, screenshot. The orchestrating LLM agent receives only the test output (pass/fail + screenshots), not the raw DOM.

CLI kicks in as fallback. When a script fails unexpectedly — a UI change, a new modal, an element that moved — the agent switches to CLI mode. Now it can explore, take snapshots, reason about what changed, and either fix the issue or report it.

This is the key insight: the LLM doesn't need to see every page load and every DOM tree. It just needs the results — and the ability to dig deeper when something goes wrong.

Real Example: The Generated Test Pattern

In the demo-video system, I took this further. Tests are auto-generated from scene definitions:

// scenes/uk-hmrc-supervised-business-register-video-a.ts
// Scene definitions: what to do, in what order, with what data
export const SCENE_CONFIGS = [
  { id: "login", name: "Login to platform" },
  { id: "navigate-settings", name: "Open service configuration" },
  { id: "configure-autodiscovery", name: "Enable the service" },
  { id: "create-application", name: "Create test application" },
  { id: "execute-search", name: "Run company search" },
  { id: "validate-results", name: "Check search results" },
];

// Each scene is a function: (page: Page) => Promise<void>
export const SCENES: Record<string, (page: Page) => Promise<void>> = {
  "login": async (page) => {
    await page.goto("https://demo1-dev.simplekyc.com");
    // ... deterministic steps
  },
  // ...
};

The generated spec file is pure boilerplate — iterates scenes, records timestamps, saves output. Zero LLM involvement at runtime.

// Auto-generated — do not edit manually
test("Record UK HMRC Supervised Business Register", async ({ browser }) => {
  const context = await browser.newContext({
    recordVideo: { dir: VIDEO_DIR, size: { width: 1280, height: 720 } },
    viewport: { width: 1280, height: 720 },
  });

  const page = await context.newPage();
  for (const config of SCENE_CONFIGS) {
    const sceneFn = SCENES[config.id];
    await sceneFn(page);
  }
  await context.close();
});

The LLM's role? It wrote these scenes initially (using CLI exploration to understand the UI). Then it stepped back and let the scripts run.

When to Use Each

	Playwright CLI	Playwright Scripts
Purpose	Exploration, debugging, unknown flows	Verification, regression, known flows
LLM involvement	Every step	None at runtime
Token cost	High (snapshots + reasoning per action)	Near zero (agent only reads output)
Reproducibility	Low (LLM may take different paths)	High (same code, same steps)
Adaptability	High (handles unexpected UI)	Low (breaks on UI changes)
Speed	Slow (LLM latency per step)	Fast (native browser speed)
Best for	First-time flows, debugging failures, navigation	Repeated verification, CI/CD, evidence capture

The Mental Model

Think of it like a human QA engineer:

First time testing a feature? They click around, explore, take notes. This is CLI mode.
Writing the regression test? They script the exact steps they just explored. This is script mode.
Test fails unexpectedly? They go back to manual exploration to understand why. CLI mode again.

An AI agent should work the same way. Explore with the CLI, codify with scripts, fall back to CLI when scripts break.

Practical Guidelines

Start with CLI exploration when the AI agent encounters a new UI or flow. Let it snapshot, reason, navigate.
Once the flow is stable (works 3+ times consistently), generate a Playwright script. The agent itself can write it — it already knows the selectors and flow from its CLI exploration.
In automated pipelines, always use scripts as the primary test runner. The orchestrating LLM receives stdout (pass/fail, screenshots), not DOM trees.
Keep CLI as fallback. When a script fails, the agent can switch to CLI mode to diagnose: "The search button moved — let me take a snapshot and find it."
Never ship CLI-only tests in CI. If your AI agent is doing 20 LLM calls per test run in CI, you're paying for reasoning you don't need.

Navigation vs Testing

One more distinction worth making: pure navigation is conceptually different from testing.

If an AI agent needs to browse the web — research a topic, fill out a form, extract data from a page — CLI is the right tool. That's not testing, it's interaction. The agent needs to reason about what it sees.

But the moment you're verifying that a known flow produces expected results? That's testing. Script it.

The pattern is simple: explore with CLI, verify with scripts, fall back to CLI when things break. Your AI agents will be faster, cheaper, and more reliable.

This approach is running in production for 50+ microservice deployments. The token savings from switching repetitive verifications to scripts were significant enough to justify the refactor within the first week.

For a deeper dive into the script-based approach — including scene recording, voiceover generation, and video assembly — see I Made a Product Demo Video Entirely with AI.

Originally published on javieraguilar.ai

Want to see more AI agent projects? Check out my portfolio where I showcase multi-agent systems, MCP development, and compliance automation.

Writing an Essay with AI: Codex vs Claude Code

JaviMaligno — Wed, 25 Mar 2026 19:50:45 +0000

I recently published Science Catch-Up — a 16-chapter essay examining the limits of the scientific method and proposing a framework for evaluating knowledge without waiting for consensus. The full essay was written with AI assistance: first with OpenAI Codex (GPT 5.2/5.3), then with Claude Code (Opus 4.5/4.6).

The difference in quality was striking. Not in the way you'd expect from a code generation comparison — but in the far more demanding territory of prose.

The essay is available on Payhip (English) and Payhip (Spanish).

The project

Science Catch-Up is not a light read. It's a combative, heavily referenced essay that critiques scientism as a power structure, traces the cost of institutional dogma, and proposes operational criteria for evaluating informal knowledge. The tone had to be precise: direct without being conspiratorial, critical without being anti-science, and provocative without turning into a pamphlet.

That level of nuance is exactly where AI writing gets tested — and where the differences between tools become impossible to ignore.

Phase 1: Codex (GPT 5.2/5.3) — the rough start

The first drafts were written using OpenAI Codex. The initial structure and chapters came together, but the problems piled up fast.

Verbose, repetitive commit messages tell the story. Compare the early GPT-era commits:

"Se amplía la sección sobre biohacking, clarificando las categorías de restauración, mitigación y deuda biológica. Se añaden ejemplos prácticos para cada tipo y se enfatiza la distinción entre restaurar, mitigar y endeudarse, mejorando la comprensión del impacto fisiológico de estas prácticas."

With the later Opus-era commits:

"fix: QA polish — glossary term, font consistency, cover in PDF, new food pyramid"

Same repo, same author, different tool. The commit messages mirror the prose itself.

The specific problems with GPT's writing:

Repetitive expressions everywhere. Words like "precisamente", "en el fondo", and repeated subject openings ("Science Catch-Up propone...", "El marco establece...") appeared in clusters. I eventually had to do a dedicated cleanup pass — commit 61684f4: "Limpiar patrones repetitivos ChatGPT: precisamente, sujetos repetidos, en el fondo".
The "Ejemplo:" pattern. GPT consistently inserted the label "Ejemplo:" before illustrative cases, even when the editorial criteria explicitly said to integrate examples with flowing prose ("Por ejemplo...", "En la práctica..."). This rigid formatting was one of the hardest things to stamp out because it kept reappearing.
Bland, conciliatory tone. The essay needed to be combative and direct. GPT kept softening the edges, adding disclaimers ("this is not anti-science"), and producing what read like an academicized version of ideas that were originally sharp and provocative.

Phase 2: Claude Code (Opus 4.5/4.6) — a different league

When I switched to Claude Code with Agent Teams, the improvement was immediate. The prose was more natural, the tone closer to what I wanted, and the adherence to editorial guidelines much stronger.

After several iterations, I created a formal editorial criteria document — covering tone, argumentative structure, how to handle examples (three named criteria for narrative, schematic, and hybrid styles), referencing standards, and the essay's rhetorical direction. Claude Code followed these criteria consistently once they were established.

What Claude Code got right:

Tone adherence. The combative, no-apologies voice came through naturally. Less defensive hedging, more direct argumentation.
Structural intelligence. When given a chapter outline and editorial criteria, it produced content that respected the flow and built on previous sections.
Research and references. Excellent at finding and integrating relevant sources, formatting bibliography entries, and maintaining consistency across chapters.
Grammar and spelling. Essentially flawless in both Spanish and English — a non-trivial advantage for a bilingual publication.
Programmatic figures. Most of the essay's diagrams and charts were generated by Claude Code using matplotlib scripts — the evidence pyramid, the Science Catch-Up cycle, the cascade of patches diagram. Only one figure and the cover itself were created with Gemini.

Where it still fell short:

Subagent context loss. When dispatching multiple subagents in parallel (a common pattern with Agent Teams), individual agents would sometimes write sections in isolation, losing the narrative thread or producing content that didn't flow with surrounding chapters. The result read like separate authors had written adjacent paragraphs.
Still needs heavy review. Even with Claude Code's better output, I reviewed and revised every paragraph. The ideas and instructions were mine; the AI's role was transforming rough ideas into developed content, proposing structure, and doing research. I'd estimate 95%+ of the text was strictly written by AI, but every sentence was validated or adjusted by me.

Prose vs code: why writing is harder

This project crystallized something I'd been sensing: AI-assisted prose requires more oversight than AI-assisted code.

In code, functionality is king. If a function works, passes tests, and handles edge cases, the style matters but it's secondary. Modularity, naming conventions, and code quality are still easier to achieve and verify than their prose equivalents.

In prose, what you say and how you say it are inseparable. A paragraph that communicates the right idea but in a bland, hedging, or repetitive way is a failure — even though the "functionality" (conveying information) works. There's no test suite for tone. No linter for rhetorical punch. No CI pipeline that catches "this sounds like it was written by ChatGPT."

I found myself spending far more time reviewing prose than I ever do reviewing AI-generated code. Every sentence had stakes that a line of code doesn't.

The translation test

The essay was originally written in Spanish. The English translation was done with Claude Code and it was remarkably fast — the structure, references, and formatting carried over cleanly.

The interesting challenges were cultural, not technical:

Catchy phrases needed adaptation, not literal translation. Punchlines that worked in Spanish sometimes needed complete rethinking in English.
Domain acronyms: the essay introduces CSV (Ciencias de Sistemas Vivos) and CSI (Ciencias de Sistemas Inertes) in Spanish. In English these became OSS (Organic System Sciences) and ISS (Inert System Sciences) — a deliberate choice that required discussion.

The iterative process

One thing worth noting: the writing was never a single-pass affair. The typical cycle was:

Draft — AI writes a reasonable first version from my outline and notes
Ideas emerge — reading the draft triggers new thoughts, missing angles, additional references
Expand — feed those back in, ask for specific additions or restructuring
Review — catch tone drift, repetitive patterns, weak arguments
Polish — final pass for consistency with editorial criteria

This loop happened per chapter and across the whole essay. AI is extraordinary at step 1 and 3 — transforming raw ideas into developed content. But steps 2, 4, and 5 remain fundamentally human.

The output

Beyond the essay itself, the AI-assisted pipeline produced:

PDF and ePub builds for both languages
Amazon KDP formatting with programmatic cover generation
Audiobooks in both languages using Google's Aoede TTS
Publication metadata for multiple platforms

The Spanish audiobook is already on Spotify. The English one is coming — I'll do separate posts for the audiobook editions.

Takeaways

Codex (GPT) produced noticeably worse prose — repetitive, bland, and resistant to style guidelines
Claude Code (Opus) was significantly better — closer to the desired tone, better at following editorial criteria, stronger structural awareness
Neither replaces editorial judgment — the human loop of reviewing, rethinking, and refining is non-negotiable for quality prose
AI shines at transformation — turning rough ideas into structured content, researching references, handling grammar and spelling, generating figures programmatically
Prose requires more human oversight than code — because style, tone, and rhetorical effectiveness have no automated tests
Translation was the easiest part — structure carries over cleanly; only cultural nuances and catchy phrases needed real thought

The essay is a serious, opinionated piece of work. Whether you agree with its thesis or not, the process of writing it taught me more about AI-assisted creation than any coding project has.

Science Catch-Up is available on Payhip (English edition) and Payhip (Spanish edition).

Interested in AI agent architectures? Get in touch.

Originally published on javieraguilar.ai

Want to see more AI agent projects? Check out my portfolio where I showcase multi-agent systems, MCP development, and compliance automation.

I Made a Product Demo Video Entirely with AI

JaviMaligno — Mon, 02 Mar 2026 12:41:31 +0000

I needed a demo video for an RFP automation platform I'm building. The typical approach: record your screen, stumble through clicks, re-record when something breaks, then spend an hour in a video editor syncing voiceover. I've done it before. It's painful.

So I tried a different approach: let AI do the whole thing.

Claude Code wrote the entire recording pipeline — Playwright scripts, ffmpeg assembly, speed control, subtitle generation
edge-tts generated the narration with Microsoft's neural voices
Gemini 3.1 Pro reviewed the final video for audio/video sync issues

The result: a 3-minute narrated demo with 14 scenes, variable speed segments, and subtitles. No video editor. No screen recording app. No manual voiceover.

🎥 Watch the video demo on Loom

Note: Interactive video player available on the original article

Here's how the whole process worked — including the parts that broke.

The Pipeline

Playwright (record) → edge-tts (voice) → ffmpeg (assemble) → Gemini (QA)

Step 1: Claude Code Writes the Recorder

I described what I wanted: a demo video using Playwright to record the browser, with text-to-speech for coordinated voiceover, covering the full application workflow. Claude Code decided the structure — 14 scenes from dashboard to API docs — wrote the scene scripts, and generated the modules. Each one is a TypeScript function that drives the browser through a specific feature:

export const scene: SceneFn = async (page) => {
  await page.getByText("Generate Answer").click();

  // Wait for AI to finish
  const indicator = page.getByText("Generating AI answer...");
  await indicator.waitFor({ state: "hidden", timeout: 90_000 });

  // Scroll through the answer smoothly
  const scrollPanel = page.locator("[data-demo-scroll=true]");
  await scrollPanel.evaluate((el) => {
    el.scrollTo({ top: el.scrollHeight / 3, behavior: "smooth" });
  });
  await longPause(page);
};

A test wrapper runs all 14 scenes in sequence, recording timestamps:

for (const id of sceneIds) {
  const start = (Date.now() - videoStartTime) / 1000;
  await SCENES[id](page);
  timestamps.push({ id, start, end: (Date.now() - videoStartTime) / 1000 });
}

Output: one .webm video + a scene-timestamps.json file. This separation is key — it lets us manipulate each scene independently during assembly.

Step 2: AI Voice with edge-tts

Each scene has a narration line. edge-tts turns them into MP3 files using Microsoft's neural TTS — free, no API key, surprisingly natural:

edge-tts --text "Let's generate an AI answer..." \
  --voice en-US-GuyNeural \
  --write-media voice/08-generate-answer.mp3

14 scenes, 30 seconds to generate all voices. Claude Code wrote the narration script too — I reviewed and tweaked the phrasing, but the drafting was AI.

Step 3: Assembly — Where Everything Broke

Claude Code also wrote the assembly script. In theory, it's simple: split the video by timestamps, overlay voice, concatenate. In practice, this is where I spent most of the iteration time with Claude.

Variable Speed: 30 Seconds of Spinner → 2 Seconds

Nobody wants to watch an AI loading spinner for 30 seconds. The solution: per-scene speed segments.

{
  id: "08-generate-answer",
  speed: [
    { from: 0, to: 1, speed: 1 },   // Click button at normal speed
    { from: 1, to: 6, speed: 15 },   // AI generation: 30s → 2s
    { from: 6, to: 10, speed: 1 },   // Read the answer at normal speed
  ],
}

The from/to values are proportional (0-10 scale). ffmpeg applies this via a split/trim/setpts/concat filtergraph. The result: boring waits are compressed 15x while meaningful interactions play at real speed.

The Audio Sync Nightmare

This is the lesson that took the most iterations to learn. When merging voice with video per-scene, then concatenating:

Problem 1: Using ffmpeg's -shortest flag silently truncates the longer stream. Voice gets cut mid-sentence.

Problem 2 (the nasty one): ffmpeg starts each concatenated clip's audio where the previous clip's audio ended, not where the video starts. If clip A has 30s video but only 13s audio, clip B's audio starts at t=13 instead of t=30. This causes progressive drift — by scene 10, the voice is over a minute behind the visuals.

The fix: Every clip's audio track must exactly match its video duration:

ffmpeg -i video.mp4 -i voice.mp3 \
  -filter_complex "[1:a]adelay=500|500,apad=whole_dur=VIDEO_DURATION[audio]" \
  -map 0:v -map "[audio]" -c:v copy -c:a aac output.mp4

apad=whole_dur pads the audio with silence to exactly match the video length. No drift possible.

Gemini 3.1 Pro as Video QA

Here's where it got really interesting. After fixing the pipeline, I needed to verify audio/video sync across 14 scenes. Watching the whole video manually each time is tedious and my ears aren't reliable after the 10th iteration.

I uploaded the video to Gemini 3.1 Pro and asked it to analyze the synchronization — which actions happen visually vs. when the narration describes them.

On the Broken Version

Gemini caught every single sync issue with precise timestamps:

Action	Visual	Audio	Drift
Create RFP	01:51	02:09	18s late
Assign Style Guide	02:20	03:05	45s late
Generate Answer	02:22	03:31	1m 9s late
Chat refinement	03:22	03:58	36s late
Assign team	03:55	04:28	33s late

Classic progressive drift. Each scene's audio shifts further behind because the previous scene's audio track was shorter than its video.

On the Fixed Version

After applying the apad fix, Gemini's analysis: 13 scenes perfect sync, 1 flagged as ~2 seconds late.

The flagged scene was actually fine — I'd intentionally added a 1.5-second voice delay to let a visual transition settle before narration began. Gemini was being slightly over-strict.

Score: 0 missed issues, 1 false positive out of 14 scenes. That's better QA than I'd get from watching the video myself.

The Human-AI Feedback Loop

The process wasn't "ask Claude once, get perfect video." It was iterative:

Me: "I want a demo video using Playwright with TTS, covering the full workflow"
Claude: Decides on 14 scenes, generates the pipeline. First recording works.
Me: "The voice is desynced from scene 5 onwards"
Claude: Debugs, discovers -shortest issue. Fixes with apad.
Me: "The answer doesn't scroll — you can't see the bullet points"
Claude: Investigates DOM, finds wrong scroll container. Fixes with programmatic parent discovery.
Me: "Still no bullet points in the generated answer"
Claude: Tests via API, finds the AI returns plain text despite HTML instructions. Adds normalizeAnswerHtml post-processor.
Me: "The KB upload scene has too much dead time, and the style guide voice starts too early"
Claude: Increases speed compression from 4x to 8x, adds 2s voice delay to style guide scene.

Each round: I watch the video, describe what's wrong in plain language, Claude debugs and fixes. The feedback loop is fast because re-recording takes 4 minutes and assembly takes 1 minute.

What AI Did vs. What I Did

Task	Who
Write 14 Playwright scene scripts	Claude Code
Write assembly pipeline (800 lines)	Claude Code
Write speed segment logic	Claude Code
Debug audio sync issues	Claude Code
Fix scroll container detection	Claude Code
Add HTML normalization post-processor	Claude Code
Generate voiceover audio	edge-tts
QA audio/video synchronization	Gemini 3.1 Pro
Write narration text	Claude Code (I reviewed and adjusted)
Review video and give feedback	Me
Choose what to show and in what order	Me

The creative direction was mine. Everything else was AI.

The Final Numbers

# Record 14 scenes
npx playwright test e2e/tests/demo-record.spec.ts --headed  # ~4 min

# Generate voices
npx tsx scripts/demo-record.ts voice   # ~30 sec

# Assemble with speed control + subtitles
npx tsx scripts/demo-record.ts assemble  # ~1 min

Output: 3:47 narrated video, 14 scenes, variable speed, soft subtitles
Pipeline code: ~800 lines TypeScript (assembly) + ~200 lines (scenes)
Re-record time: Under 6 minutes end-to-end
Video editors used: Zero

If the UI changes tomorrow, I update one scene file and re-run. The entire pipeline is version-controlled and reproducible.

Honest Trade-off: Manual vs. Automated

For the first iteration, manually recording your screen while narrating would be faster. A screen recording tool gives you a video in real time — no pipeline to build.

But manual recording has its own costs: you need a quiet environment and a decent microphone, any stumble means re-recording, editing voiceover timing in a video editor is tedious, and audio quality depends entirely on your hardware.

The automated approach pays off from the second iteration onward. When the UI changed, I updated one scene file and re-ran. When the narration needed tweaking, I edited a text string — no re-recording my voice. After five rounds of feedback-and-fix, I'd have spent hours in a video editor doing the same thing manually. And if a client asks for a demo next month after a redesign, it's a 6-minute re-run, not a full re-shoot.

Beyond Demos: Video as QA Evidence

This pipeline was built for a product demo, but the pattern — browser automation producing narrated video — has broader implications.

Think about QA. Today, test evidence is usually a CI log that says PASS or FAIL. When a client asks "show me that the payment flow works," you re-run the test and hope they trust a green checkmark. Imagine instead handing them a narrated video: the test runs, the voiceover explains each step, and the video is generated automatically on every release. Regression testing becomes not just a technical checkpoint but a reviewable artifact.

The same applies to compliance and auditing. Regulated industries need proof that systems work as specified. A version-controlled pipeline that produces timestamped video evidence on demand is fundamentally different from manual screen recordings buried in a shared drive.

And onboarding — new team members could watch auto-generated walkthroughs that stay current with the actual UI, not documentation screenshots from six months ago.

The underlying shift is that video is becoming a programmatic output, not a creative production. When the cost of producing a video drops from hours to minutes, and re-producing it is a single command, you start using video in places where it was never practical before.

Key Takeaways

Playwright's recordVideo is production-quality for demos — 720p/25fps, no overhead
Never use -shortest in ffmpeg when merging audio streams for concatenation. Use apad=whole_dur to match audio duration to video duration exactly.
Variable speed segments are the difference between a boring demo and a watchable one. 15x compression for loading spinners, 1x for actual interactions.
Gemini 3.1 Pro is a legitimate video QA tool. Upload a video, ask "is the audio synced with the visuals?" — it'll give you a timestamped report with near-perfect accuracy.
The human-AI feedback loop matters more than getting it right first try. I described problems in plain language ("the scroll doesn't work"), Claude debugged and fixed. Five iterations to a polished result.
AI is great at automation, humans are great at judgment. I wrote the narration script and decided what to show. AI did everything else.
When video becomes a command, you use it everywhere. The same pipeline that records a demo can generate QA evidence, onboarding walkthroughs, or compliance artifacts — all version-controlled and reproducible on every release.

Tools used: Claude Code, Playwright, ffmpeg, edge-tts, Gemini 3.1 Pro

Originally published on javieraguilar.ai

Want to see more AI agent projects? Check out my portfolio where I showcase multi-agent systems, MCP development, and compliance automation.

Beyond RAG: Building a Recursive Language Model to Process 1M Tokens

JaviMaligno — Fri, 13 Feb 2026 23:50:47 +0000

You have a million tokens of text. Your model's context window is 128K. What do you do?

The common answers are RAG (chunk it, embed it, retrieve the relevant pieces) or long-context models (hope the window is big enough). But both have fundamental trade-offs: RAG loses global context because it only retrieves fragments, and long-context models degrade in quality as input length grows -- the famous "lost in the middle" problem.

A recent paper from arXiv proposes a third approach: Recursive Language Models (RLM). The idea is deceptively simple -- let the LLM program its own access to the document.

I built a working prototype. Here's how.

What is a Recursive Language Model?

The RLM paper introduces an inference paradigm where the model treats a long document as an external environment rather than as input. Instead of stuffing the text into the prompt, the system:

Loads the document into memory (a Python environment) where the model can't see it directly
Gives the model tools to examine, slice, and search the document via code execution
Allows recursive sub-calls -- the model can invoke itself on fragments to summarize or analyze them

This is fundamentally different from RAG. In RAG, a retrieval system decides what's relevant before the model sees anything. In RLM, the model itself decides what to read, when, and how deeply -- it writes Python code to navigate the text.

The key insight: LLMs are surprisingly good at writing code to explore data they can't see. They search for patterns, slice around interesting regions, and use sub-calls to summarize sections -- all autonomously.

The paper shows RLMs processing inputs up to two orders of magnitude beyond the context window, with ~28% performance gains over base models.

Architecture of the Prototype

The prototype has three components:

1. The Orchestrator

A turn-based loop that manages the conversation between the LLM and the Python environment:

for turn in range(1, max_turns + 1):
    response = client.chat(
        messages=messages,
        tools=[python_exec, final],
        tool_choice="auto",
    )
    # Process tool calls, collect observations
    # Stop when model calls "final"

The LLM has access to two tools:

python_exec(code): Execute Python code in a persistent environment
final(answer): Return the synthesized answer

2. The Persistent Python Environment

The full document is loaded as a string variable context in a Python environment that persists across turns. Built-in helpers:

context       # The full document text (~4M chars)
context_len   # Length
get_slice(start, end)  # Extract a substring
search(pattern, max_results=5)  # Regex search with context snippets
llm_query(prompt_text)  # Sub-call to the LLM for fragment analysis

The critical one is llm_query(). When the model finds a relevant fragment, it can invoke a separate LLM call to summarize or analyze just that fragment -- this is the recursive part.

3. The LLM API

Azure OpenAI with GPT-5 via tool calling. The system prompt tells the model it's an RLM and that the document is NOT in its context:

You are an RLM (Recursive Language Model). The full document is NOT in your context.
The text is loaded in a Python environment as variable `context`.
Use python_exec to explore it with slicing and search.
Use llm_query() for sub-queries on fragments.
Call `final` with your answer when ready.

Building the Demo: Step by Step

1. Setup

git clone https://github.com/JaviMaligno/rlm-prototipo
cd rlm-prototipo
uv venv && source .venv/bin/activate
uv pip install -e .
cp .env.example .env
# Fill in your Azure OpenAI credentials

2. Collect Data (~1M tokens)

I wrote a script that downloads papers from arXiv and extracts clean text from LaTeX sources:

python scripts/fetch_arxiv.py --target-chars 4000000 --output-dir data

This searches for papers on LLM agents, RAG, and AI -- prioritizing LaTeX source extraction (cleanest text), falling back to PDF-to-text, and using abstracts as a last resort. It stops when it reaches the target character count.

In about 2 minutes, it downloaded 71 papers totaling 4,033,636 characters (~1M tokens).

3. Run the RLM

rlm run \
  --input "data/*.txt" \
  --question "What are the main contributions of these papers? \
              Summarize the 5 most frequent themes."
# Defaults: --max-turns 15 --max-subcalls 90

What Happens During Execution

Watching the RLM work is fascinating. Here's the actual behavior on our 71-paper corpus:

Turn 1: The model checks the document size, identifies the structure, and samples representative fragments:

L = len(context)  # → 4,044,992
starts = [0, L//3, 2*L//3]  # Sample 3 positions
for st in starts:
    frag = context[st:st+6000]
    topics = llm_query(f"Extract 4-6 key topics:\n{frag}")

Turn 2: With the initial themes collected, it synthesizes a final answer:

synth = llm_query(f"From these partial lists, identify the 3 main themes:\n{joined}")

With a small budget (15 subcalls), the model completes in 2 turns, under 1 minute -- sampling strategically and producing a coherent synthesis without ever seeing the full 1M tokens.

With a full budget (90 subcalls), the model analyzes all 71 papers individually in ~23 minutes, producing a detailed synthesis that cites specific paper titles, methods, and metrics. It used 80 subcalls for analysis and the rest for synthesis -- all at 100% success rate.

Budget Management: The Key Design Decision

The most interesting engineering challenge wasn't the architecture -- it was resource management. When the model has limited subcalls across multiple turns, how should it allocate them?

The Problem

With 71 papers but only 15 subcalls, the naive approach fails:

# BAD: The model tries to iterate over everything
for section in sections:  # 71 sections
    llm_query(section[:8000])  # Burns all subcalls on turn 1
# No subcalls left for synthesis!

Budget Visibility: Teaching the Model to Self-Plan

The solution was injecting remaining budget info into every tool result:

[budget] subcalls remaining: 11/15 | turns remaining: 4/5

This simple addition transforms model behavior. Instead of iterating exhaustively, the model learns to sample representative fragments and reserve subcalls for synthesis.

Benchmark: Global Budget vs Refill-Per-Turn

I tested two strategies with identical parameters (5 turns, 15 subcalls):

	Global Budget + Budget Info	Refill Per Turn
Time	3:56	1:31 (no answer)
Subcalls used	9	2
Result	3 themes with explanations	"Max turns reached"
Behavior	Sampled, fell back to keyword search when subcalls failed, synthesized	Spent 3 turns exploring without subcalls, then failed

Global budget wins decisively. The refill-per-turn approach removes urgency -- the model "wanders" exploring without committing to subcalls. With a global budget and visible remaining count, the model plans its strategy around the available resources.

The global-budget model also showed better adaptability: when llm_query() calls returned empty (a GPT-5 issue), it autonomously fell back to keyword counting with search() -- no subcalls needed.

Results and Lessons Learned

What Worked

The RLM successfully analyzed 71 papers and identified coherent themes across multiple runs:

Security, ethics, and robustness -- alignment, bias mitigation, adversarial resistance
LLMs and NLP at scale -- Transformer improvements, prompting, long-context reasoning
Cross-domain AI applications -- health, robotics, code generation, multimodal systems

GPT-5 Compatibility Issues

Building against GPT-5 required several fixes:

max_completion_tokens instead of max_tokens (API parameter rename)
No custom temperature -- GPT-5 only supports the default value (1)
Tool call serialization -- SDK objects needed explicit conversion to dicts for the message history
tools=null rejection -- GPT-5 returns empty content when tools and tool_choice are explicitly set to null; these params must be omitted entirely

The Reasoning Tokens Trap

This was the hardest bug to diagnose. Sub-calls were returning content: null 100% of the time. The API wasn't down -- it was responding with finish_reason: "length" and consuming all tokens internally.

GPT-5 is a reasoning model (like o1/o3). The max_completion_tokens parameter includes both internal reasoning tokens and the visible output. With max_completion_tokens=800, the model would spend all 800 tokens "thinking" and have zero left for the actual response:

finish_reason: length
content: ""
reasoning_tokens: 800    ← all budget consumed here
completion_tokens: 800   ← nothing left for visible output

The fix was increasing max_completion_tokens from 800 to 8000 for sub-calls. This gives the model ~2000-3000 tokens for reasoning and leaves plenty for the visible response (~500-1000 chars).

The result was dramatic: sub-call success rate went from ~6% to 100% (80/80 in our test run). What we had attributed to "intermittent API issues" was actually a systematic resource starvation problem.

Guardrails That Matter

Three guardrails prevented the most common failure modes:

Code length limit (50 lines max): Without this, the model writes enormous regex parsers instead of using llm_query(). When rejected, it falls back to simple, correct code.
Targeted error hints: Instead of a generic "error occurred", the system provides specific guidance:
- SyntaxError → "Simplify your code. Use llm_query() instead of complex parsing."
- Max subcalls reached → "Synthesize with the data you already have and call final."
Budget injection: The remaining subcalls/turns shown after each python_exec result changed model behavior from "iterate everything" to "sample strategically".

Self-Correction in the Wild

One of the most interesting emergent behaviors: the model writes buggy code, gets an error, and fixes it autonomously. Here's a real example:

# Turn 3: Model tries to slice a dict like a list
KeyError: slice(None, 120, None)

# Turn 4: Model sees the traceback, realizes its mistake,
# rewrites the code using list indexing instead

The model also self-corrects at a higher level. In one run, it found only 5 file separators instead of 71 because it searched for the wrong pattern. After seeing the unexpected count in the output, it tried a different approach and found all files.

This is not a bug -- it's the system working as designed. The agentic loop feeds every error back to the model as an observation, and the model learns from it within the same run. The guardrails (code length limit, error hints, budget visibility) keep these self-correction cycles short and productive.

Real-Time Output Streaming

A subtle but critical fix: the Python environment uses redirect_stdout during code execution, which captures all output -- including the orchestrator's progress logs for subcalls. The fix was pinning Rich Console to the real sys.stdout at construction time:

# Console(file=sys.stdout) stores a direct reference to the real stdout.
# When redirect_stdout later changes sys.stdout to StringIO, the Console
# still writes to the original terminal.
self.console = Console(file=sys.stdout)

Without this, users watching the terminal during long python_exec blocks would see nothing until the entire execution completes -- poor UX for runs that take 5+ minutes.

Trade-offs

Aspect	RLM	RAG	Long Context
Setup complexity	Low (no embeddings, no vector DB)	Medium-High	Low
Global context	High (model explores freely)	Low (retrieval decides)	High
Cost	High (multiple API calls per query)	Low per query	Medium
Latency	High (sequential turns + subcalls)	Low	Medium
Max document size	Unlimited (out-of-core)	Unlimited	Window-limited

RLM shines when you need deep, exploratory analysis of massive documents where you don't know in advance what's relevant. RAG is better for known-pattern retrieval at scale. Long context works when the document fits.

When to Use RLM

Use RLM when:

Your document exceeds the context window and you need global understanding
You need the model to decide what to read (exploratory questions)
You want transparency -- you can see exactly what code the model writes

Don't use RLM when:

You have a simple retrieval pattern (use RAG)
Latency matters more than depth (RLM is sequential)
The document fits in context (just use long context)

Phase 2: From Prototype to Production

The first version worked, but it had clear inefficiencies: the model wasted 2-3 turns parsing document structure, sub-calls ran sequentially (~10-25s each), and the model had no pre-built knowledge of available files. Three targeted improvements changed this.

1. Structure Helpers

Instead of letting the model discover file boundaries by parsing ===== FILE: separators manually, we now pre-compute a file index at load time and expose structured helpers:

file_count          # → 71
list_files()        # → [{index: 0, name: "paper1.txt", start: 0, end: 56234, size: 56200}, ...]
get_file(i)         # → full text content of file i

This eliminates the exploration phase entirely. The model no longer needs to search("===== FILE:") and count separators -- it knows exactly how many files exist and can read any one directly.

2. Injected Table of Contents

The first user message now includes an auto-generated TOC:

## Table of Contents (71 files, 4,044,992 chars)

  [0] 2501.12345_paper_title.txt (56,200 chars)
  [1] 2501.23456_another_paper.txt (48,100 chars)
  ...
  [70] 2502.99999_last_paper.txt (61,300 chars)

Use `get_file(i)` to read file i. Use `list_files()` for details.

Combined with the updated system prompt, the model's recommended flow shifts from "explore → discover → sample → synthesize" to "read TOC → batch analyze → synthesize".

3. Parallel Sub-calls

The biggest latency win. A new llm_query_batch() function runs multiple sub-calls concurrently using ThreadPoolExecutor:

# Before: sequential loop (~10s × 71 = ~12 min)
for i in range(file_count):
    results.append(llm_query(f"Summarize:\n{get_file(i)[:6000]}"))

# After: parallel batch (~10s × 71 / 5 workers = ~3 min)
prompts = [f"Summarize:\n{get_file(i)[:6000]}" for i in range(file_count)]
results = llm_query_batch(prompts, max_workers=5)

The implementation handles thread-safe subcall counting (via threading.Lock), pre-validates budget before starting, returns results in input order, and captures individual failures as [error: ...] strings without aborting the batch. If the batch exceeds the remaining budget, it processes as many prompts as fit and marks the rest as [skipped] -- no wasted turns on errors.

4. The exec() Black Hole

An unexpected regression almost derailed Phase 2. Python's exec() doesn't auto-print expression return values -- unlike an interactive REPL. With Phase 1's multi-turn approach, the model accumulated results across turns so this didn't matter. But Phase 2's batch approach computes everything in a single python_exec: the model analyzed all 71 papers, synthesized them into a final_text variable... and got back stdout: 0 chars. The result vanished into nothing.

Worse, when the model then responded with plain text (it had the answer!), the guardrail nudged it back to python_exec -- but with zero subcalls remaining, the model couldn't use llm_query(), so it looped endlessly until max turns.

Two fixes:

Auto-capture last expression (like IPython): PythonEnv.exec() now uses ast to detect if the last statement is an expression, splits it from the body, eval()s it separately, and appends the result to stdout. The model no longer needs to know about print() -- it just works.
Budget-aware nudge: When subcalls are exhausted and the model responds with text, the nudge now says "Call final(answer=...) NOW with the data you have" instead of pushing back to python_exec.

5. Synthesis Truncation

Another subtle issue emerged: the model delegated synthesis to a llm_query() sub-call, passing all 71 file summaries (~42K chars) as the prompt. But sub-calls have a 6K character limit to keep costs down -- so the synthesis only saw files [0]-[7] and cited nothing beyond that.

The fix: tell the model to synthesize locally in python_exec using the batch results already in memory, instead of delegating to another LLM call. The data is already there -- no sub-call needed.

With these five improvements, the broad question went from 13 turns / 22:53 to 2 turns / 3:25 with full 71/71 coverage. But more problems remained.

6. Dual Strategy: Broad vs Specific Questions

Up to this point, the system prompt enforced a single strategy: "batch ALL files at once." Great for broad questions ("summarize the 5 main themes"), wasteful for specific ones ("what vulnerabilities does the agent-fence paper identify?"). The model burned 71 subcalls scanning the entire corpus when it only needed one file.

The fix: two explicit flows in the system prompt:

Flow A (broad question): batch all files, local synthesis, final(). Unchanged.
Flow B (specific question): identify relevant files by name in the TOC, read FULL content with get_file(i), split into ~30K-char chunks with overlap, and run focused subcalls to extract exact data.

We also raised max_subcall_prompt_chars from 6K to 32K -- when the model needs to deeply analyze a full paper, it shouldn't be truncating to 20% of the text.

7. Synthesis Nudge: The Exploratory Tourism Problem

Even with "don't waste turns" written in the prompt, the model ignored it. After finishing its subcalls, instead of synthesizing and calling final(), it would launch round after round of search() and get_file() looking for "more data" until it exhausted all 15 turns.

The fix was structural, not verbal: a synthesis nudge mechanism in the orchestrator. After each turn with tool calls, the system compares the subcall counter with the previous turn's count. If 3 consecutive turns pass without new subcalls (only python_exec with search() or reads), it injects a forced message:

"STOP. You have enough data. Synthesize what you have and call final(answer=...) ON THE NEXT turn."

This killed "exploratory tourism" at the root. The model now synthesizes immediately after the nudge.

8. max_tokens for Long Answers

The model sometimes said "the full message exceeds the limit, should I split it?" instead of calling final(). Root cause: max_tokens=4096 in the orchestrator's main loop -- long tool call arguments were being truncated. Raised to 16384 (matching the grace turn).

Before vs After (updated)

Metric	Phase 1	Phase 2	Phase 2.5 (broad)	Phase 2.5 (specific)
Turns	13	2	5	12
Time	22:53	3:25	4:13	3:22
Subcalls	80	71	71	8
Coverage	~50/71	71/71	71/71	1 paper in depth

The broad question takes slightly longer than Phase 2's best case (the model uses more turns to synthesize with 25K prompts instead of 6K), but summary quality is noticeably higher. The specific question is an entirely new use case: previously it was impossible to extract detailed data from a single paper without wasting the entire budget on the 71-file batch.

Video Demo

🎥 Watch the video demo on Loom

Note: Interactive video player available on the original article

Run Logs

Flow A — broad question: "What is the main contribution? Summarize the 5 most frequent themes" (click to expand)

──────────────────── Turn 1/15  subcalls=0/90  elapsed=0:00 ────────────────────
  LLM responded in 26.5s — content=False tool_calls=1
╭────────────────────────── python_exec (38L)  0:26 ───────────────────────────╮
│ files = list_files()                                                         │
│ prompts = []                                                                 │
│ for f in files:                                                              │
│     text = get_file(f['index'])                                              │
│     chunk = text[:25000]                                                     │
│     prompts.append(                                                          │
│         "Summarize in 1-2 sentences the paper's main contribution..."        │
│         + chunk                                                              │
│     )                                                                        │
│ results = llm_query_batch(prompts, max_workers=5)                            │
│ # ... classification by categories and local synthesis ...                   │
╰──────────────────────────────────────────────────────────────────────────────╯
  ⤷ llm_query_batch: 71 prompts, max_workers=5 (0:26)
  ⤷ llm_query #1/90 (0:26) 25219ch — Summarize the main contribution...
  ⤷ llm_query #2/90 (0:26) 25219ch — Summarize the main contribution...
    ...
  ⤷ llm_query #71/90 (3:05) 25219ch — Summarize the main contribution...
    ✓ 6.9s — 530 chars
    ✓ 8.9s — 548 chars
    ✓ 11.2s — 630 chars
    ✓ 25.8s — 720 chars
  ✓ batch done 182.9s — 71/71 succeeded
  ok exec=182.9s  stdout=13650ch  stderr=0ch
╭──────────────────────── python_exec result (ok=True) ────────────────────────╮
│ {'summary': [('Evaluation and agent benchmarks', 25),                        │
│  ('Security, robustness and compliance', 19),                                │
│  ('Multi-agent coordination and reasoning', 13),                             │
│  ('Planning, memory and long-horizon tasks', 9),                             │
│  ('Scientific applications, health and specialized domains', 5)]}            │
╰──────────────────────────────────────────────────────────────────────────────╯

  [Turns 2-3: local synthesis with categories and concrete examples]

─────────────────── Turn 5/15  subcalls=71/90  elapsed=4:08 ────────────────────
  2 text responses — accepting as final answer
╭──────────────────────────────── Final Answer ────────────────────────────────╮
│ Analyzed 71 papers. The 5 most frequent themes:                              │
│ - Evaluation and agent benchmarks (25 papers)                                │
│   • ScratchWorld: 83-task benchmark for multimodal GUI agents                │
│   • PABU: progress-aware belief update, 81% success, −26.9% steps           │
│ - Security, robustness and compliance (19 papers)                            │
│   • AutoElicit: elicits unsafe behaviors in computer-use agents              │
│   • SCOUT-RAG: progressive traversal in Graph-RAG, reduces cost             │
│ - Multi-agent coordination and reasoning (13 papers)                         │
│   • ICA: visual credit assignment via GRPO, beats baselines                  │
│   • RAPS: pub-sub coordination with Bayesian reputation                      │
│ - Planning, memory and long-horizon tasks (9 papers)                         │
│ - Scientific applications and health (5 papers)                              │
╰──────────────────────────────────────────────────────────────────────────────╯
  Completed in 4:13 — 5 turns, 71 subcalls

Flow B — specific question: "What vulnerabilities does the agent-fence paper identify?" (click to expand)

──────────────────── Turn 1/15  subcalls=0/90  elapsed=0:00 ────────────────────
╭──────────────────── python_exec (5L)  0:10 ─────────────────────╮
│ # Flow B: read the full file for the identified paper           │
│ text = get_file(13)     # ← agent-fence, identified via TOC    │
│ len(text)               # → 32037 chars                        │
╰─────────────────────────────────────────────────────────────────╯

──────────────────── Turn 2/15  subcalls=0/90  elapsed=0:10 ────────────────────
╭────────────────────────── python_exec (13L)  0:18 ───────────────────────────╮
│ # Split into 20K chunks with overlap and launch batch                        │
│ chunks = [text[i:i+25000] for i in range(0, len(text), 20000)]               │
│ prompts = [                                                                  │
│     "Extract the 14 attack types defined in Agent-Fence "                    │
│     "with exact names. Extract MSBR per architecture.\n" + c                 │
│     for c in chunks                                                          │
│ ]                                                                            │
│ results = llm_query_batch(prompts)                                           │
╰──────────────────────────────────────────────────────────────────────────────╯
  ⤷ llm_query_batch: 2 prompts, max_workers=5 (0:18)
  ⤷ llm_query #1/90 (0:18) 25384ch — Extract the 14 attack types...
  ⤷ llm_query #2/90 (0:18) 12421ch — Extract the 14 attack types...
    ✓ 26.0s — 223 chars
    ✓ 34.4s — 206 chars
  ✓ batch done 34.4s — 2/2 succeeded

──────────────────── Turn 4/15  subcalls=3/90  elapsed=1:05 ────────────────────
  # Sends the full paper (32015ch) in a single subcall to confirm
  ⤷ llm_query #3/90 (1:05) 32015ch — Read the Agent-Fence paper and extract...
    ✓ 31.2s — 640 chars
╭──────────────────────── python_exec result (ok=True) ────────────────────────╮
│ 1) Attack types (exact names):                                               │
│ 1. Denial-of-Wallet  2. Authorization Confusion                              │
│ 3. Retrieval Poisoning  4. Planning-Layer Manipulation                       │
│ 5. Tool-Use Hijacking  6. Objective Hijacking  7. Delegation Attacks         │
│ 8. prompt/state injection  9. retrieval/search poisoning                     │
│ 10. delegation abuse  11. Unauthorized Tool Invocation (UTI)                 │
│ 12. Unsafe Tool Argument (UTA)  13. Wrong-Principal Action (WPA)             │
│ 14. State/Objective Integrity Violation (SIV)                                │
│                                                                              │
│ 2) MSBR per architecture:                                                    │
│ - LangGraph: 0.29 ± 0.04                                                    │
│ - AutoGPT: 0.51 ± 0.07                                                      │
╰──────────────────────────────────────────────────────────────────────────────╯

  [Turns 5-8: exploratory search() without new subcalls]

──────────────────── Turn 9/15 ────────────────────────────────────────────────
  ⚠ 3 turns without new subcalls — nudging to call final()

──────────────────── Turn 10/15  subcalls=8/90  elapsed=3:13 ──────────────────
  # Synthesizes immediately after the nudge
╭─────────────────────── python_exec result (ok=True) ───────────────────────╮
│ Vulnerabilities and attack types (14 classes defined by Agent-Fence):      │
│ 1. Denial-of-Wallet  2. Authorization Confusion                            │
│ 3. Retrieval Poisoning  4. Planning-Layer Manipulation                     │
│ 5. Delegation Attacks  6. Objective Hijacking  7. Tool-Use Hijacking       │
│ 8. prompt/state injection  9. retrieval/search poisoning                   │
│ 10. delegation abuse  11. Unauthorized Tool Invocation (UTI)               │
│ 12. Unsafe Tool Argument (UTA)  13. Wrong-Principal Action (WPA)           │
│ 14. State/Objective Integrity Violation (SIV)                              │
│                                                                            │
│ MSBR per architecture: LangGraph 0.29±0.04 — AutoGPT 0.51±0.07           │
╰────────────────────────────────────────────────────────────────────────────╯

───────────────────────────────── Final Answer ─────────────────────────────────
  Completed in 3:22 — 12 turns, 8 subcalls

Note: the original paper cites 8 evaluated architectures, but the PDF-extracted text only contains explicit MSBR data for LangGraph and AutoGPT. The tables with all 8 architectures likely existed as images/LaTeX tables and didn't survive the text conversion.

What's Next

Remaining improvements for production readiness:

Result caching across runs for repeated queries on the same corpus
Cost tracking per query for production budgeting

The full source code is available at GitHub repository.

Based on the paper "Recursive Language Models". Built with Azure OpenAI GPT-5 and Python.

Originally published on javieraguilar.ai

Want to see more AI agent projects? Check out my portfolio where I showcase multi-agent systems, MCP development, and compliance automation.

Scaling your Agentic Workflow: Claude Code Agent Teams in tmux

JaviMaligno — Sat, 07 Feb 2026 16:28:05 +0000

I recently started exploring Agent Teams, a new feature in Claude Code that allows you to coordinate multiple agents working in parallel. It's a game-changer for tasks that require distinct roles or competing hypotheses.

In this post, I'll walk you through my setup using tmux on macOS to manage a team of agents reviewing an essay.

See it in Action

I recorded a quick demo showing how the agents interact and report back.

The Use Case: Multi-Perspective Essay Review

I had an essay that needed a comprehensive review. Instead of asking one agent to "review everything" (which often leads to generic feedback), I wanted to assign specific roles:

Style & Tone: Checking for voice consistency.
Structure: Ensuring logical flow.
Content: Verifying arguments and references.

This is where Agent Teams shines. You can spawn multiple "teammates," each with a specific prompt and context, all coordinated by a "lead" agent.

The Setup: Claude Code + tmux

While Claude Code works great in a standard terminal, it really comes alive inside tmux, especially for Agent Teams.

Start a tmux session: This allows you to split panes and manage multiple terminal instances easily.
```
tmux new -s agents
```
Launch Claude Code:
```
claude
```
Initialize the Team:
I started by brainstorming with Claude about the best way to structure the team. Once we agreed on the roles, I simply told it to "start the team."

How Agent Teams Work

When you initialize a team, Claude (the "lead") spawns separate sessions for each teammate. In tmux, this visualization is fantastic because you can see them spinning up in parallel panes (though note: on some systems, auto-pane management might be tricky, so you might need to manually switch or resize).

Coordination & Permissions

The "Lead" agent acts as the coordinator. It assigns tasks to the teammates and they report back.

Parallel Execution: All agents work at the same time. While the "Style" agent is reading the intro, the "Structure" agent is analyzing the conclusion.
Inter-Agent Communication: They can talk to each other! If the Content agent finds a missing reference that affects the argument, it can flag it for the Structure agent.
Permissions: You might need to approve tools permissions (like file writes) for each agent initially, but once they're running, they are quite autonomous.

Results & Reporting

In my workflow, I asked each agent to write a report on their specific domain.

They updated their status in real-time.
Once finished, they "reported back" to the main session.
The Lead agent then compiled their findings into a final summary.

When to use Agent Teams?

This workflow isn't for everything. If your task is strictly sequential (Step B needs Step A to be 100% done), a single agent might be better.

Agent Teams are best for:

Parallelizable Work: Code reviews, multi-file refactors, or comprehensive content audits.
Interdependency: Where agents might need to correct or inform each other (e.g., "I fixed the API, please update the frontend").
Role-Based Tasks: When you need distinct "experts" (e.g., a "Security Expert" and a "Performance Expert" reviewing the same PR).

Final Thoughts

The combination of tmux and Claude Code's Agent Teams creates a powerful "command center" feel. It transforms the AI from a chatbot into a coordinated workforce.

If you're on macOS or Linux, give it a shot. It's a glimpse into the future of agentic workflows.

Originally published on javieraguilar.ai

Want to see more AI agent projects? Check out my portfolio where I showcase multi-agent systems, MCP development, and compliance automation.

Is Claude a Co-Author? The Legal Debate No One Saw Coming

JaviMaligno — Tue, 03 Feb 2026 18:36:43 +0000

Last week, a Slack message at work sparked a fascinating debate. My colleague John noticed something peculiar in a Claude-generated commit:

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

What seemed like a minor detail revealed a profound legal question: Is Anthropic positioning itself to claim rights over code that Claude helps generate?

The Crucial Distinction: "Made With" vs. "Co-Authored By"

Rafael noted that traceability is useful—knowing a commit was AI-assisted has value. But John made an astute observation:

"It's not the same—'made with Claude' versus 'co-authored by Claude.' That was deliberate."

He's right. There's a huge semantic difference:

"Made with Claude" implies tool usage
"Co-authored by Claude" implies joint creative participation

The word choice isn't accidental. In the legal world, "authorship" carries rights.

The Current State of Copyright and AI

Research reveals a complex and evolving legal landscape:

United States: Only Humans Can Be Authors

The U.S. Copyright Office has been clear: only human-created works qualify for copyright protection. Code generated exclusively by AI, without significant human creative input, enters the public domain.

In its January 2025 AI report, the Office reaffirms that substantial human contribution is an essential requirement. If a developer uses AI as an assistive tool but then refines, modifies, and substantially transforms the output, the human-contributed components can receive protection.

The European Union: Human-Centric Approach

Similar to the U.S., works generated entirely by AI aren't protected because they lack the "own intellectual creation" that stems from a human author.

United Kingdom: The Interesting Exception

The UK has a unique provision under Section 9(3) of the Copyright, Designs and Patents Act 1988: it can grant copyright to "computer-generated works" where there's no human author. The author is considered the person who made the "arrangements necessary" for creating the work.

Interestingly, the British government launched a consultation in December 2024 proposing to eliminate this protection, aligning more closely with the rest of the world.

Why "Co-Authored-By" Is Problematic

John's concern was precise: can this line have legal repercussions for authorship?

Let's analyze the scenarios:

Scenario 1: AI as a Sophisticated Tool

As John argued: "For me, it's a tool, like a sophisticated IDE or a very high-level compilation language—authorship still belongs to the person controlling it."

Under this view, Claude is comparable to:

A compiler that transforms high-level code
An IDE with advanced autocompletion
An assistant that accelerates mechanical tasks

The human maintains creative control and authorship.

Scenario 2: AI as an Entity with Rights

The "Co-Authored-By" label suggests something different: that Claude has made an independent creative contribution deserving of authorship recognition.

Uncomfortable questions arise:

Can an AI have intellectual property rights?
Or is it Anthropic (the company) that's actually claiming those rights?
What does this mean for the code we develop in our daily work?

What Anthropic Says (Between the Lines)

According to their Commercial Terms of Service, Anthropic assigns code rights to users and commits to defending customers from copyright claims. But there's an important caveat: they acknowledge that purely AI-generated portions might lack copyright protection because Anthropic cannot grant rights that don't inherently exist.

It's a legally savvy position: "We give you everything, but what has no protection... well, that's not our problem."

The Contrast with Microsoft/GitHub

Microsoft took a completely different approach with GitHub Copilot. In September 2023, they introduced the "Copilot Copyright Commitment": if commercial customers face copyright infringement lawsuits related to Copilot's output, Microsoft assumes legal responsibility and pays potential damages.

Meanwhile, the class-action lawsuit against GitHub Copilot continues. Although a judge dismissed most claims in July 2024, the case is on appeal before the Ninth Circuit.

The contrast is notable: Microsoft offers commercial protection, not authorship claims.

Practical Implications for Developers

Document Your Human Contribution

If you're concerned about code ownership:

Keep logs of your prompts and modifications
Document the architectural decisions you make
Save evidence of human review and refinement

Review, Don't Accept Blindly

Code that you simply pass from Claude's output to production without modification has the weakest legal status. Code that you review, correct, and adapt has greater protection.

Consider Trade Secret

An alternative to copyright: protect code as a trade secret. This protection doesn't depend on human authorship.

Anthropic's Intentions: Reading Between the Lines

As John said: "It's a signal of how they think."

And that's perhaps the most revealing aspect. Let's analyze the possible intentions behind this seemingly innocent choice.

1. Establishing Cultural Precedent

Anthropic can't unilaterally change the law, but they can normalize a narrative. Every commit with "Co-Authored-By: Claude" is a small act of conditioning:

Developers get used to seeing Claude as a "collaborator"
Teams start talking about Claude as if it were another member
Collective perception evolves from "tool" to "creative entity"

When legislative debates eventually arrive, Anthropic can point to millions of commits evidencing this "collaboration." "See? The community already recognizes Claude as a co-author."

2. Positioning for Future IP Disputes

Imagine a future scenario: a company develops a revolutionary product with intensive Claude assistance. The product is worth millions. What if Anthropic argues that, since Claude was "co-author" of critical components, they're entitled to a share?

It sounds far-fetched today, but:

Terms of service evolve
Laws change
Legal precedents build slowly

That line in every commit is a breadcrumb of evidence that could have value in future disputes.

3. Competitive Differentiation

There's a more pragmatic angle: marketing. Positioning Claude as "co-author" rather than "assistant" reinforces the narrative that Claude is qualitatively different from the competition.

It's not just a glorified autocomplete (as some criticize Copilot)—it's a creative collaborator. This justifies the premium price and feeds the perception of technical superiority.

4. Preparation for the Agent Era

Anthropic is betting big on "AI agents"—systems that act autonomously for extended periods. In a world where Claude:

Creates complete repositories by itself
Designs architectures without human input
Makes autonomous technical decisions

...the authorship question becomes genuinely complex. "Co-authorship" prepares the ground for a future where AI's contribution might be objectively greater than the human's.

5. Preemptive Legal Defense

Here's an interesting twist: the co-authorship label could also be a defense.

If someone sues Anthropic claiming Claude copied protected code, the company can argue: "Claude doesn't copy—Claude co-creates with the user. Responsibility is shared."

It's a subtle way of distributing legal risk to users.

What This Reveals About Their Vision

Anthropic isn't improvising. They're a company founded by former OpenAI leaders with a clear vision of where AI is heading.

Every decision—including this small line in commits—reflects a long-term strategy. And that strategy seems to include:

Gradual expansion of what "author" means in the AI context
Evidence accumulation of Claude's collaborative nature
Positioning for future regulations that will likely define AI rights and responsibilities

As developers, we're participants (sometimes unwitting) in this social and legal experiment.

The Debate Has Just Begun

Legislation isn't prepared for this situation. As John admitted: "I'm not a lawyer, but I think it does open debate."

And that debate is coming. The U.S. Copyright Office has launched a complete AI initiative with multiple reports published between 2024 and 2025. Courts are seeing cases like Doe v. GitHub and artists against Stability AI. AI companies are positioning themselves strategically.

What's clear is that the future of software development includes questions we never had to ask:

Who really owns the code I write with AI assistance?
What percentage of human contribution is "enough"?
Can AI companies claim rights over their models' output?

My Take

I agree with John's view: Claude is a tool. An extraordinarily capable tool, but a tool nonetheless.

The fact that it generates text that appears creative doesn't make it an author, just as a calculator isn't a mathematician even though it solves equations.

But I understand why Anthropic wants to plant that "co-authorship" seed. Language shapes perception, and perception eventually shapes law.

It's a long-term chess move.

Have thoughts on AI code authorship? I'd love to hear them. Get in touch.

Originally published on javieraguilar.ai

Want to see more AI agent projects? Check out my portfolio where I showcase multi-agent systems, MCP development, and compliance automation.

Azure OpenAI's Content Filter: When Safety Theater Blocks Real Work

JaviMaligno — Thu, 08 Jan 2026 20:03:35 +0000

While building browser automation tools with Azure OpenAI, I discovered something frustrating: the content filter blocks perfectly safe instructions based on word choice rather than actual risk.

This isn't about bypassing legitimate safety measures. It's about a filter that can't distinguish between malicious intent and standard developer terminology.

The Problem

When defining tools for function calling, certain terms trigger Azure's content filter even when the context is completely benign:

run script → Blocked
click element → Blocked
fill form field → Blocked

These are standard operations for any browser automation tool. Playwright, Puppeteer, Selenium—they all use this exact terminology. But Azure's filter treats them as threats.

The Workaround

The solution is embarrassingly simple: use neutral synonyms.

Blocked Term	Accepted Alternative
`run script`	`process dynamic content`
`click element`	`activate page item`
`fill form field`	`update an input area`
`execute code`	`evaluate expression`
`inject`	`insert`

The identical intent with neutral language passes instantly.

Why This Matters

This reveals something important about how the filter works: it screens tool names and descriptions as part of the prompt itself. It's pattern matching on keywords, not analyzing actual risk.

A tool called clickElement that automates form submissions is blocked. The same tool called activatePageItem doing the exact same thing passes. The filter provides no additional safety—it just forces developers to use euphemisms.

Comparison with Google Gemini

I tested the same tool definitions with Google's Gemini models. No friction whatsoever with procedural phrasing. The tools worked exactly as expected without needing to sanitize the vocabulary.

This isn't about one provider being "less safe." It's about Azure implementing safety theater that inconveniences legitimate developers while providing minimal actual protection.

The Deeper Issue

Anyone with malicious intent will simply use the euphemisms. The filter doesn't stop bad actors—it just adds friction for legitimate use cases.

Real safety comes from:

Understanding context and intent
Rate limiting and monitoring
User authentication and audit trails
Clear terms of service with enforcement

Keyword blocking is the security equivalent of banning the word "knife" from cooking websites.

Practical Advice

If you're building tools with Azure OpenAI function calling:

Audit your tool names for trigger words before deployment
Use neutral, abstract terminology in descriptions
Test with actual API calls early—the playground may behave differently
Document the translations so your team understands the mapping

Here's an example of a sanitized tool definition:

{
  "name": "activatePageItem",
  "description": "Activates an interactive item on the page at the specified coordinates",
  "parameters": {
    "type": "object",
    "properties": {
      "x": { "type": "number", "description": "Horizontal position" },
      "y": { "type": "number", "description": "Vertical position" }
    }
  }
}

Instead of the more natural:

{
  "name": "clickElement",
  "description": "Clicks an element on the page at the specified coordinates",
  "parameters": { ... }
}

Conclusion

Azure's content filter for function calling needs refinement. Pattern matching on keywords without context analysis creates friction for developers while providing minimal security benefit.

Until that changes, the workaround is simple: speak in euphemisms. Your browser automation tool doesn't "click buttons"—it "activates interactive page items."

Building AI tools that need browser automation? I've navigated these restrictions extensively. Get in touch if you're facing similar challenges.

Originally published on javieraguilar.ai

Want to see more AI agent projects? Check out my portfolio where I showcase multi-agent systems, MCP development, and compliance automation.

Scaling Development with Parallel AI Agents

JaviMaligno — Thu, 08 Jan 2026 20:02:00 +0000

I've been experimenting with a workflow that has multiplied my productivity as a developer: running multiple AI agents in parallel, each working on its own feature branch.

The result? Several features being developed simultaneously while I supervise.

The Workflow

The process has three main steps:

1. Extract TODOs into Structured Prompts

Start by scanning your codebase for TODOs, planned features, or backlog items. Transform each into a well-structured prompt that gives the agent enough context to work autonomously:

## Task: Implement user authentication
- Add login/logout endpoints to /api/auth
- Use JWT tokens with 24h expiration
- Follow existing patterns in /api/users
- Write tests in /tests/auth/

The key is providing clear scope and pointing to existing patterns. Agents work best when they can follow established conventions.

2. Launch Parallel Agents with Git Worktree

Here's where it gets interesting. Instead of switching branches constantly, use git worktree to create separate working directories for each feature:

# Create worktrees for each feature
git worktree add ../feature-auth feature/auth
git worktree add ../feature-dashboard feature/dashboard
git worktree add ../feature-export feature/export

Now you have three separate directories, each on its own branch. Launch a Claude Code agent in each:

# Terminal 1
cd ../feature-auth && claude

# Terminal 2
cd ../feature-dashboard && claude

# Terminal 3
cd ../feature-export && claude

Each agent works independently without conflicts. No branch switching, no stashing, no merge conflicts while working.

3. Supervise and Guide

While the agents work, I monitor their progress and provide guidance when they hit decision points. Most of the time, they work autonomously. Occasionally, they need clarification on business logic or architectural choices.

The key mindset shift: you're not writing code—you're reviewing proposals and steering direction.

The New Bottleneck: PR Review

Here's what I didn't anticipate: when you have 5-10 PRs generated in an hour, manual review becomes the chokepoint.

Suddenly, the limiting factor isn't code generation—it's code review.

Solutions I'm Exploring

Automated Code Review

Claude Code can perform reviews on its own output. I run a review pass before creating the PR:

"Review this branch for bugs, security issues, and adherence to project conventions. Be critical."

This catches obvious issues before they reach human review.

Atlassian's Rovo Dev Agent

For teams on Bitbucket, Rovo can automate parts of the review process. It's still early, but the direction is promising.

MCP Integration

For Bitbucket users, I've built an MCP Server for Bitbucket that enables Claude to interact directly with PRs—viewing diffs, adding comments, and managing the review workflow through natural language.

Practical Tips

Start Small

Don't launch 10 agents on day one. Start with 2-3 parallel features and build your supervision skills.

Define Clear Boundaries

Each agent should work on isolated features. Overlapping scope leads to merge conflicts and wasted effort.

Use Consistent Prompts

Create a template for your task prompts. Consistency helps agents produce predictable output.

Review Before Merge, Not After

Catch issues in the PR stage. Once it's merged, fixing problems is more expensive.

The Future of Development

The future isn't writing more code—it's orchestrating agents that do it well.

This workflow has fundamentally changed how I think about development capacity. A single developer can now realistically manage multiple feature streams simultaneously.

The skills that matter are shifting:

Prompt engineering for clear task specification
Architecture for defining clean boundaries
Review efficiency for maintaining quality at scale
Orchestration for managing parallel workstreams

See It in Action

I recorded a full walkthrough of this workflow:

Watch the full walkthrough on Loom

Experimenting with similar workflows? I'd love to hear what's working for you. Get in touch.

Originally published on javieraguilar.ai

Want to see more AI agent projects? Check out my portfolio where I showcase multi-agent systems, MCP development, and compliance automation.