Forem: Benedict L

Claude Code stuck in a failure loop? I escaped with multi-agent.

Benedict L — Tue, 06 Jan 2026 12:46:35 +0000

I tried to improve benchmark scores on my open-source project with Claude Code. Failed miserably.

It would try approach A, hit a wall, switch to B, then circle back to A. A classic failure loop.

Too much data and work to fit in one agent's context.

And every time Auto Compact ran, earlier decisions and work got diluted.

So I switched to a multi-agent architecture.

Redesigned the entire task into 12 Tasks across 5 Phases.
Each Task is sized so one agent can complete it within its context window.

Sub-agents run in parallel or sequence depending on dependencies.

The key is knowledge transfer between agents.

For example, the triage optimization routine from Task 10 → Task 11 ran 5 iterations:
Round 1: Precision 27%, 110 false positives.
Round 2: Removed Y-overlap. Precision up to 36%, FP down to 72.
...
Round 5: Added image ratio conditions. Hit 100% Recall.
Each experiment result gets recorded for the next agent to read.
New agents read previous results and don't retry failed approaches.
When an agent runs a new experiment, I make the call on what to focus on next.

Results:

✅ The A-B-A failure loop disappeared.
✅ Table inference score improved from 0.49 to 0.93—nearly 2x.
✅ Beyond expectations: achieved #1 among open-source solutions.

Why did multi-agent work?

The bottleneck for today's AI agents isn't intelligence—it's memory.

Context windows are small. The longer they get, the more early decisions get buried.

You have to engineer both the size and structure of memory.
If Claude Code is stuck in a failure loop, run through this checklist.

📋 Checklist:

Size (task decomposition)
☐ Is each task focused on a single goal?
☐ Can one agent complete it within context limits?
☐ Are task dependencies clearly defined?

Structure (knowledge transfer)
☐ Are previous agent results documented?
☐ Are required specs/docs loaded into context?
☐ Can the agent self-verify completion criteria?

AI Agents aren't genies that grant wishes

Benedict L — Tue, 06 Jan 2026 12:43:58 +0000

"Plan and build a dashboard for me." Sounds like it should work. It doesn't.

Planners and developers have different goals.

A planner's goal is delivering value. "What will the customer be able to decide from this?"

A developer's goal is shipping code. "Does it run?"

Give one agent both jobs, and it prioritizes whichever is easier to measure.

"Is the planning good?" Hard to judge.

"Does the code run?" Immediate feedback.

So the agent rushes through planning and jumps straight to implementation.

"Plan and build this" becomes "half-ass the plan and build this."

That's why role separation matters.

1️⃣ Researcher role → Investigate customer interests, gather industry trends and benchmarks

2️⃣ Planner role → Redefine customer value based on research, select key KPIs, design screen structure

3️⃣ Developer role → Implement based on the planning doc
We review each agent's output, iterate with feedback and direction until it hits the right level, then move to the next stage.

Now each agent can focus on its specialty.

Once the planning agent finishes the "why," the dev agent only worries about the "how."

Separate the roles. Define each role's expertise and success criteria.

That's context engineering.

What's interesting is how much context engineering resembles leadership.

A leader isn't a superhero who's best at everything.

A leader sets a valuable goal and assembles a team of specialists who are better than them.

AI Agents aren't genies—but with a good leader, they become an incredible team.

I thought plugging in LangSmith would solve agentic AI monitoring

Benedict L — Tue, 06 Jan 2026 12:42:34 +0000

I thought plugging in LangSmith would solve agentic AI monitoring
In my last post, I shared the "costs are a black box" problem.

Some issues used 10,000 tokens, others 1,000,000. A 100x difference with no explanation.

I figured switching to LangGraph and adding LangSmith would fix it.

Wrong.

First wall: CLI calls can't be traced

The original setup ran Claude Code CLI as a subprocess from GitHub Actions.

GitHub Actions → npx claude-code → (black box) → result

Even with LangSmith connected, LLM calls inside the CLI were invisible. The problem wasn't the tooling—it was that the architecture wasn't observable in the first place.

So I switched from CLI to the Anthropic SDK.

——
Second wall: tokens don't show up in LangSmith

Switched to SDK. Still no token counts.

After debugging, I learned:

You need run_type=llm for tokens to be tracked
input_cost and output_cost must be manually added as metadata

I assumed an "observability tool" would show everything automatically. Turns out, defining what you want to see is your job.

Third wall: the trap of over-engineering

I tried to build for scale and include LangGraph V1's new features.

Durable State & Built-in Persistence
Scoring, observability, model config
Fallback API logic

In the end, I deleted all of it.

Once I actually ran it, none of that was needed. Most settings were better hardcoded. Unnecessary logic just added complexity.

Result: now I can see the costs

The cause of that 100x difference finally became visible:

Triage → input 1.5K / output 102 tokens → $0.002
Analyze → input 18K / output 703K tokens → $0.105
Fix → input 50K / output 3,760K tokens → $0.512

The shift from "running on gut feel" to "running on numbers."

Plugging in an observability tool is easy. Building an observable architecture is the real work.

Next: a dashboard that shows whether these numbers actually deliver value—productivity metrics in real time.

The real problems I faced after deploying agentic AI to production

Benedict L — Tue, 06 Jan 2026 12:40:27 +0000

In my last post, I shared that agentic AI cut issue handling time by 66%.
The numbers looked great on paper. But the moment I asked "can we actually keep running this?"—unexpected problems started piling up.

Costs are a black box

Agents make multiple LLM calls internally.
Reading code, analyzing, deciding, sometimes double-checking.

But I had no way to track how many tokens any of this used.
Some issues consumed 10,000 tokens. Others burned 1,000,000.
A 100x difference—and I couldn't tell why.

The current setup runs Claude Code CLI inside GitHub Actions.

There's no system for systematically tracking and analyzing token usage.

I discovered that tools like LangSmith are built to solve exactly this. Next step: either parse CLI logs into structured data, or integrate a dedicated monitoring tool.

The workflow only exists as code

Right now the workflow is 4 YAML files. Each one is 100-300 lines, and the conditional branches keep getting more complex. When a teammate asks "what exactly is the AI doing right now?"—I have to open the code and explain line by line.

Trust is everything for agentic AI.

To explain "why did the AI make this decision?", you need to see the intermediate steps visually.

I'm looking into low-code platforms like LangFlow and n8n. If the visual workflow actually reflects what's running, with logs at each node, that would help a lot.

No real-time KPIs

"AI saved 66% of developer time" came from manually analyzing 10 issues.

But I have no idea how well the AI is performing right now.
Metrics I need:

AI filtering rate: % of issues AI resolved without developer involvement
Average first response time: how fast did we reply after issue creation
Auto-resolution rate: % where AI fixed code and opened a PR
Daily/weekly/monthly trends: is performance improving or degrading

Without a dashboard, I'm left judging by feel—"seems to be working fine."

Agentic AI isn't "build and done"—it's "the beginning of operations."

Results came fast, but making this a sustainable system is a completely different challenge.

Next time I'll share how I'm tackling these problems one by one. Starting with cost monitoring and real-time dashboards.

What developers do is changing after agentic AI

Benedict L — Tue, 06 Jan 2026 10:17:22 +0000

Before, when developers handled issues directly, even a simple question meant re-reading context, crafting a response, double-checking nothing was missed. Trying to remember if a similar issue came up before, digging through old threads to see how it was handled. This triage and analysis work quietly, steadily drains time and focus—regardless of how hard the actual issue is.

I thought: maybe AI should be doing this, not developers.

So I applied an agentic AI workflow to real work—analyzing GitHub issues, responding, and in some cases, even fixing code.

I compared the old human-driven workflow against the agentic AI workflow across 10 GitHub issues.

The criteria were simple.

From the organization's perspective:
→ How much time does it actually take to resolve one issue?

From the user's perspective:
→ How fast do they get a response after filing an issue?

The results were clear.

The old workflow took an average of 2.4 developer-days per issue.

After agentic AI handled triage and first-pass resolution, that dropped to 0.8 days.

That's a 66.7% cost reduction for the organization.

The change from the user's side was even more dramatic.

Previously, first response took over 3 days on average.

With agentic AI, first response came in 4-5 seconds.

Users now get near-instant feedback after filing an issue.

But what impressed me most: even with just simple policies and a basic domain knowledge structure, the agentic AI workflow stayed consistent in knowing what to do and what not to do.

Simple questions got answered and closed
Small fixable bugs got PRs created directly
Risky or complex issues got handed to humans without hesitation

Bottom line: after applying agentic AI, only 4 out of 10 issues actually reached developers.

About 60% were handled without any developer involvement.

Developers no longer need to spend hours on repetitive triage, responses, and context reconstruction.

Instead, they define the boundaries of automation and design how AI and humans should collaborate.

I believe that's becoming the new role of developers.

Next up: building a dashboard to see "how is agentic AI performing right now" at a glance.

Why being smart isn't enough for agentic AI

Benedict L — Tue, 06 Jan 2026 10:15:06 +0000

Agentic AI isn't just about generating answers.

It judges, acts, and actually affects your systems.

The moment that happens, "sounding smart" stops mattering.

What matters is making the same decision every time, given the same situation.

Without an explicit domain knowledge model, agents make different calls depending on the prompt and context.

The same issue might:

Get auto-fixed one day
Be routed to manual handling the next
End up as just a question another time

That might look like automation, but you can't call it production-ready agentic AI.

So instead of burying decision criteria inside prompts, I separated them into an explicit domain knowledge model.

Now the agent can explain why it made a decision—and make the same call in the same situation, consistently.

Palantir Ontology and Claude Skills both tackle this problem. Palantir Ontology impressed me by defining what AI should judge against at the system level. Claude Skills solves the same problem in a simpler form.

For this project, I didn't need enterprise-scale ontology design. I needed to quickly define and validate the decision criteria that mattered right now.

So I chose Claude Skills. Policies, actions, judgment criteria—all separated into explicit skill documents.
But defining criteria isn't the end.

An AI that takes action needs guardrails.

My agentic AI adds labels, closes issues, creates PRs, changes code.

It has real permissions in a real system.

So alongside the domain knowledge model, I built a workflow testing framework.

For each action, I defined test cases: "Given this input, this judgment must follow." Then I run actual Claude API responses to verify.

Now the agentic workflow isn't "seems to be working"—it's continuously validated against intended criteria.

Next step: a dashboard that shows the value of agentic AI in numbers, not features.

How many issues did AI handle? How much time did it save?

Without that visibility, you can't convince the team or scale further.

Moving agentic AI from "fun experiment" to "system the team can trust."

Can I build an agentic AI that handles issues and fixes code for me?

Benedict L — Tue, 06 Jan 2026 10:13:20 +0000

GitHub issues were piling up on my open-source project, and it was becoming a real burden.
I needed automation. My first attempt was n8n + LLM.
When an issue came in, the LLM would decide:

Spam or not?
Bug / feature request / question?
Need more info?
Can AI handle this directly?

Triage and auto-commenting worked surprisingly well.
Then I hit a wall.

The real core of this agentic workflow wasn't classification—it was the AI actually understanding code and taking action.

Reading code, leaving informed comments, making fixes and opening PRs... For any of that, the agent needs to run where the code lives.

That's when I landed on GitHub Actions.

Free compute for AI agents, and the GitHub integration is seamless.

I designed the architecture with Claude Code and documented all decision policies and role definitions in markdown.
Prototype done in 4 hours.

Next week: apply improvements, build test cases, deploy to the real project.

Curious to see how much time this actually saves the team.
What I'm experimenting with next:

Structuring domain knowledge with Claude Skills (policies, metrics, builds, tests)
Better AI Q&A interactions
Test cases that validate LLM and agent behavior