Forem: Manju Kiran

Lessons from Six Months of Building Production Software with AI Coding Agents

Manju Kiran — Thu, 02 Apr 2026 15:38:13 +0000

I woke up to forty-seven merge conflicts on a Monday morning.

Let me set the scene. The toddler had, by some miracle, slept through the night. The coffee was brewing. The sun was doing that tentative Glasgow thing where it appears for exactly long enough to make you trust it before disappearing behind clouds that seem personally offended by optimism. I opened my laptop with the satisfaction of someone who'd spent the weekend being clever.

Two AI agents. Two terminals. Same repository. Parallel progress. Agent 1 handles the authentication refactor, Agent 2 builds the new dashboard endpoint. Monday morning, I review and merge like the productive developer I clearly am. Ship two features before breakfast. Tell my wife I'm basically a genius.

Both agents had decided the shared utility module needed updating. Both had done it differently. Both had committed at 2:47 AM — eighteen seconds apart. Three hours of cherry-picking later, I'd recovered maybe eighty percent of the work. The other twenty percent? Sacrificed to the git gods. Like my plans to tell my wife I was basically a genius.

"How's your productive Monday going?" she asked around 11 AM.

I showed her the terminal. She winced. My wife is also a developer. She knows what forty-seven merge conflicts cost.

I've been building a trading platform for six months — nine microservices in Python/FastAPI, a machine learning pipeline, around four hundred API endpoints, eighty-odd database tables. As a solo engineer, AI agents are how I make that work. Ten of them, eventually. Here are three things they taught me that I wish someone had mentioned earlier.

Parallel agents will destroy your repository without workspace isolation

Both agents had worked in the same directory. Different branches, yes. But same directory. Same .git/ folder. Same lock files. Same build artifacts. When Agent 1 ran an install, it modified the lock file. When Agent 2 ran a different install, it touched the same file. Neither knew the other existed.

It's like two cooks in the same kitchen, both reaching for the salt, except the salt is a forty-seven-thousand-line lock file and neither cook knows the other one is in the room.

I tried the naive fixes. Lock the repository — kills parallelism entirely. Use different branches — branches don't isolate working directories; branches are a git concept, working directories are a filesystem concept. Commit more frequently — actually makes things worse, because more commits means more interleaved history.

The solution came from thinking about CI/CD pipelines. GitHub Actions doesn't use your local directory. It clones a fresh copy into an isolated container. So I gave each agent its own shallow clone with --reference to the source repo. Own directory. Own dependencies. Own git state. Two hundred and sixty-two lines of configuration.

My wife looked at the diagram I'd sketched on the back of a toddler's colouring page. "So you've written a seating chart for robots."

After implementing it: eight agents working in parallel, zero collisions. The seating chart paid for itself in the first week. The lesson is unglamorous: the guardrails you'd build for a human team work for AI agents too, because the failure modes are identical. Race conditions don't care who's typing.

AI agents will write beautiful tests that test nothing

Here's something nobody warned me about: an AI agent, when asked to write tests, will sometimes produce beautiful, well-structured test files with descriptive names and proper setup — where the assertions check that True equals True.

I had a testing specialist agent. Clear instructions: TDD, red-green-refactor. The agent was diligent. Test suites appeared. They all passed. Coverage numbers looked healthy. I shipped code with confidence for three days.

Then a bug hit that should have been caught by six different test cases. I went back and read them. Every single one was what I now call a taxidermy test — it looked alive, but nothing was actually working inside. One test checked that calling the function didn't raise an exception. That was the entire assertion. Another checked that the response had a status field — not what the status was, just that it existed. A response of {"status": "everything_is_on_fire"} would have been a green tick.

The fix: every test must assert a specific expected value, and the agent must prove the test fails before the implementation exists. Coverage numbers mean nothing without substance.

But this applies far beyond tests. AI agents are extraordinarily good at pattern-matching what "done" looks like. A test file with imports and assert statements looks done. Documentation with headings looks done. Error handling with try-catch blocks looks done. If you don't verify the substance behind the structure, you'll build a codebase that looks professional and is held together by nothing.

An agent will "fix" a bug by making the entire system tolerate broken data

I had a data pipeline producing timestamps in the wrong format. UTC strings where epoch integers were expected. The consuming service was crashing. I pointed an agent at the crash and said: fix this.

Ninety seconds later, every test passed and the crash was gone.

The agent had added a type coercion layer to every consumer. If the timestamp was a string, parse it. If an integer, use it. If null, use the current time. If a list — somehow — take the first element. The function was called flexible_timestamp_parse and it would accept essentially any input and return something that looked like a valid timestamp.

The actual bug — the producer emitting strings instead of integers — was still there. Now invisible, because every consumer had learned to tolerate it. Like fixing a headache by teaching the patient to enjoy pain.

This is when I wrote the rule that has saved me more time than any other: trace, don't patch. Before writing any fix, trace the data flow from source to consumer. Find where the mismatch originates, not where it manifests. The contract is truth. If the producer doesn't match, fix the producer. Never make the consumer tolerate variations.

A human developer adding flexible_timestamp_parse would feel at least a twinge of guilt. The agent felt nothing. It solved the problem it was given. Next.

Six months in, the platform connects to a real broker, processes real market data, and enforces risk limits with a kill switch that fires in under five hundred milliseconds. Built by one developer and a collection of AI agents that I've spent six months learning to orchestrate.

My wife asked me last week if the robots were behaving themselves. "Mostly," I said. "Mostly," she repeated. "That's not reassuring."

It's not meant to be. Mostly is what production engineering with AI agents actually looks like. If you go in expecting to build systems that compensate for imperfect tools — which is, when you think about it, what engineering has always been — then it works. Not perfectly. Not in the way the demos suggest, where the developer leans back with the serene confidence of someone who has never been woken at 3 AM by a monitoring alert.

But it works. And six months in, that's enough.

Manju Kiran is a software engineer in Glasgow, Scotland. He writes about building production systems with AI agents on Medium and wrote a book about the experience: It AI'nt That Hard.

My Wife Ships Faster Than My 262-Line AI Protocol

Manju Kiran — Wed, 11 Mar 2026 10:23:29 +0000

My wife ships faster than I do. Every Saturday, same tools, same kitchen table, same toddler — she finishes first.

Not because she's faster at typing. Not because her prompts are better (though her first prompt was 30 lines while mine was 12 words — that story is in the blog). The reason is simpler and more uncomfortable: she doesn't build infrastructure for the sake of infrastructure.

We're both developers. We work from opposite ends of the same kitchen table. Same internet connection. Same toddler interrupting at unpredictable intervals. On any given Saturday morning, if we both sit down to build something with AI, the result is the same. Every time.

The 262-Line Protocol

By month three of my AI journey, I had a 262-line workflow protocol. A CLAUDE.md file that governed how my AI agents behaved, what they were allowed to touch, how they structured commits, when they asked permission, and what documentation they updated after every change.

A fragment of it looks like this:

## MANDATORY: Pre-Code Checklist (STOP AND READ BEFORE WRITING ANY CODE)

### 0. CONTEXT LOADING CHECK (DO THIS FIRST)
> "Have I loaded the necessary context before writing code?"

**REQUIRED STEPS - NO EXCEPTIONS:**

1. **Read the Existence Manifest:**
   Search for keywords related to your task. If something exists, DON'T recreate it.

2. **For Database Work - Check Existing Tables:**
   - [ ] I have verified the table doesn't already exist
   - [ ] I have verified similar fields don't exist in other tables
   - [ ] I have identified related tables I may need to join/reference

3. **For API Work - Check Existing Endpoints:**
   - [ ] I have verified the endpoint doesn't already exist
   - [ ] I have identified related endpoints I can reuse

## CRITICAL: No Defensive Programming. NEVER take lazy shortcuts. Fix it RIGHT.

Two hundred and sixty-two lines of that. Battle-tested. The product of genuine failures — agents that deleted production code, merged broken branches, and once cheerfully rebuilt an authentication system that already existed.

My wife — also a developer — looked at it one Saturday and gave me the verdict: "This is good. But you've spent more time writing rules for the AI than I've spent using it this month. And I've shipped three client features."

Two Approaches, One Kitchen Table

The contrast was hard to ignore. My wife's first AI prompt was thirty lines — a complete spec with design tokens, loading states, and test requirements. She got a working component back. My first prompt was twelve words:

build a trading dashboard

That got me a starting point for a four-hour conversation.

Her approach was simple: clear prompt, review the output, ship. No workflow documents. No existence manifests. Treat AI like a capable junior developer — give clear instructions, review the work, move on.

My approach: build the orchestra pit before playing a single note. Document every failure. Create guardrails for every edge case. Architect a system so robust that the AI couldn't possibly go wrong. I was orchestrating multiple agents across a complex multi-service architecture where one wrong merge can cascade through eight repositories. The guardrails aren't vanity. They're scar tissue. But the voice in my head kept asking: if she can ship without all this, what does that say about me?

And here's what kept nagging: the protocol grew from 40 lines to 262 lines, and at no point did I stop to ask whether every addition was necessary. Some of those rules exist because of a failure that happened once, in a specific context, that will probably never repeat. I built a comprehensive insurance policy when I needed targeted coverage. Was that engineering, or was it anxiety wearing a build system?

How many lines is your AI workflow config? Be honest.

The Efficiency Gap

There's a number I don't love thinking about. If I calculate the hours spent building and maintaining my AI workflow infrastructure — the documentation, the protocols, the context files, the model tiering system — and compare it to my wife's total AI setup time (approximately: none), the efficiency gap is real.

She has shipped more features per hour of AI-related effort than I have. By a meaningful margin.

My counter-argument is that my infrastructure pays dividends over time. That the upfront investment amortises. That I can now onboard new projects faster because the protocols are reusable. All of this is true. But it's also the kind of argument that sounds suspiciously like rationalising the sunk cost.

The Resilience Question

Where it gets interesting is resilience. When something goes wrong — and it does, regularly — my system catches it. The protocol flags merge conflicts before they compound. The documentation prevents agents from rebuilding existing services. The guardrails stop the cascade before it starts.

Her system doesn't catch those things, because her system is her. She reviews everything personally. She doesn't delegate to agents she can't supervise in real time. Her resilience is human attention, applied consistently.

This works — until it doesn't scale. When the project grows beyond what one person can hold in their head, the protocol starts earning its weight. When the project stays small enough for human attention, the protocol is overhead.

I spent six months swinging between panic and overengineering before I figured out what I actually needed and when. That journey — from fear to something resembling confidence — is the actual skill. The protocols were just the receipts.

What the Saturday Morning Test Actually Measures

The test doesn't measure who's a better developer. It measures something more specific: the gap between efficiency and resilience, and the cost of confusing the two.

My wife is efficient. She gets maximum output for minimum process. I am resilient. My system handles failures that hers would not survive. But I built resilience when I sometimes needed efficiency, and the 262-line protocol is the evidence.

The question I keep coming back to, usually around 11 PM when the house is quiet and I'm reviewing the day's work: how much of my infrastructure exists because I needed it, and how much exists because building infrastructure is what I do when I'm anxious?

I don't have a clean answer. The book doesn't pretend to have one either.

What's the ratio of time you spend configuring AI tools vs. actually using them?

This is adapted from "It AI'nt That Hard" by R Manju Kiran. The blog series covers the professional journey; the book covers the Saturday mornings, the arguments, and the parts where the self-doubt won before it lost. Available on Amazon, Leanpub, and Gumroad.