Forem: Praveen Govindaraj

The Prefix Bubble

Praveen Govindaraj — Thu, 21 May 2026 10:35:35 +0000

A mobile gadget shop in my location, three centuries of evidence, and the part of the AI debate that almost no one is having.

I walk down near my location after luch on a humid afternoon, I pass a mobile-phone shop. It is a modest place. It sells SIM cards, prepaid top-ups, refurbished phones, laptop accessories. The shopkeeper has been there for years.

A few days ago, the signboard above the door changed. The new name is Ai Gadgets.

There is no neural network in the building. There is no model, LLM or GenAI running anywhere on the premises. There is a counter, a glass case, a tired shopkeeper, and a freshly printed sign in a font that wants you to think of something other than what is actually being sold.

I took a photograph of it because it made me laugh. Then I wrote a short story with bit of research, comparing the rebrand to Long Blockchain Corp and the dot-com renames of 1999. The post was well received. It was also, I realised the next morning, only the first inch of the argument it had started to make.

The shopkeeper is not stupid. The shopkeeper is participating in a global ritual that is at least three hundred years old.

Every general-purpose technology produces a naming bubble. Every one. The label changes; the human does not.

A short history of prestige labels

Figure 1. Three centuries of prefix manias. The label changes; the human stays the same.

Look at the timeline above and a pattern becomes visible that any one cycle hides. South Sea, 1720. Canal Mania, 1825. Railway Mania, 1845. The Electric Belt era, the 1880s. The American auto boom, the 1900s. Radio Mania, the 1920s. Atomic this and Space-Age that in the 1950s. Bio- and Nano- prefixes in the 1980s. Dot-com in 1999. Blockchain in 2017. AI in 2024.

The intervals shorten as you approach the present, which is its own observation worth holding. But the shape repeats. Each of these episodes was, in its moment, treated as the singular event of its century. Each was preceded by a real technological breakthrough and accompanied by an explosion of operators who attached themselves to the label without participating in the breakthrough. Each ended, eventually, with a culling — and each left behind real infrastructure that compounded for the next fifty years.

The Carlota Perez framework, which I’ll come to in a moment, is the most rigorous attempt to name this regularity. But you don’t need a framework to see it. You need a long enough timeline. The shopkeeper at 28 Veerasamy Road is the 2026 instance of a phenomenon I could have photographed in 1845 London, 1899 New York, or 1920 Chicago, and would have photographed in every one of those places, with a different label on the sign each time.

The pattern is so consistent that its absence would be the historical anomaly.

1720: a company for nothing in particular

The earliest case in modern memory is the South Sea Bubble. England had just invented the joint-stock company. Speculators noticed that the word company, attached to almost anything, could pull money out of pockets that had previously kept it.

The most famous prospectus of the period offered shares in “a company for carrying on an undertaking of great advantage, but nobody to know what it is.” The promoter collected £2,000 in five hours and left town the same evening.

The magic word in 1720 was not a technology. It was a legal form — the joint-stock incorporation itself, less than a century old at the time, untested in most of its applications, and attached to a public imagination that had just glimpsed what it might do.

This matters because it tells us something about the underlying mechanism. The prefix is not specifically about science or engineering. The prefix is about whatever is the newest authoritative pattern of legitimacy. In 1720 it was incorporation. In 1845 it would be the railway. In 1899 it would be electricity. In 2026 it is artificial intelligence. The word changes. The function does not.

Isaac Newton, then in his late seventies and Master of the Royal Mint, lost most of his personal fortune in the South Sea Bubble. He had sold early at a profit, then re-bought near the peak. His reported confession afterwards was that he could “calculate the motion of heavenly bodies, but not the madness of people.”

The smartest man in England could not see the prefix coming. Neither, in their own time, can we.

1845: the railway that was never built

If you want one historical episode to project onto the current AI cycle, it is not the dot-com bust. It is the British Railway Mania of 1844 to 1847. Almost every element rhymes.

The Liverpool and Manchester Railway had opened in 1830 and proved that passenger rail was a real industrial revolution. By the early 1840s the Bank of England had cut interest rates, the Bubble Act had been repealed, and railway shares could be bought on a ten per cent deposit. By 1846, two hundred and sixty-three Acts of Parliament for new railway companies were passed in a single year, with nine thousand five hundred miles of proposed track.

About a third of these companies were never built. They either collapsed from poor financial planning, were absorbed by stronger competitors, or were outright fraudulent — vehicles for channelling investor capital into entirely other businesses. The press was complicit. There were fourteen bi-weekly railway papers at the peak, plus a daily morning and evening edition, mostly funded by the very promoters they covered.

George Hudson, the so-called Railway King, at his peak controlled over a thousand miles of track. He ran what was retrospectively recognised as a Ponzi-like scheme, paying dividends out of fresh capital. He died in poverty.

And yet — this is the part nobody wants to integrate — Railway Mania was not a waste. By 1850 Britain had a six-thousand-mile network that formed the spine of the world’s most advanced transport system. Roughly ninety per cent of the modern UK rail network was built during those mania years. The bubble was the financing mechanism for the deployment.

This is the lesson the AI cynic and the AI evangelist both miss. The bubble is real money being wasted on stupid signs — and the same bubble is laying the rails on which the next half-century will ride. The two facts are not in tension. They are the same fact, looked at from two different distances.

Saying “AI is a bubble” and “AI is the next industrial revolution” are the same statement. They are separated by about ten years and one crash.

1880: an electric belt for what ails you

Forty years after the railway crash, Edison and Tesla were household words. Electricity had moved from laboratory to lightbulb to street. The prefix of the new era was Electric.

And so we got the Pulvermacher Hydro-Electric Chain — a vinegar-soaked voltaic belt worn around the waist, which claimed to cure impotence, rheumatism, kidney pain, sciatica, dyspepsia, “weaknesses peculiar to men,” and most other diseases known to the Victorian gentleman. It was sold in the Sears Roebuck catalogue alongside guns, sewing machines, and laudanum. Tens of thousands of units moved.

Competitors swarmed: the German Electric Belt Company (which was, despite the name, based in New York), Dr. Sanden’s Electric Belt, Dr. Crystal’s, Dr. Horn’s, Owen’s, Heidelberg’s — and, hilariously, Edison’s, founded by Thomas Edison Junior, against whom his own father took out lawsuits.

The 1887 U.S. Congress had electric medical equipment wired directly into the Capitol Building so congressmen could be “treated” during sessions. The medical journals of the period are full of the same exasperated tone you find in modern AI-skeptic Twitter. The British Medical Journal called one prominent manufacturer an egregious quack. They were ignored for thirty years.

If you had walked into a New York drugstore in 1892, you would have seen electric corsets, electric brushes, electric tonics, electric baths, electric magnetic insoles. Real electricity existed. None of these products contained any. The label was doing all the work — exactly as “AI” does on a 2026 food-delivery app’s new menu-suggestion feature.

The label is the product. The product is the label. The actual technology is at most a coincidence.

Eighteen hundred companies, three winners

The automobile arrived around 1885 in Germany. By the early 1900s in the United States, over eighteen hundred separate car companies had been founded. Almost all are forgotten. Abbot-Detroit. Adams-Farwell. Armstrong Electric. The Berg. Three different companies called Courier. The 1914-only Motor Bob.

Henry Ford’s own first attempt — the Detroit Automobile Company, incorporated in August 1899 — went bankrupt in January 1901. Production problems. Quality problems. He came back later.

By 1929, the so-called Big Three — GM, Ford, Chrysler — were the only survivors at scale. The survival rate from the original cohort is roughly two-tenths of one per cent. The capital was not burned. It was redirected through a winnowing process that left an industry.

The instructive part is the naming. The survivors generally had personal names (Ford, Buick, Dodge, Chrysler, Olds) or geographic names (Detroit, Cadillac). The casualties disproportionately had prestige-tech names: Electric, Power, Motor, Automatic, Steel. The label-heavy companies died at a higher rate than the company-heavy companies.

I do not think this is an accident. I think it is a quiet signal of which founders were trying to participate in the technology and which were trying to participate in the prestige around it. The market sorted, with great cruelty, between the two.

The louder the prefix in the name, the shorter the company’s future.

The curve that keeps rhyming

The most rigorous frame for what we are watching is Carlota Perez’s Technological Revolutions and Financial Capital, published in 2002. Perez identifies five technological revolutions in the last two hundred and fifty years: the Industrial Revolution, Steam and Railways, Steel and Electricity, Oil and Mass Production, and Information and Telecommunications. Each follows the same four-phase surge.

Figure 2. Carlota Perez’s surge cycle, fitted to the present AI moment. The naming bubble peaks in Frenzy. Operators emerge in Synergy.

The phases are: Irruption (the technology arrives quietly, used by specialists), Frenzy (capital floods in, labels appear everywhere, valuations decouple from reality), Turning Point (a crash, a sorting event), Synergy (the survivors deploy the infrastructure at scale, labels drop, the operators take over), and Maturity (saturation, the next big bang waits in the wings).

The crucial point — which my LinkedIn post elided — is that the naming bubble is a feature of the Frenzy phase, not a sign that the underlying technology is fake. Steam was real. Electricity was real. Cars were real. Radio was real. The internet was real. And each of them had a Pulvermacher Belt era of its own.

Saying “AI is a bubble” and saying “AI is the next industrial revolution” are not opposing positions. They are the same position, looked at from two different points on the same curve. The bubble produces the infrastructure that the revolution will ride. The cynic and the evangelist are arguing across a temporal gap they have not noticed.

Where on the curve are we now? Probably mid-Frenzy, possibly approaching Turning Point. The shopkeeper near my local is one of the better leading indicators. When the prefix has reached the high-street signboard, you are not early.

The bubble is the financing mechanism for the deployment. Both halves are required.

The compressing half-life

Here is something that has changed, and matters. The half-life of each prestige prefix has shortened with every cycle, in rough proportion to the acceleration of media and capital flows.

Figure 3. The half-life of prestige labels has shortened with every cycle. Calibrate accordingly.

Railway lasted forty years as a viable prestige name in a company. Electric lasted about twenty-five. Radio about fifteen. Atomic about ten. Cyber about seven. Dot-com about five. Blockchain about three. The trend is not gentle. It is exponential decay.

What does this imply for AI? Plausibly, the prestige half-life of the label is forty months, not forty years. If you are building a career identity around the prefix — AI Strategist, AI Transformation Lead, Head of AI Anything or Head of Quantum — you are betting on a label whose viable shelf life is shorter than yoghurt.

The technology will last. The label will not. These are different things, and conflating them is exactly the cargo-cult error this whole essay is about.

If your identity is the prefix, your obsolescence runs on a faster clock than your competence.

Operators and name-changers

Here is the cleanest historical pattern, and the one most absent from the original LinkedIn post: the long-run winners almost never carried the prestige prefix in their name.

Watch who is still here in 2030 and not called “Acme AI Solutions.”

The table above is the historical record, written one cycle per row. The column on the left is the railway, electric, auto, dot-com, blockchain, and AI companies that put the era’s prefix on their letterhead and disappeared. The column on the right is the companies that did the same work without the prestige label, and which are still in operation, or whose successor entities are.

Ford, not Armstrong Electric. Amazon, not Pets.com. Coinbase, not Long Blockchain Corp. The Great Western Railway, not the Yorkshire and Berwick Promoters’ Consortium.

The signal is so consistent across centuries that I am willing to state it as something close to a law of the industry. Operators do not need the prefix. The prefix is a substitute for the operation. When you see it on a name, in any era, you are watching someone short the difference between what they are doing and what they want to be seen as doing.

Bet on the named, not the labelled. Ford, not Electric. The shopkeeper, not the strategist.

A test you can run on Monday

If you want to know whether the company you work for, or the role you are about to take, or the stock you are about to buy is participating in the substance of the cycle or merely the label of it, there is a five-minute test.

Take the entity’s name and strip the prestige prefix. Read what remains. Does the description still describe a business?

Ai Gadgets, with the prefix stripped, is Gadgets. That still describes a business. The shopkeeper is fine. He sells real phones to real people, and the prefix is decoration.

AI Transformation Lead, with the prefix stripped, is Transformation Lead. That, in most organisations, also describes a role. Probably not good one, but a recognisable one. Mostly fine.

Long Blockchain Corp, with the prefix stripped, is Long Corp. That describes nothing. Long what? The company name relied on the label so completely that removing the label collapsed the description. The market eventually noticed.

Pets.com, with the prefix stripped, is Pets. That is not a company, that is a category. The dot-com was load-bearing.

Run this on your own employer. Run it on your own title. Run it on every “AI-native” startup that crossed your inbox last quarter. Notice which descriptions hold and which collapse. The ones that collapse are the ones whose viability is the label, and whose half-life is whatever this cycle’s prefix-clock turns out to be.

The test costs nothing. It takes five minutes and a piece of paper. The fact that almost no one in a boardroom does it before signing the brand is the most damning thing you can say about the current state of strategic judgement.

If the description collapses when you remove the prefix, the prefix was the description.

Walk past the same Road

I want to close where I started, because the essay is really about one specific shopkeeper and what he has to teach a city’s worth of AI strategists.

Walk past the same Road on this sunday afternoon. The shopkeeper is selling a refurbished phone to a tourist. He has paid his rent. He has stocked his inventory. The signboard above his head reads Ai Gadgets and means nothing, and he knows it means nothing, and the tourist knows it means nothing, and the transaction proceeds anyway, because the label is doing exactly what labels have done since 1720: providing a small, harmless permission to participate.

One mile north, in a glass office tower, an AI Transformation Lead is preparing slides for an Architecture Review Board. He has never shipped a model. He has never read a Goodhart paper. He has never heard the name Pulvermacher. His org chart reports to an AI VP who reports to an AI CIO. None of them have read Perez. They believe they are at the beginning of something genuinely new.

They are at the middle of something very, very old.

The shopkeeper has skin in the game — his rent, his stock, his children’s school fees. If his Ai signboard pulls in one extra tourist a week, he has won. Litrally kind of less bet. If the wave breaks, he changes the sign, paints over the Ai, and the business continues. He is doing the same thing he was doing before. The prefix was always cosmetic; he knew that.

The AI XXXXXXX Lead has no skin in any game. If the wave breaks, he becomes a Quantum Transformation Lead by Q3. By Q4, a Post-AGI Strategic Officer. His résumé will absorb the next prefix without resistance, because the prefix was always the résumé.

The shopkeeper sells real phones. The strategist sells the prefix.

Bet on the shopkeeper he hedge less.

What falls between

Praveen Govindaraj — Thu, 21 May 2026 10:21:53 +0000

The seams between agents are where reality leaks in. Most teams don’t have a name for that yet

Photo by Joseph Frank on Unsplash

Fifth in a short series. The first four pieces were about what production-grade agentic systems require: an honest reckoning with cost (one), a plumbing layer (two), a tolerance for asymmetric risk (three), and a discipline of policy as code (four). This piece is about the place where multi-agent systems most often break, and the reason they break there. It is a place that has no name on most architecture diagrams, and that absence is part of the problem.

If you have ever watched a kitchen pass during dinner service — the long stainless-steel counter where finished plates wait for the runner — you will have noticed something that any cook can tell you, and that no diagram of a restaurant ever captures.

Most of the failures of a busy service do not happen in the cooking. They happen at the pass. A dish sits a minute too long under the heat lamp. A garnish goes on the wrong plate. A ticket gets placed under another ticket and forgotten. The cook did their job. The runner did their job. The thing that failed is the place between them. And the head chef, if they are any good, spends most of their attention not on the cooking but on that two-meter strip of metal where one person’s work becomes another’s.

Agentic systems have a pass. Almost nobody is watching it.

The places no diagram shows

Consider any production workflow that uses more than one agent — which, in 2026, means most of them. You will be shown a diagram. The diagram will have boxes. The boxes will be labelled with what each agent does. There will be arrows between the boxes, indicating that one agent’s output becomes the next one’s input. The diagram will look clean.

The boxes are not the problem. Each agent, taken alone, is usually fine. It has been tested. It has a known input shape and a known output shape. Engineers have prompted it, evaluated it, retried it under load. The box is the smallest possible unit of “this works.”

The arrows are where the work is. The arrows are also, on most diagrams, drawn as straight lines — implying instantaneous, lossless, well-defined transfer. They are not. They are little zones of indeterminacy. Things happen in those arrows that nobody scheduled, that no test covers, that no metric counts.

What happens? A field gets renamed in the upstream schema, but the downstream agent is still looking for the old name. A claim type that nobody trained on shows up, and the upstream extractor returns null in a field that the downstream scorer treats as zero. The policy version was bumped on Tuesday, but only the decide-step was updated; the score-step still applies the old thresholds and the system as a whole now violates the rule it was supposed to enforce. None of these are the agent’s fault. None of these would show up in any of the agents’ unit tests. They live in the arrows.

Press enter or click to view image in full size

I have come to believe this is the single most underweighted truth in agentic engineering today. The hype is about what each agent can do. The reality is about what happens between agents. We are spending almost all of our attention on the cooking and almost none on the pass.

what each agent actually sees

To understand why the seams are so dangerous, you have to be honest about how much an agent doesn’t see.

An agent has a context window. The context window is finite. The things in the context window are the things the agent can reason about. Everything else may as well not exist. This is true for the largest models in the world. It is true for the smallest. It is the ground truth of how language models work, and no amount of marketing about “shared context” or “unified memory” changes it.

When you compose three agents into a workflow, you are not creating a single intelligence with three faculties. You are creating three small, separate intelligences who each have access to a partial view of the world. The extractor knows about PDFs and OCR. The scorer knows about risk models and historical priors. The decision agent knows about approval ladders and audit format. They do not know each other’s domains. They do not, in general, see each other’s reasoning.

What they share is whatever you, the engineer, made an effort to put on both sides of the seam. By default, that is almost nothing. Often it is a single string — an ID that points back to a database row that nobody is loading the same way. The famous “shared context” of multi-agent systems is, in practice, a thin overlap, much smaller than the boxes in the diagram suggest.

This is the architectural lie at the heart of most multi-agent demos. The demo shows three agents working together on a task, and the audience reads “they’re collaborating.” What is actually happening is that one agent is producing some output, and another agent is taking some words from that output and treating them as inputs, and the second agent has no real understanding of what the first agent meant. They are not collaborating. They are passing notes under the desk.

This works fine when the notes are simple and the desk is small. It stops working when the notes are complex and the consequences of misreading them are large. Which is to say: it stops working precisely when you start putting agents in production.

The cost of an unclear handoff

To make this concrete, watch what happens when you let an agent summarise the state of the world in prose, and then hand that prose to another agent.

Here is an extractor agent producing a summary of an incoming claim. It is a fluent summary. It uses words like “straightforward” and “probably approve.” A human reading this would have a clear sense of what the extractor thinks. The fluency is exactly the trap. Because the summary is in prose, and the next agent in the workflow is a language model, the next agent will read the prose and form its own interpretation. And that interpretation is not deterministic.

Run the same prose through the same scoring agent ten times, on the same model, in the same hour, and you will get different scores. Not wildly different — the model is consistent enough at the surface level — but different in the ways that matter. Sometimes the scorer will pick up on “straightforward” and route to auto-approval. Sometimes it will pick up on “probably” and route to human review. Sometimes it will see “basement” and trigger a fraud check pattern that has nothing to do with what the extractor was trying to communicate.

Each of these is, taken alone, a defensible reading. The prose supports all three. The scorer is doing what it was asked to do — interpret the natural-language input. The problem is not the scorer. The problem is the handoff itself.

I have watched teams spend months tuning the prompts of their downstream agents to “be more consistent” without ever realising that the inconsistency is upstream of the prompt. The downstream agent is being asked to interpret an ambiguous artifact. It will, with great fluency, produce different interpretations on different runs. Tuning the agent does not fix this. The fix is to stop handing the agent prose

Typed seams

Here is what the same handoff looks like when it is treated as a contract instead of a chat message.

The handoff has a name. It has required fields, each with a type. It has a validator that runs at the moment of write, and again at the moment of read. If the upstream agent forgets a required field, the write fails and the workflow halts at that step — not three steps later, when the missing field surfaces as a strange downstream behaviour, but right there, in the agent that produced the bad output.

If the downstream agent tries to read a field the contract doesn’t include, that is a compile error. Not a runtime guess. The system refuses to ship a workflow that reads from a handoff that doesn’t define what is being read.

This sounds, to anyone who has shipped a piece of software in the last forty years, deeply unremarkable. Of course you would type your interfaces. Of course you would validate your messages. The IDL pattern is older than most engineers writing AI code. What is remarkable is how rare this is in agentic systems shipped today, and how much surprise teams express when they discover that introducing typed handoffs makes their multi-agent systems suddenly stable.

So write it down. Typed. Named. Versioned. With required fields and a validator that the runtime enforces. The cost of this is small — a few extra lines of declaration per agent boundary. The benefit is that you stop running a coin-toss workflow and start running an engineered one.

An aside, for anyone tempted to say “but the model writes better when you let it talk freely.” This is sometimes true at the level of a single response. It is almost never true at the level of a multi-agent workflow. Free-form prose between agents is a local optimisation that costs the system its global coherence. Pay the small price of a schema. The downstream stability is worth more than the upstream eloquence, by a factor of about a hundred.

the shared spine

Once you have typed handoffs at every seam, a question arises naturally: where do the handoffs live? Who owns them? Who governs them? What happens when one of them needs to change?

The answer is structural, and it is the part of multi-agent design that almost no one has a vocabulary for.

The agents in a working multi-agent system do not actually talk to each other. They talk to a shared spine — a registry that holds the typed handoffs, the audit record, the governance rules, and the coordination protocol. Each agent reads from the spine and writes to the spine. No agent has a direct line to another. The architecture is, to use a phrase from an older era, a hub-and-spoke. The hub is what makes the whole thing work.

This is not a trivial detail. It changes how the system behaves under stress. When you need to change a handoff, you change one declaration in one place, and the runtime can verify that all readers and all writers are still compatible. When you need to audit a decision, you read the spine; the trace is already there, in one record, by name. When you need to add a new agent, you connect it to the spine; you do not need to update three other agents to know about the new one.

Compare this to the alternative — agents passing prose summaries to each other directly, point to point — and you see the structural fragility immediately. The point-to-point system has n times n minus one possible failure modes between any pair of agents. The hub-and-spoke system has n failure modes, all centralised, all monitored, all governable. The point-to-point system grows in complexity quadratically. The hub-and-spoke system grows linearly. This difference compounds rapidly as soon as the workflow has more than three steps.

The tools — the agents — are interchangeable. You can swap one for another, upgrade one, retrain one, replace one with a deterministic rule. What you cannot swap is the bench. The bench is the architecture. It is where work meets work. It is the place where the whole system becomes governable, or doesn’t.

The teams that get this right are the teams whose multi-agent systems survive the second year of production. The teams that don’t will spend that second year discovering, painfully, that their dozen agents are actually a dozen unrelated systems sharing nothing but a Slack channel.

A test you can run on Monday

If you want to know whether your team has thought about this — whether you are building a workshop or a Slack channel — there is a simple diagnostic.

Pick any two adjacent agents in your production workflow. Find an engineer who works on the upstream agent. Find a different engineer who works on the downstream agent. Put them in a room. Ask each of them, separately, to write down — on paper, no looking — the exact list of fields that pass between their two agents, with types.

Compare the lists.

If the lists match, your seams are typed; you are running a workshop. If the lists don’t match, you have just found, in five minutes, the place where production will fail next quarter. Almost no team passes this test on the first try. Many teams fail it spectacularly — different field names, different types for the same field, fields one engineer thinks are required and the other thinks are optional, fields one team didn’t know existed.

The exercise costs nothing. It does not require a tool, a vendor, or a quarter of investment. It requires a piece of paper and twenty minutes. The fact that almost no team does this is the most damning thing you can say about the current state of agentic engineering practice.

The fix, when you discover the discrepancy, is not a meeting. The fix is to put the handoff in source — typed, named, versioned, validated. So that the next time those two engineers each write down the fields, they are writing the same thing, because they are both reading from the same declaration.

The workshop, not the army

I want to close with an observation about how we talk about multi-agent systems, because I think the language we use is part of why we keep getting the architecture wrong.

The dominant metaphor in agentic marketing, in 2026, is military. We talk about “swarms” of agents. We talk about “armies.” We talk about “agent teams” doing “missions.” The metaphor implies a kind of centralised command — a general handing down orders, a chain of command, a hierarchy of obedience. It also implies, more subtly, that the agents themselves are the locus of capability. The general gives the orders; the agents carry them out.

This metaphor is wrong, and the systems built under it are fragile in proportion to how seriously their architects took it.

The right metaphor is older, and quieter, and almost never used in the marketing. It is the workshop. A workshop has specialists. The specialists do not report to each other. They share a bench. The bench has tools laid out in a known place. There is a master craftsman who keeps the bench in order — sharpens the tools, organises the materials, knows what each specialist needs and ensures it is there. The work flows around the bench, not through any one person.

The locus of capability in a workshop is not the specialists. It is the bench. Replace any specialist with another, and the workshop continues. Lose the bench, and the workshop is finished, no matter how skilled the specialists are.

This is the architecture that survives. It is older than computing, older than industry, older than civilisation as we usually count it. It is the way human work has been coordinated, in every settled tradition, since people have had work to coordinate. The agentic moment is, at its best, a chance to apply this old wisdom to a new substrate. The agents are tools. The workflow is the work. The spine is the bench. Build the bench first.

The previous piece in this series ended with the line build it like it has to last.

I want to end this one with the practical corollary, the thing every old craftsman knows and every new engineer has to learn the hard way.

The series will pause here. The pieces describe a discipline that exists, that has been tested, and that is not particularly fashionable. Whether you find a platform that respects it, or build one, or stitch one together from open-source pieces, the test is the same: can two engineers draw the same handoff. The job is to be able to answer yes.

The Discipline We Forgot We Had

Praveen Govindaraj — Thu, 21 May 2026 10:04:46 +0000

Process automation is older than software. The teams that remember will be the ones still shipping.

                    Photo by Thao LEE on Unsplash

Fourth in a short series. The first piece argued that the gap between AI-agents-in-pilots and AI-agents-in-production is being closed by a quiet infrastructure rebuild. The second showed what that infrastructure looks like. The third was about the failure modes the dashboard does not show. This piece is the constructive counterpoint: what installing the missing discipline actually consists of, and why teams that install it will be the ones still shipping when the hype rotates.

An older engineer I know — the kind of person who has seen three industry waves come and go — told me, when I asked him what he thought about the agentic moment, that he was struck mostly by how much of the conversation he had heard before.

“Every five years,” he said, “we discover that the way to coordinate work in a large organisation is to write down what the work is. We rediscover this with a slightly different vocabulary every time. The vocabulary changes. The discovery is the same.”

I have been thinking about that for a few weeks, and I think he is right. There is a discipline of process automation that is older than software — older, in some forms, than electricity. We have been forgetting and rediscovering pieces of it for the better part of a century. The agentic hype, in particular, is convincing teams to skip the rediscovery this time around. Which is a problem, because the discipline is what keeps the wheels on.

This piece is about what the discipline actually is.

Five layers a production process needs

If you take an experienced operations engineer — someone who has shipped real workflows in regulated industries, and lived with what they shipped — and you ask them what makes the difference between a process that survives and a process that breaks, you will get a list that has been more or less the same since the 1970s.

Figure 1. The unfashionable list. Each layer holds the ones above it up.

At the bottom is audit completeness. Every decision the process made — by which agent, on what input, against which version of the rules, with what outcome — is captured in one place, by name. Not in three log files, not in two tools, not in a Slack channel. One record. The kind of thing you can hand to a regulator without rewriting it first.

Above that is declared structure. The process exists as a thing you can read and a thing you can run, and these are the same thing. An analyst reads the diagram. An engineer reads the source. They are looking at the same artifact. Nobody has to translate.

Above that is typed contracts. Each step has an input shape and an output shape, both checked. If the upstream step changes its output and the downstream step is not updated, the system refuses to run. It does not silently pass the wrong shape forward and let production discover it.

Above that is bounded scope. Each step in the process is allowed to touch a specific, declared set of tools, data, and external systems. Nothing else. If the score-the-claim step tries to send an email, the runtime refuses, regardless of how plausible the email is.

At the top is cost SLOs in the source. The economic envelope of the process — what it is allowed to cost per run, per day, per month — is part of the program. Not a dashboard the finance team monitors. Part of the program.

Each layer is a claim the next layer rests on. If the audit log is incomplete, the structure means nothing — you cannot demonstrate which version of the process actually ran. If the structure is undeclared, the contracts mean nothing — there is no shared understanding to type-check against. If the contracts are unenforced, the scope means nothing — agents will pass arguments to tools that those tools were never built to accept. And if the scope is unbounded, the cost SLO means nothing — the process can spend money in ways the budget never anticipated.

Most teams shipping AI agents in 2026 have, at best, two of these five. Often one. Often zero. They make up the difference with vigilance and pagers. It works for a while. It does not scale.

The two-language problem

To see why this matters, consider the most common shape of an enterprise AI deployment, in any industry, anywhere in the world.

Figure 2. The two-language problem. The more important your workflow, the more this gap will cost you.

On one side, in a git repository, lives the code. Engineers wrote it. It is tested, reviewed, deployed, instrumented. When it changes, the change is reviewed. When something goes wrong, you can git blame your way back to the moment.

On the other side, in a Confluence page or a PDF or a SharePoint folder, lives the policy. Compliance wrote it. Legal approved it. It is the document the regulator will ask for in the audit. It says what is required. It says what is forbidden. It is the law of the land for this workflow.

Between them is a gap, and into that gap fall every important failure your enterprise will have over the next three years.

The reason is mechanical. Code and policy are written in two different languages, lived in two different systems, owned by two different organisations. The two systems do not talk to each other. The code cannot verify the policy. The policy cannot constrain the code. When the policy changes, the change has to be translated into code by hand, by someone who hopefully understood both. When the code changes, the change has to be reviewed against the policy by hand, by someone who hopefully read both.

This works, in the sense that nothing immediately catches fire. It also has a one-hundred-percent failure rate over a long enough horizon. The policy and the code drift. Edge cases the policy contemplates are not in the code. Edge cases the code handles are not in the policy. The drift is silent until something happens — a regulator’s question, a customer’s complaint, a Tuesday — and you discover that what you do is not what you said you would do.

Every team that has run a workflow in production for more than two years knows this in their bones. They know that the audit went well last time because someone remembered to update the policy. They know that the small incident in March happened because the engineer didn’t read the latest version. They know that the next problem will happen because nobody can hold both documents in their head at the same time.

The fix is not better documentation. The fix is to stop having two artifacts.

The same thing, in one source

Imagine, for a moment, that the policy and the code lived in the same source file. Not next to each other. Not linked. The same source file.

Figure 3. What it looks like when the workflow, the policy, and the budget all share one source.

The workflow declaration sits next to a governance rule that constrains it. The governance rule sits next to a budget that bounds them both. All three are written in the same syntax. All three are read by the same compiler. All three are checked against each other before anything ships.

If the workflow changes in a way that violates the policy, the build fails. Not at runtime — at build time, before anything is deployed. If the budget is exceeded by the workflow’s expected cost, the build fails. If a step references a tool that the policy disallows, the build fails. The compiler is the contract between engineering and compliance. It cannot be bypassed by someone forgetting to update a Confluence page.

This is not a science fiction proposal. This is how every serious software discipline has worked for half a century. Type systems. Schema validation. Formal contracts in distributed systems. The compiler-as-discipline pattern is older than most of the people writing AI agents today. The novelty is applying it to process as well as code, to policy as well as logic, to cost as well as correctness.

The pushback I get when I describe this is always the same: “but our policy is too complicated to encode.” And it is true that natural-language policy is more expressive than any formal system. It is also true that natural-language policy is less defensible than any formal system. The choice is not between formal-but-impoverished and rich-but-vague. The choice is between enforced and unenforced. Most policies, in practice, are not as complex as they sound; they are written in legalese to defend their authors. The actual logic — what amounts get reviewed, by whom, under what conditions — fits in a few dozen lines of declarative source.

The teams that learn to do this will be the teams that ship faster, not slower. They will not need three meetings to understand whether a change is compliant. The compiler will tell them. They will not need a quarterly compliance review to find drift. There will be no drift to find.

Where the bug gets caught

There is a strange asymmetry in the cost of bugs that anyone who has worked in software for long enough will have an intuitive feel for, and that almost no one outside software has been told about explicitly.

Figure 4. The same bug, caught in three different places.

A bug caught at compile time costs essentially nothing. The author sees a red squiggle, fixes it, moves on. The bug never enters the system. There is no incident, no rollback, no post-mortem. Most engineers do not even count compile-time errors as bugs. They are part of the act of writing.

A bug caught at deployment time costs a deployment. Maybe an evening. Maybe a rollback and a hotfix. Annoying, but contained. The damage is internal — engineering pays the cost, the customer never sees it.

A bug that escapes into production costs whatever the bug actually does. If it sends a wire to the wrong account, it costs the wire plus the recovery plus the regulatory disclosure plus the trust. If it deletes a customer’s data, it costs the data plus the trust plus the lawsuit plus possibly the company. The cost in production is not bounded by anything. It is bounded by what the world makes of the bug.

Each layer outward is roughly an order of magnitude more expensive than the one inside it. This is not a precise number. The point is the shape: the cost of catching a bug is exponential in how late you catch it.

Most agentic systems being shipped today do not have a compile step at all. The platform interprets a configuration. The configuration is checked, in some places, but not as a whole. The checks are at runtime, at deployment time, sometimes only when production traffic hits a particular path. The cost shape of bugs in these systems is therefore the worst possible shape: most bugs are caught in production, where they are most expensive.

This is a choice the platforms make. It is a defensible choice for a research prototype. It is an indefensible choice for a workflow that processes claims, transfers money, or makes hiring decisions. The teams that pick the platform without a compile step will, on average, pay an order of magnitude more for their bugs than teams that pick the platform with one. Over a few years, this difference shows up in everything: cost-of-incidents, time-to-recover, regulatory standing, engineering morale.

It is one of those things that is not visible until it is. And then it is the only thing that matters.

Cost is a number the program knows about

Here is a small thing that is, when you sit with it, an enormous thing.

In every conventional engineering discipline — civil, mechanical, electrical, chemical — the cost envelope of a system is part of the system’s specification. A bridge has a cost. A circuit has a power budget. A process plant has a throughput target. These are not numbers a finance team monitors after construction. They are constraints the design has to honour, in the design phase, on paper, before anything is built.

In software, somehow, this got lost.

Figure 5. The budget is part of the program graph, not a sidecar.

Most software systems, including the agentic systems being built today, treat cost as an emergent property. The system runs. Bills accrue. A dashboard somewhere — or, in the worst cases, an invoice at the end of the month — reports what the system spent. The finance team adjusts. The engineering team apologises. Nobody is held accountable, because nobody wrote down what the cost was supposed to be in the first place.

The fix is to make cost a typed quantity in the program. A first-class construct. A budget node that the workflow references, that the runtime enforces, that the compiler validates. When a step is added that pushes the projected cost above the declared budget, the build fails. When a run starts to exceed its budget, it pauses, alerts, and refuses to continue without explicit approval. The runtime knows what the program is supposed to cost. The program is the source of truth.

This single change does more for FinOps discipline than every dashboard ever built. Because the question stops being “how much did we spend?” — a question that can only be answered after the fact, by an exhausted analyst with three browser tabs open — and becomes “what did we say we’d spend?” — a question with one answer, in the source.

If you want to know whether a team takes its production agents seriously, look for this. If their budget lives in a spreadsheet, they have a hobby. If their budget lives in the source they shipped, they have a system.

What the agentic hype is selling you

Now, with all of this in view, look honestly at what the major agent platforms are selling in 2026.

They are selling you a notebook with extra steps. They are selling you a configuration file in a YAML dialect that nobody compiles, that nobody type-checks, that nobody validates against your governance, that nobody bounds against your budget. They are selling you a beautiful drag-and-drop canvas that produces a workflow that lives only in their database, that you cannot version-control, that you cannot review in a pull request, that you cannot enforce a policy against.

They are selling you, in short, the absence of every layer in the diagram above.

This is not because the people who build these platforms are foolish. They are not. It is because the platforms were built for the demo era — where the goal is to make a thing that produces an impressive output in a controlled setting, fast. They are excellent at this. They are not built for the era we are now actually entering, where the goal is to run a thousand of these things a day in production, on workflows that touch real money and real people and that you have to defend in a hearing.

The platforms will, eventually, grow into this. Some of them are starting to. But the gap between what they ship today and what production-grade process automation requires is much larger than the marketing suggests, and it will be filled by teams who choose the unfashionable option.

The hype cycle has a predictable rhythm. The new thing is exciting because it skips the boring parts. Then production reveals that the boring parts were load-bearing. Then the boring parts get added back, one at a time, painfully, by the teams that survive long enough to need them. The teams that wait for the hype to install the boring parts on their behalf wait a long time, because hype does not install boring parts. Hype installs new features.

The teams that install the boring parts themselves, in advance — that pick the platform with the compiler, that demand the typed contracts, that put the budget in the source — those teams will be the ones still shipping in 2030. They will look, today, like they are moving slowly. In three years they will look like they were moving exactly the right speed.

A craft, not a product

I want to close on something that the previous three pieces have circled around without quite saying.

Process automation is a craft. It is older than the software industry. It is older than the computer industry. It has roots in the time-and-motion studies of the early twentieth century, and before that in the trade guilds, and before that in the apprenticeship traditions of every settled civilisation that ever had to coordinate work across more than one room.

Crafts have a particular shape. They have practitioners who get better with time. They have apprentices who are taught by the people who came before. They have standards that are enforced not by management but by the practitioners themselves, refusing to ship work that does not meet them. They have a sense of what is right and what is sloppy that does not need to be justified to outsiders, because the outsiders cannot tell, and the practitioners can.

The agentic moment is, among other things, a moment when this craft is being threatened by people who do not know it exists. They believe they are inventing process automation from scratch. They are not. They are reinventing it badly, with worse tools, in a hurry, for a market that does not yet know enough to demand otherwise.

The job of anyone who has been around long enough to remember is to keep the craft alive. To insist that the boring parts are load-bearing. To choose, at every fork, the platform that compiles over the platform that interprets, the system that types over the system that hopes, the source that is auditable over the source that is convenient. To pay the small short-term price for the large long-term sanity.

This is not a heroic posture. It is a workmanlike one. There is no glory in it, and there is no Twitter audience for it, and there is no conference circuit for it. There is only the quiet satisfaction, three years on, of looking at a system that is still shipping, that has not had its catastrophe, that the regulators have not flagged, that the customers have not left — and knowing that this is not luck. It is the result of discipline, applied early, when it was unfashionable.

The previous piece in this series ended with the line look at the shape.

I want to end this one with the older line that the discipline of engineering keeps coming back to, because every generation has to learn it again.

Build it like it has to last.

The series ends here, for now. If you found these useful, the previous pieces are linked at the top. The discipline they describe is older than I am, and will outlast all of us. The job is to remember it.

What Your Agent Will Cost You on a Tuesday

Praveen Govindaraj — Fri, 08 May 2026 06:43:05 +0000

A field guide to the fragilities the agent dashboard does not show.

Photo by Compagnons on Unsplash

Third in a short series. The first piece argued that the gap between AI-agents-in-pilots and AI-agents-in-production is being closed by an unglamorous infrastructure rebuild. The second piece showed what that infrastructure actually looks like — process registries, tool gates, audit trails, the boring stuff. This piece is about what the previous two were too polite to say.

I know an engineer who used to work in fraud detection. She told me, once, that the team was proud of how rarely they saw an alert. The dashboard was almost always green. Months would go by without a single high-severity event firing.

What she eventually realised was that the absence of alerts was not the absence of fraud. The absence of alerts was the absence of fraud that her detector was trained to find. The fraud kept happening. It had simply moved.

I have been thinking about that conversation a lot, lately, in the context of AI agents.

*The average is fine. It’s the variance that takes you out.
*
Here is what most agentic dashboards in 2026 look like. Success rate: 98.7%. P95 latency: 3.2 seconds, well under target. Cost per instance: four cents. The little arrows next to each number are pointing in the right direction. The deck is green. The CFO nods.

This is the friendly hump. It is real. It is also not the whole story.

Consider the actual shape of an agent’s payoffs.

Figure 1. The visible distribution and the part that takes you out.

The hump on the right is what the dashboard measures — the small, frequent wins. The mean of the whole distribution is positive. Looks great in the deck. But the dashboard does not show you the long, ugly tail on the left, and it especially does not show you the spike at the very edge — the one that goes off the chart at minus two million dollars on a Tuesday.

Most days, the agent saves you money. One day a year, it does something irreversible.

The mean does not capture this. Variance does not capture this. Standard deviation, ratio metrics, three-sigma confidence intervals — none of them capture this. All of them assume the world is symmetric. The agentic world is not symmetric. It is a world where the upside is bounded — you can save at most $200 a day; you cannot save $2M a day — and the downside is not. The agent really can lose you $2M in an afternoon if it deletes the wrong table or sends the wrong wire.

This is what statisticians call a fat-tailed distribution. The tech industry has been designing for the friendly hump for thirty years. The collision with this kind of risk shape was always going to be ugly.

A practical implication, free of charge: any KPI that summarises agent performance into a single number is, structurally, lying to you. Not because the people who built it are dishonest — most of them are excellent — but because the math itself does not reduce to a single number when the distribution has a fat tail. If your weekly review shows you the green KPI and not the loss histogram, you are looking at the wrong artifact.

What gets measured causes what gets ignored
There is a specific thing that happens when you put a dashboard up on a wall.

The numbers on the dashboard improve. The numbers not on the dashboard get worse. This is not a metaphor; it is a sociology law, observed often enough to have a name. When a measure becomes a target, it ceases to be a good measure.

I have watched it happen with agentic systems. A team measures success rate. Engineers, who are clever, learn to define “success” generously. The success rate goes up. Costs go up too, but those weren’t on the dashboard until later. By the time they appear, the team has six months of “success” history that is impossible to question without admitting the metric was wrong.

Or: a team measures cost per instance. Engineers learn to break a job into smaller instances. Cost per instance goes down. Total cost goes up. The dashboard is happy.

Or: a team measures user satisfaction. Engineers learn to time the survey for after a successful interaction. Satisfaction goes up. The angry users who churn before the survey arrives are invisible.

The agentic era is especially susceptible to this because the metrics are easier to game. An agent’s “success” is partially a matter of LLM-as-judge interpretation. An agent’s “cost” depends on which tokens you count. An agent’s “latency” is meaningful only if you fix the routing. Every degree of freedom in the metric is a place where reality and the dashboard part ways.

What you actually want to know is what the worst output looked like this week. Not the average. The worst.

Pull up the ten ugliest traces from production every Friday and read them. You will learn more in twenty minutes than from a quarter of summary statistics.

The cure becomes the disease
You might reasonably ask: if the tail risk is so awful, why not bolt on a bunch of safety gates? Approval workflows. Human checkpoints. Audit logs. Guardrails. The previous piece in this series catalogued them at length.

The answer, the one I want this piece to spend the most time on, is that interventions in complex systems often cause the harm they were meant to prevent. The medical word for this is iatrogenic — from the Greek iatros, “physician,” plus the suffix for causation. Doctors used to bleed patients. The patients got worse. The doctors interpreted the worsening as proof more bleeding was needed.

The agentic version of this is depressingly common.

Figure 2. The intervention becomes the cause.

A safety gate gets added to a workflow. The gate is real, well-intentioned, and at first glance reduces risk. The wait time on approvals starts to grow. As the wait time grows, the operations team — who are measured on throughput, not safety — starts to lean on approvers to sign off faster. The approvers, who are not domain experts and have a queue of forty other items, develop a habit of rubber-stamping. The rubber-stamped items contain the exact failures the gate was designed to catch. The gate now provides a false sense of safety while the system is actually less safe than before, because the people downstream have learned to trust it.

The gate caused the failure mode it was meant to prevent. This is not a hypothetical. Ask anyone who has worked in a regulated industry. The compliance team will tell you about the time the audit checkbox became the only thing that mattered, and the actual quality of the audit declined for five years until something broke loudly enough to reset.

The deeper point: every intervention you make has a second-order effect. You cannot reason about the first-order effect alone. If you add a gate, ask yourself what behaviour the gate will create downstream. If it creates pressure to bypass, you may have made things worse. If it creates a more thoughtful pause, you may have made things better. The same intervention, different outcomes — depending on the context the rest of the system provides.

Any safety mechanism the people downstream have learned to circumvent is, in practice, an unsafe mechanism.

Treat it as such. Remove it, redesign it, or replace it with something that creates the right kind of pressure rather than the wrong kind.

Three responses to the same wave
If the agent is going to face stress — edge cases, unexpected inputs, malformed responses, a tool that comes back garbled at 3 AM — then the right question is not how do we eliminate stress. The stress is the world. You do not get to eliminate it. You only get to choose how your system responds when it arrives.

There are three ways something can respond to stress, and they look identical for a long time before they don’t.

Figure 3. The three response curves. They diverge only when stress is high enough to matter.

The fragile system runs beautifully under normal load. Right up until it doesn’t. There is no warning, because the system was operating well within its capacity all along, and the failure mode is not a gradual degradation but a discontinuity. The agent that has never seen Cyrillic text in its training distribution will handle Latin characters perfectly until the day a Russian customer file shows up, and then it will produce something that looks confident and is very wrong.

The robust system, the one most engineers aim for, takes the stress and shrugs. It survives. It does not get better. It does not get worse. This is fine, but it is also a ceiling. A robust system in 2026 will be a robust system in 2030, but the world in 2030 will have moved.

The third kind — the kind that actually improves under stress — is the rare and worth-aiming-for one. Each edge case it encounters becomes part of its training data. Each failed run becomes a regression test. Each unexpected tool error tightens its retry logic. Stress is not a threat; stress is the input to its improvement.

Most agents in production today are the leftmost curve. They look the same as the others — until they don’t. The teams that operate them mistake the absence of failures for the presence of robustness. These are not the same thing.
A diagnostic, free of charge: if your team’s response to a near-miss is to file a Jira ticket and forget about it, you are running a fragile system. If the response is “let’s add this to the test suite,” you are running a robust one. If the response is “let’s update the agent’s prompt and the eval set and the regression tests in one PR, automatically generated from the trace,” you are running the third kind.

The barbell
Here is the strategy I recommend for any team trying to build agents that survive contact with reality. It is not original. The investing world has known about it for decades. It applies to engineering with no modification.

Figure 4. Two extremes, deliberately. The middle is where the demos live and the post-mortems get written.

Take everything your business does. Sort it into the things that matter (a wire transfer, a regulatory filing, a customer’s medical record) and the things that don’t (a rough draft, a tool suggestion, a search query summary).

For the things that matter, do not use an AI agent. Use a deterministic rule. A SQL query. A validation. A handwritten if-then. Boring. Predictable. Auditable. Coded by a human who can be fired if it goes wrong.

For the things that don’t matter, give the agent the most freedom you can. Open-ended exploration. Creative drafting. Multi-step research. The downside is bounded — at worst, the output is bad, and you throw it away. The upside is large, because the agent can do things no rule could anticipate.

What you must not do is the comfortable middle. The “we’ll let the agent decide, but with some guidelines” approach. The “human in the loop, but only sort of” approach. The “rules engine that calls an LLM that calls a rules engine” approach. The middle is where you get the worst of both: the unboundedness of an agent attached to outcomes that matter.

This middle, incidentally, is where most enterprise AI demos live. A loan officer agent that “advises” but the advice gets followed 95% of the time. A pricing agent that “suggests” but the suggestion ships unchanged. A hiring agent that “helps” but no human ever overrides it. These are not the bounded delegations they appear to be. They are the unbounded ones, dressed up in the language of constraint.

The barbell is harder to sell than the middle. The middle sounds reasonable in a meeting. The barbell sounds extreme. But extremity, in this domain, is closer to the shape of reality than reasonableness is. Reality is mostly small frequent events plus rare large ones. The barbell is the strategy that fits that shape.

What the dashboard does not show
I want to come back to the dashboard, because I have a specific complaint about it.

Most agentic dashboards in production are designed to show you what already broke. They are autopsies. The thing that takes down production next quarter is not on this quarter’s dashboard, because the dashboard only knows about failure modes it has already seen.

Figure 5. The KPIs are green. The substrate of incubating failure is not on the deck.

Above the line: the green KPIs that go in the leadership deck. Below the line: the dozens of small near-misses that nobody is paid to look at. The silent retries. The schema drifts. The hallucinated arguments to tool calls that happened to land on a parameter the API was lenient about. The timeouts that succeeded on retry. The user complaints that came in and were quietly closed because they did not match a known issue.

Each of these is, individually, beneath notice. Together, they are the substrate from which Tuesday’s catastrophe will incubate.

There is a specific kind of engineer who loves this stuff. They are the people who, when nothing is on fire, go looking for things that almost caught fire. They tail logs nobody asks them to tail. They run analyses on retry distributions. They set up alerts on metrics nobody else cares about. They are the reason your system has not yet had its Tuesday.

Most organisations, sadly, do not value these engineers. They are not promoted, because their work is invisible. The dashboards, you see, are still all green.

Find the engineer who is looking under the surface, and pay them more. Pay them visibly more. This comes after seeing myself in mirror.

Make it institutional. Or accept that you are betting your firm’s future on the assumption that this Tuesday will not be the wrong Tuesday.

Things that will not save you
A list, in no particular order, of things that are sold as solutions to agentic fragility but mostly are not:

More elaborate prompts. A prompt is a wish. The model is under no obligation to honour your wish. Wishes that worked yesterday will fail tomorrow when the model gets updated, the input distribution shifts, or the moon phase changes. Anything that depends on a prompt being followed cannot be a load-bearing safety mechanism.

Bigger models. A bigger model is more capable. It is also more capable of being wrong with confidence. The error mode of older models was obvious nonsense. The error mode of frontier models is plausible, well-formatted, internally consistent nonsense. The dangerous failures got prettier with scale, not rarer.

More guardrails. Guardrails work the way fences work. They are great until something climbs them, and then they make it harder to see what is happening on the other side. The thing about an adversary — and a misbehaving agent is, in effect, an adversary — is that an adversary will route around the guardrail you specifically built. Guardrails should be one layer in a defence-in-depth, not the whole defence.

Fine-tuning on incidents. You will fine-tune the model on this quarter’s incidents. The model will get better at this quarter’s incidents. Next quarter’s incidents will be different — that is what makes them next quarter’s incidents. Fine-tuning is rear-view-mirror engineering.

Audit logs. An audit log tells you what already happened. It does not stop the next thing from happening. It is a forensic tool, not a preventive one. Treat it accordingly.

A bigger model evaluating the smaller model. The fashionable architecture: a small fast model does the work, a big slow model judges. This works fine until the big slow model has the same blind spot as the small fast one — which it often does, because they share most of their training data. You have not added a check; you have added a correlated one. This is the same statistical mistake that made 2008 worse than it should have been.

A committee. Committees do not reduce risk. They distribute responsibility, which is a different thing. A failure of the committee is no one’s failure in particular, which means no one will be in a position to learn from it. The fastest-improving organisations have decisions owned by named individuals who pay a real price for being wrong.

This is a depressing list. I am sorry. The good news is that most of these are not necessary if you get the architecture right in the first place. The bad news is that getting the architecture right is itself difficult, and most of the people selling you these things have never had to do it.

Things that actually help
A shorter list. Hopefully more useful.

Skin in the game for the people building the system. If the team that ships the agent is also on the pager when it fails — at 3 AM, every time, no exceptions — they will design differently. If the team that ships the agent never feels the failure, they will optimise for the green dashboard and let the variance grow. Skin in the game is not a virtue; it is a feedback mechanism. Without it, no amount of process will produce safe systems.

Slow rollouts on irreversible actions. Reversible actions can be deployed quickly. Irreversible ones — anything involving money, customer records, sent communication — should roll out to 1% of traffic for two weeks, then 5%, then 20%, then 100%. The point is not the percentages. The point is that you give your detection systems enough exposure to find the failure modes before the blast radius is everyone.

Read the bad traces. Every Friday, every team that runs an agent in production should pull the ten worst traces from the past week and read them out loud. Out loud. Together. You will be amazed how many failure modes you spot when the bad output is in the room with you, and how few you spot in a Jira ticket.

Kill features that are not paying their way. The agentic world has a horrible habit of accumulating capabilities. Every new tool, every new connector, every new sub-agent expands the attack surface. Periodically — quarterly is fine — go through and kill the ones nobody is using. The system gets simpler. The blast radius shrinks. Nothing is lost, because nothing was being used.

Pay people to think about what could go wrong. Most engineering teams pay to ship features. The most expensive engineers, in any system that survives the long run, are the ones who do not ship features and instead think about what would happen if the features already shipped behaved badly. These engineers are unpopular. They are the reason your firm exists in five years.

That is the whole list. Not glamorous. Not new. Five practices, every one of them invented before the LLM was, all of them ignored by approximately 90% of the teams currently shipping agentic systems.

The practices that survive
A small thing to close on.

The practices that worked for previous generations of mission-critical software — version control, code review, slow rollouts, blameless post-mortems, on-call rotations, runbooks, capacity planning — all still work. They have survived for thirty years because they encode something true about how complex systems fail. Things that have lasted that long usually keep lasting. New things usually don’t.

The practices being invented today specifically for AI agents are mostly not in this category. Most of them are someone’s preferred opinion, untested by time. Some will turn out to be useful. Most will turn out to be cargo-culted variations on practices we already had, dressed up in new vocabulary and sold at a premium.

If you have to choose, in 2026, between an engineering practice that has been around since the 1990s and a “novel agentic safety framework” announced last quarter at a conference, bet on the older one. It will be wrong less often. And when it is wrong, it will be wrong in known ways, which is much safer than being wrong in surprising ones.

The previous piece in this series ended on the line that the boring infrastructure takes the 11% to 60%.

I want to add the obvious corollary: getting to 60% on a fragile foundation just means a 60% rate of getting to your Tuesday faster. The number on the dashboard is not the thing. The shape of the distribution is.

Look at the shape.

The series continues. The next piece will go into specific patterns — the ones that survive Tuesday, and the ones that don’t — and what tells them apart. If you found this useful, the previous two pieces are linked at the top.

The Plumbing Beneath the Magic

Praveen Govindaraj — Fri, 08 May 2026 06:29:22 +0000

Photo by Compagnons on Unsplash

What it actually takes to run an AI agent in production

This is a follow-up to “Process Automation in the Agentic Era.” The short version of that piece: 71% of organisations are using AI agents, only 11% have anything in production, and the gap is being closed by a quiet rebuild of the infrastructure layer underneath. This piece picks up where that one left off.

A friend who runs operations at an insurance company told me something a few weeks ago that has stayed with me.

“We don’t have an AI problem. We have a who-just-did-that problem.”

Their claims process used to be entirely human. A claim came in. A handler read it. A supervisor approved it. A check went out. If anything went wrong, you could trace it back to a person, a date, a signature.

Then they piloted an AI agent. Same workflow, mostly. The agent reads the claim, summarises it, recommends an action, escalates to a human on edge cases. It works beautifully — about 80% of the time. The other 20% is the problem. Not because the agent is wrong; because when the agent is right, nobody can tell how it got there. And when it is wrong, nobody can prove they would have caught it.

Her team did not roll back the pilot. But they did not expand it either. They got stuck — like most enterprises — in the no-man’s-land between interesting and trustworthy.

This is the plumbing problem. And to talk about it with some illustrations

What an agent actually does, in shape

Before we talk about the infrastructure, here is what a single agent doing a single task actually looks like — at least in the way most platforms model it today.

Figure 1. The bare-minimum picture of an agent task.

That is it. That is the whole picture. An input arrives. An agent — which is really a model holding a loop, allowed to call some tools — chews on it until it produces an output.

This is what most demos show you. And if you only had to run one of these, in a sandbox, with no auditor watching, you would be done. The model is smart enough. The tools work. Job over.

But a business does not run one of these. A business runs ten thousand of them per day, in dozens of variations, embedded in workflows that touch real money and real people. And once you do that, the simple picture above is wildly insufficient.

What was missing, and what is now being added

The agent in the box above does not know it is part of a process. It does not know its budget. It does not know whose approval is required before it can act. It does not know how to escalate. It does not know what to log. It does not know what its peers are doing. It does not know what to do if it succeeds — or fails — or just hangs.

All of these are the responsibility of the layer around the box. And that layer is what is being built right now.

Here is the simplest sketch of it.

Figure 2. The agent at the centre, surrounded by the infrastructure layer being built across the industry.

The agent is in the middle. Around it are six things, each of which used to not exist as a discrete concept in the AI agent world, and each of which is now being built — or has already been built — by every serious platform.

Reading clockwise from the top: there is the process orchestration — the thing that knows the workflow has eight steps and what order they go in. There is observability — every model call, every tool invocation, every decision recorded with timestamps. There is human checkpoints — formal places where the workflow pauses and waits for a named approver. There is the audit trail — a record built for the regulator, not the engineer. There is budgets and gates — economic guardrails that prevent a workflow from costing $40 when it should cost $4. And there is the process registry — versioned, signed artifacts that compliance can sign off on once and not have to re-review every Tuesday.

None of these are exotic. Every one of them existed in classical enterprise software. The trick — and the thing the new generation of platforms is figuring out — is reassembling them in a way that does not crush the agent’s productivity in the process.

The shape of a real workflow

Let us make this less abstract. Here is a real one — slightly stylised, but representative — from the kind of process my insurance friend deals with every day.

Figure 3. A claim-processing workflow with two agent tasks, a decision gate, and a human checkpoint for high-value claims.

A claim arrives. An agent extracts the structured data. A second agent runs a fraud check. Then there is a decision — is the claim over $50,000? If so, it pauses and waits for a human adjuster. If not, the agent proceeds to payout.

Notice what this picture has that the first one did not.
It has named steps. Each step has a place in a workflow that an analyst can read without seeing any code. It has a decision with explicit branching logic — not a probabilistic “the agent decides what to do next,” but a deterministic gate. It has a human checkpoint — formal, mandatory, gated by an authority threshold. And it has an end — a clear point at which the work is done.

This is not an AI invention. This is how every business process has been drawn for the last twenty years. The notation is called BPMN. It was standardised in 2011. The reason it is suddenly relevant is that this is the only notation that the people who own these processes — claims supervisors, compliance officers, auditors — can read. The agent went from being the whole story to being a participant in a story that pre-existed it.

The two checkpoints
If you take only one technical idea away from this piece, make it this one. There are two kinds of human-in-the-loop, and most platforms only ship one of them.

Figure 4. Process-level checkpoints gate a whole workflow step. Tool-level gates lock a single tool while the agent uses other tools freely.

The first is the process-level checkpoint — a whole step in the workflow pauses and waits for a human. The insurance claim over $50,000, paused for an adjuster. A loan over $750,000, paused for a senior underwriter. The contract change, paused for legal. These are old. Banks have done this for decades. AI agents inherit them naturally.

The second one is newer, and it is the one that actually lets you trust the agent to do unsupervised work. It is the tool-level gate — and the difference is subtle but enormous.

Imagine you give an agent ten tools. Read this database. Search this knowledge base. Send this email. Delete this customer record. You do not want the agent to ask before reading. You do not want it to ask before searching. You probably do not want it to ask before sending a routine confirmation email. But the delete? The delete you want gated. Always. No matter what the agent thinks. No matter how confident the model is.

The clever way platforms have started shipping this is to attach the gate to the tool, not the agent. The agent does its thing autonomously — until the moment it tries to invoke the gated tool, at which point a human approval request fires off and the agent freezes. The instant the human approves (or denies), the agent resumes. The agent does not even know there is a human in the loop; it just sees a tool call that took eleven minutes to return.

This sounds like a small distinction. It is not. The reason most enterprise AI pilots stall is that the only tool to constrain them was the prompt. You would write a prompt that said, in earnest English, “if you are about to delete a record, ask the user first.” And then sometimes the model would do that, and sometimes — at three in the morning, on Tuesday, after a context switch — it would not. And you would have a deleted record and an apologetic post-mortem.

A tool-level gate makes that conversation impossible. Not “please ask before deleting.” The delete tool will not function until a named human approves.

Mechanical. Auditable. Boring. Exactly what regulators want to see, and what most AI products still do not have.

What the user actually sees
The thing that strikes you, when you watch an experienced operations manager use one of the new platforms, is how unspectacular it looks.

The fancy AI demo where an agent books your trip is not, it turns out, what people want. What they want is something that looks like a project management dashboard from 2014. Rows of running instances. Status badges. Filters. A list of approvals waiting for you. A cost ticker that you can drill into. A search box.

The magic, when there is any, is that the rows are AI agents — not humans clicking buttons. But the interface is deliberately unmagical. Because the people who run these systems are not impressed by magic. They are impressed by predictability.

This is, in the end, the philosophical accommodation we are making.

In the autocomplete era we agreed that intelligence could flow through a person and still belong to them. In the agentic era, we are agreeing that intelligence can flow past a person — but only inside a structure that makes the flow visible, attributable, and reversible. The agent gets autonomy. The infrastructure gets accountability. Each gives up something the other needed.

The boring interface is not a failure of imagination. It is the price.

What is still missing
If I were betting on what the next two years bring, it would be these.

Process mining for agentic workflows. Right now we design the process and then run the agents through it. Soon we will do it backwards — let the agents run, mine the actual paths they took, and let the system propose the workflow that fits what is actually happening. Bottom-up rather than top-down. The classical BPM world has done this for years; the agentic world is about to inherit it.

Federation across enterprises. Today every platform has its own process registry. Tomorrow we will need cross-organisation registries — when an agent at your bank talks to an agent at my insurer, both processes need to interoperate without leaking data. The standards work has not started yet. It will.

Cost-aware autonomy. Today’s gates are static — “do not promote if cost rises by more than 20%.” Tomorrow they will be dynamic — “if you are approaching the budget envelope on this instance, switch from the expensive model to the cheap one and notify ops.” Agents that route themselves through cost-quality tradeoffs in real time. Vellum hints at this. Nobody ships it yet.

A genuinely shared visual language. BPMN is great at workflows but not at the agentic specifics — tool gates, knowledge pipelines, agent skills. Either BPMN gets extended (the OMG is glacial), or a sibling notation emerges. Either way, by 2028 we will have a way for an analyst at a bank, an engineer at a startup, and an auditor at a regulator to look at the same diagram and agree on what it says. We do not have that today.

Why this is worth caring about
It is tempting to read this and think: well, that is a lot of plumbing for a space that is mostly hype. Maybe it is true. Maybe agents fizzle and we go back to writing scripts.

But the thing my insurance friend said keeps coming back. We do not have an AI problem. We have a who-just-did-that problem.

That problem is not AI-specific. It is at least as old as the printing press. Every time intelligence becomes easier to produce and cheaper, the question of accountability gets harder, and a new layer of infrastructure has to be invented to answer it. Notaries. Signatures. Receipts. Audit logs. Version control. Every one of these was once a clever new thing; now they are invisible furniture.

The plumbing being built right now — process registries and tool gates and cost SLOs and trace replays — is just the next set of furniture. In ten years we will not talk about it any more than we talk about HTTPS today. It will simply be the layer that makes intelligence runnable in places where intelligence could not safely run before.

But it is the one that takes the 11% to 60%.

Note : Diagrams in this article are simplified for clarity. Real-world implementations involve more components, more edge cases, and more subtle interactions than any single illustration can capture. The principles are the same; the wiring is messier.

Why 89% of AI agents never reach production — and what’s quietly fixing that

Praveen Govindaraj — Fri, 08 May 2026 05:57:51 +0000

Process Automation in the Agentic Era

There’s a small, mostly unnoticed moment that happens millions of times a day in 2026.
You’re typing a message. Halfway through a sentence, the machine finishes your thought for you. Sometimes you accept the suggestion. Sometimes you don’t. Either way, when you press send, the message goes out under your name. Nobody asks who wrote it. You did. The machine helped. That’s how we’ve collectively decided to think about it.
It’s a small philosophical accommodation, this one. Quiet, almost invisible. We’ve agreed that intelligence can flow through a person — that a thought can come from somewhere else and still belong to us, as long as we read it before pressing send. As long as the human stays in the loop.
Now imagine the machine isn’t finishing your sentence. It’s running your loan underwriting. It’s processing your insurance claim. It’s triaging your medical complaint at three in the morning when the on-call doctor is asleep. The intelligence isn’t flowing through a human anymore. It’s flowing past one. And the question we papered over in the autocomplete era — who, exactly, just did that? — suddenly matters very much.
This is the question the AI industry has been quietly grappling with for the last twelve months. Not whether machines can think — we figured that out a while ago. The harder question: when a machine acts, who is acting? And what does the world need in place to make that question answerable?
There’s a number that captures the shape of this problem.
71% of organizations now use AI agents. Only 11% of agentic use cases reach production.
That gap — between trying an agent and trusting one — is where most of the interesting work in software is happening right now. It isn’t getting much airtime. There’s no flashy new model dropping every Tuesday. No viral demo of an agent booking a flight. Just a slow, methodical rebuilding of an entire layer of software that, until recently, no one was sure we needed.
It turns out we did.
The promise that keeps almost-arriving
For about three years now, the pitch for AI agents has been irresistible. You describe what you want in plain English. The agent figures out how to get it done — calls APIs, reads documents, drafts emails, makes decisions, escalates to humans when stuck. Software that thinks. Knowledge work that runs itself.
In demos, it’s magic. In production, it’s mostly a graveyard.
Klarna built one that genuinely works — handles two-thirds of customer support tickets. Ramp shipped a buyer agent that processes purchases. A handful of others have made it across. But for every Klarna, there are a hundred enterprises with a closet full of half-finished pilots. Agents that work brilliantly on Tuesday and inexplicably destroy a database on Thursday. Agents that pass the demo but fail the audit. Agents that the legal team won’t sign off on, that the finance team can’t budget for, that the operations team can’t monitor when something goes wrong.
The interesting question isn’t why this is happening. We mostly know why. The interesting question is what the industry is now doing about it — and the answer, surprisingly, is that we’re rediscovering a 25-year-old idea.
The standard nobody wanted that everyone now needs
If you worked in enterprise software in the early 2000s, you probably encountered something called BPMN — Business Process Model and Notation. It looked like a flowchart. It had circles for events, rectangles for tasks, diamonds for decisions, and lanes for who-does-what. Banks loved it. Insurance companies loved it. Hospitals loved it. Software people, mostly, did not.
For two decades, BPMN sat in the “boring enterprise” corner of software, alongside things like middleware and document management. The cool kids — and the AI startups especially — built workflow tools that ignored it entirely. Zapier connected apps. Make.com chained operations. n8n let you write JavaScript between nodes. Each had its own visual language. None of them were BPMN, and that felt fine, because BPMN was for compliance officers in suits.
Then something interesting happened in 2025. Camunda — the company that has spent twenty years quietly making BPMN tools — published a report on the state of agentic orchestration. It contained the 71%/11% number. It also contained an argument that the AI industry didn’t quite want to hear: the problem wasn’t that the agents weren’t smart enough. The problem was that there was no shared language between the people building agents and the people who had to live with them.
The compliance officer can’t read Python. The engineer doesn’t want to write a 40-page process document. The auditor needs to see the workflow before sign-off, but the workflow only exists in a giant prompt that was edited at 2 AM by a contractor. The legal team needs an artifact they can review. The operations team needs a diagram they can monitor. The finance team needs a budget they can attach to a process step.
BPMN, it turns out, was already designed for exactly this. Standard since 2011. Read by every BPM tool ever built. Approved by regulators in every major jurisdiction. The thing the AI agent industry was missing was the thing the BPMN industry had been holding for two decades.
So a quiet pivot started. Camunda began shipping AI agent capabilities directly into BPMN diagrams. Academic papers started appearing — “BPMN Assistant,” “H2A-BPMN,” “Mestro” — all asking the same question: can we use LLMs to generate BPMN diagrams, and have those diagrams orchestrate the LLMs back? The answer, it’s turning out, is yes.
What the new layer actually looks like
Eric Broda, who has been writing about this for a while, calls it “Agentic Process Automation” — APA, distinct from RPA (the robot-script tools of the 2010s) and from BPM (the heavyweight workflow suites that came before).
APA is not a product. It’s a runtime architecture. The pieces it requires, roughly:
A process manager that runs the workflow — knows which task is current, what state the data is in, when to escalate, when to retry, when to fail. Think of it as the conductor.
A process registry where workflows live as versioned, signed artifacts. Like a package registry, but for business processes. You can publish, you can subscribe, you can roll back.
A communications fabric — a normalized event stream so that when a task completes, every other system that cares (monitoring, billing, audit, notifications) hears about it in the same format. Without this, every agent integration becomes a custom mess.
An event normalizer that translates the dozen different event vocabularies (Claude’s tool_use, OpenAI's function_call, Anthropic's content_block_delta, vendor X's whatever) into one shared schema. Otherwise the auditor sees ten different log formats and can't reason about any of them.
And — this is the one that matters most — formal human-in-the-loop checkpoints. Not “the agent will ask if it’s unsure,” which is probabilistic and unreliable. Actual gates. The agent cannot execute past this point until a named human, with documented authority, approves with a justification, captured in an audit trail.
If that sounds like overkill for your weekend chatbot, that’s because it is. APA isn’t for the chatbot. It’s for the loan underwriting workflow. The claims process. The sanctions screening. The places where an agent making a wrong call costs money, breaks regulations, or hurts a person.
The convergence nobody is talking about
Here’s what’s strange. If you survey the agentic no-code tools shipping right now — and there are a lot of them — you find them all converging on the same set of patterns. Independently. From different starting points.
n8n started as a workflow automation tool for developers. In January 2026, it shipped tool-level human-in-the-loop gating — you can require explicit human approval before an AI agent invokes a specific tool, with the approval routed through Slack, email, or a web form. In March 2026, it shipped visual diff between workflow versions: side-by-side canvas rendering with changed nodes highlighted.
Dify started as an LLMOps platform. It now ships a “Knowledge Pipeline” — a separate visual canvas just for the data engineering side of RAG (parse → chunk → embed → index → retrieve), letting non-engineers configure how documents become context.
OpenAI launched Agent Builder in October 2025 — drag-and-drop canvas, inline evaluations on each node, version history, preview runs, exportable to SDK code. Sam Altman called it “Canva for building agents.”
Microsoft’s Copilot Studio has gone the furthest on governance. Wave 3 (March 2026) ships agent inventory queryable from Azure Resource Graph, an Activity Map showing what agents are accessing in real time, MCP allowlist policies that admins can enforce, and HITL via Outlook forms. Microsoft Defender and Purview wrap the whole thing.
Vellum focuses on what they call “cost SLO gates” — you set a rule that says “don’t promote this version to production if cost-per-resolved-ticket exceeds $0.08,” and the platform enforces it.
Pipedream lets you embed real code in any node, exposes 10,000+ API tools through a hosted MCP server, and syncs workflows to GitHub.
Tines, born in security operations, calls its visual builder “Storyboard” and treats every action as a HTTP block — schemas change all the time, so the abstraction is “make a request” not “use the Salesforce connector.”
Google’s Opal lets you describe what you want in natural language and generates a working workflow visually.
These tools come from radically different starting points. Different companies, different funding sources, different target users, different licenses. And yet, if you list what each one shipped in the last twelve months, the lists overlap so heavily it’s startling.
Sticky notes for annotation. Time-travel debuggers. AI copilots that generate workflows from descriptions. Versioning baked into the canvas. Cost dashboards. Tool-level HITL gating. Visual diff. Multi-environment promotion. Inline evals.
When ten independent teams arrive at the same set of features, that’s not coincidence. That’s the field discovering its shape.
The three people in every room
The shape, if you look closely, is defined by three people who have to coexist in front of a screen:
The business analyst. Probably draws BPMN diagrams in Visio today. Doesn’t code. Owns the requirements document. Has to walk into a meeting with compliance and the legal team and explain how a process works. Their job is on the line if a diagram gets approved that contains a regulatory violation.
The AI engineer. Comfortable in Python, lives in their IDE, debugs with print statements and trace replays. Wants to iterate on a prompt and see the difference. Doesn’t care what color the start event is. Cares deeply that they can roll back a bad deployment in five seconds.
The operations manager. Doesn’t author. Doesn’t code. Their world is dashboards. They get paged when something fails. They need to know which instance is stuck, why it’s stuck, who can unstick it, and how much it’s costing while it’s stuck. They sign the SLA.
Every successful platform serves all three. Every failed platform serves one and assumes the others will accommodate. The reason most AI agent tools have a low ceiling — and the reason BPM tools historically had a low floor for engineers — is that each was optimized for one persona at the expense of the others.
The new generation of tools is figuring out how to make a single product feel native to all three. Different default views, different terminology in tooltips, different keyboard shortcuts, different defaults — but the same underlying artifact. The analyst sees a flowchart. The engineer sees code. The ops manager sees a live monitor. All three are looking at the same process. None of them can corrupt what the other sees.
That sounds simple. It is technically the hardest problem in this category.
The parts that are hard, the parts that are interesting
If you peel back the visual polish on any of these tools, the hard part is always the same: keeping the diagram and the code in sync without losing information.
A BPMN diagram doesn’t capture everything an engineer cares about — token budgets, retry semantics, type signatures, OWASP compliance bindings. A piece of code doesn’t render itself as a flowchart that an analyst can read. So you need a third representation that both can be projected from. Most teams call this an “AST” — abstract syntax tree, the same kind of structure compilers use internally.
The interesting platforms (Inkeep is the explicit pioneer, but several others are converging on this) use the AST as canonical truth. The diagram is generated from it. The code is generated from it. An LLM that wants to add a step generates a patch against the AST, not against either projection. Round-trips work because no projection is authoritative — the structure underneath is.
This is the same architectural insight that made compilers good in the 1970s. It just took us four decades to apply it to business processes.
The other hard parts are mostly about cost and trust. AI agents are expensive to run, in ways that traditional automation isn’t. A workflow that costs $0.04 per execution today might cost $0.40 next month if you change the model. Without economic governance — budgets, SLOs, gates — you can’t deploy safely. Without observability — every tool call traced, every prompt logged, every decision attributable — you can’t audit. Without human-in-the-loop checkpoints with real authority semantics, you can’t pass compliance review.
These aren’t AI problems. They’re plumbing problems. They’re the kind of unglamorous work that makes the difference between a demo and a product.
Where this lands
The gap between 71% adoption and 11% production isn’t going to close because models get smarter. The models are already smart enough for most of what enterprises want to do. The gap will close because the layer between the model and the business — the layer that’s been missing for the last three years — is finally being built.
It will look like BPMN, because BPMN already solved the “talk to compliance” problem twenty years ago. It will look like a modern visual editor, because business analysts won’t read code. It will have an AI copilot, because nobody wants to drag rectangles for an hour. It will have versioning and cost gates and tool-level HITL, because you can’t deploy a $40K-per-month agent without economic governance.
It will, in other words, look like the thing that thirteen different teams are independently building right now.
When the dust settles, and the tools mature, and one or two of them become standards, we won’t talk about “agentic process automation” the way we talk about it now. It will just be how processes are automated. The same way we don’t really talk about “internet-enabled email” anymore. The infrastructure becomes invisible when it works.
That’s the boring, beautiful work happening underneath all the agent-demo theater. It’s not as fun to watch as a chatbot booking a flight. But it’s the reason your bank, your insurer, and your hospital might actually be running an AI agent five years from now without anyone noticing.
The 11% becomes 60% not because of a breakthrough. Because of plumbing.
Sources for this article include Camunda’s 2026 State of Agentic Orchestration report, Eric Broda’s writing on Agentic Mesh, the n8n release notes for January and March 2026, OpenAI’s AgentKit announcement (October 2025), Microsoft Copilot Studio Wave 3 (March 2026), Dify’s Knowledge Pipeline release, Vellum’s enterprise guide, and academic papers including BPMN Assistant (arXiv:2509.24592), H2A-BPMN (Springer LNCS 2026), and the Agentic BPM Manifesto (arXiv:2603.18916). Production case studies referenced: Klarna’s customer support agent, Ramp’s buyer agent, LY Corporation’s work assistant, Carlyle’s deployment metrics.

Build a Multi-Agent Data Pipeline in 50 Lines of Neam

Praveen Govindaraj — Wed, 01 Apr 2026 12:38:13 +0000

In this tutorial, you'll build a working multi-agent data pipeline using Neam, an agentic AI programming language. By the end, you'll have a DIO orchestrating five agents through a churn prediction workflow.

Step 1: Define Your Infrastructure Profile. This tells every agent where data lives and what compliance rules apply:

infrastructure_profile MyInfra {
    data_warehouse: {
        platform: "postgres",
        connection: env("DB_URL")
    },
    governance: { regulations: ["GDPR"] }
}

Step 2: Declare Your Agents. Each agent is a specialist. Note the budget constraints:

budget B { cost: 50.00, tokens: 500000 }

databa agent MyBA { provider: "openai",
    model: "gpt-4o", budget: B }
datascientist agent MyDS { provider: "openai",
    model: "gpt-4o", budget: B }
datatest agent MyDT { provider: "openai",
    model: "gpt-4o", budget: B }

Step 3: Wire Up the DIO. The orchestrator coordinates everything:

budget DioBudget { cost: 500.00, tokens: 2000000 }

dio agent MyDIO {
    mode: "hybrid",
    task: "Predict customer churn, identify drivers",
    infrastructure: MyInfra,
    agent_md: "./my_domain.agent.md",
    provider: "openai", model: "gpt-4o",
    budget: DioBudget
}

let result = dio_solve(MyDIO, task)
print(result)

Step 4: Create Your Agent.MD. This is the secret weapon — encode domain knowledge:

## @organization-context
Company: My E-Commerce Co
Scale: 500K customers, 5M orders

## @known-data-issues
- signup_date timezone drift before 2024-03
- Product ratings skew positive (self-reported)

## @agent-preferences
DataScientist: XGBoost for tabular, AUC-ROC metric

Run it: neam-cli run my_pipeline.neam

Welcome to Neam Ecosystem

Neam DIO: Orchestrate 14 AI Agents for Your Data Lifecycle

Praveen Govindaraj — Tue, 31 Mar 2026 06:30:49 +0000

85% of ML Projects Fail.

We Built 14 AI Agents to Fix That.

How the Neam Data Intelligent Orchestrator manages the entire data lifecycle — from requirements to production — with spec-driven agent coordination.

The Number That Should Trouble Every Data Leader
Here is a statistic that should keep every VP of Data awake at night: 85% of machine learning projects never reach production. Not 85% that deliver poor results. Eighty-five percent that never ship at all.

For every six ML initiatives your organization launches, five will consume budget, occupy engineers, generate excitement in steering committees — and then quietly die. The models sit in notebooks. The pipelines rot. The business case gets revisited “next quarter,” which is corporate shorthand for never.

This is not a technology problem. The algorithms work. The cloud scales. The tooling has never been better. The problem is organizational. It lives in the gaps between the business analyst who writes the requirements and the data engineer who builds the pipeline. Between the data scientist who trains the model and the MLOps engineer who deploys it.

These gaps have a name: handoff failures. And they are where data projects go to die.

The Shrimp Tank Insight
In mid-2019, a $70 shrimp tank in a Singapore shop made me rethink how systems should be designed. The shopkeeper explained: no water changes, no filter cleaning. The shrimp eat the vegetation, the vegetation grows back. A self-sustaining ecosystem. You buy it, you keep your hands clean. It just… lives.

💡 Key Insight

That question became the design philosophy behind the Neam DIO: 14 agents, each with a distinct role, each producing outputs that others consume, each making the system stronger simply by doing their job. A data ecosystem that, like that $70 shrimp tank, just… works.

Introducing the Data Intelligent Orchestrator (DIO)
The DIO is the central coordination layer of Neam’s Intelligent Data Organization. It is not a chatbot. It is not a prompt chain. It is a compiled, spec-driven orchestrator that coordinates 14 specialist AI agents across the complete data lifecycle.

The Four-Layer Architecture
LayerAgentsWhat They DoInfrastructureData Agent, ETL Agent, Migration AgentSource discovery, SQL-first warehousing, zero-downtime platform movesPlatformDataOps, Governance, Modeling, AnalystSRE for data, compliance enforcement, architecture intelligence, NL-to-SQLAnalyticalData-BA, DataScientist, Causal, DataTest, MLOpsRequirements, EDA-to-AutoML, causal reasoning, quality validation, production opsOrchestrationDIODynamic crew formation, RACI assignment, 8 auto-patterns, error recovery

Each agent has a defined personality, authority boundary, and trait-based capabilities. The Data-BA Agent is “inquisitive and traceability-obsessed.” The DataTest Agent is “skeptical, adversarial, never rubber-stamps.” The Causal Agent is “correlation-is-not-causation embodied.”

These are not marketing descriptions. They are system prompts compiled into bytecode.

How the DIO Actually Works

Step 1: Task Understanding
When a task arrives — say, “Predict which customers will churn in 90 days and identify the causal drivers” — the DIO classifies the intent and matches it against 8 pre-defined auto-patterns.

Step 2: Crew Formation

Not every task needs all 14 agents. The DIO scores each agent on four dimensions:

Capability match (40%) — Can this agent do the required work?
Cost efficiency (20%) — How much budget does it consume?
Infrastructure compatibility (20%) — Does it work with the declared platform?
Historical performance (20%) — How well has it performed on similar tasks?
For churn prediction, the DIO forms a crew of 7 agents and skips DataOps, Analyst, Modeling, and Migration entirely.

Step 3: RACI Delegation

Every sub-task gets a RACI assignment: who is Responsible (does the work), Accountable (owns the outcome — always the DIO), Consulted (provides input), and Informed (receives results).

Step 4: Execute with Quality Gates

The DataTest Agent — architecturally separated from all builder agents — must approve artifacts before they flow downstream. The agent that trains the model cannot be the agent that validates it. This is a trust boundary.

Step 5: Error Recovery

Retry → Fallback → Graceful Degradation → Human Escalation. Exhaust automated options before involving humans, but involve humans before producing incorrect results.

The Trait System

TraitWhat It MeansAgentsDataProducerCreates data artifactsData Agent, ETL, Migration, Data-BA, DataScientist, DeployDataConsumerReads artifacts from other agentsETL, Modeling, Analyst, DataScientist, Causal, MLOpsCausalReasonerPerforms causal inferenceCausal Agent (exclusively)QualityGatekeeperCan block downstream progressData Agent, DataOps, Governance, DataTest, MLOps

The Causal Agent: The Missing Role

SHAP values tell you which features were important to the model’s prediction. They do not tell you which features cause the outcome. The Causal Agent reveals that “support_ticket_resolution_time” is the actual driver — not “days_since_last_order.” One is chasing symptoms. The other identifies the lever you can actually pull.