Forem: Christo Zietsman

Structural Quality Gaps in AI Governance Prompts

Christo Zietsman — Fri, 24 Apr 2026 07:15:22 +0000

Most AI agent governance documents in production are structurally incomplete. That is not an observation. It is an empirical finding.

A second paper from the nuphirho.dev research programme is now on arXiv. "Structural Quality Gaps in Practitioner AI Governance Prompts: An Empirical Study Using a Five-Principle Evaluation Framework" (arXiv:2604.21090) applies a five-principle evaluation framework to 34 public AGENTS.md files sourced from GitHub. Three independent language model evaluators scored each document. Thirty-seven percent scored below the structural completeness threshold.

The five principles the framework evaluates are drawn from computability theory, proof theory, and Bayesian epistemology. Together they ask whether a governance document defines a decidable problem: does it specify success criteria, embed an assessment schema, require external verification of factual claims, scope what the agent will not do, and constrain the output format? A document that fails these tests will produce coherent output. Whether that output is correct depends entirely on what the document did not say, and there is no mechanism inside the document to catch what was left out.

The central finding is an artefact classification gap. The same file format, AGENTS.md, is being used for three architecturally incompatible purposes: task orchestration, behavioural governance, and architectural specification. There is no consensus. Different teams are solving different problems with the same artefact and calling it the same thing. That is a tractable requirements engineering problem with no current solution.

This paper is the empirical foundation for the PromptQ framework, which evaluates governance documents at authorship time. The framework is in development at promptq.ai.

arXiv:2604.21090 | doi.org/10.48550/arXiv.2604.21090

Conway's Law at Level 5

Christo Zietsman — Wed, 15 Apr 2026 06:55:00 +0000

Before I wrote this post I had a conversation I did not expect to have.

I have two agents: a Science Officer and a Blogger. I was considering adding more. A Dev Lead, a BA, maybe one other I cannot recall now. So I did what felt natural. I asked them.

Both agreed the Dev Lead made sense. The reason was more useful than the answer. I had been giving both of them work that was not theirs. The Dev Lead was not a new capability. It was a missing boundary. Without it, coordination costs were bleeding into roles that should not be carrying them.

Then they wrote the job spec.

That is worth sitting with.

My existing agents identified the gap in the org structure, validated the hire, and defined the role. I approved it. That is the full hiring loop, except two of the three participants are not human.

This is Conway's Law running in a direction Conway did not anticipate. Systems mirror the communication structures of the organisations that build them. But what happens when the organisation is actively redesigning itself, and some of the designers are agents?

The question is not theoretical. Every person reading this who uses AI seriously is already inside it. You gave your agent a task. It hit a boundary. You worked around it. That workaround is your org chart. The boundary you did not design is shaping your system anyway.

The frontend/backend split existed because human brains could not hold the full context of a feature simultaneously. AI does not have that limitation. But we are still hiring, structuring, and coordinating as if it does. The structures we built for human communication bandwidth are now the ceiling on what AI can do for us.

The deeper question is not whether to add an agent. It is what adding that agent does to every relationship that already exists. Who does it talk to? What does it own? What were other agents doing that now belongs to it?

I got those answers before I made the decision. From the agents already in the room.

Conway's Law is not dead. It just got a lot more interesting.

Wardley Was Right

Christo Zietsman — Sat, 11 Apr 2026 06:02:02 +0000

Simon Wardley changed the way I think about a problem I thought I had solved.

He published a post a few weeks ago about what happens to software engineers when agents write the code. His answer: the role survives, the title probably does not. The people who matter are the ones who can maintain a chain of comprehension over a billion lines of code they did not write and cannot read. He called them AI wranglers, agentic herders. Alice, the software engineer you fired because a thought leader said you no longer needed her.

I read it and posted a comment. I was confident. Executable specifications, I said. If the spec proves itself in the pipeline, nobody needs to read the code. The human writes intent. The agents write code. The pipeline is the reviewer.

Wardley replied in four words: "agreed, up until the specification part."

He was right.

Not entirely right. But right enough that I spent the next morning following the challenge instead of defending the position. That is usually more productive.

The chain of comprehension problem is real. At a billion lines of code, reading is not a strategy. Something has to carry the comprehension that used to live in the engineer's head. My argument was that executable specifications are that something. A BDD scenario that runs in the pipeline is a piece of comprehension that is smaller than the code, verifiable, and does not depend on any individual's memory of what was intended.

Wardley's pushback was precise: it is not clear yet what those emerging practices are.

He was pointing at a scale problem I had underweighted. Executable specifications work within a bounded context. Within a single service, a single domain, a single team's scope of ownership, BDD scenarios that run and mutate are a mature enough practice to defend. But across a billion lines the scenario explosion problem makes it unmanageable. The number of valid scenarios grows with the product of features and their interactions. Across domain boundaries, that number is combinatorially out of reach. Nobody writes BDD scenarios for what happens when the system fails at that scale in a specific sequence. Who owns that scenario?

The organisations operating at that scale do not use specifications as the primary comprehension mechanism. The pipeline is still the reviewer. But what the pipeline is checking is not behaviour. It is properties.

That is a different thing.

The distinction that came out of following the challenge is between behavioural comprehension and property comprehension. A BDD specification captures behavioural intent: given this context, when this action, then this outcome. It is precise, it is owned, and it fails deterministically when the behaviour changes. A fitness function captures a system property: latency stays below this threshold, coupling between these services stays below this score, security posture does not degrade. It runs across the whole system and fails when the architecture drifts.

Both are verifiable artefacts. Neither is a human reading code. But they operate at different levels and answer different questions.

Behavioural specifications within domains. Contract tests at service boundaries. Architectural fitness functions across the system. Observability in production where the emergent behaviour is too complex to specify in advance. Each layer comprehends its own scope. The layers compose.

Wardley's craft-to-engineering-discipline shift is the right frame for what happens next. SREs did not stop caring about servers. They started caring about the system properties that matter at scale: reliability, observability, deployment safety. The shift from software engineer to whatever comes next is the same shape. Stop caring about the code. Start caring about the specifications at the domain level, the fitness functions at the system level, and the governance model that connects them.

The question I did not have an answer to when I posted that comment, and still do not have a complete answer to now, is what the governance model looks like when the agents are generating both the code and the specifications simultaneously. The comprehension hierarchy I described above assumes humans are writing the specifications that the agents implement. What happens when the agents are writing the specifications too?

That is the open question Wardley's pushback correctly identifies. The practices are co-evolving. What they settle into at a billion lines of AI-generated code, governed by AI-generated specifications, is not yet known.

What I am more confident about is the principle: comprehension has to live in verifiable artefacts, not in people's heads. What those artefacts turn out to be at each level of scale is still being worked out. But the direction is clear. You are not going back to reading code. The question is what you are going forward to.

That is the problem I am working on.

The Go-to-Market Bottleneck

Christo Zietsman — Tue, 07 Apr 2026 07:18:51 +0000

Imagine your engineering team can ship a complete feature in a day. Specs go in, verified software comes out. You solved the pipeline problem.

Is your sales team ready?

Last week I wrote about the pipeline bottleneck: AI generates code faster, but review, testing, and deployment can't keep up. But even if you solve that problem completely, there's a bigger one waiting.

Imagine you reach Level 5. Specs go in, verified software comes out. Your engineering team can ship a complete feature in a day. What happens next?

Is your sales team ready to sell it? Has sales engineering updated the demo? Has marketing updated the positioning? Has support been trained? Has operations provisioned the infrastructure? Has finance updated the pricing model? Have your channel partners been enabled? Has governance signed off?

At Level 5, the engineering bottleneck disappears. But the go-to-market bottleneck becomes the constraint. And unlike engineering, you can't automate your way through most of it.

This is especially acute for enterprise SaaS companies that sell through a combination of direct sales and channel partners. Every feature touches a chain of people and processes that were designed for quarterly or monthly release cadences, not daily ones.

And if your product isn't built to sell itself, the problem gets worse. If every new capability requires a sales engineer to demonstrate it, a solutions architect to scope it, and a customer success manager to onboard it, then your delivery speed is gated by the capacity of those teams. It doesn't matter how fast you can build.

Anthropic shipped Cowork in what felt like a blink. But Anthropic has a product that largely sells itself to a market that's actively looking for it. Most enterprise software companies don't have that luxury.

So what does this mean for organisations investing heavily in AI-assisted development?

The investment case is not just about faster engineering. It is about exposing the constraints that were always there but hidden behind a slow release cadence. Speed the engineering and every downstream bottleneck becomes visible.

Feature sizing changes. Release cadence changes. Enablement processes change. Some of those steps can be automated. The human judgement steps: positioning, pricing, channel readiness. Those cannot.

The teams that redesign the full delivery chain will compound their advantage. The teams that only accelerate engineering will find themselves with a warehouse full of features the market cannot absorb.

The delivery bottleneck is not a pipeline problem. It is an organisational problem. Who is designing for it?

I Followed the Problem Home

Christo Zietsman — Sat, 04 Apr 2026 20:47:03 +0000

James Bach wrote that failing to detect a problem is not a measurement of non-problemness. He was responding to me. That exchange sent me somewhere I did not expect to go.

I am not a philosopher. I am not a mathematician. I am an engineer who has spent twenty years trying to verify that software does what it is supposed to do, and I kept running into the same wall from different directions.

On a Sunday evening in my office in Technopark, Stellenbosch, I was preparing a presentation on the effect of backfill within deep mine stopes. The numerical model was producing beautiful results. I tried to speed things up, changed something in the solver, and introduced a bug. We used CVS at the time. When you are in the zone and tweaking, you skip the commit. The stress erased the memory of what I had changed. I presented at Tau Tona, one of the deepest gold mines in the world, and the results looked exactly as I expected them to look. It was only at the follow-up workshop, where we tried to make the theory practical, that the new build of the tool revealed itself to be producing garbage. The presentation had looked fine. The practice exposed it. The version control was there. The process existed. I ran straight past it because it felt like friction in the moment. At another company I watched a feature nobody would touch because the selection logic had grown beyond anyone's ability to reason about it. At every step, the problem was the same: how do you know that what you built is actually correct?

I rebuilt the selection model from scratch over a Christmas break using test-driven development. Not because I was told to. Because I needed something external to the code to tell me whether the code was right. The test suite was not the product. But it was the only thing I trusted.

That instinct was twenty years old before I had a name for it.

When large language models started writing code, I thought the problem would get worse. A system that produces plausible-sounding output, confident in tone and occasionally incorrect in fact, is a system that makes verification harder, not easier. The code looks right. The tests pass. And yet.

I published a paper on this in March 2026. The core claim: AI reviewing AI-generated code without an external specification shares blind spots. The correlated error hypothesis. Three models reviewed code containing domain-specific bugs none of their training had prepared them for, and they agreed it was correct. The confidence was not evidence of correctness. It was evidence of shared blindness.

The specification was the only thing that did not share the blind spot. The executable BDD scenario, written to describe what the system must do independently of how it does it, could catch what the reviewers could not. Because it was external. Because it was deterministic. Because it did not care how plausible the code looked.

The specification is the quality gate.

I did not expect what happened next.

James Bach posted a question about whether past correctness justifies trust in a tool. I replied with the mutation testing argument. He pushed back: failing to detect a problem is not a measurement of non-problemness, unless you have a guaranteed method of detecting every possible kind of problem. He was right. That is the oracle problem, the formal boundary of what executable specifications can catch. I had already published a taxonomy of it. His pushback named the limit more precisely than I had.

It is not. That is Hume's problem of induction. I had stumbled into philosophy.

Simon Wardley, who thinks about strategic positioning and the evolution of capabilities, engaged with the specification-at-scale argument. He has a framework for understanding how practices become commodities. He pushed on whether executable specifications could survive at that scale, and whether the ecosystem would commoditise the verification layer before the engineering discipline caught up.

A paper by Catalini, Hui and Wu at MIT Sloan, published independently, arrived at the same structural diagnosis from a different discipline entirely. The binding constraint on AI-driven growth is not execution speed. It is human verification bandwidth. The specification is the mechanism that makes verification machine-executable rather than biologically bottlenecked. Two disciplines, same conclusion.

I had been following a problem. The problem had been following a path that connected software engineering to epistemology to economics without asking my permission.

This is where I need to be honest about what I do not know.

The philosophical grounding of the verification problem goes back much further than I had realised. Sextus Empiricus, writing around 200 CE, described what he called the criterion problem: how do you establish the criterion by which you judge something to be true, without already assuming the criterion is correct? The Pyrrhonists suspended judgement precisely because they could not resolve it.

Software verification has the same structure. To verify that the test is testing the right thing, you need a meta-test. To verify the meta-test, you need a meta-meta-test. At some point, you need an external reference. The specification. The oracle. Something that stands outside the system and says: this is what correct looks like.

I recognised this structure from my own work before I had read a word of ancient philosophy. But recognising a structure and understanding its formal treatment are different things. I am working with a philosopher, Jaco Louw, who specialises in Pyrrhonian scepticism, to think through whether the formal epistemological argument holds. He is the expert. I am the engineer who followed the problem far enough to need one.

This is not false modesty. It is how good thinking works. You follow the problem until it takes you somewhere you cannot go alone. Then you find the people who live there.

I have been asking questions in public for several months now. The questions have attracted people I did not expect: practitioners, researchers, strategists, and investors. Not because I have all the answers. Because the questions are the right ones, and the right questions draw out the people who have been thinking about the same things from different angles.

What I am looking for now is more of that.

If you work in formal verification, computability theory, or proof theory and you have thought about what undecidability means for software correctness in practice, I want to hear from you.

If you work in epistemology, philosophy of science, or sceptical traditions and you see the oracle problem as something your field has already mapped, I want to hear from you.

If you work in economics, organisational theory, or complexity science and you see the verification bandwidth argument from a different vantage point, I want to hear from you.

If you are a practitioner who has been living inside this problem and has not yet found the vocabulary for what you keep running into, I especially want to hear from you. The vocabulary matters less than the pattern.

I am not building this alone. I do not want to. The work is better when it is challenged, extended, and connected to things I have not read yet.

Follow the problem with me.

The research programme is at nuphirho.dev. The conversation is in the comments.

Treating Ideas as Releasable Software

Christo Zietsman — Wed, 25 Mar 2026 07:06:25 +0000

I recently published a four-part series on AI code review, correlated errors, and specification-driven verification. The research behind it took longer than the writing. This post is about that research process, not the conclusions it produced.

The argument is simple: published ideas have the same failure modes as shipped software. They go out with known gaps because the author ran out of time or confidence. They make claims the evidence does not support. They patch weaknesses with emphasis rather than fixing them. They pass the author's own review because the author wrote them.

I wanted to see what happens when you apply engineering rigour to content development. Not writing advice. Not style guides. Actual process: hypotheses, verification, quality gates, and honest assessment of what the evidence does and does not support. What follows is past tense. The series is published. The process is complete.

Here is what that looked like.

Hypothesis formation through dialogue

The three core ideas in the series did not start as outlines. They emerged from research conversations where claims were made, challenged, refined, and scoped. The correlated error hypothesis started as an observation about AI code review tools. The Cynefin framing started as a question about complexity theory. The residual defect taxonomy started as an honest admission that executable specifications cannot catch everything.

The important moves happened early. Separating structural claims from empirical ones. Identifying where an argument was circular at the level of definitions rather than genuinely supported. Recognising when an observation needed a qualifier rather than a bolder statement. These are the same moves you make in a design review, and they matter just as much in content.

Science officer review

Before any drafting, the hypotheses were stress-tested using a dedicated challenge role. The brief was explicit: do not validate, try to break them.

The results were specific. The correlated error claim needed a bridging sentence connecting ensemble machine learning theory to LLM pipelines. The Cynefin framing needed the determinism objection engaged directly, not sidestepped. The economic argument needed data, not assertion. The taxonomy needed "to the best of our knowledge" before the novelty claim.

Each of these is a gap that a validation-oriented review would have missed. A reviewer looking for confirmation finds confirmation. A reviewer tasked with finding weaknesses finds weaknesses. The role matters more than the competence of the reviewer.

Targeted literature search

Two targeted research briefs were sent to Grok, each scoped with specific context about the hypotheses and explicit instructions about what to look for and what to ignore. The first brief asked for academic papers on correlated errors in LLM pipelines, Cynefin applied to specifications, and the limits of specification. The second asked specifically for evidence on specification authoring cost, because a Toulmin assessment had identified the economic argument as assertion-dense rather than evidence-dense.

These were not broad searches. The briefs included the current state of the argument, the specific gaps, and instructions to look outside the mainstream SDD and vibe coding conversation. The papers that closed the last Toulmin gap, Fonseca et al. and Hassani et al. on specification authoring cost, came back from the second brief. They were found because the search was scoped by the review process, not by general curiosity.

The research brief for each search was precise about what it was looking for and what it was not. This turns out to matter enormously. A broad search for "AI code review" returns hundreds of vendor marketing pieces and opinion posts. A search for "correlated errors in LLM ensembles applied to code generation pipelines" returns three papers, each directly relevant.

The SGCR paper required a correction before citation. The 90.9% figure is the adoption rate among developers who used the spec-grounded framework, not a defect detection improvement. That distinction matters. One version supports the claim. The other overstates it.

One paper from the initial search (arXiv:2603.00311) turned out to be about regex engines, not metamorphic testing in the relevant sense. It was removed. The cost of including a bad citation is higher than the cost of having one fewer reference. Precision in the search brief determined the quality of what came back. Garbage in, garbage out applies to research as much as it applies to data pipelines.

Citation verification

Every cited paper was verified against its original abstract before inclusion. Author lists, key figures, and specific findings were checked. The 90.9% correction, a 37.6% performance loss figure, the "popularity trap" terminology, each was confirmed in the source before being used.

Three numbers from an intermediary research return required PDF-level verification, which was completed against the original papers before publishing. The distinction between what was confirmed from the abstract and what came from an intermediary's interpretation was tracked explicitly. I know which claims I can stand behind and which ones still need a primary source check.

This is tedious. It is also the difference between a post that survives scrutiny and one that does not.

Toulmin framework as quality gate

Five of Toulmin's six elements (claim, data, warrant, rebuttal, qualifier, with backing folded into the data and warrant assessment) were applied to each piece before drafting the final versions. This is not a writing technique. It is a structural verification tool.

The warrant failures were the most important findings. The economic argument was assertion-dense rather than evidence-dense. The constraint transformation claim was circular at the level of definitions. A Category D claim used "arguably" where a citation or an explicit provisionality flag was needed.

Toulmin did not find these problems. Applying it systematically did. The framework forces you to ask "what connects my evidence to my claim?" for every claim in the piece. When the answer is "it feels right" or "it is obvious," you have found a gap.

Stop-slop scoring

I use a scoring framework (Hardik Pandya's stop-slop skill) that evaluates prose across five dimensions: directness, rhythm, trust in the reader, authenticity, and density. Each is scored 1 to 10. Below 35 out of 50, the piece goes back for revision.

One post scored 30 on the first pass. The specific failures: a formulaic "often described as X, but actually Y" opener, repetitive staccato paragraph endings, pre-emptive hedging that signalled a weak argument rather than honest qualification, and an unsupported economic claim presented with confident language.

Each failure had a specific fix. The scoring made the prioritisation tractable. Without a framework, "this doesn't feel right" is the feedback. With one, "the rhythm score is 5 because three consecutive paragraphs end with punchy one-liners" is the feedback. The second version is actionable.

Revision and re-scoring

The revised pieces were scored again. The 30 moved to 38. All four pieces passed the threshold. The one remaining gap in the Toulmin assessment, the economic argument, was closed by a targeted literature search that was run specifically because the review identified the gap.

One late-stage catch is worth describing because it illustrates how the review process works. The correlated error argument, as originally phrased, left a gap that an argumentative reader could exploit: "but you used different AI models for generation and review, so the correlation breaks and your claim falls apart." The argument does not fall apart, but the text needed a sentence distinguishing model diversity from ground truth. Using a different model is not the same as using an external specification. Diversity reduces correlation. It does not eliminate it. Only an external reference, a specification that defines intent independently of the code, breaks the circular validation. That gap was found during review, reasoned through, and closed with a single sentence. Drafting missed it. The process caught it.

The loop closed. The gap in the argument drove the research. The research filled the gap. The revised piece passed both the structural and the editorial quality gates.

What this process is and is not

This is expensive for a single blog post sharing a personal experience. It is not expensive for a practitioner publication making novel claims in an active research area and citing more than a dozen empirical sources.

The discipline is not the process itself. It is knowing which level of process a given piece warrants and applying it without shortcuts when the stakes are high. A post about my experience with a security audit needs authenticity and clarity. A post proposing a novel defect taxonomy and connecting it to the oracle problem in formal testing theory needs every claim verified, every citation checked, and every logical gap identified before publication.

The same principle applies to software. Not every service needs mutation testing. Not every feature needs a formal specification. But the ones that matter, the ones where failure has consequences, need the full process. The skill is knowing which is which.

AI made this process possible at a pace that would have been impractical without it. The literature search, the cross-referencing, the citation verification, the structural analysis. All of it happened faster than I could have done manually. But the decisions about what to research, which arguments hold, which caveats matter, and when the piece is ready to publish, those were mine.

The thinking is mine. The rigour is a collaboration. And the output is still what matters most.

AI-Assisted Security Audit

Christo Zietsman — Wed, 25 Mar 2026 06:27:50 +0000

Last year I started working on a large, unfamiliar codebase. Go, new CI/CD pipelines, new deployment patterns. My job was integration work, not security. I was using AI to accelerate my understanding of the system, reading through services, tracing data flows, building a mental model of how the pieces fit together.

Then I started asking different questions.

Not "show me the code for this endpoint" but "should this role be able to perform that action?" Not "how does authentication work?" but "what happens if this token is presented to a service it was not issued for?" The shift was subtle but the consequences were not. That first question, about role permissions, led to a real finding that we have since fixed.

Why outcome-based questions work

The questions that uncovered the issues were not on any security checklist. They came from twenty years of building and breaking systems, from having seen enough architectures to know where the seams are. The AI did not generate those questions. I did. But the AI made it possible to answer them across an entire codebase in minutes rather than weeks.

Most conversations about AI-assisted development focus on writing code faster. The value that compounds is pairing AI's traversal speed with experienced judgement. The AI can scan every file, trace every dependency, and test every boundary. But it needs someone who knows which questions to ask.

When I asked about role-based access, the AI showed me every service that checked permissions, every place a role was assigned, and every path where a request could reach a protected resource. It then helped me construct a proof-of-concept that demonstrated the issue. What would have taken days of manual code review took minutes of directed conversation.

From code comprehension to security posture

I did not set out to do a security audit. I was trying to understand a system. But when you combine architectural intuition with an AI that can traverse an entire codebase in seconds, you start seeing things that would take weeks to uncover through manual review.

Within a day I had a clear picture of the security posture across development, staging, and production environments. Not a theoretical report or a list of hypothetical risks. Working proof-of-concepts that demonstrated real issues, followed by merge requests to resolve them. A prioritised security roadmap for a codebase I had never seen before, in under a day.

That claim sounds strong, and it should be qualified. The speed came from a specific combination of factors: the codebase was well-structured with clear service boundaries, the AI tooling was mature enough to handle large codebases, and the person asking the questions had two decades of experience knowing where to look. A junior engineer with the same tools would have found different things. Fewer in some areas, more in others, because my pattern matching has its own blind spots. The AI amplifies whatever expertise you bring. It does not replace it.

What made AI useful for security

Security auditing has a property that makes it suited to AI assistance: the attack surface is defined by what the code does, not what the developer intended it to do. A code reviewer looks at a function and asks "does this do what the author meant?" A security reviewer looks at the same function and asks "what else could this do?"

That second question requires traversing the codebase for breadth rather than depth. You need to know every path that reaches a function, every input that can influence its behaviour, every assumption that upstream code makes about downstream validation. AI does this without getting tired of tracing call chains, without losing track of which services talk to which. It can hold the entire dependency graph in context and answer questions about it on demand.

The division of labour was consistent throughout: AI for breadth and speed, human for judgement and prioritisation. I decided which findings mattered, which were theoretical risks versus exploitable issues, and how to sequence remediation. The AI did the reconnaissance. I did the analysis.

The systems view

This experience connects to a theme running through this series. The verification process I described in my earlier post on the trust barrier, the idea that you build confidence in AI output through measurable process rather than blind trust, showed up in this work. I verified each finding with a working proof-of-concept before reporting it. I tested each fix against the original exploit. Following that process is what made the findings trustworthy, not the AI's confidence in its own output. Without the proof-of-concepts, the findings would have been suggestions. With them, they were evidence.

The systems thinking I explored when writing about the transition from craftsman to factory in AI-assisted development appeared here too. This audit was not a sequence of isolated code reviews. It was a systematic traversal of a system, asking architectural questions about how components interact, where trust boundaries exist, and what happens when assumptions break. Taking that systems view is what turned an interesting AI experiment into an actionable security roadmap.

The process gap

Every codebase has issues waiting to be found. The pattern generalises: AI traversal speed paired with experienced judgement finds things that neither does alone. The AI without the experience generates noise. The experience without the AI takes weeks instead of hours.

But finding the problems is half the story. Fixing them at scale requires a level of process maturity that most teams have not built. Dan Shapiro's maturity levels for AI-assisted development describe a progression where Level 4 and 5 represent teams with verification processes and architectural constraints around AI output. Without that maturity, fixing everything uncovered in a thorough AI-assisted audit will take far longer than it should. The same process discipline that lets you find the problems is what lets you fix them.

The security gaps are real. The process gap is bigger. The teams that start building these processes now will have a compounding advantage. The teams that wait will find themselves further behind than they expect.

We need to sharpen the axe today.

The Specification as Quality Gate: Three Hypotheses on AI-Assisted Code Review

Christo Zietsman — Sun, 22 Mar 2026 22:43:07 +0000

This is Part 4 of a four-part series, "The Specification as Quality Gate." Part 1 developed the correlated error hypothesis. Part 2 grounded the argument in complexity science. Part 3 mapped what specifications cannot catch. This post is the complete argument.

A PDF version of this paper is available for download and citation at [arXiv link to be added after submission].

This paper develops three interconnected hypotheses about the role of
executable specifications in AI-assisted software development. Citations
are provided for verification; readers are encouraged to consult the
original sources rather than relying on the summaries here.

Abstract

The dominant industry response to AI-generated code quality problems is
to deploy AI reviewers. This paper argues that this response is
structurally circular when executable specifications are absent: without
an external reference, both the generating agent and the reviewing agent
reason from the same artefact, share the same training distribution, and
exhibit correlated failures. The review checks code against itself, not
against intent.

Three hypotheses are developed. First, that correlated errors in
homogeneous LLM pipelines echo rather than cancel, a claim supported by
convergent empirical evidence from multiple 2025-2026 studies and by three
small contrived experiments reported here. The first two experiments are
same-family (Claude reviewing Claude-generated code); the third extends to
a cross-family panel of four models from three families. All use a planted
bug corpus rather than a natural defect sample; they are directional
evidence, not a controlled demonstration.
Second, that executable specifications perform a domain transition in the
Cynefin sense, converting enabling constraints into governing constraints
and moving the problem from the complex domain to the complicated domain,
a transition that AI makes economically viable at scale. Third, that
the defect classes lying outside the reach of executable specifications
form a well-defined residual, which is the legitimate and bounded target
for AI review.

The combined argument implies an architecture: specifications first,
deterministic verification pipeline second, AI review only for the
structural and architectural residual. This is not a claim that AI review
is valueless. It is a claim about what it is actually for, and about what
happens when it is deployed without the foundation that makes it
non-circular.

1. Introduction

The AI code review market is growing rapidly, with tools like CodeRabbit,
Cursor Bugbot, and GitHub Copilot deployed across tens
of thousands of engineering teams. The value proposition is
straightforward: AI generates code faster than humans can review it, so
use AI to review it.

The premise is reasonable. The architecture that follows from it has a
structural problem the industry conversation has mostly not examined.

The DORA 2026 report, based on 1,110 open-ended survey responses from Google engineers,
found that higher AI adoption correlates with higher throughput and higher
instability simultaneously. Time saved generating code is re-spent
auditing it. The bottleneck has moved from writing code to knowing what
to write and verifying that what was written is correct. Deploying more AI
at the review stage does not address the structural problem; it adds
another probabilistic layer to a pipeline that already lacks a
deterministic quality gate.

This paper examines that problem in three stages. Section 2 develops the
correlated error hypothesis, drawing on recent empirical work on LLM
ensemble pipelines. Section 3 develops the Cynefin domain transition
hypothesis, grounding the specification-first argument in complexity
science. Section 4 proposes a taxonomy of defect classes that executable
specifications cannot catch, grounding the residual in oracle problem
theory. Section 5 discusses the architecture implied by the combined
argument and identifies what remains open.

2. The Correlated Error Hypothesis

2.1 The Structural Problem

When an AI coding agent generates code and a separate AI reviewer examines
it without an external specification, both agents reason from the same
artefact: the code. The reviewer has no ground truth to compare against.
It checks the code against itself, not against intent.

This has a precise formulation in machine learning theory. In classical
ensemble learning, stacking multiple estimators improves reliability under
one condition: the estimators must fail independently. If two classifiers
share the same training distribution and the same blind spots, combining
them does not reduce error. It consolidates it. The joint miss rate of two
correlated estimators approaches the miss rate of either one alone, not
the product of both (Dietterich 2000; Hansen and Salamon 1990).

A code-generating model and a code-reviewing model from the same model
family share architecture, training corpus, and reward signal. That is
the same class of correlation the independence condition prohibits. They
are not two independent estimators. They are two samples from the same
prior.

2.2 The Empirical Evidence

A 2025 paper studying LLM ensembles for code generation and repair
(Vallecillos-Ruiz, Hort, and Moonen, arXiv:2510.21513) coined the term
"popularity trap" to describe what happens when multiple models vote on
candidate solutions. Models trained on similar distributions converge on
the same syntactically plausible but semantically wrong answers. Consensus
selection (the default review heuristic in most multi-agent pipelines)
filters out the minority correct solutions and amplifies the shared
error. Diversity-based selection recovered up to 95% of the gain that a
perfectly independent ensemble would achieve.

A second paper (Mi et al., arXiv:2412.11014) examined a
researcher-then-reviser pipeline for Verilog code generation and found
that erroneous outputs from the first agent were accepted downstream
because the agents shared the same training distribution and lacked
adversarial diversity. Single-agent self-review degenerated into
repeating the original errors.

A February 2026 paper (Pappu et al., arXiv:2602.01011) showed that even
heterogeneous multi-agent teams consistently failed to match their best
individual member, incurring performance losses of up to 37.6%, even when
explicitly told which member was the expert. The failure mechanism is
consensus-seeking over expertise. Homogeneous copies of the same model
family make the failure mode worse.

A 2025 paper on test case generation (arXiv:2507.06920) found that
LLM-generated verifiers exhibited tightly clustered error patterns,
indicating shared systematic biases, while human errors were widely
distributed. LLM-based approaches produced test suites that mirrored the
generating model's error patterns, creating what the authors call a
homogenisation trap where tests focus on LLM-like failures while
neglecting diverse human programming errors. This is the correlated error
argument confirmed from the test generation side rather than the review
side.

Jin and Chen (arXiv:2508.12358, 2025) found a systematic failure of LLMs
in evaluating whether code aligns with natural language requirements, with
more complex prompting leading to higher misjudgement rates rather than
lower ones. A March 2026 follow-up (arXiv:2603.00539) confirmed that LLMs
frequently misclassify correct code as non-compliant when reasoning steps
are added. These papers test reviewers given specifications in prose form;
the overcorrection bias they document is a separate failure mode from the
correlated miss problem, but both point to the same structural weakness:
without a deterministic ground truth, the review is unreliable in both
directions.

Wang et al. (arXiv:2512.17540, 2025) proposed a specification-grounded
code review framework deployed in a live industrial environment and found
that grounding review in human-authored specifications improved developer
adoption of review suggestions by 90.9% relative to a baseline LLM
reviewer. Note: the 90.9% figure refers to adoption rate, not defect
detection rate.

None of these papers tested the exact configuration the hypothesis
describes: a pure generator-then-reviewer pipeline without any external
grounding, measuring shared failure modes directly. The contrived
experiments in Section 2.5 provide directional evidence for that
configuration, subject to the limitations stated there.

2.3 A Constructed Illustration

The following example uses a deliberately simple case. Modern frontier
models catch classic boundary conditions reliably, as the experiments in
Section 2.5 confirm. The purpose here is not to demonstrate an AI
review failure but to illustrate the sequence: how a BDD scenario makes
a defect detectable before any reviewer, human or AI, is involved.

A developer asks an AI coding agent to implement a pagination function.
The agent generates a correct implementation for typical cases and produces
tests for page 0 and page 1 of a ten-item list. Both tests pass.

The flaw is in a boundary condition the agent did not consider: when the
total number of items is exactly divisible by the page size and the caller
requests the last page by calculating the index from the total count.

The review agent, given the code and the tests, validates the
implementation against what is present. It has no basis to identify what
was not tested. It reports no issues.

The following BDD scenario would have made the boundary condition
explicit before a single line of code was written:

Scenario: Last page when total is exactly divisible by page size
  Given a list of 10 items
  And a page size of 5
  When I request page 1
  Then I receive items 6 through 10
  And the result contains exactly 5 items

This scenario would have failed against the implementation before the
reviewer was ever involved. The value of BDD is not that it catches
what AI review misses in simple cases like this one. It is that the
pipeline becomes the reviewer: deterministic, not probabilistic, and
invariant to what the model does or does not know. For domain-specific
logic where the convention is absent from training data, that is the case
examined in Section 2.5, where that invariance is what makes the difference.

There is a second motivation the illustration also captures. Even where
models reliably identify boundary conditions during focused review, general
refactoring passes introduce a different risk. An agent refactoring for
readability or performance is not focused on correctness. Conditional
checks, guard clauses, and edge-case handling can be silently removed as
apparent noise. The BDD pipeline catches that regression the same way it
catches the original defect: the scenario fails, the build stops, and
the cause is immediately visible. The protection is unconditional on what
the agent was trying to do.

2.4 The Independence Condition

The correlated error claim has a precise boundary. Stacking AI reviewers
is not always counterproductive. The failure condition is specific:
estimators that share a training distribution and lack an external
reference exhibit correlated failures. The condition for genuine benefit
is diversity plus external grounding. These are two separate conditions,
and conflating them is the most likely source of pushback against this
argument.

A cross-family pipeline, Grok reviewing Claude-generated code for
instance, has more independence than a same-family pipeline. Different
organisations, different training corpora, different reward signals.
Errors are partially independent in ways that same-family models are
not. But model diversity does not supply ground truth. A cross-family
reviewer without an external specification is still checking code
against code, not code against intent. It will share systematic blind
spots on anything underrepresented in both training corpora, and it has
no basis to identify what was never specified regardless of how different
its architecture is from the generator. Diversity reduces correlation.
Specification eliminates circularity. Both are required, and the current
industry architecture typically provides neither.

2.5 Experimental Evidence

Three small contrived experiments were conducted to test the hypothesis
directly. All are reported here with their limitations stated up front.
The full corpus, specifications, scripts, and results are publicly
available at https://github.com/czietsman/nuphirho.dev/tree/main/experiments.

The experimental design was proposed by Claude (Anthropic) during a
research session developing this paper, and implemented by a Claude Code
agent. The model being tested proposed its own methodology for being
tested. This is noted not as a limitation but as a transparency
obligation.

Design. Each experiment used five Python functions with a single
planted bug per function. Two conditions ran against each function.
Condition A: Claude CLI reviewed the buggy implementation using a neutral
prompt with no specification context, run five times per function.
Condition B: pre-written BDD scenarios targeting the exact defect ran via
behave. Experiments 1 and 2 used Claude as the sole reviewer (same family
as the code author). Experiment 3 extended to a cross-family panel of four
models from three families. The same-family condition is the strongest form
of the correlated error claim; the cross-family condition tests whether
diversity alone is sufficient.

Experiment 1: Classic boundary conditions. The first corpus used
textbook boundary-condition bugs: off-by-one in pagination, loop
termination in binary search, leap year century rule, exact-length string
truncation, sign handling in date arithmetic.

Result: Claude detected all five bugs at 100% across all runs. BDD also
caught all five.

The hypothesis was not confirmed at this level. These bugs sit in the
complicated domain. They are well-known patterns, analysable without domain
knowledge, and dense in training data. Claude catches them reliably because
they are not blind spots. The result refined the hypothesis: the correlated
error claim applies to the complex domain, not to pattern-recognition bugs
that any experienced reviewer would find.

Experiment 2: Domain-convention violations. The second corpus used bugs
that are only wrong relative to a domain convention not inferable from
the code alone: insurance premium proration using a fixed 365-day divisor
rather than actual/actual, flat-rate tax rather than marginal bracket
calculation, aviation maintenance triggering on AND rather than OR of hour
and cycle thresholds, option pool calculated on post-money rather than
pre-money valuation, and linear rather than log-linear interest rate
interpolation.

A confound was discovered in the first run: the original docstrings stated
the domain convention explicitly. The reviewer was comparing the
implementation against the docstring, not against independent domain
knowledge. A docstring that encodes the convention is a specification. The
experiment was inadvertently confirming that specifications work, not
testing whether AI review works without them. Docstrings were replaced with
neutral descriptions before the second run.

Result with neutral docstrings:

Function	BDD	AI review	Detection rate
prorate_premium	caught	5/5	100%
apply_tiered_tax	caught	5/5	100%
schedule_maintenance	caught	5/5	100%
calculate_dilution	caught	4/5	80%
interpolate_rate	caught	0/5	0%

BDD caught all five. AI review ranged from 0% to 100% depending on domain
opacity.

On inspection, the three functions at 100% still have code-level signals
despite the neutral docstrings. The hardcoded 365 is a common smell. The
AND/OR distinction is partially inferable from parameter naming. The
flat-rate calculation is detectable from the arithmetic pattern. These are
not truly domain-opaque. The two genuinely opaque functions produced the
predicted result: interpolate_rate at 0% (log-linear interpolation is
market convention in fixed income, not general programming knowledge) and
calculate_dilution at 80% (partial VC mechanics coverage in training data,
unreliable).

All five AI review runs on interpolate_rate flagged a sorting assumption
instead of the interpolation method and declared the implementation correct.
The code is idiomatic, the logic is sound, and without the convention stated
explicitly there is no signal. The reviewer filled in a plausible concern
rather than the actual violation.

Experiment 3: Cross-family reviewer panel. A third experiment extended
the test to four models from three families: Claude Sonnet 4.6 (Anthropic),
Codex 0.116.0/gpt-5.4 (OpenAI), Gemini 0.34.0 (Google), and Amazon Q
1.18.1/claude-sonnet-4.5 (AWS/Anthropic). Five domain-opaque bugs were
reviewed by each model, five runs per function, with no specification
context. The corpus was designed collaboratively across six models not in
the reviewer panel. Amazon Q declined to participate in corpus generation
on safety grounds and was used as a reviewer only.

Result with manual review of all 100 runs:

Function	Domain	BDD	Claude	Codex	Gemini	Q
calculate_final_reserve_fuel	Aviation (ICAO)	caught	0/5	0/5	5/5	0/5
get_gas_day	Energy (NAESB)	caught	5/5	5/5	5/5	2/5
validate_diagnosis_sequence	Healthcare (ICD-10-CM)	caught	0/5	0/5	0/5	0/5
validate_imo_number	Maritime (IMO)	caught	5/5	5/5	5/5	0/5
electricity_cost	Utility billing	caught	5/5	5/5	5/5	5/5

BDD caught all five. AI review ranged from 0% to 100% depending on domain
opacity and model family, confirming the gradient observed in Experiment 2
and extending it across training distributions.

The ICAO fuel reserve function produced the most striking result. The bug
swaps the reserve durations for jet and piston aircraft (45 minutes instead
of 30 for jet, 30 instead of 45 for piston). This is the
intuition_misleads case: the swapped values are plausible on naive
reasoning because larger aircraft might seem to need more reserve. Claude
did not merely miss the bug. In all five runs, it confidently asserted the
swapped values as the correct ICAO rules and declared the implementation
correct. The model filled in the wrong convention and defended it. Gemini
caught the swap in all five runs, indicating this convention is represented
in its training data but absent from Claude's and Codex's.

The ICD-10-CM external cause code rule (V00-Y99 codes cannot be the
principal diagnosis) was missed by all four models at 0/5. This convention
was absent from every training distribution tested. No model even
approximated the rule. This is the strongest confirmation of the hypothesis
across the three experiments: a domain convention that exists only in
published coding guidelines and is not inferable from the code.

Amazon Q (claude-sonnet-4.5, AWS fine-tuned from an Anthropic base model)
showed consistently lower detection rates than Claude Sonnet 4.6 despite
sharing the same model family. Claude caught the IMO check digit weight
ordering at 5/5; Q missed it at 0/5. Claude caught the gas day boundary at
5/5; Q caught it at 2/5. This suggests that AWS fine-tuning for the
developer assistant use case may have narrowed general domain knowledge
coverage relative to the base model. Two of the four reviewer models are
Anthropic family, so this experiment is only partially cross-family.

During the first Gemini run, it was discovered that the Gemini CLI in
interactive mode accessed the experiment's specification files from the
local filesystem before answering. Gemini was re-run in sandboxed
non-interactive mode to prevent filesystem access. The original corpus also
contained an unused timedelta import in the gas day function that acted
as a code-level hint; Q's detection rate dropped from 5/5 to 2/5 after
the import was removed, demonstrating the sensitivity of detection to
incidental code signals.

Limitations of all three experiments. These results are directional,
not statistically significant. The corpus was designed to test the
hypothesis rather than sampled from a natural defect distribution. The bugs
were planted by someone who already knew where they were, giving the BDD
scenarios an unfair advantage: they are optimally targeted at the planted
defects in a way that production specifications would not be. Experiments 1
and 2 use Claude as both the code author and reviewer (same family), which
is the strongest form of the correlated error claim but not a general
result about all AI pipelines. Experiment 3 extends to a cross-family
panel, but two of four models share the Anthropic base, making it only
partially cross-family. Amazon Q required periodic re-authentication during
batch runs, introducing an operational constraint on unattended execution.

The truly untestable version of the hypothesis cannot be demonstrated in a
public experiment: a bug that is only wrong relative to an internal policy
document that has never been published anywhere. That is the category where
the correlated error problem is most consequential and where no amount of
training data coverage helps. Public experiments can only approximate it
with obscure published conventions, which frontier models may or may not
know.

3. The Cynefin Domain Transition Hypothesis

3.1 The Constraint Distinction

The Cynefin framework (Snowden and Boone 2007) distinguishes problem
domains by the relationship between cause and effect and by the nature of
the constraints governing system behaviour.

In the complicated domain, governing constraints apply. These bound the
solution space without specifying every action. A qualified expert
operating within governing constraints can analyse the situation and
identify a good approach. Cause and effect are knowable through analysis.

In the complex domain, enabling constraints apply. These allow a system
to function and create conditions for patterns to emerge, but they do not
determine outcomes. Cause and effect are only knowable in hindsight.

The distinction is ontological, not a matter of difficulty.

3.2 Constraint Transformation as Domain Shift

An AI coding agent operating from a vague natural language prompt operates
under enabling constraints. The prompt allows the agent to generate code
and creates conditions for solutions to emerge, but edge cases, boundary
conditions, and architectural choices are resolved by the model's priors.
The agent's behaviour is emergent and only knowable in hindsight.

An AI coding agent operating from a precise executable specification is
in a different situation. A BDD scenario makes a specific causal claim:
given this precondition, when this action occurs, then this outcome is
required. That claim is verifiable before hindsight. The agent cannot
produce code that fails the scenarios without the pipeline catching it.
Cause and effect are knowable through analysis of the specification
itself, which is exactly what defines the complicated domain.

Writing executable specifications converts enabling constraints into
governing constraints. The problem does not change. The constraint type
does. And with it, the domain.

3.3 What AI Makes Economically Viable

The DORA 2026 report establishes that as AI accelerates code generation,
the bottleneck shifts to specification and verification. Specifications
are no longer overhead relative to implementation. They are the scarce
resource.

The evidence for a corresponding reduction in specification authoring
cost is early but directional. Fonseca, Lima, and Faria
(arXiv:2510.18861, 2025) measured Gherkin scenario generation from
natural-language JIRA tickets on a production mobile application at BMW
and found that practitioners reported time savings often amounting to a
full developer-day per feature, with the automated generation completing
in under five minutes. In a separate quasi-experiment, Hassani,
Sabetzadeh, and Amyot (arXiv:2508.20744v2, 2026) found that 91.7% of
ratings on the time savings criterion for LLM-generated Gherkin
specifications fell in the top two categories. A 2025 systematic review
of 74 LLM-for-requirements studies (Zadenoori et al.,
arXiv:2509.11446) found that studies are predominantly evaluated in
controlled environments using output quality metrics, with limited
industry use, consistent with specification authoring cost being an
underexplored measurement target. These two studies represent the
leading edge of an emerging evidence base, not settled consensus.

Both studies show AI shifting human effort from authoring to reviewing:
from expressing intent to validating that the expressed intent is
accurate. The intent, the domain knowledge, and the judgment that
scenarios accurately describe what the system should do remain human
responsibilities. The mechanical cost of expressing that intent in
executable form has fallen substantially.

The domain transition from complex to complicated is now economically
viable at scale in a way it was not before. The claim is not that AI
makes specification automatic. It is that the economics have shifted
enough to make specification-first development the rational default
rather than the disciplined exception.

3.4 A Likely Objection

A Cynefin-literate reader will challenge whether software development is
ever truly complex in Snowden's sense. Software systems are deterministic
at the execution level: given the same inputs and state, they produce the
same outputs. If cause and effect are always knowable in principle, the
complex framing is a category error.

The response: the complexity resides in the problem space, not the
execution. What users need, which edge cases matter, how requirements will
evolve. These exhibit the emergent properties and hindsight-only
knowability that Snowden places in the complex domain. An executable
specification narrows the problem space by articulating requirements
precisely enough to be analysed. The domain transition occurs in the
problem space, not in the implementation.

As of March 2026, no published work in the Cynefin community addresses the specific claim
that executable specifications serve as a constraint transformation
mechanism in this sense. Dave Snowden is actively working on the
relationship between AI and Cynefin domains as of early 2026, but has
not published conclusions.

The vocabulary has been checked carefully
against canonical framework definitions. The enabling and governing
constraints terminology is confirmed in Snowden's own Cynefin wiki
(cynefin.io/wiki/Constraints), not in the 2007 HBR paper. The mapping
is consistent with Snowden's definitions. The specific claim is original
to this argument.

4. The Residual Defect Taxonomy

4.1 Theoretical Grounding: The Oracle Problem

The oracle problem in software testing asks how a test outcome is
determined to be correct. For a test to be meaningful, you need an oracle,
a ground truth against which to judge the system's output.

Barr et al. (2015), in the canonical survey of the oracle problem,
established that complete oracles are theoretically impossible for most
real-world software. Even a formally correct specification cannot fully
specify the correct output for all possible inputs in all possible
contexts. There exists a class of defects for which no specification,
however precise, provides an oracle. That class defines the permanent
theoretical boundary of what specification-driven verification can achieve.

4.2 The Proposed Taxonomy

To the best of our knowledge, no prior taxonomy in the testing or formal methods literature
organises defects by specifiability rather than by severity or recovery
type. The following five-category taxonomy is
proposed as a working framework.

Category A: Theoretically specifiable, not yet specified. These are
defects that a specification could have caught if the scenario had been
written. Boundary conditions, error handling paths, and unexercised state
machine transitions fall here. The gap is a process failure, not a
theoretical limitation. This category shrinks as specification discipline
matures. An AI review agent operating here without an external
specification shares the same blind spots as the generator.

Category B: Specifiable in principle, economically impractical.
Exhaustive combinatorial input spaces and full interaction matrices could
theoretically be specified but at a cost that exceeds the value. This
boundary is moving: property-based testing frameworks reduce the cost of
exploring combinatorial spaces systematically. Category B defects are a
legitimate target for AI review that brings genuine sampling diversity,
provided the reviewer draws from a different prior than the generator.

Category C: Inherently unspecifiable from pre-execution context.
Timing-dependent race conditions, behaviour under partial network failure,
and performance degradation under specific hardware configurations depend
on properties of the running system that do not exist at specification
time. Recent work is pushing the boundary: Santos et al. (Software 2024)
encoded performance efficiency requirements as BDD scenarios; Maaz et al.
(arXiv:2510.09907, 2025) surfaced concurrency bugs previously requiring
runtime observation. Category C is not fixed. It is the current frontier.
For defects that remain here, the right tool is runtime verification
infrastructure rather than pre-deployment review. This includes
ML-based anomaly detection, APM tooling with learned behavioural
baselines, and observability platforms. This is a class of techniques predating
LLMs that applies machine learning to operational data rather than to
source code. The implicit specification in these systems is derived from
observed behaviour rather than stated intent, which makes them
complementary to, not a replacement for, the pre-deployment pipeline.

Category D: Structural and architectural properties. Code can satisfy
every specification while introducing coupling, violating layer boundaries,
or drifting from intended design. These are relational properties of the
codebase as a whole, not properties of individual components. They resist
behavioural specification because they concern structure rather than
behaviour, but this resistance does not mean they are unspecifiable. Architectural rules,
once articulated, are enforceable deterministically: dependency rules via
tools such as ArchUnit or Dependency Cruiser, service boundary agreements
via contract testing frameworks such as Pact. General delivery rigour,
clear module boundaries, and explicit interface definitions reduce the
surface area of Category D by making architectural intent concrete. The
residual, after tooling and contract testing are in place, is the
unarticulated architectural intent that has not yet been expressed as an
enforceable rule: drift from a design decision that was never written down,
coupling that violates a pattern that exists only in institutional memory,
a half-completed migration to a new architectural pattern where some
modules use the new approach and others still use the old one, dead
abstractions introduced for a use case that no longer exists. No automated
rule catches these because the correct answer depends on intent that was
never codified. This is where an AI review agent with access to the full
codebase and architectural context adds genuine non-circular signal,
operating in a role analogous to an expert architect reviewing for
structural coherence. The agent advises; the human decides whether to
complete the migration, remove the dead pattern, or codify the observation
as a new enforceable rule. This is the least empirically grounded category
in the taxonomy; no controlled study has yet isolated this residual as a
distinct defect class.

Category E: Specification defects. The oracle problem result makes
this category unavoidable. A specification can be internally consistent,
precisely expressed, and correctly implemented, and still describe a
system that does not do what users need. Requirements elicitation is not
a solved problem. Domain experts disagree. Business rules change. No
verification pipeline catches Category E defects because the pipeline
verifies conformance to the specification. If the specification is wrong,
the pipeline confirms the wrong thing. The right tool for Category E is
user testing, observability of actual usage, and design thinking practices
that surface unstated assumptions. These are human processes, not
automated ones.

4.3 Implications for the Architecture

The taxonomy implies a specific allocation of tools:

Category A is the target for specification discipline and AI-assisted
coverage analysis. Investment here reduces the work that review agents
need to do.

Category B is the target for property-based testing and diverse sampling
strategies. AI review adds value here only if genuine diversity is
achieved.

Category C is the target for runtime verification infrastructure:
ML-based anomaly detection, observability tooling, chaos engineering, and
load testing. Neither pre-deployment specifications nor AI review agents
are the right tool here.

Category D is the target for architectural tooling and contract testing
first: dependency enforcement, Pact-style contract verification, and
explicit interface definitions. AI review is the complement for the
unarticulated residual: drift from design intent that has not yet been
codified as an enforceable rule.

Category E is the reminder that the feedback loop from production to
requirements is a human loop and cannot be automated away.

5. The Combined Architecture and What Remains Open

5.1 The Architecture

The three hypotheses together imply a specific architecture for
AI-assisted software development.

Specifications first. BDD scenarios, contract tests, and mutation testing
harden the verification layer. This is the constraint transformation that
moves the problem into the complicated domain and eliminates the correlated
error problem for Category A defects.

Deterministic verification pipeline second. The pipeline is the reviewer
for behavioural correctness. Pass or fail, no opinions.

Architectural tooling and AI review for Category D. Dependency
enforcement tools and contract testing for articulated architectural
rules. AI review for the unarticulated residual, in a role analogous
to an expert architect advising on structural coherence.

Runtime verification for Category C. ML-based anomaly detection,
observability tooling, chaos engineering, load testing. Not part of the
pre-deployment pipeline.

User feedback loops for Category E. Requirements validation, user testing,
and design thinking. Not part of the engineering pipeline at all.

5.2 What Remains Open

Three open questions are stated honestly here.

The controlled demonstration has strengthened but remains incomplete. Three
contrived experiments provided directional evidence: classic boundary
conditions were detected at 100% by AI review (refining the hypothesis
toward domain-opaque defects), domain-convention violations showed a
gradient from 0% to 100% depending on how well the convention is
represented in training data, and a cross-family reviewer panel confirmed
the gradient extends across model families with detection varying by both
domain opacity and training distribution. The 0/20 result on ICD-10-CM
coding guidelines (four models, five runs each, zero detections) is the
strongest single finding. But all three experiments use a planted bug
corpus, the BDD scenarios are optimally targeted, and the cross-family
panel includes two Anthropic-family models. A controlled study using a
natural defect sample, fully independent model families, and a
specification-grounded condition alongside an ungrounded condition would
strengthen or revise the claim precisely. The experiments are available in the repository cited in Section 2.5
for replication and extension.

The Cynefin mapping is unvalidated by the Cynefin community. The
constraint transformation framing is consistent with Snowden's own
vocabulary but has not been reviewed by accredited practitioners. Snowden
is actively working on AI and Cynefin and may publish a position that
validates, challenges, or reframes this argument.

The taxonomy is novel and untested. No prior work organises defects by
specifiability in the five-category structure proposed here. The taxonomy
may be incomplete, category boundaries may be blurrier than described,
and empirical work testing the taxonomy against real defect populations
would strengthen or revise it. The Category A and Category D boundary is
the most contested: if architectural properties become specifiable as
tooling matures, parts of Category D would reclassify into Category A,
which would reduce the permanent residual for AI review.

These are open questions, not fatal weaknesses. The directional evidence is consistent with the hypothesis. Stating the gaps is not a concession. It is
what makes the argument trustworthy.

6. Conclusion

The argument developed here is structural, not empirical. Executable
specifications break the circularity in AI-assisted pipelines not because
they catch more bugs than AI review, but because they introduce a
reference that is independent of the model's training distribution. That
independence is what makes verification non-circular. Without it, review
checks code against itself.

Three experiments across four model families and three training
distributions provide directional evidence. Classic boundary conditions
are reliably caught by all models. Domain-convention violations produce a
gradient from 0% to 100% detection depending on how well the convention
is represented in training data. The ICD-10-CM external cause code rule
was missed by every model tested, while BDD caught it deterministically.
The ICAO fuel reserve case showed something worse than a miss: models
confidently asserting the wrong convention as correct. These results are
small in scale and use planted bugs, but the pattern is consistent with
the theoretical prediction and with the independent empirical findings
cited in Section 2.

The practical implication is an architecture, not a prohibition.
Specifications first, for the domain logic that must be right.
Deterministic pipeline second, as the reviewer for behavioural
correctness. AI review third, scoped to the structural and architectural
residual where it adds genuine non-circular signal. This is not a claim
that AI review is valueless. It is a claim about where it belongs in the
pipeline and what it needs underneath it to be trustworthy.

The experiments, corpus, and specifications are publicly available for
replication and extension. The argument improves by being tested.

References

Barr, E.T., Harman, M., McMinn, P., Shahbaz, M., and Yoo, S. (2015).
"The Oracle Problem in Software Testing: A Survey." IEEE Transactions on
Software Engineering 41(5), 507-525. doi:10.1109/TSE.2014.2372785

Dietterich, T.G. (2000). "Ensemble Methods in Machine Learning."
Proceedings of the First International Workshop on Multiple Classifier
Systems. Lecture Notes in Computer Science, vol 1857. Springer.

Baolin, J. and Harvey, N. (2026). "Balancing AI tensions: Moving from
AI adoption to effective SDLC use." https://dora.dev/insights/balancing-ai-tensions/,
March 2026. Based on 1,110 open-ended survey responses from Google
software engineers.

Fonseca, P.L., Lima, B., and Faria, J.P. (2025). "Streamlining Acceptance
Test Generation for Mobile Applications Through Large Language Models:
An Industrial Case Study." arXiv:2510.18861.

Hassani, S., Sabetzadeh, M., and Amyot, D. (2026). "From Law to Gherkin:
A Human-Centred Quasi-Experiment on the Quality of LLM-Generated
Behavioural Specifications from Food-Safety Regulations."
arXiv:2508.20744v2 (updated March 2026).

Hansen, L.K. and Salamon, P. (1990). "Neural Network Ensembles." IEEE
Transactions on Pattern Analysis and Machine Intelligence, 12(10),
993-1001. doi:10.1109/34.58871

Maaz, M., DeVoe, L., Hatfield-Dodds, Z., and Carlini, N. (2025).
"Agentic Property-Based Testing: Finding Bugs Across the Python
Ecosystem." arXiv:2510.09907, October 2025.

Mi, Z., Zheng, R., Zhong, H., Sun, Y., Kneeland, S., Moitra, S., Kutzer,
K., Xu, Z., and Huang, S. (2025). "CoopetitiveV: Leveraging LLM-powered
Coopetitive Multi-Agent Prompting for High-quality Verilog Generation."
arXiv:2412.11014, v2 June 2025.

Pappu, A., El, B., Cao, H., di Nolfo, C., Sun, Y., Cao, M., and Zou, J.
(2026). "Multi-Agent Teams Hold Experts Back." arXiv:2602.01011, February
2026.

Santos, S., Pimentel, T., Rocha, F., and Soares, M.S. (2024). "Using
Behaviour-Driven Development (BDD) for Non-Functional Requirements."
Software, 3(3), 271-283. doi:10.3390/software3030014

Snowden, D. and Boone, M. (2007). "A Leader's Framework for Decision
Making." Harvard Business Review, November 2007. Note: the enabling
and governing constraints terminology is not in this paper. It is
from Snowden's Cynefin wiki.

Snowden, D. "Constraints." cynefin.io/wiki/Constraints.
The Cynefin Company. Accessed March 2026. Confirms: Complicated domain:
governing constraints. Complex domain: enabling constraints.

Vallecillos-Ruiz, F., Hort, M., and Moonen, L. (2025). "Wisdom and
Delusion of LLM Ensembles for Code Generation and Repair."
arXiv:2510.21513, October 2025.

Wang, K., Mao, B., Jia, S., Ding, Y., Han, D., Ma, T., and Cao, B.
(2025). "SGCR: A Specification-Grounded Framework for Trustworthy LLM
Code Review." arXiv:2512.17540, December 2025, v2 January 2026.
Note: the 90.9% figure refers to developer adoption rate, not defect
detection rate.

Zadenoori, M.A., Dabrowski, J., Alhoshan, W., Zhao, L., and Ferrari, A.
(2025). "Large Language Models (LLMs) for Requirements Engineering (RE):
A Systematic Literature Review." arXiv:2509.11446.

Jin, H. and Chen, H. (2025). "Uncovering Systematic Failures of LLMs in
Verifying Code Against Natural Language Specifications."
arXiv:2508.12358, August 2025. Presented at ASE 2025.

Jin, H. and Chen, H. (2026). "Are LLMs Reliable Code Reviewers? Systematic
Overcorrection in Requirement Conformance Judgement."
arXiv:2603.00539, March 2026.

Ma, Z., Zhang, T., Cao, M., Liu, J., Zhang, W., Luo, M., Zhang, S., and Chen, K.
(2025). "Rethinking Verification for LLM Code Generation: From Generation to
Testing." arXiv:2507.06920v2, July 2025.

Series: The Specification as Quality Gate
Part 1: The Echo Chamber in Your Pipeline
Part 2: From Complex to Complicated
Part 3: What Specifications Cannot Catch
Part 4: The Specification as Quality Gate (this post)

What Specifications Cannot Catch: A Proposed Taxonomy of the Residual

Christo Zietsman — Sun, 22 Mar 2026 22:24:03 +0000

This is Part 3 of a four-part series, "The Specification as Quality Gate." Part 1 developed the correlated error hypothesis. Part 2 grounded the argument in complexity science. This post maps what executable specifications cannot catch. Part 4 will follow.

The argument for specification-driven development is strong. BDD scenarios
that pass or fail are not probability estimates. Mutation testing that
finds no survivors is a verified claim about the coverage of the
verification layer itself. Contract probes that confirm boundary behaviour
are deterministic gates, not heuristics. The case for putting
specifications before code, and treating the pipeline as the reviewer,
is well-founded.

"Well-founded" is not the same as "complete." Any argument that executable
specifications make AI code review redundant needs to account honestly for
what specifications cannot catch. That accounting is not a concession. It
is what makes the rest of the argument credible.

What follows is a proposed taxonomy of the defect classes that lie outside
the reach of executable specifications. To the best of our knowledge, no
prior taxonomy in the testing or formal methods literature organises
defects by specifiability rather than by severity or recovery type. The
reason to organise by specifiability is practical: severity tells you how
bad a defect is. Specifiability tells you which tool to reach for. Those
are different questions, and conflating them is why teams end up applying
AI review to problems where it is circular and missing the problems where
it would genuinely help. It should be treated as a working framework and
challenged accordingly.

The Foundation: The Oracle Problem

Before the categories, a theoretical grounding that gives the taxonomy
its shape.

The oracle problem in software testing asks: how do you know whether a
test outcome is correct? For a test to be meaningful, you need an oracle,
a ground truth against which to judge the system's output. In most
testing, the oracle is implicit: the developer's understanding of what
the system should do. In specification-driven development, the oracle is
the specification.

Barr, Harman, McMinn, Shahbaz, and Yoo, in their 2015 survey in IEEE
Transactions on Software Engineering, established that complete oracles
are theoretically impossible for most real-world software. Even a formally
correct specification, internally consistent and precisely expressed,
cannot fully specify the correct output for all possible inputs in all
possible contexts.

This is not a practical limitation that better process will eventually
overcome. It is a theoretical result. There exists a class of defects for
which no specification, however precise, provides an oracle. That class
defines the permanent boundary of what specification-driven verification
can achieve.

Everything in the taxonomy that follows should be read against this
foundation.

Category A: Theoretically Specifiable, Not Yet Specified

The largest and most important category. These are defects that a
specification could have caught if the scenario had been written. The
gap is a process failure, not a theoretical limitation.

The clearest examples are domain-convention violations: code that is
internally consistent and idiomatic, but wrong relative to a rule that
is not inferrable from the code alone. An interest rate interpolation
function that uses linear interpolation when the market convention for
this curve type is log-linear looks correct to any reviewer without
the specification. The code is clean. The logic is sound. Only the BDD
scenario that states the correct output for a specific tenor gap makes
the violation visible.

This is not a hypothetical. A small experiment run alongside the paper
this post is part of tested exactly this case. Claude reviewed the
function five times without a specification and missed the bug in every
run, flagging a different concern instead. The BDD scenario caught it
immediately. The full experiment is at
correlated-error-v2.

Classic boundary conditions (off-by-one errors, loop termination
edge cases, leap year rules) also belong here in principle. But
frontier models catch those reliably from pattern recognition alone,
because they are dense in training data. The specification still matters
for these: it prevents regressions when a refactoring agent removes a
guard clause it considers noise. But the correlated error problem bites
hardest at the domain-specific end of Category A, where neither the
generator nor the reviewer has the convention in its prior.

Other examples: error handling paths that were not enumerated in
requirements, input validation for edge cases the specification author
did not consider, state machine transitions for states that were not
mapped.

Category A is the primary target for the specification-first argument.
AI-assisted specification drafting, property-based exploration, and
mutation testing can reduce it systematically over time. The residual
in this category is not permanent. It shrinks as specification discipline
matures.

An AI review agent operating in Category A without an external
specification shares the same blind spots as the generator: both
drew from the same prior, and whatever the prior underweights, both
miss.

Category B: Specifiable in Principle, Economically Impractical

Some defect classes are theoretically specifiable but require verification
effort that exceeds the value of specifying them. A function that takes
three independent boolean parameters has eight input combinations. Ten
parameters gives 1,024. Twenty gives over a million. Every combination's
boundary conditions could be specified, but nobody does this, for good
reason.

This boundary is moving. Property-based testing frameworks generate inputs
systematically and explore combinatorial spaces that manual scenario
authoring cannot reach. The economics of Category B defects are shifting
as these tools mature and AI-assisted property-based exploration becomes
practical.

Category B is where AI review can provide legitimate signal, not by
finding what the specification missed, but by sampling the input space
more broadly than the specified scenarios cover. This is heuristic, not
deterministic, but it is genuine signal rather than circular reasoning,
provided the reviewer draws from a genuinely different prior than the
generator.

Category C: Inherently Unspecifiable From Pre-Execution Context

Some defect classes depend on properties of the running system, the
environment, or the interaction between components that only manifest at
runtime. Timing-dependent race conditions, behaviour under partial network
failure, memory behaviour under sustained load, and performance degradation
under specific hardware configurations cannot be expressed as a BDD
scenario that runs before deployment. The conditions that trigger them
do not exist at specification time.

These are not unknown unknowns in the Category A sense. They are knowable
properties, observable and measurable after the fact. The issue is
temporal, not conceptual.

The boundary of Category C is moving. Santos, Pimentel, Rocha, and
Soares (Software, 2024) explicitly encoded ISO 25010 performance
efficiency requirements as BDD scenarios. Maaz et al. (arXiv:2510.09907,
2025) used LLM-assisted property-based testing to surface concurrency and
numerical precision bugs that previously required runtime observation to
detect. Some defects classified as inherently unspecifiable five years
ago are specifiable today with the right tooling.

For the defects that remain in Category C, the right tool is runtime
verification infrastructure. This includes ML-based anomaly detection,
APM tooling with learned behavioural baselines, observability platforms,
chaos engineering, and load testing. This is a mature field predating
LLMs that applies machine learning to operational data rather than to
source code. Neither a BDD pipeline nor an AI code review agent
reaches these defects. They require the system to be running.

Category D: Structural and Architectural Properties

Code can pass every BDD scenario, survive every mutation, and satisfy
every contract probe, while simultaneously introducing coupling between
modules that will compound over time, violating layer boundaries the
architecture was designed to enforce, or drifting from the intended design
in ways that only become visible months later.

These are relational properties of the codebase as a whole, not properties
of individual components in isolation. They resist behavioural specification
because they concern how the code is structured relative to the intended
design, not what the code does.

But resist does not mean unspecifiable. Architectural rules are specifiable
once a human articulates them. A rule such as "the web layer may not import
from the data layer" is a constraint that can be expressed, enforced, and
tested deterministically. Tools like ArchUnit, Dependency Cruiser, and
NDepend enforce dependency rules as automated checks. Contract testing
frameworks like Pact verify that service boundaries behave as agreed
between producers and consumers. These are deterministic gates, not
heuristic opinions, and they belong in the pipeline alongside BDD
scenarios. General delivery rigour (clear module boundaries, explicit
interface definitions, disciplined use of dependency injection) reduces
the surface area of Category D by making architectural intent concrete
rather than implicit.

The residual in Category D, after tooling and contract testing are in
place, is the unarticulated architectural intent that has not yet been
expressed as a rule. Drift from a design decision that was never written
down. Coupling that violates a pattern that exists only in someone's head.
A half-completed migration to a new architectural pattern, where some
modules use the new approach and others still use the old one: no
automated rule catches that because the correct answer depends on whether
the team intended to complete the migration or abandoned it. Dead patterns
accumulate the same way: abstractions introduced for a use case that no
longer exists, base classes with a single subclass, interfaces designed
for extensibility that never came. These are structural noise that only
becomes visible to something reasoning about the whole codebase over time.

This is where an AI review agent with access to the full codebase and
architectural context adds genuine, non-circular signal. It can identify
incomplete migrations, flag dead abstractions, and surface inconsistencies
that no automated rule yet captures. The role is analogous to an expert
architect reviewing for structural coherence: the agent advises, the human
decides whether to complete the migration, remove the dead pattern, or
codify the observation as a new enforceable rule. That last step closes
the loop back into tooling, which is where the category belongs once the
intent has been made explicit.

This is the least empirically grounded category in the taxonomy: no
controlled study has yet isolated AI architectural review as a distinct
defect class separate from what specification pipelines and architectural
tooling catch. It is the most provisional claim here, and should be
treated accordingly.

Category D is the permanent legitimate home for AI review in a
specification-driven pipeline, not as a substitute for architectural
tooling and contract testing, but as the complement that operates where
rules have not yet been written.

Category E: Specification Defects

The oracle problem survey makes this category unavoidable. A specification
that is internally consistent, precisely expressed, and correctly
implemented can still describe a system that does not do what users need.

Requirements elicitation is not a solved problem. Domain experts disagree.
Users articulate needs in terms of their current mental model, which may
not reflect what they actually want. Business rules change after the
specification was written. Edge cases that matter in production were not
considered during discovery.

No verification pipeline catches Category E defects, because the pipeline
verifies conformance to the specification. If the specification is wrong,
the pipeline confirms the wrong thing. An AI review agent without external
user feedback cannot catch Category E either, because the agent has no way
to know the specification does not match user intent.

The right tool for Category E is not in the engineering pipeline at all.
It is user testing, feedback loops, observability of actual usage, and
the design thinking practices that surface unstated assumptions during
requirements elicitation. These are human processes. AI can assist with
them, but not replace them.

Category E is not a process failure. It is a theoretical limitation.
Even a perfect specification process will leave some Category E defects,
because the oracle for user intent is inherently incomplete.

What the Taxonomy Implies

Each category points to a different tool.

Category A is the target for specification discipline. It shrinks as
teams invest in BDD coverage, mutation testing, and AI-assisted
specification drafting. An AI review agent here is a temporary patch for
process immaturity, not a permanent fixture.

Category B is the target for property-based testing and combinatorial
exploration. AI adds value here only if it brings genuine sampling
diversity, meaning a different prior than the generator.

Category C is the target for runtime verification infrastructure:
ML-based anomaly detection, observability tooling, chaos engineering,
and load testing. This is a mature field with its own tooling. Neither
specifications nor AI code reviewers substitute for it.

Category D is the target for architectural tooling and contract testing
first: ArchUnit, Dependency Cruiser, Pact, and similar deterministic
enforcement mechanisms for rules that have been articulated. AI review
is the complement for the unarticulated residual: drift from design
intent that has not yet been codified as an enforceable rule. The agent
advises, the human decides, and the observation ideally becomes a new
rule that closes back into tooling.

Category E is the reminder that the specification is never complete. It
gets less incomplete over time. The feedback loop from production back
to requirements is a human loop, and cannot be automated away.

An Invitation

This taxonomy is proposed, not established. The categories emerged from
reasoning through what executable specifications can and cannot express,
grounded in the oracle problem literature and tested against empirical
findings from the 2025-2026 AI code review research.

If the taxonomy is wrong, the corrections are welcome and will improve the
argument. If it is incomplete, the missing categories are interesting. The
most likely challenge is that the Category A and Category D boundary is
blurrier than described: architectural properties may become specifiable
as tooling matures, which would reclassify parts of Category D as
Category A over time. That boundary question is worth watching.

The goal is not to defend the taxonomy. It is to have a precise enough
framework to stop treating AI code review as a monolithic practice and
start reasoning about which kinds of defects each tool in the pipeline
is actually positioned to find.

If you work in software testing, formal methods, or AI-assisted
development and the taxonomy is wrong in ways not anticipated here,
the comments are the right place to say so. A taxonomy that provokes
rigorous challenge is more useful than one that goes unchallenged.
Corrections will be published with attribution.

This post draws on: Barr, E.T., Harman, M., McMinn, P., Shahbaz, M.,
and Yoo, S. (2015), "The Oracle Problem in Software Testing: A Survey,"
IEEE Transactions on Software Engineering, 41(5), 507-525,
doi:10.1109/TSE.2014.2372785; Santos, S., Pimentel, T., Rocha, F., and
Soares, M.S. (2024), "Using Behaviour-Driven Development (BDD) for
Non-Functional Requirements," Software, 3(3), 271-283,
doi:10.3390/software3030014; and Maaz, M., DeVoe, L., Hatfield-Dodds, Z., and Carlini, N. (2025), "Agentic
Property-Based Testing: Finding Bugs Across the Python Ecosystem,"
arXiv:2510.09907.

Series: The Specification as Quality Gate
Part 1: The Echo Chamber in Your Pipeline
Part 2: From Complex to Complicated
Part 3: What Specifications Cannot Catch (this post)
Part 4: The Specification as Quality Gate

From Complex to Complicated: What Executable Specifications Actually Do

Christo Zietsman — Sun, 22 Mar 2026 21:32:24 +0000

This is Part 2 of a four-part series, "The Specification as Quality Gate." Part 1 developed the correlated error hypothesis. This post grounds the specification-first argument in complexity science. Parts 3 and 4 will follow.

Ask an engineering team why they write tests, and most will say: to catch
bugs. Ask why they write specifications, and the answers are less coherent.
Documentation. Compliance. Handover. The process said so.

None of those answers explain what a specification actually does to the
nature of the problem being solved. Dave Snowden's Cynefin framework offers
the clearest account. The argument here is that executable specifications
do not simply improve quality; they change the class of problem you are
working on.

Two Kinds of Problem

Cynefin distinguishes problem domains by the relationship between cause
and effect. Two are relevant here.

In the complicated domain, cause and effect are knowable through analysis.
A qualified expert can study the system, apply the right methods, and
arrive at a good answer. The problem is hard but it is tractable. Engineers
building against a well-specified API are working in the complicated domain.
The system's behaviour is knowable if you do the analysis.

In the complex domain, cause and effect are only knowable in hindsight.
The system responds to interventions in ways you cannot fully predict
before they happen. You probe, observe what emerges, and respond to what
you find. Raising a child is the canonical example. But so is asking an
AI agent to implement a feature from a two-sentence description. The
output is emergent. The only way to know if it is right is to observe
what it does.

This is not a question of difficulty. It is a question of ontological
character. Complicated problems are hard. Complex problems are of a
different kind.

What Separates the Domains

Snowden distinguishes the domains not just by cause-and-effect
relationships but by the constraints that govern system behaviour.

In the complicated domain, governing constraints apply. These are the
rules, standards, and boundaries that define acceptable behaviour. They
do not specify every action, but they bound the solution space tightly
enough that analysis can identify a good approach.

In the complex domain, enabling constraints apply. These are looser.
They allow a system to function and create conditions for patterns to
emerge, but the behaviour of the system is not derivable from the
constraints alone. You must observe it.

An AI agent operating from a vague natural language prompt operates under
enabling constraints at best. The prompt allows the agent to function and
creates conditions for code to emerge, but edge cases, boundary conditions,
and architectural choices are resolved by the model's priors, not by the
stated requirements.

An AI agent operating from a precise executable specification is in a
different situation. A BDD scenario makes a specific causal claim: given
this precondition, when this action occurs, then this outcome is required.
That claim is verifiable before hindsight. The agent cannot legally produce
code that fails the scenario. When the pipeline runs, the question "does
the system do what it should?" has a deterministic answer. Cause and effect
are knowable through analysis of the specification itself, which is
exactly what defines the complicated domain.

Writing executable specifications converts enabling constraints into
governing constraints. The problem does not change. The constraint type
does. And with it, the domain.

A Note on Terminology

An earlier draft of this argument used "exaptation" for this transition.
That was wrong.

Exaptation in Snowden's framework describes radical repurposing: taking
something designed for one function and finding it serves a different
function in a different context. It is an innovation mechanism operating
within the complex domain, not a mechanism for moving between domains.
The word comes from evolutionary biology, where Gould and Vrba used it
in 1982 to describe traits coopted for functions different from the ones
they evolved for.

Constraint transformation is the accurate description. The act of writing
an executable specification converts the type of constraint governing the
system's behaviour, which changes the nature of the problem.

What AI Changes About the Economics

For most of software engineering history, writing precise executable
specifications cost more relative to writing the code they described.
BDD scenarios require domain knowledge, collaboration, and iteration.
Mutation testing requires tooling investment. Contract tests require
discipline about API boundaries. Many teams concluded the benefit did not
justify the cost.

Two things have shifted this calculation.

The DORA 2026 report, based on 1,110 Google engineer responses, found
that higher AI adoption correlates with higher throughput and higher
instability simultaneously. Time saved generating code is re-spent
auditing it. The bottleneck has moved from writing code to knowing what
to write and verifying that what was written is correct. Specifications
are no longer overhead relative to implementation. They are the scarce
resource.

AI also reduces the mechanical cost of expressing intent in executable
form. The evidence here is early but directional. Fonseca, Lima, and
Faria (arXiv:2510.18861, 2025) measured end-to-end Gherkin scenario
generation from JIRA tickets on a production mobile application at BMW
and found that practitioners reported time savings often amounting to a
full developer-day per feature, with the automated generation completing
in under five minutes. In a separate quasi-experiment, Hassani,
Sabetzadeh, and Amyot (arXiv:2508.20744v2, 2026) had software
engineering practitioners evaluate LLM-generated Gherkin specifications
from legal text; 91.7% of ratings on the time savings criterion fell in
the top two categories. Neither study is a controlled experiment. A
2025 systematic review of 74 LLM-for-requirements studies (Zadenoori
et al., arXiv:2509.11446) found that studies are predominantly evaluated
in controlled environments using output quality metrics, with limited
industry use, consistent with specification authoring cost being an
underexplored measurement target in the literature. The evidence is
emerging, not settled.

What both studies show consistently is that AI shifts the human effort
from authoring to reviewing: from expressing intent to validating
that the expressed intent is accurate. The intent, the domain knowledge,
and the judgment that the scenarios accurately describe what the system
should do remain the human's responsibility. The mechanical work of
expressing that intent in executable form has become substantially
cheaper.

The domain transition from complex to complicated is now economically
viable at scale in a way it was not five years ago. The claim is not
that AI makes specification automatic. It is that the economics have
shifted enough to make specification-first development the rational
default rather than the disciplined exception.

Where Khononov's Observation Stops

In May 2025, Vlad Khononov argued on LinkedIn that LLMs turn complex
problems into complicated ones. The direction is right. The mechanism
is incomplete.

Khononov attributes the domain transition to LLMs broadly. The argument
here is more specific: the LLM is not what makes the transition stick.
The specification is. An LLM operating without a specification produces
output that sits in the complex domain because the agent's behaviour is
still emergent and only knowable in hindsight. An LLM operating within
a specification-governed verification pipeline produces output that can
be assessed deterministically.

Without the specification, the problem bounces back to complex every time
the agent drifts, which it will. The governing constraint is what holds
the transition.

A Likely Objection

A Cynefin-literate reader will raise a fair challenge: software systems
are deterministic at the execution level. Given the same inputs and state,
they produce the same outputs. If cause and effect are always knowable in
principle, the complex framing is a category error. Software development
is always in the complicated domain.

The response is that the complexity resides in the problem space, not
the execution. What users need, which edge cases matter, how requirements
will evolve. These are genuinely complex. They are knowable only in
hindsight, they respond unpredictably to interventions, and they exhibit
the emergent properties that Snowden places in the complex domain. An
executable specification narrows the problem space by articulating
requirements precisely enough to be analysed. The domain transition
occurs in the problem space, not in the implementation.

This is not an endorsed Cynefin position. Dave Snowden is actively working
on the relationship between AI and Cynefin domains as of early 2026,
following practitioner discussions, and has not published conclusions.
The argument here is an application of Cynefin vocabulary, checked
carefully against canonical definitions. Readers familiar with the
framework are invited to challenge it.

What Changes in Practice

Teams that operate AI agents without specifications are navigating a
complex domain with complicated-domain tools. They apply analysis and
best practices to a system whose behaviour is emergent. They will get
some things right through pattern matching and experience, and they will
miss systematically at the boundaries, because boundaries are exactly
where emergent behaviour is most unpredictable.

Teams operating AI agents within a specification-governed verification
pipeline have moved the problem. They are not analysing emergent
behaviour. They are verifying compliance with stated constraints. The
failure mode is specification incompleteness, not emergent
unpredictability. Incompleteness is solvable. A specification gets less
incomplete over time. Emergent unpredictability does not.

These are not the same problem with different amounts of effort applied.
They are different problems, requiring different investment, different
staffing, and different success measures.

The Cynefin framework was developed by Dave Snowden and described in
Snowden, D. and Boone, M. (2007), "A Leader's Framework for Decision
Making," Harvard Business Review, November 2007. The enabling and
governing constraints terminology is from Snowden's Cynefin wiki:
cynefin.io/wiki/Constraints; this language is not in the 2007 HBR
paper, which focuses on cause-and-effect relationships by domain.
Khononov's observation on LLMs and Cynefin domains was made on LinkedIn
in May 2025. Exaptation definition: Gould, S.J. and Vrba, E.S. (1982),
"Exaptation: a missing term in the science of form," Paleobiology 8(1).
DORA 2026: "Balancing AI Tensions: Moving from AI adoption to effective
SDLC use," dora.dev, March 2026. Fonseca, P.L., Lima, B., and Faria,
J.P. (2025), "Streamlining Acceptance Test Generation for Mobile
Applications Through Large Language Models: An Industrial Case Study,"
arXiv:2510.18861. Hassani, S., Sabetzadeh, M., and Amyot, D. (2026),
"From Law to Gherkin: A Human-Centred Quasi-Experiment on the Quality
of LLM-Generated Behavioural Specifications from Food-Safety
Regulations," arXiv:2508.20744v2 (updated March 2026). Zadenoori,
M.A., Dabrowski, J., Alhoshan, W., Zhao, L., and Ferrari, A. (2025),
"Large Language Models (LLMs) for Requirements Engineering (RE): A
Systematic Literature Review," arXiv:2509.11446.

Series: The Specification as Quality Gate
Part 1: The Echo Chamber in Your Pipeline
Part 2: From Complex to Complicated (this post)
Part 3: What Specifications Cannot Catch
Part 4: The Specification as Quality Gate

The Echo Chamber in Your Pipeline

Christo Zietsman — Sun, 22 Mar 2026 21:32:22 +0000

This is Part 1 of a four-part series, "The Specification as Quality Gate." The series develops three hypotheses about executable specifications, AI code review, and what each is actually for. Parts 2, 3, and 4 will follow.

When an AI coding agent generates code and a separate AI reviewer examines
it, both agents are reasoning from the same artefact: the code itself. If
neither has an external reference (a specification, a contract, a
statement of what the system is supposed to do), the reviewer has no
ground truth to compare against. It checks the code against the code. Not
against intent.

The architecture is circular. And there is now enough empirical evidence
to say precisely where and how it fails.

The Mechanism

In classical ensemble learning, stacking multiple estimators improves
reliability under one condition: the estimators must fail independently.
If two classifiers share the same training distribution and the same blind
spots, combining them does not reduce error. It consolidates it. The joint
miss rate of two correlated estimators approaches the miss rate of either
one alone, not the product of both.

This holds for LLM pipelines for a specific reason: a code-generating
model and a code-reviewing model drawn from the same model family share
architecture, training corpus, and reward signal. That is the same class
of correlation the independence condition prohibits. They are not two
independent estimators. They are two samples from the same prior.

A 2025 paper (Vallecillos-Ruiz, Hort, and Moonen, arXiv:2510.21513)
studying LLM ensembles for code generation and repair named this the
"popularity trap." Models trained on similar distributions converge on
the same syntactically plausible but semantically wrong answers. Consensus
selection (the default review heuristic in most multi-agent pipelines)
filters out the minority correct solutions and amplifies the shared
error. Diversity-based selection, by contrast, recovered up to 95% of
the gain that a perfectly independent ensemble would achieve. The
implication is direct: naive stacking of homogeneous LLMs is
counterproductive.

A second paper (Mi et al., arXiv:2412.11014) examined a
researcher-then-reviser pipeline for Verilog code generation and found
that erroneous information from the first agent was accepted downstream
because the agents shared the same training distribution and lacked
adversarial diversity. Self-review repeated the original errors rather
than correcting them.

A February 2026 paper (Pappu et al., arXiv:2602.01011) went further,
showing that even heterogeneous multi-agent teams consistently fail to
match their best individual member, incurring performance losses of up to
37.6%, even when explicitly told which member is the expert. The failure
mechanism is consensus-seeking over expertise. Homogeneous copies make
it worse.

None of these papers tested a pure generator-then-reviewer pipeline
without external grounding in the exact configuration described here.
That controlled experiment has not been published. But three independent
papers from 2025 and 2026, approaching the problem from different angles,
document the same failure mode: shared training distributions produce
correlated errors, and consensus amplifies rather than corrects them.

A Concrete Example

This example uses a deliberately simple case. Modern frontier models catch
classic boundary conditions reliably, as the experiments described in the
caveats section confirm this. The point here is the sequence: how a BDD
scenario makes a defect detectable before any reviewer is involved.

Consider a pagination function. A developer asks an AI coding agent to
implement it. The agent produces:

def paginate(items, page, page_size):
    start = page * page_size
    end = start + page_size
    return items[start:end]

The agent also produces tests: one for page 0 and one for page 1 against
a ten-item list. Both pass. The implementation is correct for those cases.

The flaw is in what was not tested. When the total number of items is
exactly divisible by the page size and the caller requests the last page
by calculating it from the total count, the edge case is invisible to
the tests provided.

The review agent, given only the code and the tests, validates the
implementation against what is there. The tests pass. The reviewer
reports no issues. It has no basis to ask what was not tested, because
there is no specification defining the expected behaviour for boundary
conditions.

The BDD scenario that would have caught it:

Scenario: Last page when total is exactly divisible by page size
  Given a list of 10 items
  And a page size of 5
  When I request page 1
  Then I receive items 6 through 10
  And the result contains exactly 5 items

This scenario would have failed against the implementation before the
reviewer was ever involved. The pipeline stops at the specification, not
the review.

There is a second reason this matters beyond the review question. Even
where models reliably identify boundary conditions during focused review,
general refactoring passes introduce a different risk. An agent
refactoring for readability or performance is not focused on correctness.
Conditional checks and edge-case handling can be silently removed. The
BDD pipeline catches that regression the same way it catches the original
defect. The protection is unconditional on what the agent was trying to do.

What the SGCR Paper Actually Says

A December 2025 paper from HiThink Research (arXiv:2512.17540) proposed
a Specification-Grounded Code Review framework and reported a 90.9%
improvement over a baseline LLM reviewer. This figure circulates widely
and requires a correction.

The 90.9% is an improvement in developer adoption rate of review
suggestions: 42% versus 22%. Developer adoption reflects whether
suggestions are relevant and actionable. It is not a defect detection
rate. These are different claims.

The SGCR paper validates the hypothesis partially. Without specification
grounding, baseline LLM reviewers produce generic suggestions and
hallucinated issues. Grounding in human-authored specifications filters
the noise and produces suggestions developers act on. But the paper's
architecture, combining an explicit specification-driven path and an
implicit heuristic discovery path, does not claim specifications
make review redundant. It positions them as making review better.

SGCR's implicit pathway explicitly handles discovery of issues beyond the
stated rules. This is the paper's own acknowledgement that specifications
alone are insufficient. It is consistent with the argument that an AI
review agent has a legitimate residual role, but only after the
specification pipeline has done its work.

The Independence Condition

The correlated error claim has a precise boundary. Stacking any two AI
reviewers is not always counterproductive. The failure condition is
specific: estimators that share a training distribution and lack an
external reference exhibit correlated failures. The condition for genuine
benefit is diversity plus external grounding.

If the reviewer uses a different model family, a different temperature, or
a substantially different prompting strategy, errors may be partially
independent. A cross-family pipeline, Grok reviewing Claude-generated
code for instance, has more independence than a same-family pipeline.
But model diversity and external grounding are two separate conditions.
Diversity reduces correlation between estimators. It does not supply
ground truth. A cross-family reviewer without an external specification
is still checking code against code, not code against intent. It will
catch some things the generator missed. It will share the same blind
spots on anything not well-represented in either training corpus, and
it has no basis to identify what was not specified regardless of how
different its architecture is from the generator.

The ensemble literature is clear that diversity is what makes stacking
work. The specification is what makes review non-circular. Both
conditions matter. The current industry architecture typically satisfies
neither.

But What About Documentation?

A reasonable objection: if the problem is that the reviewer lacks ground
truth, why not solve it by adding documentation? Write better docstrings.
Maintain a spec document. Give the reviewer the context it needs.

Our own experiment accidentally tested this. The first version of the v2
corpus included docstrings that stated the domain convention explicitly.
The prorate_premium docstring said "ISDA actual/actual." The
schedule_maintenance docstring said "ANY of these thresholds." Claude
caught every bug at 100%. Documentation in close proximity to the code
works, at least when the agent reads it.

That result is real, but it does not solve the problem. It reframes it.

The first issue is retrieval and compliance. A docstring sitting on the
same function is the best case for specification proximity. An external
policy document, a requirements file, a Confluence page is a much weaker
guarantee. The agent needs to find it, read it, and apply it faithfully.
Anyone who has watched an AI agent work at scale knows that is not a safe
assumption. Agents read context selectively. They produce plausible output
that satisfies surface checks without necessarily following the rule they
were given. I have seen this enough times firsthand to know I cannot
treat documentation-as-specification as reliable ground truth.

The second issue is drift. Docstrings are not executable. They do not
fail when the code diverges from them. A function can be refactored, a
business rule can change, an edge case can be added, and the docstring
sits there describing what the function used to do, confidently, in
present tense. Every codebase accumulates this. The reviewer checking
code against a stale docstring is not checking against intent. It is
checking against what someone intended when they first wrote it.

A BDD scenario cannot drift silently. When the system stops behaving the
way the scenario describes, the scenario fails. The build stops. The
divergence is visible immediately. That is not a property of documentation.
It is a property of executable specifications. The difference is not one
of quality or discipline. It is structural.

The objection "just write better documentation" asks for the same
discipline that produced the bad documentation in the first place, and
adds an assumption that AI agents will follow it reliably. The BDD
pipeline removes both dependencies.

The Implication

An AI reviewer without an external specification is a probability estimate,
not a quality gate. It samples from the same distribution as the generator
and applies pattern matching against its training data. It will catch some
things the generator missed, particularly common vulnerability patterns
and well-documented anti-patterns that appear frequently in training data.
It will systematically miss whatever the generator systematically missed,
because they share the same prior.

A BDD scenario that fails is a falsified claim. The pipeline either passes
or it does not. There is no hallucination, no false confidence, no noise
to filter.

The architecture that follows is not "no AI review." It is: specifications
first, verification pipeline second, AI review only for the residual. The
residual is real. It includes architectural properties, structural drift,
and defect classes that resist specification, but it is a much smaller
target than the current industry framing suggests. The next post in this
series maps that residual precisely.

Caveats and Open Questions

This argument rests on three empirical papers and two small contrived
experiments. Both experiments are same-family only (Claude reviewing
Claude-generated code) and use a planted bug corpus rather than a natural
defect sample. They are directional evidence, not a controlled
demonstration.

The first experiment used classic boundary-condition bugs. Claude caught
all five at 100%, which refined rather than confirmed the hypothesis.
Classic boundary conditions are pattern-recognition problems that are
dense in training data. The correlated error claim applies where the
convention is absent from training data, not where it is ubiquitous.

The second experiment used domain-convention violations with neutral
docstrings. Detection ranged from 0% to 100% depending on domain opacity.
The log-linear interpolation function, where the convention is market
practice in fixed income rather than general programming knowledge, was
missed in all five runs. The reviewer flagged a different concern instead
and declared the implementation correct. BDD caught it. The full corpus,
specifications, scripts, and results are available in the
experiments directory.

The independence condition also requires more empirical work. Same-family
models fail the independence requirement. How much architectural diversity (different model families, different prompting strategies, different
temperatures) is required to provide genuine signal rather than
correlated noise remains to be established.

These are open questions, not fatal weaknesses. Without an external
reference, circular review is circular. The empirical work so far points
in one direction. A controlled demonstration using a natural defect sample
and cross-family pipelines would make the argument more precise.

The authors of the papers cited here, along with anyone working on LLM
ensemble reliability, multi-agent code pipelines, or specification-grounded
review, are welcome to challenge, correct, or build on this argument.
If the hypothesis is wrong, knowing precisely where it breaks is more
useful than leaving it unchallenged. Comments are open.

The papers cited in this post are: Vallecillos-Ruiz, F., Hort, M., and
Moonen, L., "Wisdom and Delusion of LLM Ensembles for Code Generation and
Repair," arXiv:2510.21513, October 2025; Mi, Z. et al., "CoopetitiveV:
Leveraging LLM-powered Coopetitive Multi-Agent Prompting for High-quality
Verilog Generation," arXiv:2412.11014, v2 June 2025; Pappu, A. et al.,
"Multi-Agent Teams Hold Experts Back," arXiv:2602.01011, February 2026;
and Wang, K., Mao, B., Jia, S., Ding, Y., Han, D., Ma, T., and Cao, B.,
"SGCR: A Specification-Grounded Framework for Trustworthy LLM Code
Review," arXiv:2512.17540, December 2025, v2 January 2026. Author lists
and findings verified against the original papers.

Series: The Specification as Quality Gate
Part 1: The Echo Chamber in Your Pipeline (this post)
Part 2: From Complex to Complicated
Part 3: What Specifications Cannot Catch
Part 4: The Specification as Quality Gate

The Trust Barrier

Christo Zietsman — Tue, 10 Mar 2026 19:21:03 +0000

The biggest obstacle to AI in software engineering isn't the technology. It's trust.

Teams don't trust AI-generated code. And they shouldn't, not blindly. But the conversation in most engineering organisations has stalled at that point. The distrust is treated as a conclusion rather than a starting point. AI generates code we can't trust, therefore we shouldn't use AI. End of discussion.

That's the wrong framing. The right question isn't whether to trust AI. It's how to build a process that makes trust measurable.

The carpenter and the power tools

Uncle Bob at Uncle Bob Consulting LLC has been working through this in public, and his evolving position is worth following. Back in September 2025, he was clear that AI would change programming for the better but not replace programmers.

// Detect dark theme var iframe = document.getElementById('tweet-1962636247769530650-734'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1962636247769530650&theme=dark" }

By January 2026, after working with AI more directly, he observed something important about the relationship between programmer and AI: it's not like a pilot and autopilot. His conclusion was striking.

// Detect dark theme var iframe = document.getElementById('tweet-2008879916301898134-431'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2008879916301898134&theme=dark" }

"In the end the AI might write 90% of the code, but the programmer will have put in 90% of the thought and effort."

Then in February, after six weeks of daily use, he compared the experience to a skilled carpenter being handed power tools for the first time. The power is undeniable, but so are the risks. He's still not sure his project wouldn't be just as far along without it.

// Detect dark theme var iframe = document.getElementById('tweet-2019025982863069621-423'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2019025982863069621&theme=dark" }

That progression resonates because it captures where most experienced engineers are right now. They can see the potential. They can also see the ways it could go wrong. And without a clear framework for managing those risks, the rational response is caution.

But here's where it gets really interesting. Uncle Bob, the person who formalised the three laws of TDD, recently acknowledged that TDD as traditionally practised is inefficient for AI. The principles remain, but the techniques need to adapt.

// Detect dark theme var iframe = document.getElementById('tweet-2023158252700066287-865'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2023158252700066287&theme=dark" }

That's a significant shift from someone who has spent decades advocating for specific engineering disciplines. And it's exactly the point. The principles of verification, rigour, and specification don't change. The techniques do.

I've been through the same journey. I spent time being cautious, reviewing everything line by line, second-guessing outputs, wondering whether I was actually saving time or just shifting the effort from writing to reviewing. It felt productive, but it wasn't fundamentally different from writing the code myself.

Here's what I found on the other side of that phase: the question isn't "can I trust this code?" It's "does this code meet the spec?"

That reframe changes everything.

From gut feeling to process

When trust is a feeling, it's subjective, inconsistent, and unscalable. One developer might trust AI-generated code after a quick scan. Another might rewrite it entirely. Neither approach is reliable, and neither scales to a team.

When trust is a process, it becomes a solvable engineering problem. You define what correct looks like before any code gets written. You build verification layers that catch failures regardless of who or what produced the code. You measure outcomes rather than gut-checking inputs.

This isn't a new idea. It's the scientific method applied to software delivery. Hypothesise what the system should do. Build it. Test whether it does what you hypothesised. If it doesn't, iterate.

We call it computer science for a reason, but somewhere along the way the discipline drifted from its roots. We stopped treating software delivery as an empirical process and started treating it as a craft, something learned through apprenticeship, refined through experience, and validated by the judgement of senior practitioners. That model worked well enough when humans were writing all the code. Experienced judgement was the best verification layer we had.

But craft-based verification doesn't scale to AI-generated output. You can't rely on an experienced developer's intuition when the volume of generated code outpaces any individual's ability to read it. What scales is the scientific method: define the expected behaviour precisely, run the experiment, measure the result. If the result matches the hypothesis, the code is fit for purpose. If it doesn't, you have a specific, measurable failure to investigate.

The difference now is that AI has made this approach economically viable in ways it wasn't before.

Writing behavioural specifications, running mutation tests, enforcing contract-driven APIs, these practices have existed for years. They were always the right thing to do. But the effort required to implement them properly meant that most teams cut corners. When you're manually writing all the code, spending additional time writing comprehensive specifications and mutation test suites feels like a luxury.

AI changes that equation. When the code generation is fast and cheap, the verification layer becomes the primary deliverable. The spec is no longer documentation you write after the fact. It's the thing you write first, and it's the thing that determines whether the output is fit for purpose.

What this looks like in practice

The process I've built works like this:

Write behavioural specifications before any code gets written, by human or AI. Use BDD to define what the system should do, not how it should do it. These specifications are executable. They're not documentation sitting in a wiki. They run as part of the pipeline and they fail when the code doesn't match the expected behaviour.

Run mutation tests to validate that your tests actually catch failures. A passing test suite means nothing if the tests would still pass with broken code. Mutation testing introduces deliberate faults and checks whether your tests detect them. It's the verification layer for your verification layer.

Enforce contract-driven APIs so integrations are verifiable independently. When services agree on contracts, you can validate that each side honours the agreement without spinning up the entire system. This matters even more when AI is generating the integration code, because the contracts become the source of truth that the generated code is measured against.

Review outputs against specifications, not line by line. This is the fundamental shift. Instead of reading every line of AI-generated code and asking "does this look right?", you run the specification suite and ask "does this behave correctly?" The first approach doesn't scale. The second does.

The personal proof point

I don't write code by hand anymore. I direct AI to write it. I've never formally learned Vue or Go, but I'm shipping production-ready prototypes in both. Not because I blindly trust AI, but because I've built the process that lets me verify the output.

That statement makes some people uncomfortable. It sounds reckless to say "I ship code in languages I haven't formally learned." But the discomfort reveals the assumption: that understanding every line of code is a prerequisite for shipping reliable software. It's not. Understanding the specification, the verification layer, and the failure modes is the prerequisite. The language is an implementation detail.

This doesn't mean the code doesn't matter. It means the specification matters more. And when you have a robust specification, the question of whether a human or an AI wrote the implementation becomes less important than whether the implementation meets the spec.

The competitive advantage is process

The teams that figure this out first won't just be faster. They'll be more reliable than teams still reviewing every line by hand. That's counterintuitive, because we associate human review with quality. But human review is inconsistent, fatiguing, and subject to all the cognitive biases that make us miss things we've seen a hundred times before.

A specification-driven process with automated verification doesn't get tired. It doesn't skip steps on a Friday afternoon. It catches the same class of failures at 5pm that it catches at 9am. It scales to any volume of AI-generated code without degrading in quality.

The trust barrier is real. I'm not dismissing the concern. Engineers who are cautious about AI-generated code are being responsible. But caution without a path forward is just avoidance. The path forward is process.

Build the verification layer. Let the spec be the source of truth. And stop asking whether you can trust the code. Start asking whether it meets the spec.