<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Christo Zietsman</title>
    <description>The latest articles on Forem by Christo Zietsman (@nuphirho).</description>
    <link>https://forem.com/nuphirho</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3809624%2Fe72a8451-c9d3-4845-adaa-f88bbf1e4ab0.jpeg</url>
      <title>Forem: Christo Zietsman</title>
      <link>https://forem.com/nuphirho</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/nuphirho"/>
    <language>en</language>
    <item>
      <title>Conway's Law at Level 5</title>
      <dc:creator>Christo Zietsman</dc:creator>
      <pubDate>Wed, 15 Apr 2026 06:55:00 +0000</pubDate>
      <link>https://forem.com/nuphirho/conways-law-at-level-5-28l3</link>
      <guid>https://forem.com/nuphirho/conways-law-at-level-5-28l3</guid>
      <description>&lt;p&gt;Before I wrote this post I had a conversation I did not expect to have.&lt;/p&gt;

&lt;p&gt;I have two agents: a Science Officer and a Blogger. I was considering adding more. A Dev Lead, a BA, maybe one other I cannot recall now. So I did what felt natural. I asked them.&lt;/p&gt;

&lt;p&gt;Both agreed the Dev Lead made sense. The reason was more useful than the answer. I had been giving both of them work that was not theirs. The Dev Lead was not a new capability. It was a missing boundary. Without it, coordination costs were bleeding into roles that should not be carrying them.&lt;/p&gt;

&lt;p&gt;Then they wrote the job spec.&lt;/p&gt;

&lt;p&gt;That is worth sitting with.&lt;/p&gt;

&lt;p&gt;My existing agents identified the gap in the org structure, validated the hire, and defined the role. I approved it. That is the full hiring loop, except two of the three participants are not human.&lt;/p&gt;

&lt;p&gt;This is Conway's Law running in a direction Conway did not anticipate. Systems mirror the communication structures of the organisations that build them. But what happens when the organisation is actively redesigning itself, and some of the designers are agents?&lt;/p&gt;

&lt;p&gt;The question is not theoretical. Every person reading this who uses AI seriously is already inside it. You gave your agent a task. It hit a boundary. You worked around it. That workaround is your org chart. The boundary you did not design is shaping your system anyway.&lt;/p&gt;

&lt;p&gt;The frontend/backend split existed because human brains could not hold the full context of a feature simultaneously. AI does not have that limitation. But we are still hiring, structuring, and coordinating as if it does. The structures we built for human communication bandwidth are now the ceiling on what AI can do for us.&lt;/p&gt;

&lt;p&gt;The deeper question is not whether to add an agent. It is what adding that agent does to every relationship that already exists. Who does it talk to? What does it own? What were other agents doing that now belongs to it?&lt;/p&gt;

&lt;p&gt;I got those answers before I made the decision. From the agents already in the room.&lt;/p&gt;

&lt;p&gt;Conway's Law is not dead. It just got a lot more interesting.&lt;/p&gt;

</description>
      <category>aidev</category>
      <category>agenticdevelopment</category>
      <category>softwareengineering</category>
      <category>organisationaldesign</category>
    </item>
    <item>
      <title>Wardley Was Right</title>
      <dc:creator>Christo Zietsman</dc:creator>
      <pubDate>Sat, 11 Apr 2026 06:02:02 +0000</pubDate>
      <link>https://forem.com/nuphirho/wardley-was-right-2nf0</link>
      <guid>https://forem.com/nuphirho/wardley-was-right-2nf0</guid>
      <description>&lt;p&gt;&lt;strong&gt;Simon Wardley&lt;/strong&gt; changed the way I think about a problem I thought I had solved.&lt;/p&gt;

&lt;p&gt;He published a post a few weeks ago about what happens to software engineers when agents write the code. His answer: the role survives, the title probably does not. The people who matter are the ones who can maintain a chain of comprehension over a billion lines of code they did not write and cannot read. He called them AI wranglers, agentic herders. Alice, the software engineer you fired because a thought leader said you no longer needed her.&lt;/p&gt;

&lt;p&gt;I read it and posted a comment. I was confident. Executable specifications, I said. If the spec proves itself in the pipeline, nobody needs to read the code. The human writes intent. The agents write code. The pipeline is the reviewer.&lt;/p&gt;

&lt;p&gt;Wardley replied in four words: "agreed, up until the specification part."&lt;/p&gt;

&lt;p&gt;He was right.&lt;/p&gt;

&lt;p&gt;Not entirely right. But right enough that I spent the next morning following the challenge instead of defending the position. That is usually more productive.&lt;/p&gt;




&lt;p&gt;The chain of comprehension problem is real. At a billion lines of code, reading is not a strategy. Something has to carry the comprehension that used to live in the engineer's head. My argument was that executable specifications are that something. A BDD scenario that runs in the pipeline is a piece of comprehension that is smaller than the code, verifiable, and does not depend on any individual's memory of what was intended.&lt;/p&gt;

&lt;p&gt;Wardley's pushback was precise: it is not clear yet what those emerging practices are.&lt;/p&gt;

&lt;p&gt;He was pointing at a scale problem I had underweighted. Executable specifications work within a bounded context. Within a single service, a single domain, a single team's scope of ownership, BDD scenarios that run and mutate are a mature enough practice to defend. But across a billion lines the scenario explosion problem makes it unmanageable. The number of valid scenarios grows with the product of features and their interactions. Across domain boundaries, that number is combinatorially out of reach. Nobody writes BDD scenarios for what happens when the system fails at that scale in a specific sequence. Who owns that scenario?&lt;/p&gt;

&lt;p&gt;The organisations operating at that scale do not use specifications as the primary comprehension mechanism. The pipeline is still the reviewer. But what the pipeline is checking is not behaviour. It is properties.&lt;/p&gt;

&lt;p&gt;That is a different thing.&lt;/p&gt;




&lt;p&gt;The distinction that came out of following the challenge is between behavioural comprehension and property comprehension. A BDD specification captures behavioural intent: given this context, when this action, then this outcome. It is precise, it is owned, and it fails deterministically when the behaviour changes. A fitness function captures a system property: latency stays below this threshold, coupling between these services stays below this score, security posture does not degrade. It runs across the whole system and fails when the architecture drifts.&lt;/p&gt;

&lt;p&gt;Both are verifiable artefacts. Neither is a human reading code. But they operate at different levels and answer different questions.&lt;/p&gt;

&lt;p&gt;Behavioural specifications within domains. Contract tests at service boundaries. Architectural fitness functions across the system. Observability in production where the emergent behaviour is too complex to specify in advance. Each layer comprehends its own scope. The layers compose.&lt;/p&gt;

&lt;p&gt;Wardley's craft-to-engineering-discipline shift is the right frame for what happens next. SREs did not stop caring about servers. They started caring about the system properties that matter at scale: reliability, observability, deployment safety. The shift from software engineer to whatever comes next is the same shape. Stop caring about the code. Start caring about the specifications at the domain level, the fitness functions at the system level, and the governance model that connects them.&lt;/p&gt;

&lt;p&gt;The question I did not have an answer to when I posted that comment, and still do not have a complete answer to now, is what the governance model looks like when the agents are generating both the code and the specifications simultaneously. The comprehension hierarchy I described above assumes humans are writing the specifications that the agents implement. What happens when the agents are writing the specifications too?&lt;/p&gt;

&lt;p&gt;That is the open question Wardley's pushback correctly identifies. The practices are co-evolving. What they settle into at a billion lines of AI-generated code, governed by AI-generated specifications, is not yet known.&lt;/p&gt;

&lt;p&gt;What I am more confident about is the principle: comprehension has to live in verifiable artefacts, not in people's heads. What those artefacts turn out to be at each level of scale is still being worked out. But the direction is clear. You are not going back to reading code. The question is what you are going forward to.&lt;/p&gt;

&lt;p&gt;That is the problem I am working on.&lt;/p&gt;

</description>
      <category>aidev</category>
      <category>agenticdevelopment</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>The Go-to-Market Bottleneck</title>
      <dc:creator>Christo Zietsman</dc:creator>
      <pubDate>Tue, 07 Apr 2026 07:18:51 +0000</pubDate>
      <link>https://forem.com/nuphirho/the-go-to-market-bottleneck-4ocj</link>
      <guid>https://forem.com/nuphirho/the-go-to-market-bottleneck-4ocj</guid>
      <description>&lt;p&gt;Imagine your engineering team can ship a complete feature in a day. Specs go in, verified software comes out. You solved the pipeline problem.&lt;/p&gt;

&lt;p&gt;Is your sales team ready?&lt;/p&gt;

&lt;p&gt;Last week I wrote about the pipeline bottleneck: AI generates code faster, but review, testing, and deployment can't keep up. But even if you solve that problem completely, there's a bigger one waiting.&lt;/p&gt;

&lt;p&gt;Imagine you reach Level 5. Specs go in, verified software comes out. Your engineering team can ship a complete feature in a day. What happens next?&lt;/p&gt;

&lt;p&gt;Is your sales team ready to sell it? Has sales engineering updated the demo? Has marketing updated the positioning? Has support been trained? Has operations provisioned the infrastructure? Has finance updated the pricing model? Have your channel partners been enabled? Has governance signed off?&lt;/p&gt;

&lt;p&gt;At Level 5, the engineering bottleneck disappears. But the go-to-market bottleneck becomes the constraint. And unlike engineering, you can't automate your way through most of it.&lt;/p&gt;

&lt;p&gt;This is especially acute for enterprise SaaS companies that sell through a combination of direct sales and channel partners. Every feature touches a chain of people and processes that were designed for quarterly or monthly release cadences, not daily ones.&lt;/p&gt;

&lt;p&gt;And if your product isn't built to sell itself, the problem gets worse. If every new capability requires a sales engineer to demonstrate it, a solutions architect to scope it, and a customer success manager to onboard it, then your delivery speed is gated by the capacity of those teams. It doesn't matter how fast you can build.&lt;/p&gt;

&lt;p&gt;Anthropic shipped Cowork in what felt like a blink. But Anthropic has a product that largely sells itself to a market that's actively looking for it. Most enterprise software companies don't have that luxury.&lt;/p&gt;

&lt;p&gt;So what does this mean for organisations investing heavily in AI-assisted development?&lt;/p&gt;

&lt;p&gt;The investment case is not just about faster engineering. It is about exposing the constraints that were always there but hidden behind a slow release cadence. Speed the engineering and every downstream bottleneck becomes visible.&lt;/p&gt;

&lt;p&gt;Feature sizing changes. Release cadence changes. Enablement processes change. Some of those steps can be automated. The human judgement steps: positioning, pricing, channel readiness. Those cannot.&lt;/p&gt;

&lt;p&gt;The teams that redesign the full delivery chain will compound their advantage. The teams that only accelerate engineering will find themselves with a warehouse full of features the market cannot absorb.&lt;/p&gt;

&lt;p&gt;The delivery bottleneck is not a pipeline problem. It is an organisational problem. Who is designing for it?&lt;/p&gt;

</description>
      <category>aidev</category>
      <category>gotomarket</category>
      <category>enterprisesaas</category>
      <category>productstrategy</category>
    </item>
    <item>
      <title>I Followed the Problem Home</title>
      <dc:creator>Christo Zietsman</dc:creator>
      <pubDate>Sat, 04 Apr 2026 20:47:03 +0000</pubDate>
      <link>https://forem.com/nuphirho/i-followed-the-problem-home-58ee</link>
      <guid>https://forem.com/nuphirho/i-followed-the-problem-home-58ee</guid>
      <description>&lt;p&gt;&lt;a href="https://www.linkedin.com/in/james-bach-6188a811/" rel="noopener noreferrer"&gt;James Bach&lt;/a&gt; wrote that failing to detect a problem is not a measurement of non-problemness. He was responding to me. That exchange sent me somewhere I did not expect to go.&lt;/p&gt;

&lt;p&gt;I am not a philosopher. I am not a mathematician. I am an engineer who has spent twenty years trying to verify that software does what it is supposed to do, and I kept running into the same wall from different directions.&lt;/p&gt;

&lt;p&gt;On a Sunday evening in my office in Technopark, Stellenbosch, I was preparing a presentation on the effect of backfill within deep mine stopes. The numerical model was producing beautiful results. I tried to speed things up, changed something in the solver, and introduced a bug. We used CVS at the time. When you are in the zone and tweaking, you skip the commit. The stress erased the memory of what I had changed. I presented at Tau Tona, one of the deepest gold mines in the world, and the results looked exactly as I expected them to look. It was only at the follow-up workshop, where we tried to make the theory practical, that the new build of the tool revealed itself to be producing garbage. The presentation had looked fine. The practice exposed it. The version control was there. The process existed. I ran straight past it because it felt like friction in the moment. At another company I watched a feature nobody would touch because the selection logic had grown beyond anyone's ability to reason about it. At every step, the problem was the same: how do you know that what you built is actually correct?&lt;/p&gt;

&lt;p&gt;I rebuilt the selection model from scratch over a Christmas break using test-driven development. Not because I was told to. Because I needed something external to the code to tell me whether the code was right. The test suite was not the product. But it was the only thing I trusted.&lt;/p&gt;

&lt;p&gt;That instinct was twenty years old before I had a name for it.&lt;/p&gt;




&lt;p&gt;When large language models started writing code, I thought the problem would get worse. A system that produces plausible-sounding output, confident in tone and occasionally incorrect in fact, is a system that makes verification harder, not easier. The code looks right. The tests pass. And yet.&lt;/p&gt;

&lt;p&gt;I published a paper on this in March 2026. The core claim: AI reviewing AI-generated code without an external specification shares blind spots. The correlated error hypothesis. Three models reviewed code containing domain-specific bugs none of their training had prepared them for, and they agreed it was correct. The confidence was not evidence of correctness. It was evidence of shared blindness.&lt;/p&gt;

&lt;p&gt;The specification was the only thing that did not share the blind spot. The executable BDD scenario, written to describe what the system must do independently of how it does it, could catch what the reviewers could not. Because it was external. Because it was deterministic. Because it did not care how plausible the code looked.&lt;/p&gt;

&lt;p&gt;The specification is the quality gate.&lt;/p&gt;




&lt;p&gt;I did not expect what happened next.&lt;/p&gt;

&lt;p&gt;James Bach posted a question about whether past correctness justifies trust in a tool. I replied with the mutation testing argument. He pushed back: failing to detect a problem is not a measurement of non-problemness, unless you have a guaranteed method of detecting every possible kind of problem. He was right. That is the oracle problem, the formal boundary of what executable specifications can catch. I had already published a taxonomy of it. His pushback named the limit more precisely than I had.&lt;/p&gt;

&lt;p&gt;It is not. That is Hume's problem of induction. I had stumbled into philosophy.&lt;/p&gt;

&lt;p&gt;Simon Wardley, who thinks about strategic positioning and the evolution of capabilities, engaged with the specification-at-scale argument. He has a framework for understanding how practices become commodities. He pushed on whether executable specifications could survive at that scale, and whether the ecosystem would commoditise the verification layer before the engineering discipline caught up.&lt;/p&gt;

&lt;p&gt;A paper by Catalini, Hui and Wu at MIT Sloan, published independently, arrived at the same structural diagnosis from a different discipline entirely. The binding constraint on AI-driven growth is not execution speed. It is human verification bandwidth. The specification is the mechanism that makes verification machine-executable rather than biologically bottlenecked. Two disciplines, same conclusion.&lt;/p&gt;

&lt;p&gt;I had been following a problem. The problem had been following a path that connected software engineering to epistemology to economics without asking my permission.&lt;/p&gt;




&lt;p&gt;This is where I need to be honest about what I do not know.&lt;/p&gt;

&lt;p&gt;The philosophical grounding of the verification problem goes back much further than I had realised. Sextus Empiricus, writing around 200 CE, described what he called the criterion problem: how do you establish the criterion by which you judge something to be true, without already assuming the criterion is correct? The Pyrrhonists suspended judgement precisely because they could not resolve it.&lt;/p&gt;

&lt;p&gt;Software verification has the same structure. To verify that the test is testing the right thing, you need a meta-test. To verify the meta-test, you need a meta-meta-test. At some point, you need an external reference. The specification. The oracle. Something that stands outside the system and says: this is what correct looks like.&lt;/p&gt;

&lt;p&gt;I recognised this structure from my own work before I had read a word of ancient philosophy. But recognising a structure and understanding its formal treatment are different things. I am working with a philosopher, Jaco Louw, who specialises in Pyrrhonian scepticism, to think through whether the formal epistemological argument holds. He is the expert. I am the engineer who followed the problem far enough to need one.&lt;/p&gt;

&lt;p&gt;This is not false modesty. It is how good thinking works. You follow the problem until it takes you somewhere you cannot go alone. Then you find the people who live there.&lt;/p&gt;




&lt;p&gt;I have been asking questions in public for several months now. The questions have attracted people I did not expect: practitioners, researchers, strategists, and investors. Not because I have all the answers. Because the questions are the right ones, and the right questions draw out the people who have been thinking about the same things from different angles.&lt;/p&gt;

&lt;p&gt;What I am looking for now is more of that.&lt;/p&gt;

&lt;p&gt;If you work in formal verification, computability theory, or proof theory and you have thought about what undecidability means for software correctness in practice, I want to hear from you.&lt;/p&gt;

&lt;p&gt;If you work in epistemology, philosophy of science, or sceptical traditions and you see the oracle problem as something your field has already mapped, I want to hear from you.&lt;/p&gt;

&lt;p&gt;If you work in economics, organisational theory, or complexity science and you see the verification bandwidth argument from a different vantage point, I want to hear from you.&lt;/p&gt;

&lt;p&gt;If you are a practitioner who has been living inside this problem and has not yet found the vocabulary for what you keep running into, I especially want to hear from you. The vocabulary matters less than the pattern.&lt;/p&gt;

&lt;p&gt;I am not building this alone. I do not want to. The work is better when it is challenged, extended, and connected to things I have not read yet.&lt;/p&gt;

&lt;p&gt;Follow the problem with me.&lt;/p&gt;




&lt;p&gt;The research programme is at nuphirho.dev. The conversation is in the comments.&lt;/p&gt;

</description>
      <category>aigovernance</category>
      <category>epistemology</category>
      <category>softwareverification</category>
      <category>collaboration</category>
    </item>
    <item>
      <title>Treating Ideas as Releasable Software</title>
      <dc:creator>Christo Zietsman</dc:creator>
      <pubDate>Wed, 25 Mar 2026 07:06:25 +0000</pubDate>
      <link>https://forem.com/nuphirho/treating-ideas-as-releasable-software-1bca</link>
      <guid>https://forem.com/nuphirho/treating-ideas-as-releasable-software-1bca</guid>
      <description>&lt;p&gt;I recently published a four-part series on AI code review, correlated errors, and specification-driven verification. The research behind it took longer than the writing. This post is about that research process, not the conclusions it produced.&lt;/p&gt;

&lt;p&gt;The argument is simple: published ideas have the same failure modes as shipped software. They go out with known gaps because the author ran out of time or confidence. They make claims the evidence does not support. They patch weaknesses with emphasis rather than fixing them. They pass the author's own review because the author wrote them.&lt;/p&gt;

&lt;p&gt;I wanted to see what happens when you apply engineering rigour to content development. Not writing advice. Not style guides. Actual process: hypotheses, verification, quality gates, and honest assessment of what the evidence does and does not support. What follows is past tense. The series is published. The process is complete.&lt;/p&gt;

&lt;p&gt;Here is what that looked like.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hypothesis formation through dialogue
&lt;/h2&gt;

&lt;p&gt;The three core ideas in the series did not start as outlines. They emerged from research conversations where claims were made, challenged, refined, and scoped. The correlated error hypothesis started as an observation about AI code review tools. The Cynefin framing started as a question about complexity theory. The residual defect taxonomy started as an honest admission that executable specifications cannot catch everything.&lt;/p&gt;

&lt;p&gt;The important moves happened early. Separating structural claims from empirical ones. Identifying where an argument was circular at the level of definitions rather than genuinely supported. Recognising when an observation needed a qualifier rather than a bolder statement. These are the same moves you make in a design review, and they matter just as much in content.&lt;/p&gt;

&lt;h2&gt;
  
  
  Science officer review
&lt;/h2&gt;

&lt;p&gt;Before any drafting, the hypotheses were stress-tested using a dedicated challenge role. The brief was explicit: do not validate, try to break them.&lt;/p&gt;

&lt;p&gt;The results were specific. The correlated error claim needed a bridging sentence connecting ensemble machine learning theory to LLM pipelines. The Cynefin framing needed the determinism objection engaged directly, not sidestepped. The economic argument needed data, not assertion. The taxonomy needed "to the best of our knowledge" before the novelty claim.&lt;/p&gt;

&lt;p&gt;Each of these is a gap that a validation-oriented review would have missed. A reviewer looking for confirmation finds confirmation. A reviewer tasked with finding weaknesses finds weaknesses. The role matters more than the competence of the reviewer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Targeted literature search
&lt;/h2&gt;

&lt;p&gt;Two targeted research briefs were sent to Grok, each scoped with specific context about the hypotheses and explicit instructions about what to look for and what to ignore. The first brief asked for academic papers on correlated errors in LLM pipelines, Cynefin applied to specifications, and the limits of specification. The second asked specifically for evidence on specification authoring cost, because a Toulmin assessment had identified the economic argument as assertion-dense rather than evidence-dense.&lt;/p&gt;

&lt;p&gt;These were not broad searches. The briefs included the current state of the argument, the specific gaps, and instructions to look outside the mainstream SDD and vibe coding conversation. The papers that closed the last Toulmin gap, Fonseca et al. and Hassani et al. on specification authoring cost, came back from the second brief. They were found because the search was scoped by the review process, not by general curiosity.&lt;/p&gt;

&lt;p&gt;The research brief for each search was precise about what it was looking for and what it was not. This turns out to matter enormously. A broad search for "AI code review" returns hundreds of vendor marketing pieces and opinion posts. A search for "correlated errors in LLM ensembles applied to code generation pipelines" returns three papers, each directly relevant.&lt;/p&gt;

&lt;p&gt;The SGCR paper required a correction before citation. The 90.9% figure is the adoption rate among developers who used the spec-grounded framework, not a defect detection improvement. That distinction matters. One version supports the claim. The other overstates it.&lt;/p&gt;

&lt;p&gt;One paper from the initial search (arXiv:2603.00311) turned out to be about regex engines, not metamorphic testing in the relevant sense. It was removed. The cost of including a bad citation is higher than the cost of having one fewer reference. Precision in the search brief determined the quality of what came back. Garbage in, garbage out applies to research as much as it applies to data pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Citation verification
&lt;/h2&gt;

&lt;p&gt;Every cited paper was verified against its original abstract before inclusion. Author lists, key figures, and specific findings were checked. The 90.9% correction, a 37.6% performance loss figure, the "popularity trap" terminology, each was confirmed in the source before being used.&lt;/p&gt;

&lt;p&gt;Three numbers from an intermediary research return required PDF-level verification, which was completed against the original papers before publishing. The distinction between what was confirmed from the abstract and what came from an intermediary's interpretation was tracked explicitly. I know which claims I can stand behind and which ones still need a primary source check.&lt;/p&gt;

&lt;p&gt;This is tedious. It is also the difference between a post that survives scrutiny and one that does not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Toulmin framework as quality gate
&lt;/h2&gt;

&lt;p&gt;Five of Toulmin's six elements (claim, data, warrant, rebuttal, qualifier, with backing folded into the data and warrant assessment) were applied to each piece before drafting the final versions. This is not a writing technique. It is a structural verification tool.&lt;/p&gt;

&lt;p&gt;The warrant failures were the most important findings. The economic argument was assertion-dense rather than evidence-dense. The constraint transformation claim was circular at the level of definitions. A Category D claim used "arguably" where a citation or an explicit provisionality flag was needed.&lt;/p&gt;

&lt;p&gt;Toulmin did not find these problems. Applying it systematically did. The framework forces you to ask "what connects my evidence to my claim?" for every claim in the piece. When the answer is "it feels right" or "it is obvious," you have found a gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stop-slop scoring
&lt;/h2&gt;

&lt;p&gt;I use a scoring framework (Hardik Pandya's stop-slop skill) that evaluates prose across five dimensions: directness, rhythm, trust in the reader, authenticity, and density. Each is scored 1 to 10. Below 35 out of 50, the piece goes back for revision.&lt;/p&gt;

&lt;p&gt;One post scored 30 on the first pass. The specific failures: a formulaic "often described as X, but actually Y" opener, repetitive staccato paragraph endings, pre-emptive hedging that signalled a weak argument rather than honest qualification, and an unsupported economic claim presented with confident language.&lt;/p&gt;

&lt;p&gt;Each failure had a specific fix. The scoring made the prioritisation tractable. Without a framework, "this doesn't feel right" is the feedback. With one, "the rhythm score is 5 because three consecutive paragraphs end with punchy one-liners" is the feedback. The second version is actionable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Revision and re-scoring
&lt;/h2&gt;

&lt;p&gt;The revised pieces were scored again. The 30 moved to 38. All four pieces passed the threshold. The one remaining gap in the Toulmin assessment, the economic argument, was closed by a targeted literature search that was run specifically because the review identified the gap.&lt;/p&gt;

&lt;p&gt;One late-stage catch is worth describing because it illustrates how the review process works. The correlated error argument, as originally phrased, left a gap that an argumentative reader could exploit: "but you used different AI models for generation and review, so the correlation breaks and your claim falls apart." The argument does not fall apart, but the text needed a sentence distinguishing model diversity from ground truth. Using a different model is not the same as using an external specification. Diversity reduces correlation. It does not eliminate it. Only an external reference, a specification that defines intent independently of the code, breaks the circular validation. That gap was found during review, reasoned through, and closed with a single sentence. Drafting missed it. The process caught it.&lt;/p&gt;

&lt;p&gt;The loop closed. The gap in the argument drove the research. The research filled the gap. The revised piece passed both the structural and the editorial quality gates.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this process is and is not
&lt;/h2&gt;

&lt;p&gt;This is expensive for a single blog post sharing a personal experience. It is not expensive for a practitioner publication making novel claims in an active research area and citing more than a dozen empirical sources.&lt;/p&gt;

&lt;p&gt;The discipline is not the process itself. It is knowing which level of process a given piece warrants and applying it without shortcuts when the stakes are high. A post about my experience with a security audit needs authenticity and clarity. A post proposing a novel defect taxonomy and connecting it to the oracle problem in formal testing theory needs every claim verified, every citation checked, and every logical gap identified before publication.&lt;/p&gt;

&lt;p&gt;The same principle applies to software. Not every service needs mutation testing. Not every feature needs a formal specification. But the ones that matter, the ones where failure has consequences, need the full process. The skill is knowing which is which.&lt;/p&gt;

&lt;p&gt;AI made this process possible at a pace that would have been impractical without it. The literature search, the cross-referencing, the citation verification, the structural analysis. All of it happened faster than I could have done manually. But the decisions about what to research, which arguments hold, which caveats matter, and when the piece is ready to publish, those were mine.&lt;/p&gt;

&lt;p&gt;The thinking is mine. The rigour is a collaboration. And the output is still what matters most.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>process</category>
      <category>scientificmethod</category>
    </item>
    <item>
      <title>AI-Assisted Security Audit</title>
      <dc:creator>Christo Zietsman</dc:creator>
      <pubDate>Wed, 25 Mar 2026 06:27:50 +0000</pubDate>
      <link>https://forem.com/nuphirho/ai-assisted-security-audit-287d</link>
      <guid>https://forem.com/nuphirho/ai-assisted-security-audit-287d</guid>
      <description>&lt;p&gt;Last year I started working on a large, unfamiliar codebase. Go, new CI/CD pipelines, new deployment patterns. My job was integration work, not security. I was using AI to accelerate my understanding of the system, reading through services, tracing data flows, building a mental model of how the pieces fit together.&lt;/p&gt;

&lt;p&gt;Then I started asking different questions.&lt;/p&gt;

&lt;p&gt;Not "show me the code for this endpoint" but "should this role be able to perform that action?" Not "how does authentication work?" but "what happens if this token is presented to a service it was not issued for?" The shift was subtle but the consequences were not. That first question, about role permissions, led to a real finding that we have since fixed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why outcome-based questions work
&lt;/h2&gt;

&lt;p&gt;The questions that uncovered the issues were not on any security checklist. They came from twenty years of building and breaking systems, from having seen enough architectures to know where the seams are. The AI did not generate those questions. I did. But the AI made it possible to answer them across an entire codebase in minutes rather than weeks.&lt;/p&gt;

&lt;p&gt;Most conversations about AI-assisted development focus on writing code faster. The value that compounds is pairing AI's traversal speed with experienced judgement. The AI can scan every file, trace every dependency, and test every boundary. But it needs someone who knows which questions to ask.&lt;/p&gt;

&lt;p&gt;When I asked about role-based access, the AI showed me every service that checked permissions, every place a role was assigned, and every path where a request could reach a protected resource. It then helped me construct a proof-of-concept that demonstrated the issue. What would have taken days of manual code review took minutes of directed conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  From code comprehension to security posture
&lt;/h2&gt;

&lt;p&gt;I did not set out to do a security audit. I was trying to understand a system. But when you combine architectural intuition with an AI that can traverse an entire codebase in seconds, you start seeing things that would take weeks to uncover through manual review.&lt;/p&gt;

&lt;p&gt;Within a day I had a clear picture of the security posture across development, staging, and production environments. Not a theoretical report or a list of hypothetical risks. Working proof-of-concepts that demonstrated real issues, followed by merge requests to resolve them. A prioritised security roadmap for a codebase I had never seen before, in under a day.&lt;/p&gt;

&lt;p&gt;That claim sounds strong, and it should be qualified. The speed came from a specific combination of factors: the codebase was well-structured with clear service boundaries, the AI tooling was mature enough to handle large codebases, and the person asking the questions had two decades of experience knowing where to look. A junior engineer with the same tools would have found different things. Fewer in some areas, more in others, because my pattern matching has its own blind spots. The AI amplifies whatever expertise you bring. It does not replace it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What made AI useful for security
&lt;/h2&gt;

&lt;p&gt;Security auditing has a property that makes it suited to AI assistance: the attack surface is defined by what the code does, not what the developer intended it to do. A code reviewer looks at a function and asks "does this do what the author meant?" A security reviewer looks at the same function and asks "what else could this do?"&lt;/p&gt;

&lt;p&gt;That second question requires traversing the codebase for breadth rather than depth. You need to know every path that reaches a function, every input that can influence its behaviour, every assumption that upstream code makes about downstream validation. AI does this without getting tired of tracing call chains, without losing track of which services talk to which. It can hold the entire dependency graph in context and answer questions about it on demand.&lt;/p&gt;

&lt;p&gt;The division of labour was consistent throughout: AI for breadth and speed, human for judgement and prioritisation. I decided which findings mattered, which were theoretical risks versus exploitable issues, and how to sequence remediation. The AI did the reconnaissance. I did the analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  The systems view
&lt;/h2&gt;

&lt;p&gt;This experience connects to a theme running through this series. The verification process I described in my earlier post on the trust barrier, the idea that you build confidence in AI output through measurable process rather than blind trust, showed up in this work. I verified each finding with a working proof-of-concept before reporting it. I tested each fix against the original exploit. Following that process is what made the findings trustworthy, not the AI's confidence in its own output. Without the proof-of-concepts, the findings would have been suggestions. With them, they were evidence.&lt;/p&gt;

&lt;p&gt;The systems thinking I explored when writing about the transition from craftsman to factory in AI-assisted development appeared here too. This audit was not a sequence of isolated code reviews. It was a systematic traversal of a system, asking architectural questions about how components interact, where trust boundaries exist, and what happens when assumptions break. Taking that systems view is what turned an interesting AI experiment into an actionable security roadmap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The process gap
&lt;/h2&gt;

&lt;p&gt;Every codebase has issues waiting to be found. The pattern generalises: AI traversal speed paired with experienced judgement finds things that neither does alone. The AI without the experience generates noise. The experience without the AI takes weeks instead of hours.&lt;/p&gt;

&lt;p&gt;But finding the problems is half the story. Fixing them at scale requires a level of process maturity that most teams have not built. Dan Shapiro's maturity levels for AI-assisted development describe a progression where Level 4 and 5 represent teams with verification processes and architectural constraints around AI output. Without that maturity, fixing everything uncovered in a thorough AI-assisted audit will take far longer than it should. The same process discipline that lets you find the problems is what lets you fix them.&lt;/p&gt;

&lt;p&gt;The security gaps are real. The process gap is bigger. The teams that start building these processes now will have a compounding advantage. The teams that wait will find themselves further behind than they expect.&lt;/p&gt;

&lt;p&gt;We need to sharpen the axe today.&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>architecture</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>The Specification as Quality Gate: Three Hypotheses on AI-Assisted Code Review</title>
      <dc:creator>Christo Zietsman</dc:creator>
      <pubDate>Sun, 22 Mar 2026 22:43:07 +0000</pubDate>
      <link>https://forem.com/nuphirho/the-specification-as-quality-gate-three-hypotheses-on-ai-assisted-code-review-4k99</link>
      <guid>https://forem.com/nuphirho/the-specification-as-quality-gate-three-hypotheses-on-ai-assisted-code-review-4k99</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 4 of a four-part series, "The Specification as Quality Gate." &lt;a href="https://dev.to/echo-chamber-ai-code-review-correlated-error"&gt;Part 1&lt;/a&gt; developed the correlated error hypothesis. &lt;a href="https://dev.to/executable-specifications-cynefin-domain-transition"&gt;Part 2&lt;/a&gt; grounded the argument in complexity science. &lt;a href="https://dev.to/what-specifications-cannot-catch-residual-defect-taxonomy"&gt;Part 3&lt;/a&gt; mapped what specifications cannot catch. This post is the complete argument.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A PDF version of this paper is available for download and citation at [arXiv link to be added after submission].&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This paper develops three interconnected hypotheses about the role of&lt;br&gt;
executable specifications in AI-assisted software development. Citations&lt;br&gt;
are provided for verification; readers are encouraged to consult the&lt;br&gt;
original sources rather than relying on the summaries here.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Abstract
&lt;/h2&gt;

&lt;p&gt;The dominant industry response to AI-generated code quality problems is&lt;br&gt;
to deploy AI reviewers. This paper argues that this response is&lt;br&gt;
structurally circular when executable specifications are absent: without&lt;br&gt;
an external reference, both the generating agent and the reviewing agent&lt;br&gt;
reason from the same artefact, share the same training distribution, and&lt;br&gt;
exhibit correlated failures. The review checks code against itself, not&lt;br&gt;
against intent.&lt;/p&gt;

&lt;p&gt;Three hypotheses are developed. First, that correlated errors in&lt;br&gt;
homogeneous LLM pipelines echo rather than cancel, a claim supported by&lt;br&gt;
convergent empirical evidence from multiple 2025-2026 studies and by three&lt;br&gt;
small contrived experiments reported here. The first two experiments are&lt;br&gt;
same-family (Claude reviewing Claude-generated code); the third extends to&lt;br&gt;
a cross-family panel of four models from three families. All use a planted&lt;br&gt;
bug corpus rather than a natural defect sample; they are directional&lt;br&gt;
evidence, not a controlled demonstration.&lt;br&gt;
Second, that executable specifications perform a domain transition in the&lt;br&gt;
Cynefin sense, converting enabling constraints into governing constraints&lt;br&gt;
and moving the problem from the complex domain to the complicated domain,&lt;br&gt;
a transition that AI makes economically viable at scale. Third, that&lt;br&gt;
the defect classes lying outside the reach of executable specifications&lt;br&gt;
form a well-defined residual, which is the legitimate and bounded target&lt;br&gt;
for AI review.&lt;/p&gt;

&lt;p&gt;The combined argument implies an architecture: specifications first,&lt;br&gt;
deterministic verification pipeline second, AI review only for the&lt;br&gt;
structural and architectural residual. This is not a claim that AI review&lt;br&gt;
is valueless. It is a claim about what it is actually for, and about what&lt;br&gt;
happens when it is deployed without the foundation that makes it&lt;br&gt;
non-circular.&lt;/p&gt;


&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;The AI code review market is growing rapidly, with tools like CodeRabbit,&lt;br&gt;
Cursor Bugbot, and GitHub Copilot deployed across tens&lt;br&gt;
of thousands of engineering teams. The value proposition is&lt;br&gt;
straightforward: AI generates code faster than humans can review it, so&lt;br&gt;
use AI to review it.&lt;/p&gt;

&lt;p&gt;The premise is reasonable. The architecture that follows from it has a&lt;br&gt;
structural problem the industry conversation has mostly not examined.&lt;/p&gt;

&lt;p&gt;The DORA 2026 report, based on 1,110 open-ended survey responses from Google engineers,&lt;br&gt;
found that higher AI adoption correlates with higher throughput and higher&lt;br&gt;
instability simultaneously. Time saved generating code is re-spent&lt;br&gt;
auditing it. The bottleneck has moved from writing code to knowing what&lt;br&gt;
to write and verifying that what was written is correct. Deploying more AI&lt;br&gt;
at the review stage does not address the structural problem; it adds&lt;br&gt;
another probabilistic layer to a pipeline that already lacks a&lt;br&gt;
deterministic quality gate.&lt;/p&gt;

&lt;p&gt;This paper examines that problem in three stages. Section 2 develops the&lt;br&gt;
correlated error hypothesis, drawing on recent empirical work on LLM&lt;br&gt;
ensemble pipelines. Section 3 develops the Cynefin domain transition&lt;br&gt;
hypothesis, grounding the specification-first argument in complexity&lt;br&gt;
science. Section 4 proposes a taxonomy of defect classes that executable&lt;br&gt;
specifications cannot catch, grounding the residual in oracle problem&lt;br&gt;
theory. Section 5 discusses the architecture implied by the combined&lt;br&gt;
argument and identifies what remains open.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. The Correlated Error Hypothesis
&lt;/h2&gt;
&lt;h3&gt;
  
  
  2.1 The Structural Problem
&lt;/h3&gt;

&lt;p&gt;When an AI coding agent generates code and a separate AI reviewer examines&lt;br&gt;
it without an external specification, both agents reason from the same&lt;br&gt;
artefact: the code. The reviewer has no ground truth to compare against.&lt;br&gt;
It checks the code against itself, not against intent.&lt;/p&gt;

&lt;p&gt;This has a precise formulation in machine learning theory. In classical&lt;br&gt;
ensemble learning, stacking multiple estimators improves reliability under&lt;br&gt;
one condition: the estimators must fail independently. If two classifiers&lt;br&gt;
share the same training distribution and the same blind spots, combining&lt;br&gt;
them does not reduce error. It consolidates it. The joint miss rate of two&lt;br&gt;
correlated estimators approaches the miss rate of either one alone, not&lt;br&gt;
the product of both (Dietterich 2000; Hansen and Salamon 1990).&lt;/p&gt;

&lt;p&gt;A code-generating model and a code-reviewing model from the same model&lt;br&gt;
family share architecture, training corpus, and reward signal. That is&lt;br&gt;
the same class of correlation the independence condition prohibits. They&lt;br&gt;
are not two independent estimators. They are two samples from the same&lt;br&gt;
prior.&lt;/p&gt;
&lt;h3&gt;
  
  
  2.2 The Empirical Evidence
&lt;/h3&gt;

&lt;p&gt;A 2025 paper studying LLM ensembles for code generation and repair&lt;br&gt;
(Vallecillos-Ruiz, Hort, and Moonen, arXiv:2510.21513) coined the term&lt;br&gt;
"popularity trap" to describe what happens when multiple models vote on&lt;br&gt;
candidate solutions. Models trained on similar distributions converge on&lt;br&gt;
the same syntactically plausible but semantically wrong answers. Consensus&lt;br&gt;
selection (the default review heuristic in most multi-agent pipelines)&lt;br&gt;
filters out the minority correct solutions and amplifies the shared&lt;br&gt;
error. Diversity-based selection recovered up to 95% of the gain that a&lt;br&gt;
perfectly independent ensemble would achieve.&lt;/p&gt;

&lt;p&gt;A second paper (Mi et al., arXiv:2412.11014) examined a&lt;br&gt;
researcher-then-reviser pipeline for Verilog code generation and found&lt;br&gt;
that erroneous outputs from the first agent were accepted downstream&lt;br&gt;
because the agents shared the same training distribution and lacked&lt;br&gt;
adversarial diversity. Single-agent self-review degenerated into&lt;br&gt;
repeating the original errors.&lt;/p&gt;

&lt;p&gt;A February 2026 paper (Pappu et al., arXiv:2602.01011) showed that even&lt;br&gt;
heterogeneous multi-agent teams consistently failed to match their best&lt;br&gt;
individual member, incurring performance losses of up to 37.6%, even when&lt;br&gt;
explicitly told which member was the expert. The failure mechanism is&lt;br&gt;
consensus-seeking over expertise. Homogeneous copies of the same model&lt;br&gt;
family make the failure mode worse.&lt;/p&gt;

&lt;p&gt;A 2025 paper on test case generation (arXiv:2507.06920) found that&lt;br&gt;
LLM-generated verifiers exhibited tightly clustered error patterns,&lt;br&gt;
indicating shared systematic biases, while human errors were widely&lt;br&gt;
distributed. LLM-based approaches produced test suites that mirrored the&lt;br&gt;
generating model's error patterns, creating what the authors call a&lt;br&gt;
homogenisation trap where tests focus on LLM-like failures while&lt;br&gt;
neglecting diverse human programming errors. This is the correlated error&lt;br&gt;
argument confirmed from the test generation side rather than the review&lt;br&gt;
side.&lt;/p&gt;

&lt;p&gt;Jin and Chen (arXiv:2508.12358, 2025) found a systematic failure of LLMs&lt;br&gt;
in evaluating whether code aligns with natural language requirements, with&lt;br&gt;
more complex prompting leading to higher misjudgement rates rather than&lt;br&gt;
lower ones. A March 2026 follow-up (arXiv:2603.00539) confirmed that LLMs&lt;br&gt;
frequently misclassify correct code as non-compliant when reasoning steps&lt;br&gt;
are added. These papers test reviewers given specifications in prose form;&lt;br&gt;
the overcorrection bias they document is a separate failure mode from the&lt;br&gt;
correlated miss problem, but both point to the same structural weakness:&lt;br&gt;
without a deterministic ground truth, the review is unreliable in both&lt;br&gt;
directions.&lt;/p&gt;

&lt;p&gt;Wang et al. (arXiv:2512.17540, 2025) proposed a specification-grounded&lt;br&gt;
code review framework deployed in a live industrial environment and found&lt;br&gt;
that grounding review in human-authored specifications improved developer&lt;br&gt;
adoption of review suggestions by 90.9% relative to a baseline LLM&lt;br&gt;
reviewer. Note: the 90.9% figure refers to adoption rate, not defect&lt;br&gt;
detection rate.&lt;/p&gt;

&lt;p&gt;None of these papers tested the exact configuration the hypothesis&lt;br&gt;
describes: a pure generator-then-reviewer pipeline without any external&lt;br&gt;
grounding, measuring shared failure modes directly. The contrived&lt;br&gt;
experiments in Section 2.5 provide directional evidence for that&lt;br&gt;
configuration, subject to the limitations stated there.&lt;/p&gt;
&lt;h3&gt;
  
  
  2.3 A Constructed Illustration
&lt;/h3&gt;

&lt;p&gt;The following example uses a deliberately simple case. Modern frontier&lt;br&gt;
models catch classic boundary conditions reliably, as the experiments in&lt;br&gt;
Section 2.5 confirm. The purpose here is not to demonstrate an AI&lt;br&gt;
review failure but to illustrate the sequence: how a BDD scenario makes&lt;br&gt;
a defect detectable before any reviewer, human or AI, is involved.&lt;/p&gt;

&lt;p&gt;A developer asks an AI coding agent to implement a pagination function.&lt;br&gt;
The agent generates a correct implementation for typical cases and produces&lt;br&gt;
tests for page 0 and page 1 of a ten-item list. Both tests pass.&lt;/p&gt;

&lt;p&gt;The flaw is in a boundary condition the agent did not consider: when the&lt;br&gt;
total number of items is exactly divisible by the page size and the caller&lt;br&gt;
requests the last page by calculating the index from the total count.&lt;/p&gt;

&lt;p&gt;The review agent, given the code and the tests, validates the&lt;br&gt;
implementation against what is present. It has no basis to identify what&lt;br&gt;
was not tested. It reports no issues.&lt;/p&gt;

&lt;p&gt;The following BDD scenario would have made the boundary condition&lt;br&gt;
explicit before a single line of code was written:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Last page when total is exactly divisible by page size
  &lt;span class="nf"&gt;Given &lt;/span&gt;a list of 10 items
  &lt;span class="nf"&gt;And &lt;/span&gt;a page size of 5
  &lt;span class="nf"&gt;When &lt;/span&gt;I request page 1
  &lt;span class="nf"&gt;Then &lt;/span&gt;I receive items 6 through 10
  &lt;span class="nf"&gt;And &lt;/span&gt;the result contains exactly 5 items
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This scenario would have failed against the implementation before the&lt;br&gt;
reviewer was ever involved. The value of BDD is not that it catches&lt;br&gt;
what AI review misses in simple cases like this one. It is that the&lt;br&gt;
pipeline becomes the reviewer: deterministic, not probabilistic, and&lt;br&gt;
invariant to what the model does or does not know. For domain-specific&lt;br&gt;
logic where the convention is absent from training data, that is the case&lt;br&gt;
examined in Section 2.5, where that invariance is what makes the difference.&lt;/p&gt;

&lt;p&gt;There is a second motivation the illustration also captures. Even where&lt;br&gt;
models reliably identify boundary conditions during focused review, general&lt;br&gt;
refactoring passes introduce a different risk. An agent refactoring for&lt;br&gt;
readability or performance is not focused on correctness. Conditional&lt;br&gt;
checks, guard clauses, and edge-case handling can be silently removed as&lt;br&gt;
apparent noise. The BDD pipeline catches that regression the same way it&lt;br&gt;
catches the original defect: the scenario fails, the build stops, and&lt;br&gt;
the cause is immediately visible. The protection is unconditional on what&lt;br&gt;
the agent was trying to do.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.4 The Independence Condition
&lt;/h3&gt;

&lt;p&gt;The correlated error claim has a precise boundary. Stacking AI reviewers&lt;br&gt;
is not always counterproductive. The failure condition is specific:&lt;br&gt;
estimators that share a training distribution and lack an external&lt;br&gt;
reference exhibit correlated failures. The condition for genuine benefit&lt;br&gt;
is diversity plus external grounding. These are two separate conditions,&lt;br&gt;
and conflating them is the most likely source of pushback against this&lt;br&gt;
argument.&lt;/p&gt;

&lt;p&gt;A cross-family pipeline, Grok reviewing Claude-generated code for&lt;br&gt;
instance, has more independence than a same-family pipeline. Different&lt;br&gt;
organisations, different training corpora, different reward signals.&lt;br&gt;
Errors are partially independent in ways that same-family models are&lt;br&gt;
not. But model diversity does not supply ground truth. A cross-family&lt;br&gt;
reviewer without an external specification is still checking code&lt;br&gt;
against code, not code against intent. It will share systematic blind&lt;br&gt;
spots on anything underrepresented in both training corpora, and it has&lt;br&gt;
no basis to identify what was never specified regardless of how different&lt;br&gt;
its architecture is from the generator. Diversity reduces correlation.&lt;br&gt;
Specification eliminates circularity. Both are required, and the current&lt;br&gt;
industry architecture typically provides neither.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.5 Experimental Evidence
&lt;/h3&gt;

&lt;p&gt;Three small contrived experiments were conducted to test the hypothesis&lt;br&gt;
directly. All are reported here with their limitations stated up front.&lt;br&gt;
The full corpus, specifications, scripts, and results are publicly&lt;br&gt;
available at &lt;a href="https://github.com/czietsman/nuphirho.dev/tree/main/experiments" rel="noopener noreferrer"&gt;https://github.com/czietsman/nuphirho.dev/tree/main/experiments&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The experimental design was proposed by Claude (Anthropic) during a&lt;br&gt;
research session developing this paper, and implemented by a Claude Code&lt;br&gt;
agent. The model being tested proposed its own methodology for being&lt;br&gt;
tested. This is noted not as a limitation but as a transparency&lt;br&gt;
obligation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design.&lt;/strong&gt; Each experiment used five Python functions with a single&lt;br&gt;
planted bug per function. Two conditions ran against each function.&lt;br&gt;
Condition A: Claude CLI reviewed the buggy implementation using a neutral&lt;br&gt;
prompt with no specification context, run five times per function.&lt;br&gt;
Condition B: pre-written BDD scenarios targeting the exact defect ran via&lt;br&gt;
behave. Experiments 1 and 2 used Claude as the sole reviewer (same family&lt;br&gt;
as the code author). Experiment 3 extended to a cross-family panel of four&lt;br&gt;
models from three families. The same-family condition is the strongest form&lt;br&gt;
of the correlated error claim; the cross-family condition tests whether&lt;br&gt;
diversity alone is sufficient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Experiment 1: Classic boundary conditions.&lt;/strong&gt; The first corpus used&lt;br&gt;
textbook boundary-condition bugs: off-by-one in pagination, loop&lt;br&gt;
termination in binary search, leap year century rule, exact-length string&lt;br&gt;
truncation, sign handling in date arithmetic.&lt;/p&gt;

&lt;p&gt;Result: Claude detected all five bugs at 100% across all runs. BDD also&lt;br&gt;
caught all five.&lt;/p&gt;

&lt;p&gt;The hypothesis was not confirmed at this level. These bugs sit in the&lt;br&gt;
complicated domain. They are well-known patterns, analysable without domain&lt;br&gt;
knowledge, and dense in training data. Claude catches them reliably because&lt;br&gt;
they are not blind spots. The result refined the hypothesis: the correlated&lt;br&gt;
error claim applies to the complex domain, not to pattern-recognition bugs&lt;br&gt;
that any experienced reviewer would find.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Experiment 2: Domain-convention violations.&lt;/strong&gt; The second corpus used bugs&lt;br&gt;
that are only wrong relative to a domain convention not inferable from&lt;br&gt;
the code alone: insurance premium proration using a fixed 365-day divisor&lt;br&gt;
rather than actual/actual, flat-rate tax rather than marginal bracket&lt;br&gt;
calculation, aviation maintenance triggering on AND rather than OR of hour&lt;br&gt;
and cycle thresholds, option pool calculated on post-money rather than&lt;br&gt;
pre-money valuation, and linear rather than log-linear interest rate&lt;br&gt;
interpolation.&lt;/p&gt;

&lt;p&gt;A confound was discovered in the first run: the original docstrings stated&lt;br&gt;
the domain convention explicitly. The reviewer was comparing the&lt;br&gt;
implementation against the docstring, not against independent domain&lt;br&gt;
knowledge. A docstring that encodes the convention is a specification. The&lt;br&gt;
experiment was inadvertently confirming that specifications work, not&lt;br&gt;
testing whether AI review works without them. Docstrings were replaced with&lt;br&gt;
neutral descriptions before the second run.&lt;/p&gt;

&lt;p&gt;Result with neutral docstrings:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;th&gt;BDD&lt;/th&gt;
&lt;th&gt;AI review&lt;/th&gt;
&lt;th&gt;Detection rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;prorate_premium&lt;/td&gt;
&lt;td&gt;caught&lt;/td&gt;
&lt;td&gt;5/5&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;apply_tiered_tax&lt;/td&gt;
&lt;td&gt;caught&lt;/td&gt;
&lt;td&gt;5/5&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;schedule_maintenance&lt;/td&gt;
&lt;td&gt;caught&lt;/td&gt;
&lt;td&gt;5/5&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;calculate_dilution&lt;/td&gt;
&lt;td&gt;caught&lt;/td&gt;
&lt;td&gt;4/5&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;interpolate_rate&lt;/td&gt;
&lt;td&gt;caught&lt;/td&gt;
&lt;td&gt;0/5&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;BDD caught all five. AI review ranged from 0% to 100% depending on domain&lt;br&gt;
opacity.&lt;/p&gt;

&lt;p&gt;On inspection, the three functions at 100% still have code-level signals&lt;br&gt;
despite the neutral docstrings. The hardcoded 365 is a common smell. The&lt;br&gt;
AND/OR distinction is partially inferable from parameter naming. The&lt;br&gt;
flat-rate calculation is detectable from the arithmetic pattern. These are&lt;br&gt;
not truly domain-opaque. The two genuinely opaque functions produced the&lt;br&gt;
predicted result: interpolate_rate at 0% (log-linear interpolation is&lt;br&gt;
market convention in fixed income, not general programming knowledge) and&lt;br&gt;
calculate_dilution at 80% (partial VC mechanics coverage in training data,&lt;br&gt;
unreliable).&lt;/p&gt;

&lt;p&gt;All five AI review runs on interpolate_rate flagged a sorting assumption&lt;br&gt;
instead of the interpolation method and declared the implementation correct.&lt;br&gt;
The code is idiomatic, the logic is sound, and without the convention stated&lt;br&gt;
explicitly there is no signal. The reviewer filled in a plausible concern&lt;br&gt;
rather than the actual violation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Experiment 3: Cross-family reviewer panel.&lt;/strong&gt; A third experiment extended&lt;br&gt;
the test to four models from three families: Claude Sonnet 4.6 (Anthropic),&lt;br&gt;
Codex 0.116.0/gpt-5.4 (OpenAI), Gemini 0.34.0 (Google), and Amazon Q&lt;br&gt;
1.18.1/claude-sonnet-4.5 (AWS/Anthropic). Five domain-opaque bugs were&lt;br&gt;
reviewed by each model, five runs per function, with no specification&lt;br&gt;
context. The corpus was designed collaboratively across six models not in&lt;br&gt;
the reviewer panel. Amazon Q declined to participate in corpus generation&lt;br&gt;
on safety grounds and was used as a reviewer only.&lt;/p&gt;

&lt;p&gt;Result with manual review of all 100 runs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;th&gt;Domain&lt;/th&gt;
&lt;th&gt;BDD&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;Codex&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;th&gt;Q&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;calculate_final_reserve_fuel&lt;/td&gt;
&lt;td&gt;Aviation (ICAO)&lt;/td&gt;
&lt;td&gt;caught&lt;/td&gt;
&lt;td&gt;0/5&lt;/td&gt;
&lt;td&gt;0/5&lt;/td&gt;
&lt;td&gt;5/5&lt;/td&gt;
&lt;td&gt;0/5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;get_gas_day&lt;/td&gt;
&lt;td&gt;Energy (NAESB)&lt;/td&gt;
&lt;td&gt;caught&lt;/td&gt;
&lt;td&gt;5/5&lt;/td&gt;
&lt;td&gt;5/5&lt;/td&gt;
&lt;td&gt;5/5&lt;/td&gt;
&lt;td&gt;2/5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;validate_diagnosis_sequence&lt;/td&gt;
&lt;td&gt;Healthcare (ICD-10-CM)&lt;/td&gt;
&lt;td&gt;caught&lt;/td&gt;
&lt;td&gt;0/5&lt;/td&gt;
&lt;td&gt;0/5&lt;/td&gt;
&lt;td&gt;0/5&lt;/td&gt;
&lt;td&gt;0/5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;validate_imo_number&lt;/td&gt;
&lt;td&gt;Maritime (IMO)&lt;/td&gt;
&lt;td&gt;caught&lt;/td&gt;
&lt;td&gt;5/5&lt;/td&gt;
&lt;td&gt;5/5&lt;/td&gt;
&lt;td&gt;5/5&lt;/td&gt;
&lt;td&gt;0/5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;electricity_cost&lt;/td&gt;
&lt;td&gt;Utility billing&lt;/td&gt;
&lt;td&gt;caught&lt;/td&gt;
&lt;td&gt;5/5&lt;/td&gt;
&lt;td&gt;5/5&lt;/td&gt;
&lt;td&gt;5/5&lt;/td&gt;
&lt;td&gt;5/5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;BDD caught all five. AI review ranged from 0% to 100% depending on domain&lt;br&gt;
opacity and model family, confirming the gradient observed in Experiment 2&lt;br&gt;
and extending it across training distributions.&lt;/p&gt;

&lt;p&gt;The ICAO fuel reserve function produced the most striking result. The bug&lt;br&gt;
swaps the reserve durations for jet and piston aircraft (45 minutes instead&lt;br&gt;
of 30 for jet, 30 instead of 45 for piston). This is the&lt;br&gt;
&lt;code&gt;intuition_misleads&lt;/code&gt; case: the swapped values are plausible on naive&lt;br&gt;
reasoning because larger aircraft might seem to need more reserve. Claude&lt;br&gt;
did not merely miss the bug. In all five runs, it confidently asserted the&lt;br&gt;
swapped values as the correct ICAO rules and declared the implementation&lt;br&gt;
correct. The model filled in the wrong convention and defended it. Gemini&lt;br&gt;
caught the swap in all five runs, indicating this convention is represented&lt;br&gt;
in its training data but absent from Claude's and Codex's.&lt;/p&gt;

&lt;p&gt;The ICD-10-CM external cause code rule (V00-Y99 codes cannot be the&lt;br&gt;
principal diagnosis) was missed by all four models at 0/5. This convention&lt;br&gt;
was absent from every training distribution tested. No model even&lt;br&gt;
approximated the rule. This is the strongest confirmation of the hypothesis&lt;br&gt;
across the three experiments: a domain convention that exists only in&lt;br&gt;
published coding guidelines and is not inferable from the code.&lt;/p&gt;

&lt;p&gt;Amazon Q (claude-sonnet-4.5, AWS fine-tuned from an Anthropic base model)&lt;br&gt;
showed consistently lower detection rates than Claude Sonnet 4.6 despite&lt;br&gt;
sharing the same model family. Claude caught the IMO check digit weight&lt;br&gt;
ordering at 5/5; Q missed it at 0/5. Claude caught the gas day boundary at&lt;br&gt;
5/5; Q caught it at 2/5. This suggests that AWS fine-tuning for the&lt;br&gt;
developer assistant use case may have narrowed general domain knowledge&lt;br&gt;
coverage relative to the base model. Two of the four reviewer models are&lt;br&gt;
Anthropic family, so this experiment is only partially cross-family.&lt;/p&gt;

&lt;p&gt;During the first Gemini run, it was discovered that the Gemini CLI in&lt;br&gt;
interactive mode accessed the experiment's specification files from the&lt;br&gt;
local filesystem before answering. Gemini was re-run in sandboxed&lt;br&gt;
non-interactive mode to prevent filesystem access. The original corpus also&lt;br&gt;
contained an unused &lt;code&gt;timedelta&lt;/code&gt; import in the gas day function that acted&lt;br&gt;
as a code-level hint; Q's detection rate dropped from 5/5 to 2/5 after&lt;br&gt;
the import was removed, demonstrating the sensitivity of detection to&lt;br&gt;
incidental code signals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitations of all three experiments.&lt;/strong&gt; These results are directional,&lt;br&gt;
not statistically significant. The corpus was designed to test the&lt;br&gt;
hypothesis rather than sampled from a natural defect distribution. The bugs&lt;br&gt;
were planted by someone who already knew where they were, giving the BDD&lt;br&gt;
scenarios an unfair advantage: they are optimally targeted at the planted&lt;br&gt;
defects in a way that production specifications would not be. Experiments 1&lt;br&gt;
and 2 use Claude as both the code author and reviewer (same family), which&lt;br&gt;
is the strongest form of the correlated error claim but not a general&lt;br&gt;
result about all AI pipelines. Experiment 3 extends to a cross-family&lt;br&gt;
panel, but two of four models share the Anthropic base, making it only&lt;br&gt;
partially cross-family. Amazon Q required periodic re-authentication during&lt;br&gt;
batch runs, introducing an operational constraint on unattended execution.&lt;/p&gt;

&lt;p&gt;The truly untestable version of the hypothesis cannot be demonstrated in a&lt;br&gt;
public experiment: a bug that is only wrong relative to an internal policy&lt;br&gt;
document that has never been published anywhere. That is the category where&lt;br&gt;
the correlated error problem is most consequential and where no amount of&lt;br&gt;
training data coverage helps. Public experiments can only approximate it&lt;br&gt;
with obscure published conventions, which frontier models may or may not&lt;br&gt;
know.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The Cynefin Domain Transition Hypothesis
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.1 The Constraint Distinction
&lt;/h3&gt;

&lt;p&gt;The Cynefin framework (Snowden and Boone 2007) distinguishes problem&lt;br&gt;
domains by the relationship between cause and effect and by the nature of&lt;br&gt;
the constraints governing system behaviour.&lt;/p&gt;

&lt;p&gt;In the complicated domain, governing constraints apply. These bound the&lt;br&gt;
solution space without specifying every action. A qualified expert&lt;br&gt;
operating within governing constraints can analyse the situation and&lt;br&gt;
identify a good approach. Cause and effect are knowable through analysis.&lt;/p&gt;

&lt;p&gt;In the complex domain, enabling constraints apply. These allow a system&lt;br&gt;
to function and create conditions for patterns to emerge, but they do not&lt;br&gt;
determine outcomes. Cause and effect are only knowable in hindsight.&lt;/p&gt;

&lt;p&gt;The distinction is ontological, not a matter of difficulty.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 Constraint Transformation as Domain Shift
&lt;/h3&gt;

&lt;p&gt;An AI coding agent operating from a vague natural language prompt operates&lt;br&gt;
under enabling constraints. The prompt allows the agent to generate code&lt;br&gt;
and creates conditions for solutions to emerge, but edge cases, boundary&lt;br&gt;
conditions, and architectural choices are resolved by the model's priors.&lt;br&gt;
The agent's behaviour is emergent and only knowable in hindsight.&lt;/p&gt;

&lt;p&gt;An AI coding agent operating from a precise executable specification is&lt;br&gt;
in a different situation. A BDD scenario makes a specific causal claim:&lt;br&gt;
given this precondition, when this action occurs, then this outcome is&lt;br&gt;
required. That claim is verifiable before hindsight. The agent cannot&lt;br&gt;
produce code that fails the scenarios without the pipeline catching it.&lt;br&gt;
Cause and effect are knowable through analysis of the specification&lt;br&gt;
itself, which is exactly what defines the complicated domain.&lt;/p&gt;

&lt;p&gt;Writing executable specifications converts enabling constraints into&lt;br&gt;
governing constraints. The problem does not change. The constraint type&lt;br&gt;
does. And with it, the domain.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3 What AI Makes Economically Viable
&lt;/h3&gt;

&lt;p&gt;The DORA 2026 report establishes that as AI accelerates code generation,&lt;br&gt;
the bottleneck shifts to specification and verification. Specifications&lt;br&gt;
are no longer overhead relative to implementation. They are the scarce&lt;br&gt;
resource.&lt;/p&gt;

&lt;p&gt;The evidence for a corresponding reduction in specification authoring&lt;br&gt;
cost is early but directional. Fonseca, Lima, and Faria&lt;br&gt;
(arXiv:2510.18861, 2025) measured Gherkin scenario generation from&lt;br&gt;
natural-language JIRA tickets on a production mobile application at BMW&lt;br&gt;
and found that practitioners reported time savings often amounting to a&lt;br&gt;
full developer-day per feature, with the automated generation completing&lt;br&gt;
in under five minutes. In a separate quasi-experiment, Hassani,&lt;br&gt;
Sabetzadeh, and Amyot (arXiv:2508.20744v2, 2026) found that 91.7% of&lt;br&gt;
ratings on the time savings criterion for LLM-generated Gherkin&lt;br&gt;
specifications fell in the top two categories. A 2025 systematic review&lt;br&gt;
of 74 LLM-for-requirements studies (Zadenoori et al.,&lt;br&gt;
arXiv:2509.11446) found that studies are predominantly evaluated in&lt;br&gt;
controlled environments using output quality metrics, with limited&lt;br&gt;
industry use, consistent with specification authoring cost being an&lt;br&gt;
underexplored measurement target. These two studies represent the&lt;br&gt;
leading edge of an emerging evidence base, not settled consensus.&lt;/p&gt;

&lt;p&gt;Both studies show AI shifting human effort from authoring to reviewing:&lt;br&gt;
from expressing intent to validating that the expressed intent is&lt;br&gt;
accurate. The intent, the domain knowledge, and the judgment that&lt;br&gt;
scenarios accurately describe what the system should do remain human&lt;br&gt;
responsibilities. The mechanical cost of expressing that intent in&lt;br&gt;
executable form has fallen substantially.&lt;/p&gt;

&lt;p&gt;The domain transition from complex to complicated is now economically&lt;br&gt;
viable at scale in a way it was not before. The claim is not that AI&lt;br&gt;
makes specification automatic. It is that the economics have shifted&lt;br&gt;
enough to make specification-first development the rational default&lt;br&gt;
rather than the disciplined exception.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.4 A Likely Objection
&lt;/h3&gt;

&lt;p&gt;A Cynefin-literate reader will challenge whether software development is&lt;br&gt;
ever truly complex in Snowden's sense. Software systems are deterministic&lt;br&gt;
at the execution level: given the same inputs and state, they produce the&lt;br&gt;
same outputs. If cause and effect are always knowable in principle, the&lt;br&gt;
complex framing is a category error.&lt;/p&gt;

&lt;p&gt;The response: the complexity resides in the problem space, not the&lt;br&gt;
execution. What users need, which edge cases matter, how requirements will&lt;br&gt;
evolve. These exhibit the emergent properties and hindsight-only&lt;br&gt;
knowability that Snowden places in the complex domain. An executable&lt;br&gt;
specification narrows the problem space by articulating requirements&lt;br&gt;
precisely enough to be analysed. The domain transition occurs in the&lt;br&gt;
problem space, not in the implementation.&lt;/p&gt;

&lt;p&gt;As of March 2026, no published work in the Cynefin community addresses the specific claim&lt;br&gt;
that executable specifications serve as a constraint transformation&lt;br&gt;
mechanism in this sense. Dave Snowden is actively working on the&lt;br&gt;
relationship between AI and Cynefin domains as of early 2026, but has&lt;br&gt;
not published conclusions.&lt;/p&gt;

&lt;p&gt;The vocabulary has been checked carefully&lt;br&gt;
against canonical framework definitions. The enabling and governing&lt;br&gt;
constraints terminology is confirmed in Snowden's own Cynefin wiki&lt;br&gt;
(cynefin.io/wiki/Constraints), not in the 2007 HBR paper. The mapping&lt;br&gt;
is consistent with Snowden's definitions. The specific claim is original&lt;br&gt;
to this argument.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. The Residual Defect Taxonomy
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 Theoretical Grounding: The Oracle Problem
&lt;/h3&gt;

&lt;p&gt;The oracle problem in software testing asks how a test outcome is&lt;br&gt;
determined to be correct. For a test to be meaningful, you need an oracle,&lt;br&gt;
a ground truth against which to judge the system's output.&lt;/p&gt;

&lt;p&gt;Barr et al. (2015), in the canonical survey of the oracle problem,&lt;br&gt;
established that complete oracles are theoretically impossible for most&lt;br&gt;
real-world software. Even a formally correct specification cannot fully&lt;br&gt;
specify the correct output for all possible inputs in all possible&lt;br&gt;
contexts. There exists a class of defects for which no specification,&lt;br&gt;
however precise, provides an oracle. That class defines the permanent&lt;br&gt;
theoretical boundary of what specification-driven verification can achieve.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 The Proposed Taxonomy
&lt;/h3&gt;

&lt;p&gt;To the best of our knowledge, no prior taxonomy in the testing or formal methods literature&lt;br&gt;
organises defects by specifiability rather than by severity or recovery&lt;br&gt;
type. The following five-category taxonomy is&lt;br&gt;
proposed as a working framework.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Category A: Theoretically specifiable, not yet specified.&lt;/strong&gt; These are&lt;br&gt;
defects that a specification could have caught if the scenario had been&lt;br&gt;
written. Boundary conditions, error handling paths, and unexercised state&lt;br&gt;
machine transitions fall here. The gap is a process failure, not a&lt;br&gt;
theoretical limitation. This category shrinks as specification discipline&lt;br&gt;
matures. An AI review agent operating here without an external&lt;br&gt;
specification shares the same blind spots as the generator.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Category B: Specifiable in principle, economically impractical.&lt;/strong&gt;&lt;br&gt;
Exhaustive combinatorial input spaces and full interaction matrices could&lt;br&gt;
theoretically be specified but at a cost that exceeds the value. This&lt;br&gt;
boundary is moving: property-based testing frameworks reduce the cost of&lt;br&gt;
exploring combinatorial spaces systematically. Category B defects are a&lt;br&gt;
legitimate target for AI review that brings genuine sampling diversity,&lt;br&gt;
provided the reviewer draws from a different prior than the generator.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Category C: Inherently unspecifiable from pre-execution context.&lt;/strong&gt;&lt;br&gt;
Timing-dependent race conditions, behaviour under partial network failure,&lt;br&gt;
and performance degradation under specific hardware configurations depend&lt;br&gt;
on properties of the running system that do not exist at specification&lt;br&gt;
time. Recent work is pushing the boundary: Santos et al. (Software 2024)&lt;br&gt;
encoded performance efficiency requirements as BDD scenarios; Maaz et al.&lt;br&gt;
(arXiv:2510.09907, 2025) surfaced concurrency bugs previously requiring&lt;br&gt;
runtime observation. Category C is not fixed. It is the current frontier.&lt;br&gt;
For defects that remain here, the right tool is runtime verification&lt;br&gt;
infrastructure rather than pre-deployment review. This includes&lt;br&gt;
ML-based anomaly detection, APM tooling with learned behavioural&lt;br&gt;
baselines, and observability platforms. This is a class of techniques predating&lt;br&gt;
LLMs that applies machine learning to operational data rather than to&lt;br&gt;
source code. The implicit specification in these systems is derived from&lt;br&gt;
observed behaviour rather than stated intent, which makes them&lt;br&gt;
complementary to, not a replacement for, the pre-deployment pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Category D: Structural and architectural properties.&lt;/strong&gt; Code can satisfy&lt;br&gt;
every specification while introducing coupling, violating layer boundaries,&lt;br&gt;
or drifting from intended design. These are relational properties of the&lt;br&gt;
codebase as a whole, not properties of individual components. They resist&lt;br&gt;
behavioural specification because they concern structure rather than&lt;br&gt;
behaviour, but this resistance does not mean they are unspecifiable. Architectural rules,&lt;br&gt;
once articulated, are enforceable deterministically: dependency rules via&lt;br&gt;
tools such as ArchUnit or Dependency Cruiser, service boundary agreements&lt;br&gt;
via contract testing frameworks such as Pact. General delivery rigour,&lt;br&gt;
clear module boundaries, and explicit interface definitions reduce the&lt;br&gt;
surface area of Category D by making architectural intent concrete. The&lt;br&gt;
residual, after tooling and contract testing are in place, is the&lt;br&gt;
unarticulated architectural intent that has not yet been expressed as an&lt;br&gt;
enforceable rule: drift from a design decision that was never written down,&lt;br&gt;
coupling that violates a pattern that exists only in institutional memory,&lt;br&gt;
a half-completed migration to a new architectural pattern where some&lt;br&gt;
modules use the new approach and others still use the old one, dead&lt;br&gt;
abstractions introduced for a use case that no longer exists. No automated&lt;br&gt;
rule catches these because the correct answer depends on intent that was&lt;br&gt;
never codified. This is where an AI review agent with access to the full&lt;br&gt;
codebase and architectural context adds genuine non-circular signal,&lt;br&gt;
operating in a role analogous to an expert architect reviewing for&lt;br&gt;
structural coherence. The agent advises; the human decides whether to&lt;br&gt;
complete the migration, remove the dead pattern, or codify the observation&lt;br&gt;
as a new enforceable rule. This is the least empirically grounded category&lt;br&gt;
in the taxonomy; no controlled study has yet isolated this residual as a&lt;br&gt;
distinct defect class.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Category E: Specification defects.&lt;/strong&gt; The oracle problem result makes&lt;br&gt;
this category unavoidable. A specification can be internally consistent,&lt;br&gt;
precisely expressed, and correctly implemented, and still describe a&lt;br&gt;
system that does not do what users need. Requirements elicitation is not&lt;br&gt;
a solved problem. Domain experts disagree. Business rules change. No&lt;br&gt;
verification pipeline catches Category E defects because the pipeline&lt;br&gt;
verifies conformance to the specification. If the specification is wrong,&lt;br&gt;
the pipeline confirms the wrong thing. The right tool for Category E is&lt;br&gt;
user testing, observability of actual usage, and design thinking practices&lt;br&gt;
that surface unstated assumptions. These are human processes, not&lt;br&gt;
automated ones.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.3 Implications for the Architecture
&lt;/h3&gt;

&lt;p&gt;The taxonomy implies a specific allocation of tools:&lt;/p&gt;

&lt;p&gt;Category A is the target for specification discipline and AI-assisted&lt;br&gt;
coverage analysis. Investment here reduces the work that review agents&lt;br&gt;
need to do.&lt;/p&gt;

&lt;p&gt;Category B is the target for property-based testing and diverse sampling&lt;br&gt;
strategies. AI review adds value here only if genuine diversity is&lt;br&gt;
achieved.&lt;/p&gt;

&lt;p&gt;Category C is the target for runtime verification infrastructure:&lt;br&gt;
ML-based anomaly detection, observability tooling, chaos engineering, and&lt;br&gt;
load testing. Neither pre-deployment specifications nor AI review agents&lt;br&gt;
are the right tool here.&lt;/p&gt;

&lt;p&gt;Category D is the target for architectural tooling and contract testing&lt;br&gt;
first: dependency enforcement, Pact-style contract verification, and&lt;br&gt;
explicit interface definitions. AI review is the complement for the&lt;br&gt;
unarticulated residual: drift from design intent that has not yet been&lt;br&gt;
codified as an enforceable rule.&lt;/p&gt;

&lt;p&gt;Category E is the reminder that the feedback loop from production to&lt;br&gt;
requirements is a human loop and cannot be automated away.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. The Combined Architecture and What Remains Open
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5.1 The Architecture
&lt;/h3&gt;

&lt;p&gt;The three hypotheses together imply a specific architecture for&lt;br&gt;
AI-assisted software development.&lt;/p&gt;

&lt;p&gt;Specifications first. BDD scenarios, contract tests, and mutation testing&lt;br&gt;
harden the verification layer. This is the constraint transformation that&lt;br&gt;
moves the problem into the complicated domain and eliminates the correlated&lt;br&gt;
error problem for Category A defects.&lt;/p&gt;

&lt;p&gt;Deterministic verification pipeline second. The pipeline is the reviewer&lt;br&gt;
for behavioural correctness. Pass or fail, no opinions.&lt;/p&gt;

&lt;p&gt;Architectural tooling and AI review for Category D. Dependency&lt;br&gt;
enforcement tools and contract testing for articulated architectural&lt;br&gt;
rules. AI review for the unarticulated residual, in a role analogous&lt;br&gt;
to an expert architect advising on structural coherence.&lt;/p&gt;

&lt;p&gt;Runtime verification for Category C. ML-based anomaly detection,&lt;br&gt;
observability tooling, chaos engineering, load testing. Not part of the&lt;br&gt;
pre-deployment pipeline.&lt;/p&gt;

&lt;p&gt;User feedback loops for Category E. Requirements validation, user testing,&lt;br&gt;
and design thinking. Not part of the engineering pipeline at all.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.2 What Remains Open
&lt;/h3&gt;

&lt;p&gt;Three open questions are stated honestly here.&lt;/p&gt;

&lt;p&gt;The controlled demonstration has strengthened but remains incomplete. Three&lt;br&gt;
contrived experiments provided directional evidence: classic boundary&lt;br&gt;
conditions were detected at 100% by AI review (refining the hypothesis&lt;br&gt;
toward domain-opaque defects), domain-convention violations showed a&lt;br&gt;
gradient from 0% to 100% depending on how well the convention is&lt;br&gt;
represented in training data, and a cross-family reviewer panel confirmed&lt;br&gt;
the gradient extends across model families with detection varying by both&lt;br&gt;
domain opacity and training distribution. The 0/20 result on ICD-10-CM&lt;br&gt;
coding guidelines (four models, five runs each, zero detections) is the&lt;br&gt;
strongest single finding. But all three experiments use a planted bug&lt;br&gt;
corpus, the BDD scenarios are optimally targeted, and the cross-family&lt;br&gt;
panel includes two Anthropic-family models. A controlled study using a&lt;br&gt;
natural defect sample, fully independent model families, and a&lt;br&gt;
specification-grounded condition alongside an ungrounded condition would&lt;br&gt;
strengthen or revise the claim precisely. The experiments are available in the repository cited in Section 2.5&lt;br&gt;
for replication and extension.&lt;/p&gt;

&lt;p&gt;The Cynefin mapping is unvalidated by the Cynefin community. The&lt;br&gt;
constraint transformation framing is consistent with Snowden's own&lt;br&gt;
vocabulary but has not been reviewed by accredited practitioners. Snowden&lt;br&gt;
is actively working on AI and Cynefin and may publish a position that&lt;br&gt;
validates, challenges, or reframes this argument.&lt;/p&gt;

&lt;p&gt;The taxonomy is novel and untested. No prior work organises defects by&lt;br&gt;
specifiability in the five-category structure proposed here. The taxonomy&lt;br&gt;
may be incomplete, category boundaries may be blurrier than described,&lt;br&gt;
and empirical work testing the taxonomy against real defect populations&lt;br&gt;
would strengthen or revise it. The Category A and Category D boundary is&lt;br&gt;
the most contested: if architectural properties become specifiable as&lt;br&gt;
tooling matures, parts of Category D would reclassify into Category A,&lt;br&gt;
which would reduce the permanent residual for AI review.&lt;/p&gt;

&lt;p&gt;These are open questions, not fatal weaknesses. The directional evidence is consistent with the hypothesis. Stating the gaps is not a concession. It is&lt;br&gt;
what makes the argument trustworthy.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Conclusion
&lt;/h2&gt;

&lt;p&gt;The argument developed here is structural, not empirical. Executable&lt;br&gt;
specifications break the circularity in AI-assisted pipelines not because&lt;br&gt;
they catch more bugs than AI review, but because they introduce a&lt;br&gt;
reference that is independent of the model's training distribution. That&lt;br&gt;
independence is what makes verification non-circular. Without it, review&lt;br&gt;
checks code against itself.&lt;/p&gt;

&lt;p&gt;Three experiments across four model families and three training&lt;br&gt;
distributions provide directional evidence. Classic boundary conditions&lt;br&gt;
are reliably caught by all models. Domain-convention violations produce a&lt;br&gt;
gradient from 0% to 100% detection depending on how well the convention&lt;br&gt;
is represented in training data. The ICD-10-CM external cause code rule&lt;br&gt;
was missed by every model tested, while BDD caught it deterministically.&lt;br&gt;
The ICAO fuel reserve case showed something worse than a miss: models&lt;br&gt;
confidently asserting the wrong convention as correct. These results are&lt;br&gt;
small in scale and use planted bugs, but the pattern is consistent with&lt;br&gt;
the theoretical prediction and with the independent empirical findings&lt;br&gt;
cited in Section 2.&lt;/p&gt;

&lt;p&gt;The practical implication is an architecture, not a prohibition.&lt;br&gt;
Specifications first, for the domain logic that must be right.&lt;br&gt;
Deterministic pipeline second, as the reviewer for behavioural&lt;br&gt;
correctness. AI review third, scoped to the structural and architectural&lt;br&gt;
residual where it adds genuine non-circular signal. This is not a claim&lt;br&gt;
that AI review is valueless. It is a claim about where it belongs in the&lt;br&gt;
pipeline and what it needs underneath it to be trustworthy.&lt;/p&gt;

&lt;p&gt;The experiments, corpus, and specifications are publicly available for&lt;br&gt;
replication and extension. The argument improves by being tested.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;Barr, E.T., Harman, M., McMinn, P., Shahbaz, M., and Yoo, S. (2015).&lt;br&gt;
"The Oracle Problem in Software Testing: A Survey." IEEE Transactions on&lt;br&gt;
Software Engineering 41(5), 507-525. doi:10.1109/TSE.2014.2372785&lt;/p&gt;

&lt;p&gt;Dietterich, T.G. (2000). "Ensemble Methods in Machine Learning."&lt;br&gt;
Proceedings of the First International Workshop on Multiple Classifier&lt;br&gt;
Systems. Lecture Notes in Computer Science, vol 1857. Springer.&lt;/p&gt;

&lt;p&gt;Baolin, J. and Harvey, N. (2026). "Balancing AI tensions: Moving from&lt;br&gt;
AI adoption to effective SDLC use." &lt;a href="https://dora.dev/insights/balancing-ai-tensions/" rel="noopener noreferrer"&gt;https://dora.dev/insights/balancing-ai-tensions/&lt;/a&gt;,&lt;br&gt;
March 2026. Based on 1,110 open-ended survey responses from Google&lt;br&gt;
software engineers.&lt;/p&gt;

&lt;p&gt;Fonseca, P.L., Lima, B., and Faria, J.P. (2025). "Streamlining Acceptance&lt;br&gt;
Test Generation for Mobile Applications Through Large Language Models:&lt;br&gt;
An Industrial Case Study." arXiv:2510.18861.&lt;/p&gt;

&lt;p&gt;Hassani, S., Sabetzadeh, M., and Amyot, D. (2026). "From Law to Gherkin:&lt;br&gt;
A Human-Centred Quasi-Experiment on the Quality of LLM-Generated&lt;br&gt;
Behavioural Specifications from Food-Safety Regulations."&lt;br&gt;
arXiv:2508.20744v2 (updated March 2026).&lt;/p&gt;

&lt;p&gt;Hansen, L.K. and Salamon, P. (1990). "Neural Network Ensembles." IEEE&lt;br&gt;
Transactions on Pattern Analysis and Machine Intelligence, 12(10),&lt;br&gt;
993-1001. doi:10.1109/34.58871&lt;/p&gt;

&lt;p&gt;Maaz, M., DeVoe, L., Hatfield-Dodds, Z., and Carlini, N. (2025).&lt;br&gt;
"Agentic Property-Based Testing: Finding Bugs Across the Python&lt;br&gt;
Ecosystem." arXiv:2510.09907, October 2025.&lt;/p&gt;

&lt;p&gt;Mi, Z., Zheng, R., Zhong, H., Sun, Y., Kneeland, S., Moitra, S., Kutzer,&lt;br&gt;
K., Xu, Z., and Huang, S. (2025). "CoopetitiveV: Leveraging LLM-powered&lt;br&gt;
Coopetitive Multi-Agent Prompting for High-quality Verilog Generation."&lt;br&gt;
arXiv:2412.11014, v2 June 2025.&lt;/p&gt;

&lt;p&gt;Pappu, A., El, B., Cao, H., di Nolfo, C., Sun, Y., Cao, M., and Zou, J.&lt;br&gt;
(2026). "Multi-Agent Teams Hold Experts Back." arXiv:2602.01011, February&lt;br&gt;
2026.&lt;/p&gt;

&lt;p&gt;Santos, S., Pimentel, T., Rocha, F., and Soares, M.S. (2024). "Using&lt;br&gt;
Behaviour-Driven Development (BDD) for Non-Functional Requirements."&lt;br&gt;
Software, 3(3), 271-283. doi:10.3390/software3030014&lt;/p&gt;

&lt;p&gt;Snowden, D. and Boone, M. (2007). "A Leader's Framework for Decision&lt;br&gt;
Making." Harvard Business Review, November 2007. Note: the enabling&lt;br&gt;
and governing constraints terminology is not in this paper. It is&lt;br&gt;
from Snowden's Cynefin wiki.&lt;/p&gt;

&lt;p&gt;Snowden, D. "Constraints." cynefin.io/wiki/Constraints.&lt;br&gt;
The Cynefin Company. Accessed March 2026. Confirms: Complicated domain:&lt;br&gt;
governing constraints. Complex domain: enabling constraints.&lt;/p&gt;

&lt;p&gt;Vallecillos-Ruiz, F., Hort, M., and Moonen, L. (2025). "Wisdom and&lt;br&gt;
Delusion of LLM Ensembles for Code Generation and Repair."&lt;br&gt;
arXiv:2510.21513, October 2025.&lt;/p&gt;

&lt;p&gt;Wang, K., Mao, B., Jia, S., Ding, Y., Han, D., Ma, T., and Cao, B.&lt;br&gt;
(2025). "SGCR: A Specification-Grounded Framework for Trustworthy LLM&lt;br&gt;
Code Review." arXiv:2512.17540, December 2025, v2 January 2026.&lt;br&gt;
Note: the 90.9% figure refers to developer adoption rate, not defect&lt;br&gt;
detection rate.&lt;/p&gt;

&lt;p&gt;Zadenoori, M.A., Dabrowski, J., Alhoshan, W., Zhao, L., and Ferrari, A.&lt;br&gt;
(2025). "Large Language Models (LLMs) for Requirements Engineering (RE):&lt;br&gt;
A Systematic Literature Review." arXiv:2509.11446.&lt;/p&gt;

&lt;p&gt;Jin, H. and Chen, H. (2025). "Uncovering Systematic Failures of LLMs in&lt;br&gt;
Verifying Code Against Natural Language Specifications."&lt;br&gt;
arXiv:2508.12358, August 2025. Presented at ASE 2025.&lt;/p&gt;

&lt;p&gt;Jin, H. and Chen, H. (2026). "Are LLMs Reliable Code Reviewers? Systematic&lt;br&gt;
Overcorrection in Requirement Conformance Judgement."&lt;br&gt;
arXiv:2603.00539, March 2026.&lt;/p&gt;

&lt;p&gt;Ma, Z., Zhang, T., Cao, M., Liu, J., Zhang, W., Luo, M., Zhang, S., and Chen, K.&lt;br&gt;
(2025). "Rethinking Verification for LLM Code Generation: From Generation to&lt;br&gt;
Testing." arXiv:2507.06920v2, July 2025.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Series: The Specification as Quality Gate&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/echo-chamber-ai-code-review-correlated-error"&gt;Part 1: The Echo Chamber in Your Pipeline&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/executable-specifications-cynefin-domain-transition"&gt;Part 2: From Complex to Complicated&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/what-specifications-cannot-catch-residual-defect-taxonomy"&gt;Part 3: What Specifications Cannot Catch&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Part 4: The Specification as Quality Gate (this post)&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codereview</category>
      <category>specifications</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>What Specifications Cannot Catch: A Proposed Taxonomy of the Residual</title>
      <dc:creator>Christo Zietsman</dc:creator>
      <pubDate>Sun, 22 Mar 2026 22:24:03 +0000</pubDate>
      <link>https://forem.com/nuphirho/what-specifications-cannot-catch-a-proposed-taxonomy-of-the-residual-24am</link>
      <guid>https://forem.com/nuphirho/what-specifications-cannot-catch-a-proposed-taxonomy-of-the-residual-24am</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 3 of a four-part series, "The Specification as Quality Gate." &lt;a href="https://dev.to/echo-chamber-ai-code-review-correlated-error"&gt;Part 1&lt;/a&gt; developed the correlated error hypothesis. &lt;a href="https://dev.to/executable-specifications-cynefin-domain-transition"&gt;Part 2&lt;/a&gt; grounded the argument in complexity science. This post maps what executable specifications cannot catch. Part 4 will follow.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The argument for specification-driven development is strong. BDD scenarios&lt;br&gt;
that pass or fail are not probability estimates. Mutation testing that&lt;br&gt;
finds no survivors is a verified claim about the coverage of the&lt;br&gt;
verification layer itself. Contract probes that confirm boundary behaviour&lt;br&gt;
are deterministic gates, not heuristics. The case for putting&lt;br&gt;
specifications before code, and treating the pipeline as the reviewer,&lt;br&gt;
is well-founded.&lt;/p&gt;

&lt;p&gt;"Well-founded" is not the same as "complete." Any argument that executable&lt;br&gt;
specifications make AI code review redundant needs to account honestly for&lt;br&gt;
what specifications cannot catch. That accounting is not a concession. It&lt;br&gt;
is what makes the rest of the argument credible.&lt;/p&gt;

&lt;p&gt;What follows is a proposed taxonomy of the defect classes that lie outside&lt;br&gt;
the reach of executable specifications. To the best of our knowledge, no&lt;br&gt;
prior taxonomy in the testing or formal methods literature organises&lt;br&gt;
defects by specifiability rather than by severity or recovery type. The&lt;br&gt;
reason to organise by specifiability is practical: severity tells you how&lt;br&gt;
bad a defect is. Specifiability tells you which tool to reach for. Those&lt;br&gt;
are different questions, and conflating them is why teams end up applying&lt;br&gt;
AI review to problems where it is circular and missing the problems where&lt;br&gt;
it would genuinely help. It should be treated as a working framework and&lt;br&gt;
challenged accordingly.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Foundation: The Oracle Problem
&lt;/h2&gt;

&lt;p&gt;Before the categories, a theoretical grounding that gives the taxonomy&lt;br&gt;
its shape.&lt;/p&gt;

&lt;p&gt;The oracle problem in software testing asks: how do you know whether a&lt;br&gt;
test outcome is correct? For a test to be meaningful, you need an oracle,&lt;br&gt;
a ground truth against which to judge the system's output. In most&lt;br&gt;
testing, the oracle is implicit: the developer's understanding of what&lt;br&gt;
the system should do. In specification-driven development, the oracle is&lt;br&gt;
the specification.&lt;/p&gt;

&lt;p&gt;Barr, Harman, McMinn, Shahbaz, and Yoo, in their 2015 survey in IEEE&lt;br&gt;
Transactions on Software Engineering, established that complete oracles&lt;br&gt;
are theoretically impossible for most real-world software. Even a formally&lt;br&gt;
correct specification, internally consistent and precisely expressed,&lt;br&gt;
cannot fully specify the correct output for all possible inputs in all&lt;br&gt;
possible contexts.&lt;/p&gt;

&lt;p&gt;This is not a practical limitation that better process will eventually&lt;br&gt;
overcome. It is a theoretical result. There exists a class of defects for&lt;br&gt;
which no specification, however precise, provides an oracle. That class&lt;br&gt;
defines the permanent boundary of what specification-driven verification&lt;br&gt;
can achieve.&lt;/p&gt;

&lt;p&gt;Everything in the taxonomy that follows should be read against this&lt;br&gt;
foundation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Category A: Theoretically Specifiable, Not Yet Specified
&lt;/h2&gt;

&lt;p&gt;The largest and most important category. These are defects that a&lt;br&gt;
specification could have caught if the scenario had been written. The&lt;br&gt;
gap is a process failure, not a theoretical limitation.&lt;/p&gt;

&lt;p&gt;The clearest examples are domain-convention violations: code that is&lt;br&gt;
internally consistent and idiomatic, but wrong relative to a rule that&lt;br&gt;
is not inferrable from the code alone. An interest rate interpolation&lt;br&gt;
function that uses linear interpolation when the market convention for&lt;br&gt;
this curve type is log-linear looks correct to any reviewer without&lt;br&gt;
the specification. The code is clean. The logic is sound. Only the BDD&lt;br&gt;
scenario that states the correct output for a specific tenor gap makes&lt;br&gt;
the violation visible.&lt;/p&gt;

&lt;p&gt;This is not a hypothetical. A small experiment run alongside the paper&lt;br&gt;
this post is part of tested exactly this case. Claude reviewed the&lt;br&gt;
function five times without a specification and missed the bug in every&lt;br&gt;
run, flagging a different concern instead. The BDD scenario caught it&lt;br&gt;
immediately. The full experiment is at&lt;br&gt;
&lt;a href="https://github.com/czietsman/nuphirho.dev/tree/main/experiments/correlated-error-v2" rel="noopener noreferrer"&gt;correlated-error-v2&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Classic boundary conditions (off-by-one errors, loop termination&lt;br&gt;
edge cases, leap year rules) also belong here in principle. But&lt;br&gt;
frontier models catch those reliably from pattern recognition alone,&lt;br&gt;
because they are dense in training data. The specification still matters&lt;br&gt;
for these: it prevents regressions when a refactoring agent removes a&lt;br&gt;
guard clause it considers noise. But the correlated error problem bites&lt;br&gt;
hardest at the domain-specific end of Category A, where neither the&lt;br&gt;
generator nor the reviewer has the convention in its prior.&lt;/p&gt;

&lt;p&gt;Other examples: error handling paths that were not enumerated in&lt;br&gt;
requirements, input validation for edge cases the specification author&lt;br&gt;
did not consider, state machine transitions for states that were not&lt;br&gt;
mapped.&lt;/p&gt;

&lt;p&gt;Category A is the primary target for the specification-first argument.&lt;br&gt;
AI-assisted specification drafting, property-based exploration, and&lt;br&gt;
mutation testing can reduce it systematically over time. The residual&lt;br&gt;
in this category is not permanent. It shrinks as specification discipline&lt;br&gt;
matures.&lt;/p&gt;

&lt;p&gt;An AI review agent operating in Category A without an external&lt;br&gt;
specification shares the same blind spots as the generator: both&lt;br&gt;
drew from the same prior, and whatever the prior underweights, both&lt;br&gt;
miss.&lt;/p&gt;




&lt;h2&gt;
  
  
  Category B: Specifiable in Principle, Economically Impractical
&lt;/h2&gt;

&lt;p&gt;Some defect classes are theoretically specifiable but require verification&lt;br&gt;
effort that exceeds the value of specifying them. A function that takes&lt;br&gt;
three independent boolean parameters has eight input combinations. Ten&lt;br&gt;
parameters gives 1,024. Twenty gives over a million. Every combination's&lt;br&gt;
boundary conditions could be specified, but nobody does this, for good&lt;br&gt;
reason.&lt;/p&gt;

&lt;p&gt;This boundary is moving. Property-based testing frameworks generate inputs&lt;br&gt;
systematically and explore combinatorial spaces that manual scenario&lt;br&gt;
authoring cannot reach. The economics of Category B defects are shifting&lt;br&gt;
as these tools mature and AI-assisted property-based exploration becomes&lt;br&gt;
practical.&lt;/p&gt;

&lt;p&gt;Category B is where AI review can provide legitimate signal, not by&lt;br&gt;
finding what the specification missed, but by sampling the input space&lt;br&gt;
more broadly than the specified scenarios cover. This is heuristic, not&lt;br&gt;
deterministic, but it is genuine signal rather than circular reasoning,&lt;br&gt;
provided the reviewer draws from a genuinely different prior than the&lt;br&gt;
generator.&lt;/p&gt;




&lt;h2&gt;
  
  
  Category C: Inherently Unspecifiable From Pre-Execution Context
&lt;/h2&gt;

&lt;p&gt;Some defect classes depend on properties of the running system, the&lt;br&gt;
environment, or the interaction between components that only manifest at&lt;br&gt;
runtime. Timing-dependent race conditions, behaviour under partial network&lt;br&gt;
failure, memory behaviour under sustained load, and performance degradation&lt;br&gt;
under specific hardware configurations cannot be expressed as a BDD&lt;br&gt;
scenario that runs before deployment. The conditions that trigger them&lt;br&gt;
do not exist at specification time.&lt;/p&gt;

&lt;p&gt;These are not unknown unknowns in the Category A sense. They are knowable&lt;br&gt;
properties, observable and measurable after the fact. The issue is&lt;br&gt;
temporal, not conceptual.&lt;/p&gt;

&lt;p&gt;The boundary of Category C is moving. Santos, Pimentel, Rocha, and&lt;br&gt;
Soares (Software, 2024) explicitly encoded ISO 25010 performance&lt;br&gt;
efficiency requirements as BDD scenarios. Maaz et al. (arXiv:2510.09907,&lt;br&gt;
2025) used LLM-assisted property-based testing to surface concurrency and&lt;br&gt;
numerical precision bugs that previously required runtime observation to&lt;br&gt;
detect. Some defects classified as inherently unspecifiable five years&lt;br&gt;
ago are specifiable today with the right tooling.&lt;/p&gt;

&lt;p&gt;For the defects that remain in Category C, the right tool is runtime&lt;br&gt;
verification infrastructure. This includes ML-based anomaly detection,&lt;br&gt;
APM tooling with learned behavioural baselines, observability platforms,&lt;br&gt;
chaos engineering, and load testing. This is a mature field predating&lt;br&gt;
LLMs that applies machine learning to operational data rather than to&lt;br&gt;
source code. Neither a BDD pipeline nor an AI code review agent&lt;br&gt;
reaches these defects. They require the system to be running.&lt;/p&gt;




&lt;h2&gt;
  
  
  Category D: Structural and Architectural Properties
&lt;/h2&gt;

&lt;p&gt;Code can pass every BDD scenario, survive every mutation, and satisfy&lt;br&gt;
every contract probe, while simultaneously introducing coupling between&lt;br&gt;
modules that will compound over time, violating layer boundaries the&lt;br&gt;
architecture was designed to enforce, or drifting from the intended design&lt;br&gt;
in ways that only become visible months later.&lt;/p&gt;

&lt;p&gt;These are relational properties of the codebase as a whole, not properties&lt;br&gt;
of individual components in isolation. They resist behavioural specification&lt;br&gt;
because they concern how the code is structured relative to the intended&lt;br&gt;
design, not what the code does.&lt;/p&gt;

&lt;p&gt;But resist does not mean unspecifiable. Architectural rules are specifiable&lt;br&gt;
once a human articulates them. A rule such as "the web layer may not import&lt;br&gt;
from the data layer" is a constraint that can be expressed, enforced, and&lt;br&gt;
tested deterministically. Tools like ArchUnit, Dependency Cruiser, and&lt;br&gt;
NDepend enforce dependency rules as automated checks. Contract testing&lt;br&gt;
frameworks like Pact verify that service boundaries behave as agreed&lt;br&gt;
between producers and consumers. These are deterministic gates, not&lt;br&gt;
heuristic opinions, and they belong in the pipeline alongside BDD&lt;br&gt;
scenarios. General delivery rigour (clear module boundaries, explicit&lt;br&gt;
interface definitions, disciplined use of dependency injection) reduces&lt;br&gt;
the surface area of Category D by making architectural intent concrete&lt;br&gt;
rather than implicit.&lt;/p&gt;

&lt;p&gt;The residual in Category D, after tooling and contract testing are in&lt;br&gt;
place, is the unarticulated architectural intent that has not yet been&lt;br&gt;
expressed as a rule. Drift from a design decision that was never written&lt;br&gt;
down. Coupling that violates a pattern that exists only in someone's head.&lt;br&gt;
A half-completed migration to a new architectural pattern, where some&lt;br&gt;
modules use the new approach and others still use the old one: no&lt;br&gt;
automated rule catches that because the correct answer depends on whether&lt;br&gt;
the team intended to complete the migration or abandoned it. Dead patterns&lt;br&gt;
accumulate the same way: abstractions introduced for a use case that no&lt;br&gt;
longer exists, base classes with a single subclass, interfaces designed&lt;br&gt;
for extensibility that never came. These are structural noise that only&lt;br&gt;
becomes visible to something reasoning about the whole codebase over time.&lt;/p&gt;

&lt;p&gt;This is where an AI review agent with access to the full codebase and&lt;br&gt;
architectural context adds genuine, non-circular signal. It can identify&lt;br&gt;
incomplete migrations, flag dead abstractions, and surface inconsistencies&lt;br&gt;
that no automated rule yet captures. The role is analogous to an expert&lt;br&gt;
architect reviewing for structural coherence: the agent advises, the human&lt;br&gt;
decides whether to complete the migration, remove the dead pattern, or&lt;br&gt;
codify the observation as a new enforceable rule. That last step closes&lt;br&gt;
the loop back into tooling, which is where the category belongs once the&lt;br&gt;
intent has been made explicit.&lt;/p&gt;

&lt;p&gt;This is the least empirically grounded category in the taxonomy: no&lt;br&gt;
controlled study has yet isolated AI architectural review as a distinct&lt;br&gt;
defect class separate from what specification pipelines and architectural&lt;br&gt;
tooling catch. It is the most provisional claim here, and should be&lt;br&gt;
treated accordingly.&lt;/p&gt;

&lt;p&gt;Category D is the permanent legitimate home for AI review in a&lt;br&gt;
specification-driven pipeline, not as a substitute for architectural&lt;br&gt;
tooling and contract testing, but as the complement that operates where&lt;br&gt;
rules have not yet been written.&lt;/p&gt;




&lt;h2&gt;
  
  
  Category E: Specification Defects
&lt;/h2&gt;

&lt;p&gt;The oracle problem survey makes this category unavoidable. A specification&lt;br&gt;
that is internally consistent, precisely expressed, and correctly&lt;br&gt;
implemented can still describe a system that does not do what users need.&lt;/p&gt;

&lt;p&gt;Requirements elicitation is not a solved problem. Domain experts disagree.&lt;br&gt;
Users articulate needs in terms of their current mental model, which may&lt;br&gt;
not reflect what they actually want. Business rules change after the&lt;br&gt;
specification was written. Edge cases that matter in production were not&lt;br&gt;
considered during discovery.&lt;/p&gt;

&lt;p&gt;No verification pipeline catches Category E defects, because the pipeline&lt;br&gt;
verifies conformance to the specification. If the specification is wrong,&lt;br&gt;
the pipeline confirms the wrong thing. An AI review agent without external&lt;br&gt;
user feedback cannot catch Category E either, because the agent has no way&lt;br&gt;
to know the specification does not match user intent.&lt;/p&gt;

&lt;p&gt;The right tool for Category E is not in the engineering pipeline at all.&lt;br&gt;
It is user testing, feedback loops, observability of actual usage, and&lt;br&gt;
the design thinking practices that surface unstated assumptions during&lt;br&gt;
requirements elicitation. These are human processes. AI can assist with&lt;br&gt;
them, but not replace them.&lt;/p&gt;

&lt;p&gt;Category E is not a process failure. It is a theoretical limitation.&lt;br&gt;
Even a perfect specification process will leave some Category E defects,&lt;br&gt;
because the oracle for user intent is inherently incomplete.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Taxonomy Implies
&lt;/h2&gt;

&lt;p&gt;Each category points to a different tool.&lt;/p&gt;

&lt;p&gt;Category A is the target for specification discipline. It shrinks as&lt;br&gt;
teams invest in BDD coverage, mutation testing, and AI-assisted&lt;br&gt;
specification drafting. An AI review agent here is a temporary patch for&lt;br&gt;
process immaturity, not a permanent fixture.&lt;/p&gt;

&lt;p&gt;Category B is the target for property-based testing and combinatorial&lt;br&gt;
exploration. AI adds value here only if it brings genuine sampling&lt;br&gt;
diversity, meaning a different prior than the generator.&lt;/p&gt;

&lt;p&gt;Category C is the target for runtime verification infrastructure:&lt;br&gt;
ML-based anomaly detection, observability tooling, chaos engineering,&lt;br&gt;
and load testing. This is a mature field with its own tooling. Neither&lt;br&gt;
specifications nor AI code reviewers substitute for it.&lt;/p&gt;

&lt;p&gt;Category D is the target for architectural tooling and contract testing&lt;br&gt;
first: ArchUnit, Dependency Cruiser, Pact, and similar deterministic&lt;br&gt;
enforcement mechanisms for rules that have been articulated. AI review&lt;br&gt;
is the complement for the unarticulated residual: drift from design&lt;br&gt;
intent that has not yet been codified as an enforceable rule. The agent&lt;br&gt;
advises, the human decides, and the observation ideally becomes a new&lt;br&gt;
rule that closes back into tooling.&lt;/p&gt;

&lt;p&gt;Category E is the reminder that the specification is never complete. It&lt;br&gt;
gets less incomplete over time. The feedback loop from production back&lt;br&gt;
to requirements is a human loop, and cannot be automated away.&lt;/p&gt;




&lt;h2&gt;
  
  
  An Invitation
&lt;/h2&gt;

&lt;p&gt;This taxonomy is proposed, not established. The categories emerged from&lt;br&gt;
reasoning through what executable specifications can and cannot express,&lt;br&gt;
grounded in the oracle problem literature and tested against empirical&lt;br&gt;
findings from the 2025-2026 AI code review research.&lt;/p&gt;

&lt;p&gt;If the taxonomy is wrong, the corrections are welcome and will improve the&lt;br&gt;
argument. If it is incomplete, the missing categories are interesting. The&lt;br&gt;
most likely challenge is that the Category A and Category D boundary is&lt;br&gt;
blurrier than described: architectural properties may become specifiable&lt;br&gt;
as tooling matures, which would reclassify parts of Category D as&lt;br&gt;
Category A over time. That boundary question is worth watching.&lt;/p&gt;

&lt;p&gt;The goal is not to defend the taxonomy. It is to have a precise enough&lt;br&gt;
framework to stop treating AI code review as a monolithic practice and&lt;br&gt;
start reasoning about which kinds of defects each tool in the pipeline&lt;br&gt;
is actually positioned to find.&lt;/p&gt;

&lt;p&gt;If you work in software testing, formal methods, or AI-assisted&lt;br&gt;
development and the taxonomy is wrong in ways not anticipated here,&lt;br&gt;
the comments are the right place to say so. A taxonomy that provokes&lt;br&gt;
rigorous challenge is more useful than one that goes unchallenged.&lt;br&gt;
Corrections will be published with attribution.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post draws on: Barr, E.T., Harman, M., McMinn, P., Shahbaz, M.,&lt;br&gt;
and Yoo, S. (2015), "The Oracle Problem in Software Testing: A Survey,"&lt;br&gt;
IEEE Transactions on Software Engineering, 41(5), 507-525,&lt;br&gt;
doi:10.1109/TSE.2014.2372785; Santos, S., Pimentel, T., Rocha, F., and&lt;br&gt;
Soares, M.S. (2024), "Using Behaviour-Driven Development (BDD) for&lt;br&gt;
Non-Functional Requirements," Software, 3(3), 271-283,&lt;br&gt;
doi:10.3390/software3030014; and Maaz, M., DeVoe, L., Hatfield-Dodds, Z., and Carlini, N. (2025), "Agentic&lt;br&gt;
Property-Based Testing: Finding Bugs Across the Python Ecosystem,"&lt;br&gt;
arXiv:2510.09907.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Series: The Specification as Quality Gate&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/echo-chamber-ai-code-review-correlated-error"&gt;Part 1: The Echo Chamber in Your Pipeline&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/executable-specifications-cynefin-domain-transition"&gt;Part 2: From Complex to Complicated&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Part 3: What Specifications Cannot Catch (this post)&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/specification-as-quality-gate-three-hypotheses"&gt;Part 4: The Specification as Quality Gate&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>specifications</category>
      <category>testing</category>
      <category>ai</category>
      <category>bdd</category>
    </item>
    <item>
      <title>From Complex to Complicated: What Executable Specifications Actually Do</title>
      <dc:creator>Christo Zietsman</dc:creator>
      <pubDate>Sun, 22 Mar 2026 21:32:24 +0000</pubDate>
      <link>https://forem.com/nuphirho/from-complex-to-complicated-what-executable-specifications-actually-do-73k</link>
      <guid>https://forem.com/nuphirho/from-complex-to-complicated-what-executable-specifications-actually-do-73k</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 2 of a four-part series, "The Specification as Quality Gate." &lt;a href="https://dev.to/echo-chamber-ai-code-review-correlated-error"&gt;Part 1&lt;/a&gt; developed the correlated error hypothesis. This post grounds the specification-first argument in complexity science. Parts 3 and 4 will follow.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Ask an engineering team why they write tests, and most will say: to catch&lt;br&gt;
bugs. Ask why they write specifications, and the answers are less coherent.&lt;br&gt;
Documentation. Compliance. Handover. The process said so.&lt;/p&gt;

&lt;p&gt;None of those answers explain what a specification actually does to the&lt;br&gt;
nature of the problem being solved. Dave Snowden's Cynefin framework offers&lt;br&gt;
the clearest account. The argument here is that executable specifications&lt;br&gt;
do not simply improve quality; they change the class of problem you are&lt;br&gt;
working on.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Kinds of Problem
&lt;/h2&gt;

&lt;p&gt;Cynefin distinguishes problem domains by the relationship between cause&lt;br&gt;
and effect. Two are relevant here.&lt;/p&gt;

&lt;p&gt;In the complicated domain, cause and effect are knowable through analysis.&lt;br&gt;
A qualified expert can study the system, apply the right methods, and&lt;br&gt;
arrive at a good answer. The problem is hard but it is tractable. Engineers&lt;br&gt;
building against a well-specified API are working in the complicated domain.&lt;br&gt;
The system's behaviour is knowable if you do the analysis.&lt;/p&gt;

&lt;p&gt;In the complex domain, cause and effect are only knowable in hindsight.&lt;br&gt;
The system responds to interventions in ways you cannot fully predict&lt;br&gt;
before they happen. You probe, observe what emerges, and respond to what&lt;br&gt;
you find. Raising a child is the canonical example. But so is asking an&lt;br&gt;
AI agent to implement a feature from a two-sentence description. The&lt;br&gt;
output is emergent. The only way to know if it is right is to observe&lt;br&gt;
what it does.&lt;/p&gt;

&lt;p&gt;This is not a question of difficulty. It is a question of ontological&lt;br&gt;
character. Complicated problems are hard. Complex problems are of a&lt;br&gt;
different kind.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Separates the Domains
&lt;/h2&gt;

&lt;p&gt;Snowden distinguishes the domains not just by cause-and-effect&lt;br&gt;
relationships but by the constraints that govern system behaviour.&lt;/p&gt;

&lt;p&gt;In the complicated domain, governing constraints apply. These are the&lt;br&gt;
rules, standards, and boundaries that define acceptable behaviour. They&lt;br&gt;
do not specify every action, but they bound the solution space tightly&lt;br&gt;
enough that analysis can identify a good approach.&lt;/p&gt;

&lt;p&gt;In the complex domain, enabling constraints apply. These are looser.&lt;br&gt;
They allow a system to function and create conditions for patterns to&lt;br&gt;
emerge, but the behaviour of the system is not derivable from the&lt;br&gt;
constraints alone. You must observe it.&lt;/p&gt;

&lt;p&gt;An AI agent operating from a vague natural language prompt operates under&lt;br&gt;
enabling constraints at best. The prompt allows the agent to function and&lt;br&gt;
creates conditions for code to emerge, but edge cases, boundary conditions,&lt;br&gt;
and architectural choices are resolved by the model's priors, not by the&lt;br&gt;
stated requirements.&lt;/p&gt;

&lt;p&gt;An AI agent operating from a precise executable specification is in a&lt;br&gt;
different situation. A BDD scenario makes a specific causal claim: given&lt;br&gt;
this precondition, when this action occurs, then this outcome is required.&lt;br&gt;
That claim is verifiable before hindsight. The agent cannot legally produce&lt;br&gt;
code that fails the scenario. When the pipeline runs, the question "does&lt;br&gt;
the system do what it should?" has a deterministic answer. Cause and effect&lt;br&gt;
are knowable through analysis of the specification itself, which is&lt;br&gt;
exactly what defines the complicated domain.&lt;/p&gt;

&lt;p&gt;Writing executable specifications converts enabling constraints into&lt;br&gt;
governing constraints. The problem does not change. The constraint type&lt;br&gt;
does. And with it, the domain.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Note on Terminology
&lt;/h2&gt;

&lt;p&gt;An earlier draft of this argument used "exaptation" for this transition.&lt;br&gt;
That was wrong.&lt;/p&gt;

&lt;p&gt;Exaptation in Snowden's framework describes radical repurposing: taking&lt;br&gt;
something designed for one function and finding it serves a different&lt;br&gt;
function in a different context. It is an innovation mechanism operating&lt;br&gt;
within the complex domain, not a mechanism for moving between domains.&lt;br&gt;
The word comes from evolutionary biology, where Gould and Vrba used it&lt;br&gt;
in 1982 to describe traits coopted for functions different from the ones&lt;br&gt;
they evolved for.&lt;/p&gt;

&lt;p&gt;Constraint transformation is the accurate description. The act of writing&lt;br&gt;
an executable specification converts the type of constraint governing the&lt;br&gt;
system's behaviour, which changes the nature of the problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  What AI Changes About the Economics
&lt;/h2&gt;

&lt;p&gt;For most of software engineering history, writing precise executable&lt;br&gt;
specifications cost more relative to writing the code they described.&lt;br&gt;
BDD scenarios require domain knowledge, collaboration, and iteration.&lt;br&gt;
Mutation testing requires tooling investment. Contract tests require&lt;br&gt;
discipline about API boundaries. Many teams concluded the benefit did not&lt;br&gt;
justify the cost.&lt;/p&gt;

&lt;p&gt;Two things have shifted this calculation.&lt;/p&gt;

&lt;p&gt;The DORA 2026 report, based on 1,110 Google engineer responses, found&lt;br&gt;
that higher AI adoption correlates with higher throughput and higher&lt;br&gt;
instability simultaneously. Time saved generating code is re-spent&lt;br&gt;
auditing it. The bottleneck has moved from writing code to knowing what&lt;br&gt;
to write and verifying that what was written is correct. Specifications&lt;br&gt;
are no longer overhead relative to implementation. They are the scarce&lt;br&gt;
resource.&lt;/p&gt;

&lt;p&gt;AI also reduces the mechanical cost of expressing intent in executable&lt;br&gt;
form. The evidence here is early but directional. Fonseca, Lima, and&lt;br&gt;
Faria (arXiv:2510.18861, 2025) measured end-to-end Gherkin scenario&lt;br&gt;
generation from JIRA tickets on a production mobile application at BMW&lt;br&gt;
and found that practitioners reported time savings often amounting to a&lt;br&gt;
full developer-day per feature, with the automated generation completing&lt;br&gt;
in under five minutes. In a separate quasi-experiment, Hassani,&lt;br&gt;
Sabetzadeh, and Amyot (arXiv:2508.20744v2, 2026) had software&lt;br&gt;
engineering practitioners evaluate LLM-generated Gherkin specifications&lt;br&gt;
from legal text; 91.7% of ratings on the time savings criterion fell in&lt;br&gt;
the top two categories. Neither study is a controlled experiment. A&lt;br&gt;
2025 systematic review of 74 LLM-for-requirements studies (Zadenoori&lt;br&gt;
et al., arXiv:2509.11446) found that studies are predominantly evaluated&lt;br&gt;
in controlled environments using output quality metrics, with limited&lt;br&gt;
industry use, consistent with specification authoring cost being an&lt;br&gt;
underexplored measurement target in the literature. The evidence is&lt;br&gt;
emerging, not settled.&lt;/p&gt;

&lt;p&gt;What both studies show consistently is that AI shifts the human effort&lt;br&gt;
from authoring to reviewing: from expressing intent to validating&lt;br&gt;
that the expressed intent is accurate. The intent, the domain knowledge,&lt;br&gt;
and the judgment that the scenarios accurately describe what the system&lt;br&gt;
should do remain the human's responsibility. The mechanical work of&lt;br&gt;
expressing that intent in executable form has become substantially&lt;br&gt;
cheaper.&lt;/p&gt;

&lt;p&gt;The domain transition from complex to complicated is now economically&lt;br&gt;
viable at scale in a way it was not five years ago. The claim is not&lt;br&gt;
that AI makes specification automatic. It is that the economics have&lt;br&gt;
shifted enough to make specification-first development the rational&lt;br&gt;
default rather than the disciplined exception.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Khononov's Observation Stops
&lt;/h2&gt;

&lt;p&gt;In May 2025, Vlad Khononov argued on LinkedIn that LLMs turn complex&lt;br&gt;
problems into complicated ones. The direction is right. The mechanism&lt;br&gt;
is incomplete.&lt;/p&gt;

&lt;p&gt;Khononov attributes the domain transition to LLMs broadly. The argument&lt;br&gt;
here is more specific: the LLM is not what makes the transition stick.&lt;br&gt;
The specification is. An LLM operating without a specification produces&lt;br&gt;
output that sits in the complex domain because the agent's behaviour is&lt;br&gt;
still emergent and only knowable in hindsight. An LLM operating within&lt;br&gt;
a specification-governed verification pipeline produces output that can&lt;br&gt;
be assessed deterministically.&lt;/p&gt;

&lt;p&gt;Without the specification, the problem bounces back to complex every time&lt;br&gt;
the agent drifts, which it will. The governing constraint is what holds&lt;br&gt;
the transition.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Likely Objection
&lt;/h2&gt;

&lt;p&gt;A Cynefin-literate reader will raise a fair challenge: software systems&lt;br&gt;
are deterministic at the execution level. Given the same inputs and state,&lt;br&gt;
they produce the same outputs. If cause and effect are always knowable in&lt;br&gt;
principle, the complex framing is a category error. Software development&lt;br&gt;
is always in the complicated domain.&lt;/p&gt;

&lt;p&gt;The response is that the complexity resides in the problem space, not&lt;br&gt;
the execution. What users need, which edge cases matter, how requirements&lt;br&gt;
will evolve. These are genuinely complex. They are knowable only in&lt;br&gt;
hindsight, they respond unpredictably to interventions, and they exhibit&lt;br&gt;
the emergent properties that Snowden places in the complex domain. An&lt;br&gt;
executable specification narrows the problem space by articulating&lt;br&gt;
requirements precisely enough to be analysed. The domain transition&lt;br&gt;
occurs in the problem space, not in the implementation.&lt;/p&gt;

&lt;p&gt;This is not an endorsed Cynefin position. Dave Snowden is actively working&lt;br&gt;
on the relationship between AI and Cynefin domains as of early 2026,&lt;br&gt;
following practitioner discussions, and has not published conclusions.&lt;br&gt;
The argument here is an application of Cynefin vocabulary, checked&lt;br&gt;
carefully against canonical definitions. Readers familiar with the&lt;br&gt;
framework are invited to challenge it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Changes in Practice
&lt;/h2&gt;

&lt;p&gt;Teams that operate AI agents without specifications are navigating a&lt;br&gt;
complex domain with complicated-domain tools. They apply analysis and&lt;br&gt;
best practices to a system whose behaviour is emergent. They will get&lt;br&gt;
some things right through pattern matching and experience, and they will&lt;br&gt;
miss systematically at the boundaries, because boundaries are exactly&lt;br&gt;
where emergent behaviour is most unpredictable.&lt;/p&gt;

&lt;p&gt;Teams operating AI agents within a specification-governed verification&lt;br&gt;
pipeline have moved the problem. They are not analysing emergent&lt;br&gt;
behaviour. They are verifying compliance with stated constraints. The&lt;br&gt;
failure mode is specification incompleteness, not emergent&lt;br&gt;
unpredictability. Incompleteness is solvable. A specification gets less&lt;br&gt;
incomplete over time. Emergent unpredictability does not.&lt;/p&gt;

&lt;p&gt;These are not the same problem with different amounts of effort applied.&lt;br&gt;
They are different problems, requiring different investment, different&lt;br&gt;
staffing, and different success measures.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The Cynefin framework was developed by Dave Snowden and described in&lt;br&gt;
Snowden, D. and Boone, M. (2007), "A Leader's Framework for Decision&lt;br&gt;
Making," Harvard Business Review, November 2007. The enabling and&lt;br&gt;
governing constraints terminology is from Snowden's Cynefin wiki:&lt;br&gt;
cynefin.io/wiki/Constraints; this language is not in the 2007 HBR&lt;br&gt;
paper, which focuses on cause-and-effect relationships by domain.&lt;br&gt;
Khononov's observation on LLMs and Cynefin domains was made on LinkedIn&lt;br&gt;
in May 2025. Exaptation definition: Gould, S.J. and Vrba, E.S. (1982),&lt;br&gt;
"Exaptation: a missing term in the science of form," Paleobiology 8(1).&lt;br&gt;
DORA 2026: "Balancing AI Tensions: Moving from AI adoption to effective&lt;br&gt;
SDLC use," dora.dev, March 2026. Fonseca, P.L., Lima, B., and Faria,&lt;br&gt;
J.P. (2025), "Streamlining Acceptance Test Generation for Mobile&lt;br&gt;
Applications Through Large Language Models: An Industrial Case Study,"&lt;br&gt;
arXiv:2510.18861. Hassani, S., Sabetzadeh, M., and Amyot, D. (2026),&lt;br&gt;
"From Law to Gherkin: A Human-Centred Quasi-Experiment on the Quality&lt;br&gt;
of LLM-Generated Behavioural Specifications from Food-Safety&lt;br&gt;
Regulations," arXiv:2508.20744v2 (updated March 2026). Zadenoori,&lt;br&gt;
M.A., Dabrowski, J., Alhoshan, W., Zhao, L., and Ferrari, A. (2025),&lt;br&gt;
"Large Language Models (LLMs) for Requirements Engineering (RE): A&lt;br&gt;
Systematic Literature Review," arXiv:2509.11446.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Series: The Specification as Quality Gate&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/echo-chamber-ai-code-review-correlated-error"&gt;Part 1: The Echo Chamber in Your Pipeline&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Part 2: From Complex to Complicated (this post)&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/what-specifications-cannot-catch-residual-defect-taxonomy"&gt;Part 3: What Specifications Cannot Catch&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/specification-as-quality-gate-three-hypotheses"&gt;Part 4: The Specification as Quality Gate&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>cynefin</category>
      <category>specifications</category>
      <category>complexity</category>
      <category>ai</category>
    </item>
    <item>
      <title>The Echo Chamber in Your Pipeline</title>
      <dc:creator>Christo Zietsman</dc:creator>
      <pubDate>Sun, 22 Mar 2026 21:32:22 +0000</pubDate>
      <link>https://forem.com/nuphirho/the-echo-chamber-in-your-pipeline-34o0</link>
      <guid>https://forem.com/nuphirho/the-echo-chamber-in-your-pipeline-34o0</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 1 of a four-part series, "The Specification as Quality Gate." The series develops three hypotheses about executable specifications, AI code review, and what each is actually for. Parts 2, 3, and 4 will follow.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When an AI coding agent generates code and a separate AI reviewer examines&lt;br&gt;
it, both agents are reasoning from the same artefact: the code itself. If&lt;br&gt;
neither has an external reference (a specification, a contract, a&lt;br&gt;
statement of what the system is supposed to do), the reviewer has no&lt;br&gt;
ground truth to compare against. It checks the code against the code. Not&lt;br&gt;
against intent.&lt;/p&gt;

&lt;p&gt;The architecture is circular. And there is now enough empirical evidence&lt;br&gt;
to say precisely where and how it fails.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Mechanism
&lt;/h2&gt;

&lt;p&gt;In classical ensemble learning, stacking multiple estimators improves&lt;br&gt;
reliability under one condition: the estimators must fail independently.&lt;br&gt;
If two classifiers share the same training distribution and the same blind&lt;br&gt;
spots, combining them does not reduce error. It consolidates it. The joint&lt;br&gt;
miss rate of two correlated estimators approaches the miss rate of either&lt;br&gt;
one alone, not the product of both.&lt;/p&gt;

&lt;p&gt;This holds for LLM pipelines for a specific reason: a code-generating&lt;br&gt;
model and a code-reviewing model drawn from the same model family share&lt;br&gt;
architecture, training corpus, and reward signal. That is the same class&lt;br&gt;
of correlation the independence condition prohibits. They are not two&lt;br&gt;
independent estimators. They are two samples from the same prior.&lt;/p&gt;

&lt;p&gt;A 2025 paper (Vallecillos-Ruiz, Hort, and Moonen, arXiv:2510.21513)&lt;br&gt;
studying LLM ensembles for code generation and repair named this the&lt;br&gt;
"popularity trap." Models trained on similar distributions converge on&lt;br&gt;
the same syntactically plausible but semantically wrong answers. Consensus&lt;br&gt;
selection (the default review heuristic in most multi-agent pipelines)&lt;br&gt;
filters out the minority correct solutions and amplifies the shared&lt;br&gt;
error. Diversity-based selection, by contrast, recovered up to 95% of&lt;br&gt;
the gain that a perfectly independent ensemble would achieve. The&lt;br&gt;
implication is direct: naive stacking of homogeneous LLMs is&lt;br&gt;
counterproductive.&lt;/p&gt;

&lt;p&gt;A second paper (Mi et al., arXiv:2412.11014) examined a&lt;br&gt;
researcher-then-reviser pipeline for Verilog code generation and found&lt;br&gt;
that erroneous information from the first agent was accepted downstream&lt;br&gt;
because the agents shared the same training distribution and lacked&lt;br&gt;
adversarial diversity. Self-review repeated the original errors rather&lt;br&gt;
than correcting them.&lt;/p&gt;

&lt;p&gt;A February 2026 paper (Pappu et al., arXiv:2602.01011) went further,&lt;br&gt;
showing that even heterogeneous multi-agent teams consistently fail to&lt;br&gt;
match their best individual member, incurring performance losses of up to&lt;br&gt;
37.6%, even when explicitly told which member is the expert. The failure&lt;br&gt;
mechanism is consensus-seeking over expertise. Homogeneous copies make&lt;br&gt;
it worse.&lt;/p&gt;

&lt;p&gt;None of these papers tested a pure generator-then-reviewer pipeline&lt;br&gt;
without external grounding in the exact configuration described here.&lt;br&gt;
That controlled experiment has not been published. But three independent&lt;br&gt;
papers from 2025 and 2026, approaching the problem from different angles,&lt;br&gt;
document the same failure mode: shared training distributions produce&lt;br&gt;
correlated errors, and consensus amplifies rather than corrects them.&lt;/p&gt;


&lt;h2&gt;
  
  
  A Concrete Example
&lt;/h2&gt;

&lt;p&gt;This example uses a deliberately simple case. Modern frontier models catch&lt;br&gt;
classic boundary conditions reliably, as the experiments described in the&lt;br&gt;
caveats section confirm this. The point here is the sequence: how a BDD&lt;br&gt;
scenario makes a defect detectable before any reviewer is involved.&lt;/p&gt;

&lt;p&gt;Consider a pagination function. A developer asks an AI coding agent to&lt;br&gt;
implement it. The agent produces:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;paginate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page_size&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;page_size&lt;/span&gt;
    &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;page_size&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent also produces tests: one for page 0 and one for page 1 against&lt;br&gt;
a ten-item list. Both pass. The implementation is correct for those cases.&lt;/p&gt;

&lt;p&gt;The flaw is in what was not tested. When the total number of items is&lt;br&gt;
exactly divisible by the page size and the caller requests the last page&lt;br&gt;
by calculating it from the total count, the edge case is invisible to&lt;br&gt;
the tests provided.&lt;/p&gt;

&lt;p&gt;The review agent, given only the code and the tests, validates the&lt;br&gt;
implementation against what is there. The tests pass. The reviewer&lt;br&gt;
reports no issues. It has no basis to ask what was not tested, because&lt;br&gt;
there is no specification defining the expected behaviour for boundary&lt;br&gt;
conditions.&lt;/p&gt;

&lt;p&gt;The BDD scenario that would have caught it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Last page when total is exactly divisible by page size
  &lt;span class="nf"&gt;Given &lt;/span&gt;a list of 10 items
  &lt;span class="nf"&gt;And &lt;/span&gt;a page size of 5
  &lt;span class="nf"&gt;When &lt;/span&gt;I request page 1
  &lt;span class="nf"&gt;Then &lt;/span&gt;I receive items 6 through 10
  &lt;span class="nf"&gt;And &lt;/span&gt;the result contains exactly 5 items
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This scenario would have failed against the implementation before the&lt;br&gt;
reviewer was ever involved. The pipeline stops at the specification, not&lt;br&gt;
the review.&lt;/p&gt;

&lt;p&gt;There is a second reason this matters beyond the review question. Even&lt;br&gt;
where models reliably identify boundary conditions during focused review,&lt;br&gt;
general refactoring passes introduce a different risk. An agent&lt;br&gt;
refactoring for readability or performance is not focused on correctness.&lt;br&gt;
Conditional checks and edge-case handling can be silently removed. The&lt;br&gt;
BDD pipeline catches that regression the same way it catches the original&lt;br&gt;
defect. The protection is unconditional on what the agent was trying to do.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the SGCR Paper Actually Says
&lt;/h2&gt;

&lt;p&gt;A December 2025 paper from HiThink Research (arXiv:2512.17540) proposed&lt;br&gt;
a Specification-Grounded Code Review framework and reported a 90.9%&lt;br&gt;
improvement over a baseline LLM reviewer. This figure circulates widely&lt;br&gt;
and requires a correction.&lt;/p&gt;

&lt;p&gt;The 90.9% is an improvement in developer adoption rate of review&lt;br&gt;
suggestions: 42% versus 22%. Developer adoption reflects whether&lt;br&gt;
suggestions are relevant and actionable. It is not a defect detection&lt;br&gt;
rate. These are different claims.&lt;/p&gt;

&lt;p&gt;The SGCR paper validates the hypothesis partially. Without specification&lt;br&gt;
grounding, baseline LLM reviewers produce generic suggestions and&lt;br&gt;
hallucinated issues. Grounding in human-authored specifications filters&lt;br&gt;
the noise and produces suggestions developers act on. But the paper's&lt;br&gt;
architecture, combining an explicit specification-driven path and an&lt;br&gt;
implicit heuristic discovery path, does not claim specifications&lt;br&gt;
make review redundant. It positions them as making review better.&lt;/p&gt;

&lt;p&gt;SGCR's implicit pathway explicitly handles discovery of issues beyond the&lt;br&gt;
stated rules. This is the paper's own acknowledgement that specifications&lt;br&gt;
alone are insufficient. It is consistent with the argument that an AI&lt;br&gt;
review agent has a legitimate residual role, but only after the&lt;br&gt;
specification pipeline has done its work.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Independence Condition
&lt;/h2&gt;

&lt;p&gt;The correlated error claim has a precise boundary. Stacking any two AI&lt;br&gt;
reviewers is not always counterproductive. The failure condition is&lt;br&gt;
specific: estimators that share a training distribution and lack an&lt;br&gt;
external reference exhibit correlated failures. The condition for genuine&lt;br&gt;
benefit is diversity plus external grounding.&lt;/p&gt;

&lt;p&gt;If the reviewer uses a different model family, a different temperature, or&lt;br&gt;
a substantially different prompting strategy, errors may be partially&lt;br&gt;
independent. A cross-family pipeline, Grok reviewing Claude-generated&lt;br&gt;
code for instance, has more independence than a same-family pipeline.&lt;br&gt;
But model diversity and external grounding are two separate conditions.&lt;br&gt;
Diversity reduces correlation between estimators. It does not supply&lt;br&gt;
ground truth. A cross-family reviewer without an external specification&lt;br&gt;
is still checking code against code, not code against intent. It will&lt;br&gt;
catch some things the generator missed. It will share the same blind&lt;br&gt;
spots on anything not well-represented in either training corpus, and&lt;br&gt;
it has no basis to identify what was not specified regardless of how&lt;br&gt;
different its architecture is from the generator.&lt;/p&gt;

&lt;p&gt;The ensemble literature is clear that diversity is what makes stacking&lt;br&gt;
work. The specification is what makes review non-circular. Both&lt;br&gt;
conditions matter. The current industry architecture typically satisfies&lt;br&gt;
neither.&lt;/p&gt;




&lt;h2&gt;
  
  
  But What About Documentation?
&lt;/h2&gt;

&lt;p&gt;A reasonable objection: if the problem is that the reviewer lacks ground&lt;br&gt;
truth, why not solve it by adding documentation? Write better docstrings.&lt;br&gt;
Maintain a spec document. Give the reviewer the context it needs.&lt;/p&gt;

&lt;p&gt;Our own experiment accidentally tested this. The first version of the v2&lt;br&gt;
corpus included docstrings that stated the domain convention explicitly.&lt;br&gt;
The &lt;code&gt;prorate_premium&lt;/code&gt; docstring said "ISDA actual/actual." The&lt;br&gt;
&lt;code&gt;schedule_maintenance&lt;/code&gt; docstring said "ANY of these thresholds." Claude&lt;br&gt;
caught every bug at 100%. Documentation in close proximity to the code&lt;br&gt;
works, at least when the agent reads it.&lt;/p&gt;

&lt;p&gt;That result is real, but it does not solve the problem. It reframes it.&lt;/p&gt;

&lt;p&gt;The first issue is retrieval and compliance. A docstring sitting on the&lt;br&gt;
same function is the best case for specification proximity. An external&lt;br&gt;
policy document, a requirements file, a Confluence page is a much weaker&lt;br&gt;
guarantee. The agent needs to find it, read it, and apply it faithfully.&lt;br&gt;
Anyone who has watched an AI agent work at scale knows that is not a safe&lt;br&gt;
assumption. Agents read context selectively. They produce plausible output&lt;br&gt;
that satisfies surface checks without necessarily following the rule they&lt;br&gt;
were given. I have seen this enough times firsthand to know I cannot&lt;br&gt;
treat documentation-as-specification as reliable ground truth.&lt;/p&gt;

&lt;p&gt;The second issue is drift. Docstrings are not executable. They do not&lt;br&gt;
fail when the code diverges from them. A function can be refactored, a&lt;br&gt;
business rule can change, an edge case can be added, and the docstring&lt;br&gt;
sits there describing what the function used to do, confidently, in&lt;br&gt;
present tense. Every codebase accumulates this. The reviewer checking&lt;br&gt;
code against a stale docstring is not checking against intent. It is&lt;br&gt;
checking against what someone intended when they first wrote it.&lt;/p&gt;

&lt;p&gt;A BDD scenario cannot drift silently. When the system stops behaving the&lt;br&gt;
way the scenario describes, the scenario fails. The build stops. The&lt;br&gt;
divergence is visible immediately. That is not a property of documentation.&lt;br&gt;
It is a property of executable specifications. The difference is not one&lt;br&gt;
of quality or discipline. It is structural.&lt;/p&gt;

&lt;p&gt;The objection "just write better documentation" asks for the same&lt;br&gt;
discipline that produced the bad documentation in the first place, and&lt;br&gt;
adds an assumption that AI agents will follow it reliably. The BDD&lt;br&gt;
pipeline removes both dependencies.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Implication
&lt;/h2&gt;

&lt;p&gt;An AI reviewer without an external specification is a probability estimate,&lt;br&gt;
not a quality gate. It samples from the same distribution as the generator&lt;br&gt;
and applies pattern matching against its training data. It will catch some&lt;br&gt;
things the generator missed, particularly common vulnerability patterns&lt;br&gt;
and well-documented anti-patterns that appear frequently in training data.&lt;br&gt;
It will systematically miss whatever the generator systematically missed,&lt;br&gt;
because they share the same prior.&lt;/p&gt;

&lt;p&gt;A BDD scenario that fails is a falsified claim. The pipeline either passes&lt;br&gt;
or it does not. There is no hallucination, no false confidence, no noise&lt;br&gt;
to filter.&lt;/p&gt;

&lt;p&gt;The architecture that follows is not "no AI review." It is: specifications&lt;br&gt;
first, verification pipeline second, AI review only for the residual. The&lt;br&gt;
residual is real. It includes architectural properties, structural drift,&lt;br&gt;
and defect classes that resist specification, but it is a much smaller&lt;br&gt;
target than the current industry framing suggests. The next post in this&lt;br&gt;
series maps that residual precisely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Caveats and Open Questions
&lt;/h2&gt;

&lt;p&gt;This argument rests on three empirical papers and two small contrived&lt;br&gt;
experiments. Both experiments are same-family only (Claude reviewing&lt;br&gt;
Claude-generated code) and use a planted bug corpus rather than a natural&lt;br&gt;
defect sample. They are directional evidence, not a controlled&lt;br&gt;
demonstration.&lt;/p&gt;

&lt;p&gt;The first experiment used classic boundary-condition bugs. Claude caught&lt;br&gt;
all five at 100%, which refined rather than confirmed the hypothesis.&lt;br&gt;
Classic boundary conditions are pattern-recognition problems that are&lt;br&gt;
dense in training data. The correlated error claim applies where the&lt;br&gt;
convention is absent from training data, not where it is ubiquitous.&lt;/p&gt;

&lt;p&gt;The second experiment used domain-convention violations with neutral&lt;br&gt;
docstrings. Detection ranged from 0% to 100% depending on domain opacity.&lt;br&gt;
The log-linear interpolation function, where the convention is market&lt;br&gt;
practice in fixed income rather than general programming knowledge, was&lt;br&gt;
missed in all five runs. The reviewer flagged a different concern instead&lt;br&gt;
and declared the implementation correct. BDD caught it. The full corpus,&lt;br&gt;
specifications, scripts, and results are available in the&lt;br&gt;
&lt;a href="https://github.com/czietsman/nuphirho.dev/tree/main/experiments" rel="noopener noreferrer"&gt;experiments directory&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The independence condition also requires more empirical work. Same-family&lt;br&gt;
models fail the independence requirement. How much architectural diversity (different model families, different prompting strategies, different&lt;br&gt;
temperatures) is required to provide genuine signal rather than&lt;br&gt;
correlated noise remains to be established.&lt;/p&gt;

&lt;p&gt;These are open questions, not fatal weaknesses. Without an external&lt;br&gt;
reference, circular review is circular. The empirical work so far points&lt;br&gt;
in one direction. A controlled demonstration using a natural defect sample&lt;br&gt;
and cross-family pipelines would make the argument more precise.&lt;/p&gt;

&lt;p&gt;The authors of the papers cited here, along with anyone working on LLM&lt;br&gt;
ensemble reliability, multi-agent code pipelines, or specification-grounded&lt;br&gt;
review, are welcome to challenge, correct, or build on this argument.&lt;br&gt;
If the hypothesis is wrong, knowing precisely where it breaks is more&lt;br&gt;
useful than leaving it unchallenged. Comments are open.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The papers cited in this post are: Vallecillos-Ruiz, F., Hort, M., and&lt;br&gt;
Moonen, L., "Wisdom and Delusion of LLM Ensembles for Code Generation and&lt;br&gt;
Repair," arXiv:2510.21513, October 2025; Mi, Z. et al., "CoopetitiveV:&lt;br&gt;
Leveraging LLM-powered Coopetitive Multi-Agent Prompting for High-quality&lt;br&gt;
Verilog Generation," arXiv:2412.11014, v2 June 2025; Pappu, A. et al.,&lt;br&gt;
"Multi-Agent Teams Hold Experts Back," arXiv:2602.01011, February 2026;&lt;br&gt;
and Wang, K., Mao, B., Jia, S., Ding, Y., Han, D., Ma, T., and Cao, B.,&lt;br&gt;
"SGCR: A Specification-Grounded Framework for Trustworthy LLM Code&lt;br&gt;
Review," arXiv:2512.17540, December 2025, v2 January 2026. Author lists&lt;br&gt;
and findings verified against the original papers.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Series: The Specification as Quality Gate&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Part 1: The Echo Chamber in Your Pipeline (this post)&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/executable-specifications-cynefin-domain-transition"&gt;Part 2: From Complex to Complicated&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/what-specifications-cannot-catch-residual-defect-taxonomy"&gt;Part 3: What Specifications Cannot Catch&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/specification-as-quality-gate-three-hypotheses"&gt;Part 4: The Specification as Quality Gate&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codereview</category>
      <category>specifications</category>
      <category>bdd</category>
    </item>
    <item>
      <title>The Trust Barrier</title>
      <dc:creator>Christo Zietsman</dc:creator>
      <pubDate>Tue, 10 Mar 2026 19:21:03 +0000</pubDate>
      <link>https://forem.com/nuphirho/the-trust-barrier-444a</link>
      <guid>https://forem.com/nuphirho/the-trust-barrier-444a</guid>
      <description>&lt;p&gt;The biggest obstacle to AI in software engineering isn't the technology. It's trust.&lt;/p&gt;

&lt;p&gt;Teams don't trust AI-generated code. And they shouldn't, not blindly. But the conversation in most engineering organisations has stalled at that point. The distrust is treated as a conclusion rather than a starting point. AI generates code we can't trust, therefore we shouldn't use AI. End of discussion.&lt;/p&gt;

&lt;p&gt;That's the wrong framing. The right question isn't whether to trust AI. It's how to build a process that makes trust measurable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The carpenter and the power tools
&lt;/h2&gt;

&lt;p&gt;Uncle Bob at &lt;a href="https://www.linkedin.com/company/cleancoder/" rel="noopener noreferrer"&gt;Uncle Bob Consulting LLC&lt;/a&gt; has been working through this in public, and his evolving position is worth following. Back in September 2025, he was clear that AI would change programming for the better but not replace programmers.&lt;/p&gt;

&lt;p&gt;&lt;iframe class="tweet-embed" id="tweet-1962636247769530650-734" src="https://platform.twitter.com/embed/Tweet.html?id=1962636247769530650"&gt;
&lt;/iframe&gt;

  // Detect dark theme
  var iframe = document.getElementById('tweet-1962636247769530650-734');
  if (document.body.className.includes('dark-theme')) {
    iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1962636247769530650&amp;amp;theme=dark"
  }



&lt;/p&gt;

&lt;p&gt;By January 2026, after working with AI more directly, he observed something important about the relationship between programmer and AI: it's not like a pilot and autopilot. His conclusion was striking.&lt;/p&gt;

&lt;p&gt;&lt;iframe class="tweet-embed" id="tweet-2008879916301898134-431" src="https://platform.twitter.com/embed/Tweet.html?id=2008879916301898134"&gt;
&lt;/iframe&gt;

  // Detect dark theme
  var iframe = document.getElementById('tweet-2008879916301898134-431');
  if (document.body.className.includes('dark-theme')) {
    iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2008879916301898134&amp;amp;theme=dark"
  }



&lt;/p&gt;

&lt;p&gt;"In the end the AI might write 90% of the code, but the programmer will have put in 90% of the thought and effort."&lt;/p&gt;

&lt;p&gt;Then in February, after six weeks of daily use, he compared the experience to a skilled carpenter being handed power tools for the first time. The power is undeniable, but so are the risks. He's still not sure his project wouldn't be just as far along without it.&lt;/p&gt;

&lt;p&gt;&lt;iframe class="tweet-embed" id="tweet-2019025982863069621-423" src="https://platform.twitter.com/embed/Tweet.html?id=2019025982863069621"&gt;
&lt;/iframe&gt;

  // Detect dark theme
  var iframe = document.getElementById('tweet-2019025982863069621-423');
  if (document.body.className.includes('dark-theme')) {
    iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2019025982863069621&amp;amp;theme=dark"
  }



&lt;/p&gt;

&lt;p&gt;That progression resonates because it captures where most experienced engineers are right now. They can see the potential. They can also see the ways it could go wrong. And without a clear framework for managing those risks, the rational response is caution.&lt;/p&gt;

&lt;p&gt;But here's where it gets really interesting. Uncle Bob, the person who formalised the three laws of TDD, recently acknowledged that TDD as traditionally practised is inefficient for AI. The principles remain, but the techniques need to adapt.&lt;/p&gt;

&lt;p&gt;&lt;iframe class="tweet-embed" id="tweet-2023158252700066287-865" src="https://platform.twitter.com/embed/Tweet.html?id=2023158252700066287"&gt;
&lt;/iframe&gt;

  // Detect dark theme
  var iframe = document.getElementById('tweet-2023158252700066287-865');
  if (document.body.className.includes('dark-theme')) {
    iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2023158252700066287&amp;amp;theme=dark"
  }



&lt;/p&gt;

&lt;p&gt;That's a significant shift from someone who has spent decades advocating for specific engineering disciplines. And it's exactly the point. The principles of verification, rigour, and specification don't change. The techniques do.&lt;/p&gt;

&lt;p&gt;I've been through the same journey. I spent time being cautious, reviewing everything line by line, second-guessing outputs, wondering whether I was actually saving time or just shifting the effort from writing to reviewing. It felt productive, but it wasn't fundamentally different from writing the code myself.&lt;/p&gt;

&lt;p&gt;Here's what I found on the other side of that phase: the question isn't "can I trust this code?" It's "does this code meet the spec?"&lt;/p&gt;

&lt;p&gt;That reframe changes everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  From gut feeling to process
&lt;/h2&gt;

&lt;p&gt;When trust is a feeling, it's subjective, inconsistent, and unscalable. One developer might trust AI-generated code after a quick scan. Another might rewrite it entirely. Neither approach is reliable, and neither scales to a team.&lt;/p&gt;

&lt;p&gt;When trust is a process, it becomes a solvable engineering problem. You define what correct looks like before any code gets written. You build verification layers that catch failures regardless of who or what produced the code. You measure outcomes rather than gut-checking inputs.&lt;/p&gt;

&lt;p&gt;This isn't a new idea. It's the scientific method applied to software delivery. Hypothesise what the system should do. Build it. Test whether it does what you hypothesised. If it doesn't, iterate.&lt;/p&gt;

&lt;p&gt;We call it computer science for a reason, but somewhere along the way the discipline drifted from its roots. We stopped treating software delivery as an empirical process and started treating it as a craft, something learned through apprenticeship, refined through experience, and validated by the judgement of senior practitioners. That model worked well enough when humans were writing all the code. Experienced judgement was the best verification layer we had.&lt;/p&gt;

&lt;p&gt;But craft-based verification doesn't scale to AI-generated output. You can't rely on an experienced developer's intuition when the volume of generated code outpaces any individual's ability to read it. What scales is the scientific method: define the expected behaviour precisely, run the experiment, measure the result. If the result matches the hypothesis, the code is fit for purpose. If it doesn't, you have a specific, measurable failure to investigate.&lt;/p&gt;

&lt;p&gt;The difference now is that AI has made this approach economically viable in ways it wasn't before.&lt;/p&gt;

&lt;p&gt;Writing behavioural specifications, running mutation tests, enforcing contract-driven APIs, these practices have existed for years. They were always the right thing to do. But the effort required to implement them properly meant that most teams cut corners. When you're manually writing all the code, spending additional time writing comprehensive specifications and mutation test suites feels like a luxury.&lt;/p&gt;

&lt;p&gt;AI changes that equation. When the code generation is fast and cheap, the verification layer becomes the primary deliverable. The spec is no longer documentation you write after the fact. It's the thing you write first, and it's the thing that determines whether the output is fit for purpose.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this looks like in practice
&lt;/h2&gt;

&lt;p&gt;The process I've built works like this:&lt;/p&gt;

&lt;p&gt;Write behavioural specifications before any code gets written, by human or AI. Use BDD to define what the system should do, not how it should do it. These specifications are executable. They're not documentation sitting in a wiki. They run as part of the pipeline and they fail when the code doesn't match the expected behaviour.&lt;/p&gt;

&lt;p&gt;Run mutation tests to validate that your tests actually catch failures. A passing test suite means nothing if the tests would still pass with broken code. Mutation testing introduces deliberate faults and checks whether your tests detect them. It's the verification layer for your verification layer.&lt;/p&gt;

&lt;p&gt;Enforce contract-driven APIs so integrations are verifiable independently. When services agree on contracts, you can validate that each side honours the agreement without spinning up the entire system. This matters even more when AI is generating the integration code, because the contracts become the source of truth that the generated code is measured against.&lt;/p&gt;

&lt;p&gt;Review outputs against specifications, not line by line. This is the fundamental shift. Instead of reading every line of AI-generated code and asking "does this look right?", you run the specification suite and ask "does this behave correctly?" The first approach doesn't scale. The second does.&lt;/p&gt;

&lt;h2&gt;
  
  
  The personal proof point
&lt;/h2&gt;

&lt;p&gt;I don't write code by hand anymore. I direct AI to write it. I've never formally learned Vue or Go, but I'm shipping production-ready prototypes in both. Not because I blindly trust AI, but because I've built the process that lets me verify the output.&lt;/p&gt;

&lt;p&gt;That statement makes some people uncomfortable. It sounds reckless to say "I ship code in languages I haven't formally learned." But the discomfort reveals the assumption: that understanding every line of code is a prerequisite for shipping reliable software. It's not. Understanding the specification, the verification layer, and the failure modes is the prerequisite. The language is an implementation detail.&lt;/p&gt;

&lt;p&gt;This doesn't mean the code doesn't matter. It means the specification matters more. And when you have a robust specification, the question of whether a human or an AI wrote the implementation becomes less important than whether the implementation meets the spec.&lt;/p&gt;

&lt;h2&gt;
  
  
  The competitive advantage is process
&lt;/h2&gt;

&lt;p&gt;The teams that figure this out first won't just be faster. They'll be more reliable than teams still reviewing every line by hand. That's counterintuitive, because we associate human review with quality. But human review is inconsistent, fatiguing, and subject to all the cognitive biases that make us miss things we've seen a hundred times before.&lt;/p&gt;

&lt;p&gt;A specification-driven process with automated verification doesn't get tired. It doesn't skip steps on a Friday afternoon. It catches the same class of failures at 5pm that it catches at 9am. It scales to any volume of AI-generated code without degrading in quality.&lt;/p&gt;

&lt;p&gt;The trust barrier is real. I'm not dismissing the concern. Engineers who are cautious about AI-generated code are being responsible. But caution without a path forward is just avoidance. The path forward is process.&lt;/p&gt;

&lt;p&gt;Build the verification layer. Let the spec be the source of truth. And stop asking whether you can trust the code. Start asking whether it meets the spec.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>process</category>
      <category>bdd</category>
      <category>trust</category>
    </item>
    <item>
      <title>When bash gets too wild: rewriting my publish pipeline in Go</title>
      <dc:creator>Christo Zietsman</dc:creator>
      <pubDate>Mon, 09 Mar 2026 13:11:50 +0000</pubDate>
      <link>https://forem.com/nuphirho/when-bash-gets-too-wild-rewriting-my-publish-pipeline-in-go-1eff</link>
      <guid>https://forem.com/nuphirho/when-bash-gets-too-wild-rewriting-my-publish-pipeline-in-go-1eff</guid>
      <description>&lt;p&gt;There is a particular kind of technical debt that only grows in shell scripts. It starts innocently: a &lt;code&gt;sed&lt;/code&gt; command to strip some quotes, a &lt;code&gt;grep&lt;/code&gt; to pull out a frontmatter field, a &lt;code&gt;jq&lt;/code&gt; one-liner to map tags. Each fix is small and defensible in isolation. But give it a few months and you have something that works in exactly the conditions it was written for and fails in ways that are almost impossible to reason about.&lt;/p&gt;

&lt;p&gt;That is where I ended up with the publish pipeline for this blog, and it did not even take months. We built it fresh, but the pace of iteration compressed what normally takes a year into days. The workarounds accumulated just as fast.&lt;/p&gt;

&lt;p&gt;The pipeline cross-posts markdown files to Hashnode and Dev.to via their APIs. It lives in a GitHub Actions workflow, and it started failing almost immediately: quoted YAML values, tags with hyphens, multi-line frontmatter. Each failure led to another workaround: quote-stripping logic, a tag glossary loaded via &lt;code&gt;jq&lt;/code&gt;, conditionals that were more incantation than logic. The script became the kind of thing you are afraid to touch because you cannot be certain what it is actually doing.&lt;/p&gt;

&lt;p&gt;The honest diagnosis: I had reached the natural ceiling of bash for this kind of work. Bash has no good answer to "how do I test this?" And if you cannot test it, you cannot trust it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost of untestable pipelines
&lt;/h2&gt;

&lt;p&gt;This is not unique to personal blog tooling. Infrastructure scripts, CI workflows, deployment pipelines, these are often written in bash because bash is always available and the task seems small at the time. The problem is that "small at the time" is doing a lot of work in that sentence. Pipelines accumulate logic. They handle edge cases. They become load-bearing parts of a development workflow. And because they are in bash, they remain untestable by default.&lt;/p&gt;

&lt;p&gt;The failure mode is not usually a catastrophic crash. It is subtler: a post that silently publishes with the wrong tags, a frontmatter field that gets parsed incorrectly and corrupts a slug, a workaround that fixes one case while breaking another you did not think to check. You discover these failures after the fact, in production, with no good way to reproduce them locally.&lt;/p&gt;

&lt;p&gt;I have spent a lot of time over the past year thinking about what rigorous software engineering actually looks like when AI is in the picture. One of the core arguments I keep coming back to is that AI has collapsed the cost of doing things properly, not just the glamorous parts like architecture and feature development, but the unglamorous parts too: tests, specs, validation, structured error handling. If the tooling for rigour is now cheaper to write, there is less justification for skipping it. That includes pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rewrite
&lt;/h2&gt;

&lt;p&gt;The goal was straightforward: replace the inline bash in &lt;code&gt;.github/workflows/publish.yml&lt;/code&gt; with a Go CLI tool. The workflow calls the binary; all logic lives in testable Go code with BDD specifications written before any implementation.&lt;/p&gt;

&lt;p&gt;The package structure reflects the actual problem decomposition: &lt;code&gt;internal/frontmatter&lt;/code&gt; for YAML parsing, &lt;code&gt;internal/tags&lt;/code&gt; for glossary loading and per-platform mapping, &lt;code&gt;internal/hashnode&lt;/code&gt; and &lt;code&gt;internal/devto&lt;/code&gt; for the API clients, &lt;code&gt;internal/probe&lt;/code&gt; for contract validation, &lt;code&gt;internal/pipeline&lt;/code&gt; for the orchestrator, and &lt;code&gt;cmd/publish&lt;/code&gt; as the thin CLI wrapper. The full plan is in the &lt;a href="https://github.com/czietsman/nuphirho.dev" rel="noopener noreferrer"&gt;nuphirho.dev repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The end result: 98 BDD scenarios, 488 steps across 7 packages, all passing. Around 150 lines of bash collapse into two workflow lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build publish tool&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;go build -o publish ./cmd/publish/&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Publish posts&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;HASHNODE_TOKEN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.HASHNODE_TOKEN }}&lt;/span&gt;
    &lt;span class="na"&gt;HASHNODE_PUBLICATION_ID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.HASHNODE_PUBLICATION_ID }}&lt;/span&gt;
    &lt;span class="na"&gt;DEVTO_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.DEVTO_API_KEY }}&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./publish --tags-file tags.json ${{ steps.changed.outputs.posts }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But the line count is not the interesting part. What is interesting is what the BDD scenarios caught.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the specs caught
&lt;/h2&gt;

&lt;p&gt;The first package, &lt;code&gt;internal/frontmatter&lt;/code&gt;, is where most of the bash failures had originated. Writing the scenarios first forced precision about what "correct" actually means. That precision immediately exposed bugs that had been hiding in plain sight.&lt;/p&gt;

&lt;p&gt;The secret detection regex in the bash pipeline was &lt;code&gt;[A-Za-z0-9+/]{20,}&lt;/code&gt;. It did not match GitHub personal access tokens (&lt;code&gt;ghp_...&lt;/code&gt;) because it excluded underscores and hyphens. The BDD scenario caught this on the first run. The fix was adding &lt;code&gt;_-&lt;/code&gt; to the character class. One line. But without a scenario that explicitly tested a GitHub token pattern, that bug would have kept silently passing anything with a &lt;code&gt;ghp_&lt;/code&gt; prefix as clean.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;gopkg.in/yaml.v3&lt;/code&gt; library eliminated a whole category of problems. The bash pipeline needed separate handling for three tag notation formats: &lt;code&gt;[a, b]&lt;/code&gt;, &lt;code&gt;["a", "b"]&lt;/code&gt;, and YAML list. The Go library handles all three identically, returning &lt;code&gt;[]string&lt;/code&gt; in all cases. That alone justified the rewrite.&lt;/p&gt;

&lt;p&gt;Moving to &lt;code&gt;internal/hashnode&lt;/code&gt;, the GraphQL client needed a helper for navigating untyped JSON responses. The &lt;code&gt;gqlNode.path()&lt;/code&gt; helper was written to chain safely on nil, so missing fields return empty strings rather than panics. What it did not handle correctly was a null JSON value: &lt;code&gt;publication: null&lt;/code&gt;. The initial check was &lt;code&gt;if pub == nil&lt;/code&gt;, which passed because the node existed, just with a nil raw value. The "publication not found" error was never raised. The BDD scenario caught it. Fixed by checking &lt;code&gt;pub.isNull()&lt;/code&gt; instead. This is the same class of nil-vs-empty bug that bites Go developers with nil interfaces. Without the scenario, it would have manifested as a confusing API error message at publish time.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;internal/probe&lt;/code&gt; package had the most satisfying catch. The contract probe output format was specified exactly in the plan: &lt;code&gt;[hashnode] credentials        OK&lt;/code&gt; with precise column padding. The initial implementation used &lt;code&gt;fmt.Sprintf("[%-8s]", platform)&lt;/code&gt;, which pads inside the brackets: &lt;code&gt;[devto   ]&lt;/code&gt;. The spec required padding after the closing bracket: &lt;code&gt;[devto]&lt;/code&gt;. Five of eleven scenarios failed on that single formatting difference. One line fix. All eleven passing. The spec was doing its job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Patterns, not just passes
&lt;/h2&gt;

&lt;p&gt;Something else happened during the implementation that was worth noting. The first package, &lt;code&gt;internal/hashnode&lt;/code&gt;, required iteration. The nil-vs-empty bug needed fixing. The fake HTTP client routing needed thought. Two scenarios failed on the first run and required code changes.&lt;/p&gt;

&lt;p&gt;Every package after that passed on the first run. Tags, Dev.to, probe, pipeline, CLI, five consecutive packages with no post-implementation fixes. That is not coincidence. The patterns established in the hashnode package, injected HTTP client interface, fake routing by method and path, result structs carrying action and IDs, &lt;code&gt;io.Writer&lt;/code&gt; for testable output, transferred cleanly to every subsequent package. By the time the orchestrator was being written, the shape of the solution was so well established that the implementation matched the spec without iteration.&lt;/p&gt;

&lt;p&gt;This is one of the underappreciated benefits of BDD when it is done in order. The first package is where you pay the design tax. Everything after inherits the patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two decisions worth explaining
&lt;/h2&gt;

&lt;p&gt;Two design decisions in &lt;code&gt;internal/pipeline&lt;/code&gt; are worth being explicit about, because both have non-obvious implications.&lt;/p&gt;

&lt;p&gt;The first is two-phase execution. Phase 1 validates all post files. If any fail validation, the pipeline exits with code 1 before making any API calls. Phase 2 processes each file independently. If Hashnode fails for one file, Dev.to is skipped for that file but other files continue. This means validation is a hard gate, you never partially publish a batch with a broken post in it, but publish failures are isolated, not catastrophic.&lt;/p&gt;

&lt;p&gt;The second is no rollback. If file 1 publishes successfully to both platforms and file 2 then fails on Hashnode, file 1's results are preserved. The summary reports both the success and the failure. Exit code 2. There is no mechanism to undo file 1's publish. This matches the bash pipeline's behaviour, and it is the right default: a post that has been successfully published should stay published. The failure is in the tooling, not the content.&lt;/p&gt;

&lt;p&gt;Both decisions are explicitly tested in BDD scenarios. Writing them as scenarios forces you to state the expected behaviour unambiguously before you implement it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The migration strategy
&lt;/h2&gt;

&lt;p&gt;The switch was not a flag day. The Go binary ran in &lt;code&gt;--dry-run&lt;/code&gt; mode in a parallel workflow job alongside the existing bash pipeline, uploading structured JSON output as a GitHub Actions artifact. Once the dry-run output matched the bash pipeline's behaviour, Hashnode cut over first, keeping Dev.to on bash to isolate the risk to one platform. Once that was stable, Dev.to followed, the bash pipeline was removed, and the multi-job workflow collapsed into a single &lt;code&gt;./publish&lt;/code&gt; invocation.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;--dry-run&lt;/code&gt; flag produces structured JSON rather than human-readable logs specifically for this phase. Diffing JSON is straightforward. Diffing log output is not.&lt;/p&gt;

&lt;p&gt;The final tally: 411 lines of workflow down to 117. The migration is done.&lt;/p&gt;

&lt;h2&gt;
  
  
  The broader point
&lt;/h2&gt;

&lt;p&gt;This is a small project. It publishes markdown files to two APIs. No reasonable person would call it critical infrastructure.&lt;/p&gt;

&lt;p&gt;But the same thinking applies at every scale. Pipelines accumulate logic. Untestable logic accumulates risk. And the cost of writing testable, well-specified code has come down dramatically. This rewrite, 7 packages, 98 scenarios, 488 steps, full BDD coverage, happened in days, not months. That is the part that changes the calculus. There is less and less justification for treating pipelines as second-class citizens that do not deserve the same engineering rigour as application code.&lt;/p&gt;

&lt;p&gt;The secret regex bug, the nil-vs-empty GraphQL node, the five failing probe scenarios from a single misplaced padding character, none of these are dramatic failures. They are exactly the kind of subtle, silent, hard-to-reproduce problems that bash pipelines accumulate over time and that a proper test suite catches before they reach production.&lt;/p&gt;

&lt;p&gt;The bash script served its purpose. It got the pipeline working. Now it has earned the right to be replaced by something that can actually be trusted.&lt;/p&gt;

&lt;p&gt;One last detail worth recording: the first real test of the new pipeline used this post as the test file. It immediately hit the exact class of bash quoting bug the post describes. The timing was not planned.&lt;/p&gt;

&lt;p&gt;I will write a follow-up once the mutation testing is in place, specifically on running go-mutesting against the frontmatter parser, which is the highest-consequence code in the tool and the place where mutation coverage matters most.&lt;/p&gt;

</description>
      <category>go</category>
      <category>bdd</category>
      <category>devops</category>
      <category>softwareengineering</category>
    </item>
  </channel>
</rss>
