Forem: wintrover

"42% Silence": What It Means to Control Failure in AI Code Verification

wintrover — Fri, 24 Apr 2026 06:34:24 +0000

We live in an era where AI writes code.

The problem is no longer productivity.

It's trust.

Code gets generated fast.
Tests pass.
But

there's no guarantee that this code is actually correct.

1) What We're Building: Axiom

Axiom is not a mere "automatic proof engine."

It's a system that makes AI-generated code trustworthy.

To achieve this, we combine:

SMT Solver (Z3)
Abstraction (CEGAR)
Summary-based Decomposition (Summary)
Formal Verification (Lean 4)

into a single orchestration.

2) The Incident: "42% Silence"

When running ax prove, the progress would stall precisely at 42%.

It looked like a simple UI glitch.

But three possibilities existed simultaneously:

Actual deadlock
Solver endlessly computing
Some tasks finished but UI not showing them

In other words,

The system was in a state where you couldn't tell if it was dead or thinking.

3) Kimi's Diagnosis: "It's Actually Stuck"

Investigation revealed an actual bug.

Cause 1 — Sequential Result Waiting

let result = ^task  # blocks in spawn order

→ If one hangs, everything stops

Cause 2 — Blocking I/O

readLine()  # can wait forever on Z3 pipe

→ Deadlock possible regardless of solver state

4) First Fix: Make the System Non-Blocking

We fixed two things.

✔ Out-of-order processing

if isReady(task):
  collect(task)

→ Immediately process completed tasks

✔ Non-blocking I/O

readData()  # chunk-based processing

→ Detect solver termination/abnormal state immediately

This removed the "true freeze."

5) But a Bigger Problem Remained

After fixing the bug, the same question lingered.

"What if the solver never finds an answer?"

6) The Technical Debate: "Bypass or Siege?"

Two perspectives collided here.

🏛️ Gemini — Engineering Siege Strategy

"We're not finding answers, we're collecting evidence"

Simplify problems with CEGAR
Break them down with Summary
Finalize with Lean 4

🏛️ Kimi — Limit-Awareness Design

"All these tools fail"

SMT is undecidable
CEGAR can also stall in loops
Lean automation is ultimately a search problem

Conclusion:

There is no perfect solution.

7) Our Conclusion

The key insight from this debate:

❌ "A system that always proves" is impossible
✅ "A system that controls failure" is possible

8) Axiom's Design Principles

Axiom is not a "proof engine" but:

"A system that structures failure"

🏛️ L1/L2 — Stop the failure

timeout
abstraction (CEGAR)

→ Block infinite computation

🏛️ L3 — Break down the failure

decomposition
summary

→ Whole failure → Partial success

🏛️ L4 — Escalate the failure

Lean 4
Human invariant

→ Automation → Collaboration

9) Another Problem Surfaced

The bug was gone, but a new problem appeared.

It still "looked stuck."

This isn't just a UX problem.

Core Issue

In a verification system, "invisible progress" equals trust collapse.

10) Solution: Heartbeat Patch

We changed our approach.

Instead of "exact progress percentage,"
show "evidence of being alive"

Before

[████████░░] 42%

After

[████████░░] 42%

[VERIFIED] expr.nim::add
[VERIFIED] results.nim::isErr
[TIMEOUT] main.nim::getStr

The meaning of this change:

"It's not stuck — it's solving a hard problem."

11) Additional Improvement: Evidence-Based Hints

Z3 says nothing when it times out.

So we worked around it:

Structural complexity analysis (Topology)
Lean unsolved goal extraction

Result:

We can't directly say "why it failed,"
but we can show "where the problem is."

12) Lessons Learned

The incident clarified two things.

✔ Nim threadpool limitations

FlowVar cancellation impossible
Polling-based design required

✔ Z3's black-box nature

timeout = unknown
Root cause tracking impossible

→ Contextual diagnosis required

13) Conclusion

The "42% incident" wasn't just a simple bug.

It was the surfacing of structural limitations that every AI code verification system must face.

We made one choice here.

Abandon perfect proof
Choose controllable failure

One-sentence definition:

In a verification engine, what matters isn't the correct answer,
but how failure manifests and is controlled.

Closing

Axiom no longer freezes at 42%.

And more importantly:

Even when it stops,
it tells you why it stopped and what to do next.

One-line summary:

Axiom isn't an engine for proving code,
it's a system for generating trust.

Inside Axiom’s Verification Kernel: BMC, UAP, Lean Replay, and the Proof Vault

wintrover — Tue, 21 Apr 2026 08:48:39 +0000

When people hear “formal verification”, they imagine a single solver run that prints Verified.

That’s not what Axiom is optimizing for.

Axiom is built around one requirement: if you say something is verified, you must be able to reproduce that claim deterministically—including the exact query shape, the exact counterexample (if SAT), and the exact Lean replay (if promoted to safe).

1) BMC Core: Bounded Models, Not Hand-Wavy Guarantees

The BMC core is the “boundary model” layer of the engine: it takes a transition system, cuts it at a fixed bound, and asks the only question that matters inside that boundary:

“Does there exist a counterexample within N steps?”

In Axiom’s reference race model, that boundary is explicit:

const
  bmcStepBound* = 10

That bound is not a random default—it’s the contract that defines what “safe” even means in this mode. The semantics are also strict: a race is promoted only when the final state falls outside the set of sequential outcomes computed from the same TransitionRule.

The important architectural decision is that the rule and the safety property are not hardcoded into the solver path. Instead:

External behavior is injected as a TransitionRule(op, delta) model.
The SSOT of the safety property is TransitionRule.invariant.
Encoding is performed as SMT-LIB2 (QF_LIA) and executed through Z3.

This is also where determinism stops being “a preference” and becomes “mechanically enforced”.

The Z3 query text is stamped with fixed-seed and single-thread constraints, and every invariant is asserted as a named clause so UNSAT cores are stable and actionable:

(set-option :fixed_seed 17)
(set-option :smt.threads 1)
...
(assert (! (<= size 512) :named inv_bounded_growth.max_size))
...
(get-unsat-core)

In other words: the BMC core is not just “Z3 integration”. It’s a deterministic boundary-model execution contract where solver outputs can be replayed and audited.

2) UAP Pipeline: Universal Automated Proof as an Orchestrated Repair Loop

BMC gives you a controlled counterexample search. But the real world is not a single bounded query; it’s a sequence:

Extract a model.
Find a counterexample.
Repair the implementation and/or strengthen invariants.
Re-prove.
Preserve only the minimum reusable evidence.

That sequence is what Axiom calls UAP: Universal Acceptance Pipeline.

UAP is not “just” automation. It has a strict output contract: one JSON result that includes translation, scans, repair rounds, proof artifacts, and promotion snapshots.

The core orchestrator is deliberately structured as a set of roles (Detective / Legislator / SWAT / Judge), because each stage produces different kinds of artifacts:

AST + Source
  → UniversalTranslation (AST → IL/SMT mapping)
  → initialScan (counterexample search + trace grounds)
  → legislate (baseline invariant draft)
  → executeRaid (repair + PoC + BDD generation hooks)
  → certify (Lean replay + invariant finalization)
  → investigate (final scan + verdict normalization)
  → writeProofArtifact + writePromotionSnapshot

Two constraints matter here:

“Counterexample not found” is explicitly not equal to “safe”.
Safe promotion is prohibited unless Lean replay succeeds, even if SMT returns UNSAT.

So UAP is a self-healing pipeline, but it’s also an anti-vacuity pipeline: it refuses to treat solver silence as proof.

3) Lean 4 Reproducibility: Z3 Is the Oracle; Lean Is the Kernel

If SMT is the “search engine”, Lean 4 is the “court of law”.

The engine’s safety promotion rules explicitly treat Z3 UNSAT as a hint, not a conclusion. Final safe promotion requires a Lean-kernel-verifiable result (proof term or tactic replay).

This is why Axiom’s Lean side has hard constraints:

Semantics are not allowed to hide inside axioms like axiom transition.
Transition systems must be reconstructed through def nextState and inductive Step.
Generated specs must be replayable under a pinned toolchain and stable tactic strategy.

Operationally, Lean replay is not a one-shot script. It’s a dynamic proof search loop driven by compiler feedback through Lean’s --server JSON-RPC path (LSP-style). The system treats “proof search” as an iterative, deterministic process: observe goals, propose tactics, replay, repeat.

The point is not that Lean always succeeds.

The point is that when it does succeed, it is reproducible—and when it fails, the failure itself is reproducible and promotable as actionable constraints.

4) Proof Vault: Fingerprint-Keyed Evidence Compression and Reuse

Most systems keep “proof results” as human-facing logs.

Axiom treats proofs as assets.

The Proof Vault is where those assets are stored under stable keys so they can be reused instead of re-derived. Two kinds of vaulting matter:

1) Promoted invariants (UAP-level): the engine preserves the latest activeInvariant and refinedInvariant as snapshots and auto-reinjects them into later runs for the same modulePath/functionName.

2) Lean replay tactics (kernel-level): successful tactics are stored keyed by theorem name plus a complexity fingerprint. On the next run, they become a warm-start strategy—preferentially reusing the path with fewer attempts and shorter tactic sequences.

That “fingerprint-first” design is what turns the vault into a compression mechanism:

You discard bulky intermediate artifacts (full ASTs, large traces) after success.
You keep only minimal, fingerprint-addressable evidence that can deterministically recreate the proof attempt.

This is also why fingerprints are treated as first-class inputs across the pipeline (source checksum, SMT fingerprint, invariant fingerprint). If the fingerprint changes, Axiom does not pretend the old proof still applies—it demotes and schedules re-proof precisely where the change occurred.

Closing: A Kernel, Not a Collection of Tools

BMC, UAP, Lean replay, and the Proof Vault are not separate features.

They are one kernel with one invariant:

If Axiom says “verified”, it must be replayable as the same query, the same evidence, and the same proof—deterministically.

That is the only definition of verification that survives contact with AI-generated code.

Nim vs Rust: Language is a Matter of 'Coherence', Not 'Performance'

wintrover — Fri, 17 Apr 2026 11:59:55 +0000

I once spent a long time debating between two languages as the foundation for a formal verification engine.
One was Rust, bringing its borrow checker and lifetimes. The other was Nim, leading with func and distinct.

When people compare these two, the conversation usually boils down to "Rust's memory safety vs. Nim's C-compatibility and productivity."
But when you actually try to build something like the Axiom proof engine from the ground up, the real wall you hit is entirely different.

Especially in an environment where AI generates code, the most terrifying bottleneck isn't a 'compile error'—it's 'semantic opacity'.
LLMs can easily spit out code that flawlessly passes tests while subtly twisting the author's original intent into something unrecognizable.

There is No Perfect Symmetry: The Trade-off Between Memory and Intent

"Rust prevents memory corruption, while Nim prevents intent corruption."
Saying it like that sounds incredibly clean and punchy, but real-world engineering is rarely that neat.

Of course, Rust is perfectly capable of expressing semantics. You can use the Typestate pattern or the Newtype pattern to rigorously enforce domain meaning. It's not that Rust can't express intent. The issue is the cognitive cost required to express that intent.

Conversely, Nim is not a silver bullet. Debugging complex macros can sometimes feel like hell, and the smaller ecosystem can be a real pressure point. This is a clear trade-off.

However, the reason I accepted this trade-off and sided with Nim comes down to how we distribute our 'attention' when collaborating with AI.

Syntactic Simplicity: Where Does Our Attention Go?

When reviewing code generated by an LLM, our cognitive resources are strictly limited.

Let's look at Rust:

fn explore<'a>(
    step: usize,
    trace: &'a BmcTrace,
    verdict: BmcVerdict,
    rule: &'a TransitionRule,
) -> BmcVerdict {

When I read this code, a massive chunk of my brainpower is spent tracking lifetimes ('a), references, and mutability.
In fact, when a bug occurred here, I found it incredibly difficult to tell at a glance: is the domain logic wrong, or was this code just awkwardly structured to appease the borrow checker?

Now, how about Nim?

func explore(step: int; trace: BmcTrace;
             verdict: BmcVerdict;
             rule: TransitionRule): BmcVerdict {.raises: [].} =

Here, my attention is entirely focused on the 'domain variables' and 'semantic constraints'.
Because the concerns of memory management are stripped away from the syntax, I can focus on the causality of the logic itself, rather than the grammar.

Structural Devices That Enforce the "Why"

The most painful moments in the trenches are when you have to dig through comments or commit messages to figure out "why did the previous dev (or yesterday's me, or the LLM) write it this way?"

Nim allows you to bake this 'Why' directly into the language spec.

type
  SessionId = distinct string
  OwnerId = distinct string

distinct is not just a type alias. If an AI makes the mistake of passing an OwnerId where a SessionId is expected, the compiler grabs it by the collar immediately—long before it ever reaches runtime tests.

The same goes for contracts that specify the preconditions and guaranteed outcomes of a function:

func toSessionId*(raw: string): Result[SessionId, AxiomError] {.
  raises: [],
  requires: raw.len > 0,
  ensures: (not result.isOk()) or
           (sessionIdString(result.get()) == raw)
.}

This isn't mere documentation. It's a contract enforced by the compiler before runtime.
Sure, you can mimic this in Rust using macros or external crates, but Nim's core philosophy pushes this 'visibility of intent' by default. Just like how the single func keyword structurally enforces the absence of side effects at the module level.

Conclusion: The Real Bottleneck in the AI Era

To reiterate, if performance or the sheer size of the community is your primary concern, Rust is undoubtedly an excellent choice. It provides an incredibly solid floor of memory safety.

But what I felt in my bones while building Axiom is that, in the era of AI-assisted development, the most terrifying risk isn't 'unsafe code'. It's 'code where you cannot know what it intended until you run it.'

Can you structurally grasp the author's intent without having to read every line of code?
I didn't want to compromise on this question. And that—even if it sometimes looks a bit rough around the edges—is why I chose Nim. To break through the ceiling of coherence.

The Most Dangerous Word in AI Coding: "Verified"

wintrover — Wed, 08 Apr 2026 11:41:07 +0000

Got a "Verified" result from my formal verification engine.

Problem was, it was completely wrong.

The Setup

Looking at a simple function: checkType from Bitcoin Core.

The engine generated this SMT query:

(assert (= throwsRuntimeError (not (= typ expected))))
(assert (= typ expected))
(assert throwsRuntimeError)

At first glance? Looks fine.

But there's a fatal flaw in there.

The Contradiction

Unpack it and here's what you get:

Error occurs when typ != expected
But we're assuming typ == expected
While also asserting "an error occurred"

Boil it down:

typ == expected
AND simultaneously: typ != expected

Logically impossible.

What the Solver Did

Z3 (or any SMT solver) takes one look and concludes:

Unsat (Unsatisfiable)

In formal verification, this usually means:

"No execution path exists where the error occurs."

So the engine outputs:

✅ Verified

Where It Went Wrong

Here's the thing.

The solver didn't prove the code safe.

It proved the question itself was invalid.

Vacuous Truth

This is a classic trap in formal verification.

The definition is simple:

If the premise is impossible, the statement is always true.

Example:

"If the sun rises in the west, I am God."

Logically true.\
Because the premise can never hold.

What Actually Happened

The engine effectively asked:

"In a state where typ == expected, can an error occur?
Given errors only happen when typ != expected?"

The solver's answer is clear:

"No such state exists."

Then the engine interprets:

"No error possible → safe"

That jump is the problem.

Verified ≠ Correct

Key point:

"Verified" doesn't mean correct.

Sometimes it means:

Your model is broken.

This case was exactly that:

Actual logic wasn't captured
Contradictory constraints were generated
Solver just short-circuited

Didn't verify the code.\
Logic just crashed.

Why This Matters Now

We're in this flow:

AI writes code
AI writes tests
AI explains "correctness"

Problem:

None of that guarantees truth

Because:

AI optimizes for plausibility, not consistency

LLMs don't "resolve" contradictions.\
They continue them.

The Real Danger: False Confidence

This is worse than a failing test.

Test fails → problem visible
Verified (fake) → problem hidden

In security or finance logic? Catastrophic.

What's Needed

Formal verification systems need at minimum:

1. Sanity Check (Premise Validation)

Before proving anything:

Are these assumptions even possible together?

2. Vacuity Detection

Catch cases that pass because "nothing could happen."

3. Mutation Testing

Break the code intentionally:

If it still verifies, your verification is broken.

Axiom

That's why I'm building Axiom.

Not to replace AI.

To sit on top as a "verification layer."

Role is simple:

This holds
This doesn't
Or: this question is invalid from the start

Final Thought

If your system says "Verified,"

Ask this first:

Is the question itself valid?

If you can't answer that,

You don't have verification.

You have a well-crafted illusion.

Where Does Truth Live in AI-Generated Code?

wintrover — Tue, 07 Apr 2026 07:15:09 +0000

The Problem Isn't Tests—It's Authority

Talk about AI-generated code long enough, and you'll hit this question.

"Tests pass, so what's the problem?"

Not wrong. But it misses the core issue.

In AI-generated code, the real question is: who decides 'this is correct'?

Tests? Reviewers? Another LLM?

None of them.

This isn't about improving accuracy. It's about where the authority to declare truth lives.

What We're Actually Doing

1. Trusting Tests

"Tests pass, so we're fine."

In practice, this translates to:

"Wechecked a few cases and nothing broke."

This pattern repeats. Especially after adding LLMs.

Most teams have been here.

Tests sample. They miss edge cases. They rarely cover invariants.

So this happens:

All tests pass. Production breaks.

Not an exception. A structural result.

Test passage is observation. Correctness is a property.

2. Trusting Human Review

"A person reviewed it, so it's fine."

Even less stable.

Humans don't scale. LLM code output explodes. Reviews drift toward "looks about right."

More fundamentally:

Code review is not correctness verification. It's a consensus process.

Consensus can be wrong.

3. Trusting LLM-as-Critic

A popular idea lately.

"Run another model to double-check?"

Sounds reasonable. Many teams try this.

But the structure is:

Probabilistic system + probabilistic system

Result stays the same:

Still probabilistic.

You can create consensus. But:

Consensus is not proof.

Two Common Reactions

Raise this topic, and responses split two ways.

First:

"The problem isn't correctness—it's shipping useful things fast."

Second:

"Invariants aren't perfect either, so keep verifying with LLMs."

Both sound reasonable. Both collapse at the same point.

"Shipping vs Correctness" Is a False Choice

Fast at first.

A few tests. One review. Ship it.

But over time:

Every fix breaks something else. Unpredictable. Debugging costs explode.

Eventually:

Lack of correctness kills shipping speed.

With LLMs, this happens much faster.

Why "LLM Verification" Collapses

Second idea:

"LLM can verify invariants too, right?"

This is usually where intuition misaligns.

Structure breaks here.

1. No termination

Need a verifier → who verifies the verifier? → another model? another?

No end.

2. Results aren't fixed

Same input → different output

At this point:

You can't consistently state "what's correct."

3. Authority vanishes

Who's the final judge?

The model? Training data? Prompt?

No clear answer.

This isn't a system. It's:

"A structure that drifts toward whatever seems plausible."

The Question to Ask Again

What matters isn't the method.

"In this system, what decides 'correct'?"

No answer means the system is already unstable.

Trust Boundary Categories

Not all boundaries are equal.

Empirical boundary: Tests, benchmarks, runtime monitors. Observation-based. Cannot prove absence.

Social boundary: Code review, approval workflows. Authority-based. Doesn't scale.

Formal boundary: Invariants, type systems, proofs. Mathematical necessity. Deterministic.

Most systems use a mix of all three.

The problem is not recognizing which one you're relying on.

Where Should the Trust Boundary Be?

Simple choice.

Somewhere:

A point must decide "this is correct."

And that point:

Cannot be probabilistic.

Redefining the Structure

BEFORE (what most systems do)

LLM → Code → Tests → Ship
              ↓
         (uncertain)

No clear decision point anywhere.

AFTER

LLM → Proposes (may be wrong)
System → Encodes to spec
Proof → Decides validity     ← (this is the boundary)
Execution → Runs only what passes

The boundary exists explicitly.

The difference is not the tools.
It's where the boundary sits.

One thing matters:

The verification step must be non-negotiable.

Core Definition

A trust boundary is where correctness becomes non-negotiable.

Before: exploration, generation, probability
After: execution, responsibility, determinism

Don't draw this clearly, and:

The entire system stays in "ambiguous territory."

This Isn't a New Idea

Formal methods. Proof systems. Invariant-based design.

These already exist.

What's different now:

We haven't placed them at the center of code generation pipelines.

So What Does This Look Like in Practice?

Naturally, this structure emerges:

Define invariants first
Prove code satisfies them
Execute only what passes

This isn't about writing better tests.
It's about redrawing the trust boundary.

Axiom

One attempt to implement this: Axiom.

Define state in terms of invariants
Decide correctness through proof
Block regression structurally

If the boundary has been implicit until now,
Axiom tries to make it explicit and enforce it.

The key isn't features—it's position.

Clarifying "what holds final authority."

Conclusion

AI systems come down to one of two:

1. Probabilistic systems
Mostly right. Sometimes wrong. Don't know where.

2. Deterministic systems
Only proven code passes. Upfront cost. Doesn't fail.

No middle ground.

Final Question

In your system, who can say "this is correct"?

Can't answer clearly?
Then the trust boundary doesn't exist.

Architecture Philosophy: Rule-First Design

wintrover — Mon, 06 Apr 2026 12:31:29 +0000

The Core Question

When building an engine to verify AI-written code, architects hit a brutal dilemma. Where do rules live, and who enforces them?

Axiom's answer is simple. Rules must precede code.

Rule-First Principle

Traditional software development works like this:

Write code first.
Run tests to catch bugs.
Add rules to prevent similar bugs.
Repeat forever.

This reactive approach shatters catastrophically when verifying AI-generated code. LLMs excel at crafting outputs that pass tests while concealing logical flaws—they're geniuses at creating "plausible wrong answers." Their probabilistic nature means bugs reappear in subtly different forms, bypassing test-based detection.

Without rules locked in first, the engine eventually gaslights itself. It starts lowering verification standards to accommodate AI hallucinations. A verification engine that discovers rules from code inspection ends up in a self-justifying loop. It validates code against constraints derived from that same code.

So Axiom inverts the paradigm completely:

Declare rules: Define invariants, contracts, safety properties first.
Encode: Transform rules into verification artifacts.
Prove: Mathematically prove code satisfies rules.
Execute: Only then allow execution.

Constitution-Ordinance-Annals: Layering Rules

Axiom splits rules into three layers. Structure borrowed from legal systems.

Layer	Keyword	File Pattern	Purpose	Change Frequency
Constitution	Context	`CONTEXT.md`	Immutable design principles	Almost never
Ordinances	Spec	`docs/spec/*.md`	Detailed algorithms and constraints	Moderate
Annals	History	`docs/history/*.md`	Decision records and history	Frequent

Layer 1: Constitution (Context)

Project identity. Principles like "Core calculations must use pure functions (func) only" belong here. Stuff that breaks the project's foundation. Modifying CONTEXT.md requires serious architectural review.

## 2. Core Architecture Principles (SSOT)

### Purity Hierarchy

- **Core Calculation Tier**: `src/core` computation uses `func` exclusively
- **Boundary Executor Tier**: `proc` exists only at OS/interfaces
- **Executor-only Rule**: Boundary `proc` cannot own business logic

Layer 2: Ordinances (Spec)

Actual running rules. SMT solver interfaces, Lean 4 proof strategies—concrete specifications live here. Guidelines during implementation.

bmc_core.md: Model checking algorithms, Z3 interfaces, timeout rules
uap_logic.md: Universal pipeline stages, counterexample promotion, invariant refinement
lean_proof.md: Lean 4 semantics, tactic strategies, replay protocols

Layer 3: Annals (History)

Answers "why did we decide that back then?" Records past compromises and decision context. Annals are for reference only—they should never govern current implementation.

## 2026-03-26
- Core introduced Bottom-up Sorter, Vault Manager, Axiom Orchestration Loop
- Established stale axiom blocking and vault-based promotion paths

Annals stay separate from rules. When you need "why does this rule exist?" you check annals. When implementing code, you focus on constitution and ordinances. No wading through historical noise.

No-Downgrade Policy: The Hard Line

Axiom enforces one rule above all others. Integrity can only increase, never decrease.

The func→proc Trap

Nim strictly separates pure functions (func) and impure procedures (proc). During development, temptation strikes:

"Just one quick echo here. Let me temporarily switch to proc..."

Doesn't work in Axiom.

Why?

Downgrade is one-way: Once proc exists, developers keep adding side effects. It spreads like weeds.
Purity is binary: Either a function is pure or it isn't. "Slightly impure" is meaningless.
Verification requires purity: You can't mathematically prove properties about functions with unpredictable side effects.

Nim 2.0+ strictFuncs mode and effect tracking make this distinction even stronger:

# strictFuncs enforces purity at compile time
{.push strictFuncs.}
func verifyLogic(node: Node): Result[void, Error] {.raises: [], tags: [].} =
  # Compiler rejects any external state mutation or IO call
  if node.isValid: ok() else: err(InvalidNode)
{.pop.}

This is why Nim fits Axiom's architecture perfectly. The language itself enforces purity boundaries at compile time. Not just at code review.

Axiom's Enforcement Mechanism

Axiom maintains a Purity Hierarchy:

┌─────────────────────────────────────────────────────────────┐
│ Core Calculation Tier (src/core)                            │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ func-only zone: computation, parsing, normalization     │ │
│ │ Returns: Result[T], Option[T], VerifiedType[T]         │ │
│ │ Forbidden: IO, mutation, exceptions                     │ │
│ └─────────────────────────────────────────────────────────┘ │
│                                                             │
│ Boundary Executor Tier (src/ui, src/cli, gateway)          │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ proc-allowed zone: OS, TTY, filesystem, processes       │ │
│ │ Role: Execute core func results, inject dependencies    │ │
│ │ Forbidden: Business logic, domain rules                  │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

When proc shows up in src/core, it gets flagged as design debt—temporary compromise with mandatory refactor plan. Code doesn't ship until:

Executor boundary clearly marked
"func promotion" task exists in backlog
Impurity reason documented in CONTEXT.md

What Happens When You Violate This Rule

In Axiom, downgrading func to proc isn't just code modification. It's treated as destroying mathematical integrity. Downgrades without proper justification (CONTEXT.md update) get blocked at CI stage entirely.

Axiom's CI enforces this through Procedure Gate:

If src/core files change, check for proc additions
If func→proc conversion appears in diff, fail the commit
If proc exists without corresponding Boundary documentation, fail the commit

Only exception: --style-only changes (typos, comments, whitespace). Everything else needs architectural justification.

At compile time, Nim's effect system provides additional enforcement:

# [Layer 1: Core Calculation Tier]
{.push strictFuncs.}
func verifyNodeIntegrity(node: AxiomNode): Result[void, IntegrityError] {.raises: [].} =
  ## No IO, no mutation, no exceptions - mathematically pure
  if node.fingerprint == node.computedFingerprint():
    ok()
  else:
    err(IntegrityError.FingerprintMismatch)

# [Layer 2: Boundary Executor Tier]
proc executeVerification(node: AxiomNode, output: File) {.raises: [IOError].} =
  ## Thin executor: injects dependencies, delegates all logic to func
  let result = verifyNodeIntegrity(node)  # Calls pure function
  if result.isOk:
    output.write("Verification Passed\n")  # IO isolated at boundary
  else:
    output.write("Verification Failed: " & $result.error & "\n")
{.pop.}

See the pattern? verifyNodeIntegrity is pure mathematics. executeVerification is a thin wrapper handling IO. Business logic never crosses into impure territory.

Real-World Example

When implementing report rendering, Axiom faced a choice:

# Tempting initial approach:
proc renderReport(data: seq[ProofResult]) =
  # Direct IO: writes to stdout
  for r in data:
    echo r.format()

# Enforced purity approach:
func renderReportLines(data: seq[ProofResult]): seq[string] {.noSideEffect.} =
  # Pure computation: returns strings
  for r in data:
    result.add r.format()

proc writeReport(data: seq[ProofResults]) =
  # Thin executor: injects output stream
  let lines = renderReportLines(data)
  for line in lines:
    echo line

Second approach separates concerns:

renderReportLines is pure, testable, verifiable
writeReport is trivial executor replaceable with file output, network output, or test mock

This isn't over-engineering. It's making verification possible.

Why These Principles Exist

Axiom verifies AI-generated code. To do this credibly, it must be more rigorous than code it verifies.

Rule-First design ensures:

Verification rules never get retrofitted to justify existing code
Architecture principles are explicit, discoverable, enforced
Integrity only moves one direction: upward

Constitution-Ordinance-Annals separation ensures:

Principles stay visible without drowning in details
Implementers find specifications easily
History informs but doesn't clutter

No-Downgrade Policy ensures:

Purity isn't sacrificed for convenience
Verification stays mathematically sound
Technical debt is visible, tracked, temporary

Conclusion

These principles aren't aspirational. CI gates, code review, architectural documentation enforce them. Every Axiom commit must pass Procedure Gate. Every proc in core must justify its existence. Every rule change must be documented before implementation.

The architecture must itself be verifiable.

Building a verification engine means you cannot verify probabilistic code with probabilistic tools. The foundation must be deterministic, mathematically proven, architecturally pure.

Next post: how Axiom transforms mathematical promises into executable proofs through bounded model checking (BMC).

Axiom: Deterministic Integrity Engine for Probabilistic AI

wintrover — Mon, 06 Apr 2026 11:16:13 +0000

Introduction: Why an Integrity Verification Engine?

How can we be certain that AI-generated code is "correct"? Beyond simply compiling or passing tests, can we mathematically prove that code is free from race conditions, memory safety violations, and logical flaws?

Axiom was born from a fundamental question in software engineering: "How far can we trust AI-written code?"

Our answer lies not in more test cases, but in mathematical verification. Axiom replaces probabilistic AI outputs with deterministic, robust software. To achieve this, we combine the following technical pillars:

Bounded Model Checking (BMC): Complete integrity verification within defined exploration bounds
SMT Solver (Z3): Automated theorem proving based on logical constraints
Lean 4: Ensuring deterministic reproducibility of high-level design principles
Dr.Nim: Powerful Design-by-Contract (DbC) verification embedded in the Nim language

The Reliability Problem in AI-Agent Generated Code

1. Limitations of Probabilistic Models

Modern LLMs operate on probability. When AI generates code, it doesn't search for the logically most correct answer — it stochastically chains together "the most plausible next token" from its training data.

The result is a dangerous gap:

Surface-level correctness: Code appears to work, but internal edge cases remain unaddressed
Logical hallucination: Systems collapse without warning under untested runtime conditions
Security and concurrency defects: Race conditions in parallel execution paths that humans easily miss remain difficult puzzles for AI

Edsger Dijkstra once said: "Testing proves the existence of bugs, not their absence." In the AI era, this maxim cuts even deeper.

2. The Asymmetry of Verification

A dangerous asymmetry exists in current software development:

AI Code Generation Speed:   ████████████████████ (Overwhelming)
Human Code Review Speed:    ██ (Lagging)
Trust Gap:                  ██████████████████

Axiom was created to bridge this trust gap. Our goal extends beyond augmenting human judgment — we enhance AI's productivity with mathematical certainty.

Limitations of Traditional Static Analysis and the Value of BMC

Why Existing Tools Fall Short

Traditional linters and static analyzers rely on simple pattern matching and heuristic rules. They merely speculate: "This code is probably safe." In contrast, Axiom answers: "Is this code provably correct?"

Approach	Mechanism	Coverage	False Positive Rate	Formal Guarantees
Linting	Pattern matching	Limited	High	None
Type Check	Type inference	Moderate	Low	Partial
Testing	Execution sampling	Incomplete	None	None
Axiom (BMC)	Exhaustive exploration	Complete (within bounds)	0%	Complete proofs

Bounded Model Checking: Realizing Complete Verification

BMC systematically explores all states a program can reach within defined bounds.

# BMC doesn't merely execute — it mathematically deconstructs every execution path
func processBuffer(data: seq[byte]): Result[ProcessedData, ErrorCode] =
  # Axiom mathematically proves:
  # 1. Buffer overflow probability: 0%
  # 2. Post-conditions satisfied on all paths
  # 3. Invariants maintained across all reachable states

At Axiom's core sits Z3, Microsoft Research's SMT Solver. It encodes program semantics into logical formulas. When Z3 determines a negated assertion is UNSAT (unsatisfiable), we have mathematical certainty: no execution path exists that could violate that assertion.

The Birth of Axiom: Goals and Architecture

Design Principles

Axiom was architected with these foundational principles:

1. Deterministic Integrity

Every verification result must be reproducible and mathematically proven. No probabilistic judgments, no heuristics. Pure logical consequence.

2. Effect-Free Core

The verification engine maintains strict purity:

# Core verification functions are marked:
{.noSideEffect.}  # No IO, no mutation of external state
{.raises: [].}    # No exceptions — errors are data

This guarantees that verification logic itself cannot introduce bugs through side effects.

3. Gateway-First Architecture

All external input passes through verification gates:

Untrusted Input → Gateway (Parse/Validate/Promote) → VerifiedType → Core Engine
                                              ↓
                                     Reject with structured error

The core never sees untrusted data. All validation happens at the boundaries.

4. CLI as Primary Interface

Axiom prioritizes command-line interfaces for automation and integration:

ax scan ./src              # Scan codebase structure
ax prove --module auth    # Verify specific module
ax report --format json   # Generate structured report

This enables seamless integration into CI/CD pipelines and development workflows.

What This Series Will Cover

This introduction marks the beginning of a detailed technical series exploring:

BMC Core Implementation: How Axiom encodes program semantics for model checking
Z3 Integration Patterns: Effective use of SMT solvers in program verification
Lean 4 Proof Replay: Generating and validating formal proofs
Universal Assertion Pipeline (UAP): A generalized verification framework
Dr.Nim Contract System: Embedding verification contracts in code
Proof Vault Architecture: Managing verification artifacts at scale
CI/CD Integration: Automating verification in development workflows
Performance Engineering: Optimizing BMC for real-world codebases

Each post will combine theoretical foundations with concrete implementation details from the Axiom codebase.

The Vision Ahead

Axiom's mission is clear: Refine AI's non-deterministic outputs into mathematically proven deterministic assets.

This isn't about replacing developers or AI assistants. It's about augmenting every participant in the software development lifecycle with certainty:

Developers get immediate feedback on logical correctness
Organizations get audit trails of mathematical proofs
Security teams get guarantees beyond penetration testing
AI systems get a verification layer that elevates their output quality

The future of software engineering isn't just about generating code faster — it's about generating code that we can mathematically prove correct. Axiom is the engine that makes this possible.

Next in this series: Architecture Philosophy - Rule First Principles and Constitution-Ordinance-Chronicle Separation.

Why We Still Don't Trust AI-Generated Code: The Archright Trinity

wintrover — Fri, 20 Mar 2026 08:08:20 +0000

"You just can't trust code written by AI."

Until right before I decided to resign, this was the sentence I heard most often in real engineering teams.
The paradox was obvious: organizations wanted "10x productivity" from AI, yet deeply distrusted the output.

I stood in the middle of that contradiction, in agony, drilling into the root cause.
Why do we fail to trust AI-generated code? Is it simply because the tool is imperfect?

No.
My conclusion was this: the real problem is a distorted process that burns human labor to patch AI uncertainty.

Organizations forced engineers to review hundreds of lines produced in seconds through naked-eye inspection and overtime.
That was not a productivity gain.
It was the hell of review labor.
I found that this distrust consistently maps to three fundamental deficits.

1) Deficit of Intent: Is this code truly aligned with my design context?

When intent is not preserved, teams cannot prove whether generated code matches architectural decisions.
The output may look correct, yet still violate what the builder actually meant.

2) Deficit of Stability: Will it break at runtime or quietly degrade performance?

Without deterministic controls, generated code remains a probability game.
Even if it passes quickly, hidden runtime failures and regressions can emerge later.

3) Deficit of Security: Does it work now but plant future vulnerabilities?

Many AI outputs are operationally acceptable in the moment but logically under-proven.
That gap becomes a delayed risk multiplier.

Archright exists to solve this asymmetry as a system.
Instead of turning humans into disposable verification parts, I translated my resignation declaration of deterministic integrity into three concrete technical pillars.

1. Thought Trajectory System: Freezing Intent

It freezes builder intent and context as durable records, like a GitHub commit log for reasoning.
It fixes the architect's thought flow as explicit data so AI inference does not remain a black box.
That creates transparent intent anyone can onboard from immediately, while slashing communication cost.

This system is not a simple log.
In Archright, intent is used through this flow:

Requirements input
→ Capture the architect's intent in a structured form
→ Use it as the reference point across every generation/verification step
→ Validate consistency against "intent," not just against code

In short, code is only one output that must satisfy this intent.

2. Nim Programming Language: A Trinity of Productivity, Performance, and Stability

Why Nim instead of Rust?
Rust is excellent for performance and safety, but often sacrifices production velocity, and its less human-friendly syntax can become a major trigger for AI-agent hallucinations.
Nim preserves performance and stability while delivering exceptional readability.
It is the optimal choice for a high-efficiency engine where both agents and humans communicate with clarity.

Language choice is not a matter of taste.
In Archright's flow:

Intent (high level) → Constraints (intermediate form) → Code (low level)
these three layers must remain continuously connected.

We chose a language that minimizes the cost of maintaining this connection.
Rust is powerful at solving "is this code safe?".
However, Archright addresses an earlier question:
"is this code correct in the first place?"

We chose a language that allows this question to be handled before compile-time output.
Intent and constraints are verified first, before they are converted into code.

3. Formal Verification: Mathematical Proof

Trying to block probabilistic failure with another probabilistic tool is mathematically hollow.
The emerging pattern of AI code review (such as Claude Review) is still a 99%-accurate AI inspecting code produced by another 99%-accurate AI.
In that setup, accuracy may rise to 99.99%, but it can never become 100%.

Most security incidents emerge precisely from that neglected 0.01% gap.
In engineering, "almost certain" is a synonym for "not certain."
Archright internalizes mathematical verification and authorization tools such as Z3 Solver, Lean 4, and Cedar in its engine.
Instead of probabilistic comfort—"tests passed"—it proves "no exception exists" mathematically and frees engineers from review labor.

This verification is not a post-hoc review.
In Archright, it runs in this order:

Convert intent into verifiable rules
Validate those rules hold across all states
Search automatically for counterexamples that break the rules
Stop code generation when a counterexample is found
Generate code only when all constraints hold

For example:
“A user must only be allowed to read their own data” as a security condition means,
→ regardless of incoming request,
a user must only be allowed to read their own data.

If a request is sent with another user's ID:
→ it is immediately detected as a counterexample and code generation is stopped.
→ if even one violating case exists, that logic is not generated.

In short, we do not "catch bugs."
We create a state where bugs cannot exist.

Reclaiming the Joy of Building

On top of these three pillars, engineers are no longer janitors cleaning up AI-generated trash.
They step away from tedious debugging and communication bottlenecks, and return as builders focused only on business-logic design and creative architecture.

AI writes code.
Archright creates a state where that code cannot be wrong.

That is exactly the new software-engineering standard Archright proposes, and the reason I began this journey.

In the Age of Probabilistic Intelligence, a Thirst for Deterministic Systems

wintrover — Tue, 17 Mar 2026 12:56:43 +0000

(I'm not handsome as this guy, AI generated lol)

"AI is not trustworthy. Go back to coding by hand."

It was a short sentence. But it perfectly captured the contradiction I had been living in.

As a product engineer, I spent my time turning business priorities into shipped software. Speed mattered. Quality mattered. What became intolerable was watching an organization demand both while rejecting the engineering discipline required to make both possible.

The market asks for faster delivery every quarter. But users allow almost no defects. Any engineer knows this combination is nearly unsustainable if you keep relying on grinding human labor. That is exactly why I believed AI was no longer optional. Even with probabilistic tools, we can still enforce deterministic reliability by filtering output through explicit control procedures. The core issue is not whether we use tools, but whether we build control systems.

My organization chose the opposite path.

Instead of building a control plane for AI usage, it framed AI itself as the problem. Disappointment from unskilled usage became evidence against the tool, not evidence of a missing system. The operating model became painfully clear:

Keep aggressive output targets.
Remove the highest-leverage tool.
Fill the gap with overtime and manual repetition.
That is not engineering. It is deferred cost.

This is exactly where my anger came from. Probabilistic output is supposed to be noisy. That is why deterministic quality gates exist.

What I argued for was not ideology. It was process design:

Pre-commit hooks to block policy violations before review
Procedure gates combining static analysis and tests before merge
Formal verification checks for critical intent-to-implementation consistency
Banning tools can look like control, but in practice it often means giving up control. If you refuse to build systems and rely on labor intensity instead, quality degrades and people burn out at the same time.

I could no longer treat that as normal.

My resignation was not an escape. It was a decision. I did not leave because AI is imperfect. I left because I could not align with a culture that refused to build deterministic safeguards around probabilistic intelligence.

That decision is what led to Archright.

I am not building just another app. I am rebuilding the production process itself:

how probabilistic intelligence is disciplined into deterministic integrity, how thought trajectory survives implementation without leaking intent, and how this becomes a repeatable system instead of a heroic individual effort.

So this is not just a resignation story. As an AI tsunami approaches, this is the first page of a journey to rebuild software engineering that has lost its direction.

From Celery/Redis to Temporal: A Journey Toward Idempotency and Reliable Workflows

wintrover — Tue, 17 Mar 2026 10:20:33 +0000

When handling asynchronous tasks in distributed systems, the combination of Celery and Redis is often the go-to choice. I also chose Celery for the initial design of my KYC (Know Your Customer) orchestrator due to its familiarity. However, as the service grew in complexity, I hit a massive wall: guaranteeing idempotency and managing complex states.

In this post, I want to share my technical journey of why I moved away from Celery to Temporal and how I ensured idempotency during that process.

1. Limitations of Celery/Redis: Why Change Was Necessary?

Difficulties in Idempotency Management

While Celery is excellent for "Fire and Forget" tasks, there's a high risk of duplicate execution during retries caused by network failures or worker downs. Especially for face recognition tasks that consume significant GPU resources, duplicate execution was critical in terms of both cost and performance.

Fragmentation of State

The KYC process follows this sequence:

User uploads an ID card image.
User uploads a selfie video.
Compare face similarity once both files exist.

In a Celery environment, since I didn't know when images and videos would be uploaded, I needed complex logic to query the DB every time or store intermediate states in Redis. The logic to check "Are all files collected?" was scattered across multiple places, making maintenance difficult.

2. Introducing Temporal: A Paradigm Shift in Orchestration

Temporal is not just a message queue; it's a Stateful Workflows engine.

Workflow Logic Must Be "Deterministic"

Since Temporal workflow code is based on the premise of "Replay," it must always produce the same sequence of workflow API calls for the same input and history. Therefore, you should not directly perform "external-world-dependent operations" like network I/O, file I/O, system time (e.g., DateTime.now), randomness, or threading within a workflow. These side effects should be pushed to activities, while the workflow focuses solely on orchestration.

Official Docs: https://docs.temporal.io/develop/python/core-application#workflow-logic-requirements

Workflow-Centric Design

The first thing that changed after introducing Temporal was the visibility of business logic. FaceSimilarityWorkflow now gracefully waits until files are ready.

# Core logic of FaceSimilarityWorkflow
@workflow.run
async def run(self, data: SimilarityData) -> SimilarityResult:
    # Wait up to 1 hour until both image and video are collected
    await workflow.wait_condition(
        lambda: any(f["type"] == "image" for f in self._files)
        and any(f["type"] == "video" for f in self._files),
        timeout=timedelta(hours=1),
    )

    # Execute GPU activity once all files are ready
    result = await workflow.execute_activity(
        check_face_similarity_activity,
        activity_data,
        retry_policy=RetryPolicy(maximum_attempts=3)
    )
    return result

This code uses workflow.wait_condition to suspend the workflow until the condition is met without blocking the event loop. In Celery, this would have required complex polling or webhook logic.

3. Idempotency Strategy: Building a Double Defense

Even with Temporal, idempotency at the activity level remains crucial. I established a double defense strategy as follows.

Step 1: Temporal's Basic Guarantee

Temporal records the progress of a workflow as event history. Therefore, even if a worker restarts, it resumes exactly from the last successful point.

Step 2: Explicit Checks within Activities

Since Temporal activities follow an "at-least-once" execution model, an activity might be retried if a worker crashes after successfully performing it but before notifying the server. Thus, official documentation strongly recommends making activities idempotent.

Official Docs: https://docs.temporal.io/develop/python/error-handling#make-activities-idempotent

In practice, I use the following two together:

For external system calls, pass an idempotency key combined from the workflow execution and activity identifiers.
Internally, use unique keys (or check for existing results) in the DB to prevent duplicate storage/processing.

@activity.defn
async def check_face_similarity_activity(data: SimilarityData) -> SimilarityResult:
    info = activity.info()
    idempotency_key = f"{info.workflow_run_id}-{info.activity_id}"
    session_id = data["session_id"]

    with get_db_context() as db:
        existing = (
            db.query(FaceSimilarity)
            .filter(FaceSimilarity.idempotency_key == idempotency_key)
            .first()
        )
        if existing:
            return SimilarityResult(success=True, message="Already processed.")

    # Perform actual GPU-intensive work...

4. Results: What Has Changed?

Comparison Item	Celery/Redis Based	Temporal Based
State Management	Manual storage in DB/Redis	Automatically managed by engine
Retry Strategy	Manual exponential backoff	Declarative Retry Policy
Visibility	Must dig through logs	Check history in Temporal UI
Idempotency	Very difficult to guarantee	Structurally achievable

Conclusion

The transition from Celery to Temporal was not just about changing tools; it was about changing how I define business processes in code. Especially in financial/authentication systems where idempotency is paramount, Temporal provided irreplaceable stability.

If you are losing sleep over complex asynchronous logic and idempotency issues, I strongly recommend migrating to Temporal.

Testing in the Age of AI Agents: How I Kept QA from Collapsing

wintrover — Tue, 17 Mar 2026 10:20:26 +0000

AI agents changed my development tempo overnight. I can ship more code in a day than I used to in a week, and that sounds great until the first time a tiny edge case takes down an entire flow.

At that speed, QA becomes either a competitive advantage or a constant fire drill. I chose the first option, and I rebuilt my testing approach in d:\Coding\Company\Ochestrator around a small set of test design techniques that scale with code volume:

TDD
EP-BVA (Equivalence Partitioning + Boundary Value Analysis)
Pairwise (Combinatorial Testing)
State Transition Testing

1. Why I Needed “Test Design,” Not Just “More Tests”

When code volume grows, the problem is not only “coverage.” The real problem is that the space of possible inputs and states grows faster than my time.

So I stopped asking:

“Did I write tests for this function?”

And I started asking:

“Did I select test cases that actually represent the failure surface?”

That mindset is what pushed me toward structured test design techniques.

2. TDD: Design for Testability from Day One

The Principle: TDD (Test-Driven Development) flips the traditional "write code, then test" workflow. It follows the Red-Green-Refactor cycle:

Red: Write a test for a new requirement and watch it fail. This confirms the test actually checks something and that the requirement isn't already met.
Green: Write the minimal amount of code to make the test pass. Avoid "over-engineering" at this stage.
Refactor: Clean up the code while ensuring the tests stay green.

In Orchestrator:
Since AI agents can generate complex business logic rapidly, I used TDD to ensure that the logic was testable by design. For example, when implementing the RetryPolicy for our Temporal workflows, I started with the test cases for exponential backoff before writing a single line of the policy logic.

# Simplified TDD Example for Retry Logic
def test_retry_interval_calculation():
    policy = ExponentialRetry(base_delay=1.0, max_delay=10.0)
    # 1st attempt: 1.0s
    assert policy.get_delay(attempt=1) == 1.0
    # 2nd attempt: 2.0s
    assert policy.get_delay(attempt=2) == 2.0
    # Capped at 10.0s
    assert policy.get_delay(attempt=10) == 10.0

This forced me to separate the calculation of delays from the execution of the retry, making the system modular and robust.

3. EP-BVA: Efficiency through Mathematical Selection

The Principle:

Equivalence Partitioning (EP): Instead of testing every possible value, you divide the input domain into groups (partitions) where the system is expected to behave identically. You only need to test one value from each group.
Boundary Value Analysis (BVA): Bugs often hide at the "edges" of these partitions. BVA focuses on testing the exact boundaries, and values just inside and outside of them.

In Orchestrator:
When handling user-uploaded files, we have strict size limits (e.g., 1MB to 10MB).

Partitions:
- Invalid (< 1MB)
- Valid (1MB - 10MB)
- Invalid (> 10MB)
BVA Points: 0.99MB, 1.0MB, 1.01MB, 9.99MB, 10.0MB, 10.01MB.

A critical real-world example I applied was the 72-byte limit of bcrypt. Many developers don't realize that bcrypt ignores any characters after the 72nd byte.

# apps/backend/tests/test_auth_service.py
def test_password_length_boundaries(self, auth_service):
    # Boundary: 72 bytes
    p72 = "a" * 72
    h72 = auth_service.get_password_hash(p72)

    # Just above the boundary: 73 bytes
    p73 = p72 + "b"
    # Bcrypt will treat p73 the same as p72 if only the first 72 bytes are used
    assert auth_service.verify_password(p73, h72) is True

By focusing on these specific points, I reduced hundreds of potential test cases to just 6-10 highly effective ones.

4. Pairwise: Taming the Combinatorial Explosion

The Principle: Most bugs are caused by either a single input parameter or the interaction between two parameters. Pairwise Testing is a combinatorial method that ensures every possible pair of input parameters is tested at least once. This drastically reduces the number of test cases while maintaining high defect detection.

In Orchestrator:
Our AI Inference engine has multiple configuration axes:

Execution Provider: [CUDA, CPU, OpenVINO] (3)
Model Size: [Small, Medium, Large] (3)
Quantization: [INT8, FP16, FP32] (3)
Async Mode: [Enabled, Disabled] (2)

Total combinations: $3 \times 3 \times 3 \times 2 = 54$ cases.
Using Pairwise, we can cover all interactions between any two settings in roughly 12-15 cases.

# Using allpairspy to generate the matrix
from allpairspy import AllPairs

parameters = [
    ["CUDA", "CPU", "OpenVINO"],
    ["Small", "Medium", "Large"],
    ["INT8", "FP16", "FP32"],
    ["Enabled", "Disabled"]
]

for i, combo in enumerate(AllPairs(parameters)):
    print(f"Test Case {i}: {combo}")

This allows us to maintain high confidence in our hardware compatibility matrix without running the full 54-case suite on every PR.

5. State Transition Testing: Mapping the Life of a Process

The Principle: This technique is used when the system's behavior depends on its current state and the events that occur. We map out a State Transition Diagram and ensure that:

All valid transitions are possible.
All invalid transitions are properly blocked (Negative Testing).
The system ends in the correct final state.

In Orchestrator:
The KYC (Know Your Customer) verification workflow is a complex state machine. A user's document moves through:
PENDING $\rightarrow$ UPLOADING $\rightarrow$ PROCESSING $\rightarrow$ VERIFIED or REJECTED.

I implemented tests to ensure a REJECTED document cannot suddenly jump to VERIFIED without going through PROCESSING again.

# apps/backend/tests/test_integration_kyc_workflow.py
def test_invalid_state_transitions(workflow_engine):
    workflow_engine.set_state(ImageStatus.REJECTED)

    # This should be blocked by the business logic
    with pytest.raises(IllegalStateError):
        workflow_engine.transition_to(ImageStatus.VERIFIED)

This is crucial for AI agents that might try to "short-circuit" logic. By strictly testing the state machine, we ensure the integrity of the entire business process.

Conclusion

In the AI-agent era, code is cheap. Trust is not.

What kept my QA from collapsing was not writing more tests, but adopting test design techniques that scale:

TDD for fast feedback and safer refactors
EP-BVA to systematize edge cases
Pairwise to tame combinatorial growth
State Transition Testing to validate real workflows

This is the testing toolbox I expect to keep using as my code volume keeps accelerating.

The Pitfalls of Test Coverage: Introducing Mutation Testing with Stryker and Cosmic Ray

wintrover — Tue, 17 Mar 2026 10:20:19 +0000

Overview

Goal: Overcome the limitations of Code Coverage metrics and introduce 'Mutation Testing' to verify if test codes actually catch errors in business logic.
Scope: Core modules of the enterprise orchestrator project (Ochestrator) in both Frontend (TypeScript) and Backend (Python).
Expected Results: Improve code stability and test reliability by securing a 'Mutation Score' beyond simple line coverage.

We often believe that high test coverage means safe code. However, it's difficult to answer the question: "Who tests the tests?" Tests that simply execute code without proper assertions still contribute to coverage metrics. To solve this 'coverage trap', we introduced mutation testing.

Implementation

1. TypeScript Environment: Introducing Stryker Mutator

For the TypeScript environment, including frontend and common utilities, we chose Stryker. It integrates well with Vitest and is easy to configure.

Tech Stack: TypeScript, Vitest, Stryker Mutator
Key Configuration (stryker.config.json):

  {
    "testRunner": "vitest",
    "reporters": ["html", "clear-text", "progress"],
    "concurrency": 4,
    "incremental": true,
    "mutate": [
      "src/utils/**/*.ts",
      "src/services/**/*.ts"
    ]
  }

We enabled the incremental option to efficiently perform tests only on changed files.

2. Python Environment: Introducing Cosmic Ray

For the backend environment, we introduced Cosmic Ray. It generates powerful mutations by manipulating the AST (Abstract Syntax Tree) using Python's dynamic nature.

Tech Stack: Python, Pytest, Cosmic Ray, Docker
Execution Architecture: Since mutation testing consumes significant computational resources, we configured it to run in parallel across multiple workers using Docker.

  # Partial docker-compose.test.yaml
  cosmic-worker-1:
    command: uv run cosmic-ray worker cosmic.sqlite
  cosmic-runner:
    depends_on: [cosmic-worker-1, cosmic-worker-2]
    command: |
      uv run cosmic-ray init cosmic-ray.toml cosmic.sqlite
      uv run cosmic-ray exec cosmic-ray.toml cosmic.sqlite

Debugging/Challenges

Real-world Case: Survived Mutants in `VideoSplitter.ts`

The most interesting case was videoSplitter.ts, which handles video splitting. This file had over 95% line coverage, but Stryker revealed shocking results.

Problem Statement: A large number of mutants survived in the logic that checks available memory.

  // Original Code
  if (availableMemory < requiredMemory) {
    throw new Error("Insufficient memory.");
  }

Even when Stryker changed this code to if (false) or if (availableMemory <= requiredMemory), all existing tests PASSED.

Root Cause Analysis:
Existing tests focused only on "whether an error occurs," missing boundary value tests for exactly which conditions trigger the error. In other words, coverage was high, but the actual logic wasn't being thoroughly verified.
Solution:
To 'kill' the surviving mutants, we reinforced the test cases with boundary value analysis.

  test('Boundary value verification for memory', () => {
    // Simulate situations where memory is exactly equal to or slightly less than requiredMemory
    // ... reinforced test code ...
  });

Results

Achievements:
- Discovered and removed 12 Survived Mutants in core utility modules.
- Elevated test code from simply 'executing' code to truly 'verifying' it.
Key Metrics:
- Mutation Score: Improved from an initial 62% to 88%.
- Reliability: Prevented potential regression bugs by running test:mutation scripts before deployment.
User Feedback: Positive reactions from team members: "I can now refactor with confidence, trusting our tests."

Key Takeaways

Coverage is just the beginning: Line coverage only tells you 'what is not tested,' not the 'quality of what is tested.'
Mutation testing is expensive but worth it: Although it takes time (up to tens of minutes for full execution), it's essential for core business logic or complex utilities.
Incremental Adoption: Rather than applying it to all code, it's important to build success stories by starting with core infrastructure code like VideoSplitter.

After completion, ensure the following checklist is met:

Verification Checklist

[x] Overview: Are the goals and scope clear?
[x] Implementation: Are the tech stack and specific code examples included?
[x] Debugging: Is there at least one specific problem and its solution process?
[x] Results: Are there numerical data or performance indicators?
[x] Key Takeaways: Are the lessons learned and future plans clear?

Length Guidelines

[x] Overall: 400-800 lines (currently ~100 lines - can be expanded if needed)
[x] Each section: Minimum 50 lines (if possible)
[x] Code examples: 2-3 examples included