<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: wintrover</title>
    <description>The latest articles on Forem by wintrover (@wintrover).</description>
    <link>https://forem.com/wintrover</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1896056%2Fa00c0da2-64c2-47ab-b5d0-2256c4ba0e75.png</url>
      <title>Forem: wintrover</title>
      <link>https://forem.com/wintrover</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/wintrover"/>
    <language>en</language>
    <item>
      <title>"42% Silence": What It Means to Control Failure in AI Code Verification</title>
      <dc:creator>wintrover</dc:creator>
      <pubDate>Fri, 24 Apr 2026 06:34:24 +0000</pubDate>
      <link>https://forem.com/wintrover/42-silence-what-it-means-to-control-failure-in-ai-code-verification-1nip</link>
      <guid>https://forem.com/wintrover/42-silence-what-it-means-to-control-failure-in-ai-code-verification-1nip</guid>
      <description>&lt;p&gt;We live in an era where AI writes code.&lt;/p&gt;

&lt;p&gt;The problem is no longer productivity.&lt;/p&gt;

&lt;p&gt;It's trust.&lt;/p&gt;

&lt;p&gt;Code gets generated fast.&lt;br&gt;
Tests pass.&lt;br&gt;
But&lt;/p&gt;

&lt;p&gt;there's no guarantee that this code is actually correct.&lt;/p&gt;



&lt;h2&gt;
  
  
  1) What We're Building: Axiom
&lt;/h2&gt;

&lt;p&gt;Axiom is not a mere "automatic proof engine."&lt;/p&gt;

&lt;p&gt;It's a system that makes AI-generated code &lt;em&gt;trustworthy&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;To achieve this, we combine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SMT Solver (Z3)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Abstraction (CEGAR)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Summary-based Decomposition (Summary)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Formal Verification (Lean 4)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;into a single orchestration.&lt;/p&gt;



&lt;h2&gt;
  
  
  2) The Incident: "42% Silence"
&lt;/h2&gt;

&lt;p&gt;When running &lt;code&gt;ax prove&lt;/code&gt;, the progress would stall precisely at 42%.&lt;/p&gt;

&lt;p&gt;It looked like a simple UI glitch.&lt;/p&gt;

&lt;p&gt;But three possibilities existed simultaneously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Actual deadlock&lt;/li&gt;
&lt;li&gt;Solver endlessly computing&lt;/li&gt;
&lt;li&gt;Some tasks finished but UI not showing them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words,&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The system was in a state where you couldn't tell if it was dead or thinking.&lt;/strong&gt;&lt;/p&gt;



&lt;h2&gt;
  
  
  3) Kimi's Diagnosis: "It's Actually Stuck"
&lt;/h2&gt;

&lt;p&gt;Investigation revealed an actual bug.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause 1 — Sequential Result Waiting&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nim"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;^&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;  &lt;span class="c"&gt;# blocks in spawn order&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ If one hangs, everything stops&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause 2 — Blocking I/O&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nim"&gt;&lt;code&gt;&lt;span class="n"&gt;readLine&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c"&gt;# can wait forever on Z3 pipe&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ Deadlock possible regardless of solver state&lt;/p&gt;



&lt;h2&gt;
  
  
  4) First Fix: Make the System Non-Blocking
&lt;/h2&gt;

&lt;p&gt;We fixed two things.&lt;/p&gt;

&lt;p&gt;✔ &lt;strong&gt;Out-of-order processing&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nim"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;isReady&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ Immediately process completed tasks&lt;/p&gt;

&lt;p&gt;✔ &lt;strong&gt;Non-blocking I/O&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nim"&gt;&lt;code&gt;&lt;span class="n"&gt;readData&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c"&gt;# chunk-based processing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ Detect solver termination/abnormal state immediately&lt;/p&gt;

&lt;p&gt;This removed the "true freeze."&lt;/p&gt;



&lt;h2&gt;
  
  
  5) But a Bigger Problem Remained
&lt;/h2&gt;

&lt;p&gt;After fixing the bug, the same question lingered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"What if the solver never finds an answer?"&lt;/strong&gt;&lt;/p&gt;



&lt;h2&gt;
  
  
  6) The Technical Debate: "Bypass or Siege?"
&lt;/h2&gt;

&lt;p&gt;Two perspectives collided here.&lt;/p&gt;

&lt;p&gt;🏛️ &lt;strong&gt;Gemini — Engineering Siege Strategy&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"We're not finding answers, we're collecting evidence"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;Simplify problems with CEGAR&lt;/li&gt;
&lt;li&gt;Break them down with Summary&lt;/li&gt;
&lt;li&gt;Finalize with Lean 4&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🏛️ &lt;strong&gt;Kimi — Limit-Awareness Design&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"All these tools fail"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;SMT is undecidable&lt;/li&gt;
&lt;li&gt;CEGAR can also stall in loops&lt;/li&gt;
&lt;li&gt;Lean automation is ultimately a search problem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There is no perfect solution.&lt;/strong&gt;&lt;/p&gt;



&lt;h2&gt;
  
  
  7) Our Conclusion
&lt;/h2&gt;

&lt;p&gt;The key insight from this debate:&lt;/p&gt;

&lt;p&gt;❌ "A system that always proves" is impossible&lt;br&gt;
✅ "A system that controls failure" is possible&lt;/p&gt;



&lt;h2&gt;
  
  
  8) Axiom's Design Principles
&lt;/h2&gt;

&lt;p&gt;Axiom is not a "proof engine" but:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"A system that structures failure"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;🏛️ &lt;strong&gt;L1/L2 — Stop the failure&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;timeout&lt;/li&gt;
&lt;li&gt;abstraction (CEGAR)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ Block infinite computation&lt;/p&gt;

&lt;p&gt;🏛️ &lt;strong&gt;L3 — Break down the failure&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;decomposition&lt;/li&gt;
&lt;li&gt;summary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ Whole failure → Partial success&lt;/p&gt;

&lt;p&gt;🏛️ &lt;strong&gt;L4 — Escalate the failure&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lean 4&lt;/li&gt;
&lt;li&gt;Human invariant&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ Automation → Collaboration&lt;/p&gt;



&lt;h2&gt;
  
  
  9) Another Problem Surfaced
&lt;/h2&gt;

&lt;p&gt;The bug was gone, but a new problem appeared.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It still "looked stuck."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This isn't just a UX problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core Issue&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In a verification system, "invisible progress" equals trust collapse.&lt;/p&gt;



&lt;h2&gt;
  
  
  10) Solution: Heartbeat Patch
&lt;/h2&gt;

&lt;p&gt;We changed our approach.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Instead of "exact progress percentage,"&lt;br&gt;
show "evidence of being alive"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[████████░░] 42%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[████████░░] 42%

[VERIFIED] expr.nim::add
[VERIFIED] results.nim::isErr
[TIMEOUT] main.nim::getStr
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The meaning of this change:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"It's not stuck — it's solving a hard problem."&lt;/strong&gt;&lt;/p&gt;



&lt;h2&gt;
  
  
  11) Additional Improvement: Evidence-Based Hints
&lt;/h2&gt;

&lt;p&gt;Z3 says nothing when it times out.&lt;/p&gt;

&lt;p&gt;So we worked around it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Structural complexity analysis (Topology)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lean unsolved goal extraction&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;p&gt;We can't directly say "why it failed,"&lt;br&gt;
but we can show "where the problem is."&lt;/p&gt;



&lt;h2&gt;
  
  
  12) Lessons Learned
&lt;/h2&gt;

&lt;p&gt;The incident clarified two things.&lt;/p&gt;

&lt;p&gt;✔ &lt;strong&gt;Nim threadpool limitations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;FlowVar cancellation impossible&lt;/li&gt;
&lt;li&gt;Polling-based design required&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✔ &lt;strong&gt;Z3's black-box nature&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;timeout = unknown&lt;/li&gt;
&lt;li&gt;Root cause tracking impossible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ Contextual diagnosis required&lt;/p&gt;



&lt;h2&gt;
  
  
  13) Conclusion
&lt;/h2&gt;

&lt;p&gt;The "42% incident" wasn't just a simple bug.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It was the surfacing of structural limitations that every AI code verification system must face.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We made one choice here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Abandon perfect proof&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Choose controllable failure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One-sentence definition:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In a verification engine, what matters isn't the correct answer,&lt;br&gt;
but how failure manifests and is controlled.&lt;/p&gt;
&lt;/blockquote&gt;



&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Axiom no longer freezes at 42%.&lt;/p&gt;

&lt;p&gt;And more importantly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Even when it stops,&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;it tells you why it stopped and what to do next.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One-line summary:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Axiom isn't an engine for proving code,&lt;br&gt;
it's a system for generating trust.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>axiom</category>
      <category>ai</category>
      <category>code</category>
      <category>verification</category>
    </item>
    <item>
      <title>Inside Axiom’s Verification Kernel: BMC, UAP, Lean Replay, and the Proof Vault</title>
      <dc:creator>wintrover</dc:creator>
      <pubDate>Tue, 21 Apr 2026 08:48:39 +0000</pubDate>
      <link>https://forem.com/wintrover/inside-axioms-verification-kernel-bmc-uap-lean-replay-and-the-proof-vault-50m3</link>
      <guid>https://forem.com/wintrover/inside-axioms-verification-kernel-bmc-uap-lean-replay-and-the-proof-vault-50m3</guid>
      <description>&lt;p&gt;When people hear “formal verification”, they imagine a single solver run that prints &lt;code&gt;Verified&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That’s not what Axiom is optimizing for.&lt;/p&gt;

&lt;p&gt;Axiom is built around one requirement: &lt;strong&gt;if you say something is verified, you must be able to reproduce that claim deterministically&lt;/strong&gt;—including the exact query shape, the exact counterexample (if SAT), and the exact Lean replay (if promoted to safe).&lt;/p&gt;



&lt;h2&gt;
  
  
  1) BMC Core: Bounded Models, Not Hand-Wavy Guarantees
&lt;/h2&gt;

&lt;p&gt;The BMC core is the “boundary model” layer of the engine: it takes a transition system, cuts it at a fixed bound, and asks the only question that matters inside that boundary:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Does there exist a counterexample within N steps?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In Axiom’s reference race model, that boundary is explicit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nim"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt;
  &lt;span class="n"&gt;bmcStepBound&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That bound is not a random default—it’s the contract that defines what “safe” even means in this mode. The semantics are also strict: a race is promoted only when the final state falls outside the set of sequential outcomes computed from the same &lt;code&gt;TransitionRule&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The important architectural decision is that the &lt;em&gt;rule&lt;/em&gt; and the &lt;em&gt;safety property&lt;/em&gt; are not hardcoded into the solver path. Instead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;External behavior is injected as a &lt;code&gt;TransitionRule(op, delta)&lt;/code&gt; model.&lt;/li&gt;
&lt;li&gt;The SSOT of the safety property is &lt;code&gt;TransitionRule.invariant&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Encoding is performed as SMT-LIB2 (QF_LIA) and executed through Z3.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is also where determinism stops being “a preference” and becomes “mechanically enforced”.&lt;/p&gt;

&lt;p&gt;The Z3 query text is stamped with fixed-seed and single-thread constraints, and every invariant is asserted as a named clause so UNSAT cores are stable and actionable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight common_lisp"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;set-option&lt;/span&gt; &lt;span class="ss"&gt;:fixed_seed&lt;/span&gt; &lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;set-option&lt;/span&gt; &lt;span class="ss"&gt;:smt.threads&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;assert&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;!&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="nv"&gt;size&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ss"&gt;:named&lt;/span&gt; &lt;span class="nv"&gt;inv_bounded_growth.max_size&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;get-unsat-core&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In other words: the BMC core is not just “Z3 integration”. It’s a deterministic boundary-model execution contract where solver outputs can be replayed and audited.&lt;/p&gt;



&lt;h2&gt;
  
  
  2) UAP Pipeline: Universal Automated Proof as an Orchestrated Repair Loop
&lt;/h2&gt;

&lt;p&gt;BMC gives you a controlled counterexample search. But the real world is not a single bounded query; it’s a sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extract a model.&lt;/li&gt;
&lt;li&gt;Find a counterexample.&lt;/li&gt;
&lt;li&gt;Repair the implementation and/or strengthen invariants.&lt;/li&gt;
&lt;li&gt;Re-prove.&lt;/li&gt;
&lt;li&gt;Preserve only the minimum reusable evidence.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That sequence is what Axiom calls UAP: &lt;strong&gt;Universal Acceptance Pipeline&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;UAP is not “just” automation. It has a strict output contract: one JSON result that includes translation, scans, repair rounds, proof artifacts, and promotion snapshots.&lt;/p&gt;

&lt;p&gt;The core orchestrator is deliberately structured as a set of roles (Detective / Legislator / SWAT / Judge), because each stage produces different kinds of artifacts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AST + Source
  → UniversalTranslation (AST → IL/SMT mapping)
  → initialScan (counterexample search + trace grounds)
  → legislate (baseline invariant draft)
  → executeRaid (repair + PoC + BDD generation hooks)
  → certify (Lean replay + invariant finalization)
  → investigate (final scan + verdict normalization)
  → writeProofArtifact + writePromotionSnapshot
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two constraints matter here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Counterexample not found” is explicitly not equal to “safe”.&lt;/li&gt;
&lt;li&gt;Safe promotion is prohibited unless Lean replay succeeds, even if SMT returns UNSAT.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So UAP is a self-healing pipeline, but it’s also an anti-vacuity pipeline: it refuses to treat solver silence as proof.&lt;/p&gt;



&lt;h2&gt;
  
  
  3) Lean 4 Reproducibility: Z3 Is the Oracle; Lean Is the Kernel
&lt;/h2&gt;

&lt;p&gt;If SMT is the “search engine”, Lean 4 is the “court of law”.&lt;/p&gt;

&lt;p&gt;The engine’s safety promotion rules explicitly treat Z3 UNSAT as a hint, not a conclusion. Final safe promotion requires a Lean-kernel-verifiable result (proof term or tactic replay).&lt;/p&gt;

&lt;p&gt;This is why Axiom’s Lean side has hard constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Semantics are not allowed to hide inside axioms like &lt;code&gt;axiom transition&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Transition systems must be reconstructed through &lt;code&gt;def nextState&lt;/code&gt; and &lt;code&gt;inductive Step&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Generated specs must be replayable under a pinned toolchain and stable tactic strategy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Operationally, Lean replay is not a one-shot script. It’s a dynamic proof search loop driven by compiler feedback through Lean’s &lt;code&gt;--server&lt;/code&gt; JSON-RPC path (LSP-style). The system treats “proof search” as an iterative, deterministic process: observe goals, propose tactics, replay, repeat.&lt;/p&gt;

&lt;p&gt;The point is not that Lean always succeeds.&lt;/p&gt;

&lt;p&gt;The point is that when it does succeed, it is &lt;strong&gt;reproducible&lt;/strong&gt;—and when it fails, the failure itself is reproducible and promotable as actionable constraints.&lt;/p&gt;



&lt;h2&gt;
  
  
  4) Proof Vault: Fingerprint-Keyed Evidence Compression and Reuse
&lt;/h2&gt;

&lt;p&gt;Most systems keep “proof results” as human-facing logs.&lt;/p&gt;

&lt;p&gt;Axiom treats proofs as assets.&lt;/p&gt;

&lt;p&gt;The Proof Vault is where those assets are stored under stable keys so they can be reused instead of re-derived. Two kinds of vaulting matter:&lt;/p&gt;

&lt;p&gt;1) &lt;strong&gt;Promoted invariants&lt;/strong&gt; (UAP-level): the engine preserves the latest &lt;code&gt;activeInvariant&lt;/code&gt; and &lt;code&gt;refinedInvariant&lt;/code&gt; as snapshots and auto-reinjects them into later runs for the same &lt;code&gt;modulePath/functionName&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;2) &lt;strong&gt;Lean replay tactics&lt;/strong&gt; (kernel-level): successful tactics are stored keyed by theorem name plus a complexity fingerprint. On the next run, they become a warm-start strategy—preferentially reusing the path with fewer attempts and shorter tactic sequences.&lt;/p&gt;

&lt;p&gt;That “fingerprint-first” design is what turns the vault into a compression mechanism:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You discard bulky intermediate artifacts (full ASTs, large traces) after success.&lt;/li&gt;
&lt;li&gt;You keep only minimal, fingerprint-addressable evidence that can deterministically recreate the proof attempt.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is also why fingerprints are treated as first-class inputs across the pipeline (source checksum, SMT fingerprint, invariant fingerprint). If the fingerprint changes, Axiom does not pretend the old proof still applies—it demotes and schedules re-proof precisely where the change occurred.&lt;/p&gt;



&lt;h2&gt;
  
  
  Closing: A Kernel, Not a Collection of Tools
&lt;/h2&gt;

&lt;p&gt;BMC, UAP, Lean replay, and the Proof Vault are not separate features.&lt;/p&gt;

&lt;p&gt;They are one kernel with one invariant:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If Axiom says “verified”, it must be replayable as the same query, the same evidence, and the same proof—deterministically.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the only definition of verification that survives contact with AI-generated code.&lt;/p&gt;

</description>
      <category>axiom</category>
      <category>formal</category>
      <category>verification</category>
      <category>bmc</category>
    </item>
    <item>
      <title>Nim vs Rust: Language is a Matter of 'Coherence', Not 'Performance'</title>
      <dc:creator>wintrover</dc:creator>
      <pubDate>Fri, 17 Apr 2026 11:59:55 +0000</pubDate>
      <link>https://forem.com/wintrover/nim-vs-rust-language-is-a-matter-of-coherence-not-performance-45kc</link>
      <guid>https://forem.com/wintrover/nim-vs-rust-language-is-a-matter-of-coherence-not-performance-45kc</guid>
      <description>&lt;p&gt;I once spent a long time debating between two languages as the foundation for a formal verification engine.&lt;br&gt;
One was Rust, bringing its borrow checker and lifetimes. The other was Nim, leading with &lt;code&gt;func&lt;/code&gt; and &lt;code&gt;distinct&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;When people compare these two, the conversation usually boils down to "Rust's memory safety vs. Nim's C-compatibility and productivity."&lt;br&gt;
But when you actually try to build something like the Axiom proof engine from the ground up, the real wall you hit is entirely different.&lt;/p&gt;

&lt;p&gt;Especially in an environment where AI generates code, &lt;strong&gt;the most terrifying bottleneck isn't a 'compile error'—it's 'semantic opacity'.&lt;/strong&gt;&lt;br&gt;
LLMs can easily spit out code that flawlessly passes tests while subtly twisting the author's original intent into something unrecognizable.&lt;/p&gt;



&lt;h2&gt;
  
  
  There is No Perfect Symmetry: The Trade-off Between Memory and Intent
&lt;/h2&gt;

&lt;p&gt;"Rust prevents memory corruption, while Nim prevents intent corruption."&lt;br&gt;
Saying it like that sounds incredibly clean and punchy, but real-world engineering is rarely that neat.&lt;/p&gt;

&lt;p&gt;Of course, Rust is perfectly capable of expressing semantics. You can use the Typestate pattern or the Newtype pattern to rigorously enforce domain meaning. It's not that Rust &lt;em&gt;can't&lt;/em&gt; express intent. The issue is &lt;strong&gt;the cognitive cost required to express that intent&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Conversely, Nim is not a silver bullet. Debugging complex macros can sometimes feel like hell, and the smaller ecosystem can be a real pressure point. This is a clear trade-off.&lt;/p&gt;

&lt;p&gt;However, the reason I accepted this trade-off and sided with Nim comes down to how we distribute our &lt;strong&gt;'attention'&lt;/strong&gt; when collaborating with AI.&lt;/p&gt;



&lt;h2&gt;
  
  
  Syntactic Simplicity: Where Does Our Attention Go?
&lt;/h2&gt;

&lt;p&gt;When reviewing code generated by an LLM, our cognitive resources are strictly limited.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Let's look at Rust:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="n"&gt;explore&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nv"&gt;'a&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;usize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nv"&gt;'a&lt;/span&gt; &lt;span class="n"&gt;BmcTrace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BmcVerdict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nv"&gt;'a&lt;/span&gt; &lt;span class="n"&gt;TransitionRule&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;BmcVerdict&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When I read this code, a massive chunk of my brainpower is spent tracking lifetimes (&lt;code&gt;'a&lt;/code&gt;), references, and mutability.&lt;br&gt;
In fact, when a bug occurred here, I found it incredibly difficult to tell at a glance: &lt;strong&gt;is the domain logic wrong, or was this code just awkwardly structured to appease the borrow checker?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Now, how about Nim?&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nim"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;explore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BmcTrace&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
             &lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BmcVerdict&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
             &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;TransitionRule&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;BmcVerdict&lt;/span&gt; &lt;span class="p"&gt;{.&lt;/span&gt;&lt;span class="n"&gt;raises&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;[]&lt;/span&gt;&lt;span class="p"&gt;.}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, my attention is entirely focused on the 'domain variables' and 'semantic constraints'.&lt;br&gt;
Because the concerns of memory management are stripped away from the syntax, I can focus on the &lt;strong&gt;causality of the logic itself&lt;/strong&gt;, rather than the grammar.&lt;/p&gt;



&lt;h2&gt;
  
  
  Structural Devices That Enforce the "Why"
&lt;/h2&gt;

&lt;p&gt;The most painful moments in the trenches are when you have to dig through comments or commit messages to figure out "why did the previous dev (or yesterday's me, or the LLM) write it this way?"&lt;/p&gt;

&lt;p&gt;Nim allows you to bake this 'Why' directly into the language spec.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nim"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt;
  &lt;span class="n"&gt;SessionId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;distinct&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
  &lt;span class="n"&gt;OwnerId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;distinct&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;distinct&lt;/code&gt; is not just a type alias. If an AI makes the mistake of passing an &lt;code&gt;OwnerId&lt;/code&gt; where a &lt;code&gt;SessionId&lt;/code&gt; is expected, the compiler grabs it by the collar immediately—long before it ever reaches runtime tests.&lt;/p&gt;

&lt;p&gt;The same goes for contracts that specify the preconditions and guaranteed outcomes of a function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nim"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;toSessionId&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;SessionId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AxiomError&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{.&lt;/span&gt;
  &lt;span class="n"&gt;raises&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;[]&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;requires&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;len&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ensures&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isOk&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt;
           &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sessionIdString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;.}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't mere documentation. It's a contract enforced by the compiler before runtime.&lt;br&gt;
Sure, you can mimic this in Rust using macros or external crates, but Nim's core philosophy pushes this &lt;strong&gt;'visibility of intent'&lt;/strong&gt; by default. Just like how the single &lt;code&gt;func&lt;/code&gt; keyword structurally enforces the absence of side effects at the module level.&lt;/p&gt;



&lt;h2&gt;
  
  
  Conclusion: The Real Bottleneck in the AI Era
&lt;/h2&gt;

&lt;p&gt;To reiterate, if performance or the sheer size of the community is your primary concern, Rust is undoubtedly an excellent choice. It provides an incredibly solid floor of memory safety.&lt;/p&gt;

&lt;p&gt;But what I felt in my bones while building Axiom is that, in the era of AI-assisted development, the most terrifying risk isn't 'unsafe code'. It's &lt;strong&gt;'code where you cannot know what it intended until you run it.'&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Can you structurally grasp the author's intent without having to read every line of code?&lt;br&gt;
I didn't want to compromise on this question. And that—even if it sometimes looks a bit rough around the edges—is why I chose Nim. To break through the ceiling of coherence.&lt;/p&gt;

</description>
      <category>axiom</category>
      <category>nim</category>
      <category>formal</category>
      <category>verification</category>
    </item>
    <item>
      <title>The Most Dangerous Word in AI Coding: "Verified"</title>
      <dc:creator>wintrover</dc:creator>
      <pubDate>Wed, 08 Apr 2026 11:41:07 +0000</pubDate>
      <link>https://forem.com/wintrover/the-most-dangerous-word-in-ai-coding-verified-3a1k</link>
      <guid>https://forem.com/wintrover/the-most-dangerous-word-in-ai-coding-verified-3a1k</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foz7qa78jg7lrz5gvbera.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foz7qa78jg7lrz5gvbera.png" alt=" " width="553" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Got a "Verified" result from my formal verification engine.&lt;/p&gt;

&lt;p&gt;Problem was, it was completely wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Looking at a simple function: &lt;code&gt;checkType&lt;/code&gt; from Bitcoin Core.&lt;/p&gt;

&lt;p&gt;The engine generated this SMT query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight common_lisp"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;assert&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;throwsRuntimeError&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;not&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;typ&lt;/span&gt; &lt;span class="nv"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;))))&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;assert&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;typ&lt;/span&gt; &lt;span class="nv"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;assert&lt;/span&gt; &lt;span class="nv"&gt;throwsRuntimeError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At first glance? Looks fine.&lt;/p&gt;

&lt;p&gt;But there's a fatal flaw in there.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Contradiction
&lt;/h2&gt;

&lt;p&gt;Unpack it and here's what you get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Error occurs when &lt;code&gt;typ != expected&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;But we're assuming &lt;code&gt;typ == expected&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;While also asserting "an error occurred"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Boil it down:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;typ == expected
AND simultaneously: typ != expected
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Logically impossible.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Solver Did
&lt;/h2&gt;

&lt;p&gt;Z3 (or any SMT solver) takes one look and concludes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unsat (Unsatisfiable)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In formal verification, this usually means:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"No execution path exists where the error occurs."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So the engine outputs:&lt;/p&gt;

&lt;p&gt;✅ Verified&lt;/p&gt;




&lt;h2&gt;
  
  
  Where It Went Wrong
&lt;/h2&gt;

&lt;p&gt;Here's the thing.&lt;/p&gt;

&lt;p&gt;The solver didn't prove the code safe.&lt;/p&gt;

&lt;p&gt;It proved the question itself was invalid.&lt;/p&gt;




&lt;h2&gt;
  
  
  Vacuous Truth
&lt;/h2&gt;

&lt;p&gt;This is a classic trap in formal verification.&lt;/p&gt;

&lt;p&gt;The definition is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If the premise is impossible, the statement is always true.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"If the sun rises in the west, I am God."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Logically true.\&lt;br&gt;
Because the premise can never hold.&lt;/p&gt;


&lt;h2&gt;
  
  
  What Actually Happened
&lt;/h2&gt;

&lt;p&gt;The engine effectively asked:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"In a state where &lt;code&gt;typ == expected&lt;/code&gt;, can an error occur?&lt;br&gt;
Given errors only happen when &lt;code&gt;typ != expected&lt;/code&gt;?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The solver's answer is clear:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"No such state exists."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Then the engine interprets:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"No error possible → safe"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;That jump is the problem.&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Verified ≠ Correct
&lt;/h2&gt;

&lt;p&gt;Key point:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Verified" doesn't mean correct.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sometimes it means:&lt;/p&gt;

&lt;p&gt;Your model is broken.&lt;/p&gt;

&lt;p&gt;This case was exactly that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Actual logic wasn't captured&lt;/li&gt;
&lt;li&gt;Contradictory constraints were generated&lt;/li&gt;
&lt;li&gt;Solver just short-circuited&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Didn't verify the code.\&lt;br&gt;
Logic just crashed.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why This Matters Now
&lt;/h2&gt;

&lt;p&gt;We're in this flow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI writes code&lt;/li&gt;
&lt;li&gt;AI writes tests&lt;/li&gt;
&lt;li&gt;AI explains "correctness"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Problem:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;None of that guarantees truth&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Because:&lt;/p&gt;

&lt;p&gt;AI optimizes for plausibility, not consistency&lt;/p&gt;

&lt;p&gt;LLMs don't "resolve" contradictions.\&lt;br&gt;
They continue them.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Real Danger: False Confidence
&lt;/h2&gt;

&lt;p&gt;This is worse than a failing test.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Test fails → problem visible
Verified (fake) → problem hidden
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In security or finance logic? Catastrophic.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Needed
&lt;/h2&gt;

&lt;p&gt;Formal verification systems need at minimum:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Sanity Check (Premise Validation)
&lt;/h3&gt;

&lt;p&gt;Before proving anything:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Are these assumptions even possible together?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  2. Vacuity Detection
&lt;/h3&gt;

&lt;p&gt;Catch cases that pass because "nothing could happen."&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Mutation Testing
&lt;/h3&gt;

&lt;p&gt;Break the code intentionally:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If it still verifies, your verification is broken.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Axiom
&lt;/h2&gt;

&lt;p&gt;That's why I'm building Axiom.&lt;/p&gt;

&lt;p&gt;Not to replace AI.&lt;/p&gt;

&lt;p&gt;To sit on top as a "verification layer."&lt;/p&gt;

&lt;p&gt;Role is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This holds&lt;/li&gt;
&lt;li&gt;This doesn't&lt;/li&gt;
&lt;li&gt;Or: this question is invalid from the start&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;If your system says "Verified,"&lt;/p&gt;

&lt;p&gt;Ask this first:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is the question itself valid?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you can't answer that,&lt;/p&gt;

&lt;p&gt;You don't have verification.&lt;/p&gt;

&lt;p&gt;You have a well-crafted illusion.&lt;/p&gt;

</description>
      <category>axiom</category>
      <category>formal</category>
      <category>verification</category>
      <category>smt</category>
    </item>
    <item>
      <title>Where Does Truth Live in AI-Generated Code?</title>
      <dc:creator>wintrover</dc:creator>
      <pubDate>Tue, 07 Apr 2026 07:15:09 +0000</pubDate>
      <link>https://forem.com/wintrover/where-does-truth-live-in-ai-generated-code-5hmc</link>
      <guid>https://forem.com/wintrover/where-does-truth-live-in-ai-generated-code-5hmc</guid>
      <description>&lt;h2&gt;
  
  
  The Problem Isn't Tests—It's Authority
&lt;/h2&gt;

&lt;p&gt;Talk about AI-generated code long enough, and you'll hit this question.&lt;/p&gt;

&lt;p&gt;"Tests pass, so what's the problem?"&lt;/p&gt;

&lt;p&gt;Not wrong. But it misses the core issue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In AI-generated code, the real question is: who decides 'this is correct'?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tests? Reviewers? Another LLM?&lt;/p&gt;

&lt;p&gt;None of them.&lt;/p&gt;

&lt;p&gt;This isn't about improving accuracy. It's about where the authority to declare truth lives.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We're Actually Doing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Trusting Tests
&lt;/h3&gt;

&lt;p&gt;"Tests pass, so we're fine."&lt;/p&gt;

&lt;p&gt;In practice, this translates to:&lt;/p&gt;

&lt;p&gt;"Wechecked a few cases and nothing broke."&lt;/p&gt;

&lt;p&gt;This pattern repeats. Especially after adding LLMs.&lt;/p&gt;

&lt;p&gt;Most teams have been here.&lt;/p&gt;

&lt;p&gt;Tests sample. They miss edge cases. They rarely cover invariants.&lt;/p&gt;

&lt;p&gt;So this happens:&lt;/p&gt;

&lt;p&gt;All tests pass. Production breaks.&lt;/p&gt;

&lt;p&gt;Not an exception. A structural result.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test passage is observation. Correctness is a property.&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Trusting Human Review
&lt;/h3&gt;

&lt;p&gt;"A person reviewed it, so it's fine."&lt;/p&gt;

&lt;p&gt;Even less stable.&lt;/p&gt;

&lt;p&gt;Humans don't scale. LLM code output explodes. Reviews drift toward "looks about right."&lt;/p&gt;

&lt;p&gt;More fundamentally:&lt;/p&gt;

&lt;p&gt;Code review is not correctness verification. It's a consensus process.&lt;/p&gt;

&lt;p&gt;Consensus can be wrong.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Trusting LLM-as-Critic
&lt;/h3&gt;

&lt;p&gt;A popular idea lately.&lt;/p&gt;

&lt;p&gt;"Run another model to double-check?"&lt;/p&gt;

&lt;p&gt;Sounds reasonable. Many teams try this.&lt;/p&gt;

&lt;p&gt;But the structure is:&lt;/p&gt;

&lt;p&gt;Probabilistic system + probabilistic system&lt;/p&gt;

&lt;p&gt;Result stays the same:&lt;/p&gt;

&lt;p&gt;Still probabilistic.&lt;/p&gt;

&lt;p&gt;You can create consensus. But:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consensus is not proof.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Common Reactions
&lt;/h2&gt;

&lt;p&gt;Raise this topic, and responses split two ways.&lt;/p&gt;

&lt;p&gt;First:&lt;/p&gt;

&lt;p&gt;"The problem isn't correctness—it's shipping useful things fast."&lt;/p&gt;

&lt;p&gt;Second:&lt;/p&gt;

&lt;p&gt;"Invariants aren't perfect either, so keep verifying with LLMs."&lt;/p&gt;

&lt;p&gt;Both sound reasonable. Both collapse at the same point.&lt;/p&gt;




&lt;h3&gt;
  
  
  "Shipping vs Correctness" Is a False Choice
&lt;/h3&gt;

&lt;p&gt;Fast at first.&lt;/p&gt;

&lt;p&gt;A few tests. One review. Ship it.&lt;/p&gt;

&lt;p&gt;But over time:&lt;/p&gt;

&lt;p&gt;Every fix breaks something else. Unpredictable. Debugging costs explode.&lt;/p&gt;

&lt;p&gt;Eventually:&lt;/p&gt;

&lt;p&gt;Lack of correctness kills shipping speed.&lt;/p&gt;

&lt;p&gt;With LLMs, this happens much faster.&lt;/p&gt;




&lt;h3&gt;
  
  
  Why "LLM Verification" Collapses
&lt;/h3&gt;

&lt;p&gt;Second idea:&lt;/p&gt;

&lt;p&gt;"LLM can verify invariants too, right?"&lt;/p&gt;

&lt;p&gt;This is usually where intuition misaligns.&lt;/p&gt;

&lt;p&gt;Structure breaks here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. No termination&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Need a verifier → who verifies the verifier? → another model? another?&lt;/p&gt;

&lt;p&gt;No end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Results aren't fixed&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Same input → different output&lt;/p&gt;

&lt;p&gt;At this point:&lt;/p&gt;

&lt;p&gt;You can't consistently state "what's correct."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Authority vanishes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Who's the final judge?&lt;/p&gt;

&lt;p&gt;The model? Training data? Prompt?&lt;/p&gt;

&lt;p&gt;No clear answer.&lt;/p&gt;

&lt;p&gt;This isn't a system. It's:&lt;/p&gt;

&lt;p&gt;"A structure that drifts toward whatever seems plausible."&lt;/p&gt;




&lt;h2&gt;
  
  
  The Question to Ask Again
&lt;/h2&gt;

&lt;p&gt;What matters isn't the method.&lt;/p&gt;

&lt;p&gt;"In this system, what decides 'correct'?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No answer means the system is already unstable.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Trust Boundary Categories
&lt;/h2&gt;

&lt;p&gt;Not all boundaries are equal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Empirical boundary&lt;/strong&gt;: Tests, benchmarks, runtime monitors. Observation-based. Cannot prove absence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Social boundary&lt;/strong&gt;: Code review, approval workflows. Authority-based. Doesn't scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Formal boundary&lt;/strong&gt;: Invariants, type systems, proofs. Mathematical necessity. Deterministic.&lt;/p&gt;

&lt;p&gt;Most systems use a mix of all three.&lt;/p&gt;

&lt;p&gt;The problem is not recognizing which one you're relying on.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Should the Trust Boundary Be?
&lt;/h2&gt;

&lt;p&gt;Simple choice.&lt;/p&gt;

&lt;p&gt;Somewhere:&lt;/p&gt;

&lt;p&gt;A point must decide "this is correct."&lt;/p&gt;

&lt;p&gt;And that point:&lt;/p&gt;

&lt;p&gt;Cannot be probabilistic.&lt;/p&gt;




&lt;h2&gt;
  
  
  Redefining the Structure
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;BEFORE (what most systems do)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LLM → Code → Tests → Ship
              ↓
         (uncertain)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No clear decision point anywhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AFTER&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LLM → Proposes (may be wrong)
System → Encodes to spec
Proof → Decides validity     ← (this is the boundary)
Execution → Runs only what passes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The boundary exists explicitly.&lt;/p&gt;

&lt;p&gt;The difference is not the tools.&lt;br&gt;
It's where the boundary sits.&lt;/p&gt;

&lt;p&gt;One thing matters:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The verification step must be non-negotiable.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Core Definition
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;A trust boundary is where correctness becomes non-negotiable.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before: exploration, generation, probability&lt;br&gt;
After: execution, responsibility, determinism&lt;/p&gt;

&lt;p&gt;Don't draw this clearly, and:&lt;/p&gt;

&lt;p&gt;The entire system stays in "ambiguous territory."&lt;/p&gt;




&lt;h2&gt;
  
  
  This Isn't a New Idea
&lt;/h2&gt;

&lt;p&gt;Formal methods. Proof systems. Invariant-based design.&lt;/p&gt;

&lt;p&gt;These already exist.&lt;/p&gt;

&lt;p&gt;What's different now:&lt;/p&gt;

&lt;p&gt;We haven't placed them at the center of code generation pipelines.&lt;/p&gt;




&lt;h2&gt;
  
  
  So What Does This Look Like in Practice?
&lt;/h2&gt;

&lt;p&gt;Naturally, this structure emerges:&lt;/p&gt;

&lt;p&gt;Define invariants first&lt;br&gt;
Prove code satisfies them&lt;br&gt;
Execute only what passes&lt;/p&gt;

&lt;p&gt;This isn't about writing better tests.&lt;br&gt;
It's about redrawing the trust boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  Axiom
&lt;/h2&gt;

&lt;p&gt;One attempt to implement this: Axiom.&lt;/p&gt;

&lt;p&gt;Define state in terms of invariants&lt;br&gt;
Decide correctness through proof&lt;br&gt;
Block regression structurally&lt;/p&gt;

&lt;p&gt;If the boundary has been implicit until now,&lt;br&gt;
Axiom tries to make it explicit and enforce it.&lt;/p&gt;

&lt;p&gt;The key isn't features—it's position.&lt;/p&gt;

&lt;p&gt;Clarifying "what holds final authority."&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;AI systems come down to one of two:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Probabilistic systems&lt;/strong&gt;&lt;br&gt;
Mostly right. Sometimes wrong. Don't know where.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Deterministic systems&lt;/strong&gt;&lt;br&gt;
Only proven code passes. Upfront cost. Doesn't fail.&lt;/p&gt;

&lt;p&gt;No middle ground.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Question
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;In your system, who can say "this is correct"?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Can't answer clearly?&lt;br&gt;
Then the trust boundary doesn't exist.&lt;/p&gt;

</description>
      <category>axiom</category>
      <category>ai</category>
      <category>trust</category>
      <category>boundary</category>
    </item>
    <item>
      <title>Architecture Philosophy: Rule-First Design</title>
      <dc:creator>wintrover</dc:creator>
      <pubDate>Mon, 06 Apr 2026 12:31:29 +0000</pubDate>
      <link>https://forem.com/wintrover/architecture-philosophy-rule-first-design-5gd1</link>
      <guid>https://forem.com/wintrover/architecture-philosophy-rule-first-design-5gd1</guid>
      <description>&lt;h2&gt;
  
  
  The Core Question
&lt;/h2&gt;

&lt;p&gt;When building an engine to verify AI-written code, architects hit a brutal dilemma. Where do rules live, and who enforces them?&lt;/p&gt;

&lt;p&gt;Axiom's answer is simple. &lt;strong&gt;Rules must precede code.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Rule-First Principle
&lt;/h2&gt;

&lt;p&gt;Traditional software development works like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Write code first.&lt;/li&gt;
&lt;li&gt;Run tests to catch bugs.&lt;/li&gt;
&lt;li&gt;Add rules to prevent similar bugs.&lt;/li&gt;
&lt;li&gt;Repeat forever.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This reactive approach shatters catastrophically when verifying AI-generated code. LLMs excel at crafting outputs that pass tests while concealing logical flaws—they're geniuses at creating "plausible wrong answers." Their probabilistic nature means bugs reappear in subtly different forms, bypassing test-based detection.&lt;/p&gt;

&lt;p&gt;Without rules locked in first, the engine eventually gaslights itself. It starts lowering verification standards to accommodate AI hallucinations. A verification engine that discovers rules from code inspection ends up in a self-justifying loop. It validates code against constraints derived from that same code.&lt;/p&gt;

&lt;p&gt;So Axiom inverts the paradigm completely:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Declare rules&lt;/strong&gt;: Define invariants, contracts, safety properties first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Encode&lt;/strong&gt;: Transform rules into verification artifacts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prove&lt;/strong&gt;: Mathematically prove code satisfies rules.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execute&lt;/strong&gt;: Only then allow execution.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Constitution-Ordinance-Annals: Layering Rules
&lt;/h2&gt;

&lt;p&gt;Axiom splits rules into three layers. Structure borrowed from legal systems.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Keyword&lt;/th&gt;
&lt;th&gt;File Pattern&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Change Frequency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Constitution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Context&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;CONTEXT.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Immutable design principles&lt;/td&gt;
&lt;td&gt;Almost never&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ordinances&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Spec&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;docs/spec/*.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Detailed algorithms and constraints&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Annals&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;History&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;docs/history/*.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Decision records and history&lt;/td&gt;
&lt;td&gt;Frequent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Layer 1: Constitution (Context)
&lt;/h3&gt;

&lt;p&gt;Project identity. Principles like "Core calculations must use pure functions (func) only" belong here. Stuff that breaks the project's foundation. Modifying &lt;code&gt;CONTEXT.md&lt;/code&gt; requires serious architectural review.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## 2. Core Architecture Principles (SSOT)&lt;/span&gt;

&lt;span class="gu"&gt;### Purity Hierarchy&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**Core Calculation Tier**&lt;/span&gt;: &lt;span class="sb"&gt;`src/core`&lt;/span&gt; computation uses &lt;span class="sb"&gt;`func`&lt;/span&gt; exclusively
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Boundary Executor Tier**&lt;/span&gt;: &lt;span class="sb"&gt;`proc`&lt;/span&gt; exists only at OS/interfaces
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Executor-only Rule**&lt;/span&gt;: Boundary &lt;span class="sb"&gt;`proc`&lt;/span&gt; cannot own business logic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 2: Ordinances (Spec)
&lt;/h3&gt;

&lt;p&gt;Actual running rules. SMT solver interfaces, Lean 4 proof strategies—concrete specifications live here. Guidelines during implementation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;bmc_core.md&lt;/code&gt;: Model checking algorithms, Z3 interfaces, timeout rules&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;uap_logic.md&lt;/code&gt;: Universal pipeline stages, counterexample promotion, invariant refinement&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;lean_proof.md&lt;/code&gt;: Lean 4 semantics, tactic strategies, replay protocols&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 3: Annals (History)
&lt;/h3&gt;

&lt;p&gt;Answers "why did we decide that back then?" Records past compromises and decision context. Annals are for reference only—they should never govern current implementation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## 2026-03-26&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Core introduced Bottom-up Sorter, Vault Manager, Axiom Orchestration Loop
&lt;span class="p"&gt;-&lt;/span&gt; Established stale axiom blocking and vault-based promotion paths
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Annals stay separate from rules. When you need "why does this rule exist?" you check annals. When implementing code, you focus on constitution and ordinances. No wading through historical noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  No-Downgrade Policy: The Hard Line
&lt;/h2&gt;

&lt;p&gt;Axiom enforces one rule above all others. &lt;strong&gt;Integrity can only increase, never decrease.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The func→proc Trap
&lt;/h3&gt;

&lt;p&gt;Nim strictly separates pure functions (&lt;code&gt;func&lt;/code&gt;) and impure procedures (&lt;code&gt;proc&lt;/code&gt;). During development, temptation strikes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Just one quick &lt;code&gt;echo&lt;/code&gt; here. Let me temporarily switch to &lt;code&gt;proc&lt;/code&gt;..."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Doesn't work in Axiom.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Downgrade is one-way&lt;/strong&gt;: Once &lt;code&gt;proc&lt;/code&gt; exists, developers keep adding side effects. It spreads like weeds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Purity is binary&lt;/strong&gt;: Either a function is pure or it isn't. "Slightly impure" is meaningless.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification requires purity&lt;/strong&gt;: You can't mathematically prove properties about functions with unpredictable side effects.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Nim 2.0+ &lt;code&gt;strictFuncs&lt;/code&gt; mode and effect tracking make this distinction even stronger:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nim"&gt;&lt;code&gt;&lt;span class="c"&gt;# strictFuncs enforces purity at compile time&lt;/span&gt;
&lt;span class="p"&gt;{.&lt;/span&gt;&lt;span class="n"&gt;push&lt;/span&gt; &lt;span class="n"&gt;strictFuncs&lt;/span&gt;&lt;span class="p"&gt;.}&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;verifyLogic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Node&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;void&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{.&lt;/span&gt;&lt;span class="n"&gt;raises&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;[]&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;[]&lt;/span&gt;&lt;span class="p"&gt;.}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="c"&gt;# Compiler rejects any external state mutation or IO call&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isValid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;InvalidNode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;.}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is why Nim fits Axiom's architecture perfectly. The language itself enforces purity boundaries at compile time. Not just at code review.&lt;/p&gt;

&lt;h3&gt;
  
  
  Axiom's Enforcement Mechanism
&lt;/h3&gt;

&lt;p&gt;Axiom maintains a &lt;strong&gt;Purity Hierarchy&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────┐
│ Core Calculation Tier (src/core)                            │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ func-only zone: computation, parsing, normalization     │ │
│ │ Returns: Result[T], Option[T], VerifiedType[T]         │ │
│ │ Forbidden: IO, mutation, exceptions                     │ │
│ └─────────────────────────────────────────────────────────┘ │
│                                                             │
│ Boundary Executor Tier (src/ui, src/cli, gateway)          │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ proc-allowed zone: OS, TTY, filesystem, processes       │ │
│ │ Role: Execute core func results, inject dependencies    │ │
│ │ Forbidden: Business logic, domain rules                  │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;proc&lt;/code&gt; shows up in &lt;code&gt;src/core&lt;/code&gt;, it gets flagged as &lt;strong&gt;design debt&lt;/strong&gt;—temporary compromise with mandatory refactor plan. Code doesn't ship until:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Executor boundary clearly marked&lt;/li&gt;
&lt;li&gt;"func promotion" task exists in backlog&lt;/li&gt;
&lt;li&gt;Impurity reason documented in &lt;code&gt;CONTEXT.md&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  What Happens When You Violate This Rule
&lt;/h3&gt;

&lt;p&gt;In Axiom, downgrading &lt;code&gt;func&lt;/code&gt; to &lt;code&gt;proc&lt;/code&gt; isn't just code modification. &lt;strong&gt;It's treated as destroying mathematical integrity.&lt;/strong&gt; Downgrades without proper justification (&lt;code&gt;CONTEXT.md&lt;/code&gt; update) get blocked at CI stage entirely.&lt;/p&gt;

&lt;p&gt;Axiom's CI enforces this through &lt;strong&gt;Procedure Gate&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;If &lt;code&gt;src/core&lt;/code&gt; files change, check for &lt;code&gt;proc&lt;/code&gt; additions&lt;/li&gt;
&lt;li&gt;If &lt;code&gt;func&lt;/code&gt;→&lt;code&gt;proc&lt;/code&gt; conversion appears in diff, &lt;strong&gt;fail the commit&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;If &lt;code&gt;proc&lt;/code&gt; exists without corresponding Boundary documentation, &lt;strong&gt;fail the commit&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Only exception: &lt;code&gt;--style-only&lt;/code&gt; changes (typos, comments, whitespace). Everything else needs architectural justification.&lt;/p&gt;

&lt;p&gt;At compile time, Nim's effect system provides additional enforcement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nim"&gt;&lt;code&gt;&lt;span class="c"&gt;# [Layer 1: Core Calculation Tier]&lt;/span&gt;
&lt;span class="p"&gt;{.&lt;/span&gt;&lt;span class="n"&gt;push&lt;/span&gt; &lt;span class="n"&gt;strictFuncs&lt;/span&gt;&lt;span class="p"&gt;.}&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;verifyNodeIntegrity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AxiomNode&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;void&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;IntegrityError&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{.&lt;/span&gt;&lt;span class="n"&gt;raises&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;[]&lt;/span&gt;&lt;span class="p"&gt;.}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="sd"&gt;## No IO, no mutation, no exceptions - mathematically pure&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fingerprint&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;computedFingerprint&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IntegrityError&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FingerprintMismatch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# [Layer 2: Boundary Executor Tier]&lt;/span&gt;
&lt;span class="k"&gt;proc &lt;/span&gt;&lt;span class="nf"&gt;executeVerification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AxiomNode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;File&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{.&lt;/span&gt;&lt;span class="n"&gt;raises&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;IOError&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;.}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="sd"&gt;## Thin executor: injects dependencies, delegates all logic to func&lt;/span&gt;
  &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;verifyNodeIntegrity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c"&gt;# Calls pure function&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isOk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Verification Passed&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c"&gt;# IO isolated at boundary&lt;/span&gt;
  &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Verification Failed: "&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;.}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See the pattern? &lt;code&gt;verifyNodeIntegrity&lt;/code&gt; is pure mathematics. &lt;code&gt;executeVerification&lt;/code&gt; is a thin wrapper handling IO. Business logic never crosses into impure territory.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-World Example
&lt;/h3&gt;

&lt;p&gt;When implementing report rendering, Axiom faced a choice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nim"&gt;&lt;code&gt;&lt;span class="c"&gt;# Tempting initial approach:&lt;/span&gt;
&lt;span class="k"&gt;proc &lt;/span&gt;&lt;span class="nf"&gt;renderReport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;seq&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ProofResult&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="c"&gt;# Direct IO: writes to stdout&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;echo&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c"&gt;# Enforced purity approach:&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;renderReportLines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;seq&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ProofResult&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kt"&gt;seq&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{.&lt;/span&gt;&lt;span class="n"&gt;noSideEffect&lt;/span&gt;&lt;span class="p"&gt;.}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="c"&gt;# Pure computation: returns strings&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;proc &lt;/span&gt;&lt;span class="nf"&gt;writeReport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;seq&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ProofResults&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="c"&gt;# Thin executor: injects output stream&lt;/span&gt;
  &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;renderReportLines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;echo&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Second approach separates concerns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;renderReportLines&lt;/code&gt; is pure, testable, verifiable&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;writeReport&lt;/code&gt; is trivial executor replaceable with file output, network output, or test mock&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't over-engineering. &lt;strong&gt;It's making verification possible.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why These Principles Exist
&lt;/h2&gt;

&lt;p&gt;Axiom verifies AI-generated code. To do this credibly, it must be &lt;strong&gt;more rigorous than code it verifies.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rule-First design ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Verification rules never get retrofitted to justify existing code&lt;/li&gt;
&lt;li&gt;Architecture principles are explicit, discoverable, enforced&lt;/li&gt;
&lt;li&gt;Integrity only moves one direction: upward&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Constitution-Ordinance-Annals separation ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Principles stay visible without drowning in details&lt;/li&gt;
&lt;li&gt;Implementers find specifications easily&lt;/li&gt;
&lt;li&gt;History informs but doesn't clutter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No-Downgrade Policy ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Purity isn't sacrificed for convenience&lt;/li&gt;
&lt;li&gt;Verification stays mathematically sound&lt;/li&gt;
&lt;li&gt;Technical debt is visible, tracked, temporary&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;These principles aren't aspirational. CI gates, code review, architectural documentation enforce them. Every Axiom commit must pass Procedure Gate. Every &lt;code&gt;proc&lt;/code&gt; in core must justify its existence. Every rule change must be documented before implementation.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The architecture must itself be verifiable.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Building a verification engine means &lt;strong&gt;you cannot verify probabilistic code with probabilistic tools.&lt;/strong&gt; The foundation must be deterministic, mathematically proven, architecturally pure.&lt;/p&gt;

&lt;p&gt;Next post: how Axiom transforms mathematical promises into executable proofs through bounded model checking (BMC).&lt;/p&gt;

</description>
      <category>axiom</category>
      <category>architecture</category>
      <category>rulefirst</category>
      <category>formal</category>
    </item>
    <item>
      <title>Axiom: Deterministic Integrity Engine for Probabilistic AI</title>
      <dc:creator>wintrover</dc:creator>
      <pubDate>Mon, 06 Apr 2026 11:16:13 +0000</pubDate>
      <link>https://forem.com/wintrover/axiom-deterministic-integrity-engine-for-probabilistic-ai-1j30</link>
      <guid>https://forem.com/wintrover/axiom-deterministic-integrity-engine-for-probabilistic-ai-1j30</guid>
      <description>&lt;h2&gt;
  
  
  Introduction: Why an Integrity Verification Engine?
&lt;/h2&gt;

&lt;p&gt;How can we be certain that AI-generated code is "correct"? Beyond simply compiling or passing tests, can we mathematically prove that code is free from race conditions, memory safety violations, and logical flaws?&lt;/p&gt;

&lt;p&gt;Axiom was born from a fundamental question in software engineering: &lt;strong&gt;"How far can we trust AI-written code?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Our answer lies not in more test cases, but in &lt;strong&gt;mathematical verification&lt;/strong&gt;. Axiom replaces probabilistic AI outputs with deterministic, robust software. To achieve this, we combine the following technical pillars:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bounded Model Checking (BMC)&lt;/strong&gt;: Complete integrity verification within defined exploration bounds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SMT Solver (Z3)&lt;/strong&gt;: Automated theorem proving based on logical constraints&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lean 4&lt;/strong&gt;: Ensuring deterministic reproducibility of high-level design principles&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dr.Nim&lt;/strong&gt;: Powerful Design-by-Contract (DbC) verification embedded in the Nim language&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Reliability Problem in AI-Agent Generated Code
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Limitations of Probabilistic Models
&lt;/h3&gt;

&lt;p&gt;Modern LLMs operate on probability. When AI generates code, it doesn't search for the logically most correct answer — it stochastically chains together &lt;strong&gt;"the most plausible next token"&lt;/strong&gt; from its training data.&lt;/p&gt;

&lt;p&gt;The result is a dangerous gap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Surface-level correctness&lt;/strong&gt;: Code appears to work, but internal edge cases remain unaddressed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logical hallucination&lt;/strong&gt;: Systems collapse without warning under untested runtime conditions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security and concurrency defects&lt;/strong&gt;: Race conditions in parallel execution paths that humans easily miss remain difficult puzzles for AI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Edsger Dijkstra once said: &lt;em&gt;"Testing proves the existence of bugs, not their absence."&lt;/em&gt; In the AI era, this maxim cuts even deeper.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The Asymmetry of Verification
&lt;/h3&gt;

&lt;p&gt;A dangerous asymmetry exists in current software development:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI Code Generation Speed:   ████████████████████ (Overwhelming)
Human Code Review Speed:    ██ (Lagging)
Trust Gap:                  ██████████████████
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Axiom was created to bridge this trust gap. Our goal extends beyond augmenting human judgment — we enhance AI's productivity with mathematical certainty.&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations of Traditional Static Analysis and the Value of BMC
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why Existing Tools Fall Short
&lt;/h3&gt;

&lt;p&gt;Traditional linters and static analyzers rely on simple pattern matching and heuristic rules. They merely speculate: &lt;em&gt;"This code is probably safe."&lt;/em&gt; In contrast, Axiom answers: &lt;strong&gt;"Is this code provably correct?"&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;Coverage&lt;/th&gt;
&lt;th&gt;False Positive Rate&lt;/th&gt;
&lt;th&gt;Formal Guarantees&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Linting&lt;/td&gt;
&lt;td&gt;Pattern matching&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type Check&lt;/td&gt;
&lt;td&gt;Type inference&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Testing&lt;/td&gt;
&lt;td&gt;Execution sampling&lt;/td&gt;
&lt;td&gt;Incomplete&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Axiom (BMC)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Exhaustive exploration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Complete (within bounds)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Complete proofs&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Bounded Model Checking: Realizing Complete Verification
&lt;/h3&gt;

&lt;p&gt;BMC systematically explores all states a program can reach within defined bounds.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nim"&gt;&lt;code&gt;&lt;span class="c"&gt;# BMC doesn't merely execute — it mathematically deconstructs every execution path&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;processBuffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;seq&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ProcessedData&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ErrorCode&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="c"&gt;# Axiom mathematically proves:&lt;/span&gt;
  &lt;span class="c"&gt;# 1. Buffer overflow probability: 0%&lt;/span&gt;
  &lt;span class="c"&gt;# 2. Post-conditions satisfied on all paths&lt;/span&gt;
  &lt;span class="c"&gt;# 3. Invariants maintained across all reachable states&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At Axiom's core sits Z3, Microsoft Research's SMT Solver. It encodes program semantics into logical formulas. When Z3 determines a negated assertion is &lt;strong&gt;UNSAT (unsatisfiable)&lt;/strong&gt;, we have mathematical certainty: &lt;strong&gt;no execution path exists that could violate that assertion&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Birth of Axiom: Goals and Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Design Principles
&lt;/h3&gt;

&lt;p&gt;Axiom was architected with these foundational principles:&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Deterministic Integrity
&lt;/h4&gt;

&lt;p&gt;Every verification result must be &lt;strong&gt;reproducible and mathematically proven&lt;/strong&gt;. No probabilistic judgments, no heuristics. Pure logical consequence.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Effect-Free Core
&lt;/h4&gt;

&lt;p&gt;The verification engine maintains strict purity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nim"&gt;&lt;code&gt;&lt;span class="c"&gt;# Core verification functions are marked:&lt;/span&gt;
&lt;span class="p"&gt;{.&lt;/span&gt;&lt;span class="n"&gt;noSideEffect&lt;/span&gt;&lt;span class="p"&gt;.}&lt;/span&gt;  &lt;span class="c"&gt;# No IO, no mutation of external state&lt;/span&gt;
&lt;span class="p"&gt;{.&lt;/span&gt;&lt;span class="n"&gt;raises&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;[]&lt;/span&gt;&lt;span class="p"&gt;.}&lt;/span&gt;    &lt;span class="c"&gt;# No exceptions — errors are data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This guarantees that verification logic itself cannot introduce bugs through side effects.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Gateway-First Architecture
&lt;/h4&gt;

&lt;p&gt;All external input passes through verification gates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Untrusted Input → Gateway (Parse/Validate/Promote) → VerifiedType → Core Engine
                                              ↓
                                     Reject with structured error
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The core never sees untrusted data. All validation happens at the boundaries.&lt;/p&gt;

&lt;h4&gt;
  
  
  4. CLI as Primary Interface
&lt;/h4&gt;

&lt;p&gt;Axiom prioritizes command-line interfaces for automation and integration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ax scan ./src              &lt;span class="c"&gt;# Scan codebase structure&lt;/span&gt;
ax prove &lt;span class="nt"&gt;--module&lt;/span&gt; auth    &lt;span class="c"&gt;# Verify specific module&lt;/span&gt;
ax report &lt;span class="nt"&gt;--format&lt;/span&gt; json   &lt;span class="c"&gt;# Generate structured report&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This enables seamless integration into CI/CD pipelines and development workflows.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Series Will Cover
&lt;/h2&gt;

&lt;p&gt;This introduction marks the beginning of a detailed technical series exploring:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;BMC Core Implementation&lt;/strong&gt;: How Axiom encodes program semantics for model checking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Z3 Integration Patterns&lt;/strong&gt;: Effective use of SMT solvers in program verification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lean 4 Proof Replay&lt;/strong&gt;: Generating and validating formal proofs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Universal Assertion Pipeline (UAP)&lt;/strong&gt;: A generalized verification framework&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dr.Nim Contract System&lt;/strong&gt;: Embedding verification contracts in code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proof Vault Architecture&lt;/strong&gt;: Managing verification artifacts at scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD Integration&lt;/strong&gt;: Automating verification in development workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance Engineering&lt;/strong&gt;: Optimizing BMC for real-world codebases&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each post will combine theoretical foundations with concrete implementation details from the Axiom codebase.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Vision Ahead
&lt;/h2&gt;

&lt;p&gt;Axiom's mission is clear: &lt;strong&gt;Refine AI's non-deterministic outputs into mathematically proven deterministic assets&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This isn't about replacing developers or AI assistants. It's about augmenting every participant in the software development lifecycle with certainty:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Developers&lt;/strong&gt; get immediate feedback on logical correctness&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Organizations&lt;/strong&gt; get audit trails of mathematical proofs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security teams&lt;/strong&gt; get guarantees beyond penetration testing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI systems&lt;/strong&gt; get a verification layer that elevates their output quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The future of software engineering isn't just about generating code faster — it's about generating code that we can mathematically prove correct. Axiom is the engine that makes this possible.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next in this series: Architecture Philosophy - Rule First Principles and Constitution-Ordinance-Chronicle Separation.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>axiom</category>
      <category>bmc</category>
      <category>model</category>
      <category>checking</category>
    </item>
    <item>
      <title>Why We Still Don't Trust AI-Generated Code: The Archright Trinity</title>
      <dc:creator>wintrover</dc:creator>
      <pubDate>Fri, 20 Mar 2026 08:08:20 +0000</pubDate>
      <link>https://forem.com/wintrover/why-we-still-dont-trust-ai-generated-code-the-archright-trinity-105h</link>
      <guid>https://forem.com/wintrover/why-we-still-dont-trust-ai-generated-code-the-archright-trinity-105h</guid>
      <description>&lt;p&gt;"You just can't trust code written by AI."&lt;/p&gt;

&lt;p&gt;Until right before I decided to resign, this was the sentence I heard most often in real engineering teams.&lt;br&gt;
The paradox was obvious: organizations wanted "10x productivity" from AI, yet deeply distrusted the output.&lt;/p&gt;

&lt;p&gt;I stood in the middle of that contradiction, in agony, drilling into the root cause.&lt;br&gt;
Why do we fail to trust AI-generated code? Is it simply because the tool is imperfect?&lt;/p&gt;

&lt;p&gt;No.&lt;br&gt;
My conclusion was this: the real problem is a distorted process that burns human labor to patch AI uncertainty.&lt;/p&gt;

&lt;p&gt;Organizations forced engineers to review hundreds of lines produced in seconds through naked-eye inspection and overtime.&lt;br&gt;
That was not a productivity gain.&lt;br&gt;
It was the hell of review labor.&lt;br&gt;
I found that this distrust consistently maps to three fundamental deficits.&lt;/p&gt;



&lt;h3&gt;
  
  
  1) Deficit of Intent: Is this code truly aligned with my design context?
&lt;/h3&gt;

&lt;p&gt;When intent is not preserved, teams cannot prove whether generated code matches architectural decisions.&lt;br&gt;
The output may look correct, yet still violate what the builder actually meant.&lt;/p&gt;

&lt;h3&gt;
  
  
  2) Deficit of Stability: Will it break at runtime or quietly degrade performance?
&lt;/h3&gt;

&lt;p&gt;Without deterministic controls, generated code remains a probability game.&lt;br&gt;
Even if it passes quickly, hidden runtime failures and regressions can emerge later.&lt;/p&gt;

&lt;h3&gt;
  
  
  3) Deficit of Security: Does it work now but plant future vulnerabilities?
&lt;/h3&gt;

&lt;p&gt;Many AI outputs are operationally acceptable in the moment but logically under-proven.&lt;br&gt;
That gap becomes a delayed risk multiplier.&lt;/p&gt;




&lt;p&gt;Archright exists to solve this asymmetry as a system.&lt;br&gt;
Instead of turning humans into disposable verification parts, I translated my resignation declaration of deterministic integrity into three concrete technical pillars.&lt;/p&gt;



&lt;h3&gt;
  
  
  1. Thought Trajectory System: Freezing Intent
&lt;/h3&gt;

&lt;p&gt;It freezes builder intent and context as durable records, like a GitHub commit log for reasoning.&lt;br&gt;
It fixes the architect's thought flow as explicit data so AI inference does not remain a black box.&lt;br&gt;
That creates transparent intent anyone can onboard from immediately, while slashing communication cost.&lt;/p&gt;

&lt;p&gt;This system is not a simple log.&lt;br&gt;
In Archright, intent is used through this flow:&lt;/p&gt;

&lt;p&gt;Requirements input&lt;br&gt;
→ Capture the architect's intent in a structured form&lt;br&gt;
→ Use it as the reference point across every generation/verification step&lt;br&gt;
→ Validate consistency against "intent," not just against code&lt;/p&gt;

&lt;p&gt;In short, code is only one output that must satisfy this intent.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Nim Programming Language: A Trinity of Productivity, Performance, and Stability
&lt;/h3&gt;

&lt;p&gt;Why Nim instead of Rust?&lt;br&gt;
Rust is excellent for performance and safety, but often sacrifices production velocity, and its less human-friendly syntax can become a major trigger for AI-agent hallucinations.&lt;br&gt;
Nim preserves performance and stability while delivering exceptional readability.&lt;br&gt;
It is the optimal choice for a high-efficiency engine where both agents and humans communicate with clarity.&lt;/p&gt;

&lt;p&gt;Language choice is not a matter of taste.&lt;br&gt;
In Archright's flow:&lt;/p&gt;

&lt;p&gt;Intent (high level) → Constraints (intermediate form) → Code (low level)&lt;br&gt;
these three layers must remain continuously connected.&lt;/p&gt;

&lt;p&gt;We chose a language that minimizes the cost of maintaining this connection.&lt;br&gt;
Rust is powerful at solving "is this code safe?".&lt;br&gt;
However, Archright addresses an earlier question:&lt;br&gt;
"is this code correct in the first place?"&lt;/p&gt;

&lt;p&gt;We chose a language that allows this question to be handled before compile-time output.&lt;br&gt;
Intent and constraints are verified first, before they are converted into code.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Formal Verification: Mathematical Proof
&lt;/h3&gt;

&lt;p&gt;Trying to block probabilistic failure with another probabilistic tool is mathematically hollow.&lt;br&gt;
The emerging pattern of AI code review (such as Claude Review) is still a 99%-accurate AI inspecting code produced by another 99%-accurate AI.&lt;br&gt;
In that setup, accuracy may rise to 99.99%, but it can never become 100%.&lt;/p&gt;

&lt;p&gt;Most security incidents emerge precisely from that neglected 0.01% gap.&lt;br&gt;
In engineering, "almost certain" is a synonym for "not certain."&lt;br&gt;
Archright internalizes mathematical verification and authorization tools such as Z3 Solver, Lean 4, and Cedar in its engine.&lt;br&gt;
Instead of probabilistic comfort—"tests passed"—it proves "no exception exists" mathematically and frees engineers from review labor.&lt;/p&gt;

&lt;p&gt;This verification is not a post-hoc review.&lt;br&gt;
In Archright, it runs in this order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Convert intent into verifiable rules&lt;/li&gt;
&lt;li&gt;Validate those rules hold across all states&lt;/li&gt;
&lt;li&gt;Search automatically for counterexamples that break the rules&lt;/li&gt;
&lt;li&gt;Stop code generation when a counterexample is found&lt;/li&gt;
&lt;li&gt;Generate code only when all constraints hold&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For example:&lt;br&gt;
“A user must only be allowed to read their own data” as a security condition means,&lt;br&gt;
→ regardless of incoming request,&lt;br&gt;
a user must only be allowed to read their own data.&lt;/p&gt;

&lt;p&gt;If a request is sent with another user's ID:&lt;br&gt;
→ it is immediately detected as a counterexample and code generation is stopped.&lt;br&gt;
→ if even one violating case exists, that logic is not generated.&lt;/p&gt;

&lt;p&gt;In short, we do not "catch bugs."&lt;br&gt;
We create a state where bugs cannot exist.&lt;/p&gt;



&lt;h2&gt;
  
  
  Reclaiming the Joy of Building
&lt;/h2&gt;

&lt;p&gt;On top of these three pillars, engineers are no longer janitors cleaning up AI-generated trash.&lt;br&gt;
They step away from tedious debugging and communication bottlenecks, and return as builders focused only on business-logic design and creative architecture.&lt;/p&gt;

&lt;p&gt;AI writes code.&lt;br&gt;
Archright creates a state where that code cannot be wrong.&lt;/p&gt;

&lt;p&gt;That is exactly the new software-engineering standard Archright proposes, and the reason I began this journey.&lt;/p&gt;

</description>
      <category>devlog</category>
      <category>ai</category>
      <category>formal</category>
      <category>verification</category>
    </item>
    <item>
      <title>In the Age of Probabilistic Intelligence, a Thirst for Deterministic Systems</title>
      <dc:creator>wintrover</dc:creator>
      <pubDate>Tue, 17 Mar 2026 12:56:43 +0000</pubDate>
      <link>https://forem.com/wintrover/in-the-age-of-probabilistic-intelligence-a-thirst-for-deterministic-systems-294i</link>
      <guid>https://forem.com/wintrover/in-the-age-of-probabilistic-intelligence-a-thirst-for-deterministic-systems-294i</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpuxs43xjtpmekcnen8ie.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpuxs43xjtpmekcnen8ie.png" alt=" " width="800" height="1199"&gt;&lt;/a&gt;&lt;br&gt;
(I'm not handsome as this guy, AI generated lol)&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;"AI is not trustworthy. Go back to coding by hand."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It was a short sentence. But it perfectly captured the contradiction I had been living in.&lt;/p&gt;

&lt;p&gt;As a product engineer, I spent my time turning business priorities into shipped software. Speed mattered. Quality mattered. What became intolerable was watching an organization demand both while rejecting the engineering discipline required to make both possible.&lt;/p&gt;

&lt;p&gt;The market asks for faster delivery every quarter. But users allow almost no defects. Any engineer knows this combination is nearly unsustainable if you keep relying on grinding human labor. That is exactly why I believed AI was no longer optional. Even with probabilistic tools, we can still enforce deterministic reliability by filtering output through explicit control procedures. The core issue is not whether we use tools, but whether we build control systems.&lt;/p&gt;

&lt;p&gt;My organization chose the opposite path.&lt;/p&gt;

&lt;p&gt;Instead of building a control plane for AI usage, it framed AI itself as the problem. Disappointment from unskilled usage became evidence against the tool, not evidence of a missing system. The operating model became painfully clear:&lt;/p&gt;

&lt;p&gt;Keep aggressive output targets.&lt;br&gt;
Remove the highest-leverage tool.&lt;br&gt;
Fill the gap with overtime and manual repetition.&lt;br&gt;
That is not engineering. It is deferred cost.&lt;/p&gt;

&lt;p&gt;This is exactly where my anger came from. Probabilistic output is supposed to be noisy. That is why deterministic quality gates exist.&lt;/p&gt;

&lt;p&gt;What I argued for was not ideology. It was process design:&lt;/p&gt;

&lt;p&gt;Pre-commit hooks to block policy violations before review&lt;br&gt;
Procedure gates combining static analysis and tests before merge&lt;br&gt;
Formal verification checks for critical intent-to-implementation consistency&lt;br&gt;
Banning tools can look like control, but in practice it often means giving up control. If you refuse to build systems and rely on labor intensity instead, quality degrades and people burn out at the same time.&lt;/p&gt;

&lt;p&gt;I could no longer treat that as normal.&lt;/p&gt;

&lt;p&gt;My resignation was not an escape. It was a decision. I did not leave because AI is imperfect. I left because I could not align with a culture that refused to build deterministic safeguards around probabilistic intelligence.&lt;/p&gt;

&lt;p&gt;That decision is what led to Archright.&lt;/p&gt;

&lt;p&gt;I am not building just another app. I am rebuilding the production process itself:&lt;/p&gt;

&lt;p&gt;how probabilistic intelligence is disciplined into deterministic integrity, how thought trajectory survives implementation without leaking intent, and how this becomes a repeatable system instead of a heroic individual effort.&lt;/p&gt;

&lt;p&gt;So this is not just a resignation story. As an AI tsunami approaches, this is the first page of a journey to rebuild software engineering that has lost its direction.&lt;/p&gt;

</description>
      <category>devlog</category>
    </item>
    <item>
      <title>From Celery/Redis to Temporal: A Journey Toward Idempotency and Reliable Workflows</title>
      <dc:creator>wintrover</dc:creator>
      <pubDate>Tue, 17 Mar 2026 10:20:33 +0000</pubDate>
      <link>https://forem.com/wintrover/from-celeryredis-to-temporal-a-journey-toward-idempotency-and-reliable-workflows-k1i</link>
      <guid>https://forem.com/wintrover/from-celeryredis-to-temporal-a-journey-toward-idempotency-and-reliable-workflows-k1i</guid>
      <description>&lt;p&gt;When handling asynchronous tasks in distributed systems, the combination of Celery and Redis is often the go-to choice. I also chose Celery for the initial design of my KYC (Know Your Customer) orchestrator due to its familiarity. However, as the service grew in complexity, I hit a massive wall: guaranteeing &lt;strong&gt;idempotency&lt;/strong&gt; and managing &lt;strong&gt;complex states&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this post, I want to share my technical journey of why I moved away from Celery to Temporal and how I ensured idempotency during that process.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Limitations of Celery/Redis: Why Change Was Necessary?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Difficulties in Idempotency Management
&lt;/h3&gt;

&lt;p&gt;While Celery is excellent for "Fire and Forget" tasks, there's a high risk of duplicate execution during retries caused by network failures or worker downs. Especially for face recognition tasks that consume significant GPU resources, duplicate execution was critical in terms of both cost and performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fragmentation of State
&lt;/h3&gt;

&lt;p&gt;The KYC process follows this sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User uploads an ID card image.&lt;/li&gt;
&lt;li&gt;User uploads a selfie video.&lt;/li&gt;
&lt;li&gt;Compare face similarity once both files exist.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp65ru8vmnc53elxiud11.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp65ru8vmnc53elxiud11.png" alt="Celery Sequence Diagram" width="760" height="672"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In a Celery environment, since I didn't know when images and videos would be uploaded, I needed complex logic to query the DB every time or store intermediate states in Redis. The logic to check "Are all files collected?" was scattered across multiple places, making maintenance difficult.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Introducing Temporal: A Paradigm Shift in Orchestration
&lt;/h2&gt;

&lt;p&gt;Temporal is not just a message queue; it's a &lt;strong&gt;Stateful Workflows&lt;/strong&gt; engine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Workflow Logic Must Be "Deterministic"
&lt;/h3&gt;

&lt;p&gt;Since Temporal workflow code is based on the premise of "Replay," it must always produce the same sequence of workflow API calls for the same input and history. Therefore, you should not directly perform "external-world-dependent operations" like network I/O, file I/O, system time (e.g., &lt;code&gt;DateTime.now&lt;/code&gt;), randomness, or threading within a workflow. These side effects should be pushed to activities, while the workflow focuses solely on orchestration.&lt;/p&gt;

&lt;p&gt;Official Docs: &lt;a href="https://docs.temporal.io/develop/python/core-application#workflow-logic-requirements" rel="noopener noreferrer"&gt;https://docs.temporal.io/develop/python/core-application#workflow-logic-requirements&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Workflow-Centric Design
&lt;/h3&gt;

&lt;p&gt;The first thing that changed after introducing Temporal was the visibility of business logic. &lt;code&gt;FaceSimilarityWorkflow&lt;/code&gt; now gracefully waits until files are ready.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2b0453k5kqp5i1dnpi4j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2b0453k5kqp5i1dnpi4j.png" alt="Temporal Workflow Diagram" width="760" height="732"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Core logic of FaceSimilarityWorkflow
&lt;/span&gt;&lt;span class="nd"&gt;@workflow.run&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SimilarityData&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;SimilarityResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Wait up to 1 hour until both image and video are collected
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_condition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_files&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;video&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_files&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hours&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Execute GPU activity once all files are ready
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute_activity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;check_face_similarity_activity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;activity_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;retry_policy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;RetryPolicy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;maximum_attempts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code uses &lt;code&gt;workflow.wait_condition&lt;/code&gt; to suspend the workflow until the condition is met without blocking the event loop. In Celery, this would have required complex polling or webhook logic.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Idempotency Strategy: Building a Double Defense
&lt;/h2&gt;

&lt;p&gt;Even with Temporal, idempotency at the activity level remains crucial. I established a double defense strategy as follows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Temporal's Basic Guarantee
&lt;/h3&gt;

&lt;p&gt;Temporal records the progress of a workflow as event history. Therefore, even if a worker restarts, it resumes exactly from the last successful point.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Explicit Checks within Activities
&lt;/h3&gt;

&lt;p&gt;Since Temporal activities follow an "at-least-once" execution model, an activity might be retried if a worker crashes after successfully performing it but before notifying the server. Thus, official documentation strongly recommends making activities idempotent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F97qvv4odbem80lut1gcr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F97qvv4odbem80lut1gcr.png" alt="Activity Idempotency Diagram" width="760" height="134"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Official Docs: &lt;a href="https://docs.temporal.io/develop/python/error-handling#make-activities-idempotent" rel="noopener noreferrer"&gt;https://docs.temporal.io/develop/python/error-handling#make-activities-idempotent&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In practice, I use the following two together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For external system calls, pass an &lt;strong&gt;idempotency key&lt;/strong&gt; combined from the workflow execution and activity identifiers.&lt;/li&gt;
&lt;li&gt;Internally, use unique keys (or check for existing results) in the DB to prevent &lt;strong&gt;duplicate storage/processing&lt;/strong&gt;.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@activity.defn&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_face_similarity_activity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SimilarityData&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;SimilarityResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;activity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;idempotency_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;workflow_run_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;activity_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;session_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;get_db_context&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;existing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FaceSimilarity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FaceSimilarity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;idempotency_key&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;SimilarityResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Already processed.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Perform actual GPU-intensive work...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Results: What Has Changed?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Comparison Item&lt;/th&gt;
&lt;th&gt;Celery/Redis Based&lt;/th&gt;
&lt;th&gt;Temporal Based&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;State Management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual storage in DB/Redis&lt;/td&gt;
&lt;td&gt;Automatically managed by engine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retry Strategy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual exponential backoff&lt;/td&gt;
&lt;td&gt;Declarative Retry Policy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Visibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Must dig through logs&lt;/td&gt;
&lt;td&gt;Check history in Temporal UI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Idempotency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Very difficult to guarantee&lt;/td&gt;
&lt;td&gt;Structurally achievable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The transition from Celery to Temporal was not just about changing tools; it was about changing &lt;strong&gt;how I define business processes in code&lt;/strong&gt;. Especially in financial/authentication systems where idempotency is paramount, Temporal provided irreplaceable stability.&lt;/p&gt;

&lt;p&gt;If you are losing sleep over complex asynchronous logic and idempotency issues, I strongly recommend migrating to Temporal.&lt;/p&gt;

</description>
      <category>smbholdings</category>
    </item>
    <item>
      <title>Testing in the Age of AI Agents: How I Kept QA from Collapsing</title>
      <dc:creator>wintrover</dc:creator>
      <pubDate>Tue, 17 Mar 2026 10:20:26 +0000</pubDate>
      <link>https://forem.com/wintrover/testing-in-the-age-of-ai-agents-how-i-kept-qa-from-collapsing-2ld9</link>
      <guid>https://forem.com/wintrover/testing-in-the-age-of-ai-agents-how-i-kept-qa-from-collapsing-2ld9</guid>
      <description>&lt;p&gt;AI agents changed my development tempo overnight. I can ship more code in a day than I used to in a week, and that sounds great until the first time a tiny edge case takes down an entire flow.&lt;/p&gt;

&lt;p&gt;At that speed, QA becomes either a competitive advantage or a constant fire drill. I chose the first option, and I rebuilt my testing approach in &lt;code&gt;d:\Coding\Company\Ochestrator&lt;/code&gt; around a small set of test design techniques that scale with code volume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TDD&lt;/li&gt;
&lt;li&gt;EP-BVA (Equivalence Partitioning + Boundary Value Analysis)&lt;/li&gt;
&lt;li&gt;Pairwise (Combinatorial Testing)&lt;/li&gt;
&lt;li&gt;State Transition Testing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy3ehxris3spmjh41er9b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy3ehxris3spmjh41er9b.png" alt="Testing Toolbox Diagram" width="760" height="485"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Why I Needed “Test Design,” Not Just “More Tests”
&lt;/h2&gt;

&lt;p&gt;When code volume grows, the problem is not only “coverage.” The real problem is that the space of possible inputs and states grows faster than my time.&lt;/p&gt;

&lt;p&gt;So I stopped asking:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Did I write tests for this function?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And I started asking:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Did I select test cases that actually represent the failure surface?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That mindset is what pushed me toward structured test design techniques.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. TDD: Design for Testability from Day One
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Principle:&lt;/strong&gt; TDD (Test-Driven Development) flips the traditional "write code, then test" workflow. It follows the &lt;strong&gt;Red-Green-Refactor&lt;/strong&gt; cycle:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Red&lt;/strong&gt;: Write a test for a new requirement and watch it fail. This confirms the test actually checks something and that the requirement isn't already met.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Green&lt;/strong&gt;: Write the minimal amount of code to make the test pass. Avoid "over-engineering" at this stage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refactor&lt;/strong&gt;: Clean up the code while ensuring the tests stay green.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;In Orchestrator:&lt;/strong&gt;&lt;br&gt;
Since AI agents can generate complex business logic rapidly, I used TDD to ensure that the logic was &lt;em&gt;testable&lt;/em&gt; by design. For example, when implementing the &lt;code&gt;RetryPolicy&lt;/code&gt; for our Temporal workflows, I started with the test cases for exponential backoff before writing a single line of the policy logic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified TDD Example for Retry Logic
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_retry_interval_calculation&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;policy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ExponentialRetry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_delay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_delay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# 1st attempt: 1.0s
&lt;/span&gt;    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;
    &lt;span class="c1"&gt;# 2nd attempt: 2.0s
&lt;/span&gt;    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;
    &lt;span class="c1"&gt;# Capped at 10.0s
&lt;/span&gt;    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mf"&gt;10.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This forced me to separate the &lt;em&gt;calculation&lt;/em&gt; of delays from the &lt;em&gt;execution&lt;/em&gt; of the retry, making the system modular and robust.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. EP-BVA: Efficiency through Mathematical Selection
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Principle:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Equivalence Partitioning (EP)&lt;/strong&gt;: Instead of testing every possible value, you divide the input domain into groups (partitions) where the system is expected to behave identically. You only need to test one value from each group.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Boundary Value Analysis (BVA)&lt;/strong&gt;: Bugs often hide at the "edges" of these partitions. BVA focuses on testing the exact boundaries, and values just inside and outside of them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;In Orchestrator:&lt;/strong&gt;&lt;br&gt;
When handling user-uploaded files, we have strict size limits (e.g., 1MB to 10MB).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Partitions&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Invalid (&amp;lt; 1MB)&lt;/li&gt;
&lt;li&gt;Valid (1MB - 10MB)&lt;/li&gt;
&lt;li&gt;Invalid (&amp;gt; 10MB)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BVA Points&lt;/strong&gt;: 0.99MB, 1.0MB, 1.01MB, 9.99MB, 10.0MB, 10.01MB.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A critical real-world example I applied was the &lt;strong&gt;72-byte limit of bcrypt&lt;/strong&gt;. Many developers don't realize that bcrypt ignores any characters after the 72nd byte.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# apps/backend/tests/test_auth_service.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_password_length_boundaries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;auth_service&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Boundary: 72 bytes
&lt;/span&gt;    &lt;span class="n"&gt;p72&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;72&lt;/span&gt;
    &lt;span class="n"&gt;h72&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;auth_service&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_password_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p72&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Just above the boundary: 73 bytes
&lt;/span&gt;    &lt;span class="n"&gt;p73&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p72&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="c1"&gt;# Bcrypt will treat p73 the same as p72 if only the first 72 bytes are used
&lt;/span&gt;    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;auth_service&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;verify_password&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p73&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h72&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By focusing on these specific points, I reduced hundreds of potential test cases to just 6-10 highly effective ones.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Pairwise: Taming the Combinatorial Explosion
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Principle:&lt;/strong&gt; Most bugs are caused by either a single input parameter or the interaction between &lt;em&gt;two&lt;/em&gt; parameters. &lt;strong&gt;Pairwise Testing&lt;/strong&gt; is a combinatorial method that ensures every possible pair of input parameters is tested at least once. This drastically reduces the number of test cases while maintaining high defect detection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In Orchestrator:&lt;/strong&gt;&lt;br&gt;
Our AI Inference engine has multiple configuration axes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Execution Provider&lt;/strong&gt;: &lt;code&gt;[CUDA, CPU, OpenVINO]&lt;/code&gt; (3)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model Size&lt;/strong&gt;: &lt;code&gt;[Small, Medium, Large]&lt;/code&gt; (3)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quantization&lt;/strong&gt;: &lt;code&gt;[INT8, FP16, FP32]&lt;/code&gt; (3)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async Mode&lt;/strong&gt;: &lt;code&gt;[Enabled, Disabled]&lt;/code&gt; (2)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Total combinations: $3 \times 3 \times 3 \times 2 = 54$ cases.&lt;br&gt;
Using Pairwise, we can cover all interactions between any two settings in roughly &lt;strong&gt;12-15 cases&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Using allpairspy to generate the matrix
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;allpairspy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AllPairs&lt;/span&gt;

&lt;span class="n"&gt;parameters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CUDA&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CPU&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OpenVINO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INT8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FP16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FP32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Disabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;combo&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;AllPairs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Test Case &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;combo&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allows us to maintain high confidence in our hardware compatibility matrix without running the full 54-case suite on every PR.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. State Transition Testing: Mapping the Life of a Process
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Principle:&lt;/strong&gt; This technique is used when the system's behavior depends on its current state and the events that occur. We map out a &lt;strong&gt;State Transition Diagram&lt;/strong&gt; and ensure that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;All valid transitions are possible.&lt;/li&gt;
&lt;li&gt;All invalid transitions are properly blocked (Negative Testing).&lt;/li&gt;
&lt;li&gt;The system ends in the correct final state.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;In Orchestrator:&lt;/strong&gt;&lt;br&gt;
The KYC (Know Your Customer) verification workflow is a complex state machine. A user's document moves through:&lt;br&gt;
&lt;code&gt;PENDING&lt;/code&gt; $\rightarrow$ &lt;code&gt;UPLOADING&lt;/code&gt; $\rightarrow$ &lt;code&gt;PROCESSING&lt;/code&gt; $\rightarrow$ &lt;code&gt;VERIFIED&lt;/code&gt; or &lt;code&gt;REJECTED&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I implemented tests to ensure a &lt;code&gt;REJECTED&lt;/code&gt; document cannot suddenly jump to &lt;code&gt;VERIFIED&lt;/code&gt; without going through &lt;code&gt;PROCESSING&lt;/code&gt; again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# apps/backend/tests/test_integration_kyc_workflow.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_invalid_state_transitions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workflow_engine&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;workflow_engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ImageStatus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;REJECTED&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# This should be blocked by the business logic
&lt;/span&gt;    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raises&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IllegalStateError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;workflow_engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transition_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ImageStatus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;VERIFIED&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is crucial for AI agents that might try to "short-circuit" logic. By strictly testing the state machine, we ensure the integrity of the entire business process.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In the AI-agent era, code is cheap. Trust is not.&lt;/p&gt;

&lt;p&gt;What kept my QA from collapsing was not writing more tests, but adopting test design techniques that scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TDD for fast feedback and safer refactors&lt;/li&gt;
&lt;li&gt;EP-BVA to systematize edge cases&lt;/li&gt;
&lt;li&gt;Pairwise to tame combinatorial growth&lt;/li&gt;
&lt;li&gt;State Transition Testing to validate real workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the testing toolbox I expect to keep using as my code volume keeps accelerating.&lt;/p&gt;

</description>
      <category>smbholdings</category>
    </item>
    <item>
      <title>The Pitfalls of Test Coverage: Introducing Mutation Testing with Stryker and Cosmic Ray</title>
      <dc:creator>wintrover</dc:creator>
      <pubDate>Tue, 17 Mar 2026 10:20:19 +0000</pubDate>
      <link>https://forem.com/wintrover/the-pitfalls-of-test-coverage-introducing-mutation-testing-with-stryker-and-cosmic-ray-75</link>
      <guid>https://forem.com/wintrover/the-pitfalls-of-test-coverage-introducing-mutation-testing-with-stryker-and-cosmic-ray-75</guid>
      <description>&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: Overcome the limitations of Code Coverage metrics and introduce 'Mutation Testing' to verify if test codes actually catch errors in business logic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scope&lt;/strong&gt;: Core modules of the enterprise orchestrator project (&lt;code&gt;Ochestrator&lt;/code&gt;) in both Frontend (TypeScript) and Backend (Python).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expected Results&lt;/strong&gt;: Improve code stability and test reliability by securing a 'Mutation Score' beyond simple line coverage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We often believe that high test coverage means safe code. However, it's difficult to answer the question: &lt;strong&gt;"Who tests the tests?"&lt;/strong&gt; Tests that simply execute code without proper assertions still contribute to coverage metrics. To solve this 'coverage trap', we introduced mutation testing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wlbq4trux8zs989bc0m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wlbq4trux8zs989bc0m.png" alt="Mutation Testing Flow" width="760" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. TypeScript Environment: Introducing Stryker Mutator
&lt;/h3&gt;

&lt;p&gt;For the TypeScript environment, including frontend and common utilities, we chose &lt;a href="https://stryker-mutator.io/" rel="noopener noreferrer"&gt;Stryker&lt;/a&gt;. It integrates well with Vitest and is easy to configure.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tech Stack&lt;/strong&gt;: TypeScript, Vitest, Stryker Mutator&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key Configuration (&lt;code&gt;stryker.config.json&lt;/code&gt;)&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"testRunner"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vitest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"reporters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"html"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"clear-text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"progress"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"concurrency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"incremental"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mutate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"src/utils/**/*.ts"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"src/services/**/*.ts"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We enabled the &lt;code&gt;incremental&lt;/code&gt; option to efficiently perform tests only on changed files.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Python Environment: Introducing Cosmic Ray
&lt;/h3&gt;

&lt;p&gt;For the backend environment, we introduced &lt;a href="https://github.com/sixty-north/cosmic-ray" rel="noopener noreferrer"&gt;Cosmic Ray&lt;/a&gt;. It generates powerful mutations by manipulating the AST (Abstract Syntax Tree) using Python's dynamic nature.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tech Stack&lt;/strong&gt;: Python, Pytest, Cosmic Ray, Docker&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution Architecture&lt;/strong&gt;: Since mutation testing consumes significant computational resources, we configured it to run in parallel across multiple workers using Docker.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;  &lt;span class="c1"&gt;# Partial docker-compose.test.yaml&lt;/span&gt;
  &lt;span class="na"&gt;cosmic-worker-1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;uv run cosmic-ray worker cosmic.sqlite&lt;/span&gt;
  &lt;span class="na"&gt;cosmic-runner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;cosmic-worker-1&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cosmic-worker-2&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;uv run cosmic-ray init cosmic-ray.toml cosmic.sqlite&lt;/span&gt;
      &lt;span class="s"&gt;uv run cosmic-ray exec cosmic-ray.toml cosmic.sqlite&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Debugging/Challenges
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Real-world Case: Survived Mutants in &lt;code&gt;VideoSplitter.ts&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The most interesting case was &lt;code&gt;videoSplitter.ts&lt;/code&gt;, which handles video splitting. This file had over 95% line coverage, but Stryker revealed shocking results.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Problem Statement&lt;/strong&gt;:
A large number of mutants survived in the logic that checks available memory.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;  &lt;span class="c1"&gt;// Original Code&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;availableMemory&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;requiredMemory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Insufficient memory.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even when Stryker changed this code to &lt;code&gt;if (false)&lt;/code&gt; or &lt;code&gt;if (availableMemory &amp;lt;= requiredMemory)&lt;/code&gt;, all existing tests &lt;strong&gt;PASSED&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Root Cause Analysis&lt;/strong&gt;:&lt;br&gt;
Existing tests focused only on "whether an error occurs," missing boundary value tests for exactly which conditions trigger the error. In other words, coverage was high, but the actual logic wasn't being thoroughly verified.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;:&lt;br&gt;
To 'kill' the surviving mutants, we reinforced the test cases with boundary value analysis.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;  &lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Boundary value verification for memory&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Simulate situations where memory is exactly equal to or slightly less than requiredMemory&lt;/span&gt;
    &lt;span class="c1"&gt;// ... reinforced test code ...&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Achievements&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Discovered and removed 12 &lt;strong&gt;Survived Mutants&lt;/strong&gt; in core utility modules.&lt;/li&gt;
&lt;li&gt;Elevated test code from simply 'executing' code to truly 'verifying' it.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Key Metrics&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mutation Score&lt;/strong&gt;: Improved from an initial 62% to 88%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability&lt;/strong&gt;: Prevented potential regression bugs by running &lt;code&gt;test:mutation&lt;/code&gt; scripts before deployment.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;User Feedback&lt;/strong&gt;: Positive reactions from team members: "I can now refactor with confidence, trusting our tests."&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Coverage is just the beginning&lt;/strong&gt;: Line coverage only tells you 'what is not tested,' not the 'quality of what is tested.'&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mutation testing is expensive but worth it&lt;/strong&gt;: Although it takes time (up to tens of minutes for full execution), it's essential for core business logic or complex utilities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incremental Adoption&lt;/strong&gt;: Rather than applying it to all code, it's important to build success stories by starting with core infrastructure code like &lt;code&gt;VideoSplitter&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;After completion, ensure the following checklist is met:&lt;/p&gt;

&lt;h3&gt;
  
  
  Verification Checklist
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[x] &lt;strong&gt;Overview&lt;/strong&gt;: Are the goals and scope clear?&lt;/li&gt;
&lt;li&gt;[x] &lt;strong&gt;Implementation&lt;/strong&gt;: Are the tech stack and specific code examples included?&lt;/li&gt;
&lt;li&gt;[x] &lt;strong&gt;Debugging&lt;/strong&gt;: Is there at least one specific problem and its solution process?&lt;/li&gt;
&lt;li&gt;[x] &lt;strong&gt;Results&lt;/strong&gt;: Are there numerical data or performance indicators?&lt;/li&gt;
&lt;li&gt;[x] &lt;strong&gt;Key Takeaways&lt;/strong&gt;: Are the lessons learned and future plans clear?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Length Guidelines
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[x] Overall: 400-800 lines (currently ~100 lines - can be expanded if needed)&lt;/li&gt;
&lt;li&gt;[x] Each section: Minimum 50 lines (if possible)&lt;/li&gt;
&lt;li&gt;[x] Code examples: 2-3 examples included&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>smbholdings</category>
    </item>
  </channel>
</rss>
