<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: sk8ordie84</title>
    <description>The latest articles on Forem by sk8ordie84 (@sk8ordie84).</description>
    <link>https://forem.com/sk8ordie84</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3889537%2Fa6734c67-3985-462e-a16e-4a5dc086772f.png</url>
      <title>Forem: sk8ordie84</title>
      <link>https://forem.com/sk8ordie84</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/sk8ordie84"/>
    <language>en</language>
    <item>
      <title>"I implemented PRML in two languages. Three things broke that the spec didn't warn about." published: true</title>
      <dc:creator>sk8ordie84</dc:creator>
      <pubDate>Fri, 01 May 2026 19:50:42 +0000</pubDate>
      <link>https://forem.com/sk8ordie84/i-implemented-prml-in-two-languages-three-things-broke-that-the-spec-didnt-warn-about-2j62</link>
      <guid>https://forem.com/sk8ordie84/i-implemented-prml-in-two-languages-three-things-broke-that-the-spec-didnt-warn-about-2j62</guid>
      <description>&lt;p&gt;PRML v0.1 is a small specification I drafted three weeks ago. It binds an ML evaluation claim — &lt;em&gt;(metric, comparator, threshold, dataset hash, random seed, producer)&lt;/em&gt; — to a SHA-256 digest computed over canonical YAML bytes, &lt;em&gt;before&lt;/em&gt; the experiment runs. The spec is at &lt;a href="https://spec.falsify.dev/v0.1" rel="noopener noreferrer"&gt;spec.falsify.dev/v0.1&lt;/a&gt;. The Python reference implementation is on GitHub. v0.2 freezes 2026-05-22.&lt;/p&gt;

&lt;p&gt;A specification with one implementation is indistinguishable from that implementation's bugs. So this past weekend I sat down and built a second reference implementation, in Node.js, from scratch. The goal: take the prose spec, ignore the Python source, and produce byte-identical canonical bytes for all twelve v0.1 conformance vectors.&lt;/p&gt;

&lt;p&gt;It worked. 12/12 vectors pass byte-for-byte. The implementation is 404 lines of JavaScript with zero runtime dependencies beyond the Node.js standard library. You can run it from &lt;a href="https://github.com/sk8ordie84/falsify/tree/main/impl/js" rel="noopener noreferrer"&gt;&lt;code&gt;impl/js/falsify.js&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;What's interesting is what &lt;em&gt;didn't&lt;/em&gt; work the first time. The exercise surfaced three quiet portability gotchas — places where the spec's prose and the spec's twelve vectors silently disagreed about what the bytes should be. Each of them is a real defect in the v0.1 specification, and each is now an action item for v0.2.&lt;/p&gt;

&lt;p&gt;This post is the three findings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 1 — Sixty-four-bit integer precision
&lt;/h2&gt;

&lt;p&gt;The first failing vector was &lt;strong&gt;TV-006&lt;/strong&gt;: &lt;code&gt;seed: 18446744073709551615&lt;/code&gt;. That's $2^{64} - 1$, the largest unsigned 64-bit integer the v0.1 spec allows for the seed field.&lt;/p&gt;

&lt;p&gt;Naive Node.js parses this through &lt;code&gt;JSON.parse&lt;/code&gt; into a &lt;code&gt;Number&lt;/code&gt;. JavaScript's &lt;code&gt;Number&lt;/code&gt; is IEEE-754 binary64. The largest &lt;em&gt;integer&lt;/em&gt; you can safely represent in binary64 is $2^{53} - 1$, which is about $9 \times 10^{15}$. Above that, integers round to the nearest representable float.&lt;/p&gt;

&lt;p&gt;So when Node.js read the test vector input file, the seed &lt;code&gt;18446744073709551615&lt;/code&gt; quietly became &lt;code&gt;18446744073709552000&lt;/code&gt; — a value $385$ larger than what the test vector said. The canonicalizer then dumped that wrong number, and the hash didn't match.&lt;/p&gt;

&lt;p&gt;The same problem hits Go (&lt;code&gt;int64&lt;/code&gt;, $2^{63} - 1$ ceiling), Java (same), and any other language whose default integer type isn't unbounded.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Native integer ceiling&lt;/th&gt;
&lt;th&gt;TV-006 round-trips?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Python 3&lt;/td&gt;
&lt;td&gt;unbounded&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JavaScript Number&lt;/td&gt;
&lt;td&gt;$2^{53} - 1$&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;no&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Go &lt;code&gt;int64&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;$2^{63} - 1$&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;no&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Java &lt;code&gt;long&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;$2^{63} - 1$&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;no&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rust &lt;code&gt;u64&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;$2^{64} - 1$&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The PyYAML-based Python reference implementation works only because Python's &lt;code&gt;int&lt;/code&gt; is arbitrary-precision. The spec did not mention this, anywhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix in the Node.js implementation:&lt;/strong&gt; parse the JSON text with a regex that wraps any 16-or-more-digit integer in a sentinel string before &lt;code&gt;JSON.parse&lt;/code&gt; sees it, then unwrap to &lt;code&gt;BigInt&lt;/code&gt; after parse. Twenty lines of JavaScript that no spec reader could have predicted from the prose.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix for v0.2:&lt;/strong&gt; make &lt;code&gt;seed&lt;/code&gt; a quoted decimal string in the canonical form: &lt;code&gt;seed: '18446744073709551615'&lt;/code&gt;. Languages with weak integer types now get a string and can opt into BigInt themselves. The format is unambiguous from the bytes alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 2 — Integer-valued floats lose their type
&lt;/h2&gt;

&lt;p&gt;The next failing vector was &lt;strong&gt;TV-008&lt;/strong&gt;: a manifest with &lt;code&gt;threshold: 1.0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The expected canonical bytes contain &lt;code&gt;threshold: 1.0&lt;/code&gt;. The actual produced bytes contain &lt;code&gt;threshold: 1&lt;/code&gt;. The hash differed. This bothered me for ten minutes.&lt;/p&gt;

&lt;p&gt;It turns out: when JSON parsers encounter &lt;code&gt;1.0&lt;/code&gt; in a JSON document, almost all of them lose the float-ness. JavaScript's &lt;code&gt;JSON.parse&lt;/code&gt; returns &lt;code&gt;Number(1)&lt;/code&gt;, indistinguishable at runtime from the integer &lt;code&gt;1&lt;/code&gt;. When a YAML emitter then takes that number and serialises it, it has no signal that the producer wrote &lt;code&gt;1.0&lt;/code&gt; rather than &lt;code&gt;1&lt;/code&gt;. So it emits &lt;code&gt;1&lt;/code&gt;. The hash drifts.&lt;/p&gt;

&lt;p&gt;PyYAML doesn't have this problem because PyYAML's load-and-dump cycle uses Python's native &lt;code&gt;float&lt;/code&gt; type, which round-trips through &lt;code&gt;1.0&lt;/code&gt; cleanly. JavaScript's &lt;code&gt;Number&lt;/code&gt; cannot.&lt;/p&gt;

&lt;p&gt;This is a property of the JSON format itself. JSON does not distinguish integer-valued floats from integers. The information is destroyed at parse time, before any canonicalizer runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix in the Node.js implementation:&lt;/strong&gt; a small "this field should always render as a float" set, currently containing one element: &lt;code&gt;{'threshold'}&lt;/code&gt;. The canonicalizer checks the field name and forces &lt;code&gt;.0&lt;/code&gt; when the value is integer-valued. A field-specific hack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix for v0.2:&lt;/strong&gt; specify that &lt;code&gt;threshold&lt;/code&gt; always renders with at least one decimal place in the canonical form. Two lines in the spec close it. No field-aware emitter logic required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 3 — "Plain scalar" disagreements
&lt;/h2&gt;

&lt;p&gt;The third failing case was the &lt;em&gt;same&lt;/em&gt; vector, &lt;strong&gt;TV-008&lt;/strong&gt;: &lt;code&gt;comparator: ==&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The expected canonical bytes have &lt;code&gt;comparator: ==&lt;/code&gt;. JavaScript's &lt;code&gt;js-yaml&lt;/code&gt; library produced &lt;code&gt;comparator: '=='&lt;/code&gt; — single-quoted. SHA-256 is unforgiving; this difference sets a different hash.&lt;/p&gt;

&lt;p&gt;YAML 1.1 and 1.2 both have a notion of "plain scalars": strings that don't need quotes because they contain no characters or patterns that would confuse the parser. A long list of rules governs whether a particular string can be plain: must not start with an indicator character (&lt;code&gt;-&lt;/code&gt;, &lt;code&gt;?&lt;/code&gt;, &lt;code&gt;:&lt;/code&gt;, &lt;code&gt;,&lt;/code&gt;, &lt;code&gt;[&lt;/code&gt;, &lt;code&gt;]&lt;/code&gt;, &lt;code&gt;{&lt;/code&gt;, &lt;code&gt;}&lt;/code&gt;, &lt;code&gt;#&lt;/code&gt;, &lt;code&gt;&amp;amp;&lt;/code&gt;, &lt;code&gt;*&lt;/code&gt;, &lt;code&gt;!&lt;/code&gt;, &lt;code&gt;|&lt;/code&gt;, &lt;code&gt;&amp;gt;&lt;/code&gt;, &lt;code&gt;'&lt;/code&gt;, &lt;code&gt;"&lt;/code&gt;, &lt;code&gt;%&lt;/code&gt;, &lt;code&gt;@&lt;/code&gt;, &lt;code&gt;`&lt;/code&gt;), must not contain colon-space, must not look like a number/boolean/null/timestamp, must not have leading/trailing whitespace, etc.&lt;/p&gt;

&lt;p&gt;PyYAML and &lt;code&gt;js-yaml&lt;/code&gt; implement this predicate with subtly different conservatism. PyYAML accepts &lt;code&gt;==&lt;/code&gt; as a plain scalar because none of the rules fire — there is no indicator character, no number resolution, no timestamp pattern. &lt;code&gt;js-yaml&lt;/code&gt; is more defensive: it sees a string that &lt;em&gt;could&lt;/em&gt; be confusing and quotes it.&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;&amp;gt;=&lt;/code&gt;, &lt;code&gt;&amp;lt;=&lt;/code&gt;, &lt;code&gt;&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;&lt;/code&gt;, both libraries quote — the leading character is in the indicator set. So those work. Only &lt;code&gt;==&lt;/code&gt; is special, and only &lt;code&gt;==&lt;/code&gt; differs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix in the Node.js implementation:&lt;/strong&gt; I rewrote the plain-scalar predicate from scratch, in about fifty lines, matching PyYAML's behaviour. It checks for indicator-prefix, leading/trailing whitespace, colon-space and hash-space, number-resolution regex, boolean/null set, timestamp regex, and control-character escape. With this hand-rolled predicate, TV-008 reproduces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix for v0.2:&lt;/strong&gt; publish a formal canonicalization grammar. Or, simpler and aggressive: drop the plain-scalar concept entirely. Always single-quote every string scalar in the canonical form. The output is ~10% larger; the ambiguity surface is zero. No predicate needed; no second implementation reverse-engineering an emitter.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this exercise really proves
&lt;/h2&gt;

&lt;p&gt;It does not prove that PRML is bulletproof. It proves that PRML is &lt;em&gt;implementable in a second language&lt;/em&gt; — which, at the v0.1 stage, was not yet established. A specification existing in only one implementation is indistinguishable from that implementation's bugs. PRML is now demonstrably more than that.&lt;/p&gt;

&lt;p&gt;It also does not prove that &lt;em&gt;all&lt;/em&gt; PyYAML edge cases are covered. The Node.js implementation matches the twelve current vectors, which exercise specific cases. Adding new vectors (Unicode normalisation, control characters, very long strings, unusual line-folding) might reveal further divergences.&lt;/p&gt;

&lt;p&gt;The general lesson: &lt;strong&gt;a content-addressed format has to be specified in terms of the bytes it produces, not in terms of the emitter that produces them&lt;/strong&gt;. PyYAML's &lt;code&gt;safe_dump&lt;/code&gt; is a stable, careful, twenty-year-old emitter. It is not a specification. The next time someone wants to write a content-addressed YAML format — for SBOMs, for build provenance, for AI evaluation claims, anything — write the canonicalization grammar first, and &lt;em&gt;then&lt;/em&gt; implement it. Don't describe an emitter; describe bytes.&lt;/p&gt;

&lt;h2&gt;
  
  
  v0.2 action items, summarised
&lt;/h2&gt;

&lt;p&gt;The findings translate to three concrete v0.2 specification changes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;seed&lt;/code&gt; is a quoted decimal string.&lt;/strong&gt; Closes 64-bit integer precision portability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;threshold&lt;/code&gt; always renders with at least one decimal place.&lt;/strong&gt; Closes integer-valued float type loss.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Always-quoted string scalars.&lt;/strong&gt; Eliminates the plain-scalar predicate ambiguity entirely.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Plus a fourth, broader change:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Publish a formal canonicalization grammar in ABNF.&lt;/strong&gt; With the always-quoted rule, the grammar is short — about forty production rules. It becomes the source of truth for conformance, replacing the implicit "PyYAML's behaviour" reference.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The full v0.2 roadmap, including six other extension fields (algorithm agility, tolerance, multi-claim manifests, mandatory signatures for high-risk Annex III, twelve new conformance vectors, sidecar format extension), is at &lt;a href="https://github.com/sk8ordie84/falsify/blob/main/spec/v0.2/ROADMAP.md" rel="noopener noreferrer"&gt;&lt;code&gt;spec/v0.2/ROADMAP.md&lt;/code&gt;&lt;/a&gt;. The freeze is targeted 2026-05-22 — three weeks from this writing — and the five open RFC questions in the roadmap are the parts where outside opinion would carry the most weight.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to read along
&lt;/h2&gt;

&lt;p&gt;If you want to see the artefacts directly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Node.js implementation:&lt;/strong&gt; &lt;a href="https://github.com/sk8ordie84/falsify/tree/main/impl/js" rel="noopener noreferrer"&gt;&lt;code&gt;impl/js/falsify.js&lt;/code&gt;&lt;/a&gt; — 404 LOC, MIT.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The portability findings document:&lt;/strong&gt; &lt;a href="https://github.com/sk8ordie84/falsify/blob/main/spec/analysis/canonicalization-portability-v0.1.md" rel="noopener noreferrer"&gt;&lt;code&gt;spec/analysis/canonicalization-portability-v0.1.md&lt;/code&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The conformance suite:&lt;/strong&gt; &lt;a href="https://github.com/sk8ordie84/falsify/tree/main/spec/test-vectors/v0.1" rel="noopener noreferrer"&gt;&lt;code&gt;spec/test-vectors/v0.1/&lt;/code&gt;&lt;/a&gt; — JSON, twelve entries with locked digests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The v0.1 spec:&lt;/strong&gt; &lt;a href="https://spec.falsify.dev/v0.1" rel="noopener noreferrer"&gt;&lt;code&gt;spec.falsify.dev/v0.1&lt;/code&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The arXiv preprint (working draft):&lt;/strong&gt; &lt;a href="https://github.com/sk8ordie84/falsify/tree/main/spec/paper" rel="noopener noreferrer"&gt;&lt;code&gt;spec/paper/&lt;/code&gt;&lt;/a&gt; — 14-page LaTeX, CC BY 4.0.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Public review thread:&lt;/strong&gt; &lt;a href="https://github.com/sk8ordie84/falsify/discussions/6" rel="noopener noreferrer"&gt;GitHub Discussion #6&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want to add a third implementation in a third language — Rust, Go, Java, Swift, OCaml — the test vectors are the contract. If your canonicalizer reproduces all twelve byte-for-byte, your implementation is conformant. Open a PR; I'll add it.&lt;/p&gt;

&lt;p&gt;— Studio-11 (independent), &lt;code&gt;hello@studio-11.co&lt;/code&gt;&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>opensource</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Why ML accuracy numbers are unfalsifiable, and what a 1287-line Python tool does about it" published: false</title>
      <dc:creator>sk8ordie84</dc:creator>
      <pubDate>Fri, 01 May 2026 13:46:43 +0000</pubDate>
      <link>https://forem.com/sk8ordie84/why-ml-accuracy-numbers-are-unfalsifiable-and-what-a-1287-line-python-tool-does-about-it-40e1</link>
      <guid>https://forem.com/sk8ordie84/why-ml-accuracy-numbers-are-unfalsifiable-and-what-a-1287-line-python-tool-does-about-it-40e1</guid>
      <description>&lt;p&gt;A few weeks ago I was reading a model card for an open-weight code model. It claimed &lt;code&gt;pass@1 = 67%&lt;/code&gt; on HumanEval. I tried to reproduce it. I got &lt;code&gt;54%&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I went back to the model card. The metric was named, the dataset was named, the model checkpoint hash was published. Everything looked reproducible.&lt;/p&gt;

&lt;p&gt;Except: which version of HumanEval? The original 164 problems, or the de-contaminated 161? What temperature? What seed for nucleus sampling? What was the threshold the team committed to &lt;em&gt;before&lt;/em&gt; they ran the eval, and how do I know the published &lt;code&gt;67%&lt;/code&gt; is not the best of three runs at three temperatures?&lt;/p&gt;

&lt;p&gt;I read the paper. I read the README. I read the eval harness source. I could not answer any of those questions from the published artifacts. I could only ask the authors, and they could only tell me what they remembered. And I had no way to distinguish what they remembered from what they wished they had done.&lt;/p&gt;

&lt;p&gt;This is not a problem about that specific model card or those specific authors. It is a problem about every published ML accuracy number I have ever read.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five failure modes that current reporting practices cannot detect
&lt;/h2&gt;

&lt;p&gt;A claim like &lt;em&gt;"our model achieves 91.3% accuracy on benchmark X"&lt;/em&gt; can be wrong, in published form, in at least these five ways, none of which leave a forensic trace:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Threshold drift.&lt;/strong&gt; The team picked the threshold &lt;em&gt;after&lt;/em&gt; running the experiment, by looking at where their model happened to land, and reported that as if it was the original target.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slice selection.&lt;/strong&gt; The evaluation set was filtered after results were observed (e.g., dropping the 12 hardest examples because "they were mislabeled").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Silent re-runs.&lt;/strong&gt; Five seeds were tried; only the seed that passed was reported.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metric ambiguity.&lt;/strong&gt; "F1" without specifying micro vs macro. "Accuracy" without specifying top-k. "Pass@1" without specifying temperature.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dataset drift.&lt;/strong&gt; The benchmark hosted at the canonical URL changed between the experiment date and the publication date, and the team did not pin the bytes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each of these is consistent with current best-practice reporting. Each leaves the published number unfalsifiable: a reader cannot, even in principle, distinguish honest reporting from any of the above.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why no infrastructure exists
&lt;/h2&gt;

&lt;p&gt;Pre-registration solved this exact problem in adjacent fields:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clinical trials, in 2007, with &lt;a href="https://clinicaltrials.gov" rel="noopener noreferrer"&gt;ClinicalTrials.gov&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Psychology, in 2013, with &lt;a href="https://osf.io/" rel="noopener noreferrer"&gt;Open Science Framework&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Economics, the same year, with the &lt;a href="https://www.aeaweb.org/journals/policies/rcts" rel="noopener noreferrer"&gt;AEA registry&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ML never got the equivalent. The closest thing — the &lt;a href="https://reproml.org/" rel="noopener noreferrer"&gt;ML Reproducibility Challenge&lt;/a&gt; — is an annual peer-driven effort to re-run published experiments. It produces excellent post-hoc analysis but does not change the publication-time commitment surface.&lt;/p&gt;

&lt;p&gt;The 2026 regulatory window is the part that matters most for builders. The &lt;a href="https://artificialintelligenceact.eu/article/12/" rel="noopener noreferrer"&gt;EU AI Act Article 12&lt;/a&gt; requires automatic logging of evaluation events for high-risk systems. &lt;a href="https://artificialintelligenceact.eu/article/18/" rel="noopener noreferrer"&gt;Article 18&lt;/a&gt; requires 10-year retention. Both enter force August 2, 2026. NIST AI RMF references content-addressed audit trails as a recommended control. ISO/IEC 42001:2023 mandates documented information practices that PRML directly satisfies.&lt;/p&gt;

&lt;p&gt;In other words: there is now a regulatory deadline by which "we have a tradition of reporting these numbers honestly" stops being a sufficient answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  PRML in plain English
&lt;/h2&gt;

&lt;p&gt;I drafted a small format, working draft v0.1, currently under public review. It is called &lt;strong&gt;PRML — Pre-Registered ML Manifest&lt;/strong&gt;. The whole spec fits in a single YAML schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prml/0.1"&lt;/span&gt;
&lt;span class="na"&gt;claim_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;01900000-0000-7000-8000-000000000000"&lt;/span&gt;
&lt;span class="na"&gt;created_at&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-05-01T12:00:00Z"&lt;/span&gt;
&lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy"&lt;/span&gt;
&lt;span class="na"&gt;comparator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;="&lt;/span&gt;
&lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.85&lt;/span&gt;
&lt;span class="na"&gt;dataset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;imagenet-val-2012"&lt;/span&gt;
  &lt;span class="na"&gt;hash&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"&lt;/span&gt;
&lt;span class="na"&gt;seed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;42&lt;/span&gt;
&lt;span class="na"&gt;producer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;studio-11.co"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the entire required surface. Eight fields. Plain text. UTF-8. YAML 1.2 strict subset (block style only, lexicographic key ordering, no comments, no flow collections).&lt;/p&gt;

&lt;p&gt;The format defines a deterministic canonicalization. Given any logical YAML mapping with these fields, there is exactly one canonical UTF-8 byte sequence. The SHA-256 of those bytes is the manifest hash.&lt;/p&gt;

&lt;p&gt;The hash is published &lt;em&gt;before&lt;/em&gt; the experiment runs. After the experiment, an independent verifier can:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Re-canonicalize the manifest.&lt;/li&gt;
&lt;li&gt;Recompute SHA-256.&lt;/li&gt;
&lt;li&gt;Compare against the published sidecar hash. If they differ, the manifest has been edited post-lock — exit code &lt;code&gt;3&lt;/code&gt; (TAMPERED).&lt;/li&gt;
&lt;li&gt;Load the dataset by its content hash. Verify byte integrity.&lt;/li&gt;
&lt;li&gt;Run the metric computation under the seed. Compare against threshold.&lt;/li&gt;
&lt;li&gt;Emit &lt;code&gt;0&lt;/code&gt; (PASS), &lt;code&gt;10&lt;/code&gt; (FAIL), or one of the diagnostic codes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There is no trust in the producer required at verification time. Anyone with the manifest, the dataset, and the model can reproduce the verdict offline.&lt;/p&gt;

&lt;p&gt;Honest amendments — "we found 12 mislabeled examples and re-ran" — do not overwrite. They append. Each new manifest carries a &lt;code&gt;prior_hash&lt;/code&gt; field pointing to the manifest it amends. The chain is the audit log. When a regulator or reviewer asks &lt;em&gt;"what was committed when?"&lt;/em&gt;, the answer is one hash, and from that hash the entire history is recoverable.&lt;/p&gt;

&lt;h2&gt;
  
  
  A worked example with the reference implementation
&lt;/h2&gt;

&lt;p&gt;The reference implementation is a single-file Python CLI called &lt;code&gt;falsify&lt;/code&gt;, MIT-licensed, 1287 lines. Install it the usual way:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;falsify
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Initialize a claim:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;falsify init imagenet-87
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This writes &lt;code&gt;.falsify/imagenet-87/spec.yaml&lt;/code&gt; with the required PRML fields as placeholders. Edit the file with your real values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prml/0.1"&lt;/span&gt;
&lt;span class="na"&gt;claim_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;01900000-0000-7000-8000-000000000010"&lt;/span&gt;
&lt;span class="na"&gt;created_at&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-05-01T14:00:00Z"&lt;/span&gt;
&lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy"&lt;/span&gt;
&lt;span class="na"&gt;comparator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;="&lt;/span&gt;
&lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.87&lt;/span&gt;
&lt;span class="na"&gt;dataset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;imagenet-val-2012"&lt;/span&gt;
  &lt;span class="na"&gt;hash&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"&lt;/span&gt;
&lt;span class="na"&gt;seed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;42&lt;/span&gt;
&lt;span class="na"&gt;producer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-org.example"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lock it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;falsify lock imagenet-87
locked: &lt;span class="nb"&gt;yes&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;sha256:1a3466cc08ee, locked_at 2026-05-01T14:00:00Z&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the spec is hash-bound. If anyone — including you — edits the YAML, the next &lt;code&gt;falsify verify&lt;/code&gt; exits 3 and refuses to produce a verdict.&lt;/p&gt;

&lt;p&gt;Run the experiment, capture the metric value (let us say 0.876), and verify:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;falsify verify imagenet-87 &lt;span class="nt"&gt;--observed&lt;/span&gt; 0.876
PASS  &lt;span class="nv"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;accuracy &lt;span class="nv"&gt;observed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.876 &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nv"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.87
&lt;span class="nb"&gt;exit &lt;/span&gt;0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the team had silently raised the threshold to 0.88 after seeing the result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;falsify verify imagenet-87 &lt;span class="nt"&gt;--observed&lt;/span&gt; 0.876
TAMPERED  spec &lt;span class="nb"&gt;hash &lt;/span&gt;drift detected
recorded: 1a3466cc08ee...
current:  7b2c9a5d1e4f...
&lt;span class="nb"&gt;exit &lt;/span&gt;3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CI pipeline halts. The deploy does not happen. There is no judgment call.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you know the canonicalization actually works?
&lt;/h2&gt;

&lt;p&gt;The most reasonable skeptical question about a content-addressed format is: &lt;em&gt;what guarantees that two implementations produce the same canonical bytes for the same input?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For v0.1 we publish &lt;a href="https://spec.falsify.dev/test-vectors/v0.1/test-vectors.md" rel="noopener noreferrer"&gt;12 conformance test vectors&lt;/a&gt;. Each vector defines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An input manifest (logical YAML, key order irrelevant).&lt;/li&gt;
&lt;li&gt;The exact UTF-8 byte sequence the canonicalizer must produce.&lt;/li&gt;
&lt;li&gt;The exact lowercase-hex SHA-256 of those bytes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The vectors exercise:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;TV-001&lt;/td&gt;
&lt;td&gt;Minimal valid manifest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-002&lt;/td&gt;
&lt;td&gt;Key-ordering invariance — random insertion order produces same hash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-003&lt;/td&gt;
&lt;td&gt;Single-bit-of-content sensitivity — &lt;code&gt;0.85&lt;/code&gt; vs &lt;code&gt;0.86&lt;/code&gt; produces different hash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-004&lt;/td&gt;
&lt;td&gt;Optional fields populated (&lt;code&gt;model.id&lt;/code&gt;, &lt;code&gt;model.hash&lt;/code&gt;, &lt;code&gt;dataset.uri&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-005&lt;/td&gt;
&lt;td&gt;Unicode handling in &lt;code&gt;producer.id&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-006&lt;/td&gt;
&lt;td&gt;Maximum seed value (&lt;code&gt;2⁶⁴ − 1&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-007&lt;/td&gt;
&lt;td&gt;Minimum seed (&lt;code&gt;0&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-008&lt;/td&gt;
&lt;td&gt;Equality comparator with integer-valued threshold&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-009&lt;/td&gt;
&lt;td&gt;Amendment with &lt;code&gt;prior_hash&lt;/code&gt; linkage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-010&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pass@k&lt;/code&gt; metric for code generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-011&lt;/td&gt;
&lt;td&gt;AUROC with strict comparator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TV-012&lt;/td&gt;
&lt;td&gt;Regression metric with &lt;code&gt;&amp;lt;=&lt;/code&gt; comparator&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A new implementation in Rust, Go, or TypeScript is conformant only if it reproduces all 12 vectors exactly. The reference implementation has 28 unittest assertions in CI that lock in the v0.1 hash contract; any code change that breaks a vector forces a v0.2 spec bump.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it is not
&lt;/h2&gt;

&lt;p&gt;PRML does not establish &lt;em&gt;whether&lt;/em&gt; a claimed metric is correct, fair, or sufficient. It establishes only &lt;em&gt;that&lt;/em&gt; the claim was committed before it was tested. Specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not a model card replacement. PRML manifests sit &lt;em&gt;underneath&lt;/em&gt; model cards as the cryptographic floor.&lt;/li&gt;
&lt;li&gt;Not a benchmark. PRML does not pick metrics for you.&lt;/li&gt;
&lt;li&gt;Not a reproducibility framework. PRML does not ship code or data.&lt;/li&gt;
&lt;li&gt;Not a tool. PRML is a format. &lt;code&gt;falsify&lt;/code&gt; is one implementation. A second implementation in any language passes if it reproduces the test vectors.&lt;/li&gt;
&lt;li&gt;Not a compliance product. It is a primitive that makes named regulatory obligations satisfiable with arithmetic verification rather than process attestation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What it costs
&lt;/h2&gt;

&lt;p&gt;The cost of adopting PRML at the experiment level is one hash function call. SHA-256 is FIPS 180-4, available in every standard library written since 2002. The format is UTF-8 plain text, readable in 2046 by any tool that can read text.&lt;/p&gt;

&lt;p&gt;The cost of &lt;em&gt;not&lt;/em&gt; adopting it scales with deployment scope. For a personal project, zero. For a research paper, growing pressure as reviewers begin to ask. For a product subject to EU AI Act Annex III obligations, measurable in regulatory exposure plus legal review hours. For a foundation model that will be cited in safety cases for a decade, the cost is roughly &lt;em&gt;the credibility of every accuracy claim you have ever shipped&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I am asking for
&lt;/h2&gt;

&lt;p&gt;This is a working draft. v0.2 freeze is targeted &lt;strong&gt;2026-05-22&lt;/strong&gt;. Three concrete asks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Format review.&lt;/strong&gt; Is the canonical serialization in §3 of &lt;a href="https://spec.falsify.dev/v0.1" rel="noopener noreferrer"&gt;the spec&lt;/a&gt; unambiguous? Are there YAML 1.2 edge cases the spec misses?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Threat-model gaps.&lt;/strong&gt; §6 of the spec enumerates six adversaries. What is missing?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance correctness.&lt;/strong&gt; &lt;a href="https://github.com/sk8ordie84/falsify/blob/main/spec/compliance/AI-Act-mapping-v0.1.md" rel="noopener noreferrer"&gt;The AI Act mapping&lt;/a&gt; maps PRML fields to Articles 12, 17, 18, 50, 72, and 73. Compliance lawyers and engineers in EU AI Act adjacent roles: are the bindings defensible?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Discussion thread: &lt;a href="https://github.com/sk8ordie84/falsify/discussions/6" rel="noopener noreferrer"&gt;github.com/sk8ordie84/falsify/discussions/6&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tl;dr
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Most published ML accuracy numbers are unfalsifiable in practice.&lt;/li&gt;
&lt;li&gt;A small spec — eight fields, one hash function, one canonical serialization — gives published claims a cryptographic floor.&lt;/li&gt;
&lt;li&gt;Reference implementation in Python, MIT, single file. Spec under CC BY 4.0.&lt;/li&gt;
&lt;li&gt;v0.2 freeze in 3 weeks. Reviews, ambiguity reports, threat-model critiques are wanted.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Spec: &lt;a href="https://spec.falsify.dev/v0.1" rel="noopener noreferrer"&gt;spec.falsify.dev/v0.1&lt;/a&gt;&lt;br&gt;
Code: &lt;a href="https://github.com/sk8ordie84/falsify" rel="noopener noreferrer"&gt;github.com/sk8ordie84/falsify&lt;/a&gt;&lt;br&gt;
Discussion: &lt;a href="https://github.com/sk8ordie84/falsify/discussions/6" rel="noopener noreferrer"&gt;github.com/sk8ordie84/falsify/discussions/6&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I built a CLI that hashes your ML accuracy claims before the experiment runs</title>
      <dc:creator>sk8ordie84</dc:creator>
      <pubDate>Wed, 29 Apr 2026 07:33:37 +0000</pubDate>
      <link>https://forem.com/sk8ordie84/i-built-a-cli-that-hashes-your-ml-accuracy-claims-before-the-experiment-runs-ick</link>
      <guid>https://forem.com/sk8ordie84/i-built-a-cli-that-hashes-your-ml-accuracy-claims-before-the-experiment-runs-ick</guid>
      <description>&lt;h1&gt;
  
  
  I built a CLI that hashes your ML accuracy claims before the experiment runs
&lt;/h1&gt;

&lt;p&gt;Last month, a customer told me our model's accuracy on their data was 71%, not the 94% we had shipped on the landing page.&lt;/p&gt;

&lt;p&gt;I went back to the eval notebook. The threshold was still 0.94. The test set was named the same thing. But somewhere in the last three weeks, somebody had "refreshed" the test set, somebody else had tightened the metric definition, and the original 94% was now unreproducible. Not anybody's fault, exactly — just nobody had written down the contract before running the experiment.&lt;/p&gt;

&lt;p&gt;That night I started building falsify. Three days later I shipped it.&lt;/p&gt;

&lt;p&gt;This post is what I built, why I built it that small, and the one Python function that does most of the work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem in one sentence
&lt;/h2&gt;

&lt;p&gt;If you can change the spec after seeing the result, your accuracy claim is not falsifiable. And if it is not falsifiable, it is not really a claim — it is marketing.&lt;/p&gt;

&lt;p&gt;Psychology and medicine figured this out the hard way and invented pre-registration. You write down the prediction, the threshold, and the analysis plan, hash it, timestamp it, and you cannot move it later without everyone knowing.&lt;/p&gt;

&lt;p&gt;ML never adopted any of this. A &lt;code&gt;git commit&lt;/code&gt; is the closest thing most teams have, and &lt;code&gt;git commit --amend&lt;/code&gt; followed by a force-push will quietly erase the receipt.&lt;/p&gt;

&lt;p&gt;So I wrote a CLI that does the smallest possible version of pre-registration: canonicalize a YAML spec, SHA-256 it, lock the hash, and refuse to let it move.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "the smallest possible version" actually looks like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# falsify.yaml&lt;/span&gt;
&lt;span class="na"&gt;claim&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;accuracy&lt;/span&gt;
  &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.94&lt;/span&gt;
  &lt;span class="na"&gt;dataset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;customer_eval_v3&lt;/span&gt;
  &lt;span class="na"&gt;dataset_sha256&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;4f1a8b2c...&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ranker-7b-2026q1&lt;/span&gt;
  &lt;span class="na"&gt;test_n&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1200&lt;/span&gt;
&lt;span class="na"&gt;created_at&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2026-04-28T19:45:00Z&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the contract. The CLI workflow is three commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;falsify
falsify lock falsify.yaml      &lt;span class="c"&gt;# writes a .lock file with the hash&lt;/span&gt;
falsify check falsify.yaml &lt;span class="nt"&gt;--result&lt;/span&gt; &lt;span class="nv"&gt;actual_accuracy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.91
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit codes are the API:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;0&lt;/code&gt; — claim verified&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;10&lt;/code&gt; — claim falsified (you missed the threshold, but cleanly)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;3&lt;/code&gt; — tamper detected (someone edited the spec after lock)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;11&lt;/code&gt; — spec invalid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;10&lt;/code&gt; and &lt;code&gt;3&lt;/code&gt; being different exit codes is the whole point. "We didn't hit the number" is a different thing from "we moved the number."&lt;/p&gt;

&lt;h2&gt;
  
  
  The one function that matters
&lt;/h2&gt;

&lt;p&gt;The reason this works at all is YAML canonicalization. JSON looks canonical but isn't — key order, whitespace, and unicode forms can all drift while the document stays "the same." YAML is worse by default, but easy to canonicalize once you commit to a few rules.&lt;/p&gt;

&lt;p&gt;Here is the actual hashing function from the source. It is small on purpose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;unicodedata&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;  &lt;span class="c1"&gt;# PyYAML
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;canonical_sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spec_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return SHA-256 of a canonicalized YAML spec.

    Canonicalization rules:
      - Parse the document, drop comments and anchors
      - Recursively sort all mapping keys
      - Normalize all strings to NFC unicode
      - Re-emit as UTF-8 with LF line endings, no trailing whitespace
      - Hash the bytes
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spec_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safe_load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;unicodedata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NFC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;unicodedata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NFC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;

    &lt;span class="n"&gt;canonical&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safe_dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;allow_unicode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;default_flow_style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;line_break&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;canonical&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the entire trust primitive. Everything else in the 3925-line file — the lock file format, the CI integration, the tamper detection, the schema validation — is plumbing around this one function.&lt;/p&gt;

&lt;p&gt;The reason it has to be exactly this strict: any wiggle room (key order, trailing whitespace, BOM, unicode form) is a place where someone can quietly change the spec and produce a "matching" hash. Canonicalize once, hash once, never look back.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CI moment
&lt;/h2&gt;

&lt;p&gt;The point of all of this is the moment a teammate edits the spec after lock. Maybe they have a good reason. Maybe they don't. Either way, you want the system to notice.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/eval.yml&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;verify accuracy claim&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;falsify check falsify.yaml --result-file results.json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If anyone touches &lt;code&gt;falsify.yaml&lt;/code&gt; after the lock, the action exits with code 3 and the PR cannot merge. The lie is blocked at the filesystem level, not by trust.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned in three days
&lt;/h2&gt;

&lt;p&gt;A few things surprised me while building this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;YAML canonicalization is most of the value.&lt;/strong&gt; I spent way more time on the canonicalizer than on anything else. Every "clever" optimization I tried later turned out to be a place where two byte-different YAMLs produced the same hash. Boring is correct.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Exit codes are an API.&lt;/strong&gt; I almost shipped with just &lt;code&gt;0&lt;/code&gt; and &lt;code&gt;1&lt;/code&gt;. Splitting "falsified" from "tampered" was the single biggest jump in how teams reacted to it. People immediately understood the difference.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;One file is a feature.&lt;/strong&gt; I kept resisting the urge to split it into a package. Auditors and skeptical SREs read single-file Python CLIs in one sitting. They do not read packages.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dogfooding is non-negotiable.&lt;/strong&gt; falsify locks its own test claims with falsify. The honesty badge on the README is generated by the tool itself, on its own metrics. If a tool that locks claims cannot lock its own, why would you trust it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Agents change what one person can ship in a weekend.&lt;/strong&gt; I built this solo in three days with Claude Opus 4.7 in the loop — pair programming, eval generation, doc drafting, the whole pipeline. The 518 tests and the YAML canonicalizer corner cases would have been a two-week solo grind without it. The actual design decisions were still mine; the agent just made the cost of being thorough a lot lower.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;falsify
falsify init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Repo: &lt;a href="https://github.com/sk8ordie84/falsify" rel="noopener noreferrer"&gt;https://github.com/sk8ordie84/falsify&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;90-second demo: &lt;a href="https://youtu.be/vVZTNeak5PA" rel="noopener noreferrer"&gt;https://youtu.be/vVZTNeak5PA&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Site: &lt;a href="https://falsify.dev" rel="noopener noreferrer"&gt;https://falsify.dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PyPI: &lt;a href="https://pypi.org/project/falsify/" rel="noopener noreferrer"&gt;https://pypi.org/project/falsify/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Single file, MIT, Python 3.11+, stdlib plus pyyaml. If you ship any number followed by a percent sign, lock it before the experiment runs. It costs 30 seconds and saves the meeting where someone has to explain why the number changed.&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
    <item>
      <title>I built a film camera simulator in a single HTML file here's how</title>
      <dc:creator>sk8ordie84</dc:creator>
      <pubDate>Mon, 20 Apr 2026 18:51:08 +0000</pubDate>
      <link>https://forem.com/sk8ordie84/i-built-a-film-camera-simulator-in-a-single-html-file-heres-how-403b</link>
      <guid>https://forem.com/sk8ordie84/i-built-a-film-camera-simulator-in-a-single-html-file-heres-how-403b</guid>
      <description>&lt;p&gt;Launched today: faxoffice1987.com — 8 film cameras simulated in Canvas 2D.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The constraints I set myself:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One HTML file&lt;/li&gt;
&lt;li&gt;No build step, no dependencies, no npm install&lt;/li&gt;
&lt;li&gt;Runs offline from a USB drive&lt;/li&gt;
&lt;li&gt;No backend, no account, no uploads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The hard part:&lt;/strong&gt; per-pixel color science. Each film stock (Tri-X, &lt;br&gt;
Portra, Velvia, Neopan Acros) has its own render path. Not a filter &lt;br&gt;
on top — a decision at the pixel level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vanilla JS, Canvas 2D&lt;/li&gt;
&lt;li&gt;Cloudflare Pages + Functions (share links, license validation)&lt;/li&gt;
&lt;li&gt;Polar.sh for checkout&lt;/li&gt;
&lt;li&gt;localStorage for state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing experiment:&lt;/strong&gt; $29 one-time. No subscription. 1 camera free forever.&lt;/p&gt;

&lt;p&gt;Would love architecture feedback especially on the color science approach.&lt;/p&gt;

&lt;p&gt;Link: &lt;a href="https://faxoffice1987.com" rel="noopener noreferrer"&gt;https://faxoffice1987.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>webdev</category>
      <category>canvas</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
