Forem: sk8ordie84

"I implemented PRML in two languages. Three things broke that the spec didn't warn about." published: true

sk8ordie84 — Fri, 01 May 2026 19:50:42 +0000

PRML v0.1 is a small specification I drafted three weeks ago. It binds an ML evaluation claim — (metric, comparator, threshold, dataset hash, random seed, producer) — to a SHA-256 digest computed over canonical YAML bytes, before the experiment runs. The spec is at spec.falsify.dev/v0.1. The Python reference implementation is on GitHub. v0.2 freezes 2026-05-22.

A specification with one implementation is indistinguishable from that implementation's bugs. So this past weekend I sat down and built a second reference implementation, in Node.js, from scratch. The goal: take the prose spec, ignore the Python source, and produce byte-identical canonical bytes for all twelve v0.1 conformance vectors.

It worked. 12/12 vectors pass byte-for-byte. The implementation is 404 lines of JavaScript with zero runtime dependencies beyond the Node.js standard library. You can run it from impl/js/falsify.js.

What's interesting is what didn't work the first time. The exercise surfaced three quiet portability gotchas — places where the spec's prose and the spec's twelve vectors silently disagreed about what the bytes should be. Each of them is a real defect in the v0.1 specification, and each is now an action item for v0.2.

This post is the three findings.

Finding 1 — Sixty-four-bit integer precision

The first failing vector was TV-006: seed: 18446744073709551615. That's $2^{64} - 1$, the largest unsigned 64-bit integer the v0.1 spec allows for the seed field.

Naive Node.js parses this through JSON.parse into a Number. JavaScript's Number is IEEE-754 binary64. The largest integer you can safely represent in binary64 is $2^{53} - 1$, which is about $9 \times 10^{15}$. Above that, integers round to the nearest representable float.

So when Node.js read the test vector input file, the seed 18446744073709551615 quietly became 18446744073709552000 — a value $385$ larger than what the test vector said. The canonicalizer then dumped that wrong number, and the hash didn't match.

The same problem hits Go (int64, $2^{63} - 1$ ceiling), Java (same), and any other language whose default integer type isn't unbounded.

Language	Native integer ceiling	TV-006 round-trips?
Python 3	unbounded	yes
JavaScript Number	$2^{53} - 1$	no
Go `int64`	$2^{63} - 1$	no
Java `long`	$2^{63} - 1$	no
Rust `u64`	$2^{64} - 1$	yes

The PyYAML-based Python reference implementation works only because Python's int is arbitrary-precision. The spec did not mention this, anywhere.

The fix in the Node.js implementation: parse the JSON text with a regex that wraps any 16-or-more-digit integer in a sentinel string before JSON.parse sees it, then unwrap to BigInt after parse. Twenty lines of JavaScript that no spec reader could have predicted from the prose.

The fix for v0.2: make seed a quoted decimal string in the canonical form: seed: '18446744073709551615'. Languages with weak integer types now get a string and can opt into BigInt themselves. The format is unambiguous from the bytes alone.

Finding 2 — Integer-valued floats lose their type

The next failing vector was TV-008: a manifest with threshold: 1.0.

The expected canonical bytes contain threshold: 1.0. The actual produced bytes contain threshold: 1. The hash differed. This bothered me for ten minutes.

It turns out: when JSON parsers encounter 1.0 in a JSON document, almost all of them lose the float-ness. JavaScript's JSON.parse returns Number(1), indistinguishable at runtime from the integer 1. When a YAML emitter then takes that number and serialises it, it has no signal that the producer wrote 1.0 rather than 1. So it emits 1. The hash drifts.

PyYAML doesn't have this problem because PyYAML's load-and-dump cycle uses Python's native float type, which round-trips through 1.0 cleanly. JavaScript's Number cannot.

This is a property of the JSON format itself. JSON does not distinguish integer-valued floats from integers. The information is destroyed at parse time, before any canonicalizer runs.

The fix in the Node.js implementation: a small "this field should always render as a float" set, currently containing one element: {'threshold'}. The canonicalizer checks the field name and forces .0 when the value is integer-valued. A field-specific hack.

The fix for v0.2: specify that threshold always renders with at least one decimal place in the canonical form. Two lines in the spec close it. No field-aware emitter logic required.

Finding 3 — "Plain scalar" disagreements

The third failing case was the same vector, TV-008: comparator: ==.

The expected canonical bytes have comparator: ==. JavaScript's js-yaml library produced comparator: '==' — single-quoted. SHA-256 is unforgiving; this difference sets a different hash.

YAML 1.1 and 1.2 both have a notion of "plain scalars": strings that don't need quotes because they contain no characters or patterns that would confuse the parser. A long list of rules governs whether a particular string can be plain: must not start with an indicator character (-, ?, :, ,, [, ], {, }, #, &, *, !, |, >, ', ", %, @, `), must not contain colon-space, must not look like a number/boolean/null/timestamp, must not have leading/trailing whitespace, etc.

PyYAML and js-yaml implement this predicate with subtly different conservatism. PyYAML accepts == as a plain scalar because none of the rules fire — there is no indicator character, no number resolution, no timestamp pattern. js-yaml is more defensive: it sees a string that could be confusing and quotes it.

For >=, <=, >, <, both libraries quote — the leading character is in the indicator set. So those work. Only == is special, and only == differs.

The fix in the Node.js implementation: I rewrote the plain-scalar predicate from scratch, in about fifty lines, matching PyYAML's behaviour. It checks for indicator-prefix, leading/trailing whitespace, colon-space and hash-space, number-resolution regex, boolean/null set, timestamp regex, and control-character escape. With this hand-rolled predicate, TV-008 reproduces.

The fix for v0.2: publish a formal canonicalization grammar. Or, simpler and aggressive: drop the plain-scalar concept entirely. Always single-quote every string scalar in the canonical form. The output is ~10% larger; the ambiguity surface is zero. No predicate needed; no second implementation reverse-engineering an emitter.

What this exercise really proves

It does not prove that PRML is bulletproof. It proves that PRML is implementable in a second language — which, at the v0.1 stage, was not yet established. A specification existing in only one implementation is indistinguishable from that implementation's bugs. PRML is now demonstrably more than that.

It also does not prove that all PyYAML edge cases are covered. The Node.js implementation matches the twelve current vectors, which exercise specific cases. Adding new vectors (Unicode normalisation, control characters, very long strings, unusual line-folding) might reveal further divergences.

The general lesson: a content-addressed format has to be specified in terms of the bytes it produces, not in terms of the emitter that produces them. PyYAML's safe_dump is a stable, careful, twenty-year-old emitter. It is not a specification. The next time someone wants to write a content-addressed YAML format — for SBOMs, for build provenance, for AI evaluation claims, anything — write the canonicalization grammar first, and then implement it. Don't describe an emitter; describe bytes.

v0.2 action items, summarised

The findings translate to three concrete v0.2 specification changes:

seed is a quoted decimal string. Closes 64-bit integer precision portability.
threshold always renders with at least one decimal place. Closes integer-valued float type loss.
Always-quoted string scalars. Eliminates the plain-scalar predicate ambiguity entirely.

Plus a fourth, broader change:

Publish a formal canonicalization grammar in ABNF. With the always-quoted rule, the grammar is short — about forty production rules. It becomes the source of truth for conformance, replacing the implicit "PyYAML's behaviour" reference.

The full v0.2 roadmap, including six other extension fields (algorithm agility, tolerance, multi-claim manifests, mandatory signatures for high-risk Annex III, twelve new conformance vectors, sidecar format extension), is at spec/v0.2/ROADMAP.md. The freeze is targeted 2026-05-22 — three weeks from this writing — and the five open RFC questions in the roadmap are the parts where outside opinion would carry the most weight.

How to read along

If you want to see the artefacts directly:

The Node.js implementation: impl/js/falsify.js — 404 LOC, MIT.
The portability findings document: spec/analysis/canonicalization-portability-v0.1.md.
The conformance suite: spec/test-vectors/v0.1/ — JSON, twelve entries with locked digests.
The v0.1 spec: spec.falsify.dev/v0.1.
The arXiv preprint (working draft): spec/paper/ — 14-page LaTeX, CC BY 4.0.
Public review thread: GitHub Discussion #6.

If you want to add a third implementation in a third language — Rust, Go, Java, Swift, OCaml — the test vectors are the contract. If your canonicalizer reproduces all twelve byte-for-byte, your implementation is conformant. Open a PR; I'll add it.

— Studio-11 (independent), hello@studio-11.co

Why ML accuracy numbers are unfalsifiable, and what a 1287-line Python tool does about it" published: false

sk8ordie84 — Fri, 01 May 2026 13:46:43 +0000

A few weeks ago I was reading a model card for an open-weight code model. It claimed pass@1 = 67% on HumanEval. I tried to reproduce it. I got 54%.

I went back to the model card. The metric was named, the dataset was named, the model checkpoint hash was published. Everything looked reproducible.

Except: which version of HumanEval? The original 164 problems, or the de-contaminated 161? What temperature? What seed for nucleus sampling? What was the threshold the team committed to before they ran the eval, and how do I know the published 67% is not the best of three runs at three temperatures?

I read the paper. I read the README. I read the eval harness source. I could not answer any of those questions from the published artifacts. I could only ask the authors, and they could only tell me what they remembered. And I had no way to distinguish what they remembered from what they wished they had done.

This is not a problem about that specific model card or those specific authors. It is a problem about every published ML accuracy number I have ever read.

Five failure modes that current reporting practices cannot detect

A claim like "our model achieves 91.3% accuracy on benchmark X" can be wrong, in published form, in at least these five ways, none of which leave a forensic trace:

Threshold drift. The team picked the threshold after running the experiment, by looking at where their model happened to land, and reported that as if it was the original target.
Slice selection. The evaluation set was filtered after results were observed (e.g., dropping the 12 hardest examples because "they were mislabeled").
Silent re-runs. Five seeds were tried; only the seed that passed was reported.
Metric ambiguity. "F1" without specifying micro vs macro. "Accuracy" without specifying top-k. "Pass@1" without specifying temperature.
Dataset drift. The benchmark hosted at the canonical URL changed between the experiment date and the publication date, and the team did not pin the bytes.

Each of these is consistent with current best-practice reporting. Each leaves the published number unfalsifiable: a reader cannot, even in principle, distinguish honest reporting from any of the above.

Why no infrastructure exists

Pre-registration solved this exact problem in adjacent fields:

Clinical trials, in 2007, with ClinicalTrials.gov.
Psychology, in 2013, with Open Science Framework.
Economics, the same year, with the AEA registry.

ML never got the equivalent. The closest thing — the ML Reproducibility Challenge — is an annual peer-driven effort to re-run published experiments. It produces excellent post-hoc analysis but does not change the publication-time commitment surface.

The 2026 regulatory window is the part that matters most for builders. The EU AI Act Article 12 requires automatic logging of evaluation events for high-risk systems. Article 18 requires 10-year retention. Both enter force August 2, 2026. NIST AI RMF references content-addressed audit trails as a recommended control. ISO/IEC 42001:2023 mandates documented information practices that PRML directly satisfies.

In other words: there is now a regulatory deadline by which "we have a tradition of reporting these numbers honestly" stops being a sufficient answer.

PRML in plain English

I drafted a small format, working draft v0.1, currently under public review. It is called PRML — Pre-Registered ML Manifest. The whole spec fits in a single YAML schema:

version: "prml/0.1"
claim_id: "01900000-0000-7000-8000-000000000000"
created_at: "2026-05-01T12:00:00Z"
metric: "accuracy"
comparator: ">="
threshold: 0.85
dataset:
  id: "imagenet-val-2012"
  hash: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
seed: 42
producer:
  id: "studio-11.co"

That is the entire required surface. Eight fields. Plain text. UTF-8. YAML 1.2 strict subset (block style only, lexicographic key ordering, no comments, no flow collections).

The format defines a deterministic canonicalization. Given any logical YAML mapping with these fields, there is exactly one canonical UTF-8 byte sequence. The SHA-256 of those bytes is the manifest hash.

The hash is published before the experiment runs. After the experiment, an independent verifier can:

Re-canonicalize the manifest.
Recompute SHA-256.
Compare against the published sidecar hash. If they differ, the manifest has been edited post-lock — exit code 3 (TAMPERED).
Load the dataset by its content hash. Verify byte integrity.
Run the metric computation under the seed. Compare against threshold.
Emit 0 (PASS), 10 (FAIL), or one of the diagnostic codes.

There is no trust in the producer required at verification time. Anyone with the manifest, the dataset, and the model can reproduce the verdict offline.

Honest amendments — "we found 12 mislabeled examples and re-ran" — do not overwrite. They append. Each new manifest carries a prior_hash field pointing to the manifest it amends. The chain is the audit log. When a regulator or reviewer asks "what was committed when?", the answer is one hash, and from that hash the entire history is recoverable.

A worked example with the reference implementation

The reference implementation is a single-file Python CLI called falsify, MIT-licensed, 1287 lines. Install it the usual way:

pip install falsify

Initialize a claim:

falsify init imagenet-87

This writes .falsify/imagenet-87/spec.yaml with the required PRML fields as placeholders. Edit the file with your real values:

version: "prml/0.1"
claim_id: "01900000-0000-7000-8000-000000000010"
created_at: "2026-05-01T14:00:00Z"
metric: "accuracy"
comparator: ">="
threshold: 0.87
dataset:
  id: "imagenet-val-2012"
  hash: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
seed: 42
producer:
  id: "your-org.example"

Lock it:

$ falsify lock imagenet-87
locked: yes (sha256:1a3466cc08ee, locked_at 2026-05-01T14:00:00Z)

Now the spec is hash-bound. If anyone — including you — edits the YAML, the next falsify verify exits 3 and refuses to produce a verdict.

Run the experiment, capture the metric value (let us say 0.876), and verify:

$ falsify verify imagenet-87 --observed 0.876
PASS  metric=accuracy observed=0.876 >= threshold=0.87
exit 0

If the team had silently raised the threshold to 0.88 after seeing the result:

$ falsify verify imagenet-87 --observed 0.876
TAMPERED  spec hash drift detected
recorded: 1a3466cc08ee...
current:  7b2c9a5d1e4f...
exit 3

The CI pipeline halts. The deploy does not happen. There is no judgment call.

How do you know the canonicalization actually works?

The most reasonable skeptical question about a content-addressed format is: what guarantees that two implementations produce the same canonical bytes for the same input?

For v0.1 we publish 12 conformance test vectors. Each vector defines:

An input manifest (logical YAML, key order irrelevant).
The exact UTF-8 byte sequence the canonicalizer must produce.
The exact lowercase-hex SHA-256 of those bytes.

The vectors exercise:

Test	Property
TV-001	Minimal valid manifest
TV-002	Key-ordering invariance — random insertion order produces same hash
TV-003	Single-bit-of-content sensitivity — `0.85` vs `0.86` produces different hash
TV-004	Optional fields populated (`model.id`, `model.hash`, `dataset.uri`)
TV-005	Unicode handling in `producer.id`
TV-006	Maximum seed value (`2⁶⁴ − 1`)
TV-007	Minimum seed (`0`)
TV-008	Equality comparator with integer-valued threshold
TV-009	Amendment with `prior_hash` linkage
TV-010	`pass@k` metric for code generation
TV-011	AUROC with strict comparator
TV-012	Regression metric with `<=` comparator

A new implementation in Rust, Go, or TypeScript is conformant only if it reproduces all 12 vectors exactly. The reference implementation has 28 unittest assertions in CI that lock in the v0.1 hash contract; any code change that breaks a vector forces a v0.2 spec bump.

What it is not

PRML does not establish whether a claimed metric is correct, fair, or sufficient. It establishes only that the claim was committed before it was tested. Specifically:

Not a model card replacement. PRML manifests sit underneath model cards as the cryptographic floor.
Not a benchmark. PRML does not pick metrics for you.
Not a reproducibility framework. PRML does not ship code or data.
Not a tool. PRML is a format. falsify is one implementation. A second implementation in any language passes if it reproduces the test vectors.
Not a compliance product. It is a primitive that makes named regulatory obligations satisfiable with arithmetic verification rather than process attestation.

What it costs

The cost of adopting PRML at the experiment level is one hash function call. SHA-256 is FIPS 180-4, available in every standard library written since 2002. The format is UTF-8 plain text, readable in 2046 by any tool that can read text.

The cost of not adopting it scales with deployment scope. For a personal project, zero. For a research paper, growing pressure as reviewers begin to ask. For a product subject to EU AI Act Annex III obligations, measurable in regulatory exposure plus legal review hours. For a foundation model that will be cited in safety cases for a decade, the cost is roughly the credibility of every accuracy claim you have ever shipped.

What I am asking for

This is a working draft. v0.2 freeze is targeted 2026-05-22. Three concrete asks:

Format review. Is the canonical serialization in §3 of the spec unambiguous? Are there YAML 1.2 edge cases the spec misses?
Threat-model gaps. §6 of the spec enumerates six adversaries. What is missing?
Compliance correctness. The AI Act mapping maps PRML fields to Articles 12, 17, 18, 50, 72, and 73. Compliance lawyers and engineers in EU AI Act adjacent roles: are the bindings defensible?

Discussion thread: github.com/sk8ordie84/falsify/discussions/6.

Tl;dr

Most published ML accuracy numbers are unfalsifiable in practice.
A small spec — eight fields, one hash function, one canonical serialization — gives published claims a cryptographic floor.
Reference implementation in Python, MIT, single file. Spec under CC BY 4.0.
v0.2 freeze in 3 weeks. Reviews, ambiguity reports, threat-model critiques are wanted.

Spec: spec.falsify.dev/v0.1
Code: github.com/sk8ordie84/falsify
Discussion: github.com/sk8ordie84/falsify/discussions/6

I built a CLI that hashes your ML accuracy claims before the experiment runs

sk8ordie84 — Wed, 29 Apr 2026 07:33:37 +0000

I built a CLI that hashes your ML accuracy claims before the experiment runs

Last month, a customer told me our model's accuracy on their data was 71%, not the 94% we had shipped on the landing page.

I went back to the eval notebook. The threshold was still 0.94. The test set was named the same thing. But somewhere in the last three weeks, somebody had "refreshed" the test set, somebody else had tightened the metric definition, and the original 94% was now unreproducible. Not anybody's fault, exactly — just nobody had written down the contract before running the experiment.

That night I started building falsify. Three days later I shipped it.

This post is what I built, why I built it that small, and the one Python function that does most of the work.

The problem in one sentence

If you can change the spec after seeing the result, your accuracy claim is not falsifiable. And if it is not falsifiable, it is not really a claim — it is marketing.

Psychology and medicine figured this out the hard way and invented pre-registration. You write down the prediction, the threshold, and the analysis plan, hash it, timestamp it, and you cannot move it later without everyone knowing.

ML never adopted any of this. A git commit is the closest thing most teams have, and git commit --amend followed by a force-push will quietly erase the receipt.

So I wrote a CLI that does the smallest possible version of pre-registration: canonicalize a YAML spec, SHA-256 it, lock the hash, and refuse to let it move.

What "the smallest possible version" actually looks like

# falsify.yaml
claim:
  metric: accuracy
  threshold: 0.94
  dataset: customer_eval_v3
  dataset_sha256: 4f1a8b2c...
  model: ranker-7b-2026q1
  test_n: 1200
created_at: 2026-04-28T19:45:00Z

That is the contract. The CLI workflow is three commands:

pip install falsify
falsify lock falsify.yaml      # writes a .lock file with the hash
falsify check falsify.yaml --result actual_accuracy=0.91

Exit codes are the API:

0 — claim verified
10 — claim falsified (you missed the threshold, but cleanly)
3 — tamper detected (someone edited the spec after lock)
11 — spec invalid

10 and 3 being different exit codes is the whole point. "We didn't hit the number" is a different thing from "we moved the number."

The one function that matters

The reason this works at all is YAML canonicalization. JSON looks canonical but isn't — key order, whitespace, and unicode forms can all drift while the document stays "the same." YAML is worse by default, but easy to canonicalize once you commit to a few rules.

Here is the actual hashing function from the source. It is small on purpose:

import hashlib
import unicodedata
import yaml  # PyYAML

def canonical_sha256(spec_path: str) -> str:
    """Return SHA-256 of a canonicalized YAML spec.

    Canonicalization rules:
      - Parse the document, drop comments and anchors
      - Recursively sort all mapping keys
      - Normalize all strings to NFC unicode
      - Re-emit as UTF-8 with LF line endings, no trailing whitespace
      - Hash the bytes
    """
    with open(spec_path, "rb") as f:
        data = yaml.safe_load(f)

    def normalize(node):
        if isinstance(node, dict):
            return {
                unicodedata.normalize("NFC", k): normalize(v)
                for k, v in sorted(node.items())
            }
        if isinstance(node, list):
            return [normalize(x) for x in node]
        if isinstance(node, str):
            return unicodedata.normalize("NFC", node)
        return node

    canonical = yaml.safe_dump(
        normalize(data),
        sort_keys=True,
        allow_unicode=True,
        default_flow_style=False,
        line_break="\n",
    ).encode("utf-8")

    return hashlib.sha256(canonical).hexdigest()

That is the entire trust primitive. Everything else in the 3925-line file — the lock file format, the CI integration, the tamper detection, the schema validation — is plumbing around this one function.

The reason it has to be exactly this strict: any wiggle room (key order, trailing whitespace, BOM, unicode form) is a place where someone can quietly change the spec and produce a "matching" hash. Canonicalize once, hash once, never look back.

The CI moment

The point of all of this is the moment a teammate edits the spec after lock. Maybe they have a good reason. Maybe they don't. Either way, you want the system to notice.

# .github/workflows/eval.yml
- name: verify accuracy claim
  run: |
    falsify check falsify.yaml --result-file results.json

If anyone touches falsify.yaml after the lock, the action exits with code 3 and the PR cannot merge. The lie is blocked at the filesystem level, not by trust.

What I learned in three days

A few things surprised me while building this:

YAML canonicalization is most of the value. I spent way more time on the canonicalizer than on anything else. Every "clever" optimization I tried later turned out to be a place where two byte-different YAMLs produced the same hash. Boring is correct.
Exit codes are an API. I almost shipped with just 0 and 1. Splitting "falsified" from "tampered" was the single biggest jump in how teams reacted to it. People immediately understood the difference.
One file is a feature. I kept resisting the urge to split it into a package. Auditors and skeptical SREs read single-file Python CLIs in one sitting. They do not read packages.
Dogfooding is non-negotiable. falsify locks its own test claims with falsify. The honesty badge on the README is generated by the tool itself, on its own metrics. If a tool that locks claims cannot lock its own, why would you trust it.
Agents change what one person can ship in a weekend. I built this solo in three days with Claude Opus 4.7 in the loop — pair programming, eval generation, doc drafting, the whole pipeline. The 518 tests and the YAML canonicalizer corner cases would have been a two-week solo grind without it. The actual design decisions were still mine; the agent just made the cost of being thorough a lot lower.

Try it

pip install falsify
falsify init

Repo: https://github.com/sk8ordie84/falsify
90-second demo: https://youtu.be/vVZTNeak5PA
Site: https://falsify.dev
PyPI: https://pypi.org/project/falsify/

Single file, MIT, Python 3.11+, stdlib plus pyyaml. If you ship any number followed by a percent sign, lock it before the experiment runs. It costs 30 seconds and saves the meeting where someone has to explain why the number changed.

I built a film camera simulator in a single HTML file here's how

sk8ordie84 — Mon, 20 Apr 2026 18:51:08 +0000

Launched today: faxoffice1987.com — 8 film cameras simulated in Canvas 2D.

The constraints I set myself:

One HTML file
No build step, no dependencies, no npm install
Runs offline from a USB drive
No backend, no account, no uploads

The hard part: per-pixel color science. Each film stock (Tri-X,
Portra, Velvia, Neopan Acros) has its own render path. Not a filter
on top — a decision at the pixel level.

Stack:

Vanilla JS, Canvas 2D
Cloudflare Pages + Functions (share links, license validation)
Polar.sh for checkout
localStorage for state

Pricing experiment: $29 one-time. No subscription. 1 camera free forever.

Would love architecture feedback especially on the color science approach.

Link: https://faxoffice1987.com