Forem: Chester Guan （Ziyuan Guan）

How synthetic data actually performs

Chester Guan （Ziyuan Guan） — Thu, 14 May 2026 17:08:47 +0000

Originally published at prometheno.org.

Now let's think together. In The clinical-truth gap
I said clinical-truth verification belongs in medical empiricism. A
fair objection: why bother with real-patient infrastructure at all when
synthetic data exists? Synthea, MDClone, Syntegra, mostly.ai — generate
fake patients with the statistical properties of real ones, train
models on those, ship.

The honest answer is to look at how synthetic data actually performs,
not how it's pitched.

What synthetic does well

Three uses where it earns its place.

Pipeline testing. No PHI, no HIPAA review, no consent overhead.
Engineers stress-test ingestion code, validate FHIR mappings, exercise
edge cases. Synthea — the MITRE-developed open-source generator — was
built explicitly for this¹ and most US health-IT projects use it.

Training augmentation. For rare conditions where real-data samples
are clinically inadequate, synthetic supplementation lifts model
performance measurably. A 2024 study on rare thyroid cancer subtype
classification used text-guided diffusion to generate synthetic images
and improved subtype-classification AUC from 0.7364 to 0.8442². The
gain came from hybrid training. Synthetic + real beat real alone.

Aggregate statistical research. Questions like what's the average
HbA1c trajectory or what's the comorbidity prevalence often produce
similar answers on synthetic and real data, with no individual-level
exposure. A JMIR comparison study of five MDClone-generated cohorts
against their real counterparts found the analyses "provide a close
estimate of real data results in general," with caveats depending on
the patient-to-variable ratio³.

That's a real value proposition. The series doesn't dismiss it.

What the benchmarks show

Three places the numbers cut against synthetic-as-substitute.

Rare-event performance plateaus. SHEPHERD — a Harvard/Zitnik-lab
model trained on 40,000+ synthetic patients across 2,134 rare diseases
— achieved 40% top-1 accuracy in causal gene discovery when
evaluated against the real-world Undiagnosed Diseases Network
cohort⁴. Forty percent is useful as a triage tool. It is not
clinical-grade. The gap between synthetic-trained performance and
real-world ground truth is precisely the gap synthetic data can't close
on its own.

src="/blog/synthetic-data-actually-performs/figure-1-decision-matrix.svg"
alt="A two-column decision matrix. Left column 'SYNTHETIC SUFFICES' lists pipeline testing, training augmentation, aggregate research, hypothesis generation. Right column 'REAL DATA REQUIRED' lists regulatory submission, outcome verification, AI accountability, rare-event prediction."
caption="Synthetic data does real work in the left column. The right column is what HAVEN's real-patient infrastructure exists to serve."
/>

Hybrid almost always wins, and hybrid needs real data. Across
healthcare AI benchmarks, models trained on synthetic + real outperform
either alone. An AMD fundus-image study using ResNet-18 reached 85%
accuracy when combined data was used — outperforming the same
architecture trained on synthetic-only by a clinically meaningful
margin⁵. The destination is rarely synthetic. It's augmentation.

Privacy isn't as clean as advertised. Membership-inference attacks
against synthetic health data work. A 2022 JMIR analysis (since
extended by multiple 2024 papers) demonstrated attackers can infer with
substantial confidence whether a specific real patient's record was
used to generate a synthetic cohort⁶. The re-identification risk
rises for unique cases — older patients, rare conditions — which is
exactly the population synthetic data is most often used for.
Differential privacy mitigates this, but only at meaningful utility
cost.

Where synthetic structurally can't go

Two categorical limits. Better generators don't fix them.

Real outcomes. A synthetic patient doesn't develop sepsis. Doesn't
survive their cancer. Doesn't die from heart failure five years later.
Synthetic outcome data is fictional — produced to match training
distributions, not real biology. For Prometheno's longer-term horizon
— paying or penalizing AI vendors when their predictions match or miss
reality — the outcome side cannot be synthetic. No algorithm turns
simulation into observation.

Regulatory ground truth. FDA Center for Devices issued updated
real-world-evidence guidance December 2025⁷. The framework rests on
observational data from actual patients in actual care. Synthetic
control arms have a defined pathway as supplements to real evidence,
not substitutes for it. EMA position is similar. For any AI/ML medical
device seeking clearance, the path runs through real data.

What HAVEN does that synthetic can't

The strongest argument for HAVEN comes from accepting synthetic
data's strengths.

If synthetic-only training plateaus well below clinical-grade for rare
events, the path forward runs through hybrid models — and hybrid needs
governed real data. Consent, audit, and quality grading are what make
hybrid defensible at population scale.

If real outcomes can't be synthesized, AI accountability runs on real
outcome data. The infrastructure for collecting outcomes, tying them
to the predictions that preceded them, and attributing value back to
the contributing patients is what HAVEN's four primitives enable.

If membership-inference attacks compromise synthetic privacy claims,
the answer isn't to abandon real data. It's to govern access to real
data properly. Consent-attestation and hash-chained audit produce
traceability that de-identification alone never did.

Synthetic data is complementary. It strengthens the case for a
patient-sovereign protocol layer rather than replacing it.

What comes next

The next post commits to what would prove the whole argument wrong.

Walonoski, J., et al. "Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record." Journal of the American Medical Informatics Association 25, no. 3 (2018): 230-238. Open-source, MITRE-maintained, used widely for testing and demonstration. ↩
Frontiers in Digital Health, "Synthetic data generation: a privacy-preserving approach to accelerate rare disease research" (2025). Text-guided diffusion produced synthetic images with 92.2% realism rate; hybrid training improved AUC from 0.7364 to 0.8442. ↩
JMIR Medical Informatics 8, no. 2 (2020), "Analyzing Medical Research Results Based on Synthetic Data and Their Relation to Real Data Results: Systematic Comparison From Five Observational Studies." https://medinform.jmir.org/2020/2/e16492/ ↩
Alsentzer, E., et al. "Deep learning for diagnosing patients with rare genetic diseases." Zitnik Lab, Harvard. SHEPHERD model evaluated against the NIH Undiagnosed Diseases Network real-world cohort; published results show 40% top-1 accuracy on causal gene discovery. ↩
npj Digital Medicine, "Generating high-fidelity synthetic patient data for assessing machine learning healthcare software" (2020). https://www.nature.com/articles/s41746-020-00353-9. ResNet-18 on AMD fundus images: 85% accuracy with combined real+synthetic data. ↩
Hyeong, J., et al. "Membership inference attacks against synthetic health data." Journal of Biomedical Informatics 125 (2022). https://pmc.ncbi.nlm.nih.gov/articles/PMC8766950/. Extended by multiple 2024 papers including work on differentially private synthetic data and re-identification on tabular GANs. ↩
U.S. Food and Drug Administration. Use of Real-World Evidence to Support Regulatory Decision-Making for Medical Devices. Final guidance, December 16, 2025 (supersedes 2017 guidance). Real-World Data quality criteria emphasize relevance and reliability of observational data from actual patients in actual care. ↩

The clinical-truth gap

Chester Guan （Ziyuan Guan） — Thu, 14 May 2026 17:00:04 +0000

Originally published at prometheno.org.

Now let's think together. In The identity gap
I said identity-proofing rides on existing institutions, not on crypto.
This post is the same shape, different regime: clinical truth rides on
existing medical practice. What HAVEN delivers is what the protocol
layer can deliver — quality grading at ingest. The clinical-truth
verification happens where it should: in medicine.

Two questions a record makes

A clinical record claims two things at once.

First — this byte sequence is the one that was written. Crypto answers
that. Hash matches, signature verifies, chain intact. Done.

Second — the byte sequence describes the patient's body. Glucose was

The diagnosis was correct. The procedure happened. Different question. Not because crypto fails at it; because crypto isn't pointed at it.

Three concrete ways the second question goes wrong:

Wrong patient. Two MRNs swapped at intake. The lab value belongs to
someone else's blood. Signature, timestamp, chain — clean.

Wrong recording. The phlebotomist drew from a contaminated line. The
instrument read 247. The instrument was reading the IV bag.

Wrong interpretation. "Type 2 diabetes" assigned to a patient whose
elevated A1C was steroid-induced. The patient doesn't have diabetes.
The record says they do.

The chain is fine. The record is well-formed. The data is just wrong
about the body. Catching this is medicine's job, and medicine has been
doing it for centuries.

What HAVEN solves

HAVEN's contribution at this layer is the 3-Gate Quality Protocol
from §6.4. Reproducible, machine-verifiable quality grading at ingest.

src="/blog/the-clinical-truth-gap/figure-2-quality-gates.svg"
alt="A pipeline showing a record entering three gates — Provenance valid, Structure complete, Concepts mapped — and emerging with a Grade A, B, C, or D classification."
caption="The 3-Gate Quality Protocol. Three checks at ingest, one grade out."
/>

Gate 1: Provenance valid. Cryptographic chain intact, signatures
verify, hash hasn't moved. Catches custodian-level tampering.

Gate 2: Structure complete. Required OMOP fields populated. FHIR
resources validate against schema. Required relationships resolve. No
nulls in required positions.

Gate 3: Concepts mapped. Diagnosis codes resolve to standard
vocabularies (SNOMED, RxNorm, LOINC¹) rather than local custom
strings. Measurement units standardized. Medications map to active
ingredients.

All three pass → A. Two → B. One → C. None → D. The grade rides on
the record's metadata, visible to anyone who pulls it.

What the grade buys you

Before quality grading, a researcher pulling a cohort had two options:
trust the source, or audit every record by hand. A reproducible grade
gives them a third — filter to grade A and know exactly what was
checked.

An AI vendor training on a grade-A cohort gets a cleaner training
signal than one training on raw mixed-grade data. Models can be
validated against the grade.

A patient who contributed records sees their contributions weighted by
grade. HAVEN's 3-Tier Value Model ties the grade to the attribution
score². Quality matters for compensation.

The grade isn't a clinical-truth guarantee. It is the strongest claim
the protocol layer can make on its own — and it already changes how
research-grade data gets compiled.

Where clinical truth lives

src="/blog/the-clinical-truth-gap/figure-1-verification-stack.svg"
alt="A four-layer verification stack with the Clinical truth layer highlighted."
caption="Clinical truth sits two layers above crypto. Different regime, different evidence."
/>

Clinical truth — whether the record matches the body — lives in
medical empiricism. Repeated observation, independent measurement,
longitudinal follow-up. The protocols are mature: Good Clinical
Practice guidelines for trial data³, data monitoring committees,
multi-source validation, adjudication panels.

These have been doing the work for decades, by people who do nothing
else. The protocol layer connects to them. It doesn't try to be them.

The decomposition is the design

A protocol that tried to verify clinical truth on its own would have
to run adjudication panels. It would have to be a mortality registry.
It would need credentialed physicians on staff. That's not a protocol
— that's a research institute.

HAVEN decomposes the work. Quality grading runs at the protocol layer,
where it scales across institutions. Clinical-truth verification runs
in the medical regime, where it already happens. The two meet at the
attribution layer — research outcomes flow back, tied to graded
contributions, validated against medical-empirical evidence⁴.

The longer-term arc — paying patients when their data contributes to
outcomes, paying or penalizing AI vendors when predictions match or
miss reality — depends on this decomposition holding. Quality is the
protocol's job. Truth is medicine's. Both are necessary. Neither
substitutes for the other.

What comes next

Posts 2 through 5 have argued what would happen if the protocol
works. The next post commits to what would prove the whole argument
wrong.

SNOMED CT (Systematized Nomenclature of Medicine, Clinical Terms), RxNorm (NIH unified medication nomenclature), and LOINC (Logical Observation Identifiers Names and Codes) — the OHDSI/OMOP standard vocabularies for diagnoses, medications, and laboratory results respectively. ↩
HAVEN whitepaper v2.0, §6.4: Quality Assessment and the 3-Tier Value Model. DOI: 10.5281/zenodo.18701303. ↩
International Council for Harmonisation. ICH Harmonised Guideline: Good Clinical Practice E6(R3). ICH, January 2025. Normative standard for the conduct of clinical trials, including source-document verification and endpoint adjudication procedures. ↩
U.S. FDA Center for Devices and Radiological Health, Software as a Medical Device (SaMD) — Clinical Evaluation, and the IMDRF SaMD framework. Validation of clinical AI/ML is empirical and ongoing, separate from data-integrity verification. ↩

The identity gap

Chester Guan （Ziyuan Guan） — Thu, 14 May 2026 17:00:04 +0000

Originally published at prometheno.org.

Now let's think together. In The four primitives
I said identity sits outside the protocol's scope. This post says what
"outside" means.

The cryptographic primitives from post 3 can prove a signature came
from a specific key. They can't prove the key belongs to a specific
human. That second question is older than crypto, and the field that
answers it has been working at it for forty years. Building a patient-
sovereign protocol that tries to redo that work is a category error.

A regime, not a function

Every claim in a healthcare system gets verified inside some regime.
Signatures get checked with math. Audit chains get checked with hash
replay. Both live in the cryptographic regime — the truth-conditions
are mathematical, the evidence is computable, any honest party with
the artifacts reaches the same answer.

"Alice Chen holds this key" isn't that kind of claim. It's a claim
about the world. The evidence is a passport, a biometric, a notary's
seal. You can audit the procedure. You can't compute it.

Not harder math. A different question.

src="/blog/the-identity-gap/figure-1-verification-stack.svg"
alt="A four-layer stack labeled Outcome verification, Clinical truth, Identity proofing, and Cryptographic verification. The Identity proofing layer is highlighted."
caption="Four regimes the protocol rides on. Crypto sits at the bottom. Identity proofing is the next layer up, and the subject of this post."
/>

The protocol can use a regime's output — an identity assertion, a
quality grade, an outcome label. It can't manufacture that output from
below.

What a signature doesn't say

A signature confirms whoever holds the key signed the message. It says
nothing about who holds the key. Three ways that gap bites:

Stolen keys. A signature from Alice's key, after the key has been
exfiltrated to Bob, is indistinguishable from a genuine Alice
signature. The math doesn't care who's typing.

Shared keys. Alice gives her key to her daughter to manage Alice's
records. Every consent grant looks like Alice. The daughter could
grant access Alice would refuse, and the protocol has no way to know.

Sybil accounts. One person creates ten patient identities, each
with a different key. Signatures all verify. Contributions all look
distinct. Cryptography is structurally blind to this¹.

Identity proofing is what makes those expensive. Crypto is a
coordination primitive — it lets parties who don't trust each other
agree on what was signed. It deliberately doesn't try to settle who
the parties are. That settlement happens upstream.

What proofing actually means

NIST 800-63-4² is the U.S. federal standard for identity proofing,
last revised in 2024. It defines three Identity Assurance Levels —
IAL1, IAL2, IAL3 — each naming what evidence binds a credential to a
real human. It predates healthcare-specific concerns by years. The
federal government, the financial sector, and most regulated
industries already use it.

src="/blog/the-identity-gap/figure-2-ial-ladder.svg"
alt="Three steps ascending from IAL1 (self-asserted, no proofing) to IAL2 (remote or in-person proofing with strong evidence) to IAL3 (in-person plus supervised biometric capture)."
caption="NIST 800-63-4 Identity Assurance Levels. The protocol's assurance is bounded by the IAL the implementing system delivers."
/>

IAL1 is self-asserted. You type your name; the system believes you.
Fine for newsletter signups. Not fine for an asset binding that gates
clinical data.

IAL2 is the working floor for healthcare. Government-issued ID plus
a live biometric, remote or in-person. The Cures Act patient-access
rule³ effectively assumes IAL2. ONC certified-health-IT requirements
align⁴.

IAL3 adds a supervised in-person step. An authorized agent
inspects the evidence and the person presenting it. Federal benefits,
defense clearances, some clinical research.

A protocol that demands IAL3 everywhere prices itself out of
population scale. A protocol that accepts IAL1 and pretends otherwise
prices itself out of credibility. HAVEN doesn't pick — the
implementing system picks, and the consent grant inherits whatever
assurance the system can deliver.

What Prometheno consumes

A Consent Attestation names a grantor and a grantee. The protocol
doesn't say how either gets resolved to a human. The implementing
system mounts an existing substrate.

OpenID Connect⁵ — federated identity over OAuth 2.0. The
standard for "log in with your hospital portal." Token in, identity
out, at whatever IAL the provider supports.

Decentralized Identifiers⁶ — W3C standard for self-sovereign
identities backed by verifiable credentials. Useful when patients
carry identity across institutions. Doesn't itself produce IAL; relies
on the credential issuer.

EHR-issued identity — the provider already proofed the patient at
intake. SMART on FHIR⁷ surfaces that identity to apps via the
Patient resource. Most US silent-pilot work starts here, because the
proofing already happened in-clinic.

eIDAS in the EU⁸ — national-level electronic identity with
Assurance Levels (Low / Substantial / High) that map cleanly onto
IAL1/2/3. Relevant when EU patient pools come into scope.

The whitepaper §9 is explicit: "How you verify patients are who they
say they are is up to you."⁹ That's not a punt. It's the same
scoping move SMTP made for institutional directories, OAuth made for
existing accounts, OIDC made for OAuth. Building identity proofing
into HAVEN would be like asking HTTP to run a passport office.

What HAVEN inherits

Every weakness in the substrate propagates into the protocol.

If the substrate is IAL1, every signed consent is IAL1 consent. The
chain is unbroken; the signatures verify; the audit log holds. And
"the patient" is whoever clicked through. Crypto can make the fiction
tamper-evident. It can't make it true.

If credential recovery is a security question, an attacker who guesses
the answer takes over the credential and signs whatever they like.
The protocol records it as a legitimate session. The fix isn't more
crypto. The fix is in the identity layer.

If the substrate is high-assurance for the patient but low-assurance
for the grantee — the researcher, the AI vendor, the lab — the
asymmetry hides. A chain is only as strong as its identity links.

The honest posture: document what the protocol assumes (IAL on
grantor, IAL on grantee), make those assumptions explicit in the
consent record, reject grants where the substrate can't deliver them.

Why this is a separate post

Build identity proofing into consent. Pick an IAL. Mandate biometric
capture. Ship. That path ends two ways: rejected by every system that
already has an identity substrate, or drifting into a quasi-identity
provider that competes with the substrates it was supposed to
consume.

So the boundary stays where it is. Prometheno consumes whatever the
implementing system mounts. The cost is real — the protocol's
sovereignty claim is contingent on identity-proofing upstream — but
it's the same cost SMTP pays for not running mail servers and OAuth
pays for not running user databases.

What comes next

The second gap is harder, because it lives in a different
epistemology. Crypto can prove a value hasn't been altered. Identity
proofing can prove a key belongs to a person. Neither can prove the
value reflects clinical reality. That sits in medical empiricism. The
next post takes it up.

Douceur, J.R. "The Sybil Attack." International Workshop on Peer-to-Peer Systems (IPTPS), 2002. Sybil resistance requires an out-of-band binding to a costly real-world artifact. ↩
NIST Special Publication 800-63-4: Digital Identity Guidelines. National Institute of Standards and Technology, 2024. Defines IAL (Identity Assurance Level), AAL (Authenticator Assurance Level), FAL (Federation Assurance Level) as orthogonal axes. This post discusses IAL only. ↩
21st Century Cures Act, Public Law 114-255 (2016); ONC interoperability rules 85 FR 25642 (May 2020), 89 FR 1437 (January 2024). Patient-access APIs operate under identity proofing equivalent to NIST IAL2. ↩
ONC Health IT Certification Program §170.315(d) — identity-proofing requirements for credentialed access to certified health IT. ↩
Sakimura, N., Bradley, J., Jones, M., de Medeiros, B., Mortimore, C. OpenID Connect Core 1.0, OpenID Foundation, incorporating errata set 2 (2014–present). Federated authentication on top of OAuth 2.0 (RFC 6749). ↩
W3C Decentralized Identifiers (DIDs) v1.0, W3C Recommendation, July 2022. ↩
SMART App Launch Framework v2.0. Mandel, J.C., et al. SMART on FHIR: A standards-based, interoperable apps platform for electronic health records. JAMIA 23(5):899-908, 2016. ↩
eIDAS Regulation EU 910/2014, effective 2016; eIDAS 2.0 (Regulation EU 2024/1183) extending the framework with the European Digital Identity Wallet. Assurance Levels (Low / Substantial / High) align with NIST IAL1/2/3. ↩
HAVEN whitepaper v2.0, §9, "What We're Not Trying to Do." DOI: 10.5281/zenodo.18701303. ↩

The four primitives

Chester Guan （Ziyuan Guan） — Wed, 13 May 2026 15:42:55 +0000

Originally published at prometheno.org.

Now let's think together. In Three failures, one missing layer I
argued that healthcare's three persistent AI failures share one shape:
each requires a governance protocol layer that doesn't yet exist
anywhere — not in applications, not in regulations, not in platforms,
not even in the standards layer that handles data shape. This post
specifies what such a layer must provide.

Four primitives carry the load.

Content-addressable Health Assets. Programmable Consent. Hash-chained
Provenance. Quality-weighted Contribution. Each one addresses a
specific failure named in the previous post. Each one earns its place
against an alternative that doesn't work.

The claim isn't that these four are provably the smallest possible
set. Design spaces resist that kind of proof. The claim is that each
one earns its place, that they cluster naturally as a working set,
and that any honest governance protocol has to answer all four
questions they answer.

Specifying, instead of waving at it

"Consent and audit" is what every patient-data pitch already says.
The phrase is correct and inert.
Anything strict enough to actually rule out the failure modes post 2
named has to be specified at the level of data structures and
algorithms, not slogans.

A primitive, to count here, has to do three things:

Address a specific failure mode that doesn't dissolve if you refuse to define the primitive. (If "consent" can be replaced by "the patient signed a form," the form was the primitive, not the word.)
Rule out alternatives that look similar but don't carry the same guarantee. (Signed consent records aren't the same as hash-chained consent records, even though both involve signing.)
Compose with the other primitives without circular dependency. (Provenance can't be the thing that verifies its own integrity.)

The four primitives below each pass this bar. Each section names the
failure it addresses, the obvious alternative that doesn't work, and
what breaks if you remove it.

Health Assets

Failure addressed: fragmentation.

Post 2 named the fragmentation: EHRs hold parts, apps hold other
parts, research datasets are fixed snapshots. For a governance
protocol to mean anything across all of these, it needs a way to
refer to a specific piece of clinical data that everyone agrees is
the same piece — verifiably, across systems that don't trust each
other.

That's what a Health Asset is. From HAVEN whitepaper §6.1¹:

HealthAsset := {
    asset_id        : ContentHash      // Derived from content
    data_ref        : SecureReference  // Pointer to clinical data
    substrate       : Identifier       // Data format (FHIR, OMOP, etc.)
    consent_ref     : ConsentID        // Active consent policy
    quality_class   : {A, B, C, D}     // Data quality grade
    provenance_ref  : ProvenanceID     // Audit chain reference
    patient_ref     : PatientID        // Owner of this data
    created_at      : Timestamp
}

The asset_id is a SHA-256 hash of the content. Change one byte of
the underlying data, the hash changes, the asset_id no longer
matches. The pointer carries its own integrity check. The same trick
Git uses for commits².

The obvious alternative: just give every record a UUID.

UUIDs work fine inside one system. They fail at the boundary. Two
custodians can issue the same UUID for different records, or
different UUIDs for what should be the same record. Reconciliation
becomes a coordination problem that has to be solved custodian by
custodian. Content addressing dissolves it: same content, same hash,
anywhere. No registry needed. No reconciliation needed³.

What breaks if you remove this primitive: the protocol loses any
basis for saying two systems are referring to the same record. Every
audit becomes "trust me, this is the same record." Every consent
becomes ambiguous about what it covers. The fragmentation failure
named in post 2 stays unfixed.

Content addressing isn't new. Git has used it since 2005. IPFS
implements it for general data. RFC 6920 specifies it for URIs⁴.
The choice in HAVEN is to apply it to healthcare records specifically,
in a substrate-neutral way — the same Health Asset can wrap a FHIR
resource, an OMOP measurement, or a raw document reference.

Consent

Failure addressed: no role for the patient.

Three of the four primitives map directly to one of the failures
named in post 2. This one doesn't. Consent is the precondition for
the other three to mean anything. It's what turns the patient from a
data source into an actor in the coordination protocol. Without it,
governance has nothing to bite on.

A patient's record is one of dozens. Each custodian decides what's
shared, with whom, for what purpose. The patient signs a form, often
under duress, and the form is then interpreted application by
application. Revoking is a phone call to records. Auditing is a FOIA
request. "Consent" in this regime is a paper artifact, not a
machine-verifiable proposition.

HAVEN's Consent Protocol turns it into one. From whitepaper §6.2¹:

ConsentAttestation := {
    consent_id      : UUID
    grantor         : PatientIdentity   // Who grants
    grantee         : AccessorIdentity  // Who receives
    scope           : DataScope         // What data
    purpose         : PurposeType       // Why
    conditions      : Conditions[]      // Under what rules
    ...
    status          : {active, revoked, expired}
    signature       : CryptoSignature
}

Three properties make this primitive different from existing consent
practice.

Closed-world semantics. If you didn't explicitly grant access to
a resource type, the answer is no. Silence is denial. Existing
consent regimes default to permission for anything not explicitly
forbidden; HAVEN inverts that.

Deterministic verification. Same inputs, same answer, every
time. No randomness, no "it depends." That's what makes the consent
machine-verifiable rather than interpretable.

Immediate revocation. The next verify() call after a revoke()
returns denied. Not "after the next sync." Not "within 24 hours."
Immediately.

The ethical foundation isn't new. The Nuremberg Code (1947)⁵
established that voluntary consent is the floor for medical research.
The Belmont Report (1979)⁶ codified the principle for modern
practice. What's new is making the principle executable — turning a
40-page form into a function.

The obvious alternative: signed consent forms (digital or paper).

A signed form attests that consent happened. It doesn't attest to
what the consent permits, doesn't compose with audit trails, and
doesn't carry revocation state. Two systems sharing the same signed
form will interpret its scope differently. The form is evidence; the
primitive needs to be a function.

A note on Identity. Consent grants reference two parties —
grantor and grantee. Both have to be verifiable identities for the
consent to mean anything. HAVEN deliberately doesn't define how
identity is established: "How you verify patients are who they say
they are is up to you"⁷. The protocol consumes identity from
established systems (OIDC, DIDs, EHR identity proofing⁸) and
operates over those. Identity-proofing is its own deep field;
reinventing it inside the governance protocol would be a bad bet.

What breaks if you remove this primitive: data flow without
governance. The sovereignty failure stays unfixed regardless of how
clean the data layer is.

Provenance

Failure addressed: missing audit.

An audit log inside the system being audited is auditable by the
system's custodian. Nobody else. If MyChart logs your record access,
you have to ask MyChart for the log. If the log is wrong, you have
to ask MyChart to prove it isn't. That's not audit. That's a
custodian's self-attestation, served on a printout.

The Provenance Record fixes this by making the log structurally
tamper-evident. From whitepaper §6.3¹:

ProvenanceEntry := {
    entry_id        : UUID
    timestamp       : Timestamp
    event_type      : EventType
    actor           : Identity
    subject         : AssetRef | ConsentRef
    details         : EventData
    previous_hash   : Hash          // Chain linkage
    signature       : CryptoSignature
}

Each entry includes the hash of the previous one. Tampering with
history breaks the chain — the change cascades forward, every entry
after the tampered one becomes invalid. Each entry is signed with
Ed25519 or ECDSA, binding it to a specific actor. Verification is
O(log n) via Merkle proofs⁹: you don't need to replay the whole
chain to check a single entry.

This is the same construction Certificate Transparency uses for the
public web's certificate logs¹⁰. And before CT, it's the same
construction Haber and Stornetta proposed in 1991¹¹ — seventeen
years before Bitcoin. The technique is well-understood. The novelty is
applying it to clinical data access.

The obvious alternative: signed but mutable audit logs.

Signatures alone aren't enough. The custodian who owns the log can
re-sign a modified version with the same key, and the substitution
is undetectable to anyone who doesn't have the original. The
chaining is what makes substitution detectable. Without it, "audit"
remains a courtesy.

What breaks if you remove this primitive: the patient has no
basis for verifying any claim about what happened to their record.
Consent becomes unenforceable in the wild, because revocation can't
be verified after the fact. The missing-audit failure stays unfixed.

Contribution

Failure addressed: misaligned incentives.

Patients contribute data; researchers use it; outcomes flow to
neither directly. To realign, the protocol needs an accounting
primitive — something that turns "Alice contributed records to study
X" into a value-weighted quantity that can be tracked, attributed,
and eventually paid.

From whitepaper §6.4¹:

Contribution := {
    patient_id      : PatientIdentity
    asset_refs      : AssetRef[]
    quality_score   : Float[0, 1]
    tier            : ContributionTier
    context         : UsageContext
    timestamp       : Timestamp
}

The score follows a transparent formula: Value = TierWeight × QualityScore × VolumeNorm. Tiers run from PROFILE (demographics)
through STRUCTURED (labs, meds, conditions) and LONGITUDINAL
(multi-year records) to COMPLEX (notes, imaging, genomics). Quality
is determined by a three-gate protocol — provenance valid, structure
complete, concepts mapped — producing a score from 0 to 1 and a
class from A to D.

The score isn't dollars. It's a relative weight. If Alice scores
0.83 and Bob scores 0.41, Alice contributed roughly twice as much to
that study. What that translates to in money is between the
implementing system, the patients, and the business model. HAVEN
provides the accounting, not the payment rails.

The obvious alternative #1: equal-share data dividends.

The Datacoup / Datawallet / LunaDNA model — every contributor gets
the same share. This collapses on contact with reality. A patient
contributing a single demographic record is treated identically to
one contributing ten years of multi-system labs. Researchers won't
trust the cohort because it can't be quality-weighted. Patients who
contribute heavily get the same as those who contribute thinly. The
system fails on both ends¹².

The obvious alternative #2: pure clinical-weight, no quality
gating.

Skip the quality gates, weight by clinical content alone. Works in
theory, breaks in practice — clinical content quality varies wildly.
A LONGITUDINAL record with 95% concept-mapping coverage is different
research material from a LONGITUDINAL record with 30%. Without
quality gating, "value" becomes garbage-in garbage-out.

The three-gate quality protocol exists because each previous attempt
at patient data marketplaces collapsed in one of these two ways. The
historical evidence is on the page already.

What breaks if you remove this primitive: value pools at the
custodian, not the patient. The misaligned-incentives failure stays
unfixed. The protocol becomes another consent-and-audit layer with
no honest accounting of where research value goes.

Data Shapley and related attribution methods¹³ suggest more
refined math is possible. The three-tier quality-weighted formula is
HAVEN's deliberate floor — easy enough to compute, hard enough to
defend, and intentionally open to richer attribution schemes layered
on top.

Why these four cluster

Each primitive answers a different question that any governance
protocol has to answer:

Primitive	Question it answers
Health Asset	What is the record?
Consent	Who may use it, and for what?
Provenance	What happened to it?
Contribution	What was it worth?

Remove any one and the protocol stops being a protocol. Remove
Health Assets and Consent has no stable thing to authorize. Remove
Consent and Provenance has nothing to audit against. Remove
Provenance and the whole system runs on trust. Remove Contribution
and the system has no honest reason for patients to participate.

The four cluster naturally because each answers a category of
question that the others can't. They aren't variations on a theme.
They aren't aspects of a single underlying concept. They're four
distinct functions a governance protocol has to provide if it's
going to be the missing layer post 2 named.

That's the working set. A reader who can show that one of them is
reducible to another, or that a fifth answers a question I haven't
named, should write back. The series is better for the pressure.

What this argument can't show

Three limits worth naming.

Identity sits outside the protocol's scope. HAVEN's position is that
identity-proofing happens outside the protocol, in established
systems. That's a deliberate boundary. It also
means the protocol inherits whatever weakness exists in the identity
layer it rides on. A weak identity binding produces weak consents;
HAVEN doesn't fix that upstream problem. It just declines to make it
worse.

Settlement is downstream. Turning attribution scores into actual
payments — to patients, to research funds, to whatever model the
implementing system chooses — is an application concern, not a
protocol primitive. HAVEN gives you the accounting. What's done with
the accounting is yours to design.

"Cluster naturally" is a judgment call. The
argument that each of these four primitives is necessary doesn't
rule out the possibility that some other set of four (or five) could
do the same work via different decompositions. Design spaces resist
that kind of proof, as post 2 already conceded. The defensible claim is
necessity against the failures we named. Universal minimality is a
separate question.

What comes next

Specifying four primitives didn't dissolve every problem post 2
named. Two of them surfaced as separate work during specification —
not because the primitives are wrong, but because each ran into a
verification regime the cryptographic primitives don't cover. One of
them lives in a different epistemology entirely.

The next post takes up the first of those two gaps.

HAVEN whitepaper v2.0 (February 2026). DOI: 10.5281/zenodo.18701303. Source: github.com/Chesterguan/HAVEN. ↩
Git uses SHA-1 today, with migration to SHA-256 in progress. The construction is identical to HAVEN's: hash the content, use the hash as the identifier. Original design: Linus Torvalds, 2005. ↩
IPFS (InterPlanetary File System) implements the same model for general data storage. Content Identifiers (CIDs) are the operational form. The pattern long predates blockchain. ↩
Farrell, S., Kutscher, D., Dannewitz, C., Ohlman, B., Keränen, A., and Hallam-Baker, P. RFC 6920: Naming Things with Hashes. April 2013. Defines the ni: URI scheme for content-addressable resources, including hash-algorithm parameterization. ↩
"The Nuremberg Code." Trials of War Criminals before the Nuremberg Military Tribunals. U.S. Government Printing Office, 1949 (originally issued 1947). ↩
National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. The Belmont Report: Ethical Principles and Guidelines for the Protection of Human Subjects of Research. U.S. Department of Health, Education, and Welfare, 1979. ↩
HAVEN whitepaper §9, "What We're Not Trying to Do." ↩
NIST Special Publication 800-63-4 (2024): Digital Identity Guidelines — identity-proofing assurance levels. W3C Decentralized Identifiers (DIDs) v1.0, W3C Recommendation, July 2022. OpenID Connect Core 1.0 (federated identity). eIDAS Regulation EU 910/2014 and eIDAS 2.0 (2024) for the EU's electronic identification framework. HAVEN is compatible with any of these as the underlying identity substrate. ↩
Merkle, R.C. "A digital signature based on a conventional encryption function." Advances in Cryptology — CRYPTO '87. The hash-tree construction enabling O(log n) inclusion proofs. ↩
Laurie, B., Messeri, E., and Stradling, R. RFC 9162: Certificate Transparency Version 2.0. December 2021. The current normative standard for the public web's certificate transparency logs. ↩
Haber, S., and Stornetta, W.S. "How to time-stamp a digital document." Journal of Cryptology 3.2 (1991): 99-111. First presented at CRYPTO '90. The original hash-linked timestamp construction. ↩
Datacoup (founded 2012, NYC; shut down November 2019; later acquired by ODE July 2021), Datawallet (founded 2014; pivoted to crypto with a $40M DXT token sale in February 2018; functionally dormant by 2026), LunaDNA (founded December 2017 by Bob Kain et al.; closed January 31, 2024 citing capital shortage). Each attempted a patient-side data marketplace with various dividend models; each failed to attract either the patient volume or the research-buyer trust necessary to sustain a market. ↩
Ghorbani, A., and Zou, J. "Data Shapley: Equitable Valuation of Data for Machine Learning." Proceedings of the 36th International Conference on Machine Learning (ICML), 2019. Shapley-value-based attribution for individual data points in machine-learning training sets. ↩

Three failures, one missing layer

Chester Guan （Ziyuan Guan） — Wed, 13 May 2026 15:42:54 +0000

Originally published at prometheno.org.

Now let's think together. In Hello from here I said I'd revisit what
I got wrong in the Medium pieces last summer. This is that revisit.

Last summer I wrote that healthcare AI keeps stalling for three reasons:
fragmented data, missing audit, misaligned incentives. Ten months later,
I still think those three failures are real. I no longer think they're
three failures. They're symptoms of one.

What I said last summer, restated

The Medium pieces were diagnostic. They named the failures, gave each a
name, and proposed each could be addressed separately. Fragmented data —
fix it with better integration. Missing audit — fix it with better
logging. Misaligned incentives — fix it with better economics. Three
problems, three fixes, three workstreams.

I still believe each of those failures is real. I have spent ten months
trying to address them, primarily through the protocol I called HAVEN
and the reference implementation that runs against MIMIC-IV today. What
I have learned in those ten months is that I named them wrong. Not
because the symptoms are wrong, but because the cause is one.

Each of those three failures, when you look at what would actually fix
it, requires the same thing: a layer of infrastructure that currently
does not exist anywhere in healthcare. Not in a specific app. Not in
any single regulation. Not in any platform. A layer that lives beneath
the application layer, between the data and the things that use it,
and is jointly governed rather than custodially owned.

Healthcare hasn't built that layer. The reason the three failures are
so persistent is that all of the actors who could build it are working
at the wrong layer.

What the three failures share

Consider what each failure actually requires.

Fragmented data is a coordination problem. Each EHR holds part of a
patient's record. Each direct-to-consumer health app holds another
part. Each research dataset is a fixed snapshot of one institution.
Fixing this requires not better storage but a way for the parts to
refer to each other — to be the same record, verifiably, across systems
that don't trust each other.

Missing audit is a coordination problem. An audit log that lives
inside the system being audited is auditable by the system's custodian
only. To be trustworthy, the audit has to be visible from outside the
custodian's reach. That means coordinating audit across actors who
otherwise have no reason to cooperate.

Misaligned incentives is a coordination problem. Patients contribute
data; researchers use it; outcomes flow to neither directly. To realign
requires value-tracking across that chain. No actor in the chain has the
standing to track it on behalf of everyone. Value attribution at scale
is shared accounting across systems that do not share a custodian. That
is coordination, just at a different layer than data shape or audit
trails.

Three failures, one shape: each is a problem of coordination across
actors who don't share a custodian. And there are four layers in
current healthcare infrastructure where such coordination has been
attempted: the application layer, the regulatory layer, the platform
layer, and the standards layer. Each has tried to host the fix. Each
has produced a layer-specific limitation worth examining in detail.

Why each existing layer can't host the fix

Application layer fails at coordination

Most patient-data infrastructure today is application-layer. MyChart
manages access to one health system's records. Pillpack manages
medications. Apple Health stores a phone's sensor data. Each has
consent UI. Each has logs. Each has some value model — even if "free"
is the model. None of them coordinate with the others. Consent given
in one is not visible to another. Audit logs in one are not auditable
from another. Value accrued in one cannot be paid across them.

You can build the best possible consent flow inside one application
and still have failed at the actual problem, because the patient does
not have one application. The patient has dozens. The data exists in
dozens of systems. The application layer cannot, by its structural
definition, coordinate across applications it does not contain.

This is not a problem that better applications will solve. It is a
problem that requires a layer applications can rest on, in the way that
an HTTP server doesn't have to reimplement TCP.

Regulatory layer fails at latency

HIPAA¹ defined privacy boundaries in 1996, before patient data was
an AI training resource. It maps poorly onto questions like "who is
allowed to train a model on this record," because the act of training
does not look like the disclosure events HIPAA was designed to govern.

GDPR² added the right to erasure in 2018. The right to erasure is
a coherent demand for records held in databases. It is much less
coherent for records held in the gradient weights of a deployed model.
The right exists in statute; the mechanism for enforcing it for
training data simply doesn't.

The 21st Century Cures Act³ and the subsequent ONC interoperability
rules (2020–2022) mandated that patients receive access to their
records via standardized APIs. Access is a precondition for
sovereignty, not a substitute for it. Receiving the data is not the
same as having rights about what is done with the data once it's
received.

What these regulations have in common is that they responded to
whatever problem was visible at the time of drafting. By the time the
regulation is in force, the technology has produced new problems.
Regulation has structurally lower bandwidth than technology, which
means whatever is built before regulation catches up will continue to
operate, will continue to extract value, and will not be unwound by
the eventual regulatory response. The fix has to exist before
regulation, or it cannot exist at all.

Platform layer fails at consolidation

The platform attempt is the most recent. Apple Health Records launched
in 2018⁴ with twelve partner health systems and now integrates with
hundreds of US health systems. Google has had four separate goes at healthcare data
(Google Health 2008–2011, Google Fit, DeepMind Streams, and Cloud
Healthcare API)⁵, each closed or refocused. EHR vendors operate
patient-facing portals that are platform-like at health-system scope.

These platforms work, in the narrow sense that data does flow through
them. They do not solve the sovereignty problem. They consolidate it.
When Apple is the custodian of a unified patient-data layer, the
patient is no longer the sovereign — Apple is, with the patient as
user. When the EHR vendor is the custodian, the health system is.
Sovereignty becomes mediated, which is the opposite of sovereignty.

Platforms aren't bad. They're just not where the fix lives.

Standards layer fails at scope

Healthcare has serious protocol-layer attempts. HL7 v2 standardized
clinical message exchange in 1989⁶. HL7 FHIR has standardized
RESTful access to clinical data since 2014⁷. The OMOP Common Data
Model⁸ codified the shape of observational research data
across hundreds of institutions. SMART on FHIR⁹ standardized
authorization for clinical apps.

These are real protocol-layer wins. They are not the wins the missing
fix needs.

Each of these standards governs the wire. FHIR specifies how to
retrieve a record; it does not specify whether the retrieving party
may train a model on it. OMOP specifies how a diagnosis is encoded;
it does not specify who may access the cohort or what they owe the
patients in it. SMART on FHIR specifies how an app authenticates; it
does not specify what the patient should receive when the app's
output is used in care.

The standards layer scopes to data shape. The missing fix has to
scope to data use. The two are complementary: a governance protocol
operates over FHIR-shaped data and OMOP-modeled cohorts. What those
standards don't provide.

What "protocol layer" means in this context

A protocol is a set of rules that participants follow voluntarily,
without any of them owning the rules or storing the data the rules
govern. SMTP made email possible across institutions in 1981¹⁰ — not
because Bell Labs hosted email, but because everyone agreed on how to
address it. HTTP made the web possible across servers in 1991¹¹ —
not because Tim Berners-Lee hosted the web. DNS made naming possible
without a single registrar¹².

In each case, the protocol layer succeeded by enabling cross-system
behavior that no single custodian could have provided. Each protocol
was published, ratified by use, and operated without any party having
permission to revoke it. Email has survived four decades of vendor
consolidation because the protocol is older than the vendors.

Healthcare data does not have such a layer. It has applications that
consolidate. It has regulations that constrain disclosure. It has
platforms that mediate. It has no shared rules for what a record is,
what consent means, what audit consists of, or how value gets
attributed. Each of those questions is currently answered application
by application, regulation by regulation, platform by platform.

The bet is that a protocol layer for patient-sovereign healthcare data
could behave the way SMTP and HTTP did. Not because it solves any
specific application problem better than that application would, but
because it enables a class of cooperation that cannot happen without
it. There is a second part to the bet: this layer is buildable now,
before regulation forces a worse version of it, and before any single
platform consolidates the territory.

What this means for the next four posts

If the missing layer is protocol, then specifying what such a protocol
must provide is the next step. Not "consent and audit" as generic
abstractions. Specific primitives, each with a job.

The next post argues that four primitives carry the load:
content-addressable Health Assets, programmable Consent, hash-chained
Provenance, quality-weighted Contribution. Each maps to one of the
failures named here. The claim is not that these four are provably
the smallest possible set. Design spaces resist that kind of proof.
The claim is that each one earns its place against a specific failure
mode, and that the four cluster naturally rather than arbitrarily.

That's a softer commitment than "minimum sufficient." It's the one I
can defend. A reader who sees a natural fifth primitive should write
back. The series is better for the pressure.

What I underestimated

When I wrote the Medium pieces last summer, I thought the field needed
better tools. I now think it needs a layer the field hasn't built.
That's a harder problem than the one I named.

Building better tools in a missing layer is a treadmill.

The next post specifies. The two after that examine the gaps that
surfaced during the specification — gaps that became separate work
because they live in different verification regimes. The fifth post
commits to what would prove the whole argument wrong.

HIPAA, Public Law 104-191 (1996); Privacy Rule effective 2003. The statute governs disclosure of protected health information by covered entities. It is structurally about who may share what with whom, not about what may be inferred from what has been shared. ↩
GDPR, Regulation (EU) 2016/679, effective May 2018. Art. 17 (right to erasure) and Art. 20 (right to data portability). Art. 17 is binding on data controllers; the mechanism for applying it to data already encoded in trained model weights remains an open legal question. ↩
21st Century Cures Act, Public Law 114-255 (2016). Subsequent ONC interoperability rules: 85 FR 25642 (May 2020) and 89 FR 1437 (January 2024). FHIR R4 patient-access APIs mandated for certified health IT. ↩
Apple Health Records launched March 28, 2018. Initial 12 partner health systems; FHIR R4–based; now integrated with hundreds of US health systems. ↩
Google Health (consumer): 2008–2011. Google Fit: launched 2014. Google DeepMind Streams: piloted at Royal Free London 2016, criticized by UK ICO 2017, folded into Google Health 2018. Google Cloud Healthcare API: launched 2018, operational. None operate at protocol layer; all are platform plays. ↩
HL7 v2 (originally HL7 v2.1, 1989). Maintained by HL7 International; versions 2.3–2.7 in widespread clinical deployment. ↩
HL7 FHIR (Fast Healthcare Interoperability Resources). DSTU 1 published 2014; FHIR R4 became normative in 2019. ↩
OMOP Common Data Model, maintained by the OHDSI consortium. v5.x widely deployed across hundreds of research sites; v6.0 current. ↩
Mandel, J.C., Kreda, D.A., Mandl, K.D., Kohane, I.S., and Ramoni, R.B. "SMART on FHIR: A standards-based, interoperable apps platform for electronic health records." Journal of the American Medical Informatics Association 23(5) (2016): 899-908. Initial profile published 2014; SMART App Launch Framework v2.0 in current use. ↩
Postel, J. (1981). RFC 821: Simple Mail Transfer Protocol. ↩
Berners-Lee, T. (1991). HTTP/0.9 first proposal; HTTP/1.0 standardized 1996, RFC 1945. ↩
Mockapetris, P. (1983). RFC 882, RFC 883: DNS specifications. ↩

Portfolio Post — What I'm Building: HAVEN, Prometheno

Chester Guan （Ziyuan Guan） — Wed, 08 Apr 2026 15:24:05 +0000

Been heads-down building the future of data governance, and I wanted to share a glimpse of what I'm working on.

First, I'm developing HAVEN (Health Asset Value & Exchange Network) – a protocol for patient-controlled health data. It focuses on how health data is referenced, consented to, audited, and valued. Think of it as the foundational layer for a more equitable health data ecosystem. Check out the spec on GitHub: https://github.com/Chesterguan/HAVEN. Recent updates include refining the documentation and adding a logo to improve community contributions.

I was also working on Prometheno, a patient-centered health data platform. The goal was to empower individuals to own, control, and benefit from their medical information while contributing to medical research on their own terms. You can find the project here: https://github.com/Chesterguan/Prometheno.

These projects are all about empowering individuals with control over their data and how it's used. I'm excited to see where this journey takes me!

datagovernance #healthtech #opensource