Forem: Oluwafemi Adedayo

Signal Independence Audit

Oluwafemi Adedayo — Thu, 14 May 2026 09:09:40 +0000

Signal Independence Audit

Pre-hyperopt screening framework for testing whether a candidate signal adds information over a baseline. Methodology + crypto case study.

Note: This is the canonical Markdown source. The published version is hosted at https://hallengray.github.io/signal-independence-audit/BLOG_POST.html (GitHub Pages auto-resolves the relative ./case-study/*.md links to the corresponding Pages-rendered HTML).

Four crypto strategy failures, and the tool that caught one in 3 seconds

Or: what happens when you bring software-engineering ADR discipline to retail quant research, then watch it produce an honest negative result.

Hook

Most retail quant content reads “I found a strategy that backtests well.” This post is the inverse.

Over four phases of work on BTC/USDT and ETH/USDT spot at the 4-hour timeframe, four different strategies were tested against pre-committed criteria: Sharpe > 1.0 AND profit factor > 1.4 out-of-sample. All four failed. The bar was never lowered. Across nineteen architecture-decision records the project never said “the threshold was too strict” — it documented why each strategy fell short and moved to the next question.

One of those four — a funding-rate-conditioned variant of a Donchian breakout — was killed in ~3 seconds of pure-Python testing, before any of the planned 1000-epoch × 2-seed × 3-window hyperopt compute ran. The tool that produced that 3-second verdict is what this post ships. It’s a four-screen pre-hyperopt audit called Signal Independence Audit (SIA), published as a standalone repo under Apache 2.0. The methodology applies to any candidate signal on any asset class; crypto was the surface where it was developed and stress-tested.

The rest of this post walks through each of the four failures in turn, then the literature audit that produced an unexpected corrigendum about how AI-assisted research goes wrong, then what survives the kills as reusable methodology. It ends on what a documented negative result actually is, and what it isn’t.

If you only read one section, read the Phase 4 SIA-kill section — that’s where the methodology earns its keep.

Phase 2 — MFD (price-only trend-following) → §14 FAIL, Sharpe-bound

The first attempt was a Macro-Filtered Donchian Breakout — MFD. Entries fire when the 4-hour close breaks above an N-period Donchian high, gated by a daily-timeframe filter (daily close above a slow EMA, so trend-following only fires during macro uptrends). Exits are symmetric: a Donchian-low cross. Stops are ATR-scaled. Two seeds, 1000 hyperopt epochs each, top-5 candidates per seed evaluated out-of-sample across four calendar walk-forward windows.

The headline result: best out-of-sample Sharpe 0.314, against a threshold of > 1.0. That’s roughly 3× below the bar. The best per-window Sharpe across the four walk-forward windows was 0.553 (W3, Jan–Jun 2024) — still well short. Profit factor was not the problem: PF cleared 1.4 on multiple candidates, ranging 1.6–2.5 across the windows. The binding constraint was Sharpe.

The failure mode is informative: returns per unit volatility were too low, but the strategy was profitable on average. Stoploss-firing breakdowns showed ~100% of exits firing as exit_signal (the Donchian-low cross) with near-zero stop_loss and zero roi exits — meaning the per-trade edge depends on exit-signal timing, not on stoploss tightness, and that the edge is real but structurally too small relative to trade-noise variance. Wider ATR stops would not have closed the 3× Sharpe gap; the modal expected outcome had already been flagged before the hyperopt ran (details).

Phase 2’s discipline holds: 0 of 4 walk-forward windows pass, the bar wasn’t softened, and the ADR records exactly what the failure shape was — not “this strategy needs more tuning” but “this strategy class produces edge below threshold on this universe at this timeframe.”

Phase 3 — BMR (mean reversion) → §14 FAIL, Sharpe AND PF bound

If trend-following’s per-trade edge was too small, mean reversion was the next obvious thing to try. Phase 3’s strategy was BMR — Bollinger Mean Revert — entries when the 4-hour close drops below a lower Bollinger band, exits when it returns to the middle band, gated by the same daily-EMA macro filter as MFD (to avoid catching falling knives during macro downtrends).

Three seeds, top-5 hyperopt candidates per seed, 15 candidates evaluated out-of-sample total. Plan Option B was invoked to skip walk-forward — the trigger conditions for that option are conservative (all OOS candidates fail decisively AND multiple §14 criteria bind AND walk-forward would only confirm), and BMR cleared all three triggers.

Headline: all 15 out-of-sample candidates were unprofitable. Best OOS Sharpe -0.083 (seed=42 epoch=51). Worst -0.361. Mean across the 15: -0.180. Mean PF: 0.59 (best 0.85, worst 0.31). Every candidate was both unprofitable AND low-conviction (details).

This is a structurally worse failure than MFD’s. MFD was profitable-but-inconsistent — its OOS Sharpe was below threshold but its PF cleared on multiple candidates. BMR was unprofitable-AND-inconsistent: all three seeds independently produced negative best in-sample Sharpe (−0.066 to −0.111), and the hyperopt landscape showed degenerate plateaus (638 out of 1000 epochs tied on one seed), which itself suggested the macro filter was dominating entries.

Two consecutive strategies failed §14 in two different ways. The pattern raised a hypothesis explicitly: maybe the universe + timeframe combination (BTC/ETH spot, 4-hour primary) is too narrow to support strategies with both high enough Sharpe AND high enough profitability to clear the bar. Phase 4 was scoped with that hypothesis on the table.

Phase 4 — FCMFD (funding-rate-conditioned MFD) → SIA-killed pre-hyperopt in 3 seconds (the lead worked example)

By the start of Phase 4 there were two compute-expensive kills in the rearview. Each had cost roughly an hour of cloud hyperopt to produce a verdict that, in retrospect, was visible in the strategy’s design before any hyperopt ran. MFD’s per-trade edge was structurally too small. BMR’s macro filter dominated entries. Both shapes — “candidate signal weaker than expected after accounting for what’s already captured” — should have been catchable cheaply, but the pipeline didn’t have a cheap mechanism for catching them.

So before spending another hyperopt budget on the natural Phase 4 candidate, I built one.

The Phase 4 candidate was FCMFD: MFD plus a funding-rate gate. The thesis was direct — when perpetual funding rates are low (longs are not crowded), forward breakouts have more room to run; when funding is high (longs are crowded), breakouts mean-revert. Funding rates are public on-chain data, free, and intuitive. The natural plan was: add a “funding < 60th-percentile of trailing 90-day window” gate to MFD’s entry condition, hyperopt the rolling-window and threshold parameters, evaluate out-of-sample.

Before any of that, I built the Signal Independence Audit (SIA) — four mechanical screens that test whether a candidate signal adds information over a price-only baseline (details). Screen 1: coverage (does the gate change trade frequency? Does it preserve enough trades for evaluation?). Screen 2: lift (do funding-gated entries’ forward returns exceed price-only-gated entries by ≥ 0.5σ on ≥ 2 of 3 horizons?). Screen 3: Jaccard overlap on a small hyperparameter grid (does the gate’s trade-set stay stable under perturbation?). Screen 4: information-split control (does the candidate signal explain more of the forward-return variance than realised volatility — the obvious confounder — explains?). The four screens run on the training window only, against a frozen rule set, and produce a JSON verdict. They take about 3 seconds.

The Phase 4 SIA ran and returned exit code 1 (details):

Horizon

Treatment mean

Control mean

Gap σ

Threshold

Verdict

0.85%

0.25%

0.147

0.500

FAIL

10d

1.83%

1.20%

0.098

0.500

FAIL

20d

3.78%

3.69%

0.016

0.500

FAIL

The locked direction was qualitatively right — funding-gated entries outperformed price-only entries on all three horizons. The problem was magnitude: the gap σ values (0.147 / 0.098 / 0.016) sat well below the 0.5σ-pooled-stddev bar. By the 20-day horizon the gap was effectively zero. Screen 4 reinforced the verdict: vol-tercile separation on the 10-day horizon was 6.93 percentage points (low vol 5.14%, mid -1.79%, high 3.95%); funding-tercile separation was 2.45pp (low 3.50%, mid 2.77%, high 1.05%). Vol explained more of the forward returns than funding added.

The inversion check made the kill cleaner. ADR-0014 reserves a “Phase 4-prime” branch for the case where the inverted signal direction shows ≥ 1.0σ separation — i.e., maybe the right interpretation is the opposite of what was hypothesised. The inversion cohort (funding > 40th percentile) underperformed control on all three horizons with negative gap σ. The thesis wasn’t wrong about direction; it was right about direction and quantitatively too weak.

Here’s where SIA earns the post’s title. The Phase 4b 6-month smoke (Jan–Jun 2024) had looked materially better than anything Phase 2 or 3 ever produced: +16.84% return, Sortino 2.04, daily wallet Sharpe 1.98. Without SIA, that smoke would have justified the full 1000-epoch × 2-seed × 3-window hyperopt run. SIA put the smoke in context: 18 trades over 6 months in a +46% market-change window, with a single March 2024 winner (+174 USDT) driving most of the PnL. Across the full 3.5-year training window — 313 conjunction trades, multiple regimes — the gate doesn’t carry a tradeable lift. The smoke wasn’t fraudulent; it was a small-sample regime-favourable artefact, and SIA was the mechanism that converted it from “promising signal” to “don’t spend the compute” in 3 seconds.

Phase 2 burned roughly a Hetzner-hour × 4 windows to learn its shape of fact. Phase 3 burned a Hetzner-hour × 2 seeds. Phase 4 burned 3 seconds laptop time. That is the entire value proposition of pre-hyperopt screening.

Phase 5 — literature audit, pre-committed gates, and the routing verdict

After three failures with three different failure modes (Phase 2 Sharpe-bound profitable; Phase 3 unprofitable on both criteria; Phase 4 candidate signal too weak vs vol), the productive next question wasn’t “what strategy class do we try fourth?” It was: is §14 empirically achievable on this surface at all?

That’s an evidence question, not a strategy-building question. So Phase 5 started with a two-day literature audit of published academic cryptocurrency-trading research, looking for a Tier-1 source reporting Sharpe > 1.0 AND PF > 1.4 OOS on a retail-realistic universe (details). The audit produced a Criterion D verdict — mixed/inconclusive evidence — which routed to a structured next move: test the timeframe hypothesis cheaply (Path C), and gate any expensive follow-up (Path D) on four pre-committed verifications (details).

Path C: take Phase 2’s locked MacroDonchian default parameters (no hyperopt, no signal-class change) and re-run them at 8-hour primary timeframe across the same four walk-forward windows. The narrow question: was 4h the binding constraint? Result (details) — 0 of 4 windows clear §14 at 8h. Per-window Sharpe means: 4h 0.425 vs 8h 0.415 (a 2% degradation, not an improvement). PF passed comfortably in all four 8h windows (1.45–2.07 range), so the binding constraint stayed exactly where it was at 4h: returns per unit volatility too low for this universe at this strategy class, regardless of candle cadence.

Path D Gate 1 was the next pre-committed verification: retrieve the full text of Fieberg/Liedtke/Metko/Zaremba 2023 (Quantitative Finance) — the closest paper in the audit’s Criterion C set — and extract its profit factor figure for the long-only winner portfolio. If PF < 1.4 net of fees, Criterion C strictly fails on §14’s AND-clause and the path forward routes to Path E (stop in-universe; ship the framework).

The full text was retrieved (via the CC-BY 4.0 published version on open.icm.edu.pl, after a header-tweaked HTTP request bypassed the JavaScript CAPTCHA — recorded for reproducibility). Findings (details):

Profit factor is never reported in the paper. Anywhere. Zero matches across main results, the long-only robustness section (§4.4.2, Table 13), the liquidity-restricted subsamples (Table 12), the 768-design-choice robustness (Table 8), or the subperiod analysis (Table 5). The paper reports weekly mean return, standard deviation, Newey–West t-statistic, and annualised Sharpe. PF (or any close analogue) is never computed and never tabulated.
No transaction-cost analysis exists. Zero matches for “transaction cost”, “trading cost”, “fees”, “basis points”, “bps”, or “net of”. All performance figures (Sharpe 1.28 for cross-sectional winners; Sharpe 1.09 for long-only winners minus risk-free; Sharpe 1.43 for the most-liquid-25% best variant) are gross.

This is strictly stronger than the gate’s pre-committed FAIL condition. The metric required to evaluate the AND-clause doesn’t exist in the source; the net-of-fees clause is unsatisfiable because no friction analysis is reported. Criterion E fires. Path D is foreclosed. Path E is routed: stop in-universe; ship the framework artefacts (details).

The routing isn’t a soft conclusion. The gates were written into ADR-0017 before Path C ran, specifically to prevent the failure mode of “Path C fails → grind on without sharper criteria.” The pre-commit caught a 4–8 week false start (Path D infrastructure build) before it happened. That’s the demonstrable value.

The audit corrigendum — what AI-assisted literature research got wrong, and how it was caught

The Path D Gate 1 retrieval surfaced something the audit hadn’t anticipated. The Phase 5a literature audit, working under paywall constraints and time pressure, had attributed two specific phrases to Fieberg et al 2023:

“incurs substantial trading costs”

“extracts alphas largely from short positions”

These phrases appeared in the audit’s §5.6 verification update as caveats on the paper’s headline Sharpe figures — framing them as evidence that the paper itself flagged friction degradation and short-side alpha capture.

Neither phrase — nor its substantive content — appears anywhere in the paper.

The misattribution survived the 2-hour focused audit because the original retrieval relied on search-result snippets and abstracts. The paywall prevented full-text verification at audit time. The narrative looked coherent — search snippets about cross-sectional momentum often do discuss trading costs and short-side dynamics, and there’s a Liu/Tsyvinski/Wu 2022 paper that addresses exactly those topics for cryptocurrency. It’s the most likely actual source of the attributed phrasing. The audit’s compressed-time retrieval co-mingled snippets across sources and attributed to one paper what another said.

The error was caught by construction. ADR-0017’s Gate 1 was written into the Path D escalation pipeline for general epistemic-hygiene reasons — literature claims should be verified against full text before they fund infrastructure builds — not because this specific error was anticipated. When Phase 5d executed the gate, the full text didn’t contain the attributed phrases. The pre-committed gate caught the error because the gate was the verification mechanism (standalone teaching artefact).

The systemic lesson is what makes this section load-bearing for the rest of the post:

Paywall reliance + abstract-only summarisation produces attribution errors that survive review.

The mechanism:

The paywall prevents full-text access at the moment of summarisation.
Search snippets and abstracts get integrated into the audit’s narrative.
The narrative reads coherently — fluent summaries that sound like the source.
Nothing in the audit’s own evidence chain catches the misattribution.
The error propagates downstream into the artefacts the audit feeds (here: ADR-0017’s framing of Criterion C’s evidence weight; the §5.6 update).
Only forcing full-text retrieval for any load-bearing claim catches it.

This is sharpest when AI-assisted research is in the loop. LLMs do exactly what LLMs do — they produce fluent summaries that integrate sources without preserving granular “this exact phrase came from THIS source” attribution. Without an explicit “verify the load-bearing phrase against the full text” gate, the failure mode is structurally hard to catch. Reading the LLM’s summary feels like reading the source; the prose discipline is indistinguishable. Only the bibliographic discipline is missing, and it’s invisible inside the summary.

The defence is concrete: for any load-bearing claim, require full-text verification before that claim drives a decision. Codify this as a pre-committed gate, not a soft norm. Soft norms aren’t enforced under time pressure. Gates are.

The corrigendum strengthens the Path E routing decision rather than weakening it. The original audit reading was “Fieberg et al claims Sharpe 1.28 but acknowledges substantial trading costs and short-side alpha capture.” The corrected reading is “Fieberg et al claims gross Sharpe 1.28 (with 1.09 long-only, 1.43 most-liquid-25%-best-variant) but doesn’t address trading costs at all and doesn’t report profit factor.” The corrected reading makes the paper an even thinner foundation for the original Path D scope. The literature’s PF claim — the AND-clause of §14 — was never verified because the literature never reported it.

What survives — the SIA harness and the framework artefacts

Five attempted-or-prevented in-universe phases, four strategy classes empirically tested, one literature foundation empirically verified and found wanting, one corrigendum. The strategy attempts terminate. The methodology survives.

The SIA harness (source, Apache 2.0) is the publishable artefact. Four mechanical screens that test whether a candidate signal adds information over a baseline, runnable on a laptop in seconds, with explicit pre-committed bar thresholds. Pre-hyperopt screening like this is published-ish in academic settings but operationally rare in retail tooling — most retail backtests either skip the question or answer it implicitly through hyperopt-then-OOS, which costs hours per signal hypothesis. SIA answers it in seconds, before hyperopt. The Phase 4 FCMFD case is the worked example of why that matters; the 16 unit tests + mypy --strict cleanliness are the polish that lets the harness ship as installable Python.

Installable via:

pip install -e git+https://github.com/hallengray/signal-independence-audit

The methodology generalises beyond crypto. The four screens are signal-agnostic: any candidate signal-vs-baseline comparison fits the shape, on any asset class with a forward-return horizon and a confounder available for the information-split control.

Other framework artefacts are catalogued in ADR-0019’s “what we keep” section (details) but not extracted as separate standalone packages in this Phase 5e (could be later phases):

Deterministic backtest pipeline: image-pinned Docker + sha256 data-manifest + cross-machine determinism check (laptop = cloud byte-for-byte). Applies to any Freqtrade-class strategy on any data.
Look-ahead-safe alignment patterns: merge_informative_pair(ffill=True) discipline (for OHLCV joins across timeframes) + a custom non-OHLCV joiner with merge_asof(allow_exact_matches=False) for cadence-mixing signals like funding rates. Applies to any cadence-mixing problem.
Data manifest tooling: schema-validated JSON manifests with sha256 verification + cross-platform LF parity handling.
Pre-committed audit methodology: literature audit before strategy attempts, mechanical decision criteria pre-committed in scope docs, audit findings mapped to criteria mechanically rather than via judgment. The Path D escalation gates in ADR-0017 caught a 4–8 week false start — that’s the demonstrable value of the methodology over soft norms.
ADR + post-mortem discipline: numbered ADRs, post-mortem-on-kill pattern, cross-strategy comparison tables, references back to predecessor ADRs.

The case-study directory (reading guide) is the receipt — ten curated documents covering every kill, the methodology, the audit, the corrigendum, and the routing. Three reading paths: chronological for the full trajectory, SIA-methodology-focused for readers who want the tool, negative-result-focused for readers who want the disconfirmation story.

What this isn’t

Path E is not “the project failed.” It is “the project produced a documented disconfirmation of the in-universe hypothesis, with mechanism, on a surface where the literature itself does not have positive evidence for retail-realistic strategies meeting the criteria.” Those are different statements with different implications.

The §14 bar (Sharpe > 1.0 AND PF > 1.4) held throughout nineteen ADRs. It wasn’t relaxed when Phase 2 fell short, when Phase 3 fell shorter, when Phase 4 was SIA-killed, when Path C’s timeframe shift turned out flat, or when Path D Gate 1 found the foundational paper didn’t even report the metric. The bar transitions from “contested-theoretical” (a bar someone might clear if conditions are right) to “documented-empirical” (a bar that wasn’t cleared on this surface across documented attempts under documented discipline). That’s a strictly more useful state for the bar to be in than “contested-theoretical.”

If anyone wants to deploy capital on BTC/USDT + ETH/USDT 4-hour spot long-only in the future, the case-study directory is the receipt. It names the universe, the strategy classes attempted (price-only trend-following, mean reversion, funding-conditioned trend-following, timeframe-shifted trend-following), the bar, the literature evaluated, the corrigendum, and the routing. A future operator can decide whether to revisit the surface with new evidence rather than re-running the same experiments blind.

What this also isn’t: a claim that crypto is unsuitable for systematic trading in general, or that the §14 bar is the right bar for every operator, or that funding-rate signals are categorically uninformative. The negative result is scoped narrowly to a documented universe, a documented bar, and four documented strategy attempts. Outside that scope, the artefact this post ships — the SIA harness — is precisely the tool you’d want to apply to whatever signal hypothesis is in scope for you.

Footnotes / references

The case study: ./case-study/00-readme.md — reading guide for the ten curated documents. The audit corrigendum is at ./case-study/10-audit-corrigendum.md as a self-contained teaching artefact about AI-assisted research failure modes.
The SIA harness: this repo’s ./sia/ directory. Apache 2.0. Python 3.11+. Pandas + numpy hard deps; pytest + mypy as test extras; jupyter + matplotlib as notebook extras. Install: pip install -e git+https://github.com/hallengray/signal-independence-audit.
The source project’s repo: private. Scope B of the Phase 5e shipping decision intentionally excludes the strategy implementation code, hyperopt outputs, raw backtest data, and ops playbooks from the public artefact. The case-study directory is the curated evidence chain for the narrative above; the full source project is not part of what’s published.
Maintenance posture: this repository is published as a finished research project, not an actively-maintained tool. Issues are welcome and may be responded to selectively; there is no response time commitment. PRs may not be reviewed. Forks are welcome.
Author: Femi Adedayo — hallengray@gmail.com.

End of post.

signal-independence-audit is maintained by hallengray. This page was generated by GitHub Pages.

Why Audit Is the Missing Layer in Every Healthcare RAG System

Oluwafemi Adedayo — Wed, 15 Apr 2026 10:20:15 +0000

I've reviewed a lot of healthcare AI builds over the last year. Clinicians love the demos. CTOs love the architecture diagrams. The compliance team gets invited to the meeting in week six, usually right after someone asks "what happens if this is wrong?"

That question is almost always asked too late.

In non-regulated industries, a RAG system that hallucinates occasionally is an embarrassment. In healthcare, it is a liability. The difference between those two outcomes isn't the model. It isn't even the retrieval pipeline. It's whether you built audit in from the start, or bolted it on when someone demanded it.

This post is about why audit is the highest-leverage layer in a healthcare RAG system, what it actually means to build it properly, and why most teams skip it until the regulator asks.

The stakes are different in healthcare

In my last post I wrote about the RAG Maturity Model: five levels from naive demo to enterprise-grade system. RMM-3 (Better Trust) and RMM-5 (Enterprise) both include elements of evaluation and drift detection. In healthcare, those aren't maturity features you graduate to. They are table stakes before you go live.

Here's why. A RAG system retrieving from clinical guidelines, drug databases, or patient records is generating outputs that influence clinical decisions. When it retrieves the wrong passage and the LLM composes a confident answer citing it, a clinician may act on that. The error chain is:

Wrong retrieval → confident hallucination → clinical decision → patient outcome

Every step in that chain is invisible unless you've built the instrumentation to catch it. Audit is that instrumentation.

What "audit" actually means for a healthcare RAG system

Most teams hear "audit" and think logging: drop a timestamp and a query hash into a table and call it done. That is not audit. That is compliance theater.

Real audit in a healthcare RAG context has four layers:

1. Input audit
Log the exact query, the user who sent it, the timestamp, and the session context. If the system is connected to a patient record, log which record. You need to be able to answer: who asked what, about whom, and when.

2. Retrieval audit
Log every chunk retrieved, its document source, its version, and its retrieval score. This is the layer most teams omit, and it is the most important one. If a clinician later questions a recommendation, you need to replay exactly what the system saw, not just what it said. The EU AI Act and FDA transparency guidelines both require this capability for clinical AI systems.

3. Generation audit
Log the prompt sent to the LLM (including injected context), the raw model output, and any post-processing applied. This gives you faithfulness traceability. You can verify whether the generated answer actually derived from the retrieved context or whether the model improvised.

4. Decision audit
In high-stakes workflows, log whether the output was acted on, by whom, and what the downstream outcome was. This is the layer that enables feedback loops and model improvement. It's also the layer that protects the organisation when outcomes are questioned.

HIPAA, the EU AI Act, and why "best effort" isn't a defence

Healthcare RAG systems operating in the US must comply with HIPAA. That means comprehensive audit trails logging every data access, every query, and every response. It means Business Associate Agreements with every third-party vendor in your pipeline: your vector database provider, your embedding API, your LLM vendor. Each of those is a point of potential PHI exposure, and each requires a contractual audit trail.

In the EU, the AI Act classifies clinical decision-support systems as high-risk AI. High-risk AI systems are required to maintain logs sufficient for post-market surveillance and for regulators to audit the system's operation retrospectively. "We were doing our best" is not a defence. The log either exists or it doesn't.

The practical consequence: if your healthcare RAG system doesn't have tamper-evident, timestamped, source-linked audit records for every inference, you are not compliant under either regime, regardless of how good your retrieval precision is.

The thing audit reveals that nothing else does

Here is what audit gives you that evaluation metrics alone don't: it tells you what the system is actually being used for.

Golden sets tell you how the system performs on questions you anticipated. Audit logs tell you how the system performs on questions you didn't.

In a healthcare deployment I reviewed, the golden set showed 87% faithfulness on clinical guideline queries. A strong result. The audit logs, when finally reviewed six weeks post-launch, showed that 23% of real queries were about medication dosages, a category almost entirely absent from the golden set. Faithfulness on that category, measured retrospectively, was 61%.

The golden set was honest. The system was fine on the questions the team had anticipated. But the audit revealed a gap that could have caused real harm. It caught the problem before anything serious happened, only because audit was actually running.

That is what audit is for. Not a compliance checkbox. Not a liability shield. Operational visibility into the real distribution of your system's use.

The architecture of an auditable healthcare RAG system

Building audit in from the start looks like this:

[User Query]
     ↓
[Input audit log] → tamper-evident store (append-only)
     ↓
[Retrieval pipeline] → [Retrieval audit log: chunks, sources, versions, scores]
     ↓
[LLM generation] → [Generation audit log: prompt, raw output, model version]
     ↓
[PII scan + output filter]
     ↓
[Response to user]
     ↓
[Faithfulness score (async)] → appended to generation audit record

Key design decisions:

Append-only audit store. Write-only for the application, with no update or delete path. This is what makes the record tamper-evident.
Version-pinned retrieval. Your knowledge base must be versioned. The audit record must reference the specific version of each source document retrieved, not just the document name. If the guideline changes next month, you need to know which version was retrieved for each past query.
Async faithfulness scoring. Don't put faithfulness scoring in the hot path if you can avoid it. Score asynchronously and append the result to the generation audit record. This gives you the operational data without adding latency.
PII redaction before logging. Log the query, but run PII scanning before writing it to the audit store. You don't want patient names or SSNs in the audit log in cleartext, because that defeats the purpose.

Why teams skip this and what happens when they do

The honest answer is that audit feels like overhead when you're trying to ship. The vector store is working. The LLM is producing answers that look right. The demo is good. Audit adds latency, complexity, and cost with no visible benefit on the day you deploy.

The benefit shows up on the day something goes wrong.

Without audit, a healthcare organisation that receives a regulatory inquiry or a patient complaint has no way to replay what the system actually did. They cannot demonstrate that retrieval was grounded in approved sources. They cannot prove that PII was handled correctly. They cannot show that the answer given to the clinician derived from evidence rather than from model hallucination.

At that point, "we were logging queries but not the retrieved context" is the most expensive sentence in AI development.

Mapping audit to the RAG Maturity Model

If you've read the RAG Maturity Model post, here's where audit sits:

RMM Level	Audit Requirement
RMM-0 to RMM-2	None mandated, but instrument for it now
RMM-3 (Better Trust)	Generation audit and faithfulness scoring active
RMM-4 (Better Workflow)	Full four-layer audit running, traces queryable
RMM-5 (Enterprise)	Tamper-evident store, version-pinned retrieval, regulatory export capability

For non-healthcare RAG, audit is an RMM-5 concern. For healthcare RAG, it's an RMM-3 requirement you cannot skip. The maturity model is a starting point. In regulated domains, you apply your own floor.

The takeaway

Audit is not the last thing you add to a healthcare RAG system. It is the second thing: after you confirm the retrieval pipeline actually works, and before you let anyone act on an answer.

Build the append-only log. Version your knowledge base. Score faithfulness asynchronously. Run PII scanning before you write anything to storage. Make the retrieval record replayable.

The regulator who asks "what did your system retrieve before it gave that recommendation?" deserves an answer you can produce in under an hour. So does the clinician. So does the patient.

If you're building or evaluating a healthcare RAG system right now, which of the four audit layers is the weakest in your current stack? Drop it in the comments.

Femi Adedayo builds AI automation systems at Hgray AI and writes about RAG, LLMs, and production AI systems. The RAG Maturity Model and RAG-Forge CLI referenced in this post are available on GitHub.

The 5 Levels of RAG Maturity: How to Know When Your RAG Is Actually Production-Ready

Oluwafemi Adedayo — Tue, 14 Apr 2026 17:17:11 +0000

I've watched a lot of RAG projects ship in the last eighteen months. They all follow the same arc.

Week 1: somebody wires up a vector store, drops in a few PDFs, and demos it to the team. Everyone is impressed. The CEO forwards the Loom video to the board.

Week 4: the first real users start asking questions. The answers are... fine? Sometimes. Nobody can quite tell.

Week 8: someone asks "is this actually working?" and the room goes quiet. There's a notebook somewhere with eleven test questions in it. It hasn't been run since week 2.

Week 12: the project is either quietly shelved or it's still in production and nobody trusts it.

The problem isn't that RAG is hard to build. The problem is that RAG is hard to evaluate, and without evaluation you can't tell the difference between "a demo that sometimes works" and "a system you can put in front of customers."

This post is about a framework I've been using to close that gap: a 0-to-5 maturity model for RAG systems, with concrete exit criteria at each level. Call it RMM — the RAG Maturity Model. By the end you should be able to look at any RAG system, including your own, and answer the question "what level is this, and what's the next thing I need to fix?"

Why "production-ready" is a useless phrase

The phrase "production-ready RAG" gets thrown around like everyone agrees what it means. Nobody does.

For one team it means "the API returns 200s." For another it means "no hallucinations on the golden set." For a regulated team it means "we can prove to an auditor that PII never leaves the perimeter." These are wildly different bars, and conflating them is how projects end up shipping at one bar and getting evaluated at another.

What we need is the same thing the rest of software engineering figured out years ago: a maturity model. CMMI did it for process. The Well-Architected Framework did it for cloud. DORA did it for delivery. Each of these works because it gives you:

A small number of named levels
Concrete, measurable exit criteria for each level
A clear ordering — you can't skip rungs

Here's what that looks like for RAG.

The RAG Maturity Model

Level	Name	Exit Criteria
RMM-0	Naive	Basic vector search works
RMM-1	Better Recall	Hybrid search, Recall@5 > 70%
RMM-2	Better Precision	Reranker active, nDCG@10 +10%
RMM-3	Better Trust	Guardrails, faithfulness > 85%
RMM-4	Better Workflow	Caching, P95 < 4s, cost tracking
RMM-5	Enterprise	Drift detection, CI/CD gates, adversarial tests

Let's walk through each level, what it actually feels like, and how you know you're done.

RMM-0: Naive

This is the demo. You chunked some docs, embedded them with text-embedding-3-small, stuffed them into a vector store, and built a retrieval loop that grabs the top-k and pipes them into a prompt. It works on the questions you tested it with. It feels like magic.

Exit criteria: You can answer a question end-to-end and the answer references the right document at least some of the time.

Why teams get stuck here: They don't. They get stuck moving past here, because RMM-0 looks so close to working that nobody believes it's only level zero.

The lie RMM-0 tells you: that retrieval quality is mostly solved by picking a good embedding model. It isn't.

RMM-1: Better Recall

The first thing you discover at RMM-0 is that vector search misses obvious matches. Someone asks a question with a specific product code or acronym, and the right document is sitting in your index, but cosine similarity ranks five irrelevant docs above it because the query and the doc don't share enough semantic surface area.

The fix is hybrid search — combine dense (vector) and sparse (BM25 or SPLADE) retrieval, then merge the results with reciprocal rank fusion or a similar scheme. Sparse retrieval catches the literal-match cases that embeddings fumble.

Exit criteria: Recall@5 above 70% on a golden set of at least 50 questions. (Recall@5 = "of the questions where the right answer exists in your corpus, how often is the relevant doc in the top 5 retrieved chunks.")

The honest signal: if you can't measure Recall@5, you're still at RMM-0 regardless of how fancy your retrieval pipeline looks. Measurement is the gate, not the technique.

RMM-2: Better Precision

Now retrieval finds the right doc, but it also finds four wrong ones, and the LLM gets confused or hedges. This is where you add a reranker: a second-stage model (Cohere Rerank, BGE Reranker, Voyage, etc.) that takes your top-50 retrieved chunks and reorders them by actual relevance to the query.

The lift from a good reranker is usually larger than the lift from picking a better embedding model, and it's much cheaper to iterate on.

Exit criteria: nDCG@10 improves by at least 10% versus the RMM-1 baseline on the same golden set. (nDCG@10 captures ordering quality, not just whether the right doc is in the set.)

Common mistake: skipping straight from RMM-0 to a reranker without ever measuring recall. If your retrieval has 40% recall, a reranker can't save you — it can only reorder what it's given.

RMM-3: Better Trust

At this point retrieval is solid, but the LLM still occasionally makes things up. It cites sources that don't say what it claims they say. It answers questions it shouldn't answer. It leaks PII it should redact.

RMM-3 is about the answer, not the retrieval. You add:

Faithfulness scoring — does the generated answer actually follow from the retrieved context, or is the model improvising?
Guardrails — input/output filters for prompt injection, jailbreaks, and policy violations.
PII scanning — detect and redact names, emails, SSNs, etc., on the way in and the way out.

Exit criteria: Faithfulness > 85% on the golden set, plus guardrails and PII scans wired into the request path with measurable block rates.

Why this level matters more than the previous two: RMM-1 and RMM-2 fail visibly — users see bad answers and complain. RMM-3 failures are invisible. The model confidently states something false and the user believes it. These are the failures that destroy trust and end projects.

RMM-4: Better Workflow

Your RAG system answers correctly. Now it has to do that fast, cheaply, and predictably, every day, for a year, while the team keeps shipping changes. RMM-4 is the operational layer:

Caching at the embedding, retrieval, and generation tiers
P95 latency under 4 seconds end-to-end
Cost tracking per query so the finance team doesn't ambush you in month three
Distributed tracing (OpenTelemetry, Langfuse) so you can debug a bad answer two weeks after it happened

Exit criteria: P95 < 4s, cost-per-query measured and trending in a dashboard, and a trace for any query you can name.

The insight: at RMM-4 you stop thinking about RAG as a model problem and start thinking about it as a systems problem. The hardest bugs are no longer prompt bugs; they're cache-invalidation bugs.

RMM-5: Enterprise

The final level is the one most teams never need, but the ones who do can't ship without it:

Drift detection — alerts when the embedding distribution, query distribution, or answer quality shifts
CI/CD gates — every PR runs the full audit and is blocked if faithfulness, recall, or precision regress beyond a threshold
Adversarial test suites — red-team prompts, prompt-injection corpora, and known-bad inputs run on every release

Exit criteria: A merge to main triggers an evaluation run, the run posts to a dashboard, and a regression beyond a configured threshold blocks the merge.

What this really gives you: the ability to keep shipping changes to a RAG system without breaking it. Up to RMM-4 you're building. At RMM-5 you're maintaining, and maintenance is where most RAG projects die.

How to use the model

Three rules.

1. Don't skip levels. RMM is ordered for a reason. A reranker on top of broken recall is wasted compute. Faithfulness scoring on top of bad retrieval just measures how confidently your model is wrong. Drift detection on a system with no golden set has nothing to drift from. The exit criteria are gates, not suggestions.

2. Measure before you optimize. The biggest difference between teams stuck at RMM-0 and teams climbing the model is whether they have a golden set. Fifty questions with known-good answers is enough to start. You can grow it later. You cannot skip it.

3. Score honestly. It is much more useful to know you are an honest RMM-1 than to claim you are an aspirational RMM-3. The model only works if the level is verified, not asserted.

Trying this on your own system

I built a CLI called RAG-Forge that implements this model end-to-end. It scaffolds a RAG pipeline, runs evaluation as a CI gate, and scores any audit report against the RMM levels above. It's MIT-licensed and works against your existing pipeline — you don't need to rewrite anything to score it.

The scoring command takes a JSON audit report and tells you which level you're at and what's blocking the next one:

npm install -g @rag-forge/cli

rag-forge init basic --directory my-rag
cd my-rag
rag-forge index --source ./docs
rag-forge audit --golden-set eval/golden_set.json
rag-forge assess --audit-report reports/audit-report.json

The assess command output looks roughly like this:

Current level: RMM-2 (Better Precision)
✓ Recall@5: 78% (target: >70%)
✓ nDCG@10 lift: +14% (target: +10%)
✗ Faithfulness: 71% (target: >85%) — blocking RMM-3

Next action: faithfulness is below RMM-3 threshold. Add output
verification or a guardrails layer before claiming Better Trust.

That's the whole point of the model — turning "is our RAG good?" into "we are at level 2 and the specific thing blocking level 3 is faithfulness below 85%." You can put that on a roadmap. You can argue about it in a PR review. You can show it to a non-technical stakeholder and they will understand it.

Even if you don't use the tool, you can use the framework. Print the table at the top of this post, score your own system honestly, and you will already know more about where your RAG stands than 90% of teams shipping RAG today.

Where the model breaks down
No maturity model survives contact with reality unmodified, so here are the cases where RMM bends.

Agentic RAG with multi-hop retrieval doesn't fit cleanly into per-query Recall@5. You usually need to score each hop separately and roll up.
Conversational RAG (long chat history) needs a faithfulness metric that accounts for the running context, not just the latest retrieval.
Domain-specific RAG (legal, medical, code) often needs custom precision targets — 85% faithfulness is too low for a legal brief and too high for a brainstorming assistant.
The model is a starting point, not a constitution. If you find a place where it doesn't fit your system, the right move is to write down why and what you replaced it with — that's still better than no framework at all.

The takeaway
If you remember nothing else from this post: build a golden set, score yourself honestly, and stop shipping RAG without knowing what level you're at.

The tools to do this are free. The framework is published. The reason most RAG projects don't have evaluation isn't that it's hard — it's that nobody insists on it. Be the person who insists.

If you try the model on your own system, I'd love to hear where it broke down for you. Drop a comment with your level and what's blocking the next one, or open a discussion in the RAG-Forge repo. The model gets better the more systems we score against it.