Forem: Raihan

I built the first open benchmark for federal contracting AI. Here's what it shows about frontier LLMs.

Raihan — Tue, 12 May 2026 12:16:46 +0000

If you ask GPT-4o or Claude to extract Federal Acquisition Regulation clause numbers from a federal solicitation, a non-trivial fraction of the time they will hand you a number that does not exist. There is no FAR 52.999-99. The model just made it up. For a federal contractor staffing a proposal, that is the difference between a clean compliance matrix and a rejected bid.

I went looking for a benchmark that measured this. There isn't one. Commercial tools in the space — Capture2Proposal, GovTribe, GovWin, OrangeSlices — all do natural-language processing on federal solicitations, but none publish benchmarks. Academic work on RFP processing is narrow and one-off. GSA's own srt-fbo-scraper covers only Section 508 compliance.

So I built one.

FedProc-Bench

FedProc-Bench is a multi-task benchmark for federal procurement NLP. Four tasks, drawn from real federal contracting sources:

#	Task	What it tests
1	Notice type classification	Eight SAM.gov notice-type buckets — Solicitation, Combined Synopsis/Solicitation, Sources Sought, and so on
2	NAICS sector prediction	Twenty top-level NAICS sectors
3	Set-aside identification	Multi-label across SBA, SDVOSB, WOSB, EDWOSB, 8(a), HUBZone, and SDB
4	FAR / DFARS clause extraction	Token-level entity recognition on canonical clause numbers like `52.219-9` or `252.225-7042`

Task 4 is the headline. It is the task where frontier LLMs visibly fail.

The data sources are public and free. SAM.gov provides the solicitations themselves through its Opportunities API. The Electronic Code of Federal Regulations gives me Title 48 — the full FAR and DFARS — as structured XML, which I parse down to 1,032 individual clause records. Claude Haiku fills in a small amount of synthetic augmentation for rare set-aside types like HUBZone and EDWOSB that real SAM data barely contains. Every record carries a source and label_origin field, so anyone can audit the provenance line by line.

The v0 release ships 1,615 records split 1,129 / 243 / 243 train / val / test. That is small. I had originally targeted 10,000. We will come back to why I have less in the section on what bit me.

The model

The companion model — raihan-js/fedproc-180m-v0 — is a 149-million-parameter ModernBERT-base with one shared encoder and four task heads. Sequence classification heads for tasks 1 and 2 (softmax over the label set), seven sigmoid heads for the multi-label set-aside task, and a per-token BIO head for the FAR-clause extractor.

The interesting design choice is the task mask. Records from different sources contribute different supervision: SAM metadata contributes tasks 1, 2, and 3; raw FAR clause text contributes task 4; synthetic excerpts contribute all four. Inside the model's forward pass, a per-record four-boolean mask says which heads get gradient for each example. That is how a single model trains jointly on heterogeneous sources without diluting any head's signal.

Training takes 4.3 minutes on a single RTX 3060 for six epochs. The whole training run cost me zero dollars and the electricity to keep my desk lamp on.

What I compared against

I ran the same four tasks through three frontier systems and the trained model:

Claude Sonnet 4.6 (Anthropic)
GPT-4o (OpenAI)
Claude Haiku 4.5 (Anthropic)
FedProc-180M v0 (the model I trained)

Each system gets the same prompt and the same test split. For task 4, a clean canonical metric: entity F1 with exact match on clause-number strings, plus a hallucination rate, which I define as the share of predicted clause numbers that do not appear anywhere in the cached real FAR + DFARS corpus. Inventing a number that does not exist is the failure mode that matters here.

The headline results

Aggregate scores across all four tasks (mean of per-task macro-F1; task 4 is entity F1):

Rank	Model	Aggregate	T4 entity F1	T4 hallucination
1	Claude Sonnet 4.6	0.911	0.991	0.0% (0 / 493)
2	GPT-4o	0.896	0.970	4.4% (23 / 517)
3	Claude Haiku 4.5	0.851	0.916	15.0% (88 / 587)
4	FedProc-180M v0	0.497	0.921	5.5% (26 / 473)

Claude Sonnet 4.6 is genuinely impressive — zero invented clauses across 493 predicted spans. GPT-4o is close behind. The compact model places fourth on the aggregate because tasks 2 and 3 are weak in v0 (more on this in a moment), but on task 4 — the headline task — it is right behind GPT-4o and roughly matches Claude Haiku on F1 while inventing about a third as many fake clauses.

The honest read

Before anyone runs away with that table: 65 of the 220 task-4 test records are Claude-generated synthetic excerpts that cite specific pinned clauses. Frontier models from the Claude family are being graded on text their own family wrote. That is a real bias.

The way I disclose this in the benchmark is to break out task 4 by record source. The real-FAR slice is the honest read because no system in this comparison helped author it:

Model	Real FAR text — F1	Real FAR text — hallucination
Claude Sonnet 4.6	0.984	0.0% (0 / 182)
GPT-4o	0.937	11.0% (23 / 209)
Claude Haiku 4.5	0.804	32.1% (88 / 274)
FedProc-180M v0	0.800	13.8% (22 / 159)

So on the cleanest available slice, Claude Sonnet 4.6 still wins outright. GPT-4o is solid but invents a clause number more than one in ten times. Claude Haiku 4.5 invents a clause number almost a third of the time. And FedProc-180M, the compact specialized model, matches Haiku on F1 with less than half the hallucination rate.

That last bullet is the v0 takeaway: a 150M-parameter model trained in four minutes on a consumer GPU produces task-specific extraction that is competitive with Claude Haiku and demonstrably more reliable on the failure mode that matters for the use case. At roughly fifty times lower latency and three orders of magnitude lower per-call cost, that is a real Pareto point for federal contractors who want on-prem, predictable, auditable FAR-clause extraction.

Where this is actually weak

I am not going to oversell the rest of the table. Tasks 1, 2, 3 are limited in v0 because the SAM.gov daily quota on a non-federal API key ran out during the data pull before I could fetch description text for the cached solicitations. The model sees only titles like 53--O-RING for task-2 NAICS prediction. That is essentially impossible. v0.1, once the quota window cycles, will retrain on the full description text and these numbers should move substantially.

The other honest caveat: 1,129 training records is tiny by NLP standards. The fact that ModernBERT-base lifts task 4 to 0.921 F1 on this little data is partly attributable to ModernBERT being a genuinely strong base model, and partly to the fact that the FAR-clause-number pattern is fundamentally structural — it is easier to learn 52.<digits>-<digits> than to learn what makes a notice an RFI versus a Sources Sought.

Why I built this

I am a co-founder at VETR Proposal, which builds AI-assisted federal proposal management for SDVOSB, WOSB, and 8(a) contractors. Reliable FAR clause handling is core to that product. Before I shipped anything to a customer that touches clause citation, I wanted to know how often current AI systems make things up. There was no public answer. So I made the measurement public.

That is the other reason the benchmark and dataset are open: anyone working in this space — competitors, GSA, academic groups, internal teams at large contractors — can now use the same yardstick. The benchmark is the contribution. The model is just one entry on the leaderboard.

Try it

The model: raihan-js/fedproc-180m-v0.

The dataset: raihan-js/fedproc-bench.

Both are Apache 2.0. The source code that built them lives in the repo (link in the model card). To reproduce from scratch you need a SAM.gov developer key (free), an Anthropic key for the synthetic step, and a couple of hours on a GPU you already have.

If you find a clause number my model misses, an obvious bug, or a hallucination my regex did not catch — open an issue. v0.1 lands tomorrow.

If you build in federal contracting tech or care about the reliability of LLMs on regulated text, let me know what you find when you run it.

Three small models for healthcare intake — and what shipping all three taught me

Raihan — Tue, 12 May 2026 01:20:37 +0000

Two months ago I started a portfolio project: build three small specialized language models for healthcare practice intake, benchmark each one honestly against frontier APIs, and write about what I learned. The goal was to build the case that small specialized models still have a place in the 2026 toolkit alongside frontier LLMs — not as replacements, but as the first stage of a hybrid pipeline.

This is the post about the third model. It's also the post about the suite — what worked across all three, what didn't, and the pattern that emerged.

The three models, all on Hugging Face:

clarioscope-intent-deberta-v1 — 184M DeBERTa-v3-base, 7-class intent classification. Within 4 pp of Claude Haiku 4.5, 22× faster on CPU. methodology post →
clarioscope-phi-deberta-v1 — 125M RoBERTa-base, 18-category PHI span detection (HIPAA Safe Harbor). Loses on aggregate but triples frontier F1 on geographic locations. methodology post →
clarioscope-insurance-v1 — 125M RoBERTa-base, 12-field insurance / billing extraction. This post.

What the third model does

A 125M-parameter RoBERTa fine-tune that extracts twelve insurance/billing fields from patient text: carrier name, plan type, member ID, group number, policy number, subscriber name, relationship, claim ID, prior-auth number, copay, deductible, and billed amount. Output is BIO-tagged token spans, which downstream code converts into a JSON object a billing system can ingest directly.

The benchmark, on a 672-example held-out test set:

Model	Macro F1	Weighted F1	Latency / example	Cost / 1K inferences
`clarioscope-insurance-v1` (CPU)	0.7882	0.8202	45.4 ms	$0.00
`gpt-4o-2024-11-20`	0.9562	0.9572	1202 ms	$1.90

Same speed/cost shape as the other two models in the suite: ~26× faster than GPT-4o, $0 marginal cost. The accuracy gap is concentrated in a small number of low-frequency fields.

Fine-tune is competitive on the high-volume fields. CLAIM_ID (0.95 vs 1.00), MEMBER_ID (0.91 vs 0.99), CARRIER (0.91 vs 0.96), SUBSCRIBER_NAME (0.89 vs 0.91 — essentially tied). These four fields collectively cover ~70% of the test entities.

The gap is concentrated in a few low-volume fields. AUTH_NUMBER is the standout weakness: 0.30 vs 0.99. The training set has only 770 AUTH_NUMBER spans and the format space is wide (PA-4421, auth #998-2210, AUTH998212, etc.). Same structured-ID problem as the PHI detector had with MRN. PLAN_TYPE is similar: short strings like "PPO", "HMO" with overloaded surface forms.

Three patterns that repeated across all three models

1. Synthetic data is fast and noisy, and the noise is systematic

In all three models, gpt-4o-mini produced label-noise patterns I had to discover and fix:

Intent classifier (Model 1): the LLM over-fitted to "ChatGPT-polite" message style on first attempts. Fixed by adding a mandatory realism mix (40/40/20 polished / casual / messy) to the generation prompt.
PHI detector (Model 2): the LLM included cue words in entity spans — "MRN 8472301" annotated as the MRN span instead of "8472301". About 8.6% of training spans had this contamination. Fixed by clean_data.py (cue-word stripping + re-locating spans).
Insurance extractor (Model 3): same cue-word noise pattern as PHI — "member ID AET-998-2210" instead of "AET-998-2210", "copay $35" instead of "$35". 7.4% of spans needed cleanup. Same fix.

Lesson: when synthetic data is the input, label QA is part of the pipeline. The LLM that generates the annotations does not produce ground truth, it produces a draft that humans (or scripts) need to validate. The version of clean_data.py that I shipped for Models 2 and 3 is now part of every future synthetic NER project I'll build.

2. Cross-generator test sets are not optional — and val numbers lie

In all three models, val macro F1 was 5–17 percentage points higher than test macro F1:

Model	Val macro F1	Test macro F1	Gap
Intent classifier	0.886	0.911	-0.025 (test actually higher)
PHI detector	0.863	0.630	+0.233
Insurance extractor	0.957	0.788	+0.169

The intent classifier was the exception — classification with 7 categories is more robust than span extraction with 12+ categories. For both span-extraction models, the val numbers from a same-generator split would have produced overconfident model cards.

Lesson: same-generator val splits are useful for early development feedback, but the headline number that goes on a model card should be from a held-out set generated by a different model with a different prompt style. Otherwise the benchmark inflates and you'll be surprised in production.

3. Small models beat frontier on linguistic entities, lose on structured-ID memorization

This pattern showed up clearest in the PHI detector and was the central observation in that model's writeup. The insurance extractor repeats it:

Linguistic + bounded vocabulary fields (CARRIER from a short list of insurance companies, CLAIM_ID with predictable claim patterns, SUBSCRIBER_NAME using ordinary names): fine-tune is competitive or tied with GPT-4o.
Structured-ID fields with high format variance (AUTH_NUMBER, PLAN_TYPE token boundaries, GROUP_NUMBER formats that vary widely): frontier wins because they've seen far more format variance during pretraining.

For both Model 2 and Model 3, the production recommendation is the same: hybrid pipeline. Fine-tuned model first, regex for highly-structured patterns, frontier API as the fallback for the long tail. Most of the cost and latency comes from the fine-tune; the frontier API runs on a small fraction of traffic.

What each model cost to build, total

Model	OpenAI (data gen)	RunPod (train)	Benchmark APIs	Total
Intent classifier	$1.20	$1.20	$1.78	$4.18
PHI detector	$1.40	$1.50	$5.20	$8.10
Insurance extractor	$1.50	$0.80	$1.10 (no Anthropic)	$3.40
Suite total	$4.10	$3.50	$8.08	~$15.70

Three published models with benchmark-grade write-ups for under sixteen dollars. The Anthropic credit gap in the insurance extractor benchmark is the only thing that prevents a clean head-to-head across all three, and that's just a "buy more credit" problem.

Hugging Face hosting: $0. Total infrastructure cost beyond the line items above: $0.

Why I built this

To educate why SLM matters! The pivot story has three components: (1) I can train transformers from scratch on consumer hardware (the ORCH series), (2) I can fine-tune larger base models with QLoRA (ORCH-7B), and (3) I can ship benchmark-grade specialized SLMs with rigorous, transparent evaluation against frontier APIs.

The ClarioScope SLM Suite is the third leg. Three months of work, three published models, three dev.to write-ups, full transparency on synthetic-data limitations and where frontier still beats us. If you're hiring for AI engineering roles where the candidate needs to understand both training-from-scratch AND production benchmarking AND honest model-limitations communication, my LinkedIn is in my GitHub profile.

What's next

A v1.1 of the insurance extractor with Anthropic benchmarks once credit is restored, and a v2 of all three models trained on real (de-identified) patient text from a partner practice — which moves the project into HIPAA-eligible infrastructure (AWS SageMaker / Azure ML with a BAA) and out of the "synthetic-data v1" phase.

If you've shipped a small specialized model that has the inverse story — beats frontier on aggregate but loses on a specific axis — I'd love to hear about it. The interesting trade-offs in 2026 aren't "should I use a frontier API" but "what's the right hybrid architecture for this task." This three-model suite was the project that taught me that lesson, three times in a row.

Follow along on Hugging Face or GitHub.

Where small models beat frontier LLMs (and where they don't): a 125M PHI detector

Raihan — Tue, 12 May 2026 00:10:50 +0000

Last month I published a 184M-parameter intent classifier that matches frontier LLMs at 22× lower latency. The story was clean: small specialized model, narrow task, comparable accuracy, much faster, almost free per inference. People liked it.

The second model in the ClarioScope SLM Suite tells a more complicated story. It's a PHI detector — a token classifier that tags spans of protected health information in inbound patient text across all 18 HIPAA Safe Harbor identifier categories. On the macro-F1 headline number, it loses to Claude Sonnet 4.6: 0.63 vs 0.89. On Claude Haiku 4.5: 0.63 vs 0.85. On GPT-4o: 0.63 vs 0.81.

So the click-through headline isn't "matches frontier." It's: on aggregate, frontier wins. But the macro number hides what's actually happening, and the per-entity breakdown reveals something more interesting than either "small model wins" or "small model loses."

Model on Hugging Face: raihan-js/clarioscope-phi-deberta-v1.

The 125M-parameter fine-tune beats or matches every frontier model on linguistic entities — geographic locations, ages, person names, dates, phone numbers, fax numbers, IP addresses. It loses badly on structured-ID entities — MRNs, license numbers, health-plan IDs, device serial numbers. The right production architecture is not one or the other. It's hybrid. This post is the methodology, the benchmark, and the honest interpretation.

The task: span detection across all 18 HIPAA Safe Harbor categories

Patient text comes in messy — "Hi Dr. Okafor, this is Iniko Adeleke, DOB 11/03/1985, MRN OMK-44291, phone 312.555.7820, my partner's email is jordan.holloway@workmail.io. We live in Brookline." A PHI detector has to locate every individually-identifying span: two names, a date, an MRN, a phone, an email, a location. Six spans, six different entity types, in a single 25-token message.

The HIPAA Safe Harbor rule defines 18 categories of PHI. The Clarioscope model tags all of them: NAME, LOC, DATE, PHONE, FAX, EMAIL, SSN, MRN, HEALTH_PLAN, ACCOUNT, LICENSE, VEHICLE, DEVICE, URL, IP, BIOMETRIC, PHOTO_REF, AGE_OVER_89. The architecture is standard: RoBERTa-base encoder, a token-classification head outputting BIO tags across 37 labels (one O plus 18 entity types times {B-, I-}).

A side note before going further: the repo is named clarioscope-phi-deberta-v1 because the original plan was DeBERTa-v3-base. During training, DeBERTa-v3 reproduced a NaN-gradient bug specific to this 37-label token classification setup — forward pass loss healthy, backward pass NaN on the first step, across fp16, bf16, and fp32, with explicit classifier head re-init and gradient clipping. After three afternoons of trying to keep DeBERTa alive, I switched to RoBERTa-base, which trained stably with the same training script. The repo name is kept for URL stability and the model card calls it out at the top.

Why not just call the API

The same three reasons as last time, with slightly different weights:

Privacy. PHI is the canonical "you'd want this to never leave your infrastructure" data class. A frontier API with a Business Associate Agreement is one option, but BAAs aren't free, aren't available at every tier, and add legal complexity. A self-hosted model never sends the patient's address or DOB to a third party.

Latency. The fine-tuned model runs in 28.6 ms on a CPU. Frontier API calls from my Bangladesh ISP run 1,000–2,000 ms. For redaction-before-routing where every inbound message has to be processed before it can be displayed in an inbox, that wall-clock floor matters.

Cost. Claude Sonnet 4.6 with the PHI-extraction prompt costs $2.53 per 1,000 inferences. Haiku is $1.00 per 1K. GPT-4o is $1.64 per 1K. The fine-tuned model is $0 per inference after training. For a practice receiving 10K messages per day, the math is the same as last time: $3,650–$9,234 per year on frontier vs roughly free on the local model.

The benchmark

The headline numbers, on a 548-example held-out test set, with entity-level F1 measured by seqeval (which requires both entity type AND exact span boundary to match for a true positive):

Model	Macro F1	Weighted F1	Latency / example	Cost / 1K inferences
`raihan-js/clarioscope-phi-deberta-v1` (CPU)	0.6301	0.7639	28.6 ms	$0.00
`claude-haiku-4-5-20251001`	0.8492	0.9213	1294 ms	$1.00
`claude-sonnet-4-6`	0.8946	0.9396	1980 ms	$2.53
`gpt-4o-2024-11-20`	0.8094	0.8912	1111 ms	$1.64

If you stop reading here, the takeaway is: "frontier wins, small model is 45× faster but trails 20–25 points of F1." That's true and it's the honest aggregate. But it's not the interesting part.

The interesting part: per-entity F1

Same benchmark, broken out by entity type, sorted by the fine-tuned model's F1 (best to worst):

Linguistic entities — small model matches or beats frontier:

Entity	This model	Haiku	Sonnet	GPT-4o
`PHONE`	0.983	1.000	0.994	1.000
`AGE_OVER_89`	0.976	0.967	0.967	0.836
`NAME`	0.961	0.996	0.994	0.980
`IP`	0.949	1.000	1.000	0.967
`FAX`	0.949	1.000	0.984	1.000
`DATE`	0.945	0.949	0.970	0.909
`LOC`	0.818	0.328	0.289	0.301

LOC is the standout. The fine-tuned model nearly triples the frontier APIs' F1 on geographic locations. Frontier models systematically under-flag informal location mentions like "she lives in Allston" or "at the Roxbury location" — their pretraining seems to have left them uncertain about whether informal context cues count as PHI. A specialized model trained explicitly to tag these does not hesitate.

AGE_OVER_89 is another quiet win. Frontier models occasionally tag ages 89-and-under as PHI (they aren't, under Safe Harbor) or miss the "over 89" qualifier ("she's 96") that determines whether the age is reportable. The fine-tuned model learned the rule directly from the training distribution.

For names, dates, phone numbers, fax numbers, and IPs, the gap between this model and frontier is 1–5 percentage points. Within margin-of-noise for production use.

Structured-ID entities — frontier wins, often dominantly:

Entity	This model	Haiku	Sonnet	GPT-4o
`MRN`	0.276	1.000	1.000	0.997
`LICENSE`	0.170	1.000	1.000	0.933
`HEALTH_PLAN`	0.264	0.855	0.983	0.717
`BIOMETRIC`	0.095	0.410	1.000	0.314
`DEVICE`	0.341	0.732	1.000	0.800
`VEHICLE`	0.640	1.000	1.000	0.970
`SSN`	0.583	0.983	0.949	0.915
`ACCOUNT`	0.759	0.985	0.969	1.000
`EMAIL`	0.815	1.000	1.000	1.000
`URL`	0.738	0.967	0.967	0.931

Frontier wins these by enormous margins. The fine-tuned model gets MRN right 28% of the time. Haiku and Sonnet get it right 100% of the time.

The reason is straightforward once you stare at the data. Structured-ID entities follow surface conventions that vary wildly between institutions and generators. An MRN might look like OMK-44291, RMR-882034, DENT-12345-A, or just 8472301. The training set generates one distribution of ID formats; the test set was generated by a different model and uses a different distribution. The fine-tuned model can only recognize what it saw during training, and when the test-set MRN doesn't match the training conventions, the model either misses it entirely or produces a span boundary that's off by one token (which under seqeval's strict matching is a miss).

Frontier models win these categories because they've seen a much wider distribution of ID formats during pretraining, and because their attention mechanism is strong enough to anchor an ID span to its context cue — "MRN" or "member ID" or "license #" — regardless of the specific token pattern that follows it.

This is a real limitation. It's also a reasonable one to live with, because there's a much cheaper way to catch structured-ID PHI than running a frontier API on every message.

The bug that wasn't (and the bug that was)

A first version of the model trained on the raw generated annotations and scored 0.57 macro F1 on test — worse than what's shipping now. The expected explanation was distribution shift between train and test. The actual explanation was simpler and more embarrassing.

The training data had systematic label noise: the data-generation LLM was returning entity texts that included the cue word that introduced the entity. The annotation for "MRN 8472301" came back as {"text": "MRN 8472301", "label": "MRN"} instead of {"text": "8472301"}. The literal word "SSN" was annotated as an SSN entity in six different training examples. About 8.6% of all training spans (1,676 of them) had this kind of cue-word contamination. The Claude-generated test set, with a stricter prompt, had two such cases out of 1,632 spans.

So the model wasn't being graded against the same distribution it learned. It learned "MRN" was part of an MRN span; the test set told it it wasn't.

A clean_data.py script in the repo strips known cue-word prefixes ("MRN ", "SSN ", "phone ", "account #") from entity texts, re-locates the cleaned text in the inquiry, and drops entities that no longer have a valid span. Importantly, it preserves natural prefix characters like the opening ( in (617) 555-1234 — a first version of the cleanup stripped parens too aggressively and tanked PHONE F1 from 0.94 to 0.51 in one run. The fix was to apply punctuation stripping only after a cue word had been detected and removed, not as a generic "trim leading punctuation" pass.

The cleanup recovered about 4 percentage points of test macro F1 and (more interestingly) flipped two entity types from "loses badly" to "competitive": EMAIL went from 0.55 to 0.82, ACCOUNT from 0.21 to 0.76.

The deeper lesson is one that's well-known but easy to forget: synthetic data is fast and cheap and the cost shows up as systematic noise that nobody else will catch for you. The annotations themselves need a QA pass before training. Real-world data has its own noise problems, but it tends not to label cue words as entities, because human annotators don't.

Preventing benchmark leakage

Same trick as model 1: training set was generated by gpt-4o-mini-2024-07-18; the held-out test set was generated by Claude with a deliberately different prompt style. This cross-generator split mitigates the failure mode where a fine-tuned model just learns one generator's style and the benchmark inflates.

A side effect on this model specifically: the Claude-generated test set uses tighter, more uniform structured-ID formats than the training set. That's part of why the test F1 on structured-ID entities is harsher than the val F1 (val: 0.86 macro; test: 0.63 macro). It's also fair, because real-world MRN formats are at least as varied as the gap between the two generators — and probably more so.

The recommended production architecture: hybrid

Given the per-entity breakdown, the right architecture is not "use this model alone" or "use a frontier API alone." It's a three-stage pipeline:

Run this model first on every inbound message. ~30 ms on a CPU, $0, never sends text off-host. Captures NAME, LOC, DATE, PHONE, FAX, IP, AGE_OVER_89 reliably — that's most of the volume.
Add regex matchers for highly structured patterns the model misses: SSN (\d{3}-\d{2}-\d{4}), credit-card numbers, basic MRN/account patterns specific to your practice's conventions. Regex is fast, free, and brittle — but correct when it matches.
Fall back to a frontier API only when the message contains likely structured-ID content the local pipeline didn't resolve, or when downstream confidence is needed. This pays the latency and dollar cost only on a small fraction of traffic.

For a practice receiving 10,000 messages per day where roughly 10% have unresolved structured-ID content after stages 1 and 2, the hybrid sends 1,000 calls per day to Haiku ($0.30/day) instead of 10,000 ($3.00/day). Most messages never leave the host. Latency is bounded by the local model. The frontier model becomes a "structured-ID specialist" rather than a "PHI redaction generalist."

This is the actual cost-effective answer in 2026. The small model is not a frontier replacement; it's a frontier accelerator.

The cost ledger

Item	Cost
9,500 synthetic training examples via OpenAI (`gpt-4o-mini`)	~$1.40
RunPod RTX A4000 pod (two training runs, ~25 min total)	~$1.50
Benchmark API calls (Haiku + Sonnet + GPT-4o on 548 examples × 2 runs)	~$5.20
Hugging Face hosting	$0
Total	~$8.10

A bit more expensive than model 1 because the benchmark ran twice — once before the data cleanup, once after. The cleanup story was a four-percentage-point F1 improvement, which seems small but matters for the "where does this model lose" interpretation of the per-entity numbers.

How to use it

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_id = "raihan-js/clarioscope-phi-deberta-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
model.eval()

text = "Hi Dr. Okafor, this is Iniko Adeleke, DOB 11/03/1985. phone 312.555.7820, email jordan@workmail.io. I live in Brookline."

enc = tokenizer(text, return_offsets_mapping=True, return_tensors="pt", truncation=True, max_length=256)
offsets = enc.pop("offset_mapping")[0].tolist()
with torch.no_grad():
    pred_ids = model(**enc).logits.argmax(dim=-1)[0].tolist()

id2label = model.config.id2label
spans = []
i = 0
while i < len(pred_ids):
    label = id2label[pred_ids[i]]
    if label.startswith("B-"):
        ent_type = label[2:]
        start = offsets[i][0]
        end = offsets[i][1]
        j = i + 1
        while j < len(pred_ids) and id2label[pred_ids[j]] == f"I-{ent_type}":
            end = offsets[j][1]
            j += 1
        spans.append({"text": text[start:end], "label": ent_type})
        i = j
    else:
        i += 1

for s in spans:
    print(s)
# {'text': 'Dr. Okafor', 'label': 'NAME'}
# {'text': 'Iniko Adeleke', 'label': 'NAME'}
# {'text': '11/03/1985', 'label': 'DATE'}
# {'text': '312.555.7820', 'label': 'PHONE'}
# {'text': 'jordan@workmail.io', 'label': 'EMAIL'}
# {'text': 'Brookline', 'label': 'LOC'}

Limitations

The model card has the full list. The ones worth surfacing here:

All training and evaluation data is synthetic. No real production validation yet. A real-world calibration pass is required before deployment.
Structured-ID entities are weak. Per the benchmark, this model is materially worse than frontier APIs on MRN, LICENSE, HEALTH_PLAN, BIOMETRIC, and several others. Pair with regex and/or a frontier fallback.
Not a HIPAA compliance verdict. This model tags entity types as defined in the Safe Harbor rule. HIPAA compliance is a regulatory determination that a model can't make on its own.
English only, healthcare practice domain only.

What's next

This is model 2 of three. The third is clarioscope-insurance-v1 — structured JSON extraction of insurance- and billing-relevant fields from inbound text. Probably a small encoder-decoder with constrained decoding. When all three are published, they'll go up as a Hugging Face collection with a single longer post tying the suite together.

The honest takeaway from this one: small specialized models don't always beat frontier on aggregate, but the per-entity breakdown is where the actual production decision lives. Frontier-or-nothing is the wrong frame. Frontier-as-fallback-to-a-cheap-local-model is the right one.

Follow along on Hugging Face or GitHub.

Matching frontier LLMs at 22 lower latency: a 184M-parameter intent classifier for healthcare text

Raihan — Mon, 11 May 2026 17:43:38 +0000

Healthcare practices drown in inbound patient text. Email, contact forms, live chat, SMS, voicemail transcripts — every channel sends messages that need to be routed: to scheduling, to billing, to clinical, to the front desk. It's a high-volume, deterministic, latency-sensitive task.

The obvious answer in 2026 is to throw a frontier LLM at it. Claude Haiku 4.5 will give you 95% accuracy on this kind of classification. GPT-4o will too. But every call costs real money, adds about a second of network round-trip, and sends patient text to a third party that doesn't have a BAA with you.

I built a small alternative — a 184M-parameter DeBERTa-v3-base fine-tune — and benchmarked it against Claude Haiku 4.5, Claude Sonnet 4.6, and GPT-4o on a 1,154-example test set. The fine-tuned model lands within 4 percentage points of accuracy of the best frontier model, runs 22× faster on a CPU, and costs effectively $0 per inference after training. Total cost to build it: under $3.

Model on Hugging Face: raihan-js/clarioscope-intent-deberta-v1.

This is model 1 of three I'm building for the ClarioScope SLM Suite — a healthcare intake intelligence pipeline. The other two are a PHI detector and an insurance extractor; they're in development. This post is the methodology and the benchmark for the first one.

The task

Seven intent labels, designed for production routing at a healthcare practice:

Label	What it captures
`new_patient_inquiry`	A prospective patient asking about becoming a patient
`existing_patient_question`	An existing patient with a non-urgent question
`appointment_request`	Scheduling, rescheduling, or cancellation
`billing_inquiry`	Questions about bills or pricing of services already received
`clinical_concern`	An active medical concern requiring clinical attention
`complaint`	Dissatisfaction with service, staff, communication, or outcome
`price_shopper`	Pricing-only inquiry, no commitment signals

These categories are opinionated and they have real ambiguity at the edges. A new patient asking for their first appointment is both new-patient and appointment-request. A frustrated patient describing a medical concern is both clinical and complaint. The data-generation prompt encodes explicit disambiguation rules (complaint dominates when both signals are present; pre-commitment pricing questions are price_shopper even if they mention insurance), but the boundary cases are where every model — fine-tuned or frontier — gives up F1 points.

Why not just use the API

Three reasons:

Latency. Frontier API calls from my Bangladesh ISP run 1,000–1,600 ms. For routing, that's the difference between an inbox that updates instantly and one that lags noticeably. The fine-tuned model on a CPU runs in 48 ms. On a GPU it would be another 5–10× faster. Either way, the wall-clock floor for a hosted API call is in the hundreds of milliseconds even before the model processes anything.

Cost. Claude Sonnet 4.6 costs $0.76 per 1,000 inferences on this task. Haiku is $0.25 per 1K. GPT-4o is $0.53 per 1K. For a single practice receiving 10,000 inbound messages per day across all channels (not unrealistic for a multi-location dental or dermatology group), that's $912 to $2,774 per practice per year — a hard line item on the SaaS economics. The fine-tuned model has a one-time training cost and approximately zero marginal per-inference cost.

Privacy. Frontier APIs are great, and they're also a third-party data path. For protected health information you'd want a BAA, and not every API provider offers one at every tier. A self-hosted classifier never sends patient text anywhere.

The accuracy gap versus frontier is real but small enough that for production routing, the speed/cost/privacy wins dominate.

The model

Standard DeBERTa-v3-base with a sequence classification head: a single linear layer over the pooled [CLS] representation producing 7 logits. All 184M parameters fine-tuned. No LoRA — at this dataset size, full fine-tuning beats parameter-efficient methods without much overhead. Training was 5 epochs of 8,099 examples on a single RTX 4090 (rented on RunPod), batch size 32, max sequence length 256 tokens, learning rate 2e-5 with cosine schedule and 10% warmup, fp16 mixed precision. Total training wall time: about five minutes.

The training data — synthetic and transparent about it

This is the most important section of the post for anyone considering similar work. All training and test data is synthetic. There is no real patient data anywhere in the pipeline. This is a deliberate choice — using synthetic data for v1 sidesteps HIPAA constraints entirely and lets the model ship fast. A v2 trained on real PHI would need HIPAA-eligible training infrastructure (AWS SageMaker or Azure ML with a BAA), and that's a separate, more careful project.

But "synthetic" is doing a lot of work in that sentence. The naïve approach — ask an LLM for 1,000 example patient inquiries per intent — produces what I'll call ChatGPT-polite text: every message opens with "Hi!", ends with "Thanks!", uses correct grammar and punctuation, and reads nothing like a real SMS message that an actual frustrated parent sends at 2 AM.

A model trained on ChatGPT-polite text will overfit to the politeness markers and degrade badly on real production text. So the generation prompt forces a mandatory realism mix per batch:

~40% polished (full sentences, correct grammar, proper punctuation, formal or neutral)
~40% casual (lowercase starts, contractions, fragments, missing terminal punctuation, conversational)
~20% messy (typos, autocorrect mistakes, abbreviations like u/appt/tmrw, ALL CAPS for urgency, run-on phrasing)

Plus channel-conditional scaling: SMS is the messiest, voicemail transcripts second messiest, email and web forms more polished. The prompt also includes about 20 lines of style anchors — concrete patterns the LLM should reproduce. Stuff like:

Abbreviations: u/ur, appt, tmrw, yr, pls, thx, rx, ins (insurance)

Fragment phrases: "billing question call me back", "need to reschedule thursday", "kid has fever 102", "still no answer about my x-ray"

Run-on voicemail: "uh hi yeah this is calling about that thing you mentioned last week i think it was a follow up or something can you call me back"

Conversational starts (no greeting): "Quick question —", "So I got this bill...", "Need to cancel —"

Two dry runs before the full 9,000-example generation: the first one without the realism mix produced very polite, very clean output (82% of messages opened with "Hi!", 0% had ALL CAPS, almost nothing was a fragment); the second one with the mix landed at 18% greetingless openers, 22% abbreviations, 21% no terminal punctuation, 6% ALL CAPS urgency. The shape of the distribution actually moved when the prompt told it to move.

Costs: the 9,000 training examples cost about $1.20 of OpenAI credit (via gpt-4o-mini-2024-07-18, JSON-object response format, temperature 1.0, 8-worker parallel generation).

Preventing benchmark leakage

The naive failure mode here is generating both train and test with the same model. The fine-tuned model would learn the generator's style, and the benchmark would inflate.

So train and test come from different generators:

Train (9,000 examples) — generated by gpt-4o-mini-2024-07-18 with the prompt above.
Test (1,154 examples) — generated by Claude with a deliberately different prompt style and a different abbreviation set (w/, &, hrs, BTW, IDK, plz versus the train prompt's u, tmrw, appt). The test set leans into more medically specific content (real conditions, real procedure names) and longer rambling messages.

A side effect of this split: when I benchmark against Claude Haiku 4.5 and Claude Sonnet 4.6 below, those models are from the same family as the test-set generator. If anything, they should get a small style-familiarity advantage. The benchmark numbers below are with that caveat in mind. (Spoiler: they don't visibly benefit.)

The benchmark

Evaluated on 1,154 held-out test examples:

Model	Accuracy	Macro F1	Latency / example	Cost / 1K inferences
`raihan-js/clarioscope-intent-deberta-v1` (CPU)	91.16%	91.07%	48.5 ms	$0.00
`claude-haiku-4-5-20251001`	95.32%	95.28%	1064 ms	$0.252
`claude-sonnet-4-6`	93.59%	93.53%	1566 ms	$0.759
`gpt-4o-2024-11-20`	95.23%	95.17%	1036 ms	$0.527

Latency is wall-clock single-example latency through each provider's chat completions API, measured from a Bangladesh ISP. The fine-tuned model number is on a CPU (no GPU acceleration). Cost is the actual API spend per 1,000 calls based on token counts from the run.

Three things in this table are interesting

Sonnet 4.6 is worse than Haiku 4.5. A bigger, slower, more expensive frontier model produces lower accuracy on this task. This isn't an artifact of one run — I've seen it consistently. My take: for narrow, well-structured classification with short prompts, more reasoning capacity sometimes second-guesses the correct intuition. The first thought is often right, and a smaller model that doesn't have the option to deliberate just commits to it. The right tool for this kind of job is small and specific.

The latency advantage is on CPU. The 48ms number is on CPU. A modest GPU would drop it to ~5–10 ms. The frontier API numbers are network-bound — the model itself processes the request in tens of milliseconds, but the wall-clock floor for a hosted API call from a non-US-East ISP is in the hundreds of milliseconds before the model has even started. Adding a GPU at the API side does nothing for that floor.

Cost gap doesn't shrink at scale. API cost scales linearly with call volume. The fine-tuned model has a one-time training cost (about $2.40 of OpenAI plus RunPod compute together) and approximately zero marginal cost. For 10K daily inferences over a year, the dollar swing is between zero and roughly $2,800.

Per-class F1 and where the errors live

The model's per-class F1 on the val set, ranked best to worst:

Intent	F1
`price_shopper`	0.957
`complaint`	0.929
`billing_inquiry`	0.908
`appointment_request`	0.881
`clinical_concern`	0.874
`existing_patient_question`	0.834
`new_patient_inquiry`	0.819

The hardest pairs to disambiguate are exactly the pairs you'd expect to be hard:

new_patient_inquiry ↔ appointment_request — a new patient asking to schedule their first visit fits both labels. The data-gen prompt resolves toward new_patient_inquiry for messages that lead with the becoming-a-patient signal, but the model lands on appointment_request more often than the label intends.
existing_patient_question ↔ clinical_concern — medical questions from established patients read as low-grade concerns to the model, because at the lexical level they are.
clinical_concern ↔ complaint — frustrated medical concerns combine both signals; the prompt's tie-breaker says complaint dominates, but the model occasionally goes the other way.

These same pairs gave Claude Haiku 4.5 trouble too when I ran the benchmark by hand on a sample. They're real ambiguity in the task, not classifier weakness. Useful production move: have the model emit confidence (max softmax) alongside the label, and route low-confidence predictions to a human reviewer.

The cost ledger

Full breakdown of what it cost to ship this model:

Item	Cost
9,000 synthetic training examples via OpenAI (`gpt-4o-mini`)	$1.20
RunPod RTX 4090 pod (about 50 minutes including iteration)	$1.20
Benchmark API calls (Haiku + Sonnet + GPT-4o, 1,154 examples each)	$1.78
Hugging Face hosting	$0
Total	$4.18

That's it. End-to-end, from empty repo to published model + reproducible benchmark, for less than the price of lunch.

How to use it

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "raihan-js/clarioscope-intent-deberta-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

texts = [
    "Hi, I'm new to the area and looking for a dermatologist. Are you accepting new patients?",
    "got a bill for $382 for my visit on 4/12 but my copay should only be $35 — what's the rest?",
    "my kid's fever is 103.2 and not coming down with tylenol. need advice now",
]

inputs = tokenizer(texts, padding=True, truncation=True, max_length=256, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
labels = [model.config.id2label[i] for i in logits.argmax(dim=-1).tolist()]
print(labels)
# ['new_patient_inquiry', 'billing_inquiry', 'clinical_concern']

Limitations

I've put a full Limitations section in the model card, but the highlights:

All training and test data is synthetic. No real production validation yet. A real-world calibration pass is a prerequisite for production deployment.
English only.
Healthcare practice domain only. Routes messages within a practice — does not generalize to other industries.
Seven categories, not exhaustive. Messages that don't fit get the closest available label rather than an "unknown" bucket.
No PHI redaction is performed by this model. PHI detection is a separate model in the suite (in development), and HIPAA compliance is a regulatory determination that no model can make.

What's next

This is model 1 of three. The other two:

clarioscope-phi-deberta-v1 — a token-classification model (BIO tagging) for detecting PHI spans in patient text. Same DeBERTa base, different head, different training data (synthetic PHI-annotated text). Goal: redact before routing.
clarioscope-insurance-v1 — structured JSON extraction of insurance- and billing-relevant fields from inbound text. Probably a small encoder-decoder or constrained-decoding setup.

When all three are published, they'll go up as a Hugging Face collection and the master writeup will be a single longer post tying the suite together. Follow along on Hugging Face or GitHub.

If you've shipped a small specialized model that beats — sorry, matches — frontier APIs on a narrow task, I'd love to hear about it. The pattern works.