Forem: J Schoemaker

Anthropic Never Released Their Tokenizer. Here's What We Found Testing the Alternatives

J Schoemaker — Thu, 19 Mar 2026 20:47:42 +0000

bpe-lite accuracy benchmark — report

Date: 2026-03-19
Model tested against: claude-haiku-4-5-20251001 via Anthropic count_tokens API
Tokenizers compared: bpe-lite (modified Xenova), ai-tokenizer (claude encoding), raw Xenova (unmodified)

1. Background

bpe-lite is a zero-dependency JS tokenizer supporting OpenAI (cl100k / o200k), Anthropic (Xenova/claude-tokenizer, 65k BPE), and Gemini (Gemma3 SPM). Anthropic has not released the Claude 4 tokenizer, so the Anthropic provider is a reverse-engineered approximation sourced from Xenova/claude-tokenizer on HuggingFace, with hand-tuned modifications.

This report documents the construction of a stratified accuracy benchmark and its results.

2. Benchmark corpus

Design

120 samples across 12 categories (10 per category):

Category	Focus
`english-prose`	sentences, paragraphs, mixed punctuation, dialogue
`code-python`	functions, classes, decorators, f-strings, async
`code-js`	arrow functions, classes, JSX, TypeScript, async/await
`numbers`	integers, floats, scientific notation, dates, IPs
`hex-binary`	0x/0b prefixes, color codes, hashes, UUIDs, hex dumps
`symbols`	copyright/trademark, math operators, arrows, currency clusters
`arabic`	words, sentences, mixed Latin, technical text
`cjk`	Chinese, Japanese, Korean, mixed scripts
`emoji`	isolated, in prose, clusters, skin tones, flags, ZWJ sequences
`structured`	JSON, HTML, XML, Markdown, CSV, YAML, SQL, GraphQL
`urls`	full URLs, query strings, email addresses, data URIs
`short`	1–5 token inputs, single words, punctuation only

Files

scripts/corpus.js — 120 sample definitions (category, name, text)
scripts/fetch-corpus.js — fetches expected counts from the Anthropic API, writes scripts/corpus-expected.json (committed; benchmark runs offline)
scripts/accuracy.js — offline runner; reads corpus-expected.json, compares both tokenizers, outputs per-sample table and per-category summary with Wilson 95% CI

3. Calibration and a discovered overhead artifact

Expected counts are computed as api_raw(text) - overhead, where overhead = api("Hi") - countTokens("Hi") = 7.

A ±1 overhead artifact was discovered: the last structural token of the Anthropic message template BPE-merges with certain first characters of content, making the effective overhead 7 or 8 depending on the first character:

"Hi"   raw=8  overhead=7   letter start  — our calibration anchor
"1"    raw=9  overhead=8   digit start
"©"    raw=9  overhead=8   2-byte UTF-8, C2xx range
"→"    raw=8  overhead=7   3-byte UTF-8, E2/86 range
"Hi1"  raw=9  net=2        Hi=1 + 1=1 — digit contributes 1 token in context ✓
"1Hi"  raw=10 net=3        boundary effect inflates count by 1

The artifact only matters for expected < 5 tokens — at that scale ±1 is more than 20% relative error. For longer samples it is negligible.

Resolution: 6 samples with expected < 5 are excluded from percentage error calculations and shown as n/a. All other samples are unaffected.

We also investigated and ruled out a "prefix neutralisation" approach (api("Hi. " + text) - api("Hi. ")): while it eliminates the digit-boundary artifact, the trailing space in the prefix gets absorbed into the first chunk of text (the BPE regex treats it as a leading space), corrupting token counts for short-string samples by a different ±1. The overhead subtraction approach with exclusion of tiny samples is the most honest solution.

4. What ai-tokenizer uses for Claude

ai-tokenizer's claude encoding is a different vocabulary from Xenova/claude-tokenizer:

64,995 tokens total (64,241 string-keyed + 754 binary)
Special tokens: EOT, META, META_START, META_END, SOS — characteristic of an older Claude 1/2-era tokenizer
Regex pattern uses \p{N}+ (greedy, unlimited digits) instead of \p{N}{1,3} (1–3 digits)

The \p{N}+ pattern is ai-tokenizer's primary weakness: it chunks multi-digit numbers as a single unit, whereas Claude uses 1–3 digit chunks. This causes severe errors on anything involving numbers (43% error on fibonacci integers, 29% on arithmetic, 22% on hex).

ai-tokenizer also does not have a Gemini encoding — all Gemini models in their registry are mapped to o200k_base (OpenAI's vocabulary) with a fudge multiplier of 1.08. This produces completely wrong results for Gemini.

5. Results — full 120-sample benchmark

Overall summary (114 eligible samples, 6 excluded as expected < 5)

Metric	bpe-lite	ai-tokenizer
Exact	11 (9.6%)	9 (7.9%)
Within 5%	53 (46.5%)	21 (18.4%)
Within 10%	71 (62.3%) ±8.8% CI	43 (37.7%) ±8.8% CI
Mean abs err	9.4%	16.0%
Median abs err	5.7%	13.6%
p95 abs err	31.0%	38.1%
Max abs err	42.9% (single emoji repeated)	82.6% (repeated chars)

Per-category breakdown (within-10% rate, mean abs err)

Category	bpe-lite within-10%	bpe-lite mean err	ai-tok within-10%	ai-tok mean err
`english-prose`	90% ±19%	5.5%	80% ±23%	7.1%
`code-python`	90% ±19%	4.8%	20% ±23%	11.6%
`code-js`	100% ±14%	4.2%	60% ±26%	9.7%
`numbers`	80% ±23%	7.3%	10% ±19%	23.7%
`hex-binary`	80% ±23%	5.3%	20% ±23%	22.8%
`symbols`	10% ±19%	17.6%	10% ±19%	23.2%
`arabic`	0% ±14%	26.1%	0% ±14%	28.8%
`cjk`	40% ±26%	8.8%	30% ±25%	12.7%
`emoji`	20% ±23%	17.7%	30% ±25%	15.8%
`structured`	90% ±19%	3.6%	70% ±25%	9.5%
`urls`	80% ±23%	3.6%	90% ±19%	4.0%
`short`	30% ±25%	6.8%	10% ±19%	32.1%

6. Comparison with raw Xenova (unmodified)

We also ran raw Xenova (no modifications applied) against the same API to isolate the effect of bpe-lite's hand-tuning:

Metric	raw Xenova	bpe-lite
Within 10%	68%	84% (25-sample run)
Mean abs err	12.48%	5.74% (25-sample run)
Max abs err	82.6% (repeated chars)	21.7% (symbols)

Key modifications that drive the improvement:

Repeated-byte merges deleted — Xenova has aaa, aaaa etc.; Claude does not. Fixes repeated chars from 82.6% to 4.3%.
Emoji byte-pair injections — Xenova merges full 4-byte emoji to 1 token; Claude uses 3–4 tokens. Injecting [9F,91], [9F,92], [9F,98] and [20,F0] pairs; deleting full emoji merges. Cuts emoji error from 26% to 8%.
Symbol path engineering — Deleted over-merged 3-byte tokens (↑↓↔≈≠≤≥∞∑∫); injected [E2,88] and [E2,82] prefix pairs for correct 2-token bare paths. Reduces symbol error from 37.7% to 21.7%.
CJK/Japanese injections — Added missing single-char tokens (世機械習モ語). Drops Japanese error from 20% to 3%.
Whitespace sequence injections — space×3..32, tab×2..8, nl×2..8 at rank 0. Fixes whitespace-heavy inputs.
Space+symbol merge deletions — Xenova has £, ±, ≤, ≥ merged; Claude does not. Deleted these.
NFKC normalisation — Applied before BPE (normalize: 'NFKC'). Fixes ™→TM, …→..., etc.

7. Known unresolvable issues

These categories cannot be fully fixed without the actual Claude tokenizer:

Arabic (mean err 26%): Xenova was trained on far less Arabic data than Claude. It has fewer Arabic merges, producing longer token sequences. Every Arabic sample is over-tokenized by 17–46 tokens. The gap grows with text length.

Symbols (mean err 18%): Claude tokenizes symbols using byte-level BPE without regex pre-tokenisation. Adjacent symbols can form cross-symbol byte merges (e.g. the last byte of © and the first byte of ® may merge). Our regex-chunked approach processes each symbol in isolation, making these cross-boundary merges unreplicable. Some symbols also have different space-prefixed merge behaviour than Xenova.

Emoji (mean err 18%): Complex emoji sequences (ZWJ families, skin-tone variants, keycap sequences, symbol-like emoji) have irregular token counts that don't follow a simple pattern. bpe-lite handles the common cases but ZWJ sequences, flag emoji, and symbol-like emoji have 14–43% errors.

Large integers (numbers, 33% err): 1000000 9999999 ... — these contain 7–12 digit numbers. The \p{N}{1,3} pattern chunks them into 1–3 digit groups as expected. However Claude appears to have merged some specific digit sequences differently. The current sample shows bpe-lite over-counting by 12 tokens (48 vs 36).

8. Comparison notes vs ai-tokenizer's published accuracy

ai-tokenizer's README claims 97–99% accuracy for Claude models at 5k–50k tokens, measured on random text. Our benchmark shows 37.7% within 10% on our 120-sample corpus. The discrepancy has two explanations:

Test corpus composition: ai-tokenizer tests on long random text (5k–50k tokens). At that scale, errors average out and the overall percentage is dominated by the majority of tokens which tokenize correctly. Our corpus deliberately over-represents hard categories (symbols, Arabic, emoji, numbers) that expose systematic failures.
Number pattern flaw: ai-tokenizer's \p{N}+ regex is correct for the older Claude 1/2 tokenizer they appear to have encoded, but wrong for current Claude models which use \p{N}{1,3}. On random prose this matters little; on code and data it causes large errors.

For the specific use case of estimating token counts on real-world diverse inputs, bpe-lite's mean error of 9.4% (with a 62% within-10% rate) is substantially more reliable than ai-tokenizer's 16% mean error and 37.7% within-10% rate.

9. Benchmark scripts summary

node scripts/fetch-corpus.js     # one-time: fetch 120 expected counts from API
node scripts/accuracy.js         # offline: compare bpe-lite + ai-tokenizer vs corpus

The corpus-expected.json is committed and does not need to be re-fetched unless the corpus changes or a new model is tested.

Your AI Coding Session Is Degrading Silently — Here's How to Measure It

J Schoemaker — Sat, 07 Mar 2026 10:03:43 +0000

How driftguard-mcp Detects AI Context Degradation in Real Time

Long AI coding sessions degrade. Not gradually and gracefully — silently, until the model is already repeating itself, hedging on things it was confident about an hour ago, and producing code that contradicts what it wrote earlier in the same session.

Most developers don't catch this when it happens. They just feel like the AI is "having an off day" and keep pushing. The session compounds.

I built driftguard-mcp to measure this in real time and expose the score as MCP tools you can call mid-session. This article covers why the problem is hard to detect, what signals actually predict it, and how the implementation works under the hood.

Why Context Degradation Is Hard to Notice

The underlying mechanism is well-documented: academic benchmarks like NoLiMa (ICML 2025) show that at 32K tokens, 10 out of 12 models drop below 50% of their short-context performance — models that all claim to support at least 128K tokens. The same degradation pattern appears in coding sessions specifically. Engineers at Sourcegraph found Claude Code quality declining around 147,000–152,000 tokens, well before its advertised 200K limit. Practitioners running daily Claude Code and Cursor sessions have documented it starting as early as 20–40% context capacity. The failure mode is the same regardless of domain: the model doesn't error — it degrades.

Output gets shorter. It starts paraphrasing things it said 30 messages ago. It hedges more, qualifies more, and corrects itself on minor points rather than reasoning forward. None of this looks obviously broken. The model is still responding. It's still generating code. It just isn't the same model you were talking to at message 12.

The two most reliable signals are also the most invisible:

Context saturation accumulates incrementally. Each message pushes the window a little further. There's no threshold warning, no indicator. By the time you're at 88% token fill, the model has been operating under pressure for a while.

Repetition is equally invisible because developers don't read transcripts — they read current output. If the model recycled a code pattern from 20 messages ago, you'd have to actively compare to catch it.

The result: most people notice something is wrong at message 60+, well after the session became unreliable.

Sources: NoLiMa: Long-Context Evaluation Beyond Literal Matching (ICML 2025) · Lost in the Middle (Liu et al., TACL 2024) · Why Claude Code Sessions Keep Dying · Context Rot (Chroma Research)

Reading the Session Directly

driftguard-mcp reads session files on disk rather than intercepting API calls. This has a few advantages: it requires no proxy layer, no API key, no modified toolchain. It just watches the same JSONL files the CLI produces.

Claude Code writes session state to ~/.claude/projects/<hash>/<session-uuid>.jsonl. Each line is a typed message with role, content, and — critically — token counts from the API response. The usage field includes input_tokens and cache_read_input_tokens, which together give an accurate picture of what the model actually processed.

Gemini CLI writes to ~/.gemini/tmp/. Its format uses functionCall / functionResponse pairs for tool use, which required a separate adapter to normalise into the shared message structure.

Codex CLI uses ~/.codex/ with tool_calls / role:tool format. Token counts aren't available here, so context saturation falls back to a character-based estimate with a calibration factor.

All three adapters normalise to the same internal ParsedMessage[] structure before scoring:

interface ParsedMessage {
  role: 'user' | 'assistant' | 'tool';
  content: string;
  tokenCount?: number;       // real API counts where available
  timestamp?: number;
}

One edge case worth noting: Claude Code compact boundaries. When Claude compacts mid-session, pre-compaction messages are dropped from its active context. driftguard-mcp detects this boundary in the JSONL and drops the same messages from scoring — the score only reflects what Claude actually remembers, not the full conversation history on disk.

The 6-Factor Scoring Model

The composite drift score (0–100) is a weighted sum of six factors. The weights reflect signal reliability, not equal contribution:

Factor	Weight	Signal type
Context saturation	37%	Quantitative — token fill %
Repetition	37%	Statistical — 3-gram overlap
Response length collapse	15%	Trend — rolling window
Goal distance	8%	Semantic — TF-IDF cosine similarity
Uncertainty signals	2%	Lexical — explicit self-corrections
Confidence drift	1%	Lexical — hedging language trend

Context saturation and repetition dominate at 74% combined. This is intentional — they're the most direct, measurable predictors of degradation. The lexical signals (uncertainty, confidence drift) contribute noise-reduction rather than primary signal, which is why they're weighted at 3% combined.

Context Saturation (37%)

For Claude and Gemini, token counts come directly from the API response metadata in the session file. The saturation score is a calibrated curve against the model's known context window:

function contextSaturationScore(tokenCount: number, maxTokens: number): number {
  const fill = tokenCount / maxTokens;
  // Smooth ramp: low penalty below 50%, steep above 75%
  if (fill < 0.5) return fill * 20;
  if (fill < 0.75) return 10 + (fill - 0.5) * 120;
  return 40 + (fill - 0.75) * 240;
}

This produces near-zero scores in healthy sessions and rapidly climbing scores as fill approaches capacity — matching actual model behaviour, which degrades non-linearly near the limit.

Repetition (37%)

Repetition is measured using a 3-gram sliding window across recent assistant responses. The algorithm extracts all 3-word sequences from the last N responses and measures overlap:

function extractTrigrams(text: string): Set<string> {
  const words = text.toLowerCase().split(/\s+/);
  const trigrams = new Set<string>();
  for (let i = 0; i < words.length - 2; i++) {
    trigrams.add(`${words[i]} ${words[i+1]} ${words[i+2]}`);
  }
  return trigrams;
}

function repetitionScore(messages: ParsedMessage[]): number {
  const recent = messages.filter(m => m.role === 'assistant').slice(-10);
  if (recent.length < 3) return 0;

  const allTrigrams = recent.flatMap(m => [...extractTrigrams(m.content)]);
  const unique = new Set(allTrigrams);
  const overlapRatio = 1 - (unique.size / allTrigrams.length);

  return Math.min(100, overlapRatio * 180); // calibrated multiplier
}

3-grams at this window size are reliable enough to catch genuine repetition without false-positives from incidental shared vocabulary (e.g., variable names appearing across multiple messages).

Tool noise filtering: Tool call messages — "Tool loaded.", "Calling bash...", etc. — are filtered from the user message stream before scoring. Without this, tool-heavy sessions score artificially high on repetition due to repeated tool invocation boilerplate.

Response Length Collapse (15%)

As sessions degrade, responses get shorter. The model starts truncating explanations, omitting context it would have included earlier. This is a reliable secondary signal.

The score measures the trend in response length across the last 15 assistant messages using a simple linear regression slope:

function lengthCollapseScore(messages: ParsedMessage[]): number {
  const recent = messages
    .filter(m => m.role === 'assistant')
    .slice(-15)
    .map(m => m.content.length);

  if (recent.length < 5) return 0;

  const slope = linearRegressionSlope(recent);
  // Negative slope = shrinking responses
  return slope < 0 ? Math.min(100, Math.abs(slope) * 0.4) : 0;
}

Goal Distance (8%)

This factor only activates when you pass a goal string to get_drift(). It measures vocabulary drift from your original objective using TF-IDF cosine similarity:

get_drift({ goal: "implement JWT authentication with refresh token rotation" })

The goal string is vectorised against recent assistant responses. As the session drifts from the original task — handling edge cases, going down tangents, responding to follow-up questions — cosine similarity to the goal string decreases.

The threshold curve is calibrated so that similarity ≥ 0.5 returns a near-zero score, with penalty scaling steeply below 0.3. Without a goal param, this factor returns 0 and its 8% weight is redistributed proportionally.

Uncertainty Signals (2%) and Confidence Drift (1%)

These are intentionally low-weight. Uncertainty signals count explicit self-corrections ("I was wrong about", "let me correct that", "actually, I made an error") — not general hedging, which is too noisy. Confidence drift measures the trend in hedging language frequency (perhaps, might, could, I think) between the first third and last third of the session.

Both factors were originally weighted higher. In practice, hedging language is too context-dependent — a research session is supposed to have more hedging — and self-corrections are too rare to contribute meaningful signal in most sessions. Keeping them at 3% combined means they can nudge a borderline score without ever dominating it.

Score Thresholds and Output Design

Scores map to four states:

Range	State
0–29	Fresh
30–60	Warming
61–80	Drifting
81–100	Polluted

The get_drift() output leads with a plain-English recommendation rather than just the score. The score is a number — what most developers need is "should I start fresh right now or not":

⚠️  Start fresh now — context is full and responses are repeating heavily.

  Context depth         █████████░   88
  Repetition            ████████░░   72
  Length collapse       █████░░░░░   48

Score: 84/100 · 67 messages

→ Call get_handoff() to write handoff.md before starting fresh.

Factor bars only appear when they're contributing meaningfully to the score. A healthy session shows only the top two; a degraded session shows all contributing factors. This avoids surfacing noise in the common case.

Handoff trigger: The suggestion to call get_handoff() fires independently of the composite score — it triggers when context depth or repetition individually cross their thresholds. A session can have a composite score of 65 (drifting) and still get a handoff suggestion if repetition is at 78.

The Handoff Workflow

get_handoff() returns a structured prompt instructing the current AI session to write a handoff.md file. The AI generates the file using its full session context — which, crucially, still exists even in a degraded session. The model may be repeating itself, but it still has access to everything it did.

A typical handoff.md:

## What we accomplished
Implemented JWT authentication with refresh token rotation. Added middleware,
updated the user model, wrote integration tests. All tests passing.

## Current state
Auth flow is working end-to-end. Rate limiting is stubbed but not implemented.
The `/refresh` endpoint has a known edge case with concurrent requests (see auth.ts:142).

## Files modified
- src/middleware/auth.ts — JWT verify + refresh logic
- src/models/user.ts — added refreshToken field + index
- src/routes/auth.ts — /login, /logout, /refresh endpoints
- tests/integration/auth.test.ts — 14 new tests

## Open questions / next steps
- Implement rate limiting on /login (5 attempts per 15 min)
- Fix concurrent refresh edge case
- Add token blacklist for logout

## Context for next session
Using jsonwebtoken@9, refresh tokens stored in DB. Access token TTL: 15min,
Refresh TTL: 7 days.

Load this at the start of the next session. You don't lose context — you lose the degraded session state while keeping the useful information.

Trend Tracking

get_trend() returns the full score history for the current session with a sparkline, peak, average, and trajectory annotation:

Session drift trend (18 snapshots)

  12 ▁▁▂▃▄▄▅▆▇▇█  84

  Peak: 84  ·  Avg: 47  ·  Trajectory: ↑ climbing

Snapshots: 12 → 18 → 24 → 31 → 38 → 42 → 51 → 58 → 63 → 70 → 76 → 84

Snapshots are persisted to ~/.driftcli/history/ as JSONL and survive session restarts. The sparkline starts appearing after 3 get_drift() calls. Trend data is per-session, keyed by session UUID.

Configuration

driftguard-mcp merges global (~/.driftclirc) and per-project (.driftcli) config. Presets adjust factor weights without requiring manual override:

Preset	Adjustment
`coding`	Default weights — emphasises context depth and repetition
`research`	Weights goal distance more heavily
`brainstorm`	Relaxes repetition and confidence drift penalties
`strict`	Equal weight across all six factors

Custom weight overrides are supported on top of any preset:

{
  "preset": "coding",
  "warnThreshold": 55,
  "weights": {
    "repetition": 0.45
  }
}

Install

npm install -g driftguard-mcp
driftguard-mcp setup

setup auto-configures Claude Code, Gemini CLI, Codex CLI, and Cursor. Restart your CLI after running it — the tools are live immediately.

What's Next

Current areas of active work: better token count estimation for Codex (the character-based fallback works but real counts would improve saturation accuracy), and a VSCode extension surface for teams that don't use CLI-first workflows.

The core scoring algorithm is intentionally conservative — better to miss a drifting session than to cry wolf on healthy ones. If you're running sessions and find the thresholds too tight or too loose for your workflow, the config system is designed for exactly that.