<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: J Schoemaker</title>
    <description>The latest articles on Forem by J Schoemaker (@jerown).</description>
    <link>https://forem.com/jerown</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3811352%2Fb368cade-7c1b-4580-a769-8cebb040f2bb.png</url>
      <title>Forem: J Schoemaker</title>
      <link>https://forem.com/jerown</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/jerown"/>
    <language>en</language>
    <item>
      <title>Anthropic Never Released Their Tokenizer. Here's What We Found Testing the Alternatives</title>
      <dc:creator>J Schoemaker</dc:creator>
      <pubDate>Thu, 19 Mar 2026 20:47:42 +0000</pubDate>
      <link>https://forem.com/jerown/anthropic-never-released-their-tokenizer-heres-what-we-found-testing-the-alternatives-b05</link>
      <guid>https://forem.com/jerown/anthropic-never-released-their-tokenizer-heres-what-we-found-testing-the-alternatives-b05</guid>
      <description>&lt;h1&gt;
  
  
  bpe-lite accuracy benchmark — report
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Date:&lt;/strong&gt; 2026-03-19&lt;br&gt;
&lt;strong&gt;Model tested against:&lt;/strong&gt; &lt;code&gt;claude-haiku-4-5-20251001&lt;/code&gt; via Anthropic &lt;code&gt;count_tokens&lt;/code&gt; API&lt;br&gt;
&lt;strong&gt;Tokenizers compared:&lt;/strong&gt; bpe-lite (modified Xenova), ai-tokenizer (claude encoding), raw Xenova (unmodified)&lt;/p&gt;


&lt;h2&gt;
  
  
  1. Background
&lt;/h2&gt;

&lt;p&gt;bpe-lite is a zero-dependency JS tokenizer supporting OpenAI (cl100k / o200k), Anthropic (Xenova/claude-tokenizer, 65k BPE), and Gemini (Gemma3 SPM). Anthropic has not released the Claude 4 tokenizer, so the Anthropic provider is a reverse-engineered approximation sourced from &lt;code&gt;Xenova/claude-tokenizer&lt;/code&gt; on HuggingFace, with hand-tuned modifications.&lt;/p&gt;

&lt;p&gt;This report documents the construction of a stratified accuracy benchmark and its results.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. Benchmark corpus
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Design
&lt;/h3&gt;

&lt;p&gt;120 samples across 12 categories (10 per category):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Focus&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;english-prose&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;sentences, paragraphs, mixed punctuation, dialogue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;code-python&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;functions, classes, decorators, f-strings, async&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;code-js&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;arrow functions, classes, JSX, TypeScript, async/await&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;numbers&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;integers, floats, scientific notation, dates, IPs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hex-binary&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0x/0b prefixes, color codes, hashes, UUIDs, hex dumps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;symbols&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;copyright/trademark, math operators, arrows, currency clusters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;arabic&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;words, sentences, mixed Latin, technical text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cjk&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Chinese, Japanese, Korean, mixed scripts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;emoji&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;isolated, in prose, clusters, skin tones, flags, ZWJ sequences&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;structured&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;JSON, HTML, XML, Markdown, CSV, YAML, SQL, GraphQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;urls&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;full URLs, query strings, email addresses, data URIs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;short&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1–5 token inputs, single words, punctuation only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Files
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;scripts/corpus.js&lt;/code&gt; — 120 sample definitions (category, name, text)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;scripts/fetch-corpus.js&lt;/code&gt; — fetches expected counts from the Anthropic API, writes &lt;code&gt;scripts/corpus-expected.json&lt;/code&gt; (committed; benchmark runs offline)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;scripts/accuracy.js&lt;/code&gt; — offline runner; reads corpus-expected.json, compares both tokenizers, outputs per-sample table and per-category summary with Wilson 95% CI&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  3. Calibration and a discovered overhead artifact
&lt;/h2&gt;

&lt;p&gt;Expected counts are computed as &lt;code&gt;api_raw(text) - overhead&lt;/code&gt;, where &lt;code&gt;overhead = api("Hi") - countTokens("Hi") = 7&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;±1 overhead artifact&lt;/strong&gt; was discovered: the last structural token of the Anthropic message template BPE-merges with certain first characters of content, making the effective overhead 7 or 8 depending on the first character:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Hi"   raw=8  overhead=7   letter start  — our calibration anchor
"1"    raw=9  overhead=8   digit start
"©"    raw=9  overhead=8   2-byte UTF-8, C2xx range
"→"    raw=8  overhead=7   3-byte UTF-8, E2/86 range
"Hi1"  raw=9  net=2        Hi=1 + 1=1 — digit contributes 1 token in context ✓
"1Hi"  raw=10 net=3        boundary effect inflates count by 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The artifact only matters for &lt;code&gt;expected &amp;lt; 5&lt;/code&gt; tokens — at that scale ±1 is more than 20% relative error. For longer samples it is negligible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resolution:&lt;/strong&gt; 6 samples with &lt;code&gt;expected &amp;lt; 5&lt;/code&gt; are excluded from percentage error calculations and shown as &lt;code&gt;n/a&lt;/code&gt;. All other samples are unaffected.&lt;/p&gt;

&lt;p&gt;We also investigated and ruled out a "prefix neutralisation" approach (&lt;code&gt;api("Hi. " + text) - api("Hi. ")&lt;/code&gt;): while it eliminates the digit-boundary artifact, the trailing space in the prefix gets absorbed into the first chunk of text (the BPE regex treats it as a leading space), corrupting token counts for short-string samples by a different ±1. The overhead subtraction approach with exclusion of tiny samples is the most honest solution.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. What ai-tokenizer uses for Claude
&lt;/h2&gt;

&lt;p&gt;ai-tokenizer's &lt;code&gt;claude&lt;/code&gt; encoding is a &lt;strong&gt;different vocabulary&lt;/strong&gt; from Xenova/claude-tokenizer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;64,995 tokens total (64,241 string-keyed + 754 binary)&lt;/li&gt;
&lt;li&gt;Special tokens: &lt;code&gt;EOT&lt;/code&gt;, &lt;code&gt;META&lt;/code&gt;, &lt;code&gt;META_START&lt;/code&gt;, &lt;code&gt;META_END&lt;/code&gt;, &lt;code&gt;SOS&lt;/code&gt; — characteristic of an older Claude 1/2-era tokenizer&lt;/li&gt;
&lt;li&gt;Regex pattern uses &lt;code&gt;\p{N}+&lt;/code&gt; (greedy, unlimited digits) instead of &lt;code&gt;\p{N}{1,3}&lt;/code&gt; (1–3 digits)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;\p{N}+&lt;/code&gt; pattern is ai-tokenizer's primary weakness: it chunks multi-digit numbers as a single unit, whereas Claude uses 1–3 digit chunks. This causes severe errors on anything involving numbers (43% error on fibonacci integers, 29% on arithmetic, 22% on hex).&lt;/p&gt;

&lt;p&gt;ai-tokenizer also does &lt;strong&gt;not have a Gemini encoding&lt;/strong&gt; — all Gemini models in their registry are mapped to &lt;code&gt;o200k_base&lt;/code&gt; (OpenAI's vocabulary) with a fudge multiplier of 1.08. This produces completely wrong results for Gemini.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Results — full 120-sample benchmark
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Overall summary (114 eligible samples, 6 excluded as expected &amp;lt; 5)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;bpe-lite&lt;/th&gt;
&lt;th&gt;ai-tokenizer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Exact&lt;/td&gt;
&lt;td&gt;11 (9.6%)&lt;/td&gt;
&lt;td&gt;9 (7.9%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Within 5%&lt;/td&gt;
&lt;td&gt;53 (46.5%)&lt;/td&gt;
&lt;td&gt;21 (18.4%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Within 10%&lt;/td&gt;
&lt;td&gt;71 (62.3%) ±8.8% CI&lt;/td&gt;
&lt;td&gt;43 (37.7%) ±8.8% CI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mean abs err&lt;/td&gt;
&lt;td&gt;9.4%&lt;/td&gt;
&lt;td&gt;16.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Median abs err&lt;/td&gt;
&lt;td&gt;5.7%&lt;/td&gt;
&lt;td&gt;13.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p95 abs err&lt;/td&gt;
&lt;td&gt;31.0%&lt;/td&gt;
&lt;td&gt;38.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max abs err&lt;/td&gt;
&lt;td&gt;42.9% (single emoji repeated)&lt;/td&gt;
&lt;td&gt;82.6% (repeated chars)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Per-category breakdown (within-10% rate, mean abs err)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;bpe-lite within-10%&lt;/th&gt;
&lt;th&gt;bpe-lite mean err&lt;/th&gt;
&lt;th&gt;ai-tok within-10%&lt;/th&gt;
&lt;th&gt;ai-tok mean err&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;english-prose&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;90% ±19%&lt;/td&gt;
&lt;td&gt;5.5%&lt;/td&gt;
&lt;td&gt;80% ±23%&lt;/td&gt;
&lt;td&gt;7.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;code-python&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;90% ±19%&lt;/td&gt;
&lt;td&gt;4.8%&lt;/td&gt;
&lt;td&gt;20% ±23%&lt;/td&gt;
&lt;td&gt;11.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;code-js&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;100% ±14%&lt;/td&gt;
&lt;td&gt;4.2%&lt;/td&gt;
&lt;td&gt;60% ±26%&lt;/td&gt;
&lt;td&gt;9.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;numbers&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;80% ±23%&lt;/td&gt;
&lt;td&gt;7.3%&lt;/td&gt;
&lt;td&gt;10% ±19%&lt;/td&gt;
&lt;td&gt;23.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hex-binary&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;80% ±23%&lt;/td&gt;
&lt;td&gt;5.3%&lt;/td&gt;
&lt;td&gt;20% ±23%&lt;/td&gt;
&lt;td&gt;22.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;symbols&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;10% ±19%&lt;/td&gt;
&lt;td&gt;17.6%&lt;/td&gt;
&lt;td&gt;10% ±19%&lt;/td&gt;
&lt;td&gt;23.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;arabic&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0% ±14%&lt;/td&gt;
&lt;td&gt;26.1%&lt;/td&gt;
&lt;td&gt;0% ±14%&lt;/td&gt;
&lt;td&gt;28.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cjk&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;40% ±26%&lt;/td&gt;
&lt;td&gt;8.8%&lt;/td&gt;
&lt;td&gt;30% ±25%&lt;/td&gt;
&lt;td&gt;12.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;emoji&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;20% ±23%&lt;/td&gt;
&lt;td&gt;17.7%&lt;/td&gt;
&lt;td&gt;30% ±25%&lt;/td&gt;
&lt;td&gt;15.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;structured&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;90% ±19%&lt;/td&gt;
&lt;td&gt;3.6%&lt;/td&gt;
&lt;td&gt;70% ±25%&lt;/td&gt;
&lt;td&gt;9.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;urls&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;80% ±23%&lt;/td&gt;
&lt;td&gt;3.6%&lt;/td&gt;
&lt;td&gt;90% ±19%&lt;/td&gt;
&lt;td&gt;4.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;short&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;30% ±25%&lt;/td&gt;
&lt;td&gt;6.8%&lt;/td&gt;
&lt;td&gt;10% ±19%&lt;/td&gt;
&lt;td&gt;32.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  6. Comparison with raw Xenova (unmodified)
&lt;/h2&gt;

&lt;p&gt;We also ran raw Xenova (no modifications applied) against the same API to isolate the effect of bpe-lite's hand-tuning:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;raw Xenova&lt;/th&gt;
&lt;th&gt;bpe-lite&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Within 10%&lt;/td&gt;
&lt;td&gt;68%&lt;/td&gt;
&lt;td&gt;84% (25-sample run)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mean abs err&lt;/td&gt;
&lt;td&gt;12.48%&lt;/td&gt;
&lt;td&gt;5.74% (25-sample run)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max abs err&lt;/td&gt;
&lt;td&gt;82.6% (repeated chars)&lt;/td&gt;
&lt;td&gt;21.7% (symbols)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Key modifications that drive the improvement:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Repeated-byte merges deleted&lt;/strong&gt; — Xenova has &lt;code&gt;aaa&lt;/code&gt;, &lt;code&gt;aaaa&lt;/code&gt; etc.; Claude does not. Fixes &lt;code&gt;repeated chars&lt;/code&gt; from 82.6% to 4.3%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Emoji byte-pair injections&lt;/strong&gt; — Xenova merges full 4-byte emoji to 1 token; Claude uses 3–4 tokens. Injecting &lt;code&gt;[9F,91]&lt;/code&gt;, &lt;code&gt;[9F,92]&lt;/code&gt;, &lt;code&gt;[9F,98]&lt;/code&gt; and &lt;code&gt;[20,F0]&lt;/code&gt; pairs; deleting full emoji merges. Cuts emoji error from 26% to 8%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Symbol path engineering&lt;/strong&gt; — Deleted over-merged 3-byte tokens (&lt;code&gt;↑↓↔≈≠≤≥∞∑∫&lt;/code&gt;); injected &lt;code&gt;[E2,88]&lt;/code&gt; and &lt;code&gt;[E2,82]&lt;/code&gt; prefix pairs for correct 2-token bare paths. Reduces symbol error from 37.7% to 21.7%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CJK/Japanese injections&lt;/strong&gt; — Added missing single-char tokens (&lt;code&gt;世 機 械 習 モ 語&lt;/code&gt;). Drops Japanese error from 20% to 3%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whitespace sequence injections&lt;/strong&gt; — space×3..32, tab×2..8, nl×2..8 at rank 0. Fixes whitespace-heavy inputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Space+symbol merge deletions&lt;/strong&gt; — Xenova has &lt;code&gt;£&lt;/code&gt;, &lt;code&gt;±&lt;/code&gt;, &lt;code&gt;≤&lt;/code&gt;, &lt;code&gt;≥&lt;/code&gt; merged; Claude does not. Deleted these.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NFKC normalisation&lt;/strong&gt; — Applied before BPE (&lt;code&gt;normalize: 'NFKC'&lt;/code&gt;). Fixes &lt;code&gt;™→TM&lt;/code&gt;, &lt;code&gt;…→...&lt;/code&gt;, etc.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  7. Known unresolvable issues
&lt;/h2&gt;

&lt;p&gt;These categories cannot be fully fixed without the actual Claude tokenizer:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Arabic (mean err 26%):&lt;/strong&gt; Xenova was trained on far less Arabic data than Claude. It has fewer Arabic merges, producing longer token sequences. Every Arabic sample is over-tokenized by 17–46 tokens. The gap grows with text length.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Symbols (mean err 18%):&lt;/strong&gt; Claude tokenizes symbols using byte-level BPE without regex pre-tokenisation. Adjacent symbols can form cross-symbol byte merges (e.g. the last byte of &lt;code&gt;©&lt;/code&gt; and the first byte of &lt;code&gt;®&lt;/code&gt; may merge). Our regex-chunked approach processes each symbol in isolation, making these cross-boundary merges unreplicable. Some symbols also have different space-prefixed merge behaviour than Xenova.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Emoji (mean err 18%):&lt;/strong&gt; Complex emoji sequences (ZWJ families, skin-tone variants, keycap sequences, symbol-like emoji) have irregular token counts that don't follow a simple pattern. bpe-lite handles the common cases but ZWJ sequences, flag emoji, and symbol-like emoji have 14–43% errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Large integers (numbers, 33% err):&lt;/strong&gt; &lt;code&gt;1000000 9999999 ...&lt;/code&gt; — these contain 7–12 digit numbers. The &lt;code&gt;\p{N}{1,3}&lt;/code&gt; pattern chunks them into 1–3 digit groups as expected. However Claude appears to have merged some specific digit sequences differently. The current sample shows bpe-lite over-counting by 12 tokens (48 vs 36).&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Comparison notes vs ai-tokenizer's published accuracy
&lt;/h2&gt;

&lt;p&gt;ai-tokenizer's README claims 97–99% accuracy for Claude models at 5k–50k tokens, measured on random text. Our benchmark shows 37.7% within 10% on our 120-sample corpus. The discrepancy has two explanations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Test corpus composition:&lt;/strong&gt; ai-tokenizer tests on long random text (5k–50k tokens). At that scale, errors average out and the overall percentage is dominated by the majority of tokens which tokenize correctly. Our corpus deliberately over-represents hard categories (symbols, Arabic, emoji, numbers) that expose systematic failures.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Number pattern flaw:&lt;/strong&gt; ai-tokenizer's &lt;code&gt;\p{N}+&lt;/code&gt; regex is correct for the older Claude 1/2 tokenizer they appear to have encoded, but wrong for current Claude models which use &lt;code&gt;\p{N}{1,3}&lt;/code&gt;. On random prose this matters little; on code and data it causes large errors.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For the specific use case of estimating token counts on real-world diverse inputs, bpe-lite's mean error of 9.4% (with a 62% within-10% rate) is substantially more reliable than ai-tokenizer's 16% mean error and 37.7% within-10% rate.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Benchmark scripts summary
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;node scripts/fetch-corpus.js     &lt;span class="c"&gt;# one-time: fetch 120 expected counts from API&lt;/span&gt;
node scripts/accuracy.js         &lt;span class="c"&gt;# offline: compare bpe-lite + ai-tokenizer vs corpus&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The corpus-expected.json is committed and does not need to be re-fetched unless the corpus changes or a new model is tested.&lt;/p&gt;

</description>
      <category>anthropic</category>
      <category>claude</category>
      <category>ai</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Your AI Coding Session Is Degrading Silently — Here's How to Measure It</title>
      <dc:creator>J Schoemaker</dc:creator>
      <pubDate>Sat, 07 Mar 2026 10:03:43 +0000</pubDate>
      <link>https://forem.com/jerown/your-ai-coding-session-is-degrading-silently-heres-how-to-measure-it-43nm</link>
      <guid>https://forem.com/jerown/your-ai-coding-session-is-degrading-silently-heres-how-to-measure-it-43nm</guid>
      <description>&lt;h1&gt;
  
  
  How driftguard-mcp Detects AI Context Degradation in Real Time
&lt;/h1&gt;

&lt;p&gt;Long AI coding sessions degrade. Not gradually and gracefully — silently, until the model is already repeating itself, hedging on things it was confident about an hour ago, and producing code that contradicts what it wrote earlier in the same session.&lt;/p&gt;

&lt;p&gt;Most developers don't catch this when it happens. They just feel like the AI is "having an off day" and keep pushing. The session compounds.&lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://github.com/jschoemaker/driftguard-mcp" rel="noopener noreferrer"&gt;driftguard-mcp&lt;/a&gt; to measure this in real time and expose the score as MCP tools you can call mid-session. This article covers why the problem is hard to detect, what signals actually predict it, and how the implementation works under the hood.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Context Degradation Is Hard to Notice
&lt;/h2&gt;

&lt;p&gt;The underlying mechanism is well-documented: academic benchmarks like NoLiMa (ICML 2025) show that at 32K tokens, 10 out of 12 models drop below 50% of their short-context performance — models that all claim to support at least 128K tokens. The same degradation pattern appears in coding sessions specifically. Engineers at Sourcegraph found Claude Code quality declining around 147,000–152,000 tokens, well before its advertised 200K limit. Practitioners running daily Claude Code and Cursor sessions have documented it starting as early as 20–40% context capacity. The failure mode is the same regardless of domain: the model doesn't error — it degrades.&lt;/p&gt;

&lt;p&gt;Output gets shorter. It starts paraphrasing things it said 30 messages ago. It hedges more, qualifies more, and corrects itself on minor points rather than reasoning forward. None of this looks obviously broken. The model is still responding. It's still generating code. It just isn't the same model you were talking to at message 12.&lt;/p&gt;

&lt;p&gt;The two most reliable signals are also the most invisible:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context saturation&lt;/strong&gt; accumulates incrementally. Each message pushes the window a little further. There's no threshold warning, no indicator. By the time you're at 88% token fill, the model has been operating under pressure for a while.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repetition&lt;/strong&gt; is equally invisible because developers don't read transcripts — they read current output. If the model recycled a code pattern from 20 messages ago, you'd have to actively compare to catch it.&lt;/p&gt;

&lt;p&gt;The result: most people notice something is wrong at message 60+, well after the session became unreliable.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/NQcMkPxkcho"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Sources: &lt;a href="https://arxiv.org/abs/2502.05167" rel="noopener noreferrer"&gt;NoLiMa: Long-Context Evaluation Beyond Literal Matching&lt;/a&gt; (ICML 2025) · &lt;a href="https://arxiv.org/abs/2307.03172" rel="noopener noreferrer"&gt;Lost in the Middle&lt;/a&gt; (Liu et al., TACL 2024) · &lt;a href="https://www.turboai.dev/blog/claude-code-context-window-management" rel="noopener noreferrer"&gt;Why Claude Code Sessions Keep Dying&lt;/a&gt; · &lt;a href="https://research.trychroma.com/context-rot" rel="noopener noreferrer"&gt;Context Rot&lt;/a&gt; (Chroma Research)&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Reading the Session Directly
&lt;/h2&gt;

&lt;p&gt;driftguard-mcp reads session files on disk rather than intercepting API calls. This has a few advantages: it requires no proxy layer, no API key, no modified toolchain. It just watches the same JSONL files the CLI produces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Code&lt;/strong&gt; writes session state to &lt;code&gt;~/.claude/projects/&amp;lt;hash&amp;gt;/&amp;lt;session-uuid&amp;gt;.jsonl&lt;/code&gt;. Each line is a typed message with role, content, and — critically — token counts from the API response. The &lt;code&gt;usage&lt;/code&gt; field includes &lt;code&gt;input_tokens&lt;/code&gt; and &lt;code&gt;cache_read_input_tokens&lt;/code&gt;, which together give an accurate picture of what the model actually processed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini CLI&lt;/strong&gt; writes to &lt;code&gt;~/.gemini/tmp/&lt;/code&gt;. Its format uses &lt;code&gt;functionCall&lt;/code&gt; / &lt;code&gt;functionResponse&lt;/code&gt; pairs for tool use, which required a separate adapter to normalise into the shared message structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Codex CLI&lt;/strong&gt; uses &lt;code&gt;~/.codex/&lt;/code&gt; with &lt;code&gt;tool_calls&lt;/code&gt; / &lt;code&gt;role:tool&lt;/code&gt; format. Token counts aren't available here, so context saturation falls back to a character-based estimate with a calibration factor.&lt;/p&gt;

&lt;p&gt;All three adapters normalise to the same internal &lt;code&gt;ParsedMessage[]&lt;/code&gt; structure before scoring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;ParsedMessage&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;assistant&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tool&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;tokenCount&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;       &lt;span class="c1"&gt;// real API counts where available&lt;/span&gt;
  &lt;span class="nl"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One edge case worth noting: &lt;strong&gt;Claude Code compact boundaries&lt;/strong&gt;. When Claude compacts mid-session, pre-compaction messages are dropped from its active context. driftguard-mcp detects this boundary in the JSONL and drops the same messages from scoring — the score only reflects what Claude actually remembers, not the full conversation history on disk.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 6-Factor Scoring Model
&lt;/h2&gt;

&lt;p&gt;The composite drift score (0–100) is a weighted sum of six factors. The weights reflect signal reliability, not equal contribution:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;Weight&lt;/th&gt;
&lt;th&gt;Signal type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Context saturation&lt;/td&gt;
&lt;td&gt;37%&lt;/td&gt;
&lt;td&gt;Quantitative — token fill %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Repetition&lt;/td&gt;
&lt;td&gt;37%&lt;/td&gt;
&lt;td&gt;Statistical — 3-gram overlap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Response length collapse&lt;/td&gt;
&lt;td&gt;15%&lt;/td&gt;
&lt;td&gt;Trend — rolling window&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Goal distance&lt;/td&gt;
&lt;td&gt;8%&lt;/td&gt;
&lt;td&gt;Semantic — TF-IDF cosine similarity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Uncertainty signals&lt;/td&gt;
&lt;td&gt;2%&lt;/td&gt;
&lt;td&gt;Lexical — explicit self-corrections&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Confidence drift&lt;/td&gt;
&lt;td&gt;1%&lt;/td&gt;
&lt;td&gt;Lexical — hedging language trend&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Context saturation and repetition dominate at 74% combined. This is intentional — they're the most direct, measurable predictors of degradation. The lexical signals (uncertainty, confidence drift) contribute noise-reduction rather than primary signal, which is why they're weighted at 3% combined.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context Saturation (37%)
&lt;/h3&gt;

&lt;p&gt;For Claude and Gemini, token counts come directly from the API response metadata in the session file. The saturation score is a calibrated curve against the model's known context window:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;contextSaturationScore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tokenCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;maxTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fill&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;tokenCount&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;maxTokens&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="c1"&gt;// Smooth ramp: low penalty below 50%, steep above 75%&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fill&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;fill&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fill&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fill&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fill&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;240&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This produces near-zero scores in healthy sessions and rapidly climbing scores as fill approaches capacity — matching actual model behaviour, which degrades non-linearly near the limit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Repetition (37%)
&lt;/h3&gt;

&lt;p&gt;Repetition is measured using a 3-gram sliding window across recent assistant responses. The algorithm extracts all 3-word sequences from the last N responses and measures overlap:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;extractTrigrams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Set&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;+/&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;trigrams&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Set&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;words&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;trigrams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;words&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;words&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;words&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;trigrams&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;repetitionScore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ParsedMessage&lt;/span&gt;&lt;span class="p"&gt;[]):&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;recent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;assistant&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;allTrigrams&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flatMap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nf"&gt;extractTrigrams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)]);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;unique&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;allTrigrams&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;overlapRatio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;size&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;allTrigrams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;overlapRatio&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;180&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// calibrated multiplier&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3-grams at this window size are reliable enough to catch genuine repetition without false-positives from incidental shared vocabulary (e.g., variable names appearing across multiple messages).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool noise filtering&lt;/strong&gt;: Tool call messages — "Tool loaded.", "Calling bash...", etc. — are filtered from the user message stream before scoring. Without this, tool-heavy sessions score artificially high on repetition due to repeated tool invocation boilerplate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Response Length Collapse (15%)
&lt;/h3&gt;

&lt;p&gt;As sessions degrade, responses get shorter. The model starts truncating explanations, omitting context it would have included earlier. This is a reliable secondary signal.&lt;/p&gt;

&lt;p&gt;The score measures the trend in response length across the last 15 assistant messages using a simple linear regression slope:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;lengthCollapseScore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ParsedMessage&lt;/span&gt;&lt;span class="p"&gt;[]):&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;recent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;assistant&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;slope&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;linearRegressionSlope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// Negative slope = shrinking responses&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;slope&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;slope&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Goal Distance (8%)
&lt;/h3&gt;

&lt;p&gt;This factor only activates when you pass a &lt;code&gt;goal&lt;/code&gt; string to &lt;code&gt;get_drift()&lt;/code&gt;. It measures vocabulary drift from your original objective using TF-IDF cosine similarity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;get_drift({ goal: "implement JWT authentication with refresh token rotation" })
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The goal string is vectorised against recent assistant responses. As the session drifts from the original task — handling edge cases, going down tangents, responding to follow-up questions — cosine similarity to the goal string decreases.&lt;/p&gt;

&lt;p&gt;The threshold curve is calibrated so that similarity ≥ 0.5 returns a near-zero score, with penalty scaling steeply below 0.3. Without a &lt;code&gt;goal&lt;/code&gt; param, this factor returns 0 and its 8% weight is redistributed proportionally.&lt;/p&gt;

&lt;h3&gt;
  
  
  Uncertainty Signals (2%) and Confidence Drift (1%)
&lt;/h3&gt;

&lt;p&gt;These are intentionally low-weight. Uncertainty signals count explicit self-corrections ("I was wrong about", "let me correct that", "actually, I made an error") — not general hedging, which is too noisy. Confidence drift measures the trend in hedging language frequency (perhaps, might, could, I think) between the first third and last third of the session.&lt;/p&gt;

&lt;p&gt;Both factors were originally weighted higher. In practice, hedging language is too context-dependent — a research session is supposed to have more hedging — and self-corrections are too rare to contribute meaningful signal in most sessions. Keeping them at 3% combined means they can nudge a borderline score without ever dominating it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Score Thresholds and Output Design
&lt;/h2&gt;

&lt;p&gt;Scores map to four states:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Range&lt;/th&gt;
&lt;th&gt;State&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0–29&lt;/td&gt;
&lt;td&gt;Fresh&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;30–60&lt;/td&gt;
&lt;td&gt;Warming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;61–80&lt;/td&gt;
&lt;td&gt;Drifting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;81–100&lt;/td&gt;
&lt;td&gt;Polluted&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;get_drift()&lt;/code&gt; output leads with a plain-English recommendation rather than just the score. The score is a number — what most developers need is "should I start fresh right now or not":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;⚠️  Start fresh now — context is full and responses are repeating heavily.

  Context depth         █████████░   88
  Repetition            ████████░░   72
  Length collapse       █████░░░░░   48

Score: 84/100 · 67 messages

→ Call get_handoff() to write handoff.md before starting fresh.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Factor bars only appear when they're contributing meaningfully to the score. A healthy session shows only the top two; a degraded session shows all contributing factors. This avoids surfacing noise in the common case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Handoff trigger&lt;/strong&gt;: The suggestion to call &lt;code&gt;get_handoff()&lt;/code&gt; fires independently of the composite score — it triggers when context depth or repetition individually cross their thresholds. A session can have a composite score of 65 (drifting) and still get a handoff suggestion if repetition is at 78.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Handoff Workflow
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;get_handoff()&lt;/code&gt; returns a structured prompt instructing the current AI session to write a &lt;code&gt;handoff.md&lt;/code&gt; file. The AI generates the file using its full session context — which, crucially, still exists even in a degraded session. The model may be repeating itself, but it still has access to everything it did.&lt;/p&gt;

&lt;p&gt;A typical &lt;code&gt;handoff.md&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## What we accomplished&lt;/span&gt;
Implemented JWT authentication with refresh token rotation. Added middleware,
updated the user model, wrote integration tests. All tests passing.

&lt;span class="gu"&gt;## Current state&lt;/span&gt;
Auth flow is working end-to-end. Rate limiting is stubbed but not implemented.
The &lt;span class="sb"&gt;`/refresh`&lt;/span&gt; endpoint has a known edge case with concurrent requests (see auth.ts:142).

&lt;span class="gu"&gt;## Files modified&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; src/middleware/auth.ts — JWT verify + refresh logic
&lt;span class="p"&gt;-&lt;/span&gt; src/models/user.ts — added refreshToken field + index
&lt;span class="p"&gt;-&lt;/span&gt; src/routes/auth.ts — /login, /logout, /refresh endpoints
&lt;span class="p"&gt;-&lt;/span&gt; tests/integration/auth.test.ts — 14 new tests

&lt;span class="gu"&gt;## Open questions / next steps&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Implement rate limiting on /login (5 attempts per 15 min)
&lt;span class="p"&gt;-&lt;/span&gt; Fix concurrent refresh edge case
&lt;span class="p"&gt;-&lt;/span&gt; Add token blacklist for logout

&lt;span class="gu"&gt;## Context for next session&lt;/span&gt;
Using jsonwebtoken@9, refresh tokens stored in DB. Access token TTL: 15min,
Refresh TTL: 7 days.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Load this at the start of the next session. You don't lose context — you lose the degraded session state while keeping the useful information.&lt;/p&gt;




&lt;h2&gt;
  
  
  Trend Tracking
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;get_trend()&lt;/code&gt; returns the full score history for the current session with a sparkline, peak, average, and trajectory annotation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Session drift trend (18 snapshots)

  12 ▁▁▂▃▄▄▅▆▇▇█  84

  Peak: 84  ·  Avg: 47  ·  Trajectory: ↑ climbing

Snapshots: 12 → 18 → 24 → 31 → 38 → 42 → 51 → 58 → 63 → 70 → 76 → 84
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Snapshots are persisted to &lt;code&gt;~/.driftcli/history/&lt;/code&gt; as JSONL and survive session restarts. The sparkline starts appearing after 3 &lt;code&gt;get_drift()&lt;/code&gt; calls. Trend data is per-session, keyed by session UUID.&lt;/p&gt;




&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;driftguard-mcp merges global (&lt;code&gt;~/.driftclirc&lt;/code&gt;) and per-project (&lt;code&gt;.driftcli&lt;/code&gt;) config. Presets adjust factor weights without requiring manual override:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Preset&lt;/th&gt;
&lt;th&gt;Adjustment&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;coding&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Default weights — emphasises context depth and repetition&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;research&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Weights goal distance more heavily&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;brainstorm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Relaxes repetition and confidence drift penalties&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;strict&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Equal weight across all six factors&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Custom weight overrides are supported on top of any preset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"preset"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"coding"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"warnThreshold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"weights"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"repetition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.45&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; driftguard-mcp
driftguard-mcp setup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;setup&lt;/code&gt; auto-configures Claude Code, Gemini CLI, Codex CLI, and Cursor. Restart your CLI after running it — the tools are live immediately.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;npm: &lt;a href="https://www.npmjs.com/package/driftguard-mcp" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/driftguard-mcp&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/jschoemaker/driftguard-mcp" rel="noopener noreferrer"&gt;https://github.com/jschoemaker/driftguard-mcp&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Current areas of active work: better token count estimation for Codex (the character-based fallback works but real counts would improve saturation accuracy), and a VSCode extension surface for teams that don't use CLI-first workflows.&lt;/p&gt;

&lt;p&gt;The core scoring algorithm is intentionally conservative — better to miss a drifting session than to cry wolf on healthy ones. If you're running sessions and find the thresholds too tight or too loose for your workflow, the config system is designed for exactly that.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>typescript</category>
      <category>claudeai</category>
    </item>
  </channel>
</rss>
