<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: delimitter</title>
    <description>The latest articles on Forem by delimitter (@delimitter_8b9077911a3848).</description>
    <link>https://forem.com/delimitter_8b9077911a3848</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1878195%2F7d6143ad-5ffb-4b2f-bd25-94c0e9b21b1d.png</url>
      <title>Forem: delimitter</title>
      <link>https://forem.com/delimitter_8b9077911a3848</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/delimitter_8b9077911a3848"/>
    <language>en</language>
    <item>
      <title>Type-Guided Constrained Decoding: How to Stop LLMs from Hallucinating Code</title>
      <dc:creator>delimitter</dc:creator>
      <pubDate>Fri, 03 Apr 2026 08:18:05 +0000</pubDate>
      <link>https://forem.com/delimitter_8b9077911a3848/type-guided-constrained-decoding-how-to-stop-llms-from-hallucinating-code-5hbc</link>
      <guid>https://forem.com/delimitter_8b9077911a3848/type-guided-constrained-decoding-how-to-stop-llms-from-hallucinating-code-5hbc</guid>
      <description>&lt;h2&gt;
  
  
  From GBNF Grammars to Type-Directed Generation: Guarantees Instead of Hope
&lt;/h2&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who this is for.&lt;/strong&gt; If you've ever had ChatGPT generate code that doesn't compile — this article explains how to eliminate that completely. All technical terms explained in footnotes and the glossary at the end.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;In previous articles, we showed that reducing tokens saves money, energy, and compute. But there's a more serious problem: LLMs generate &lt;strong&gt;incorrect&lt;/strong&gt; code. And every retry doubles the token spend.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scale of the Problem
&lt;/h2&gt;

&lt;p&gt;Type errors account for 33.6% of all failures in LLM-generated code (Mündler et al., PLDI 2025&lt;sup id="fnref1"&gt;1&lt;/sup&gt;). These aren't typos — they're structural errors: wrong argument types, incompatible return values, accessing nonexistent fields.&lt;/p&gt;

&lt;p&gt;When an LLM generates a sort function that doesn't compile, the cost doubles — either a human fixes it (time) or an agent retries (tokens).&lt;/p&gt;

&lt;p&gt;But what if the model &lt;strong&gt;physically cannot&lt;/strong&gt; generate syntactically invalid code?&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Levels of Constraints
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Level 1: Grammar (Syntactic Correctness)
&lt;/h3&gt;

&lt;p&gt;At each generation step, the set of grammatically&lt;sup id="fnref2"&gt;2&lt;/sup&gt; valid tokens is determined. All others are masked — probability set to zero.&lt;/p&gt;

&lt;p&gt;Example: if the model just generated &lt;code&gt;[&lt;/code&gt;, then the next token can be a number, identifier, &lt;code&gt;]&lt;/code&gt;, or &lt;code&gt;[&lt;/code&gt; — but not &lt;code&gt;+&lt;/code&gt;, not &lt;code&gt;=&lt;/code&gt;, not &lt;code&gt;)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;XGrammar&lt;sup id="fnref3"&gt;3&lt;/sup&gt;&lt;/strong&gt; — default backend in SGLang&lt;sup id="fnref4"&gt;4&lt;/sup&gt;, vLLM, TensorRT-LLM&lt;sup id="fnref5"&gt;5&lt;/sup&gt;. Works with context-free grammars (CFG&lt;sup id="fnref6"&gt;6&lt;/sup&gt;). Approaches zero overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outlines&lt;sup id="fnref7"&gt;7&lt;/sup&gt;&lt;/strong&gt; — structured generation via finite state machines (FSM&lt;sup id="fnref8"&gt;8&lt;/sup&gt;). Supports regex and CFG.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp&lt;sup id="fnref9"&gt;9&lt;/sup&gt;&lt;/strong&gt; — built-in GBNF grammar&lt;sup id="fnref10"&gt;10&lt;/sup&gt; support.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guidance&lt;/strong&gt; (Microsoft) — template-based generation with constraints.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result: &lt;strong&gt;100% syntactic correctness.&lt;/strong&gt; Every generated fragment is a valid program.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 2: Types (Semantic Correctness)
&lt;/h3&gt;

&lt;p&gt;Grammar guarantees that &lt;code&gt;f x = x + 1&lt;/code&gt; is syntactically valid. But not that &lt;code&gt;x&lt;/code&gt; is a number. Type-constrained decoding&lt;sup id="fnref11"&gt;11&lt;/sup&gt; adds a second layer: only tokens compatible with the current type context are allowed.&lt;/p&gt;

&lt;p&gt;Mündler et al. (PLDI 2025) showed that type-constrained decoding reduces compilation errors by &lt;strong&gt;74.8%&lt;/strong&gt; compared to 9.0% for syntax-only constraints.&lt;/p&gt;

&lt;p&gt;This requires type inference&lt;sup id="fnref12"&gt;12&lt;/sup&gt; — so the compiler can determine valid types at every generation point without explicit annotations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 3: Specification (Logical Correctness)
&lt;/h3&gt;

&lt;p&gt;The most powerful level: constraints based on formal specification. A sort function doesn't just have the right type — it actually sorts. This is an area of active research (dependent types, refinement types). Not yet in production tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  How XGrammar Works
&lt;/h2&gt;

&lt;p&gt;XGrammar's key optimization: &lt;strong&gt;splitting the vocabulary into two classes:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context-independent tokens (~80%+).&lt;/strong&gt; Validity determined at preprocessing, before generation. For each grammar state, a bitmask of valid tokens is precomputed. O(1) per token.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context-dependent tokens (~20%).&lt;/strong&gt; Validity depends on the current PDA&lt;sup id="fnref13"&gt;13&lt;/sup&gt; state. Checked at runtime, but few in number.&lt;/p&gt;

&lt;p&gt;Result: &lt;strong&gt;near-zero overhead.&lt;/strong&gt; Constrained decoding adds no measurable overhead to TPOT&lt;sup id="fnref14"&gt;14&lt;/sup&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  BPE Misalignment Breaks Constrained Decoding
&lt;/h2&gt;

&lt;p&gt;This is where language design becomes critical.&lt;/p&gt;

&lt;p&gt;When a language grammar isn't aligned to BPE boundaries, constrained decoding faces the &lt;strong&gt;bridge token&lt;/strong&gt; problem — a BPE token spanning two grammatical symbols.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Domino&lt;/strong&gt; (ICML 2024&lt;sup id="fnref15"&gt;15&lt;/sup&gt;) showed that bridge tokens distort the model's probability distribution. &lt;strong&gt;Grammar-Aligned Decoding&lt;/strong&gt; (NeurIPS 2024&lt;sup id="fnref16"&gt;16&lt;/sup&gt;) formalized the problem and proposed a fix — but with added overhead.&lt;/p&gt;

&lt;p&gt;If a language is designed so bridge tokens &lt;strong&gt;never arise&lt;/strong&gt; — every grammatical symbol coincides with one BPE token — the problem disappears entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deterministic CFG = Zero Overhead
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Nondeterministic CFG&lt;/strong&gt; — when parsing, multiple rules may apply. Requires backtracking&lt;sup id="fnref17"&gt;17&lt;/sup&gt;. Expensive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deterministic CFG (DCFG)&lt;sup id="fnref18"&gt;18&lt;/sup&gt;&lt;/strong&gt; — exactly one rule applies at each step. Compiles to an FSM. No backtracking. No ambiguity.&lt;/p&gt;

&lt;p&gt;Tian et al. (CoLM 2024&lt;sup id="fnref19"&gt;19&lt;/sup&gt;) proved that for DCFGs, constrained decoding compiles in &lt;strong&gt;closed form&lt;/strong&gt; — overhead approaches zero.&lt;/p&gt;

&lt;p&gt;If a language has a DCFG grammar with BPE-aligned operators, constrained decoding is &lt;strong&gt;free&lt;/strong&gt;: zero overhead + zero bridge tokens.&lt;/p&gt;

&lt;h2&gt;
  
  
  In Practice: GBNF Grammar
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;root    ::= program
program ::= (decl newline)* decl newline?
decl    ::= func-def | type-sig | type-def

func-def ::= lower-id ws (pattern ws)* "=" ws expr
cond-expr ::= "?" ws expr ws "-&amp;gt;" ws expr ws ":" ws expr
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plugging into SGLang:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write factorial&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;extra_body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ebnf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;synoema.gbnf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Result: GUARANTEED syntactically valid code
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or llama.cpp:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./main &lt;span class="nt"&gt;-m&lt;/span&gt; model.gguf &lt;span class="nt"&gt;--grammar-file&lt;/span&gt; synoema.gbnf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"-- Quicksort:"&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; 128 &lt;span class="nt"&gt;--temp&lt;/span&gt; 0.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Economic Impact
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Lever&lt;/th&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;BPE-aligned grammar&lt;/td&gt;
&lt;td&gt;46% fewer tokens&lt;/td&gt;
&lt;td&gt;-46% direct&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quadratic attention&lt;/td&gt;
&lt;td&gt;54% length → 29% cost&lt;/td&gt;
&lt;td&gt;-71% on attention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Constrained decoding&lt;/td&gt;
&lt;td&gt;0 invalid code → 0 retries&lt;/td&gt;
&lt;td&gt;-10–30%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type constraints&lt;/td&gt;
&lt;td&gt;-74.8% type errors&lt;/td&gt;
&lt;td&gt;-5–15% additional&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Combined: &lt;strong&gt;50–70%&lt;/strong&gt; savings in cost and energy vs unoptimized Python generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;In the next article, we'll introduce &lt;strong&gt;Synoema&lt;/strong&gt; — a language with all three levers: BPE-aligned grammar (33 single-token operators), Hindley-Milner&lt;sup id="fnref20"&gt;20&lt;/sup&gt; type inference, and Cranelift&lt;sup id="fnref21"&gt;21&lt;/sup&gt; JIT for native speed.&lt;/p&gt;







&lt;h2&gt;
  
  
  Series: Token Economics of Code
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://dev.to/delimitter_8b9077911a3848/why-every-token-costs-more-than-you-think-3i00"&gt;Why Every Token Costs More Than You Think&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/delimitter_8b9077911a3848/the-anatomy-of-bpe-why-python-wastes-46-of-tokens-4e0k"&gt;The Anatomy of BPE: Why Python Wastes 46% of Tokens&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Type-Guided Constrained Decoding: How to Stop LLMs from Hallucinating Code ← &lt;em&gt;you are here&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Compilation for LLMs: Cranelift JIT, 4.4× Faster Than Python &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Hindley-Milner for LLMs: Type Inference Without Annotations &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Show HN: Synoema — The First Programming Language Designed for LLMs &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;The Future of Code Generation: From Prompts to Compilation &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Glossary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Explanation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Constrained decoding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Technology forbidding invalid tokens during generation. Guarantees correctness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;XGrammar&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Constrained decoding engine, de facto standard for LLM inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SGLang&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Open-source LLM inference engine from UC Berkeley&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;vLLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Open-source LLM inference engine with memory optimization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TensorRT-LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;NVIDIA's inference engine, fastest on their GPUs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GBNF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Grammar description format for llama.cpp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;llama.cpp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Popular project for running LLMs on commodity hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CFG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Context-Free Grammar — formal grammar describing language syntax&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DCFG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deterministic CFG — unambiguous parsing, enables zero-overhead constraints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FSM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Finite State Machine — model for fast token validity checking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pushdown Automaton — FSM with a stack for nested structures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TPOT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Time Per Output Token — main LLM inference speed metric&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bridge token&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BPE token spanning two grammar symbol boundaries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Type inference&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Automatic type deduction by the compiler, no annotations needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hindley-Milner&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Most powerful type inference algorithm. Used in Haskell, OCaml&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cranelift&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rust-based compiler backend. Fast JIT compilation to native code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PLDI / ICML / NeurIPS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Top academic conferences on PL, ML, and AI respectively&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Backtracking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Parsing by trial-and-error with rollbacks. Slow but universal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;&lt;strong&gt;PLDI (Programming Language Design and Implementation)&lt;/strong&gt; — one of the most prestigious academic conferences on programming languages. Papers undergo rigorous peer review. If a result is published at PLDI, it's trustworthy. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;&lt;strong&gt;Formal grammar&lt;/strong&gt; — a set of rules describing which sequences of symbols are valid in a language. For example, "after &lt;code&gt;[&lt;/code&gt;, the next symbol can be a number, identifier, or &lt;code&gt;]&lt;/code&gt;" is part of a grammar. Python has a complex grammar (hundreds of rules); JSON has a simple one (~10 rules). ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn3"&gt;
&lt;p&gt;&lt;strong&gt;XGrammar&lt;/strong&gt; — constrained decoding engine from the MLC-AI team. The de facto standard for LLM inference. Its key innovation: splitting the vocabulary into "easy" tokens (80%+, checked at preprocessing) and "hard" tokens (20%, checked at runtime), yielding near-zero overhead. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn4"&gt;
&lt;p&gt;&lt;strong&gt;SGLang&lt;/strong&gt; — open-source LLM inference engine from UC Berkeley. One of the fastest ways to serve LLMs. Supports constrained decoding via XGrammar out of the box. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn5"&gt;
&lt;p&gt;&lt;strong&gt;TensorRT-LLM&lt;/strong&gt; — NVIDIA's inference engine, optimized for their GPUs. Fastest on NVIDIA hardware, but complex to set up. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn6"&gt;
&lt;p&gt;&lt;strong&gt;CFG (Context-Free Grammar)&lt;/strong&gt; — a class of grammars where each rule has the form "symbol → sequence of symbols." Most programming languages are described by CFGs. JSON, XML, HTML, Python, JavaScript all have CFGs. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn7"&gt;
&lt;p&gt;&lt;strong&gt;Outlines&lt;/strong&gt; — open-source library for structured generation. Compiles a grammar or regex into a finite state machine that filters tokens on the fly. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn8"&gt;
&lt;p&gt;&lt;strong&gt;FSM (Finite State Machine)&lt;/strong&gt; — a mathematical model that at any moment is in one of a finite number of states and transitions between them on input symbols. Used for fast checking of whether the next token is valid. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn9"&gt;
&lt;p&gt;&lt;strong&gt;llama.cpp&lt;/strong&gt; — the most popular open-source project for running LLMs on commodity hardware (CPU, Mac M1/M2, budget GPUs). Written in C++, works without Python. Supports GBNF grammars for constrained decoding. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn10"&gt;
&lt;p&gt;&lt;strong&gt;GBNF (GGML BNF)&lt;/strong&gt; — grammar description format used in llama.cpp. Extension of standard Backus-Naur Form. Example: &lt;code&gt;expr ::= number | expr "+" expr&lt;/code&gt;. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn11"&gt;
&lt;p&gt;&lt;strong&gt;Type-constrained decoding&lt;/strong&gt; — an extension of constrained decoding that checks types in addition to grammar. If a function expects &lt;code&gt;Int&lt;/code&gt;, the model can't substitute &lt;code&gt;String&lt;/code&gt;. Requires type inference — automatic type deduction by the compiler. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn12"&gt;
&lt;p&gt;&lt;strong&gt;Type inference&lt;/strong&gt; — the compiler's ability to determine types of all expressions automatically, without programmer annotations. Instead of &lt;code&gt;int add(int x, int y)&lt;/code&gt; (as in C), you write just &lt;code&gt;add x y = x + y&lt;/code&gt;, and the compiler deduces &lt;code&gt;Int → Int → Int&lt;/code&gt;. The most powerful type inference algorithm is Hindley-Milner, used in Haskell and ML. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn13"&gt;
&lt;p&gt;&lt;strong&gt;PDA (Pushdown Automaton)&lt;/strong&gt; — an extension of a finite state machine with a stack. Needed for grammars with nesting (brackets, code blocks). A regular FSM can't count brackets — a PDA can. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn14"&gt;
&lt;p&gt;&lt;strong&gt;TPOT (Time Per Output Token)&lt;/strong&gt; — the time to generate one output token. The main metric for LLM inference speed. For GPT-4: ~20–30 ms; for small models on powerful GPUs: 5–10 ms. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn15"&gt;
&lt;p&gt;&lt;strong&gt;ICML (International Conference on Machine Learning)&lt;/strong&gt; — one of the top three ML conferences (with NeurIPS and ICLR). Publication at ICML signals high-quality research. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn16"&gt;
&lt;p&gt;&lt;strong&gt;NeurIPS (Neural Information Processing Systems)&lt;/strong&gt; — the largest AI/ML conference. ~15,000 attendees annually. Publication at NeurIPS is the gold standard for ML research. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn17"&gt;
&lt;p&gt;&lt;strong&gt;Backtracking&lt;/strong&gt; — a parsing method where the parser tries one rule, and if it fails, backtracks and tries another. Slow because the same text may be parsed multiple times. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn18"&gt;
&lt;p&gt;&lt;strong&gt;DCFG (Deterministic Context-Free Grammar)&lt;/strong&gt; — a subclass of CFG where parsing is unambiguous at every step. Compiles to an efficient automaton. Most real programming languages are DCFGs (or close). Python technically isn't due to indentation, but with the offside rule it approximates one. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn19"&gt;
&lt;p&gt;&lt;strong&gt;CoLM&lt;/strong&gt; — a newer conference at the intersection of language models and formal methods. Focuses on how compiler theory can improve LLMs. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn20"&gt;
&lt;p&gt;&lt;strong&gt;Hindley-Milner&lt;/strong&gt; — the most powerful automatic type inference algorithm, developed in the 1960s–80s. Allows the compiler to determine types of all expressions &lt;strong&gt;without a single annotation&lt;/strong&gt;. Used in Haskell, OCaml, F#, Elm. Detailed in the fifth article. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn21"&gt;
&lt;p&gt;&lt;strong&gt;Cranelift&lt;/strong&gt; — a compiler backend written in Rust. Converts intermediate representation (IR) to native machine code (x86-64, ARM). Alternative to LLVM: compiles 10× faster, though generated code is ~14% slower. Ideal for JIT compilation where compilation speed matters more than peak optimization. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>codequality</category>
      <category>llm</category>
      <category>programming</category>
    </item>
    <item>
      <title>The Anatomy of BPE: Why Python Wastes 46% of Tokens</title>
      <dc:creator>delimitter</dc:creator>
      <pubDate>Thu, 02 Apr 2026 15:11:31 +0000</pubDate>
      <link>https://forem.com/delimitter_8b9077911a3848/the-anatomy-of-bpe-why-python-wastes-46-of-tokens-4e0k</link>
      <guid>https://forem.com/delimitter_8b9077911a3848/the-anatomy-of-bpe-why-python-wastes-46-of-tokens-4e0k</guid>
      <description>&lt;h2&gt;
  
  
  How BPE Tokenization Works and What It Means for Language Design
&lt;/h2&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who this is for.&lt;/strong&gt; If you want to understand how ChatGPT "sees" your code and why the same program costs different amounts in different languages — read on. All terms explained in footnotes and the glossary at the end.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;In the previous article, we established that inference cost grows quadratically with token count. The natural question: can we reduce token count without losing semantics?&lt;/p&gt;

&lt;p&gt;To answer that, we need to understand how LLMs see code. Not as text — as a sequence of tokens. And between how a programmer sees &lt;code&gt;def factorial(n):&lt;/code&gt; and how GPT-4 sees it, there's a chasm.&lt;/p&gt;

&lt;h2&gt;
  
  
  How BPE Works
&lt;/h2&gt;

&lt;p&gt;BPE (Byte Pair Encoding)&lt;sup id="fnref1"&gt;1&lt;/sup&gt; is the algorithm that converts text into sequences of integers (tokens). It underlies all modern LLMs: GPT-4 uses the cl100k_base&lt;sup id="fnref2"&gt;2&lt;/sup&gt; vocabulary, Claude uses a modified BPE, and Llama uses SentencePiece&lt;sup id="fnref3"&gt;3&lt;/sup&gt; BPE.&lt;/p&gt;

&lt;p&gt;The algorithm is simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start with an alphabet of individual bytes (256 characters).&lt;/li&gt;
&lt;li&gt;Find the most frequent pair of adjacent symbols in the corpus&lt;sup id="fnref4"&gt;4&lt;/sup&gt;.&lt;/li&gt;
&lt;li&gt;Create a new symbol for that pair and add it to the vocabulary.&lt;/li&gt;
&lt;li&gt;Repeat steps 2–3 until the desired vocabulary size (~100K).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result is a vocabulary of ~100,000 "subwords" of variable length. Short, frequent words (&lt;code&gt;the&lt;/code&gt;, &lt;code&gt;is&lt;/code&gt;, &lt;code&gt;def&lt;/code&gt;) encode as a single token. Rare words get split: &lt;code&gt;tokenization&lt;/code&gt; → &lt;code&gt;token&lt;/code&gt; + &lt;code&gt;ization&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The critical property:&lt;/strong&gt; BPE is trained on &lt;strong&gt;natural language&lt;/strong&gt;, not code. So it's optimized for English prose, not Python syntax.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Misalignment Problem
&lt;/h2&gt;

&lt;p&gt;The grammatical units of a programming language — operators, keywords, delimiters — &lt;strong&gt;don't align&lt;/strong&gt; with BPE token boundaries. This creates two types of waste.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Type 1: Redundant tokens on syntax.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Take a simple Python function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;factorial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;factorial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;BPE (cl100k_base) splits this into &lt;strong&gt;29 tokens&lt;/strong&gt;. Semantically significant: &lt;code&gt;factorial&lt;/code&gt;, &lt;code&gt;n&lt;/code&gt;, &lt;code&gt;0&lt;/code&gt;, &lt;code&gt;1&lt;/code&gt;, &lt;code&gt;*&lt;/code&gt;, &lt;code&gt;-&lt;/code&gt;. The remaining 23 tokens are syntactic overhead: &lt;code&gt;def&lt;/code&gt;, spaces, &lt;code&gt;(&lt;/code&gt;, &lt;code&gt;)&lt;/code&gt;, &lt;code&gt;:&lt;/code&gt;, &lt;code&gt;if&lt;/code&gt;, &lt;code&gt;==&lt;/code&gt;, &lt;code&gt;return&lt;/code&gt; (twice), indentation, newlines.&lt;/p&gt;

&lt;p&gt;The equivalent program in a minimal-syntax functional language:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight haskell"&gt;&lt;code&gt;&lt;span class="n"&gt;fac&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="n"&gt;fac&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;fac&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;16 tokens.&lt;/strong&gt; Same semantics. 45% fewer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Type 2: Bridge tokens&lt;sup id="fnref5"&gt;5&lt;/sup&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sometimes a single BPE token spans two grammatical symbols. For example, &lt;code&gt;"name"&lt;/code&gt; in JSON may become one token, even though grammatically it's space + quote + identifier + quote. This creates problems for constrained decoding&lt;sup id="fnref6"&gt;6&lt;/sup&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark: 12 Programs, 3 Languages
&lt;/h2&gt;

&lt;p&gt;I compared token counts for equivalent programs in three languages: Python, Haskell&lt;sup id="fnref7"&gt;7&lt;/sup&gt;, and an optimized language where every operator is exactly one BPE token.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Program&lt;/th&gt;
&lt;th&gt;Optimized&lt;/th&gt;
&lt;th&gt;Python&lt;/th&gt;
&lt;th&gt;Haskell&lt;/th&gt;
&lt;th&gt;Saving vs Python&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Factorial&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;45%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Map&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;52%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QuickSort&lt;/td&gt;
&lt;td&gt;51&lt;/td&gt;
&lt;td&gt;83&lt;/td&gt;
&lt;td&gt;54&lt;/td&gt;
&lt;td&gt;39%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FizzBuzz&lt;/td&gt;
&lt;td&gt;44&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;49&lt;/td&gt;
&lt;td&gt;31%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Filter&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;td&gt;67&lt;/td&gt;
&lt;td&gt;36&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fibonacci&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;46&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;43%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sum&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;52%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Length&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;47%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reverse&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;49%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compose&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;td&gt;39&lt;/td&gt;
&lt;td&gt;49%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maximum&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;58&lt;/td&gt;
&lt;td&gt;37&lt;/td&gt;
&lt;td&gt;52%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zip&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;td&gt;37&lt;/td&gt;
&lt;td&gt;40%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;332&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;615&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;373&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;46%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The optimized language uses 46% fewer tokens than Python&lt;/strong&gt;, and 11% fewer than Haskell.&lt;/p&gt;

&lt;p&gt;Given quadratic attention cost: 46% fewer tokens ≈ &lt;strong&gt;71% less computation&lt;/strong&gt; in attention layers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the Savings Come From
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Function Definition
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Python: 6 tokens of boilerplate
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;

&lt;span class="c1"&gt;# Optimized: 0 boilerplate tokens
&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Conditional
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Python: 6 overhead tokens
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;

&lt;span class="c1"&gt;# Optimized: 3 tokens
&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lists
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# 9 tokens (commas are waste)
&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;       &lt;span class="c1"&gt;# 7 tokens (no commas needed)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pattern Matching&lt;sup id="fnref8"&gt;8&lt;/sup&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Python: 29 tokens
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fac&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;fac&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Optimized: 16 tokens
&lt;/span&gt;&lt;span class="n"&gt;fac&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="n"&gt;fac&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;fac &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  BPE-Aligned Grammar
&lt;/h2&gt;

&lt;p&gt;The savings above aren't just "terse syntax." They're achieved through deliberate grammar design accounting for BPE tokenizer properties.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The BPE-aligned grammar&lt;sup id="fnref9"&gt;9&lt;/sup&gt; principle:&lt;/strong&gt; every language operator must be exactly one BPE token.&lt;/p&gt;

&lt;p&gt;For the optimized language, all 33 operators were verified — each encodes to exactly 1 BPE token on cl100k_base (GPT-4/Claude) and o200k_base&lt;sup id="fnref10"&gt;10&lt;/sup&gt; (GPT-4o):&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Two chars, 1 token:    -&amp;gt; &amp;lt;- |&amp;gt; ++ &amp;gt;&amp;gt; == != &amp;lt;= &amp;gt;= &amp;amp;&amp;amp; || ..
One char, 1 token:     ? : . = @ | \ _ , + - * / % &amp;lt; &amp;gt;
Delimiters, 1 token:   ( ) [ ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't coincidence — it's a &lt;strong&gt;design constraint&lt;/strong&gt;. If an operator doesn't fit in one BPE token, it gets replaced by one that does.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for LLMs
&lt;/h2&gt;

&lt;p&gt;When an LLM generates code in the optimized language instead of Python, it generates 46% fewer tokens (faster, cheaper), spends 71% less on attention (larger codebases fit), creates no bridge tokens (cleaner constrained decoding), and can't "babble" (minimal syntax prevents bloat).&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;In the next article, we'll look at &lt;strong&gt;constrained decoding&lt;/strong&gt; — the technology that guarantees 100% syntactic correctness. And we'll show why BPE-aligned grammar makes constrained decoding &lt;strong&gt;free&lt;/strong&gt;.&lt;/p&gt;







&lt;h2&gt;
  
  
  Series: Token Economics of Code
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://dev.to/delimitter_8b9077911a3848/why-every-token-costs-more-than-you-think-3i00"&gt;Why Every Token Costs More Than You Think&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The Anatomy of BPE: Why Python Wastes 46% of Tokens ← &lt;strong&gt;you are here&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Type-Guided Constrained Decoding: How to Stop LLMs from Hallucinating Code &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Compilation for LLMs: Cranelift JIT, 4.4× Faster Than Python &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Hindley-Milner for LLMs: Type Inference Without Annotations &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Show HN: Synoema — The First Programming Language Designed for LLMs &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;The Future of Code Generation: From Prompts to Compilation &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Glossary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Explanation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BPE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Byte Pair Encoding — algorithm splitting text into tokens by merging frequent character pairs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;cl100k_base&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GPT-4/Claude's BPE vocabulary with ~100K tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;o200k_base&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GPT-4o's newer BPE vocabulary with ~200K tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SentencePiece&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google's BPE alternative used in Llama and open models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Corpus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Massive text dataset for training BPE and LLMs (web, books, code)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bridge token&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BPE token spanning the boundary of two grammar symbols&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BPE-aligned grammar&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Grammar where every operator = exactly 1 BPE token&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pattern matching&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Defining functions by examples: &lt;code&gt;fac 0 = 1&lt;/code&gt; instead of &lt;code&gt;if n == 0&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Constrained decoding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Technology forbidding invalid tokens during generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Haskell&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Functional language with minimal syntax, brevity benchmark&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;&lt;strong&gt;BPE (Byte Pair Encoding)&lt;/strong&gt; — a text compression algorithm invented in 1994 and adapted for LLMs. The idea: find the most frequent pairs of characters in a huge text corpus and merge them into a new symbol. Repeat ~100,000 times. The result is a vocabulary of "subwords" that the model thinks in. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;&lt;strong&gt;cl100k_base&lt;/strong&gt; — the specific BPE token vocabulary used by GPT-4 and Claude. Contains ~100,000 tokens. Trained primarily on English internet text. GPT-4o uses a newer vocabulary called o200k_base with ~200,000 tokens. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn3"&gt;
&lt;p&gt;&lt;strong&gt;SentencePiece&lt;/strong&gt; — Google's alternative BPE implementation, used in Llama and other open models. Works at the Unicode character level instead of bytes, which is better for non-English languages. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn4"&gt;
&lt;p&gt;&lt;strong&gt;Corpus&lt;/strong&gt; — the massive text dataset used to train the BPE vocabulary (and the LLM itself). Includes web pages, books, articles, GitHub code. Usually hundreds of billions of tokens. Code makes up only ~5–15% of a typical corpus — which is why BPE is optimized for English prose, not Python syntax. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn5"&gt;
&lt;p&gt;&lt;strong&gt;Bridge token&lt;/strong&gt; — a BPE token that spans the boundary between two grammatical symbols. For example, BPE might merge a space and keyword &lt;code&gt;if&lt;/code&gt; into one token. This creates problems for constrained decoding engines, which must "split" the token, distorting the model's probability distribution. More details in the third article. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn6"&gt;
&lt;p&gt;&lt;strong&gt;Constrained decoding&lt;/strong&gt; — technology that forbids invalid tokens at each generation step. Guarantees syntactically valid output. Covered in detail in the third article. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn7"&gt;
&lt;p&gt;&lt;strong&gt;Haskell&lt;/strong&gt; — a functional programming language with minimal syntax. Used as the "brevity benchmark" among existing languages. &lt;code&gt;fac 0 = 1&lt;/code&gt; in Haskell and the optimized language look nearly identical, but the optimized language additionally accounts for BPE boundaries. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn8"&gt;
&lt;p&gt;&lt;strong&gt;Pattern matching&lt;/strong&gt; — defining a function by "examples." Instead of &lt;code&gt;if n == 0: return 1&lt;/code&gt;, you write &lt;code&gt;fac 0 = 1&lt;/code&gt; — literally: "factorial of zero is one." The compiler generates the check automatically. Shorter, clearer, and eliminates an entire class of errors. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn9"&gt;
&lt;p&gt;&lt;strong&gt;BPE-aligned grammar&lt;/strong&gt; — a language design principle where every operator, keyword, and delimiter encodes to exactly one BPE token. This means: no "wasted" tokens on syntax and no bridge tokens. Conventional languages don't account for BPE — they were created long before LLMs. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn10"&gt;
&lt;p&gt;&lt;strong&gt;o200k_base&lt;/strong&gt; — the newer BPE vocabulary used by GPT-4o. Contains ~200,000 tokens (twice cl100k_base). Better coverage of code and non-English languages, but same underlying principles. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>machinelearning</category>
      <category>science</category>
    </item>
    <item>
      <title>Why Every Token Costs More Than You Think</title>
      <dc:creator>delimitter</dc:creator>
      <pubDate>Thu, 02 Apr 2026 15:07:06 +0000</pubDate>
      <link>https://forem.com/delimitter_8b9077911a3848/why-every-token-costs-more-than-you-think-3i00</link>
      <guid>https://forem.com/delimitter_8b9077911a3848/why-every-token-costs-more-than-you-think-3i00</guid>
      <description>&lt;h1&gt;
  
  
  Why Every Token Costs More Than You Think
&lt;/h1&gt;

&lt;h2&gt;
  
  
  The Quadratic Price of Attention: How Context Length Is Killing Your AI Budget
&lt;/h2&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who this is for.&lt;/strong&gt; If you use ChatGPT, Claude, Copilot, or Cursor to write code, this article explains why the same tasks can cost 2–4× less. No technical background required — all terms are explained inline and in the glossary at the end.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;When you ask Claude or GPT to write a sorting function, the model generates ~50 tokens&lt;sup id="fnref1"&gt;1&lt;/sup&gt; per second. Each token costs fractions of a cent. Seems cheap.&lt;/p&gt;

&lt;p&gt;But behind that simplicity lies an engineering reality most people overlook: the cost of each token grows &lt;strong&gt;quadratically&lt;/strong&gt; with context length&lt;sup id="fnref2"&gt;2&lt;/sup&gt;. If you're working with codebases spanning thousands of lines, this quadratic relationship transforms from a theoretical abstraction into a line item that can double your AI budget.&lt;/p&gt;

&lt;p&gt;In this article, I'll show where this cost comes from, why inference — not training — is the dominant consumer of resources, and what can be done about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inference Consumes 90%+ of All Energy
&lt;/h2&gt;

&lt;p&gt;There's a common misconception: the major cost of LLMs&lt;sup id="fnref3"&gt;3&lt;/sup&gt; is training. Training GPT-4 reportedly cost $50–100M. An impressive number.&lt;/p&gt;

&lt;p&gt;But training is a one-time capital expense. Inference&lt;sup id="fnref4"&gt;4&lt;/sup&gt; is an ongoing operational cost that occurs with every request, every second, for every user.&lt;/p&gt;

&lt;p&gt;According to AWS, inference consumes more than 90% of total energy in the LLM lifecycle. The AI inference market is valued at $106 billion in 2025, projected to exceed $250 billion by 2030 at a 19.2% compound annual growth rate.&lt;/p&gt;

&lt;p&gt;Every token ChatGPT generates costs OpenAI approximately $0.00012. Sounds negligible. But at billions of daily requests, this adds up to hundreds of millions of dollars per year — and terajoules of electricity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Quadratic Trap
&lt;/h2&gt;

&lt;p&gt;Here's the key fact that changes everything.&lt;/p&gt;

&lt;p&gt;In a standard transformer&lt;sup id="fnref5"&gt;5&lt;/sup&gt; with self-attention&lt;sup id="fnref6"&gt;6&lt;/sup&gt;, the computational cost of processing a sequence of &lt;em&gt;n&lt;/em&gt; tokens is:&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cost(n) = O(n² · d)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;where &lt;em&gt;d&lt;/em&gt; is the model dimension. This is not a linear relationship. It's &lt;strong&gt;quadratic&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What this means in practice:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Relative attention cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1,000 tokens&lt;/td&gt;
&lt;td&gt;1× (baseline)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2,000 tokens&lt;/td&gt;
&lt;td&gt;4×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4,000 tokens&lt;/td&gt;
&lt;td&gt;16×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8,000 tokens&lt;/td&gt;
&lt;td&gt;64×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32,000 tokens&lt;/td&gt;
&lt;td&gt;1,024×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Doubling the context increases attention cost &lt;strong&gt;fourfold&lt;/strong&gt;, not twofold. This means reducing context by 50% saves not 50%, but &lt;strong&gt;75%&lt;/strong&gt; of attention computation.&lt;/p&gt;

&lt;p&gt;When a developer sends a 2,000-line Python file (~8,000 tokens) to an LLM for refactoring, the attention cost is 64× higher than for a simple 1,000-token question. And that's just one request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Money
&lt;/h2&gt;

&lt;p&gt;Let's calculate for a typical team.&lt;/p&gt;

&lt;p&gt;A team of 10 developers uses an AI assistant (Cursor, Copilot, Claude Code). Each makes an average of 100 requests per day. Average request context: 2,000 input tokens. Average response: 500 output tokens.&lt;/p&gt;

&lt;p&gt;At Claude Sonnet 4 pricing ($3/M input, $15/M output):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input:  10 × 100 × 2,000 = 2M tokens/day × $3/M  = $6/day
Output: 10 × 100 × 500   = 500K tokens/day × $15/M = $7.50/day
Total: ~$13.50/day = ~$405/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now imagine expressing the same programs with 46% fewer tokens (I'll show in the next article that this is achievable):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input:  2M × 0.54 = 1.08M tokens/day × $3/M  = $3.24/day
Output: 500K × 0.54 = 270K tokens/day × $15/M = $4.05/day
Total: ~$7.29/day = ~$219/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Savings: &lt;strong&gt;$186/month&lt;/strong&gt; for 10 people, or &lt;strong&gt;$2,200/year&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For 100 developers: &lt;strong&gt;$22,000/year&lt;/strong&gt;. For 1,000: &lt;strong&gt;$220,000&lt;/strong&gt;. And this is a conservative estimate with a relatively affordable model and moderate workload.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Energy Dimension
&lt;/h2&gt;

&lt;p&gt;Measurements on LLaMA-65B&lt;sup id="fnref7"&gt;7&lt;/sup&gt; (A100 GPUs&lt;sup id="fnref8"&gt;8&lt;/sup&gt;) show energy consumption in the range of 3–4 joules per output token. On modern H100s with optimized inference engines like vLLM&lt;sup id="fnref9"&gt;9&lt;/sup&gt;, efficiency has improved roughly 10×, down to ~0.39 J per token. But usage scale has grown even faster.&lt;/p&gt;

&lt;p&gt;ChatGPT processes an estimated one billion requests daily. At an average response of 500 tokens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1B requests × 500 tokens × 0.39 J ≈ 195 GJ/day ≈ 54,000 kWh/day
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the energy consumption of a small town — every single day. Reducing token count isn't just about saving money. It's a direct reduction in energy consumption and carbon footprint.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Babbling Problem
&lt;/h2&gt;

&lt;p&gt;The study "Towards Green AI" (2026) found that 3 out of 10 tested models exhibit "babbling" behavior — generating significantly more text than necessary. Suppressing this yielded energy savings of 44% to 89%.&lt;/p&gt;

&lt;p&gt;But what if the language the LLM writes code in is designed so that "babbling" is physically impossible?&lt;/p&gt;

&lt;p&gt;Python code is inherently verbose. &lt;code&gt;def&lt;/code&gt;, &lt;code&gt;return&lt;/code&gt;, &lt;code&gt;if/elif/else&lt;/code&gt;, commas in lists — all syntactic overhead&lt;sup id="fnref10"&gt;10&lt;/sup&gt; that consumes tokens without carrying semantic information.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Optimization Levers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Lever 1: Representation compression.&lt;/strong&gt; Express the same program with fewer tokens. This isn't obfuscation — it's grammar design optimized for BPE tokenizers&lt;sup id="fnref11"&gt;11&lt;/sup&gt;. Potential: 35–50%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lever 2: Constrained decoding&lt;sup id="fnref12"&gt;12&lt;/sup&gt;.&lt;/strong&gt; Prevent the model from generating syntactically invalid code. Every error = retry = double token spend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lever 3: Type guarantees.&lt;/strong&gt; Type errors account for 33.6% of all failures in LLM-generated code. Type-guided generation&lt;sup id="fnref13"&gt;13&lt;/sup&gt; reduces them by 74.8%.&lt;/p&gt;

&lt;p&gt;Combining all three levers can yield 60–80% cumulative savings in tokens, money, energy, and time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;In the next article, we'll examine &lt;strong&gt;how BPE tokenization actually works&lt;/strong&gt; and why Python syntax wastes 46% of tokens on structural noise.&lt;/p&gt;







&lt;h2&gt;
  
  
  Series: Token Economics of Code
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Why Every Token Costs More Than You Think &lt;strong&gt;← you are here&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The Anatomy of BPE: Why Python Wastes 46% of Tokens &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Type-Guided Constrained Decoding: How to Stop LLMs from Hallucinating Code &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Compilation for LLMs: Cranelift JIT, 4.4× Faster Than Python &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Hindley-Milner for LLMs: Type Inference Without Annotations &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Show HN: Synoema — The First Programming Language Designed for LLMs &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;The Future of Code Generation: From Prompts to Compilation &lt;em&gt;(coming soon)&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Glossary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Explanation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Token&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Smallest text unit for an LLM. Roughly ¾ of a word or 3–4 characters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Large Language Model — neural network that generates text/code (GPT-4, Claude, Llama)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inference&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Generating a response from a trained model. Happens with every request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Everything the model "sees" — prompt, chat history, files. Measured in tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transformer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Neural network architecture underlying all LLMs. Uses attention mechanism&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-attention&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mechanism where every token considers all others. Cost: O(n²)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BPE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Byte Pair Encoding — algorithm that splits text into tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Constrained decoding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Technology forbidding invalid tokens during generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Graphics card for AI computation. NVIDIA H100 is standard for LLM inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;vLLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Open-source engine for fast LLM serving&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overhead&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Parts of code/computation carrying no useful payload&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;&lt;strong&gt;Token&lt;/strong&gt; — the smallest unit of text an LLM processes. Not a letter, not a word, but a "chunk" of text 1–15 characters long. The word "hello" is 1 token; the code &lt;code&gt;def factorial(n):&lt;/code&gt; is 6 tokens. The model doesn't see characters — it sees a sequence of tokens. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;&lt;strong&gt;Context (context window)&lt;/strong&gt; — everything the model "sees" at once: your question, previous messages, attached files. Measured in tokens. GPT-4 has a context of up to 128K tokens, Claude up to 200K. The longer the context, the more computation the model needs. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn3"&gt;
&lt;p&gt;&lt;strong&gt;LLM (Large Language Model)&lt;/strong&gt; — a neural network trained on massive amounts of text that can generate text, code, and answer questions. Examples: GPT-4, Claude, Llama, Gemini. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn4"&gt;
&lt;p&gt;&lt;strong&gt;Inference&lt;/strong&gt; — the process of using an already-trained model to generate responses. When you type a prompt into ChatGPT and get an answer, that's inference. Unlike training (which happens once), inference happens billions of times per day. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn5"&gt;
&lt;p&gt;&lt;strong&gt;Transformer&lt;/strong&gt; — the neural network architecture underlying all modern LLMs. Invented at Google in 2017 ("Attention Is All You Need" paper). Its key feature is the "attention" mechanism, which lets the model consider relationships between any words in the text, even distant ones. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn6"&gt;
&lt;p&gt;&lt;strong&gt;Self-attention&lt;/strong&gt; — a mechanism where every token "looks at" every other token in the context to understand their relationships. This gives transformers their power — but also creates quadratic cost: if there are &lt;em&gt;n&lt;/em&gt; tokens, there are &lt;em&gt;n × n&lt;/em&gt; pairs to compare. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn7"&gt;
&lt;p&gt;&lt;strong&gt;LLaMA&lt;/strong&gt; — a family of open-source language models from Meta (Facebook). Available for download and self-hosted deployment, unlike GPT-4. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn8"&gt;
&lt;p&gt;&lt;strong&gt;GPU (Graphics Processing Unit)&lt;/strong&gt; — originally a graphics card, now used for AI computation. NVIDIA A100 and H100 are specialized GPUs for LLM inference and training. A single H100 costs ~$30–40K and draws 700 watts. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn9"&gt;
&lt;p&gt;&lt;strong&gt;vLLM&lt;/strong&gt; — an open-source engine for fast LLM serving. Optimizes GPU memory usage through PagedAttention, enabling more simultaneous requests. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn10"&gt;
&lt;p&gt;&lt;strong&gt;Syntactic overhead&lt;/strong&gt; — parts of code required by the language's syntax but carrying no meaning. For example, Python's &lt;code&gt;def&lt;/code&gt; before a function definition and &lt;code&gt;return&lt;/code&gt; before a return value are mandatory but contain no information about &lt;em&gt;what&lt;/em&gt; the function does. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn11"&gt;
&lt;p&gt;&lt;strong&gt;BPE (Byte Pair Encoding)&lt;/strong&gt; — the algorithm that splits text into tokens. Used in all modern LLMs. Finds the most frequent pairs of characters in a huge text corpus and merges them into new "subwords." Result: a vocabulary of ~100,000 tokens. Covered in detail in the second article. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn12"&gt;
&lt;p&gt;&lt;strong&gt;Constrained decoding&lt;/strong&gt; — a technology that forbids the model from choosing invalid tokens at each generation step. If the model is generating JSON, it ensures brackets are closed and commas are in the right places. The same can be done for any language with a formal grammar. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn13"&gt;
&lt;p&gt;&lt;strong&gt;Type-guided generation&lt;/strong&gt; — an extension of constrained decoding where the model is additionally prevented from generating code with type errors. A second layer of guarantees on top of syntactic ones. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>llm</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>science</category>
    </item>
  </channel>
</rss>
