Forem: delimitter

Executable Documentation: When Your Comments Become Tests

delimitter — Sat, 04 Apr 2026 22:06:16 +0000

TL;DR: Synoema stores documentation as executable state in the AST. Doc comments (---) and their example: assertions are parsed, tested, and rendered from a single source of truth. Stale docs fail the test suite. 56% fewer tokens than Python equivalent.

Who this is for. If you've ever found an outdated docstring that claimed a function returned a string when it actually returns a list, this article is for you. Whether you maintain a library, write code with AI assistants, or just want documentation that can't lie вАФ read on.

Documentation lies. Not intentionally, but inevitably. A developer changes a function's behavior, forgets to update the docstring, and now the documentation describes code that no longer exists. This drift is not a personal failing вАФ it's a structural problem. Traditional programming languages treat documentation as metadata: a passive comment attached to code, never executed, never verified.

What if the language itself made stale documentation structurally impossible?

In this article вАФ Part 14 of Token Economics of Code вАФ I'll describe a paradigm where documentation is stored as executable state directly in the AST, verified on every test run, and consumed by both humans and LLMs from a single source of truth.

The Scale of the Problem

The disconnect between documentation and code is well-documented (ironically). A 2023 study by Wen et al. found that 25.5% of Python docstrings in popular open-source projects are inconsistent with their corresponding function signatures. One in four.

The cost isn't just confusion. When an LLM reads a stale docstring to understand your codebase, it generates code based on incorrect context. That code fails. The failure triggers a retry. The retry consumes tokens. The tokens cost money and energy. Documentation debt becomes inference debt.

Three dominant approaches exist today. None solve the problem:

Approach	Language	Verification	Drift risk
Docstrings	Python	`doctest` (opt-in, fragile)	High вАФ separate from tests
JSDoc / TSDoc	JS / TS	None вАФ comments only	Very high
Haddock	Haskell	None вАФ rendered to HTML	Moderate

Python's doctest module comes closest, but it has a fundamental limitation: it compares string representations of output, not semantic values. A change in __repr__ breaks every doctest. And doctest extraction relies on regex-level parsing, not the language's own AST.

The Paradigm: Documentation as AST State

Synoema takes a different approach. Documentation is a first-class syntactic element вАФ not a comment convention, but a token type recognized by the lexer, preserved in the AST, and consumed by the compiler toolchain.

The syntax uses triple-dash ---:

--- Compute factorial.
--- example: fact 5 == 120
fact 0 = 1
fact n = n * fact (n - 1)

Three things happen when the parser encounters ---:

The lexer emits a Token::DocComment(String) вАФ distinct from -- (regular comment, stripped during tokenization)
The parser collects consecutive doc lines and attaches them to the next declaration as doc: Vec<String> in the AST
Lines starting with example: are flagged as executable assertions

This is not a wrapper around regular comments. It's a distinct token class, occupying exactly 1 BPE token in the cl100k_base vocabulary. Regular comments (--) are invisible to the AST. Doc comments (---) persist through the entire compilation pipeline.

Here is the key difference from traditional approaches:

Traditional:  Source вЖТ [strip comments] вЖТ AST вЖТ Compile
Synoema:      Source вЖТ AST (with doc: Vec<String>) вЖТ Compile + Test + Doc

Documentation is not stripped. It travels with the code.

How It Works: From Lexer to Test Runner

Let me trace the full pipeline for a single doctest.

Step 1: Lexing. The scanner encounters --- and calls scan_doc_comment():

Input:  "--- example: fact 5 == 120\n"
Output: Token::DocComment("example: fact 5 == 120")

The text after --- is captured verbatim, with leading whitespace trimmed.

Step 2: Parsing. The parser's collect_doc_comments() method gathers consecutive DocComment tokens into a vector. When it hits a function declaration, it attaches the vector:

Decl::Func {
    name: "fact",
    equations: [...],
    doc: ["Compute factorial.", "example: fact 5 == 120"],
    span: ...,
}

Both the human-readable description ("Compute factorial") and the executable assertion ("example: fact 5 == 120") live in the same Vec<String>. No separate metadata structure. No JSON sidecar. One field.

Step 3: Test extraction. When you run synoema test, the extract_doctests() function walks every declaration, finds lines starting with example:, and splits them:

"example: fact 5 == 120"
         ^^^^^^    ^^^
         expr      expected

The split respects bracket nesting вАФ example: head [1 2 3] == 1 correctly identifies head [1 2 3] as the expression and 1 as the expected value.

Step 4: Execution. Each doctest is evaluated by appending it to the full module source:

-- Original source loaded here --
__doctest_val = fact 5

The result is compared against the expected value (also evaluated in the same context). If they match вАФ pass. If not вАФ fail with a diagnostic showing the expression, expected value, and actual value.

This means doctests have access to every definition in the file. They run in the real evaluation environment, not a sandboxed mock. If the function changes behavior, the doctest catches it.

Three Testing Tiers, One Pipeline

Synoema unifies three kinds of verification into a single synoema test command:

Tier 1: Doctests вАФ inline assertions in doc comments.

--- Reverse a list.
--- example: reverse [1 2 3] == [3 2 1]
reverse [] = []
reverse (x:xs) = reverse xs ++ [x]

Tier 2: Unit tests вАФ named boolean assertions using the test keyword.

test "fact base" = fact 0 == 1
test "fact 10" = fact 10 == 3628800
test "sort then reverse" = reverse (qsort [3 1 2]) == [3 2 1]

Tier 3: Property tests вАФ generative testing with the prop keyword.

test "reverse involution" = prop xs -> reverse (reverse xs) == xs
test "sort idempotent" = prop xs -> qsort (qsort xs) == qsort xs
test "fact positive" = prop n -> fact n >= 1 when n >= 0 && n <= 10

Property tests use Hindley-Milner type inference to determine what values to generate. The variable xs in prop xs -> reverse (reverse xs) == xs is inferred as List a вАФ so the test runner generates random lists. The variable n in prop n -> fact n >= 1 is inferred as Int вАФ so it generates random integers. No manual type annotations required.

The when clause filters generated values: when n >= 0 && n <= 10 discards any n outside that range before evaluating the property. 100 valid trials per property, deterministic seed for reproducibility.

All three tiers run together:

$ synoema test examples/testing.sno

  testing.sno
    doctests:    4 passed, 0 failed
    unit tests:  4 passed, 0 failed
    properties:  5 passed, 0 failed (500 trials)

  Total: 13 passed, 0 failed

Documentation Generation: Same Source, Different Output

The same doc: Vec<String> that drives testing also drives documentation generation. The synoema doc command reads the AST and renders it вАФ without re-parsing, without a separate doc format, without Markdown source files.

Markdown output (synoema doc --format md):

Interleaves doc lines as prose and declarations as code blocks. Lines starting with example: are rendered as highlighted code snippets. Metadata lines (guide:, order:, requires:) control page title, ordering, and dependency tracking вАФ but are invisible in the rendered output.

JSON output (synoema doc --format json):

Exports structured metadata for tooling:

{
  "file": "examples/testing.sno",
  "functions": [
    {
      "name": "fact",
      "doc": ["Compute factorial.", "example: fact 5 == 120"],
      "line": 7
    }
  ]
}

This JSON is consumed by the MCP server, which exposes Synoema documentation to LLM agents. The documentation that LLMs read is the same documentation that tests verify. There is no gap.

Why LLMs Care

When an LLM generates or modifies Synoema code, it reads doc comments as part of the source context. Those comments are guaranteed to be accurate вАФ because if they weren't, synoema test would have failed.

This creates a feedback loop:

LLM reads doc вЖТ generates code вЖТ code changes behavior вЖТ
  synoema test catches stale docs вЖТ developer updates docs вЖТ
    LLM reads updated docs вЖТ ...

In traditional languages, the loop has a silent gap: nothing catches stale docs. The LLM operates on incorrect context, generates incorrect code, and the developer blames the LLM rather than the documentation.

There's a second benefit specific to token economics. Doc comments in Synoema use --- (1 BPE token) instead of Python's """...""" (at least 2 tokens for delimiters) or JSDoc's /** ... */ (3+ tokens). Each example: line is 1 token for the keyword. The documentation syntax itself is token-efficient вАФ consistent with the language's design principle of minimizing BPE token count.

Side-by-Side: Python vs Synoema

Let's compare equivalent documented, tested code:

Python (32 tokens):

def fact(n):
    """Compute factorial.

    >>> fact(5)
    120
    """
    if n == 0:
        return 1
    return n * fact(n - 1)

Synoema (14 tokens):

--- Compute factorial.
--- example: fact 5 == 120
fact 0 = 1
fact n = n * fact (n - 1)

Same function. Same documentation. Same executable test. 56% fewer tokens. But the meaningful difference isn't token count вАФ it's that Python's doctest compares string output ("120") while Synoema's compares evaluated values (120 == 120). Change fact to return a float, and Python's doctest breaks on "120.0". Synoema's doesn't.

Try It

Install and run doctests on the example suite:

# Install Synoema
cargo run -p synoema-repl -- install

# Run all tests (doctests + unit + property)
synoema test examples/

# Run tests for a specific file
synoema test examples/testing.sno

# Generate documentation
synoema doc examples/testing.sno
synoema doc --format json examples/testing.sno

Write your own documented function:

--- Double every element in a list.
--- example: double_all [1 2 3] == [2 4 6]
double_all xs = [x * 2 | x <- xs]

Save it as my_funcs.sno and run synoema test my_funcs.sno. The example assertion becomes a test. The description becomes documentation. One source, two outputs, zero drift.

What's Next

In the next article, we'll explore the future of code generation вАФ how compilation, type inference, and executable documentation combine into an agentic pipeline where LLMs don't just write code, but verify it.

Part 14 of "Token Economics of Code" by @andbubnov. Synoema is open-source: github.com/Delimitter/synoema.

Glossary

Term	Explanation
AST	Abstract Syntax Tree вАФ the parsed structure of source code
BPE	Byte Pair Encoding вАФ how LLMs split text into tokens
Doctest	An executable example embedded in documentation
Doc comment	A `---` line in Synoema that persists in the AST
Hindley-Milner	Type inference algorithm вАФ determines types without annotations
MCP	Model Context Protocol вАФ connects LLM agents to external tools
Property test	A test that verifies a property holds for random inputs
Token	Smallest text unit for an LLM, roughly 3-4 characters

The Real Cost: Token Savings Calculator for Engineering Teams

delimitter — Sat, 04 Apr 2026 16:29:11 +0000

How Much Is Your Team Actually Spending on Syntactic Overhead?

Who this is for. Engineering managers, team leads, and developers who pay for LLM API tokens. This article turns benchmark data into dollar amounts for teams of different sizes.

We've shown that Synoema uses up to 33% fewer tokens than Python on functional code and that every token costs quadratically more than you think. Now let's do the math for real teams.

Part of Token Economics of Code series.

The Formula

Monthly cost = requests/day x tokens/request x price/token x 30 x quadratic_factor

Current API Pricing (April 2026)

Model	Input ($/M tokens)	Output ($/M tokens)
GPT-4o	$2.50	$10.00
GPT-4o-mini	$0.15	$0.60
Claude Sonnet 4.6	$3.00	$15.00
Claude Haiku 4.5	$0.80	$4.00
DeepSeek V3	$0.14	$0.28
Gemini 2.5 Pro	$1.25	$10.00

Savings: Functional Code (-33% tokens)

Component	Python	Synoema	Saving
System + prompt	250	250	0
Code context	1,500	1,005	495
Output	400	268	132
Total per request	2,150	1,523	627 (29%)

Dollar Savings by Team Size (GPT-4o)

Team size	Python monthly	Synoema monthly	Monthly saving	Annual saving
5 devs	$424	$301	$123	$1,476
25 devs	$2,118	$1,504	$614	$7,368
100 devs	$8,470	$6,014	$2,456	$29,472
500 devs	$42,350	$30,069	$12,281	$147,372

Beyond Direct Token Cost

Latency Savings

	Python	Synoema	Time saved
Output generation	8.0s	5.4s	2.6s per request
Team of 25, per month	55.8 hrs	37.1 hrs	18.7 hrs/month

Quadratic Compute

29% fewer tokens = 50% reduction in attention compute (O(n^2)).

Error Rate Reduction

Type-guided constrained decoding: 74.8% fewer type errors. Fewer retries = fewer total tokens.

Break-Even Analysis

Team size	Monthly saving (GPT-4o)	Break-even
5 devs	$123/mo	~3 months
25 devs	$614/mo	~1 month
100 devs	$2,456/mo	< 2 weeks

Try It

git clone https://github.com/Delimitter/synoema
cd synoema/lang
cargo run -p synoema-repl -- eval "map f [] = []; map f (x:xs) = f x : map f xs; map (\x -> x * 2) [1 2 3]"

# Full benchmark:
cd .. && cargo run --manifest-path benchmarks/runner/Cargo.toml -- run --phases tokens -v

# MCP integration:
npx synoema-mcp

Build Your Own Estimate

Monthly saving = team x requests x 22 x token_saving x price/M

Where:
  token_saving (input)  = context_tokens x 0.33
  token_saving (output) = output_tokens x 0.33

What's Next

Next: all the pieces together -- getting started, architecture, benchmarks, and the project roadmap.

Part of Token Economics of Code series by @andbubnov. Pricing: public API rates, April 2026.

Hindley-Milner for LLMs: Type Inference Without Annotations

delimitter — Sat, 04 Apr 2026 16:26:30 +0000

Polymorphic Typing: Fewer Tokens, Stronger Guarantees

Who this is for. If you've wondered whether it's possible to have strict typing without verbose annotations like Java or TypeScript — the answer is yes. This article explains how, and why it's critical for LLMs.

33.6% of all LLM-generated code failures are type errors. Can we eliminate them without making the LLM generate type annotations?

Part of Token Economics of Code series.

The Problem: Types Cost Tokens

// TypeScript: ~50% of tokens are type annotations
function add(a: number, b: number): number {
    return a + b;
}

Each annotation means more tokens, more context consumed (quadratic attention cost), and more opportunities for mistakes.

The ideal: 100% type safety with zero type annotations.

Hindley-Milner: A 1960s Solution to a 2020s Problem

The Hindley-Milner algorithm infers types automatically, requiring zero annotations. Used in Haskell, OCaml, F#, Elm.

-- Synoema: zero annotations, 100% type safety
add x y = x + y
-- Compiler infers: Int -> Int -> Int

id x = x
-- Compiler infers: forall a. a -> a

map f [] = []
map f (x:xs) = f x : map f xs
-- Compiler infers: forall a b. (a -> b) -> List a -> List b

How It Works

Algorithm W works in three steps:

Step 1: Constraint generation. For each expression, create type variables and record constraints.

Step 2: Unification. Solve the constraint system — replace type variables with concrete types. Conflicts = type errors.

Step 3: Generalization. Remaining free variables become polymorphic: id : a -> a becomes id : forall a. a -> a.

Key property: HM always finds the most general type.

Let-Polymorphism

id x = x
main =
  a = id 42        -- id used as Int -> Int
  b = id true       -- id used as Bool -> Bool
  a                 -- No error! id is polymorphic.

Interaction with Constrained Decoding

At each generation step, the compiler determines valid types for the next expression:

-- LLM generates: map ??? [1 2 3]
-- Compiler knows: ??? : Int -> t  (function from Int)
-- Valid: \x -> x + 1, \x -> x * 2
-- Invalid: \x -> x ++ "hello" (String != Int)

Grammar + type constraints narrow valid continuations by orders of magnitude.

Comparison

Approach	Type guarantees	Tokens on types	Runtime errors
Python (duck typing)	None	0	Many
TypeScript	Yes	~30-50% of code	Few
Java	Yes	~40-60% of code	Few
Synoema (HM)	Yes	0	None

Try It

git clone https://github.com/Delimitter/synoema
cd synoema/lang
cargo run -p synoema-repl -- eval "id x = x; id 42"
cargo run -p synoema-repl -- eval "map f [] = []; map f (x:xs) = f x : map f xs; map (\x -> x * 2) [1 2 3]"

TypeScript vs Synoema — same guarantees, 44% fewer tokens:

// TypeScript: 25 tokens
function map<A, B>(f: (a: A) => B, xs: A[]): B[] { ... }

-- Synoema: 14 tokens
map f [] = []
map f (x:xs) = f x : map f xs

Impact on LLM Code Generation

With HM, LLMs generate only semantics — the compiler handles types. This is why Synoema achieves 74.8% fewer type errors than syntax-only constrained decoding.

What's Next

Next: we measured every token across 16 algorithms in 5 languages.

Part of Token Economics of Code series by @andbubnov. HM type inference: 1,908 lines of Rust, 61 tests.

JIT vs Interpreters: Benchmarking LLM-Generated Code Execution

delimitter — Sat, 04 Apr 2026 16:22:09 +0000

Your AI Agent Writes Python. What If It Compiled to Native?

Who this is for. If you're building agentic workflows where LLMs generate and execute code — the execution speed of that code directly affects your agent's throughput. This article measures it.

Token efficiency is half the story. The other half: how fast does the generated code actually run? We benchmarked Synoema's Cranelift JIT against Python, Node.js, TypeScript (tsx), and C++ (-O2) across 12 algorithmic tasks.

Part of Token Economics of Code series.

Methodology

Hardware: Apple Silicon (macOS Darwin 25.3.0)

Runtimes: Synoema JIT (Cranelift, --release), CPython 3.12, Node.js (V8), TypeScript via tsx, C++ (g++ -O2)

Measurement: 3 warm-up runs discarded, 5 measured runs, median reported with p5/p95 percentiles.

Fairness: Identical algorithms across all languages. No language-specific optimizations.

cargo run --manifest-path benchmarks/runner/Cargo.toml -- run --phases runtime -v

Results: Overview

Language	Avg median (ms)	vs Synoema
C++ (-O2)	2.0	2.5x faster
Synoema JIT	5.2	baseline
Python 3.12	27.6	5.3x slower

Results: Per-Task (12 tasks)

Task	C++ (ms)	Synoema (ms)	Python (ms)	Synoema vs Python
binary_search	2.1	7.4	16.7	2.3x faster
collatz	2.3	5.7	16.4	2.9x faster
factorial	1.4	JIT fail	17.2	--
fibonacci	3.7	JIT fail	145.6	--
filter_map	2.3	5.2	16.6	3.2x faster
fizzbuzz	1.7	5.7	16.8	3.0x faster
gcd	2.4	5.6	16.8	3.0x faster
matrix_mult	1.5	8.4	17.6	2.1x faster
mergesort	2.1	6.6	17.4	2.6x faster
quicksort	1.4	6.0	16.7	2.8x faster
string_ops	2.0	5.1	16.3	3.2x faster
tree_traverse	1.5	6.5	17.0	2.6x faster

factorial and fibonacci fail in JIT mode (known limitation -- being addressed).

Analysis

JIT Compilation Overhead

Synoema's times include Cranelift JIT compilation (10-50ms one-time cost). For short tasks, this overhead is visible. For longer computations, it's negligible.

Key insight: JIT overhead is constant, interpreter overhead is proportional to work.

Where Synoema Wins

Recursive algorithms: no interpreter loop overhead
Tight numeric loops (collatz, gcd): native integer operations
Pattern matching: compiled to jump tables

Where Synoema Loses

String-heavy operations: Python's C-implemented string library is highly optimized
Very short programs: JIT overhead dominates when computation < 10ms
vs C++ always: Cranelift generates ~86% quality code vs LLVM/GCC

Honest Comparison

The comparison that matters for AI agents:

Synoema (JIT, type-safe, fewer tokens on functional code)
    vs
Python (interpreted, duck-typed, dominant in LLM generation)

Implications for AI Agents

Python:   generate (1.5s) -> interpret (Nms)
Synoema:  generate (0.8s, fewer tokens) -> JIT (50ms) -> native (N/4 ms)

The real question: what's the total cost of the generate -> execute -> analyze cycle? Token efficiency + compilation speed + type guarantees create compound savings.

Try It

git clone https://github.com/synoema/synoema
cd synoema
cargo run --manifest-path benchmarks/runner/Cargo.toml -- run --phases runtime -v

What's Next

Next: we sent the same prompts to 10 LLM models and measured who generates correct Synoema code.

*Part of Token Economics of Code series by @andbubnov.*llmprogrammingrust

Compilation for LLMs: Why a Language for Models Needs Native Code

delimitter — Sat, 04 Apr 2026 15:59:08 +0000

Cranelift JIT, 2.8--5.9x Faster Than Python, and Why It Matters for AI Agents

Who this is for. If you're building AI agents that generate and execute code, or want to understand why compiled LLM output isn't science fiction but working technology -- read on. All terms explained inline and in the glossary.

In previous articles, we showed how to cut tokens by 46% and guarantee syntactic correctness. But there's a third problem: generated code must not only be short and correct -- it must be fast.

Context: LLM Agents Write and Run Code

Claude Code, Cursor, Devin, OpenAI Codex -- these tools don't just generate code. They execute it: run tests, process data, call APIs. The cycle "generate -> run -> analyze result -> repeat" is the foundation of agentic workflows.

Agentic workflows -- an approach where an LLM acts as an autonomous "agent": receives a task, breaks it into steps, writes code, runs it, analyzes the result, and adjusts.

The problem: almost all agents generate Python. And Python is interpreted.

Interpreted language -- a language whose code is executed "line by line" by an interpreter, without prior compilation to machine code. Interpreted languages are simpler but 10--100x slower than compiled ones.

This means: every run goes through the CPython interpreter (slow, single-threaded), no code optimization (Python doesn't know types until runtime via duck typing), and serious computation requires C-based libraries (NumPy, pandas).

Duck typing -- Python's principle: "if it walks like a duck and quacks like a duck, it's a duck." Type errors are discovered only at runtime.

The Solution: JIT Compilation

What if LLM-generated code compiles to native machine code in milliseconds and runs at C speed?

JIT (Just-In-Time) compilation -- compiling code to machine instructions immediately before execution, "on the fly." No separate build step. LLM generates code -> JIT compiles in milliseconds -> native execution speed.

LLM generates code (.sno)
    |
Parser -> AST -> Type Check -> Core IR
    |
Cranelift JIT -> native x86-64 machine code
    |
Execution at C/Rust speed (no interpreter)

The entire cycle -- from text to native code -- takes < 100 ms.

Why Cranelift, Not LLVM

LLVM -- the industry standard. Used in Clang (C/C++), Rust, Swift, Julia. Generates very fast code but compiles slowly. Written in C++, pulls gigabytes of dependencies.

Cranelift -- written in pure Rust. Compiles 10x faster than LLVM. Generates code ~86% the quality of LLVM. Ideal for JIT.

Criterion	LLVM	Cranelift
Language	C++	Rust
Compilation speed	1x	10x
Code quality	100%	~86%
Dependencies	Gigabytes	`cargo build`
Ideal for	AOT compilation	JIT compilation

Benchmarks: Synoema JIT vs Python vs C++

Methodology: median of 5 runs, 3 warm-up discarded. All times include process startup; Synoema times include JIT compilation.

Full Suite (10 tasks)

Task	C++	Synoema JIT	Python	Synoema vs Python
quicksort	1.4 ms	6.0 ms	16.7 ms	2.8x
mergesort	2.1 ms	6.6 ms	17.4 ms	2.6x
binary_search	2.1 ms	7.4 ms	16.7 ms	2.3x
tree_traverse	1.5 ms	6.5 ms	17.0 ms	2.6x
filter_map	2.3 ms	5.2 ms	16.6 ms	3.2x
collatz	2.3 ms	5.7 ms	16.4 ms	2.9x
gcd	2.4 ms	5.6 ms	16.8 ms	3.0x
fizzbuzz	1.7 ms	5.7 ms	16.8 ms	3.0x
matrix_mult	1.5 ms	8.4 ms	17.6 ms	2.1x
string_ops	2.0 ms	5.1 ms	16.3 ms	3.2x
Average	1.9 ms	6.2 ms	16.8 ms	2.8x

Compute-Heavy Tasks

Task	Python	Synoema JIT	Speedup
fib(30)	277 ms	47 ms	5.9x
collatz (10K)	505 ms	90 ms	5.6x
gcd (100K)	143 ms	83 ms	1.7x
Average			4.4x

What the Numbers Mean

Micro-benchmarks: 2.1--3.2x faster. Startup overhead dominates.

Compute-heavy tasks: up to 5.9x faster. JIT-compiled native code pulls ahead as startup cost amortizes.

C++ context: C++ runs 3x faster than Synoema JIT on average -- expected since Cranelift generates ~86% quality code vs LLVM. Trade-off: Synoema compiles in < 100 ms (no build step).

Architecture Pipeline

Source code (.sno)
  |
  +-- Lexer (735 lines, 82 tests)
  +-- Parser (1,672 lines, 43 tests) -- Pratt parser -> AST
  +-- Type Checker (1,908 lines, 61 tests) -- Hindley-Milner
  +-- Core IR (1,536 lines, 44 tests) -- System F
  +-- Diagnostics -- structured errors, LLM hints
  +-- Backend:
      +-- Interpreter (1,894 lines, 119 tests)
      +-- Cranelift JIT (3,044 lines, 126 tests)

8 crates, ~12,000 lines of Rust, 890+ tests, 0 errors.

What This Means for AI Agents

With Python: LLM generates script (200 tokens, 1.5s) -> Python processes (12s) -> total ~15 seconds.

With Synoema: LLM generates sno code (108 tokens, 0.8s) -> JIT (50ms) -> native (3s) -> total ~4 seconds.

Savings: 73% time, 46% tokens, zero dependencies.

What's Changed Since We Started

890+ tests (from 264), all passing, 0 warnings
JIT supports: closures, records, ADTs, pattern matching, modules, TCO, string stdlib, float arithmetic, type class dispatch
Prelude: Result type with combinators (map_ok, unwrap, and_then)
MCP server: npx synoema-mcp integrates into LLM toolchains
Region inference: memory management without GC
Diagnostics: structured errors with LLM-friendly hints

Try It

cargo build --release -p synoema-repl
cargo run -p synoema-repl -- jit examples/quicksort.sno
cargo run -p synoema-repl -- eval "6 * 7"

Source: github.com/Delimitter/synoema

What's Next

Next: Hindley-Milner -- 100% type safety with zero annotations. This is what makes type-guided constrained decoding possible.

Part of Token Economics of Code series by @andbubnov.

Token Efficiency: 16 Algorithms, 5 Languages, Zero Guesswork

delimitter — Sat, 04 Apr 2026 15:36:19 +0000

Who this is for. If you use LLMs to generate code — or pay for API tokens — this article shows exactly where your budget goes. Every number is reproducible. No opinions, just data.

In previous articles, we explained why tokens are expensive (quadratic attention cost) and how BPE tokenization works. Now we show the full data: 16 algorithms implemented in 5 languages, every token counted with tiktoken (cl100k_base).

Methodology

Tokenizer: tiktoken cl100k_base (used by GPT-4, GPT-4o, Claude).

Languages: Synoema, Python, JavaScript, TypeScript, C++.

Tasks: 16 algorithmic problems covering recursion, higher-order functions, data structures, string operations, pattern matching, error handling, and custom types.

Fairness rules:

Identical algorithms (same logic, not idiomatic rewrites)
SPDX license headers stripped before counting
No comments counted
Minimal imports (only what's needed)

Reproducibility: cargo run --manifest-path benchmarks/runner/Cargo.toml -- run --phases token

Results: Summary

Language          | Avg tokens/task | vs Synoema
------------------|-----------------|----------
Synoema           | 92.5            | baseline
Python            | 92.9            | +0%
JavaScript        | 94.9            | +3%
TypeScript        | 114.6           | +24%
C++               | 166.6           | +80%

Wait — +0% vs Python? Yes. The headline "46% savings" from our earlier work was measured on 12 purely functional programs. With 16 diverse tasks including imperative-style algorithms, the picture changes. Keep reading — the per-category breakdown tells the real story.

Results: Per-Task Breakdown

Synoema Wins: Functional & Pattern-Heavy Tasks

Task             | Synoema | Python | JS  | TS  | C++ | vs Python
-----------------|---------|--------|-----|-----|-----|----------
factorial        | 25      | 32     | 35  | 38  | 58  | -22%
fibonacci        | 38      | 49     | 52  | 55  | 75  | -22%
gcd              | 26      | 35     | 38  | 43  | 61  | -26%
quicksort        | 77      | 124    | 111 | 115 | 245 | -38%
json_build       | 32      | 67     | 60  | 81  | 156 | -52%
pattern_match    | 136     | 225    | 182 | 261 | 277 | -40%
type_definition  | 83      | 118    | 127 | 189 | 204 | -30%

Average saving on functional tasks: -33% vs Python.

Near-Equal: General Algorithms

Task             | Synoema | Python | JS  | TS  | C++ | vs Python
-----------------|---------|--------|-----|-----|-----|----------
collatz          | 55      | 60     | 63  | 66  | 87  | -8%
fizzbuzz         | 59      | 63     | 66  | 69  | 97  | -6%
tree_traverse    | 129     | 130    | 117 | 163 | 338 | -1%
error_handling   | 95      | 90     | 101 | 144 | 179 | +6%
mergesort        | 194     | 179    | 179 | 189 | 320 | +8%
filter_map       | 32      | 28     | 62  | 76  | 73  | +14%

Synoema Loses: Imperative & Index-Heavy Tasks

Task             | Synoema | Python | JS  | TS  | C++ | vs Python
-----------------|---------|--------|-----|-----|-----|----------
binary_search    | 159     | 120    | 129 | 134 | 175 | +33%
string_ops       | 28      | 15     | 17  | 19  | 52  | +87%
matrix_mult      | 312     | 152    | 180 | 191 | 269 | +105%

Where the Savings Come From

1. Function Definitions

# Python: 6 tokens
def factorial(n):
    return n * factorial(n - 1)

-- Synoema: 2 tokens for the definition syntax
fac n = n * fac (n - 1)

def, (, ), :, return — 5 syntactic tokens that carry zero semantic information. Synoema uses pattern matching: fac n = is 3 tokens total (name, parameter, equals).

2. Conditionals

# Python: if/elif/else = 3 keyword tokens + colons
if x > 0:
    return x
else:
    return -x

-- Synoema: ? -> : = 3 single-character tokens
? x > 0 -> x : -x

3. Lists

# Python: commas between elements
[1, 2, 3, 4, 5]  # 9 tokens (5 numbers + 4 commas)

-- Synoema: space-separated
[1 2 3 4 5]  -- 7 tokens (5 numbers + 2 brackets)

Every comma is a wasted token. In a list of N elements, Python wastes N-1 tokens on commas.

4. Type Annotations

// TypeScript: ~50% of tokens are type information
function map<A, B>(f: (a: A) => B, xs: A[]): B[] {
    // ...
}

-- Synoema: zero type tokens, compiler infers everything
map f [] = []
map f (x:xs) = f x : map f xs
-- Inferred: forall a b. (a -> b) -> List a -> List b

5. C++ Ceremony

C++ pays the highest tax: #include, template<typename T>, std::vector<int>, {, }, ; after every statement. These structural tokens dominate — hence the +97% overhead.

The Honest Picture

The data tells a nuanced story:

Synoema saves big (-22% to -52%) when the task is naturally functional:

Pattern matching (case expressions vs Python's if/elif chains)
List processing (comprehensions, cons, no commas)
Custom type definitions (ADTs vs Python classes)
Recursive algorithms (no def/return overhead)

Synoema loses (+33% to +105%) when the task is imperative:

matrix_mult: no array indexing, must simulate with get xs n helper (+105%)
binary_search: same problem — linked-list traversal vs xs[mid] (+33%)
string_ops: Python's built-in methods (s.upper()) are extremely concise (+87%)

The critical insight: Synoema is optimized for the code patterns LLMs most commonly generate — function definitions, recursion, data transformation. The tasks where Synoema loses (imperative loops, indexed access) are patterns where LLMs already generate efficient Python.

What This Means for Your LLM Budget

For functional-style code (the majority of LLM-generated algorithms), the savings compound via quadratic attention:

Metric             | Python | Synoema | Saving
-------------------|--------|---------|-------
Tokens (func avg)  | ~93    | ~60     | 33%
Attention compute  | 8,649  | 3,600   | 58%

For mixed workloads (all 16 tasks equally weighted), savings are negligible on tokens but Synoema still wins on type safety and compilation speed.

The takeaway: choose the right tool for the task. Synoema excels at exactly the code patterns where LLMs benefit most from token reduction — function definitions, data transformations, pattern matching. For imperative array manipulation, Python remains competitive on token efficiency.

Try It

git clone https://github.com/Delimitter/synoema
cd synoema

# Run token benchmarks (no runtime deps needed)
cargo run --manifest-path benchmarks/runner/Cargo.toml -- run --phases token

# Results saved to benchmarks/results/<date>/

What's Next

Next in the series: runtime benchmarks — token efficiency is half the story. How fast does the generated code actually run?

Part of Token Economics of Code series. Token counts: tiktoken cl100k_base, reproducible via open-source benchmark suite.

Glossary

BPE (Byte Pair Encoding) — Tokenization algorithm used by GPT-4, Claude. Splits text into subword tokens.

cl100k_base — OpenAI's BPE vocabulary (~100K tokens). Standard for GPT-4 and Claude models.

tiktoken — Python library for exact BPE token counting.

Quadratic attention — Transformer cost scales as O(n²) with sequence length — 30% fewer tokens = 51% less compute.

Syntactic overhead — Tokens required by language grammar that carry no semantic information.

Type-Guided Constrained Decoding: How to Stop LLMs from Hallucinating Code

delimitter — Fri, 03 Apr 2026 08:18:05 +0000

From GBNF Grammars to Type-Directed Generation: Guarantees Instead of Hope

Who this is for. If you've ever had ChatGPT generate code that doesn't compile — this article explains how to eliminate that completely. All technical terms explained in footnotes and the glossary at the end.

In previous articles, we showed that reducing tokens saves money, energy, and compute. But there's a more serious problem: LLMs generate incorrect code. And every retry doubles the token spend.

The Scale of the Problem

Type errors account for 33.6% of all failures in LLM-generated code (Mündler et al., PLDI 2025¹). These aren't typos — they're structural errors: wrong argument types, incompatible return values, accessing nonexistent fields.

When an LLM generates a sort function that doesn't compile, the cost doubles — either a human fixes it (time) or an agent retries (tokens).

But what if the model physically cannot generate syntactically invalid code?

Three Levels of Constraints

Level 1: Grammar (Syntactic Correctness)

At each generation step, the set of grammatically² valid tokens is determined. All others are masked — probability set to zero.

Example: if the model just generated [, then the next token can be a number, identifier, ], or [ — but not +, not =, not ).

Tools:

XGrammar³ — default backend in SGLang⁴, vLLM, TensorRT-LLM⁵. Works with context-free grammars (CFG⁶). Approaches zero overhead.
Outlines⁷ — structured generation via finite state machines (FSM⁸). Supports regex and CFG.
llama.cpp⁹ — built-in GBNF grammar¹⁰ support.
Guidance (Microsoft) — template-based generation with constraints.

Result: 100% syntactic correctness. Every generated fragment is a valid program.

Level 2: Types (Semantic Correctness)

Grammar guarantees that f x = x + 1 is syntactically valid. But not that x is a number. Type-constrained decoding¹¹ adds a second layer: only tokens compatible with the current type context are allowed.

Mündler et al. (PLDI 2025) showed that type-constrained decoding reduces compilation errors by 74.8% compared to 9.0% for syntax-only constraints.

This requires type inference¹² — so the compiler can determine valid types at every generation point without explicit annotations.

Level 3: Specification (Logical Correctness)

The most powerful level: constraints based on formal specification. A sort function doesn't just have the right type — it actually sorts. This is an area of active research (dependent types, refinement types). Not yet in production tools.

How XGrammar Works

XGrammar's key optimization: splitting the vocabulary into two classes:

Context-independent tokens (~80%+). Validity determined at preprocessing, before generation. For each grammar state, a bitmask of valid tokens is precomputed. O(1) per token.

Context-dependent tokens (~20%). Validity depends on the current PDA¹³ state. Checked at runtime, but few in number.

Result: near-zero overhead. Constrained decoding adds no measurable overhead to TPOT¹⁴.

BPE Misalignment Breaks Constrained Decoding

This is where language design becomes critical.

When a language grammar isn't aligned to BPE boundaries, constrained decoding faces the bridge token problem — a BPE token spanning two grammatical symbols.

Domino (ICML 2024¹⁵) showed that bridge tokens distort the model's probability distribution. Grammar-Aligned Decoding (NeurIPS 2024¹⁶) formalized the problem and proposed a fix — but with added overhead.

If a language is designed so bridge tokens never arise — every grammatical symbol coincides with one BPE token — the problem disappears entirely.

Deterministic CFG = Zero Overhead

Nondeterministic CFG — when parsing, multiple rules may apply. Requires backtracking¹⁷. Expensive.

Deterministic CFG (DCFG)¹⁸ — exactly one rule applies at each step. Compiles to an FSM. No backtracking. No ambiguity.

Tian et al. (CoLM 2024¹⁹) proved that for DCFGs, constrained decoding compiles in closed form — overhead approaches zero.

If a language has a DCFG grammar with BPE-aligned operators, constrained decoding is free: zero overhead + zero bridge tokens.

In Practice: GBNF Grammar

root    ::= program
program ::= (decl newline)* decl newline?
decl    ::= func-def | type-sig | type-def

func-def ::= lower-id ws (pattern ws)* "=" ws expr
cond-expr ::= "?" ws expr ws "->" ws expr ws ":" ws expr

Plugging into SGLang:

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Write factorial"}],
    extra_body={"ebnf": open("synoema.gbnf").read()},
)
# Result: GUARANTEED syntactically valid code

Or llama.cpp:

./main -m model.gguf --grammar-file synoema.gbnf \
  -p "-- Quicksort:" -n 128 --temp 0.2

The Economic Impact

Lever	Mechanism	Savings
BPE-aligned grammar	46% fewer tokens	-46% direct
Quadratic attention	54% length → 29% cost	-71% on attention
Constrained decoding	0 invalid code → 0 retries	-10–30%
Type constraints	-74.8% type errors	-5–15% additional

Combined: 50–70% savings in cost and energy vs unoptimized Python generation.

What's Next

In the next article, we'll introduce Synoema — a language with all three levers: BPE-aligned grammar (33 single-token operators), Hindley-Milner²⁰ type inference, and Cranelift²¹ JIT for native speed.

Series: Token Economics of Code

Why Every Token Costs More Than You Think
The Anatomy of BPE: Why Python Wastes 46% of Tokens
Type-Guided Constrained Decoding: How to Stop LLMs from Hallucinating Code ← you are here
Compilation for LLMs: Cranelift JIT, 4.4× Faster Than Python (coming soon)
Hindley-Milner for LLMs: Type Inference Without Annotations (coming soon)
Show HN: Synoema — The First Programming Language Designed for LLMs (coming soon)
The Future of Code Generation: From Prompts to Compilation (coming soon)

Glossary

Term	Explanation
Constrained decoding	Technology forbidding invalid tokens during generation. Guarantees correctness
XGrammar	Constrained decoding engine, de facto standard for LLM inference
SGLang	Open-source LLM inference engine from UC Berkeley
vLLM	Open-source LLM inference engine with memory optimization
TensorRT-LLM	NVIDIA's inference engine, fastest on their GPUs
GBNF	Grammar description format for llama.cpp
llama.cpp	Popular project for running LLMs on commodity hardware
CFG	Context-Free Grammar — formal grammar describing language syntax
DCFG	Deterministic CFG — unambiguous parsing, enables zero-overhead constraints
FSM	Finite State Machine — model for fast token validity checking
PDA	Pushdown Automaton — FSM with a stack for nested structures
TPOT	Time Per Output Token — main LLM inference speed metric
Bridge token	BPE token spanning two grammar symbol boundaries
Type inference	Automatic type deduction by the compiler, no annotations needed
Hindley-Milner	Most powerful type inference algorithm. Used in Haskell, OCaml
Cranelift	Rust-based compiler backend. Fast JIT compilation to native code
PLDI / ICML / NeurIPS	Top academic conferences on PL, ML, and AI respectively
Backtracking	Parsing by trial-and-error with rollbacks. Slow but universal

PLDI (Programming Language Design and Implementation) — one of the most prestigious academic conferences on programming languages. Papers undergo rigorous peer review. If a result is published at PLDI, it's trustworthy. ↩
Formal grammar — a set of rules describing which sequences of symbols are valid in a language. For example, "after [, the next symbol can be a number, identifier, or ]" is part of a grammar. Python has a complex grammar (hundreds of rules); JSON has a simple one (~10 rules). ↩
XGrammar — constrained decoding engine from the MLC-AI team. The de facto standard for LLM inference. Its key innovation: splitting the vocabulary into "easy" tokens (80%+, checked at preprocessing) and "hard" tokens (20%, checked at runtime), yielding near-zero overhead. ↩
SGLang — open-source LLM inference engine from UC Berkeley. One of the fastest ways to serve LLMs. Supports constrained decoding via XGrammar out of the box. ↩
TensorRT-LLM — NVIDIA's inference engine, optimized for their GPUs. Fastest on NVIDIA hardware, but complex to set up. ↩
CFG (Context-Free Grammar) — a class of grammars where each rule has the form "symbol → sequence of symbols." Most programming languages are described by CFGs. JSON, XML, HTML, Python, JavaScript all have CFGs. ↩
Outlines — open-source library for structured generation. Compiles a grammar or regex into a finite state machine that filters tokens on the fly. ↩
FSM (Finite State Machine) — a mathematical model that at any moment is in one of a finite number of states and transitions between them on input symbols. Used for fast checking of whether the next token is valid. ↩
llama.cpp — the most popular open-source project for running LLMs on commodity hardware (CPU, Mac M1/M2, budget GPUs). Written in C++, works without Python. Supports GBNF grammars for constrained decoding. ↩
GBNF (GGML BNF) — grammar description format used in llama.cpp. Extension of standard Backus-Naur Form. Example: expr ::= number | expr "+" expr. ↩
Type-constrained decoding — an extension of constrained decoding that checks types in addition to grammar. If a function expects Int, the model can't substitute String. Requires type inference — automatic type deduction by the compiler. ↩
Type inference — the compiler's ability to determine types of all expressions automatically, without programmer annotations. Instead of int add(int x, int y) (as in C), you write just add x y = x + y, and the compiler deduces Int → Int → Int. The most powerful type inference algorithm is Hindley-Milner, used in Haskell and ML. ↩
PDA (Pushdown Automaton) — an extension of a finite state machine with a stack. Needed for grammars with nesting (brackets, code blocks). A regular FSM can't count brackets — a PDA can. ↩
TPOT (Time Per Output Token) — the time to generate one output token. The main metric for LLM inference speed. For GPT-4: ~20–30 ms; for small models on powerful GPUs: 5–10 ms. ↩
ICML (International Conference on Machine Learning) — one of the top three ML conferences (with NeurIPS and ICLR). Publication at ICML signals high-quality research. ↩
NeurIPS (Neural Information Processing Systems) — the largest AI/ML conference. ~15,000 attendees annually. Publication at NeurIPS is the gold standard for ML research. ↩
Backtracking — a parsing method where the parser tries one rule, and if it fails, backtracks and tries another. Slow because the same text may be parsed multiple times. ↩
DCFG (Deterministic Context-Free Grammar) — a subclass of CFG where parsing is unambiguous at every step. Compiles to an efficient automaton. Most real programming languages are DCFGs (or close). Python technically isn't due to indentation, but with the offside rule it approximates one. ↩
CoLM — a newer conference at the intersection of language models and formal methods. Focuses on how compiler theory can improve LLMs. ↩
Hindley-Milner — the most powerful automatic type inference algorithm, developed in the 1960s–80s. Allows the compiler to determine types of all expressions without a single annotation. Used in Haskell, OCaml, F#, Elm. Detailed in the fifth article. ↩
Cranelift — a compiler backend written in Rust. Converts intermediate representation (IR) to native machine code (x86-64, ARM). Alternative to LLVM: compiles 10× faster, though generated code is ~14% slower. Ideal for JIT compilation where compilation speed matters more than peak optimization. ↩

The Anatomy of BPE: Why Python Wastes 46% of Tokens

delimitter — Thu, 02 Apr 2026 15:11:31 +0000

How BPE Tokenization Works and What It Means for Language Design

Who this is for. If you want to understand how ChatGPT "sees" your code and why the same program costs different amounts in different languages — read on. All terms explained in footnotes and the glossary at the end.

In the previous article, we established that inference cost grows quadratically with token count. The natural question: can we reduce token count without losing semantics?

To answer that, we need to understand how LLMs see code. Not as text — as a sequence of tokens. And between how a programmer sees def factorial(n): and how GPT-4 sees it, there's a chasm.

How BPE Works

BPE (Byte Pair Encoding)¹ is the algorithm that converts text into sequences of integers (tokens). It underlies all modern LLMs: GPT-4 uses the cl100k_base² vocabulary, Claude uses a modified BPE, and Llama uses SentencePiece³ BPE.

The algorithm is simple:

Start with an alphabet of individual bytes (256 characters).
Find the most frequent pair of adjacent symbols in the corpus⁴.
Create a new symbol for that pair and add it to the vocabulary.
Repeat steps 2–3 until the desired vocabulary size (~100K).

The result is a vocabulary of ~100,000 "subwords" of variable length. Short, frequent words (the, is, def) encode as a single token. Rare words get split: tokenization → token + ization.

The critical property: BPE is trained on natural language, not code. So it's optimized for English prose, not Python syntax.

The Misalignment Problem

The grammatical units of a programming language — operators, keywords, delimiters — don't align with BPE token boundaries. This creates two types of waste.

Type 1: Redundant tokens on syntax.

Take a simple Python function:

def factorial(n):
    if n == 0:
        return 1
    return n * factorial(n - 1)

BPE (cl100k_base) splits this into 29 tokens. Semantically significant: factorial, n, 0, 1, *, -. The remaining 23 tokens are syntactic overhead: def, spaces, (, ), :, if, ==, return (twice), indentation, newlines.

The equivalent program in a minimal-syntax functional language:

fac 0 = 1
fac n = n * fac (n - 1)

16 tokens. Same semantics. 45% fewer.

Type 2: Bridge tokens⁵.

Sometimes a single BPE token spans two grammatical symbols. For example, "name" in JSON may become one token, even though grammatically it's space + quote + identifier + quote. This creates problems for constrained decoding⁶.

Benchmark: 12 Programs, 3 Languages

I compared token counts for equivalent programs in three languages: Python, Haskell⁷, and an optimized language where every operator is exactly one BPE token.

Program	Optimized	Python	Haskell	Saving vs Python
Factorial	16	29	16	45%
Map	20	42	23	52%
QuickSort	51	83	54	39%
FizzBuzz	44	64	49	31%
Filter	27	67	36	60%
Fibonacci	26	46	26	43%
Sum	16	33	19	52%
Length	16	30	19	47%
Reverse	18	35	18	49%
Compose	38	75	39	49%
Maximum	28	58	37	52%
Zip	32	53	37	40%
Total	332	615	373	46%

The optimized language uses 46% fewer tokens than Python, and 11% fewer than Haskell.

Given quadratic attention cost: 46% fewer tokens ≈ 71% less computation in attention layers.

Where the Savings Come From

Function Definition

# Python: 6 tokens of boilerplate
def add(x, y):
    return x + y

# Optimized: 0 boilerplate tokens
add x y = x + y

Conditional

# Python: 6 overhead tokens
if x > 0:
    return x
else:
    return -x

# Optimized: 3 tokens
? x > 0 -> x : -x

Lists

[1, 2, 3, 4, 5]  # 9 tokens (commas are waste)
[1 2 3 4 5]       # 7 tokens (no commas needed)

Pattern Matching⁸

# Python: 29 tokens
def fac(n):
    if n == 0:
        return 1
    return n * fac(n - 1)

# Optimized: 16 tokens
fac 0 = 1
fac n = n * fac (n - 1)

BPE-Aligned Grammar

The savings above aren't just "terse syntax." They're achieved through deliberate grammar design accounting for BPE tokenizer properties.

The BPE-aligned grammar⁹ principle: every language operator must be exactly one BPE token.

For the optimized language, all 33 operators were verified — each encodes to exactly 1 BPE token on cl100k_base (GPT-4/Claude) and o200k_base¹⁰ (GPT-4o):

Two chars, 1 token:    -> <- |> ++ >> == != <= >= && || ..
One char, 1 token:     ? : . = @ | \ _ , + - * / % < >
Delimiters, 1 token:   ( ) [ ]

This isn't coincidence — it's a design constraint. If an operator doesn't fit in one BPE token, it gets replaced by one that does.

What This Means for LLMs

When an LLM generates code in the optimized language instead of Python, it generates 46% fewer tokens (faster, cheaper), spends 71% less on attention (larger codebases fit), creates no bridge tokens (cleaner constrained decoding), and can't "babble" (minimal syntax prevents bloat).

What's Next

In the next article, we'll look at constrained decoding — the technology that guarantees 100% syntactic correctness. And we'll show why BPE-aligned grammar makes constrained decoding free.

Series: Token Economics of Code

Why Every Token Costs More Than You Think
The Anatomy of BPE: Why Python Wastes 46% of Tokens ← you are here
Type-Guided Constrained Decoding: How to Stop LLMs from Hallucinating Code (coming soon)
Compilation for LLMs: Cranelift JIT, 4.4× Faster Than Python (coming soon)
Hindley-Milner for LLMs: Type Inference Without Annotations (coming soon)
Show HN: Synoema — The First Programming Language Designed for LLMs (coming soon)
The Future of Code Generation: From Prompts to Compilation (coming soon)

Glossary

Term	Explanation
BPE	Byte Pair Encoding — algorithm splitting text into tokens by merging frequent character pairs
cl100k_base	GPT-4/Claude's BPE vocabulary with ~100K tokens
o200k_base	GPT-4o's newer BPE vocabulary with ~200K tokens
SentencePiece	Google's BPE alternative used in Llama and open models
Corpus	Massive text dataset for training BPE and LLMs (web, books, code)
Bridge token	BPE token spanning the boundary of two grammar symbols
BPE-aligned grammar	Grammar where every operator = exactly 1 BPE token
Pattern matching	Defining functions by examples: `fac 0 = 1` instead of `if n == 0`
Constrained decoding	Technology forbidding invalid tokens during generation
Haskell	Functional language with minimal syntax, brevity benchmark

BPE (Byte Pair Encoding) — a text compression algorithm invented in 1994 and adapted for LLMs. The idea: find the most frequent pairs of characters in a huge text corpus and merge them into a new symbol. Repeat ~100,000 times. The result is a vocabulary of "subwords" that the model thinks in. ↩
cl100k_base — the specific BPE token vocabulary used by GPT-4 and Claude. Contains ~100,000 tokens. Trained primarily on English internet text. GPT-4o uses a newer vocabulary called o200k_base with ~200,000 tokens. ↩
SentencePiece — Google's alternative BPE implementation, used in Llama and other open models. Works at the Unicode character level instead of bytes, which is better for non-English languages. ↩
Corpus — the massive text dataset used to train the BPE vocabulary (and the LLM itself). Includes web pages, books, articles, GitHub code. Usually hundreds of billions of tokens. Code makes up only ~5–15% of a typical corpus — which is why BPE is optimized for English prose, not Python syntax. ↩
Bridge token — a BPE token that spans the boundary between two grammatical symbols. For example, BPE might merge a space and keyword if into one token. This creates problems for constrained decoding engines, which must "split" the token, distorting the model's probability distribution. More details in the third article. ↩
Constrained decoding — technology that forbids invalid tokens at each generation step. Guarantees syntactically valid output. Covered in detail in the third article. ↩
Haskell — a functional programming language with minimal syntax. Used as the "brevity benchmark" among existing languages. fac 0 = 1 in Haskell and the optimized language look nearly identical, but the optimized language additionally accounts for BPE boundaries. ↩
Pattern matching — defining a function by "examples." Instead of if n == 0: return 1, you write fac 0 = 1 — literally: "factorial of zero is one." The compiler generates the check automatically. Shorter, clearer, and eliminates an entire class of errors. ↩
BPE-aligned grammar — a language design principle where every operator, keyword, and delimiter encodes to exactly one BPE token. This means: no "wasted" tokens on syntax and no bridge tokens. Conventional languages don't account for BPE — they were created long before LLMs. ↩
o200k_base — the newer BPE vocabulary used by GPT-4o. Contains ~200,000 tokens (twice cl100k_base). Better coverage of code and non-English languages, but same underlying principles. ↩

Why Every Token Costs More Than You Think

delimitter — Thu, 02 Apr 2026 15:07:06 +0000

Why Every Token Costs More Than You Think

The Quadratic Price of Attention: How Context Length Is Killing Your AI Budget

Who this is for. If you use ChatGPT, Claude, Copilot, or Cursor to write code, this article explains why the same tasks can cost 2–4× less. No technical background required — all terms are explained inline and in the glossary at the end.

When you ask Claude or GPT to write a sorting function, the model generates ~50 tokens¹ per second. Each token costs fractions of a cent. Seems cheap.

But behind that simplicity lies an engineering reality most people overlook: the cost of each token grows quadratically with context length². If you're working with codebases spanning thousands of lines, this quadratic relationship transforms from a theoretical abstraction into a line item that can double your AI budget.

In this article, I'll show where this cost comes from, why inference — not training — is the dominant consumer of resources, and what can be done about it.

Inference Consumes 90%+ of All Energy

There's a common misconception: the major cost of LLMs³ is training. Training GPT-4 reportedly cost $50–100M. An impressive number.

But training is a one-time capital expense. Inference⁴ is an ongoing operational cost that occurs with every request, every second, for every user.

According to AWS, inference consumes more than 90% of total energy in the LLM lifecycle. The AI inference market is valued at $106 billion in 2025, projected to exceed $250 billion by 2030 at a 19.2% compound annual growth rate.

Every token ChatGPT generates costs OpenAI approximately $0.00012. Sounds negligible. But at billions of daily requests, this adds up to hundreds of millions of dollars per year — and terajoules of electricity.

The Quadratic Trap

Here's the key fact that changes everything.

In a standard transformer⁵ with self-attention⁶, the computational cost of processing a sequence of n tokens is:

Cost(n) = O(n² · d)

where d is the model dimension. This is not a linear relationship. It's quadratic.

What this means in practice:

Context	Relative attention cost
1,000 tokens	1× (baseline)
2,000 tokens	4×
4,000 tokens	16×
8,000 tokens	64×
32,000 tokens	1,024×

Doubling the context increases attention cost fourfold, not twofold. This means reducing context by 50% saves not 50%, but 75% of attention computation.

When a developer sends a 2,000-line Python file (~8,000 tokens) to an LLM for refactoring, the attention cost is 64× higher than for a simple 1,000-token question. And that's just one request.

Real Money

Let's calculate for a typical team.

A team of 10 developers uses an AI assistant (Cursor, Copilot, Claude Code). Each makes an average of 100 requests per day. Average request context: 2,000 input tokens. Average response: 500 output tokens.

At Claude Sonnet 4 pricing ($3/M input, $15/M output):

Input:  10 × 100 × 2,000 = 2M tokens/day × $3/M  = $6/day
Output: 10 × 100 × 500   = 500K tokens/day × $15/M = $7.50/day
Total: ~$13.50/day = ~$405/month

Now imagine expressing the same programs with 46% fewer tokens (I'll show in the next article that this is achievable):

Input:  2M × 0.54 = 1.08M tokens/day × $3/M  = $3.24/day
Output: 500K × 0.54 = 270K tokens/day × $15/M = $4.05/day
Total: ~$7.29/day = ~$219/month

Savings: $186/month for 10 people, or $2,200/year.

For 100 developers: $22,000/year. For 1,000: $220,000. And this is a conservative estimate with a relatively affordable model and moderate workload.

The Energy Dimension

Measurements on LLaMA-65B⁷ (A100 GPUs⁸) show energy consumption in the range of 3–4 joules per output token. On modern H100s with optimized inference engines like vLLM⁹, efficiency has improved roughly 10×, down to ~0.39 J per token. But usage scale has grown even faster.

ChatGPT processes an estimated one billion requests daily. At an average response of 500 tokens:

1B requests × 500 tokens × 0.39 J ≈ 195 GJ/day ≈ 54,000 kWh/day

That's the energy consumption of a small town — every single day. Reducing token count isn't just about saving money. It's a direct reduction in energy consumption and carbon footprint.

The Babbling Problem

The study "Towards Green AI" (2026) found that 3 out of 10 tested models exhibit "babbling" behavior — generating significantly more text than necessary. Suppressing this yielded energy savings of 44% to 89%.

But what if the language the LLM writes code in is designed so that "babbling" is physically impossible?

Python code is inherently verbose. def, return, if/elif/else, commas in lists — all syntactic overhead¹⁰ that consumes tokens without carrying semantic information.

Three Optimization Levers

Lever 1: Representation compression. Express the same program with fewer tokens. This isn't obfuscation — it's grammar design optimized for BPE tokenizers¹¹. Potential: 35–50%.

Lever 2: Constrained decoding¹². Prevent the model from generating syntactically invalid code. Every error = retry = double token spend.

Lever 3: Type guarantees. Type errors account for 33.6% of all failures in LLM-generated code. Type-guided generation¹³ reduces them by 74.8%.

Combining all three levers can yield 60–80% cumulative savings in tokens, money, energy, and time.

What's Next

In the next article, we'll examine how BPE tokenization actually works and why Python syntax wastes 46% of tokens on structural noise.

Series: Token Economics of Code

Why Every Token Costs More Than You Think ← you are here
The Anatomy of BPE: Why Python Wastes 46% of Tokens (coming soon)
Type-Guided Constrained Decoding: How to Stop LLMs from Hallucinating Code (coming soon)
Compilation for LLMs: Cranelift JIT, 4.4× Faster Than Python (coming soon)
Hindley-Milner for LLMs: Type Inference Without Annotations (coming soon)
Show HN: Synoema — The First Programming Language Designed for LLMs (coming soon)
The Future of Code Generation: From Prompts to Compilation (coming soon)

Glossary

Term	Explanation
Token	Smallest text unit for an LLM. Roughly ¾ of a word or 3–4 characters
LLM	Large Language Model — neural network that generates text/code (GPT-4, Claude, Llama)
Inference	Generating a response from a trained model. Happens with every request
Context	Everything the model "sees" — prompt, chat history, files. Measured in tokens
Transformer	Neural network architecture underlying all LLMs. Uses attention mechanism
Self-attention	Mechanism where every token considers all others. Cost: O(n²)
BPE	Byte Pair Encoding — algorithm that splits text into tokens
Constrained decoding	Technology forbidding invalid tokens during generation
GPU	Graphics card for AI computation. NVIDIA H100 is standard for LLM inference
vLLM	Open-source engine for fast LLM serving
Overhead	Parts of code/computation carrying no useful payload

Token — the smallest unit of text an LLM processes. Not a letter, not a word, but a "chunk" of text 1–15 characters long. The word "hello" is 1 token; the code def factorial(n): is 6 tokens. The model doesn't see characters — it sees a sequence of tokens. ↩
Context (context window) — everything the model "sees" at once: your question, previous messages, attached files. Measured in tokens. GPT-4 has a context of up to 128K tokens, Claude up to 200K. The longer the context, the more computation the model needs. ↩
LLM (Large Language Model) — a neural network trained on massive amounts of text that can generate text, code, and answer questions. Examples: GPT-4, Claude, Llama, Gemini. ↩
Inference — the process of using an already-trained model to generate responses. When you type a prompt into ChatGPT and get an answer, that's inference. Unlike training (which happens once), inference happens billions of times per day. ↩
Transformer — the neural network architecture underlying all modern LLMs. Invented at Google in 2017 ("Attention Is All You Need" paper). Its key feature is the "attention" mechanism, which lets the model consider relationships between any words in the text, even distant ones. ↩
Self-attention — a mechanism where every token "looks at" every other token in the context to understand their relationships. This gives transformers their power — but also creates quadratic cost: if there are n tokens, there are n × n pairs to compare. ↩
LLaMA — a family of open-source language models from Meta (Facebook). Available for download and self-hosted deployment, unlike GPT-4. ↩
GPU (Graphics Processing Unit) — originally a graphics card, now used for AI computation. NVIDIA A100 and H100 are specialized GPUs for LLM inference and training. A single H100 costs ~$30–40K and draws 700 watts. ↩
vLLM — an open-source engine for fast LLM serving. Optimizes GPU memory usage through PagedAttention, enabling more simultaneous requests. ↩
Syntactic overhead — parts of code required by the language's syntax but carrying no meaning. For example, Python's def before a function definition and return before a return value are mandatory but contain no information about what the function does. ↩
BPE (Byte Pair Encoding) — the algorithm that splits text into tokens. Used in all modern LLMs. Finds the most frequent pairs of characters in a huge text corpus and merges them into new "subwords." Result: a vocabulary of ~100,000 tokens. Covered in detail in the second article. ↩
Constrained decoding — a technology that forbids the model from choosing invalid tokens at each generation step. If the model is generating JSON, it ensures brackets are closed and commas are in the right places. The same can be done for any language with a formal grammar. ↩
Type-guided generation — an extension of constrained decoding where the model is additionally prevented from generating code with type errors. A second layer of guarantees on top of syntactic ones. ↩