tokencount: Fast GPT token counting CLI

CharlonTank — Tue, 16 Sep 2025 19:55:31 +0000

tokencount is a new Rust CLI that helps you answer a deceptively simple question: how many GPT tokens are hidden across this project? If you build AI features, write prompt-heavy docs, or just keep an eye on context windows, this tool makes the audit painless.

Why I built it

Most token counters either run one file at a time or ignore the filesystem realities of big projects. I wanted something that:

Walks a codebase quickly (parallel rayon workers + OS-native ignore rules)
Respects .gitignore by default and lets me layer custom --exclude globs
Talks the same language as OpenAI models (cl100k_base, o200k_base, etc.)
Gives a useful summary out of the box: per-file counts, totals, percentiles, and top-N offenders
Plays nicely with automation (JSON and NDJSON streaming modes)

Features at a glance

Blazing fast scan – ignore::WalkBuilder + Rayon for concurrent IO/tokenization
Smart defaults – only scans *.elm unless you add --include-ext flags (good for Elm-heavy repos)
Flexible filtering – combine --include-ext, --exclude, --max-bytes, and --follow-symlinks
Multiple outputs – table, JSON array with summary, or NDJSON stream for pipelines
Rich stats – totals, average per file, and P50/P90/P99 percentiles to spot outliers fast
Quiet/verbose modes – keep CI logs clean or turn on detailed warnings locally

Install

cargo install tokencount

Quick tour

# default: scan current directory, only *.elm files, table output
tokencount

# include Elm + TypeScript
tokencount ./frontend --include-ext elm --include-ext ts

# show top 10 largest files by tokens
tokencount --top 10

# machine-readable summary for CI
tokencount --format json > tokens.json

# streaming counts for further processing
tokencount --format ndjson

# sort descending by token count
tokencount --sort tokens

Each run ends with a footer like this:

---
total files: 42
total tokens: 128730
average/file: 3065.00
p50: 812
p90: 7194
p99: 24403

Need only the top offenders? Combine --top N with either --sort tokens or the default path sort.

Under the hood

Ignore handling uses the ignore crate with .gitignore, .git/info/exclude, and global git ignores respected automatically. I add common junk folders (node_modules, target, .git) so you don’t have to.
Tokenization relies on tiktoken-rs, so you get the same counts as OpenAI’s cl100k_base/o200k_base models.
Error handling is friendly by default—non UTF-8 files or oversized blobs are skipped with warnings (or silently with --quiet).
Percentiles use a nearest-rank approach and degrade gracefully when there are zero files.

Roadmap & feedback

I’m exploring:

More encodings (if you need a different tokenizer, open an issue)
Optional HTML/Markdown report outputs
Built-in file size histogram to complement token stats

Repo & issues live here: github.com/CharlonTank/tokencount

If you try tokencount, I’d love to hear how it fits into your prompt engineering workflow or CI pipelines—reach out in the repo or drop a comment below.