Forem: Phasu Yeneng

Stop Your OpenAI Bill from Exploding: Per-User LLM Budget Caps in Node.js

Phasu Yeneng — Mon, 04 May 2026 14:44:32 +0000

The cost incident that started this

Three weeks after we put our chatbot into production, I opened the OpenAI billing dashboard on a Monday morning and stopped breathing for a second. One session — not one user, one session — had burned through roughly four times the daily budget for the entire app. Over a single afternoon.

The session wasn't malicious. It was a test account someone forgot to log out of, hammering the chat endpoint in the background while reloading a broken page. No rate limit was breached. No alarm fired. No infrastructure metric looked unusual. The only place it showed up was the bill at the end of the month.

That was the day I learned that rate limits and budget limits are not the same thing, and that running an LLM-powered app without a per-user cost cap is roughly the same as putting a credit card behind a public form and hoping nobody fills it in 800 times.

This post walks through the pattern I now use in every Node.js + Express app that talks to OpenAI: track first, then cap, then degrade gracefully, then cache aggressively. It's framework-agnostic, Postgres-backed, and a fresh team can ship it in a single afternoon.

Why "rate limit" is the wrong abstraction

Classic rate limiting nudges you toward thinking in requests per minute. That model works fine for a REST API where every request costs roughly the same. It falls apart for LLM APIs because request count is decoupled from cost.

Compare two requests to the same endpoint:

A 50-token "what's your refund policy?" question → roughly $0.0005
A 50-page pasted document with "summarize this" prompt → roughly $0.30

That's a 600× cost spread on two requests that both count as "1" to a token bucket. If your rate limit is 60 requests/minute, an attacker — or a buggy client, or a curious power user — can drive your bill into triple digits per hour while staying perfectly within rate-limit bounds.

You need to cap the dollar value, not the request count.

Step 1 — Measure before you cap

You cannot cap what you do not measure. The first thing to ship is a single funnel that every LLM call passes through, with structured logging into a real database (not a JSON file, not an analytics tool — something you can JOIN and WHERE against in real time).

Here's the schema I use, simplified:

CREATE TABLE llm_usage_logs (
  id                     BIGSERIAL PRIMARY KEY,
  session_id             TEXT NOT NULL,
  model                  TEXT NOT NULL,
  prompt_tokens          INT  NOT NULL DEFAULT 0,
  cached_prompt_tokens   INT  NOT NULL DEFAULT 0,    -- subset of prompt_tokens that hit OpenAI's cache
  completion_tokens      INT  NOT NULL DEFAULT 0,
  total_tokens           INT  NOT NULL DEFAULT 0,
  prompt_cost_usd        NUMERIC(10,6) NOT NULL DEFAULT 0,
  cached_prompt_cost_usd NUMERIC(10,6) NOT NULL DEFAULT 0,  -- billed at ~50% of full prompt rate
  completion_cost_usd    NUMERIC(10,6) NOT NULL DEFAULT 0,
  total_cost_usd         NUMERIC(10,6) NOT NULL DEFAULT 0,
  finish_reason          TEXT,
  response_time_ms       INT,
  created_at             TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX llm_usage_logs_session_time_idx
  ON llm_usage_logs (session_id, created_at DESC);

Three non-obvious choices:

Store cost in USD as NUMERIC, not as cents in an integer. Token-priced cost has 4–6 significant decimal digits. If you store cents, you'll round most short calls to zero and the arithmetic gets useless.
Index on (session_id, created_at DESC). Every "is this user over budget?" query scans recent rows for a session. Without this index it's a sequential scan, and you'll regret it the day usage spikes.
Provision the cached-token columns from day one. Even if you're not using prompt caching yet, adding cached_prompt_tokens and cached_prompt_cost_usd up front saves you a migration later — they default to 0 and Step 2 wires them up.

Then a single logging function:

async function logLLMCall({
  session_id, model,
  prompt_tokens, completion_tokens, total_tokens,
  finish_reason, response_time_ms,
}) {
  // All prices below are USD per 1M tokens (NOT per 1K).
  const promptPricePerM     = parseFloat(process.env.LLM_PROMPT_PRICE_PER_M)     || 2.50;
  const completionPricePerM = parseFloat(process.env.LLM_COMPLETION_PRICE_PER_M) || 10.00;

  const promptCost     = (prompt_tokens     / 1_000_000) * promptPricePerM;
  const completionCost = (completion_tokens / 1_000_000) * completionPricePerM;
  const totalCost      = promptCost + completionCost;

  await pg.query(
    `INSERT INTO llm_usage_logs
       (session_id, model, prompt_tokens, completion_tokens, total_tokens,
        prompt_cost_usd, completion_cost_usd, total_cost_usd,
        finish_reason, response_time_ms)
     VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10)`,
    [session_id, model, prompt_tokens, completion_tokens, total_tokens,
     promptCost, completionCost, totalCost, finish_reason, response_time_ms]
  );
}

Step 2 — Get the cost math right

Three things that bite people in cost calculation:

Different pricing per model. GPT-4o costs roughly 5× what GPT-4o-mini costs per million tokens. Don't hardcode prices; pull them from env vars (or a small model_pricing table) keyed by model name. When OpenAI announces new pricing — and they will — you change config, not code.

Trust the API's token count, not your local one. Tokenizers like tiktoken are close to what the API actually charges, but not identical. The number that matters is response.usage.{prompt,completion}_tokens returned in the API response. Log that, not your local pre-call estimate.

Streaming responses still report usage — if you ask for it. With stream: true, you must pass stream_options: { include_usage: true } to get a final usage chunk. Many people miss this and end up logging 0 tokens for every streamed call, which silently zeroes their cost dashboard.

const stream = await openai.chat.completions.create({
  model, messages,
  stream: true,
  stream_options: { include_usage: true },
});

let usage;
for await (const chunk of stream) {
  if (chunk.usage) usage = chunk.usage;        // arrives in the final chunk
  // ... yield chunk.choices[0].delta to client
}

await logLLMCall({ session_id, model, ...usage, ...timing });

Don't forget prompt caching — it changes the math

If you've enabled OpenAI prompt caching (it kicks in automatically once your prompt prefix is long and reused), part of your prompt tokens come back at roughly half price. They show up under usage.prompt_tokens_details.cached_tokens. If you ignore the field, your dashboard will overstate spend by 20–30% — and worse, you'll under-credit the optimizations you're actually doing.

Three-rate calculation:

function calculateCost(usage) {
  const cached    = usage.prompt_tokens_details?.cached_tokens || 0;
  const promptRaw = (usage.prompt_tokens || 0) - cached;          // billed at full rate
  const completion = usage.completion_tokens || 0;

  // All prices below are USD per 1M tokens.
  const fullPromptPerM   = parseFloat(process.env.LLM_PROMPT_PRICE_PER_M)        || 2.50;
  const cachedPromptPerM = parseFloat(process.env.LLM_CACHED_PROMPT_PRICE_PER_M) || 1.25;  // ~50% off full
  const completionPerM   = parseFloat(process.env.LLM_COMPLETION_PRICE_PER_M)    || 10.00;

  return {
    prompt_cost_usd:        (promptRaw  / 1_000_000) * fullPromptPerM,
    cached_prompt_cost_usd: (cached     / 1_000_000) * cachedPromptPerM,
    completion_cost_usd:    (completion / 1_000_000) * completionPerM,
    total_cost_usd:
      (promptRaw  / 1_000_000) * fullPromptPerM +
      (cached     / 1_000_000) * cachedPromptPerM +
      (completion / 1_000_000) * completionPerM,
  };
}

Concrete example: a 4,000-token system prompt with a 2,000-token cache hit, plus 200 completion tokens on gpt-4o:

Component	Tokens	Rate (per 1M)	Cost
Prompt (uncached)	2,000	$2.50	$0.005000
Prompt (cached)	2,000	$1.25	$0.002500
Completion	200	$10.00	$0.002000
Total			$0.0095

Without caching that same call would be $0.0120 — about 21% more expensive. Add a cached_prompt_cost_usd column to your llm_usage_logs table and you'll be able to track caching ROI directly.

A note on currency

If you operate in a non-USD currency (we report in THB internally), store the canonical cost as USD and convert at query time. Exchange rates drift; locking yesterday's rate into the row makes month-over-month comparisons quietly lie to you.

Step 3 — Per-session budget middleware

Now that costs are visible, add an Express middleware that runs before the LLM call:

async function budgetGuard(req, res, next) {
  const sessionId = req.session?.id || req.body.session_id;
  const tier      = req.user?.tier || 'anonymous';

  const dailyCapUsd = {
    anonymous: 0.50,
    free:      2.00,
    paid:     20.00,
    internal:100.00,
  }[tier];

  const { rows } = await pg.query(
    `SELECT COALESCE(SUM(total_cost_usd), 0)::float AS spent
       FROM llm_usage_logs
      WHERE session_id = $1
        AND created_at > NOW() - INTERVAL '24 hours'`,
    [sessionId]
  );

  const spent = rows[0].spent;
  req.budget = { spent, cap: dailyCapUsd, remaining: dailyCapUsd - spent };

  if (spent >= dailyCapUsd) {
    return res.status(429).json({
      error: 'daily_budget_exceeded',
      message: 'You have hit your daily usage limit. Try again in 24 hours.',
      spent_usd: spent,
      cap_usd: dailyCapUsd,
    });
  }

  next();
}

app.post('/chat', budgetGuard, chatHandler);

⚠️ Security note on the identity you cap against. The example above falls back to req.body.session_id for clarity, but in production never trust an identifier the client can rotate. A hostile (or just curious) client can change session_id on every request and dodge the cap entirely. Pull the identity from a verified source: a signed cookie session, a JWT subject claim, or the authenticated user object set by your auth middleware. Treat the body fallback as prototype-only — replace it with the real authenticated principal before anything ships.

A few real-world refinements:

Tier the cap. Anonymous traffic gets the smallest budget, paid users get more. Don't give every visitor a $20/day allowance by default.
24h sliding window, not calendar day. A user who maxes out at 23:59 shouldn't get a fresh budget at 00:00.
Attach the budget object to req. Downstream handlers can read req.budget.remaining to make smarter decisions — see the next section.

Step 4 — Soft cap → hard cap → fallback (the 3-tier pattern)

A single threshold is too binary. Instead, treat the cap as three zones:

Zone	Trigger	Action
Green	< 80% of cap	Use the premium model. Business as usual.
Yellow	80–100%	Log a warning, ping `#alerts` on Slack, degrade to a cheaper model (e.g. `gpt-4o-mini`). User keeps getting answers.
Red	> 100%	Hard-refuse with a friendly message.

The decision flow looks like this:

flowchart TD
    A[Incoming chat request] --> B[Look up 24h spend for user]
    B --> C{spent < 80% of cap?}
    C -- yes --> D["🟢 GREEN<br/>use premium model"]
    C -- no --> E{spent < 100% of cap?}
    E -- yes --> F["🟡 YELLOW<br/>fall back to cheap model<br/>+ Slack alert"]
    E -- no --> G["🔴 RED<br/>return 429<br/>budget_exceeded"]
    D --> H[Call LLM, log usage, respond]
    F --> H

function pickModel(req) {
  const usage = req.budget.spent / req.budget.cap;
  if (usage >= 1.0) return null;            // upstream middleware already 429'd
  if (usage >= 0.8) return 'gpt-4o-mini';   // graceful degradation
  return 'gpt-4o';
}

The Yellow zone is the trick that makes this user-friendly instead of abusive. The product still works at 90% of cap; it's just running on a cheaper engine. Most users won't notice. Engineers who wake up to the Slack alert can investigate before the wall is hit.

Step 5 — Caching is the largest cost cut you can make

Logging and capping reduces runaway spend. Caching reduces baseline spend, often by 20–50%. Two layers, in order of effort:

Exact-match cache (cheap). Normalize the question (lowercase, trim, collapse whitespace). If you've answered the exact same thing in the last 24h, return the stored answer.

async function findExactMatch(question) {
  const { rows } = await pg.query(
    `SELECT answer FROM llm_usage_logs
      WHERE question = $1 AND answer IS NOT NULL
      ORDER BY created_at DESC
      LIMIT 1`,
    [question]
  );
  return rows[0]?.answer || null;
}

Semantic cache (high ROI). Embed the incoming question, look for past questions with cosine similarity ≥ 0.95, return their stored answers. This catches "what is your refund policy?" vs "how do refunds work?" — textually different, semantically identical. If your hit rate is 30%, you've cut 30% off your bill the day you ship it.

A practical tip: don't cache personalized or stateful answers. Cache the FAQ-style stuff. A simple cacheable: true flag on the prompt template handles this cleanly without leaking one user's data to another.

Step 6 — Observability you'll actually look at

Build one endpoint that returns this morning's numbers:

app.get('/api/llm/budget', async (req, res) => {
  const { rows } = await pg.query(`
    SELECT
      DATE(created_at)            AS date,
      model,
      COUNT(*)                    AS calls,
      SUM(total_tokens)           AS tokens,
      SUM(total_cost_usd)::float  AS cost_usd
    FROM llm_usage_logs
    WHERE created_at > NOW() - INTERVAL '30 days'
    GROUP BY DATE(created_at), model
    ORDER BY date DESC
  `);
  res.json({ daily: rows });
});

Pipe it into a small HTML dashboard or a Grafana board. Set one alert worth its salt: >30% above the 7-day rolling average triggers a Slack ping. That single alert has caught every cost incident I've had since I shipped it.

A simple version of the dashboard looks roughly like this — replace with a real screenshot once you have one:

┌──────────────────────────────────────────────────────────────────┐
│  LLM SPEND — last 7 days                              ↻ refresh  │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│   $12 ┤                                            ▇             │
│   $10 ┤             ▇                              ▇             │
│    $8 ┤      ▇      ▇      ▇             ▇        ▇             │
│    $6 ┤      ▇      ▇      ▇      ▇      ▇        ▇             │
│    $4 ┤▇     ▇      ▇      ▇      ▇      ▇        ▇             │
│    $2 ┤▇     ▇      ▇      ▇      ▇      ▇        ▇             │
│    $0 ┴┴─────┴──────┴──────┴──────┴──────┴────────┴───           │
│       Mon   Tue    Wed    Thu    Fri    Sat      Sun ⚠ +47%      │
│                                                                  │
│  Today          $11.83   ████████████████  (cap $20)             │
│  Top session    sess_8f2 $4.12  • 312 calls  • gpt-4o            │
│  Cache hit      31.4%    (≈ $5.20 saved today)                   │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

The three numbers I look at every morning: today's total, the highest-spend session, and cache hit rate. Anything weird shows up in one of those three.

Pitfalls I learned the hard way

Tool calls aren't always counted the way you expect. Older SDK versions occasionally return 0 for completion_tokens when the response is a tool call instead of plain text. Always log the full usage object and verify against your dashboard once a week.

Retries double-count. If your HTTP client retries a 5xx, you can log the same call twice. Pass an idempotency key — the request ID works — and dedupe on it.

Logging on the hot path. Don't await the log insert before responding to the user; the user is waiting on you. Fire-and-forget with a try/catch, or push to a small in-memory queue and drain in the background. But bound the queue size: an unbounded queue is just a memory leak waiting for traffic.

Long system prompts compound. A 4,000-token system prompt × 10 messages per session × 1,000 sessions/day is 40M prompt tokens, even if every user message is short. Use OpenAI's prompt caching where supported, and trim the system prompt aggressively. Most "best practice" system prompts on the internet are 3× longer than they need to be.

Cached input is cheaper — and the field is easy to miss. Prompt caching lives in usage.prompt_tokens_details.cached_tokens, not at the top level. We covered the math in Step 2; the practical pitfall is that older code paths often only read prompt_tokens and silently double-bill anything cached. Audit every place that reads usage.

What this gets you

After shipping the full pattern in our app:

No more cost surprises. The worst possible day is now bounded by cap_per_user × active_users. We can math it out before launching a feature.
Abuse is a non-event. The same loop that burned through a daily budget in three hours now hits the 429 in twelve minutes and stays quiet.
Spend is steerable. When we want to spend less, we tighten the Yellow threshold. When we want to spend more on quality, we raise it. It's a knob, not a panic button.
Engineers stopped being scared of LLM features. "What if it gets expensive?" finally has an answer.

Where to go next

The version above is the minimum viable pattern. Once it's running, the natural extensions are:

Per-feature budgets — chat, summarization, and embedding refresh each get their own cap.
Anomaly detection — alert on cost-per-session 3σ above the mean instead of a fixed threshold.
Auto-throttle — when the daily org-wide cap is approaching, slow down lower-tier users first.
Pre-flight estimation — refuse a 200KB pasted blob before it hits the API, not after.

Each is a follow-up post. The foundation — log, cap, degrade, cache — is what stops the bleeding tonight.

If you ship this pattern and it saves you money, I'd love to hear what your hit rate on the semantic cache ended up being. That number was the most surprising thing for me — much higher than I expected.

9 Verified Tools to Stop Burning Claude Tokens Unnecessarily

Phasu Yeneng — Mon, 20 Apr 2026 14:03:30 +0000

You're not using Claude more — you're just wasting more context.

I went looking for real, working tools after seeing a widely-shared list that mixed legitimate repos with hallucinated ones. This article only covers tools I could verify on GitHub, organized by the type of waste they fix.

Why tokens disappear faster than you expect

Before the tools: understanding where tokens actually go.

Most developers assume their prompts are the main cost. They're not. The real culprits are:

Verbose model output — Claude explaining what it's about to do, then doing it, then summarizing what it did
Raw CLI output — dumping full git log, npm install, or test runner output directly into context
Bloated CLAUDE.md — this file loads on every turn before Claude reads a single line of your code. A 5,000-token CLAUDE.md costs 5,000 tokens per message, before you've typed a word
Code navigation by content — when Claude reads entire files to find one function instead of navigating by symbol
Ghost tokens — leftover context from earlier in the session that no longer contributes to the task but still costs money every turn

Each category has a different fix. Here's what actually works.

Category 1 — Shrink what Claude writes back

Caveman

The simplest idea with surprisingly good results: make Claude talk like a caveman. Short words. No filler. Still technically accurate.

It ships as a Claude Code skill that cuts ~65–75% of output tokens while preserving full technical accuracy. The compression is aggressive but the signal stays intact — you get fix auth bug in login.js line 42 instead of three paragraphs explaining what the fix does.

Works as a plugin for Cursor, Windsurf, Cline, and others too.

# install as a Claude Code skill
# see: https://github.com/juliusbrussee/caveman

Best for: Long coding sessions where Claude's explanations are eating your context budget.

claude-token-efficient

Single CLAUDE.md file. Drop it into your project. Done.

It bakes response-terseness rules directly into Claude's instructions, forcing shorter output on heavy workflows without you having to prompt for it every time. No code changes, no new dependencies.

Best for: Projects where you want a permanent "be concise" baseline without running an extra tool.

Category 2 — Compress what you send in

RTK (Rust Token Killer)

A CLI proxy that filters terminal output before it reaches Claude. Instead of Claude seeing 2,000 tokens of raw git status, it sees ~200 tokens of the relevant parts.

The hook transparently rewrites shell commands — git status becomes rtk git status — and Claude never sees the rewrite, just the compressed result.

# without rtk
git status  → ~2,000 tokens raw output

# with rtk
rtk git status  → ~200 tokens filtered output

Claims 60–90% reduction on common dev commands. Single Rust binary, zero dependencies.

Best for: Teams running agentic workflows where Claude executes a lot of shell commands.

Headroom

A context optimization layer that sits between your app and the LLM. Three compression modes:

SmartCrusher — JSON compression
CodeCompressor — AST-aware code compression (understands structure, not just text)
Kompress-base — general text compression

The AST-aware approach is the interesting one. It doesn't just truncate code — it understands which parts of a file are structurally relevant and compresses accordingly.

Best for: Applications (not just Claude Code) that programmatically build context before sending to any LLM.

claude-token-optimizer

Reusable CLAUDE.md setup prompts that structure your documentation so Claude only loads what it needs per task.

One real-world example from the repo: a RedwoodJS project reduced session start from 11,000 tokens down to 1,300 by restructuring which docs load when.

Best for: Projects with large documentation that Claude currently loads all at once.

Category 3 — MCP-level optimization

Token Optimizer MCP

Intelligent token optimization for Claude Code via MCP. Claims 95%+ reduction through caching, compression, and smart tool intelligence — meaning it tracks which tools Claude actually uses and optimizes the tool definitions it sends.

Best for: Claude Code users running MCP-heavy workflows.

PromptThrift MCP

Compresses conversation history using a local Gemma 4 model (runs on your machine, no extra API cost) or heuristic fallback. Key feature: pinned facts — you can mark specific context as protected so it survives compression.

Claims 70–90% reduction on conversation history while keeping critical context intact.

Best for: Long multi-turn sessions where early context becomes expensive dead weight.

Token Savior

MCP server specifically built for code navigation. Instead of Claude reading entire files to find a function, it indexes your codebase by symbol — functions, classes, imports, call graph — and navigates by pointer.

Also includes a persistent memory engine that stores decisions, conventions, and session summaries in SQLite and re-injects them as a compact delta at session start.

Claims 97% reduction on code navigation across 170+ real sessions.

Best for: Large codebases where Claude spends significant tokens just finding the right file and function.

Category 4 — Clean up ghost tokens

Token Optimizer

Targets "ghost tokens" — context that's still technically in the window but no longer relevant to the current task. Also helps survive context compaction without losing quality.

More of a diagnostic and cleanup tool than a compression layer.

Best for: Sessions that run long and accumulate stale context from earlier tasks.

The free fix most people skip

Before installing anything: audit your CLAUDE.md.

According to Claude Code's official documentation, CLAUDE.md loads before every response, before Claude reads your code, before anything else. It's the most expensive file in your project on a per-token basis.

The recommended limit is under 200 lines. If yours is longer, move sections into on-demand skill files that only load when invoked. Most CLAUDE.md files I've seen in the wild are 3–5x longer than they need to be.

Summary

Tool	What it fixes	Claimed reduction
Caveman	Verbose model output	~65–75% output tokens
claude-token-efficient	Verbose model output	Drop-in terseness
RTK	Raw CLI output in context	60–90% on shell commands
Headroom	Input context size	AST-aware compression
claude-token-optimizer	Doc structure / loading	11k → 1.3k session start
Token Optimizer MCP	MCP tool definitions	95%+
PromptThrift MCP	Conversation history	70–90%
Token Savior	Code navigation	~97%
Token Optimizer	Ghost tokens / session health	Varies

The numbers above come from each project's own documentation — treat them as upper bounds under ideal conditions, not guarantees.

Pick one from Category 1 and one from Category 2. Those two changes alone will have the most impact on day-to-day cost for most workflows.

Why your RAG chatbot fails in Thai — and how to fix it

Phasu Yeneng — Sun, 19 Apr 2026 10:08:22 +0000

Why your RAG chatbot fails in Thai — and how to fix it

A real-world walkthrough of how we built a customer service chatbot for a Thai e-commerce company — and the chunking problem nobody warns you about.

When I started building a RAG (Retrieval-Augmented Generation) chatbot for a Thai e-commerce company, I made the same mistake every developer makes: I copied the LangChain quickstart example, set chunk_size=500, and expected things to just work.

They didn't.

This is the story of why naive chunking fails for Thai text, what we built instead, and the full pipeline from PDF product manuals to chatbot answers — using Python, Qdrant, and OpenAI.

The Problem Nobody Warns You About

Most RAG tutorials are written with English in mind. The chunking logic looks like this:

# Works fine for English
chunks = text.split('. ')
# or
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

This works because English has clear word boundaries — spaces between every word. When you split on periods or character count, you still get coherent, searchable chunks.

Thai is completely different.

Thai has no spaces between words.

This sentence — "ร้านค้าของเรามีสินค้าหลายหมวดหมู่ให้เลือกซื้อ" — means "Our store has many product categories to choose from." But to a naive chunker, it looks like one enormous, unsplittable blob. There are 7 meaningful words in there, with zero whitespace between them.

Here's what happens when you embed that raw blob versus properly tokenized words:

Input to embedding model	What it sees
`ร้านค้าของเรามีสินค้าหลายหมวดหมู่ให้เลือกซื้อ`	One opaque token sequence
`ร้านค้า \	ของเรา \

The second form produces embeddings that actually capture the meaning of each concept — "store", "product", "category" — which leads to better retrieval when a user asks "มีสินค้าหมวดหมู่ไหนบ้าง" (what product categories are available?).

The Pipeline We Built

Here's the full architecture:
{% raw %}

PDF product manuals / FAQ documents
    |
Python (PyMuPDF) → extract raw text
    |
Sentence splitting by '. '
    |
[Stored in MongoDB as raw sentences]
    |
Python → pythainlp tokenization
    |
OpenAI text-embedding-3-small
    |
Qdrant vector database (cosine similarity, 1536 dims)
    |
User query → tokenize → embed → search → top-7 chunks
    |
GPT-4o-mini + context → answer

Let's walk through each step with real code. Here are the dependencies we'll use:

# requirements.txt
pymupdf==1.27.2.2
pythainlp==5.2.0
openai==2.32.0
qdrant-client==1.17.1
pymongo==4.10.1

Step 1 — Extract Text from PDF

We use PyMuPDF (the fitz library) instead of PyPDF2 because it handles Thai character encoding much more reliably.

# app/python/PdfToSentences.py
import pymupdf as fitz  # PyMuPDF 1.27+ (legacy: import fitz)
import re
import uuid
import requests

def extract_sentences_from_pdf(pdf_path):
    pdf_file = fitz.open(pdf_path)
    text = ""
    for page in pdf_file:
        text += page.get_text("text")

    # Split on English period + space — works for mixed Thai/English documents
    sentences = [sentence.strip() for sentence in text.split('. ') if sentence.strip()]
    return sentences

def clean_text(text):
    cleaned_text = re.sub(r'•', '', text)  # Remove bullet points
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
    return cleaned_text

Two things to note here:

Why PyMuPDF over PyPDF2? Thai PDF documents often use non-standard font encodings. PyMuPDF handles these much better — with PyPDF2 you'd frequently get garbled output or empty strings for Thai text blocks. Note: as of PyMuPDF 1.24+, the recommended import is import pymupdf (the old import fitz still works but is considered legacy).

Why split on . (period + space)? Our documents are mixed Thai/English — product names, SKUs, and technical specs are often in English, while descriptions are Thai. The period-space split is a pragmatic middle ground that preserves Thai paragraphs as single chunks rather than fragmenting them randomly at character 500.

⚠️ Limitation: Formal Thai text often ends paragraphs with a line break rather than a period. If your PDFs have no periods at all, text.split('. ') will return one giant chunk per page. In that case, use pythainlp's sentence tokenizer instead:
from pythainlp.tokenize import sent_tokenize
sentences = sent_tokenize(text, engine="crfcut")

Step 2 — Thai Word Tokenization Before Embedding

This is the most important step, and the one that differs most from English RAG.

Before sending any Thai text to the embedding model, we tokenize it with pythainlp:

# thai_tokenizer.py
from pythainlp.tokenize import word_tokenize

def word_cut(text: str) -> str:
    tokens = word_tokenize(text, engine="newmm")
    # Join with pipe separator so the embedding model sees distinct units
    return "|".join(tokens)

pythainlp uses a dictionary-based approach (newmm engine) to segment Thai text into individual words:

Input:  "สินค้าอิเล็กทรอนิกส์ราคาถูกส่งฟรี"
Output: "สินค้า|อิเล็กทรอนิกส์|ราคาถูก|ส่งฟรี"

Now the embedding model sees four distinct semantic units instead of one long string. The cosine similarity between "ส่งฟรี" (free shipping) and a user's query "จัดส่งฟรีไหม" (is shipping free?) will be much higher and more meaningful after proper tokenization.

We also tried attacut (a neural-network-based engine in pythainlp) but settled on newmm for its speed and dictionary coverage — important when your domain includes product jargon and Thai promotional phrases like "ลดราคา", "ส่งฟรี", "ผ่อนชำระ".

Step 3 — Generate and Store Embeddings

We use OpenAI's text-embedding-3-small for embeddings — the current-generation model that replaced text-embedding-ada-002. It scores 44% on the MIRACL multilingual benchmark vs 31.4% for the old model, and costs 5x less. The key is that we tokenize before embedding — not after:

# ingest_embeddings.py
from thai_tokenizer import word_cut
from openai_module import create_embedding

for item in data:
    # ✅ Tokenize Thai text FIRST
    tokenized = word_cut(item["keyword"])

    # Then embed the tokenized version
    result = create_embedding(tokenized)

    if result["status"]:
        sentence = {
            "id": item["id"],
            "sentence": item["text"],      # store original for display
            "keyword": item["keyword"],    # store original keyword
            "embeded": result["embed"],    # embed the tokenized version
        }
        sentences_collection.insert_one(sentence)

Notice we store the original text as the payload but create the embedding from the tokenized version. This way, when a match is found, the chatbot returns the human-readable original sentence — not the pipe-separated tokenized form.

The embedding function itself:

# openai_module.py
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
MAX_INPUT_LENGTH = 10000

def create_embedding(text: str) -> dict:
    if len(text) > MAX_INPUT_LENGTH:
        return {"status": False, "message": "Text too long"}

    response = client.embeddings.create(
        model="text-embedding-3-small",  # replaces text-embedding-ada-002
        input=text,
        dimensions=1536,                 # if you change this, update Qdrant collection size too!
    )
    return {
        "status": True,
        "embed": response.data[0].embedding,
    }

Step 4 — Qdrant as the Vector Store

We use Qdrant running in Docker as our vector database. It's fast, lightweight, and the REST API is straightforward to call with Python's requests:

# qdrant_module.py
import os
import requests

QDRANT_URL = os.environ.get("QDRANT_URL", "http://localhost:6333")

def create_rag_collection(collection_name: str, vector_size: int):
    requests.put(
        f"{QDRANT_URL}/collections/{collection_name}",
        json={
            "vectors": {
                "chatgpt_vector": {
                    "size": vector_size,  # 1536 for text-embedding-3-small (default)
                    "distance": "Cosine",
                }
            }
        },
    )

def search(collection_name: str, vector: dict, limit: int = 5) -> dict:
    response = requests.post(
        f"{QDRANT_URL}/collections/{collection_name}/points/search",
        json={
            "vector": vector,
            "limit": limit,
            "with_payload": True,
        },
    )
    return response.json()

Start Qdrant locally with one Docker command:

docker run -dt --name VectorDB \
  -p 6333:6333 \
  -v /your/path/storage:/qdrant/storage \
  qdrant/qdrant:latest

We use Cosine similarity rather than Euclidean distance. For semantic search in Thai, cosine similarity performs better because it measures the angle between vectors (meaning similarity) rather than the absolute distance, which is sensitive to text length differences.

Step 5 — The RAG Query Flow

When a user asks a question, here's what happens:

# chat_module.py
from openai_module import create_embedding
from qdrant_module import search

def rag(question: str, category_name: str) -> str:
    # 1. Build a context-rich search query
    search_query = "สินค้า" + category_name  # "Product [category]"

    # 2. Embed the search query (tokenization happens upstream before this call)
    question_embed = create_embedding(search_query)

    # 3. Search Qdrant for the top 7 most similar sentences
    gpt_vector = {"name": "chatgpt_vector", "vector": question_embed["embed"]}
    search_result = search("chatgpt", gpt_vector, limit=7)

    # 4. Assemble context from the matched payloads
    context = retrieve_relevant_context(search_result["result"])
    return context

def retrieve_relevant_context(results: list) -> str:
    context = ""
    for item in results:
        context += item["payload"]["sentence"] + "\n\n"
    return context

The assembled context is then injected into GPT-4o-mini's system prompt:

system_content = f"""Use the attached context to answer the user's questions.
Answer only questions related to our company's products and services:

{context}

ภาษาที่ใช้ตอบกลับ User ให้ยึดจากภาษาของคำถามล่าสุดของ User เท่านั้น"""

That last Thai instruction tells the model: "Reply in the same language as the user's most recent message." This handles the bilingual nature of our users — some ask in Thai, some in English, some mix both.

Step 6 — Question Classification Before RAG

One non-obvious optimization: not every question needs a RAG lookup. We classify questions first with GPT-4o-mini to decide which path to take:

# chat_module.py
import json
from openai import OpenAI

client = OpenAI()

def question_classification(question: str) -> dict:
    prompt = """วิเคราะห์คำถามของ User ว่าเป็นคำถามประเภทไหน โดยให้ตอบเป็น JSON { "type": value }
    type 0 = ทักทาย / ไม่เกี่ยวกับสินค้าหรือบริการ
    type 1 = ถามเกี่ยวกับโปรโมชั่น / ส่วนลด / หมวดหมู่สินค้า
    type 2 = ถามเกี่ยวกับสาขา / พื้นที่จัดส่ง
    type 3 = ถามเกี่ยวกับข้อมูลสินค้าหรือบริการ  ← needs RAG
    type 4 = ถามทั่วไปเกี่ยวกับบริษัท"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": question},
        ],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

Only type 3 (specific product info questions) triggers the full RAG pipeline. Promotion and branch questions (type 1-2) use structured data from a JSON catalog instead. Greetings (type 0) go straight to the LLM without any retrieval at all.

This classification step saves both latency and API cost — you're not doing a vector search for "สวัสดีครับ" (hello).

What We Learned

1. Tokenize before embedding, always. The single biggest quality improvement came from running pythainlp on every piece of text before it touches the embedding model — both at ingest time and at query time. Without this, retrieval quality was noticeably worse for Thai-only queries.

2. Use PyMuPDF, not PyPDF2. For Thai PDF documents, PyMuPDF is dramatically more reliable. PyPDF2 would silently drop or garble Thai characters from complex layouts. Also note: as of v1.24+, use import pymupdf instead of the legacy import fitz.

3. Store original text, embed tokenized text. Users should see natural language in responses. Keep these as separate fields.

4. Sentence-level chunks beat character-level chunks for Thai. Because Thai sentences naturally carry complete thoughts, splitting at sentence boundaries (.) gives the model coherent context units rather than arbitrary fragments. A chunk_size=500 cut might land in the middle of a Thai word — or more precisely, in the middle of a run of characters that spans multiple words, since there's no space to safely break at.

5. Question classification as a router saves money. Not every user message needs vector search. A cheap classification step routes simple questions to a direct LLM call and complex ones to the full RAG pipeline.

The Stack at a Glance

Layer	Tool	Version
PDF extraction	PyMuPDF (`pymupdf`)	1.27.2.2
Thai tokenization	`pythainlp` (`newmm` engine)	5.2.0
Embedding model	OpenAI `text-embedding-3-small` (1536d)	—
Vector database	Qdrant + `qdrant-client`	1.17.1
LLM	OpenAI GPT-4o-mini	—
OpenAI SDK	`openai`	2.32.0
Backend	Python / FastAPI or Flask	—
Chat history	MongoDB	—

Final Thoughts

Building RAG for Thai taught me that most of the "standard" chunking advice assumes English. Once you work with a language that has no word boundaries, the whole pipeline has to be rethought — from how you split sentences to how you normalize text before embedding.

The good news: the fix is not complicated. A single tokenization step with pythainlp before embedding makes a significant difference. The hard part is knowing you need it in the first place.

If you're building RAG for other Asian languages — Japanese, Chinese, Korean — the same principle applies. Never assume your text has whitespace-delimited tokens. Always pre-process with a language-appropriate tokenizer before hitting your embedding model.

Addendum — "But where's the benchmark?" (added after publication)

After this post went up, a sharp comment pushed back on the central claim — and they were right to. The question is worth surfacing in the article itself, because the answer matters for anyone considering this technique:

"Embedding models like text-embedding-3-small already have an internal BPE tokenizer. Pre-tokenizing with newmm and joining with | adds | tokens to the sequence — possibly OOD relative to the model's training distribution. So what's the actual mechanism that makes pre-tokenization help?"

Two honest acknowledgments first.

1. This article does not contain a rigorous ablation. The quality claims — including the ส่งฟรี ↔ จัดส่งฟรีไหม cosine-similarity example — come from production retrieval logs and qualitative observation, not from a controlled experiment with recall@k or MRR numbers. Treat the recommendation as "this worked for our case" rather than "this is empirically proven across Thai RAG generally."

2. The OOD concern is half-right. | (U+007C) is not OOD at the vocabulary level — it's a single token in cl100k_base, common in code, CSV, and markdown across the training data. But its co-occurrence pattern with Thai script is distributionally OOD: the model has rarely seen "ไทย|สมุนไพร|พื้นบ้าน" during training, which can shift the resulting embedding toward a "structured/code-like" subspace. Whether that helps or hurts retrieval is an empirical question, not a settled one.

Proposed mechanism (hypothesis, not proof)

The reason pre-tokenization seems to help isn't the | token itself — it's that a hard separator blocks BPE from merging across word boundaries.

Thai is scriptio continua — no whitespace between words. When BPE runs on raw Thai, it greedily merges subword units based on training-corpus frequency, which routinely produces tokens that span word boundaries: the trailing subword of one word fuses with the leading subword of the next. The resulting token's embedding "blurs" the semantics of two distinct words into a single fragment.

Insert any separator (|, space, special token) and BPE merging halts at the separator. The tokens that come out align more closely with linguistic word boundaries, so the embedding reflects per-word semantics rather than a subword soup.

If that hypothesis is correct, the predictions are:

pipe ≈ space (both block cross-boundary merging)
pipe > raw Thai (consistent with what this article reports)
pipe may be marginally worse than space if | injects distributional noise

The ablation that should exist

Variant	Input fed to the embedder
A	Raw Thai, no pre-tokenization
B	`newmm` tokens, space-joined
C	`newmm` tokens, **`\
D	{% raw %}`newmm` tokens, special separator like `<sep>`

Evaluate recall@5 and MRR on a Thai QA dataset (TyDi-QA Thai works as an open option) or in-house labeled pairs. My prediction: A trails B/C/D clearly, while B and C land within noise of each other. If that pattern holds, the takeaway updates from "pipe-tokenize before embedding" to:

"Insert any boundary marker that prevents BPE from merging across word boundaries. The marker itself doesn't matter much — the boundary signal does."

I plan to run this ablation and publish a follow-up with the actual numbers. If anyone runs it first, please share — happy to link it from here.

Thanks to the commenter who raised this. That's the kind of question that turns a heuristic into actual knowledge.

25 Free Developer Tools That Run 100% in Your Browser

Phasu Yeneng — Sat, 18 Apr 2026 07:07:19 +0000

25 Free Developer Tools That Run 100% in Your Browser

If you've ever pasted sensitive data into a random online tool and immediately regretted it — this post is for you.

I built toolsstack.cloud — a collection of 25 free developer tools that run entirely in your browser. No backend. No account. No data ever leaves your device.

Here's the full list, grouped by category.

🔐 Security & Auth

1. JWT Decoder

Decode and inspect JWT tokens instantly — header, payload, expiry, and signature status. Useful when debugging auth issues without needing Postman or curl.

2. Hash Generator

Generate MD5, SHA-1, SHA-256, and SHA-512 hashes from any text. Client-side using the Web Crypto API — nothing is sent to a server.

3. Password Generator

Cryptographically secure passwords with options for length, uppercase, numbers, symbols. Uses crypto.getRandomValues() — genuinely random, not Math.random().

📦 Data & Encoding

4. JSON Formatter & Validator

Format, minify, and validate JSON with syntax highlighting. Shows line numbers on errors. One of the most-used tools on the site.

5. YAML to JSON Converter

Paste YAML, get JSON. Or paste JSON, get YAML. Uses js-yaml — supports anchors, multiline strings, and nested objects. Great for switching between Kubernetes manifests and API calls.

6. JSON to CSV Converter

Convert JSON arrays to CSV with auto-detected column headers. Download the result directly. No server upload needed.

7. Base64 Encoder / Decoder

Encode and decode Base64 — supports both text and file input. Handles UTF-8 correctly (unlike some tools that break on non-ASCII characters).

8. Image to Base64

Upload an image, get a Base64 data URI. Copy it directly into your HTML <img src=""> or CSS background-image. Useful for embedding small icons without an extra HTTP request.

9. URL Encoder & Decoder

Encode/decode URLs with two modes: encodeURIComponent (for query params) and encodeURI (for full URLs). Also parses URLs into protocol, host, path, and individual query parameters.

10. HTML Entity Encoder

Encode special characters like <, >, &, " to HTML entities and back. Useful when embedding user-generated content or debugging XSS-safe output.

🛠️ Developer Utilities

11. UUID / GUID Generator

Generate UUID v4 identifiers — single or bulk up to 1000 at once. Options for uppercase, no hyphens, or {braces} GUID format. Uses crypto.getRandomValues() for proper randomness.

12. Regex Tester

Test regular expressions with live match highlighting, group capture display, and flags support. Faster than switching between your editor and a browser console.

13. Cron Expression Generator

Build cron schedules with a visual builder — minute, hour, day, month, weekday. Shows a human-readable description and the next 5 run times. No more googling "cron every 15 minutes".

14. Epoch / Unix Timestamp Converter

Convert between Unix timestamps and human-readable dates in any timezone. Also shows the current timestamp live — useful for debugging API responses with created_at fields.

15. Diff Checker

Paste two blocks of text and see line-by-line differences highlighted in green/red. Good for quickly spotting config file changes or API response differences.

16. chmod Calculator

Click checkboxes for owner/group/others read/write/execute permissions and see the numeric value update in real time. Never google "chmod 755 meaning" again.

💻 Code & Text

17. CSS Minifier

Minify CSS and see the exact bytes saved. Pure client-side — paste your stylesheet, get minified output without uploading to any server.

18. SQL Formatter

Format and beautify SQL queries with proper indentation and keyword casing. Supports SELECT, INSERT, UPDATE, DELETE, JOIN, subqueries.

19. Markdown Converter

Convert Markdown to HTML with a live preview. Useful for checking how your README or blog post will render before committing.

20. Lorem Ipsum Generator

Generate placeholder text — choose paragraphs, sentences, or word count. Starts with the classic "Lorem ipsum dolor sit amet" or generates fully random Latin-ish text.

🎨 Design & Visual

21. Color Converter

Convert between HEX, RGB, HSL, and HSV instantly with a color picker. Copy any format with one click. Useful when your designer gives you a hex code but your CSS needs HSL.

22. QR Code Generator

Generate QR codes for URLs, WiFi credentials, vCards, email, SMS, and more. Customize dot styles, eye shapes, colors, and upload a logo overlay. Download as PNG or SVG.

🌏 Specialized

23. IP Lookup

Shows your public IPv4 and IPv6 addresses with geolocation (country, city, ISP). Useful for verifying VPN connections or debugging network issues.

24. PDF Text Extractor

Extract text from PDF files — supports both English and Thai. Uses pdf.js entirely in the browser. The PDF never gets uploaded anywhere.

25. Thai Text to URL Slug

Convert Thai text to URL-friendly slugs using RTGS (Royal Thai General System of Transcription) romanization. For example, สวัสดีครับ → sawatdi-khrap. Useful for building SEO-friendly URLs for Thai-language content.

Why client-side only?

Most online developer tools send your data to a server. That means:

Your JWT tokens (with user data) go to someone else's server
Your passwords get logged somewhere
Your internal API responses are stored in someone's database

Every tool on toolsstack.cloud processes data entirely in your browser using native browser APIs (crypto.getRandomValues, URL, FileReader) and trusted open-source CDN libraries like js-yaml and pdf.js.

The network tab stays clean.

Tech stack

Pure HTML + Vanilla JavaScript. No framework. No build step. No node_modules. The entire site is static files on shared hosting.

For tools that need libraries:

js-yaml@4.1.0 — YAML parsing
pdf.js@4.4 — PDF text extraction
qrcode-generator — QR code rendering

Everything else uses browser-native APIs.

If you find any of these useful — or want to suggest a tool that's missing — feel free to drop a comment below.

🔗 https://toolsstack.cloud