Forem: Praful Reddy

Why I used a 50-year-old algorithm instead of embeddings to cut Claude API token costs

Praful Reddy — Wed, 22 Apr 2026 04:16:37 +0000

I built Prism — a local proxy that routes only relevant knowledge to Claude per query using BM25, with zero extra API calls, zero embeddings, and zero vector databases.
Every time you send a prompt to Claude, it considers its entire
knowledge space. A question about a React bug still costs tokens
on geography, cooking, history, and every other domain Claude was
trained on. Nobody talks about this because the context window is
large enough that it "works." But it's wasteful by design — and
it produces padded, unfocused responses as a side effect.

I spent two weeks building a fix. The result is
Prism — a local
proxy that intercepts your Claude API calls and routes only
relevant knowledge to each query. Zero extra API calls. Zero
embeddings. Zero vector database. Pure deterministic math.

## The problem with every other approach

Before I explain what Prism does, I want to explain what it
deliberately does not do — because the distinction matters.

Every existing context optimizer I found uses a second, smaller
LLM to compress the input to the main LLM. LLMlingua, Selective
Context, LLM-DCP — they all work the same way: call a model to
decide what to keep, then call the main model with the compressed
input.

That's two inference calls instead of one. You're burning tokens
to save tokens. The abstraction is broken at the foundation.

I kept asking: do you actually need a model to decide what's
relevant? For most prompts, the answer is no. The relevant
domain is structurally detectable from the words in the query
itself. You don't need intelligence to know that "fix this
TypeError" is a code/debug question. You need pattern matching.

So I reached for BM25.

## What BM25 is and why it works here

BM25 (Best Match 25) is an information retrieval algorithm from
1994, built on TF-IDF principles from the 1970s. It's the
ranking function that powered search engines before neural
networks existed. It scores documents against a query using
term frequency, inverse document frequency, and document length
normalization. No model. No training. Pure math.

Here's the key insight: I pre-built a corpus of 40 knowledge
domain nodes — javascript, security, databases, geography,
history, medicine, etc. Each node has a keyword set describing
that domain. At query time, BM25 scores every domain node
against the incoming prompt in microseconds and returns the
top 5 by relevance.

The Knowledge Graph then walks relationship edges — if
"javascript" scores highest, its related domains (node, react,
typescript) get included at a discounted score. The result is
a focused set of 3-5 domains that actually matter for this
specific query.

Building the index takes ~2ms at startup. Each query takes
under 1ms. The entire operation costs zero dollars and works
completely offline.

## The four pipeline stages

1. Intent Classifier
Reads the prompt and assigns one of six intent types: CODE,
DEBUG, FACTUAL, CONCEPTUAL, DECISION, or CREATIVE. Uses a
deterministic keyword graph — trigger words, regex patterns,
confidence thresholds. No model call. Under 1ms.

2. Knowledge Graph
BM25-scores 40 domain nodes against the prompt + intent.
Applies intent affinity boosts (DEBUG queries get a 1.4x
multiplier on security and language domains). Walks relationship
edges for related domains. Returns top 5 nodes with scores.

3. Context Injector
Builds a focused system prompt fragment under 300 tokens.
Format varies by intent:

DEBUG: "Diagnose from [security] perspective. State cause first. Then fix. Then why."
FACTUAL: "Answer from [geography] knowledge. One direct answer. No padding. Facts only."

Always appends: "Be dense. Replace meta-commentary with labels:
[reason] [context] [caveat]. Skip preamble and sign-off."

4. Response Enforcer
Post-processes Claude's raw response before returning it to
you. Runs 111 filler phrase patterns — "Here's the thing
about", "Let me walk you through", "I hope this helps",
"Great question" and 108 others. Prefix and suffix patterns
are deleted entirely. Inline patterns are replaced with
compact semantic labels. Result: 30-50% shorter responses
that are actually denser with information.

## Before and after

Here's a real example. Prompt: "fix the TypeError in my
auth middleware"

Without Prism:

Full knowledge space considered
Response opens with: "I'd be happy to help you fix that TypeError! Let me walk you through what's likely happening here. TypeErrors in Express middleware are quite common and usually fall into a few categories..."
Tokens in: ~800 (with any system context)

With Prism:

Intent: DEBUG (0.94 confidence)
Domains activated: security (0.91), javascript (0.87), node (0.72)
System fragment injected: 52 tokens
Response opens directly with the diagnosis
Filler removed: 6 phrases
Tokens saved: ~140

The response isn't just shorter — it's structured differently.
It leads with cause, then fix, then explanation. That's the
intent-specific formatting doing its job.

## How to use it

Prism runs as a local proxy on port 3179. You change one
thing in your code:

// Before
const client = new Anthropic({
  baseURL: 'https://api.anthropic.com'
});

// After — that's it
const client = new Anthropic({
  baseURL: 'http://localhost:3179'
});

Or use the SDK directly:

import { prism } from 'prism-ai';

const response = await prism.send({
  prompt: "fix the TypeError in my auth middleware",
  apiKey: process.env.ANTHROPIC_API_KEY
});

console.log(response.intent);    // DEBUG
console.log(response.domains);   // ['security', 'javascript', 'node']
console.log(response.tokensIn);  // 312
console.log(response.fillerRemoved); // 6

Install:

npx prism-ai start

That's the entire setup.

## Prism Agent

I also built Prism Agent
on top of the SDK — a Claude Code alternative with a live
knowledge graph pane in the terminal. Every turn shows you
which domains activated, their BM25 scores, tokens saved,
and filler removed. You can pin domains (always include this)
or suppress them (never use this). First coding agent that
isn't a black box.

npm install -g prism-agent
prism-agent

Why this matters beyond token costs

The token savings are real but they're not the main point.
The main point is that focused context produces better
responses. When Claude is directed at a specific domain
with a specific response format for a specific intent type,
the quality of the answer goes up — not just the length
down. The Response Enforcer compounds this by removing the
preamble padding that dilutes the actual answer.

BM25 has been solving information retrieval problems for 50
years. It doesn't hallucinate. It doesn't drift. It doesn't
need a GPU. It runs in a for loop. For the specific problem
of "which knowledge domain is this query about," it's more
than sufficient — and it's the right tool precisely because
it's so much simpler than the alternatives.

Both projects are fully open source and MIT licensed.

prism-ai: npm | GitHub
prism-agent: GitHub

If you're building anything on top of Claude or other LLM
APIs and want to talk through the BM25 implementation,
drop a comment — happy to go deeper on any part of this.

I built an open-source Python tool to detect drift in embedding spaces

Praful Reddy — Fri, 17 Apr 2026 00:19:56 +0000

I built an open-source Python tool to detect drift in embedding spaces

Most monitoring pipelines wait for a downstream metric to break: accuracy drops, retrieval quality slips, or user-facing behavior gets worse.

By then, the shift has already happened.

I wanted a simpler way to catch changes earlier by looking directly at the embeddings themselves.

So I built drift-lens-monitor — an open-source Python package for detecting drift in embedding spaces.

GitHub: https://github.com/PRAFULREDDYM/Drift_lense
PyPI: https://pypi.org/project/drift-lens-monitor/

What problem this solves

A lot of modern ML systems depend on embeddings:

semantic search
RAG pipelines
recommenders
clustering
classification pipelines with embedding-based features

Even when the raw system looks “healthy,” the embedding space can start changing underneath you.

That change may come from:

new user behavior
model updates
data source changes
upstream preprocessing differences
gradual distribution shift over time

If you only monitor business metrics or model accuracy, you often detect the issue late.

The idea behind this project is simple:

take snapshots of embeddings over time and compare them directly.

What drift-lens-monitor includes

The package currently supports three drift detection approaches.

1) Fréchet Embedding Distance (FED)

This is inspired by FID-style comparison, but applied to arbitrary embeddings.

At a high level, it compares the mean and covariance of two embedding distributions.

Useful when you want a compact statistical distance between a reference snapshot and a current snapshot.

2) Maximum Mean Discrepancy (MMD)

This is a kernel-based, non-parametric method for comparing two samples.

I included permutation-based p-values so it can be used not just as a raw distance, but also as a statistical test.

Useful when you want a more flexible distribution comparison without assuming Gaussian structure.

3) Persistent homology

This is the unusual one.

Instead of only asking whether two embedding clouds differ statistically, this looks at whether their shape changes.

It builds topological summaries over the point cloud and compares them using Wasserstein distance.

Why that matters:

A system can preserve rough averages while still changing structurally. For example:

clusters merge
clusters split
holes or loops appear/disappear
local geometry shifts in ways mean/covariance may not capture

This makes persistent homology an interesting complement to more standard drift metrics.

Design goals

I wanted the tool to stay practical:

local-first
no cloud dependency
no API keys
simple snapshot storage
easy to inspect and extend

Snapshots are stored as parquet files, so the workflow stays lightweight and reproducible.

The package can be used:

as a Python library
through a Streamlit dashboard for visual exploration

Install


bash
pip install drift-lens-monitor

## Example workflow

The intended workflow is straightforward:

1. Save a reference embedding snapshot
2. Save a new embedding snapshot later
3. Compare them using one or more drift methods
4. Inspect drift scores and visualize the changes

This makes it usable for both experimentation and production-adjacent monitoring workflows.

## Why I built it

I was interested in a gap I keep seeing in ML tooling:

we monitor model outputs, latency, costs, and downstream metrics heavily, but we often do much less direct monitoring of the representation space itself.

Embeddings are doing a huge amount of work in modern AI systems. They deserve first-class monitoring too.

I also wanted to explore whether more unusual techniques like **topological drift detection** could add signal beyond standard statistical distances.

## What I’d love feedback on

I’d especially love feedback on three things:

1. Does persistent homology feel genuinely useful here, or too heavyweight?
2. What baselines or benchmark datasets would make this more convincing?
3. How should the package/API be improved to make it easier to use in real workflows?

## Links

- GitHub: https://github.com/PRAFULREDDYM/Drift_lense
- PyPI: https://pypi.org/project/drift-lens-monitor/

A zero-token progress bar for Claude Code

Praful Reddy — Fri, 10 Apr 2026 11:48:59 +0000

Every Claude Code extension I've seen shows the same thing: token counts, API costs, model info. Useful, but it doesn't answer the question I actually care about mid-session — how much of the work is done.
So I built task-progress-bar. It reads Claude Code's native task list from disk and renders a live ASCII progress bar with time estimates. It runs as a PostToolUse hook, which means it consumes zero tokens — it's a subprocess that Claude never sees.

Tasks [████████░░] 8/10 (~3m left) | ✓8 ⟳1 ○1

How it works

Claude Code persists tasks as JSON files in ~/.claude/tasks/. The plugin watches for TodoWrite, TodoRead, TaskCreate, and TaskUpdate tool calls via the PostToolUse hook, then:

Parses every JSON file in the tasks directory
Counts completed, in-progress, and pending tasks
Computes a time estimate using an exponential moving average (EMA)
Outputs a single status line to Claude Code's statusLine renderer

The time estimation is straightforward. Each time a task moves to completed, the timestamp is logged. The interval between consecutive completions feeds into an EMA with α=0.3:

EMA_new = 0.3 × latest_interval + 0.7 × EMA_old
estimated_remaining = EMA × tasks_left

Intervals over 1 hour are clamped to avoid skew from session breaks. The first 3 completions show "calculating..." until there's enough data.
The entire plugin is a single Python file — stdlib only, no pip dependencies. It uses json, pathlib, time, and sys. Nothing else.

Install

curl -fsSL https://raw.githubusercontent.com/PRAFULREDDYM/task-progress-bar/main/install.sh | bash

The installer checks for Python 3.8+, copies the script to ~/.claude/, and patches settings.json with the hook configuration. There's a matching uninstall.sh for clean removal.

What it looks like

The progress bar color-codes by completion percentage — red below 33%, yellow from 33–66%, green above 66%. When all tasks finish, it shows ✅ All done! for 30 seconds and then hides.
If you run it standalone (python3 task_progress_bar.py), you get a full multi-line colored view with a task breakdown and per-task average.