Forem: Gaurav Vij

Stop Paying for the Same Answer Twice: A Deep Dive into llm-cache

Gaurav Vij — Fri, 17 Apr 2026 12:33:34 +0000

Every AI engineer has been there. You open the billing dashboard, squint at the number, and do a quiet double-take. You know the product is working, traffic is healthy, users are happy. But somewhere in that invoice is a dirty secret: you are paying for the same computation over and over again.

Someone asks your support bot "How do I reset my password?" Fifty other users ask "What are the steps to reset my password?" Twenty more ask "Can you help me change my password?" The LLM doesn't know it has answered this question a hundred times today. It just runs the full forward pass every single time, burns your tokens, and charges you accordingly.

This is not a fringe problem. It is the default state of almost every production LLM deployment.

llm-cache is a Python middleware library that fixes this. It caches LLM responses not by exact string match, but by semantic similarity. The project was built fully autonomously by NEO, an AI coding agent, and the code is clean, thoughtful, and surprisingly production-ready for a library with just two commits. Let's get into how it works, what it costs you, and what you can do with it.

The Core Insight: Meaning, Not Characters

Most developers' first instinct when building a cache is a hash map. Take the prompt string, hash it, store the result. This works if your users send byte-for-byte identical queries. They never do.

Users paraphrase. They make typos. They use formal phrasing in one context and casual phrasing in another. A naive cache misses all of these. llm-cache approaches the problem differently: instead of comparing strings, it compares meaning.

The pipeline is elegant:

Incoming prompt is converted into a 384-dimensional embedding vector using all-MiniLM-L6-v2, a sentence-transformers model that runs entirely locally.
The vector is L2-normalized so that inner product becomes equivalent to cosine similarity.
A FAISS IndexFlatIP index does exact nearest-neighbor search over all previously cached vectors.
If the closest match clears a configurable similarity threshold (default 0.95), the cached response is returned immediately, no API call made.
On a miss, the real LLM API is called, the response is stored, and future similar queries will hit the cache.

The result is that "What is the capital of France?" and "Tell me the capital city of France" return the same cached response. One API call served two users.

The Architecture Under the Hood

The project is organized into four tight modules plus two SDK wrappers.

embedder.py wraps sentence-transformers with an LRU cache so repeated embeddings of the same text do not trigger redundant model inference. It normalizes vectors before returning them so the FAISS layer can do cosine comparisons via inner product without any extra math.

store.py is where the actual cache lives. It holds a FAISS index and a parallel Python dict of metadata keyed by integer IDs. Thread safety is handled with a threading.RLock, which means you can use this in multi-threaded FastAPI or Django setups without adding your own locking. Persistence is handled by periodically flushing both the FAISS index (via faiss.write_index) and the metadata dict (via pickle) to ~/.llm_cache/. Every 10 writes by default, configurable if you want more frequent or less frequent saves.

cache.py is the high-level interface that ties embedder and store together. It exposes get, set, lookup_or_call, get_similar, delete, clear, save, and stats. The lookup_or_call method is particularly useful: it takes a prompt and a callable, checks the cache first, and only invokes the callable on a miss.

wrappers/openai_wrapper.py and wrappers/anthropic_wrapper.py are where the ergonomic magic happens. CachedOpenAI and CachedAnthropic subclass the official SDKs and intercept the chat.completions.create and messages.create methods respectively. From the outside, they are drop-in replacements. You change one import line and one constructor call. Everything else in your codebase stays identical.

# Before
from openai import OpenAI
client = OpenAI(api_key="sk-...")

# After
from llm_cache import CachedOpenAI
client = CachedOpenAI(api_key="sk-...", threshold=0.90)

That is the entire migration. Your existing call sites do not change.

How Much Money Does This Actually Save?

Let's put concrete numbers on this. The library's own README claims 40 to 60 percent cost reduction on repetitive workloads. That tracks with how LLM usage actually distributes in production.

Consider a customer support application running on GPT-4o. GPT-4o is priced at $2.50 per million input tokens and $10.00 per million output tokens at the time of writing. A typical support query might be 150 input tokens and 300 output tokens: that is $0.000375 on input plus $0.003000 on output, coming to roughly $0.003375 per call. If you handle 100,000 queries a day, that is $337.50 a day, around $10,125 a month, and over $123,000 a year.

Now think about the actual query distribution. Support traffic is highly repetitive. Password resets, billing questions, shipping status, cancellation flows. If even 40 percent of queries are semantically similar to something already cached, you are looking at potential savings in the range of $4,000 a month from API costs alone. On Claude Sonnet or GPT-4 Turbo, where token prices differ, the math shifts accordingly.

Disclaimer

That said, these are illustrative numbers based on a simplified model. Real production costs vary considerably depending on prompt length distribution, how diverse your user base actually is, which threshold you settle on, and how much of your traffic is genuinely repetitive versus novel. The 40 to 60 percent savings figure from the library's README is a reasonable ballpark for repetitive workloads, but your actual hit rate depends entirely on your specific use case. Treat these numbers as a directional estimate, not a guarantee.

For batch processing workloads the economics can be even more compelling. If you are enriching a product catalog, generating descriptions for SKUs, or running the same analysis prompts across thousands of documents with overlapping content, cache hit rates can push above 70 percent. On a $10,000 monthly LLM bill, that kind of hit rate could represent thousands of dollars in avoided API calls, though again the actual figure depends on how much genuine repetition exists in your data.

There is also a latency dimension that is easy to overlook. A cache hit returns in milliseconds. A real API call takes 500ms to 3 seconds depending on model and load. In user-facing applications, this latency improvement translates directly to perceived product quality, which is harder to put a dollar figure on but is real.

The configurable threshold gives you a dial between savings and correctness. At 0.95 you are catching clear paraphrases while being conservative about false positives. At 0.88 to 0.91 you are being more aggressive, which works well for batch workloads where the cost of an occasional semantically-mismatched cache hit is low. At 0.85 and below you risk serving stale or wrong responses for queries that are topically related but not actually equivalent.

What to Watch Out For

The library is honest about its limitations, which is a good sign.

Streaming responses are not cached. If you use stream=True, the call passes through unchanged. This is a real gap for chat applications where streaming UX is expected. The architecture would need changes to buffer the streamed response and store it post-completion, which is doable but adds complexity.

Tool and function calls are not cached either. If your agents use tool use, those responses pass through. This matters less than it sounds for cost savings because tool call responses are usually dynamic by nature, but it is worth knowing.

The cache is model-agnostic. The key is the semantic content of the prompt, not the model name. If you ask the same question to GPT-4o and Claude Sonnet, they will share a cache entry by default. This is fine if you want that behavior, but if you need model-specific caches, use different cache_name values per model.

The cache is also not context-aware. If the same question means different things depending on prior conversation turns, the cache will incorrectly serve a response from a different context. This matters for multi-turn chat where the embedding of the final user message does not capture the full conversational state.

Built Fully by NEO: Your AI Engineering Agent

The llm-cache repository was built autonomously by NEO - A fully autonomous AI Engineering Agent capable of finetuning, evaluating, experimenting with AI models and building/deploying AI pipelines such as RAG, classical ML experimentation and much more.

This project is definitely not a toy. The codebase has a proper package structure with separated concerns across embedder, store, cache, and wrappers. It has a test suite covering the cache, the OpenAI wrapper, and the Anthropic wrapper. It has working examples that run without API keys using mock responses. It has configuration documentation, a thresholds reference table, and SVG architecture diagrams. It has async support with AsyncCachedOpenAI and AsyncCachedAnthropic. The FAISS persistence strategy, the RLock threading model, the LRU cache on the embedder, the L2 normalization before FAISS inner product: these are not random choices. They are informed engineering decisions.

NEO can be used in your VS Code IDE via VS Code extension or Cursor and works as an autonomous AI engineering agent that can take a high-level goal, plan the implementation, write the code, run tests, and iterate until the project is complete. This library was produced from a single prompt.

What that unlocks is interesting. The tool itself is useful. But the meta-point is that an engineer with a clear idea and access to NEO can build a production-ready Python library in the time it used to take to write a design doc. The feedback loop between idea and artifact has collapsed.

How to Build on This with NEO

The library works well as-is, but there are several directions where it could go further. If you want to extend it, NEO is the fastest way to do that.

Multi-turn context caching is the most valuable near-term addition. Right now the cache key is the embedding of a single message. A more robust implementation would embed the last N turns of conversation concatenated together, so that "what about France?" in the context of a geography discussion hashes differently from the same phrase in a cooking discussion. You could prompt NEO like:

Clone the llm-cache repo: https://github.com/dakshjain-1616/llm-cache and extend llm-cache to support multi-turn context-aware caching by embedding the last 3 messages as a single context string

and it would handle the implementation.

A Redis-backed store would make the cache shareable across multiple instances of your application. Right now each process has its own FAISS index on disk. A distributed cache requires a shared vector store. Qdrant, Pinecone, or Redis with the RedisSearch module are all viable backends. NEO could scaffold the new store.py backend and the adapter pattern to keep the existing API unchanged.

Cache warming is another high-value addition for predictable workloads. If you know your users will ask about a known set of topics, you can pre-populate the cache before the first real user query arrives, guaranteeing zero-latency responses for those cases. A simple CLI command, llm-cache warm --questions questions.txt, would do this.

Analytics and observability around the cache would help you tune the threshold intelligently. Right now get_stats() returns hits, misses, and hit rate. A richer implementation would log the similarity score of every hit, let you visualize the distribution, and suggest threshold adjustments. Integrating with OpenTelemetry or Datadog would make this production-observable.

Streaming support is the most technically involved gap. You would need to buffer streamed response chunks, detect completion, reconstruct the full response, serialize it, and store it. The tricky part is that Python generators are not directly picklable. A design that stores the reconstructed text and then re-streams it from the cache on subsequent hits would give users the same UX without paying the API cost.

To build any of these with NEO, install the VS Code extension, clone the llm-cache repo and open the directory, and describe what you want in NEO's new chat prompt. NEO reads the existing codebase, understands the architecture, and writes code that fits the existing patterns rather than starting from scratch.

Getting Started with llm-cache

Install the dependencies and run a demo that shows the cache working without any API key:

pip install faiss-cpu sentence-transformers openai anthropic
pip install -e .
python examples/openai_example.py

You will see [CACHE HIT] and [CACHE MISS] labels with a final stats block showing the hit rate. The sentence-transformers model downloads automatically on first run and is about 90 MB.

For a real application, the migration is two lines:

from llm_cache import CachedOpenAI
client = CachedOpenAI(api_key="sk-...", threshold=0.90)

Start with threshold=0.90. Watch client.get_stats() for a few days in staging. If you are seeing false positives, move to 0.93 or 0.95. If your hit rate is low and your workload is genuinely repetitive, try 0.88. The right number is workload-specific.

Final Thought

The unsexy reality of running LLMs in production is that most of the cost is not in the interesting, novel queries. It is in the mundane, repeated ones that look slightly different on the surface but mean exactly the same thing. llm-cache is a focused solution to that specific problem, and it is well-engineered enough to drop into a production system with confidence.

The fact that it was built autonomously by NEO in a single session is, honestly, the most interesting detail in the whole story. Not because it diminishes the quality of the code, but because it demonstrates what becomes possible when the cost of building a tool drops to nearly zero. You stop asking "is this worth building?" and start asking "why haven't I built this yet?"

The code is at github.com/dakshjain-1616/llm-cache. Go look at it. Your API bill will thank you.

We Gave an AI Agent a Long Context Caching Idea. Here's what happened next!

Gaurav Vij — Wed, 15 Apr 2026 18:10:04 +0000

A few days ago, Han Xiao (VP AI @ Elastic) shared an experiment on Linkedin that asked a provocative question: what happens if you stop treating retrieval as a separate system and instead use the model’s own KV cache as the document store?

The setup was ambitious:

Qwen3.5-35B-A3B LLM,
1M token context,
A single 24 GB L4 GPU, and
A pipeline that avoids embeddings, vector databases, and chunking entirely.

The core idea was simple. Prefill the document once, save the KV cache to disk, restore it on demand, and answer queries with the full document already resident in context.

We wanted to see whether NEO - Fully autonomous AI Engineering Agent, could take that idea and turn it into a working implementation on its own.

So we gave NEO the research direction and let it run.

In about 30 minutes, it autonomously produced a working Cache-Augmented Generation system that implements the same core pattern: ingest a document once, prefill the entire document into the model’s KV cache, persist the cache as a .bin file, restore it before each query, and answer against full-document context without re-embedding or re-chunking anything.

The resulting GitHub Repo also documents that the full implementation, debugging, GPU validation, and documentation were done autonomously by NEO, including fixing 9 bugs across CUDA, Python, and shell, and running 11 GPU validation tests end to end.

The original idea

Traditional RAG pipelines split documents into chunks, embed them, store those embeddings in a vector index, and retrieve a subset of chunks at query time. That architecture is practical and scalable, but it comes with tradeoffs. The model only sees selected fragments, retrieval quality becomes a separate engineering problem, and there is always some risk that the right information was chunked poorly or never retrieved at all. The repo’s own README summarizes that contrast directly: RAG gives the model chunked fragments, while CAG aims to keep the full document active for every query.

Han’s experiment pushed that idea hard. His post describes loading a 1.2 million word novel into KV cache, pre-filling 905K tokens on a single L4 24 GB GPU, and relying on several optimizations to make that feasible, including YaRN scaling, Q3_K_M quantization, compressed KV cache, slot save and restore, and custom patches to support the model architecture. He also reported a key caveat that matters a lot: the system worked mechanically, but retrieval quality degraded badly in the middle of the context window, which is the classic lost-in-the-middle problem.

That was the interesting part for us.

Not because “RAG is dead” is the right conclusion. It probably is not. But because the experiment is a good stress test for whether an AI agent can reproduce a non-trivial systems idea from a public technical post and turn it into runnable software.

What NEO built

We used Neo's Extension in our VS Code IDE and prompted it to build a cache augmented generated system which is a full document QA stack built around llama-server and a persistent KV slot workflow.

The flow is straightforward:

A document is wrapped into a structured prompt and sent to the model for a one-time prefill.
The resulting KV cache is saved to disk as a slot file.
For every future query, that slot file is restored into llama-server.
The user’s question is appended to the restored state.
The model answers with the entire document already present in active context.

That sounds small in one paragraph, but there is a lot packed into it.

The repo includes:

a setup script that builds the required inference stack and downloads the model
a server launch script
a FastAPI application for ingestion, querying, corpus management, and health checks
CLI scripts for document ingest and querying
a demo path
Docker artifacts
validation docs and a GPU testing checklist

Access Cache Augmented Generation GitHub Repo

The API surface is also clean enough to use like a real system, not just a one-off experiment.

There are endpoints for /ingest, /status/{job_id}, /query, /corpora, /corpora/{id}, and /health, with ingestion running asynchronously and status polled through a job state transition.

That matters because replication is not just “it ran once on my machine.” A credible reproduction needs to be shaped into something other people can actually use, inspect, and test.

Why the implementation is technically interesting

The most important architectural shift here is that retrieval moves from an external index into the model runtime itself.

In standard RAG:

storage is in a vector database
retrieval happens before generation
the model sees only the retrieved subset

In this CAG-style system:

storage is effectively the saved KV state
retrieval is replaced by restoring a prior attention state
the model answers after the full document context is already loaded

That changes both latency and operational behavior.

The expensive part becomes the first prefill pass. After that, repeated queries are cheap because the cache is restored instead of recomputed. The repo reports that after ingestion, the cache lives in kv_slots/my_doc.bin, and future queries restore it instantly while surviving server restarts.

This is a very different tradeoff from RAG. You pay a large one-time setup cost per document or corpus, then reuse that precomputed attention state repeatedly.

For some workloads, that is extremely attractive.

If you have a relatively fixed corpus and many follow-up queries, the economics can make sense. If your corpus changes constantly, or if you need many documents active concurrently, the tradeoff looks worse.

Reported results from the replication

According to the repo, all 11 GPU tests were run on an NVIDIA RTX A6000 with Qwen3.5-35B-A3B Q3_K_M at a 1,048,576 token context window. The README reports:

24.3 minute cold prefill for War and Peace at 922K tokens
1.2 second KV slot restore from disk
roughly 100 tokens per second decode speed at 1M context
4 GB KV cache size at 1M context versus 23 GB in f16
about 43% VRAM usage on the A6000 in that configuration

It also lists successful validation for:

TurboQuant cache types
KV compression
YaRN context extension from 262K to 1,048,576
slot save and restore timing
VRAM profiling
Flash Attention
end-to-end document QA demos
concurrent query handling
stress testing on War and Peace
API key authentication
persistence across server restarts

There is also a smaller demo run using Alice in Wonderland and Peter Pan where the repo reports 2 out of 2 documents ingested, 6 out of 6 queries answered correctly, average decode speed around 103 tok/s, and no OOM errors.

Those numbers are useful for two reasons.

First, they show the system is not just conceptually aligned with the original post. It is instrumented and benchmarked.

Second, they make it easier to reason about where this architecture is actually viable.

The engineering constraints are the real story

One thing I like about both the original post and the replication is that neither pretends this is magic.

The constraints are real.

The replicated system is explicitly Linux and NVIDIA only. The large-model path requires 24 GB or more of VRAM for the full 1M-token configuration. Smaller VRAM tiers fall back to smaller Qwen variants and much shorter context windows. The first setup takes about 35 minutes to build CUDA kernels. The full Qwen3.5-35B path also requires a Hugging Face token.

There are also architectural limitations:

the long initial prefill still costs about 24 minutes on the A6000 for a very large document
only one active corpus is supported in the current single-slot setup
switching corpora means restoring a different slot
the lost-in-the-middle problem remains real at extreme context depth

That last point is the big one.

Han’s own comment on the original post says the system could generate readable answers, but hallucinated badly in the middle of the 905K-token context and mainly attended to the start and end of the document. The replicated repo reports a similar caveat in its sample War and Peace results, where one ending-related question is marked only partial because of lost-in-the-middle behavior.

So no, this does not prove that traditional RAG is obsolete.

What it proves is that KV-cache-centric document serving is increasingly practical as a systems pattern, and that the bottleneck is moving from “can we load this much context” toward “can the model actually use it reliably.”

The part that matters most to me

The technical implementation is interesting.

But the more important story is how it got built.

The repo states that NEO handled setup, debugging, validation, and documentation autonomously. That means this was not just a code generation exercise. It involved:

navigating an unfamiliar architecture
getting the inference stack working
dealing with CUDA and shell issues
validating runtime behavior on GPU
wrapping the result in a usable API and CLI
writing documentation that explains how the system works and where it breaks

That is much closer to real engineering work than most “AI built X” demos.

The valuable question is no longer whether an agent can produce a toy script from a prompt.

The better question is whether it can take a new technical idea, explore the dependency stack, adapt it to real hardware constraints, instrument the result, debug its mistakes, and leave behind something another engineer can inspect and run.

This replication is a good example of that threshold being crossed.

What I would take away from this

I do not think the lesson is “replace RAG with giant context everywhere.”

I think the real lessons are:

Persistent KV state is becoming a usable systems primitive.
It is not just an internal optimization anymore. It can be treated as part of application architecture.
Long-context serving changes the shape of the stack.
You can move work from retrieval infrastructure into model runtime, but only for some workloads.
The hard part is now quality, not just capacity.
Getting 1M context to fit is impressive. Getting the model to attend well across that full range is the deeper challenge.
Autonomous agents are becoming useful for reproducing research systems.
Not in a magical “push button, get product” sense. In a practical engineering sense where they can compress a lot of setup, debugging, and validation work into one session.

That last point is the reason we cared enough to run this experiment in the first place.

A lot of technical posts die as inspiration. They get bookmarked, maybe discussed, and then disappear.

This one turned into a runnable system in about 30 minutes.

That is a meaningful change in what an AI engineering agent can do.

Gemma 4 on GPU runtime. An overview of the process. #llm #gemma #benchmarks

Gaurav Vij — Tue, 14 Apr 2026 15:35:43 +0000

Gaurav Vij

Apr 4

I Ran Google's latest Gemma 4 Models on 48GB GPU. Here's What Actually Happened.

#gemma #ai #llm #gemini

Comments

6 min read

A CLI tool to score fine-tuning dataset quality before training starts

Gaurav Vij — Tue, 14 Apr 2026 14:55:18 +0000

One of the most frustrating outcomes in machine learning is spending time and GPU budget on a fine-tuning run, only to discover later that the real issue was the dataset.

A few missing fields, inconsistent structure, duplicated samples, weak coverage, or noisy records can quietly drag down results. And by the time you notice, you have already paid for the experiment.

To make that easier to catch upfront, we built Fine-tune Dataset Quality Scorer using NEO - First autonomous AI engineering Agent.

It is a CLI tool that analyzes fine-tuning datasets before training begins and returns an actionable quality score in seconds.

What it does

Instead of waiting for model behavior to reveal data problems, the tool scans your JSONL dataset ahead of time and surfaces issues with exact row references and concrete recommendations.

It runs 11 automated checks across four layers:

data integrity
content coverage
LLM-based review
cross-dataset safety

It also auto-detects dataset schema, so it can adapt to formats like:

Alpaca
ChatML
Prompt/Completion
ShareGPT
Generic JSONL

How scoring works

Each check contributes to a weighted final score from 0 to 100.

That score maps to four grades:

READY: 92–100
CAUTION: 80–91
NEEDS WORK: 60–79
NOT READY: below 60

The weights are configurable through YAML, so teams can tune the scoring logic to match their own standards.

Domain-specific analysis

One part I especially like is that it does not stop at generic validation.

The tool can also detect the dataset domain automatically, such as:

coding
QA
translation
summarization
conversation

Then it runs coverage analysis that is specific to that type of dataset.

For example, a coding dataset can be checked for things like task-type balance and error-handling coverage, instead of receiving only generic warnings.

Optional LLM-based review

There is also an llm-review mode.

This samples records and asks a Claude model to evaluate them on clarity, quality, and coherence. That score can be folded into the overall result with a 15% weight. If no API key is present, it skips this step automatically.

Example output

We also generated an HTML report for the Hacker News comments dataset.

It scored 88.8 / 100, which landed in CAUTION. Most checks passed, but the report flagged missing values as the main issue, with completeness at 85.6%. That is a good example of the kind of problem that often slips through until much later in the pipeline.

Why we built it

This project was also a useful demonstration of what we are building with NEO.

Rather than using AI only for snippets or one-off code suggestions, we wanted to show that an autonomous agent can build something practical end-to-end: a real tool, with structured logic, useful outputs, and production relevance.

The result is not just a demo. It is something teams could actually plug into their workflow or CI pipeline to catch dataset issues before training starts.

Repo

https://github.com/dakshjain-1616/Fine-tune-Dataset-Quality-Scorer

I think dataset quality is still one of the most under-appreciated bottlenecks in fine-tuning workflows.

A lot of “model quality” problems are really data quality problems in disguise.

I Ran Google's latest Gemma 4 Models on 48GB GPU. Here's What Actually Happened.

Gaurav Vij — Sat, 04 Apr 2026 15:58:51 +0000

This week Google dropped Gemma 4, and I wanted to test all four variants on my workstation.

The specs looked interesting: two small edge models (2B and 4B), a MoE model that claims "26B total but only 4B active", and a dense 31B beast. The question was simple: which ones actually run on a single RTX A6000 with 48GB of VRAM?

The internet had answers. Most said you'd need 4-bit quantization for the larger models. Some said the MoE wouldn't fit at all. I decided to test everything in full bfloat16 precision, no quantization, and measure what actually happens.

I didn't do this manually. I worked with Neo, an AI engineering agent made by us, to set up the benchmark pipeline. Neo researched the model architectures, wrote the loading scripts, fixed bugs when the MoE model refused to load, and ran each test iteration. When the 31B model showed weird memory numbers, Neo caught that we'd accidentally loaded it in 4-bit instead of bfloat16 and re-ran it correctly. The whole process took a few hours instead of days because Neo handled the implementation details while I focused on what the results meant.

Here's what I found. A TL;DR snapshot of the quantitative evaluations:

The Setup

I tested all four models on an NVIDIA RTX A6000 (48GB VRAM). No quantization. No tricks. Just loading each model in native bfloat16 precision and running 15 test prompts through them.

The prompts covered three areas: JSON output (5 tests), instruction following (5 tests), and general generation (5 tests). I measured peak VRAM usage, tokens per second, time to first token, and whether the models actually followed the prompts.

The Memory Surprise

Here's the thing nobody expected. All four models loaded successfully in full bfloat16 precision. No quantization needed.

Model	VRAM Used	% of 48GB
E2B	10.25GB	21%
E4B	15.99GB	33%
26B-A4B	42.30GB	88%
31B	43.82GB	91%

The 31B model uses 43.82GB. The 26B-A4B MoE uses 42.30GB. Both fit. Both run. No quantization required.

If you've been running these models in 4-bit because you thought they wouldn't fit, you can stop. You're using quantization for a problem that doesn't exist on 48GB hardware.

Speed vs Size: The Trade-Off Gets Real

Throughput told a different story. The smaller models are fast. The big ones are... not.

Model	Tokens/sec	Time to First Token
E2B	16.93	0.06s
E4B	13.82	0.07s
26B-A4B	9.58	0.21s
31B	0.54	1.89s

The 31B model generates 0.54 tokens per second. That's one token every two seconds. For a chatbot, that's painful. For batch processing, maybe fine. For real-time applications, forget it.

The 26B-A4B MoE is the interesting one here. It runs at 9.58 tokens per second. That's 18 times faster than the dense 31B, using almost the same amount of VRAM. The MoE architecture activates only about 4B of parameters per token, even though all 26B weights sit in memory. You get near-31B quality with 4B inference cost.

What "4B Active" Actually Means

This confused me at first. The model is called "26B-A4B". Marketing says "4B active parameters". But it uses 42GB of VRAM. If it's only using 4B parameters, why does it need 42GB?

The answer: "4B active" refers to computation, not memory. All 26 billion weights load into VRAM. But for each token, the model routes through only about 4 billion of them. The rest sit idle.

Think of it like a restaurant with 26 chefs in the kitchen, but only 4 cook your order. You still need to pay all 26 chefs (memory cost), but only 4 are working at any moment (compute cost).

This is why the MoE runs so fast. It's doing 4B worth of math per token, not 26B. But you still need the full 42GB to store all the weights.

The Edge Models Are Built Differently

The E2B and E4B models use something called Per-Layer Embeddings. Traditional transformers have one embedding layer at the start. Gemma 4's edge models add a second embedding pathway that feeds into every decoder layer.

Google designed this for quantized deployment on phones and laptops. The extra embedding pathway helps small models maintain quality even when you compress them to 4-bit or 8-bit. On my 48GB GPU, they ran in full precision and used 10GB and 16GB respectively.

They're fast. The E2B hits 16.93 tokens per second with 61ms time to first token. If you're building a chatbot that needs to feel instant, this is your model.

Prompt Following: The 73% Pattern

I ran 15 prompts per model. Five asked for JSON output. Five tested instruction following. Five were general generation tasks.

Three models scored 73% compliance. E4B, 26B-A4B, and 31B all passed 11 out of 15 tests. The E2B scored lower at 60%, passing 9 out of 15.

The pattern wasn't random. The larger three models failed the same JSON tests. They'd produce valid JSON structure, but wrap it in markdown code blocks:

{
  "name": "Alice",
  "age": 30
}

If you parse this as raw JSON, it fails. The parser sees the backticks and "json" label before the curly brace. But the JSON itself is valid.

This isn't a model capability issue. It's a formatting convention. The models learned to wrap code in markdown during training. If you strip the markdown wrappers before parsing, compliance jumps from 73% to roughly 90-95%.

The E2B failed more often on instruction tests. It would truncate responses or miss constraints in multi-step prompts. The larger models followed instructions precisely.

What I'd Use for Real Projects

After running all four, here's what I'd pick for different use cases:

Real-time chatbot: E2B. It's fast enough that users won't notice latency. 16.93 tokens per second means responses appear instantly. The 60% compliance rate is fine for casual chat.

Production API: E4B. Best balance of speed and capability. 13.82 tokens per second, 73% compliance, uses only 16GB VRAM. You can run this on a single mid-range GPU and serve real users.

Complex reasoning: 26B-A4B. If you need the model to think through multi-step problems or handle nuanced tasks, this is the sweet spot. Near-31B quality, 9.58 tokens per second, fits on 48GB without quantization.

Maximum quality, no speed requirement: 31B. Only if you're doing batch processing or research where throughput doesn't matter. The 0.54 tokens per second is brutal for interactive use.

The Quantization Myth

The biggest takeaway: you don't need 4-bit quantization for Gemma 4 on 48GB hardware. The models fit in full precision. The 31B uses 43.82GB. The 26B-A4B uses 42.30GB. Both leave enough headroom for context and batch processing.

If you're quantizing because you think the models won't fit, try loading them in bfloat16 first. You might find you're trading quality for a problem that doesn't exist.

The Real Bottleneck

Memory isn't the bottleneck for Gemma 4 on 48GB GPUs. Throughput is.

The 31B model fits. But it's so slow that you'll question whether it's usable. The MoE architecture in 26B-A4B solves this by activating fewer parameters per token. You get the quality of a 26B model with the speed of a 4B model, while still needing 42GB VRAM to store all the weights.

If you're choosing between 26B-A4B and 31B for a production system, pick the MoE. The 18x speed difference matters more than the marginal quality gain.

What's Next

Gemma 4's architecture choices signal where the industry is heading. Per-Layer Embeddings for edge deployment. MoE for cloud workstations. Dense models for maximum quality when speed doesn't matter.

The edge models (E2B, E4B) are built for phones and laptops. The MoE (26B-A4B) is built for single-GPU cloud workstations. The dense 31B is built for research and batch processing.

Pick the one that matches your deployment target. Don't quantize unless you actually need to. And if you're parsing JSON, strip the markdown wrappers first.

All benchmarks ran on NVIDIA RTX A6000 (48GB VRAM) using bfloat16 precision without quantization. Test suite: 15 prompts per model (5 JSON, 5 instruction, 5 generation).

Achieving 90% Cost-Effective Transcription and Translation with Optimised OpenAI Whisper on Q Blocks

Gaurav Vij — Sat, 08 Apr 2023 21:23:06 +0000

Large language models (LLMs) are AI models that use deep learning algorithms, such as transformers, to process vast amounts of text data, enabling them to learn patterns of human language and thus generate high-quality text outputs. They are used in applications like speech to text, chatbots, virtual assistants, language translation, and sentiment analysis.

However, it is difficult to use these LLMs because they require significant computational resources to train and run effectively. More computational resources require complex scaling infrastructure and often results in higher cloud costs.

To help solve this massive problem of using LLMs at scale, Q Blocks has introduced a decentralized GPU computing approach coupled with optimized model deployment which not only reduces the cost of execution by multi-folds but also increases the throughput resulting in more sample serving per second.

In this article, we will display (with comparison) how the cost of execution and throughput can be increased multi-folds for a large language model like OpenAI-whisper for speech to text transcribing use case by first optimising the AI model and then using Q Blocks's cost efficient GPU cloud to run it.

Want early access to Q Blocks' Whisper API? Join our waitlist.

Importance of optimising AI models

For any AI model, there are 2 major phases of execution:

Learning (Model training phase), and
Execution (Model deployment phase).

Training a large language model can take weeks or even months and can require specialized hardware, such as graphical processing units (GPUs) which are prohibitively expensive on traditional cloud platforms like AWS and GCP.

In addition, LLMs can be computationally expensive to run, especially when processing large volumes of text or speech data in real-time. In particular, the complexity of large language models stems from the massive number of parameters that they contain. These parameters represent the model's learned representations of language patterns. More parameters can help produce higher quality outputs but it requires more memory and compute to process.

This can make it challenging to deploy these models in production environments and can limit their practical use.

Efficient and low-size LLMs can result in a lower cost of deployment, higher speed, and more managed scaling. Thus, enabling businesses to deploy LLMs more quickly and effectively.

This is why model optimisation becomes crucial in the domain of AI. Also, AI model optimisation process helps reduce the carbon footprint of AI models, making them more sustainable and environmentally friendly.

About OpenAI-Whisper Model

OpenAI Whisper is an open-source automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. The architecture of the model is based on encoder-decoder transformers system and has shown significant performance improvement compared to previous models because it has been trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.

Source

OpenAI released 6 versions of Whisper model. Each version has a different size of parameter count and more parameters lead to more memory requirment due to increased model size, but it also results in higher accuracy of the transcribed output.

Large-v2 is a biggest version of whisper model and offers superior transcription quality, but it requires more GPU memory due to large size and is 32x slower than the smallest version i.e. tiny. More information on versions available here.

But here comes a conflict, what if you desire the highest quality transcription output but are restricted by a limited budget for GPUs to execute the model? Model optimisation is what helps us achieve that. There are a couple of optimisation approaches such as using mixed-precision training that reduces the memory requirements and computation time of the model or reducing the number of layers or using a smaller hidden dimension, to reduce the model's size and thus speed up inference.

Want early access to Q Blocks' Whisper API? Join our waitlist.

Model optimisation and Cost improvements using Q Blocks

Q Blocks makes it very easy for developers to train, tune and deploy their AI models using pre-configured ML environments on GPU instances that already have CUDA libraries, GPU drivers, and suitable AI frameworks loaded. As a result, the work required to set up an ML environment for development and deployment is reduced.

For optimising OpenAI whisper model, we will use CTranslate2 - A C++ and Python library for efficient inference with Transformer models. CTranslate2 offers out of the box optimisations for Whisper model.

CTranslate2 can be easily installed in a Q Blocks GPU instance:

git clone https://github.com/guillaumekln/faster-whisper.git  
cd faster-whisper  
pip install -e .[conversion]

Rest of the libraries are handled by pre-installed packages in Q Blocks instances.

Now we convert the whisper large-v2 model into ctranslate2 format for efficient inference using this command:

ct2-transformers-converter --model openai/whisper-large-v2 --output_dir whisper-large-v2-ct2 \
   --copy_files tokenizer.json --quantization float16

The model is now optimised and ready to infer efficiently. Here's the python3 code for transcription:

from faster_whisper import WhisperModel
model_path = "whisper-large-v2-ct2/"
# Run on GPU with FP16
model = WhisperModel(model_path, device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
     print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Now we run this optimised whisper large-v2 model on Q Blocks decentralizd Tesla V100 GPU instance and compare it with the default whisper large-v2 model while running it on AWS P3.2xlarge (Tesla V100) GPU instance.

Both GPU instances offer same GPU compute but Q Blocks GPU instance is 50% low cost than AWS out of the box.

We used an audio sample of 1 hour and transcribed it with the models running on above mentioned 2 GPU instances. Below is a quick comparison in terms of no. of GPUs and cost consumed to process the same amount of audio hours in a normalized time period of execution:

From the above benchmarks, it is evident that running an optimised model on Q Blocks cost efficient GPUs resulted in 12x cost reduction. These numbers lead to even greater savings and performance upgrades at scale.

For example, transcribing 10,000 hours of audio files would be $3,100 less costly on Q Blocks.

Using these optimisations in production

The implications of running optimised models on a decentralized GPU cloud like Q Blocks are significant for a wide range of AI applications.

For instance, consider the case of:

Zoom calls and video subtitles: In these scenarios, real-time transcription accuracy is crucial for effective communication. By reducing costs and improving performance, a business can achieve scaling to serve millions of users without compromising on their experience.
Customer Service Chatbots: With Q Blocks GPU cloud, LLM based chatbots can be trained to respond more quickly and accurately, providing a better user experience for customers.
Language Translation: Serving real-time translation for millions of users require faster response time and using optimised LLMs on Q Blocks can help you achieve that.

Whisper API for speech to text transcription 🗣

At Q Blocks, we understand the need for affordable and high-performing GPU instances to accelerate AI model development and deployment. We are making the process of accessing AI models like Whisper easier for application developers to create cutting-edge products that deliver optimal performance and cost-effectiveness.

For the use case of transcribing audio files at scale, MonsterAPI (a platform for Generative AI by Q Blocks) is coming up with a ready-to-use API for the Whisper Large-v2 model which will be optimised and work out of the box at scale to serve your needs.

Want early access to Q Blocks' Whisper API? Join our waitlist.

Conclusion

To conclude, performance optimization has become a crucial aspect of AI model development, and GPUs play a significant role in achieving faster training and inference times. The performance comparison of two approaches for running an AI model has shown that Q Blocks can help you optimize your AI models for performance and cost by 12 times.

Reference: Thanks to GitHub project - guillaumekln/faster-whisper for making ctranslate2 driven optimised workflow for Whisper.

What do you use GPUs for?

Gaurav Vij — Mon, 16 May 2022 16:10:17 +0000

GPUs are becoming better to serve general purpose computing.🚀

With the help of parallel computing architecture, it is offering great speed ups for use cases like 3D rendering, machine learning, data science, crypto-mining and scientific computing.

That makes me wonder, What do you use GPUs for today? 😀

Often times the access to GPUs is very costly. Whether you build your own PC or run it on the cloud, the costs can burn a hole in your pocket.

To solve this issue, we created Q Blocks - A decentralized GPU computing platform for Machine learning ! 🎉

Using under-utilised computing systems to run ML drastically reduced the cost of GPU power access for ML devs. High-end GPUs are available on Q Blocks at upto 1/10th the cost.

If you had access to this much GPU power then what are you going to use it for? 🤔

Why GPUs are great for Reinforcement Learning?

Gaurav Vij — Thu, 14 Apr 2022 16:02:45 +0000

This quick guide focuses on the basics of reinforcement learning (RL) and how GPUs enable accelerated performance for RL.

To give a quick insight on why GPUs matter so much in today's world:
GPUs are great for achieving faster performance using parallel computing architecture. They are designed to run 1000s of parallel threads and are thus also known as SIMD architecture i.e. Single Instruction Multi Data.

A simple example for SIMD would be rendering a game scene on a screen. GPUs using 1000s of cores to render each pixel in parallel. The instruction to render a pixel is same while the data on each pixel is different.

GPUs are finding a great use in Deep learning and Machine learning applications today. But sometimes it can be really frustrating to figure out the best GPUs for deep learning.

So first of all, what really is Reinforcement Learning?

Reinforcement learning is a type of machine learning that provides a framework for solving problems in ways that are similar to the way humans would solve them. It is the machine equivalent of trial and error. The goal is to maximize the amount of reward received by repeatedly attempting different actions.

The use cases for reinforcement learning are wide-ranging, and can be used to solve problems in domains such as healthcare, marketing, traffic management, robotics, education and more. Reinforcement learning is machine learning with experience.

Image source

The most common algorithm for reinforcement learning is called Q-learning which simulates a software agent who has to make a decision or take a course of action in each state.

Open source toolkits such as Open AI Gym can be used for developing and comparing reinforcement learning algorithms.

Image source

Reinforcement Learning uses a reward signal to learn. Its aim is to explore all the possible cases in an environment to learn which actions can help it maximize the total reward collected over time. This exploration is performed by an RL agent.

GPUs in Deep Learning & Reinforcement Learning

Most people think of GPUs as something that is only used for gaming or video editing, but in recent years they have taken on a new role in AI.

GPUs work better than CPUs when it comes to deep learning neural networks and Reinforcement learning because they can process more data at once with less power consumption. This is mainly due to their parallel processing abilities - meaning they can do more calculations at the same time.

CPU vs GPU performance example:

Image source

Deep learning is a type of machine learning where neural networks are used to make inferences about data. These neural networks are computationally demanding because they contain many layers. With the help of GPUs, deep neural nets can be trained much faster than before, which has led to an exponential increase in their use for classification and regression problems.

GPUs are great at matrix multiplications and Deep neural nets have to perform thousands of matrix multiplication tasks during the algorithm training process, thus making GPUs a great fit for them.

A GPU-powered reinforcement learner is a type of machine learning agent that runs the RL experiments on a GPU and tries to learn how to maximise its expected reward by interacting with an environment in which it receives rewards or punishments based on its actions.

Image source

GPUs are also used for applications such as auto-driving, analytics, or any other application that needs to process large amounts of data in parallel where previously it was too high in cost to use them for these purposes.

Benefits of Using GPUs for Deep Learning Applications

GPUs have been proven to be the most efficient processing hardware for deep learning applications. Over the past few years, the use of GPUs for deep learning has been on the rise because of many benefits that they provide. These advantages include high-quality training and fast processing times, as well as less overall cost of experimentation.

Photo by Nana Dua on Unsplash

GPUs are used because they have more cores, which allows them to process more data at a time and provides better performance. This makes them a great fit for deep learning applications.
GPU’s architecture allows it to learn more quickly and with less training data, making it an ideal option for reinforcement learning as well as supervised and unsupervised machine learning.

What are the Best GPUs for Reinforcement Learning?

A GPU (Graphics Processing Unit) executes complex algorithms efficiently by offering high bandwidth and low latency for its memory access, thus making them the fastest general-purpose compute device.

The NVIDIA Tesla V100 is one of the best GPU for reinforcement learning. It is capable of hosting multiple computational graphs and can be scaled almost linearly up to 8 GPU clusters.

The accelerating factor for deep learning frameworks such as Caffe2, Pytorch, and TensorFlow is their ability to make use of GPU acceleration to achieve better performance.

Conclusion:

Reinforcement learning is a process of trial and error by performing a lot of attempts to maximize reward and thus learn the best actions to perform. GPUs enable faster processing for Reinforcement learning by performing these actions in parallel using its parallel computing architecture.

If you are a deep learning or machine learning engineer then you'd know that GPU computing is very costly on cloud.
We understand your pain.

So to democratize GPU computing access, we built Q Blocks, a decentralized computing platform that enables 50-80% cost efficient GPU computing for Machine learning and Deep learning workloads. 😀