<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Derrick Pedranti</title>
    <description>The latest articles on Forem by Derrick Pedranti (@derrickpedranti).</description>
    <link>https://forem.com/derrickpedranti</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3746226%2Fa6a66f89-1947-4bb1-a546-78cdcb512123.png</url>
      <title>Forem: Derrick Pedranti</title>
      <link>https://forem.com/derrickpedranti</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/derrickpedranti"/>
    <language>en</language>
    <item>
      <title>Improving Determinism with LLMs: Prompting, Model Selection, Context, and Tools</title>
      <dc:creator>Derrick Pedranti</dc:creator>
      <pubDate>Sat, 02 May 2026 04:48:06 +0000</pubDate>
      <link>https://forem.com/derrickpedranti/improving-determinism-with-llms-prompting-model-selection-context-and-tools-21ja</link>
      <guid>https://forem.com/derrickpedranti/improving-determinism-with-llms-prompting-model-selection-context-and-tools-21ja</guid>
      <description>&lt;p&gt;Large language models are incredibly powerful, but they are not automatically deterministic.&lt;/p&gt;

&lt;p&gt;Ask the same question twice and you may get slightly different answers. Ask for facts without enough context and the model may fill in gaps. Ask it to perform complex matching or calculations directly in natural language and you may get an answer that sounds confident but is not reliable enough for production use.&lt;/p&gt;

&lt;p&gt;That does not mean LLMs are unreliable by default. It means we need to design around how they work.&lt;/p&gt;

&lt;p&gt;When building AI-powered applications, improving determinism usually comes down to four practical methods:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Prompt engineering&lt;/li&gt;
&lt;li&gt;Choosing the right model&lt;/li&gt;
&lt;li&gt;Providing the right context, including RAG&lt;/li&gt;
&lt;li&gt;Using tools for deterministic work&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The goal is not to make the LLM magically perfect. The goal is to reduce ambiguity, improve accuracy, and prevent the model from inventing answers when it does not have enough information.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Prompt Engineering
&lt;/h2&gt;

&lt;p&gt;Prompt engineering is one of the simplest ways to improve LLM reliability. A vague prompt gives the model too much freedom. A specific prompt gives it boundaries.&lt;/p&gt;

&lt;p&gt;Instead of asking:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Compare these records and tell me which ones match.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can improve the prompt by giving the model a clear process:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Compare the records step by step.
First, normalize company names.
Second, compare addresses.
Third, compare phone numbers.
Fourth, assign a confidence score.
If there is not enough evidence to determine a match, return `unknown`.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Good prompt engineering often includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Step-by-step instructions&lt;/li&gt;
&lt;li&gt;Specific examples&lt;/li&gt;
&lt;li&gt;Example outputs&lt;/li&gt;
&lt;li&gt;Clear formatting requirements&lt;/li&gt;
&lt;li&gt;Constraints on what sources the model should use&lt;/li&gt;
&lt;li&gt;Permission for the model to say “I don’t know”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last point is important.&lt;/p&gt;

&lt;p&gt;LLMs are often optimized to be helpful, which can sometimes make them answer even when they should not.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Giving the model permission to say it does not know can reduce hallucinations.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If the answer cannot be determined from the provided context, respond with:
"I don't know based on the provided information."
Do not guess.
Do not use outside knowledge.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This kind of instruction helps the model stay inside the boundaries of the task. Prompting alone will not guarantee perfect results, but it is usually the first layer of control.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Choosing the Right Model
&lt;/h2&gt;

&lt;p&gt;Not all LLMs are equally good at every task.&lt;/p&gt;

&lt;p&gt;Some models are stronger at reasoning. Some are better at coding. Some are optimized for speed and cost. Some are designed for image generation, document understanding, or multimodal workflows.&lt;/p&gt;

&lt;p&gt;For example, a model like Claude Opus 4.7 is commonly used for complex reasoning and coding-heavy tasks. A model like Nano Banana Pro is designed for high-quality image generation and editing, including use cases where accurate text rendering inside images matters.&lt;/p&gt;

&lt;p&gt;The key point is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Pick the model based on the task, not just the brand name.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If your task is code generation, evaluate models against coding benchmarks and real coding examples from your own project. If your task is medical document summarization, legal review, financial extraction, or data matching, evaluate models against examples from that subject matter. If your task is image generation, use a model designed for image generation.&lt;/p&gt;

&lt;p&gt;Model settings matter too.&lt;/p&gt;

&lt;p&gt;Temperature is one of the most important settings for determinism. Lower temperature generally makes responses more predictable and focused, while higher temperature increases creativity and variation.&lt;/p&gt;

&lt;p&gt;For accuracy-focused tasks like structured extraction, classification, JSON output, or data processing, I usually prefer a low temperature (often close to &lt;code&gt;0&lt;/code&gt;). Conversely, for creative writing, brainstorming, or marketing copy, a higher temperature may be more appropriate.&lt;/p&gt;

&lt;p&gt;Another useful pattern is intelligent model routing.&lt;/p&gt;

&lt;p&gt;Instead of sending every prompt to the same model, you can route tasks based on intent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If the user asks for code generation, use the coding model.
If the user asks for image generation, use the image model.
If the user asks for summarization, use the fast summarization model.
If the user asks for complex reasoning, use the reasoning model.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This routing can be rule-based, or you can use an LLM to classify the task and select the best model. The more specialized the task, the more important model selection becomes.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Providing the Right Context (RAG)
&lt;/h2&gt;

&lt;p&gt;Context is one of the biggest factors in improving LLM accuracy.&lt;/p&gt;

&lt;p&gt;An LLM without context may answer based on general knowledge. That can be useful, but it is risky when you need answers grounded in specific documents, company policies, user data, contracts, codebases, or domain-specific content.&lt;/p&gt;

&lt;p&gt;Context gives the model boundaries.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Answer only using the provided context.
If the context does not contain the answer, say you do not know.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And this is where RAG becomes extremely useful.&lt;/p&gt;

&lt;p&gt;RAG stands for Retrieval-Augmented Generation. In a RAG system, your documents are usually chunked, embedded, and stored in a vector database. When a user asks a question, the system performs a semantic search to find relevant content and passes that content to the LLM as context.&lt;/p&gt;

&lt;p&gt;Instead of asking the model to rely only on what it already knows, you are giving it the source material it should use.&lt;/p&gt;

&lt;p&gt;A simplified RAG flow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User asks a question
        ↓
Search relevant documents
        ↓
Retrieve the best matching chunks
        ↓
Pass those chunks to the LLM
        ↓
Generate an answer grounded in the retrieved context
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This improves determinism because the model is no longer operating in an open-ended way. It has a defined source of truth.&lt;/p&gt;

&lt;p&gt;RAG is especially useful for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Internal documentation&lt;/li&gt;
&lt;li&gt;Policy questions&lt;/li&gt;
&lt;li&gt;Knowledge bases&lt;/li&gt;
&lt;li&gt;Technical documentation&lt;/li&gt;
&lt;li&gt;Customer support&lt;/li&gt;
&lt;li&gt;Contract review&lt;/li&gt;
&lt;li&gt;Medical or legal document review&lt;/li&gt;
&lt;li&gt;Codebase Q&amp;amp;A&lt;/li&gt;
&lt;li&gt;Research assistants&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, RAG does not automatically solve everything. You still need good chunking, good retrieval, good metadata, and good prompting. If the wrong context is retrieved, the model may still produce the wrong answer.&lt;/p&gt;

&lt;p&gt;A strong RAG prompt is built on strict boundaries. While a production prompt would be much more detailed, a simplified example of the core instructions looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Use only the provided context.
Cite the source sections used.
Do not answer from general knowledge.
If the answer is not present in the context, say so.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This helps reduce hallucinations and makes the answer easier to verify.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Using Tools for Deterministic Work
&lt;/h2&gt;

&lt;p&gt;Tools are one of the best ways to improve reliability.&lt;/p&gt;

&lt;p&gt;There are many tasks that an LLM should not perform directly if you need consistent, production-quality results.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complex calculations&lt;/li&gt;
&lt;li&gt;Fuzzy matching across large datasets&lt;/li&gt;
&lt;li&gt;Sorting and filtering&lt;/li&gt;
&lt;li&gt;Database queries&lt;/li&gt;
&lt;li&gt;API lookups&lt;/li&gt;
&lt;li&gt;File parsing&lt;/li&gt;
&lt;li&gt;Data validation&lt;/li&gt;
&lt;li&gt;Date calculations&lt;/li&gt;
&lt;li&gt;Business rule execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An LLM can reason about these tasks, but it should not always be the thing performing them.&lt;/p&gt;

&lt;p&gt;If you need to compare thousands of records, do not rely on the LLM to manually inspect all of them in a prompt. Instead create a tool.&lt;/p&gt;

&lt;p&gt;For example, a fuzzy matching tool could be written in Python and exposed to the LLM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fuzzy_match_records&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source_records&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_records&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Deterministically compare two datasets and return likely matches.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;matches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;source_records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;target_records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;
                &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;matches&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM can decide when to use the tool, explain the results, and help the user interpret the output. But the matching itself happens in code, which is much more reliable.&lt;/p&gt;

&lt;p&gt;The same applies to calculations. If you need accurate math, use a calculator tool or a Python function. If you need data from a database, use a query tool. If you need to check real-time information, use an API.&lt;/p&gt;

&lt;p&gt;The pattern is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Use the LLM for reasoning, language, orchestration, and explanation.&lt;br&gt;
Use tools for deterministic execution.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is especially important in agentic workflows.&lt;/p&gt;

&lt;p&gt;The more autonomy you give an AI agent, the more important tool boundaries become. Tools should be scoped, validated, logged, and restricted. A tool should do one thing clearly and safely.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Tools make LLM systems more reliable because they move critical operations out of natural language and into deterministic code.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One important clarification: a tool does not guarantee correct results just because it is a tool. It guarantees that the same code runs consistently, assuming the implementation and inputs are correct. That is still a major improvement over asking an LLM to improvise calculations or matching logic in plain text.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bringing It All Together
&lt;/h2&gt;

&lt;p&gt;Improving determinism with LLMs is not about one magic trick. It is a layered approach.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt engineering gives the model clear instructions.&lt;/li&gt;
&lt;li&gt;Model selection ensures you are using the right model for the task.&lt;/li&gt;
&lt;li&gt;Context and RAG ground the model in relevant source material.&lt;/li&gt;
&lt;li&gt;Tools move critical logic into deterministic code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, these methods can dramatically improve the reliability of LLM-powered applications.&lt;/p&gt;

&lt;p&gt;A practical architecture might look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Prompt
   ↓
Prompt Classification
   ↓
Model Routing
   ↓
Retrieve Context with RAG
   ↓
LLM Reasoning
   ↓
Tool Calls for Deterministic Work
   ↓
Validated Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This kind of design gives you the best of both worlds. You get the flexibility and reasoning ability of an LLM, but you also get the reliability of structured prompts, grounded context, model specialization, and deterministic tools.&lt;/p&gt;

&lt;p&gt;That is where LLM applications become much more production-ready.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;LLMs are powerful, but they need guardrails. If you want better accuracy, fewer hallucinations, and more repeatable results, start by asking four questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Is my prompt specific enough?&lt;/li&gt;
&lt;li&gt;Am I using the right model for this task?&lt;/li&gt;
&lt;li&gt;Have I provided the right context?&lt;/li&gt;
&lt;li&gt;Should this task be handled by a tool instead of the LLM?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The more often you answer those questions intentionally, the more deterministic your AI system becomes. LLMs are not just chatbots anymore. They are reasoning engines, orchestrators, and interfaces to tools.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;But for production systems, the best results come when we stop expecting the model to do everything by itself and instead design systems that combine LLM intelligence with deterministic software engineering.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>rag</category>
      <category>promptengineering</category>
    </item>
    <item>
      <title>Stop Overloading Your CLAUDE.md — Simplicity Wins (and Saves Tokens)</title>
      <dc:creator>Derrick Pedranti</dc:creator>
      <pubDate>Sun, 12 Apr 2026 21:15:56 +0000</pubDate>
      <link>https://forem.com/derrickpedranti/stop-overloading-your-claudemd-simplicity-wins-and-saves-tokens-e07</link>
      <guid>https://forem.com/derrickpedranti/stop-overloading-your-claudemd-simplicity-wins-and-saves-tokens-e07</guid>
      <description>&lt;p&gt;If your &lt;code&gt;CLAUDE.md&lt;/code&gt;, &lt;code&gt;.cursorrules&lt;/code&gt;, or &lt;code&gt;agent.md&lt;/code&gt; file is longer than a few hundred lines, you are probably making your AI assistant worse, not better.&lt;/p&gt;

&lt;p&gt;Every time you start a new chat session, you pay a hidden cost for massive context files—in tokens, performance, and overall accuracy. Many developers tend to over-engineer their context files, stuffing them with endless rules and massive context blocks. Ironically, this usually leads to worse results.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Shift Most Developers Haven't Fully Realized
&lt;/h2&gt;

&lt;p&gt;Modern Large Language Models (LLMs) are exceptionally capable right out of the box. You no longer need to explain fundamental concepts like how React works, what REST APIs are, or re-teach basic programming architecture. The models already possess this knowledge.&lt;/p&gt;

&lt;p&gt;What matters now isn't providing &lt;em&gt;more&lt;/em&gt; instructions, but managing the context you provide much more effectively. This emerging practice is known as &lt;strong&gt;context engineering&lt;/strong&gt;—the art of optimizing exactly what goes into the model's context window to produce the best possible results without overwhelming it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hidden Cost of Large Context Files
&lt;/h2&gt;

&lt;p&gt;Every time you start a new coding session or prompt your AI assistant, your context files (&lt;code&gt;CLAUDE.md&lt;/code&gt;, &lt;code&gt;.cursorrules&lt;/code&gt;, &lt;code&gt;agent.md&lt;/code&gt;, system instructions) are all loaded into the context window.&lt;/p&gt;

&lt;p&gt;That content immediately converts into tokens, and those tokens have a tangible cost:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost:&lt;/strong&gt; You pay for every token processed, whether through direct API usage or hidden compute limits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attention:&lt;/strong&gt; LLMs have finite attention spans. Essential project rules get diluted by boilerplate instructions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance Risk:&lt;/strong&gt; The larger the context, the slower the response times, and the higher the chance the model hallucinates or ignores specific constraints.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLMs operate within a finite context window, meaning everything you include competes for attention. When you dump a massive configuration file into every single session, you run the risk of degrading the model's reasoning capabilities over time.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Big Mistake: "More Context = Better Results"
&lt;/h2&gt;

&lt;p&gt;It feels logical to assume that giving an AI more background information will yield a better answer. However, research and real-world usage consistently demonstrate a "less-is-more" effect in prompting. Removing non-essential content actually improves the accuracy and relevance of the model's output.&lt;/p&gt;

&lt;p&gt;When a context window is bloated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The model gets distracted:&lt;/strong&gt; It might index heavily on a minor, irrelevant rule you included "just in case."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Important instructions get buried:&lt;/strong&gt; The "needle in a haystack" problem means your critical constraints are lost in a sea of generic best practices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Signal-to-noise ratio drops:&lt;/strong&gt; Meaningful project context is drowned out by unnecessary explanations, leading to generic or confused outputs.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What You Should Do Instead
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Keep Context Files Minimal
&lt;/h3&gt;

&lt;p&gt;Most developers and teams do not need an enormous configuration file. Your system prompts should be lean and highly specific.&lt;/p&gt;

&lt;p&gt;Only include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Project-specific rules:&lt;/strong&gt; Naming conventions, specific directory structures, or custom architectural patterns unique to your repository.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Constraints the model wouldn't infer:&lt;/strong&gt; Hard requirements like "Never use external libraries for data fetching" or "Strictly adhere to local timezones."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Truly required defaults:&lt;/strong&gt; Formatting preferences or language-specific compiler flags.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything else? Remove it.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Stop Treating Agents Like They're Dumb
&lt;/h3&gt;

&lt;p&gt;There is no need to include generic instructions like "Write clean code" or "Use best practices." Modern models are aligned to do this by default. Telling an advanced LLM to write good code is like telling a senior engineer not to forget to breathe—it wastes space and adds no value.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Use Skills Instead of Static Context
&lt;/h3&gt;

&lt;p&gt;This is where you can drastically improve your workflow. Agent skills allow for &lt;strong&gt;progressive disclosure&lt;/strong&gt; of context:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instead of loading a massive document of instructions upfront, only the skill's name and a brief description are loaded initially (consuming perhaps 100 tokens).&lt;/li&gt;
&lt;li&gt;The full, detailed instructions and context are only loaded dynamically when the agent decides it needs to use that specific skill.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By utilizing skills, you ensure lower token usage per request, significantly better focus for the model, and a much more scalable system as your project grows.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Keep Skills Small Too
&lt;/h3&gt;

&lt;p&gt;Even dynamic skills can suffer from bloat if you aren't careful. When building out agent capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Only include what the agent wouldn't already know:&lt;/strong&gt; Do not paste the entirety of a public API's documentation if the model was likely trained on it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep instructions concise and actionable:&lt;/strong&gt; Focus on input/output expectations and specific steps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid "documentation-style" writing:&lt;/strong&gt; Be direct. Once a skill activates, its entire payload enters the context window, so every word should earn its keep.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Real Insight: Context Is a Budget
&lt;/h2&gt;

&lt;p&gt;It helps to think of your context window like system RAM. In software development, you wouldn't load unnecessary libraries into memory, keep unused data structures active, or duplicate logic everywhere.&lt;/p&gt;

&lt;p&gt;You should treat your AI's context with the same level of discipline. Manage it like a strict budget where every token must justify its inclusion.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters More Than Ever
&lt;/h2&gt;

&lt;p&gt;We are entering an era of AI development where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Foundational models are becoming commoditized.&lt;/li&gt;
&lt;li&gt;Baseline capabilities across different providers are largely similar.&lt;/li&gt;
&lt;li&gt;The true differentiation lies in &lt;strong&gt;how you orchestrate and utilize them&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The engineering teams and individual developers who will excel are those who keep their systems lean, rigorously optimize their context, and build modular, reusable workflows—not the ones writing the most exhaustive, monolithic prompts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Introducing: simplify-markdown
&lt;/h2&gt;

&lt;p&gt;One problem I consistently encountered while refining these workflows is that AI-generated markdown tends to get bloated incredibly fast. It becomes too verbose, contains redundant sections, includes unnecessary explanations, and relies on token-heavy structures.&lt;/p&gt;

&lt;p&gt;To solve this, I built a specialized skill: &lt;strong&gt;simplify-markdown&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This tool is designed to systematically reduce token usage, clean up unwieldy context files, and simplify agent or skill markdown files so that only the signal remains.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where to Find It
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Repository:&lt;/strong&gt; &lt;a href="https://github.com/dpedranti/ai-agent-toolkit" rel="noopener noreferrer"&gt;ai-agent-toolkit&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skill Source:&lt;/strong&gt; &lt;a href="https://github.com/dpedranti/ai-agent-toolkit/blob/master/skills/simplify-markdown/SKILL.md" rel="noopener noreferrer"&gt;simplify-markdown/SKILL.md&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  When to Use It
&lt;/h3&gt;

&lt;p&gt;Consider integrating &lt;code&gt;simplify-markdown&lt;/code&gt; into your workflow when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your context files (&lt;code&gt;.cursorrules&lt;/code&gt;, &lt;code&gt;CLAUDE.md&lt;/code&gt;, etc.) are growing too large to manage easily.&lt;/li&gt;
&lt;li&gt;Your dynamic skills feel bloated and are slowing down execution.&lt;/li&gt;
&lt;li&gt;Your prompt architecture is becoming difficult to reason about.&lt;/li&gt;
&lt;li&gt;You want to immediately improve response performance and lower your token expenditure.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;The future of AI-assisted development isn't about writing more instructions. It is about writing &lt;strong&gt;less, but better&lt;/strong&gt; instructions.&lt;/p&gt;

&lt;p&gt;By focusing on smaller context windows, cleaner automated workflows, and smarter loading mechanisms like skills, you empower the AI rather than suffocate it. The models are already highly capable; your job as an engineer is simply to provide the right environment and stay out of their way.&lt;/p&gt;




&lt;h3&gt;
  
  
  Inspiration &amp;amp; Sources
&lt;/h3&gt;

&lt;p&gt;Some of the core ideas and inspiration for this post came from the following resources—highly recommend checking them out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/playlist?list=PLQHTakJAwGLMQxKVlKUVpC7aiYTZ2j2J9" rel="noopener noreferrer"&gt;The Startup Ideas Podcast&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://www.youtube.com/@rasmic" rel="noopener noreferrer"&gt;Ras Mic on YouTube&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>claude</category>
      <category>promptengineering</category>
      <category>tooling</category>
      <category>ai</category>
    </item>
    <item>
      <title>Semantic Caching for LLMs: Faster Responses, Lower Costs</title>
      <dc:creator>Derrick Pedranti</dc:creator>
      <pubDate>Sun, 29 Mar 2026 20:24:12 +0000</pubDate>
      <link>https://forem.com/derrickpedranti/semantic-caching-for-llms-faster-responses-lower-costs-81e</link>
      <guid>https://forem.com/derrickpedranti/semantic-caching-for-llms-faster-responses-lower-costs-81e</guid>
      <description>&lt;p&gt;If you're building AI applications with LLMs, you've probably noticed a pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The same (or very similar) questions keep coming in&lt;/li&gt;
&lt;li&gt;Each one triggers a full LLM call&lt;/li&gt;
&lt;li&gt;Latency adds up, and token costs quietly grow in the background&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What makes this especially frustrating is that many of these requests aren't truly unique. They're slightly reworded versions of things you've already answered.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"What is the capital of France?"&lt;/li&gt;
&lt;li&gt;"What's France's capital?"&lt;/li&gt;
&lt;li&gt;"Can you tell me the capital city of France?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From an LLM's perspective, these are three separate requests. From a user's perspective, they're the same question. Without caching, you pay for each one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic caching&lt;/strong&gt; solves this. Instead of treating every request as new, your system recognizes when a query is similar enough to a previous one and reuses the existing response.&lt;/p&gt;

&lt;p&gt;In real-world systems, this single optimization can reduce LLM calls by 30–70%, drop latency from seconds to milliseconds, and significantly lower your token costs. It's one of the highest-leverage improvements you can make early in your architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;Traditional caching relies on exact string matches. Change a single character and the cache misses.&lt;/p&gt;

&lt;p&gt;Semantic caching takes a different approach: instead of comparing raw text, it compares &lt;strong&gt;meaning&lt;/strong&gt; using embeddings.&lt;/p&gt;

&lt;p&gt;Here's the flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query
    ↓
Generate embedding
    ↓
Search cache for similar embeddings
    ↓
Match found? → Return cached response
    ↓
No match? → Call LLM → Store result in cache → Return response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: you avoid calling the LLM unless you have to. Before every request, you ask &lt;em&gt;"Have I already answered something similar enough?"&lt;/em&gt; If yes, you skip the most expensive part of your system entirely.&lt;/p&gt;

&lt;p&gt;Under the hood, this works by converting queries into vectors and measuring how close they are in vector space. If the distance is below a threshold, the system considers them a match.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;p&gt;Let's build a working semantic cache. We'll use &lt;a href="https://github.com/facebookresearch/faiss" rel="noopener noreferrer"&gt;FAISS&lt;/a&gt; for vector search and &lt;a href="https://www.sbert.net/" rel="noopener noreferrer"&gt;sentence-transformers&lt;/a&gt; for embeddings, which keeps everything local and dependency-light.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install dependencies
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;faiss-cpu sentence-transformers numpy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Note on Dependencies
&lt;/h4&gt;

&lt;p&gt;Depending on your environment (especially Python 3.12 on macOS), you may need to pin a few dependencies due to PyTorch compatibility.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"sentence-transformers&amp;lt;4"&lt;/span&gt; &lt;span class="s2"&gt;"transformers&amp;lt;5"&lt;/span&gt; &lt;span class="s2"&gt;"numpy&amp;lt;2"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The versions in this article are intentionally left unpinned to keep things simple, but if you run into installation issues, try the pinned versions above.&lt;/p&gt;

&lt;h3&gt;
  
  
  Define a cache interface
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;abc&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ABC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;abstractmethod&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ResponseCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ABC&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Look up a cached response. Returns None on miss.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Store a response for future reuse.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Defining an interface keeps your application decoupled from the caching backend. You might start with FAISS locally, then move to Redis or Qdrant in production. Your LLM logic shouldn't need to change.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implement a no-op cache
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;NoCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ResponseCache&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you a safe default for environments where caching isn't available, and a clean baseline for benchmarking.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implement the semantic cache
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ResponseCache&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;distance_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;ttl_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dimension&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_sentence_embedding_dimension&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;distance_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;distance_threshold&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ttl_seconds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ttl_seconds&lt;/span&gt;

        &lt;span class="c1"&gt;# FAISS index for fast similarity search (L2 distance)
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;IndexFlatL2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dimension&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Parallel store: maps index position → cached entry
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_context_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Create a deterministic key from context so we only match
        responses generated under the same conditions.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;stable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;normalize_embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ntotal&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

        &lt;span class="n"&gt;query_vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;distances&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;ctx_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_context_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;distances&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;dist&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;distance_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

            &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

            &lt;span class="c1"&gt;# Context must match (model, temperature, user, etc.)
&lt;/span&gt;            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;ctx_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

            &lt;span class="c1"&gt;# Respect TTL
&lt;/span&gt;            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ttl_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;query_vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_context_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things to note:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;&lt;/strong&gt; is a lightweight embedding model (~80MB) that's fast enough for real-time use. For higher accuracy on domain-specific queries, consider &lt;code&gt;all-mpnet-base-v2&lt;/code&gt; or a fine-tuned model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FAISS &lt;code&gt;IndexFlatL2&lt;/code&gt;&lt;/strong&gt; does exact nearest-neighbor search using L2 (Euclidean) distance. For millions of entries, switch to &lt;code&gt;IndexIVFFlat&lt;/code&gt; or &lt;code&gt;IndexHNSWFlat&lt;/code&gt; for approximate search.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The context key&lt;/strong&gt; ensures we never return a cached response generated under different conditions (different model, temperature, user, etc.).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tie it all together
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ResponseCache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Check cache
&lt;/span&gt;    &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[cache hit]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Cache miss — call the LLM
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[cache miss → calling LLM]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Store for future reuse
&lt;/span&gt;    &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Three steps: check, call, store.&lt;/p&gt;

&lt;h3&gt;
  
  
  Try it out
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Swap in your actual LLM client here
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MockLLM&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The capital of France is Paris.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;distance_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MockLLM&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# First call — cache miss, calls the LLM
&lt;/span&gt;&lt;span class="n"&gt;r1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;handle_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the capital of France?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Second call — semantically similar, cache hit
&lt;/span&gt;&lt;span class="n"&gt;r2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;handle_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s France&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s capital city?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Different context — cache miss even though query is similar
&lt;/span&gt;&lt;span class="n"&gt;other_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;r3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;handle_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tell me the capital of France&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;other_context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Tuning the Distance Threshold
&lt;/h2&gt;

&lt;p&gt;The distance threshold is the most important tuning parameter in your system. It controls the tradeoff between precision (returning only correct matches) and recall (catching more cache hits).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lower values&lt;/strong&gt; → stricter matching, fewer false positives, lower hit rate&lt;br&gt;
&lt;strong&gt;Higher values&lt;/strong&gt; → more matches, higher hit rate, risk of returning wrong responses&lt;/p&gt;

&lt;p&gt;The right value depends on your embedding model and distance metric:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Typical Range&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;L2 (Euclidean)&lt;/td&gt;
&lt;td&gt;0.15 – 0.40&lt;/td&gt;
&lt;td&gt;Used in our FAISS example above&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cosine distance&lt;/td&gt;
&lt;td&gt;0.05 – 0.15&lt;/td&gt;
&lt;td&gt;1 - cosine_similarity; common in Redis, Qdrant&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Start around 0.25 for L2 or 0.10 for cosine&lt;/strong&gt;, then adjust based on real traffic.&lt;/p&gt;
&lt;h3&gt;
  
  
  How to calibrate
&lt;/h3&gt;

&lt;p&gt;Don't guess. Log your near-misses and spot-check them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# During development, log borderline matches for review
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;dist&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;distance_threshold&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Near match: dist=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | query=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; | cached=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Review these logs periodically. If you see incorrect matches slipping through, tighten the threshold. If you see obvious matches being missed, loosen it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Context Filters: Correctness and Security
&lt;/h2&gt;

&lt;p&gt;Semantic similarity alone isn't enough. Two queries can be nearly identical in meaning but require different responses based on context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Correctness concerns:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Different models produce different outputs&lt;/li&gt;
&lt;li&gt;Temperature affects randomness — a cached &lt;code&gt;temperature=0&lt;/code&gt; response shouldn't serve a &lt;code&gt;temperature=1&lt;/code&gt; request&lt;/li&gt;
&lt;li&gt;System prompts or attached documents change the answer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Security concerns:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In multi-tenant systems, responses that contain user-specific data (account details, personalized recommendations, user-scoped RAG results) must include the user identifier in the context key. Without it, User A could receive User B's cached response. Treat this as a security boundary, not just a correctness optimization.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system_prompt_hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;A note on cache hit rate:&lt;/strong&gt; Including the user in the context key means every user builds separate cache entries, even for identical answers. For applications where responses don't depend on who's asking — general knowledge, shared documentation, public FAQs — consider omitting the user from the context key so all users share the same cache entries. This can dramatically improve your hit rate. The right approach depends on your application; the important thing is to make the decision intentionally rather than including or excluding the user by default across the board.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cache Invalidation
&lt;/h2&gt;

&lt;p&gt;TTL handles the simple case: responses expire after a set period. But in practice, you'll also need to think about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Knowledge updates.&lt;/strong&gt; If the underlying data your LLM references changes (e.g., you update your RAG corpus), cached responses built on the old data become stale. Consider including a version identifier in your context key so that corpus updates automatically invalidate old entries.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;System prompt changes.&lt;/strong&gt; If you modify your system prompt, cached responses from the previous version may no longer be appropriate. Hashing the system prompt into your context key (as shown above) handles this automatically.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Selective invalidation.&lt;/strong&gt; Sometimes you need to invalidate specific entries rather than waiting for TTL. Adding a &lt;code&gt;purge(context_key)&lt;/code&gt; method to your cache gives you this escape hatch.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;purge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Remove all entries matching a given context. Returns count removed.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;ctx_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_context_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;removed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;ctx_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="c1"&gt;# Force expiry
&lt;/span&gt;            &lt;span class="n"&gt;removed&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;removed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For production systems with frequent knowledge updates, you'll likely want a more sophisticated approach — e.g., tagging cache entries with a corpus version and bulk-invalidating when the version changes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Pitfalls
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Caching everything.&lt;/strong&gt; Some responses should never be cached: real-time data (stock prices, weather), sensitive PII-containing responses, or anything where staleness causes harm. Maintain an explicit skip list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No TTL.&lt;/strong&gt; Without expiration, your cache will silently return outdated responses. Always set a TTL, even if it's generous (e.g., 24 hours).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ignoring context.&lt;/strong&gt; If you cache without filtering on model, temperature, and user, you will eventually serve wrong or leaked responses. This is the most dangerous pitfall because it often doesn't surface in testing — only in production with real multi-user traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Poor serialization.&lt;/strong&gt; If you're caching structured LLM responses (tool calls, JSON, streaming chunks), make sure your serialization round-trips correctly. A subtle bug here can produce responses that look right but are subtly broken.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embedding model mismatch.&lt;/strong&gt; If you change your embedding model, your existing cache becomes invalid — the vector spaces are incompatible. Either clear the cache on model change or version your cache keys.&lt;/p&gt;




&lt;h2&gt;
  
  
  When to Use It (and When Not To)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Good fit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;FAQ-style or search-style applications with high query overlap&lt;/li&gt;
&lt;li&gt;Customer support bots where similar questions recur frequently&lt;/li&gt;
&lt;li&gt;RAG systems where the same retrievals happen repeatedly&lt;/li&gt;
&lt;li&gt;Internal tools with a bounded set of common queries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Poor fit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Highly personalized responses that differ per user even for the same query&lt;/li&gt;
&lt;li&gt;Real-time data applications where freshness is critical&lt;/li&gt;
&lt;li&gt;Creative applications where variety is the point (e.g., brainstorming tools)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Semantic caching works best when there's meaningful reuse across queries. If every request is genuinely unique, the cache overhead adds cost without benefit.&lt;/p&gt;




&lt;h2&gt;
  
  
  Going to Production
&lt;/h2&gt;

&lt;p&gt;The FAISS implementation above is great for prototyping and single-process applications. When you're ready to scale, here's what changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vector store.&lt;/strong&gt; Move to a dedicated vector database: &lt;a href="https://redis.io/docs/latest/develop/interact/search-and-query/query/vector-search/" rel="noopener noreferrer"&gt;Redis with vector search&lt;/a&gt;, &lt;a href="https://qdrant.tech/" rel="noopener noreferrer"&gt;Qdrant&lt;/a&gt;, &lt;a href="https://www.pinecone.io/" rel="noopener noreferrer"&gt;Pinecone&lt;/a&gt;, or &lt;a href="https://weaviate.io/" rel="noopener noreferrer"&gt;Weaviate&lt;/a&gt;. These give you persistence, replication, and filtering built in.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding service.&lt;/strong&gt; Consider moving embedding generation to an API (OpenAI, Cohere, or a self-hosted model) so your application server stays lightweight.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring.&lt;/strong&gt; Track your cache hit rate, average distance of matches, and latency savings. These metrics tell you if your threshold is calibrated correctly and how much value the cache is providing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Warm-up.&lt;/strong&gt; Pre-populate the cache with common queries from your logs to maximize hit rate from day one.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;Semantic caching doesn't require retraining models or changing your core LLM logic. It adds a decision layer in front of your existing system: &lt;em&gt;Have I already answered something similar enough?&lt;/em&gt; If yes, skip the LLM call.&lt;/p&gt;

&lt;p&gt;That single decision can make your system significantly faster, cheaper, and more scalable. If you're running LLMs in production, it's worth building in early — the ROI only grows as your traffic does.&lt;/p&gt;

&lt;p&gt;The full working code from this article is available as a single file you can drop into your project and start experimenting with immediately.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
