<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Roman Belov</title>
    <description>The latest articles on Forem by Roman Belov (@spyrae).</description>
    <link>https://forem.com/spyrae</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3814828%2F4f08754d-1c5b-45ed-9659-de1473e054df.jpeg</url>
      <title>Forem: Roman Belov</title>
      <link>https://forem.com/spyrae</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/spyrae"/>
    <language>en</language>
    <item>
      <title>Prompt Engineering System: Managing 50+ Prompts in Production</title>
      <dc:creator>Roman Belov</dc:creator>
      <pubDate>Fri, 10 Apr 2026 04:04:41 +0000</pubDate>
      <link>https://forem.com/spyrae/prompt-engineering-system-managing-50-prompts-in-production-44co</link>
      <guid>https://forem.com/spyrae/prompt-engineering-system-managing-50-prompts-in-production-44co</guid>
      <description>&lt;p&gt;The average LLM project in production uses 20–50 prompts. Classification, summarization, data extraction, response generation, quality evaluation. Each prompt requires iteration, and each iteration can break something that was working. At 50 prompts, managing them manually becomes chaos: who changed the classifier prompt? Why did summarizer accuracy drop? Which version is in production right now?&lt;/p&gt;

&lt;p&gt;This article covers how to build a prompt management system that scales from 5 to 500 prompts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why You Can't Store Prompts in Code
&lt;/h2&gt;

&lt;p&gt;A prompt looks like a string. Developers store it in code, next to the call logic. This works fine when there are only a few prompts and iterations are infrequent.&lt;/p&gt;

&lt;p&gt;Problems start at scale:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Changing a prompt requires deploying the app.&lt;/strong&gt; The prompt is hardcoded. To fix a single word in a system prompt, you need a PR, review, merge, deploy. Iteration cycle: hours instead of minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No versioning.&lt;/strong&gt; Git stores history, but a diff on a 2,000-character prompt is unreadable. There's no fast path to roll back a prompt to a previous version without rolling back the entire app.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No link between version and metrics.&lt;/strong&gt; Prompt changed, quality dropped. Connecting a specific prompt version to specific metrics is manual work when the prompt lives in code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-team chaos.&lt;/strong&gt; The product manager wants to adjust the tone. The ML engineer is optimizing tokens. The developer is refactoring the template. All three are editing the same file, and the outcome is unpredictable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anatomy of a Prompt Engineering System
&lt;/h2&gt;

&lt;p&gt;A mature prompt management system has four layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────┐
│              Prompt Engineering System          │
├────────────┬────────────┬────────────┬──────────┤
│  Registry  │  Testing   │  Deploy    │ Monitor  │
│            │            │            │          │
├────────────┼────────────┼────────────┼──────────┤
│ Storage    │ Pre-deploy │ Canary /   │ Metrics  │
│ + versions │ eval       │ A/B rollout│ + alerts │
└────────────┴────────────┴────────────┴──────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Registry&lt;/strong&gt; — a centralized prompt store with versioning, metadata, and access control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing&lt;/strong&gt; — automated quality evaluation of a prompt against test datasets before deploying to production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deploy&lt;/strong&gt; — a mechanism to push a new prompt version to production without deploying the application.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitor&lt;/strong&gt; — tracking quality metrics tied to specific prompt versions.&lt;/p&gt;

&lt;p&gt;You don't need to build all four layers at once. A minimum viable system is registry + deploy. Without testing and monitoring, you're flying blind.&lt;/p&gt;

&lt;h2&gt;
  
  
  Registry: Centralized Prompt Storage
&lt;/h2&gt;

&lt;p&gt;The registry solves the basic problem: a single source of truth for all prompts. Two approaches.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach 1: Langfuse Prompt Management
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://dev.to/blog/llm-observability-langfuse/"&gt;Langfuse&lt;/a&gt; provides prompt management out of the box. Each prompt is a named entity with versions, labels, and variables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langfuse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Langfuse&lt;/span&gt;

&lt;span class="n"&gt;langfuse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Langfuse&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Get the production version of a prompt
&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;langfuse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket-classifier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# or "staging", "latest"
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Prompt with variables
&lt;/span&gt;&lt;span class="n"&gt;system_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;categories&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;billing,technical,general,urgent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prompt structure in Langfuse:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Unique identifier&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ticket-classifier&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;version&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Auto-increment&lt;/td&gt;
&lt;td&gt;&lt;code&gt;14&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;label&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Environment / status&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;production&lt;/code&gt;, &lt;code&gt;staging&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Format&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;text&lt;/code&gt; or &lt;code&gt;chat&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;config&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Model parameters&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{"model": "gpt-4o-mini", "temperature": 0}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The prompt is decoupled from code. A product manager edits the prompt in the UI, assigns the &lt;code&gt;staging&lt;/code&gt; label, tests it, and switches to &lt;code&gt;production&lt;/code&gt;. The application code stays the same.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach 2: Prompts-as-Code
&lt;/h3&gt;

&lt;p&gt;For teams that prefer Git as the single source of truth:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prompts/
├── ticket-classifier/
│   ├── prompt.yaml
│   ├── config.yaml
│   └── tests/
│       ├── dataset.jsonl
│       └── eval.py
├── summarizer/
│   ├── prompt.yaml
│   ├── config.yaml
│   └── tests/
└── prompt_registry.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# prompts/ticket-classifier/prompt.yaml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ticket-classifier&lt;/span&gt;
&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;chat&lt;/span&gt;
&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;
&lt;span class="na"&gt;temperature&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;system&lt;/span&gt;
    &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;You are a support ticket classifier.&lt;/span&gt;
      &lt;span class="s"&gt;Categories: {{categories}}.&lt;/span&gt;
      &lt;span class="s"&gt;Return JSON: {"category": "...", "confidence": 0.0-1.0, "reasoning": "..."}&lt;/span&gt;
      &lt;span class="s"&gt;Response language: {{language}}.&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;user&lt;/span&gt;
    &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{ticket_text}}"&lt;/span&gt;
&lt;span class="na"&gt;variables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;categories&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;billing,technical,general,urgent"&lt;/span&gt;
  &lt;span class="na"&gt;language&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# prompt_registry.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PromptRegistry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompts_dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompts_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompts_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;prompt_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompts_dir&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt.yaml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safe_load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;variables&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}),&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;}.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{{{&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;}}}}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both approaches support a hybrid variant: prompts live in Git, and CI/CD syncs them to Langfuse on every merge to main.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ci/sync_prompts.py — called in CI pipeline
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langfuse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Langfuse&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;prompt_registry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PromptRegistry&lt;/span&gt;

&lt;span class="n"&gt;langfuse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Langfuse&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;registry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PromptRegistry&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;prompt_name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket-classifier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarizer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response-generator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;prompt_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;langfuse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
        &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Testing: Eval Before Deploying a Prompt
&lt;/h2&gt;

&lt;p&gt;A prompt without tests is a gamble. Every change can silently break edge cases. Automated evaluation before deployment catches regressions before they reach users.&lt;/p&gt;

&lt;h3&gt;
  
  
  Datasets: The Gold Standard
&lt;/h3&gt;

&lt;p&gt;Every prompt needs a test dataset. Minimum size: 20–30 examples covering the main scenarios and edge cases.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"input": "Can't process payment, card is being declined", "expected": {"category": "billing", "confidence_min": 0.8}}
{"input": "App crashes when opening the chat", "expected": {"category": "technical", "confidence_min": 0.8}}
{"input": "I want to delete my account and all my data", "expected": {"category": "general", "confidence_min": 0.7}}
{"input": "URGENT! Server is down, customers can't log in", "expected": {"category": "urgent", "confidence_min": 0.9}}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dataset sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Production logs.&lt;/strong&gt; Real requests with labeled responses. The most valuable source.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual labeling.&lt;/strong&gt; For new prompts with no production data yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthetic data.&lt;/strong&gt; An LLM generates variations of existing examples. Useful for expanding edge case coverage.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Eval Pipeline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;prompt_registry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PromptRegistry&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;registry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PromptRegistry&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dataset_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Evaluate a prompt against a dataset. Return pass/fail.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;examples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;correct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;examples&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;failures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;example&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;examples&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ticket_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence_min&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
                &lt;span class="n"&gt;correct&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low confidence: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wrong category: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="n"&gt;accuracy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;correct&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;
    &lt;span class="n"&gt;passed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;accuracy&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failures&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For complex cases, &lt;a href="https://dev.to/blog/llm-as-judge-automated-quality-gate/"&gt;LLM-as-Judge&lt;/a&gt; fits well. A judge model evaluates response quality against defined criteria: relevance, completeness, tone.&lt;/p&gt;

&lt;h3&gt;
  
  
  CI/CD Integration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/prompt-eval.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prompt Evaluation&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;prompts/**'&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;eval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install dependencies&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pip install openai langfuse pyyaml&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run prompt evaluations&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.OPENAI_API_KEY }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python ci/eval_prompts.py --changed-only&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Comment PR with results&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/github-script@v7&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;const fs = require('fs');&lt;/span&gt;
            &lt;span class="s"&gt;const results = JSON.parse(fs.readFileSync('eval_results.json'));&lt;/span&gt;
            &lt;span class="s"&gt;let body = '## Prompt Eval Results\n\n';&lt;/span&gt;
            &lt;span class="s"&gt;for (const [name, result] of Object.entries(results)) {&lt;/span&gt;
              &lt;span class="s"&gt;const status = result.passed ? '✅' : '❌';&lt;/span&gt;
              &lt;span class="s"&gt;body += `| ${name} | ${status} | ${result.accuracy.toFixed(2)} | ${result.threshold} |\n`;&lt;/span&gt;
            &lt;span class="s"&gt;}&lt;/span&gt;
            &lt;span class="s"&gt;github.rest.issues.createComment({&lt;/span&gt;
              &lt;span class="s"&gt;issue_number: context.issue.number,&lt;/span&gt;
              &lt;span class="s"&gt;owner: context.repo.owner,&lt;/span&gt;
              &lt;span class="s"&gt;repo: context.repo.repo,&lt;/span&gt;
              &lt;span class="s"&gt;body&lt;/span&gt;
            &lt;span class="s"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every PR touching prompts automatically runs the eval pipeline and posts results as a comment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploy: Shipping Prompts Without Deploying Code
&lt;/h2&gt;

&lt;p&gt;Three strategies for delivering a new prompt version to production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Instant Switch
&lt;/h3&gt;

&lt;p&gt;The simplest option. Flip the &lt;code&gt;production&lt;/code&gt; label to a new prompt version.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In Langfuse UI: assign label "production" to prompt v14
# The app picks it up automatically on the next request
&lt;/span&gt;
&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;langfuse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket-classifier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;cache_ttl_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 5-minute cache
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Good for non-critical prompts and quick fixes. Risk: 100% of traffic immediately hits the new version.&lt;/p&gt;

&lt;h3&gt;
  
  
  Canary Deploy
&lt;/h3&gt;

&lt;p&gt;Gradual traffic shift: 5% → 25% → 50% → 100%.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_prompt_with_canary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;canary_percentage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return a prompt and its version (production or canary).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;canary_percentage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;langfuse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;canary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;canary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;langfuse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Canary and production metrics are compared in real time. If canary degrades — automatic rollback.&lt;/p&gt;

&lt;h3&gt;
  
  
  Feature Flags
&lt;/h3&gt;

&lt;p&gt;For teams with an existing feature flag system (LaunchDarkly, Unleash, or homegrown):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_prompt_version&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Determine the prompt version via feature flag.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;flag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;feature_flags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;flag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_enabled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;flag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_variant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# "v14", "v15"
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also target specific users, segments, or regions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitor: Tying Metrics to Prompt Versions
&lt;/h2&gt;

&lt;p&gt;Monitoring without version context is useless. Quality dropped — but what broke: the prompt, the model, the data?&lt;/p&gt;

&lt;h3&gt;
  
  
  Tracing with Prompt Version
&lt;/h3&gt;

&lt;p&gt;Every LLM call should include the prompt version in metadata:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;trace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;langfuse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket-classification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket-classifier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# 14
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;generation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;classify&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Langfuse automatically links the version
&lt;/span&gt;    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Version Dashboard
&lt;/h3&gt;

&lt;p&gt;Key metrics to monitor:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What it shows&lt;/th&gt;
&lt;th&gt;Alert when&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Accuracy&lt;/td&gt;
&lt;td&gt;Fraction of correct responses&lt;/td&gt;
&lt;td&gt;&amp;lt; threshold for prompt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency p95&lt;/td&gt;
&lt;td&gt;Response time&lt;/td&gt;
&lt;td&gt;&amp;gt; 2x baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token usage&lt;/td&gt;
&lt;td&gt;Token consumption&lt;/td&gt;
&lt;td&gt;&amp;gt; 1.5x vs previous version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error rate&lt;/td&gt;
&lt;td&gt;Fraction of invalid responses&lt;/td&gt;
&lt;td&gt;&amp;gt; 5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost per request&lt;/td&gt;
&lt;td&gt;Cost per call&lt;/td&gt;
&lt;td&gt;&amp;gt; budget&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example: automatic comparison of two prompt versions
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compare_prompt_versions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;version_a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;version_b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Compare metrics for two prompt versions from Langfuse.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;traces_a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;langfuse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch_traces&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-eval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;version_a&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;traces_b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;langfuse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch_traces&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-eval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;version_b&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;scores_a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;traces_a&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;scores_b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;traces_b&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version_a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;version_a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores_a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores_a&lt;/span&gt;&lt;span class="p"&gt;)},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version_b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;version_b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores_b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores_b&lt;/span&gt;&lt;span class="p"&gt;)},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;diff&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores_b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores_b&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores_a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores_a&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Regression Alerts
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Check metrics every 15 minutes (cron job or Langfuse webhook)
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_prompt_regression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;current_version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;langfuse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;
    &lt;span class="n"&gt;recent_scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_recent_scores&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_version&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hours&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;baseline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_baseline_scores&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_version&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;recent_scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;gt; 10% degradation
&lt;/span&gt;        &lt;span class="nf"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;slack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Regression detected: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; v&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;current_version&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Accuracy: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;recent_scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(baseline: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Automatic rollback to previous version
&lt;/span&gt;        &lt;span class="nf"&gt;rollback_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;to_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;current_version&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Prompt Organization Patterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Composition Over Monoliths
&lt;/h3&gt;

&lt;p&gt;A 3,000-token monolithic prompt is hard to test and maintain. Break it into components:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# prompts/components/output-format.yaml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;output-format-json&lt;/span&gt;
&lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;Respond STRICTLY in JSON. No text before or after the JSON.&lt;/span&gt;
  &lt;span class="s"&gt;If you cannot determine the answer, return {"error": "unable to classify"}.&lt;/span&gt;

&lt;span class="c1"&gt;# prompts/components/language-rules.yaml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;language-rules&lt;/span&gt;
&lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;Response language: {{language}}.&lt;/span&gt;
  &lt;span class="s"&gt;Do not translate proper nouns or technical terms.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compose_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;component_names&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Assemble a prompt from components.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;component_names&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;component&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;components/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;component&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{{{&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;}}}}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compose_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket-classifier-core&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output-format-json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;language-rules&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;categories&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;billing,technical,general&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Naming Convention
&lt;/h3&gt;

&lt;p&gt;At 50+ prompts, consistent naming matters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{domain}-{task}-{variant}

ticket-classifier-v2
ticket-classifier-multilingual
order-summarizer-short
order-summarizer-detailed
response-generator-formal
response-generator-casual
quality-judge-relevance
quality-judge-toxicity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Prompt Metadata
&lt;/h3&gt;

&lt;p&gt;Each prompt should carry metadata for auditing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ticket-classifier&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-team&lt;/span&gt;
  &lt;span class="na"&gt;created&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2026-01-15&lt;/span&gt;
  &lt;span class="na"&gt;last_tested&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2026-03-20&lt;/span&gt;
  &lt;span class="na"&gt;model_compatibility&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;claude-3-5-sonnet-20241022&lt;/span&gt;
  &lt;span class="na"&gt;avg_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;450&lt;/span&gt;
  &lt;span class="na"&gt;cost_per_call_usd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.002&lt;/span&gt;
  &lt;span class="na"&gt;test_accuracy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.92&lt;/span&gt;
  &lt;span class="na"&gt;dataset_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;150&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Scaling: From 5 to 500 Prompts
&lt;/h2&gt;

&lt;p&gt;How the system evolves as the number of prompts grows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scale&lt;/th&gt;
&lt;th&gt;Registry&lt;/th&gt;
&lt;th&gt;Testing&lt;/th&gt;
&lt;th&gt;Deploy&lt;/th&gt;
&lt;th&gt;Monitor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5–10 prompts&lt;/td&gt;
&lt;td&gt;YAML in Git&lt;/td&gt;
&lt;td&gt;Manual eval&lt;/td&gt;
&lt;td&gt;Instant switch&lt;/td&gt;
&lt;td&gt;Logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10–50 prompts&lt;/td&gt;
&lt;td&gt;Langfuse + Git sync&lt;/td&gt;
&lt;td&gt;CI eval pipeline&lt;/td&gt;
&lt;td&gt;Canary&lt;/td&gt;
&lt;td&gt;Version dashboard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50–200 prompts&lt;/td&gt;
&lt;td&gt;Langfuse + RBAC&lt;/td&gt;
&lt;td&gt;CI + LLM-as-Judge&lt;/td&gt;
&lt;td&gt;Feature flags&lt;/td&gt;
&lt;td&gt;Alerts + auto-rollback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;200+ prompts&lt;/td&gt;
&lt;td&gt;Custom registry&lt;/td&gt;
&lt;td&gt;Eval platform&lt;/td&gt;
&lt;td&gt;Progressive rollout&lt;/td&gt;
&lt;td&gt;ML monitoring&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Key thresholds:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;10 prompts&lt;/strong&gt; — you need a registry. Prompts in code become unmanageable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;30 prompts&lt;/strong&gt; — you need CI eval. Manual testing doesn't scale; regressions slip through.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;50 prompts&lt;/strong&gt; — you need RBAC. Different teams own different prompts; access control becomes non-optional.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;100 prompts&lt;/strong&gt; — you need auto-rollback. Humans can't respond to regressions fast enough in real time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt Management Tools
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Strengths&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Langfuse&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Open-source&lt;/td&gt;
&lt;td&gt;Prompt management + tracing + evals in one. Self-hostable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PromptLayer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS&lt;/td&gt;
&lt;td&gt;Specialized in prompt management. Good UI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Humanloop&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS&lt;/td&gt;
&lt;td&gt;Prompt management + eval + annotation. Enterprise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pezzo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Open-source&lt;/td&gt;
&lt;td&gt;Prompt management. Lightweight&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Custom&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom&lt;/td&gt;
&lt;td&gt;Git + YAML + CI scripts. Maximum control&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Langfuse covers most scenarios: registry with versioning, prompt-to-trace linking, dataset-based evals, MCP server for IDE management. Detailed walkthrough in the &lt;a href="https://dev.to/blog/llm-observability-langfuse/"&gt;Langfuse guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prompts in .env or config files.&lt;/strong&gt; No versioning, no testing, no connection to metrics. Fine for prototypes, falls apart in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing on three examples.&lt;/strong&gt; The prompt passes three tests and ships to production. A week later you discover it breaks on long inputs or edge case categories.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No baseline.&lt;/strong&gt; The new prompt version "works well." Without a baseline, there's nothing to compare against. The previous version may have been better.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Optimizing tokens at the expense of quality.&lt;/strong&gt; Prompt reduced from 800 to 300 tokens. Cost drops 60%. Accuracy drops from 0.94 to 0.81. Saving $50/month costs dozens of wrong responses every day.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context Engineering for Prompts
&lt;/h2&gt;

&lt;p&gt;A prompt doesn't exist in isolation. Quality depends on what's fed alongside it: &lt;a href="https://dev.to/blog/context-engineering-guide/"&gt;context engineering&lt;/a&gt; determines which data enters the context window and in what order.&lt;/p&gt;

&lt;p&gt;Three rules for production prompts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Variables instead of hardcoded values.&lt;/strong&gt; Anything that might change (categories, languages, formats) goes into variables. The prompt stays stable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Few-shot examples at the end.&lt;/strong&gt; Models "see" the end of the context more clearly. Placing examples after instructions improves accuracy.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Minimal context.&lt;/strong&gt; Every extra token in the prompt dilutes the model's attention. If an instruction doesn't affect quality — remove it.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Where to Start
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Week 1.&lt;/strong&gt; Inventory. Collect all prompts from your codebase into one place — YAML files in Git or Langfuse. Standardize the format: name, version, model, messages, variables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 2.&lt;/strong&gt; Datasets. For each prompt, collect 20–30 test examples from production logs. Label the expected output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 3.&lt;/strong&gt; Eval pipeline. A script that runs the prompt against the dataset and outputs accuracy. Triggered in CI when prompts change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 4.&lt;/strong&gt; Monitoring. Prompt version in every trace's metadata. Dashboard with metrics per version. Alert on &amp;gt; 10% degradation.&lt;/p&gt;

&lt;p&gt;After a month — a working system where every prompt change is tested, versioned, and monitored. No chaos, no regressions, no "who changed this prompt?"&lt;/p&gt;

</description>
      <category>promptengineering</category>
      <category>ai</category>
      <category>devops</category>
      <category>llm</category>
    </item>
    <item>
      <title>Context Engineering: How to Manage Context for AI Models and Agents</title>
      <dc:creator>Roman Belov</dc:creator>
      <pubDate>Thu, 09 Apr 2026 04:33:23 +0000</pubDate>
      <link>https://forem.com/spyrae/context-engineering-how-to-manage-context-for-ai-models-and-agents-1hej</link>
      <guid>https://forem.com/spyrae/context-engineering-how-to-manage-context-for-ai-models-and-agents-1hej</guid>
      <description>&lt;p&gt;Claude's context window holds 200,000 tokens. Gemini's stretches to two million. But response quality starts degrading long before the window fills up. Window size doesn't solve the context problem — it masks it.&lt;/p&gt;

&lt;p&gt;Prompt engineering teaches you &lt;em&gt;how to ask&lt;/em&gt;. Context engineering teaches you &lt;em&gt;what to feed&lt;/em&gt; the model before asking. And the second one shapes the answer more than the first.&lt;/p&gt;

&lt;p&gt;Andrej Karpathy &lt;a href="https://x.com/karpathy/status/1937902205765607626" rel="noopener noreferrer"&gt;put it this way&lt;/a&gt;: "Context engineering — the delicate art and science of filling the context window with just the right information for the next step." Tobi Lütke, CEO of Shopify, popularized the term itself, and Gartner declared in July 2025: "Context engineering is in, and prompt engineering is out."&lt;/p&gt;

&lt;p&gt;This piece covers concrete techniques, models, and patterns. Things that actually work when you're using AI agents in development every day.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt vs Context: Where the Line Falls
&lt;/h2&gt;

&lt;p&gt;Here's an analogy that works: you're hiring an expert consultant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt&lt;/strong&gt; — your question: "What should I do?"&lt;br&gt;
&lt;strong&gt;Context&lt;/strong&gt; — the briefing you hand them before the question.&lt;/p&gt;

&lt;p&gt;You can phrase the question perfectly, but if the briefing contains 500 pages of irrelevant documents, even a strong expert will get lost. Flip it around: hand them exactly the 2 pages they need, and even a simple question yields a precise answer.&lt;/p&gt;

&lt;p&gt;Prompt engineering answers questions like: how to frame the task, what role to assign the model, what output format to request. Context engineering answers different ones: feed 100 reviews or pick 15 representative ones? The entire 500-line file or just lines 45–80? All the documentation or extract the facts?&lt;/p&gt;

&lt;p&gt;A more technical analogy drives it home. The LLM is a CPU. The context window is RAM. You're the operating system deciding what gets loaded into working memory. The goal: load exactly the data needed for the current operation.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why More Context Is Worse
&lt;/h2&gt;

&lt;p&gt;This is counterintuitive, but backed by research.&lt;/p&gt;
&lt;h3&gt;
  
  
  Context Rot
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://research.trychroma.com/context-rot" rel="noopener noreferrer"&gt;Chroma's research (2025)&lt;/a&gt; showed that LLM accuracy drops as the token count in context grows — even when the window is far from full.&lt;/p&gt;

&lt;p&gt;The mechanism: attention is a fixed resource. Weights always sum to 1. More tokens means less attention per fragment. Think of a flashlight — the wider the beam, the dimmer the light at any point. And the harder the task, the steeper the drop.&lt;/p&gt;
&lt;h3&gt;
  
  
  Lost in the Middle
&lt;/h3&gt;

&lt;p&gt;A &lt;a href="https://arxiv.org/abs/2307.03172" rel="noopener noreferrer"&gt;study&lt;/a&gt; found a specific pattern: LLM performance drops 30%+ when critical information sits in the middle of a long context. Beginning and end? Fine. The middle is a blind spot.&lt;/p&gt;

&lt;p&gt;Practical takeaway: put the important stuff at the beginning or end. System prompt up top. Few-shot examples at the bottom.&lt;/p&gt;
&lt;h3&gt;
  
  
  Economics
&lt;/h3&gt;

&lt;p&gt;Every token costs money, and the model rereads the entire context on every request (LLMs are stateless):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input: ~$3 per 1M tokens (Claude Sonnet)&lt;/li&gt;
&lt;li&gt;100K context × 100 requests/day = ~$30/day = &lt;strong&gt;$900/month&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Context engineering is budget engineering too.&lt;/p&gt;
&lt;h3&gt;
  
  
  Hallucinations from Overload
&lt;/h3&gt;

&lt;p&gt;With a bloated context, the model tries to "use everything" and starts inventing connections between unrelated parts. Data about Company A gets attributed to Company B. Functions that don't exist get "recalled" from similar code that drifted into the context twenty screens back.&lt;/p&gt;
&lt;h2&gt;
  
  
  Six Layers of Context
&lt;/h2&gt;

&lt;p&gt;Structure context like an onion — six layers, each with a specific job. This fights degradation by placing the most important information at the beginning and end, instead of spreading it across the middle.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│  1. SYSTEM — who you are &amp;amp; how to act   │  ← Permanent (beginning)
├─────────────────────────────────────────┤
│  2. PROJECT — project context           │  ← Semi-permanent
├─────────────────────────────────────────┤
│  3. TASK — the specific task            │  ← Per task
├─────────────────────────────────────────┤
│  4. DIFF / CODE — relevant fragments    │  ← Per task
├─────────────────────────────────────────┤
│  5. ACCEPTANCE CRITERIA — exit criteria │  ← Per task
├─────────────────────────────────────────┤
│  6. EXAMPLES (Few-shot) — samples       │  ← Optional (end)
└─────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  System: Role and Behavior
&lt;/h3&gt;

&lt;p&gt;Who the model is and how it should behave. Always at the very beginning.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are an experienced backend developer working with Python and FastAPI.
Keep answers concise. Use type hints. Don't add dependencies without asking.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is where the role, response style, and constraints go. Task details do &lt;strong&gt;not&lt;/strong&gt; belong here — that's layer 3.&lt;/p&gt;

&lt;h3&gt;
  
  
  Project: Project Context
&lt;/h3&gt;

&lt;p&gt;Tech stack, structure, architecture decisions, code conventions. This layer gets reused across tasks. In Claude Code, it lives in the &lt;code&gt;CLAUDE.md&lt;/code&gt; file — the agent reads it automatically on every launch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Task: What to Do
&lt;/h3&gt;

&lt;p&gt;A clear description of what to do, why, and — this gets forgotten constantly — what &lt;strong&gt;not&lt;/strong&gt; to do.&lt;/p&gt;

&lt;p&gt;A good example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: Add rate limiting to /users.
Context: Endpoint is unprotected, bots are overloading it.
Requirements: 100 req/min per IP, Redis for counters, 429 on exceeded.
Out of scope: Changing endpoint logic, adding authorization.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A bad example: "Add rate limiting."&lt;/p&gt;

&lt;h3&gt;
  
  
  Diff/Code: Only What's Relevant
&lt;/h3&gt;

&lt;p&gt;Provide only the code fragments that relate to the task. Not the entire file. Specify path and lines: &lt;code&gt;app/api/users.py, lines 45–60&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Acceptance Criteria: When to Stop
&lt;/h3&gt;

&lt;p&gt;Clear, verifiable conditions. The model only knows when to stop if you tell it. Skip these and you'll get either a half-finished answer or something wildly overengineered.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="p"&gt;-&lt;/span&gt; [ ] Return 429 status when limit is exceeded
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Include Retry-After header in the response
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Unit tests cover edge cases
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Examples (Few-shot): At the End
&lt;/h3&gt;

&lt;p&gt;For nonstandard output formats or a specific style. Place them at the end of the context — the model "sees" the finale better.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Hurts to Put in Context
&lt;/h3&gt;

&lt;p&gt;A few anti-patterns that will reliably tank your results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The entire project codebase&lt;/strong&gt; — signal drowns in noise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contradictory instructions&lt;/strong&gt; — "use Redux" + "use Context API" = the model gets confused&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outdated examples&lt;/strong&gt; — code with deprecated APIs gets reproduced verbatim&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vague phrasing&lt;/strong&gt; — "make it better" gives the model no direction&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Four Strategies for Managing Context
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://blog.langchain.com/context-engineering-for-agents/" rel="noopener noreferrer"&gt;LangChain&lt;/a&gt; and Anthropic propose a framework: all context work boils down to four actions.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Write&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Persist externally&lt;/td&gt;
&lt;td&gt;Scratchpads, MEMORY.md, progress files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Select&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Extract only what's relevant&lt;/td&gt;
&lt;td&gt;RAG, grep, code search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compress&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Compact&lt;/td&gt;
&lt;td&gt;Compaction, summarization, tool result cleanup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Isolate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Isolate tasks&lt;/td&gt;
&lt;td&gt;Subagents with clean context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Everything described below is a specific case of one of these four.&lt;/p&gt;

&lt;h2&gt;
  
  
  Persistence: Bridging Sessions
&lt;/h2&gt;

&lt;p&gt;Every session with an AI agent starts from scratch. New context window, zero memory of previous work. &lt;a href="https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents" rel="noopener noreferrer"&gt;Anthropic&lt;/a&gt; calls it the "shift engineer" problem: each new engineer coming on shift remembers nothing of the previous one's work. No notes left behind? Start over.&lt;/p&gt;

&lt;h3&gt;
  
  
  Plain Files
&lt;/h3&gt;

&lt;p&gt;The most basic form of memory — markdown notes the agent writes for its future self. Claude Code uses &lt;code&gt;MEMORY.md&lt;/code&gt; for this: the agent automatically records project patterns, decisions, and architectural notes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Git as Memory
&lt;/h3&gt;

&lt;p&gt;Commits with meaningful messages form a changelog and restore points. The agent can experiment freely, knowing it can always roll back.&lt;/p&gt;

&lt;h3&gt;
  
  
  Structured Notes
&lt;/h3&gt;

&lt;p&gt;Plain files evolve. Instead of a flat log, the agent maintains a structured knowledge base. The pattern: &lt;code&gt;write_to_notes(topic, content)&lt;/code&gt; + &lt;code&gt;read_from_notes(topic)&lt;/code&gt; — an external hard drive for memory.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents" rel="noopener noreferrer"&gt;An example from Anthropic&lt;/a&gt;: an agent playing Pokemon recorded "trained Pikachu 1234 steps, 8 out of 10 levels." After a context reset, it read its own notes and picked up right where it left off.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scratchpad
&lt;/h3&gt;

&lt;p&gt;Working memory within the current session. The agent "thinks out loud" — storing intermediate results, hypotheses, a plan. Scratchpad is RAM; files are disk.&lt;/p&gt;

&lt;p&gt;Simple thought, but it changes everything: stop making the model remember. Give it a notebook.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context Compaction
&lt;/h3&gt;

&lt;p&gt;When the context fills up, compress it. The model gets the full history and produces a summary. Old conversation gets tossed, compressed version goes at the start of the new context.&lt;/p&gt;

&lt;p&gt;Manual compaction at logical breakpoints (after finishing a feature) beats automatic. There's also a lighter variant: cleaning up tool results — strip the verbose command outputs from history, keep just the fact that they ran.&lt;/p&gt;

&lt;h3&gt;
  
  
  Task Trackers
&lt;/h3&gt;

&lt;p&gt;For long-running projects, the "Initializer + Executor" pattern works well. The first agent doesn't write code — it creates a structured task list in JSON: description, status, dependencies. Each subsequent agent reads the list, picks a pending task, completes it, updates the status, and commits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Subagents: Isolation as Strategy
&lt;/h2&gt;

&lt;p&gt;The main agent can delegate a subtask to a subagent — a separate process with its own clean context window. Like a manager asking a database specialist to optimize a query: hand them the schema and the slow query, not the entire month's email thread.&lt;/p&gt;

&lt;p&gt;Three wins:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Context purity.&lt;/strong&gt; The subagent isn't polluted by the main agent's history. The main agent might have 85% of its window occupied — the subagent starts at 5%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Specialization.&lt;/strong&gt; You can use different models or system prompts for different subagents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallelization.&lt;/strong&gt; Multiple subagents can work simultaneously.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In Claude Code, subagents are launched via the Task tool. The main agent describes the task, the subagent receives it in a clean context, does the work, and returns a structured result. The main agent's context cost is minimal.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP and the Tool Tax
&lt;/h2&gt;

&lt;p&gt;MCP (&lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;Model Context Protocol&lt;/a&gt;) is an open standard defining how AI agents discover and call tools. Each MCP server adds its tool descriptions to the context. Every description costs tokens.&lt;/p&gt;

&lt;p&gt;You feel it the moment you start working for real: connect 5–10 MCP servers (GitHub, Slack, database, analytics, monitoring) and tens of thousands of tokens in tool descriptions land &lt;em&gt;in every request&lt;/em&gt;, even when none of them get called.&lt;/p&gt;

&lt;p&gt;The fix is lazy loading. Claude Code uses Tool Search: tool descriptions load on demand, only when the agent decides it might need one. Saves around 85% of tokens. Other agents have similar tricks: lazy-mcp, MetaMCP.&lt;/p&gt;

&lt;p&gt;Tool design principles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-sufficiency:&lt;/strong&gt; the description contains everything needed for use. The model doesn't read your README.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unambiguity:&lt;/strong&gt; &lt;code&gt;user_email&lt;/code&gt; instead of &lt;code&gt;data&lt;/code&gt;, &lt;code&gt;validate_payment&lt;/code&gt; instead of &lt;code&gt;process&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimalism:&lt;/strong&gt; one tool = one atomic operation. If the description exceeds 200 words, the tool does too much.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Memory Hierarchy
&lt;/h2&gt;

&lt;p&gt;Context in production tools isn't a single file — it's a multi-level system. &lt;a href="https://code.claude.com/docs/en/memory" rel="noopener noreferrer"&gt;Claude Code's docs&lt;/a&gt; lay out the hierarchy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;System prompt&lt;/strong&gt; — base instructions (always loaded)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Settings&lt;/strong&gt; — user preferences&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CLAUDE.md&lt;/strong&gt; — project instructions (loaded from the repository root)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rules&lt;/strong&gt; — modular instructions, can be path-specific (loaded only when working with certain files)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skills&lt;/strong&gt; — entire folders of instructions and scripts the agent loads at its own discretion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto Memory&lt;/strong&gt; — memory the agent forms for itself&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://martinfowler.com/articles/exploring-gen-ai/context-engineering-coding-agents.html" rel="noopener noreferrer"&gt;Martin Fowler&lt;/a&gt; proposes a useful distinction: &lt;strong&gt;Instructions&lt;/strong&gt; (orders — "write a test for this function") vs &lt;strong&gt;Guidance&lt;/strong&gt; (general rules — "all tests must be independent of each other"). CLAUDE.md and rules are mostly Guidance. Chat prompts are Instructions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Working with Large Documents
&lt;/h2&gt;

&lt;p&gt;You can't just dump a 50-page PDF into the model. You need a strategy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chunking
&lt;/h3&gt;

&lt;p&gt;Break it into pieces of 1,500–3,000 tokens with 10–20% overlap. Semantic chunking (by chapters and sections) works noticeably better than chopping at fixed lengths.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.anthropic.com/news/contextual-retrieval" rel="noopener noreferrer"&gt;Contextual Retrieval from Anthropic&lt;/a&gt; tackles the ripped-from-context problem: before indexing, each fragment gets a description of where it came from and what the section covers. Result: at least 35% fewer retrieval failures, up to 67% with reranking.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fact Extraction
&lt;/h3&gt;

&lt;p&gt;Skip the full text. Pull a structured list of facts and figures from each chunk instead. Smaller footprint, better accuracy for analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  Map-Reduce
&lt;/h3&gt;

&lt;p&gt;For very large documents: split into chunks, summarize each (MAP), assemble the mini-summaries into a final one (REDUCE). The MAP phase can be parallelized — speedup scales with the number of workers.&lt;/p&gt;

&lt;h3&gt;
  
  
  RAG vs Long Context
&lt;/h3&gt;

&lt;p&gt;With windows getting bigger (Gemini 2M), the question keeps coming up: do we still need RAG? &lt;a href="https://arxiv.org/abs/2501.01880" rel="noopener noreferrer"&gt;Research (arXiv:2501.01880)&lt;/a&gt; says it depends on the task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG wins:&lt;/strong&gt; the corpus is huge (&amp;gt; 1M tokens), freshness matters, budget is limited.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long context wins:&lt;/strong&gt; you need synthesis across sections, structural understanding, document &amp;lt; 200K.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid (the way to go):&lt;/strong&gt; RAG for selection, long context for analysis. The cost gap is real: full 2M context on every request runs an order of magnitude more than RAG selection + 50K of relevant context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where This Doesn't Work
&lt;/h2&gt;

&lt;p&gt;Wouldn't be honest to stop at the upsides.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context engineering won't fix a bad model
&lt;/h3&gt;

&lt;p&gt;If the model can't write Rust, no amount of context will help. Context engineering works within what the model can already do. If the task is too hard for the current generation, break it into subtasks or try a different angle.&lt;/p&gt;

&lt;h3&gt;
  
  
  Preparation overhead
&lt;/h3&gt;

&lt;p&gt;Assembling a perfect six-layer context package for every request takes time. For quick questions ("how does this function work?") it's overkill. Context engineering pays off on repeatable tasks and with agents that chain dozens of operations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Compaction loses information
&lt;/h3&gt;

&lt;p&gt;Compression is a tradeoff. The model picks what to keep and what to toss. Sometimes it tosses what matters. Manual compaction at logical breakpoints is safer, but needs the operator paying attention.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lost in the Middle works both ways
&lt;/h3&gt;

&lt;p&gt;You can get so focused on "important stuff at the beginning and end" that the middle turns into a junk drawer. Better to cut the context down than hope positioning saves you.&lt;/p&gt;

&lt;h3&gt;
  
  
  Subagents add latency
&lt;/h3&gt;

&lt;p&gt;Delegating to a subagent means a separate API call with its own context. On a complex task, one subagent fires dozens of requests. For anything real-time, that's too slow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lazy tool loading isn't free
&lt;/h3&gt;

&lt;p&gt;Tool Search saves context but adds a search step. If the agent hunts for a tool before every action, that's extra requests and wasted time. Balancing tools-in-context against search frequency takes tuning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common mistakes
&lt;/h3&gt;

&lt;p&gt;Three that come up more than anything else:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Copying an entire file instead of the relevant fragment.&lt;/strong&gt; The model gets 500 lines when it needed lines 45–60. The other 440 lines are pure noise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not saying what NOT to do.&lt;/strong&gt; Without constraints, the model refactors the whole file when you asked it to fix one function.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skipping acceptance criteria.&lt;/strong&gt; The model doesn't know when to stop. You get either undercooked or overcomplicated output.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Checklist
&lt;/h2&gt;

&lt;p&gt;Run through this before every serious request to a model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before the request:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is there a source of truth (docs, code, data) in the context?&lt;/li&gt;
&lt;li&gt;Is the task clearly described?&lt;/li&gt;
&lt;li&gt;Is the output format specified?&lt;/li&gt;
&lt;li&gt;Is what NOT to do specified?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;In the prompt:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Answer only based on the provided context"&lt;/li&gt;
&lt;li&gt;"If you don't know — say you don't know"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After the response:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Are the facts verified?&lt;/li&gt;
&lt;li&gt;Do the referenced functions and libraries actually exist?&lt;/li&gt;
&lt;li&gt;Were characteristics of one object attributed to another?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Five takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Less = better.&lt;/strong&gt; Quality and relevance of context matter more than quantity. The goal is the smallest set of tokens with the strongest signal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structure it.&lt;/strong&gt; Six layers: System, Project, Task, Code, Criteria, Examples. Important stuff at the beginning and end.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persist it.&lt;/strong&gt; Persistence = bridge between sessions. State files, structured notes, git.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Isolate it.&lt;/strong&gt; Subagents with clean context for specialized tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compress it.&lt;/strong&gt; Compaction and tool result cleanup when the context grows.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Start small: assemble a six-layer context package for one typical task and compare the result to what you get from pasting code into the chat. The difference tends to be obvious on the first try.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  At what token count does context rot become practically noticeable, and is there a threshold to monitor?
&lt;/h3&gt;

&lt;p&gt;Multiple benchmarks (including studies by Chroma and others) show measurable accuracy degradation starting around 20–30K tokens for complex reasoning tasks, with a steeper drop past 50K. For simpler extraction tasks the threshold is higher — around 80–100K. A practical monitoring rule: if your average context exceeds 40K tokens per request and you're seeing inconsistent output quality, context size is the first variable to investigate. The $900/month calculation in the article assumes 100K tokens — most production agents can cut that by 60–70% through selective RAG retrieval without measurable quality loss.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does lazy tool loading in Claude Code achieve 85% token savings, and what is the actual mechanism?
&lt;/h3&gt;

&lt;p&gt;Without lazy loading, every MCP server's full tool schema is injected into the system prompt on every request — 10 servers with 5 tools each at ~200 tokens per tool description equals 10,000 tokens of overhead per call, regardless of which tools actually get used. Tool Search defers schema injection: the agent first sends a semantic search query to find relevant tool names (~50 tokens), then loads only the matching tool descriptions (~400 tokens for 2 tools). The 85% savings comes from eliminating the full schema dump for 8–9 unused tools per typical request.&lt;/p&gt;

&lt;h3&gt;
  
  
  When should you use manual context compaction versus automatic, and what information is typically lost?
&lt;/h3&gt;

&lt;p&gt;Manual compaction at logical breakpoints (end of a feature, after a passing test suite) is safer because you control what the summary captures. Automatic compaction triggers on window fill and summarizes whatever is current — which may include half-finished reasoning, temporary debugging state, or contradictory instructions from mid-session pivots. The most common loss is architectural decisions made conversationally: "let's not use Redux here because X" survives a manual summary but gets dropped by automatic compaction which treats it as transient chat rather than binding constraint.&lt;/p&gt;

</description>
      <category>contextengineering</category>
      <category>llm</category>
      <category>ai</category>
      <category>promptengineering</category>
    </item>
    <item>
      <title>I Built a Lie Detector for AI Coding Agents</title>
      <dc:creator>Roman Belov</dc:creator>
      <pubDate>Mon, 09 Mar 2026 13:59:51 +0000</pubDate>
      <link>https://forem.com/spyrae/i-built-a-lie-detector-for-ai-coding-agents-21l2</link>
      <guid>https://forem.com/spyrae/i-built-a-lie-detector-for-ai-coding-agents-21l2</guid>
      <description>&lt;h2&gt;
  
  
  The problem nobody talks about
&lt;/h2&gt;

&lt;p&gt;AI coding agents lie. Not on purpose - they hallucinate.&lt;/p&gt;

&lt;p&gt;Claude Code tells you "All tests pass!" when tests were never executed. It says "I updated the file" when the content is byte-for-byte identical. It sneaks in &lt;code&gt;git commit --no-verify&lt;/code&gt; to skip the hooks that would catch its mistakes.&lt;/p&gt;

&lt;p&gt;This isn't rare. It's a &lt;a href="https://github.com/anthropics/claude-code/issues/1501" rel="noopener noreferrer"&gt;documented bug&lt;/a&gt; and it hits every serious Claude Code user. System prompts don't help - the agent just ignores them when it "decides" something is done.&lt;/p&gt;

&lt;p&gt;I spent a couple weeks building a fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't ask the agent to be honest - verify it
&lt;/h2&gt;

&lt;p&gt;That's the whole idea. Claude Code has a hooks API - before and after every tool call, it can run your scripts. Those scripts inspect what actually happened and &lt;strong&gt;block&lt;/strong&gt; the agent if the results don't match the claims.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent claims: "I updated utils.ts"
    |
[PostToolUse hook]
    |
Compare SHA256 before/after -&amp;gt; IDENTICAL
    |
BLOCKED: "File was not actually modified. Checksum unchanged."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Can't argue with a checksum. This isn't a prompt the agent can ignore. It's a gate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Six hooks, zero fluff
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hook&lt;/th&gt;
&lt;th&gt;What it catches&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dangerous command blocker&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;--no-verify&lt;/code&gt;, &lt;code&gt;--force push&lt;/code&gt;; warns on &lt;code&gt;reset --hard&lt;/code&gt;, &lt;code&gt;clean -f&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pre-commit test runner&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Auto-detects your framework, runs tests before every commit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;File checksum recorder&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Saves SHA256 before file edit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Exit code verifier&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Command failed (exit 1) but agent claims success&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Phantom edit detector&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;File unchanged after a claimed "edit"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Commit verification reminder&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Makes the agent prove the fix works before claiming "done"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Two days of real use
&lt;/h2&gt;

&lt;p&gt;I ran TruthGuard on a production Flutter project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;5 commits blocked&lt;/strong&gt; - agent kept trying to commit with failing tests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3 dangerous commands caught&lt;/strong&gt; - 2x &lt;code&gt;git push --force&lt;/code&gt;, 1x &lt;code&gt;git commit --no-verify&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;0 false positives&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pre-commit test hook alone stopped me from shipping broken code five times in two days. Five times.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pre-commit testing is the killer feature
&lt;/h2&gt;

&lt;p&gt;When Claude runs &lt;code&gt;git commit&lt;/code&gt;, TruthGuard intercepts it. Detects your project type, runs the right test command, and blocks the commit if tests fail. Simple as that.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Auto-detection:&lt;/span&gt;
&lt;span class="c"&gt;# pubspec.yaml     -&amp;gt; flutter test&lt;/span&gt;
&lt;span class="c"&gt;# package.json     -&amp;gt; npm test&lt;/span&gt;
&lt;span class="c"&gt;# Cargo.toml       -&amp;gt; cargo test&lt;/span&gt;
&lt;span class="c"&gt;# go.mod           -&amp;gt; go test ./...&lt;/span&gt;
&lt;span class="c"&gt;# pyproject.toml   -&amp;gt; python -m pytest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Override if you want:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .truthguard.yml&lt;/span&gt;
&lt;span class="na"&gt;test_command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;npm&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;run&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;test:unit"&lt;/span&gt;
&lt;span class="na"&gt;skip_on_no_tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The subtler problem: wrong fixes
&lt;/h2&gt;

&lt;p&gt;After building the basic hooks, I ran into something trickier. Claude makes real changes, tests pass, but the fix doesn't actually solve the original problem. It genuinely thinks it's done.&lt;/p&gt;

&lt;p&gt;So I added a post-commit reminder. After every successful commit:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"You just committed code. STOP and verify: did you actually confirm the fix works?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A nudge, basically. But it makes the agent pause instead of rushing to "Done."&lt;/p&gt;

&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx truthguard &lt;span class="nb"&gt;install
cd &lt;/span&gt;your-project
npx truthguard init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Copies scripts to &lt;code&gt;~/.truthguard/&lt;/code&gt;, adds hooks to &lt;code&gt;.claude/settings.json&lt;/code&gt;. Restart Claude Code and that's it.&lt;/p&gt;

&lt;p&gt;Homebrew works too:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew tap spyrae/truthguard &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; brew &lt;span class="nb"&gt;install &lt;/span&gt;truthguard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Agent-agnostic
&lt;/h2&gt;

&lt;p&gt;Scripts read JSON from stdin, write JSON to stdout. Same scripts power both Claude Code and Gemini CLI. Supporting another agent means writing a config file, not rewriting hooks.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;This is the free local-only version. No backend, no telemetry, everything on your machine.&lt;/p&gt;

&lt;p&gt;Some ideas I'm thinking about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A second LLM that checks whether the diff actually solves the described problem&lt;/li&gt;
&lt;li&gt;Team dashboard with honesty stats&lt;/li&gt;
&lt;li&gt;VS Code extension for Cursor and Copilot users&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/spyrae/truthguard" rel="noopener noreferrer"&gt;github.com/spyrae/truthguard&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;npm:&lt;/strong&gt; &lt;a href="https://www.npmjs.com/package/truthguard" rel="noopener noreferrer"&gt;npmjs.com/package/truthguard&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If your agent lies in ways I haven't covered - open an issue and I'll write a hook for it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>productivity</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
