<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: astronaut</title>
    <description>The latest articles on Forem by astronaut (@astronaut27).</description>
    <link>https://forem.com/astronaut27</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3594224%2F48ba6b28-a495-4630-8112-62f28ff8b5dc.png</url>
      <title>Forem: astronaut</title>
      <link>https://forem.com/astronaut27</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/astronaut27"/>
    <language>en</language>
    <item>
      <title>Prompt Management Is Infrastructure: Requirements, Tools, and Patterns</title>
      <dc:creator>astronaut</dc:creator>
      <pubDate>Tue, 17 Mar 2026 17:00:56 +0000</pubDate>
      <link>https://forem.com/astronaut27/prompt-management-is-infrastructure-requirements-tools-and-patterns-32nn</link>
      <guid>https://forem.com/astronaut27/prompt-management-is-infrastructure-requirements-tools-and-patterns-32nn</guid>
      <description>&lt;p&gt;&lt;strong&gt;Mission Log #6 — Prompt control center: from strings in code to a production-grade system.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If your LLM service keeps prompts in code or in a UI without strict version control, you're accumulating technical debt. Not the usual kind. This debt doesn't show up as stack traces. It shows up as silent quality drift: SLAs green, logs clean, and users increasingly getting irrelevant answers.&lt;/p&gt;

&lt;p&gt;In production, a prompt is the &lt;strong&gt;behavioral contract of your service&lt;/strong&gt;. It directly affects tool-calling accuracy, RAG faithfulness, latency distribution, inference cost, and downstream behavior.&lt;/p&gt;

&lt;p&gt;This article is not about prompt engineering (how to write a good prompt). It's about &lt;strong&gt;prompt management&lt;/strong&gt; — how to manage prompts as an engineer: version, deploy, roll back, observe, and avoid silent regressions.&lt;/p&gt;

&lt;p&gt;You'll find:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What prompt management is and how it differs from prompt engineering.&lt;/li&gt;
&lt;li&gt;What production demands from prompt management (and what breaks when you ignore it).&lt;/li&gt;
&lt;li&gt;A maturity model: where your team is and what the next step is.&lt;/li&gt;
&lt;li&gt;Tools that address these requirements and how they map.&lt;/li&gt;
&lt;li&gt;Architectural patterns for embedding prompt management into your system.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What Is Prompt Management (and What Are We Versioning?)
&lt;/h2&gt;

&lt;p&gt;Prompt management is the set of practices and tools for the full lifecycle of prompts: creation, versioning, testing, deployment, monitoring, and rollback.&lt;/p&gt;

&lt;p&gt;In production, a "prompt" is not a single text string. It's a &lt;strong&gt;composite artifact&lt;/strong&gt; of several components, each of which affects service behavior:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;th&gt;Why we version it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System prompt&lt;/td&gt;
&lt;td&gt;"You are a support agent..."&lt;/td&gt;
&lt;td&gt;Defines model behavior&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Few-shot examples&lt;/td&gt;
&lt;td&gt;3 input→output pairs&lt;/td&gt;
&lt;td&gt;Affect format and quality of responses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool schemas&lt;/td&gt;
&lt;td&gt;OpenAPI specs for function calling&lt;/td&gt;
&lt;td&gt;Define which tools the model can call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output schema&lt;/td&gt;
&lt;td&gt;JSON Schema for structured output&lt;/td&gt;
&lt;td&gt;Breaks downstream parsers when changed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference params&lt;/td&gt;
&lt;td&gt;model, temperature, max_tokens, top_p&lt;/td&gt;
&lt;td&gt;Affect latency, cost, response style&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt template&lt;/td&gt;
&lt;td&gt;Template with variables (&lt;code&gt;{{user_name}}&lt;/code&gt;, &lt;code&gt;{{context}}&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Logic for assembling the final prompt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Routing logic&lt;/td&gt;
&lt;td&gt;Which prompt for which tenant/use case&lt;/td&gt;
&lt;td&gt;Determines who sees which version&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Engineers often version only the system prompt text. But if someone changes a tool schema or bumps temperature from 0.3 to 0.9, system behavior changes just as much. In mature production systems, teams version the &lt;strong&gt;entire artifact&lt;/strong&gt;, not just the text.&lt;/p&gt;




&lt;h2&gt;
  
  
  9 Requirements for Production-Grade Prompt Management
&lt;/h2&gt;

&lt;p&gt;These requirements come from working with production LLM systems. Each is described with a concrete failure mode — what actually breaks when the requirement isn't met.&lt;/p&gt;

&lt;p&gt;It helps to split them into three planes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Versioning&lt;/strong&gt;: version identity, diff, change history, reproducibility.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delivery/Rollout&lt;/strong&gt;: labels, canary, version distribution, rollback.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control/Governance&lt;/strong&gt;: eval gating, audit trail, trace linkage.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  1. Immutable versions
&lt;/h3&gt;

&lt;p&gt;Every prompt version is immutable. A unique &lt;code&gt;prompt_version_id&lt;/code&gt; (content hash or incremental id).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: you can't tell which exact prompt version was live during an incident. "Someone changed the prompt last week, I think" is guesswork, not debugging.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Labels / Aliases
&lt;/h3&gt;

&lt;p&gt;Named labels for routing prompt versions at runtime. Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;By environment&lt;/strong&gt;: &lt;code&gt;production&lt;/code&gt;, &lt;code&gt;canary&lt;/code&gt;, &lt;code&gt;staging&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;By model&lt;/strong&gt;: &lt;code&gt;gpt-4o&lt;/code&gt;, &lt;code&gt;claude-sonnet&lt;/code&gt;, &lt;code&gt;llama-3-70b&lt;/code&gt; — different prompts tuned for different LLMs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;By tenant/use case&lt;/strong&gt;: &lt;code&gt;tenant_acme&lt;/code&gt;, &lt;code&gt;support_flow&lt;/code&gt;, &lt;code&gt;sales_agent&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;By experiment&lt;/strong&gt;: &lt;code&gt;experiment_v3_concise&lt;/code&gt;, &lt;code&gt;baseline&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The app requests a prompt by label, not by concrete version. That lets you change the version without changing code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: changing a prompt version means a full service deploy. Every text change goes through the full CI/CD pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Evaluation gating
&lt;/h3&gt;

&lt;p&gt;A new prompt version goes through controlled validation before promotion:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;domain-specific golden dataset,&lt;/li&gt;
&lt;li&gt;automated regression tests,&lt;/li&gt;
&lt;li&gt;offline comparison to baseline,&lt;/li&gt;
&lt;li&gt;(optional) LLM-based scoring.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Promotion is a deliberate decision, not a blind merge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: every prompt change is a lottery. You can go a month without noticing that answer quality dropped 15%.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Low-latency fetch
&lt;/h3&gt;

&lt;p&gt;Predictable time to fetch the prompt at runtime. In-memory cache on the hot path. The goal is to avoid putting a slow, uncached config dependency on the critical request path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: prompt management becomes a single point of failure. If the config service responds in 500ms instead of 5ms, your TTFT is already broken.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Audit trail
&lt;/h3&gt;

&lt;p&gt;Who changed what, when, and why. Commit message + metadata.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: after an incident you run a detective investigation instead of root-cause analysis. "Who changed the support prompt?" shouldn't take more than 10 seconds to answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Trace linkage
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;prompt_version_id&lt;/code&gt; attached to every trace/span. Correlation with metrics: latency, tool-call success rate, semantic failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: you see quality degrade but can't tie it to a specific prompt version. Observability without trace linkage is dashboards for the sake of dashboards.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Rollback without downtime
&lt;/h3&gt;

&lt;p&gt;Reassign a label → fast rollback without redeploy or service restart (within your propagation window).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: recovery time after a bad prompt equals full deploy time (minutes or hours instead of seconds). In agent systems with dozens of prompts, that's critical.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Structured schema support
&lt;/h3&gt;

&lt;p&gt;Version not only text but tool schemas, output constraints, and templating.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: you track prompt text, someone quietly changes the output schema, and the downstream parser breaks. Half the artifact is out of control.&lt;/p&gt;

&lt;h3&gt;
  
  
  9. GitOps-friendly or API-driven workflow
&lt;/h3&gt;

&lt;p&gt;Infra and product teams work in parallel without overwriting each other. Prompts are managed via Git (PR, review) or via API (SDK, UI).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without it&lt;/strong&gt;: two people edit the same prompt in the UI → last save wins, wiping the first person's changes. Familiar Google Docs pain, but with production impact.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F00htb8giltooe9jwb827.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F00htb8giltooe9jwb827.png" alt=" " width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Maturity Model: Where Are You Now?
&lt;/h2&gt;

&lt;p&gt;Not every system needs Level 4. The point is to know your current level and choose the next step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 0 — Strings in code
&lt;/h3&gt;

&lt;p&gt;Prompts live as literals in code or hardcoded in the UI.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Typical Level 0
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant that...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;No explicit versions (only git blame if you're lucky).&lt;/li&gt;
&lt;li&gt;Rollback = git revert + full deploy.&lt;/li&gt;
&lt;li&gt;Debug: "check the code for what's there" — but production may be running a different build.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Covers&lt;/strong&gt;: minimal code-level audit trail and version history in Git; almost none of the runtime requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 1 — Git-based prompts
&lt;/h3&gt;

&lt;p&gt;Prompts live in separate files (YAML, JSON, Markdown) and are versioned in Git.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# prompts/support_agent/v2.yaml&lt;/span&gt;
&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;support_agent&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v2&lt;/span&gt;
&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o&lt;/span&gt;
&lt;span class="na"&gt;temperature&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;
&lt;span class="na"&gt;system_prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;You are a support agent for {{product_name}}.&lt;/span&gt;
  &lt;span class="s"&gt;Always check the knowledge base before answering.&lt;/span&gt;
  &lt;span class="s"&gt;If unsure, escalate to a human.&lt;/span&gt;
&lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;search_kb&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;create_ticket&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Change history and PR review.&lt;/li&gt;
&lt;li&gt;Audit trail via git log.&lt;/li&gt;
&lt;li&gt;Rollback still via deploy (git revert → CI → deploy).&lt;/li&gt;
&lt;li&gt;No runtime labels/aliases.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Covers&lt;/strong&gt;: immutable history in Git, audit trail (git log), GitOps workflow, structured schema (if the file holds all components). Immutable runtime artifacts only appear when you explicitly build and publish versioned artifacts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 2 — Config store + labels
&lt;/h3&gt;

&lt;p&gt;Prompts live in a key-value store (Redis, Postgres, DynamoDB, internal config service) with label support.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET /v1/prompts/support_agent?label=production
→ { version_id: "v2-abc123", system_prompt: "...", tools: [...] }

GET /v1/prompts/support_agent?label=canary
→ { version_id: "v3-def456", system_prompt: "...", tools: [...] }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Runtime routing by alias.&lt;/li&gt;
&lt;li&gt;Changing the production version without deploy (reassign label).&lt;/li&gt;
&lt;li&gt;In-memory cache on the client + background refresh.&lt;/li&gt;
&lt;li&gt;No built-in eval gating.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Covers&lt;/strong&gt;: immutability, labels, low-latency fetch, rollback, audit trail (if you keep it), GitOps/API.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 3 — Dedicated prompt management platform
&lt;/h3&gt;

&lt;p&gt;A dedicated platform: UI for version management, diffs between versions, built-in tracing, and observability integrations.&lt;/p&gt;

&lt;p&gt;Examples: Langfuse, Braintrust, MLflow Prompt Registry, PromptLayer, LangSmith.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;UI for comparing versions, promoting, rolling back.&lt;/li&gt;
&lt;li&gt;Observability integration (trace linkage).&lt;/li&gt;
&lt;li&gt;A/B testing and canary rollouts.&lt;/li&gt;
&lt;li&gt;Non-engineers (product, domain experts) can edit prompts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Covers&lt;/strong&gt;: all 9 requirements to varying degrees (platform-dependent).&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 4 — Full prompt ops
&lt;/h3&gt;

&lt;p&gt;Single pipeline: create → eval → offline comparison → canary rollout → monitoring → auto-rollback.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt management is part of CI/CD and the eval pipeline.&lt;/li&gt;
&lt;li&gt;Evaluation gating built into the promotion process.&lt;/li&gt;
&lt;li&gt;Automatic alerts when metrics degrade for a given prompt_version.&lt;/li&gt;
&lt;li&gt;A prompt doesn't reach production until it passes the golden set and regression tests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Covers&lt;/strong&gt;: all 9 requirements plus automated eval.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tool Overview
&lt;/h2&gt;

&lt;p&gt;Not a feature list — a mapping onto the 9 requirements. The focus is on infrastructure needs, not marketing features.&lt;/p&gt;

&lt;h3&gt;
  
  
  Langfuse
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;: LLM observability + prompt management platform, open-source / open-core. After the ClickHouse merger, the project kept an open core.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Versioning with labels (&lt;code&gt;production&lt;/code&gt;, &lt;code&gt;staging&lt;/code&gt;, custom).&lt;/li&gt;
&lt;li&gt;Client-side cache — prompt is fetched once, then served from memory. No extra latency on requests.&lt;/li&gt;
&lt;li&gt;Trace linkage: &lt;code&gt;prompt_version_id&lt;/code&gt; attached to every trace.&lt;/li&gt;
&lt;li&gt;Self-hosted option (Docker) — important for compliance and data-sensitive systems.&lt;/li&gt;
&lt;li&gt;Open-source/open-core: most core features are open; some capabilities depend on the commercial plan.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;UI for non-engineers is less polished than more product-centric platforms.&lt;/li&gt;
&lt;li&gt;Eval gating has to be built separately (via integration with eval frameworks).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;: immutability ✓, labels ✓, eval gating ~, low-latency ✓, audit trail ✓, trace linkage ✓, rollback ✓, schema ~, GitOps/API ✓.&lt;/p&gt;

&lt;h3&gt;
  
  
  MLflow Prompt Registry
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;: Part of the MLflow GenAI ecosystem. Git-inspired versioning for prompts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Immutable versions + aliasing (Git-inspired).&lt;/li&gt;
&lt;li&gt;Lineage tracking — link prompts to model runs and eval results.&lt;/li&gt;
&lt;li&gt;Natural fit for teams already on MLflow/Databricks.&lt;/li&gt;
&lt;li&gt;Template support with variables (&lt;code&gt;{{variable}}&lt;/code&gt;), conversion to LangChain/LlamaIndex formats.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tightly coupled to the MLflow ecosystem. If you're not on Databricks/MLflow, integration overhead.&lt;/li&gt;
&lt;li&gt;Not a standalone observability platform.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;: immutability ✓, labels ✓ (aliases), eval gating ✓ (via MLflow evaluate), low-latency ~, audit trail ✓, trace linkage ~ (via MLflow tracking), rollback ✓, schema ✓, GitOps ~ (custom scripts).&lt;/p&gt;

&lt;h3&gt;
  
  
  Braintrust
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;: AI observability platform with prompt management, eval, and production monitoring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Environments: development → staging → production with quality gates.&lt;/li&gt;
&lt;li&gt;Bidirectional sync between code (SDK) and UI (playground) — engineers and product work in parallel.&lt;/li&gt;
&lt;li&gt;GitHub Actions integration: eval in CI, blocking deployments, PR comments.&lt;/li&gt;
&lt;li&gt;Prompt playground for testing on real data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SaaS-first: deployment and data-plane options depend on enterprise setup and contracts.&lt;/li&gt;
&lt;li&gt;Platform lock-in and migration cost if you switch vendors.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;: immutability ✓, labels ✓ (environments), eval gating ✓, low-latency ✓, audit trail ✓, trace linkage ✓, rollback ✓, schema ✓, GitOps ✓.&lt;/p&gt;

&lt;h3&gt;
  
  
  PromptLayer
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;: Lightweight tool for logging and versioning LLM calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Easiest integration (&amp;lt; 30 minutes, a few lines of code).&lt;/li&gt;
&lt;li&gt;Prompt registry: prompts stored outside code, deployed via API.&lt;/li&gt;
&lt;li&gt;Release labels and dynamic labels for runtime routing.&lt;/li&gt;
&lt;li&gt;Basic eval and version comparison.&lt;/li&gt;
&lt;li&gt;Low barrier to entry; good for getting started.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Less depth on observability and governance than full-stack LLMOps platforms.&lt;/li&gt;
&lt;li&gt;Teams with growing complexity will outgrow it quickly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;: immutability ✓, labels ✓, eval gating ~, low-latency ~, audit trail ✓, trace linkage ~, rollback ✓, schema ~, GitOps ~.&lt;/p&gt;

&lt;h3&gt;
  
  
  LangSmith
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;: LangChain platform for tracing, eval, and prompt management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deep integration with LangChain/LangGraph.&lt;/li&gt;
&lt;li&gt;Hub for sharing and versioning prompts.&lt;/li&gt;
&lt;li&gt;Evaluation + dataset management.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tied to the LangChain ecosystem (though there are SDK and API).&lt;/li&gt;
&lt;li&gt;Commercial product: deployment modes and enterprise features depend on plan and contract.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Requirements&lt;/strong&gt;: immutability ✓, labels ~, eval gating ✓, low-latency ~, audit trail ✓, trace linkage ✓, rollback ~, schema ✓, GitOps ~.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;th&gt;Langfuse&lt;/th&gt;
&lt;th&gt;MLflow&lt;/th&gt;
&lt;th&gt;Braintrust&lt;/th&gt;
&lt;th&gt;PromptLayer&lt;/th&gt;
&lt;th&gt;LangSmith&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Immutability&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Labels/Aliases&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Eval Gating&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low-latency Fetch&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit Trail&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trace Linkage&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rollback&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structured Schema&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitOps/API&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Open Source&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-hosted&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;~&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;✓ = full support, ~ = partial or needs extra setup, ✗ = no.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Table reflects public docs and typical production scenarios at the time of writing. For a real choice, always check current limits for plans, licensing, and deployment mode.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Architectural Patterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pattern 1: Git-native
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;prompts/&lt;/span&gt;
  &lt;span class="s"&gt;support_agent/&lt;/span&gt;
    &lt;span class="s"&gt;v1.yaml&lt;/span&gt;
    &lt;span class="s"&gt;v2.yaml&lt;/span&gt;
  &lt;span class="s"&gt;code_review/&lt;/span&gt;
    &lt;span class="s"&gt;v1.yaml&lt;/span&gt;
  &lt;span class="s"&gt;registry.yaml     ← index&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;which label points to which version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CI builds prompts into an artifact (JSON bundle, SQLite, Redis snapshot). The service loads the artifact at startup.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Familiar workflow (PR, review, CI)&lt;/td&gt;
&lt;td&gt;Rollback = new deploy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full audit trail&lt;/td&gt;
&lt;td&gt;Non-engineers can't edit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No runtime dependencies&lt;/td&gt;
&lt;td&gt;No runtime labels&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No extra cost&lt;/td&gt;
&lt;td&gt;Eval gating built from scratch&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: teams of 1–5 engineers, early stage, few prompts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: Config service (internal)
&lt;/h3&gt;

&lt;p&gt;Your own service with REST/gRPC API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET /v1/prompts/{name}?label=production
POST /v1/prompts/{name}/versions   ← create version
PUT /v1/prompts/{name}/labels      ← reassign label
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Storage: Postgres / DynamoDB. Clients: SDK with in-memory cache + background polling (TTL 30–60 sec).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Full control&lt;/td&gt;
&lt;td&gt;Build and maintain it yourself&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runtime labels + rollback&lt;/td&gt;
&lt;td&gt;Another service in the stack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low-latency (your cache)&lt;/td&gt;
&lt;td&gt;You build the UI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No vendor lock-in&lt;/td&gt;
&lt;td&gt;Eval gating is a separate concern&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Consistency note&lt;/strong&gt;: with background polling and TTL 30–60 sec, after reassigning a label different instances can run on &lt;strong&gt;different prompt versions&lt;/strong&gt; for up to a minute. For most LLM use cases eventual consistency is fine. For safety-critical systems you need a push mechanism (webhook/event) or a shorter TTL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: mid-size and larger teams that care about control and have capacity for infra.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 3: Managed platform (SaaS)
&lt;/h3&gt;

&lt;p&gt;Langfuse Cloud / Braintrust / LangSmith — prompts managed via the platform's UI and SDK.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fast to start&lt;/td&gt;
&lt;td&gt;Runtime dependency on SaaS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UI for non-engineers&lt;/td&gt;
&lt;td&gt;Vendor lock-in (as with Humanloop, which was discontinued)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Eval, tracing, A/B out of the box&lt;/td&gt;
&lt;td&gt;Cost at scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No infra to build&lt;/td&gt;
&lt;td&gt;Data residency constraints&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Critical question&lt;/strong&gt;: what happens when the SaaS is down? The client SDK must have a fallback (last known good version from cache). Without it, SaaS downtime = your service downtime.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: teams that need a quick start and non-engineer access, and accept the risk.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 4: Hybrid (Git + platform)
&lt;/h3&gt;

&lt;p&gt;Git is source of truth. CI syncs prompts into the platform (Langfuse, Braintrust). The platform handles runtime delivery and observability.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Developer → Git PR → Review → Merge → CI syncs to Platform → Runtime fetch via SDK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Code review + runtime flexibility&lt;/td&gt;
&lt;td&gt;Sync complexity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit trail in Git&lt;/td&gt;
&lt;td&gt;Drift between Git and platform possible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Non-engineers see result in UI&lt;/td&gt;
&lt;td&gt;Two sources of truth when things go wrong&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runtime labels + rollback&lt;/td&gt;
&lt;td&gt;Extra CI plumbing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Failure modes to plan for&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Drift&lt;/strong&gt;: CI sync fails, Git moves ahead, platform serves an old version. Engineer thinks the prompt is updated — service is still on the previous one. Mitigation: check &lt;code&gt;prompt_hash&lt;/code&gt; on the platform side + alert on mismatch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ownership&lt;/strong&gt;: if non-engineers can edit prompts directly in the platform UI, bypassing Git, Git is no longer the single source of truth. Either block direct edits in the UI or implement reverse sync (platform → Git), which is much more complex.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: teams that want Git review plus runtime flexibility. Most mature pattern, and the hardest to operate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 5: Feature flags
&lt;/h3&gt;

&lt;p&gt;Prompt versions are managed as feature flags in your existing system.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Granular rollout (5% → 50% → 100%)&lt;/td&gt;
&lt;td&gt;Flag systems aren't built for long text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Instant rollback (toggle off)&lt;/td&gt;
&lt;td&gt;With dozens of prompts, flag sprawl&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A/B testing out of the box&lt;/td&gt;
&lt;td&gt;No diffs between prompt versions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Familiar if you already use it&lt;/td&gt;
&lt;td&gt;Prompts still need to live somewhere&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: teams that already have feature-flag infra and need granular rollout. Works well as a &lt;strong&gt;complement&lt;/strong&gt; to other patterns (e.g. Git-native + flags for rollout), not as the only mechanism.&lt;/p&gt;

&lt;h3&gt;
  
  
  Runtime delivery: 3 questions for any pattern
&lt;/h3&gt;

&lt;p&gt;Whatever pattern you pick, answer these before production:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;How does the prompt reach runtime?&lt;/strong&gt; Polling with TTL, push via webhook/event, or baked in at deploy? This determines how fast changes propagate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What happens if the prompt source is unavailable?&lt;/strong&gt; Fallback from local cache (stale-while-revalidate) or hard failure? Without fallback you add a single point of failure on the hot path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How quickly do all instances see the new version?&lt;/strong&gt; Eventual consistency (seconds–minutes) or strong? For most LLM use cases eventual is enough, but you must know your consistency window.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each of these is a separate engineering concern from distributed config propagation. A deeper treatment — caching patterns, failure modes, examples — is a separate post.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Choose: Decision Framework
&lt;/h2&gt;

&lt;p&gt;Don't choose by feature list. Choose by four questions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Who edits prompts?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only engineers → Git-native or config service.&lt;/li&gt;
&lt;li&gt;Product/domain experts too → Platform or hybrid.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. How fast must rollback be?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Seconds → you need runtime labels (Level 2+).&lt;/li&gt;
&lt;li&gt;Minutes via CI is acceptable → Git-native is enough.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. How many prompts and how often do they change?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;5 prompts, change once a month → Git-native.&lt;/li&gt;
&lt;li&gt;50+ prompts, change weekly → Platform or hybrid.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Data residency and compliance?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data must stay in region / on-premise → self-hosted (Langfuse, MLflow) or your own config service.&lt;/li&gt;
&lt;li&gt;No constraints → SaaS is fine.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For enterprise teams, (4) is often the &lt;strong&gt;first filter&lt;/strong&gt; and rules out half the options immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  Insight
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Prompt management is a new infrastructure layer. It's closest to config management and feature flags, but with a twist: prompt semantics are opaque and the impact of changes is probabilistic.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You don't need to build Level 4 right away. See where you are and pick &lt;strong&gt;one&lt;/strong&gt; next step:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;At Level 0? → Move prompts to files and introduce &lt;code&gt;prompt_version_id&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;At Level 1? → Add runtime labels and rollback without deploy.&lt;/li&gt;
&lt;li&gt;At Level 2? → Add eval gating and trace linkage.&lt;/li&gt;
&lt;li&gt;At Level 3? → Automate the promotion pipeline.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you already run prompt management in production — what approach did you choose and what pitfalls did you hit?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>mlops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Design Recipe: Observability Pyramid for LLM Infrastructure</title>
      <dc:creator>astronaut</dc:creator>
      <pubDate>Thu, 05 Feb 2026 08:44:43 +0000</pubDate>
      <link>https://forem.com/astronaut27/design-recipe-observability-pyramid-for-llm-infrastructure-3b5l</link>
      <guid>https://forem.com/astronaut27/design-recipe-observability-pyramid-for-llm-infrastructure-3b5l</guid>
      <description>&lt;p&gt;In classic backend systems, we are used to determinism: code either works or crashes with a clear stack trace. In LLM systems, we deal with "soft failures" — the system runs fast and without log errors, but outputs hallucinations or irrelevant context.&lt;/p&gt;

&lt;p&gt;As an engineer with a highload and distributed systems background, I like to view the system as a conveyor with measurable efficiency at each stage. For this, I use the &lt;strong&gt;Observability Pyramid&lt;/strong&gt;, where each layer protects the next.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff75eneriwv4ubbmlx40v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff75eneriwv4ubbmlx40v.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  1. System Layer: Telemetry and SRE Basics
&lt;/h2&gt;

&lt;p&gt;Without this layer, the others make no sense. If you don't meet SLAs for availability and speed, response accuracy doesn't matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TTFT&lt;/strong&gt; (Time to First Token): the main metric for UX&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TPOT&lt;/strong&gt; (Time Per Output Token): generation stability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tokens/Sec &amp;amp; Input/Output Ratio&lt;/strong&gt;: critical for capacity planning and understanding KV-cache load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Engineering Approach:&lt;/strong&gt; Monitor inference engines (vLLM/TGI) via Prometheus/Grafana and OpenTelemetry (OpenLLMetry).&lt;/p&gt;

&lt;p&gt;For details on profiling the engine and finding bottlenecks — see my article:&lt;br&gt;&lt;br&gt;
&lt;a href="https://dev.to/astronaut27/mission-accomplished-how-an-engineer-astronaut-prepared-metas-crag-benchmark-for-launch-in-4bl6"&gt;LLM Engine Telemetry: How to profile models&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Retrieval Layer: Data Hygiene (RAG Triad)
&lt;/h2&gt;

&lt;p&gt;Most hallucinations stem from poor retrieval. RAG evaluation should be decomposed into three components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A. Context Precision&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
How relevant are the retrieved chunks? Noise distracts the model and wastes tokens.&lt;br&gt;&lt;br&gt;
Tools: RAGAS, DeepEval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;B. Context Recall&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Does the retrieved set contain the factual answer?&lt;br&gt;&lt;br&gt;
Practice: You need a "golden standard" — a labeled dataset. I use &lt;a href="https://github.com/facebookresearch/CRAG" rel="noopener noreferrer"&gt;Meta CRAG&lt;/a&gt; because it simulates real-world chaos and dynamically changing data.&lt;br&gt;&lt;br&gt;
See my guide on local CRAG evaluation &lt;a href="https://dev.to/astronaut27/mission-accomplished-how-an-engineer-astronaut-prepared-metas-crag-benchmark-for-launch-in-4bl6"&gt;here.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;C. Faithfulness&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Is the answer derived from the context or hallucinated? &lt;/p&gt;

&lt;p&gt;A judge model checks every claim in the response against the provided source.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Semantic Layer: LLM-as-a-Judge at Scale
&lt;/h2&gt;

&lt;p&gt;This level checks logic. The main challenge is balancing evaluation quality with cost/speed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Engineering Best Practices:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD Gating&lt;/strong&gt;: Full run on a reference dataset. If Faithfulness drops below 0.8 — block deployment (tune the threshold for your domain).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production Sampling&lt;/strong&gt;: In highload systems, evaluating 100% of traffic via GPT-4o is financial suicide. Use sampling (1–5%).
Additionally: implement &lt;strong&gt;judge caching&lt;/strong&gt; (GPT cache, LangChain cache, or vLLM prefix caching). This is especially effective when users ask similar questions — the same prompt+context can be evaluated multiple times, but you pay only once.
&lt;a href="https://github.com/zilliztech/GPTCache" rel="noopener noreferrer"&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Specialized Judges&lt;/strong&gt;: Instead of "naked" small models (which often struggle with logic), use Prometheus-2 or Flow-Judge. They are trained specifically for evaluation tasks, comparable in quality to GPT-4, and can be hosted locally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Out-of-band Eval&lt;/strong&gt;: In production, evaluation always runs asynchronously to avoid increasing main request latency.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Diagnostic Map: What to Fix?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;If Dropped, Problem In:&lt;/th&gt;
&lt;th&gt;Action Plan&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Context Recall&lt;/td&gt;
&lt;td&gt;Embeddings / Indexing&lt;/td&gt;
&lt;td&gt;Switch embedding model, implement Hybrid Search (Vector + Keyword)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context Precision&lt;/td&gt;
&lt;td&gt;Chunking / Noise&lt;/td&gt;
&lt;td&gt;Add Reranker (Cross-Encoder), revise Chunking Strategy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Faithfulness&lt;/td&gt;
&lt;td&gt;Temperature / Context&lt;/td&gt;
&lt;td&gt;Lower Temperature, strengthen system prompt, check chunk integrity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTFT (Latency)&lt;/td&gt;
&lt;td&gt;Hardware / Load&lt;/td&gt;
&lt;td&gt;Check Cache Hit Rate, enable quantization or PagedAttention&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Implementation Plan (Checklist)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Instrument (Day 0)&lt;/strong&gt;: Set up export of metrics and traces (vLLM + OpenTelemetry).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Golden Set&lt;/strong&gt;: Collect 50–100 critical cases. Use Meta CRAG structure as reference (details in my article &lt;a href="https://dev.to/astronaut27/build-your-own-spaceport-local-rag-evaluation-with-meta-crag-4b2k"&gt;Build Your Own Spaceport: Local RAG Evaluation with Meta CRAG&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate&lt;/strong&gt;: Integrate DeepEval/RAGAS into GitHub Actions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sampling &amp;amp; Feedback&lt;/strong&gt;: Set up log and user feedback collection (thumbs up/down) for gray-zone analysis in Arize Phoenix or LangSmith.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;For an experienced engineer, an LLM system is just another probabilistic node in a distributed architecture. Our job is to surround it with sensors so its behavior becomes predictable — like the trajectory of a rocket on a verified orbit.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fak7booeooz6xzliaod2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fak7booeooz6xzliaod2g.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>aiops</category>
      <category>rag</category>
    </item>
    <item>
      <title>Build Your Own Spaceport: Local RAG Evaluation with Meta CRAG</title>
      <dc:creator>astronaut</dc:creator>
      <pubDate>Tue, 30 Dec 2025 15:31:54 +0000</pubDate>
      <link>https://forem.com/astronaut27/build-your-own-spaceport-local-rag-evaluation-with-meta-crag-4b2k</link>
      <guid>https://forem.com/astronaut27/build-your-own-spaceport-local-rag-evaluation-with-meta-crag-4b2k</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Want to skip the theory and launch a local RAG benchmark in Docker right now? Check out the &lt;a href="https://github.com/astronaut27/CRAG_with_Docker" rel="noopener noreferrer"&gt;repo&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Introduction: Breaking the Infrastructure Barrier&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In my &lt;a href="https://medium.com/@astronaut27/how-an-engineer-astronaut-prepared-metas-crag-benchmark-for-launch-in-docker-8ea8435f9fa2" rel="noopener noreferrer"&gt;previous article&lt;/a&gt;, we prepped our "shuttle" for launch by containerizing the Meta CRAG infrastructure. It gave us a standardized environment, but we were still tethered to one expensive "ground control" dependency.&lt;/p&gt;

&lt;p&gt;The original benchmark baselines are &lt;strong&gt;resource-hungry&lt;/strong&gt;. They expect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Paid OpenAI API for final judging.&lt;/li&gt;
&lt;li&gt;GPU(CUDA) clusters to run inference via vLLM.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Developing a RAG system under these constraints feels like ordering expensive parts by mail when you already have the tools in your garage.&lt;/em&gt; You spend your budget on "shipping" (API tokens) and wait for external servers to reply, even though you have plenty of local horsepower sitting idle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if you could launch the rocket from your own spaceport?&lt;/strong&gt; Right on your laptop, with &lt;strong&gt;zero cost per request&lt;/strong&gt; and total autonomy. We’re swapping external APIs for local inference using Ollama and Ray.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;2. Architecture: The OpenAI-Compatible Interface&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkj9afsublbt9qjh1moeg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkj9afsublbt9qjh1moeg.png" alt=" " width="800" height="476"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The biggest headache with academic benchmarks is their &lt;strong&gt;rigid stack.&lt;/strong&gt; Meta CRAG expects either vLLM or OpenAI by default. Rewriting the core evaluation logic is a recipe for bugs and broken metrics.&lt;/p&gt;

&lt;p&gt;Instead, we’ll take the engineering shortcut:&lt;/p&gt;

&lt;p&gt;We implemented a RAGOpenAICompatibleModel class. It uses the standard openai library but "hijacks" the data flow via the base_url variable. This lets us point the benchmark at a local Ollama instance without changing of the core logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt; This gives us hot-swappable brains. Want to test Llama 3? Just change the key. Want to compare it against Qwen or Gemma? A quick export in your terminal is all it takes and a few lines in the configuration file.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;3. Tuning the "Onboard Systems": Ray and HTML Cleanup&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In the cloud, you pay for convenience—you can feed raw HTML to LLM and hope it figures it out. In a local spaceport, &lt;strong&gt;resources are finite.&lt;/strong&gt; Every extra token is &lt;em&gt;dead weight&lt;/em&gt; (ballast).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo0nirj6ok5e98lyi3npm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo0nirj6ok5e98lyi3npm.png" alt=" " width="800" height="129"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  🛠 Parallelism via Ray
&lt;/h3&gt;

&lt;p&gt;Processing hundreds of HTML pages for every question is heavy. We use &lt;strong&gt;Ray&lt;/strong&gt; to distribute the load: while the GPU is busy generating an answer, the idle CPU cores are &lt;strong&gt;"scrubbing" data&lt;/strong&gt; for the next batch in the background.&lt;/p&gt;

&lt;h3&gt;
  
  
  🧹 The "Space Junk" Filter
&lt;/h3&gt;

&lt;p&gt;Using &lt;code&gt;BeautifulSoup&lt;/code&gt; to strip tags is a &lt;strong&gt;survival requirement.&lt;/strong&gt; Local models with 8k context windows quickly "suffocate" under endless &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;script&amp;gt;&lt;/code&gt; tags.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We clean the HTML.&lt;/li&gt;
&lt;li&gt;Split text into sentences.&lt;/li&gt;
&lt;li&gt;Cap snippets at 1000 characters.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Result&lt;/em&gt;: We fit significantly more useful info into the context, boosting accuracy without needing massive model weights.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;4. Field Testing: Real Metrics&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqwjpl3xvyraxwbhsbgaq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqwjpl3xvyraxwbhsbgaq.png" alt=" " width="800" height="580"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We picked three popular models to see how they handle a "combat" RAG scenario.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Accuracy (Correct)&lt;/th&gt;
&lt;th&gt;Hallucination&lt;/th&gt;
&lt;th&gt;Missing (I don't know)&lt;/th&gt;
&lt;th&gt;Final Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemma-2-9B&lt;/td&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;55%&lt;/td&gt;
&lt;td&gt;0.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama-3-8B&lt;/td&gt;
&lt;td&gt;15%&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;55%&lt;/td&gt;
&lt;td&gt;-0.15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen-2.5-7B&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;-1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Post-Mortem: Why did Qwen crash? 💥
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsmakpmgg3v4g3aaf5lgh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsmakpmgg3v4g3aaf5lgh.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Qwen’s results look catastrophic, but this is a &lt;strong&gt;huge engineering lesson&lt;/strong&gt;. It didn't fail because it was "stupid"—it failed because it violated the protocol.&lt;/p&gt;

&lt;p&gt;Typical Qwen output:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;" &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; Okay, let's see. The user is asking about the producers... I need to check the references..."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model started "thinking out loud" via the  tag, ignoring the instruction to &lt;em&gt;answer succinctly&lt;/em&gt;. In CRAG, any text that isn't the direct answer is flagged as a &lt;strong&gt;Hallucination&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Takeaway:&lt;/strong&gt; Models with forced &lt;em&gt;Chain-of-Thought (CoT)&lt;/em&gt; need heavy post-processing (stripping tags) or ювелирный (precise) prompting to keep them from turning a short answer into a philosophical essay.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcl6rhnhtp6shszsguqf3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcl6rhnhtp6shszsguqf3.png" alt=" " width="800" height="563"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;5. Try it Yourself: Code on GitHub&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Stop reading and start launching. I’ve prepped a repository with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker configs for easy deployment.&lt;/li&gt;
&lt;li&gt;Ollama adapters for local inference.&lt;/li&gt;
&lt;li&gt;Ray scripts for high-speed HTML cleaning.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🚀 Project Repo: &lt;a href="https://github.com/astronaut27/CRAG_with_Docker" rel="noopener noreferrer"&gt;astronaut27/CRAG_with_Docker&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;6. Conclusion: Autonomy Achieved&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We’ve proven that you don’t need a corporate budget to do serious RAG engineering.&lt;/p&gt;

&lt;p&gt;Our Results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reproducibility&lt;/strong&gt;: Run the benchmark with a single command.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt;: Exactly $0 per iteration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;: Your data never leaves your "space station."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Local evaluation is about building an honest development process where every change is backed by numbers, not just gut feeling.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;7. Next Mission: RAGas vs. CRAG&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Our spaceport is fully operational. But how does our local ground truth compare to popular metrics like &lt;strong&gt;RAGas&lt;/strong&gt;? In the next post, we’ll pit "RAGas" against the hard facts of CRAG.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;See you in orbit! 👨‍🚀✨&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
    <item>
      <title>🧑‍🚀 LLM Engine Telemetry: How to Profile Models and See Where Performance is Lost</title>
      <dc:creator>astronaut</dc:creator>
      <pubDate>Thu, 27 Nov 2025 14:32:20 +0000</pubDate>
      <link>https://forem.com/astronaut27/llm-engine-telemetry-how-to-profile-models-and-see-where-performance-is-lost-169b</link>
      <guid>https://forem.com/astronaut27/llm-engine-telemetry-how-to-profile-models-and-see-where-performance-is-lost-169b</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;“Any LLM is an engine. It can be massive or compact, but if you don't look at the telemetry, you'll never understand where you're burning energy inefficiently.”&lt;br&gt;
— Astronaut Engineer, Logbook #4&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🌌 Introduction: Why LLMs Need Profiling
&lt;/h2&gt;

&lt;p&gt;When engineers discuss LLM performance, three key phases are most often mentioned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tokenization latency&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTFT&lt;/strong&gt; (Time To First Token)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;tokens/sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;But it's easier to think of it this way:&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;An LLM is an engine, and the profiler is its dashboard. The rest is visible through the readings—and we're about to break them down.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Just like in real machinery:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Startup is always more expensive than the cruising phase,&lt;/li&gt;
&lt;li&gt;Different engine components consume energy differently,&lt;/li&gt;
&lt;li&gt;The true picture is only visible through telemetry.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;👨‍🚀 Caption: "Before launch—rely only on the instruments"&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 Mission Plan
&lt;/h2&gt;

&lt;p&gt;We are launching the &lt;strong&gt;GPT-2&lt;/strong&gt; model in three scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Short prompt&lt;/li&gt;
&lt;li&gt;Medium prompt&lt;/li&gt;
&lt;li&gt;Long prompt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each test goes through three key phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Tokenization&lt;/strong&gt; — Preparing the input.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefill&lt;/strong&gt; - The initial prompt processing that establishes TTFT.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decode / Steady-State&lt;/strong&gt; — The cruising phase of generation.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We measure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tokenization time,&lt;/li&gt;
&lt;li&gt;TTFT,&lt;/li&gt;
&lt;li&gt;Generation speed (ms/token, tokens/sec),&lt;/li&gt;
&lt;li&gt;Memory usage (peakRSS),&lt;/li&gt;
&lt;li&gt;The most expensive low-level operations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All data is collected via &lt;em&gt;torch.profiler&lt;/em&gt; and displayed in TensorBoard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuhyry70g5m1ntxb992o4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuhyry70g5m1ntxb992o4.png" alt=" " width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🧩 LLM Operation Phases: What Happens Under the Hood
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Tokenization - Input Preparation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The text is converted into tokens using the chosen tokenizer. On short texts, measuring this phase can be highly susceptible to system noise (jitter), which is why tokenization is almost always measured separately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Prefill - Prompt Processing and Model State Establishment&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this phase, the model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs the entire prompt through all layers once,&lt;/li&gt;
&lt;li&gt;Computes attention for the entire input sequence,&lt;/li&gt;
&lt;li&gt;Populates the KV-Cache for subsequent generation,&lt;/li&gt;
&lt;li&gt;Allocates temporary tensors and buffers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Formally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TTFT = Prefill time + first Decode step time
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;TTFT is the time required to complete the prompt processing and generate the first token. On a per-token basis, prefill is by far the most expensive phase, since the entire prompt is processed in one go.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Decode — Generating New Tokens&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After prefill, the model transitions to sequential generation. Each new token requires:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1 forward pass → 1 token
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Decode characteristics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Operations are repeated with the same structure,&lt;/li&gt;
&lt;li&gt;The KV-Cache prevents re-computing attention for the entire prompt,&lt;/li&gt;
&lt;li&gt;Metrics become stable: &lt;code&gt;ms/token&lt;/code&gt;, &lt;code&gt;tokens/sec&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;📡 Experimental Setup&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;mission_profiler.py&lt;/code&gt; script:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Performs three launches (short / medium / long prompt)&lt;/li&gt;
&lt;li&gt;Executes two generations for each:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Prefill → TTFT and full generation → Steady&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Saves traces to TensorBoard,&lt;/li&gt;
&lt;li&gt;Outputs a summary metrics table.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⚠️ We do not perform any warmup, so the first run (short_prompt) may be slower.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;🛠️ Launch Telemetry Yourself!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can replicate this "flight" and study the profiler logs on your own machine. &lt;/p&gt;

&lt;p&gt;All the code, settings, and launch instructions are available in the mission repository: &lt;a href="https://github.com/astronaut27/llm-profiler-mission" rel="noopener noreferrer"&gt;GitHub: LLM Profiler Mission - Engine Telemetry&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  📈 Mission Results
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;================= MISSION SUMMARY =================
tag            prompt_len   tokenize   TTFT(ms)   steady(ms)   actual_tok   ms/token   tok/s   peakRSS(MB)
--------------------------------------------------------------------------------------------------------
short_prompt           19        6.6      920.9        823.5         32      25.73     38.9       2541.2
medium_prompt          56        1.4       43.2       1047.4         32      32.73     30.6       2866.3
long_prompt           116        1.7       32.5        894.0         32      27.94     35.8       2886.8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How to Read These Numbers:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;Tokenize&lt;/strong&gt; We can't reliably compare tokenizers using this data—a dedicated, large-scale benchmark is needed. Short strings are heavily affected by system noise, so tokenization performance is evaluated separately.&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;TTFT (Time-To-First-Token)&lt;/strong&gt; The most interesting observation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Short prompt → 921 ms&lt;/li&gt;
&lt;li&gt;Medium → 43 ms&lt;/li&gt;
&lt;li&gt;Long → 32 ms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Why the difference?&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The first run (short_prompt) bore the full impact of the &lt;strong&gt;cold start&lt;/strong&gt;: it includes CUDA/MPS warmup, allocations, and JIT compilation of kernels.&lt;/li&gt;
&lt;li&gt;TTFT is sensitive to the very first execution.&lt;/li&gt;
&lt;li&gt;In subsequent runs (medium, long prompt), after warmup, TTFT stabilizes, and the difference between the medium and long prompt becomes minimal.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;TTFT should be measured either after a dedicated &lt;strong&gt;warmup&lt;/strong&gt; or averaged over several runs.&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;Steady-State (ms/token)&lt;/strong&gt; The cost per token remains relatively stable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~26–33 ms/token&lt;/li&gt;
&lt;li&gt;~30–39 tok/s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the engine's speed in &lt;strong&gt;cruising mode&lt;/strong&gt;. As expected, the per-token latency shows almost no dependence on prompt length.&lt;/p&gt;

&lt;p&gt;🔹 &lt;strong&gt;Peak RSS 2541 → 2866 → 2886 MB.&lt;/strong&gt; Memory usage jumps noticeably when going from the short to the medium prompt &lt;em&gt;(due to the growth of the KV-Cache and general allocations)&lt;/em&gt;, but further lengthening shows minimal increase. This confirms that the primary VRAM/RAM allocation is for the model itself, while the KV-cache consumes only a small fraction. Its size does, however, grow linearly with input length.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;📊 Who is Really Consuming Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the profiler, all operations fall into two camps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔥 1. Main Thrust (Useful Work)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On the GPU, these are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;addmm&lt;/li&gt;
&lt;li&gt;mm / matmul&lt;/li&gt;
&lt;li&gt;scaled_dot_product_attention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They consume the majority of the CUDA time. These are the matrix computation kernels—the operations that truly &lt;strong&gt;propel&lt;/strong&gt; the LLM engine forward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚙️ 2. Control Expenses (Overhead)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Utility operations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;_local_scalar_dense&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;item&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;cat&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;copy_&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;to&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Mask checks (eq, all)&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are responsible for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data movement,&lt;/li&gt;
&lt;li&gt;Synchronization between the CPU/host and the GPU/device,&lt;/li&gt;
&lt;li&gt;Scalar extraction,&lt;/li&gt;
&lt;li&gt;Utility logic for generation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not the engine's thrust, but the cost of &lt;strong&gt;flight control.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;🧭 The Big Picture&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Core computational kernels (attention, matmuls, addmm)&lt;/strong&gt;—these determine whether the model is fast or slow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Overhead operations&lt;/strong&gt; —these are non-productive costs that can be reduced through optimizations: minimizing synchronizations, using use_cache=True, and reducing the number of small tensor operations.&lt;/li&gt;
&lt;li&gt;On &lt;strong&gt;CUDA&lt;/strong&gt;, matrix kernels dominate (as they should), but on MPS, utility operations often dominate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real profile of LLM performance is hidden in the balance between these two groups.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvveiwef3stj8drsd2xza.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvveiwef3stj8drsd2xza.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚙️ Why Profiling LLMs is Essential&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The profiler turns: &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;❌ "The model is running slow" into ✔ "Here is the specific operation that's consuming energy."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It helps reveal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where the &lt;strong&gt;bottleneck&lt;/strong&gt; is located,&lt;/li&gt;
&lt;li&gt;The cost of prefill,&lt;/li&gt;
&lt;li&gt;The cost of each token,&lt;/li&gt;
&lt;li&gt;How &lt;strong&gt;memory&lt;/strong&gt; behaves,&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;overhead&lt;/strong&gt; created by HuggingFace generate().&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;🏁 Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The LLM is an engine. Sometimes powerful, sometimes compact, but always complex and sensitive to overloads. And until you open the profiler:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You won't see the expensive matrix operations,&lt;/li&gt;
&lt;li&gt;You won't see the &lt;strong&gt;synchronization overhead&lt;/strong&gt;,&lt;/li&gt;
&lt;li&gt;You won't know the cost of prefill,&lt;/li&gt;
&lt;li&gt;You won't see the growth of the KV-Cache.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The profiler is our &lt;em&gt;flight recorder&lt;/em&gt;. It shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where the engine is pulling,&lt;/li&gt;
&lt;li&gt;Where it's stalling,&lt;/li&gt;
&lt;li&gt;And where the energy is going.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;No one launches a rocket without a flight recorder.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>deeplearning</category>
      <category>python</category>
    </item>
    <item>
      <title>🧑‍🚀 Choosing the Right Engine to Launch Your LLM (LM Studio, Ollama, and vLLM)</title>
      <dc:creator>astronaut</dc:creator>
      <pubDate>Thu, 06 Nov 2025 17:00:00 +0000</pubDate>
      <link>https://forem.com/astronaut27/choosing-the-right-engine-to-launch-your-llm-lm-studio-ollama-and-vllm-195o</link>
      <guid>https://forem.com/astronaut27/choosing-the-right-engine-to-launch-your-llm-lm-studio-ollama-and-vllm-195o</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;A Practical Field Guide for Engineers: LM Studio, Ollama, and vLLM&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“When you’re building your first LLM ship, the hardest part isn’t takeoff — it’s choosing the right engine.”&lt;br&gt;
— Engineer-Astronaut, Mission Log №3&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;In the LLM universe, everything moves at lightspeed.&lt;br&gt;
Sooner or later, every engineer faces the same question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;how do you run a local model — fast, stable, and reliably?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;LM Studio — a local capsule with a friendly interface.&lt;/li&gt;
&lt;li&gt;Ollama — a maneuverable shuttle for edge missions.&lt;/li&gt;
&lt;li&gt;vLLM — an industrial reactor for API workloads and GPU clusters.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But which one is right for &lt;em&gt;your&lt;/em&gt; mission?&lt;br&gt;
This article isn’t just another benchmark — it’s a &lt;strong&gt;navigation map&lt;/strong&gt;, built by an engineer who has wrestled with GPU crashes, dependency hell, and Dockerization pains.&lt;/p&gt;


&lt;h2&gt;
  
  
  🪐 Personal Log.
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;“When I first tried LM Studio on my laptop, it was beautiful —&lt;br&gt;
until I needed to automate the launch.&lt;br&gt;
The GUI couldn’t be containerized, and the headless mode required extra tinkering.&lt;br&gt;
Then I switched to Ollama, and only with vLLM did I finally understand what a real production-grade workload feels like.”&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  ⚙️ 1. LM Studio — A Piloted Capsule for Local Missions
&lt;/h2&gt;
&lt;h4&gt;
  
  
  &lt;strong&gt;What it is:&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;LM Studio is a desktop application with a local OpenAI-compatible API.&lt;br&gt;
It lets you work offline and run models directly on your laptop.&lt;/p&gt;

&lt;p&gt;📚 Documentation: &lt;a href="//lmstudio.ai/docs"&gt;lmstudio&lt;/a&gt;&lt;br&gt;
💻 Platforms: macOS, Windows, Linux (AppImage).&lt;/p&gt;
&lt;h4&gt;
  
  
  &lt;strong&gt;How to launch:&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Download and install from &lt;a href="//lmstudio.ai/download"&gt;lmstudio.ai&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;
  
  
  &lt;strong&gt;Caveats:&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;GUI-only app — limited containerization;&lt;/li&gt;
&lt;li&gt;Experimental headless API;&lt;/li&gt;
&lt;li&gt;May overload CPU/GPU during long sessions.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;“LM Studio is a flight simulator — perfect for training,&lt;br&gt;
but it won’t take you into orbit.”&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  🚀 2. Ollama — A Maneuverable Shuttle for Edge Missions
&lt;/h2&gt;
&lt;h4&gt;
  
  
  &lt;strong&gt;What it is:&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;An open-source CLI/desktop runtime for models like Mistral, Gemma, Phi-3, and Llama-3.&lt;br&gt;
It runs as a REST API and integrates easily into Docker.&lt;/p&gt;

&lt;p&gt;📚 Documentation: &lt;a href="//ollama.ai"&gt;ollama.ai&lt;/a&gt;&lt;br&gt;
💻 Platforms: macOS, Linux, Windows.&lt;/p&gt;
&lt;h4&gt;
  
  
  &lt;strong&gt;How to launch:&lt;/strong&gt;
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;ollama
ollama run llama3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Or via Docker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; 11434:11434 ollama/ollama
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  When to use:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Local REST APIs and edge inference;&lt;/li&gt;
&lt;li&gt;CI/CD and microservices;&lt;/li&gt;
&lt;li&gt;Quick launches without complex dependencies.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;“Ollama is a light shuttle —&lt;br&gt;
it can launch from any planet, but it won’t carry heavy cargo.”&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  ☀️ 3. vLLM — A Reactor for Production-Grade Flights
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;What it is:&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;vLLM is a high-performance runtime for LLM inference,&lt;br&gt;
optimized for GPUs, fully OpenAI-API compatible, and designed for scaling.&lt;/p&gt;

&lt;p&gt;📚 Documentation: &lt;a href="//github.com/vllm-project/vllm"&gt;vllm&lt;/a&gt;&lt;br&gt;&lt;br&gt;
💻 Platforms: Linux and major cloud providers (AWS, GCP, Azure).&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;How to launch:&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpus&lt;/span&gt; all &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 8000:8000 &lt;span class="se"&gt;\&lt;/span&gt;
  vllm/vllm-openai &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; meta-llama/Llama-3-8b-instruct &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  When to use:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Product APIs and AI platforms;&lt;/li&gt;
&lt;li&gt;Multi-user environments;&lt;/li&gt;
&lt;li&gt;High-speed, CUDA-optimized inference.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Caveats:&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Requires NVIDIA GPU (CUDA ≥ 12.x);&lt;/li&gt;
&lt;li&gt;Not compatible with macOS (no GPU backend);&lt;/li&gt;
&lt;li&gt;Needs DevOps experience — monitoring, logging, version sync.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;“vLLM is a deep-space reactor — built for interstellar journeys.&lt;br&gt;
But if you try to fire it up in your garage, it simply won’t ignite.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  🪐 The Mission Map — Which Engine to Choose
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbmrz1ujv1oplh0yzayxh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbmrz1ujv1oplh0yzayxh.png" alt=" " width="665" height="1460"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ⚠️ Common pitfalls:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;LM Studio → limited containerization;&lt;/li&gt;
&lt;li&gt;Ollama → not all models available out of the box, though you can import from Hugging Face;&lt;/li&gt;
&lt;li&gt;vLLM → CUDA version mismatch causes kernel errors.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8nlx7xmmqyt22mcf52y0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8nlx7xmmqyt22mcf52y0.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  🧩 Mission Debrief
&lt;/h3&gt;

&lt;p&gt;Every engine is built for its own orbit.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LM Studio&lt;/strong&gt; — for solo flights and quick system checks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; — for agile edge missions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM&lt;/strong&gt; — for long-range, interstellar operations.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;“Sometimes an engineer’s mission isn’t to build a new engine —&lt;br&gt;
but to understand which existing one fits the current flight plan.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  🛰️ Previous Missions
&lt;/h3&gt;

&lt;p&gt;🚀 &lt;a href="https://dev.to/astronaut27/mission-accomplished-how-an-engineer-astronaut-prepared-metas-crag-benchmark-for-launch-in-4bl6"&gt;Prepared Meta’s CRAG Benchmark for Launch in Docker&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>performance</category>
    </item>
    <item>
      <title>🧑‍🚀 Mission Accomplished: How an Engineer-Astronaut Prepared Meta’s CRAG Benchmark for Launch in Docker</title>
      <dc:creator>astronaut</dc:creator>
      <pubDate>Thu, 06 Nov 2025 11:04:46 +0000</pubDate>
      <link>https://forem.com/astronaut27/mission-accomplished-how-an-engineer-astronaut-prepared-metas-crag-benchmark-for-launch-in-4bl6</link>
      <guid>https://forem.com/astronaut27/mission-accomplished-how-an-engineer-astronaut-prepared-metas-crag-benchmark-for-launch-in-4bl6</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Every ML system is like a spacecraft — powerful, intricate, and temperamental.&lt;br&gt;
But without telemetry, you have no idea where it’s headed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  🌌 Introduction
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;CRAG (Comprehensive RAG Benchmark)&lt;/strong&gt; from Meta AI is the control panel for Retrieval-Augmented Generation systems.&lt;br&gt;
It measures how well model responses stay grounded in facts, remain robust under noise, and maintain contextual relevance.&lt;/p&gt;

&lt;p&gt;As is often the case with research projects, CRAG required &lt;strong&gt;engineering adaptation&lt;/strong&gt; to operate reliably in a modern environment:&lt;br&gt;
incompatible library versions, dependency conflicts, unclear paths, and manual launch steps.&lt;/p&gt;

&lt;p&gt;🧰 I wanted to bring CRAG to a state where it could be launched with a single command — no dependency chaos, no manual fixes.&lt;br&gt;
The result is a fully reproducible Dockerized environment, available here:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="//github.com/astronaut27/CRAG_with_Docker"&gt;github.com/astronaut27/CRAG_with_Docker&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  🚀 What I Improved
&lt;/h2&gt;

&lt;p&gt;In the original build, several issues made CRAG difficult to run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🔧 Conflicting library versions;&lt;/li&gt;
&lt;li&gt;⚙️ No unified, reproducible start-up workflow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, everything comes to life with a single command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose up &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After building, two containers start automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🛰️ mock-api — an emulator for web search and Knowledge Graph APIs;&lt;/li&gt;
&lt;li&gt;🚀 crag-app — the main container with the benchmark and built-in baseline models.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🧱 Pre-Launch Preparation: Handling the Mission Artifacts
&lt;/h2&gt;

&lt;p&gt;Before firing up the Docker build, make sure all mission artifacts — the large data and model files — are present locally.&lt;/p&gt;

&lt;p&gt;Because CRAG includes files over 100 MB, it uses &lt;strong&gt;Git Large File Storage (LFS)&lt;/strong&gt;. Without them, your container won’t initialize.&lt;/p&gt;

&lt;p&gt;So the first command in your console is essentially &lt;strong&gt;fueling the ship&lt;/strong&gt; with data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git lfs pull
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🧩 How It Works
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foigv0nsubou4mzm3i4uc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foigv0nsubou4mzm3i4uc.png" alt=" " width="800" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  📡 ⚙️ CRAG in Autonomous Mode
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;mock-API&lt;/em&gt;&lt;/strong&gt; — simulates external data sources (Web Search, KG API) used by the RAG system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;crag-app&lt;/em&gt;&lt;/strong&gt; — the main container running the benchmark and the model used for response generation (a dummy model at this stage).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;local_evaluation.py&lt;/em&gt;&lt;/strong&gt; — coordinates the pipeline, calls the mock API, and handles metric evaluation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;ChatGPT&lt;/em&gt;&lt;/strong&gt; — serves as an LLM-assisted judge that evaluates generated responses by CRAG’s metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🧠 What CRAG Measures: The Telemetry Dashboard
&lt;/h2&gt;

&lt;p&gt;CRAG reports quantitative indicators — a flight log of your system after a test mission:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;total&lt;/em&gt;&lt;/strong&gt;: Total number of evaluated examples.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;n_correct&lt;/em&gt;&lt;/strong&gt;: Count of responses that are fully supported by retrieved context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;n_hallucination&lt;/em&gt;&lt;/strong&gt;: Number of responses containing unsupported or invented facts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;n_miss&lt;/em&gt;&lt;/strong&gt;: Responses missing key information or empty answers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;accuracy/ score&lt;/em&gt;&lt;/strong&gt;: Overall precision (ratio of correct responses).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;hallucination&lt;/em&gt;&lt;/strong&gt;: Ratio = n_hallucination / total.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;missing&lt;/em&gt;&lt;/strong&gt;: Ratio = n_miss / total.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdn5zvjv46heqbitc43ev.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdn5zvjv46heqbitc43ev.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;💡 These metrics are &lt;strong&gt;the sensors on your RAG ship’s dashboard&lt;/strong&gt;.&lt;br&gt;
If any of them start flashing red — it’s time to check the model’s engine.&lt;/p&gt;

&lt;h2&gt;
  
  
  🧱 Docker Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.8'&lt;/span&gt;

&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Mock API service for RAG data&lt;/span&gt;
  &lt;span class="na"&gt;mock-api&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;../mock_api&lt;/span&gt;
      &lt;span class="na"&gt;dockerfile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;../deployments/Dockerfile.mock-api&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;crag-mock-api&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8000:8000"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;../mock_api/cragkg:/app/cragkg&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;PYTHONPATH=/app&lt;/span&gt;
    &lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;crag-network&lt;/span&gt;
    &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;unless-stopped&lt;/span&gt;

  &lt;span class="c1"&gt;# CRAG application container&lt;/span&gt;
  &lt;span class="na"&gt;crag-app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;..&lt;/span&gt;
      &lt;span class="na"&gt;dockerfile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deployments/Dockerfile.crag-app&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;crag-app&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;mock-api&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# OpenAI for evaluation (optional)&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;OPENAI_API_KEY=${OPENAI_API_KEY}&lt;/span&gt;
      &lt;span class="c1"&gt;# Mock API connection (Docker service)&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;CRAG_MOCK_API_URL=http://mock-api:8000&lt;/span&gt;
      &lt;span class="c1"&gt;# Evaluation model&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;EVALUATION_MODEL_NAME=${EVALUATION_MODEL_NAME:-gpt-4-0125-preview}&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# Mount large data directories (read-only)&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;../data:/app/data:ro&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;../results:/app/results&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;../example_data:/app/example_data:ro&lt;/span&gt;
      &lt;span class="c1"&gt;# Tokenizer (if needed)&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;../tokenizer:/app/tokenizer:ro&lt;/span&gt;
    &lt;span class="na"&gt;extra_hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host.docker.internal:host-gateway"&lt;/span&gt;
    &lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;crag-network&lt;/span&gt;
    &lt;span class="na"&gt;stdin_open&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;tty&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local_evaluation.py"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;crag-network&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;driver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bridge&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🪐 Why This Matters
&lt;/h2&gt;

&lt;p&gt;RAG systems are quickly becoming the &lt;strong&gt;core engines of modern LLM-based products&lt;/strong&gt;.&lt;br&gt;
CRAG allows engineers to evaluate their reliability and factual grounding before shipping to production.&lt;/p&gt;

&lt;p&gt;This Docker build transforms Meta AI’s research benchmark into a &lt;strong&gt;practical engineering environment&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📦 fully isolated and reproducible;&lt;/li&gt;
&lt;li&gt;🧠 runnable locally or in CI pipelines;&lt;/li&gt;
&lt;li&gt;🚀 easily extendable with your own models (for example, via LM Studio — coming in the next mission).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🔭 The Next Mission
&lt;/h3&gt;

&lt;p&gt;Right now, CRAG runs on its built-in baselines — a test flight before mounting the real engine.&lt;br&gt;
The next step is integrating the &lt;strong&gt;LM Studio API&lt;/strong&gt; and evaluating a live LLM within the same container setup.&lt;br&gt;
That will be &lt;strong&gt;Mission II&lt;/strong&gt; 🚀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14hvf10ag8ty80z3alh0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14hvf10ag8ty80z3alh0.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  🧭 Mission Summary
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;“Sometimes engineering magic isn’t about building a brand-new ship,&lt;br&gt;
but about preparing an existing one for its next flight.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;CRAG now launches reliably, telemetry is stable, and the mission is a success.&lt;/p&gt;

&lt;p&gt;Next up: integrating LM Studio and real models.&lt;br&gt;
For now, the ship holds a steady course. 🪐&lt;/p&gt;

&lt;h4&gt;
  
  
  🔗 Mission Repository
&lt;/h4&gt;

&lt;p&gt;📦 &lt;a href="//%F0%9F%93%A6%20github.com/astronaut27/CRAG_with_Docker"&gt;github.com/astronaut27/CRAG_with_Docker&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;📜 License&lt;br&gt;
CRAG is distributed under the MIT License, developed by Meta AI / Facebook Research.&lt;br&gt;
All modifications in &lt;a href="//github.com/astronaut27/CRAG_with_Docker"&gt;CRAG_with_Docker&lt;/a&gt; preserve the original copyright notices.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
