Forem: astronaut

Prompt Management Is Infrastructure: Requirements, Tools, and Patterns

astronaut — Tue, 17 Mar 2026 17:00:56 +0000

Mission Log #6 — Prompt control center: from strings in code to a production-grade system.

If your LLM service keeps prompts in code or in a UI without strict version control, you're accumulating technical debt. Not the usual kind. This debt doesn't show up as stack traces. It shows up as silent quality drift: SLAs green, logs clean, and users increasingly getting irrelevant answers.

In production, a prompt is the behavioral contract of your service. It directly affects tool-calling accuracy, RAG faithfulness, latency distribution, inference cost, and downstream behavior.

This article is not about prompt engineering (how to write a good prompt). It's about prompt management — how to manage prompts as an engineer: version, deploy, roll back, observe, and avoid silent regressions.

You'll find:

What prompt management is and how it differs from prompt engineering.
What production demands from prompt management (and what breaks when you ignore it).
A maturity model: where your team is and what the next step is.
Tools that address these requirements and how they map.
Architectural patterns for embedding prompt management into your system.

What Is Prompt Management (and What Are We Versioning?)

Prompt management is the set of practices and tools for the full lifecycle of prompts: creation, versioning, testing, deployment, monitoring, and rollback.

In production, a "prompt" is not a single text string. It's a composite artifact of several components, each of which affects service behavior:

Component	Example	Why we version it
System prompt	"You are a support agent..."	Defines model behavior
Few-shot examples	3 input→output pairs	Affect format and quality of responses
Tool schemas	OpenAPI specs for function calling	Define which tools the model can call
Output schema	JSON Schema for structured output	Breaks downstream parsers when changed
Inference params	model, temperature, max_tokens, top_p	Affect latency, cost, response style
Prompt template	Template with variables (`{{user_name}}`, `{{context}}`)	Logic for assembling the final prompt
Routing logic	Which prompt for which tenant/use case	Determines who sees which version

Engineers often version only the system prompt text. But if someone changes a tool schema or bumps temperature from 0.3 to 0.9, system behavior changes just as much. In mature production systems, teams version the entire artifact, not just the text.

9 Requirements for Production-Grade Prompt Management

These requirements come from working with production LLM systems. Each is described with a concrete failure mode — what actually breaks when the requirement isn't met.

It helps to split them into three planes:

Versioning: version identity, diff, change history, reproducibility.
Delivery/Rollout: labels, canary, version distribution, rollback.
Control/Governance: eval gating, audit trail, trace linkage.

1. Immutable versions

Every prompt version is immutable. A unique prompt_version_id (content hash or incremental id).

Without it: you can't tell which exact prompt version was live during an incident. "Someone changed the prompt last week, I think" is guesswork, not debugging.

2. Labels / Aliases

Named labels for routing prompt versions at runtime. Examples:

By environment: production, canary, staging.
By model: gpt-4o, claude-sonnet, llama-3-70b — different prompts tuned for different LLMs.
By tenant/use case: tenant_acme, support_flow, sales_agent.
By experiment: experiment_v3_concise, baseline.

The app requests a prompt by label, not by concrete version. That lets you change the version without changing code.

Without it: changing a prompt version means a full service deploy. Every text change goes through the full CI/CD pipeline.

3. Evaluation gating

A new prompt version goes through controlled validation before promotion:

domain-specific golden dataset,
automated regression tests,
offline comparison to baseline,
(optional) LLM-based scoring.

Promotion is a deliberate decision, not a blind merge.

Without it: every prompt change is a lottery. You can go a month without noticing that answer quality dropped 15%.

4. Low-latency fetch

Predictable time to fetch the prompt at runtime. In-memory cache on the hot path. The goal is to avoid putting a slow, uncached config dependency on the critical request path.

Without it: prompt management becomes a single point of failure. If the config service responds in 500ms instead of 5ms, your TTFT is already broken.

5. Audit trail

Who changed what, when, and why. Commit message + metadata.

Without it: after an incident you run a detective investigation instead of root-cause analysis. "Who changed the support prompt?" shouldn't take more than 10 seconds to answer.

6. Trace linkage

prompt_version_id attached to every trace/span. Correlation with metrics: latency, tool-call success rate, semantic failures.

Without it: you see quality degrade but can't tie it to a specific prompt version. Observability without trace linkage is dashboards for the sake of dashboards.

7. Rollback without downtime

Reassign a label → fast rollback without redeploy or service restart (within your propagation window).

Without it: recovery time after a bad prompt equals full deploy time (minutes or hours instead of seconds). In agent systems with dozens of prompts, that's critical.

8. Structured schema support

Version not only text but tool schemas, output constraints, and templating.

Without it: you track prompt text, someone quietly changes the output schema, and the downstream parser breaks. Half the artifact is out of control.

9. GitOps-friendly or API-driven workflow

Infra and product teams work in parallel without overwriting each other. Prompts are managed via Git (PR, review) or via API (SDK, UI).

Without it: two people edit the same prompt in the UI → last save wins, wiping the first person's changes. Familiar Google Docs pain, but with production impact.

Maturity Model: Where Are You Now?

Not every system needs Level 4. The point is to know your current level and choose the next step.

Level 0 — Strings in code

Prompts live as literals in code or hardcoded in the UI.

# Typical Level 0
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that..."},
        {"role": "user", "content": user_input}
    ]
)

No explicit versions (only git blame if you're lucky).
Rollback = git revert + full deploy.
Debug: "check the code for what's there" — but production may be running a different build.

Covers: minimal code-level audit trail and version history in Git; almost none of the runtime requirements.

Level 1 — Git-based prompts

Prompts live in separate files (YAML, JSON, Markdown) and are versioned in Git.

# prompts/support_agent/v2.yaml
id: support_agent
version: v2
model: gpt-4o
temperature: 0.3
system_prompt: |
  You are a support agent for {{product_name}}.
  Always check the knowledge base before answering.
  If unsure, escalate to a human.
tools:
  - search_kb
  - create_ticket

Change history and PR review.
Audit trail via git log.
Rollback still via deploy (git revert → CI → deploy).
No runtime labels/aliases.

Covers: immutable history in Git, audit trail (git log), GitOps workflow, structured schema (if the file holds all components). Immutable runtime artifacts only appear when you explicitly build and publish versioned artifacts.

Level 2 — Config store + labels

Prompts live in a key-value store (Redis, Postgres, DynamoDB, internal config service) with label support.

GET /v1/prompts/support_agent?label=production
→ { version_id: "v2-abc123", system_prompt: "...", tools: [...] }

GET /v1/prompts/support_agent?label=canary
→ { version_id: "v3-def456", system_prompt: "...", tools: [...] }

Runtime routing by alias.
Changing the production version without deploy (reassign label).
In-memory cache on the client + background refresh.
No built-in eval gating.

Covers: immutability, labels, low-latency fetch, rollback, audit trail (if you keep it), GitOps/API.

Level 3 — Dedicated prompt management platform

A dedicated platform: UI for version management, diffs between versions, built-in tracing, and observability integrations.

Examples: Langfuse, Braintrust, MLflow Prompt Registry, PromptLayer, LangSmith.

UI for comparing versions, promoting, rolling back.
Observability integration (trace linkage).
A/B testing and canary rollouts.
Non-engineers (product, domain experts) can edit prompts.

Covers: all 9 requirements to varying degrees (platform-dependent).

Level 4 — Full prompt ops

Single pipeline: create → eval → offline comparison → canary rollout → monitoring → auto-rollback.

Prompt management is part of CI/CD and the eval pipeline.
Evaluation gating built into the promotion process.
Automatic alerts when metrics degrade for a given prompt_version.
A prompt doesn't reach production until it passes the golden set and regression tests.

Covers: all 9 requirements plus automated eval.

Tool Overview

Not a feature list — a mapping onto the 9 requirements. The focus is on infrastructure needs, not marketing features.

Langfuse

What it is: LLM observability + prompt management platform, open-source / open-core. After the ClickHouse merger, the project kept an open core.

Strengths:

Versioning with labels (production, staging, custom).
Client-side cache — prompt is fetched once, then served from memory. No extra latency on requests.
Trace linkage: prompt_version_id attached to every trace.
Self-hosted option (Docker) — important for compliance and data-sensitive systems.
Open-source/open-core: most core features are open; some capabilities depend on the commercial plan.

Weaknesses:

UI for non-engineers is less polished than more product-centric platforms.
Eval gating has to be built separately (via integration with eval frameworks).

Requirements: immutability ✓, labels ✓, eval gating ~, low-latency ✓, audit trail ✓, trace linkage ✓, rollback ✓, schema ~, GitOps/API ✓.

MLflow Prompt Registry

What it is: Part of the MLflow GenAI ecosystem. Git-inspired versioning for prompts.

Strengths:

Immutable versions + aliasing (Git-inspired).
Lineage tracking — link prompts to model runs and eval results.
Natural fit for teams already on MLflow/Databricks.
Template support with variables ({{variable}}), conversion to LangChain/LlamaIndex formats.

Weaknesses:

Tightly coupled to the MLflow ecosystem. If you're not on Databricks/MLflow, integration overhead.
Not a standalone observability platform.

Requirements: immutability ✓, labels ✓ (aliases), eval gating ✓ (via MLflow evaluate), low-latency ~, audit trail ✓, trace linkage ~ (via MLflow tracking), rollback ✓, schema ✓, GitOps ~ (custom scripts).

Braintrust

What it is: AI observability platform with prompt management, eval, and production monitoring.

Strengths:

Environments: development → staging → production with quality gates.
Bidirectional sync between code (SDK) and UI (playground) — engineers and product work in parallel.
GitHub Actions integration: eval in CI, blocking deployments, PR comments.
Prompt playground for testing on real data.

Weaknesses:

SaaS-first: deployment and data-plane options depend on enterprise setup and contracts.
Platform lock-in and migration cost if you switch vendors.

Requirements: immutability ✓, labels ✓ (environments), eval gating ✓, low-latency ✓, audit trail ✓, trace linkage ✓, rollback ✓, schema ✓, GitOps ✓.

PromptLayer

What it is: Lightweight tool for logging and versioning LLM calls.

Strengths:

Easiest integration (< 30 minutes, a few lines of code).
Prompt registry: prompts stored outside code, deployed via API.
Release labels and dynamic labels for runtime routing.
Basic eval and version comparison.
Low barrier to entry; good for getting started.

Weaknesses:

Less depth on observability and governance than full-stack LLMOps platforms.
Teams with growing complexity will outgrow it quickly.

Requirements: immutability ✓, labels ✓, eval gating ~, low-latency ~, audit trail ✓, trace linkage ~, rollback ✓, schema ~, GitOps ~.

LangSmith

What it is: LangChain platform for tracing, eval, and prompt management.

Strengths:

Deep integration with LangChain/LangGraph.
Hub for sharing and versioning prompts.
Evaluation + dataset management.

Weaknesses:

Tied to the LangChain ecosystem (though there are SDK and API).
Commercial product: deployment modes and enterprise features depend on plan and contract.

Requirements: immutability ✓, labels ~, eval gating ✓, low-latency ~, audit trail ✓, trace linkage ✓, rollback ~, schema ✓, GitOps ~.

Summary table

Requirement	Langfuse	MLflow	Braintrust	PromptLayer	LangSmith
Immutability	✓	✓	✓	✓	✓
Labels/Aliases	✓	✓	✓	✓	~
Eval Gating	~	✓	✓	~	✓
Low-latency Fetch	✓	~	✓	~	~
Audit Trail	✓	✓	✓	✓	✓
Trace Linkage	✓	~	✓	~	✓
Rollback	✓	✓	✓	✓	~
Structured Schema	~	✓	✓	~	✓
GitOps/API	✓	~	✓	~	~
Open Source	~	✓	✗	✗	✗
Self-hosted	✓	✓	~	✗	✗

✓ = full support, ~ = partial or needs extra setup, ✗ = no.

Table reflects public docs and typical production scenarios at the time of writing. For a real choice, always check current limits for plans, licensing, and deployment mode.

Architectural Patterns

Pattern 1: Git-native

prompts/
  support_agent/
    v1.yaml
    v2.yaml
  code_review/
    v1.yaml
  registry.yaml     ← index: which label points to which version

CI builds prompts into an artifact (JSON bundle, SQLite, Redis snapshot). The service loads the artifact at startup.

Pros	Cons
Familiar workflow (PR, review, CI)	Rollback = new deploy
Full audit trail	Non-engineers can't edit
No runtime dependencies	No runtime labels
No extra cost	Eval gating built from scratch

Best for: teams of 1–5 engineers, early stage, few prompts.

Pattern 2: Config service (internal)

Your own service with REST/gRPC API:

GET /v1/prompts/{name}?label=production
POST /v1/prompts/{name}/versions   ← create version
PUT /v1/prompts/{name}/labels      ← reassign label

Storage: Postgres / DynamoDB. Clients: SDK with in-memory cache + background polling (TTL 30–60 sec).

Pros	Cons
Full control	Build and maintain it yourself
Runtime labels + rollback	Another service in the stack
Low-latency (your cache)	You build the UI
No vendor lock-in	Eval gating is a separate concern

Consistency note: with background polling and TTL 30–60 sec, after reassigning a label different instances can run on different prompt versions for up to a minute. For most LLM use cases eventual consistency is fine. For safety-critical systems you need a push mechanism (webhook/event) or a shorter TTL.

Best for: mid-size and larger teams that care about control and have capacity for infra.

Pattern 3: Managed platform (SaaS)

Langfuse Cloud / Braintrust / LangSmith — prompts managed via the platform's UI and SDK.

Pros	Cons
Fast to start	Runtime dependency on SaaS
UI for non-engineers	Vendor lock-in (as with Humanloop, which was discontinued)
Eval, tracing, A/B out of the box	Cost at scale
No infra to build	Data residency constraints

Critical question: what happens when the SaaS is down? The client SDK must have a fallback (last known good version from cache). Without it, SaaS downtime = your service downtime.

Best for: teams that need a quick start and non-engineer access, and accept the risk.

Pattern 4: Hybrid (Git + platform)

Git is source of truth. CI syncs prompts into the platform (Langfuse, Braintrust). The platform handles runtime delivery and observability.

Developer → Git PR → Review → Merge → CI syncs to Platform → Runtime fetch via SDK

Pros	Cons
Code review + runtime flexibility	Sync complexity
Audit trail in Git	Drift between Git and platform possible
Non-engineers see result in UI	Two sources of truth when things go wrong
Runtime labels + rollback	Extra CI plumbing

Failure modes to plan for:

Drift: CI sync fails, Git moves ahead, platform serves an old version. Engineer thinks the prompt is updated — service is still on the previous one. Mitigation: check prompt_hash on the platform side + alert on mismatch.
Ownership: if non-engineers can edit prompts directly in the platform UI, bypassing Git, Git is no longer the single source of truth. Either block direct edits in the UI or implement reverse sync (platform → Git), which is much more complex.

Best for: teams that want Git review plus runtime flexibility. Most mature pattern, and the hardest to operate.

Pattern 5: Feature flags

Prompt versions are managed as feature flags in your existing system.

Pros	Cons
Granular rollout (5% → 50% → 100%)	Flag systems aren't built for long text
Instant rollback (toggle off)	With dozens of prompts, flag sprawl
A/B testing out of the box	No diffs between prompt versions
Familiar if you already use it	Prompts still need to live somewhere

Best for: teams that already have feature-flag infra and need granular rollout. Works well as a complement to other patterns (e.g. Git-native + flags for rollout), not as the only mechanism.

Runtime delivery: 3 questions for any pattern

Whatever pattern you pick, answer these before production:

How does the prompt reach runtime? Polling with TTL, push via webhook/event, or baked in at deploy? This determines how fast changes propagate.
What happens if the prompt source is unavailable? Fallback from local cache (stale-while-revalidate) or hard failure? Without fallback you add a single point of failure on the hot path.
How quickly do all instances see the new version? Eventual consistency (seconds–minutes) or strong? For most LLM use cases eventual is enough, but you must know your consistency window.

Each of these is a separate engineering concern from distributed config propagation. A deeper treatment — caching patterns, failure modes, examples — is a separate post.

How to Choose: Decision Framework

Don't choose by feature list. Choose by four questions:

1. Who edits prompts?

Only engineers → Git-native or config service.
Product/domain experts too → Platform or hybrid.

2. How fast must rollback be?

Seconds → you need runtime labels (Level 2+).
Minutes via CI is acceptable → Git-native is enough.

3. How many prompts and how often do they change?

5 prompts, change once a month → Git-native.
50+ prompts, change weekly → Platform or hybrid.

4. Data residency and compliance?

Data must stay in region / on-premise → self-hosted (Langfuse, MLflow) or your own config service.
No constraints → SaaS is fine.

For enterprise teams, (4) is often the first filter and rules out half the options immediately.

Insight

Prompt management is a new infrastructure layer. It's closest to config management and feature flags, but with a twist: prompt semantics are opaque and the impact of changes is probabilistic.

You don't need to build Level 4 right away. See where you are and pick one next step:

At Level 0? → Move prompts to files and introduce prompt_version_id.
At Level 1? → Add runtime labels and rollback without deploy.
At Level 2? → Add eval gating and trace linkage.
At Level 3? → Automate the promotion pipeline.

If you already run prompt management in production — what approach did you choose and what pitfalls did you hit?

Design Recipe: Observability Pyramid for LLM Infrastructure

astronaut — Thu, 05 Feb 2026 08:44:43 +0000

In classic backend systems, we are used to determinism: code either works or crashes with a clear stack trace. In LLM systems, we deal with "soft failures" — the system runs fast and without log errors, but outputs hallucinations or irrelevant context.

As an engineer with a highload and distributed systems background, I like to view the system as a conveyor with measurable efficiency at each stage. For this, I use the Observability Pyramid, where each layer protects the next.

1. System Layer: Telemetry and SRE Basics

Without this layer, the others make no sense. If you don't meet SLAs for availability and speed, response accuracy doesn't matter.

Key Metrics:

TTFT (Time to First Token): the main metric for UX
TPOT (Time Per Output Token): generation stability
Tokens/Sec & Input/Output Ratio: critical for capacity planning and understanding KV-cache load

Engineering Approach: Monitor inference engines (vLLM/TGI) via Prometheus/Grafana and OpenTelemetry (OpenLLMetry).

For details on profiling the engine and finding bottlenecks — see my article:

LLM Engine Telemetry: How to profile models

2. Retrieval Layer: Data Hygiene (RAG Triad)

Most hallucinations stem from poor retrieval. RAG evaluation should be decomposed into three components:

A. Context Precision

How relevant are the retrieved chunks? Noise distracts the model and wastes tokens.

Tools: RAGAS, DeepEval.

B. Context Recall

Does the retrieved set contain the factual answer?

Practice: You need a "golden standard" — a labeled dataset. I use Meta CRAG because it simulates real-world chaos and dynamically changing data.

See my guide on local CRAG evaluation here.

C. Faithfulness

Is the answer derived from the context or hallucinated?

A judge model checks every claim in the response against the provided source.

3. Semantic Layer: LLM-as-a-Judge at Scale

This level checks logic. The main challenge is balancing evaluation quality with cost/speed.

Engineering Best Practices:

CI/CD Gating: Full run on a reference dataset. If Faithfulness drops below 0.8 — block deployment (tune the threshold for your domain).
Production Sampling: In highload systems, evaluating 100% of traffic via GPT-4o is financial suicide. Use sampling (1–5%). Additionally: implement judge caching (GPT cache, LangChain cache, or vLLM prefix caching). This is especially effective when users ask similar questions — the same prompt+context can be evaluated multiple times, but you pay only once.
Specialized Judges: Instead of "naked" small models (which often struggle with logic), use Prometheus-2 or Flow-Judge. They are trained specifically for evaluation tasks, comparable in quality to GPT-4, and can be hosted locally.
Out-of-band Eval: In production, evaluation always runs asynchronously to avoid increasing main request latency.

Diagnostic Map: What to Fix?

Metric	If Dropped, Problem In:	Action Plan
Context Recall	Embeddings / Indexing	Switch embedding model, implement Hybrid Search (Vector + Keyword)
Context Precision	Chunking / Noise	Add Reranker (Cross-Encoder), revise Chunking Strategy
Faithfulness	Temperature / Context	Lower Temperature, strengthen system prompt, check chunk integrity
TTFT (Latency)	Hardware / Load	Check Cache Hit Rate, enable quantization or PagedAttention

Implementation Plan (Checklist)

Instrument (Day 0): Set up export of metrics and traces (vLLM + OpenTelemetry).
Golden Set: Collect 50–100 critical cases. Use Meta CRAG structure as reference (details in my article Build Your Own Spaceport: Local RAG Evaluation with Meta CRAG).
Automate: Integrate DeepEval/RAGAS into GitHub Actions.
Sampling & Feedback: Set up log and user feedback collection (thumbs up/down) for gray-zone analysis in Arize Phoenix or LangSmith.

Conclusion

For an experienced engineer, an LLM system is just another probabilistic node in a distributed architecture. Our job is to surround it with sensors so its behavior becomes predictable — like the trajectory of a rocket on a verified orbit.

Build Your Own Spaceport: Local RAG Evaluation with Meta CRAG

astronaut — Tue, 30 Dec 2025 15:31:54 +0000

Want to skip the theory and launch a local RAG benchmark in Docker right now? Check out the repo

1. Introduction: Breaking the Infrastructure Barrier

In my previous article, we prepped our "shuttle" for launch by containerizing the Meta CRAG infrastructure. It gave us a standardized environment, but we were still tethered to one expensive "ground control" dependency.

The original benchmark baselines are resource-hungry. They expect:

Paid OpenAI API for final judging.
GPU(CUDA) clusters to run inference via vLLM.

Developing a RAG system under these constraints feels like ordering expensive parts by mail when you already have the tools in your garage. You spend your budget on "shipping" (API tokens) and wait for external servers to reply, even though you have plenty of local horsepower sitting idle.

What if you could launch the rocket from your own spaceport? Right on your laptop, with zero cost per request and total autonomy. We’re swapping external APIs for local inference using Ollama and Ray.

2. Architecture: The OpenAI-Compatible Interface

The biggest headache with academic benchmarks is their rigid stack. Meta CRAG expects either vLLM or OpenAI by default. Rewriting the core evaluation logic is a recipe for bugs and broken metrics.

Instead, we’ll take the engineering shortcut:

We implemented a RAGOpenAICompatibleModel class. It uses the standard openai library but "hijacks" the data flow via the base_url variable. This lets us point the benchmark at a local Ollama instance without changing of the core logic.

Why this matters: This gives us hot-swappable brains. Want to test Llama 3? Just change the key. Want to compare it against Qwen or Gemma? A quick export in your terminal is all it takes and a few lines in the configuration file.

3. Tuning the "Onboard Systems": Ray and HTML Cleanup

In the cloud, you pay for convenience—you can feed raw HTML to LLM and hope it figures it out. In a local spaceport, resources are finite. Every extra token is dead weight (ballast).

🛠 Parallelism via Ray

Processing hundreds of HTML pages for every question is heavy. We use Ray to distribute the load: while the GPU is busy generating an answer, the idle CPU cores are "scrubbing" data for the next batch in the background.

🧹 The "Space Junk" Filter

Using BeautifulSoup to strip tags is a survival requirement. Local models with 8k context windows quickly "suffocate" under endless <div> and <script> tags.

We clean the HTML.
Split text into sentences.
Cap snippets at 1000 characters.

Result: We fit significantly more useful info into the context, boosting accuracy without needing massive model weights.

4. Field Testing: Real Metrics

We picked three popular models to see how they handle a "combat" RAG scenario.

Model	Accuracy (Correct)	Hallucination	Missing (I don't know)	Final Score
Gemma-2-9B	25%	20%	55%	0.05
Llama-3-8B	15%	30%	55%	-0.15
Qwen-2.5-7B	0%	100%	0%	-1.00

Post-Mortem: Why did Qwen crash? 💥

Qwen’s results look catastrophic, but this is a huge engineering lesson. It didn't fail because it was "stupid"—it failed because it violated the protocol.

Typical Qwen output:

" <think> Okay, let's see. The user is asking about the producers... I need to check the references..."

The model started "thinking out loud" via the tag, ignoring the instruction to answer succinctly. In CRAG, any text that isn't the direct answer is flagged as a Hallucination.

The Takeaway: Models with forced Chain-of-Thought (CoT) need heavy post-processing (stripping tags) or ювелирный (precise) prompting to keep them from turning a short answer into a philosophical essay.

5. Try it Yourself: Code on GitHub

Stop reading and start launching. I’ve prepped a repository with:

Docker configs for easy deployment.
Ollama adapters for local inference.
Ray scripts for high-speed HTML cleaning.

🚀 Project Repo: astronaut27/CRAG_with_Docker

6. Conclusion: Autonomy Achieved

We’ve proven that you don’t need a corporate budget to do serious RAG engineering.

Our Results:

Reproducibility: Run the benchmark with a single command.
Cost: Exactly $0 per iteration.
Security: Your data never leaves your "space station."

Local evaluation is about building an honest development process where every change is backed by numbers, not just gut feeling.

7. Next Mission: RAGas vs. CRAG

Our spaceport is fully operational. But how does our local ground truth compare to popular metrics like RAGas? In the next post, we’ll pit "RAGas" against the hard facts of CRAG.

See you in orbit! 👨‍🚀✨

🧑‍🚀 LLM Engine Telemetry: How to Profile Models and See Where Performance is Lost

astronaut — Thu, 27 Nov 2025 14:32:20 +0000

“Any LLM is an engine. It can be massive or compact, but if you don't look at the telemetry, you'll never understand where you're burning energy inefficiently.”
— Astronaut Engineer, Logbook #4

🌌 Introduction: Why LLMs Need Profiling

When engineers discuss LLM performance, three key phases are most often mentioned:

Tokenization latency
TTFT (Time To First Token)
tokens/sec

But it's easier to think of it this way:

An LLM is an engine, and the profiler is its dashboard. The rest is visible through the readings—and we're about to break them down.

Just like in real machinery:

Startup is always more expensive than the cruising phase,
Different engine components consume energy differently,
The true picture is only visible through telemetry.

👨‍🚀 Caption: "Before launch—rely only on the instruments"

🚀 Mission Plan

We are launching the GPT-2 model in three scenarios:

Short prompt
Medium prompt
Long prompt

Each test goes through three key phases:

Tokenization — Preparing the input.
Prefill - The initial prompt processing that establishes TTFT.
Decode / Steady-State — The cruising phase of generation.

We measure:

Tokenization time,
TTFT,
Generation speed (ms/token, tokens/sec),
Memory usage (peakRSS),
The most expensive low-level operations.

All data is collected via torch.profiler and displayed in TensorBoard.

🧩 LLM Operation Phases: What Happens Under the Hood

1. Tokenization - Input Preparation

The text is converted into tokens using the chosen tokenizer. On short texts, measuring this phase can be highly susceptible to system noise (jitter), which is why tokenization is almost always measured separately.

2. Prefill - Prompt Processing and Model State Establishment

In this phase, the model:

Runs the entire prompt through all layers once,
Computes attention for the entire input sequence,
Populates the KV-Cache for subsequent generation,
Allocates temporary tensors and buffers.

Formally:

TTFT = Prefill time + first Decode step time

TTFT is the time required to complete the prompt processing and generate the first token. On a per-token basis, prefill is by far the most expensive phase, since the entire prompt is processed in one go.

3. Decode — Generating New Tokens

After prefill, the model transitions to sequential generation. Each new token requires:

1 forward pass → 1 token

Decode characteristics:

Operations are repeated with the same structure,
The KV-Cache prevents re-computing attention for the entire prompt,
Metrics become stable: ms/token, tokens/sec.

📡 Experimental Setup

The mission_profiler.py script:

Performs three launches (short / medium / long prompt)
Executes two generations for each:

Prefill → TTFT and full generation → Steady

Saves traces to TensorBoard,
Outputs a summary metrics table.

⚠️ We do not perform any warmup, so the first run (short_prompt) may be slower.

🛠️ Launch Telemetry Yourself!

You can replicate this "flight" and study the profiler logs on your own machine.

All the code, settings, and launch instructions are available in the mission repository: GitHub: LLM Profiler Mission - Engine Telemetry

📈 Mission Results

================= MISSION SUMMARY =================
tag            prompt_len   tokenize   TTFT(ms)   steady(ms)   actual_tok   ms/token   tok/s   peakRSS(MB)
--------------------------------------------------------------------------------------------------------
short_prompt           19        6.6      920.9        823.5         32      25.73     38.9       2541.2
medium_prompt          56        1.4       43.2       1047.4         32      32.73     30.6       2866.3
long_prompt           116        1.7       32.5        894.0         32      27.94     35.8       2886.8

How to Read These Numbers:

🔹 Tokenize We can't reliably compare tokenizers using this data—a dedicated, large-scale benchmark is needed. Short strings are heavily affected by system noise, so tokenization performance is evaluated separately.

🔹 TTFT (Time-To-First-Token) The most interesting observation:

Short prompt → 921 ms
Medium → 43 ms
Long → 32 ms

Why the difference?

The first run (short_prompt) bore the full impact of the cold start: it includes CUDA/MPS warmup, allocations, and JIT compilation of kernels.
TTFT is sensitive to the very first execution.
In subsequent runs (medium, long prompt), after warmup, TTFT stabilizes, and the difference between the medium and long prompt becomes minimal.

Conclusion:

TTFT should be measured either after a dedicated warmup or averaged over several runs.

🔹 Steady-State (ms/token) The cost per token remains relatively stable:

~26–33 ms/token
~30–39 tok/s

This is the engine's speed in cruising mode. As expected, the per-token latency shows almost no dependence on prompt length.

🔹 Peak RSS 2541 → 2866 → 2886 MB. Memory usage jumps noticeably when going from the short to the medium prompt (due to the growth of the KV-Cache and general allocations), but further lengthening shows minimal increase. This confirms that the primary VRAM/RAM allocation is for the model itself, while the KV-cache consumes only a small fraction. Its size does, however, grow linearly with input length.

📊 Who is Really Consuming Resources:

In the profiler, all operations fall into two camps:

🔥 1. Main Thrust (Useful Work)

On the GPU, these are:

addmm
mm / matmul
scaled_dot_product_attention

They consume the majority of the CUDA time. These are the matrix computation kernels—the operations that truly propel the LLM engine forward.

⚙️ 2. Control Expenses (Overhead)

Utility operations:

_local_scalar_dense
item
cat
copy_
to
Mask checks (eq, all)

These are responsible for:

Data movement,
Synchronization between the CPU/host and the GPU/device,
Scalar extraction,
Utility logic for generation.

This is not the engine's thrust, but the cost of flight control.

🧭 The Big Picture

Core computational kernels (attention, matmuls, addmm)—these determine whether the model is fast or slow.
Overhead operations —these are non-productive costs that can be reduced through optimizations: minimizing synchronizations, using use_cache=True, and reducing the number of small tensor operations.
On CUDA, matrix kernels dominate (as they should), but on MPS, utility operations often dominate.

The real profile of LLM performance is hidden in the balance between these two groups.

⚙️ Why Profiling LLMs is Essential

The profiler turns:

❌ "The model is running slow" into ✔ "Here is the specific operation that's consuming energy."

It helps reveal:

Where the bottleneck is located,
The cost of prefill,
The cost of each token,
How memory behaves,
The overhead created by HuggingFace generate().

🏁 Conclusion

The LLM is an engine. Sometimes powerful, sometimes compact, but always complex and sensitive to overloads. And until you open the profiler:

You won't see the expensive matrix operations,
You won't see the synchronization overhead,
You won't know the cost of prefill,
You won't see the growth of the KV-Cache.

The profiler is our flight recorder. It shows:

Where the engine is pulling,
Where it's stalling,
And where the energy is going.

No one launches a rocket without a flight recorder.

🧑‍🚀 Choosing the Right Engine to Launch Your LLM (LM Studio, Ollama, and vLLM)

astronaut — Thu, 06 Nov 2025 17:00:00 +0000

A Practical Field Guide for Engineers: LM Studio, Ollama, and vLLM

“When you’re building your first LLM ship, the hardest part isn’t takeoff — it’s choosing the right engine.”
— Engineer-Astronaut, Mission Log №3

In the LLM universe, everything moves at lightspeed.
Sooner or later, every engineer faces the same question:

how do you run a local model — fast, stable, and reliably?

LM Studio — a local capsule with a friendly interface.
Ollama — a maneuverable shuttle for edge missions.
vLLM — an industrial reactor for API workloads and GPU clusters.

But which one is right for your mission?
This article isn’t just another benchmark — it’s a navigation map, built by an engineer who has wrestled with GPU crashes, dependency hell, and Dockerization pains.

🪐 Personal Log.

“When I first tried LM Studio on my laptop, it was beautiful —
until I needed to automate the launch.
The GUI couldn’t be containerized, and the headless mode required extra tinkering.
Then I switched to Ollama, and only with vLLM did I finally understand what a real production-grade workload feels like.”

⚙️ 1. LM Studio — A Piloted Capsule for Local Missions

What it is:

LM Studio is a desktop application with a local OpenAI-compatible API.
It lets you work offline and run models directly on your laptop.

📚 Documentation: lmstudio
💻 Platforms: macOS, Windows, Linux (AppImage).

How to launch:

Download and install from lmstudio.ai.

Caveats:

GUI-only app — limited containerization;
Experimental headless API;
May overload CPU/GPU during long sessions.

“LM Studio is a flight simulator — perfect for training,
but it won’t take you into orbit.”

🚀 2. Ollama — A Maneuverable Shuttle for Edge Missions

What it is:

An open-source CLI/desktop runtime for models like Mistral, Gemma, Phi-3, and Llama-3.
It runs as a REST API and integrates easily into Docker.

📚 Documentation: ollama.ai
💻 Platforms: macOS, Linux, Windows.

How to launch:

brew install ollama
ollama run llama3

Or via Docker:

docker run -d -p 11434:11434 ollama/ollama

When to use:

Local REST APIs and edge inference;
CI/CD and microservices;
Quick launches without complex dependencies.

“Ollama is a light shuttle —
it can launch from any planet, but it won’t carry heavy cargo.”

☀️ 3. vLLM — A Reactor for Production-Grade Flights

What it is:

vLLM is a high-performance runtime for LLM inference,
optimized for GPUs, fully OpenAI-API compatible, and designed for scaling.

📚 Documentation: vllm

💻 Platforms: Linux and major cloud providers (AWS, GCP, Azure).

How to launch:

docker run -d \
  --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai \
  --model meta-llama/Llama-3-8b-instruct \
  --gpu-memory-utilization 0.9

When to use:

Product APIs and AI platforms;
Multi-user environments;
High-speed, CUDA-optimized inference.

Caveats:

Requires NVIDIA GPU (CUDA ≥ 12.x);
Not compatible with macOS (no GPU backend);
Needs DevOps experience — monitoring, logging, version sync.

“vLLM is a deep-space reactor — built for interstellar journeys.
But if you try to fire it up in your garage, it simply won’t ignite.”

🪐 The Mission Map — Which Engine to Choose

⚠️ Common pitfalls:

LM Studio → limited containerization;
Ollama → not all models available out of the box, though you can import from Hugging Face;
vLLM → CUDA version mismatch causes kernel errors.

🧩 Mission Debrief

Every engine is built for its own orbit.

LM Studio — for solo flights and quick system checks.
Ollama — for agile edge missions.
vLLM — for long-range, interstellar operations.

“Sometimes an engineer’s mission isn’t to build a new engine —
but to understand which existing one fits the current flight plan.”

🛰️ Previous Missions

🚀 Prepared Meta’s CRAG Benchmark for Launch in Docker

🧑‍🚀 Mission Accomplished: How an Engineer-Astronaut Prepared Meta’s CRAG Benchmark for Launch in Docker

astronaut — Thu, 06 Nov 2025 11:04:46 +0000

Every ML system is like a spacecraft — powerful, intricate, and temperamental.
But without telemetry, you have no idea where it’s headed.

🌌 Introduction

The CRAG (Comprehensive RAG Benchmark) from Meta AI is the control panel for Retrieval-Augmented Generation systems.
It measures how well model responses stay grounded in facts, remain robust under noise, and maintain contextual relevance.

As is often the case with research projects, CRAG required engineering adaptation to operate reliably in a modern environment:
incompatible library versions, dependency conflicts, unclear paths, and manual launch steps.

🧰 I wanted to bring CRAG to a state where it could be launched with a single command — no dependency chaos, no manual fixes.
The result is a fully reproducible Dockerized environment, available here:

👉 github.com/astronaut27/CRAG_with_Docker

🚀 What I Improved

In the original build, several issues made CRAG difficult to run:

🔧 Conflicting library versions;
⚙️ No unified, reproducible start-up workflow.

Now, everything comes to life with a single command:

docker-compose up --build

After building, two containers start automatically:

🛰️ mock-api — an emulator for web search and Knowledge Graph APIs;
🚀 crag-app — the main container with the benchmark and built-in baseline models.

🧱 Pre-Launch Preparation: Handling the Mission Artifacts

Before firing up the Docker build, make sure all mission artifacts — the large data and model files — are present locally.

Because CRAG includes files over 100 MB, it uses Git Large File Storage (LFS). Without them, your container won’t initialize.

So the first command in your console is essentially fueling the ship with data:

git lfs pull

🧩 How It Works

📡 ⚙️ CRAG in Autonomous Mode

mock-API — simulates external data sources (Web Search, KG API) used by the RAG system.
crag-app — the main container running the benchmark and the model used for response generation (a dummy model at this stage).
local_evaluation.py — coordinates the pipeline, calls the mock API, and handles metric evaluation.
ChatGPT — serves as an LLM-assisted judge that evaluates generated responses by CRAG’s metrics.

🧠 What CRAG Measures: The Telemetry Dashboard

CRAG reports quantitative indicators — a flight log of your system after a test mission:

total: Total number of evaluated examples.
n_correct: Count of responses that are fully supported by retrieved context.
n_hallucination: Number of responses containing unsupported or invented facts.
n_miss: Responses missing key information or empty answers.
accuracy/ score: Overall precision (ratio of correct responses).
hallucination: Ratio = n_hallucination / total.
missing: Ratio = n_miss / total.

💡 These metrics are the sensors on your RAG ship’s dashboard.
If any of them start flashing red — it’s time to check the model’s engine.

🧱 Docker Architecture

version: '3.8'

services:
  # Mock API service for RAG data
  mock-api:
    build:
      context: ../mock_api
      dockerfile: ../deployments/Dockerfile.mock-api
    container_name: crag-mock-api
    ports:
      - "8000:8000"
    volumes:
      - ../mock_api/cragkg:/app/cragkg
    environment:
      - PYTHONPATH=/app
    networks:
      - crag-network
    restart: unless-stopped

  # CRAG application container
  crag-app:
    build:
      context: ..
      dockerfile: deployments/Dockerfile.crag-app
    container_name: crag-app
    depends_on:
      - mock-api
    environment:
      # OpenAI for evaluation (optional)
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      # Mock API connection (Docker service)
      - CRAG_MOCK_API_URL=http://mock-api:8000
      # Evaluation model
      - EVALUATION_MODEL_NAME=${EVALUATION_MODEL_NAME:-gpt-4-0125-preview}
    volumes:
      # Mount large data directories (read-only)
      - ../data:/app/data:ro
      - ../results:/app/results
      - ../example_data:/app/example_data:ro
      # Tokenizer (if needed)
      - ../tokenizer:/app/tokenizer:ro
    extra_hosts:
      - "host.docker.internal:host-gateway"
    networks:
      - crag-network
    stdin_open: true
    tty: true
    command: ["python", "local_evaluation.py"]

networks:
  crag-network:
    driver: bridge

🪐 Why This Matters

RAG systems are quickly becoming the core engines of modern LLM-based products.
CRAG allows engineers to evaluate their reliability and factual grounding before shipping to production.

This Docker build transforms Meta AI’s research benchmark into a practical engineering environment:

📦 fully isolated and reproducible;
🧠 runnable locally or in CI pipelines;
🚀 easily extendable with your own models (for example, via LM Studio — coming in the next mission).

🔭 The Next Mission

Right now, CRAG runs on its built-in baselines — a test flight before mounting the real engine.
The next step is integrating the LM Studio API and evaluating a live LLM within the same container setup.
That will be Mission II 🚀

🧭 Mission Summary

“Sometimes engineering magic isn’t about building a brand-new ship,
but about preparing an existing one for its next flight.”

CRAG now launches reliably, telemetry is stable, and the mission is a success.

Next up: integrating LM Studio and real models.
For now, the ship holds a steady course. 🪐

🔗 Mission Repository

📦 github.com/astronaut27/CRAG_with_Docker

📜 License
CRAG is distributed under the MIT License, developed by Meta AI / Facebook Research.
All modifications in CRAG_with_Docker preserve the original copyright notices.