Forem: Hugo

How to Monitor AI Agents in Production

Hugo — Mon, 09 Mar 2026 07:31:40 +0000

Uptime monitoring is not enough. Here's what you actually need to track, why agent failures are mostly silent, and which tools the industry uses today.

Why monitoring an AI agent is different

Traditional monitoring is built around a simple contract: the system either works or it doesn't. A server is up or down. An API returns 200 or 500. Alerts fire, someone fixes it.

AI agents break this contract. An agent can be fully available — no crashes, no timeouts, no error codes — while producing wrong answers, calling the wrong tool, or fabricating information. From an infrastructure perspective, everything looks healthy. From a user perspective, the agent is broken.

The silent failure problem. The biggest production incidents with agents don't throw exceptions. They look like: a confident answer that's factually wrong, a tool call that partially succeeded, a workflow that loops until it hits a timeout. None of these trigger a standard alert.

This is why the AI industry has converged on a broader concept than monitoring: observability. The goal isn't just to know if the agent is running — it's to understand what it's doing, step by step, and whether it's doing it correctly.

What to track: the five layers

A production AI agent generates several distinct types of telemetry. You need all of them — each layer reveals failures that the others miss.

1. Traces

A trace is the complete execution record of one agent interaction: every step, every decision, every tool call, every intermediate output, with timestamps. For a multi-step agent, a single user request can trigger dozens of internal operations. Without traces, when something goes wrong you have no way to know at which step it happened or why.

What good tracing looks like: you can replay any past interaction exactly as it happened, inspect each step in isolation, and compare the execution path when the agent worked correctly versus when it failed.

2. Quality metrics

This is what separates AI monitoring from infrastructure monitoring. You need to measure whether the agent's outputs are actually correct — not just fast and available.

Task completion rate — did the agent accomplish what the user asked?
Hallucination detection — did the agent produce claims not grounded in its sources or tool outputs? Measured via automated "LLM-as-judge" evaluation on sampled traffic
Tool selection quality — did the agent call the right tool, with the right parameters?
Instruction adherence — did the agent follow its system prompt, formatting rules, and policy constraints?
Multi-turn consistency — does the agent contradict itself across conversation turns?

3. Latency — by percentile, not average

Average latency hides the problem. A multi-step agent might respond in 800ms most of the time, but take 15 seconds for complex queries. The users who experience those 15-second waits drive complaints and churn — the average never shows it.

Track p50, p95, and p99. The p99 (the slowest 1% of requests) is what defines the worst-case user experience. Set alerts on p95 and p99, not on averages.

4. Cost per request

Token costs are not evenly distributed. A small proportion of requests typically accounts for a disproportionate share of your LLM spend. Without per-request cost tracking, you can't identify which queries, workflows, or user segments are burning your budget — and you can't optimize.

Track cost at the trace level, broken down by model, endpoint, and if possible by user segment or workflow type.

5. Drift over time

An agent that performs well at launch can degrade over weeks without any code change. Reasons include: changes in how users phrase requests, upstream data quality shifts, model provider updates, or subtle prompt regressions after a deployment. Without longitudinal quality tracking, drift is invisible until it's severe.

Run automated quality evaluations continuously on sampled production traffic, and compare scores week-over-week. A consistent downward trend is a signal to act before users notice.

How agent failures actually look in production

Understanding the failure modes helps you set up the right alerts. Agent failures in production tend to fall into a few recurring patterns:

The wrong tool, confidently called

The agent selects a plausible-looking tool but the wrong one for the task. The call succeeds (no error), it returns data, and the agent builds its response on that data — which is irrelevant or misleading. The entire downstream output is flawed, but nothing in your infrastructure logs flags it.

Infinite loops

An agent retries a failed operation repeatedly, or continues processing a task that was already completed. This burns compute and token budget silently, and can corrupt data through duplicate operations. Define explicit termination conditions and set circuit breakers on retry loops.

Context loss in multi-turn conversations

In longer sessions, the agent loses track of constraints or prior decisions established earlier in the conversation. It starts contradicting itself or ignoring instructions it acknowledged a few turns back. This is hard to catch with per-request monitoring — it only shows up in session-level analysis.

Prompt drift after deployment

A prompt change that looked fine in testing degrades performance on a class of production queries that wasn't represented in the test set. This shows up as a gradual decline in quality scores for a specific intent type — catchable with segment-level evaluation, invisible with aggregate metrics.

The tools the industry uses today

The observability ecosystem for AI agents has matured significantly. OpenTelemetry has emerged as the industry standard for collecting telemetry — it's vendor-neutral, which means your trace data stays portable across tools. Most major frameworks (LangChain, CrewAI, OpenAI Agents SDK) emit OpenTelemetry-compatible traces natively.

On top of that foundation, several purpose-built platforms have emerged:

Langfuse — Open-source · Self-hostable
Full trace replay, prompt versioning, cost tracking, LLM-as-judge evaluations. Standard choice for teams that want data sovereignty and full control.

Arize Phoenix — Open-source · Cloud option
Strong on drift detection and embedding monitoring. Good for teams that need to track model-level performance degradation alongside agent behavior.

LangSmith — LangChain ecosystem
Deep integration for LangChain/LangGraph stacks. Execution graph visualization, prompt comparison, dataset-based regression testing.

Datadog LLM Observability — Enterprise · Full-stack
Connects AI monitoring to your existing infrastructure observability. Best for teams already on Datadog who want unified dashboards across infra and agents.

All four support OpenTelemetry as a data source, so you're not locked in. The practical choice depends on whether you prioritize data control (Langfuse, Arize), ecosystem fit (LangSmith), or infra consolidation (Datadog).

Monitoring for compliance and governance

For teams in regulated industries — finance, healthcare, legal, HR — monitoring isn't just an operational concern. It's a legal one.

An AI agent that influences decisions (a loan recommendation, a candidate screening, a customer response) needs an audit trail that can answer: what did the agent receive as input, what did it output, which tools did it call, and what model version was running at the time? Without this, you can't respond to a compliance inquiry or a regulatory audit.

This means monitoring infrastructure needs to capture and store, in a tamper-evident way: full input/output logs with timestamps, model version and configuration at time of execution, tool calls and their results, and any human approval or override events.

What Azure AI Foundry's team noted on this: Traditional observability covers metrics, logs, and traces. Agent observability adds two layers on top: evaluations (did the agent achieve the right outcome?) and governance (did it operate within policy and compliance constraints?). Both are needed for production in regulated environments.

A practical setup to start with

If you're instrumenting a production agent for the first time, here's a reasonable sequence that doesn't require months of infrastructure work:

Day 1: Traces. Instrument your agent with OpenTelemetry or plug into Langfuse directly. Make sure every execution generates a trace with inputs, outputs, tool calls, and latency per step. This alone gives you the ability to debug failures.

Week 1: Latency and cost dashboards. Set up per-request cost tracking and p95/p99 latency monitoring. Set alerts for cost anomalies (sudden spikes in token spend) and for latency regressions after deployments.

Week 2: Quality evaluations. Define 3–5 evaluation criteria specific to your use case (relevance, factual grounding, policy adherence). Run them automatically on a sample of production traffic. Establish a baseline score.

Month 1: Drift monitoring. Compare quality scores week-over-week. Add segment-level breakdowns (by intent type, user segment, or workflow) to catch regressions that don't show up in aggregate metrics.

Ongoing: Audit trail. If you're in a regulated context, ensure logs are stored with version context (model, prompt hash, config) and are accessible for compliance review.

Deploy AI Agents in Production The Practical 2026 Guide

Hugo — Sun, 08 Mar 2026 10:05:25 +0000

_Originally published at o137.ai
_

The demo was impressive. Production is another story.

What enterprise reports really say — and what it means in practice.

Based on: LangChain State of Agents 2026, Cleanlab Enterprise Report, UC Berkeley MAP, McKinsey State of AI, Docker official documentation

The demo/production gap is real — and massive

In 2024-2025, AI agent demos proliferated. An agent that answers in natural language, uses tools, chains actions across multiple steps — on stage or in a notebook, it impresses.

In production, it's different. Not slightly different. Fundamentally different.

Key finding — Cleanlab / MIT 2025

Of 1,837 companies surveyed on their AI agent deployment, only 95 actually had an agent in production with real user interactions. And among those 95, the majority remained in an early maturity phase.

Source: AI Agents in Production 2025, Cleanlab (based on MIT State of AI in Business 2025 data)

It's not a model problem. LLMs work. The problem is everything around them: infrastructure, evaluation, governance, team trust.

"Most so-called AI agents can't reliably do what they claim."

— Curtis Northcutt, CEO of Cleanlab

What "production" really requires

The original article listed correct requirements but without quantified context. Here's what the data shows:

57% of surveyed companies have agents in production (LangChain, 1,300+ respondents, 2025)
32% cite quality as the main barrier to production
89% of production teams have implemented some form of observability
68% of agents run fewer than 10 steps before human intervention (Berkeley MAP)

Sources: LangChain State of Agent Engineering (Dec. 2025, n=1,340); UC Berkeley Measuring Agents in Production (n=300+)

Volume and latency. An application with 10,000 requests/day does not have the same constraints as a 10-request prototype. Latency has become the second most cited challenge (20% of teams), especially for multi-step agents where each LLM call adds up. Practical recommendations: aim under 500ms for a conversational agent, under 2 seconds for complex analytics.

Reliability, not uptime. Traditional uptime (99.9%) is not the right metric for an AI agent. An agent can be "available" but produce wrong answers, hallucinate, call the wrong tool, or get stuck in an infinite loop. These silent failures are more dangerous than a crash, because they trigger no alert.

Legal traceability and audit. In regulated sectors, 42% of companies plan to add supervision features (approvals, review controls) — versus only 16% in unregulated sectors. Without auditability of every decision, a production deployment exposes the company to regulatory risk.

Human escalation. Berkeley measured that 92.5% of production agents send their output to humans rather than to other systems. That's not a design flaw — it's a deliberate strategy to maintain reliability.

From localhost to production: the technical path

This is where most guides stop being useful. "Deploy to the cloud" is not a step. Here's the concrete path.

Why localhost ≠ production

When your agent works on your machine, you're typically running:

API keys hardcoded in a .env file or directly in the code
A single Python process with no restart on crash
No logging, no monitoring, no concurrency handling
Dependencies tied to your local Python version

None of that survives production. Here's how to bridge the gap systematically.

Step 1 — Containerize with Docker

Docker is the standard because it solves the "works on my machine" problem definitively. Your agent runs in an isolated container with its own dependencies, Python version, and environment — identical across dev, staging, and prod.

Dockerfile (Python agent, FastAPI)

# --- Build stage ---
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# --- Runtime stage ---
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY . .

# Never hardcode secrets here
ENV PYTHONUNBUFFERED=1

# Health check: verify the agent AND its dependencies are up
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD python -c "import requests; requests.get('http://localhost:8000/health')"

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Key points:

Multi-stage build keeps the final image small (no build tools in production)
HEALTHCHECK verifies the container is actually functional, not just running
No secrets in the Dockerfile — ever

Step 2 — Manage secrets properly

The most common mistake: API keys in code or committed .env files.

Local development — .env file (never committed to git):

# .env — add this to .gitignore
OPENAI_API_KEY=sk-...
DATABASE_URL=postgresql://user:password@localhost:5432/agent_db
REDIS_URL=redis://localhost:6379/0

In docker-compose.yml (local and staging):

services:
  agent:
    build: .
    ports:
      - "8000:8000"
    env_file:
      - .env
    depends_on:
      redis:
        condition: service_healthy
      postgres:
        condition: service_healthy

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 3

  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: agent_db
      POSTGRES_USER: agent_user
      POSTGRES_PASSWORD: agent_password
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U agent_user"]
      interval: 10s
      timeout: 3s
      retries: 3

In production — use your cloud provider's secret manager, never plain env vars:

AWS → Secrets Manager
GCP → Secret Manager
Kubernetes → kubectl create secret + mount as env vars

Step 3 — Add a staging environment

Never deploy directly from localhost to production. The staging environment catches environment-specific bugs (different OS, different network, different secret values) before they hit users.

localhost (dev)
     ↓
  docker-compose up  →  everything runs locally, identical to prod
     ↓
  staging (cloud)    →  same Docker image, real secrets, limited traffic
     ↓
  production         →  same image promoted from staging, full traffic

The key principle: the same Docker image travels through all three environments. You're not rebuilding for prod — you're promoting a tested image.

Step 4 — Choose your production infrastructure

Three main options depending on your scale and team:

Option	Best for	Scaling	Complexity
Google Cloud Run / AWS Lambda	Stateless agents, variable traffic	Automatic (serverless)	Low
AWS ECS / Azure Container Apps	Teams without Kubernetes expertise	Manual or auto	Medium
Kubernetes (EKS, GKE, AKS)	Large scale, multi-agent systems	Full control	High

Practical recommendation: Start with Cloud Run or ECS. Kubernetes is justified only when you have multiple agent types, high traffic, and a dedicated DevOps function.

For Cloud Run (simplest path from Docker to production):

# Build and push your image
docker build -t gcr.io/your-project/your-agent:v1.0.0 .
docker push gcr.io/your-project/your-agent:v1.0.0

# Deploy
gcloud run deploy your-agent \
  --image gcr.io/your-project/your-agent:v1.0.0 \
  --platform managed \
  --region europe-west1 \
  --memory 2Gi \
  --timeout 60s \
  --set-secrets OPENAI_API_KEY=openai-key:latest

Note the --memory 2Gi minimum — LLM applications need at least 1-2GB RAM. And --timeout 60s accounts for multi-step reasoning chains.

Step 5 — Handle concurrency with a queue

At low traffic (< 100 requests/day), a single process is fine. At scale, you need to separate request intake from execution.

Incoming requests → Redis queue → Worker 1
                                → Worker 2
                                → Worker 3

This prevents a slow agent run (10+ LLM calls) from blocking all other requests. Queue depth (jobs waiting) and worker utilization (CPU/memory per worker) become your main scaling signals — add workers when the queue grows faster than it drains.

The real problems in production

Hallucinations and output quality

Hallucinations don't work like classic software bugs. An agent doesn't "crash" when it hallucinates — it answers confidently while inventing information. In a multi-step workflow, an early hallucination can contaminate all following steps.

Beware of misleading metrics. An 85% accuracy rate at launch may seem solid. If it drops to 72% three months later, that's a signal of drift (model drift) or data misalignment — not normal fluctuation.

Measuring hallucinations in production today relies mainly on the "LLM-as-judge" approach: one model evaluates another model's outputs on consistency, factuality, and grounding in sources. It's imperfect but operational at scale.

Drift and stack instability

The AI stack moves fast — too fast to be stable. In the regulated sector, 70% of teams rebuild their agent stack every three months or faster. Each rebuild loses behavioral continuity. What you validated in January may no longer be valid in April if you changed model, framework version, or data pipeline.

Integration with existing systems

Salesforce acknowledged that its Einstein Copilot encountered difficulties in pilot because it could not reliably navigate between customer data silos and existing CRM workflows. This case isn't isolated — it's the norm. McKinsey notes that organizations reporting significant ROI from AI projects are twice as likely to have reconfigured their workflows end-to-end before deploying the agent.

Observability: the non-negotiable foundation

89% of teams with agents in production have implemented some form of observability. Among those planning investments in the year, improving observability is the number one priority (62% of prod teams).

What to trace

An AI agent is not a classic web service. A single user request can trigger 15+ LLM calls across multiple chains, models, and tools. Standard monitoring tools (uptime, API latency) don't measure what matters.

Full traces — every reasoning step, every tool call, every intermediate decision, with inputs/outputs
Quality metrics — relevance, factuality, instruction compliance, consistency over time
Cost per request — the top 5% most expensive requests often consume 50% of tokens
Latency by percentile — p50, p95, p99 (not just average: slow requests are the ones that generate complaints)
Drift detection — compare performance across prompt versions, models, or time windows

Market tools (2025-2026)

Langfuse (open-source, self-hosted): full traces with replay, prompt versioning, evaluations. De facto standard for teams that want full control of their data.
Arize Phoenix: unified observability for traditional ML + LLM, "council of judges" approach for evaluation.
LangSmith (LangChain): native integration for LangChain/LangGraph projects, execution chain visualization.
Datadog LLM Observability: for teams already on Datadog — integrates AI monitoring into the existing observability stack.

If you're looking for a platform that integrates observability, human supervision, and agent control natively — without stitching together five tools — that's exactly what we built at Origin 137.

Architecture: the real choices

Containerization and orchestration

Docker + Kubernetes is the de facto standard for production deployments. Docker ensures reproducibility. Kubernetes handles scaling, load balancing, and automatic recovery on failure. For execution mode: if your agents must handle traffic spikes, queue mode (Redis + workers) separates scheduling from execution.

RAG vs fine-tuning

Most production teams use off-the-shelf models without fine-tuning, with manually tuned prompts. Fine-tuning complexity is only justified for very specific use cases. RAG (Retrieval-Augmented Generation) remains the preferred solution to ground responses in verifiable sources and reduce hallucinations.

Multi-agent or single agent?

The move toward distributed multi-agent systems is real in large enterprises. But beware: each additional agent multiplies communication paths, conflict scenarios, and coordination requirements. Berkeley teams observe that 68% of production agents stop in fewer than 10 steps before human intervention — a sign that complexity remains deliberately limited.

Common pitfall: Agents can end up in infinite loops — retrying failed operations indefinitely, or continuing to process already completed tasks. Defining explicit termination conditions is not optional.

Human supervision: not a stopgap

In the vast majority of production cases, agents pass their results to humans rather than to other systems. That's not lack of trust in the technology — it's deliberate architecture.

Forrester states it clearly in its 2025 Model Overview Report: AI agents fail in unexpected and costly ways, with failure modes that don't resemble classic software bugs. They emerge from ambiguity, poor coordination, and unpredictable systemic dynamics.

Human supervision isn't a temporary limitation until models improve. It's an architectural component that enables responsible deployment today while maintaining auditability and legal accountability.

The KPIs that actually matter

Uptime (99.9% vs 95%) is a relevant KPI for infrastructure, not for evaluating an AI agent. The metrics that matter in production:

Task completion rate — does the agent actually accomplish the requested task?
Hallucination rate — measured continuously via automated evaluations on real traffic samples
p95 and p99 latency — the slowest users define perceived experience
Human escalation rate — too low can mean false confidence; too high indicates a quality problem
Cost per successful request — not total cost, but cost relative to actually useful outputs
Quality drift over time — weekly or monthly comparison of evaluation scores

What it means in practice

If you're starting an AI agent project in 2026, the data suggests this sequence:

Define what "reliable" means for your specific use case — not in general. What error rate is acceptable? What latency? When must a human be in the loop?
Containerize from day one. A proper Dockerfile + docker-compose.yml from the start eliminates an entire class of "works on my machine" problems before they happen.
Put observability in before launch. Not after. Langfuse or Arize Phoenix open-source are enough to start. Without full traces, you can't debug, improve, or justify the agent's decisions.
Use a staging environment. The same Docker image travels from localhost → staging → production. Never rebuild for prod.
Reconfigure workflows before plugging in the agent. McKinsey data is clear: organizations that re-design their processes upfront are twice as likely to achieve significant ROI.
Stay simple until complexity is justified. A 5-step agent with well-designed human supervision is more reliable — and more useful — than a 20-step autonomous agent that produces silent errors.
Plan for stack instability. If 70% of teams in regulated sectors rebuild their stack every three months, that's the norm. Architect with swappable modules. Don't marry one framework.

Main sources

LangChain, State of Agent Engineering, Dec. 2025 (n=1,340 professionals)
Cleanlab, AI Agents in Production 2025 (MIT State of AI in Business 2025, n=1,837)
UC Berkeley, Measuring Agents in Production, Melissa Pan et al. (n=300+ teams)
McKinsey, State of AI 2025
Forrester, 2025 AI Model Overview Report
Docker, Agentic AI Applications — official documentation, 2025-2026
Docker, Build AI Agents with Docker Compose, Nov. 2025
MachineLearningMastery, Deploying AI Agents to Production: Architecture, Infrastructure, and Implementation Roadmap, Mar. 2026
n8n Blog, 15 best practices for deploying AI agents in production, Jan. 2026
FreeCodeCamp, How to Build and Deploy a Multi-Agent AI System with Python and Docker, Feb. 2026

This article synthesizes public data available in March 2026. Figures may evolve rapidly in this space.

Not sure where to start with your own agent?

We offer a free 20-minute workshop to help you define your first agentic use case — what to automate, how to scope it, and what production readiness actually looks like for your context.