<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Hugo</title>
    <description>The latest articles on Forem by Hugo (@hugo_o137).</description>
    <link>https://forem.com/hugo_o137</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3812725%2Fb3e35184-36d9-4ef4-b7db-dd5b42abc601.png</url>
      <title>Forem: Hugo</title>
      <link>https://forem.com/hugo_o137</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/hugo_o137"/>
    <language>en</language>
    <item>
      <title>How to Monitor AI Agents in Production</title>
      <dc:creator>Hugo</dc:creator>
      <pubDate>Mon, 09 Mar 2026 07:31:40 +0000</pubDate>
      <link>https://forem.com/hugo_o137/how-to-monitor-ai-agents-in-production-10ll</link>
      <guid>https://forem.com/hugo_o137/how-to-monitor-ai-agents-in-production-10ll</guid>
      <description>&lt;p&gt;Uptime monitoring is not enough. Here's what you actually need to track, why agent failures are mostly silent, and which tools the industry uses today.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why monitoring an AI agent is different
&lt;/h2&gt;

&lt;p&gt;Traditional monitoring is built around a simple contract: the system either works or it doesn't. A server is up or down. An API returns 200 or 500. Alerts fire, someone fixes it.&lt;/p&gt;

&lt;p&gt;AI agents break this contract. An agent can be fully available — no crashes, no timeouts, no error codes — while producing wrong answers, calling the wrong tool, or fabricating information. From an infrastructure perspective, everything looks healthy. From a user perspective, the agent is broken.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The silent failure problem.&lt;/strong&gt; The biggest production incidents with agents don't throw exceptions. They look like: a confident answer that's factually wrong, a tool call that partially succeeded, a workflow that loops until it hits a timeout. None of these trigger a standard alert.&lt;/p&gt;

&lt;p&gt;This is why the AI industry has converged on a broader concept than monitoring: observability. The goal isn't just to know if the agent is running — it's to understand what it's doing, step by step, and whether it's doing it correctly.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to track: the five layers
&lt;/h2&gt;

&lt;p&gt;A production AI agent generates several distinct types of telemetry. You need all of them — each layer reveals failures that the others miss.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Traces
&lt;/h3&gt;

&lt;p&gt;A trace is the complete execution record of one agent interaction: every step, every decision, every tool call, every intermediate output, with timestamps. For a multi-step agent, a single user request can trigger dozens of internal operations. Without traces, when something goes wrong you have no way to know at which step it happened or why.&lt;/p&gt;

&lt;p&gt;What good tracing looks like: you can replay any past interaction exactly as it happened, inspect each step in isolation, and compare the execution path when the agent worked correctly versus when it failed.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Quality metrics
&lt;/h3&gt;

&lt;p&gt;This is what separates AI monitoring from infrastructure monitoring. You need to measure whether the agent's outputs are actually correct — not just fast and available.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Task completion rate&lt;/strong&gt; — did the agent accomplish what the user asked?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination detection&lt;/strong&gt; — did the agent produce claims not grounded in its sources or tool outputs? Measured via automated "LLM-as-judge" evaluation on sampled traffic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool selection quality&lt;/strong&gt; — did the agent call the right tool, with the right parameters?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instruction adherence&lt;/strong&gt; — did the agent follow its system prompt, formatting rules, and policy constraints?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-turn consistency&lt;/strong&gt; — does the agent contradict itself across conversation turns?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Latency — by percentile, not average
&lt;/h3&gt;

&lt;p&gt;Average latency hides the problem. A multi-step agent might respond in 800ms most of the time, but take 15 seconds for complex queries. The users who experience those 15-second waits drive complaints and churn — the average never shows it.&lt;/p&gt;

&lt;p&gt;Track p50, p95, and p99. The p99 (the slowest 1% of requests) is what defines the worst-case user experience. Set alerts on p95 and p99, not on averages.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Cost per request
&lt;/h3&gt;

&lt;p&gt;Token costs are not evenly distributed. A small proportion of requests typically accounts for a disproportionate share of your LLM spend. Without per-request cost tracking, you can't identify which queries, workflows, or user segments are burning your budget — and you can't optimize.&lt;/p&gt;

&lt;p&gt;Track cost at the trace level, broken down by model, endpoint, and if possible by user segment or workflow type.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Drift over time
&lt;/h3&gt;

&lt;p&gt;An agent that performs well at launch can degrade over weeks without any code change. Reasons include: changes in how users phrase requests, upstream data quality shifts, model provider updates, or subtle prompt regressions after a deployment. Without longitudinal quality tracking, drift is invisible until it's severe.&lt;/p&gt;

&lt;p&gt;Run automated quality evaluations continuously on sampled production traffic, and compare scores week-over-week. A consistent downward trend is a signal to act before users notice.&lt;/p&gt;




&lt;h2&gt;
  
  
  How agent failures actually look in production
&lt;/h2&gt;

&lt;p&gt;Understanding the failure modes helps you set up the right alerts. Agent failures in production tend to fall into a few recurring patterns:&lt;/p&gt;

&lt;h3&gt;
  
  
  The wrong tool, confidently called
&lt;/h3&gt;

&lt;p&gt;The agent selects a plausible-looking tool but the wrong one for the task. The call succeeds (no error), it returns data, and the agent builds its response on that data — which is irrelevant or misleading. The entire downstream output is flawed, but nothing in your infrastructure logs flags it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Infinite loops
&lt;/h3&gt;

&lt;p&gt;An agent retries a failed operation repeatedly, or continues processing a task that was already completed. This burns compute and token budget silently, and can corrupt data through duplicate operations. Define explicit termination conditions and set circuit breakers on retry loops.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context loss in multi-turn conversations
&lt;/h3&gt;

&lt;p&gt;In longer sessions, the agent loses track of constraints or prior decisions established earlier in the conversation. It starts contradicting itself or ignoring instructions it acknowledged a few turns back. This is hard to catch with per-request monitoring — it only shows up in session-level analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prompt drift after deployment
&lt;/h3&gt;

&lt;p&gt;A prompt change that looked fine in testing degrades performance on a class of production queries that wasn't represented in the test set. This shows up as a gradual decline in quality scores for a specific intent type — catchable with segment-level evaluation, invisible with aggregate metrics.&lt;/p&gt;




&lt;h2&gt;
  
  
  The tools the industry uses today
&lt;/h2&gt;

&lt;p&gt;The observability ecosystem for AI agents has matured significantly. OpenTelemetry has emerged as the industry standard for collecting telemetry — it's vendor-neutral, which means your trace data stays portable across tools. Most major frameworks (LangChain, CrewAI, OpenAI Agents SDK) emit OpenTelemetry-compatible traces natively.&lt;/p&gt;

&lt;p&gt;On top of that foundation, several purpose-built platforms have emerged:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Langfuse&lt;/strong&gt; — Open-source · Self-hostable&lt;br&gt;
Full trace replay, prompt versioning, cost tracking, LLM-as-judge evaluations. Standard choice for teams that want data sovereignty and full control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Arize Phoenix&lt;/strong&gt; — Open-source · Cloud option&lt;br&gt;
Strong on drift detection and embedding monitoring. Good for teams that need to track model-level performance degradation alongside agent behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangSmith&lt;/strong&gt; — LangChain ecosystem&lt;br&gt;
Deep integration for LangChain/LangGraph stacks. Execution graph visualization, prompt comparison, dataset-based regression testing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Datadog LLM Observability&lt;/strong&gt; — Enterprise · Full-stack&lt;br&gt;
Connects AI monitoring to your existing infrastructure observability. Best for teams already on Datadog who want unified dashboards across infra and agents.&lt;/p&gt;

&lt;p&gt;All four support OpenTelemetry as a data source, so you're not locked in. The practical choice depends on whether you prioritize data control (Langfuse, Arize), ecosystem fit (LangSmith), or infra consolidation (Datadog).&lt;/p&gt;




&lt;h2&gt;
  
  
  Monitoring for compliance and governance
&lt;/h2&gt;

&lt;p&gt;For teams in regulated industries — finance, healthcare, legal, HR — monitoring isn't just an operational concern. It's a legal one.&lt;/p&gt;

&lt;p&gt;An AI agent that influences decisions (a loan recommendation, a candidate screening, a customer response) needs an audit trail that can answer: what did the agent receive as input, what did it output, which tools did it call, and what model version was running at the time? Without this, you can't respond to a compliance inquiry or a regulatory audit.&lt;/p&gt;

&lt;p&gt;This means monitoring infrastructure needs to capture and store, in a tamper-evident way: full input/output logs with timestamps, model version and configuration at time of execution, tool calls and their results, and any human approval or override events.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Azure AI Foundry's team noted on this:&lt;/strong&gt; Traditional observability covers metrics, logs, and traces. Agent observability adds two layers on top: evaluations (did the agent achieve the right outcome?) and governance (did it operate within policy and compliance constraints?). Both are needed for production in regulated environments.&lt;/p&gt;




&lt;h2&gt;
  
  
  A practical setup to start with
&lt;/h2&gt;

&lt;p&gt;If you're instrumenting a production agent for the first time, here's a reasonable sequence that doesn't require months of infrastructure work:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Day 1: Traces.&lt;/strong&gt; Instrument your agent with OpenTelemetry or plug into Langfuse directly. Make sure every execution generates a trace with inputs, outputs, tool calls, and latency per step. This alone gives you the ability to debug failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 1: Latency and cost dashboards.&lt;/strong&gt; Set up per-request cost tracking and p95/p99 latency monitoring. Set alerts for cost anomalies (sudden spikes in token spend) and for latency regressions after deployments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 2: Quality evaluations.&lt;/strong&gt; Define 3–5 evaluation criteria specific to your use case (relevance, factual grounding, policy adherence). Run them automatically on a sample of production traffic. Establish a baseline score.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 1: Drift monitoring.&lt;/strong&gt; Compare quality scores week-over-week. Add segment-level breakdowns (by intent type, user segment, or workflow) to catch regressions that don't show up in aggregate metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ongoing: Audit trail.&lt;/strong&gt; If you're in a regulated context, ensure logs are stored with version context (model, prompt hash, config) and are accessible for compliance review.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
      <category>devops</category>
    </item>
    <item>
      <title>Deploy AI Agents in Production The Practical 2026 Guide</title>
      <dc:creator>Hugo</dc:creator>
      <pubDate>Sun, 08 Mar 2026 10:05:25 +0000</pubDate>
      <link>https://forem.com/hugo_o137/deploy-ai-agents-in-production-the-practical-2026-guide-2p7h</link>
      <guid>https://forem.com/hugo_o137/deploy-ai-agents-in-production-the-practical-2026-guide-2p7h</guid>
      <description>&lt;p&gt;_Originally published at &lt;a href="//www.o137.ai"&gt;o137.ai&lt;/a&gt;&lt;br&gt;
_&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The demo was impressive. Production is another story.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
What enterprise reports really say — and what it means in practice.&lt;br&gt;&lt;br&gt;
&lt;em&gt;Based on: LangChain State of Agents 2026, Cleanlab Enterprise Report, UC Berkeley MAP, McKinsey State of AI, Docker official documentation&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The demo/production gap is real — and massive
&lt;/h2&gt;

&lt;p&gt;In 2024-2025, AI agent demos proliferated. An agent that answers in natural language, uses tools, chains actions across multiple steps — on stage or in a notebook, it impresses.&lt;/p&gt;

&lt;p&gt;In production, it's different. Not slightly different. Fundamentally different.&lt;/p&gt;
&lt;h3&gt;
  
  
  Key finding — Cleanlab / MIT 2025
&lt;/h3&gt;

&lt;p&gt;Of 1,837 companies surveyed on their AI agent deployment, only 95 actually had an agent in production with real user interactions. And among those 95, the majority remained in an early maturity phase.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Source: AI Agents in Production 2025, Cleanlab (based on MIT State of AI in Business 2025 data)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It's not a model problem. LLMs work. The problem is everything around them: infrastructure, evaluation, governance, team trust.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Most so-called AI agents can't reliably do what they claim."&lt;br&gt;&lt;br&gt;
— Curtis Northcutt, CEO of Cleanlab&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  What "production" really requires
&lt;/h2&gt;

&lt;p&gt;The original article listed correct requirements but without quantified context. Here's what the data shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;57%&lt;/strong&gt; of surveyed companies have agents in production &lt;em&gt;(LangChain, 1,300+ respondents, 2025)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;32%&lt;/strong&gt; cite quality as the main barrier to production&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;89%&lt;/strong&gt; of production teams have implemented some form of observability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;68%&lt;/strong&gt; of agents run fewer than 10 steps before human intervention &lt;em&gt;(Berkeley MAP)&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Sources: LangChain State of Agent Engineering (Dec. 2025, n=1,340); UC Berkeley Measuring Agents in Production (n=300+)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Volume and latency.&lt;/strong&gt; An application with 10,000 requests/day does not have the same constraints as a 10-request prototype. Latency has become the second most cited challenge (20% of teams), especially for multi-step agents where each LLM call adds up. Practical recommendations: aim under 500ms for a conversational agent, under 2 seconds for complex analytics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reliability, not uptime.&lt;/strong&gt; Traditional uptime (99.9%) is not the right metric for an AI agent. An agent can be "available" but produce wrong answers, hallucinate, call the wrong tool, or get stuck in an infinite loop. These silent failures are more dangerous than a crash, because they trigger no alert.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Legal traceability and audit.&lt;/strong&gt; In regulated sectors, 42% of companies plan to add supervision features (approvals, review controls) — versus only 16% in unregulated sectors. Without auditability of every decision, a production deployment exposes the company to regulatory risk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human escalation.&lt;/strong&gt; Berkeley measured that 92.5% of production agents send their output to humans rather than to other systems. That's not a design flaw — it's a deliberate strategy to maintain reliability.&lt;/p&gt;


&lt;h2&gt;
  
  
  From localhost to production: the technical path
&lt;/h2&gt;

&lt;p&gt;This is where most guides stop being useful. "Deploy to the cloud" is not a step. Here's the concrete path.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why localhost ≠ production
&lt;/h3&gt;

&lt;p&gt;When your agent works on your machine, you're typically running:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API keys hardcoded in a &lt;code&gt;.env&lt;/code&gt; file or directly in the code&lt;/li&gt;
&lt;li&gt;A single Python process with no restart on crash&lt;/li&gt;
&lt;li&gt;No logging, no monitoring, no concurrency handling&lt;/li&gt;
&lt;li&gt;Dependencies tied to your local Python version&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of that survives production. Here's how to bridge the gap systematically.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1 — Containerize with Docker
&lt;/h3&gt;

&lt;p&gt;Docker is the standard because it solves the "works on my machine" problem definitively. Your agent runs in an isolated container with its own dependencies, Python version, and environment — identical across dev, staging, and prod.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dockerfile (Python agent, FastAPI)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# --- Build stage ---&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;python:3.11-slim&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;builder&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; requirements.txt .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--no-cache-dir&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# --- Runtime stage ---&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; python:3.11-slim&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;

&lt;span class="c"&gt;# Never hardcode secrets here&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; PYTHONUNBUFFERED=1&lt;/span&gt;

&lt;span class="c"&gt;# Health check: verify the agent AND its dependencies are up&lt;/span&gt;
&lt;span class="k"&gt;HEALTHCHECK&lt;/span&gt;&lt;span class="s"&gt; --interval=30s --timeout=5s --start-period=10s --retries=3 \&lt;/span&gt;
  CMD python -c "import requests; requests.get('http://localhost:8000/health')"

&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="s"&gt; 8000&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-stage build&lt;/strong&gt; keeps the final image small (no build tools in production)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HEALTHCHECK&lt;/strong&gt; verifies the container is actually functional, not just running&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No secrets&lt;/strong&gt; in the Dockerfile — ever&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 2 — Manage secrets properly
&lt;/h3&gt;

&lt;p&gt;The most common mistake: API keys in code or committed &lt;code&gt;.env&lt;/code&gt; files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local development&lt;/strong&gt; — &lt;code&gt;.env&lt;/code&gt; file (never committed to git):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# .env — add this to .gitignore&lt;/span&gt;
&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-...
&lt;span class="nv"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;postgresql://user:password@localhost:5432/agent_db
&lt;span class="nv"&gt;REDIS_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;redis://localhost:6379/0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;In docker-compose.yml&lt;/strong&gt; (local and staging):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;agent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8000:8000"&lt;/span&gt;
    &lt;span class="na"&gt;env_file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;.env&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;redis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service_healthy&lt;/span&gt;
      &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service_healthy&lt;/span&gt;

  &lt;span class="na"&gt;redis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;redis:7-alpine&lt;/span&gt;
    &lt;span class="na"&gt;healthcheck&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CMD"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis-cli"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ping"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3s&lt;/span&gt;
      &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;

  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:15-alpine&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;agent_db&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;agent_user&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;agent_password&lt;/span&gt;
    &lt;span class="na"&gt;healthcheck&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CMD-SHELL"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pg_isready&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;-U&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;agent_user"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3s&lt;/span&gt;
      &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;In production&lt;/strong&gt; — use your cloud provider's secret manager, never plain env vars:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS → Secrets Manager&lt;/li&gt;
&lt;li&gt;GCP → Secret Manager&lt;/li&gt;
&lt;li&gt;Kubernetes → &lt;code&gt;kubectl create secret&lt;/code&gt; + mount as env vars&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3 — Add a staging environment
&lt;/h3&gt;

&lt;p&gt;Never deploy directly from localhost to production. The staging environment catches environment-specific bugs (different OS, different network, different secret values) before they hit users.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;localhost (dev)
     ↓
  docker-compose up  →  everything runs locally, identical to prod
     ↓
  staging (cloud)    →  same Docker image, real secrets, limited traffic
     ↓
  production         →  same image promoted from staging, full traffic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key principle: &lt;strong&gt;the same Docker image&lt;/strong&gt; travels through all three environments. You're not rebuilding for prod — you're promoting a tested image.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4 — Choose your production infrastructure
&lt;/h3&gt;

&lt;p&gt;Three main options depending on your scale and team:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Option&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;th&gt;Scaling&lt;/th&gt;
&lt;th&gt;Complexity&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google Cloud Run / AWS Lambda&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stateless agents, variable traffic&lt;/td&gt;
&lt;td&gt;Automatic (serverless)&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS ECS / Azure Container Apps&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Teams without Kubernetes expertise&lt;/td&gt;
&lt;td&gt;Manual or auto&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kubernetes (EKS, GKE, AKS)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Large scale, multi-agent systems&lt;/td&gt;
&lt;td&gt;Full control&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Practical recommendation:&lt;/strong&gt; Start with Cloud Run or ECS. Kubernetes is justified only when you have multiple agent types, high traffic, and a dedicated DevOps function.&lt;/p&gt;

&lt;p&gt;For Cloud Run (simplest path from Docker to production):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build and push your image&lt;/span&gt;
docker build &lt;span class="nt"&gt;-t&lt;/span&gt; gcr.io/your-project/your-agent:v1.0.0 &lt;span class="nb"&gt;.&lt;/span&gt;
docker push gcr.io/your-project/your-agent:v1.0.0

&lt;span class="c"&gt;# Deploy&lt;/span&gt;
gcloud run deploy your-agent &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; gcr.io/your-project/your-agent:v1.0.0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--platform&lt;/span&gt; managed &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; europe-west1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--memory&lt;/span&gt; 2Gi &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--timeout&lt;/span&gt; 60s &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set-secrets&lt;/span&gt; &lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;openai-key:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the &lt;code&gt;--memory 2Gi&lt;/code&gt; minimum — LLM applications need at least 1-2GB RAM. And &lt;code&gt;--timeout 60s&lt;/code&gt; accounts for multi-step reasoning chains.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5 — Handle concurrency with a queue
&lt;/h3&gt;

&lt;p&gt;At low traffic (&amp;lt; 100 requests/day), a single process is fine. At scale, you need to separate request intake from execution.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Incoming requests → Redis queue → Worker 1
                                → Worker 2
                                → Worker 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prevents a slow agent run (10+ LLM calls) from blocking all other requests. Queue depth (jobs waiting) and worker utilization (CPU/memory per worker) become your main scaling signals — add workers when the queue grows faster than it drains.&lt;/p&gt;




&lt;h2&gt;
  
  
  The real problems in production
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Hallucinations and output quality
&lt;/h3&gt;

&lt;p&gt;Hallucinations don't work like classic software bugs. An agent doesn't "crash" when it hallucinates — it answers confidently while inventing information. In a multi-step workflow, an early hallucination can contaminate all following steps.&lt;/p&gt;

&lt;p&gt;Beware of misleading metrics. An 85% accuracy rate at launch may seem solid. If it drops to 72% three months later, that's a signal of drift (model drift) or data misalignment — not normal fluctuation.&lt;/p&gt;

&lt;p&gt;Measuring hallucinations in production today relies mainly on the "LLM-as-judge" approach: one model evaluates another model's outputs on consistency, factuality, and grounding in sources. It's imperfect but operational at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Drift and stack instability
&lt;/h3&gt;

&lt;p&gt;The AI stack moves fast — too fast to be stable. In the regulated sector, 70% of teams rebuild their agent stack every three months or faster. Each rebuild loses behavioral continuity. What you validated in January may no longer be valid in April if you changed model, framework version, or data pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Integration with existing systems
&lt;/h3&gt;

&lt;p&gt;Salesforce acknowledged that its Einstein Copilot encountered difficulties in pilot because it could not reliably navigate between customer data silos and existing CRM workflows. This case isn't isolated — it's the norm. McKinsey notes that organizations reporting significant ROI from AI projects are twice as likely to have reconfigured their workflows end-to-end &lt;em&gt;before&lt;/em&gt; deploying the agent.&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability: the non-negotiable foundation
&lt;/h2&gt;

&lt;p&gt;89% of teams with agents in production have implemented some form of observability. Among those planning investments in the year, improving observability is the number one priority (62% of prod teams).&lt;/p&gt;

&lt;h3&gt;
  
  
  What to trace
&lt;/h3&gt;

&lt;p&gt;An AI agent is not a classic web service. A single user request can trigger 15+ LLM calls across multiple chains, models, and tools. Standard monitoring tools (uptime, API latency) don't measure what matters.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Full traces&lt;/strong&gt; — every reasoning step, every tool call, every intermediate decision, with inputs/outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality metrics&lt;/strong&gt; — relevance, factuality, instruction compliance, consistency over time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost per request&lt;/strong&gt; — the top 5% most expensive requests often consume 50% of tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency by percentile&lt;/strong&gt; — p50, p95, p99 (not just average: slow requests are the ones that generate complaints)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drift detection&lt;/strong&gt; — compare performance across prompt versions, models, or time windows&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Market tools (2025-2026)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Langfuse&lt;/strong&gt; &lt;em&gt;(open-source, self-hosted)&lt;/em&gt;: full traces with replay, prompt versioning, evaluations. De facto standard for teams that want full control of their data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Arize Phoenix&lt;/strong&gt;: unified observability for traditional ML + LLM, "council of judges" approach for evaluation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangSmith&lt;/strong&gt; &lt;em&gt;(LangChain)&lt;/em&gt;: native integration for LangChain/LangGraph projects, execution chain visualization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Datadog LLM Observability&lt;/strong&gt;: for teams already on Datadog — integrates AI monitoring into the existing observability stack.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're looking for a platform that integrates observability, human supervision, and agent control natively — without stitching together five tools — that's exactly what we built at &lt;a href="https://www.o137.ai/" rel="noopener noreferrer"&gt;Origin 137&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture: the real choices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Containerization and orchestration
&lt;/h3&gt;

&lt;p&gt;Docker + Kubernetes is the de facto standard for production deployments. Docker ensures reproducibility. Kubernetes handles scaling, load balancing, and automatic recovery on failure. For execution mode: if your agents must handle traffic spikes, queue mode (Redis + workers) separates scheduling from execution.&lt;/p&gt;

&lt;h3&gt;
  
  
  RAG vs fine-tuning
&lt;/h3&gt;

&lt;p&gt;Most production teams use off-the-shelf models without fine-tuning, with manually tuned prompts. Fine-tuning complexity is only justified for very specific use cases. RAG (Retrieval-Augmented Generation) remains the preferred solution to ground responses in verifiable sources and reduce hallucinations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-agent or single agent?
&lt;/h3&gt;

&lt;p&gt;The move toward distributed multi-agent systems is real in large enterprises. But beware: each additional agent multiplies communication paths, conflict scenarios, and coordination requirements. Berkeley teams observe that 68% of production agents stop in fewer than 10 steps before human intervention — a sign that complexity remains deliberately limited.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common pitfall:&lt;/strong&gt; Agents can end up in infinite loops — retrying failed operations indefinitely, or continuing to process already completed tasks. Defining explicit termination conditions is not optional.&lt;/p&gt;




&lt;h2&gt;
  
  
  Human supervision: not a stopgap
&lt;/h2&gt;

&lt;p&gt;In the vast majority of production cases, agents pass their results to humans rather than to other systems. That's not lack of trust in the technology — it's deliberate architecture.&lt;/p&gt;

&lt;p&gt;Forrester states it clearly in its 2025 Model Overview Report: AI agents fail in unexpected and costly ways, with failure modes that don't resemble classic software bugs. They emerge from ambiguity, poor coordination, and unpredictable systemic dynamics.&lt;/p&gt;

&lt;p&gt;Human supervision isn't a temporary limitation until models improve. It's an architectural component that enables responsible deployment today while maintaining auditability and legal accountability.&lt;/p&gt;




&lt;h2&gt;
  
  
  The KPIs that actually matter
&lt;/h2&gt;

&lt;p&gt;Uptime (99.9% vs 95%) is a relevant KPI for infrastructure, not for evaluating an AI agent. The metrics that matter in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Task completion rate&lt;/strong&gt; — does the agent actually accomplish the requested task?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination rate&lt;/strong&gt; — measured continuously via automated evaluations on real traffic samples&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;p95 and p99 latency&lt;/strong&gt; — the slowest users define perceived experience&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human escalation rate&lt;/strong&gt; — too low can mean false confidence; too high indicates a quality problem&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost per successful request&lt;/strong&gt; — not total cost, but cost relative to actually useful outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality drift over time&lt;/strong&gt; — weekly or monthly comparison of evaluation scores&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What it means in practice
&lt;/h2&gt;

&lt;p&gt;If you're starting an AI agent project in 2026, the data suggests this sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Define what "reliable" means&lt;/strong&gt; for your specific use case — not in general. What error rate is acceptable? What latency? When must a human be in the loop?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Containerize from day one.&lt;/strong&gt; A proper Dockerfile + docker-compose.yml from the start eliminates an entire class of "works on my machine" problems before they happen.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Put observability in before launch.&lt;/strong&gt; Not after. Langfuse or Arize Phoenix open-source are enough to start. Without full traces, you can't debug, improve, or justify the agent's decisions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use a staging environment.&lt;/strong&gt; The same Docker image travels from localhost → staging → production. Never rebuild for prod.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reconfigure workflows before plugging in the agent.&lt;/strong&gt; McKinsey data is clear: organizations that re-design their processes upfront are twice as likely to achieve significant ROI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stay simple until complexity is justified.&lt;/strong&gt; A 5-step agent with well-designed human supervision is more reliable — and more useful — than a 20-step autonomous agent that produces silent errors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan for stack instability.&lt;/strong&gt; If 70% of teams in regulated sectors rebuild their stack every three months, that's the norm. Architect with swappable modules. Don't marry one framework.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Main sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;LangChain, &lt;em&gt;State of Agent Engineering&lt;/em&gt;, Dec. 2025 (n=1,340 professionals)&lt;/li&gt;
&lt;li&gt;Cleanlab, &lt;em&gt;AI Agents in Production 2025&lt;/em&gt; (MIT State of AI in Business 2025, n=1,837)&lt;/li&gt;
&lt;li&gt;UC Berkeley, &lt;em&gt;Measuring Agents in Production&lt;/em&gt;, Melissa Pan et al. (n=300+ teams)&lt;/li&gt;
&lt;li&gt;McKinsey, &lt;em&gt;State of AI 2025&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Forrester, &lt;em&gt;2025 AI Model Overview Report&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Docker, &lt;em&gt;Agentic AI Applications — official documentation&lt;/em&gt;, 2025-2026&lt;/li&gt;
&lt;li&gt;Docker, &lt;em&gt;Build AI Agents with Docker Compose&lt;/em&gt;, Nov. 2025&lt;/li&gt;
&lt;li&gt;MachineLearningMastery, &lt;em&gt;Deploying AI Agents to Production: Architecture, Infrastructure, and Implementation Roadmap&lt;/em&gt;, Mar. 2026&lt;/li&gt;
&lt;li&gt;n8n Blog, &lt;em&gt;15 best practices for deploying AI agents in production&lt;/em&gt;, Jan. 2026&lt;/li&gt;
&lt;li&gt;FreeCodeCamp, &lt;em&gt;How to Build and Deploy a Multi-Agent AI System with Python and Docker&lt;/em&gt;, Feb. 2026&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;This article synthesizes public data available in March 2026. Figures may evolve rapidly in this space.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Not sure where to start with your own agent?
&lt;/h2&gt;

&lt;p&gt;We offer a free 20-minute workshop to help you define your first agentic use case — what to automate, how to scope it, and what production readiness actually looks like for your context.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>machinelearning</category>
      <category>docker</category>
    </item>
  </channel>
</rss>
