Forem: Simran Kumari

AI Agent Monitoring: How to Observe Autonomous AI Agents in Production

Simran Kumari — Fri, 10 Apr 2026 16:09:26 +0000

AI agent monitoring — also called LLM observability — is the practice of collecting, analysing, and acting on telemetry data generated by LLM calls and the autonomous agents built on top of them. Think of it as traditional APM, but purpose-built for AI workloads.

A modern AI agent is not a static API call. It's a dynamic, multi-step reasoning system that may:

Plan and decompose subtasks autonomously
Call external tools (web search, code execution, APIs)
Retrieve documents via Retrieval-Augmented Generation (RAG)
Spawn sub-agents for parallel task execution
Loop and self-correct until a goal is satisfied

Every one of those steps is a potential point of failure, latency spike, or cost explosion. Just as DevOps engineers would never deploy a microservice without metrics, traces, and logs, MLOps and AI engineers need the same rigour for LLM-powered systems.

Why It Matters in Production

The jump from a prototype that "works on my machine" to a reliable production AI agent is enormous. Here's what routinely breaks without proper monitoring:

🔴 Runaway Token Costs

An unchecked agentic loop can consume millions of tokens before you notice. A single misbehaving agent session — stuck in a reasoning loop — can exhaust your entire daily token budget in minutes. Token-level telemetry gives you per-request cost visibility and the ability to set budget-based circuit breakers.

🔴 Silent Latency Regressions

A new model version, a longer system prompt, or a change in retrieval strategy can quietly double your agent's response time. Without distributed latency traces, you discover this from frustrated users — not from a proactive alert.

🔴 Rate-Limit Cascade Failures

LLM API rate limits hit unpredictably under production load. A single rate-limit event can trigger aggressive retries across multiple parallel agent sessions, cascading into a full outage.

🔴 Degraded Output Quality

Hallucinations, refusals, and incoherent responses increase as context windows grow or prompts drift. Span-level metadata correlating prompt structure with output quality lets you catch these regressions systematically.

🔴 Multi-Step Reasoning Failures

In agentic pipelines, a failure deep in a reasoning chain is nearly impossible to attribute without distributed tracing. Did the agent fail because the web search tool returned bad data, because the LLM misinterpreted the tool output, or because the context window overflowed? Traces answer this.

🔴 Compliance & Audit Requirements

Enterprise deployments increasingly require complete audit logs of what the agent decided, why, what data it accessed, and what actions it took.

The Four Pillars of LLM Observability

1. Distributed Tracing

Every agent action — from receiving a user prompt to returning a final answer — is instrumented as a trace composed of spans. Each span captures a discrete unit of work: an LLM call, a tool invocation, a database retrieval, or a sub-agent call.

Tracing answers: "What happened, in what order, and how long did each step take?"

2. Metrics

Aggregated numerical data over time — token counts, latency percentiles (p50/p95/p99), error rates, throughput, and cost per request. Metrics are cheap to store and fast to query, making them ideal for real-time dashboards and threshold-based alerting.

3. Structured Logs

Rich, machine-readable event records attached to each agent action — prompt text, model parameters, completion content, tool call arguments, and exception stack traces. Unlike metrics, logs retain the full context needed for post-incident debugging.

4. Evaluations (Evals)

A layer unique to AI observability: automated or human-assisted scoring of agent outputs for correctness, safety, relevance, and faithfulness. Evals close the loop between operational telemetry and output quality.

💡 Pro Tip: For most teams starting out, distributed tracing delivers the highest immediate value. It reveals exactly where latency and failures originate across multi-step agent pipelines — something neither metrics nor logs alone can show.

Key Metrics to Track

Metric	What It Tells You	Typical Alert Threshold
`llm.usage.prompt_tokens`	Input token consumption per request	> 80% of model context window
`llm.usage.completion_tokens`	Output token consumption per request	Sudden spike > 2× baseline
`llm.usage.total_tokens`	Combined cost proxy per call	Daily cost budget exceeded
`duration` (end-to-end)	User-perceived latency	p95 > 10s for interactive agents
`error.rate`	% of requests that fail or timeout	> 1% over a 5-minute window
`tool_call.count`	Tool invocations per session	> 20 per session (loop indicator)
`agent.steps`	Depth of reasoning chain	> configured max steps
`llm.request.model`	Which model was invoked	Unexpected model fallback detected

OpenTelemetry: The Standard for AI Observability

OpenTelemetry (OTel) is the open-source observability framework that has become the industry standard for instrumenting distributed systems. For AI agents, it provides a vendor-neutral way to emit traces, metrics, and logs from any LLM call to any compatible backend — OpenObserve, Prometheus, Jaeger, Grafana, Datadog, and more.

The ecosystem includes dedicated auto-instrumentation libraries for all major LLM providers:

opentelemetry-instrumentation-openai
opentelemetry-instrumentation-anthropic
opentelemetry-instrumentation-langchain
opentelemetry-instrumentation-llama-index
opentelemetry-instrumentation-cohere

These libraries wrap LLM client calls and automatically attach semantic attributes — token counts, model name, temperature, max tokens, error details — as span attributes, with no manual instrumentation required.

How OTel Spans Map to Agent Steps

In an agentic pipeline, the OTel trace tree mirrors the agent's reasoning hierarchy:

[root trace] user-request
  └── [span] planner-llm-call
        └── [span] tool: web_search
        └── [span] tool: code_executor
              └── [span] sub-agent: summariser-llm-call

This lets you instantly see which step was the bottleneck or failure point in any given agent run.

Setting Up LLM Monitoring with OpenObserve

OpenObserve is an open-source observability platform with a native OTLP endpoint — purpose-built for high-volume telemetry at significantly lower cost and resource footprint than alternatives like the Elastic Stack.

Prerequisites

Python 3.8+
uv package manager (or pip)
An OpenObserve account — cloud or self-hosted
Your OpenObserve organisation ID and Base64-encoded auth token
API key for your LLM provider (OpenAI, Anthropic, etc.)

Step 1: Configure Your Environment

Create a .env file in your project root:

# OpenObserve instance URL
OPENOBSERVE_URL=https://api.openobserve.ai/

# Your OpenObserve organisation slug or ID
OPENOBSERVE_ORG=your_org_id

# Basic auth token — Base64-encoded "email:password"
OPENOBSERVE_AUTH_TOKEN="Basic <your_base64_token>"

# Enable or disable tracing (default: true)
OPENOBSERVE_ENABLED=true

# LLM provider keys
OPENAI_API_KEY="your-openai-key"
ANTHROPIC_API_KEY="your-anthropic-key"

Step 2: Install Dependencies

# Using uv (recommended)
uv pip install openobserve-telemetry-sdk \
               opentelemetry-instrumentation-openai \
               opentelemetry-instrumentation-anthropic \
               python-dotenv

# Or with pip
pip install openobserve-telemetry-sdk opentelemetry-instrumentation-openai python-dotenv

Step 3: Instrument Your Application

OpenAI

Add two lines before any LLM calls are made:

from opentelemetry.instrumentation.openai import OpenAIInstrumentor
from openobserve import openobserve_init

# Instrument OpenAI and initialise the OpenObserve exporter
OpenAIInstrumentor().instrument()
openobserve_init()

from openai import OpenAI

client = OpenAI()

# Use the client exactly as normal — traces are captured automatically
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarise this document..."}]
)
print(response.choices[0].message.content)

Anthropic (Claude)

from opentelemetry.instrumentation.anthropic import AnthropicInstrumentor
from openobserve import openobserve_init

AnthropicInstrumentor().instrument()
openobserve_init()

from anthropic import Anthropic

client = Anthropic()
response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Analyse this data..."}]
)
print(response.content[0].text)

Every call is now captured as a trace span and exported to OpenObserve automatically.

Note: The openobserve-telemetry-sdk is an optional thin wrapper around the standard OTel Python SDK. If you already use OpenTelemetry, you can send telemetry directly to OpenObserve's OTLP endpoint without it.

Step 4: View Traces in OpenObserve

Log in to your OpenObserve instance
Navigate to Traces in the left sidebar
Filter by service name, model name, or time range
Click any span to inspect token counts, latency, parameters, and full request metadata

What Gets Captured in Each Trace Span

The OTel instrumentation libraries automatically attach the following attributes — no manual coding needed:

OTel Attribute	Description	Example Value
`llm.request.model`	Model identifier	`gpt-4o`
`llm.usage.prompt_tokens`	Tokens in the prompt	`1,247`
`llm.usage.completion_tokens`	Tokens in the response	`312`
`llm.usage.total_tokens`	Combined token usage	`1,559`
`llm.request.temperature`	Sampling temperature	`0.7`
`llm.request.max_tokens`	Max response length	`2048`
`duration`	End-to-end request latency	`2,340ms`
`error`	Exception details on failure	`RateLimitError: 429`

Adding Custom Span Attributes

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("agent-task") as span:
    span.set_attribute("user.id", "usr_abc123")
    span.set_attribute("session.id", "sess_xyz789")
    span.set_attribute("agent.name", "research-agent")
    span.set_attribute("task.type", "document-summarisation")
    span.set_attribute("prompt.version", "v2.3.1")

    # Your LLM call here — child spans are created automatically
    response = client.chat.completions.create(...)

Unique Challenges in Agentic Systems

Non-Determinism

Unlike traditional software, the same input to an agent may produce different execution paths on different runs. Your monitoring must capture the full trace of each individual run, not just aggregated statistics.

Long-Horizon Context Windows

As agents maintain conversation history across multiple turns, context windows grow substantially. A single agent session can consume tens of thousands of tokens. Per-turn token tracking is essential.

Nested and Parallel Tool Calls

Modern agents call multiple tools — often in parallel. Distributed tracing with proper parent-child span relationships is the only reliable way to reconstruct the true execution timeline.

Infinite Loop Detection

Agents can get stuck in reasoning loops, repeatedly calling the same tool without making progress. Monitor agent.steps and tool_call.count per session, combined with a max-step circuit breaker.

Multi-Agent Coordination

Orchestrator-worker architectures require trace context propagation across agent boundaries. OpenTelemetry's W3C TraceContext standard enables this:

from opentelemetry.propagate import inject, extract
import requests

# Orchestrator: inject trace context into outgoing request headers
headers = {}
inject(headers)  # adds traceparent, tracestate headers

response = requests.post(
    "http://worker-agent/execute",
    json={"task": task_payload},
    headers=headers
)

# Worker agent: extract and continue the trace
context = extract(incoming_request.headers)
with tracer.start_as_current_span("worker-task", context=context):
    # Appears as child span in orchestrator's trace
    ...

⚠️ Critical: Always propagate the W3C traceparent header when your orchestrator calls a worker agent. Without this, each agent's activity appears as a disconnected root trace — making end-to-end debugging nearly impossible.

Best Practices for AI Agent Monitoring

✅ Instrument Early, Not After the Fact

Add observability during development, not after incidents. Retrofitting into a complex agentic system leaves blind spots in the most critical execution paths.

✅ Separate Evaluation Metrics from Operational Metrics

Don't conflate system health (latency, error rate, tokens) with output quality (correctness, relevance, safety). Keep them in separate pipelines with separate alert policies.

✅ Sample Intelligently, Not Uniformly

Use head-based sampling for normal traffic (e.g., 10%), but configure tail-based sampling to capture 100% of failed or slow requests. Full fidelity where it matters most, without prohibitive storage costs.

✅ Mask Sensitive Data Before Export

from opentelemetry.sdk.trace import SpanProcessor

class SensitiveDataRedactor(SpanProcessor):
    SENSITIVE_ATTRS = ["llm.prompts", "llm.completions", "user.email"]

    def on_end(self, span):
        for attr in self.SENSITIVE_ATTRS:
            if attr in span.attributes:
                span.set_attribute(attr, "[REDACTED]")

✅ Version Your Prompts

Treat prompt templates as software artefacts with version identifiers. Attach prompt.version: v2.3.1 as a span attribute to compare performance across prompt versions — just like canary deployments.

✅ Tag Every Trace with Business Context

Add user.id, session.id, agent.name, task.type, and feature.flag to every trace. These transform your observability data from an engineering artefact into a product intelligence asset.

✅ Build a Feedback Loop from Evals to Prompts

Connect your evaluation pipeline back to your prompt management system. When evaluations detect a quality regression, it should automatically trigger a prompt review workflow — the AI equivalent of failing a CI/CD pipeline on test failures.

Conclusion

As autonomous AI agents take on consequential tasks — writing and executing code, managing business workflows, interacting with customers at scale — the organisations that invest in proper observability will have a decisive operational advantage: faster debugging cycles, lower costs, better output quality, and the confidence to scale reliably.

OpenTelemetry + OpenObserve gives you a vendor-neutral, open-source foundation that scales from a solo developer's project to an enterprise deployment, without lock-in or prohibitive cost at scale.

You cannot improve what you cannot measure. For AI agents, observability is the measurement layer that makes continuous improvement possible.

Top 10 APM Tools in 2026: A Complete Comparison Guide

Simran Kumari — Fri, 27 Mar 2026 09:53:11 +0000

Application Performance Monitoring (APM) tools help engineering teams track, analyze, and optimize how their applications behave in production. They collect telemetry data — response times, error rates, throughput — across distributed systems and turn it into actionable insights.

But picking the right APM tool in 2026 is more nuanced than it used to be. Teams are increasingly pushing back on:

Runaway costs that scale with data volume or host count
Vendor lock-in from proprietary agents and query languages
Lack of data sovereignty when compliance requires on-prem or regional storage
Unnecessary complexity for teams with simpler observability needs

This guide covers 10 APM tools that address these concerns — from open source platforms to enterprise SaaS solutions.

What to Look for in an APM Tool

Before diving in, here's a quick framework for evaluating your options:

Criterion	What to Evaluate
Unified Observability	Single pane for metrics, logs, and traces
Cost Structure	Transparent pricing; no hidden fees at scale
Data Ownership	Self-hosted option, data export, retention control
Scalability	Ingestion throughput and query performance at volume
Migration Ease	OpenTelemetry support, agent compatibility
Query Capabilities	SQL, PromQL, or a proprietary DSL
Alerting & Visualization	Alert config flexibility, dashboard quality
High-Cardinality Support	User-level tracking without cost blowups

1. OpenObserve

Best for: Teams wanting unified observability without vendor lock-in or unpredictable costs.

OpenObserve is an open-source observability platform that unifies logs, metrics, traces, and APM in a single interface. It uses 140x compression technology that can reduce storage and ingestion costs by 60–90% compared to legacy tools.

Pros:

Unified logs, metrics, traces, and APM in one platform
OpenTelemetry-native — works as a drop-in replacement for proprietary agents
SQL-based querying instead of a vendor-specific DSL
Self-hosted or cloud deployment options
No per-host or per-metric billing surprises

Cons:

Requires SQL familiarity for advanced analysis
Smaller integration marketplace vs. legacy vendors

Deployment: Self-hosted / Cloud
Pricing: Open source + low-cost cloud

2. Datadog

Best for: Teams that want a mature, feature-rich SaaS platform and have the budget for it.

Datadog is one of the most well-known names in cloud monitoring, offering 900+ integrations and a powerful unified platform for metrics, logs, traces, RUM, and security.

Pros:

Enormous integration ecosystem (900+ integrations)
Strong end-to-end distributed tracing with automatic service discovery
AI-powered anomaly detection and root cause analysis
Quick time-to-value with solid documentation

Cons:

Pricing scales rapidly with data volume and host count
Complex billing model with separate per-feature charges
Custom metric auto-generation can create unexpected costs
Proprietary agents and query language create lock-in

Deployment: SaaS
Pricing: Host + usage-based

3. Dynatrace

Best for: Large enterprises running complex distributed systems that want automated instrumentation.

Dynatrace's OneAgent handles instrumentation automatically, and its Davis AI engine cuts through alert noise with built-in root cause analysis.

Pros:

Zero-touch instrumentation via OneAgent
Strong AI-driven alerting with reduced noise
Excellent support for hybrid and on-premises environments
End-to-end visibility from infrastructure to user experience

Cons:

Premium pricing — often the most expensive option
Proprietary data formats and agents
Can be overkill for smaller or cloud-native teams

Deployment: SaaS / Hybrid
Pricing: Host / unit-based

4. New Relic

Best for: Teams wanting a familiar all-in-one SaaS APM experience with a generous free tier.

New Relic offers deep code-level performance visibility across metrics, logs, traces, RUM, and synthetics, with a 100 GB/month free data ingest that makes it accessible for smaller teams.

Pros:

Unified platform with strong APM capabilities
100 GB/month data ingest on the free tier
Good OpenTelemetry support for easier migration
Developer-friendly onboarding and documentation

Cons:

Still a proprietary SaaS platform
Costs grow quickly with high data volumes
Limited data residency control vs. self-hosted tools
Advanced features gated behind higher pricing tiers

Deployment: SaaS
Pricing: Usage-based

5. AppDynamics

Best for: Enterprises already in the Cisco ecosystem that need deep business transaction visibility.

AppDynamics maps application performance directly to business outcomes — making it particularly useful for organizations where IT metrics need to connect to revenue and customer experience metrics.

Pros:

Deep code-level visibility and dependency mapping
Business transaction monitoring that connects to business impact
Tight Cisco networking and security integration
Works well across on-prem, hybrid, and legacy environments

Cons:

Expensive enterprise pricing
Heavy agent-based approach
Complex setup and configuration
Less cloud-native than modern alternatives

Deployment: SaaS / On-prem
Pricing: Unit-based

6. Splunk APM

Best for: Compliance-heavy organizations with mature security and audit requirements.

Splunk has been an enterprise log analytics powerhouse for years, with its APM offering extending that depth to distributed tracing and full-fidelity observability.

Pros:

Extremely powerful analytics via SPL (Search Processing Language)
Enterprise-grade security and compliance capabilities
Full-fidelity tracing with no default sampling
Flexible deployment (on-prem and cloud)

Cons:

One of the most expensive APM tools on the market
Steep learning curve for SPL
Complex licensing model
Often excessive for pure APM use cases

Deployment: SaaS / On-prem
Pricing: Data-volume based

7. Elastic APM

Best for: Teams already using the ELK stack who want to extend into APM.

Elastic Observability builds on Elasticsearch's powerful full-text and structured search to offer logs, metrics, and APM in a unified interface.

Pros:

Best-in-class log search via Elasticsearch
Flexible deployment (cloud, self-hosted, hybrid)
Large community and broad ecosystem integrations
Strong SIEM overlap for security + observability use cases

Cons:

Expensive to operate at scale
High infrastructure and tuning overhead
Storage costs can grow quickly
Complex cluster management

Deployment: Self-hosted / Cloud
Pricing: Data / host-based

8. Grafana Stack (Prometheus + Loki + Tempo)

Best for: Teams that want best-in-class open source tools and have the ops capability to manage them.

The Grafana Stack isn't a single product — it's a collection of open source tools: Prometheus for metrics, Loki for logs, and Tempo for traces, all visualized through Grafana dashboards.

Pros:

Prometheus is the de-facto standard for Kubernetes and infrastructure metrics
Completely open source, no vendor lock-in
Highly customizable dashboards that rival commercial tools
Thousands of exporters and plugins

Cons:

Not a unified product — requires managing multiple systems
Significantly higher operational overhead at scale
Alerting setup is more complex than integrated platforms
Steeper learning curve for full-stack setup

Deployment: Self-hosted / Cloud (Grafana Cloud managed option available)
Pricing: OSS + managed tiers

9. Honeycomb

Best for: Engineering teams debugging complex distributed systems with high-cardinality data.

Honeycomb was purpose-built for the challenges modern microservices create — where request IDs, user IDs, and other high-cardinality fields need to be tracked without blowing up your observability bill.

Pros:

Handles high-cardinality dimensions (user IDs, request IDs) without performance or cost penalties
Fast, ad-hoc exploratory querying for unknown unknowns
First-class SLOs, error budgets, and burn-rate alerts
OpenTelemetry-native ingestion

Cons:

SaaS-only, no self-hosted option
Pricing scales with event volume
Less focus on traditional infrastructure dashboards
Different mental model than legacy APM tools

Deployment: SaaS
Pricing: Event-based

10. Site24x7

Best for: Smaller DevOps teams wanting broad monitoring coverage at a competitive price.

Site24x7 covers APM, RUM, synthetic monitoring, server, and cloud monitoring in one platform — without the enterprise price tag.

Pros:

Competitive pricing with a broad feature set
Quick setup and guided onboarding
Covers APM, synthetic, infrastructure, and cloud in one tool
Good customer support reputation

Cons:

UI feels dated compared to modern competitors
Less depth in distributed tracing
Advanced features locked behind higher tiers
Smaller community and ecosystem

Deployment: SaaS
Pricing: Tier-based

Quick Comparison Table

Tool	Deployment	Metrics	Logs	Traces	APM	Pricing
OpenObserve	Self-hosted / Cloud	✅	✅	✅	✅	OSS + low-cost cloud
Datadog	SaaS	✅	✅	✅	✅	Host + usage-based
Dynatrace	SaaS / Hybrid	✅	✅	✅	✅	Host / unit-based
New Relic	SaaS	✅	✅	✅	✅	Usage-based
AppDynamics	SaaS / On-prem	✅	✅	✅	✅	Unit-based
Splunk APM	SaaS / On-prem	✅	✅	✅	✅	Data-volume based
Elastic APM	Self-hosted / Cloud	✅	✅	✅	✅	Data / host-based
Grafana Stack	Self-hosted / Cloud	✅	✅	✅	⚠️	OSS + managed
Honeycomb	SaaS	⚠️	⚠️	✅	✅	Event-based
Site24x7	SaaS	✅	✅	✅	✅	Tier-based

How to Choose

By budget:

Tight → OpenObserve, Grafana Stack, Elastic APM
Moderate → New Relic (free tier), Site24x7
Enterprise → Dynatrace, Datadog, Splunk

By deployment preference:

Self-hosted required → OpenObserve, Grafana Stack, Elastic
SaaS preferred → New Relic, Datadog, Honeycomb, OpenObserve Cloud
Hybrid needed → Dynatrace, Elastic, AppDynamics

By use case:

General observability → OpenObserve, New Relic, Datadog
Business transaction visibility → AppDynamics
Log analytics → OpenObserve, Elastic, Splunk
High-cardinality tracing → Honeycomb, OpenObserve
Security + observability → Splunk, Elastic, OpenObserve

By migration strategy:

Quick migration → OpenTelemetry-native tools (OpenObserve, Honeycomb, New Relic)
Gradual transition → Start with one signal type (logs or metrics first)
Parallel running → Run new tool alongside existing APM during evaluation

Final Thoughts

The APM landscape in 2026 is richer — and more opinionated — than ever. The right tool depends on your team's technical depth, budget constraints, compliance requirements, and how much operational overhead you're willing to take on.

A few principles that apply regardless of which tool you choose:

Adopt OpenTelemetry to instrument once and avoid being locked into any specific backend
Start with a pilot on non-critical services before committing to a full migration
Model your costs at scale — what looks cheap at 10 hosts can surprise you at 100
Run tools in parallel during evaluation to validate parity before cutting over

If you're looking for a starting point that balances cost, flexibility, and full-stack observability, OpenObserve is worth a look — it's open source, OTel-native, and offers both self-hosted and cloud deployment options.

Originally based on the OpenObserve blog.

Best Open Source LLM Observability Tools in 2026: Complete Guide

Simran Kumari — Wed, 25 Mar 2026 15:27:12 +0000

What Is LLM Observability?

LLM observability is the practice of monitoring, tracing, and analyzing every layer of an AI application — from the prompt you send to the final response your model returns. As AI systems grow more complex, with multi-step agent workflows, retrieval-augmented generation (RAG) pipelines, and tool calls chained together, traditional logging falls short.

The four core components of LLM observability are:

Tracing — tracking the full lifecycle of a user interaction, including intermediate steps, model API calls, and tool invocations
Evaluation — measuring output quality through automated metrics (relevance, faithfulness, toxicity) or human annotation
Cost & Usage Monitoring — tracking token consumption, latency, and spend per model, user, or session
Prompt Management — versioning, testing, and iterating on prompts without losing reproducibility

Without these, teams are blind to quality regressions, prompt drift, hallucinations, and runaway API costs in production.

Why LLM Observability Is Different from Traditional Monitoring

Traditional observability tools like Grafana and Prometheus are excellent for infrastructure-level signals — CPU, memory, request rates, latency percentiles. But LLMs introduce an entirely new class of failure that metrics alone cannot detect:

Traditional Monitoring	LLM Observability
Tracks uptime, latency, error rates	Tracks hallucinations, prompt quality, output relevance
Alerts on crashes or timeouts	Alerts on silent quality regressions
Measures infrastructure health	Measures model behavior and output correctness
Query languages: PromQL, SQL	Evaluation frameworks: LLM-as-judge, semantic similarity
Dashboards for SREs	Dashboards for ML engineers and product teams

What to Look for in an Open Source LLM Observability Tool

A CHI 2025 study with 30 developers identified four core design principles every solid LLM observability tool should satisfy:

Principle	What It Means
Awareness	Makes model behavior visible — you understand what is happening inside the system
Monitoring	Real-time feedback during training and evaluation to catch issues early
Intervention	Enables you to act on problems as they surface, not after users report them
Operability	Supports long-term maintainability as models and requirements evolve

Beyond those principles, evaluate tools on:

Self-hosting support — critical for data residency and compliance
Framework integrations — LangChain, LlamaIndex, OpenAI SDK, LiteLLM, Vercel AI SDK, Haystack
OpenTelemetry compatibility — avoids vendor lock-in and lets you route traces to any OTEL-compatible backend
Evaluation capabilities — LLM-as-judge, human annotation, hallucination detection
Prompt management — versioning and collaboration features for iterating on prompts
Cost tracking — per-user, per-model, per-session breakdowns
Unified observability — whether the tool also covers infrastructure so you don't need a second platform
License — MIT, Apache 2.0, and Elastic License 2.0 carry very different implications for commercial use

Top Open Source LLM Observability Tools

1. OpenObserve

License: AGPL-3.0 | Website: openobserve.ai | Cloud: cloud.openobserve.ai

OpenObserve is our top pick for 2026. While most tools on this list specialize in LLM-specific concerns, OpenObserve unifies LLM observability with full infrastructure monitoring — logs, metrics, traces, and frontend (RUM) monitoring — in a single deployment. For teams tired of managing a separate DevOps telemetry stack alongside a dedicated LLM tool, OpenObserve eliminates that overhead entirely.

Built on OpenTelemetry standards and using a Parquet/Vertex columnar format with aggressive compression, OpenObserve delivers 140x lower storage costs compared to traditional stacks like Prometheus + Loki + Tempo. Its SQL-based query interface means teams can correlate LLM trace data with infrastructure metrics without learning multiple proprietary query languages. With single binary deployment, you can be up and running in under 2 minutes.

Key Features:

Unified platform — logs, metrics, traces, LLM traces, and RUM monitoring in one tool
OpenTelemetry-native — drop-in instrumentation for LLM applications using any OTEL SDK
SQL-based queries — correlate LLM trace data with infrastructure signals using familiar syntax
140x lower storage costs — Parquet columnar format with aggressive compression
High-cardinality support — handles per-user, per-session, and per-request LLM telemetry without performance degradation
Single binary deployment — self-hosted in under 2 minutes; no Kubernetes expertise required
Real-time alerting — set alerts on token usage, latency spikes, error rates, and custom LLM metrics
Rich dashboards — visualization for both infrastructure health and LLM operational metrics side by side
Self-hosted or Cloud — full data residency control with flexible deployment options

Pros:

Only open source platform covering infrastructure observability AND LLM tracing in a single tool
140x storage cost reduction makes it dramatically cheaper to retain long-term LLM trace history
SQL querying lowers the learning curve — one language for both infrastructure and LLM queries
Fully OpenTelemetry-native — no vendor lock-in

Cons:

LLM-specific features like LLM-as-judge evaluation and prompt management are handled through integrations rather than built-in modules
Advanced LLM dashboard templates require manual configuration

Pricing:

Open source (self-hosted): Free
Cloud: Free tier available; usage-based pricing beyond that

Best for: Teams that want a single open source platform covering both LLM observability and infrastructure monitoring, or organizations with strict self-hosting/data residency requirements.

2. Langfuse

GitHub Stars: 21,000+ | License: MIT (core) | Website: langfuse.com

Langfuse is the most widely adopted open source LLM-specific observability platform. Originally from YCombinator W23, it was recently acquired by ClickHouse, signalling strong long-term investment in its data infrastructure. Its MIT-licensed core covers end-to-end tracing, prompt management, evaluation, and datasets — everything a production LLM team needs on the application layer.

Key Features:

End-to-end tracing across LLM calls, retrieval steps, and agent actions with waterfall views
Session replay to reconstruct complete conversation histories for debugging
Prompt management with version control and live iteration without redeployment
LLM-as-a-judge evaluation workflows for hallucination, toxicity, and relevance
LLM Playground for testing prompts directly from a failed trace
Native integrations: LangChain, LlamaIndex, OpenAI SDK, LiteLLM, Vercel AI SDK, Haystack, Mastra
Self-host via Docker Compose in under 5 minutes

Pros:

Strongest LLM-specific community adoption in the open source space
Covers the full LLM development lifecycle — tracing, evals, datasets, prompt management
Generous free tier on Langfuse Cloud (50k events/month, 2 users)
True MIT license on core features

Cons:

No built-in infrastructure monitoring — needs a separate platform for full-stack visibility
Enterprise features (SSO, RBAC, advanced security) are separately licensed
Cloud pricing can grow quickly at high event volumes

Pricing:

Self-hosted: Free
Cloud: Free up to 50k events/month, then $29/month for 100k events

Best for: Engineering teams that want the deepest open source LLM-specific observability with prompt management and evaluation built in.

3. Arize Phoenix

License: Elastic License 2.0 (source-available) | Website: phoenix.arize.com

Arize Phoenix is a source-available observability platform built specifically for LLM applications, RAG pipelines, and agent workflows. Built on OpenTelemetry standards, it includes built-in hallucination detection and embedding drift visualization, making it particularly powerful for teams iterating on retrieval pipelines.

Key Features:

End-to-end tracing for prompts, responses, and agent workflows
RAG observability — inspect retrieval results, chunk quality, and grounding
Hallucination detection built in
Embedding drift detection for monitoring distribution shifts over time
OpenTelemetry-native export to OpenObserve, Datadog, Grafana, or any OTEL backend
Supports Python and JavaScript

Pros:

Purpose-built for RAG and agent debugging — best-in-class for retrieval pipeline visibility
OTEL-native design eliminates vendor lock-in
Rich visualizations for understanding embedding spaces and cluster drift

Cons:

Elastic License 2.0 restricts certain commercial uses (not true open source)
Less mature prompt management than Langfuse
No infrastructure monitoring — requires a separate backend
Enterprise features require moving to Arize AI platform ($50/month+)

Pricing:

Phoenix (open source): Free
Arize AX Pro: $50/month; Enterprise: custom

Best for: AI engineering teams building RAG-based systems and agent workflows where deep retrieval pipeline visibility is critical.

4. OpenLLMetry

License: Apache 2.0 | Website: openllmetry.com

OpenLLMetry is the most vendor-neutral option on this list. An open source observability framework built purely on OpenTelemetry standards, it provides LLM instrumentation for Python and TypeScript with a single line of setup code. It then ships traces to any OTEL-compatible backend.

Key Features:

Single-line setup for automatic instrumentation
Supports OpenAI, Anthropic, Cohere, Azure OpenAI, Bedrock, Vertex AI, and more
Framework support: LangChain, LlamaIndex, Haystack, CrewAI, and others
Privacy controls for redacting sensitive prompts from traces
Custom attributes for A/B testing and feature flag tracking
Completely free — no licensing costs

Pros:

True vendor neutrality — switch backends without changing instrumentation code
Widest framework and provider coverage on the list
Fully Apache 2.0 licensed — safe for any commercial use
Zero cost, zero lock-in

Cons:

Instrumentation library only — requires a separate backend for storage, dashboards, and alerting
No built-in evaluation, prompt management, or dashboards
Requires more setup work to build a complete observability stack

Pricing: Completely free

Best for: Teams that want vendor-neutral LLM instrumentation and already have an observability backend, or teams building a custom OpenTelemetry-native stack.

5. Comet Opik

License: Apache 2.0 | Website: comet.com/site/products/opik

Opik is an open source LLM observability and evaluation platform from Comet ML, focused on systematic testing, optimization, and production monitoring. It stands out for its automated prompt optimization — six algorithms including Few-shot Bayesian, evolutionary, and LLM-powered MetaPrompt approaches — which is rare in open source tooling.

Key Features:

Full tracing for LLM calls, agent steps, and RAG pipelines
Automated prompt optimization (six algorithms built in)
Built-in guardrails for PII filtering, off-topic detection, and competitor mention blocking
Works with any LLM provider; native integrations for LangChain, LlamaIndex, OpenAI, Anthropic, Vertex AI
60-day data retention on free hosted plan with unlimited team members
Self-hostable with full features available in the codebase

Pros:

Automated prompt optimization is a major differentiator
Guardrails are built in, not bolted on
Truly open source (Apache 2.0) with full feature access
Unlimited team members on free tier

Cons:

Smaller community than Langfuse
No infrastructure monitoring
Some advanced analytics features are cloud-only

Pricing:

Free hosted: 25k spans/month, unlimited team members, 60-day retention
Pro: $39/month for 100k spans

Best for: Teams that want comprehensive observability with automated prompt optimization and guardrails built in.

6. Helicone

License: MIT | Website: helicone.ai

Helicone takes a fundamentally different approach: it is a proxy-first observability platform. Rather than adding an SDK, you simply change your base URL to route traffic through Helicone — and it immediately logs every request, response, token count, cost, and error with zero code changes.

Key Features:

Proxy-based setup — change one line of code (base URL), nothing else
Works with 100+ models and any OpenAI-compatible endpoint
Request caching to reduce latency and cost on repeated calls
Intelligent request routing and automatic provider failover
Rate limiting and usage controls to prevent runaway spend
Cost tracking by model, user, and session

Pros:

Fastest time-to-value — production observability in under 5 minutes
No SDK to install or manage
Caching and routing features go beyond pure observability
MIT licensed and self-hostable

Cons:

Proxy architecture introduces a network hop
Less suited for deep agent workflow tracing than Langfuse or Arize Phoenix
No infrastructure monitoring
Evaluation features are limited compared to dedicated eval platforms

Pricing:

Hobby (free): 50k monthly logs
Pro: $79/month
Team: $799/month

Best for: Teams that need lightweight model-level observability and cost control with the absolute minimum setup friction.

7. Lunary

License: Apache 2.0 | Website: lunary.ai

Lunary is a lightweight open source observability platform optimized for RAG pipelines and chatbot applications. It offers SDKs for JavaScript (Node.js, Deno, Vercel Edge, Cloudflare Workers) and Python, with a setup time of roughly two minutes. Its Radar feature automatically categorizes LLM responses based on pre-defined criteria, making it easy to audit outputs at scale.

Key Features:

Specialized RAG tracing with embedding metrics and latency visualization
Radar: rule-based categorization of LLM responses for downstream auditing
SDKs for JavaScript environments including Vercel Edge and Cloudflare Workers
Session-level tracing for chatbot conversations
10k events/month free with 30-day retention

Pros:

Best JavaScript/TypeScript support of any tool on this list
Lightweight and fast to set up — under 2 minutes
Purpose-built for RAG and chatbot use cases

Cons:

Narrower feature set than Langfuse or OpenObserve
Some advanced features require Enterprise licensing
Smaller community and ecosystem

Pricing:

Free tier: 10k events/month, 30-day retention
Enterprise: Custom (includes self-hosting)

Best for: JavaScript-first teams building RAG pipelines or chatbot applications who need quick observability setup.

8. TruLens

License: MIT | Website: trulens.org

TruLens takes a qualitative-first approach to LLM observability, built around structured feedback functions that evaluate LLM responses after each call. It is particularly strong for teams using LlamaIndex and LangChain who want systematic evaluation pipelines rather than traditional tracing.

Key Features:

Feedback functions that run automatically after each LLM call
Pre-built evaluators for relevance, groundedness, and coherence
RAG triad evaluation: answer relevance, context relevance, groundedness
Deep integration with LlamaIndex and LangChain
LLM-agnostic — supports any model as an evaluator

Pros:

Best-in-class for structured, systematic evaluation pipelines
RAG triad evaluation is a well-regarded methodology for RAG quality assessment
MIT licensed with no restrictions

Cons:

Python only — no JavaScript/TypeScript support
Less focus on tracing and production monitoring
Smaller community than Langfuse

Pricing: Free (MIT licensed)

Best for: Research teams and ML engineers who need rigorous, automated evaluation pipelines for RAG systems with Python-native tooling.

9. PostHog LLM Analytics

GitHub Stars: 32,100+ | License: MIT | Website: posthog.com

PostHog bundles LLM observability alongside product analytics, session replay, feature flags, A/B testing, and error tracking. For teams who want to understand not just how their LLM performs technically but how users actually interact with it, PostHog is uniquely positioned.

Key Features:

LLM generation capture with cost, latency, and usage metrics
Combines LLM data with product analytics — funnels, retention, and user behaviour
Session replay for AI interactions — watch exactly what users experienced
A/B testing for prompts using the same experiment framework as product features
Prompt management (beta) with version control
100k LLM observability events/month on free tier

Pros:

Only tool on this list that combines LLM observability with full product analytics
Session replay for AI interactions is a uniquely powerful debugging tool
Massive community (32k+ GitHub stars)
Transparent, usage-based pricing

Cons:

LLM-specific features (evaluation, RAG tracing) are less mature than dedicated tools
No infrastructure monitoring
Prompt management is still in beta

Pricing:

Free: 100k LLM events/month, 30-day retention
Usage-based beyond that

Best for: Product-led teams who want to combine LLM monitoring with user behaviour and product analytics in one platform.

10. Weave by Weights & Biases

License: Apache 2.0 | Website: wandb.ai/site/weave

Weave is the LLM observability product from Weights & Biases (W&B), extending W&B's ML experiment tracking into LLM application observability — covering tracing, evaluation, and dataset management in a unified interface.

Key Features:

End-to-end tracing for LLM calls, chains, and agent workflows
Dataset management with versioning for evaluation benchmarks
Integration with W&B experiment tracking for model-level and application-level comparison
Human annotation tools for labelling and review workflows
Supports Python and JavaScript
Model-agnostic — works with OpenAI, Anthropic, open source models, and custom endpoints

Pros:

Natural fit for teams already using W&B for model training and experiment tracking
Strong dataset and evaluation management inherited from W&B's research-grade tooling
Apache 2.0 license — commercially safe
Bridges model development and production deployment in one workspace

Cons:

Less specialized for production LLM monitoring than Langfuse or OpenObserve
Tightly coupled to the W&B ecosystem — less useful if you're not already a W&B user

Pricing:

Free tier available via W&B
Team and Enterprise plans: custom pricing

Best for: ML research teams already invested in the W&B ecosystem who want to extend experiment tracking into production LLM observability.

Comparison Table

Tool	License	Self-Hosted	Tracing	Evaluation	Prompt Mgmt	Infra Monitoring	RAG Support	Best For
OpenObserve	AGPL-3.0	✅	✅	⚠️	⚠️	✅✅	✅	Unified infra + LLM observability
Langfuse	MIT (core)	✅	✅	✅	✅	❌	✅	Full-lifecycle LLM observability
Arize Phoenix	ELv2	✅	✅	✅	⚠️	❌	✅✅	RAG and agent debugging
OpenLLMetry	Apache 2.0	✅	✅	❌	❌	❌	✅	Vendor-neutral instrumentation
Comet Opik	Apache 2.0	✅	✅	✅	✅	❌	✅	Prompt optimization + observability
Helicone	MIT	✅	✅	⚠️	❌	❌	⚠️	Lightweight proxy-based monitoring
Lunary	Apache 2.0	✅	✅	⚠️	❌	❌	✅	JavaScript RAG & chatbots
TruLens	MIT	✅	⚠️	✅✅	❌	❌	✅	Structured evaluation pipelines
PostHog	MIT	✅	✅	⚠️	⚠️	❌	⚠️	LLM + product analytics combined
Weave (W&B)	Apache 2.0	✅	✅	✅	⚠️	❌	✅	ML research teams on W&B

✅ = strong support, ⚠️ = partial or in beta, ❌ = not available

How to Choose the Right Tool

1. Start with your deployment requirement

If your organization requires data residency or strict compliance, every tool on this list supports self-hosting. For the simplest self-hosted path, OpenObserve stands out — single binary deployment in under 2 minutes, covering both infrastructure and LLM telemetry. For pure LLM-specific self-hosting, Langfuse via Docker Compose takes about 5 minutes.

2. Match the tool to your primary bottleneck

If your main problem is...	Best tool(s)
Unified infra + LLM observability in one place	OpenObserve
Debugging agent and chain failures	OpenObserve, Langfuse, Arize Phoenix
RAG pipeline quality	Arize Phoenix, TruLens, Lunary
Prompt quality and optimization	Comet Opik, Langfuse
Cost and token tracking	Helicone, Langfuse, OpenObserve
Storage cost at scale	OpenObserve (140x compression)
Vendor-neutral instrumentation	OpenLLMetry → OpenObserve as backend
JavaScript/Node.js first	Lunary, PostHog
Product analytics + LLM	PostHog

3. Consider your framework dependencies

LangChain / LangGraph users: Langfuse has the deepest native LLM-specific integration
LlamaIndex users: TruLens and Arize Phoenix have strong LlamaIndex support
OpenAI SDK / Anthropic SDK users: All tools support this; Helicone is fastest to set up
Custom stacks / framework agnostic: OpenLLMetry → OpenObserve is the safest, most future-proof combination

4. Think about the evaluation maturity you need

In early development, basic tracing and cost monitoring (Helicone, Lunary) may be enough. As you move to production, evaluation becomes critical. Langfuse and Arize Phoenix lead for comprehensive evaluation workflows; TruLens leads for structured RAG evaluation methodology.

5. Factor in long-term lock-in risk

Tools built on OpenTelemetry standards — particularly OpenLLMetry, Arize Phoenix, and OpenObserve — give you the most flexibility to change components without re-instrumenting your application.

FAQs

What is the best open source LLM observability tool in 2026?

OpenObserve is our top pick for 2026 — the only open source platform covering both LLM observability and infrastructure monitoring in a single deployment. For LLM-specific evaluation and prompt management on top, Langfuse is the strongest companion. For RAG-specific debugging, Arize Phoenix leads.

Can I use these tools with any LLM provider?

Yes. All tools on this list support major providers including OpenAI, Anthropic, Cohere, Azure OpenAI, AWS Bedrock, Vertex AI, and most open source model endpoints. OpenLLMetry and Helicone have the broadest provider coverage (100+ models).

What is the difference between LLM tracing and LLM evaluation?

Tracing records what happened — prompts sent, responses received, latencies, token counts, tool calls. Evaluation assesses whether what happened was good — was the response accurate, relevant, grounded in retrieved context, free of hallucinations?

Do I need a separate observability stack for infrastructure if I adopt one of these tools?

Not if you choose OpenObserve. It handles metrics, logs, distributed traces, and LLM telemetry in a single platform — replacing the need for separate tools like Prometheus, Loki, and Tempo. For all other tools on this list, you will need a separate infrastructure monitoring stack.

What is the easiest tool to set up?

Helicone wins on LLM-specific setup speed — one line of code (change your base URL) and you have immediate production observability. OpenObserve wins on full-stack setup speed — single binary deployment in under 2 minutes covering both LLM and infrastructure telemetry.

How much does LLM observability cost at scale?

This is where OpenObserve stands out most clearly. Its Parquet-based 140x compression technology dramatically reduces the cost of storing LLM traces, prompt histories, and operational metrics at scale — critical as LLM application volumes grow.

Originally published on openobserve.ai

Microservices Observability: Using Logs, Metrics, and Traces to Keep Distributed Systems Healthy

Simran Kumari — Wed, 25 Mar 2026 07:20:39 +0000

Picture this: It's Black Friday and you're the lead developer at a high-traffic e-commerce platform. Orders start failing. Customer complaints flood in. Your team scrambles to find the root cause — but with dozens of microservices working together, it feels like searching for a needle in a haystack of needles.

This is exactly the kind of scenario that microservices observability is designed to solve.

In this guide, we'll break down the three pillars of observability, explore real-world implementation patterns, and look at how you can build a system that tells you what's wrong, where it's wrong, and why — before your users start noticing.

What Is Microservices Observability?

Traditional monitoring tells you that something broke. Observability tells you why.

In a microservices architecture, observability means collecting rich telemetry data — logs, metrics, and traces — that allows you to ask arbitrary questions about your system's state without having to predict in advance what might go wrong.

Think of an observable system as one that has a voice. Instead of failing silently, it tells you exactly what happened, when it happened, and why.

The Three Pillars of Observability

1. Logs

Logs are the detailed diary entries of your microservices. Whenever an event occurs — a user action, an internal process, an error — it gets recorded. They're essential for debugging and post-mortem analysis.

Best practice: use structured logging.

Instead of plain-text messages, structured logs use consistent, queryable fields — timestamps, service names, error codes, user IDs. This makes it dramatically faster to isolate the specific entry you're looking for during an incident.

{
  "timestamp": "2024-11-29T14:32:11Z",
  "service": "payment-service",
  "level": "ERROR",
  "message": "Transaction timeout",
  "user_id": "u_83721",
  "trace_id": "4bf92f3577b34da6"
}

2. Metrics

Metrics are quantitative measurements of your system's performance over time. They're typically numeric values — request rates, error rates, latency percentiles, CPU usage — that can be aggregated, trended, and alerted on.

Metrics are your early warning system. A sudden spike in p99 latency or a drop in request throughput can trigger an alert before users even notice something is wrong.

Key metrics to track per service:

Request rate — how many requests per second
Error rate — percentage of failed requests
Latency — response time distributions (p50, p95, p99)
Saturation — how close to capacity the service is running

3. Traces

Traces are the most powerful pillar for distributed systems. They map the complete journey of a single request as it travels through multiple services, showing you the sequence of operations and how long each one takes.

In a microservices environment, a single user action might touch 10+ services. Without traces, when one of them is slow or failing, you're guessing. With traces, you can follow that request step by step and pinpoint exactly where the bottleneck or failure occurred.

User Request
 └─ API Gateway (2ms)
     └─ Order Service (45ms)
         ├─ Inventory Service (12ms)
         └─ Payment Service (820ms) ← Bottleneck here
             └─ Fraud Detection (790ms) ← Root cause

Why All Three Work Together

Each pillar tells part of the story. None of them gives you the full picture alone.

Signal	Answers	Limitation
Metrics	Is something wrong?	Doesn't tell you why
Logs	What happened in detail?	Hard to correlate across services
Traces	Where in the flow did it break?	Doesn't capture system-wide trends

The real power comes when they're correlated. During an incident:

Metrics show you a latency spike at 14:32
Logs from that window reveal timeout errors in the payment service
Traces lead you directly to the fraud detection call that took 790ms

Without all three pieces, you might spend hours checking database connections, server resources, or network configs — chasing the wrong thing entirely.

Common Implementation Patterns

Distributed Tracing

Implement trace context propagation across all your services using a standard like OpenTelemetry. Every service should pass along a trace_id and span_id in its requests so you can reconstruct the full request path after the fact.

from opentelemetry import trace

tracer = trace.get_tracer("payment-service")

with tracer.start_as_current_span("process_payment") as span:
    span.set_attribute("user.id", user_id)
    span.set_attribute("transaction.amount", amount)
    result = process_payment(user_id, amount)

Centralized Log Aggregation

Instead of SSH-ing into individual containers to read logs, aggregate everything into a central platform. Tag logs with service name, environment, and crucially, the trace ID — so you can jump from a trace directly to the relevant log lines.

Golden Signals Dashboards

Focus your metrics dashboards around the four golden signals (from Google's SRE book):

Latency — how long requests take
Traffic — how much demand the system is handling
Errors — rate of failed requests
Saturation — how full the system is

Real-World Scenario: The Black Friday Incident

Your e-commerce platform starts seeing order failures during a peak sales event.

Without observability:
You restart services, check CPU, look at database connections. 2 hours later, still no root cause. The incident drags on.

With observability:

Metrics dashboard: response times spiking, error rate climbing since 14:30
Logs: timeout errors concentrated in the payment service
Traces: every failing order shows 800ms+ spent in the fraud detection service
Root cause: a bug in fraud detection makes it extremely slow for high-value transactions — exactly the kind that flood in during flash sales

Total time to resolution: 12 minutes.

Getting Started: Practical Advice

Start small and scale gradually. You don't need to instrument everything on day one. Pick your most critical service, add structured logging, expose a /metrics endpoint, and add basic tracing. Learn from that before expanding.

Use OpenTelemetry. It's the vendor-neutral standard for instrumentation. Instrument once, and you can export to any backend. This avoids lock-in and makes it easy to swap or add tools later.

Correlate with a common trace ID. The most important thing to get right early is making sure your logs, metrics, and traces all share a common identifier. Without this, correlating signals during an incident requires manual timestamp-matching — painful and slow.

Set meaningful alerts. Don't alert on every metric. Focus on user-facing impact: elevated error rates, latency crossing SLO thresholds, or traffic anomalies that suggest something is off.

Conclusion

Effective microservices observability isn't a single tool or a one-time project — it's a continuous practice of collecting, correlating, and acting on telemetry data across your entire system.

By integrating logs, metrics, and traces into a unified observability strategy, you shift from reactive firefighting to proactive system understanding. You spend less time guessing and more time building.

The next time Black Friday rolls around, you'll be ready.

Originally published on OpenObserve Blog.

Logs, Metrics, and Traces: What They Are and When to Use Each

Simran Kumari — Mon, 02 Feb 2026 11:49:43 +0000

It's 2 AM. You get a call: "The website is broken."

You SSH into your server, run top to check CPU, maybe df -h for disk space. Everything looks... fine? You restart the application. It works again. But you're left wondering: what actually went wrong?

This happens because we don't have visibility into what's happening inside our systems. That's what observability solves.

The Three Pillars

Observability data comes in three forms:

Metrics: Numbers that change over time
Logs: Detailed records of specific events
Traces: Maps of requests flowing through distributed systems

Each answers different questions. Let's break them down.

Metrics: Is Everything OK?

Metrics are numbers that change over time. Think of them as your app's vital signs—temperature and pulse, measured continuously.

# Server health
cpu_usage_percent = 45
memory_usage_percent = 67
disk_usage_percent = 23

# Application health  
requests_per_minute = 120
response_time_ms = 250
failed_requests_percent = 0.8
active_users = 43

Metrics tell you something is wrong before users complain. If failed_requests_percent jumps from 0.8% to 15%, you know there's a problem.

When to Use Metrics

Dashboards: Visualize trends over time
Alerting: Get notified when thresholds are breached
Capacity planning: Predict when you'll run out of resources
SLO tracking: Monitor service level objectives

Example

Your response time metric shows requests normally taking 200ms are now taking 2000ms. You check and find the database connection pool is exhausted. Fixed before users notice.

Metrics answer: "Is my system healthy?"

Logs: What Exactly Happened?

Metrics tell you that something is wrong. Logs tell you what.

Logs are detailed records of specific events. They're a diary of everything your app does.

Instead of just knowing "more requests are failing," logs show:

2024-08-08T14:30:15Z ERROR [AuthService] Failed login attempt for user@email.com: invalid password
2024-08-08T14:30:16Z ERROR [AuthService] Failed login attempt for user@email.com: invalid password  
2024-08-08T14:30:17Z ERROR [AuthService] Failed login attempt for user@email.com: account locked after 3 failed attempts
2024-08-08T14:30:45Z INFO [AuthService] Password reset requested for user@email.com

Now the "failed requests spike" makes sense—a user forgot their password. Not a bug, just expected behavior.

Structure Your Logs

Random text logs become useless at scale. Use structured logging (JSON):

{
  "timestamp": "2024-08-08T14:30:15Z",
  "level": "ERROR",
  "service": "AuthService", 
  "message": "Failed login attempt",
  "user_id": "12345",
  "email": "user@email.com",
  "reason": "invalid_password",
  "attempt_count": 1
}

Structured logs let you query:

"Show me all errors from AuthService in the last hour"
"How many failed login attempts did user 12345 have today?"

Logs answer: "What exactly happened?"

Traces: Where Did It Break?

Your single-server app grows into microservices:

User Service: Authentication and profiles
Product Service: Product catalog
Order Service: Processes purchases
Payment Service: Handles transactions

A user reports: "I can't complete my purchase. The page just hangs."

You check metrics—all services look healthy. You check logs in each service... but which service did the request even touch? How do you follow a single user's journey across multiple services?

Traces Connect the Dots

A trace shows the path of a single request through your distributed system.

When a user clicks "Buy Now":

Order Service receives the request and creates a trace ID (e.g., "abc123")
It records: "I'm processing order abc123, started at 10:30:15"
When it calls User Service, it passes along that trace ID
User Service records: "I'm verifying user for trace abc123"
When User Service calls Payment Service, same trace ID
Payment Service records: "I'm charging card for trace abc123"

Each service leaves breadcrumbs connected by the same trace ID. The result:

Purchase Request [1,200ms total]
├── Order Service: Process Order [50ms]
├── User Service: Verify User [100ms] ✓
├── Product Service: Check Inventory [150ms] ✓  
├── Payment Service: Charge Card [900ms] ⚠️
│   ├── Validate Card [100ms] ✓
│   ├── External Payment Gateway [750ms] ⚠️ 
│   └── Update Transaction [50ms] ✓
└── Order Service: Finalize Order [100ms] ✓

The problem is obvious: the external payment gateway is taking 750ms.

When to Use Traces

Debugging latency: Find which service is slow
Understanding dependencies: See how services interact
Root cause analysis: Follow a failing request across the stack
Performance optimization: Identify bottlenecks

Traces answer: "Where in my distributed system did things go wrong?"

Combining All Three: A Real Scenario

Your e-commerce site is struggling during a Black Friday sale.

Step 1: Metrics

Response times are spiking. Error rate is up.

Step 2: Logs

Timeout errors in the Payment Service.

Step 3: Traces

Payment requests take 10+ seconds, but only for orders over $500.

Root cause: The fraud detection system (called by Payment Service) has a bug that makes it extremely slow for high-value transactions.

Without all three, you might have wasted hours checking database connections, server resources, or network issues.

Quick Reference: When to Use What

Question	Use
Is my system healthy?	Metrics
What exactly happened?	Logs
Where did it break (distributed)?	Traces
Should I alert on-call?	Metrics (thresholds)
Why did this specific request fail?	Logs + Traces
Which service is the bottleneck?	Traces

Common Mistakes

Metric overload

Don't track everything. Start with what matters to users: latency, error rate, throughput.

Unstructured logs

console.log("error happened") is useless at scale. Add context, use JSON.

Tracing everything

Sample your traces. 100% trace coverage kills performance and storage.

Using them in isolation

Each pillar is useful alone. Together, they're powerful. Correlate metrics spikes with log errors with trace timelines.

Getting Started

If you're not sure where to begin:

Start with metrics for the RED method: Rate (requests/sec), Errors (error rate), Duration (latency)
Add structured logging to your services
Implement tracing when you have 2+ services communicating

Sign up for a 14-day free OpenObserve Cloud trial and integrate your metrics, logs, and traces into one powerful platform to boost your operational efficiency and enable smarter, faster decision-making.

What's your observability setup? Are you using all three pillars, or still figuring out where to start? Let me know in the comments 👇

How to Convert Logs to Metrics: A Practical Guide with OpenObserve Pipelines

Simran Kumari — Mon, 02 Feb 2026 11:43:23 +0000

Most engineering teams start their observability journey with logs. They're easy to implement, they capture exactly what happened, and when something breaks, logs are usually the first place you look.

But here's the thing: your logs already contain metrics—timestamps, status codes, error flags, latency values. The problem isn't a lack of data. It's that you're asking metric questions while relying entirely on logs.

In this guide, I'll show you how to extract metrics from logs using scheduled pipelines, step by step.

Why Convert Logs to Metrics?

Before diving into the how, let's understand the why.

Aspect	Logs	Metrics
What they represent	Individual events	Aggregated summaries
Detail level	Per request/event	High-level trends
Cardinality	High	Low
Best for	Debugging, root cause analysis	Monitoring, alerting, dashboards
Query cost	Expensive at scale	Cheap and fast

When you try to use logs as a substitute for metrics, you pay the cost of high cardinality for questions that don't need that detail.

Example: Building a dashboard showing "error rates over time" by scanning millions of log entries repeatedly? That's slow and expensive. Deriving a metric once per minute? That's fast and cheap.

Understanding Pipeline Types

In OpenObserve, pipelines fall into two categories:

Real-time Pipelines

Operate on individual events as they arrive. Great for:

Normalizing fields
Enriching records
Dropping noisy data
Routing events to different streams

Scheduled Pipelines

Run at fixed intervals over defined time windows. Perfect for:

Aggregating logs into metrics
Computing summaries
Generating time-series data

For logs-to-metrics conversion, scheduled pipelines are what you need.

The Logs-to-Metrics Flow

Here's how the data flows:

App Logs → Log Stream → Scheduled Pipeline (every 1min) → Metric Stream → Dashboards/Alerts

The pipeline:

Reads logs from the previous time window
Filters and aggregates them
Writes results to a metric stream

Step-by-Step: Converting Kubernetes Logs to Metrics

Let's build this with real Kubernetes logs.

Prerequisites

An OpenObserve instance (Cloud or Self-hosted)
Sample log data (or your own logs)

Step 1: Ingest Your Logs

For this demo, grab some sample Kubernetes logs:

curl -L https://zinc-public-data.s3.us-west-2.amazonaws.com/zinc-enl/sample-k8s-logs/k8slog_json.json.zip -o k8slog_json.json.zip
unzip k8slog_json.json.zip

These logs contain typical K8s fields like _timestamp, code (HTTP status), kubernetes_container_name, kubernetes_labels_app, etc.

Step 2: Define Your Metrics

Before writing pipeline code, decide what you want to measure:

k8s_http_requests_total: Total requests per app per minute
k8s_http_errors_total: Total 5xx errors per app per minute

Step 3: Create the Scheduled Pipeline

Source Query for Request Count

SELECT
  'k8s_http_requests_total' AS "__name__",
  'counter' AS "__type__",
  COUNT(*) AS "value",
  kubernetes_labels_app AS app,
  kubernetes_namespace_name AS namespace, 
  MAX(_timestamp) AS _timestamp
FROM kubernetes_logs
GROUP BY
  kubernetes_labels_app,
  kubernetes_namespace_name

Key fields explained:

__name__ → Metric name (required)
__type__ → Metric type: counter or gauge (required)
value → The actual metric value (required)
app, namespace → Labels for filtering/grouping

Source Query for Error Count

SELECT
  'k8s_http_errors_total' AS __name__,
  'counter' AS __type__,
  COUNT(*) AS value,
  kubernetes_labels_app AS app,
  kubernetes_namespace_name AS namespace,
  MAX(_timestamp) AS _timestamp
FROM kubernetes_logs
WHERE code >= 500
GROUP BY
  kubernetes_labels_app,
  kubernetes_namespace_name

The WHERE code >= 500 filter ensures we only count server errors.

Step 4: Configure Pipeline Settings

Test the query - Run it manually to verify output includes __name__, __type__, and value
Set the interval - Typically 1 minute for real-time metrics
Define the destination - Point to your metric stream
Connect the nodes and save

Step 5: Verify Your Metrics

After the pipeline runs:

Check the destination metric stream
Verify records contain expected metric names, types, values, and labels
Build dashboards using the new metrics

Troubleshooting Common Errors

`error in ingesting metrics missing name`

Your query output doesn't include the metric name.

Fix: Ensure your SQL includes:

SELECT 'metric_name' AS "__name__", ...

`error in ingesting metrics missing type`

The metric type isn't being set.

Fix: Add the type field:

SELECT 'counter' AS "__type__", ...

`error in ingesting metrics missing value`

The numeric value is missing.

Fix: Include an aggregation:

COUNT(*) AS "value"

`DerivedStream has reached max retries of 3`

The pipeline failed multiple times due to validation errors.

Fix:

Open the pipeline config
Run the source SQL manually
Verify all required fields are present
Save and wait for next scheduled run

No Metrics Produced (But No Errors)

The pipeline runs but produces nothing.

Cause: No logs exist in the source stream for the time window.

Fix:

Verify logs are being ingested to the source stream
Run the SQL query manually against recent data
Check the time window matches when logs exist

Debugging Pipeline Failures

Enable usage reporting to track pipeline execution:

ZO_USAGE_REPORTING_ENABLED=true

This surfaces:

Error stream: Detailed failure messages
Triggers stream: Pipeline execution history
UI indicators: Visual failure signals with error messages

What's Next?

Once your logs-to-metrics pipeline is running:

Build dashboards from your derived metrics
Set up alerts on metric thresholds
Create SLOs using your new metrics

Key Takeaways

Logs contain metric data—you just need to extract it
Scheduled pipelines bridge the gap between raw logs and aggregated metrics
Three required fields: __name__, __type__, value
No new instrumentation needed—work with data you already have
Result: Faster dashboards, cheaper queries, better observability

Have you implemented logs-to-metrics in your stack? What challenges did you face? Let me know in the comments! 👇

Originally published on the OpenObserve blog.

Good read!

Simran Kumari — Mon, 02 Feb 2026 04:42:58 +0000

Manas Sharma

Feb 1

NVIDIA GPU Monitoring: Catch Thermal Throttling Before It Costs You $50k/Year

#devops #monitoring #gpu #observability

Comments

7 min read

[Boost]

Simran Kumari — Mon, 02 Feb 2026 04:42:33 +0000

Manas Sharma

Feb 2

FastAPI + OpenTelemetry: Stop Debugging with grep (Use Distributed Tracing)

#fastapi #python #opentelemetry #observability

Comments

3 min read

Forem: Simran Kumari

AI Agent Monitoring: How to Observe Autonomous AI Agents in Production

Why It Matters in Production

🔴 Runaway Token Costs

🔴 Silent Latency Regressions

🔴 Rate-Limit Cascade Failures

🔴 Degraded Output Quality

🔴 Multi-Step Reasoning Failures

🔴 Compliance & Audit Requirements

The Four Pillars of LLM Observability

1. Distributed Tracing

2. Metrics

3. Structured Logs

4. Evaluations (Evals)

Key Metrics to Track

OpenTelemetry: The Standard for AI Observability

How OTel Spans Map to Agent Steps

Setting Up LLM Monitoring with OpenObserve

Prerequisites

Step 1: Configure Your Environment

Step 2: Install Dependencies

Step 3: Instrument Your Application

OpenAI

Anthropic (Claude)

Step 4: View Traces in OpenObserve

What Gets Captured in Each Trace Span

Adding Custom Span Attributes

Unique Challenges in Agentic Systems

Non-Determinism

Long-Horizon Context Windows

Nested and Parallel Tool Calls

Infinite Loop Detection

Multi-Agent Coordination

Best Practices for AI Agent Monitoring

✅ Instrument Early, Not After the Fact

✅ Separate Evaluation Metrics from Operational Metrics

✅ Sample Intelligently, Not Uniformly

✅ Mask Sensitive Data Before Export

✅ Version Your Prompts

✅ Tag Every Trace with Business Context

✅ Build a Feedback Loop from Evals to Prompts

Conclusion

Further Reading

Top 10 APM Tools in 2026: A Complete Comparison Guide

What to Look for in an APM Tool

1. OpenObserve

2. Datadog

3. Dynatrace

4. New Relic

5. AppDynamics

6. Splunk APM

7. Elastic APM

8. Grafana Stack (Prometheus + Loki + Tempo)

9. Honeycomb

10. Site24x7

Quick Comparison Table

How to Choose

Final Thoughts

Best Open Source LLM Observability Tools in 2026: Complete Guide

What Is LLM Observability?

Why LLM Observability Is Different from Traditional Monitoring

What to Look for in an Open Source LLM Observability Tool

Top Open Source LLM Observability Tools

1. OpenObserve

2. Langfuse

3. Arize Phoenix

4. OpenLLMetry

5. Comet Opik

6. Helicone

7. Lunary

8. TruLens

9. PostHog LLM Analytics

10. Weave by Weights & Biases

Comparison Table

How to Choose the Right Tool

1. Start with your deployment requirement

2. Match the tool to your primary bottleneck

3. Consider your framework dependencies

4. Think about the evaluation maturity you need

5. Factor in long-term lock-in risk

`error in ingesting metrics missing name`

`error in ingesting metrics missing type`

`error in ingesting metrics missing value`

`DerivedStream has reached max retries of 3`