<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Simran Kumari</title>
    <description>The latest articles on Forem by Simran Kumari (@simran_kumari_464546e0a3c).</description>
    <link>https://forem.com/simran_kumari_464546e0a3c</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3746392%2F62deca9d-c57e-4a16-a24d-45a8d9b11311.png</url>
      <title>Forem: Simran Kumari</title>
      <link>https://forem.com/simran_kumari_464546e0a3c</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/simran_kumari_464546e0a3c"/>
    <language>en</language>
    <item>
      <title>Top 10 APM Tools in 2026: A Complete Comparison Guide</title>
      <dc:creator>Simran Kumari</dc:creator>
      <pubDate>Fri, 27 Mar 2026 09:53:11 +0000</pubDate>
      <link>https://forem.com/simran_kumari_464546e0a3c/top-10-apm-tools-in-2026-a-complete-comparison-guide-57pi</link>
      <guid>https://forem.com/simran_kumari_464546e0a3c/top-10-apm-tools-in-2026-a-complete-comparison-guide-57pi</guid>
      <description>&lt;p&gt;Application Performance Monitoring (APM) tools help engineering teams track, analyze, and optimize how their applications behave in production. They collect telemetry data — response times, error rates, throughput — across distributed systems and turn it into actionable insights.&lt;/p&gt;

&lt;p&gt;But picking the right APM tool in 2026 is more nuanced than it used to be. Teams are increasingly pushing back on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Runaway costs&lt;/strong&gt; that scale with data volume or host count&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vendor lock-in&lt;/strong&gt; from proprietary agents and query languages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lack of data sovereignty&lt;/strong&gt; when compliance requires on-prem or regional storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unnecessary complexity&lt;/strong&gt; for teams with simpler observability needs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This guide covers 10 APM tools that address these concerns — from open source platforms to enterprise SaaS solutions.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Look for in an APM Tool
&lt;/h2&gt;

&lt;p&gt;Before diving in, here's a quick framework for evaluating your options:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;What to Evaluate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Unified Observability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single pane for metrics, logs, and traces&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost Structure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Transparent pricing; no hidden fees at scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Ownership&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Self-hosted option, data export, retention control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ingestion throughput and query performance at volume&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Migration Ease&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OpenTelemetry support, agent compatibility&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query Capabilities&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SQL, PromQL, or a proprietary DSL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Alerting &amp;amp; Visualization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Alert config flexibility, dashboard quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;High-Cardinality Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User-level tracking without cost blowups&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  1. OpenObserve
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams wanting unified observability without vendor lock-in or unpredictable costs.&lt;/p&gt;

&lt;p&gt;OpenObserve is an open-source observability platform that unifies logs, metrics, traces, and APM in a single interface. It uses 140x compression technology that can reduce storage and ingestion costs by 60–90% compared to legacy tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unified logs, metrics, traces, and APM in one platform&lt;/li&gt;
&lt;li&gt;OpenTelemetry-native — works as a drop-in replacement for proprietary agents&lt;/li&gt;
&lt;li&gt;SQL-based querying instead of a vendor-specific DSL&lt;/li&gt;
&lt;li&gt;Self-hosted or cloud deployment options&lt;/li&gt;
&lt;li&gt;No per-host or per-metric billing surprises&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Requires SQL familiarity for advanced analysis&lt;/li&gt;
&lt;li&gt;Smaller integration marketplace vs. legacy vendors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Deployment:&lt;/strong&gt; Self-hosted / Cloud&lt;br&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Open source + low-cost cloud&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Datadog
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams that want a mature, feature-rich SaaS platform and have the budget for it.&lt;/p&gt;

&lt;p&gt;Datadog is one of the most well-known names in cloud monitoring, offering 900+ integrations and a powerful unified platform for metrics, logs, traces, RUM, and security.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enormous integration ecosystem (900+ integrations)&lt;/li&gt;
&lt;li&gt;Strong end-to-end distributed tracing with automatic service discovery&lt;/li&gt;
&lt;li&gt;AI-powered anomaly detection and root cause analysis&lt;/li&gt;
&lt;li&gt;Quick time-to-value with solid documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pricing scales rapidly with data volume and host count&lt;/li&gt;
&lt;li&gt;Complex billing model with separate per-feature charges&lt;/li&gt;
&lt;li&gt;Custom metric auto-generation can create unexpected costs&lt;/li&gt;
&lt;li&gt;Proprietary agents and query language create lock-in&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS&lt;br&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Host + usage-based&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Dynatrace
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Large enterprises running complex distributed systems that want automated instrumentation.&lt;/p&gt;

&lt;p&gt;Dynatrace's OneAgent handles instrumentation automatically, and its Davis AI engine cuts through alert noise with built-in root cause analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero-touch instrumentation via OneAgent&lt;/li&gt;
&lt;li&gt;Strong AI-driven alerting with reduced noise&lt;/li&gt;
&lt;li&gt;Excellent support for hybrid and on-premises environments&lt;/li&gt;
&lt;li&gt;End-to-end visibility from infrastructure to user experience&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Premium pricing — often the most expensive option&lt;/li&gt;
&lt;li&gt;Proprietary data formats and agents&lt;/li&gt;
&lt;li&gt;Can be overkill for smaller or cloud-native teams&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS / Hybrid&lt;br&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Host / unit-based&lt;/p&gt;




&lt;h2&gt;
  
  
  4. New Relic
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams wanting a familiar all-in-one SaaS APM experience with a generous free tier.&lt;/p&gt;

&lt;p&gt;New Relic offers deep code-level performance visibility across metrics, logs, traces, RUM, and synthetics, with a 100 GB/month free data ingest that makes it accessible for smaller teams.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unified platform with strong APM capabilities&lt;/li&gt;
&lt;li&gt;100 GB/month data ingest on the free tier&lt;/li&gt;
&lt;li&gt;Good OpenTelemetry support for easier migration&lt;/li&gt;
&lt;li&gt;Developer-friendly onboarding and documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Still a proprietary SaaS platform&lt;/li&gt;
&lt;li&gt;Costs grow quickly with high data volumes&lt;/li&gt;
&lt;li&gt;Limited data residency control vs. self-hosted tools&lt;/li&gt;
&lt;li&gt;Advanced features gated behind higher pricing tiers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS&lt;br&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Usage-based&lt;/p&gt;




&lt;h2&gt;
  
  
  5. AppDynamics
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Enterprises already in the Cisco ecosystem that need deep business transaction visibility.&lt;/p&gt;

&lt;p&gt;AppDynamics maps application performance directly to business outcomes — making it particularly useful for organizations where IT metrics need to connect to revenue and customer experience metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deep code-level visibility and dependency mapping&lt;/li&gt;
&lt;li&gt;Business transaction monitoring that connects to business impact&lt;/li&gt;
&lt;li&gt;Tight Cisco networking and security integration&lt;/li&gt;
&lt;li&gt;Works well across on-prem, hybrid, and legacy environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expensive enterprise pricing&lt;/li&gt;
&lt;li&gt;Heavy agent-based approach&lt;/li&gt;
&lt;li&gt;Complex setup and configuration&lt;/li&gt;
&lt;li&gt;Less cloud-native than modern alternatives&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS / On-prem&lt;br&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Unit-based&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Splunk APM
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Compliance-heavy organizations with mature security and audit requirements.&lt;/p&gt;

&lt;p&gt;Splunk has been an enterprise log analytics powerhouse for years, with its APM offering extending that depth to distributed tracing and full-fidelity observability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extremely powerful analytics via SPL (Search Processing Language)&lt;/li&gt;
&lt;li&gt;Enterprise-grade security and compliance capabilities&lt;/li&gt;
&lt;li&gt;Full-fidelity tracing with no default sampling&lt;/li&gt;
&lt;li&gt;Flexible deployment (on-prem and cloud)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One of the most expensive APM tools on the market&lt;/li&gt;
&lt;li&gt;Steep learning curve for SPL&lt;/li&gt;
&lt;li&gt;Complex licensing model&lt;/li&gt;
&lt;li&gt;Often excessive for pure APM use cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS / On-prem&lt;br&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Data-volume based&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Elastic APM
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams already using the ELK stack who want to extend into APM.&lt;/p&gt;

&lt;p&gt;Elastic Observability builds on Elasticsearch's powerful full-text and structured search to offer logs, metrics, and APM in a unified interface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Best-in-class log search via Elasticsearch&lt;/li&gt;
&lt;li&gt;Flexible deployment (cloud, self-hosted, hybrid)&lt;/li&gt;
&lt;li&gt;Large community and broad ecosystem integrations&lt;/li&gt;
&lt;li&gt;Strong SIEM overlap for security + observability use cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expensive to operate at scale&lt;/li&gt;
&lt;li&gt;High infrastructure and tuning overhead&lt;/li&gt;
&lt;li&gt;Storage costs can grow quickly&lt;/li&gt;
&lt;li&gt;Complex cluster management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Deployment:&lt;/strong&gt; Self-hosted / Cloud&lt;br&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Data / host-based&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Grafana Stack (Prometheus + Loki + Tempo)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams that want best-in-class open source tools and have the ops capability to manage them.&lt;/p&gt;

&lt;p&gt;The Grafana Stack isn't a single product — it's a collection of open source tools: Prometheus for metrics, Loki for logs, and Tempo for traces, all visualized through Grafana dashboards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prometheus is the de-facto standard for Kubernetes and infrastructure metrics&lt;/li&gt;
&lt;li&gt;Completely open source, no vendor lock-in&lt;/li&gt;
&lt;li&gt;Highly customizable dashboards that rival commercial tools&lt;/li&gt;
&lt;li&gt;Thousands of exporters and plugins&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not a unified product — requires managing multiple systems&lt;/li&gt;
&lt;li&gt;Significantly higher operational overhead at scale&lt;/li&gt;
&lt;li&gt;Alerting setup is more complex than integrated platforms&lt;/li&gt;
&lt;li&gt;Steeper learning curve for full-stack setup&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Deployment:&lt;/strong&gt; Self-hosted / Cloud (Grafana Cloud managed option available)&lt;br&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; OSS + managed tiers&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Honeycomb
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Engineering teams debugging complex distributed systems with high-cardinality data.&lt;/p&gt;

&lt;p&gt;Honeycomb was purpose-built for the challenges modern microservices create — where request IDs, user IDs, and other high-cardinality fields need to be tracked without blowing up your observability bill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Handles high-cardinality dimensions (user IDs, request IDs) without performance or cost penalties&lt;/li&gt;
&lt;li&gt;Fast, ad-hoc exploratory querying for unknown unknowns&lt;/li&gt;
&lt;li&gt;First-class SLOs, error budgets, and burn-rate alerts&lt;/li&gt;
&lt;li&gt;OpenTelemetry-native ingestion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SaaS-only, no self-hosted option&lt;/li&gt;
&lt;li&gt;Pricing scales with event volume&lt;/li&gt;
&lt;li&gt;Less focus on traditional infrastructure dashboards&lt;/li&gt;
&lt;li&gt;Different mental model than legacy APM tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS&lt;br&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Event-based&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Site24x7
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Smaller DevOps teams wanting broad monitoring coverage at a competitive price.&lt;/p&gt;

&lt;p&gt;Site24x7 covers APM, RUM, synthetic monitoring, server, and cloud monitoring in one platform — without the enterprise price tag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Competitive pricing with a broad feature set&lt;/li&gt;
&lt;li&gt;Quick setup and guided onboarding&lt;/li&gt;
&lt;li&gt;Covers APM, synthetic, infrastructure, and cloud in one tool&lt;/li&gt;
&lt;li&gt;Good customer support reputation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;UI feels dated compared to modern competitors&lt;/li&gt;
&lt;li&gt;Less depth in distributed tracing&lt;/li&gt;
&lt;li&gt;Advanced features locked behind higher tiers&lt;/li&gt;
&lt;li&gt;Smaller community and ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Deployment:&lt;/strong&gt; SaaS&lt;br&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; Tier-based&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Comparison Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Deployment&lt;/th&gt;
&lt;th&gt;Metrics&lt;/th&gt;
&lt;th&gt;Logs&lt;/th&gt;
&lt;th&gt;Traces&lt;/th&gt;
&lt;th&gt;APM&lt;/th&gt;
&lt;th&gt;Pricing&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenObserve&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Self-hosted / Cloud&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;OSS + low-cost cloud&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Datadog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Host + usage-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dynatrace&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS / Hybrid&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Host / unit-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;New Relic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Usage-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AppDynamics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS / On-prem&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Unit-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Splunk APM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS / On-prem&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Data-volume based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Elastic APM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Self-hosted / Cloud&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Data / host-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Grafana Stack&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Self-hosted / Cloud&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;OSS + managed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Honeycomb&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Event-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Site24x7&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Tier-based&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  How to Choose
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;By budget:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tight → OpenObserve, Grafana Stack, Elastic APM&lt;/li&gt;
&lt;li&gt;Moderate → New Relic (free tier), Site24x7&lt;/li&gt;
&lt;li&gt;Enterprise → Dynatrace, Datadog, Splunk&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;By deployment preference:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Self-hosted required → OpenObserve, Grafana Stack, Elastic&lt;/li&gt;
&lt;li&gt;SaaS preferred → New Relic, Datadog, Honeycomb, OpenObserve Cloud&lt;/li&gt;
&lt;li&gt;Hybrid needed → Dynatrace, Elastic, AppDynamics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;By use case:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;General observability → OpenObserve, New Relic, Datadog&lt;/li&gt;
&lt;li&gt;Business transaction visibility → AppDynamics&lt;/li&gt;
&lt;li&gt;Log analytics → OpenObserve, Elastic, Splunk&lt;/li&gt;
&lt;li&gt;High-cardinality tracing → Honeycomb, OpenObserve&lt;/li&gt;
&lt;li&gt;Security + observability → Splunk, Elastic, OpenObserve&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;By migration strategy:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Quick migration → OpenTelemetry-native tools (OpenObserve, Honeycomb, New Relic)&lt;/li&gt;
&lt;li&gt;Gradual transition → Start with one signal type (logs or metrics first)&lt;/li&gt;
&lt;li&gt;Parallel running → Run new tool alongside existing APM during evaluation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The APM landscape in 2026 is richer — and more opinionated — than ever. The right tool depends on your team's technical depth, budget constraints, compliance requirements, and how much operational overhead you're willing to take on.&lt;/p&gt;

&lt;p&gt;A few principles that apply regardless of which tool you choose:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Adopt OpenTelemetry&lt;/strong&gt; to instrument once and avoid being locked into any specific backend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Start with a pilot&lt;/strong&gt; on non-critical services before committing to a full migration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model your costs at scale&lt;/strong&gt; — what looks cheap at 10 hosts can surprise you at 100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run tools in parallel&lt;/strong&gt; during evaluation to validate parity before cutting over&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're looking for a starting point that balances cost, flexibility, and full-stack observability, &lt;a href="https://openobserve.ai/" rel="noopener noreferrer"&gt;OpenObserve&lt;/a&gt; is worth a look — it's open source, OTel-native, and offers both self-hosted and cloud deployment options.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally based on the &lt;a href="https://openobserve.ai/blog/top-10-apm-tools/" rel="noopener noreferrer"&gt;OpenObserve blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>devops</category>
      <category>observability</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Best Open Source LLM Observability Tools in 2026: Complete Guide</title>
      <dc:creator>Simran Kumari</dc:creator>
      <pubDate>Wed, 25 Mar 2026 15:27:12 +0000</pubDate>
      <link>https://forem.com/simran_kumari_464546e0a3c/best-open-source-llm-observability-tools-in-2026-complete-guide-kn5</link>
      <guid>https://forem.com/simran_kumari_464546e0a3c/best-open-source-llm-observability-tools-in-2026-complete-guide-kn5</guid>
      <description>&lt;h2&gt;
  
  
  What Is LLM Observability?
&lt;/h2&gt;

&lt;p&gt;LLM observability is the practice of monitoring, tracing, and analyzing every layer of an AI application — from the prompt you send to the final response your model returns. As AI systems grow more complex, with multi-step agent workflows, retrieval-augmented generation (RAG) pipelines, and tool calls chained together, traditional logging falls short.&lt;/p&gt;

&lt;p&gt;The four core components of LLM observability are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tracing&lt;/strong&gt; — tracking the full lifecycle of a user interaction, including intermediate steps, model API calls, and tool invocations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation&lt;/strong&gt; — measuring output quality through automated metrics (relevance, faithfulness, toxicity) or human annotation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost &amp;amp; Usage Monitoring&lt;/strong&gt; — tracking token consumption, latency, and spend per model, user, or session&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt Management&lt;/strong&gt; — versioning, testing, and iterating on prompts without losing reproducibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without these, teams are blind to quality regressions, prompt drift, hallucinations, and runaway API costs in production.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why LLM Observability Is Different from Traditional Monitoring
&lt;/h2&gt;

&lt;p&gt;Traditional observability tools like Grafana and Prometheus are excellent for infrastructure-level signals — CPU, memory, request rates, latency percentiles. But LLMs introduce an entirely new class of failure that metrics alone cannot detect:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Traditional Monitoring&lt;/th&gt;
&lt;th&gt;LLM Observability&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tracks uptime, latency, error rates&lt;/td&gt;
&lt;td&gt;Tracks hallucinations, prompt quality, output relevance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alerts on crashes or timeouts&lt;/td&gt;
&lt;td&gt;Alerts on silent quality regressions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Measures infrastructure health&lt;/td&gt;
&lt;td&gt;Measures model behavior and output correctness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query languages: PromQL, SQL&lt;/td&gt;
&lt;td&gt;Evaluation frameworks: LLM-as-judge, semantic similarity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dashboards for SREs&lt;/td&gt;
&lt;td&gt;Dashboards for ML engineers and product teams&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What to Look for in an Open Source LLM Observability Tool
&lt;/h2&gt;

&lt;p&gt;A &lt;a href="https://dl.acm.org/doi/10.1145/3706598.3713913" rel="noopener noreferrer"&gt;CHI 2025 study with 30 developers&lt;/a&gt; identified four core design principles every solid LLM observability tool should satisfy:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Principle&lt;/th&gt;
&lt;th&gt;What It Means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Awareness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Makes model behavior visible — you understand what is happening inside the system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Monitoring&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Real-time feedback during training and evaluation to catch issues early&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Intervention&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enables you to act on problems as they surface, not after users report them&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Operability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Supports long-term maintainability as models and requirements evolve&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Beyond those principles, evaluate tools on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosting support&lt;/strong&gt; — critical for data residency and compliance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Framework integrations&lt;/strong&gt; — LangChain, LlamaIndex, OpenAI SDK, LiteLLM, Vercel AI SDK, Haystack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry compatibility&lt;/strong&gt; — avoids vendor lock-in and lets you route traces to any OTEL-compatible backend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation capabilities&lt;/strong&gt; — LLM-as-judge, human annotation, hallucination detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt management&lt;/strong&gt; — versioning and collaboration features for iterating on prompts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost tracking&lt;/strong&gt; — per-user, per-model, per-session breakdowns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified observability&lt;/strong&gt; — whether the tool also covers infrastructure so you don't need a second platform&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License&lt;/strong&gt; — MIT, Apache 2.0, and Elastic License 2.0 carry very different implications for commercial use&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Top Open Source LLM Observability Tools
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. OpenObserve
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; AGPL-3.0 | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://openobserve.ai/" rel="noopener noreferrer"&gt;openobserve.ai&lt;/a&gt; | &lt;strong&gt;Cloud:&lt;/strong&gt; &lt;a href="https://cloud.openobserve.ai/" rel="noopener noreferrer"&gt;cloud.openobserve.ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenObserve is our top pick for 2026.&lt;/strong&gt; While most tools on this list specialize in LLM-specific concerns, OpenObserve unifies LLM observability with full infrastructure monitoring — logs, metrics, traces, and frontend (RUM) monitoring — in a single deployment. For teams tired of managing a separate DevOps telemetry stack alongside a dedicated LLM tool, OpenObserve eliminates that overhead entirely.&lt;/p&gt;

&lt;p&gt;Built on OpenTelemetry standards and using a Parquet/Vertex columnar format with aggressive compression, OpenObserve delivers &lt;strong&gt;140x lower storage costs&lt;/strong&gt; compared to traditional stacks like Prometheus + Loki + Tempo. Its SQL-based query interface means teams can correlate LLM trace data with infrastructure metrics without learning multiple proprietary query languages. With single binary deployment, you can be up and running in under 2 minutes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpxgs69jmt69ojwxt3ynx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpxgs69jmt69ojwxt3ynx.png" alt="LLM Observability in OpenObserve" width="800" height="434"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unified platform&lt;/strong&gt; — logs, metrics, traces, LLM traces, and RUM monitoring in one tool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry-native&lt;/strong&gt; — drop-in instrumentation for LLM applications using any OTEL SDK&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQL-based queries&lt;/strong&gt; — correlate LLM trace data with infrastructure signals using familiar syntax&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;140x lower storage costs&lt;/strong&gt; — Parquet columnar format with aggressive compression&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High-cardinality support&lt;/strong&gt; — handles per-user, per-session, and per-request LLM telemetry without performance degradation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single binary deployment&lt;/strong&gt; — self-hosted in under 2 minutes; no Kubernetes expertise required&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time alerting&lt;/strong&gt; — set alerts on token usage, latency spikes, error rates, and custom LLM metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rich dashboards&lt;/strong&gt; — visualization for both infrastructure health and LLM operational metrics side by side&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted or Cloud&lt;/strong&gt; — full data residency control with flexible deployment options&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only open source platform covering infrastructure observability AND LLM tracing in a single tool&lt;/li&gt;
&lt;li&gt;140x storage cost reduction makes it dramatically cheaper to retain long-term LLM trace history&lt;/li&gt;
&lt;li&gt;SQL querying lowers the learning curve — one language for both infrastructure and LLM queries&lt;/li&gt;
&lt;li&gt;Fully OpenTelemetry-native — no vendor lock-in&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM-specific features like LLM-as-judge evaluation and prompt management are handled through integrations rather than built-in modules&lt;/li&gt;
&lt;li&gt;Advanced LLM dashboard templates require manual configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open source (self-hosted): Free&lt;/li&gt;
&lt;li&gt;Cloud: Free tier available; usage-based pricing beyond that&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams that want a single open source platform covering both LLM observability and infrastructure monitoring, or organizations with strict self-hosting/data residency requirements.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Langfuse
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;GitHub Stars:&lt;/strong&gt; 21,000+ | &lt;strong&gt;License:&lt;/strong&gt; MIT (core) | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://langfuse.com/" rel="noopener noreferrer"&gt;langfuse.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Langfuse is the most widely adopted open source LLM-specific observability platform. Originally from YCombinator W23, it was recently acquired by ClickHouse, signalling strong long-term investment in its data infrastructure. Its MIT-licensed core covers end-to-end tracing, prompt management, evaluation, and datasets — everything a production LLM team needs on the application layer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xwdq6nso10y0z0xj9v8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xwdq6nso10y0z0xj9v8.png" alt="Langfuse" width="800" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;End-to-end tracing across LLM calls, retrieval steps, and agent actions with waterfall views&lt;/li&gt;
&lt;li&gt;Session replay to reconstruct complete conversation histories for debugging&lt;/li&gt;
&lt;li&gt;Prompt management with version control and live iteration without redeployment&lt;/li&gt;
&lt;li&gt;LLM-as-a-judge evaluation workflows for hallucination, toxicity, and relevance&lt;/li&gt;
&lt;li&gt;LLM Playground for testing prompts directly from a failed trace&lt;/li&gt;
&lt;li&gt;Native integrations: LangChain, LlamaIndex, OpenAI SDK, LiteLLM, Vercel AI SDK, Haystack, Mastra&lt;/li&gt;
&lt;li&gt;Self-host via Docker Compose in under 5 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strongest LLM-specific community adoption in the open source space&lt;/li&gt;
&lt;li&gt;Covers the full LLM development lifecycle — tracing, evals, datasets, prompt management&lt;/li&gt;
&lt;li&gt;Generous free tier on Langfuse Cloud (50k events/month, 2 users)&lt;/li&gt;
&lt;li&gt;True MIT license on core features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No built-in infrastructure monitoring — needs a separate platform for full-stack visibility&lt;/li&gt;
&lt;li&gt;Enterprise features (SSO, RBAC, advanced security) are separately licensed&lt;/li&gt;
&lt;li&gt;Cloud pricing can grow quickly at high event volumes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Self-hosted: Free&lt;/li&gt;
&lt;li&gt;Cloud: Free up to 50k events/month, then $29/month for 100k events&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Engineering teams that want the deepest open source LLM-specific observability with prompt management and evaluation built in.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Arize Phoenix
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; Elastic License 2.0 (source-available) | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://phoenix.arize.com/" rel="noopener noreferrer"&gt;phoenix.arize.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Arize Phoenix is a source-available observability platform built specifically for LLM applications, RAG pipelines, and agent workflows. Built on OpenTelemetry standards, it includes built-in hallucination detection and embedding drift visualization, making it particularly powerful for teams iterating on retrieval pipelines.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2mcq92bni4kt3j42zvqn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2mcq92bni4kt3j42zvqn.png" alt="Arize Phoenix" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;End-to-end tracing for prompts, responses, and agent workflows&lt;/li&gt;
&lt;li&gt;RAG observability — inspect retrieval results, chunk quality, and grounding&lt;/li&gt;
&lt;li&gt;Hallucination detection built in&lt;/li&gt;
&lt;li&gt;Embedding drift detection for monitoring distribution shifts over time&lt;/li&gt;
&lt;li&gt;OpenTelemetry-native export to OpenObserve, Datadog, Grafana, or any OTEL backend&lt;/li&gt;
&lt;li&gt;Supports Python and JavaScript&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Purpose-built for RAG and agent debugging — best-in-class for retrieval pipeline visibility&lt;/li&gt;
&lt;li&gt;OTEL-native design eliminates vendor lock-in&lt;/li&gt;
&lt;li&gt;Rich visualizations for understanding embedding spaces and cluster drift&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Elastic License 2.0 restricts certain commercial uses (not true open source)&lt;/li&gt;
&lt;li&gt;Less mature prompt management than Langfuse&lt;/li&gt;
&lt;li&gt;No infrastructure monitoring — requires a separate backend&lt;/li&gt;
&lt;li&gt;Enterprise features require moving to Arize AI platform ($50/month+)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Phoenix (open source): Free&lt;/li&gt;
&lt;li&gt;Arize AX Pro: $50/month; Enterprise: custom&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; AI engineering teams building RAG-based systems and agent workflows where deep retrieval pipeline visibility is critical.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. OpenLLMetry
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0 | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://www.traceloop.com/docs/openllmetry" rel="noopener noreferrer"&gt;openllmetry.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;OpenLLMetry is the most vendor-neutral option on this list. An open source observability framework built purely on OpenTelemetry standards, it provides LLM instrumentation for Python and TypeScript with a single line of setup code. It then ships traces to any OTEL-compatible backend.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkk9kddqochvx4dkxur1e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkk9kddqochvx4dkxur1e.png" alt="OpenLLMetry" width="800" height="523"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single-line setup for automatic instrumentation&lt;/li&gt;
&lt;li&gt;Supports OpenAI, Anthropic, Cohere, Azure OpenAI, Bedrock, Vertex AI, and more&lt;/li&gt;
&lt;li&gt;Framework support: LangChain, LlamaIndex, Haystack, CrewAI, and others&lt;/li&gt;
&lt;li&gt;Privacy controls for redacting sensitive prompts from traces&lt;/li&gt;
&lt;li&gt;Custom attributes for A/B testing and feature flag tracking&lt;/li&gt;
&lt;li&gt;Completely free — no licensing costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;True vendor neutrality — switch backends without changing instrumentation code&lt;/li&gt;
&lt;li&gt;Widest framework and provider coverage on the list&lt;/li&gt;
&lt;li&gt;Fully Apache 2.0 licensed — safe for any commercial use&lt;/li&gt;
&lt;li&gt;Zero cost, zero lock-in&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instrumentation library only — requires a separate backend for storage, dashboards, and alerting&lt;/li&gt;
&lt;li&gt;No built-in evaluation, prompt management, or dashboards&lt;/li&gt;
&lt;li&gt;Requires more setup work to build a complete observability stack&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Completely free&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams that want vendor-neutral LLM instrumentation and already have an observability backend, or teams building a custom OpenTelemetry-native stack.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Comet Opik
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0 | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://www.comet.com/site/products/opik/" rel="noopener noreferrer"&gt;comet.com/site/products/opik&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Opik is an open source LLM observability and evaluation platform from Comet ML, focused on systematic testing, optimization, and production monitoring. It stands out for its automated prompt optimization — six algorithms including Few-shot Bayesian, evolutionary, and LLM-powered MetaPrompt approaches — which is rare in open source tooling.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F574n68s79c95yvhpp370.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F574n68s79c95yvhpp370.png" alt="Comet Opik" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full tracing for LLM calls, agent steps, and RAG pipelines&lt;/li&gt;
&lt;li&gt;Automated prompt optimization (six algorithms built in)&lt;/li&gt;
&lt;li&gt;Built-in guardrails for PII filtering, off-topic detection, and competitor mention blocking&lt;/li&gt;
&lt;li&gt;Works with any LLM provider; native integrations for LangChain, LlamaIndex, OpenAI, Anthropic, Vertex AI&lt;/li&gt;
&lt;li&gt;60-day data retention on free hosted plan with unlimited team members&lt;/li&gt;
&lt;li&gt;Self-hostable with full features available in the codebase&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automated prompt optimization is a major differentiator&lt;/li&gt;
&lt;li&gt;Guardrails are built in, not bolted on&lt;/li&gt;
&lt;li&gt;Truly open source (Apache 2.0) with full feature access&lt;/li&gt;
&lt;li&gt;Unlimited team members on free tier&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Smaller community than Langfuse&lt;/li&gt;
&lt;li&gt;No infrastructure monitoring&lt;/li&gt;
&lt;li&gt;Some advanced analytics features are cloud-only&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free hosted: 25k spans/month, unlimited team members, 60-day retention&lt;/li&gt;
&lt;li&gt;Pro: $39/month for 100k spans&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams that want comprehensive observability with automated prompt optimization and guardrails built in.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. Helicone
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; MIT | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://helicone.ai/" rel="noopener noreferrer"&gt;helicone.ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Helicone takes a fundamentally different approach: it is a proxy-first observability platform. Rather than adding an SDK, you simply change your base URL to route traffic through Helicone — and it immediately logs every request, response, token count, cost, and error with zero code changes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbimfxtlw3f4y4rmn4c8m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbimfxtlw3f4y4rmn4c8m.png" alt="Helicone" width="800" height="388"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Proxy-based setup — change one line of code (base URL), nothing else&lt;/li&gt;
&lt;li&gt;Works with 100+ models and any OpenAI-compatible endpoint&lt;/li&gt;
&lt;li&gt;Request caching to reduce latency and cost on repeated calls&lt;/li&gt;
&lt;li&gt;Intelligent request routing and automatic provider failover&lt;/li&gt;
&lt;li&gt;Rate limiting and usage controls to prevent runaway spend&lt;/li&gt;
&lt;li&gt;Cost tracking by model, user, and session&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fastest time-to-value — production observability in under 5 minutes&lt;/li&gt;
&lt;li&gt;No SDK to install or manage&lt;/li&gt;
&lt;li&gt;Caching and routing features go beyond pure observability&lt;/li&gt;
&lt;li&gt;MIT licensed and self-hostable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Proxy architecture introduces a network hop&lt;/li&gt;
&lt;li&gt;Less suited for deep agent workflow tracing than Langfuse or Arize Phoenix&lt;/li&gt;
&lt;li&gt;No infrastructure monitoring&lt;/li&gt;
&lt;li&gt;Evaluation features are limited compared to dedicated eval platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hobby (free): 50k monthly logs&lt;/li&gt;
&lt;li&gt;Pro: $79/month&lt;/li&gt;
&lt;li&gt;Team: $799/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams that need lightweight model-level observability and cost control with the absolute minimum setup friction.&lt;/p&gt;




&lt;h3&gt;
  
  
  7. Lunary
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0 | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://lunary.ai/" rel="noopener noreferrer"&gt;lunary.ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Lunary is a lightweight open source observability platform optimized for RAG pipelines and chatbot applications. It offers SDKs for JavaScript (Node.js, Deno, Vercel Edge, Cloudflare Workers) and Python, with a setup time of roughly two minutes. Its Radar feature automatically categorizes LLM responses based on pre-defined criteria, making it easy to audit outputs at scale.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg7v5b5e5czc3twnkwktf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg7v5b5e5czc3twnkwktf.png" alt="Lunary" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Specialized RAG tracing with embedding metrics and latency visualization&lt;/li&gt;
&lt;li&gt;Radar: rule-based categorization of LLM responses for downstream auditing&lt;/li&gt;
&lt;li&gt;SDKs for JavaScript environments including Vercel Edge and Cloudflare Workers&lt;/li&gt;
&lt;li&gt;Session-level tracing for chatbot conversations&lt;/li&gt;
&lt;li&gt;10k events/month free with 30-day retention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Best JavaScript/TypeScript support of any tool on this list&lt;/li&gt;
&lt;li&gt;Lightweight and fast to set up — under 2 minutes&lt;/li&gt;
&lt;li&gt;Purpose-built for RAG and chatbot use cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Narrower feature set than Langfuse or OpenObserve&lt;/li&gt;
&lt;li&gt;Some advanced features require Enterprise licensing&lt;/li&gt;
&lt;li&gt;Smaller community and ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free tier: 10k events/month, 30-day retention&lt;/li&gt;
&lt;li&gt;Enterprise: Custom (includes self-hosting)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; JavaScript-first teams building RAG pipelines or chatbot applications who need quick observability setup.&lt;/p&gt;




&lt;h3&gt;
  
  
  8. TruLens
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; MIT | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://www.trulens.org/" rel="noopener noreferrer"&gt;trulens.org&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;TruLens takes a qualitative-first approach to LLM observability, built around structured feedback functions that evaluate LLM responses after each call. It is particularly strong for teams using LlamaIndex and LangChain who want systematic evaluation pipelines rather than traditional tracing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxix8ggqa72fngjv4b0pl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxix8ggqa72fngjv4b0pl.png" alt="TruLens" width="800" height="477"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Feedback functions that run automatically after each LLM call&lt;/li&gt;
&lt;li&gt;Pre-built evaluators for relevance, groundedness, and coherence&lt;/li&gt;
&lt;li&gt;RAG triad evaluation: answer relevance, context relevance, groundedness&lt;/li&gt;
&lt;li&gt;Deep integration with LlamaIndex and LangChain&lt;/li&gt;
&lt;li&gt;LLM-agnostic — supports any model as an evaluator&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Best-in-class for structured, systematic evaluation pipelines&lt;/li&gt;
&lt;li&gt;RAG triad evaluation is a well-regarded methodology for RAG quality assessment&lt;/li&gt;
&lt;li&gt;MIT licensed with no restrictions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python only — no JavaScript/TypeScript support&lt;/li&gt;
&lt;li&gt;Less focus on tracing and production monitoring&lt;/li&gt;
&lt;li&gt;Smaller community than Langfuse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Free (MIT licensed)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Research teams and ML engineers who need rigorous, automated evaluation pipelines for RAG systems with Python-native tooling.&lt;/p&gt;




&lt;h3&gt;
  
  
  9. PostHog LLM Analytics
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;GitHub Stars:&lt;/strong&gt; 32,100+ | &lt;strong&gt;License:&lt;/strong&gt; MIT | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://posthog.com/docs/ai-engineering" rel="noopener noreferrer"&gt;posthog.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;PostHog bundles LLM observability alongside product analytics, session replay, feature flags, A/B testing, and error tracking. For teams who want to understand not just how their LLM performs technically but how users actually interact with it, PostHog is uniquely positioned.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftmb41flykagr9xwec52y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftmb41flykagr9xwec52y.png" alt="PostHog LLM Analytics" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM generation capture with cost, latency, and usage metrics&lt;/li&gt;
&lt;li&gt;Combines LLM data with product analytics — funnels, retention, and user behaviour&lt;/li&gt;
&lt;li&gt;Session replay for AI interactions — watch exactly what users experienced&lt;/li&gt;
&lt;li&gt;A/B testing for prompts using the same experiment framework as product features&lt;/li&gt;
&lt;li&gt;Prompt management (beta) with version control&lt;/li&gt;
&lt;li&gt;100k LLM observability events/month on free tier&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only tool on this list that combines LLM observability with full product analytics&lt;/li&gt;
&lt;li&gt;Session replay for AI interactions is a uniquely powerful debugging tool&lt;/li&gt;
&lt;li&gt;Massive community (32k+ GitHub stars)&lt;/li&gt;
&lt;li&gt;Transparent, usage-based pricing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM-specific features (evaluation, RAG tracing) are less mature than dedicated tools&lt;/li&gt;
&lt;li&gt;No infrastructure monitoring&lt;/li&gt;
&lt;li&gt;Prompt management is still in beta&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free: 100k LLM events/month, 30-day retention&lt;/li&gt;
&lt;li&gt;Usage-based beyond that&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Product-led teams who want to combine LLM monitoring with user behaviour and product analytics in one platform.&lt;/p&gt;




&lt;h3&gt;
  
  
  10. Weave by Weights &amp;amp; Biases
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0 | &lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://wandb.ai/site/weave" rel="noopener noreferrer"&gt;wandb.ai/site/weave&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Weave is the LLM observability product from Weights &amp;amp; Biases (W&amp;amp;B), extending W&amp;amp;B's ML experiment tracking into LLM application observability — covering tracing, evaluation, and dataset management in a unified interface.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxw99ue1opxd9et0bey7j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxw99ue1opxd9et0bey7j.png" alt="Weave by Weights &amp;amp; Biases" width="800" height="470"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;End-to-end tracing for LLM calls, chains, and agent workflows&lt;/li&gt;
&lt;li&gt;Dataset management with versioning for evaluation benchmarks&lt;/li&gt;
&lt;li&gt;Integration with W&amp;amp;B experiment tracking for model-level and application-level comparison&lt;/li&gt;
&lt;li&gt;Human annotation tools for labelling and review workflows&lt;/li&gt;
&lt;li&gt;Supports Python and JavaScript&lt;/li&gt;
&lt;li&gt;Model-agnostic — works with OpenAI, Anthropic, open source models, and custom endpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Natural fit for teams already using W&amp;amp;B for model training and experiment tracking&lt;/li&gt;
&lt;li&gt;Strong dataset and evaluation management inherited from W&amp;amp;B's research-grade tooling&lt;/li&gt;
&lt;li&gt;Apache 2.0 license — commercially safe&lt;/li&gt;
&lt;li&gt;Bridges model development and production deployment in one workspace&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Less specialized for production LLM monitoring than Langfuse or OpenObserve&lt;/li&gt;
&lt;li&gt;Tightly coupled to the W&amp;amp;B ecosystem — less useful if you're not already a W&amp;amp;B user&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free tier available via W&amp;amp;B&lt;/li&gt;
&lt;li&gt;Team and Enterprise plans: custom pricing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; ML research teams already invested in the W&amp;amp;B ecosystem who want to extend experiment tracking into production LLM observability.&lt;/p&gt;




&lt;h2&gt;
  
  
  Comparison Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;th&gt;Self-Hosted&lt;/th&gt;
&lt;th&gt;Tracing&lt;/th&gt;
&lt;th&gt;Evaluation&lt;/th&gt;
&lt;th&gt;Prompt Mgmt&lt;/th&gt;
&lt;th&gt;Infra Monitoring&lt;/th&gt;
&lt;th&gt;RAG Support&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenObserve&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AGPL-3.0&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;✅✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Unified infra + LLM observability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Langfuse&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MIT (core)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Full-lifecycle LLM observability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Arize Phoenix&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ELv2&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅✅&lt;/td&gt;
&lt;td&gt;RAG and agent debugging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenLLMetry&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Vendor-neutral instrumentation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Comet Opik&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Prompt optimization + observability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Helicone&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;Lightweight proxy-based monitoring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lunary&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;JavaScript RAG &amp;amp; chatbots&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TruLens&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;✅✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Structured evaluation pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PostHog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;LLM + product analytics combined&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Weave (W&amp;amp;B)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;ML research teams on W&amp;amp;B&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;✅ = strong support, ⚠️ = partial or in beta, ❌ = not available&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Choose the Right Tool
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Start with your deployment requirement
&lt;/h3&gt;

&lt;p&gt;If your organization requires data residency or strict compliance, every tool on this list supports self-hosting. For the simplest self-hosted path, &lt;strong&gt;OpenObserve&lt;/strong&gt; stands out — single binary deployment in under 2 minutes, covering both infrastructure and LLM telemetry. For pure LLM-specific self-hosting, &lt;strong&gt;Langfuse&lt;/strong&gt; via Docker Compose takes about 5 minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Match the tool to your primary bottleneck
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If your main problem is...&lt;/th&gt;
&lt;th&gt;Best tool(s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Unified infra + LLM observability in one place&lt;/td&gt;
&lt;td&gt;OpenObserve&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Debugging agent and chain failures&lt;/td&gt;
&lt;td&gt;OpenObserve, Langfuse, Arize Phoenix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG pipeline quality&lt;/td&gt;
&lt;td&gt;Arize Phoenix, TruLens, Lunary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt quality and optimization&lt;/td&gt;
&lt;td&gt;Comet Opik, Langfuse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost and token tracking&lt;/td&gt;
&lt;td&gt;Helicone, Langfuse, OpenObserve&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage cost at scale&lt;/td&gt;
&lt;td&gt;OpenObserve (140x compression)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vendor-neutral instrumentation&lt;/td&gt;
&lt;td&gt;OpenLLMetry → OpenObserve as backend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JavaScript/Node.js first&lt;/td&gt;
&lt;td&gt;Lunary, PostHog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Product analytics + LLM&lt;/td&gt;
&lt;td&gt;PostHog&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  3. Consider your framework dependencies
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangChain / LangGraph users:&lt;/strong&gt; Langfuse has the deepest native LLM-specific integration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LlamaIndex users:&lt;/strong&gt; TruLens and Arize Phoenix have strong LlamaIndex support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI SDK / Anthropic SDK users:&lt;/strong&gt; All tools support this; Helicone is fastest to set up&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom stacks / framework agnostic:&lt;/strong&gt; OpenLLMetry → OpenObserve is the safest, most future-proof combination&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Think about the evaluation maturity you need
&lt;/h3&gt;

&lt;p&gt;In early development, basic tracing and cost monitoring (Helicone, Lunary) may be enough. As you move to production, evaluation becomes critical. &lt;strong&gt;Langfuse&lt;/strong&gt; and &lt;strong&gt;Arize Phoenix&lt;/strong&gt; lead for comprehensive evaluation workflows; &lt;strong&gt;TruLens&lt;/strong&gt; leads for structured RAG evaluation methodology.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Factor in long-term lock-in risk
&lt;/h3&gt;

&lt;p&gt;Tools built on OpenTelemetry standards — particularly &lt;strong&gt;OpenLLMetry&lt;/strong&gt;, &lt;strong&gt;Arize Phoenix&lt;/strong&gt;, and &lt;strong&gt;OpenObserve&lt;/strong&gt; — give you the most flexibility to change components without re-instrumenting your application.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is the best open source LLM observability tool in 2026?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenObserve is our top pick for 2026 — the only open source platform covering both LLM observability and infrastructure monitoring in a single deployment. For LLM-specific evaluation and prompt management on top, &lt;strong&gt;Langfuse&lt;/strong&gt; is the strongest companion. For RAG-specific debugging, &lt;strong&gt;Arize Phoenix&lt;/strong&gt; leads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I use these tools with any LLM provider?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. All tools on this list support major providers including OpenAI, Anthropic, Cohere, Azure OpenAI, AWS Bedrock, Vertex AI, and most open source model endpoints. &lt;strong&gt;OpenLLMetry&lt;/strong&gt; and &lt;strong&gt;Helicone&lt;/strong&gt; have the broadest provider coverage (100+ models).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the difference between LLM tracing and LLM evaluation?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tracing records &lt;em&gt;what happened&lt;/em&gt; — prompts sent, responses received, latencies, token counts, tool calls. Evaluation assesses &lt;em&gt;whether what happened was good&lt;/em&gt; — was the response accurate, relevant, grounded in retrieved context, free of hallucinations?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need a separate observability stack for infrastructure if I adopt one of these tools?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not if you choose &lt;strong&gt;OpenObserve&lt;/strong&gt;. It handles metrics, logs, distributed traces, and LLM telemetry in a single platform — replacing the need for separate tools like Prometheus, Loki, and Tempo. For all other tools on this list, you will need a separate infrastructure monitoring stack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the easiest tool to set up?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Helicone&lt;/strong&gt; wins on LLM-specific setup speed — one line of code (change your base URL) and you have immediate production observability. &lt;strong&gt;OpenObserve&lt;/strong&gt; wins on full-stack setup speed — single binary deployment in under 2 minutes covering both LLM and infrastructure telemetry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How much does LLM observability cost at scale?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where &lt;strong&gt;OpenObserve&lt;/strong&gt; stands out most clearly. Its Parquet-based 140x compression technology dramatically reduces the cost of storing LLM traces, prompt histories, and operational metrics at scale — critical as LLM application volumes grow.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://openobserve.ai/blog/llm-observability-tools/" rel="noopener noreferrer"&gt;openobserve.ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>observability</category>
      <category>opentelemetry</category>
    </item>
    <item>
      <title>Microservices Observability: Using Logs, Metrics, and Traces to Keep Distributed Systems Healthy</title>
      <dc:creator>Simran Kumari</dc:creator>
      <pubDate>Wed, 25 Mar 2026 07:20:39 +0000</pubDate>
      <link>https://forem.com/simran_kumari_464546e0a3c/microservices-observability-using-logs-metrics-and-traces-to-keep-distributed-systems-healthy-56e6</link>
      <guid>https://forem.com/simran_kumari_464546e0a3c/microservices-observability-using-logs-metrics-and-traces-to-keep-distributed-systems-healthy-56e6</guid>
      <description>&lt;p&gt;Picture this: It's Black Friday and you're the lead developer at a high-traffic e-commerce platform. Orders start failing. Customer complaints flood in. Your team scrambles to find the root cause — but with dozens of microservices working together, it feels like searching for a needle in a haystack of needles.&lt;/p&gt;

&lt;p&gt;This is exactly the kind of scenario that microservices &lt;strong&gt;observability&lt;/strong&gt; is designed to solve.&lt;/p&gt;

&lt;p&gt;In this guide, we'll break down the three pillars of observability, explore real-world implementation patterns, and look at how you can build a system that tells you &lt;em&gt;what's wrong&lt;/em&gt;, &lt;em&gt;where it's wrong&lt;/em&gt;, and &lt;em&gt;why&lt;/em&gt; — before your users start noticing.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Microservices Observability?
&lt;/h2&gt;

&lt;p&gt;Traditional monitoring tells you &lt;em&gt;that&lt;/em&gt; something broke. Observability tells you &lt;em&gt;why&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In a microservices architecture, observability means collecting rich telemetry data — logs, metrics, and traces — that allows you to ask arbitrary questions about your system's state without having to predict in advance what might go wrong.&lt;/p&gt;

&lt;p&gt;Think of an observable system as one that has a voice. Instead of failing silently, it tells you exactly what happened, when it happened, and why.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Pillars of Observability
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Logs
&lt;/h3&gt;

&lt;p&gt;Logs are the detailed diary entries of your microservices. Whenever an event occurs — a user action, an internal process, an error — it gets recorded. They're essential for debugging and post-mortem analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best practice: use structured logging.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of plain-text messages, structured logs use consistent, queryable fields — timestamps, service names, error codes, user IDs. This makes it dramatically faster to isolate the specific entry you're looking for during an incident.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-11-29T14:32:11Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"service"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"payment-service"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ERROR"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Transaction timeout"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"u_83721"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"trace_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4bf92f3577b34da6"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Metrics
&lt;/h3&gt;

&lt;p&gt;Metrics are quantitative measurements of your system's performance over time. They're typically numeric values — request rates, error rates, latency percentiles, CPU usage — that can be aggregated, trended, and alerted on.&lt;/p&gt;

&lt;p&gt;Metrics are your early warning system. A sudden spike in p99 latency or a drop in request throughput can trigger an alert before users even notice something is wrong.&lt;/p&gt;

&lt;p&gt;Key metrics to track per service:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Request rate&lt;/strong&gt; — how many requests per second&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error rate&lt;/strong&gt; — percentage of failed requests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt; — response time distributions (p50, p95, p99)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Saturation&lt;/strong&gt; — how close to capacity the service is running&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Traces
&lt;/h3&gt;

&lt;p&gt;Traces are the most powerful pillar for distributed systems. They map the complete journey of a single request as it travels through multiple services, showing you the sequence of operations and how long each one takes.&lt;/p&gt;

&lt;p&gt;In a microservices environment, a single user action might touch 10+ services. Without traces, when one of them is slow or failing, you're guessing. With traces, you can follow that request step by step and pinpoint exactly where the bottleneck or failure occurred.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Request
 └─ API Gateway (2ms)
     └─ Order Service (45ms)
         ├─ Inventory Service (12ms)
         └─ Payment Service (820ms) ← Bottleneck here
             └─ Fraud Detection (790ms) ← Root cause
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why All Three Work Together
&lt;/h2&gt;

&lt;p&gt;Each pillar tells part of the story. None of them gives you the full picture alone.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;Answers&lt;/th&gt;
&lt;th&gt;Limitation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Metrics&lt;/td&gt;
&lt;td&gt;Is something wrong?&lt;/td&gt;
&lt;td&gt;Doesn't tell you &lt;em&gt;why&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logs&lt;/td&gt;
&lt;td&gt;What happened in detail?&lt;/td&gt;
&lt;td&gt;Hard to correlate across services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Traces&lt;/td&gt;
&lt;td&gt;Where in the flow did it break?&lt;/td&gt;
&lt;td&gt;Doesn't capture system-wide trends&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The real power comes when they're &lt;strong&gt;correlated&lt;/strong&gt;. During an incident:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Metrics&lt;/strong&gt; show you a latency spike at 14:32&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logs&lt;/strong&gt; from that window reveal timeout errors in the payment service&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traces&lt;/strong&gt; lead you directly to the fraud detection call that took 790ms&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Without all three pieces, you might spend hours checking database connections, server resources, or network configs — chasing the wrong thing entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Implementation Patterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Distributed Tracing
&lt;/h3&gt;

&lt;p&gt;Implement trace context propagation across all your services using a standard like &lt;strong&gt;OpenTelemetry&lt;/strong&gt;. Every service should pass along a &lt;code&gt;trace_id&lt;/code&gt; and &lt;code&gt;span_id&lt;/code&gt; in its requests so you can reconstruct the full request path after the fact.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;

&lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment-service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;process_payment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user.id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transaction.amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;process_payment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Centralized Log Aggregation
&lt;/h3&gt;

&lt;p&gt;Instead of SSH-ing into individual containers to read logs, aggregate everything into a central platform. Tag logs with service name, environment, and crucially, the &lt;strong&gt;trace ID&lt;/strong&gt; — so you can jump from a trace directly to the relevant log lines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Golden Signals Dashboards
&lt;/h3&gt;

&lt;p&gt;Focus your metrics dashboards around the &lt;strong&gt;four golden signals&lt;/strong&gt; (from Google's SRE book):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt; — how long requests take&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traffic&lt;/strong&gt; — how much demand the system is handling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Errors&lt;/strong&gt; — rate of failed requests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Saturation&lt;/strong&gt; — how full the system is&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Real-World Scenario: The Black Friday Incident
&lt;/h2&gt;

&lt;p&gt;Your e-commerce platform starts seeing order failures during a peak sales event.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without observability:&lt;/strong&gt;&lt;br&gt;
You restart services, check CPU, look at database connections. 2 hours later, still no root cause. The incident drags on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With observability:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metrics dashboard: response times spiking, error rate climbing since 14:30&lt;/li&gt;
&lt;li&gt;Logs: timeout errors concentrated in the payment service&lt;/li&gt;
&lt;li&gt;Traces: every failing order shows 800ms+ spent in the fraud detection service&lt;/li&gt;
&lt;li&gt;Root cause: a bug in fraud detection makes it extremely slow for high-value transactions — exactly the kind that flood in during flash sales&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Total time to resolution: &lt;strong&gt;12 minutes.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started: Practical Advice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start small and scale gradually.&lt;/strong&gt; You don't need to instrument everything on day one. Pick your most critical service, add structured logging, expose a &lt;code&gt;/metrics&lt;/code&gt; endpoint, and add basic tracing. Learn from that before expanding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use OpenTelemetry.&lt;/strong&gt; It's the vendor-neutral standard for instrumentation. Instrument once, and you can export to any backend. This avoids lock-in and makes it easy to swap or add tools later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Correlate with a common trace ID.&lt;/strong&gt; The most important thing to get right early is making sure your logs, metrics, and traces all share a common identifier. Without this, correlating signals during an incident requires manual timestamp-matching — painful and slow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set meaningful alerts.&lt;/strong&gt; Don't alert on every metric. Focus on user-facing impact: elevated error rates, latency crossing SLO thresholds, or traffic anomalies that suggest something is off.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Effective microservices observability isn't a single tool or a one-time project — it's a continuous practice of collecting, correlating, and acting on telemetry data across your entire system.&lt;/p&gt;

&lt;p&gt;By integrating logs, metrics, and traces into a unified observability strategy, you shift from reactive firefighting to proactive system understanding. You spend less time guessing and more time building.&lt;/p&gt;

&lt;p&gt;The next time Black Friday rolls around, you'll be ready.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://openobserve.ai/blog/microservices-observability-logs-metrics-traces/" rel="noopener noreferrer"&gt;OpenObserve Blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>microservices</category>
      <category>observability</category>
      <category>devops</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Logs, Metrics, and Traces: What They Are and When to Use Each</title>
      <dc:creator>Simran Kumari</dc:creator>
      <pubDate>Mon, 02 Feb 2026 11:49:43 +0000</pubDate>
      <link>https://forem.com/simran_kumari_464546e0a3c/logs-metrics-and-traces-what-they-are-and-when-to-use-each-2pcp</link>
      <guid>https://forem.com/simran_kumari_464546e0a3c/logs-metrics-and-traces-what-they-are-and-when-to-use-each-2pcp</guid>
      <description>&lt;p&gt;It's 2 AM. You get a call: "The website is broken."&lt;/p&gt;

&lt;p&gt;You SSH into your server, run &lt;code&gt;top&lt;/code&gt; to check CPU, maybe &lt;code&gt;df -h&lt;/code&gt; for disk space. Everything looks... fine? You restart the application. It works again. But you're left wondering: &lt;strong&gt;what actually went wrong?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This happens because we don't have visibility into what's happening inside our systems. That's what observability solves.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Pillars
&lt;/h2&gt;

&lt;p&gt;Observability data comes in three forms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Metrics&lt;/strong&gt;: Numbers that change over time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logs&lt;/strong&gt;: Detailed records of specific events&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traces&lt;/strong&gt;: Maps of requests flowing through distributed systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwbib4izddwrzc9y0nknd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwbib4izddwrzc9y0nknd.png" alt="Logs, metrics and traces: The pillars of Observability" width="800" height="255"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each answers different questions. Let's break them down.&lt;/p&gt;




&lt;h2&gt;
  
  
  Metrics: Is Everything OK?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Metrics are numbers that change over time.&lt;/strong&gt; Think of them as your app's vital signs—temperature and pulse, measured continuously.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Server health
cpu_usage_percent = 45
memory_usage_percent = 67
disk_usage_percent = 23

# Application health  
requests_per_minute = 120
response_time_ms = 250
failed_requests_percent = 0.8
active_users = 43
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Metrics tell you &lt;strong&gt;something is wrong&lt;/strong&gt; before users complain. If &lt;code&gt;failed_requests_percent&lt;/code&gt; jumps from 0.8% to 15%, you know there's a problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to Use Metrics
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dashboards&lt;/strong&gt;: Visualize trends over time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alerting&lt;/strong&gt;: Get notified when thresholds are breached&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capacity planning&lt;/strong&gt;: Predict when you'll run out of resources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLO tracking&lt;/strong&gt;: Monitor service level objectives&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;Your response time metric shows requests normally taking 200ms are now taking 2000ms. You check and find the database connection pool is exhausted. Fixed before users notice.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6n2swx8emq294k4c2ce5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6n2swx8emq294k4c2ce5.png" alt="Sample dashboard showing metrics over time" width="800" height="374"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metrics answer: "Is my system healthy?"&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Logs: What Exactly Happened?
&lt;/h2&gt;

&lt;p&gt;Metrics tell you &lt;strong&gt;that&lt;/strong&gt; something is wrong. Logs tell you &lt;strong&gt;what&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Logs are detailed records of specific events.&lt;/strong&gt; They're a diary of everything your app does.&lt;/p&gt;

&lt;p&gt;Instead of just knowing "more requests are failing," logs show:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2024-08-08T14:30:15Z ERROR [AuthService] Failed login attempt for user@email.com: invalid password
2024-08-08T14:30:16Z ERROR [AuthService] Failed login attempt for user@email.com: invalid password  
2024-08-08T14:30:17Z ERROR [AuthService] Failed login attempt for user@email.com: account locked after 3 failed attempts
2024-08-08T14:30:45Z INFO [AuthService] Password reset requested for user@email.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the "failed requests spike" makes sense—a user forgot their password. Not a bug, just expected behavior.&lt;/p&gt;

&lt;h3&gt;
  
  
  Structure Your Logs
&lt;/h3&gt;

&lt;p&gt;Random text logs become useless at scale. Use structured logging (JSON):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-08-08T14:30:15Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ERROR"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"service"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AuthService"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
  &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Failed login attempt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"12345"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"email"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user@email.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"invalid_password"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"attempt_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Structured logs let you query:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Show me all errors from AuthService in the last hour"&lt;/li&gt;
&lt;li&gt;"How many failed login attempts did user 12345 have today?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67pa5f2g2p8ll118upkc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67pa5f2g2p8ll118upkc.png" alt="Filtering logs by service and level" width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Logs answer: "What exactly happened?"&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Traces: Where Did It Break?
&lt;/h2&gt;

&lt;p&gt;Your single-server app grows into microservices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User Service&lt;/strong&gt;: Authentication and profiles&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Product Service&lt;/strong&gt;: Product catalog&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Order Service&lt;/strong&gt;: Processes purchases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Payment Service&lt;/strong&gt;: Handles transactions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A user reports: "I can't complete my purchase. The page just hangs."&lt;/p&gt;

&lt;p&gt;You check metrics—all services look healthy. You check logs in each service... but which service did the request even touch? How do you follow a single user's journey across multiple services?&lt;/p&gt;

&lt;h3&gt;
  
  
  Traces Connect the Dots
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;A trace shows the path of a single request through your distributed system.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a user clicks "Buy Now":&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Order Service receives the request and creates a &lt;strong&gt;trace ID&lt;/strong&gt; (e.g., "abc123")&lt;/li&gt;
&lt;li&gt;It records: "I'm processing order abc123, started at 10:30:15"&lt;/li&gt;
&lt;li&gt;When it calls User Service, it passes along that trace ID&lt;/li&gt;
&lt;li&gt;User Service records: "I'm verifying user for trace abc123"&lt;/li&gt;
&lt;li&gt;When User Service calls Payment Service, same trace ID&lt;/li&gt;
&lt;li&gt;Payment Service records: "I'm charging card for trace abc123"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each service leaves breadcrumbs connected by the same trace ID. The result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Purchase Request [1,200ms total]
├── Order Service: Process Order [50ms]
├── User Service: Verify User [100ms] ✓
├── Product Service: Check Inventory [150ms] ✓  
├── Payment Service: Charge Card [900ms] ⚠️
│   ├── Validate Card [100ms] ✓
│   ├── External Payment Gateway [750ms] ⚠️ 
│   └── Update Transaction [50ms] ✓
└── Order Service: Finalize Order [100ms] ✓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem is obvious: the external payment gateway is taking 750ms.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fohq6mdu5zqwgbyyuo219.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fohq6mdu5zqwgbyyuo219.png" alt="Trace view showing request flow across services" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  When to Use Traces
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Debugging latency&lt;/strong&gt;: Find which service is slow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Understanding dependencies&lt;/strong&gt;: See how services interact&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Root cause analysis&lt;/strong&gt;: Follow a failing request across the stack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance optimization&lt;/strong&gt;: Identify bottlenecks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Traces answer: "Where in my distributed system did things go wrong?"&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Combining All Three: A Real Scenario
&lt;/h2&gt;

&lt;p&gt;Your e-commerce site is struggling during a Black Friday sale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Metrics&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Response times are spiking. Error rate is up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Logs&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Timeout errors in the Payment Service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Traces&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Payment requests take 10+ seconds, but only for orders over $500.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause&lt;/strong&gt;: The fraud detection system (called by Payment Service) has a bug that makes it extremely slow for high-value transactions.&lt;/p&gt;

&lt;p&gt;Without all three, you might have wasted hours checking database connections, server resources, or network issues.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Reference: When to Use What
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Is my system healthy?&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Metrics&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What exactly happened?&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Logs&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Where did it break (distributed)?&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Traces&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Should I alert on-call?&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Metrics&lt;/strong&gt; (thresholds)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Why did this specific request fail?&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Logs&lt;/strong&gt; + &lt;strong&gt;Traces&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Which service is the bottleneck?&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Traces&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Metric overload&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Don't track everything. Start with what matters to users: latency, error rate, throughput.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unstructured logs&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;code&gt;console.log("error happened")&lt;/code&gt; is useless at scale. Add context, use JSON.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tracing everything&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Sample your traces. 100% trace coverage kills performance and storage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Using them in isolation&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Each pillar is useful alone. Together, they're powerful. Correlate metrics spikes with log errors with trace timelines.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;If you're not sure where to begin:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with metrics&lt;/strong&gt; for the RED method: Rate (requests/sec), Errors (error rate), Duration (latency)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add structured logging&lt;/strong&gt; to your services&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement tracing&lt;/strong&gt; when you have 2+ services communicating&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Sign up for a &lt;a href="https://cloud.openobserve.ai/" rel="noopener noreferrer"&gt;14-day free OpenObserve Cloud trial&lt;/a&gt; and integrate your metrics, logs, and traces into one powerful platform to boost your operational efficiency and enable smarter, faster decision-making.&lt;/p&gt;




&lt;p&gt;What's your observability setup? Are you using all three pillars, or still figuring out where to start? Let me know in the comments 👇&lt;/p&gt;

</description>
      <category>observability</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>How to Convert Logs to Metrics: A Practical Guide with OpenObserve Pipelines</title>
      <dc:creator>Simran Kumari</dc:creator>
      <pubDate>Mon, 02 Feb 2026 11:43:23 +0000</pubDate>
      <link>https://forem.com/simran_kumari_464546e0a3c/how-to-convert-logs-to-metrics-a-practical-guide-with-openobserve-pipelines-582n</link>
      <guid>https://forem.com/simran_kumari_464546e0a3c/how-to-convert-logs-to-metrics-a-practical-guide-with-openobserve-pipelines-582n</guid>
      <description>&lt;p&gt;Most engineering teams start their observability journey with logs. They're easy to implement, they capture exactly what happened, and when something breaks, logs are usually the first place you look.&lt;/p&gt;

&lt;p&gt;But here's the thing: &lt;strong&gt;your logs already contain metrics&lt;/strong&gt;—timestamps, status codes, error flags, latency values. The problem isn't a lack of data. It's that you're asking &lt;em&gt;metric questions&lt;/em&gt; while relying entirely on &lt;em&gt;logs&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In this guide, I'll show you how to extract metrics from logs using scheduled pipelines, step by step.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Convert Logs to Metrics?
&lt;/h2&gt;

&lt;p&gt;Before diving into the how, let's understand the why.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Logs&lt;/th&gt;
&lt;th&gt;Metrics&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What they represent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Individual events&lt;/td&gt;
&lt;td&gt;Aggregated summaries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Detail level&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per request/event&lt;/td&gt;
&lt;td&gt;High-level trends&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cardinality&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Debugging, root cause analysis&lt;/td&gt;
&lt;td&gt;Monitoring, alerting, dashboards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Expensive at scale&lt;/td&gt;
&lt;td&gt;Cheap and fast&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When you try to use logs as a substitute for metrics, you pay the cost of high cardinality for questions that don't need that detail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Building a dashboard showing "error rates over time" by scanning millions of log entries repeatedly? That's slow and expensive. Deriving a metric once per minute? That's fast and cheap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Pipeline Types
&lt;/h2&gt;

&lt;p&gt;In OpenObserve, pipelines fall into two categories:&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-time Pipelines
&lt;/h3&gt;

&lt;p&gt;Operate on individual events as they arrive. Great for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Normalizing fields&lt;/li&gt;
&lt;li&gt;Enriching records&lt;/li&gt;
&lt;li&gt;Dropping noisy data&lt;/li&gt;
&lt;li&gt;Routing events to different streams&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scheduled Pipelines
&lt;/h3&gt;

&lt;p&gt;Run at fixed intervals over defined time windows. Perfect for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Aggregating logs into metrics&lt;/li&gt;
&lt;li&gt;Computing summaries&lt;/li&gt;
&lt;li&gt;Generating time-series data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For logs-to-metrics conversion, &lt;strong&gt;scheduled pipelines&lt;/strong&gt; are what you need.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Logs-to-Metrics Flow
&lt;/h2&gt;

&lt;p&gt;Here's how the data flows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;App Logs → Log Stream → Scheduled Pipeline (every 1min) → Metric Stream → Dashboards/Alerts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reads logs from the previous time window&lt;/li&gt;
&lt;li&gt;Filters and aggregates them&lt;/li&gt;
&lt;li&gt;Writes results to a metric stream&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Step-by-Step: Converting Kubernetes Logs to Metrics
&lt;/h2&gt;

&lt;p&gt;Let's build this with real Kubernetes logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;An OpenObserve instance (&lt;a href="https://cloud.openobserve.ai/" rel="noopener noreferrer"&gt;Cloud&lt;/a&gt; or &lt;a href="https://openobserve.ai/downloads/" rel="noopener noreferrer"&gt;Self-hosted&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Sample log data (or your own logs)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 1: Ingest Your Logs
&lt;/h3&gt;

&lt;p&gt;For this demo, grab some sample Kubernetes logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-L&lt;/span&gt; https://zinc-public-data.s3.us-west-2.amazonaws.com/zinc-enl/sample-k8s-logs/k8slog_json.json.zip &lt;span class="nt"&gt;-o&lt;/span&gt; k8slog_json.json.zip
unzip k8slog_json.json.zip
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These logs contain typical K8s fields like &lt;code&gt;_timestamp&lt;/code&gt;, &lt;code&gt;code&lt;/code&gt; (HTTP status), &lt;code&gt;kubernetes_container_name&lt;/code&gt;, &lt;code&gt;kubernetes_labels_app&lt;/code&gt;, etc.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Define Your Metrics
&lt;/h3&gt;

&lt;p&gt;Before writing pipeline code, decide what you want to measure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;k8s_http_requests_total&lt;/code&gt;&lt;/strong&gt;: Total requests per app per minute&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;k8s_http_errors_total&lt;/code&gt;&lt;/strong&gt;: Total 5xx errors per app per minute&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Create the Scheduled Pipeline
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Source Query for Request Count
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="s1"&gt;'k8s_http_requests_total'&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nv"&gt;"__name__"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;'counter'&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nv"&gt;"__type__"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nv"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;kubernetes_labels_app&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;kubernetes_namespace_name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;_timestamp&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;kubernetes_logs&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt;
  &lt;span class="n"&gt;kubernetes_labels_app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;kubernetes_namespace_name&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key fields explained:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;__name__&lt;/code&gt; → Metric name (required)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;__type__&lt;/code&gt; → Metric type: &lt;code&gt;counter&lt;/code&gt; or &lt;code&gt;gauge&lt;/code&gt; (required)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;value&lt;/code&gt; → The actual metric value (required)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;app&lt;/code&gt;, &lt;code&gt;namespace&lt;/code&gt; → Labels for filtering/grouping&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Source Query for Error Count
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="s1"&gt;'k8s_http_errors_total'&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;'counter'&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;__type__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;kubernetes_labels_app&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;kubernetes_namespace_name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;_timestamp&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;kubernetes_logs&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt;
  &lt;span class="n"&gt;kubernetes_labels_app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;kubernetes_namespace_name&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;WHERE code &amp;gt;= 500&lt;/code&gt; filter ensures we only count server errors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Configure Pipeline Settings
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Test the query&lt;/strong&gt; - Run it manually to verify output includes &lt;code&gt;__name__&lt;/code&gt;, &lt;code&gt;__type__&lt;/code&gt;, and &lt;code&gt;value&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set the interval&lt;/strong&gt; - Typically 1 minute for real-time metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Define the destination&lt;/strong&gt; - Point to your metric stream&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connect the nodes&lt;/strong&gt; and save&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 5: Verify Your Metrics
&lt;/h3&gt;

&lt;p&gt;After the pipeline runs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check the destination metric stream&lt;/li&gt;
&lt;li&gt;Verify records contain expected metric names, types, values, and labels&lt;/li&gt;
&lt;li&gt;Build dashboards using the new metrics&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Troubleshooting Common Errors
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;error in ingesting metrics missing __name__&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Your query output doesn't include the metric name.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Ensure your SQL includes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="s1"&gt;'metric_name'&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nv"&gt;"__name__"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;error in ingesting metrics missing __type__&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The metric type isn't being set.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Add the type field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="s1"&gt;'counter'&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nv"&gt;"__type__"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;error in ingesting metrics missing value&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The numeric value is missing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Include an aggregation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nv"&gt;"value"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;DerivedStream has reached max retries of 3&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The pipeline failed multiple times due to validation errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open the pipeline config&lt;/li&gt;
&lt;li&gt;Run the source SQL manually&lt;/li&gt;
&lt;li&gt;Verify all required fields are present&lt;/li&gt;
&lt;li&gt;Save and wait for next scheduled run&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  No Metrics Produced (But No Errors)
&lt;/h3&gt;

&lt;p&gt;The pipeline runs but produces nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause:&lt;/strong&gt; No logs exist in the source stream for the time window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Verify logs are being ingested to the source stream&lt;/li&gt;
&lt;li&gt;Run the SQL query manually against recent data&lt;/li&gt;
&lt;li&gt;Check the time window matches when logs exist&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Debugging Pipeline Failures
&lt;/h2&gt;

&lt;p&gt;Enable usage reporting to track pipeline execution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ZO_USAGE_REPORTING_ENABLED=true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This surfaces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Error stream&lt;/strong&gt;: Detailed failure messages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Triggers stream&lt;/strong&gt;: Pipeline execution history&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UI indicators&lt;/strong&gt;: Visual failure signals with error messages&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;Once your logs-to-metrics pipeline is running:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://openobserve.ai/docs/user-guide/dashboards/" rel="noopener noreferrer"&gt;Build dashboards&lt;/a&gt;&lt;/strong&gt; from your derived metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://openobserve.ai/docs/user-guide/alerts/" rel="noopener noreferrer"&gt;Set up alerts&lt;/a&gt;&lt;/strong&gt; on metric thresholds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://openobserve.ai/blog/slo-based-alerting/" rel="noopener noreferrer"&gt;Create SLOs&lt;/a&gt;&lt;/strong&gt; using your new metrics&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Logs contain metric data&lt;/strong&gt;—you just need to extract it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scheduled pipelines&lt;/strong&gt; bridge the gap between raw logs and aggregated metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three required fields&lt;/strong&gt;: &lt;code&gt;__name__&lt;/code&gt;, &lt;code&gt;__type__&lt;/code&gt;, &lt;code&gt;value&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No new instrumentation needed&lt;/strong&gt;—work with data you already have&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result&lt;/strong&gt;: Faster dashboards, cheaper queries, better observability&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Have you implemented logs-to-metrics in your stack? What challenges did you face? Let me know in the comments! 👇&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published on the &lt;a href="https://openobserve.ai/blog/logs-to-metrics/" rel="noopener noreferrer"&gt;OpenObserve blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>observability</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Good read!</title>
      <dc:creator>Simran Kumari</dc:creator>
      <pubDate>Mon, 02 Feb 2026 04:42:58 +0000</pubDate>
      <link>https://forem.com/simran_kumari_464546e0a3c/good-read-4299</link>
      <guid>https://forem.com/simran_kumari_464546e0a3c/good-read-4299</guid>
      <description>&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/manas_sharma/nvidia-gpu-monitoring-with-dcgm-exporter-and-openobserve-complete-setup-guide-34k6" class="crayons-story__hidden-navigation-link"&gt;NVIDIA GPU Monitoring: Catch Thermal Throttling Before It Costs You $50k/Year&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/manas_sharma" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3739096%2F64c63567-d504-47de-b304-1cd488cc2906.jpeg" alt="manas_sharma profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/manas_sharma" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Manas Sharma
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Manas Sharma
                
              
              &lt;div id="story-author-preview-content-3216286" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/manas_sharma" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3739096%2F64c63567-d504-47de-b304-1cd488cc2906.jpeg" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Manas Sharma&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/manas_sharma/nvidia-gpu-monitoring-with-dcgm-exporter-and-openobserve-complete-setup-guide-34k6" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Feb 1&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/manas_sharma/nvidia-gpu-monitoring-with-dcgm-exporter-and-openobserve-complete-setup-guide-34k6" id="article-link-3216286"&gt;
          NVIDIA GPU Monitoring: Catch Thermal Throttling Before It Costs You $50k/Year
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/devops"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;devops&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/monitoring"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;monitoring&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/gpu"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;gpu&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/observability"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;observability&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/manas_sharma/nvidia-gpu-monitoring-with-dcgm-exporter-and-openobserve-complete-setup-guide-34k6" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/fire-f60e7a582391810302117f987b22a8ef04a2fe0df7e3258a5f49332df1cec71e.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;4&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/manas_sharma/nvidia-gpu-monitoring-with-dcgm-exporter-and-openobserve-complete-setup-guide-34k6#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            7 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;




</description>
      <category>devops</category>
      <category>monitoring</category>
      <category>gpu</category>
      <category>observability</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>Simran Kumari</dc:creator>
      <pubDate>Mon, 02 Feb 2026 04:42:33 +0000</pubDate>
      <link>https://forem.com/simran_kumari_464546e0a3c/-2eeg</link>
      <guid>https://forem.com/simran_kumari_464546e0a3c/-2eeg</guid>
      <description>&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/manas_sharma/fastapi-opentelemetry-stop-debugging-with-grep-use-distributed-tracing-16m5" class="crayons-story__hidden-navigation-link"&gt;FastAPI + OpenTelemetry: Stop Debugging with grep (Use Distributed Tracing)&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/manas_sharma" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3739096%2F64c63567-d504-47de-b304-1cd488cc2906.jpeg" alt="manas_sharma profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/manas_sharma" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Manas Sharma
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Manas Sharma
                
              
              &lt;div id="story-author-preview-content-3219295" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/manas_sharma" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3739096%2F64c63567-d504-47de-b304-1cd488cc2906.jpeg" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Manas Sharma&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/manas_sharma/fastapi-opentelemetry-stop-debugging-with-grep-use-distributed-tracing-16m5" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Feb 2&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/manas_sharma/fastapi-opentelemetry-stop-debugging-with-grep-use-distributed-tracing-16m5" id="article-link-3219295"&gt;
          FastAPI + OpenTelemetry: Stop Debugging with grep (Use Distributed Tracing)
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/fastapi"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;fastapi&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/python"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;python&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/opentelemetry"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;opentelemetry&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/observability"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;observability&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/manas_sharma/fastapi-opentelemetry-stop-debugging-with-grep-use-distributed-tracing-16m5" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/raised-hands-74b2099fd66a39f2d7eed9305ee0f4553df0eb7b4f11b01b6b1b499973048fe5.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;3&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/manas_sharma/fastapi-opentelemetry-stop-debugging-with-grep-use-distributed-tracing-16m5#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            3 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;




</description>
      <category>fastapi</category>
      <category>python</category>
      <category>opentelemetry</category>
      <category>observability</category>
    </item>
  </channel>
</rss>
