hridyesh bisht

Posted on Jun 4

Understanding Observability with OpenTelemetry and Coffee

#observability #opentelemetry #logs #traces

Solutions are increasingly built using microservices architecture, leading to complex distributed systems. Monitoring these systems becomes challenging due to the diversity of tools, protocols, and data formats.

This blog focuses on:

Explaining the basics of OpenTelemetry, its role in observability, and the current state of observability in the industry.
Explaining how to instrument code and identify when to use manual and automatic instrumentation.
Discussing the OpenTelemetry Collector and Connector, which are responsible for processing and forwarding telemetry data.

What is OpenTelemetry?

OpenTelemetry addresses these challenges by providing a unified framework for collecting, processing, and exporting telemetry data, enabling you to gain deep insights into their apps’ behavior.

For this guide, consider a modern coffee shop app with the following microservices:

Order Service: Handles customer orders.
Payment Service: Processes payments.
Inventory Service: Manages stock levels.
Notification Service: Sends order confirmations.

Each service operates independently, possibly written in different languages and deployed across various environments. When a customer places an order, the request traverses multiple services, making it essential to have a comprehensive observability solution to monitor and troubleshoot the system effectively.

Key Benefits of OpenTelemetry:

Unified Instrumentation: Instrument your code once and send telemetry data to multiple backends without re-instrumentation.
Vendor-Neutral: Avoid vendor lock-in by using standard APIs and protocols. Since you can switch platforms without having to re-instrument your entire solution.
Unified telemetry: Combines tracing, logging, and metrics into a single framework enabling correlation of all data and establishing an open standard for telemetry data.
- Linking these parameters helps you make better decisions.
Community-Driven: Benefit from a vibrant open-source community contributing to continuous improvements.
Improved Correlation: Easily correlate data across different telemetry signals for better insights.

Three pillars of Observability

Telemetry is collected via instrumentation and flows through a pipeline that enriches, batches, and stores it for later analysis. Most observability tooling revolves around three categories of telemetry: logs, metrics, and traces.

While they share architectural similarities, such as instrumentation, ingestion, storage, and visualization. Each type presents unique challenges and is best suited to answer different types of questions.

Logs

Logs are immutable, timestamped records of discrete events. Each log entry typically contains a message and optional structured metadata. However, coming up with a standardized log format is no easy task, since different pieces of information are critical for different types of software.

You can also build logging agents and protocols to forward logs to a central location for efficient storage. For example, consider a user placing an order in a microservices-based coffee shop app. The order-service logs a line like:

{  
  "timestamp": "2025-06-01T08:43:12Z",  
  "level": "INFO",  
  "service": "order-service",  
  "message": "New order placed: latte",  
  "order\_id": "ORD-20250601-001"  
}

Metrics

Metrics help you understand a high-level view of the current state of your system. A metric is a single numerical value derived by applying a statistical measure to a group of events.

In other words, metrics represent an aggregate. This is useful because their compact representation allows us to graph how a system changes over time.

Different Metric Types:

Counters: Total number of orders placed
Gauges: Current number of in-progress orders
Histograms: Distribution of order preparation times
Summaries: Quantiles of response times

Coffee Shop Example*:* The order-service emits a metric would be displayed.

orders_placed_total{beverage=”latte”} 1560

A Prometheus dashboard may show a sharp spike in latte orders, suggesting a promotional campaign is working or an anomaly is occurring.

Traces

To understand the larger context in distributed solution, you must identify other related events, such as the specific requests or transactions that initiated the log entry and the sequence of services or microservices involved in processing that request across the system.

Traces visualize the full journey of a single request across services. A trace consists of multiple spans, each representing a step in the request’s lifecycle. This makes it possible to reconstruct the journey of requests in the system.

Coffee Shop Example*:* A user places an order. The request flows through UI -> order-service -> payment-service -> inventory-service.

Each service adds a span with trace and span IDs, allowing you to view:

Total request duration
Which service caused a delay
Any failed steps in the chain

Problems with the Current Observability Approach

Logs, metrics, and traces typically live in separate systems, with different formats and tooling. This fragmentation forces you to jump between dashboards and correlate data manually. Even with shared metadata like timestamps or service names, stitching information together remains time-consuming and error-prone.

Coffee Shop Example: Imagine a spike in failed order-service requests. You check metrics and see a high error rate. You then switch to logs, scan for failures, and try to match logs with trace IDs. Without consistent context, root cause analysis becomes guesswork.

Lack of Built-in Instrumentation in Open Source Software

Many open source libraries expose hooks but do not include native telemetry support. You must build and maintain custom adapters.

Problems this causes:

Version Compatibility: Library updates may break adapters.
Telemetry Loss: Converting data between formats can degrade signal quality.
Engineering Overhead: Teams spend time wiring telemetry instead of building features.

Coffee Shop Example*:* If the inventory-service uses a third-party stock manager with no OpenTelemetry support, you must manually instrument it or depend on its observability hooks.

What OpenTelemetry is NOT

OpenTelemetry simplifies telemetry collection and export, but it doesn’t offer end-to-end observability out of the box. It’s a toolkit, not a monitoring platform.

Not OpenTelemetry’s Job	Description
Data storage	OpenTelemetry exports data; it doesn’t store it. You’ll need systems like SigNoz, Prometheus, or Elasticsearch.
Visualization	No dashboards or charts are included. Use tools like Grafana, Jaeger, or Datadog.
Alerting	OpenTelemetry doesn’t generate alerts. Integrate it with systems that support alert rules.
Monitoring out-of-the-box	It doesn’t auto-instrument everything or provide prebuilt dashboards. You must configure and integrate.
Performance optimization	It helps identify bottlenecks, but doesn’t tune your app.

OpenTelemetry standardizes how you collect logs, metrics, and traces. It enables observability, but doesn’t deliver it on its own. You still need storage, visualization, alerting, and analysis platforms to complete the picture.

Signals in OpenTelemetry

OpenTelemetry organizes observability data into three core signals:

Signal	Purpose
Traces	Capture the lifecycle and flow of a request across services.
Metrics	Measure system and app performance over time.
Logs	Record discrete events and state changes in the app.

Each signal is independent but can be correlated to provide richer observability. OpenTelemetry’s architecture ensures signal consistency and interoperability across programming languages through its official OpenTelemetry Specification.

OpenTelemetry Specification Components

Common Terminology: Ensures a consistent vocabulary across implementations.
API Specification: Provides language-agnostic interfaces to generate telemetry (traces, metrics, logs). APIs are backend-agnostic and enable portable instrumentation.
- For more information, refer to Tracing API, Metrics API, and OpenTelemetry Logging.
SDK Specification: Defines how SDKs process, sample, and export telemetry. Ensures consistent behavior across languages.
- For more information, refer to Tracing SDK, Metrics SDK, and Logs SDK.
Semantic Conventions: Standardizes names and attributes for telemetry data (e.g., HTTP status codes, DB queries).
OpenTelemetry Protocol (OTLP): Describes a vendor-neutral transport protocol to send telemetry

Why separate API from SDK?

The API–SDK split improves modularity, portability, and vendor neutrality:

Library safety: A shared library (e.g., database driver) can safely include only the API, avoiding heavy SDK dependencies and avoiding conflicts in user apps.
Portability: You can ship apps with OpenTelemetry APIs baked in, and let platform teams decide which SDK/exporter to use later.
Flexibility: You can write your own SDK or replace components (e.g., use a custom sampler or exporter).

OpenTelemetry API vs SDK

Feature	OpenTelemetry API	OpenTelemetry SDK
Purpose	Defines interfaces to generate telemetry	Implements logic to process and export telemetry
Responsibility	Exposes functions to create spans, metrics, logs	Manages batching, sampling, context, and export
Language-specific?	Yes	Yes
Included by default?	Yes, lightweight	No, must be explicitly added and configured
Default behavior	No-op	Active when configured
Used by	App and library developers	DevOps, SREs, platform engineers
Stability	High	May evolve with backends and exporter needs
Customizable	No	Yes (exporters, processors, samplers)

For example, consider the following scenarios:

Scenario	Best choice
Open-source library with tracing support	API only (lightweight, no deps)
Production microservice exporting to Grafana	API + SDK + OTLP Exporter
CLI tool needing optional debug tracing	API (enabled conditionally with SDK)

How to Instrument Code with OpenTelemetry

OpenTelemetry supports multiple instrumentation approaches to capture telemetry from apps. Understanding these methods helps choose the right approach based on your app’s complexity, development stage, and observability goals.

OpenTelemetry classifies instrumentation into three categories, often overlapping in practice:

Category	Effort	Control	Customization	Code Changes
Automatic Instrumentation (Zero-Code)	Low	Limited	Minimal	None
Instrumentation Libraries	Moderate	Medium	Moderate	Minimal to moderate
Manual Instrumentation (Fully Code-Based)	High	Full	Full	Extensive

OpenTelemetry provides three ways to capture telemetry from your app:

1. Automatic Instrumentation

Auto-instrumentation in .NET 8 is available via the OpenTelemetry .NET Auto-Instrumentation Agent, which instruments common libraries like ASP.NET Core, HttpClient, and SQL clients at runtime.

Ideal use case: Use this to quickly add observability to .NET services without modifying source code.

Example: orders-service (.NET 8, ASP.NET Core)

Download and install the auto-instrumentation binaries .NET Auto-Instrumentation GitHub
Run the app with the auto-instrumentation profiler

set OTEL_SERVICE_NAME=orders-service
set OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
set CORECLR_ENABLE_PROFILING=1
set CORECLR_PROFILER_PATH=C:\otel-dotnet auto\OpenTelemetry.AutoInstrumentation.Native.dll
dotnet run

What it captures:

HTTP requests and responses
Outgoing HTTP/gRPC calls
SQL queries via ADO.NET

Pros:

No code changes required
Fast onboarding
Works well for ASP.NET Core, Entity Framework, and HttpClient

Cons:

Limited to supported libraries
Less control over span names and metadata

2. Library-Based Instrumentation

Library-based instrumentation uses the OpenTelemetry SDK and prebuilt instrumentations like AddAspNetCoreInstrumentation.

Ideal use case: You want to customize configuration and capture high-value signals without full manual control.

Example: menu-service (.NET 8, ASP.NET Core)

Install NuGet packages:

dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol

Configure in Program.cs:

using OpenTelemetry.Metrics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddOpenTelemetry()
    .WithTracing(tracerProviderBuilder =>
    {
        tracerProviderBuilder
            .SetResourceBuilder(
                ResourceBuilder.CreateDefault()
                    .AddService("menu-service"))
            .AddAspNetCoreInstrumentation()
            .AddHttpClientInstrumentation()
            .AddOtlpExporter(otlp =>
            {
                otlp.Endpoint = new Uri("http://otel-collector:4317");
            });
    });

var app = builder.Build();
app.MapGet("/", () => "Hello from Menu Service");
app.Run();

What it captures:

Inbound ASP.NET Core request spans
Outbound calls (HttpClient, gRPC)
Custom span and resource metadata

Pros

Easy to configure
Integrates well with DI and hosting model
Supports enrichment and filtering

Cons

Requires adding code/configuration
Less flexible than full manual instrumentation

3. Manual Instrumentation

Manual instrumentation lets you define custom spans for critical business logic (e.g., awarding loyalty points or calculating discounts).

Ideal use case: You need to trace domain-specific logic not covered by auto or library-based methods.

Example: loyalty-service (.NET 8 Worker Service)

Install packages:

dotnet add package OpenTelemetry
dotnet add package OpenTelemetry.Trace
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol

Configure tracing in Program.cs:

using OpenTelemetry.Trace;
using OpenTelemetry.Resources;
using System.Diagnostics;

var builder = Host.CreateApplicationBuilder(args);

builder.Services.AddOpenTelemetry()
    .WithTracing(tracerProviderBuilder =>
    {
        tracerProviderBuilder
            .SetResourceBuilder(ResourceBuilder.CreateDefault().AddService("loyalty-service"))
            .AddSource("LoyaltyService")
            .AddOtlpExporter(options =>
            {
                options.Endpoint = new Uri("http://otel-collector:4317");
            });
    });

var app = builder.Build();

// Start a custom span manually
var source = new ActivitySource("LoyaltyService");

using var activity = source.StartActivity("AwardLoyaltyPoints", ActivityKind.Internal);
activity?.SetTag("customer.id", "cust-123");
activity?.SetTag("points.awarded", 20);

// Simulate business logic
Console.WriteLine("Loyalty points awarded.");

What it captures:

Custom spans for logic like point calculations
Rich metadata (tags)
Correlation with other telemetry (metrics/logs)

Pros:

Full control over telemetry
Capture domain-specific operations
High value for debugging or performance tuning

Cons:

Requires development effort
Must manage span lifecycle correctly
Potential for inconsistent usage without guidelines

Overlaps and Clarifications

Instrumentation libraries sometimes provide automatic instrumentation after import, blurring the line between zero-code and code-based.
Under the hood, all approaches use some form of libraries.
Zero-code is broad and quick; libraries add customization; manual is full control.

Recommended Approach and Strategy

Start with automatic instrumentation to gain immediate insight with minimal effort.
Add instrumentation libraries where you need more coverage or framework-specific tracing.
Use manual instrumentation for critical business logic or custom metrics requiring fine-grained control.

Why use OpenTelemetry Collector?

The OpenTelemetry Collector is a vendor-agnostic, standalone service that simplifies telemetry management in production. It decouples telemetry generation from ingestion and export, offering the following benefits:

The Collector provides three core capabilities:

Function	Description
Receive	Accepts telemetry from apps, agents, or other Collectors via OTLP or other supported protocols.
Process	Filters, enriches, transforms, batches, or samples telemetry data.
Export	Sends processed data to one or more observability backends.

Key benefits

Without Collector	With Collector
Apps must export data directly to each backend	Central point of control for all telemetry
Risk of tight coupling to backend protocols	Decouples app logic from backend details
Difficult to enforce consistent processing	Apply transformations consistently
No central routing or batching	Route and batch data efficiently

Understanding OpenTelemetry Protocol (OTLP)

OTLP is the native telemetry transport used across OpenTelemetry. It standardizes how telemetry is serialized, transmitted, and received.

Key benefits:

Unified: Handles traces, metrics, and logs in one format.
Vendor-neutral: Reduces backend lock-in and removes custom exporters.
Efficient: Uses gRPC and Protobuf for high-performance streaming.
Extensible: Schema evolves without breaking compatibility.
Integrated: Collector and most observability tools support OTLP out of the box.

OTLP transport options

Transport	Encoding	Use case
gRPC	Protobuf	Default for performance and bi-directional streaming
HTTP/1.1	JSON	Debugging, human-readable payloads
HTTP/2	Protobuf	Efficient, firewall-friendly alternative to gRPC

Example: Pre-OTLP vs OTLP

Before OTLP	With OTLP
Prometheus exporter, Zipkin exporter, Fluentd plugin	One OTLP exporter and one Collector instance
Multiple exporters in each service	Centralized, simplified telemetry pipeline

In OpenTelemetry, logs are a critical signal for observability. Any data that isn’t a trace or metric is categorized as a log. Events, for instance, are specialized log entries.

Unlike traces and metrics, which OpenTelemetry implements via dedicated APIs and SDKs, logging is designed to integrate with existing logging frameworks in various programming languages. Instead of requiring a brand-new logging API, OpenTelemetry provides a Logs Bridge API that links traditional logging systems with telemetry signals such as traces and metrics.

How Logging Works in OpenTelemetry

You instrument logging using the Logs Bridge API, which connects popular logging frameworks (like Serilog, ILogger, or log4net in .NET) to OpenTelemetry’s pipeline.

Key Components

LoggerProvider: Factory for creating loggers.
Logger: Used to create log entries (LogRecord).
LogRecord: Represents a single log entry with metadata.
LogRecordExporter: Sends logs to destinations like the OpenTelemetry Collector.
LogRecordProcessor: Processes logs before they’re exported.

LogRecord Structure

A LogRecord typically includes:

timestamp: When the log occurred.
trace_id, span_id: Links to a trace/span for correlation.
severity_text: e.g., INFO, WARNING, ERROR.
body: The log message or structured content.
attributes: Custom metadata (e.g., user.id, http.method).

Example Use Case: Coffee app has a /get_coffee endpoint. When a coffee request fails due to a missing ID, the app logs this event.

logger.error(“Missing coffee ID”, extra={“http.status_code”: 400, “coffee_id”: None})

This log entry can be linked to the trace of the request, helping correlate the failure with upstream service calls and backend metrics.

Collector Configuration

The OpenTelemetry Collector decouples telemetry generation from backend concerns. It processes logs, traces, and metrics independently.

Collector Pipeline Example

receivers:
  otlp:
    protocols:
      grpc:


processors:
  batch: {}


exporters:
  logging:
    loglevel: debug


service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging]

Collector Deployment Topologies

The OpenTelemetry Collector supports multiple deployment models, allowing you to tailor observability pipelines based on your architecture and scalability needs. Each topology serves different use cases—from tightly coupled microservices to centralized processing in large-scale environments.

Sidecar Deployment : OpenTelemetry Collector runs as a sidecar alongside each application instance. This setup is common in containerized environments like Kubernetes, where the Collector is injected into each Pod.

Advantages:

Low latency: The Collector runs on the same host or Pod, reducing network overhead for exporting telemetry data.
Isolation: Each service has a dedicated Collector instance, ensuring telemetry data stays service-specific and avoids cross-contamination.
Simplified trace correlation: Local logs, traces, and metrics can be more easily linked.

Ideal for microservices architectures where services operate independently and require individual telemetry pipelines.

Node Agent Deployment: a single Collector instance runs per host or node. This is typically implemented as a Kubernetes DaemonSet or similar system service in virtual machine environments.

Advantages:

Centralized control per node: One Collector handles telemetry for all services on the same node.
Resource-efficient: Fewer Collector instances are required compared to the sidecar model.
System metrics access: Can collect host-level metrics (CPU, memory, disk, etc.) in addition to application telemetry.

Ideal Use Case:

Suitable for clusters with many lightweight services that share node resources.
Often used to monitor node-level infrastructure and runtime metrics alongside service-level data.

Standalone or Gateway Deployment: The Collector runs as a dedicated service, often behind a load balancer. Applications send telemetry data remotely to this central Collector (typically over OTLP).

Advantages:

Scalability: A centralized Collector cluster can scale independently from application workloads.
Simplified configuration management: Telemetry pipelines and transformations are managed in one place.
Decoupling from application logic: Developers don’t need to worry about backend changes or exporter configurations.

Ideal Use Case:

Best suited for large-scale systems with high telemetry volume.
Useful for teams that want to offload all processing from applications and maintain a consistent observability architecture.

Benefits of OpenTelemetry Collectors:

Separation of concerns: Developers emit logs; operators manage pipelines.
Centralized management: All configuration is in one place.
Resource efficiency: Offloads processing from app.
No redeployments needed: Change pipelines without touching app code.

SigNoz with OpenTelemetry

SigNoz is a powerful observability platform built specifically for OpenTelemetry. It provides a seamless experience for collecting, storing, visualizing, and querying telemetry data, without vendor lock-in.

With OpenTelemetry, you collect signals (logs, metrics, and traces) from the coffee shop services. These signals are sent to the OpenTelemetry Collector, which processes and forwards them to SigNoz.

In our coffee shop microservices example, SigNoz plays the role of the observability backend, giving your team full visibility into traces, metrics, and logs generated by the app. Here’s how SigNoz helps the coffee shop:

Traces: Visualize how an order moves through the system, from frontend-service to payment-service and inventory-service. Identify latency bottlenecks or failed calls.
Metrics: Monitor key service-level indicators like espresso_orders_per_minute, latency, and error_rate without writing custom dashboards.
Logs: Correlate logs with trace IDs and span IDs to troubleshoot order failures (e.g., inventory out-of-stock or payment declined).

For more information, refer to:

The AI PaaS for deploying, managing, and scaling apps.

Heroku tackles the toil — patching and upgrading, 24/7 ops and security, build systems, failovers, and more. Stay focused on building great data-driven applications.

Get Started