Forem: Jade Lassery

From Zero to Observability: Your first steps sending OpenTelemetry data to an Observability backend

Jade Lassery — Tue, 10 Dec 2024 16:51:52 +0000

In formal terms, OpenTelemetry 🔭 is an open source framework used for instrumenting, generating, collecting, and exporting telemetry data for applications, services, and infrastructure. It provides vendor-neutral tools, SDKs and APIs for generating, collecting, and exporting telemetry data such as traces, metrics, and logs to any observability backend, including both open source and commercial tools

While some concepts might seem straightforward to experienced engineers, I believe it’s important to share ideas in a way that’s inclusive and approachable. With that in mind, think of OpenTelemetry (a.k.a OTel) as an universal translator for data from various applications and systems. Imagine you’re managing a group of machines or software programs, each speaking its own language. Clearly you need to understand what they’re saying to monitor their performance and spot issues.

This is where OTel steps in to gather and standardize this data—things like error logs or performance metrics—and organizes it so you can send this data to a “central location” or backend for analysis. OTel transforms raw information into something clear and actionable, making it easier for users to gain deep visibility into their workloads, helping to observe, monitor, troubleshoot, and optimize software systems.

What to expect in this guide

This Hands-on article will seek to guide you on how to start your observability journey, sending logs, metrics and traces from Kubernetes-deployed applications to an observability backend/vendor using OTel. Whether you're a first-time user or an experienced engineer seeking a fast, hands-on setup, this is your chance to enhance your OTel and Kubernetes observability skills.

With OTel growing its contributor base and ranking as the second highest velocity project in the CNCF ecosystem, there's never been a better time to dive in and explore its potential for optimizing observability.

In this guide we’ll use the OpenTelemetry Demo App and the Logz.io exporter (I chose Logz.io as my observability backend, but you can choose any that supports OTel). It’s not mandatory to use the OTel Demo, but it’s a nice starting point if you don’t have a real-world implementation or if it’s your first time trying Logz.io and OTel.
The OpenTelemetry Demo includes microservices written in multiple programming languages, communicating over gRPC and HTTP; and a load generator that uses Locust to fake user traffic automatically, eliminating the need to manually create scenarios. You can check the Demo architecture here.

Prerequisites:

Logz.io account (I choose this one to be my observability backend)
Any Kubernetes cluster 1.24+ + Kubectl configured (for this guide, I’m using an EKS. But Minikube/Kind is also welcome)
6 GB of free RAM for the application
Helm 3.14+ installation (for Helm installation method only)
OpenTelemetry Collector (for this guide, I’m using the official OpenTelemetry Demo for Kubernetes, it already provides you the Collector)

OpenTelemetry Core Components - quick explanation:

Instrumentation libraries: Tools and SDKs integrated into applications to automatically or manually generate telemetry data.

Collector: Vendor-agnostic proxy that can receive, process, and export telemetry data. It supports receiving telemetry data in multiple formats (for example, OTLP, Jaeger, Prometheus, as well as many commercial/proprietary tools) and sending data to one or more backends. It also supports processing and filtering telemetry data before it gets exported.

Exporters: Exporters take the processed data and send it to your chosen observability platform, such as Logz.io, Prometheus, Jaeger…

Context for this guide: The OTel Demo App will handle instrumentation, and the OTel Collector (which comes by default when deploying the Otel Demo Helm chart) will send telemetry data to Logz.io using the Logz.io exporter.

Demo Application → Otel SDK → Otel Collector with Logz.io Exporter → Logz.io Backend

Now let’s see how it works in practical terms…

Deploying the OTel Demo App:

Add the OpenTelemetry Helm chart repository:

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts

Deploy the app (in my case I deployed with the release name my-otel-demo):

helm install my-otel-demo open-thelemetry/opentelemetry-demo

Verify that app pods are running:

kubectl get pods

Accessing the Otel app:

After you’ve deployed the Helm, the Demo application needs the services exposed outside of the Kubernetes cluster in order to use and navigate them. You can easily expose the services to your local system using kubectl port-forwardcommand or by configuring service types (ie: LoadBalancer) with optionally deployed ingress resources.
The easiest way is to expose services is by using kubectl port-forward, which I’m using in this guide:

kubectl port-forward svc/my-otel-demo-frontendproxy 8080:8080

With the frontendproxy port-forward set up, you can access:
Web store: http://localhost:8080/
Grafana: http://localhost:8080/grafana/
Load Generator UI: http://localhost:8080/loadgen/
Jaeger UI: http://localhost:8080/jaeger/ui/

Bringing your own backend:

Now it’s time to configure the OTel Collector for Logz.io, using the Logz.io exporter and some additional Logz.io parameters. This will allow us to start sending telemetry from the OTel App to Logz.io.

The OpenTelemetry Collector’s configuration is exposed in the Helm chart that we just deployed in the previous steps. Any additions you make will be merged into the default configuration and you can choose any backend of your choice, that’s the main idea of using OTel: vendor-neutrality.

Create a configuration file named my-values-file.yaml with the following content:

opentelemetry-collector:
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: "0.0.0.0:4317"
          http:
            endpoint: "0.0.0.0:4318"

    exporters:
      logzio/logs:
        account_token: "YOUR-LOGS-SHIPPING-TOKEN"
        region: "your-region-code"
        headers:
          user-agent: logzio-opentelemetry-logs
      prometheusremotewrite:
        endpoint: https://listener.logz.io:8053
        headers:
          Authorization: "Bearer YOUR-METRICS-SHIPPING-TOKEN"
          user-agent: logzio-opentelemetry-metrics
        target_info:
            enabled: false
      logzio/traces:
        account_token: "YOUR-TRACES-SHIPPING-TOKEN"
        region: "your-region-code"
        headers:
          user-agent: logzio-opentelemetry-traces
      prometheusremotewrite/spm:
        endpoint: "https://listener-uk.logz.io:8053"
        add_metric_suffixes: false
        headers:
          Authorization: "Bearer YOUR-METRICS-SHIPPING-TOKEN"
               user-agent: "logzio-opentelemetry-apm" 
# Metrics account token for span metrics


    processors:
      batch:
      tail_sampling:
        policies:
          [
            {
              name: policy-errors,
              type: status_code,
              status_code: {status_codes: [ERROR]}
            },
            {
              name: policy-slow,
              type: latency,
              latency: {threshold_ms: 1000}
            }, 
            {
              name: policy-random-ok,
              type: probabilistic,
              probabilistic: {sampling_percentage: 10}
            }        
          ]

    extensions:
      pprof:
        endpoint: :1777
      zpages:
        endpoint: :55679
      health_check:

    service:
      Extensions: [health_check, pprof, zpages]
      pipelines:
        logs:
          receivers: [otlp]
          processors: [batch]
          exporters: [logzio/logs]
        metrics:
          receivers: [otlp,spanmetrics]
          exporters: [prometheusremotewrite]
        traces:
          receivers: [otlp]
          processors: [tail_sampling, batch]
          exporters: [logzio/traces,logzio/logs,spanmetrics]
      Telemetry: #log verbosity for the Collector logs.
        logs:
          level: "debug"

❗️Notes:

Receivers: Defines how telemetry data is received

otlp: Specifies the protocol (grpc and http) for receiving logs, metrics, or traces from applications.

Exporters: Specifies where and how telemetry data is sent.

Services: Defines the data flow pipelines for processing telemetry.

tail_sampling defines which traces to sample after all spans in a request are completed. By default, it collects all traces with an error span, traces slower than 1000 ms, and 10% of all other traces.

The extensions session is optional.

When merging YAML values with Helm, objects are merged and arrays are replaced. The spanmetrics exporter must be included in the array of exporters for the traces pipeline if overridden. Not including this exporter will result in an error.

You can find all your personal parameters and Data shipping tokens logging into the Logz.io platform, going to Settings > Data shipping tokens. Or, going to Integrations > OpenTelemetry.

You can also find the full OTel configuration directly in the Logz.io platform, under Integrations and searching for OpenTelemetry or accessing the Logz.io exporter GitHub documentation.

To finalize, apply the YAML configuration changes to start sending telemetry to Logz.io:

helm upgrade my-otel-demo open-telemetry/opentelemetry-demo --values my-values-file.yaml

This command will apply the changes to the current OTel Helm release without requiring a fresh installation.

Validating and exploring your OpenTelemetry data in the vendor backend:

After deploying the OTel Demo App and configuring the collector to send data to your chosen backend, it’s important to validate that the telemetry data is flowing correctly. After a few seconds, you can start exploring all your logs, metrics and traces quickly within the vendor platform!

In my case I can go to the Logz.io Logs section to view the incoming log data and interact with them:

In the Logz.io App 360 menu is where you’ll find all the OpenTelemetry microservices deployed, creating the ability to dive into a specific service to get even more app-level details, traces and app metrics.

✅ Well done! your OTel configuration and the data were well collected and exported to the vendor backend!

Wrapping Up:

By following the steps laid out in this guide, you've taken the critical first steps in using OpenTelemetry. You’ve learned how to collect telemetry data from applications deployed in a Kubernetes environment using the OTel demo app and send it to a vendor backend, using the native vendor Exporter. In just a few simple steps, you’ve set up logs, metrics, and traces streaming into a unified observability platform/backend, enabling seamless monitoring and troubleshooting of your systems.

Appendix: Troubleshooting & references

Common issues and fixes:
No data in Logz.io: Verify the API tokens you used in my-values-file.
Further Reading:

Migrating from DIY ELK to a full SaaS platform

Jade Lassery — Tue, 10 Dec 2024 15:07:15 +0000

Managing modern systems requires a constant balance between operational efficiency and innovation; going a little further, maintaining seamless operations and delivering exceptional customer experiences increasingly depend on ensuring robust observability.

For years, the ELK stack (Elasticsearch, Logstash, Kibana) has been the go-to solution for many organizations for log management and observability, offering flexibility control and an open source approach. However, as organizations scale and their data demands grow, maintaining ELK often becomes a real challenge, requiring more resources, generating higher costs and driving increasing complexity.

Shifting to a full SaaS observability platform — purpose-built solutions designed to simplify operations, enhance insights and scale effortlessly — offers a strategic alternative. The shift allows businesses to offload the operational challenges of DIY ELK, enabling teams to focus on delivering value instead of maintaining infrastructure. It’s not just about swapping tools, it’s about transforming the way you approach observability to support long-term business success by aggregating innovation and managed capabilities.

Why Do Organizations Choose Saas Over DIY ELK?

To begin understanding the migration process, it’s important to consider why organizations choose SaaS over DIY ELK. The answer lies in the challenges of managing this kind of stack.

As organizations expand, their data requirements become more demanding. Scaling a DIY ELK stack to handle increasing log volumes and infrastructure requirements can lead to performance issues, data loss, downtime and creating a constant need for manual intervention. SaaS platforms, on the other hand, manage all the hurdles for you, automatically scaling to accommodate growing data levels, reducing operational complexity and ensuring near-seamless performance.

But scalability is just one part of the equation. Operating and maintaining a DIY ELK stack also means handling constant updates, security patches and rebalancing of infrastructure — tasks that consume time and resources. SaaS platforms handle these tasks in the background, allowing teams to focus on strategic work. Moreover, while DIY ELK might seem cost-effective initially, hidden expenses for scaling, maintenance and management can add up. SaaS platforms offer predictable observability pricing, simplifying budget management.

The Benefits of Moving to a SaaS Observability Platform

A significant benefit of moving to a SaaS platform is access to advanced features that go beyond traditional log management. Many SaaS observability platforms provide integrated solutions for logs, metrics and traces in one unified interface. These platforms now also frequently leverage AI-powered observability tools for anomaly detection and root cause analysis (RCA) to quickly surface issues —- reducing time spent troubleshooting and enabling proactive incident management.

Beyond these operational benefits, SaaS platforms also offer enhanced security and compliance features that can be difficult and costly to implement with a DIY stack. With built-in encryption, access controls and industry certifications (such as SOC 2, GDPR compliance, etc), SaaS providers help ensure that your data remains secure and meets regulatory standards, without requiring additional overhead from your internal teams.

When is it Time to Move?

There are many factors to consider for when it might be the correct time to migrate from a DIY ELK stack to a SaaS platform. Here are some things to watch out for:

Data growth is overwhelming: Your ELK stack struggles to keep up with increasing data volumes, leading to slow query times and infrastructure strain.
Operational complexity: is draining resources: Managing and maintaining the stack is consuming your DevOps team’s time, leaving little room for innovation.
Costs are escalating or unpredictable:Infrastructure, storage and operational expenses are becoming unpredictable and hard to justify.
Unified and advanced observability is needed: Siloed tools for logs, metrics and traces make it challenging to diagnose and resolve issues quickly.
Security or compliance is a concern:You need advanced security features or compliance certifications that are difficult to implement in a DIY stack.

Once you’ve identified that your stack is no longer meeting your needs — whether due to scaling issues, rising costs, or operational inefficiencies — the next step is to start planning your migration to a SaaS platform. Making this shift doesn’t have to be overwhelming, but it does require careful consideration and a strategic approach.

Here are the key steps that you can use as a baseline to ensure a smooth transition:

1. Evaluate your needs: Understand what you need from your observability stack. Are you looking for better scalability, advanced features, simplified management? What else?

2. Choose the right platform: Not all SaaS platforms are built equal. Here’s a tip, look for one that offers:

Native integrations with your current tools such as Logstash, Beats or OpenTelemetry.
Unified support for logs, metrics, traces and extra visualizations.
AI-powered insights and automation.

Platforms like Logz.io, for example, support the same ingestion methods as ELK, so you can reuse your existing configurations with minimal changes, besides providing advanced capabilities like root cause analysis to help businesses proactively manage their systems with minimal effort.

3. Plan and test: Begin by setting up the SaaS platform alongside your existing ELK stack. Test data ingestion using a subset of your logs or metrics to validate compatibility and performance.

4. Migrate gradually: Move workloads incrementally, starting with non-critical systems. Once the process is stable and workflows are optimized, transition critical systems.

5. Recreate dashboards and alerts: Export dashboards and alerts from ELK and import them into the new managed platform. Take advantage of pre-built templates and advanced alerting options to refine your observability strategy.

6. Optimize and train: Ensure your team is trained on the new platform and continue optimizing configurations to align with your needs.

7. Decommission DIY ELK: Once all systems are successfully migrated, phase out your ELK infrastructure, archiving historical data in an external storage if needed.

Unlocking Value for the Long Term

Migrating to a SaaS observability platform is more than just a technical upgrade or getting everything up and running. It’s a strategic decision that drives long-term value. By offloading operational complexity, businesses can focus on innovation, improve system reliability and enhance customer experiences.

Organizations that make this shift often find they’re not just solving operational headaches, they’re positioning themselves for scalable, data-driven growth. It’s a step toward making observability a seamless enabler of success, rather than a persistent challenge.

How to monitor Openshift using Datadog Operator

Jade Lassery — Mon, 11 Mar 2024 04:58:58 +0000

Co-authors: Luiz Bernardo Levenhagen and Leonardo Araujo

In this article, we will demonstrate how to integrate Openshift with Datadog using Datadog operator to collect metrics,logs, events and also applications' data.

In this article we use the following versions:

Openshift v4.13.11

Datadog Operator v1.3.0

Datadog account (more information on how to request a trial at the bottom of the blog)

About

This article is aimed at users who would like to integrate or monitor their Openshift Cluster using the Datadog monitoring solution.
We will use the datadog operator to instantiate our agent and collect all metrics(cluster/containers), cluster and container/pod logs, network, cpu, memory consumption as well as applications' data.
Red Hat does not support the DataDog operator or its configuration, for any questions related to the use of the platform or operator, contact DataDog.

Prerequisites

User with the cluster-admin cluster role
Openshift 4.10 or +
Datadog account (more information on how to request a trial at the bottom of the blog)

Procedure

Datadog

Add API Keys

To add a new datadog API Key, navigate to Organization Settings > API Keys
If you have the permission to create API keys, click New Key in the top right corner.
Define the desired name, something that can help you identify in the future.
Once created, copy the Key so we can use it later.

Add Application keys

To add a new datadog Application Key, navigate to Organization Settings > Application Keys
If you have the permission to create Application Keys, click New Key in the top right corner.
Define the desired name, something that can help you identify in the future.
Once created, copy the Key so we can use it later.

Openshift

Datadog Operator Install

In the Openshift console, in the left side menu, click Operator > OperatorHub > in the search field, type datadog

💡 Tip
Whenever available, use a certified option.

As we can see, we are using version 1.3.0 of operator, click Install.

On this screen, we will keep all the default options:
- Update channel: stable
- Installation mode: All namespaces the cluster(default)
- Installed Namespace: openshift-operators
- Update approval: Automatic
  - Obs.: If you prefer, you can use the Manual option.
- Click Install.

Wait until the installation is complete.

Create secret with Datadog keys (not mandatory, but good practice)

In the terminal, access the openshift-operators namespace context

$ oc project openshift-operators

Now let's create a secret to store in this API Key and Application Key, replace the values below with the keys we generated previously in the Datadog console.

$ oc create secret generic datadog-secret \
--from-literal api-key=`REPLACE_ME` \
--from-literal app-key=`REPLACE_ME`

Let's now instatiate our datadog agent using the yaml below

$ cat <<EOF > datadog_agent.yaml
apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
  namespace: openshift-operators
spec:
  features:
    apm:
      enabled: true
      unixDomainSocketConfig:
        enabled: true
    clusterChecks:
      enabled: true
      useClusterChecksRunners: true
    dogstatsd:
      originDetectionEnabled: true
      unixDomainSocketConfig:
        enabled: true
    eventCollection:
      collectKubernetesEvents: true
    liveContainerCollection:
      enabled: true
    liveProcessCollection:
      enabled: true
    logCollection:
      containerCollectAll: true
      enabled: true
    npm:
      collectDNSStats: true
      enableConntrack: true
      enabled: true
  global:
    clusterName: DemoLab
    credentials:
      apiSecret:
        keyName: api-key
        secretName: datadog-secret
      appSecret:
        keyName: app-key
        secretName: datadog-secret
    criSocketPath: /var/run/crio/crio.sock
    kubelet:
      tlsVerify: false
    site: datadoghq.eu
  override:
    clusterAgent:
      containers:
        cluster-agent:
          securityContext:
            readOnlyRootFilesystem: false
      replicas: 2
      serviceAccountName: datadog-agent-scc
    nodeAgent:
      hostNetwork: true
      securityContext:
        runAsUser: 0
        seLinuxOptions:
          level: s0
          role: system_r
          type: spc_t
          user: system_u
      serviceAccountName: datadog-agent-scc
      tolerations:
      - operator: Exists
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
EOF

Some explanations about what we are enabling in this agent

Enabling the APM (Application Performance Monitoring) feature

apm:
  enabled: true
  unixDomainSocketConfig:
    enabled: true

Cluster Check extends the autodiscover function to non-containerized resources and checks if there is some integration/technology to monitor.

clusterChecks:
  enabled: true
  useClusterChecksRunners: true

Dogstatsd is responsible for collecting custom metrics and events and sending them from time to time to a metrics aggregation service on the Datadog server.

dogstatsd:
  originDetectionEnabled: true
  unixDomainSocketConfig:
    enabled: true

Here we are enabling the collection of all logs (including container logs) and events generated in our cluster and sending them to Datadog.

eventCollection:
  collectKubernetesEvents: true
liveContainerCollection:
  enabled: true
liveProcessCollection:
  enabled: true
logCollection:
  containerCollectAll: true
  enabled: true

With NPM (Network Performance Monitoring), we can have visibility of all traffic in our cluster, nodes, containers, availability zones, etc.

npm:
  collectDNSStats: true
  enableConntrack: true
  enabled: true

In the credentials block in Global, we have the definition of the secret previously created with the API and app key.

credentials:
  apiSecret:
    keyName: api-key
    secretName: datadog-secret
  appSecret:
    keyName: app-key
    secretName: datadog-secret

In this block, we define the path to the cri-o service socket, we define the non-checking of tls for communication with the kubelet and in website, we define which datadog server will receive the data sent.

criSocketPath: /var/run/crio/crio.sock
kubelet:
  tlsVerify: false
site: datadoghq.eu

In the clusterAgent block in override, we add SecurityContext(scc) settings and which serviceaccount should be used in the datadog-cluster-agent pods.

clusterAgent:
  containers:
    cluster-agent:
      securityContext:
        readOnlyRootFilesystem: false
  replicas: 2
  serviceAccountName: datadog-agent-scc

❗Note
The datadog-agent-scc serviceaccount is created automatically by the operator and already has all the necessary permissions for the agent to run correctly.

In the nodeAgent block in override, we define settings for SecurityContext for the datadog-agent pods, we will use the same datadog-agent-scc serviceaccount and we also define the tolerations for the nodes that have taints created, in our case for the master nodes.

nodeAgent:
  hostNetwork: true
  securityContext:
    runAsUser: 0
    seLinuxOptions:
      level: s0
      role: system_r
      type: spc_t
      user: system_u
  serviceAccountName: datadog-agent-scc
  tolerations:
  - operator: Exists
  - effect: NoSchedule
    key: node-role.kubernetes.io/master

After some explanations, let's deploy our datadog agent. Execute this command to create the object:

$ oc -n openshift-operators create -f datadog_agent.yaml

Once created, we will validate that our agent was created correctly

$ oc -n openshift-operators get datadogagent
$ oc -n openshift-operators get pods

❗Note
Here we should have a datadog-agent running on each available openshift node.

❗Information

datadog-agent-xxxxx pods, is responsible for collecting all metrics, events, traces and logs from each node in the cluster.

datadog-cluster-agent-xxxxx pods, will act as a proxy between the API server and node-based agents, Cluster Agent helps to ease the server load.

Now let's validate the logs of the datadog-agent-xxxxx pods, to identify if there is any communication error.

$ oc logs -f -l app.kubernetes.io/managed-by=datadog-operator --max-log-requests 10

Datadog platform/UI

Now on the Datadog platform, in the left side menu, click on Infrastructure > and then onInfrastructure List`

❗Information
Server data, such as status, cpu information, memory and other details, may take a few minutes to be displayed.

To view more details about a specific node, click on the node name and navigate through the available tabs. It’s just the simplest way to check your nodes/hosts.

Under the Infrastructure menu, Datadog also gives you an exclusive Kubernetes menu where you have the full picture about your cluster. You can check the state of all of your Kubernetes resources, troubleshoot patterns, access out-of-the-box Dashboards and enable some recommended Alerts to monitor your environment

You can also explore deeper the containers running in your Openshift environment, going to Infrastructure > Containers. Here you get chance to analyse things like logs from containers, traces, networking layer, processes running inside the container and so on...

To view more details about network traffic, in the left side menu, go to Infrastructure > Network Map`

To view the logs received from the cluster or from any application or technology running in your kubernetes environment, in the left side menu, go to Logs > Analytics, on this screen, we can view all the details, filter application logs and even view the processes.

To view all collected metrics, in the left side menu, go to Metrics >Explorer`, here we can view all metrics, run and save queries or create dashboards based on queries.

Datadog provides out-of-the-box Dashboards that can be used and customized. To use one available, in the left side menu, go to Dashboards > Dashboard List` > choose the dashboard and click on the name.

❗Note:
To customize a dashboard provided by Datadog, use the Clone feature to make the desired changes and save.

Conclusion

Using the Datadog Operator solution, we can have a complete monitoring solution for our Openshift cluster with main features such as APM, Network Analysis, Logs, Events and Metrics.

To request an Openshift trial and learn more about our solution, click here.

To request a Datadog trial and be able to replicate this knowledge, click here.

References

For more details and other configurations, start with the reference documents below.