Forem: YukiOnodera

Instrumenting Python Apps with Datadog APM: A Docker Setup Guide

YukiOnodera — Fri, 01 May 2026 13:51:47 +0000

Introduction

In this post, we'll walk through the components required to instrument a Python application with Datadog APM, and how to configure them in a Docker environment.

A common shorthand is "just install the Datadog Agent and the tracing library and you're good to go" — and while that's essentially correct, understanding what each component is actually doing under the hood makes troubleshooting dramatically easier when things go wrong. If you've ever stared at an empty APM dashboard wondering why your traces aren't showing up, this guide should help you build a clearer mental model of how the pieces fit together.

The Building Blocks of Datadog APM

To instrument an app with Datadog APM, you need to set up two things: the Agent side and the library side.

Agent Side: The Trace Agent

When you enable APM in the Datadog Agent, an internal component called the Trace Agent starts up.

The Trace Agent is responsible for receiving trace data sent from your application and forwarding it to Datadog's backend. By default, it listens for traces on port 8126/tcp.

In a Docker environment, you enable APM with the following environment variables:

Environment Variable	Description
`DD_APM_ENABLED`	Enables APM (the Trace Agent)
`DD_APM_NON_LOCAL_TRAFFIC`	Allows trace submissions from other containers

If you don't set DD_APM_NON_LOCAL_TRAFFIC=true, traces from other containers on the same Docker network won't be accepted — watch out for this.

Library Side: ddtrace

On the Python application side, you'll use a library called ddtrace. It's Datadog's official Python APM client and provides automatic instrumentation for over 80 libraries, including Flask, Django, and SQLAlchemy.

pip install ddtrace

How Auto Instrumentation Works: Monkey Patching

ddtrace's auto instrumentation works through a technique called Monkey Patching.

Monkey Patching is a method of dynamically replacing existing classes or functions at runtime. ddtrace uses this approach to inject trace instrumentation into supported libraries without requiring any changes to your application code.

There are two ways to enable it.

The ddtrace-run Command (Recommended)

Just prepend ddtrace-run to your application's startup command:

ddtrace-run python app.py

In a Dockerfile, modify the CMD like this:

CMD ["ddtrace-run", "python", "app.py"]

import ddtrace.auto

Alternatively, you can import it at the very top of your entry point:

import ddtrace.auto

from flask import Flask
# ...

⚠️ Using ddtrace-run and import ddtrace.auto at the same time will cause monkey patching to be applied twice, so use only one of them.

Configuration in a Docker Environment

Agent Container

services:
  datadog-agent:
    image: gcr.io/datadoghq/agent:latest
    environment:
      DD_API_KEY: ${DD_API_KEY}
      DD_SITE: datadoghq.com
      DD_APM_ENABLED: "true"
      DD_APM_NON_LOCAL_TRAFFIC: "true"
    ports:
      - "8126:8126/tcp"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro

Application Container

On the application container side, specify the connection target for the Agent container via environment variables.

services:
  myapp:
    build: .
    environment:
      DD_AGENT_HOST: datadog-agent
      DD_TRACE_AGENT_PORT: "8126"
      DD_SERVICE: my-python-app
      DD_ENV: production
      DD_VERSION: 1.0.0
    depends_on:
      - datadog-agent

You can use the Docker Compose service name directly as the value of DD_AGENT_HOST.

Unified Service Tagging

The three variables DD_SERVICE / DD_ENV / DD_VERSION are part of a mechanism called Unified Service Tagging — standard tags that link telemetry across all of Datadog.

Environment Variable	Description	Example
`DD_SERVICE`	Service name	`my-python-app`
`DD_ENV`	Environment name	`production`, `staging`
`DD_VERSION`	Version	`1.0.0`

By setting these three, you'll be able to navigate from APM trace views to related logs and metrics with a single click. I strongly recommend configuring them.

Conclusion

Datadog APM instrumentation works through the cooperation of two pieces: the Trace Agent on the Agent side, and ddtrace (with its monkey patching) on the library side. In a Docker environment, DD_APM_NON_LOCAL_TRAFFIC=true and DD_AGENT_HOST are particularly common gotchas, so keep them in mind during setup.

References

Tracing Docker Applications | Datadog

Tracing Python Applications | Datadog

Unified Service Tagging | Datadog

Subscribe for more Datadog & Observability deep-dives.

Inside Datadog's Log Pipeline: How "Logging without Limits" Actually Works

YukiOnodera — Mon, 27 Apr 2026 09:54:07 +0000

Introduction

In this post, I want to walk through how Datadog processes logs internally — from raw ingestion all the way to indexed, queryable data.

If you've spent time clicking around Datadog's log management UI, you've probably noticed something satisfying: raw, messy log lines gradually get enriched and structured as they flow through the pipeline. It's a really elegant design, and once you understand the order in which things happen, it becomes clear why Datadog can offer both cost control and deep observability at the same time. Let me break it down.

:::message
This article focuses on the main steps I studied this time around. In practice, there are more detailed processes — sensitive data scanning, Error Tracking, Live Tail, and so on. For the full picture, please refer to the official documentation | Datadog.
:::

The Overall Log Processing Flow

Datadog's log management is built around a design philosophy called Logging without Limits™, which lets you independently control "ingestion," "storage," and "analysis."

The high-level flow looks like this:

Ingest
  ↓
Pipelines (Parse & Enrich)
  ↓
Generate Metrics
  ↓
Exclusion Filters
  ↓
Index

Walking Through Each Step

Ingest

First, logs are collected into Datadog from a wide variety of sources.

Datadog offers over 500 log integrations, covering AWS, GCP, Kubernetes, and all kinds of middleware. I was honestly surprised by just how many there are.

Pipelines (Parse & Enrich)

Once raw logs are ingested, they pass through pipelines that structure and enrich them (adding extra information).

Using processors like the Grok parser, unstructured text logs get broken down into fields, and additional attributes can be attached.

# Before parsing (raw log)
2024-04-27 12:00:00 ERROR [app] Connection timeout: host=db01 duration=5002ms

# After parsing (structured)
{
  "timestamp": "2024-04-27T12:00:00Z",
  "level": "ERROR",
  "service": "app",
  "message": "Connection timeout",
  "host": "db01",
  "duration_ms": 5002
}

Watching unformatted logs get cleaned up and enriched is honestly the most fun part of this whole process to observe.

Generate Metrics

This is the most interesting part of the Logging without Limits design.

Log-based metrics are generated before exclusion filters run.

In other words, even for logs that will later be discarded and never make it to the index, you can still retain statistical information as metrics.

:::message
The benefit of this design is that even if you're aggressively dropping logs to keep costs down, you still get reliable metrics on trends, error rates, and the like.
:::

Exclusion Filters

After metric generation, exclusion filters decide which logs are not saved to the index.

Debug logs, high-volume boilerplate logs, and anything that isn't needed for ongoing search can be dropped here, helping keep indexing costs under control.

Index

Logs that pass through the filters are finally stored in the Index. Once a log is indexed, you can use Datadog's UI for facet search and analysis.

Why This Ordering Matters

The key insight in this processing order is the design principle: "extract metrics before throwing logs away."

Log storage costs balloon quickly, so indexing every single log is rarely realistic. But if you just drop logs, you lose visibility into trends within the discarded data.

Logging without Limits solves this by placing metric generation before exclusion filters. You can lower storage costs while still maximizing observability.

Wrap-up

Datadog's log pipeline has clearly separated stages: ingest, parse, generate metrics, exclude, and index. The design choice to run metric generation before exclusion filters strikes me as especially important — it's what allows you to balance cost and observability rather than trade one off against the other.

Subscribe for more Datadog & Observability deep-dives.

ECR Costs Had Increased Over 10 Times Without Me Noticing

YukiOnodera — Fri, 16 Aug 2024 12:10:15 +0000

Introduction

The other day, while I was looking through Cost Explorer, I discovered that the cost of ECR had ballooned to more than 10 times what it was a few months ago.

Investigation Begins

Realizing this was a serious issue, I immediately began investigating the cause.

Confirming ECR Pricing

I started by reviewing the pricing structure for ECR.

https://aws.amazon.com/ecr/pricing/

The cost of ECR is determined by storage charges based on the amount of image storage and data transfer out.

Reviewing the Invoice

Next, I checked last month’s invoice.

The goal was to identify whether the increased costs were due to storage or data transfer.

At this point, I noticed that in the ECR section of the invoice, only the storage costs were visible. Data transfer costs are listed under a separate data transfer section, so make sure to check that. I initially overlooked this, which left me puzzled about the discrepancy.

In my case, data transfer costs had skyrocketed.

Checking AWS Accounts

Since I was using AWS Organizations, I checked which member account was seeing the increased ECR costs.

Fortunately, only one account had significantly higher costs, so I was able to quickly pinpoint the source.

Reviewing ECR Repositories

However, just identifying the AWS account wasn’t enough to solve the problem.

I decided to open the list of ECR repositories and take a look.

I noticed that several repositories had been created around the time costs started increasing.

The Cause of the Cost Increase

After discussing it with team members and digging deeper, we discovered that this was due to a repository that had been migrated from Docker Hub for use in CI.

CI is executed on GitHub Actions with each commit, and since multiple images of considerable size were being pulled, data transfer costs had surged.

Conclusion

I'm glad I found the cause. In fact, I consider myself lucky to have noticed it through Cost Explorer.

Cloud costs can sometimes spike unexpectedly, so it’s important to stay vigilant.

References

https://docs.docker.com/docker-hub/download-rate-limit/