Forem: Kemal Akkoyun

Measuring Software Performance: Why Your Benchmarks Are Probably Lying

Kemal Akkoyun — Fri, 06 Mar 2026 00:00:00 +0000

A Loose Cable That Broke Physics

In 2006, a team of physicists began building the OPERA experiment — a 730-kilometer underground tunnel from CERN in Switzerland to Gran Sasso in Italy, designed to measure the speed of neutrinos. Five years of construction. Roughly 100 million euros. The most rigorous experimental physics on the planet.

In September 2011, the results came back. Neutrinos were traveling faster than the speed of light. The team had just broken the laws of physics.

Except they hadn’t. After months of rechecking the math, the sensors, and the calibration, they found the root cause: a single fiber-optic cable that wasn’t fully plugged in. A loose connector had introduced a 73-nanosecond timing error — enough to make neutrinos appear superluminal.

Most of us aren’t building 730-kilometer tunnels. But we deal with “loose cables” every day when measuring software performance. A benchmark that shows a 5% speedup might be measuring thermal throttling, CPU frequency scaling, or a noisy neighbor on a shared cloud instance. The signal is real, but so is the noise — and telling them apart requires discipline.

This post expands on the talk Augusto de Oliveira and I gave at the FOSDEM 2026 Software Performance Devroom. The slides and experiments are all open source.

Why Benchmarking Is Hard

Measuring software performance is a specialized version of a more general problem: finding a signal in a world full of noise.

Modern systems have layers of non-determinism that conspire against repeatable measurements. The CPU dynamically adjusts its clock frequency based on load and temperature. The OS scheduler moves threads between cores. Caches warm and cool. Background processes steal cycles. VMs share physical resources with other tenants. Memory layout changes between runs due to address space layout randomization (ASLR).

Any one of these factors can shift your numbers by a few percent. Stack them up, and a benchmark that reports a 5% improvement might just be measuring random variation. You run it again and the improvement vanishes — or reverses.

The gap between “I ran a quick benchmark on my laptop” and “this measurement is reliable enough to make decisions on” is enormous. Closing that gap requires controlling the environment, designing the benchmark properly, interpreting results with statistical rigor, and integrating the whole process into your development workflow.

Environment Control

This is the foundation. No amount of statistical sophistication will compensate for a noisy measurement environment. The sources of noise come from every layer of the stack:

Layer	Sources of Noise	Mitigations
External	Network, temperature, vibration, virtualization	Bare metal instances, dedicated hardware
Application	Memory layout, compilation/linking	Fixed builds, disable ASLR
Kernel	Scheduling, caching	CPU affinity, process priority, cache management
CPU	SMT contention, dynamic frequency scaling	Disable SMT, disable DFS

Noisy Neighbors and Bare Metal

If you’re running benchmarks on a shared cloud VM, you’re sharing physical CPU cores, memory bandwidth, and last-level cache with other tenants. Their workload affects your numbers. This is the classic noisy neighbor problem.

The fix: use bare metal cloud instances (e.g., AWS m5.metal). They cost more, but they give you exclusive access to the underlying hardware. Just as importantly, bare metal access lets you apply the kernel-level and CPU-level mitigations below — none of which are possible on shared VMs.

MongoDB’s engineering team documented this well — their work on reducing variability in EC2 performance tests is an excellent reference for anyone setting up cloud-based benchmarking infrastructure.

CPU Affinity and Process Priority

The OS scheduler moves processes between CPU cores to balance load. Each migration can evict warm cache lines and introduce jitter. Pinning your benchmark to specific cores with taskset eliminates this:

# Pin benchmark to CPU 0
taskset -c 0 ./benchmark

Similarly, raising process priority with nice reduces scheduling interference from other processes:

# Higher priority (niceness -5, where -20 is highest)
nice -n -5 ./benchmark

Cache Management

If your benchmark touches the filesystem, cold vs. warm page cache can dramatically change results. Either warm the cache deliberately before measurement, or drop it to start from a known state:

# Drop all caches (requires root)
echo 3 > /proc/sys/vm/drop_caches && sync

Simultaneous Multithreading (SMT)

SMT (marketed as Hyper-Threading on Intel CPUs) allows two hardware threads to share a single physical core. They share execution resources — ALUs, caches, branch predictors — while maintaining separate architectural state.

For I/O-bound workloads, this is fine: one thread executes while the other waits for I/O. But for CPU-bound benchmarks, SMT introduces severe contention. Two threads fight over the same execution units, and the resulting interference shows up as variance in your measurements.

We ran a simple experiment on an AWS m5.metal instance with DFS disabled, measuring two CPU-bound tasks running on the same core (SMT enabled) vs. separate cores (SMT disabled):

Configuration	Mean	Coeff. of Variation
SMT enabled, task 1	1537.64 +/- 367.29 ms	23.887%
SMT enabled, task 2	1536.88 +/- 366.84 ms	23.869%
SMT disabled, task 1	737.37 +/- 0.32 ms	0.044%
SMT disabled, task 2	737.93 +/- 1.74 ms	0.235%

That’s 100x less variance with SMT disabled. The tasks also run twice as fast because they’re no longer contending for shared execution resources.

# Disable SMT
echo off > /sys/devices/system/cpu/smt/control

Dynamic Frequency Scaling (DFS)

Modern CPUs adjust their clock frequency dynamically based on workload, thermals, and power budgets. Intel calls the upward scaling “Turbo Boost.” This is great for general-purpose computing but terrible for benchmarking — the frequency varies based on how many cores are active, the ambient temperature, and the power headroom.

A single-threaded benchmark might run at 3.5 GHz. Start another workload on a neighboring core and the frequency drops to 3.1 GHz. Your benchmark just got 11% slower, and the code didn’t change.

We measured this on the same m5.metal instance with SMT disabled, varying the number of concurrent CPU-bound tasks:

Configuration	Mean	Coeff. of Variation
DFS on, 1 task	533.97 +/- 2.046 ms	0.383%
DFS on, 8 tasks	578.67 +/- 0.287 ms	0.050%
DFS off, 1 task	738.18 +/- 0.306 ms	0.041%
DFS off, 8 tasks	739.18 +/- 0.351 ms	0.047%

With DFS enabled, the single-task case shows ~10x more variance than with DFS disabled. The absolute runtime is higher with DFS off (the CPU runs at its base frequency rather than boosting), but the measurements are rock-solid. When benchmarking, consistency matters more than raw speed.

# Pin clock rate to base frequency
echo 2500000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq

# Set scaling governor to "performance"
echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Disable Turbo Boost (Intel CPUs)
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

Denis Bakhvalov’s Performance Analysis and Tuning on Modern CPUs covers CPU-level tuning in depth and is the definitive reference on this topic.

Benchmark Design

Environment control reduces noise. Good benchmark design ensures the signal you’re measuring is actually meaningful.

Representative Workloads

A benchmark is only useful if it measures something that matters. What does your application actually do?

Archetype	Pattern	Characteristics
Idle	Background workers, minimal load	Low RPS, minimal CPU
Latency	Microservices, APIs	High RPS, low CPU per request
Throughput	Queue workers, batch processing	Moderate RPS, high CPU
Enterprise	Business apps with DB/API calls	Moderate RPS, mixed CPU/IO

Your benchmark workload should match your production workload. A microbenchmark that measures a tight loop in isolation won’t tell you much about how your API server handles realistic traffic patterns.

That said, microbenchmarks have their place. They’re invaluable for comparing algorithms, validating specific optimizations, and catching regressions in hot paths. The key is knowing which type fits your question:

Use Case	Benchmark Type
Comparing algorithms	Micro
Validating optimizations	Micro
Regression detection	Both
Capacity planning	Macro
User experience	Macro

Best practice: use both in your pipeline.

The Coordinated Omission Problem

If your load generator waits for each response before sending the next request, it’s probably lying to you. When the system under test slows down, the generator slows down too — sending fewer requests per second, which artificially improves the measured latencies.

Gil Tene’s talk “How NOT to Measure Latency” is the definitive explanation of this problem. The short version: use load generators that maintain a constant request rate regardless of response time. Tools like k6 and wrk2 handle this correctly.

Warm-Up and Steady State

We learned this the hard way with a Java benchmark. The goal: measure instrumentation overhead on a Spring application. Initial setup: 20-second warmup, 15 seconds of measurements, collecting one sample per second.

The coefficient of variation was 11.80% — far too noisy to detect real changes.

The problem was warmup. The JVM compiles methods on the fly (JIT compilation). Each method needs to be called enough times to hit the compilation threshold, then you wait for the compiler to finish. Twenty seconds wasn’t nearly enough. By extending the warmup to 160 seconds and the measurement period to match, the picture changed completely.

From the experiments:

Tip 1 : Run benchmarks long enough to uncover perturbations like warmup effects.

Tip 2 : Collect enough samples to reduce intra-run variation. N >= 30 is a reasonable minimum.

Tip 3 : Rerun benchmarks multiple times to reduce inter-run variation. M >= 5 runs helps account for random initial state effects (cache layout, memory placement).

Applying all three tips reduced the coefficient of variation from 11.80% to 2.94% — a 4x improvement from benchmark design alone, before any environment control.

Tip 4 : Use deterministic inputs. Non-deterministic data leads to non-deterministic measurements.

Statistical Methods

You’ve controlled the environment and designed a good benchmark. Now you have data. The question is: is the difference you’re seeing real, or noise?

Why Averages Lie

Consider a throughput benchmark run before and after a code change. The “before” mean is 102.7 req/s. The “after” mean is 105.0 req/s. That’s a 2.3% improvement. Ship it?

Not so fast. Each of those means summarizes a distribution of individual measurements. If those distributions overlap significantly, the difference between the means might not be statistically significant — it could easily arise from random variation alone.

Hypothesis Testing

The intuition is straightforward: compare the size of the difference to the size of the noise.

The Welch’s t-test formalizes this. It computes a test statistic t that is essentially the ratio of the mean difference to the standard error. If t exceeds a critical value (determined by your chosen false positive rate, alpha), you can conclude the difference is statistically significant.

The key insight: a statistically significant result tells you the difference is unlikely to be zero, but not that the difference is large or practically meaningful. Always pair hypothesis testing with effect size estimates. A 0.1% improvement might be statistically significant with enough samples — but not worth the code complexity.

Change Point Detection

Hypothesis testing works well when you have a clear “before” and “after.” But what about continuous benchmarking, where you’re tracking performance across hundreds of commits?

Change point detection algorithms scan a time series and identify where the underlying distribution shifts. The e-divisive method (ED-PELT) is particularly effective for benchmark data. It handles non-normal distributions, detects multiple change points, and works well with the kind of noisy data that benchmarks produce.

Netflix’s engineering team wrote an excellent post on fixing performance regressions before they happen, which covers their use of change point detection in continuous benchmarking.

Henrik Ingo (who spoke in the same Software Performance Devroom at FOSDEM) has published extensively on applying these methods in practice.

Visualization: Strip Plots Over Boxplots

Boxplots hide too much. They show quartiles and a median, but they obscure the actual distribution shape — bimodality, outlier clusters, and gaps all disappear into a box.

Strip plots (dot plots of every individual measurement) are better for benchmark data. They make outliers obvious, reveal distribution shape at a glance, and scale well for the sample sizes typical in benchmarking (30-200 points).

Brendan Gregg’s work on frequency trails is excellent on this topic — showing how visualization choices affect your ability to detect real patterns in performance data.

Integrating Into Development Workflows

Reliable measurement is only half the problem. The other half is making performance a first-class part of the development process.

The Feedback Loop

The ideal: a developer opens a pull request, benchmarks run automatically, and within minutes they see whether their changes have performance implications. If there’s a regression, they know about it before the code merges — not weeks later when a customer notices.

This requires:

Automated benchmark execution triggered by code changes
Statistical analysis to distinguish real regressions from noise
Clear reporting that developers can act on — not a wall of numbers, but a concise “this got 3% slower, here’s the data”
Local reproducibility so developers can investigate and fix regressions on their own machines

Performance Quality Gates

Beyond PR-level feedback, performance quality gates can block releases that don’t meet defined SLOs. The philosophy is the same as any other quality gate — you wouldn’t ship without passing tests, so don’t ship without passing performance benchmarks.

When to Benchmark

The answer depends on your resources and risk tolerance:

Strategy	Cost	Coverage	Best For
Every PR	High	Complete	Critical paths, performance-sensitive libraries
Periodic (nightly/weekly)	Medium	Trend detection	General regression catching
On-demand	Low	Targeted	Investigation, optimization validation

For most teams, a combination works best: lightweight benchmarks on every PR, comprehensive macrobenchmarks nightly, and on-demand deep dives when investigating specific issues.

Open Source Tools

You don’t need to build a benchmarking platform from scratch. Several open source projects can get you started:

bencher.dev — Continuous benchmarking as a service. Tracks benchmark results over time, detects regressions, and integrates with CI/CD.
hyperfine — A CLI benchmarking tool for comparing command execution times. Handles warmup, statistical analysis, and parameterized runs.
github-action-benchmark — GitHub Action for running benchmarks and tracking results over time, with support for Go, Python, Rust, and other language-specific benchmark formats.
chronologer — Benchmark tracking focused on Go benchmarks with historical comparison.
Apache Otava (formerly Nyrkio, incubating) — Performance change point detection service, built on the e-divisive algorithm.
perflock — A tool for locking CPU frequency and other system settings during benchmarks. Useful for local development.

The right tool depends on your language ecosystem, CI system, and how much you want to self-host vs. use a managed service.

Key Takeaways

Four things to remember:

Control your benchmarking environment. Bare metal instances, CPU isolation, disable SMT, disable dynamic frequency scaling. Environment noise is the single largest source of unreliable measurements.
Design your benchmarks to be representative and repeatable. Match your production workload. Run long enough. Collect enough samples. Rerun multiple times.
Interpret results with statistical rigor. Don’t trust averages. Use hypothesis testing or change point detection. Always ask: is this difference real, or noise?
Integrate benchmarks into your development workflow. Run continuously. Catch regressions on PRs. Make performance feedback as fast as test feedback.

Performance Matters

Performance is not always the first thing we think about when building software. We focus on features, correctness, security. And those are right to come first. But in the end, performance is what users experience.

Low latency means your users aren’t waiting. High throughput means your system handles the load. Cost-efficient performance means you’re not burning money (and energy) on infrastructure that could be halved with the right optimization. A 500ms delay costs Google 20% of their traffic. A 400ms improvement gave Yahoo 5-9% more traffic. The numbers are real.

“Not all fast software is world-class, but all world-class software is fast.” – Tobi Lutke, CEO of Shopify

So write benchmarks. Run them continuously. Catch regressions before your users do.

And don’t shout in the datacenter.

Resources

Slides and experiments (GitHub)
Talk recording (YouTube)
Talk page
FOSDEM 2026 recap
OTel Unplugged EU 2026: Field Notes
Bakhvalov, D. — Performance Analysis and Tuning on Modern CPUs
Gregg, B. — Systems Performance: Enterprise and the Cloud, 2nd ed.
Tene, G. — How NOT to Measure Latency
Kalibera, T. et al. — Benchmark Precision and Random Initial State
Leiserson, C. et al. — There’s Plenty of Room at the Top (Science, 2020)
Netflix Engineering — Fixing Performance Regressions Before They Happen
Ingo, H. — Change Point Detection for Performance
Gregg, B. — Frequency Trails: Outliers
MongoDB: Reducing Variability in EC2 Performance Tests

Auto-Instrumenting Go: From eBPF to USDT Probes

Kemal Akkoyun — Fri, 27 Feb 2026 00:00:00 +0000

This post expands on the FOSDEM 2026 Go Devroom talk I co-presented with Hannah S. Kim. The talk, demo code, and all benchmark scenarios are available in the fosdem-2026 repository.

The Problem

Go is one of the best languages for building production backend services. It compiles to native binaries, has excellent concurrency primitives, and produces predictable performance characteristics. But when it comes to auto-instrumentation — adding observability without modifying source code — Go is uniquely difficult.

In the JVM world, bytecode manipulation gives you powerful hooks. Java agents can intercept method calls, inject tracing, and propagate context without the application developer knowing. Python and Node.js have similar dynamic capabilities. Go has none of this.

The reasons are structural:

Static compilation. Go compiles to a single native binary. There is no intermediate bytecode to rewrite at load time, no classloader to intercept, no dynamic linking by default.
No LD_PRELOAD. Go’s default static linking means the LD_PRELOAD trick that works for C/C++ applications (and that the OTel Injector uses for Java, .NET, and Node.js) doesn’t apply.
Unique calling convention. Go’s ABI passes arguments in registers with a convention different from the platform C ABI. This makes dynamic hooking with tools like Frida or ptrace significantly harder — you can’t just read standard frame pointers.
Goroutine stack management. Goroutines use segmented, growable stacks that the runtime can move at any time. Traditional stack-walking assumptions break.

The gap between “Go is great for production” and “Go is hard to auto-instrument” is real. This is the gap we set out to map.

The Comparison Framework

We built a demo repository with the same Go HTTP server implemented across seven scenarios, each using a different instrumentation approach. The application is deliberately simple — an HTTP server with configurable CPU load, memory allocation, and off-CPU time — so that instrumentation overhead is isolated and measurable.

The Seven Scenarios

#	Scenario	Approach	What It Does
1	`default`	None	Baseline. No instrumentation of any kind.
2	`manual`	OTel SDK	Manual OpenTelemetry SDK integration — explicit tracer initialization, span creation via `otelhttp`, and context propagation. The “standard” way.
3	`obi`	eBPF (OBI)	OpenTelemetry eBPF Instrumentation. Network-level eBPF hooks. Runs as a sidecar, attaches to the running process. No code changes.
4	`ebpf`	eBPF (Auto)	OpenTelemetry Go Auto-Instrumentation. Uprobe-based eBPF hooks targeting Go runtime functions. No code changes.
5	`orchestrion`	Compile-time	Datadog Orchestrion with OTel SDK. AST transformation via `-toolexec` at compile time. Requires a rebuild but no source changes.
6	`libstabst`	USDT (salp)	USDT probes via salp/libstapsdt, consumed by a bpftrace sidecar that exports to OTLP. Proof of concept.
7	`usdt`	USDT (native)	Native USDT probes via a custom Go fork that adds probe points to `net/http`, `database/sql`, `crypto/tls`, and `net`. Proof of concept.

Each scenario runs in Docker with an identical observability stack (OTel Collector, Jaeger, Prometheus) and is load-tested with identical parameters.

Evaluation Axes

We compared the approaches across three dimensions:

Performance overhead — latency, CPU, memory (RSS), throughput
Robustness — stability across Go versions, container environments, failure modes
Operational friction — deployment complexity, privilege requirements, debugging

Manual OTel SDK (Baseline for Comparison)

The manual scenario is not auto-instrumentation — it is the standard way to instrument a Go service. You import the OTel SDK, initialize a tracer provider, wrap your HTTP handler with otelhttp.NewHandler, and create spans explicitly.

func setupHandlers(inputs *Input) http.Handler {
    mux := http.NewServeMux()
    mux.HandleFunc("/health", HealthHandler)
    mux.HandleFunc("/load", inputs.LoadHandler)
    return otelhttp.NewHandler(mux, "")
}

func (c *Input) LoadHandler(w http.ResponseWriter, r *http.Request) {
    tracer := otel.Tracer("manual")
    _, span := tracer.Start(r.Context(), "manual.handler")
    defer span.End()
    // ... business logic
}

This gives you full control — custom span attributes, context propagation, error recording. But it requires code changes in every service, and those changes accumulate. Multiply by a hundred microservices and you understand why auto-instrumentation matters.

Compile-Time: Orchestrion and OTel Compile-Time Instrumentation

Orchestrion uses Go’s -toolexec flag to intercept the compilation pipeline. During the AST transformation phase, it injects instrumentation code — adding OTel spans, wrapping handlers, propagating context — without the developer modifying source files.

go build -toolexec 'orchestrion toolexec' -o myapp .
# Or equivalently:
orchestrion go build -o myapp .

The mechanism is aspect-oriented: you declare join points (e.g., “any function in package main named LoadHandler”) and advice (e.g., “prepend a span creation statement”). The transformation happens at the AST level before the compiler emits machine code.

Orchestrion supports OpenTelemetry natively — it is not Datadog-specific. In January 2025, Datadog and Alibaba began merging their compile-time instrumentation efforts into a unified solution under the OpenTelemetry Compile-Time Instrumentation SIG.

Trade-offs:

Requires a rebuild. You cannot instrument already-deployed binaries.
Deepest instrumentation of all approaches — it can instrument stdlib and dependencies.
Zero runtime overhead from the instrumentation mechanism itself (the injected OTel code has the same cost as manual instrumentation).
Stable across Go versions (the toolexec interface is stable).
No kernel privileges required.

For a deeper dive into the -toolexec mechanism, see my earlier Unleashing the Go Toolchain talk from GopherCon UK 2025.

eBPF Approaches

OBI (OpenTelemetry eBPF Instrumentation)

OBI takes a network-level approach. It uses eBPF programs to hook into kernel-level network operations, intercepting HTTP/S and gRPC traffic. It is multi-language — Go, Java, .NET, Python, Node.js, Ruby, Rust — because it operates at the protocol layer rather than the language runtime layer.

docker run --privileged \
  --pid=container:myapp \
  -e OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4318 \
  otel/ebpf-instrumentation:latest

OBI runs as a sidecar container. It attaches to the target process’s PID namespace and loads eBPF programs that intercept network system calls. No source code modification, no recompilation, no restart.

Trade-offs:

Requires CAP_SYS_ADMIN or privileged containers. Security teams push back on this.
Limited to what eBPF can observe at the network level. Application-internal spans are not visible.
Protocol coverage is growing: HTTP/S, gRPC, TLS visibility.
Excellent for topology mapping and network observability beyond just tracing.

OTel Go Auto-Instrumentation

The OpenTelemetry Go Auto-Instrumentation project uses uprobe-based eBPF hooks that target specific Go runtime functions. Unlike OBI’s network-level approach, this hooks directly into Go function prologues.

This project is effectively in maintenance mode. Several of its contributors have moved to OBI. At OTel Unplugged EU 2026, the frank assessment was: the people moved to where the momentum is.

Runtime Injection: Frida and ptrace

The injector scenario explores dynamic instrumentation via Frida, a ptrace-based toolkit for runtime function hooking. The idea is conceptually simple: attach to a running process, find the function you want to hook, and replace its prologue with a trampoline that calls your instrumentation code.

// The application code uses //go:noinline to keep functions hookable.
//go:noinline
func (c *Input) LoadHandler(w http.ResponseWriter, _ *http.Request) {
    // ... business logic
}

In practice, this is extremely hard for Go binaries. Quarkslab’s excellent three-part series documents the challenges in detail: Go’s register-based calling convention, goroutine stack relocation, and compiler optimizations (inlining, dead code elimination) all conspire against reliable dynamic hooking.

The demo’s injector scenario includes a helper tool that uses unsafe.Offsetof to find http.Request struct field offsets — information you need just to read the HTTP method and path from a hooked function’s arguments.

Trade-offs:

Works with existing binaries. No rebuild required.
Requires -gcflags="all=-N -l" to disable optimizations, which defeats the purpose for production.
Fragile across Go versions — struct layouts and calling conventions change.
Limited applicability for Go’s statically linked binaries.

For Go, this approach is more useful as a debugging tool than a production instrumentation strategy.

USDT Probes: The Novel Part

USDT (User Statically-Defined Tracing) probes are a mechanism from the DTrace/SystemTap ecosystem. They are marker points compiled into a binary that external tooling (bpftrace, perf, DTrace) can attach to at runtime. The key property: when no consumer is attached, the probe site is a NOP instruction with zero overhead.

We built two proof-of-concept implementations.

libstabst: USDT via salp and bpftrace

The libstabst scenario uses salp, a Go binding to libstapsdt, to create USDT probes at runtime. The application defines probe points for request start and end:

probes = salp.NewProvider("fosdem")
reqStart, _ = probes.AddProbe("request_start", salp.String, salp.Int64)
reqEnd, _ = probes.AddProbe("request_end", salp.String, salp.Int64, salp.Int64)
probes.Load()

// In the handler:
if reqStart != nil && reqStart.Enabled() {
    reqStart.Fire(reqID, startTime)
}

A bpftrace sidecar attaches to these probes and exports events as OTLP traces via a custom exporter bridge.

Known limitations: The salp library has compatibility issues with Go 1.25+, pinning this scenario to Go 1.23.x. It also needs /proc/self/fd/ access for mmap, which fails in many container environments. On bare metal Linux or in a Lima VM, it works.

Native USDT: Custom Go Fork

The more ambitious PoC is a custom Go fork that adds USDT probe points directly to the Go standard library — net/http, database/sql, crypto/tls, and net.

import "runtime/trace/usdt"

func handleRequest(w http.ResponseWriter, r *http.Request) {
    usdt.Probe("myapp", "request_start")
    defer usdt.Probe1("myapp", "request_end", int32(w.StatusCode))
    // ... handle request
}

The fork includes a go tool usdt command for listing probes in a binary and generating bpftrace scripts:

$ go tool usdt list ./myserver
PROVIDER NAME ADDRESS ARGUMENTS
net_http server_request_start 0x63296c 8@%rsi -8@%r8

$ go tool usdt bpftrace ./myserver > trace.bt
$ sudo bpftrace trace.bt

This PoC proves that native USDT support in Go is technically feasible. The standard library instrumentation is automatically available in any binary built with the fork — no application code changes, no SDK imports.

Known limitations: ARM64 argument parsing in bpftrace has issues with the probe argument notation emitted by the fork. The fork is strictly a proof of concept and not suitable for production.

Go Runtime PoCs: Flight Recording

Beyond USDT, we explored a flight recorder PoC based on golang/go#63185. The concept: always-on distributed tracing built into the Go runtime, with a bounded ring buffer and GODEBUG-based activation.

import "runtime/trace/flight"

flight.Enable(flight.HTTP | flight.SQL | flight.Net)
defer flight.Flush() // Export on error or crash

The flight recorder PoC watches for trace files produced by the runtime, converts them to OTLP spans, and exports to a collector. If Go’s runtime trace facilities eventually gain W3C Trace Context propagation, this could become the lowest-friction instrumentation path for Go — no SDK, no eBPF, no compile-time tools. Just the runtime doing what runtimes should do.

Benchmark Results

We ran each scenario under identical load conditions using a Docker-based observability stack with 5-minute sustained load tests.

Approach	CPU	Memory (RSS)	Max Latency	Max Throughput
Baseline (no instrumentation)	10.2%	202 MiB	4.50 ms	3.1k req/sec
Manual OTel SDK	10.3% (+0.1%)	210 MiB (+8 MiB)	3.02 ms	13.97k req/sec
eBPF Auto-Instrumentation	10.0% (-0.3%)	204 MiB (+2 MiB)	3.07 ms	4.57k req/sec
Compile-time (Orchestrion)	9.8% (-0.4%)	210 MiB (+8 MiB)	2.59 ms	27.8k req/sec

A few things stand out. The CPU and memory overhead across all approaches is negligible for this workload. The throughput differences are more interesting — Orchestrion’s compile-time approach achieved the highest throughput, likely because the OTel code injected at compile time benefits from the same optimizations as the rest of the application. The eBPF approach showed lower throughput, consistent with the overhead of crossing the kernel boundary for each intercepted call.

The USDT scenarios (libstabst and usdt) are not included in the table because they are proof-of-concept implementations with different exporter architectures. The core property of USDT — zero overhead when probes are not attached — was confirmed, but end-to-end benchmarking against the other approaches requires further work.

Full benchmark data and reproduction instructions are in the demo repository.

The Ecosystem Picture

The most grounded take from the OTel Unplugged EU 2026 OBI/eBPF session:

“Document the trade-offs between OBI, compile-time, injector, and SDK. Let people choose. Make them aware of each other and let them work together.”

These approaches are not competing. They serve different deployment scenarios and can coexist.

Approach	Mechanism	Best For	Limitation
Compile-time (Orchestrion)	AST transformation via `-toolexec`	Deepest instrumentation, security-sensitive environments	Requires rebuild
eBPF/OBI	Kernel-level network hooks	Runtime flexibility, multi-language, no restart	Needs kernel privileges
eBPF Auto	Uprobe hooks on Go functions	Go-specific deep tracing without code changes	Maintenance mode, fragile across Go versions
Injector/SSI	K8s operator + `LD_PRELOAD`	Lowest friction onboarding	Does not work for Go’s static binaries
USDT	Compiled probe points + bpftrace	Zero overhead when not tracing, future potential	Proof of concept, ecosystem immaturity

The vision articulated at OTel Unplugged — apt install opentelemetry and everything works — requires all these layers coordinating. OBI detecting the Injector and backing off. Compile-time instrumentation detecting existing SDK usage. USDT probes coexisting with eBPF hooks. We are not there yet, but the direction is clear.

Future Directions

Several threads from the talk and surrounding conversations point forward:

OTel Compile-Time SIG. The merger between Datadog’s Orchestrion and Alibaba’s compile-time instrumentation under the OpenTelemetry umbrella is the most significant near-term development. A vendor-neutral, community-maintained compile-time instrumentation tool for Go would change the adoption curve.
W3C context propagation in runtimes. If language runtimes and compilers understand trace context natively, the instrumentation story simplifies fundamentally. This was a recurring theme at OTel Unplugged.
eBPF Tokens. BPF Tokens could significantly reduce the privilege requirements for eBPF-based instrumentation. Instead of CAP_SYS_ADMIN, a token-based trust model would lower the bar for security teams.
Native USDT in Go. The PoC fork demonstrates feasibility. Whether the Go team would accept USDT probes into the standard library is an open question, but the pattern exists in other ecosystems — Postgres, MySQL, and the JVM all have static tracepoints behind flags.
Flight recording. The golang/go#63185 proposal for always-on flight recording in the Go runtime could eventually provide the foundation for zero-touch distributed tracing without any external tooling.

Closing

The instrumentation tax is real and unavoidable. The question is not whether to pay it, but how to manage it. For Go, the answer is increasingly “you have options” — and those options are getting better.

The slides are available as Markdown in the repository. The demo code, Docker setup, and benchmark scripts are all in the fosdem-2026 repository. The recording is on YouTube.

If you want to get involved: the OTel Compile-Time Instrumentation SIG, OBI, and OTel Go repositories all accept contributions. The #otel-go and #otel-ebpf-sig channels on CNCF Slack are where the discussions happen.

See also: OTel Unplugged EU 2026 field notes for the broader ecosystem context.

OTel Unplugged EU 2026: Field Notes from the Instrumentation Frontier

Kemal Akkoyun — Fri, 20 Feb 2026 00:00:00 +0000

Brussels Again, But Make It Unplugged

The day after FOSDEM, about a hundred of us gathered at Sparks Meeting on Rue Ravenstein in Brussels for OTel Unplugged EU 2026 — an unconference dedicated entirely to OpenTelemetry. Purple stage lights, a mid-century auditorium with wood paneling, and the familiar buzz of people who spend their days thinking about telemetry pipelines. If you know, you know.

The format is simple: no prepared talks, no slides. Morning session brainstorming, dot-voting on topics, then self-organizing into nine rooms across four breakout slots. You vote with your feet. If a conversation isn’t working, you move. It’s chaotic, it’s honest, and it produces the kind of discussions that polished conference talks rarely achieve.

I spent the day bouncing between sessions on Prometheus and OpenTelemetry convergence , the Injector and Operator , OBI/eBPF , and auto-instrumentation for Go. Four rooms, one thread connecting them all: how do we make applications observable without asking developers to change their code?

Here’s what I learned.

Prometheus Loves OpenTelemetry (It’s Complicated)

Prometheus and OpenTelemetry had two sessions — one in the morning with end users and contributors, and a follow-up in the afternoon specifically for maintainers. Both were packed. The relationship between these two projects is the kind you’d describe as “it’s complicated” on social media.

The Resource Attributes Mess

The biggest pain point? Getting OTLP data into Prometheus. Resource attributes are the central headache. OTLP has a rich hierarchy — resource, scope, and metric attributes. Prometheus is flat. Bridging these two models means choosing between promoting all attributes, promoting some, or relying on target_info. There are too many config options, no consistency across deployments, and the info function (using target_info) helps but adoption is uneven.

One person described running an observability platform for over a thousand developers , most using Prometheus remote_write. Some teams want a single OTLP endpoint for logs and metrics, but that just shifts the same “what to promote, what to drop” problem. The frustration was palpable — someone put it bluntly: “OTel is rewriting everything again.” Different conventions (. vs _), hard target_info joins, and the sense that mature Prometheus semantics (cluster, namespace) are being duplicated under different names (k8s.*).

Migration Resistance

Teams recognize the value of OTel’s semantic conventions, but the migration path is painful. Naming inconsistencies (. vs _), hard target_info joins, and the cognitive overhead of moving from Prometheus’s world view to OTel’s. Several people mentioned that PromQL IS AWESOME (their emphasis, not mine) and that transformation adds overhead that people who come from a Prometheus background don’t want to pay.

On the SDK side, OTel measurements require a hashmap lookup while Prometheus doesn’t. Too many concepts — meter, instrument, aggregation — versus Prometheus’s closer alignment to mechanical sympathy. The performance direction being pursued? Zero allocations, no lookups — the bound instruments PoC is the concrete step toward closing that gap. Nobody in the room uses delta temporality.

“People care about observability, not query languages.”

The Afternoon: Maintainers Chart a Path

The afternoon session brought Prometheus and OTel maintainers together. The mood was constructive. OTel SDK v2 was discussed as an opportunity for the kind of breaking changes that could simplify the metrics API — a simplified, more performant, but less flexible API. The Prometheus 3.0 experience was instructive: the maintainers planned for major breakage but ended up with almost none.

Concrete progress: David Ashpole’s bound instruments PoC in Go — instruments pre-bound to specific attribute sets, eliminating the hashmap lookup. People in the room care about Go and C++ performance, and this could be a game changer.

On the receiver/exporter convergence front: cAdvisor is considering archiving its Prometheus exporter and moving all code into the OTel collector. OTel Kubernetes monitoring is broadly adopted, with near-parity to kube-state-metrics. The idea of Prometheus carrying an OTel Collector distribution was floated.

The messaging problem came into sharp focus: as one Prometheus maintainer put it, “Having joint statements helps towards the perception of working together.” The gap isn’t just technical — it’s about perception. End users see two projects that look like they’re competing, even when the maintainers are collaborating.

Action items : use the #otel-prometheus Slack channel, meet again in Amsterdam , and produce joint messaging — “this is built together and is compatible.” Who owns that messaging? That’s the open question.

The Injector: From LD_PRELOAD to `apt install opentelemetry`

Two sessions covered the Injector and Operator ecosystem — one focused on the general architecture, the other specifically on OBI and Injector coordination for Go. The framing that stuck with me came early: “OTel instrumentation feels more like a collection of tools than a product.” That’s why the Injector exists — to close the gap between what OTel offers and what users expect to just work.

Injector vs Operator

The Injector is opinionated and out-of-the-box. It aims for 80% coverage with zero configuration. The Operator is for power users who want fine-grained control. Both end users and OTel maintainers were in the room, and more people knew about the Operator than the Injector.

The Injector works via LD_PRELOAD — it hooks into process loading to activate SDK instrumentation for Java, .NET, Node.js, and soon Python. It’s being used in production at scale on Kubernetes. It can detect libc vs musl. Blocking system start during injection? Not perceived as a problem by anyone in the room.

The inevitable question came up: “What about Go?” For Go’s statically linked binaries, there’s no LD_PRELOAD equivalent. The answer is either eBPF or compile-time instrumentation. Go remains the special case that requires different thinking.

Beyond Kubernetes

There’s clear demand for the Injector outside of Kubernetes — EC2, bare metal, traditional VMs. Users not on K8s or Docker “end up using custom Ansibles” — the packaging gap is real and concrete. System packages (Debian, RPM) are needed, but hosting for them doesn’t exist yet. Red Hat is looking into packaging OTel components. Multiple projects are independently solving the same packaging problems — signatures, distribution, hosting — which led to a proposal for a new SIG on OS packaging.

The Vision: One Package to Rule Them All

The afternoon session on OBI and Injector for Go articulated a bold vision:

Run apt install opentelemetry and get everything running — SDKs, Injector, OBI, all coordinated.

This would require massive coordination between instrumentation providers — and it would include the OTel profiler alongside the SDKs and Injector. The group discussed how to avoid double instrumentation when both OBI and the Injector are present — OBI should detect the Injector and back off (similar to how it already detects other SDKs). A creative proposal emerged: OBI injecting the Injector instead of the Operator, since eBPF can intercept process loading natively.

The reality is that OTel declarative configuration doesn’t cleanly fit either project’s model yet. The Injector has its own config format. OBI instruments many applications from a single daemon, which doesn’t map neatly to per-application YAML. This is a design problem that needs solving before the apt install dream becomes real.

And the question that kept coming back in both sessions — “What about Go?” — led naturally into the next room.

eBPF and the Instrumentation Tax

The OBI/eBPF session drew a crowd interested in the promise and the trade-offs of non-invasive auto-instrumentation.

OBI (eBPF-based auto-instrumentation) uses uprobes to hook into application functions at the kernel level. No source code modification, no recompilation, no SDK integration. The trade-off? You need privileges. CAP_SYS_ADMIN or root access is a hard sell for security teams, and the discussion around reducing privilege requirements was lively.

The operational reality came through early: someone described a case where “the instrumentation was bringing down the pod” — auto-injection and sidecars destabilizing the very workloads they’re supposed to observe. That anecdote set the tone for the rest of the session and led directly to the quote that stuck with me most.

A bright spot: eBPF Tokens, a newer Linux facility for safer userspace eBPF, could significantly lower the trust bar. There was optimism in the room about this direction.

OBI isn’t just about application tracing. It shines in network observability — topology mapping, correlating network stack behavior with application layer events. Someone asked about lock observability — “Maybe profiling” was the answer, hinting at the breadth of what people want from eBPF beyond just tracing. And there’s an underexplored opportunity around USDTs (user-defined static tracepoints). Postgres and MySQL already have them behind flags. Rust makes them easy to add. But we need to convince popular libraries across more languages to adopt them.

A broader point was raised: W3C context propagation should be pushed into language runtimes and compilers, not just libraries. If the runtime itself understands trace context, the instrumentation story becomes fundamentally simpler.

The most grounded take came during the Go discussion:

“Document the trade-offs between OBI, compile-time, injector, and SDK. Let people choose. Make them aware of each other and let them work together.”

And the reality check:

“Instrumentation tax is inevitable. Manage it, don’t pretend it’s free.”

The consensus across multiple sessions: stop treating these approaches as competing camps. They’re complementary layers for different deployment scenarios:

Approach	Mechanism	Trade-off
Compile-time	AST transformation via `-toolexec`	Deepest instrumentation, zero runtime overhead, requires rebuild
eBPF/OBI	Kernel-level uprobe hooking	No app modification, needs kernel privileges
Injector/SSI	K8s operator triggering instrumentation	Lowest friction onboarding, abstracts complexity

On the Kubernetes operations side, there was a concrete proposal: a CRD for otel-operator to deploy OBI daemonsets — with config validation and selective node deployment via workload labels. Not theoretical; the group was sketching the API surface.

Patterns Across the Day

Beyond the sessions I attended, three themes kept surfacing throughout the unconference.

Ship Faster vs Stable by Default

Two rooms, opposite tensions. One group argued: “We discourage people from trying. Processes feel rigid. We can only learn if we actually build something.” The Prometheus model — experiment first, let things mature, specify later — was held up as the better feedback loop. The other group was laser-focused on stability : feature gates, opt-in experimental features, the pain of breaking changes in semantic conventions. The impatience was clear: “Less bike-shedding, more doing.”

Both are right. The community is threading a needle between moving fast enough to stay relevant and being stable enough that enterprises trust the project. The gap between these two positions is where a lot of energy gets spent.

The Maintainer Crisis

This came up in at least three rooms. Not enough maintainers, too many PRs, codeowners disappearing. The JavaScript SIG has an automated script to move inactive maintainers to emeritus after three months. Other SIGs handle it manually. Some SIGs have tried a buddy/mentor system for onboarding new contributors — it helps, but it doesn’t scale across all SIGs when the existing maintainers barely have time to review PRs. The phrase that stuck with me:

“Maintainership is privilege AND responsibility.”

And a new problem: as one maintainer put it directly, “AI slop creates a lot of work for maintainers.” Low-quality AI-generated PRs need review just like everything else, but they rarely lead to productive outcomes — creating a treadmill of review work that burns out the very people the project can’t afford to lose.

opentelemetry-go-auto: Quietly Fading

During the Go-focused session, someone asked about opentelemetry-go-auto — the eBPF-based Go auto-instrumentation project (originally from Alibaba). The answer was frank: the project “seems in maintenance mode, some of their maintainers are already contributing to OBI.” The group decided to keep it out of the discussions unless those maintainers want to participate. No drama, just the natural evolution of open-source projects. The people moved to where the momentum is.

What Comes Next

The unconference produced concrete next steps across every thread:

Prometheus + OTel : Convergence work continues. Joint messaging, Amsterdam meetup, bound instruments moving forward.
Injector : Merge functionality into the Operator starting with one language. System packages for non-K8s environments.
OBI : Gradual protocol expansion, packaging SIG proposal, exploration of an eBPF-based OTel Collector.
Go auto-instrumentation : Coordinate all three approaches, document trade-offs clearly for end users.

Unplugged

The unconference format works because the hardest problems in observability right now are not technical — they’re social. Governance, maintenance burden, convergence between projects that grew up independently, vendor-neutrality when vendors are the primary contributors. You can’t solve these with a slide deck. You need a room, a whiteboard, and honest conversation.

As I always say — the hallway track is the real conference. OTel Unplugged is an entire day of hallway track, and it’s exactly what the community needs.

If you want to get involved: join the CNCF Slack and find the #otel-prometheus, #otel-ebpf-sig, and #otel-go channels. The SIG meetings are open and listed on the OTel community repo. Show up, contribute, and help shape the future of observability.

Already looking forward to the next one.

FOSDEM 2026: Even Bigger, Even Better

Kemal Akkoyun — Fri, 13 Feb 2026 00:00:00 +0000

Another Year, Another FOSDEM

FOSDEM — the annual Brussels pilgrimage. If you’ve been, you know the drill: too many talks, too little time, questionable coffee, and the kind of conversations that only happen when you pack thousands of open-source developers into a university campus in the dead of winter.

This year was different for me, though. Two talks in two devrooms, three sessions at OTel Unplugged — and this time, I brought the whole family. My wife and our toddler (who has graduated from “can barely walk” to “can absolutely destroy a hotel room in under four minutes”) came along, and we turned it into a proper trip — FOSDEM, then a few days exploring Ghent and Antwerp before heading home.

The conference part was incredible. The journey home… well, we’ll get to that.

Saturday Morning: eBPF Devroom

Last year the eBPF Devroom was impenetrable — nobody leaves, nobody gets in. This year I made it in early and spent the morning there.

Three sessions stood out:

"eBPF Hookpoint Gotchas: Why Your Program Fires (or Fails) in Unexpected Ways" — Donia Chaiehloudj and Chris Tarazi walked through the subtle behaviors of kprobes, fentry, tracepoints, and uprobes that catch everyone off guard. The kind of talk where half the room is nodding along because they’ve hit these exact edge cases in production. If you write eBPF programs, this is required viewing.
"Performance and Reliability Pitfalls of eBPF" — Usama Saqib shared hard-won lessons from running eBPF at scale: kprobe performance varying across kernel versions, fentry stability issues, and the challenges of scaling uprobes. Directly relevant to anyone using eBPF-based auto-instrumentation — the kind of detail you don’t find in documentation.
"OOMProf: Profiling Go Heap Memory at OOM Time" — Tommy Reilly presented OOMProf, a Go library that uses eBPF to hook into Linux OOM tracepoints and capture heap profiles right before the kernel kills your process. Exports to pprof or Parca. The intersection of Go, eBPF, and profiling — three things I care deeply about.

The eBPF Devroom continues to be one of the most technically dense tracks at FOSDEM. Every talk assumes you already know the basics and goes straight to the edge cases and production realities.

Sunday: Two Talks, Two Devrooms

Sunday was a double-header. Two talks in two devrooms.

Augusto de Oliveira and I co-presented “How to Reliably Measure Software Performance” in the Software Performance Devroom.

The talk opened with one of my favorite stories in science: the OPERA experiment that appeared to show neutrinos traveling faster than the speed of light, only for the root cause to be a single fiber-optic cable that wasn’t fully plugged in. That’s benchmarking in a nutshell — a world where loose cables are everywhere and your numbers are lying to you until you prove otherwise.

We covered the full stack of what it takes to measure reliably. Environment control : bare metal instances, disabling SMT, CPU affinity, cache management. Benchmark design : making measurements representative and repeatable. Statistical rigor : because if you’re not thinking about variance, you’re not thinking. And then the part I’m most excited about — integrating benchmarks into development workflows. Performance quality gates on PRs, auto-generated regression comments, continuous benchmarking infrastructure. We showed what we’ve built at Datadog and pointed to the open-source alternatives available today.

“Performance matters. It’s not always the first thing we think about when building software. But in the end, performance is what users experience.”

The Performance Devroom had a strong lineup all day. The audience was deeply technical — people who care about p99 latencies and can argue for an hour about whether your benchmark harness is introducing measurement bias. My kind of crowd.

The technical blog post goes deeper, and the talk page has the slides and recording.

Then I crossed campus to the Go Devroom. My kind of room.

Hannah S. Kim and I presented “How to Instrument Go Without Changing a Single Line of Code” — a talk comparing every strategy available today for zero-touch Go observability.

We walked through eBPF-based auto-instrumentation with OBI, compile-time manipulation with tools like Orchestrion and the OTel Go compile-time instrumentation project, runtime injection via LD_PRELOAD, and the emerging world of USDTs for Go.

The core of the talk was practical: benchmark results and small realistic services, compared along three axes — performance overhead , robustness across Go versions , and operational friction. We showed the trade-offs honestly. eBPF gives you zero code changes but needs kernel privileges. Compile-time rewriting gives you the deepest instrumentation but requires a rebuild. The Injector abstracts complexity but is currently Kubernetes-only. There’s no silver bullet, just choices with different costs.

We also looked forward at how upcoming work in the Go runtime — flight recording, improved diagnostics primitives, USDT probe generation — could unlock cleaner hooks for future instrumentation. The room was full. The questions were sharp. Hannah handled the eBPF deep-dives while I covered the compile-time and operational integration angles. It worked.

If you want the full technical breakdown, I wrote a companion blog post and the talk page has the slides and recording.

Monday: OTel Unplugged

The day after FOSDEM, about a hundred of us gathered at Sparks Meeting on Rue Ravenstein for OTel Unplugged EU 2026 — an unconference dedicated entirely to OpenTelemetry. No slides, no prepared talks, just session brainstorming, dot-voting, and then splitting into nine rooms across four breakout slots.

I led or co-led three sessions : one on Prometheus and OTel convergence , one on OBI/eBPF-based auto-instrumentation , and one on the Injector and OBI coordination for Go. The thread connecting all three was the same question that keeps me up at night: how do we make applications observable without asking developers to change their code?

I wrote a dedicated post covering the full day, so I won’t repeat it here. The short version: the community is converging. Prometheus and OTel maintainers are charting a path together, the Injector vision is expanding beyond Kubernetes, and the various auto-instrumentation approaches for Go are finally being treated as complementary layers rather than competing camps. Read the post for the details.

The Hallway Track

As always — the hallway track is the real conference.

Some of the best conversations happened between sessions, over coffee, during the frantic sprints between ULB buildings. Catching up with Prometheus maintainers about v3 adoption and the road ahead. Talking auto-instrumentation strategy with OTel contributors who I’d only known through GitHub issues. Comparing notes on performance engineering practices with people running infrastructure at wildly different scales.

The informal Prometheus maintainers gathering was a highlight. Getting the people who build and maintain the project into the same room, away from structured agendas, just talking about what’s working and what isn’t — that’s where real alignment happens. No Zoom call will ever replicate that.

I’m incredibly grateful for the people I managed to see this year. And as always, slightly heartbroken about the ones I missed. FOSDEM is four thousand developers in one place for a weekend, and no matter how fast you move, you can’t see everyone.

Travel and Logistics: The Sequel Nobody Asked For

A FOSDEM without travel drama is, apparently, not something the universe allows for me.

We flew Berlin to Brussels on Friday and had a great time — FOSDEM all weekend, OTel Unplugged on Monday, then a few days of family time in Ghent and Antwerp. Belgian frites, Belgian waffles, Belgian everything. The toddler approved.

Then came Thursday. Our flight home from Brussels to Berlin: first delayed, then cancelled. Berlin airport shut down. The coldest winter in twenty years had frozen the city solid. We ended up at a hotel near Brussels airport with a very tired toddler and no plan B.

Friday morning we flew to Frankfurt instead. Surely a train from Frankfurt to Berlin would be straightforward? Of course not. The Frankfurt-to-Berlin flight: also cancelled. We rebooked on a train, but our checked luggage was… somewhere. The airline couldn’t tell us where. We waited two hours at the airport, watching the carousel go around empty, then gave up and headed to the train station.

Four and a half hours of train ride later, we were finally home. Antwerp to Berlin: 29.5 hours, door to door. With a toddler. In the coldest week Germany had seen in two decades.

The luggage? It arrived ten days later. Intact, thankfully. But ten days.

Last year’s transport chaos was cute by comparison.

Looking Forward

FOSDEM 2026 was my best one yet. Two talks across the weekend, three unconference sessions on Monday, and more hallway track conversations than I can count. The open-source observability community is in a remarkable place right now — Prometheus and OpenTelemetry converging, auto-instrumentation maturing across multiple approaches, and performance engineering finally getting the attention it deserves.

Already thinking about next year. If you’re into open source and haven’t experienced FOSDEM, just go. You won’t regret it.

Fix Go Module Downloads Behind a Corporate VPN

Kemal Akkoyun — Thu, 12 Feb 2026 00:00:00 +0000

If you work at a company that runs its own Go module proxy and you connect through a VPN, you’ve probably seen this:

Get "https://binaries.example.com/google.golang.org/grpc/@v/v1.77.0.mod":
  dial tcp 172.27.5.36:443: i/o timeout

The module has nothing to do with your company. It’s a public dependency. Yet Go refuses to fetch it from the public proxy and just dies with a timeout. The frustrating part: you know proxy.golang.org has the module, and your config lists it as a fallback. So why doesn’t it fall through?

The comma trap

A typical corporate Go setup looks like this:

export GOPROXY=corp-proxy.internal,https://proxy.golang.org,direct

The comma separator between proxies looks harmless, but it controls exactly when Go tries the next proxy in the chain. With commas, Go only falls through on HTTP 404 or 410 — meaning the proxy responded and said “I don’t have this module.” Any other error, including TCP timeouts, DNS failures, and 5xx server errors, is treated as a hard failure. Go stops and reports the error.

When your VPN is disconnected, the corporate proxy is unreachable. That’s a TCP timeout, not a 404. Go never tries proxy.golang.org.

The pipe fix

Go 1.15 introduced the pipe separator (|) as an alternative to commas. With a pipe, Go falls through on any error , including network failures:

export GOPROXY="corp-proxy.internal|https://proxy.golang.org,direct"

Notice the mix of separators. The pipe between the corporate proxy and the public proxy means “if the corporate proxy is unreachable, try the public one.” The comma between the public proxy and direct means “only go direct if the public proxy returns 404” — which is the safer default for the last hop.

Why not use pipes everywhere?

The comma separator exists for a reason: privacy. When Go tries to fetch a module from a proxy, it reveals the module path in the request URL. If your corporate proxy is down and you use pipes everywhere, Go would send your private module paths (github.com/your-company/secret-service) to proxy.golang.org before finally trying to fetch them directly.

The GOPRIVATE and GONOPROXY environment variables mitigate this. Modules matching those patterns bypass the proxy chain entirely and are fetched directly from source. If you set GOPRIVATE correctly, the pipe separator is safe for your use case:

export GOPRIVATE=github.com/your-company
export GONOPROXY=github.com/your-company
export GONOSUMDB=github.com/your-company,go.internal.example.com
export GOPROXY="corp-proxy.internal|https://proxy.golang.org,direct"

With this setup, private modules never touch any proxy. Public modules try the corporate proxy first (fast, cached, available on VPN), fall back to the public proxy on failure, and go direct as a last resort.

The full picture

Here’s how Go resolves a module with this configuration:

go get google.golang.org/grpc@v1.77.0

1. Does "google.golang.org/grpc" match GOPRIVATE? No.
2. Try corp-proxy.internal -> TCP timeout (VPN off)
3. Separator is "|" -> fall through on any error
4. Try proxy.golang.org -> 200 OK, module found
5. Done.

And for a private module:

go get github.com/your-company/secret-service@latest

1. Does "github.com/your-company/secret-service" match GOPRIVATE? Yes.
2. Skip proxy chain entirely.
3. Fetch directly from github.com via git.
4. Done.

One-line fix

If you’re in this situation, the fix is a single character change in your shell config:

- export GOPROXY=corp-proxy.internal,https://proxy.golang.org,direct
+ export GOPROXY="corp-proxy.internal|https://proxy.golang.org,direct"

Reload your shell (source ~/.zshrc) and Go will gracefully fall back to the public proxy whenever your corporate proxy is unreachable. No more waiting for timeouts to tell you what you already know.

Stop Putting API Keys in Your Shell Config

Kemal Akkoyun — Thu, 12 Feb 2026 00:00:00 +0000

We all know better. Don’t hardcode secrets. Use a vault. Rotate your keys. We’ve been saying this for years.

And then the agentic coding boom happened.

Suddenly every tool wants an API key. OpenAI, Anthropic, Gemini, Groq, Mistral, Replicate—the list grows weekly. And where do those keys end up? Right there in .zshrc, in plain text, because you needed it working right now and you were going to fix it later.

# The "I'll fix this later" hall of shame
export OPENAI_API_KEY=sk-proj-abc123...
export ANTHROPIC_API_KEY=sk-ant-xyz789...
export GEMINI_API_KEY=AIzaSy...

I caught myself doing exactly this. Two API keys, sitting in my dotfiles, probably backed up to Time Machine, possibly in shell history, definitely in my terminal scrollback. Let’s fix this properly.

The Problem

Plain text API keys in shell configs are bad for reasons you already know:

Shell history — ~/.zsh_history records commands, and sometimes you echo $OPENAI_API_KEY to debug something
Backup snapshots — Time Machine, cloud backups, dotfile repos all capture the file
Shoulder surfing — cat ~/.zshrc during a screen share or a pairing session
Terminal scrollback — the key is sitting in your terminal buffer right now

And this isn’t just a theoretical risk. Attackers actively scan repos and backups for unprotected credentials — and when they find stolen API keys, they rack up thousands of dollars in charges. The platform bills the original owner.

The “I’ll rotate it later” never comes. Meanwhile these keys have billing attached to them.

The Fix: 1Password CLI

If you use 1Password, you already have a secret manager with biometric unlock, audit logging, and team sharing. The op CLI lets you pull secrets into your shell without ever writing them to disk.

Step 1: Install the CLI

brew install --cask 1password-cli

Enable the CLI integration in 1Password desktop app: Settings > Developer > Connect with 1Password CLI. This lets the CLI authenticate via the desktop app (Touch ID on Mac) instead of requiring a separate login.

Step 2: Store Your Keys

op item create \
  --category="API Credential" \
  --title="OpenAI API Key" \
  --vault="Private" \
  "credential=sk-proj-your-key-here"

op item create \
  --category="API Credential" \
  --title="Gemini API Key" \
  --vault="Private" \
  "credential=AIzaSy-your-key-here"

Step 3: Replace Hardcoded Values

In your .zshrc (or .bashrc, .profile, whatever you use):

- export OPENAI_API_KEY=sk-proj-abc123...
- export GEMINI_API_KEY=AIzaSy...
+ export OPENAI_API_KEY=$(op read "op://Private/OpenAI API Key/credential" --no-newline 2>/dev/null)
+ export GEMINI_API_KEY=$(op read "op://Private/Gemini API Key/credential" --no-newline 2>/dev/null)

That’s it. Three steps. The keys now live in 1Password, protected by your master password and biometric auth.

One catch: this triggers a 1Password biometric prompt every time you open a terminal. If that bothers you (it bothered me), see Shell Startup Speed for the lazy-loading version that only prompts when you actually run a command.

Step 4: Rotate the Old Keys

This is the step people skip. Do it now. The old keys have been in plaintext. Assume they’re compromised.

OpenAI: platform.openai.com/api-keys
Google AI: aistudio.google.com/apikey
Anthropic: console.anthropic.com/settings/keys

Generate new keys, update the 1Password items with op item edit, and you’re done.

The Details Worth Knowing

Why `--no-newline`?

op read appends a trailing newline by default. API keys with a stray newline cause cryptic authentication failures—the kind where the key “looks right” but every request returns 401. The --no-newline flag strips it.

Why `2>/dev/null`?

If 1Password is locked or the CLI isn’t authenticated, op read writes an error to stderr. The redirect silences that so you don’t get a wall of errors every time you open a terminal without 1Password unlocked. The variable simply becomes empty.

The tradeoff: a misconfigured vault path also fails silently. Test it once after setup, and you’re fine.

What About Shell Startup Speed?

The eager approach above runs op read at shell init, which means every new terminal triggers a 1Password biometric prompt. If you open terminals frequently, this gets old fast.

The fix is lazy loading with command-specific triggers. In zsh, the preexec hook fires right before a command executes and receives the command string — perfect for deciding which secrets to load when:

# Map: env var → 1Password secret reference
typeset -A _op_refs=(
  OPENAI_API_KEY "op://Private/OpenAI API Key/credential"
  GEMINI_API_KEY "op://Private/Gemini API Key/credential"
)

# Map: command → which keys it needs
typeset -A _op_cmd_keys=(
  codex "OPENAI_API_KEY"
  aider "OPENAI_API_KEY GEMINI_API_KEY"
  gemini "GEMINI_API_KEY"
)

_maybe_load_op_secrets() {
  local cmd="${1%% *}" # extract first word
  cmd="${cmd##*/}" # strip path prefix
  local keys="${_op_cmd_keys[$cmd]}"
  [[-z "$keys"]] && return
  for key in ${=keys}; do
    [[-n "${(P)key}"]] && continue # already loaded
    export "$key=$(op read "${_op_refs[$key]}" --no-newline 2>/dev/null)"
  done
}
preexec_functions+=(_maybe_load_op_secrets)

# Manual fallback: load everything
load-secrets() {
  for key ref in "${(@kv)_op_refs}"; do
    export "$key=$(op read "$ref" --no-newline 2>/dev/null)"
  done
}

This gives you three properties:

No startup cost — terminal opens instantly, no biometric prompt
Least privilege — codex only loads OPENAI_API_KEY, not every secret you have
Load once — each key is fetched at most once per session (the ${(P)key} guard skips keys that are already set)

Adding a new tool is one line in _op_cmd_keys. Adding a new key is one line in _op_refs.

If you have multiple 1Password accounts (personal + work), add --account=my.1password.com to the op read calls to avoid vault name collisions.

For even more granularity:

op run — inject secrets into a specific command rather than the global environment:

# Only injects the key for this one command
op run --env-file=.env.1password -- python train.py

op inject — when you have a dozen keys, individual op read calls add up. With op inject, you define all your secrets in a single template and load them in one shot:

# ~/.env.op (template — safe to commit, contains no secrets)
export OPENAI_API_KEY={{ op://Private/OpenAI API Key/credential }}
export GEMINI_API_KEY={{ op://Private/Gemini API Key/credential }}
export ANTHROPIC_API_KEY={{ op://Private/Anthropic API Key/credential }}

# In .zshrc — one CLI call loads everything
eval "$(op inject --in-file ~/.env.op)"

This is substantially faster than N individual op read calls — the CLI resolves all references in a single authentication round-trip.

Scoped injection — skip the global environment entirely and inject a key for exactly one command’s lifetime:

OPENAI_API_KEY=$(op read "op://Private/OpenAI API Key/credential" --no-newline) python train.py

The key exists only in that command’s process environment. Nothing touches your shell, nothing lingers after the process exits. This is the most paranoid option, and it’s great for CI scripts or one-off runs.

What About macOS Keychain?

macOS Keychain (security find-generic-password) works too and has zero startup overhead since it’s always unlocked when you’re logged in. I use it for some tokens:

export GITLAB_TOKEN=$(security find-generic-password -a ${USER} -s gitlab_token -w)

The advantage of 1Password over Keychain: cross-device sync, team sharing, audit logs, and a UI that doesn’t make you question your life choices. Use whichever fits your workflow. The point is to stop storing secrets in plain text.

The Agentic Boom Made This Worse

A year ago, most developers had maybe one or two API keys. Now? I know people with six or more AI service keys in their shell config. Coding agents need them. MCP servers need them. Every new tool in the ecosystem asks you to “just export your API key” and the docs always show the hardcoded version because it’s simpler to explain.

MCP servers are the newest vector here. Tools like Claude Code, Cursor, and Windsurf use configuration files (claude_desktop_config.json, mcp.json) that store API keys for tool servers. The LLM itself never sees the secret values — the MCP server process does — but only if you inject them properly. Hardcoding keys in MCP configs is the same mistake as hardcoding them in .zshrc, just in a newer file. The op CLI works here too: use op run or environment variable references in your MCP server configs instead of raw keys.

This is a tooling culture problem. The default getting-started experience for almost every AI API is:

export MAGIC_AI_KEY=your-key-here # don't do this

We should normalize showing the secure version in documentation. Until that happens, take five minutes and move your keys to a vault. Your future self (and your billing page) will thank you.

TL;DR

# Before: plain text keys in .zshrc
export OPENAI_API_KEY=sk-proj-...

# After: lazy-loaded from 1Password, per-command per-key
typeset -A _op_refs=(
  OPENAI_API_KEY "op://Private/OpenAI API Key/credential"
  GEMINI_API_KEY "op://Private/Gemini API Key/credential"
)
typeset -A _op_cmd_keys=(
  codex "OPENAI_API_KEY"
  aider "OPENAI_API_KEY GEMINI_API_KEY"
)
_maybe_load_op_secrets() {
  local cmd="${1%% *}"; cmd="${cmd##*/}"
  local keys="${_op_cmd_keys[$cmd]}"
  [[-z "$keys"]] && return
  for key in ${=keys}; do
    [[-n "${(P)key}"]] && continue
    export "$key=$(op read "${_op_refs[$key]}" --no-newline 2>/dev/null)"
  done
}
preexec_functions+=(_maybe_load_op_secrets)

Install op, store your keys, replace the exports, rotate the old keys. Five minutes. Zero excuses.

Vibe Coding with Cursor: My R&D Week Adventure 🚀

Kemal Akkoyun — Wed, 12 Mar 2025 00:00:00 +0000

TL;DR: Spent a week building cool stuff with Cursor, an AI-powered IDE. Found it surprisingly effective for both coding and managing my second brain. When your requirements are clear, it’s almost magical! ✨

The Setup: R&D Week Vibes

You know that feeling when R&D week rolls around, and you’re caught between “I should learn something useful” and “I want to have fun”? Well, this time I decided to combine both by diving deep into Cursor, an AI-powered code editor that’s been making waves in the developer community.

The mission was simple: Use Cursor for everything - from managing my notes to building small task-specific projects. And by everything, I mean everything.

What Makes Cursor Different?

Unlike traditional IDEs that just help you write code, Cursor feels more like having a pair programmer who actually gets your context. It’s built on top of VSCode (so you get all the good stuff you’re used to) but adds a layer of AI-powered features that make development feel more… vibey? 😎

The Good Parts

Context-Aware AI : The AI understands your project structure and can help with everything from code completion to refactoring. For example, when working on a React component, it automatically suggested appropriate hooks and state management patterns based on my component’s purpose. When your requirements are clear, it’s almost magical how it can scaffold projects and implement patterns!
Rules Feature: This is where things get interesting. You can create custom rules and context for different types of work, both at the project and global level. Think project-specific coding standards, documentation patterns, and even architecture guidelines.
Notepads: Quick thoughts? Code snippets? The notepad feature is like having a smart scratchpad that understands code and can share context between different parts of your development workflow.

Second Brain Management: A Pleasant Surprise

One of my unexpected discoveries was how well Cursor handles note-taking and second brain management. Here’s what made it click for me:

- Markdown support is top-notch
- AI understands context across files
- Easy to maintain structure with rules
- Quick navigation between related notes
- File attachments for enhanced documentation
- Dynamic references using @ mentions

The Rules Feature: A Game Changer

The rules feature deserves its own spotlight. Cursor offers two powerful ways to customize AI behavior (note that the older .cursorrules file is being deprecated in favor of this new system):

Project Rules (.cursor/rules directory):
Global Rules (Cursor Settings):

Pro tip: Use project rules whenever possible - they’re more flexible, can be version controlled, and provide better granular control over different parts of your project.

I’ve set up different contexts for various types of work:

Technical blog posts (with specific writing guidelines)
Project documentation (with architecture patterns)
Personal notes (with custom templates)
Code standards (with framework-specific rules)

Each context comes with its own set of rules and AI behavior. It’s like having multiple specialized assistants at your disposal.

Notepads: Beyond Simple Notes

The Notepads feature (currently in beta) has been a revelation. Think of them as enhanced reference documents that go beyond regular .cursorrules. I use them for:

Dynamic Boilerplate Generation :
Architecture Documentation :
Development Guidelines :

The ability to share context between composers and chat interactions makes them incredibly powerful. Plus, you can attach files and use @ mentions to create a web of connected knowledge.

Small Projects, Big Impact

During the week, I worked on several small, task-specific projects. The workflow typically went like this:

Create a new project with clear requirements
Set up project-specific rules and templates
Let the AI handle boilerplate and routine coding
Focus on architecture and edge cases

The AI handled a lot of the repetitive work, letting me focus on the creative aspects of each project. The clearer my requirements were, the more magical the results became. ✨

Lessons Learned

AI-Powered Doesn’t Mean AI-Dependent : Cursor enhances your workflow without taking over.
Rules Are Your Friend : Taking time to set up proper rules pays off immensely.
Context is King : The more context you provide, the better the AI assistance becomes.
Second Brain Benefits : It’s not just for coding; it’s a genuine knowledge management tool.
Clear Requirements = Magic : The more precise your task definition, the better the results.

What’s Next?

I’m planning to:

Expand my rule sets for different types of work
Create more structured templates for common architectural patterns
Explore advanced AI features like multi-file refactoring
Share my rules and templates with the community

Conclusion

R&D weeks are about trying new things and finding better ways to work. This experiment with Cursor turned out to be more than just playing with a new tool - it’s changed how I think about IDE capabilities and knowledge management.

The combination of familiar VSCode features with AI assistance, especially the rules system, makes it a powerful tool for both coding and knowledge work. It’s not perfect (what is?), but it’s definitely earned its place in my daily toolkit.

Remember: The best tools are the ones that enhance your natural workflow rather than forcing you to adapt to them. Cursor does this surprisingly well. 👍

FOSDEM 2025: Blimey, What a Weekend!

Kemal Akkoyun — Tue, 04 Feb 2025 00:00:00 +0000

Another Year, Another FOSDEM

FOSDEM —the annual pilgrimage to Brussels for a weekend of open-source brilliance, hallway track magic, and the inevitable sleep deprivation. This year’s Free and Open Source Software Developers’ European Meeting was, as always, a whirlwind of ideas, people, and tech so bleeding-edge it practically needed bandages.

But for me? It was all about seeing friends. Catching up, syncing, and squeezing in as many conversations as humanly possible. As we always say—the hallway track is the real conference. I’m beyond grateful for the people I managed to see, and equally bummed about those I missed. But with a toddler waiting at home, even carving out this limited time was a logistical miracle.

Saturday: Go, Go, Go… and the eBPF Black Hole

Saturday kicked off with a deep dive into the Go DevRoom , before a (failed) mission to infiltrate the eBPF talks.

Go Goodness

The Go DevRoom delivered as expected:

"The State of Go" – Maartje Eyskens gave a solid rundown on where Go is headed.
"Swiss Maps in Go" – Bryan Boreham took us through these lightning-fast maps.
"Go-ing Easy on Memory: Writing GC-Friendly Code" – Sümer Cip’s talk was a timely reminder that, yes, your garbage collection problems are (probably) your fault.

eBPF Fail

The eBPF DevRoom? Packed. Absolutely impenetrable. As someone put it on Twitter:

“Nobody leaves #eBPF room at #FOSDEM, so nobody gets in. 🥲”

Next year, I’m bringing a tent and camping outside the door.

Sunday: Monitoring, Metrics, and Maybe Too Many Frites

Sunday was all about observability, performance, and squeezing every bit of insight from running systems.

Observability Overload

The Monitoring and Observability DevRoom had a strong lineup:

Richard “RichiH” Hartmann set the stage.
"The Performance Impact of Auto-Instrumentation" – James Belchamber gave a fantastic talk on the hidden costs of auto-instrumentation.
"Prometheus Version 3" – Jan Fajerski and Bryan Boreham gave us the lowdown on what’s next for Prometheus.

Community Vibes

Like I said— FOSDEM is really about the people. The talks are great, but the real magic happens in the hallway track. Some of the best conversations weren’t planned; they just happened over coffee, between sessions, or during a frantic sprint between buildings.

I’m incredibly happy for the folks I got to see, and at the same time, I wish I had more time to catch up with everyone I missed. But life is about balance, and with a little one waiting at home, I had to make every moment count.

Oh, and the frites? Still undefeated.

Bonus: Trains, Chaos, and a Race Against Time

Because no trip is complete without public transport drama , my journey back home came with an extra dose of stress. Trains? Cancelled. Schedule? A mess. Plane? Hanging by a thread.

Somehow, I made it. But FOSDEM weekend wouldn’t be complete without at least one unexpected adventure.

Final Thoughts

FOSDEM 2025 delivered. Again. Already looking forward to next year. If you’re into open source and haven’t experienced FOSDEM , sort it out.

When Hustle Culture and Personal Values Collide: Lessons from My Startup Journey

Kemal Akkoyun — Wed, 16 Oct 2024 00:00:00 +0000

Startups can be exciting arenas of innovation, filled with ambitious goals, rapid development cycles, and the allure of shaping the future. But when the pace becomes unsustainable, and personal values clash with company culture, the dream can quickly lose its luster. My recent experience at a machine learning inference startup taught me invaluable lessons about overwork, alignment, and the balance between idealism and pragmatism.

Why I Decided to Leave

The decision to leave wasn’t easy, but it became necessary when I realized that the environment was not compatible with my personal and professional priorities.

Overwork as a Default

The company embraced a hustle culture where working over 10 hours a day and being on-call 24/7 was normalized. This wasn’t limited to crunch times—it was the baseline expectation. For someone with a newborn at home, this level of overwork was unsustainable and detrimental to my family life.
Lack of Empathy and Transparency

Despite knowing about my life situation, the company struggled to adjust its expectations. With no parents on the team, there was little understanding of what it meant to balance work and family. Additionally, expectations around work hours and deliverables weren’t clearly communicated during onboarding.
Misaligned Values

I joined with the goal of building resilient, scalable, high-performance systems while contributing to open-source projects—a passion of mine. However, the company prioritized rapid feature delivery and short-term metrics over reliability, sustainability, or open-source contributions. This fundamental misalignment created constant friction.

My Mistakes

While the cultural mismatch played a significant role, I also made mistakes that compounded the challenges:

Over-Optimizing Instead of Delivering

I leaned into finding ideal solutions rather than delivering quick, practical implementations. In a fast-paced startup, speed often outweighs perfection.
Focusing Too Much on Learning

My desire to deeply understand and control every detail of the platform slowed me down. While this mindset works well in some roles, it was counterproductive in a high-pressure, delivery-focused environment.
Prioritizing Reliability Over Features

I instinctively gravitated toward improving system reliability and long-term sustainability, even when it was clear the company valued rapid feature delivery instead. This misalignment of priorities made my efforts less impactful in their eyes.
Spending Time on Open Source

I worked on improving and contributing to open-source tools, which I saw as valuable. However, the company didn’t share this enthusiasm, and my efforts were viewed as misaligned with their goals.

Lessons Learned

Reflecting on this experience, I’ve taken away several key lessons:

Cultural Fit is Critical

No matter how exciting the technology or mission, if the company’s culture doesn’t align with your values, frustrations will inevitably arise. Startups that glorify overwork are not sustainable for someone who values balance.
Clarify Expectations Early

Misaligned expectations around priorities and success metrics can derail even the most skilled engineers. Asking detailed questions during interviews and onboarding is essential to ensure alignment.
Balance Idealism with Pragmatism

Striking a balance between delivering quick wins and building sustainable systems is key, especially in startups. Knowing when to prioritize speed over perfection is a crucial skill.
Stay True to Your Priorities

For me, being present for my family and maintaining a balanced life outweighs any professional ambition. Leaving the role wasn’t easy, but it was the right decision for my well-being and values.

Final Thoughts

This experience was a humbling reminder of the importance of alignment—between personal values, company culture, and role expectations. While I’ve always thrived at the intersection of systems engineering and challenging problems, this chapter underscored the need for environments that respect the individual, not just the output.

Startups can be transformative experiences for those who thrive on rapid growth and ambiguity. But for those who prioritize balance and long-term thinking, it’s critical to choose an organization that values these traits. For me, this experience reaffirmed the importance of staying true to my values, even when the professional stakes are high.

Profiling Python with eBPF: A New Frontier in Performance Analysis

Kemal Akkoyun — Mon, 12 Feb 2024 00:00:00 +0000

Profiling Python with eBPF: A New Frontier in Performance Analysis

Profiling Python applications can be challenging, especially in scenarios involving high-performance requirements or complex workloads. Existing tools often require code instrumentation, making them impractical for certain use cases. Enter eBPF (Extended Berkeley Packet Filter)—a revolutionary Linux technology—and the open-source project Parca, which together are reshaping the landscape of Python profiling.

In this post, I’ll explore how eBPF enables continuous profiling, discuss challenges like stack unwinding in Python, and demonstrate the power of modern profiling tools.

You can also watch my full talk here or refer to the slides from the presentation.

Why Do We Need Profiling?

Profiling helps optimize performance and troubleshoot issues, such as CPU spikes, memory leaks, or out-of-memory (OOM) events. For instance:

Performance optimization: Identifying bottlenecks in code.
Incident resolution: Determining which function or component caused a memory spike or CPU overload.

Traditional Python profiling tools, like cProfile or py-spy, require application instrumentation, which isn’t always feasible—especially in production environments where code access might be restricted. This is where eBPF shines, offering non-intrusive, external profiling.

Existing Profiling Solutions in Python

The Python ecosystem offers several profiling tools, each with unique strengths:

cProfile: A built-in module for deterministic profiling.
pyinstrument: A call stack profiler for Python.
py-spy: A sampling profiler for Python programs.
yappi: Yet Another Python Profiler, supports multithreaded programs.
Pyflame: A ptracing profiler for Python.
Scalene: A high-performance CPU and memory profiler.

While these tools are valuable, many require code instrumentation or introduce significant overhead, making them less suitable for continuous profiling in production environments.

What Is eBPF?

Originally designed for network packet filtering, eBPF has evolved into a versatile event-driven system. It enables safe execution of custom programs inside the Linux kernel, using:

Performance Monitoring Units (PMUs): Efficient hardware units that track CPU cycles and other metrics.
Perf subsystem: A Linux facility for hooking into kernel and user-space events, such as CPU activity, memory allocation, or I/O.

By leveraging eBPF with PMUs, profiling becomes faster and more efficient than traditional approaches.

Continuous Profiling with Parca

Parca is an open-source project enabling continuous profiling. Its eBPF agent hooks into perf events, collects stack traces, and aggregates data for visualization. The process involves:

Hooking into CPU events to monitor active functions.
Stack unwinding to trace function calls.
Data aggregation and visualization in a web-based UI.

Unlike traditional profilers, Parca introduces minimal runtime overhead, making it ideal for production workloads.

Stack Unwinding: A Key Challenge

Native Code

Profiling native code is straightforward: we unwind the stack by reading memory addresses from the CPU and resolving them into human-readable symbols using debug information (e.g., DWARF).

Python Code

For Python, stack unwinding is complex due to its interpreter-based execution. Python maintains execution state in custom data structures, such as:

Interpreter state: Tracks threads and their execution context.
Thread state: A linked list of threads running in the interpreter.
Frame state: Represents the current execution frame.

To unwind Python stacks, we must traverse these structures, extract relevant information, and map them to human-readable symbols.

How Parca Profiles Python

Here’s how Parca handles Python profiling:

Reverse Engineering the Python Runtime:
Unwinding Python Stacks:
Mapping Symbols:
Efficient Data Handling:

Python 3.13: A Game-Changer for Profiling

The upcoming Python 3.13 release introduces a debug offset structure that simplifies stack unwinding. It provides precomputed offsets for key runtime fields, eliminating much of the manual reverse engineering required for earlier versions. This improvement marks a significant leap forward for tools like Parca.

Visualizing Profiles with Parca

Parca’s UI provides a comprehensive view of application performance:

Flame graphs : Visualize stack traces over time, highlighting bottlenecks.
Filtering and Metadata : Focus on specific languages (e.g., Python) or layers (e.g., C libraries).
Continuous Insights : Compare profiles across deployments to monitor performance regressions.

For example, a flame graph might reveal inefficient recursion in a Python function, enabling developers to pinpoint and optimize the problematic code.

Supported Python Versions

Parca supports profiling for Python versions from 2.7 to 3.11, with ongoing work for 3.12 and full support anticipated for 3.13. The project’s modular design allows quick adaptation to new Python runtime changes.

Conclusion

Profiling Python applications with eBPF and Parca represents a new frontier in performance analysis. By leveraging eBPF and continuous profiling, we can gain invaluable insights into our applications, enabling effective performance optimization. I encourage you to explore Parca, provide feedback, and contribute to the project—it’s a collaborative effort that can benefit us all as we tackle the challenges of modern software development.

Get Started

Watch my full talk or check out the presentation slides. Explore Parca on GitHub and join the community. Your feedback helps improve the tooling and shape the future of observability.

Fantastic Symbols and Where to Find Them - Part 2

Kemal Akkoyun — Thu, 27 Jan 2022 14:46:02 +0000

Originally published on polarsignals.com/blog on 27.01.2022

This is a blog post series. If you haven’t read Part 1 we recommend you to do so first!

In the first blog post, we learned about the fantastic symbols (debug symbols), how the symbolization process works and lastly, how to find the symbolic names of addresses in a compiled binary.

The actual location of the symbolic information depends on the programming language implementation the program is written in.
We can categorize the programming language implementations into three groups: compiled languages (with or without a runtime), interpreted languages, and JIT-compiled languages.

In this post, we will continue our journey to find fantastic symbols. And we will look into where to find them for the other types of programming language implementations.

JIT-compiled language implementations

Examples of JIT-compiled languages include Java, .NET, Erlang, JavaScript (Node.js) and many others.

Just-In-Time compiled languages compile the source code into bytecode, which is then compiled into machine code at runtime,
often using direct feedback from runtime to guide compiler optimizations on the fly.

Because functions are compiled on the fly, there is no pre-built, discoverable symbol table in any object files. Instead, the symbol table is created on the fly.
The symbol mappings (location to symbol) are usually stored in the memory of the runtime or virtual machine
and used for rendering human-readable stack traces when it is needed , e. g. when an exception occurs, the runtime will use the symbol mappings to render a human-readable stack trace.

The good thing is that most of the runtimes provide supplemental symbol mappings for the just-in-time compiled code for Linux to use perf.

perf defines an interface to resolve symbols for dynamically generated code by a JIT compiler.
These files usually can be found in /tmp/perf-$PID.map, where $PID is the process ID of the process of the runtime that is running on the system.

The runtimes usually don't enable providing symbol mappings by default.
You might need to change a configuration, run the virtual machine with a specific flag/environment variable or run an additional program to obtain these mappings.
For example, JVM needs an agent to provide supplemental symbol mapping files, called perf-map-agent.

Let's see an example perf map file for NodeJS. The runtimes out there output this file with more or less the same format, more or less!

To generate a similar file for Node.js, we need to run node with --perf-basic-prof option.

# With Node.js >=v0.11.15 the following command will create a map file for NodeJS:
node --perf-basic-prof your-app.js

This will create a map file at /tmp/perf-<pid>.map that looks like this:

3ef414c0 398 RegExp:[{(]
3ef418a0 398 RegExp:[})]
59ed4102 26 LazyCompile:~REPLServer.self.writer repl.js:514
59ed44ea 146 LazyCompile:~inspect internal/util/inspect.js:152
59ed4e4a 148 LazyCompile:~formatValue internal/util/inspect.js:456
59ed558a 25f LazyCompile:~formatPrimitive internal/util/inspect.js:768
59ed5d62 35 LazyCompile:~formatNumber internal/util/inspect.js:761
59ed5fca 5d LazyCompile:~stylizeWithColor internal/util/inspect.js:267
4edd2e52 65 LazyCompile:~Domain.exit domain.js:284
4edd30ea 14b LazyCompile:~lastIndexOf native array.js:618
4edd3522 35 LazyCompile:~online internal/repl.js:157
4edd37f2 ec LazyCompile:~setTimeout timers.js:388
4edd3cca b0 LazyCompile:~Timeout internal/timers.js:55
4edd40ba 55 LazyCompile:~initAsyncResource internal/timers.js:45
4edd42da f LazyCompile:~exports.active timers.js:151
4edd457a cb LazyCompile:~insert timers.js:167
4edd4962 50 LazyCompile:~TimersList timers.js:195
4edd4cea 37 LazyCompile:~append internal/linkedlist.js:29
4edd4f12 35 LazyCompile:~remove internal/linkedlist.js:15
4edd5132 d LazyCompile:~isEmpty internal/linkedlist.js:44
4edd529a 21 LazyCompile:~ok assert.js:345
4edd555a 68 LazyCompile:~innerOk assert.js:317
4edd59a2 27 LazyCompile:~processTimers timers.js:220
4edd5d9a 197 LazyCompile:~listOnTimeout timers.js:226
4edd6352 15 LazyCompile:~peek internal/linkedlist.js:9
4edd66ca a1 LazyCompile:~tryOnTimeout timers.js:292
4edd6a02 86 LazyCompile:~ontimeout timers.js:429
4edd7132 d7 LazyCompile:~process.kill internal/process/per_thread.js:173

Each line has START, SIZE and symbolname fields, separated with spaces. START and SIZE are hex numbers without 0x.
symbolname is the rest of the line, so it could contain special characters.

With the help of this mapping file, we have everything we need to symbolize the addresses in the stack trace. Of course, as always, this is just an oversimplification.

For example, these mappings might change as the runtime decides to recompile the bytecode. So we need to keep an eye on these files and keep track of the changes to resolve the address correctly with their most recent mapping.

Each runtime and virtual machine has its peculiarities that we need to adapt. But those are out of the scope of this post.

Interpreted language implementations

Examples of interpreted languages include Python, Ruby, and again many others.
There are also languages that commonly use interpretation as a stage before JIT compilation, e. g. Java.
Symbolization for this stage of compilation is similar to interpreted languages.

Interpreted language runtimes do not compile the program to machine code.
Instead, interpreters and virtual machines parse and execute the source code using their REPL routines.
Or execute their own virtual processor. So they have their own way of executing functions and managing stacks.

If you observe (profile or debug) these runtimes using something like perf,
you will see symbols for the runtime. However, you won't see the language-level context you might be expecting.

Moreover, the interpreter itself is probably written in a more low-level language like C or C++.
And when you inspect the object file of the runtime/interpreter, the symbol table that you would find would show the internals of the interpreter, not the symbols from the provided source code.

Finding the symbols for our runtime

The runtime symbols are useful because they allow you to see the internal routines of the interpreter. e. g. how much time your program spends on garbage collection.
And it's mostly like the stack traces you would see in the debugger or profiler will have calls to the internals of the runtime.
So these symbols are also helpful for debugging.

Most of the runtimes are compiled with production mode, and they most likely lack the debug symbols in their release binaries.
You might need to manually compile your runtime in debug mode to actually have them in the resulting binary.
Some runtimes, such as Node.js, already have them in their production distributions.

Lastly, to completely resolve the stack traces of the runtime, we might need to obtain the debug information for the linked libraries.
If you remember from the first blog post, debuginfo files can help us.
Debuginfo files for software packages are available through package managers in Linux distributions.
Usually for an available package called mypackage there exists a mypackage-dbgsym, mypackage-dbg or mypackage-debuginfo package.
There are also public servers that serve debug information.
So we need to find the debuginfo files for the runtime we are using and all the linked libraries.

Finding the symbols for our target program

The symbols that we look for in our own program likely are stored in a memory table that is specific to the runtime.
For example, in Python, the symbol mappings can be accessed using symtable.

As a result, you need to craft a specific routine for each interpreter runtime (in some cases, each version of that runtime) to obtain symbol information.
Educated eyes might have already noticed, it's not an easy undertaking considering the sheer amount of interpreted languages out there.
For example, a very well known Ruby profiler, rbspy, generates code for reading internal structs of the Ruby runtime for each version.

If you were to write a general-purpose profiler, like us, you would need to write a special subroutine in your profiler for each runtime that you want to support.

Again, don't worry, we got you covered

The good news is we got you covered. If you are using Parca Agent, we already do the heavy lifting for you to symbolize captured stack traces.
And we keep extending our support for the different languages and runtimes.
For example, Parca has already support for parsing perf JIT interface to resolve the symbols for collected stack traces.

Check Parca out and let us know what you think, on Discord channel.

Fantastic Symbols and Where to Find Them - Part 1

Kemal Akkoyun — Sat, 15 Jan 2022 08:18:33 +0000

Originally published on polarsignals.com/blog on 13.01.2022

Symbolization is a technique that allows you to translate machine memory addresses to human-readable symbol information (symbols).

Why do we need to read what programs do anyways? We usually do not need to translate everything to a human-readable format when things run smoothly. But when things go south, we need to understand what is going on under the hood.
Symbolization is needed by introspection tools like debuggers, profilers and core dumps or any other program that needs to trace the execution of another program.
While a target program is executing on a machine, these types of programs capture the stack traces of the program that is being executed.

A stack trace (also called stack backtrace or stack traceback) is a report of the active stack frames at a certain point in time during the execution of a program.

In raw stack traces, the addresses of the functions that are being called are recorded. The addresses are hexadecimal numbers representing the memory return addresses of the functions. Symbols are needed to translate memory addresses into function and variable names precisely as in the program’s source code to be read by us humans.
Without symbols, all we see are hexadecimal numbers representing the memory addresses that we have captured.

It sounds simple enough, right? Well, it's not. As with everything else about computers, it's a bit of sorcery. It has its challenges, such as associating them with correct symbols, transforming addresses, and most importantly, actually finding the symbols!
The strategies to get symbol information varies depending on the platform and the programming language implementation that the program is written in.

For the sake of simplicity, we will be focusing on Linux as the target platform and ignore Windows, macOS and many other platforms. Otherwise, I could end up writing a small size book in here :)

Fantastic Symbols ...

A symbol (or debug symbol, to be precise) is a special kind of symbol that attaches additional information to the symbol table of a program.
This symbol information allows a debugger or a profiler to gain access to information from the program's source code, such as the names of identifiers, including variables and functions.
But where can we find these symbols?

... and Where to Find Them

If the program is a compiled one, these may be compiled together with the binary file, distributed in a separate file, or discarded during the compilation and/or linking.
Or, if the program is interpreted, these may be stored in the program itself. Let's briefly look at where and how we can find these symbols depending on the programming language implementation.

Compiled language implementations

Examples of compiled languages include C, C++, Go, Rust and many others.

The compiled languages usually have a symbol table that contains all the symbols used in the program.
The symbol table is usually compiled in the executable binary file. And the binary file is typically in the ELF format (for Linux systems).
Symbol tables are included in the ELF binary file, specifically for mapping the addresses to function names and object names.
In rare cases, it is stored in a separate file, usually with the same name as the binary file, but with a different extension.

The ELF format is not an easy one to describe in a couple of sentences. For the purpose of this article, we will focus on what we need to know about the ELF format.
Each ELF file is made up of one ELF header, followed by file data. The ELF header is a fixed size and contains information about the data sections.
The relevant part for us is the symbols can live in a special section called .symtab and .dynsym.
.dynsym is the “dynamic symbol table” and it is a smaller version of the .symtab that only contains global symbols.

Contents of .dynsym and .symtab section using readelf -s /bin/go:

Symbol table '.dynsym' contains 38 entries:
   Num: Value Size Type Bind Vis Ndx Name
     0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
     1: 00000000006355e0 99 FUNC GLOBAL DEFAULT 1 crosscall2
     2: 00000000006355a0 55 FUNC GLOBAL DEFAULT 1 _cgo_panic
     3: 0000000000465560 25 FUNC GLOBAL DEFAULT 1 _cgo_topofstack
     4: 0000000000000000 0 OBJECT GLOBAL DEFAULT UND [...]@GLIBC_2.2.5 (6)
     5: 0000000000000000 0 OBJECT GLOBAL DEFAULT UND [...]@GLIBC_2.2.5 (4)
     6: 0000000000000000 0 OBJECT GLOBAL DEFAULT UND [...]@GLIBC_2.2.5 (4)
     7: 0000000000000000 0 OBJECT GLOBAL DEFAULT UND [...]@GLIBC_2.2.5 (4)
     8: 0000000000000000 0 OBJECT GLOBAL DEFAULT UND [...]@GLIBC_2.2.5 (4)
     9: 0000000000000000 0 OBJECT GLOBAL DEFAULT UND [...]@GLIBC_2.2.5 (4)
    10: 0000000000000000 0 OBJECT GLOBAL DEFAULT UND [...]@GLIBC_2.2.5 (4)
...
Symbol table '.symtab' contains 13199 entries:
   Num: Value Size Type Bind Vis Ndx Name
     0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
     1: 0000000000000000 0 FILE LOCAL DEFAULT ABS go.go
     2: 0000000000401000 0 FUNC LOCAL DEFAULT 1 runtime.text
     3: 0000000000401000 214 FUNC LOCAL DEFAULT 1 net(.text)
     4: 00000000004010e0 214 FUNC LOCAL DEFAULT 1 runtime/cgo(.text)
     5: 00000000004011c0 601 FUNC LOCAL DEFAULT 1 runtime/cgo(.text)
     6: 0000000000401420 480 FUNC LOCAL DEFAULT 1 runtime/cgo(.text)
     7: 0000000000401420 47 FUNC LOCAL HIDDEN 1 threadentry
     8: 0000000000401600 70 FUNC LOCAL DEFAULT 1 runtime/cgo(.text)
     9: 0000000000401646 5 FUNC LOCAL DEFAULT 1 runtime/cgo(.tex[...]
    10: 0000000000401646 5 FUNC LOCAL HIDDEN 1 x_cgo_munmap.cold

Go has a unique table (of course). It stores its symbols in a section called .gopclntab. This is a table of functions, line numbers and addresses.
Go does this because it needs to be able to render human-readable stack traces when a panic occurs in runtime;

Note that addresses in the symbol table do not move during execution so that they can be read any time during the execution of the program.
They can easily be loaded into memory independent of the running program and an observer can easily read them.

We assumed that the binary file is a statically linked executable until this point. However, this might not be the case. The binary file might be dynamically linked to other libraries.
From now on, we will refer to these shared library files and executables (both in ELF format) as object files. Each object file can have its own symbol table.

We need to note that when we take a snapshot of the stack (a.k.a stack trace), it could include addresses from linked shared libraries and Kernel functions.

Kernel-level software differs as it has its own dynamic symbol table in /proc/kallsyms, which is a file that contains all the symbols that are used in the kernel. And it can grow as the kernel modules are loaded.

We can read the object files by using binary utilities such as objdump, readelf and nm.

To read the .symtab:

nm $FILE
# or
objdump --syms $FILE
# or
readelf -a $FILE

To read the .dynsym:

nm -D $FILE
# or
objdump --dynamic-syms $FILE
# or
readelf -a $FILE

For the compiled languages, the symbol table is not the only source of symbols. There are also DWARFs!

Debuginfo

ELFs and DWARFs, welcome to fairyland.

Another way to obtain the symbols from an object file is to use the debug information or debuginfo in short.
Same as the symbol table, this information can be compiled in the binary file, formatted in the DWARF(Debugging With Attributed Record Formats) or in a separate file.

DWARF is the debug information format most commonly used with ELF. It’s not necessarily tied to ELF, but the two were developed in tandem and work very well together.
This information is split across different ELF sections (.debug_* and .zdebug_* for compressed ones), each with its own piece of information to relay.
For our specific needs, we need to use the .debug_info section to find corresponding functions and .debug_line section to corresponding line numbers.

Debuginfo files for software packages are available through package managers in Linux distributions.
Usually for an available package called mypackage there exists a mypackage-dbgsym, mypackage-dbg or mypackage-debuginfo package.
There are also public servers that serve debug information.

One Program to bring them all, and in the darkness bind them: `addr2line`

Wait, what?! Isn't that from another fantasy book?

Now that we have the symbol table or debug information, we can use addr2line (address to line) to get the source code location of a given address.
addr2line converts addresses back to function and line numbers.

Let's see it in action addr2line -a 0x0000000000001154 -e <objectFile>:

For addr2line <objectFile> can be any object file compiled with debug information or symbols. It can be an executable, a shared library or output of a strip operation.

Voilà!

0x0000000000001154
main
/home/newt/Sandbox/hello-c/hello.c:14

I used a simple C executable for this example. And we have got our symbol and attached source information for the corresponding address 🎉

I only wish we had compiled programming language implementations out there, then our job here could have been finished. But we are not. We need to keep digging.
But for that, you need to wait for another week. As we hinted at in the title of this post, there will be a part 2! All the best franchises are sequels, right?!
In part 2, we will see how interpreted languages and Just-In-Time compiled languages handle symbols.

Please stay tuned!

Don't worry we got you covered

Even though we simplified things a bit here, if you want to write a program to utilize symbolization, you still have a lot of work to do.
Many open-source tools out there already handle nitty-gritty details of symbolization, like perf.

Check Parca out and let us know what you think, on Discord channel.

Forem: Kemal Akkoyun

Measuring Software Performance: Why Your Benchmarks Are Probably Lying

A Loose Cable That Broke Physics

Why Benchmarking Is Hard

Environment Control

Noisy Neighbors and Bare Metal

CPU Affinity and Process Priority

Cache Management

Simultaneous Multithreading (SMT)

Dynamic Frequency Scaling (DFS)

Benchmark Design

Representative Workloads

The Coordinated Omission Problem

Warm-Up and Steady State

Statistical Methods

Why Averages Lie

Hypothesis Testing

Change Point Detection

Visualization: Strip Plots Over Boxplots

Integrating Into Development Workflows

The Feedback Loop

Performance Quality Gates

When to Benchmark

Open Source Tools

Key Takeaways

Performance Matters

Resources

Auto-Instrumenting Go: From eBPF to USDT Probes

The Problem

The Comparison Framework

The Seven Scenarios

Evaluation Axes

Manual OTel SDK (Baseline for Comparison)

Compile-Time: Orchestrion and OTel Compile-Time Instrumentation

eBPF Approaches

OBI (OpenTelemetry eBPF Instrumentation)

OTel Go Auto-Instrumentation

Runtime Injection: Frida and ptrace

USDT Probes: The Novel Part

libstabst: USDT via salp and bpftrace

Native USDT: Custom Go Fork

Go Runtime PoCs: Flight Recording

Benchmark Results

The Ecosystem Picture

Future Directions

Closing

OTel Unplugged EU 2026: Field Notes from the Instrumentation Frontier

Brussels Again, But Make It Unplugged

Prometheus Loves OpenTelemetry (It’s Complicated)

The Resource Attributes Mess

Migration Resistance

The Afternoon: Maintainers Chart a Path

The Injector: From LD_PRELOAD to apt install opentelemetry

Injector vs Operator

Beyond Kubernetes

The Vision: One Package to Rule Them All

eBPF and the Instrumentation Tax

Patterns Across the Day

Ship Faster vs Stable by Default

The Maintainer Crisis

opentelemetry-go-auto: Quietly Fading

What Comes Next

Unplugged

FOSDEM 2026: Even Bigger, Even Better

Another Year, Another FOSDEM

Saturday Morning: eBPF Devroom

Sunday: Two Talks, Two Devrooms

Monday: OTel Unplugged

The Hallway Track

Travel and Logistics: The Sequel Nobody Asked For

Looking Forward

Fix Go Module Downloads Behind a Corporate VPN

The comma trap

The pipe fix

Why not use pipes everywhere?

The full picture

One-line fix

Stop Putting API Keys in Your Shell Config

The Problem

The Fix: 1Password CLI

The Injector: From LD_PRELOAD to `apt install opentelemetry`

Why `--no-newline`?

Why `2>/dev/null`?

One Program to bring them all, and in the darkness bind them: `addr2line`