Forem: Vladimir

Your model migration passed. Here's what the aggregate didn't show.

Vladimir — Fri, 27 Mar 2026 08:33:28 +0000

SWE-CI, a benchmark published this month by Alibaba, tested whether AI coding agents maintain correct behavior over time. The result: 75% of them break previously working code — and model upgrades are one of the triggers.

This isn't unique to coding agents. Every team running an LLM-powered agent hits the same problem quarterly: the provider deprecates your model, or a newer version promises better performance, or you're switching providers for cost. You change one string in your config, run the eval, and check the dashboard.

The dashboard looks fine. But underneath, the behavior may have shifted in ways the aggregate doesn't surface.

The quarterly forced experiment

Model deprecations used to be annual. Now they're quarterly. Claude Opus 3 was retired earlier this year. GPT-4 Turbo was sunset last year. Each deprecation forces every team on that model to migrate — not on their schedule, on the provider's.

And it's not just deprecations. Teams switch models for cost optimization, latency improvements, or capability upgrades. Every switch is a forced experiment where the variables aren't controlled — the new model behaves differently on every task, and the differences are invisible in the aggregate.

How migrations hide regressions

Different models have different strengths across task types. GPT-4o might be better at structured extraction while Claude excels at multi-step reasoning. A model that's faster might produce shorter responses — which looks like a cost improvement until you realize the shorter responses are incomplete responses.

The standard migration test:

Success rate: 82% → 83%. Ship it.
Median cost: $0.04 → $0.03. Even better.
Median latency: 6.2s → 5.8s. Faster too.

What this misses: 8 out of 25 task types regressed. The high-volume, low-complexity tasks got slightly better — inflating the aggregate. The complex business flows that make up 15% of traffic broke silently.

The aggregate improved. Key task types degraded. And nothing in the top-line numbers flagged it.

This is the same failure mode I described in Aggregate metrics are a blind spot in agent evaluation — but model migrations make it worse because they change everything at once. A prompt edit affects one step. A model swap affects every LLM call in every trace.

What catches it

Two things that help when the aggregate looks flat:

Statistical significance. Did the cost actually decrease, or is 50 traces not enough to tell? Kalibra computes bootstrap confidence intervals automatically — if the CI on the median delta includes zero, the "improvement" isn't real, it's sampling noise.

Per-task breakdown. Which tasks got better? Which got worse? If 8 task types flipped from pass to fail, that's a regression — even if the aggregate went up.

Here's what this looks like with Kalibra, an open-source CLI that compares two trace populations statistically. It works on any JSONL traces that include outcomes (from an LLM-as-judge, deterministic eval, or the provider's finish reason). We ran 25 tasks through the same agent, baseline vs current, 50 traces each. The aggregate shows improvement everywhere:

▲ Token usage       963 → 337 tokens/trace (median)  -65.0%
                    95% CI [-75.4%, -14.9%]
▲ Duration          7.1s → 3.3s median  -52.9%
                    95% CI [-62.3%, -14.0%]

Tokens down 65% with a tight CI. Duration halved. Both statistically significant. If you stopped here, it looks like a clear win.

The per-task breakdown:

Trace breakdown
▼ Per trace         25 matched — ✗ 10 regressed

Quality gates
  [ OK ] token_delta_pct <= -10   actual: -65.00
  [FAIL] regressions <= 2         actual: 10.00

FAILED — quality gate violation (exit code 1)

10 task types broke. The token gate passed — tokens did go down. The regressions gate failed — too many tasks regressed. The tokens went down because complex responses were cut short, not because the agent got more efficient. Instead of generating a 40-line SQL query with explanatory comments, the agent output "You can use a SELECT with GROUP BY and HAVING" — drastically fewer tokens, but missing the actual answer.

The regressions <= 2 gate surfaced what the aggregate metric missed.

One file, two populations

The practical question: how do you compare traces from two models without managing separate files, separate runs, separate export pipelines?

Tag each trace with its model version. Put everything in one file. Split at compare time:

# kalibra.yml — assuming all traces are in one centralized log
sources:
  baseline:
    path: ./traces.jsonl
    where:
      - model_version == gpt-4o
  current:
    path: ./traces.jsonl
    where:
      - model_version == claude-4.6

fields:
  task_id: task_name   # matches the 'task_name' key in your JSONL trace objects

require:
  - regressions <= 2
  - cost_delta_pct <= 30
  - success_rate_delta >= -5

kalibra compare   # reads config, exits 1 on gate failure

where filters traces by metadata — Prometheus-style matchers (==, !=, =~, !~). Both populations come from the same file, split by a tag you control. No separate export pipelines.

The three gates check three different failure modes:

regressions <= 2 — the per-task breakdown. Catches hidden regressions.
cost_delta_pct <= 30 — the cost didn't blow up.
success_rate_delta >= -5 — the aggregate didn't tank either.

If any gate fails, exit code 1. The deploy pauses until you've looked at which tasks were affected.

Kalibra — regression detection for AI agents · GitHub · Docs · PyPI

When agent trace metrics lie: the span tree double-counting problem

Vladimir — Wed, 25 Mar 2026 18:06:27 +0000

I was building OpenInference support for an agent trace comparison tool when the token counts came back double what they should have been. The code was simple — sum tokens across all spans in a trace. The bug was that "all spans" included orchestration wrappers that carried their children's totals. Nothing crashed. The numbers just looked plausible enough to ship.

This is the span tree double-counting problem. It's not hard to fix once you see it, but it's easy to miss because the wrong numbers look reasonable.

The tree

Agent traces are trees. This isn't a new data structure — OpenTelemetry has used tree-structured traces for distributed systems since long before LLMs were mainstream. OpenInference, the AI-specific semantic convention layer built on top of OpenTelemetry, inherits this model and adds span kinds tailored to AI workloads: LLM, TOOL, CHAIN, AGENT, RETRIEVER, and others. Every OpenInference trace is a valid OTLP trace — the conventions give attribute names their AI-specific meaning.

A root AGENT or CHAIN span wraps child spans — LLM calls, tool invocations, retrievals. Those children can have children of their own. A planning step spawns sub-queries. A tool call triggers an LLM to parse the result. The depth is arbitrary.

root (AGENT)
├── plan (LLM)         ← 500 input, 200 output tokens, $0.02
├── search (TOOL)      ← no tokens, no cost
│   └── parse (LLM)   ← 300 input, 100 output tokens, $0.01
└── respond (LLM)      ← 800 input, 400 output tokens, $0.03

Four spans. Three are LLM calls with token counts and costs. One is a tool invocation. The AGENT span at the root is an orchestration wrapper — it didn't make an LLM call itself.

The total cost is $0.06. The total tokens are 2,300. Straightforward — you sum the three LLM spans.

But whether this works depends entirely on your instrumentation. What happens when parent spans also carry token and cost attributes?

An old problem in new clothes

If you've worked with distributed tracing, you might recognize this. Traditional Application Performance Monitoring (APM) has dealt with a version of it for years under the name self-time (or exclusive time) — the duration a span spends doing its own work, excluding time enclosed by children. Elastic APM computes span.self_time metrics specifically for this: they subtract child durations from the parent's total to produce a breakdown visualization that doesn't double-count.

The AI-specific twist is that the double-counting isn't about duration — which is inherently hierarchical, since parent spans enclose children by definition. It's about metric values on spans: tokens and costs. These are point measurements that should live on the specific span that generated them. They are not hierarchical quantities. When a parent span carries total_tokens: 2300 as a subtotal of its children, and you sum across all spans, you get 4,600 tokens. Double the actual value.

Duration double-counting is a display and analysis problem — the data itself is correct, you just need to compute self-time. Token and cost double-counting is a data problem — the same value exists in two places, and the spec doesn't tell you which one is the source of truth.

Where things go wrong

Some instrumentations record aggregated subtotals on parent spans. A parent AGENT span might carry total_tokens: 2300 — the sum of its children. If you now sum all spans, you get 4,600 tokens.

This isn't hypothetical. Langfuse has seen related reports surface in different forms. The Microsoft Agent Framework integration ran into it directly: the framework's invoke_agent spans carried a gen_ai.request.model attribute, which caused Langfuse to classify them as generations and infer token counts — even though the framework explicitly set capture_usage=False. The result: both the orchestration span and the nested LLM calls got counted, doubling the totals. The presence of a model attribute on a non-LLM span was enough to trigger it.

What makes this tricky is that no specification forbids putting aggregated values on parent spans. OpenInference defines llm.token_count.prompt and llm.cost.total as span-level attributes but doesn't say "only attach these to leaf spans." OpenTelemetry's GenAI semantic conventions define gen_ai.usage.input_tokens on inference spans but don't warn about aggregation. These conventions are still in Development status — the earliest maturity level in OTel's lifecycle — and they define no cost attributes at all. The convention — cost and tokens live only on the actual LLM call — is implicit, not specified.

When conventions are implicit, they get violated. And the violations are silent — your numbers are wrong, but nothing crashes. Not every dataset has this problem, which makes it harder to catch when one does.

And here's why you can't just detect it after the fact: imagine a parent span with cost: $0.05 and two children costing $0.02 and $0.03. Is the parent's cost an aggregated subtotal of its children — meaning you should ignore it — or did the parent make its own LLM call that happened to cost $0.05? That's not a contrived scenario: an orchestration step that reasons about which tool to call and then delegates to children is both an LLM caller and a parent. You can't distinguish "aggregated subtotal" from "coincidentally equal own cost" by looking at the numbers alone.

And this compounds: in a tree of arbitrary height, you're not double-counting — you're potentially N-counting, with the ambiguity multiplying at every level.

What to do about it

The double-counting issue manifests differently for each metric type. Here's how I handle each one.

Cost and tokens

The approach I landed on: sum(s.cost for s in spans if s.cost is not None).

This works regardless of span kind taxonomy because it relies on the data, not the labels. Orchestration spans with None cost are excluded. LLM spans with 0.0 cost (cached responses, free-tier models) are correctly included. Non-LLM spans that legitimately have cost (paid API tool calls) are also correctly included. In my experience, this is more robust than filtering by span kind, which requires knowing every possible kind value across every instrumentation library.

The None vs 0 distinction is critical here and easy to get wrong. None means "this span didn't measure cost" — a TOOL span, a CHAIN wrapper. 0.0 means "this span measured cost and it was zero" — a cached LLM response, a free-tier model call. If you collapse None to 0 before summing — a common shortcut — you lose the ability to tell "no cost data" from "genuinely free." Your medians shift toward zero, your comparisons break, and you won't see it in the output because zero looks reasonable.

This approach works because the convention places cost and token data exclusively on the spans that generated them — orchestration spans have None, not a subtotal. It's a pragmatic shortcut, not a general solution: if a parent span carried an aggregated subtotal as a real value, None-filtering would silently include it. You'd need true self-time-style subtraction to handle that case. But in practice, the convention holds often enough that filtering on None is the more robust default.

Step count

This one is less about correctness and more about what you're trying to measure. A 3-step agent (plan, search, respond) wrapped in a CHAIN has 4 total spans. len(spans) returns 4, not 3. Whether that's "wrong" depends on the question. If you're asking "how complex is this trace's orchestration," total span count is fine. If you're asking "how many things did the agent actually do," I found leaf spans — spans with no children — to be more useful. The orchestration wrappers are envelopes, not actions. Though it's worth noting that the boundary isn't always clean — a "search" step might be a parent span that delegates to an LLM call for query parsing. In that case, "search" is a logical step but not a leaf. What you're really counting with leaves is execution primitives, not logical operations.

parent_ids = {s.parent_id for s in spans if s.parent_id}
leaves = [s for s in spans if s.span_id not in parent_ids]

One caveat: if the tree is incomplete — child spans missing due to instrumentation gaps or partial exports — a parent will look like a leaf and inflate the count. In practice this is rare with well-instrumented code, but worth knowing about.

Duration

Summing span durations is always wrong for traces — a parent span's duration overlaps its children. This is the classic self-time problem that APM tools have solved at the visualization layer. What you want for trace-level duration is wall-clock time: max(end_time) - min(start_time) across all spans. That gives you total elapsed time without double-counting overlapping execution. This works correctly even when branches execute in parallel.

But for per-span analysis — comparing "how long does the search step take across 100 traces" — each span's own duration is exactly right. Even for parent spans, where the duration tells you how long that sub-pipeline consumed end-to-end. Group by span name, compare independently. This is valid at any tree depth because you're comparing the same span across traces, not summing different spans within a trace.

How major platforms handle it

The major observability platforms all address this, though the reasoning behind their approaches isn't always well-documented. Here's what I've gathered from their docs and public data.

Phoenix / OpenInference relies on span kind. The OpenInference semantic conventions define llm.token_count.* and llm.cost.* attributes specifically for LLM spans — CHAIN, AGENT, and TOOL spans don't typically carry them. Phoenix also computes cost server-side by combining token counts with built-in model pricing tables, rather than relying on pre-computed cost attributes on spans — the two public Phoenix trace datasets I tested (context-retrieval and random) have no llm.cost.* attributes, consistent with this.

Langfuse uses observation types: generation, span, embedding, and several others. Only generation and embedding carry cost and token data. When the Microsoft Agent Framework integration produced double-counts, the root cause was that any span with a model attribute was auto-classified as a generation. The discussion shows this is still being worked through — the architecture is sound, but it depends on the instrumentation not accidentally triggering the heuristic.

LangSmith records token usage on LLM call runs. Their cost tracking docs describe the trace tree as displaying total usage for the entire trace, aggregated values for each parent run, and token and cost breakdowns for each child run. The docs don't specify whether parent aggregation is stored or computed at display time, but the architecture clearly separates individual run data from rolled-up totals.

Braintrust fixes the source rather than filtering at consumption. Their v3.1.0 changelog notes a fix for "token double counting between parent and child spans in Vercel AI SDK integration." Their data model supports DAG-structured spans, and aggregation happens at query time via their BTQL language rather than at export time.

The common thread: every platform puts cost and tokens on the actual LLM call, not on the orchestration wrapper. The convention exists. It's just not documented as a rule that instrumentation authors are expected to follow.

The spec gap

Traditional APM solved duration double-counting by establishing self-time as a first-class concept — Elastic APM has span.self_time metrics, and most APM UIs distinguish between a span's total time and its exclusive time. The solution was baked into the tooling because the problem was well-understood.

AI trace metrics don't have an equivalent. Neither OpenInference nor the OpenTelemetry GenAI semantic conventions specify whether llm.token_count.* or gen_ai.usage.* attributes represent the span's own values or cumulative subtotals of children. The conventions — still at the earliest maturity level, with work begun in early 2024 — don't define cost attributes at all. The agent spans spec defines span types but says nothing about token or cost rollup. OpenInference, which does define llm.cost.total, is ahead here but still doesn't clarify the aggregation semantics.

One sentence in either spec would fix this: "Token count and cost attributes on a span represent that span's own values, not cumulative subtotals of descendant spans." That turns an implicit convention into a guarantee that instrumentation authors can code against. Now, while the conventions are still being shaped, is the time to say it.

Until that happens, defensive coding is the practical answer: filter on None for aggregation, don't assume every instrumentation follows the convention, and validate against known-good data before trusting the numbers.

I ran into this while building OpenInference support in Kalibra, a regression detection tool for AI agent traces. The tree aggregation problem was one of the design decisions that required real thought — not because it's algorithmically hard, but because getting it wrong produces numbers that look right.

How do you test your LLM agents before shipping changes?

Vladimir — Tue, 24 Mar 2026 21:55:42 +0000

Genuinely curious how other engineers are handling this.

Every time I change a prompt, swap a model, or tweak a tool, I've struggled to get a reliable answer to a simple question: did the agent get better or worse overall?

The challenge I keep hitting is that aggregate metrics (average success rate, total tokens) usually look fine, but specific task types silently break. The easy tasks improve, masking the regressions on the hard ones. By the time someone notices, it's already in production.

Here’s what I tried before landing on something that actually worked:

LLM-as-judge scoring: Too inconsistent between runs. Hard to tell if a score change was real or just statistical noise.
Manual spot-checking: Useful early on, but didn't scale past ~10 task types.
Comparing trace-level metrics statistically: Looking at distributions of tokens, duration, and cost per specific task ended up being the most reliable signal, so much so that I ended up building my own tooling around it.

What does your testing setup look like? Do you have CI gates that block deploys on agent regressions, or is it mostly manual review?