<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Vladimir</title>
    <description>The latest articles on Forem by Vladimir (@khan5v).</description>
    <link>https://forem.com/khan5v</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3842227%2F95c64040-57e0-4257-b47b-59bfefea0aa5.png</url>
      <title>Forem: Vladimir</title>
      <link>https://forem.com/khan5v</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/khan5v"/>
    <language>en</language>
    <item>
      <title>Your model migration passed. Here's what the aggregate didn't show.</title>
      <dc:creator>Vladimir</dc:creator>
      <pubDate>Fri, 27 Mar 2026 08:33:28 +0000</pubDate>
      <link>https://forem.com/khan5v/your-model-migration-passed-heres-what-the-aggregate-didnt-show-147e</link>
      <guid>https://forem.com/khan5v/your-model-migration-passed-heres-what-the-aggregate-didnt-show-147e</guid>
      <description>&lt;p&gt;&lt;a href="https://arxiv.org/abs/2603.03823" rel="noopener noreferrer"&gt;SWE-CI&lt;/a&gt;, a benchmark published this month by Alibaba, tested whether AI coding agents maintain correct behavior over time. The result: &lt;a href="https://awesomeagents.ai/news/alibaba-swe-ci-ai-coding-agents-long-term-maintenance/" rel="noopener noreferrer"&gt;75% of them break previously working code&lt;/a&gt; — and model upgrades are one of the triggers.&lt;/p&gt;

&lt;p&gt;This isn't unique to coding agents. Every team running an LLM-powered agent hits the same problem quarterly: the provider deprecates your model, or a newer version promises better performance, or you're switching providers for cost. You change one string in your config, run the eval, and check the dashboard.&lt;/p&gt;

&lt;p&gt;The dashboard looks fine. But underneath, the behavior may have shifted in ways the aggregate doesn't surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  The quarterly forced experiment
&lt;/h2&gt;

&lt;p&gt;Model deprecations used to be annual. Now they're quarterly. Claude Opus 3 was retired earlier this year. GPT-4 Turbo was sunset last year. Each deprecation forces every team on that model to migrate — not on their schedule, on the provider's.&lt;/p&gt;

&lt;p&gt;And it's not just deprecations. Teams switch models for cost optimization, latency improvements, or capability upgrades. Every switch is a forced experiment where the variables aren't controlled — the new model behaves differently on every task, and the differences are invisible in the aggregate.&lt;/p&gt;

&lt;h2&gt;
  
  
  How migrations hide regressions
&lt;/h2&gt;

&lt;p&gt;Different models have different strengths across task types. GPT-4o might be better at structured extraction while Claude excels at multi-step reasoning. A model that's faster might produce shorter responses — which looks like a cost improvement until you realize the shorter responses are &lt;em&gt;incomplete&lt;/em&gt; responses.&lt;/p&gt;

&lt;p&gt;The standard migration test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Success rate: 82% → 83%. Ship it.&lt;/li&gt;
&lt;li&gt;Median cost: $0.04 → $0.03. Even better.&lt;/li&gt;
&lt;li&gt;Median latency: 6.2s → 5.8s. Faster too.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What this misses: 8 out of 25 task types regressed. The high-volume, low-complexity tasks got slightly better — inflating the aggregate. The complex business flows that make up 15% of traffic broke silently.&lt;/p&gt;

&lt;p&gt;The aggregate improved. Key task types degraded. And nothing in the top-line numbers flagged it.&lt;/p&gt;

&lt;p&gt;This is the same failure mode I described in &lt;a href="https://dev.to/blog/kalibra-regression-detection/"&gt;Aggregate metrics are a blind spot in agent evaluation&lt;/a&gt; — but model migrations make it worse because they change &lt;em&gt;everything at once&lt;/em&gt;. A prompt edit affects one step. A model swap affects every LLM call in every trace.&lt;/p&gt;

&lt;h2&gt;
  
  
  What catches it
&lt;/h2&gt;

&lt;p&gt;Two things that help when the aggregate looks flat:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Statistical significance.&lt;/strong&gt; Did the cost actually decrease, or is 50 traces not enough to tell? Kalibra computes bootstrap confidence intervals automatically — if the CI on the median delta includes zero, the "improvement" isn't real, it's sampling noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-task breakdown.&lt;/strong&gt; Which tasks got better? Which got worse? If 8 task types flipped from pass to fail, that's a regression — even if the aggregate went up.&lt;/p&gt;

&lt;p&gt;Here's what this looks like with &lt;a href="https://github.com/khan5v/kalibra" rel="noopener noreferrer"&gt;Kalibra&lt;/a&gt;, an open-source CLI that compares two trace populations statistically. It works on any JSONL traces that include outcomes (from an LLM-as-judge, deterministic eval, or the provider's finish reason). We ran 25 tasks through the same agent, baseline vs current, 50 traces each. The aggregate shows improvement everywhere:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;▲ Token usage       963 → 337 tokens/trace (median)  -65.0%
                    95% CI [-75.4%, -14.9%]
▲ Duration          7.1s → 3.3s median  -52.9%
                    95% CI [-62.3%, -14.0%]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tokens down 65% with a tight CI. Duration halved. Both statistically significant. If you stopped here, it looks like a clear win.&lt;/p&gt;

&lt;p&gt;The per-task breakdown:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Trace breakdown
▼ Per trace         25 matched — ✗ 10 regressed

Quality gates
  [ OK ] token_delta_pct &amp;lt;= -10   actual: -65.00
  [FAIL] regressions &amp;lt;= 2         actual: 10.00

FAILED — quality gate violation (exit code 1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;10 task types broke. The token gate passed — tokens &lt;em&gt;did&lt;/em&gt; go down. The regressions gate failed — too many tasks regressed. The tokens went down because complex responses were cut short, not because the agent got more efficient. Instead of generating a 40-line SQL query with explanatory comments, the agent output "You can use a SELECT with GROUP BY and HAVING" — drastically fewer tokens, but missing the actual answer.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;regressions &amp;lt;= 2&lt;/code&gt; gate surfaced what the aggregate metric missed.&lt;/p&gt;

&lt;h2&gt;
  
  
  One file, two populations
&lt;/h2&gt;

&lt;p&gt;The practical question: how do you compare traces from two models without managing separate files, separate runs, separate export pipelines?&lt;/p&gt;

&lt;p&gt;Tag each trace with its model version. Put everything in one file. Split at compare time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# kalibra.yml — assuming all traces are in one centralized log&lt;/span&gt;
&lt;span class="na"&gt;sources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;baseline&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./traces.jsonl&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;model_version == gpt-4o&lt;/span&gt;
  &lt;span class="na"&gt;current&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./traces.jsonl&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;model_version == claude-4.6&lt;/span&gt;

&lt;span class="na"&gt;fields&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;task_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;task_name&lt;/span&gt;   &lt;span class="c1"&gt;# matches the 'task_name' key in your JSONL trace objects&lt;/span&gt;

&lt;span class="na"&gt;require&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;regressions &amp;lt;= &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cost_delta_pct &amp;lt;= &lt;/span&gt;&lt;span class="m"&gt;30&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;success_rate_delta &amp;gt;= -5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kalibra compare   &lt;span class="c"&gt;# reads config, exits 1 on gate failure&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;where&lt;/code&gt; filters traces by metadata — Prometheus-style matchers (&lt;code&gt;==&lt;/code&gt;, &lt;code&gt;!=&lt;/code&gt;, &lt;code&gt;=~&lt;/code&gt;, &lt;code&gt;!~&lt;/code&gt;). Both populations come from the same file, split by a tag you control. No separate export pipelines.&lt;/p&gt;

&lt;p&gt;The three gates check three different failure modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;regressions &amp;lt;= 2&lt;/code&gt; — the per-task breakdown. Catches hidden regressions.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cost_delta_pct &amp;lt;= 30&lt;/code&gt; — the cost didn't blow up.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;success_rate_delta &amp;gt;= -5&lt;/code&gt; — the aggregate didn't tank either.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any gate fails, exit code 1. The deploy pauses until you've looked at which tasks were affected.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/khan5v/kalibra" rel="noopener noreferrer"&gt;Kalibra&lt;/a&gt;&lt;/strong&gt; — regression detection for AI agents · &lt;a href="https://github.com/khan5v/kalibra" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://kalibra.cc" rel="noopener noreferrer"&gt;Docs&lt;/a&gt; · &lt;a href="https://pypi.org/project/kalibra/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>agents</category>
      <category>automation</category>
    </item>
    <item>
      <title>When agent trace metrics lie: the span tree double-counting problem</title>
      <dc:creator>Vladimir</dc:creator>
      <pubDate>Wed, 25 Mar 2026 18:06:27 +0000</pubDate>
      <link>https://forem.com/khan5v/when-agent-trace-metrics-lie-the-span-tree-double-counting-problem-3lhp</link>
      <guid>https://forem.com/khan5v/when-agent-trace-metrics-lie-the-span-tree-double-counting-problem-3lhp</guid>
      <description>&lt;p&gt;I was building OpenInference support for an agent trace comparison tool when the token counts came back double what they should have been. The code was simple — sum tokens across all spans in a trace. The bug was that "all spans" included orchestration wrappers that carried their children's totals. Nothing crashed. The numbers just looked plausible enough to ship.&lt;/p&gt;

&lt;p&gt;This is the span tree double-counting problem. It's not hard to fix once you see it, but it's easy to miss because the wrong numbers look reasonable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tree
&lt;/h2&gt;

&lt;p&gt;Agent traces are trees. This isn't a new data structure — &lt;a href="https://opentelemetry.io/docs/concepts/signals/traces/" rel="noopener noreferrer"&gt;OpenTelemetry&lt;/a&gt; has used tree-structured traces for distributed systems since long before LLMs were mainstream. &lt;a href="https://github.com/Arize-ai/openinference" rel="noopener noreferrer"&gt;OpenInference&lt;/a&gt;, the AI-specific semantic convention layer built on top of OpenTelemetry, inherits this model and adds span kinds tailored to AI workloads: LLM, TOOL, CHAIN, AGENT, RETRIEVER, and others. Every OpenInference trace is a valid &lt;a href="https://opentelemetry.io/docs/specs/otel/protocol/" rel="noopener noreferrer"&gt;OTLP&lt;/a&gt; trace — the conventions give attribute names their AI-specific meaning.&lt;/p&gt;

&lt;p&gt;A root AGENT or CHAIN span wraps child spans — LLM calls, tool invocations, retrievals. Those children can have children of their own. A planning step spawns sub-queries. A tool call triggers an LLM to parse the result. The depth is arbitrary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;root (AGENT)
├── plan (LLM)         ← 500 input, 200 output tokens, $0.02
├── search (TOOL)      ← no tokens, no cost
│   └── parse (LLM)   ← 300 input, 100 output tokens, $0.01
└── respond (LLM)      ← 800 input, 400 output tokens, $0.03
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four spans. Three are LLM calls with token counts and costs. One is a tool invocation. The AGENT span at the root is an orchestration wrapper — it didn't make an LLM call itself.&lt;/p&gt;

&lt;p&gt;The total cost is $0.06. The total tokens are 2,300. Straightforward — you sum the three LLM spans.&lt;/p&gt;

&lt;p&gt;But whether this works depends entirely on your instrumentation. What happens when parent spans &lt;em&gt;also&lt;/em&gt; carry token and cost attributes?&lt;/p&gt;

&lt;h2&gt;
  
  
  An old problem in new clothes
&lt;/h2&gt;

&lt;p&gt;If you've worked with distributed tracing, you might recognize this. Traditional Application Performance Monitoring (APM) has dealt with a version of it for years under the name &lt;strong&gt;self-time&lt;/strong&gt; (or &lt;strong&gt;exclusive time&lt;/strong&gt;) — the duration a span spends doing its own work, excluding time enclosed by children. Elastic APM computes &lt;a href="https://www.elastic.co/guide/en/observability/current/apm-data-model-metrics.html" rel="noopener noreferrer"&gt;&lt;code&gt;span.self_time&lt;/code&gt;&lt;/a&gt; metrics specifically for this: they subtract child durations from the parent's total to produce a breakdown visualization that doesn't double-count.&lt;/p&gt;

&lt;p&gt;The AI-specific twist is that the double-counting isn't about &lt;strong&gt;duration&lt;/strong&gt; — which is inherently hierarchical, since parent spans enclose children by definition. It's about &lt;strong&gt;metric values on spans&lt;/strong&gt;: tokens and costs. These are point measurements that should live on the specific span that generated them. They are not hierarchical quantities. When a parent span carries &lt;code&gt;total_tokens: 2300&lt;/code&gt; as a subtotal of its children, and you sum across all spans, you get 4,600 tokens. Double the actual value.&lt;/p&gt;

&lt;p&gt;Duration double-counting is a display and analysis problem — the data itself is correct, you just need to compute self-time. Token and cost double-counting is a data problem — the same value exists in two places, and the spec doesn't tell you which one is the source of truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where things go wrong
&lt;/h2&gt;

&lt;p&gt;Some instrumentations record aggregated subtotals on parent spans. A parent AGENT span might carry &lt;code&gt;total_tokens: 2300&lt;/code&gt; — the sum of its children. If you now sum &lt;em&gt;all&lt;/em&gt; spans, you get 4,600 tokens.&lt;/p&gt;

&lt;p&gt;This isn't hypothetical. Langfuse has seen &lt;a href="https://github.com/langfuse/langfuse/issues/10914" rel="noopener noreferrer"&gt;related&lt;/a&gt; &lt;a href="https://github.com/orgs/langfuse/discussions/11252" rel="noopener noreferrer"&gt;reports&lt;/a&gt; surface in different forms. The &lt;a href="https://github.com/orgs/langfuse/discussions/11252" rel="noopener noreferrer"&gt;Microsoft Agent Framework&lt;/a&gt; integration ran into it directly: the framework's &lt;code&gt;invoke_agent&lt;/code&gt; spans carried a &lt;code&gt;gen_ai.request.model&lt;/code&gt; attribute, which caused Langfuse to classify them as generations and infer token counts — even though the framework explicitly set &lt;code&gt;capture_usage=False&lt;/code&gt;. The result: both the orchestration span and the nested LLM calls got counted, doubling the totals. The presence of a &lt;code&gt;model&lt;/code&gt; attribute on a non-LLM span was enough to trigger it.&lt;/p&gt;

&lt;p&gt;What makes this tricky is that no specification forbids putting aggregated values on parent spans. OpenInference defines &lt;code&gt;llm.token_count.prompt&lt;/code&gt; and &lt;code&gt;llm.cost.total&lt;/code&gt; as span-level attributes but doesn't say "only attach these to leaf spans." OpenTelemetry's &lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/" rel="noopener noreferrer"&gt;GenAI semantic conventions&lt;/a&gt; define &lt;code&gt;gen_ai.usage.input_tokens&lt;/code&gt; on inference spans but don't warn about aggregation. These conventions are still in &lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/" rel="noopener noreferrer"&gt;Development status&lt;/a&gt; — the earliest maturity level in OTel's lifecycle — and they define no cost attributes at all. The convention — cost and tokens live only on the actual LLM call — is implicit, not specified.&lt;/p&gt;

&lt;p&gt;When conventions are implicit, they get violated. And the violations are silent — your numbers are wrong, but nothing crashes. Not every dataset has this problem, which makes it harder to catch when one does.&lt;/p&gt;

&lt;p&gt;And here's why you can't just detect it after the fact: imagine a parent span with &lt;code&gt;cost: $0.05&lt;/code&gt; and two children costing &lt;code&gt;$0.02&lt;/code&gt; and &lt;code&gt;$0.03&lt;/code&gt;. Is the parent's cost an aggregated subtotal of its children — meaning you should ignore it — or did the parent make its own LLM call that happened to cost &lt;code&gt;$0.05&lt;/code&gt;? That's not a contrived scenario: an orchestration step that reasons about which tool to call &lt;em&gt;and then&lt;/em&gt; delegates to children is both an LLM caller and a parent. You can't distinguish "aggregated subtotal" from "coincidentally equal own cost" by looking at the numbers alone.&lt;/p&gt;

&lt;p&gt;And this compounds: in a tree of arbitrary height, you're not double-counting — you're potentially N-counting, with the ambiguity multiplying at every level.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do about it
&lt;/h2&gt;

&lt;p&gt;The double-counting issue manifests differently for each metric type. Here's how I handle each one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost and tokens
&lt;/h3&gt;

&lt;p&gt;The approach I landed on: &lt;code&gt;sum(s.cost for s in spans if s.cost is not None)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This works regardless of span kind taxonomy because it relies on the data, not the labels. Orchestration spans with &lt;code&gt;None&lt;/code&gt; cost are excluded. LLM spans with &lt;code&gt;0.0&lt;/code&gt; cost (cached responses, free-tier models) are correctly included. Non-LLM spans that legitimately have cost (paid API tool calls) are also correctly included. In my experience, this is more robust than filtering by span kind, which requires knowing every possible kind value across every instrumentation library.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;None&lt;/code&gt; vs &lt;code&gt;0&lt;/code&gt; distinction is critical here and easy to get wrong. &lt;code&gt;None&lt;/code&gt; means "this span didn't measure cost" — a TOOL span, a CHAIN wrapper. &lt;code&gt;0.0&lt;/code&gt; means "this span measured cost and it was zero" — a cached LLM response, a free-tier model call. If you collapse &lt;code&gt;None&lt;/code&gt; to &lt;code&gt;0&lt;/code&gt; before summing — a common shortcut — you lose the ability to tell "no cost data" from "genuinely free." Your medians shift toward zero, your comparisons break, and you won't see it in the output because zero looks reasonable.&lt;/p&gt;

&lt;p&gt;This approach works because the convention places cost and token data exclusively on the spans that generated them — orchestration spans have &lt;code&gt;None&lt;/code&gt;, not a subtotal. It's a pragmatic shortcut, not a general solution: if a parent span carried an aggregated subtotal as a real value, None-filtering would silently include it. You'd need true self-time-style subtraction to handle that case. But in practice, the convention holds often enough that filtering on &lt;code&gt;None&lt;/code&gt; is the more robust default.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step count
&lt;/h3&gt;

&lt;p&gt;This one is less about correctness and more about what you're trying to measure. A 3-step agent (plan, search, respond) wrapped in a CHAIN has 4 total spans. &lt;code&gt;len(spans)&lt;/code&gt; returns 4, not 3. Whether that's "wrong" depends on the question. If you're asking "how complex is this trace's orchestration," total span count is fine. If you're asking "how many things did the agent actually do," I found leaf spans — spans with no children — to be more useful. The orchestration wrappers are envelopes, not actions. Though it's worth noting that the boundary isn't always clean — a "search" step might be a parent span that delegates to an LLM call for query parsing. In that case, "search" is a logical step but not a leaf. What you're really counting with leaves is execution primitives, not logical operations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;parent_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent_id&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;spans&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent_id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;leaves&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;spans&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;span_id&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parent_ids&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One caveat: if the tree is incomplete — child spans missing due to instrumentation gaps or partial exports — a parent will look like a leaf and inflate the count. In practice this is rare with well-instrumented code, but worth knowing about.&lt;/p&gt;

&lt;h3&gt;
  
  
  Duration
&lt;/h3&gt;

&lt;p&gt;Summing span durations is always wrong for traces — a parent span's duration overlaps its children. This is the classic self-time problem that APM tools have solved at the visualization layer. What you want for trace-level duration is wall-clock time: &lt;code&gt;max(end_time) - min(start_time)&lt;/code&gt; across all spans. That gives you total elapsed time without double-counting overlapping execution. This works correctly even when branches execute in parallel.&lt;/p&gt;

&lt;p&gt;But for per-span analysis — comparing "how long does the search step take across 100 traces" — each span's own duration is exactly right. Even for parent spans, where the duration tells you how long that sub-pipeline consumed end-to-end. Group by span name, compare independently. This is valid at any tree depth because you're comparing the same span across traces, not summing different spans within a trace.&lt;/p&gt;

&lt;h2&gt;
  
  
  How major platforms handle it
&lt;/h2&gt;

&lt;p&gt;The major observability platforms all address this, though the reasoning behind their approaches isn't always well-documented. Here's what I've gathered from their docs and public data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phoenix / OpenInference&lt;/strong&gt; relies on span kind. The &lt;a href="https://arize-ai.github.io/openinference/spec/semantic_conventions.html" rel="noopener noreferrer"&gt;OpenInference semantic conventions&lt;/a&gt; define &lt;code&gt;llm.token_count.*&lt;/code&gt; and &lt;code&gt;llm.cost.*&lt;/code&gt; attributes specifically for &lt;a href="https://github.com/Arize-ai/openinference/blob/main/spec/llm_spans.md" rel="noopener noreferrer"&gt;LLM spans&lt;/a&gt; — CHAIN, AGENT, and TOOL spans don't typically carry them. Phoenix also &lt;a href="https://arize.com/docs/phoenix/tracing/how-to-tracing/cost-tracking" rel="noopener noreferrer"&gt;computes cost server-side&lt;/a&gt; by combining token counts with built-in model pricing tables, rather than relying on pre-computed cost attributes on spans — the two public Phoenix trace datasets I tested (&lt;code&gt;context-retrieval&lt;/code&gt; and &lt;code&gt;random&lt;/code&gt;) have no &lt;code&gt;llm.cost.*&lt;/code&gt; attributes, consistent with this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Langfuse&lt;/strong&gt; uses &lt;a href="https://langfuse.com/docs/observability/features/observation-types" rel="noopener noreferrer"&gt;observation types&lt;/a&gt;: generation, span, embedding, and several others. Only generation and embedding carry &lt;a href="https://langfuse.com/docs/observability/features/token-and-cost-tracking" rel="noopener noreferrer"&gt;cost and token data&lt;/a&gt;. When the Microsoft Agent Framework integration produced double-counts, the root cause was that any span with a &lt;code&gt;model&lt;/code&gt; attribute was auto-classified as a generation. The &lt;a href="https://github.com/orgs/langfuse/discussions/11252" rel="noopener noreferrer"&gt;discussion&lt;/a&gt; shows this is still being worked through — the architecture is sound, but it depends on the instrumentation not accidentally triggering the heuristic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangSmith&lt;/strong&gt; records token usage on LLM call runs. Their &lt;a href="https://docs.langchain.com/langsmith/cost-tracking" rel="noopener noreferrer"&gt;cost tracking docs&lt;/a&gt; describe the trace tree as displaying total usage for the entire trace, aggregated values for each parent run, and token and cost breakdowns for each child run. The docs don't specify whether parent aggregation is stored or computed at display time, but the architecture clearly separates individual run data from rolled-up totals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Braintrust&lt;/strong&gt; fixes the source rather than filtering at consumption. Their &lt;a href="https://www.braintrust.dev/docs/changelog" rel="noopener noreferrer"&gt;v3.1.0 changelog&lt;/a&gt; notes a fix for "token double counting between parent and child spans in Vercel AI SDK integration." Their data model supports DAG-structured spans, and aggregation happens at query time via their BTQL language rather than at export time.&lt;/p&gt;

&lt;p&gt;The common thread: every platform puts cost and tokens on the actual LLM call, not on the orchestration wrapper. The convention exists. It's just not documented as a rule that instrumentation authors are expected to follow.&lt;/p&gt;

&lt;h2&gt;
  
  
  The spec gap
&lt;/h2&gt;

&lt;p&gt;Traditional APM solved duration double-counting by establishing self-time as a first-class concept — Elastic APM has &lt;a href="https://www.elastic.co/guide/en/observability/current/apm-data-model-metrics.html" rel="noopener noreferrer"&gt;&lt;code&gt;span.self_time&lt;/code&gt;&lt;/a&gt; metrics, and most APM UIs distinguish between a span's total time and its exclusive time. The solution was baked into the tooling because the problem was well-understood.&lt;/p&gt;

&lt;p&gt;AI trace metrics don't have an equivalent. Neither OpenInference nor the OpenTelemetry &lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/" rel="noopener noreferrer"&gt;GenAI semantic conventions&lt;/a&gt; specify whether &lt;code&gt;llm.token_count.*&lt;/code&gt; or &lt;code&gt;gen_ai.usage.*&lt;/code&gt; attributes represent the span's own values or cumulative subtotals of children. The conventions — still at the earliest maturity level, with work begun in &lt;a href="https://opentelemetry.io/blog/2024/otel-generative-ai/" rel="noopener noreferrer"&gt;early 2024&lt;/a&gt; — don't define cost attributes at all. The &lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/" rel="noopener noreferrer"&gt;agent spans spec&lt;/a&gt; defines span types but says nothing about token or cost rollup. OpenInference, which does define &lt;code&gt;llm.cost.total&lt;/code&gt;, is &lt;a href="https://arize-ai.github.io/openinference/spec/semantic_conventions.html" rel="noopener noreferrer"&gt;ahead here&lt;/a&gt; but still doesn't clarify the aggregation semantics.&lt;/p&gt;

&lt;p&gt;One sentence in either spec would fix this: &lt;em&gt;"Token count and cost attributes on a span represent that span's own values, not cumulative subtotals of descendant spans."&lt;/em&gt; That turns an implicit convention into a guarantee that instrumentation authors can code against. Now, while the conventions are still being shaped, is the time to say it.&lt;/p&gt;

&lt;p&gt;Until that happens, defensive coding is the practical answer: filter on &lt;code&gt;None&lt;/code&gt; for aggregation, don't assume every instrumentation follows the convention, and validate against known-good data before trusting the numbers.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I ran into this while building &lt;a href="https://github.com/khan5v/kalibra" rel="noopener noreferrer"&gt;OpenInference support in Kalibra&lt;/a&gt;, a regression detection tool for AI agent traces. The tree aggregation problem was one of the design decisions that required real thought — not because it's algorithmically hard, but because getting it wrong produces numbers that look right.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>opentelemetry</category>
      <category>python</category>
    </item>
    <item>
      <title>How do you test your LLM agents before shipping changes?</title>
      <dc:creator>Vladimir</dc:creator>
      <pubDate>Tue, 24 Mar 2026 21:55:42 +0000</pubDate>
      <link>https://forem.com/khan5v/how-do-you-test-your-llm-agents-before-shipping-changes-go8</link>
      <guid>https://forem.com/khan5v/how-do-you-test-your-llm-agents-before-shipping-changes-go8</guid>
      <description>&lt;p&gt;Genuinely curious how other engineers are handling this. &lt;/p&gt;

&lt;p&gt;Every time I change a prompt, swap a model, or tweak a tool, I've struggled to get a reliable answer to a simple question: &lt;strong&gt;did the agent get better or worse overall?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The challenge I keep hitting is that aggregate metrics (average success rate, total tokens) usually look fine, but specific task types silently break. The easy tasks improve, masking the regressions on the hard ones. By the time someone notices, it's already in production.&lt;/p&gt;

&lt;p&gt;Here’s what I tried before landing on something that actually worked:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LLM-as-judge scoring:&lt;/strong&gt; Too inconsistent between runs. Hard to tell if a score change was real or just statistical noise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual spot-checking:&lt;/strong&gt; Useful early on, but didn't scale past ~10 task types.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comparing trace-level metrics statistically:&lt;/strong&gt; Looking at distributions of tokens, duration, and cost per specific task ended up being the most reliable signal, so much so that I ended up building my own tooling around it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What does your testing setup look like? Do you have CI gates that block deploys on agent regressions, or is it mostly manual review?&lt;/p&gt;

</description>
      <category>discuss</category>
      <category>ai</category>
      <category>llm</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
