<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Backboard.io</title>
    <description>The latest articles on Forem by Backboard.io (@backboardio).</description>
    <link>https://forem.com/backboardio</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F12706%2F4a562173-03b2-4c28-9be4-32f3e22f5474.png</url>
      <title>Forem: Backboard.io</title>
      <link>https://forem.com/backboardio</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/backboardio"/>
    <language>en</language>
    <item>
      <title>The Hidden Challenge of Multi-LLM Context Management</title>
      <dc:creator>Jonathan Murray</dc:creator>
      <pubDate>Fri, 24 Apr 2026 20:19:51 +0000</pubDate>
      <link>https://forem.com/backboardio/the-hidden-challenge-of-multi-llm-context-management-1pbh</link>
      <guid>https://forem.com/backboardio/the-hidden-challenge-of-multi-llm-context-management-1pbh</guid>
      <description>&lt;h1&gt;
  
  
  Why token counting isn't a solved problem when building across providers
&lt;/h1&gt;

&lt;p&gt;Building AI products that span multiple LLM providers involves a challenge most developers don't anticipate until they hit it: context windows are not interoperable.&lt;/p&gt;

&lt;p&gt;On the surface, managing context in a multi-LLM system seems straightforward. You track how long conversations get, trim when needed, and move on. In practice, it's considerably more complex — and if you're routing requests across providers like OpenAI, Anthropic, Google, Cohere, or xAI, there's a fundamental mismatch that can break your product in subtle ways.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tokenization Problem
&lt;/h2&gt;

&lt;p&gt;Every major LLM provider uses its own tokenizer. These tokenizers don't agree. The same block of text produces different token counts depending on which model processes it. The difference is often 10–20%, sometimes more.&lt;/p&gt;

&lt;p&gt;What this means in practice: a conversation that fits comfortably in one model's context window may silently overflow another's. A prompt routed to OpenAI might count as 1,200 tokens; the same prompt routed to Claude might count as 1,450. That gap matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where It Breaks
&lt;/h2&gt;

&lt;p&gt;The failure modes tend to show up at the boundaries. When you switch providers mid-conversation, the new model has to ingest the full prior context. If your context management layer was calibrated to the previous model's tokenizer, the new model may see a context that's already at or over the limit — before it's even responded to anything new.&lt;/p&gt;

&lt;p&gt;This produces three common failure patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unexpected context-window overflow:&lt;/strong&gt; the conversation that worked before now breaches the limit&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inconsistent truncation:&lt;/strong&gt; different models truncate at different points, changing what prior context the model actually sees&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Routing failures&lt;/strong&gt; that are unpredictable because the numbers your system used don't match the numbers the model actually used&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Simple Estimates Fail
&lt;/h2&gt;

&lt;p&gt;The instinct is to maintain a single "token estimate" with a generous safety margin. The problem is that the margin you'd need varies by provider, model version, and content type (code tokenizes differently than prose). A margin calibrated for one use case will either be too tight for another, causing failures, or too generous, causing unnecessary truncation that degrades conversation quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: Provider-Aware Token Counting
&lt;/h2&gt;

&lt;p&gt;A robust multi-LLM context management layer makes token counting provider-specific. Rather than maintaining a single estimate, it measures each prompt the way the actual target model will measure it. The routing layer uses these per-provider measurements to make decisions before requests are sent.&lt;/p&gt;

&lt;p&gt;This lets the system stay ahead of context limits: it knows when a conversation is approaching an edge, trims or compresses history calibrated to the specific model receiving the request, and avoids the pricing and failure surprises that come from miscounted tokens.&lt;/p&gt;

&lt;p&gt;The end result is what users should see: a smooth conversation experience, regardless of which model is serving it. The complexity of "every model speaks a slightly different token language" stays inside the infrastructure layer, invisible to the people using the product.&lt;/p&gt;

&lt;p&gt;This is the approach we've taken in our adaptive context window management component, and it's become a foundational part of how we think about multi-LLM routing more broadly.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Rob Imbeault&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Apr 17, 2026&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>llm</category>
    </item>
    <item>
      <title>Why LLM Reasoning Is Breaking AI Infrastructure (And How to Fix It)</title>
      <dc:creator>Jonathan Murray</dc:creator>
      <pubDate>Fri, 24 Apr 2026 20:18:05 +0000</pubDate>
      <link>https://forem.com/backboardio/why-llm-reasoning-is-breaking-ai-infrastructure-and-how-to-fix-it-2aik</link>
      <guid>https://forem.com/backboardio/why-llm-reasoning-is-breaking-ai-infrastructure-and-how-to-fix-it-2aik</guid>
      <description>&lt;p&gt;If you've tried building anything serious on top of large language models (LLMs) recently, you've probably run into this:&lt;/p&gt;

&lt;p&gt;"Thinking" is supposed to make models better. In practice, it makes your infrastructure worse.&lt;/p&gt;

&lt;p&gt;This isn't a model problem—it's an infrastructure and abstraction problem. And it's getting worse as teams scale across multiple AI providers.&lt;/p&gt;

&lt;p&gt;Let's break down exactly where things go wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Illusion of "Just Turn On Reasoning"
&lt;/h2&gt;

&lt;p&gt;At a high level, LLM reasoning sounds straightforward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Turn reasoning on → better answers&lt;/li&gt;
&lt;li&gt;Turn reasoning off → cheaper, faster&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But in production systems, reality looks very different.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What actually happens:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Models don't reason when explicitly prompted&lt;/li&gt;
&lt;li&gt;Models over-reason on trivial queries, wasting tokens&lt;/li&gt;
&lt;li&gt;Behavior is inconsistent across providers and model versions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of predictable performance, you get variability.&lt;/p&gt;

&lt;p&gt;You're no longer just building an AI product—you're debugging model behavior at runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fragmentation Problem in LLM Reasoning
&lt;/h2&gt;

&lt;p&gt;One of the biggest hidden challenges in AI infrastructure today is fragmentation.&lt;/p&gt;

&lt;p&gt;Every major provider has implemented reasoning differently:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI&lt;/strong&gt; → reasoning effort levels (low, medium, high)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anthropic (Claude)&lt;/strong&gt; → explicit reasoning token budgets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google AI (Gemini)&lt;/strong&gt; → hybrid approaches depending on model version&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's just input configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output fragmentation is even worse:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Some models return separate reasoning blocks&lt;/li&gt;
&lt;li&gt;Others provide summarized reasoning&lt;/li&gt;
&lt;li&gt;Some mix reasoning directly into standard responses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No shared schema&lt;/li&gt;
&lt;li&gt;No standardized interface&lt;/li&gt;
&lt;li&gt;No predictable structure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What this means for developers:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're building a multi-model AI system, you now need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input normalization layers&lt;/li&gt;
&lt;li&gt;Output parsing logic per provider&lt;/li&gt;
&lt;li&gt;Custom handling for reasoning formats&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this point, "simple API routing" becomes complex middleware engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Cost Optimization Becomes a Moving Target
&lt;/h2&gt;

&lt;p&gt;Reasoning doesn't just impact performance—it breaks cost predictability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Billing inconsistencies across providers:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Some expose reasoning tokens explicitly&lt;/li&gt;
&lt;li&gt;Others bundle them into total usage&lt;/li&gt;
&lt;li&gt;Some introduce custom billing fields&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now you're not just optimizing latency or quality.&lt;/p&gt;

&lt;p&gt;You're building a cost translation layer across providers.&lt;/p&gt;

&lt;p&gt;This adds complexity to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Forecasting&lt;/li&gt;
&lt;li&gt;Budget control&lt;/li&gt;
&lt;li&gt;Scaling decisions&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Multi-Model Switching Breaks Systems
&lt;/h2&gt;

&lt;p&gt;In theory, switching between LLM providers should improve reliability and cost efficiency.&lt;/p&gt;

&lt;p&gt;In practice, it introduces system instability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Even within a single provider:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Different endpoints behave differently&lt;/li&gt;
&lt;li&gt;Input formats change&lt;/li&gt;
&lt;li&gt;Output schemas change&lt;/li&gt;
&lt;li&gt;Reasoning structures vary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Now add state management:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What context should persist?&lt;/li&gt;
&lt;li&gt;How do you maintain reasoning continuity?&lt;/li&gt;
&lt;li&gt;How do you prevent token explosion?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The result:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most teams either:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Abandon portability, or&lt;/li&gt;
&lt;li&gt;Build fragile adapter layers that constantly break&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Real Problem: Lack of Abstraction
&lt;/h2&gt;

&lt;p&gt;After working through these challenges, one thing becomes clear:&lt;/p&gt;

&lt;p&gt;The core issue isn't reasoning—it's the absence of a unified abstraction layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Developers today are forced to:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learn multiple reasoning systems&lt;/li&gt;
&lt;li&gt;Normalize different response formats&lt;/li&gt;
&lt;li&gt;Track multiple billing models&lt;/li&gt;
&lt;li&gt;Rebuild state handling for each provider&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not scalable.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "Unified LLM Reasoning" Should Look Like
&lt;/h2&gt;

&lt;p&gt;To make AI infrastructure truly production-ready, reasoning needs to be abstracted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A unified system should provide:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A single reasoning parameter&lt;/li&gt;
&lt;li&gt;Direct control over reasoning budgets&lt;/li&gt;
&lt;li&gt;Consistent behavior across models&lt;/li&gt;
&lt;li&gt;Standardized input/output formats&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The impact:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Developers can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tune reasoning without provider lock-in&lt;/li&gt;
&lt;li&gt;Switch models without rewriting logic&lt;/li&gt;
&lt;li&gt;Maintain consistent state across systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And most importantly:&lt;/p&gt;

&lt;p&gt;Stop thinking about thinking.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Uncomfortable Truth About Scaling AI Systems
&lt;/h2&gt;

&lt;p&gt;If you're working with LLMs and haven't encountered these issues yet—you will.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Complexity compounds rapidly when you:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add a second provider&lt;/li&gt;
&lt;li&gt;Enable reasoning features&lt;/li&gt;
&lt;li&gt;Optimize for cost&lt;/li&gt;
&lt;li&gt;Maintain persistent context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point:&lt;/p&gt;

&lt;p&gt;You're no longer building your product. You're building AI infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Future of AI Platforms
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Short-term impact:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduced engineering time (weeks to months saved)&lt;/li&gt;
&lt;li&gt;Lower debugging overhead&lt;/li&gt;
&lt;li&gt;More predictable cost structures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Long-term shift:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The winning AI platforms won't be defined by model quality alone.&lt;/p&gt;

&lt;p&gt;They will be defined by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Interoperability&lt;/strong&gt; (model interchangeability)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Statefulness&lt;/strong&gt; (persistent, portable context)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the real unlock in the next phase of AI development.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Audit for Your AI Stack
&lt;/h2&gt;

&lt;p&gt;If you're currently integrating multiple LLM providers, ask yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many reasoning formats are you handling?&lt;/li&gt;
&lt;li&gt;How portable is your state management layer?&lt;/li&gt;
&lt;li&gt;How predictable are your AI costs?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If those answers aren't clean and consistent:&lt;/p&gt;

&lt;p&gt;You're already paying the infrastructure tax.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Rob Imbeault&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Apr 20, 2026&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
