<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Robert Imbeault</title>
    <description>The latest articles on Forem by Robert Imbeault (@robimbeault).</description>
    <link>https://forem.com/robimbeault</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3818726%2F3d165aef-8612-4c2c-aba9-6dd7754f4f84.jpeg</url>
      <title>Forem: Robert Imbeault</title>
      <link>https://forem.com/robimbeault</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/robimbeault"/>
    <language>en</language>
    <item>
      <title>I Think Therefore I Am… A Big Pain in the A$$</title>
      <dc:creator>Robert Imbeault</dc:creator>
      <pubDate>Mon, 20 Apr 2026 17:59:09 +0000</pubDate>
      <link>https://forem.com/robimbeault/i-think-therefore-i-am-a-big-pain-in-the-a-3a9m</link>
      <guid>https://forem.com/robimbeault/i-think-therefore-i-am-a-big-pain-in-the-a-3a9m</guid>
      <description>&lt;p&gt;If you’ve tried to build anything serious on top of LLMs recently, you’ve probably run into this:&lt;/p&gt;

&lt;p&gt;“Thinking” is supposed to make models better.&lt;br&gt;
In practice, it makes your infrastructure worse.&lt;/p&gt;

&lt;p&gt;Let’s break down where it actually hurts.&lt;/p&gt;




&lt;h2&gt;
  
  
  The illusion of “just turn on reasoning”
&lt;/h2&gt;

&lt;p&gt;At a high level, you’d expect something simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Turn reasoning on → better answers&lt;/li&gt;
&lt;li&gt;Turn reasoning off → cheaper, faster&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reality is messier.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Models sometimes &lt;strong&gt;don’t think when you ask them to&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Models sometimes &lt;strong&gt;overthink trivial prompts&lt;/strong&gt;, burning tokens for no gain&lt;/li&gt;
&lt;li&gt;There’s &lt;strong&gt;no consistent behavior across providers&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So now you’re not just building a product.&lt;br&gt;
You’re debugging &lt;em&gt;model psychology&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The fragmentation problem nobody talks about
&lt;/h2&gt;

&lt;p&gt;Every provider decided to implement “thinking” differently.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI → effort levels (low, medium, high)&lt;/li&gt;
&lt;li&gt;Anthropic → token budgets (explicit caps)&lt;/li&gt;
&lt;li&gt;Google → both… depending on the model version&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s just inputs.&lt;/p&gt;

&lt;p&gt;Outputs are worse:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Some return &lt;strong&gt;dedicated thinking blocks&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Others return &lt;strong&gt;reasoning summaries&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Some mix reasoning into standard content structures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is no standard. No shared schema. No predictable behavior.&lt;/p&gt;

&lt;p&gt;So if you’re routing across models, you now need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input normalization&lt;/li&gt;
&lt;li&gt;Output parsing per provider&lt;/li&gt;
&lt;li&gt;Logic to reconcile different reasoning formats&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where “simple API routing” stops being simple.&lt;/p&gt;




&lt;h2&gt;
  
  
  Billing is inconsistent too
&lt;/h2&gt;

&lt;p&gt;Even cost modeling breaks.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Some providers expose reasoning tokens explicitly&lt;/li&gt;
&lt;li&gt;Some hide it inside total usage&lt;/li&gt;
&lt;li&gt;Some introduce &lt;strong&gt;provider-specific fields&lt;/strong&gt; (looking at you, xAI)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now you’re not just optimizing performance.&lt;br&gt;
You’re building a &lt;strong&gt;cost translation layer&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Model switching makes everything worse
&lt;/h2&gt;

&lt;p&gt;Switching models mid-thread sounds great… until you try it.&lt;/p&gt;

&lt;p&gt;Even within the same provider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Different endpoints behave differently (yes, even inside OpenAI)&lt;/li&gt;
&lt;li&gt;Input formats change&lt;/li&gt;
&lt;li&gt;Output structures change&lt;/li&gt;
&lt;li&gt;Reasoning formats change&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now add state:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What context do you carry over?&lt;/li&gt;
&lt;li&gt;How do you preserve reasoning continuity?&lt;/li&gt;
&lt;li&gt;How do you avoid exploding token usage?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where most teams either:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Give up on portability, or&lt;/li&gt;
&lt;li&gt;Build a fragile pile of adapters that break every few weeks&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What we realized building Backboard
&lt;/h2&gt;

&lt;p&gt;The real problem isn’t reasoning.&lt;/p&gt;

&lt;p&gt;It’s &lt;strong&gt;lack of abstraction&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Developers shouldn’t have to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learn 5 different “thinking” systems&lt;/li&gt;
&lt;li&gt;Normalize 5 different response formats&lt;/li&gt;
&lt;li&gt;Track 5 different billing models&lt;/li&gt;
&lt;li&gt;Rebuild state every time they switch models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So we made a call:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unify it.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What “unified thinking” actually means
&lt;/h2&gt;

&lt;p&gt;Instead of exposing provider quirks, we abstract them into one model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A single &lt;strong&gt;thinking parameter&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Direct control over &lt;strong&gt;reasoning budget&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Consistent behavior across models&lt;/li&gt;
&lt;li&gt;Normalized input and output structures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tune reasoning without caring about provider differences&lt;/li&gt;
&lt;li&gt;Switch models without rewriting logic&lt;/li&gt;
&lt;li&gt;Keep state intact across everything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And most importantly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stop thinking about thinking.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The uncomfortable truth
&lt;/h2&gt;

&lt;p&gt;If you’re building on multiple LLMs and you haven’t hit these issues yet, you will.&lt;/p&gt;

&lt;p&gt;The complexity is not obvious at the start.&lt;br&gt;
It compounds as soon as you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add a second provider&lt;/li&gt;
&lt;li&gt;Introduce reasoning&lt;/li&gt;
&lt;li&gt;Try to optimize cost&lt;/li&gt;
&lt;li&gt;Or maintain state across sessions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, you’re not building your product anymore.&lt;br&gt;
You’re building infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where this goes long term
&lt;/h2&gt;

&lt;p&gt;Short term:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Abstractions like this save teams weeks or months of engineering time&lt;/li&gt;
&lt;li&gt;They reduce cost volatility and debugging overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Long term:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The winning platforms won’t be the best models&lt;/li&gt;
&lt;li&gt;They’ll be the ones that make models &lt;strong&gt;interchangeable and stateful&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s the real unlock.&lt;/p&gt;




&lt;p&gt;If you’re currently stitching together multiple providers, do a quick audit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many reasoning formats are you handling?&lt;/li&gt;
&lt;li&gt;How portable is your state layer?&lt;/li&gt;
&lt;li&gt;How confident are you in your cost predictability?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the answer isn’t clean, you’re already paying the tax.&lt;/p&gt;

&lt;p&gt;This is what we're working on at Backboard.io :)&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>developers</category>
      <category>reasoning</category>
    </item>
    <item>
      <title>Why Token Counting in Multi-LLM Systems Is Harder Than You Think</title>
      <dc:creator>Robert Imbeault</dc:creator>
      <pubDate>Thu, 16 Apr 2026 14:09:06 +0000</pubDate>
      <link>https://forem.com/robimbeault/why-token-counting-in-multi-llm-systems-is-harder-than-you-think-1moj</link>
      <guid>https://forem.com/robimbeault/why-token-counting-in-multi-llm-systems-is-harder-than-you-think-1moj</guid>
      <description>&lt;p&gt;When we set out to build our adaptive context window management component, we ran into a problem that sounds deceptively simple: how do you manage context windows when your system routes requests across multiple LLM providers?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Core Problem&lt;/strong&gt;&lt;br&gt;
Each model has its own tokenizer, context window, and pricing rules. The same text is not "the same" across providers. OpenAI might count a prompt as 1,200 tokens; Claude might see it as 1,450. A chat session that fits comfortably in one model can silently exceed limits or cost significantly more in another.&lt;/p&gt;

&lt;p&gt;This creates real problems when you switch providers mid-conversation. The new model has to ingest the full conversation history again — but since each model counts that context differently, you can hit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unexpected context-window overflow: the conversation that fit before now breaches the limit&lt;/li&gt;
&lt;li&gt;Inconsistent truncation: different models truncate at different points, changing what context the model sees&lt;/li&gt;
&lt;li&gt;Hard-to-predict routing failures: your router makes decisions based on one token count, but the model uses another&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why a Single 'Token Estimate' Doesn't Cut It&lt;/strong&gt;&lt;br&gt;
The tempting solution is to maintain a single token count with a safety margin. The problem: OpenAI, Claude, Gemini, Cohere, xAI, and others don't tokenize text the same way. A single estimate will be wrong in both directions — undercount and you risk failures; overcount and you truncate too aggressively, degrading conversation quality unnecessarily.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How We Solved It&lt;/strong&gt;&lt;br&gt;
The answer is making token counting provider-aware. Instead of a single universal estimate, the context management layer measures each prompt the way the specific target model will measure it. The router uses this measurement before the request is sent.&lt;/p&gt;

&lt;p&gt;In practice this means the system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Knows when a conversation is approaching the edge of a model's context window&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Trims or compresses history intelligently, not just blindly chopping from the front.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Avoids expensive overages from miscounted tokens&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Keeps model-switching complexity invisible to the end user&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The user sees a smooth conversation. The system handles the messy reality that every model speaks a slightly different "token language."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What We're Building Toward&lt;/strong&gt;&lt;br&gt;
This is one component of a larger routing layer. The goal: switch LLM providers mid-product — based on cost, capability, or availability — without that complexity leaking to users. Provider-aware token counting turns out to be a foundational piece of that.&lt;/p&gt;

&lt;p&gt;We're doing this so you won't have to. :)&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devtools</category>
      <category>programming</category>
    </item>
    <item>
      <title>Your Context Window Is Chaos. We Fixed It.</title>
      <dc:creator>Robert Imbeault</dc:creator>
      <pubDate>Tue, 31 Mar 2026 10:47:36 +0000</pubDate>
      <link>https://forem.com/robimbeault/your-context-window-is-chaos-we-fixed-it-3ca5</link>
      <guid>https://forem.com/robimbeault/your-context-window-is-chaos-we-fixed-it-3ca5</guid>
      <description>&lt;p&gt;If you’re routing across multiple LLMs, you probably already know this feeling:&lt;/p&gt;

&lt;p&gt;One model happily accepts your massive conversation.&lt;br&gt;
The next model chokes, truncates half the important bits, and hallucinates the rest.&lt;/p&gt;

&lt;p&gt;Same app. Same user. Different context window. Chaos.&lt;/p&gt;

&lt;p&gt;Backboard.io now includes Adaptive Context Management, a system that automatically manages conversation state when your app moves between models with different context sizes. &lt;/p&gt;

&lt;p&gt;ps. if you have keys from any of the frontiers or OpenRouter you can use this for free!&lt;/p&gt;

&lt;p&gt;You still get access to 17,000+ LLMs on the platform.&lt;/p&gt;

&lt;p&gt;You just don’t have to personally babysit their context windows anymore.&lt;/p&gt;

&lt;p&gt;And yes, it’s included for free.&lt;/p&gt;

&lt;p&gt;The Problem: Context Windows Are Inconsistent (and Annoying)&lt;br&gt;
In a multi‑model setup, this is what actually happens:&lt;/p&gt;

&lt;p&gt;You start on a large‑context model. Everything fits:&lt;/p&gt;

&lt;p&gt;system prompt&lt;br&gt;
conversation history&lt;br&gt;
tool calls + tool responses&lt;br&gt;
RAG chunks&lt;br&gt;
web search results&lt;br&gt;
random runtime metadata you forgot you added&lt;br&gt;
Your router decides to send the next request to a smaller‑context model.&lt;/p&gt;

&lt;p&gt;Suddenly your carefully curated “state” is too big to fit. Something has to go.&lt;/p&gt;

&lt;p&gt;Most platforms respond with:&lt;/p&gt;

&lt;p&gt;“Cool, just write truncation and summarization logic that:&lt;/p&gt;

&lt;p&gt;prioritizes what matters,&lt;br&gt;
handles overflow nicely,&lt;br&gt;
doesn’t break when you add a new tool,&lt;br&gt;
and works for every model you might ever route to.”&lt;br&gt;
So we all end up writing the same brittle code:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;if tokens &amp;gt; limit:&lt;br&gt;
  drop_old_messages()&lt;br&gt;
  maybe_summarize()&lt;br&gt;
  hope_nothing_important_was_there()&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
In a multi‑model system, that logic gets complicated and fragile fast.&lt;/p&gt;

&lt;p&gt;What We Shipped: Adaptive Context Management&lt;/p&gt;

&lt;p&gt;Backboard now automatically handles context transitions when models change.&lt;/p&gt;

&lt;p&gt;There’s no extra endpoint and no new config. It runs inside the Backboard runtime whenever a request is routed to a model.&lt;/p&gt;

&lt;p&gt;When that happens, Backboard:&lt;/p&gt;

&lt;p&gt;Looks up the model’s context window.&lt;br&gt;
Dynamically budgets it:&lt;br&gt;
20% reserved for raw state&lt;br&gt;
80% freed via summarization&lt;br&gt;
Within that 20% “raw state” budget, we prioritize:&lt;/p&gt;

&lt;p&gt;system prompt&lt;br&gt;
recent messages&lt;br&gt;
tool calls&lt;br&gt;
RAG results&lt;br&gt;
web search context&lt;br&gt;
Whatever fits in that 20% goes through unchanged.&lt;/p&gt;

&lt;p&gt;Everything else is handled by intelligent summarization.&lt;/p&gt;

&lt;p&gt;You don’t write the logic. You just route between models.&lt;/p&gt;

&lt;p&gt;How Intelligent Summarization Works&lt;br&gt;
When we need to compress, we follow a simple rule:&lt;/p&gt;

&lt;p&gt;First try the model you’re switching to.&lt;/p&gt;

&lt;p&gt;“Hey smaller model, summarize this so you can still understand what’s going on.”&lt;br&gt;
If the summary still doesn’t fit:&lt;/p&gt;

&lt;p&gt;We fall back to the larger model that was previously in use to generate a more efficient summary.&lt;br&gt;
This preserves the important parts of the conversation while ensuring the final state always fits within the new model’s context window.&lt;/p&gt;

&lt;p&gt;All of this happens automatically during the request and tool calls.&lt;/p&gt;

&lt;p&gt;No manual orchestration. No custom jobs. No extra service.&lt;/p&gt;

&lt;p&gt;You Should Rarely Hit 100% Context Again&lt;br&gt;
Because Adaptive Context Management runs continuously:&lt;/p&gt;

&lt;p&gt;It reshapes and compresses state before you slam into the limit.&lt;br&gt;
It keeps a buffer in the context window instead of riding at 99.9% and hoping for the best.&lt;br&gt;
Mid‑conversation model switches stop being a coin flip on whether something vital gets chopped.&lt;br&gt;
Your job: define the routing logic and features.&lt;/p&gt;

&lt;p&gt;Our job: make sure the context window doesn’t quietly wreck them.&lt;/p&gt;

&lt;p&gt;You Still Get Visibility: context_usage in msg&lt;br&gt;
This is not a black box.&lt;/p&gt;

&lt;p&gt;We expose context usage directly in the msg endpoint so you can see what’s happening in real time.&lt;/p&gt;

&lt;p&gt;Example response:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;"context_usage": {&lt;br&gt;
  "used_tokens": 1302,&lt;br&gt;
  "context_limit": 8191,&lt;br&gt;
  "percent": 19.9,&lt;br&gt;
  "summary_tokens": 0,&lt;br&gt;
  "model": "gpt-4"&lt;br&gt;
}&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You can track:&lt;/p&gt;

&lt;p&gt;how much context is currently used&lt;br&gt;
how close you are to the limit&lt;br&gt;
how many tokens are from summarization&lt;br&gt;
which model is currently managing the context&lt;br&gt;
If you like graphs and dashboards, this gives you the raw data without forcing you to build your own context tracking system from scratch.&lt;/p&gt;

&lt;p&gt;The Bigger Idea: Treat Models Like Infrastructure&lt;br&gt;
Backboard’s thesis is simple:&lt;/p&gt;

&lt;p&gt;You should be able to treat models as interchangeable infrastructure.&lt;/p&gt;

&lt;p&gt;Your state should just move with the user.&lt;/p&gt;

&lt;p&gt;That only works if state can move safely between:&lt;/p&gt;

&lt;p&gt;cheap and expensive models&lt;br&gt;
long‑context and short‑context models&lt;br&gt;
different providers and pricing tiers&lt;br&gt;
Adaptive Context Management is the safety layer that makes that viable:&lt;/p&gt;

&lt;p&gt;You route across thousands of models.&lt;br&gt;
Backboard keeps the conversation state aligned with each model’s constraints.&lt;br&gt;
You don’t write ad‑hoc truncation and summarization logic per model.&lt;br&gt;
You focus on product behavior.&lt;/p&gt;

&lt;p&gt;We handle the context window drama.&lt;/p&gt;

&lt;p&gt;Adaptive Context Management is free and live today in the Backboard API.&lt;/p&gt;

&lt;p&gt;No feature flag. No extra pricing line.&lt;/p&gt;

&lt;p&gt;You can start building with it now at:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://docs.backboard.io" rel="noopener noreferrer"&gt;https://docs.backboard.io&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you’re already routing across multiple models and have horror stories about context windows, I’d love to hear them.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>programming</category>
      <category>api</category>
    </item>
    <item>
      <title>I'm bias but I love this!</title>
      <dc:creator>Robert Imbeault</dc:creator>
      <pubDate>Tue, 24 Mar 2026 15:13:56 +0000</pubDate>
      <link>https://forem.com/robimbeault/im-bias-but-i-love-this-4oom</link>
      <guid>https://forem.com/robimbeault/im-bias-but-i-love-this-4oom</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/jon_at_backboardio/im-learning-ai-in-public-and-i-think-developers-need-to-chill-a-bit-31d2" class="crayons-story__hidden-navigation-link"&gt;I’m Learning AI in Public, and I Think Developers Need to Chill a Bit&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/jon_at_backboardio" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3824580%2Fcbf3ef23-2d0b-4576-90ff-0d46b2119ea8.png" alt="jon_at_backboardio profile" class="crayons-avatar__image" width="96" height="96"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/jon_at_backboardio" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Jonathan Murray
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Jonathan Murray
                
              
              &lt;div id="story-author-preview-content-3395533" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/jon_at_backboardio" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3824580%2Fcbf3ef23-2d0b-4576-90ff-0d46b2119ea8.png" class="crayons-avatar__image" alt="" width="96" height="96"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Jonathan Murray&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/jon_at_backboardio/im-learning-ai-in-public-and-i-think-developers-need-to-chill-a-bit-31d2" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Mar 24&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/jon_at_backboardio/im-learning-ai-in-public-and-i-think-developers-need-to-chill-a-bit-31d2" id="article-link-3395533"&gt;
          I’m Learning AI in Public, and I Think Developers Need to Chill a Bit
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/ai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;ai&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/devops"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;devops&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/devrel"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;devrel&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/programming"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;programming&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/jon_at_backboardio/im-learning-ai-in-public-and-i-think-developers-need-to-chill-a-bit-31d2" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/multi-unicorn-b44d6f8c23cdd00964192bedc38af3e82463978aa611b4365bd33a0f1f4f3e97.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/fire-f60e7a582391810302117f987b22a8ef04a2fe0df7e3258a5f49332df1cec71e.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;49&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/jon_at_backboardio/im-learning-ai-in-public-and-i-think-developers-need-to-chill-a-bit-31d2#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              10&lt;span class="hidden s:inline"&gt; comments&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            5 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
      <category>ai</category>
      <category>devops</category>
      <category>devrel</category>
      <category>programming</category>
    </item>
    <item>
      <title>The Hidden Problem With Multi-Model AI Systems: Context Window Mismatch</title>
      <dc:creator>Robert Imbeault</dc:creator>
      <pubDate>Tue, 24 Mar 2026 13:37:22 +0000</pubDate>
      <link>https://forem.com/robimbeault/the-hidden-problem-with-multi-model-ai-systems-context-window-mismatch-821</link>
      <guid>https://forem.com/robimbeault/the-hidden-problem-with-multi-model-ai-systems-context-window-mismatch-821</guid>
      <description>&lt;p&gt;Notes from building infrastructure for 17,000+ LLMs&lt;/p&gt;

&lt;p&gt;One of the promises of modern AI infrastructure is simple:&lt;br&gt;
You should be able to switch models whenever you want.&lt;/p&gt;

&lt;p&gt;Different models have different strengths. Some are faster. Some are cheaper. Some reason better. Some support large context windows.&lt;/p&gt;

&lt;p&gt;In theory, you route requests dynamically and get the best of each.&lt;br&gt;
In practice, something breaks almost immediately.&lt;br&gt;
Context windows don’t match.&lt;/p&gt;

&lt;p&gt;The Moment Everything Breaks&lt;/p&gt;

&lt;p&gt;Imagine this common scenario&lt;/p&gt;

&lt;p&gt;A conversation begins on a large context model. Maybe something like a 128k context window.&lt;br&gt;
The system prompt is fairly large.&lt;br&gt;
 The user has been chatting for a while.&lt;br&gt;
 Tools have been called.&lt;br&gt;
 A RAG system has pulled in documents.&lt;br&gt;
Everything works.&lt;br&gt;
Then your router decides to switch to a smaller model. Maybe for latency or cost reasons.&lt;/p&gt;

&lt;p&gt;Suddenly the entire state no longer fits.&lt;br&gt;
The request fails or the model behaves unpredictably.&lt;br&gt;
This happens because the model’s context window is not just holding messages. It contains the entire runtime state:&lt;br&gt;
system prompts recent conversation turns tool calls and tool outputs RAG results web search context other metadata.&lt;/p&gt;

&lt;p&gt;When you exceed the limit, something has to give.&lt;br&gt;
Most teams end up writing custom logic to handle this:&lt;br&gt;
truncating older messages prioritizing certain content summarizing conversation history trying to prevent context overflow&lt;/p&gt;

&lt;p&gt;This logic grows quickly and often becomes fragile.&lt;br&gt;
We ran into this problem while building Backboard, which currently routes across 17,000+ LLMs.&lt;br&gt;
So we built a system to handle it automatically.&lt;/p&gt;

&lt;p&gt;The Core Idea: Treat Context Like a Budget&lt;br&gt;
The approach we landed on was surprisingly simple.&lt;br&gt;
Instead of filling the entire context window with raw state, we reserve a portion of it as a stable budget.&lt;br&gt;
When a request is routed to a model, we allocate the context window like this:&lt;br&gt;
~20% reserved for raw state&lt;br&gt;
~80% available for summarization&lt;/p&gt;

&lt;p&gt;The system calculates how many tokens fit inside that 20% allocation.&lt;br&gt;
Within that space we prioritize the most important live inputs:&lt;br&gt;
system prompt most recent messages tool calls, RAG results, web search context:&lt;/p&gt;

&lt;p&gt;Everything else becomes eligible for summarization.&lt;/p&gt;

&lt;p&gt;The Summarization Strategy&lt;br&gt;
Once the system identifies which parts of the state cannot fit directly into the context window, it compresses them.&lt;br&gt;
We designed the summarization pipeline around a simple rule:&lt;br&gt;
First try summarizing using the target model.&lt;/p&gt;

&lt;p&gt;If the summary still does not fit, fall back to the larger model previously used to generate a more efficient summary.&lt;/p&gt;

&lt;p&gt;This helps preserve as much information as possible while guaranteeing the final prompt fits inside the model’s context window.&lt;br&gt;
All of this happens automatically in the runtime.&lt;/p&gt;

&lt;p&gt;Avoiding Hard Context Failures&lt;br&gt;
One of our goals was to make context exhaustion extremely rare.&lt;br&gt;
Because the system runs continuously during requests and tool calls, the state is reshaped before the context window is fully consumed.&lt;br&gt;
In practice this means applications rarely hit the absolute context limit of a model.&lt;br&gt;
Developers do not have to constantly monitor token counts or worry about prompt overflow.&lt;/p&gt;

&lt;p&gt;Making Context Usage Observable&lt;br&gt;
Even though the system runs automatically, we wanted developers to see what was happening.&lt;br&gt;
So we added context metrics directly to the API response.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"context_usage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="nl"&gt;"used_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1302&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="nl"&gt;"context_limit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8191&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="nl"&gt;"percent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;19.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="nl"&gt;"summary_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gpt-4"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This makes it easy to track:&lt;br&gt;
how much context is being used when summarization happens how close you are to a model’s limit which model processed the request&lt;/p&gt;

&lt;p&gt;For production systems, this visibility is useful for debugging and optimization.&lt;/p&gt;

&lt;p&gt;Why We Think This Belongs in Infrastructure&lt;br&gt;
A lot of AI applications now route between multiple models depending on cost, latency, or capability.&lt;br&gt;
But context window management often ends up as application code.&lt;br&gt;
Our view was that this is infrastructure responsibility, not application responsibility.&lt;br&gt;
Developers should be able to move between models freely without rebuilding state management every time.&lt;/p&gt;

&lt;p&gt;Adaptive Context Management&lt;br&gt;
We ended up calling this system Adaptive Context Management.&lt;br&gt;
Its job is simple:&lt;br&gt;
Ensure the conversation state always fits the model being used.&lt;br&gt;
No prompt surgery.&lt;br&gt;
No manual truncation logic.&lt;br&gt;
No context window surprises.&lt;/p&gt;

&lt;p&gt;As AI systems move toward multi-model architectures, context management becomes one of the most important reliability problems.&lt;/p&gt;

&lt;p&gt;Different models will always have different limits.&lt;br&gt;
The goal is to make those differences invisible to developers.&lt;/p&gt;

&lt;p&gt;If you are curious about the architecture behind this or how we tested summarization quality, I’d love to hear how others are approaching context management in multi-model systems.&lt;/p&gt;

&lt;p&gt;Adaptive Context Management is now available in Backboard and automatically enabled for users.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>programming</category>
      <category>api</category>
    </item>
  </channel>
</rss>
