Forem: Robert Imbeault

I Think Therefore I Am… A Big Pain in the A$$

Robert Imbeault — Mon, 20 Apr 2026 17:59:09 +0000

If you’ve tried to build anything serious on top of LLMs recently, you’ve probably run into this:

“Thinking” is supposed to make models better.
In practice, it makes your infrastructure worse.

Let’s break down where it actually hurts.

The illusion of “just turn on reasoning”

At a high level, you’d expect something simple:

Turn reasoning on → better answers
Turn reasoning off → cheaper, faster

Reality is messier.

Models sometimes don’t think when you ask them to
Models sometimes overthink trivial prompts, burning tokens for no gain
There’s no consistent behavior across providers

So now you’re not just building a product.
You’re debugging model psychology.

The fragmentation problem nobody talks about

Every provider decided to implement “thinking” differently.

OpenAI → effort levels (low, medium, high)
Anthropic → token budgets (explicit caps)
Google → both… depending on the model version

That’s just inputs.

Outputs are worse:

Some return dedicated thinking blocks
Others return reasoning summaries
Some mix reasoning into standard content structures

There is no standard. No shared schema. No predictable behavior.

So if you’re routing across models, you now need:

Input normalization
Output parsing per provider
Logic to reconcile different reasoning formats

This is where “simple API routing” stops being simple.

Billing is inconsistent too

Even cost modeling breaks.

Some providers expose reasoning tokens explicitly
Some hide it inside total usage
Some introduce provider-specific fields (looking at you, xAI)

Now you’re not just optimizing performance.
You’re building a cost translation layer.

Model switching makes everything worse

Switching models mid-thread sounds great… until you try it.

Even within the same provider:

Different endpoints behave differently (yes, even inside OpenAI)
Input formats change
Output structures change
Reasoning formats change

Now add state:

What context do you carry over?
How do you preserve reasoning continuity?
How do you avoid exploding token usage?

This is where most teams either:

Give up on portability, or
Build a fragile pile of adapters that break every few weeks

What we realized building Backboard

The real problem isn’t reasoning.

It’s lack of abstraction.

Developers shouldn’t have to:

Learn 5 different “thinking” systems
Normalize 5 different response formats
Track 5 different billing models
Rebuild state every time they switch models

So we made a call:

Unify it.

What “unified thinking” actually means

Instead of exposing provider quirks, we abstract them into one model:

A single thinking parameter
Direct control over reasoning budget
Consistent behavior across models
Normalized input and output structures

So you can:

Tune reasoning without caring about provider differences
Switch models without rewriting logic
Keep state intact across everything

And most importantly:

Stop thinking about thinking.

The uncomfortable truth

If you’re building on multiple LLMs and you haven’t hit these issues yet, you will.

The complexity is not obvious at the start.
It compounds as soon as you:

Add a second provider
Introduce reasoning
Try to optimize cost
Or maintain state across sessions

At that point, you’re not building your product anymore.
You’re building infrastructure.

Where this goes long term

Short term:

Abstractions like this save teams weeks or months of engineering time
They reduce cost volatility and debugging overhead

Long term:

The winning platforms won’t be the best models
They’ll be the ones that make models interchangeable and stateful

That’s the real unlock.

If you’re currently stitching together multiple providers, do a quick audit:

How many reasoning formats are you handling?
How portable is your state layer?
How confident are you in your cost predictability?

If the answer isn’t clean, you’re already paying the tax.

This is what we're working on at Backboard.io :)

Why Token Counting in Multi-LLM Systems Is Harder Than You Think

Robert Imbeault — Thu, 16 Apr 2026 14:09:06 +0000

When we set out to build our adaptive context window management component, we ran into a problem that sounds deceptively simple: how do you manage context windows when your system routes requests across multiple LLM providers?

The Core Problem
Each model has its own tokenizer, context window, and pricing rules. The same text is not "the same" across providers. OpenAI might count a prompt as 1,200 tokens; Claude might see it as 1,450. A chat session that fits comfortably in one model can silently exceed limits or cost significantly more in another.

This creates real problems when you switch providers mid-conversation. The new model has to ingest the full conversation history again — but since each model counts that context differently, you can hit:

Unexpected context-window overflow: the conversation that fit before now breaches the limit
Inconsistent truncation: different models truncate at different points, changing what context the model sees
Hard-to-predict routing failures: your router makes decisions based on one token count, but the model uses another

Why a Single 'Token Estimate' Doesn't Cut It
The tempting solution is to maintain a single token count with a safety margin. The problem: OpenAI, Claude, Gemini, Cohere, xAI, and others don't tokenize text the same way. A single estimate will be wrong in both directions — undercount and you risk failures; overcount and you truncate too aggressively, degrading conversation quality unnecessarily.

How We Solved It
The answer is making token counting provider-aware. Instead of a single universal estimate, the context management layer measures each prompt the way the specific target model will measure it. The router uses this measurement before the request is sent.

In practice this means the system:

Knows when a conversation is approaching the edge of a model's context window
Trims or compresses history intelligently, not just blindly chopping from the front.
Avoids expensive overages from miscounted tokens
Keeps model-switching complexity invisible to the end user

The user sees a smooth conversation. The system handles the messy reality that every model speaks a slightly different "token language."

What We're Building Toward
This is one component of a larger routing layer. The goal: switch LLM providers mid-product — based on cost, capability, or availability — without that complexity leaking to users. Provider-aware token counting turns out to be a foundational piece of that.

We're doing this so you won't have to. :)

Your Context Window Is Chaos. We Fixed It.

Robert Imbeault — Tue, 31 Mar 2026 10:47:36 +0000

If you’re routing across multiple LLMs, you probably already know this feeling:

One model happily accepts your massive conversation.
The next model chokes, truncates half the important bits, and hallucinates the rest.

Same app. Same user. Different context window. Chaos.

Backboard.io now includes Adaptive Context Management, a system that automatically manages conversation state when your app moves between models with different context sizes.

ps. if you have keys from any of the frontiers or OpenRouter you can use this for free!

You still get access to 17,000+ LLMs on the platform.

You just don’t have to personally babysit their context windows anymore.

And yes, it’s included for free.

The Problem: Context Windows Are Inconsistent (and Annoying)
In a multi‑model setup, this is what actually happens:

You start on a large‑context model. Everything fits:

system prompt
conversation history
tool calls + tool responses
RAG chunks
web search results
random runtime metadata you forgot you added
Your router decides to send the next request to a smaller‑context model.

Suddenly your carefully curated “state” is too big to fit. Something has to go.

Most platforms respond with:

“Cool, just write truncation and summarization logic that:

prioritizes what matters,
handles overflow nicely,
doesn’t break when you add a new tool,
and works for every model you might ever route to.”
So we all end up writing the same brittle code:

if tokens > limit: drop_old_messages() maybe_summarize() hope_nothing_important_was_there()
In a multi‑model system, that logic gets complicated and fragile fast.

What We Shipped: Adaptive Context Management

Backboard now automatically handles context transitions when models change.

There’s no extra endpoint and no new config. It runs inside the Backboard runtime whenever a request is routed to a model.

When that happens, Backboard:

Looks up the model’s context window.
Dynamically budgets it:
20% reserved for raw state
80% freed via summarization
Within that 20% “raw state” budget, we prioritize:

system prompt
recent messages
tool calls
RAG results
web search context
Whatever fits in that 20% goes through unchanged.

Everything else is handled by intelligent summarization.

You don’t write the logic. You just route between models.

How Intelligent Summarization Works
When we need to compress, we follow a simple rule:

First try the model you’re switching to.

“Hey smaller model, summarize this so you can still understand what’s going on.”
If the summary still doesn’t fit:

We fall back to the larger model that was previously in use to generate a more efficient summary.
This preserves the important parts of the conversation while ensuring the final state always fits within the new model’s context window.

All of this happens automatically during the request and tool calls.

No manual orchestration. No custom jobs. No extra service.

You Should Rarely Hit 100% Context Again
Because Adaptive Context Management runs continuously:

It reshapes and compresses state before you slam into the limit.
It keeps a buffer in the context window instead of riding at 99.9% and hoping for the best.
Mid‑conversation model switches stop being a coin flip on whether something vital gets chopped.
Your job: define the routing logic and features.

Our job: make sure the context window doesn’t quietly wreck them.

You Still Get Visibility: context_usage in msg
This is not a black box.

We expose context usage directly in the msg endpoint so you can see what’s happening in real time.

Example response:

"context_usage": { "used_tokens": 1302, "context_limit": 8191, "percent": 19.9, "summary_tokens": 0, "model": "gpt-4" }

You can track:

how much context is currently used
how close you are to the limit
how many tokens are from summarization
which model is currently managing the context
If you like graphs and dashboards, this gives you the raw data without forcing you to build your own context tracking system from scratch.

The Bigger Idea: Treat Models Like Infrastructure
Backboard’s thesis is simple:

You should be able to treat models as interchangeable infrastructure.

Your state should just move with the user.

That only works if state can move safely between:

cheap and expensive models
long‑context and short‑context models
different providers and pricing tiers
Adaptive Context Management is the safety layer that makes that viable:

You route across thousands of models.
Backboard keeps the conversation state aligned with each model’s constraints.
You don’t write ad‑hoc truncation and summarization logic per model.
You focus on product behavior.

We handle the context window drama.

Adaptive Context Management is free and live today in the Backboard API.

No feature flag. No extra pricing line.

You can start building with it now at:

👉 https://docs.backboard.io

If you’re already routing across multiple models and have horror stories about context windows, I’d love to hear them.

I'm bias but I love this!

Robert Imbeault — Tue, 24 Mar 2026 15:13:56 +0000

Jonathan Murray

Mar 24

I’m Learning AI in Public, and I Think Developers Need to Chill a Bit

#ai #devops #devrel #programming

Comments 10

5 min read

The Hidden Problem With Multi-Model AI Systems: Context Window Mismatch

Robert Imbeault — Tue, 24 Mar 2026 13:37:22 +0000

Notes from building infrastructure for 17,000+ LLMs

One of the promises of modern AI infrastructure is simple:
You should be able to switch models whenever you want.

Different models have different strengths. Some are faster. Some are cheaper. Some reason better. Some support large context windows.

In theory, you route requests dynamically and get the best of each.
In practice, something breaks almost immediately.
Context windows don’t match.

The Moment Everything Breaks

Imagine this common scenario

A conversation begins on a large context model. Maybe something like a 128k context window.
The system prompt is fairly large.
The user has been chatting for a while.
Tools have been called.
A RAG system has pulled in documents.
Everything works.
Then your router decides to switch to a smaller model. Maybe for latency or cost reasons.

Suddenly the entire state no longer fits.
The request fails or the model behaves unpredictably.
This happens because the model’s context window is not just holding messages. It contains the entire runtime state:
system prompts recent conversation turns tool calls and tool outputs RAG results web search context other metadata.

When you exceed the limit, something has to give.
Most teams end up writing custom logic to handle this:
truncating older messages prioritizing certain content summarizing conversation history trying to prevent context overflow

This logic grows quickly and often becomes fragile.
We ran into this problem while building Backboard, which currently routes across 17,000+ LLMs.
So we built a system to handle it automatically.

The Core Idea: Treat Context Like a Budget
The approach we landed on was surprisingly simple.
Instead of filling the entire context window with raw state, we reserve a portion of it as a stable budget.
When a request is routed to a model, we allocate the context window like this:
~20% reserved for raw state
~80% available for summarization

The system calculates how many tokens fit inside that 20% allocation.
Within that space we prioritize the most important live inputs:
system prompt most recent messages tool calls, RAG results, web search context:

Everything else becomes eligible for summarization.

The Summarization Strategy
Once the system identifies which parts of the state cannot fit directly into the context window, it compresses them.
We designed the summarization pipeline around a simple rule:
First try summarizing using the target model.

If the summary still does not fit, fall back to the larger model previously used to generate a more efficient summary.

This helps preserve as much information as possible while guaranteeing the final prompt fits inside the model’s context window.
All of this happens automatically in the runtime.

Avoiding Hard Context Failures
One of our goals was to make context exhaustion extremely rare.
Because the system runs continuously during requests and tool calls, the state is reshaped before the context window is fully consumed.
In practice this means applications rarely hit the absolute context limit of a model.
Developers do not have to constantly monitor token counts or worry about prompt overflow.

Making Context Usage Observable
Even though the system runs automatically, we wanted developers to see what was happening.
So we added context metrics directly to the API response.

Example:

"context_usage": {
 "used_tokens": 1302,
 "context_limit": 8191,
 "percent": 19.9,
 "summary_tokens": 0,
 "model": "gpt-4"
}

This makes it easy to track:
how much context is being used when summarization happens how close you are to a model’s limit which model processed the request

For production systems, this visibility is useful for debugging and optimization.

Why We Think This Belongs in Infrastructure
A lot of AI applications now route between multiple models depending on cost, latency, or capability.
But context window management often ends up as application code.
Our view was that this is infrastructure responsibility, not application responsibility.
Developers should be able to move between models freely without rebuilding state management every time.

Adaptive Context Management
We ended up calling this system Adaptive Context Management.
Its job is simple:
Ensure the conversation state always fits the model being used.
No prompt surgery.
No manual truncation logic.
No context window surprises.

As AI systems move toward multi-model architectures, context management becomes one of the most important reliability problems.

Different models will always have different limits.
The goal is to make those differences invisible to developers.

If you are curious about the architecture behind this or how we tested summarization quality, I’d love to hear how others are approaching context management in multi-model systems.

Adaptive Context Management is now available in Backboard and automatically enabled for users.