Forem: Jonathan Murray

If Microsoft and Uber can't afford AI coding, what chance do the rest of us have?

Jonathan Murray — Mon, 25 May 2026 15:46:18 +0000

Two stories landed in the same news cycle.

Microsoft cancelled most internal Claude Code licenses. Windows, Surface, Teams, Outlook, all migrating to GitHub Copilot CLI by June 30. Reporting is consistent on the why: usage exploded, the bills got indefensible, and the company that owns Azure and is one of Anthropic's biggest partners decided it was cheaper to migrate thousands of engineers than to keep paying the meter.

Uber's CTO Praveen Neppalli Naga said the company is "back to the drawing board" on AI coding. They burned through their planned 2026 AI budget within months. R&D was $3.4B last year and is still climbing. Engineers were ranked on internal leaderboards for AI tool usage. Claude Code became dominant. Costs went vertical.

Read that twice. Two of the most capitalized, AI-bullish companies on the planet just hit the wall on AI coding cost, and we're still in the first inning.

If they can't make the math work, what happens to the rest of us.

The thing nobody is saying out loud

The current generation of AI coding tools is built on an assumption: more tokens equals better output.

Bigger context windows. Longer reasoning chains. More tool calls per task. The whole industry is in a token-maxing arms race, and the pricing model is perfectly aligned with that race. Every additional token the agent burns is revenue for the model provider. Every re-fetch of the same file, every redundant reasoning loop, every "let me re-read your codebase to remember what we discussed", that's the meter running.

This is the part where I'm supposed to be diplomatic. I'm not going to be.

Claude Code is excellent. Cursor is excellent. Codex is excellent. The engineering is genuinely impressive. But the business model is a parking meter and you are the car. The longer your session, the deeper the agent goes, the more files it touches, the more money the vendor makes. Productivity and cost are positively correlated. That's not a bug. That's the design.

Microsoft figured this out at scale and pulled the plug. Uber figured it out and is rebuilding from scratch. If you're a developer reading this thinking "well, my $200/month plan is fine for now", I have bad news. Your plan is fine because somebody upstream is eating the difference between what you pay and what your usage actually costs. That subsidy ends the moment these companies need real margins. Anthropic is reportedly raising at a $900B valuation. OpenAI just raised again. The investor math doesn't close at "we lose money on every power user forever."

You're not the customer in this model. You're the funnel.

Bigger context is not the answer

The industry's response so far has been to make the context window bigger. 200K. 1M. 2M. Look at all this room.

This is a category error.

A bigger context window doesn't help you, it helps the bill. You're paying to stuff your entire repo into a prompt every turn so the model can "remember" what file structure you have. That's not memory. That's amnesia with a credit card attached.

Real memory, the kind your brain runs on, doesn't reload everything every time you think. It selectively recalls what's relevant. It compresses. It forgets things that don't matter. It builds a model of the world that persists across sessions.

When your coding agent actually remembers your codebase architecture, your conventions, the decision you made last Tuesday, the bug you fixed in auth.ts three weeks ago, the patterns your team prefers, it doesn't need to re-read 400K tokens of context to do the next task. It already knows. The token bill collapses. Quality goes up, not down, because the agent isn't drowning in fresh context every turn.

This is the part of the stack the hyperscalers don't want to build.

Memory is harder than context. Memory is opinionated. Memory requires you to commit to architecture decisions about what to retain, what to compress, what to forget. And critically, memory cuts token revenue. It's a direct conflict of interest for any vendor whose margin depends on you burning tokens.

If you're a vendor making money per token, why would you ever ship the feature that uses fewer tokens.

You wouldn't. And they haven't.

Silicon Valley can afford this. The rest of the world cannot.

A Brazilian developer earning R$15K/month does not have a $200/month Claude Max budget. A two-person Jakarta startup is not dropping $1,500/month per seat on agentic coding. An indie hacker in Lagos is not running a Cursor team plan. The math doesn't work and it isn't going to start working because OpenAI raises another $40B at a higher valuation.

The current AI coding market is a luxury product priced for San Francisco salaries and venture-subsidized burn. That's a real market and the companies serving it should keep serving it. But pretending that's the market is delusional.

There are roughly 30 million developers globally. Maybe 2 million of them work at companies that can sustainably absorb token-metered agentic coding at current prices. The other 28 million need a different solution. Not a worse one. A different one. One whose architecture isn't designed to extract maximum revenue per keystroke.

And let's be honest with each other for a second. The "AI levels the playing field for developers in emerging markets" narrative has been one of the dominant talking points of the last two years. Every keynote. Every blog post. Every "the future of work is global" panel.

How is that going? Right now, with current pricing, the playing field is the most tilted it has ever been. A junior developer in Toronto on a Pro plan has more leverage per dollar than a senior developer in São Paulo on a budget. That's not democratization. That's a new caste system with better marketing.

The "aligned with the driver, not the parking meter" test

I keep coming back to this framing because it keeps being right. The question to ask any AI tool you adopt going forward is whose side the economics are actually on.

If the vendor makes more money when you use it more, you have a parking meter. Your interests and theirs diverge the moment you scale.

If the vendor makes more money when you succeed (you ship faster, retain users, build better), you have a partner. Your interests align.

Most of the AI coding industry right now is parking meters wearing partner costumes. Microsoft just got billed for the parking. Uber too. The smart play for everyone else is to pick tools where the architecture itself, not just the marketing copy, is on your side.

What we're doing about it

We're opening the alpha of our CLI at Backboard. Memory-first. Built for the 28 million developers who are not the target market of the current generation of tools.

I'm not going to pitch you here. I'm telling you we're taking this problem on, and we'd rather have a smaller post and a bigger fight than the other way around.

If the Microsoft and Uber stories landed wrong for you, if you're tired of token bills that look like rent, if you think memory is more interesting than context, come find us.

We're aligned with the driver. Not the parking meter.

backboard.io

OpenAI and Anthropic are Friendster and MySpace, if Subquadratic proves to be true.

Jonathan Murray — Wed, 06 May 2026 15:24:34 +0000

If you've ever shipped an LLM-powered feature that needed to reason over a real codebase, a real contract, or a real research corpus, you already know the shape of the problem. The model technically accepts a million tokens of context. In practice, the answers get worse as the context gets longer, and your infra bill gets worse faster than that.

SubQ is built around SSA — Subquadratic Sparse Attention — a linearly scaling attention mechanism designed for long-context retrieval, reasoning, and software engineering workloads. The technical results are strong on their own merits: 52.2× prefill speedup at 1M tokens, RULER 95.0%, MRCR v2 65.9%, SWE-Bench Verified 81.8%.

But the more interesting question is what happens to the industry if results like these stop being a one-off. The valuations, pricing, and competitive narrative around the major labs have been priced as if compute is the moat — as if maximizing token use and burning more dollars per call is the cost of doing business at the frontier. SSA is one of the first credible signals that this might not be true for much longer. And if it isn't, the OpenAIs and Anthropics of today look less like permanent fixtures and more like the Friendsters and MySpaces of the next platform shift.

The problem isn't "missing context." It's fragmented context.

The hard problems enterprise AI needs to solve are long-context problems. Codebases, contracts, enterprise corpora, databases, spreadsheets, research collections, and long-running agent sessions rarely fail because the answer is absent. They fail because the relevant evidence is distributed across a large body of context, referenced indirectly, and only meaningful when multiple pieces are held in view at once.

If you build with these systems, this list will look familiar:

a codebase where a function is defined in one module, called in dozens of others, and constrained by tests elsewhere
a contract where an obligation depends on a definition, an exception, and a referenced clause several pages apart
a research workflow where a conclusion depends on reconciling evidence across many papers
a long-running coding task where prior planning decisions, intermediate edits, review notes, and regressions all matter

These aren't lookup problems. They're multi-hop reasoning problems over fragmented corpora. And the workarounds we've been using — chunking, RAG, agentic decomposition, recursive summarization — all have the same shape. They preserve some signal and lose some signal. RAG keeps semantic similarity but loses position, hierarchy, neighboring context, and reference structure. Agentic workflows decompose tasks into smaller calls but compound errors across steps and bake hand-authored orchestration policy into the system. The bitter lesson keeps showing up: scaffolding that works today doesn't generalize tomorrow.

SSA is an attempt to remove more of the reason that scaffolding is necessary in the first place.

Why dense attention is the bottleneck

Attention is a retrieval operation built into the model. Each token acts as a query, compares itself against every other token, scores their relevance, and aggregates their information into its next representation. Powerful, because every token gets access to the full context. Expensive, for the exact same reason — every query compares against every key, and the cost grows quadratically with sequence length.

At small contexts this is fine. At hundreds of thousands to millions of tokens, it becomes the dominant constraint. Doubling context doesn't double cost; it quadruples it.

And here's the part that should bother any engineer: most of that work is wasted. In trained models, the vast majority of attention weights are near zero. The model performs the full all-pairs comparison, but only a small fraction of those interactions meaningfully influence the output. Dense attention isn't just quadratic — it's wastefully quadratic.

FlashAttention made this much more practical at today's context lengths by avoiding materialization of the full attention matrix and optimizing memory movement. That's a real win. But it doesn't change the underlying scaling. The number of comparisons is still the same. The model still does quadratic work; it just does that work more efficiently.

System-level workarounds — retrieval pipelines, context compaction, recursive decomposition, agentic orchestration — make dense-attention systems usable. None of them change the scaling law. They route around the limitation. The quadratic cost is the boundary they're routing around.

What prior efficient architectures gave up

The field has spent years trying to make attention cheaper. The hard part isn't reducing cost. It's reducing cost without breaking retrieval. Every prior approach traded something away.

Fixed-pattern sparse attention — sliding windows, strided patterns, dilated masks — gets subquadratic scaling by deciding in advance which positions a token can attend to. The routing decision is positional, not content-aware. The model decides where to look before it knows what it's looking for. When the relevant information falls outside the pattern, it's invisible.

State space models and recurrent alternatives drop the all-pairs comparison entirely, replacing it with a compressed state that evolves across the sequence. Linear scaling by construction — but the state has fixed capacity. Information gets summarized, blurred, or discarded as the sequence grows. Great at gist and structure, weaker at retrieving a specific fact introduced arbitrarily far back.

Hybrid architectures combine both ideas: efficient layers do most of the compute, dense attention layers preserve retrieval. Works in practice, but the dense layers stay load-bearing. As context grows, their quadratic cost dominates again. The benefit is scalar, not asymptotic.

DeepSeek Sparse Attention offsets attention's quadratic cost to a lightning indexer that selects, per query, which keys to attend to. The indexer is itself quadratic — it scores every query against every key with small constants but the same O(n²) scaling. The complexity has been moved, not removed.

The pattern is consistent. Fixed sparsity gives up content-dependent routing. Recurrent models give up exact retrieval. Hybrids reintroduce the original cost. DeepSeek-style indexers stay quadratic and become cost-prohibitive at scale.

The open problem isn't "make attention faster." It's: build a mechanism that's efficient, content-dependent, and capable of retrieving from arbitrary positions across long context.

How SSA works

SSA changes how attention work is allocated. The core idea is content-dependent selection: for each query, the model selects which parts of the sequence are worth attending to, and computes attention exactly over those positions.

Dense attention assumes every pair might matter, so it evaluates all of them. In practice, almost none do. SSA drops that assumption. It doesn't approximate attention — it restricts attention to the positions that actually carry signal, and skips the rest.

That gives SSA three properties that matter together:

Linear scaling in compute and memory. Attention cost grows with the number of selected positions, not the full sequence. Long context becomes economically usable.
Content-dependent routing. The model decides where to look based on meaning, not position. Relevant information can be retrieved regardless of where it appears.
Sparse retrieval from arbitrary positions. Unlike recurrent or compressed approaches, SSA preserves the ability to recover specific information introduced far earlier in the sequence.

The practical distinction matters: SSA is not just a faster implementation of dense attention. It reduces the amount of attention work the model performs. That reduction is what shows up as speed.

Measured in wall-clock input processing time on B200s, SSA achieves the following speedups over standard attention with FlashAttention-2 (FlashAttention-3 did not produce a speedup over FA-2 on B200s):

Context length	SSA speed increase vs. FlashAttention
128K	7.2×
256K	13.2×
512K	23.0×
1M	52.2×

This is the throughput inversion that matters in production. Dense attention becomes slower relative to SSA as context grows. SSA gets more advantageous exactly where long-context workloads become most valuable.

Training SSA for long-context behavior

Architecture is necessary but not sufficient. A model can have a long context window and still fail to use it well. SSA was trained to make long-context use reliable, not just possible.

The training pipeline is three stages:

Pre-training establishes base language modeling capability and the long-context representations the selection mechanism uses.
Supervised fine-tuning shapes behavior toward instruction following, structured reasoning, and the code generation patterns enterprise workloads need.
Reinforcement learning targets the behaviors that are hardest to induce through supervised examples: reliable long-context retrieval, and coding behavior that uses the available context aggressively instead of defaulting to local reasoning.

That last stage is the one developers should care about. Long-context failures often look plausible. A model answers from nearby context because nearby evidence is easier to use, even when the decisive evidence is much earlier. It produces a locally correct patch that violates an interface defined elsewhere. It summarizes a prior decision instead of preserving the exact constraint that should govern a later step. SSA's RL stage is designed around exactly those failure modes.

Training data emphasizes long-form sources with high information density and cross-reference structure — the kind of data that forces the selection mechanism to learn routing over large positional distances. The goal isn't benchmark memorization. It's teaching the model to attend to what matters regardless of where it sits.

Why the training infrastructure matters too

Long-context training isn't only a modeling problem. It's a systems problem that only shows up at scale. At million-token sequence lengths, failure modes that are invisible at shorter contexts become binding — memory pressure, sequence partitioning across devices, gradient instability, numerical precision, kernel efficiency. These determine whether training runs at all.

The SSA training stack runs stably at 1M tokens and beyond, maintains linear memory scaling across the training pipeline, and uses distributed sequence parallelism to shard sequences across devices when they exceed single-device limits.

The consequence isn't just that long-context training becomes possible. It becomes iterable.

Under dense attention, long-context experiments are expensive enough that they get treated as reserved runs. With SSA's linear scaling, they become routine. More ablations, more evaluations, faster feedback, targeted fixes on the behaviors that actually matter at long context.

That's the deeper implication. SSA doesn't only reduce the cost of inference. It reduces the cost of learning long-context behavior in the first place — and that's the thing that compounds for developers downstream.

Evaluating functional context, not nominal context

An advertised context window doesn't tell you how much context a model can use. The real question is whether the model can retrieve, connect, and reason over evidence distributed across that window.

SubQ is evaluated across two axes:

Deployment viability — compute reduction and wall-clock speed
Retrieval capability — RULER, MRCR v2, and SWE-Bench Verified

More general benchmarks will be published in the upcoming model card. Needle-in-a-Haystack tests exact retrieval of a single target. RULER extends that to multi-hop retrieval, aggregation, variable tracking, and selective filtering. MRCR v2 goes further: the model must locate and integrate multiple pieces of evidence distributed across the context, where the relevant set isn't given in advance. That's closer to the shape of real work — finding one fact isn't enough; the model has to determine which pieces matter and combine them into a coherent answer.

Results

Compute and speed

SSA's linear scaling means doubling context length doubles attention compute, rather than quadrupling it. At 1M tokens, that's a 62.5× attention FLOP reduction relative to standard quadratic attention.

Context length	Attention FLOP reduction vs. standard attention
128K	8×
1M	62.5×

Wall-clock speed is the more product-relevant result: a 52.2× prefill speedup over dense attention at 1M tokens. That's the difference between a long-context system that behaves like an interactive tool and one that feels like an offline batch job.

Context length	Input processing speed increase
128K	7.2×
256K	13.2×
512K	23.0×
1M	52.2×

RULER

RULER tests retrieval and reasoning beyond simple needle lookup — multi-hop retrieval, aggregation, variable tracking, selective filtering.

Model	RULER @ 128K
SSA / SubQ	95.0%
Opus 4.6	94.8%

For real workflows this matters because multi-hop tasks compound. A missed reference early in the chain can corrupt every conclusion downstream.

MRCR v2

MRCR v2 is the most demanding retrieval benchmark in this set. It evaluates the ability to locate and integrate multiple non-adjacent pieces of evidence across long context.

Model	MRCR v2 score
SSA / SubQ	65.9%
Opus 4.6	78.3%
GPT 5.5	74.0%
GPT 5.4	36.6%
Opus 4.7	32.2%
Gemini 3.1 Pro	26.3%

SubQ lands at 65.9% — solidly in the range of frontier dense models, well ahead of GPT 5.4, Opus 4.7, and Gemini 3.1 Pro. That's the clearest evidence for the gap between nominal and functional context. A model can accept a long input and still fail to reason reliably over that input. MRCR v2 surfaces the gap because it requires retrieval and combination, not just token processing.

SWE-Bench Verified

SWE-Bench Verified is an end-to-end software engineering benchmark on real GitHub issues. Not a pure retrieval test — it asks whether the model can use codebase understanding to localize bugs, reason about implementation constraints, and produce patches.

Model	SWE-Bench Verified
SSA / SubQ	81.8%
Opus 4.7	87.6%
Opus 4.6	80.8%
Gemini 3.1 Pro	80.6%
GPT 5.4	not reported
GPT 5.5	not reported

Sitting at 81.8% — ahead of Opus 4.6 and Gemini 3.1 Pro on a real-world coding benchmark while running on a subquadratic architecture — is the result that should land hardest for developers. This is the workload most of us actually care about.

The part nobody priced in

Step back from the architecture for a second and look at what the current AI industry is actually selling.

The valuations, the capex, the data center buildouts, the multi-year compute contracts — all of it is underwritten by an assumption that frontier intelligence requires frontier-scale spend. Long context costs a lot. Reasoning costs a lot. Agents cost a lot. The premise running through every pitch deck and earnings call is that the labs with the most GPUs win, and the rest of the market pays for tokens at whatever margin those labs choose.

SSA is one architecture, on one model, with one set of benchmarks. But the result it points at is uncomfortable for that premise: the dominant cost of long-context inference may not be a law of physics — it may be an artifact of dense attention. A 52.2× prefill speedup at 1M tokens isn't a 10% efficiency gain. It is the kind of step-change that, if it generalizes, rewrites the unit economics of the entire industry.

If you don't have to maximize tokens consumed and dollars burned to get frontier-quality long-context behavior, a lot of the moat narrative collapses with it.

Why the incumbents look more fragile than they're priced

The Friendster and MySpace comparison isn't snark — it's a specific lesson. Both had network effects. Both had brand. Both had scale advantages that looked durable right up until a better-architected product showed up and the users moved over a weekend. The moat people talked about (network effects, switching costs) turned out to be much weaker than the moat that actually mattered (a better product on a better stack).

The current frontier labs have a similar mismatch:

API-level switching cost is near zero. Most production code paths abstract the model behind a thin client. Swapping providers is a config change, not a migration.
Compute scarcity is the moat people brag about. It is also the moat that subquadratic architectures attack first. If a challenger can match frontier quality at a fraction of the FLOPs, the capex advantage flips into a capex liability — billions of dollars of GPU contracts depreciating against a more efficient successor.
Pricing power assumes scarcity. Today's per-token prices for long context look reasonable because the underlying compute is genuinely expensive. Drop the cost of a 1M-token prefill by 50× and the same prices start looking like rent extraction, not value capture.
Brand isn't a defense once parity exists. "Nobody got fired for buying OpenAI" works until a model with comparable benchmarks costs an order of magnitude less to serve. Then it works against them, the same way "nobody got fired for choosing IBM" did.

This isn't a prediction that any specific lab disappears. Anthropic, OpenAI, and Google have real assets — distribution, talent, training data, alignment research, regulatory relationships. Those don't evaporate. But the valuations and the pricing power are built on the assumption that frontier compute is a stable moat, and that assumption depends on dense attention staying expensive.

SSA is one of the first credible signals that it might not.

What developers should actually take away

Strip out the industry analysis and the practical takeaways for anyone building on top of these systems are pretty clean:

Long context as a product surface is about to get a lot cheaper and a lot better. If you've been deferring long-context features because the economics didn't pencil, the economics are about to pencil.
A nominal context window has never told you what a model can actually use. RULER 95.0% and MRCR v2 65.9% on a subquadratic architecture is the gap between marketing tokens and functional tokens, and that gap is closing.
Less hand-authored scaffolding. Chunking, recursive summarization, and bespoke orchestration are workarounds for an attention bottleneck. As that bottleneck loosens, the scaffolding becomes a maintenance burden rather than an asset.
Watch where the open and challenger labs go next. Efficient architectures disproportionately benefit teams that don't already own a hyperscaler-sized GPU fleet. The next frontier-quality model that runs cheaply on commodity infra is the one to track.
Don't lock into long-term commitments priced on dense-attention economics. Multi-year contracts written against today's per-token costs are the riskiest thing on the table if a successor architecture cuts those costs by an order of magnitude.

SSA on its own is one paper, one architecture, one set of numbers. The reason it's worth paying attention to is what it implies if the result is real and replicable: the AI bubble's tightest correlation — bigger spend, better model — gets a lot weaker. That's good for developers, good for customers, and meaningfully bad for any incumbent whose story to investors depends on the old curve holding.

The Friendsters and MySpaces of this cycle won't lose because their products got worse. They'll lose because someone shows up with a better-architected stack at a fraction of the cost, and the switching cost turns out to have been a config flag the whole time.

Worth watching.

Very cool use of Backboard!

Jonathan Murray — Sat, 02 May 2026 02:29:45 +0000

DEV Weekend Challenge: Earth Day

Arqam Waheed

Apr 20

Terra Triage: I Built a 3-Agent Wildlife Dispatcher That Learns From Every Referral

#devchallenge #weekendchallenge #ai #backboard

Comments 3

9 min read

"Of Course" Erodes Trust Faster Than Bad Code ... Two Words That Are Killing Your Career

Jonathan Murray — Thu, 30 Apr 2026 18:38:04 +0000

You already have the job or the internship. You're on the team. You're in the meetings. You're in the Slack channels.

And the thing that's going to hold you back has nothing to do with your code.

Someone you work with says "hey, could we do X?" and you say "yeah, of course." Feels confident. Feels like you just proved you've got it.

But you gave an answer that contains zero information. No cost. No timeline. No tradeoff. No indication of whether you even understood the question. And now you're either about to disappear for two weeks and come back with something nobody asked for, or pull an all-nighter for something that was just a question, not a request.

Both started with "of course."

Why I'm Writing This

I'm a non-technical founder. I don't write the code. But I build alongside my team every day. I set direction, I think through problems, I get my hands dirty in the product.

The devs who accelerated fastest on my team were never the ones who said yes the fastest. They were the ones who slowed down long enough to make sure we were talking about the same thing. The ones who said "of course" to everything burned out, shipped the wrong thing, and lost trust. Not because they weren't talented, but because they never gave anyone a chance to actually collaborate with them.

A Yell That Sounds Like a Whisper

Not everything a founder or lead says carries the same weight. But it doesn't always feel that way from your side. When the person steering the ship says "hey what if we tried this," it can land like a mandate even when it's just a thought.

So before you go heads-down for 48 hours on something mentioned in a 5-minute conversation, ask:

"Is this urgent or is this something we should plan for?"

"Are you asking me to build this or are you asking if it's possible?"

"Where does this sit relative to what I'm working on right now?"

And yeah, sometimes the answer is going to be "yes, it's urgent, do it right now, and please don't ask me any more questions." That happens. But even that is better than the silence you were working inside of before. That five-second question just saved you from building the wrong thing at the wrong pace.

"Of Course" Erodes Trust Faster Than Bad Code

You say "of course." You go dark. A week passes. Someone checks in. The thing isn't done, or it's half-done, or it's not what was asked for. The people around you start second-guessing every "of course" that comes after it.

That didn't happen because you're a bad developer. It happened because you skipped making sure everyone was on the same page before you started building.

If you're stuck, say so. If it's more complex than expected, flag it. If the original ask doesn't make sense technically, speak up. "That won't work because of X, but here's what would" is one of the most valuable sentences in engineering.

If you can build but you can't communicate what you're building, why you built it that way, and what could go wrong, you are operating at half your potential.

What the Best Devs Actually Sound Like

"I want to make sure I understand what you're looking for before I start."

"That's a cool idea. Here's what it would take and here's what we'd need to deprioritize."

"I can do a rough version by Friday to see if it's even the right direction. Want that instead of the full build?"

"Honestly, I'm not sure yet. Let me look into it and come back to you tomorrow."

None of those sound weak. They sound like someone you'd hand the keys to. Someone who respects the problem enough to not pretend it's already solved.

The Line

The most dangerous dev says "of course."

The most valuable dev says "let me make sure I understand the problem first."

You already got the job. Now show them why they were right.

The Hidden Challenge of Multi-LLM Context Management

Jonathan Murray — Fri, 24 Apr 2026 20:19:51 +0000

Why token counting isn't a solved problem when building across providers

Building AI products that span multiple LLM providers involves a challenge most developers don't anticipate until they hit it: context windows are not interoperable.

On the surface, managing context in a multi-LLM system seems straightforward. You track how long conversations get, trim when needed, and move on. In practice, it's considerably more complex — and if you're routing requests across providers like OpenAI, Anthropic, Google, Cohere, or xAI, there's a fundamental mismatch that can break your product in subtle ways.

The Tokenization Problem

Every major LLM provider uses its own tokenizer. These tokenizers don't agree. The same block of text produces different token counts depending on which model processes it. The difference is often 10–20%, sometimes more.

What this means in practice: a conversation that fits comfortably in one model's context window may silently overflow another's. A prompt routed to OpenAI might count as 1,200 tokens; the same prompt routed to Claude might count as 1,450. That gap matters.

Where It Breaks

The failure modes tend to show up at the boundaries. When you switch providers mid-conversation, the new model has to ingest the full prior context. If your context management layer was calibrated to the previous model's tokenizer, the new model may see a context that's already at or over the limit — before it's even responded to anything new.

This produces three common failure patterns:

Unexpected context-window overflow: the conversation that worked before now breaches the limit
Inconsistent truncation: different models truncate at different points, changing what prior context the model actually sees
Routing failures that are unpredictable because the numbers your system used don't match the numbers the model actually used

Why Simple Estimates Fail

The instinct is to maintain a single "token estimate" with a generous safety margin. The problem is that the margin you'd need varies by provider, model version, and content type (code tokenizes differently than prose). A margin calibrated for one use case will either be too tight for another, causing failures, or too generous, causing unnecessary truncation that degrades conversation quality.

The Solution: Provider-Aware Token Counting

A robust multi-LLM context management layer makes token counting provider-specific. Rather than maintaining a single estimate, it measures each prompt the way the actual target model will measure it. The routing layer uses these per-provider measurements to make decisions before requests are sent.

This lets the system stay ahead of context limits: it knows when a conversation is approaching an edge, trims or compresses history calibrated to the specific model receiving the request, and avoids the pricing and failure surprises that come from miscounted tokens.

The end result is what users should see: a smooth conversation experience, regardless of which model is serving it. The complexity of "every model speaks a slightly different token language" stays inside the infrastructure layer, invisible to the people using the product.

This is the approach we've taken in our adaptive context window management component, and it's become a foundational part of how we think about multi-LLM routing more broadly.

Rob Imbeault
Apr 17, 2026

Why LLM Reasoning Is Breaking AI Infrastructure (And How to Fix It)

Jonathan Murray — Fri, 24 Apr 2026 20:18:05 +0000

If you've tried building anything serious on top of large language models (LLMs) recently, you've probably run into this:

"Thinking" is supposed to make models better. In practice, it makes your infrastructure worse.

This isn't a model problem—it's an infrastructure and abstraction problem. And it's getting worse as teams scale across multiple AI providers.

Let's break down exactly where things go wrong.

The Illusion of "Just Turn On Reasoning"

At a high level, LLM reasoning sounds straightforward:

Turn reasoning on → better answers
Turn reasoning off → cheaper, faster

But in production systems, reality looks very different.

What actually happens:

Models don't reason when explicitly prompted
Models over-reason on trivial queries, wasting tokens
Behavior is inconsistent across providers and model versions

Instead of predictable performance, you get variability.

You're no longer just building an AI product—you're debugging model behavior at runtime.

The Fragmentation Problem in LLM Reasoning

One of the biggest hidden challenges in AI infrastructure today is fragmentation.

Every major provider has implemented reasoning differently:

OpenAI → reasoning effort levels (low, medium, high)
Anthropic (Claude) → explicit reasoning token budgets
Google AI (Gemini) → hybrid approaches depending on model version

That's just input configuration.

Output fragmentation is even worse:

Some models return separate reasoning blocks
Others provide summarized reasoning
Some mix reasoning directly into standard responses

There is:

No shared schema
No standardized interface
No predictable structure

What this means for developers:

If you're building a multi-model AI system, you now need:

Input normalization layers
Output parsing logic per provider
Custom handling for reasoning formats

At this point, "simple API routing" becomes complex middleware engineering.

AI Cost Optimization Becomes a Moving Target

Reasoning doesn't just impact performance—it breaks cost predictability.

Billing inconsistencies across providers:

Some expose reasoning tokens explicitly
Others bundle them into total usage
Some introduce custom billing fields

Now you're not just optimizing latency or quality.

You're building a cost translation layer across providers.

This adds complexity to:

Forecasting
Budget control
Scaling decisions

Why Multi-Model Switching Breaks Systems

In theory, switching between LLM providers should improve reliability and cost efficiency.

In practice, it introduces system instability.

Even within a single provider:

Different endpoints behave differently
Input formats change
Output schemas change
Reasoning structures vary

Now add state management:

What context should persist?
How do you maintain reasoning continuity?
How do you prevent token explosion?

The result:

Most teams either:

Abandon portability, or
Build fragile adapter layers that constantly break

The Real Problem: Lack of Abstraction

After working through these challenges, one thing becomes clear:

The core issue isn't reasoning—it's the absence of a unified abstraction layer.

Developers today are forced to:

Learn multiple reasoning systems
Normalize different response formats
Track multiple billing models
Rebuild state handling for each provider

This is not scalable.

What "Unified LLM Reasoning" Should Look Like

To make AI infrastructure truly production-ready, reasoning needs to be abstracted.

A unified system should provide:

A single reasoning parameter
Direct control over reasoning budgets
Consistent behavior across models
Standardized input/output formats

The impact:

Developers can:

Tune reasoning without provider lock-in
Switch models without rewriting logic
Maintain consistent state across systems

And most importantly:

Stop thinking about thinking.

The Uncomfortable Truth About Scaling AI Systems

If you're working with LLMs and haven't encountered these issues yet—you will.

Complexity compounds rapidly when you:

Add a second provider
Enable reasoning features
Optimize for cost
Maintain persistent context

At that point:

You're no longer building your product. You're building AI infrastructure.

The Future of AI Platforms

Short-term impact:

Reduced engineering time (weeks to months saved)
Lower debugging overhead
More predictable cost structures

Long-term shift:

The winning AI platforms won't be defined by model quality alone.

They will be defined by:

Interoperability (model interchangeability)
Statefulness (persistent, portable context)

That's the real unlock in the next phase of AI development.

Quick Audit for Your AI Stack

If you're currently integrating multiple LLM providers, ask yourself:

How many reasoning formats are you handling?
How portable is your state management layer?
How predictable are your AI costs?

If those answers aren't clean and consistent:

You're already paying the infrastructure tax.

Rob Imbeault
Apr 20, 2026

I Broke SSO Trying to Center a Div. Let's Talk About Tokenmaxxing

Jonathan Murray — Fri, 24 Apr 2026 15:28:12 +0000

Backboard CODEGEN CLI Waitlist

A couple weeks ago, I tried to recenter some text on one of my side project SSO pages.

That's the whole task. Move the text. Left a bit. Right a bit. Until it's in the middle. Center. Middle.

I opened Claude Code. I said, roughly, "hey, center this."

Fifteen minutes later I was two bugs deep, SSO was broken — not the text, the whole login flow — and I'd hit my usage limit trying to unbreak the thing I broke while trying to do the thing that should've taken eight seconds in the inspector.

Palm. Face.

"Why did I just do that?"

That was tokenmaxxing. You know what tokenmaxxing is. Your timeline knows. There is an entire subgenre of VC on X right now posting, with their whole chest, variations of "if your engineers aren't maxing out their token budgets every day, they aren't working hard enough."

Three thousand likes. Quote tweets from other VCs agreeing. "This," they type. "100%," they type.

I want to say this clearly, one time, so we can move on: that is insane.

Measuring engineering effort by token spend is like measuring a chef by how much gas they burn. Congratulations. Your kitchen is on fire and the soup is fine.

Tokenmaxxing is when the answer to every problem — a typo, a bug, a bad schema, a bad decision, a bad Tuesday — is to shove more context, more tokens, more model at it until the problem stops complaining.

It is console.log("hello world") wearing a $400 watch.

A lot of people are going to read this and get defensive. I get it. I've done it.

I once built a "documentation agent" that loaded the entire repo into context and then asked, very politely, whether we had a login page.

We did. It was in routes/login.tsx.

That query cost $2.17.

I tell myself it was research.

Here's the part nobody says out loud: brute-force compute is the new jQuery.

Not in the "it works, ship it" way. In the "we're going to look at this in three years and wince" way.

We're living in a window where:

A 100k-token prompt to find one number is considered normal.
"Just pass the whole codebase" is a real architectural decision that real adults say out loud in real meetings.
The solution to hallucinations is more tokens. The solution to latency is more tokens. The solution to your cat being sad is, apparently, more tokens.

And the people selling the tokens? Thrilled. Obviously. You would be too.

I want to be clear: I love LLMs. I use them constantly. I have emotions about them I will not discuss here.

But the current game is rigged in a very specific direction. The model companies make more money when you're lazy. Your sloppy prompt is their margin. Your 90,000-token scaffolding is someone's yacht.

Meanwhile, the indie devs — the people who built the internet worth having — are getting priced out of the exact kind of tinkering that used to be free. You can't "just try something" when "just trying something" is $40.

The next big app should not require a $10k/month API budget to prototype. It used to require a laptop and an unreasonable amount of Red Bull. I'd like to go back to that, if possible.

So here is my proposal, which I will now name dramatically so it fits in a tweet.

The Token Minimizing Revolution.

It has two rules, and they are embarrassingly obvious.

Precision over volume. A small, clever retrieval beats a giant dumb context every time. RAG, fine-tunes, routers, caches. Boring stuff. Works.
Token-golfing is the new code-golfing. The flex is not "look what I made the big model do." The flex is "look what I made the small model do."

We're building something right now at backboard.io that is the opposite of tokenmaxxing.

But if you've been feeling that itch — the one where you look at your API bill and think this is not a technology problem, this is a vibes problem — you are not alone.

The revolution will be small. Efficient. Under budget.

Stop building what your customers ask for

Jonathan Murray — Wed, 22 Apr 2026 13:44:39 +0000

I was at a conference this week.

Bunch of stakeholders on stage. Hospital admins, big-name buyers, a couple of policy folks. The message to founders was loud and clear:

"You need to be consulting us. You need to be adapting your products to our suggestions."

And honestly? I hated it.

Not because they were completely wrong. They were half right. They were just shouting the half that was wrong.

Here's the part that's true

Building in a vacuum is how you ship things nobody uses. Founders, especially technical ones, have a real habit of deciding what the world needs from inside a Notion doc.

So yes. Talk to users. Ride along. Watch people struggle with your product. All of that.

The stakeholders aren't crazy for wanting a seat at the table.

Here's the part that breaks things

"Listen to us" slowly turns into "do what we say."

And that's where it gets weird.

Because every dev on earth has learned this lesson already. It's called a bug report.

A user says "the login is slow." You dig in. The login isn't slow. They're on hotel wifi and there's no loading spinner, so it feels frozen. The complaint was real. The proposed fix, "make the login faster," was useless.

Stakeholder feedback works exactly the same way.

The pain is the signal. The proposed fix is a guess. Usually a bad one.

A senior eng who shipped whatever the ticket said would get laughed out of the room. Why do we call a founder who ships whatever the customer asks for "responsive"?

Why the fix is almost always wrong

Three reasons, no mystery to any of this:

1) Stakeholders see their slice. Not the whole system. Of course their fix is local.
2) They imagine solutions inside the workflow they already have. Which is often the exact workflow you're trying to change.
3) The thing that would actually solve the problem doesn't exist in their vocabulary yet. That's kind of your job.

When a cardiologist says "add a button that auto-generates the referral letter," the real signal is referrals are friction. The button might be the worst possible version of the fix. Maybe the letter shouldn't exist. Maybe the referral shouldn't need a letter. That's a conversation. Not a ticket.

The receipt: healthcare AI just ran this experiment for us

For years, stakeholders told the industry they wanted "AI that can pass the medical boards."

The industry listened. Every model got tuned on USMLE-style questions. Board-exam scores became the benchmark everyone pointed at.

This month, JAMA Network Open dropped a study across 21 top LLMs (ChatGPT, Claude, Gemini, DeepSeek, Grok). Final-diagnosis accuracy on complete cases? Over 90%.

Differential diagnosis, the thing an actual doctor does all day? Failed more than 80% of the time.

The stakeholders asked for the wrong benchmark. Founders shipped it. We now have a generation of models that ace trivia and fold on reasoning.

The founders who had pushed back, the ones who said we hear you want trustworthy AI, we're not going to chase board scores to prove it, would look prescient right now. The ones who obeyed built an industry of exam-passers.

How to actually do it

When a stakeholder hands me a feature request, I try to never put it in the backlog as written. Three questions first:

1) What were they trying to do when they felt the pain?
2) What's the actual friction, stripped of their proposed fix?
3) What would "solved" feel like, regardless of how it gets built?

Rule of thumb: if a stakeholder ask fits neatly into a Jira ticket, I haven't translated it yet.

Back to the conference

I get why the stakeholders were on that stage. They've been burned by founders who ignored them. They want to be heard.

But "heard" is not the same as "obeyed." And founders who treat customer feedback as a spec instead of a bug report end up building slightly nicer versions of the thing that already isn't working.

Listen obsessively.
Obey selectively.
And be willing to tell the room that the button they're asking for isn't the thing they actually need.

That's not arrogance. That's the job.

What's a piece of stakeholder feedback you took literally and regretted? Or one you translated into something better and it worked?

🙏🏻🙏🏻🙏🏻🙏🏻💪🏻💪🏻💪🏻💪🏻

Jonathan Murray — Tue, 21 Apr 2026 18:47:22 +0000

Ranjan Dailata

Apr 21

Building Conversational Intelligence with Backboard: Turning Conversations into a Living Intelligence System

#ai #machinelearning #nlp #productivity

Comments

4 min read

Two Days, Two Hacks: The Lovable Disclosure and the Pattern Nobody Wants to Talk About

Jonathan Murray — Mon, 20 Apr 2026 18:14:55 +0000

Yesterday I wrote about the Vercel incident and walked through why you need to rotate your "non-sensitive" environment variables today. I thought that would be the week's security post.

Then I woke up to @weezerOSINT's disclosure about Lovable, and now I am starting to wonder if someone out there is just running an end-to-end test on the mythos of the modern AI-dev stack.

Two days. Two incidents. Totally different root causes. Same uncomfortable conclusion.

What dropped

The short version: security researcher @weezerOSINT made a free Lovable account and was able to read other users' source code, database credentials, AI chat histories, and customer data. Any free account. Every project created before November 2025.

The screenshot making the rounds shows a response from api.lovable.dev/GetProjectMessagesOutputBody.json with another user's prompts, AI reasoning traces, task lists, and project IDs sitting there in plain JSON. The bug is Broken Object Level Authorization on Lovable's own platform API, not the more familiar "the generated app shipped without Supabase RLS" story we got in February.

The part that actually made me set my coffee down: the report was filed through Lovable's bug bounty program 48 days ago, marked as a duplicate of an earlier informative report, and left open. At the time of the disclosure it reportedly still worked.

Forty. Eight. Days.

Why this one hits different

The February Lovable wave was a story about generated apps. The takeaway was "audit the output" — a thing developers already know how to do, at least in principle. You could imagine a fix: better defaults, RLS on by default in the scaffolds, a linter that yells at you when a table is public.

This one is a story about the platform itself. The thing you trusted to hold your code, your keys, your customer data — the control plane, not the output — had a missing auth check on a production API endpoint for at least seven weeks after someone told them about it.

Stack this next to the Vercel situation and a pattern starts to emerge. In the Vercel case, the breach came through a third-party AI tool that had been granted a Workspace OAuth scope that went further than anyone audited. In the Lovable case, it is the platform's own API failing to check "is this caller allowed to see this object." Different failure modes, same underlying theme: the trust boundaries in the AI-assisted-dev stack are drawn with marker, and the marker is washing off in the rain.

The vibe-coding angle

Here is the thing that will keep me up tonight. When you vibe-code an app, you do not type process.env.STRIPE_KEY into a .env file and move on. You paste the key into the chat so the AI can wire it up. You paste the database URL into the chat to fix a schema bug. You paste a sample customer record into the chat to get the types right.

Every one of those messages lives in the project's chat history. The disclosed endpoint returned chat histories. So it is not just "your generated app is exposed" — it is "every secret you ever mentioned in a conversation with Lovable is sitting in a JSON response that any free account could fetch."

If you have built on Lovable, go read your own chat history right now, with the eyes of an attacker. Search for sk-, postgres://, Bearer, anything that looks like a secret. Every match is a key to rotate at the source. Not rename. Rotate. Revoke at the provider and reissue.

What I actually think is going on

I do not think someone is literally targeting the AI-dev ecosystem on a two-day schedule for dramatic effect. What I think is happening is that this category of tools grew very fast, shipped a lot of features, pointed their best engineers at the next feature rather than the last one, and is now discovering that "trust boundaries" is a feature that does not show up in a demo.

The vibe-coding productivity is real. I still use these tools. I will still use them next week. But I am going to stop pretending that a platform saying "secure by default" counts for anything until I see a disclosure track record that backs it up. Forty-eight days on a report with the title "Broken Object Level Authorization on Lovable API leads to unauthorized access to user data and project source code" is, to use a technical term, a lot.

If you are shipping on Lovable right now

Short version, because I already wrote the long version yesterday for Vercel and the shape is the same:

Rotate anything a Lovable project ever touched. Revoke at the upstream provider, not just in the Lovable dashboard. Audit your chat histories for pasted secrets. Turn on RLS on every Supabase table while you are in there. If personal data was exposed, talk to a lawyer today about your disclosure obligations, because "we used an AI app builder" is not going to hold up in front of a regulator.

Two days. Two hacks. Maybe it is the start of a trend, maybe it is the week from hell, maybe someone really is testing the mythos. Either way, rotate your keys and get back to building.

Source for the disclosure: @weezerOSINT on X. If you have audited a Lovable project in the last day and found something worth sharing, the comments are open.

My co-founder is just being honest in this post. ;)

Jonathan Murray — Mon, 20 Apr 2026 18:09:14 +0000

Robert Imbeault

Apr 20

I Think Therefore I Am… A Big Pain in the A$$

#ai #llm #developers #reasoning

Comments 1

3 min read

Anyone working on applications for Recursive Language Models? #discussion #rlm

Jonathan Murray — Mon, 20 Apr 2026 16:34:38 +0000