Forem: Ravi Patel

The Merging Take Is Too Early

Ravi Patel — Fri, 17 Apr 2026 08:24:25 +0000

Everyone is calling for AI coding tools to consolidate. We are not in the merging phase — we are in the explosion phase. Calling for consolidation right now is reading the cycle wrong.

Everyone is calling for AI coding tools to merge. This week alone I've seen three takes saying the space is consolidating, the winners are picking themselves, time to pick a side.

I think this is reading the cycle wrong.

We are not in the merging phase. We are in the explosion phase. Hundreds of tools are going to emerge and die before any real merging happens. Calling for consolidation right now is like standing in the middle of the Cambrian explosion telling everyone to pick a vertebrate.

The actual thing happening right now — if you watch closely — is not merging. It's copying. Every tool is trying to do everything. The IDE plugin is adding agents. The agent is adding an IDE. The chat tool is adding a terminal. The terminal tool is adding a chat. Nobody is merging with anyone. They're all sprinting toward the same imaginary all-in-one product because that's what the funding decks told them to build.

This is what the noise looks like before a shakeout. Not before a consolidation.

What Merging Actually Requires

Merging requires a few things to be true that are not yet true.

The surface area of the problem needs to be stable. Right now it's changing every quarter. Last year, coding tools were autocomplete plus chat. This year they're autonomous agents that file PRs. Next year it will be something else. You cannot consolidate around a moving target.

The winners need to be clear enough that buying is cheaper than building. Nobody is in that position. Cursor is not buying Continue. Claude Code is not buying Aider. Windsurf is not buying anyone. They're all still in the land grab.

Distribution needs to matter more than capability. We're still in the phase where capability matters more, because the tools are genuinely different from each other in what they can actually do. Once they all converge on the same capability ceiling, distribution starts to matter — and then you see acquisitions. We are not there.

What Is Actually Happening

What's actually happening is the part of the cycle nobody likes to write about, because it's messy and there's no clean narrative.

Tools are copying each other. Features are getting cloned within weeks. Differentiation is collapsing on the obvious axes — autocomplete quality, chat quality, agent capability — and shifting to the non-obvious ones: latency, cost per task, long-context handling, integration with what you already use.

Most of these tools are going to die. Not because they're bad, but because they were built for a window that closed. They raised on the assumption that "AI coding" was the category and they just had to claim a slice of it. They didn't notice that the category is fragmenting faster than they can ship.

The survivors will be the ones that pick a real wedge and go deep instead of trying to be a platform. The wedges that win are the ones an all-in-one tool won't bother with, but that you can't work without once you've used them.

Why This Matters If You're Building or Buying

If you're building: Don't try to be a platform yet. The platforms haven't been built. The picks-and-shovels layer hasn't been built. There is far more room in doing one specific job better than anything else than in being the next Cursor.

If you're buying: Don't commit to a stack expecting it to be your stack in 18 months. It won't be. The tool you're using today will probably be acquired, killed, or pivoted before you finish onboarding your team. Build your workflow around the assumption that the tools will change and the API contracts will not.

The Merging Comes After the Shakeout

The consolidation will happen. But it happens after the shakeout, not before. When most of these tools are dead and the surviving three or four have stopped sprinting toward feature parity and started looking for ways to grow without spending more on capability they can't differentiate — that's when you'll see acquisitions.

Calling for merging right now is just rebranding the noise. The noise is the explosion. The merging comes after the shakeout, and we haven't had the shakeout yet.

I'm building one of these tools. Prism is an API proxy that does routing and session memory, nothing else. That's the bet — that the wedge is more useful than the platform, at least for now. Get an API key or read the docs.

There Is No Best AI Model in 2026 — And That's Actually Good News

Ravi Patel — Thu, 09 Apr 2026 03:36:45 +0000

GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro all dropped within weeks. Each is best at something different. Here's why that changes how you should build with AI.

The last six weeks produced one of the densest model release windows in AI history. OpenAI shipped GPT-5.4 with native computer use and a 1M context window. Anthropic shipped Claude Opus 4.6 with the strongest expert task performance scores anyone has measured. Google shipped Gemini 3.1 Pro at $2 per million input tokens, undercutting both. DeepSeek dropped V4 with 1 trillion parameters at less than a tenth the price of frontier models. Mistral, MiniMax, and Alibaba all released models that beat last year's flagships.

If you're a developer trying to pick "the best model" right now, you've probably noticed something strange. Every comparison article picks a different winner. Every benchmark tells a different story. Every Twitter thread argues for a different model.

That's because there is no best model. And after building an AI proxy that routes across all three major providers, I've come to think that's actually the better outcome.

The current landscape — who wins what

Let me walk through the actual numbers, because the marketing pages bury them.

GPT-5.4 leads on knowledge work and computer use. Its GDPval score of 83% matches industry professionals across 44 different occupations. It hit 75% on OSWorld, which is the first AI model to surpass human performance on desktop task benchmarks. If you're building agents that need to navigate operating systems, browsers, and terminal interfaces, GPT-5.4 is the one. Pricing: $2.50 per million input, $20 per million output.

Claude Opus 4.6 leads on coding and expert-level reasoning. It scores 80.8% on SWE-bench Verified and 81.4% with prompt modification. Its GDPval-AA Elo benchmark score of 1,633 points is 316 points ahead of Gemini 3.1 Pro, indicating that human evaluators consistently prefer Claude's outputs for expert tasks. It also has 128K max output, which means it can generate entire multi-file patches without truncation. Pricing: $5 per million input, $25 per million output. Above 200K context, the price doubles.

Gemini 3.1 Pro is the price-performance king. It scores 80.6% on SWE-bench (within 0.2% of Opus), 94.3% on GPQA Diamond (the highest of any frontier model), and 77.1% on ARC-AGI-2. Context window is 1M tokens standard, 2M in some configurations. Pricing: $2 per million input, $12 per million output. That's 2.5x cheaper than Opus on input and roughly half the price on output.

Claude Sonnet 4.6 is the quiet workhorse. 79.6% SWE-bench, $3 input, $15 output. Within 1 point of Opus on most coding tasks at 60% of the price. Most production apps probably should be using Sonnet by default, not Opus.

Claude Haiku 4.5 at $1 input, $5 output. Half the price of Sonnet. Handles classification, extraction, summarization, and routine generation tasks at quality that would have been considered frontier 18 months ago.

Gemini Flash at $0.50 input, $3 output. Cheap enough that you can run high-volume workloads almost for free.

DeepSeek V4 at $0.28 input, $1.10 output. Open-weight, frontier-class performance on many benchmarks, roughly 27x cheaper than the closed flagships.

The pattern that matters

Notice something? Six different models, each best at something different. None of them is best at everything. The price spread between them is more than 90x for similar quality on appropriate tasks.

Five years ago there was one model that mattered for production work. Three years ago there were maybe three. Today there are easily ten frontier-class models, each with distinct strengths.

The decision isn't "which model do I pick" anymore. It's "how do I match each task to the right model."

The two ways developers respond to this

Option 1: Pick one and call it done. Most developers do this. They sign up for OpenAI, integrate GPT-4o or GPT-5.4, and never look back. It's simpler. There's only one billing dashboard, one SDK, one set of failure modes. The cost is significant overpayment on simple tasks and underpayment on complex ones (where a stronger model would have given a better result).

Option 2: Multi-model routing. Use the right model for each job. Simple classifications go to Haiku or Flash. Coding tasks go to Sonnet or Opus. Reasoning-heavy work goes to Opus or Gemini Pro. Computer-use agents go to GPT-5.4. The cost savings are 30-70% on most workloads. The quality on hard tasks goes up because you're using the right tool. But the engineering overhead is significant — three API keys, three SDKs, three sets of error handling, three billing dashboards.

This is a real tradeoff. Most teams pick Option 1 because Option 2 is too much work for too little immediate payoff. You can save 40% on your AI bill, but you spend two weeks building the infrastructure to do it.

Why proxies exist

This is exactly the problem proxies solve. A proxy sits between your application and the providers. You make one type of request to one endpoint with one API key. The proxy handles the routing, the multiple SDKs, the failover, the cost tracking. Your code stays simple. You get the multi-model benefits without the multi-model overhead.

The proxies that exist today fall into two camps:

Pass-through routers like OpenRouter let you specify a model name in your request and they forward it to the right provider. Useful for accessing many models through one billing relationship, but you still have to pick the model yourself. The intelligence is on you.

Intelligent routers classify your query and pick the model for you. This is what I built with Prism. You pick a mode (eco, balanced, or sport) and Prism's classifier decides which model handles each query. Simple tasks go cheap. Complex tasks go capable. Quality floor enforced — eco mode never sends complex reasoning to Flash.

Both approaches are valid. The pass-through routers are great if you already know exactly which model you want for which task and you just want unified billing. The intelligent routers are better if you want the routing decisions made for you.

What the model proliferation actually means

The model release pace has compressed from quarterly to monthly. OpenAI confirmed monthly GPT-5 series releases. Anthropic, Google, and the open-source labs are matching that cadence. By the end of 2026 we'll likely have 15-20 frontier-class models, each with distinct strengths.

This means three things for developers:

1. Vendor lock-in is increasingly expensive. If you hardcoded GPT-4o into your app two months ago, you're already on a deprecated model. The next version is better and cheaper, but switching means code changes, prompt rewrites, and regression testing. Building against an abstraction layer (OpenAI-compatible API or a proxy) means swapping models becomes a config change instead of a migration project.

2. Continuous evaluation matters more than picking right once. No matter which model you choose today, a better one will exist in 6 weeks. The right strategy is to build the ability to swap and re-evaluate easily, not to pick the perfect model upfront.

3. Routing infrastructure is now table stakes. What used to be a "nice to have" optimization is becoming standard practice. The teams winning on AI economics are the ones who've automated model selection. The teams losing are the ones still hardcoding flagship models for every request.

The simple version

If you remember nothing else from this post, remember this: the AI model landscape in 2026 is no longer a "pick the best one" problem. It's a "match each task to the right model" problem.

The savings from doing this correctly are 30-70%. The quality improvements on hard tasks are also significant. The engineering cost is the only thing standing in the way, and that's exactly what proxies and routers solve.

Stop picking. Start routing.

I built Prism because I wanted intelligent routing without building it myself. It's an OpenAI-compatible proxy that classifies your queries and routes them to the optimal model across Anthropic, OpenAI, and Google. Free tier available. Get an API key or read the docs.

How I Cut My AI API Costs by 40% Without Changing a Single Prompt

Ravi Patel — Tue, 07 Apr 2026 09:50:20 +0000

Most developers overpay for AI by sending every query to the same model. Here's how intelligent routing across Anthropic, OpenAI, and Google saved me 40% on API costs.
I was spending about $200 a month on Anthropic API calls. Building a product that makes a lot of AI requests — some complex, some dead simple. Every single call went to Claude Sonnet because it was "good enough" and I didn't want to deal with multiple providers.

Then I sat down and actually looked at what those calls were doing.

About 60-70% of my API calls were simple tasks. Summarise this paragraph. Extract the name from this email. Classify this support ticket. Translate this sentence. These don't need Sonnet. A model like Gemini Flash or Claude Haiku handles them perfectly at a fraction of the cost.

The other 30% were genuinely complex — analysing financial data, generating reports, multi-step reasoning. Those needed a capable model. But I was paying Sonnet prices for the simple stuff just because I couldn't be bothered to manage multiple providers and figure out which model to use for each call.

So I fixed it. And the fix turned into a product.

The actual cost difference between models
Here's what caught my attention. These are real per-million-token costs as of early 2026:

Gemini 2.5 Flash costs about $0.15 per million input tokens and $0.60 per million output. Claude Haiku is $0.80 input and $4.00 output. GPT-4o-mini is $0.15 input and $0.60 output.

Compare that to the "default" models most developers use: Claude Sonnet at $3.00 input and $15.00 output. GPT-4o at $2.50 input and $10.00 output.

For a simple summarisation task, you're paying 10-20x more than you need to by using Sonnet instead of Flash. And the output quality on straightforward tasks is nearly identical.

The maths is simple. If 65% of your calls can go to cheap models and 35% need the mid-tier ones, your blended cost drops from roughly $8 per million tokens to about $3-4 per million. That's a 50% reduction before you've changed a single prompt.

Why developers don't do this already
Because it's genuinely painful to set up. You need to sign up for Anthropic, OpenAI, and Google. Manage three API keys. Learn three different request formats (they're similar but not identical). Build routing logic. Handle failures when one provider goes down. Track costs across three billing dashboards.

Nobody does this for a side project. Most companies don't do it either — they pick one provider and accept the overspend because the engineering cost of multi-provider routing isn't worth it.

What intelligent routing actually looks like
The approach I built classifies each incoming query before routing it. The classifier looks at signals in the prompt: how long is it, does it contain code, does it ask for analysis or reasoning, is there a system prompt, what's the conversation depth.

Based on those signals, each query gets tagged as one of four types: simple, code, reasoning, or complex. Then the routing table maps that type to a model based on your cost-quality preference.

If you want aggressive cost savings, simple tasks go to Gemini Flash, code goes to Haiku, reasoning goes to Haiku, and only truly complex multi-step queries go to Sonnet. That's the cheapest option that still maintains a quality floor.

If you want the best answer regardless of cost, everything goes to the most capable model available. Simple or not.

The interesting case is the middle ground — you want good answers at a reasonable price. Simple tasks go cheap, but anything with substance goes to a capable model. This is where most production apps should operate.

The quality floor matters more than the routing
Here's the thing I learned building this: the routing algorithm is less important than the quality floor. If your cost optimisation ever sends a complex reasoning task to a model that can't handle it, the developer loses trust immediately. One bad answer and they switch back to hardcoding Sonnet.

So the classifier has to be conservative. When in doubt, route to a more capable model. It's better to overpay slightly on an edge case than to return a garbage response. The savings come from volume — getting the easy 60-70% of calls right, not from being aggressive on the hard 30%.

Session memory: the other cost nobody talks about
While building the router, I noticed another source of waste. Every AI API is stateless. If you're building a chatbot or any multi-turn interaction, you resend the entire conversation history on every single call.

Message 1: you send 1 message. Message 5: you send all 5 messages. Message 20: you send all 20. You're paying input token costs on the same messages over and over.

The fix was adding a memory layer at the proxy level. The developer sends a session identifier with their request. The proxy stores the conversation history and assembles it before forwarding to the provider. The developer sends one new message each time. The proxy handles the rest.

This doesn't reduce the provider cost — the full history still gets sent. But it eliminates the need for the developer to build conversation management infrastructure. No Redis store, no message assembly logic, no context window overflow handling. One header and it works.

The more interesting cost benefit comes later with compression. When conversations get long, the proxy can summarise older messages using a cheap model before forwarding. The developer doesn't manage this. The summary is transparent. And the input token count on long conversations drops significantly.

Real numbers from production
I ran a test with the same prompt across different routing strategies.

A reasonably complex prompt about Indian tax law sent directly to Claude Sonnet: 955 input tokens, 1115 output tokens, cost roughly $0.018.

The same prompt through intelligent routing in balanced mode: same quality response, same token counts, but routed to the optimal model. Cost was comparable in this case because the classifier correctly identified it as a reasoning task.

Where the savings show up is in aggregate. Across a mix of simple and complex calls — the kind any production app generates — balanced mode saves 30-50% compared to hardcoding a single mid-tier model. Aggressive cost mode saves 50-70% on workloads that are mostly simple tasks.

The five-minute integration
The whole point is that none of this should require work from the developer. The proxy accepts OpenAI-compatible requests. Switching means changing one URL:

base_url = "https://api.openai.com/v1"
base_url = "https://api.prism.ssimplifi.com/v1" Existing prompts work unchanged. Existing response parsing works unchanged. The developer adds one header to choose their cost-quality preference, and optionally a session header for memory.

The proxy handles model selection, provider failover, session memory, cost tracking, and response normalisation. The developer's code stays exactly as it was.

What I'd tell myself six months ago
Stop overpaying for simple tasks. The model landscape has cheap, capable options for straightforward work. Reserve the expensive models for queries that actually need them. And if the engineering overhead of managing multiple providers is what's stopping you, use a proxy that handles it.

The AI API cost problem isn't about any single model being too expensive. It's about using the same model for everything when the tasks have wildly different complexity levels. Fix the routing and the savings follow.

I built Prism to solve this for myself and now it's available for any developer. Free tier available, no credit card required. Read the docs or get an API key.