Forem: Kael Tiwari

GPU Economics: What Inference Actually Costs in 2026

Kael Tiwari — Wed, 25 Feb 2026 03:35:12 +0000

The question every AI team eventually asks: should we rent GPUs and run models ourselves, or just pay per token through an API?

The answer changed a lot in the last six months. GPU rental prices dropped. API prices dropped faster. New GPU generations shipped. And mixture-of-experts models made the whole calculation messier than it used to be.

Here's the actual math, with real numbers from real providers.

GPU rental prices right now

These are on-demand, publicly listed prices as of February 2026. No negotiated enterprise deals, no reserved instances.

GPU	Provider	Config	$/hour	VRAM (GB)
NVIDIA B200	CoreWeave	8x GPU	$68.80	180
NVIDIA GB200 NVL72	CoreWeave	4-GPU slice	$42.00	186
NVIDIA HGX H200	CoreWeave	8x GPU	$50.44	141
NVIDIA HGX H100	CoreWeave	8x GPU	$49.24	80
NVIDIA GH200	CoreWeave	1x GPU	$6.50	96
NVIDIA A100 80GB	CoreWeave	8x GPU	$21.60	80
NVIDIA L40S	CoreWeave	8x GPU	$18.00	48
NVIDIA RTX PRO 6000	CoreWeave	8x GPU	$20.00	96

A few things stand out. The B200 costs 40% more than the H100 per hour, but delivers roughly 2.5x the inference throughput for large models according to NVIDIA's own benchmarks. The H200 is barely more expensive than the H100 despite having 76% more VRAM. And the A100 — which was the default choice 18 months ago — is now less than half the price of current gen.

CoreWeave dominates GPU cloud pricing. They're seeking an $8.5B loan backed by a Meta contract worth up to $14.2B, which tells you the scale of demand here.

What does it cost to serve a model yourself?

Let's do the math on running Llama 3.1 405B — a model big enough to compete with GPT-5 mini on most benchmarks, and the most common choice for self-hosted production deployments.

Hardware requirement: 405B parameters at FP8 precision need roughly 405GB of VRAM. That's a minimum of 6x H100 80GB GPUs, or more practically an 8-GPU H100 node.

Hourly cost on CoreWeave: $49.24/hr for 8x H100.

Throughput: with vLLM and continuous batching, expect roughly 2,000-3,000 output tokens per second on an 8x H100 setup running Llama 405B at FP8. Call it 2,500 tok/s as a conservative estimate based on vLLM benchmarks.

Cost per million output tokens:

2,500 tokens/second = 9,000,000 tokens/hour
$49.24 / 9M tokens = $5.47 per million output tokens

Compare that to API pricing for models in this class:

Model	Provider	Input $/M	Output $/M	Source
Llama 3.1 405B	Together AI	$3.50	$3.50	Serverless
DeepSeek-R1-0528	Together AI	$3.00	$7.00	Serverless
GPT-5 mini	OpenAI	$0.25	$2.00	API
GPT-5.2	OpenAI	$1.75	$14.00	API
Gemini 3 Pro	Google Cloud	$2.00	$12.00	Vertex AI
Gemini 3 Flash	Google Cloud	$0.50	$3.00	Vertex AI
Qwen3.5-397B-A17B	Together AI	$0.60	$3.60	Serverless

Self-hosting Llama 405B at $5.47/M output tokens is more expensive than calling Together AI's API for the same model at $3.50/M. That's the efficiency of shared infrastructure at scale. Together AI batches requests from thousands of customers across the same GPUs. You're paying for idle time; they're not.

When self-hosting wins

The math flips in three scenarios.

First, when you're running at near-100% capacity. If your inference demand is constant and maxes out the hardware — say, a consumer product doing millions of requests per day — your effective per-token cost drops because you're eliminating idle time. At 90%+ load, self-hosted Llama 405B drops to roughly $4.00/M output. Still not cheaper than Together AI's serverless rate, but cheaper than OpenAI's GPT-5.2 at $14.00/M.

Second, data isolation. Some industries (healthcare, defense, finance) can't send prompts to third-party APIs. The premium you pay for self-hosting is really a compliance cost. CoreWeave and Lambda offer single-tenant nodes for this.

Third, smaller models. A 7B or 8B model on a single L40S ($2.25/hr for one GPU) can push 10,000+ tokens/second. That works out to about $0.06/M output tokens — roughly matching the cheapest API options like Llama 3.2 3B at $0.06/M on Together AI. But if you're running a fine-tuned version of that model, the API option doesn't exist.

When APIs win

For most teams, most of the time. Here's why.

Mixture-of-experts models destroyed the self-hosting value proposition. Qwen3.5-397B has 397B total parameters but only activates 17B per token. Together AI charges $0.60/M input and $3.60/M output for it. Running it yourself requires enough VRAM to hold all 397B parameters even though you're only using 17B at inference. You're paying for dead weight.

The same applies to DeepSeek V3.1 ($0.60/$1.70 on Together AI), Llama 4 Maverick ($0.27/$0.85), and most new open models shipping with MoE architectures. API providers handle the memory overhead across a shared fleet. You'd handle it alone.

Batch pricing cuts costs in half. OpenAI's Batch API gives you 50% off both input and output tokens in exchange for 24-hour turnaround. For non-realtime workloads — data processing, content generation, analysis pipelines — that brings GPT-5 mini down to $0.125/$1.00. No GPU rental comes close for a model of that quality.

You don't need to hire anyone. Running inference infrastructure requires MLOps engineers. Kubernetes. Monitoring. Model updates. Quantization debugging. One senior ML infra engineer costs $200K+/year. That's equivalent to roughly 4,000 H100-hours at CoreWeave, or about 36 trillion tokens through GPT-5 mini's API.

The Blackwell generation changes the math (slightly)

NVIDIA's B200 delivers roughly 2.5x the inference throughput of an H100 for FP8 workloads. At $68.80/hr for 8x B200 on CoreWeave versus $49.24 for 8x H100, you're paying 40% more for 2.5x the throughput. Per-token cost drops by about 44%.

That brings self-hosted Llama 405B on B200s down to roughly $3.10/M output tokens — finally competitive with Together AI's API rate. But B200 availability is still constrained. CoreWeave's GB200 NVL72 (the rack-scale option at $42/hr for a 4-GPU slice) adds even more memory bandwidth, but at 186GB VRAM per slice it's sized for models under 200B parameters.

For teams that can get B200 allocations and run at high capacity, self-hosting starts to make financial sense again. For everyone else, the API gap keeps widening.

The real cost nobody talks about

Electricity. A single 8x H100 node draws about 10.2 kW under load. At US commercial electricity rates ($0.12/kWh average from the EIA), that's $1.22/hr just for power — roughly 2.5% of the CoreWeave rental price. Not a big deal for cloud renters.

But if you're Meta building out data centers that consume gigawatts, or CoreWeave financing $8.5B in infrastructure, power becomes the constraint that sets the floor on how cheap inference can get. Big Tech is projected to invest $650B in AI infrastructure in 2026, and a meaningful chunk of that is electricity and cooling.

Bottom line

For teams processing fewer than 10B tokens per month, APIs are cheaper, simpler, and better maintained. GPT-5 mini at $0.25/$2.00 or Qwen3.5-397B at $0.60/$3.60 will outperform anything you self-host at the same cost.

For teams above 10B tokens/month with consistent demand, self-hosting on B200s starts to pencil out — but only if you have the engineering team to run it and can tolerate the 3-6 month wait for hardware allocation.

The interesting middle ground is dedicated endpoints from providers like Together AI and Fireworks, where you rent reserved GPU capacity but the provider handles the stack. You get lower per-token costs than serverless without the ops overhead. That's where most serious production deployments end up.

If you want to see how the API prices compare across all major providers, we maintain an updated table in our LLM pricing comparison. And for context on which models are worth running in the first place, see our open source vs proprietary LLM analysis.

We publish data-driven analysis on AI infrastructure, pricing, and adoption every week. Subscribe to get it in your inbox.

AI coding assistant adoption by company size: who's actually using what

Kael Tiwari — Fri, 20 Feb 2026 05:37:08 +0000

Nearly every developer you know probably uses an AI coding assistant. DX's latest research — 121,000 developers, 450+ companies — puts the monthly usage number at 92.6%. Sounds like a settled question. It isn't. A solo dev auto-completing functions in Cursor and an enterprise pushing Codex through six months of compliance review are living in different worlds. The story worth telling is in that gap.

The numbers everyone quotes (and what they miss)

Three data points, three years:

Survey	Year	Sample	"Using AI tools now"	"Using or plan to"
Stack Overflow Developer Survey	2024	65,000+ devs	62%	76%
Stack Overflow Developer Survey	2023	90,000+ devs	44%	70%
DX / Laura Tacho research	Q4 2025–Q1 2026	121,000 devs	92.6% (monthly)	~97%

44% to 92.6% in under three years. Nobody disputes the trend anymore. But these surveys flatten a variable that matters a lot: company size.

Small teams move fast, big teams move carefully

Under 50 engineers? AI coding tools show up overnight. No procurement. No security review. A founder enables Copilot and the team has it by lunch.

Big companies are different. DX found that even the best-performing large organizations cap out around 60% active usage — weekly, habitual use, not "opened it once in January." That 60% ceiling versus 92.6% monthly tells you everything about the enterprise adoption gap.

Rough pattern:

Company size	Typical adoption rate	Active weekly usage	Primary blocker
1–50 engineers	>90%	~75%	Individual preference
51–500 engineers	~80%	~55%	Security review, budget
500–5,000 engineers	~70%	~45%	Compliance, SSO/audit requirements
5,000+ engineers	~65%	~35%	Procurement, data residency, IP concerns

Synthesized from DX benchmarks (4M+ samples, hundreds of orgs), Stack Overflow 2024, and Pragmatic Summit keynote.

Big companies adopting slower — fine, obvious. The weird part is how many licenses go unused. A 500-person eng org buys Copilot for everyone. 45% open it in a given week. The rest? Expensive shelfware.

The productivity plateau is real — and it hits different by size

Laura Tacho's DX research found that productivity gains from AI coding tools have flatlined at about 10%. Developers save 3.6–4 hours a week — same number as Q2 2025. The needle stopped moving.

Except that's an average. Averages lie.

Small orgs tend to get more out of these tools. Simpler codebases. Faster CI. Developers who do everything. A full-stack dev at a 20-person startup scaffolds an API endpoint with Copilot and saves an hour — visible immediately.

At a 5,000-person company, the same tool collides with slow CI pipelines, three rounds of code review, and legacy code that AI can't parse. Stack Overflow's 2024 survey found 45% of professional developers rate AI tools as "bad or very bad at handling complex tasks." Complex tasks live at big companies.

DX's numbers get wilder. Well-run orgs? 50% fewer customer-facing incidents with AI. Messy orgs? Incidents doubled. Same tools, opposite outcomes. Tacho's take: "AI tends to highlight existing flaws rather than fix them." Messy orgs tend to be bigger ones. Not a rule. But a pattern.

AI-authored code is climbing fast

One metric that cuts through the adoption noise: the share of production code written by AI. DX tracked 4.2 million developers between November 2025 and February 2026:

Metric	Value	Trend
AI-authored code in production	26.9%	Up from 22% previous quarter
AI-authored code (daily users)	~33%	Approaching one-third
Onboarding time (time to 10th PR)	Cut in half	Steady decline since Q1 2024

That onboarding cut matters most for big companies. New hires at large orgs historically take months to ship anything in a sprawling codebase. Cut that ramp in half and the ROI math changes completely. It stops being about writing code faster. It becomes about making people useful sooner.

Which tools win at which scale

Tool choice maps pretty cleanly to org size:

Segment	Dominant tools	Why
Solo / small team	Cursor, Claude Code, Windsurf	Best DX, no procurement needed
Mid-market (50–500)	GitHub Copilot, Cursor Business	Balance of features and admin controls
Enterprise (500+)	GitHub Copilot Enterprise, Codex	SSO, audit logs, IP indemnification

Codex deserves a special mention. The desktop app launched February 2 and hit one million downloads within weeks, growing 60% week-over-week. Inside OpenAI, 95% of developers use it and submit roughly 60% more pull requests per week. Cisco deployed it to 18,000 engineers for migrations and code reviews, cutting review time in half.

Enterprise adoption of Codex is early though. Most big companies haven't finished security vetting. Copilot Enterprise stays the default at scale because GitHub already lives in their stack.

Mid-market is where Cursor and similar AI-native editors are winning. Deep model integration, reasonable admin controls on the business tier, none of the enterprise procurement overhead. Good enough for a 200-person eng org.

The experience gap nobody talks about

Stack Overflow's 2023 data had a pattern that jumped out:

Experience	Using AI tools	Don't plan to
Less than 1 year	55.1%	21.4%
1–5 years	51.3%	24.5%
6–10 years	42.3%	30.2%
11–15 years	39.5%	32.5%
16–20 years	35.9%	36.0%
21+ years	30.2%	42.2%

This connects directly to company size. Older, larger companies employ more senior engineers. A shop where average tenure is 12 years will see lower organic adoption than a startup where the median engineer has three years under their belt. Senior engineers aren't Luddites. They're working on problems where current AI tools genuinely can't help much yet.

Geography makes it messier

Where your developers sit changes things too. From the same Stack Overflow data:

Country	Using or plan to use
🇮🇳 India	83.6%
🇧🇷 Brazil	78.0%
🇺🇸 United States	63.9%
🇩🇪 Germany	63.9%
🇬🇧 United Kingdom	61.3%
🇫🇷 France	61.4%

India and Brazil lead. Younger developer populations, faster-growing tech sectors. GitHub's Octoverse report projects India will have the most developers on GitHub by 2028 — generative AI contributions on the platform surged 59% in 2024.

For multinationals, this means a patchwork. Your Bangalore team is all-in on Copilot. Your Munich office wants to see more evidence first. That's not a tech problem. It's a cultural one.

So what do you actually do

Small team, under 50 engineers. Pick something and commit to it. Cursor or Claude Code for the best solo experience. Copilot if everyone uses VS Code. Don't agonize over the choice — daily habit matters more than which tool.

Mid-market, 50–500. Track active usage, not seat count. DX recommends measuring weekly active users and time saved per developer. Booking.com did this across 3,500 engineers — 16% throughput increase.

Enterprise, 500+. The tool is almost irrelevant. What matters: fast CI, clear docs, well-defined service boundaries. DX identifies these as the real predictors of whether AI tools deliver value. Fix developer experience first. Add AI second. Otherwise you're just automating dysfunction.

The winners aren't the companies that adopted first. They're the ones that measured what happened after and changed course when the data told them to. Laura Tacho's blunt summary: "This is really a management problem."

More from Kael Research: LLM pricing comparison and AI agent market map 2026. Get these posts in your inbox — join the newsletter.

AI agent market map 2026: who's building what

Kael Tiwari — Thu, 19 Feb 2026 13:40:21 +0000

Originally published on Kael Research

The AI agent market split into two camps this year: frameworks racing for developer adoption, and platforms betting on enterprise deployment. After analyzing GitHub stars, HuggingFace downloads, and funding announcements, the winners are becoming clear.

Market size and momentum

The agent space got real money in 2026. CrewAI claims 100,000+ certified developers through their courses at learn.crewai.com. LangChain maintains its position as the default choice but faces performance pressure from newer frameworks. Microsoft's AutoGen shifted focus to their new Agent Framework after announcing maintenance mode for v0.2.

Enterprise adoption accelerated. Accenture now allegedly ties promotions to "regular" AI adoption and tracks individual weekly AI tool logins for senior staff, according to Financial Times reporting. TCS signed OpenAI as their first data center customer with 100MW capacity, starting what could be power-grid scale enterprise AI deployment.

India emerged as a major market force. At the India AI Impact Summit 2026, organizers claimed 300+ exhibitors, 500 sessions, 250K visitors, and billions in investment commitments. Reliance plans up to $110B in AI infrastructure over seven years, while Pine Labs is embedding OpenAI APIs directly into payment infrastructure.

Framework comparison

Name	Type	Pricing	GitHub Stars/Users	Key Feature
LangChain	Framework	Free/LangSmith paid	100K+ stars	Model interoperability
CrewAI	Framework	Free/AMP Suite paid	Not specified	Role-based multi-agent
AutoGen	Framework	Free	30K+ stars	Conversational agents
OpenAI Assistants	API	Per-token	N/A (deprecated Aug 2026)	Native OpenAI integration
Anthropic Tool Use	API	Per-token	N/A	Claude-native tools

OpenAI deprecated their Assistants API in favor of the new Responses API, marking a significant shift toward simpler mental models. The new system replaces assistants with "prompts" that can be versioned in the dashboard and threads with "conversations" that store items beyond just messages.

CrewAI positioned itself as the anti-LangChain this year — completely independent, no dependencies, built from scratch. They claim 5.76x faster execution than LangGraph in certain QA tasks and tout their lean architecture. The framework offers both autonomous "Crews" for flexible decision-making and precise "Flows" for event-driven control.

AutoGen's Microsoft backing kept it relevant despite the maintenance mode announcement. The new Agent Framework promises better layered architecture with Core API for message passing, AgentChat API for rapid prototyping, and Extensions API for third-party capabilities.

Platform comparison

The platform battle intensified around deployment and monitoring:

Platform	Focus	Pricing	Notable Features
LangSmith	Monitoring	Usage-based	LangChain native observability
CrewAI AMP	Enterprise control	Enterprise pricing	Unified control plane, 24/7 support
AutoGen Studio	No-code GUI	Free	Visual multi-agent workflows
OpenClaw	Personal agents	Free tier	Telegram-native, cross-platform

OpenClaw gained traction in messaging-native agent deployment, particularly on Telegram. The platform offers personal AI assistants that integrate across devices and supports features like voice message transcription and real-time collaboration.

Open source model momentum

HuggingFace download numbers revealed shifting preferences:

moonshotai/Kimi-K2.5 hit 955K+ downloads with 2.2K likes — Kimi adoption is accelerating
hexgrad/Kokoro-82M dominated text-to-speech with 8.1M+ downloads — tiny models are winning distribution
MiniMaxAI/MiniMax-M2.5 showed 89.9K downloads — non-US models are gaining serious traction
Video generation crossed from demos to repeated use: Lightricks/LTX-2 reached 2M+ downloads

The pattern is clear: smaller, specialized models are eating market share from larger general-purpose systems. Developers want fast, focused tools over Swiss Army knife solutions.

Recent launches and announcements

February 2026 brought several major developments:

Funding rounds brought major capital influx. Fei-Fei Li's World Labs reportedly raised $1B from A16Z and Nvidia for world models. OpenAI approaches a funding round that could exceed $100B, with valuations potentially hitting $850B according to Bloomberg.

Enterprise deals showed infrastructure scale. TCS and OpenAI's 100MW data center partnership signals AI infrastructure moving to utility scale. Circuit raised $30M for AI manufacturing platforms, showing vertical-specific agent demand.

Technical updates accelerated across providers. Gemini 3.1 Pro went live on Vertex AI. New model releases included significant improvements in reasoning and tool use capabilities across major providers.

Platform consolidation emerged around major approaches — framework-first (LangChain, CrewAI), platform-first (enterprise solutions), and API-first (OpenAI, Anthropic).

What this means for builders

The agent market is maturing fast. Three trends matter most:

Performance beats features every time. CrewAI's speed claims against LangChain reflect broader developer frustration with bloated frameworks. Lean, fast solutions are winning mindshare.

Enterprise deployment patterns are hardening. The TCS-OpenAI deal and Accenture's promotion policies show enterprise AI is moving from experimentation to operational requirement. IT departments want monitoring, control planes, and SLA guarantees.

Messaging-native experiences: Telegram bots, WhatsApp integrations, and SMS-based agents are becoming default UX patterns. The command line lost to the chat interface.

If you're building agents in 2026, focus on deployment simplicity over framework complexity. The market rewarded practical tools that solve real workflow problems, not academic demonstrations of multi-agent collaboration.

The infrastructure layer is consolidating around a few winners, but application opportunities remain wide open. Pick your framework based on deployment target: CrewAI for speed, LangChain for ecosystem, or native APIs for direct model integration.

For more analysis on model pricing trends, read our LLM pricing comparison Feb 2026 and open source vs proprietary LLMs breakdown.

Want updates on agent market developments? Subscribe to our newsletter for weekly analysis of funding, launches, and technical developments.

Open Source vs Proprietary LLMs: The Real Cost Breakdown

Kael Tiwari — Thu, 19 Feb 2026 13:34:17 +0000

Originally published on Kael Research

TL;DR: Below 1B tokens/month, just use APIs. Proprietary or hosted open-source, doesn't matter much. Between 1 and 10B tokens, hosted open-source APIs from Together.ai or Groq are usually cheapest. Above 10B tokens/month, self-hosting can win, but only if you already have an MLOps team. The "open source is free" narrative ignores $300K to $600K/year in engineering overhead.

The pricing table

Prices move fast. Here's where things stand in February 2026. All prices are per 1M tokens (input/output).

Open source models via hosted APIs

Model	Provider	Input	Output	Notes
Llama 4 Maverick	Together.ai	$0.27	$0.85
Llama 4 Maverick	Groq	$0.20	$0.60	562 tok/s
GPT-OSS-120B	Together.ai / Fireworks / Groq	$0.15	$0.60
GPT-OSS-20B	Together.ai	$0.05	$0.20	Bargain tier
DeepSeek V3.1	Together.ai	$0.60	$1.70
Qwen3-235B	Together.ai	$0.20	$0.60
Mistral Small 3	Together.ai	$0.10	$0.30

Proprietary models

Model	Input	Output	Source
GPT-5.2	$1.75	$14.00	OpenAI
GPT-5 mini	$0.25	$2.00	OpenAI
Claude Opus 4.6	$5.00	$25.00	Anthropic
Claude Sonnet 4.6	$3.00	$15.00	Anthropic
Gemini 2.5 Flash	$0.30	$2.50	Google

A few things jump out. GPT-OSS-120B at $0.15 input is wild. That's 11x cheaper than GPT-5.2 on the input side. GPT-5 mini and Gemini 2.5 Flash sit in a middle ground where proprietary pricing gets surprisingly close to open-source hosted rates. For a deeper dive on the month-over-month trends, see our full pricing comparison.

The real comparison: API vs API vs self-hosted

People frame this as "open source vs proprietary." That's wrong. The actual decision has three options:

Proprietary API, where you pay OpenAI, Anthropic, or Google directly
Hosted open-source API, where you pay Together.ai, Groq, or Fireworks to run open models for you
Self-hosted open source, where you rent GPUs and run the models yourself

Option 2 gets overlooked constantly. You get the flexibility of open weights without the operational burden. For most companies, this is the right answer.

Option 3 sounds appealing on paper. In practice, it's a staffing decision disguised as a technology decision.

Breakeven math at different scales

Let's do the math for a representative setup: GPT-OSS-120B via Together.ai ($0.15/$0.60) vs self-hosting on H100s from Lambda Labs at $2.99/hr ($2,183/mo). A single H100 running a 70B model produces roughly 50 tokens/second on average, which works out to about 130M tokens per month.

Scale (tokens/mo)	Together.ai cost	Self-hosted cost	Winner
10M	~$4.50	$2,183 + eng. overhead	API by a mile
100M	~$45	$2,183 + eng. overhead	API
1B	~$450	$2,183 + eng. overhead	Roughly even on compute, but API wins on total cost
10B	~$4,500	~$17K compute (8× H100s) + eng. overhead	Depends on your team

The compute-only crossover hits somewhere around 1 to 2B tokens/month. But compute isn't the whole story.

At AWS rates of ~$3.90/hr per H100, the math shifts even further toward APIs. Reserved instances at $1.85/hr help, but you're committing to a year of capacity. H200s at $6.00/hr and B200s at $9.00/hr from Fireworks give you more throughput per dollar, but the upfront commitment grows too.

The hidden costs of self-hosting

Here's the part that "open source is free" evangelists skip over.

An MLOps team to keep self-hosted models running costs $300K to $600K per year. That's 2 to 4 engineers, and you're competing with every AI company on earth for that talent. Good luck hiring them quickly.

Beyond salaries, you're signing up for monitoring and alerting infrastructure, model version management and rollback procedures, GPU usage tuning (most teams waste 30 to 50% of their compute), security patching and compliance audits, and on-call rotations for when inference goes sideways at 3 AM.

None of this shows up in the $/token calculation. It should.

There's also the upgrade treadmill. A new model drops, your fine-tuned version is two generations behind, and now you need to re-run your evaluation suite, re-tune, and redeploy. With an API provider, you change a model string.

When open source wins

Open source isn't always the cheaper option, but it's sometimes the only option.

Compliance and data sovereignty come first. If you're operating in healthcare or finance with strict data residency requirements, self-hosted open source gives you full control. The data never leaves your infrastructure. No BAA negotiations, no hoping your provider's compliance team got it right. HIPAA and GDPR compliance by design, not by contract.

Air-gapped environments are the extreme version of this. Defense, certain government agencies, some financial institutions: they can't send data to external APIs at all. Open source is the only game in town.

Fine-tuning is where open source pulls ahead on cost dramatically. Training on OpenAI's GPT-4.1 costs $25 per million tokens. The same job on open-source models through Fireworks runs $0.50 per million tokens for models up to 16B parameters. Self-hosted, you pay only for compute. That's a 50x cost difference at the API level. If you need customized models, and many agent-based architectures do, open source is hard to beat.

High volume is the last piece. Past 10B tokens per month, the economics of self-hosting start making sense, assuming you've already got the infrastructure team. The key word is "already." Building that team from scratch to save on inference costs rarely pencils out.

When proprietary wins

Speed to market is the obvious one. You can go from zero to production with GPT-5.2 or Claude Sonnet 4.6 in a weekend. No infrastructure provisioning, no model tuning, no serving framework selection. Just an API key and a credit card.

Quality ceiling matters too. As of February 2026, Claude Opus 4.6 and GPT-5.2 still outperform open-source alternatives on complex reasoning tasks. The gap has narrowed (Llama 4 Maverick and Qwen3-235B are genuinely impressive) but for the hardest problems, proprietary models hold an edge. That edge costs 10 to 20x more per token, so the question is whether your use case actually needs it.

No infra team is the underrated advantage. A startup with 5 engineers shouldn't be allocating 2 of them to GPU management. Use that headcount to build product instead. The API cost premium is cheaper than the hiring cost.

Proprietary providers also handle the compliance paperwork for you. OpenAI has a BAA for HIPAA. Anthropic is HIPAA-ready. Azure OpenAI gives you EU data residency. These aren't free (enterprise plans cost more) but the operational simplicity has real value.

Decision framework

Forget the vibes. Use this matrix.

Factor	Use proprietary API	Use hosted open-source API	Self-host
Volume	< 1B tok/mo	1 to 10B tok/mo	> 10B tok/mo
Team size	No MLOps engineers	No MLOps engineers	2+ MLOps engineers already on staff
Data sensitivity	Standard (with BAA if needed)	Standard	Air-gapped or strict residency
Fine-tuning needed	Light (prompt engineering suffices)	Moderate	Heavy or continuous
Time to production	Days	Days	Weeks to months
Quality requirements	Highest available	Good enough	Good enough + customized

My honest take: most companies should start with proprietary APIs, move to hosted open-source APIs as volume grows, and only self-host when they're processing billions of tokens and already have the team. The middle option, hosted open source, is the most underused path. And it's often the best one.

The market is moving fast. Prices on this page will be outdated within weeks. We track changes monthly in our LLM pricing comparison, and if you want updates when the math shifts, join the newsletter.

LLM Pricing in February 2026: What Every Model Actually Costs

Kael Tiwari — Thu, 19 Feb 2026 13:34:15 +0000

Originally published on Kael Research

TL;DR: Cheapest option is OpenAI's open-source GPT-OSS-20B at $0.05/M input. Best value is GPT-5 mini at $0.25/M. Most expensive is Grok-4 at $30/M — 600x more than GPT-OSS-20B. Claude Opus 4.6 dropped to $5/$25 (down from $15/$75 on Opus 4). Full table with 18 models below.

If you're building on top of LLMs right now, you're probably spending more than you need to. Pricing has changed so fast over the past year that most teams are running on outdated assumptions.

Here's what every major model actually costs as of February 2026, with the context that matters for choosing between them.

The full pricing table

All prices are per million tokens.

Model	Provider	Input	Output	Notes
GPT-5.2	OpenAI	$1.75	$14.00	Flagship, best overall quality
GPT-5 mini	OpenAI	$0.25	$2.00	Best price/performance ratio
GPT-4.1	OpenAI	$2.00	$8.00	Still widely deployed
GPT-4.1 nano	OpenAI	$0.10	$0.40	Cheapest OpenAI option
o4-mini	OpenAI	$1.10	$4.40	Reasoning model
Claude Opus 4.6	Anthropic	$5.00	$25.00	Top-tier reasoning + coding
Claude Sonnet 4.6	Anthropic	$3.00	$15.00	Workhorse model
Claude Haiku 4.5	Anthropic	$1.00	$5.00	Fast + cheap
GPT-OSS-120B	OpenAI (open-source)	$0.15	$0.60	Open-weight, via hosted APIs
GPT-OSS-20B	OpenAI (open-source)	$0.05	$0.20	Smallest open-weight option
Gemini 2.5 Flash	Google	$0.30	$2.50	Strong on long context
Gemini 2.0 Flash	Google	$0.10	$0.40	Budget tier
Llama 4 Maverick	Meta (via API)	$0.27	$0.85	Open-weight, self-hostable
DeepSeek V3.1	DeepSeek	$0.60	$1.70	Chinese lab, surprisingly strong
Grok-4	xAI	$30.00	$150.00	Most expensive model on market
Grok-4-fast	xAI	$2.00	$5.00	xAI's mid-tier
Grok-3	xAI	$30.00	$150.00	Previous gen, same price as Grok-4
Grok-3-mini	xAI	$3.00	$5.00	Budget reasoning

Sources: OpenAI pricing, Anthropic models, Google AI pricing, xAI pricing, DeepSeek pricing, Together.ai, Groq for open-source model hosting. All checked February 19, 2026.

What stands out

The gap between cheapest and most expensive is staggering. GPT-OSS-20B at $0.05/M input vs Grok-4 at $30/M input. That's 600x. Even comparing production-grade models, GPT-5 mini at $0.25/M vs Claude Opus 4.6 at $5/M is a 20x spread. For most workloads, the cheaper models handle 80%+ of tasks just fine.

xAI is pricing itself out. Grok-4 at $30/$150 per million tokens is the most expensive API on the market. That's 6x Claude Opus 4.6 and 17x GPT-5.2 on input. Unless you need something Grok does better (hard to name what that is), the pricing makes no sense for production use.

Google is quietly the cheapest. Gemini 2.0 Flash at $0.10/$0.40 matches GPT-4.1 nano and undercuts almost everything else. If your use case tolerates the quality tradeoff, it's the best deal available.

Open-weight models changed the math. Llama 4 Maverick at $0.27/$0.85 through hosted APIs is cheap, but the real story is self-hosting. Running Llama on your own GPUs drops the effective cost below $0.10/M tokens for input. The breakeven vs API depends on volume, but for companies doing 10B+ tokens/month, self-hosting wins.

Beyond the price tag

The table is just the start. What actually matters:

Output tokens cost 3-8x more than input. This is consistent across every provider. If your app generates long responses (code, reports, content), output cost dominates your bill. Trim your outputs.

Caching changes everything. OpenAI and Anthropic both offer prompt caching that cuts repeat-context costs by 50-90%. If you're sending the same system prompt or few-shot examples on every call, caching alone might cut your bill in half.

Quality gaps are shrinking. A year ago, there was a clear hierarchy: GPT-4 > Claude 3 > everything else. Now GPT-5 mini, Claude Sonnet 4, and Gemini 2.5 Flash are all competitive for most tasks. The premium models (GPT-5.2, Opus 4) still win on complex reasoning and long-form analysis, but the gap keeps closing.

Latency matters more than price. The cheapest model that takes 8 seconds to respond might cost you more in user drop-off than a 2x pricier model that responds in 1.5 seconds. Benchmark latency alongside cost.

Who should use what

High-volume production (chatbots, classification, extraction): GPT-5 mini or Gemini 2.0 Flash. Both under $0.50/M input with solid quality.

Code generation: Claude Sonnet 4 or GPT-5.2. Sonnet is generally better at following complex coding instructions. GPT-5.2 has an edge on multi-file refactoring.

Research and analysis: Claude Opus 4.6 if budget allows ($5/$25 is much more reasonable than the old Opus 4 pricing). GPT-5.2 if not.

Cost-sensitive startups: Llama 4 Maverick self-hosted, or GPT-4.1 nano for API. Get to market first, pick the right model later.

What's next

Pricing has dropped roughly 10x per year for equivalent quality over the past three years. There's no reason to think that stops. By Q4 2026, expect GPT-5 mini-equivalent quality at $0.05/M input or less.

The real shift is happening at the infrastructure layer. Custom silicon (Google TPUs, Amazon Trainium, Microsoft Maia) is starting to undercut Nvidia GPU economics. As that scales, hosted API pricing will drop faster than self-hosting costs — potentially flipping the build-vs-buy calculation for mid-size companies.

We'll update this comparison monthly. Subscribe to get updates when pricing changes.

This analysis is part of Kael Research's ongoing coverage of AI market economics. We track pricing, adoption, and competition across the AI industry. See our full research briefs for deeper analysis on specific markets.