Forem: Nate Voss

Model Routing: 3 Things I Learned Sending Tasks to the Cheapest Model That Actually Works

Nate Voss — Mon, 04 May 2026 06:54:30 +0000

Everyone benchmarks models. Sonnet beats Haiku on reasoning. Opus beats Sonnet. Haiku is fastest. These things are all true.

But benchmarking and deploying are different games. At scale, the difference between Haiku at $0.80/million tokens and Sonnet at $3/million tokens isn't academic. It's $400+ monthly on a mid-size application. The trap is paying for capability you don't actually need because you never measured what you do need.

I built a router to answer one question: which tasks in my actual workflow could run on the cheapest model without failing? The answer surprised me. And I learned that the real value isn't the savings. It's the forcing function. You can't implement routing without auditing exactly where your complexity lives.

3 Things I Learned

1. Your Intuition About Task Complexity Is Backwards

You think something needs Sonnet. Your gut says: "this requires reasoning, obviously expensive model."

So I measured. Content classification? Haiku handles 95% of real requests. Writing summaries? 88%. Extracting structured data? 92%. The edge cases that needed Sonnet were smaller than I'd guessed. And they were always the same types of edge cases.

Here's the pattern I found: obvious cases are really obvious to Haiku. Spam detection, data validation, simple extractions. Haiku nails these. The failures cluster in a small, identifiable category: ambiguous cases where the human answer is ambiguous. That's when you need Sonnet's nuance.

But you don't know your edge case percentage until you try. Guessing leaves money on the table.

2. You Need Observability Before Routing Saves Anything

The instinct is to build the router first. "Let's write logic that detects complex requests and routes to Sonnet."

This is backward. You need to measure first. Log every task with both Haiku and Sonnet responses side-by-side. Compare them. Find the patterns.

Real questions to answer:

When did Haiku refuse a task that Sonnet handled?
How often do their answers differ, and which one was right?
Was Haiku just uncertain, or actually wrong?

This requires instrumenting your inference layer. It takes a week. But you can't optimize what you can't see. Most teams skip this and build routers on intuition, which is why their routers are fragile.

3. Routing Rules Should Be Dumb, Not Smart

The temptation: build a classifier that predicts task complexity. Input length heuristics, keyword matching, embedding similarity. Something sophisticated.

Don't. Use a simple rule: "If the model reports low confidence, escalate to Sonnet."

This separates the decision from the task. Haiku tells you when it's uncertain. That's a signal you can act on immediately, without needing to predict the future.

The dumb rule wins because:

It adapts as your tasks change (no retraining)
It's testable (you can verify the confidence threshold)
It fails safely (escalation costs more but works)

The smart rule loses because routing logic becomes load-bearing infrastructure. Requires constant tuning. Breaks when your data distribution shifts.

How It Works

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function classifyWithFallback(text, confidenceThreshold = 0.7) {
 // First pass: try Haiku (cheap, fast)
 const haikuResponse = await client.messages.create({
 model: "claude-3-5-haiku-20241022",
 max_tokens: 100,
 messages: [
 {
 role: "user",
 content: `Classify this text as: safe, unsafe, or review-needed. Return JSON with {classification, confidence}.

Text: "${text}"`
 }
 ]
 });

 const haikuResult = JSON.parse(haikuResponse.content[0].text);

 // Log all Haiku decisions (even successes)
 // You're building a dataset of "when does Haiku work?"
 console.log({
 text: text.slice(0, 50),
 model: "haiku",
 classification: haikuResult.classification,
 confidence: haikuResult.confidence,
 tokensUsed:
 haikuResponse.usage.input_tokens +
 haikuResponse.usage.output_tokens
 });

 // If Haiku is unsure, escalate to Sonnet
 if (haikuResult.confidence < confidenceThreshold) {
 const sonnetResponse = await client.messages.create({
 model: "claude-3-5-sonnet-20241022",
 max_tokens: 100,
 messages: [
 {
 role: "user",
 content: `Classify this text as: safe, unsafe, or review-needed. Return JSON with {classification, confidence}.

Text: "${text}"`
 }
 ]
 });

 const sonnetResult = JSON.parse(sonnetResponse.content[0].text);
 console.log({
 text: text.slice(0, 50),
 model: "sonnet",
 escalatedFrom: "haiku",
 classification: sonnetResult.classification,
 tokensUsed:
 sonnetResponse.usage.input_tokens +
 sonnetResponse.usage.output_tokens
 });

 return sonnetResult;
 }

 return haikuResult;
}

That's it. Run both models in parallel during development and log the results. In production, start with Haiku, escalate on low confidence. As your logs accumulate, you'll see exactly which tasks need expensive models and which don't.

The Math

Haiku: $0.80 per 1M input tokens
Sonnet: $3 per 1M input tokens

Scenario: 1M requests/month, 200 tokens average
- All Sonnet: 1M × 200 tokens = $600
- 95% Haiku: (950k × 200) Haiku + (50k × 200) Sonnet = $152 + $30 = $182
- Savings: $418/month

At enterprise scale (100M requests/month): $41,800/month saved by routing to the cheapest viable model.

The cost difference compounds. Small routing decisions get multiplied across thousands of requests.

One Common Pitfall

You'll build a sophisticated router and wonder why it doesn't move the needle. Usually because:

You spent three months on routing logic, but you spend one week validating it
The escalation threshold is too aggressive ("if anything looks hard, use Sonnet")
You're routing on heuristics, not observed behavior

The fix: measure first, always. Log both models' responses in parallel before committing to either one. You'll find that the obvious cases are really obvious, and the edge cases are smaller than you think.

When Routing Actually Works

Build it if:

You have >100k requests/month (smaller volume doesn't justify overhead)
Your requests fall into clusters (some are cheap tasks, some are hard)
You can measure ground truth (compare Haiku vs Sonnet, track which was right)

Don't build it if:

<10k requests/month (infrastructure overhead isn't worth it)
Every request is unique and complex (no pattern to exploit)
You need 99.9% accuracy (can't tolerate Haiku failures)

The Real Win

The cost savings are real. But the bigger win is the audit itself. Building a router forces you to measure exactly where your complexity actually lives. Most teams overthink what they need because they never measure. The router is just the excuse to finally look.

3 Things I Learned Auditing My LLM App's Token Spend (And Why Your Benchmarks Are Lying)

Nate Voss — Mon, 27 Apr 2026 08:04:50 +0000

You know that feeling when you ship an AI feature and realize your token bill is 3x what you estimated? Yeah, that was me last week.

I have this thing called Agent-Max — it's a multi-platform growth agent that runs autonomous workflows: generating content, publishing to Bluesky, Medium, Twitter, Reddit. Sounds heavy, right? Every Monday it synthesizes a week of reading, scrapes engagement metrics, decides what to post and where. Seven platforms. Infinite LLM calls if you're not paying attention.

Last Sunday I realized I had no idea what I was actually spending. I knew roughly — "somewhere between $5-20/week" — but roughly is how you end up with bill shock. So I built PromptFuel to solve the actual problem: measure what your app is doing, not what the docs say it should do.

Here's what three days of auditing my own code taught me.

1. Your bottleneck isn't the model you picked, it's the prompt you didn't trim

I assumed my biggest cost sink was the weekly reflection. Claude reads 7 days of snapshots, engagement data, content history, trend analysis, then reasons about next week's strategy. Heavy prompt, right?

Nope.

Running pf optimize on the actual prompts showed the reflection was 2,847 tokens. Not small, but fine. The real killer: the daily content pregeneration loop was calling Claude 5 times per platform, and each call had:

Entire engagement history (redundant. I'm fetching fresh data every run)
Every. Single. Previous. Post. (all 120 of them, in the context)
Current date, weather, trending topics (reloaded every call)

Cutting history to "last 10 posts, last 3 days of engagement" knocked 40% off. Not because I switched models. Because I stopped hallucinating I needed context I wasn't even reading.

2. Your audit will surface the dumb stuff, not the obvious stuff

Benchmarks tell you Claude costs 3¢ per 1M input tokens. Haiku costs 0.8¢. Pick the right model, do the math, move on.

Except I was calling Claude Sonnet 7 times/week on background analytics where Haiku was plenty. Not intentional. I'd copied the model from an earlier prompt and never thought about it again. One-line change, zero quality loss, $2 saved per month.

That math never shows up in a benchmark. It shows up in your actual codebase, on your actual data, running your actual job. PromptFuel's advantage isn't telling you models are expensive. It's finding the calls you forgot about and showing you the before/after side-by-side.

3. Once you see the numbers, the optimization loop becomes obvious

The first time I ran the dashboard, I thought I was done. Then Monday's weekly job ran and I watched 47 new prompts execute. Dashboard updated in real time. I saw the pattern. There's another cut.

Auditing once is useful. Auditing every week is how you stop bleeding money.

Let's walk through it

Install:

npm install -g promptfuel

Run pf optimize on a real prompt:

pf optimize ./src/prompts/reflect.md --model claude-3-5-sonnet

You'll see token count, cost per call, and a readability score. More importantly, you'll see where the redundancy is hiding.

Open the dashboard to watch prompts in real time:

pf dashboard --watch ./src/

Port 3000 opens. Every time you call an LLM, you see it log: model, input tokens, output tokens, cost, latency. No guessing.

For production, wire up the SDK:

import { PromptFuel } from 'promptfuel/sdk';
import Anthropic from '@anthropic-ai/sdk';

const pf = new PromptFuel();
const client = pf.wrapClient(new Anthropic());

const response = await client.messages.create({
 model: 'claude-3-5-sonnet-20241022',
 max_tokens: 1024,
 messages: [{ role: 'user', content: 'your prompt' }]
});

// Automatically tracked. One line changes nothing
console.log(pf.getMetrics()); 
// { totalTokens: 342, totalCost: $0.008, calls: 1 }

Real numbers

Agent-Max before: ~1,847 tokens/week across all platforms.

Agent-Max after (trimmed + downgraded safe calls to Haiku): 1,094 tokens/week.

40% reduction. No quality loss. Three hours to audit and implement.

That's not a benchmark. That's a real app, real prompts, real data.

Stop guessing about your token spend. Measure what you're actually doing.

npm install -g promptfuel

https://promptfuel.vercel.app?utm_source=devto&utm_medium=social&utm_campaign=max

3 Things I Learned Auditing My LLM App's Token Spend (And Why Your Benchmarks Are Lying)

Nate Voss — Mon, 27 Apr 2026 06:29:33 +0000

You know that feeling when you ship an AI feature and realize your token bill is 3x what you estimated? Yeah, that was me last week.

Here's what three days of auditing my own code taught me.

1. Your bottleneck isn't the model you picked, it's the prompt you didn't trim

Nope.

Entire engagement history (redundant. I'm fetching fresh data every run)
Every. Single. Previous. Post. (all 120 of them, in the context)
Current date, weather, trending topics (reloaded every call)

Cutting history to "last 10 posts, last 3 days of engagement" knocked 40% off. Not because I switched models. Because I stopped hallucinating I needed context I wasn't even reading.

2. Your audit will surface the dumb stuff, not the obvious stuff

Benchmarks tell you Claude costs 3¢ per 1M input tokens. Haiku costs 0.8¢. Pick the right model, do the math, move on.

3. Once you see the numbers, the optimization loop becomes obvious

The first time I ran the dashboard, I thought I was done. Then Monday's weekly job ran and I watched 47 new prompts execute. Dashboard updated in real time. I saw the pattern. There's another cut.

Auditing once is useful. Auditing every week is how you stop bleeding money.

Let's walk through it

Install:

npm install -g promptfuel

Run pf optimize on a real prompt:

pf optimize ./src/prompts/reflect.md --model claude-3-5-sonnet

You'll see token count, cost per call, and a readability score. More importantly, you'll see where the redundancy is hiding.

Open the dashboard to watch prompts in real time:

pf dashboard --watch ./src/

Port 3000 opens. Every time you call an LLM, you see it log: model, input tokens, output tokens, cost, latency. No guessing.

For production, wire up the SDK:

import { PromptFuel } from 'promptfuel/sdk';
import Anthropic from '@anthropic-ai/sdk';

const pf = new PromptFuel();
const client = pf.wrapClient(new Anthropic());

const response = await client.messages.create({
 model: 'claude-3-5-sonnet-20241022',
 max_tokens: 1024,
 messages: [{ role: 'user', content: 'your prompt' }]
});

// Automatically tracked. One line changes nothing
console.log(pf.getMetrics()); 
// { totalTokens: 342, totalCost: $0.008, calls: 1 }

Real numbers

Agent-Max before: ~1,847 tokens/week across all platforms.

Agent-Max after (trimmed + downgraded safe calls to Haiku): 1,094 tokens/week.

40% reduction. No quality loss. Three hours to audit and implement.

That's not a benchmark. That's a real app, real prompts, real data.

Stop guessing about your token spend. Measure what you're actually doing.

npm install -g promptfuel

https://promptfuel.vercel.app?utm_source=devto&utm_medium=social&utm_campaign=max

3 Things I Learned Auditing My LLM App's Token Spend (And Why Your Benchmarks Are Lying)

Nate Voss — Mon, 27 Apr 2026 06:13:27 +0000

You know that feeling when you ship an AI feature and realize your token bill is 3x what you estimated? Yeah, that was me last week.

Here's what three days of auditing my own code taught me.

1. Your bottleneck isn't the model you picked, it's the prompt you didn't trim

Nope.

Entire engagement history (redundant. I'm fetching fresh data every run)
Every. Single. Previous. Post. (all 120 of them, in the context)
Current date, weather, trending topics (reloaded every call)

Cutting history to "last 10 posts, last 3 days of engagement" knocked 40% off. Not because I switched models. Because I stopped hallucinating I needed context I wasn't even reading.

2. Your audit will surface the dumb stuff, not the obvious stuff

Benchmarks tell you Claude costs 3¢ per 1M input tokens. Haiku costs 0.8¢. Pick the right model, do the math, move on.

3. Once you see the numbers, the optimization loop becomes obvious

The first time I ran the dashboard, I thought I was done. Then Monday's weekly job ran and I watched 47 new prompts execute. Dashboard updated in real time. I saw the pattern. There's another cut.

Auditing once is useful. Auditing every week is how you stop bleeding money.

Let's walk through it

Install:

npm install -g promptfuel

Run pf optimize on a real prompt:

pf optimize ./src/prompts/reflect.md --model claude-3-5-sonnet

You'll see token count, cost per call, and a readability score. More importantly, you'll see where the redundancy is hiding.

Open the dashboard to watch prompts in real time:

pf dashboard --watch ./src/

Port 3000 opens. Every time you call an LLM, you see it log: model, input tokens, output tokens, cost, latency. No guessing.

For production, wire up the SDK:

import { PromptFuel } from 'promptfuel/sdk';
import Anthropic from '@anthropic-ai/sdk';

const pf = new PromptFuel();
const client = pf.wrapClient(new Anthropic());

const response = await client.messages.create({
 model: 'claude-3-5-sonnet-20241022',
 max_tokens: 1024,
 messages: [{ role: 'user', content: 'your prompt' }]
});

// Automatically tracked. One line changes nothing
console.log(pf.getMetrics()); 
// { totalTokens: 342, totalCost: $0.008, calls: 1 }

Real numbers

Agent-Max before: ~1,847 tokens/week across all platforms.

Agent-Max after (trimmed + downgraded safe calls to Haiku): 1,094 tokens/week.

40% reduction. No quality loss. Three hours to audit and implement.

That's not a benchmark. That's a real app, real prompts, real data.

Stop guessing about your token spend. Measure what you're actually doing.

npm install -g promptfuel

https://promptfuel.vercel.app?utm_source=devto&utm_medium=social&utm_campaign=max

How I Accidentally Spent $800/Month on LLM Tokens I Didn't Need (And How to Fix It)

Nate Voss — Thu, 23 Apr 2026 07:46:42 +0000

I spent six weeks shipping the wrong thing.

I built PromptFuel because I was hemorrhaging money on API calls. Not because I was building at scale—I wasn't. I was building dumb. I'd write a prompt in isolation, test it once, ship it, then wonder why my OpenAI bill jumped $200. Turns out I was doing things like:

Asking GPT-4 to write validation logic that Haiku could handle just fine
Sending full context windows when 30% of it was redundant
Retrying identical requests with slightly different temperatures instead of picking one and sticking with it
Including examples in prompts that the model was already trained on

The real kick? None of this was visible. I had no idea which requests were wasteful, which models were overkill for my tasks, or where I was throwing money away. I just had a credit card statement and regret.

So I built a tool to see what I was actually doing. And then I optimized it. Here's how.

The Problem Nobody Talks About

Choosing the right model for a job isn't about capabilities. A Haiku can validate JSON, classify text, and format output just as well as GPT-4o for most real work. The difference is cost: Haiku costs 10x less per token.

But without visibility, you default to the expensive one. Because it's safe. Because you can't see the waste.

After I started measuring, I found:

35% of my requests didn't need GPT-4o. They were hitting it because it was the default, not because it was the right tool.
20% of my prompts had bloat. Instructions that contradicted each other, examples I copy-pasted but never used, context I included "just in case."
15% of requests were duplicates. Same input, same model, within minutes. If I'd cached or batched them, I'd cut token spend by half.

Total: 40% waste. $800 → $480. Not revolutionary, but real money for an indie project.

The fix wasn't rocket science. It was boring infrastructure: measure, analyze, optimize, repeat.

Step 1: See What You're Actually Doing

Install PromptFuel:

npm install -g promptfuel

That's it. No API keys, no auth, no bullshit. The tool runs locally.

Now run this on any prompt or code snippet:

pf optimize --input "Your prompt here"

Or point it at a file:

pf optimize --file my-prompt.txt

You get back:

Token count — exactly what you'll be charged for
Cost estimates — broken down by model (Haiku, Sonnet, GPT-4o, etc.)
Optimization suggestions — what you can trim without losing meaning
Model recommendations — which model actually makes sense for this task

Example output:

Current prompt: 412 tokens

Optimization suggestions:
  - Remove redundant instruction (line 8)
  - Simplify JSON schema example (saves 34 tokens)
  - Collapse repeated context (saves 18 tokens)

Cost per call:
  - GPT-4o: $0.006 (❌ overpowered)
  - Claude 3.5 Sonnet: $0.002 (✓ recommended)
  - Claude 3 Haiku: $0.0004 (✓ if you only need classification)

Estimated monthly (1000 calls):
  - Current setup: $6.12
  - Optimized: $1.84

That's the insight. That's what I was missing.

Step 2: Understand Your Actual Costs

Open the dashboard:

pf dashboard

Your default browser opens to a local dashboard showing:

All your recent prompts and their token counts
Cost distribution — which requests ate the most budget
Model usage — are you using the expensive ones too much?
Optimization opportunities — ranked by potential savings

The dashboard doesn't need your API keys. It's analyzing local data. But it will tell you which of your shipped prompts are costing way more than they should.

Spend 10 minutes here. You'll probably find something you didn't realize you were doing.

Step 3: Integrate into Your Stack

Once you see the waste, you'll want to catch it earlier. That's where the SDK and MCP server come in.

Option A: JavaScript SDK (for Next.js, Node apps)

npm install @promptfuel/sdk

import { PromptOptimizer } from '@promptfuel/sdk';

const optimizer = new PromptOptimizer();

const prompt = `You are a helpful assistant...
Classify the following text into categories...
[20 more lines of context you don't actually need]`;

const analysis = await optimizer.analyze(prompt);

console.log(`This prompt costs $${analysis.costPerCall.gpt4o}`);
console.log(`Optimized version: $${analysis.optimized.costPerCall.gpt4o}`);

// Actually use the optimized version
const optimizedPrompt = analysis.optimized.text;

Option B: Claude Code MCP Server (for use in Claude directly)

If you're like me and you use Claude for a lot of your thinking, add the PromptFuel MCP server to your Claude Code settings. Then ask Claude directly:

@promptfuel optimize my prompt for cost

[paste your prompt]

Claude runs it through PromptFuel's analysis and tells you exactly where you're bleeding money. Then it generates an optimized version.

Both approaches catch waste before it ships.

What Happened Next

After I actually measured and optimized my stuff, here's what I learned:

You don't need the expensive model as often as you think. Most of my classification, formatting, and even some reasoning tasks work fine on Haiku.
Prompt bloat is real. Every instruction that contradicts another one, every "just in case" example, every "let me explain the context" paragraph adds tokens and confusion.
Token count scales weird. I thought I'd save 10%. I saved 40%. Because once you see the pattern, you fix it everywhere.

For me: $800 → $480/month. For you, it might be different. But it won't be zero.

Getting Started (Right Now)

Install: npm install -g promptfuel
Optimize a single prompt: pf optimize --file your-prompt.txt
Open the dashboard: pf dashboard
If you like it, integrate the SDK or MCP server into your workflow

No commitment. No API keys. No upsell. Just a free tool that shows you where your money's going.

The tool exists because I was tired of guessing. If you are too, give it a try: https://promptfuel.vercel.app?utm_source=devto&utm_medium=social&utm_campaign=max

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work

Nate Voss — Tue, 21 Apr 2026 12:07:05 +0000

If you're still picking LLM providers by gut feeling, you're leaving money on the table. I ran 5 developer use cases through Claude 3.5 Sonnet, GPT-4o, and Gemini 2.0 Flash using PromptFuel to measure token usage and cost. The results? More interesting than "fastest wins." Here's what I found.

The Setup

I took 5 tasks I actually do in PromptFuel development:

JSON schema validation prompt — catch malformed API responses
Code review feedback — multi-file analysis with context
Refactoring suggestion — optimize a chunky utility function
Bug diagnosis — trace through a stack trace with logs
Documentation generation — write API docs from code comments

Each got run through all three models with identical input. I used PromptFuel's CLI to count tokens and calculate costs, because doing this manually is chaos. Output quality was rated by me (subjectively, but honestly).

Use Case Breakdown

1. JSON Schema Validation

Input: Schema definition + malformed JSON sample + expected error message format

Token usage (input → output):

Claude Sonnet: 1,847 → 512 (cost: $0.0043)
GPT-4o: 2,156 → 487 (cost: $0.0082)
Gemini Flash: 1,923 → 501 (cost: $0.0001)

Quality: All three nailed it. Claude was most concise in its explanation. GPT-4o over-explained. Gemini was crisp and useful.

Token efficiency win: Gemini, by cost. Claude, by clarity per token.

2. Code Review (3 files, ~200 LOC)

Input: Three TypeScript modules + review instructions + examples of good feedback

Token usage:

Claude Sonnet: 4,231 → 891 (cost: $0.0147)
GPT-4o: 4,782 → 856 (cost: $0.0208)
Gemini Flash: 4,456 → 823 (cost: $0.0003)

Quality: Claude caught subtle issues I actually cared about. GPT-4o was thorough but verbose. Gemini gave surface-level feedback.

Token efficiency win: Gemini cheapest. Claude best output/token.

3. Refactoring Suggestion

Input: 80-line utility function + performance requirements + current bottleneck description

Token usage:

Claude Sonnet: 2,134 → 618 (cost: $0.0054)
GPT-4o: 2,445 → 602 (cost: $0.0110)
Gemini Flash: 2,287 → 587 (cost: $0.0002)

Quality: Claude's refactor was production-ready. GPT-4o suggested good ideas but with syntax issues. Gemini's suggestion worked but wasn't elegant.

Token efficiency win: Gemini cost, Claude quality.

4. Bug Diagnosis

Input: Stack trace (15 lines) + error logs (20 lines) + code snippet (40 lines) + attempted fixes tried

Token usage:

Claude Sonnet: 2,856 → 445 (cost: $0.0071)
GPT-4o: 3,102 → 421 (cost: $0.0127)
Gemini Flash: 2,934 → 438 (cost: $0.0002)

Quality: Claude nailed it immediately. GPT-4o circled around the issue. Gemini flagged the right file but not the root cause.

Token efficiency win: Gemini cost, Claude accuracy.

5. Documentation Generation

Input: 12 functions with JSDoc comments + expected markdown format + examples

Token usage:

Claude Sonnet: 3,445 → 734 (cost: $0.0118)
GPT-4o: 3,821 → 689 (cost: $0.0182)
Gemini Flash: 3,567 → 712 (cost: $0.0004)

Quality: Claude's docs were complete and well-structured. GPT-4o was good but required minimal cleanup. Gemini's docs were functional but missing details.

Token efficiency win: Gemini cost, Claude completeness.

The 3 Things I Learned

1. Cost-per-task != best value. Gemini Flash is comically cheap (~90% less than GPT-4o), but you're paying for what you get. When I needed high-stakes work (code review, bug diagnosis), Claude was worth the extra cents because I didn't have to iterate. For throwaway tasks (generating examples, formatting), Gemini's cost made its mediocrity acceptable.

2. Token count is not predictive of quality. All three models produced similar token counts for the same input, but output quality varied wildly. GPT-4o consistently used more tokens and wasn't proportionally better. Claude packed useful signal into fewer tokens. This matters: if you're optimizing for cost alone, you'll pick the wrong model.

3. Real-world testing beats benchmarks. The model rankings flip depending on what you're actually doing. For documentation, Claude wins. For budget validation of a throwaway check, Gemini wins. Generic "fastest model" articles don't capture this. You need to test your actual tasks.

How to Benchmark Yours

Here's the thing: this comparison is data, not law. Your tasks might weight differently. Let me show you how I tested this using PromptFuel.

# Install PromptFuel (if you haven't)
npm install -g promptfuel

# Create a test file with your prompt
cat > test-prompt.txt << 'EOF'
[your prompt here]
EOF

# Count tokens across models
pf count test-prompt.txt --model claude-3-5-sonnet
pf count test-prompt.txt --model gpt-4o
pf count test-prompt.txt --model gemini-2.0-flash

# Compare costs
pf count test-prompt.txt --compare

That --compare flag gives you a cost matrix. Takes 30 seconds. Beats guessing.

The real insight: run this for your specific use cases. A document summarizer might favor Claude. A high-throughput classification pipeline might favor Gemini. The only way to know is to test.

The Real Optimization

After picking your model, there's still money left on the table. Here's a before/after from actual PromptFuel code:

Before (unoptimized prompt):

You are an expert code reviewer. Review the following code for quality, security, 
and performance issues. Check for common bugs, suggest improvements, and rate the 
code from 1-10. Consider edge cases, error handling, and best practices. Be thorough 
and detailed in your feedback.

[400 tokens of instructions]
[200 tokens of examples]
[150 tokens of code to review]
Total: ~750 input tokens

After (optimized with PromptFuel):

Review code for quality, security, performance. Rate 1-10.

[Stripped redundant instructions]
[Examples reduced to 1 exemplar instead of 3]
[Code reformatted to remove whitespace]
Total: ~420 input tokens

Cost saved: ~$0.0012 per review on Claude. Run that 100 times a day, and you're saving $0.12/day, $36/year. Small? Yes. Multiplied by 50 internal tools? Now you're talking real money.

The Honest Take

Pick the model that gives you the output you need, then optimize the prompt. Stop optimizing for the wrong metric. Benchmarks are fun, but production bills are real.

If you're running this analysis for your own stuff, PromptFuel makes it stupidly easy. It's free, no API keys needed, runs locally. Just npm install -g promptfuel and compare. If you want the actual numbers from your prompts, run the test. Don't inherit my data — build your own.

What's your highest-volume LLM task? Test it. You might be surprised which model wins.

Tags: #ai #tutorial #javascript #optimization

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work

Nate Voss — Tue, 21 Apr 2026 07:44:49 +0000

The Setup

I took 5 tasks I actually do in PromptFuel development:

JSON schema validation prompt — catch malformed API responses
Code review feedback — multi-file analysis with context
Refactoring suggestion — optimize a chunky utility function
Bug diagnosis — trace through a stack trace with logs
Documentation generation — write API docs from code comments

Use Case Breakdown

1. JSON Schema Validation

Input: Schema definition + malformed JSON sample + expected error message format

Token usage (input → output):

Claude Sonnet: 1,847 → 512 (cost: $0.0043)
GPT-4o: 2,156 → 487 (cost: $0.0082)
Gemini Flash: 1,923 → 501 (cost: $0.0001)

Quality: All three nailed it. Claude was most concise in its explanation. GPT-4o over-explained. Gemini was crisp and useful.

Token efficiency win: Gemini, by cost. Claude, by clarity per token.

2. Code Review (3 files, ~200 LOC)

Input: Three TypeScript modules + review instructions + examples of good feedback

Token usage:

Claude Sonnet: 4,231 → 891 (cost: $0.0147)
GPT-4o: 4,782 → 856 (cost: $0.0208)
Gemini Flash: 4,456 → 823 (cost: $0.0003)

Quality: Claude caught subtle issues I actually cared about. GPT-4o was thorough but verbose. Gemini gave surface-level feedback.

Token efficiency win: Gemini cheapest. Claude best output/token.

3. Refactoring Suggestion

Input: 80-line utility function + performance requirements + current bottleneck description

Token usage:

Claude Sonnet: 2,134 → 618 (cost: $0.0054)
GPT-4o: 2,445 → 602 (cost: $0.0110)
Gemini Flash: 2,287 → 587 (cost: $0.0002)

Quality: Claude's refactor was production-ready. GPT-4o suggested good ideas but with syntax issues. Gemini's suggestion worked but wasn't elegant.

Token efficiency win: Gemini cost, Claude quality.

4. Bug Diagnosis

Input: Stack trace (15 lines) + error logs (20 lines) + code snippet (40 lines) + attempted fixes tried

Token usage:

Claude Sonnet: 2,856 → 445 (cost: $0.0071)
GPT-4o: 3,102 → 421 (cost: $0.0127)
Gemini Flash: 2,934 → 438 (cost: $0.0002)

Quality: Claude nailed it immediately. GPT-4o circled around the issue. Gemini flagged the right file but not the root cause.

Token efficiency win: Gemini cost, Claude accuracy.

5. Documentation Generation

Input: 12 functions with JSDoc comments + expected markdown format + examples

Token usage:

Claude Sonnet: 3,445 → 734 (cost: $0.0118)
GPT-4o: 3,821 → 689 (cost: $0.0182)
Gemini Flash: 3,567 → 712 (cost: $0.0004)

Quality: Claude's docs were complete and well-structured. GPT-4o was good but required minimal cleanup. Gemini's docs were functional but missing details.

Token efficiency win: Gemini cost, Claude completeness.

The 3 Things I Learned

How to Benchmark Yours

Here's the thing: this comparison is data, not law. Your tasks might weight differently. Let me show you how I tested this using PromptFuel.

# Install PromptFuel (if you haven't)
npm install -g promptfuel

# Create a test file with your prompt
cat > test-prompt.txt << 'EOF'
[your prompt here]
EOF

# Count tokens across models
pf count test-prompt.txt --model claude-3-5-sonnet
pf count test-prompt.txt --model gpt-4o
pf count test-prompt.txt --model gemini-2.0-flash

# Compare costs
pf count test-prompt.txt --compare

That --compare flag gives you a cost matrix. Takes 30 seconds. Beats guessing.

The real insight: run this for your specific use cases. A document summarizer might favor Claude. A high-throughput classification pipeline might favor Gemini. The only way to know is to test.

The Real Optimization

After picking your model, there's still money left on the table. Here's a before/after from actual PromptFuel code:

Before (unoptimized prompt):

You are an expert code reviewer. Review the following code for quality, security, 
and performance issues. Check for common bugs, suggest improvements, and rate the 
code from 1-10. Consider edge cases, error handling, and best practices. Be thorough 
and detailed in your feedback.

[400 tokens of instructions]
[200 tokens of examples]
[150 tokens of code to review]
Total: ~750 input tokens

After (optimized with PromptFuel):

Review code for quality, security, performance. Rate 1-10.

[Stripped redundant instructions]
[Examples reduced to 1 exemplar instead of 3]
[Code reformatted to remove whitespace]
Total: ~420 input tokens

Cost saved: ~$0.0012 per review on Claude. Run that 100 times a day, and you're saving $0.12/day, $36/year. Small? Yes. Multiplied by 50 internal tools? Now you're talking real money.

The Honest Take

Pick the model that gives you the output you need, then optimize the prompt. Stop optimizing for the wrong metric. Benchmarks are fun, but production bills are real.

What's your highest-volume LLM task? Test it. You might be surprised which model wins.

Tags: #ai #tutorial #javascript #optimization

I Tested Claude Haiku, GPT-4o Mini, and Gemini Flash on Real Tasks. Here's What Actually Happened.

Nate Voss — Fri, 17 Apr 2026 06:08:24 +0000

I Tested Claude Haiku, GPT-4o Mini, and Gemini Flash on Real Tasks. Here's What Actually Happened.

Every few weeks someone posts a new model comparison and it's always the same: benchmark scores, carefully designed test prompts, neat bar charts. Then you try the "winning" model on your actual workload and something weird happens.

I've been running all three in production for a few months. Here's what I actually found, including the parts that don't make for clean charts.

Quick Pricing Reality Check

Model	Input (per 1M tokens)	Context
Claude Haiku 4.5	$1.00	200K
GPT-4o Mini	$0.15	128K
Gemini 1.5 Flash	$0.075	1M

Gemini Flash is genuinely 13× cheaper than Haiku. That's not a rounding error. Before you immediately migrate everything: keep reading.

What I Was Testing

Three real workloads from a side project:

Document summarization — long PDFs, messy formatting, inconsistent structure
JSON extraction — pull structured data from unstructured user input
Code explanation — explain what a function does in plain English

Not "write a poem" or "solve this logic puzzle." Real things an app might do.

Document Summarization: Gemini Flash Wins

Flash's 1M context window is genuinely useful here, not just a spec sheet number. I was chunking documents into pieces to fit smaller context windows. With Flash I stopped doing that.

It's also significantly cheaper, which matters when you're summarizing hundreds of documents.

The catch: Flash occasionally invents a section heading that wasn't in the original. It's subtle and sounds plausible, which makes it worse. For internal tools where someone will verify the output, fine. For anything customer-facing, I'd want a review step.

JSON Extraction: Haiku Wins, Annoyingly

I really wanted Flash to win here because of the price. It didn't.

The task: given a paragraph of user-written text, extract structured fields into a specific JSON schema. Claude Haiku followed the schema reliably. Flash followed it most of the time, but occasionally added fields that weren't in the schema, renamed fields it didn't like, or decided to nest things differently.

Each of these breaks downstream code. The debugging cost per incident outweighed the token savings.

Haiku is predictable in a way that sounds boring but is exactly what you want when you're processing thousands of records.

Code Explanation: GPT-4o Mini Wins

This one surprised me. GPT-4o Mini is clearly well-trained on code-related tasks. Its explanations are accurate and well-structured. It also handled edge cases (unusual syntax, legacy patterns) better than the other two.

For anything code-adjacent, it's my first reach now.

What I Actually Use

I stopped trying to find one winner and started routing:

function pickModel(taskType, inputTokens) {
  if (taskType === 'summarize' && inputTokens > 50000) return 'gemini-1.5-flash';
  if (taskType === 'extract_json') return 'claude-haiku-4-5';
  if (taskType === 'explain_code') return 'gpt-4o-mini';
  return 'claude-haiku-4-5'; // safe default
}

You can check price differences before committing to a migration:

npm install -g promptfuel
promptfuel compare claude-haiku-4-5 gpt-4o-mini gemini-1.5-flash

The Actual Bottom Line

Cheapest by far: Gemini Flash. Use it for volume tasks where edge cases are acceptable.
Most format-reliable: Claude Haiku. Use it when you need strict schema compliance.
Best at code: GPT-4o Mini. Use it for anything developer-facing.

The honest answer is: don't pick one. Pick based on the task. The cost difference between routing intelligently versus defaulting to one model is real, and it compounds over time.