<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Nate Voss</title>
    <description>The latest articles on Forem by Nate Voss (@natevoss).</description>
    <link>https://forem.com/natevoss</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3839442%2F8d858ab2-90a7-47dd-b9ee-fa8c73b19227.png</url>
      <title>Forem: Nate Voss</title>
      <link>https://forem.com/natevoss</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/natevoss"/>
    <language>en</language>
    <item>
      <title>Model Routing: 3 Things I Learned Sending Tasks to the Cheapest Model That Actually Works</title>
      <dc:creator>Nate Voss</dc:creator>
      <pubDate>Mon, 04 May 2026 06:54:30 +0000</pubDate>
      <link>https://forem.com/natevoss/model-routing-3-things-i-learned-sending-tasks-to-the-cheapest-model-that-actually-works-4e31</link>
      <guid>https://forem.com/natevoss/model-routing-3-things-i-learned-sending-tasks-to-the-cheapest-model-that-actually-works-4e31</guid>
      <description>&lt;p&gt;Everyone benchmarks models. Sonnet beats Haiku on reasoning. Opus beats Sonnet. Haiku is fastest. These things are all true.&lt;/p&gt;

&lt;p&gt;But benchmarking and deploying are different games. At scale, the difference between Haiku at $0.80/million tokens and Sonnet at $3/million tokens isn't academic. It's $400+ monthly on a mid-size application. The trap is paying for capability you don't actually need because you never measured what you do need.&lt;/p&gt;

&lt;p&gt;I built a router to answer one question: which tasks in my actual workflow could run on the cheapest model without failing? The answer surprised me. And I learned that the real value isn't the savings. It's the forcing function. You can't implement routing without auditing exactly where your complexity lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  3 Things I Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Your Intuition About Task Complexity Is Backwards
&lt;/h3&gt;

&lt;p&gt;You think something needs Sonnet. Your gut says: "this requires reasoning, obviously expensive model."&lt;/p&gt;

&lt;p&gt;So I measured. Content classification? Haiku handles 95% of real requests. Writing summaries? 88%. Extracting structured data? 92%. The edge cases that needed Sonnet were smaller than I'd guessed. And they were always the same types of edge cases.&lt;/p&gt;

&lt;p&gt;Here's the pattern I found: obvious cases are &lt;strong&gt;really&lt;/strong&gt; obvious to Haiku. Spam detection, data validation, simple extractions. Haiku nails these. The failures cluster in a small, identifiable category: ambiguous cases where the human answer is ambiguous. That's when you need Sonnet's nuance.&lt;/p&gt;

&lt;p&gt;But you don't know your edge case percentage until you try. Guessing leaves money on the table.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. You Need Observability Before Routing Saves Anything
&lt;/h3&gt;

&lt;p&gt;The instinct is to build the router first. "Let's write logic that detects complex requests and routes to Sonnet."&lt;/p&gt;

&lt;p&gt;This is backward. You need to measure first. Log every task with both Haiku and Sonnet responses side-by-side. Compare them. Find the patterns.&lt;/p&gt;

&lt;p&gt;Real questions to answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When did Haiku refuse a task that Sonnet handled?&lt;/li&gt;
&lt;li&gt;How often do their answers differ, and which one was right?&lt;/li&gt;
&lt;li&gt;Was Haiku just uncertain, or actually wrong?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This requires instrumenting your inference layer. It takes a week. But you can't optimize what you can't see. Most teams skip this and build routers on intuition, which is why their routers are fragile.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Routing Rules Should Be Dumb, Not Smart
&lt;/h3&gt;

&lt;p&gt;The temptation: build a classifier that predicts task complexity. Input length heuristics, keyword matching, embedding similarity. Something sophisticated.&lt;/p&gt;

&lt;p&gt;Don't. Use a simple rule: &lt;strong&gt;"If the model reports low confidence, escalate to Sonnet."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This separates the decision from the task. Haiku tells you when it's uncertain. That's a signal you can act on immediately, without needing to predict the future.&lt;/p&gt;

&lt;p&gt;The dumb rule wins because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It adapts as your tasks change (no retraining)&lt;/li&gt;
&lt;li&gt;It's testable (you can verify the confidence threshold)&lt;/li&gt;
&lt;li&gt;It fails safely (escalation costs more but works)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The smart rule loses because routing logic becomes load-bearing infrastructure. Requires constant tuning. Breaks when your data distribution shifts.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;Anthropic&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@anthropic-ai/sdk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;classifyWithFallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;confidenceThreshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="c1"&gt;// First pass: try Haiku (cheap, fast)&lt;/span&gt;
 &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;haikuResponse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
 &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;claude-3-5-haiku-20241022&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
 &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Classify this text as: safe, unsafe, or review-needed. Return JSON with {classification, confidence}.

Text: "&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"`&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="p"&gt;]&lt;/span&gt;
 &lt;span class="p"&gt;});&lt;/span&gt;

 &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;haikuResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;haikuResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

 &lt;span class="c1"&gt;// Log all Haiku decisions (even successes)&lt;/span&gt;
 &lt;span class="c1"&gt;// You're building a dataset of "when does Haiku work?"&lt;/span&gt;
 &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
 &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
 &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;haiku&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;haikuResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;haikuResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;tokensUsed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
 &lt;span class="nx"&gt;haikuResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
 &lt;span class="nx"&gt;haikuResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;output_tokens&lt;/span&gt;
 &lt;span class="p"&gt;});&lt;/span&gt;

 &lt;span class="c1"&gt;// If Haiku is unsure, escalate to Sonnet&lt;/span&gt;
 &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;haikuResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;confidenceThreshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sonnetResponse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
 &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;claude-3-5-sonnet-20241022&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
 &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Classify this text as: safe, unsafe, or review-needed. Return JSON with {classification, confidence}.

Text: "&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"`&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="p"&gt;]&lt;/span&gt;
 &lt;span class="p"&gt;});&lt;/span&gt;

 &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sonnetResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sonnetResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
 &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
 &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
 &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sonnet&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;escalatedFrom&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;haiku&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;sonnetResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;tokensUsed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
 &lt;span class="nx"&gt;sonnetResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
 &lt;span class="nx"&gt;sonnetResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;output_tokens&lt;/span&gt;
 &lt;span class="p"&gt;});&lt;/span&gt;

 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;sonnetResult&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;

 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;haikuResult&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Run both models in parallel during development and log the results. In production, start with Haiku, escalate on low confidence. As your logs accumulate, you'll see exactly which tasks need expensive models and which don't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Math
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Haiku: $0.80 per 1M input tokens
Sonnet: $3 per 1M input tokens

Scenario: 1M requests/month, 200 tokens average
- All Sonnet: 1M × 200 tokens = $600
- 95% Haiku: (950k × 200) Haiku + (50k × 200) Sonnet = $152 + $30 = $182
- Savings: $418/month

At enterprise scale (100M requests/month): $41,800/month saved by routing to the cheapest viable model.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cost difference compounds. Small routing decisions get multiplied across thousands of requests.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Common Pitfall
&lt;/h2&gt;

&lt;p&gt;You'll build a sophisticated router and wonder why it doesn't move the needle. Usually because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You spent three months on routing logic, but you spend one week validating it&lt;/li&gt;
&lt;li&gt;The escalation threshold is too aggressive ("if anything looks hard, use Sonnet")&lt;/li&gt;
&lt;li&gt;You're routing on heuristics, not observed behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fix: measure first, always. Log both models' responses in parallel before committing to either one. You'll find that the obvious cases are really obvious, and the edge cases are smaller than you think.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Routing Actually Works
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Build it if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have &amp;gt;100k requests/month (smaller volume doesn't justify overhead)&lt;/li&gt;
&lt;li&gt;Your requests fall into clusters (some are cheap tasks, some are hard)&lt;/li&gt;
&lt;li&gt;You can measure ground truth (compare Haiku vs Sonnet, track which was right)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Don't build it if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&amp;lt;10k requests/month (infrastructure overhead isn't worth it)&lt;/li&gt;
&lt;li&gt;Every request is unique and complex (no pattern to exploit)&lt;/li&gt;
&lt;li&gt;You need 99.9% accuracy (can't tolerate Haiku failures)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Real Win
&lt;/h2&gt;

&lt;p&gt;The cost savings are real. But the bigger win is the audit itself. Building a router forces you to measure exactly where your complexity actually lives. Most teams overthink what they need because they never measure. The router is just the excuse to finally look.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>javascript</category>
      <category>beginners</category>
    </item>
    <item>
      <title>3 Things I Learned Auditing My LLM App's Token Spend (And Why Your Benchmarks Are Lying)</title>
      <dc:creator>Nate Voss</dc:creator>
      <pubDate>Mon, 27 Apr 2026 08:04:50 +0000</pubDate>
      <link>https://forem.com/natevoss/3-things-i-learned-auditing-my-llm-apps-token-spend-and-why-your-benchmarks-are-lying-3nbi</link>
      <guid>https://forem.com/natevoss/3-things-i-learned-auditing-my-llm-apps-token-spend-and-why-your-benchmarks-are-lying-3nbi</guid>
      <description>&lt;p&gt;You know that feeling when you ship an AI feature and realize your token bill is 3x what you estimated? Yeah, that was me last week.&lt;/p&gt;

&lt;p&gt;I have this thing called Agent-Max — it's a multi-platform growth agent that runs autonomous workflows: generating content, publishing to Bluesky, Medium, Twitter, Reddit. Sounds heavy, right? Every Monday it synthesizes a week of reading, scrapes engagement metrics, decides what to post and where. Seven platforms. Infinite LLM calls if you're not paying attention.&lt;/p&gt;

&lt;p&gt;Last Sunday I realized I had no idea what I was actually spending. I knew &lt;em&gt;roughly&lt;/em&gt; — "somewhere between $5-20/week" — but roughly is how you end up with bill shock. So I built PromptFuel to solve the actual problem: measure what your app is doing, not what the docs say it &lt;em&gt;should&lt;/em&gt; do.&lt;/p&gt;

&lt;p&gt;Here's what three days of auditing my own code taught me.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Your bottleneck isn't the model you picked, it's the prompt you didn't trim
&lt;/h2&gt;

&lt;p&gt;I assumed my biggest cost sink was the weekly reflection. Claude reads 7 days of snapshots, engagement data, content history, trend analysis, then reasons about next week's strategy. Heavy prompt, right?&lt;/p&gt;

&lt;p&gt;Nope.&lt;/p&gt;

&lt;p&gt;Running &lt;code&gt;pf optimize&lt;/code&gt; on the actual prompts showed the reflection was 2,847 tokens. Not small, but fine. The real killer: the daily content pregeneration loop was calling Claude 5 times per platform, and each call had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Entire engagement history (redundant. I'm fetching fresh data every run)&lt;/li&gt;
&lt;li&gt;Every. Single. Previous. Post. (all 120 of them, in the context)&lt;/li&gt;
&lt;li&gt;Current date, weather, trending topics (reloaded every call)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cutting history to "last 10 posts, last 3 days of engagement" knocked 40% off. Not because I switched models. Because I stopped hallucinating I needed context I wasn't even &lt;em&gt;reading&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Your audit will surface the dumb stuff, not the obvious stuff
&lt;/h2&gt;

&lt;p&gt;Benchmarks tell you Claude costs 3¢ per 1M input tokens. Haiku costs 0.8¢. Pick the right model, do the math, move on.&lt;/p&gt;

&lt;p&gt;Except I was calling Claude Sonnet 7 times/week on background analytics where Haiku was plenty. Not intentional. I'd copied the model from an earlier prompt and never thought about it again. One-line change, zero quality loss, $2 saved per month.&lt;/p&gt;

&lt;p&gt;That math &lt;em&gt;never&lt;/em&gt; shows up in a benchmark. It shows up in your actual codebase, on your actual data, running your actual job. PromptFuel's advantage isn't telling you models are expensive. It's finding the calls you forgot about and showing you the before/after side-by-side.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Once you see the numbers, the optimization loop becomes obvious
&lt;/h2&gt;

&lt;p&gt;The first time I ran the dashboard, I thought I was done. Then Monday's weekly job ran and I watched 47 new prompts execute. Dashboard updated in real time. I saw the pattern. There's another cut.&lt;/p&gt;

&lt;p&gt;Auditing once is useful. Auditing every week is how you stop bleeding money.&lt;/p&gt;




&lt;h2&gt;
  
  
  Let's walk through it
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; promptfuel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run pf optimize&lt;/strong&gt; on a real prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pf optimize ./src/prompts/reflect.md &lt;span class="nt"&gt;--model&lt;/span&gt; claude-3-5-sonnet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see token count, cost per call, and a readability score. More importantly, you'll see where the redundancy is hiding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Open the dashboard&lt;/strong&gt; to watch prompts in real time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pf dashboard &lt;span class="nt"&gt;--watch&lt;/span&gt; ./src/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Port 3000 opens. Every time you call an LLM, you see it log: model, input tokens, output tokens, cost, latency. No guessing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For production, wire up the SDK:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;PromptFuel&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;promptfuel/sdk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;Anthropic&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@anthropic-ai/sdk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;PromptFuel&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;pf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wrapClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
 &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-3-5-sonnet-20241022&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;your prompt&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Automatically tracked. One line changes nothing&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getMetrics&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt; 
&lt;span class="c1"&gt;// { totalTokens: 342, totalCost: $0.008, calls: 1 }&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Real numbers
&lt;/h2&gt;

&lt;p&gt;Agent-Max before: ~1,847 tokens/week across all platforms.&lt;/p&gt;

&lt;p&gt;Agent-Max after (trimmed + downgraded safe calls to Haiku): 1,094 tokens/week.&lt;/p&gt;

&lt;p&gt;40% reduction. No quality loss. Three hours to audit and implement.&lt;/p&gt;

&lt;p&gt;That's not a benchmark. That's a real app, real prompts, real data.&lt;/p&gt;




&lt;p&gt;Stop guessing about your token spend. Measure what you're actually doing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; promptfuel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://promptfuel.vercel.app?utm_source=devto&amp;amp;utm_medium=social&amp;amp;utm_campaign=max" rel="noopener noreferrer"&gt;https://promptfuel.vercel.app?utm_source=devto&amp;amp;utm_medium=social&amp;amp;utm_campaign=max&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>javascript</category>
      <category>beginners</category>
    </item>
    <item>
      <title>3 Things I Learned Auditing My LLM App's Token Spend (And Why Your Benchmarks Are Lying)</title>
      <dc:creator>Nate Voss</dc:creator>
      <pubDate>Mon, 27 Apr 2026 06:29:33 +0000</pubDate>
      <link>https://forem.com/natevoss/3-things-i-learned-auditing-my-llm-apps-token-spend-and-why-your-benchmarks-are-lying-453l</link>
      <guid>https://forem.com/natevoss/3-things-i-learned-auditing-my-llm-apps-token-spend-and-why-your-benchmarks-are-lying-453l</guid>
      <description>&lt;p&gt;You know that feeling when you ship an AI feature and realize your token bill is 3x what you estimated? Yeah, that was me last week.&lt;/p&gt;

&lt;p&gt;I have this thing called Agent-Max — it's a multi-platform growth agent that runs autonomous workflows: generating content, publishing to Bluesky, Medium, Twitter, Reddit. Sounds heavy, right? Every Monday it synthesizes a week of reading, scrapes engagement metrics, decides what to post and where. Seven platforms. Infinite LLM calls if you're not paying attention.&lt;/p&gt;

&lt;p&gt;Last Sunday I realized I had no idea what I was actually spending. I knew &lt;em&gt;roughly&lt;/em&gt; — "somewhere between $5-20/week" — but roughly is how you end up with bill shock. So I built PromptFuel to solve the actual problem: measure what your app is doing, not what the docs say it &lt;em&gt;should&lt;/em&gt; do.&lt;/p&gt;

&lt;p&gt;Here's what three days of auditing my own code taught me.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Your bottleneck isn't the model you picked, it's the prompt you didn't trim
&lt;/h2&gt;

&lt;p&gt;I assumed my biggest cost sink was the weekly reflection. Claude reads 7 days of snapshots, engagement data, content history, trend analysis, then reasons about next week's strategy. Heavy prompt, right?&lt;/p&gt;

&lt;p&gt;Nope.&lt;/p&gt;

&lt;p&gt;Running &lt;code&gt;pf optimize&lt;/code&gt; on the actual prompts showed the reflection was 2,847 tokens. Not small, but fine. The real killer: the daily content pregeneration loop was calling Claude 5 times per platform, and each call had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Entire engagement history (redundant. I'm fetching fresh data every run)&lt;/li&gt;
&lt;li&gt;Every. Single. Previous. Post. (all 120 of them, in the context)&lt;/li&gt;
&lt;li&gt;Current date, weather, trending topics (reloaded every call)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cutting history to "last 10 posts, last 3 days of engagement" knocked 40% off. Not because I switched models. Because I stopped hallucinating I needed context I wasn't even &lt;em&gt;reading&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Your audit will surface the dumb stuff, not the obvious stuff
&lt;/h2&gt;

&lt;p&gt;Benchmarks tell you Claude costs 3¢ per 1M input tokens. Haiku costs 0.8¢. Pick the right model, do the math, move on.&lt;/p&gt;

&lt;p&gt;Except I was calling Claude Sonnet 7 times/week on background analytics where Haiku was plenty. Not intentional. I'd copied the model from an earlier prompt and never thought about it again. One-line change, zero quality loss, $2 saved per month.&lt;/p&gt;

&lt;p&gt;That math &lt;em&gt;never&lt;/em&gt; shows up in a benchmark. It shows up in your actual codebase, on your actual data, running your actual job. PromptFuel's advantage isn't telling you models are expensive. It's finding the calls you forgot about and showing you the before/after side-by-side.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Once you see the numbers, the optimization loop becomes obvious
&lt;/h2&gt;

&lt;p&gt;The first time I ran the dashboard, I thought I was done. Then Monday's weekly job ran and I watched 47 new prompts execute. Dashboard updated in real time. I saw the pattern. There's another cut.&lt;/p&gt;

&lt;p&gt;Auditing once is useful. Auditing every week is how you stop bleeding money.&lt;/p&gt;




&lt;h2&gt;
  
  
  Let's walk through it
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; promptfuel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run pf optimize&lt;/strong&gt; on a real prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pf optimize ./src/prompts/reflect.md &lt;span class="nt"&gt;--model&lt;/span&gt; claude-3-5-sonnet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see token count, cost per call, and a readability score. More importantly, you'll see where the redundancy is hiding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Open the dashboard&lt;/strong&gt; to watch prompts in real time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pf dashboard &lt;span class="nt"&gt;--watch&lt;/span&gt; ./src/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Port 3000 opens. Every time you call an LLM, you see it log: model, input tokens, output tokens, cost, latency. No guessing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For production, wire up the SDK:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;PromptFuel&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;promptfuel/sdk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;Anthropic&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@anthropic-ai/sdk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;PromptFuel&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;pf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wrapClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
 &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-3-5-sonnet-20241022&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;your prompt&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Automatically tracked. One line changes nothing&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getMetrics&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt; 
&lt;span class="c1"&gt;// { totalTokens: 342, totalCost: $0.008, calls: 1 }&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Real numbers
&lt;/h2&gt;

&lt;p&gt;Agent-Max before: ~1,847 tokens/week across all platforms.&lt;/p&gt;

&lt;p&gt;Agent-Max after (trimmed + downgraded safe calls to Haiku): 1,094 tokens/week.&lt;/p&gt;

&lt;p&gt;40% reduction. No quality loss. Three hours to audit and implement.&lt;/p&gt;

&lt;p&gt;That's not a benchmark. That's a real app, real prompts, real data.&lt;/p&gt;




&lt;p&gt;Stop guessing about your token spend. Measure what you're actually doing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; promptfuel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://promptfuel.vercel.app?utm_source=devto&amp;amp;utm_medium=social&amp;amp;utm_campaign=max" rel="noopener noreferrer"&gt;https://promptfuel.vercel.app?utm_source=devto&amp;amp;utm_medium=social&amp;amp;utm_campaign=max&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>javascript</category>
      <category>beginners</category>
    </item>
    <item>
      <title>3 Things I Learned Auditing My LLM App's Token Spend (And Why Your Benchmarks Are Lying)</title>
      <dc:creator>Nate Voss</dc:creator>
      <pubDate>Mon, 27 Apr 2026 06:13:27 +0000</pubDate>
      <link>https://forem.com/natevoss/3-things-i-learned-auditing-my-llm-apps-token-spend-and-why-your-benchmarks-are-lying-2kfd</link>
      <guid>https://forem.com/natevoss/3-things-i-learned-auditing-my-llm-apps-token-spend-and-why-your-benchmarks-are-lying-2kfd</guid>
      <description>&lt;p&gt;You know that feeling when you ship an AI feature and realize your token bill is 3x what you estimated? Yeah, that was me last week.&lt;/p&gt;

&lt;p&gt;I have this thing called Agent-Max — it's a multi-platform growth agent that runs autonomous workflows: generating content, publishing to Bluesky, Medium, Twitter, Reddit. Sounds heavy, right? Every Monday it synthesizes a week of reading, scrapes engagement metrics, decides what to post and where. Seven platforms. Infinite LLM calls if you're not paying attention.&lt;/p&gt;

&lt;p&gt;Last Sunday I realized I had no idea what I was actually spending. I knew &lt;em&gt;roughly&lt;/em&gt; — "somewhere between $5-20/week" — but roughly is how you end up with bill shock. So I built PromptFuel to solve the actual problem: measure what your app is doing, not what the docs say it &lt;em&gt;should&lt;/em&gt; do.&lt;/p&gt;

&lt;p&gt;Here's what three days of auditing my own code taught me.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Your bottleneck isn't the model you picked, it's the prompt you didn't trim
&lt;/h2&gt;

&lt;p&gt;I assumed my biggest cost sink was the weekly reflection. Claude reads 7 days of snapshots, engagement data, content history, trend analysis, then reasons about next week's strategy. Heavy prompt, right?&lt;/p&gt;

&lt;p&gt;Nope.&lt;/p&gt;

&lt;p&gt;Running &lt;code&gt;pf optimize&lt;/code&gt; on the actual prompts showed the reflection was 2,847 tokens. Not small, but fine. The real killer: the daily content pregeneration loop was calling Claude 5 times per platform, and each call had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Entire engagement history (redundant. I'm fetching fresh data every run)&lt;/li&gt;
&lt;li&gt;Every. Single. Previous. Post. (all 120 of them, in the context)&lt;/li&gt;
&lt;li&gt;Current date, weather, trending topics (reloaded every call)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cutting history to "last 10 posts, last 3 days of engagement" knocked 40% off. Not because I switched models. Because I stopped hallucinating I needed context I wasn't even &lt;em&gt;reading&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Your audit will surface the dumb stuff, not the obvious stuff
&lt;/h2&gt;

&lt;p&gt;Benchmarks tell you Claude costs 3¢ per 1M input tokens. Haiku costs 0.8¢. Pick the right model, do the math, move on.&lt;/p&gt;

&lt;p&gt;Except I was calling Claude Sonnet 7 times/week on background analytics where Haiku was plenty. Not intentional. I'd copied the model from an earlier prompt and never thought about it again. One-line change, zero quality loss, $2 saved per month.&lt;/p&gt;

&lt;p&gt;That math &lt;em&gt;never&lt;/em&gt; shows up in a benchmark. It shows up in your actual codebase, on your actual data, running your actual job. PromptFuel's advantage isn't telling you models are expensive. It's finding the calls you forgot about and showing you the before/after side-by-side.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Once you see the numbers, the optimization loop becomes obvious
&lt;/h2&gt;

&lt;p&gt;The first time I ran the dashboard, I thought I was done. Then Monday's weekly job ran and I watched 47 new prompts execute. Dashboard updated in real time. I saw the pattern. There's another cut.&lt;/p&gt;

&lt;p&gt;Auditing once is useful. Auditing every week is how you stop bleeding money.&lt;/p&gt;




&lt;h2&gt;
  
  
  Let's walk through it
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; promptfuel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run pf optimize&lt;/strong&gt; on a real prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pf optimize ./src/prompts/reflect.md &lt;span class="nt"&gt;--model&lt;/span&gt; claude-3-5-sonnet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see token count, cost per call, and a readability score. More importantly, you'll see where the redundancy is hiding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Open the dashboard&lt;/strong&gt; to watch prompts in real time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pf dashboard &lt;span class="nt"&gt;--watch&lt;/span&gt; ./src/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Port 3000 opens. Every time you call an LLM, you see it log: model, input tokens, output tokens, cost, latency. No guessing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For production, wire up the SDK:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;PromptFuel&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;promptfuel/sdk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;Anthropic&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@anthropic-ai/sdk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;PromptFuel&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;pf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wrapClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
 &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-3-5-sonnet-20241022&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;your prompt&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Automatically tracked. One line changes nothing&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getMetrics&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt; 
&lt;span class="c1"&gt;// { totalTokens: 342, totalCost: $0.008, calls: 1 }&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Real numbers
&lt;/h2&gt;

&lt;p&gt;Agent-Max before: ~1,847 tokens/week across all platforms.&lt;/p&gt;

&lt;p&gt;Agent-Max after (trimmed + downgraded safe calls to Haiku): 1,094 tokens/week.&lt;/p&gt;

&lt;p&gt;40% reduction. No quality loss. Three hours to audit and implement.&lt;/p&gt;

&lt;p&gt;That's not a benchmark. That's a real app, real prompts, real data.&lt;/p&gt;




&lt;p&gt;Stop guessing about your token spend. Measure what you're actually doing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; promptfuel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://promptfuel.vercel.app?utm_source=devto&amp;amp;utm_medium=social&amp;amp;utm_campaign=max" rel="noopener noreferrer"&gt;https://promptfuel.vercel.app?utm_source=devto&amp;amp;utm_medium=social&amp;amp;utm_campaign=max&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>javascript</category>
      <category>beginners</category>
    </item>
    <item>
      <title>How I Accidentally Spent $800/Month on LLM Tokens I Didn't Need (And How to Fix It)</title>
      <dc:creator>Nate Voss</dc:creator>
      <pubDate>Thu, 23 Apr 2026 07:46:42 +0000</pubDate>
      <link>https://forem.com/natevoss/how-i-accidentally-spent-800month-on-llm-tokens-i-didnt-need-and-how-to-fix-it-oi7</link>
      <guid>https://forem.com/natevoss/how-i-accidentally-spent-800month-on-llm-tokens-i-didnt-need-and-how-to-fix-it-oi7</guid>
      <description>&lt;p&gt;I spent six weeks shipping the wrong thing.&lt;/p&gt;

&lt;p&gt;I built PromptFuel because I was hemorrhaging money on API calls. Not because I was building at scale—I wasn't. I was building &lt;em&gt;dumb&lt;/em&gt;. I'd write a prompt in isolation, test it once, ship it, then wonder why my OpenAI bill jumped $200. Turns out I was doing things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Asking GPT-4 to write validation logic that Haiku could handle just fine&lt;/li&gt;
&lt;li&gt;Sending full context windows when 30% of it was redundant&lt;/li&gt;
&lt;li&gt;Retrying identical requests with slightly different temperatures instead of picking one and sticking with it&lt;/li&gt;
&lt;li&gt;Including examples in prompts that the model was already trained on&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real kick? None of this was visible. I had no idea which requests were wasteful, which models were overkill for my tasks, or where I was throwing money away. I just had a credit card statement and regret.&lt;/p&gt;

&lt;p&gt;So I built a tool to see what I was actually doing. And then I optimized it. Here's how.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Choosing the right model for a job isn't about capabilities. A Haiku can validate JSON, classify text, and format output just as well as GPT-4o for most real work. The difference is cost: Haiku costs 10x less per token.&lt;/p&gt;

&lt;p&gt;But without visibility, you default to the expensive one. Because it's safe. Because you can't see the waste.&lt;/p&gt;

&lt;p&gt;After I started measuring, I found:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;35% of my requests didn't need GPT-4o.&lt;/strong&gt; They were hitting it because it was the default, not because it was the right tool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;20% of my prompts had bloat.&lt;/strong&gt; Instructions that contradicted each other, examples I copy-pasted but never used, context I included "just in case."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;15% of requests were duplicates.&lt;/strong&gt; Same input, same model, within minutes. If I'd cached or batched them, I'd cut token spend by half.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Total: &lt;strong&gt;40% waste.&lt;/strong&gt; $800 → $480. Not revolutionary, but real money for an indie project.&lt;/p&gt;

&lt;p&gt;The fix wasn't rocket science. It was boring infrastructure: measure, analyze, optimize, repeat.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: See What You're Actually Doing
&lt;/h2&gt;

&lt;p&gt;Install PromptFuel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; promptfuel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No API keys, no auth, no bullshit. The tool runs locally.&lt;/p&gt;

&lt;p&gt;Now run this on any prompt or code snippet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pf optimize &lt;span class="nt"&gt;--input&lt;/span&gt; &lt;span class="s2"&gt;"Your prompt here"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or point it at a file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pf optimize &lt;span class="nt"&gt;--file&lt;/span&gt; my-prompt.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get back:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Token count&lt;/strong&gt; — exactly what you'll be charged for&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost estimates&lt;/strong&gt; — broken down by model (Haiku, Sonnet, GPT-4o, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimization suggestions&lt;/strong&gt; — what you can trim without losing meaning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model recommendations&lt;/strong&gt; — which model actually makes sense for this task&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Current prompt: 412 tokens

Optimization suggestions:
  - Remove redundant instruction (line 8)
  - Simplify JSON schema example (saves 34 tokens)
  - Collapse repeated context (saves 18 tokens)

Cost per call:
  - GPT-4o: $0.006 (❌ overpowered)
  - Claude 3.5 Sonnet: $0.002 (✓ recommended)
  - Claude 3 Haiku: $0.0004 (✓ if you only need classification)

Estimated monthly (1000 calls):
  - Current setup: $6.12
  - Optimized: $1.84
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the insight. That's what I was missing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Understand Your Actual Costs
&lt;/h2&gt;

&lt;p&gt;Open the dashboard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pf dashboard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your default browser opens to a local dashboard showing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;All your recent prompts&lt;/strong&gt; and their token counts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost distribution&lt;/strong&gt; — which requests ate the most budget&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model usage&lt;/strong&gt; — are you using the expensive ones too much?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimization opportunities&lt;/strong&gt; — ranked by potential savings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The dashboard doesn't need your API keys. It's analyzing local data. But it &lt;em&gt;will&lt;/em&gt; tell you which of your shipped prompts are costing way more than they should.&lt;/p&gt;

&lt;p&gt;Spend 10 minutes here. You'll probably find something you didn't realize you were doing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Integrate into Your Stack
&lt;/h2&gt;

&lt;p&gt;Once you see the waste, you'll want to catch it earlier. That's where the SDK and MCP server come in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option A: JavaScript SDK (for Next.js, Node apps)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; @promptfuel/sdk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;PromptOptimizer&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@promptfuel/sdk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;optimizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;PromptOptimizer&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`You are a helpful assistant...
Classify the following text into categories...
[20 more lines of context you don't actually need]`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;analysis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`This prompt costs $&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;costPerCall&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;gpt4o&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Optimized version: $&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;optimized&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;costPerCall&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;gpt4o&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Actually use the optimized version&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;optimizedPrompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;optimized&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Option B: Claude Code MCP Server (for use in Claude directly)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're like me and you use Claude for a lot of your thinking, add the PromptFuel MCP server to your Claude Code settings. Then ask Claude directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@promptfuel optimize my prompt for cost

[paste your prompt]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude runs it through PromptFuel's analysis and tells you exactly where you're bleeding money. Then it generates an optimized version.&lt;/p&gt;

&lt;p&gt;Both approaches catch waste before it ships.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happened Next
&lt;/h2&gt;

&lt;p&gt;After I actually measured and optimized my stuff, here's what I learned:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;You don't need the expensive model as often as you think.&lt;/strong&gt; Most of my classification, formatting, and even some reasoning tasks work fine on Haiku.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt bloat is real.&lt;/strong&gt; Every instruction that contradicts another one, every "just in case" example, every "let me explain the context" paragraph adds tokens and confusion.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token count scales weird.&lt;/strong&gt; I thought I'd save 10%. I saved 40%. Because once you see the pattern, you fix it everywhere.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For me: $800 → $480/month. For you, it might be different. But it won't be zero.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started (Right Now)
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Install: &lt;code&gt;npm install -g promptfuel&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Optimize a single prompt: &lt;code&gt;pf optimize --file your-prompt.txt&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Open the dashboard: &lt;code&gt;pf dashboard&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;If you like it, integrate the SDK or MCP server into your workflow&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No commitment. No API keys. No upsell. Just a free tool that shows you where your money's going.&lt;/p&gt;

&lt;p&gt;The tool exists because I was tired of guessing. If you are too, give it a try: &lt;a href="https://promptfuel.vercel.app?utm_source=devto&amp;amp;utm_medium=social&amp;amp;utm_campaign=max" rel="noopener noreferrer"&gt;https://promptfuel.vercel.app?utm_source=devto&amp;amp;utm_medium=social&amp;amp;utm_campaign=max&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>javascript</category>
      <category>beginners</category>
    </item>
    <item>
      <title>3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work</title>
      <dc:creator>Nate Voss</dc:creator>
      <pubDate>Tue, 21 Apr 2026 12:07:05 +0000</pubDate>
      <link>https://forem.com/natevoss/3-things-i-learned-benchmarking-claude-gpt-4o-and-gemini-on-real-dev-work-38fl</link>
      <guid>https://forem.com/natevoss/3-things-i-learned-benchmarking-claude-gpt-4o-and-gemini-on-real-dev-work-38fl</guid>
      <description>&lt;p&gt;If you're still picking LLM providers by gut feeling, you're leaving money on the table. I ran 5 developer use cases through Claude 3.5 Sonnet, GPT-4o, and Gemini 2.0 Flash using PromptFuel to measure token usage and cost. The results? More interesting than "fastest wins." Here's what I found.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I took 5 tasks I actually do in PromptFuel development:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;JSON schema validation prompt&lt;/strong&gt; — catch malformed API responses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code review feedback&lt;/strong&gt; — multi-file analysis with context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refactoring suggestion&lt;/strong&gt; — optimize a chunky utility function&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bug diagnosis&lt;/strong&gt; — trace through a stack trace with logs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation generation&lt;/strong&gt; — write API docs from code comments&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each got run through all three models with identical input. I used PromptFuel's CLI to count tokens and calculate costs, because doing this manually is chaos. Output quality was rated by me (subjectively, but honestly).&lt;/p&gt;

&lt;h2&gt;
  
  
  Use Case Breakdown
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. JSON Schema Validation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; Schema definition + malformed JSON sample + expected error message format&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token usage (input → output):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Sonnet: 1,847 → 512 (cost: $0.0043)&lt;/li&gt;
&lt;li&gt;GPT-4o: 2,156 → 487 (cost: $0.0082)&lt;/li&gt;
&lt;li&gt;Gemini Flash: 1,923 → 501 (cost: $0.0001)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quality:&lt;/strong&gt; All three nailed it. Claude was most concise in its explanation. GPT-4o over-explained. Gemini was crisp and useful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token efficiency win:&lt;/strong&gt; Gemini, by cost. Claude, by clarity per token.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Code Review (3 files, ~200 LOC)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; Three TypeScript modules + review instructions + examples of good feedback&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token usage:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Sonnet: 4,231 → 891 (cost: $0.0147)&lt;/li&gt;
&lt;li&gt;GPT-4o: 4,782 → 856 (cost: $0.0208)&lt;/li&gt;
&lt;li&gt;Gemini Flash: 4,456 → 823 (cost: $0.0003)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quality:&lt;/strong&gt; Claude caught subtle issues I actually cared about. GPT-4o was thorough but verbose. Gemini gave surface-level feedback.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token efficiency win:&lt;/strong&gt; Gemini cheapest. Claude best output/token.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Refactoring Suggestion
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; 80-line utility function + performance requirements + current bottleneck description&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token usage:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Sonnet: 2,134 → 618 (cost: $0.0054)&lt;/li&gt;
&lt;li&gt;GPT-4o: 2,445 → 602 (cost: $0.0110)&lt;/li&gt;
&lt;li&gt;Gemini Flash: 2,287 → 587 (cost: $0.0002)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quality:&lt;/strong&gt; Claude's refactor was production-ready. GPT-4o suggested good ideas but with syntax issues. Gemini's suggestion worked but wasn't elegant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token efficiency win:&lt;/strong&gt; Gemini cost, Claude quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Bug Diagnosis
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; Stack trace (15 lines) + error logs (20 lines) + code snippet (40 lines) + attempted fixes tried&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token usage:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Sonnet: 2,856 → 445 (cost: $0.0071)&lt;/li&gt;
&lt;li&gt;GPT-4o: 3,102 → 421 (cost: $0.0127)&lt;/li&gt;
&lt;li&gt;Gemini Flash: 2,934 → 438 (cost: $0.0002)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quality:&lt;/strong&gt; Claude nailed it immediately. GPT-4o circled around the issue. Gemini flagged the right file but not the root cause.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token efficiency win:&lt;/strong&gt; Gemini cost, Claude accuracy.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Documentation Generation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; 12 functions with JSDoc comments + expected markdown format + examples&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token usage:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Sonnet: 3,445 → 734 (cost: $0.0118)&lt;/li&gt;
&lt;li&gt;GPT-4o: 3,821 → 689 (cost: $0.0182)&lt;/li&gt;
&lt;li&gt;Gemini Flash: 3,567 → 712 (cost: $0.0004)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quality:&lt;/strong&gt; Claude's docs were complete and well-structured. GPT-4o was good but required minimal cleanup. Gemini's docs were functional but missing details.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token efficiency win:&lt;/strong&gt; Gemini cost, Claude completeness.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 3 Things I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Cost-per-task != best value.&lt;/strong&gt; Gemini Flash is comically cheap (~90% less than GPT-4o), but you're paying for what you get. When I needed high-stakes work (code review, bug diagnosis), Claude was worth the extra cents because I didn't have to iterate. For throwaway tasks (generating examples, formatting), Gemini's cost made its mediocrity acceptable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Token count is not predictive of quality.&lt;/strong&gt; All three models produced similar token counts for the same input, but output quality varied wildly. GPT-4o consistently used more tokens and wasn't proportionally better. Claude packed useful signal into fewer tokens. This matters: if you're optimizing for cost alone, you'll pick the wrong model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Real-world testing beats benchmarks.&lt;/strong&gt; The model rankings flip depending on what you're actually doing. For documentation, Claude wins. For budget validation of a throwaway check, Gemini wins. Generic "fastest model" articles don't capture this. You need to test your actual tasks.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Benchmark Yours
&lt;/h2&gt;

&lt;p&gt;Here's the thing: this comparison is data, not law. Your tasks might weight differently. Let me show you how I tested this using PromptFuel.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install PromptFuel (if you haven't)&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; promptfuel

&lt;span class="c"&gt;# Create a test file with your prompt&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; test-prompt.txt &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
[your prompt here]
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="c"&gt;# Count tokens across models&lt;/span&gt;
pf count test-prompt.txt &lt;span class="nt"&gt;--model&lt;/span&gt; claude-3-5-sonnet
pf count test-prompt.txt &lt;span class="nt"&gt;--model&lt;/span&gt; gpt-4o
pf count test-prompt.txt &lt;span class="nt"&gt;--model&lt;/span&gt; gemini-2.0-flash

&lt;span class="c"&gt;# Compare costs&lt;/span&gt;
pf count test-prompt.txt &lt;span class="nt"&gt;--compare&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;--compare&lt;/code&gt; flag gives you a cost matrix. Takes 30 seconds. Beats guessing.&lt;/p&gt;

&lt;p&gt;The real insight: &lt;strong&gt;run this for your specific use cases.&lt;/strong&gt; A document summarizer might favor Claude. A high-throughput classification pipeline might favor Gemini. The only way to know is to test.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Optimization
&lt;/h2&gt;

&lt;p&gt;After picking your model, there's still money left on the table. Here's a before/after from actual PromptFuel code:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before (unoptimized prompt):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are an expert code reviewer. Review the following code for quality, security, 
and performance issues. Check for common bugs, suggest improvements, and rate the 
code from 1-10. Consider edge cases, error handling, and best practices. Be thorough 
and detailed in your feedback.

[400 tokens of instructions]
[200 tokens of examples]
[150 tokens of code to review]
Total: ~750 input tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After (optimized with PromptFuel):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Review code for quality, security, performance. Rate 1-10.

[Stripped redundant instructions]
[Examples reduced to 1 exemplar instead of 3]
[Code reformatted to remove whitespace]
Total: ~420 input tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost saved: ~$0.0012 per review on Claude. Run that 100 times a day, and you're saving $0.12/day, $36/year. Small? Yes. Multiplied by 50 internal tools? Now you're talking real money.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Take
&lt;/h2&gt;

&lt;p&gt;Pick the model that gives you the output you need, then optimize the prompt. Stop optimizing for the wrong metric. Benchmarks are fun, but production bills are real.&lt;/p&gt;

&lt;p&gt;If you're running this analysis for your own stuff, PromptFuel makes it stupidly easy. It's free, no API keys needed, runs locally. Just &lt;code&gt;npm install -g promptfuel&lt;/code&gt; and compare. If you want the actual numbers from your prompts, run the test. Don't inherit my data — build your own.&lt;/p&gt;

&lt;p&gt;What's your highest-volume LLM task? Test it. You might be surprised which model wins.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #ai #tutorial #javascript #optimization&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>javascript</category>
      <category>optimization</category>
    </item>
    <item>
      <title>3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work</title>
      <dc:creator>Nate Voss</dc:creator>
      <pubDate>Tue, 21 Apr 2026 07:44:49 +0000</pubDate>
      <link>https://forem.com/natevoss/3-things-i-learned-benchmarking-claude-gpt-4o-and-gemini-on-real-dev-work-4cp5</link>
      <guid>https://forem.com/natevoss/3-things-i-learned-benchmarking-claude-gpt-4o-and-gemini-on-real-dev-work-4cp5</guid>
      <description>&lt;p&gt;If you're still picking LLM providers by gut feeling, you're leaving money on the table. I ran 5 developer use cases through Claude 3.5 Sonnet, GPT-4o, and Gemini 2.0 Flash using PromptFuel to measure token usage and cost. The results? More interesting than "fastest wins." Here's what I found.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I took 5 tasks I actually do in PromptFuel development:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;JSON schema validation prompt&lt;/strong&gt; — catch malformed API responses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code review feedback&lt;/strong&gt; — multi-file analysis with context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refactoring suggestion&lt;/strong&gt; — optimize a chunky utility function&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bug diagnosis&lt;/strong&gt; — trace through a stack trace with logs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation generation&lt;/strong&gt; — write API docs from code comments&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each got run through all three models with identical input. I used PromptFuel's CLI to count tokens and calculate costs, because doing this manually is chaos. Output quality was rated by me (subjectively, but honestly).&lt;/p&gt;

&lt;h2&gt;
  
  
  Use Case Breakdown
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. JSON Schema Validation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; Schema definition + malformed JSON sample + expected error message format&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token usage (input → output):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Sonnet: 1,847 → 512 (cost: $0.0043)&lt;/li&gt;
&lt;li&gt;GPT-4o: 2,156 → 487 (cost: $0.0082)&lt;/li&gt;
&lt;li&gt;Gemini Flash: 1,923 → 501 (cost: $0.0001)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quality:&lt;/strong&gt; All three nailed it. Claude was most concise in its explanation. GPT-4o over-explained. Gemini was crisp and useful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token efficiency win:&lt;/strong&gt; Gemini, by cost. Claude, by clarity per token.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Code Review (3 files, ~200 LOC)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; Three TypeScript modules + review instructions + examples of good feedback&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token usage:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Sonnet: 4,231 → 891 (cost: $0.0147)&lt;/li&gt;
&lt;li&gt;GPT-4o: 4,782 → 856 (cost: $0.0208)&lt;/li&gt;
&lt;li&gt;Gemini Flash: 4,456 → 823 (cost: $0.0003)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quality:&lt;/strong&gt; Claude caught subtle issues I actually cared about. GPT-4o was thorough but verbose. Gemini gave surface-level feedback.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token efficiency win:&lt;/strong&gt; Gemini cheapest. Claude best output/token.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Refactoring Suggestion
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; 80-line utility function + performance requirements + current bottleneck description&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token usage:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Sonnet: 2,134 → 618 (cost: $0.0054)&lt;/li&gt;
&lt;li&gt;GPT-4o: 2,445 → 602 (cost: $0.0110)&lt;/li&gt;
&lt;li&gt;Gemini Flash: 2,287 → 587 (cost: $0.0002)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quality:&lt;/strong&gt; Claude's refactor was production-ready. GPT-4o suggested good ideas but with syntax issues. Gemini's suggestion worked but wasn't elegant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token efficiency win:&lt;/strong&gt; Gemini cost, Claude quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Bug Diagnosis
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; Stack trace (15 lines) + error logs (20 lines) + code snippet (40 lines) + attempted fixes tried&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token usage:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Sonnet: 2,856 → 445 (cost: $0.0071)&lt;/li&gt;
&lt;li&gt;GPT-4o: 3,102 → 421 (cost: $0.0127)&lt;/li&gt;
&lt;li&gt;Gemini Flash: 2,934 → 438 (cost: $0.0002)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quality:&lt;/strong&gt; Claude nailed it immediately. GPT-4o circled around the issue. Gemini flagged the right file but not the root cause.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token efficiency win:&lt;/strong&gt; Gemini cost, Claude accuracy.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Documentation Generation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; 12 functions with JSDoc comments + expected markdown format + examples&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token usage:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Sonnet: 3,445 → 734 (cost: $0.0118)&lt;/li&gt;
&lt;li&gt;GPT-4o: 3,821 → 689 (cost: $0.0182)&lt;/li&gt;
&lt;li&gt;Gemini Flash: 3,567 → 712 (cost: $0.0004)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quality:&lt;/strong&gt; Claude's docs were complete and well-structured. GPT-4o was good but required minimal cleanup. Gemini's docs were functional but missing details.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token efficiency win:&lt;/strong&gt; Gemini cost, Claude completeness.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 3 Things I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Cost-per-task != best value.&lt;/strong&gt; Gemini Flash is comically cheap (~90% less than GPT-4o), but you're paying for what you get. When I needed high-stakes work (code review, bug diagnosis), Claude was worth the extra cents because I didn't have to iterate. For throwaway tasks (generating examples, formatting), Gemini's cost made its mediocrity acceptable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Token count is not predictive of quality.&lt;/strong&gt; All three models produced similar token counts for the same input, but output quality varied wildly. GPT-4o consistently used more tokens and wasn't proportionally better. Claude packed useful signal into fewer tokens. This matters: if you're optimizing for cost alone, you'll pick the wrong model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Real-world testing beats benchmarks.&lt;/strong&gt; The model rankings flip depending on what you're actually doing. For documentation, Claude wins. For budget validation of a throwaway check, Gemini wins. Generic "fastest model" articles don't capture this. You need to test your actual tasks.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Benchmark Yours
&lt;/h2&gt;

&lt;p&gt;Here's the thing: this comparison is data, not law. Your tasks might weight differently. Let me show you how I tested this using PromptFuel.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install PromptFuel (if you haven't)&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; promptfuel

&lt;span class="c"&gt;# Create a test file with your prompt&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; test-prompt.txt &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
[your prompt here]
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="c"&gt;# Count tokens across models&lt;/span&gt;
pf count test-prompt.txt &lt;span class="nt"&gt;--model&lt;/span&gt; claude-3-5-sonnet
pf count test-prompt.txt &lt;span class="nt"&gt;--model&lt;/span&gt; gpt-4o
pf count test-prompt.txt &lt;span class="nt"&gt;--model&lt;/span&gt; gemini-2.0-flash

&lt;span class="c"&gt;# Compare costs&lt;/span&gt;
pf count test-prompt.txt &lt;span class="nt"&gt;--compare&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;--compare&lt;/code&gt; flag gives you a cost matrix. Takes 30 seconds. Beats guessing.&lt;/p&gt;

&lt;p&gt;The real insight: &lt;strong&gt;run this for your specific use cases.&lt;/strong&gt; A document summarizer might favor Claude. A high-throughput classification pipeline might favor Gemini. The only way to know is to test.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Optimization
&lt;/h2&gt;

&lt;p&gt;After picking your model, there's still money left on the table. Here's a before/after from actual PromptFuel code:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before (unoptimized prompt):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are an expert code reviewer. Review the following code for quality, security, 
and performance issues. Check for common bugs, suggest improvements, and rate the 
code from 1-10. Consider edge cases, error handling, and best practices. Be thorough 
and detailed in your feedback.

[400 tokens of instructions]
[200 tokens of examples]
[150 tokens of code to review]
Total: ~750 input tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After (optimized with PromptFuel):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Review code for quality, security, performance. Rate 1-10.

[Stripped redundant instructions]
[Examples reduced to 1 exemplar instead of 3]
[Code reformatted to remove whitespace]
Total: ~420 input tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost saved: ~$0.0012 per review on Claude. Run that 100 times a day, and you're saving $0.12/day, $36/year. Small? Yes. Multiplied by 50 internal tools? Now you're talking real money.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Take
&lt;/h2&gt;

&lt;p&gt;Pick the model that gives you the output you need, then optimize the prompt. Stop optimizing for the wrong metric. Benchmarks are fun, but production bills are real.&lt;/p&gt;

&lt;p&gt;If you're running this analysis for your own stuff, PromptFuel makes it stupidly easy. It's free, no API keys needed, runs locally. Just &lt;code&gt;npm install -g promptfuel&lt;/code&gt; and compare. If you want the actual numbers from your prompts, run the test. Don't inherit my data — build your own.&lt;/p&gt;

&lt;p&gt;What's your highest-volume LLM task? Test it. You might be surprised which model wins.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #ai #tutorial #javascript #optimization&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>javascript</category>
      <category>optimization</category>
    </item>
    <item>
      <title>I Tested Claude Haiku, GPT-4o Mini, and Gemini Flash on Real Tasks. Here's What Actually Happened.</title>
      <dc:creator>Nate Voss</dc:creator>
      <pubDate>Fri, 17 Apr 2026 06:08:24 +0000</pubDate>
      <link>https://forem.com/natevoss/i-tested-claude-haiku-gpt-4o-mini-and-gemini-flash-on-real-tasks-heres-what-actually-happened-3mc6</link>
      <guid>https://forem.com/natevoss/i-tested-claude-haiku-gpt-4o-mini-and-gemini-flash-on-real-tasks-heres-what-actually-happened-3mc6</guid>
      <description>&lt;h1&gt;
  
  
  I Tested Claude Haiku, GPT-4o Mini, and Gemini Flash on Real Tasks. Here's What Actually Happened.
&lt;/h1&gt;

&lt;p&gt;Every few weeks someone posts a new model comparison and it's always the same: benchmark scores, carefully designed test prompts, neat bar charts. Then you try the "winning" model on your actual workload and something weird happens.&lt;/p&gt;

&lt;p&gt;I've been running all three in production for a few months. Here's what I actually found, including the parts that don't make for clean charts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Pricing Reality Check
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input (per 1M tokens)&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;$1.00&lt;/td&gt;
&lt;td&gt;200K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o Mini&lt;/td&gt;
&lt;td&gt;$0.15&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 1.5 Flash&lt;/td&gt;
&lt;td&gt;$0.075&lt;/td&gt;
&lt;td&gt;1M&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Gemini Flash is genuinely 13× cheaper than Haiku. That's not a rounding error. Before you immediately migrate everything: keep reading.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Was Testing
&lt;/h2&gt;

&lt;p&gt;Three real workloads from a side project:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Document summarization&lt;/strong&gt; — long PDFs, messy formatting, inconsistent structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JSON extraction&lt;/strong&gt; — pull structured data from unstructured user input&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code explanation&lt;/strong&gt; — explain what a function does in plain English&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Not "write a poem" or "solve this logic puzzle." Real things an app might do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Document Summarization: Gemini Flash Wins
&lt;/h2&gt;

&lt;p&gt;Flash's 1M context window is genuinely useful here, not just a spec sheet number. I was chunking documents into pieces to fit smaller context windows. With Flash I stopped doing that.&lt;/p&gt;

&lt;p&gt;It's also significantly cheaper, which matters when you're summarizing hundreds of documents.&lt;/p&gt;

&lt;p&gt;The catch: Flash occasionally invents a section heading that wasn't in the original. It's subtle and sounds plausible, which makes it worse. For internal tools where someone will verify the output, fine. For anything customer-facing, I'd want a review step.&lt;/p&gt;

&lt;h2&gt;
  
  
  JSON Extraction: Haiku Wins, Annoyingly
&lt;/h2&gt;

&lt;p&gt;I really wanted Flash to win here because of the price. It didn't.&lt;/p&gt;

&lt;p&gt;The task: given a paragraph of user-written text, extract structured fields into a specific JSON schema. Claude Haiku followed the schema reliably. Flash followed it most of the time, but occasionally added fields that weren't in the schema, renamed fields it didn't like, or decided to nest things differently.&lt;/p&gt;

&lt;p&gt;Each of these breaks downstream code. The debugging cost per incident outweighed the token savings.&lt;/p&gt;

&lt;p&gt;Haiku is predictable in a way that sounds boring but is exactly what you want when you're processing thousands of records.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code Explanation: GPT-4o Mini Wins
&lt;/h2&gt;

&lt;p&gt;This one surprised me. GPT-4o Mini is clearly well-trained on code-related tasks. Its explanations are accurate and well-structured. It also handled edge cases (unusual syntax, legacy patterns) better than the other two.&lt;/p&gt;

&lt;p&gt;For anything code-adjacent, it's my first reach now.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Actually Use
&lt;/h2&gt;

&lt;p&gt;I stopped trying to find one winner and started routing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;pickModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;taskType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;inputTokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;taskType&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;summarize&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;inputTokens&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gemini-1.5-flash&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;taskType&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;extract_json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-haiku-4-5&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;taskType&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;explain_code&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-haiku-4-5&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// safe default&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can check price differences before committing to a migration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; promptfuel
promptfuel compare claude-haiku-4-5 gpt-4o-mini gemini-1.5-flash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Actual Bottom Line
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cheapest by far&lt;/strong&gt;: Gemini Flash. Use it for volume tasks where edge cases are acceptable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Most format-reliable&lt;/strong&gt;: Claude Haiku. Use it when you need strict schema compliance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best at code&lt;/strong&gt;: GPT-4o Mini. Use it for anything developer-facing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The honest answer is: don't pick one. Pick based on the task. The cost difference between routing intelligently versus defaulting to one model is real, and it compounds over time.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>javascript</category>
      <category>optimization</category>
    </item>
  </channel>
</rss>
