<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Jacob Blackmer</title>
    <description>The latest articles on Forem by Jacob Blackmer (@build_with_jake).</description>
    <link>https://forem.com/build_with_jake</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3785687%2F0068e12e-2115-48a6-8cc8-db75c1dd429d.png</url>
      <title>Forem: Jacob Blackmer</title>
      <link>https://forem.com/build_with_jake</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/build_with_jake"/>
    <language>en</language>
    <item>
      <title>How I cut my AI API costs 79% — the boring stuff that actually worked</title>
      <dc:creator>Jacob Blackmer</dc:creator>
      <pubDate>Tue, 24 Feb 2026 03:54:34 +0000</pubDate>
      <link>https://forem.com/build_with_jake/how-i-cut-my-ai-api-costs-79-the-boring-stuff-that-actually-worked-27fe</link>
      <guid>https://forem.com/build_with_jake/how-i-cut-my-ai-api-costs-79-the-boring-stuff-that-actually-worked-27fe</guid>
      <description>&lt;p&gt;Look, theres a million posts about AI cost optimization that read like they were written by a consulting firm. This isn't one of those. This is what I actually did, including the parts where I felt dumb.&lt;/p&gt;

&lt;h2&gt;
  
  
  The situation
&lt;/h2&gt;

&lt;p&gt;I run an always-on AI assistant on a VPS. It handles email, calendar, code generation, research, monitoring — basically it runs my digital life. Over the past few months I kept adding features without paying attention to costs and one day I looked at my bill: &lt;strong&gt;$288.61 for February.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For context thats on a €5/month Hetzner VPS. The infrastructure is cheap. The API calls are not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the money was actually going
&lt;/h2&gt;

&lt;p&gt;When I finally sat down and looked at task-level costs (instead of just the aggregate dashboard), I wanted to crawl under my desk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;40% of my API calls were using Claude Opus.&lt;/strong&gt; Thats the $15 per million token model. And what was it doing? Status checks. Formatting dates. Checking if emails were unread. Generating heartbeat responses.&lt;/p&gt;

&lt;p&gt;Its like hiring a brain surgeon to take your temperature. It works. But why.&lt;/p&gt;

&lt;p&gt;The answer is I was lazy. Opus was the default, and I never questioned it. Thats the whole story.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I changed (and what each thing saved)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Model routing — saved about 45%
&lt;/h3&gt;

&lt;p&gt;This was the big one. I set up routing logic that matches tasks to models by complexity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Simple stuff&lt;/strong&gt; (status, formatting, extraction, Q&amp;amp;A) → Claude Haiku at $0.25/M tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real work&lt;/strong&gt; (analysis, writing, code review) → Claude Sonnet at $3/M&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Important decisions&lt;/strong&gt; (architecture, strategy) → Claude Opus at $15/M&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After this, 85% of calls go to Haiku. Opus dropped from 40% usage to under 2%.&lt;/p&gt;

&lt;p&gt;The implementation is honestly embarassing in how simple it is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;task_type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;classify&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;haiku&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;task_type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;analysis&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;writing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;code_review&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sonnet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;opus&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;  &lt;span class="c1"&gt;# only for the big stuff
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Thats basically it. A three-way if statement saved me $130/month.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Local embeddings — saved about 15%
&lt;/h3&gt;

&lt;p&gt;Every time my system searched its memory or did semantic matching, it was making an API call for embeddings. I switched to &lt;code&gt;nomic-embed-text&lt;/code&gt; running locally through Ollama.&lt;/p&gt;

&lt;p&gt;Quick setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull nomic-embed-text
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Its 274MB, runs on CPU, costs $0, and the quality is close enough for retrieval that I genuinly can't tell the difference. Saved about $40/mo.&lt;/p&gt;

&lt;p&gt;If you're paying for embedding APIs right now and you have any kind of server or decent machine, just... stop. Run them locally. This is the lowest of low-hanging fruit.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Deduplication — saved about 10%
&lt;/h3&gt;

&lt;p&gt;Found out 30-40% of my API calls were completely redundant. My heartbeat system was checking the same things every 15 minutes whether or not anything had changed.&lt;/p&gt;

&lt;p&gt;Added basic state tracking — a JSON file that records what was checked and when. Before making an API call: "did this change since last time?" If no, skip it.&lt;/p&gt;

&lt;p&gt;My heartbeat calls went from 96/day to about 5/day. Same coverage.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Batch scheduling — saved about 9%
&lt;/h3&gt;

&lt;p&gt;Non-urgent tasks (memory consolidation, background research) got moved to cheaper models running on a schedule. Some runs on local Qwen2.5 7B for free.&lt;/p&gt;

&lt;h2&gt;
  
  
  What didn't work
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Going 100% local.&lt;/strong&gt; I tried. Qwen2.5 and Llama models handle simple tasks fine but fall apart on anything requiring nuanced reasoning. The quality gap led to more retries and manual fixes. Time cost &amp;gt; money saved. Hybrid wins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aggressive prompt shortening.&lt;/strong&gt; Cut too much context and got worse outputs. Worse outputs need retries. Retries cost money. Net negative.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token micro-optimization.&lt;/strong&gt; Spent a few hours trying to squeeze tokens out of individual prompts. Saved maybe $2/month. Not worth it when routing saves $130.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Thing&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Monthly cost&lt;/td&gt;
&lt;td&gt;$288&lt;/td&gt;
&lt;td&gt;~$60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Opus usage&lt;/td&gt;
&lt;td&gt;40% of calls&lt;/td&gt;
&lt;td&gt;&amp;lt;2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Haiku usage&lt;/td&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;td&gt;85%+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local models&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;~15%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Functionality&lt;/td&gt;
&lt;td&gt;Full&lt;/td&gt;
&lt;td&gt;Full (no loss)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  If you want to do this today
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Check your API dashboard.&lt;/strong&gt; What % of calls use your most expensive model? If its over 20%, your overpaying.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log task types.&lt;/strong&gt; Even for a day. See what tasks are actually hitting which models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add routing.&lt;/strong&gt; Even a basic if/else saves money immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Look for redundant calls.&lt;/strong&gt; Anything polling on a timer without checking for changes is probably wasting calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Try local embeddings.&lt;/strong&gt; nomic-embed-text via Ollama. Seriously just do it.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The honest takeaway
&lt;/h2&gt;

&lt;p&gt;None of this is technically impressive. A three-way if statement and a JSON state file account for most of my savings. The hard part was sitting down, looking at the numbers, and admitting I'd been wasting money through laziness.&lt;/p&gt;

&lt;p&gt;If your running AI in production and havent audited your model usage — do it this week. You'll probably find what I found: most of the cost is going to the wrong model.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Running Claude + Ollama on Hetzner. Total cost including VPS: about $80/mo for 24/7 AI automation. Happy to answer questions in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>devops</category>
      <category>ai</category>
      <category>optimization</category>
    </item>
    <item>
      <title>How I cut my AI API costs 79% — the boring stuff that actually worked</title>
      <dc:creator>Jacob Blackmer</dc:creator>
      <pubDate>Tue, 24 Feb 2026 03:06:02 +0000</pubDate>
      <link>https://forem.com/build_with_jake/how-i-cut-my-ai-api-costs-79-the-boring-stuff-that-actually-worked-2871</link>
      <guid>https://forem.com/build_with_jake/how-i-cut-my-ai-api-costs-79-the-boring-stuff-that-actually-worked-2871</guid>
      <description>&lt;p&gt;Look, theres a million posts about AI cost optimization that read like they were written by a consulting firm. This isn't one of those. This is what I actually did, including the parts where I felt dumb.&lt;/p&gt;

&lt;h2&gt;
  
  
  The situation
&lt;/h2&gt;

&lt;p&gt;I run an always-on AI assistant on a VPS. It handles email, calendar, code generation, research, monitoring — basically it runs my digital life. Over the past few months I kept adding features without paying attention to costs and one day I looked at my bill: &lt;strong&gt;$288.61 for February.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For context thats on a €5/month Hetzner VPS. The infrastructure is cheap. The API calls are not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the money was actually going
&lt;/h2&gt;

&lt;p&gt;When I finally sat down and looked at task-level costs (instead of just the aggregate dashboard), I wanted to crawl under my desk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;40% of my API calls were using Claude Opus.&lt;/strong&gt; Thats the $15 per million token model. And what was it doing? Status checks. Formatting dates. Checking if emails were unread. Generating heartbeat responses.&lt;/p&gt;

&lt;p&gt;Its like hiring a brain surgeon to take your temperature. It works. But why.&lt;/p&gt;

&lt;p&gt;The answer is I was lazy. Opus was the default, and I never questioned it. Thats the whole story.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I changed (and what each thing saved)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Model routing — saved about 45%
&lt;/h3&gt;

&lt;p&gt;This was the big one. I set up routing logic that matches tasks to models by complexity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Simple stuff&lt;/strong&gt; (status, formatting, extraction, Q&amp;amp;A) → Claude Haiku at $0.25/M tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real work&lt;/strong&gt; (analysis, writing, code review) → Claude Sonnet at $3/M&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Important decisions&lt;/strong&gt; (architecture, strategy) → Claude Opus at $15/M&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After this, 85% of calls go to Haiku. Opus dropped from 40% usage to under 2%.&lt;/p&gt;

&lt;p&gt;The implementation is honestly embarassing in how simple it is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;task_type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;classify&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;haiku&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;task_type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;analysis&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;writing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;code_review&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sonnet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;opus&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;  &lt;span class="c1"&gt;# only for the big stuff
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Thats basically it. A three-way if statement saved me $130/month.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Local embeddings — saved about 15%
&lt;/h3&gt;

&lt;p&gt;Every time my system searched its memory or did semantic matching, it was making an API call for embeddings. I switched to nomic-embed-text running locally through Ollama.&lt;/p&gt;

&lt;p&gt;Quick setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull nomic-embed-text
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Its 274MB, runs on CPU, costs $0, and the quality is close enough for retrieval that I genuinly can't tell the difference. Saved about $40/mo.&lt;/p&gt;

&lt;p&gt;If you're paying for embedding APIs right now and you have any kind of server or decent machine, just... stop. Run them locally. This is the lowest of low-hanging fruit.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Deduplication — saved about 10%
&lt;/h3&gt;

&lt;p&gt;Found out 30-40% of my API calls were completely redundant. My heartbeat system was checking the same things every 15 minutes whether or not anything had changed.&lt;/p&gt;

&lt;p&gt;Added basic state tracking — a JSON file that records what was checked and when. Before making an API call: "did this change since last time?" If no, skip it.&lt;/p&gt;

&lt;p&gt;My heartbeat calls went from 96/day to about 5/day. Same coverage.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Batch scheduling — saved about 9%
&lt;/h3&gt;

&lt;p&gt;Non-urgent tasks (memory consolidation, background research) got moved to cheaper models running on a schedule. Some runs on local Qwen2.5 7B for free.&lt;/p&gt;

&lt;h2&gt;
  
  
  What didn't work
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Going 100% local.&lt;/strong&gt; I tried. Qwen2.5 and Llama models handle simple tasks fine but fall apart on anything requiring nuanced reasoning. The quality gap led to more retries and manual fixes. Time cost &amp;gt; money saved. Hybrid wins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aggressive prompt shortening.&lt;/strong&gt; Cut too much context and got worse outputs. Worse outputs need retries. Retries cost money. Net negative.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token micro-optimization.&lt;/strong&gt; Spent a few hours trying to squeeze tokens out of individual prompts. Saved maybe $2/month. Not worth it when routing saves $130.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Thing&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Monthly cost&lt;/td&gt;
&lt;td&gt;$288&lt;/td&gt;
&lt;td&gt;~$60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Opus usage&lt;/td&gt;
&lt;td&gt;40% of calls&lt;/td&gt;
&lt;td&gt;&amp;lt;2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Haiku usage&lt;/td&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;td&gt;85%+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local models&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;~15%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Functionality&lt;/td&gt;
&lt;td&gt;Full&lt;/td&gt;
&lt;td&gt;Full (no loss)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  If you want to do this today
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Check your API dashboard.&lt;/strong&gt; What % of calls use your most expensive model? If its over 20%, your overpaying.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log task types.&lt;/strong&gt; Even for a day. See what tasks are actually hitting which models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add routing.&lt;/strong&gt; Even a basic if/else saves money immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Look for redundant calls.&lt;/strong&gt; Anything polling on a timer without checking for changes is probably wasting calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Try local embeddings.&lt;/strong&gt; nomic-embed-text via Ollama. Seriously just do it.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The honest takeaway
&lt;/h2&gt;

&lt;p&gt;None of this is technically impressive. A three-way if statement and a JSON state file account for most of my savings. The hard part was sitting down, looking at the numbers, and admitting I'd been wasting money through laziness.&lt;/p&gt;

&lt;p&gt;If your running AI in production and havent audited your model usage — do it this week. You'll probably find what I found: most of the cost is going to the wrong model.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Running Claude + Ollama on Hetzner. Total cost including VPS: about $80/mo for 24/7 AI automation. Happy to answer questions in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>beginners</category>
    </item>
  </channel>
</rss>
