<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: John Medina</title>
    <description>The latest articles on Forem by John Medina (@amedinat).</description>
    <link>https://forem.com/amedinat</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3854284%2F73b7fb73-f118-4d37-b5a7-37581d43bd0a.png</url>
      <title>Forem: John Medina</title>
      <link>https://forem.com/amedinat</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/amedinat"/>
    <language>en</language>
    <item>
      <title>The Token Spiral: How One Runaway AI Agent Burned $2,847 in 4 Hours</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Fri, 22 May 2026 14:02:11 +0000</pubDate>
      <link>https://forem.com/amedinat/the-token-spiral-how-one-runaway-ai-agent-burned-2847-in-4-hours-12ok</link>
      <guid>https://forem.com/amedinat/the-token-spiral-how-one-runaway-ai-agent-burned-2847-in-4-hours-12ok</guid>
      <description>&lt;p&gt;traditional monitoring is completely broken when it comes to AI agents. &lt;/p&gt;

&lt;p&gt;we've all seen the dashboards. everything is green. HTTP 200s across the board. p99 latency looks fine. CPU is barely ticking. &lt;/p&gt;

&lt;p&gt;meanwhile, your agent is stuck in an infinite retry loop, burning $80 per iteration because it keeps hallucinating an invalid JSON payload and asking the LLM to fix it. &lt;/p&gt;

&lt;p&gt;this exact failure mode—the "token spiral"—recently burned $2,847 in just 4 hours for a dev team. and they only noticed because their card declined.&lt;/p&gt;

&lt;p&gt;here is why standard observability tools miss this:&lt;br&gt;
they track the container, the request, the database. they don't track the &lt;em&gt;tokens per customer task&lt;/em&gt;. &lt;/p&gt;

&lt;p&gt;when an agent starts spiraling, it's making valid API calls to OpenAI or Anthropic. the provider happily returns 200 OK. the latency might be slightly elevated, but not enough to trigger a generic PagerDuty alert. it just looks like heavy usage.&lt;/p&gt;

&lt;p&gt;to catch a token spiral before it bankrupts you, you need runtime cost enforcement. not just a daily digest, but active circuit breakers.&lt;/p&gt;

&lt;p&gt;if you're at an enterprise, you buy Braintrust or Vantage. &lt;br&gt;
if you're building a startup or just vibing in your garage, you can't afford those.&lt;/p&gt;

&lt;p&gt;imo, you need open-source per-customer cost attribution. i built &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=agent-token-spiral-silent-killer" rel="noopener noreferrer"&gt;LLMeter&lt;/a&gt; to solve exactly this problem. it tracks costs by model, by user, by day. you can set budget alerts and actually see which specific tenant is spiraling out of control.&lt;/p&gt;

&lt;p&gt;ymmv, but don't deploy agents without cost circuit breakers. the API providers aren't going to refund you for bad prompts.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>The Overlooked Costs of Your LLM API Calls</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Wed, 20 May 2026 14:01:53 +0000</pubDate>
      <link>https://forem.com/amedinat/the-overlooked-costs-of-your-llm-api-calls-21gd</link>
      <guid>https://forem.com/amedinat/the-overlooked-costs-of-your-llm-api-calls-21gd</guid>
      <description>&lt;p&gt;Everyone tracks the cost per token. It's the obvious metric. But if that's all you're watching, you're missing the bigger picture. After spending way too much time sifting through invoices and logs, I've found the real cost sinks are often hidden elsewhere.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Retries &amp;amp; Timeouts Tax
&lt;/h3&gt;

&lt;p&gt;Your code retries on a 503 from OpenAI. Standard practice, right? But are you tracking the cost of those retries? A temporary outage or a poorly optimized prompt can cause a spike in retries, doubling or tripling the cost of a single user action without you even noticing until the end of the month. It's not just the API cost, either. It's the extended function execution time, the user waiting, the potential for cascading failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The "Which User Was That?" Problem
&lt;/h3&gt;

&lt;p&gt;A huge bill comes in. You see a spike in &lt;code&gt;gpt-4-turbo&lt;/code&gt; usage last Tuesday. Who caused it? Was it a single power user, a misbehaving script, or a feature getting abused? If you're just passing an API key from your backend, you have no idea. Per-user attribution isn't a vanity metric; it's essential. Without it, you can't tell who your most expensive users are, or if a specific user is hammering your service in a way you didn't anticipate.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The "Development vs. Production" Blind Spot
&lt;/h3&gt;

&lt;p&gt;You run tests, you experiment in a staging environment. All those calls are hitting your single API key. How much are you spending on non-production traffic? Is your CI/CD pipeline making a dozen LLM calls on every commit? These small, untracked costs from multiple developers and automated systems add up. They muddy the waters, making it impossible to see your actual production COGS.&lt;/p&gt;




&lt;p&gt;I built LLMeter to get a handle on this. It's an open-source dashboard that helps me see costs per user, per model, and set alerts before the bill gets out of hand. It directly connects to OpenAI, Anthropic, and others, giving me a real-time view of what's actually happening. Fwiw, it's helped me catch more than one runaway script. Check it out if you're tired of flying blind: &lt;a href="https://llmeter.org" rel="noopener noreferrer"&gt;https://llmeter.org&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>Your AI costs are growing faster than your revenue</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Mon, 18 May 2026 14:09:10 +0000</pubDate>
      <link>https://forem.com/amedinat/your-ai-costs-are-growing-faster-than-your-revenue-3c6o</link>
      <guid>https://forem.com/amedinat/your-ai-costs-are-growing-faster-than-your-revenue-3c6o</guid>
      <description>&lt;p&gt;Most startups integrating LLMs run into the exact same wall around month 6.&lt;/p&gt;

&lt;p&gt;User growth looks great. ARR is going up. But your OpenAI/Anthropic bill is growing 3x faster than your MRR. Suddenly your gross margins are negative, and you have no idea why.&lt;/p&gt;

&lt;p&gt;I've talked to dozens of founders this year. Almost everyone starts the same way: one global API key, no caching, and a "we'll figure out costs later" mentality. Later is usually when Stripe fails to cover the API bill.&lt;/p&gt;

&lt;p&gt;The problem isn't the model pricing. GPT-4o mini and Claude 3.5 Haiku are cheap. The problem is lack of visibility.&lt;/p&gt;

&lt;p&gt;When a customer complains about an issue, your support team runs an agent loop. When a user uploads a PDF, your RAG pipeline chunks and embeds 50 pages. Who paid for that? Which customer is actually profitable? &lt;/p&gt;

&lt;p&gt;Usually, 20% of your power users are burning 80% of your API budget, while the rest are subsidizing them. But without per-tenant cost attribution, you can't tell them apart. You can't adjust your pricing tiers.&lt;/p&gt;

&lt;p&gt;If you don't track costs per user, you are flying blind. &lt;/p&gt;

&lt;p&gt;Start tracking per-tenant usage early. Even logging token counts to your database is better than nothing.&lt;br&gt;
fwiw, if you don't want to build it yourself, I built LLMeter (&lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-15-devto-ai-costs-revenue" rel="noopener noreferrer"&gt;https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-15-devto-ai-costs-revenue&lt;/a&gt;). It's an open-source dashboard that tracks LLM API costs by model, by user, and by day. Handles OpenAI, Anthropic, DeepSeek, and OpenRouter. &lt;/p&gt;

&lt;p&gt;Stop guessing where your margin went.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>How I track per-customer LLM costs in production</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Fri, 15 May 2026 19:27:55 +0000</pubDate>
      <link>https://forem.com/amedinat/how-i-track-per-customer-llm-costs-in-production-4l2l</link>
      <guid>https://forem.com/amedinat/how-i-track-per-customer-llm-costs-in-production-4l2l</guid>
      <description>&lt;p&gt;Tracking LLM costs across an entire app is easy. Finding out &lt;em&gt;which&lt;/em&gt; customer is actually burning through your OpenAI bill? That's a nightmare.&lt;/p&gt;

&lt;p&gt;For a while, we were just eating the cost. You look at the Stripe dashboard, look at the OpenAI invoice, and pray the margins make sense. But when a single power user decides to process 10k documents through Claude on a Saturday night, averages stop mattering real fast.&lt;/p&gt;

&lt;p&gt;tbh, most billing dashboards are useless for this. They tell you you spent $400 yesterday, but not &lt;em&gt;who&lt;/em&gt; spent it.&lt;/p&gt;

&lt;p&gt;Here is what actually works in production, without over-engineering your entire stack.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Metadata trick
&lt;/h3&gt;

&lt;p&gt;If you're using OpenAI or Anthropic, you can pass metadata with every request. Don't just log it in your db. Pass the &lt;code&gt;user_id&lt;/code&gt; or &lt;code&gt;tenant_id&lt;/code&gt; directly to the provider.&lt;/p&gt;

&lt;p&gt;OpenAI supports this natively. Anthropic has custom headers. OpenRouter makes it trivial.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[...],&lt;/span&gt;
  &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`tenant_&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt; &lt;span class="c1"&gt;// &amp;lt;--- Do this. Always.&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This simple line changes everything. Now you can at least export CSVs from your provider and run a pivot table. But fwiw, doing that manually every week gets old fast.&lt;/p&gt;

&lt;h3&gt;
  
  
  Building a real ingestion pipeline
&lt;/h3&gt;

&lt;p&gt;We needed real-time budget alerts. If a user on a $19/mo plan burns $5 in API costs, I want a Slack ping before they hit $20.&lt;/p&gt;

&lt;p&gt;We ended up building a pipeline for this: Next.js edge functions for ingestion, Supabase for fast time-series queries, and Inngest for async budget checking. &lt;/p&gt;

&lt;p&gt;The flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Proxy the request (or capture the token usage post-request).&lt;/li&gt;
&lt;li&gt;Fire an event to Inngest with &lt;code&gt;tenant_id&lt;/code&gt;, &lt;code&gt;model&lt;/code&gt;, and &lt;code&gt;tokens&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Inngest writes to a Supabase table.&lt;/li&gt;
&lt;li&gt;If &lt;code&gt;sum(cost)&lt;/code&gt; &amp;gt; &lt;code&gt;budget_limit&lt;/code&gt; -&amp;gt; alert.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No Stripe integration needed. Just raw usage limits. Ymmv, but decoupling billing from cost-tracking saved us a lot of headaches.&lt;/p&gt;

&lt;h3&gt;
  
  
  Open sourcing the dashboard
&lt;/h3&gt;

&lt;p&gt;I got tired of rebuilding this for every project, so I built LLMeter. It's an open-source dashboard specifically for this: per-tenant LLM cost tracking.&lt;/p&gt;

&lt;p&gt;It tracks costs per model, per user, per day. Handles budget alerts. Supports OpenAI, Anthropic, DeepSeek, and OpenRouter out of the box. &lt;/p&gt;

&lt;p&gt;Stack is Next.js, Supabase, Inngest. It’s AGPL-3.0 licensed, completely free if you self-host.&lt;br&gt;
We use it internally to keep margins in check. &lt;/p&gt;

&lt;p&gt;Repo is up at &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=devto-per-customer-llm-costs" rel="noopener noreferrer"&gt;llmeter.org&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;How are you all handling multi-tenant API costs right now? Anyone found a better way to enforce per-user limits?&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>costtracking</category>
    </item>
    <item>
      <title>Agents need control flow because the loop pays the bill</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Wed, 13 May 2026 20:55:10 +0000</pubDate>
      <link>https://forem.com/amedinat/agents-need-control-flow-because-the-loop-pays-the-bill-414d</link>
      <guid>https://forem.com/amedinat/agents-need-control-flow-because-the-loop-pays-the-bill-414d</guid>
      <description>&lt;p&gt;last week a post called "agents need control flow, not more prompts" went around hn (thread &lt;a href="https://news.ycombinator.com/item?id=48051562" rel="noopener noreferrer"&gt;48051562&lt;/a&gt;, 588 points, 293 comments). the argument is an engineering one: open-ended prompt loops are unpredictable, deterministic harnesses aren't, so wrap the agent in a flowchart and feed it one step at a time. one commenter described doing exactly that — "wrapped the agent in a loop that kept feeding it the next step in the flowchart."&lt;/p&gt;

&lt;p&gt;all true. but there's a second axis the thread mostly stepped around, and one person said it out loud:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I used to assume they pushed people into the prompt-only workflows because you're paying them for the tokens" — DrewADesign, &lt;a href="https://news.ycombinator.com/item?id=48051562" rel="noopener noreferrer"&gt;same thread&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;an open-ended agent loop isn't just unreliable behavior. it's unbounded &lt;em&gt;spend&lt;/em&gt;. and the part almost nobody is instrumenting: when the invoice arrives, you have one number for a session that made 30 model calls, and no way to tell which of those 30 calls re-read the repo three times and cost $1.40 of the $1.83.&lt;/p&gt;

&lt;h2&gt;
  
  
  the loop got more expensive three times in the last 30 days
&lt;/h2&gt;

&lt;p&gt;the timing here isn't subtle. while the thread was arguing about flowcharts, the per-token price moved underneath everyone:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;github copilot&lt;/strong&gt; shifted to a token-credit model where the same Opus turn bills at a 1x / 7.5x / 27x multiplier depending on plan and overage state (HN 47923357). same work, three prices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;anthropic&lt;/strong&gt; A/B-tested removing Claude Code from the Pro tier mid-cycle (HN 47854477). people found out when the harness they'd built around it stopped working — no notice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;openai&lt;/strong&gt; shipped GPT-5.5 on may 8 at roughly 2x GPT-5.4's per-token price (HN 48057209, 213 pts). OpenRouter measured a 49–92% net cost increase even after the 19–34% token-efficiency gain, because efficiency doesn't save you if the price moved further than the efficiency did. a commenter there: "it's also quite the cost lottery and i'm not sure i am comfortable with that."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and the workload itself is variance-heavy &lt;em&gt;before&lt;/em&gt; any repricing. reflex.dev benchmarked computer-use against a structured API on the &lt;em&gt;same&lt;/em&gt; admin-panel task (HN 48024859, 269 comments): 550,976 ± 178,849 input tokens for the agent loop, 12,151 ± 27 for the structured call. the standard deviation on the loop is ~32% of the mean. run the same task twice and you get a 400k–750k token swing — and a matching 750s–1257s wall-clock swing.&lt;/p&gt;

&lt;p&gt;stack those: a workload that already swings ~2x run-to-run, on a per-token price that moved three times in a month, inside a loop whose length is decided by the model and not by you. "average cost per task" is not a number you can budget against. it's a number that was true once, for one run.&lt;/p&gt;

&lt;h2&gt;
  
  
  what we actually measured
&lt;/h2&gt;

&lt;p&gt;i build llmeter — an open-source dashboard for llm api cost tracking — so i spend a lot of time staring at this data. the thing that broke for us first wasn't the price. it was attribution.&lt;/p&gt;

&lt;p&gt;here's the shape of an agentic task: one user request becomes N model calls, where N is the agent's choice, not the user's. without per-call attribution your invoice for that session is a single line. you can't point at the iteration that re-read the repo three times. you can't tell a retried tool-call branch — one that already succeeded — apart from real work. "agents are expensive" stays a feeling.&lt;/p&gt;

&lt;p&gt;once we started recording cost per API call and rolling it up per user / per model / per day, two things fell out fast:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the cost distribution across "the same" task is &lt;strong&gt;bimodal, not normal&lt;/strong&gt;. most runs are cheap; a small tail of runs is 3–5x, and the tail is exactly where the loop decided to do something extra. the mean hides the tail. the p95 is the number that actually predicts your invoice.&lt;/li&gt;
&lt;li&gt;a handful of users — usually on the free tier, usually running something on a cron — accounted for a wildly disproportionate share of token spend. one cron job re-summarizing the same document every hour will quietly outspend your paying customers, and you won't see it in an aggregate provider bill.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;none of that is a pricing problem. it's a visibility problem that pricing volatility makes expensive.&lt;/p&gt;

&lt;h2&gt;
  
  
  what to do this week (none of this needs a tool)
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;find your single most expensive task in production this month.&lt;/strong&gt; not the average — the single worst run. if you can't query that in under five minutes, that's the gap, and closing it is usually a five-line change: log &lt;code&gt;model&lt;/code&gt;, &lt;code&gt;input_tokens&lt;/code&gt;, &lt;code&gt;output_tokens&lt;/code&gt;, &lt;code&gt;cached_tokens&lt;/code&gt; and a &lt;code&gt;task_id&lt;/code&gt; next to every completion call, then &lt;code&gt;GROUP BY task_id ORDER BY cost DESC LIMIT 10&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;break out the cached-token line.&lt;/strong&gt; OpenAI, Anthropic and DeepSeek each name cache-hit / cache-miss / cached-input differently, and the cached tier is the one that tends to move most on a repricing. if your cost rollup collapses everything into "input tokens," a price change on the cached tier is invisible until the invoice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;put a &lt;code&gt;task_id&lt;/code&gt; on agent loops and count the iterations.&lt;/strong&gt; the reflex numbers say iteration count is your variance source. if you're not logging "this user request fanned out to 14 model calls," you can't tell a healthy run from a runaway one — and you definitely can't alert on it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;alert on p95, not the mean.&lt;/strong&gt; a Slack ping at "you've spent your monthly average" fires after the damage. a ping at "this task is in the top 5% of cost we've ever recorded for this task type" fires while it's still running. (this is the one spot a tool earns its keep — per-model / per-user / per-day budget alerts — but the logic is simple enough to roll yourself.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;if you route to DeepSeek for cost, write down the date.&lt;/strong&gt; the V4-Pro 75% promo expires 2026-05-31 15:59 UTC and every line item goes 4x at that second. that's not a forecast, it's a calendar entry — model it before may 30, not on june 1.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  the point
&lt;/h2&gt;

&lt;p&gt;the control-flow argument and the cost argument are the same argument. a deterministic harness is predictable behavior &lt;em&gt;and&lt;/em&gt; a predictable bill. an open-ended loop is "trust me" on both. the harness people are right — but the reason to draw the flowchart isn't only that the agent behaves better. it's that you can finally point at the box that cost you the money.&lt;/p&gt;

&lt;p&gt;if you can't draw the cost shape of your agent's loop, your control flow is just hope.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;i build &lt;a href="https://llmeter.org/?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=hn-week-loop-pays-the-bill" rel="noopener noreferrer"&gt;llmeter&lt;/a&gt; — an open-source (AGPL-3.0) cost dashboard for OpenAI / Anthropic / DeepSeek / OpenRouter / Mistral / Azure OpenAI: per-model, per-user, per-day, with budget alerts. it's not a proxy and it doesn't sit in your request path — the SDK forwards usage metadata async. the per-call attribution stuff above is the part that made me build it. free tier is one provider / 7-day retention. genuinely want to hear how other people slice agentic cost — what does your rollup key on?&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>saas</category>
      <category>agents</category>
    </item>
    <item>
      <title>Your prompt is getting longer without you knowing it (and it's killing your margins)</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Tue, 12 May 2026 21:39:59 +0000</pubDate>
      <link>https://forem.com/amedinat/your-prompt-is-getting-longer-without-you-knowing-it-and-its-killing-your-margins-1b71</link>
      <guid>https://forem.com/amedinat/your-prompt-is-getting-longer-without-you-knowing-it-and-its-killing-your-margins-1b71</guid>
      <description>&lt;p&gt;I've been looking at LLM billing patterns lately, and there's a silent killer that creeps up on almost every team: prompt inflation.&lt;/p&gt;

&lt;p&gt;When you first build an AI feature, your prompt is tight. Maybe 500 tokens for the system instructions and 100 for the user query. The math looks great. "This will cost us fractions of a cent per call," you tell the team.&lt;/p&gt;

&lt;p&gt;Fast forward three months.&lt;/p&gt;

&lt;p&gt;Someone added conversation history to make the bot "smarter." Another dev added a massive RAG context block because the model hallucinated once. Product asked for formatting instructions, so now the system prompt is a 2,000-word essay. &lt;/p&gt;

&lt;p&gt;Suddenly, your baseline request is 8k tokens. &lt;/p&gt;

&lt;p&gt;The worst part is that user value doesn't scale linearly with prompt size. But your OpenAI bill sure does. If you're running at scale, you're suddenly paying $0.05+ per request for a feature you modeled at $0.005. &lt;/p&gt;

&lt;p&gt;If you just look at your monthly total on the provider dashboard, it just looks like you're getting more usage. You think "growth is good" until the Stripe payout hits and you realize your margins are gone.&lt;/p&gt;

&lt;p&gt;You need to track cost &lt;em&gt;per user&lt;/em&gt; and cost &lt;em&gt;per feature&lt;/em&gt;, not just total spend. If you see specific users driving crazy costs, they're probably accumulating massive context windows that you need to truncate.&lt;/p&gt;

&lt;p&gt;fwiw, I ran into this exact issue, which is why I built LLMeter (&lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-21-prompt-inflation-margin-killer" rel="noopener noreferrer"&gt;https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-21-prompt-inflation-margin-killer&lt;/a&gt;). It's an open-source, proxy-free way to track this stuff. It attributes costs down to the user ID level so you can actually see who is dragging around a 10k token history.&lt;/p&gt;

&lt;p&gt;Stop assuming your prompt is the same size it was on day one. Track it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>The Hidden 43% — How Teams Are Wasting Almost Half Their LLM API Budget</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Fri, 08 May 2026 23:20:51 +0000</pubDate>
      <link>https://forem.com/amedinat/the-hidden-43-how-teams-are-wasting-almost-half-their-llm-api-budget-32b5</link>
      <guid>https://forem.com/amedinat/the-hidden-43-how-teams-are-wasting-almost-half-their-llm-api-budget-32b5</guid>
      <description>&lt;p&gt;You look at your provider dashboard and see one number: the total bill. It's like getting an electricity bill that just says "$5,000" with no breakdown of whether it was the AC, the fridge, or someone leaving the lights on all month.&lt;/p&gt;

&lt;p&gt;tbh, most AI startups are flying blind right now. We recently looked into the cost breakdown for several teams and found something crazy: almost 43% of LLM API spend is completely wasted. It’s not about paying for usage; it’s about paying for bad architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here’s where the leaks are actually happening:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Retry Storms (34% of waste)&lt;br&gt;
Your agent fails to parse a JSON response, so it retries. And retries. Sometimes 5-10 times in a loop. You aren't just paying for the failure, you are paying for the massive context window sent every single time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Duplicate Calls (85% of apps have this issue)&lt;br&gt;
Multiple users asking the exact same question, or internal systems running the same RAG pipeline on the same document. Without caching at the provider level, you're paying OpenAI to generate the identical tokens twice.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Context Bloat&lt;br&gt;
Sending the entire 50-page document history when the user just asked "what's the summary of page 2". RAG is great, but shoving everything into the prompt "just in case" is burning your runway.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Wrong Model Selection&lt;br&gt;
Using GPT-4o or Claude 3 Opus for simple classification tasks when Haiku or GPT-3.5-turbo would do it for a fraction of the cost.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You can't fix what you can't see. That's exactly why I built LLMeter (&lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=hidden-43-percent-llm-waste" rel="noopener noreferrer"&gt;https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=hidden-43-percent-llm-waste&lt;/a&gt;). It's an open-source dashboard that gives you per-customer and per-model cost tracking. Stop guessing who or what is draining your API budget.&lt;/p&gt;

&lt;p&gt;Fwiw, just setting up basic budget alerts and seeing the breakdown by tenant usually drops a team's bill by 20% in the first week. Give it a try, it's open source (AGPL-3.0) and you can self-host or use the free tier.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>saas</category>
    </item>
    <item>
      <title>The week your AI coding tier got smaller</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Wed, 06 May 2026 23:15:24 +0000</pubDate>
      <link>https://forem.com/amedinat/the-week-your-ai-coding-tier-got-smaller-1a2j</link>
      <guid>https://forem.com/amedinat/the-week-your-ai-coding-tier-got-smaller-1a2j</guid>
      <description>&lt;p&gt;In 48 hours this week, two of the biggest AI coding platforms confirmed the same thing: your unlimited subscription was never sustainable for how you actually use it. The provider will be the one who decides when to cut you off.&lt;/p&gt;

&lt;p&gt;Anthropic silently removed Claude Code from Pro on a "2% A/B test" (later reversed). Their Head of Growth justified it saying "usage has changed a lot and our current plans weren't built for this." GitHub paused new Copilot Pro signups and dropped Opus from Pro entirely.&lt;/p&gt;

&lt;p&gt;One dev on HN said sending 3-4 messages to Opus 4.7 blew through their $20 plan limits and consumed $10 of extra usage.&lt;/p&gt;

&lt;p&gt;Simon Willison framed the trust break: "Should I be taking a bet on Claude Code if I know that they might 5x the minimum price of the product?"&lt;/p&gt;

&lt;p&gt;The structural takeaway for any team shipping AI features: the invoice is the governance boundary, not the plan page. The provider's unit economics are now public. Every user is a small loss when they exceed the pricing assumption, and no vendor has found the pricing floor yet.&lt;/p&gt;

&lt;p&gt;Teams that cannot meter their own spend per-customer, per-agent, per-task are now one pricing memo away from being unprofitable overnight.&lt;/p&gt;

&lt;p&gt;The concrete fix: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;track your tokens (not the invoice's)&lt;/li&gt;
&lt;li&gt;use per-customer attribution (so you know whose usage is killing you)&lt;/li&gt;
&lt;li&gt;implement hard budget caps at the agent level. Alerts don't stop a runaway loop.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is exactly what LLM Budget Guard is being built for. &lt;/p&gt;

&lt;p&gt;Here is how a wrapper around the SDK produces per-customer token attribution without waiting for invoice day:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;wrapOpenAI&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;llmeter&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;OpenAI&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;wrapOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;prod-cluster&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;tenantId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cust_883&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Cost is now tracked per customer automatically&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Generate report&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Track your costs early. Check out &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=ai-coding-tier-collapse" rel="noopener noreferrer"&gt;LLMeter&lt;/a&gt; to get started with attribution.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Bun Is Porting from Zig to Rust — Here's Why That Matters If You Run LLM Workloads</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Wed, 06 May 2026 17:11:01 +0000</pubDate>
      <link>https://forem.com/amedinat/bun-is-porting-from-zig-to-rust-heres-why-that-matters-if-you-run-llm-workloads-3kgo</link>
      <guid>https://forem.com/amedinat/bun-is-porting-from-zig-to-rust-heres-why-that-matters-if-you-run-llm-workloads-3kgo</guid>
      <description>&lt;p&gt;This week Bun published its internal &lt;a href="https://github.com/oven-sh/bun/commit/46d3bc29f270fa881dd5730ef1549e88407701a5" rel="noopener noreferrer"&gt;Zig→Rust porting guide&lt;/a&gt; — a signal that the runtime is migrating core components from Zig to Rust. The HN thread hit 700+ points in 24 hours.&lt;/p&gt;

&lt;p&gt;The Rust move is a reasonable technical bet. Rust's ecosystem maturity, tooling, and contributor onboarding advantages at Bun's scale are real. But for teams running LLM workloads in production, the migration surfaces a question worth thinking about: &lt;em&gt;what does it mean that your JavaScript runtime is now an Anthropic asset?&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Background: December 2025
&lt;/h2&gt;

&lt;p&gt;Anthropic acquired Bun in December 2025. For most developers this was a minor footnote. For teams running AI pipelines, it quietly changed the vendor dependency structure.&lt;/p&gt;

&lt;p&gt;If you're using Bun as your JS runtime, Claude Code as your AI CLI, and Anthropic as your LLM provider, all three now share the same balance sheet. The Rust port doesn't change that — it makes Bun faster and easier to contribute to, but it doesn't change who owns it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the shared balance sheet matters at the billing layer
&lt;/h2&gt;

&lt;p&gt;The last 90 days produced 6 separate LLM billing incidents across the major providers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;th&gt;Incident&lt;/th&gt;
&lt;th&gt;HN&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mar 2026&lt;/td&gt;
&lt;td&gt;Anthropic Pro A/B — silent tier reclassification&lt;/td&gt;
&lt;td&gt;&lt;a href="https://news.ycombinator.com/item?id=47854477" rel="noopener noreferrer"&gt;47854477&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mar 2026&lt;/td&gt;
&lt;td&gt;Cursor per-token surprise — $200→$500 overnight&lt;/td&gt;
&lt;td&gt;&lt;a href="https://news.ycombinator.com/item?id=47847849" rel="noopener noreferrer"&gt;47847849&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apr 2026&lt;/td&gt;
&lt;td&gt;GitHub Copilot 7.5x billing multiplier&lt;/td&gt;
&lt;td&gt;&lt;a href="https://news.ycombinator.com/item?id=192911" rel="noopener noreferrer"&gt;#192911&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apr 2026&lt;/td&gt;
&lt;td&gt;GitHub Copilot 27x billing trap&lt;/td&gt;
&lt;td&gt;&lt;a href="https://news.ycombinator.com/item?id=47923357" rel="noopener noreferrer"&gt;47923357&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apr 2026&lt;/td&gt;
&lt;td&gt;OpenClaw trigger-word charges&lt;/td&gt;
&lt;td&gt;&lt;a href="https://news.ycombinator.com/item?id=47963204" rel="noopener noreferrer"&gt;47963204&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apr 2026&lt;/td&gt;
&lt;td&gt;HERMES.md rate reclassification&lt;/td&gt;
&lt;td&gt;&lt;a href="https://news.ycombinator.com/item?id=47952722" rel="noopener noreferrer"&gt;47952722&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;None of these were announced in advance. All of them hit teams that thought they had controls in place — dashboards, alerts, rate limits set through the vendor's own UI.&lt;/p&gt;

&lt;p&gt;The pattern: vendor-side controls fail at the worst moment because they live inside the system making the billing decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  What out-of-band enforcement looks like
&lt;/h2&gt;

&lt;p&gt;The durable fix isn't switching runtimes or providers. It's moving cost enforcement &lt;em&gt;outside&lt;/em&gt; the vendor stack.&lt;/p&gt;

&lt;p&gt;Enforcement that runs before the API call goes out — synchronously, without a network round-trip — can't be overridden by a policy update on the vendor side. The call either goes out or it doesn't. No spend is committed until your own cap logic says it's safe.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;BudgetGuard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;wrapAnthropic&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@simplifai/budget-guard&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;guard&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;BudgetGuard&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;global_cap_per_day_usd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;per_customer_cap_per_day_usd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;anthropic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;wrapAnthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nx"&gt;guard&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// BudgetCapError thrown before the call if cap exceeded&lt;/span&gt;
&lt;span class="c1"&gt;// The call never goes out. No spend incurred.&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a cap is hit you get structured data — &lt;code&gt;scope&lt;/code&gt;, &lt;code&gt;spend_usd&lt;/code&gt;, &lt;code&gt;cap_usd&lt;/code&gt;, &lt;code&gt;retry_after&lt;/code&gt; — not a Slack alert 10 minutes after the damage is done.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Rust port is good news
&lt;/h2&gt;

&lt;p&gt;Genuinely. Faster startup, better memory safety, easier for external contributors. If you're building on Bun, the migration is a net positive for stability.&lt;/p&gt;

&lt;p&gt;The vendor dependency concern is a separate question from runtime quality. Bun being well-engineered in Rust doesn't change that it's now part of the same vendor stack as your LLM calls. For teams where that matters, the answer isn't a different runtime — it's enforcement that lives outside all of them.&lt;/p&gt;




&lt;p&gt;SDK (TypeScript, zero deps, 29 tests): &lt;code&gt;npm install @simplifai/budget-guard&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Managed version waitlist → &lt;a href="https://simplifai.tools/validate/budget-guard" rel="noopener noreferrer"&gt;simplifai.tools/validate/budget-guard&lt;/a&gt;&lt;/p&gt;

</description>
      <category>typescript</category>
      <category>ai</category>
      <category>webdev</category>
      <category>devops</category>
    </item>
    <item>
      <title>27 days to the DeepSeek V4-Pro cliff: what a 4x price jump looks like in production</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Tue, 05 May 2026 13:48:06 +0000</pubDate>
      <link>https://forem.com/amedinat/27-days-to-the-deepseek-v4-pro-cliff-what-a-4x-price-jump-looks-like-in-production-4j4c</link>
      <guid>https://forem.com/amedinat/27-days-to-the-deepseek-v4-pro-cliff-what-a-4x-price-jump-looks-like-in-production-4j4c</guid>
      <description>&lt;p&gt;So here's the thing about the deepseek v4-pro pricing schedule that has been making the rounds on hn this week (thread 48002136, 494 pts / 198 comments) — the 75% promotional discount everyone has been routing to is &lt;strong&gt;calendar-deterministic&lt;/strong&gt;. it expires 2026-05-31 at 15:59 UTC. that's 27 days from today. on june 1 the same agent that costs you $87 in tokens will cost you $348.&lt;/p&gt;

&lt;p&gt;most teams i've talked to about this have a vague awareness ("yeah we've been saving with deepseek") and zero plan for the cliff. nobody has a runbook for what happens when a long-running agent session starts before the price flip and finishes after it. nobody has tested whether their billing dashboard refreshes mid-session or only on the next invoice. and nobody has checked if their fallback provider is actually configured to take over at 16:00 UTC on the 31st.&lt;/p&gt;

&lt;p&gt;this isn't a "scandal" the way openclaw or hermes.md or the copilot 27x pivot were. deepseek published the schedule openly. but the failure mode is identical to a stripe price-update event delivered with no buyer-side notification: you've been routing 60-80% of agent traffic to v4-pro since the 04-26 launch, the meter ticks up at 4x without an alert, and the line on june's bill is the first thing you see.&lt;/p&gt;

&lt;h2&gt;
  
  
  what's actually happening
&lt;/h2&gt;

&lt;p&gt;the published schedule (api-docs.deepseek.com/quick_start/pricing) is unambiguous:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;line item&lt;/th&gt;
&lt;th&gt;promo (now → may 31 15:59 UTC)&lt;/th&gt;
&lt;th&gt;post-cliff (june 1 onward)&lt;/th&gt;
&lt;th&gt;multiple&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;output tokens&lt;/td&gt;
&lt;td&gt;$0.87 / 1M&lt;/td&gt;
&lt;td&gt;$3.48 / 1M&lt;/td&gt;
&lt;td&gt;4x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;input cache miss&lt;/td&gt;
&lt;td&gt;$0.435 / 1M&lt;/td&gt;
&lt;td&gt;$1.74 / 1M&lt;/td&gt;
&lt;td&gt;4x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;input cache hit&lt;/td&gt;
&lt;td&gt;$0.03625 / 1M&lt;/td&gt;
&lt;td&gt;$0.145 / 1M&lt;/td&gt;
&lt;td&gt;4x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;every line is exactly 4x. the cache-hit tier is the one that catches people — at $0.036/M it's basically free and most teams stopped instrumenting after the first week. at $0.145/M it's still cheap but cache-hit volume is usually 5-10x cache-miss volume, so the absolute dollar delta on cache-hit can dwarf the headline output-token delta.&lt;/p&gt;

&lt;p&gt;the deepclaw thread (claude code → deepseek wrapper, hit hn front page same week) is full of "$0.06 per task with v4-pro" anecdotes. every one of those becomes $0.24 on june 1. the $30/mo dev-side budget becomes $120. the $4k/mo agent fleet becomes $16k. nothing on the deepseek side of the wire changes — same model, same endpoint, same response — just a 4x multiplier on the meter.&lt;/p&gt;

&lt;h2&gt;
  
  
  what we learned doing the math on our own usage
&lt;/h2&gt;

&lt;p&gt;i ran our last 30 days of llm spend through the cliff scenario assuming we keep the same routing distribution. three observations worth sharing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. cache-hit dominance flips the headline number.&lt;/strong&gt; our v4-pro mix is ~70% cache-hit / 25% cache-miss / 5% output-heavy. on the promo schedule cache-hit is ~6% of total spend; post-cliff it's ~22%. the "output tokens went 4x" story misses where the dollars actually sit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. mid-session pricing flips are a real failure mode.&lt;/strong&gt; we have agent runs that span 4-6 hours. on may 31 a session that starts at 12:00 UTC and finishes at 18:00 UTC will see the promo rate for 3h59m of tokens and the post-cliff rate for the rest. nothing in the deepseek api response surfaces which tier the request was billed at — you find out from the invoice. for a long-running summarization or repo-analysis job that's a $0.40 → $1.60 swing on a single run, but the agent has no way to detect it and slow down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. multi-provider failover is necessary but not sufficient.&lt;/strong&gt; the obvious move is "fail over to anthropic / openai if deepseek crosses some price threshold." but the threshold isn't price (deepseek hasn't changed prices, the schedule has) — it's calendar. you have to encode "after 2026-05-31 15:59 UTC, if i'm still routing to v4-pro, stop" in the routing layer itself. and you have to test it before the day, because if your e2e test environment hits real apis on june 1 you've lost the dry run.&lt;/p&gt;

&lt;h2&gt;
  
  
  practical implications (things you can do this week)
&lt;/h2&gt;

&lt;p&gt;these are the actions, in priority order, that close the cliff exposure for a typical team running 60-80% deepseek-v4-pro traffic:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;audit your last 30 days of v4-pro spend by tier.&lt;/strong&gt; group by cache-hit / cache-miss / output. multiply each line by 4x and look at the dollar delta, not the percent. the result is your post-cliff monthly run-rate floor. if it scares you, the rest of this list is mandatory; if it doesn't, you can stop here.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;add a hard cutoff in your router by may 30, not may 31.&lt;/strong&gt; "switch off v4-pro at 15:59 UTC on the 31st" is the obvious answer and the wrong one — it puts a billing event on the same calendar minute as a routing event with no headroom. switch off at midnight UTC may 30, run on the post-cliff equivalent (deepseek-v4 base, anthropic haiku, gpt-4.1-mini) for 24-48 hours, validate that quality + latency hold, then make a deliberate "do we want v4-pro at the new price" call with the actual numbers in front of you.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;instrument cache-hit-rate dollar impact, not just request count.&lt;/strong&gt; most observability stacks count cache hits as a vanity metric. on june 1 cache-hit becomes 4x its promo price and the volume doesn't change. plot dollar-cost-per-cache-hit over time so the cliff shows up as a step function and not a slow drift.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;kill any long-running session that starts after may 31 12:00 UTC.&lt;/strong&gt; for the 4-hour window before the cliff, force agent runs to stay under 30 minutes wall-clock. it's annoying for one shift but it eliminates the mid-session-pricing-flip class of bug entirely. cheaper than discovering on june 2 that your overnight run got billed at the post-cliff rate for 90% of its tokens.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;set an out-of-band budget cap that survives provider misbehavior.&lt;/strong&gt; the cliff is the deterministic case. the worse case is a provider you've never used before issuing a similar schedule on a tuesday with no warning. the only durable defense is a hard $/period ceiling on a layer the provider doesn't control. that's the slot llmeter sits in — count tokens before they leave your network, refuse to forward requests once the cap is hit, fail closed when the meter is uncertain. fwiw it's the same architecture every team rolls themselves on the 2nd of the month after the first surprise bill.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  tldr
&lt;/h2&gt;

&lt;p&gt;Deepseek did the honest thing and published the schedule. the failure mode is still that 99% of teams won't read it until june 1. you have 27 days. the cheap version of preparing is item 1 (run the numbers); the durable version is item 5 (own the cap). the rest is somewhere in between.&lt;/p&gt;

&lt;p&gt;if you're running llm spend across deepseek + anthropic + openai and want a dashboard that flags the pre-cliff routing exposure today rather than after the invoice, that's what i've been building at  &lt;a href="https://llmeter.org/?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=hn-week-deepseek-may-31-cliff" rel="noopener noreferrer"&gt;https://llmeter.org/?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=hn-week-deepseek-may-31-cliff&lt;/a&gt;.  open-source, agpl-3.0, no proxy in the request path. drop the sdk in, point it at your providers, see the cliff exposure as a number.&lt;/p&gt;

&lt;p&gt;happy to compare notes if you've already done the audit on your stack — what was the cache-hit delta vs the headline?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deepseek</category>
      <category>llm</category>
      <category>saas</category>
    </item>
    <item>
      <title>You Vibe-Coded Your SaaS Landing Page — Google Can't See It</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Mon, 04 May 2026 21:42:46 +0000</pubDate>
      <link>https://forem.com/amedinat/you-vibe-coded-your-saas-landing-page-google-cant-see-it-16cj</link>
      <guid>https://forem.com/amedinat/you-vibe-coded-your-saas-landing-page-google-cant-see-it-16cj</guid>
      <description>&lt;p&gt;Look, I get it. You used Lovable or Bolt, shipped a beautiful landing page in 3 hours, and felt like a god. The UI is slick, animations are butter. &lt;/p&gt;

&lt;p&gt;But it's been 3 weeks and you have 0 organic traffic. &lt;/p&gt;

&lt;p&gt;Here is why: you shipped a Client-Side Rendered (CSR) React app. Google hates those.&lt;/p&gt;

&lt;p&gt;I was digging into indexation stats recently. Google takes about 9x longer to index JS-heavy pages. If your site exceeds their rendering budget, indexation drops by up to 40%. &lt;/p&gt;

&lt;p&gt;When Googlebot hits your vibe-coded site, it sees a blank HTML file and a massive JS bundle. It has to queue it for rendering, parse the JS, execute it, and only &lt;em&gt;then&lt;/em&gt; sees the content. Most indie devs don't realize this until their Search Console stays flat for months.&lt;/p&gt;

&lt;p&gt;Vibe-coding tools default to SPA (Single Page Application) architectures because they're easier to generate and feel faster to the user. But for a landing page? It's SEO suicide.&lt;/p&gt;

&lt;p&gt;If you want organic users, you need SSR (Server-Side Rendering) or SSG (Static Site Generation). When Google hits the URL, the HTML needs to be there instantly. &lt;/p&gt;

&lt;p&gt;I built LLMeter (open-source LLM cost tracking) using Next.js on Vercel specifically for this reason. SSR out of the box. No waiting for Googlebot to execute JS. It just indexes.&lt;/p&gt;

&lt;p&gt;If you're stuck with a CSR landing page right now:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Prerender.io is an option, though it's a band-aid.&lt;/li&gt;
&lt;li&gt;Rebuild the marketing pages in Next.js/Astro. Keep the CSR app for the actual dashboard on a subdomain (&lt;code&gt;app.yoursaas.com&lt;/code&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Don't let the AI tools trick you into bad architecture. Ship fast, but make sure Google can actually read it, tbh.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=2026-04-17-vibe-coding-seo-invisible" rel="noopener noreferrer"&gt;Check out LLMeter if you're dealing with LLM API costs&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vibecoding</category>
      <category>seo</category>
      <category>llm</category>
    </item>
    <item>
      <title>GitHub Copilot's 27x Billing Trap is Closing — The Budget Guard Deadline</title>
      <dc:creator>John Medina</dc:creator>
      <pubDate>Sun, 03 May 2026 01:55:07 +0000</pubDate>
      <link>https://forem.com/amedinat/github-copilots-27x-billing-trap-is-closing-the-budget-guard-deadline-1ioc</link>
      <guid>https://forem.com/amedinat/github-copilots-27x-billing-trap-is-closing-the-budget-guard-deadline-1ioc</guid>
      <description>&lt;p&gt;GitHub Copilot is shifting to usage-based billing on June 1, 2026. If you've been relying on flat-rate pricing to shield you from the realities of AI coding costs, your runway just evaporated.&lt;/p&gt;

&lt;p&gt;We've been tracking a nasty trend in the wild: the "27x billing trap." When an AI coding assistant gets stuck in a recursive loop—hallucinating a fix, failing the test, and trying again—it burns tokens at maximum velocity. Under flat-rate billing, this was an invisible annoyance. Under usage-based billing, it's a catastrophic financial event.&lt;/p&gt;

&lt;p&gt;We saw one dev hit a -$563 bill on a $5 prepaid account because provider-side limits don't enforce in real-time. They are &lt;em&gt;eventually consistent&lt;/em&gt;, and by the time the provider shuts you down, the damage is done.&lt;/p&gt;

&lt;p&gt;The clock is ticking. You have less than 30 days to implement hard enforcement before the new billing model takes effect. &lt;/p&gt;

&lt;p&gt;You don't need another dashboard that sends you a Slack alert &lt;em&gt;after&lt;/em&gt; your budget is blown. You need an enforcement layer that kills the request &lt;em&gt;before&lt;/em&gt; the network call is made.&lt;/p&gt;

&lt;p&gt;That's why we open-sourced LLM Budget Guard. It's a dead-simple, local SDK enforcement layer that cuts the cord the millisecond your hard cap is reached. No proxy routing latency, no cloud gateway single points of failure. Just deterministic budget enforcement.&lt;/p&gt;

&lt;p&gt;Self-host it and protect your runway: &lt;a href="https://llmeter.org?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=copilot-27x-budget-guard-deadline" rel="noopener noreferrer"&gt;llmeter.org&lt;/a&gt;&lt;/p&gt;

</description>
      <category>github</category>
      <category>ai</category>
      <category>opensource</category>
      <category>githubcopilot</category>
    </item>
  </channel>
</rss>
