<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Tijo Gaucher</title>
    <description>The latest articles on Forem by Tijo Gaucher (@rapidclaw).</description>
    <link>https://forem.com/rapidclaw</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3850323%2F4c57502d-d13a-4255-aa80-30e2ab22d035.jpeg</url>
      <title>Forem: Tijo Gaucher</title>
      <link>https://forem.com/rapidclaw</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/rapidclaw"/>
    <language>en</language>
    <item>
      <title>Running Gemma 4 next to your agent runtime: notes from a small shop</title>
      <dc:creator>Tijo Gaucher</dc:creator>
      <pubDate>Mon, 06 Apr 2026 08:04:51 +0000</pubDate>
      <link>https://forem.com/rapidclaw/running-gemma-3-next-to-your-agent-runtime-notes-from-a-small-shop-c79</link>
      <guid>https://forem.com/rapidclaw/running-gemma-3-next-to-your-agent-runtime-notes-from-a-small-shop-c79</guid>
      <description>&lt;p&gt;My brother Brandon and I run RapidClaw. Most days it's just the two of us, a handful of customers, and a few agents chugging along in production. A few months ago we started putting small open-weight models on the same box as the agent runtime — mostly Gemma 4, a bit of Phi-4 for comparison, some Qwen. This is a short write-up of what's actually worked and what hasn't.&lt;/p&gt;

&lt;p&gt;Nothing revolutionary here. I'm writing it because I searched for "agent + local Gemma" a bunch of times last quarter and mostly found benchmark posts, not lived-experience notes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thing we noticed
&lt;/h2&gt;

&lt;p&gt;The newest small models are small enough that they fit on the same machine as the agent loop. That's the whole observation. Gemma 4 4B runs fine on a 24 GB GPU next to a Node process running our agent code. Phi-4 14B is tight but works. A year ago you needed a separate inference box, which meant a network hop, which meant we just paid a hosted API and moved on.&lt;/p&gt;

&lt;p&gt;Now the tradeoff is different. You can keep the hosted model for the hard stuff and quietly route the cheap, high-volume calls to the local model. Hybrid, not replacement.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we actually do
&lt;/h2&gt;

&lt;p&gt;We have four agents running in production right now. One of them — the one that classifies incoming support messages and decides which of the other agents to hand off to — used to make a hosted-model call per message. That single agent was roughly 80% of our inference spend because it ran on every message, even the obvious ones.&lt;/p&gt;

&lt;p&gt;We moved that classifier to Gemma 4 4B on the same box. The agent framework is unchanged, it just points at a local OpenAI-compatible endpoint (we're using Ollama for now, llama.cpp's server also works). The other three agents still call the hosted models when they need to reason about something real.&lt;/p&gt;

&lt;p&gt;That's it. One local model, four agents, one box. No Kubernetes, no model router, no fancy fallback chain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Numbers from our box
&lt;/h2&gt;

&lt;p&gt;Single machine, RTX 4090, one of our production workers. Measured over a week in March on real traffic, not a synthetic benchmark.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;Median latency&lt;/th&gt;
&lt;th&gt;p95&lt;/th&gt;
&lt;th&gt;Cost per 1k calls&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hosted Sonnet-class&lt;/td&gt;
&lt;td&gt;1.8s&lt;/td&gt;
&lt;td&gt;4.2s&lt;/td&gt;
&lt;td&gt;~$4.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hosted mini/flash-class&lt;/td&gt;
&lt;td&gt;0.9s&lt;/td&gt;
&lt;td&gt;2.1s&lt;/td&gt;
&lt;td&gt;~$0.60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 3 4B, local, same box&lt;/td&gt;
&lt;td&gt;0.25s&lt;/td&gt;
&lt;td&gt;0.6s&lt;/td&gt;
&lt;td&gt;~$0.04*&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*Local cost is amortized GPU + power on a box we were already paying for. If you had to rent a GPU just for this, the numbers flip hard — more on that below.&lt;/p&gt;

&lt;p&gt;For the classifier workload specifically, Gemma 4 is good enough. It's not as sharp as the big hosted models, but "is this message a billing question or a bug report" doesn't need the big hosted models. We compared a week of its outputs against the hosted model's outputs on the same messages — they agreed on about 94% of them. The 6% where they disagreed were mostly ambiguous messages where the hosted model wasn't obviously right either.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gotchas we hit
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Cold starts are real.&lt;/strong&gt; First request after the model unloads was 8–15 seconds. We pin the model in memory with a keepalive. Obvious in hindsight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VRAM math is tighter than you think.&lt;/strong&gt; Gemma 4 4B at Q4, plus an 8k context window, plus our Node process, plus the occasional burst of parallel requests: we hit OOM twice in the first week. We now cap concurrent local calls at 3 and queue the rest. Nothing fancy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt formats drift.&lt;/strong&gt; A prompt that worked cleanly on the hosted model produced mush on Gemma. Small models are less forgiving of vague instructions. We ended up maintaining two prompt versions — one terse and explicit for Gemma, one more conversational for the hosted model. Not ideal but it's only two prompts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Eval is annoying but necessary.&lt;/strong&gt; You can't just swap models and hope. We built a small eval set (about 200 labeled messages) and run it whenever we change the local model or the prompt. Takes five minutes. Worth it.&lt;/p&gt;

&lt;h2&gt;
  
  
  When not to bother
&lt;/h2&gt;

&lt;p&gt;Honestly, most people reading this probably shouldn't do this yet. A few cases where it doesn't make sense:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Low volume.&lt;/strong&gt; If you're making under ~10k inference calls a day, the hosted APIs are cheaper than any GPU you'd rent. Local only wins at volume.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You don't already have a box.&lt;/strong&gt; If you're renting a GPU purely to run Gemma 3, the math only works if you're saturating it. We could do this because we already had machines running the agent runtime with idle GPU capacity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The task actually needs the big model.&lt;/strong&gt; If you're doing code generation or multi-step planning, Gemma 4 4B will frustrate you. Use the hosted model and stop fighting it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You're early.&lt;/strong&gt; If you're pre-product-market-fit, every hour spent on inference optimization is an hour not spent on the thing users actually care about. We only did this after the classifier bill started showing up in the monthly.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I'd try next
&lt;/h2&gt;

&lt;p&gt;Phi-4 14B for one of the agents that does light reasoning over structured data. We haven't moved it yet because the quality bar is higher and I haven't built the eval set for it. Probably in April.&lt;/p&gt;

&lt;p&gt;Also curious about Qwen 2.5 for a multilingual case we have, but that's further out.&lt;/p&gt;




&lt;p&gt;That's the whole post. Nothing dramatic — a classifier moved, a bill went down, we learned some boring operational lessons. Small open-weight models finally being small enough to share a box with the agent runtime is, for us, the thing that made any of this viable.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tijo Bear runs RapidClaw (rapidclaw.dev) with his brother Brandon — managed hosting for AI agents. If you're running agents and curious about hybrid local/hosted setups, the site has more.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>selfhosted</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Token Cost Optimization for AI Agents: 7 Patterns That Cut Our Bill by 73%</title>
      <dc:creator>Tijo Gaucher</dc:creator>
      <pubDate>Mon, 06 Apr 2026 07:15:49 +0000</pubDate>
      <link>https://forem.com/rapidclaw/token-cost-optimization-for-ai-agents-7-patterns-that-cut-our-bill-by-73-1aie</link>
      <guid>https://forem.com/rapidclaw/token-cost-optimization-for-ai-agents-7-patterns-that-cut-our-bill-by-73-1aie</guid>
      <description>&lt;h1&gt;
  
  
  Token Cost Optimization for AI Agents: 7 Patterns That Cut Our Bill by 73%
&lt;/h1&gt;

&lt;p&gt;Six months ago our monthly LLM bill at &lt;a href="https://rapidclaw.dev" rel="noopener noreferrer"&gt;RapidClaw&lt;/a&gt; hit a number I'd rather not print. We were running production AI agents across customer workloads, and every "let's just add one more tool call" was quietly compounding into a four-figure surprise on the invoice.&lt;/p&gt;

&lt;p&gt;I'm Tijo Bear, founder of RapidClaw. We build infrastructure for teams who want to ship AI agents without becoming full-time prompt engineers. After spending a quarter obsessing over our own token economics, we cut spend by 73% — without degrading agent quality. Here are the seven patterns that mattered most.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Prompt caching is the cheapest 90% win you'll ever ship
&lt;/h2&gt;

&lt;p&gt;If you're sending the same system prompt, tool definitions, or RAG context on every turn, you're paying full freight for tokens the model has already seen. Anthropic, OpenAI, and most major providers now support prompt caching with cache hits priced at roughly 10% of normal input tokens.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before: 4,200 input tokens/turn at full price
&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LARGE_SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;TOOL_DEFINITIONS&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# After: same prompt, marked cacheable
&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LARGE_SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;TOOL_DEFINITIONS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One line of config. ~85% cost reduction on the cached portion. There's no excuse not to ship this today.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Route by complexity, not by habit
&lt;/h2&gt;

&lt;p&gt;Not every task needs your most expensive model. We built a tiny router that classifies incoming agent requests into three buckets and dispatches them to the cheapest model that can plausibly handle the job:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;task_type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;classify&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;haiku&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;          &lt;span class="c1"&gt;# ~$0.25/M input
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;context_size&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;50_000&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;task_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sonnet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;         &lt;span class="c1"&gt;# ~$3/M input
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;haiku&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;              &lt;span class="c1"&gt;# default to cheap
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We escalate to the bigger model only when the cheap one returns low confidence or fails validation. Roughly 68% of our agent calls now resolve on the small model. That alone moved the needle more than any other optimization.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Trim your tool definitions ruthlessly
&lt;/h2&gt;

&lt;p&gt;Tool/function schemas are tokens too. We audited ours and found 11 tools with descriptions averaging 180 tokens each, half of which were redundant explanation the model didn't actually need.&lt;/p&gt;

&lt;p&gt;Cut every tool description down to its single most informative sentence. Move worked examples into a separate retrievable doc the agent can fetch &lt;em&gt;only&lt;/em&gt; when it needs guidance. We saved ~1,400 tokens per turn just by editing JSON.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Stop re-feeding the entire conversation history
&lt;/h2&gt;

&lt;p&gt;The naive agent loop ships the full message history on every turn. By turn 12 you're paying for turns 1–11 again. Three things help:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Sliding window&lt;/strong&gt; — keep only the last N turns verbatim&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Summary compaction&lt;/strong&gt; — once history exceeds a threshold, ask a cheap model to summarize older turns into a 200-token recap&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory extraction&lt;/strong&gt; — pull stable facts (user prefs, project IDs, decisions) into a structured memory store, then inject only the relevant rows
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compact_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;
    &lt;span class="n"&gt;old&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cheap_summarize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;old&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Earlier context: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  5. Cap your tool-call loops
&lt;/h2&gt;

&lt;p&gt;The single biggest money pit in agent systems isn't the model — it's the runaway loop. An agent that retries a flaky tool 14 times will quietly burn through more budget than 200 normal sessions.&lt;/p&gt;

&lt;p&gt;Hard cap iterations. Add exponential backoff. Surface a clear error to the user instead of letting the model keep paying to re-try. Our default is 8 tool calls per turn with a budget guardrail that aborts the session if input tokens exceed a configured ceiling. You can read more about how we handle this in our &lt;a href="https://rapidclaw.dev/blog" rel="noopener noreferrer"&gt;agent runtime docs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Stream and short-circuit
&lt;/h2&gt;

&lt;p&gt;If your agent's output gets parsed and acted on, you don't need to wait for the full completion. Stream the response and short-circuit as soon as you've got the structured field you need. We saved roughly 22% of output tokens on long-form generations by stopping early when a &lt;code&gt;&amp;lt;done&amp;gt;&lt;/code&gt; sentinel was emitted.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(...):&lt;/span&gt;
    &lt;span class="nb"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;done&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;  &lt;span class="c1"&gt;# stop paying for more tokens
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  7. Self-host the cheap stuff
&lt;/h2&gt;

&lt;p&gt;Not every step in an agent pipeline needs a frontier model. Embeddings, classification, reranking, simple extraction — these run beautifully on small open models you can deploy on a single GPU box for a fixed monthly cost.&lt;/p&gt;

&lt;p&gt;We moved embeddings and intent classification onto a self-hosted setup and the marginal cost dropped to effectively zero. The frontier model still handles the hard reasoning, but the surrounding plumbing now runs on infrastructure we control. If you're curious how we deploy and scale these, we wrote up the full architecture on the &lt;a href="https://rapidclaw.dev" rel="noopener noreferrer"&gt;RapidClaw blog&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;Stacked together, here's what each pattern contributed to our 73% cut:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Contribution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prompt caching&lt;/td&gt;
&lt;td&gt;31%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model routing&lt;/td&gt;
&lt;td&gt;19%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosting plumbing&lt;/td&gt;
&lt;td&gt;11%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;History compaction&lt;/td&gt;
&lt;td&gt;6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool definition trim&lt;/td&gt;
&lt;td&gt;3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Loop caps + budget guard&lt;/td&gt;
&lt;td&gt;2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stream short-circuit&lt;/td&gt;
&lt;td&gt;1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The lesson isn't that any single trick is magical — it's that token economics is &lt;em&gt;additive&lt;/em&gt;. Five mediocre optimizations beat one heroic one, and they're far easier to ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do first if I were starting over
&lt;/h2&gt;

&lt;p&gt;If I had to rebuild this from scratch tomorrow with one week to optimize, I'd ship in this order: prompt caching → loop caps → model routing → history compaction. Those four alone get you to roughly 60% savings and require no infrastructure changes.&lt;/p&gt;

&lt;p&gt;Everything else is polish.&lt;/p&gt;




&lt;p&gt;If you're building production agents and want a runtime that bakes these patterns in by default, that's exactly what we're building at &lt;a href="https://rapidclaw.dev" rel="noopener noreferrer"&gt;RapidClaw&lt;/a&gt;. I'd love to hear how you're handling token economics in your own stack — drop a comment or hit me up.&lt;/p&gt;

&lt;p&gt;— Tijo&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>performance</category>
    </item>
    <item>
      <title>I replaced myself with AI agents and now my startup runs 60% faster</title>
      <dc:creator>Tijo Gaucher</dc:creator>
      <pubDate>Mon, 30 Mar 2026 02:45:03 +0000</pubDate>
      <link>https://forem.com/rapidclaw/i-replaced-myself-with-ai-agents-and-now-my-startup-runs-60-faster-3nj8</link>
      <guid>https://forem.com/rapidclaw/i-replaced-myself-with-ai-agents-and-now-my-startup-runs-60-faster-3nj8</guid>
      <description>&lt;p&gt;So about 6 months ago I was basically drowning. Solo founder, trying to build an AI platform, doing everything myself — investor outreach, pitch decks, dev work, customer support, content, SEO... you know the drill. I was working 14 hour days and still falling behind.&lt;/p&gt;

&lt;p&gt;Then I started using AI agents for real. Not just ChatGPT for writing emails — I mean actual autonomous agents that handle entire workflows end to end. And honestly it kinda changed everything about how I run my startup.&lt;/p&gt;

&lt;h2&gt;
  
  
  what actually happened
&lt;/h2&gt;

&lt;p&gt;I'm building &lt;a href="https://rapidclaw.dev" rel="noopener noreferrer"&gt;Rapid Claw&lt;/a&gt; — its a platform for deploying and managing OpenClaw AI agents. OpenClaw is an open-source AI co-founder framework, and we make it stupid easy to spin up instances and manage them without dealing with all the infra headaches.&lt;/p&gt;

&lt;p&gt;But before we had the platform ready, I was running these agents manually. Setting up servers, configuring environments, managing state, handling crashes at 2am... it was a mess. I was spending more time babysitting the agents than actually building my product.&lt;/p&gt;

&lt;p&gt;The irony of building an agent hosting platform while struggling to host your own agents is not lost on me lol.&lt;/p&gt;

&lt;h2&gt;
  
  
  the numbers that made me rethink everything
&lt;/h2&gt;

&lt;p&gt;Here's roughly what I was spending per month running agents the "hard way":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VPS instances: ~$400/mo (3 servers on Hetzner)&lt;/li&gt;
&lt;li&gt;API costs (OpenAI + Anthropic): ~$800/mo
&lt;/li&gt;
&lt;li&gt;My time on devops/firefighting: ~25 hrs/mo (thats worth... a lot when you're a solo founder)&lt;/li&gt;
&lt;li&gt;Random tools and services: ~$200/mo&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Total: ~$1,400/mo + 25 hours of my life&lt;/p&gt;

&lt;p&gt;After I dogfooded our own platform and moved everything to managed Rapid Claw instances, it dropped to about $600/mo total and I spend maybe 3-4 hours a month on agent ops. The rest of that time goes into actually building features and talking to users.&lt;/p&gt;

&lt;h2&gt;
  
  
  what the agents actually do for me
&lt;/h2&gt;

&lt;p&gt;I'm not just using agents for one thing — I have multiple OpenClaw instances handling different parts of the business:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Research agent&lt;/strong&gt; — scrapes competitor pricing, tracks Product Hunt launches in my space, monitors relevant subreddits. Used to spend 5+ hours a week on this manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Content agent&lt;/strong&gt; — drafts blog posts, helps with SEO research, generates social media content. I still edit everything but starting from a solid draft vs a blank page saves me hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dev assistant agent&lt;/strong&gt; — reviews PRs, writes tests, handles repetitive code tasks. This one alone probably saves me 10 hours a week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Outreach agent&lt;/strong&gt; — personalizes cold emails for investor outreach, researches potential partners. Way better than the generic templates I was sending before.&lt;/p&gt;

&lt;h2&gt;
  
  
  the part nobody warns you about
&lt;/h2&gt;

&lt;p&gt;The thing that surprised me most wasnt the cost savings or the time savings. It was how much mental energy it freed up.&lt;/p&gt;

&lt;p&gt;When you're a solo founder, context switching is the real killer. Going from writing code to researching competitors to drafting emails to fixing a server — your brain never gets to go deep on anything. &lt;/p&gt;

&lt;p&gt;Having agents handle the repetitive stuff means I can actually focus on the 2-3 things that matter most each day. Thats honestly been the biggest win.&lt;/p&gt;

&lt;h2&gt;
  
  
  why I built rapid claw
&lt;/h2&gt;

&lt;p&gt;After going through all this pain myself, I realized other founders are dealing with the exact same thing. Everyone wants to use AI agents but nobody wants to deal with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Setting up and maintaining servers&lt;/li&gt;
&lt;li&gt;Managing multiple agent instances
&lt;/li&gt;
&lt;li&gt;Handling security and permissions (you do NOT want an agent with unrestricted access to your systems btw)&lt;/li&gt;
&lt;li&gt;Monitoring and logging&lt;/li&gt;
&lt;li&gt;Scaling up when things get busy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So thats basically why &lt;a href="https://rapidclaw.dev" rel="noopener noreferrer"&gt;Rapid Claw&lt;/a&gt; exists. You pick your OpenClaw agent template, configure it, deploy it, and we handle all the infra. We've got this permission firewall thing that lets you control exactly what each agent can access, which honestly should be table stakes for anyone running agents in production but most people just... don't do it.&lt;/p&gt;

&lt;h2&gt;
  
  
  if you're thinking about trying agents
&lt;/h2&gt;

&lt;p&gt;Few things I wish someone told me when I started:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with one workflow.&lt;/strong&gt; Don't try to automate everything at once. Pick the most repetitive task you do and agent-ify that first. For me it was competitor research.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Expect to iterate.&lt;/strong&gt; Your first agent config will suck. Thats fine. The second one will be way better. By the third you'll have a solid sense of what works.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't give agents more access than they need.&lt;/strong&gt; Seriously. An agent with write access to your production database is a disaster waiting to happen. Principle of least privilege, always.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Track the time savings.&lt;/strong&gt; Its easy to underestimate how much time agents save you. I started logging it and was genuinely surprised — went from ~60 hrs/week of work to about 35 hrs for the same output.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  wrapping up
&lt;/h2&gt;

&lt;p&gt;I went from being a burned out solo founder working insane hours to actually having time to think strategically about my business. The agents aren't perfect and they definitely need supervision, but they've basically become my team.&lt;/p&gt;

&lt;p&gt;If you're a founder or indie hacker whos been curious about agents but hasn't taken the plunge — just start. Even a basic research agent will change how you work.&lt;/p&gt;

&lt;p&gt;Happy to answer questions about my setup in the comments. Been running this way for a few months now and have learned a ton about what works and what definitely doesnt.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;btw if you want to try running your own OpenClaw agents without the infra pain, check out &lt;a href="https://rapidclaw.dev" rel="noopener noreferrer"&gt;rapidclaw.dev&lt;/a&gt; — we're still early but the free tier is enough to get started.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>cloud</category>
      <category>startup</category>
    </item>
  </channel>
</rss>
