<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Mervin</title>
    <description>The latest articles on Forem by Mervin (@mervindublin).</description>
    <link>https://forem.com/mervindublin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3906971%2F89876791-af06-4c66-a1b7-99df14df1b85.png</url>
      <title>Forem: Mervin</title>
      <link>https://forem.com/mervindublin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/mervindublin"/>
    <language>en</language>
    <item>
      <title>How I Cut My LLM Costs by 90% Without Changing My App Logic</title>
      <dc:creator>Mervin</dc:creator>
      <pubDate>Thu, 21 May 2026 20:44:33 +0000</pubDate>
      <link>https://forem.com/mervindublin/how-i-cut-my-llm-costs-by-90-without-changing-my-app-logic-278f</link>
      <guid>https://forem.com/mervindublin/how-i-cut-my-llm-costs-by-90-without-changing-my-app-logic-278f</guid>
      <description>&lt;h1&gt;
  
  
  How I Cut My LLM Costs by 90% Without Changing My App Logic
&lt;/h1&gt;

&lt;p&gt;There’s a particular kind of dread that comes with checking your OpenAI billing dashboard mid-month.&lt;/p&gt;

&lt;p&gt;I’ve been building a news automation hub that runs 14 editorial workspaces — summarizing, rewriting, fact-checking, SEO-tagging, and translation pipelines around the clock.&lt;/p&gt;

&lt;p&gt;The AI layer was already fairly optimized:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Groq&lt;/li&gt;
&lt;li&gt;Gemini Flash&lt;/li&gt;
&lt;li&gt;DeepSeek&lt;/li&gt;
&lt;li&gt;OpenRouter&lt;/li&gt;
&lt;li&gt;provider rotation&lt;/li&gt;
&lt;li&gt;fallback logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the final fallback was still OpenAI, and once rate limits hit, costs climbed faster than expected.&lt;/p&gt;

&lt;p&gt;What I needed wasn’t more routing logic.&lt;/p&gt;

&lt;p&gt;I needed a smarter endpoint.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Problem
&lt;/h1&gt;

&lt;p&gt;My setup already rotated between multiple providers, but the architecture had a weakness:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Provider exhausted
    -&amp;gt; fallback
        -&amp;gt; OpenAI
            -&amp;gt; credits disappear
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The more providers I added, the messier things became:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;more API keys&lt;/li&gt;
&lt;li&gt;more retry logic&lt;/li&gt;
&lt;li&gt;more conditional branches&lt;/li&gt;
&lt;li&gt;more provider-specific handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I was optimizing infrastructure with application code.&lt;/p&gt;

&lt;p&gt;That was the mistake.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Fix
&lt;/h1&gt;

&lt;p&gt;After digging through self-hosted AI tooling, I found &lt;code&gt;freellmapi&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It’s a lightweight OpenAI-compatible proxy that automatically routes requests across multiple free-tier LLM providers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Groq&lt;/li&gt;
&lt;li&gt;Cerebras&lt;/li&gt;
&lt;li&gt;SambaNova&lt;/li&gt;
&lt;li&gt;Cloudflare Workers AI&lt;/li&gt;
&lt;li&gt;GitHub Models&lt;/li&gt;
&lt;li&gt;OpenRouter free models&lt;/li&gt;
&lt;li&gt;and others&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Combined free-tier capacity: roughly &lt;strong&gt;800M tokens/month&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The interesting part is that the routing happens inside the proxy — not inside your app.&lt;/p&gt;




&lt;h1&gt;
  
  
  My Integration
&lt;/h1&gt;

&lt;p&gt;The integration took less than an hour.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Deploy the proxy
&lt;/h2&gt;

&lt;p&gt;I ran it on my existing VPS:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Node.js 20&lt;/li&gt;
&lt;li&gt;~40MB idle RAM&lt;/li&gt;
&lt;li&gt;localhost only&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2. Add provider credentials
&lt;/h2&gt;

&lt;p&gt;I added:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Groq key&lt;/li&gt;
&lt;li&gt;Cloudflare credentials&lt;/li&gt;
&lt;li&gt;OpenRouter key&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;inside the admin panel.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Point my app to a single endpoint
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;baseURL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;http://localhost:3001/v1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;LOCAL_ROUTER_KEY&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That was basically it.&lt;/p&gt;

&lt;p&gt;The important detail:&lt;/p&gt;

&lt;p&gt;I stopped specifying models for non-critical tasks.&lt;/p&gt;

&lt;p&gt;Instead of forcing a specific provider, I let the proxy auto-route requests to whatever free provider was currently available.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;App
  -&amp;gt; freellmapi
      -&amp;gt; Groq
      -&amp;gt; Cloudflare Workers AI
      -&amp;gt; Cerebras
      -&amp;gt; SambaNova
      -&amp;gt; OpenRouter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If Groq rate-limited:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;another provider picked up the request&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a provider became slow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;routing shifted automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My application code never needed to know.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Result
&lt;/h1&gt;

&lt;p&gt;Within 24 hours:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI usage dropped by ~90%&lt;/li&gt;
&lt;li&gt;background AI tasks became almost entirely free-tier&lt;/li&gt;
&lt;li&gt;no additional retry logic was needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most importantly:&lt;br&gt;
I removed provider chaos from my application layer.&lt;/p&gt;




&lt;h1&gt;
  
  
  What I Learned
&lt;/h1&gt;

&lt;p&gt;When engineers hit rate limits, the instinct is usually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;add more providers&lt;/li&gt;
&lt;li&gt;add more fallback logic&lt;/li&gt;
&lt;li&gt;add more code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But sometimes the better solution is adding an abstraction layer that absorbs the complexity for you.&lt;/p&gt;

&lt;p&gt;Another realization:&lt;/p&gt;

&lt;p&gt;Most AI tasks do &lt;strong&gt;not&lt;/strong&gt; require a specific premium model.&lt;/p&gt;

&lt;p&gt;For:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;summaries&lt;/li&gt;
&lt;li&gt;tagging&lt;/li&gt;
&lt;li&gt;drafts&lt;/li&gt;
&lt;li&gt;translations&lt;/li&gt;
&lt;li&gt;background enrichment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…almost any decent modern 70B model works fine.&lt;/p&gt;




&lt;h1&gt;
  
  
  Caveats
&lt;/h1&gt;

&lt;p&gt;Free-tier infrastructure has tradeoffs.&lt;/p&gt;

&lt;p&gt;Some providers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;have cold starts&lt;/li&gt;
&lt;li&gt;introduce latency spikes&lt;/li&gt;
&lt;li&gt;become temporarily unavailable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For real-time user-facing chat systems, you should test failover carefully.&lt;/p&gt;

&lt;p&gt;For async pipelines and batch jobs, though, it’s been surprisingly solid.&lt;/p&gt;

&lt;p&gt;Also:&lt;br&gt;
run this on infrastructure you control.&lt;/p&gt;

&lt;p&gt;A proxy like this handles upstream API keys — don’t hand that responsibility to random hosted services.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Thought
&lt;/h1&gt;

&lt;p&gt;The biggest optimization wasn’t changing models.&lt;/p&gt;

&lt;p&gt;It was removing complexity from the layer that had to manage them.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Turn ~800M Free AI Tokens Into a Single OpenAI API with FreeLLMAPI</title>
      <dc:creator>Mervin</dc:creator>
      <pubDate>Thu, 21 May 2026 08:21:17 +0000</pubDate>
      <link>https://forem.com/mervindublin/turn-800m-free-ai-tokens-into-a-single-openai-api-with-freellmapi-2gm9</link>
      <guid>https://forem.com/mervindublin/turn-800m-free-ai-tokens-into-a-single-openai-api-with-freellmapi-2gm9</guid>
      <description>&lt;h2&gt;
  
  
  The Problem Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Every major AI lab now offers a free tier. Gemini, Groq, Mistral, Cerebras — they all give you a few million tokens a month, a few thousand requests a day.&lt;/p&gt;

&lt;p&gt;On paper, that's generous. In practice, you end up juggling 14 different SDKs, 14 rate limits, and 14 places a request can silently fail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FreeLLMAPI&lt;/strong&gt; solves exactly that.&lt;/p&gt;




&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;p&gt;It's a self-hosted proxy that aggregates free tiers from 14 providers behind a &lt;strong&gt;single &lt;code&gt;/v1/chat/completions&lt;/code&gt; endpoint&lt;/strong&gt; — fully compatible with the OpenAI SDK.&lt;/p&gt;

&lt;p&gt;Supported providers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Notable Models&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Google Gemini&lt;/td&gt;
&lt;td&gt;2.5 Pro / Flash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Groq&lt;/td&gt;
&lt;td&gt;Llama 4, Qwen, Kimi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cerebras&lt;/td&gt;
&lt;td&gt;Llama 3.3, Qwen&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SambaNova&lt;/td&gt;
&lt;td&gt;Llama 3.3 70B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NVIDIA NIM&lt;/td&gt;
&lt;td&gt;Full catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral&lt;/td&gt;
&lt;td&gt;La Plateforme&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenRouter&lt;/td&gt;
&lt;td&gt;Free-tier models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub Models&lt;/td&gt;
&lt;td&gt;GPT-4o, Llama, Phi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hugging Face&lt;/td&gt;
&lt;td&gt;Inference Providers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloudflare&lt;/td&gt;
&lt;td&gt;Workers AI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zhipu&lt;/td&gt;
&lt;td&gt;GLM-4 series&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Moonshot&lt;/td&gt;
&lt;td&gt;Kimi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MiniMax&lt;/td&gt;
&lt;td&gt;abab / hailuo&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Combined: roughly &lt;strong&gt;~800M tokens/month&lt;/strong&gt; across all providers.&lt;/p&gt;




&lt;h2&gt;
  
  
  Zero Code Changes
&lt;/h2&gt;

&lt;p&gt;Point your existing OpenAI SDK at &lt;code&gt;localhost:3001/v1&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3001/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;freellmapi-your-unified-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# router picks the best available
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarise the fall of Rome in one sentence.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Routed via:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-routed-via&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Every response includes an &lt;code&gt;X-Routed-Via&lt;/code&gt; header so you know which provider actually served the request.&lt;/p&gt;




&lt;h2&gt;
  
  
  Technical Highlights
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Automatic failover&lt;/strong&gt; — On 429 / timeout / 5xx, the router cools down the key and retries the next provider in your chain, up to 20 attempts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sticky sessions&lt;/strong&gt; — Multi-turn conversations stay on the same model for 30 minutes. This matters more than it sounds — switching models mid-conversation causes subtle hallucination spikes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-key rate tracking&lt;/strong&gt; — RPM, RPD, TPM, and TPD counters per &lt;code&gt;(platform, model, key)&lt;/code&gt;. The router always picks a key that's under its caps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Encrypted key storage&lt;/strong&gt; — AES-256-GCM before hitting SQLite. Upstream provider keys never leave your machine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Admin dashboard&lt;/strong&gt; — React + Vite UI to manage keys, reorder the fallback chain, inspect analytics, and test prompts in a playground.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lightweight&lt;/strong&gt; — Runs on a Raspberry Pi 4 at ~40MB RAM idle.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setup in 3 Lines
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/tashfeenahmed/freellmapi
&lt;span class="nb"&gt;cd &lt;/span&gt;freellmapi &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm &lt;span class="nb"&gt;install
cp&lt;/span&gt; .env.example .env &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm run dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;localhost:5173&lt;/code&gt;, add your provider API keys, grab your unified key → done.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Part
&lt;/h2&gt;

&lt;p&gt;A few things the README says clearly, and you should know upfront:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Intelligence degrades throughout the day.&lt;/strong&gt; Gemini 2.5 Pro and GPT-4o (via GitHub Models) have the lowest daily caps. Once they're exhausted, the router falls back to smaller models. Expect effective quality to drop in the late hours — then reset at UTC midnight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool calling and vision are not yet supported.&lt;/strong&gt; Text-only for now. PRs are welcome.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency is unpredictable.&lt;/strong&gt; Cerebras and Groq are extremely fast. Others are not. You get whichever one is available.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Personal use only.&lt;/strong&gt; No multi-tenant auth. Don't expose this to the internet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free tiers change without notice.&lt;/strong&gt; When a provider tightens limits, you'll see 429s until the catalog is updated.&lt;/p&gt;




&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;p&gt;✅ Building AI agents or coding assistants and want to prototype without spending money upfront&lt;/p&gt;

&lt;p&gt;✅ Researchers and students who hit rate limits on one provider and want seamless fallback&lt;/p&gt;

&lt;p&gt;✅ Anyone tired of maintaining multiple SDK integrations&lt;/p&gt;

&lt;p&gt;❌ Production workloads — use a paid API with an SLA&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick ToS Note
&lt;/h2&gt;

&lt;p&gt;The project includes a detailed review of each provider's terms. Most are fine for single-user personal use. Notable exceptions: &lt;strong&gt;Cohere's trial ToS explicitly forbids personal/household use&lt;/strong&gt;, and &lt;strong&gt;NVIDIA NIM's free tier is scoped to evaluation only&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Read the full table in the README before adding keys.&lt;/p&gt;




&lt;p&gt;FreeLLMAPI is MIT licensed and actively welcoming contributors — especially for adding embeddings, tool calling, and new providers.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://github.com/tashfeenahmed/freellmapi" rel="noopener noreferrer"&gt;github.com/tashfeenahmed/freellmapi&lt;/a&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>ai</category>
      <category>devtools</category>
      <category>llm</category>
    </item>
    <item>
      <title>Gemma 4 as an On-Call SRE: Turning Alert Spam into One Reasoned Incident</title>
      <dc:creator>Mervin</dc:creator>
      <pubDate>Thu, 21 May 2026 06:53:34 +0000</pubDate>
      <link>https://forem.com/mervindublin/gemma-4-as-an-on-call-sre-turning-alert-spam-into-one-reasoned-incident-be3</link>
      <guid>https://forem.com/mervindublin/gemma-4-as-an-on-call-sre-turning-alert-spam-into-one-reasoned-incident-be3</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Write About Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;description: "I rebuilt my alert pipeline around Gemma 4 26B MoE. It now groups cascading alerts into a single incident and writes the root cause for me. Architecture, demo, and why MoE — not the dense 31B — was the right tool."&lt;/p&gt;

&lt;h2&gt;
  
  
  tags: gemmachallenge, gemma, ai, devops
&lt;/h2&gt;

&lt;p&gt;I had a working alerts service — Postgres, BullMQ, rules engine, Telegram bot. Classic stuff. It also produced the classic problem: 4 separate notifications for what was obviously &lt;strong&gt;one&lt;/strong&gt; incident, no causal narrative, no fix suggestion. So I bolted a single component onto it: a &lt;strong&gt;Gemma 4 26B MoE&lt;/strong&gt; "SRE brain" that reads correlated events and writes the postmortem before I finish my coffee.&lt;/p&gt;

&lt;p&gt;Demo. &lt;a href="https://github.com/melyx-id/alert-service" rel="noopener noreferrer"&gt;https://github.com/melyx-id/alert-service&lt;/a&gt;&lt;br&gt;
Repo (single self-contained NestJS service): &lt;code&gt;/opt/alert-service&lt;/code&gt; on the host.&lt;/p&gt;

&lt;p&gt;The intentional pick: &lt;strong&gt;&lt;code&gt;google/gemma-4-26B-A4B-it&lt;/code&gt;&lt;/strong&gt; (26B MoE, 4B active) — not the dense 31B. Reasoning below.&lt;/p&gt;


&lt;h2&gt;
  
  
  The problem I actually had
&lt;/h2&gt;

&lt;p&gt;Last week our &lt;code&gt;api-gateway&lt;/code&gt; hiccupped after a deploy. Telegram fired:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🚨 Deploy #441 promoted to production
🔴 Redis connection timeout spike (p99 4.2s)
🚨 5xx error rate surged 340% (12% of traffic)
🔴 Checkout latency p95 = 8.7s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four pings. Four pages. I had to assemble the story myself: &lt;em&gt;the deploy caused the redis pool to exhaust, which caused 5xx, which broke checkout.&lt;/em&gt; Obvious in retrospect. &lt;strong&gt;Cognitive load at 2am, not so obvious.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's the gap I wanted Gemma to close.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    Webhooks / app events
                            │
                            ▼
                  ┌─────────────────────┐
                  │ /events/incident    │ (Fastify + NestJS)
                  └────────┬────────────┘
                           │
                  ┌────────▼────────┐
                  │  AlertsService  │  dedup → Postgres
                  └────────┬────────┘
                           │
                  ┌────────▼────────────────┐
                  │  IncidentsService       │
                  │  • signature(service)   │
                  │  • find OPEN incident   │
                  │    in last 10 min       │
                  │  • attach alert         │
                  └────────┬────────────────┘
                           │
                           ▼
                  ┌─────────────────────────┐
                  │  AnalysisService        │
                  │  HF Inference Router →  │
                  │  gemma-4-26B-A4B-it     │
                  │  (system prompt: SRE)   │
                  └────────┬────────────────┘
                           │
       ┌───────────────────┼────────────────────┐
       ▼                   ▼                    ▼
  Telegram (HTML +    Dashboard (Alpine,    Postgres
   inline buttons:    polls /incidents)     (timeline, aiFixes,
   ACK / RESOLVE /                          aiConfidence,
   Retry AI)                                aiRootCause)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key idea: an &lt;strong&gt;Incident&lt;/strong&gt; is the unit, not an Alert. An incident gathers all alerts from the same &lt;code&gt;service&lt;/code&gt; within a 10-minute window. Every new alert in the window re-invokes Gemma with the &lt;em&gt;full chronological stack&lt;/em&gt; — so the analysis improves with context instead of repeating with each ping.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Gemma 4 26B MoE (and not the dense 31B)
&lt;/h2&gt;

&lt;p&gt;The challenge specifically asks why each model is the right tool. Here's my honest answer:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;What incident analysis needs&lt;/th&gt;
&lt;th&gt;26B MoE&lt;/th&gt;
&lt;th&gt;31B Dense&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Workload shape&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Bursty&lt;/strong&gt; (idle → 3-6 events in 60s → idle)&lt;/td&gt;
&lt;td&gt;✅ sparse activation = lower TTFT per call&lt;/td&gt;
&lt;td&gt;dense is always-on cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning depth&lt;/td&gt;
&lt;td&gt;Multi-step causal chain (deploy → pool → 5xx → checkout)&lt;/td&gt;
&lt;td&gt;✅ MoE benchmarks competitive with 31B on reasoning&lt;/td&gt;
&lt;td&gt;slightly better, marginal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long context&lt;/td&gt;
&lt;td&gt;Up to 128K — we send growing event timelines&lt;/td&gt;
&lt;td&gt;✅ both fine&lt;/td&gt;
&lt;td&gt;✅ both fine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost per analysis&lt;/td&gt;
&lt;td&gt;Want sub-cent&lt;/td&gt;
&lt;td&gt;✅ 4B active params → cheaper inference&lt;/td&gt;
&lt;td&gt;higher&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency budget&lt;/td&gt;
&lt;td&gt;&amp;lt;10s per call (on-call patience)&lt;/td&gt;
&lt;td&gt;✅ ~4–7s observed&lt;/td&gt;
&lt;td&gt;~6–9s observed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For an &lt;strong&gt;incident analyst&lt;/strong&gt; workload — short, bursty, but reasoning-heavy — MoE was the right tool. I kept the 31B Dense wired in as automatic fallback for when the MoE provider 429's. Both go through the HuggingFace Inference Router using the same OpenAI-compatible interface (&lt;code&gt;/v1/chat/completions&lt;/code&gt;) which made the fallback a one-line config swap.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/modules/analysis/analysis.service.ts&lt;/span&gt;
&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;GEMMA_MODEL&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;google/gemma-4-26B-A4B-it&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fallbackModel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;GEMMA_FALLBACK_MODEL&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;google/gemma-4-31B-it&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A subtle gotcha worth flagging: &lt;strong&gt;HF Router model IDs are case-sensitive&lt;/strong&gt;. &lt;code&gt;gemma-4-26b-a4b-it&lt;/code&gt; returns 400 model_not_found. &lt;code&gt;gemma-4-26B-A4B-it&lt;/code&gt; works. Lost 30 minutes to that.&lt;/p&gt;




&lt;h2&gt;
  
  
  The system prompt that actually mattered
&lt;/h2&gt;

&lt;p&gt;The interesting part of the build wasn't the plumbing — it was getting Gemma to &lt;em&gt;reason&lt;/em&gt; rather than &lt;em&gt;summarize&lt;/em&gt;. My first prompt produced confident-sounding restatements of the input ("Redis is timing out and 5xx errors are happening"). Useless.&lt;/p&gt;

&lt;p&gt;What worked was framing it as a senior person with a strong opinion about what &lt;em&gt;qualifies&lt;/em&gt; as a root cause:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are a senior Site Reliability Engineer with 10+ years on-call experience...

Rules:
- Prefer concrete causal chains over vague language
  ("connection pool exhaustion after deploy #441" beats "service degradation")
- If a deploy event is present, evaluate whether it is the likely trigger
- Severity: CRITICAL = revenue path or full outage
- Confidence: be honest. 0.5 means "plausible but unverified".
  Above 0.85 only when the causal chain is clear.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two design choices behind this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Force a causal chain, not a summary.&lt;/strong&gt; Without this, Gemma reflexively rewrites symptoms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confidence as a contract.&lt;/strong&gt; When I tell it "0.5 = plausible but unverified", it actually self-rates lower on weak signal. With the redis cascade demo, the first alert (deploy event alone) returned confidence 50%. By the third alert it hit 95% — &lt;em&gt;because the causal chain became visible&lt;/em&gt;. The model is policing its own certainty.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What the demo actually looks like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;npm run demo:redis

&lt;span class="o"&gt;===&lt;/span&gt; Demo scenario: Redis &lt;span class="nb"&gt;timeout &lt;/span&gt;cascade after deploy &lt;span class="c"&gt;#441 ===&lt;/span&gt;

  &lt;span class="o"&gt;[&lt;/span&gt;1/4] &lt;span class="o"&gt;(&lt;/span&gt;4528ms&lt;span class="o"&gt;)&lt;/span&gt; LOW      Deploy &lt;span class="c"&gt;#441 promoted to production&lt;/span&gt;
     ↳ grouped into INC-260520-001 &lt;span class="o"&gt;(&lt;/span&gt;conf 50%, google/gemma-4-26B-A4B-it&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;[&lt;/span&gt;2/4] &lt;span class="o"&gt;(&lt;/span&gt;7616ms&lt;span class="o"&gt;)&lt;/span&gt; HIGH     Redis connection &lt;span class="nb"&gt;timeout &lt;/span&gt;spike &lt;span class="o"&gt;(&lt;/span&gt;p99 4.2s&lt;span class="o"&gt;)&lt;/span&gt;
     ↳ grouped into INC-260520-001 &lt;span class="o"&gt;(&lt;/span&gt;conf 90%, google/gemma-4-26B-A4B-it&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;[&lt;/span&gt;3/4] &lt;span class="o"&gt;(&lt;/span&gt;5634ms&lt;span class="o"&gt;)&lt;/span&gt; CRITICAL 5xx error rate surged 340% &lt;span class="o"&gt;(&lt;/span&gt;12% of traffic&lt;span class="o"&gt;)&lt;/span&gt;
     ↳ grouped into INC-260520-001 &lt;span class="o"&gt;(&lt;/span&gt;conf 95%, google/gemma-4-26B-A4B-it&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;[&lt;/span&gt;4/4] &lt;span class="o"&gt;(&lt;/span&gt;4407ms&lt;span class="o"&gt;)&lt;/span&gt; HIGH     Checkout latency p95 &lt;span class="o"&gt;=&lt;/span&gt; 8.7s
     ↳ grouped into INC-260520-001 &lt;span class="o"&gt;(&lt;/span&gt;conf 95%, google/gemma-4-26B-A4B-it&lt;span class="o"&gt;)&lt;/span&gt;

Gemma 4 final analysis:
  root cause : Deploy &lt;span class="c"&gt;#441 introduced a regression causing Redis connection&lt;/span&gt;
               pool exhaustion, leading to request queuing and 5xx errors
               on the checkout path.
  impact     : Users are experiencing high latency and a 12% failure rate
               during the checkout process, directly impacting revenue.
  severity   : CRITICAL  &lt;span class="o"&gt;(&lt;/span&gt;auto-escalated from initial LOW&lt;span class="o"&gt;)&lt;/span&gt;
  confidence : 95%
  fixes:
    - Roll back api-gateway to the previous stable version &lt;span class="o"&gt;(&lt;/span&gt;v2.3.3&lt;span class="o"&gt;)&lt;/span&gt;
    - Increase Redis connection pool size as a temporary mitigation
      &lt;span class="k"&gt;if &lt;/span&gt;rollback is delayed
    - Investigate commit a1b2c3d &lt;span class="k"&gt;for &lt;/span&gt;unclosed Redis connections
      or inefficient session lookups
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: severity was &lt;strong&gt;auto-escalated&lt;/strong&gt;. The first event was tagged LOW (a deploy isn't itself a problem). Gemma rewrote the incident's severity to CRITICAL after seeing the cascading impact — exactly what a human SRE would do.&lt;/p&gt;

&lt;p&gt;Before/after view on the dashboard makes this concrete:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Before&lt;/strong&gt; (raw alerts pane): 4 separate-looking entries, no narrative, on-call paged 4 times.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;After&lt;/strong&gt; (Gemma pane): 1 grouped incident, root cause + impact + fixes + 95% confidence, on-call paged once with all context inline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same data. Different outcome.&lt;/p&gt;




&lt;h2&gt;
  
  
  Things I deliberately did NOT do (yet)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-agent reasoning&lt;/strong&gt; (DB-specialist, network-specialist, summarizer). LangGraph would slot in cleanly, but for the use case — small bursts, single service per incident — one well-prompted call beats four coordinated ones in latency. Multi-agent is on the roadmap once I'm grouping across services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector search for similar past incidents.&lt;/strong&gt; pgvector is already running on the host; the hook is in &lt;code&gt;IncidentsService.groupAndAnalyze&lt;/code&gt;. Will add when there are &amp;gt;50 historical incidents to retrieve from.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local Ollama.&lt;/strong&gt; Tempting for privacy, but my VPS is 4GB RAM and runs ~15 other services. The HF Router gives me the same Gemma 4 weights without evicting half my fleet. If you're on dedicated hardware, swap the endpoint — the prompt and grouping logic don't change.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Production-y bits that came along for the ride
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dedupe + retry.&lt;/strong&gt; Cache key = &lt;code&gt;sha1(title:source)&lt;/code&gt;, 2-min TTL. Stops a runaway cron from re-analyzing the same payload 60x.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Telegram inline keyboard&lt;/strong&gt;: ACK / RESOLVE / &lt;strong&gt;Retry AI&lt;/strong&gt; / open dashboard. The Retry AI button is my favorite — it re-invokes Gemma with the current event stack. Cheap second opinion when the first reasoning felt off.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Severity escalation&lt;/strong&gt;. The incident's stored severity is &lt;code&gt;max(human-rule severity, AI severity)&lt;/code&gt;. AI can upgrade LOW→CRITICAL but cannot downgrade a CRITICAL classification, by design.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confidence as UI signal&lt;/strong&gt;. The dashboard shows &lt;code&gt;conf 95%&lt;/code&gt; next to every root cause. Below 70% the UI hints "consider re-analysis or wait for more events."&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Stack summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NestJS 11&lt;/strong&gt; (Fastify) — existing service, ~30 LOC of wiring to add the Gemma layer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prisma + Postgres&lt;/strong&gt; — 1 new model (&lt;code&gt;Incident&lt;/code&gt;), 3 new columns on &lt;code&gt;Alert&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HuggingFace Inference Router&lt;/strong&gt; — &lt;code&gt;google/gemma-4-26B-A4B-it&lt;/code&gt; primary, &lt;code&gt;gemma-4-31B-it&lt;/code&gt; fallback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alpine.js + Tailwind CDN&lt;/strong&gt; — single-file dashboard, polls /incidents every 5s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Telegram bot&lt;/strong&gt; — HTML messages with inline keyboard, HMAC-signed callbacks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Single &lt;code&gt;npm run demo:redis&lt;/code&gt; reproduces the entire flow from cold start.&lt;/p&gt;




&lt;h2&gt;
  
  
  What surprised me
&lt;/h2&gt;

&lt;p&gt;I expected Gemma to be good at the language — paraphrasing logs, polishing summaries. What I didn't expect was how reliably it &lt;em&gt;upgrades&lt;/em&gt; severity. The first event in my demo (a deploy) is mundane. The model only paints it as CRITICAL once it has the second and third alerts to connect the chain. That's not pattern matching, that's reasoning over a sequence. It's the behavior I'd want from a junior SRE on their third month.&lt;/p&gt;

&lt;p&gt;The other surprise: &lt;strong&gt;confidence actually moves.&lt;/strong&gt; Most LLM "confidence" outputs are 0.9 forever. Telling Gemma in the system prompt that 0.5 is honest got me back a useful spread of values that I can now drive UI on.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff3ijuzx7ochv18f8rnrs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff3ijuzx7ochv18f8rnrs.png" alt="Demo Page" width="800" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If it's empty when you look, the demo data may have expired — you're welcome to mentally substitute "redis cascade after deploy #441, severity CRITICAL, 95% confidence, with fixes." Or watch the next real incident roll through, which is the whole point.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff3ijuzx7ochv18f8rnrs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff3ijuzx7ochv18f8rnrs.png" alt="Demo Page" width="800" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
    </item>
    <item>
      <title>Idea2Post Agent Mode: turning a working content SaaS into an autonomous operator with Hermes</title>
      <dc:creator>Mervin</dc:creator>
      <pubDate>Wed, 20 May 2026 15:57:32 +0000</pubDate>
      <link>https://forem.com/mervindublin/idea2post-agent-mode-turning-a-working-content-saas-into-an-autonomous-operator-with-hermes-3g</link>
      <guid>https://forem.com/mervindublin/idea2post-agent-mode-turning-a-working-content-saas-into-an-autonomous-operator-with-hermes-3g</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/hermes-agent-2026-05-15"&gt;Hermes Agent Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Idea2Post Agent Mode&lt;/strong&gt; — a new autonomous "Agent" tab inside an already-shipped content SaaS (&lt;a href="https://idea2post.app" rel="noopener noreferrer"&gt;idea2post.app&lt;/a&gt;). The existing product is a one-shot generator: paste an idea, get hooks/blog/social posts. Useful, but it's still a stateless AI wrapper — the user has to decide &lt;em&gt;what&lt;/em&gt; to generate.&lt;/p&gt;

&lt;p&gt;Agent Mode flips that. The user types a goal — &lt;em&gt;"What should I post on Facebook this week?"&lt;/em&gt; — and a Hermes Agent autonomously:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pulls the system's own state (queued jobs, recent content) via MCP tools&lt;/li&gt;
&lt;li&gt;Reads the most recent posts from the user's tracked competitors (real crawler signal, not made up)&lt;/li&gt;
&lt;li&gt;Analyzes the engagement and identifies underserved themes&lt;/li&gt;
&lt;li&gt;Optionally calls the existing in-house content generator for concrete drafts&lt;/li&gt;
&lt;li&gt;Reports back with citations to the actual data it used&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Crucially, none of the existing flows were touched. Quick Mode (one-shot generation), auto-pipelines (cron-driven crawl→generate→publish loops), and the 7 background workers all keep running unchanged. Agent Mode is a parallel route that &lt;em&gt;orchestrates the same building blocks&lt;/em&gt; in a different sequence.&lt;/p&gt;

&lt;p&gt;The whole layer is roughly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser  ──SSE──▶  PHP proxy  ──HTTP──▶  Hermes Agent
                                         (gateway @ :8642)
                                              │
                                              ├─ web search (built-in)
                                              └─ idea2post MCP server (stdio)
                                                     │
                                                     └─▶ PHP CLI ─▶ MySQL + content engine
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;Live agent run inside the dashboard. Notice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tool chips (&lt;code&gt;🔧 status_report&lt;/code&gt;, &lt;code&gt;🔧 recent_competitor_posts&lt;/code&gt;) appear as &lt;strong&gt;Hermes calls them&lt;/strong&gt;, pulsing while running, turning green when complete&lt;/li&gt;
&lt;li&gt;Content streams in token-by-token after research finishes&lt;/li&gt;
&lt;li&gt;Numbers and competitor names in the reply come from the &lt;strong&gt;real database&lt;/strong&gt;, not the LLM's imagination&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;[ &lt;a href="https://youtu.be/7e4dsX8pd1s" rel="noopener noreferrer"&gt;https://youtu.be/7e4dsX8pd1s&lt;/a&gt; — record from cp.idea2post.app/?page=i2p-agent]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Quick prompts the demo runs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;"What should I post on Facebook this week? Check my competitor signal and propose 3 angles."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Give me a status report — queued jobs, content last 7 days, active pipelines, recent competitor activity."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Generate a full content plan: research recent competitor posts, pick the most engaging theme, draft 1 blog + 3 FB posts."&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/melyx-id/idea2post-agent" rel="noopener noreferrer"&gt;https://github.com/melyx-id/idea2post-agent&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Key files:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;Lines&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP server (Python, FastMCP) — exposes idea2post engine as tools&lt;/td&gt;
&lt;td&gt;&lt;code&gt;opt/idea2post-mcp/server.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~95&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PHP CLI bridge — talks to MySQL + existing content engine&lt;/td&gt;
&lt;td&gt;&lt;code&gt;engine/i2p_agent_cli.php&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~110&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent UI (chat, streaming)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pages/i2p-agent.php&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~270&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SSE proxy: browser ↔ Hermes Agent&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pages/i2p-agent-api-stream.php&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~110&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  My Tech Stack
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hermes Agent&lt;/strong&gt; v0.14.0 — running as a systemd user service on a Ubuntu 22.04 VPS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM&lt;/strong&gt;: &lt;code&gt;openai/gpt-oss-120b:free&lt;/code&gt; via OpenRouter (free tier, supports tool calling)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP&lt;/strong&gt;: official &lt;code&gt;mcp&lt;/code&gt; Python SDK (&lt;code&gt;FastMCP&lt;/code&gt;) over stdio&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend&lt;/strong&gt;: existing SaaS — PHP 8.3, MySQL 10.6 (MariaDB), Apache → Caddy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: vanilla JS + Tailwind CDN, Server-Sent Events for live tool/token streaming&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infra&lt;/strong&gt;: Caddy reverse proxy, idea2post's existing 7-worker cron stack&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No new frameworks, no rewrites. The whole agent layer is ~600 lines.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Used Hermes Agent
&lt;/h2&gt;

&lt;p&gt;Three Hermes capabilities did the heavy lifting:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The agent loop itself.&lt;/strong&gt; I didn't want to write a tool-calling state machine. Hermes' &lt;code&gt;/v1/chat/completions&lt;/code&gt; endpoint already runs the full reasoning→tool→observe→reason loop internally. My PHP backend just POSTs the user's goal and reads the stream. Multi-turn tool calls, retries, compression, all handled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. MCP for tool surface.&lt;/strong&gt; Instead of hard-coding tools into the prompt, I built a small Python MCP server that exposes 7 idea2post operations (&lt;code&gt;status_report&lt;/code&gt;, &lt;code&gt;list_competitors&lt;/code&gt;, &lt;code&gt;recent_competitor_posts&lt;/code&gt;, &lt;code&gt;list_publish_accounts&lt;/code&gt;, &lt;code&gt;list_pipelines&lt;/code&gt;, &lt;code&gt;generate_content&lt;/code&gt;, &lt;code&gt;queue_publish&lt;/code&gt;). Each tool internally shells out to a PHP CLI that talks to the real MySQL and the existing content engine. &lt;code&gt;hermes mcp add idea2post --command ...&lt;/code&gt; registered it in one line. Adding a new capability later means adding one function in &lt;code&gt;server.py&lt;/code&gt; — Hermes picks it up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The streaming protocol.&lt;/strong&gt; Hermes' SSE stream emits both standard OpenAI &lt;code&gt;chat.completion.chunk&lt;/code&gt; frames &lt;strong&gt;and&lt;/strong&gt; custom &lt;code&gt;hermes.tool.progress&lt;/code&gt; events with &lt;code&gt;status: running&lt;/code&gt; / &lt;code&gt;completed&lt;/code&gt; per tool call. This is the difference between "spinner for 30 seconds" and "the agent is visibly thinking" — chip turns from purple-pulsing to green-✓ in real time as each MCP tool returns. Massive UX upgrade for ~30 lines of frontend code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this was the right fit.&lt;/strong&gt; I had a working SaaS that already did the &lt;em&gt;expensive&lt;/em&gt; work — competitor crawlers, content generation, FB/LinkedIn publishing. What I was missing was an &lt;em&gt;orchestration layer&lt;/em&gt; that could reason over which subsystem to invoke for a given high-level goal. Hermes provided that layer without forcing a rewrite. The existing one-shot generator at &lt;code&gt;?page=i2p-generate&lt;/code&gt; is now one of many tools the agent can call, instead of being the entire product.&lt;/p&gt;

&lt;p&gt;The net result: same SaaS, but the user can ask it open-ended questions and watch it autonomously chain research → analysis → generation. That's the leap from "AI wrapper" to "agent product" — and Hermes did 80% of it.&lt;/p&gt;

</description>
      <category>hermesagentchallenge</category>
      <category>devchallenge</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
