Forem: Mervin

How I Cut My LLM Costs by 90% Without Changing My App Logic

Mervin — Thu, 21 May 2026 20:44:33 +0000

How I Cut My LLM Costs by 90% Without Changing My App Logic

There’s a particular kind of dread that comes with checking your OpenAI billing dashboard mid-month.

I’ve been building a news automation hub that runs 14 editorial workspaces — summarizing, rewriting, fact-checking, SEO-tagging, and translation pipelines around the clock.

The AI layer was already fairly optimized:

Groq
Gemini Flash
DeepSeek
OpenRouter
provider rotation
fallback logic

But the final fallback was still OpenAI, and once rate limits hit, costs climbed faster than expected.

What I needed wasn’t more routing logic.

I needed a smarter endpoint.

The Problem

My setup already rotated between multiple providers, but the architecture had a weakness:

Provider exhausted
    -> fallback
        -> OpenAI
            -> credits disappear

The more providers I added, the messier things became:

more API keys
more retry logic
more conditional branches
more provider-specific handling

I was optimizing infrastructure with application code.

That was the mistake.

The Fix

After digging through self-hosted AI tooling, I found freellmapi.

It’s a lightweight OpenAI-compatible proxy that automatically routes requests across multiple free-tier LLM providers:

Groq
Cerebras
SambaNova
Cloudflare Workers AI
GitHub Models
OpenRouter free models
and others

Combined free-tier capacity: roughly 800M tokens/month.

The interesting part is that the routing happens inside the proxy — not inside your app.

My Integration

The integration took less than an hour.

1. Deploy the proxy

I ran it on my existing VPS:

Node.js 20
~40MB idle RAM
localhost only

2. Add provider credentials

I added:

Groq key
Cloudflare credentials
OpenRouter key

inside the admin panel.

3. Point my app to a single endpoint

const client = new OpenAI({
  baseURL: "http://localhost:3001/v1",
  apiKey: process.env.LOCAL_ROUTER_KEY
});

That was basically it.

The important detail:

I stopped specifying models for non-critical tasks.

Instead of forcing a specific provider, I let the proxy auto-route requests to whatever free provider was currently available.

App
  -> freellmapi
      -> Groq
      -> Cloudflare Workers AI
      -> Cerebras
      -> SambaNova
      -> OpenRouter

If Groq rate-limited:

another provider picked up the request

If a provider became slow:

routing shifted automatically

My application code never needed to know.

The Result

Within 24 hours:

OpenAI usage dropped by ~90%
background AI tasks became almost entirely free-tier
no additional retry logic was needed

Most importantly:
I removed provider chaos from my application layer.

What I Learned

When engineers hit rate limits, the instinct is usually:

add more providers
add more fallback logic
add more code

But sometimes the better solution is adding an abstraction layer that absorbs the complexity for you.

Another realization:

Most AI tasks do not require a specific premium model.

For:

summaries
tagging
drafts
translations
background enrichment

…almost any decent modern 70B model works fine.

Caveats

Free-tier infrastructure has tradeoffs.

Some providers:

have cold starts
introduce latency spikes
become temporarily unavailable

For real-time user-facing chat systems, you should test failover carefully.

For async pipelines and batch jobs, though, it’s been surprisingly solid.

Also:
run this on infrastructure you control.

A proxy like this handles upstream API keys — don’t hand that responsibility to random hosted services.

Final Thought

The biggest optimization wasn’t changing models.

It was removing complexity from the layer that had to manage them.

Turn ~800M Free AI Tokens Into a Single OpenAI API with FreeLLMAPI

Mervin — Thu, 21 May 2026 08:21:17 +0000

The Problem Nobody Talks About

Every major AI lab now offers a free tier. Gemini, Groq, Mistral, Cerebras — they all give you a few million tokens a month, a few thousand requests a day.

On paper, that's generous. In practice, you end up juggling 14 different SDKs, 14 rate limits, and 14 places a request can silently fail.

FreeLLMAPI solves exactly that.

What It Does

It's a self-hosted proxy that aggregates free tiers from 14 providers behind a single /v1/chat/completions endpoint — fully compatible with the OpenAI SDK.

Supported providers:

Provider	Notable Models
Google Gemini	2.5 Pro / Flash
Groq	Llama 4, Qwen, Kimi
Cerebras	Llama 3.3, Qwen
SambaNova	Llama 3.3 70B
NVIDIA NIM	Full catalog
Mistral	La Plateforme
OpenRouter	Free-tier models
GitHub Models	GPT-4o, Llama, Phi
Hugging Face	Inference Providers
Cloudflare	Workers AI
Zhipu	GLM-4 series
Moonshot	Kimi
MiniMax	abab / hailuo

Combined: roughly ~800M tokens/month across all providers.

Zero Code Changes

Point your existing OpenAI SDK at localhost:3001/v1:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3001/v1",
    api_key="freellmapi-your-unified-key",
)

resp = client.chat.completions.create(
    model="auto",  # router picks the best available
    messages=[{"role": "user", "content": "Summarise the fall of Rome in one sentence."}],
)

print(resp.choices[0].message.content)
print("Routed via:", resp.headers.get("x-routed-via"))

That's it. Every response includes an X-Routed-Via header so you know which provider actually served the request.

Technical Highlights

Automatic failover — On 429 / timeout / 5xx, the router cools down the key and retries the next provider in your chain, up to 20 attempts.

Sticky sessions — Multi-turn conversations stay on the same model for 30 minutes. This matters more than it sounds — switching models mid-conversation causes subtle hallucination spikes.

Per-key rate tracking — RPM, RPD, TPM, and TPD counters per (platform, model, key). The router always picks a key that's under its caps.

Encrypted key storage — AES-256-GCM before hitting SQLite. Upstream provider keys never leave your machine.

Admin dashboard — React + Vite UI to manage keys, reorder the fallback chain, inspect analytics, and test prompts in a playground.

Lightweight — Runs on a Raspberry Pi 4 at ~40MB RAM idle.

Setup in 3 Lines

git clone https://github.com/tashfeenahmed/freellmapi
cd freellmapi && npm install
cp .env.example .env && npm run dev

Open localhost:5173, add your provider API keys, grab your unified key → done.

The Honest Part

A few things the README says clearly, and you should know upfront:

Intelligence degrades throughout the day. Gemini 2.5 Pro and GPT-4o (via GitHub Models) have the lowest daily caps. Once they're exhausted, the router falls back to smaller models. Expect effective quality to drop in the late hours — then reset at UTC midnight.

Tool calling and vision are not yet supported. Text-only for now. PRs are welcome.

Latency is unpredictable. Cerebras and Groq are extremely fast. Others are not. You get whichever one is available.

Personal use only. No multi-tenant auth. Don't expose this to the internet.

Free tiers change without notice. When a provider tightens limits, you'll see 429s until the catalog is updated.

Who This Is For

✅ Building AI agents or coding assistants and want to prototype without spending money upfront

✅ Researchers and students who hit rate limits on one provider and want seamless fallback

✅ Anyone tired of maintaining multiple SDK integrations

❌ Production workloads — use a paid API with an SLA

Quick ToS Note

The project includes a detailed review of each provider's terms. Most are fine for single-user personal use. Notable exceptions: Cohere's trial ToS explicitly forbids personal/household use, and NVIDIA NIM's free tier is scoped to evaluation only.

Read the full table in the README before adding keys.

FreeLLMAPI is MIT licensed and actively welcoming contributors — especially for adding embeddings, tool calling, and new providers.

→ github.com/tashfeenahmed/freellmapi

Gemma 4 as an On-Call SRE: Turning Alert Spam into One Reasoned Incident

Mervin — Thu, 21 May 2026 06:53:34 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

description: "I rebuilt my alert pipeline around Gemma 4 26B MoE. It now groups cascading alerts into a single incident and writes the root cause for me. Architecture, demo, and why MoE — not the dense 31B — was the right tool."

tags: gemmachallenge, gemma, ai, devops

I had a working alerts service — Postgres, BullMQ, rules engine, Telegram bot. Classic stuff. It also produced the classic problem: 4 separate notifications for what was obviously one incident, no causal narrative, no fix suggestion. So I bolted a single component onto it: a Gemma 4 26B MoE "SRE brain" that reads correlated events and writes the postmortem before I finish my coffee.

Demo. https://github.com/melyx-id/alert-service
Repo (single self-contained NestJS service): /opt/alert-service on the host.

The intentional pick: google/gemma-4-26B-A4B-it (26B MoE, 4B active) — not the dense 31B. Reasoning below.

The problem I actually had

Last week our api-gateway hiccupped after a deploy. Telegram fired:

🚨 Deploy #441 promoted to production
🔴 Redis connection timeout spike (p99 4.2s)
🚨 5xx error rate surged 340% (12% of traffic)
🔴 Checkout latency p95 = 8.7s

Four pings. Four pages. I had to assemble the story myself: the deploy caused the redis pool to exhaust, which caused 5xx, which broke checkout. Obvious in retrospect. Cognitive load at 2am, not so obvious.

That's the gap I wanted Gemma to close.

Architecture

                    Webhooks / app events
                            │
                            ▼
                  ┌─────────────────────┐
                  │ /events/incident    │ (Fastify + NestJS)
                  └────────┬────────────┘
                           │
                  ┌────────▼────────┐
                  │  AlertsService  │  dedup → Postgres
                  └────────┬────────┘
                           │
                  ┌────────▼────────────────┐
                  │  IncidentsService       │
                  │  • signature(service)   │
                  │  • find OPEN incident   │
                  │    in last 10 min       │
                  │  • attach alert         │
                  └────────┬────────────────┘
                           │
                           ▼
                  ┌─────────────────────────┐
                  │  AnalysisService        │
                  │  HF Inference Router →  │
                  │  gemma-4-26B-A4B-it     │
                  │  (system prompt: SRE)   │
                  └────────┬────────────────┘
                           │
       ┌───────────────────┼────────────────────┐
       ▼                   ▼                    ▼
  Telegram (HTML +    Dashboard (Alpine,    Postgres
   inline buttons:    polls /incidents)     (timeline, aiFixes,
   ACK / RESOLVE /                          aiConfidence,
   Retry AI)                                aiRootCause)

Key idea: an Incident is the unit, not an Alert. An incident gathers all alerts from the same service within a 10-minute window. Every new alert in the window re-invokes Gemma with the full chronological stack — so the analysis improves with context instead of repeating with each ping.

Why Gemma 4 26B MoE (and not the dense 31B)

The challenge specifically asks why each model is the right tool. Here's my honest answer:

Property	What incident analysis needs	26B MoE	31B Dense
Workload shape	Bursty (idle → 3-6 events in 60s → idle)	✅ sparse activation = lower TTFT per call	dense is always-on cost
Reasoning depth	Multi-step causal chain (deploy → pool → 5xx → checkout)	✅ MoE benchmarks competitive with 31B on reasoning	slightly better, marginal
Long context	Up to 128K — we send growing event timelines	✅ both fine	✅ both fine
Cost per analysis	Want sub-cent	✅ 4B active params → cheaper inference	higher
Latency budget	<10s per call (on-call patience)	✅ ~4–7s observed	~6–9s observed

For an incident analyst workload — short, bursty, but reasoning-heavy — MoE was the right tool. I kept the 31B Dense wired in as automatic fallback for when the MoE provider 429's. Both go through the HuggingFace Inference Router using the same OpenAI-compatible interface (/v1/chat/completions) which made the fallback a one-line config swap.

// src/modules/analysis/analysis.service.ts
this.model = config.get('GEMMA_MODEL') || 'google/gemma-4-26B-A4B-it'
this.fallbackModel = config.get('GEMMA_FALLBACK_MODEL') || 'google/gemma-4-31B-it'

A subtle gotcha worth flagging: HF Router model IDs are case-sensitive. gemma-4-26b-a4b-it returns 400 model_not_found. gemma-4-26B-A4B-it works. Lost 30 minutes to that.

The system prompt that actually mattered

The interesting part of the build wasn't the plumbing — it was getting Gemma to reason rather than summarize. My first prompt produced confident-sounding restatements of the input ("Redis is timing out and 5xx errors are happening"). Useless.

What worked was framing it as a senior person with a strong opinion about what qualifies as a root cause:

You are a senior Site Reliability Engineer with 10+ years on-call experience...

Rules:
- Prefer concrete causal chains over vague language
  ("connection pool exhaustion after deploy #441" beats "service degradation")
- If a deploy event is present, evaluate whether it is the likely trigger
- Severity: CRITICAL = revenue path or full outage
- Confidence: be honest. 0.5 means "plausible but unverified".
  Above 0.85 only when the causal chain is clear.

Two design choices behind this:

Force a causal chain, not a summary. Without this, Gemma reflexively rewrites symptoms.
Confidence as a contract. When I tell it "0.5 = plausible but unverified", it actually self-rates lower on weak signal. With the redis cascade demo, the first alert (deploy event alone) returned confidence 50%. By the third alert it hit 95% — because the causal chain became visible. The model is policing its own certainty.

What the demo actually looks like

$ npm run demo:redis

=== Demo scenario: Redis timeout cascade after deploy #441 ===

  [1/4] (4528ms) LOW      Deploy #441 promoted to production
     ↳ grouped into INC-260520-001 (conf 50%, google/gemma-4-26B-A4B-it)
  [2/4] (7616ms) HIGH     Redis connection timeout spike (p99 4.2s)
     ↳ grouped into INC-260520-001 (conf 90%, google/gemma-4-26B-A4B-it)
  [3/4] (5634ms) CRITICAL 5xx error rate surged 340% (12% of traffic)
     ↳ grouped into INC-260520-001 (conf 95%, google/gemma-4-26B-A4B-it)
  [4/4] (4407ms) HIGH     Checkout latency p95 = 8.7s
     ↳ grouped into INC-260520-001 (conf 95%, google/gemma-4-26B-A4B-it)

Gemma 4 final analysis:
  root cause : Deploy #441 introduced a regression causing Redis connection
               pool exhaustion, leading to request queuing and 5xx errors
               on the checkout path.
  impact     : Users are experiencing high latency and a 12% failure rate
               during the checkout process, directly impacting revenue.
  severity   : CRITICAL  (auto-escalated from initial LOW)
  confidence : 95%
  fixes:
    - Roll back api-gateway to the previous stable version (v2.3.3)
    - Increase Redis connection pool size as a temporary mitigation
      if rollback is delayed
    - Investigate commit a1b2c3d for unclosed Redis connections
      or inefficient session lookups

Note: severity was auto-escalated. The first event was tagged LOW (a deploy isn't itself a problem). Gemma rewrote the incident's severity to CRITICAL after seeing the cascading impact — exactly what a human SRE would do.

Before/after view on the dashboard makes this concrete:

Before (raw alerts pane): 4 separate-looking entries, no narrative, on-call paged 4 times.
After (Gemma pane): 1 grouped incident, root cause + impact + fixes + 95% confidence, on-call paged once with all context inline.

Same data. Different outcome.

Things I deliberately did NOT do (yet)

Multi-agent reasoning (DB-specialist, network-specialist, summarizer). LangGraph would slot in cleanly, but for the use case — small bursts, single service per incident — one well-prompted call beats four coordinated ones in latency. Multi-agent is on the roadmap once I'm grouping across services.
Vector search for similar past incidents. pgvector is already running on the host; the hook is in IncidentsService.groupAndAnalyze. Will add when there are >50 historical incidents to retrieve from.
Local Ollama. Tempting for privacy, but my VPS is 4GB RAM and runs ~15 other services. The HF Router gives me the same Gemma 4 weights without evicting half my fleet. If you're on dedicated hardware, swap the endpoint — the prompt and grouping logic don't change.

Production-y bits that came along for the ride

Dedupe + retry. Cache key = sha1(title:source), 2-min TTL. Stops a runaway cron from re-analyzing the same payload 60x.
Telegram inline keyboard: ACK / RESOLVE / Retry AI / open dashboard. The Retry AI button is my favorite — it re-invokes Gemma with the current event stack. Cheap second opinion when the first reasoning felt off.
Severity escalation. The incident's stored severity is max(human-rule severity, AI severity). AI can upgrade LOW→CRITICAL but cannot downgrade a CRITICAL classification, by design.
Confidence as UI signal. The dashboard shows conf 95% next to every root cause. Below 70% the UI hints "consider re-analysis or wait for more events."

Stack summary

NestJS 11 (Fastify) — existing service, ~30 LOC of wiring to add the Gemma layer
Prisma + Postgres — 1 new model (Incident), 3 new columns on Alert
HuggingFace Inference Router — google/gemma-4-26B-A4B-it primary, gemma-4-31B-it fallback
Alpine.js + Tailwind CDN — single-file dashboard, polls /incidents every 5s
Telegram bot — HTML messages with inline keyboard, HMAC-signed callbacks

Single npm run demo:redis reproduces the entire flow from cold start.

What surprised me

I expected Gemma to be good at the language — paraphrasing logs, polishing summaries. What I didn't expect was how reliably it upgrades severity. The first event in my demo (a deploy) is mundane. The model only paints it as CRITICAL once it has the second and third alerts to connect the chain. That's not pattern matching, that's reasoning over a sequence. It's the behavior I'd want from a junior SRE on their third month.

The other surprise: confidence actually moves. Most LLM "confidence" outputs are 0.9 forever. Telling Gemma in the system prompt that 0.5 is honest got me back a useful spread of values that I can now drive UI on.

Try it

If it's empty when you look, the demo data may have expired — you're welcome to mentally substitute "redis cascade after deploy #441, severity CRITICAL, 95% confidence, with fixes." Or watch the next real incident roll through, which is the whole point.

Idea2Post Agent Mode: turning a working content SaaS into an autonomous operator with Hermes

Mervin — Wed, 20 May 2026 15:57:32 +0000

This is a submission for the Hermes Agent Challenge

What I Built

Idea2Post Agent Mode — a new autonomous "Agent" tab inside an already-shipped content SaaS (idea2post.app). The existing product is a one-shot generator: paste an idea, get hooks/blog/social posts. Useful, but it's still a stateless AI wrapper — the user has to decide what to generate.

Agent Mode flips that. The user types a goal — "What should I post on Facebook this week?" — and a Hermes Agent autonomously:

Pulls the system's own state (queued jobs, recent content) via MCP tools
Reads the most recent posts from the user's tracked competitors (real crawler signal, not made up)
Analyzes the engagement and identifies underserved themes
Optionally calls the existing in-house content generator for concrete drafts
Reports back with citations to the actual data it used

Crucially, none of the existing flows were touched. Quick Mode (one-shot generation), auto-pipelines (cron-driven crawl→generate→publish loops), and the 7 background workers all keep running unchanged. Agent Mode is a parallel route that orchestrates the same building blocks in a different sequence.

The whole layer is roughly:

Browser  ──SSE──▶  PHP proxy  ──HTTP──▶  Hermes Agent
                                         (gateway @ :8642)
                                              │
                                              ├─ web search (built-in)
                                              └─ idea2post MCP server (stdio)
                                                     │
                                                     └─▶ PHP CLI ─▶ MySQL + content engine

Demo

Live agent run inside the dashboard. Notice:

Tool chips (🔧 status_report, 🔧 recent_competitor_posts) appear as Hermes calls them, pulsing while running, turning green when complete
Content streams in token-by-token after research finishes
Numbers and competitor names in the reply come from the real database, not the LLM's imagination

[ https://youtu.be/7e4dsX8pd1s — record from cp.idea2post.app/?page=i2p-agent]

Quick prompts the demo runs:

"What should I post on Facebook this week? Check my competitor signal and propose 3 angles."
"Give me a status report — queued jobs, content last 7 days, active pipelines, recent competitor activity."
"Generate a full content plan: research recent competitor posts, pick the most engaging theme, draft 1 blog + 3 FB posts."

Code

https://github.com/melyx-id/idea2post-agent

Key files:

Layer	Path	Lines
MCP server (Python, FastMCP) — exposes idea2post engine as tools	`opt/idea2post-mcp/server.py`	~95
PHP CLI bridge — talks to MySQL + existing content engine	`engine/i2p_agent_cli.php`	~110
Agent UI (chat, streaming)	`pages/i2p-agent.php`	~270
SSE proxy: browser ↔ Hermes Agent	`pages/i2p-agent-api-stream.php`	~110

My Tech Stack

Hermes Agent v0.14.0 — running as a systemd user service on a Ubuntu 22.04 VPS
LLM: openai/gpt-oss-120b:free via OpenRouter (free tier, supports tool calling)
MCP: official mcp Python SDK (FastMCP) over stdio
Backend: existing SaaS — PHP 8.3, MySQL 10.6 (MariaDB), Apache → Caddy
Frontend: vanilla JS + Tailwind CDN, Server-Sent Events for live tool/token streaming
Infra: Caddy reverse proxy, idea2post's existing 7-worker cron stack

No new frameworks, no rewrites. The whole agent layer is ~600 lines.

How I Used Hermes Agent

Three Hermes capabilities did the heavy lifting:

1. The agent loop itself. I didn't want to write a tool-calling state machine. Hermes' /v1/chat/completions endpoint already runs the full reasoning→tool→observe→reason loop internally. My PHP backend just POSTs the user's goal and reads the stream. Multi-turn tool calls, retries, compression, all handled.

2. MCP for tool surface. Instead of hard-coding tools into the prompt, I built a small Python MCP server that exposes 7 idea2post operations (status_report, list_competitors, recent_competitor_posts, list_publish_accounts, list_pipelines, generate_content, queue_publish). Each tool internally shells out to a PHP CLI that talks to the real MySQL and the existing content engine. hermes mcp add idea2post --command ... registered it in one line. Adding a new capability later means adding one function in server.py — Hermes picks it up.

3. The streaming protocol. Hermes' SSE stream emits both standard OpenAI chat.completion.chunk frames and custom hermes.tool.progress events with status: running / completed per tool call. This is the difference between "spinner for 30 seconds" and "the agent is visibly thinking" — chip turns from purple-pulsing to green-✓ in real time as each MCP tool returns. Massive UX upgrade for ~30 lines of frontend code.

Why this was the right fit. I had a working SaaS that already did the expensive work — competitor crawlers, content generation, FB/LinkedIn publishing. What I was missing was an orchestration layer that could reason over which subsystem to invoke for a given high-level goal. Hermes provided that layer without forcing a rewrite. The existing one-shot generator at ?page=i2p-generate is now one of many tools the agent can call, instead of being the entire product.

The net result: same SaaS, but the user can ask it open-ended questions and watch it autonomously chain research → analysis → generation. That's the leap from "AI wrapper" to "agent product" — and Hermes did 80% of it.