Forem: Devansh

Your Google Sheets backend is silently dropping rows. Here's why.

Devansh — Sun, 26 Apr 2026 23:30:00 +0000

A signup form POSTing to Google Sheets is the most common "backend" on the indie web.

It works for your landing page demo. It works when you test it with 5 friends. It works right up until the moment you don't want it to fail.

Here's the part nobody tells you: Google's own values.append endpoint silently drops rows under concurrent writes. Two simultaneous POSTs can resolve to the same target row and one of them gets overwritten. No error in your logs. No error in your client. Just rows that silently didn't land.

Every "Sheets as a backend" wrapper you've heard of — SheetDB, Sheety, SheetBest, NoCodeAPI — forwards your request straight to values.append. They inherit the bug.

I spent the last few weeks building a fix. It's called SheetForge, it's MIT-licensed, and this post is about the actual bug and why the fix isn't as simple as "throw a mutex on it."

The bug, reproduced in 4 lines

await Promise.all([
  sheets.values.append({ ...rowA }),
  sheets.values.append({ ...rowB }),
  sheets.values.append({ ...rowC }),
  sheets.values.append({ ...rowD }),
]);

Four POSTs. You expect four rows. Under load you often get three. Sometimes two.

The reason is inside values.append. The operation reads the current last row, then writes to the position after it. When two calls race, they can read the same "current last row" and write to the same target cell range. One value wins. The other silently disappears.

Google has documented this. The official workaround is: "Don't write concurrently." That's a fine rule when your launch gets 3 signups. It's actively destructive when you hit HN's front page.

Why your form loses 40% of rows during a traffic spike

Your form looks like this:

// /api/signup
export async function POST(req: Request) {
  const { email } = await req.json()
  await sheets.values.append({ values: [[email, new Date().toISOString()]] })
  return Response.json({ ok: true })
}

This is deployed on Vercel. Each request gets its own isolated serverless invocation. They run truly in parallel. There is zero coordination between them.

Now Product Hunt puts you on the daily leaderboard. Forty people land on your page in the same 10 seconds. Twelve of them submit the form before the others.

Twelve concurrent POSTs to values.append.

You won't end up with twelve rows. You'll end up with something closer to eight, maybe nine. The exact number depends on how Google's backend serializes the writes internally (it doesn't, deterministically — that's the whole problem). The lost rows show no error. The users who submitted them see a green checkmark. Their email is gone.

This isn't theoretical. Levels.fyi wrote a long engineering post about running their entire site on Sheets until exactly this class of problem forced them to migrate.

Why a mutex doesn't fix it

The naive fix is obvious: put a lock in front of the Sheets API call. One write at a time per sheet. Problem solved.

In practice, this is where it breaks:

1. Your API runs on serverless. A lock in Node process memory doesn't work when requests are spread across 12 cold starts. You need a distributed lock.

2. A distributed lock has its own bugs. Redis SETNX with a TTL is the standard answer. But the classic failure mode: process A acquires the lock, gets paused by GC, TTL expires, process B grabs the lock, process A wakes up and releases "its" lock — which is now B's lock. Now two processes think they hold the lock. You're back to silent data loss.

3. Retries break everything. Your client retries a failed POST. The lock-protected write succeeds twice. Now you have duplicate rows. Fixing drops created duplicates.

4. Crashes strand the lock. If your process dies while holding the lock, the next writer has to wait for the TTL to expire. For TTLs long enough to be safe (30+ seconds), that stalls your whole write throughput.

Every one of these is a real bug I hit while prototyping. The fix isn't a mutex. The fix is a proper queue.

How SheetForge actually fixes it

The architecture, in one sentence: every write goes into a per-sheet queue, one worker per sheet pulls from that queue inside a Postgres transaction, and an idempotency key dedupes retries.

Here's the flow:

submitWrite(row)
  └─ INSERT INTO write_ledger (sheet_id, idempotency_key, payload)
     (partial unique index dedupes retries before we even touch Sheets)
  └─ XADD to Redis Stream for this sheet
  └─ return { writeId, status: 'pending' }

processNext()  ← runs in a loop on the worker
  └─ XREADGROUP from the sheet's stream
  └─ BEGIN transaction
     └─ SELECT pg_advisory_xact_lock(hashtextextended(streamKey, 0))
     └─ call sheets.values.append(payload)
     └─ UPDATE write_ledger SET status = 'committed'
  └─ COMMIT
  └─ XACK only after commit

Four things matter here:

The advisory lock is the fence. pg_advisory_xact_lock acquires a lock inside the transaction. If the transaction commits or aborts, the lock is released automatically by Postgres. There is no TTL. No lease clock. No split-brain. If your process dies mid-handler, the transaction rolls back, the lock releases, and the message redelivers.

The idempotency key is the deduper. Every write comes in with an Idempotency-Key header. A partial unique index on (sheet_id, idempotency_key) WHERE status IN ('pending', 'committed') means the database rejects duplicates before the worker even sees them. Retry the same request 100 times, you get 1 row.

The ledger is the truth. The write is durable in Postgres the moment you get the writeId back. Sheets is downstream. If Google's API is down, your writes queue up and flush when it recovers. Your users see a green checkmark and it means something.

XACK happens post-commit. Redis Streams' PEL (pending entries list) redelivers messages if they're not acked. If the worker crashes mid-transaction, Postgres rolls back, Redis redelivers, the idempotency key catches the replay. Exactly-once semantics, for real.

The full test for this concurrency

Every change to the write-queue slice requires a concurrency test. This is the one I do not break:

it('50 parallel writes land 50 rows in order', async () => {
  const sheet = await createTestSheet()
  const writes = Array.from({ length: 50 }, (_, i) => ({
    email: `user-${i}@test.com`,
    idempotencyKey: `key-${i}`,
  }))

  await Promise.all(
    writes.map(w => sheetforge.rows.insert(w, { idempotencyKey: w.idempotencyKey }))
  )
  await waitForQueueDrain(sheet.id)

  const rows = await readSheet(sheet.id)
  expect(rows).toHaveLength(50)
  expect(rows.map(r => r.email)).toEqual(writes.map(w => w.email))
})

50 parallel POSTs. 50 rows. In order. Retry safe.

This same test against raw values.append reliably fails. It fails under SheetDB, Sheety, and SheetBest too — I checked.

The typed SDK is the bonus

Once you have a proper queue, the rest gets interesting. Since SheetForge knows your sheet's header row, it can generate a typed TypeScript client with literal union types inferred from your sample cells.

Header row: email | plan | created_at
Sample cells: hi@example.com | free | 2026-04-15

Generated SDK:

import { createClient } from './sheetforge-client'

const sheet = createClient({
  apiKey: process.env.SHEETFORGE_API_KEY!,
  sheetId: 'sht_abc123',
})

await sheet.rows.insert(
  {
    email: 'hi@example.com',
    plan: 'free',  // 'free' | 'pro' — inferred from sample cells
    created_at: new Date().toISOString(),
  },
  { idempotencyKey: crypto.randomUUID() },
)

The compiler catches header drift. Rename a column in the sheet and regenerate the client — TypeScript tells you every call site that needs updating.

Getting started

One-click hosted:

https://getsheetforge.vercel.app

Self-host:

git clone https://github.com/Devansh-365/sheetforge.git
cd sheetforge
pnpm install
cp .env.example .env   # Google OAuth + DATABASE_URL + Redis
pnpm db:push
pnpm dev               # web :3000, api :3001

Prereqs: Node 20+, pnpm 9+, Postgres 14+, Redis 6+ (or Upstash REST).

The OSS core (packages/queue, packages/codegen, packages/sdk-ts) is MIT and stays free forever. The hosted SaaS runs the same code.

What SheetForge is not

It is not a Postgres replacement. If you need complex queries, indices, or relational integrity, use a real database.

It is not a high-throughput pipe. Google caps you at ~60 writes/minute per sheet regardless of what sits in front. SheetForge makes sure those writes land; it doesn't make them faster.

It is not a reason to keep Sheets as your backend forever. It's the right tool for landing pages, waitlists, internal forms, ops tools, and MVPs where you need rows to actually land. If your app outgrows Sheets, it outgrows SheetForge too.

The one takeaway

Every Sheets-as-backend wrapper you've been using has a silent data-loss bug that only shows up when traffic spikes. That's exactly the moment you most care about losing rows.

If you've ever shipped a form on Sheets and watched rows vanish mid-launch, give SheetForge a try. If it saves you one bug, star the repo.

GitHub: github.com/Devansh-365/sheetforge
Hosted: getsheetforge.vercel.app
License: MIT

Gemini 2.5 Flash was returning 37 tokens. Here's why.

Devansh — Sun, 19 Apr 2026 23:30:00 +0000

I set max_tokens: 1000 on a Gemini 2.5 Flash call.

The response came back with 37 tokens. finish_reason: "MAX_TOKENS". No error. No warning. Just a string that stopped mid-sentence.

I changed it to 2000. Got back 41 tokens. Then 5000. Got back 38.

That's when I knew something was actually broken, not just a config issue.

I spent a day tracing this. The root cause is surprising, the official docs don't explain it, and the fix depends on which version of which SDK you're using. Here's what I learned, and a diagnostic script at the end so you can figure out which variant of the bug you hit.

The symptom

Your Gemini 2.5 Flash or Pro call returns one of these shapes:

{
  "candidates": [{
    "content": { "parts": [{ "text": "" }] },
    "finishReason": "MAX_TOKENS"
  }],
  "usageMetadata": {
    "promptTokenCount": 120,
    "candidatesTokenCount": 0,
    "thoughtsTokenCount": 964,
    "totalTokenCount": 1084
  }
}

Or a truncated mid-sentence response with candidatesTokenCount near zero and thoughtsTokenCount close to whatever you set max_output_tokens to.

The word thoughtsTokenCount is the giveaway.

Why it happens

Gemini 2.5 Flash and Pro are reasoning models. Like OpenAI's o-series, they burn tokens on internal reasoning before writing the visible response. Unlike OpenAI's models, Google counts those thinking tokens against your max_output_tokens budget.

So when you ask for 1,000 tokens:

The model thinks. This uses some number of tokens, tracked as thoughtsTokenCount.
Once thoughtsTokenCount + candidatesTokenCount hits your budget, generation stops.
If thinking consumed most of the budget, candidatesTokenCount ends up near zero.

Gemini 2.5 Flash defaults to a dynamic thinking budget. It decides how much to think based on the task. For anything non-trivial, it will happily burn 90 to 98 percent of your budget on reasoning.

You can see this directly in the API response. If you're using the Google GenAI SDK:

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Summarize quantum computing.",
    config={"max_output_tokens": 1000}
)

print("Output tokens:", response.usage_metadata.candidates_token_count)
print("Thinking tokens:", response.usage_metadata.thoughts_token_count)
print("Total:", response.usage_metadata.total_token_count)
print("Finish reason:", response.candidates[0].finish_reason)

The thoughts_token_count field is where your budget actually went.

The three fixes, ranked

There are three ways to handle this, and they have real tradeoffs.

Fix	What it does	Latency	Cost	Quality	Best for
Disable thinking	`thinking_budget: 0` (Flash) or `reasoning_effort: "none"`	Fast	Low	Lower on complex reasoning	Chat UIs, structured extraction, high-volume endpoints
Cap thinking	`thinking_budget: 1024` + `max_output_tokens: 8192`	Medium	Medium	Good	Most production workloads
Dynamic thinking	Let Flash decide, set `max_output_tokens` to 8K+	Slowest	Highest	Best	Research queries, complex analysis, one-shot deep tasks

The third option is the default, and it's the source of the bug. It's only the right choice if you're actually okay with burning most of your tokens on reasoning and waiting 5 to 30 seconds per response.

Fix 1: Disable thinking for Flash

For Gemini 2.5 Flash, you can turn thinking off entirely:

from google import genai
from google.genai import types

client = genai.Client()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Explain circuit breakers in 2 sentences.",
    config=types.GenerateContentConfig(
        max_output_tokens=1000,
        thinking_config=types.ThinkingConfig(thinking_budget=0)
    )
)

thinking_budget=0 is only valid for 2.5 Flash. Pro refuses to run without at least some thinking, and throws:

Thinking can't be disabled for this model.

For Pro, the minimum accepted value is 128. Using thinking_budget=128 gets you the closest thing to "off" that Pro allows.

Fix 2: The OpenAI-compat escape hatch (underdocumented)

If you're hitting Gemini through the OpenAI-compatible endpoint (either Google's own generativelanguage.googleapis.com/v1beta/openai or through a proxy like LiteLLM), you can use reasoning_effort instead of thinking_budget:

from openai import OpenAI

client = OpenAI(
    base_url="https://generativelanguage.googleapis.com/v1beta/openai",
    api_key=GEMINI_API_KEY
)

response = client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=[{"role": "user", "content": "Explain circuit breakers"}],
    max_tokens=1000,
    reasoning_effort="none"  # or "low", "medium", "high"
)

This is barely documented. Google's official OpenAI-compatibility page mentions it in passing, and almost no tutorials cover it. But it works, and it's the cleanest way to control reasoning from code that uses the OpenAI SDK.

Mapping:

reasoning_effort: "none" → thinking_budget: 0 (Flash only)
reasoning_effort: "low" → thinking_budget: 1024
reasoning_effort: "medium" → thinking_budget: 8192
reasoning_effort: "high" → thinking_budget: 24576

Fix 3: The integration-specific gotchas

The bug manifests differently depending on your stack. Some quick notes from actual GitHub issues (python-genai #782, gemini-cli #23081, langchain-google #1490):

LangChain silently truncates output. Developers report setting max_tokens=16000 and still getting cut-off responses. Fix: pass thinking_budget via model_kwargs, or switch to the OpenAI-compat endpoint through ChatOpenAI.

LiteLLM accepts reasoning_effort and maps it to the Gemini parameter, but as of late 2025 it rejected the parameter for Pro with "Thinking can't be disabled." Fix: use reasoning_effort="low" instead of "none" for Pro.

ha-llmvision defaulted thinkingBudget to 35 to 50 tokens. That value gets fully consumed by thinking, leaving nothing for output. Fix: set to 1024 or higher.

Cline set thinkingBudget: 0 which works for Flash Lite but throws on Pro. Fix depends on which model you're targeting.

Vertex AI uses thinkingConfig.thinkingBudget nested inside the config object. Raw API requests that put it at the top level silently ignore it.

Diagnostic script

If you're not sure which variant of the bug you hit, run this:

from google import genai
client = genai.Client()

def diagnose(model: str, prompt: str, max_tokens: int):
    response = client.models.generate_content(
        model=model,
        contents=prompt,
        config={"max_output_tokens": max_tokens}
    )
    usage = response.usage_metadata
    text = response.candidates[0].content.parts[0].text or ""
    finish = response.candidates[0].finish_reason

    thinking_pct = (usage.thoughts_token_count / max_tokens * 100) if usage.thoughts_token_count else 0
    output_pct = (usage.candidates_token_count / max_tokens * 100) if usage.candidates_token_count else 0

    print(f"Model:          {model}")
    print(f"Budget:         {max_tokens}")
    print(f"Thinking used:  {usage.thoughts_token_count} ({thinking_pct:.0f}%)")
    print(f"Output tokens:  {usage.candidates_token_count} ({output_pct:.0f}%)")
    print(f"Finish reason:  {finish}")
    print(f"Response len:   {len(text)} chars")

    if finish == "MAX_TOKENS" and usage.candidates_token_count < max_tokens * 0.1:
        print("\nDIAGNOSIS: Thinking tokens ate your budget.")
        print("FIX: Set thinking_budget=0 (Flash) or reasoning_effort='none'.")
    elif finish == "MAX_TOKENS":
        print("\nDIAGNOSIS: Output actually hit the cap. Raise max_output_tokens.")

diagnose("gemini-2.5-flash", "Write a short poem about debugging.", 200)

The script prints a percentage breakdown showing exactly where your budget went. If thinking is over 50 percent of your budget, you need to cap it.

The gateway-level fix

All of this is fixable at the application layer, but it requires every caller to know about reasoning budgets. That doesn't scale if you have multiple services calling Gemini.

I run FreeLLM in front of my LLM calls. It's an OpenAI-compatible gateway that routes across six providers, and it sets the right reasoning budget per Gemini model automatically. Flash gets reasoning_effort: "none". Pro gets "low". Your full max_tokens budget goes to the actual answer. You can override per-request if you need reasoning back.

curl http://localhost:3000/v1/chat/completions \
  -d '{
    "model": "gemini/gemini-2.5-flash",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 1000
  }'

Before the gateway: 37 output tokens. After: 670+ tokens, finish_reason: stop. Same prompt, same budget.

The point is not "use my tool." The point is that gateway-level defaults let you fix provider quirks once instead of in every service.

What to take away

Gemini 2.5 is a reasoning model. Its thinking tokens count against your max_output_tokens.
The dynamic default will eat 90 to 98 percent of your budget on anything non-trivial.
For Flash, disable thinking with thinking_budget: 0 or reasoning_effort: "none".
For Pro, cap thinking with thinking_budget: 128 (minimum) or reasoning_effort: "low".
If you're using an OpenAI-compat endpoint, reasoning_effort is cleaner and underdocumented.
Run the diagnostic script above when in doubt.

The official docs don't make any of this obvious. Hopefully this post saves you the day I spent figuring it out.

GitHub: github.com/devansh-365/freellm (the gateway that handles this for you)

LiteLLM got hacked. I built a simpler LLM gateway you can actually audit.

Devansh — Tue, 14 Apr 2026 00:10:45 +0000

On March 24, 2026, LiteLLM versions 1.82.7 and 1.82.8 were uploaded to PyPI with a credential harvester, a Kubernetes lateral-movement toolkit, and a persistent remote code execution backdoor baked in.

The malicious package was live for about 40 minutes before PyPI quarantined it.

40 minutes doesn't sound like much. But LiteLLM gets 95 million downloads a month. It's the default multi-provider routing library for anyone building on LLMs. Teams running pip install litellm during that window got compromised automatically. No explicit import needed. The payload triggered on Python interpreter startup via a .pth file.

Google brought in Mandiant for the investigation. Snyk, Kaspersky, and Trend Micro all published breakdowns. The attack vector: a compromised Trivy security scanner leaked CircleCI credentials, including the PyPI publishing token and a GitHub PAT.

This is not a theoretical risk. This happened.

The real problem is not one attack

LiteLLM does a lot. 2,000+ models across 100+ providers. Proxy server, load balancing, spend tracking, A/B testing, caching, logging, guardrails, prompt management.

That scope is the problem.

A developer on HN described the codebase as having a 7,000+ line utils.py. A 30-year engineer called it "the worst code I have ever read in my life." Before the supply chain attack, a DEV Community post titled "5 Real Issues With LiteLLM That Are Pushing Teams Away in 2026" was already documenting the trust erosion.

The supply chain attack was the tipping point, not the root cause. The root cause is depending on a massive, opaque library for critical routing infrastructure.

What a simpler design looks like

I ran into the same multi-provider routing problem last year while building Metis, an AI stock analysis tool. Kept burning through Groq's free tier in 20 minutes, switching to Gemini manually, hitting their cap, switching again.

Built FreeLLM to stop doing that manually. It solves a narrower problem than LiteLLM, and that's the point.

FreeLLM is an OpenAI-compatible gateway that routes across Groq, Gemini, Mistral, Cerebras, NVIDIA NIM, and Ollama. When one provider rate-limits, the next one answers. That's the core of it.

What it does

curl http://localhost:3000/v1/chat/completions \
  -d '{"model": "free-fast", "messages": [{"role": "user", "content": "Hello!"}]}'

Your existing OpenAI SDK code works. Swap the base URL. Keep your code.

Three meta-models handle routing: free-fast (lowest latency, usually Groq/Cerebras), free-smart (best reasoning, usually Gemini 2.5 Pro), and free (max availability).

What it fixes that LiteLLM doesn't

Gemini 2.5 reasoning tokens eating your output. This is one of the most reported Gemini bugs right now. Gemini 2.5 Flash and Pro are reasoning models. They burn 90-98% of your max_tokens on internal thinking before producing visible text. Ask for 1,000 tokens and you get back 37. There are 15+ open GitHub issues about this across multiple SDKs.

FreeLLM fixes it at the gateway. Flash gets reasoning_effort: "none" by default. Pro gets "low". Your full token budget goes to the actual answer. Override per-request if you want the reasoning back.

Provider outages don't break your app. Claude went down for three consecutive days in early April. 8,000+ Downdetector reports. If your app depends on one provider, that's three days of broken service. FreeLLM's circuit breakers pull failing providers from rotation and test for recovery automatically.

Response caching without a separate layer. Identical prompts return in ~23ms with zero quota burn. The cache refuses to store truncated responses (another Gemini bug: reasoning models returning cut-off output that then poisons your cache for an hour).

Browser-safe tokens for static sites. Mint a short-lived HMAC-signed token from a serverless function, pass it to the browser, call the gateway directly from client-side JavaScript. No auth backend. No session store.

Key stacking: 360 free requests per minute

Every provider env var accepts a comma-separated list. FreeLLM rotates round-robin per key.

GROQ_API_KEY=gsk_key1,gsk_key2,gsk_key3
GEMINI_API_KEY=AI_key1,AI_key2,AI_key3

Stack 3 keys across 5 cloud providers: ~360 req/min. All free. Enough to prototype an entire product without spending anything.

Get it running

Docker:

docker run -d -p 3000:3000 \
  -e GROQ_API_KEY=gsk_... \
  -e GEMINI_API_KEY=AI... \
  ghcr.io/devansh-365/freellm:latest

Or one-click deploy on Railway or Render (buttons in the README).

Use it from Python:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:3000/v1", api_key="unused")
response = client.chat.completions.create(
    model="free-smart",
    messages=[{"role": "user", "content": "Explain circuit breakers"}]
)

TypeScript, Go, Ruby, anything that speaks OpenAI. Same pattern.

Why this matters beyond FreeLLM

The LiteLLM attack exposed something the community already suspected: critical AI infrastructure is running on libraries nobody audits.

The fix is not "use my tool instead." The fix is smaller dependencies, pinned versions, codebases you can read in an afternoon. FreeLLM is 262 tests across 22 files. TypeScript, not Python. Docker images with pinned deps. MIT licensed.

If you don't use FreeLLM, build something similarly scoped. The era of "install this 100-provider mega-library and trust it with your API keys" should be over.

262 tests. 6 providers. One endpoint. Zero cost.

GitHub: github.com/devansh-365/freellm

I built an OpenAI-compatible gateway that routes across 5 free LLM providers

Devansh — Mon, 06 Apr 2026 20:22:07 +0000

Every LLM provider has a free tier.

Groq gives you 30 requests per minute. Gemini gives you 15. Cerebras gives you 30. Mistral gives you 5.

Combined, that's about 80 requests per minute. Enough for prototyping, internal tools, and side projects where you don't want to pay for API access yet.

The problem: each provider has its own SDK, its own rate limits, its own auth, and its own downtime. You end up writing provider-switching logic, catching 429 errors, and managing API keys across five different dashboards.

I got tired of this while building Metis, an AI stock analysis tool. Kept hitting Groq's limits while Gemini had capacity sitting idle. So I built FreeLLM.

What FreeLLM does

One endpoint. Five providers. Twenty models. All free.

curl http://localhost:3000/v1/chat/completions \
  -d '{"model": "free-fast", "messages": [{"role": "user", "content": "Hello!"}]}'

Your existing OpenAI SDK code works. Just change the base URL. That's the whole migration.

How the routing works

When a request comes in, FreeLLM:

Checks which providers are healthy (circuit breakers track this automatically)
Picks the best available provider based on your model choice
If that provider returns a 429 or fails, it tries the next one
You get a response

Three meta-models handle routing:

free-fast   → lowest latency (usually Groq or Cerebras)
free-smart  → most capable model (usually Gemini 2.5)
free        → maximum availability across all providers

Providers and their free tiers

Provider	Models	Free Tier
Groq	Llama 3.3 70B, Llama 4 Scout, Qwen3 32B	~30 req/min
Gemini	2.5 Flash, 2.5 Pro, 2.0 Flash	~15 req/min
Cerebras	Llama 3.1 8B, Qwen3 235B, GPT-OSS 120B	~30 req/min
Mistral	Small, Medium, Nemo	~5 req/min
Ollama	Any local model	Unlimited

What's under the hood

This isn't a simple round-robin proxy. The routing layer handles real production concerns:

Sliding-window rate limiter. Each provider's limits are tracked independently. FreeLLM knows how many requests you've sent to Groq in the last 60 seconds and won't send another if you're near the cap.

Circuit breakers. If Gemini starts returning 500s, FreeLLM pulls it from rotation. Every 30 seconds, it sends a test request. When the provider recovers, it goes back in.

Per-client rate limiting. If you expose this to a team, each client gets their own limit. Admin auth protects the config endpoints.

Zod validation. Every request is validated before it hits any provider. Bad payloads fail fast with clear error messages.

Real-time dashboard. React frontend showing provider health, request logs, and latency. You can see which providers are healthy at a glance.

Get it running in 30 seconds

git clone https://github.com/devansh-365/freellm.git
cd freellm
cp .env.example .env   # add your free API keys
docker compose up

API on localhost:3000. Dashboard on localhost:3000/dashboard. Done.

Using it with the OpenAI SDK

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:3000/v1",
  apiKey: "not-needed",
});

const response = await client.chat.completions.create({
  model: "free-fast",
  messages: [{ role: "user", content: "Explain circuit breakers in 2 sentences" }],
});

No new SDK to learn. No migration effort.

Why I built this

I was building Metis and kept running into the same pattern: burn through Groq's free tier in 20 minutes of testing, switch to Gemini manually, hit their limit, switch to Mistral. Repeat.

Wrote a quick proxy to automate the switching. Added failover because providers go down randomly. Added circuit breakers because I didn't want to wait for timeouts. Added a dashboard because I wanted to see what was happening.

It grew into a proper tool. Open-sourced it because every developer prototyping with LLMs has this exact problem.

Stack

TypeScript, Express 5, React 19, Zod, Docker. MIT licensed.

GitHub: github.com/devansh-365/freellm

react native animation

Devansh — Wed, 10 Sep 2025 16:11:18 +0000