Forem: TokenHub

Build the eval set before you swap the model.

TokenHub — Tue, 12 May 2026 09:46:54 +0000

The pattern I keep seeing on teams chasing AI cost reductions: someone swaps a workload from GPT-4o to DeepSeek-V3, eyeballs a handful of outputs, calls it good, ships it. The cost graph drops the next day. Three weeks later a customer surfaces a regression — the cheaper model hallucinates a date format 6% more often, breaks the downstream invoice generator, and the rollback erases most of the savings plus a week of engineering time.

The fix isn't "don't swap models." Swapping is mostly the right move — DeepSeek-V3 at $0.07/$0.28 per million tokens vs GPT-4o at $2.50/$10 is too much money to leave on the table when the workload tolerates it.

The fix is: build the eval set before the swap, not after.

What a useful eval set looks like

You don't need fancy infrastructure for this. Five steps:

1. Pull 100-300 real prompts from production logs

Cover the long tail of inputs, not just the happy path. Include the weird ones:

The customer ticket in Spanish when your system was designed for English
The PR diff with binary files mixed in
The malformed JSON the user pasted instead of describing the issue in words
The prompt that hit a 30-second timeout last Tuesday

These are the inputs where models actually differ. A 50-prompt happy-path eval will tell you both models are 99% accurate, and you'll learn nothing.

2. Get the current model's outputs on those prompts

Save them with a timestamp. This is your baseline. Don't skip this — you'll need it for the comparison and you can't reconstruct it later if the model gets deprecated.

from openai import OpenAI

client = OpenAI(api_key="...", base_url="https://your-gateway/v1")

baseline_outputs = {}
for prompt_id, prompt in eval_set.items():
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    baseline_outputs[prompt_id] = resp.choices[0].message.content

with open("baseline_gpt4o.json", "w") as f:
    json.dump(baseline_outputs, f)

3. Get the candidate model's outputs on the same prompts

Same code, different model name. If your application is wired through an OpenAI-compatible gateway, this is one config change:

candidate_outputs = {}
for prompt_id, prompt in eval_set.items():
    resp = client.chat.completions.create(
        model="deepseek-chat",  # the only change
        messages=[{"role": "user", "content": prompt}],
    )
    candidate_outputs[prompt_id] = resp.choices[0].message.content

The cost of running 300 prompts through DeepSeek-V3 is roughly $0.20. Don't optimize this step.

4. Compare programmatically where you can, human-review the rest

For structured outputs (JSON, tool calls, field extraction), programmatic comparison covers most ground:

Schema validity: does the output parse?
Field match: do the extracted fields match the baseline?
Edit distance for short strings

For free-form outputs (summaries, explanations, agent responses), human review is the bottleneck. Three minutes per prompt × 300 prompts = 15 hours, which sounds bad but is a one-time cost for a decision that affects every production call going forward.

Use an LLM-as-judge to triage: have a stronger model (Claude 3.5, GPT-4o) rate each candidate output against the baseline as better / equivalent / worse / different-but-acceptable. Then human-review only the worse and different buckets. That cuts human time by ~70% in my experience.

5. Set a threshold before you ship

"Candidate model has to match baseline on at least 95% of evals to ship" is a reasonable default. The exact number depends on the workload:

Safety-critical (legal, medical, financial): 99%+
User-facing high-stakes (customer-facing summaries): 97%+
Internal tooling (Slack summaries, dev tools): 92%+
Background tasks (data cleanup, tagging): 85%+

Pick the threshold before you see the numbers. Picking after is how you talk yourself into shipping a model that's slightly worse on the dimension you care about.

The architectural prerequisite

This whole loop only works cheaply if swapping the model for the eval is a config change, not an integration project. Wire your application code through the OpenAI Python SDK with a configurable base_url and let a gateway handle the provider-specific bits.

client = OpenAI(
    api_key="th-...",
    base_url="https://your-gateway/v1",
)

# Same client, different model per call
client.chat.completions.create(model="gpt-4o", ...)
client.chat.completions.create(model="deepseek-chat", ...)
client.chat.completions.create(model="claude-3-5-sonnet", ...)

I use TokenHub for the gateway — 40+ models behind one API key, route per call. LiteLLM self-hosted gets you the same shape if you'd rather run it yourself.

Without that wiring, every eval is a custom integration project, which is why most teams don't run evals.

TL;DR

Pull 100-300 real prompts from logs (include weird ones)
Run baseline model, save outputs
Run candidate model, save outputs
Compare (programmatic for structured, LLM-judge + human for free-form)
Threshold before shipping

The whole exercise takes a day. It saves you the rollback story.

Stop tuning one model. Route per workload.

TokenHub — Thu, 07 May 2026 13:49:58 +0000

"What's the best model?" used to be a meaningful question. Today it has the wrong shape.

The useful version is: best for which workload?

In the pipeline I migrated last month, three workloads ended up on three different models. None of those would have been my answer if you'd asked "what's the best model overall." Each one is the right answer to a more specific question.

Field extraction wants reliability

We pull structured fields from customer support tickets and route them into a workflow tool. Accuracy matters more than anything because a wrong field silently corrupts a downstream system. Speed barely matters; cost matters somewhat.

Landed on GPT-4o-mini. Boring, reliable, structured-output friendly. Cheap enough that we don't think about it. Trying to use a smarter model here would be a category error — the failure mode isn't "the model isn't smart enough," it's "the model occasionally hallucinates a field name that breaks the schema validator." Smaller, more boring models hallucinate less.

Long-context summarization wants context and price

We summarize PR diffs that range from 5K to 80K tokens, posted to Slack. Quality matters, but the difference between 91% and 94% accuracy on "summarize this diff" is invisible to the human reading the Slack message; the difference between $0.49/day and $17.50/day shows up on the next finance review.

Landed on DeepSeek-V3. The accuracy hit is real but small; the cost reduction is large; the context window covers what we need. The fact that DeepSeek speaks the OpenAI wire format meant the migration was a configuration change.

Tool-use agents want strong instruction following

We have an agent that calls 4-6 tools in sequence to draft and send a follow-up email. Each retry is expensive (in latency, in user trust, in tokens spent debugging). The model that gets it right on the first try the most often is the one we want, almost regardless of per-call cost.

Landed on Claude 3.5. The per-call price is higher; the retries-per-task ratio is lower; the net cost is similar to GPT-4o and the reliability ceiling is meaningfully higher.

The architectural prerequisite

This only works if your application code doesn't care which model serves a given call. We wired everything through the OpenAI Python SDK with a configurable base_url, then put a gateway in front that routes per request:

from openai import OpenAI

client = OpenAI(
    api_key="th-...",
    base_url="https://your-gateway/v1",
)

# Same client, different model per call
extraction = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[...],
)

summary = client.chat.completions.create(
    model="deepseek-chat",
    messages=[...],
)

agent_step = client.chat.completions.create(
    model="claude-3-5-sonnet",
    messages=[...],
)

Switching a workload to a different model is now a one-line config change in production.

Without that, picking three models means writing three integrations, debugging three sets of edge cases, and watching one of them rot every time a provider changes their API. With it, the marginal cost of trying a new model on a workload is near zero, which is also the marginal cost of moving off a model that stops being the right answer.

Why this matters more in 2026

The market keeps shipping new models at new price points. Last quarter alone: DeepSeek dropped V3 prices, Anthropic shipped a cheaper Haiku, OpenAI introduced 4o-mini, Mistral updated their open-weight lineup. Each of those releases changed which model was the right answer for at least one workload in our system.

If switching costs a sprint, you ignore most of those announcements. If it costs a config change, you re-evaluate routinely and your costs trend down while the market keeps shipping.

The shape of the question stays the same: which model is the best answer to this specific workload right now? You want code that can give a different answer next week without a sprint of refactoring.

TL;DR

Don't pick the best model. Pick the best model per workload, and keep picking — the answer changes every few weeks. The architectural prerequisite is wiring everything through one OpenAI-compatible API surface so swapping is free.

I used TokenHub for the gateway — 40+ models behind one API key, route per call. LiteLLM self-hosted is the same architectural pattern if you'd rather run it yourself. The point, again, is the pattern more than the vendor.

Swap OpenAI for DeepSeek without rewriting a single line of code

TokenHub — Mon, 27 Apr 2026 03:55:51 +0000

Last month I added Claude to a project that was already using GPT-4o. Two SDKs, two error formats, two retry strategies. By the time I finished I had wrapped both in my own abstraction — a tiny LLM gateway, badly written, that I now had to maintain.

Then I noticed something I should have noticed earlier: most of the new providers expose an OpenAI-compatible endpoint. DeepSeek, Mistral, Together, Fireworks — they all speak the same wire format. You don't need a new SDK. You need a new base_url.

This post is the 5-minute version of that realization, with the tradeoffs I learned the hard way.

The "before" code

Standard OpenAI Python:

from openai import OpenAI

client = OpenAI(api_key="sk-...")

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this PR diff..."}],
)
print(resp.choices[0].message.content)

The "after" code

from openai import OpenAI

client = OpenAI(
    api_key="th-...",
    base_url="https://jiatoken.com/v1",   # gateway
)

# Same call, different model
resp = client.chat.completions.create(
    model="deepseek-chat",                # DeepSeek-V3
    messages=[{"role": "user", "content": "Summarize this PR diff..."}],
)

That's it. Two lines changed. The rest of your code — streaming handlers, tool calls, retry logic — keeps working because the response shape is identical.

Why this works

The OpenAI Python SDK is just a typed HTTP client. It POSTs JSON to {base_url}/chat/completions. Anything that responds with the same JSON shape is, from the SDK's point of view, OpenAI.

Most gateways take advantage of this:

DeepSeek ships its own OpenAI-compatible endpoint at api.deepseek.com/v1. You can point the SDK there directly.
Anthropic does not — Claude has its own message format. You need a translator.
Gemini has both: a native API and a Vertex-side OpenAI shim.

A multi-model gateway (LiteLLM, OpenRouter, TokenHub, your own) collapses these into one endpoint. One key, one base_url, every model behind it.

What I actually save

For the workload I just migrated (~3M input tokens / 1M output per day, mostly summarization):

Model	Input $/1M	Output $/1M	Daily cost
GPT-4o	2.50	10.00	$17.50
Claude 3.5	3.00	15.00	$24.00
DeepSeek-V3	0.07	0.28	$0.49

DeepSeek isn't a drop-in quality replacement for everything — GPT-4o still wins on instruction following in my evals — but for the 80% of calls that are "summarize this", "extract these fields", "rewrite in tone X", it's fine and ~35× cheaper.

The annoying parts

A few things don't carry over cleanly through OpenAI compatibility:

Tool calling JSON shape. Most providers match it now, but older OSS models return tool calls inside the content string. Always test with your actual prompts before flipping production.
Vision. OpenAI uses image_url parts; some providers want base64. A gateway should normalize this for you — verify before you assume.
Streaming with usage stats. OpenAI added stream_options={"include_usage": True} to get token counts on the final SSE chunk. Not every backend forwards this.
Rate limits. You're now subject to the gateway's RPM, which may be lower than direct provider limits.

When NOT to use a gateway

You only ever call one provider. Direct SDK is one less moving part.
You need provider-specific features (Anthropic's prompt caching, OpenAI's Realtime API, Gemini's long context). Gateways usually lag behind native features by weeks.
You're in a regulated environment that requires data plane control. Most gateways are SaaS.

For everything else — especially side projects and prototypes where the model you "want" changes every two weeks — a gateway pays for itself in saved switching cost.

TL;DR

  client = OpenAI(
      api_key="...",
+     base_url="https://your-gateway/v1",
  )
  client.chat.completions.create(
-     model="gpt-4o",
+     model="deepseek-chat",
      ...
  )

If you want to skip running your own LiteLLM, TokenHub hosts a pre-configured gateway with 40+ models behind one key. Otherwise, LiteLLM self-hosted is the standard answer.

I Built an OpenAI-Compatible Gateway to 40+ AI Models (DeepSeek, MiniMax, Claude)

TokenHub — Sun, 26 Apr 2026 08:31:18 +0000

The Problem

I was paying for 5+ different AI subscriptions: OpenAI, Anthropic, Google, etc. Each with separate API keys, billing dashboards, and SDK quirks.

When DeepSeek-V3 dropped at ~$0.28 per million output tokens (vs GPT-4o at $10), I wanted to switch — but the friction of changing SDKs across multiple projects was a pain.

So I built TokenHub — an OpenAI-compatible gateway that routes to 40+ AI models with a single API key.

How It Works

It's a drop-in replacement for the OpenAI SDK. Just change base_url and api_key:

from openai import OpenAI

client = OpenAI(
    api_key="your-tokenhub-key",
    base_url="https://jiatoken.com/v1"
)

# Use any of 40+ models — DeepSeek, MiniMax, Claude, GPT, Gemini, Llama, etc.
response = client.chat.completions.create(
    model="deepseek-v3",
    messages=[{"role": "user", "content": "Explain async/await in Python"}]
)

print(response.choices[0].message.content)

That's it. The same code works with:

gpt-4o
claude-sonnet-4-6
gemini-2.5-pro
deepseek-v3 / deepseek-r1
minimax-text-01
llama-3.3-70b
...and more

Real Pricing Comparison

Per million tokens (input / output):

Model	Provider	Input	Output
GPT-4o	OpenAI	$2.50	$10.00
GPT-4o mini	OpenAI	$0.15	$0.60
DeepSeek-V3	TokenHub	$0.07	$0.28
DeepSeek-R1	TokenHub	$0.14	$0.55
MiniMax-Text-01	TokenHub	$0.10	$0.40

For high-volume workloads (RAG, agents, batch summarization), DeepSeek-V3 is ~35x cheaper than GPT-4o for output tokens.

When to Use Which Model

A quick mental model from my own usage:

Cheap & good enough → DeepSeek-V3 (most general tasks)
Reasoning → DeepSeek-R1 (CoT-style tasks)
Long context → MiniMax-Text-01 (200K+ tokens)
Frontier capability → GPT-4o or Claude (still worth it for hard problems)
Code → Claude Sonnet 4.6 or DeepSeek-V3

The win is being able to A/B test across models without rewriting code.

Why I Open-Sourced the Routing Logic

(Note: TokenHub itself is hosted, but the routing pattern is straightforward.)

The hardest part wasn't the proxy — it was:

Normalizing function-calling formats across providers
Handling streaming differences (SSE format quirks)
Token counting for accurate billing pre-request

If you're building something similar, the OpenAI spec is the de facto standard. Most providers either match it or have OpenAI-compatible endpoints already.

Try It

If you're tired of juggling AI subscriptions:

👉 https://jiatoken.com
Free credits to start
Pay-as-you-go, no monthly commitment
Compatible with OpenAI SDK out of the box

I'd love feedback — especially on which models you'd want added, or pricing pain points.

What's your current setup? Are you using a single provider or juggling multiple?