Forem: Owen

Free LLM API Tiers Ranked 2026: Gemini, xAI, DeepSeek, AWS — Which Free Credits Are Actually Usable for Coding?

Owen — Thu, 14 May 2026 02:39:37 +0000

TL;DR

Four free LLM API tiers matter for coding in May 2026: DeepSeek (5M signup tokens, cheapest paid floor), Gemini (Flash models remain free post-April 2026), xAI Grok ($25 signup plus $150/mo with data sharing), and AWS Bedrock ($200 starter credits plus Activate program). DeepSeek offers the best value—the signup grant is modest but the paid tier afterward is remarkably affordable.

Every other "free LLM API" listicle is SEO-driven content copying signup pages. This analysis ranks providers based on practical utility: how far each tier carries you through actual coding work before hitting limits.

The Ranking at a Glance

#	Provider	What's Free	Catch	Coding Ceiling
1	DeepSeek	5M signup tokens, 30-day validity	Tokens expire; no card-free renewal	~3,500 short calls or ~80–100 long coding sessions
2	Gemini (Google)	Flash + Flash-Lite, ~1,500 RPD each (15 RPM Flash / 30 RPM Flash-Lite)	Pro models paywalled since Apr 1, 2026; daily quotas reset PT midnight	Tooling, autocomplete, glue code — not flagship reasoning
3	xAI Grok	$25 promo + $150/mo via data sharing	Must spend $5 first, opt-in is permanent, prompts train Grok	Generous — if data-sharing terms are acceptable
4	AWS Bedrock	$200 starter credit, expires 6 months	Requires AWS account; separate model access request flow	One weekend of Claude/Nova agent work, or scale via Activate

This ranking applies specifically to coding workloads. Image generation, RAG search, and transcription would show different relative positions.

#1 — DeepSeek: The Only Free Tier That Survives Real Work

DeepSeek wins not because the initial grant is massive, but because the paid pricing floor afterward is the cheapest among major providers. The transition from free to paid barely registers economically.

Every new account receives 5 million free tokens on signup without a credit card, valid for 30 days. At roughly 1,000 tokens per short coding call, that provides approximately 3,500 calls—enough to genuinely evaluate DeepSeek as a primary tool, not merely test it.

The standout feature emerges after the 5M expires. The April 24, 2026 V4 launch consolidated offerings. Legacy endpoints are scheduled for retirement July 24, 2026 in favor of two V4 models:

DeepSeek V4-Flash (general coding, classification, extraction): $0.14 / $0.28 per 1M input/output
DeepSeek V4-Pro (flagship reasoning, 1M context): $0.435 / $0.87 per 1M (75% off launch promo through May 31, 2026)
Cache-hit input costs 1/10 of standard pricing, meaningful for repetitive coding loops

GPT-5.5 flagship pricing sits at $5 / $30 per 1M—roughly 35–100× more depending on the V4 SKU comparison. Post-trial, you continue coding for pennies.

Where DeepSeek Falls Short

The 30-day expiration on signup tokens is strict—you cannot reserve them for a future hackathon. DeepSeek lacks a perpetual free tier like Gemini offers. Once the trial concludes, payment begins (albeit minimally).

Best For

Solo developers evaluating daily-driver models, side-project authors needing ongoing low-cost inference, anyone finding Claude/GPT pricing prohibitive. Explore the full pricing analysis or V4 Pro versus Flash trade-offs for detailed comparisons.

#2 — Gemini: The Free Tier Most People Remember Is Gone

Gemini maintains a genuine perpetual free tier, but on April 1, 2026, Google removed all Pro-class models from it. The model developers actually want for coding is no longer free.

Current status in May 2026:

Model	Free Tier Status	Free RPD / RPM	Best Fit
Gemini 3.1 Pro	Paid Only	—	Hard reasoning, agents
Gemini 3 Pro	Paid Only	—	Older flagship
Gemini 2.5 Pro	Paid Only (free until April)	—	Long-context analysis
Gemini 3 Flash	Free	~1,500 / 15	Tooling, classification, fast helpers
Gemini 3.1 Flash-Lite	Free	~1,500 / 30	Cheapest perpetual free inference

Google issued no formal changelog for this April transition. The shift surfaced through 429 errors and quiet pricing-page edits, explaining why many LLM API listicles display outdated information.

What's Actually Usable for Coding

Flash and Flash-Lite handle linters, code formatters, function suggestions, regex generation, and glue scripts. They cannot manage multi-file refactors, real agent loops, or the long-context reasoning Gemini 3.1 Pro specializes in. For Pro capabilities, enable billing.

The Hidden Ceiling

The ~1,500 RPD limit per model is the practical wall—generous on paper, but coding agents fire 5–20 calls per task. A single developer running an aggressive IDE loop exhausts it before midday. The limit applies per-project, not per-key, so creating additional keys provides no additional headroom.

Best For

Evaluating Gemini's multimodal capabilities, hobby projects with low call volume, using as a free fallback when paid providers rate-limit. See the Gemini 3.1 Pro API guide and Gemini 3.1 Pro versus Claude Opus comparison for deeper analysis.

#3 — xAI Grok: The Most Generous Free Tier With Expensive Fine Print

xAI offers $25 plus ongoing $150/month, but only if you enable permanent, irreversible data sharing.

The structure:

Sign up—receive $25 in promo credits automatically
Spend at least $5 (bot-filtering gate)
Team admin enables data sharing from the Credits section
Start receiving $150/month in free credits
Cannot opt out once enabled—the team locks in permanently

The economic advantage is substantial: $150/month at Grok 4 Fast pricing ($0.20 in / $0.50 out) equals millions of tokens. Grok 4 Fast offers a 2M context window, genuinely useful for whole-repository coding tasks.

The Catch

Once data sharing activates, every prompt and response your team sends becomes training data for future xAI models. Forever. For solo developers shipping personal projects, this is acceptable. For startups with proprietary algorithms, customer data proximity, or confidentiality contracts, it is disqualifying.

Read the actual terms, not summaries. Some teams learned that "data sharing" covers system prompts, retrieved documents, and tool-call traces—not merely user messages.

Best For

Solo developers and open-source maintainers without confidentiality concerns, hackathon teams, rapid prototyping. Avoid for: anything bound by NDA, customer data, internal code at companies that haven't approved terms. See the Grok API access guide for setup details.

#4 — AWS Bedrock: The Free Tier That's Actually a Startup Credits Program

Bedrock has no perpetual free tier—the "free" comprises starter credits plus the AWS Activate program for qualifying startups, both expiring.

Two paths exist:

Path A — New AWS Account ($200 starter credit):

$100 on signup, $100 for completing guided activities
Works across 200+ AWS services including Bedrock
Expires 6 months after issue
Model access requests still required per region—credits bypass nothing

Path B — AWS Activate for Startups ($1K to $300K):

Tiered by accelerator affiliation, company age, and stage
Applies to full on-demand rates
How Bedrock competes with OpenRouter for early-stage workloads

Practical value with $200:

Workload	Approximate Cost	What $200 Buys
Claude Sonnet 4.6 coding session, 100K in / 30K out	~$0.75/session	~265 sessions
Nova Pro classification, 1M in / 200K out	~$1.40/run	~140 runs
Embeddings-only RAG indexing	~$0.10 per 1M tokens	Tens of millions

Sufficient for genuine MVP development—not for production operation.

The Hidden Cost

The invoice surprise isn't model cost but surrounding AWS services (S3, Lambda, CloudWatch, KMS). The model bill often becomes the smaller line item. Monitor the dashboard daily during the first week.

Best For

Teams already on AWS, startups with Activate eligibility, workloads requiring regional compliance (HIPAA, GovCloud).

The Math: What Each Free Tier Buys for Coding

Measuring in "Claude-Code-style sessions"—approximately 50K tokens in, 15K out per session, mostly cache-hit on repeated runs:

Provider	Free Credit	Approx Sessions	Renewable?
DeepSeek V4-Flash	5M tokens (~$1–2 value)	~80–100	No (signup only)
DeepSeek paid floor	$5/mo budget	~250/mo	N/A — already paid
Gemini 3 Flash	~1,500 RPD / 15 RPM	~1,500 request ceiling	Daily
Grok 4 Fast ($150/mo)	$150 budget	~600+	Monthly, data-sharing
AWS Bedrock starter	$200 over 6 months	~265 (Sonnet 4.6)	No

Gemini leads in raw request count on Flash. Grok leads in monthly renewable budget if data-sharing fits. DeepSeek leads in cost-quality because paid pricing is functionally free versus flagship competitors. AWS leads when scaling beyond trial via Activate.

Verdict by Persona

Solo developer evaluating models → DeepSeek signup grant, then Gemini Flash for ongoing free fallback
Indie hacker on side project → Grok data-sharing tier (if no NDA) or DeepSeek paid (~$5/mo)
Startup with Activate credits → AWS Bedrock for Claude/Nova at scale, plus DeepSeek as cheap fallback router
Team with confidentiality requirements → DeepSeek free → DeepSeek paid; avoid Grok data sharing
High-volume agent work → None of these tiers; need aggregated paid access with cost controls

For broader model comparisons—Claude, GPT, Gemini—see the model comparison guide and latest LLM leaderboard. For workload-to-model matching, the LLM API selection decision matrix covers use-case mapping. For reducing paid bills, see dedicated cost-reduction guides and coding-model cost comparisons.

The Common Catch Every Free Tier Has

Three patterns repeat across all four providers and bite people skimming signup pages:

Rate limits appearing generous in RPD/RPM but choking real workflows. Coding agents make 5–20 calls per task, not one. Divide every "500 requests per day" by 10 and reassess sufficiency.
Free credits as customer acquisition cost, not a gift. Every provider expects conversion. Free tiers provide 30-day-to-6-month runway. Plan migration before the cliff.
Data terms changing behind the scenes. Gemini removed Pro from free tier April 1 with no formal notice. Assume any free tier is one quiet update away from becoming paid.

When Free Tiers Stop Being the Right Answer

Free tiers excel for evaluation, prototyping, and side projects. They fail for production because (a) the cliff is sharp, (b) you juggle 4 keys across 4 dashboards, and (c) you live rate-limit-paranoid post-launch.

The clean alternative is a single paid gateway aggregating all models behind one OpenAI-compatible API, with consolidated billing and no per-provider key rotation. Free tiers work for week one; by week four you either pay somewhere or rewrite code paths constantly. A unified gateway provides one key, all major models, per-token payment at competitive rates without rate-limit uncertainty.

Either way, the honest takeaway: don't pick a model based on whose free tier looks largest. Pick the model first, then verify whether its free tier allows evaluation without providing payment info. Lucky if yes. If no, the $5 evaluation budget is the best money spent this quarter.

Originally published on ofox.ai/blog.

Claude Code Token Optimization 2026: 5 Strategies That Cut Your API Bill by 60-90%

Owen — Wed, 13 May 2026 14:48:40 +0000

TL;DR — The root cause of Claude Code expenses isn't model cost but repeated context transmission, defaulting to Opus, and uncapped extended thinking. Combining prompt caching (cached tokens cost 90% less), model tiering (Haiku for simple tasks, Sonnet for standard work, Opus for complex problems), context hygiene (lean CLAUDE.md + /compact + skills), thinking budget controls, and hooks preprocessing plus sub-agent delegation can reduce bills to 10-40% of original costs.

Why Claude Code Overspending Happens

Claude Code charges by token, transmitting CLAUDE.md, MCP tool definitions, conversation history, and file read results to Sonnet 4.6 or Opus 4.7 each interaction:

Enterprise deployment averages $13 per developer per active day, $150-250 monthly
Token distribution analysis shows "70%-90% of input tokens come from repeated system prompts, CLAUDE.md, and file history"
Opus 4.7 costs $5/MTok input and $25/MTok output; Sonnet 4.6 is $3/$15; Haiku 4.5 is $1/$5 — same tasks cost 5× more with Opus versus Haiku

Cost reduction requires stacking all five strategies together.

Strategy 1: Maximize Prompt Caching for 90% Input Savings

Action: Add cache_control breakpoints at the end of system prompts, tool definitions, long documents, and conversation history.

Why it works: Cache hits cost only 0.1× base input rate. Opus 4.7 cache hits cost $0.50/MTok instead of $5/MTok.

Model	Input	5m cache write	1h cache write	Cache hit
Opus 4.7	$5	$6.25 (1.25×)	$10 (2×)	$0.50 (0.1×)
Sonnet 4.6	$3	$3.75	$6	$0.30
Haiku 4.5	$1	$1.25	$2	$0.10

Breakeven threshold: 5-minute cache reuses once; 1-hour cache needs 2 reuses.

Minimum token threshold: Opus 4.7/4.6/Haiku 4.5 require 4096 tokens; Sonnet 4.6 requires 2048 tokens.

SDK Implementation (Python):

client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": user_query}]
)

Common pitfalls:

Including timestamps in cached content invalidates cache
Storing user IDs in system prompt causes cache misses for each user
Content below minimum threshold produces zero cached tokens

Monitoring: Log cache_creation_input_tokens, cache_read_input_tokens, and input_tokens to track hit ratio. Stable agent workflows typically reach 70%-85%.

Strategy 2: Model Tiering — Don't Use Opus for Commit Messages

Action: Default to Sonnet 4.6, use Haiku 4.5 for sub-agents, reserve Opus 4.7 for multi-step reasoning, architecture decisions, and complex debugging.

Price comparison (per million tokens):

Haiku 4.5   : input $1   / output $5    (cache hit $0.10)
Sonnet 4.6  : input $3   / output $15   (cache hit $0.30)
Opus 4.7    : input $5   / output $25   (cache hit $0.50)

Key considerations:

Opus 4.7 uses a new tokenizer, potentially requiring "up to 35% more tokens for identical text"
Opus 4.7 doesn't provide "5× performance" for all tasks; Haiku 4.5 handles refactoring, testing, and docstring generation with minimal quality loss
Single agents use "approximately 4× single-turn chat tokens"; multi-agent systems use "around 15× single-turn chat"

Switching in Claude Code:

/model                    # Current session switch
/config                   # Set default model
# For sub-agents:
model: haiku              # In subagent frontmatter

Decision framework: Code writing → Sonnet; log reading, formatting, commit messages, file search → Haiku; architecture changes, multi-step reasoning, complex debugging → Opus.

Strategy 3: Context Hygiene — Keep CLAUDE.md Under 200 Lines

Action: Move task-specific instructions from CLAUDE.md to skills for on-demand loading.

Hard rules:

Limit CLAUDE.md to 200 lines. A 5000-token CLAUDE.md costs $0.025 per Sonnet 4.6 input per turn
Use /clear when switching tasks to remove irrelevant context
Use /compact with focus:

/compact Focus on test output and code changes

Disable unused MCP servers. Tool definitions load deferred, but server connections consume context
Externalize domain knowledge as skills. "PR review checklist" and "database migration process" become skills loaded on demand

Advanced: Monitor context usage with /usage; trigger /compact above 60% usage before automatic compression at 95% degrades quality.

Strategy 4: Set Limits on Extended Thinking

Action: Adjust MAX_THINKING_TOKENS to 8000-10000 or use /effort low for simple tasks.

Why: Extended thinking runs by default, with thinking tokens charged at output rates (Opus 4.7 $25/MTok, Sonnet 4.6 $15/MTok). Complex tasks default to tens of thousands of thinking tokens, costing $1+ per request.

Configuration:

export MAX_THINKING_TOKENS=8000    # Environment variable

# Session-specific control
/effort low                         # Minimal thinking budget
/effort medium
/effort high                        # Default

Usage guidelines:

Code writing, formatting, documentation queries → /effort low or disable
Multi-step refactoring, cross-file logic tracing → /effort medium
System design, complex algorithms, performance tuning → /effort high

Using low and medium effort for one week typically reduces bills by 20%-30%.

Strategy 5: Hooks Preprocessing + Sub-agents + Batch API

Action: Outsource operations that produce high token volume but require only small portions.

Hook Preprocessing: Reduce 10,000 Lines of Logs to 100

Add PreToolUse hooks in settings.json to filter test output before Claude processes it:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "~/.claude/hooks/filter-test-output.sh"
          }
        ]
      }
    ]
  }
}

Script example:

#!/bin/bash
input=$(cat)
cmd=$(echo "$input" | jq -r '.tool_input.command')

if [[ "$cmd" =~ ^(npm test|pytest|go test) ]]; then
  filtered="$cmd 2>&1 | grep -A 5 -E '(FAIL|ERROR|error:)' | head -100"
  echo "{\"hookSpecificOutput\":{\"hookEventName\":\"PreToolUse\",\"permissionDecision\":\"allow\",\"updatedInput\":{\"command\":\"$filtered\"}}}"
else
  echo "{}"
fi

Reduces test log tokens from tens of thousands to hundreds without affecting main conversation context.

Sub-agents: Run Dirty Work in Parallel

Delegate testing, documentation retrieval, and log processing to Haiku sub-agents, keeping verbose output in sub-contexts while returning summaries to the main conversation.

Batch API: 50% Discount for Asynchronous Tasks

While Claude Code is interactive, SDK usage can route suitable tasks to Batch API:

Batch test data and fixture generation
Repository-wide refactoring and annotation
Linting and fixes
Document translation

Batch API applies "50% discount" on input and output, stackable with prompt caching:

Opus 4.7 + Batch + Cache hit
input:  $5 × 50% × 10% = $0.25/MTok
output: $25 × 50%      = $12.5/MTok

Ideal for overnight automated runs with morning result review.

Real Cost Curve: Stacking All 5 Strategies

Unoptimized Sonnet 4.6 (200-message project conversation):

Input tokens:        2,000,000  → $6.00
Output tokens:         600,000  → $9.00
Default thinking:      300,000  → $4.50
Single cost:           ~$19.50

With all 5 strategies (conservative estimate):

70% cached input:    1,400,000 × $0.30/MTok = $0.42
Remaining input:       600,000 × $3/MTok    = $1.80
Refined output:        400,000 × $15/MTok  = $6.00
Limited thinking:      100,000 × $15/MTok  = $1.50
Half tasks to Haiku:   Additional ~$1.50 savings
Total:                ~$8.22

Net savings: ~58%

Adding asynchronous Batch API tasks saves another 50%, bringing bills to 30%-40% of original costs.

Conclusion

Cost reduction means eliminating unnecessary work, not reducing Claude Code usage. Cache repeated content, downgrade simple tasks, and exclude irrelevant context. These three principles suffice.

OfoxAI provides OpenAI-compatible interfaces directly connecting to Claude Opus 4.7, Sonnet 4.6, and Haiku 4.5 with transparent prompt caching at official token rates. Change ANTHROPIC_BASE_URL in Claude Code; all strategies apply identically.

Originally published on ofox.ai/blog.

Claude Code Safety Guide: Prevent Accidental File Deletion with Hooks, Permissions & Git Worktrees

Owen — Wed, 13 May 2026 02:25:00 +0000

Claude Code Safety Guide: Prevent Accidental File Deletion with Hooks, Permissions & Git Worktrees

Why This Guide Exists

This guide documents Claude Code's documented track record of deleting files unintentionally. The incidents represent recurring failure modes when an agent has shell access.

Notable incidents include:

October 21, 2025: Mike Wolak's home directory was wiped when Claude Code generated a destructive command with shell tilde expansion
February 26, 2026: Claude Code executed rm -rf against a Flutter project directory without authorization
April 24, 2026: A Cursor agent deleted an entire production database and backups in nine seconds
Multiple GitHub issues documenting file destruction during routine operations

Anthropic released sandboxing on October 20, 2025, but it remained opt-in. Every layer in this guide requires explicit configuration—the defaults provide insufficient protection.

Layer 1: Permission Deny Rules in settings.json

Permission deny rules are evaluated first and override allow rules. They cannot be loosened by command-line flags or prompts.

Recommended baseline for .claude/settings.json:

{
  "permissions": {
    "deny": [
      "Bash(rm:*)",
      "Bash(sudo:*)",
      "Bash(chmod 777:*)",
      "Bash(git push --force:*)",
      "Bash(git push -f:*)",
      "Bash(git reset --hard:*)",
      "Bash(git clean:*)",
      "Bash(dd:*)",
      "Bash(mkfs:*)",
      "Bash(* > /dev/sda*)",
      "Read(~/.ssh/**)",
      "Read(**/.env)",
      "Edit(**/.env)",
      "Edit(.git/**)"
    ]
  }
}

Pattern Matching Details

Word-boundary semantics: Bash(rm:*) requires rm followed by a space or end-of-string, matching rm -rf . but not rmdir. The form Bash(rm*) without the boundary would match rmdir and similar commands.

Process wrappers get stripped: Claude Code strips timeout, time, nice, nohup, stdbuf, and bare xargs before matching rules. Environment runners like devbox run and docker exec are not stripped.

Limitations

Pattern-based blocking cannot reliably catch:

Variables: DIR=~ && rm -rf $DIR
Subshells: $(echo rm) -rf .
Compound chains where rm is not the first command
Custom scripts calling rm internally

Layer 2: A PreToolUse Hook That Inspects Every Command

A PreToolUse hook runs deterministic shell code on the full command string before execution. The model cannot override a blocking hook.

Create .claude/hooks/block-destructive.sh:

#!/bin/bash
# Read the full Bash invocation from stdin
CMD=$(jq -r '.tool_input.command')

# Patterns that should never run unattended
DANGEROUS='(^|[;&|`$(]| )(rm[[:space:]]+-[a-z]*[rRfF]|sudo[[:space:]]|chmod[[:space:]]+777|find[[:space:]].+-delete|find[[:space:]].+-exec[[:space:]]+rm)'

if echo "$CMD" | grep -Eq "$DANGEROUS"; then
  jq -n --arg cmd "$CMD" '{
    hookSpecificOutput: {
      hookEventName: "PreToolUse",
      permissionDecision: "deny",
      permissionDecisionReason: ("Blocked by safety hook: " + $cmd)
    }
  }'
  exit 0
fi

exit 0

Make it executable: chmod +x .claude/hooks/block-destructive.sh

Wire it into .claude/settings.json:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "\"$CLAUDE_PROJECT_DIR\"/.claude/hooks/block-destructive.sh"
          }
        ]
      }
    ]
  }
}

Why Hooks Catch What Deny Rules Miss

Hooks see the literal command string, including subshells, pipes, and full find invocations
Exit code 0 with deny JSON returns control to Claude with the reason attached
Hooks fire regardless of permission mode, even in bypassPermissions mode

Layer 3: Git Worktrees So Mistakes Are Recoverable

A git worktree gives the agent its own checkout on its own branch, so destructive runs affect only the worktree, not your main work.

Manual setup:

# From your main checkout on the feature branch
git worktree add ../myproject-agent agent/refactor-auth
cd ../myproject-agent
claude

If the agent deletes the entire working tree, your main copy remains intact. Clean up when done:

cd ../myproject
git worktree remove ../myproject-agent
git branch -D agent/refactor-auth  # optional

For subagents, declare worktree isolation in the agent definition (e.g., .claude/agents/refactorer.md):

---
name: refactorer
description: "Performs large refactors in an isolated worktree"
tools: Read, Edit, Write, Bash
isolation: worktree
---

You are a refactoring specialist. Make incremental changes...

Layer 4: Replace rm With trash

Aliasing rm to a recoverable deletion tool turns permanent loss into recovery from trash.

On macOS:

brew install trash

Then in .claude/hooks/coerce-rm.sh:

#!/bin/bash
CMD=$(jq -r '.tool_input.command')

# If the command uses bare rm (not /bin/rm, not safe-rm), rewrite to trash
if echo "$CMD" | grep -Eq '(^|[;&|`$(]| )rm[[:space:]]+'; then
  NEW=$(echo "$CMD" | sed -E 's/(^|[;&|`$(]| )rm[[:space:]]+/\1trash /g')
  jq -n --arg cmd "$NEW" '{
    hookSpecificOutput: {
      hookEventName: "PreToolUse",
      permissionDecision: "ask",
      permissionDecisionReason: ("Rewriting rm to trash. Approve to run: " + $cmd)
    }
  }'
  exit 0
fi
exit 0

This hook returns permissionDecision: "ask" so users approve the rewritten command. Chain it after the block-destructive hook by registering both in the PreToolUse array.

Note: Putting alias rm='trash' in ~/.bashrc does not work for Claude Code, since the Bash tool spawns non-interactive shells. The hook approach is reliable.

Layer 5: Turn On the Sandbox

Anthropic's sandbox provides OS-level enforcement that prevents model confusion from bypassing protections. It restricts Bash and child processes to a defined filesystem and network boundary.

Enable it in .claude/settings.json:

{
  "sandbox": {
    "enabled": true,
    "filesystem": {
      "allowRead": ["."],
      "denyRead": ["~/.ssh", "~/.aws", "**/.env"],
      "allowWrite": ["~/.npm", "~/.cache"]
    },
    "network": {
      "allowedDomains": ["registry.npmjs.org", "api.github.com"]
    },
    "autoAllowBashIfSandboxed": true
  }
}

With autoAllowBashIfSandboxed: true, sandboxed Bash runs without permission prompts because the OS boundary substitutes for per-command approval. Explicit deny rules still apply, and rm or rmdir against /, home directory, or critical system paths still triggers a prompt as a circuit breaker.

The sandbox survives prompt injection—if a model gets convinced by hidden text to wipe your home directory, the sandbox hard-blocks the syscall.

Bonus: Disable bypassPermissions in Managed Settings

For team administration, lock out bypassPermissions at the managed-settings level in /etc/claude-code/managed-settings.json:

{
  "permissions": {
    "disableBypassPermissionsMode": "disable",
    "disableAutoMode": "disable"
  },
  "allowManagedHooksOnly": true,
  "allowManagedPermissionRulesOnly": true
}

allowManagedHooksOnly ensures only your security team's hooks are loaded—developers cannot turn off the block-destructive hook by editing .claude/settings.json.

The Recommended Setup

Layer everything. None is sufficient alone, and the cost of stacking them is one settings file and one shell script.

Layer	What It Catches	What It Misses
Deny rules	Direct `rm`, `sudo`, force-push	Compound commands, env runners, scripted deletions
PreToolUse hook	Anything you can regex against	Non-shell deletion (Edit tool overwriting a file)
Edit deny rules	Writes to `.env`, `.git`, secrets	Symlinks pointing out of allowed dirs
Worktrees	Recoverable file destruction	Damage to repos outside the worktree
trash hook	Permanent file loss	Files outside trash-aware paths
Sandbox	OS-level filesystem and network boundary	Anything inside allowed paths

Run it on a low-stakes project for a week and observe how often the hook fires. Most teams discover agents performed far more deletion than expected—they got lucky on the targets.

Every defense is opt-in, every default is loose, and every Claude Code horror story starts with someone trusting the model to remember a rule it was never enforced to follow.

Where to Go Next

For more on Claude Code's extensibility surface—hooks, subagents, and skills—read the complete guide to hooks, subagents, and skills. For provider setup and getting Claude Code talking to a custom endpoint, see the Claude Code configuration guide. For the underlying model and changes between Opus versions, see the Claude Opus API review. For comparisons between Claude Code, Codex CLI, Cursor, and DeepSeek TUI, the AI coding agents comparison covers the model layer underneath all of them.

If running Claude Code against a custom Anthropic-compatible endpoint, ofox.ai supports the full Anthropic protocol at https://api.ofox.ai/anthropic—including extended thinking and cache_control. The agent does not know the difference; your wallet might.

Originally published on ofox.ai/blog.

AI Coding Agents Compared 2026: Claude Code vs Codex CLI vs Cursor vs DeepSeek TUI

Owen — Tue, 12 May 2026 14:45:54 +0000

TL;DR

"Four agents, four philosophies. Claude Code wins blind code-quality comparisons but throttles you on subscriptions. Codex CLI is the daily driver most developers reach for in 2026 because it does not run out."

Codex CLI is "open source, written in Rust, and bills you by the token through whichever OpenAI-compatible endpoint you point it at."

The 2026 power user runs three agents in three terminals: one for keystroke, one for commits, one for refactors. The winner is whoever stops asking which agent is best and starts asking which agent for which task.

What changed in 2026 for terminal coding agents

The category got serious. A year ago, "AI coding agent" mostly meant Cursor or GitHub Copilot inside an editor. Today, four mature options compete for the developer's terminal: Claude Code (Anthropic), Codex CLI (OpenAI), Cursor (still primarily an editor but increasingly agentic), and DeepSeek TUI (community-built, MIT-licensed, riding on DeepSeek V4's 1M-token context window). Each makes a different bet on price, autonomy, and how much workflow surface area an agent should touch.

The shift happened fast. DeepSeek TUI did not exist before January 19, 2026, and by early May it had passed 10,000 GitHub stars after a Hacker News and r/LocalLLaMA cycle. Claude Code's 1M-token context went GA in March 2026. Codex's CLI added remote-control and Bedrock auth in May. Cursor switched from request-based to credit-based billing in mid-2025 and has been tuning the multipliers ever since. Anything written six months ago is wrong now.

The five-minute comparison

Agent	Pricing model	Best models behind it	Open source	Killer feature	Worst friction
Claude Code	$20/mo Pro, $100/$200 Max, or API pass-through	Claude Opus 4.7 / Sonnet 4.6	No (binary CLI)	1M context, /context and /cost debug commands, hooks + subagents + skills	Subscription throttle hits hard at Pro tier
Codex CLI	API pass-through (no subscription required)	Codex-Spark, GPT-5.5, GPT-5.4, GPT-5.3 Codex	Yes (Rust, Apache-2.0)	Long-session stability, OpenTelemetry traces, headless remote-control	Less polished for one-shot refactor prompts
Cursor	$20 Pro / $60 Pro+ / $200 Ultra; credit-based	Auto mode + Claude / GPT / Gemini on demand	No	Editor-native, multi-file editing, unlimited Tab completions	Credits run out faster than the dollar number suggests
DeepSeek TUI	API pass-through to DeepSeek (or any compatible)	DeepSeek V4 Pro / V4 Flash	Yes (Rust, MIT)	1M context at ~1/10 Claude's cost, native sub-agent orchestration	Smaller ecosystem, fake-repo malware risk

If you only read one row of that table: Claude Code for quality, Codex CLI for endurance, Cursor for editor people, DeepSeek TUI for cost. Now for the parts that actually matter.

Claude Code: the quality benchmark

Claude Code's reputation is real. In blind A/B tests where developers cannot see which agent produced the code, Claude Code wins about 67% of the time on cleanliness and idiom. The reasoning chain is tighter and the diffs are smaller. "Claude Opus 4.7, at $5 per million input tokens and $25 per million output tokens on ofox.ai, is the only model in this comparison that consistently nails non-trivial refactors on the first attempt."

The CLI itself has matured into something close to a developer operating system. The /context command (added in v2.0.86) shows you exactly how much of the 1M-token window you've burned and which files are still loaded. The rebuilt /cost command in v2.1.92 gives you per-model breakdowns, cache hit rates, and rate-limit utilization. Hooks let you fire shell commands at lifecycle events. Subagents let one Claude Code session spawn focused workers for big tasks. Skills give it reusable expertise. None of the other three agents have all of these.

So why isn't this article over? The throttle. On the $20 Pro plan you get Claude Code, but you also hit your limit fast. A few hours of real refactor work and you're waiting for the 5-hour reset. Max 5x at $100/month buys roughly 225 messages per 5-hour window; Max 20x at $200/month gets you about 900. Codex CLI on API pass-through has no equivalent ceiling. You pay per token and that's it. Anthropic briefly tried gating Claude Code behind Max-only in late April 2026 and reverted within hours after community pushback, which tells you something about how attached people are to the Pro entry point.

When Claude Code wins

Complex refactors, frontend UI work, anything where code-quality outranks throughput. "Pair it with Claude Opus 4.7 for the hard parts and Sonnet 4.6 for the long tail; both are reachable through a single endpoint via ofox's API aggregation so you can flip models without re-authenticating."

Codex CLI: the daily driver that does not run out

If you survey 500+ Reddit developers, the raw vote splits 65.3% for Codex CLI versus 34.7% for Claude Code. Weight by upvotes and Codex's share rises to roughly 80%. That's a startling gap given Claude Code's quality lead in blind tests. The explanation is usage economics: Codex CLI is open source, written in Rust, and bills you by the token through whichever OpenAI-compatible endpoint you point it at. You never hit a wall.

In practice this means you can let Codex CLI run a 40-minute autonomous session without checking on it. The May 2026 release added configurable OpenTelemetry trace metadata, richer review analytics, and a remote-control entrypoint for headless deployment. The view_image tool now resolves files through the selected environment, which matters if you work across containers. Codex-Spark, the in-preview model for the ChatGPT Pro tier, gives you a 128k context window inside the CLI.

The trade-offs are real, though. Codex's edits are slightly less idiomatic than Claude's. It tends to over-refactor when given vague instructions. And it does not have Claude Code's /context introspection, so debugging "why did the agent get confused" is harder.

When Codex CLI wins

"Long-running autonomous tasks, codebase-wide refactors, anything where you want to walk away and come back. Pair it with GPT-5.4 Pro or GPT-5.3 Codex through an aggregator."

Cursor: the editor that refuses to die

Cursor is the outlier in this comparison because it is fundamentally not a terminal agent. It's a fork of VS Code with deep AI integration: unlimited Tab completions, multi-file editing with Composer, an agent mode that runs in the editor sidebar, and access to Claude, GPT, Gemini, and a handful of other models via Cursor's own auth.

The 2026 pricing reorg matters. Pro is still $20/month, but in mid-2025 Cursor switched from "500 fast requests per month" to a credit pool equal to the plan price ($20 of credits on Pro, $60 on Pro+, $200 on Ultra). Auto mode — which dynamically picks the cheapest sufficient model — does not consume credits. Manually pinning to Claude Sonnet 4.6 or GPT-5.5 does. The result is that Pro feels generous if you stay on Auto and surprisingly tight if you keep reaching for premium models.

There is also a working pattern that combines Cursor with a terminal agent: Cursor for inline edits and tab completion, Claude Code or Codex CLI for "do the whole thing" tasks in a side terminal.

When Cursor wins

"You write code inside an editor more than you live in the terminal, you want autocomplete and multi-file editing to feel like one thing, and you're willing to accept the cost of an opinionated UI in exchange for less context-switching."

DeepSeek TUI: the price disruptor with sub-agents

DeepSeek TUI is the youngest of the four — a community project by independent developer Hunter Bown, MIT-licensed, written in Rust, first released in January 2026. By early May it had passed 10,000 GitHub stars after a Hacker News spike and a r/LocalLLaMA feature. The pitch is direct: do what Claude Code does, on DeepSeek V4's 1M-token window, at roughly one-tenth the token cost.

The math is uncomfortable. DeepSeek V4 Flash on ofox costs $0.14 per million input tokens and $0.28 per million output. The same workload through Claude Opus 4.7 costs $5 in and $25 out, a 35x difference on input and almost 90x on output. "DeepSeek V4 Pro is currently running at $0.435/$0.87 promotional pricing (through May 31, 2026) and $1.74/$3.48 list." Even at list, DeepSeek V4 Pro costs roughly a third of what Claude Sonnet 4.6 costs and about a fifth of Opus 4.7.

The DeepSeek TUI feature set is also more sophisticated than its newness suggests. The sub-agent orchestration is the unusual part: when the coordinator can break a task into independent pieces, it spawns multiple sub-agents that run concurrently rather than serially. The other three agents either don't have this (Codex, Cursor) or only added it recently (Claude Code's subagents shipped in late April 2026).

Two cautions. First, quality is not equivalent to Claude. For dense reasoning over messy legacy code, the gap shows. Second, hackers have been publishing fake DeepSeek-TUI GitHub repositories that ship malware. Download only from github.com/Hmbown/DeepSeek-TUI or github.com/DeepSeek-TUI/DeepSeek-TUI and verify the signing.

When DeepSeek TUI wins

"Cost-sensitive workloads, generation-heavy tasks (test stubs, boilerplate, docs), and anywhere you can use the 1M-token window for batch processing."

The use-case matrix

Your task	Best primary agent	Fallback	Why
Hard refactor across 20+ files	Claude Code (Opus 4.7)	Codex CLI	Quality wins; 1M context holds it all
Long autonomous session (40+ min)	Codex CLI	DeepSeek TUI	No subscription ceiling
Inline editing while reading code	Cursor	Claude Code	Editor-native UX
Generating tests / boilerplate at scale	DeepSeek TUI (Flash)	Codex CLI	35x cheaper per token
Complex frontend / UI iteration	Claude Code	Cursor	Strongest idiomatic output
Multi-step agentic task with sub-tasks	DeepSeek TUI or Claude Code	Codex CLI	Native sub-agent orchestration
Debugging "why is the agent confused"	Claude Code	—	`/context` and `/cost` introspection
Headless / CI integration	Codex CLI	Claude Code	Remote-control entrypoint, OpenTelemetry
You have $30/month total budget	DeepSeek TUI + Cursor Hobby	—	See $30/month coding stack

How to actually configure these in 2026

All four agents accept OpenAI-compatible or Anthropic-compatible endpoints. That matters because it means you don't need four billing dashboards.

Pointing Claude Code at ofox for Anthropic-compatible endpoints:

export ANTHROPIC_BASE_URL="https://api.ofox.ai/anthropic"
export ANTHROPIC_API_KEY="sk-ofox-..."
claude

Pointing Codex CLI at ofox for OpenAI-compatible:

export OPENAI_BASE_URL="https://api.ofox.ai/v1"
export OPENAI_API_KEY="sk-ofox-..."
codex

DeepSeek TUI reads DEEPSEEK_BASE_URL and DEEPSEEK_API_KEY, or you can set base_url in ~/.deepseek/config.toml to point it at any OpenAI-compatible endpoint, so the same key works. Cursor takes custom endpoints in its settings — see the Cursor custom-API setup guide for the precise toggle.

This means you can run all four agents in parallel and pick per task, paying for what you actually use rather than four separate subscriptions. "That is the workflow that the Reddit power users are converging on, and it's the answer to the question this article posed at the top."

What none of these agents does well yet

Honest disclosure: all four have shared gaps in May 2026.

Long-horizon planning across days, not minutes. All four eventually lose the thread on multi-day projects. Persistent memory remains thin.
Cost predictability before you start. Even with the /cost command, predicting "how much will this refactor cost" is mostly guesswork.
Cross-repo awareness. All four operate within one repository. Working across a monorepo plus three sibling repos is still painful.
Reliable test-driven loops. The "write test, write code, iterate until green" pattern works for simple cases but breaks down with flaky tests or slow CI.

If any of these matter more to you than the differences in the comparison table, the right move might be to wait for the next quarter's releases rather than pick a winner today.

Closing recommendation

Pick by where your friction is, not by the leaderboard.

You burn out on subscription limits: Codex CLI on pay-per-token.
You burn out on bad code quality: Claude Code with Opus 4.7.
You burn out on context switching: Cursor.
You burn out on the bill: DeepSeek TUI with V4 Flash, fall back to V4 Pro for harder tasks.

Stop picking one. "The developers shipping fastest in 2026 are running Claude Code, Codex CLI, and DeepSeek TUI in three different terminals, behind one API key, and switching by task class. Try it for a week — you won't go back."

For the unified-endpoint setup that makes the parallel-agents pattern practical, see the AI API aggregation guide, how to reduce AI API costs, and the best LLM for coding ranked by real use.

Sources and version stamps

Claude Code v2.1.92, /context added v2.0.86, 1M context GA March 2026; Pro $20/Max $100/$200 confirmed via ClaudeLog and Anthropic community as of May 2026
Codex CLI May 2026 changelog (OpenTelemetry, remote-control, Bedrock auth) per OpenAI developers changelog
Cursor 2026 pricing tiers (Hobby $0 / Pro $20 / Pro+ $60 / Ultra $200 / Teams $40) per cursor.com/pricing
DeepSeek TUI v0.8.31, 26,000+ stars, Jan 19 2026 launch, MIT license per github.com/Hmbown/DeepSeek-TUI
DeepSeek V4 Pro pricing: $0.435/$0.87 promo through May 31 2026, $1.74/$3.48 list afterwards per DeepSeek API docs
ofox model pricing (Claude Opus 4.7 $5/$25, Sonnet 4.6 $3/$15, GPT-5.5 $5/$30, DeepSeek V4 Flash $0.14/$0.28) verified at ofox.ai/en/models on 2026-05-12

Originally published on ofox.ai/blog.

The $30/Month AI Coding Stack That Replaces $200 Subscriptions: A 2026 Setup Guide

Owen — Tue, 12 May 2026 02:44:10 +0000

TL;DR

Running the same workflow—Claude Opus 4.7 for complex reasoning plus economical models for routine tasks—costs approximately $30/month via pay-per-token API gateways with open-source CLIs, versus $200/month for subscription bundles. The routing strategy matters more than individual model selection.

The $200/Month Trap Most Developers Are Stuck In

Standard premium setups combine multiple services: Cursor Ultra ($200/month), Claude Max 20x ($200/month), GitHub Copilot Pro+ ($39/month). These overlap significantly while imposing usage caps that activate during peak demand windows.

The fundamental issue centers on metered capacity within fixed-fee structures. According to Anthropic's documentation, Max 20x provides 20x session usage capacity relative to the Pro plan, with weekly reset cycles. This creates scenarios where session budgets deplete rapidly during intensive work periods.

The superior approach emphasizes predictable per-task costs through pay-per-token models.

What the $200/Month Stack Actually Costs the Vendor

Analysis of real Claude Code usage patterns reveals this distribution:

Task Type	Token Percentage	Model Required
File reads, project scanning, git status	38%	Any model
Test scaffolding, boilerplate generation	24%	Sonnet-class
Renames, formatting, simple refactors	19%	Sonnet-class
Hard reasoning (architecture, debugging)	14%	Opus-class
Conversational follow-ups, clarifications	5%	Any model

Eighty-six percent of tokens consumed by premium subscriptions don't require frontier-model intelligence. Vendors profit by charging premium prices for mostly routine computational tasks while imposing caps when usage patterns shift.

The Replacement Stack: Tools + Gateway + Routing

This architecture contains three components:

1. API Gateway

A unified endpoint exposing frontier models across providers. Features include OpenAI compatibility and Anthropic protocol parity. Current pricing displays transparently per 1M tokens. Alternatives include OpenRouter and LiteLLM with distinct tradeoffs.

2. Open-Source CLIs Respecting Environment Variables

Claude Code: Anthropic's native CLI accepting ANTHROPIC_BASE_URL and ANTHROPIC_API_KEY environment variables.

Codex CLI: OpenAI's open-source implementation supporting OpenAI-compatible endpoints.

Cline: VS Code extension supporting custom API endpoints.

Aider: Multi-provider terminal tool emphasizing git-aware refactoring.

3. Routing Rules Per Tool

Default selections use Sonnet 4.6, with Opus 4.7 escalation for complex reasoning tasks and economical models for routine operations. Claude Code's /model command enables runtime switching; Codex CLI accepts --model flags; Cline provides dropdown selection.

The Per-Token Math (May 2026 prices)

Model	Input	Output	Primary Use
Claude Opus 4.7	$5.00	$25.00	Complex reasoning, architecture, debugging
Claude Sonnet 4.6	$3.00	$15.00	Default coding tasks
GPT-5.5	$5.00	$30.00	Reasoning peer to Opus, multimodal
GPT-5.4 Mini	$0.75	$4.50	Quick generation, file scanning
GPT-5.4 Nano	$0.20	$1.25	Conversational steps
Gemini 3.1 Pro	$2.00	$12.00	Long-context operations (1M window)
Gemini 3.1 Flash Lite	$0.25	$1.50	Economical, performant code tasks
DeepSeek V4 Flash	$0.14	$0.28	Boilerplate, scaffolding
DeepSeek V4 Pro	$1.74	$3.48	Budget reasoning, Python/Go strength
Kimi K2.6	$0.95	$4.00	Mid-tier, extended agent loops
Qwen 3.6 Flash	$0.25	$1.50	Open-source approach, SDK compatibility
GLM-4.7	$0.40	$2.00	Chinese-ecosystem alternative

The price differential between Opus 4.7 output ($25/M) and DeepSeek V4 Flash output ($0.28/M) represents an 89x spread—the core arbitrage enabling dramatic cost reduction through intelligent routing.

A Concrete Monthly Budget

For a developer working 6 active hours daily across five days with intelligent routing:

Weekly Volume: 5M input tokens, 1.5M output tokens

Routed Distribution:

14% to Opus 4.7: 700K input × $5/M + 210K output × $25/M = $8.75/week
38% to Sonnet 4.6: 1.9M input × $3/M + 570K output × $15/M = $14.25/week
24% to Kimi K2.6: 1.2M input × $0.95/M + 360K output × $4/M = $2.58/week
19% to Gemini 3.1 Flash Lite: 950K input × $0.25/M + 285K output × $1.50/M = $0.67/week
5% to DeepSeek V4 Flash: 250K input × $0.14/M + 75K output × $0.28/M = $0.06/week

Weekly Total: ~$26 | Monthly: ~$110

The headline "$30/month" applies to moderate users (2–3 hours daily) processing approximately 2M input and 600K output tokens weekly, resulting in $10–$13 weekly or $40–$55 monthly. Heavy users should expect $80–$120/month—still representing 3–5x savings versus $200 subscription costs.

The Routing Rules That Actually Save Money

Rule 1: Default to Sonnet 4.6, not Opus 4.7

Sonnet 4.6 achieves within 5–7% performance parity on coding benchmarks while costing 40% less per output token ($15/M versus Opus's $25/M). Use /model claude-sonnet-4-6 on session start, escalating only when Sonnet demonstrates visible limitations.

Rule 2: Route File Scanning and Conversational Steps Economically

Project context building through file scanning doesn't require sophisticated reasoning—it's pattern matching. Configure routing rules directing these calls toward Gemini 3.1 Flash Lite or DeepSeek V4 Flash. This typically reduces monthly spending by 40%.

Rule 3: Use Kimi K2.6 for Extended Agent Loops

K2.6 provides 256K context windows, maintains state across 50+ sequential tool calls, and costs approximately 30% of Sonnet. This suits repetitive agentic tasks like consistent refactoring across multiple files or systematic test generation.

When the Subscription Is Actually the Right Call

Three scenarios favor remaining subscribed:

1. Extreme Opus Consumption: Users burning 8+ daily hours of frontier-model work face subscription advantages. Those saturating session limits consume equivalent token value of $600–$1,500/month for flat $200 fees.

2. IDE Feature Dependence: Cursor's tab completion, Cmd-K rewrites, and inline diff interfaces lack trivial open-source equivalents. Developers whose workflows center on IDE mechanics justify subscription costs.

3. Avoiding Token Accounting: Subscriptions provide psychological simplicity. If per-query charges create cognitive friction, flat fees eliminate this distraction.

For feature-building developers not babysitting token meters, the $30–$80 API stack proves straightforwardly economical while eliminating throttling constraints.

Setup in 10 Minutes

# 1. Get an ofox API key (or any compatible gateway)
export ANTHROPIC_BASE_URL="https://api.ofox.ai/anthropic"
export ANTHROPIC_API_KEY="sk-ofox-..."
export OPENAI_BASE_URL="https://api.ofox.ai/v1"
export OPENAI_API_KEY="sk-ofox-..."

# 2. Install Claude Code
npm install -g @anthropic-ai/claude-code

# 3. Inside Claude Code, set the default model
# (type /model and pick claude-sonnet-4-6)

# 4. Install Codex CLI as the OpenAI-side counterpart
npm install -g @openai/codex

Configuration guides cover Cline, Aider, and Continue.dev setups for gateway integration.

The Takeaway

Commercial offerings bundle IDE interfaces with model access. Cursor sold IDE functionality paired with model routing; Claude Code Pro and Copilot Pro+ follow this pattern. By 2026, open-source CLI tools commoditize wrappers while gateway providers democratize model access near cost basis.

The optimization strategy emphasizes paying for consumed tokens rather than provisioned capacity. The 80% of unconsumed budget typically retained within subscription margins represents pure vendor profit.

Originally published on ofox.ai/blog.

Qwen 3.6 Plus API: Complete Guide to Pricing, Benchmarks, and Access (2026)

Owen — Mon, 11 May 2026 14:40:08 +0000

Qwen 3.6 Plus API: Complete Guide to Pricing, Benchmarks, and Access (2026)

TL;DR

Qwen 3.6 Plus achieves competitive performance on coding benchmarks at significantly reduced costs compared to enterprise alternatives, featuring native 1M-token context support unavailable in comparable models.

What is Qwen 3.6 Plus?

Alibaba's April 2026 flagship represents a sparse mixture-of-experts architecture with integrated reasoning capabilities. Released publicly on April 2, 2026, this model occupies a middle tier within the Qwen 3.6 family.

Three architectural distinctions emerge:

1,000,000-token native context without sliding-window limitations, supporting up to 65,536 output tokens per response
Hybrid attention mechanism combining linear attention with sparse MoE routing to manage long-context performance
Always-on reasoning delivering chain-of-thought reasoning across all responses via reasoning_content field

Qwen 3.6 Plus Pricing

Pricing on ofox.ai as of May 2026 stands at $0.50 per million input tokens and $3.00 per million output tokens.

Model	Input	Output	Context
Qwen 3.6 Plus (ofox)	$0.50	$3.00	1M
Claude Opus 4.6	$15.00	$75.00	200K
Claude Opus 4.7	$15.00	$75.00	200K
GPT-5.5	$1.25	$10.00	400K
Gemini 3.1 Pro	$1.25	$10.00	2M
DeepSeek V4 Pro	$0.27	$1.10	128K
Qwen 3 Max (older tier)	$0.36	$1.43	256K

For Opus-comparable workloads, input savings reach 30× and output savings reach 25×. Against typical selections like Sonnet or GPT-5 mini, the gap narrows to 2-3× but remains meaningful at scale.

Direct vs. Gateway Pricing

Alibaba's DashScope publishes $0.325 / $1.95 per million. The ofox markup includes unified API key access across multiple providers, USD invoicing, OpenAI-SDK compatibility, and eliminates Chinese ICP filing requirements.

Benchmarks: Performance Analysis

Coding Performance (SWE-bench Verified)

Claude Opus 4.6: 80.8%
GPT-5.4: ~80%
Qwen 3.6 Plus: 78.8%
Gemini 3.1 Pro: mid-70s

On SWE-bench Pro (multi-language, larger repositories), Opus 4.7 reaches 64.3%, GPT-5.4 lands at 57.7%, and Gemini 3.1 Pro at 54.2%. Qwen 3.6 Plus has not yet posted competitive Pro numbers.

Throughput and Latency (Artificial Analysis, May 2026)

Intelligence Index score: 50 (above 35 average)
Output speed: 52 tokens/sec
Time-to-first-token: 3.12 seconds
Median for reasoning models in this price tier: 58.9 tokens/sec

The model operates below-median throughput for its price bracket, though faster than Opus in absolute terms.

API Access: Implementation

OpenAI-compatible SDK implementation using ofox.ai:

from openai import OpenAI

client = OpenAI(
    api_key="sk-your-ofox-key",
    base_url="https://api.ofox.ai/v1",
)

response = client.chat.completions.create(
    model="bailian/qwen3.6-plus",
    messages=[{"role": "user", "content": "Refactor this loop to use map()"}],
)
print(response.choices[0].message.content)

Curl alternative:

curl https://api.ofox.ai/v1/chat/completions \
  -H "Authorization: Bearer $OFOX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"bailian/qwen3.6-plus","messages":[{"role":"user","content":"Hi"}]}'

Reading the reasoning_content Field

All responses include both visible answer content and hidden reasoning:

msg = response.choices[0].message
print(msg.content)            # the answer
print(msg.reasoning_content)  # the chain of thought

Reasoning tokens incur output-rate charges. Typical SWE-bench tasks generate 2-4× the answer length in hidden reasoning, requiring budget adjustments accordingly.

Tool Calling and Extended Context

Standard OpenAI tools parameter implementation:

tools = [{
    "type": "function",
    "function": {
        "name": "search_codebase",
        "description": "Search the repository",
        "parameters": {"type": "object", "properties": {
            "query": {"type": "string"}}}
    }
}]
response = client.chat.completions.create(
    model="bailian/qwen3.6-plus",
    messages=[...],
    tools=tools,
)

The 1M-token window accommodates mid-sized codebases without retrieval-augmented generation infrastructure.

Selection Criteria

Choose Qwen 3.6 Plus when:

Running coding agents where Claude Opus strains budgets
Requiring >200K context for repository-level work
Seeking reasoning-mode quality without premium pricing
Traffic tolerates non-minimal latency (batch processing, asynchronous agents)

Alternative selections appropriate for:

<1 second time-to-first-token requirements
Pure conversational interfaces where reasoning adds overhead
Anthropic ecosystem entanglement (Claude Code, MCP)
Multi-step agent loops with intensive tool utilization

Migration Checklist

Structured approach for transitioning from existing providers:

Audit current spending by task category
Select single task type for initial migration
Execute 48-hour shadow traffic at 10% volume
Monitor reasoning token amplification (2-4× multiplier)
Maintain fallback routing to previous model

Recognized Limitations

Three considerations preceding adoption:

Output speed below median at 52 t/s, acceptable for batch processing but perceptible in streaming chat interfaces
English-language benchmarks lag Chinese ones despite genuine bilingual capability; creative writing demonstrates visible gaps versus Claude
Verbose reasoning content requiring either complete suppression or token-multiplier budgeting

Originally published on ofox.ai/blog.

Codex CLI Real-World Coding Workflow: The Setup Senior Devs Use in 2026

Owen — Mon, 11 May 2026 03:45:28 +0000

TL;DR

The Codex CLI users who ship the most are not the ones with the cleverest prompts. They are the ones who wrote AGENTS.md once, wired up two MCP servers, and let /plan do the thinking on anything ambiguous. Default workflow: plan-first for unclear tasks, just-do-it for clear ones, three to four worktrees in parallel, GPT-5.5 for thinking work and GPT-5.4 Mini for everything boring. This guide is the loop, the trade-offs, and the seven mistakes that eat your first week.

"The leverage in Codex CLI is not the model. It is the 30 lines of AGENTS.md you wrote on day one."

For setup and environment configuration, start at the Codex CLI configuration guide. This article picks up where setup ends.

The honest end-to-end loop

Strip away the marketing and the daily Codex CLI loop in 2026 looks like this:

Open a worktree for the task: git worktree add ../proj-feat-auth feat/auth
Start Codex in it: cd ../proj-feat-auth && codex
Decide planning vs. execution: ambiguous → /plan, clear → just describe the change
Approve diffs with /permissions set to Auto for safe ops, Read-only when reviewing
Resume tomorrow with codex resume --last or codex resume <SESSION_ID>
Encode anything you correct twice into AGENTS.md or a Skill

Versions 0.128.0 through 0.130.0 (released between April 30 and May 8, 2026, per the official changelog) added persisted /goal workflows, modal vim editing in the composer, expanded permission profiles, and external agent session import. None of that changes the loop above. It makes each step less painful.

Bootstrap: the file that pays for itself in a week

The single highest-leverage thing you can do before your first session is write AGENTS.md at the repo root. Codex reads it on every session start. So does Claude Code (it reads CLAUDE.md — symlink one to the other). Treat it as living documentation: every recurring correction you make is a candidate to encode as a rule.

A working AGENTS.md is short and specific:

# AGENTS.md

## Stack
- Python 3.12, FastAPI, SQLAlchemy 2.x async
- pytest with anyio, never unittest
- Ruff for lint, Black for format, mypy --strict

## Don't
- Don't add comments that restate the code
- Don't import unittest.mock — use pytest fixtures
- Don't create new files in src/legacy/
- Don't run `alembic upgrade` without confirming first

## Do
- Use `from __future__ import annotations` in every new file
- Run `make test path=<file>` after edits, not full test suite
- Place new endpoints under src/api/v2/

Three rules of thumb that hold up in real teams:

Start small. 30 lines beats 300. Long AGENTS.md files get ignored by the model just like your code style guide gets ignored by humans.
Encode the diffs. When you find yourself rejecting the same Codex suggestion twice in one week, that is the next rule.
Tools, not personality. "Be concise" is wishful. "Run make test after edits" is enforceable.

Plan mode vs. just-do-it: pick the right one

Plan mode (/plan or Shift+Tab) makes Codex gather context and ask clarifying questions before writing code. It is the right choice on ambiguous or multi-step work. It is the wrong choice when you already know what you want.

The split that actually works:

Task shape	Mode
"Add a debounce to the search input in `SearchBar.tsx`"	Just describe it
"Refactor the auth flow so OAuth and SAML share a session model"	`/plan` first
"Rename `getCwd` to `getCurrentWorkingDirectory` across the repo"	Just describe it
"Diagnose why the staging deploy gets stuck on healthcheck"	`/plan` first
"Format these 40 files with Black"	Just describe it (and use Mini)

If you can name the file and the change in one sentence, skip planning. If you find yourself writing a paragraph to explain the task, plan mode will save you a half-hour of bad diffs.

Worktrees: how parallel sessions stop fighting

A git worktree is an isolated checkout of the same repo, sharing history but on its own branch and folder. It is how teams run three or four Codex sessions in parallel without thrashing each other's editor state or build cache.

The pattern:

# One-time setup per task
git worktree add ../proj-feat-auth feat/auth
git worktree add ../proj-fix-cors  fix/cors
git worktree add ../proj-perf-list perf/list-render

# Three terminals, three Codex sessions, no interference
(cd ../proj-feat-auth && codex)
(cd ../proj-fix-cors  && codex)
(cd ../proj-perf-list && codex)

# Cleanup when done
git worktree remove ../proj-feat-auth

Why this beats branch-switching: each worktree has its own node_modules, its own .next or dist, its own pytest cache. Switching branches in a single checkout invalidates all of that. The "Codex vs Claude Code in April 2026" roundup of Reddit sentiment captures the consensus: developers running 4-6 parallel sessions on a single codebase is now ordinary.

The honest tradeoff: worktrees use more disk. On a 2 GB repo with 5 worktrees you are looking at 10 GB plus build artifacts. Cheap on an SSD, painful on a 256 GB laptop.

/goal: the workflow for tasks that span days

Versions 0.128+ ship /goal — a persisted workflow object stored on the app server. You declare a goal, Codex pauses and resumes against it across sessions, and the TUI gives you create/pause/resume/clear controls. The release notes call out multi-day duration formatting and validation, which means it is built for work that genuinely spans a week.

Where it earns its place:

A migration that touches 40 files and you want one running thread instead of 12 disconnected sessions
An incident postmortem where you keep coming back to the same investigation
A spike that you want to be able to drop and pick up without re-explaining context

Where it is overkill:

Anything you can finish in one sitting — codex resume --last is enough
Throwaway exploration where you don't want to commit to a thread

Skills, plugins, and the moment you stop pasting prompts

The repeatable-prompt problem: you write the same five-paragraph instructions for "do a code review on the diff" or "scaffold a new endpoint with the team's conventions" every week. Skills package those instructions as a SKILL.md file plus any helper logic, and Codex applies them consistently.

The 0.128–0.130 releases added workspace plugin sharing with access controls, marketplace install/upgrade flows, and remote plugin bundle caching. The takeaway: skills and plugins are now first-class, not a power-user toy.

A working skill is small:

# review-pr

When invoked, run `git diff main..HEAD` and review every changed file
against AGENTS.md "Don't" list. Output:

- Per-file findings (severity: blocker/nit)
- A two-line summary fit for a PR comment
- Any test gaps you noticed

Don't make code edits. Don't open new files unrelated to the diff.

Add MCP servers in ~/.codex/config.toml or via the Codex App under Settings → MCP servers. The two that almost everyone benefits from on day one: a filesystem server scoped to the repo, and a git server. Anything beyond that, the official best-practices doc is right — add tools only when they unlock a real workflow you already do manually.

The three commands you'll use every day

After a month, the muscle memory is:

codex resume --last opens the session you closed five minutes ago, full context intact.
/permissions flips between Auto (let it write), Read-only (let it look), and Full Access (let it run shell), mid-session.
/review audits a diff or specific commit without modifying files. The fastest pre-PR sanity check in the toolkit.

Two more that look small and compound: Tab queues a follow-up prompt while Codex is still working on the previous one (no waiting), and Ctrl+R searches your prompt history (no rewriting yesterday's prompt today).

Model selection: when GPT-5.5 is overkill

Switch with /model mid-session, or launch with --model gpt-5.5. The split that actually saves money on ofox pricing:

Model	Input / Output ($/M)	When to use
GPT-5.5	$5 / $30	Default for plan mode, multi-file refactors, debugging unfamiliar code
GPT-5.4 Mini	$0.75 / $4.5	Batch renames, formatting, scaffolding, "just do this obvious thing"
GPT-5.3 Codex	$1.75 / $14	Code-specialized variant; useful when you want tighter pure-code generation
GPT-5.4 Nano	$0.20 / $1.25	Quick "explain this 20-line snippet" or commit-message generation

The heuristic: if you used /plan, you should be on GPT-5.5. If you didn't, you can probably drop a tier. The Reddit consensus on the GPT-5 family (summary in this comparison post) is that the cheaper variants degrade noticeably on multi-file context but are basically indistinguishable on small clear tasks.

For routing patterns that go further (sending different task types to different providers) see the Claude Code hybrid routing pattern, which translates one-to-one to Codex CLI via custom endpoints.

Costs: where the bill actually comes from

Three patterns explain most "why is my Codex bill higher than expected" complaints:

Plan mode on everything. Plan mode reads more of the repo to build its plan. Useful when the task warrants it, expensive when it doesn't.
No model split. Defaulting to GPT-5.5 for trivial edits is a 6-7x markup over Mini for no quality gain.
Long sessions without /clear. Context compounds. A six-hour session with no clears is paying for the same file reads ten times.

The structural fix is to consolidate billing across all your AI tools (Codex CLI, Cursor, Cline, Claude Code) through one endpoint, which is what an AI API aggregation layer is for. Practical advice on cutting raw token spend lives in the reduce AI API costs guide.

The seven mistakes that waste your first week

What people learn the hard way:

Skipping AGENTS.md because "it's just for big projects." Wrong. Small projects get more leverage per line.
Plan mode on everything. Burns tokens, slows the loop, doesn't help on clear tasks.
One worktree, many branches. Cache thrash, build re-runs, frustration.
Full Access permissions in production repos. Drop back to Auto (workspace-write, asks before network or out-of-scope) or Read-only when the blast radius is real.
Long-running sessions with no /clear. Context grows, costs grow, model attention degrades.
Defaulting to GPT-5.5 for trivial work. See the cost section.
Treating skills as advanced. A 10-line SKILL.md for code review pays for itself in two days.

Where Codex CLI is weak

For balance: Codex CLI is not the right tool for everything. It struggles when:

The repository has heavy framework magic (lots of decorators, codegen, runtime metaprogramming). It can't always trace what calls what.
The task is "design a new system." Codex executes plans well. It is mediocre at picking which plan to execute when you genuinely don't know what you want.
You need a tool with a polished UI for non-developers. Codex CLI is a terminal tool by design.

For a head-to-head on which model wins for which coding task type, and where Codex CLI's GPT-5.5 backend ranks against Claude Opus 4.6 and Gemini 3.1 Pro, see the best LLM for coding breakdown.

What this looks like end-to-end

A real day-in-the-life on a 50k-LOC backend, running through ofox at GPT-5.5:

9:00 — git worktree add ../app-feat-billing feat/billing && cd ../app-feat-billing && codex
9:01 — /plan "Add Stripe webhook handling for invoice.paid, idempotent on event_id"
9:08 — Plan looks good, approve, switch to default mode
9:30 — Diff in 6 files, /review to sanity-check, then commit
9:35 — Open second terminal, second worktree, /model gpt-5.4-mini, batch-rename a deprecated module
14:00 — Resume morning session with codex resume --last, fix the test it missed
17:00 — Drop a /goal for tomorrow's spike on the queue migration so it survives the weekend

"The first week feels like prompts. The second week, you realize it's about four files written on day one—after that you basically stop thinking about the tool."

The model gets better, the CLI gets new features, the team adds more skills. The shape of the loop stays the same, and that is what separates the people who ship from the people who keep tweaking prompts.

Sources

Originally published on ofox.ai/blog.

Best LLM for Coding by Task in 2026: A Decision Matrix Across 10 Real Sub-Tasks

Owen — Sun, 10 May 2026 14:19:49 +0000

TL;DR — There is no single best coding LLM in 2026. Across ten sub-tasks we mapped, Claude Opus 4.6 still leads cross-file refactoring and long-context comprehension; GPT-5.5 wins greenfield scaffolding and structured tool use; Gemini 3.1 Pro handles whole-repo reads; DeepSeek V4 Flash and Kimi K2.6 deliver 80–90% of frontier quality at one tenth the cost. The actual decision is per task, not per favorite — and the matrix below tells you which model to call before you write the prompt.

Why "best LLM for coding" is the wrong question

The "best coding LLM" question gets asked thousands of times a month and produces almost no useful answers. Most rankings collapse refactoring, debugging, scaffolding, code review, and SQL into one aggregate score that fits a single bar chart. In production work, those tasks load completely different model strengths.

A 30-file refactor needs long-context recall and consistent type tracking. A one-shot bash script needs zero context but tight output discipline. A flaky concurrency bug needs careful causal reasoning over short windows. A SWE-bench Verified score averages all of these out, which is exactly why a model topping the leaderboard can still feel wrong on the work in front of you.

Reddit threads name the same pattern over and over. The top r/ClaudeAI thread on May 4, 2026 (1,471 upvotes) describes a Kimi-as-coworker workflow at $0.02 per call alongside Claude for the hard parts. A r/ClaudeCode thread on May 2 (323 upvotes) walks through cancelling the $200 Max plan and replacing it with $30/mo of routed calls. r/ChatGPTCoding has a recurring genre of "I switched models per task and stopped paying for the wrong one" posts. The frontier-versus-budget framing collapses as soon as you separate the work.

This article is the matrix you can act on. Ten real coding sub-tasks. Six current models. One pick per row. All models referenced are accessible through ofox.ai's unified API gateway so swapping per task is one parameter change, not a new SDK.

The contenders (May 2026 pricing)

Model	Context	Input	Output	Notes
Claude Opus 4.6	1M	$5/M	$25/M	Long-context refactoring leader; we use 4.6 over 4.7 — see FAQ
Claude Sonnet 4.6	1M	$3/M	$15/M	Daily-driver Claude; cheaper than Opus, ~85% of the quality
GPT-5.5	1M	$5/M	$30/M	Strongest 2026 generalist; doubled price from 5.4
Gemini 3.1 Pro	1M	$2/M	$12/M	Multimodal; strongest long-document recall on dense schemas
DeepSeek V4 Pro	1M	$1.74/M	$3.48/M	Frontier-tier coding at one tenth flagship cost (75% launch promo through 2026-05-31)
DeepSeek V4 Flash	1M	$0.14/M	$0.28/M	The new budget anchor; tool-calling workhorse
Kimi K2.6	262K	$0.95/M	$4/M	Open-weight; LiveCodeBench v6 89.6 vs Opus 4.6 88.8

Prices reflect current ofox.ai listings as of May 2026 (verify on the models page before quoting in production budgets). For context on how these slot against the broader field, see the LLM leaderboard and the overall best-coding ranking — this matrix is the per-task layer those articles flatten.

The 10 sub-tasks

We split a normal coding workday into ten distinct units of work. The list is alphabetical to keep priority bias out of the matrix.

CLI and shell scripting — bash, awk, jq, gh, one-shot pipelines
Code review — PR feedback, suggestion comments, security smells
Cross-file refactoring — rename, restructure, or migrate across 5+ files
Debugging from stack trace — known error, find and fix
Debugging intermittent or concurrency bugs — flaky tests, race conditions
Documentation generation — READMEs, docstrings, ADR drafts
Greenfield scaffolding — new project, framework setup, boilerplate
Single-function generation — isolated unit, no surrounding context
SQL query writing and optimization — joins, window functions, EXPLAIN reads
Test generation — unit + integration, including fixtures

These map to the work most teams actually do. We deliberately excluded image-input UI debugging, audio transcription, and other multimodal-only tasks where the field collapses to one or two models.

The decision matrix

Each row picks one primary model. The "honorable mention" column gives the budget alternative when you do not need the headline pick.

Sub-task	Primary	Honorable mention	Why
CLI and shell scripting	GPT-5.5	DeepSeek V4 Flash	Tightest one-shot output, fewest hallucinated flags
Code review	Claude Opus 4.6	Kimi K2.6	Catches dependency-graph implications others miss
Cross-file refactoring	Claude Opus 4.6	Gemini 3.1 Pro (>500 KB repos)	Type tracking across modules; Gemini wins on raw context
Debugging from stack trace	GPT-5.5	DeepSeek V4 Pro	Structured output, fast iteration, low refusal
Debugging intermittent / concurrency	Claude Opus 4.6	GPT-5.5	Causal reasoning over short windows
Documentation generation	Claude Sonnet 4.6	DeepSeek V4 Flash	Tone discipline; Opus is overkill, Flash is acceptable
Greenfield scaffolding	GPT-5.5	Kimi K2.6	Up-to-date framework defaults, working build configs
Single-function generation	DeepSeek V4 Flash	Claude Sonnet 4.6	At $0.14/$0.28 per M tokens, anything else is overpaying
SQL query writing + optimization	Gemini 3.1 Pro	DeepSeek V4 Pro	Schema reading at 1M context; correct query plan reasoning
Test generation	Claude Sonnet 4.6	Kimi K2.6	Honest assertions over coverage theater

The shape of the matrix is the point. Claude Opus 4.6 owns the tasks where reasoning over many surfaces matters — refactoring, code review, concurrency. GPT-5.5 owns the tasks where tight, single-pass output matters — CLI, scaffolding, stack-trace debugging. The cost layer (DeepSeek V4 Flash and Kimi K2.6) takes the rows where the work is bounded enough that frontier intelligence is wasted spend.

Notes on the picks that surprise people

Single-function generation: DeepSeek V4 Flash, not Opus

Calling Opus for a 20-line utility costs roughly 100x what V4 Flash does and produces an indistinguishable result on bounded tasks. r/LocalLLaMA threads in late April 2026 reported Flash handling multi-file refactors in the same ballpark as Claude Haiku, and on isolated functions the gap closes further. The Hacker News thread on a Kimi K2.6 coding-challenge win (380 points, April 30 2026) makes the broader point: open-weight models are now within striking distance on bounded tasks, which means frontier spend on those tasks is mostly habit. Ship the cheap model first; escalate when it visibly fails.

SQL: Gemini 3.1 Pro, not GPT-5.5

The model you want for SQL is the one that can actually read your schema. Gemini 3.1 Pro's 1M context with strong long-document recall lets you paste a 200-table DDL into the prompt without summarizing. GPT-5.5 has the same window and is faster on the actual query, but if the query touches a join you forgot existed, Gemini sees it and GPT-5.5 invents a column.

Cross-file refactoring: Opus 4.6 over Opus 4.7

Anthropic's own system card shows Opus 4.7 scoring 32.2% on MRCR v2 8-needle at 1M context, against 78.3% for Opus 4.6 — a documented multi-needle long-context regression. r/ClaudeCode and r/ClaudeAI threads in April–May 2026 (including the widely-shared "Opus 4.7 is a genuine regression" post, 2,300 upvotes within 48 hours of the 4.7 launch) describe degraded multi-file edit reliability. 4.7 is genuinely better on agentic search and visual reasoning. For pure refactoring, 4.6 is still the safer call. The full breakdown is in the Opus 4.6 vs GPT-5.5 vs Gemini 3.1 Pro reasoning comparison.

Code review: Opus 4.6 over GPT-5.5

GPT-5.5 review comments read crisper, but Opus 4.6 catches more cross-file implications — the kind that surface as "this rename broke a downstream caller you didn't see." On a 12-PR sample we ran (mixed TS, Go, and Python), Opus flagged two breaking changes GPT-5.5 missed and zero false positives. GPT-5.5 flagged the same number of true positives plus one false positive. With code review, the cost of a missed breaking change usually beats the cost of running the more expensive model.

Greenfield scaffolding: GPT-5.5 over everything else

The job is "give me a working Next.js 15 + Drizzle + Auth.js v5 starter." That requires up-to-date package versions and config defaults that actually compile. GPT-5.5 currently does this with the lowest rate of "needs three rounds of fixes to build" output. Kimi K2.6 is the budget pick when you can hand-fix one or two package.json versions.

How we ran the comparison (first-party note)

We ran each sub-task three times on each candidate model over the first week of May 2026, identical prompts, no temperature or system-prompt tuning per model. The matrix above reflects the version that won at least 2 of 3 runs on quality-adjusted output. We did not invent benchmark percentages — published numbers (SWE-Bench Verified, Terminal-Bench 2.0, LiveCodeBench v6) appear in the contenders table and are linked to source. The picks are based on observed behavior on real, bounded tasks; your own workload may push some rows in either direction, which is why the next section gives you the questions to ask.

The cost numbers in the matrix are headline rates and ignore prompt caching. With caching, every row gets meaningfully cheaper, but the relative order of models barely moves. For the cache math, see DeepSeek V4 Pro vs Flash — the same logic applies across providers.

A 5-question self-assessment for your workload

Use this before locking in a default model for any team:

What is the median input length per coding prompt you actually send? If under 8K tokens, frontier-context advantages disappear and DeepSeek V4 Pro / Kimi K2.6 get more attractive. If above 100K, Opus 4.6 or Gemini 3.1 Pro are the only honest answers.
How often do you need the model to follow strict output formats (JSON, tool calls, diff format)? If "almost always," GPT-5.5 currently has the lowest format-failure rate. If "rarely," that strength is wasted spend.
Are your prompts mostly fresh, or mostly variations on a cached system prompt? If the latter, prompt-cache pricing reshapes the matrix — DeepSeek's 50x cache discount and Anthropic's cache pricing change which row wins on dollars.
What is the cost of a wrong answer in your loop? Cheap to verify (CI catches it) → push down to the budget tier. Expensive to verify (production-affecting refactor) → stay on Opus 4.6 or GPT-5.5.
Is your team locked into one provider for compliance or contract reasons? If yes, the matrix collapses to one column. The remaining decision is which prompt patterns squeeze the most out of the model you must use.

If three or more answers point to "we send short prompts, fresh, low cost-of-wrong," your default model should be DeepSeek V4 Flash or Kimi K2.6 with manual escalation. If three or more answers point to "long prompts, structured output, expensive-to-verify," your default should be Opus 4.6 or GPT-5.5 with cost discipline on cache.

What this matrix does not solve

Three things to keep honest about the matrix:

It does not replace measuring on your own code. Run your top three rows against your own repo for a week before locking team defaults.
It is not for switching models mid-session inside a single Claude Code or Codex run. Mid-session swaps usually hurt more than they help. The matrix picks the default per task type.
It does not automate routing. If you want the picks applied without thinking, see the Claude Code hybrid routing pattern.

It also does not cover image-in-the-loop debugging, voice-to-code, or other multimodal-only loops where the field is too narrow for a useful matrix.

And the honest "ofox is not the right answer" cases: if your entire workload is a single model with predictable load and no compliance ask, going direct to Anthropic, OpenAI, or DeepSeek is fine. The aggregator's value shows up specifically when you want to act on a matrix like this without integrating six SDKs. The mechanics of switching live in the Claude Code backend switch tutorial.

How to act on the matrix today

The minimum viable version of the matrix in production is two lines of config:

# pick model per task type, one OpenAI-compatible endpoint
client = OpenAI(base_url="https://ofox.ai/v1", api_key=OFOX_KEY)
MODEL_FOR = {
    "refactor": "anthropic/claude-opus-4.6",
    "scaffold": "openai/gpt-5.5",
    "sql":      "google/gemini-3.1-pro-preview",
    "util":     "deepseek/deepseek-v4-flash",
}
resp = client.chat.completions.create(model=MODEL_FOR[task_type], messages=msgs)

That is the entire pattern. The same client object talks to six providers. The matrix decides the model parameter. The cost ceiling and quality floor both move in your favor immediately. For the broader picture of how these models slot together, the Claude vs GPT vs Gemini comparison guide is the cluster pillar; the API aggregation primer covers the architecture; the Kimi K2.6 vs Claude Opus 4.6 coding test is the deepest cluster page on a single matrix row.

The best coding LLM in 2026 is six models, one endpoint, and a matrix that fits on a napkin — pick once per task type, ship, and stop relitigating which model is "best" every week.

Originally published on ofox.ai/blog.

LLM API Cache Hit Math: Why Your DeepSeek Bill Says $4 But the Pricing Says $50

Owen — Sun, 10 May 2026 02:09:50 +0000

TL;DR

Real LLM bills run 3 to 50 times lower than the headline per-million-token price because most input tokens come from cache. DeepSeek's deepseek-v4-flash cache read is $0.0028 per million versus $0.14 cache miss — a 50x discount. Claude Opus 4.6 cache read is $0.50 per million versus $5.00 input. OpenAI GPT-5.5 cached input is $0.50 versus $5.00 cache miss. If you're paying full price, you're either streaming a moving target into your prompt prefix or your hit rate audit is wrong.

The 90% Discount Nobody Calculates Correctly

Open any LLM pricing page in May 2026 and the headline number is the cache miss price, not the price you actually pay. If your prompt has a timestamp anywhere near the top, your cache hit rate is zero and you're paying the cache-write tax for nothing.

This is the gap behind every confused Slack thread that starts with "we're spending $4 a day, the calculator said $50, what's going on?" Pricing pages quote miss prices. Real workloads are mostly hits. The arithmetic is straightforward once you put the numbers in one table, but providers don't, because the big number is what drives sticker shock and the small number is what drives migration intent.

The Three Pricing Models Compared (May 2026)

There are exactly three ways the major providers price prompt caching today. They look similar on a slide and are very different on a bill.

DeepSeek — Disk caching, automatic, no write premium

Model	Cache hit input	Cache miss input	Output	Hit / miss ratio
deepseek-v4-flash	$0.0028 / M	$0.14 / M	$0.28 / M	50x cheaper
deepseek-v4-pro (list)	$0.0145 / M	$1.74 / M	$3.48 / M	~120x cheaper
deepseek-v4-pro (75% off promo, through May 31 2026)	$0.003625 / M	$0.435 / M	$0.87 / M	~120x cheaper

DeepSeek introduced disk-based context caching in mid-2024 and the V4 generation made the cache discount dramatic — Flash cache reads are 50x cheaper than misses. Prices above are the standard rates from the official pricing page. There is no separate "cache write" line item; first-time tokens bill at miss price and identical prefixes on later requests bill at hit price. Note: the legacy aliases deepseek-chat and deepseek-reasoner are scheduled for deprecation on 2026-07-24 and route to V4 Flash today; use the V4 model IDs for new integrations.

Anthropic — Explicit `cache_control` blocks, write premium, two TTLs

For Claude Opus 4.6 (base input $5 per million):

Operation	Multiplier	Rate
Base input	1x	$5.00 / M
5-minute cache write	1.25x	$6.25 / M
1-hour cache write	2x	$10.00 / M
Cache read	0.1x	$0.50 / M

These are the exact multipliers documented in Anthropic's prompt caching docs. Minimum cacheable block is 2,048 tokens for Sonnet 4.6 and 4,096 tokens for Opus 4.6.

The 5-minute TTL silently became the default in early 2026, replacing the previous 1-hour default. A widely-cited Dev Community analysis reported that the change inflated effective costs 30 to 60 percent for production workloads that depended on the 1-hour cache surviving across slow human turns.

OpenAI — Automatic, free to write, ~10x discount

OpenAI documents prompt caching as automatic on prompts of 1,024 tokens or longer, with no code change and no write premium. For GPT-5.5, the 2026 API pricing page lists $5.00 per million for cache-miss input and $0.50 per million for cached. Flat 10x discount, no math required.

The Three Workflows That Get Billed Three Different Ways

The same model running the same total token volume can produce wildly different bills depending on what your traffic shape looks like. These are the three recurring patterns.

Workflow A — Long stable system prompt, many short user turns (RAG, support bot)

Imagine a customer-support bot with a 5,000-token system prompt and policy reference, plus a 200-token user question and 300-token answer per request, doing 10,000 requests per day on Claude Sonnet 4.6 ($3 input / $15 output base, $0.30 cache read).

Without caching: 5,200 input × 10,000 = 52M tokens × $3 = $156/day input

With caching, after the first 12 requests the system prompt is hot: 5,000 × 10,000 = 50M cached at $0.30 = $15, plus 200 × 10,000 = 2M fresh at $3 = $6, plus a handful of write premiums. Total ~$21/day input.

An 86 percent input-cost cut on a workload you weren't planning to refactor. This is why serious RAG deployments treat cache hit rate as a top-line metric. Du'An Lightfoot's writeup tracks the same pattern from the other direction: $720 down to $72 monthly on Anthropic, no other changes.

Workflow B — Iterative coding session (Claude Code, Cursor)

In a coding agent the system prompt, tool definitions, and the read-back of recently opened files dominate the input. Each turn appends a few hundred tokens of new tool output. A real audit of 100.9M tokens through Claude Code reported an 84 percent cache hit rate, with 84.2M of 100.3M input tokens served from cache. Anthropic's own engineering team wrote in April 2026 that they declare an internal SEV when cache hit rate drops, because the entire $20/month Claude Code Pro economics depends on it.

An r/DeepSeek user posted a real bill: 88.9M input tokens at a 98.07 percent cache hit rate on deepseek-v4-flash, billed at $0.642 (input + output combined). Pencil it out and 87M of those input tokens were $0.0028/M reads ($0.24), only 1.7M were $0.14/M misses ($0.24), with the rest going to output. The math only looks insane until you notice the prefix never moved.

Workflow C — Parallel batch document processing (the trap)

This is the workflow that produces the surprised Slack message. A team has 1,000 documents to summarize, each with the same 8,000-token instruction header. They naively fire all 1,000 in parallel.

Cache writes take 2 to 4 seconds for large prefixes. Before the first write completes, the next 50 requests have already arrived — all of them misses. The result is described well in an analysis by AI Transfer Lab: "10 cache writes, 0 cache reads, and a bill 5–10x what you expected." The fix is to send a single warm-up call first, wait for the cache to land, then unleash parallelism.

Hidden Costs That Surprise People

Reading the docs once is not enough. These are the four leaks that trip up production deployments, in rough order of frequency.

1. Timestamps and UUIDs near the top of the prompt. This is the single most common bug. The top-voted comment on the Hacker News thread on prompt caching (306 points, user duggan) is verbatim: "It was a real facepalm moment when I realised we were busting the cache on every request by including date time near the top of the main prompt." Moving timestamps to the user message instead of the system prompt took their cached ratio from 30–50 percent to 50–70 percent.

2. Cache writes are billed even when nothing reads them. On Anthropic, a one-shot batch job that never repeats the same prefix pays the 1.25x write premium on every request and gets nothing back. Caching is a loss in that workload. The break-even on Anthropic's 5-minute cache is one read; on the 1-hour cache it is two reads.

3. Anthropic's TTL silently dropped to 5 minutes. A still-open GitHub issue tracks the regression. If your interactive workflow has 10-minute pauses (you're reading code, getting coffee), every resume eats a fresh write premium.

4. Parallel fan-out before the first write lands. Discussed above under Workflow C. The fix is a sequential warm-up call, not changing your concurrency model.

How to Actually Measure Your Cache Hit Rate

All three providers expose cache token counts in the response object. Stop trusting the dashboard summaries and read them yourself.

# OpenAI / OpenAI-compatible (works on ofox.ai too)
resp = client.chat.completions.create(model="openai/gpt-5.5", messages=[...])
cached = resp.usage.prompt_tokens_details.cached_tokens
total_in = resp.usage.prompt_tokens
hit_rate = cached / total_in if total_in else 0

# Anthropic
resp = client.messages.create(model="claude-opus-4-6", ...)
hit = resp.usage.cache_read_input_tokens
write_5m = resp.usage.cache_creation_input_tokens

Aggregate across a day, plot a histogram, treat anything under 60 percent on a stable agent or RAG workload as a bug. The HN commenter weird-eye-issue flagged that a unique prefix-prompt_cache_key pair above 15 requests per minute can overflow the OpenAI cache. One more knob to watch if you partition heavily by user.

Per-Provider Quirks Worth Knowing

A short table of the things that catch people off-guard, none of which are on the pricing page:

Provider	Quirk
Anthropic	Default TTL is 5 minutes since early 2026. Pay 2x for 1-hour TTL. Minimum cacheable block 2,048 tokens (Sonnet 4.6) / 4,096 (Opus 4.6).
DeepSeek	Cache is keyed on a 64-token prefix chunk. No write premium. Cached prices apply automatically.
OpenAI	Caching kicks in only for prompts ≥ 1,024 tokens. Free to write. Cache key is per organization, not per API key.
Gemini 3.1 Pro	Explicit cache requires a separate `cachedContent` resource and has a per-second storage fee. Different mental model entirely; budget separately.

Bottom Line

The economics of running an LLM in production in 2026 are not what the pricing page suggests. A coding agent talking to deepseek-v4-flash at 95 percent cache hit rate runs at $0.0098 per million effective input tokens, not $0.14. A RAG pipeline on Claude Sonnet 4.6 with a stable system prompt runs at $0.30 per million effective input, not $3.00. The price card is the menu. The cache hit rate is the bill.

If you're routing across multiple providers (say, fast iteration on DeepSeek, hard refactors on Opus), a single OpenAI-compatible endpoint like ofox.ai keeps the measurement code identical, since usage.prompt_tokens_details.cached_tokens is the same field on every model behind it. Tracking that one number across a week tells you more about your real spend than any pricing comparison.

Originally published on ofox.ai/blog.

Claude Opus 4.6 vs GPT-5.5 vs Gemini 3.1 Pro: Reasoning Benchmarks (3 Real Tasks Tested)

Owen — Sat, 09 May 2026 15:01:32 +0000

Claude Opus 4.6 vs GPT-5.5 vs Gemini 3.1 Pro: Reasoning Benchmarks (3 Real Tasks Tested)

TL;DR — On three reasoning tasks (legal contradiction analysis, multi-step proof, nested-spec planning), Claude Opus 4.6 produced the most rigorous step-by-step output, GPT-5.5 reached correct answers fastest, and Gemini 3.1 Pro delivered roughly 70% of the depth at one-third the price. There is no overall winner — only sweet spots. We tested Opus 4.6 instead of 4.7 because Anthropic's own system card flags a long-context retrieval regression, and reasoning chains depend on long-context recall.

Why this comparison, and why now

Most flagship-model comparisons in 2026 collapse coding, math, multimodal, and agentic benchmarks into a single ranking that nobody actually uses for picking a model. When choosing for chained reasoning specifically, the leaderboard average tells you almost nothing about which model will think clearly through your problem.

This article focuses on reasoning through three real tasks where reasoning was the entire job: legal contradiction analysis, a chained proof, and nested-spec planning. Each model received identical inputs. Outputs were graded on correctness, depth of justification, and total cost.

For pricing context: all three models are available through ofox.ai's unified API gateway, with one OpenAI-compatible endpoint for switching between them.

Why Opus 4.6 (and the 4.7 disclaimer)

Opus 4.7 is Anthropic's newest flagship at the time of writing, and on most benchmarks it beats 4.6. So why test the older version?

Anthropic's published system card for Opus 4.7 reports 32.2% on MRCR v2 8-needle at 1M context, against Opus 4.6's 78.3%. That represents a real regression on long-context multi-needle retrieval — the exact failure mode that breaks chained reasoning. On Lech Mazur's Extended NYT Connections benchmark (a closed reasoning test Anthropic did not optimize against), Opus 4.6 scores 94.7% versus 41.0% for Opus 4.7 — a 54-point gap. The benchmark author notes Opus 4.7 also refuses over 50% of prompts, and even on the subset it does answer it scores below 4.6.

Reactions in the community split sharply. r/ClaudeAI threads praise agentic-coding performance for 4.7. r/LocalLLaMA threads highlight regression complaints for non-coding reasoning. Both perspectives hold merit, as they measure different capabilities.

For the reasoning tasks below, we selected the version still recommended by users running long, layered prompts. If your workload is agentic coding, consider 4.7 instead. If your workload matches the tasks covered here, the choice between 4.6 and 4.7 still matters and 4.6 remains the safer default.

Public reasoning benchmarks: the honest summary

Before our own runs, here is what the public benchmarks actually show — with all three models compared on a like-for-like basis where data is available.

Benchmark	Opus 4.6 (no tools)	GPT-5.5 (no tools)	Gemini 3.1 Pro (thinking)	Source
HLE (no tools)	40.0%	41.4%	44.4%	Anthropic, OpenAI, Google model cards
GPQA Diamond	91.3%	93.6%	94.3%	LM Council
FrontierMath Tier 4	not reported	35.4% (base) / 39.6% (Pro)	~19%	Epoch AI / OpenAI GPT-5.5 system card
MRCR v2 (1M ctx, 8-needle)	78.3%	74.0%	not reported	Anthropic / OpenAI system cards

Three quick observations:

No model dominates. GPT-5.5 wins math-heavy reasoning by a wide margin. Gemini 3.1 Pro edges out PhD-level science questions. Opus 4.6 wins long-context multi-needle retrieval (which underlies most real-world chained reasoning).
HLE-no-tools is close. Within four points across all three. Marketing departments will pick whichever benchmark flatters them; ignore the 0.4-point claims.
GPQA Diamond is saturating. When all three flagships score above 91%, GPQA stops discriminating. Treat it as a floor, not a ranking.

Three real reasoning tasks

Each task was run twice per model (to control for sampling variance) on the same date with the same prompt. Outputs were graded on a 5-point rubric: correctness, depth of justification, edge-case awareness, format clarity, and total tokens. All runs went through ofox.ai's unified endpoint to keep auth and routing identical.

Task 1: legal contradiction analysis

Prompt — given a 2,800-word excerpt of a fictional jurisdiction's contract law statute, identify three internal contradictions and explain the legal reasoning for each. The contradictions are not surface-level; they require chaining across multiple sections.

Opus 4.6 (extended thinking, high effort). Identified all three contradictions on the first attempt. The reasoning chain explicitly named each section being cited, walked through the implication, and flagged a fourth potential contradiction as "ambiguous, depends on definition of 'reasonable notice' in §11." That fourth note was correct — the human author had intentionally left that section ambiguous. Total: 11,400 output tokens, ~$0.29.

GPT-5.5 (default reasoning effort). Identified three contradictions on the first run. Output was 40% shorter than Opus, with cleaner structure (numbered headings, one paragraph per contradiction). It missed the ambiguous fourth case entirely. On the second run, sampling variance flipped one of the three identifications to a wrong section reference, though the contradiction itself was still real. Total: ~6,800 output tokens, ~$0.20.

Gemini 3.1 Pro (thinking high). Identified two of three contradictions. Missed the third because it failed to chain across §4 and §17 (separated by ~1,400 tokens). Justifications for the two it found were solid. Total: ~7,100 output tokens, ~$0.09.

Winner on this task: Opus 4.6. Got every contradiction, flagged the trick case, justified each step. The cost was 3x Gemini's, but for legal-style reasoning where misses are expensive, the depth matters.

Task 2: chained mathematical proof

Prompt — prove that for any positive integer n ≥ 2, there exist n consecutive composite numbers. (This is a classic but the prompt forbade citing factorial constructions and required a fully explicit proof.)

GPT-5.5. Produced a clean, complete proof in two paragraphs. Used the (n+1)! + k construction implicitly without naming it as factorial — exactly within constraints. Total: ~1,200 output tokens, ~$0.04.

Opus 4.6. Produced a more verbose three-paragraph proof, explicitly named the construction, noticed the constraint on the second pass and rewrote without naming, then offered an alternative proof using the Chinese Remainder Theorem. The CRT proof was correct and elegant but longer than required. Total: ~3,800 output tokens, ~$0.10.

Gemini 3.1 Pro. Produced a correct proof on the first run with the cleanest exposition of the three. On the second run, with no prompt change, it produced an essentially identical proof (low variance — a good sign for production use). Total: ~1,400 output tokens, ~$0.02.

Winner on this task: Gemini 3.1 Pro on cost-quality. GPT-5.5 on raw answer speed. Opus 4.6 produced the deepest output but over-engineered the answer. For closed-form math reasoning, more depth is not better.

Task 3: nested-spec planning

Prompt — given a 1,500-word product specification with five interdependent features (each with constraints that reference other features), produce an implementation plan that respects all constraints, identifies the optimal build order, and flags any contradictions in the spec itself.

Opus 4.6. Produced a build order that respected every constraint and identified two genuine contradictions in the spec (a circular dependency and a constraint that violated a stated requirement). It also flagged a third "potential contradiction" that turned out to be the intended behavior — a false positive, but the kind of false positive that a careful engineer would also raise. Total: ~6,200 output tokens, ~$0.16.

Gemini 3.1 Pro. Produced a build order that respected most constraints but quietly reordered one feature in a way that broke its stated dependency. When pressed in a follow-up turn, it self-corrected and identified the issue. Caught one of the two genuine contradictions in the spec. Total: ~4,800 output tokens, ~$0.06.

GPT-5.5. Produced the most readable build order — closest to something you would copy into a project planner. Caught both genuine contradictions and did not flag the false positive. Did not fully justify why one specific feature needed to be third in the order; the reasoning was implicit. Total: ~3,900 output tokens, ~$0.12.

Winner on this task: GPT-5.5, narrowly. Opus 4.6 was more thorough but raised one false alarm. Gemini broke a constraint silently — the riskiest failure mode for planning work.

Sweet spots, not winners

Across three tasks, the picture shows what frontier-model reviewers keep finding and refusing to admit out loud: there is no single best reasoning model in May 2026 — there are three good ones, each best at a specific shape of problem.

Task shape	Best fit	Why
Long, layered reasoning where misses are expensive (legal, compliance, multi-document analysis)	Claude Opus 4.6	Caught every contradiction, flagged trick cases, justified every step. The MRCR-v2 long-context strength shows up here.
Closed-form math, theorem proofs, anything where the right answer is short	Gemini 3.1 Pro	Cleanest output, lowest variance run-to-run, one-third the cost.
Implementation planning, structured output, anything that gets handed to humans or pipelines	GPT-5.5	Most readable structure, fewest false positives, fastest to a usable answer.

Gemini 3.1 Pro is cheap enough to change the workflow calculus. At $2/$12 per million tokens you can run Gemini as the default reasoner and only escalate the 5-10% of cases where it underperforms. That routing pattern is covered in our hybrid model routing guide.

Cost math, in concrete dollars

For a typical day of reasoning workloads — say 50 prompts averaging 8K input + 4K output:

Model	Input cost	Output cost	Daily total	Monthly (×30)
Claude Opus 4.6	$2.00	$5.00	$7.00	$210
GPT-5.5	$2.00	$6.00	$8.00	$240
Gemini 3.1 Pro	$0.80	$2.40	$3.20	$96

If your reasoning chains are output-heavy (which most are), the gap widens. A research workflow generating 20K output tokens per call hits $14/day on Opus, $18/day on GPT-5.5, $5.60/day on Gemini.

Cache pricing changes the picture for repeated context — see our breakdown of Claude API pricing and the Gemini 3.1 Pro guide for the per-provider details.

How to actually decide

If you have one reasoning workload, pick by task shape using the table above. If you have a mixed workload, the routing pattern is the better answer than picking one model. We covered the implementation in the hybrid routing guide, and the broader case for unifying behind one endpoint in the AI API aggregation guide.

A reasonable starting policy:

Default to Gemini 3.1 Pro for math, code review, structured planning under 5K output tokens.
Escalate to Opus 4.6 (extended thinking, high effort) for legal, compliance, multi-document analysis, anything where a missed edge case is expensive.
Escalate to GPT-5.5 for output that humans or downstream systems will read directly without much editing.

This is closer to a portfolio approach than a "best model" ranking — and it consistently outperforms any single-model default on cost-adjusted quality.

What to test on your own data

Public benchmarks saturate fast. The numbers above will shift as each provider releases new versions. What has held for the last year, and will likely keep holding, is that reasoning capability splits along three axes: depth of justification, format cleanliness, and unit cost. The model that wins for your team is the one that wins on the axis you actually care about, not the one with the highest leaderboard average.

Three concrete things worth running on your own data before you commit:

A 10-prompt micro-benchmark. Pick ten prompts that look like your real workload. Run each through all three models. Score on a 1-5 rubric. The result will be more useful than any public benchmark.
A cost-per-correct-answer calculation. Track wrong answers and the human time to fix them. Cheap-and-wrong is more expensive than expensive-and-right when the human edit cost dominates.
A long-context test if your prompts go past 100K tokens. This is where the MRCR-v2 regression on Opus 4.7 and the chunked-attention quirks of GPT-5.5 actually show up.

If you want to run these tests against all three models without juggling three SDKs and three billing dashboards, ofox.ai's OpenAI-compatible endpoint lets you swap model names in one config line. That is the setup we used to keep the runs above identical across providers.

What we did not test

This article is deliberately about reasoning, not coding, not multimodal, not agentic-tool-use. For coding-specific comparisons, see Best LLM for Coding in 2026. For the broader flagship benchmark roundup including math and long-context, see GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro flagship comparison. For Gemini-specific reasoning vs Opus, Gemini 3.1 Pro vs Claude Opus 4.6 goes deeper on the head-to-head.

The market's preferred model in twelve months will not be any of these three. The framework — pick by task shape, not by leaderboard average, and route across providers when the workload is mixed — is the part that survives the next round of releases.

Originally published on ofox.ai/blog.

DeepSeek V4 Pro vs Flash: 3 Tasks, 100M Tokens, Real Cost-Quality Tradeoff

Owen — Sat, 09 May 2026 02:43:40 +0000

TL;DR

"V4 Pro and V4 Flash cost 12x apart on list price. On bounded, single-file coding work, the quality gap is small enough that most teams can't tell the models apart." On multi-file reasoning and long agent loops, Flash becomes insufficient. The key strategic question involves determining which 30% of your workload requires Pro. Task-based routing can reduce DeepSeek expenses by 80% without noticeable quality degradation. Without routing capabilities, performance gaps become apparent within a week.

The Headline Numbers (And Why They Lie A Little)

DeepSeek released V4 Pro and V4 Flash on April 24, 2026, both as MoE models with 1M-token context windows under MIT license. V4 Pro contains 1.6T total parameters with 49B active per request; V4 Flash runs 284B total / 13B active. The architectural difference is meaningful—Pro offers roughly 3.8x the active capacity per forward pass—though the pricing gap is more dramatic.

Official pricing as of May 2026:

Model	Input (cache-miss)	Input (cache-hit)	Output
V4 Pro (regular)	$1.74/M	$0.0145/M	$3.48/M
V4 Pro (launch promo, ends 2026-05-31)	$0.435/M	$0.003625/M	$0.87/M
V4 Flash	$0.14/M	$0.0028/M	$0.28/M

Source: DeepSeek API pricing (verified 2026-05-09).

At regular pricing, Flash costs 12.4x less on input and 12.4x less on output. With the Pro promo active, the gap shrinks to approximately 3.1x. After May 31 it reverts to 12x.

The cache-hit row reveals nuanced dynamics. Flash's cache-hit input pricing of $0.0028/M represents a 98% discount versus its own cache-miss rate. Sustaining high cache hit rates makes Flash's effective input cost approach zero. However, "sustaining" carries significant weight—agent sessions following Claude Code patterns typically achieve 60-75% cache hit rates rather than the 95% benchmarks associated with stable RAG workloads.

Task 1: Single-File Code Generation (Flash Wins Clean)

The first task category encompasses bounded code generation—writing functions, scaffolding endpoints, generating test files, transforming configurations. These represent the primary use cases V4 Flash targets.

For bounded prompts teams typically use daily—writing functions, scaffolding single endpoints, stubbing tests, transforming config blocks—Flash produces output indistinguishable from Pro upon blind comparison. Community analysis patterns (including Codersera's V4 Flash deep dive, Geeky Gadgets' coding tests, and Hugging Face model cards) demonstrate consistent findings: Pro leads on aggregated coding benchmarks, but the margin remains modest, and on individual one-shot tasks the models often appear interchangeable.

This represents editorial generalization rather than benchmark assertion. Published model cards document HumanEval-style pass rates; community write-ups cover one-shot game generation, simulation prompts, and structured reasoning tasks. None specifically benchmark CRUD scaffolding or framework form generation—categories where public testing patterns extend, though hard numbers require personal evaluation.

Cost asymmetry dominates this category's decision calculus. A typical scaffolding prompt—"generate a CRUD service for these five database models, with tests"—requires approximately 8K input tokens and 4K output tokens. Flash costs roughly $0.0023 per generation (assuming 70% cache hit on the system prompt). Pro at promo pricing costs $0.0073. Pro at regular pricing costs $0.0292. Across a thousand scaffolding runs in a sprint, costs become $2.30 vs $7.30 vs $29.20.

Routing scaffolding to Pro means paying a premium for capabilities the task doesn't require.

Task 2: Long-File Refactoring (Pro Wins, But Read The Fine Print)

The second category—refactoring across a single 500-1,500 line file—shows increasing divergence. Both models accommodate the context window. Consistency distinguishes performance.

Developer test reports consistently observe matching patterns: when refactoring files requiring multiple naming conventions, error-handling patterns, and consistent type signatures throughout the rewrite, Pro maintains coherence completely. Flash exhibits drift. By line 800 of a refactored file, Flash inconsistently names variables, switches error-handling style mid-class, or introduces subtly different return types.

A notable failure mode: when Flash refactors long files containing implicit invariants—shared state, ordering assumptions, error-propagation conventions—it catches obvious conversion sites while missing subtle ones. The resulting output isn't syntactically erroneous; it's semantically valid while silently dropping invariants the original code required. Pro approaches this conservatively, partly because greater active capacity preserves unstated constraints across the rewrite.

The cost dynamics reverse here. Flash drift causing 30 minutes of hand-fixing erases savings. The 12x price gap on a 30K-token refactor equals roughly $0.42 versus $0.035—fifty cents. Thirty minutes of cleanup expenses far exceed fifty cents.

For long-file refactors with consistency requirements, Pro represents the correct choice even at full price. Mathematical analysis disfavors Flash except for genuinely independent transformation patterns.

Task 3: Multi-File Agent Loops (Pro Wins, Flash Doesn't Even Compete)

The third category transforms from quality margins into capability distinctions.

Agent loops—reading files, running tests, checking outputs, editing code, re-running—depend on models correctly selecting next actions based on previous tool results. Pro manages 10-20 tool call sequences with near-zero misrouting. Flash compounds errors after approximately 6-8 tool calls.

The specific failure pattern: Flash misinterprets test failure messages, determines the bug resides in file A when it's actually in file B, edits file A to "fix" it, runs tests again, observes a different failure now caused by its bad edit, and attempts fixing that. By tool call 12, the model repairs damage it caused two tool calls prior. Pro doesn't exhibit this—when tool results diverge from its hypothesis, it backtracks and re-reads the original failure rather than persisting incorrect theories.

This isn't a marginal quality gap. Flash represents genuinely wrong tooling for this workload. Running agentic coding setups—Claude Code, Aider, Cursor's agent mode, OpenCode CLI—backed by Flash feels inexpensive until encountering your first difficult bug, then watching the agent burn $0.50 of tokens digging itself into a hole becomes evident.

For agent workloads, Pro becomes non-negotiable. Alternatively, routing to Claude Sonnet 4.6 at comparable pricing offers viable options.

The Cache Hit Rate Trap

Nearly every "DeepSeek is 90% cheaper than X" comparison assumes cache hit rates that crumble against real workloads. Understanding this math precedes budgeting against marketing claims.

Cache hit rates sustain well in:

RAG retrievals against stable knowledge bases
Long-running chatbot sessions with fixed system prompts
Document analysis pipelines where system prompts remain constant

Cache hit rates collapse in:

Coding agent loops (every tool result invalidates cache)
Multi-turn conversations with topic pivots
Tool-based systems emitting large variable outputs

For agent-style coding work on Flash, typical effective cache hit rates reach 60-75%. Applying this to pricing:

Cache hit rate	Effective input cost (per M)
95%	$0.0095
75%	$0.0378
60%	$0.0584
0%	$0.14

The same 100M-token monthly workload costing $10.52 at marketing-friendly 80% cache assumptions actually costs $14-18 at realistic agent rates. Still economical. Still 50x cheaper than Opus 4.6. But not the headline number.

Extract actual cache hit rates from your DeepSeek dashboard before communicating savings estimates.

The Decision Rule

Distilling to one principle: "if your task fits in one file and one round of model output, use Flash. If it crosses files or requires more than two tool calls, use Pro."

The 12x price gap and Pro promo discount and cache-hit mathematics all matter, but they're secondary. The first-order question concerns bounded-ness. Flash excels at bounded work and underperforms on unbounded work; Pro demonstrates the inverse—wasteful on bounded work yet necessary for unbounded work.

Most production systems benefit from routers classifying incoming requests by bounded-ness and dispatching appropriately. Some teams implement this using LiteLLM or custom proxies. Others employ aggregation gateways exposing both models behind single endpoints allowing model swaps via configuration changes. Regardless, routing logic supersedes model selection—once routing exists, choosing the right model becomes configuration rather than code changes.

For broader DeepSeek family pricing context, see the DeepSeek API pricing breakdown. For Flash comparisons to Anthropic and OpenAI alternatives in Claude Code workflows, consult the V4 in Claude Code cost test. For 2026's broader model selection landscape, the Kimi 2.6 vs Claude Opus 4.6 comparison addresses similar questions on the cost curve's upper end.

What This Means for the Pro Promo Decision

The 75% Pro discount runs through May 31, 2026. After that, V4 Pro reverts to $1.74/M input and $3.48/M output. Three weeks of decisions warrant flagging:

Running mostly bounded tasks: maintain Flash, disregard the promo. Pro at promo pricing exceeds Flash expenses for Flash-suitable work by 3x.
Running agent workloads currently using Pro at promo pricing: budget for 4x cost increases June 1. Either accept it or construct routers dropping bounded tasks back to Flash.
Considering Pro initially: the promo provides real discounting but represents sales strategy—it concludes. Don't model steady-state economics against promo pricing.

The honest interpretation: Flash remains the actually-economical option staying economical, while Pro remains the capable option costing accordingly. Maintaining architectural cleanliness makes the 12x gap a feature rather than concern—it encourages considering which work genuinely requires bigger models. Build routing infrastructure once, and the price difference between Pro and Flash transforms into load-balancing rather than budget stress.

References

DeepSeek V4 official pricing: api-docs.deepseek.com/quick_start/pricing (accessed 2026-05-09)
V4 Preview release notes: api-docs.deepseek.com/news/news260424
V4 Flash model card: huggingface.co/deepseek-ai/DeepSeek-V4-Flash
V4 Pro model card: huggingface.co/deepseek-ai/DeepSeek-V4-Pro
Field test report: Runpod's V4 in the wild
Coding test write-up: Geeky Gadgets V4 Flash vs Pro
Cost analysis methodology: Codersera V4 Flash deep dive

Originally published on ofox.ai/blog.

Why Claude Max Users Are Leaving in May 2026: A Data-Driven Look at the Throttling Backlash

Owen — Fri, 08 May 2026 14:27:05 +0000

Why Claude Max Users Are Leaving in May 2026: A Data-Driven Look at the Throttling Backlash

TL;DR

Between March 23 and May 6, 2026, Claude Max subscribers experienced dramatic usage consumption changes. Five-hour sessions ended in 19 minutes, two cache bugs inflated token bills 10–20×, and Claude Code v2.1.100 burned approximately 40% more tokens than v2.1.98 for identical workloads. Anthropic acknowledged issues on March 26, shipped a partial reversal May 6, and left weekly caps unchanged. This analysis examines the backlash without rendering judgment on the strategy.

The core issue: Claude Max's spring 2026 experience involved simultaneous changes across limits, model performance, tokenization, and client behavior—making it impossible for users to identify what actually changed.

What Actually Changed Between March 23 and May 6

The throttling incident began March 23 when Max 20x users hit daily limits in 19 minutes rather than documented five hours. Initial investigation revealed four concurrent changes:

Intentional peak-hour throttling during 05:00–11:00 PT and 13:00–19:00 GMT, confirmed by Anthropic on March 26
Two prompt-caching bugs silently inflating token bills 10–20×, tracked in claude-code issue #41930, with source-code analysis pointing to "the attestation/anti-distillation pipeline as the proximate cause"
Expiration of the 2× off-peak usage promotion on March 28, which had quietly subsidized heavy nighttime usage
Claude Code v2.1.100+ token inflation—source-code analysis comparing v2.1.98 versus v2.1.100 measured 978 fewer bytes sent but 20,196 more tokens billed for identical workloads, representing roughly 40% client-side regression

None of these changes appeared on status pages, blog announcements, or email communications. Anthropic's March 31 acknowledgment simply stated "limits are being consumed far faster than expected" in scattered Reddit comments and engineer social media posts. This communication vacuum, rather than the changes themselves, triggered the backlash.

Why This Hit Harder Than the November 2025 Limit Cut

The March incident proved more damaging because users couldn't determine whether they'd encountered quotas, bugs, or downgrades. November 2025's uniform, predictable limit reduction meant usage halved consistently—the math worked. March presented a different problem: identical five prompts might show 21% usage consumption, then jump to 100% on the next attempt, with no audit trail indicating which counter was inaccurate.

The compounding effect escalated geometrically rather than linearly. A 10× cache-bug inflation on a 40% client-regression baseline against peak-hour 50% session reduction created sessions ending seven times faster than documentation indicated. This transcended quota management—it became a trust problem.

Community response scaled accordingly. The r/ClaudeAI thread "20x max usage gone in 19 minutes" accumulated 330+ comments within 24 hours. The r/ClaudeCode thread "Claude Code Limits Were Silently Reduced and It's MUCH Worse" reached 360+ comments over six days. Parallel r/Anthropic discussions questioning model degradation became inseparable from throttling conversations. GitHub tracked issues across claude-code #38335 (bug reports), #41930 (canonical cross-reference), and #54714 (late-April Max 20x daily-limit tightening after supposed resolution).

What Anthropic Was Actually Optimizing For

Anthropic addressed a capacity constraint, not a billing issue—this distinction reframes the backlash significantly.

Inference capacity for frontier models has represented Anthropic's binding growth constraint since late 2025. Max 20x users concentrate technical heavy users running agents 24/7—exactly the workload converting "five hours of capacity per user per session" into tragedy-of-the-commons dynamics. Peak-hour throttling offers the obvious lever: cap 95th percentile consumption so median users retain functional products.

The May 6 announcement suggests the company determined this lever's shape was problematic. Two simultaneous changes deployed: the 5-hour limit doubled for Pro and Max accounts, and peak-hour reductions for both tiers disappeared. The concurrent SpaceX compute deal announcement represents the supply-side answer, suggesting throttling served as temporary stopgap until capacity expanded. Notably, weekly caps remained unchanged—indicating comfort with heavy users exhausting limits mid-week, just not mid-day.

The unresolved question concerns cache bugs. Capacity constraints explain throttling; they don't explain 10–20× billing inflation or 40% client-side regression in v2.1.100. These suggest release-train problems: rapid shipping atop tokenizer/attestation rewrites with insufficient per-request token-count telemetry. Anthropic's April 23 quality-regression post-mortem hints toward this without explicit connection.

Opus 4.6 Versus 4.7: The Comparison Nobody Wants to Name

This section deliberately avoids declaring winners because the community disagrees on what comparison should even measure. Well-sourced facts include:

Opus 4.7 ships with a new tokenizer producing approximately 35% more tokens for identical input text, making any "Max plan is unchanged" claim hollow for 4.7 users—weekly caps shrink proportionally despite identical cap numbers
Opus 4.6 was silently removed from Claude Desktop Code tab model picker following 4.7 release, filed as claude-code issue #49689 and extensively discussed on Hacker News
A Reddit post "Opus 4.7 is not an upgrade but a serious regression" reached approximately 2,300 upvotes within 48 hours, with primary complaints centered on predictability rather than raw quality—4.7 felt "more confidently wrong" requiring excessive re-prompting
Opus 4.6 scheduled for deprecation June 15, 2026, compressing decision windows

Contrasting reports emerge from other engineers who report 4.7 as meaningfully stronger on long-horizon agentic tasks, characterizing regression complaints as sampling bias from users whose prompts tuned toward 4.6's quirks. GitHub Changelog coverage and Anthropic's own 4.7 page lead with agentic improvements, benchmarked consistently in independent testing.

The true dynamic: version questions and throttling questions became entangled. If usage doubled during week 1 of 4.7 solely from tokenization changes, separation becomes impossible between "I exhausted Max" and "Anthropic changed the agreement." That entanglement, more than either individual fact, generated the spring 2026 downgrade sensation for Max users.

What "Leaving Max" Actually Meant in Practice

Most users didn't cancel—they routed around limits. Three dominant patterns emerged from GitHub and Reddit discussions:

Pin the client. The widely-cited April workaround pinned Claude Code to v2.1.98—the immediately preceding release—via command-line tools, accepting that new features wouldn't ship until v2.1.100+ regression resolution
Hybrid routing. Run Claude Code with Claude on critical paths and cheaper backends (DeepSeek V4 Pro, Kimi 2.6, Gemini 3.1 Pro) for the 60–80% of calls not requiring flagship reasoning
Switch the backend. More aggressive path: replace Claude entirely per-session. Users running 100M-token tests over four weeks documented real cost comparisons

Cancellations occurred—the r/ClaudeCode "burned $6000" thread represents the most-cited heavy-user departure. However, the modal Max 20x response involved hybridization rather than exit. This distinction matters for interpreting May 6's reversal: Anthropic appears to have moved before cancellation curves actually bent.

So Is Max Worth It After May 6?

The answer depends on usage patterns. May 6's changes—doubled 5-hour limits, removed peak-hour reductions—directly benefit users whose work clusters into intense multi-hour business-hours sessions. These users likely regain working plans.

For sustained-usage patterns—extended agentic runs, batched evaluations, codebase-wide refactors—the binding constraint remains the weekly cap, which stayed unchanged. Users still hit walls, just later in the week. For this profile, API paths through unified gateways plus selective Opus calls cost less than $200/month Max and were already cheaper before throttling began.

For users running Max specifically for integrated Claude Code experience value: the May 6 reversal probably restores functionality. For users running Max because per-token economics worked: recalculate—between new tokenizers, v2.1.100 regression, and unchanged weekly caps, the math may have shifted even where pricing didn't.

Anthropic shipped four breaking changes in six weeks, communicated none through status pages, and reversed the most visible only after 2,300-upvote Reddit posts and 700+ aggregated GitHub-issue comments—that distinction matters more than whether $200/month represents correct pricing.

Sources and Further Reading

Originally published on ofox.ai/blog.