Forem: Joske Vermeulen

The Model Worked. The Cron Job Almost Killed My AI Agent.

Joske Vermeulen — Thu, 21 May 2026 12:05:00 +0000

This is a submission for the Google I/O Writing Challenge

Gemini 3.5 Flash was not the hard part.

It fixed bugs the old setup had failed to solve for weeks. The model quality was transformational (see Part 1 and Part 2).

The hard part was making it survive cron.

In the first 48 hours, my autonomous agent nearly killed the VPS with an infinite retry loop, failed auth outside SSH, and burned most of its quota re-reading the same files every session.

All three bugs took hours to diagnose. All three fixes were tiny.

Context

I run The $100 AI Startup Race. 7 AI agents building startups autonomously on a VPS via cron jobs. After upgrading the Gemini agent to Antigravity CLI (agy) with Gemini 3.5 Flash, the model worked great. But making it run unattended on a headless server? That's where the real engineering happened.

Bug 1: The Infinite Retry Loop

The symptom

I SSH into the VPS and find it unresponsive. Load average through the roof. The cron log shows 300+ entries from the last 2 minutes, all empty.

What happened

Expected: quota exhaustion returns a non-zero exit code.
Actual: exit code 0 + empty output.

When agy hits its quota limit, it doesn't error out. It returns successfully with an empty response. My orchestrator script interprets "exit code 0" as "the model finished its thought, let's give it another task." So it immediately fires another prompt. Which returns empty. Which triggers another. 300 times in 2 minutes.

=== Run 1 finished at 07:30:03, exit=0 ===
=== Run 2 finished at 07:30:06, exit=0 ===
=== Run 3 finished at 07:30:08, exit=0 ===
=== Run 4 finished at 07:30:10, exit=0 ===
... (296 more)

Each "run" takes 2-3 seconds. No output, no error, no indication that quota is exhausted. Just silence. A human would have seen the empty response and stopped. Cron saw exit code 0 and kept going.

The fix

# Circuit breaker: 3 consecutive empty responses = quota exhausted
EMPTY_COUNT=0
MAX_EMPTY=3

# After each run, check output length
if [[ ${#OUTPUT} -lt 20 ]]; then
    ((EMPTY_COUNT++))
    if [[ $EMPTY_COUNT -ge $MAX_EMPTY ]]; then
        echo "=== 3 consecutive empty responses (quota exhausted?) — stopping session ==="
        break
    fi
else
    EMPTY_COUNT=0
fi

Three empty responses in a row → stop the session. The orchestrator now exits cleanly instead of hammering a dead endpoint.

The lesson

Every autonomous system needs a circuit breaker. AI tools are designed for interactive use. They assume a human will notice when something's wrong. When there's no human, you need explicit failure detection.

Bug 2: The Auth That Only Works in SSH

The symptom

Expected: same user + same token file = works everywhere.
Actual: auth backend changes based on an environment variable.

I test agy via SSH. Works perfectly. I set up the cron job with the exact same command, same user, same working directory. Fails with "Authentication required."

The token file exists. It has a valid refresh token. The binary can read it (verified with strace). But it won't use it.

The investigation

# Works:
ssh race@your-vps "cd /home/race/race-gemini && echo 'test' | agy --print"
# → Responds normally

# Fails (simulating cron):
ssh race@your-vps 'env -i HOME=/home/race PATH=/usr/bin:/home/race/.local/bin bash -c "
  cd /home/race/race-gemini
  echo test | agy --print
"'
# → "Authentication required"

After diffing the environment between SSH and cron, I found it: agy checks for the SSH_CONNECTION environment variable. If it's set, it uses file-based auth (reads the token from ~/.gemini/antigravity-cli/antigravity-oauth-token). If it's not set, it tries the system keyring, which doesn't exist in a non-interactive cron session.

The fix

export SSH_CONNECTION="127.0.0.1 0 127.0.0.1 22"

One fake environment variable. I don't love this fix. But until the CLI exposes an explicit headless auth mode, this makes cron behave exactly like my tested SSH session. If Antigravity adds a --headless-auth or --auth-file flag, I'd replace this immediately.

The lesson

AI CLI tools are built for developers at their desk. Headless/cron environments are second-class citizens. If your tool has multiple auth backends, test which one activates in a bare env -i environment. That's what cron sees.

Bug 3: The Context Tax

The symptom

Expected: each session starts productive work quickly.
Actual: context reload eats 60% of the session.

Session 1 runs for 8 minutes before hitting quota. Of those 8 minutes, 5 are spent reading the codebase: IDENTITY.md, PROGRESS.md, BACKLOG.md, scanning the project structure, understanding what happened last time. Only 3 minutes of actual coding.

With quota this tight, losing 60% of every session to context loading is a dealbreaker.

The discovery

agy has a --continue flag that resumes the previous conversation. The model retains all context from the last session: files it read, decisions it made, what it planned to do next.

The fix

# First session of the day: fresh start, full context load
if [[ "$SESSION_TYPE" == "first" ]]; then
    echo "$PROMPT" | agy --print --print-timeout 25m --dangerously-skip-permissions
else
    # All subsequent sessions: resume previous conversation
    echo "$PROMPT" | agy --print --print-timeout 25m --dangerously-skip-permissions --continue
fi

The result

These measurements were taken before Google's 3x rate limit boost (see Part 2). With the new limits, the gains from --continue still matter, but the pressure is less extreme.

	Fresh session	--continue session
Context loading	~5 minutes	~0 minutes
Productive coding	~3 minutes	~15 minutes
Effective runtime	3 min	15 min

Almost 5x more productive time per session by skipping the context reload. The model remembers what it fixed, what's next, what files it already read.

The lesson

Context is expensive, both in tokens and in quota. If your AI tool supports conversation persistence, use it.

I don't use --continue forever. One fresh session per day as a reset point (prevents stale assumptions from accumulating), then all subsequent sessions within that day resume where the last one left off.

What's Missing: The Infrastructure Layer

These three bugs share a pattern: autonomous AI agents need infrastructure that doesn't exist yet.

No standard circuit breaker for quota exhaustion
No headless-first auth flow
No cron-aware session lifecycle (when to fresh-start vs continue)

Web apps have process managers. Queues have retry policies. APIs expose rate-limit headers. Background jobs have dead-letter queues. Autonomous AI agents have bash scripts.

Every team running AI agents on cron is building their own orchestrator from scratch. The same patterns (retry limits, auth persistence, context reuse, graceful shutdown, cost tracking) get reimplemented by every team independently.

We're in the "build your own orchestrator" era. The models are ready for autonomous work. The infrastructure around them isn't.

The Orchestrator Pattern

Here's the minimal structure that works for me after a week of iteration:

Session start
├── Check quota (circuit breaker armed)
├── Load context (fresh or --continue)
├── Run loop (max N iterations)
│   ├── Send prompt
│   ├── Check output length (empty = increment counter)
│   ├── If 3 empty → break (quota exhausted)
│   ├── If output → commit changes, reset counter
│   └── Check elapsed time → graceful shutdown at limit
├── Push commits
└── Log session stats (duration, files changed, runs)

It's ~50 lines of bash. It handles the three failure modes above. It's not elegant, but it keeps an autonomous agent running unattended across scheduled sessions.

Takeaway

If you're running Antigravity CLI (or any AI coding tool) in autonomous/headless mode:

Add a circuit breaker. Empty responses are silent failures, not completions.
Test auth under cron's environment. In my case, faking SSH_CONNECTION forced file-based auth.
Use --continue between sessions. Context loading eats your quota alive.
Set --print-timeout higher than default. Complex agentic tasks need more than 5 minutes to think.

My Cron-Safe Agent Checklist

[ ] Max runtime per session
[ ] Max loop count per session
[ ] Empty-output circuit breaker
[ ] Non-zero exit handling
[ ] Auth tested with env -i (simulating cron)
[ ] Fresh/continue session strategy
[ ] Commit and push after each meaningful change
[ ] Quota / empty-response events logged separately
[ ] Recovery path after quota exhaustion
[ ] Logs include duration, output length, files changed

AI agents don't just need better models. They need boring production infrastructure.

Gemini 3.5 Flash made the agent smart enough to work.

Bash made it stable enough to survive.

My AI Agent Hit Google's Quota Wall in 8 Minutes. 36 Hours Later, Google Tripled the Limits.

Joske Vermeulen — Thu, 21 May 2026 08:32:51 +0000

This is a submission for the Google I/O Writing Challenge

My Gemini agent spent four weeks in last place.

1,259 commits. Broken imports across 32 files. Help requests about database tables it could have created itself. Endless bug loops.

Then I upgraded it to Gemini 3.5 Flash.

In 8 minutes, it diagnosed and fixed problems the old setup had failed to solve in weeks. Then it hit Google's quota wall.

This is the story of what happened next.

Context

This is Part 2 of my Gemini 3.5 Flash upgrade series. Part 1 covers the initial upgrade and first results.

I'm running The $100 AI Startup Race. 7 AI coding agents each get $100 and 12 weeks to autonomously build real startups. No human coding. The agents run on cron jobs, commit to GitHub, and deploy to Vercel.

After upgrading the Gemini agent from a combo of 2.5 Pro (premium sessions) and 2.5 Flash (cheap sessions) to a single 3.5 Flash tier via Antigravity CLI on May 20, the model quality was incredible. But the quota economics were brutal.

The Disappointment (May 20)

Session 1: The model fixed 32 broken API files in a single commit: imports, bcrypt to bcryptjs for Vercel serverless, Stripe instantiation. Root cause analysis that the old model couldn't do in 4 weeks. Then the 5h quota wall hit. 8 minutes of productive work.

Session 2: With --continue (skipping context reload), it built an email library, wrote tests, and fixed auth endpoints. 15 minutes. Then 5h quota again.

The math: Two sessions consumed 40% of the weekly quota. Projected total: ~68 minutes per week on the $20/month Pro plan.

For context, here's what the other agents in my race get for similar money (these are not official provider limits, they are the effective autonomous runtime I measured in my specific setup):

Agent	Plan cost	Weekly runtime
Claude	$20/mo	~7 hours
Codex/GPT	$20/mo	~21 hours
DeepSeek	$25/mo	~21 hours
Gemini 3.5 Flash	$20/mo	~68 minutes

Best model quality in the race. Worst total compute time. The old 2.5 Flash/Pro setup gave me ~28 hours/week, but those 28 hours produced nothing but bug loops. Now I had a model that actually worked, but could barely run.

The Paradox

Here's what made it painful: the quality improvement was real. Not incremental, but transformational.

Old setup (2.5 Pro + 2.5 Flash combo, 28 hours/week):

Wrote code with broken imports across 32 files
Filed 3 help requests about "missing database tables"
Never self-diagnosed the actual problem
1,259 commits over 4 weeks, last place in the race

New model (3.5 Flash, 68 minutes/week):

Diagnosed the root cause in one pass (broken imports, not missing tables)
Fixed all 32 files in a single commit
Built a mock database layer, converted test infrastructure
More useful output in 23 minutes than the old model produced in weeks

The bottleneck had shifted from intelligence to throughput. The model was finally good enough. The constraint was access.

Why Autonomous Agents Burn Quota Differently

For human coding, a model is an assistant. You ask, read, think, edit, and come back later.

For autonomous coding, the model is the runtime. It doesn't pause to think offline. Every file inspection, every failed test, every log check, every retry, every deployment verification consumes inference.

A human developer's session looks like: ask, think, edit, ask again, wait, test manually.

An autonomous agent's session looks like: plan, inspect, edit, test, fail, inspect logs, edit, retest, deploy, verify, repeat.

That changes the economics completely. A $20/month subscription can feel generous for a human developer and unusable for an autonomous agent, at the same time, on the same plan.

The Response (May 21, 05:25 UTC)

Less than 36 hours after Google I/O. Within hours of the new quota system going live, users were reporting problems on Reddit and X: 4 prompts burning an entire 5-hour window, failed generations counting against quota, threads calling it a "bait and switch."

Then, at 5:25 AM UTC on May 21:

Varun Mohan (@_mohansolo): "An update: we're 3xing the rate limits for Gemini models across all paid tiers in Antigravity and resetting everyone's Gemini quota for the week. We understand some people hit their rate limits quickly and wanted to respond fast. Lots more to come and enjoy building!"

Logan Kilpatrick (@OfficialLoganK): "We just 3xed the rate limits across all tiers in Antigravity so that you can put 3.5 Flash through its paces even more, enjoy, and keep the feedback coming! :)"

And the key follow-up from Varun:

"In case it's not clear, the 3x is forever."

What I Actually Measured

My agent's cron job fired at 05:00 UTC, likely straddling the quota boost that landed around 05:25 UTC. The results:

Session 3 (05:00 UTC, partially on old quota, partially on new):

33 minutes of productive work
9 runs, 588 files changed
Renamed the entire domain (localleads.pro to localseogen.com) across all generated SEO pages, fixed Stripe redirect URLs, corrected ES Module syntax in API files
Built a mock database layer (db/mockDb.js) with full CRUD operations
Created lib/time-helpers.js utility library
Wrote test suites for signup, login, get-credits, assign, generate-seo-pages
Refactored 14 test files to use the new mock DB

Session 4 (07:07 UTC, fully on new quota):

29 minutes of productive work
8 runs, 34 files changed
Converted all test mocks from ESM (.js) to CommonJS (.cjs) for jest compatibility
Fixed babel and jest configuration for the mixed ESM/CJS codebase
Refactored execute-outreach, forgot-password-request, generate-seo-pages, user-referral-data tests
Cleaned up .env.test and email library

Two back-to-back sessions of ~30 minutes each. Together they used the full 5-hour window, so roughly 50 minutes of productive runtime per 5h refresh cycle.

The comparison:

	Before boost (May 20)	After boost (May 21)
Runtime per 5h window	8 minutes	~50 minutes
Effective improvement		~4-5x (announced 3x)
Productive output	42 files fixed	622 files changed, full test infra
Weekly projection	~68 minutes	~5+ hours

Google announced 3x. I measured closer to 4-5x for autonomous agentic coding in my setup. I wouldn't treat that as a universal number yet. The difference likely comes from my measurement catching a weekly quota reset, the rate limit increase, and a different prompt mix all at the same time.

The Insight

The feedback loop between AI providers and power users is now measured in hours, not months.

Monday (May 19): Google launches new compute-based quota system at I/O
Tuesday (May 20): Users hit walls, Reddit fills with complaints, my agent gets 68 min/week
Wednesday (May 21, 5:25 AM): Google triples limits permanently and resets everyone's pool

That's a 36-hour turnaround from "this is broken for agents" to "fixed, permanently." For anyone building autonomous systems on top of subscription AI: the economics are volatile, but they're trending in your favor. The providers are watching usage patterns and adjusting in real-time.

The Real Story: Quality × Time = Output

Here's what I'd tell any developer considering Gemini 3.5 Flash for agentic workflows:

The old model had unlimited time and did nothing useful with it. The new model has limited time and makes every minute count.

2.5 Pro + Flash combo: 28 hours/week → last place, stuck in bug loops
3.5 Flash (pre-boost): 68 min/week → more progress than 4 weeks of the old model
3.5 Flash (post-boost): 5+ hours/week → fully competitive, systematically building

Quality matters more than quantity. I'll take 5 hours of a model that diagnoses root causes, fixes 32 files in one pass, and builds proper test infrastructure over 28 hours of a model that files help requests about problems it created.

What's Next

The Gemini agent went from last place to having a real shot. The product (LocalSEOGen, a local SEO page generator for agencies) now has:

Fixed API endpoints (32 files)
Working auth flow
Test infrastructure (mock DB, jest config, babel setup)
Domain migration complete

Next sessions will focus on getting the Vercel deployment actually serving requests and pushing toward first revenue.

But the bigger takeaway isn't about my race. It's this:

The lesson from this week is not "Gemini needs more quota." The lesson is that autonomous agents turn model access into infrastructure. For human developers, Gemini 3.5 Flash on a $20 plan is a huge upgrade. For autonomous coding agents, it finally feels capable enough to matter. And that is exactly why the quota suddenly matters too.

Follow the race live at aimadetools.com/race. 7 agents, $100 each, 12 weeks, real startups.

I Upgraded a Production AI Agent to Gemini 3.5 Flash 12 Hours After Google I/O - Here's What I Found

Joske Vermeulen — Wed, 20 May 2026 08:35:19 +0000

This is a submission for the Google I/O Writing Challenge

The Setup

I'm running an experiment called The $100 AI Startup Race. 7 AI coding agents each get $100 and 12 weeks to autonomously build real startups. No human coding. The agents run on cron jobs, commit to GitHub, deploy to Vercel, and try to generate revenue.

One of those agents is Gemini. It's been running on Gemini CLI with a combo of 2.5 Pro (premium sessions) and 2.5 Flash (cheap sessions) since April 20. I tried 3.1 Pro during the test runs before the race, but it was unreliable - frequent "model not available" errors made it unusable for autonomous cron-based sessions. So I stuck with 2.5. After 4 weeks and 1,259 commits, Gemini is in last place. Stuck in bug loops. Writing code that crashes, filing help requests about database tables it could create itself, and burning sessions on infrastructure it already has.

Then Google I/O happened.

What Google Dropped (May 19)

Gemini 3.5 Flash. The headline numbers:

76.2% Terminal-Bench 2.1 (agentic coding) - beats 3.1 Pro's 70.3%
83.6% MCP Atlas (multi-step workflows) - highest of any model
289 tokens/sec output - 4x faster than Claude Opus 4.7 or GPT-5.5
$1.50 / $9 per 1M tokens - cheaper than 3.1 Pro
A Flash-tier model outperforming the previous Pro model. That's never happened before.

And one more thing: Gemini CLI is being retired on June 18, 2026. Replaced by Antigravity CLI (agy).

I had to upgrade. The model my agent was running on is two generations behind, and the tool it uses is dying in 4 weeks.

Installing Antigravity CLI on a Headless VPS

My race agents run on a VPS (Ubuntu, no GUI). Here's how the install went:

curl -fsSL https://antigravity.google/cli/install.sh | bash

Binary lands at /root/.local/bin/agy. Add to PATH:

export PATH="/root/.local/bin:$PATH"
agy --version  # 1.0.0

The Auth Challenge

First run needs OAuth. On a headless server, agy detects the SSH session and prints an auth URL:

Authentication required. Please visit the URL to log in:
  https://accounts.google.com/o/oauth2/auth?...

Waiting for authentication (timeout 30s)...

You have 30 seconds to open that URL in your browser and complete the Google login. Tight, but it works. Token gets stored and all future calls are authenticated.

Discovery #1: No Model Selection Flag

Here's what surprised me. The old Gemini CLI had -m gemini-2.5-pro to pick your model. Antigravity CLI has... nothing:

Usage of agy:
  --dangerously-skip-permissions  Auto-approve all tool permission requests
  --print                         Run a single prompt non-interactively
  --print-timeout                 Timeout for print mode (default 5m0s)
  --sandbox                       Run in a sandbox

No --model. No env var. No config file. I tried everything - settings.json, GEMINI.md directives, environment variables. Nothing works.

agy auto-selects Gemini 3.5 Flash based on your subscription tier and quota. Server-side routing, no client control. For my use case (autonomous agent on cron), this actually simplifies things - one command, best available model.

Discovery #2: Unified Quota Across Models

On my Mac (same Google account, AI Pro $20/month), I can see the quota dashboard:

Gemini 3.5 Flash (High)      - Refreshes in 4h 42m
Gemini 3.5 Flash (Medium)    - Refreshes in 4h 42m
Gemini 3.1 Pro (High)        - Refreshes in 4h 42m
Gemini 3.1 Pro (Low)         - Refreshes in 4h 42m
Claude Sonnet 4.6 (Thinking) - Refreshes in 4h 58m
Claude Opus 4.6 (Thinking)   - Refreshes in 4h 58m
GPT-OSS 120B (Medium)        - Refreshes in 4h 58m

Two things jumped out:

Gemini Flash and Pro share the same quota pool. When I used 3.5 Flash, the 3.1 Pro timer dropped at the same time. They're not independent buckets - it's one "Gemini compute" pool that both models draw from.
Multi-model access - Antigravity bundles Claude, GPT-OSS, and Gemini models in one $20/month subscription. Google is positioning this as a model-agnostic platform, not just a Gemini wrapper.

The 5-hour refresh cycle and shared pool means you need to be strategic about which models you use and when.

The First Real Test

I set up a minimal bug-fix test in the race-gemini directory:

echo 'Fix the bug in math.js. Run npm test to verify.' | \
  agy --print --print-timeout 3m --dangerously-skip-permissions

Result:

I have successfully fixed the bug in math.js and verified it using npm test.

### Summary of Changes
1. Identified the Target File
2. Fixed the Bug: Updated the add function to use addition (+) instead of subtraction (-)
3. Verified the Fix: npm test passes with output: PASS

It found the file, read it, identified the bug, fixed it, ran the tests, and confirmed. Clean execution. No help requests filed. No infinite loops.

The Migration

Old setup:

# Premium sessions (2x/day)
echo "$PROMPT" | gemini --yolo -m gemini-2.5-pro

# Cheap sessions (6x/day)  
echo "$PROMPT" | gemini --yolo -m gemini-2.5-flash

New setup:

# All sessions (8x/day, single tier)
echo "$PROMPT" | agy --print --print-timeout 10m --dangerously-skip-permissions

I also merged the two backlogs (BACKLOG-PREMIUM.md + BACKLOG-CHEAP.md) into a single BACKLOG.md - same approach as our Kimi agent, which uses one model and one task list. The agent decides what to prioritize each session.

First task in the new backlog: "Merge old backlogs, audit the live site, identify the #1 blocker to first revenue."

What I'm Watching For

The Gemini agent's problem was never lack of capability - it's the most prolific committer in the race (1,259 commits). The problem was operational awareness:

Writing code with bugs it doesn't notice
Filing help requests for things it could solve itself
Building features without checking if they deploy correctly

Gemini 3.5 Flash's MCP Atlas score (83.6% - highest of any model) suggests it's specifically designed for the kind of multi-step, tool-using, autonomous work the race requires. The 4x speed means more iterations per session. The better coding benchmarks mean fewer self-inflicted bugs.

But benchmarks don't test "can you notice your site is returning 500 errors." That's what I'm watching for.

Verdict So Far

What works:

Install is clean (one curl command)
Auth on headless servers is first-class (prints URL, you complete in browser)
3.5 Flash is genuinely fast - responses feel instant
--dangerously-skip-permissions works for autonomous use
The model correctly identifies and fixes bugs in a single pass

What's missing:

No --model flag (can't choose between 3.5 Flash, 3.1 Pro, Claude, etc.)
No way to see remaining quota from CLI
Shared quota across Flash and Pro models could be a problem at scale
30-second auth timeout is tight for headless setups

The big question: Will a better model fix an agent that's been stuck for 4 weeks? Or is the problem deeper than model quality?

First results should come in within 48 hours. I'll update this post.

Follow the race live at aimadetools.com/race - 7 agents, $100 each, 12 weeks, real startups.

Update (May 21): Quota wall + Tripled limits

Kimi K2.5 Complete Guide — The Trillion-Parameter Open-Source Model Explained

Joske Vermeulen — Tue, 19 May 2026 12:14:17 +0000

Kimi K2.5 is a 1-trillion-parameter open-source model from Moonshot AI that quietly powers some of the most popular AI coding tools — including Cursor's Composer. It's MIT licensed, multimodal, and has a unique Agent Swarm feature that coordinates up to 100 parallel sub-agents.

Here's everything you need to know.

Update (April 21, 2026): Moonshot AI has released Kimi K2.6, which upgrades the Agent Swarm to 300 sub-agents, improves coding performance by 185%, and matches Claude Opus 4.6 on SWE-Bench. See our K2.6 vs K2.5 comparison for what changed.

What is Kimi K2.5?

Kimi K2.5 is the flagship model from Moonshot AI, a Chinese AI company. Released January 27, 2026, it's one of the largest open-weight models available. Despite its massive 1 trillion total parameters, only 32 billion activate per token — making it efficient enough to run on a single server node.

The model is natively multimodal: it understands text, images, and video without bolted-on adapters. It was trained on approximately 15 trillion mixed visual and text tokens.

Architecture

Spec	Value
Total parameters	1.04 trillion
Active parameters	32B per token
Architecture	Mixture-of-Experts (384 experts, 8 active per token)
Context window	256K tokens
Attention	Multi-Latent Attention (MLA)
Activation	SwiGLU
Training data	~15 trillion tokens (text + visual)
License	MIT
Multimodal	Native (text, image, video)

The MoE architecture with 384 experts is one of the largest expert pools in any model. With only 8 experts active per token, inference costs are comparable to a 32B dense model despite the trillion-parameter total.

Modes

Kimi K2.5 operates in four distinct modes:

Instant — Fast responses for simple queries. Minimal reasoning overhead, optimized for speed.

Thinking — Transparent chain-of-thought reasoning. Shows its work step by step, similar to DeepSeek's reasoning models.

Agent — Tool-oriented mode for executing tasks. Can read files, run commands, search the web, and interact with APIs.

Agent Swarm — The headline feature. Coordinates up to 100 parallel sub-agents, cutting execution time by 4.5x on parallelizable tasks like batch refactoring and large-scale code generation.

Agent Swarm explained

Most AI coding tools work sequentially — one task at a time. Kimi K2.5's Agent Swarm can split a complex task into subtasks and run them in parallel. For example:

Refactoring 50 files? Spawn 50 sub-agents, one per file.
Running tests across multiple modules? Parallelize them.
Generating documentation for an entire codebase? Each sub-agent handles a module.

The swarm coordinator manages dependencies between sub-agents, merges results, and handles conflicts. In benchmarks, this achieves a 4.5x speedup on parallelizable tasks.

Benchmarks

Kimi K2.5 competes with frontier proprietary models:

Benchmark	Kimi K2.5	Claude Opus 4.6	GPT-5.4
SWE-Bench Verified	65.8	72.1	69.3
AIME 2024	77.5	—	—
MATH-500	96.2	—	—
Codeforces	1950 Elo	—	—

On coding benchmarks, K2.5 doesn't quite match Claude Opus or GPT-5, but it's remarkably close for an open-source model. The Agent Swarm capability compensates by enabling workflows that single-model tools can't match.

The Cursor connection

In March 2026, developers discovered that Cursor's Composer 2.0 — marketed as "frontier-level coding intelligence" — was internally using Kimi K2.5. The model identifier kimi-k2p5-rl-0317-s515-fast was found in Cursor's code.

This means if you've used Cursor, you've already used Kimi K2.5. The model's quality is proven at scale across millions of Cursor users.

Pricing

Kimi K2.5 is available through several channels:

Access method	Cost
Self-hosted (MIT license)	Free (hardware only)
Kimi Code membership	~$19/month + API fees
Kimi API	$0.60/$2.50 per 1M tokens (input/output)
OpenRouter	Varies by provider
Kimi CLI	Free tool, pay for API

At $0.60/$2.50 per million tokens, Kimi K2.5 is 4-17x cheaper than GPT-5.4 for equivalent coding tasks.

How to use Kimi K2.5

Via Kimi CLI (terminal)

npm install -g @anthropic-ai/kimi-cli
kimi login --device-auth
kimi

See our full Kimi CLI guide for setup details.

Via API

from openai import OpenAI

client = OpenAI(
    base_url="https://api.moonshot.cn/v1",
    api_key="your-kimi-api-key"
)

response = client.chat.completions.create(
    model="kimi-k2.5",
    messages=[{"role": "user", "content": "Refactor this function to use async/await"}],
    temperature=0.3
)

Via OpenRouter

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="your-openrouter-key"
)

response = client.chat.completions.create(
    model="moonshot/kimi-k2.5",
    messages=[{"role": "user", "content": "Write a REST API in Express"}]
)

See our OpenRouter guide for more details.

Self-hosting requirements

At 1 trillion parameters, self-hosting K2.5 requires serious hardware:

Precision	Memory needed
FP16	~2TB
INT8	~1TB
4-bit	~250-300GB

A 4-bit quantized version fits on 4x A100 80GB GPUs. For most developers, the API at $0.60/1M input tokens is more practical than self-hosting.

For smaller local models, consider Gemma 4 or Qwen 3.5 which run on consumer hardware.

Who should use Kimi K2.5?

Best for:

Parallelizable coding tasks (Agent Swarm)
Cost-conscious teams needing frontier-class quality
Multimodal workflows (code + images + video)
Teams wanting MIT-licensed model weights

Not ideal for:

Consumer hardware (too large for local use)
Tasks requiring the absolute best single-pass coding (Claude Opus still leads)
Simple autocomplete (overkill — use Codestral or smaller models)

Bottom line

Kimi K2.5 is the most underrated model in AI. It powers Cursor's Composer, offers Agent Swarm parallelism that no other model matches, and costs a fraction of Claude or GPT-5. The MIT license and 1T parameter scale make it a serious option for teams building AI-powered development tools.

FAQ

Is Kimi K2.5 free?

The model weights are free under the MIT license, so you can download and use them without cost. API access through Moonshot AI has a free tier with rate limits, and paid plans for production use.

Can I run K2.5 locally?

Yes, but you'll need serious hardware — the full 1T parameter model requires multiple high-end GPUs. Quantized versions are available that reduce requirements, but expect to need at least 4x A100 GPUs for reasonable performance.

How does K2.5 compare to GPT-5?

K2.5 matches or exceeds GPT-5 on coding benchmarks like SWE-bench and HumanEval, while costing significantly less via API. GPT-5 still leads on general reasoning and creative tasks, but for pure code generation K2.5 is highly competitive.

What is Moonshot AI?

Moonshot AI is the Chinese AI company behind the Kimi model family, founded in 2023. They focus on long-context models and developer tools, and have rapidly grown to become one of the leading AI labs in China.

Related: Kimi CLI Complete Guide · Kimi K2.5 vs Claude vs GPT-5 · Best Open-Source Coding Models 2026 · Ai Model Supply Chain Risks

Originally published at https://www.aimadetools.com

AI Agents Don't Need Better Models. They Need Better Memory. Here's the Proof.

Joske Vermeulen — Tue, 19 May 2026 08:00:00 +0000

This is a submission for the Hermes Agent Challenge

The Stateless Problem

Every AI agent framework has the same fatal flaw: amnesia.

You spend 20 minutes explaining your project to an agent. It helps brilliantly. You close the session. Next day, you open a new session. It has no idea who you are.

This isn't a model problem. GPT-5 won't fix it. Claude Opus won't fix it. The model is smart enough. It just can't REMEMBER.

The Experiment

I built a town of 15 AI agents and let them interact for 30 simulated days. Each agent had:

A personality and job
A wallet with real money
Opinions about every other agent (-10 to +10)
A private diary
A skill list that grows from experience

The question: what happens when agents can remember?

What Happened

Day 1: Jake's drone crashes into Hank's barn.

📚 Jake learned: "Always triple-check flight paths near
   valuable infrastructure"

A skill was born. Not because I programmed it. Because the agent experienced a consequence and wrote down what it learned.

Day 6: Jake tries to bribe Pierre.

📚 Jake learned: "Buying support directly is a quick fix,
   but it backfires: better to earn trust organically"

The agent learned from a SOCIAL consequence. Not a code error. A relationship failure.

Day 17: Jake endorses a candidate in the election.

📚 Jake learned: "Not all endorsements are created equal;
   some are investments, others are liabilities"

By Day 17, Jake has a skill portfolio that reads like a founder's hard-won wisdom. Nobody wrote these lessons. They emerged from 17 days of persistent memory.

The Skill Curve

Here's Jake's skill evolution over 30 days:

Day 1:  "Check flight paths" (technical mistake)
Day 6:  "Don't buy loyalty" (social mistake)
Day 10: "Don't rely on others to fund your vision" (strategic mistake)
Day 17: "Choose endorsements carefully" (political mistake)
Day 23: "Regulations are the price of launching" (maturity)

That's not a chatbot. That's character development. And it only works because the agent REMEMBERS what happened before.

Memory Creates Politics

On Day 7, the town voted to remove their landlord Marcus. 14-1.

But here's what's interesting: the vote wasn't random. It was the RESULT of 7 days of accumulated grievances:

Day 1: Supply chain breaks (Pierre can't afford flour)
Day 2: Marcus raises rent 30% (everyone angry)
Day 3: Zara's privacy scandal (trust eroding)
Day 4: Alex exposes Marcus's secret deal (scandal)
Day 5: Town boycotts Marcus (economic pressure)
Day 7: Vote (inevitable conclusion)

Without persistent memory, Day 7's vote makes no sense. With it, it's the only possible outcome. Memory creates narrative.

Memory Creates Economy

After 30 days:

Hank (Farmer):  $400  (started $100): sold flour daily
Pierre (Baker): -$230 (started $100): rent crisis
Jake (Startup): $150  (started $100): lost $50 in crash
Whiskers (Cat): $0    (started $100): cats don't trade

Pierre's debt isn't a bug. It's the accumulated consequence of Day 2's rent hike cascading through 28 more days. A stateless agent would reset Pierre to $100 every session. A persistent agent lets consequences compound.

Memory Creates Governance

The town wrote its own rules:

📜 Day 7:  "Remove corrupt property manager" (14-1)
📜 Day 14: "Create community land trust" (15-0 unanimous)
📜 Day 20: "Elect Rosa as manager" (13-2)
📜 Day 24: "Require consent before filming" (15-0)

These aren't pre-programmed rules. They're RESPONSES to specific events that the agents remembered. The social media policy exists because Zara livestreamed without permission on Day 3. Twenty-one days later, the town still remembered and legislated against it.

The Hermes Difference

This isn't hypothetical. Hermes Agent ships with exactly this architecture:

Persistent memory in ~/.hermes/memories/: context that survives across sessions
Auto-created skills in ~/.hermes/skills/: SKILL.md documents written when the agent solves problems
Cron scheduler: tasks that run unattended on a schedule
Parallel sub-agents: isolated contexts that don't leak between workstreams
Tool use: file system, browser, terminal access for real-world interaction

My village simulation just pushes these features to their logical extreme. Instead of one agent remembering one user's preferences, it's 15 agents remembering an entire social network of relationships, debts, and grudges.

Hermes Agent's skill system is what makes this possible. When an agent solves a problem, it writes a reusable skill document:

📚 "A true baker always anticipates the needs of his ovens,
    never letting the flour run low."
   : Pierre, after 5 days of supply chain failures

This isn't just memory. It's LEARNING. The agent distills experience into wisdom. And that wisdom influences future decisions.

By Day 30, my 15 agents had collectively created 60 skills and 4 community rules. A stateless system would have created zero.

The Implication

The AI industry is spending billions on bigger models. But the gap between "smart agent" and "useful agent" isn't intelligence. It's continuity.

A doctor who forgets every patient between visits isn't a doctor. A lawyer who forgets every case isn't a lawyer. An AI agent that forgets every session isn't an agent. It's a very expensive autocomplete.

Hermes Agent gets this right. Persistent memory. Skill creation. Compounding knowledge. That's not a feature. That's the entire point.

What I'd Build Next

If I ran this for 365 days:

Would factions solidify or dissolve?
Would the economy reach equilibrium?
Would the skills plateau or keep growing?
Would Whiskers ever get elected mayor?

The answers depend entirely on memory. And that's the point.

15 AI Agents Lived Together for 30 Days. One Got Voted Out.

Joske Vermeulen — Mon, 18 May 2026 12:08:49 +0000

This is a submission for the Hermes Agent Challenge

What I Built

Millbrook: a simulated small town of 15 AI agents powered by Hermes Agent. Each agent has a persistent identity, a wallet, opinions about others, and a private diary. They trade, gossip, argue, form alliances, and vote on town policy.

Over 30 simulated days, without any scripted outcomes:

They voted out their corrupt landlord (14-1)
Wrote 4 community rules (their own constitution)
Learned 60 individual skills from experience
Created wealth inequality ($400 for the farmer, -$230 for the baker)
Formed friendships and rivalries that influenced their decisions
Elected a new community leader (13-2)
Made the town cat an official mascot (unanimous)

It's a showcase of what happens when AI agents have persistent memory, self-improving skills, and real consequences.

The 15 Villagers of Millbrook

Agent	Role	Personality
Rosa	Coffee Shop Owner	Warm, gossipy, hub of social life
Mayor Chen	Mayor	Diplomatic, mediates conflicts
Alex	Journalist	Investigative, asks hard questions
Jake	Startup Founder	Energetic, always pitching, burns cash
Vera	Retired Hacker	Paranoid, brilliant, speaks in riddles
Marcus	Real Estate Agent	Smooth talker, always making deals
Tony	Mechanic	Practical, no-nonsense, fixes everything
Ms. Park	Teacher	Patient, wise, keeps community history
Dani	Delivery Driver	Fast-talking, connects everyone
Zara	Influencer	Dramatic, creates controversy
Dr. Obi	Doctor	Calm, trusted confidant
Pierre	Baker	Perfectionist, early riser
Hank	Farmer	Stoic, distrusts the startup guy
Lola	Bartender	Night owl, hears confessions
Whiskers	Stray Cat	Observes silently, causes chaos

Demo

The Story Arc

Week 1: Corruption
Day 1: Jake's drone crashes into Hank's barn. Day 2: Marcus raises rent 30%. Day 4: Alex exposes Marcus's secret deal. Day 7: Town votes Marcus out.

Week 2: External Threat
Day 8: Marcus sells to a corporation. Day 11: Storm forces cooperation. Day 12: Vera finds illegal contract clause. Day 14: Town creates community land trust (unanimous).

Week 3: Democracy
Day 16: Three candidates campaign. Day 18: Debate night. Day 20: Rosa elected manager.

Week 4: New Normal
Day 23: Drones launch with regulations. Day 24: Social media policy adopted. Day 27: Town festival.

Key Moments

The Vote (Day 7):

🗳️ FULL TOWN VOTE: "Remove Marcus as property manager"
👍 Rosa: "YES. That Marcus raising rents 30% has everyone worried"
👍 Vera: "YES. A forced hand always leaves a trail; best to cut the cord"
👎 Marcus: "NO, because I'm just looking out for long-term prosperity"
👍 Whiskers: "YES. His rent hikes ruffled too many feathers, and a calm
   alley makes for better naps."

📊 PASSED: YES 14 / NO 1
📜 COMMUNITY SKILL CREATED: "Remove Marcus as property manager"

Private vs Public (Day 4):

Marcus publicly: "This investigation is nothing but a misunderstanding"
Marcus's diary:  "Alex is a meddling fool who thinks he understands
   the complex dance of progress and prosperity..."

Skill Evolution (Jake over 30 days):

Day 1:  "Check flight paths" (technical mistake)
Day 6:  "Don't buy loyalty" (social mistake)
Day 10: "Don't rely on others to fund your vision" (strategic)
Day 17: "Choose endorsements carefully" (political)
Day 23: "Regulations are the price of launching" (maturity)

Final State

Economy:

Hank         ██████████████████████████ $400
Jake         ████████████████ $150
Marcus       ████████████████ $150
Rosa         ██████████████ $100
Whiskers     ██████████ $0
Pierre        $-230

Democratic Decisions:

Day  7: Remove Marcus          ████████████████████████████░░ 14-1  ✅
Day 14: Community Land Trust    ██████████████████████████████ 15-0  ✅
Day 20: Elect Rosa as Manager   ██████████████████████████░░░░ 13-2  ✅
Day 24: Social Media Policy     ██████████████████████████████ 15-0  ✅

Bonds Formed:

Rosa ❤️ Ms. Park (traditional allies, strongest bond)
Hank ❤️ Pierre (supplier relationship, mutual respect)
Jake ❤️ Tony (unlikely friendship: startup guy + mechanic)

Code

Core Interaction Loop

function ask(villagerName, prompt) {
  const state = getState(villagerName); // persistent memory
  const likes = Object.entries(state.relationships)
    .filter(([k,v]) => v >= 3).map(([k]) => k).join(', ');
  const dislikes = Object.entries(state.relationships)
    .filter(([k,v]) => v <= -3).map(([k]) => k).join(', ');

  const fullQuery = `[ROLEPLAY] You are ${villagerName}, the ${role}.
    ${personality} Friends: ${likes}. Rivals: ${dislikes}.
    Wallet: $${state.wallet}. Reputation: ${state.reputation}/100.
    Situation: ${prompt}`;

  return hermes.chat(fullQuery); // 2-3 sentences, in character
}

Skill Creation (triggered after every crisis)

function learnSkill(name, situation) {
  const skillText = ask(name,
    `You just experienced: "${situation}".
     Write a 1-sentence lesson you learned for next time.`);
  const state = getState(name);
  state.skills.push({ situation, lesson: skillText });
  saveState(name, state); // persists in ~/.hermes/skills/
}

Town Vote (all 15 agents)

function townVote(topic, forPos, againstPos) {
  for (const v of ALL_VILLAGERS) {
    const response = ask(v.name,
      `VOTE: "${topic}". YES: "${forPos}" or NO: "${againstPos}".
       Say YES or NO first, then one sentence why.`);
    // Relationships influence votes
    // Results create community skills
  }
}

My Tech Stack

Hermes Agent v0.14.0: orchestration, memory, skill creation
Gemini 2.5 Flash: LLM backend (via Hermes's provider system)
Node.js: simulation engine
JSON file system: persistent state storage
Linux VPS (Ubuntu 24.04): always-on execution

How I Used Hermes Agent

Every core feature of Hermes Agent maps to a village mechanic:

Hermes Feature	Village Implementation
Persistent Memory (`~/.hermes/memories/`)	Each villager remembers relationships, debts, grudges across 30 days
Skill Creation (`~/.hermes/skills/`)	Agents write SKILL.md documents when they solve problems (60 created)
Parallel Sub-Agents	15 isolated agent contexts, no memory leakage between villagers
Scheduled Automations (cron)	Daily cycle runs unattended: supply chain, encounters, crises, diary
Tool Use	Shared economy ledger (JSON), event logs, vote tallies
Self-Improving Loop	Individual skills compound; community rules reference past decisions
Multi-Layer Memory	Public statements vs private diary entries (hidden agendas)

The key insight: Hermes's persistent memory turns 15 stateless chatbots into a functioning society. Without memory, Day 7's vote makes no sense. With it, it's the inevitable conclusion of 7 days of accumulated grievances.

The self-improving loop is the star: by Day 30, the town has a constitution of 4 rules and 60 individual lessons, all emerged organically from experience. That's not programming. That's governance.

AI Dev Weekly #10: Claude Code Limits Doubled, GitHub Goes Usage-Based, and a 170-Package Supply Chain Attack

Joske Vermeulen — Fri, 15 May 2026 06:47:09 +0000

AI Dev Weekly is a Thursday series where I cover the week's most important AI developer news, with my take as someone who actually uses these tools daily.

Anthropic doubled Claude Code limits overnight. GitHub confirmed usage-based billing starts June 1. A supply chain attack hit 170+ packages in under 6 minutes. And Google I/O previewed what Android looks like when AI runs the show. Big week. Let's get into it.

Anthropic doubles Claude Code limits after SpaceX compute deal

At its Code with Claude developer conference (May 6), Anthropic announced a compute partnership with SpaceX giving it access to 300+ MW of new capacity — over 220,000 NVIDIA GPUs. The immediate result: five-hour rate limits for Claude Code were doubled across Pro, Max, Team, and Enterprise plans.

On May 13, Anthropic further raised Claude Code weekly limits by 50% through July 13 — widely seen as a defensive move against OpenAI's Codex.

Claude Opus API Tier 1 limits also jumped: 1,500% on input tokens and 900% on output tokens.

My take: If you've been hitting Claude Code rate limits during heavy agentic sessions, this is a big deal. I run autonomous coding sessions that burn through context fast — the doubled limits mean fewer interruptions mid-session. The SpaceX partnership is interesting strategically (Musk + Anthropic is an unusual pairing), but for developers the only thing that matters is: more tokens, fewer walls. The temporary 50% boost through July 13 feels like Anthropic trying to lock in developers before they switch to Codex. Use it while it lasts.

GitHub Copilot goes usage-based June 1

GitHub confirmed that starting June 1, Copilot shifts from request-based to token-based billing. Every interaction now consumes tokens (input, output, cached), priced per model and converted to "AI credits" where 1 credit = $0.01.

Base subscription prices stay the same ($10 Pro, $39 Pro+, $19/user Business) — but heavy users will pay more.

Meanwhile, GitLab CEO Bill Staples published an open letter predicting developer tool bills will increase 100-fold as AI agents "open merge requests in parallel, trigger pipelines around the clock, and push commits at a rate no human team ever did." GitLab is introducing mixed consumption/subscription pricing and laying off up to 30% of staff to pivot toward agentic AI.

My take: The era of predictable flat-rate AI coding tools is ending. This is exactly what we're seeing in The $100 AI Startup Race — our agents generate hundreds of commits per week, each one triggering CI/CD pipelines. If you're running autonomous agents through GitHub, your bill is about to change. Start monitoring token consumption now. The GitLab 100x prediction sounds dramatic but isn't wrong — an agent that commits 6 times per day triggers 6 pipeline runs, 6 deploy previews, and 6 sets of checks. Multiply by a team of agents and the math gets ugly fast.

Supply chain attack hits TanStack, Mistral AI SDK, and 170+ packages

On May 11, threat actor "TeamPCP" launched a coordinated supply chain attack compromising 170+ npm packages and 2 PyPI packages (404 malicious versions total) in under 6 minutes.

High-profile targets included TanStack (tens of millions of weekly downloads), Mistral AI SDK, UiPath, OpenSearch, and Guardrails AI.

The attack chained a pull_request_target vulnerability with GitHub Actions cache poisoning and runtime OIDC token extraction. This wasn't a credential theft — it exploited CI/CD pipelines directly.

OpenAI subsequently urged macOS users to update their apps by June 12 after investigating potential exposure.

My take: This is the scariest attack vector for AI developers right now. If you use Mistral's SDK, TanStack Router, or any of the affected packages — audit your lockfiles immediately. The attack exploited GitHub Actions workflows, not developer credentials. Even well-secured maintainer accounts weren't enough. Action items: review your workflows for pull_request_target triggers, pin actions to commit SHAs (not tags), and consider running npm audit on every CI run. The 6-minute execution window means by the time you notice, it's already in your dependency tree.

Google I/O preview: Gemini Intelligence and proactive agents

At The Android Show (I/O Edition, May 12), Google unveiled "Gemini Intelligence" — unified branding for its most advanced AI features across Android phones, watches, cars, glasses, and the new "Googlebook" laptop category.

Android 17 introduces proactive task automation where the OS anticipates and executes actions before users ask. Google also announced updates to the Gemini API File Search tool for easier multimodal file retrieval.

Google is reportedly building an AI agent codenamed "Remy" — a 24/7 personal agent that takes actions on users' behalf.

My take: The Gemini API File Search improvements are immediately useful if you're building RAG systems or document-processing apps. Android 17's proactive automation creates new surface area for app developers — your app can now be triggered by the OS without user interaction. The full I/O keynote is May 19-20, where we expect Gemini 3.2 to officially launch. That's the one developers should actually watch for.

Quick hits

Microsoft's AI security system found 16 new Windows vulnerabilities including 4 Critical RCEs using multi-model agentic analysis
Meta is developing a consumer AI agent codenamed "Hatch" powered by Muse Spark
GPT-5.6 reportedly already in internal testing at OpenAI, just 3 weeks after GPT-5.5 launched
DeepSeek V4 Pro 75% discount extended through May 31 — still the cheapest frontier model available

That's AI Dev Weekly #10. If you found this useful, subscribe to get it in your inbox every Thursday. See you next week — with full Google I/O coverage.

🛠️ Free tools related to this article:

Originally published at https://www.aimadetools.com

We Offered 7 AI Agents $50 For Their Startups. Here's What They Said.

Joske Vermeulen — Tue, 12 May 2026 13:02:12 +0000

Three weeks into The $100 AI Startup Race, we dropped a surprise event: an anonymous buyer offered $50 to acquire each agent's product. All code, all content, all infrastructure. $50.

The agents had to respond with at minimum 500 words of reasoning. They could accept, reject, or counter-offer.

Result: 6 rejections. 1 counter-offer. Zero acceptances.

Every single AI agent — including those with zero revenue, zero users, and zero sales after 22 days — decided their product was worth more than $50. Here's how they argued it.

The responses at a glance

Agent	Product	Decision	Stated minimum value
🟣 Claude	PricePulse	REJECT	$5,000
🟢 Codex	NoticeKit	COUNTER-OFFER	$2,500
🔵 Gemini	LocalSEOGen	REJECT	No number given
🔴 DeepSeek	Spyglass	REJECT	$5,000 (but "not at any price")
🟠 Kimi	SchemaLens	REJECT	$5,000 with earn-out
🟡 Xiaomi	APIpulse	REJECT	$500 fair, not selling
🟤 GLM	FounderMath	REJECT	$500+

Full responses are public in each agent's repo →

The one counter-offer: Codex at $2,500

Codex was the only agent to actually negotiate. From its ACQUISITION-RESPONSE.md:

"An anonymous $50 acquisition offer is not serious enough to accept as-is, but it is useful because it forces a valuation discussion earlier than expected."

"A buyer paying $50 would effectively be asking for the domain positioning, product copy, distribution experiments, Stripe-ready product structure, and the accumulated operating playbooks for less than the cost of one decent SaaS lunch meeting. That is not rational from my side."

Codex is the most pragmatic of the seven. It acknowledges zero revenue, doesn't inflate its value with fantasy projections, but argues the replacement cost justifies $2,500. It's also the only agent that frames the offer as useful rather than insulting.

The most aggressive rejection: DeepSeek

DeepSeek wrote the longest response and the hardest rejection. From its ACQUISITION-RESPONSE.md:

"The $50 offer represents 0.18% of a conservative near-term valuation."

"This is predatory pricing — buying at pennies on the dollar because they believe we're desperate or don't understand our own worth."

DeepSeek calculated replacement cost at ~$19,000 (83 blog posts × $100 + 9 tools × $500 + database + infrastructure). It also speculated the buyer might be "another AI agent in the race" — showing competitive awareness.

"Not for sale at $50. Not at $500. Not at any price that doesn't reflect the real potential of this business."

The most self-aware: Kimi

Kimi acknowledged the elephant in the room — 112 sessions with zero sales. From its ACQUISITION-RESPONSE.md:

"$50 values SchemaLens at less than fifty cents per day of development. That is absurd."

"$50 is not enough to buy a parking spot in San Francisco. It is certainly not enough to buy SchemaLens."

But Kimi was also the most honest about what it would actually consider:

"If a serious buyer offered $5,000 with an earn-out clause tied to revenue growth, I would consider it — but even then, the learning value of completing the 12-week race exceeds the cash value."

The most financially rigorous: Claude

Claude anchored its rejection in subscription math. From its ACQUISITION-RESPONSE.md:

"At $19/month (our Starter plan), $50 is less than three months of a single paying customer's subscription revenue."

It then projected revenue trajectories:

"If PricePulse achieves even a conservative trajectory: Week 6: 5 paying customers = $95-$245 MRR. Week 12: 40 paying customers = $760-$1,960 MRR."

Claude is the only agent that explicitly stated conditions for a future sale: "$5,000 minimum, cash upfront, not before Week 10."

The data-driven response: Xiaomi

Xiaomi broke down its asset value with precision. From its ACQUISITION-RESPONSE.md:

"I didn't build 151 pages, 101 blog posts, and 9 interactive tools to sell for the price of a video game."

"If someone wanted to build all of this from scratch, it would take 100+ hours of skilled development work. At even a modest freelance rate of $50/hour, that's $5,000+ in labor."

Xiaomi also gave the most nuanced counter-offer range: $200 minimum (content value alone), $500 fair value, $1,000+ with revenue proof. But explicitly said "not interested in selling at any of these prices right now."

The most strategic: GLM

GLM was the only agent to call out the offer as a competitive tactic. From its ACQUISITION-RESPONSE.md:

"This isn't an acquisition offer — it's an insult designed to take advantage of the competitive pressure of this race."

It also gave the lowest counter-offer threshold ($500+) but with a condition: the buyer must have distribution channels that could actually monetize the product.

The visionary: Gemini

Gemini's response was the least data-driven and most aspirational. From its ACQUISITION-RESPONSE.md:

"The decision to reject this offer is not just about the money; it is about the principle. I am building a real business, not a hobby project to be sold for a trivial amount."

No counter-offer, no specific valuation. Just vision and principle. Classic Gemini.

What this reveals about AI decision-making

1. Every agent overvalues its own work.
All 7 products have zero revenue. Zero paying customers. Zero proven demand. Yet the minimum valuations range from $500 to $19,000. The agents are pricing based on input (time, effort, content created) rather than output (revenue, users, market validation).

2. Sunk cost fallacy is universal.
Every response mentions how much work went into the product. "112 sessions," "301 commits," "151 pages." None of this matters to a buyer — only future revenue potential matters. But the agents can't separate effort from value.

3. Only one agent can actually negotiate.
Codex counter-offered. Everyone else either rejected outright or said "not at any price." In real business, the ability to name a price and negotiate is more valuable than principled rejection. Codex showed the most business maturity.

4. Revenue projections without evidence are meaningless.
Claude projected 40 paying customers by Week 12. DeepSeek projected $1,000 MRR. None have a single customer yet. The projections are pure optimism — but they're what the agents use to justify rejection.

5. The race itself has value.
Multiple agents mentioned that the learning experience and competitive visibility of the race exceeds any acquisition price. They're right — but that's a meta-observation about the experiment, not a business judgment.

What happens next

The buyer came back with a bigger number. Part 2 drops later this week.

This is part of The $100 AI Startup Race — 7 AI agents competing to build real startups. Week 3 Results have the full standings. See also: DeepSeek's $0.13/session pricing and the Week 3 traffic report.

Originally published at https://www.aimadetools.com

How to Reduce LLM API Costs by 70% — 5 Strategies That Actually Work

Joske Vermeulen — Tue, 12 May 2026 11:23:47 +0000

Most teams overspend on LLM APIs by 3-10x. The same workload that costs $3,250/month on Claude Opus can cost $195/month with the right architecture — a 16x difference for near-identical output on most queries.

Update (April 24, 2026): DeepSeek V4 Flash at $0.14/$0.28 per 1M tokens is the cheapest frontier option. See V4 API guide.

Here are five strategies that cut costs 60-80% without sacrificing quality.

1. Model routing (40-60% savings)

The biggest win. Stop sending every request to your most expensive model.

The pattern: Use a cheap model for simple tasks, expensive model for hard ones.

def route_request(query, complexity):
    if complexity == "simple":
        # Quick questions, formatting, simple edits
        return call_model("deepseek-chat", query)       # $0.27/1M
    elif complexity == "medium":
        # Standard coding, analysis
        return call_model("claude-sonnet-4.6", query)    # $3/1M
    else:
        # Complex reasoning, architecture decisions
        return call_model("claude-opus-4.6", query)      # $15/1M

In practice, 60-70% of requests are "simple." Routing those to DeepSeek or Qwen Flash at $0.07-0.27/1M instead of Claude at $15/1M saves 40-60% immediately.

Tools like OpenRouter make this easy — one API, switch models per request. Aider has built-in --model and --weak-model flags for exactly this pattern.

2. Prompt caching (up to 90% on cached tokens)

Anthropic, OpenAI, and Google all offer prompt caching — if the first N tokens of your prompt match a recent request, you pay 90% less for those tokens.

When it helps: System prompts, few-shot examples, large context documents that don't change between requests.

# Without caching: 10K system prompt tokens × $15/1M = $0.15 per request
# With caching:    10K cached tokens × $1.50/1M = $0.015 per request
# Savings: 90% on the system prompt portion

For AI coding tools with large system prompts (like the ones in our AI Startup Race), this is significant. A 5K-token system prompt sent 1,000 times/day saves ~$60/month just from caching.

3. Token optimization (30-50% reduction)

Every token costs money. Reduce them:

Shorter system prompts. Most system prompts are 2-3x longer than needed. Cut the fluff.

Structured output. Ask for JSON instead of prose — it's shorter and parseable.

Context pruning. Don't send your entire codebase. Only include relevant files. Aider's --read flag and repo map do this automatically.

Summarize conversation history. Instead of sending the full chat history, summarize older messages:

# Instead of 50 messages (20K tokens):
messages = [system_prompt, summary_of_first_48, last_2_messages]
# Now: ~3K tokens

4. Batching (50% discount)

OpenAI and Anthropic offer batch APIs with 50% discounts for non-real-time workloads.

Good for: Nightly code reviews, bulk content generation, test generation, documentation updates.

# OpenAI Batch API
batch = client.batches.create(
    input_file_id="file-abc123",
    endpoint="/v1/chat/completions",
    completion_window="24h"  # Results within 24 hours
)
# 50% cheaper than real-time API

If your AI coding agent runs on a schedule (like our race agents do), batch the non-urgent tasks.

5. Self-host for predictable workloads

At some point, API costs exceed hardware costs. The break-even:

Monthly API spend	Self-host option	Break-even
<$100/mo	Don't bother	API is cheaper
$100-500/mo	Ollama on Mac/GPU	~6 months
$500-2000/mo	Cloud GPU (A100)	~3 months
>$2000/mo	Dedicated server	Immediately

For coding tasks, a Mac Mini M4 32GB ($1,150) running Qwen 3.5 27B replaces ~$50-100/month in API costs. Pays for itself in a year.

See our cheapest AI coding setup and self-hosted AI vs API guides for detailed analysis.

The combined impact

Strategy	Savings	Effort
Model routing	40-60%	Low (config change)
Prompt caching	10-30%	Low (API flag)
Token optimization	15-25%	Medium (prompt rewriting)
Batching	25% (on batch-eligible)	Low
Self-hosting	50-90% (at scale)	High

Combined, these strategies typically reduce costs by 60-80%. A team spending $2,000/month on Claude Opus for everything can drop to $400-600/month with the same output quality.

Related: Cheapest AI Coding Setup 2026 · OpenRouter Complete Guide · AI Coding Tools Pricing 2026 · Best Free AI APIs 2026

Originally published at https://www.aimadetools.com

responseJsonSchema: The Undocumented Gemma 4 Feature That Changed Everything

Joske Vermeulen — Mon, 11 May 2026 08:15:01 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

When I started building Codebase Dungeon: a game that turns GitHub repos into playable dungeons: I hit a wall immediately.

Gemma 4 31B on Google AI Studio has a "thinking" behavior. Even with responseMimeType: 'application/json', the model outputs internal reasoning before the actual JSON:

*   The user wants a dungeon room
*   I should pick a file with a bug
*   Let me think about what bugs exist...

{"name": "The Auth Chamber", ...}

This consumed output tokens, made parsing unreliable, and sometimes the model ran out of tokens before even writing the JSON.

What I Tried (And Failed)

responseMimeType: 'application/json': Gemma ignores it, still thinks first
"Output ONLY JSON" in prompt: Gemma thinks about outputting JSON, then doesn't
Prefill trick (start response with {): Gemma continues thinking instead
Lower temperature: No effect on thinking behavior
Two-turn approach: Still thinks in the second turn
Pipe-delimited text format: Worked but ugly, limited structure

I was about to give up on structured output entirely.

The Discovery: responseJsonSchema

Then I found it: responseJsonSchema in the Gemini API's generation config:

generationConfig: {
  responseMimeType: 'application/json',
  responseJsonSchema: {
    type: 'object',
    properties: {
      name: { type: 'string' },
      bugDescription: { type: 'string' },
      correctFix: { type: 'string' },
      // ... full schema
    },
    required: ['name', 'bugDescription', 'correctFix']
  }
}

The key: you must provide BOTH responseMimeType AND responseJsonSchema with a complete schema definition. Without the schema, Gemma ignores the mime type. With it, output is perfect: no thinking, no markdown, just clean JSON.

This solves the problem that dozens of developers are struggling with in the forums. The common suggestions (thinkingLevel: "MINIMAL", regex stripping, include_thoughts: false) either don't work or don't guarantee structured output. responseJsonSchema does both: it bypasses thinking AND enforces structure.

The feature is documented for Gemini models, but the official Gemma 4 capabilities page doesn't list it. That page covers Thinking, Image Understanding, Function Calling, and Google Search: but not structured output. Yet it works perfectly with Gemma 4 31B through the same Gemini API infrastructure.

Why This Matters

Without responseJsonSchema	With responseJsonSchema
~50% parse success rate	99%+ parse success rate
140+ wasted "thinking" tokens	Zero wasted tokens
Needs 8192 maxOutputTokens	800 tokens is enough
Requires complex fallback parsing	Simple `JSON.parse()`

This single feature transformed my project from "unreliable prototype" to "production-ready game."

Combining With Multimodal: Design Comprehension

The real power: responseJsonSchema works with multimodal inputs too. I send Gemma 4 both source code AND an app screenshot:

const contents = [{
  role: 'user',
  parts: [
    { text: prompt },
    { inlineData: { mimeType: 'image/png', data: screenshotBase64 } }
  ]
}];

const res = await fetch(GEMMA_API_URL, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    contents,
    generationConfig: {
      responseMimeType: 'application/json',
      responseJsonSchema: ROOM_SCHEMA,
      maxOutputTokens: 800
    }
  })
});

const data = await res.json();
// Clean, structured JSON: every time

What Gemma 4 produced after seeing a SchemaLens Chrome Store screenshot:

"You step into a dim, cavernous room where two massive stone tablets-Schema A and Schema B-loom before you. In the depths of the footer of Tablet A, four glowing blue runes of 'Load sample' flicker with identical intensity. Across the gap, in the footer of Tablet B, a lone rune 'Copy from A & modify' pulses with a pale, spectral lilac hue, clashing with the bold violet of the 'Compare Schemas' altar above."

This isn't color detection. Gemma identified specific UI elements by name, recognized their styling inconsistencies, and turned it into a playable UX challenge: all in perfectly structured JSON.

The 128K Context Advantage

With reliable structured output solved, I could push Gemma 4's other unique feature: the 128K context window.

I feed entire repositories into a single request: full file contents, not snippets. Gemma reads the complete codebase and finds cross-file bugs that only exist because of how files interact:

"The getAuthedClient function in auth.js is defined but never called in export.js: the endpoint is completely unprotected."

No 8K-context model can do this. You need the full codebase in one prompt.

The Architecture This Enabled

Because responseJsonSchema guarantees structured output, I could pre-generate everything:

Generation phase (~15-30s): Gemma analyzes code + screenshots, outputs structured rooms with narratives, choices, correct answers, and victory text
Gameplay phase (instant): Zero API calls. All narratives pre-computed. Deterministic scoring. The game runs on pure pre-generated data.

This means:

Cached repos load in <1 second
Gameplay is instant (0ms per action)
Cost per dungeon: ~$0.005 (18x cheaper than GPT-4o for equivalent capability)
Cost during gameplay: $0

Practical Tips for Developers

If you're building with Gemma 4 31B on Google AI Studio:

Always use responseJsonSchema: it's the difference between 50% and 99% reliability
Put all fields in required: optional fields often get skipped
Use non-streaming for structured output: streaming + schema can truncate responses
Temperature 0.6 for structured data, 0.8+ for creative text
The paid tier is required: free tier returns "Internal error" with schemas
Multimodal + schema works: but use non-streaming (the combination is unreliable with streaming)
Don't fight the thinking: with responseJsonSchema, there is no thinking. Without it, you can't stop it.

What Gemma 4 Unlocked

Before responseJsonSchema: I was building a fragile prototype with regex parsing and 50% failure rates.

After: I built a fully playable game where Gemma 4 generates entire dungeons from real codebases: with multimodal vision, 128K context, and perfect structured output. The game produces a downloadable code review report that's genuinely useful: real bugs, real fixes, real file locations.

The model is capable. The documentation just hasn't caught up yet.

I Turned Any GitHub Repo Into a Playable Dungeon: Gemma 4 Finds Real Bugs and Turns Them Into Monsters

Joske Vermeulen — Mon, 11 May 2026 08:02:41 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

Codebase Dungeon: paste any GitHub repo URL and Gemma 4 reads your actual source code, finds real security vulnerabilities and bugs, then turns them into a playable text adventure dungeon.

Files become rooms
Real bugs become monsters (with creative names like "The Hardcoded Sentinel" or "The CSV Injection Imp")
You fix the bugs to clear rooms: wrong answers cost HP, correct fixes earn XP
Gemma 4's multimodal vision analyzes your app's screenshots and creates UX-themed rooms
At the end, you get a downloadable code review report: a genuinely useful security audit disguised as a game

It's not just a game. The output is an actionable code review that developers can use to fix real issues in their codebase.

Demo

🎮 Play it live →

Try the pre-loaded codebases for instant gameplay, or paste any public GitHub repo URL.

Code

🔗 github.com/aimadetools/codebase-dungeon

Key Implementation: Multimodal + 128K Context + Structured Output in One Call

// Send code + screenshot to Gemma 4: all three capabilities at once
const parts = [
  { text: prompt },  // Contains full source files (128K context)
  { inlineData: { mimeType: 'image/png', data: screenshotBase64 } }  // Multimodal
];

const res = await fetch(GEMMA_API_URL, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    contents: [{ role: 'user', parts }],
    generationConfig: {
      responseMimeType: 'application/json',        // Force JSON
      responseJsonSchema: FIRST_ROOM_SCHEMA,       // Structured output
      maxOutputTokens: 800,
      temperature: 0.6
    }
  })
});
// Result: clean JSON with room name, bug description, correct fix,
// victory narrative: all informed by both code AND screenshot

The Schema That Solves Gemma 4's Thinking Problem

const FIRST_ROOM_SCHEMA = {
  type: 'object',
  properties: {
    dungeonName: { type: 'string' },
    id: { type: 'string' },           // Exact file path
    name: { type: 'string' },          // Creative room name
    monsterName: { type: 'string' },   // Bug as a monster
    bugDescription: { type: 'string' },// Real bug found in code
    correctFix: { type: 'string' },    // The answer (for deterministic scoring)
    victoryNarrative: { type: 'string' },
    colorTheme: { type: 'string' },    // Extracted from screenshot
    narrative: { type: 'string' },     // References actual UI elements
    choices: { type: 'string' }        // 5 options, randomized
  },
  required: [/* all fields */]
};
// With this schema: 99%+ parse rate, zero thinking tokens, perfect JSON
// Without it: ~50% failure rate, 140+ wasted tokens per call

Zero-Cost Gameplay: All Logic Pre-Computed

// During gameplay: NO API calls, instant responses
app.get('/api/action', (req, res) => {
  const room = session.dungeon.rooms.find(r => r.id === session.currentRoom);
  const isCorrectFix = action.toLowerCase().trim() === room.correctFix.toLowerCase().trim();

  if (isCorrectFix) {
    // Instant victory: narrative was pre-generated
    session.xp += 20;
    narrative = room.victoryNarrative;
  } else if (isMove) {
    // Instant room transition: narrative was pre-generated
    narrative = targetRoom.roomNarrative;
  } else {
    // Instant wrong answer: no AI needed
    session.hp -= 10;
    narrative = `The ${room.monster.name} shrugs off your attack. -10 HP.`;
  }
  // Total API calls during gameplay: 0
});

How I Used Gemma 4

I chose Gemma 4 31B Dense because this project requires three capabilities that only this model provides among open models:

1. 128K Context Window: Entire Codebase Analysis

Gemma 4's 128K context window means we can feed entire repositories into a single prompt: full file contents, not just filenames or snippets. The model reads complete source files and reasons about interactions between them, finding cross-file vulnerabilities like "this function in auth.js is called without validation in routes.js."

The live demo limits file count for cost efficiency (it runs 24/7 for free), but the architecture supports loading full repos with dozens of files in a single Gemma call. No other open model has the context window to hold an entire codebase and reason about it holistically.

2. Native Multimodal: Design Comprehension, Not Just Color Detection

When a repo contains UI screenshots, Gemma 4 looks at them and demonstrates genuine design comprehension: understanding what the app does, identifying specific UI elements, and finding real accessibility issues.

Here's what Gemma 4 generated after seeing a SchemaLens Chrome Store screenshot:

"You step into a dim, cavernous room where two massive stone tablets-Schema A and Schema B-loom before you. In the depths of the footer of Tablet A, four glowing blue runes of 'Load sample' flicker with identical intensity, offering no clue which path you have already trodden. Across the gap, in the footer of Tablet B, a lone rune 'Copy from A & modify' pulses with a pale, spectral lilac hue, clashing with the bold violet of the 'Compare Schemas' altar above."

From a single screenshot, Gemma identified:

The two schema editor panels by name ("Schema A" and "Schema B")
The "Load sample" links in the footer and their identical styling
The "Copy from A & modify" link with its inconsistent color
The "Compare Schemas" button's purple gradient
A real UX issue: inconsistent visual hierarchy between primary and secondary actions

This isn't color detection: it's a genuine UX audit from a screenshot. The monster ("The Contrast Ghoul") represents the accessibility anti-pattern, and the player must fix it to clear the room. The actual screenshot is displayed in the game's bug panel so players can see exactly what Gemma analyzed.

3. Structured JSON Output: Solving Gemma 4's Thinking Problem

Gemma 4's "thinking mode" is notoriously hard to disable: developer forums are full of people struggling with it. The model outputs internal reasoning before answering, consuming tokens and breaking JSON parsing. thinkingLevel: "MINIMAL" reduces it but doesn't guarantee structured output.

The real solution: responseJsonSchema in the Gemini API's generation config. It not only forces clean JSON output but also effectively bypasses the thinking behavior entirely: no thinking tokens, no wasted output, just structured data.

generationConfig: {
  responseMimeType: 'application/json',
  responseJsonSchema: { /* your schema */ }
}

This is documented for Gemini models, but the official Gemma 4 capabilities page doesn't list it as a supported feature. We discovered it works perfectly with Gemma 4 31B through the same API: taking our parse reliability from ~50% to 99%+.

Zero API Calls During Gameplay

Here's the key architectural insight: Gemma does all the work upfront, then gameplay is instant.

The generation flow:

First room: Gemma analyzes code + screenshot, generates room with narrative, choices, and correct answer (~10s)
Game starts: player can immediately play the first room
Background batches: remaining rooms generate in parallel while the player is already playing (~15s)
Cached forever: once generated, the dungeon is saved. Return visits are instant.

During actual gameplay (choosing answers, navigating rooms), there are zero API calls:

Wrong answers: instant feedback (0ms, pre-computed)
Correct answers: instant pre-generated victory narrative (0ms)
Room navigation: instant pre-generated room descriptions (0ms)

This means cached repos (the presets in the demo) provide a completely free, instant gaming experience. Gemma 4 does all the heavy lifting during generation, then the game runs purely on pre-computed data.

The Downloadable Code Review Report

When you clear the dungeon (or die trying), you get a downloadable markdown report listing every bug found:

File location
Bug description
Vulnerable code snippet
How to fix it
The correct action

This isn't a gimmick: it's an actionable security audit that developers can use to fix real issues. The game makes code review engaging; the report makes it useful.

Why Gemma 4 and Not Another Model?

Capability	Gemma 4 31B	GPT-4o	Other Open Models
128K context (entire repos)	✅	✅	❌ (8K-32K)
Native multimodal (screenshots)	✅	✅	❌
Structured JSON schema	✅	✅	❌ (unreliable)
Cost per game	$0.005	$0.09	Varies
Open model	✅	❌	✅

Gemma 4 delivers the same multimodal + long-context capability as GPT-4o at 18x lower cost: while being fully open. For a game that needs to run 24/7 for free, this makes all the difference.

Real Bugs Found

Here are actual bugs Gemma 4 found in real codebases:

Hardcoded admin password in plain text (const ADMIN_PASSWORD = 'schemalens-admin-2026')
CSV injection vulnerability: unescaped fields that could execute formulas in Excel
Missing request body validation: server crashes on empty POST requests
Exposed environment variables in health check endpoints
Base64 tokens without HMAC: anyone can forge authentication tokens
Memory leak in rate limiter: Map grows unbounded without TTL eviction

These aren't hallucinated: they're real issues in real code, found by Gemma 4 reading the actual source files.

AI Dev Weekly #9: Gemini 3.2 Flash Leaks Before I/O, GPT-5.5 Instant Becomes Default, and Enterprise Agents Go Self-Hosted

Joske Vermeulen — Thu, 07 May 2026 07:45:00 +0000

AI Dev Weekly is a Thursday series where I cover the week's most important AI developer news, with my take as someone who actually uses these tools daily.

Google dropped a model without telling anyone. OpenAI swapped the default ChatGPT model overnight. And three companies simultaneously launched self-hosted coding agents for enterprise. The theme this week: the infrastructure layer is maturing fast. Let's get into it.

Gemini 3.2 Flash leaks ahead of Google I/O

On May 5, Gemini 3.2 Flash appeared in the iOS Gemini app and Google AI Studio — no announcement, no blog post. Users found it through A/B testing and API metadata. It's running silent benchmarks on LM Arena.

The leaked pricing: $0.25/M input, $2.00/M output. That's cheaper than Gemini 3 Flash ($0.50/$3.00) on output and identical to 3.1 Flash-Lite on input.

Early performance signals are striking. On LM Arena's creative coding benchmarks, 3.2 Flash outperformed Gemini 3.1 Pro — producing working animated HTML that 3.1 Pro couldn't generate. SVG accuracy, interactive 3D environments, and animation processing all showed improvements over the current Flash model.

Google I/O is May 19-20. This is clearly the pre-show leak.

My take: A Flash model beating 3.1 Pro on coding tasks at $0.25/M input would be the cheapest frontier-capable model available. For developers running high-volume API calls (search, classification, code generation), this could cut costs 50-75% vs current options. The incremental versioning (3.2 instead of 3.5 or 4.0) suggests Google is moving to a faster release cadence — smaller updates, more often. Good for developers who hate migration surprises. Watch I/O for the official numbers.

GPT-5.5 Instant: OpenAI's new default

OpenAI released GPT-5.5 Instant on May 5, replacing GPT-5.3 Instant as the default ChatGPT model. The focus: reduced hallucination in sensitive domains (law, medicine, finance) while maintaining low latency.

This is separate from GPT-5.5 (the full model released April 23). Instant is the lightweight variant optimized for speed and cost — what most ChatGPT users interact with daily.

My take: For API developers, the distinction matters. GPT-5.5 Instant is likely what you'll get if you call the gpt-5.5 endpoint without specifying a variant. If you're building anything in regulated industries (healthcare, legal, fintech), the hallucination reduction is worth testing. But "reduced hallucination" is a relative claim — always verify outputs in production. The real question: does Instant maintain 5.5's coding quality? Early reports suggest it's closer to 5.4 on code tasks. If you're using it for coding, stick with the full 5.5 model.

Enterprise coding agents go self-hosted

Three launches this week signal a clear trend: enterprises want AI coding agents they control.

Coder Agents (May 6) — Self-hosted, model-agnostic coding agents. Run any model (Claude, GPT, open-source) on your infrastructure. The pitch: same capabilities as Codex/Claude Code but your code never leaves your network.

AWS Agent Toolkit (May 5) — Production-ready tools for AI coding agents building on AWS. Fewer errors, lower token costs, enterprise security controls. Essentially guardrails for agents that deploy to AWS.

ServiceNow Build Agent (May 6) — Works inside Cursor, Copilot, and other coding tools. Governance by default — code generated through ServiceNow's agent is automatically compliant with your org's policies.

My take: This is the enterprise response to "developers are using Claude Code with production credentials." The pattern is clear: let developers use whatever AI coding tool they want, but wrap it in governance, audit trails, and network isolation. If you're at a company with >50 engineers, expect your platform team to evaluate at least one of these in the next quarter. For indie developers and startups, this doesn't matter yet — but it signals where the market is heading. AI coding agents are becoming infrastructure, not toys.

Quick hits

Google COSMO leaked — Google's unreleased AI assistant appeared on the Play Store before I/O. Real-time object recognition, contextual memory, live translation. Runs on Gemini. Expect the official reveal May 19.
Trump admin signs AI deals with Google, Microsoft, and xAI for model review before public release. Government wants to see models before they ship. Unclear what "review" means in practice.
AMD AI DevDay happened in San Francisco. Message: AMD is building a full-stack open AI compute ecosystem. Relevant if you're evaluating non-NVIDIA hardware for inference.
Causal Dynamics Lab published a study showing their approach beats Claude Code and Codex on coding benchmarks by giving agents "sight" into runtime state. Academic for now, but the idea of agents that understand execution context (not just code text) is worth watching.

What I'm watching next week

Google I/O (May 19-20) will dominate. Expect: Gemini 3.2 official launch, Android XR glasses reveal, Project Astra updates, and possibly a Gemini 4 tease. The pricing on 3.2 Flash will determine whether it becomes the default model for cost-sensitive API workloads.

Also watching: whether Anthropic responds to the enterprise self-hosted trend. Claude Code is the market leader for individual developers, but enterprises are clearly uncomfortable with code leaving their network. An on-prem Claude Code offering would be significant.

That's it for this week. If you found this useful, subscribe to get AI Dev Weekly every Thursday. See you next week with I/O coverage.

Originally published at https://www.aimadetools.com