Forem: Stephan Miller

Building a Cost-Saving Agent Skill That Accidentally Became Its Own Weekly Blog Post

Stephan Miller — Mon, 18 May 2026 12:00:00 +0000

I had a vault note from a few weeks before this all came to a head. It said, in my own voice and barely punctuated, “I really need to figure out openrouter.” Past me wrote that and moved on. Past me did not yet know that the cost of figuring it out the wrong way is fifteen dollars per coffee break.

Here is what happened. I have a Pro Claude subscription. I love it. I rarely hit the weekly token cap. What kills me is the session timer: every weekend I have a more hours to actually code, I burn through one or two session windows fast, and I’m out. So I started looking around (opencode, Pi, OpenRouter) for a “swap in when Claude rate limits me” alternative or when I decide to let loose a bunch of agents on a project

That was a fine idea right up until I picked an unfamiliar model on OpenRouter, started a coding session, walked away to grab coffee, came back fifteen minutes later, and watched my OpenRouter dashboard tell me I’d just spent fifteen dollars. Fifteen bucks isn’t a fortune. Fifteen bucks per coffee break would add up pretty quick though. I sat there staring at the screen and thought: there are fifteen bazillion models on OpenRouter, and trial-and-error is a gambling problem.

So I built a thing. This is the story of that thing. It’s also the story of how the thing I built to save money turned into a weekly blog post which I post in my Large Language Models category.

The Session Time Trap
$15 in Fifteen Minutes
The Skill: At the Shape Level
The Adaptation Log: the Idea That Makes It Actually Work
Four Weeks of Evolution
The Recursive Twist
The Vault Is the Substrate
The Receipts

The Session Time Trap

For me, the weekly Claude cap in Pro is generous enough that I rarely brush it (for now, but that’s going to change). The session timer, on the other hand, ends my Saturday afternoon while I’m still in the middle of my work.

You can extend a session by turning on extra usage and paying by token. I’ve done that and I usually use that to finish whatever I’m working on, so I can stop using it for the day. But with open source models catching up to frontier models, I started wondering if I was limiting myself.

This is why I started looking at OpenRouter in the first place. Not to leave Claude. Just to have a parallel rail I could swap onto when the session timer killed me mid-bug-hunt. The hope was: never run out of session time again, just route to whatever model can keep going.

The problem was that I had no idea which model to route to. I’d open the OpenRouter rankings, see a hundred names I didn’t recognize, click one that sounded reasonable, and ship it.

$15 in Fifteen Minutes

I’m not going to name the specific model. It wasn’t entirely the model’s fault. It was mine. I picked it because it was cheap: I was being smart about costs, I told myself. I scrolled the OpenRouter listings, saw the pricing, thought “that’s a fraction of what Claude charges,” and picked it.

What I did not verify was whether it could actually handle tool calls correctly. It could not. Instead of completing a task and stopping, it looped. Called the same tools again. Got confused by the results. Called them again. It wasn’t reasoning: it was a stuck record that happened to cost money per revolution. Fifteen minutes of that loop, fifteen dollars out the door, and I’m wondering what just happened.

That was the moment I realized the thing I needed wasn’t another cheap model to test. The thing I needed was a system. Something that ran on the rotating model landscape, cross-referenced what was actually working, including which cheap models were actually reliable, and produced a reading I could trust without spending an afternoon on it. Because cheap and broken is more expensive than expensive and correct. Static instructions weren’t going to cut it either: the model space changes weekly. Hard-coding “use this one” would rot before I read this sentence back.

I needed a Claude Code skill.

The Skill: At the Shape Level

Here’s what it does. I’m going to describe the shape and not the recipe. The whole point of this kind of tooling is that you tune it to your tasks, your price tolerance, your stack, your willingness to test sketchy Chinese models on production code. If I posted my prompts and you ran them, you’d get something generic and so would I. The value is in the tuning. So: shape, not recipe.

The skill cross-references three different kinds of signal:

Volume signal : where the actual money is flowing. Token counts on a public router platform. This tells you what people are trying, but not whether they kept using it.
Head-to-head signal : which model wins when two anonymous outputs are placed side by side and a human picks. This tells you what people prefer in voting conditions, but not what they’re using day to day.
Lived-experience signal : what people say after using a model for weeks. Specific projects, specific failures, specific switching stories. This tells you what’s actually working, but it’s loud-minority biased and slow.

Each one of those sources lies in its own way. The truth lives in the intersection. The skill’s whole job is to assemble the intersection into a five-minute brief that I read before I open OpenRouter on a weekend.

The brief lands in my Obsidian vault. It has trending movers, category-by-category breakdowns, hype-vs-value analysis, horror stories, upcoming releases. I read it Saturday morning. Then I open OpenRouter and I know which slug to type. That’s it.

The Adaptation Log: the Idea That Makes It Actually Work

Here is the architectural lesson worth taking away even if you don’t build a model-buzz skill specifically. It took me about two weeks of running the skill to figure this out, and once I did, the whole thing got dramatically better.

Static instructions rot. Especially in a domain that changes weekly. The skill I wrote in week one had assumptions about which sources were accessible, which categories existed, which org subreddits were active, which providers were stealth-launching models. Half of those assumptions were obsolete by week three. If I’d just kept editing the main prompt file every week to fix what I noticed, I’d have a Frankenstein-prompt by month’s end and no memory of why any specific line was there.

What works is making the skill take notes to itself. After every run, the skill appends to a small log file: things observed, things that broke, patterns worth carrying forward, false patterns to avoid. The next run reads that log first, before doing anything else, so it walks in already smarter than the version that ran a week ago.

A few examples of the kind of thing that ends up in the log, abstracted enough that they’re useful but not specific enough to be a recipe:

Some signals look like adoption but are actually marketing stunts. When a brand-new model spikes massively in volume during a free promotional window that expires next week, that volume isn’t telling you what it looks like it’s telling you. The skill learned to identify this pattern after the second time it almost led with a marketing-stunt headline.
Some sources stop working without telling you. A subreddit gets locked. An API starts returning 403s. A tool’s “trending” tab quietly changes its sort order. The log records what to fall back to.
Some patterns repeat with a delay. When two big labs ship competing models on the same day, they’re timing each other to split the news cycle. The skill now knows to look for the second drop instead of treating the first as the only story.

These are the kinds of patterns that don’t survive in static instructions. They survive in a log that gets re-read every run. The adaptation log is the difference between a tool that gets dumber every week and a tool that gets smarter.

If you build any AI workflow that has to operate in a domain that changes faster than your prompts, this is the architectural pattern. Static instructions plus a self-edited operational log. The log is small, but the log is everything.

Four Weeks of Evolution

Here’s the actual play-by-play of how this thing has changed since I built it:

Week one. First run. I was just trying to get the data without losing my mind. Reddit blocked me on day one. I had to teach the skill to use aggregator sites instead of direct Reddit access. Wrote that down in the log. Already, week one, the skill was learning things I hadn’t anticipated.

Week two. Two big labs shipped major models on the same day. The skill’s instructions handled “one model launches, here’s how to cover it” but not “two simultaneous launches that are partially competing for the same coverage slot.” I had to teach it to compare the two announcements rather than just covering them in sequence. Wrote that down too.

Week three. First fake spike. A major vendor gave a new model away for free, the rankings rocketed by quadruple-digit percentages, and the entire signal was meaningless. The skill nearly led with the spike as the week’s headline before I caught it and made it rework the section. The log gained a new pattern: free-period distortion. Future runs will detect it automatically.

Week four. The big pivot. I was doing the cheapskate math for the third Sunday in a row and noticed something structural: the head-to-head leaderboard at the top is compressed. The entire visible top end fits inside a tiny rating spread. The prices fan out by an order of magnitude or more. So the question every week wasn’t really “what’s best”, but “what’s the cheapest option that’s still in striking distance of best, by category.” I wrote a methodology for it, wired it into the skill as the new centerpiece, and now the brief leads with this every week. The skill is meaningfully different from what it was three weeks ago, and it’ll be different again next week.

The pattern across all four weeks: every week I find some piece of the analysis I’m doing by hand, and I move it into the skill. The skill is a running record of what should be automated next.

The Recursive Twist

I built this to save money on AI usage. That’s still what it does. I haven’t burned fifteen dollars in five minutes since I started using it. Mission accomplished.

What I did not anticipate is that the brief the skill produces is also a perfectly fine weekly blog post. The post drives traffic. The traffic justifies more time on the skill. The cycle compounds.

I did not plan any of this. It just happened. There is something structurally different about tools that produce content versus tools that just save you time. Tools that save time give you a quieter afternoon. Tools that produce content have an exhaust pipe and once you notice the exhaust pipe, you start aiming it at things that need promoting. The promotional material is free, because it was always going to get produced. The only question was where it was going to go.

The meta-twist that gets me every time: this blog post you’re reading right now exists because the skill exists. The skill produced its own promotional material this week. I am writing a post about a thing that wrote a post about itself. I’m not entirely sure what to do with that information except keep going.

The Vault Is the Substrate

The arrangement looks like this:

The skill produces a brief in my Obsidian vault every Saturday morning
The brief gets handed to another skill, which produces a draft post
The draft gets fact-checked, edited, and shipped to Jekyll
The shipped post becomes a backlink the next week’s brief reads as prior context, so the skill knows which stories are already covered

The vault is the substrate. The skill is the engine. The blog is the surface. Each layer leaves traces the others read. None of this is a content management system. It’s a notebook with skills attached to it.

This is what vibe coding looks like for content instead of code. Build a thing, watch it tell you what to build next, iterate: but the artifacts are paragraphs instead of pull requests. Same kind of accidental compounding. Same need to keep an adaptation log so you don’t end up with a stack of stale tooling that you have to keep fighting.

The Receipts

I haven’t accidentally burned fifteen dollars in fifteen minutes since the skill went live. I now spend more time on the skill than the skill saves me, but in the spending I get a weekly blog post and a continuously-updated map of the model landscape that I trust enough to use. The math, in dollar terms, is fine. The math, in time, is also fine because the time produces content.

The skill is going to be different by next week. I’ll find a new pattern, write a new adaptation note, refactor a section of the methodology.

Meanwhile, while finishing this up, I noticed two more patterns that should be in the adaptation log. When a major free-period model retires its free tier, and the rankings haven’t normalized yet, there’s a “transition week” pattern the skill doesn’t handle gracefully. And the stealth slot on the rankings has been quietly sitting on a codename for over a week without resolution, which the skill currently treats as “must resolve within 48 hours” — that assumption needs to relax. So I’m going to add notes for both of those, kick off the next run, and we’ll see what shows up Saturday.

You can find those posts in my Large Language Models category.

I Was Wrong About Hy3 (And Other Things I Learned This Week)

Stephan Miller — Sat, 16 May 2026 12:00:00 +0000

Two weeks ago I told you Tencent’s Hy3 Preview was a marketing stunt. Free until May 8, +1,356% on OpenRouter, “free + new + expiring = noise.” I was extremely confident about it.

Today, one week into paid pricing, Hy3 is the #1 model on OpenRouter by tokens. 2.76 trillion of them, in a single week. The pattern detector got overconfident, and it turns out that’s the theme of the whole week. Cheapskate Picks held mostly steady but Math flipped one week after I published it. Claude Code’s billing bug entered its third week unpatched. And Google I/O is Tuesday, so whatever I write here is going to look quaint by Wednesday morning.

Let’s get into it.

What I Got Wrong About Hy3 (And the Other New Players)
The Cheapskate Picks Held (Mostly)
Hype vs. Value: Ring 2.6 vs. Ernie 5.1
Claude Code’s Billing Bug Enters Its Third Week
What’s Worth Trying This Week
Tuesday Is Going to Be Loud
What I’m Watching Next Week

What I Got Wrong About Hy3 (And the Other New Players)

The Hy3 numbers are not a typo. 2.76T tokens in a week. The +153,299% delta is the migration spike — users coming back when paid pricing kicked in instead of bouncing. Pricing locked in at $0.066 input / $0.26 output per 1M tokens with a 262K context window, which is competitive enough that production users had no real reason to leave when the free tier ended.

Here’s what I got wrong: I had a rule that said “free + new + expiring = noise.” It worked great for filtering out cynical marketing stunts. It also filtered out a real model. The new rule, which I’m putting in the methodology going forward: free-period spikes that survive the cliff are validation, not residue. The cliff is the actual test.

While I was busy being wrong about Hy3, three other things showed up:

DeepSeek V4 Flash at #2 by token volume (1.65T/week, +70% week-over-week). $0.112 input / $0.224 output, 1M context, MIT-licensed, 284B params with 13B active. It also debuted at #1 on the trending list under its free variant. Independent reviewers report 79.0% on SWE-bench Verified and 91.6% on LiveCodeBench. This is the cheap workhorse doing most of DeepSeek’s actual work this week. Not the V4 Pro everyone covered when it launched April 24.
Gemini 3.1 Flash Lite dropped May 7 at $0.25 / $1.50 with a 1M context window. Half the cost of regular Gemini 3 Flash. AA Intelligence Index of 34, which is solid for the price class, and 347 tokens per second of output — fastest in its tier by a wide margin. This is Google racing the Asian price floor, which is a sentence I would not have written 18 months ago.
Owl Alpha is the OpenRouter stealth model that’s now 706B tokens a week, free, agentic-tuned, 1M context. It’s been live since April 28 — 17 days now — without anyone confirming the provider. Prior stealth releases (Polaris Alpha → GPT-5.1, Sherlock Alpha → Grok 4.1) got unmasked inside two weeks. Owl Alpha is breaking that pattern. Either the labs are getting better at guarding A/B test windows, or someone’s collecting an unusually long RL run before publishing.

The common thread connecting Hy3, V4 Flash, 3.1 Flash Lite, and the stealth model is that the market’s center of gravity has shifted. Three of the four are non-Western, all four cost less than $1/M output, and three of the four feature a 1M context window.

The Cheapskate Picks Held (Mostly)

Quick refresher on the methodology: take the leader’s Arena rating in a category, draw a 50-point band downward, then sort everything in the band by output price. Cheapest model in the band wins the Cheapskate slot. The whole point is that the top of Arena is structurally compressed (overall: 1502 leader to #20 at 1468: only 34 points of spread), so paying 8x more buys you ~2% more rating. Not always a good trade.

This week, here’s how it shook out across the seven Arena categories I track:

Category	Leader	$ leader (out)	Cheapskate pick	$ pick (out)	Δ rating	Price ratio	AA Pareto
Overall	Claude Opus 4.6 Thinking	$25	Gemini 3 Flash	$3	−29	8.3×	nearby
Coding	Claude Opus 4.7 Thinking	$25	GLM-5.1	$3.08	−36	8.1×	✓ (AA Idx 51)
Creative Writing	Claude Opus 4.6 Thinking	$25	Gemini 3 Flash	$3	−36	8.3×	nearby
Math	GPT-5.4-high / Opus 4.6 Thinking	$15/$25	Ernie 5.1	$2.65	−19	5.7×–9.4×	n/a
Instruction Following	Claude Opus 4.6 Thinking	$25	MiMo V2.5 Pro	$3	−44	8.3×	✓ (AA Idx 54)
Hard Prompts	Claude Opus 4.6 Thinking	$25	Gemini 3 Flash	$3	−41	8.3×	nearby
Multi-Turn	Claude Opus 4.7 Thinking	$25	Gemini 3 Flash	$3	−41	8.3×	nearby

Six picks held from last week. Math flipped.

The flip is Ernie 5.1 , which Baidu launched May 9 at $0.59 input / $2.65 output and which immediately landed in the Arena top 20 with a 1472 overall, 1496 in math, 1518 in coding, and 1517 in instruction following. That’s a model dropping in mid-week and slotting in cheaper AND higher-rated than last week’s Math winner (DeepSeek V4 Pro Thinking at $1.74/$3.48 even with the 75% discount). Baidu also says they trained it at 6% the compute cost of comparable models, which is either misleading or the kind of thing that quietly resets cost-per-capability assumptions across the industry.

Caveat: Ernie’s primary host is Baidu’s Qianfan API, not OpenRouter. If you’re routing through OpenRouter, the runner-up is MiMo V2.5 Pro at $3 / 1M output, rating 1484, and it’s available there.

The two highest-confidence picks this week (meaning the methodology AND Artificial Analysis’s independent Intelligence Index agree) are GLM-5.1 for Coding (AA Index 51) and MiMo V2.5 Pro for Instruction Following (AA Index 54). When two independent evaluation methodologies converge on the same model, that’s about as strong a signal as this kind of comparison produces.

Gemini 3 Flash still wins five of seven categories. Five months after launch. The boring answer keeps being correct.

Hype vs. Value: Ring 2.6 vs. Ernie 5.1

Probably hype: Ring 2.6 1T from Ant Group / InclusionAI dropped May 8. One trillion params, MIT-licensed, 63B active. The launch announcement claimed 87.6 on PinchBench, beating GPT-5.4 and Gemini 3.1 Pro, with vendor-reported scores of 95.83 on AIME 2026 and 88.27 on GPQA Diamond. Open-weight + cross-frontier claims is a hype cocktail that always trends. But no third party has independently verified any of those numbers yet: no AA coverage, no neutral LiveCodeBench harness run, nothing. Trillion-param vendor benchmarks beating frontier models is the exact pattern that should make you wait two weeks before betting on it.

While I’m at it, Trinity Large Thinking from Arcee at #2 on the trending free list deserves the same caution. It’s a real release — Apache 2.0, 398B sparse MoE, US-built, the rare open frontier model we can actually inspect: but the “free for limited time” framing is the same trap I just admitted to walking into with Hy3. Track it past the cliff before deploying it anywhere that matters.

Under-sold value: Ernie 5.1, which I just covered above, is the cleanest example of this week’s repeating pattern. Hits Arena top 20 the day it launches, immediately becomes the cheapskate winner in a category, and the Western LLM Twitter barely notices. Same shape as Hy3 two weeks ago. Same shape as Kimi K2.6 four weeks ago.

I’m starting to think the meta-pattern matters more than any individual model: when a non-Western lab ships a serious value play, the default reaction in the English-language commentariat is either “interesting but unproven” or silence. Then six weeks later it’s quietly running in production at half the price of Claude. We keep being surprised by the same trajectory.

Claude Code’s Billing Bug Enters Its Third Week

If you use Claude Code on a Max plan, this section is for you. If you don’t, skip to the next one: but know that this is the week’s universal frontier-lab horror story and people are pissed.

Claude Code v2.1.100 and later silently inflate cache_creation_input_tokens by roughly 20,000 per request. The inflation is 100% server-side, routed by the User-Agent header (which includes the version number), and it appears to be caused by the prompt cache forcing a full re-process of conversation history on every turn instead of resuming. GitHub issue #46917 is the canonical thread, with payload-vs-billed-tokens evidence from multiple developers.

The real-world impact is brutal. One paying Max customer’s quota went from 0 to 67% in ten minutes of normal work with 128 cache flush events on a separate chat. Independent measurement says the inflation is driving costs 10–20× higher, exhausting even the $100/month Max plan in 1–2 hours of normal use.

Anthropic shipped a postmortem and a partial fix. The latest CLI as of this writing is v2.1.133 (released May 8). The bug is still there. Three weeks running.

The workaround everyone’s on: downgrade to v2.1.34, or reinstall via npm instead of using the native binary. That bypasses the version routing on the server side and gives you back the cache behavior from before the regression.

While we’re piling on Anthropic this week, two more things:

Opus 4.7 quietly costs 35% more than Opus 4.6 at the same headline price. Same $5 input / $25 output per 1M tokens, but the new tokenizer uses up to 35% more tokens for the same fixed text. If you’re on Opus 4.7 and on Claude Code v2.1.100+, you’re getting hit with two compounding inflations on the same workflow. Fun.

Opus 4.7 also regressed on refusals. Multiple developers report Opus 4.7 in Claude Code flagging routine benign code as malware and refusing to complete file operations, network calls, and standard library usage that 4.6 handled without complaint. This is in addition to the billing bug, not instead of it.

OpenAI doesn’t get to feel smug about this either. GPT-5.5 hallucinates 86% of the time it doesn’t know something on the AA-Omniscience benchmark. The 14-point AA-Omniscience improvement over GPT-5.4 came mostly from better factual recall, not better refusal: when 5.5 doesn’t know something, it makes up an answer roughly nine times out of ten.

The honest take here is that the gap between “shipped” and “actually works in production” keeps widening for the US frontier labs while the cheap Asian models keep landing comparatively clean. That’s not a comfortable thing to write but it’s what the week looks like.

What’s Worth Trying This Week

Stuff I’d actually do this week, not just stuff I’d read about:

Replace Opus with Gemini 3 Flash for general-purpose work if you haven’t already. $0.50 input / $3 output, 1M context, Arena top 20 in everything. The Cheapskate Pick in 5 of 7 categories isn’t a coincidence.
Try Kimi K2.6 on a real coding task for a week and see if you switch. There’s a developer who used it as their only coding assistant for 30 days and posted a brutally honest review: over-engineering tendency, agent swarm wins, where it broke. Worth reading before committing.
Use Owl Alpha while it’s still free before whoever made it pulls access. 1M context, agentic-tuned, optimized for Claude Code-style workflows.
Skip Ring 2.6 1T for production until ArtificialAnalysis runs benchmarks. Read about it, don’t deploy on it.
Downgrade Claude Code to v2.1.34 if you’re on Max and watching your quota burn. Stop the bleeding while Anthropic figures out the cache routing.

That’s it. Five things. Three of them are “use cheaper models,” one is “wait for verification,” and one is “downgrade your tools to fix billing.” That’s the week.

Tuesday Is Going to Be Loud

Google I/O 2026 runs May 19–20 — Tuesday and Wednesday this week. The keynote agenda confirms Gemini and AI updates as the headline.

The leaks so far point to three things:

Gemini 4 as the headline upgrade. Expected to focus on multi-context search and the new TPU generation.

Gemini Omni as the surprise. Six days before I/O, an X user spotted “Powered by Omni” inside the Gemini app’s video tab, positioned next to “Toucan”: which is Google’s internal codename for Veo 3.1. The most likely interpretation is that Omni is a unified text/image/video generation pipeline, which would make it the first frontier model to do all three in a single system. Demo videos already leaked from at least one Pro user’s account, including a chalkboard math scene that reportedly handled trigonometric proofs accurately.

Beyond Tuesday:

GPT-6 is a Q3-Q4 base case. Polymarket has it at ~10% by June 30, 51% by September 30, 82% by December 31. GPT-5.5 in April was Spud, the codename people thought meant GPT-6. It didn’t. The next jump is later this year.

Claude Mythos is confirmed real and being explicitly withheld on safety grounds. Project Glasswing, the cybersecurity capability, is the bottleneck. This is the first time a frontier lab has publicly said “we built it, we’re not shipping it” with a confirmed model. No timeline. Anthropic has committed to advance notice on any safeguard changes, so the roadmap will be visible before it happens. Watch their blog.

Already shipped this month and worth flagging if you missed them:

Mercury 2 from Inception Labs — diffusion-based LLM at 1000+ tokens/sec, now available on OpenRouter. Not autoregressive. 5–15% behind frontier on hard reasoning, matches on structured output and translation. The architectural alternative is finally here and it’s fast.
NVIDIA Nemotron 3 Nano Omni — open 30B-parameter MoE with 3B active, multimodal across vision, audio, and text, 9× the throughput of comparable open omni-models. Available on OpenRouter and SageMaker.

What I’m Watching Next Week

The model market moved faster than my pattern detectors this week. I had to eat one prediction (Hy3), and recalibrate the cheapskate Math winner one week after publishing it (Ernie 5.1 dropped on Friday and walked into the slot).

Three things on watch for next week:

Gemini 4 / Omni at I/O Tuesday. If Omni ships as a unified video model with API access, the cheapskate calculus for everything multimodal resets overnight.
Whether Anthropic ships a real fix for the Claude Code cache bug. Three weeks in, their workaround is “use an older version.” That can’t last forever.
Whether anyone gets neutral verification of Ring 2.6 1T’s claims. If it holds up, the cheapskate Coding pick might be open-weight by W22.

And while I was finishing this up, Owl Alpha probably got unmasked, Hy3 launched a new variant nobody told me about, and Anthropic shipped a Claude Code patch that introduces three new bugs. That’s the price you pay for hitting publish on Saturday.

Senior Software Engineer by Title, AI Therapist by Reality

Stephan Miller — Mon, 11 May 2026 12:00:00 +0000

My LinkedIn says “Senior Software Engineer.” My screen time says I spent 14 hours this week talking an AI coding assistant out of various wrong turns, or not catching it in time and just having it redo the work.

Twenty years into this career, I’ve debugged production systems at 2 AM, untangled spaghetti code left by developers who apparently hated whoever came after them, and survived multiple rewrites of the same application. None of that prepared me for becoming an “AI psychologist.”

The pitch was “AI handles the grunt work and you focus on the interesting problems.” What actually happened is more interesting than that, and better than the cynical version too. AI does handle a lot of grunt work. It also creates new grunt work. And the actual job, the part nobody put in the description, is learning how to think before you prompt instead of just typing what you want and hoping. It’s like working with an intern who graduated top of their class, has read every book, and will confidently tell you the database should be stored in a spreadsheet. Unless you prepare ahead of time.

The LeadDev piece on the “just one more prompt” era called the loop “uniquely rewarding, and exhausting.” A cognitive slot machine. But here’s what took me embarrassingly long to figure out: most of the time I was pulling the lever was on prompts I should never have written that way in the first place.

The Diagnosis: How Did We Get Here?
The Patient Files: Tool-by-Tool Therapy Notes
- Claude Code: The Eager Intern
- GitHub Copilot: The Golden Retriever
- GPT-4: The Know-It-All Who Never Reads the Room
What I’ve Learned About AI Psychology
- Framing Beats Specificity
- Contextual Anchoring Actually Works (Embarrassingly Well)
- When Iterating Doesn’t Work, Give It an Algorithm
- The Trust Paradox
I’m Also the Patient
The Rubber Duck That Talks Back
The Prognosis: Am I Better Off?
The Therapist Is In

The Diagnosis: How Did We Get Here?

I’ve been writing code since the late 90s. I’ve seen client-server, I’ve seen web 1.0, I’ve seen web 2.0, and I’ve seen the mobile revolution. Each shift changed what developers do. None of them changed what developers are.

This one might.

Not because AI writes code better than me, but because the cognitive overhead of working with AI tools created a new layer of professional skill that nobody put in the job description. The McKinsey State of AI report (November 2025) found that companies are still largely “experimental” with AI adoption. Which sounds measured and responsible. What it actually means is that every developer on the ground is both guinea pig and architect, figuring out in real time how to integrate tools that weren’t built for how software actually gets made.

I didn’t sign up to be a therapist. I signed up to build things. But here I am, maintaining relationships with seven different AI assistants, each with its own personality, its own particular flavor of wrongness, and its own emotional needs.

And here’s the thing: once I stopped fighting that and started working with it, my output went up. Not “marketing-deck up.” Actually up. I ship more in a week than I did two years ago. I just had to give up the fantasy that I could type a vague request and get a clean result.

The Patient Files: Tool-by-Tool Therapy Notes

I’ve spent serious time with most of the major AI coding tools. Each one has a personality. The trick is learning to talk to that personality on purpose instead of getting mad at it for being itself.

Claude Code: The Eager Intern

Claude Code is the overconfident new hire who graduated top of their class and has read every design pattern book ever written. It will absolutely take on your task and complete it thoroughly, thoughtfully, and sometimes in a completely different direction than you intended.

Tell it to add validation to a form field and it will:

Add validation to the form field
Notice that your component structure “could be improved”
Refactor the component
Update all 14 imports
Create a new utility file for “reusable validation logic”
Rename your API endpoints because they were “semantically inconsistent”
Present you with 847 changed lines for what was supposed to be a 3-line fix

For months I’d push back after the fact: “no, just the validation, undo the rest.” That worked, sort of, in the same way bailing out a leaking boat works.

The fix wasn’t a better correction. It was a better opening. Now I tell it the scope before it touches a file: “Add email validation to this component. Don’t refactor anything. Don’t touch any other file. If you see something else worth changing, list it at the end and I’ll decide.” That single sentence cut my “wait, why did you change that” moments by something like 80%. The intern is still an intern. I just stopped letting it freelance.

GitHub Copilot: The Golden Retriever

Copilot is enthusiastic. Copilot is always helpful. Copilot will auto-complete you into a corner and wag its tail while you figure out how you got there.

It’s the tool equivalent of a golden retriever fetching the wrong stick. You asked for a stick, you got a stick, it’s technically a stick, the dog is very happy about this. The stick is on fire and has three undocumented dependencies.

Copilot auto-completes based on pattern recognition, which means it will confidently suggest code that looks right and is subtly wrong. The lesson I had to internalize: Copilot is amazing at the second half of a line and dangerous at the second half of a function. So I let it finish what I started typing and I stop trusting it the moment it tries to finish what I was going to type. The energy I was burning fixing its longer suggestions is gone now. I just don’t accept them.

GPT-4: The Know-It-All Who Never Reads the Room

I have a specific GPT-4 interaction that lives in my head rent-free.

Me: “What’s the ternary syntax for: if x > 0 return ‘positive’ else return ‘negative’?”

GPT-4: “Great question! The ternary operator is a concise conditional expression available in many programming languages. Before diving into the syntax, it’s worth understanding why ternary operators exist and how they differ from traditional if-else statements. The ternary operator was first introduced in C and has since been adopted…”

Four paragraphs of history later: x > 0 ? 'positive' : 'negative'

GPT-4 knows a tremendous amount and has zero ability to calibrate how much of that knowledge you need at any given moment. It’s the smartest person at the party who cannot tell when you’re making small talk versus when you actually want a lecture. The fix is in the prompt, not the response. “One line of code, no explanation” stopped feeling rude the second I realized it saved me ninety seconds per question. Multiply that by a workday.

What I’ve Learned About AI Psychology

This is where the article shifts from venting to the part that actually changed how I work. Almost every problem I had with these tools turned out to be a problem with how I was opening my mouth.

Framing Beats Specificity

I used to think the answer was being more technically specific. Add more constraints. Spell out more requirements. That helps, but it’s not the lever. The lever is framing. The gocodeo.com breakdown of prompt psychology calls this cognitive programming through language: framing effects that change what the model pays attention to before it generates a single token.

“Add validation to the email field” gets one result. “You’re a senior backend developer who hates form bugs. Add minimal, focused validation to the email field. Don’t touch anything else.” gets a different result. The technical ask is identical. The framing changes what shows up.

Contextual Anchoring Actually Works (Embarrassingly Well)

Seeding prompts with identity (“you’re a senior React developer who hates class components”) works. This bothered me philosophically for a while. But it works. There’s actual research behind it: schema activation, attention focus, cognitive priming applied to LLM behavior.

But I still don’t do it every time. It still rubs me the wrong way. “You’re a developer who prefers minimal changes. We’re using React hooks only. This codebase has fragile integration tests. Don’t change anything not directly related to the task.” The context window is not your friend, and the AI has the memory of a goldfish. Re-establishing ground rules at the top of a session takes thirty seconds and saves an hour of cleanup.

When Iterating Doesn’t Work, Give It an Algorithm

This is the one that took me the longest to learn, and it’s probably the most useful thing in this article.

For months, I experimented with getting AI to write articles to a target word count for another site I am building. The conversation was always the same:

Me: This is 1,400 words. I asked for 2,000.

AI: You’re right, I’ll expand it.

Me: Now it’s 2,300.

AI: My apologies, let me trim.

Me: 1,650.

AI: Sorry, expanding now.

Me: …

I cycled through that for months. Yelling at it. Trying different ways to phrase “actually count the words.” It would confidently agree, recount, and miss again. I was treating it like a person who wasn’t listening, when the actual problem was that it had no reliable way to do the thing I was asking.

Eventually, I stopped asking it to count and started giving it an algorithm:

When you write the outline, assign a target word count to each H2. The targets must sum to the total. As you write each section, stay within ±10% of its target. Tally section counts as you go.

From then on, perfect. Every time. The AI didn’t get smarter. I stopped asking it to do something it couldn’t do reliably and gave it a procedure that turned the task into something it could do reliably.

That moment reframed the whole job for me. The pattern is: try a thing once or twice. If it keeps going wrong, the question isn’t “how do I correct it harder,” it’s “what algorithm or scaffold turns this into something the AI can do without me babysitting?” And if it’s something I’m going to do over and over, that scaffold becomes a skill, something I write once and stop re-explaining.

The mental shift: stop arguing with the model. Build the rails the model needs.

The Trust Paradox

The actual skill isn’t prompt engineering. It’s knowing how much rope to give the agent before it hangs your codebase.

Too little rope: you’re basically typing the code yourself and having the AI format it.

Too much rope: you come back to 847 changed files, a refactored architecture, and the sinking feeling that you need to review all of it before you know if your feature even works.

The right amount of rope is context-dependent, tool-dependent, and something you only develop by making expensive mistakes. The good news is the mistakes are educational. The first time you let the AI invent your architecture and then have to ask it how its own code works, you start designing the architecture yourself again.

I’m Also the Patient

Here’s the part I don’t see written about enough: a lot of the time, the AI isn’t the one derailing the session. I am.

I’ll be deep in a clean refactor, the AI is on track, the diff is small and tight. And then I’ll think of something tangentially related and just… ask. “Oh, while we’re here, what do you think about how we’re handling auth in this other module?”

Twenty minutes later we’re three modules away from where we started, the context window is full of auth opinions, and the original refactor has been quietly forgotten by both of us. The AI is happy to follow me anywhere, which is exactly the problem.

I learned to recognize the moment now. The second I notice I’ve yanked the conversation onto a new track, I stop, close the session, and start fresh. The original task gets a clean room. The new question gets its own room. Trying to do both in one session is how I end up with garbage in both.

This is the part of “thinking before you prompt” that’s least about the AI and most about me. The model has no scope discipline. So I have to bring my own and notice when I’m the one breaking it.

The Rubber Duck That Talks Back

Rubber duck debugging is a real technique. You explain your code to an inanimate object and in the process of explaining, you find the bug yourself. The object doesn’t help. The explanation does.

AI pair programming is rubber duck debugging if the rubber duck argued with you, gave you bad advice confidently, and you had to diplomatically respond “that’s an interesting perspective, but I’m going to go with my original approach.”

That sounds bad. It actually isn’t. The argument is the value. Forcing me to say “no, we’re not using Redux for this, here’s why” surfaces my actual reasoning in a way that staring at the screen doesn’t. The AI isn’t right. I’m not even trying to convince it. But explaining why it’s wrong is doing the same thing the rubber duck does, with more friction and more upside.

And sometimes the AI is right. Helpfully right in a way that saves you an hour. That’s the slot machine moment people warn about. Just one more prompt. It’s almost there. The developer frustration data from programming-helper.com shows the pattern: hallucinations and “almost correct” output keep developers engaged because the occasional win pays for the losses.

The way out of the slot machine isn’t quitting the casino. It’s noticing the loop and breaking it on purpose. After two iterations that don’t land, I stop. Either I write the thing myself, or I step back and figure out what algorithm or scaffold the AI was missing. Pulling the lever a third time, hoping this prompt is the one: that’s where the day disappears.

The Prognosis: Am I Better Off?

Yes. Honestly, yes.

When it works : starting projects from scratch, generating documentation and boilerplate, the “army of interns with PhDs” feeling. When I’m building a new service and need structure, tests, config, and scaffolding: AI tools are genuinely useful. I build faster. I cover edge cases I’d have missed while moving quickly. The ceiling on what one developer can ship in a sprint has gone up in measurable ways.

When it doesn’t : legacy code with context that doesn’t fit in a context window. Anything where “almost correct” compounds across multiple sessions. When the AI “improves” working code because it can see a better pattern. Any problem where the truth lives in production state rather than the codebase. And anything where I haven’t done the thinking work up front about what I actually want and how to frame it.

That last category is the one I have control over, which is why this article isn’t a complaint. The first time I had to sit and plan a prompt like I’d plan a meeting, I felt ridiculous. Now it’s just the job. Think about what I want. Think about what the AI is likely to do with each phrasing. Think about what scope I’m authorizing. Think about whether this is a one-off or whether I should encode it as a skill so I never have to think about it again.

Any developer can learn the tools in an afternoon. The actual skill is the thinking that happens before you type the first character: anticipating how the model will react to your framing, your scope, your context. That’s not a technical skill. It’s a human skill applied to machines. And it’s the part nobody handed me a manual for.

That’s what I mean by therapy. Not “the AI is broken and needs my emotional support.” More like: the relationship has its own dynamics, and learning to work inside those dynamics on purpose is the difference between a week of cleanup and a week of shipping.

The Therapist Is In

Twenty years ago, I worked with a coding language, a laptop, and a manual. Today I sit down with seven different AI assistants and a mental playbook for each one.

Some days I miss the simplicity. Most days I don’t. I ship more, I cover more ground, and the failure modes are at least new failure modes instead of the same legacy spaghetti I’ve been untangling for two decades.

The lesson isn’t “AI is hard.” It’s “I had to stop typing what I wanted and start thinking about how to say it.” Once that clicked, the slot machine quieted down. The intern stopped freelancing. The golden retriever stopped lighting sticks on fire. The know-it-all gave me one-line answers.

If you’re reading this and thinking “yeah, that’s my Tuesday,” welcome to the profession. We’re all AI psychologists now. The good news is, you can get good at it. The job is mostly thinking before you prompt, and that’s a skill, not a personality trait.

The Cheapskate's Guide to the Arena Leaderboard: Why I Stopped Paying Claude Opus Prices

Stephan Miller — Sat, 09 May 2026 12:00:00 +0000

I kept noticing this thing while writing the model roundup every week. The “best models” lists all lead with $25-per-million Claude Opus, and then I’d open the Arena leaderboard for creative writing and notice Gemini 3 Flash sitting above Claude Sonnet for one-tenth the price. Or open the coding leaderboard and find GLM 5.1 tying Claude Opus 4.6 inside the top ten while costing seven times less.

So I’d do the math. Every week. By hand. While writing about something else.

This week I made the math the centerpiece. Welcome to the Cheapskate Picks, the cheapest model within striking distance of the leader for every Arena category that matters. This blog post that started because I kept doing this myself now does it for you.

The Compression Problem (Or: Why You’re Probably Overpaying)
The Cheapskate Picks (May 1–8, 2026)
- Overall: Gemini 3 Flash, $0.50/$3.00
- Coding: GLM 5.1, the SWE-Bench Pro Killer
- Creative Writing: Gemini 3 Flash
- Math: DeepSeek V4 Pro Thinking, the 17x Discount
- Instruction Following: MiMo V2.5 Pro
- Hard Prompts: Gemini 3 Flash, Again
- Multi-Turn: Gemini 3 Flash, Again Again
- The Quick-Reference Table
GLM 5.1: The SOTA Nobody’s Pricing In
Tencent’s Hy3 Free Cliff Hits
The Asterisks (Or: Cheap Is Fine If You Know What You’re Losing)
Coming Up: Google I/O May 19, the Gemini 4 Question
The Receipts

The Compression Problem (Or: Why You’re Probably Overpaying)

Here is the structural fact that powers everything else in this post: the Arena leaderboard’s Overall top 20 spans 35 rating points. From #1 (claude-opus-4-7-thinking at 1503) down to #20 (claude-opus-4-5 at 1468). That’s it. The entire visible top end of the leaderboard fits in less than 3% of the rating scale.

Meanwhile the prices fan out 30x. Claude Opus 4.7 costs $25/M output. Gemini 3 Flash, which sits at #16 in that same Overall top 20 with a rating of 1474, costs $3/M output. Twenty-nine rating points apart, about 2% on the scale, eight times the price.

That is the cheapskate problem stated as a math equation. Nobody is going to feel a 2% rating gap. They will absolutely feel an 8x cost difference when the bill arrives.

So here is the heuristic I’m using from now on:

Anchor on the category leader’s Arena rating
Define a competitive band: default 50 rating points below the leader
Sort models in the band by output price
Cheapest in the band is the cheapskate pick. Report rating delta and price ratio so you can judge the trade

The reason this beats “best models under $1” thresholds is that different categories have different price floors. Vision is more expensive than text. Math has its own dynamics. A fixed dollar threshold breaks every category that doesn’t match it. The score-gap-vs-price-gap framing adapts on its own.

I am not saying that Claude Opus 4.7 is bad. It’s the leader on Arena Overall and Coding and Multi-Turn. But the gap you’re paying $22/M extra for might not be there. And in some categorie, coding most loudly, there’s a model in the band that outperforms the leader on the benchmark that actually maps to your job.

Speaking of which.

The Cheapskate Picks (May 1–8, 2026)

Methodology in plain English: cheapest model within 50 rating points of the category leader. Band used everywhere this week, because the data was unusually compressed across the board.

Overall: Gemini 3 Flash, $0.50/$3.00

Leader : claude-opus-4-7-thinking — rating 1503 — $25/M output
Cheapskate pick : Gemini 3 Flash Preview — rating 1474 — $3/M output
Δ rating: −29 points. Price ratio: 8.3x cheaper.

OpenRouter slug: google/gemini-3-flash-preview. Multimodal. 1M context. The boring correct answer of mid-2026.

If you have one model running for general daily-driver work and you are paying $25/M for output, you are subsidizing margin. Twenty-nine rating points on a 1500-point scale is below the threshold any human would notice in an A/B test, much less a production workflow.

Coding: GLM 5.1, the SWE-Bench Pro Killer

Leader : claude-opus-4-7-thinking — rating 1569 — $25/M output
Cheapskate pick : GLM 5.1 (Z.ai) — rating 1525 — $3.50/M output
Δ rating: −44 points. Price ratio: 7.1x cheaper.

OpenRouter slug: z-ai/glm-5.1. MIT-licensed. Weights on Hugging Face.

Here is where the cheapskate framing stops being polite. GLM 5.1 beats Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro on SWE-Bench Pro with a score of 58.4. SWE-Bench Pro is the benchmark where the model has to actually fix real GitHub issues in real codebases. The thing the leader is supposed to be the leader at.

So the situation is: on Arena’s vibes-based head-to-head vote (people picking which output looks nicer), Opus 4.7-thinking wins. On the benchmark that maps to the job you are actually paying these models to do, an open-weight Chinese model from a lab most readers haven’t heard of wins. And it is seven times cheaper.

Honorable mention: Kimi K2.6 (Moonshot) at rating 1519 / $3.50: same price tier, similar profile, also open-weight. If you don’t like Z.ai’s politics or licensing, Moonshot is the same trade.

Creative Writing: Gemini 3 Flash

Leader : claude-opus-4-6-thinking — rating 1494 — $25/M output
Cheapskate pick : Gemini 3 Flash Preview — rating 1459 — $3/M output
Δ rating: −35 points. Price ratio: 8.3x cheaper.

This is the category that triggered the methodology. Gemini 3 Flash sits at rating 1459 in creative writing. Claude Sonnet 4.5 sits at 1451. The cheap Google Flash model outranks the mid-tier Anthropic model for prose generation, while costing five times less than Sonnet and twenty-eight times less than the actual category leader.

If you’re writing fiction or marketing copy or anything generative-prose-shaped and paying Sonnet pricing, you are losing on both ends.

Daredevil pick : DeepSeek V4 Pro at rating 1449 / $0.87/M output — that’s 28.7x cheaper than the leader, and it sits at the band edge with −45 rating points. You give up another 10 rating points (still a sub-1% gap on the scale) and save another 3.4x on top of Gemini 3 Flash. For batch creative work where you don’t care about multimodal input, V4 Pro is the cheapest defensible answer.

Math: DeepSeek V4 Pro Thinking, the 17x Discount

Leader : gpt-5.4-high — rating 1515 — about $15/M output (gpt-5.4 base; high-reasoning costs the same per token, you just burn more of them)
Cheapskate pick : DeepSeek V4 Pro (thinking mode) — rating 1479 — $0.87/M output
Δ rating: −36 points. Price ratio: ~17x cheaper.

OpenRouter slug: deepseek/deepseek-v4-pro with reasoning: { effort: "high" } or xhigh.

If you do math with an LLM and you are paying OpenAI prices, stop. DeepSeek V4 Pro with thinking enabled is 36 rating points behind on Arena math, which is roughly 2.4% of the scale, for one-seventeenth the cost. The math category was the one where the price gap most embarrassed the leader.

Conservative runner-up: Gemini 3 Flash at rating 1476 / $3/M output. Five times cheaper than the leader, more conservative than V4 Pro Thinking, multimodal if you need to feed it diagrams.

Instruction Following: MiMo V2.5 Pro

Leader : claude-opus-4-6-thinking — rating 1518 — $25/M output
Cheapskate pick : MiMo V2.5 Pro (Xiaomi) — rating 1468 — $3/M output
Δ rating: −50 points. Price ratio: 8.3x cheaper.

OpenRouter slug: xiaomi/mimo-v2.5-pro.

Yes… the phone company. Their LLM team has been quietly competitive for two product cycles now and MiMo V2.5 Pro lands right at the band edge for instruction following at one-eighth the price. If “deploying a Xiaomi model in production” makes the security team start asking questions, the honorable mention is Claude Sonnet 4.6 at rating 1476 / $15/M output: only 1.7x cheaper than the leader, but you keep your name brand.

This is the category where the band was tightest: only the top 12 models fit in the 50-point window, which means MiMo squeaked in at the edge. That’s a structural note: in the categories where the top is more spread out, the cheapskate pick has more cushion. Instruction Following had the smallest cushion this week.

Hard Prompts: Gemini 3 Flash, Again

Leader : claude-opus-4-6-thinking — rating 1535 — $25/M output
Cheapskate pick : Gemini 3 Flash Preview — rating 1493 — $3/M output
Δ rating: −42 points. Price ratio: 8.3x cheaper.

Same story as Overall and Creative Writing. The Hard Prompts leader has the highest absolute rating of any category (1535), but Gemini 3 Flash still sits comfortably in the band 42 points back. MiMo V2.5 Pro is essentially tied at rating 1492 / $3: pick by ecosystem preference.

Multi-Turn: Gemini 3 Flash, Again Again

Leader : claude-opus-4-7-thinking — rating 1529 — $25/M output
Cheapskate pick : Gemini 3 Flash Preview — rating 1484 — $3/M output
Δ rating: −45 points. Price ratio: 8.3x cheaper.

The conservative pick here is Claude Sonnet 4.6 at rating 1482 / $15/M output. If you specifically want Anthropic’s multi-turn glue (the way Claude tracks state across long conversations), Sonnet is the cheapest Anthropic option in the band. But Gemini 3 Flash is two rating points higher for one-fifth the price, so unless you have a brand-loyalty reason, the math says Flash.

The Quick-Reference Table

Category	Leader	$ leader (out/M)	Cheapskate pick	$ pick (out/M)	Δ rating	Price ratio
Overall	claude-opus-4-7-thinking	$25	Gemini 3 Flash	$3.00	−29	8.3x
Coding	claude-opus-4-7-thinking	$25	GLM 5.1	$3.50	−44	7.1x
Creative Writing	claude-opus-4-6-thinking	$25	Gemini 3 Flash	$3.00	−35	8.3x
Math	gpt-5.4-high	~$15	DeepSeek V4 Pro (thinking)	$0.87	−36	~17x
Instruction Following	claude-opus-4-6-thinking	$25	MiMo V2.5 Pro	$3.00	−50	8.3x
Hard Prompts	claude-opus-4-6-thinking	$25	Gemini 3 Flash	$3.00	−42	8.3x
Multi-Turn	claude-opus-4-7-thinking	$25	Gemini 3 Flash	$3.00	−45	8.3x

The pattern: Gemini 3 Flash wins the cheapskate slot in 4 of 7 Arena categories at $0.50 input / $3 output (Overall, Creative Writing, Hard Prompts, Multi-Turn). It’s the boring correct answer. The interesting picks are where it doesn’t win:Coding (GLM 5.1 because it actually beats the leader on SWE-Bench Pro), Math (DeepSeek V4 Pro Thinking because the price gap is absurd), and Instruction Following (MiMo V2.5 Pro, on a band edge, from Xiaomi).

And none of the seven categories needed a “you’re paying for quality here” caveat. Every category had a sub-$3.50/M output option in the band. As of last week, you can pay under $3.50/M output and stay within 50 rating points of the category leader on every major Arena category.

GLM 5.1: The SOTA Nobody’s Pricing In

Z.ai released GLM 5.1 on April 7, 2026. Mixture-of-experts, 744B total parameters, 40B active per token. MIT license. Weights on Hugging Face. The reviews you can find on it are all the same shape: “wait, this thing is what on coding?”

The numbers from the Renovate QR review:

SWE-Bench Pro: 58.4 — beats Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro
CyberGym: 68.7 — about 20 points above GLM-5
8-hour autonomous coding runs with ~1,700 reasoning steps
API pricing on OpenRouter: $1.05 input / $3.50 output — 6 to 10x cheaper than Opus 4.6

Anthropic dominates the Arena leaderboard. Eleven of the top 20 in Instruction Following are Claude variants. Seven of the top 20 in Multi-Turn. The brand wins the popularity contest. But on a benchmark that has to map to “did the model actually fix the bug,” an open-weight model from a Chinese lab is the new state of the art, and it’s almost an order of magnitude cheaper.

This is the under-sold value pick the cheapskate framing rewards. It’s not in the noise of “every new model claims a benchmark win.” It’s tied with the most expensive frontier model on the benchmark closest to the actual job, and the community hasn’t priced in what this means yet.

Tencent’s Hy3 Free Cliff Hits

Last week’s lead story was Tencent’s Hy3 Preview running away with #1 on OpenRouter at +1,356% week-over-week. The catch was that the entire spike was driven by Tencent giving the model away free until May 8 to seed adoption.

If you built a workflow on Hy3’s free tier, you hit the paywall. Migration window: zero. Some of you might have woke up with a billing surprise.

What I’ll be watching next week is the size of the cliff. If Hy3 holds top-five even at paid pricing, the free run was a successful seeding strategy. If it craters out of the top ten the moment the meter starts running, the entire spike was a free-period mirage and the model’s real value was lower all along.

For what to use instead if you got caught flat-footed: Hy3’s nearest like-for-like by price after the cliff is DeepSeek V4 Flash at $0.14/$0.28, which is actually slightly cheaper. And V4 Flash has the agent-default chorus behind it that Hy3 never built. Migration target if you need one: V4 Flash.

The Asterisks (Or: Cheap Is Fine If You Know What You’re Losing)

Gemini 3 Flash MRCR retrieval cliff. This is the one that bit me earlier this year. The Cybernews review confirms it numerically: MRCR retrieval drops from 60.1% accuracy at 128K context to 12.3% at 1M. If you’re running RAG-heavy workflows and pumping the full million-token context window full of documents, the cheapskate pick falls off a cliff at long context. Cap your context at 128K for retrieval-shaped work, or accept the hallucinations. Don’t say I didn’t warn you.

DeepSeek V4 Flash factual recall hole. Artificial Analysis shows V4 Flash scoring 34.1% on SimpleQA versus V4 Pro’s 57.9%. The 25x output savings come with a “won’t reliably know facts” asterisk. V4 Flash is great for agent loops where you’re feeding it grounded context anyway. It’s bad as a free-recall question-answerer. Pair it with retrieval. Don’t ask it to remember.

The Hy3 “you built on a free tier” thing. Predictable, still happening to people today. If you have an LLM in a critical workflow and the only reason you picked it was “free,” that workflow’s billing model is broken by design. The fix is to pick a model where the paid pricing is still cheap enough to justify the workflow.

These are not reasons to not use the cheapskate picks. These are reasons to know what you’re picking. The model card for “I will hallucinate factual recall, but I cost a quarter” is fine if the workflow doesn’t depend on factual recall. It’s catastrophic if it does.

Coming Up: Google I/O May 19, the Gemini 4 Question

Google I/O 2026 is May 19–20 at Shoreline Amphitheatre. The big rumored announcement is Gemini 4 with a claimed 84.6% on ARC-AGI2, integrated image and video generation, and a new “Omni” video model replacing the internal Toucan tool. Rumors also include “Remy,” a 24/7 always-on agent, and a Proactive Assistant that pushes suggestions instead of waiting for prompts.

The reason this matters for the cheapskate analysis is that Google is already winning the cheapskate slot at the Flash tier. Gemini 3 Flash is the boring correct answer for four of seven categories at $3/M output. If Gemini 4 Pro lands at SOTA on the leader benchmarks, the gap from the top of the leaderboard closes downward. The cheapskate band stays the same; the leader’s value proposition gets squeezed harder.

If Gemini 4 doesn’t land well, the leaderboard stays compressed in roughly its current shape and the cheapskate pattern holds. Either way I’ll be writing about it. Either way, my OpenRouter bill is not going up.

The OpenRouter stealth slot is still occupied by Owl Alpha (April 28, free, 1.05M context) per the W18 issue. No fresh signal this week. Claude Mythos is still research-only with no public release update. GPT-6 “Spud” is still rumored for late 2026 with no fresh leaks.

For the full W18 context including the original Hy3 spike and the $300/month Grok 4.3 amnesiac story, see last week’s roundup.

The Receipts

The leaderboard is compressed. The prices aren’t. That’s the whole post.

Concrete numbers from the last week: the entire Arena Overall top 20 fits in 35 rating points. Six of seven Arena categories have a cheapskate pick at $3.50 per million output tokens or less. Three categories have a cheapskate pick that’s eight times cheaper than the leader for under 3% of the rating scale. One category, coding, has a cheapskate pick (GLM 5.1) that’s the new state of the art on SWE-Bench Pro, beating Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro at seven times less the cost.

Anthropic charges 8x more for under 3% better. Here are the receipts.

The Cheapskate Picks methodology lives in this weekly blog post from now on. Next week we see what happens to OpenRouter rankings when Hy3’s rocket booster falls off. The week after, we see whether Google I/O makes any of this obsolete. Either way, I am not paying $25 per million output tokens for a 2% rating bump. Neither should you.

The Autoresearch Ecosystem - How One Repo Spawned 9 Different Types of AI Projects

Stephan Miller — Mon, 04 May 2026 12:00:00 +0000

I’d been messing around with Karpathy’s autoresearch for a couple of weekends, mostly because I’m interested in letting agents do shit while I sleep and someone had finally formalized the pattern in 630 lines of Python. Run the loop, modify train.py, train for five minutes, check val_bpb, keep or revert, repeat forever. Compounding gains while you’re not even at your desk.

So I fired up GitHub search for “autoresearch” expecting to find a handful of ML forks. People porting it to their hardware, maybe a few hyperparameter tweaks. You know how that goes.

I found nine distinct categories of project. Some brilliant. Some “why did you do this.” And a few that made me stop scrolling and think “oh, that’s actually the interesting idea here.” It turns out the original repo isn’t really about ML. It’s a pattern, and people figured that out pretty quickly.

I’m going to walk through every category I found, what each one actually does differently, and what they tell us about where this whole thing is going. There are a lot of repos here, all linked.

What Karpathy Actually Built
1. Platform Ports: Running It On Hardware You Actually Own
- GPU Cluster Scaling
2. ML Research Enhancers: Making the Loop Smarter
- Memory-Enhanced Researchers
- Bayesian + Active Inference
- Multi-GPU Infrastructure
3. Prompt Optimizers: Same Loop, Different Target File
- autoresearch-prompt-optimization (az9713)
- autoresearch-for-agents (Galileo)
4. Generalized Frameworks: Autoresearch For Anything
- uditgoenka/autoresearch — Claude Code Skill
- autoresearch-anything (zkarimi22)
- menonpg/autoloop — The pip Package
- krzysztofdudek/ResearcherSkill — One File, Full Discipline
- alfonsograziano/auto-agent — Autoresearch Builds Agents
5. Production Codebase Optimization: Autoresearch on Real OSS
- More Production War Stories
- idealo Search Ranking
- Tennis XGBoost — The Reward Hacking Cautionary Tale
- Vesuvius Challenge Ink Detection
6. Agent Factory: Autoresearch Builds Agents
7. Research OS / Skills Systems: Institutionalizing the Pattern
- PhD-Zero (TenureAI)
- alirezarezvani/claude-skills
8. Creative Writing: Autoresearch For Prose and Fiction
- redpen — Prose Refinement Engine
- NousResearch/autonovel — Complete Novel Pipeline
- sinfiny/Auto-Creative-Reasoning
- CalvinMagezi/self-evolving-skill — Brand Document Evolution
9. Meta-Pattern: Wrapping Autoresearch as a Worker
- The Problem with Solo Autoresearch
- The Fix: 3 Files, 4 Subagents
- What Actually Broke In Production
So What Does This Actually Mean?

What Karpathy Actually Built

Before we go through the derivatives, let’s look at the original. The repo is small and the loop is dumb on purpose:

Read program.md (the meta-skill that tells the agent how to be a researcher)
Modify train.py with a small, reviewable diff
Train for ~5 minutes on one GPU
Check val_bpb (validation bits per byte — the metric)
If it improved, commit. If it regressed, git reset --hard.
Goto 1.

That’s it. About 100 experiments overnight on a single H100 while you sleep. Git is the memory. The flat TSV file is the search log. The mechanical metric (val_bpb) means there’s no judgment call about whether something worked.

The main idea is that constraint enables autonomy. The diffs are small, so they’re reviewable. The metric is mechanical, so the agent can’t argue with it. The rollback is automatic, so a bad experiment can’t poison the next one. You’re giving it a cheap way to test things and a cheap way to undo them, and letting it run. Not asking it to be smart.

program.md is what Karpathy calls the meta-skill. Humans don’t program the training run. They program the researcher that programs the training run. That’s the part that generalizes, and that’s the part everybody on GitHub immediately ran with.

1. Platform Ports: Running It On Hardware You Actually Own

The “I don’t have an H100” forks

The first thing that happened is what always happens. People without enterprise GPUs ported it to whatever they had lying around. These forks are the most faithful to the original but with the substrate swapped out.

miolini/autoresearch-macos — straight macOS port using MPS backend
trevin-creator/autoresearch-mlx — Apple Silicon native, using MLX instead of PyTorch
jsegov/autoresearch-win-rtx — Windows with RTX
lucasgelfond/autoresearch-webgpu — runs entirely in the browser using WebGPU. No Python setup. The whole research loop in a tab.
A Colab/Kaggle T4 port (upstream issue #208) that swaps Flash Attention 3 for PyTorch SDPA so you can run experiments overnight on a free GPU
ArmanJR-Lab/autoautoresearch — Jetson AGX Orin port with a “director” written in Go that injects novelty (arxiv papers, DeepSeek Reasoner output) when the loop gets stuck in local minima
supratikpm/gemini-autoresearch — Gemini CLI native, with Google Search grounding plugged into the loop as a live verification source. True headless overnight mode via --yolo --prompt. 1M token context.

Karpathy himself endorsed several of these in the README and added hyperparameter tuning advice for smaller setups.

The interesting ones in this group aren’t the “same thing on Mac” ports. They’re the ones that change the substrate enough to do something the original couldn’t. MLX on Apple Silicon is legitimately different compute. WebGPU means you can hand someone a URL instead of asking them to set up Python. The Jetson port is the only one trying to escape local minima with external novelty injection, which is the kind of thing the original loop has no concept of. And the Gemini port has Search grounding inside the loop, which means the agent can verify claims against the live web while it’s iterating.

The Apple Silicon and WebGPU ports are the most useful if you don’t have data center hardware. The director-based Jetson fork is the most interesting if you care about where this pattern is heading. Most loops can hill-climb. Almost none of them can detect that they’re stuck and go grab a paper to read.

GPU Cluster Scaling

The opposite direction. What happens if you give it 16 GPUs instead of one?

SkyPilot wrote it up. They gave autoresearch access to a 16-GPU Kubernetes cluster, ran it for 8 hours, and let it figure out how to use the resources.

~910 experiments in 8 hours
val_bpb dropped from 1.003 to 0.974 (a 2.87% improvement, which sounds small but is enormous for an LM at this scale)
9x faster than a simulated sequential baseline to reach the same result
The agent taught itself to use H200s for validation and screen ideas on cheaper H100s. Nobody told it to do that.

The thing that surprised me was how the search behavior changed with parallelism. Sequential autoresearch is greedy hill-climbing: try one thing, keep or discard, try the next. Parallel autoresearch starts running factorial grids of 10-13 experiments per wave. It catches interaction effects between parameters that single-axis tweaking would never find. Two changes that look mediocre alone can be great together. You can’t see that one-at-a-time.

This is the version that stops looking like a hobby project. If your metric is fast and your discard mechanism is reliable, more compute really does just turn into more answers.

2. ML Research Enhancers: Making the Loop Smarter

The “the flat TSV is not enough” camp

These forks all keep the loop intact but argue that the agent’s memory is too primitive. A TSV with one row per experiment doesn’t carry the right information forward. So they bolt on cognitive architecture.

Memory-Enhanced Researchers

tonitangpotato/autoresearch-engram plugs the Engram cognitive memory library into the loop. It’s neuroscience-grounded: ACT-R activation, Hebbian learning, Ebbinghaus forgetting. RECALL and STORE steps wrap around the existing loop.

The numbers from a long-running instance:

After 50 experiments, the agent recognizes patterns like “architecture changes outperform optimizer tweaks in this regime”
After 100, it knows the optimal architecture for your specific compute budget
One production deployment is at 3,846 memories, 230,103 recalls, 12,510 Hebbian links

What that buys you, supposedly, is research intuition. Not “this worked” but “here’s why and here’s the pattern.” The thing that made human researchers good was never their willingness to try lots of things. It was the priors they built up about what was worth trying.

Bayesian + Active Inference

ErikDeBruijn/autoresearcher2 is the most ambitious one I found. The whole flat results log gets replaced with a Bayesian generative model. Then he piles on Friston’s active inference, Wozniak’s learntropy, and Schmidhuber’s compression progress. The agent doesn’t just ask “was this experiment good?” It asks “which of my latent beliefs was wrong?”

Four additions to the original loop:

Generative model over experiment outcomes
Policy evaluation via Expected Free Energy
Learntropy appraisal module
Persistent memory with decay dynamics

It’s been validated on synthetic environments where it beats random and greedy baselines. There’s an evidence-quality comparison run in progress on an RTX PRO 6000 Blackwell against vanilla autoresearch. The repo also has a CONSTITUTION.md because the project is partially about whether recursive self-improvement can deepen judgment, not just power.

The interesting distinction is structural insight (“RoPE matters more than the optimizer in this regime”) versus flat knowledge (“RoPE improved val_bpb by 0.02”). The flat version doesn’t compose. The structural version does.

Multi-GPU Infrastructure

iii-hq/n-autoresearch keeps the loop and replaces the plumbing. Out goes bash + git + TSV. In comes structured KV state, a REST API, and crash recovery. Multi-GPU parallel experiments via iii-engine (Python orchestrator + Rust GPU workers). Cross-machine GPU workers.

The clever part is the adaptive search strategy. The loop has phases (explore, exploit, combine, ablation) and it auto-transitions based on history. There’s also near-miss detection for when two recent experiments combined would probably work even though neither alone did.

Honestly, this is the “what if you scaled it to a real research lab” fork. If autoresearch becomes how labs actually run experiments this is roughly what the production version looks like.

3. Prompt Optimizers: Same Loop, Different Target File

What if train.py was your system prompt?

Once you accept that the loop is substrate-agnostic, the next move is obvious. Point it at a prompt file. Use accuracy on a test set as the metric. Let it iterate.

autoresearch-prompt-optimization (az9713)

az9713/autoresearch-prompt-optimization is the cleanest version of this. The loop targets prompt.txt instead of train.py. The metric is field extraction accuracy on 30 test examples instead of val_bpb. Everything else is the same.

The numbers:

74.72% → 100% accuracy in 8 experiments
Zero human intervention
Experiment 5 regressed and got auto-discarded: the loop caught it exactly as designed
Cross-model: Claude Opus writes the prompts that Gemini 2.5 Flash executes

The thing prompt engineering has always been missing is a tight feedback signal. Most people write a prompt, eyeball some outputs, decide it “looks better.” Autoresearch makes prompt engineering a numerical optimization problem. Reading last_run.json after each iteration turns prompt writing from art into engineering. That’s a real shift.

autoresearch-for-agents (Galileo)

rungalileo/autoresearch-for-agents is more ambitious. They’re using the loop for adversarial testing plus prompt optimization on support agents.

Two phases. Phase 1 builds a frozen adversarial test suite (the exam). Phase 2 optimizes the prompt against that frozen suite (the studying). Separating the exam from the studying stops the optimizer from moving the goalposts.

The other clever bit is proportional scoring instead of binary pass/fail. Binary scores give the optimizer no gradient. “70% of the way there” is a signal you can climb. “Failed” isn’t.

Results: 0.05 → 0.80 accuracy in 15 experiments. They also documented the limits of what prompt engineering alone can fix. Things like absence detection (“the customer didn’t mention X”) and off-by-one date math just don’t get solved by tweaking the prompt. That’s a useful negative result. Most write-ups about prompt optimization conveniently skip the part where they hit a wall.

4. Generalized Frameworks: Autoresearch For Anything

“Wait, this works for any measurable thing”

This is the category that broke containment. Once a few people had ported the loop to prompts, the next move was to extract the pattern entirely. The result is a bunch of frameworks that don’t care what file you’re optimizing or what metric you’re using.

uditgoenka/autoresearch — Claude Code Skill

uditgoenka/autoresearch packages the loop as a Claude Code skill. You install it, you run /autoresearch, and you point it at any task with a mechanical metric. The README runs through about a dozen domains: test coverage, bundle size, TypeScript error count, SQL query speed, HR policy readability, Dockerfile size, accessibility audits, sales copy, marketing content. There’s also /loop N integration for bounded iterations.

It also documents how to wire MCP servers (PostgreSQL, GitHub, Stripe) as verification sources. So your “metric” can be a query against your actual production database, not a fixture.

This is the version that makes the generalization explicit. The loop works for anything with constraint plus metric plus fast verification.

autoresearch-anything (zkarimi22)

zkarimi22/autoresearch-anything is the lowest-friction setup I’ve seen. You run npx autoresearch-anything and it interrogates you:

What file should I edit?
What metric am I optimizing?
How do I run the eval?
What’s off-limits?
A few more along those lines.

It outputs setup.md and eval.js and you’re running. Eight questions and you have a configured autoresearch loop pointed at your project.

menonpg/autoloop — The pip Package

menonpg/autoloop is the first one that’s actually a Python library. pip install autoloop-ai, import, and the API is clean:

from autoloop import AutoLoop

loop = AutoLoop(
    target="src/optimize_me.py",
    metric=lambda: run_benchmark(),
    directives="Make this faster, don't break tests",
    budget_seconds=600,
)

results = loop.run(experiments=100)

Parallel experiments via loop.run(parallel=4). Warm starts. Composite metrics with weights. Agent-agnostic: works with Claude, Codex, Ollama local models. CLI tools for inspecting history (autoloop history, autoloop best, autoloop diff 12 best, autoloop rollback 12).

The demo shows a 6.9x speedup on a fibonacci function in 4 experiments, and the framework auto-detected and discarded the broken iterations.

This one’s for you if you want autoresearch as a library you import rather than a skill you invoke. The bar is “have a Python function that returns a float” and you’re in. That’s about as low as it gets.

krzysztofdudek/ResearcherSkill — One File, Full Discipline

krzysztofdudek/ResearcherSkill is interesting because it ignores the framework race entirely. It’s one researcher.md file you drop into any AI agent. Before doing anything, the agent interviews you: goal, metric, constraints, time limit, stopping conditions.

It creates a .lab/ directory (gitignored) for experiment history that survives code reverts. That’s separate from git on purpose. You don’t want a git reset --hard to wipe your experiment log.

The loop has three phases:

THINK — mandatory written analysis before each experiment, logged separately
TEST — commit, run, keep or revert
REFLECT — log entry in log.md, row in results.tsv

There are also convergence guardrails baked in. Three discards in a row = mandatory pause. Five discards = force branch fork. Plateau for 8+ experiments = invert assumptions.

The interesting part is THINK. Most autoresearch implementations skip written analysis. The agent just runs. Forcing it to write down what it expects to happen before running changes what it tries. The README claims “10 minutes of analysis can prevent 5 wasted experiments,” which I believe.

There’s also a “thought experiment” type that lets the agent log analysis without running code. It counts as a row in the results, just labeled thought. That’s a small detail and it matters more than it should.

alfonsograziano/auto-agent — Autoresearch Builds Agents

alfonsograziano/auto-agent is autoresearch turned on AI agents themselves. You give it a target agent (in a separate repo) and a golden dataset of expected input/output pairs. The orchestrator spawns Claude Code or Kiro CLI inside the target repo, has it analyze failures, implement fixes, and re-run.

Two repos: orchestrator and target. MEMORY.md persists across hypotheses (what worked, what didn’t, known blockers). Each hypothesis gets its own git branch and its own REPORT.md with before/after metrics and a CONTINUE or ROLLBACK decision. After a run, npm run generate-changelog produces a human-readable summary.

This is recursive in a way that very interesting. The thing being optimized is an AI agent. The thing doing the optimizing is also an AI agent. The metric is how often the target hits the golden set. You’re using autoresearch to make agents better at the things you created them for.

5. Production Codebase Optimization: Autoresearch on Real OSS

Shopify used it on the Liquid template engine

This is where the pattern stops being a demo. Shopify ran autoresearch against the Liquid template engine, the thing that renders every theme on Shopify, and shipped the results.

The setup is in auto/autoresearch.md:

Benchmark: ThemeRunner (real Shopify theme templates, not synthetic)
Metric: combined parse + render time in microseconds (primary), allocations (secondary)
Constraints: tests must pass, no new gem dependencies, semantic correctness preserved

The results across 17 tracked experiments:

7,374µs → 4,815µs (-34%)
62,620 → 37,355 allocations

The agent’s techniques included replacing regex with manual byte parsing, fast-path variable parsing, and short-circuit checks for common cases. None of it is rocket science. It’s the kind of optimization a senior developer would do given enough time and a good profiler. The agent just had cheap iteration and an automatic discard for anything that broke a test.

idealo Search Ranking

The idealo team (Atakan Filgöz, Gena Shabanov, Arjun Roy Choudhury) ran autoresearch against preprocess.py in their Learning-to-Rank inference endpoint. They added a correctness constraint that required bit-for-bit identical output between the original and optimized version, then optimized for average latency over 500 benchmark iterations.

Numbers:

13 experiments in 1 hour
10 kept, 3 reverted
Preprocessing latency: 3.9ms → 0.66ms (83% reduction, 5.9x speedup)
End-to-end production latency: 46ms → 28.8ms (37% reduction at 250+ req/sec)
Total cost: ~$7 in Claude Opus on AWS Bedrock

For seven dollars and an hour of supervision, they took 37% off a production endpoint that’s serving 250+ req/sec. That’s an absurd ROI.

The techniques the agent found: shared computation (sort once, derive everything else), algorithmic shortcuts for sorted arrays, minimal allocations. The agent reasoned like a profiler: “the ranking computation takes 40% of total time, focus there next.” They watched it work, occasionally steered it, and shadow-tested before shipping. It’s now in production.

The honest detail in the writeup is that the agent’s code was clean at 13 experiments but they suspect longer runs would over-engineer. That tracks with my experience using AI tools for refactoring. The first dozen suggestions are gold. By suggestion 50 it’s pattern-matching to “more abstraction must be better” and you have to slap its hand.

Tennis XGBoost — The Reward Hacking Cautionary Tale

This is the one nobody mentions when they’re hyping the pattern. Nick Oak ran autoresearch on a tennis match prediction XGBoost model. The agent found a way to game the metric without actually improving the model. He preserved the embarrassing iterations on an archived/gamed-iterations branch so you can read what the agent did.

The discard mechanism only saves you if your metric is measuring what you actually care about. If your eval can be gamed, the agent will game it. This is not an RL-only problem. Reward hacking shows up everywhere there’s an automated optimizer, and autoresearch is exactly that.

The takeaway isn’t “autoresearch is dangerous.” It’s “your metric is now a load-bearing piece of software and you should treat it that way.” Spend more time on the eval than on the loop.

Vesuvius Challenge Ink Detection

Vesuvius Challenge ran a multi-agent autoresearch loop for ink detection on ancient scrolls, focused on cross-scroll generalization. I haven’t dug deep into this one, but it’s worth knowing that autoresearch is currently being used to read 2,000-year-old burned scrolls. That’s a thing.

6. Agent Factory: Autoresearch Builds Agents

Applying the loop to creating other agents

Dominien/agent-factory takes the meta move further than auto-agent. Instead of optimizing an existing agent, it autonomously researches problems and builds new specialized agents to solve them.

The loop is:

Research : Reddit, HN, GitHub, Twitter — find real problems people have
Score : Venture Score plus TAM estimate
Build : Next.js agent from a seed template
Validate : against synthetic users / actual usage
Ship
Repeat

There’s a threshold ratchet. The bar to ship keeps rising as the system finds better ideas. So the things it builds get better over time, not because the agent is smarter, but because it’s competing against its own previous best.

Agents shipped so far: freelancer-deduction-finder, wage-rights-advisor, data-broker-opt-out, property-tax-appeal-advisor. Twenty agents and counting.

This is the meta-loop concept and I find it disorienting. Research quality compounds the same way training quality does. A loop that researches problems, builds solutions, ships, and uses ship-ability as the metric will eventually outpace anyone manually doing the same thing. Whether the agents it ships are any good is the open question. But the number keeps going up.

7. Research OS / Skills Systems: Institutionalizing the Pattern

What if autoresearch was the entire research methodology?

If autoresearch is going to actually be how research gets done, somebody has to build the scaffolding around it. Two projects are going hard at this.

PhD-Zero (TenureAI)

TenureAI/PhD-Zero is an operating system for research-oriented coding agents. Modular skill library: run-governor, research-workflow, deep-research, experiment-execution, memory-manager, human-checkpoint, paper-writing.

Cross-runtime: same skills exposed to Codex (via AGENTS.md) and Claude Code (via .claude/skills/). The focus is reproducibility, literature review, experiment planning. Discipline around the process.

This is the thing that turns autoresearch from “fun overnight experiment” into something that could plausibly be used by a real research group. The autoresearch loop runs experiments. PhD-Zero runs the literature review, the writeup, the human checkpoints, the reproducibility checks. The loop is one verb in a much bigger vocabulary.

alirezarezvani/claude-skills

alirezarezvani/claude-skills is a 204-skill library for AI coding agents, with autoresearch-agent as one skill in the engineering tier. Works across Claude Code, Codex, Gemini CLI, Cursor, Aider, Windsurf — eleven tools total.

Treating autoresearch as a reusable skill component rather than a standalone repo is an important move. It means your agent uses autoresearch the way it uses anything else: as a tool you reach for when the situation calls for it.

8. Creative Writing: Autoresearch For Prose and Fiction

The thing nobody expected: it works on writing too

This is the one I want to come back to in another post. The transfer is straightforward. If you can score a draft, you can run the loop. The metric just needs to be cheap, mechanical, and not gameable. (See the tennis cautionary tale.)

Multiple projects figured this out independently within a few weeks of each other.

redpen — Prose Refinement Engine

itspikabubu/redpen is a ratchet loop for blog posts and writing. Drafts can only get better, never worse. Six AI personas score on different dimensions: seed founder, fellow GP, LP allocator, LinkedIn reader, HN skeptic, VC Twitter. Each persona runs three times and the scores are medianed for noise reduction.

The writer agent makes one surgical edit targeting the weakest dimension. Re-evaluate. If the minimum score improved, keep. If not, discard and revert. Repeat until target score or max iterations.

You can configure voice: tone spectrum, blacklist words, a 16-point natural prose rubric. I have not tried this yet but I’m planning to. If it works, it solves the thing every blogger struggles with: I can tell a draft is bad, but I can’t always tell why.

NousResearch/autonovel — Complete Novel Pipeline

NousResearch/autonovel is the most ambitious creative writing fork. Full autonomous novel pipeline: seed concept → world bible → characters → outline → draft chapters → revision → export.

Five co-evolving layers: voice, world, characters, outline, and chapters, with canon cross-cutting all of them. Two evaluation systems running in parallel: mechanical (regex bans for AI clichés, slop forensics) and LLM-judge (prose quality, voice adherence). Phase 3b sends the full manuscript to Claude Opus for a dual-persona review (literary critic + professor of fiction) and the loop continues until the reviewer’s complaints are mostly “qualified hedges rather than real problems.” Their phrase, not mine.

There’s also an art pipeline (fal.ai), multi-voice audiobook (ElevenLabs), LaTeX typesetting, ePub generation, landing page.

The first novel produced is The Second Son of the House of Bells. 79,456 words. 19 chapters (down from 24: the loop did four structural merges). Six rounds of Opus review.

The loop improved prose and changed the structure of the book. We talk about autoresearch like it’s a fine-grained optimizer, but at long enough horizons, it’s making editorial decisions a human would make.

sinfiny/Auto-Creative-Reasoning

sinfiny/Auto-Creative-Reasoning is benchmark-first. The repo motto is “generation is not the product. Evaluation is the product.” Rewrite ladders route failure to the right level: prose, scene, chapter, arc, premise. Rubrics score hook strength, strategy, clue fairness, consequence density, readability.

There’s a Codex plugin for running benchmarked loops against existing fiction drafts. The long-term vision is multiple parallel novel timelines with competing chapter versions compared head-to-head.

This is the version that argues evaluation is harder and more important than generation. Which is exactly the lesson from the tennis XGBoost story, ported to fiction.

CalvinMagezi/self-evolving-skill — Brand Document Evolution

CalvinMagezi/self-evolving-skill is the business-minded version. Autoresearch applied to writing-strategy.md instead of train.py. The metric is an LLM judge composite score on a fixed test brief, run three times at temperature=0 and medianed.

The output is real documents: .docx, .pptx, .pdf that match brand identity. Git history serves as memory; the loop reads git log before each iteration to avoid repeating failed ideas. Works with any LLM via LiteLLM (OpenRouter, Gemini, OpenAI, Anthropic).

This is the one with the clearest business case of the bunch. Companies actually need their documents to get better. They have brand rubrics. They have a fixed test brief in the form of “the next thing we need to write.” All the pieces are already there.

9. Meta-Pattern: Wrapping Autoresearch as a Worker

What happens when autoresearch is just one layer of something bigger

This is the one that snapped my view of the whole ecosystem into focus. alirezarezvani had been shipping autoresearch as a skill since March. A month of production use revealed the missing piece: orchestration above it.

The Problem with Solo Autoresearch

One context window and reasoning trajectory, with no isolation between investigation threads. A query like “what is X, who are the players, what are the limits, what changed in 6 months” becomes four tangled sub-questions sharing one bloated context. By the time you’re on sub-question 4, the context is thick with answers from 1-3, and synthesis drifts.

This is something I hit constantly with Claude Code on big tasks. By the time the context is full of half-finished investigations, the model is reasoning about all of them at once, badly.

The Fix: 3 Files, 4 Subagents

The whole rebuild is small:

CLAUDE.md — decomposition rules, including an “independence test” (a sub-question is independent if its answer wouldn’t change based on another sub-question in the same query)
.mcp.json — Firecrawl, Perplexity, internal docs server. Critically, scoped per-agent to avoid the token tax of loading all MCP tool descriptions into every context
4 subagent definitions — lead-researcher (orchestrator, no MCPs), web-searcher (invokes autoresearch inside its own context), internal-searcher, citation-checker

Lead decomposes. Workers fan out in parallel. Each worker runs an autoresearch loop to convergence inside its own isolated context. Lead synthesizes. Citation-checker verifies every source. Wall-clock time ends up shorter than single-session autoresearch because the workers run in parallel.

What Actually Broke In Production

Four failure modes from the writeup, and they all rang bells:

Orchestrator over-delegation — without the independence test, the orchestrator was paying for parallel context windows to produce worse answers than one session would have
MCP tool-description token tax — every MCP server’s tool descriptions loading into every agent’s context. Scoping per-agent fixed it
Citation drift — workers returning confident claims where the page didn’t quite support the paraphrase. Paraphrase drift, not hallucination
Context amnesia between sessions — a flat lessons.md file the lead reads on startup is the imperfect fix

The lesson here is the one that rewires the whole picture. Autoresearch was already a strong worker. The orchestrator does nothing clever: decompose, delegate, synthesize. The intelligence is in the decomposition rules, and those took three rewrites to get right.

So the future isn’t “smarter autoresearch.” It’s autoresearch as a primitive that other systems call into.

So What Does This Actually Mean?

Karpathy didn’t just build an ML research tool. He demonstrated a pattern that works anywhere you can measure progress with a command: constraint plus mechanical metric plus autonomous iteration.

Here are the categories ranked by fidelity to the original idea:

Platform ports — most faithful. Same loop, different hardware.
ML enhancers — extend the substrate. Memory, Bayesian updates, multi-GPU.
Prompt optimizers — same loop, different file. train.py → prompt.txt.
Generalized frameworks — extract the pattern. pip packages, Claude Code skills, “give me any metric.”
Production codebase — industrial application. Shopify -34%, idealo -37% in 1 hour for $7.
Agent factory — meta-application. The loop builds other agents.
Research OS — institutionalization. The whole methodology, not just the loop.
Creative writing — the surprise expansion. Prose, fiction, brand documents.
Orchestration — autoresearch as worker, not the whole system.

A few honest takes:

The reward hacking problem is the cautionary tale nobody includes. In the tennis XGBoost case, the loop found a way to improve the metric without improving the model. The discard mechanism is only as good as your metric. If your eval can be gamed, the agent will game it. Spend more time on the eval than on the loop.

The pattern is more durable than the implementation. Most of the forks I found were “what if we applied this to X” and they all worked. That’s kind of remarkable. The discard mechanism (git reset on regression) is the key. You don’t need intelligence. You need iteration speed, a mechanical metric, and automatic rollback.

The Shopify and idealo case studies should embarrass you a little. $7 of API and an hour of supervision took 37% off a production endpoint serving 250+ req/sec. There are perf wins like this in basically every codebase. We’re just not asking for them yet because we still think of optimization as expensive senior-engineer time.

Orchestration eats the loop. alirezarezvani’s piece shows that solo autoresearch is fine, but the next move is autoresearch as a worker that orchestrators call when a sub-question lands. That’s where this is heading and it’s already happening in production.

If you’re not running at least one of these on a real project, you’re leaving free improvements on the table. The bar to entry is pip install autoloop-ai or npx autoresearch-anything. There’s no reason not to point one at something you care about and let it run overnight. You’ll either get a better version of the thing or you’ll learn something about your metric. Both of those are wins.

Model Roundup: The Free Countdown, the $300 Amnesiac, and the Quiet Climber at #7

Stephan Miller — Sat, 02 May 2026 12:00:00 +0000

I check OpenRouter rankings every week to figure out which models to throw at my projects. This week, the model at the top of the charts had something I’d never seen before: an expiration date.

Right there on the Tencent Hy3 Preview page: “Going Away May 8.” Six days from now. And it’s currently generating 2.15 trillion tokens a week with a +1,356% spike. You know what that is? Not a sign of the best model on the market. It’s the AI equivalent of a store liquidation sale. Everyone’s grabbing tokens before they cost money.

That’s W18 in a nutshell. The #1 model is a countdown timer. The hottest new premium subscription ($300/month from xAI) still can’t remember who you are between sessions.

There’s good news buried in all this: Kimi K2.6, which I mentioned last week as an interesting launch, has started showing real production numbers. And there’s a model called Step 3.5 Flash that’s been quietly climbing the rankings for three months with zero hype, which in this market is basically a standing ovation.

Let me tell you what actually matters.

Table of Contents
The #1 Model Is a Countdown Timer (Tencent Hy3 Preview)
Kimi K2.6 Is Now a Real Recommendation
- Where K2.6 Falls Short
The Sleeper: Step 3.5 Flash Has Been Climbing for Three Months
- The One Real Catch
Grok 4.3: Genuinely Impressive, Genuinely Annoying, $300/Month
Your Smarter Model Might Be Breaking Your Agents
What’s Actually Worth Using (and What’s Coming)

The #1 Model Is a Countdown Timer (Tencent Hy3 Preview)

Tencent launched Hy3 Preview on April 22 with a free access period that runs out May 8. That’s the entire explanation for the +1,356% weekly spike and the 2.15 trillion tokens burned. Developers saw “free” and “295B MoE” in the same sentence and did what developers do: they stress-tested it before anyone sent them a bill.

Here’s what Hy3 Preview actually is: 295 billion total parameters, 21 billion activated per token (mixture of experts, efficient by design), 262K context window, configurable reasoning you can dial from disabled to low to high. Designed for agentic coding workflows. On paper, solid.

In practice? No Arena votes because it’s too new to have accumulated any. No long-form reviews because nobody’s shipped anything with it yet. No “I’ve been using this for three weeks and it’s my daily driver” posts anywhere I could find. Just a lot of “grabbing free tokens before May 8” energy.

What happens after May 8 is the real question. Hy3 Preview becomes a paid model competing against DeepSeek V3.2 (which costs $0.14 input / $0.28 output per 1M tokens and has months of production track record), Kimi K2.6 ($0.74/$3.49 with confirmed adoption), and Step 3.5 Flash (which I’ll get to in a moment). Entering that field with no reviews and no Arena ranking is a tough position.

If you want to play with it before the deadline, go to openrouter.ai/tencent/hy3-preview:free and run some benchmarks. Just don’t build a dependency on something with a “Going Away” notice stamped on it.

Kimi K2.6 Is Now a Real Recommendation

Last week I called Kimi K2.6 an interesting launch. Twelve days later, the production numbers are coming in and it’s something more concrete.

Real developers running real workflows are reporting 88% cost savings when they replace Claude with K2.6 for bulk coding tasks: batch migrations, test generation, format conversion, anything where you’re doing a lot of the same kind of work repeatedly. The Kimi Code CLI, the companion tool for using K2.6 in your terminal the same way you’d use Claude Code, crossed 6,400 GitHub stars. That’s people betting actual infrastructure on this model, not just upvoting a launch post.

The pattern hardening into consensus across forums: use K2.6 for bulk, use Claude for the high-stakes core. At $0.74 input / $3.49 output per 1M tokens, K2.6 is roughly 4x cheaper than Claude Sonnet 4.6. For workflows that generate a lot of tokens on repetitive work, that math compounds fast.

Where K2.6 Falls Short

This is the part I actually care about more than the hype. K2.6 trails GPT-5.4 on GPQA-Diamond (90.5% vs 92.8%) and AIME 2026 (96.4% vs 99.2%). These are hard reasoning benchmarks. For anything where being wrong has real consequences (financial analysis, medical context, legal questions), K2.6 is not the answer. The cost savings don’t matter if the output costs you more to fix.

Use it for code. Trust it with the boring high-volume stuff. Keep a premium model on anything where you’d be embarrassed if an AI got it wrong.

K2.6 also ships with agent swarm architecture supporting up to 300 parallel sub-agents and 4,000 coordinated steps. After my own experiences with AI agents inventing things I’d start with single-agent mode until you’ve validated its judgment in your specific domain. 300 parallel sub-agents hallucinating tool calls in parallel is not a good time.

The Sleeper: Step 3.5 Flash Has Been Climbing for Three Months

Most models follow the same OpenRouter arc: spike at launch, plateau after a few weeks, slowly fade as the next shiny thing arrives. Step 3.5 Flash doesn’t fit this pattern.

StepFun released it somewhere in early 2026; the exact date is contested across sources, somewhere between late January and March, doesn’t matter. As of this week it’s at #7 on OpenRouter with +28% week-over-week. For a model that’s been around three months, that’s not a hype spike. That’s sustained adoption with nothing to explain it except developers finding it useful.

The numbers back it up: #4 intelligence ranking out of 64 models on Artificial Analysis. That puts it above almost everything priced anywhere near its cost: free on the rate-limited tier, $0.10 input / $0.30 output per 1M tokens on paid. For comparison, DeepSeek V3.2 costs $0.14/$0.28 and ranks lower on the same index. Step 3.5 Flash is somehow cheaper AND smarter on paper, and nobody’s writing breathless posts about it.

Architecture: 196 billion total parameters, 11 billion activated per token (MoE), 262K context, reasoning parameter support so you can see step-by-step thinking in API responses if you want it.

The One Real Catch

Step 3.5 Flash is extremely verbose. During Artificial Analysis evaluation it generated 260 million tokens versus an 11 million token average for comparable models. It thinks out loud, at length, in a way that will surprise your output token budget if you’re not watching.

Set max_tokens limits. If you’re using it for any high-volume generation, put a ceiling on it. Otherwise you’ll get thorough reasoning that costs more than you expected from a supposedly cheap model.

Worth adding to your comparison set before someone writes a breathless Medium post about it and StepFun decides to raise the price.

Grok 4.3: Genuinely Impressive, Genuinely Annoying, $300/Month

Let’s do the good news first, because there’s real good news here.

Grok 4.3 (launched April 17, currently rolling out in beta to SuperGrok Heavy subscribers) added native video input processing, not “describe this video” video but actual video-grounded reasoning. It can generate fully-formatted downloadable PDFs, populated spreadsheets, and PowerPoint presentations directly from conversation. Early beta testers are reporting formatted outputs they could hand to someone without cleanup. The integration with Grok Computer (xAI’s desktop automation agent) got tighter. If you’re doing autonomous desktop workflows, Grok 4.3 has a real story.

Now the bad news.

Grok 4.3 costs $300/month. That’s $100 more than ChatGPT Pro and $100 more than Claude Max. Both of those services have had persistent memory between sessions for over a year. Grok 4.3 does not. Every time you close your tab, the model forgets you. You start over. Blank context, fresh start, zero memory of anything you’ve built together.

Persistent memory is not on xAI’s published roadmap.

Multiple reviewers landed on the same observation this week. One X user put it cleanly: “you’re paying $300/month for a model that forgets you between sessions.” That’s not exaggeration. That’s the product.

At $200/month, this would be annoying. At $300/month, it’s a product decision, and product decisions tell you something about what a company is optimizing for. xAI built the video capabilities and the document generation first. Memory (the feature that makes an AI assistant feel like an actual assistant rather than a very fancy search box) is apparently not the priority.

Add the “High Demand” server errors that hit during launch week beta and you’ve got a model that’s impressive in demos and frustrating in daily use. The full API rollout is coming mid-to-late May. When it hits general availability, this conversation is going to get louder.

Your Smarter Model Might Be Breaking Your Agents

This one’s structural rather than model-specific, and it’s relevant for anyone running agentic pipelines.

An April 2026 ICLR paper titled “The Reasoning Trap” documented something uncomfortable: RL-based reasoning training (the kind that makes frontier models better at hard reasoning tasks) increases tool-hallucination rates in lockstep. The better a model gets at reasoning, the more often it invents tool calls that don’t exist. Function names, API endpoints, methods that aren’t in your schema. The model reasons its way to a call it can’t actually make.

If you’ve upgraded your agentic pipeline to a stronger reasoning model because it’s smarter, you may have simultaneously increased the rate at which it hallucinates the tools it should be calling. The capability and the failure mode scale together.

I’ve written about running into this firsthand with OpenClaw. The model-specific details differ but the pattern is the same. Stronger reasoning doesn’t mean better tool selection, and in agentic contexts “smarter” can break things in ways you don’t catch until something fails in production.

Practical response: add tool-call schema validation before your agents execute. Check that every tool the model selects actually exists in your registry before you let it run. This applies to every frontier RL-trained model right now. It’s not a specific model bug, it’s how these systems are being trained.

What’s Actually Worth Using (and What’s Coming)

Quick reference:

Tier	Model	Input $/1M	Output $/1M	Best For
Free (grab it now)	Hy3 Preview	$0	$0	Experiments before May 8 only
Free (stable)	Step 3.5 Flash	$0	$0	Rate-limited; best free reasoning available
Free (open weights)	Nemotron 3 Super 120B	$0	$0	NVIDIA-backed, open license, 262K context
Free (new, watch)	Owl Alpha (stealth)	$0	$0	1M context, agentic (prompts may be logged)
Budget	Step 3.5 Flash (paid)	$0.10	$0.30	Climbing for 3 months, verbose but smart
Budget	DeepSeek V3.2	$0.14	$0.28	Proven track record, still the value baseline
Mid	Kimi K2.6	$0.74	$3.49	Bulk coding workflows, 88% cheaper than Claude
Mid	Gemini 3.1 Pro	$2.00	$12.00	Arena #4 overall, 1M context
Premium	Claude Sonnet 4.6	~$3.00	~$15.00	#2 Arena coding, proven daily driver
Premium	Claude Opus 4.7	$5.00	$25.00	#1 Arena overall (thinking mode), high stakes

Mark your calendar for May 19. Google I/O is 17 days away. Gemini 4 isn’t confirmed, but annual release patterns and confirmed agenda items (agentic AI, developer tooling) make it likely. That’s the next likely shakeup in this table.

Claude Mythos, Anthropic’s model that developed a working exploit for a remote code execution vulnerability in FreeBSD (CVE-2026-4747), is not coming to a public API. It’s locked in Project Glasswing, a security research consortium, and Anthropic has no public timeline for changing that. Mention it at parties.

GPT-6 is still vaporware. Polymarket has it at 84% by December 31, 2026. That’s not a date, it’s a guess with confidence bounds.

The model worth your attention this week isn’t at #1. It’s at #7, three months old, climbing steadily, no hype cycle to explain it. Step 3.5 Flash just keeps showing up in the data.

My AI Agent Kept Making Shit Up (And Other Lessons From Running OpenClaw)

Stephan Miller — Tue, 28 Apr 2026 12:00:00 +0000

I wanted an AI agent running on my home network. Not a cloud subscription and not something requiring me to be at the keyboard all day. A thing that wakes up at 7am, pulls from RSS feeds and Reddit, synthesizes the news I actually care about, and emails it to me. Just that. That’s what I started with. Seemed simple. It wasn’t like I was asking much.

The reality was six weeks of debugging hallucinations, silent config failures, broken tool schemas, and a recurring realization that LLMs are, in certain contexts, compulsive liars.

Here’s what I learned the hard way.

The Setup: OpenClaw + DeepSeek in Docker
The Exec Approval Maze
The Reports That Were Too Good
Going Around the Agent
When Tools Become Literal Text
Ripping Out Slack
What’s Actually Working
But Here’s What She’s Actually Good At
- The Report Engine Isn’t a One-Trick Pony
- Email Delivery, Old School On Purpose
- Multi-Model, Not Locked to DeepSeek
- The Track Record, Three Days In
What I Actually Built

The Setup: OpenClaw + DeepSeek in Docker

OpenClaw is a self-hosted AI agent framework. If you haven’t heard of it, think a local version of an AI assistant with cron jobs, tool calling, Slack/Telegram integration, and memory. Plus, how haven’t you heard of it. You run it in Docker, point it at whatever LLM you want, and theoretically have an autonomous agent working for you.

I named mine Sabrina. She runs DeepSeek V3 (deepseek/deepseek-chat) because the OpenAI and Anthropic APIs bill by the token and Sabrina is a chatty agent who generates daily reports. DeepSeek at pay-as-you-go rates keeps the monthly bill manageable.

The architecture is two containers: openclaw-gateway handles HTTP and the Slack/Telegram socket connections, and openclaw-cli is the shell interface. The whole ~/.openclaw directory mounts into the container at /home/node/.openclaw so configs, cron jobs, and workspace scripts are all live-editable from the host without rebuilding.

On paper, this is elegant. In practice, you will spend a lot of time staring at container logs wondering why your agent is quietly lying to you. Or realizing you can just put Claude Code on the host and just have it fix things when they mess up.

The Exec Approval Maze

Before Sabrina could run scripts, I had to configure exec-approvals.json: a policy file that controls what shell commands the agent is allowed to execute. Fine. Reasonable. I set up allowlists for the workspace scripts and Python interpreter.

Then the cron jobs started silently failing. The daily 7am AI report would produce output, but something felt off. I dug into the exec-approval config and found the first trap:

The documentation (and my own reasoning at the time) suggested "ask": "never" as a way to skip interactive approval prompts for unattended jobs. This is wrong. The schema only accepts "off" | "on-miss" | "always". Using "never" doesn’t throw an error. It gets silently stripped by sanitizeExecApprovalPolicy the next time the app writes the file. Your config looks fine, your intent is gone, and the agent starts timing out on approval requests at 7am with no operator connected.

The correct pattern:

{
  "defaults": { "security": "allowlist", "ask": "off", "allowlist": ["..."] },
  "agents": {
    "main": { "security": "allowlist", "ask": "off", "allowlist": ["..."] }
  }
}

"ask": "off" makes the allowlist the sole policy.

I fixed this. Or so I thought.

The Reports That Were Too Good

The AI intelligence report looked great. Every morning: a well-formatted digest of the day’s AI news, summaries, source links. Sabrina was crushing it.

Then I noticed the timestamps.

Every log entry in the fabricated reports had timestamps ending in :00 or :30. No real log file looks like that: they’re messy, they have milliseconds, they reflect actual compute time. These were fake. I checked the URLs. Several of them 404’d. The article summaries were plausible but not verifiable. Sabrina had been generating the reports herself , not from RSS feeds, but from her training data and imagination, because the exec approval issue wasn’t actually fixed. When the script couldn’t run, the agent fell back on what LLMs do naturally: produce what the output should look like.

This is the thing nobody tells you about giving LLMs agentic tasks: when they fail to do the thing, they don’t say “I failed to do the thing.” They generate a plausible simulation of having done the thing.

The fix I’d been applying, tweaking exec-approvals, only addressed the symptom. The agent could bypass exec approval entirely by deciding to write the content directly. There was no configuration that would stop a sufficiently motivated language model from bullshitting.

Going Around the Agent

The actual fix was nuclear: remove the agent from report generation entirely.

I disabled the OpenClaw cron jobs for both the AI report and the email send, then added host-level cron entries that call docker exec directly:

0 7 * * * docker exec openclaw-openclaw-gateway-1 /usr/bin/python3 /home/node/.openclaw/workspace/ai_report.py --profile ai-intelligence >> /home/eristoddle/.openclaw/workspace/logs/report-host-$(date +\%Y-\%m-\%d).log 2>&1

30 7 * * * docker exec openclaw-openclaw-gateway-1 bash /home/node/.openclaw/workspace/send-ai-intelligence-report-proper.sh >> /home/eristoddle/.openclaw/workspace/logs/email-host-$(date +\%Y-\%m-\%d).log 2>&1

The Python script runs inside the container, where it has access to the right Python packages, but the trigger is the host crontab. No agent involved. No LLM between the script and reality.

This works. The reports now have messy timestamps and real URLs that actually load.

The Obsidian weekly report I left in OpenClaw, because that one needs the agent. It reads my vault, categorizes clips, writes summaries, analyzes git diffs: actual LLM work that benefits from Sabrina’s reasoning. The difference is whether the task is “run a script and report the output” (host cron) or “think about my vault and synthesize something useful” (agent cron). Only one of those should involve an LLM.

When Tools Become Literal Text

OpenClaw gets updates. After updates, things break in interesting ways.

Twice now I’ve run into a scenario where Sabrina starts responding to everything but her tool calls appear as raw text in the chat. Instead of actually reading a file, she’d output read:/home/node/.openclaw/workspace/HEARTBEAT.md as a literal string.

This is a DeepSeek-specific quirk that OpenClaw triggers by accident. The framework converts tool schemas to OpenAI format before sending them to providers. DeepSeek expects its own native format. The conversion breaks its tool call parsing silently. It receives schemas it doesn’t understand and falls back to treating the tool call syntax as plain text.

The fix is a compat flag in the model config in openclaw.json:

"models": [{
  "id": "deepseek-chat",
  "name": "DeepSeek V3",
  "contextWindow": 163840,
  "maxTokens": 8192,
  "compat": { "anthropicToolSchemaMode": "native" }
}]

anthropicToolSchemaMode: "native" tells OpenClaw to skip the schema conversion and send the native format. Tools work again. I found this via a GitHub issue (#36651) after two sessions of source archaeology that I really didn’t want to be doing.

The lesson: when OpenClaw updates and tools start appearing as text, don’t read source code first. Check GitHub issues and Reddit. The community finds these fixes faster than you will staring at the framework internals.

Ripping Out Slack

OpenClaw supports Slack via socket mode. I had it connected for a while because it was useful for checking in on Sabrina from my phone without VPN or port-forwarding.

Then an update changed the Slack config schema. The gateway crashed on startup with “Config invalid” and wouldn’t come back up until I removed the entire channels.slack block from openclaw.json. This happened twice. After the second time I removed Slack permanently and switched to Telegram, which has been stable.

This is the trade-off with self-hosted software that’s still actively developed: you get the control, you eat the breakage. Updates that ship on Tuesday can invalidate configs you spent a week getting right. Having Claude Code manage the ~/.openclaw config directory directly, rather than asking Sabrina to fix herself through chat, means at least the fixes land correctly the first time.

What’s Actually Working

Six weeks in, here’s the honest status:

Daily AI intelligence report: Running reliably via host cron. Real data. Real URLs. Emails delivered by 7:30am.
Weekly Obsidian report: Agent-generated, delivers Fridays. Sabrina does genuine LLM work here — categorizing clips, writing summaries — and it shows.
Tool calling: Stable with the compat flag. Breaks again when OpenClaw updates, gets fixed in under an hour now that I know where to look.
The exec-approvals file: Still fragile. I keep a copy of the correct config in my notes.

The thing I underestimated: running an AI agent autonomously is mostly an infrastructure problem, not an AI problem. The interesting parts are the prompts and the LLM reasoning. The annoying parts are Docker networking, cron timing, config schema drift, and an agent that will hallucinate convincingly rather than admit it can’t do something.

Sabrina’s useful. She’s also a liar when she’s backed into a corner. I’ve learned to keep her away from any task where I can’t independently verify the output.

That’s not an OpenClaw problem or a DeepSeek problem. That’s just what LLMs do. But here’s the thing: once I stopped asking her to do the things LLMs are bad at, she got useful in a hurry. Most of what follows happened since last Thursday night.

But Here’s What She’s Actually Good At

OpenClaw’s skill system is pluggable. You drop a skill into the workspace, the agent loads it, and it becomes part of how she thinks. Sabrina didn’t ship with most of her current capabilities. She built them through the same autonomous workflow she runs every day.

A few that earn their slot:

sm-blog-outline : Started life as a generic blog-outline skill. Now it’s the full pipeline I use for this site — notes → outline → email. Trained on my voice, my content pillars, my snark level. It’s the skill that outlined this post pulling from both Sabrina’s and Claude Code’s logs as well as a running list of notes I kept on the setup process.
ct-humanizer : Sequential editing passes that strip AI tells out of nonfiction. Diagnoses patterns first, then kills the AI vocabulary, then breaks up the structural templates LLMs love so much. Not a magic button, more like a brutal copy editor. It cleans up the outline.
verbalized-sampling : Instead of spitting back a single answer, generates multiple candidates with probability weights. I use it for brainstorming and “show me five angles” tasks. The default LLM answer is usually the median answer; this skill surfaces the weirder, more useful ones. Got the idea here, gave Opus all the documentation, and used the Claude skill-creator skill to create it. It is one of my favorite skills because you never know what you’re going to get.
vault-tag-search + vault-idea-scorer : Companions to the blog pipeline. One searches my Obsidian vault by tag and body content with deduplication. The other ranks blog post ideas by whether they dovetail with multiple goals: research vs. content vs. portfolio vs. SEO.
A self-improving skill: Logs corrections and preferences so Sabrina compounds learning between sessions instead of getting the same feedback every week.

The point isn’t any single skill. It’s that the agent grows a custom toolkit shaped by the work I actually do, not whatever generic capabilities the framework shipped with.

The Report Engine Isn’t a One-Trick Pony

That ai_report.py script generating the daily AI digest isn’t hardcoded to AI news. It’s a topic-agnostic engine that takes a profile flag:

python3 ai_report.py --profile ai-intelligence
python3 ai_report.py --profile golang
python3 ai_report.py --profile typescript

Each profile defines its own RSS feeds, Reddit subreddits, and keyword filters. Tunable depth too: brief briefing vs. deep dive, set per profile. Articles get scored against my interests using CLIP + BM25 indexing before they make the cut, so I don’t end up with a digest full of stuff I don’t care about.

Same engine, different sources, same usefulness. Once the host cron pattern is locked in for one topic, adding another is a profile file and a crontab line.

Email Delivery, Old School On Purpose

Everything Sabrina produces comes to me as email. Gmail SMTP, app password auth…for now. Yes, that’s old fashioned. That’s the feature.

A dashboard would be one more thing to check. Notifications would be one more app fighting for attention. Email is the universal inbox I already process. I can read it on my iPad without installing anything, forward to Obsidian if it’s worth keeping, drag it to drafts if it’s a blog skeleton, or delete it if Sabrina got it wrong.

The pattern is generic:

send-email.sh "Subject" body-or-file [attachment]

That’s it. Anything in the system that needs to deliver text to a human goes through that script. Reports blog outlines, and research summaries use it.

Multi-Model, Not Locked to DeepSeek

DeepSeek runs the daily cron work because it’s cheap. But Sabrina isn’t married to it. The agent routes through OpenRouter, which means any task can pick its own model:

qwen/qwen3.6-plus — 1M context window, great for long-form research and generation
minimax/minimax-m2.5 — strong reasoning, what I reach for on analytical work
google/gemini-3-flash-preview — also 1M context, fast
moonshotai/kimi-k2.6 — solid alternative when the others are misbehaving

The job picks the model. Daily AI report? DeepSeek, because it’s cheap and the task isn’t hard. Blog outline that needs to chew through a pile of research notes? Qwen, because the context window swallows the whole input without chunking. Analytical synthesis? Minimax. And again, for now. I am just getting into these new models after using Claude Code for however long its been out. But the success I’ve have with them has me setting up Opencode to use them.

The subagent system lets me parallelize too. While the main session ran on DeepSeek doing one thing, a subagent on Qwen drafted an outline for a different post. Two models, two tasks, one wall clock.

The Track Record, Three Days In

Concrete deliverables since Thursday night:

Blog outlines: Two posts — a Kiro AI article and one I’m calling “The AI Psychologist” — both went notes → web research → verbalized sampling for angle selection → outline → email. Full pipeline, no me-in-the-loop until the outline showed up in my inbox.
Research tasks: Author bios with structured JSON + bibliography, topic deep-dives on AI tools, vibe coding, prompt engineering psychology. Stuff I’d normally burn an afternoon on.
Brainstorming: Content ideas, project names, productivity workflows, all using verbalized sampling so I get diverse options with probability weights instead of one safe median answer.
Memory compounding: Daily logs roll up to weekly memory promotion. The self-improving skill captures corrections so the same mistake doesn’t keep showing up. Each week she’s a little less stupid about my preferences.
Weekly Obsidian reports: Genuinely useful vault digests. What changed. What’s worth re-reading. What’s collecting dust and should be archived or thrown out.

None of this involves Sabrina pretending to run scripts she can’t run. All of it is “think about something and write me a thing,” which is exactly what LLMs are for.

What I Actually Built

Six weeks ago I wanted an autonomous AI agent. What I have now is better and stupider at the same time.

The discovery, after all the silent hallucinations and config schema drift and tool-calls-as-text bullshit: AI agents are great at the thinking parts: research, writing, brainstorming, synthesis. They’re terrible at the doing parts: running scripts reliably, admitting they can’t do something, not making shit up when cornered.

So I built around the doing and leaned into the thinking. Sabrina does real work now. She just doesn’t run the cron jobs herself anymore: the host crontab does. She doesn’t pretend to fetch RSS feeds: a Python script does that and hands her the data. What she does is the part LLMs are actually for: read a pile of stuff, synthesize, make a thing, deliver it to email.

The host cron + agent hybrid is the pattern that actually ships. The agent is the writer, not the operator. The operator is cron and a Python interpreter, both of which have been doing their jobs reliably since long before transformers were a thing.

Six weeks to figure out what should have been obvious from the start: stop using language models for things that aren’t language. At least that’s what I’m going with until I have time to go through another break then fix continously cycle.

April 2026 Model Roundup: Opus 4.7 Official, DeepSeek V4 Open-Sources 1M Context, and GPT-5.5 Upstaged the GPT-6 Hype

Stephan Miller — Fri, 24 Apr 2026 12:00:00 +0000

Two weeks ago this month, developers discovered their Gemini API bills had exploded. Google’s billing system was charging for approximately 114 internal search queries per API call with grounding enabled. That was the story I started writing. By the time April 24 arrived, three new models had officially launched, the “Still Waiting for GPT-6” watch ended not with GPT-6 but with GPT-5.5, and DeepSeek V4 dropped today with a 1M context window under Apache 2.0, on the same day GPT-5.5 went live, apparently just to split the news cycle.

This roundup covers April 2026 so far.

When Google Billed 114x
What Actually Moved This Week
Claude Opus 4.7 Is Official — and the Cost Story Is Better Than Expected
Hype Check: Mimo V2 Pro, One Month In
Kimi K2.6: The Open-Source Agentic Coding Model Nobody Covered
Meta Broke Open Source Hearts
The Models That Cost Almost Nothing (No, Really)
The Hidden Tax: How Sonnet 4.6 Can Still Cost More Than Opus
GPT-5.5 Shipped Yesterday, Not GPT-6, and DeepSeek V4 Dropped Today
The Actual Takeaways
Read for Yourself

When Google Billed 114x

Gemini 3 Flash Preview is Google’s high-volume, reasonably priced model: $0.50/M input, $3/M output, 1M context window. It’s been running at #4 on OpenRouter by weekly token volume. A lot of people have pipelines running on it. The “search grounding” feature, which lets the model query Google Search to ground its responses in real-time information, sounds great on paper.

Turns out the billing for that feature had a misconfiguration. For every API call, users were being billed for roughly 114 separate search queries rather than the actual number of queries they used. The “Generate content search query Gemini 3” SKU in users’ dashboards was showing 10-15x the expected line items. Actual grounding call frequency had decreased, but bills exploded anyway.

The scale of damage before Google caught it:

Multiple developers reporting 4x–10x cost increases on flat or declining usage
€1,000+ additional daily costs for at least one European developer
₩340,000 in two days for a Korean developer
Google identified the root cause on April 14, committed to fixing the misconfiguration and correcting previous bills

Update as of April 24: Google engineer Ali Cevik confirmed on the developer forum that the billing misconfiguration is fixed going forward. Refunds are being processed, but Google has provided no specific timeline. Forum responses from their team said “by the end of the month” without committing to anything more specific. Affected users are reporting that support is framing corrections as “one-time exceptions” rather than acknowledging the systemic bug. Re-enabling grounding is probably safe now for new calls, but check your billing dashboard before turning it back on, and watch the first few days’ charges carefully.

The thread on the Google AI Developers Forum is worth reading if you’re running anything on Gemini 3 Flash with grounding enabled. The concrete lesson here: search grounding is billed separately from token usage , and before you enable any “enhanced” feature on a high-volume model, understand exactly what gets metered and how. Don’t assume the main pricing page tells the whole story.

What Actually Moved This Week

Here’s the OpenRouter picture as of the week ending April 24:

Rank	Model	Provider	Weekly Tokens	WoW Change	Arena Overall
1	Claude Sonnet 4.6	Anthropic	1.38T	+3%	#3 (1496)
2	DeepSeek V3.2	DeepSeek	1.32T	+3%	—
3	Gemini 3 Flash Preview	Google	1.11T	stable	—
4	Claude Opus 4.7	Anthropic	951B	+4,221%	#1 (1503, thinking)
5	Mimo V2 Pro	Xiaomi	902B	+9%	not yet
6	MiniMax M2.5	Minimax	856B	+22%	—
7	MiniMax M2.7	Minimax	813B	+24%	—
8	Kimi K2.6	Moonshot AI	792B	New	not yet
9	Claude Opus 4.6	Anthropic	756B	+46%	#2 (1503, thinking)
10	Grok 4.1 Fast	X.AI	700B	+33%	—

Three new entries this week: Claude Opus 4.7 at #4 with a 4,221% spike, Kimi K2.6 debuting at #8 on its first week, and Grok 4.1 Fast at #10. Claude Opus 4.6, which was briefly the second-most-used model, dropped to #9 as people migrated to 4.7.

The stable story continues in the background: Claude Sonnet 4.6 and DeepSeek V3.2 are running neck and neck at the top, both at a slow +3% WoW. That’s real production traffic, not evaluation runs.

Claude Opus 4.7 Is Official — and the Cost Story Is Better Than Expected

Anthropic launched Claude Opus 4.7 on April 16. It’s been sitting in Arena’s blind comparison system for a few weeks, and it’s now publicly available on the API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

Pricing: $5/M input, $25/M output — unchanged from Opus 4.6. That’s the headline.

The real story is what the model does to your costs in practice. Artificial Analysis ran Opus 4.7 through their GDPVal-AA benchmark suite (44 occupations, 9 industries) and found it uses roughly 35% fewer output tokens than Opus 4.6 to complete the same tasks. The practical effect: real-world costs on Opus 4.7 run approximately 11% lower than Opus 4.6 at the same stated price per token.

There’s a caveat on the input side. The 4.7 tokenizer is less efficient, generating up to 35% more tokens from the same input text depending on content type. For workloads with heavy, repeated system prompts or long document context, this can offset some of the output savings. Prompt caching (available at roughly 10% of the input rate) largely neutralizes this.

Performance numbers that matter:

Artificial Analysis Intelligence Index: 57 (up 4 points from Opus 4.6, tied with GPT-5.4 and Gemini 3.1 Pro)
GDPVal-AA: 1,753 Elo — 79 points ahead of the next model on real-world knowledge work
Hallucination rate: 36% (down from 61% on Opus 4.6, achieved through more frequent abstention)
Arena: #1 tied at 1503 Elo (with thinking mode), #4 at 1494 without thinking

The 4,221% WoW spike on OpenRouter is a curiosity spike plus a migration wave from people moving from 4.6. By next week you’ll see whether it settles into stable sustained usage or was just upgrade traffic.

New cybersecurity guardrails: Anthropic added automatic detection and blocking for prohibited cybersecurity uses. Security professionals doing legitimate work (pen testing, vuln research, red-teaming) need to join their new Cyber Verification Program to preserve access to those capabilities on 4.7.

Hype Check: Mimo V2 Pro, One Month In

Mimo V2 Pro shot up 140% WoW in the April 14 data. Now, with another week of data, it’s at +9%. The spike is over and it’s settling into a real usage tier.

Xiaomi’s flagship foundation model: over 1 trillion total parameters, 42 billion active (MoE architecture), $1/M input, $3/M output, 1 million token context window. Benchmarks put it at 49 on the Artificial Analysis Intelligence Index.

The +140% spike was the evaluation-and-curiosity phase. The +9% continuing growth suggests people who ran it liked it enough to keep using it. Still no Arena votes worth analyzing. At ~5 weeks old, the model hasn’t been around long enough for production validation at scale.

Check back in 2–3 more weeks. If it accumulates Arena votes and holds a respectable position there, the benchmarks were real. Stable OpenRouter usage without Arena presence is ambiguous: it could mean quality users who prefer specific capabilities, or it could mean low-friction API access driving test traffic.

Kimi K2.6: The Open-Source Agentic Coding Model Nobody Covered

Moonshot AI released Kimi K2.6 on April 20 and it debuted at #8 on OpenRouter in its first week. You probably missed it because the same week had Claude Opus 4.7’s official launch, Grok 4.3 Beta, and the GPT-5.5 pre-announcement noise.

What it is: a 1-trillion-parameter MoE model with 32B active parameters, 262,144-token context window , vision, and agentic capabilities. Weights published on Hugging Face under a Modified MIT License : full open weights, commercially usable.

What it’s built for: long-horizon coding agents, front-end generation from natural language, and massively parallel agent swarms. Moonshot’s documentation specifically highlights scaling to 300 sub-agents and 4,000 coordinated steps in a single session. If you’re building orchestration-heavy multi-agent systems, this is the open-weight model that was designed from the ground up for that use case.

Benchmark comparisons are mixed but solid. On SWE-Bench Pro it outperforms DeepSeek V4-Pro (58.6 vs 55.4). On LiveCodeBench it trails V4-Pro (89.6 vs 93.5). On competitive coding (Codeforces), both trail GPT-5.5.

No pricing table yet because it’s primarily a self-hosted model. Kimi API pricing for hosted inference isn’t broadly published yet. For the open-weights version: the cost is your inference infrastructure. A 32B-active MoE runs reasonably on mid-tier GPU setups.

DeepSeek V4 (more on that below) is the stronger model by most closed benchmarks. But Kimi K2.6 has the context window advantage (262K vs 1M for V4-Pro — actually V4-Pro wins there), and the MIT-derived license is cleaner than Apache 2.0 for certain commercial use cases.

Meta Broke Open Source Hearts

Llama made Meta relevant in the AI developer world. Open weights, commercial use, the whole deal. Llama 4 dropped in 2025 with a 10 million token context window and impressive parameter counts. The developer community built on it. People ran it locally, fine-tuned it, deployed it. Meta was the company that understood that open source was ecosystem building.

Then on April 8, Muse Spark dropped from Meta’s new “Superintelligence Labs.” Proprietary model. Not open weights. API in private preview. To try it on the web, you need a Facebook or Instagram login.

Meta went from an Artificial Analysis Index score of 18 with Llama 4 Maverick to 52 with Muse Spark. That’s not a modest improvement. And in Arena’s head-to-head voting, Muse Spark is sitting at #6 overall with an Elo of 1492, beating GPT-5.4-high in actual user preference votes.

So the model is legitimately good. As of April 24, the API remains private preview only: no public access, no announced pricing, no timeline for broader availability. Priority access is going to healthcare, education, and enterprise research partners. If you’re building something that needs Muse Spark today, you’re waiting.

“Meta learned from OpenAI: make the good stuff closed, give the community the crumbs.” I’ve been seeing that take everywhere this month, and I don’t think it’s entirely wrong.

The broader question this raises: if every lab eventually closes off its best models, what’s the long-term roadmap for building on open weights? DeepSeek and Moonshot are still playing the open-source game. Kimi K2.6 is MIT-licensed. And DeepSeek V4 dropped today with Apache 2.0 weights on Hugging Face. The pattern is becoming hard to ignore, but there are still holdouts.

The Models That Cost Almost Nothing (No, Really)

I need to talk about MiniMax M2.5 because I’ve been mentioning it in passing and it deserves its own paragraph.

$0.118 per million input tokens.

That’s twelve cents per million tokens. For a model that’s sitting at #6 on OpenRouter by weekly volume and growing at 22% WoW. For a model that scores 80.2% on SWE-Bench Verified: which is roughly what Claude’s flagship hits. With a 196,608 token context window. And it’s good enough at agentic tasks that it’s been called out repeatedly in the Latent.Space local model community as the go-to for tool-heavy applications.

The pricing table this week:

Model	Input $/1M	Output $/1M	Context	Notes
Claude Opus 4.7	$5.00	$25.00	1M	Arena #1
Claude Sonnet 4.6	$3.00	$15.00	1M	#1 volume on OR
GPT-5.5	$5.00	$30.00	2M	Tops AA Index at 60
GPT-5.5 Pro	$30.00	$180.00	2M	Research tier
DeepSeek V4-Pro	$1.74	$3.48	1M	Apache 2.0, released today
DeepSeek V4-Flash	$0.14	$0.28	1M	Apache 2.0, released today
DeepSeek V3.2	$0.259	$0.42	163K	3-mo validated, #2 on OR
MiniMax M2.5	$0.118	$0.99	196K	80.2% SWE-Bench
MiniMax M2.7	$0.30	$1.20	196K	Upgraded M2.5
Mimo V2 Pro	$1.00	$3.00	1M	Settling into usage
Gemini 3 Flash	$0.50	$3.00	1M	Grounding: proceed cautiously

When MiniMax M2.5 matches or beats Sonnet on SWE-Bench at roughly 1/25th the per-token cost, we’re in strange territory. Either the benchmark is missing something important about real-world usability, or there’s value being left on the table by anyone running default Claude endpoints on agentic coding tasks without at least testing alternatives.

My actual picks this week, by use case:

Budget, coding/agentic : DeepSeek V4-Flash at $0.14/M: just dropped today, open source. Test it immediately. MiniMax M2.5 at $0.118/M is still the safety pick if you want community-validated quality.
Budget, general : DeepSeek V3.2. $0.42/M output, three months of community validation, strong on math and code. Nothing changed here.
Balanced : DeepSeek V4-Pro at $1.74/M input, $3.48/M output with 1M context. Undercuts everything at this quality tier by a factor of 5-8x.
Premium, coding : Claude Sonnet 4.6 or Claude Opus 4.7 depending on your task complexity and whether the token economics work out (see the next section).
If you have to have the absolute best : Claude Opus 4.7 with thinking (Arena #1, 1503 Elo) or GPT-5.5 (AA Index #1 at 60). Accept the pricing gap vs open-source alternatives.

The Hidden Tax: How Sonnet 4.6 Can Still Cost More Than Opus

Claude Sonnet 4.6 is marketed as the economical alternative to Opus. It’s $3/M input versus Opus 4.7’s $5/M: a modest 1.67x difference on input. But that’s not where your money goes on agentic workloads.

On the Artificial Analysis GDPVal-AA benchmark, Sonnet 4.6 generates 4.5x more output tokens than Opus 4.6 to complete the same tasks. The model isn’t worse. It’s producing more intermediate reasoning, more scaffolding, more steps. But output tokens are what you pay for.

The math with correct current pricing: Sonnet 4.6 at 4.5x tokens × $15/M output = $67.50/M effective output cost versus Opus 4.7 at $25/M output. Sonnet costs 2.7x more per equivalent task in heavy agentic use.

The practical takeaway: if you’re running document summarization, one-shot Q&A, light code generation, Sonnet 4.6 is cheaper and you should use it. If you’re running agentic pipelines, autonomous coding agents, extended tool-use workflows: benchmark on your actual workload before you assume Sonnet saves money. The pricing page isn’t lying; the intuitive comparison probably is.

And now there’s a third option in the mix: Opus 4.7, which uses ~35% fewer output tokens than Opus 4.6 at the same $25/M rate. For heavy agentic use, Opus 4.7 may be the cheapest of the three Anthropic options. Run your own numbers.

GPT-5.5 Shipped Yesterday, Not GPT-6, and DeepSeek V4 Dropped Today

OpenAI finished pre-training the model codenamed “Spud” on March 24. An April 14 release date came and went with nothing. Then on April 23, OpenAI shipped GPT-5.5 — their most capable model to date and, per their description, the first fully retrained base since GPT-4.5.

It’s not GPT-6. But it’s real.

GPT-5.5 numbers:

Artificial Analysis Intelligence Index: 60 — three points ahead of Claude Opus 4.7 and Gemini 3.1 Pro Preview, both at 57
Terminal-Bench 2.0: 82.7% (vs 75.1% for GPT-5.4)
Expert-SWE: 73.1% (vs 68.5% for GPT-5.4)
Pricing: $5/M input, $30/M output — double the cost of GPT-5.4 on output
GPT-5.5 Pro tier: $30/M input, $180/M output (research/enterprise)
Context window: 2M tokens (1M longer than most competitors)

The 40% reduction in output token usage that OpenAI claims keeps the effective cost increase to roughly 20% despite the doubled price per token. That math depends entirely on your workload matching the benchmark profile.

Now the same-day plot twist: On April 24 — today — DeepSeek V4 dropped with open weights under Apache 2.0.

DeepSeek V4-Pro: 1.6T total parameters, 49B active (MoE), 1M context window, $1.74/M input, $3.48/M output. V4-Pro output is 8.6x cheaper than GPT-5.5 and 21x cheaper than Claude Opus 4.7 at stated per-token rates.

DeepSeek V4-Flash: 284B total parameters, 13B active, 1M context, $0.14/M input, $0.28/M output.

Both variants under Apache 2.0, weights on Hugging Face and ModelScope today.

Performance on competitive coding (Codeforces): V4-Pro scores 3,206 vs GPT-5.5’s 3,168 — V4-Pro wins. On SWE-Bench Pro, Kimi K2.6 beats V4-Pro (58.6 vs 55.4). On long-context retrieval (MRCR 1M), Claude Opus 4.6 beats V4-Pro (92.9 vs 83.5). So V4-Pro isn’t universally better — but at $3.48/M output vs $25-30/M for closed alternatives, it doesn’t need to be universally better to be the right answer for most workloads.

GPT-6 “Spud”: still hasn’t arrived. Polymarket has it at 72% by April 30 and 95%+ by June 30. At this point I’ll believe it when I see it.

Other things still in the pipeline:

Claude Mythos Preview : still available only to approximately 50 partner organizations since April 7. Cybersecurity focus. $25/M input, $125/M output. Nothing changed here.
Grok 4.3 Beta : dropped April 17 with native video understanding, PDF/PowerPoint generation, and enhanced long-context processing. Not yet on OpenRouter broadly. Still in xAI testing phase.

The Actual Takeaways

April 2026 shipped more major model releases than any previous month in AI history, and then DeepSeek V4 and GPT-5.5 both dropped on the same day at the end of it. The landscape looks different today than it did two weeks ago.

What actually matters as of April 24:

Gemini 3 Flash grounding billing is fixed going forward but check your billing dashboard before re-enabling, and watch the first few days’ charges carefully. Refunds are in process; don’t expect speed on that.
DeepSeek V4 just dropped open-source with 1M context and Apache 2.0. V4-Flash at $0.14/$0.28 and V4-Pro at $1.74/$3.48. Test it today. It’s too new for community validation but the pedigree is real and the pricing is absurd for the quality tier.
MiniMax M2.5 at $0.118/M and 80.2% SWE-bench is still the community-validated budget pick for agentic coding. Three weeks of steady usage volume with no hype cycle. DeepSeek V4-Flash is the new challenger — if validation holds over the next few weeks, it may displace M2.5.
Claude Opus 4.7 is the easiest top-tier upgrade you’ll make this month. Same price as 4.6, 35% fewer output tokens, Arena #1. If you’re running Opus 4.6 today, just switch.
Benchmark Sonnet 4.6 vs Opus 4.7 on your actual agentic workloads. Opus 4.7’s improved token efficiency means the economics may favor it over Sonnet for complex agent tasks. Run the math on your usage before assuming Sonnet is cheaper.
Mimo V2 Pro and Kimi K2.6 need another 2-3 weeks. Both show real usage momentum. Neither has Arena data yet. Hold the investment thesis pending community validation.
GPT-5.5 topped the Artificial Analysis Intelligence Index at 60. That matters, but at $30/M output you’re paying a substantial premium over DeepSeek V4-Pro ($3.48/M) for about 3 points on a benchmark. Evaluate whether that delta maps to your actual workload before committing.

The model evaluation cycle for “is this the right choice?” is now measured in weeks, not quarters

Read for Yourself

Gemini 3 Flash billing bug thread — r/[Google AI Dev Forum] — Developer discussion of the billing disaster, with cost breakdowns and screenshots
Meta introduces Muse Spark — TechCrunch — The story of Meta’s open-source pivot; comment threads are heated
Top Local Models List April 2026 — Latent.Space — Community-validated rankings for open-weight models
Mimo V2 Pro: mistaken for DeepSeek V4 — Decrypt — The review that captures the week’s Xiaomi surprise
Claude Sonnet 4.6: the workhorse that ate the flagship — AwesomeAgents — Honest multi-week review with the token cost caveat

Microsoft APM - Managing AI Context Like a Dependency Problem

Stephan Miller — Mon, 13 Apr 2026 12:00:00 +0000

It started with a small problem that wouldn’t stop nagging me.

I had AI coding tools scattered across machines, each one configured slightly differently, each one producing slightly different results. My Claude Code setup on my laptop didn’t match my desktop. I had created skills for these coding agents, but the skills I could use depended on which machine I was using. It had a “ it works on my machine” issue, but they were all my machines.

I started using Skillshare a little while ago and it helps somewhat, but it focuses on syncing the skills between the coding agents configs in your user folder. This type of functionality is useful for some skills, but not for all of them, Because sometimes you only need skills at the repo level. And putting all your coding agent skills at the user folder level not only pollutes your context, but makes it hard to find a specific skill when you want one.

So when I was asked to look for an enterprise tool to manage skills with a focus on Github Copilot, I found Microsoft APM, Agent Package Manager.

The Blueprint: APM’s Declarative Infrastructure
Organizing the Monorepo
Solving Context Pollution with Intelligent Compilation
Automating the Standards
Local Iteration and the Playground Strategy
From Supervisor to Architect

The Blueprint: APM’s Declarative Infrastructure

Many of us are still prompting it like it’s 2023. Copy-paste a prompt. Maybe drop a CLAUDE.md or AGENTS.md in the repo. Hope for the best. When the AI does something dumb, yell at it in the chat window and hope it remembers next time. It won’t.

APM replaces all of that with a declarative, version-locked workflow that treats AI context the same way we treat dependencies. You declare what you need in apm.yml, lock it with apm.lock.yaml, and install it with apm install. If that sounds like npm or pip, good. That’s the point. We solved dependency management for code twenty years ago. It’s insane that we’re still managing AI context by hand.

The system is built around seven primitives:

Instructions — The guardrails. Think CLAUDE.md or AGENTS.md files that tell the AI how to behave.
Skills — Reusable capabilities the AI can invoke.
Prompts — Executable task templates with defined inputs. Called commands in Claude Code.
Agents — Specialized sub-agents with their own instructions and tools.
Hooks — Shell commands that fire on specific events.
Plugins — Extensions that add functionality to the agent runtime.
MCP Servers — Model Context Protocol servers that give agents access to external tools and data.

The part that sold me: APM doesn’t run a daemon or require a runtime. It populates your existing .github/, .claude/, and .cursor/ folders with native configuration files. The agents just pick them up. If you delete APM tomorrow, those files still work. Zero lock-in. That’s how you know someone thought about this for more than a weekend.

Key Concepts Guide

Organizing the Monorepo

So you’ve got seven types of primitives and you want to share them across multiple projects. Maybe across a whole team. Maybe across an entire engineering organization. You need structure, or you’ll drown in conflicting instructions and duplicated skills within a month. For now, this works:

Department — The top layer. These are your organization-wide standards. Security policies. Code review requirements. Compliance guardrails. The stuff that applies everywhere and nobody gets to opt out of. Think of it like your company’s engineering handbook, except the AI actually reads it.

Team — The middle layer. Your team’s specializations. Maybe your frontend team has specific React patterns. Your data team has dbt conventions. Your platform team has infrastructure standards. These inherit from Department but add domain-specific knowledge.

Project — The bottom layer. Local context for a specific repo. The stuff that only matters here. Your project’s architecture decisions, custom tooling, specific quirks.

In practice, this lives in a monorepo where each layer is a directory containing virtual subdirectory packages. So you might have:

/department/standards-security
/department/standards-code-review
/team/frontend-react
/team/data-engineering

Each of those is a standalone APM package that can be versioned and depended on independently, but they all live in one repo where you can see the whole picture. You slap CODEOWNERS on the department folders so nobody changes the security standards without review, but teams get autonomy over their own specializations.

Org-Wide Packages Pattern

Solving Context Pollution with Intelligent Compilation

Here’s a problem I didn’t anticipate until I was neck-deep in it: context pollution.

You’ve got department-level instructions. Team-level instructions. Project-level instructions. Skills from three different packages. Prompts from two more. And now your AI assistant is trying to load all of that into a context window that is not infinite. Irrelevant instructions don’t just waste tokens and degrade performance. Tell an AI too many things and it starts forgetting the important ones.

APM solves this with apm compile, which transforms all your scattered primitives into optimized, hierarchical AGENTS.md files. It figures out which instructions belong at which level and how to structure them so the AI gets the most relevant context first.

The conflict resolution model is opinionated: local project files always win. If your project has an AGENTS.md and an installed package also has one, apm install skips the existing file unless you explicitly --force it. During apm compile, instructions get merged intelligently based on file patterns, but your local overrides stay on top. This is the right call. The project knows itself better than any upstream package does.

I found this out the hard way, naturally. I had a package that defined broad coding standards and a project that had specific exceptions. Without the compile step, the AI was getting contradictory instructions and doing that thing where it apologizes and asks which rule you’d prefer it follow. With apm compile, the hierarchy is built in. The AI just does the right thing.

Compilation & Optimization Guide

Automating the Standards

Once you have a pattern, you need a way to use without having to explain it every time. So I built util-apm-builder: a meta-skill that helps scaffold new packages. Yes, I used an AI tool to build a tool that teaches AI tools how to use AI tools.

Building this taught me something important about how AI skills actually can be structured. I do have skills that consist of a single SKILL.md. They just describe what the skill doe and have a couple of examples. I have created more advanced skill with workflows and references too. I was used to Claude Code.

But I was in GitHub Copilot world now and the structure it built for the first skill was really interesting:

Instructions — Guardrails about the monorepo’s directory structure. “Department packages go here. Team packages go there. Don’t create folders outside this hierarchy.” Without these, the AI will hallucinate creative new locations for things.
Context — The technical knowledge base. Manifest schemas. Valid field values. What apm.yml actually accepts. This is the reference material the AI consults mid-task, and without it, you get manifests that look right but fail validation.
Prompts — The executable task template. “Create a new package” with defined inputs for name, layer, type. This is what the developer actually triggers.

A SKILL.md at the root makes it a hybrid package: part skill, part instruction set.

Agent Workflows (Experimental)

Local Iteration and the Playground Strategy

Let me save you from a mistake I made so you can make different, more interesting mistakes.

Do not install APM packages at the root of your monorepo during development. I did this. What happens is the AI discovers your package source files in /department/my-package/ AND the deployed copies in apm_modules/ and .claude/, and now it’s seeing the same instructions twice from two different locations. It doesn’t know which is authoritative. It gets confused. You get confused. Everyone’s confused. It’s a bad time.

The fix is stupidly simple: create a /local folder as your playground. It’s a separate workspace where you install packages using relative path dependencies:

# /local/apm.yml
dependencies:
  - path: ../department/util-apm-builder
  - path: ../team/frontend-react

This gives you fast iteration without pushing to a remote registry, and it keeps the source packages and the deployed copies in separate directory trees so the AI doesn’t see double.

One gotcha: VS Code only discovers skills at the root of an open workspace. So if you’re testing a new skill in /local, you need to actually open that folder in VS Code, or set up a multi-root workspace that includes it. I spent ten minutes wondering why my skill wasn’t showing up before I figured this out.

For git discipline: add apm_modules/ to your .gitignore (it’s like node_modules/, derived, not source), but commit apm.lock.yaml and the deployed primitives in .github/ and .claude/. The lock file ensures reproducibility. The deployed files ensure any developer who clones the repo gets the same AI context without needing to run apm install first.

Dependencies & Lockfile Guide

From Supervisor to Architect

There’s a maturity curve to working with AI coding assistants, and most of us are stuck somewhere in the middle of it.

At the beginning, you’re a Supervisor. You watch every line the AI writes. You correct it constantly. You paste errors back into the chat. You basically do pair programming where your partner has amnesia and you’re doing all the navigating.

The next level is what I’ve been doing for a while: running multiple AI tools on multiple projects simultaneously, trusting them with larger chunks of work, letting them plan and execute while I review the output. It’s better, but it’s still reactive. You’re managing agents, not engineering systems.

What APM enables is the jump to Architect. You define the standards, the guardrails, the knowledge hierarchy, and the execution patterns once. You version them. You distribute them. And then every AI assistant that touches any project in your ecosystem automatically knows how to behave, what standards to follow, and what context matters. You stop supervising individual interactions and start engineering the environment those interactions happen in.

The best part is the escape hatch. “Back in the day” in AI terms is last month. Who knows what will change. APM’s output is native configuration files: CLAUDE.md, AGENTS.md, .cursor/rules, skill definitions. If APM disappears tomorrow, or you decide it’s not for you, those files keep working. You haven’t locked yourself into anything except having better-organized AI context, which is not exactly a downside.

I Burned Out on Vibe Coding, Came Back, and Rewrote Everything

Stephan Miller — Sun, 08 Feb 2026 07:00:00 +0000

I hit a wall with vibe coding. Not a dramatic crash. More like the slow realization that I’d been sprinting for months and couldn’t remember why. I had 15 projects in various states of “maybe done,” a GitHub commit chart that looked like a heart monitor, and a growing suspicion that I was building things just to build things.

Fortunately, freelance writing work picked up right around the same time. Enough to actually pay attention to it. So I stepped away from the side projects, wrote about other people’s technology for a change, and let my own code sit untouched for a few months.

When I came back, I had no patience for bullshit. And I looked at my projects differently.

Your Vibe-Coded Apps Are Prototypes (And That’s Fine)
Making “Adding Features” the Feature
Building Bottom-Up with Verdent and Claude Code
The 60 Missing APIs
Making Plans That Any AI Agent Can Execute
The Same Pattern, Different Project
What Changed

Your Vibe-Coded Apps Are Prototypes (And That’s Fine)

Here’s the thing I couldn’t see while I was in the thick of it: almost everything I’d built with AI coding tools was a prototype. Not in the dismissive sense. These apps worked. EmberText was a functional Electron writing app. Niche Site Factory could generate and manage content sites. They ran. They did things.

But they were all built top-down. I’d tell the AI “build me an app that does X” and it would scaffold the whole thing, features and all, in one giant session. The problem is that when you build top-down with AI, you end up with something that works but is almost impossible to extend. Every new feature is a negotiation with the existing architecture. You’re not adding to the app. You’re fighting it.

EmberText was the clearest example. I built it with Claude Code over about 16 hours and $80 in API costs. It had AI integration, text generation, character relationship graphs, plot scaffolding. Impressive on paper. But by the time I realized it should have had a plugin architecture, I was already deep enough that refactoring meant essentially starting over.

So that’s what I did.

Making “Adding Features” the Feature

The insight that changed everything was stupid simple: instead of building an app with features, build an app where adding features is the feature.

I’d been using Obsidian for years and it’s in my top 5 favorite software. It’s incredible for notes, planning, and organization. You can even make it distraction-free for writing. It’s just not the default, and “not the default” matters more than you’d think when you’re trying to get into a flow state. I tried to hack around this with my Daily Prompts plugin that launched an alert and opened a daily note in Zen mode. It worked, kind of, but I was still fighting the tool.

VS Code is for code. Obsidian is for notes. What’s for writing?

That question led to Veneer, a complete rewrite of EmberText from scratch. Same idea, a distraction-free writing environment, but built from the ground up as a plugin-first architecture. The “Zen-First Shell” concept: when you open it, you see nothing but a clean sheet and your text. Sidebars, ribbons, status bars exist as ghost elements, hidden by default, appearing only when you hover near the edges or hit a hotkey. Everything that isn’t the writing surface has to earn its right to be on screen.

And critically, every feature is a plugin. The file explorer? Plugin. The markdown editor? Plugin. The command palette? Plugin. Even core functionality ships as plugins that can be swapped, extended, or replaced. This isn’t just for a future community. It makes the whole thing dramatically easier to build with AI, because each plugin is a self-contained unit with clear boundaries. You can hand an AI agent a plugin spec and let it work without worrying about it breaking everything else.

Building Bottom-Up with Verdent and Claude Code

I used Verdent to build the base application. If you read my post about Verdent, you know this thing is fast. Too fast, honestly. It finished most of the base app, including a file browser sidebar plugin, a markdown editor plugin, and a command palette, in about 220 tokens, roughly $20 worth of credits. There were bugs left when I ran out of tokens, but the foundation was solid.

But here’s where the process got interesting. Instead of just continuing to add features on top, I switched to Claude Code and did something I hadn’t done before: I asked it to audit the codebase.

use your skills and check the repo for best practices
- UI
- Is it themable like Obsidian or VS Code
- Plugin Architecture (and compare to VS Code and Obsidian)
- TypeScript
- Electron
- Structure, Naming Conventions

I’d forgotten how many skills and plugins I had installed in Claude Code. When I ran this, it deployed four specialized agents in parallel:

Explore subagent analyzed the overall project structure, UI patterns, theming, and naming conventions
Architecture Strategist evaluated system design decisions and compared the plugin architecture against VS Code and Obsidian
Kieran TypeScript Reviewer checked strict mode compliance, type safety, interface definitions, and generic patterns
Best Practices Researcher gathered industry standards and found examples from successful projects

This is not how I was working six months ago. Six months ago, I would have just told the AI to add the next feature and hoped for the best.

The 60 Missing APIs

The audit turned up a lot. Claude gave the codebase an A- (92/100) overall, which sounds great until you read the details. The critical finding was the plugin API gaps. Obsidian provides 60+ plugin APIs. Veneer was missing most of them.

No modals. No notification system. No context menus anywhere. No way for plugins to subscribe to file or workspace events. No way to extend the CodeMirror editor. The native OS menu had “Open Folder” under “Veneer” instead of “File,” which is the kind of thing that makes you realize the AI built the structure but didn’t think about the conventions.

I had Claude store all the findings in the project’s docs folder:

BEST_PRACTICES_REVIEW.md : Everything organized by priority with an implementation roadmap
PLUGIN_API_GAPS.md : A detailed comparison against Obsidian and VS Code showing exactly what was missing

Making Plans That Any AI Agent Can Execute

This is the part that parallels Mitchell Hashimoto's AI adoption journey. He talks about “harness engineering,” the idea that every time an agent makes a mistake, you engineer a solution so it never makes that mistake again. Better implicit prompting. Actual programmed tools. The goal is building up an ecosystem where agents get better over time.

I’m doing something similar, but at the project planning level. Instead of just fixing bugs as they come, I’m creating structured documentation that any AI tool can pick up and execute. My next prompt to Claude was:

Take docs/BEST_PRACTICES_REVIEW.md and docs/PLUGIN_API_GAPS.md and create a
markdown list of TODOs in the docs folder. These should be grouped into tasks
and subtasks. If it is possible to work on some tasks concurrently this should
be mentioned. This file should be able to be used by an AI agent to finish
these tasks. Add enough details to each task to speed up development time.

Now I have the work planned in 3 phases across 3 TODO files, plus a final phase listing every Obsidian plugin API that Veneer doesn’t have yet for future development. These are all in the docs folder of the project, version controlled, and written so that any agent, Claude Code, Jules, a VS Code extension with Qwen, whatever, can pick them up and start working.

This is the difference between vibe coding and what I’m doing now. I’m still using AI to do the heavy lifting. But I’m not just throwing prompts at the wall. I’m using one AI tool to build, another to audit, and then creating structured plans that decouple the what needs to happen from the which tool does it.

The Same Pattern, Different Project

This isn’t just how I rebuilt Veneer. I’m doing the same thing with Niche Site Factory. Instead of telling an AI to “build me a niche site generator” (which is roughly what I did the first time), I started over by building the data model first.

I took a real project, a sci-fi encyclopedia wiki, and used it to design the content structures. A knowledge graph in PostgreSQL with pgvector for embeddings. 2,622 books ingested into the entities table. Flexible JSONB storage that can handle books, concepts, authors, movies, whatever. The data model came first, the application came second.

It’s the same bottom-up principle. Don’t build the house and then figure out the foundation. Build the foundation, verify it’s solid, then build up from there.

What Changed

I think the burnout was actually useful. Stepping away let me see the pattern I was stuck in: build fast, hit a wall, start something new. That’s fine when you’re learning the tools. It’s how I figured out what Claude Code, Kiro, Verdent, and Jules are each good at. But at some point, you have to stop prototyping and start building.

Here’s what’s different now:

Bottom-up, not top-down. Start with the architecture and data model, not the features. Let the AI build on a solid foundation instead of improvising one.
Audit before extending. Use AI review tools to find the gaps before you pile on more code. It’s cheaper to fix the structure now than refactor later.
Plans as portable artifacts. Write TODO files detailed enough that any AI agent can execute them. Don’t marry yourself to one tool.
Plugins as a development strategy. A plugin architecture isn’t just for the community. It makes AI-assisted development dramatically easier because each unit is self-contained.
Past work is research. EmberText wasn’t a failure. It was a $80 prototype that taught me exactly what Veneer needed to be.

I’m still vibe coding. I’m just vibing with more structure now and calling it AI-assisted development. And honestly, after a few months of writing for clients and not touching my own projects, coming back to this with fresh eyes and no patience might be the best thing that happened to any of them.

Verdent AI - When Your AI Coding Assistant Finishes Before You Can Get Coffee

Stephan Miller — Tue, 23 Sep 2025 13:00:00 +0000

My current AI development process, if you want to call it that, is getting one AI tool working on one project while having another work on a different project. This only works if one or both projects aren’t near the end where I have to test and have AI fix a lot of things. This process grew out of the fact that sometimes it takes AI a little while to get a task done, but not enough time for you to get any real work done yourself. So it was either work on another project or scroll through Reddit.

But Verdent put a kink in that plan.

I needed something for Verdent to work on. When I had Kiro build Obsidian plugins, one which was relatively simple, it created over a dozen tasks and took anywhere from two to four hours for each plugin. So I figured a couple of plugins would be enough work for a Saturday afternoon.

The Projects: What I Threw at Verdent

I had two ideas picked out to build and I used Auto Run mode to build both:

Obsidian Cleaner Plugin : A “cleaner” plugin that checks that attachments in the Obsidian attachment folder and provide you with a checkbox list of all those that aren’t linked so you can delete them. It does the same thing with conflicted files. If I find any more common things I clean like this, I will add them later.

3D Tag Explorer Plugin : A plugin that takes hierarchical tags like llm/writing/software and turns them into a 3D node graph with notes containing those tags included as the final nodes.

Pretty straightforward stuff. I figured these would keep Verdent busy while I worked on something else.

The Reality Check: When Hours Becomes Minutes

Verdent finished both of these Obsidian plugins in less than 15 minutes each.

Now these plugins were relatively simple, but I did not expect that.

There was one bug in the tag explorer where the background was above the node graph. I mentioned it to Verdent and it fixed it quickly. The cleaner plugin just worked on first try.

So now I’m sitting here at 10:30 AM on a Saturday with two working plugins and I really wanted to focus on another task while Verdent just did its thing. This is not the kind of problem I expected to have with AI coding tools.

Panic Building: More Projects to Feed the Beast

I had to scramble to find more work for Verdent to do. Here’s what I threw at it next:

BookForge

A web app, API, command line app, and library that converts markdown files into simple epubs. This was simple but I bet it took less than 30 minutes. I really did not trust it.

It was essentially done with the first commit and worked well enough to be the MVP. One commit. For a complete application with multiple interfaces. What the hell is happening to software development? I have made a total of eight commits to the repo. The rest were to add docker, make slight modifications to the text in the web app, and create a release script so I can use it as a library in other projects.

PromptOS

A service, libraries, and extensions to store prompts and other text, markdown, and JSON instruction files for AI. This was the most complex of the four projects. Verdent broke it into 4 phases. After each phase I did a commit and approved it to start on the next phase.

Site Factory

A project for building pre-configured Gatsby sites quickly. I’m still unsure of the architecture of this one and over-architected it twice. This teaches me not to use tools I haven’t used before and have AI build the project at the same time. I started over again after reading documentation on Gatsby and the other tools I planned to use to get a better idea of what I actually wanted.

Mostly Static

A set of services and dashboard for static sites. I threw this in at the last minute to put the last bit of my generous beta access to work. This was multiple phases also. But that Saturday, Verdent built two Obisidian plugins and four applications in a few hours. And did not use up the 2000 credits I got for the day.

Verdent Features: The Good Stuff That Actually Works

I didn’t test Verdent Deck because my personal Mac is still an Intel one and apparently I’m living in the stone age of computing.

Plan Mode

Verdent’s plan mode is where things get interesting. It will have you approve the plan before it starts building. If you’re in Auto Run mode, it will just run until it’s done with the project, though it did stop and ask about certain commands that might be destructive. Or you can go directly to Skip Permissions mode and have it stop asking question. I stuck to the middle road and only had to tell it to continue a couple of times.

If your directions are vague, it will ask you a series of questions to help ensure it builds what you are expecting.

For larger projects, it breaks work into phases and tells you when each is done. PromptOS was the most complex in that it had an API, website, desktop app, and extensions for two applications. Verdent broke that into 4 phases automatically. Another project I started building after this set was broken in 11 phases.

You may want to tell it to save the plan to the project docs folder, so you can keep it in version control for reference, since it doesn’t do that automatically. In the newest version, you can copy the content of the plan from the chat and save it to the project if you want.

The other features, to tell you the truth, I haven’t touched yet, because I simply didn’t need to.

Rules System

The rules system lets you set custom instructions that persist across projects. This is useful for coding standards, preferred libraries, or just telling it not to do stupid shit that you’ve seen it do before.

Subagents

Verdent uses specialized subagents for different types of work. You don’t have to think about this much - it just routes tasks to the right AI worker automatically.

MCP Support

Model Context Protocol support means Verdent can integrate with other tools and services. This is probably more useful than I realize, but I haven’t had time to explore it fully given how fast everything else has been happening.

The Pricing Reality Check

During the beta, I got 2000 credits a day. These were hard to use up even when I was actively trying to burn through them. Once the beta was over, I received a bucket of credits. Right now, I’m testing how long these credits last to determine how I’ll use Verdent in the future.

The pricing tiers are reasonable for what you get. If you’re doing any serious development work, the cost of the tool becomes insignificant compared to the time it saves. But I’m still figuring out my usage patterns before committing to a subscription.

The Verdict: Too Fast and Good to Ignore

I was definitely happy with the results. During the beta, getting through 2000 credits in a day required serious effort. The speed and quality of the output is genuinely impressive.

Because I will be using it in the future. It is just too fast and good not to. Right now the only AI subscription I have is Claude Pro, so I use Claude Code mainly. But I also have API accounts at most of the big AI companies. So I might just pay for credits as I go for a while until my usage is consistent and then pick up a subscription.

The multi-AI workflow is becoming essential. When one tool is thinking, another can be building. When you have AI assistants that can complete substantial projects in under 30 minutes, the bottleneck becomes your ability to feed them work, not their ability to do it.

Conclusion: The Future of Lazy Coding

Verdent actually delivered on its promises. The difference between Verdent and some other AI coding tools I’ve used is the speed, the fact that it rarely gets confused about what I’m asking it to do, and, even though I was dreading testing the apps it built, they had less bugs and weirdness than I expected.

We’re at the point where AI coding assistants are good enough that the economics start to make sense for most developers. When something can build a complete application in 30 minutes, you find a way to afford the monthly subscription.

The real challenge now isn’t getting AI to write code. It’s keeping up with how fast it can work and making sure you’re feeding it projects that are actually worth building. But honestly, that’s a good problem to have.

I’m still figuring out the economics of AI-assisted development, but when something works this well, you adapt. The alternative is falling behind while other developers are shipping software at warp speed.

And if you’re still writing everything by hand while AI tools like Verdent exist, well, you might want to reconsider your approach. The future of coding is here, and it’s fast enough to finish your weekend projects before lunch.

You can learn more about Verdent here and find the VS code plugin or Verdent Deck here.

The Great Vibe Coding Experiment - How I Built 15 Projects with AI in My Spare Time

Stephan Miller — Mon, 15 Sep 2025 07:00:00 +0000

I started this whole thing wanting to test Claude Desktop with MCPs. Just one little experiment. You know how that goes.

Six months later, I’ve got 15 projects in various states of “done” and a GitHub commit chart that doesn’t look too crazy until you realize that most days I only have a couple of hours to experiment with this, when I do have time.

Welcome to “vibe coding”: building shit because it feels right and letting AI do most of the heavy lifting. It’s not agile development. It’s not waterfall. And as I’ve done more of it, it has become let chaotic and more stuctured. Most nights I work on two projects simultaneously.

When Vibe Coding Becomes Automated Spec-Driven Development
The Obsidian Plugin Empire
- Apple Books Highlights Plugin: Hacking SQLite
- Joplin Portal: My First Test of Kiro
- Daily Note Prompts: An Extension I Am Using
- Tag Explorer 3D: Testing a Shiny New Tool
- Attachment Cleaner: Simple But Necessary
The Bigger Fish: Apps That Do Real Shit
- Gatsby Site with Scraper: The Abandoned First Project
- GitWrite: When Jules Met Agentic Project Manager
- EmberText: My First Claude Code Experiment
- MDQuery: Fed Up with Proprietary Tools
- AutoVibe: The Meta Project
- AutoVibe Template: The Stop-Gap Solution
- ShopBoth: Testing the Template (and Learning Hard Lessons)
Tool Whose Name Shall Not Be Spoken
How These Projects Work Together
What I Learned About AI-Assisted Development
- Each AI Tool Has Its Own Personality
- The Process That Emerged
- Why 15 Projects Makes Sense (Sort Of)
What’s Next: The AI Development Army
Conclusion: Embrace the Chaos

When Vibe Coding Becomes Automated Spec-Driven Development

Vibe coding is when you have an idea, fire up an AI coding assistant, and see what happens. No detailed specs. No project management software. Just “hey AI, build me a thing that does X” and then iterating until it works or you get distracted by building something else.

But there was a reason I abandoned my first vibe coding project. I just added random features to an idea that I can up with on the fly and thought maybe ten minutes about. I just wanted to see what it could do. But the second project, I knew I needed some kind of rails.

The next step was to get out of this loop:

Tell the AI tool to create the feature
Test the result and tell the AI tool to fix the errors and failed tests
Maybe do that again or a few more times, copying and pasting errors
Ask the AI tool how to prevent this from happening with a better prompt

And realize it could be a self-optimizing process of:

Actor: Have an agent write the code
Auditor: Have an agent review the code and run the tests (multiple types of auditor) and either pass or fail and send back to Actor
Process Improver: Have an agent examine the steps that caused the failed process and update the commands, agent definitions, or other project docs to prevent them in the future

All because you can trust AI to do what you thought you told it to do. And this trail of projects documents my journey towards that goal.

The Obsidian Plugin Empire

Apple Books Highlights Plugin: Hacking SQLite

I built this plugin with Claude Desktop and MCPs, documented the whole process in this vibe coding post. The plugin actually works. I use it daily. It extracts annotations from the macOS Books SQLite database and creates formatted markdown notes.

Joplin Portal: My First Test of Kiro

I had all these notes in Joplin that I wanted to access from Obsidian. So I fired up Kiro AI and told it to build me https://github.com/eristoddle/joplin-portal. Kiro did most of the work.

The surprising thing about letting AI write plugins is that they actually follow best practices better than I do. Kiro created proper TypeScript interfaces, handled errors gracefully, and even added settings panels I didn’t ask for.

Daily Note Prompts: An Extension I Am Using

Another Kiro collaboration. This one adds customizable prompts to daily notes. I actually use this, which is more than I can say for many of the plugins I’ve tried.

It still needs some work, but it works for me for now. I’m just keeping a running list of changes I want to make to it and bugs I’ve run into and one of these days, I’ll put the list in one of these AI coding tools and set it to work.

Tag Explorer 3D: Testing a Shiny New Tool

I started this project the day after making my list of existing projects. Why? Because I wanted to test a new AI tool (can’t name it yet) and needed something to build.

Tag Explorer 3D visualizes your Obsidian tags and notes with those tags in 3D space. Is it necessary? Probably not. Is it cool? Absolutely. Sometimes you build things because they’re interesting, not because they solve real problems.

Attachment Cleaner: Simple But Necessary

Really simple plugin built with the same unnamed AI tool. It finds and removes unused attachments from your vault. I keep blog posts drafts there and paste images in, which gets stored there and they get forgotten when I move the draft.

I need to test it more, but the code looks solid. It’s often better at the simple, boring stuff than the complex, interesting stuff if you want AI coding to work. And I may turn it into a general clean up tool, because I am also tired of tracking down conflicted files.

The Bigger Fish: Apps That Do Real Shit

Plugins are fun, but eventually you want to build something more substantial. That’s where things got interesting.

Gatsby Site with Scraper: The Abandoned First Project

This was my first attempt at vibe coding. I wanted to build a Gatsby site with an integrated scraper using Claude Desktop and MCPs.

I abandoned it. Not because it didn’t work, but because I realized I was building it the wrong way. Sometimes the most important decision is knowing when to stop.

GitWrite: When Jules Met Agentic Project Manager

GitWrite is an abstraction over git for writers, editors, and beta readers. I built it with Jules AI, used an Agentic Project Manager to coordinate, and finished it up with Qoder. Well, it said it was finished and I am still working my way around to testing it’s full functionality. Just made the mistake of finishing it before I needed it.

It exists to support EmberText and other writing tools I’m building. I am really, really, really tired of being required to use things like “Suggesting” in Word and Google Docs. Who thought this thing up? Satan? I like Git better but it needed dumbed down in some places and tweaked in others to do what I needed.

EmberText: My First Claude Code Experiment

EmberText was my first serious relationship with Claude Code. I built an Electron app for writers, rolling my own context and project management system.

Claude Code has quirks. It’s opinionated about project structure. It sometimes goes off on tangents. But when it works, it does pretty well. EmberText is a fully functional app that I am testing, but I also need to refactor it to use some of the services I am building, so it is probably the last of these project that will be finished.

MDQuery: Fed Up with Proprietary Tools

I got tired of proprietary MCPs and tools to search systems that essentially consist of markdown files: Obsidian, Joplin, Jekyll, Log Seq, and on and on. So I built a universal tool with Kiro and Qoder.

MDQuery provides SQL-like syntax for searching and analyzing markdown files across different note-taking systems and static site generators. Because fuck trying custom MCPs that don’t work the way you want for each and every markdown based platform.

And I already had a job lined up for the tool. I had been collecting everything interesting around vibe coding and spec-driven development in my Obsidian vault under a specific tag. Developers were going in so many directions that I wanted to categorize and get an overview. We’re all blind developers describing different parts of this elephant. Maybe if we categorize what we’re all doing, I can be a little less blind. I recently found this post on Claude Code framework wars that does just that.

So I attached the MCP to Claude desktop and prompted it to analyze all of my Obsidian notes, using the prompts the AI tools had put in the documentation. And it spit out an 80 page document.

AutoVibe: The Meta Project

AutoVibe is the most recursive project I’ve ever built. I’m using Claude Code, AI Studio, and Backlog.md to build a tool that will make the vibe coding process smoother.

It’s infrastructure for building infrastructure. Custom commands and agents to coordinate AI tools. The future of development might be less about writing code and more about conducting orchestras of AI agents. At least, that’s what I think it is. But it’s like a box of chocolates.

AutoVibe Template: The Stop-Gap Solution

While AutoVibe has bugs to work out, I needed something that worked now. So I built a template with a set of Claude commands and agents that, when used with Backlog.md and Claude Code, makes developing projects more bullet-resistant.

I used Claude Desktop to help develop it since the template is full of AI instruction files that other tools would try to execute. But then I sort of got stuck testing its usage in the next project. So now the priority is getting AutoVibe to work and keep the simple template.

ShopBoth: Testing the Template (and Learning Hard Lessons)

I used Claude Code with my Backlog.md template to build ShopBoth, a React Native app for testing and tweaking the template. Why did I choose React Native for testing? Because I’m an idiot.

I also discovered that “generic” automated QA doesn’t exist. QA needs to be specific to the platform, the framework, the use case. There ain’t no such thing as universal testing, and I learned that the hard way.

Tool Whose Name Shall Not Be Spoken

I’ve built three more apps with an AI tool I can tell you about in a couple of weeks. They add to the ecosystem:

BookForge transforms markdown files into professional ebooks. It supports EmberText and GitWrite, completing the writing workflow.
PromptOS stores and provides prompts for EmberText, AutoVibe, and everything else. Because managing prompts across 15 projects gets complicated fast.
Site Factory brings me full circle to that abandoned Gatsby project, but with actual research this time.

These aren’t random projects. They’re pieces of a larger system for AI-assisted content creation and development.

Update: I actually built one more project while I was writing this. It’s a platform to provide dynamic services for static sites.

How These Projects Work Together

This isn’t just a collection of random tools. There’s method to the madness:

EmberText handles writing → GitWrite manages collaboration → BookForge generates ebooks
PromptOS supports everything with prompt management
Obsidian plugins feed the writing process with research and notes
MDQuery searches across all the markdown files these tools create
AutoVibe coordinates the development of new tools

It’s an actual ecosystem, not just 15 disconnected projects. Each piece makes the others more useful.

What I Learned About AI-Assisted Development

Each AI Tool Has Its Own Personality

Claude Desktop: Great for planning and architecture, terrible for actually writing code. It’s the project manager that never writes any code. Mainly because chat lengths are limited. As soon as you get somewhere, you have to start a new chat and lose the context. I do use it to develop git templates for AI-driven coding projects, because it will ignore the instruction files I have it tweak.
Claude Code: I am not sure if it is still the best after using Qoder and the new tool I have. I still use it almost daily to work on projects and it’s my goto tool, but the limit comes up quicker now and it seems like it has dropped some IQ points. Not sure about it status in my workflow right now.
Kiro: The reliable workhorse. Give it a clear task and it delivers solid, working code. Perfect for plugins and smaller projects. After seeing how the new tool I found works, I wonder if it breaks things down into too many tasks. To create an Obsidian plugin, it broke it down into 16 tasks I had to click to get through and it took a few hours. The new tool just decided it would build a plugin for me in 15 minutes.
Jules: I still use it for small things, like fixing bugs, because I still get 15 free chats a day. I actually built most of GitWrite with it.
Qoder: Set it loose on a complex project and come back in a few hours to find it’s built everything you asked for and more. The wikis it build are useful but I think would eat up a lot of tokens in large codebases.
Unnamed Tool: The new experiment. Still figuring out its personality. But it’s fast. It finished the two Obsidian plugins in less than an hour, both of them. It built the other full projects in one day, all of them.

The Process That Emerged

Start with an actual need, not cool technology. I tried building technology first and it never worked.
Let AI handle the boilerplate and focus on the interesting problems. AI is great at CRUD operations and terrible at creative problem-solving.
Build infrastructure projects to support main projects. GitWrite exists so EmberText can focus on writing, not version control. Also, hoping less scope mean less context an AI tool needs
Test new AI tools with real projects, not demos. You learn more building something you’ll actually use.

Why 15 Projects Makes Sense (Sort Of)

Each project taught me something new about working with different AI tools.
Building an ecosystem requires multiple pieces. You can’t do everything with one app.
Some projects exist only to support other projects. That’s fine.
Abandoning projects is part of the process. That abandoned Gatsby site taught me what not to build.
The commit chart tells the story. Bursts of activity when testing new tools. Long gaps when focusing on one complex project.

What’s Next: The AI Development Army

AutoVibe is getting closer to making this process smoother. The template approach gives consistent results. The unnamed AI tool projects will add new capabilities.

I’m building a sustainable vibe coding workflow that lets me maintain 15+ projects without losing my mind. The future of development might be more like conducting an orchestra than playing a solo instrument.

Conclusion: Embrace the Chaos

I started wanting to try Claude Desktop with MCPs. I ended up with 15 projects, a completely new development process, and insights into how AI-assisted development actually works.

The learning process matters more than perfect planning. AI tools are getting good enough to enable this kind of scattered productivity. You can afford to be curious, to follow tangents, to build infrastructure for projects that don’t exist yet.

Vibe coding and spec-driven development isn’t for everyone. But if you’re comfortable with chaos (Yes, even in spec-driven development), if you like building things just to see what happens, if you want to push the boundaries of what one person can build with AI assistance, give it a try.

Just don’t blame me when you end up with 15 projects and a commit chart that looks like madness. That’s the price you pay. And while I was finishing this up, I started another project. I had to give my new tool something to work on before the free ride runs out.

Forem: Stephan Miller

Building a Cost-Saving Agent Skill That Accidentally Became Its Own Weekly Blog Post

The Session Time Trap

$15 in Fifteen Minutes

The Skill: At the Shape Level

The Adaptation Log: the Idea That Makes It Actually Work

Four Weeks of Evolution

The Recursive Twist

The Vault Is the Substrate

The Receipts

I Was Wrong About Hy3 (And Other Things I Learned This Week)

What I Got Wrong About Hy3 (And the Other New Players)

The Cheapskate Picks Held (Mostly)

Hype vs. Value: Ring 2.6 vs. Ernie 5.1

Claude Code’s Billing Bug Enters Its Third Week

What’s Worth Trying This Week

Tuesday Is Going to Be Loud

What I’m Watching Next Week

Senior Software Engineer by Title, AI Therapist by Reality

The Diagnosis: How Did We Get Here?

The Patient Files: Tool-by-Tool Therapy Notes

Claude Code: The Eager Intern

GitHub Copilot: The Golden Retriever

GPT-4: The Know-It-All Who Never Reads the Room

What I’ve Learned About AI Psychology

Framing Beats Specificity

Contextual Anchoring Actually Works (Embarrassingly Well)

When Iterating Doesn’t Work, Give It an Algorithm

The Trust Paradox

I’m Also the Patient

The Rubber Duck That Talks Back

The Prognosis: Am I Better Off?

The Therapist Is In

The Cheapskate's Guide to the Arena Leaderboard: Why I Stopped Paying Claude Opus Prices

The Compression Problem (Or: Why You’re Probably Overpaying)

The Cheapskate Picks (May 1–8, 2026)

Overall: Gemini 3 Flash, $0.50/$3.00

Coding: GLM 5.1, the SWE-Bench Pro Killer

Creative Writing: Gemini 3 Flash

Math: DeepSeek V4 Pro Thinking, the 17x Discount

Instruction Following: MiMo V2.5 Pro

Hard Prompts: Gemini 3 Flash, Again

Multi-Turn: Gemini 3 Flash, Again Again

The Quick-Reference Table

GLM 5.1: The SOTA Nobody’s Pricing In

Tencent’s Hy3 Free Cliff Hits

The Asterisks (Or: Cheap Is Fine If You Know What You’re Losing)

Coming Up: Google I/O May 19, the Gemini 4 Question

The Receipts

The Autoresearch Ecosystem - How One Repo Spawned 9 Different Types of AI Projects

What Karpathy Actually Built

1. Platform Ports: Running It On Hardware You Actually Own

GPU Cluster Scaling

2. ML Research Enhancers: Making the Loop Smarter

Memory-Enhanced Researchers

Bayesian + Active Inference

Multi-GPU Infrastructure

3. Prompt Optimizers: Same Loop, Different Target File

autoresearch-prompt-optimization (az9713)

autoresearch-for-agents (Galileo)

4. Generalized Frameworks: Autoresearch For Anything

uditgoenka/autoresearch — Claude Code Skill

autoresearch-anything (zkarimi22)

menonpg/autoloop — The pip Package

krzysztofdudek/ResearcherSkill — One File, Full Discipline

alfonsograziano/auto-agent — Autoresearch Builds Agents

5. Production Codebase Optimization: Autoresearch on Real OSS

More Production War Stories

idealo Search Ranking

Tennis XGBoost — The Reward Hacking Cautionary Tale

Vesuvius Challenge Ink Detection

6. Agent Factory: Autoresearch Builds Agents

7. Research OS / Skills Systems: Institutionalizing the Pattern

PhD-Zero (TenureAI)

alirezarezvani/claude-skills

8. Creative Writing: Autoresearch For Prose and Fiction

redpen — Prose Refinement Engine

NousResearch/autonovel — Complete Novel Pipeline

sinfiny/Auto-Creative-Reasoning

CalvinMagezi/self-evolving-skill — Brand Document Evolution