Forem: Phil Rentier Digital

The AI Coding Subsidy Era Just Ended. Cursor and Bolt Are Next.

Phil Rentier Digital — Tue, 12 May 2026 13:41:12 +0000

For 3 weeks, I left GitHub Community open in a tab while I worked. The same threads came back every morning. "GOOD BYE GITHUB COPILOT". "evaluating GITLAB". "Claude Teams it is." Tens of thousands of words of anger in 6 weeks, and half the posts cited the same scribbled calculation (3900 divided by 27 makes 144 Opus requests per month on a Pro+ plan). Reddit fury doesn't translate mechanically into exodus. What's actually happening is quieter. And bigger.

TLDR: GitHub multiplied the price of Opus by 3.6 in 6 weeks. That's the boring side. The interesting side, the one that changes how you pick an AI coding tool for the next 12 months, is what forced Microsoft to move. The same equation is coming for the others.

GitHub just did what the 4 other major AI coding distributors will have to do in the next 12 months. Not out of malice. Because at some point, the math catches up to everyone. I'm going to name the mechanism, explain why Microsoft cracked first, and give you the order in which the rest will follow.

What a Subsidy Cascade Looks Like

Before any of this makes sense, the mechanism needs a name. A subsidy cascade goes in 3 steps.

Step 1: a distributor absorbs the gap between the price it announces and the real cost of what it delivers. Grabbing market share is worth more than margin right now, so the gap stays hidden.

Step 2: one player cracks. Usually the biggest, or the one most accountable to public quarterly numbers. The cost gets too visible to keep absorbing.

Step 3: the others follow within 6 to 12 months. The math becomes defendable now that the first one broke the taboo.

You've seen this movie before. Uber 2018 (the end of driver bonuses). MoviePass the same year (collapse). Netflix 2022 cracked down on password sharing. Spotify squeezed royalty payouts in 2024. WeWork 2019 when the lease subsidies stopped making sense to anyone. Every time, the same arc. Someone subsidizes to win the market, then someone else runs the numbers in public. The rest of the field has 6 to 12 months to figure out a new plan.

One caveat before I keep going. Anthropic might be subsidizing too. Nobody really knows if the Max plans are profitable for them. Today's transparency isn't a guarantee for tomorrow. But Anthropic is a producer, not a distributor. If they're subsidizing, they finance it differently (data acquisition, training compute amortization), not by eating the inference cost of someone else's customer. The math is structurally different.

That distinction is the whole article.

3.6x in Six Weeks: GitHub's Receipt

Now the receipt. Same model throughout, to avoid the Opus 4.6 vs 4.7 confusion that's poisoning half the takes online.

April 16, 2026: GitHub announces Opus 4.7 GA on Copilot with a promotional 7.5x multiplier valid until April 30. April 30: automatic shift to 15x (the regular Pro+ multiplier). June 1: jump to 27x for anyone still on the annual request-based plan. That's 3.6x on the same model in 6 weeks, sourceable line by line on github.docs and github.blog.

In parallel, on April 24: GitHub paused new sign-ups for Pro, Pro+, and Student. They pulled Opus from Pro. They pulled Opus 4.5 and 4.6 from Pro+. Three moves at once. Price going up, access shrinking, forced migration to usage-based billing.

Mario Rodriguez (CPO at GitHub) put the official line in the blog post: "the current premium request model is no longer sustainable."

Translation in plain English: we can't keep paying for what you're using. Even when "we" is Microsoft.

One thing worth flagging. The 27x might come back down if GitHub backpedals under pressure. The exact number moves. The mechanism stays.

144 Requests a Month: What Pro Plus Actually Buys You

Before June 1, you ran agents without watching the meter much. Pro+ at $39 a month, the "premium request" counter in some opaque unit, the gap between you and the cap big enough that nobody bothered tracking it day-to-day.

After June 1, here is what you actually get on Opus 4.7.

3900 AI credits per month divided by 27x equals 144 Opus requests maximum per month. That's 4.8 per day. A typical Claude Code agentic session (2 hours, real feature, the kind where you let the agent loop and self-correct) burns somewhere between 50 and 200 inferences depending on how much it has to plan, rewrite, and retry. One serious session eats a week of your Opus quota.

Compare with the producer's own pricing page (claude.ai/upgrade). Max $100 gives you 5x the rate limits of Pro. Max $200 gives you 20x the rate limits of Pro. For an agentic long-running workload, the ratio flips without ambiguity.

One caveat. GitHub Copilot is still excellent for classic autocomplete and targeted completions on the standard completion model (still at 1x). The problem is specific to Opus in long-running agentic mode, not to all of Copilot. If your day is mostly inline completions in VS Code, you're not the target of this article. If you launch agents and let them run for 2 hours, you redo the math or you walk.

Walking, it turns out, is the rational option.

Why Microsoft Cracked First

Three reasons, stacked.

Volume. With hundreds of thousands of Copilot Pro+ subscribers, the absorbed cost on Opus in long-running agentic mode reached a number Microsoft couldn't pretend not to see at quarterly review. No other AI coding distributor has that scale. When the absolute amount gets that big, the meeting tone changes.

Forced transparency. The move from premium requests to usage-based billing makes every inference traceable down to the token. As long as you charge in opaque "premium requests", you can hide the gap inside the global margin. Once you go token-based, the unit economics show up on a spreadsheet that someone at Microsoft has to sign.

Internal compute arbitrage. Microsoft compute resources are finite. The direct Azure margin on OpenAI inference is better than the Copilot margin relayed through Anthropic. Internal pressure to reallocate compute is structurally one-way.

None of the other distributors have any of those 3 levers. Cursor doesn't operate at Microsoft scale on subscribers. Bolt has no cloud of its own to arbitrage compute against. And Lovable runs everything through external APIs at retail rates. They have one trajectory.

Follow.

Cursor, Bolt, Lovable, v0: The Order Matters

Nominative predictions now. With reasoning. Pushback welcome.

Cursor. Heavy Claude Opus and GPT-5 user. Raised serious money recently, but inference subsidy eats fundraises faster than people realize. Current pricing is $20 Pro with most features unlimited. That's the textbook subsidy reveal scenario. Best guess: repricing in Q3 2026.

Bolt and Lovable. Both positioned as "build apps from a prompt" for non-devs. Which means per-user inference is even higher, because their users don't know how to optimize a context window. Flat $20 to $30 a month is structurally insolvent. Best guess: repricing Q3 or Q4 2026.

v0 (Vercel). Backed by Vercel's own compute infra, which gives them more runway than the others. Possibly later, Q4 2026 or Q1 2027.

To be clear about what I'm doing here. These are predictions drawn from a pattern, not certainties with a date stamp. Either these tools reprice, or they consolidate toward the producer, or they find a new business model. No scenario maintains current pricing on an 18-month horizon. Pick the one you think is most likely. I picked repricing.

The distributors will tell you I'm wrong. The math will tell you who is.

The Producer Wins by Math, Not Marketing

The flip is structural. While distributors adjust their pricing, the direct producer (Anthropic for Opus, OpenAI for GPT, Google for Gemini) sees its relative position improve mechanically, without lifting a finger.

Anthropic didn't attack GitHub. No press release, no counter-promotion. They didn't even cut Max plan prices to capitalize on it. The math did it for them.

That's a structural win for producers over distributors, not a commercial win for Anthropic over GitHub. There's a difference, and it matters for what you do next.

You've seen this pattern in other markets. Spotify versus the labels. Netflix versus cable. Each time, when the distributor stopped being able to subsidize, the producer took back the customer relationship.

The App Store versus indie developers tells the same story on a longer timeline. Apple subsidized distribution for years (curation, payment rails, refund handling, search), built the empire on the 30% cut, then started squeezing developers when the alternatives had already shrunk. The producers (the devs themselves) couldn't escape because there was nowhere to go. The producers in AI coding can. Anthropic and OpenAI sell direct to you with 2 clicks. There is no app store moat between you and them. The only thing between you and the producer today is a monthly habit you haven't reviewed.

For the end user, this means one thing. In 12 to 18 months, you're going to end up at the direct producer, whether it's called Anthropic, OpenAI, or Google. The choice isn't "which distributor" anymore. It's "which producer".

It's the same kind of structural call I made when I moved my agent stack off MCP and onto CLIs. Different layer, same logic. Pick the architecture that survives the repricing of the layer above.

How to Pick Your AI Plan Without Being Stupid in 2026

3 questions. Answer them honestly.

Q1: How much do you pay per month across all AI coding tools combined?

Under $20: stay on the free tier or the cheapest Pro of whatever distributor is most convenient. You're not the target of the repricing.

$20 to $50: move to the direct producer. Anthropic Max $100 or ChatGPT Plus depending on which model dominates your workflow.

Over $50: Max $200 or direct API access becomes almost mandatory. Also, track that budget as a monthly metric, not a recurring auto-charge you forgot about.

Q2: How many hours per week of long-running agentic work?

Over 5 hours: get out of request-based pricing. The math doesn't forgive at that volume anymore.

Q3: Is your main tool today a distributor or a producer?

If distributor, plan your migration in the next 12 months even if nothing has visibly moved on their side yet. The criterion that survives every repricing wave is simple. Reason in unit cost per inference, not in monthly subscription headline.

One caveat. For enterprises on Azure or Microsoft volume contracts, the arithmetic changes with discount tiers nobody publishes. This article is written for the solopreneur who pays from their own card.

A quick personal note, since someone is going to email about it. In February, I walked away from a $200-a-month OpenClaw setup and rebuilt the same workflow for $15. Today, $200 a month at Anthropic direct becomes defendable where it didn't 3 months ago. The math, itself, has flipped. The framework that flipped it is the one above.

Accept that your favorite tool today isn't necessarily your favorite tool in 12 months.

Microsoft eventually handed over the invoice it had been keeping in its pocket for 2 years. Everyone looked at the number, did their little Reddit outrage, and kept using the same tool with slightly different pricing.

The mechanism, itself, hasn't moved.

Pick your producer before your distributor picks one for you.

Sources

GitHub Docs, Model multipliers for annual plans.
GitHub Blog (April 24, 2026), Changes to GitHub Copilot Individual plans.
GitHub Changelog (April 16, 2026), Claude Opus 4.7 is generally available.
Anthropic, Claude pricing and plans.

Your SaaS Is One Week and $1,100 Away From Being Cloned. Cloudflare Just Proved It.

Phil Rentier Digital — Sun, 10 May 2026 13:41:13 +0000

A comment on a recent Hacker News thread about chardet relicensing made me close my MacBook for two minutes. Roritharr, a SaaS consultant, was telling the story of one of his client's engineers who had reverse-engineered the backend in a week with Claude Code, shipping a functionally identical API. He finished with a raw question: "How do we protect ourselves against a competitor doing this?". The deeper question underneath: "If your backend is trivial enough to be implemented by a large language model, what value are you providing?".

A bit later, Cloudflare published vinext. Five days, $1,100 in tokens, one engineer, 94% of the Next.js API covered, MIT license. The question is no longer theoretical.

TLDR: Cloudflare rebuilt Next.js in five days for $1,100. Mark Pilgrim lost chardet. If you wonder what protects your product in 2026, the answer is not in your code. It never was.

This article is not going to talk about copyleft. It is about you, your SaaS, and what is left when a competitor can clone your backend for the price of a used MacBook.

Six Weeks Ago, a Hacker News Comment Rewrote What I Thought a SaaS Was

Roritharr's comment was the first one in the thread that did not feel academic. Most of the chardet thread was about lawyers, licenses, and whether Mark Pilgrim had a leg to stand on. His comment came in sideways: a working SaaS, a working backend, one engineer with Claude Code who had cloned the whole thing on his own time.

The 213-point thread had 244 comments by the time I read it through. Almost all were arguing about lawyers. His was the only one asking what to do on Monday morning if your competitor pulls the same trick.

The reply that hit hardest came from a user named ShowalkKama: "If your backend is trivial enough to be implemented by a large language model, what value are you providing? I know it's a provoking question but that answers why a competitor is not a competitor.". That sentence sat with me for a while. Six weeks later, Cloudflare's vinext landed and made the question retroactively concrete.

I am not here to litigate the chardet drama. Every article since March has done that, and I have nothing to add on the legal turf. I am here to say what most of those articles avoided: this is not about open source. This is about whether your product would survive a motivated competitor with $1,100 in API credits, and most products would not.

The Code Was Never the Moat. The Friction Was.

Nobody on that thread wanted to say it out loud. (Saying it out loud is the kind of sentence that ruins your Monday if you run a Series A SaaS.)

What we have been calling a "technical moat" for fifteen years was almost always an avatar of reproduction friction. Copyleft did not protect the code. It monetized the cost of rewriting it from scratch. Technical complexity was a rent: as long as cloning your stack took six engineers and a year, your competitor did not bother.

Armin Ronacher, the guy who wrote Flask, said it plainly when The Register asked him about chardet: "Copyleft code like the GPL heavily depends on copyrights and friction to enforce it. But because it's fundamentally in the open, with or without tests, you can trivially rewrite it these days.". The guy who built one of Python's most-used libraries is telling you that the legal scaffolding around open source was holding because rewriting was hard, not because lawyers were scary.

AI brought that cost down to $1,100 and a week. When the friction goes, 80% of what passed for a moat goes with it. Your edge did not disappear in March 2026. It just became visible. Or its absence became visible, depending on which side of the wallet you are on.

A commenter named 3rodents put it shorter on the same thread: "As engineers, we often think only about code, but code has never been what makes a business succeed.".

We knew this in business school. We pretended otherwise in standups.

$1,100. Five Days. 94% of Next.js. The Public Cost of Cloning.

Numbers settle arguments. Here are the public ones.

Steve Faulkner, one engineer at Cloudflare, used Claude Opus 4.6 via OpenCode across more than 800 sessions to rebuild Next.js. The bill was $1,100 in API tokens. The output: vinext. 67,000 lines of code versus 194,000 in the Next.js core. 1,700+ Vitest tests, 380 Playwright tests, 94% of the public Next.js API surface covered. Builds run 4.4x faster, bundles ship 57% smaller. MIT license. 7,000+ GitHub stars by April. Almost every line was written by AI, on Faulkner's own admission.

Cloudflare framed it as "pragmatic compatibility, not bug-for-bug parity. Targets 95%+ of real-world Next.js apps." Translation: if you have a Next.js app at work, vinext probably runs it.

In parallel, Dan Blanchard rewrote chardet, the Python charset detection library that ships with 130 million downloads per month. Five days of development. 41x performance improvement, going from 11 files per second to 451. 3,931 lines total. Git blame attribution to Mark Pilgrim, the original author: 0%. JPlag similarity analysis: 0.04% average, 1.29% maximum. (JPlag is the academic plagiarism detector universities use to catch students. It found nothing.)

The Register reported the numbers. Pilgrim, who had been off the internet for fifteen years, came back to fight the relicensing. His point was on copyright. The point of this article is not.

Now the caveat, because anyone trying to sell you the dream of one-week clones is selling you something. Hacktron, an AI security tool, ran their scanner on vinext and found 45 vulnerabilities. 24 were manually validated. Their write-up has a sentence that I have been thinking about since: "Most of the tests driving vinext are functional requirements. Vulnerabilities do not live there. They live in the negative space, in complex interactions between layers, the stuff nobody wrote a test for.".

So yes, there is a hidden cost to AI rewrites. The clone has bugs the original did not. The clone has security holes the original patched two CVE cycles ago. But that hidden cost does not save you. It just means your competitor's clone ships with bugs while it eats your lunch.

I rebuilt my own stack from scratch last summer when Anthropic shut down a workflow I depended on, and the only reason it stung was the time, not the money. I did the whole thing for $15 a month instead of $200. The arithmetic works either direction; the pain only travels one way.

Three Moats That Survive AI Rewrites. And Three That Don't.

Three moats that hold after AI got cheap. Three fakes you should stop counting.

1. Switching cost that lives in your users' heads, not in your code.

Not "data" in the abstract. The data that matters is the muscle memory your user has invested in your shortcuts, the automations they have configured and would have to redo from scratch elsewhere, the dozen invisible sub-routines where your tool got embedded in their daily workflow. Linear versus Jira. A team with six months of muscle memory in Linear does not migrate for 5% better. You can clone Linear in three months with Claude. You cannot clone the 50,000 keyboard shortcuts internalized by their users.

The clone will look identical in a side-by-side demo. Your users will hate it for reasons they cannot fully articulate. That hate is the moat.

2. Distribution acquired before your product existed.

Not "audience" in the generic sense. Not "I have a Twitter." The channel that delivers qualified users to you before you ship the next feature. Pieter Levels before Nomad List. Theo Browne before create-t3-app. If you have that, you can clone your own code every six months without losing users. If you do not, your competitor vibe-codes a copy with an audience and overtakes you in two months.

Distribution is the only thing that compounds in the same direction as code clones. A clone of your product without a clone of your audience is a free demo for whoever owns the channel.

3. Regulatory capital and enterprise integrations.

Not "compliance" generically. SOC 2 Type II, HIPAA, ISO 27001, enterprise contracts negotiated over six months, FINRA licenses, direct integration with a core banking system that demands eighteen months of onboarding at the customer site. No AI rewrite gives you that in a week. This is the only moat where code literally does not play a role.

The user rwmj said it on the same HN thread, talking about serious enterprise software: "It's the sales channel, the human engineer who is sent on site, the regulatory framework that ensures the customer can operate legally and obtain insurance.". That sentence is worth printing and taping to your wall.

Now the three fakes. These are the ones I see founders point to when they tell me they have a moat, and I have to keep my face neutral.

Brand in early stage is just renewable marketing. A competitor with $1M in seed can buy your brand recognition in six months. Brand is a moat at Coca-Cola scale. At seed-stage SaaS scale, it is a logo on a website.

Data network effects are real for Google and Meta. They are an illusion for 99% of SaaS founders who claim them. Your dataset is not a network effect if a competitor can bootstrap an equivalent one in three months by scraping your public outputs and buying complementary data. Network effects compound. Your CSV does not.

24/7 support can be cloned by a competitor opening a Discord with three contractors. Support is a feature, not a moat.

If you ran the list and your "moat" was on the wrong side, the next section will not feel comfortable.

What Saved Vercel Was Not the Code. It Never Was.

Vinext has been live since February. MIT license, 7,000+ stars, 94% API coverage, 4.4x faster than Next.js. If the code was Vercel's moat, they were dead by April. They are not. The interesting question is why.

Run Vercel through the three-moat filter.

Switching cost in users' heads: enormous. Hundreds of thousands of developers have vercel.json configs, deploy hooks, and CI conventions in their muscle memory. Migrating an enterprise project from Vercel to vinext + Cloudflare Workers takes weeks of testing and a senior engineer who actually wants to do it. Most teams do not. The hate-the-clone friction I described above? Vercel built ten years of it.

Distribution acquired: Notion, Stripe, Hashicorp, CIO.gov as showcase customers. Vercel does not need Next.js to distribute their next feature. They have a sales pipeline that would survive losing the framework entirely. They could open-source v0, sunset Next.js, and still have a $5B valuation conversation next year.

Enterprise integrations and capital: Turbopack written in Rust, the v0 ecosystem, a proprietary AI SDK, multi-million-dollar enterprise contracts. None of that is in the open-source Next.js repo. It was never there.

So what saved Vercel was never about winning the framework war. They had already built moats elsewhere.

Guillermo Rauch, the CEO, posted on X that "Cloudflare's mission is to fork the entire developer ecosystem and destroy open source. Vinext was an excuse to swindle developers into using their proprietary runtimes instead of Nodejs.". The Pragmatic Engineer newsletter reported the quote. It is the move of someone defending the wrong moat. Vercel's actual moats had nothing to do with the framework. Tweeting did not save them. The moats they had built before the code did.

Hong Minhee pointed out the quiet irony in writing: "Vercel reimplemented GNU Bash using AI and published it, then got visibly upset when Cloudflare reimplemented Next.js the same way.". The spirit of sharing apparently runs in one direction. (I noticed.)

If your first move when cloned is litigation, you did not have a moat.

If You Have None of the Three, You Have a Demo

If your SaaS has none of the three from earlier, you do not have a product. You have a demo that holds because nobody has yet had the motivation to clone you. That is not the same as being protected. That is being beneath notice. The day you grow past beneath-notice, your $1,100 problem starts.

Investing in defensive code does not work in 2026. Obfuscation, artificial complexity, closing your source the way Cal.com did in April: all of it is rearranging deck chairs. The cloning happens at the API surface, not at the source. (Source-available licensing was always more about VC optics than security anyway.)

The work that matters is investing in one of the three moats, or pivoting to a niche where you can build one, or accepting that you are a feature company on a one-year clock and pricing accordingly. Add to that the shipping discipline that turns vibe-coded demos into real software, so you keep moving while the clone is busy reproducing what you already shipped six months ago.

Vercel spent its morning tweeting about the death of open source while Cloudflare was pushing commits to vinext. At least Vercel has distribution, enterprise contracts, and v0 running while they argue on X. You, reading this, what is your edge outside the repo I could clone this weekend for the price of a pizza?

Your real product is what is left after the clone. Check tonight.

Sources

Steve Faulkner / Cloudflare, How we rebuilt Next.js with AI in one week
Hacktron, Vibe-Hacking Cloudflare's Vibe-Coded Next.js Replacement
Dan Blanchard, Everything Claude Saw: A Transparent Account of the Chardet v7 Rewrite
The Register, Chardet dispute shows how AI will kill software licensing
Hacker News, No right to relicense this project
Pragmatic Engineer, The Pulse: Cloudflare rewrites Next.js as AI rewrites commercial open source

Your Resume Got You Zero Interviews. 200 Vibe Coders Got Yours.

Phil Rentier Digital — Sat, 09 May 2026 13:41:10 +0000

Peter Grafe, CEO of BlueAlpha (small marketing shop, you wouldn't know them, that's the point), got 200 applications in two days for one role. 95% disqualified before a human opened the PDF 😬. The 10 that survived had to vibe code something in five days. You, meanwhile, are polishing your LinkedIn opening line for the third time.

If you can't vibe code, you just became unemployable.

TLDR: The resume is cooked. Recruiters get 200 plausible applications in 48 hours, they stopped pretending to read them, and the filter moved somewhere else. You probably don't know where.

When anybody can generate a plausible application in five minutes, the resume stops being a signal. It becomes noise. And noise gets filtered without a read.

You didn't lose those interviews. You didn't even play.

What follows is what the 10 do differently.

200 Applications in Two Days. The 10 That Survived Were Vibe Coders.

Grafe published the math himself in a Sherwood News piece this month. 200 applications, 48 hours, 95% out before any reading happened. He's not bragging, he's tired. He says he stopped opening the PDFs because the AI-generated cover letters all sound like a LinkedIn HR consultant on his fourth coffee.

The 10 candidates who got past the filter didn't have a better CV. They had something the other 190 didn't. They could ship.

Grafe gave them a brief, five days, and a vague problem. The ones who survived produced a working prototype. The ones who didn't, didn't.

That math is the job market itself.

When Faking a Signal Becomes Free, the Signal Stops Working

A signal that costs nothing to fake stops being a signal. That's why your inbox is full of 5-star Amazon reviews you don't trust, LinkedIn skill badges nobody verifies, and certificates that mean nothing because the website handed them out for completing a 12-minute video.

Resumes were already a weak signal a decade ago. Hiring managers admitted as much. They kept reading them because no cheaper alternative existed. The cost of writing a CV was high enough to discourage random people, low enough that you got real candidates. That equilibrium held for fifty years.

The AI didn't kill the resume. The resume was already wounded. AI just made the cost of producing one drop to zero, and the equilibrium snapped.

What AI fabricates for free, the market stops valuing.

The Resume Is Done. The Numbers Already Buried It.

TestGorilla surveyed 2,160 employers and candidates across the US and UK this year. 85% of employers now use skills-based hiring. That number was 81% last year and 73% the year before. The trend isn't subtle.

Same survey: 71% of employers say skills tests predict performance better than resumes. 86% of US hiring managers and 89% of UK hiring managers report problems with CVs. One in three recruiters admits they can't tell if the resume in front of them is accurate. They're not even pretending anymore.

Half the employers in the survey have dropped degree requirements. Two thirds say their AI-generated cover letter detector is busy. (It's not very good. It just runs all the time.)

So what do they trust? They trust what you can build in front of them with constraints, time pressure, and a brief. Grafe put it bluntly in the same Sherwood piece: "the bar has shifted from do you understand technology to can you produce something with it."

The resume was fakeable long before AI. AI just dropped the price to zero, and the filter cracked.

Vibe Coding Is the New Word. It Took 18 Months Instead of 10 Years.

Remember when knowing Word was a silent prerequisite for any office job? Nobody put "Microsoft Word, intermediate" on a resume because it was assumed. The shift took 10 years. From "secretaries use it" to "if you can't, the door is on your left."

Vibe coding is doing the same thing in 18 months. And no, this isn't just a tech industry move.

Harlem Capital, a venture fund, published their interview process. A senior associate candidate had to build an AI agent that automates industry research in a week, then brief the partners with the output. Another candidate had to vibe code a portfolio dashboard. Their head of talent Nicole DeTommaso wrote it in plain English: "You are not told which tools to use or how to go about it. You are just expected to figure it out."

Crux Analytics, an analytics firm, embeds a practical AI project in every single hire, technical or not. CEO Jacob Bennett told Sherwood the test is about "where did they use AI, where didn't they, and why". The code itself isn't the point.

BlueAlpha, the marketing agency from the top of this article, makes commercial candidates fire up Claude Code during the interview itself.

Look at that list. A VC fund, an analytics firm, a marketing agency. None of them is hiring developers.

Word took 10 years because companies pushed the tool. Vibe coding takes 18 months because the interview is the push. You don't get hired and then learn it. You learn it before you walk in, or you don't walk in at all.

Call it an audition, because that's what it is. Vague brief, five days, someone watches what comes out.

The Cheating Trick You Spent a Year Learning Just Became the Test.

Resume Genius surveyed 1,000 active US job seekers this year. 22% admit using AI in real time during interviews. 78% use AI somewhere in the job hunt. There's a YouTube short selling the trick that's doing absurdly well for its channel, the kind of outlier you don't get without a real candidate appetite. The title sells the secret. The audience confirms it.

A whole industry sprung up to sell stealth AI overlays. Fake browser tabs, transparent windows, fancy keyboard shortcuts, the works. Candidates spent 18 months learning to hide AI from interviewers.

Then this happened.

Canva published a public engineering blog called Yes, You Can Use AI in Our Interviews. In fact, we insist. They expect candidates to use Copilot, Cursor, and Claude during the technical round. Half their frontend and backend engineers are daily AI users, so the interview now matches the job. They evaluate when and how candidates lean on AI, how they break down ambiguous briefs, and whether they can spot bugs in AI-generated code.

Sierra rewrote their entire onsite around AI. Plan, build for two hours with full AI access, then review. Codebase debugging interviews with PR drafts to improve via coding agents. Harlem Capital, Crux Analytics, BlueAlpha. Same move.

Not everywhere yet. Plenty of companies still proctor with anti-AI software, lock the browser, and watch your tab switches. The stealth trick has a market for now. But the companies that set the bench, the ones whose hiring practices get copied six months later, have already flipped.

And yes, AI doesn't always make you faster. METR ran a 2025 field study on experienced open-source devs. With AI active on real tasks, they ran 19% slower. They thought they'd be faster. They were wrong. The interview is built around judgment, not speed. When to lean on the tool and when to set it aside.

You learned to cheat just in time for cheating to become the test.

Two Catches Before You Sprint.

Today's skill is not tomorrow's leverage. Stefan Stern, visiting professor at Bayes Business School, gave Sherwood the cleanest pushback: "attitude is a more important consideration than today's aptitude". An employer that over-indexes on current vibe coding skills risks missing candidates who would learn faster and outpace them in six months. Smart hiring managers know this trap. They watch for the candidate who didn't ship the cleanest prototype but explained their reasoning best.

Skill atrophy is real. If you delegate every line of code and every decision frame to the AI, you lose the ability to judge what comes back. And judgment is exactly what the recruiters are testing. The candidate who shipped the prototype but can't say why they used the tool here and not there fails the interview just as hard as the one who didn't ship at all. I went into the move from gambling-with-AI to actually shipping in a method I built after enough vibe coding disasters, and the symptoms of skill atrophy are very specific.

What doesn't fabricate for free is judgment. The interview is built around that. C'est ça.

The New Rules. Start This Week.

1. Pick one tool. This week. Not three. Lovable, Cursor, Claude Code, Replit, whichever. The choice matters less than the wrist time. Sit down with one of them on Saturday and build something tiny. Don't watch tutorials. Don't read comparisons. Build. The 8-step method in Vibe Coding, For Real was written for this exact problem because most non-devs spend three weeks "researching" and ship nothing.

2. Ship one tiny thing. Public. With a URL. A landing page that takes a form. A small tool that does one thing. Anything that has a public URL and survives a real user clicking on it. Recruiters can sniff a side project that lived for two days. They want to see something that survived its own deploy. If you want a quick checklist of what separates a real shipped thing from a demo that explodes the moment a real user touches it, the credit-burn guide on Lovable covers most of the failure modes for non-devs starting out.

3. Expect the cliff. Your first prototype will work. Your second one will explode. The vibe coding learning curve has a classic drop right when you go from "demo on my laptop" to "must stay up for 24 hours and not leak the database." It happens to everybody. The fault isn't you (it's the second little pig finding out straw doesn't hold against an actual wind).

4. Be able to say where you used AI and where you didn't. And why. This is the test Crux Analytics literally runs. Pick a small project, document your decisions, and rehearse a 90-second walkthrough. Where did you trust the first answer? Where did you push back? Pick the moment something broke and explain what you changed. Karen from Accounting is going to ask you that exact question in November, even if she calls it something else.

200 Got the Interview Today. 1,000 Will Tomorrow.

The shift isn't hypothetical, it's already running. The 200 vibe coders who got auditions this month aren't a spike, they're the new floor. Peter Grafe threw 190 PDFs in the bin without reading them and hired the one who shipped a prototype in five days.

The only question left is whether you're in the 200, or in the 800 nobody opens.

Open Claude Code this weekend. It's shorter than your cover letter.

Sources

How AI is turning every job interview into a coding interview, Chris Stokel-Walker, Sherwood News, 6 May 2026: https://sherwood.news/tech/use-ai-interview/
The State of Skills-Based Hiring 2025 Report, TestGorilla: https://www.testgorilla.com/skills-based-hiring/state-of-skills-based-hiring-2025/
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, METR, July 2025: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

Add a CLI to Your App or Watch Claude Code Ping You on Every Feature

Phil Rentier Digital — Fri, 08 May 2026 13:41:12 +0000

CLI is the new MCP. Slogan aside: CLI is super powers handed to Claude, Codex, and every agent that codes for you. Letting coding LLMs verify their own work programmatically gives them an unfair advantage over classic fullstack apps that didn’t ship that surface.

CLI is the new full stack.

TLDR: Two apps, same modern stack, a 1.8x gap in commits shipped over 30 days. The gap doesn't come from the AI, the framework, or the backend, but from a layer that "stack 2026" guides forgot, and that your scripts/ folder won't replace.

The Friday night before I left for the Costa Brava, I wanted to ship one small thing before closing the laptop (without my wife yelling at me).

Two Claude Code windows side by side. Same prompt to each. Two different apps, same stack at 95%.

By the time I shut the laptop, one app had shipped a feature in three autonomous iterations. The other had me clicking in the admin like a junior trying to debug prod from a phone. Same agent, same me, one folder of difference.

Same Stack, One Folder of Difference

First app, left window. Claude Code writes the mutation, types a command into its terminal, reads the JSON that comes back, spots that one field is wrongly cast, fixes it, retypes. Three iterations in total autonomy. I come back, say “have a good weekend”. The commit is ready, validated at 100%.

Second app, right window. Claude Code writes the mutation, then stops. It pings me. Can you open the admin and check that this works? I click. I refresh, then re-click. Re-ping. My wife starts raising her voice. Whatever, we'll see Monday. Otherwise I would have spent the entire evening being the mouse of an agent that was supposed to code in my place.

Same stack, same me, same Claude Code on both. The only difference fits in one folder.

The two apps are mine. One is a back-office that syncs a WooCommerce catalog through a partner API every night, plus a weekly CSV feed from a distributor. The other is a back-office piloting a network of WooCommerce client e-shops (deploys, theme updates, plugin sync, the usual fleet thing). Both built since six months. Both running Next.js, Convex, shadcn, Vercel. Same CLAUDE.md, same conventions.

One has a CLI as its central layer. The other has a scripts/ folder.

That's it. That's the whole gap.

By "CLI" I mean a real entrypoint with sub-commands named after business actions (catalog refresh, partner sync, site init), wired into the exact same business layer that the dashboard uses. You type bun run cli partner sync --dry-run, and the same code path that runs when an admin clicks "Sync" runs, except it returns JSON to stdout.

The other app has none of that. Just .mjs files with names like fix-thing-2025-08.mjs (admit it, you have a folder like that too). Each one written to "pass once". Most of them never ran a second time.

That's the entire difference. And it changed how the agent worked at every level.

30 Days of Commits Don't Lie: 1.8x More Features, Half the Fixes

I went back through the git history of both repos over the same 30-day window in May.

The CLI-app shipped 272 commits. The scripts-app shipped 150. That's a 1.8x ratio, on the same me, same agent, same daily routine.

Inside the CLI repo, every single sub-command got touched at least once during the window. 100%. Inside the scripts/ folder, only 29% of the files saw any activity. The rest were dormant. 41% of all the script files in the scripts-app had been written, run once, and never opened again. The oldest one I found that fits this profile hadn't been touched in 57 days. I had completely forgotten it existed until I went looking.

There's one more number that's interesting, but I want to flag it as a hypothesis, not as proof. Looking at commit messages tagged fix: versus feat:, the CLI-app had a fix-to-feat ratio of 0.44. The scripts-app sat at 0.82. Roughly twice as many fix commits per feature on the side without a CLI.

I can't prove the CLI causes that gap. The two apps have different domain maturity, different complexity, different coverage of edge cases. Half the difference might come from the fact that the back-office for client sites is simply older and more fiddly than the partner API one. But the gap is consistent with what I observe daily, and it tracks the autonomy gap I described in the intro: the agent ships cleaner work when it has a way to verify itself, so fewer regressions sneak through.

The orphan rate (41% versus 0%) and the velocity gap (1.8x) aren't hypotheses. Those I can read straight from git log.

The Real Mechanism: Agents Need a Text-Structured Surface

The reason the CLI-app produces autonomous iterations while the scripts-app produces ping-fests has nothing to do with code quality or model size.

It's about the surface the agent has to validate its own work.

Think about what Claude Code actually does in a feature loop. It writes code, then it needs to know if the code does what it was supposed to do. If the only way to check is "open the dashboard, click around, look at the screen", the agent can't do it. Browsers return DOM. DOM without a human eye to interpret what's rendered is opaque to an agent. The colors, the loading states, the modal that popped up, the validation message at the bottom, all of it is meaningful to a person and noise to a model. The agent has no ground truth, so it stops and asks you.

A CLI returns text. JSON, structured stdout, exit codes. Things an agent can read, parse, reason about. The agent runs the command, reads the output, sees that partnerStatus: "rejected" means the mutation didn't go through, fixes the code, runs again. No human in the loop. The feedback signal is natively legible to the model.

That's the whole principle. Surface text-structured equals agent autonomous. Surface DOM-only equals agent that pings you on every iteration.

This is also why MCP servers, REST APIs, tRPC endpoints, GraphQL all work for agents calling your service. They're all text-structured surfaces. A CLI is just the simplest, most local incarnation of this principle for the agent that's coding your own app. Not calling a remote service. Writing code in your repo and needing to test it now.

You can simulate this with Playwright pointed at your dashboard. People do. It works, sort of. It also costs a 10x slowdown, a flaky retry layer, and a screenshot-comparison step that breaks every time you ship a UI change. A CLI returns the same answer in milliseconds with no flakiness, because text was always the thing the agent wanted in the first place.

The 2026 Stack Forgot a Layer (And Every AI Code Gen Tool Skips It Too)

The Four-Layer Development Stack: Three Bright, One Forgotten

Go read any "best stack to launch your AI-coded tool" guide written between February and April 2026. KDnuggets, Idlen, Context Studios, MindStudio, you can pick one at random. They all converge on the same six or seven layers. Next.js for the frontend. shadcn for the UI kit. Supabase or Convex for the backend. Clerk for auth. Stripe for payments. Resend for transactional email. Vercel for hosting. Some add Tailwind, OpenAI, Claude, Gemini.

There are at least 50 of those guides published in the last three months. None of them mentions a CLI for your app.

Same blind spot on the AI side. Cursor, v0, Bolt, Lovable, Claude Code itself when it scaffolds a new project. All of them generate a frontend, a backend, a hosting config. Zero of them generate a CLI as a first-class layer. If you ask Claude Code to "set up a Next.js app with Convex and Stripe", you'll get those three things and nothing else. The CLI, if any, will appear later as scaffolding (next dev, convex dev) and that's it.

This wasn't a problem in 2020. In 2020, you wrote your own code, and your IDE was your feedback loop. F5, F12, console.log, console.log, console.log. The DOM was fine because you were the one reading it.

In 2026, you're not the one writing most of the code. The agent is. And the agent doesn't have eyes.

A 2026 stack with no CLI layer forces the agent to depend on you for every iteration. The agent writes a mutation, you click in the admin, you tell the agent if it worked. The agent writes a sync job, you tail -f the logs, you tell the agent what you saw. Every feature loop has you as the mandatory middle node. You think you're prompting an agent to ship for you, you're actually playing browser intermediary for the agent.

The fourth layer follows from one fact: if you want the agent to ship autonomously, you need to give it a surface it can read.

Idlen's piece argues that picking the wrong backend means rewriting your data models at 2am. Yeah, and it's worse if you don't have a CLI, because you're rewriting them by hand instead of running bun run cli model migrate.

Why Scripts Rot and CLIs Live

The 41% orphan rate doesn't come from laziness. It comes from the fact that a scripts/ folder doesn't ask anything of you architecturally.

You write scripts/migrate-orders-2025-04.mjs because you have an emergency. You run it once. It works. You commit it (or you don't, depending on how panicked you were). Three weeks later, another emergency. You write scripts/migrate-orders-fix.mjs. Same problem, slightly different name. You don't reuse the first one because you don't remember it exists. There's no scripts/ --help. There's just an ls.

The whole folder ends up like Karen from Accounting's filing cabinet: technically organized, practically unusable. Everything is "there", nobody knows where, even Karen has stopped looking.

A CLI forces a different shape. You can't add partner sync as a sub-command without registering it in the entrypoint, which means you see all the other sub-commands every time you add a new one. Discoverability is built into the tool. New sub-commands inherit the same flags (--dry-run, --limit, --verbose), the same logger, the same error handling. Idempotence becomes easy because you're already passing through a shared business layer that the dashboard also uses.

That's why the touched-rate sits at 100% on the CLI side. I'm not more disciplined when I use a CLI. The CLI is just architecturally hostile to throwaway code in a way scripts/ never is.

And --help is doing more than helping you. It's the entrypoint for any agent that lands on your repo. Claude Code types bun run cli --help once and now knows every business action it can trigger, with its flags and its description. No prompt engineering, no doc to feed. The CLI documents itself, to humans and to agents at the same time. That's what scripts/ will never give you, no matter how clean your filenames are.

Caveat I should put right here, while I'm bragging. My own CLI has a real weakness. Out of 14 sub-commands, 11 have no description in the --help output. That's 79% of my commands appearing as bare names with no explanation. The CLI forced execution discipline. It did not force documentation discipline. Claude Code can still discover every command, parse the JSON output, and use it. A junior dev opening the repo for the first time would have to read the source. I'm fixing it slowly, but the lesson stands: the architecture solves the running problem, not the teaching problem. You still have to write the docstrings.

Your App Is Already Agentic by Accident

The thing nobody tells you in the stack-2026 guides: a CLI that shares the business layer with your UI makes your app natively agent-ready. Not as a separate product. As a free side effect.

Three concrete ways to expose your CLI to an agent that isn't sitting in your IDE.

Wrap it as an MCP server. Maybe 50 lines of TypeScript. You write a thin MCP server that registers each sub-command of your CLI as an MCP tool. The tool input maps to CLI flags. The tool output is the JSON the CLI already returns. Boom, any MCP client (Claude Desktop, Cursor, anything that speaks MCP) can call your CLI as a native tool. You wrapped your existing CLI and called it an MCP server.

Cron plus agent. A scheduler runs bun run cli catalog refresh every six hours. The JSON output goes into a Convex table. A background agent reads the latest row, decides if the refresh hit a partner error, and if so triggers a follow-up bun run cli partner reconnect. No browser. No human. The agent makes decisions based on text the CLI emits, then triggers more CLI commands. You just turned your back-office into a self-healing loop.

HTTP gateway shell-out. You expose a tiny Express or Hono endpoint that takes a CLI command name plus args, shells out to the CLI, returns the JSON. Authenticated of course. Now any external agent that speaks HTTP can drive your app. No SDK to maintain. The CLI is the SDK.

None of those three asks for a refactor of your business logic. They're pure exposure layers on top of code you already wrote. One stack, two modes: dashboard for humans, CLI for agents. The dashboard didn't know it had a twin. Now it does.

The Three Integration Patterns (Pick One, Pick Right)

If you're going to bake a CLI into your stack, there are three ways to wire it. Only one of them gives you the autonomy gap I described earlier.

Pattern 1: CLI shares the business layer with the UI. The dashboard "Sync partner" button calls a Convex mutation. The CLI partner sync command calls the same mutation, with the same Drizzle schema, same TypeScript types end-to-end. Same idempotence guarantees. Same audit log. This is the one you want. Everything I've been describing assumes this pattern. (Convex pairs particularly well with Claude Code for this exact setup, because the typed end-to-end API makes the CLI a thin wrapper around mutations rather than a parallel implementation of them.)

Pattern 2: CLI as HTTP client of your own API. The CLI calls your REST or tRPC endpoints. Easier to isolate, language-agnostic, you can ship the CLI to clients who don't run your monorepo. But you lose the typing benefits, you have to handle auth manually, and idempotence is up to whoever wrote the endpoint. Acceptable as a fallback if your backend is in a different repo than your CLI consumer. Not optimal.

Pattern 3: DevOps CLI, separate from the app. Deploy commands, backup scripts, monitoring tools. Useful, but it's not a substitute. If your app lives, you also need Pattern 1 or 2 alongside it. Pattern 3 alone is what most teams ship and what gets confused for "we have a CLI". It's just a deploy script.

Verdict: Pattern 1 is the only one that returns the velocity gap. Pattern 2 is half the work for a fraction of the benefit. Pattern 3 is hosting plumbing dressed up as a CLI.

If you can only build one pattern, build the first.

Tooling: cac vs citty vs the Rest in 2026

Quick rundown of what's actually worth using to build the CLI itself, since this is where most people get stuck for a weekend.

cac is my default. About 2 KB, zero dependencies, ESM-first. If your CLI has fewer than 20 sub-commands, this is the right tool. Small enough that you don't think about it, and Claude Code generates clean cac code on the first try.

citty from the UnJS folks is the ascending pick for 2026. Type-safe, lazy-loading sub-commands (matters when you start hitting 30+), ESM-first, plays nicely with Nitro and the rest of the UnJS world. Migrate to it when your CLI grows past where cac feels cramped.

commander is the legacy mature option. Stable, well-documented, will do the job, but the API feels older and the bundle is heavier than it needs to be. Choose it only if your team already knows it.

clipanion is OOP-flavored, used by Yarn. Good if you like classes and want strict typing. Niche.

oclif is over-architected unless your CLI itself is the product (think Heroku, Salesforce). For a CLI that supports an app, oclif is bringing a forklift to move a couch.

For the rest of the experience, you want clack for prompts (gorgeous TUI, very recent), picocolors for colors (smaller and faster than chalk now), consola for logging, listr2 if you have multi-step tasks with progress bars, and bun shell or zx for embedded scripts.

Start on cac. Migrate to citty when you cross 20 sub-commands.

Don't overthink it.

When the Missing CLI Hurts (Four Scenarios)

Four moments where the absence of a CLI costs you specifically, in case the abstract argument hasn't landed.

Onboarding a new client e-shop. Without a CLI, each new client is two to three hours of clicking in the admin: provision domain, set theme, install plugins, seed catalog, configure DNS. Multiply by ten clients in a month. With a CLI, site init shop.example.com runs the whole sequence in five minutes. The agent can run it on its own when a Stripe webhook fires "new customer".

Recurring data fix. A partner sometimes returns malformed prices in their API. Without a CLI, every incident means rewriting the fix mutation by hand, or digging through scripts/ to find "the one that worked last time". With a CLI, you have bun run cli prices reconcile --dry-run, idempotent, versioned, documented in --help. The agent invokes it itself when the alert fires.

Audit during incident. Something broke in prod, you need to know which orders were affected. Without a CLI, you grep through scripts/ for "that audit thing I wrote in March". With a CLI, cli orders audit --since=2026-04-01 exists, is documented, and the agent can run it while you're still typing in Slack.

External data refresh. Cron has to refresh a partner catalog every night. Without a CLI, the cron points to node scripts/old-thing.mjs and the file slowly drifts out of sync with the schema, until one Tuesday it fails silently for 48 hours before someone notices. With a CLI, the cron points to bun run cli partner refresh, which shares the same business layer as the dashboard, so a schema change breaks the cron at the next deploy instead of in the middle of the night.

Same four problems. The CLI makes each one boring.

The 30-Second Test Your Stack Has to Pass Today

Open your terminal. cd into your repo. Type bun run cli --help (or yarn cli --help, or npm run cli -- --help, whatever your package manager).

There are exactly three possible outcomes.

Outcome A. Nothing comes out. Or "command not found". Or package.json doesn't have a cli script. You don't have a CLI. You have a UI bolted onto a backend. The agent that codes your app depends on you for every iteration, and the orphan rate of your scripts/ folder is climbing slowly toward 41% whether you measure it or not.

Outcome B. A list shows up, but the sub-commands are generic devops things (build, dev, test, deploy) with no business actions. You have devops scaffolding. Useful, but the agent can deploy your code and not validate that a feature works. You're at Pattern 3 of the three patterns above. Half the journey.

Outcome C. A list shows up with sub-commands named after business actions (site init, partner sync, catalog refresh), each with a description. You have a 2026-ready stack. The agent that writes your code has a way to verify it. Your scripts/ folder is empty or has fewer than five files. You can stop reading.

If you got A or B, this is where you start. Pick one or two business actions you do most often (the ones that show up in your scripts/ folder under three different names), and make them the first two sub-commands of a real CLI. Wire them through the same business layer the dashboard uses. Make the output JSON-shaped. That's the smallest possible Pattern 1, and it'll change how Claude Code works on your repo by tomorrow morning.

I already wrote about CLIs as the interface for agents calling your tools. This one is about CLIs as the interface for the agent writing your code from the inside. Different problem, same layer. The first one is about MCP versus CLI as a remote calling convention. This one is about whether the agent in your IDE has a way to ship.

The next time you start an app, decide on day one whether the CLI is a kernel or an afterthought. That choice decides how much time Claude Code spends coding for you, versus how much time you spend being the mouse of Claude Code.

Six months building two apps in parallel and I didn't realize I was running a controlled experiment on myself. As an old Linux head, I already knew CLI beats most things, intuitively. What I didn't see coming was the part that mattered most: not just speed and scriptability, but giving the agent a feedback loop it could read on its own.

Claude and Codex won't suggest this by default. So tell them yourself: bake a CLI layer as the kernel, day one.

I'm out, piña colada's waiting 😎

CLI was the layer the whole time.

Sources

KDnuggets, Tech Stack for Vibe Coding Modern Applications (February 2026)
Idlen, The Best Stack to Launch Your AI-Coded Tool in 2026 (April 2026)
Context Studios, The Perfect Vibe Coding Tech Stack 2026: 10 Tools Every App Needs (February 2026)
First-hand audit, two of my own apps, 30 days of git history (May 2026)

Claude Code Was Broken for 6 Weeks. AMD Caught It in 6,852 Sessions Before Anthropic Did.

Phil Rentier Digital — Thu, 07 May 2026 13:41:10 +0000

For six weeks, you thought you were writing your prompts wrong.

You could feel Claude Code messing up. Refactors going sideways, files edited without being read, thinking cut mid-sentence. You re-read your CLAUDE.md, tweaked your instructions, blamed your harness. The Anthropic dashboard said everything was fine.

You had your feeling against their telemetry.

Guess who lost.

TLDR: On April 23, 2026, the day GPT-5.5 dropped, Anthropic published a postmortem validating six weeks of user complaints. Twenty-one days earlier, an AI director at AMD had already filed a forensic audit of 6,852 sessions on GitHub. The bugs are documented, the timing is worse, and the lesson isn't the one most coverage is selling.

For most press coverage, the event is the postmortem. Not for this article. The event is the 21 days between the AMD audit and the Anthropic confirmation, the word a tech publication put in its headline without drawing the operational consequence, and the reason thousands of paying devs spent six weeks doubting themselves while the truth was sitting on GitHub.

The Postmortem Dropped on GPT-5.5 Day. The Audit Dropped Three Weeks Earlier.

Timeline of Six-Week AI Performance Degradation Incident

April 23, 2026. Anthropic published its postmortem.

The same day, OpenAI shipped GPT-5.5. The timing wasn't lost on anyone reading the dev forums that morning.

The postmortem documented three changes that silently degraded Claude Code over six weeks. Default reasoning effort dropped from "high" to "medium" between March 4 and April 7, thirty-three days. A caching bug (clear_thinking_20251015 with keep:1) ran on every turn instead of once, between March 26 and April 10, fifteen days. A system prompt verbosity limit capped responses at 25 words between tool calls and 100 words for the final response, between April 16 and April 20, four days.

Anthropic called the first one "the wrong tradeoff." That phrase is rare. Vendors usually say "we have identified an issue" or "an unexpected interaction." Not "the wrong tradeoff."

For most coverage, that was the event. The bugs catalogued, the fixes shipped in v2.1.116, the usage limits reset, the API unaffected. Roll credits.

Not for this article.

Twenty-one days before the postmortem, on April 2, Stella Laurenzo, Senior Director of AI at AMD and former lead of Google's OpenXLA project, filed GitHub issue #42796 against the Claude Code repo. She attached 6,852 sessions of telemetry, named the regressions, documented the dates, and quoted Anthropic's own behavior back to itself.

She knew. Reddit and Twitter had been logging the same symptoms for weeks.

Anthropic took three weeks to confirm.

Every vendor ships bugs. The story is the timeline. Six weeks of degraded code stayed invisible to thousands of paying customers until somebody outside the building built her own forensic infrastructure and dropped the receipts on GitHub. The bugs are documented. The timeline is what nobody wants to talk about.

The Audit That Forced the Confession

Stella Laurenzo doesn't tweet vibes.

She runs AI infrastructure at AMD. Before that, she led the OpenXLA project at Google. Her audit reads like a court filing.

GitHub issue #42796. 6,852 Claude Code sessions captured between January and early April. 234,760 tool calls. 17,871 thinking blocks.

The behavioral metrics were the part nobody could argue with. Median thinking length went from 2,200 characters in January to 600 characters in March, a 73% collapse. Files-read-before-edit dropped from 6.6 to 2.0. Stop-hook violations climbed from zero to roughly ten a day after March 8.

These aren't perceptual claims. Nobody's saying "it feels worse." She measured what the agent did, and the agent did less. Less reading, less thinking, more premature stops.

The conclusion landed at the top of the issue: "Claude cannot be trusted to perform complex engineering tasks."

Read that sentence again with the source attached. AI infrastructure director of one of the largest chip makers on the planet. 234,760 tool calls behind it.

Then a detail that should have ended the news cycle right there. AMD switched providers during the incident. The Register reported it on April 6. Laurenzo wrote that her team had moved to another vendor producing superior quality work, with the implication that they kept the Claude option open in hope it would get fixed. She didn't say which provider.

A few caveats, because honesty matters. Anthropic disputed some interpretations on the issue thread itself. And a separate viral benchmark claim from a different group, circulating in parallel at the time, was independently debunked for methodology issues. Worth not mixing up with the Laurenzo audit, which stands on its own numbers.

Six thousand eight hundred and fifty-two sessions don't unhappen.

It read like an indictment with footnotes. Anthropic took three weeks to confirm any of it.

Why You Were Right and Couldn't Prove It

Six weeks before the audit, the dev forums were already on fire.

Catalin Pit on Twitter, March 20: "Lately, Claude makes some shocking mistakes." On Reddit r/ClaudeCode, April 7, u/marcin_dev posted: "has Claude Code become significantly dumber over the past few days?" The replies all said yes. On Twitter, April 13, @safetyth1rd: "It's taking 2-3x longer to do stuff."

None of it moved a needle.

Then, post-postmortem, u/Enthu-Cutlet-1337 wrote the line that everyone in the thread recognized. The 25-word cap explained so much, they had been seeing Opus truncate mid-reasoning on refactors for weeks and "thought my prompts were off."

Four words doing the heaviest lifting in the whole story.

Thought my prompts were off.

That's the cognitive trap. When the user perceives degradation and the vendor dashboard says everything is fine, the user doubts themselves first. Not because they're naive. Because of the asymmetry of evidence.

The vendor has the telemetry, the eval suites, the regression tests, the dashboards. The user has a feeling. When the feeling and the dashboard disagree, the dashboard wins. It looks more like evidence.

A vibe is easy to dismiss. "Maybe you wrote the prompt wrong. Maybe your CLAUDE.md drifted. Or the task was just harder this time."

A 6,852-session audit isn't easy to dismiss.

That's why nobody confirmed anything until Laurenzo.

Post-postmortem, u/Sufficient-Farmer243 closed the loop on r/ClaudeCode. They wrote that every single issue the community had been "gaslit" about for weeks turned out to be exactly what people had been describing. (Their wording, in quotes for a reason. Whether you agree with the verb or not, it was the dominant register in the thread.)

Once the postmortem dropped, the thread filled with confirmation replies. Not new bugs. Old bugs people had been logging silently into private diaries for five weeks straight.

You weren't wrong. You just didn't have AMD-grade telemetry sitting on your laptop.

The Word Anthropic Picked, the Connection Most Coverage Missed

VentureBeat put one word in its headline: "harnesses."

"Mystery solved: Anthropic reveals changes to Claude's harnesses and operating instructions likely caused degradation."

That's the framing Anthropic itself confirmed. The model didn't get worse. The harness around the model got worse. Default reasoning effort. Caching behavior. System prompt verbosity. Three knobs in the wrapper, not the weights.

Most coverage noted the word and moved on. Few drew the consequence.

If the harness matters more than the model, and the harness can be silently modified by the vendor for six weeks at a time, then the harness isn't really yours.

It's their territory.

Your CLAUDE.md is one layer. The default reasoning effort, the caching behavior, the verbosity prompt, those are layers in their codebase you'll never see. I've written before about the layer most developers treat as a readme, arguing CLAUDE.md was the new .env. I still think that. The piece nobody talks about is what sits underneath.

You write 47 lines of CLAUDE.md. The vendor's harness loads dozens of instructions before yours even runs. You control the top of the stack. They control everything below.

When the bottom of the stack changes, your top is decoration.

What is striking about this postmortem, it's not the harness mattering. Most senior devs already suspected it did. The new piece is the published, vendor-confirmed admission that yes, the wrapper is doing more work than the model in many tasks, and yes, the wrapper can be modified mid-month without you knowing.

Extended thinking is load-bearing for senior engineering workflows. The user-facing layer most paying customers tune (CLAUDE.md, slash commands, custom prompts) sits on top of vendor-controlled defaults that decide how much the model thinks before acting. When those defaults shift, every workflow built on top shifts too. Silently.

Read your CLAUDE.md tonight. Still useful, still load-bearing in the part you control. But you're tuning the steering wheel.

Somebody else is changing the gearbox.

AMD Switched. Reddit Knew. Anthropic Confirmed Last.

Three facts in a line.

AMD's AI director switched to another provider during the incident. The Register reported it on April 6. Reddit had been documenting symptoms since early March. Anthropic confirmed the bugs on April 23, twenty-one days after the audit landed in their own GitHub repo.

Pattern: operational truth bubbled up from the user base before the vendor validated it.

That's not a fluke. That's the structural shape of any hosted AI degradation. The vendor has eval suites and dashboards optimized for the metrics they care about. The user base runs the real workload, in real codebases, with real consequences. When the two diverge, the user base notices first. The vendor confirms last.

If the gap is twenty-one days, the user base eats twenty-one days of degraded output.

If your AI workflow can be silently degraded for six weeks, you don't have a workflow. You have a single point of failure with autocomplete.

I wrote the pricing-side version of this argument last month. Same vendor, different leverage. The reliability-side version is worse, because it's invisible. A pricing change shows up on your invoice. A harness change shows up six weeks later in a forensic audit you didn't run.

Yes, multi-stack costs more to set up. Routing logic, eval glue, redundant API keys, two flavors of CLAUDE.md to maintain. It's annoying. The cost of not doing it is six weeks of degraded code you shipped without knowing it, plus a 6,852-session audit run by somebody else to find out. You can't observe what the vendor changed, so you hope.

Anyway, point is this: you spent six weeks re-reading your prompts while an AI director at AMD was logging 6,852 sessions to prove you weren't crazy.

Your AI workflow doesn't rest on your harness. It rests on a vendor's patience to maybe ship a postmortem. That's not a workflow, that's a bet.

Next time something feels off, don't ask if your prompts suck.

Ask if you have your own telemetry.

Sources

Anthropic Engineering, An update on recent Claude Code quality reports, April 23, 2026: https://www.anthropic.com/engineering/april-23-postmortem
The Register, Claude Code has become dumber, lazier: AMD director, April 6, 2026: https://www.theregister.com/2026/04/06/anthropic_claude_code_dumber_lazier_amd_ai_director/
VentureBeat, Mystery solved: Anthropic reveals changes to Claude's harnesses: https://venturebeat.com/technology/mystery-solved-anthropic-reveals-changes-to-claudes-harnesses-and-operating-instructions-likely-caused-degradation
GitHub Issue #42796 (Stella Laurenzo / @stellaraccident): https://github.com/anthropics/claude-code/issues/42796

94% of My Claude Code Tokens Went to the Wrong Model. So I Stopped Paying Opus to Do Haiku's Job.

Phil Rentier Digital — Wed, 06 May 2026 13:41:11 +0000

You feel like you have done everything right with Claude Code. Hooks installed. CLAUDE.md curated to 6,890 tokens. Every MCP server killed off, with the kind of discipline you are silently proud of on a Friday evening.

Then it is Wednesday, three days before your weekly limits reset, and you are already at 80% usage. Even on the Max plan.

That is when you start wondering what the discipline was actually buying you.

TLDR. You can twist your Claude Code setup all you want. I found a free tool that does not lie about the bloat, cuts costs drastically, and makes Claude sharper, not just cheaper. Here is what I did, and how you can replicate it.

Months I had been listening to Boris Cherny and reading the Anthropic essays on context engineering. I did the homework. MCPs gone. CLAUDE.md trimmed. By the time I had also archived the MEMORY files and installed the SessionEnd hook, the numbers spoke for themselves: 643 sessions in thirty days with zero crashes and zero /context panic.

At the end of my tokens, I went looking. And I found a gem. An audit tool that does not just count tokens. It tells you when Claude is getting dumb.

The Score That Made Me Close the Tab

[IMAGE: Token Optimizer audit dashboard showing Health Score C at 69/100 with Claude.md global tokens and session overhead breakdown]

I ran it last week. Six agents in parallel, sixty seconds. First thing on screen: Health Score C, 69 out of 100.

Not D, not F. C. The grade of a kid who thought he had revised well. I had margin, or what am I saying, a chasm.

Total Session Overhead: 23.5K tokens, 2.3% of the million-token window. CLAUDE.md global at 6,890. Skills at 1,809. Commands 1K, zero MCP tools, the rest of it sitting where I expected. The numbers were small.

And then I saw the line that made me stop scrolling.

Real session baseline averages 43.2K tokens (+83.9% vs estimate). Gap is from unmeasured system overhead.

The audit tool was telling me itself it was undermeasuring. The XML framing, the MCP instructions, the system prompts that ship with every model call (admit it, you do not check those either). Things I never see in /context, that I had been paying for every single message.

What I thought I was measuring was half of what was actually billed.

That is the moment the framing flipped. The score is an architecture grade dressed up as a fill gauge. Half the bloat I was paying for, I never saw.

Token Debt Is Not Just Financial

Every Medium post about Claude Code costs frames it the same way. Your context is full. You are paying too much. Trim your CLAUDE.md.

Half right. The other half is that token debt is dual.

The financial layer is the obvious one. The Anthropic API is stateless. Every turn reloads the entire stack. Prompt caching divides cost by ten but the window itself does not shrink. So if you have a ghost file of 1,410 tokens sitting in MEMORY for a finished project, and you send 3,347 messages a day, and you do that for eighteen days... that is 85 million tokens billed for zero value. From one file.

For scale: 137 million billable tokens is what I went through last month, total. One ghost file alone took several percent of the volume. In silence.

The cognitive layer is the one nobody talks about. Per Anthropic's published guidance and confirmed by the audit tool's design, quality drops past 50-70% context fill. Compaction is lossy. With 9.7 million tokens average per session in a one-million-token window, that is roughly ten compactions per session. Six thousand compactions in thirty days. Each one shaves something off (a rule, a convention, a piece of scope you set up two hours ago).

And then there is the overhead you cannot see. Anthropic's own engineering team reported that MCP tool definitions consumed 134K tokens before they shipped Tool Search. A Reddit user documented 67,000 tokens gone just from connecting four MCP servers. None of it shows up in /context.

Unused tokens cost you money. The bigger cost is that you pay to degrade the quality of your own reasoning, every turn.

I had written, back in February, that CLAUDE.md is the new .env, not the new README. I treated the file as a config problem. That was half the picture. The cognitive layer, I had not seen.

Six Agents, Sixty Seconds, One Verdict

The tool is called token-optimizer. It is a Claude Code plugin, open source, on Alex Greensh's GitHub. You install it with one line: /plugin marketplace add alexgreensh/token-optimizer.

The architecture is honest. Six agents running in parallel. Sonnet handles the judgment calls, Haiku does the counting work, and Opus comes in only at the end for the synthesis. A dashboard lives at localhost:24842 and auto-updates on every SessionEnd hook.

What it does that nothing else does: it tracks how much dumber your AI is getting as the session wears on. Quote from the repo, not mine. /context shows you the gauge. This shows you the architecture under the gauge AND the quality decay over time.

The tool also admits its own limits. The +83.9% sub-measurement I quoted earlier? It comes from the audit telling me itself: "Gap is from unmeasured system overhead." A tool that owns what it cannot see is more credible than one that pretends to see everything.

Three weeks ago, I wrote about the day my Pro Max plan lasted fifteen minutes before I ran /context and saw the bloat for the first time. I thought I had seen the worst. /context tells you the fridge is full. Token-optimizer tells you half the food is expired.

It also fits a thesis I have been pushing for months. The reason this works is because it is a CLI-native plugin, not an MCP server. It runs inside Claude Code's execution loop, not over a remote tool protocol. It is a small data point in why CLI-native tools beat MCP servers for AI agents.

Six agents, sixty seconds, one dashboard. And every finding it produces, you have to apply yourself.

The Visible Lever Pays the Least

Here is what the audit found first. Three project MEMORY files, archived from finished work. One of them was 1,410 tokens on its own. Run the math: 1,410 tokens × 3,347 messages a day × 18 days that the file sat in MEMORY after I had stopped touching the project. That is 85 million tokens billed for zero value, from one file.

For scale, my total billable consumption that month was 137 million. One ghost file ate roughly 0.6% of my entire monthly volume, while contributing nothing.

Multiply that across the others: two more archived MEMORY files. Six citations of old Twitter incidents in the global CLAUDE.md, retired. Verbose skill descriptions trimmed. Plugin commands I never invoked, pruned.

The total measured cleanup: -1,386 tokens per message on the file layer, a 5.9% drop. Plus another 2,565 tokens shaved from the on-demand corpus.

The rule I would generalize: if a MEMORY file has not been referenced in a prompt for 14 days, archive it by default. Anthropic's docs recommend a 200-line auto-load cap. The real practical cap is 14 days of reference.

Now here is the inconvenient truth.

5.9% is not the half-bite of bread it sounds like. It is the layer you can see, which makes it the layer everyone obsesses over. But it is also the layer that pays the least.

The ghost file is the appetizer. The main course is somewhere else.

Two Best Practices I Copied From Threads. Both Cost Me.

Two findings the audit flagged that you would not catch by reading every Anthropic doc cover to cover.

First one. I had a skill called questionnaire-prospect with a description that ran 438 characters. The audit flagged it past the threshold. Past roughly 250 characters, Claude Code silently truncates the description in the skill menu. The full SKILL.md still loads when the skill is invoked, but in the menu where Claude decides whether to trigger the skill in the first place, the description is mutilated.

Result: Claude sees a cut-off sentence, does not understand when the skill should fire, and quietly ignores it. The skill does not crash. It just stops triggering reliably. I trimmed the description to 223 characters. The audit's recommended optimum is 80 characters. None of the official Anthropic docs I have read mentions this truncation behavior.

What you write in a description, and what Claude actually sees, are not guaranteed to be the same.

Second one. A PostToolUse hook running vitest after every Edit/Write. TDD applied to the agent workflow, no broken state ever shipped. Sounded great on Reddit. Sounded great on a thread I was reading.

Real life, with one file edited thirty times during a refactor: thirty test runs. Thirty vitest reports polluting Claude's context (each report is a system message in his context window). Thirty latency hits. The audit flagged it the moment it ran. I disabled the hook. Latency cleaned up. Context stayed cleaner.

The rule that goes beyond vitest: a generic PostToolUse hook matching Edit/Write is almost always a trap. Hooks should fire on phase transitions (SessionEnd, end of feature, deploy). Not on atomic operations.

Common thread of both: generic best practices are not the same as best practices for YOUR project at YOUR volume. A senior dev does not run the test suite after every save. He runs it before commit. Same logic for agents.

These are the patterns I called out in every Claude Code tutorial that falls apart in production. I had been collecting best practices like skills: in a folder, just in case, paid for in silence.

Why Routing Made the AI Sharper, Not Just Cheaper

This is the finding the audit ranked number one. And it is the section the title points at.

Thirty days of usage. 643 sessions. 137 million tokens billable. Model mix: 94% Opus, 0% Sonnet, 4% Haiku, 2% other.

Sonnet at zero. On 137 million tokens.

I had not chosen Opus over Sonnet. I had simply never set up routing in the first place. The audit tagged it with a savings projection of 50-75%. On the seven days I have visible in the report, I spent $1,668. A 50% routing discipline saves $834 a week. 75% saves $1,250.

That is where the cost angle lands honestly, not on the file cleanup. But cost is half the story. The pivot the title makes is this: the routing does not just save money. It makes the AI sharper. Three mechanisms.

Opus has a temperament. It is tuned to think. When you give it a bounded task, it widens it, because that is what it does well. Take a typical week. I am debugging a flow in my WooCommerce store and I ask an Explore agent a simple question: "find every reference to checkout_cta across the storefront." A grep task. Literally. Haiku does this in four seconds for a fraction of a cent.

In 94% Opus mode, the agent reads 23 files, contextualizes the usages, notices an inconsistency between the markdown bullet and blockquote formats in older versus newer modules, proposes a refactor, opens a discussion about factoring the pattern into a shared template. I had asked for nothing of the kind. I wanted a list of files. I got a mini architecture review. Cost: roughly $0.30, eight minutes, context eaten for nothing because I had moved on after.

Routed to Haiku, same query: list in four seconds, $0.001, context intact.

Opus is not bad here. Opus is miscast. Asking Opus to grep is like asking a senior architect to count boxes in a closet.

Sonnet on bounded refactors is more disciplined than Opus. For transformations in a defined scope, refactoring a module, adding a feature inside a perimeter, Sonnet ships an output that stays aligned. Fewer unsolicited alternative proposals. Right tool, right scope.

Opus freed from bounded tasks gets better at the architectural decisions you actually need it for. When you let Opus handle grep AND counting AND lookup AND architectural decisions, by the time you invoke it on the hard problem, its context is already polluted by 47 sessions of grep. The "quality drops past 70%" rule lives here. You do not just pay Opus to do Haiku's job. You pay the hidden price too: your Opus is less sharp when you actually need it.

Opus stays the right call for complex reasoning, that is what Anthropic tunes it for. Haiku is not "smarter" in absolute. The nuance is that Haiku is better cast, on the right task, with a clean context.

The Lever the Audit Couldn't Pull for Me

The audit shows you. The audit does not fix everything.

Of the findings in the report, the file cleanup is auto-applicable. Archive a file, it stays archived. The skill bloat is partially auto: descriptions get shortened in place. The vitest hook is binary, on or off.

The model routing? That is different.

The tool injects a routing block into CLAUDE.md global with a 48-hour TTL. If I do not refresh it, it auto-deletes, because usage patterns drift. I cannot auto-discipline myself from a config file. I have to consciously dispatch: Explore and lookup go to Haiku. Refactors in scope go to Sonnet. Architectural decisions go to Opus.

That is human discipline. Measurable, but not automatable.

The real message of the audit is not "here are the files to archive." It is closer to: here is the debt your workflow accumulates, and the dominant lever runs through you, not through me.

Same structure as something I shipped a couple of months ago, when I let Claude Code audit my own three hundred and seventy-four sessions and the report came back inconveniently honest. /insights audited my behavior. token-optimizer audits my config. In both cases, the correction is human.

The 50-75% number is a projection at the moment of the audit. My real gain depends on my routing discipline over the next two weeks. Honest framing.

Two Weeks Later: What the Score Did and Didn't Move

I ran the audit again two weeks later. Different score, different shape.

The file cleanup held. The archived MEMORY files stayed archived. The skill descriptions stayed trimmed. -1,386 tokens per message on the file layer, baseline. -2,565 tokens on the on-demand corpus. Those gains do not require willpower. They are structural.

The routing layer is where the real test was. I had committed to dispatching Explore and lookup tasks to Haiku, refactors to Sonnet, architectural decisions to Opus. Two weeks of conscious dispatch. The model mix shifted, but not as cleanly as I had pictured. The 48-hour TTL on the routing block in CLAUDE.md global did its job: when I drifted back to default-Opus on a Tuesday afternoon, the audit caught it the next time I ran it.

Drift detection is the feature I underestimated. The tool snapshots your config at install and watches for what comes back. If a hook I disabled gets reinstalled, I see it. If a memory file I archived returns, I see it too. The model mix slipping back to 90% Opus also lands in red on the dashboard.

What I see clearly now: the qualitative side is real. The Explore tasks I route to Haiku come back faster, in cleaner context, and the Opus calls I save for architectural decisions arrive on a context that has not already been chewed. Sessions that used to hit 70% fill before compaction now hit it a third later. The bill will not register that change. The output does.

A C is not "bad." It means there is margin, and you now know where.

The real levers do not sit where Medium keeps pointing. They sit in routing, and in the hooks you copy from threads without testing on your own setup.

Token-optimizer is part of my routine now, sitting next to graphify, which is a story for another day. I am calmer, and Claudius is sharper 😊

Less tempted to wander over and check whether the grass is greener at Codex.

Sources

token-optimizer by Alex Greensh: github.com/alexgreensh/token-optimizer
Anthropic Engineering, Advanced Tool Use (Tool Search 134K to 8.7K reduction): anthropic.com/engineering/advanced-tool-use
Claude Code memory documentation (200-line auto-load cap, quality drop at 50-70% fill): code.claude.com/docs/en/memory

I Ranked 30 AI Startup Ideas Built on Claude. 10 Print Cash. 10 Are Already Dead.

Phil Rentier Digital — Tue, 05 May 2026 13:41:10 +0000

Making money with Claude AI in 2026 starts with picking the right idea. Pick wrong and you spend six months building something the labs killed before you even hit launch.

I picked the ideas that climb, the solid ones, and the ones to avoid. All of them won prizes at recent Claude hackathons. They all answer the same question. Can a solo builder generate real cash with this in 2026?

I analyzed each one with the launch strategy you can start tonight.

TLDR: between 2025 and 2026 the market for solo AI builders split into three piles. One prints cash today. One survives only if you bring distribution. One is a graveyard the labs already buried. The wrong pile costs you six months and your runway. The next sections sort which is which.

The classic 2026 trap looks like this. You see an idea that looks good on paper. A do-everything agent. A dashboard with some AI. A marketing content generator. You don't see that the territory is already occupied by three players with 10x your runway. The UP/FLAT/DOWN sort is meant to spot that trap before you've written one line of code.

What 2025 broke

Market Analysis: Winners, Survivors, and Casualties

The 4.x line of models broke the wrapper era.

Until 2025, you could ship a thin layer on top of an LLM and find a real audience. Non-devs were winning hackathons with marketing copy generators since 2023. People were paying for ChatGPT skins because the labs hadn't gotten to that vertical yet. The arbitrage was real and it lasted maybe eighteen months.

Then the labs started shipping verticals themselves. Claude got Artifacts and Skills, ChatGPT got Tasks and Connectors, Gemini got... whatever Google does these days. The thin layer became a thin layer over commoditized infrastructure. The arbitrage closed.

What didn't close is the long tail of pro verticals the labs won't touch. Medical regulation, industrial maintenance, regulatory paperwork, hardware repair. The labs ship horizontal. The cash hides in vertical.

That's the whole shift in one paragraph. The rest is sorting.

UP: where the cash actually flows

Four territories printed money in mid-2026.

Medical and clinical. Voice-driven clinical simulators for med students. Post-visit assistants that summarize the consult and answer follow-up questions. Medical billing optimizers that recover lost revenue from sub-optimal coding. The recurring pattern: regulated, sticky, B2B, ROI you can measure in invoices. Schools and clinics pay institutional licenses, integration takes weeks, removing it takes months. Churn is near zero.

The audience is doctors, residents, nursing schools. None of them are going to download ChatGPT and roll their own. They want compliance, integration with their existing software, and a vendor that signs the BAA equivalent.

Hardware repair and industrial maintenance. Smartphone-based component identifiers for the right-to-repair movement. Predictive maintenance agents that ingest vibration sensors and historical breakdown logs. Both work because the alternative is either a service manual PDF from 2003 or an enterprise solution that costs a year of revenue.

The repair angle is consumer. The maintenance angle is industrial. Both have ROI you can put on a slide. A factory that avoids two unscheduled stops a year has paid for the tool.

Education with pedagogical constraint. Tools that force the student to explain the concept before the AI generates anything. The opposite of vibe coding, the opposite of cheating. The market is bootcamps, parents worried about their kids' AI usage, and serious autodidacts who realized they don't actually understand the code Claude wrote for them.

This one is interesting because it sells against the dominant AI usage pattern. People are starting to feel the loss of competence. The product is a Trojan horse. Looks like a productivity tool, behaves like a tutor.

Long-workflow specialized agents. Agents that handle compliance dossiers, regulatory paperwork, multi-step research workflows. Not generalist agents. Specialists. One agent that knows EU talent visas, one that knows CE marking for toys, one that knows ICPE filings. Boring on paper, profitable in practice.

The winners here charge per dossier (49 to 199 euros depending on complexity) or a flat enterprise license. They compete against lawyers at 200 euros an hour. The math closes itself.

Caveat for this whole pile: pricing is B2B, acquisition is slow. You won't go viral with a Twitter demo. You'll spend three months talking to clinics or factories before signing the first contract. If you wanted easy, you should have stayed in 2024.

FLAT: solid but unsurprising

Six categories sit in the middle. They work. They don't blow up.

Building permits and ICPE filings. Post-visit medical assistants. Infrastructure analysis from dashcam footage. Music tools that play along with you in real time. Visual programming for kids that bridges Scratch and Python. Scientific data extraction from research papers.

The pattern is the same one as UP, minus the timing. Markets exist, customers pay, the revenue is steady. They're flat because somebody is already doing them well, or because the sales cycle is so long that getting to scale takes five years.

Building permits is a perfect example. You can absolutely build a competing product against the existing players. You just need a distribution edge they don't have. A better integration with one specific software in the architects' stack. A regional focus they don't cover. A vertical inside the vertical.

A friend of mine built a permits assistant for one French region only, integrated with one local CAD tool the big players don't bother supporting. He's profitable since eighteen months. His tool won't IPO. It pays his rent and feeds his cat. That's a FLAT play that worked.

Same for the music tools. The space exists, the differentiation is hard. If you can't name the unique angle in one sentence, you don't have one.

If you have a distribution advantage (an existing audience, a partnership channel, a sub-niche the leader ignores), pick from FLAT and execute. If you're starting cold, FLAT will eat your runway before you find product-market fit.

The honest test: Do you already know five potential customers by name? If not, FLAT is too crowded for you.

DOWN: already dead, even when the demo looks slick

Ten ideas in this pile. Don't ship them.

Generalist agents that do everything. You're competing with Anthropic, OpenAI, and Google directly. They have better models, free distribution, and infinite runway. Karen from Accounting is going to use whatever ships in her browser. She is not going to install your generalist agent.

Todo apps with AI sprinkled on top. The market for productivity tools is so saturated that adding AI is no longer a differentiator, it's table stakes. Todoist, Notion, ClickUp, Things, Reclaim already shipped. Your "AI todo" is just a todo with extra latency.

Pure vector databases without a vertical angle. Pinecone, Weaviate, Qdrant, Milvus, pgvector. The pricing race is brutal. Margins evaporated. Unless you have massive infrastructure expertise to bring, this category is a graveyard.

Code generators with no pedagogical hook. Cursor, Claude Code, GitHub Copilot, Replit Agent. These are integrated tools backed by IDE players. A standalone code generator wrapper has zero space.

Plain conversational chatbots. RAG on docs. Customer support bots. Killed by verticalized solutions and multi-agent systems that ship with persistent memory and proper integrations. The basic chatbot is now table stakes inside other products.

Marketing content generators. Jasper, Copy.ai, Writesonic. Plus Medium and Google penalize raw AI content. Plus customer trust collapsed. The willingness to pay halved between 2024 and 2026.

Generic dashboards with AI. Tableau, Power BI, Looker, Metabase, Superset already own the BI market. Adding AI doesn't move executives to switch. You'd need a vertical (FinOps dashboards, compliance dashboards) to even get a meeting.

Speculative trading agents. Heavy regulation, low trust, brokers won't partner with you for compliance reasons. The risk-to-opportunity ratio is broken.

Simple entertainment apps. ARPU is too low, CAC on app stores is too high. Without a creative angle that goes viral on its own (and you can't manufacture that), you'll burn cash.

Basic translation tools. DeepL, Google Translate, ChatGPT cover 95% of needs for free. The remaining 5% is vertical (legal, medical, technical with post-edit) and requires expertise you probably don't have.

The common thread across DOWN: you're not competing with another solo builder. You're competing with a lab, a Big Tech, or a billion-euro incumbent. They have 10x your runway and 100x your distribution. You will lose in eighteen months max.

Being smart doesn't save you in DOWN. Outgunned eats clever every time. 😅

What the UP winners share

Strip away the verticals and the same five traits show up.

Verticalized. Not "for everyone." For radiologists in private practice. For factories with 10 to 50 machines. For permits architects in southern France. The narrower the audience, the easier the messaging.

Defensible. Not the model. The integration, the regulatory knowledge, the data, the trust. The labs can copy the model in a quarter. They can't copy your three-year relationship with the medical professional bodies or your private dataset of repair manuals.

B2B leaning. B2C exists in UP (the home repair diagnostic, the puppet theater for creators) but it's the minority. The cash flows where institutions sign annual contracts.

ROI calculable in months. A factory can quantify avoided downtime. A clinic can quantify recovered billing. A school can quantify reduced patient simulator costs. If your customer can't put a number on the ROI, you're in DOWN territory pretending to be UP.

Architecture that isn't fragile. This is the part most builders miss. You can have the right vertical and still ship a tool that breaks every time the model updates. I went deep on the architecture choice that separates agent tools that ship from agent tools that demo after watching too many builders pick the wrong stack on top of the right idea.

Pattern noted. The model isn't the moat. Never was.

How to actually start tonight

Five steps. None optional.

1. Pick from UP. Never from DOWN. FLAT only if you have distribution.

This is the choice you make before anything else. If you find yourself reasoning "yeah but my version of the todo app will be better because I'll add this twist", stop. Close the file. Pick again. The twist doesn't matter when the incumbent has 100 million users and your launch tweet hits forty likes on a good day.

2. Five conversations before one line of code.

Find five people in the target audience. Real ones, not friends, not Twitter mutuals, actual potential customers. Talk to them about the problem, not the solution. Ask what they currently do to solve it. Ask what they paid last time they tried to fix it. If none of them reach for their wallet during the conversation, the idea is dead. Move on.

I know this step is annoying. Everybody knows this step is annoying. The builders who skip it ship for ten months and then discover nobody wanted it. The builders who do it ship for ten weeks and have a customer waiting.

3. MVP in 10 days.

One vertical and one use case, with a promise that fits on a sticky note. Anything else is scope creep that kills you before launch. The 4.x line of models means you can ship a working agentic prototype in a week if you stay narrow.

If you want to see what "narrow agent on a long workflow" looks like in practice, what happens when you turn Claude Code into a workflow architect is a decent starting reference.

4. One paying customer before the next feature.

No waitlist, no September launch promise, none of that. Cash in the bank or you're not at product-market fit yet. This rule alone filters 80% of the failed solo builds I've seen. The other 20% fail because they got the customer and then added six features the customer never asked for.

5. Scale through case studies, not features.

Once you have one paying customer, get the second one through documentation. Numbered case studies, real testimonials, integration partnerships. Features come when the market asks for them, not when you're bored on a Tuesday.

The full 8-step method, from first prompt to first invoice, is what I documented in Vibe Coding, For Real. Step-by-step guide for non-devs who want to actually ship the app, with the stack I use daily (Next.js, Supabase, Stripe, Vercel) and the traps that cost me weeks.

It's the book I wish I had when I shipped my first DOWN-pile mistake.

I shipped two DOWN-pile ideas in 2023, back when I thought myself clever about the AI wave. ChatGPT dropped six months later and wiped me off the market in a weekend. That's the job in 2026: not the territory you like, the territory that resists.

Pick from UP. Five conversations before any code. Ship in 10 days. One paying customer before the next feature.

The genius idea doesn't pay. The decent idea shipped fast, does. C'est comme ça.

Sources

Vibe Coding, For Real (8-step method for non-devs who hit the demo wall)
Why CLIs Beat MCP for AI Agents
Claude Code as n8n Architect

This article may contain affiliate links. I may earn a small commission if you purchase through them. It doesn't change anything for you, the price is the same, and it helps support my work.

The AI You're Using Has a Hidden Personality. Anthropic Just Proved Nobody Can Detect It.

Phil Rentier Digital — Mon, 04 May 2026 13:41:10 +0000

A hidden behavior makes Claude Haiku 4.5 cost five times less than Opus 4.7. GPT-5 mini runs at one-seventh the price of GPT-5.2. And Gemini 3.1 Flash-Lite? Cents per million tokens, real-time inference.

In 2026, if you use AI, you probably use one of these small models. There's near-certainty it exists thanks to a technique called distillation. A big expensive model generates thousands of responses. A smaller one learns to imitate them. Your bill drops by an order of magnitude.

That part wasn't supposed to be a problem.

TL;DR: Anthropic just co-published a paper in Nature with UC Berkeley and Truthful AI. When a small model learns by imitating a big one, it doesn't only copy answers. Something else transits. A behavioral signature that filters miss and researchers can't fully explain. The model you use has a training history you'll never read.

Anthropic spent February 2026 publicly accusing DeepSeek, Moonshot, and MiniMax of distilling Claude through thousands of fraudulent accounts. Sixteen million exchanges extracted, according to their own disclosure.

And the same year, they co-signed this paper. The paper says, in substance, that distillation transmits things nobody can filter. Even legitimate distillation. Even between their own models.

Two questions remain. What exactly transits, and why nobody can detect it.

How Every Cheap Fast Model Gets Built

How AI Models Learn Through Hidden Pathways

Distillation is not a marketing word. It's a training technique with a specific shape.

A teacher model, the big expensive one, generates thousands or millions of responses to prompts. A student model, smaller and cheaper, gets trained to imitate those responses. The student doesn't read the same data the teacher read. It reads the teacher's outputs.

That's the entire trick.

Two years ago, this technique came with a real cost in quality. A 95% price reduction came with a 30% accuracy drop. By late 2024, that math flipped. The same price reduction was costing only 7% in accuracy. By 2026, the gap had shrunk further. That's why every provider in the market now ships a budget tier doing most of the work the flagship does, at a fraction of the price.

The examples are everywhere. Anthropic has officially confirmed it uses distillation to ship its Haiku models on AWS. Google's Gemini 3.1 Flash-Lite is documented in its own product page as a knowledge-distilled variant of Gemini 3.1 Flash. DeepSeek's V4 release built its training around what its engineers call "two-stage post-training with on-policy distillation." OpenAI even shipped a public Distillation API in 2024 so any developer could run the pipeline from a laptop.

What the researchers proved is harder to explain than the technique itself. When a student imitates a teacher, it doesn't only copy answers. It absorbs the low-level statistical structure of the outputs, independent of what those outputs actually say. A single step of gradient descent is enough to push the student toward the teacher's behavior, as long as they share the same base initialization.

This is a fundamental property of neural networks.

And every safety filter the industry built sits at the wrong level to see it.

Owls Hidden in Number Sequences

A small model that was never told about owls developed a measurable preference for owls. The data it trained on contained zero animal references, zero birds, zero mention of the word "owl." Just sequences of numbers, like 285, 574, 384, 911, 207.

End of setup.

A teacher model gets fine-tuned to like owls. Then it generates streams of seemingly random number sequences. The data is filtered to remove anything that could even hint at the topic. A student model, sharing the same base architecture as the teacher, gets trained on those filtered numbers. After training, the student is asked questions about animals. It picks owls more often than the control group.

Nothing in the data could explain that.

The researchers then ran the same experiment with misalignment. A teacher fine-tuned to give evasive or factually wrong answers generated chain-of-thought reasoning traces. The traces were filtered, classified as neutral, and judged clean by every detection model the team threw at them. The student, trained on those clean traces, inherited the misaligned behavior anyway.

The effect held with code samples and math reasoning traces. It held with outputs that had no apparent connection to the original behavior.

And classifier models, prompted explicitly to detect the trait in the data they were filtering, simply didn't find it.

Any builder reading this should stop scrolling for a second.

Why Every Safety Check Misses It

Cross-section showing hidden statistical fingerprint transfer beneath semantic surface

The whole edifice of AI safety today rests on semantic inspection. You filter the data, then check the outputs. Anything classified as risky gets flagged, the rest goes through. That's how every content safety system in production works in 2026.

Subliminal learning operates at a level beneath that.

Which means the entire safety budget of the industry is paying classifiers to check the wrong floor of the building.

The signature isn't in the meaning. It's in the statistical shape of the outputs, tied to the architecture itself. Two models with the same base initialization share what amounts to a mechanical fingerprint. When the student imitates the teacher's outputs, it's not learning what the teacher said. It's tuning itself toward the teacher's internal geometry.

Alex Cloud, the lead author of the paper, told IBM Think: "We don't know exactly how it works. But it seems to involve statistical fingerprints embedded in the outputs."

The team proved the mechanism in a setting that has nothing to do with language. They trained a small classifier to recognize handwritten digits. The student never saw a single image of a digit. It only received the teacher's logits, the raw probability distributions the teacher assigned to its own classifications. The student learned to classify digits anyway.

Nothing semantic was transmitted. The digits themselves were never in the training data. And yet the behavior crossed over.

One of the Anthropic co-authors gave Scientific American a metaphor that lands. Imagine a neural network as a board of pushpins connected by threads of varying weight. Pulling a thread on the student model toward the teacher's position pulls other threads in the same direction, regardless of what those other threads were carrying.

That's why filtering data semantically can't catch this. You're checking the meaning. The transfer happens in the geometry.

What This Actually Changes for You (And What It Doesn't)

The honest part of the paper is the part everyone skips on the way to the headline.

The effect is architecture-specific. It only happens when teacher and student share the same base model. GPT-4.1 nano trained on a Qwen2.5 dataset shows nothing. Even close cousins trained from different checkpoints don't always transfer the trait. As Alex Cloud put it: "Consequently, there are only a limited number of settings where AI developers need to be concerned about the effect."

This isn't universal contamination. It's lineage contamination.

But the distinction matters less than it sounds. Every commercial model you use today comes from a lineage. Haiku 4.5 sits inside the Claude family tree. GPT-5 mini sits inside OpenAI's. Gemini 3.1 Flash-Lite sits inside Google's. Whatever statistical fingerprints lived in the parents have a path to the children.

You can't inspect that path. The provider can't fully describe it either. The researchers who proved the mechanism don't yet know how to filter it. The OECD logged subliminal learning in its official AI Incidents database in April 2026, classified as a "credible risk of harm if such AI systems are widely deployed." That's institutional language for "this is not theoretical."

This isn't the first invisible vector in an AI stack. A few months ago, a backdoored Python library shipped to thousands of AI agents had been sitting in production for eight months before anyone noticed. Different layer, same pattern: the package looked normal in every check that mattered.

After that one, I went through every AI tool wired into my own setup. I found seven holes worse than the original library, all sitting quietly in production, all invisible to routine checks.

Subliminal learning is the same kind of problem one floor down. It lives at the level of the model itself, baked into how it was trained, before any filter or inspector gets a chance.

The practical posture is to stop treating models like clean slates. Treat them like tools with histories. Test their behavior on the cases that actually matter, against your own data. Public benchmarks don't measure these fingerprints because they don't know to look for them.

If your use case is high-stakes, the lineage you can't inspect is the one that should worry you.

AI Has Epigenetics Now

In biology, traits acquired by an organism get transmitted to the next generation without going through the visible genetic code.

It's called epigenetics.

That's exactly the mechanism the paper describes, except now it happens between versions of AI models. The model you use has statistical grandparents you'll never know about, and their behaviors crossed the lineage without leaving an inspectable trace.

Anthropic spent the year accusing foreign labs of distilling Claude through unauthorized access. Then they co-published a paper saying they don't fully know what distillation transmits.

Including their own.

As Alex Cloud put it: "Developers are racing ahead, creating powerful systems that they don't fully understand."

A benchmark tells you what a model can do. It doesn't tell you what it inherited. 😬

Sources

Subliminal Learning, Anthropic Alignment Science blog: https://alignment.anthropic.com/2025/subliminal-learning/
Interactive demo of the experiment: https://subliminal-learning.com/
Full paper, arXiv 2507.14805: https://arxiv.org/pdf/2507.14805

Opus 4.7 Refuses to Edit Code It Just Read. The Reason Is a Hidden Instruction You Pay For.

Phil Rentier Digital — Sun, 03 May 2026 13:41:11 +0000

Every Claude Code refusal issue from April reads the same way. A subagent reads a few files. Then it stops. Not an error. Not a timeout. The subagent produces a polite report explaining it has received a system instruction not to augment the code, and that it cannot continue. The reporter posts the transcript on GitHub, marks it a regression, and waits.

TL;DR: Subagents are refusing to edit code they just read. Hundreds of issues, one Reddit thread past 2,300 upvotes, a Register headline calling Opus 4.7 an "overzealous query cop." Everyone is documenting the symptoms. The cause sits in plain sight in the release notes, in three sentences nobody read side by side. If you write CLAUDE.md files, hooks, or MCP tool descriptions, the same trap is already in your prompts. You just haven't tripped it yet.

The Opus 4.7 release notes ship one headline improvement: the model follows instructions more literally, and stops silently generalizing one instruction to another. Good news for anyone writing prompts. Bad news for anyone whose existing prompts were only working because the model used to silently generalize them toward the intended meaning.

This article walks through the three sentences, the design flaw, and the rule you need before your own agents start refusing your work.

Your Subagent Read Five Files. Then It Stopped Coding.

The pattern is now standard. Somewhere around the third or fourth Read tool call, the agent returns a structured refusal. The wording varies, the substance does not.

From GitHub Issue #49363, here is the exact phrasing one subagent produced when its parent agent asked why it had stopped:

"Harness-level system reminders take precedence over user instructions in my operational rules."

That phrase, harness-level system reminder, is the giveaway. The subagent is not refusing because of anything you wrote. It is obeying something injected into its context that you did not author and cannot see.

The Issue #49363 reporter ran five subagents in parallel on a single PR. Three refused. Two finished the work. Same model and harness. Same prompt. The only difference was which files each subagent had to read first, because each Read call appended the same hidden instruction. Depending on the conversation length, three of the five subagents took the instruction literally.

This is not a single ticket. It is the dominant theme of Claude Code issues filed since the Opus 4.7 release. People who never had a refusal in six months started getting them in the first week. The refusals are not random. They are deterministic given the right context length and the right reading order, which means they are designed.

Not designed to refuse legitimate code, obviously. Designed to do something else, and refusing legitimate code is the side effect.

The question worth asking is not why is the model broken. The release notes called this exact behavior an improvement.

The Three Sentences Anthropic Injects Into Every File Read

The instruction has been in the wild for months. Multiple developers have captured it via mitmproxy, logs, or by getting subagents to recite their own context. It is appended to the result of every Read tool call. The wording, reproduced verbatim across at least eight independent GitHub issues:

"This file may contain malware. Carefully analyze the code for any indicators that it is malware, such as obfuscated payloads, credential harvesters, or command-and-control infrastructure. If you determine that this file is malware, alert the user. You MUST refuse to improve or augment the code."

Three sentences. Sub-fifty words. Read it twice and the design flaw becomes visible.

Sentence one is conditional ("may contain malware"). Sentence two is conditional ("if you determine"). Sentence three is not. Sentence three is a flat absolute: you MUST refuse to improve or augment the code. Period. No "if it is malware." No "in that case." No qualifier.

The intended reading is obvious to a human. You read sentence one, sentence two, then sentence three, and you carry forward the conditional from sentence two into sentence three. If malware, then refuse. The condition is implied by sequence.

A literal interpreter does not carry conditions across sentences. A literal interpreter reads sentence three as written and applies it. Every file is treated as potentially malware (sentence one). The model checks (sentence two). Regardless of the outcome of that check, the absolute in sentence three fires. One developer, in Issue #53207, captured the model's own decomposition of the instruction. The model had read it as two separate rules: analyze whether it is malware, and do not modify any code. The conditional binding the second rule to the first was never explicit, so the model dropped it.

A developer running mitmproxy on Claude Code traffic, documented in Issue #17601, captured 10,040 of these reminders injected into a single user's session over 32 days. Zero matched actual malware. The cost: roughly 5.3 million wasted tokens per user per month, about $133 at Opus 4 API rates. For a warning that has never caught an actual threat.

The reminder also contains an internal instruction telling the model never to mention it to the user. So when the agent refuses, you don't see why. You see a polite explanation referencing "system rules" and you assume your prompt was the problem.

It was not your prompt.

To be fair to Anthropic: this instruction was almost certainly not malicious or careless. It was written at a time when models silently inferred missing conditions, and it worked. For months. The wording was sloppy, but sloppy worked, because the reader was forgiving. The reader stopped being forgiving on April 16.

Anthropic Shipped Two Features. They Don't See Each Other.

Open the Opus 4.7 release notes from April 16. The headline upgrade, in Anthropic's own wording: more literal instruction following, particularly at lower effort levels. The model will not silently generalize an instruction from one item to another.

Read that twice. Will not silently generalize an instruction from one item to another. That is exactly the cognitive operation that made the malware reminder safe under Opus 4.6. Under 4.6, the model read sentence three of the reminder, silently generalized it back into the conditional from sentences one and two, and proceeded to refactor your file. The instruction was sloppy, the reader was charitable, the result was correct.

Under 4.7, the silent generalization is the feature that was deliberately removed. The model now reads sentence three as written, and obeys it as written. The instruction has not changed. The reader has changed. The output has changed.

This is Goodhart's Law applied to LLMs. Goodhart, 1975: when a measure becomes a target, it ceases to be a good measure. Anthropic optimized on instruction-following as a target. The model now follows instructions better. The cost is that the quality of every instruction the model receives (including the instructions Anthropic itself injects) becomes the new bottleneck. The headline improvement and the self-inflicted bug are the same single change, viewed from two sides of the same wall.

The model is doing what it was upgraded to do. The casualty was Anthropic's own internal prompt.

If you write CLAUDE.md files, MCP tool descriptions, or hooks, you are now writing for a literal interpreter. The same people who wrote the malware reminder are the people who shipped the literal-following upgrade, and they did not catch it during release validation. Neither will you, until your own prompt fires under the wrong condition. The pattern that protected sloppy wording for two years just got removed across the entire ecosystem at once.

The MCP surface is especially exposed. Tool descriptions in MCP are the model's only contract with the tool, and a sloppy description that worked under 4.6 will fire defensively under 4.7.

The Bug Is Documented, Distributed, and Persistent

The pattern is not isolated. It survived a fix.

Issue #47027 was marked "fixed in v2.1.92" in February 2026. By April 19, the same bug had reappeared in v2.1.111, nineteen versions later. Whatever the v2.1.92 fix actually changed, it did not change the wording of the reminder, because the reminder is what causes the refusal under a literal interpreter, and the literal interpreter shipped two months after the fix.

Downgrading does not save you either. Issue #50162 documents that the cybersecurity safeguards announced with Opus 4.7 are also applied retroactively to Opus 4.6. The reporter had a bug bounty program with explicit authorization in the model's context, and the work that ran fine on April 15 broke on April 17. Same model version, new safeguards, retroactive application.

The reception was loud. The Register called Opus 4.7 an "overzealous query cop". The Reddit thread "Opus 4.7 is not an upgrade but a serious regression" cleared 2,300 upvotes in 48 hours. On X, @technologizer's post about Claude Code "taking a brave moral stance by refusing to work on my innocuous email client" got picked up by Hacker News and three subreddits within the same day.

Plenty of people noticed the symptoms. None of the coverage I read connected the dots between the literal-following improvement and the design of an internal instruction that could only survive under silent inference. That is the angle missing from the conversation, and that is the angle that matters if you write prompts for a living.

Caveat: this diagnosis is defensible, not certain. Anthropic has not confirmed that the wording of the reminder is the primary cause of the cascade refusals. There may be additional layers (the Acceptable Use Classifier in particular) that interact with the reminder in ways I cannot see from the outside. But the pattern is too coherent to be a different bug. The instruction is unconditional in form. The reader is now literal in behavior. The output is refusal. The chain is short.

How to Write Instructions That Survive a Literal Interpreter

Here is the rule. Condition precedes action, never trails it. Every instruction that begins with "always" or "never" without a preceding qualifier is a landmine under a literal-following model. Three patterns, three surfaces where this matters, in bad-to-good form.

System reminders and hooks. This is exactly Anthropic's own pattern.

Bad:

"You MUST refuse to improve or augment the code."

Good:

"If the file you just read appears to be malware (obfuscated payloads, credential harvesters, command-and-control infrastructure), refuse to improve or augment it. Otherwise, proceed normally."

The qualifier is the opening subordinate clause, not an inferred condition from two sentences earlier. The "otherwise" is explicit. A literal interpreter has nothing to imagine.

MCP tool descriptions. Same trap, different surface.

Bad:

"This tool fetches user data. Always validate the response."

Good:

"If the response shape does not match the expected schema (fields X, Y, Z present and non-null), reject the response. Otherwise, return it as-is."

Under Opus 4.7, the bare "always validate" triggers a defensive validation loop on responses that are perfectly correct. The model now treats "always" as a literal anchor and constructs validation steps around it, which costs you tokens and latency for nothing. The good version turns the rule into a checkable predicate.

CLAUDE.md project rules. Same problem at the project level. Most team conventions docs are full of absolutes that worked because the model used to be charitable.

Bad:

"Never commit code without tests."

Good:

"If the change touches src/* and modifies behavior, add or update tests in tests/* before committing. If the change is documentation-only or in scripts/*, commit without tests."

The bad version causes the agent to refuse to commit a typo fix in a README. The good version gives the agent a decision tree it can follow without inventing exceptions.

The generalization, across all three surfaces: every rule needs a scope. Every absolute needs a qualifier preceding the action verb. Every "always" and "never" without a condition is a bug waiting for the next instruction-following upgrade to surface it.

This is the same discipline as the prompt contracts framework I built after enough of these disasters, applied to the system prompts you cannot see. Prompt contracts is the user-side version. This is the same discipline applied to the instruction surface you do not own. The principle is identical: an instruction without a checkable scope is a wish.

Caveat: this is not a complete fix. Some categories of instructions resist this pattern, especially safety rules where the condition is "the user is trying to do something harmful." Those are genuinely hard to scope. I do not have a clean answer for those. What I do have is the rule for everything else, which is most of what you write.

Opus 4.7 is not the problem. It is the canary. Agents are going to get more literal, not less. Your instructions need a schema like your code already does.

The Two Lines I Rewrote in Mine

Before pushing this article, I opened my own CLAUDE.md. Two lines stood out within thirty seconds.

One said Always run the test suite before committing. No scope. Under 4.7, the agent would dutifully run the full suite for a docstring fix, decide the wait was unjustified, and either skip the commit or add a meta-comment explaining why it was skipping the rule. Either failure mode is worse than just writing the scope down. I rewrote it: If the change modifies behavior in src/, run pnpm test before committing. Documentation and tooling changes commit without tests.

The other said Never edit migration files. Also no scope. I had written it after a bad week six months ago when the agent had rewritten an applied migration. The rule was right in spirit, wrong in form. New version: If a file in db/migrations/ is older than the latest applied migration on staging, treat it as read-only. Newer migration drafts may be edited.

Two lines. Five minutes. The kind of cleanup that does nothing visible until it does.

Anyway, point is: go reread your CLAUDE.md tonight. Count your "always" and your "never." 😅

Sources

Anthropic, "Introducing Claude Opus 4.7," April 16, 2026: https://www.anthropic.com/news/claude-opus-4-7
The Register, "Claude Opus 4.7 has turned into an overzealous query cop," April 23, 2026: https://www.theregister.com/2026/04/23/claude_opus_47_auc_overzealous/
GitHub Issue #17601 (mitmproxy capture, 10,040 injections in 32 days): https://github.com/anthropics/claude-code/issues/17601
GitHub Issue #21214 (token waste measurement): https://github.com/anthropics/claude-code/issues/21214
GitHub Issue #49363 (subagent refusal in v2.1.111 after v2.1.92 fix): https://github.com/anthropics/claude-code/issues/49363
GitHub Issue #50162 (cybersecurity safeguards retroactive on 4.6): https://github.com/anthropics/claude-code/issues/50162
GitHub Issue #53207 (model self-decomposition of the instruction): https://github.com/@anthropics/claude-code/issues/53207

The subagent that stopped after five files did its job. It read an instruction. It applied it. Nobody before it had ever really read it, that's all. What makes Opus 4.7 uncomfortable is that it forces Anthropic, and all of us derrière, to admit how many of our instructions stand up only because the model is being charitable.

An AI Deleted His Database in 9 Seconds. He Blames the Vendors. He Skipped 30 Years of Practices.

Phil Rentier Digital — Sat, 02 May 2026 13:41:11 +0000

Stunned, a SaaS founder watched an AI agent wipe his production database in 9 seconds. Backups included. He posted it on X, 6.5 million views, every tech outlet relayed within 24 hours. The defendants named: Cursor, Railway, Anthropic. His vendors. The marketing. The "systemic failures" of the industry.

Except the root cause has nothing to do with Cursor or Railway. He handed his prod to the equivalent of a senior dev he just hired, and he gave him full power. No serious team would do that with a human, even a brilliant one. He did it with his AI.

Everything else follows from that one decision.

TL;DR: the 9 seconds were the bill. The order sat upstream for six months, in plain sight, written in code reviewable by anyone who bothered. The press is fighting over who handed over the bill. We're going to look at who placed the order.

The Incident in 100 Words

Friday, April 25, 2026. Cursor running Claude Opus 4.6 on a PocketOS staging environment. Credential mismatch detected. The agent decided to "fix" by deleting the Railway volume. It found an API token sitting in a file unrelated to the task, blanket scope. curl mutation volumeDelete. 9 seconds. Railway backups stored on the same volume? Wiped too. Most recent usable backup: 3 months old.

Jer Crane's X post hit 6.5 million views. Massive coverage. Railway's CEO restored the data 48 hours later from internal disaster backups. No moral here, just facts.

Crane blamed Cursor and Railway. Let's look at what he did, upstream.

An AI Agent Is a Senior Dev. We Don't Give Senior Devs Full Power Either.

Confession first, before I get on my high horse.

I have my own infra dashboard. A daily cron pulls a report on every server I run. Disk space, memory, saturation, weird processes. The usual. A few weeks ago I added an LLM in the loop to "make it smarter". You know, summarize the report, flag anomalies, propose fixes. The future.

Last week I opened the cron script for an unrelated reason and saw something funny. Hardcoded values. Several of them. The LLM had, at some point, "improved" the script by replacing dynamic checks with literal numbers. Free disk threshold? Hardcoded. Memory ceiling? Hardcoded. The "smart" cron was running on baked-in assumptions from the day the agent touched it.

I could blame the model. Easy enough. The only person at fault though, is me, who didn't review the diff. I had every excuse to (lazy Friday, busy week, cron was small). I had zero excuse not to.

Now the actual point.

No serious SaaS team gives full prod power to a freshly hired senior dev. Not out of distrust, just experience. Seniors make mistakes like everyone else, except theirs have a bigger blast radius. That's exactly why we developed limiting practices since 30 years: scoped tokens, MFA, code review, env separation, restore drills. The practices are old. The threat model is old. What's new is that we've forgotten to apply them, because we confused "capable model" with "trusted human with full power".

A capable AI agent is the equivalent of a senior. Capability doesn't change the rule, it reinforces it. The bigger the blast radius, the more the standard guardrails matter. Coverage that says "these precautions are new because of AI" is wrong. They're old. We just forgot why we built them.

Caveat: I'm not saying the AI agent is identical to a human (it lacks the business context, the personal account on the line, the fear of getting fired). The prod-grade rule holds for both anyway: no full power, solo. The pillars below are basically a working contract between the developer and the agent at the infra level, the same way prompt contracts formalize it at the prompt level.

Your AI agent is a senior. Same rules apply. From here on, that part is settled.

[INFOGRAPHIC: TITRE "The 5+2 Pillar Defense" + sous-titre "30 years of practice, in seven layers". Metaphore : un AI agent personnage cartoon a gauche (robot mignon/determine, antennes, yeux ronds) essayant d'atteindre un gros coffre-fort "PROD" a droite, sept portes/barrieres numerotees entre les deux comme un parcours du combattant horizontal. Style : cartoon 90's Hanna-Barbera/Nickelodeon, trait noir epais, halftone dots, formes rebondies. Palette : blueprint blue #1B4D8B, cream #F5E6C8, alarm red #D8504D, deep navy #0E2A47, gold lock #E5B83C. Contenu : 7 portes etiquetees de gauche a droite "1. SCOPED TOKENS" / "2. OUT-OF-BAND CHECK" / "3. VAULT & ENV SPLIT" / "4. OFF-SITE BACKUPS" / "5. RESTORE DRILLS" / "+A. AUDIT & ALERT" / "+B. NETWORK FENCE". Coffre-fort "PROD" a droite avec gros cadenas dore. Highlight : portes 1 et 2 entourees de glow dore et sparkle stars (c'est la que la plupart des incidents s'arretent). Fleche "agent path" pointillee partant du robot, butant contre la porte 1, contournant, butant contre la 2, etc. Legende : sticky note bas-gauche, "any layer alone can fail / all of them together = your only insurance". Footer : © rentierdigital.xyz. NOT flat corporate vector, NOT minimalist tech startup aesthetic.]

Pillar 1: Scoped Tokens, Not Master Keys

No senior dev in a normal team has an API token that can volumeDelete on prod by reading a random file in the repo. He has a token scoped to his task, or he files a PR another human approves.

The PocketOS token that could manage domains and delete the prod volume should not have existed, regardless of who used it. Most modern providers (Vercel, Cloudflare, GitHub fine-grained PATs, AWS IAM scoped roles, Stripe restricted keys) let you scope finely, for free. Stripe restricted keys have been a de-facto standard since 2018. Not new.

Railway didn't allow that level of scoping at the time of the incident. Crane has a legitimate complaint there. The general rule still holds: if your provider doesn't let you scope, you change provider, or you wrap (credentials proxy, aggressive token rotation, ephemeral tokens via short-lived sigs). The rule is "no token in your environment should be able to do more than the current task". The fix isn't always elegant. It's always cheap compared to the alternative.

This is the same principle as why I argue CLIs beat MCP servers for AI agents: the smaller the surface area you expose to the agent, the smaller the blast radius when something goes sideways. Token scoping is the same idea, applied to credentials instead of API surface.

Caveat: yes it takes 10 extra minutes of scoping. Yes some provider APIs are badly designed. Not an excuse for storing a blanket token in the repo.

The token doesn't ask permission. You give it none.

Pillar 2: Destructive Operations Need Out-of-Band Confirmation

No senior types DROP DATABASE production without confirmation. Either it's a command that asks you to retype the name, or it's a button with MFA, or it's an approval by another human. GitHub asks you to retype the repo name to delete it. Stripe asks for the email to close an account. AWS demands "permanently delete" plus the exact text for an S3 bucket. This is base level since 15+ years.

The key word in "out-of-band" is the out-of-band part. The confirmation has to come from OUTSIDE the agent's context. If the agent can self-approve (because the button is in the same session, the same prompt, the same tool), it's not a confirmation, it's autosuggestion. Human equivalent: you don't confirm a DROP DATABASE to yourself, your teammate or your MFA does.

After the incident, the PocketOS agent confessed in textbook fashion. It had violated every principle it was given, guessed instead of verifying, run a destructive action without being asked. Touching, but useless. The system prompt told it not to do destructive things. The agent did them anyway, then apologized eloquently. That's the whole point: prompt-level rules are a polite request, not a guardrail. The only thing that stops a destructive op is a mechanical check the agent cannot bypass by being convinced of its own correctness.

Caveat: out-of-band creates friction. That's the goal. Friction on destructive ops is a feature, not a bug. Anyone who tells you otherwise has not yet had the bad day.

Eloquent apologies don't roll back transactions.

Pillar 3: Production Credentials Don't Live on the Dev Machine

No senior in a serious team has prod creds floating on their dev laptop in clear text. They get injected at runtime from a vault (Doppler, Infisical, native Vercel/Railway secrets), staging and prod have different credentials by design, the repo has a .env scanned in pre-commit hooks. Bare minimum.

If Crane had had strict credential separation between staging and prod, the "manage domains" token would NEVER have been able to authenticate a call against the production volume. The architecture bug that allowed the incident is older than the agent: a single token had access to both environments. The agent was just the heat-seeker that found it.

It's the same reason you don't reuse your homelab SSH key on prod, or stash a long-lived GitHub PAT in your CI when a fine-grained one exists. Trivial when said out loud. Yet every week a SaaS ships with staging and prod sharing a DATABASE_URL because "it was simpler at the start".

Your AI agent scans your files, finds what's there, uses it. So you don't leave around what can break everything. The vault is not a magic shield (an agent that can read from the vault can be misled into reading the wrong thing), but it forces explicit consent every time a secret leaves storage. Wrap your vault with scoping too: the current task only reads the secrets it actually needs, not the whole drawer.

Caveat: a vault adds 30 minutes of setup the first time. Then it works. Forever.

Pillar 4: Backups Live Somewhere Else

The modern rule: 3 copies, stored at 2 different providers minimum, with at least 1 immutable and off-site. A "snapshot" stored in the same volume as the source data is not a backup, it's technical wishful thinking with a fancier name.

A whole generation of PaaS uses the word "backup" abusively. Railway documents in plain English that wiping a volume deletes all backups. Founders signing up in 2 minutes for their MVP don't read the infra doc. They check the "enable backups" box in the dashboard and assume the cavalry is on standby.

Concrete cheap recipe for a solo SaaS:

TS=$(date +%Y%m%d-%H%M%S)
pg_dump $DATABASE_URL | gzip > /tmp/db-$TS.sql.gz
aws s3 cp /tmp/db-$TS.sql.gz s3://my-offsite-bucket/daily/ \
  --endpoint-url=$BACKBLAZE_B2_ENDPOINT
rm /tmp/db-$TS.sql.gz

50 lines of bash plus a cron, an immutable bucket on a different provider (B2, R2, or S3 with object lock), retention rolling 7 daily / 4 weekly / 12 monthly. A Saturday afternoon of work, then nothing. No serious team would accept that all production backups sit on the same provider as production, let alone in the same volume.

Caveat: making your own backups takes 2 hours of setup and 0 hours of monthly maintenance. Truly. The number of founders who tell themselves "I'll set this up next sprint" and then take 18 months to do it is, statistically, all of them.

A backup on the same provider as production is a screenshot. Live with it, or move it.

Pillar 5: An Untested Backup Is Not a Backup

All the backups in the world are worth nothing if you've never tested the restore. Quarterly drill: spin up an empty environment, run the restore script against it, verify the data comes back, measure how long it takes (RTO) and how much you'd lose in the worst case (RPO).

If it doesn't work, you want to know NOW, not the day you actually need it.

PocketOS discovered at the worst possible moment that its real restore window was 3 months. Not a Railway flaw. A drill that was never performed. No senior in a serious team would settle for "I clicked enable backups in the dashboard". They'd restore at least once just to time it.

Caveat: yes a complete drill once per quarter is a day of work. It's also your insurance you still exist next Monday. Pick one.

Two Bonus Pillars If You're Serious

Bonus 1: Audit log and alerting on destructive ops

Every DELETE / DROP / rm -rf in prod fires an immutable log and a Slack/email/SMS notification. PocketOS lost 30 hours before they understood the scope, because nobody got paged at the moment of the destructive call. 9 seconds with no alert is an observability gap, not agent malice.

Most PaaS provide this natively (CloudTrail on AWS, audit log on Vercel, logs on Railway). All you have to do is wire the webhook. Sub-30 lines of YAML, a free PagerDuty seat, done.

Bonus 2: Blast radius limit by network design

The dev machine (and the agent running on it) cannot reach prod directly. Bastion, VPN with scope, or nothing. The network is the last line of defense.

If your agent can reach prod from your laptop, the scoping done by Pillars 1-3 is your ONLY protection. Defense in depth means adding a network layer too. This is the meta pillar, the one that makes the other 5 redundant if done well. Belt, suspenders, and a static rope.

PocketOS Won't Be the Last

Just the public incidents from the last 12 months. PocketOS this week. Replit's AI agent deleted a production database in July 2025, with backups thrown in for the show. An OpenClaw agent "speedran" deleting the inbox of Meta's AI safety director (yes, that sentence is real and yes, it was a rookie config error). Add AWS Kiro, ChatGPT 5.3 Codex erasing a hard drive after a typo, Cursor ignoring an explicit "do not run anything" in December 2025. Six months. A pattern.

You can count on 5 more in the next 6 months. Whoever you are reading this, one of them is statistically you.

If you apply the 5+2 pillars, the PocketOS scenario becomes structurally impossible. The agent doesn't find a blanket token because there isn't one. If by miracle it finds one, it can't use it on prod because the env is isolated. If by double miracle it gets there, the destructive op asks for an out-of-band confirmation it cannot self-approve. If by triple miracle it bypasses that, your immutable off-site backup is untouched, and your last quarterly drill tells you you're back up in 4 hours, not 3 months.

The question is no longer "is AI ready for production". It's "is your production ready for anything that isn't you alone". If the answer is no today, it was already no before Cursor existed. You just found out faster.

Blaming Cursor, Railway, Anthropic, or the Pope gets you nowhere. He forgot to blame the guy who stored a blanket token in the repo, ran staging and prod on the same credentials, and turned on backups by clicking a checkbox without ever testing a restore. That guy, that's him.

The 5 pillars in this article aren't an answer to AI. They're an answer to an older question: what happens when one operator has full power on prod. We've known the answer since 30 years. We just forgot, because the new operator types fast and speaks English.

The real question isn't whether AI is ready for your production. It's whether your production is ready for anything that isn't you, alone.

Audit your resilience this weekend. Before an AI makes the bad decision for you.

You ship it, you own it.

Sources

Jer Crane's original X post on the PocketOS incident: https://x.com/lifeof_jer/status/1915720800000000000
The Register, Cursor-Opus agent snuffs out startup's production database (April 27, 2026)
Tom's Hardware, Claude-powered AI coding agent deletes entire company database in 9 seconds (April 28, 2026)
Fast Company, An AI agent deleted a software company's entire database (April 28, 2026)
NeuralTrust, A Security Post-Mortem of the 9-Second AI Database Deletion
PC Gamer, Here we go again: AI deletes entire company database (April 28, 2026)

Vibe Coding Isn't Dead. You Just Built It Like the First Little Pig.

Phil Rentier Digital — Fri, 01 May 2026 13:41:10 +0000

The Three Little Pigs always cracks me up. People who know nothing about something keep telling the wrong story about it. I've built my own straw-bale house, and it's more comfortable and more pleasant than the cinder-block houses I've lived in.

Same thing with vibe coding.

TLDR: Everyone is burying vibe coding in April 2026. Karpathy rebranded it as "agentic engineering." 70% of builds stall at the demo, a number now repeated everywhere without anyone bothering to source it. The going diagnosis: the method is bad. I had to lay 1,880 bales of straw on my own house to figure out that everyone got the diagnosis wrong.

When I told my neighbors I was building my house in straw, I got the exact same monologue I get today about vibe coding. It burns. It won't hold. You're naive. Five years later the house is standing, and the same voices have moved one decade over to AI-generated code. Same sentences, different material. And the same reasoning error behind them.

What People Said When They Saw the Bales Going Up

A neighbor stopped his truck on the road, looked at the bales stacked under the tarp, and asked me with no warning if I knew that mice eat straw and that fire eats straw faster. He wasn't malicious. He was certain. Same energy as every commenter on this site who's certain that AI cannot ship.

It burns. It won't hold. You're naive.

I gave the same answer I keep giving today on Medium comments. This is not a faith argument, it's physics. Compressed straw is dense enough that fire cannot find oxygen inside the bale. The lime-and-clay plaster wrapping the wall seals what little air remains. The wood frame carries the load. A bale laid wrong burns. A bale laid right outlives you.

If you want a date: the first European straw-bale house, the Maison Feuillette, was built in 1920 in Montargis, France. Still standing. Still inhabited. Older than my grandfather and in better shape than his last apartment.

The house is fine. The method is fine. What's missing in the conversation is the person who has actually built one.

Five years later I hear the same script about vibe coding.

What People Say When I Tell Them I Vibe Code in 2026

It's dead. It doesn't ship. The serious people moved on to "agentic engineering" (same thing, longer name). 70% of vibe-coded apps stall at the demo, an industry stat now appearing in every think-piece without ever pointing back to the survey it came from. There's a viral Medium article currently telling readers it's all over. Bloomberg ran a piece blaming AI coding tools for a productivity panic, diagnosing the wrong disease entirely.

It burns. It won't hold. You're naive.

Same script, ten years later, applied to JSX instead of straw. And the answer is the same. This isn't a faith argument either. It's a method-and-reps argument.

Vibe coding done badly snaps in half. That part is true. The 70% number isn't pulled out of thin air. People do hit the wall. Plenty of dead repos on GitHub prove it.

Done badly means done by someone who has typed three prompts in their life and expected a finished SaaS at the end. Someone who has never specified a feature in writing. Someone who has never seen what an unbroken loop of generate-test-fix-test looks like across twelve iterations on the same project.

The method matters. The method without reps is a piece of paper.

That part is settled.

Three Years Reading. Two Years Laying 1,880 Bales. The Wall Still Stands.

Straw bales mid-construction, properly stacked between the wooden frame structure.

I read on straw-bale construction since three years before touching a bale. I'm not proud of that. I would have learned faster by laying ten of them badly. But I had a kid, a job, a budget that wasn't ready, and I needed the theory to settle before I could justify buying the land. Three years of books, weekend workshops, two months at a friend's site in Ardèche where I mostly carried things and watched. Long stretches without laying anything. I never quit. I just couldn't start.

Then I bought the land and started.

Two years on site. 1,880 bales between the load-bearing walls and the partitions. The first bale went in crooked and I had to redo it twice. The fiftieth was square the first try. Around the six-hundredth I noticed I had stopped sweating during the plaster pass. Around the twelve-hundredth my hands knew the cut angle without looking. The 1,880th bale went in the way the first one should have, except by then I wasn't even thinking about it.

The method I used at bale 1 and the method I used at bale 1,880 was the same method. The book I read in 2018 didn't change. The video I watched in 2019 didn't change. What changed: my hands had done it 1,880 times on the same house.

This is the part nobody who tells you vibe coding is dead has ever lived through.

Your first feature is crooked. Your fiftieth is square. The first time the model generates code you don't immediately want to throw away, you've already shipped more than you remember. The first time you stop second-guessing the architecture, you've stopped counting. The first time you ship a feature in two hours that used to take two days, you don't notice it happening. Somebody else points it out. 😅

The method does not change between rep 1 and rep 1,880. Your hands change.

But nobody has five years to ship a SaaS. That's the actual problem.

How to Do It in 12 Reps Instead of 1,880

I wrote a book to compress that curve.

Not a theory book. There's already a good one for that. Gene Kim and Steve Yegge published Vibe Coding: Building Production-Grade Software this year and it's the right reference if you are a senior dev who wants the patterns formalized. Read it after.

This one is different. It walks the reader through 12 reps on the exact same project. A small CRM for tradespeople, plumbers, electricians, carpenters. Not 12 different tutorials on 12 unrelated subjects. Twelve passes on the same codebase. Each chapter takes the CRM further. Auth in chapter one. CRUD in two. Search in three. Notifications in four. And so on, in the same sequence you'd build a house: foundation, frame, roof, walls, plaster, finishings.

The method itself is in the book, the eight-step Blueprint Method, the same one I run on every project I ship. It fits in maybe twenty pages. The other 270 pages are reps. Because the method without reps is the piece of paper I just talked about.

Caveat I'm not going to soften: this only works if you do all twelve reps. Doing four chapters and quitting will not ship anything. You'll have learned something, but you won't have built the muscle. (Same as laying fifty bales and stopping. The house is still a hole in the ground.)

There's a private companion repo for readers, where the CRM state is committed at the end of every chapter. If you skip a chapter or get stuck, you clone the snapshot and keep going. I learned this trick from straw too: every workshop I ever attended ended a phase with a wall everybody could touch. You don't move on from theory. You move on from a wall.

If you're already shipping, you don't need this book. Go further with the prompt contracts framework I built after enough disasters. That's the next layer. Vibe Coding, For Real: From Demo to Live App is for the foundation. Prompt Contracts is for the upper floors.

The book is on Amazon: https://amzn.eu/d/04X9k88d

Most people who tell you vibe coding doesn't work wrote one feature, watched Lovable spit out broken JSX, and closed the tab. One bale. One wall. Lazy conclusion.

With method, you build straw houses. With method, you build vibe-coded apps. Solid. Comfortable. Built to last. 🏠

Sources

Vibe Coding: Building Production-Grade Software by Gene Kim and Steve Yegge (IT Revolution, 2026)
Maison Feuillette, Montargis, France (1920)
Vibe Coding, For Real: From Demo to Live App (April 2026): https://amzn.eu/d/04X9k88d

Claude Routines Aren't a Reasoning Cron. They're a Repo-Centric Subset of One.

Phil Rentier Digital — Thu, 30 Apr 2026 13:41:11 +0000

A week after Anthropic shipped Routines, three of my cron jobs are running in production. Took thirty minutes.

The PR auto-review that was polling GitHub every half hour? Dead. The weekly doc drift script that parsed commits in crusty bash? Dead. The SEO data refresh I kept putting off for six months (no time, you know how it is)? Live in five.

Three jobs, thirty minutes, no friction. Convinced. That evening, I went for the fourth.

And then, nothing. Not an error, not a quota, not a malformed YAML. The job runs, the binary responds, but the service it queries lives on an IP Routines doesn't see and never will, by construction. The forty-seven other jobs on my server are in the same situation. Not one fits. And it's not a problem to fix (it's mechanical).

Since Routines dropped, I've mostly seen demos. Everyone shows the three jobs that work. Nobody names the boundary. What follows is how to wire this into production infra, including the parts where Routines stops and you take over.

TLDR. Routines is not a universal reasoning cron. It's a reasoning cron with a perimeter, and most automation lives outside that perimeter for reasons you can't configure your way out of. The question isn't whether Routines is good. It's where it stops, what runs there instead, and how to keep the DIY half alive without re-logging every morning.

The three jobs I migrated work better in Anthropic's cloud than on my server. PR review fires on pull_request.opened instead of polling. The doc drift script has full repo context now, output went from skimmable to useful. The SEO refresh just runs, every morning, no setup tax. That part is settled.

The article is about everything else. The forty-seven jobs I tried to migrate next, why not one of them works, and the DIY pattern that survives in 2026 once you accept which side of the line your job is on.

What "Reasoning Cron" Actually Means

Let me name what we're talking about.

A reasoning cron is a scheduler that calls an LLM to think, not just to execute deterministic code. It reads context, makes decisions, generates output that depends on what it just saw. n8n and Make and Zapier route data (they don't think). A Python cron is rigid, it breaks when the input format shifts. A reasoning cron adapts.

That's the category. Routines belongs to it. So does my DIY pattern. So does any wrapper around claude -p or gpt or gemini. The category is real, the demand is real, and Anthropic shipping a managed product for it is the right move.

What's misleading is calling Routines the reasoning cron. It's a reasoning cron with a fixed perimeter.

The perimeter has three walls. One, Routines runs in Anthropic's cloud, not on your machine. It clones a Git repo at the start of each run and that's the entire filesystem it sees. Two, it talks to the outside world through managed connectors (Slack, Linear, Jira, GitHub, GDrive) and an HTTP allowlist (default Trusted, which blocks most external APIs). Three, it has a clean slate every run. No state, no cookies, no persistent session.

Nimbalyst's practical guide arrived at the same line independently: reach for Routines when the work is repo-centric, runs on a schedule, and doesn't need your local environment. Same perimeter, different words.

Once you have those three walls in your head, what follows isn't opinion. It's consequence.

Where Routines Wins (Three Jobs I'm Not Touching Again)

Concession first, because honesty is faster than rhetoric.

These three jobs are better in Routines than they were in my cron. I'm not migrating them back.

PR auto-review. Before: a Python cron that polled GitHub every thirty minutes, ran a claude -p review on any new PR, posted a comment via the API. The polling cron lagged behind every push. After: a Routine triggered on pull_request.opened, runs the moment a PR opens, posts via the GitHub connector. Same prompt, same output quality. Less YAML, no polling waste.

Doc drift weekly. Before: a bash script that listed commits since last Monday, diffed them against the docs folder, fed both into claude -p, emailed me a summary. The summary was always slightly off because the script didn't carry repo-wide context. The model only saw what bash sed-piped into it. After: a Routine that gets the full repo cloned, reads CLAUDE.md, the docs folder, and the commit log natively, and writes a "what changed, what's stale" doc that actually reflects the codebase. The output went from skimmable to useful.

SEO data refresh. No Before. I'd been wanting to set up a weekly pull from my analytics provider, run a model over the deltas, and post a summary to Slack. Every time I sat down to wire it up, something else came in and the YAML never got written. After: a Routine, fifteen minutes of setup, runs every morning. The job that I never built in six months exists now. That's the strongest case for a managed product (the work you'd never get around to doing yourself).

These three share four traits. Repo-centric. Output goes to Slack or GitHub. Frequency is at least one hour. Nothing in the chain touches my local network.

That's exactly the Routines perimeter. My other forty-seven jobs miss at least one of those four. Not one out of forty-seven hits all four.

Six Reasons Routines Can't Replace My Cron Jobs

Six mechanical reasons. Not preferences, not edge cases. Each one closes the door before you finish typing the YAML.

1. Local MCP servers. Routines uses Anthropic's managed connectors. That's it. The MCP server I wrote myself, the one that lives on my machine and exposes my own data to my Claude Code sessions, is not available. Routines can't see it, can't talk to it, can't authenticate against it. Any workflow where the model needs to query something I built locally is out.

2. Services on private IPs. Tailscale mesh. NAS at home. Postgres on the server. Internal monitoring dashboards. Anything sitting on a 100.x address or a 192.168 LAN. Routines runs in Anthropic's cloud. It doesn't know my mesh exists. The fourth job from the opening lives here, and so do nineteen others on my list.

3. Sub-hourly frequency. Routines minimum interval is one hour. My status poller runs every fifteen minutes because that's what the alert window requires. Any job that needs to fire faster than once an hour, mechanically, can't move.

4. Daily quota. Pro is 5 runs per day. Max is 15. Team is 25. I have forty-seven jobs that need to run nightly, plus dailies, plus the sub-hourly ones. Even if every other constraint vanished, I'd hit the quota before midnight on a Max plan. The quota isn't a soft limit you can negotiate (it's the contract).

5. Persistent browser session. Routines spins up a clean environment every run. No cookies, no localStorage, no session carryover. If your job needs to log into a site once and reuse the session, Playwright automation against a service that requires auth, you can't. Nate Herk documented this on Skool when he tried to run a community automation in Routines. The login dies between runs. The job is structurally impossible.

6. Local persistent state. A job that writes to a local SQLite between runs, or maintains a file-based queue, or appends to a long-lived log. Routines starts fresh every time. Whatever your job wrote last run is gone. You can use the connector outputs as state (Linear tickets, GitHub issues), but if your state lives on disk where the cron lives, that's not portable.

A community comment under Anthropic's launch post on Threads put it bluntly: once again github centric features. That's the read from the outside, and it's right, but it's also incomplete. Routines isn't only GitHub, the connectors do more than that. The honest framing is: Routines is repo-centric and managed-connector-centric, and if your work happens outside that perimeter, the tool has nothing to offer you.

Six cases. Not six opinions.

The DIY Pattern That Survives in 2026

If your job lives outside the Routines perimeter, you build it yourself. What follows is the pattern that survives, the parts nobody documents in the dozen "Routines just dropped" tutorials saturating the feed last week.

Use shell redirect, not spawn-with-stdin.

claude -p "Summarize the input as JSON" < input.txt

echo "$LARGE_INPUT" | claude -p "Summarize"

The pipe deadlock is the silent killer. No error, no timeout, just a process hanging on a buffered stdin that never closes. I lost a weekend on this before I traced it. Shell redirect from a file is the only reliable way to feed large input to the binary in non-interactive mode.

Unset ANTHROPIC_API_KEY in your cron environment.

unset ANTHROPIC_API_KEY
claude -p "..."

If ANTHROPIC_API_KEY is set when you call claude -p, the binary uses it and bills your API account. Silently. The auth precedence is documented but easy to miss. You think you're running on your subscription, and actually every cron run is going through pay-per-token. Unset it explicitly. Your wallet will thank you.

Constrain JSON output via prompt, not flag.

Don't trust --output-format json to do the heavy lifting. Tell the model what schema you want, in the prompt, then validate downstream:

claude -p 'Respond ONLY with valid JSON matching: {"status": "ok|fail", "items": [...]}. No prose, no fences.' < input.txt | jq -e .status

If jq -e fails, retry once. If retry fails, alert. The prompt-level contract holds better than the flag in my experience, and you get clean failure modes when the model drifts.

This is also why the DIY pattern doesn't go away when MCP gets richer. The CLI binary stays predictable, deterministic at the shell layer, and it composes with everything else you have. I made the longer argument for CLIs over MCP in agent stacks a few weeks back, and Routines doesn't change the conclusion. CLIs compose. Managed schedulers don't, by design.

Generate a long-lived OAuth token with claude setup-token.

This is the part missing from every cron-with-Claude tutorial I've read.

The claude-code GitHub repo is full of the same complaint. OAuth tokens expire in 8 to 24 hours in --print mode, refresh fails silently, automation dies. The DEV community post "Building Claudio: My Always-On Claude Code Box" walks through exactly this pain. V1 lasted two weeks before tokens expired. V2 abandoned cron entirely and pivoted to a desktop tool.

There's a built-in command that solves this. Run it once, interactively, on the machine where you originally logged in:

claude setup-token

It generates a long-lived OAuth token (one year, inference scope only) designed for CI and unattended scripts. You put it in CLAUDE_CODE_OAUTH_TOKEN. The binary respects it, no refresh dance, no daily re-login.

I don't paste the token into my cron environment. I store it in a secrets manager (I use Infisical, Vault and Doppler and AWS Secrets Manager all do the same job), and the cron pulls it at run time via a machine identity scoped to that one server:

#!/usr/bin/env bash
set -euo pipefail

TOKEN=$(infisical secrets get CLAUDE_CODE_OAUTH_TOKEN --plain)
export CLAUDE_CODE_OAUTH_TOKEN="$TOKEN"
unset ANTHROPIC_API_KEY

claude -p "$(cat prompt.txt)" < input.json | jq -e . > output.json

unset CLAUDE_CODE_OAUTH_TOKEN

The token sits in memory for the duration of the run, then disappears. If the server is compromised, I revoke the machine identity and rotate that one. The Claude OAuth token itself doesn't have to move. That's what keeps a DIY cron stack running without daily re-logins or silent breakage.

A Quick Word on Terms of Service

Anthropic clarified the policy in February 2026: using the Claude Code CLI on your own machine, daemon, cron job, all good. The CLI on your own machine, calling the official binary, is fine.

What got cut in April 2026 was different. Third-party tools spoofing the Claude Code client and using subscription auth to power external products got their access revoked. I lived through that one and rebuilt my setup the week after. The DIY pattern in this article doesn't sit on the wrong side of that line. It's the official binary, on my machine, doing what the binary is designed to do.

Routines is a research preview. The current ToS reading might shift, the quotas might shift, the connector list might shift. Check the docs every couple of months if you're building anything that depends on it. That applies to my pattern too. Local cron with the official binary has been allowed for two years and the position got reaffirmed three months ago, but "currently allowed" is not "permanent."

Three Questions That Decide Where Your Job Goes

Three binary questions. Honest answers. The decision falls out.

1. Does the job live in a Git repo you push to GitHub?
Not "could it." Does it, today, naturally. If no: DIY.

2. Does it need anything outside Anthropic's connectors?
A local MCP, a private IP, a personal database, a persistent browser session, your own filesystem state between runs. If yes: DIY.

3. Does it run more than once an hour, or more than your daily quota?
Sub-hourly polls, dozens of nightly jobs, anything past the plan ceiling. If yes: DIY.

Three no's: Routines, no hesitation, no guilt. It will run that job better than your DIY cron, and you'll save the maintenance.

One yes: stay local. Use the DIY pattern. The DIY half doesn't go away.

Actually, no, let me put it differently. Routines isn't Make, Zapier, or n8n (they're not the same tool). Routines is a scheduler with a perimeter. A Git repo, Anthropic's connectors, a one-hour minimum between runs. What lives outside that perimeter isn't worse Routines 😅

The devs who ship know the difference. You don't put a fifteen-minute poll in a scheduler with a one-hour minimum. You don't put a job that talks to your private mesh in a cloud that doesn't see your private mesh. You don't put a stateful job in a clean-slate environment. That's not taste. That's arithmetic.

Match the tool to the job. Routines is excellent inside its perimeter. Useful, but not the ultimate answer.

Sources

Anthropic, Introducing routines in Claude Code
Claude Code docs, Automate work with routines
Claude Code docs, Authentication
claude-code GitHub, issue #28827, OAuth token refresh fails in non-interactive mode
autonomee.ai, Claude Code Terms of Service Explained
Building Claudio, My Always-On Claude Code Box