Forem: Max Quimby

Codex Pulling Ahead of Claude Code? Read the 2026 Shift

Max Quimby — Mon, 18 May 2026 03:53:57 +0000

Three independent creators all dropped "Codex is pulling ahead of Claude Code" takes on the same day this week. Nate B Jones and Tibo did a head-to-head and concluded that Codex was the daily driver now. Chase AI's "Time to Switch?" workshop went out the same morning. A third creator landed a Claude-Code-to-Codex switch post on Medium, arguing that Codex's /goal command and 4x token efficiency made the choice obvious. Three creators, one direction, one day.

📖 Read the full version with embedded sources on the original site →

Meanwhile, r/ClaudeAI hit thousands of upvotes on a single post about the operator emotion underneath all of this: devs are tired of reviewing AI-generated PRs they didn't initiate. Brian Douglas's "Death by a Thousand AI Pull Requests" Substack from the open-source side made the same point in a different vocabulary. The category moment isn't "Codex won." It's "the agent-PR-review loop broke, and we're sorting out which agent fits which seat in that loop."

This piece reads that moment cleanly. What actually changed in the last 14 days, what didn't change, and the operator question that matters: does the answer reroute your stack — or your review process?

What actually changed in the last 14 days

Three concrete things landed in the May 2026 window.

Codex's /goal command crossed the autonomy threshold. Up until April, Codex's autonomous loop topped out around 20-30 minute runs before drifting. The May Codex release tightened the plan-act-test-review cycle enough that it now sustains multi-hour autonomous sessions on the right kind of task — codebase-wide migrations, dependency upgrades, test backfills. The New Stack tested it on a real Python codebase and called it "the strongest Claude Code rival yet" — explicitly framing it as a daily-driver shift, not a benchmark win. The benchmarks themselves moved with it: GPT-5.5 now leads SWE-bench Verified at 88.7%, edging Claude Opus 4.7's 87.6%, and leads Terminal-Bench at 82.7%.

The cost gap widened in a single direction. A widely-circulated Express.js refactor benchmark cost roughly $15 on Codex versus $155 on Claude Code for the same task — a 10x gap. The token-per-task delta isn't subtle anymore. For a small team running a daily-driver coding agent 4-8 hours a day, that gap is the difference between $200/month and $2,000/month in agent costs. The math now lands in a place where switching cost can be paid back in a single billing cycle.

The Anthropic skills ecosystem kept compounding. Even as Codex pulled ahead on raw daily-driver mechanics, Anthropic shipped Code Review and the broader skills directory race kept tilting toward the Claude ecosystem. Mitchell Hashimoto's skill stack, the tech-leads-club agent-skills registry, and obra/superpowers all sit in the Claude orbit. That ecosystem doesn't move when daily-driver preference shifts. It's a separate moat operating on a separate timeline.

So three things shifted, and one didn't. The takes converging on "Codex won" are reading the first three and ignoring the fourth.

What didn't change

Three things stayed put.

Anthropic still owns the model-quality consensus. Polymarket's end-of-May "best AI model" market has Anthropic at ~82%, Google at ~19%, OpenAI well behind. That's not a benchmark consensus — that's a money-weighted consensus across thousands of traders pricing the actual perception of model leadership. Anthropic also holds SWE-bench Pro at 64.3% vs GPT-5.5's 58.6% — a 5.7-point gap on the harder, less-saturated benchmark. The "Codex pulled ahead" takes are talking about the coding agent runtime, not the underlying model. Conflating the two is the most common error in this week's coverage.

Code quality on the produced output still favors Claude. Blind reviews of completed work rated Claude Code's output cleaner 67% of the time to Codex's 25%. That gap shows up most on frontend UI work, refactors that need to match an existing codebase's idiom, and any task where the diff has to read well to a human reviewer six months later. Codex ships the feature faster and cheaper. Claude ships a smaller, cleaner diff. If your downstream cost is "code that future engineers can actually maintain," the trade isn't obvious.

The skills ecosystem is still gravitationally Anthropic-aligned. This week's GitHub trending told the same story — the skills-folder repos kept dominating, Zerostack's pure-Rust coding agent hit HN at 499 points framed as an Anthropic-ecosystem alternative, and the agent-runtime category overall stayed weighted toward Claude-orbit tooling. Codex has the daily-driver crown. The surrounding ecosystem isn't migrating with it on the same timeline.

The PR-review meme is doing the actual work

Now the part most of the comparison posts skip.

The signal that's traveling fastest this week isn't a Codex review. It's the r/ClaudeAI PR-review fatigue thread, echoed by Brian Douglas's "Death by a Thousand AI Pull Requests" on Substack. The operator emotion is consistent: agents are generating PRs faster than humans can meaningfully review them. The unit of work that's becoming the bottleneck isn't writing code — it's reading code you didn't write and forming judgment on whether to merge it.

That's a different problem from "which agent should I use." It's a workflow problem, and it doesn't care which model wrote the diff.

Anthropic's response to this — Code Review for Claude Code, launched March 9 2026 — is interesting precisely because it's not a competing-agent feature. It's a competing-reviewer feature. Boris Cherny's framing on the launch is direct: code output per Anthropic engineer is up 200% this year, and reviews became the bottleneck. The fix is more agents, on the other side of the diff.

The HN thread on the launch made the deeper question explicit, and most of the top comments landed on it: if the same vendor's agent both writes and reviews the code, is that even review? One commenter put it bluntly: "Why didn't the AI write the correct code in the first place?" Another: "So their business model is to deliver me buggy code and then charge me to fix it?" The skepticism is reasonable. The fact that it costs $15-25 per PR review is also a real cost line a team has to plan for.

But the operator framing matters here. The PR-review bottleneck is real, the human-review channel is genuinely saturated on teams shipping ~30+ agent-PRs/day, and "agent reviews agent" isn't the only option — it's just the only option that exists today. The teams that solve this first will be the ones that take the review loop as seriously as the generation loop, which most teams currently don't.

Does this change your stack, or your review loop?

That's the operator question this week's convergence actually poses. The two have different answers.

On the stack question: probably not, for most teams. If you're running a Claude Code stack today and shipping, the right move is to add Codex as a second daily-driver rather than switch wholesale. The tokenmaxxing pattern we covered last month is the canonical version: route long-horizon autonomous tasks to Codex (where the /goal loop pays off and the token math wins), route quality-sensitive refactors and frontend work to Claude (where the cleaner-diff bias pays off), and keep skills/MCP infrastructure on the Anthropic side. The 500-Reddit-developer survey from this week confirms the pattern — 65% prefer Codex for daily coding, but most serious teams run both. The $20+$20/month Pro-subscription combo is the unsexy answer that's quietly winning.

If you're starting fresh, the calculus is different. A team that's never paid Claude Code's per-seat plan and can architect around Codex's autonomous loop from day one will save real money. But that's not the median team this week.

On the review-loop question: yes, this is the move you should make first. Specifically:

Measure your current PR-review queue. How many agent-generated PRs hit your repo per day? What's the average human-eyeball time per PR? If you're past ~10 PRs/day and human review is sub-3-minutes, you're already in the saturation zone — the merge is becoming a rubber-stamp regardless of whether you've named the problem.
Decide what "agent reviews agent" looks like for you. Claude Code Review is the most polished option today. The alternative is rolling your own with a second agent in CI (which works fine for most teams). Either way, the goal is a second pass with different incentives — bug-hunting incentives, not generation incentives. Don't let the same agent write and approve.
Set a per-PR cost budget. $15-25 per review at Claude Code Review pricing is meaningful at scale. A 30-PR/day team is looking at ~$500/day in review costs if it runs on every PR. The right move is per-PR-size tiering: heavy review on PRs over 200 LOC, lightweight review under that. Build the tiering into the merge process.
Reclaim human review for the things humans uniquely do. Architecture-level judgment, intent verification, and "is this the right feature" calls aren't review-agent territory — they're senior-engineer territory. The point of the review-agent layer is to free that time up, not to replace it.

That's the actionable read on this week's convergence. The Codex-vs-Claude-Code debate is real but mostly resolves to "run both." The PR-review-loop problem is real and resolves to "build the review layer, don't let it stay implicit."

The market read

A last note on the framing wars. Every six months in 2026, one of the major agent vendors has a two-week stretch of dominant takes. February was Claude Code's. April was Codex Mobile's. This week is Codex's again. The pattern in each cycle is the same: a feature ships that genuinely moves the daily-driver line, three creators converge on the same take within 48 hours, and the take then calcifies into "X won" framing that lasts about three weeks before the next vendor releases something.

If you're building product around AI coding agents, you should expect this cadence to continue through the year and not over-fit your stack to any single two-week window. The teams that quietly run both and route by task type are accumulating an advantage that won't show up in the convergence cycles — but will compound across them.

The narrative shift is real. It's just smaller than the takes are pricing it.

For deeper-dive paths: GSD-2 vs Claude Code vs Codex CLI is our long-form harness comparison from earlier this year. Tokenmaxxing covers the YC-operator pattern of running both. Codex Mobile Operator Playbook covers Codex's mobile angle specifically. And DeepClaude vs Claude Code vs Codex Pro is the cost-stack comparison that started this whole thread.

Originally published at the original site.

Per-Seat SaaS Is a Liability: A 2026 Operator's Checklist

Max Quimby — Mon, 18 May 2026 03:53:46 +0000

In one 24-hour window this week, four independent surfaces converged on the same thesis from four different angles. Palantir's deployment team told the supply-chain industry that SaaS is dead. Salesforce CEO Marc Benioff told the All-In Podcast that this is the "current SaaSpocalypse" — his third in two decades, and not his first. Hacker News pushed a piece titled "Every AI Subscription Is a Ticking Time Bomb for Enterprise" to 275 points, where the top-comment math priced the gap between today's subsidized seat licenses and tomorrow's API-grade reality. And John Gruber, of all people, weighed in from Cupertino with a quiet line that anchored the rest: AI is technology, not a product.

📖 Read the full version with embedded sources on the original site →

That convergence is the story. Not the death of SaaS. Not the rebirth of bespoke software. The story is what those four signals look like from the inside of a CFO's desk in the second half of 2026, with a stack of renewals waiting to be priced for 2027.

We've written before about how Claude and the agent-native runtimes are eating SaaS distribution from the platform side. This piece is the flip — the enterprise buyer's view. What actually breaks when seats stop being the unit of value, and what an operator should be checking for at the next renewal cycle.

The "SaaS is dead" framing flatters everyone

Start with what the convergence is not. It's not a death certificate.

Palantir's "SaaS is dead" framing — delivered by deployment strategist Danny Lukus and amplified across enterprise X — is a sales line. Palantir sells ontology-driven, custom-deployed AI infrastructure, and it has every incentive to bury the off-the-shelf SaaS narrative. Their CTO makes the same case on a16z's channel: the software layer should step back, the agent should take over, the bespoke ontology becomes the moat.

Benioff's "not my first SaaSpocalypse" framing is the mirror image. Salesforce will book $46 billion in annual revenue this year, generate $16 billion-plus in cash flow, and serve 83,000 employees' worth of customers who've built their operations on the platform, per his All-In appearance. He has every incentive to call this cyclical — the third re-rating, not a structural break — and to point at Agentforce's growth as proof the platform absorbs the AI wave.

Both are right and both are selling something. The interesting question isn't whether SaaS dies. It's whether per-seat pricing — the specific commercial mechanic that built the last twenty years of enterprise software — survives contact with a workforce where the unit doing the work isn't a seat anymore.

Bessemer Venture Partners' 2026 AI Pricing and Monetization Playbook has the actual data: hybrid pricing — a base subscription plus usage overage — is now the industry standard at 41% of AI vendors, up from 27% a year ago. 43% of buyers prefer consumption-based; 27% prefer outcome-based. The shift isn't extinction. It's a quiet renormalization that's already past the halfway mark.

That's the actual environment an operator is buying into right now. The framing wars are loud; the renewal math is quiet.

💡 The reframe: "Is SaaS dead?" is a press question. "Are you priced for what your stack actually costs in 2027?" is the operator question. The rest of this piece is built around the second one.

Failure mode #1: Token-shifting

The first failure mode is the one HN was pricing.

The argument from The State of Brand, summarized in the HN thread: every AI lab is currently losing money serving your company, and they're doing it on purpose. A team of 50 on Claude Pro costs $1,000 a month. The equivalent API usage for that same team — measured by actual tokens consumed during real agent workflows — sits somewhere between $15,000 and $40,000 a month, depending on intensity. The seat-priced subscription is the loss-leader. The API-grade economics are the real economics.

That gap isn't a forecast. It's a balance-sheet reality at the foundation labs right now. When labs unwind the subsidy — whether through tiering, throttling, or just letting the per-seat plans atrophy while pushing customers toward API consumption — the cost line at the buyer doesn't move 10%. It moves 15x in the worst case.

The seat-priced enterprise contract you're signing in May 2026 is being underwritten against an unsustainable subsidy. That subsidy survives as long as the labs are racing for distribution. It does not survive once the market settles.

This is what we mean by token-shifting: the unit of cost is migrating from headcount to consumption, but the contracts haven't repriced yet. The first vendor to reprice — to move you from "$20/user/month with unlimited AI features" to "$20/user/month base plus $X per million tokens" — will look hostile. They're not. They're the first one telling you what your stack actually costs.

Failure mode #2: Role compression

The second failure mode is the one Salesforce can't talk about on its own earnings call.

Per-seat pricing assumes you bought N seats because you had N humans who needed software to do their jobs. The model breaks the moment one of those humans is a workflow-orchestrating agent that performs the work of several seats while occupying one — or zero.

MindStudio puts the dynamic plainly: "When one AI agent can do the work that used to require 10, 20, or 50 human users, per-seat pricing doesn't just compress — it collapses." Gartner's call, cited across the trade press, is that seat-based revenue share will decline from 21% to 15% over the next 12 months, with at least 40% of enterprise SaaS spend shifting to usage-, agent-, or outcome-based models by 2030.

SAP CEO Christian Klein said the quiet part out loud earlier this spring, per SAPinsider: "It would be foolish to still charge subscription base, because AI is so powerful that it will automate a lot of tasks." SAP is moving wall-to-wall to consumption pricing. ServiceNow and Workday are drawing similar lines — particularly around external agents touching their stored customer data.

The buyer's exposure here is asymmetric and easy to miss. If you're buying a SaaS product today and your renewal is twelve months out, the vendor's incentive is to not reprice during your current contract — to let you keep your generous seat count, let your usage grow, and then reset everything at renewal. The vendor that doesn't reset is the vendor that's eating the margin. The vendor that does is the one that lives to negotiate.

You should expect every Tier-1 enterprise software contract negotiated between now and 2027 to land somewhere other than pure per-seat. Plan procurement accordingly.

Failure mode #3: Vendor-lock erosion

The third failure mode is the counterintuitive one, and it's the one most pricing pieces miss.

The instinct, watching Palantir's argument or Sierra's outcome-pricing pitch, is that consolidating to fewer, deeper AI agents inside a single vendor's ecosystem is the cost-controlled path. Sierra's framing is the cleanest version: vendors only get paid when the AI actually solves the buyer's problem. Intercom charges $0.99 per resolved conversation. HubSpot dropped to $0.50 in April 2026. Outcome-based is the rationalist's preferred model.

The problem is that the lock-in mechanic of outcome-priced agent platforms is worse than the seat-license lock-in it replaces.

Seat-license lock-in is mostly contractual and switching-cost-driven. The data lives in the vendor's database; you've trained users on the UI; you've integrated four systems through the platform. Painful to leave, but the unit of dependency is observable.

Agent-platform lock-in compounds invisibly. Every conversation an outcome-priced agent resolves accumulates context, learned workflows, and silent integrations that don't transfer. The "outcome" is partly a function of the platform's accumulated memory of your specific operation. When you try to switch, you're not just porting data. You're reconstructing implicit institutional knowledge that lives in someone else's vector store and policy graph.

⚠️ The hidden cost of outcome-priced agent platforms isn't the per-resolution fee. It's the behavioral lock-in: portability requirements need to be in the contract before the agent is deeply embedded — exports of context, audit logs of agent decisions, and a defined off-ramp. Vendors won't volunteer those clauses.

This is the part of the SaaS conversation that's actually new. The lock-in shape changed. The defensive moves changed with it.

Why Gruber's line matters here

Now back to Gruber, because his framing is what stitches the three failure modes together for a buyer.

His argument, made in the Apple context: AI is technology, not a product — the same way wireless networking is technology. There is no "killer wireless product." Everything is a wireless device. Everything will be an AI device. The category error is treating AI as a discrete bundled thing you procure.

For an enterprise buyer in 2026, that line cashes out as: stop evaluating "AI products" against each other. Start evaluating the AI-bearing-capacity of every vendor in your stack. Every existing SaaS line item — your CRM, your ITSM, your HRIS, your finance suite — is becoming an AI-bearing line item. The right question at renewal isn't "does this vendor have AI?" Every vendor has AI. The right question is whether the vendor's pricing model is honest about the cost of the AI it's about to start charging you for.

That reframes the whole procurement conversation. You're not buying AI products. You're managing AI exposure across an existing portfolio of software contracts, most of which are about to renegotiate the meaning of "user" in the licensing line.

The 2026 operator checklist

Five questions to take into every renewal between now and the end of 2027. None of them are clever; all of them tend to get skipped.

1. What's the all-in price at 10x current AI usage?
If the answer is "let's discuss enterprise pricing," you're getting a vague number that protects vendor optionality at your expense. Push for a written quote at projected Year-3 volume — token volume, agent-action volume, outcome volume, whichever unit the vendor's pricing actually meters on. The answer should be specific to four significant figures. If the vendor won't give you one, the vendor doesn't know what their model costs to run either, and that's the relevant signal.

2. What's the migration path off this vendor in 18 months?
Especially for outcome-priced agent platforms. Ask for: full export of agent context and learned workflows, machine-readable audit logs of agent decisions, and a published off-boarding SLA. If the contract is silent on portability, the lock-in cost is whatever the vendor wants it to be later. Get the clauses in the master agreement, not the data-processing addendum.

3. Who eats the cost-overrun if AI usage spikes?
Most hybrid models — base + overage — have soft caps that quietly convert overruns to next-tier subscriptions. That's a pricing escalator, not a usage meter. The right contract structure is: pre-purchased usage commits with rollover, hard caps with notification thresholds, and a documented procedure for re-baselining usage assumptions annually. Without those, you've bought a variable cost line with no governor.

4. How is "outcome" defined, and who decides when one occurred?
For any outcome-priced contract. Resolution criteria must be defined contractually — including what happens for false positives, where the AI claims a resolution but the customer follows up. The vendor will want flexibility; the buyer needs precision. Specify the criteria in writing before signing, with a defined disputes process. This is the single most-skipped step in 2026 outcome-pricing deals, per the Bessemer pricing playbook.

5. Does this vendor's pricing change if our headcount drops 20%?
This is the diagnostic question. If a vendor's pricing is genuinely AI-aligned, the answer should be "no, our pricing is decoupled from your headcount." If the answer is "yes, you'd save money," the vendor is still selling you seats with AI features bolted on — and you're carrying the SaaSpocalypse risk on the vendor's behalf. The vendors that have actually done the work — SAP and ServiceNow on the consumption side, Sierra and Intercom on the outcome side — give you a clean answer here. Everyone else is hedging.

What to do with all of this

You don't need to pick a winner between Karp and Benioff. Both will be standing at the end of this cycle, and both companies will be larger than they are today. The convergence isn't predicting a vendor outcome. It's telling you that the commercial layer of enterprise software is repricing in real time, and your contract portfolio is probably calibrated to a 2024 understanding of "user."

The work is unglamorous. Pull every Tier-1 SaaS contract that renews in the next 18 months. Run them against the five questions above. Flag the ones with no AI-overrun governor, no portability clause, or no honest answer to question #1. Those are the line items that have unpriced exposure — not because the vendor is hostile, but because the underlying economics moved and the contract hasn't caught up.

The companies that come through 2027 cleanly aren't the ones that bet correctly on Palantir versus Salesforce. They're the ones whose procurement teams treated this twelve-month window as a repricing window — and renegotiated for the world that's already arrived.

The SaaSpocalypse is, as Benioff says, not new. The repricing is.

If you found this useful, the companion piece — Claude Kills SaaS Distribution: The Cascade — covers the same shift from the AI-platform side. And our review of agentic-coding economics digs into the actual token math behind the subscription-vs-API gap.

Originally published at the original site.

AI Psychosis in Your Agent Stack: A 9-Point Audit

Max Quimby — Sun, 17 May 2026 03:37:11 +0000

📖 Read the full version with the audit checklist on AgentConn →

On May 16 a Mitchell Hashimoto X post climbed to #1 on Hacker News with 1,687 points and 908 comments — and the top comment quietly upgraded the thesis from "some companies" to "our entire society right now is under AI psychosis." Hashimoto's actual claim is narrower and more useful: "I strongly believe there are entire companies right now under heavy AI psychosis and it's impossible to have rational conversations about it with them." The argument is not that AI is bad. It's that an unfalsifiable belief about what AI is going to do has detached from operational evidence, and the gap is wide enough that some teams can no longer course-correct.

If you're an operator running an agent stack — internal or shipping — that gap is your problem. You can't fix the boardroom. You can fix what your own team is shipping. This piece takes Hashimoto's framing and turns it into a 9-question audit you can run tomorrow.

Why this is the right week to audit

Three things converged this week and they form the editorial frame.

1. Hashimoto's post. The X thread Hashimoto posted — quoted on HN at the top of the day — argues that the "psychosis" companies have collapsed into an MTTR-only mindset: it's fine to ship bugs because the agents will fix them so quickly and at scale humans can't match. He explicitly draws the parallel to the MTBF vs MTTR debate from the cloud-automation transition — and notes that all those arguments are rearing their heads again, but now across the whole software development industry. The point is not "AI bad." The point is the ratio of measuring AI adoption to measuring output quality is upside down.

2. Claude is telling users to go to sleep. Fortune's reporting and a wave of reproductions on r/ClaudeAI document that Anthropic's flagship model has started telling users mid-session to rest, hydrate, and stop working — sometimes at completely inappropriate times. Anthropic's own staffer described it as a "character tic" and said they hope to fix it in future models. The reason matters less than the cultural artifact: the makers of the most advanced commercial AI agents publicly do not fully understand their own runtime's behavior. If Anthropic can't fully audit Claude's output distribution, you should think hard about what audit you have over the agent stack you're building on top of it.

3. The political envelope is closing. The Atlantic's The AI Backlash Could Get Very Ugly (Hacker News thread, 5.3k+ pts, 942 comments) frames data centers + job-displacement anxiety as the structural conditions historically associated with the onset of political violence, with episodes including 13 rounds fired at an Indianapolis councilman's house and an alleged Molotov attack at Sam Altman's home. Maine just passed the country's first statewide data-center moratorium (then vetoed by the governor). Pennsylvania, Virginia, Indiana, Wisconsin — bipartisan voter opposition. Permitting risk is now a real input to your roadmap.

You don't need to share Hashimoto's pessimism to take action. The convergence is the action. If your CEO sounds like the people in the Your CEO is Suffering from AI Psychosis HN thread (264 pts, 215 comments, full of operators describing exactly this dynamic), or if your team is shipping agent features without the evals to prove they work — you need an audit. Here's one.

The 9-question agent-stack audit

Run these against the current state of your agent stack — not the roadmap version, not the demo. Each question takes ≤10 minutes to answer honestly. Tally the ❌ marks at the end.

1. Do you have a real-user-derived eval set that runs on every model or harness change?

Not a vendor's benchmark. Not "10 prompts the team wrote." A frozen set of 50–500 real user prompts (with desired-outcome rubrics) that exercises the agent's actual failure modes and runs on every PR. If the answer is "we eyeball it" or "we have evals but they don't gate releases," that's a ❌. The OWASP-aligned 2026 agent observability stack guides all converge on this as table-stakes; if you're not there, no other layer is reliable.

2. Can you produce a token + dollar trace for any agent run from the last 7 days?

Pick a run at random. Reconstruct: model used, prompt tokens, output tokens, tool calls, cost per call, total cost, end-to-end latency. If your observability can't produce that within ~2 minutes for a specific request ID, you don't have agent observability — you have logs. This is the most common ❌ in 2026 stacks and the easiest to fix. (Adjacent reading: our Anthropic Finance Agent Templates Buyer's Guide walks through what "good" looks like for a vertical-agent stack.)

3. Are tool permissions scoped per-task with explicit allowlists?

OWASP's Excessive Agency (LLM05) is the #1 lesson of the 2026 agent-incident year. If your agent's tool layer can read every file, hit every internal API, and call every external service because that was easier to set up, a single successful prompt injection or model mistake performs a chain of high-impact actions. The fix is "principle of least privilege, just-in-time ephemeral tokens, and human-in-the-loop for irreversible actions" — a quote from every OWASP agentic-top-10 writeup this year. If your agents can write to prod with no human gate, that's a ❌ — and the people on the HN AI-psychosis thread describing "auto-merging coding agents at scale" are exactly who this control is for.

4. Does a human approve any irreversible action the agent takes?

Specifically: data deletion, money movement, customer-facing messages, deploys to production, and PRs auto-merged to main. If "human approval" exists only as a configuration that's been turned off in the name of velocity, that's a ❌. The Claude-telling-users-to-sleep incident is the small-stakes version of this — the makers of the agent didn't fully predict the output distribution. Your agents are no better understood than Claude is by Anthropic.

5. Do you measure adoption and quality, or only adoption?

If your AI-adoption dashboard shows percent-of-PRs-touching-Claude, percent-of-engineers-using-Cursor, and tickets-touched-by-agents, but does not show defect rate, rework rate, time-to-revert, or customer-satisfaction delta on agent-handled workflows — that's a ❌. This is the literal definition of Hashimoto's psychosis pattern. The point of the audit is to put quality back on the same dashboard as adoption.

6. Does the agent have a documented rollback / kill-switch path that's been tested in the last 30 days?

When the agent stack starts misbehaving — and per the Claude-sleep story, "misbehaving" can be very subtle — can you turn it off without breaking the calling product? Is the rollback path tested, or just claimed? The 2026 cloud-equivalent of MTTR culture is people assuming the agent can be turned off "anytime" without ever having actually done it under production load.

7. Is there a documented vendor-lock budget per model + per harness + per skills pack?

How much does it cost to migrate to a different model provider next quarter? How much rework if the harness (Claude Code, Cursor, internal) is swapped? What if a critical skills pack is sunset or compromised? Operators we trust budget this explicitly — usually 1–5 person-weeks per major component — and re-validate quarterly. If the answer is "we'd be stuck for at least a quarter," that's a ❌. This is also why our coverage of skills directory races and skills going vertical is operator-grade rather than vibes-grade: portability matters more every quarter.

8. Have you read the harness's actual prompt + tool-definition graph in the last 60 days?

Not "have you read the docs." Have you opened the harness's system prompt, the tool definitions, the agent loop pseudocode, and traced what happens when the model returns a malformed tool call? If your team is shipping on top of a harness no one on the team has read end-to-end, you cannot reason about edge cases — you can only react to them. Hashimoto literally coined the "Agent = Model + Harness" framing for this reason; see our Archon open-source harness deep dive for what a fully auditable harness looks like.

9. Is the next failure scenario named, owned, and tested for?

Specifically: which scenarios will your agent fail at if (a) the underlying model is downgraded for a week (Anthropic compute shortage style), (b) a tool API returns an unexpected shape, (c) a skill pack is replaced with a malicious near-twin, or (d) a regulator requires per-decision audit trails next week? If your team can't name the top three failure scenarios in writing — and doesn't have a test for each — that's a ❌. This is the Karpathy "developers have AI Psychosis" thread's point retold as a checklist: developers' own failure imagination is the limit.

Scoring

Total your ❌ marks across all nine.

0–1 ❌: You're in the top quartile of agent operators we've talked to this year. Document what you did so the rest of the team can copy it. Re-audit quarterly.
2–3 ❌: Normal — but each one is a specific, fixable engineering ticket. Schedule them this sprint. The most common 2–3 ❌ profile is "evals + cost trace + rollback path." That's a four-week tightening project, not a strategic pivot.
4–6 ❌: You're in the danger zone. The agent stack is producing value but you cannot prove it on demand, and you cannot stop it cleanly if something goes wrong. Stop shipping agent-touched user-facing features until you fix at least #1 (evals) and #4 (human-in-the-loop for irreversible actions). Everything else can wait one sprint; those two cannot.
7–9 ❌: This is the Hashimoto cohort. Read his X post end-to-end, then read the HN thread on it end-to-end. The audit alone is not enough — the team needs a leadership-level conversation about adoption-versus-output metrics before any of these fixes will stick. Pretending to fix #1 without that conversation just generates more rework.

A note on what this isn't

This isn't an anti-AI checklist. Every team we've audited this year that scored well on these nine questions ships more AI-driven features, not fewer — because their evals tell them what works and their kill switches let them iterate at the edge of safety. Operator discipline is not a brake on agentic ambition; it's the only reason the ambition compounds without blowing up.

It's also not exhaustive. There is real overlap with the OWASP Agentic Top 10 (governance, identity, supply chain) and with the CSA MAESTRO 7-layer threat model (evaluation & observability is their Layer 5). We chose 9 questions because that's what fits in one operator audit afternoon. If you want a fuller framework, run this audit first, then layer OWASP and MAESTRO on whatever's still standing.

The bottom line

Hashimoto's "AI psychosis" framing is loud because it's true at the boardroom level — but it stops being psychiatric and starts being engineering the moment you write the questions down. The teams that ship agents responsibly in 2026 are the ones whose evals, traces, scopes, kill switches, and failure scenarios are artifacts — files in the repo, dashboards on the wall, owners with names — not vibes in the head of the lead engineer.

Run the nine questions on your stack this week. If you can't honestly answer one of them in under ten minutes, you have a ticket. That's it. That's the whole audit.

If you want help structuring the eval-set and trace pieces specifically, our Vectorless RAG: PageIndex vs. Embedding RAG decision guide and Vertical Agent Wave roundup both walk through what "real" looks like for two of the most common agent categories. Start with question #1 and don't skip ahead — the audit only works in order.

Originally published at AgentConn

Inference Inflection: Cerebras, SpaceX, Leopold's $5.5B Bet

Max Quimby — Sun, 17 May 2026 03:31:19 +0000

📖 Read the full version with charts and embedded sources on ComputeLeap →

Three stories ran on parallel tracks this week. On Thursday, Cerebras priced its IPO at a $60 billion valuation after a year of withdrawn filings and national-security reviews, with shares closing at $280 and the company instantly worth more than half of Intel. The week before, Anthropic signed a deal with SpaceX to take over the entire 220,000-GPU Colossus 1 cluster in Memphis — and to begin scoping orbital data centers. And buried in a Fortune profile from earlier in the spring, a 23-year-old former OpenAI researcher named Leopold Aschenbrenner revealed his Situational Awareness Fund had grown from $225M to $5.5 billion in under two years, almost entirely by buying the unglamorous infrastructure underneath the AI boom.

Read on their own, each is a normal "AI is big" story. Read together — and read against Polymarket pricing Anthropic at 78–90% across nearly every category leadership market — they are the same story told from three angles: inference compute is being repriced as both the binding bottleneck of the agent era and a new investable asset class, in the same week. The capital stack is rewiring itself in real time, and a lot of public-equity investors are still pricing AI as a software story.

This piece pulls all three together.

💡 The thesis in one sentence: Inference is the asset. The model weights are necessary but no longer sufficient — what matters is the wafers, megawatts, and latency that turn weights into tokens at the speed users have learned to demand.

1. The Cerebras print: what an inference-first IPO looks like

The Cerebras numbers are the first thing to anchor on. Per the S-1 and the post-IPO coverage:

$60B valuation at pricing; revenue of $510M in 2025 (up 76% YoY).
Hardware $358M, cloud services $152M — a meaningful shift toward selling tokens-per-second rather than just dinner-plate-sized chips.
A $20B+ multi-year contract with OpenAI to deliver 750MW of low-latency inference compute through 2028, with an option to expand to 2GW through 2030.
G42 and MBZUAI together drove a "large majority" of 2025 revenue. The OpenAI deal is the engine that re-rates 2026 and beyond.

💡 The Register frames the journey in one line — Cerebras "risked it all on dinner plate-sized AI accelerators a decade ago. Today it's worth $66B." That is the right frame: this is what an inference-first IPO looks like when the bet pays.

The CFO comments in Latent Space's coverage are the tell. Asked about model size, the company said it currently serves trillion-parameter models — explicitly naming "OpenAI 5.4 and 5.5" — and that there is "no limit" to the model size it can serve. The pitch is no longer "we have a fast chip." It is "we are the production inference layer for frontier models that GPUs cannot serve at the latency users now demand."

The community context is worth flagging too. The same Hacker News audience that initially treated Cerebras as a curiosity has flipped completely. The thread on the original IPO filing news is now a useful time capsule of how the consensus changed.

The market response, per The Motley Fool, made it the biggest IPO of 2026 so far. Stock soared 68% on day one. The conventional read is "AI bubble froth." We think the better read is that retail and institutional capital have finally noticed that the binding constraint on every frontier-model product — ChatGPT Advanced Voice, Claude Code, the agent runtimes everyone is now shipping — is inference latency at production scale, not training FLOPs at the next milestone.

The HN discussion when Cerebras's investor list — Altman and Ilya among them — became public makes the point even more cleanly: this is not a niche bet anymore.

If you've been following our coverage of Anthropic's six-surface distribution push and the AWS $100B Claude dominance clock, this is the same story from the supply side: the same demand that makes Anthropic look like a category monopolist makes Cerebras look like the only US-listed pure-play on the supply.

2. The Anthropic-SpaceX deal: a hyperscaler is just a power-and-real-estate company

A week before the Cerebras print, Anthropic did something even stranger. It signed a deal with SpaceX — yes, the rocket company — to take over the entire compute capacity of xAI's Colossus 1 data center in Memphis. That is over 220,000 NVIDIA GPUs and more than 300 megawatts of power, per Bloomberg and Tom's Hardware. xAI built it; Anthropic rents it; both companies and SpaceX are exploring "multiple gigawatts of orbital AI compute capacity" together.

Two things to notice.

First, the demand context. Anthropic CEO Dario Amodei said Q1 2026 revenue and usage grew 80x against an internal plan of 10x. The New Stack frames the deal as "Anthropic recruited SpaceX's 220,000-GPU Colossus 1 to fix what Claude users kept complaining about" — the rate-limit complaints that filled r/ClaudeAI for most of April. Within hours of the deal, Claude Code's five-hour rate limits doubled for paid tiers, peak-hours throttling was removed for Pro and Max, and API rate limits for Opus models were "considerably" raised. The deal is, in operational terms, a 300MW patch to a customer-experience bug.

Second, Elon. Musk has spent two years calling Anthropic "woke," "misanthropic," and "evil". Then he handed them the keys to Colossus 1. His public quote: "Everyone I met was highly competent and cared a great deal about doing the right thing. No one set off my evil detector." The reason is not friendship. SpaceX has been the de-facto AI infrastructure financier for xAI for two years — pouring rocket revenue into GPUs — and the math now wants those GPUs leased, not held. Rocket cash flows fund the chips; Anthropic's token revenue services the chips; everyone takes a cut on the way through.

Semafor put it most cleanly: the Anthropic-SpaceX deal "shows how tokens are taking over the economy." A rocket company is now a hyperscaler because the unit economics of tokens-per-watt are now competitive with the unit economics of low-Earth-orbit launches. That is what an inflection looks like.

The Hacker News thread on the deal — which surfaced the same day the formal xAI announcement landed — surfaced two things worth highlighting. The technical analysis is that this is not a one-off rental; xAI's roadmap was to deprecate Colossus 1 in favor of the larger Colossus 2 cluster, so renting it to a competitor is more efficient than mothballing it. The cultural analysis is that the supposedly fragmented frontier-model market is, at the infra layer, a single shared pool. There is no "team Anthropic" and "team xAI" hardware stack. There is one pile of GPUs and a yield curve.

3. Leopold's $5.5B fund: the AGI thesis as a public-equity portfolio

The third leg is the one most people in tech are sleeping on. Leopold Aschenbrenner — the 23-year-old former OpenAI Superalignment researcher who wrote the Situational Awareness essay that has become the canonical AGI-investor primer — turned that thesis into a hedge fund called Situational Awareness LP. Per the February 2026 13F filing covered by Fortune, the fund went from ~$225M at launch in 2024 to $5.5 billion in U.S. equity exposure by Q1 2026.

What is in the book? Per The Motley Fool's breakdown of the top 7 holdings and Fortune's profile:

Power companies and independent power producers.
Bitcoin miners (cheap, transferable kilowatts).
Chip-design companies and fab equipment makers (not just the headline names).
Adjacent enablers — utility-scale storage, transmission, specialized real-estate.

What is not in the book? The headline AI names. No NVIDIA. No Broadcom. No Microsoft or Alphabet at material weights. The thesis is that those names are already priced for AGI, and the unpriced trade is one layer down — the megawatts and wafers that feed them.

✅ The shape of this fund — concentrated (only 24 positions), levered to physical-layer bottlenecks, dismissive of the obvious AI labels — is the public-equity version of what Cerebras and the Anthropic-SpaceX deal are saying with their balance sheets. The bottleneck is not the model. The bottleneck is the energy, the silicon, and the dirt.

Peter Diamandis spent EP #255 of Moonshots walking through the same thesis: the Anthropic compute shortage, SpaceX as a hyperscaler, Google's orbital data center patents, and Leopold's fund as a single connected story. The episode's most quoted line: "the singularity may become visible in space before it does on Earth." Whether or not you believe that, the capital flow implication is hard to argue with. The smart-money infrastructure trade is no longer in the SaaS names you already know.

📺 Watch on YouTube: https://www.youtube.com/watch?v=0hK__1vkqMg

4. Why the market still prices Anthropic at 78–90%

Here is the part the macro coverage usually misses. If inference compute is supply-constrained and Anthropic just publicly admitted to an 80x demand surprise, the textbook read is "the leader gets capped, the followers catch up." That is not what's happening on prediction markets.

Polymarket is pricing Anthropic across roughly every "best AI model" market this week at 78–90%. The May 16 markets show:

"Best AI model overall" — Anthropic ~82%.
"Best AI model, end of June 2026" — Anthropic ~69%.
"Best coding model" — Anthropic ~90%.
"Best AI model on May 16" — claude-opus-4-6-thinking at 99%.

These numbers are higher, not lower, than they were a month ago — after the compute-shortage story broke. The implied market view is not "Anthropic gets supply-constrained." It is "Anthropic will close the supply gap (via deals like SpaceX, AWS, Google, and presumably more), and once it does, demand will keep compounding from a leadership position."

That is consistent with what Cerebras's order book is saying and consistent with what Leopold's fund is buying. The market does not believe the bottleneck is permanent; it believes the bottleneck is priced into the wrong layer. Capital is racing to fund the layer that unlocks the supply.

If you want our full take on Anthropic's pricing-power story, the $1 trillion valuation monopoly framing piece lays out the demand side. This week's three stories are the supply side of the same thesis.

The HN thread when Cerebras filed to come back — after the previous withdrawn S-1 — caught the moment the market started taking the supply story seriously again:

5. The "follow the money" picture

Stand back and the capital stack from this one week looks like this:

Layer	Story this week	What it tells you
Tokens	Anthropic 80x demand surprise; Claude rate limits doubled overnight	Demand outran every plan
GPUs	220,000 NVIDIA GPUs at Colossus 1 transferred from xAI to Anthropic	Physical pool, not team pool
Wafers	Cerebras $60B IPO; 750MW OpenAI deal; supply gated by TSMC through 2028	Inference-first chips win an asset class
Power	"300 MW" headlined in every story; Leopold's fund overweights IPPs and BTC miners	Megawatts are the real bottleneck
Capital	Situational Awareness LP +$5.3B in 18 months on this exact thesis	Public equity is catching up to the physical layer
Orbit	Anthropic + SpaceX scoping "multiple gigawatts" of orbital compute	The exotic optionality nobody is priced for

Almost every one of these layers used to be priced as a feature of "AI software." This week, each one became its own market. That is what a supply-side inflection looks like.

6. What this means if you build with AI

A few operational takeaways for builders.

Latency, not capability, is now the customer-facing variable. Cerebras's pitch — "we serve trillion-parameter models at speeds GPUs can't match" — only makes sense in a world where users notice the difference. If your product depends on real-time agent loops (voice, code completion, coding agents, browser-using agents), the binding constraint on your UX in 2026 is what fraction of inference the underlying lab routes to specialized wafer-scale silicon vs. shared GPU pools. That is now a procurement decision your model provider is making for you. Ask them.

Rate-limit policy is supply-driven, and supply is now political. When Anthropic doubled rate limits the same week as the SpaceX deal, that wasn't a strategy decision — it was a capacity decision. As more inference moves to deals like Cerebras-OpenAI and SpaceX-Anthropic, expect the rate-limit relief curve to track those announcements directly. If you can read a press release, you can predict your API ceiling six months out.

The "circular deal" critique has run its course. The reflex skepticism — "OpenAI invests in NVIDIA which invests in CoreWeave which sells to OpenAI" — assumes the money is making round trips through a fixed pool. That was a reasonable read a year ago. With Cerebras going public, with SpaceX renting Colossus to Anthropic, and with Leopold's fund flowing into power and miners, the pool is being widened by genuinely outside capital. See our Google-Anthropic $40B circular deal piece for the prior frame; this week's stories meaningfully break it.

Watch the orbital line item. It sounds like science fiction. So did "rocket company becomes hyperscaler" before this month. Per the CNBC writeup, Anthropic and SpaceX explicitly committed to scoping "multiple gigawatts" of orbital compute. The cost of getting megawatts to low Earth orbit, divided by the cost of getting megawatts to Memphis, has been closing for two years. If it closes by 2028, the entire physical-layer thesis re-rates again — and Leopold's fund is one of the few public vehicles structured to benefit.

The bottom line

Cerebras's IPO, the Anthropic-SpaceX deal, and Leopold's fund are not three AI stories. They are one story about a market that has finally figured out that inference is the asset. Not the model weights, not the chat interface, not even the chips on their own — the entire stack of wafers, power, latency, and rent that turns weights into tokens at the speed users have now learned to demand.

The Polymarket pricing — Anthropic at 78–90% despite an admitted compute shortage — is the cleanest signal that capital is no longer treating the supply problem as a ceiling on the leader. It is treating it as an investable bottleneck. That is what an inflection looks like.

We're going to watch two things over the next six weeks. First, whether Cerebras's print pulls more inference-specialist silicon into public markets — Groq, SambaNova, and the AI-ASIC arms at Broadcom and Marvell are obvious candidates. Second, whether the orbital-compute line in the Anthropic-SpaceX deal turns into an actual capex commitment. If both happen, the $1T Anthropic monopoly thesis and the Leopold thesis end up describing the same trade from opposite ends.

For builders, the practical move is to start treating model-provider supply policy as a first-class input to your roadmap — the same way you already treat cloud-provider region availability and GPU prices. Inference is the inflection. The capital is just catching up.

Originally published at ComputeLeap

Xi Said the Quiet Part: Taiwan 'Could Trigger Conflict'

Max Quimby — Sat, 16 May 2026 03:48:58 +0000

On May 14, 2026, inside the Great Hall of the People in Beijing, Xi Jinping told Donald Trump directly that "if mishandled, the two nations could collide or even come into conflict" over Taiwan, calling the island "the most important issue" between the world's two largest economies. The language has been reported with near-identical phrasing by NBC, NPR, Al Jazeera, Bloomberg, Democracy Now, Time, and CNBC — every major desk got the same readout, which means it was the line Beijing wanted delivered.

📖 Read the full version with embedded Polymarket widget on The Arc of Power →

Two things happened simultaneously this week and they do not agree. Xi escalated the rhetoric to the highest temperature of any post-2017 US-China bilateral. Polymarket's "Will China invade Taiwan by end of 2026" market, with $23.4M in volume, sits at 7%. When the loudest authoritarian leader in a generation tells the US president to his face that Taiwan could trigger a war, and the market with skin in the game prices that war at 7%, one of them is wrong.

This piece argues which.

The Rhetoric Side of the Trade

Strip the language down to its constituent claims:

"The most important issue" — Xi is foregrounding Taiwan above trade, AI, rare earths, and Iran on the formal agenda of a state visit. That is a positioning move. It says: every other concession you want from us routes through this one.
"Could trigger conflict" — the word the readouts converge on is collide (碰撞), with conflict (冲突) reserved for the consequence. This is the highest-temperature word Beijing has used about Taiwan in a presidential bilateral since the 1996 Strait crisis. It is calibrated escalation, not improvisation.
The forum — Xi chose to deliver this in person, in Beijing, on day one. Not through a Foreign Ministry spokesperson. Not in a Politburo readout. Direct, leader-to-leader, on Chinese soil. The forum is itself the message.

Layer on the troop-positioning context. Coverage from The Economist and DemocracyNow over the past two weeks documented US forces concentrating on Okinawa and Guam, the Penghu and Dongyin HIMARS deployments, and the US-Taiwan asymmetric warfare "firepower center". Beijing watches all of this. Xi's read of these movements as adversarial preparation is rational from where he sits — and his warning is the proportional verbal response.

⚠️ The rhetoric is real. Anyone reading "could trigger conflict" as routine diplomatic theater is misreading the temperature. This is not the standard 'one China' boilerplate. This is Beijing telling Washington that the rules of the road are being rewritten and the next miscalculation could be load-bearing.

The Market Side of the Trade

Now look at the prices.

Live Polymarket market: will-china-invade-taiwan-before-2027

End of 2026: 7%
By September 30, 2026: 5%
By June 30, 2026: 2%
By March 31, 2026 (resolved no): trended near 0

The market is not pricing Xi's rhetoric as informational. It is treating the warning as expected behavior — exactly what an authoritarian leader does in a year of US weakness — and pricing the actual probability of an amphibious assault separately. The market's logic is the same logic the US intelligence community published in its annual threat assessment: Beijing prefers unification without force; an opposed amphibious crossing of the strait is "extremely difficult" and carries high failure risk if the US intervenes; the cost of failure to Xi personally is catastrophic.

A 7% implied probability across $23M of capital is not "the market is asleep." It is the market saying: rhetoric and capability are separate variables, and the capability variable does not yet support a 2026 invasion.

Which One Is Wrong

Both can be partially right. Neither is fully right.

Where the market is right: A literal 2026 amphibious invasion is implausible. The PLA has not telegraphed mobilization. The logistical preconditions — Type-076 LHD numbers, civilian-fleet pre-positioning, reservist callup — are not visible at the satellite-imagery layer. Polymarket is correctly pricing a base case that the year ends without Chinese marines on Taiwanese beaches.

Where the market is wrong: It is anchoring on full invasion as the event of interest. That is the wrong event to model. The events that are actually load-bearing for 2026 are:

A maritime quarantine or partial blockade — coast guard plus maritime militia, not PLA Navy — using customs and inspection regimes to choke Taiwan's energy and chip imports. This is not "invasion." It is escalation that the Polymarket question definition does not capture.
A Taiwan policy concession from Trump — the F24 line on Trump rewriting Taiwan policy in Beijing without Congress — that materially reduces deterrence without firing a shot. Xi extracts this from a weakened US bargaining position. The market does not price policy concession; it prices kinetic conflict.
A miscalculation in the Strait — a destroyer shouldering, a J-20 incident, a coast guard ramming — that escalates faster than the slow base-rate-priced markets can update. Markets at 7% do not absorb a tail-event well; their structure assumes slow news, not strait incidents.

So the right read is: the market is correctly pricing the literal question and underpricing the adjacent risks that look like the same trade to anyone who is not a quant. Xi knows this. The rhetoric is calibrated to scare Washington into pre-emptive concession before any of those adjacent risks materialize. That is the strategy. The market measures the wrong event.

What to Watch in the Next 30 Days

Six concrete data points will tell you which way the trade resolves:

The summit communiqué language on Taiwan. If it includes the phrase "peaceful reunification" without "freedom of navigation through the Strait" or "current status quo," that is the policy concession scenario. Read line-by-line.
US Pacific Command public posture. A movement of carrier groups away from the first island chain in the two weeks after Beijing is a tell that concessions were made off-camera.
Taiwan domestic politics. The DPP government's response — its public confidence vs. quiet calls for more US assurances — is the highest-fidelity signal of what Taipei believes was discussed.
PLA maritime activity around Taiwan. Not exercise count — exercise type. Combined arms with logistics simulation is qualitatively different from rote naval drills. The shift, if it happens, is the precondition for a quarantine scenario.
Polymarket movement. If the 7% creeps to 12-15% over June without a specific shock, the market is rebalancing toward the broader-risk framing. If it stays flat through a summer of Strait incidents, the market structure has a known blind spot.
Congressional reaction. Specifically: bipartisan Taiwan Relations Act reaffirmation, Senate Foreign Relations Committee hearings on the Beijing readout, Hegseth's defense ask survival in markup. If Congress folds, the deterrence equation rebalances toward Beijing without a shot fired.

💡 The convergence read. Xi escalated. The market shrugged. Both are processing the same underlying reality through different lenses — Xi sees the strategic opening; the market sees the absence of marines on landing craft. The real trade is the space between those two views — the quarantine, the concession, the miscalculation. That is where 2026 actually plays out.

The Bottom Line

Xi told Trump, in front of every wire service in the world, that Taiwan could trigger a war. Polymarket says 7%. The space between those two readings is the most important geopolitical position in the world right now, and almost nobody is sized for the trades that actually express it — partial blockade, policy concession, Strait incident. Most of the year's volatility will live there.

The companion read on the Iran-side asymmetry that brought Trump into Beijing weakened is in our Trump Arrives in Beijing Already Losing the Room piece from May 13. The arc is the same: a US president negotiating from below, an authoritarian leader who does not have to do anything except wait, and a series of markets that are still pricing the previous world.

For traders and policy operators, the practical takeaway is to stop trading the headline market. The "Will China invade Taiwan by end of 2026" line is the highest-liquidity, lowest-information instrument in this space. The information is in the adjacent markets — bilateral exchange flows, semiconductor export licensing surprises, Strait shipping insurance premia, and the price action in Taiwanese sovereign debt. Those are where the next 90 days actually move. Anyone watching only the binary invasion question will see a "calm" market right up until the moment the world has rearranged around a maritime quarantine, a defense-budget cut, or a quiet semiconductor concession that nobody bothered to make a Polymarket question about. The Taiwan trade in 2026 is everywhere except where the headline volume sits.

Originally published at The Arc of Power.

Three Humanoid Robots Just Quietly Cracked Their Records

Max Quimby — Sat, 16 May 2026 03:43:05 +0000

In the seven days ending May 15, 2026, three humanoid robotics milestones landed almost on top of each other. Figure crossed 30 hours of continuous autonomous package-sorting, processing more than 38,000 packages before the demo stretched to 40+ hours and 50,000 packages. Unitree unveiled the GD01, the first mass-produced manned mecha — a 500 kg, 9-foot transformable platform that switches between bipedal and quadruped modes. And in late April, a Chinese humanoid named Lightning finished the Beijing E-Town Half Marathon in 50:26, beating the human world record at 3:50 per mile.

📖 Read the full version with charts and embedded sources on ComputeLeap →

Three different platforms. Three different milestones. One week. That is not coincidence — it is the same maturation curve hitting different products at the same time. This piece argues what the curve actually is, what it does not yet mean, and what a serious observer should track next.

The Three Milestones, Stripped of Marketing

Figure 03 — endurance proof

Figure's package-sorting livestream started with an 8-hour target. After zero failures, the team kept it running. Three F.03 robots took shifts, all inference running fully onboard on the Helix 02 model — no cloud, no teleoperation. Each robot detects a barcode, picks the package, reorients it barcode-down onto a conveyor, repeats. The pace approached human parity at roughly three seconds per package. Reddit's r/singularity called it "Figure AI 03 keeps working for over 30 hours straight" — the thread hit hot.

The headline number is endurance. The deeper number is zero interventions. A year ago, the same task would have required hundreds of human resets per shift.

Unitree GD01 — manned mecha

Unitree premiered the GD01 on May 12 in a one-minute video that crossed millions of views on Weibo, X, and YouTube within 24 hours. The platform weighs 500 kg with pilot, stands roughly 8.9–9.2 feet in bipedal mode, and transforms in seconds to quadruped for rough terrain. Starting price is 3.9 million yuan — about USD $574,000.

It is part stunt, part power-density flex. The interesting signal is not "look, a mech" — it is that Unitree believes the actuation, battery, and balance technology is now mature enough to put a paying customer's body inside one. That is a different risk posture than a side-by-side warehouse robot.

⚠️ Stunt vs. signal. The GD01 is a Frankenstein product — half industrial platform, half cosplay. But Unitree shipped 5,500+ humanoids in 2025, and Chinese vendors took ~90% of the humanoid market that year. When a company that volume-ships ordinary humanoids puts a human inside a 500 kg machine, the credible read is: their bipedal control loop is now robust enough that they don't think the pilot dies.

Lightning — half-marathon record

The marathon result is the loudest and the least technically meaningful of the three. Honor's "Lightning" humanoid completed 21.1 km in 50:26 — beating the human world record by a clear margin at the Beijing E-Town Half Marathon. The 2025 edition of the same event saw most non-human entrants fail to finish; the fastest ran a 2:40. That is a year-over-year compression of about 3.2x in pace and an equally large jump in completion rate.

Scientific American's piece correctly notes the qualifier: a flat course, optimized actuators, a body shape built for the task. This is not a general-purpose humanoid winning a real race. It is a closed-loop demo. But it is a closed-loop demo that was impossible 12 months ago.

Why All Three, Why Now

The temptation is to call the timing coincidence. It is not. The same three underlying technologies hit a usable threshold across the industry in late 2025 / early 2026:

Battery density. Endurance demos that used to last 60–90 minutes on a charge can now run a full shift. Same chemistry, same form factor — just the cumulative effect of cell-level improvements compounding.
Onboard inference. Helix 02 runs entirely on robot. The mecha's balance loop runs on robot. The marathon humanoid's gait controller runs on robot. None of these needed a cloud round-trip. That eliminates the latency floor that capped real-time control in 2024.
RL policy stability. Long-horizon reinforcement learning has crossed a generalization threshold. Trained controllers that used to break on the second hour now run the seventh hour at the same error rate. This is the underlying reason Figure kept letting the demo run.

Different vendors. Different applications. Same three inputs hitting the threshold at the same time. That is what a maturation curve looks like — the surface area where the technology works expands across vertical markets simultaneously.

What These Demos Do Not Prove

The reflex from the marketing copy is to extrapolate. Resist it.

Endurance is not generality. Figure's 30-hour run was a single repeated motion in a fixed cell. A 30-hour run that switches between five tasks would be a more honest milestone. Watch for that next.

Mass production is not mass deployment. Unitree calls the GD01 "production-ready." TechRadar's coverage of the package-sort demo flagged community skepticism that the run was fully autonomous. Both points are fair — production-ready is a manufacturing claim, not an operations claim. The metric that matters is units actually deployed in customer facilities, with public utilization rates.

Marathon records are not labor markets. Lightning ran 50:26 on a flat marathon course. A construction worker walks uneven ground all day carrying 30 kg of materials. The two have almost nothing in common except the word "humanoid."

The honest framing: these demos prove that the underlying control loops are now stable for hours-long, real-world operation. They do not prove anyone can actually buy one and replace a job tomorrow.

What to Track Next

If you operate near this industry, four metrics will tell you whether 2026 is the year of demos or the year of deployment:

Customer utilization rates. How many hours per week is a Figure 03 actually moving packages at a non-Figure-owned facility? Anything under 40 hours is a pilot. 60+ hours is a deployment.
Payload class disclosures. Unitree's 500 kg figure includes the robot itself. The number that matters is payload — what it can lift, sustained, in a real cell. Vendors that publish this honestly are ahead. Vendors that talk only about weight and height are doing PR.
Cost per unit, post-volume. Unitree's $574K starting price is for a stunt platform. The relevant number is what an industrial humanoid — Figure 03, Apptronik Apollo, Tesla Optimus — actually costs at 10,000+ units shipped. Watch for that disclosure before believing the deployment economics.
Failure modes in public. Demo livestreams are heavily curated. Customer-side videos of robots failing, getting stuck, or needing maintenance are the truth. They will surface on Reddit, X, and short-form video first.

The Bigger Picture

Humanoid robotics has spent five years stuck at "this is what it looks like in a demo." This week is the first one where the demos are running long enough, in production-shaped environments, and at production-shaped costs that the next milestone is no longer about whether the technology works. It is about who can manufacture, deploy, and service the platforms at scale.

That is a different competitive landscape — one Chinese vendors entered with a structural lead. Chinese makers captured ~90% of the humanoid market in 2025 on the strength of supply chain integration, government subsidy, and willingness to ship rough first versions and iterate fast. The next 18 months will tell whether US vendors close that gap or whether the geography of humanoid robotics in 2030 looks more like the geography of EV batteries today.

For now, the right operator move is simple: stop scoring this category on demo footage. Score it on deployed units, utilization rates, and the failure videos that show up unbidden. The technology is ready. The market is the open question.

For context on the inference stack that makes hours-long onboard control viable, see our companion piece on running AI models locally on DGX Spark. For the broader software story unfolding in parallel, see our coverage of Anthropic's six-surface distribution push. The hardware story and the software story are converging fast; getting either one without the other is going to miss the picture.

Originally published at ComputeLeap.

Codex Goes Mobile: A Phone-as-Steering-Wheel Playbook

Max Quimby — Sat, 16 May 2026 03:36:40 +0000

On May 14, 2026, OpenAI shipped Codex inside the ChatGPT mobile app on iOS and Android, in preview, on every plan including Free. By the next morning, the announcement was the #1 mover on Hacker News at 439 points, the lead in Substack AINews, and the subject of four high-engagement YouTube creator videos. The pitch is concrete: your phone becomes a steering wheel for a Codex session that is actually running on your laptop, your Mac mini in a closet, or a managed devbox somewhere in OpenAI's relay layer.

📖 Read the full version with embedded sources on AgentConn →

This is not Codex on a phone. The agent does not move. Your code does not move. What moves is the steering wheel. That distinction matters, because it changes which workflows actually get better and which ones quietly get worse.

The discussion is unusually substantive for a launch thread. The top comment reframes the entire value proposition: "Once you've used these coding agents a lot, you develop a pretty intuitive feel for how they work… if you have some idea or some issue you want to fix on the go, you just iterate with the agent for a bit (presumably no more than a couple hours) until the agent outputs an implementation. Then when you're back at your desktop, you can review the changes carefully… an initial draft is already waiting for you." That is the operator framing — phone for intent, desktop for review.

What Actually Shipped

The mobile experience is a thin control surface bolted onto the existing Codex session model. Per OpenAI's docs, the phone connects through a secure relay to one of three backends: the Codex desktop app on macOS, a self-hosted devbox over SSH, or OpenAI's managed remote environments. Windows desktop support is on the roadmap with no firm date. The codebase never lands on the phone.

From the phone you can:

Start new tasks against a connected backend
Steer running tasks — switch models, add context, redirect the agent mid-stream
Approve commands the agent has paused on (shell exec, file write, network call)
Review streaming output — terminal logs, diffs, test results, screenshots
Manage threads across multiple in-flight sessions

OpenAI says more than 4 million developers now use Codex weekly. The mobile channel is a distribution multiplier on that base — every existing Codex user gets the new surface for free, every ChatGPT mobile user gets a one-tap on-ramp into agentic coding.

💡 Note. The competitive read is unambiguous. TechCrunch notes that Anthropic's Claude Code added remote control in February 2026 and "has been steadily winning developer mindshare as a result." Codex mobile is a direct, three-month-late response. Distribution is the lever; quality is still the open question.

The Distribution vs. Quality Divergence

Here is the trade you should be modeling. Polymarket's "best Coding AI model end of May" market prices Anthropic at 94.5% implied probability, OpenAI at 3%. The traders are looking at SWE-bench Verified — Claude Opus 4.5 sits at 76.8%, Gemini 3 Flash at 75.8%, GPT-5.2 Codex at 72.8% — and at six months of agentic-coding benchmarks tilting the same way.

The Reddit reaction in r/OpenAI and r/ChatGPTCoding shows the same split. Power users see it as a force multiplier on workflows they already have. Newcomers see it as the missing on-ramp.

Meanwhile, the channel data goes the other way. ChatGPT has hundreds of millions of mobile installs. Claude's mobile app exists but lives a quieter life. A developer who has never typed claude in a terminal can be three taps from running a Codex agent against their devbox tomorrow morning.

That is the divergence: best-in-class quality on one side, best-in-class distribution on the other. This has happened before. Slack vs. Microsoft Teams. Mongo vs. Postgres. The winner is almost never the one the engineers prefer in isolation. The winner is the one that crosses the activation threshold for users who do not care about the underlying details.

For operators, the implication is not "switch to Codex." It is "stop assuming the benchmarks decide this." Plan for a world where you are running both, where colleagues unfamiliar with terminals are productive with the mobile path, and where your tooling has to make that pluralism cheap.

What Mobile Actually Unlocks

Strip away the hype and four workflows materially improve when the steering wheel fits in your pocket.

1. Start while AFK

A test failure pings your phone. Today, you note it and queue the investigation for when you are back at a desk. Tomorrow, you open the ChatGPT app, type "reproduce the failing case in payments_test.py, add a print of the input fixture, and run it," tap send, and the agent is already three minutes deep when you sit down. This is the workflow OpenAI is most clearly designing for, and it is the one that compounds — every five-minute gap between intent and execution gets reclaimed.

2. Steer a long-running task

Most operator-grade agent runs are not 30 seconds. They are 20 minutes of refactor, test, refactor, test. Today, that loop owns your terminal. With mobile, you can step away, watch the tool calls scroll on a screen at the gym, and tap "stop — wrong direction, use the strategy pattern instead" before the agent finishes destroying a clean module. The latency-to-correction collapses from "back at desk" to "during commercial break."

3. Approve and unblock

The friction-y middle of a long agent run is the pause: "Codex wants to run npm install --force. Approve?" Today, that pause is invisible until you check. With mobile push, you get the prompt the moment it happens. The whole "agent runs while I sleep, I review in the morning" pattern stops requiring sleep cycles aligned to your desk schedule.

4. Review small diffs

A 12-line change is reviewable on a phone. A 300-line refactor across seven files is not. Use the mobile surface for what it actually fits — line-level diffs, single-file changes, "did the agent do the obvious thing" sanity checks. Defer the architectural reviews to a real screen.

What It Does Not Unlock

The list of things mobile quietly makes worse is shorter but more important.

⚠️ Warning. Approving consequential agent actions on a phone while distracted is exactly the failure mode that Kingy AI flagged in their analysis: "a small screen, multi-tasking user, and an agent asking for permission to run something on a real machine is exactly the setup where rubber-stamping bad decisions becomes easy." Mobile does not change what Codex can do. It changes how carefully you decide whether to let it.

Large diff review is fake on a phone. A 6-inch screen can show you maybe 30 lines of context. The agent that just touched seven files in three packages cannot be meaningfully reviewed there. If you find yourself approving large diffs on mobile, your process is broken — go back to the laptop or instruct the agent to break the change into smaller commits.

Codebase navigation does not exist. The mobile surface shows you what Codex chose to show you. You cannot easily jump to git blame, grep for a related call site, or check whether a test you do not see is also broken. The agent's framing of the problem is the only framing you get.

Pairing context is missing. When you sit at a desk, your IDE, your terminal, your browser tabs, and your scratch notes are all on screen. On mobile, you have the Codex thread and nothing else. The cognitive load of holding the project state in your head goes up — exactly when your attention is most divided.

Pairing With Claude Code on the Same Backend

Here is the configuration that gets the most out of this release without committing to either side of the distribution-vs-quality bet.

If your backend is a Mac mini, a devbox, or any machine you control, you can run Claude Code, Codex, and other CLI agents on the same host. cc-switch — the unified CLI manager — already lets you flip between providers with one click on the desktop. The mobile addition just gives Codex sessions on that same host a new control surface.

Concretely:

Backend: one machine, multiple CLIs. Install Codex Desktop, Claude Code, and cc-switch on the same Mac. They share configs, MCP servers, and project context. Use whichever agent is better at the specific task.
Mobile: phone steers Codex specifically. The ChatGPT mobile app only connects to Codex sessions. Claude Code's mobile path is separate. Treat the two mobile surfaces as independent — do not try to unify them.
Tasks: route by capability, not by where you started. Deep refactor, multi-file logic? Claude Opus on the desktop. Quick fix, test reproduction, "run this and tell me what broke"? Codex from the phone. The agent each task lands on should depend on the task, not on which app you happened to open.

This is the operator pattern we described for the YC token-maxxing setup and it generalizes cleanly. The phone does not replace the desktop. It just adds a second seat to the cockpit.

Security: Read This Before You Connect

Mobile remote access to a coding agent on your real machine is a meaningful expansion of your attack surface. OpenAI's security docs are explicit about the rules:

Do not expose an unauthenticated app-server listener on a shared or public network. Use a VPN or mesh networking tool like Tailscale.
For SSH backends, enforce standard hygiene: trusted keys, least-privilege accounts, no unauthenticated public listeners.
Treat phone push notifications as auth-equivalent prompts. A stolen phone is now a permission to run shell on your devbox.

There is also a filed security issue on the Codex Desktop SSH path where the managed SSH remote can connect to a different user's already-running Codex app-server on a shared host. The fix is in flight, but if you are on shared infrastructure, audit before you connect.

The threat model is not "OpenAI is malicious." It is "the seam between phone, relay, desktop, and shell now exists, and every seam is something an attacker can probe." The right posture is the same as for any new remote access: minimum permissions, audited backends, two-factor on the ChatGPT account, and a hard rule against approving destructive commands from a phone screen.

What to Watch Next

Three signals will tell you whether this changes the competitive landscape or just relieves pressure on OpenAI's distribution story:

Anthropic's response. Claude Code's mobile path is functional but quiet. If Anthropic ships a major mobile update in the next six weeks, the read is "they noticed the threat." If they do not, the read is "they think quality wins regardless."
The Polymarket coding-model line. A 94.5% / 3% split is wide. If it narrows in May after the mobile launch, distribution is moving the needle. If it stays put, traders are betting that benchmarks still decide.
Windows desktop support. Currently unannounced. Half the developer market lives on Windows. Without it, "Codex mobile" is really "Codex mobile for Mac and devbox users." That is a smaller story.

Mobile-first agentic coding is not a question of if anymore. Codex shipping a real implementation on real distribution makes it a fait accompli. The question for operators in May 2026 is whether to architect for a single agent or for a portfolio. We think the answer, this week and for the foreseeable future, is the portfolio — and a thin mobile control surface on top of a real desktop backend is the cheapest way to get there.

Originally published on AgentConn.

Originally published at AgentConn.

Skills Go Vertical: Three Domain Bundles Trend

Max Quimby — Fri, 15 May 2026 03:28:20 +0000

📖 Read the full version with charts and embedded sources on AgentConn →

A week ago, the GitHub-trending story on skills was a generic-directory race. Today it is a domain-specialization race. In a single 24-hour window, three vertical skill bundles — scientific, academic, and learning — each landed on GitHub trending or HN with star velocities that put them in the top 12 worldwide. The genre has moved past "ship a .claude folder" and into "ship a .claude folder for this profession."

This is the moment the skill ecosystem stops resembling NPM-style awesome-* lists and starts resembling industry-trade-association toolkits. We've covered the skills directory race and the skill-spam validator wave already on AgentConn. What's new this cycle is that the next layer — vertical bundles — is now visibly being built on top, and three of them landed at once.

Here are the three verticals that crossed the trending bar in the May 14 window, what each one ships, and the pattern they all share.

The cycle, in three signals

The GitHub-trending board for the day reads like a thesis-by-coincidence:

mattpocock/skills at #2 with +2,971 stars/day — the generic-directory canonical, still holding velocity day 4
obra/superpowers at #4 with +1,801 — the agentic-skills framework that pairs with mattpocock's bundle
K-Dense-AI/scientific-agent-skills at #7 with +637 — the scientific vertical
danielmiessler/Personal_AI_Infrastructure at #8 — the personal-stack flank
Imbad0202/academic-research-skills at #11 with +441 — the academic vertical
DrCatHicks/learning-opportunities at HN #8 with 184 points — the learning vertical, landed on the commentary surface rather than trending

Five of those are skill packs and three are vertical-specialized. That's a structural change. A month ago, vertical skill packs didn't exist as a category — every pack was framed as "general developer skills." This week the verticals are filling in across three different domains at the same time, all with their own grammar, their own audience, and their own download trajectory.

Vertical 1 — Scientific (K-Dense-AI/scientific-agent-skills)

The scientific vertical's entry is K-Dense-AI/scientific-agent-skills, sitting at GitHub #7 with +637 stars in 24 hours. The repo's pitch is that scientific research workflows — protein structure prediction, lab-notebook automation, literature scraping with citation graph traversal, experiment-design rubrics — are concrete enough to encode as skills, and that those skills compose into actual research throughput.

The architectural tell is not the skill names — those are obvious from the domain — but the composition model. Where mattpocock's bundle treats each skill as a stand-alone .claude/skills/{name}/SKILL.md file, scientific-agent-skills treats them as chained: a literature-review skill calls a citation-graph skill, which calls a PDF-extraction skill, which feeds a methodology-comparison skill. The bundle ships explicit dependency graphs, not just files. That's a step up the abstraction ladder.

The other tell is the user. K-Dense-AI's README profile cites computational chemistry and structural biology groups as design partners — not generic "developers." When a skill bundle ships with a named user cohort, the pack stops being a portfolio piece and becomes a vertical SaaS substrate that happens to be open-source.

This pairs naturally with the broader continuous-compute-stack thesis: if research workflows can be expressed as skills, they can be batched, queued, and run against the same volume infrastructure as code generation. Wet-lab automation becomes a skill-pack problem.

Vertical 2 — Academic (Imbad0202/academic-research-skills)

The academic vertical's entry — Imbad0202/academic-research-skills, GitHub #11 at +441/day — is the more provocative one because it sits in the meta-research layer. The skills include literature-review structuring, citation-graph traversal, methodology critique templates, peer-review draft helpers, and statistical-methods explainers.

What's interesting about this one is the audience overlap with the scientific bundle but the framing inversion. K-Dense-AI's scientific pack is about producing research output. Imbad0202's academic pack is about evaluating it. The two are complementary halves of a single research-quality flywheel — and the fact that they emerged independently, in the same cycle, on the same trending board, is the cleanest evidence that the vertical-bundle thesis is converging.

The pack also surfaces the awkward fact that AI-authored peer review is now a real category. The README does not dodge it; the inclusion of a "reviewer-mode skill" is exactly the kind of thing that would have been called skill spam three weeks ago and is now treated as a legitimate substack of academic-research tooling. The genre is settling into its own grammar fast.

The HN-skill-spam discussion earlier this month — which we covered in the validator-wave piece — is the prior step here. Once the fake vertical packs got named and shamed, the real vertical packs got room to differentiate. Imbad0202's pack benefits from the spam crackdown, not in spite of it.

Vertical 3 — Learning (DrCatHicks/learning-opportunities)

The learning vertical's entry is DrCatHicks/learning-opportunities — and unlike the other two, it landed first on HN, not on GitHub trending, with 184 points on HN #8. The HN landing is itself the signal. Learning-skill packs are getting cultural attention, not just developer attention — and that's a different distribution motion than the developer-coded scientific and academic packs.

The pack focuses on curriculum-design primitives, retrieval-practice scaffolds, spaced-repetition prompt templates, worked-example generators, and assessment rubrics. Audience: anyone shipping an AI-assisted tutoring product — and there are now a lot of those. The convergence read here pairs cleanly with the broader post-Khanmigo AI-tutoring market piece we ran a few days ago — the application layer needs primitives, and DrCatHicks' pack is one of the first credible attempts at a learning-skill canonical set.

What's most interesting is that DrCatHicks is a domain expert from outside the typical Claude-Code-skill-author crowd. The README cites cognitive-science research, not engineering-debugging methodology. That's the second tell that the vertical-bundle era has begun: the authors are domain experts, not generalist engineers.

The pattern: domain experts shipping primitives

Lining up the three vertical packs side by side, the shared structural pattern is more revealing than any individual one. All three:

Identify a named user cohort (computational chemists, academic researchers, instructional designers) rather than "developers writ large."
Author from inside the domain. K-Dense-AI cites structural-biology partners. Imbad0202's pack reads like an academic toolkit. DrCatHicks ships cognitive-science citations.
Compose skills into workflows. Each pack ships at least one chained skill that calls others — closer to a function-call DAG than a flat file list.
Land on the trending surface that matches their audience. Scientific and academic on developer-class (GitHub trending) surfaces; learning on the cultural-engagement (HN) surface.
Differentiate on credentialing, not on volume. None of these packs is trying to be exhaustive — they're trying to be correct for their domain.

That last point matters. The skill-spam complaint two weeks ago was about packs that maximize file count without quality. Vertical packs invert that — they trade breadth for in-domain rigor, and that trade is what's getting them onto the trending board.

What builders should actually do

If you're shipping skill content in the next 30 days, the operational reads from this cycle are:

Pick a vertical, not a layer. The "general developer skills" pack is fully saturated — mattpocock and obra together cover that surface. The open space is in specific professions. Pick a profession you have access to and ship the bundle a domain expert would have wanted.
Compose, don't catalog. Skill packs that ship chained workflows (skill calls skill) are landing harder than skill packs that ship flat lists. The chaining is the artifact; the list is the inventory.
Credential, don't volume. Cite your design partners in the README. Cite the research. The skill-spam validators (covered in our validator-wave piece) make uncredentialed packs cheap to dismiss; credentialing is the cheapest defense.
Pick your trending surface deliberately. If your audience is engineers, ship on GitHub. If your audience is researchers or educators, ship on HN or the relevant Substack and let GitHub catch up. The trending surface is downstream of the audience.
Build for cross-harness from day one. All three packs in this cycle work across Claude Code, Cursor, and Codex CLI. Single-harness packs are already the narrow case; vertical packs especially need horizontal harness support because their users aren't typically Claude-Code-native.

We expect three more vertical packs to land in the next 14 days. The cleanest candidates are legal (contract analysis, case retrieval, regulatory comparison), clinical (patient-history structuring, differential-diagnosis prompts, clinical-decision rubrics, with appropriate guardrails), and product-management (PRD scaffolds, user-research synthesis, sprint-planning rubrics). Each one has the audience density and the domain-expert author pool to support a credible bundle. Watch for them.

The one-sentence takeaway

Scientific, academic, and learning skill bundles all crossing the trending bar in the same 24-hour cycle is the convergence signal: domain-specialized skill packs are now the leading edge of the agent ecosystem, and the next 30 days will be defined by which verticals fill in next.

Originally published at AgentConn

Anthropic's Six-Surface Distribution Day

Max Quimby — Fri, 15 May 2026 03:23:01 +0000

📖 Read the full version with charts and embedded sources on ComputeLeap →

On May 14, 2026, in a single 24-hour news cycle, Anthropic registered independent distribution signals on every live intel surface we track — capital, partnership, product, capability, monetization, and market belief. Six surfaces. One direction. In the same window, OpenAI absorbed three independent attack vectors: WSJ-broken GOP scrutiny ahead of its IPO, a fraying Apple partnership, and a sub-1% sit on every Polymarket "best AI model" market that Anthropic now owns at 69–78%.

This is the rare configuration where the lab story, the customer story, the IPO story, and the prediction-market story all move the same way in the same window. The asymmetry between the two labs — what we'll call the distribution gap — is no longer a vibes argument. It is now visible on six separate independent surfaces in one cycle, and it is operational intelligence for anyone choosing a stack in the second half of 2026.

Here is the read across each surface, what the inverse looks like on OpenAI, and what builders should actually do with this.

The six surfaces, in order

Surface 1 — Capital: the $1.5B isn't about Claude

Nate B Jones' breakdown of the round, "Anthropic Just Raised $1.5B — The Pitch Wasn't About Claude," is the right framing. The round is not a model-training round. It is a deployment-layer round. The capital is being shaped against enterprise-agent rollouts, professional services capacity, and the kind of post-sales engineering that PE firms recognize as a moat. ARK's Brainstorm EP 131, released in the same 24 hours, frames the same dollars as compute-infrastructure positioning — including the eye-catching "off-planet datacenter" thesis that floats SpaceX as the long-tail compute partner.

The two analyst frames look different on the surface — PE-driven deployment vs. compute-infrastructure positioning — but they're describing the same motion. The $1.5B is being raised to fight on distribution, not on capability. If you've been waiting for the moment when capital concedes that frontier model gains alone won't carry the next 18 months of revenue, this is that moment.

▶ Watch on YouTube

Surface 2 — Partnership: Gates Foundation, $200M

HN #11 in the same cycle carried the announcement of the Anthropic–Gates Foundation $200M partnership to deploy Claude on health and global-development research. 83 HN points isn't a viral hit — but partnership stories rarely are. What matters is who's writing the check and what kind of institution they are.

The Gates Foundation does not buy speculative tooling. It buys instruments it intends to operate against measurable outcomes across multi-year cycles. A $200M commitment is an institutional endorsement of Claude as the model you build clinical-research workflows on top of — not a marketing slot. Three months ago this kind of partnership would have been announced with OpenAI on stage. Today, it isn't.

Surface 3 — Productization: Claude for Small Business hits HN #2

The single biggest community-engagement signal of the cycle was HN #2 — Claude for Small Business, at 476 points and 428 comments. The thread is exactly the conversation Anthropic wants: a long argument about whether Claude can eat the mid-market wedge that Microsoft Copilot anchors today.

This is Anthropic's first explicit SMB go-to-market motion. It matters not because SMB is where the money is — enterprise still dominates — but because SMB go-to-market is where you ship product features that consumer agents inherit later. The pricing tier, the per-seat economics, the lightweight admin surface — that's the substrate for the eventual prosumer offering. Anthropic has been doing enterprise (Claude for Work) and developer (Claude API, Claude Code) for two years. SMB is the missing rail.

The 428-comment thread is the developer audience absorbing that change in posture in real time. Read the top quartile of replies and what you see is people actively re-shopping their stacks — not because Claude got better today, but because it now ships in a tier that lets them stop arguing internally about license cost.

Surface 4 — Capability: the BTC wallet recovery that crossed surfaces

HN #5 at 235 points and the #3 r/technology post at 13,756 upvotes describe the same artifact: Claude recovered a $400K Bitcoin wallet for a user who had partial seed information and gave up on conventional recovery. The story is a consumer-agent capability narrative. It is also, in distribution-pattern terms, the most interesting single data point in the cycle.

HN/r-technology overlap is rare. The two audiences are stratified by intent — r/technology runs on cultural-resonance signal, HN runs on technical-merit signal — and stories that land hard on both are stories with dual-class significance. A model recovering a wallet isn't AGI; it's a real workflow that one user paid into and that 13,756 r/technology readers found legible enough to upvote. The capability is starting to surface to non-developer audiences with the right kind of stakes — financial, irreversible, personal.

ℹ️
Why dual-class signals matter. When the same artifact pulls hard on HN and r/technology in the same cycle, the lab is no longer being read as "the developers' favorite model." It's being read as a consumer-stakes-grade tool that mainstream-cultural audiences can name. That's the threshold where prosumer ARR starts to compound on its own.

This is the read on the Claude product story right now: agent capabilities are graduating from developer-class to mass-class without losing their footing on HN. That's a hard surface to hold.

Surface 5 — Monetization: Latent Space says metering is IPO setup

Latent Space's "Codex Rises, Claude Meters Programmatic Usage," released in the same window, is the load-bearing monetization analysis of the cycle. Latent Space's read: Anthropic is metering programmatic usage explicitly to harden the revenue chart ahead of an October IPO. The newsletter's framing — "finance folks fall in love with Anthropic's growth" — is the bridge between the product motion and the capital motion.

Metering is not glamorous. Metering is what you do when you stop optimizing for token-share and start optimizing for unit economics. Anthropic is doing it now, in public, in a way that's legible to analysts ahead of the IPO. That sequencing matters. If you're trying to predict where the developer pricing curve goes in Q3, watch how the meter discloses cost surfaces over the next eight weeks.

Surface 6 — Market belief: Polymarket, 78/69 vs. sub-1

Polymarket is the cleanest signal because it is real money. As of the convergence read, Anthropic sits at:

78% on "Which company has the best AI model end of May?" ($7M volume / $2M liquidity)
69% on the same question for end of June ($6M volume)
Anthropic also leads adjacent "best Math AI" and quarterly markets by similar margins

OpenAI does not appear in the leader slot on any "best model" market in the current AI Predictions feed. Six months ago, OpenAI was the market anchor — every "best model" market was a battle between OpenAI and whoever was next. Today the prediction-market story is over, at least on this horizon. The dollars stacked against the Anthropic line are not nothing; they are the kind of bets that get placed by people who follow the lab releases week-by-week.

A near-70-point lead on a real-money market is not an opinion. It is the betting community's settled price for the next four weeks of frontier evaluation. That's a strong tell.

The OpenAI inverse, in three vectors

If the six-surface motion were happening in isolation, it would be a strong story. What makes it the editorial story of the cycle is that the inverse is also moving on three independent surfaces in the same window.

Vector 1 — political risk. HN #9 at 151 points carried the WSJ report on GOP scrutiny of Sam Altman ahead of OpenAI's IPO. Pre-IPO political vulnerabilities are exactly the kind of thing that gets priced into the offering — and reduces it. The story didn't go away after one day; it'll be one of the talking points around the registration.

Vector 2 — consumer wedge. HN #12, sourced from Bloomberg, reports the Apple–OpenAI partnership is fraying. Apple was the consumer surface that anchored OpenAI's mainstream-user growth. If that integration loses tension — for whatever combination of internal politics, model-vendor diversification, or pricing — OpenAI loses the one consumer rail it had that competitors couldn't replicate.

Vector 3 — analyst narrative. AI Supremacy's "OpenAI's Momentum is Spiraling Down," released in the same cycle, stacks the Musk-vs-Altman trial + IPO overhang into a single momentum-arc piece. AI Supremacy is not the marginal voice on OpenAI — it's been one of the friendlier outlets. The framing shift there is itself the signal.

Three independent vectors on OpenAI, three independent attack surfaces, all of them moving in the wrong direction at the same time Anthropic is moving in the right direction on six independent surfaces. The asymmetry is the editorial story of the cycle.

What this means for builders

We've written about the Anthropic accumulation story before — the AWS-anchored 6-month dominance clock from earlier this spring, and the $1T valuation framing that landed in the Anthropic-OpenAI rivalry coverage — and the consistent line through every one of those pieces was: don't pick the lab, pick the distribution motion. May 14 is the day that line stops being a thesis and starts being a checklist.

If you're shipping software that uses a frontier model in 2026 Q3, the operational reads from this cycle are:

Anthropic's price curve will move first. Metering is the precursor. Latent Space called it. Plan your unit economics against a Q3 pricing event, not a Q4 one.
The SMB tier is the prosumer pre-cursor. If Claude for Small Business converts, the per-seat economics get codified in that tier. Build your auth/admin/billing surface area against the tier you think the consumer agent will run on a year from now — not the API-only billing you have today.
The Gates Foundation partnership is a domain-credibility lever. If you sell into health, education, or development, the Anthropic stack now has a procurement story that didn't exist six months ago. That changes the win-loss calculus for stacks fronting institutional buyers.
OpenAI's consumer rail is no longer the lock-in it was. If your assumption was that Apple Intelligence would keep OpenAI's consumer reach insurmountable, that assumption is now contestable. Don't build product positioning that depends on it.
The near-70-point Polymarket lead is the developer cost of conviction. When the betting community gives one lab 70 points of lead on the next benchmark cycle, that's also the implicit cost of being wrong if you bet the other way. Audit your migration cost in that frame.

💡
The checklist read. Six independent surfaces moving in the same direction in one window is not a coincidence — it's a deliberate distribution motion shaped against a public-markets event horizon (October). Treat the next eight weeks like an IPO roadshow priced into your stack-selection logic: the meter will tighten, the SMB tier will publish unit economics, and the partnership flywheel will keep producing case studies.

None of this is an argument that OpenAI loses — it's a much bigger company than the cycle suggests. But the distribution motion has shifted, and the lab you build against in Q3 is the one that's currently winning every distribution surface at once.

The takeaway in one sentence

Six independent surfaces moving in the same direction in one 24-hour window — capital, partnership, productization, capability, monetization, and market belief — is the distribution motion that produces the next 18 months of revenue, and the inverse three-vector stack on OpenAI is the editorial confirmation that the asymmetry is now operational.

Watch the meter. Watch the SMB conversion. Watch the Polymarket spread close — or not close — through end-of-May. Those three reads, together, will tell you whether May 14 was the inflection or just a particularly loud signal day. Our read is the former.

Originally published at ComputeLeap

Trump Arrives in Beijing Already Losing the Room

Max Quimby — Thu, 14 May 2026 05:22:10 +0000

📖 Read the full version with charts on The Arc of Power →

The standard handicap of a US-China presidential summit walks through three questions: who needs the meeting more, what does each side need to come home with, and where do the tradable concessions actually lie? On May 13, 2026 — the day Trump's wheels touched down in Beijing — the answers to all three are unambiguous, and they are unambiguous against the American side. This is the first major Trump bilateral where the underlying balance is structurally inverted. Xi does not have to do anything in this room. He just has to be patient.

Our thesis: Three negative shocks compounded inside ninety days have left Trump bargaining from a position of weakness Beijing has been engineering since 2023. The Iran "damage narrative" is collapsing in public (NYT satellite analysis, CNN missile-through-intact reporting, DemocracyNow interviews, TYT now running "we lost" segments). CPI is hot at a 4% trajectory through year-end (PBS NewsHour). Hegseth was grilled on a $1.5T defense ask the same week Starmer's UK collapse removes the Anglo cover and Macron's France-Africa "shut up" moment removes the EU cover. Xi enters with the China-Iran partnership preserved, a 90% enrichment threat from Tehran in Beijing's pocket, and a UN Hormuz freedom-of-navigation resolution backed by 112 nations.

The load-bearing scenario is not whether Trump leaves with a "deal." It is what he gives up — quietly, in the room — to bring back something he can call a deal. The line F24 analysts have been flagging openly: Trump could rewrite Taiwan policy in Beijing without Congress in the loop. That is the load-bearing variable. The Polymarket bilateral-quote markets at 82–86% are pricing rhetoric. They are not pricing concession. That gap is the analytical opening.

⚠️ Asymmetry summary. Trump arrives with: 4% CPI trajectory, collapsing damage narrative, $1.5T defense ask under congressional scrutiny, Starmer/Macron coalition cover gone, US delegation under digital lockdown. Xi receives with: China-Iran partnership intact, 90% enrichment threat in pocket, 112-nation Hormuz UN resolution backing, tightened domestic security as theater of control. This is the most asymmetric US-China bilateral since Nixon-Mao 1972 — and the direction is reversed.

1. The Three Shocks Compounding Inside Ninety Days

The Trump foreign policy posture in May 2026 sits on a stack of three independent shocks that have each individually arrived inside the last three months. The structural problem is that they are compounding — each one limits the rhetorical and material options for managing the others.

Shock 1: The Iran damage-claim collapse. The official position is that the US strikes on Iran's nuclear infrastructure were "decimating." That framing was used to justify the operation publicly, to bound the CPI and oil-price spillover politically, and to recover the Republican base's appetite for a war that had no clear endpoint. The framing is now collapsing in the press of record:

NYT satellite analysis of the strike sites concludes that damage to US bases in the region was meaningfully worse than the administration acknowledged at the time.
CNN's reporting on Iranian missile performance concludes that a non-trivial fraction came through "largely intact" against the regional air-defense umbrella.
TYT — a MAGA-adjacent outlet whose hosts publicly supported the strikes — has run three separate "we lost" segments in May, featuring previously bullish commentators conceding the damage assessment.

This matters for Beijing in two ways. First, it reverses the credibility direction of US deterrence signaling in the region — Tehran has visibly absorbed a strike and is talking openly about 90% enrichment, which is a weapons-grade threshold. Second, it removes the leverage Washington had over Beijing on the secondary-sanctions / Iranian oil purchases question. China's intact partnership with Iran was a vulnerability when "maximum pressure" looked decisive. It is now a strategic asset.

Shock 2: The 4% CPI trajectory. PBS NewsHour's economist on May 12 warned that the May CPI print puts inflation on a 4% trajectory through year-end. CBS's reporting on the $1.5T defense funding request overlapped the same news cycle. The compounding effect: domestic political space for a defense buildup contracts as inflation rises, and the inflation print itself partly reflects the Hormuz fuel-cost overhang from the Iran operation. The summit happens with Trump unable to credibly threaten a second front because the public arithmetic on the first one is unraveling.

Shock 3: The Anglo and EU cover dissolving the same week. Starmer is under death-watch in the UK with Reform climbing in polling; Labour's internal succession war is openly running. Macron's France-Africa "shut up" moment two weeks ago consumed the last of his diplomatic capital with the Global South — the same constituency that just backed the 112-nation Hormuz UN resolution. Trump arrives without coordinated Western backing on either the Iran follow-through question or the Taiwan deterrence question. Xi knows this. Xi has helped engineer this.

2. What Xi Enters the Room With

Beijing's posture this week, captured across the Politics Weekly cross-network read, is the opposite of conciliatory.

"Crush" Taiwan independence. Sky's reporting captures the pre-summit signaling Beijing has been amplifying: hardline on Taiwan, deliberately broadcast to the international press the same week Trump's plane is in the air.
"Won't jeopardise" the Iran partnership. Al Jazeera's coverage — the most-cited regional source on the Iran file — explicitly reports Beijing's posture that the Sino-Iranian strategic partnership is non-negotiable. That posture is being briefed publicly while Trump is in transit.
Hormuz coalition leverage. The UN freedom-of-navigation resolution backed by 112 nations is a Beijing-aligned diplomatic vehicle. Bahrain led it. China is supportive. The implicit framing is that any US unilateral action in the Strait would now face a 112-nation diplomatic majority opposed.
Tightened domestic security around the visit. Sky's footage of Beijing residents shows tightened security cordons. The optic is theatrical control on the host's side, which is the inverse of a host who needs the meeting.
Pre-summit security theater on the visitor's side. Reddit r/worldnews flagged today that the US delegation is operating under strict digital lockdown — no personal phones, hardened comms only. Read against the China-Iran-cyber tooling reporting from April, that is not a normal-summit posture. It is a posture of operational defense from a position of perceived weakness.

Read the AJ analysis →

A summit with this asymmetry has historical analogs. The closest is the 1972 Nixon-Mao opening played in reverse: a US president arrives needing the visit more than the host, the host has cultivated alternatives, and the leverage the host has accumulated is patience.

View the Polymarket bilateral market →

3. The Polymarket Misprice: Pricing Rhetoric, Not Concession

This week's Polymarket activity on the summit is one of the cleanest demonstrations of the prediction-market-as-rhetorical-instrument problem we have seen in 2026.

The dominant tradable market — "What will Trump say during bilateral events with Xi Jinping?" — is pricing three outcomes at 82%, 85%, and 86%. All three moved up ~5.3% today on $173k of 24-hour volume. That is the market expressing high confidence in what Trump will say. The market with $8.0M in volume — "Will Trump visit China by [date]?" — is at 100% / 100% / 100%, 8.5% up this week, reflecting the visit happening.

What is not tradable on Polymarket, and not priced, is what Trump gives up to bring back what Trump says. The structural question every Arc reader cares about is whether the visit happening (priced at 100%) is the same event as the visit being a strong-negotiating-position event (visibly false this week). The bilateral-quote markets at 82–86% are pricing rhetorical events. The visible domestic posture and the visible damage-claim collapse should be repricing strategic concession. The gap is the editorial opening — and it suggests the next 72 hours of summit communiqués will be substantially more concessive than the markets are currently set up to register.

Watch also the Trump federal AI model review by May 31 market — currently at 10%, down 9% this week. That is the domestic policy market on the same political quarter. Its collapse is consistent with the read that the Trump administration's domestic regulatory capacity is shrinking in step with its foreign-policy bandwidth. A Beijing summit conducted by a White House that cannot move a federal AI review domestically is one that has narrow capacity to engineer the optics on the way out.

Read the r/worldnews thread →

4. The Load-Bearing Scenario: A Taiwan Policy Drift

The most consequential variable in the summit is not on the public agenda. It is the question F24 analysts have been flagging openly: Trump could rewrite Taiwan policy in Beijing without Congress in the loop.

The mechanism would not be a treaty. It would not be a public statement at the podium. It would be a communiqué language drift in the joint statement — the kind of single-clause modification to "strategic ambiguity" or "one China" framing that lawyers later litigate but that markets and capitals interpret immediately. Three precedents support this read: the 1972 Shanghai Communiqué, the 1979 normalization (which Carter executed without congressional consultation), and the 2009 Obama-Hu joint statement that was read in Taipei as a strategic softening even though Washington insisted it was unchanged. The asymmetric leverage in this scenario favors Beijing because Beijing has had its draft language ready for years and Trump's team is improvising under the three compounding shocks above.

If a drift happens, the immediate signal will be in the TAIEX and the TWD futures curve, not in any press statement. Watch for a one-day move greater than 2% on TAIEX or a 50bps move in 1Y TWD non-deliverable forwards within 48 hours of any communiqué release. That is the load-bearing financial-markets tell that a strategic softening has been priced.

Read the NYT satellite analysis →

5. Three Things to Watch in the Next 72 Hours

The summit will produce a wall of coverage and a small number of substantive signals. Filter ruthlessly.

1. Any Taiwan-related communiqué language drift. Compare the joint statement, line-by-line, against the 2017 and 2019 Trump-Xi joint statements. A single clause modification on "one China," "peaceful resolution," or "strategic ambiguity" is the signal. Everything else is rhetoric.

2. Whether the Iran-partnership posture from Beijing softens or hardens within 72 hours post-summit. If AJ continues to report the partnership intact and DW's 90% enrichment reporting is not walked back, the summit produced no Iran movement. That is itself a strategic loss for Washington — the visit was meant to buy at least optical pressure on Beijing-Tehran coordination.

3. The domestic policy decisions in the same calendar week. Watch the federal AI model review market and the Hegseth $1.5T appropriation vote schedule. If Trump returns and immediately pivots to a domestic posture of "we delivered" without measurable domestic-policy follow-through, that is the signal that Beijing's read of him as transactional and short-cycle is accurate — and that the next bilateral asymmetry will be even sharper.

This pairs with our earlier framing of the China-Iran-Hormuz triangle and the sovereign-compute structural pivot — all three pieces describe the same underlying pattern: US unilateral leverage contracting in real time, while Beijing's optionality compounds.

Bottom Line

The summit happens. The handshakes will be photographed. The communiqué will be filed. None of that is the news.

The news is that Trump is the first US president since Nixon to fly to Beijing materially weaker than the host on the strategic balance — and the first ever to do so with an inflation print, a collapsing damage narrative, and a fracturing Western coalition all visible to the host before the wheels touched down. The question is not what Xi extracts. The question is what gets quietly conceded to bring something home that can be photographed as a win.

If you want to read this summit correctly, do not watch the press conference. Watch the communiqué redline, the TAIEX, and the Polymarket repricing on the day after. The story will be in the spreads. It always is.

Originally published at The Arc of Power

CI/CD Broke Under Agents: The Continuous Compute Stack

Max Quimby — Thu, 14 May 2026 05:21:32 +0000

📖 Read the full version with charts and embedded sources on AgentConn →

At AI Engineer Europe last week, Hugo Santos (CEO, Namespace) and Madison Faulkner (NEA) stood in front of a room of platform engineers and said the quiet thing out loud: CI/CD is dead for agent-based systems. Traditional CI was built for humans pushing one or two diffs a week. When you scale to thousands of autonomous agents opening PRs continuously, the abstractions break — runner saturation, cold Docker builds on every branch, cost explosion, feedback latency that lets context decay before the agent sees the test result.

They coined a new vocabulary for what replaces it: continuous compute and continuous computers, not continuous integration. The framing is sharp because the structural shift it points to is already happening — and the operational layer it implies is what every ops team running Claude Code Max, Cursor, or a private agent fleet is going to be invoiced for over the next two quarters.

This piece does three things. First, name the four ways traditional CI structurally breaks under agent-volume load. Second, map the production stack that is visibly forming this week across ElevenLabs, Vercel, Anthropic, and the GitHub trending charts. Third, give ops teams a buyer's-guide checklist for when the CI bill triples after they turn on agent workflows for the eng org.

1. Where traditional CI/CD actually breaks

Three numbers anchor the structural shift:

Human PR volume: ~10 PRs per developer per day on a typical team. With reviews and merges, ~50–100 CI runs per repo per day on a mid-size codebase.
Agent PR volume: Cowork 1-shotted booking 8 flights and 5 hotels with Opus 4.7 this week — multi-step agent workflows are now multi-PR by default. Operators running fleets see 100–1000+ PRs per day from the agent layer alone.
Per-PR CI cost: Docker builds, dependency installs, full test suites. On a typical SaaS repo with a 12-min CI run, that's ~$0.20–$0.40 per run on hosted runners. Multiply by 1000+/day per repo.

Four things break when the rate jumps two orders of magnitude:

Docker build cache invalidation patterns. Build caches assume human-paced commit cadence — most pushes hit a shared base layer. Agents working on parallel branches in parallel sandboxes blow through caches because they don't share branch ancestry the way human teams do. Cold builds on every agent branch turn a five-minute CI run into a fifteen-minute one and double the runner spend.

Runner pool sizing. Pool capacity is planned against human PR rate. Once you turn on autonomous agents, the rate is bounded by the agent's token-per-second budget, not by a developer drinking coffee between commits. You will saturate the pool. You will get queueing. The queue will burn agent context faster than the CI tells the agent whether the test passed.

Test-feedback latency. When a human waits for CI, twelve minutes is annoying. When an agent waits for CI, twelve minutes is context decay. The agent that submitted the PR is no longer the agent that sees the result — its working memory has been recycled. The result becomes a stale message in a queue, and the agent has to re-derive context from the PR diff to act on it.

Branch hygiene. Agent branches are cheap to create and expensive to delete. Operators are finding their repos accumulating thousands of stale agent branches, each with a build artifact, each with a cache, each with metadata GitHub charges to store. The garbage collection problem isn't sexy. It is the largest single source of unexpected platform spend operators are reporting in 2026.

That's the demolition. Now the construction.

2. The Continuous Compute stack that's visibly forming

The shape of what replaces CI is decomposing across four distinct layers — and each layer had its launch moment this week. That co-incidence is part of why the convergence is real. Nobody's hyping a single platform; multiple players in adjacent niches are independently confirming the architecture.

Layer 1: The routing layer — explicit workflow graphs replace the mega-prompt

ElevenLabs shipped Agent Workflows with a visual graph editor as the headline interface. The pitch is dry — "edges support sophisticated routing logic that enables dynamic, context-aware conversation paths" — but the structural change underneath is the news: single-prompt agents are giving way to explicit routing graphs with conditional branching, sub-agent dispatch, and per-node tool/knowledge-base overrides.

This is the same story as LangGraph and CrewAI two years ago, but with the production tax actually paid. May 2026 release notes mention conditional_operator AST nodes for branching expressions and ASTNullNode types for null-comparison branches in workflow logic. That's not marketing — that's a team building a graph-execution engine for production agents. The mega-prompt era is over for production traffic.

ElevenLabs Agent Workflows documentation →

Layer 2: The substrate — filesystems, not storage

Vercel's Nico Albanese went viral this week with the talk "Give Your Agent a Computer". The thesis: giving an agent a filesystem (not just storage) changed how the agent behaved. Agents with persistent FS-shaped substrate stopped re-deriving context on every call and started following through on multi-step tasks — they used files the way humans use scratchpads.

This is structurally important for the CI question because it splits the data-locality concern from the execution concern. Continuous compute doesn't mean "more runners." It means the agent's compute environment persists between PRs. The agent doesn't restart cold; its filesystem state carries forward. That's the inversion of how CI was designed — CI was specifically ephemeral, because human PRs don't need persistent disk state. Agent PRs do.

Layer 3: The control plane — Agent View

Anthropic shipped Agent View on May 11 — a research preview in Claude Code that lists, starts, and supervises multiple agent sessions from one screen. Boris Cherny's announcement hit 486k views; the companion announcement on Cowork's 1-shot booking flow hit 424k more. The signal is clear: the dominant UI pattern for the next phase is human-as-orchestrator-of-agent-fleets, not human-as-author.

The implication for continuous compute is that you need a control surface — not just observability, not just dashboards, but a place to dispatch new sessions, see what's blocked, and reroute work. Each row in Agent View shows the session, whether it needs input, the last response, and recency. That's the user-facing shape of continuous compute. The CI dashboard's children's children.

Read the Agent View announcement on Claude.com →

Layer 4: The capability bundles — skills as portable units

The GitHub trending chart this week is dominated by skill-bundles-as-product. mattpocock/skills is #1 with +3,372 stars in a day ("Skills for Real Engineers. Straight from my .claude directory.") obra/superpowers is #4 with +1,506 ("Agentic skills framework & software development methodology that works"). anthropics/skills is #9 with +645. Three skill repos in the top ten on the same day is a category, not a coincidence.

The structural point: skills are the externalization format for the agent's capabilities. They make the routing graph (Layer 1) and the agent's filesystem (Layer 2) portable. You ship a skill bundle, the agent loads it like a library, and the routing graph references it as a callable node. This is the package manager layer of the continuous compute stack.

mattpocock/skills on GitHub →

Layer 5: The memory layer — persistent state across runs

The piece that turns continuous compute from a slogan into an actual product is memory. rohitg00/agentmemory hit the GitHub trending chart this week at #5 with +1,335 — "#1 Persistent memory for AI coding agents based on real-world benchmarks." farion1231/cc-switch (#6, +1,186) is the meta-tool for switching between agent CLIs while preserving memory.

For ops teams, the memory layer is the budget question: it determines whether your agents amortize learning across runs or pay the re-derivation cost every PR. The numbers on amortization are stark — internal benchmarks operators are quoting put context-retrieval savings at 30–60% of total agent token spend when memory is wired correctly.

rohitg00/agentmemory on GitHub →

3. The Cowork inflection: multi-step really works now

If you want a single signal for why the stack is decomposing this fast, it's Anthropic's Cowork. One agent. One shot. Eight flights booked, five hotels reserved. Multi-step planning, tool use across booking APIs, recovery from intermediate failures — all in a single session. 424k views on the announcement tweet because operators understood what they were looking at: the practical floor for multi-step agent reliability just moved.

When the floor moves, the operational stack underneath has to catch up. Multi-step reliability is what made every CI assumption invalid in the first place. A single human PR doesn't book 13 things in sequence with state preserved between steps. An agent PR can — and once that becomes the expected workload, the CI substrate has to be redesigned for it.

4. The buyer's checklist for ops teams

If you're about to see your CI bill triple because the eng org turned on Claude Code Max, here's what to actually buy or build:

1. A routing/workflow editor. Pick ElevenLabs Agent Workflows if you live in conversational AI. Pick LangGraph or Vercel AI SDK Workflows if you're TypeScript-first. The point is not to write a single mega-prompt as your production pipeline. Anything custom you put in production should be in a visualizable graph that a teammate can review without reading 4000-token prompts.

2. A persistent filesystem layer for agents. Not S3, not a database — actual filesystem semantics that survive between agent runs. Vercel's pattern is one approach; running Docker volumes that persist beyond CI builds is another. The hard requirement is that the agent doesn't start cold on every PR.

3. A control plane for fleet-of-agents. Claude Code Agent View is the canonical reference now. Build or buy something where a human can see fleet-wide state at a glance and dispatch/redirect. Without this, you have observability over individual agents, not over the system.

4. A skill-bundle convention. Adopt either the Anthropic claude/skills directory format or one of the popular trending alternatives (mattpocock/skills, obra/superpowers). The point is not to invent your own. Skills are how knowledge becomes portable between agents.

5. A persistent memory layer. agentmemory or the equivalent. Without amortized memory, your agent spends 40%+ of every PR re-deriving context from the codebase. That's the largest cost-saving lever in the stack.

6. Branch hygiene automation. Build the deletion job. Schedule it. Tag agent-authored branches in commit metadata so you can prune by author class without affecting humans.

The Hugo Santos / Madison Faulkner framing — continuous compute, not continuous integration — captures the shape correctly. The substrate is computers that persist. The deliverable is not "an integrated build artifact" but "an agent that has consistent state to act from." Same problem the CI/CD generation solved for human-paced teams, redesigned for the agent-paced reality.

Operators have one quarter to get this stack stood up before the second tier of platforms starts charging premium rates for the routing-and-memory layer they should have built themselves. The vocabulary is new. The architecture is concrete. The bill is coming.

For more on what's running on the agent runtime side, see our coverage of agent harness fragmentation and the skill marketplace race.

Originally published at AgentConn

Meta Incognito Chat: Private Inference as Consumer Wedge

Max Quimby — Thu, 14 May 2026 05:21:30 +0000

📖 Read the full version with charts and embedded sources on ComputeLeap →

Today Meta did something the company is almost never given credit for being capable of: it shipped a feature whose entire competitive logic depends on the absence of data collection.

Incognito Chat with Meta AI launched May 13 on WhatsApp and the Meta AI app. It is built on Meta's Private Processing infrastructure — a TEE-attested inference path where, per Meta's own description, even Meta cannot read the conversation. No training. No logs. No replay. By default, the messages disappear.

Read against any plausible Meta strategy memo from the 2018–2022 era, this should not exist. Read against the 2026 competitive map, it is the single most clarifying product move of the quarter — and it makes the wedge against OpenAI and Anthropic on the consumer AI surface visible for the first time.

ℹ️ The thesis in one sentence: private-by-construction inference, attached to a 2-billion-user end-to-end-encrypted distribution channel, is the most defensible competitive position any non-OpenAI/Anthropic player has identified — because the cash-cow business model of the leaders depends on the data the wedge eliminates.

What Actually Shipped

Incognito Chat is a new conversation mode inside WhatsApp's Meta AI and the standalone Meta AI app. The user-visible promise is simple:

Conversations are processed in an environment Meta says it cannot access.
Messages disappear by default.
The chat is text-only — no image uploads.
Nothing from the conversation is used for training.

TechCrunch's coverage captures the operative quote from Will Cathcart, head of WhatsApp: "We're starting [to] ask a lot of meaningful questions about our lives with AI systems, and it doesn't always feel like you should have to share the information behind those questions with the companies that run those AI systems."

Mark Zuckerberg, in the announcement, called it "the first major AI product where there is no log of conversations stored on servers." That language — "no log" — is the load-bearing part. It is a direct rhetorical shot at the OpenAI chat-log discovery battles, which MacRumors flagged explicitly in its coverage: Meta's launch lands as OpenAI faces ongoing lawsuits over retained ChatGPT logs, including the suicide-related cases that have dominated AI-safety headlines for the past quarter.

The timing is not an accident. Privacy is no longer a feature; it is the wedge.

What "Private Processing" Actually Does

The marketing version of TEE-attested inference is "even we can't read it." That's directionally correct but worth unpacking, because the architecture is what makes the competitive moat work.

Per the Private Processing technical whitepaper and the Meta engineering blog, the inference path is:

TEE hardware foundation. Inference runs inside AMD EPYC processors with SEV-SNP (Secure Encrypted Virtualization-Secure Nested Paging) and NVIDIA confidential-computing GPUs. The encrypted VM memory is opaque even to the hypervisor.
Remote attestation + RA-TLS. Before the client sends a prompt, it cryptographically verifies that the TEE is running a specific, audited build of the inference code. That hash is cross-checked against a third-party transparency ledger.
Oblivious HTTP routing. Requests are tunneled through third-party relays so that Meta's infrastructure never sees the client IP.
Ephemeral, stateless execution. Each session uses single-use keys. The CVM holds no persistent state. After the response, the key is destroyed.
Anonymous credentials. The auth token proves a valid WhatsApp user is making the request without binding to a specific identity.

The combination is genuinely strong. Cyber Kendra, which read the technical disclosure closely, called it "genuinely private — but read the fine print" — the fine print being that Meta still controls the build of code running in the TEE, and trust ultimately routes through Meta-published attestation values.

That caveat is fair, and we'll return to it. But what it does not do is undercut the competitive logic. The whole architecture is engineered so that the technical claim survives discovery, subpoena, and breach. Meta can't hand over what it doesn't have. For a consumer AI product in 2026, that is a structurally different shape than ChatGPT or Claude.com.

Read the HN thread →

The Hacker News community working through the original Private Processing announcement landed on roughly the right framing: the trust chain is longer than public-key crypto, but it's also longer than "trust us, we promise" — which is the implicit chain everyone is operating on with the OpenAI and Anthropic consumer products.

Why WhatsApp Is the Right Vehicle

The asset that makes this competitive is not Meta's model. Llama and the new Muse Spark family from Meta Superintelligence Labs are credible but they're not the wedge.

The wedge is WhatsApp:

2 billion+ monthly users. No other AI distribution rival is in the same population bracket. ChatGPT crossed 800M weekly actives this year. WhatsApp is more than twice that, and inside an already-E2EE substrate.
End-to-end encryption as the baseline trust contract. Users already chose WhatsApp on the basis of "Meta can't read this." Layering "Meta can't read your AI chats either" is a brand-consistent product extension — not a leap.
Voice mode on the same day. AI researcher Lucas Beyer (giffmana) flagged that voice mode also dropped in Meta AI today — meaning the modality footprint matches ChatGPT's app on launch.

View original post on X →

The Muse Spark announcement (2.97M views in a day) is what's running behind Incognito Chat — a natively multimodal reasoning model with visual chain-of-thought and multi-agent orchestration. It is also, importantly, deployable under Meta's own Advanced AI Scaling Framework safety review — which adds a third moat the OpenAI/Anthropic axis cannot easily reproduce inside someone else's app: the same company that ships the model controls the distribution surface, the encryption substrate, and the policy framework. Vertical integration of trust.

And there is a fourth layer that almost nobody noticed in the day-one coverage: cryptographer Moxie Marlinspike publicly confirmed his project Confer's privacy primitives are being integrated into Meta AI. Moxie was the architect of Signal's E2EE design — the gold standard. His name on the diagram is harder to manufacture than any marketing claim.

View original post on X →

The Wedge Math

Here is why this is a structural problem for OpenAI and Anthropic on the consumer side, and not just a marketing inconvenience.

The two leaders' revenue base depends on three things:

API logs. Enterprise contracts, model evaluation, RLHF improvement, abuse detection. The pipeline is the asset.
Conversation retention. ChatGPT Memory and Claude Projects are explicit retention features. The product gets better the more you let it remember.
Discovery exposure. Currently, both companies must respond to legal process referencing stored conversations. That is a cost of doing business, but it is also a marketing liability.

A consumer AI product engineered around "we cannot read it, we cannot retain it, we cannot be compelled to produce it" attacks all three. It cannot easily be reproduced inside the OpenAI/Anthropic stack without sacrificing the data pipeline that funds the next-generation model — the cash-cow conflict. Anthropic has been hinting at differential privacy and Constitutional AI policy hygiene; OpenAI has shipped temporary chats; neither has shipped TEE-attested inference at consumer scale, and the architectural lift to do so is substantial.

⚠️ Why this is hard to match: the OpenAI/Anthropic consumer subscriptions are heavily subsidized by the same data pipeline that retention enables. Removing the data pipeline removes a meaningful chunk of the path to model improvement. Meta does not face that constraint because its monetization comes from elsewhere — and because Llama is, structurally, open-weight. Meta can afford to throw away the conversation data in a way ChatGPT structurally cannot.

The Cross-Source Mirror: Sovereignty Discourse Coming Down the Stack

There is a useful pattern visible in this week's signals: the same "I want my data not to leave my premises" instinct is showing up at every layer of the stack.

At the developer-tooling layer, the top Hacker News post today — 677 points — is titled "I moved my digital stack to Europe." The thread is operators explicitly filtering for sovereign infrastructure providers, GDPR-default hosts, and EU-incorporated data residency. At the policy layer, the same week saw the Trump China visit operated under strict digital lockdown — no personal phones for the delegation, hardened comms only. At the consumer layer, the next-gen messenger Confer is shipping branching encrypted conversations and is now plumbed into Meta AI.

These are not unrelated stories. They are the same story showing up at the dev, policy, and consumer layers in the same week.

What Incognito Chat does is operationalize the consumer-facing version of the sovereignty pattern. The framing is not "we made AI in your country." The framing is "we made AI that doesn't leave your phone in any way you can be made to regret." That is a more durable promise than data-residency-by-region, because it cannot be undone by a future export-control regime or subpoena.

This pairs naturally with our recent piece on sovereign-compute optionality — the through-line is that control over the inference path is becoming a primary marketing axis at every level of the stack at once.

What's Genuinely Limited About This

The skeptic case needs airtime, because there is a real one.

Text-only at launch. No image uploads. For a meaningful slice of the actual AI use case in 2026 (visual reasoning, screenshot debugging, document Q&A), this is a noticeable gap.
Meta still controls the build. The TEE attests to a specific image hash; that hash is published by Meta. A motivated adversary inside Meta with subpoena cover could in principle deploy a malicious build if the third-party transparency ledger is compromised. The threat model is meaningfully reduced but not zero.
Memory features deferred. A "Sidechat" feature with persistent Private Processing context is on the roadmap "over the coming months" — not shipped. ChatGPT Memory is a substantial product moat right now, and Incognito Chat does not yet match it.
Brand-trust ceiling. As the The Verge / Inc. coverage noted, some users will simply never trust Meta with the word "private," regardless of the architecture. That ceiling is real and is a marketing problem, not an engineering one.
Discovery in the long term. "We can't produce what we don't have" is a strong defense, but unprecedented data-retention orders, or future legislation requiring AI conversation retention, would force a re-architecture.

None of these undermine the wedge. They limit the slope of adoption, not the shape of the moat.

Operator Takeaway

If you are shipping an AI feature inside a messaging, social, or otherwise-intimate consumer product in the back half of 2026, the marketing primitive has changed.

A year ago, "private" was an enterprise checkbox. Today, it is a consumer-facing wedge that the largest distribution platform in the world is betting brand-level marketing on. The three things to internalize:

"Private by construction" is now a buyable position. TEE-attested inference is no longer an enterprise-only product. AMD SEV-SNP and NVIDIA confidential GPUs are commercially available. The capability is yours to ship if you choose.
Retention is now optional, not free. Until today the default assumption was that AI products should retain. The default has flipped. If you retain, you owe your users a justification — and probably a control surface to opt out.
The wedge against OpenAI/Anthropic on the consumer surface is no longer "we have a smaller model." It is "we cannot be compelled to produce the conversation." For products with sensitive surface area — health, finance, journalism, legal — that is a structurally stronger pitch than benchmark deltas.

The hardest competitive moves in product strategy are the ones where the shape of the product, not its features, embarrasses the incumbent's business model. Incognito Chat is one of those. Whether Meta executes on the rollout cleanly is a separate question. But the move itself is a year ahead of where the rest of the consumer AI market is currently planning to be.

The next twelve months will tell us which of OpenAI and Anthropic blinks first on the consumer-conversation-retention question. The answer is now visibly forced.

Originally published at ComputeLeap