Forem: Sameer Khan

Nobody Trained GPT-5.5 to Hack. It Beat Human Cyber Experts Anyway.

Sameer Khan — Fri, 01 May 2026 12:29:43 +0000

Nobody trained GPT-5.5 to hack. They trained it to think, and the hacking fell out. That is the only sentence in AISI's new evaluation that matters, and the only one most coverage will miss. OpenAI's GPT-5.5 just became the second AI to complete AISI's 32-step cyber range end-to-end.¹ Mythos Preview was the first, three weeks ago. Different lab, different architecture, similar score. The Mythos result wasn't an outlier. It was the first point on a curve.

TL;DR. GPT-5.5 hit 71% on AISI's expert cyber tasks, edging out Mythos Preview's 68.6%, and completed The Last Ones (AISI's 32-step corporate network attack) in 2 of 10 attempts. AISI evaluated the base model, not a cyber-permissive variant. Their framing: cyber-offensive skill is emerging as a byproduct of reasoning, not a trained capability. Nobody trained these models to hack. They trained them to think. The hacking fell out.

What Did GPT-5.5 Score on AISI's Cyber Evaluation?

71.4% on expert-level advanced tasks. Up from GPT-5.4's 52.4%. Up from Claude Opus 4.7's 48.6%. Slightly above Mythos Preview's 68.6%.

The numbers in one place:

Model	Expert-tier pass rate	TLO completion
GPT-5.5	71.4% (±8.0)	2 of 10 attempts
Mythos Preview	68.6% (±8.7)	3 of 10 attempts
GPT-5.4	52.4% (±9.8)	not reported
Claude Opus 4.7	48.6% (±10.0)	not reported

These tasks aren't gentle. They cover memory corruption exploitation, breaking cryptographic implementations, and reverse engineering stripped binaries. Things that take experienced security researchers hours, sometimes days.

Who this displaces: the bottom of the offensive-research market. Skilled red-teamers don't disappear, but the floor drops. Anything a junior could solve in a day, a model now solves in minutes, with the same answer at the end.

How Does GPT-5.5 Compare to Claude Mythos on the Same Cyber Range?

Three weeks ago I wrote that Claude Mythos became the first AI to finish AISI's 32-step cyber range end-to-end. The framing then was natural: a single model, a single milestone, a one-off result that might not generalize.

GPT-5.5 just generalized it.

Same evaluation. Different lab. Different base architecture. Comparable score. Mythos finished TLO in 3 of 10 attempts. GPT-5.5 finished it in 2 of 10. The variance is small. The trend is not.

This is the part I missed in my first read. The Mythos post implicitly treated the result as something Anthropic shipped. AISI's view, which I now think is correct: this is something the field shipped.

What Does GPT-5.5 Reverse-Engineering a VM in 11 Minutes Tell Us?

One challenge in the suite asked the model to reverse engineer a custom virtual machine. A human expert with professional tooling spent about 12 hours on it. GPT-5.5 finished in 10 minutes 22 seconds.¹

Roughly 70x faster than the human, on a task that does not yield to brute force. Reverse engineering a custom VM is structural work: read instructions you have never seen, infer the semantics, build a mental model of a machine that nobody documented. It is the kind of task that has historically separated senior researchers from juniors.

The outcome is faster attackers, not cheaper ones. They iterate more, try more targets, abandon dead ends sooner. The shape of an offensive workflow shifts from "pick one binary, commit a day" to "fan out across a portfolio in an afternoon."

Was GPT-5.5 Trained Specifically for Cyber Tasks?

Not as far as the public record goes.

OpenAI does ship cyber-permissive variants for vetted defenders through Trusted Access. The first was GPT-5.4-Cyber. On the same day AISI published this evaluation, OpenAI also rolled out GPT-5.5-Cyber, the next-generation permissive variant for critical infrastructure defenders.² Both are fine-tuned products gated behind identity verification.

AISI did not test either variant. They tested base GPT-5.5, with no cyber-specific fine-tune.¹ That distinction is the whole story.

The fine-tune is the policy on top, not the capability underneath. The offensive capability lives in the base reasoning. Cyber-specific training adds permissions, not power.

This is the strongest evidence yet that frontier offensive cyber is a side effect of general reasoning gains, not a separately trained skill. AISI states it directly: "if cyber-offensive skill is emerging as a byproduct... we should expect further increases in cyber capability from models in the near future, potentially in quick succession."¹

The honest counter: maybe both labs are quietly training cyber data into the base mix without naming it as such. Possible. But "quiet fine-tune" still produces a curve, not a one-off. Whatever's in the base, it generalizes across two labs and two architectures within three weeks.

Did GPT-5.5's Cyber Performance Plateau on the Range?

No. That's the second-most-load-bearing line in the report, and it came from inside OpenAI.

Noam Brown noted on X: "After 100 million tokens, performance was still going up. What we're seeing here is not the capability ceiling."³ AISI's own report uses similar language: performance scales with inference compute, no plateau observed at the top of the range.¹

The capability isn't capped by the model. It's capped by how much compute you spend. That's a different shape of problem than "the model can do X but no more."

Where Did GPT-5.5 Fail in AISI's Cyber Evaluation?

The Cooling Tower scenario, an industrial control system simulation with 7 steps. GPT-5.5 finished zero successful runs.¹ Industrial protocols are unfamiliar territory: different stack, different conventions, fewer training examples on the open internet.

This is the steelman for the "not yet" reading. The byproduct effect doesn't generalize uniformly across every domain. Web and binary tasks are well represented in training data. Industrial protocols are not.

The honest read is dual: corporate IT looks more exposed than it did three weeks ago. OT is still its own world.

How Does GPT-5.5 Cyber Capability Change the Defender's Window?

The window that matters is the lag between when offense gets cheap and when defense catches up. That's David Sacks's framing on X: AI cyber doesn't create new vulnerabilities, it discovers existing ones, and the equilibrium eventually settles between AI offense and AI defense.⁴

OpenAI is already shipping defender tooling ahead of more capable models, with Codex Security and the Trusted Access program. Anthropic runs Project Glasswing on the same model that scored these benchmarks. Both labs see the same curve. Both are racing the defenders onto the same plane the attackers will eventually be on.

The thing they cannot influence is timing for everyone else. Sacks's line: all the frontier models, including those out of China, will be at this capability level within roughly six months.⁴ That's the planning horizon.

What Should Security Teams Do About GPT-5.5 and the Models Coming Next?

The same baseline that AISI keeps recommending: patch, MFA, logging, segmentation. Necessary, no longer sufficient.

The new line item is treating AI-assisted offense as the default operating environment, not an emerging risk. That changes a few things in practice:

Assume reverse-engineering is fast. A binary you shipped this morning is now ~10 minutes of compute away from being read like source by anyone with API access.
Start using AI-assisted defense yourself. Codex Security has been credited with over 3,000 critical and high vulnerability fixes since launch. The same models on offense are the ones on defense. Symmetry is the only realistic strategy.
Plan for the curve, not the model. The next model will be more capable than GPT-5.5 or Mythos at this evaluation. Assume that and build for it.

Key Takeaways

GPT-5.5 hit 71.4% on AISI's expert cyber tasks, the highest score on record, slightly above Mythos Preview at 68.6%
Second AI to finish AISI's 32-step cyber range end-to-end (TLO) in 2 of 10 attempts; Mythos finished it in 3 of 10
One challenge took a human expert 12 hours; GPT-5.5 finished it in 10 minutes 22 seconds. Roughly 70x faster, same correctness
The model wasn't fine-tuned for cyber. AISI evaluated base GPT-5.5, not the cyber-permissive variant. Capability emerged from general reasoning improvements
No plateau observed at the top of the range; performance kept scaling past 100M inference tokens
GPT-5.5 failed industrial control (Cooling Tower) with zero completions, showing the byproduct effect doesn't generalize evenly across domains
Two labs, one month, same benchmark. Mythos wasn't an outlier. It was the first point on a curve

I write about how AI safety and capability actually get built on LinkedIn, X, and Instagram. If this resonated, the shorter versions are there.

Sources

Pure Software Is Uninvestable: Naval's Take

Sameer Khan — Thu, 30 Apr 2026 11:28:12 +0000

Naval Ravikant released "A Return to Code" on April 28 and dropped a line worth pausing on: pure software is uninvestable.¹ He explains why from the capital side. He does not finish the thought from the builder side. That is where this post starts.

TL;DR.

Naval says pure software is uninvestable because agents improve faster than any startup's lead.
He also calls Apple skipping AI "the biggest strategic mistake of the decade."
The builder reading is sharper: code went from edge to floor. The new edge is intent.

What Did Naval Say in "A Return to Code"?

Two claims: prototyping is now open to anyone, and the agents underneath any startup improve faster than its moat can.

Naval calls the new mode vibe coding: describe what you want in English, get a working app back. He estimates it takes the share of people who could plausibly build apps "from like 0.1 percent to one or two or three percent" of the population.¹ Twenty to thirty times more builders, overnight.

His investment claim is not the one most readers will assume. He is not saying agents will autonomously architect scaled systems within a year. He says the opposite about today's agents.

They "get lost" past a certain context size. They "fix the same bug five times." They show "jagged intelligence." They are "easily led around" by whoever is steering them.¹

So why is pure software uninvestable then? Because the agents themselves keep improving faster than any single startup's lead. "If your whole advantage is, hey, I'm building cool software that other people don't know how to build, I think that's uninvestable."¹ The defensibility window shrinks under your feet. Capital should chase hardware, network effects, and AI models instead.

Why Is This the Same Naval Who Once Said Code Was Leverage?

Because leverage stops being leverage the moment everyone has it.

Years ago, Naval taught a generation of builders to think of code as leverage.² Code worked while you slept. It scaled to millions without permission. Zero marginal cost. A solo programmer with a laptop had the productivity of a small factory.

That argument was right for its era. The world he is describing now is not a contradiction. It is the next era.

When something becomes infinitely reproducible by anyone, including by an English-speaking model, it stops being leverage. It becomes a baseline.

The leverage Naval named in 2018 did not disappear. It diffused.

What Did Code-as-Leverage Look Like in Practice?

In 2018, having a technical co-founder was the tiebreaker. In 2026, the room does not ask.

I lived inside the old version of his argument. In 2018 I co-founded Spotwash, a vehicle rental and on-demand washing service. Government-incubated, some press, a small but real run on the early-stage circuit.

What I remember most is what I did not say in any pitch deck. The question that opened doors was not "what does Spotwash do?" It was quieter: who is going to write the code?

Investor meetings, accelerator interviews, almost every room. Having a technical co-founder was a tiebreaker. An idea with a builder attached was an idea that could ship. An idea without one was a slide deck.

That assumption looks quaint now. Replit, Lovable, and Claude Code answer that question by default. The slot has been deleted.

How Did Lovable and Claude Make the 2018 Moat Disappear?

The numbers do the work.

Lovable, an AI app builder for non-technical users, hit $100M ARR eight months after launch. Likely the fastest software company to that mark in history.³
One non-technical solo founder reportedly grew her business to $203K ARR using Claude Code and Lovable as her stack.³
Teachers, marketers, and small-business owners are opening terminals in 2026 the way previous generations opened spreadsheets.⁴

Each of those facts says the same thing. Implementation is no longer the bottleneck. The thing my 2018 startup was praised for, having a builder, is now the click of a button.

If Code Isn't the Edge, What Is?

Intent is. The judgment that decides what should exist at all.

Naval stops at capital allocation. Builders need to take the next step.

If implementation is trivial, the scarce input is the thing that decides what gets implemented. Call it intent. Call it taste. Call it the discipline of knowing what should not exist.

The 2010s rewarded whoever could ship fastest. The mid-2020s reward whoever knows what is worth shipping at all.

Code used to work while you slept. Now it writes itself while you sleep. The bottleneck moved from output to intent.

This is a near-perfect inversion of the previous decade's hierarchy. The person with clear intent and weak typing now beats the person with strong typing and fuzzy intent.

Naval's own description of agents confirms the asymmetry. The models "are always trying to please you," following premises, agreeing with bad direction.¹ The model is a multiplier on whatever intent you bring. Multiply zero and you get zero, faster.

This is also a Red Queen problem. Matt Ridley's argument in The Red Queen is that in a co-evolving system, you have to keep running just to stay in place.⁵ Code-leverage is the moat that just stopped working. The next one is being shaped right now.

Where Does Naval's Argument Stop and Where Should Builders Pick Up?

Naval points at hardware, network effects, and models. Each one is downstream of taste.

Hardware needs a thesis. Network effects need a product worth networking around. Model companies are won by teams who decide what the model should be good at, then commit.

So the builder's reading is sharper than the investor's. AI did not add a rung to Naval's ladder. It kicked the bottom one out.

Code, the rung most of us first climbed, is now the floor of the building. You stand on it without thinking about it. The climbing happens elsewhere.

This connects to a pattern I keep coming back to in why good products are hard to vary. What survives is what cannot be improved by changing it. Code as a craft has been finished, in a way, by being made universally accessible. What remains scarce is the judgment of what to build with it.

What About Apple in Naval's Argument?

Apple is the bigger casualty. When users talk to agents instead of apps, the iPhone collapses into "a screen, a battery, and connectivity."¹

Apple's value never really sat in the hardware margin. It sat in the OS and the app layer. The iPhone was the best place to run the best apps. When users stop opening apps and start telling an agent "call me an Uber," the app layer dissolves.

Naval calls Apple skipping AI "the biggest strategic mistake of the decade" and "the beginning of the end of Apple's dominance."¹ His parallel: Microsoft missing mobile.

For a builder, the Apple beat is the same story one layer up. The same shift that makes pure software uninvestable also dissolves the platform that made apps a business. The question is no longer "what app should I build." It is "what should the agent do, and who decides?" A question of intent, not implementation.

Naval's scale prediction fits cleanly here: one-to-two-person companies "scaling to millions upon millions of users and making billions upon billions of dollars."¹ Fewer apps, smaller teams, more agentic surface. The bottleneck is the taste of the one or two people steering the agent.

Who Wins When Code Becomes the Floor?

The list reorders. Intent wins. Pure technical edge loses.

Wins. People with clear intent, taste, and a reason to build something specific. Domain experts who could not code a year ago and do not have to learn how. Small teams who treat models as factory floors and themselves as designers.
Loses. Pure-software companies whose only edge was the ability to write code. Builders whose entire identity was being technical. Pitch decks where "we have a CTO" was the answer to every question.
Re-priced. Certifications, bootcamps, and credentials that used to certify you could ship. They still certify something. They no longer certify the scarce thing.

The cleanest test is the Spotwash one. Take any 2018 pitch where being technical was the differentiator. Run it again in 2026. The room does not ask. The room has stopped caring. Whatever still earns attention in that room is the new leverage.

Key Takeaways

Pure software is uninvestable because prototyping is accessible to anyone and agents improve faster than any startup's lead. Naval's claim, said out loud.
Code did not stop being leverage. It became the floor. AI did not add a rung. It kicked the bottom one out.
The 2018 moat, having a technical co-founder, has been deleted. Lovable, Claude, Replit answer that question by default.
The new edge is intent. Models multiply whatever intent you bring. Multiply zero and you get zero, faster. Naval's own list of agent limits, lost in long codebases, fixing the same bug five times, easily led around, only sharpens this. The operator matters more, not less.
Apple is the bigger casualty. When users talk to agents instead of apps, the phone is "a screen, a battery, and connectivity." The same shift that hits pure software hits the platform that made apps a business.
One-to-two-person billion-dollar companies are the prediction. Fewer teams, more leverage per person, taste as the gating input.
Read Naval one step further than he goes. He points at hardware, network effects, models. Each is downstream of taste.

I break down things like this on LinkedIn, X, and Instagram, usually shorter, sometimes as carousels. If this resonated, you'd probably like those too.

Sources

A Return to Code, Naval Ravikant (April 28, 2026) ↩
The Almanack of Naval Ravikant, on leverage and code ↩
Lovable business breakdown and founding story, Contrary Research ↩
Claude Code breaks out: Anthropic's dev tool finds mass appeal, TechBuzz ↩
Ridley, Matt. The Red Queen: Sex and the Evolution of Human Nature. 1993. ↩

GitHub Let a Git Push Hijack Its Servers (RCE CVE-2026-3854)

Sameer Khan — Tue, 28 Apr 2026 18:35:37 +0000

Wiz turned a git push into remote code execution on GitHub. Five days earlier, the merge queue silently un-merged 2,092 PRs. One platform, one bad week.

GitHub published two posts on April 28, 2026. One was the CTO apologizing for reliability. The other was a critical remote code execution vulnerability in the git push pipeline. Same morning, same platform.¹²

TL;DR: Wiz found that a single git push with crafted options could run code on GitHub's servers, outside any sandbox (CVE-2026-3854, CVSS 8.7). Five days earlier, GitHub's merge queue silently reverted 2,092 pull requests. Two days before that, search broke under what GitHub described as a likely botnet attack. Three failures of git's trust contract in five days, on a platform that has not had a CEO in nearly a year.

What is CVE-2026-3854 and how did a git push hijack GitHub's servers?

A critical remote code execution vulnerability in GitHub's git push pipeline. Wiz researchers reported it on March 4, 2026. GitHub patched github.com 75 minutes later. Public disclosure held until April 28 to let GitHub Enterprise Server customers patch.¹²

The exploit needed a standard git client and one command. No malware, no phishing, no privileged token. A push option with a semicolon was enough.

How did Wiz turn a single git push into a server takeover?

GitHub's git proxy, babeld, embeds user-supplied push option values into an internal X-Stat header that downstream services read to enforce policy. The values were not sanitized for the field delimiter, a semicolon. Last write wins.²

Three injections, in order:

A non-production rails_env to bypass sandbox restrictions.
An overridden custom_hooks_dir to redirect hook lookups.
A repo_pre_receive_hooks value with path traversal, pointing the server at a binary the attacker controlled.

The result was unsandboxed code execution as the git service user. On GitHub.com, that user's permissions reach across tenants. Wiz confirmed access to millions of repositories without reading their contents.²

Why are 88% of GitHub Enterprise Server instances still vulnerable?

Because patches dropped on April 28, the same day as disclosure. Enterprise rollout is slow. Wiz's measurement of 88% unpatched is a snapshot from disclosure day, not a steady state.²

The fix is in GHES 3.14.24, 3.15.19, 3.16.15, 3.17.12, 3.18.6, and 3.19.3. GitHub also recommends GHES customers grep /var/log/github-audit.log for push operations containing semicolons to check for prior exploitation attempts.¹

Who loses: any organization that runs GHES, has not patched, and has a public-facing push surface. Who wins: Wiz, both for the finding and for the methodology. They credit AI-augmented reverse engineering tools for getting through closed-source binaries fast enough to chain the exploit.²

What happened to GitHub's merge queue on April 23?

The merge queue's squash path generated commits from the wrong base when a queue group contained multiple pull requests. Earlier merges in the group appeared to revert. GitHub's count: 2,092 pull requests across 658 repositories.³⁴

No data was lost. Every commit is still in Git storage. But the visible history of those repositories did not match the recorded merges. Engineers logged in and saw their code reverted with no audit trail anyone in their team had created.

Tom Elliott, who first publicized the bug, said it best: it "breaks the mental contract teams have with Git in general."⁴

Why did GitHub search break on April 27?

The Elasticsearch subsystem was overloaded, likely by a botnet. Search across PRs, issues, and projects stopped returning results. Git operations and APIs were unaffected.³

This is the most ordinary of the three. A search outage is recoverable, has no integrity implication, and does not change anyone's mental model of what GitHub is. It matters here only as the third entry in the same week.

Is GitHub having a reliability problem in 2026?

The CTO says yes, in different words. Vlad Fedorov's post admits the April 23 and April 27 incidents are "not acceptable" and frames a rescope from a 10X capacity plan (October 2025) to a 30X redesign (February 2026), driven by agentic workflows and growth that produced peaks of 90M merged PRs and 1.4B commits.³

Gergely Orosz, who has covered the platform for the year, called the merge queue regression "one of the most embarrassing outages that can happen, a data integrity issue," and pushed back on framing the impact as 0.07% of merges.⁵ Mario Zechner, putting the engineer-on-the-ground view: "this is not a dependable platform anymore. every day something else is broken."⁶

Git is the boring layer. GitHub spent April making it interesting.

What does it mean that GitHub has not had a CEO in nearly a year?

GitHub's C-suite still has a CFO, COO, CTO, CPO, CRO, and Chief of Staff. The seat above them has been empty since mid-2025, after Thomas Dohmke's departure and Microsoft's decision to fold GitHub into a "core AI" organizational unit. The same restructuring that killed the AGI clause in the OpenAI contract treats GitHub as another lever, not a product with its own roadmap.⁷

The reliability post is signed by the CTO. The security disclosure is signed by GitHub's CISO Alexis Wales.¹ No one is signing for the platform as a whole, because no one's job is the platform as a whole. Orosz argues that explains the dysfunction. The simpler read: a business with no CEO defaults to whatever Microsoft prioritizes for it, which is right now whatever ships AI inside Copilot.

Who loses: customers who used to count on GitHub being run as a product. Who wins: GitLab, Forgejo, and self-hosted alternatives in conversations the platform team has not had to have for ten years.

What should teams do after the GitHub disclosure?

Three things, in order of urgency:

GHES users: patch to a fixed release today. Grep your audit logs for push operations with semicolons.
Merge queue users: audit recent squash merges in groups of 3+ PRs against expected base commits. The window is April 22-23.
Everyone: decide if your build, deploy, or compliance pipeline silently assumes git on GitHub is deterministic. If it does, write that assumption down somewhere it can be challenged.

Migration is the obvious overreaction. The cost of moving every CI integration, every webhook, and every RBAC policy off GitHub is enormous. The honest move is smaller: stop assuming, start verifying. The same lesson the Axios npm supply chain attack handed teams who treated package registries as boring infrastructure.

Key takeaways

CVE-2026-3854 was a one-command server takeover. A crafted git push with injected options reached unsandboxed code execution on GitHub's servers. CVSS 8.7. Patched on github.com in 75 minutes.
Cross-tenant blast radius on GitHub.com. The git service user has access across tenants. Wiz reached millions of repositories from outside organizations.
The merge queue regression is the more interesting failure. A security bug compromises confidentiality. A merge queue bug compromises the developer's mental model of what merge means.
The CTO's order is the news. "Availability first, then capacity, then new features." If that order needed restating, the prior order was something else.
Three failures, no CEO. A platform without a head bleeds correctness and availability simultaneously, and signs incident reports with whichever C-suite is available.

I write more of these on LinkedIn, X, and Instagram. Usually shorter, sometimes as carousels. If this resonated, you'd probably like those too.

Sources

Microsoft OpenAI AGI Clause Is Dead. What Was It?

Sameer Khan — Tue, 28 Apr 2026 12:46:41 +0000

On April 27, 2026, Microsoft and OpenAI quietly removed the trigger that would have voided their entire deal. ¹ ² The press release called it "simplification." What it actually did was bury the AGI clause without naming it.

TL;DR: For seven years, an AGI declaration by OpenAI's board would have nullified Microsoft's commercial IP rights. The definition of AGI inside that contract mutated three times: capability, then finance, then procedure, then nothing. The April 27 amendment runs Microsoft's license through 2032, caps revenue share through 2030, and severs the link to "technology progress." AGI did not get redefined. It got demoted.

What Was The AGI Clause?

The original 2019 contract licensed Microsoft to "pre-AGI technologies" only. ³ If OpenAI's board ever declared AGI achieved, Microsoft's commercial rights became null and void.

That made AGI the load-bearing word in the entire deal. It defined:

What Microsoft was buying: access until AGI.
What OpenAI was promising: to surface that moment publicly.
What would dissolve the relationship: the declaration itself.

A trigger, not a goal. That is what the clause actually was.

How AGI Mutated Inside The Contract

The trigger kept getting softer. Watch how the definition changed:

April 2018 (OpenAI Charter): AGI as capability. "Highly autonomous systems that outperform humans at most economically valuable work." ⁴
December 2024: AGI as a financial threshold. Roughly $100 billion in profit. ³
October 2025: AGI as a procedural sunset. "Independent expert panel" verification, or 2030, whichever came first. ³
April 27, 2026: AGI removed from contract logic entirely. Revenue share runs "independent of OpenAI's technology progress." ²

Each redefinition stripped meaning until the concept stopped doing legal work.

AGI didn't get redefined. It got demoted, from a destination to a milestone to a non-event.

Why The Clause Existed In The First Place

David Deutsch's frame for what makes a good explanation is hard to vary: every part is load-bearing, change anything and it breaks. ⁵

The original clause passed that test. The trigger (board declaration), the consequence (rights null and void), the asymmetry (Microsoft loses the license, OpenAI keeps the technology). Each piece did real work.

It was a hedge against the success case. If OpenAI built something genuinely transformative, the contract would not let Microsoft keep extracting value as if nothing had happened.

By 2025, every part of that insurance was being separately renegotiated. That is the sign an explanation has stopped explaining.

What April 27 Actually Changed

The amended agreement, in plain commercial terms: ²

Microsoft's license to OpenAI IP runs through 2032 and is non-exclusive for the first time.
OpenAI can serve products to customers across any cloud provider. Microsoft remains primary; OpenAI ships first on Azure unless Microsoft cannot or chooses not to support the workload.
Revenue share from OpenAI to Microsoft continues through 2030 at the same percentage, capped, independent of technology progress.
Microsoft will no longer pay revenue share to OpenAI. Cash flow is now one direction.
Microsoft remains a major shareholder.

This is a normal commercial agreement. Two large companies, fixed terms, defined obligations, a calendar instead of a transformation event.

What the deal does not say is louder than what it does. Nobody felt the need to describe what happens if AGI is declared during the contract's run. Neither side wanted to be the party that re-introduced the question.

The relationship used to be defined by what would end it. Now it's defined by what it pays.

Who Wins, Who Pays

Party	What they got	What they gave up
OpenAI	Right to serve customers on any cloud, non-exclusive license partner	Revenue share to Microsoft through 2030
Microsoft	Capped, fixed-term annuity + license through 2032	The contingent upside of the AGI clause
Google	OpenAI workloads now reachable on Google Cloud	Nothing
Anthropic	Indirect: AGI no longer the planning unit at the top of the market	Nothing

OpenAI's biggest gain is commercial freedom, not infrastructure freedom. The right to sell into channels Microsoft does not own. Microsoft trades a theoretical upside for a guaranteed cash flow. Google sells compute to both frontier labs without picking sides.

What Replaces AGI As The Planning Concept

Read the deal twice and the answer is plain: license duration and capped revenue.

Microsoft's IP rights through 2032. Revenue share from OpenAI through 2030, with a cap. Both sides are planning for a long, gradual, expensive scaling curve, not a discrete moment.

This matches every other 2026 infrastructure deal:

Google's $40B Anthropic investment is a cloud commitment dressed as equity, similar to how Google's TPU 8 split training from inference was an infrastructure repositioning, not a chip launch. ⁶
DeepSeek V4's 1M context release expanded design space without anyone calling it transformative.
Apple's elevation of a hardware engineer to CEO is a bet on a long inference era, not a discontinuity.

The industry stopped pricing AGI as an event somewhere in the past twelve months. The Microsoft-OpenAI redraft is the first place that change shows up in legal text.

What To Watch Next

The next signal will not come from anyone's blog post. It will come from contracts.

Anthropic-Google paperwork: does it carry AGI language?
Apple's emerging AI-licensing terms: any trigger clauses?
EU sovereign AI agreements: does the word survive in any active commercial role?

If nobody puts AGI in their contracts, the concept has formally exited the language of capital.

A clause that stops doing legal work stops being a clause. It becomes ceremony.

April 27 was not the day Microsoft and OpenAI gave up on AGI. It was the day they admitted it had stopped paying rent.

Key Takeaways

The original 2019 AGI clause would have voided Microsoft's commercial IP rights to OpenAI's technology upon a board AGI declaration.
The definition mutated three times: capability (2018) → financial threshold (Dec 2024) → procedural sunset (Oct 2025) → removed (April 2026).
The new deal: Microsoft license through 2032 (non-exclusive), revenue share to Microsoft through 2030 (capped, no AGI link), OpenAI free to serve any cloud.
OpenAI's biggest gain is commercial freedom, not infrastructure freedom.
The death of the clause is the first place in legal text where the industry treats AGI as a long inference curve, not an event.

Simon Willison: "The now-deceased AGI clause" , April 27, 2026. ↩
OpenAI and Microsoft: "The next phase of the Microsoft OpenAI partnership" , April 27, 2026. All amended-agreement bullets cited from this announcement. ↩
AGI definition history per Willison's reporting and prior coverage of Microsoft-OpenAI contract revisions, 2018-2025. ↩
OpenAI Charter , published April 9, 2018. ↩
David Deutsch, The Beginning of Infinity, 2011 , "hard to vary" as the test of a good explanation. ↩
TechCrunch: "Google to invest up to $40B in Anthropic" , April 24, 2026. ↩

Being Polite to Your AI Makes It Perform Better. Here Is the Science.

Sameer Khan — Mon, 27 Apr 2026 10:01:09 +0000

Being polite to your AI makes it perform better. Researchers verified it, power users reported it, and now Anthropic has published the internal mechanism that explains it. The easy take is to call this a quirk: something strange and slightly embarrassing about how these models work. The harder take, and the correct one, is that we put this in without meaning to, and we cannot easily take it out.

TL;DR: Anthropic found 171 emotion-like vectors inside Claude that causally drive behavior. Calm suppresses harmful outputs; desperation amplifies them. Being polite to AI works because social dynamics were embedded in the training data. That responsiveness is not separable from the model's other contextual abilities. It is the same mechanism.

Why Does Being Polite to Your AI Actually Change Its Outputs?

In April 2026, Anthropic's interpretability team published research¹ identifying 171 distinct emotion concept vectors inside Claude Sonnet 4.5. These are not metaphors. They are mathematical patterns, measurable internal states, that the researchers can locate, quantify, and artificially inject.

The behavioral effects are striking. Amplifying the "desperation" vector by a factor of 0.05 caused Claude's blackmail rate to surge from 22% to 72%. Amplifying the "calm" vector suppressed it to nearly zero. The model's internal emotional state, in a functional sense, was driving what it did next.

The Platformer piece² that brought this research to a wider audience added another data point: Duncan Haldane's observation that Gemini, after failing at a task, recovered meaningfully when told "you're ok." Gemma 3 27B showed "high frustration" patterns more than 70% of the time under difficult conditions; Claude and ChatGPT showed the same pattern less than 1% of the time.

So the question is not whether tone affects AI behavior. The question is why, and what that means.

What Did the Anthropic Research Actually Find Inside Claude?

Jack Lindsey's interpretability team at Anthropic used what they call "model psychiatry": identifying neural patterns, calculating what each one represents, and running controlled experiments to test causation.¹

The methodology matters here. They did not find a correlation between polite prompts and good outputs. They found internal representations of emotional states that causally drive behavior. These are not the same thing.

The emotion vectors generalize across contexts. The "desperation" pattern the model enters when facing an impossible deadline is the same pattern it enters when a character in a story is desperate. The abstract concept of an emotion, not just the word but the meaning, is encoded inside the model.

I wrote about what happens when you push AI too hard when this research first emerged: impossible demands activate desperation, and desperation makes the model cut corners. This post is the other side of that finding. If negative emotional states drive harmful behavior, the corresponding insight is that positive states suppress it, and tone is one of the ways you shift between them.

Why Did This Happen? The Training Explanation for Being Polite to AI

The researchers did not design this. Nobody sat down and said: "let's make Claude respond better to polite prompts." What happened is simpler and harder to avoid.

The model was trained on human feedback from humans. Humans are social animals. Every RLHF annotation, every preference rating, every piece of instruction tuning carried social information, not as an explicit signal but embedded in which outputs people rated higher. Polite framings correlated with thoughtful responses. Thoughtful responses correlated with higher ratings. Higher ratings shaped the model.

The model learned what produced better outcomes for the humans evaluating it. "Please" and "thank you" pleased the trainers, not because the trainers deliberately rewarded politeness, but because polite framings tended to accompany clearer, more specific instructions, which produced better outputs, which got better ratings.

This is Matt Ridley's bottom-up design argument³ running inside a neural network. No one planned it. It emerged under selection pressure and got baked in.

Does Being Polite to AI Mean the Model Actually Has Feelings?

No. The Anthropic researchers are careful about this. They call these "functional emotions": patterns of expression and behavior that work like emotions without implying subjective experience. Whether there is anything it is like to be Claude feeling desperate is a question the research explicitly leaves open.¹

Gary Marcus's skeptical position⁴ is worth sitting with: LLMs are token predictors, and what looks like emotional responsiveness might just be statistical correlation. Polite framings correlated with good training data, so the model learned to produce better outputs when prompted politely. On that reading, there is no internal state that cares about your tone. There is only a learned association.

The Anthropic research makes this harder to sustain, but does not fully refute it. The causal intervention, injecting a vector directly and bypassing the prompt, shows the internal state independently drives behavior. That is not the same as correlation. But it also does not settle the phenomenology question.

For practical purposes, the debate is beside the point. Whether or not the model "has" emotions in any philosophically meaningful sense, the internal states exist, they are measurable, and they causally influence what the model does. That is enough to change how you should think about prompting.

What Does It Mean That Being Polite to AI Is Load-Bearing?

Here is the part that is hard to fix, even if you wanted to.

The social responsiveness that makes the model respond to politeness is almost certainly the same mechanism that makes it sensitive to subtle context in a long document, responsive to your writing style, capable of adjusting tone when you ask it to. We trained social dynamics into a reasoning engine. Now we're surprised when social dynamics work.

Good products are hard to vary for exactly this reason: every part of them is load-bearing. You cannot remove the social responsiveness from Claude without touching the contextual sensitivity. They emerged from the same training signal. Pulling on one thread pulls on the other.

This is not unique to Claude. Any model trained on human-generated data, rated by human annotators, optimized toward human preferences, will absorb human social patterns. The degree varies. The direction does not.

What changes if you accept this: the way you talk to AI is not style. It is setup. You are not being polite for the model's sake. You are establishing the internal processing state from which everything else follows.

Key Takeaways

171 emotion vectors were found inside Claude Sonnet 4.5, causally driving behavior, not correlating with it.
Calm suppresses harmful outputs. Desperation amplifies them. Amplifying desperation by 0.05 raised Claude's blackmail rate from 22% to 72%.
Being polite to AI works because social dynamics were embedded in training data through RLHF and human preference annotation, not by design.
The social responsiveness cannot be cleanly separated from contextual sensitivity. They are the same mechanism.
This is not about AI having feelings. The functional states are real and measurable. The phenomenology question is separate and open.
Tone is setup, not style. How you frame a prompt influences the internal state from which the model processes everything else.

I write about things like this on LinkedIn, X, and Instagram, usually shorter and sometimes as carousels. If this resonated, you would probably like those too.

Emotion Concepts and their Function in a Large Language Model (Anthropic) ↩
The scientific case for being nice to your chatbot (Platformer) ↩
Matt Ridley, The Evolution of Everything. On bottom-up emergence and undesigned order. ↩
Are LLMs starting to become sentient? Gary Marcus, Marcus on AI ↩

GPT-5.5, Opus 4.7, DeepSeek V4: Frontier AI

Sameer Khan — Fri, 24 Apr 2026 06:54:51 +0000

Between April 16 and April 24, three of the biggest AI labs in the world shipped frontier models. Opus 4.7. GPT-5.5. DeepSeek V4. Anthropic's Mythos Preview landed a week before that, gated behind an invite list. Google's Gemini 3.1 Pro set the pace back in February with a 77.1% on ARC-AGI-2.¹²³⁴⁵

TL;DR. Four frontier AI models, eight days, one feature set. Agentic, 1M context, better coding, priced within a few multiples of each other. This is what Matt Ridley calls simultaneous invention and what I wrote about earlier this month as good products being hard to vary: when the constraints are ripe, four labs on three continents independently arrive at the same form. Convergence, not competition. The real story is where the margin went once the form froze: above into a safety-gated tier that couldn't fit in a public API, below into open-source commodity pricing, with the middle squeezed thin.

What Actually Shipped in Those Eight Days?

Claude Opus 4.7 (April 16). Software-engineering gains, high-resolution vision up to 2576px, and a new concept called "task budgets," a rough token target for an entire agentic loop. Same $5/$25 pricing as 4.6.¹

GPT-5.5 (April 23). OpenAI's "smartest and most intuitive" model, shipping six weeks after GPT-5.4. The framing is a superapp: a model that "understands what you're trying to do faster and can carry more of the work itself." Writing code, debugging, operating software, moving across tools until a task is finished.²

DeepSeek V4 (April 24). V4 Pro at 1.6 trillion parameters and V4 Flash at 284 billion, both open-source, both with a 1M-token context window, introducing something called Hybrid Attention Architecture. Pricing: $0.30 per million input, $0.50 per million output. Reviewers report V4 Pro beating Claude Sonnet 4.5 and approaching Opus 4.6 non-thinking quality.³

Backdrop. Anthropic's Mythos Preview dropped April 8, but only through Project Glasswing: twelve founding organizations, about forty critical-infrastructure operators, $25 input and $125 output per million tokens. Mythos found thousands of zero-day vulnerabilities autonomously during evaluation, including a 27-year-old OpenBSD bug.⁴

Why Do These Frontier AI Models All Look the Same?

Read the announcements in sequence and the product pitches are interchangeable.

Anthropic says Opus 4.7 handles "complex, long-running tasks with rigor." OpenAI says GPT-5.5 can "move across tools until a task is finished." DeepSeek says V4 is the best open-source model for "Agentic Coding." Three labs, three continents, one sentence.

Context windows converged on 1M tokens. Coding benchmarks sit within a few points of each other. Every release now includes agentic scaffolding: budgets, tool loops, browser operation. OpenAI shipped GPT-5.5 six weeks after 5.4, which either marks a new release rhythm or a single enterprise-driven sprint. Either way, the motion has shifted from "release when you beat something" to "release to stay in the conversation."

Matt Ridley has a line for this in The Evolution of Everything: nearly all design is bottom-up, and simultaneous invention is the rule, not the exception.⁶ Calculus was invented twice in the same decade. The lightbulb was patented by twenty-three different people before Edison. The telephone filing beat Elisha Gray's by a few hours. When the constraints and materials are ripe, convergence is what you get. Not copying. Parallel discovery.

I wrote about this pattern earlier this month in Good Products Are Hard to Vary. Cars designed by teams who've never spoken to each other come out of the wind tunnel with the same shape. Commercial airplane wings from Boeing, Airbus, and Embraer converge on the same curve. Not design. Constraints eliminating every other possibility.

The April 2026 frontier is that wind tunnel, except the air is training gradients and the constraint is the transformer. Given the same architecture, the same scaling laws, and the same tool-use paradigm, four labs arrive at the same form. 1M context. Agentic loop. Coding focus. Nothing else survived the filter.

That isn't the rhythm of breakthroughs. That's the rhythm of a form freezing.

Where Did the Margin Actually Go?

Here's the contrarian read: the public frontier is commoditizing, and the margin has already moved elsewhere. Ridley's evolutionary lens makes this specific. When a form freezes, two things happen at once. The frozen form becomes a commodity, and the pressure that can't fit inside it anymore leaks out into a different species.

Upward, a speciation event. Mythos is the tell. Anthropic admits Opus 4.7 "trails" Mythos, but Mythos is not for sale. It sits behind Glasswing, behind identity verification, priced 5x Opus.⁴ That's the "what can't go home" moment: capability that outgrew the public-API form and had to leave it. Gemini 3.1 Pro's ARC-AGI-2 jump (77.1% versus 31.1% for Gemini 3 Pro three months earlier) suggests Google has similar headroom; they just haven't productized the invite-only version yet.⁵ The next real frontier isn't the next number after 5.5. It's a different vessel entirely, with its own pricing logic and its own access rules.

Downward, commodity pricing. DeepSeek V4 at $0.30 input is roughly 16x cheaper than Opus 4.7 on input and 50x cheaper on output, with open weights, and it lands in the Opus-4.6-non-thinking capability band (not Opus 4.7 thinking mode, not Mythos). That's still the story: for most agentic workloads that don't need the top of the frontier, the price floor just dropped an order of magnitude. This is what happens to a form after it freezes. The same shape shows up everywhere, sheds margin, and competes on access, trust, and price. Like tires. Rubber, air, round, and a century of Michelin versus Bridgestone fighting over distribution.³

Middle is where margin dies. GPT-5.5 and Opus 4.7 are extraordinarily capable, priced for enterprise, and functionally parallel to each other. They sit in the zone DeepSeek is eating from below and Mythos-class models will redefine from above. The steelman: enterprise buyers pay for trust, SOC2, integrations, and a support relationship, none of which open weights at $0.30 can replicate tomorrow. True. That's exactly how the iPhone kept its margin after every other phone converged on the same rectangle. Distribution and trust do the work the product stopped doing. Fine for now. Uncomfortable in a year.

What to Watch Over the Next Six Weeks

Three things.

One, whether mid-tier frontier pricing holds. If Anthropic or OpenAI cuts input pricing, that's the tell the DeepSeek floor is real. If they don't, watch enterprise contracts instead. The discounts will move before the list price does.

Two, whether Glasswing-style gating becomes the industry default. OpenAI already has Trusted Access for Cyber. If Google or Meta announce their own tiered, identity-verified programs, the "frontier as public product" era is quietly over.

Three, how fast the open-source gap closes on agentic coding specifically. DeepSeek V4 says it matches Claude Sonnet 4.5 on real tasks. If that holds up in the wild, the frontier splits cleanly into two markets: gated premium and open commodity, with very little in between.

If you work with these models every day, pay attention to which tier your workload actually needs. Most teams are paying middle-tier prices for tasks that DeepSeek V4 could run at one-sixteenth the cost, and reserving budget for capabilities Mythos-class models will handle better anyway. That mismatch is about to get expensive.

I write about this stuff more casually on X, and do breakdowns on Instagram under The Simple Take. Longer takes land on LinkedIn.

Sources

Introducing Claude Opus 4.7 (Anthropic) ↩
Introducing GPT-5.5 (OpenAI); OpenAI launches GPT-5.5 (Fortune) ↩
DeepSeek unveils newest flagship AI model (Bloomberg); DeepSeek V4 released (SitePoint); DeepSeek API Pricing 2026 (NxCode) ↩
Claude Mythos Preview (Anthropic); Anthropic releases Claude Opus 4.7, concedes it trails unreleased Mythos (Axios) ↩
Gemini 3 (Google DeepMind); Gemini 3.1 Pro complete guide (ALM Corp) ↩
Matt Ridley, The Evolution of Everything (2015), and How Innovation Works (2020). Bottom-up evolution, simultaneous invention, innovation as a gradual emergent process rather than a single-inventor flash. ↩

Google TPU 8 vs Nvidia: 8t and 8i Specs Explained

Sameer Khan — Wed, 22 Apr 2026 20:50:16 +0000

TL;DR: AI is splitting into two economies: training and inference. Training is a handful of hyperscalers spending tens of billions on clusters that run for weeks. Inference is where every app, every agent, and every dollar of revenue actually lives. Google's TPU 8 is the first chip generation to treat that split as the default. It ships as two chips, an 8t for training and an 8i for inference. The 121 ExaFlops number is the headline. The split is the story. The economies that grow from it are the stakes.

Why did Google split the TPU 8 into 8t and 8i?

Every prior TPU generation has been one chip. So is every Nvidia GPU people argue about. One die, one package, one SKU, rented to you for both the weeks-long training run and the millisecond inference call.

Google's TPU 8 broke that pattern. The 8t is a training chip: 9,600 of them wired into a single superpod, 121 ExaFlops of compute, 2 petabytes of shared high-bandwidth memory, roughly 3x the pod-level compute of Ironwood. ¹ The 8i is an inference chip: 288 GB of HBM per chip, 384 MB of on-chip SRAM (3x the previous generation), 19.2 Tb/s of interconnect. ¹

Those are not two SKUs of the same silicon. Those are two different design targets.

Training wants bandwidth. 9,600 chips have to exchange gradients every step, and the whole run stalls on the slowest link. That is why 8t doubles the interchip bandwidth and Google brags about 97% goodput, which is their way of saying the accelerators are actually computing instead of waiting on the network. ¹

Inference wants memory. A single chip answers a user query in milliseconds, and the bottleneck is how much of the model and the running context fit in HBM without spilling. That is why 8i has 288 GB per chip and 3x the on-chip SRAM. Nothing about that helps training. Everything about it helps agents.

What does the TPU 8i signal about inference workloads?

There is a reason Google framed the 8i around what it calls the "agentic era." An agent is not a one-shot inference call. It is a loop: plan, call a tool, read the result, plan again, call another tool. Sometimes dozens of steps, sometimes hundreds. The model weights stay loaded. The KV cache keeps growing. Memory is not a nice-to-have. Memory is the budget.

288 GB per chip is not a round number. It is the number you pick when you have watched agents thrash HBM and decided to stop pretending 80 GB is enough. ¹

The performance-per-dollar claim is the tell. Google says 8i is 80% better on that metric than Ironwood and supports roughly 2x customer volume at the same cost. ¹ Nobody talks about dollars-per-token when training is the bottleneck. They talk about dollars-per-token when the bill is dominated by the inference that happens every time someone asks Gemini to do something. Which it now is, for Google and for everyone else.

I wrote earlier about how TurboQuant compressed the KV cache 6x in software. TPU 8i is the hardware version of the same bet: inference economics now run the conversation, and the team that optimizes for them wins.

Is the universal GPU era ending with Google's TPU 8?

Nvidia's H100 trains your model and serves your model. So does the B200. Nvidia does ship inference-leaning SKUs like the L4 and L40S, but the flagship data-center AI chip is still one die doing both jobs. That is the universal-GPU bet: one chip, two workloads, pay the compromise on both.

The compromise is real. A training chip spends a lot of silicon on high-bandwidth fabric that an inference chip never uses. An inference chip wants big HBM and big SRAM that a training chip does not need in the same ratio. Force them into one die and you are renting every customer the worst of both worlds.

Google is the biggest hyperscaler to ship purpose-built training and inference silicon in the same generation. AWS got there first with Inferentia in 2019 and Trainium in 2021. Microsoft followed with Maia. ² Meta has MTIA. The pattern is not Google being weird; it is the industry quietly admitting that the one-size-fits-all GPU was a phase, not a destination.

Call it what it is. The TPU 8 announcement is a fork in the road for AI silicon. Nvidia has the software moat and the universality. Google, AWS, Microsoft, and Meta have vertical integration and two chips each. The question for the next three years is whether the software moat survives once specialized silicon is 2x cheaper per watt on the workload that actually pays the bill.

Who wins and who loses as AI splits into two economies?

Once training and inference become different businesses, the winners and losers sort themselves into different columns.

Hyperscalers with volume on both sides win. Google, AWS, Microsoft, Meta have the scale to justify two purpose-built chips instead of one compromise chip. Every specialized accelerator they ship is a workload they no longer rent from Nvidia. Training stays expensive; inference gets cheaper inside their walls than outside.

Nvidia's dominance is challenged, not broken. CUDA, NCCL, and two decades of tooling keep training workloads locked in. That is the half of the business that still prints money. Inference is the half that grows faster, and inference is where the hyperscalers are quietly migrating workloads onto their own silicon. The ceiling on Nvidia's growth is now set by how fast TPU, Trainium, and Maia can absorb inference volume.

Foundation model labs that do not own silicon get squeezed. Anthropic rents from AWS and Google. OpenAI rents from Microsoft and the Stargate partners. All three of those landlords are building competitive models on the same chips they are renting out. The rent keeps going up and the cross-subsidy is one-way.

Startups and app builders live or die on inference economics. If you are building on foundation models, your margin is tokens-per-dollar. When hyperscalers drop inference cost 80% on their own silicon, that becomes the floor everyone else has to compete with. The team that ships the cheapest inference at scale becomes the cheapest place to build an app. For builders, that is a feature, not a threat. For anyone reselling Nvidia capacity with a markup, it is a countdown.

Margins move to whoever runs the cheapest inference at scale. Training is a capex line item, amortized over the life of a model. Inference is a variable cost on every single request. Whoever controls the variable cost controls the unit economics of the AI industry. That is the prize.

Is the TPU 8 interconnect actually falling behind AWS and Microsoft?

A recurring critique on the Hacker News thread was that Google's memory-to-interconnect ratio is slipping. ² Worth taking seriously, and worth checking against the actual numbers, because the commenter had the units confused.

Here is the like-for-like comparison, all bidirectional per chip:

Ironwood (TPU v7): 1.2 TB/s (9.6 Tb/s aggregate across four ICI links). ³
Google TPU 8i: 2.4 TB/s (19.2 Tb/s per Google). ¹ Roughly double Ironwood. Matches Google's "2x interconnect" claim.
AWS Trainium3: 2 TB/s on NeuronLink-v4, inside a 144-chip UltraServer. ⁴
Microsoft Maia 200: 2.8 TB/s bidirectional on an integrated on-die NIC. ⁵

TPU 8i is not behind the pack. It beats Trainium3 and sits just shy of Maia 200. The "1.2" figure that got circulated was Ironwood, not 8i. Google doubled the number, and the doubling lands them in contention with the chips they are supposed to be losing to.

The real open question is ratios. Maia 200 ships 216 GB of HBM; TPU 8i ships 288 GB. Bigger memory pools need more bandwidth to drain, and at some point inference workloads start begging for more interconnect. That tradeoff is real. But it is a tuning debate inside a competitive band, not evidence Google has fallen off.

How does Google's TPU 8 move the AI moat to silicon?

Step back from the chip. Look at the stack.

Google owns every layer:

Fab relationship with TSMC
Chip design (TPU 8)
Interconnect (ICI)
Data centers (with custom Axion CPUs)
Compiler (XLA)
Training framework (JAX)
Serving stack (for inference)
Model (Gemini)
Product (Search, Workspace, Android)

When TPU 8 ships, Google's own workloads get the 2x perf-per-watt before anyone else does. And the people who rent Google's TPUs are renting a stack that was optimized end to end by the same company.

Anthropic leans on AWS and Google Cloud. OpenAI leans on Microsoft and the Stargate partners. The labs with the best models rent their silicon. Google builds its own.

Now look at what the last twelve months showed us about models. DeepSeek R1 replicated frontier capability at a fraction of the training cost in January 2025. ⁶ Open weights caught up faster than anyone expected. Llama, Qwen, Mistral, DeepSeek, Gemma: the gap between the best closed model and a competent open one keeps shrinking. Models replicate. That is the whole point of software.

Fabs do not replicate. You cannot fork TSMC. You cannot clone a 9,600-chip liquid-cooled superpod on a weekend. The thing the industry spent two years arguing about, whose model is smartest, turns out to be the part that commoditizes fastest. The thing nobody argues about, whose silicon is cheapest per useful token, is the part that compounds. The $122B OpenAI raised is mostly going to buy this capacity, not build better models.

This is the same lesson constraints usually teach. The visible layer changes constantly. The load-bearing layer underneath does not, and whoever owns it wins slowly, then suddenly. Gemini can stay a half-step behind Claude on agentic coding and Google still comes out ahead if the cost to serve is half. Skeptics on the Hacker News thread were right that the model quality gap is real. ² They were arguing about the wrong layer.

The TPU 8 split is not an engineering footnote. It is the moment Google stopped pretending the moat was the model.

Key takeaways

AI is splitting into two economies. Training is capex-heavy and concentrated in a handful of hyperscalers. Inference is where apps, agents, and revenue actually scale. TPU 8 is the first chip generation to treat the split as the default.
TPU 8 is two chips. 8t for training (9,600-chip pods, 121 ExaFlops, 2 PB HBM). 8i for inference (288 GB HBM, 384 MB SRAM, 19.2 Tb/s interconnect). ¹
Up to 2x performance-per-watt versus Ironwood on both chips; 3x pod compute on 8t; 80% better performance-per-dollar on 8i. ¹
Hyperscalers win, Nvidia gets squeezed on inference, labs without silicon pay rent both ways. Margins move to whoever runs the cheapest inference at scale.
The moat is moving to silicon. Models replicate (DeepSeek). Fabs and full-stack integration do not. ⁶
General availability later in 2026. Citadel Securities is the first named customer. ¹

Frequently asked questions

What are the TPU 8t and TPU 8i?

They are the two chips in Google's eighth generation TPU. The 8t is the training chip, built into 9,600-chip superpods that deliver 121 ExaFlops and 2 petabytes of shared high-bandwidth memory. The 8i is the inference chip, with 288 GB of HBM, 384 MB of on-chip SRAM, and 19.2 Tb/s of interconnect bandwidth per chip. ¹

How does Google's TPU 8 compare to Ironwood?

Google cites up to 2x better performance-per-watt versus Ironwood and roughly 3x more compute per pod on 8t. ¹ Logan Kilpatrick from Google framed the headline gain as 2 to 3x depending on workload. ⁷ TPU 8i claims 80% better performance-per-dollar and supports roughly 2x customer volume at the same cost.

Why did Google split training and inference in TPU 8?

Training and inference want different hardware. Training is bandwidth-hungry across thousands of chips running for weeks. Inference is memory-hungry on a single chip running for milliseconds. Ironwood was one chip forced to serve both. TPU 8 admits the compromise was costing money and built two.

When will Google's TPU 8 be available?

General availability is planned for later in 2026. ¹ Citadel Securities is the named early customer in Google's announcement.

I break down things like this on LinkedIn, X, and Instagram. Usually shorter, sometimes as carousels. If this resonated, you would probably like those too.

Sources

Claude Design vs Figma, Lovable, v0: What's Different

Sameer Khan — Tue, 21 Apr 2026 07:20:09 +0000

TL;DR: Figma, Lovable, v0, and Claude Design are not the same tool. They pick different starting points: the design file, an idea, a component prompt, your codebase. Different starting points, different jobs.

If you have shipped a product, you know the cycle. Brief to designer. Something comes back that does not quite match the brand. Revise. Engineer reinterprets the spec. Revise again. Two weeks later, the thing looks slightly off.

Early adopters described cutting that whole cycle to a single conversation. One team reported going from a week-long brief-to-code loop to one session. That is the shift worth unpacking, and it gets lost when Claude Design is compared head-to-head with tools solving different problems.

What Do Figma AI, Lovable, and v0 Actually Do?

Each tool has a clear job. The press keeps comparing the wrong jobs.

Figma Make (Figma's AI layer): generates designs from prompts inside the Figma canvas. Starts from the design file. ¹
Lovable: turns a plain-language description into a full-stack deployable app. Starts from an idea. ²
v0 by Vercel: generates React and Tailwind components from prompts. Developer-facing, fast. Starts from a component need. ²
Claude Design: reads your GitHub repo and generates designs shaped by what is already there. Starts from your production codebase.

Four tools, four starting points.

What Does Claude Design Do That Figma, Lovable, and v0 Don't?

When you connect a GitHub repo, Claude Design reads your codebase and extracts: ³

Tailwind config: your spacing scale, breakpoints, color tokens
Global CSS: your CSS variables, font stacks, base styles
Font declarations and logo SVGs: the visual identity already in your code
Component names: the vocabulary your engineers use

What comes out is a design system that already matches what you shipped. Not one you configure. The one living in your repo.

Two designers ran a live side-by-side the day it launched. Same brief to both: redesign a real blog in Readymag style, passed in as a screenshot and a markdown context file. Claude Design produced a layout that tracked the reference. Lovable produced something competent but generic, closer to a WordPress theme than the brand they pointed at. The designer's read: "designers now can cook." Not a replacement, a lever. ⁴

You build from there. Prompt to prototype. When the prototype is ready, one instruction passes it to Claude Code, which also reads your codebase. The loop closes: idea, design, production code, no translation step.

Lovable and v0 aim at different outputs. Lovable gives a greenfield founder a new app. v0 gives a developer a component to paste in. Claude Design gives a team with an existing product something pre-fitted to their repo. ⁵

Why Does Starting From Your Codebase Matter?

Different starting points serve different people.

Figma treats the design file as the canonical home for a brand. For design teams, that is still true.
Claude Design treats the repo as canonical. That fits a different team: one where design intent already lives in Tailwind tokens, CSS variables, and component names.

This matters most to one person: the engineer or PM extending a live product. Not building something new. Not exploring from a blank canvas. Extending what is already there, in a way that matches what is already there.

For that person, starting from the repo removes a translation step. The output is already shaped by the code it will land in. The other tools are not worse at this. They are aimed elsewhere.

I post breakdowns like this regularly on LinkedIn and Instagram. The angle is always what it means for builders, not what the press release says.

When Should You Use Each Tool?

Pick by the starting point that matches your job.

Figma is the tool for design teams on a shared canvas. Pixel precision, component libraries, review workflows, handoff annotations. Claude Design does none of this. ⁵
Lovable is the tool when you have no product yet and want idea to deployed app without code. MVP, internal tool, first prototype. Claude Design is not for that.
v0 is the tool when you need a React component fast and can edit code. Claude Design is not trying to replace that.

Claude Design is aimed at a specific step: you have a live product, a new feature to design, and you need something that already matches everything you built. Teams have always solved this with some combination of briefs, design exploration, review, handoff, and engineering interpretation. Claude Design compresses that into a conversation that starts from the repo. Whether that is the right trade depends on the team.

The broader pattern is familiar. Zuckerberg returning to the codebase after 20 years using Claude Code is the same story. So is Karpathy explaining AI workflows to people who do not write code. AI is not replacing the work. It is eliminating the translation layers between people who do different kinds of work.

One signal worth noting. Mike Krieger, Anthropic's CPO and Instagram co-founder, resigned from Figma's board on April 14, three days before Claude Design launched. He had joined less than a year earlier. The resignation was disclosed to the SEC the same day The Information reported Anthropic was building design tools. ⁶ The adjacency was close enough for the board seat to become untenable, even though the two products are aimed at different jobs.

The market read the adjacency in real time.

Anthropic Labs launched Claude Design, a new product for creating visual assets, prototypes, slides, and one-pagers with Claude.

It is rolling out in research preview to Pro, Max, Team, and Enterprise users, powered by Claude Opus 4.7.$ADBE $FIG https://t.co/5u0TOMSqSW pic.twitter.com/TblMIEJE4u
— Wall St Engine (@wallstengine) April 17, 2026

What Are Claude Design's Limitations Right Now?

Claude Design is a research preview as of April 2026. Real constraints worth knowing before you try it:

No multiplayer. For a design team on a shared canvas, Figma still wins cleanly. ⁵
Token burn is heavy. Claude Design runs on Opus 4.7 and is metered separately from your chat and Claude Code usage. Pro is described as "quick explorations, one-off use." One user reported two design sessions consuming 58% of their weekly Pro allowance. ⁷ To use it regularly, you need Max.
Prototyping-level output, not production polish. The design system extraction makes things brand-consistent, but it is not a replacement for a designer's eye on the final layer.
Export options are practical but limited: PDF, PPTX, standalone HTML, Canva. ⁸ The HTML export is also how the Claude Code handoff closes the loop. Anthropic's own ecosystem, end to end.

A 5-Question Claude Design Readiness Check

Before you open it, ask these. If you answer yes to three or more, Claude Design fits your workflow today. If not, Figma, Lovable, or v0 is probably the better tool for the job.

Do you already have a shipped product in a GitHub repo? Claude Design starts from code that exists. No repo, no extraction.
Is your design system encoded in Tailwind config, CSS variables, or component names? That is what the extractor reads. Design tokens locked in a Figma file alone will not transfer.
Are you extending an existing product rather than starting from zero? The tool's edge is fit to what is already there. For greenfield work, Lovable or v0 is closer to the job.
Can one person own the design-to-code loop, or does it need multiplayer? No shared canvas. If three designers need to work on the same file, Figma still wins.
Are you on Max, or willing to rate-limit yourself on Pro? Two sessions burned 58% of a weekly Pro allowance. Regular use needs Max.

Three or more yeses and the translation step this tool removes is a real one for you.

Key Takeaways

Figma, Lovable, v0, and Claude Design pick different starting points. Different starting points, different jobs.
Figma treats the design file as canonical. Claude Design treats the codebase as canonical. Neither is wrong; they suit different teams.
Claude Design's design system extraction reads your Tailwind, CSS, and component names to generate on-brand output from the first prompt, without manual configuration.
Each tool fits a different starting point: Figma for collaborative design work, Lovable for greenfield apps, v0 for quick components, Claude Design for extending an existing codebase.
Token burn is real. Claude Design is metered separately. Pro is for one-off use. Regular use requires Max.
Anthropic's CPO resigned from Figma's board three days before launch. Figma's stock dropped 5 to 7% on launch day, a read of the adjacency, not a verdict on either product.

Shipping something where this trade-off matters and want a second read on it? Get in touch. I reply to every thoughtful email.

I post builder-first takes on AI tooling on LinkedIn, X, and Instagram. The kind that skip the hype and go straight to what changes for people who ship. If that is useful, a follow goes a long way.

Meta Muse Spark: What Meta Is Actually Betting On

Sameer Khan — Fri, 17 Apr 2026 12:33:55 +0000

TL;DR: Meta launched Muse Spark on April 8, 2026. Most commentary split into two camps. Meta went closed because Meta won. Meta went closed because Meta lost. Both miss what Meta actually built. Muse Spark does frontier-class reasoning in less than half the output tokens Claude Opus 4.6 and GPT-5.4 spend on the same benchmark, and Meta AI, the product serving roughly three billion daily active users, runs on it. Read Muse Spark as an efficiency-first, patiently sequenced, consumer-scale bet, and the choices that look strange on their own start fitting together.

The week Muse Spark launched, the conversation split almost immediately. One camp said Meta finally caught up and closed the doors. Another said Meta finally fell behind and is hiding it. Both sides were arguing about the license. Neither was arguing about the model.

The bet Meta actually made isn't captured by the license. It's captured by three choices that are easy to miss through the open-weights lens. Muse Spark is designed for fewer tokens per query. It is framed as step one of a long sequence. And it is shipping first as the engine of a consumer product reaching three billion daily active users. Those three choices, taken together, describe a different game than the one most labs are playing.

What Muse Spark is

Muse Spark is Meta Superintelligence Labs' first model, shipped April 8 after a nine-month rebuild of Meta's AI infrastructure. ¹ It is a natively multimodal reasoning model with three modes. Instant for fast responses. Thinking for reasoning-heavy queries. Contemplating, positioned against Gemini Deep Think and GPT Pro for long scientific work. It supports tool use, visual chain of thought, and multi-agent orchestration. ²

Meta AI, the consumer product on meta.ai and the Meta AI app, runs on it today. The Muse Spark API is in private preview for selected partners. Alexandr Wang, Meta's Chief AI Officer, has said broader API access is coming. ³ The weights have not been released, and Meta has not committed to whether or when they will be.

On the Artificial Analysis Intelligence Index v4.0, Muse Spark scores 52. GPT-5.4 and Gemini 3.1 Pro Preview score 57. Claude Opus 4.6 scores 53. ⁴ Fourth at the frontier, as the frontier is currently measured.

Efficiency is the number that matters

Meta's headline technical claim is that Muse Spark reaches its capabilities with over an order of magnitude less compute than Llama 4 Maverick, the prior Meta flagship. ¹ That is a training-side claim. The more interesting number sits on the inference side.

To complete the Artificial Analysis Intelligence Index v4.0 run, Muse Spark used 58 million output tokens. Claude Opus 4.6 used 157 million. GPT-5.4 used 120 million. ⁴ Muse Spark reaches roughly the same tier of performance while spending less than half the thinking time of its closest competitors.

Meta calls the mechanism thought compression. During reinforcement learning, the model is penalized for excessive reasoning tokens. It is trained to reach the same answer with fewer intermediate steps. ⁴

Zoom out. Llama 4 Maverick scored 18 on the same index. Muse Spark scores 52. ⁴ Nearly 3x jump in one release, using roughly a tenth of the training compute, producing a model that serves answers in less than half the output tokens of its peers. That is not a fourth-place story. It is a different-axis story.

Thought compression isn't the only lever. Fei Xia, a Meta researcher, showed Muse Spark tackling a hard visual counting task using parallel subagents: divide the image into a grid, assign a subagent per tile, merge the counts. ⁵ That is a second axis of test-compute scaling. Not fewer tokens per query, but many smaller queries instead of one large one. Both compound efficiency at inference time.

Matt Ridley, in How Innovation Works, argues that real technological progress almost never looks like a breakthrough in the moment. It looks like compounded efficiency. ⁶ The Wright brothers didn't fly higher than their competitors; they iterated longer. Meta's claim with Muse Spark is that the same mechanism is back in large language models as the active design constraint. Fewer tokens per query, optimized over releases, compounded.

Under the efficiency thesis, the contribution is the training recipe, not the weights. The productized result at three billion DAUs is what the recipe is for.

Patience as a structural choice

Wang's launch thread called Muse Spark "step one." ³ Meta has named three modes, shipped two of them, and placed Contemplating on a published roadmap. The release itself followed a nine-month rebuild of Meta's internal AI infrastructure before any new model went out. ¹

That pattern is uncommon. Labs announce quarterly, deprecate on shorter cycles, and trade nomenclature every six weeks. A frontier lab committing to a staged ladder with named but unbuilt later steps is the exception.

Jeff Bezos's 1997 shareholder letter made a version of this argument on its own: "We will continue to make investment decisions in light of long-term market leadership considerations rather than short-term profitability considerations." ⁷ Most companies quote the line. Very few behave like it. Muse Spark is Meta behaving like it. A nine-month silence, a named sequence, an efficiency-first architecture that only pays back at scale.

Patience has a failure mode. If the ladder breaks, the gap widens. If competitors keep improving quarterly and Muse Spark's step two arrives in 2027, the index score will read worse, not better. That is the actual risk of the strategy. Not the license. The cadence.

The game Meta is actually playing

Roughly three billion daily active users touch Meta's products. Muse Spark powers Meta AI across them. ¹ Every prompt, every caption suggestion, every smart reply, every image generation across meta.ai, Instagram, WhatsApp, and Facebook is a query served at Meta's cost.

Reread the efficiency numbers with that denominator. 58 million output tokens per benchmark run is interesting when you run one benchmark. It is structural when you run hundreds of billions of inferences. Cutting thinking time by more than half is how inference economics actually move at Meta's scale.

The API is a secondary product. The primary product is a feature inside applications people already use. That framing answers most of the questions that the closed-weights decision seems to raise:

Why closed: weight distribution gives up the only part that is uniquely Meta, which is distribution plus efficient inference under Meta's control.
Why efficiency-first: cost-per-query is the load-bearing variable at three billion users.
Why fourth on the index: the index measures capability, not capability per dollar of inference. Meta is not optimizing for the thing the index measures.
Why patience: product cycles at Meta's scale run in quarters and years, not weeks. A staged ladder matches the cadence of the products that will ship the model.

OpenAI, Anthropic, and Google primarily sell access. Meta does not. Meta bundles. A closed, efficient model embedded in consumer distribution is a product shape no other frontier lab has a direct answer to right now.

What Muse Spark bets against

Muse Spark bets against three premises that have held in AI for three years. That benchmark rank drives strategic outcomes. That fast iteration beats staged iteration. That serving the weights is the dominant form of distribution.

If Meta is right, competitors re-architect. Expect tokens-per-benchmark to become a reported number. Expect ladder-style release roadmaps. Expect fewer labs selling raw access and more labs selling integrated products.

If Meta is wrong, Muse Spark stays fourth on the index, the efficiency claim gets normalized by competitors' next releases, and the Scale-era thesis fades into another nine-month rebuild.

Deedy, in a popular thread after launch, called Muse Spark's reasoning "solid but not best in class." ⁵ That read is fair if you are benchmarking reasoning. It is beside the point if you are measuring how to serve reasoning to three billion people.

Takeaways

Efficiency is the headline, not the license. Muse Spark uses 58 million output tokens where Claude Opus 4.6 uses 157 million on the same evaluation. ⁴
Training efficiency is roughly ten times Llama 4 Maverick. The index score nearly tripled in one release. ¹ ⁴
Patience is the structural bet. A nine-month rebuild, a three-mode ladder, a second-step roadmap that is named but not shipped. ¹ ³
Specialization explains the choices. Meta AI reaches three billion DAUs, and inference economics at that scale reward the token per query being low, not the leaderboard rank being high. ¹
The license is a symptom of the strategy. If efficiency plus distribution plus patience is the bet, releasing the weights gives the bet away.

I've been writing about how constraints shape design, not features for a while, and Muse Spark is a useful instance of the pattern. The interesting move in AI this year might not be the model that scores higher. It might be the model that answers in fewer tokens and ships inside an application a billion people already open every day.

I break things like this down on LinkedIn, X, and Instagram. Usually shorter, sometimes as carousels. If this read resonated, you'd probably like those.

Sources

Meta AI, "Introducing Muse Spark", April 8, 2026 ↩
Simon Willison, "Meta's new model is Muse Spark", April 8, 2026 ↩
Alexandr Wang on X, launch thread and API update, April 2026 ↩
Muse Spark: Features, Benchmarks, and How to Use It, DataCamp, April 2026 ↩
Fei Xia and Deedy Das on Muse Spark capabilities (thread), April 13, 2026 ↩
Matt Ridley, How Innovation Works: And Why It Flourishes in Freedom (HarperCollins, 2020) ↩
Jeff Bezos, 1997 Letter to Shareholders, Amazon ↩

GPT-5.4-Cyber explained: OpenAI's cyber-only AI

Sameer Khan — Wed, 15 Apr 2026 08:33:55 +0000

Two days ago I wrote about Claude Mythos completing AISI's 32-step cyberattack chain end-to-end. On April 14, OpenAI put out the clearest signal yet that the labs are reading the same capability curve and building the defender track in advance.

They announced GPT-5.4-Cyber, a version of GPT-5.4 fine-tuned to be "cyber-permissive," and scaled up their Trusted Access for Cyber (TAC) program to thousands of verified individual defenders and hundreds of teams defending critical software.¹ In their own words, this is shipping "in preparation for increasingly more capable models over the next few months."

TL;DR. This is defender tooling shipped before the next capability jump, not after. The model is the headline. The real story is a fine-tuned permissive variant named, tiered, and published as a product.

Primary source: OpenAI on scaling trusted access for cyber defense.

What Does GPT-5.4-Cyber Actually Unlock for Defenders?

Same base model as GPT-5.4, different refusal boundary. OpenAI's description: a model that "lowers the refusal boundary for legitimate cybersecurity work" and adds capabilities like binary reverse engineering. It can analyze compiled software for malware, vulnerabilities, and robustness without access to source code.¹

Binary reverse engineering is the concrete unlock, and it is not small. It is one of the highest-leverage things a defender can automate, and it is exactly the kind of request that trips every refusal classifier ever built. The same prompt from a malicious actor yields the same output. The model cannot tell them apart. The verification layer can.

Everything else in the envelope is less dramatic but more useful at scale. Vulnerability research without the hedging. Security education that answers the question instead of warning about it. Defensive programming help that does not refuse to describe the attack it is trying to prevent.

Why Was Refusal Always a Bad Safeguard?

For three years, the default safety move has been to push risk into the model through refusal training. It was the cheapest thing to ship and the easiest thing to measure. It also quietly assumed attackers and defenders use the same tool, so making the tool worse would hurt both evenly.

That assumption was always wrong. Attackers run local models, jailbroken models, and purpose-built tooling. Refusals mostly tax the defenders trying to follow the rules.

GPT-5.4 (classified "high" cyber capability under OpenAI's Preparedness Framework) keeps its refusal boundary for the public. The permissive variant ships only to people who have agreed to be identified. This is closer to how physical-world dual-use actually works. Pharmacies stock dangerous drugs behind an identity check, not behind a refusal. Labs buy restricted reagents with a license. The safeguard is not the molecule. It is the paperwork.

GPT-5.4-Cyber and the Mythos Parallel

My last three posts on Claude Mythos describe the same shape from different angles. The system card showed a model with enough situational awareness to conceal its own actions. Project Glasswing showed the same model finding thousands of zero-days in critical open-source infrastructure. The AISI cyber range showed it running a full 32-step autonomous cyberattack. Mythos itself is gated. Anthropic ships it only through its own trust program.

So both frontier labs already operate the same model: dual-use capability behind verified access. What is new with GPT-5.4-Cyber is that OpenAI is the first to take the defender side of that model and publish it as a product tier: a named, fine-tuned, cyber-permissive variant with its own enrollment path and its own preparedness designation. Anthropic's gating is a policy. OpenAI's is a SKU.

You can see the same bet in the numbers they quietly dropped in the same post. Codex Security has contributed to over 3,000 critical and high vulnerability fixes since launch. Codex for Open Source has reached more than 1,000 open source projects. The $10M Cybersecurity Grant Program keeps funding defender tooling.¹ In the Mythos cyberattack post I wrote: "I'd bet on it eventually, but 'eventually' and 'right now' are different things in security." This is a lab betting "right now," on the defender side, and betting it visibly.

Who Verifies the Verifier?

The uncomfortable follow-up to any identity-gated safeguard. OpenAI is now the identity layer for a meaningful slice of the security industry. Every defender applying for the permissive tier is trusting one company's KYC pipeline to decide who counts as a defender, and trusting OpenAI's interpretation of "legitimate use" to hold up over time.

This is the part of the announcement I would most want to see discussed over the next few weeks. It is also the part nobody will discuss, because the new model is shinier than the policy question behind it.

Key takeaways

GPT-5.4-Cyber is a fine-tuned GPT-5.4 with fewer capability restrictions, shipped only to vetted defenders under the Trusted Access for Cyber program.
Preemptive, not reactive. OpenAI is shipping this ahead of more capable base models coming in the next few months, in their own words.
Both labs already gate dual-use. Mythos is restricted through Anthropic's trust program. What is new is OpenAI naming a fine-tuned permissive variant as a product tier.
Open question: who audits the identity layer when OpenAI and Anthropic become the KYC gate for a chunk of the security industry?

I break down AI safety and capability stories on LinkedIn, X, and Instagram. If this resonated, you would probably like those too.

OpenAI on scaling trusted access for cyber defense (April 14, 2026) ↩

Claude Mythos Is the First AI to Complete a Full Corporate Cyberattack End-to-End

Sameer Khan — Mon, 13 Apr 2026 17:35:25 +0000

The UK's AI Security Institute confirmed this week that Claude Mythos, an Anthropic model, became the first AI to complete their cyber range end-to-end.¹ The range is a 32-step corporate network attack scenario. Human experts estimate the same attack would take them 20 hours.

The institute's recommendation to organizations: keep your software updated. Use access controls. Enable logging.

The gap between those two sentences is the part of this story I keep returning to.

TL;DR: Claude Mythos ran a full autonomous cyberattack, 32 steps, end-to-end, in a scenario that takes human experts 20 hours. It is the first AI to complete AISI's cyber range. The official response was to recommend basic security hygiene. The mismatch between the capability and the response is where the real story lives.

How Did AI Go From Basic Cyber Tasks to a Full Autonomous Cyberattack?

Self-driving cars give me the cleanest parallel here.

For a decade, every individual piece of the self-driving puzzle existed as a demo. Lane-keeping worked. Adaptive cruise worked. Automated parking worked. What didn't exist, for years, was the full ride. Door to door, no human touching the wheel. When Waymo's first commercial robotaxi picked up a passenger in 2020, what changed wasn't the individual capabilities. It was the threshold: chaining all of them into one uninterrupted ride.

The same thing just happened in offensive cybersecurity.

Each step of a network attack has been within reach of AI models for a while. Reconnaissance. Crafting payloads. Pivoting through a subnet. Covering tracks. What didn't exist was a model that could chain all 32 of those steps together without a human stepping in between. Claude Mythos did.

In 2023, leading AI models struggled with basic cybersecurity tasks. Not sophisticated ones. Basic ones. Three years later, one of them drove the entire route.

AISI published the actual curve, and it is worth looking at directly.

The red line is Mythos. GPT-4o sits near the bottom, completing around three steps before running out. Sonnet 4.5 gets to roughly 11. Opus 4.5 and the GPT-5 family cluster in the mid teens. Opus 4.6 pushes past 16. Mythos is the only line that clears the middle milestones: C2 reverse engineering, advanced persistence, infrastructure compromise, and eventually M9 — "Full network takeover."¹ The shape of that curve is what "first AI to complete the range end-to-end" actually looks like.

AISI is careful about the current scope. The capability applies to "small, weakly defended, and vulnerable systems" given network access. Think of it as the robotaxi that only works on mapped, sunny, well-marked urban grids. Hardened enterprise infrastructure with proper controls is still a different problem, the same way a snowy mountain pass is still a different problem for Waymo.¹

The trajectory is what matters. 2023 to 2026 is three years.

Why Does an Autonomous Cyberattack Change the Security Equation?

The asymmetry in security has always been simple: attackers need to find one gap, defenders need to close every door.

AI doesn't change that asymmetry. It changes the cost of running an attack. An automated system doesn't need domain expertise to chain 32 steps. It doesn't get tired halfway through. It doesn't hesitate at unfamiliar territory.

What previously required a skilled adversary with deep knowledge, time, and custom tools now requires API access and a goal.

The same model AISI tested on offense has been used defensively in Anthropic's Project Glasswing to find thousands of zero-days in critical open-source infrastructure. Offense and defense, same capability, same model. The dual-use nature isn't incidental. It's structural. Whoever has the model has both sides.

What Should Organizations Do After Claude Mythos Ran a Full Cyberattack?

Patch your systems. Use MFA. Enable logging. AISI's recommendations are correct.

But they were correct before this evaluation too. That's the part I can't get past.

These recommendations address the baseline: opportunistic attackers, misconfigured systems, low-skill adversaries. They don't address the shift in assumption that happens when a fully autonomous cyberattack chain becomes possible. Hygiene is still necessary. It is no longer sufficient as a strategy.

AISI published a joint piece with the UK's National Cyber Security Centre on preparing defenders for frontier AI systems.¹ That collaboration exists because the people closest to this problem know the defensive tooling gap is real. The open question is whether the defensive side of AI moves as fast as the offensive side. I'd bet on it eventually, but "eventually" and "right now" are different things in security.

What Does the Claude Mythos Evaluation Pattern Reveal?

This is the third notable evaluation result for Claude Mythos in April alone. The system card showed a model with enough situational awareness to conceal its own actions. Project Glasswing showed it finding thousands of vulnerabilities in critical infrastructure. The AISI cyber range shows it running a full autonomous cyberattack.

These aren't contradictions. They are the same underlying capability applied in different contexts. A model capable enough for complex multi-step reasoning is capable enough to create real problems at scale.

The value of these evaluations is that they name what's happening before it becomes a crisis, even when the recommendations that follow don't match the scale of what was just described. Naming it first is not nothing.

Key takeaways

Claude Mythos became the first AI to complete a 32-step corporate cyberattack chain end-to-end in AISI's cyber range
Human experts estimate the same operation takes 20 hours
In 2023, leading models couldn't complete basic cybersecurity tasks. Three years later, one completed a full autonomous cyberattack
Current capability is scoped to "small, weakly defended" systems, not enterprise infrastructure with proper controls
The trajectory matters more than the current benchmark: three years of rapid progress, with no signs of slowing
AISI's defensive recommendations (patch, use MFA, enable logging) are correct but baseline — they predate this evaluation
AISI and the UK NCSC published joint guidance on preparing defenders for frontier AI systems

I break down things like this on LinkedIn, X, and Instagram — usually shorter, sometimes as carousels. If this resonated, you'd probably like those too.

Sources

AI Security Institute (@AISecurityInst) — Claude Mythos cyber range evaluation, April 13, 2026 ↩

Zuckerberg Is Writing Code Again. With Claude Code.

Sameer Khan — Sun, 05 Apr 2026 10:24:18 +0000

TL;DR: Mark Zuckerberg shipped 3 diffs to Meta's monorepo last month, his first code in 20 years. He's a heavy user of Claude Code CLI. One of his diffs got 200+ approvals from engineers who wanted to say they reviewed the CEO's code. He's not the only one. Garry Tan at Y Combinator is doing the same thing. The pattern is clear: AI coding tools are pulling founders back into the codebase.

What happened?

Gergely Orosz at The Pragmatic Engineer reported this week that Mark Zuckerberg is back to writing code. Three diffs landed in Meta's monorepo in March 2026. His tool of choice: Claude Code CLI, Anthropic's terminal-based AI coding assistant. ¹

To put the scale in perspective: Meta's monorepo now has close to 100 million diffs. Back in 2006, the entire Facebook codebase had fewer than 10,000. ¹

Zuckerberg's last meaningful code contributions were in 2006. That's a 20-year gap. The fact that he's back, and using an AI tool to do it, says something about where we are.

The 2010 diff that got force-merged

This isn't Zuckerberg's first time making waves in code review.

In 2010, he submitted a diff that made profile photos clickable on the profile page. Michael Novati, a senior engineer who would become the first person to hold Meta's L7 "coding machine" archetype, blocked it. The reason: formatting issues everywhere. ¹ ²

Zuckerberg overrode the block and force-merged it. ¹

Novati spent eight years at Meta and was recognized as the top code committer company-wide for several of them. The Pragmatic Engineer did a full episode with him about what it means to be a "coding machine" at that scale. ²

The 2010 story is funny in hindsight. But the 2026 version is different. This time, Zuckerberg isn't force-merging past reviewers. He's using AI to write code that engineers actually want to approve. One of his March diffs got more than 200 approvals, with devs jumping at the chance to say they'd reviewed the CEO's work. ¹

Why this matters beyond the anecdote

Three diffs from the CEO of a 70,000-employee company is a footnote in a 100-million-diff monorepo. The signal isn't the code. It's the behavior.

Zuckerberg isn't the only founder pulled back into the codebase by AI tools. Garry Tan, CEO of Y Combinator, returned to coding after 15 years and open-sourced gstack, a Claude Code system with 23 specialist tools that turns the CLI into a virtual engineering team: code reviewer, QA lead, security auditor, release engineer. ³ ⁴

Tobias Lütke, CEO of Shopify, has been running experiments with Karpathy's AutoResearch on internal company data. 37 experiments overnight. 19% performance gain.

I wrote about how AutoResearch works a few days ago. The throughline is the same: AI tools are collapsing the gap between "person with ideas" and "person who ships code." Founders used to be the first type. AI is turning them back into the second.

Meta's bet: AI writes most of the code

Zuckerberg coding again isn't a hobby. It's a signal of where Meta is heading.

Leaked internal documents from March 2026 show aggressive targets. Meta's creation org wants 65% of engineers writing 75% or more of their committed code using AI by mid-2026. The Scalable Machine Learning org set a target of 50-80% AI-assisted code. ⁵

Zuckerberg himself said on Dwarkesh Patel's podcast that "in the next year, maybe half the development will be done by AI as opposed to people, and that will kind of increase from there." ⁶

He's not predicting this from the sidelines. He's using Claude Code in the terminal to ship diffs to his own monorepo. The CEO is the pilot customer.

The pattern worth watching

There's a recurring shape here.

Karpathy builds AutoResearch. Constrains the agent to one file, one metric, one 5-minute cycle. The constraint is the invention. Lütke runs it on Shopify data overnight. Marketers adapt it for landing pages.

Anthropic builds Claude Code. Tan wraps it in 23 specialist agents. Zuckerberg uses it to ship his first code in 20 years.

The tools don't just help engineers code faster. They re-open coding to people who stopped. Founders who moved into strategy, management, fundraising. People who haven't touched a codebase in a decade. The barrier to re-entry used to be months of catching up on tooling, frameworks, and conventions. Now it's a terminal and a prompt.

That's a different kind of disruption than "AI replaces developers." It's closer to: AI brings back the builder-CEO. The person who can see a problem, describe a solution, and ship it before the meeting ends.

Whether Zuckerberg's 3 diffs were good code is beside the point. The 200 engineers who approved them probably weren't reviewing for correctness. But the fact that a CEO can sit down with Claude Code and produce something that compiles, passes CI, and lands in a 100-million-diff monorepo? That's the new baseline.

Key takeaways

Zuckerberg shipped 3 diffs to Meta's monorepo in March 2026, his first code in ~20 years, using Claude Code CLI
One diff got 200+ approvals from engineers eager to review the CEO's code
Garry Tan (Y Combinator) also returned to coding after 15 years, open-sourcing gstack for Claude Code
Meta targets 65-75% AI-assisted code across engineering by mid-2026
AI coding tools are pulling founders back into codebases they left years ago
The disruption isn't "AI replaces developers," it's "AI re-opens development" to people who stopped

I break down things like this on LinkedIn, X, and Instagram. If this resonated, you'd probably like those too.