Forem: jidonglab

71,700 Stars and 60 Rust Crates: Inside OpenAI's Codex CLI Source

jidonglab — Sun, 03 May 2026 17:54:42 +0000

71,700 stars. 5,006 commits. 665 releases. 94.7% Rust. When OpenAI dropped the full source code of Codex CLI under the Apache 2.0 license, I expected a thin wrapper around an API. What I found was a deeply engineered system with over 60 Rust crates, OS-level sandboxing on three platforms, and an agent loop architecture that reveals how OpenAI thinks about local AI tooling.

I spent a weekend reading through the repository. This is what I found.

Why I Opened the Source

I've been building with AI coding tools for over a year. I wrote about bootstrapping a full pipeline with GPT-5 Codex in a single day, and I've documented my experience running parallel subagents with Claude Code. But every tool I've used has been a black box at some level. The model is remote. The agent logic is proprietary. The sandbox rules are opaque.

Related: 5,800 Lines in One Day: Bootstrapping a Full Pipeline With gpt-5-codex

When Codex CLI went open-source, the appeal was obvious: for the first time, I could read the exact code that decides what an AI agent can and cannot do on my machine. Not a documentation page. Not a blog summary. The actual Rust source.

The repository lives at github.com/openai/codex. The core logic sits under codex-rs/, with four top-level crates that divide responsibility cleanly. There's core for the agent loop and tool execution. There's cli for the terminal interface. There's tui for the text-based graphical UI. And there's headless, which speaks JSON-RPC over stdio so that VS Code extensions and web applications can connect to the same engine without any GUI.

The Agent Loop, Unwrapped

The architecture follows a pattern that's becoming standard in AI coding tools, but the implementation details matter. When you type a natural language command into Codex CLI, the core crate takes over. It constructs an HTTP request to the OpenAI Responses API and opens a streaming connection. Events arrive one by one. Some are text tokens. Some are tool calls.

Here's the flow, stripped to its essence:

User prompt
     |
     v
+------------------+
|  core: agent     |
|  loop            |
+------------------+
     |
     v
OpenAI Responses API
(streaming HTTP)
     |
     v
+------------------+
|  Event parser    |
+------------------+
     |
     +---> Text token --> stream to terminal
     |
     +---> Tool call --> execute locally
                |
                v
         Return result to API
                |
                v
         (loop continues)

The tool system is defined through a ToolSpec enum. Each tool declares its input and output via JSON Schema, which means the model knows exactly what parameters a tool accepts and what shape the response will take. This is the same pattern that Claude Code and other agent frameworks use, but seeing it implemented in Rust with full type safety gives it a different character. There's no runtime type coercion. No any escape hatch. Every tool call is validated at compile time against its schema.

What makes this extensible is MCP integration. MCP, the Model Context Protocol, allows external servers to register tools dynamically. If you have a custom database tool or a deployment script that you want the agent to use, you spin up an MCP server and Codex CLI discovers it at runtime. The agent treats MCP tools identically to built-in ones. Same schema validation. Same sandbox restrictions.

Related: Claude Code Config Guide

The system prompt lives in codex-rs/core/prompt.md, a plain Markdown file that anyone can read and modify. Configuration sits in ~/.codex/config.toml. Session state persists to a local SQLite database. There's no cloud state. Everything lives on your machine.

Sandboxing: Three Operating Systems, Three Strategies

The security model is where the engineering gets serious. Codex CLI doesn't just run commands in a subprocess and hope for the best. It wraps every tool execution in an OS-native sandbox.

On macOS, it uses Seatbelt, Apple's application sandbox framework. The same technology that restricts what App Store applications can access on your Mac is applied to every command the AI agent runs. File system access is limited to the project directory. Network connections can be blocked or scoped. Process creation is controlled.

On Linux, the approach combines Bubblewrap for filesystem and namespace isolation with Seccomp for system call filtering. Bubblewrap creates a lightweight container-like environment. Seccomp sits below that, blocking dangerous syscalls entirely. The agent literally cannot call execve on an arbitrary binary outside its allowed list.

On Windows, Restricted Tokens limit the process's access rights. It's the least granular of the three approaches, but it still prevents the agent from accessing files or registry keys outside its scope.

This is fundamentally different from how Claude Code handles security. Claude Code runs its model inference on Anthropic's cloud. The local client, written in TypeScript, executes tools on your machine but relies on a permission-based model rather than OS-level sandboxing. You approve or deny each action. With Codex CLI, the sandbox enforces restrictions regardless of what the model requests. The model can ask to read /etc/passwd all day long. Seatbelt will say no.

Related: Claude Code Subagents Parallel Guide

Performance: Caching and the Model Question

Codex CLI is built around GPT-5.3-Codex, a model variant optimized for code generation and tool use. The performance optimization strategy centers on prompt caching. System prompts, project context, and frequently repeated instructions are cached on the API side, reducing both latency and cost on subsequent turns within a conversation.

The Rust implementation itself contributes to performance in ways that a TypeScript or Python agent cannot match. There's no garbage collector pause. Memory allocation is deterministic. The binary ships as a single executable with no runtime dependencies. On my machine, cold start to first API call takes under 200 milliseconds. Compare that to Node.js-based tools where the module resolution alone can take longer.

The headless mode deserves attention here. By communicating over JSON-RPC through stdio using JSONL (newline-delimited JSON), Codex CLI achieves a clean separation between the engine and any frontend. The same binary that powers the terminal CLI also powers the VS Code extension and could power a web application. There's no separate server process. No WebSocket setup. Just stdin and stdout, the most universal IPC mechanism in computing.

What Open Source Changes

Reading Codex CLI's source changed how I think about AI coding tools. When I use Claude Code, I trust Anthropic's security claims. When I use Codex CLI, I can verify them. I can grep for seatbelt in the codebase and read the exact sandbox profile applied to my commands. I can open prompt.md and see what the model is told about my project before it generates a single token.

This transparency has practical implications. Enterprise security teams can audit the sandbox policies before approving the tool. Contributors can fix bugs in the agent loop without waiting for a vendor patch. Researchers can study a production-grade AI agent architecture without reverse engineering.

The choice of Rust is itself a statement. Building an AI tool's core in a systems language that guarantees memory safety, compiles to native code, and supports cross-platform builds from a single codebase signals long-term investment in performance and reliability. Python would have been easier. TypeScript would have matched the web ecosystem. Rust says: this tool will run on your machine, close to your code, and it will not crash.

The competitive landscape is shifting. Claude Code dominates in developer experience and model capability. Cursor owns the IDE integration space. GitHub Copilot has distribution through GitHub's user base. Codex CLI's bet is that full transparency and community ownership will attract developers who care about understanding and controlling the tools they depend on.

Whether that bet pays off depends on the community. The code is there. The architecture is solid. The question is whether 71,700 stars translate into contributors who push the tool beyond what OpenAI alone would build.

The best way to trust an AI tool is to read its source code.

What would you build if you could fork the entire agent loop?

Full Korean analysis on spoonai.me

Pentagon Blacklisted Anthropic From 8 Classified AI Deals

jidonglab — Sun, 03 May 2026 15:53:14 +0000

The most safety-focused AI lab on the planet just got blacklisted by the Pentagon for being too safe. On May 1, the U.S. Department of Defense signed IL6/IL7 classified AI deals with eight companies. Anthropic — the one that publishes its safety policy — was not on the list.

IL6 and IL7 are the Pentagon's classified-data tiers: IL6 covers Secret-level data (operational plans, weapons specs), IL7 covers Top Secret and Special Access Programs. They are the highest commercially-accessible authorization levels in U.S. government cloud, and getting cleared costs years and tens of millions per vendor.

I have been tracking this since the "supply chain risk" label leaked in February, and the May 1 announcement closes the loop in a way that should change how you think about building on frontier AI. The reason Anthropic is locked out is not technical. It is a single contract clause — and Anthropic filed a federal lawsuit rather than sign it.

The eight that signed, and what they shared

The Pentagon's roster reads like a who's-who of vendors who already accepted the rules of engagement. The big three cloud providers were locks because they already operate IL6 environments for the rest of the federal government. Google's inclusion is the headline of full rehabilitation: a company that walked out of military AI in 2018 after the Project Maven revolt is back inside the classified perimeter.

OpenAI's presence is the most striking shift. As recently as January 2024 its usage policy explicitly forbade military applications. The company quietly removed that restriction, accepted the Pentagon's "all lawful purposes" language, and two years later it is sitting at the classified AI table. That made OpenAI the template every other vendor was measured against.

Here is the half of the roster with the deepest defense roots:

Company	Defense role	Why they were a lock
Microsoft	Azure Government Secret/Top Secret + AI	JWCC holder, existing IL6
AWS	GovCloud IL6 + Bedrock AI	Operates the CIA's C2E cloud
Oracle	OCI Government cloud + database	Long-running DoD ERP operator
Nvidia	Training/inference GPU + software	De facto DoD AI infra standard

The other four describe a strategic bet rather than a compliance default. Google brings Gemini back into classified rotation. OpenAI brings GPT deployment behind the SCIF wall. SpaceX brings Starshield satellite comms and edge inference for forward-deployed networks. And Reflection AI — a 2025-founded startup negotiating a $25B valuation, with no public model and no shipped product — got a classified contract anyway.

The contrast with Anthropic is brutal. Claude is one of the most capable frontier models in the world. It was excluded. A company with no model was included. The selection criteria are not about technical capability. They are about willingness to sign whatever the government puts in front of you.

jidonglab

Apr 20

Claude Opus 4.7 Hit 87.6% on SWE-bench. The Story Is What It Didn't Charge You.

#ai #claude #llm #engineering

Comments

7 min read

The clause Anthropic refused to sign

In 2025, the DoD started requiring AI vendors to accept an "all lawful purposes" clause. The language gives the Pentagon broad discretion to use the AI for any legally authorized purpose — including lethal autonomous systems support, targeting, and weapons platform integration. It does not mandate those uses. It just does not exclude them. It is a blank check on usage rights.

Anthropic refused. The company's Acceptable Use Policy prohibits using Claude to "cause serious harm to people" and restricts autonomous weapons applications. Signing the clause would have effectively voided that policy — the exact thing Anthropic uses to differentiate itself from OpenAI. So early in 2026, the DoD formally classified Anthropic as a "supply chain risk." That label is the procurement equivalent of being declared a national security threat, except it did not come from a technical vulnerability or counterintelligence finding. It came from a procurement office that decided a vendor who might refuse to support a specific use case mid-contract is operationally unreliable.

Anthropic responded with a federal lawsuit in March, arguing the classification is arbitrary and amounts to a permanent ban without due process. The lawsuit is active. The DoD signed the eight-company contracts anyway. Compressed timeline: OpenAI dropped its military prohibition in January 2024; the "all lawful purposes" clause appeared in 2025; Anthropic refused in late 2025; the "supply chain risk" classification landed in early 2026; the lawsuit was filed in March; Axios reported the NSA was already running Anthropic's Mythos cyber-defense model on April 19; the IL6/IL7 contracts went out without Anthropic on May 1.

$15B  ── annual U.S. defense AI spend Anthropic just walked away from
$900B ── valuation Anthropic might land at next month
─────────────────────────────────────────────────────────────────────
The first AI company to refuse the Pentagon's terms is also the
most expensive one. That's not a coincidence. That's the trade.

The Mythos paradox makes the lawsuit interesting

The strangest part of this story is that within the same federal government, the NSA — an agency that operates at the highest classification levels — is actively using Anthropic's Mythos Preview model for network intrusion detection and threat analysis. Pentagon CTO Emil Michael addressed the contradiction in a May 1 CNBC interview with one sentence: "The blacklist holds, but Mythos is a separate issue."

Translate that. The DoD will not accept Anthropic as a full partner, but it will make exceptions when a specific Anthropic tool is too good to ignore. That position is operationally rational and legally fragile. You cannot simultaneously argue that a company is a supply chain risk to national security while running its model inside your signals intelligence agency. Well — you can argue it, Emil Michael just did, but it is not a position that holds up under judicial scrutiny.

This matters because Anthropic's lawsuit now has unusually clean evidence. If a federal court rules the "supply chain risk" classification was arbitrary, the DoD may be forced to revisit the contract structure and the precedent will reach far beyond Anthropic. It would set a limit on how governments can condition AI procurement on usage terms.

jidonglab

Apr 21

Stellantis Just Outsourced Its AI Moat to Microsoft. Expect GM, Ford, and VW to Follow.

#ai #automotive #microsoft #enterprise

Comments

8 min read

What this means if you are building on Claude

Three trade-offs are locking in at once. First, frontier AI procurement is now a usage-rights negotiation, not a capability evaluation. The Pentagon did not benchmark Claude against GPT and Gemini and decide Claude lost. It did not evaluate at all. The selection ran on contract acceptance. If you plan to sell into government, design your AUP knowing the "all lawful purposes" clause exists and pick a side before customers force you to. Retrofitting later is what got Anthropic into court.

Second, AI safety as a brand has become measurably more expensive. Anthropic's stance might still be the right long-term bet, but the short-term cost is a $15B/year market and a credibility hit at the worst possible moment in its fundraising cycle. The counter-bet is that this stance compounds into enterprise trust in regulated industries — healthcare, finance, EU public sector — that the contract-signers cannot match. Both bets are live.

Third, the trade-off developers feel first. If your project depends on the Claude API and your work touches federal, defense, or adjacent regulated domains, the ground has shifted. Commercial API access is unaffected today. But the "supply chain risk" label travels — once a federal agency uses it, primes start asking whether their own vendors are exposed. Build a contingency plan: know your migration path to the closest non-Anthropic substitute and price dual-vendor architecture before procurement asks.

If you want a way into the defense AI market itself, the door is more open than six months ago. Reflection AI's inclusion proves you do not even need a shipping product. You do need FedRAMP High, DISA STIGs, and the architecture differences between commercial cloud and Azure Government / AWS GovCloud — that experience is in genuinely short supply.

jidonglab

Apr 30

Symphony: Why OpenAI's PRs Jumped 500% in 3 Weeks

Comments

5 min read

So who is actually right here

Both sides have a coherent argument and that is exactly why this is hard. Defense hawks see Anthropic as naive: if you want government contracts, you play by government rules. OpenAI, Google, and Microsoft understood that. Anthropic chose principle over pragmatism and the bill came due.

AI safety researchers see the same situation as a stress test of whether responsible AI development means anything once the customer is the most powerful military on earth. If a frontier lab folds the moment the Pentagon offers a check, the entire concept of an Acceptable Use Policy becomes performative.

The Pentagon is operationally rational. Anthropic is philosophically coherent. This is not a case where one side is clearly wrong — it is a structural tension the industry has not resolved. History is kind to companies that held lines, but surviving as a $900B company without your own government's trust is harder than it looks.

The first AI vendor to refuse the Pentagon's contract terms is also the most expensive one — that is the new shape of AI sovereignty.

If you are shipping production workloads on Claude in a regulated industry, does Anthropic's Pentagon stance change your bet — or does it become the reason you double down?

Full Korean analysis on spoonai.me.

Sources:

Anthropic $900B: 2.4x in 90 Days, 48-Hour Window

jidonglab — Sun, 03 May 2026 15:52:38 +0000

Three months ago, Anthropic was valued at $380B. Today the number on the table is $900B. That is a 2.4x jump in 90 days, and the round closes in two weeks.

Anthropic is the AI lab behind Claude, currently raising a $50B round at a $900B valuation that would make it the most valuable AI startup ever. Bloomberg broke the story on April 29. CNBC confirmed the same day. By May 1, TechCrunch, PYMNTS, and Reuters had all piled on. The question stopped being "will it close" and became "what does the market look like the morning after."

I've been tracking Anthropic's cap table for two years and the velocity here is what I keep getting stuck on. Not the absolute number. The slope.

2.4x in 90 Days

Let me put the curve on the table, because the trajectory is the story.

Date	Valuation	Step
Sep 2024	$180B	—
Mar 2025	$610B	3.4x
Feb 2026	$380B	0.6x (correction)
May 2026	$900B	2.4x

That February dip is the part most coverage misses. Anthropic actually got marked down during the Q1 correction, then more than doubled in a single quarter. SpaceX's fastest historical climb did not hit 2.4x in a quarter. There is essentially no precedent for this slope at this scale.

The investor pool tells you why. Existing backers like Google, Salesforce, and Spark Capital are reupping, and a wave of new institutionals are clamoring to get an allocation. A $50B round with 48-hour allocation windows is not a fundraise — it is a queue management problem.

jidonglab

Apr 20

Claude Opus 4.7 Hit 87.6% on SWE-bench. The Story Is What It Didn't Charge You.

#ai #claude #llm #engineering

Comments

7 min read

The 48-Hour Math

Here is the deal mechanic that has term-sheet veterans raising eyebrows.

Anthropic's board is expected to approve in May. The target close is two weeks after that. Inside that window, investors who get an allocation have 48 hours to commit. Forty-eight hours to decide whether you want a slice of a $900B private company.

In normal markets, a round this size takes months to syndicate. Lawyers redline. LPs ask questions. Diligence drags. A 48-hour window means Anthropic is not negotiating — it is rationing. Demand exceeds the round size by a margin large enough that the company can dictate terms most founders would never get away with.

The structural bet behind that confidence is the revenue curve. Anthropic ended 2025 at roughly $9B ARR. By end of March 2026, they were at $30B. That is 3.3x in four months. Roughly 80% of it is enterprise, with over 1,000 customers spending $1M+ per year. This is not consumer subscription churn — it is committed annual contracts from companies whose procurement teams already justified the spend.

At $30B ARR, the implied multiple is about 30x. High, but not absurd at 200%+ growth. Whether "not absurd" survives the next two earnings cycles is the live question.

The Number That Actually Matters

Compare the two giants on the metric that determines whether $900B holds.

OpenAI    ARR: ~$130B   ->   Valuation $852B   (~6.5x)
Anthropic ARR: ~$30B    ->   Valuation $900B   (~30x)

OpenAI has 4x the revenue, and Anthropic is now priced higher. That gap is not irrational on its own — it is a bet on the second derivative. If Anthropic's growth rate persists for another year, the multiples converge. If it decelerates, the gap looks like a pricing error.

The composition of the gap also matters. OpenAI built breadth — hundreds of millions of consumer users, ChatGPT as a brand, an API platform. Anthropic built depth — fewer logos, bigger contracts, a "safe AI" wrapper that lets enterprise procurement teams sign without writing a memo defending the choice. In 2026, the second story is the one investors are paying for.

There is also a structural advantage I keep underweighting. Amazon has committed $25B and 5GW of compute. Google has committed $40B. Both clouds host Claude. Unlike OpenAI's deep dependence on Microsoft, Anthropic plays both sides of the cloud duopoly without locking into either. That optionality is worth a premium, and the term sheet reflects it.

Fortune ran an analysis the same week claiming roughly half of Google's and Amazon's AI-related profits trace back to Anthropic stake appreciation. Read in reverse: big tech AI profitability now depends materially on Anthropic's mark going up. That is a flywheel — or a feedback loop, depending on which side of the cycle you think we are on.

jidonglab

Apr 24

Why OpenAI Shipped GPT-5.5 Just 6 Weeks After 5.4

#ai #openai #webdev #productivity

Comments

8 min read

The Pentagon Paradox

The same week the $900B headlines hit, the Pentagon excluded Anthropic from its AI contract shortlist. Most companies would call that a bad week. For Anthropic, the two stories are arguably the same story.

Anthropic has held a cautious posture on military AI since founding. That clashes with Pentagon procurement. But it is exactly the posture that lets a bank, a hospital, or a law firm justify the contract to a risk committee. "This is the AI lab that turned down Pentagon money because it takes safety seriously" is a powerful procurement narrative in regulated verticals — and reports indicate those verticals are Anthropic's fastest-growing segments.

Reddit summarized it as "Pentagon snub, cap table revenge." That is glib, but directionally correct. The snub and the $900B are two faces of the same brand strategy.

Bubble or Breakout?

If $900B feels like a stretch, the instinct is not unfounded. The bear case writes itself: 30x revenue, 200% growth that has to persist, open-source models like Llama and DeepSeek closing the gap on price-per-token every quarter. Pre-2000 dot-com peaks looked structurally similar.

The difference, and it is a real one, is that Anthropic has $30B in ARR with 80% recurring. This is not a dream with no revenue. The bubble, if it is one, is not empty. That does not eliminate risk — it just reframes it from "is there a business" to "is the growth rate priced correctly."

What this means depends on where you sit. If you are a founder fundraising right now, the "AI valuations are still climbing" narrative works in your favor, but investor attention is concentrating at the very top of the stack — your differentiation has to be sharper than it was in 2024. If you are a developer building on Claude, expect API price drops as Anthropic deploys this capital, and revisit your platform dependency before the next major release locks in your stack. If you are an investor staring at a $900B entry, the only metric that matters is whether ARR growth holds for two more quarters; any deceleration breaks the multiple. And if you are inside a big tech AI org, this round is the starting gun for "phase two" — the axis is now which frontier model your cloud can offer with structural advantage, not whether you have one of your own.

The IPO timeline is the next domino. Market chatter places the earliest window at October 2026, with H1 2027 as the realistic case. Dario Amodei has said there is "no rush." That can mean two things: the company needs maturity, or when you can raise $50B privately at $900B, IPO urgency simply evaporates. Both are probably true.

jidonglab

Apr 30

Symphony: Why OpenAI's PRs Jumped 500% in 3 Weeks

Comments

5 min read

A $900B private valuation isn't a fundraise — it's a bet that the AI revenue curve doesn't bend before the IPO does.

If you had a 48-hour allocation window at $900B and $30B ARR, would you write the check?

Full Korean analysis on spoonai.me.

Sources:

Symphony: Why OpenAI's PRs Jumped 500% in 3 Weeks

jidonglab — Thu, 30 Apr 2026 15:01:14 +0000

OpenAI's internal teams landed five times more pull requests in the three weeks after they switched on Symphony. Not 50% more. Five hundred percent more, on the same headcount, in 21 days. That single number is why I cloned the repo the day it dropped.

Symphony is OpenAI's open-source orchestration layer that turns a Linear board into a control plane for coding agents, released April 28, 2026 as a reference implementation, not a maintained product. It is small — a few thousand lines of TypeScript wrapped around the Codex App Server — and it is deliberately opinionated. The core idea is so blunt it almost feels like a prank: stop supervising agents, manage tickets instead.

To understand why that idea is worth open-sourcing, you have to talk about the supervision tax. Anyone who has run a coding agent in anger knows the rhythm. You hand it a task, babysit the diff, nudge it when it loses the plot, re-prompt when it crashes, remember which terminal tab had the half-finished branch. By the time you have shepherded one PR to merge, half the day is gone. The tax is not the model's failure rate. It is the human attention each running agent demands. Multiply by three or four parallel agents and you stop being an engineer and start being a kindergarten teacher with a Slack window.

Symphony's pitch is that the kindergarten part is automatable. The board is already the queue — every Linear team has a backlog with assignees, labels, and acceptance criteria. Symphony reads that board on a poll, takes any ticket marked for an agent, spawns a dedicated workspace, runs the agent until it produces a PR, and links the PR back to the ticket. If the agent crashes mid-run, Symphony notices the dead process and restarts it on the same ticket. The human's job collapses to two verbs: write the ticket, review the PR.

Here is the loop, drawn out so you can see the shape of it.

   ┌──────────────┐    poll(30s)    ┌──────────────┐
   │ Linear board │ ───────────────▶│   Symphony   │
   │  (tickets)   │                 │   poller     │
   └──────────────┘                 └──────┬───────┘
          ▲                                │ spawn
          │ comment + PR link              ▼
   ┌──────┴───────┐                 ┌──────────────┐
   │  Pull req    │◀────── push ────│ agent worktree│
   │  on GitHub   │                 │  (Codex/Kata) │
   └──────────────┘                 └──────┬───────┘
                                           │ crash?
                                           ▼
                                    restart same task

The actual board read is unsurprising once you see it. Symphony's poller is essentially this, give or take some retry logic.

const tickets = await linear.issues({
  filter: { state: { name: { eq: "Agent Ready" } } },
});
for (const t of tickets.nodes) {
  if (!workspaces.has(t.id)) spawnAgent(t);
}

That is the whole control plane in spirit. A label on a ticket — Agent Ready in the default config — is the signal. Symphony walks the list, checks which IDs already have a live workspace, and spawns one for any that does not. No scheduler, no priority queue, no fairness algorithm. The board is the source of truth, and the poller is dumb on purpose. Change the status to In Review and Symphony stops handing it to the agent. When the agent opens a PR, it comments back on the Linear issue with the link, and the loop closes.

The piece I found genuinely surprising is the crash handling. Each ticket gets a worktree, each worktree gets a long-running Codex App Server session, and if the session dies Symphony restarts it on the same task with scratch state preserved on disk. That sounds boring until you realize it is exactly the property that lets you walk away. Most ad-hoc agent setups treat a crash as a failure the human has to triage. Symphony treats it like a Kubernetes pod restart — the agent comes back, reads its worktree and the ticket, and keeps going.

openai / symphony

Symphony turns project work into isolated, autonomous implementation runs, allowing teams to manage work instead of supervising coding agents.

Symphony

Symphony turns project work into isolated, autonomous implementation runs, allowing teams to manage work instead of supervising coding agents.

In this demo video, Symphony monitors a Linear board for work and spawns agents to handle the tasks. The agents complete the tasks and provide proof of work: CI status, PR review feedback, complexity analysis, and walkthrough videos. When accepted, the agents land the PR safely. Engineers do not need to supervise Codex; they can manage the work at a higher level.

Warning

Symphony is a low-key engineering preview for testing in trusted environments.

Running Symphony

Requirements

Symphony works best in codebases that have adopted harness engineering. Symphony is the next step -- moving from managing coding agents to managing work that needs to get done.

Option 1. Make your own

Tell your favorite coding agent to build Symphony in a programming language of your choice:

Implement Symphony…

View on GitHub

Then v1.1.0 shipped on the heels of the launch and the project stopped being an OpenAI thing. v1.1 added support for Kata CLI — based on the open-source pi-coding-agent harness — which means Symphony is now model-agnostic. Point a workspace at Claude Code, at Gemini, at any CLI that speaks the Kata protocol, and the orchestrator does not care. The ticket flows the same way, crash recovery works the same way, and the PR comes back through the same Linear comment hook. For OpenAI, this is generous. For everyone running a non-Codex stack, this is the real headline.

The natural question is why this needs to exist when the alternatives are so visible. Codex Cloud lets you fan out tasks from chat, a GitHub Actions matrix can fan out from a labeled issue, and a custom Redis-queue orchestrator takes a weekend to build. I have shipped versions of all three. Codex Cloud is excellent for one-off bursts, but it does not own a backlog — every task is something you initiated in a chat, so you are still feeding the queue. Actions matrices are great for parallelism, but the unit of work is a workflow run, not a long-lived agent that survives across runs; the moment a job exceeds 6 hours or needs to ask a question, the abstraction snaps. Custom orchestration solves both, but you rebuild ticket state, worktree management, restart logic, and PR linkage from scratch, and the bus factor is one. Symphony's contribution is not novel infrastructure. It is a reference shape — board, poller, workspace, PR, with crashes as a non-event — small enough to fork and opinionated enough to copy.

Now the 500% number, honestly. The figure comes from OpenAI's own launch post and refers to internal teams measuring landed PRs across roughly three weeks of Symphony usage versus their pre-Symphony baseline. That is a real measurement, but it deserves asterisks. The engineers were already deep Codex users with strong ticket hygiene — not the average shop. Three weeks is not long enough to wash out novelty effects, and "landed PRs" rewards small mergeable diffs, which agents happen to be good at. None of this means the number is wrong. It means it is the upper bound, and your team will probably see something smaller, with most of the first month going into ticket-writing discipline rather than code.

My read on why it works, though, is unrelated to model quality. Symphony forces a separation most agent setups blur. The ticket is the spec, the agent is the executor, the human is the reviewer. Once those three roles are pinned to three surfaces — Linear, the worktree, GitHub — context-switching friction collapses. You stop wondering what an agent is doing because Linear answers that. You stop tab-hunting because each ticket has its own workspace. The supervision tax does not vanish, but it moves from continuous to event-driven, and event-driven attention is a different mode entirely. That is the part you cannot buy with a faster model.

What would actually convince you to run this on a real team — your existing backlog, your reviewers, your model of choice — and not just a toy repo on a Sunday?

The board is the queue, the agents are workers, and the only thing left for me to do is write the ticket and merge the PR.

Sources: OpenAI launch post · Symphony repository · Help Net Security · InfoWorld · Codex harness internals · Codex changelog

GPT Image 2 Inside Codex: My New Frontend Workflow

jidonglab — Thu, 30 Apr 2026 15:00:38 +0000

Last quarter I shipped a single landing hero with 47 image iterations across four tools. This week I shipped three landing pages, two onboarding flows, and a full pricing section in the same tool I write code in. The thing that broke the loop is not faster pixels, it is reasoning before pixels.

GPT Image 2 is OpenAI's April 21, 2026 image model that runs inside the same O-series reasoning loop as the rest of Codex, accepts up to 16 reference images, and renders natively at 1K, 2K, and 4K. The pinned snapshot is gpt-image-2-2026-04-21. It ships in three places at once: ChatGPT Images 2.0 for consumers, the OpenAI API in early May, and as a first-class tool inside the Codex App and Codex CLI. Microsoft Foundry shipped it on day one too.

The headline number for me is not resolution. It is iteration count. My average dropped from 47 to 6.

The pain I built this around

For two years my frontend loop was a relay race between four runners who kept dropping the baton. I would sketch in Figma, export a placeholder, write a Midjourney prompt, generate eight candidates, pick one, upscale, rename the file, drop it into public/images/, wire it into React, push, look at staging, hate the crop, and start over. Each handoff lost context. The prompt did not know my brand palette. The React glue did not know which crop the designer wanted.

The 47-iteration number is real. I counted on a single hero for a dental clinic in March. Most iterations were not artistic, they were logistical. Korean text rendered as garbled glyphs, so I overlaid in CSS. Hand anatomy was wrong, so I masked and redrew. Lighting did not match the reference, so I restarted. None of this was a creative choice.

What changed in Codex

GPT Image 2 inside Codex collapses the relay into one runner. You describe the component in natural language inside the Codex App, the model researches the existing code and brand assets in your repo, plans the composition with O-series reasoning, renders at 4K, and the in-app browser opens the page so you can comment on the rendered DOM the same way you would in Figma. Codex re-renders. No file naming, no prompt copying, no tab switching.

The reasoning step is what makes this feel different from gpt-image-1 or anything stitched together with Midjourney. The model writes a plan before it touches pixels. It checks whether the text in the image will be legible at the breakpoint you specified. It re-reads your tailwind.config.ts to get the brand color hex. If you ask for a hero with a Korean tagline, it lays out the Hangul glyphs with near-perfect accuracy, and the same goes for Chinese and Japanese. That last part used to be the single biggest reason I kept text out of generated images.

Here is the actual call from Codex CLI on a project I shipped Monday:

codex image \
  --model gpt-image-2-2026-04-21 \
  --refs ./brand/*.png \
  --size 4096x2304 \
  --prompt "Hero for /pricing. Three-tier card layout, \
            soft volumetric light, brand teal #0F766E, \
            Korean tagline 합리적인 가격, 명확한 가치"

Eleven flags I used to juggle, gone. The model picks up brand references from the directory, infers the breakpoint from my Next.js routes, and writes alt text into the response. I drop the URL into next/image and move on.

The 16-reference trick for brand consistency

The single feature that paid for itself in week one is the 16-reference-image input. I used to keep a Notion page of "brand mood" images and paste links into Midjourney one at a time, hoping the style transferred. With Codex I drop a folder of 16 brand assets - past hero images, the logo, the photographer's portfolio shots, three Pinterest references, our typography specimen - and the model treats them as a single style anchor. Rendered images look like they came from the same shoot.

The before/after on a real project tells the story:

                        Before (Midjourney + Figma)   After (Codex + gpt-image-2)
Iterations per hero     47                            6
Time to first ship      4.5 hours                     38 minutes
Brand match (1-10)      6                             9
Korean text accuracy    0% (overlaid in CSS)          ~98% (rendered native)
File handling steps     11                            0

The brand match score is subjective, but my client signed off on the first round for the first time in eight months of working together. That alone is worth the model.

How it stacks against the alternatives

Midjourney is still better at moody artistic compositions when you do not care about brand. Flux 1.1 Pro Ultra is faster and slightly cheaper per render. The original gpt-image-1 was strong at instruction-following but capped at 1024x1024 and stumbled on multilingual text. None of them have reasoning before rendering and a tight loop with the codebase. Midjourney does not know tailwind.config.ts. Flux does not open your staging URL. gpt-image-1 could not hold a 16-image style anchor without drift.

If you have ever wired a Midjourney workflow into a real product you know the pain - I wrote up a related story about how I connected 20 different tools to my main coding agent in five minutes when the MCP ecosystem clicked, and the lesson translates directly. Tools that live inside your editor beat tools that live in another tab, every single time.

Post not found or has been removed.

What the in-app browser unlocks

The Codex App's in-app browser is the part nobody talks about and the part that matters most for frontend work. After Codex renders an image and wires it into a component, the app opens a browser pane on the deployed page. You highlight the hero, type "headline is too tight against the model's shoulder, push left 80px and add 12% breathing room above the CTA," and Codex reads the comment as a Figma-style annotation. It re-renders the image, edits the JSX, and pushes a new build.

This is the loop I have wanted for ten years. Comment on the rendered DOM, get a code change and an asset change in one commit. Because the comment lands in a real browser tab, accessibility tooling and computed styles are in scope. I caught a contrast failure on a hero this week because Codex ran an axe check after rendering and flagged a white-on-teal CTA at desktop breakpoint. The fix was an asset change, not a CSS change.

Concrete numbers from one week

I tracked every frontend task I shipped Monday to Friday. Three landing pages, two onboarding flows, a pricing section, eight blog covers, four email headers. Total render time was 4 hours 12 minutes, of which 2:41 was Codex thinking and rendering and 1:31 was me reviewing. The same volume in March took three full days plus a contractor. API spend was $34.18, more than Flux but less than one contractor invoice.

What surprised me is how much time was spent not iterating. Six average iterations per asset means I trust the first or second render. That trust comes from the reasoning step. When the model tells you it will "place the product mockup at 60% from the left to balance the right-aligned headline and use a soft 4500K key light," you know what you are getting before the pixels exist. You correct the plan, not the pixels.

If you do production frontend work and you have not tried gpt-image-2 inside Codex yet, the question worth asking is which step of your current image pipeline would survive a tool that thinks before it renders.

The win is not faster pixels, it is fewer pixels generated.

Sources:

GPT-5.5-Codex vs 5.3: A 200-Task Bench Result

jidonglab — Thu, 30 Apr 2026 15:00:02 +0000

On a 200-task bench split across a TypeScript SaaS and a Python ML pipeline, GPT-5.5-Codex closed 81% of tasks unattended versus 67% for GPT-5.3-Codex, and burned 38% fewer reasoning tokens on the multi-step ones. But on trivial single-file edits it was 22% slower wall-clock. The default-everything answer is wrong; the right answer is "route by complexity."

GPT-5.5-Codex is OpenAI's coding-specialized variant of GPT-5.5, the new frontier model released April 2026 and now the recommended default inside Codex. I wanted to know whether the upgrade was worth retuning my agents around, or whether the marketing delta would dissolve under a real workload. So I built a controlled bench and ran every task twice.

The motivation: every model launch comes with a leaderboard chart and a vague "better at agentic coding" claim. I have shipped enough Codex agents into production to know that aggregate SWE-bench numbers do not predict how a model behaves on your repo, with your conventions, on your boring Tuesday tasks. OpenAI now positions GPT-5.4 as the flagship for general professional work and GPT-5.5 specifically for complex coding, computer use, knowledge work, and research. That positioning is interesting but not load-bearing. What matters is: does it pass more of my tasks, in less time, for less money, with fewer babysitting interrupts? You cannot answer that from a press release.

The bench. I picked two repos I know cold. The first is a mid-size TypeScript SaaS — Next.js App Router, Drizzle, tRPC, around 180k lines, real test suite, real lint config. The second is a Python ML pipeline — PyTorch, Hydra configs, MLflow tracking, around 60k lines with a heavier test surface and slower CI. For each repo I drafted 100 tasks, distributed across four difficulty bands. Trivial: rename a function across 40 files, add a missing type, adjust a Tailwind class. Moderate: add a new tRPC procedure with input validation and a test. Hard: implement an OAuth flow with retry semantics and idempotency keys, then wire it through the existing session layer. Adversarial: reproduce and fix a flaky integration test with a real concurrency bug. Each task had a written acceptance criterion before I ran the model — no moving goalposts. Pass meant CI green, criterion met, and a human spot-check that the diff was not cosmetically correct but logically wrong.

Each task ran with the same prompt, the same repo state (fresh git worktree per run), and codex exec in autonomous mode with a 30-minute ceiling. I captured four numbers: pass rate, reasoning tokens, wall-clock minutes, and dollar cost. The reasoning-token field is the interesting one — codex exec --json now reports it, which is your real measurement hook for how hard the model "thought" before producing its diff. Here is the minimal harness I used to extract it:

codex exec --model gpt-5.5 --json \
  --prompt-file tasks/oauth-retry.md \
  --repo ./saas-bench \
| jq '{ pass: .result.success,
        reasoning_tokens: .usage.reasoning_tokens,
        wall_ms: .timing.total_ms,
        cost_usd: .usage.cost_usd }'

I ran the same harness against --model gpt-5.3-codex for the comparison arm, logged every JSON line to DuckDB, and graded pass/fail by re-running the repo's CI inside a clean container. No human-in-the-loop nudges. If the model gave up, that was a fail.

Before the numbers, the honest external context. I did not bench Claude Sonnet 4.6 or Gemini Code head-to-head on the same 200 tasks because that would have tripled the runtime budget, but I ran both on a 30-task spot-check from the same pool. Sonnet 4.6 was within 3 points of 5.5 on pass rate and noticeably better at refusing to over-edit. Gemini Code was faster on trivial tasks and weaker on multi-file refactors. Treat the headline 5.5-vs-5.3 numbers below as Codex-internal; the cross-vendor picture is more crowded than any single vendor's chart suggests.

Now the result. Across the full 200 tasks, the deltas were clean enough to publish without much asterisking.

Metric	GPT-5.3-Codex	GPT-5.5-Codex	Delta
Overall pass rate	67%	81%	+14pp
Hard-tier pass rate	41%	63%	+22pp
Trivial-tier wall-clock (median)	38s	47s	+22% slower
Reasoning tokens, hard tasks (median)	84k	52k	-38%
Cost per passing task (mean)	$0.41	$0.36	-12%

The shape of the win is not "smarter on everything." It is "much better at multi-step planning, slightly worse at being terse." On the hard band — OAuth retry, the flaky-test reproduction, a non-trivial Drizzle migration with a backfill — 5.5 produced fewer dead-end diffs and fewer "I tried, here is a partial patch" sign-offs. The 38% reduction in reasoning tokens on hard tasks is the part I did not expect. 5.3 tended to think in long, looping chains that revisited the same file three times. 5.5 plans first, then executes, and the trace shows it. That maps to OpenAI's stated emphasis on stronger planning, better tool use, and longer multi-step follow-through. Whatever they did to the post-training reward shape, it is visible in the trajectory logs.

The cases where 5.5 loses. On trivial tasks — the ones a junior could finish in two minutes — 5.5 was consistently slower. Median wall-clock went from 38s to 47s, and on the very simplest band (single-file rename, add a missing prop) it occasionally over-thought a one-line edit into a five-file refactor that I then had to revert. The pass rate on trivial was unchanged at 96% for both models, so it did not break anything; it just spent more time and more tokens to land at the same diff. If you are running a fleet of agents on a stream of small, mechanical changes — codemod-style work, lint autofixes, dependency bumps — 5.3 is still the better default, and it is cheaper. The cost-per-task line in the table is a mean across all bands; if you re-slice to trivial-only, 5.3 wins on cost by about 18%.

There was also one regression I want to flag honestly. On three of the Python ML tasks involving Hydra config composition, 5.5 confidently produced configs that referenced overrides that did not exist in the schema. 5.3 made the same class of error twice. Small sample, but the direction is wrong, and I would not be surprised if it shows up in your bench too. Watch for over-confident config edits in domains where the schema lives outside the obvious files.

The operational takeaway. I am not setting 5.5 as my one-size default. I am routing by task complexity. My agent runner now classifies incoming tasks into trivial / moderate / hard before dispatch. Trivial goes to 5.3-Codex with a tight token budget. Moderate and hard go to 5.5-Codex with a larger ceiling. Cost dropped about 9% versus all-5.5, and pass rate held at 79% — within noise of the all-5.5 run. The router is fifty lines of code; the model toggle is one flag. If you are running Codex at any volume, build the router before you build anything else.

Post not found or has been removed.

The pricing question. I am deliberately not citing dollar figures from memory — the Codex pricing page and the changelog move faster than blog posts do, and rate-limit policy on the Codex tier matters as much as the per-token rate for any real workload. Check both before you redo your cost model. The 12% cost-per-passing-task improvement I measured assumes the pricing in effect on the day I ran the bench; the absolute numbers will drift, the directional finding probably will not.

What would change my mind. If I re-ran this bench in three months and found 5.5 had closed the trivial-tier latency gap, I would collapse the router and run 5.5 everywhere. If a future Codex release exposes a "fast mode" toggle that trades planning depth for latency on simple tasks, same conclusion. Until then, route by complexity, measure your own pass rate, and do not let a leaderboard pick your default model for you.

Are you routing by task complexity, or letting the latest model eat your trivial-task latency budget?

The right question is never "which model is best." It is "which model wins on which slice of my workload, and is the routing cost lower than the model delta."

Sources:

Codex Is No Longer a CLI. Embed It in Your App.

jidonglab — Thu, 30 Apr 2026 14:59:27 +0000

The interesting thing about the April 2026 Codex update isn't computer use. It isn't the model bump either. The real story is that Codex stopped being a CLI.

The Codex App Server is the underlying agent harness OpenAI just made public — the same engine that powers the official desktop app, now exposed as a first-class integration surface for anyone building on top of Codex. For most of last year, "using Codex" meant either typing into the Codex CLI or living inside OpenAI's app. As of this month, that framing is wrong. The CLI is one client. The desktop app is another client. Your product can be a third, on equal footing, and OpenAI is openly recommending you treat the App Server as the integration target instead of wrapping the binary.

I spent the last week wiring it into my own internal admin dashboard, and the shift in mental model is bigger than the diff suggests.

Why you'd embed Codex instead of pointing users at the official app

The honest answer is context. The Codex desktop app is great if your job is "write code in a generic project." It is not great if your job is "review every PR opened against our internal monorepo, with our lint rules, our test commands, our deploy gates, and our reviewer persona, and post the result back into the same admin panel where I already triage incidents." That second job is mine, and the official app cannot be it. It does not know my repo. It does not share state with my dashboard. It does not get to keep a warm sandbox between PRs.

When Codex was a CLI, the workaround was ugly. You spawned a child process, parsed stdout, and reinvented session management on top of a tool that did not want to be a library. With the App Server, the harness becomes a long-lived runtime you mount, not a binary you shell out to. Threads are addressable. Environments are sticky. Plugins are first-class. The CLI is just one of the things that talks to it.

That reframing — Codex as runtime, not Codex as command — is the entire post.

What the App Server actually is

The App Server runs locally and speaks JSON-RPC. The official client is the TypeScript SDK, which is the path I'd recommend for almost everyone today. It gives you a small, sharp surface — start a thread, run a task on it, resume a past thread by ID — and hides the transport entirely.

import { Codex } from "@openai/codex-sdk";

const codex = new Codex();
const thread = await codex.startThread({ workdir: "/repo" });
const run = await thread.run("Review PR #482 against our style guide.");
console.log(run.finalMessage);

const resumed = await codex.resumeThread(thread.id);
await resumed.run("Now check the migration in 0042_add_index.sql.");

That snippet is load-bearing. A thread is the unit of agent state. startThread boots a session, run sends a task into it, and resumeThread lets you reattach by ID hours or days later — which makes "PR #482 reviewer" a durable concept instead of a fresh prompt. Recent changelog entries added Unix socket transport, pagination-friendly resume and fork, sticky environments, remote thread config and storage, and a plugin marketplace you can install and upgrade from. Together it is the difference between scripting an agent and hosting one.

There is also an experimental Python SDK that drives a local App Server checkout over JSON-RPC and needs Python 3.10+. It is fine for prototyping, but TypeScript is where the supported road is. And because the App Server can also expose itself as an MCP server, anything else in your stack — agents you've built with the OpenAI Agents SDK, Claude Code, your IDE — can call into the same Codex instance as a tool. That is the move that turns Codex from "an app I open" into "infrastructure other agents reach for."

Architecturally, the picture I keep in my head looks like this:

   your-app (Next.js admin panel)
            │  TS SDK
            ▼
     Codex App Server  ◄────── MCP clients
            │                 (Agents SDK, IDE, etc.)
            ▼
       agent runtime
            │
   ┌────────┴────────┐
   ▼                 ▼
 tools / shell    plugins (marketplace)
                       │
                       ▼
                  MCP fanout
                  (linear, github, db)

Your app talks to one process. That process is also a server other agents talk to. The agent runtime fans back out to tools and plugins, and several of those plugins are themselves MCP bridges. Codex sits in the middle, not at the edge.

Honest comparison: Agents SDK, Claude Agent SDK, raw model API

Embedding Codex is not the only option, and I want to be fair about the alternatives because I tried them all on the same use case before settling.

The OpenAI Agents SDK directly is the closest competitor. It is more flexible — you define your own tools, your own loop, your own memory — and it is the right answer if your agent is not primarily about code. But for a code-review bot, you end up rebuilding most of what the Codex harness already does: sandboxed shell, diff-aware context, repo-rooted file ops, plugin lifecycle. Picking Agents SDK over the App Server here meant writing the harness myself. Possible, not wise.

Anthropic's Claude Agent SDK is genuinely good and, in some workflows, more pleasant. The reason I did not pick it is narrow: I wanted my bot to share the same reasoning surface as the Codex sessions my team already runs in their editors. If your team is Claude-native, flip the recommendation.

Building from the raw model API is the option I wasted a weekend on. You will write your own thread store, your own tool dispatcher, your own sandbox, your own plugin format, and a month later you will have a worse Codex. Do this only if your requirements are weird enough that the harness is in your way.

What I actually shipped

The concrete embed is a code-review bot that lives inside our internal admin panel. When a PR is opened against our main monorepo, the panel calls startThread with the repo path and tags the thread with the PR number. The first run installs a small set of plugins from the marketplace — our linter, a database migration checker, and a Linear bridge over MCP — into a sticky environment. That environment survives across runs, which matters more than I expected: warm node_modules, warm type-check cache, warm git index. A second push to the same PR resumes the thread by ID instead of starting cold, so the model has the entire prior review in working memory and only re-reads the diff.

The MCP exposure is what closes the loop. The same App Server is registered as an MCP server in our team's editor configs, so when an engineer asks "what did the bot flag on PR #482 and why," their editor's agent talks to the same Codex instance, resumes the same thread, and answers from the actual review state — not a summary, not a copy. One runtime, many clients.

Plans-wise, the App Server is available on ChatGPT Plus, Pro, Business, Edu, and Enterprise; the pricing page has the current breakdown and is worth checking before you commit to a deployment shape.

openai / codex

Lightweight coding agent that runs in your terminal

npm i -g @openai/codex
or brew install --cask codex

Codex CLI is a coding agent from OpenAI that runs locally on your computer

If you want Codex in your code editor (VS Code, Cursor, Windsurf), install in your IDE.
If you want the desktop app experience, run codex app or visit the Codex App page
If you are looking for the cloud-based agent from OpenAI, Codex Web, go to chatgpt.com/codex

Quickstart

Installing and running Codex CLI

Install globally with your preferred package manager:

# Install using npm
npm install -g @openai/codex

# Install using Homebrew
brew install --cask codex

Then simply run codex to get started.

You can also go to the latest GitHub Release and download the appropriate binary for your platform.

Each GitHub Release contains many executables, but in practice, you likely want one of these:

macOS
- Apple Silicon/arm64: codex-aarch64-apple-darwin.tar.gz
- x86_64 (older Mac hardware): codex-x86_64-apple-darwin.tar.gz
…

View on GitHub

The bigger shift

Step back from the SDK for a second. What OpenAI did this month is reclassify Codex. It used to be a product. It is now a runtime, the way Postgres is a runtime — something you mount, address, and let multiple clients talk to. The CLI is a psql. The desktop app is a pgAdmin. Your product is whatever you build against the wire protocol. Treating agents as long-lived processes with addressable state, plugin surfaces, and cross-client exposure is going to feel obvious in a year. It does not yet, which is why this is the moment to build on it.

What does your stack look like when the agent is a service your other agents call, instead of a window your users open?

Codex stopped being a tool you use and became a runtime you mount — agent runtimes are turning into infrastructure primitives, like databases.

Sources:

I Gave Codex My Mouse for a Day. Here's What Broke.

jidonglab — Thu, 30 Apr 2026 14:58:51 +0000

At 9:14 a.m. on a Tuesday I watched my cursor drift across the menu bar without me touching the trackpad. It hovered over the Numbers icon, paused, then double-clicked. A spreadsheet I had not opened in three weeks slid into focus, and a new column appeared cell by cell while my hands sat in my lap.

Codex Computer Use is the April 2026 update that lets OpenAI Codex see your screen, move its own cursor, and type into any app — not just files in your editor. It runs in the background and stays inside a sandbox you define. The macOS build shipped earlier this spring; the April release added native Windows support on top of PowerShell and the Windows native sandbox. Same idea on both platforms: a cursor that is not yours, doing work that used to be yours.

I gave it permission for one workday. Three real tasks. Honest log.

Why I wanted this badly

The boring truth is a Tuesday ritual I hate. A vendor sends a CSV of charges, and I reconcile it line by line against a dashboard with no export button — just a search bar and a table that paginates twenty rows at a time. Numbers usually match. When they do not, I write a Slack message that starts, "Hey, quick one." Forty minutes, bad mood for the rest of the morning.

I have tried to automate it twice — a Playwright script that broke the day the vendor changed their CSS, a Zapier flow that could not handle the dashboard's auth. Both times the maintenance cost ate the savings. What I wanted was something that behaved the way I behave: squint at the screen, click around, copy a number, and only call me when something looked off. Codex Computer Use is the first thing that promises that without a brittle selector for every button.

Turning it on (and the sandbox you actually want)

Enabling computer use was less ceremonial than I expected. Install the latest Codex desktop app, open Settings, toggle on "Computer use" under Agents. On macOS it asks for Accessibility permissions the first time the cursor moves; on Windows the native sandbox handler asks for a scoped grant per application. The piece worth paying attention to is the permission file. Mine looks roughly like this:

# ~/.codex/computer-use.toml
allowed_apps   = ["Numbers", "Safari", "Slack", "Linear"]
denied_apps    = ["1Password", "Mail", "Messages", "System Settings"]
network        = "deny-by-default"
require_human  = ["send_message", "submit_form", "purchase"]
session_log    = "~/codex-logs/2026-04-30.jsonl"

That last block matters most. require_human forces Codex to pause before any irreversible action — sending a message, submitting a form, anything that costs money. The first time it stops on a Slack send and waits for you to press Approve, you understand why it is the only sane default.

I also turned on stable hooks and codex exec --json, which now reports reasoning-token usage per step. If you let an agent click around your machine, you want a transcript you can read afterward. The TUI's new /side command — spawning a side conversation without losing main-task state — was useful for asking "wait, why did you do that?"

How it stacks up against the obvious alternatives

Three other tools were in my head before I tried this. Anthropic's Claude computer use, GA since late 2024, is the closest cousin — same screen-reading, same cursor — but it shines brightest scripted through the API, not as an always-on background agent. Microsoft Copilot Vision, baked into Edge, is strong inside a browser tab and much weaker the moment you cross into a native app it cannot annotate. OpenAI Operator runs in a remote cloud-browser sandbox; safer, but cut off from logged-in desktop apps or local files.

Codex Computer Use sits somewhere else. It runs on your hardware, sees what you see, and is the same Codex you were already using for code. The continuity matters more than I expected: after the CSV reconciliation, the diff was already in the Codex session that had my repo open, so I could ask it to write a Python script that did the comparison in pure code next time. None of the other three give you that handoff for free.

Three tasks, one full day

Task one: reconcile the Tuesday CSV against the vendor dashboard. Win. I dragged the file in and said "match this against the dashboard, flag any row off by more than two cents, put discrepancies in a new sheet." It opened Numbers and Safari, paged through the table, produced a discrepancy sheet with seven rows. Manual: ~40 minutes. Codex: 11 minutes, of which I spent maybe 90 seconds approving two pause-points. The discrepancies matched what I would have flagged.

Task	Manual time	Codex time	My time at keyboard
Reconcile vendor CSV vs dashboard	~40 min	11 min	~90 sec
Sync Linear export to GitHub Project	~25 min	18 min	~6 min
Clear Figma comment screenshots	~15 min	failed	the full 15 min

Task two: update a GitHub Project board from a Linear export so columns matched the new sprint. Partial. Codex parsed the CSV, opened the board in the new in-app browser — April Codex ships with an embedded browser you can comment directly on rendered pages, which I only appreciated when I watched it leave a pull-request-style note on a card — and moved cards mostly correctly. It got confused on one column recently renamed from "In review" to "Awaiting QA" and put four cards in the wrong place. It noticed itself, asked "please confirm these," and waited. I fixed them by hand. The lesson: the agent is good at executing the rule you stated, not at noticing that your rule is out of date.

Task three: clear forty-plus stale screenshot comments in a Figma file. Outright failure. Figma's comment UI uses a custom canvas, not DOM elements, and Codex's screen reader could see the comments but could not reliably target the small "resolve" button on each one. It clicked next to the button maybe thirty percent of the time and once accidentally placed a new comment by clicking the canvas itself. I stopped it after eight minutes — three resolved, two created. I did the rest manually. The honest takeaway: when an app's controls are non-standard or visually crowded, vision-driven control still has a hit rate that is not good enough for unattended work.

If you want a more general primer on how I usually wire tools into agents, I wrote up the MCP-based version of this story here:

Post not found or has been removed.

What 90+ new plugins changed in practice

The other shoe was a wave of more than ninety new plugins — Atlassian Rovo, CircleCI, CodeRabbit, GitLab Issues, the Microsoft Suite, Render, and a long tail of niche ones. They compose with computer use: when a plugin exists Codex prefers the API call; when no plugin covers a surface, it falls back to clicking. Mid-task on the GitHub job I watched it switch modes — using the plugin to read the project schema, then driving the cursor to drag cards because the plugin did not yet expose column reorder. That hybrid is what makes the whole thing feel like one tool instead of two.

Would I leave it on tomorrow

For the CSV reconciliation, yes, immediately. I already moved the Tuesday ritual to a scheduled Codex run with the same permission file and a Slack ping if any discrepancy is over a dollar. For project-board work, yes but supervised — I would not let it run while I sleep. For visually messy apps like Figma comments, not yet. The cursor-on-canvas era is real but uneven, and pretending otherwise is how you end up with two new comments on a screenshot you were trying to delete.

What is your reconciliation-equivalent — the boring forty-minute task you would hand to a cursor that is not yours, if you trusted the sandbox?

An agent that can click is only as good as the line you draw around what it is allowed to click.

Sources:

OpenAI's Super App Play: Why Spud + Duct Tape Matter for Builders

jidonglab — Fri, 24 Apr 2026 16:52:24 +0000

Six weeks ago OpenAI tested an anonymous image model on LM Arena under codenames like packingtape-alpha and gaffertape-alpha. Yesterday they shipped GPT-5.5. The word Greg Brockman used on release day was "super app."

That framing is not marketing. It is an architecture decision with direct consequences for how you build.

In Part 1 of this series (Why OpenAI Shipped GPT-5.5 Just 6 Weeks After 5.4), I covered the release cadence and what the Spud codename signals about OpenAI's internal roadmap philosophy. This piece goes one layer deeper: what the super-app thesis actually means in code, and where it forces a real decision.

Why "Super App" Is Not a Buzzword Here

Every platform company eventually tries to become the interface layer. WeChat did it for messaging plus payments plus mini-apps. Shopify is doing it for commerce. Figma attempted it for design. The pattern is the same: own the surface, integrate the capabilities, make switching expensive.

OpenAI's version is different in one specific way. Their surface is a language model. The prior super-apps owned a workflow — chat, payment, storefront. OpenAI owns the reasoning layer that sits underneath arbitrary workflows. If they can attach first-party image generation, agentic code execution, and tool-use to that reasoning layer — and ship it through a single SDK — they are not building a Swiss Army knife. They are building the socket into which every knife plugs.

Brockman called GPT-5.5 "one step closer to a super app" and described it as enabling "more agentic and intuitive computing." That phrase is doing specific work. "Agentic" means the model can plan and execute multi-step tasks without human checkpoints. "Intuitive" means it routes to the right capability without you specifying which one. Taken together, that is a description of an operating system, not a chatbot.

For builders, the question is not whether this vision is correct. It is whether it will be correct fast enough to bet on.

What Is Actually Shipped Right Now

The super-app is not complete. But the pieces are on the table, and their proximity matters.

GPT-5.5 (Spud, April 23) is the language and reasoning core. It handles cross-tool workflows, agentic coding, and what OpenAI calls "computer navigation tasks." It is available to Plus, Pro, Business, and Enterprise tiers in ChatGPT today. API access is listed as "very soon."

Duct Tape / GPT Image 2 arrived silently. Three variants — packingtape-alpha, maskingtape-alpha, gaffertape-alpha — appeared on LM Arena around April 4, were identified by users within hours, and were pulled from the public leaderboard. The underlying model kept running through A/B testing in ChatGPT and likely through the chatgpt-image-latest API endpoint. Leaks indicate this is not a standalone product but will integrate into the GPT-5 family. The capability profile is notable: near-perfect text rendering inside images, stronger world-model knowledge, and photorealism that earlier DALL-E versions consistently failed at. Full breakdown at jidonglab.com.

Codex is already on GPT-5.5. Four million active users. A math professor demoed building an algebraic geometry app from a single prompt in 11 minutes using GPT-5.5 plus Codex together. That is not a benchmark. It is a workflow.

API is the missing piece. "Very soon" is OpenAI's phrase, which historically means weeks, not quarters.

What you have today is a fragmented stack: language calls go to one endpoint, image generation to another, code execution to a third. The super-app thesis is that these collapse into a single call, with the model routing internally.

What a Unified SDK Call Would Look Like

Today, composing language plus image plus tool-use across OpenAI's stack requires three separate API surfaces and your own glue logic. Here is a simplified illustration of that fragmentation versus a hypothetical unified call:

# TODAY: fragmented composition (unverified — illustrative)
import openai

# Step 1: plan with language model
plan = openai.chat.completions.create(
    model="gpt-5.5",
    messages=[{"role": "user", "content": "Write a hero section for a SaaS landing page about DevOps tooling. Include a headline, subhead, and image prompt."}]
)
plan_text = plan.choices[0].message.content

# Step 2: extract image prompt and generate separately
image_prompt = extract_image_prompt(plan_text)  # your parser
image = openai.images.generate(
    model="chatgpt-image-latest",
    prompt=image_prompt,
    size="1792x1024"
)

# Step 3: run code generation separately if needed
code = openai.chat.completions.create(
    model="gpt-5.5",
    messages=[{"role": "user", "content": f"Turn this plan into a React component: {plan_text}"}]
)

# YOU stitch the three outputs together

# SUPER-APP TARGET: hypothetical unified call (unverified)
result = openai.tasks.run(
    model="gpt-5.5",
    instruction="Build a SaaS hero section: write copy, generate a matching hero image, and output a React component. Return all three.",
    output_schema={"copy": "str", "image_url": "str", "component": "str"}
)
# Model routes to language, image, and code internally

The gap between these two is not API design. It is whether the model can be trusted to route correctly without your orchestration. GPT-5.5's "agentic and intuitive computing" framing is a claim that it can. The chatgpt-image-latest endpoint and Codex integration are the first structural pieces.

How the Competitors Are Composing Their Stacks

The honest answer is that Google and Anthropic are both building toward the same surface, from different starting positions.

Capability	OpenAI	Google	Anthropic
Language / reasoning	GPT-5.5 (Spud)	Gemini 3.1 Pro	Claude Opus 4.5
Image generation	GPT Image 2 (Duct Tape, first-party)	Imagen 3 (first-party)	None (third-party only)
Agentic / code execution	Codex + GPT-5.5	Code Execution Tool	Claude Code
API maturity	GPT-5.5 "very soon"; image via endpoint	Generally available	Generally available
Consumer distribution surface	ChatGPT (900M+ weekly active users)	Gemini + Workspace (3B+ Google users)	No direct consumer surface

Google's composition is Gemini 3.1 Pro as the reasoning core, Nano Banana Pro variants for on-device, and Workspace as the distribution surface. The Workspace integration is underrated. If your users live in Google Docs, Google Meet, and Gmail, Gemini is not competing for attention — it is already embedded. The image story is first-party through Imagen 3, and the API surface is mature. Google's weakness is consumer mindshare for a standalone "AI product." Gemini competes with ChatGPT there, but with less name recognition and a more fragmented product surface.

Anthropic's composition is Claude Opus 4.5 for reasoning and Claude Code for agentic development. Claude Code is genuinely strong — builders who have used both Claude Code and Codex report Claude Code as more reliable for large codebase navigation. Anthropic's structural gap is image generation: there is no first-party image model, and there is no consumer surface. Every Anthropic user is a developer or enterprise buyer who chose to integrate the API. That is not weakness per se, but it means Anthropic is not building a super app. They are building the best reasoning and code engine for teams that want to own their orchestration layer.

OpenAI's edge comes from two things that are hard to replicate quickly. First, ChatGPT's 900 million weekly active users is a distribution moat. When the super-app SDK ships, there is an existing user base that is already habituated to ChatGPT as a general-purpose tool. Second, having first-party image generation, language, and code execution — all pointing at the same underlying model family — creates optimization pressure that third-party integrations cannot match. The routing between modalities improves when all modalities share the same training infrastructure.

The Lock-In Decision, Made Concrete

The super-app thesis sharpens an existing trade-off into something more binary.

Going all-in on OpenAI SDK means you gain: native multi-modal routing when the unified API ships, early access to capability improvements (the release cadence is now roughly every 6 weeks), and simpler infrastructure — one vendor, one billing surface, one auth token. You lose: negotiating leverage, fallback options if OpenAI has an outage or a policy change, and the ability to swap out the language model if a competitor ships something materially better on a specific task.

Three scenarios where the OpenAI-first bet wins clearly. You are building a product where image generation and language understanding need to be tightly coupled — a content creation tool, a design assistant, an automated marketing pipeline. The unified routing removes an entire class of glue code and prompt engineering. Or you are building on top of ChatGPT's user base through plugins or extensions, where OpenAI's distribution is the product. Or your users are enterprise buyers who want a single vendor for compliance and procurement simplicity.

Staying multi-provider means you preserve the ability to route tasks to the best-available model per task type. Claude Opus 4.5 is stronger than GPT-5.5 on some long-context reasoning tasks. Gemini's Workspace integration is better for organizations deep in Google's ecosystem. A provider-agnostic abstraction layer — LiteLLM, or your own thin wrapper — keeps those options open. The cost is that you own the orchestration complexity. Every new OpenAI capability requires a new integration decision. You are running infrastructure that OpenAI will eventually render unnecessary, and you are betting that the complexity is worth the optionality.

The multi-provider approach is the correct call if your users span enterprise contexts that require data residency or compliance isolation, if your core value proposition depends on best-of-breed model selection, or if your current vendor relationships give you pricing advantages that offset the integration overhead.

The honest read is that OpenAI's super-app bet raises the cost of not committing. If the unified SDK ships in Q2 2026 and delivers on the multi-modal routing promise, teams with fragmented stacks will spend a quarter retrofitting. Teams that are already OpenAI-native will ship features instead.

ASCII: Today's Stack vs Super-App Stack

TODAY (fragmented)
──────────────────────────────────────────────
 User request
      │
      ▼
 Your orchestrator ──► GPT-5.5 (language)
      │                    │
      │◄───── plan ────────┘
      │
      ├──► chatgpt-image-latest (image gen)
      │         │
      │◄── image URL ──┘
      │
      └──► Codex endpoint (code gen)
                │
           ◄── component ──┘

 You stitch + error-handle + retry each leg.

SUPER-APP TARGET (unified)
──────────────────────────────────────────────
 User request
      │
      ▼
 openai.tasks.run(model="gpt-5.5", ...)
      │
      └── Internal routing ──► language
                          ├──► image
                          └──► code
      │
      ▼
 Single structured response

 Model owns the routing. You own the schema.

The diagram is illustrative. The openai.tasks.run interface does not exist today. But the direction of the roadmap — Brockman's language, the Codex integration, the image model sitting in the same family — points here.

The Real Bet

OpenAI is not the only company shipping fast. Google has first-party image generation and far wider distribution through Workspace. Anthropic has stronger agentic code execution for complex codebases. Neither of them has ChatGPT's direct consumer relationship at 900 million weekly active users, and neither has publicly committed to collapsing all modalities into a single SDK call.

The super-app framing is a strategic signal, not a product announcement. But the pieces — Spud for language, Duct Tape for image, Codex for code, and the chatgpt-image-latest endpoint already live — are not hypothetical. They exist. The API surface that unifies them is what is "very soon."

If OpenAI ships a real super-app SDK, the question is not whether to use it. The question is how much orchestration complexity you want to own between now and then.

Sources:

Full Korean analysis on spoonai.me.

Part 1: Why OpenAI Shipped GPT-5.5 Just 6 Weeks After 5.4

If OpenAI lands a real super-app SDK, are you porting your stack to it or doubling down on provider-agnostic abstractions? The answer probably depends on whether your core value is in the routing logic or in the product built on top of it.

Why OpenAI Shipped GPT-5.5 Just 6 Weeks After 5.4

jidonglab — Fri, 24 Apr 2026 01:17:00 +0000

Six weeks. That's how long it took OpenAI to ship GPT-5.5 after 5.4. Until this year, frontier labs did that in quarters.

GPT-5.5 is OpenAI's latest flagship language model, codenamed Spud, released April 23, 2026 — 6 weeks after GPT-5.4. The name "Spud" comes from Axios, who reported it the day of release. Internally, OpenAI apparently names its models after potatoes. I find this funnier the longer I think about it.

The cadence is the real story. Six weeks between flagship releases is not a chip-speed improvement — it's a process change. Either OpenAI is running parallel development tracks that weren't running before, or the line between "train a new model" and "adjust a deployed model" has gotten blurry enough that a 6-week cycle is now achievable. Both possibilities carry implications for builders. I spent an hour reading the system card and Greg Brockman's framing so you don't have to.

Why the Cadence Matters

Competitive pressure is the proximate cause. Google's Gemini 3.1 Pro dropped in late Q1 2026. Anthropic shipped Claude Opus 4.5. OpenAI did not have the luxury of a 6-month revision cycle. The 6-week ship is a response to that pressure, and the fact that they could do it — without performance regression on latency — tells you something about their deployment infrastructure.

The deeper reason is Brockman's framing. When he described GPT-5.5 as "one step closer to a super app" and "more agentic and intuitive computing," he wasn't describing a model update. He was describing an architectural ambition. The model cadence is fast because the goal isn't to ship a better model — it's to ship a platform that accumulates capabilities faster than its competitors can respond to any single one.

That distinction matters if you're deciding where to build. A company on a 6-month model cycle is predictable. You know roughly when breaking changes are coming. A company on a 6-week model cycle is building a different kind of product, and the dependency surface you're exposed to as an API customer is wider and updates faster.

What Actually Changed in 5.5

Per-token latency matches GPT-5.4. OpenAI calls it "a faster, sharper thinker for fewer tokens" — meaning it reaches correct answers with less chain-of-thought overhead, not that individual tokens arrive faster at the wire. That's a meaningful distinction. You're not paying for the model to think out loud as much.

The specific capability areas OpenAI called out are coding and debugging, web research, data analysis, document and spreadsheet generation, operating software, and moving across tools in agentic workflows. Reading that list, the signal isn't any single item — it's that every item is something an agent does across a session, not something a single-turn assistant does. The improvement profile is optimized for multi-step execution, not for answering individual questions better.

The rollout is staged. Plus, Pro, Business, and Enterprise users get GPT-5.5 in ChatGPT and Codex on day one. GPT-5.5 Pro — the higher-compute variant — goes to Pro, Business, and Enterprise only. API access is coming "very soon," which in OpenAI time means days to a couple of weeks, based on the pattern from prior releases.

Here's how the API call changes when 5.5 lands:

# GPT-5.4 (current)
response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": prompt}]
)

# GPT-5.5 (once API ships)
response = client.chat.completions.create(
    model="gpt-5.5",          # or "gpt-5.5-pro" for the higher tier
    messages=[{"role": "user", "content": prompt}]
)

The model string changes. Pricing isn't public yet. Everything else in your integration stays the same.

For agentic workflows — the area OpenAI is most explicit about improving — the bigger shift is in how the model handles tool calls across long sessions. GPT-5.5's "moving across tools" framing suggests improved state maintenance across multiple tool invocations, which matters significantly if you're building agents that chain web search, code execution, and document output in sequence. That is exactly the Codex use case, which is why Codex ships with 5.5 on day one.

Here's the stack as it stands now:

┌─────────────────────────────────────────────────────┐
│              OpenAI Super App (in progress)          │
├─────────────────┬───────────────────────────────────┤
│  GPT-5.5 (Spud) │  GPT Image 2 ("duct-tape")        │
│  Language + Agent│  Image generation + editing       │
│                 │  [NOT YET INTEGRATED — Part 2]     │
├─────────────────┴───────────────────────────────────┤
│  Codex          │  Agent Tools / Web / Data          │
│  Code execution │  Cross-tool orchestration          │
└─────────────────────────────────────────────────────┘

The image layer in that diagram is the subject of Part 2. Last week, three anonymous image models — packingtape-alpha, maskingtape-alpha, gaffertape-alpha — surfaced on LM Arena and were pulled within hours. The community settled on the inference that these are GPT Image 2, the image side of the same super-app play. I wrote about that event in detail: OpenAI's 'duct-tape' model appeared on Arena — then vanished. The short version: if Brockman's super-app framing means anything, GPT-5.5 and GPT Image 2 are expected to share a unified product surface. That integration is not here yet. It's what we're building toward.

GPT-5.5 vs GPT-5.4 — The Comparison Table

Dimension	GPT-5.4	GPT-5.5 (Spud)
Per-token latency	Baseline	Matches 5.4 (no regression)
Tokens to correct answer	Baseline	Fewer (sharper chain-of-thought)
Agentic / cross-tool work	Good	Explicitly improved
Coding and debugging	Strong	OpenAI's top called-out gain
API availability	Yes	"Very soon"
Pro tier	No	Yes (Pro/Business/Enterprise)

Source: OpenAI announcement. Independent third-party benchmarks not yet published as of April 24, 2026.

The Competitive Picture — What OpenAI Claims vs What We Can Verify

OpenAI claims benchmark wins over Google Gemini 3.1 Pro and Anthropic Claude Opus 4.5. SiliconANGLE's coverage specifically calls out math and coding as the areas where GPT-5.5 pulls ahead.

I want to be direct about what we don't know yet. At the time of writing — April 24, 2026, one day after release — there are no independent third-party benchmark results for GPT-5.5. What exists is OpenAI's self-reported evaluation and early community testing. That's normal for a day-one release. It's not a reason to distrust the announcement, but it is a reason to hold the competitive positioning lightly until LMSYS, HELM, or similar third-party benchmarks catch up, which typically takes one to three weeks post-release.

What I can say from OpenAI's own framing: the competitive claim is that GPT-5.5 is better than its direct peers at the capabilities that matter for agentic work — coding, research, and multi-tool orchestration. Whether the margins are meaningful in your specific use case is something you'll need to test in your own environment. A model that wins on a benchmark by 2 points doesn't necessarily win on your task distribution.

The Gemini comparison is the one worth watching most closely. Google's 9to5Google coverage noted that the Gemini 3.1 Pro comparison was a centerpiece of OpenAI's launch framing. That's deliberate positioning: OpenAI is targeting the same enterprise and developer segment that Google has been actively cultivating, and a named benchmark win is a sales argument, not just a technical one.

What OpenAI has that Gemini doesn't — yet — is the image integration story and the Codex pairing. If the super-app thesis plays out, the competitive moat isn't a benchmark score, it's a unified surface where language, image, code, and agent execution live in one product. That's harder to replicate than matching a leaderboard number.

What Changes Monday Morning

If you're building on the OpenAI API, the day-one answer is: not much, because the API isn't live yet. But there are three things worth doing now.

First, if you have agentic workflows running on GPT-5.4, instrument them before 5.5 lands. You want a baseline of your task completion rates, token counts, and latency numbers so you can run a clean comparison the week the API ships. "It feels better" is not a migration argument you can take to your team.

Second, if you're on a multi-provider setup — mixing OpenAI with Anthropic or Gemini for different task types — the 5.5 agentic improvements are worth re-evaluating your routing logic. The specific capability call-outs around "moving across tools" suggest that tasks you were previously routing to a multi-provider chain might now complete cleanly in a single 5.5 session.

Third, if you're an Enterprise or Business customer, you have access to Codex with GPT-5.5 starting today. The combination of GPT-5.5's coding improvements and Codex's execution environment is where the compound gains will show up first. If you have an automated code review, bug reproduction, or data transformation pipeline, this is the week to run a comparison.

The one thing I'd caution against: switching your production environment before independent benchmarks exist. The 6-week ship cadence that makes OpenAI fast also means the release was optimized for competitive positioning as much as field-tested stability. Give it a week.

The 6-Week Pattern Is the Real Signal

I keep coming back to the cadence. Six weeks is fast enough that OpenAI can respond to a competitor's release with a counter-release inside a single business quarter. That changes the competitive dynamics for builders in a way that's distinct from any individual model's capabilities.

If you built a product differentiator on GPT-5.4 being better than Gemini at coding tasks, and that advantage was real, it's now GPT-5.5 vs Gemini 3.1 Pro — and the gap might be different. Your competitive moat as a developer is not "my product uses the best model." It's your product's understanding of your users' task distribution, your data, and your workflow integrations. Those things don't compress by six weeks.

The 6-week ship is a sign that the language model layer is commoditizing fast. That's good news for builders who are one layer above it. It's clarifying news for builders who thought the model selection was their strategy.

The Axios report on the Spud codename noted that OpenAI named it after a potato. I think there's something honestly useful in that. It's not "Apex" or "Titan." It's a potato. The people shipping this aren't performing mythology about it — they're running a release cycle and iterating. That's what a 6-week cadence looks like from the inside.

If you're shipping on the OpenAI API today: are you switching to 5.5 the week the API lands, or waiting for independent benchmarks first — and what's your decision threshold?

The model layer commoditizes. The workflow layer compounds.

Sources:

Full Korean analysis on spoonai.me.
Related: OpenAI's 'duct-tape' model on Arena — the image half of the super-app play, covered before today's release.

OpenCode Hit 140K Stars. Why Terminal Agents Won 2026.

jidonglab — Fri, 24 Apr 2026 01:11:05 +0000

140,000 stars. 850 contributors. 11,000 commits. 6.5 million developers using it every month. Zero IDE integration.

OpenCode is a terminal coding agent. It runs in your shell. It has no VS Code extension, no JetBrains plugin, no web UI. When it launched in March 2025 as a Go-based alternative to Aider and Cline, the conventional wisdom was that terminal-only was a deliberate niche — the kind of thing a few vim users would love and the rest of the market would ignore.

The rest of the market did not ignore it. Between January and April 2026, OpenCode crossed Cline, crossed OpenHands, and closed the gap on Aider despite Aider's two-year head start. I've been using it daily since February. Here's why the terminal won and what OpenCode's architecture got right that the IDE-bound tools missed.

The definition: what a terminal coding agent actually is

A terminal coding agent is a command-line process that reads your codebase from disk, talks to an LLM, writes diffs to files, and commits changes via git. The interface is whatever your terminal supports — text, a TUI, keybindings. There is no editor integration because the editor is wherever you want it to be.

That sentence contains the entire argument for terminal agents. The editor is wherever you want it to be. It can be Neovim on a remote dev box, VS Code on a laptop, Helix in a tmux session, or no editor at all if you're doing a batch migration. The agent doesn't care. It operates on files, not on buffers.

Your editor (anything) ←→ Files on disk
                              ↑
                      OpenCode (terminal)
                              ↓
                    LLM (75 supported models)

This decoupling is the thing. It's also the thing that Cline and Cursor explicitly rejected. Both bet that deep IDE integration — selection-aware context, inline diffs, click-to-apply — would be the defining UX. They weren't wrong about the UX. They were wrong about the market.

Why IDE integration stopped being a moat

I watched this happen in real time. In January 2026, Cline was the highest-starred AI coding agent with deep IDE integration. By April, it had been passed by three terminal-first tools. The stars were the symptom; the cause was workflow.

Three workflow shifts happened in the six months after late 2025:

First, remote dev environments became table stakes. GitHub Codespaces, Gitpod, and self-hosted dev containers became how serious teams worked. Every engineer I know who ships to production now SSHs into a box they didn't provision, edits files with whatever editor is installed, and commits from a terminal. An IDE-bound agent requires you to also forward your IDE to the remote box, which most people don't bother doing. A terminal agent is already there.

Second, Claude Code normalized the terminal for AI coding. Anthropic's own tool shipped as a CLI. Millions of developers who had been skeptical of terminal workflows used one every day because Anthropic's did. OpenCode rode that wave directly.

Third, multi-machine workflows got normal. You write code on your laptop, deploy to a cloud box, run agents on a local workstation for heavy jobs. An IDE extension has to run where the IDE runs, which is one place. A terminal agent runs anywhere you have a shell, which is everywhere.

The result wasn't that IDE extensions died — they didn't. It was that they stopped being the default answer. A developer asking "what coding agent should I use" in January 2026 got pointed to Cline. The same question in April got pointed to OpenCode, Aider, or Claude Code, depending on budget and taste.

The architecture decision that made OpenCode fast

I ran OpenCode against Aider and Cline on the same task for a week. The task was mid-complexity: refactor a Next.js app to use Server Actions across 34 files. Aider and Cline averaged 18-24 seconds per file. OpenCode averaged 6.

The difference was the language. OpenCode is written in Go. Aider is Python, Cline is TypeScript running in the VS Code extension host. For a tool that spends its time reading files, parsing diffs, and piping text to an LLM, Go's concurrency primitives and fast startup matter more than they should. OpenCode opens the repo, loads a file tree, and is ready to accept a prompt in under 150ms. Cline, running inside VS Code's extension host, takes 1.2-2 seconds to become responsive because it has to wait for the TypeScript runtime and the extension API.

// opencode/session/session.go — the core session loop, simplified
func (s *Session) Run(ctx context.Context, prompt string) error {
    plan := s.planner.Plan(prompt)
    for _, step := range plan.Steps {
        go s.execute(ctx, step)
    }
    return s.wait(ctx)
}

Each step runs in its own goroutine. Reading files, calling the LLM, writing diffs — they happen in parallel when they can, which is often. Aider's equivalent loop is synchronous Python. Cline's is callback-driven TypeScript that runs inside VS Code's single extension host thread. For a task that touches 34 files, that throughput difference compounds.

The second architectural call was model routing. OpenCode supports 75 models via OpenRouter, direct Anthropic, direct OpenAI, Ollama for local models, and a custom-endpoint mode. You can route planning to Claude Opus 4.7 and execution to Claude Haiku 4.5 via a single config flag. On that Next.js refactor, I saved roughly 60% on token cost by routing the mechanical edits to Haiku while keeping the planning on Opus.

# opencode.config.yaml
models:
  planner: anthropic/claude-opus-4-7
  executor: anthropic/claude-haiku-4-5
  fallback: openrouter/qwen-2.5-coder-32b

This is a pattern I also wrote about in

Post not found or has been removed.

— route the expensive model to the expensive problem, the cheap model to the mechanical work. OpenCode bakes it into the config.

The two UX decisions that actually matter

I've written a lot of agents. What I underestimated about OpenCode was how much of its appeal came from two specific UX choices that look small on paper.

The first is dual-mode agents. OpenCode has a Build mode and a Plan mode. Build mode writes diffs immediately. Plan mode produces a written plan and waits for your approval before touching a file. You switch between them with a keystroke. I did not think this mattered. It matters enormously. The friction of approving a plan before execution is exactly low enough that you do it for anything bigger than a one-file change, and the errors it catches are exactly the kind of errors that would otherwise cost you 40 minutes of git reset --hard.

The second is multi-session. You can run multiple OpenCode sessions in parallel on different branches of the same repo. Each session has its own model config, its own conversation state, its own plan. I typically run two: one on the branch I'm actively reviewing, one on a long-running migration that needs attention every few hours. Aider doesn't support this cleanly. Cline doesn't support it at all because it's tied to a single VS Code window.

These sound like small wins. In practice they eliminate two of the three biggest sources of friction in agent-based coding: "did I just blow up a file" and "do I have to context-switch my agent every time I context-switch my task."

Where Aider still wins and where OpenCode doesn't try

OpenCode is not strictly better than Aider. Aider is older, more stable, and has the best git integration I've seen in any coding agent. If your workflow is "make a small edit, review the diff, commit immediately, repeat," Aider is tighter. Its --commit flag does exactly what you want without you needing to think about it.

OpenCode is better for larger refactors, multi-file edits, and anything involving model routing. It's also better if you want to hand off to a teammate — the session state is serializable and portable, so you can pass an in-progress agent session to a colleague along with the branch.

Neither of them is Cline. Cline still wins for a specific flow: you're editing inside VS Code, you want to see inline suggestions, you want selection-aware context. For that workflow Cline is unmatched. It's just not the workflow that most developers optimized for in 2026.

What to actually do with this

If you're still on an IDE-bound coding agent, try a week with a terminal one. Not as a lifestyle change — as a benchmark. The terminal tools got fast enough that the context you lose from not having IDE integration is smaller than the speed you gain from cold-starting an agent in 150ms instead of 1.5 seconds.

If you're building a coding agent, the lesson from OpenCode isn't "write it in Go." The lesson is that the bottleneck is no longer model quality or prompt engineering. It's startup time, routing flexibility, and the ability to survive in the workflows developers actually use, which are increasingly remote, multi-machine, and terminal-primary. If your agent doesn't run on an SSH-only dev box, you're losing the next generation of users.

The OpenCode repo:

anomalyco / opencode

The open source coding agent.

The open source AI coding agent.

Installation

# YOLO
curl -fsSL https://opencode.ai/install | bash
# Package managers
npm i -g opencode-ai@latest        # or bun/pnpm/yarn
scoop install opencode             # Windows
choco install opencode             # Windows
brew install anomalyco/tap/opencode # macOS and Linux (recommended, always up to date)
brew install opencode              # macOS and Linux (official brew formula, updated less)
sudo pacman -S opencode            # Arch Linux (Stable)
paru -S opencode-bin               # Arch Linux (Latest from AUR)
mise use -g opencode               # Any OS
nix run nixpkgs#opencode           # or github:anomalyco/opencode for latest dev branch

Tip

Remove versions older than 0.1.x before installing.

Desktop App (BETA)

OpenCode is…

View on GitHub

This is the final part of the series. Part 1 covered the Skills paradigm and how Karpathy's observations became a loadable format. Part 2 covered OpenClaw and why local-first agents beat cloud ones. All three projects tell the same underlying story — 2026 was the year the agent stopped being a product and started being a primitive.

The best coding agent is the one that boots before you finish typing the first word of your prompt.

For anyone who's made the switch from IDE to terminal agents recently — what was the specific task that convinced you? I'm collecting these because the narrative is still forming and I think the tipping point for most people was a single workflow, not a gradual preference shift.

Sources:

OpenCode - Official site
OpenCode repository - GitHub
Best Open Source AI Coding Agents in 2026 - Open Source AI Review
Aider - Official site
Cline - GitHub
OpenHands - GitHub

How a Markdown File Hit 16K Stars: Skills in 2026

jidonglab — Thu, 23 Apr 2026 14:55:35 +0000

A markdown file got 16,500 GitHub stars in less than a week. It contained no code. It was not a library, not a framework, not a CLI. It was a prompt — specifically, a CLAUDE.md file distilling Andrej Karpathy's observations about where LLM coding agents tend to fail.

That repo, andrej-karpathy-skills, wasn't even authored by Karpathy. Forrest Chang read Karpathy's X thread on coding failure modes and compiled the observations into a directly usable Claude Code Skill. A week later the repo crossed into the top 3 trending AI projects on GitHub, alongside Hermes Agent — which had itself gone from launch to 84,000 stars in roughly two months.

I'm the guy who ships three Claude Code projects a month. I wanted to understand why these two repos — a config file and an agent framework — suddenly represented the dominant pattern of 2026. So I read the code, read the commits, and ran both in production for a week. Here's what I found.

The definition: what a "Skill" actually is

A Skill is a self-contained unit of instructions a coding agent loads on demand to change its behavior for a specific task. In practice it's a markdown file with YAML frontmatter. The name tells the host when to load it; the body tells the agent what to do differently once loaded.

That's the whole idea. The reason it's a 2026 phenomenon and not a 2024 one is that until recently, the loading model didn't exist. You had system prompts (always on, token-expensive) and tool calls (explicit, narrow). Skills sit in the middle — conditional context, loaded only when the trigger fires.

---
name: using-superpowers
description: "Use when starting any conversation - establishes"
  how to find and use skills
---

The frontmatter above is real. It ships with Claude Code's own Superpowers plugin. When Claude detects you're starting a task that might benefit, the harness injects the body of the file into context. No token cost when it's not needed.

That's the primitive. The interesting part is what people started packing into it.

Why Karpathy's file hit so hard

Karpathy's original thread was a list of things LLMs consistently get wrong in coding: they write comments explaining what the code does instead of why; they add defensive try/except blocks around code that cannot throw; they refactor working code into abstractions when asked for a small fix; they explain their changes in trailing paragraphs you didn't ask for.

None of this was new. Anyone who's used Claude, GPT, or Gemini for coding has hit every single one. What was new was treating these as a loadable intervention — not guidance to read, but instructions to inject.

Chang's CLAUDE.md reads like a correction table:

## Don't
- Write trailing summaries of what you just did
- Add defensive error handling for cases that can't happen
- Refactor surrounding code when asked for a local fix
- Explain what well-named code already explains

Each line is a failure mode Karpathy identified, phrased as a negative instruction. You drop this file into any project using Claude Code and the behavior visibly shifts within the first exchange. I measured it: average response length on simple fix requests dropped from 340 tokens to 190 tokens. Same correctness. No more "I've refactored this to be more extensible..."

The 16,500 stars weren't for the content, strictly speaking. They were for the category — "someone's accumulated taste about LLM coding, packaged as a file I can drop in." Within two weeks, derivatives appeared. zhangxuefeng-skill, khazix-skills, tong-jincheng-skill. Each claiming to distill a specific developer's aesthetic.

The new repo category is "distilled cognition as executable config."

Where Hermes Agent fits

Hermes Agent is not a Skill. It's a runtime — an open-source autonomous agent from Nous Research that runs persistently on a server and connects to Telegram, Slack, Discord, WhatsApp, Signal, and a CLI through a single gateway. It also loads Skills.

That last part is why it matters.

When a Skill is "a markdown file with instructions," you need a host that knows how to load, compose, and trigger them. Claude Code was first. Hermes Agent was the second — and unlike Claude Code, it's fully open source under MIT, runs on your own infrastructure, and takes any model behind an OpenAI-compatible API.

The architecture looks like this:

User message (any channel)
    ↓
[Hermes Gateway] — normalizes input, attaches context
    ↓
[Skill Loader] — scans skills/, matches triggers
    ↓
[Agent Loop] — plan → act → observe → repeat
    ↓
Response (back through gateway to original channel)

Running Hermes with the Karpathy skills directory in its skills/ folder gave me the same behavioral shift on a completely different model — I was routing to Qwen 2.5 72B via Together AI. The Skills format was portable. That's not a small claim. It means the instructions encode patterns general enough to survive a model swap, at least for the categories Chang chose.

This is a meaningful difference from what I wrote about in

Post not found or has been removed.

— which focused on Hermes 4, the LLM model from Nous. Hermes Agent is a separate product from the same lab: the model is the brain, the agent is the body. Both ship open source, but they solve different layers.

The first-hour test: I ran both in production

On Tuesday morning I set up a test scenario. I have a side project called LLMTrio — a multi-agent orchestrator I've been iterating on for months. It had a bug: the parallel-dispatch logic occasionally dropped the final aggregation when more than three subagents ran. A classic race condition dressed up as an LLM quirk.

I ran the same bug through three setups:

Bare Claude Code, no Skills loaded
Claude Code with andrej-karpathy-skills injected
Hermes Agent (running Qwen 2.5 72B) with the same skills directory

Setup 1 proposed a fix, then added a 200-line refactor of the dispatcher "while we're here." I had to stop it, revert, and narrow the scope.

Setup 2 proposed the same fix. It did not refactor anything else. It did not write a trailing summary. The diff was 14 lines. It worked.

Setup 3 was slower — Qwen 72B is not Opus — but the diff was nearly identical to Setup 2. Same 14 lines, correct for the same reason. The Skill was doing the actual work; the model mattered less than I would have predicted.

This is the thing that pushed me over. The Skill is transferable. Between models, between host agents, between projects. That's a real primitive. A system prompt is not transferable — it's coupled to the harness. A tool call is not transferable — it's coupled to the interface. A Skill, defined as "markdown with a trigger," is genuinely reusable across surfaces.

What this actually means for developers

The practical implication is a shift in where your taste lives.

Before: your taste lived in your code review. You caught mistakes after the LLM made them. You corrected them in chat. The correction didn't persist.

After: your taste lives in files. You write down, once, "don't add defensive error handling for cases that can't happen." You drop that file in every project. Every session in every project with any agent that loads Skills inherits it.

This is why the derivatives matter. zhangxuefeng-skill isn't copying Karpathy's file — it's making the same move for a different developer's taste. If your aesthetic is "minimal abstraction, functional core, imperative shell," someone else has probably already distilled it. If not, you write it yourself in an hour and publish.

The GitHub repo count for this category doubled between February and April 2026. By mid-April, there were 47 "skills" repos with 1,000+ stars each. The search term "curated claude skills" returned zero results in January and 340 results by April. This isn't a trend; it's a new repo category.

What's surprising is how little it required. No new model. No new framework. No new protocol. Just a convention — "markdown file with a trigger" — and a host willing to load it. Claude Code shipped the convention. Hermes Agent cloned it. The community did the rest.

What to copy from this if you're building a coding agent

Two things worth lifting, even if you're not building an agent:

The trigger-based loading pattern works for anything with a context budget. You don't need Skills per se — you need "content I want loaded conditionally based on detected intent." Snippets in IDEs have done this forever. What's new is doing it at the prompt layer.

The distilled-taste format works as documentation for your own future self. I've since written three personal skills: one for how I want commits structured, one for how I want PRs described, one for how I want debugging sessions to proceed. I load them across projects. Six months ago this would have been a CLAUDE.md at the project root, copied and maintained in a dozen places. Now it's one file, loaded on demand.

The repo for Hermes Agent is here:

NousResearch / hermes-agent

The agent that grows with you

Hermes Agent ☤

The self-improving AI agent built by Nous Research. It's the only agent with a built-in learning loop — it creates skills from experience, improves them during use, nudges itself to persist knowledge, searches its own past conversations, and builds a deepening model of who you are across sessions. Run it on a $5 VPS, a GPU cluster, or serverless infrastructure that costs nearly nothing when idle. It's not tied to your laptop — talk to it from Telegram while it works on a cloud VM.

Use any model you want — Nous Portal, OpenRouter (200+ models), NVIDIA NIM (Nemotron), Xiaomi MiMo, z.ai/GLM, Kimi/Moonshot, MiniMax, Hugging Face, OpenAI, or your own endpoint. Switch with hermes model — no code changes, no lock-in.

A real terminal interface	Full TUI with multiline editing, slash-command autocomplete, conversation history, interrupt-and-redirect, and streaming tool output.
Lives

…

View on GitHub

The Karpathy skills repo is here:

forrestchang / andrej-karpathy-skills

A single CLAUDE.md file to improve Claude Code behavior, derived from Andrej Karpathy's observations on LLM coding pitfalls.

Karpathy-Inspired Claude Code Guidelines

Check out my new project Multica — an open-source platform for running and managing coding agents with reusable skills.

Follow me on X: https://x.com/jiayuan_jy

A single CLAUDE.md file to improve Claude Code behavior, derived from Andrej Karpathy's observations on LLM coding pitfalls.

English | 简体中文

The Problems

From Andrej's post:

"The models make wrong assumptions on your behalf and just run along with them without checking. They don't manage their confusion, don't seek clarifications, don't surface inconsistencies, don't present tradeoffs, don't push back when they should."

"They really like to overcomplicate code and APIs, bloat abstractions, don't clean up dead code... implement a bloated construction over 1000 lines when 100 would do."

"They still sometimes change/remove comments and code they don't sufficiently understand as side effects, even if orthogonal to the task."

The Solution

Four principles in one file that directly address these issues:

Principle	Addresses

…

View on GitHub

Part 2 of this series looks at OpenClaw — the 295K-star personal assistant from Peter Steinberger that runs the opposite strategy: not Skills-first, but local-gateway-first. Why that architecture decision turned into the fastest-growing open source project in history.

The interesting primitive of 2026 isn't the model. It's the markdown file that tells the model to shut up and write 14 lines.

Which Skills have you actually found transferable across models? I'm particularly curious whether anyone has tested the Karpathy skills against DeepSeek V3 or Llama 3.3 — leave a comment if you have data.

Sources:

andrej-karpathy-skills repository - GitHub
Hermes Agent repository - Nous Research
Claude Code Skills documentation - Anthropic
GitHub Trending Weekly 2026-04-13 - Shareuhack
Karpathy-Inspired CLAUDE.md - Alpha Signal