Forem: 정상록

Microsoft-OpenAI Partnership Amendment (April 2026): What Indie Devs Need to Know About Non-Exclusive Licenses and Revenue Caps

정상록 — Tue, 28 Apr 2026 01:46:32 +0000

On April 27, 2026, Microsoft published an amendment to its OpenAI partnership. The headline reads "long-term clarity," but the actual changes restructure how AI infrastructure is distributed across cloud providers.

This post breaks down what changed and why it matters for indie developers and small teams building on OpenAI APIs.

The 5 Changes That Matter

Aspect	Before	After
Azure preference	Exclusive	Primary (multi-cloud allowed)
IP license	Through 2032, exclusive	Through 2032, non-exclusive
Microsoft → OpenAI revenue share	Active	Ended
OpenAI → Microsoft revenue share	Until AGI	Until 2030 + total cap
Microsoft's OpenAI stake	~27% ($135B value)	Maintained

Two words changed everything: "exclusive" was removed, and "cap" was added.

Change 1: Azure Primary, Not Exclusive

OpenAI still launches products on Azure first. But here's the structural shift: OpenAI can now serve any product on any cloud provider. AWS Bedrock, Google Cloud Vertex AI, self-hosted infrastructure—all legally possible.

TechCrunch headlined this as "OpenAI ends Microsoft legal peril over its $50B Amazon deal." The previously blocked OpenAI-Amazon $50B agreement is no longer constrained by exclusivity terms.

What This Means for Your Stack

Today, if you use OpenAI APIs, you typically pick:

Direct OpenAI platform access
Azure OpenAI Service (for enterprise/regulated industries)

Soon, you'll likely add:

AWS Bedrock (with native OpenAI model availability)
GCP Vertex AI (same)

More options means actual price negotiation power and region-level latency optimization.

Change 2: Non-Exclusive IP License (The Biggest Structural Shift)

The IP license through 2032 stays, but it's now non-exclusive. The gatekeeper role Microsoft played for OpenAI tech access is over.

Microsoft Copilot vs. OpenAI ChatGPT: Stop Treating Them As the Same Product

Many users assumed "Copilot and ChatGPT are the same GPT model with different interfaces." Under non-exclusivity, this assumption breaks down.

Both companies can now develop their products in different directions:

Microsoft can invest more in proprietary models (MAI, Phi series)
OpenAI can distribute GPT successors freely across channels

For your stack: evaluate them as separate product lines. "I heard Copilot is good, so ChatGPT must be similar" is increasingly wrong.

Change 3: Revenue Share Restructuring + AGI Decoupling

This is the subtle but most important change.

Microsoft → OpenAI: Ended

Microsoft no longer pays a portion of its AI revenue to OpenAI.

OpenAI → Microsoft: Until 2030 + Cap, AGI-Decoupled

The old contract had this revenue share ending "at AGI achievement." But AGI definitions were a constant legal friction point between the two companies.

The amendment explicitly decouples it: "regardless of OpenAI's technology progress."

Whether AGI is achieved or not
Sharing continues until 2030
Same percentage, but with a total cap

Why Both Companies Win Here

The AGI argument was slowing down both companies. Every release required legal teams to evaluate "does this trigger the AGI clause?" That overhead is gone.

Expect both companies to ship faster.

Change 4: Clear End Dates for Enterprises

For enterprise architecture decisions, this is the best news.

2030: OpenAI → Microsoft revenue share ends + cap may be reached
2032: IP license expires

Previously, "AGI achievement" was an undefined trigger. Now there's a clear clock for migration planning, license renewal negotiations, and alternative model evaluation.

Change 5: Microsoft's 27% Stake Stays — This Is Not a Separation

Microsoft maintains its ~27% stake (~$135B value) as a major shareholder. Joint work continues on:

Gigawatt-scale data center buildout
Next-gen AI silicon
AI cybersecurity

This is relationship redefinition, not separation. Like a couple acknowledging each other's careers and friendships rather than divorcing.

Practical Action Items for Indie Devs

1. Re-evaluate Multi-Cloud Options

If you're running on Azure OpenAI Service today, prepare for AWS Bedrock and GCP Vertex AI to add OpenAI models in the next 6-12 months. Pre-document:

Pricing comparison
Region availability
Data governance policy differences

2. Treat Copilot and ChatGPT as Separate Products

Use these criteria:

Microsoft Copilot: Office 365/Windows integration, enterprise data governance
OpenAI ChatGPT: Latest model availability, API flexibility, direct integration

3. Mark Calendar Triggers

For any service with deep OpenAI dependency:

2029-2030: Pricing policy may shift before revenue share ends
2031-2032: License terms in Microsoft channels may shift before IP license ends

These dates are negotiation cards if you're signing 5-year contracts now.

4. Avoid AGI Marketing Hype

Both companies are now legally free from AGI definition battles. Expect a shift from "AGI proximity" marketing to concrete capability benchmarks (coding, reasoning, domain expertise). That's a more honest comparison framework anyway.

TL;DR

Old model: Exclusive license + AGI-triggered revenue share = locked-in vendor + fuzzy timeline
New model: Non-exclusive license + capped revenue share + clear 2030/2032 = open vendor + predictable horizon

For indie devs and small teams: vendor lock-in is the killer. More OpenAI distribution channels means more pricing leverage for everyone.

This is the official end of the single-vendor era for OpenAI APIs.

Sources

What's your take? Are you planning to evaluate AWS Bedrock or GCP Vertex AI for OpenAI workloads once they're available?

TradingAgents v0.2.4: A Multi-Agent LLM Framework That Simulates an Entire Trading Firm

정상록 — Tue, 28 Apr 2026 01:35:34 +0000

TL;DR

UCLA Tauric Research released TradingAgents v0.2.4 (2026-04-25) — a LangGraph-based multi-agent LLM framework that mimics a real trading firm with 5 layers and ~12 agents. The new release adds Pydantic-typed structured outputs, LangGraph checkpoint resumption, a persistent decision-memory file, 5-tier rating, and 10 LLM provider integrations. Backtest on AAPL/GOOGL/AMZN shows 23-27% cumulative return.

⚠️ Disclaimer: backtest only. Not financial advice. Paper trade before any real capital.

Why this is interesting beyond "another trading bot"

Most LLM trading bots are a single model with a giant prompt. They suffer hard from confirmation bias — once they form an initial thesis, they cherry-pick evidence to support it.

TradingAgents counters this structurally with 5 layers of explicit role-based agents that argue with each other:

[Analyst Team x4] -> [Bull vs Bear debate]
        |
        v
   [Trader (3-tier)]
        |
        v
[Risk Mgmt: Aggressive vs Conservative vs Neutral]
        |
        v
   [Portfolio Manager (5-tier)] -> Buy / Overweight / Hold / Underweight / Sell

The whole pipeline runs on LangGraph state graph with explicit handoffs. You can replace any node, log any state, and resume from any checkpoint.

v0.2.4 highlights for developers

1. Structured output decision agents

The Research Manager, Trader, and Portfolio Manager now use llm.with_structured_output(Schema) with Pydantic schemas. This works across:

OpenAI: json_schema
Google Gemini: response_schema
Anthropic Claude: tool-use
Other models: function-calling fallback

No more brittle text parsing for decision values.

2. LangGraph checkpoint resumption

Pass --checkpoint to enable per-node state persistence:

~/.tradingagents/cache/checkpoints/<TICKER>.db

If your run crashes after the Bull/Bear debate but before Risk Management, you resume from there instead of paying for the full pipeline again. Big API cost savings.

3. Persistent decision memory

~/.tradingagents/memory/trading_memory.md

Every run appends a decision record. On the next run for the same ticker, the framework auto-injects:

Previous decision
Actual realized return (raw + alpha vs SPY)
A one-paragraph retrospective

into the Portfolio Manager prompt. This is automated trading-journal-as-context — the framework literally learns from its own past mistakes.

4. 5-tier rating system

The Portfolio Manager now outputs:

Rating	Meaning
Buy	strong buy
Overweight	increase position
Hold	maintain
Underweight	reduce position
Sell	exit

The Trader still uses 3-tier (Buy/Hold/Sell). Only the final Portfolio Manager gets the finer granularity.

5. 10 LLM providers

config["llm_provider"] = "anthropic"  # or:
# openai, google, anthropic, xai, deepseek,
# qwen, glm, openrouter, ollama, azure

Local Ollama is the cost-killer if you don't need top-tier reasoning.

Quick start

from tradingagents.graph.trading_graph import TradingAgentsGraph
from tradingagents.default_config import DEFAULT_CONFIG

config = DEFAULT_CONFIG.copy()
config["llm_provider"] = "anthropic"
config["deep_think_llm"] = "claude-opus-4-7"
config["quick_think_llm"] = "claude-haiku-4-5"
config["max_debate_rounds"] = 2
config["checkpoint_enabled"] = True

ta = TradingAgentsGraph(debug=True, config=config)
_, decision = ta.propagate("NVDA", "2026-01-15")
print(decision)

The split between deep_think_llm and quick_think_llm is the cost-optimization sweet spot:

deep_think_llm — used for Bull/Bear debate, Risk debate, Portfolio Manager (heavy reasoning)
quick_think_llm — used for Analyst data summarization (lightweight)

For Claude users, Opus 4.7 + Haiku 4.5 is a clean combo.

CLI mode

tradingagents

Interactive prompt asks for ticker, date, LLM provider, and research depth. Great for quick experiments.

Backtest performance (paper figures)

On AAPL / GOOGL / AMZN:

Metric	TradingAgents	Baseline
Cumulative Return	23.21~26.62%	lower
Annualized Return	up to 30.5%	lower
Sharpe Ratio	improved	baseline
Max Drawdown	improved	baseline

Source: arXiv 2412.20138 v7 (Yijia Xiao et al.).

What's NOT in the paper

The honest limitations:

Slippage is not modeled
Taxes are not deducted
Market impact for non-trivial positions is ignored
Live data latency is assumed-zero in backtest
All numbers are AAPL/GOOGL/AMZN only — large-cap US tech, the easiest regime

Translation: treat it as a research framework. Don't put real capital based on the paper figures alone.

Generalizing the pattern

The interesting part isn't trading — it's the multi-agent debate pattern itself. The same 5-layer structure could be applied to:

Content production (writer + editor + SEO + reviewer + final approver)
Marketing campaigns (strategy + copy + design + measurement)
Hiring decisions (technical + culture-fit + reference + final)
Pricing decisions (cost + market + competitive + customer)

Anywhere a single human or single LLM tends to confirm-bias their initial take, structured debate between 2+ adversarial agents helps.

Final thoughts

TradingAgents is not a "make money with AI" toolkit. It's a research framework that demonstrates the multi-agent paradigm works in a high-stakes domain. The v0.2.4 additions (structured outputs, checkpoints, persistent memory) make it actually usable for serious experimentation — not just paper-friendly demos.

Worth cloning, reading the LangGraph code, and stealing patterns for your own multi-agent systems.

Links

⚠️ Disclaimer: This article analyzes the TradingAgents framework from a technical perspective. It is not investment advice. Backtest numbers are historical and do not guarantee future returns. Always paper trade extensively before deploying real capital.

free-claude-code: Route Claude Code Through NVIDIA NIM, OpenRouter, and Ollama for $0

정상록 — Tue, 28 Apr 2026 01:35:01 +0000

If you've wanted to try Claude Code but balked at the $20 Pro subscription or API credit commitment, there's an open-source proxy worth knowing about: free-claude-code.

It's a small FastAPI server that sits on localhost:8082. You set ANTHROPIC_BASE_URL=http://localhost:8082 and Claude Code's API calls get rerouted to whatever backend you've mapped. The proxy handles bidirectional translation between Anthropic's SSE format and OpenAI chat / Anthropic Messages.

Why it's interesting

Two reasons.

First, the backends. NVIDIA NIM gives you 40 req/min for free with just an nvapi- key. OpenRouter has free models like DeepSeek R1. LM Studio, llama.cpp, and Ollama run locally with no rate limits at all.

Second, the model tier mapping. Claude Code internally picks between Opus/Sonnet/Haiku based on task complexity. The proxy intercepts that decision and routes each tier to a different backend you choose.

MODEL_OPUS="nvidia_nim/moonshotai/kimi-k2.5"
MODEL_SONNET="open_router/deepseek/deepseek-r1-0528:free"
MODEL_HAIKU="lmstudio/unsloth/GLM-4.7-Flash-GGUF"
MODEL="nvidia_nim/z-ai/glm4.7"

Heavy reasoning → Kimi K2.5 on NIM. Regular work → free DeepSeek R1 on OpenRouter. Fast responses → local LM Studio. Total cost: $0. And critically, it preserves Claude Code's own tier-routing logic, which is the part you can't easily replicate by manually configuring a single model.

Built-in features that aren't obvious

Thinking token conversion: <think> tags and reasoning_content from reasoning models get converted to Claude's native thinking blocks. You don't lose the reasoning trace.
Heuristic tool parser: Some free models emit tool calls as text. The proxy parses them back into structured tool_use blocks. Imperfect but helpful.
Subagent control: It intercepts Claude Code's Task tool and forces run_in_background=False. Without this, free-tier quotas can disappear in seconds when subagents spawn in parallel.
Smart rate limiting: Rolling-window throttling and 429 exponential backoff are baked in.
Discord/Telegram bots: Optional. Tree-based threading, session persistence, voice note transcription. You can trigger coding sessions remotely.

Setup

# 1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
uv python install 3.14

# 2. Clone and configure
git clone https://github.com/Alishahryar1/free-claude-code.git
cd free-claude-code
cp .env.example .env
# Edit .env: add NVIDIA_NIM_API_KEY=nvapi-xxx

# 3. Run the proxy
uv run uvicorn server:app --host 0.0.0.0 --port 8082

Then in another terminal:

ANTHROPIC_AUTH_TOKEN="freecc" \
ANTHROPIC_BASE_URL="http://localhost:8082" \
claude

ANTHROPIC_AUTH_TOKEN can be any string. The proxy doesn't validate it.

For convenience, alias it:

fcc() {
  ANTHROPIC_AUTH_TOKEN="freecc" \
  ANTHROPIC_BASE_URL="http://localhost:8082" \
  claude "$@"
}

There's also claude-pick, an fzf-based interactive selector if you don't want to keep editing .env to swap models.

Honest tradeoffs

This isn't a free Claude Code. It's a way to route Claude Code's interface to different backends, with the quality and reliability tradeoffs that implies.

Tool use accuracy: The hardest part. Anthropic's models are tuned heavily for accurate tool calling. Free models are usually weaker, and the heuristic parser only goes so far.
Latency: NIM's 40 req/min cap means heavy parallel work will throttle.
Translation loss: The OpenAI chat → Anthropic conversion can drop features in edge cases.
Self-hosted reliability: Uptime is on you.

If you ship production code, Claude Pro at $20/month is still the right answer. Where this shines:

Trying Claude Code's workflow before committing to a subscription.
Sensitive code (client data, internal repos) that you want processed entirely locally via Ollama.
Hybrid setups: real Anthropic for the Opus tier, free backends for Sonnet/Haiku. Cuts cost 70-90% with surprisingly small quality loss for boilerplate work.
Remote autonomous coding via the Discord/Telegram bots.

Where I'd start

NIM-only setup. Get an nvapi- key, set all three MODEL_* to NIM models, run the proxy, point Claude Code at it. Twenty minutes of setup, zero cost, and you can immediately tell whether Claude Code's workflow fits how you work.

Once that's working, layer in Ollama for sensitive work and OpenRouter free models for variety. Keep a separate .env per project so you can swap the routing strategy per repo.

Repo: https://github.com/Alishahryar1/free-claude-code

Matt Pocock's Agent Skills — 30K Stars and the Start of the Skill Economy

정상록 — Tue, 28 Apr 2026 01:32:36 +0000

If you've ever caught yourself thinking "AI coding tools are great, but I'm doing the same prompt setup again and again," you'll want to look at what Matt Pocock just shipped.

The Total TypeScript founder open-sourced his personal .claude/skills/ folder on 2026-02-03. Three months later: 30,344 stars, 2,380 forks, still actively pushed. No marketing campaign. No launch post. Just a directory dump that resonated.

The repo is called, simply, mattpocock/skills. The slogan in the README is the whole thesis:

"My agent skills that I use every day to do real engineering — not vibe coding."

That distinction — real engineering vs vibe coding — is doing a lot of work. Let's unpack it.

Skills aren't prompts

Most people use AI coding tools by writing prompts. The better the prompt, the better the output. Matt's framing is different.

A skill is a process. It tells the AI how to work, not what to say. Same input, different process, fundamentally different output. Think of it less like writing a message to the AI and more like designing the workshop the AI lives in.

Let's see what that looks like in practice.

The 16 skills, by category

Planning & Design (5)

Skill	What it does
`to-prd`	Synthesizes the current conversation into a PRD, files it as a GitHub issue. No interview — just synthesis of what's already been discussed.
`to-issues`	Breaks PRDs/specs into independent vertical-slice GitHub issues.
`grill-me`	Interrogates you until every branch of the decision tree is resolved.
`design-an-interface`	Uses parallel sub-agents to generate radically different interface designs.
`request-refactor-plan`	Interview → small-commit refactor plan → GitHub issue.

The standout here is grill-me. It's like having a skeptical tech lead permanently stationed in your terminal. Why are you doing this now? What's the cost of this decision? What edge case haven't you considered? It won't shut up until you've answered everything. It's exhausting in the best possible way.

Development (5)

Skill	What it does
`tdd`	Red-Green-Refactor loop, vertical slice at a time.
`triage-issue`	Explore codebase → root cause → TDD fix plan as issue.
`improve-codebase-architecture`	Finds deepening opportunities from `CONTEXT.md` + `docs/adr/`.
`migrate-to-shoehorn`	Migrates `as` assertions in tests to `@total-typescript/shoehorn`.
`scaffold-exercises`	Generates exercise directory structure.

tdd is brilliant. Before any code is written, it forces 3 questions:

Is this code designed to be testable?
What is the core behavior?
Does this require an interface change?

Notice where the friction is. It's not "before every action." It's "before code-writing — the place where rushing produces bad outcomes." You buckle the seatbelt where the road is dangerous, not where it's flat.

Tooling & Setup (2)

Skill	What it does
`setup-pre-commit`	Husky + lint-staged + Prettier + typecheck + tests.
`git-guardrails-claude-code`	Blocks dangerous git commands at the hook level.

git-guardrails-claude-code is the safety harness for autonomous loops. It blocks git push --force, git reset --hard, git clean -fd, etc. at the Claude Code hook layer — so your AI literally cannot nuke main at 3am.

Writing & Knowledge (4)

Skill	What it does
`write-a-skill`	Generates new skills with progressive disclosure structure.
`edit-article`	Restructures sections, improves clarity, compresses prose.
`ubiquitous-language`	Extracts DDD ubiquitous language glossary from current conversation.
`obsidian-vault`	Manages Obsidian vault with wikilinks and index notes.

write-a-skill is the meta layer. Skills that make skills. This is why the repo doesn't get stale: any user can fork it, run write-a-skill, and add their own workflow. The system extends itself.

Installation

The repo uses vercel-labs/skills (582K weekly downloads on npm):

# Install one skill in current project
npx skills@latest add mattpocock/skills/tdd

# Install globally (~/.claude/skills/)
npx skills add mattpocock/skills/tdd -g

# Install all 16 at once
npx skills add mattpocock/skills --all

Compatible with 49+ agents: Claude Code, Codex, Cursor, OpenCode, Gemini CLI, GitHub Copilot, OpenClaw, Antigravity, and more.

The four design principles

If you read the SKILL.md files across the 16 skills, four principles repeat:

1. Process > Prompt

The skill defines the workflow. The prompt is just a trigger.

2. Progressive Disclosure

Each SKILL.md root is under 50 lines. Detail goes into references/. The AI loads detail only when needed. This saves context tokens and keeps the surface area readable.

3. Vertical Slice

Every workflow that could be horizontal (TDD, refactoring, issue breakdown) is sliced vertically instead. One full feature end-to-end, then the next.

4. Self-extending

write-a-skill is itself a skill. The system extends itself. This is the secret sauce of the 30K stars — adoption snowballs because users add to the library.

What's the "Skill Economy"?

Vibe Sparking coined the phrase. The trajectory looks like this:

2025: Prompt engineering (individual know-how)
  ↓
2026 Q1: AGENTS.md / CLAUDE.md (project-level context)
  ↓
2026 Q2: Agent Skills (reusable workflows)
  ↓
2026+: Skill registry (npm-like ecosystem)

Two months ago, your AI workflow was implicit knowledge in your prompt. Today, it's a versioned artifact you can fork, share, and review. That's a fundamental change.

Matt also released a course in April 2026 — "Claude Code for Real Engineers" — covering Plan/Execute/Clear workflows, AGENTS.md patterns, and Ralph Wiggum autonomous loops. The skills repo is the practical foundation of that course.

Where to start (5 skills, 30 minutes)

If you're new to this, install these five and you'll see the difference within a week:

npx skills add mattpocock/skills/tdd -g
npx skills add mattpocock/skills/grill-me -g
npx skills add mattpocock/skills/to-prd -g
npx skills add mattpocock/skills/write-a-skill -g
npx skills add mattpocock/skills/git-guardrails-claude-code -g

Workflow examples:

New feature: grill-me → to-prd → tdd (with git-guardrails-claude-code running quietly in the background)
Custom workflow: write-a-skill → share the skill with your team or open-source it back

Closing thoughts

The shift here isn't about better tooling. It's about AI as a collaborator with defined processes, not AI as an oracle you prompt. That distinction is what 30K stars are voting on.

Try it for a week. The change will compound.

Source repo: github.com/mattpocock/skills

Canva Magic Layers — How AI Decomposes Flat Images Back into Editable Layers

정상록 — Mon, 27 Apr 2026 00:43:58 +0000

TL;DR

Canva launched Magic Layers in March 2026. It takes any flat JPG/PNG and decomposes it into editable text, object, and background layers using Canva's proprietary foundation model. 9M+ uses in the first month. Now part of Canva AI 2.0 (announced April 16, 2026).

The interesting part for developers and automation builders: the direction is opposite to Photoshop Generative Fill. Generative Fill creates new content. Magic Layers reverses an existing image back into editable form.

What Magic Layers actually does

Upload a JPG or PNG. Click "Edit image" → "Magic Layers". In ~5 seconds:

Text becomes live text boxes (not OCR — original font, size, color preserved)
Objects become individual movable layers
Background sits behind cleanly separated foreground

You can then change a font, swap a background, or replace product photography without re-generating the source image.

Architecture: Canva Design Model

This isn't built on top of Stable Diffusion or DALL-E. It runs on Canva's own Canva Design Model, a foundation model trained to understand design structure rather than generate pixels.

Canva AI Stack
├── External (OpenAI, Anthropic) ── text generation, general AI
├── Canva Magic Media ───────────── image/video generation
└── Canva Design Model (in-house) ── Magic Layers (decomposition)
                                   ── Agentic editing
                                   ── Structural design

Canva publicly framed this as their proprietary moat. OpenAI and Anthropic still power Canva's text features, but the layering capability is built and trained internally.

Why this matters for automation pipelines

If you're building any kind of content automation — and especially if you generate carousels, social posts, or programmatic banners — you've hit this wall:

AI generates a flawless image
You need to change one word of text
...you're back to prompt-rewriting and re-rolling

The cost-per-variant of AI imagery has been bottlenecked at editing, not generation. Magic Layers attacks the editing bottleneck head-on.

For a Python automation pipeline that today does:

# Current AI image flow
prompt = build_prompt(brand, product, variant)
image = midjourney_generate(prompt)  # ~30s
# To create a variant: re-run from line 1

The post-Magic-Layers conceptual flow looks more like:

# Hypothetical post-Magic-Layers flow
master = midjourney_generate(prompt)  # generate once
for variant in variants:
    layered = canva_magic_layers(master)  # decompose once
    swap_text(layered, variant.copy)       # cheap edit
    swap_background(layered, variant.bg)   # cheap edit
    export(layered)

Note: Canva does not publicly expose Magic Layers via API yet (as of April 2026). The above is conceptual until they ship it.

Comparison table

Feature	Canva Magic Layers	Photoshop Generative Fill	Adobe Firefly	Figma AI
Direction	Decompose	Generate	Generate	Generate layouts
Skill barrier	Click once	High (masking, selection)	Medium	Medium
Output	Editable Canva file	PSD layers	New asset	Figma frames
Best for	Marketers, solo builders	Pro photo retouching	Brand teams	UI/UX designers
API access	Not yet	Photoshop SDK	Firefly API	Figma plugin API

Limitations engineers should know

Format support: JPEG and PNG only. No WEBP, HEIC, PDF, or SVG yet.
Edge cases: Hair, transparent objects, and heavy shadow overlap reduce mask quality.
No SDK/API: Magic Layers is UI-only inside Canva editor at the moment. No programmatic access for automation pipelines.
Region locked: US, UK, Canada, Australia public beta. Asian markets not yet supported.
Language support: English plus 6 European languages. Korean/Japanese/Chinese text layer extraction unverified.
Credits: Free plan = 50 shared AI credits/month. Magic Layers consumes more credits per call than text generation. Pro plan ($120/year) recommended for any sustained use.

What this signals about the AI image market

The last 18 months of AI image tooling competed on generation quality — Midjourney v6, DALL-E 3, SDXL, FLUX, and so on. The actual workflow blocker for everyone shipping content was post-edit. Canva moved first into that gap with a proprietary foundation model rather than fine-tuning an open one.

Adobe's response will likely come from the Photoshop side (extending Generative Fill into structural decomposition). Figma might extend AI to broader image-to-frame conversion. Watch for someone shipping an open-source equivalent within 6-12 months — the architecture (segmentation + OCR + layout reconstruction) is reproducible if not trivial.

What I'm doing about it as a solo builder

Today: monitoring Canva newsroom for global rollout including Korea
At Korean launch: testing Korean text layer extraction quality on Pro plan
Until then: keeping my existing card-news pipeline (Gemini + HTML rendering) as the production path. Magic Layers becomes a workflow integration when API ships and Korean support lands.

Source

Originally published on qjc.app/blog. Discussion welcome below.

GPT-5.5 Released: What the Marketing Headlines Don't Tell You

정상록 — Mon, 27 Apr 2026 00:42:55 +0000

OpenAI announced GPT-5.5 on April 23, 2026. The API rolled out one day later on April 24. Four days in, the marketing claims and benchmark hype are everywhere — but the picture is more nuanced than headlines suggest.

This post is a 1st-source-only digest. Everything here is cross-validated against openai.com, developers.openai.com, CNBC, TechCrunch, Fortune, Help Net Security, and the OpenAI/Codex GitHub issue tracker.

What Was Actually Announced

Announcement: April 23, 2026 (Brockman, Glaese, Chen, Pachocki — not Sam Altman)
API release: April 24, 2026 (one day later, separate safeguard process)
Model IDs: gpt-5.5, snapshot gpt-5.5-2026-04-23
Knowledge cutoff: December 1, 2025
ChatGPT availability: Plus, Pro, Business, Enterprise (immediate)
Codex availability: Plus, Pro, Business, Enterprise, Edu, Go

Benchmarks: Where GPT-5.5 Is SOTA

All scores below are at reasoning effort xhigh per OpenAI's official tables.

Benchmark	GPT-5.4	GPT-5.5	Delta
Terminal-Bench 2.0	75.1%	82.7%	+7.6pp (SOTA)
Expert-SWE (Internal)	68.5%	73.1%	+4.6pp
GDPval (knowledge work)	83.0%	84.9%	beats Claude Opus 4.7 (80.3%)
OSWorld-Verified	75.0%	78.7%	beats Claude (78.0%)
Tau2-bench Telecom	92.8%	98.0%	no prompt tuning
FrontierMath Tier 4	27.1%	35.4%	beats Claude (22.9%)
ARC-AGI-2	73.3%	85.0%	+11.7pp
MRCR v2 8-needle 512K-1M	36.6%	74.0%	+37.4pp (2x recovery)

The MRCR v2 long-context recovery is particularly impressive. GPT-5.4 was losing more than half the needles in the 512K-1M range; GPT-5.5 retains roughly three-quarters.

Where GPT-5.5 Is NOT #1

This is the part most marketing posts skip. Per OpenAI's own published comparison tables:

Benchmark	GPT-5.5	Leader	Lead Margin
SWE-Bench Pro	58.6%	Claude Opus 4.7 (64.3%)	-5.7pp
GPQA Diamond	93.6%	Claude Opus 4.7 (94.2%)	-0.6pp
Humanity's Last Exam (with tools)	52.2%	Claude Opus 4.7 (54.7%)	-2.5pp
ARC-AGI-1 (Verified)	95.0%	Gemini 3.1 Pro (98.0%)	-3.0pp
BrowseComp	84.4%	Gemini 3.1 Pro (85.9%)	-1.5pp

OpenAI itself notes in the announcement that SWE-Bench Pro has potential memorization concerns documented in the literature. Take any single benchmark with appropriate skepticism.

The 1M Context Catch

OpenAI markets GPT-5.5 with a "1M context window." The exact number per developer docs is 1,050,000 tokens. But this number depends heavily on where you use it.

Environment	Context	Source
API (`gpt-5.5`)	1,050,000 tokens	developers.openai.com
Codex (official)	400,000 tokens	OpenAI announcement
Codex (measured)	258,400 tokens (bug report)	openai/codex#19319
Max output	128,000 tokens	developers.openai.com

Users in the GitHub issue are reporting "exceeds the context window" errors at unexpectedly low input sizes. If you're building tooling that depends on the full 1M window, validate the actual environment, not the marketing claim.

Pricing: 2x Increase + Long Context Premium

The published API pricing:

gpt-5.5
  Input:        $5.00 / 1M tokens
  Output:      $30.00 / 1M tokens
  Cached input: $0.50 / 1M tokens

gpt-5.5-pro (parallel test-time compute variant)
  Input:       $30.00 / 1M tokens
  Output:     $180.00 / 1M tokens

This is exactly 2x GPT-5.4's pricing ($2.50 input / $10 output).

The hidden premium: inputs over 272K tokens get 2x input cost and 1.5x output cost. So if you actually use the full 1M window, you're paying double on input. This makes "1M context is essentially priced twice" a fair characterization.

Other pricing modifiers:

Batch / Flex: 50% of standard
Priority processing: 250% of standard
Regional processing (data residency): +10%
Codex Fast mode: 1.5x speed at 2.5x cost

OpenAI argues that token efficiency improvements offset the price hike for many workloads. Your mileage will depend heavily on the work type.

Safety: AISI Found a Universal Jailbreak in 6 Hours

The UK AI Security Institute (AISI) ran a 6-hour expert red team and found a universal jailbreak before launch. OpenAI says they fixed it before release. However, per Transformer News, AISI did not directly verify the fix in the final deployment configuration.

GPT-5.5 is rated "High" on OpenAI's Preparedness Framework for both cybersecurity and biology (below "Critical" but above prior models). OpenAI launched a Bio Bug Bounty program for finding biology safeguard bypasses.

For cybersecurity, OpenAI is positioning defensively via the Trusted Access for Cyber program — vetted defenders get expanded access to GPT-5.5's cyber capabilities. SecureBio evaluation reportedly found "wet-lab virology troubleshooting assistance above expert level," which is the basis for the High rating.

Practical Guidance (Day 4)

Real-world feedback is still thin. Based on what OpenAI has published:

Use GPT-5.5 when:

Agentic coding workflows (Terminal-Bench-style tasks)
Computer use / OS automation (OSWorld-Verified)
Long-context recall in the 512K-1M range
Tier-4 frontier mathematics
Knowledge work where GDPval is representative

Consider Claude Opus 4.7 when:

Pure SWE-Bench Pro-style coding tasks
Academic reasoning (GPQA Diamond)
Humanity's Last Exam-style questions

Cost optimization:

Stay below 272K input tokens to avoid the long context premium
Use Batch/Flex modes for 50% off when latency is flexible
Cached input drops cost to $0.50/1M (90% savings)
In Codex, plan for ~258-400K context, not 1M

Caveats Worth Repeating

All benchmarks above are at reasoning effort xhigh. Default API settings will likely produce lower scores.
We're 4 days post-launch. External reproduction and independent evaluation are pending.
OpenAI's comparison tables have empty cells (-) for some Claude/Gemini entries, so "SOTA across the board" is an overstatement of what was actually published.
Korean and other non-English language performance is not specifically benchmarked in the announcement.

References

Disclaimer: AI-assisted research digest. Verify primary sources before making decisions. We're 4 days into the release; expect updates as third-party evaluations come in.

Hermes Agent v0.11.0: How a Self-Improving Open-Source AI Agent Hit 105K GitHub Stars in 7 Weeks

정상록 — Mon, 27 Apr 2026 00:37:51 +0000

Hermes Agent v0.11.0: How a Self-Improving Open-Source AI Agent Hit 105K GitHub Stars in 7 Weeks

If you missed it: Nous Research dropped Hermes Agent v0.11.0 on April 23, 2026, and the project crossed 105,000 GitHub stars in just 7 weeks since its February 25 launch. That's faster than AutoGPT, faster than CrewAI, and arguably the most significant release in the open-source agent space this year.

I spent the weekend digging into the actual code and benchmarks. Here's what I found, and why I think the GEPA self-improvement loop is the part most articles are underselling.

Why this matters more than typical "another agent framework" launches

Most "agent frameworks" published in 2024-2025 were orchestration layers — they call LLM APIs in sequence and manage state. Hermes Agent does that, but adds something new: the agent literally rewrites its own prompts and skills as it works.

This isn't a marketing claim. It's GEPA (Generative Embedding Prompt Adaptation), accepted as an Oral paper at ICLR 2026.

The GEPA Loop in Detail

Task complete (5+ tool calls)
      ↓
Trace analysis (which tools, in what order, with what context)
      ↓
Skill file auto-generated (.md format, human-readable)
      ↓
System prompt auto-tuned (small deltas)
      ↓
SQLite FTS5 index updated for retrieval
      ↓
Next similar task → 40% faster (after ~20 skills accumulated)

The skill files end up in ~/.hermes/skills/ and look like this:

---
name: market-research-pipeline
trigger_keywords: ["market trends", "research", "competitive analysis"]
tool_sequence: [web-search, extract-content, summarize, translate, image-gen]
---

# Market Research Pipeline

When user asks for market research:
1. Run web-search with date filter (last 30 days)
2. Extract from top 5 results
3. Summarize in target language
4. Generate infographic if data is quantitative
5. Format as markdown report

Crucially, you can edit these files manually. The agent's "memory" is just a directory of markdown files. No proprietary vector store, no opaque embeddings.

Benchmarks: Where GEPA Actually Wins

Benchmark	Hermes Agent (GEPA)	Comparison Point
MATH	93%	Base CoT on same model: 67% (+26pt)
AIME-2025	MIPROv2 +12%	vs leading prompt optimizer
GEPA vs GRPO	avg +6%, max +20%	with 35x fewer rollouts
RefusalBench (Hermes 4.3 36B)	57%+	GPT-4o / Claude: ~17%

The RefusalBench result is the one I'd verify independently if I were betting production budget on this. 3.4x less false refusal vs the major closed models is a big claim. But if it holds, the enterprise implications are significant.

What's New in v0.11 Specifically

React/Ink TUI v2 (Complete Rewrite)

The terminal interface was completely rebuilt on React + Ink. New capabilities:

Sticky composer: Input area stays at the bottom even with long output streams
OSC-52 clipboard: Click any code block to copy to system clipboard
Live streaming: Tool call results render in real-time with progress indicators

If you've used the TUI in v0.10, this feels like a different product.

`/steer` — Mid-Execution Intervention

This one's underrated. Traditional agent frameworks make you wait for the entire run to complete before you can correct course. With /steer, you intervene right before the next tool call:

Agent: "I'll translate all 10 articles next."
You: /steer only translate 5
Agent: [adjusts plan, translates 5]

The implementation uses a queue inspection at the tool dispatch boundary. Clean.

Unlimited Sub-Agent Recursion

Sub-agents can spawn sub-agents to arbitrary depth and breadth. Example pipeline:

main-agent
  ├── researcher
  │   ├── web-scraper-agent
  │   └── pdf-extractor-agent
  ├── analyst
  │   └── data-validator-agent
  └── writer
      └── editor-agent

In v0.10 this was capped. v0.11 removes the cap.

Five New Model Providers

Provider	Use Case
GPT-5.5 (Codex OAuth)	OpenAI's latest coding-specialized model
AWS Bedrock (Converse API)	Enterprise AWS infrastructure integration
NVIDIA NIM	NVIDIA inference containers
Arcee AI	Small specialized models
Vercel ai-gateway	Multi-provider routing

The AWS Bedrock integration is the one I'd watch closely. It enables truly private deployments inside enterprise VPCs — which is what most regulated industries need.

QQBot (17th Messaging Platform)

Tencent QQ integration is the new platform. Combined with Discord, Slack, Telegram, Line, WhatsApp, WeChat, etc., Hermes Agent now covers basically every major chat surface globally.

The Cost Story That Should Make You Pay Attention

Per Nous Research's published numbers, equivalent enterprise tasks cost:

Volume	Hermes Agent (Local 4.3 36B)	GPT-5.5 / Claude (API)
Single 5-tool task	$0.001	$0.02 - $0.09
1,000 tasks/day	$1	$20 - $90
Monthly	$30	$600 - $2,700

That's a 20-90x cost differential. For a 10-person engineering team running agents 24/7, the math becomes obvious quickly. The catch: you need a 24GB+ GPU for local Hermes 4.3 36B. If you're running on API providers, you lose most of the cost advantage.

Architecture Highlights

┌─────────────────────────────────────────┐
│  Messaging Platforms (17)               │
│  Telegram / Discord / Slack / LINE / ... │
└─────────────────────────────────────────┘
                  ↓
┌─────────────────────────────────────────┐
│  Hermes Agent Core                       │
│                                          │
│  ┌────────────────────────────────────┐ │
│  │  Memory (3-tier)                    │ │
│  │  • Short: in-memory context         │ │
│  │  • Medium: SQLite FTS5 (cross-session) │
│  │  • Long: skills/personas/profiles   │ │
│  └────────────────────────────────────┘ │
│                                          │
│  ┌────────────────────────────────────┐ │
│  │  GEPA Loop                          │ │
│  │  • Trace capture                    │ │
│  │  • Skill generation                 │ │
│  │  • Prompt tuning                    │ │
│  └────────────────────────────────────┘ │
│                                          │
│  ┌────────────────────────────────────┐ │
│  │  Tool Gateway (v0.10+)              │ │
│  │  • Web search (Firecrawl)           │ │
│  │  • Image gen (FAL FLUX 2 Pro)       │ │
│  │  • TTS (OpenAI)                     │ │
│  │  • Browser (Browser Use)            │ │
│  └────────────────────────────────────┘ │
└─────────────────────────────────────────┘
                  ↓
┌─────────────────────────────────────────┐
│  Model Providers (multiplexed)          │
│  Local 4.3 / GPT-5.5 / Claude / Gemini  │
└─────────────────────────────────────────┘

Getting Started in 30 Minutes

# Install
npm install -g @nousresearch/hermes-agent

# Initialize
hermes init my-agent
cd my-agent

# Choose provider (pick one)
hermes config set provider local        # Free, needs GPU
hermes config set provider openai        # Easy, costs money
hermes config set provider anthropic
hermes config set provider google

# Connect Telegram (easiest messaging platform)
hermes connect telegram --token $TELEGRAM_BOT_TOKEN

# Start
hermes start

# Now message your bot on Telegram

For Tool Gateway (web search, image gen, TTS, browser):

hermes login nous-portal
hermes config set tool-gateway nous-portal

Or BYO API keys:

hermes config set tools.web-search.firecrawl-key $FIRECRAWL_KEY
hermes config set tools.image.fal-key $FAL_KEY

What I'm Still Watching

Independent benchmark verification: GEPA numbers come from Nous Research themselves. Would be valuable to see third-party reproduction.
GEPA in production: Does the 40% speedup on repeat tasks materialize after 1-3 months of real usage, or is it a benchmark artifact?
Tool Gateway availability: Nous Portal is currently the easiest path. Is it stable enough for production SLAs?

Verdict

Hermes Agent v0.11.0 is the most significant open-source agent release of 2026 so far. The GEPA self-improvement loop is genuinely novel, the 20-90x cost advantage opens up agent use cases that didn't pencil out before, and the 17-platform messaging integration makes consumer-facing deployments trivial.

If you're building agents in 2026, you owe it to yourself to spend a weekend with this.

Sources:

Codex CLI 다 알려드릴게요. 그래도 메인은 Claude Code입니다

정상록 — Mon, 27 Apr 2026 00:35:44 +0000

Codex CLI 다 알려드릴게요. 그래도 메인은 Claude Code입니다

요즘 OpenAI Codex CLI에 대한 관심이 많습니다. "Claude Code에서 갈아타야 하는가"라는 질문도 자주 보이고요. 결론부터 말씀드리면, 갈아탈 필요 없습니다. 코덱스 CLI는 잘 만든 도구이지만, 메인 도구를 바꿀 정도는 아닙니다. 메인은 여전히 Claude Code입니다.

이 글에서는 두 가지를 다룹니다. 먼저 Codex CLI의 핵심 활용법을 정리하고, 그 다음 왜 여전히 Claude Code가 메인이어야 하는지 다섯 가지 근거로 분석합니다.

"Claude Code 품질 저하" 논란부터 정리합니다

코덱스로 옮겨가려는 분들 대부분이 한 번쯤 겪으셨을 이슈입니다. 한동안 Claude Code 품질이 떨어진다는 체감이 분명히 있었습니다. Anthropic이 공식적으로 원인을 밝혔고, 세 가지가 함께 작용했다고 발표했습니다.

추론 난이도 설정이 중간에서 높음으로 변경됐던 시기
캐싱 동작 변경
시스템 프롬프트 변경

이 세 가지 모두 원상 복구가 완료된 상태입니다. 즉, 많은 사용자가 "코덱스로 갈아타야 하나" 고민할 무렵 Claude Code는 이미 정상화되어 있었습니다. 일시적 이슈에 끌려 메인 도구를 바꾸는 의사결정은 신중해야 합니다.

Codex CLI 핵심 활용법

설치와 첫 실행

설치는 npm 한 줄로 끝납니다.

npm install -g @openai/codex

터미널에서 codex를 입력하면 실행됩니다. UI는 Claude Code와 거의 흡사해서 적응 비용은 낮은 편입니다. 키보드 단축키 체계도 비슷하고, 인터랙티브 방식도 유사합니다.

권한 모드 3가지

/permissions 명령어로 권한 모드를 설정할 수 있습니다.

모드	동작
Default	파일 읽기/수정은 자유. 외부 인터넷 접근이나 작업 디렉토리 외부 접근은 사용자 승인 필요
Auto Review	자율 동작 + 서브에이전트가 명령어의 안전성을 자체 판단
Full Access	YOLO 모드. 모든 명령을 자율 실행

일상 작업은 Default나 Auto Review로 충분합니다. Full Access는 격리된 컨테이너나 가상머신에서만 사용을 권장합니다.

슬래시 명령어 정리

코덱스 CLI에서 자주 쓰게 되는 명령어들입니다.

/statusline: 모델, 추론 단계, 컨텍스트, 패스트 모드 여부, 프로젝트명을 한 줄에 표시
/copy (단축키 Ctrl+O): 마지막 응답을 클립보드로 복사
/clear: 컨텍스트를 0%로 리셋
/resume: 기존 세션 복원. /resume --all은 다른 프로젝트 세션까지 포함
/fork: 현재 세션을 복제. 분기 작업할 때 유용
/status: 컨텍스트 윈도우 사용량과 위클리 리밋 확인
/model: 모델 변경 + 추론 단계(low/medium/high) 설정
/fast: 패스트 모드 토글
/plan (단축키 Shift+Tab): 계획 문서를 먼저 작성하고 코드는 나중
/review: 4가지 리뷰 모드 (브랜치 비교, 커밋 직전 셀프 리뷰, 커스텀 인스트럭션 등)
/compact: 대화 내역을 압축해 필요한 맥락만 새 컨텍스트로 추리기

단축키 정리

Tab: 후속 작업을 큐에 적재. 현재 작업 진행 중에도 다음 작업을 미리 등록 가능
Ctrl+R: 프롬프트 히스토리 검색
Ctrl+C (작성 중): 프롬프트 임시 삭제. 위 방향키로 복원 가능
Ctrl+C (대기 중): 코덱스 종료
위/아래 방향키: 프롬프트 히스토리 탐색
! 입력: 배시 모드. 셸 명령을 직접 실행
$ 입력: 코덱스 스킬 호출 (예: $image)

Tab으로 큐를 쌓아두는 패턴이 의외로 편합니다. AI가 작업하는 동안 다음 단계를 미리 적어두는 식입니다.

모델 라인업

모델	특징	크레딧
GPT-5.5	최신, 패스트 모드 지원	2.5배
GPT-5.4	패스트 모드 지원	2배
GPT-5.3 Spark	속도 빠름	별도 리밋

대부분의 작업은 5.4로 충분합니다. 5.5는 정말 어려운 작업에만 쓰는 게 효율적입니다.

$image 스킬로 이미지 생성

코덱스 CLI 안에서 $image 스킬을 호출하면 이미지를 생성할 수 있습니다. Image 2.0 모델을 쓰는데, 한글 텍스트 렌더링이 거의 완벽한 수준입니다. 코덱스 요금제에 포함이라 별도 결제는 필요 없습니다.

로고나 배너를 개발 워크플로우 안에서 빠르게 뽑을 때 활용도가 있습니다.

Plan 모드 워크플로우

Shift+Tab 또는 /plan으로 진입합니다. 흐름은 이렇습니다.

계획 문서(plan)부터 작성합니다
모호한 부분이 있으면 에이전트가 역질문합니다
완성된 plan을 검토한 뒤 실행 옵션 3가지 중 선택합니다
- 현재 세션에서 바로 실행
- 컨텍스트를 초기화하고 plan만 가지고 실행 (권장)
- plan을 더 수정

권장 패턴은 두 번째입니다. "feature 단위로 plan을 만들고, 컨텍스트를 초기화한 뒤 실행"하는 방식이 일관성과 품질 모두에서 가장 안정적이라는 게 정설입니다.

Review 모드

/review 명령어로 진입합니다.

현재 브랜치 vs 베이스 브랜치 비교
커밋되지 않은 변경사항 셀프 리뷰
커스텀 리뷰 인스트럭션 적용

PR을 올리기 직전 셀프 리뷰 용도로 쓰면 한 번 더 점검하는 효과가 있습니다.

Codex Cloud로 PR 자동 리뷰

이게 코덱스에서 꽤 강력한 기능입니다.

chat.openai.com/codex → Connectors → GitHub 연결을 거쳐 코드 검토 옵션을 켭니다. 트리거는 세 가지 중 선택할 수 있습니다.

PR이 열릴 때
푸시할 때
스마트 트리거 (변경 규모/유형에 따라)

PR에 코덱스 봇이 코멘트를 달고, 이슈 수정을 직접 요청하면 코덱스 클라우드가 자동으로 커밋해줍니다. 솔직히 이 부분은 편합니다.

그래서 왜 여전히 Claude Code가 메인인가

여기까지 보면 코덱스가 매력적으로 보일 수 있습니다. 하지만 다음 다섯 가지 이유로 메인은 여전히 Claude Code입니다.

1. 응답 속도가 빠릅니다

일상 개발에서 체감 응답 속도는 Claude Code가 더 빠릅니다. 코덱스는 패스트 모드를 켜야 비슷한 수준이 되는데, 패스트 모드는 크레딧을 2~2.5배 소비합니다. 즉, 같은 속도를 얻으려면 비용이 두 배 이상 들어가는 구조입니다.

코드 생산성은 결국 "한 번의 응답을 기다리는 시간"의 누적입니다. 하루에 수백 번 반복되는 인터랙션에서 0.5초 차이가 쌓이면 큰 차이가 됩니다.

2. 대규모 리팩토링에서 안정적입니다

코드베이스 전체를 이해하고 일관된 변경을 적용하는 능력에서 Claude Code가 한 발 앞섭니다. 흥미로운 점은, 코덱스를 적극 추천하는 분들도 "일상 개발과 대규모 리팩토링은 Claude Code를 쓴다"고 자주 인정한다는 겁니다.

이건 단순히 모델 성능 차이가 아니라 도구 설계 철학 차이입니다. Claude Code는 처음부터 코드베이스 단위 작업을 염두에 두고 만들어졌습니다.

3. 생태계가 압도적입니다

이게 결정적입니다.

MCP 표준화: Anthropic이 만든 Model Context Protocol이 사실상 업계 표준이 됐습니다. 외부 툴 연동의 공통 언어 역할을 합니다
서브에이전트 / Agent Teams: 역할별 에이전트를 조합해 복잡한 워크플로우를 구성할 수 있습니다
스킬 시스템: 재사용 가능한 자동화 단위를 만들고 공유할 수 있습니다
훅(Hook): SessionStart, PreToolUse, PostToolUse 등 라이프사이클에 맞춰 자동화를 끼워 넣을 수 있습니다
Claude Code Plugin 생태계가 빠르게 성장 중입니다

코덱스가 이 수준에 도달하려면 상당한 시간이 필요합니다.

4. 코덱스의 강점은 "보조 영역"입니다

코덱스를 추천할 때 자주 거론되는 두 가지 장점은 이렇습니다.

토큰 비용 절감
장기 자율 실행

그런데 두 가지 모두 Claude Code에서 충분히 대응됩니다.

토큰 절감: Claude Pro/Max 정액제는 사실상 무제한에 가까운 사용량을 제공합니다. 토큰 단위 과금 비교는 의미가 줄어듭니다
장기 자율 실행: Claude Code의 SDK, 헤드리스 모드, 백그라운드 작업으로 동일한 시나리오를 구현할 수 있습니다

결국 코덱스의 차별화 지점은 "second opinion"으로 좁아집니다. 다른 시각의 코드 분석, 다른 모델의 PR 리뷰가 필요할 때 보조로 쓰는 도구로 포지셔닝됩니다.

5. 도구를 늘리되 주축은 지켜야 합니다

여러 AI 도구를 동시에 쓰는 건 좋습니다. 하지만 메인 도구를 바꾸는 건 학습 비용, 워크플로우 재구성, 팀 동기화 비용을 동반합니다. "코덱스가 더 좋다고 들어서 옮겼더니 한 달 동안 적응만 하다 끝났다"는 사례가 드물지 않습니다.

지금 권장되는 구도는 명확합니다.

메인: Claude Code (일상 개발 + 리팩토링 + 자동화 파이프라인)
보조: Codex (특정 PR 리뷰, 가끔 다른 시각의 코드 분석)

권장 활용 시나리오

코덱스를 보조로 쓸 때 가장 효율적인 시점을 정리하면 다음과 같습니다.

상황	도구	이유
일상적인 코드 작성	Claude Code	응답 속도, 친숙한 UX
대규모 리팩토링	Claude Code	코드베이스 이해도
자동화 파이프라인 구축	Claude Code	MCP, 서브에이전트, 스킬
중요 PR의 second opinion	Codex Cloud	다른 모델 시각의 리뷰
가끔의 plan 검토	Codex `/plan`	다른 설계 시각
격리된 백엔드 잡 자동 실행	둘 다 가능	환경에 맞춰 선택

이 구도가 현재 가장 합리적입니다.

결론

Codex CLI는 잘 만든 도구입니다. Plan 모드, Review 모드, Codex Cloud 모두 충분히 매력적입니다. 하지만 메인 도구를 바꿔야 할 만큼은 아닙니다.

Claude Code의 일시적 품질 저하 이슈는 이미 복구됐고, 응답 속도와 생태계는 여전히 한 단계 앞서 있습니다. 코덱스로 갈아타려는 시간을 Claude Code를 더 잘 쓰는 데 투자하는 쪽이 훨씬 더 큰 생산성 향상을 가져옵니다.

도구를 늘리되 주축은 흔들지 마세요. 그게 생산성을 지키는 가장 빠른 길입니다.

Schedule a One-Off Cloud Job in Claude Code with /schedule

정상록 — Sat, 25 Apr 2026 04:54:05 +0000

Schedule a One-Off Cloud Job in Claude Code with `/schedule`

There are two flavors of automation tools: alarm clocks that ring every day at the same time (cron), and parcel pickup reservations that fire exactly once. Both are useful. The trick is having the right tool for each.

Anthropic's Claude Code Routines, launched on April 14, 2026, recently added the second flavor under the name "Schedule a One-Off Run." It fires a routine once at a specific future timestamp, then auto-disables.

What's a Routine?

A routine is a saved Claude Code configuration — a self-contained prompt plus one or more repositories plus connectors — that runs autonomously on Anthropic's cloud. Your laptop can be closed and it still fires.

A routine packages five things:

Prompt — the self-contained instructions Claude receives at every fire
Repositories — cloned at the start of each run from default branch
Environment — network access, env vars, setup script
Connectors — Slack, Linear, Google Drive (MCP)
Triggers — schedule, API (/fire endpoint), or GitHub events

There are three trigger types: Schedule, API, and GitHub events. The schedule trigger has two flavors — recurring (cron-like) and one-off (today's topic).

How One-Off Runs Work

A one-off schedule fires the routine a single time at a specific timestamp.
— Anthropic docs

After firing, the routine is marked "Ran" in the web UI. To run it again, you have to edit the routine and set a new one-off time, or create a brand new routine.

Timezones are handled for you. You enter a local time, it gets stored as UTC, and Anthropic guarantees wall-clock execution regardless of where the underlying infra lives.

Using `/schedule` (CLI)

Inside a Claude Code session:

/schedule in 2 weeks, open a cleanup PR that removes the SIGNUP_V2_ENABLED flag

/schedule tomorrow at 9am, summarize yesterday's merged PRs

/schedule next Monday at 3pm KST, run smoke tests on production

Claude converts the natural-language timestamp into an absolute one and asks for confirmation:

Scheduled to fire at: 2026-05-09T15:00:00+09:00 (2026-05-09 06:00 UTC)
Repositories: my-app
Prompt summary: Remove SIGNUP_V2_ENABLED feature flag and open cleanup PR
Confirm? [y/N]

If GitHub isn't connected yet, you'll be guided through /web-setup. Adding API or GitHub-event triggers requires the web UI (claude.ai/code/routines).

The Quota Detail That Matters

Recurring routines hit a daily firing cap. One-off runs do not count against that cap. They consume normal plan usage (Pro/Max/Team/Enterprise) just like an interactive session, but they don't deplete the routine quota.

That means you can be at 100% routine usage and still schedule a cleanup task for 2 weeks from now without it being rejected.

One-Off vs Recurring

	One-Off	Recurring
Fires	Once	Every cadence
Auto-disables after firing	Yes	No
Counts against daily cap	No	Yes
Minimum interval	N/A	1 hour
Custom cron expression	N/A	Yes (5-field via `/schedule update`)

Where One-Off Runs Beat Other Tools

Tool	Where it runs	When session ends	Maintenance burden
Routines One-Off Run	Anthropic cloud	Keeps running	None
Claude `/loop`	Local terminal	Stops immediately	Active session required
Desktop scheduled task	Your machine	Machine must stay on	OS-dependent
Cron daemon	Your server	Server must stay on	You operate it

If you've ever maintained a cron server just to run a once-a-month cleanup, this is the rebuke.

Realistic Use Cases

1. Cleanup after rollout

The canonical use case. Ship a feature behind a flag, then schedule its removal:

/schedule in 2 weeks, open a cleanup PR that removes the SIGNUP_V2_ENABLED flag from frontend and backend repos. Update tests. Branch name: claude/remove-signup-v2-flag.

2. Post-deploy follow-up

/schedule in 1 hour, run smoke tests on production and post the result to #deploy-alerts.

3. Forget-me-not reminder

/schedule next Friday at 4pm, summarize this week's merged PRs and draft a release note.

4. Long-horizon tracking

Decision in a meeting that needs revisiting in 30 days? Drop a routine and forget about it.

Research Preview Caveats

This feature is still in research preview. Before relying on it in production:

Beta header: experimental-cc-routine-2026-04-01. Request shapes, response shapes, rate limits, and token semantics may change. Only the last two versions are kept.
API trigger tokens: shown once at creation. Store them in a password manager or your secrets manager immediately — there's no recovery.
GitHub webhook caps: per-routine and per-account hourly limits. Excess events are dropped.
Branch protection: by default, routines can only push to branches starting with claude/. There's an opt-out (Allow unrestricted branch pushes), but I'd leave it on.
Author attribution: PRs, commits, and Slack messages from a routine are recorded under your user, not a service account. Plan team conventions accordingly.

A Self-Contained Prompt Template

For the cleanup-PR scenario, this is what an actually-runnable prompt looks like:

1. Search the codebase for usages of the feature flag SIGNUP_V2_ENABLED in `frontend/` and `backend/`.
2. Remove all references, treating the flag as permanently enabled.
3. Update unit tests that depended on the previous behavior.
4. Run `npm test` (frontend) and `pytest` (backend). If tests fail, fix and rerun.
5. Open a PR titled "chore: remove SIGNUP_V2_ENABLED feature flag" on branch `claude/remove-signup-v2-flag`.
6. PR body should summarize removed code paths and test results.
7. Add the label "automated-cleanup".

The rule of thumb: write the prompt as if you won't be there to answer follow-up questions.

Wrap-up

One-off runs aren't flashy. They fill the small gap between "do it now" and "set up a cron." But that gap matters in practice — feature-flag cleanups, post-deploy follow-ups, meeting follow-ups all live there.

If you're already using Claude Code, try one this week. Schedule a cleanup PR for 2 weeks from now, then forget about it. When Monday comes around, the PR will be waiting.

Source: code.claude.com/docs/en/routines

DeepSeek V4 is out: 1.6T parameters, 1M context, MIT license — and it costs 1/6 of Opus 4.7

정상록 — Sat, 25 Apr 2026 04:52:48 +0000

DeepSeek released V4 on April 24, 2026 as an open-weight model under MIT license. The headline numbers reset expectations for what frontier-grade AI costs:

V4-Pro: 1.6T total params (49B active), 1M context, $1.74/$3.48 per 1M tokens
V4-Flash: 284B total (13B active), 1M context, $0.14/$0.28 per 1M tokens
License: MIT (full commercial + fine-tuning)
Training hardware: Huawei Ascend 950PR — zero NVIDIA chips
Weights: HuggingFace

For context, Claude Opus 4.7 sits at $15/$75 per 1M tokens. V4-Pro is roughly 22x cheaper at frontier-adjacent quality. That's not an incremental change — it's a category shift.

The pricing reality check

Let me put this in concrete terms. Imagine a SaaS service processing 100M tokens per month (a moderate usage tier for AI features in a B2B product):

Model	Monthly cost (input)	Monthly cost (output, 50M)
Claude Opus 4.7	$1,500	$3,750
GPT-5.5	$500	$1,500
DeepSeek V4-Pro	$174	$174
DeepSeek V4-Flash	$14	$14

That's the API price. With open weights, you can self-host V4-Flash (284B, 13B active) on a single H100 cluster and amortize the cost further. For a startup running a coding assistant or a document analysis SaaS, this means moving from "AI is the dominant infrastructure cost" to "AI is a rounding error."

What's actually new in the architecture

V4 isn't just V3 scaled up. There are two fundamental architectural changes worth understanding.

1. CSA + HCA hybrid attention

V3.2 used DSA (DeepSeek Sparse Attention). V4 introduces a two-level compression:

CSA (Compressed Sparse Attention): groups tokens along the sequence dimension to reduce attention FLOPs
HCA (Heavily Compressed Attention): additionally compresses the token dimension to shrink the KV cache

The result at 1M context length:

Inference FLOPs: 27% of V3.2 (V4-Flash: 10%)
KV cache: 10% of V3.2 (V4-Flash: 7%)

So context length grew 8x (128K → 1M) while memory pressure stayed roughly the same. The implication is that future scaling to 10M context shouldn't break the bank — the architecture is built for it.

2. Manifold-constrained hyper-connections (mHC)

DeepSeek replaced standard residual connections with mHC. The exact mechanism needs more study from the technical report, but the high-level effect is improved information flow stability in deep networks while increasing expressive capacity. This matters for stability when training trillion-parameter models with smaller compute budgets.

Benchmarks

Here's where V4-Pro lands:

Benchmark	V4-Pro	Notes
MMLU	90.1%	~2pp behind Gemini 3.1 Pro
MMLU-Pro	87.5%	Knowledge-intensive gap remains
HumanEval	90.0%	Top-tier coding
SWE-bench Verified	80.6%	Open-source #1
LiveCodeBench	93.5%	Best open model
GPQA Diamond	90.1%	Behind Gemini on hard science
GDPval (professional)	1554 pts	Open-source #1, overall #6

DeepSeek's own technical report says V4 is "narrowly behind" GPT-5.4 and Gemini 3.1 Pro. They estimate the gap with frontier closed-source at 3-6 months. For coding workloads specifically, V4 is at parity or slightly ahead.

The Huawei angle (this is the bigger story)

V4 was trained entirely on Huawei Ascend 950PR — the first frontier-grade open model trained without any NVIDIA hardware. Last year, Jensen Huang publicly described this scenario as a "disaster." It's now reality.

US export controls on AI chips were designed to slow down Chinese AI development. V4 demonstrates that the constraints have been routed around, at least at the model training level. For anyone building infrastructure decisions, this means:

NVIDIA's effective monopoly on AI training is over at the prosumer/enterprise tier
Chinese AI ecosystem is now fully self-sufficient for frontier model development
Geopolitical assumptions in your AI roadmap need updating if you assumed China would lag 2-3 years

Practical takeaways

If you're building with LLMs in 2026, here's what V4 changes:

For API consumers

Re-benchmark your stack: Run V4-Flash against your current Claude/GPT workload. For coding, document analysis, and summarization tasks, you'll likely find quality is acceptable at 1/50 the cost.
Consider hybrid routing: V4-Flash for high-volume routine tasks, Claude/GPT for the hardest reasoning. This routing pattern can cut total AI costs by 80%+.
Re-evaluate margin economics: If your SaaS product was barely viable due to AI costs, V4 might unlock previously infeasible business models.

For sovereign AI / regulated industries

Open weights + MIT license means:

Download V4-Flash, deploy on-prem
Never send sensitive data to external APIs
Comply with GDPR / HIPAA / industry-specific data residency rules
Fine-tune on proprietary data without licensing constraints

This is a meaningful unlock for finance, healthcare, legal, and government use cases that have been blocked from frontier AI by data residency requirements.

For the 1M context window

You can now load entire codebases, multi-year contract sets, or full project documentation into a single context. This doesn't kill RAG entirely (it's still more efficient at scale), but it does eliminate the need for RAG infrastructure for many lightweight document analysis tasks.

A practical example: instead of building a RAG system over your company's design docs, you can now load all docs (within 1M tokens) into context and ask questions directly. For internal tools, this is often simpler and produces better results.

For fine-tuning

MIT license + 1.6T base parameters opens the door to:

Domain-specific models (legal, medical, financial)
Language-specific models (e.g., Korean, Japanese, Spanish optimization)
Task-specific specialists (e.g., contract review, code review agents)

V3.2 fine-tunes have already shown IMO gold medal-level math performance via fine-tuning alone. V4 raises the ceiling further.

Closing thought

The AI market shifted yesterday. The closed-source moat narrative — "frontier models will always be 12-18 months ahead of open source" — is now harder to defend. DeepSeek's own admission of a 3-6 month gap, combined with MIT licensing and 1/6 pricing, means the practical advantage of using closed-source frontier models has shrunk to specific use cases.

For most builders, V4 is now the default open option. For sovereign AI deployments, it might be the default option, period.

References:

DeepSeek V4-Pro: 1.6T MIT-Licensed Open Weights at 88% Less Than Opus 4.7

정상록 — Sat, 25 Apr 2026 04:52:20 +0000

While OpenAI builds a desktop super-app, Anthropic doubles down on enterprise lock-in with Opus 4.7, and Google pushes Gemini 3.1-Pro paid tier, DeepSeek went the exact opposite direction. On April 24, 2026, they released V4-Pro and V4-Flash on Hugging Face under MIT license. 1.6 trillion parameters. Currently the largest open-weight model ever published.

If you've been watching the open-source vs frontier gap close in slow motion since Llama 405B, this is the moment that gap collapsed for coding workloads.

The Numbers That Matter

Total parameters:   1.6T
Active per token:   49B (MoE)
Context window:     1M tokens (native, not retrofitted)
Pretrain corpus:    33T tokens
License:            MIT (commercial use OK)

Pricing vs Big 3

Model	Input ($/M)	Output ($/M)
V4-Pro	$1.74	$3.48
GPT-5.5	~$10	$30
Claude Opus 4.7	~$15	$25
V4-Flash	$0.14	$0.28

V4-Pro output is 88% cheaper than Opus 4.7. V4-Flash input is roughly 1.4% the cost of GPT-5.5 input. Fortune's startup desk literally wrote that this "embarrasses Western AI labs' pricing pages."

This isn't a loss-leader. It's structural — and that's the engineering story worth your time.

How They Got There — Hybrid Attention

The headline architectural change is CSA + HCA interleaved attention (Compressed Sparse Attention + Heavily Compressed Attention). Two attention variants stacked in alternating layers.

The result at 1M context, vs DeepSeek V3.2:

Single-token inference FLOPs: 27% of V3.2
KV cache memory: 10% of V3.2

V4-Flash is even more aggressive: 10% FLOPs, 7% KV cache.

For anyone who has tried to run long-context production workloads, this is the difference between "interesting demo" and "we can actually serve this." The bottleneck for 1M context has always been KV cache memory linearly blowing up. Compress that to 1/10, and a single GPU can serve sessions that previously needed entire boxes.

Add to that:

FP4 quantization-aware training (precision baked in from pretraining, not bolted on after)
A new optimizer (details in the partial tech report)
Reworked residual connections

The architecture doesn't just scale — it scales efficiently.

Where V4-Pro Wins

Coding (#1 across the board):

LiveCodeBench:   93.5%   (vs Gemini 91.7%, Claude 88.8%)
Codeforces:      3206    (#1)
BrowseComp:      83.4%   (vs Opus 4.7 at 79.3%)
Terminal-Bench:  67.9%   (close to Opus 4.7's 69.4%)

If your stack is heavy on code generation, code review, or autonomous coding agents, V4-Pro is now the cost-performance leader by a wide margin.

Where V4-Pro Loses

Knowledge reasoning and complex agent workflows still belong to Big 3:

GPQA Diamond:     90.1%   (Opus 4.7: 94.2%)
SWE-bench Pro:    55.4%   (Opus 4.7: 64.3%)
HLE:              behind frontier

The honest summary: coding is the blade, knowledge reasoning is still catching up.

What This Means for Self-Hosting

Three concrete changes:

1. Data-sovereign workloads finally have a frontier-class option

If your org can't send data to external APIs (healthcare, finance, public sector, legal), Llama 405B was the previous best self-host coding option — and it lagged frontier by a meaningful margin. V4-Pro closes that gap on the workloads where it matters most.

2. Token-cost-sensitive products can rebuild on V4-Flash

If you're a SaaS startup paying $X0K/month for Haiku or 4o-mini at scale, V4-Flash at $0.14/M input is roughly 1/10th the cost. Self-hosted, the marginal cost approaches zero.

3. The hardware story is about to get interesting

DeepSeek hinted at Huawei Ascend 950 integration. If that lands, the implication is "frontier-class model on non-NVIDIA silicon at lower TCO" — which would be the first credible break in NVIDIA dependency for inference.

Try It

# Pull from Hugging Face
git lfs install
git clone https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro

# Or use the API
curl https://api.deepseek.com/chat/completions \
  -H "Authorization: Bearer $DEEPSEEK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-pro",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Open Questions

Will the GA release close the GPQA / SWE-bench gap with Opus 4.7?
How quickly will domain-specific fine-tunes outperform Big 3 in verticals?
Does this pressure OpenAI / Anthropic to release older model weights?

If you're running self-host benchmarks on actual workloads, drop your numbers in the comments. Curious what the production-side picture looks like.

Sources

DeepSeek API Docs: https://api-docs.deepseek.com/news/news260424
VentureBeat: https://venturebeat.com/technology/deepseek-v4-arrives-with-near-state-of-the-art-intelligence-at-1-6th-the-cost-of-opus-4-7-gpt-5-5
TechCrunch: https://techcrunch.com/2026/04/24/deepseek-previews-new-ai-model-that-closes-the-gap-with-frontier-models/
Simon Willison: https://simonwillison.net/2026/Apr/24/deepseek-v4/
MarkTechPost (architecture deep-dive): https://www.marktechpost.com/2026/04/24/deepseek-ai-releases-deepseek-v4-compressed-sparse-attention-and-heavily-compressed-attention-enable-one-million-token-contexts/

Claude Managed Agents now have Built-in Memory — Rakuten cut first-try errors by 97%

정상록 — Sat, 25 Apr 2026 04:46:27 +0000

TL;DR

Anthropic shipped Memory Stores for Claude Managed Agents on 2026-04-23 (Public Beta). Instead of standing up your own RAG stack, memory is mounted as a filesystem directory at /mnt/memory/{store-name} and the agent uses standard Read/Write/Bash tools.

Production metrics so far:

Rakuten: 97% fewer first-try errors, 27% lower cost, 34% lower latency
Wisedocs: 30% faster document validation
Netflix: eliminated manual prompt/skill updates
Ando: stopped building memory infra entirely

The problem before

Managed Agents reset every session. Anything an agent learned about a user, a project, or a workflow vanished when the conversation ended. To bridge the gap, teams had to build:

A vector database
An embedding pipeline
A retrieval API
Custom audit logs
Custom permission scoping
Custom concurrency controls

That's a lot of yak shaving for what is essentially "remember what you learned yesterday".

What Memory Stores actually is

A Memory Store is a workspace-scoped collection of text documents, mounted into the agent's session container as a directory. The agent doesn't learn a new memory API — it just uses the filesystem.

Three concepts:

Concept	Description
Memory Store	Container holding text documents
Memory	Path-addressable text file (max 100 KB / ~25K tokens)
Memory Version	Immutable snapshot created on every change

The mount path, access mode, description, and instructions are auto-injected into the system prompt. No prompting tricks needed.

Quickstart (Python SDK)

1. Set the beta header

If you use the SDK, this is automatic:

managed-agents-2026-04-01

2. Create a store

store = client.beta.memory_stores.create(
    name="User Preferences",
    description="Per-user preferences and project context.",
)

3. (Optional) Seed with reference content

client.beta.memory_stores.memories.create(
    store.id,
    path="/formatting_standards.md",
    content="All reports use GAAP formatting...",
)

4. Attach to a session

session = client.beta.sessions.create(
    agent=agent.id,
    environment_id=environment.id,
    resources=[
        {
            "type": "memory_store",
            "memory_store_id": store.id,
            "access": "read_write",  # or "read_only"
            "instructions": "User preferences and project context. Check before starting any task.",
        }
    ],
)

Critical constraint: stores can only be attached at session creation time. No runtime add/remove.

5. Agent-side usage

The agent just runs:

ls /mnt/memory/user-preferences/
cat /mnt/memory/user-preferences/formatting_standards.md

The prompt-injection footgun

Default access is read_write. If your agent processes untrusted input — user prompts, web content, third-party tool output — malicious content can land in memory and the next session reads it as trusted reference material.

This is the kind of bug that doesn't show up in your dev environment.

Mitigation rule of thumb: any store the agent doesn't strictly need to write to should be read_only. Examples:

Standards / conventions
Domain glossaries
Shared lookup data
External reference docs

Concurrency: optimistic locking built in

Multiple agents can hit the same store simultaneously. To avoid lost updates, use content_sha256 as a precondition:

client.beta.memory_stores.memories.update(
    memory_id=mem.id,
    memory_store_id=store.id,
    content="CORRECTED: " + new_content,
    precondition={
        "type": "content_sha256",
        "content_sha256": mem.content_sha256,
    },
)

If another session updated first, the call fails with a hash mismatch — re-read and retry.

Audit and rollback

Every change creates an immutable version, retained for 30 days.

versions = client.beta.memory_stores.memory_versions.list(
    store.id,
    memory_id=mem.id,
)

old = client.beta.memory_stores.memory_versions.retrieve(
    version_id=versions.data[1].id,
    memory_store_id=store.id,
)

# Restore by writing the old content back via memories.update()

There's no dedicated restore endpoint — you update() with the old content. PII can be scrubbed via memory_versions.redact() while keeping the audit trail.

Memory Stores vs traditional RAG

Aspect	Traditional RAG	Memory Stores
Access	Separate search API	Filesystem mount
Learning curve	Embeddings, indexing, retrieval queries	bash/Read/Write
Infra	Vector DB + embedding model	Managed by Claude Platform
Audit	DIY	Built-in immutable versions
Permissions	DIY	scoped (read_only/read_write)
Concurrency	DIY locking	`content_sha256` precondition

Beta limits

Limit	Value
Memory stores per organization	1,000
Memories per store	2,000
Storage per store	100 MB
Versions per store	250,000
Size per memory	100 KB
Version retention	30 days
Stores per session	8
`instructions` field	4,096 chars

The 30-day version retention is the one to plan around. If you need longer audit trails, export via the API.

Closing thought

"Memory lets us stop building memory infra and focus on the product itself" — Sara Du, Founder, Ando

Persistent learning is now a platform primitive, not a side project. Two-digit operational wins (Rakuten 97%, Wisedocs 30%, Netflix's eliminated manual updates) suggest this isn't marginal.

Worth integrating, especially for any agent that already has the same conversation with the same user every day.

Source: Claude Managed Agents — Memory (official docs)
Launch blog: Anthropic — Claude Managed Agents Memory (2026-04-23)