Forem: Jun seo

How to Actually Design an AI Agent: Tools and the Starting Loop (Part 2)

Jun seo — Tue, 19 May 2026 17:57:01 +0000

TL;DR

The model matters, but tools matter at least as much. Weak tool descriptions are one of the easiest agent failures to diagnose, and one of the most common.

Design the tools before the agent. If you cannot answer "what can this agent do that a general LLM cannot on its own?", you do not have an agent yet.

Ship a small, focused loop with 2-3 well-described tools and a hard iteration cap. Watch traces. Iterate.

In Part 1 I laid out the four levels of AI agents I keep seeing in production, and argued that most shipped "AI agents" are stuck on the lower rungs. This post is about how to build one that is not.

One assumption underneath everything below:

Tools are the center of an agent. Not the system prompt.

Most tutorials start with "let's write the system prompt." Wrong starting point. Start with tools.

Start With Tool Design

A weak tool description looks like this:

search_documents:
  description: "Search documents."

The model has no idea when to use it, what to put in query, or what the output means. So it guesses. Badly.

A good description looks closer to this:

search_documents:
  description: "|"
    Use this tool when the user's question requires evidence from
    internal documents, policies, or technical references.

    Do NOT pass the full user question as the query. Extract the
    core concepts and keywords. If the first result set is weak,
    rewrite the query and search again before answering.

    Cite the retrieved documents as evidence in your final answer.
  parameters:
    query:
      type: string
      description: "2-6 keywords, not a full sentence."

Same tool. Completely different agent behavior.

The good description does three jobs the bad one skips:

When to use it — a trigger condition, not just a name.
How to call it — concrete shape of arguments, with anti-patterns called out.
What to do with the result — how the output feeds back into the final answer.

Claude's Skills system is worth studying for the shape, not the brand. A Skill packages task-specific instructions, files, scripts, and example workflows behind a short trigger description. The heavy content only loads when the agent decides the Skill is relevant. The pattern has a name: progressive disclosure. Reachable in any framework that lets you gate long instructions behind a short surface.

Two tips that have saved me more time than anything else in this area.

1. When you're stuck, let skill-creator rewrite your prompt.

Anthropic ships an official Skill called skill-creator whose job is to make and improve other Skills. Its SKILL.md is, almost accidentally, the best prompt-design tutorial I have read. The patterns it pushes are exactly the ones that hold up under load: explain why a rule matters instead of writing rigid ALWAYS/NEVER directives; design for the smart model you actually have, not the rote one you imagine; generalize past your test cases rather than overfit to them; cut anything not pulling its weight.

What I do now whenever I have to write a non-trivial agent system prompt: write a first draft myself, then ask Claude to rewrite it "following the skill-creator SKILL.md guidelines." The result has beaten my draft every single time. Sometimes embarrassingly so.

2. Read how Claude Code itself is built.

Anthropic's Claude Code documentation walks through how their own coding agent is wired — system prompt shape, tool surfaces, subagent boundaries, context management, the whole stack. If you have only ever read agent tutorials, reading the docs for a real production Level 4 agent is the cheapest level-up I know.

Practical translation: do not dump every possible instruction into the system prompt. Expose short names and descriptions. Load the deep stuff on demand.

A scar from one of my own v1s. We added six tools, expecting the agent to compose them in interesting ways. It did not. It picked the wrong one, called it with the user's entire question as the query, and kept looping. The tool name was the lie. It promised "search", but the implementation could only handle keywords, and nothing in its description said so. We cut from six tools to two, rewrote the descriptions, and the loop stopped.

The bug was not reasoning. The bug was that the tool name lied.

Design the Tools Before the Agent

If you only remember one thing from this post: design tools first, agent second.

Before writing a single line, ask:

What can this agent do that GPT, Claude, or Gemini cannot do on their own?
What external data does it need access to?
What real actions should it be allowed to take?
What workflow is it automating end-to-end?

If the honest answer is "it talks nicely," the user has no reason to use it. General-purpose LLMs already talk nicely. The differentiation comes from tools that connect to your system:

Query the company database
Modify project code
Search internal policy documents
Look up a customer's order history
Create a Jira ticket
Summarize a Slack thread into action items
Run a domain-specific validation script

Tools are how an agent becomes useful in domains where a general LLM cannot act. They are also the surface where security, permissions, and auditing actually live — which is another reason "we'll figure out tools later" is the wrong order.

A Reasonable Starting Architecture

If I were building a v1 today:

User
  ↓
Root agent (system prompt + context manager)
  ↓
Planner / agent loop
  ↓
Tool selector → Tool execution → Observation
  ↓
(loop until done or iteration cap)
  ↓
Final response

A good v1 has:

A focused system prompt — describes the agent's job, not its biography.
Conversation context with a sliding window — keep head + tail, compress the middle.
2 or 3 well-described tools, not 20. Each one earns its slot.
An agent loop with a hard iteration cap. 10 is a good default; tune from traces.
Tool results fed back into the next model call, with clear delimiters.
Trace logging from day one. You will need it.

That's it. Ship that, watch traces, then improve.

The traps I see most often in v1s:

Too many tools. The model wastes turns choosing between near-duplicates. Merge or delete.
No iteration cap. One bad tool call and the agent burns your budget in a loop.
Tool errors swallowed silently. The model retries blindly because it never saw the error. Always surface error messages back into the loop.
System prompt growing every time something breaks. Each new instruction makes the previous ones less salient. Fix the tool description instead.

What "Done" Looks Like

You'll know you've built a Level 4 agent when:

A user describes a goal, not a step. ("Cancel the order I placed yesterday if it hasn't shipped.")
The agent chooses which tools to call, in what order, without you hard-coding the path.
It recovers from at least one bad tool result without giving up.
The trace shows decisions, not a script. Goals broken into steps. A look at the actual result after each tool call. An explicit "this is done." Not a fixed sequence in JSON costume.

If you cannot yet point at a trace that satisfies all four, keep going. That is the gap worth closing.

If you missed it, Part 1 covers the 4-level taxonomy and why most "AI agents" you encounter in real products are stuck at Level 1 or 2.

Tell me what your current v1 looks like — especially the tool list. Most of the interesting design happens there, and most of the bugs do too.

The 4 Levels of AI Agents: Why Most Service AIs Still Feel Dumb (Part 1)

Jun seo — Tue, 19 May 2026 17:39:05 +0000

TL;DR

AI agents in real products fall into 4 levels: LLM wrapper → intent classifier → context-aware → agent loop.

Most "AI agents" you meet in production are stuck at level 1 or 2, which is why they feel dumb on top of very smart models.

The gap between levels is rarely the model. It's context management and the agent loop. Part 2 covers how to climb the levels — this post is about what the levels are and why most products stall.

Last month I tried to cancel an order through a SaaS company's shiny new "AI agent."

Me: "Cancel my latest order if it hasn't shipped yet."
Agent: "Here is our refund policy: [link]. Is there anything else I can help with?"

Then I asked a customer-support bot a follow-up referencing my previous message. It had already forgotten.

We are in the middle of an AI boom. Frontier models can write production code and reason through multi-step problems. And yet the AI agents shipped inside real products often feel like 2018-era chatbots in a new costume.

Here is the part I find funny. After two messages with one of these things, you can usually guess the architecture from the outside. The second message reveals the architecture, every time.

Why? Because most of what people call an "AI agent" is not one. It is an LLM API call wearing the word as a marketing label.

A note before we start: classifications of agents already exist — Anthropic's workflows vs. agents, Harrison Chase's cognitive architectures, Lilian Weng's planning / memory / tool use decomposition, and a handful of "SAE-style autonomy levels" posts. The four levels below are not new theory. They are the four shapes I keep seeing when I poke at real production agents from the outside.

Level 1 — The LLM API Chatbot

The most basic form.

User input → LLM API call → Response

System prompt, maybe. No tools. No memory. No retrieval. No state.

I am honestly not sure this should be called an agent at all. It is an LLM API wrapper with a friendly UI on top. But at the product level, plenty of teams still call this "our AI agent."

It can handle FAQ-style questions. The moment a user says something like:

"Can you do the same thing with the settings I mentioned earlier?"

…or:

"Check my latest order and cancel it if it hasn't shipped yet."

…the seams show immediately. Nothing is connected to anything. The model is guessing.

This is the level most "AI features bolted onto an existing SaaS" sit at. And it is exactly why users walk away thinking AI is overhyped.

Level 2 — Intent Classification Agent

Probably the most common level in production today.

User input
  → Intent classification
  → Intent-specific handler
  → Response

For a customer-support bot the intents might look like:

Refund request
Shipping question
Payment issue
Account problem
Escalate to human

Within a tightly scoped domain this works surprisingly well. If you know your user requests fall into a small number of buckets, intent classification is cheap, fast, and easy to monitor.

The weakness shows up the moment users do what users actually do: combine intents.

"I want to cancel the thing I paid for yesterday, but I think it may have already shipped."

That is payment, cancellation, refund, and shipping in one sentence. A classic single-intent classifier routes this to one handler and ignores the rest. The user gets half an answer and gives up.

Modern multi-intent classifiers help. They do not fix the ceiling: the agent is only as good as the intents you predefined. Anything outside the schema falls off a cliff.

If you ask Claude or Codex to "build me an AI agent," there is a good chance you will end up here. It is a fine starting point. It is not a high-level agent.

Level 3 — Context-Aware Agent

Level 3 is the first level where the user stops feeling like they are filling out a form.

User input
  + Previous conversation context
  + Stored facts / preferences
  → Reasoning
  → Response

By "context" here I mean conversational memory across turns. What the user said two messages ago. Their stated preferences. The entities they already referred to. Not the runtime working context an agent loop carries between tool calls. (That shows up in Level 4.)

The agent maintains state across the conversation. "Use the option I mentioned earlier" actually works. The user does not have to repeat themselves every turn.

The prompt is the easy part. Context management is where it gets ugly. LLMs do not have infinite memory, and naively stuffing the full history into every call breaks cost and quality at the same time.

The usual strategies:

Keep the most recent N messages verbatim
Summarize older messages into a compact form
Extract durable facts ("user prefers email over Slack", "ordered SKU 1042") and store them separately
Slide the context window: keep the head (system + key facts) and the tail (recent turns), compress the middle

A Level 3 agent feels like it remembers you. A Level 2 agent feels like every message is its first day on the job.

Level 4 — Agent Loop

This is where an agent becomes an actual agent.

The model does not just generate a reply. It decides what action to take, executes a tool, observes the result, and decides again — until the task is done or a budget is hit.

Say the user asks:

"Find and fix the login bug in this project."

A Level 1 chatbot guesses. A Level 4 agent does something like:

Inspect the project structure
Search for login-related files
Read the relevant code
Check error logs or failing tests
Hypothesize the root cause
Edit the code
Run tests
If tests fail, loop back to step 3
Report

At this point the agent is no longer answering questions. It is doing work.

The part nobody warns you about when you start building one: the model matters, but tools matter at least as much. A great model with badly designed tools will pick the wrong one, pass the wrong arguments, or loop forever. A merely-good model with well-designed tools punches far above its weight.

That is the topic of Part 2.

A Note on the "Levels"

These are not strict maturity levels. Memory (Level 3) and the agent loop (Level 4) are independent axes, not stacked floors. A stateless coding agent can run a strong Level 4 loop inside a single task with almost no conversational memory. A customer support assistant can have rich user memory and no autonomous loop at all. I am using "levels" as shorthand for product behavior the user perceives, not a formal architecture ladder where each rung technically depends on the one below.

If you take only the ladder away from this post, you took the wrong thing. Take the four shapes.

Why Most Products Stall at Level 1–2

If Level 4 is so clearly better, why are most shipped "AI agents" stuck on the bottom two rungs?

A few honest reasons:

Level 1 is one weekend of work. A system prompt and a chat.completions call. Demos beautifully. Falls over the moment a real user shows up.
Level 2 fits how PMs already think. Intents map cleanly onto support tickets, KPIs, and existing decision trees. The org chart pulls toward Level 2.
Level 3 requires unsexy infra. Context summarization, fact extraction, durable per-user state. None of it is one prompt away. All of it is operationally annoying. (Note: this is conversational memory, not RAG. Retrieval is a separate axis, not a prerequisite.)
Level 4 requires real tools. A loop is meaningless without things to call. Building, scoping, and securing tools that touch your production systems is the part that scares teams — so they ship the chatbot version and call it a day.

The result is a market full of "AI agents" that share the badge but not the behavior. The badge is cheap. The badge has been cheap for two years. The behavior is what users are still waiting for.

Part 2 — "How to actually design a Level 4 agent" — covers tool design, a reasonable starting architecture, and the mistakes that make agents loop forever. Coming next.

If this matched your experience building or using AI agents, I'd love to hear which level your current project is at — and what broke when you tried to climb to the next one.

Why AI Coding Tools Over-engineer Your MVP — And the One Fix

Jun seo — Sat, 16 May 2026 07:01:53 +0000

TL;DR — For reversible, stage-sensitive engineering decisions, AI assistants default to production-grade advice unless you specify business context. This isn't a model intelligence problem you can wait out. It's an objective-function problem you can fix in the next prompt. Below: the mechanism (with appropriate hedging), a before/after example, and a concrete taxonomy of what "context" actually means.

1. A Scene You've Probably Seen

Fifty users. MVP stage. Hypothesis validation is the only thing that matters.

You ask Claude Code to do a security review. You get back:

"Move the database into a separate VPC and use VPC Peering or PrivateLink."
"Wrap every external call in mTLS with automated cert rotation."
"Stream audit logs to a separate AWS account for SOC2 readiness."

None of it is wrong. But if you do all of it at this stage, you'll burn your runway on infra migration before validating whether anyone wants the product.

The same pattern shows up outside security:

Redis cluster with read replicas recommended for a 30 RPS service
Hexagonal architecture proposed for a 100-line script
GitOps + ArgoCD + Terraform module separation recommended for a 3-person team

The usual conclusion follows:
"AI can't do trade-offs. Humans need to decide."

The direction is right. The diagnosis is too vague to act on — so let's narrow it.

2. Scoping the Claim

This article is about a specific class of decisions: reversible and stage-sensitive.

Reversible: you can undo it without losing data, breaking contracts with users, or rewriting half the codebase. (Adding Redis is reversible. Changing your primary key strategy is not.)
Stage-sensitive: the right answer depends on where you are (MVP / growth / scale), not on universal best practice. (Caching layers, auth hardening depth, observability depth, infra topology.)

For irreversible or stage-insensitive decisions — DB engine, public API contracts, auth model, multi-tenancy boundaries, anything touching PII or payments — AI's conservative reflex is closer to right. That's covered in Section 6.

Within scope, the claim is:

AI assistants default to production-grade advice. You can change that, but only by stating your stage, scale, and trade-off weights explicitly.

3. Why the Common Diagnosis Falls Short

"AI only has the codebase as context, so it can't reason about trade-offs."

Three problems with this framing.

(a) Trade-offs are value judgments, not capability tests.
Risk preference, time preference, capital allocation — these are objective-function definitions, not things a model can derive from code alone. Two senior engineers reading the same code reach opposite conclusions. The CTO says "ship," the security lead says "no" — neither is smarter. They optimize different functions. Asking AI to "make the trade-off" without naming the objective is asking it to optimize without a loss function.

(b) You're not unable to give it context. You're choosing not to.
Every major coding assistant has a context-injection slot: CLAUDE.md, .cursorrules, AGENTS.md. Most "AI gave a bad recommendation" stories are at least partially "the user didn't specify context" stories — not all of them (models also hallucinate, miss repo state, or carry generic safety bias), but more than people admit.

(c) Humans over-engineer too.
Senior engineers carry trauma from past outages and pre-armor their code. AI's over-engineering and human over-engineering have different failure modes — AI tends toward boilerplate hardening, humans toward sticky abstractions — but neither is automatically easier to undo. The real axis isn't AI vs. human. It's context-aware vs. context-blind decisions.

4. A Hypothesis: The Production-Mature Prior

Here I have to hedge, because nobody outside the labs knows training mixes for sure. But the publicly visible candidates for what feeds these models are skewed:

High-star GitHub repos (already scaled, already hardened)
Vendor docs (AWS, GCP, k8s) — best-practice prose, not MVP code
High-vote Stack Overflow answers (often "the robust way")
Tech blog post-mortems ("here's what we should have done")

What these share: they document code that survived long enough to need scale, reliability, and compliance vocabulary. MVP-shaped code — single-file Flask apps, a docker-compose.yml with ten environment variables, a single RDS instance, hand-rolled session cookies — exists in training data too, but the advice prose attached to it is rare. The model has read a lot more "you should harden this" than "this is fine for now."

This is a hypothesis about a contributing factor, not a proven mechanism. Outputs also reflect instruction tuning, RLHF, system prompts, and safety policies — any of which can independently push toward caution. But it lines up with the observed behavior, and it's testable: try the before/after in Section 5 yourself.

Security review amplifies the effect. Threat catalogs (OWASP, CVE) are by construction lists of things that went wrong. A model trained heavily on those, when asked "is this secure?" without a threat model, hedges toward more findings — false positives increase when the model can't price the cost of being wrong in either direction.

5. Before / After — Same Model, Different Context

Here is the actual move. Same model (Claude Sonnet 4.6), same prompt, same code. The only difference is whether business context was supplied.

Code under review: a 60-line Express endpoint that accepts a JSON payload, looks up a user by email in Postgres, and returns a JWT.

Before — no context

Prompt: "Review the security of this endpoint."

Response (paraphrased, typical shape):

Add rate limiting with Redis token bucket
Move JWT secret to AWS Secrets Manager
Enforce mTLS between service and database
Audit-log every auth attempt to a separate AWS account
Add a WAF in front of the load balancer
Implement refresh token rotation with revocation list
Add CAPTCHA on repeated failures

Seven recommendations. None wrong in absolute terms. All sized for a company past Series A.

After — with context

Add this paragraph to the prompt (or to CLAUDE.md):

Stage: MVP, 50 users, solo dev, $200/month infra budget, 8-month runway.
Threat model in scope: of the OWASP Top 10, prioritize Injection (A03),
Broken Authentication (A07), and Sensitive Data Exposure (A02). Out of scope:
nation-state attackers, insider threats, denial-of-service.
Anti-goals: do NOT recommend mTLS, WAF, separate AWS accounts, or anything
requiring a dedicated infra hire.

Response (paraphrased, typical shape):

Confirm parameterized queries (looks fine)
Hash passwords with bcrypt cost ≥12 — current code uses cost 8, raise it
Validate JWT signing alg explicitly; reject alg: none
Put the JWT signing secret in an env var; document a rotation procedure (manual is fine at this stage)
Log auth failures to existing application logs; add a counter for "review at 5k MAU"

Five items, all things a solo dev can do this afternoon. Same model, same code.

That's the entire claim of this article, demonstrated. Try it on your own code; the shape of the answer changes.

6. The Reframe

The loose version:

"AI can't do trade-offs. Humans must decide."

The sharper version:

For reversible, stage-sensitive decisions, AI defaults to production-grade advice. The intervention point is supplying business context — stage, scale, trade-off weights, anti-goals.

This framing is more useful than the original because:

It points to an action. The responsibility moves from "AI is limited" to "I haven't told it where I am." Whether or not that's the whole story, it's the part you can fix in the next prompt.
It's durable, but not "permanently true." Future tooling will surely improve at inferring stage from repo shape, asking clarifying questions, and pulling org context from product telemetry. But humans will remain accountable for the objective function, even when they delegate parts of stating it.
It admits scope. It's a claim about reversible × stage-sensitive decisions, not all decisions.

7. What "Context" Actually Means — a Taxonomy

"Give the AI more context" is vague advice. Useful context has four layers:

Layer	What it answers	Where it lives
Stage	MVP / growth / scale. Reversibility budget.	`CLAUDE.md` top section
Constraints	Runway, team size, infra budget, latency targets	`CLAUDE.md` or per-prompt
Trade-off weights	Ship speed vs. quality vs. scalability ordering	`CLAUDE.md`
Anti-goals	Explicit list of recommendations to skip	`CLAUDE.md` "do NOT" list

A working CLAUDE.md snippet for an MVP:

## Project Context

- **Stage**: MVP — validating hypothesis. 50 users, target 200 MAU.
- **Team**: Solo developer. No ops headcount.
- **Constraints**: 8-month runway. Infra budget under $200/month.
- **Trade-off weights** (highest to lowest): ship speed, code clarity,
  scalability. Latency p95 under 1s is fine.
- **Security scope**: OWASP Top 10, prioritized — Injection, Broken Auth,
  Sensitive Data Exposure. Out of scope: nation-state, insider threats, DoS.
- **Anti-goals — do NOT recommend**:
  - Microservices, k8s, service mesh, mTLS
  - VPC Peering / PrivateLink / separate AWS accounts
  - Architecture patterns for files under 200 LOC
  - Caching, queues, or workers for features without measured load
- **Re-evaluation trigger**: revisit these weights at 5,000 MAU or when
  payments ship.

The anti-goals list is the unusual part. Most people skip it. It's the highest-leverage line in the file: it removes a class of recommendations the model would otherwise default to.

8. When AI's Default Is Actually Right

The thesis is scoped to reversible × stage-sensitive decisions. Outside that scope, AI's conservative bias is an asset:

Non-recoverable downside — anything touching PII, payment data, health data, customer secrets, auth tokens. Also: tenant isolation, key management, backup/restore, audit logs, retention/deletion compliance, breach notification readiness, secrets handling, vendor and supply-chain risk.
Regulated industries — finance, healthcare, government, ed-tech with minors. The default prior may even underestimate what's required.
Expensive-to-reverse decisions — primary DB engine, auth model, multi-tenancy boundaries, public API contracts, event schemas, ID strategy, billing model, permission model, observability foundations.

For these, lean into the conservative recommendation. Override only with a written reason.

The honest summary: this article is a heuristic for one specific quadrant of decisions, not a universal law.

9. What Would Change This?

It's tempting to add "at least for now" at the end and move on. Worth a beat instead.

Things that would partially close the gap:

Stage inference from repo shape — a meta-layer that looks at commit cadence, test coverage, observability stack, and recommends differently for "solo founder Express app" vs. "Series B microservice fleet."
Org-aware agents — read access to product metrics, infra spend, roadmap, risk policy. So the model can reason about cost the way a senior engineer does.
Policy profiles — "MVP mode" / "scale mode" / "compliance mode" as first-class settings.

What won't change: humans remain accountable for the objective function. Even if the tool infers your stage, you still own the decision of whether to accept the inference. So the practical claim — supply context, or don't be surprised by the defaults — survives most plausible improvements.

10. Summary

Loose framing	Sharper framing
AI can't do trade-offs	AI optimizes the objective you give it; default objective is production-grade
Humans must decide	Humans must specify stage, constraints, trade-off weights, anti-goals
"AI's limitation"	"Missing intervention at the context layer"
Time-bounded ("for now")	Humans stay accountable for objectives regardless of model progress

One actionable rule: before your next "review this" prompt, write four lines — stage, constraints, trade-off weights, anti-goals — into the prompt or into CLAUDE.md. Re-run. If the recommendations don't shift, your context is probably still too thin or you've hit a different failure mode (the model ignoring the file, generic safety bias, or a genuine capability limit). At that point, you have a real problem to debug instead of a vague "AI gave bad advice."

Open to feedback. Especially curious: (a) does the before/after replicate cleanly on your codebase? (b) where does the "Production-Mature Prior" hypothesis break — concrete counter-examples wanted. (c) is the same pattern visible in adjacent tooling — data engineering, MLOps, security scanners?

Why every Claude Code-built site looks the same — and the image layer that breaks it

Jun seo — Sat, 16 May 2026 05:18:06 +0000

TL;DR:

AI-built sites look uncannily similar because they share the same defaults — Tailwind + shadcn/ui + Lucide + the same gradients. It's not a placeholder problem; it's a visual-stack problem. Real, project-specific images are the cheapest way out.
I wrote a small Claude Code skill that wraps Codex CLI's gpt-image-2 and triggers on natural-language asks. Drop a DESIGN.md at the project root, tell Claude to insert images, and you get a coherent, on-brand set across the site.
Biggest win for solo developers shipping without a designer. Repo: github.com/JunSeo99/claude-skill-codex-imagegen — install takes 30 seconds (or just hand the URL to Claude Code itself).

Claude Code can build a working site in one session. The structure, the routing, the component library — it all comes together fine on the first pass. The problem is more subtle: most of these sites end up looking like each other.

The reason is the stack. Claude reaches for the same defaults every time — Tailwind, shadcn/ui, Lucide icons, a slate-or-zinc palette, a hero with a soft purple-to-blue gradient, cards with a 1px border and rounded-2xl corners, an abstract SVG blob somewhere in the header. None of those choices are bad. But across hundreds of vibe-coded sites, the cumulative effect is that someone landing on one feels like they've been on this site before — even when they haven't. Visitors don't say "this is shadcn." They say "this feels AI-generated." And the surface they're reacting to is mostly visual: the same component library, the same icon language, the same illustration-less spaces.

The cheapest way out of that uniformity, I've found, is real images. Not stock. Not Unsplash. Project-specific, style-consistent images generated to match a brand voice. Three or four of them placed where default vibe-coded sites would have left a Lucide icon over a gradient, and the "feels AI-generated" reaction collapses. The site stops reading as a template.

I wanted that to stop being a manual step.

In April, OpenAI shipped gpt-image-2 and bundled an $imagegen skill into Codex CLI. That gave me what I needed: a real image model I could shell out to from inside Claude Code. So I wrote a Claude Code skill that triggers on natural-language asks like "make a hero image for this landing page" and dispatches the actual generation to Codex.

Then I spent a weekend learning why nobody had a clean solution yet.

gpt-image-2 has three sharp edges and none of them are documented loudly

These are the things I hit, in order, on the first day:

Size requests are advisory, not enforced. I asked for 256×256. Got 1254×1254. Asked for 1024×1024 — also 1254×1254. The model picks its own dimensions based on what it thinks the prompt needs. If you actually need a specific size for a CSS slot, you resize after, not before. You can't prompt your way out of it.

Transparent PNGs aren't supported. gpt-image-2 will not emit alpha. Only gpt-image-1.5 does. This is buried in the OpenAI image-generation guide. The first time I asked for an icon "on transparent background," I got a perfectly nice icon sitting on a solid white square. The workaround is to generate on a flat removable background — green or pure white — and chroma-key it out locally. Fine, but you need to know that going in.

The PNG doesn't land where you asked. It lands at ~/.codex/generated_images/<session-uuid>/ig_*.png. Telling Codex "save to assets/hero.png" doesn't move the file there. You move it yourself afterwards.

Each of those is a 20-minute debug session if you don't know them. Stacked, they make image generation feel "kind of broken" when it's actually working as designed, just badly documented.

And then there's the prompt itself

Even if you handle all three edges above, your output is only as good as your prompt. And gpt-image-2 punishes keyword soup.

The "stunning cinematic 8K masterpiece volumetric lighting" energy that worked on Midjourney v5 produces visibly worse output here. The OpenAI cookbook recommends a five-part structure — Scene → Subject → Details → Use case → Constraints — and front-loading the first 50 words because the model weights the opening more heavily. This is real. I A/B'd it. The five-part one wins every time.

For text in images (logos, banners, posters), wrap the literal text in double quotes or ALL CAPS so the model knows what's literal vs. descriptive. gpt-image-2 is genuinely strong here — short labels, signs, and UI mockups land at near-perfect spelling across Latin and CJK scripts, which is a meaningful jump from older models. Where it still wobbles is (a) long multi-line paragraphs baked into the image, (b) brand names and uncommon spellings, and (c) very small text inside dense layouts. For brand names, the OpenAI prompting guide recommends spelling the tricky word out letter-by-letter in the prompt ("the word ACME spelled A-C-M-E"). For paragraph-length text, render it as an HTML/CSS overlay over the generated image instead of asking the model to bake it in — that's the workflow gpt-image-2's own docs recommend.

For edits, the trick is "change only X, keep everything else identical." The model preserves what you don't mention vaguely — but it preserves what you explicitly tell it to keep very well.

None of this lives in the recipes that just say "run codex exec and you're done." So I baked all of it into the skill's playbook.

What the skill actually does

One SKILL.md plus two reference files (prompting-guide.md, cli-reference.md) that Claude Code auto-loads from ~/.claude/skills/codex-imagegen/. No Node, no install step beyond git clone && ln -s.

When you say something like "make a hero image of an origami crane for the landing page, save to assets/hero.png at 1600×900," the skill:

Rewrites your request into the five-part structure (Scene → Subject → Details → Use case → Constraints), front-loaded.
Runs codex exec --sandbox workspace-write '$imagegen <prompt>. Print only the absolute path on the last line.' — Codex generates, doesn't move.
Parses the path from stdout. Runs cp and sips -z 900 1600 (macOS) or convert -resize 1600x900 (Linux) to land the file where you actually asked.
Prints the final path. Done.

The natural-language trigger is the part that matters most to my actual goal. I want Claude Code, mid-build, to decide on its own that this <section> needs a hero image, and just generate one. Not "user types a special slash command." The skill fires from phrases like "generate an image," "make an icon," "create a banner," "OG image," "hero illustration." Claude calls it the same way it calls anything else in its toolkit.

That's the whole point. The site shouldn't end up looking like every other vibe-coded site because the agent never broke out of its default visual stack. The agent building the site should be reaching for project-specific imagery on its own.

The trick that changes everything: DESIGN.md

Here's the bit I didn't expect to matter as much as it does.

If you drop a DESIGN.md at the root of your project — palette, type, illustration style, tone — and then ask Claude Code:

Using DESIGN.md as the style reference, insert images that fit the site.

…it just works. Really well.

Claude reads DESIGN.md, decides which slots in the codebase need imagery, writes prompts that incorporate the palette and tone, calls the skill, and inserts the resulting paths into the right <img> tags. The hero image, the empty-state illustration, the OG card, and the favicon all end up looking like they belong to the same product. Without DESIGN.md it still works, but each image drifts a little — palette, mood, lighting are all slightly off across slots, and you can feel it even if you can't immediately name what's wrong.

DESIGN.md doesn't have to be fancy. Here's a trimmed version of one I'm using right now:

# Design

## Concept
Calm, considered, modern. The kind of feel that gets out of the user's
way instead of demanding attention.

## Palette
- Surface (main):  #F4F1ED  — warm off-white
- Surface (cards): #FFFFFF
- Text:            #1A1A1A  — near-black, not pure
- Accent / CTA:    #C46A4E  — soft terracotta, used sparingly

## Typography
- Inter, system-ui sans-serif

## Illustration style
- Single subject, plenty of whitespace, no busy backgrounds
- Soft natural light from upper left, gentle shadows
- Hand-folded paper / origami feel where applicable
- No text inside images unless explicitly asked
- Avoid stock-photo vibes and over-saturated colors

That's 20-ish lines. But Claude treats it as a hard constraint when writing prompts, and the visual consistency across a 4–5 page site is night and day vs. asking for each image cold. The "Illustration style" block is doing about 80% of the work — palette obviously matters, but the qualitative instructions ("hand-folded paper feel," "no busy backgrounds") are what stop each image from feeling like it came from a different stock-image library.

Why this is the year vibe-coded sites stop looking vibe-coded

A year ago this would have been a different post. Back then, even if you wanted to break out of the shadcn-default look, generated images weren't the answer. The available models produced output that screamed AI louder than the layout did — slightly melted typography, off-axis lighting, the same handful of obvious tells. So the fastest path was usually "just don't add an image," and the result was a sea of sites that all leaned on the same component library to do all the visual work.

gpt-image-2 changes the math. With a tight DESIGN.md and a five-part prompt, generated images now look like they came from a brand, not from a model. Text spells correctly. Light angles agree across slots. Subject framing is intentional. They're not hand-crafted illustrations from an agency, but they no longer carry the "AI tell" that earlier generations did. And once those images sit alongside the shadcn cards and the Lucide icons, they shift where the eye lands. A visitor reads the hero illustration, the OG card, the empty-state graphic — slots that on a default vibe-coded site were either missing or generic — and the site registers as a product instead of a template.

The interesting part isn't any individual image. It's that the gap between "site built by a small team with a designer on call" and "site built solo with Claude Code overnight" is mostly carried by image quality and visual specificity. The structure is solved. The components are solved. What's left, and what was carrying most of the "feels AI-generated" signal, was the image layer — and that's the slot this skill fills.

If you ship as a solo developer — no designer on call, no illustration budget, no Figma file from a teammate — this is the part of the workflow that used to force a compromise. Either you paid for stock images that didn't quite match the rest of the site, or you pulled an SVG from Heroicons and called it a hero. With gpt-image-2 plus a DESIGN.md, that compromise mostly goes away. The same person who writes the code can produce custom, on-brand visuals in the same session, without leaving the editor and without commissioning anyone. That's the audience I built this skill for, and the audience it changes the most for. Designers will always have an edge on intentional taste — I'm not pretending otherwise — but for the long tail of side projects, landing pages, and internal tools that were never going to get a designer in the first place, the bar just moved.

Once you have this loop — Claude builds the site, reads DESIGN.md, decides where images belong, generates them with consistent style, drops them in place — visitors stop registering that AI built the site. Which is the bar.

Caveats

gpt-image-2 turns burn 3–5× your Codex usage limit vs. a plain text turn. If you're iterating a lot, set OPENAI_API_KEY and switch to per-image API billing.
macOS is primary. Linux works via ImageMagick. Windows is not on the roadmap.
The skill is around 200 lines of markdown plus a small shell helper. If you don't like a default, edit it. There's no framework to wrestle.
For small text or dense multi-font layouts, bump quality to medium or high — gpt-image-2 is honest about which slots benefit from extra compute.

Repo

github.com/JunSeo99/claude-skill-codex-imagegen

git clone https://github.com/JunSeo99/claude-skill-codex-imagegen \
  ~/.claude/skills/codex-imagegen

If even that feels like effort, just hand the repo URL to Claude Code itself and tell it to install the skill — something like "install this Claude Code skill: https://github.com/JunSeo99/claude-skill-codex-imagegen". It'll read the README, run the clone-and-symlink, and the next session will just have it. Mildly recursive — using Claude Code to install something Claude Code is going to use — but it works, and honestly it's how I install most of my own skills these days.

Once it's installed, restart Claude Code. Drop a DESIGN.md at the root of your project. Build your site. Then say: "Using DESIGN.md as the style reference, insert images that fit the site."

Curious if anyone else is doing the DESIGN.md-as-style-anchor pattern for AI-generated assets — I'd love to compare notes on which fields actually move the needle and which are noise. The "Illustration style" block is doing 80% of the work in my setup, but I haven't tested it across enough projects to call it.

And feedback on the skill itself is genuinely welcome — issues, PRs, "this default is wrong," "this caveat is missing," "this prompt pattern didn't work for me." It's still early, and I plan to keep iterating on it as people actually run it in their own projects. If you try it and something breaks or feels off, please tell me — that's the fastest way I'll make it better.