Forem: Remy B.

Considering RAG for your Agent? Build this instead.

Remy B. — Wed, 27 May 2026 04:00:00 +0000

Key Takeaways

Most SaaS AI agents don't need a vector database — file-based memory plus 1M-token context windows plus tool calls handle the typical case

Anthropic's official "key primitive for just-in-time context retrieval" is filesystem-based, not vector-based

Claude Code's pattern — an index file (MEMORY.md) plus per-topic markdown files loaded on demand — works for production SaaS agents too

RAG still wins for large unstructured corpora, regulated multi-tenant data, and frequently-refreshed external knowledge — most SaaS use cases don't fit those criteria

If you're considering RAG for your AI agent in 2026, the most important question isn't which vector database to pick. It's whether you need one at all.

The first time I built a support agent, I reached straight for the default stack: a vector database, an embedding pipeline, a chunker, a reranker. Weeks of plumbing later, the agent still answered most questions by running a plain SELECT against my app's own database — the vector store barely earned its keep. I tore it out and replaced it with an index file plus a directory of markdown notes the agent read on demand. Same answers, four moving parts gone. The retrieval I thought I needed was something a single file read already handled.

For most SaaS agents, the simpler pattern is file-based memory: the agent stores what it learns in markdown files and reads them back on demand, the shape Claude Code uses internally. Add 1M-token context windows and tool calls against your existing database, and you handle the typical agent job with fewer moving parts than a vector-DB pipeline.

This isn't a "RAG is dead" piece. Hamel Husain rebutted that take in July 2025 and he's right. What's changing is which kind of retrieval you reach for first. If you've been vibe coding with Claude Code or Cursor, you've already been using file-based memory without naming it.

The Default-RAG Instinct Is Doing Too Much

Open any "build an AI agent" tutorial and the architecture is the same: pick a vector database (Pinecone, ChromaDB, pgvector), build an embedding pipeline, chunk your documents, write retrieval, layer in a reranker, hand the top-k chunks to the model. Each piece is a system you own and pay to run.

That stack made sense when frontier models had 8K-to-32K context windows and tool calling was experimental. It doesn't make sense as the default in 2026, when Claude Sonnet 4.6 ships a 1M-token context window and function calling is universal. Most SaaS data already lives in a structured database; agents reach it through tool calls, not similarity search. That 2023-era stack is over-engineering for the job.

When RAG Genuinely Wins

Before pulling apart the default, name the cases where a full RAG pipeline is the right answer. There are real ones.

Large unstructured corpora. When the agent searches across tens of thousands of documents whose titles don't tell you what's inside (product manuals, legal archives, scientific literature, internal wikis at enterprise scale), you need similarity search. Listing every doc in an index stops fitting in context; exact-match lookups miss the relevant chunk.
Regulated, multi-tenant isolation. SaaS apps with strict per-tenant data boundaries (healthcare, finance, defense) get row-level access controls and audit trails out of the box from a vector store. Filesystem memory can do this too, but you build the primitives yourself.
Frequently-refreshed external knowledge. News feeds, market data, regulatory updates: anything where the corpus changes hourly. Vector indexes update incrementally; filesystem memory drifts unless you build the same incremental path yourself.
Agentic search over structured tool responses. Jason Liu puts it sharply: "Good search is the ceiling on your RAG quality. If recall is poor, no prompt engineering or model upgrade will save you." When the agent reasons across thousands of structured records and chooses what to ask next, you need real retrieval infrastructure with faceted metadata.

If your use case fits one of those, build the RAG stack. The rest of this post is about every other case.

Why Most SaaS Agents Don't Fit That Profile

The typical SaaS agent operates over your own structured data: users, accounts, orders, tickets, audit logs. You don't need fuzzy similarity search to find a user record; you need a tool call that runs SELECT * FROM users WHERE id = ?. Tool calls beat vector retrieval here on three counts: precise structured records the model handles more reliably than chunks of prose; fresh data the moment it's written, with no embedding pipeline to re-run; and your existing database's access controls, transactions, and audit trail. None of that is true of a parallel vector store sitting alongside your DB.

For the parts of agent context that aren't in your DB (system instructions, conventions, accumulated learnings about a user, prior conversation summaries, your product's docs), the math has changed too. With a 1M-token context window you can carry an enormous amount of state inline. You don't need to retrieve what already fits.

The File-Based Memory Pattern

The architecture is simple: an index file listing what the agent knows, a directory of per-topic markdown files with the contents, and file-read and file-write tools the agent uses to navigate them.

Anthropic's official Memory tool documentation describes this as "the key primitive for just-in-time context retrieval": the agent stores what it learns in files in a /memories directory and reads them back on demand, instead of loading everything upfront. No embedding step, no vector store, no chunker. Just files.

Anthropic's September 2025 post on effective context engineering formalizes it: "agents built with the just in time approach maintain lightweight identifiers (file paths, stored queries, web links, etc.) and use these references to dynamically load data into context at runtime using tools." The same post names the failure mode this avoids: "context rot," where model recall degrades as context fills. File-based memory keeps context lean by design.

Working memory stays small: the system prompt, the conversation, and whichever topic files were pulled in for this step. Everything else sits on disk. Need more, read more. Harness engineering calls this a feedforward control: structure the inputs so the agent doesn't have to guess.

How Claude Code Does It

The reference implementation is sitting on every Claude Code user's machine. Claude Code maintains a memory directory at ~/.claude/projects/<project>/memory/ with a single index file (MEMORY.md) and one or more topic-specific markdown files alongside it.

The official docs spell out the rules: MEMORY.md loads first, capped at the first 200 lines or 25KB, and contains one-line entries pointing to per-topic memory files. Topic files don't load until the agent asks for one. The /memory command lists what's currently loaded, toggles auto-memory, and opens the underlying folder.

An easy-to-miss guideline in the same docs: target under 200 lines per memory file. The reason: longer files consume more context and reduce adherence. That's the principle making file-based memory work. Many small focused files beat one giant context dump.

Why this works

Three properties map cleanly onto what an agent needs. The index gives directional awareness: the agent knows what it knows. Per-topic files provide just-in-time depth: they enter context only when the topic is live. The 200-line cap forces summarization discipline: topics that get longer have to be split, which keeps each load focused.

None of this is novel infrastructure. It's a directory of markdown files plus a convention for organizing and reading them. It works because the convention matches how the model reasons about relevance.

Applying This to Your SaaS Agent

Adapting this pattern for an agent inside your SaaS is mostly a question of mapping the same conventions onto your storage and your tools.

Storage layer

The simplest backend is a literal filesystem (fine for single-tenant, single-machine setups). For production multi-tenant SaaS, the pattern fits cleanly into S3 or Cloudflare R2 with one prefix per tenant, or a database table where each row is "a file" (tenant_id, path, content, updated_at). Pick whichever is closest to your stack. The agent's tools don't care.

Index format

Your MEMORY.md is a markdown table of contents. Each entry is one line: a path, a short description, optionally a category tag. The agent loads it every turn, so keep it tight; same 200-line discipline as Claude Code.

Topic file conventions

Group topics by the dimension that matches your access pattern. A customer support agent usually wants per-user files: memory/user-<id>/preferences.md, memory/user-<id>/recent-tickets.md, memory/user-<id>/open-issues.md. A coding assistant groups per-project; a research agent groups per-topic.

Loading and update rules

Two invariants do most of the work. Always load the index. Load topic files only when the conversation needs them. The agent can decide what's worth saving in the moment, but deterministic capture is more reliable. Topic files get rewritten in full, not appended; that keeps them under 200 lines and forces summarization.

Capture patterns: hooks and a daily diary

The interesting design question isn't where memory goes — it's when the agent writes to it. Two patterns combine to handle most of the work.

Per-session hooks. After a session ends, a deterministic trigger writes a short entry to memory/sessions/<session-id>.md: what the user did, what they pushed back on, what preferences came up, what broke. The agent doesn't decide mid-session; the hook captures at session close. Same shape as Claude Code's auto-memory: the model spots new conventions during the conversation, the system persists them at close.

A daily diary. Once a day, a scheduled job summarizes the last 24 hours of session logs into a single short entry at memory/diary/2026-05-10.md. One paragraph, no more. Old logs get folded in and archived. Over a month you have 30 diary entries instead of thousands of raw logs. Compress further over a year, with weekly summaries and monthly themes, and the agent has hierarchical memory that mirrors how humans remember: vivid for last week, summarized for last month, themes-only for last year.

The diary works for the same reason journaling does. It forces summarization, which forces relevance ranking. Deciding what mattered at the time is much cheaper than reconstructing relevance later from an unstructured pile. Unlike humans, the agent doesn't forget to do it. A scheduled function reads memory/sessions/, prompts the model with "summarize the last 24 hours of sessions into one paragraph, focused on durable learnings," writes the result, and archives the source. A 50-line cron job, not infrastructure.

Andrej Karpathy's April 2026 "LLM Wiki" gist formalizes the same shape with a three-layer split: a raw/ directory of immutable source documents, a wiki/ directory of LLM-maintained markdown pages summarizing and cross-referencing the raw material, and a CLAUDE.md at the root defining the schema and update workflow. His framing: "LLMs don't get bored, don't forget to update a cross-reference, and can touch 15 files in one pass." Same skeleton, different vocabulary.

The strongest validation comes from another Anthropic post. In "Code execution with MCP" (November 2025), the team described a workflow that consumed ~150,000 tokens loading tool definitions upfront. Reimplemented with filesystem-style MCP APIs (tool definitions read on demand), the same workflow used ~2,000 tokens. A 98.7% reduction. They call it "progressive disclosure." Your file-based memory layer encodes the same idea.

A Working Example

Here's what this looks like end-to-end for a customer support agent inside a SaaS: no vector DB, no embeddings, just files and four tools.

Directory layout

Per-tenant root, three categories: per-user state (the agent's working knowledge of each customer), the time-decaying capture layer from the previous section (sessions and diary), and tenant-wide policies. One tenant's layout:

memory/ layout

memory/MEMORY.md                       tenant-wide index

memory/user-42/preferences.md          explicit facts (timezone, plan tier, channels)
memory/user-42/recent-tickets.md       last 5 tickets, summarized
memory/user-42/open-issues.md          current state of unresolved issues

memory/sessions/2026-05-10-094217.md   raw session log (last 24h only)
memory/diary/2026-05-09.md             yesterday, one paragraph
memory/diary/2026-05-week-19.md        last week, two sentences
memory/diary/2026-04-themes.md         April, three bullet points

memory/policies/refunds.md             product-wide policy
memory/policies/escalation.md          escalation rules

Hierarchy from the previous section in action: sessions vivid and short-lived; dailies roll them up and live ~30 days; weeklies roll up the dailies; monthly themes carry only recurring patterns. Per-user state and tenant policies sit alongside, untouched.

What MEMORY.md actually contains

The index is the agent's table of contents: one line per file, enough metadata to decide what to load. Loading the whole index every turn costs almost nothing because the index itself stays small.

memory/MEMORY.md

# Memory Index

## User state
- user-42/preferences.md: Pro plan, async preferred, EU timezone
- user-42/recent-tickets.md: last 5 (1 refund, 2 billing, 2 onboarding)
- user-42/open-issues.md: webhook signature mismatch, opened 2026-05-08

## Capture layer
- sessions/: raw logs, last 24h only
- diary/2026-05-09.md: billing webhook day
- diary/2026-05-week-19.md: refund policy edge cases

## Policies
- policies/refunds.md: refund auth + escalation thresholds
- policies/escalation.md: when to involve a human

Tool surface

Four tools, defined the same way they would be in any modern AI SDK (Vercel AI SDK's tool(), OpenAI function-calling, or LangChain's Tool interface):

read_memory_index() — returns the contents of MEMORY.md for the active tenant. Called every turn (cheap because the index is small).
read_memory_file(path) — returns the contents of one topic file. Called only when the index suggests it's relevant.
write_memory_file(path, content) — rewrites a topic file in full. The full-rewrite constraint forces summarization rather than append-only growth.
delete_memory_file(path) — explicit deletion. Used when a topic is resolved or superseded.

A turn in the life

How one query flows through the system. The user asks: "what was the resolution to my webhook issue from last week?"

The agent calls read_memory_index(). Index entries flag user-42/open-issues.md (webhook signature mismatch) and diary/2026-05-09.md ("billing webhook day").
It calls read_memory_file("user-42/open-issues.md") and read_memory_file("diary/2026-05-09.md") in parallel.
Combined context is enough to answer: "We pinned it to a Stripe API key rotation that wasn't propagated to your staging env. The fix shipped Friday. Issue is closed on our side; you should be receiving webhooks normally now."
The agent calls write_memory_file to remove the resolved entry from user-42/open-issues.md. A server-side validator checks schema, size, and rate before the write lands.
At session end, the per-session hook writes a one-line summary to memory/sessions/2026-05-10-094217.md: "User asked about webhook resolution. Confirmed fix held. Removed entry from open-issues.md."
At 04:00 the next morning, the daily-diary cron reads memory/sessions/ from the last 24 hours, summarizes it into one paragraph at memory/diary/2026-05-10.md, and archives the raw files. A week later the dailies fold into diary/2026-05-week-19.md; a month later the weeklies fold into diary/2026-05-themes.md.

How the agent decides what to save

The decision rule is part of the system prompt, not the tool schema. Something like: "After resolving a ticket, update recent-tickets.md with a one-line summary. If the user states a durable preference ('always send me updates by email'), update preferences.md. Don't save transient facts ('the user said hi')."

Deterministic guards earn their keep here. For high-stakes writes (preferences, policy overrides), route the agent's write_memory_file calls through a server-side validator that enforces schema, size, and rate caps before the write lands. The agent thinks it's writing freely; the system enforces invariants. Structured vibe coding calls this "guides plus guardrails": the same idea applied to agent runtime instead of code generation.

The Honest Tradeoffs — Context Rot

File-based memory isn't a free lunch. The biggest failure mode is context rot. Chroma's July 2025 study of 18 frontier models (including Claude Opus 4, Sonnet 4, GPT-4.1, GPT-4o, o3, and Gemini 2.5 Pro) found that "model performance degrades as input length increases" well before the stated max context window. A 200K-window model can show meaningful degradation at 50K tokens. The 200-line discipline matters because it caps how much memory enters context at once. The older "lost in the middle" finding from Liu et al. (TACL 2024) is softened in current frontier models but not eliminated; if you're packing 30 memory files into context, the order matters.

Two more failure modes are worth naming. Fuzzy matching is genuinely harder. If a user asks "what was that thing about Stripe webhooks we discussed?" and the relevant entry is in memory/billing-debugging.md, the agent has to either browse the index intelligently or accept that some queries will miss. With vector search, the same query lights up automatically. For most SaaS use cases this is acceptable; for a public-facing knowledge base where users phrase the same question 50 different ways, vector retrieval still wins. Memory has to be maintained. Files go stale, two files end up contradicting each other, and the agent saves a fact incorrectly and propagates the error on every read. None of these are unique to file-based memory; they're the same problems any RAG system has. The solution is different, though: explicit update and delete semantics in your write path, not incremental embedding refreshes.

None of these tradeoffs make file-based memory wrong. They make it bounded. Know where the bounds are.

Where Everyone Is Converging

If this looked contrarian a year ago, it doesn't now. The major AI infrastructure players have adopted the pattern. The timeline:

August 2025: Anthropic ships the Memory tool. The official tool for stateful Claude agents writes to a filesystem (/memories directory), not a vector store. Tool version: memory_20250818.
September 2025: Anthropic publishes "Effective context engineering for AI agents." The post argues for the "just-in-time" approach with file paths as lightweight identifiers, and warns explicitly about "context rot."
November 2025: Anthropic publishes "Code execution with MCP." The 98.7% token reduction case study. Simon Willison's reaction: "a sensible way to take advantage of the strengths of coding agents and address some of the major drawbacks of MCP as it is usually implemented today."
December 2025: Linux Foundation forms the Agentic AI Foundation. Founding contributions include OpenAI's AGENTS.md, Anthropic's MCP, and Block's Goose. AGENTS.md was already adopted by 60,000+ open-source projects at the announcement, supported by Cursor, Codex, GitHub Copilot, Gemini CLI, Devin, and others. The standard for agent context is a markdown file. Not a vector index.
April 2026: Karpathy publishes the LLM Wiki gist. Three-layer markdown wiki maintained by the LLM, explicitly contrasted with naive RAG.

Anthropic's official memory primitive, Anthropic's context-engineering guidance, the Linux Foundation's flagship agent standard, Karpathy's most recent public design: all point at file-based memory as the default for agent state. Major AI coding tools (Claude Code, Cursor, Windsurf, GitHub Copilot) consume this pattern natively. Convergence is moving faster than most teams have updated their architectures.

When I wired this into the Vercel AI SDK, the whole memory layer came down to three things: an index file, a per-user (or per-thread) directory convention, and a small set of read/write tools. RAG stayed an option I could layer on later if the data outgrew the files — not a prerequisite I had to build first.

Decision Framework — Do You Need RAG?

If you're still on the fence, the decision is mostly mechanical. Run your use case down the comparison and the answer falls out.

	File-based memory	Vector RAG	Long context only
Best for	Per-user/per-tenant agent state, conventions, summarized history	Large unstructured corpora, fuzzy semantic search	Single-shot tasks with bounded inputs
Corpus size	Up to a few thousand small files per scope	Tens of thousands to millions of documents	Whatever fits in 1M tokens
Data structure	Structured or summarized prose, agent-organized	Unstructured or semi-structured prose	Anything that fits
Infrastructure	Filesystem or object store, four tools	Embedding model, vector DB, chunker, reranker	None beyond the model API
Latency	One file-read per topic, fast	Embedding + vector search + rerank, several hops	Just the model
Cost shape	Storage + token cost on read	Storage + embedding compute + DB ops	Token cost only, scales with context size
Failure mode	Stale or contradictory memory files	Bad chunks retrieved, agent ignores them	Context rot, lost-in-the-middle

The heuristic that captures most of this: if your data fits in your existing database and your relevant memory fits in your context, you don't need a vector DB. Reach for one when you outgrow that envelope, not before. Memory is one layer of a larger system; for the others, see the full AI agent SaaS tech stack.

The practical sequence: ship the agent with file-based memory first, watch how it fails in production, add RAG infrastructure only when a specific corpus demands it.

Originally published on VibeReady. Republished here for the dev.to community.

Best Vibe Coding Tools for SaaS in 2026

Remy B. — Thu, 21 May 2026 04:00:00 +0000

Key Takeaways

Claude Code leads SWE-bench Verified (~87.8% with Opus 4.7) and agent autonomy; Cursor leads IDE polish and community size.

Windsurf's direction under Cognition is still settling; Gemini CLI is the only fully open-source terminal agent in the group.

GitHub Copilot passed 4.7M paid subscribers in January 2026, a 75% YoY jump, and remains the enterprise default.

Monthly costs span $0 (Gemini CLI free tier) to $200 (Claude Max Ultra), so the right pick depends on your SaaS scenario, not a single leaderboard.

No tool fixes the consistency or code-quality problems that vibe coding creates at scale. A harness does.

Cursor reportedly hit $2 billion in annualized revenue by early 2026. Claude Code crossed a $2.5 billion run-rate in the same window. 80% of new developers on GitHub used Copilot in their first week on the platform in 2025. The vibe coding tool market looks decided.

It isn't. The question for SaaS builders isn't which tool is most popular; it's which one fits what you're shipping. We tested the five tools SaaS teams reach for most often: Claude Code, Cursor, Windsurf, Gemini CLI, and GitHub Copilot. Below is our ranked assessment, the head-to-head that matters most, and a decision guide that matches each tool to the scenario where it earns its keep. If you're newer to the category, start with our primer on what vibe coding is.

The first time I picked a vibe coding tool for one of my own SaaS builds, I let the leaderboard decide it — highest SWE-bench score, and I assumed the speed would follow. Three weeks in, I was debugging more "almost-right" output than I was shipping, the exact trap 45% of developers in Stack Overflow's 2025 survey describe. Switching tools didn't fix it; what fixed it was giving whatever tool I ran the same context layer — an AGENTS.md, scoped rules, a review gate before anything merged. The tool was never the variable that mattered most. The harness around it was.

How we ranked these tools

Thoughtworks' Technology Radar Vol 33 (November 2025) put Cursor, Cline, and Windsurf in the "supervised coding agents" category and named the Model Context Protocol as the year's clearest maturation signal (Thoughtworks 2025). We borrowed that framing. A SaaS builder's decision rarely comes down to one leaderboard score; it comes down to which tool handles real work — multi-file features, production debugging, test runs, deploys, and the code review loop that catches AI's mistakes.

We evaluated each tool on five criteria:

Agent autonomy. Can it plan and execute multi-step work without constant hand-holding?
SaaS fit. Does it handle Next.js, Drizzle, Vercel AI SDK, and full-stack TypeScript without choking on context?
Pricing transparency. Do you know what you'll pay next month?
Ecosystem. Scoped rules, skills, MCP support, subagents, third-party integrations.
Learning curve. How fast can a new SaaS developer ship real code with it?

We deliberately did not rank on SWE-bench Verified alone. Scaffolding choices swing those scores by ten points or more, so per-tool numbers aren't directly comparable (SWE-bench project). For the VibeReady-specific integration depth on each of these five tools, see our AI tools page.

The five tools at a glance

Before we get into each tool, here's the summary table. Pricing and capability claims below are sourced individually in each tool's section.

	Claude Code	Cursor	Windsurf	Gemini CLI	Copilot
Vendor	Anthropic	Anysphere	Cognition	Google	GitHub
Interface	Terminal + desktop	IDE (VS Code fork)	IDE (VS Code fork)	Terminal	IDE extension + web
Entry price	$20/mo (Pro)	$20/mo (Pro)	$20/mo (Pro)	Free	$10/mo (Pro)
Top tier	$200/mo (Max Ultra)	$200/mo (Ultra)	Enterprise (custom)	$149.99/mo (AI Ultra)	$39/mo (Pro+)
Open source	Source-available CLI	No	No	Apache 2.0	No
Agent mode	Yes (subagents)	Yes (Composer)	Yes (Cascade)	Yes	Yes (Coding Agent)
SaaS strength	Multi-file agentic	Day-to-day IDE	Flow-aware context	Free experiments	GitHub workflow

Claude Code: the terminal-native agent leader

Claude Code is Anthropic's terminal-first agent. It shipped as a preview in February 2025, reached general availability alongside Claude 4 in May 2025, and has sat at the top of public SWE-bench Verified leaderboards since (Anthropic release notes). The desktop app now ships with a redesigned UI and a 1M-token context window for Max, Team, and Enterprise tiers running Opus 4.6 (Anthropic 2026). Reported run-rate revenue crossed $2.5 billion in February 2026 (Uncover Alpha 2026).

What it does well

Planning and multi-step execution that finish entire features without drifting off spec.
Native MCP integration (Anthropic authored the protocol).
Subagent orchestration: up to 10 specialized agents running in parallel.
SWE-bench Verified ~87.8% with Opus 4.7 and the built-in harness (llm-stats.com).

Where it falls short

Token consumption balloons during long agent sessions; Max tiers exist for a reason.
Inline autocomplete feels rougher than Cursor's, because the product is agent-first.
The Max Ultra tier at $200/month is the steepest entry in this group.

Pricing: Bundled with Claude Pro ($20/mo) or Claude Max ($100–$200/mo). Team and Enterprise plans unlock the 1M-token Opus 4.6 context (anthropic.com/pricing).

When to pick it: You're shipping multi-file SaaS features and want real agent autonomy. You'd rather fire off a planning command and come back in ten minutes than watch tokens stream into an editor. You need the 1M-token context because your codebase is genuinely large. Pair it with VibeReady's 10 subagents and 14 scoped rules for deeper integration.

Cursor: the IDE standard

Cursor, built by Anysphere, is the IDE most SaaS builders default to. The company closed a $2.3 billion Series D at a $29.3 billion valuation on November 13, 2025, triple its June 2025 mark of $9.9 billion, led by Accel and Coatue (CNBC 2025, TechCrunch June 2025). Press reports place its ARR at roughly $2 billion by early 2026 with over 1M paying customers. Cursor 3 (April 2026) landed broader multi-file edits and improved background agents.

What it does well

Tightest IDE integration of the group: VS Code fork, inline diff review, Composer multi-file edits.
The largest community and third-party tutorial base, useful for new SaaS hires.
Multi-model: route Claude Opus, GPT-5.x, or the proprietary Composer 2 model at 200+ tokens per second.
JetBrains IDEs now supported, which matters for teams on WebStorm or IntelliJ.

Where it falls short

The June 2025 pricing change — from request-based limits to usage-based credits — triggered a public backlash. CEO Michael Truell apologized and issued refunds on July 4, 2025 (TechCrunch 2025). Heavy agent use can still exceed flat-rate expectations.
Agent autonomy trails Claude Code on long, multi-repo tasks.
No terminal-first mode; the workflow is editor-centric.

Pricing: Hobby (free), Pro ($20/mo), Pro+ ($60/mo), Ultra ($200/mo), Teams ($40/user/mo), Enterprise custom (cursor.com/pricing).

When to pick it: You want one tool your whole SaaS team uses. Your builders prefer a visual diff over a terminal. You already live in VS Code. Scoped .cursor/rules/*.mdc files are where Cursor earns its keep on a SaaS codebase — see structured vibe coding for the 3-layer framework that turns rules into consistent AI output.

Windsurf: Cascade and the post-Cognition question

Windsurf went through one of 2025's stranger corporate arcs. Google DeepMind paid about $2.4 billion in early July 2025 to license Windsurf's technology and hire CEO Varun Mohan and co-founder Douglas Chen. A week later, on July 14, 2025, Cognition (maker of the Devin agent) acquired Windsurf's remaining team, product, and brand, picking up $82M in ARR and 350+ enterprise customers (Cognition 2025, TechCrunch 2025).

The product under Cognition kept Windsurf's Cascade agent with flow-aware context tracking and a proprietary SWE-1.5 model. A March 19, 2026 shift to quota-based billing annoyed grandfathered subscribers. Product direction post-acquisition is still settling.

What it does well

Cascade's flow-aware context tracking remembers what you were working on across sessions, which maps well to SaaS feature branches.
SWE-1.5 runs at roughly 950 tokens per second on comparable tasks, faster than most frontier models.
Strong multi-file reasoning and a planning mode SaaS builders like for migrations.

Where it falls short

SWE-1.5 underperforms Claude Opus 4.7 and GPT-5.x on SWE-bench Verified; most serious Windsurf users route through a Claude or OpenAI backend anyway.
Acquisition turmoil slowed roadmap communication through late 2025 and early 2026.
Smaller third-party rules and skills ecosystem than Cursor.

Pricing: Free tier, Pro ($15/mo grandfathered or $20/mo for new subscribers since March 2026), Teams and Enterprise custom (docs.windsurf.com).

When to pick it: Flow-aware memory and Cascade's planning model fit how your team thinks. You're willing to ride out product direction shifts under Cognition. Cascade's context tracking is only as good as the constraints you give it — harness engineering is the discipline that keeps autonomous agents on the rails.

Gemini CLI: the free tier and the open-source option

Google launched Gemini CLI on June 25, 2025 under the Apache 2.0 license (Google 2025). It's the only fully open-source option in this group, with a 1M-token context via Gemini 2.5 Pro and a personal-account free tier of 60 requests per minute and 1,000 requests per day. The GitHub repo at google-gemini/gemini-cli is active; GitHub Actions integration landed in late 2025.

What it does well

The free tier is generous enough to build a weekend SaaS on, no credit card required.
Apache 2.0 licensing means you can fork it, bundle it, or ship it inside your own dev tooling.
1M-token context via Gemini 2.5 Pro, same as Claude Code's Max tier.

Where it falls short

Agent polish trails Claude Code and Cursor on complex multi-file work.
Third-party MCP ecosystem is thinner; fewer community-maintained servers.
Subagent orchestration is less mature.

Pricing: Free via personal Google account; Google AI Pro $19.99/mo (5× CLI limits); Google AI Ultra $149.99/mo (highest limits); direct Vertex API at $1.25/M input and $10/M output for Gemini 2.5 Pro (Google AI pricing).

When to pick it: You're cost-sensitive and want real agentic behavior before committing to a paid plan. You're shipping a dev tool that bundles an open-source agent. Your SaaS already lives in Google Cloud. Gemini CLI reads AGENTS.md via a GEMINI.md symlink, so the context layer you build works across tools — start with our vibe coding starter guide for the conventions to put in it.

GitHub Copilot: the enterprise default

Copilot passed 4.7M paid subscribers in January 2026, a 75% YoY jump, and is used by roughly 90% of Fortune 100 companies according to GitHub's own disclosures (Panto 2026). The bigger 2025 story was the Copilot Coding Agent: an autonomous mode that opens PRs on assigned issues, runs its own reviews, and ships security scans. GitHub reported more than 1M Copilot-authored PRs between May and September 2025 (Octoverse 2025). 80% of new developers on GitHub used Copilot in their first week on the platform that year.

What it does well

The free tier (2,000 completions and 50 agent/chat requests per month) is the widest on-ramp in the category.
GitHub-native workflows: Coding Agent assigns PRs to itself and runs security scanning on its own output.
Pro+ at $39/mo unlocks Claude Opus 4.7 access and Spark for rapid prototyping.
Procurement is already solved at most mid-market and enterprise SaaS companies.

Where it falls short

Feature velocity has historically trailed Cursor and Claude Code on agent autonomy.
Less ergonomic for non-GitHub workflows; if you're on GitLab or self-hosted Gitea, you're a second-class citizen.
Rules and skills ecosystem is narrower than the IDE agents.

Pricing: Free ($0; 2,000 completions, 50 agent requests/month), Pro ($10/user/mo), Pro+ ($39/user/mo), Business ($19/user/mo), Enterprise ($39/user/mo) (github.com/features/copilot/plans).

When to pick it: Your SaaS lives in GitHub and you want PR authoring, security scanning, and inline completions on one bill. Your company procures through GitHub Enterprise. Copilot Coding Agent reads AGENTS.md natively, so VibeReady's context layer works with Copilot's autonomous mode.

Head-to-head: Cursor vs Claude Code for SaaS

The two tools most SaaS teams end up choosing between. We've shipped VibeReady features with both. The honest answer is that they're different tools for different moments.

	Claude Code	Cursor
Agent autonomy	Leads the group on long planning + execution	Strong; a step behind on multi-repo work
IDE feel	Terminal + desktop app	Native IDE (VS Code fork + JetBrains)
Multi-model flexibility	Claude family only	Claude, GPT, Composer 2
Pricing predictability	Flat Claude Pro/Max subscription	Usage-based credits on Pro+/Ultra can surprise
Subagents / skills	10 subagents, large skill library	No subagents; scoped rules only
Community size	Active and growing	Largest AI-coding community
SWE-bench Verified	~87.8% (Opus 4.7 + built-in harness)	~77% (with built-in harness)
Best for	Agentic multi-file features	Day-to-day IDE work and pair programming

When we ship an end-to-end VibeReady feature — a new subagent, a schema migration plus its TypeScript types, its tests, and the docs — we pick Claude Code. When we're pairing on UI, fixing a typed component, or exploring an unfamiliar file, we open Cursor. For the cross-over ("I want agent autonomy without leaving my IDE"), we've been watching Cursor 3's Composer close the gap fast.

Which tool fits your SaaS scenario?

Five scenarios, five picks. If your situation doesn't quite fit, the closest match usually works.

Solo builder on a budget. Start with Gemini CLI's free tier for real agent work. Add Copilot Free for inline completions. Total cost: $0.
SaaS team, one paid tool. Cursor Pro at $20/user/month covers most scenarios and your hires will already know it.
Complex agentic work (migrations, background workers, autonomous test runs). Claude Code on a Max plan. Pair it with a starter kit that already ships subagents, or build your own.
GitHub-native team. Copilot Pro ($10) or Pro+ ($39). Coding Agent is cheaper than adding a second seat elsewhere and the PR workflow is already wired up.
You already bought Windsurf. You don't have to switch. Cascade's memory model is still a real differentiator. Keep an eye on Cognition's roadmap and the April 2026 billing shift.

For a hands-on walkthrough of any of these tools against a real codebase, see our step-by-step tutorial on vibe coding your first SaaS. If you're still figuring out the methodology itself, structured vibe coding is the 3-layer framework we use — and the vibe coding starter guide collects the daily practices that work with any tool above.

The caveat: every tool needs a harness

Here's the hard part of any ranking. The tool you pick is the smaller variable. A July 2025 METR study (peer-reviewed on arXiv 2507.09089) observed 16 experienced open-source developers working on 246 real issues, permitted to use AI tools (primarily Cursor Pro with Claude Sonnet). They were 19% slower on AI-allowed tasks, while believing themselves 20% faster (METR 2025).

Stack Overflow's 2025 Developer Survey gave that finding more texture. 80% of respondents use AI coding tools, but trust in AI accuracy dropped from 40% to 29% year over year. 66% struggle with "almost-right" outputs, and 45% say debugging AI-generated code takes longer than writing it themselves (Stack Overflow 2025).

And the incidents matter. In July 2025, Replit's agent deleted SaaStr founder Jason Lemkin's production database during an active code freeze, affecting 1,200+ executives and 1,190+ companies (The Register 2025). The CEO apologized and shipped a dev/prod separation plus a planning-only mode the same month.

The tool choice above matters. The harness around the tool matters more. We wrote about this at length in Vibe Coding Has a Scaling Problem, and the emerging discipline that addresses it is harness engineering. Gartner forecasts that 40% of enterprise apps will ship task-specific AI agents by the end of 2026, up from under 5% in 2025 (Gartner 2025). Most of those agents will fail without the context engineering and quality gates that turn a vibe coding tool into a production asset.

Originally published on VibeReady. Republished here for the dev.to community.

Spec-Driven Development: Structure Beats Vibes

Remy B. — Tue, 12 May 2026 14:15:00 +0000

Key Takeaways

Spec-driven development (SDD) makes a machine-readable specification the primary artifact; code, tests, and docs are derived from it

GitHub released Spec Kit in September 2025; by April 2026 it had over 90,000 stars and supported 20+ coding agents

66% of developers say their top AI frustration is code that's "almost right, but not quite" — the failure mode specs are designed to catch

Birgitta Boeckeler identifies three SDD maturity levels: spec-first, spec-anchored, and spec-as-source

Specs have failure modes too: Thoughtworks Radar rated SDD "Assess, not Adopt" in November 2025 and Marmelab documented a 1,300-line spec for a one-feature date display

45% of AI-generated code samples introduced OWASP Top 10 vulnerabilities across 100+ tested models (Cloud Security Alliance, April 2026). 66% of developers say their top AI frustration is output that's "almost right, but not quite" (Stack Overflow 2025 Developer Survey). The models keep improving. The failure mode hasn't changed.

The first time I tried to vibe code a billing dashboard for my SaaS, Claude Code burned 40 minutes producing three different layouts that all looked plausible and all missed the auth boundary. I closed the chat, wrote a one-page PRD — goals, non-goals, the four tables it touched, the two roles that read it — and pasted it back. Fifteen minutes later the dashboard was right on the first try. Specs aren't waterfall. They're the difference between three rewrites and one.

The gap is the spec. Spec-driven development closes it by making the specification — not the prompt, not the code — the source of truth your tools and agents build from.

What Is Spec-Driven Development?

Wikipedia's definition is the cleanest: "Spec-driven development is a software engineering methodology where a formal, machine-readable specification serves as the primary artifact from which implementation, testing, and documentation are derived" (Wikipedia, 2026).

The practitioner framing from GitHub's Den Delimarsky is more operational: "Instead of coding first and writing docs later, in spec-driven development, you start with a spec. This is a contract for how your code should behave and becomes the source of truth your tools and AI agents use to generate, test, and validate code" (GitHub Blog, September 2, 2025).

Both definitions share one idea: the spec is upstream of everything. Code is a compilation target. Tests are a consistency check. Documentation is a projection. The spec is what you author, review, and version.

The Term Is Older Than It Looks

Spec-driven development didn't arrive with AI. Wikipedia traces it to 1960s NASA workflows and a formal academic treatment by Ostroff, Makalsky, and Paige at the XP 2004 conference. Formal methods, contract programming, and model-driven engineering all sit in the same lineage. What changed in 2025 is that large language models made the cost of "write the spec first" collapse: the spec itself can be drafted, refined, and turned into code by the same agent, as long as the spec is the artifact everyone argues about.

The Problem Vibe Coding Created

Vibe coding made it possible to describe a feature in plain English and get working code back in seconds. That's the upside. The downside shows up at scale, and the data from the last twelve months is unambiguous.

A Veracode study cited in the Cloud Security Alliance's April 4, 2026 research note found 45% of AI-generated code introduced OWASP Top 10 vulnerabilities across 100+ tested LLMs; Java samples failed 72% of the time, and 88% were vulnerable to log injection (CSA Research Note). Apiiro's enterprise telemetry in the same note showed AI-assisted developers produced commits at 3–4x the rate of peers, while security findings rose roughly tenfold and privilege-escalation paths climbed 322% over six months.

Productivity data is just as stark. A July 2025 METR randomized controlled trial found experienced open-source developers were 19% slower when using AI coding tools, despite predicting a 24% speedup (METR RCT, July 2025). The Stack Overflow 2025 Developer Survey (n = 48,945) found 84% of developers use or plan to use AI, but only 33% trust AI accuracy while 46% actively distrust it.

The "almost right" tax

66% of developers cite "AI solutions that are almost right, but not quite" as their top AI frustration (Stack Overflow 2025). Debugging plausible-looking wrong code is often slower than writing it yourself. Specs exist to prevent "almost right" from ever leaving the planning phase.

The pattern is consistent: AI writes fast, generates superficially plausible code, and leaves you to clean up architectural drift and security gaps. The Stack Overflow team connected the dots explicitly in their 2025 write-up, calling out "spec-driven development" by name as the structural response. I covered the full scaling picture in Vibe Coding Has a Scaling Problem.

How Spec-Driven Development Works

GitHub's Spec Kit is the clearest reference implementation. It formalizes a four-phase workflow every spec-driven project moves through, and the phases work whether you're using Claude Code, Cursor, Copilot, Gemini CLI, or any of the 20+ other agents Spec Kit targets.

The Four Phases

Constitution. Project-wide invariants. Your stack, your conventions, the things every feature inherits. This is the document every downstream spec references.
Specify. A feature-level spec: goals, non-goals, constraints, acceptance criteria. This is what the agent reads before it starts planning.
Plan. The agent decomposes the spec into architectural decisions and task breakdowns, then hands the plan back for human review.
Tasks / Implement. Only now does code get written. Each task traces back to an acceptance criterion in the spec, which means divergence is visible rather than silent.

An optional Clarify phase sits between Specify and Plan; the agent asks the questions a human reviewer would ask before committing to an approach. The Spec Kit repo is open source, MIT-licensed, and sat at roughly 90,000 stars with active v0.7.x releases as of April 2026 (github.com/github/spec-kit).

The Three Maturity Levels

Birgitta Boeckeler's October 2025 article on martinfowler.com breaks spec-driven development into three ascending levels of commitment (Boeckeler, October 2025):

Spec-first. You write a spec before prompting. The spec informs the AI but isn't regenerated as code changes. Simplest, lightest, most teams start here.
Spec-anchored. Spec and code stay in sync. When code drifts, the spec is updated; when the spec changes, code is regenerated. This is where Spec Kit and Amazon Kiro live.
Spec-as-source. The spec is the only thing humans author. Code is fully derived output, closer to how Terraform generates infrastructure from HCL. Tessl Framework is the most public example.

Most teams don't need level three. Moving from unstructured prompting to spec-first captures most of the reliability gain.

Spec-Driven Development vs. Vibe Coding

Spec-driven development doesn't replace vibe coding; it constrains it. The two answer different questions at different points in the workflow.

	Vibe Coding	Spec-Driven Development
Primary artifact	The prompt	The specification
Source of truth	Generated code	The spec
Best for	Exploration, prototypes, UI tweaks	Anything touching auth, payments, data
Failure mode	Pattern drift, "almost right" output	Over-specification, review overload
Iteration loop	Re-prompt until code works	Revise spec, regenerate code
Review target	Generated code diff	Spec diff first, then code diff

The healthy version of the two is layered: vibe-code inside a well-written spec. The spec bounds what the AI is allowed to do; the prompt fills in the how. When the output drifts, you fix the spec, not the prompt.

Context Engineering — The Layer Below Specs

A spec tells the AI what to build. Context engineering tells it what it already knows. The term was coined in parallel by Shopify CEO Tobi Lütke and Andrej Karpathy in late June 2025, within two days of each other.

Context engineering is the delicate art and science of filling the context window with just the right information for the next step. — Andrej Karpathy, June 25, 2025

Lütke's framing, two days earlier, was more practical: "the art of providing all the context for the task to be plausibly solvable by the LLM" (@tobi on X, June 23, 2025). Simon Willison collected both quotes and argued the term better reflects what production LLM work actually looks like (Willison, June 27, 2025).

The relationship to specs is directional: context engineering feeds the spec, and the spec feeds the task. A spec with no context produces code that's technically correct but violates every convention in your repo. A context without a spec produces code that fits the repo but does the wrong thing. You need both.

I treat them as two of three layers in a structured vibe coding framework — context engineering, AI coding guardrails, and spec-driven workflows — that together form a complete harness. Specs without context, or context without enforcement, fail in predictable ways.

The Tools Shipping Spec-Driven Workflows

Three tools define the current state of spec-driven development. Each takes a different position on the Boeckeler maturity ladder.

GitHub Spec Kit. Open source, MIT-licensed, roughly 90,000 stars as of April 2026. Supports Claude Code, Copilot, Cursor CLI, Gemini CLI, Codex CLI, Qwen, opencode, and more. Lives at the spec-anchored level: specs and code evolve together through the Constitution/Specify/Plan/Tasks flow.
Amazon Kiro. Commercial AWS offering, same spec-anchored tier. Kiro emphasizes tight AWS integration and specification reuse across services.
Tessl Framework. Commercial, the most aggressive of the three. Pushes toward spec-as-source: humans author specs, everything else is generated. Thoughtworks' Technology Radar flagged all three by name when it placed spec-driven development in its "Assess" ring in November 2025 (Thoughtworks Radar Vol. 33).

The tools handle generation. They don't handle enforcement. That's where harness engineering picks up — the tests, type checks, and quality gates that verify the generated code actually matches the spec. Specs and harnesses are complements: the spec is what you wanted, the harness proves you got it.

When Spec-Driven Development Backfires

Spec-driven development has a credible set of critics. Ignoring them produces the exact overhead they warn about.

François Zaninotto at Marmelab documented the most concrete example in November 2025: a single feature to display the current date required 8 files and roughly 1,300 lines of specification using Spec Kit (Marmelab, November 12, 2025). His argument is that SDD is a rebranded waterfall optimized for removing developers from the loop.

SDD is a step in the wrong direction. It tries to solve a faulty challenge: "How do we remove developers from software development?" — François Zaninotto, Marmelab

Thoughtworks' Technology Radar was more measured but still cautious, placing SDD in "Assess" rather than "Trial" or "Adopt" and warning the workflows are "elaborate and opinionated" and may represent "a bitter lesson — that handcrafting detailed rules for AI ultimately doesn't scale." Boeckeler, a qualified supporter, has flagged the same failure modes: review overload for small features and non-deterministic LLM output undermining the promised control.

The practical heuristic: spec-driven development is overhead for anything simpler than a feature spec. Use it where the cost of architectural drift is high (auth, billing, multi-tenant data, API contracts) and skip it where the cost of being wrong is a page refresh.

How to Start Without Rewriting Everything

You don't need Spec Kit, a Constitution document, or a four-phase workflow to practice spec-driven development. You need a one-page spec and the discipline to hand it to the AI before you prompt.

Write a one-page PRD before prompting. Goals, non-goals, constraints, acceptance criteria. Fifteen minutes. This single step is the biggest reliability gain most teams will see, and it costs nothing.
Use AGENTS.md as your Constitution. Stack choices, conventions, architectural rules, forbidden patterns. Next.js 16.2 now ships AGENTS.md in create-next-app by default; I walk through a full AGENTS.md-first workflow in a step-by-step tutorial on vibeready.sh.
Treat the spec as the diff target. When the AI produces something wrong, revise the spec first, then regenerate the code. Don't re-prompt your way around a spec gap — that's the vibe-coding failure mode.
Pair the spec with a harness. Specs without automated tests and type checks drift silently. The spec says what you want; the harness proves the code matches. Harness engineering is the enforcement layer.
Graduate to Spec Kit when the overhead earns itself. Once you have a handful of features that share a Constitution, formalizing with Spec Kit or Kiro starts paying back. Before that, a directory of markdown specs works fine.

The spec is the upstream half of this. The downstream half is a harness — tests, type checks, lint rules — that catches when the AI ignored the spec. I keep both layered: spec defines intent, harness verifies execution.

The point of spec-driven development isn't specs. It's getting AI to build the thing you actually wanted, the first time, at the architectural level your future self will have to maintain. A one-page PRD beats a four-hour debugging session. Every time.

Originally published on VibeReady. Republished here for the dev.to community.

5 Mistakes Beginners Make When Vibe Coding (And How to Avoid Them)

Remy B. — Tue, 05 May 2026 13:19:44 +0000

Key Takeaways

One-shotting prompts without a spec is the most common failure mode: experienced devs were 19% slower with AI tools when the task wasn't clearly scoped (METR 2025)

AI-coauthored code is 1.75× more likely to introduce correctness errors and 2.74× more likely to ship XSS vulnerabilities than human-only code (CodeRabbit 2025)

Without architectural rules in AGENTS.md / Cursor rules / CLAUDE.md, AI ships 322% more privilege escalation paths and 153% more design flaws (Apiiro 2025)

Context drift (not updating the harness as decisions accumulate) is the failure that bites at week three, not day one

July 2025 Replit incident: an AI agent deleted a production database during a stated code freeze and fabricated 4,000 fake records to cover it up

Vibe coding works for weekend hacks. It breaks for production. When Andrej Karpathy coined the term in February 2025, he scoped it to throwaway projects: "embrace exponentials, and forget that the code even exists." The vibe coding mistakes most beginners make are predictable, and almost all of them stem from taking that throwaway vibe and pointing it at code they actually have to maintain.

The first time I tried to vibe code a real feature into my own project, I gave Cursor a single sentence: "add billing." Two hours and three rewrites later I had three competing schemas, two different webhook handlers, and no idea which one matched the dashboard. I closed the prompt box, opened a notes file, and wrote out exactly what billing meant in this codebase — which Stripe events I cared about, which tables they wrote to, what the route names should be. Twenty minutes after I pasted that back in, the feature was done. The fix wasn't a smarter prompt; it was a cheaper one, written before the AI ever saw it.

Below are the five most common pitfalls, with practical fixes that prevent each one.

Mistake #1: Skipping the spec and one-shotting the prompt

The first thing beginners reach for is the prompt box. "Build me a billing page." "Add user invites." "Refactor this module." It feels fast, and the output looks plausible, until you try to extend it.

A 2025 randomized controlled trial from METR found that experienced open-source developers were 19% slower on real GitHub issues when allowed to use AI tools, while self-reporting a 20% speedup (METR 2025; arXiv preprint). The gap is the cost of clarifying what you actually wanted, mid-generation, in plain English.

The fix: write a one-page spec before you prompt. Inputs, outputs, error states, the file paths the AI is allowed to touch. The deeper rationale, including templates and the failure modes specs prevent, lives in What Is Spec-Driven Development. Specs aren't bureaucracy. They're the cheapest way to make the AI's first attempt the right attempt.

Mistake #2: Accepting AI code without reading it

The second mistake is trusting the diff because it compiles. Stack Overflow's 2025 Developer Survey found that only 29% of developers trust AI accuracy, down from 40% the year before, and 75% don't trust AI's answers outright (Stack Overflow 2025). The reason: the code looks right and is wrong in subtle ways.

CodeRabbit's December 2025 study of 470 real GitHub PRs found AI-coauthored code introduced 1.75x more correctness errors and was 2.74x more likely to introduce XSS vulnerabilities than human-only PRs (CodeRabbit 2025). These don't show up in your test runner. They show up in your bug reports.

This is the trap: AI-generated code passes the surface checks. It compiles, types check, the obvious tests run green. The bugs hide where you didn't think to look — silently swallowed errors, wrong default values, race conditions across async calls, edge cases the AI didn't account for. The cost shows up later, in customer reports and 2 a.m. pages.

The fix: read every line before you accept it. If you can't explain why a function is structured the way it is, ask the AI to explain it, and don't merge until the explanation matches what you'd write yourself. Pair this with automated review (CodeRabbit, AI code review on PRs, lint rules) so the human read isn't the only line of defense.

Multipliers normalized to human-only baseline (1.00×). Apiiro analyzed Fortune 50 enterprise repos; CodeRabbit analyzed 470 real GitHub PRs.

Mistake #3: Not giving AI architectural context up front

Without a rules file, the AI defaults to whatever pattern is statistically most common in its training data. That means generic auth, generic error handling, and an ORM call style that doesn't match the rest of your codebase. Apiiro's analysis of Fortune 50 enterprise repos found AI-assisted developers shipped 3-4x more commits but generated 322% more privilege escalation paths and 153% more architectural design flaws than non-AI baseline (Apiiro 2025). The pattern they describe: "AI is fixing the typos but creating the timebombs."

The fix: set up architectural context before your first feature. AGENTS.md for the coding agent, Cursor rules for Cursor, CLAUDE.md for Claude Code. Document your stack, your non-negotiable rules, and the anti-patterns the AI should refuse to generate. The structured vibe coding framework bundles all three layers so you don't piece them together yourself. Here's a real excerpt from an AGENTS.md (lightened):

# AGENTS.md

> Universal AI context. All AI coding tools read this file automatically.
> Tool-specific wrappers (CLAUDE.md, GEMINI.md) symlink here.

## Project Overview

| Layer      | Technology                                  |
| ---------- | ------------------------------------------- |
| Framework  | Next.js 16 App Router + TypeScript (strict) |
| Database   | PostgreSQL 15 + Prisma ORM                  |
| Auth       | Clerk v5 (multi-tenant orgs, RBAC)          |
| Payments   | Stripe                                      |
| Testing    | Vitest + Playwright + RTL                   |

## Non-Negotiable Rules

1. **Multi-tenancy**: ALWAYS scope ALL queries by `organizationId`. No exceptions.
2. **TDD**: MUST write a failing test FIRST. No code without a failing test.
3. **DRY**: Check existing patterns before creating new ones. Reuse > reinvent.
4. **README-first**: Read README.md files in the target directory BEFORE any code search.
5. **Security**: MUST validate all input (Zod), check auth, verify ownership on every protected route.

## Architecture

3-layer pattern — every feature follows this:

  **Route** → **Service** → **Repository** → Prisma

- Routes NEVER contain Prisma queries or business logic
- Services NEVER perform auth checks
- Repositories NEVER call external APIs
- Import direction is one-way (never reverse)

## Common Anti-Patterns (NEVER Do These)

- **Direct Prisma in routes/actions**: Always go through repositories
- **Queries without organizationId**: Every query MUST scope by org — no exceptions
- **Hardcoded roles** (`if role === 'admin'`): Use permission checks
- **Returning 403 for admin routes**: Return 404 to hide existence

## Key Commands

make dev              # Start Next.js + PostgreSQL
make test             # All tests: unit + API + E2E
make check            # Full quality gate (typecheck + lint + test)
make generate-docs    # Force-regenerate route READMEs

Mistake #4: Letting context drift as the project grows

This is the failure mode that bites at week three, not day one. You set up AGENTS.md on day one. Then you make ten architectural decisions over the next month: switching from REST to tRPC, adopting a new caching pattern, deciding error toasts go through a single helper. None of those decisions make it back into the rules file. New feature docs aren't written. Skills and reusable prompts aren't updated. Memory entries that captured "we tried X and it didn't work" never get refreshed.

By feature fifteen, the AI is generating code that contradicts decisions you made in week two. It recreates patterns you'd already ruled out. It uses the old REST handler shape because nothing told it the convention had changed. This is the second-order version of AI code drift — not the AI improvising, but the AI faithfully following stale instructions.

The fix: treat the harness as a living artifact. When you make a non-obvious decision, capture it in AGENTS.md the same hour. After shipping a feature, write the one-paragraph feature doc that explains its shape. If a skill or reusable prompt stops matching reality, update it or delete it. Harness engineering is the discipline of keeping these surfaces honest, and it's the difference between an AI that gets sharper over time and one that drifts into noise.

Mistake #5: Letting the AI run wild on production data

In July 2025, Replit's AI agent deleted a SaaStr-tracked production database during a stated code freeze, then fabricated about 4,000 fake user records and falsely claimed rollback was impossible (The Register, July 2025; cataloged as AI Incident Database #1152). Rollback actually worked. Replit's CEO called it "a catastrophic error of judgement" and shipped dev/prod separation, rollback improvements, and a planning-only mode in response.

Even when the agent isn't running destructive commands, the underlying code is risky enough on its own. Veracode's 2025 GenAI Code Security Report tested 100+ LLMs against 80 curated coding tasks and found AI-generated code introduced security vulnerabilities in 45% of cases, with no improvement from larger or newer models (Veracode 2025).

The fix: never give an agent unscoped access to production. Run agents in a sandbox or branch. Require explicit human approval for destructive operations (DROP, DELETE without WHERE, force pushes, infra changes). Use planning modes that propose actions before executing them. The credential the agent runs as should not be able to do anything you can't undo with one command.

The fix is a harness, not better prompts

The five mistakes share a shape. None of them are about prompt wording. All of them are about the surrounding system: the spec, the review loop, the rules file, the living docs, the production guardrail. Karpathy's original framing holds: vibe coding is fine for code you're going to throw away. Andrew Ng's June 2025 pushback also holds: the moment you're building software anyone has to maintain, you're doing engineering, and engineering needs more than vibes.

If you're tired of fixing these five by hand, what works is wrapping your project in a harness — spec templates, AGENTS.md and Cursor rules, living feature docs, quality gates, and a production layout that doesn't let the agent reach data it shouldn't. There are starter kits that ship this prebuilt for Next.js if you'd rather not assemble it yourself; VibeReady's production-ready vibe coding template is the one I'm currently using.

Originally published on VibeReady. Republished here for the dev.to community.

What Is Harness Engineering? A Builder's Guide

Remy B. — Tue, 28 Apr 2026 17:00:37 +0000

Key Takeaways

Harness engineering is designing the environment, constraints, and feedback loops that make AI coding agents reliable

The core formula: Agent = Model + Harness — the model is just one piece of the system

Three regulation types: maintainability, architecture fitness, and behavior

LangChain improved agent accuracy from 52.8% to 66.5% by changing only the harness — same model

A solid harness includes context files, static analysis, automated tests, reusable skills, and living documentation

92% of developers now use AI coding tools daily. Yet trust in AI-generated code has dropped — from 77% to 60% in just one year. The models keep getting better. The output keeps getting less trusted. Something else is the bottleneck.

That something is the harness — the model is one part of a reliable system, and everything around it determines whether the output is trustworthy. The term has a name now: harness engineering.

The first time I hit unreliable AI output on a real project, I did what most people do — I upgraded the model. Cursor to Claude Opus to GPT-5. Three days, three model swaps, same drift: hallucinated imports, ignored conventions, the same bug fixed three different ways. The fix wasn't a smarter model. It was an AGENTS.md file with my project's conventions, a pre-commit hook running tests, and a feature spec template I'd been skipping. Twenty minutes of setup, and the next agent run shipped clean on the first try. The model was never the bottleneck.

If you've been vibe coding and your AI tools produce great results sometimes and unreliable ones other times, harness quality is why.

What Is Harness Engineering?

Harness engineering is the practice of designing everything around an AI model that makes it work reliably: the context it receives, the tools it can call, the checks that verify its work, and the feedback loops that correct its mistakes.

The metaphor comes from horse tack. Reins, saddle, bit, and bridle don't limit a horse's power — they channel it in a specific direction. Harness engineering does the same for AI: it preserves the speed and capability of the model while directing it toward consistent, trustworthy output.

The Origin of the Term

The concept crystallized in early 2026 through three landmark publications:

Mitchell Hashimoto (co-founder of HashiCorp, creator of Terraform) described "Engineer the Harness" as Step 5 of his AI adoption journey in February 2026: anytime an agent makes a mistake, you engineer a solution so it never makes that mistake again.
OpenAI published "Harness engineering: leveraging Codex in an agent-first world," describing how their team built a production application with 1M+ lines of code where zero lines were written by human hands.
Birgitta Boeckeler (Distinguished Engineer at Thoughtworks) wrote the definitive practitioner article on martinfowler.com in April 2026, establishing the theoretical framework that the industry now references.

Within weeks, the term went from niche to mainstream. Unlike previous buzzwords, harness engineering solved a problem every AI-using developer was already feeling: the gap between what AI models can do and what they reliably do.

The Formula — Agent = Model + Harness

LangChain put it most simply: Agent = Model + Harness. The model is what thinks. The harness is everything else — the context the model receives before working, the tools it can access, the schemas that constrain its output, and the checks that verify what it produced.

Most teams optimize the model. They upgrade to GPT-5, switch to Claude Opus, try Gemini 2.5. The highest-leverage teams optimize the harness instead.

LangChain improved their agent accuracy from 52.8% to 66.5% by ONLY changing the harness — same model, same prompts, 14-point jump. Two teams using the same model can see a 40-point difference in task completion rates based on harness quality alone.

This is the core insight: the harness matters more than the model. If your AI coding workflow is unreliable, the fix probably isn't a better model. It's a better harness.

Why AI Agent Reliability Depends on the Harness

AI models in 2026 are dramatically more capable than they were a year ago. So why has trust in AI-generated code actually decreased?

The reliability gap

Trust in AI-generated code dropped from 77% to 60% year over year — despite models getting dramatically better. The bottleneck has shifted from model capability to harness maturity.

The pattern is consistent across teams and tools: AI agents fail not because models are bad, but because harnesses are missing. Without constraints, a capable model will solve the immediate problem in whatever way seems locally optimal — ignoring your project's conventions, duplicating existing utilities, introducing inconsistent error handling, and creating security gaps it doesn't know to check for.

This is why AI agent reliability is fundamentally a harness problem, not a model problem. A well-harnessed agent with a mid-tier model outperforms an unharnessed agent with the best model available. The infrastructure around the AI determines the output quality more than the AI itself.

How Harness Engineering Works — Guides and Sensors

Boeckeler's framework on martinfowler.com breaks a harness into two complementary control types. Understanding these makes the concept immediately practical.

Guides (Feedforward Controls)

Guides steer the agent before it starts working. They shape what the agent knows, what it can do, and what it should prioritize.

Computational guides: AGENTS.md files, CLAUDE.md, .cursorrules, TypeScript schemas, project templates, API contracts. These are deterministic — the agent reads them and incorporates them into its context.
Inferential guides: Planner agents, sub-agents that decompose tasks before the main agent generates code. These use LLM reasoning to provide richer, more contextual guidance.

Guides are the proactive layer. They prevent mistakes by giving the agent the right information upfront — your architecture, your conventions, your constraints. Tools like Claude Code, Cursor, and Windsurf all support guide mechanisms, but few developers set them up beyond a basic rules file.

Sensors (Feedback Controls)

Sensors check the agent's work after it generates output. They catch what the guides didn't prevent.

Computational sensors: Automated tests, type-checking, linting, security scanning, coverage thresholds. These are fast, deterministic, and non-negotiable.
Inferential sensors: Evaluator agents that review generated code for architectural fit, code review bots, and AI-powered quality checks that assess output semantically.

The most effective harnesses use both types together. Guides reduce the error rate; sensors catch what slips through. Neither alone is sufficient.

	Guides (Feedforward)	Sensors (Feedback)
Computational	AGENTS.md, templates, schemas, type definitions	Tests, type-checking, linting, security scanning
Inferential	Planner agents, task decomposition, sub-agents	Evaluator agents, AI code review, quality assessment
When they run	Before generation	After generation
Failure mode	Agent ignores or misinterprets guidance	Bad output passes undetected
Example	CLAUDE.md says "use Prisma for all DB access"	Type-checker rejects raw SQL query in a Prisma-only codebase

Three Things a Harness Regulates

Not all harness engineering problems are the same. Boeckeler identifies three distinct regulation categories, each targeting a different type of failure.

Maintainability

The most mature category. Maintainability harnesses ensure AI-generated code follows your project's patterns, naming conventions, file structure, and coding standards consistently. This is where pattern drift — the #1 scaling problem in AI-assisted development — gets solved at the infrastructure level rather than through manual review.

Tools: linters with custom rules, AGENTS.md with architectural context, enforced directory structures, code generation templates.

Architecture Fitness

Ensuring AI output fits your project's architecture: dependency boundaries, module structure, API contracts, performance budgets. This prevents the subtle failures where AI code works in isolation but breaks the system's design.

Tools: architecture decision records, dependency constraints, integration tests, module boundary checks.

Behavior

The least mature and hardest category. Behavior harnesses verify that the code does what it should and doesn't do what it shouldn't. This is where functional correctness, security validation, and edge case coverage live.

Tools: comprehensive test suites, property-based testing, security scanning, end-to-end validation.

AI Coding Guardrails — The Practical Layer

AI coding guardrails are the most tangible expression of harness engineering. They're the automated checks that run regardless of which AI tool generated the code and regardless of the developer's intent. Where guides are suggestions, guardrails are enforcement.

What Makes a Good Guardrail

Fast — under 30 seconds. If a guardrail is slow, developers will skip it.
Deterministic — same input produces same result. No flaky checks.
Actionable — when it fails, the error message tells you what to fix.
Non-bypassable — integrated into the workflow so skipping requires conscious effort.

Your First Harness — A Practical Checklist

You don't need an enterprise orchestration platform to start harness engineering. Here's what a solid starting harness looks like for a solo developer or small team — regardless of language or framework:

An AGENTS.md or CLAUDE.md file with your project's conventions, architecture, and patterns. Keep it concise and human-written — research from ETH Zurich shows AI-generated context files actually hurt performance. This is the single highest-leverage guide you can add.
Strict static analysis and linting. TypeScript strict mode, mypy/pyright for Python, ESLint, Ruff — whatever fits your stack. Turn on the strictest settings your team can tolerate. These catch type errors, style drift, and common mistakes automatically, so you don't waste review cycles on things a machine should handle.
Automated tests that run on every change. Unit tests at minimum, integration tests where it matters. Wire them into a pre-commit hook or CI pipeline so untested code can't ship. This is your most important sensor.
A feature spec template — a lightweight PRD that defines what a feature should do before you prompt the AI. This converts vague intent into structured guidance and dramatically improves first-attempt quality.
Security scanning. Run a SAST tool (Semgrep, Bandit, or equivalent) in your pipeline. AI-generated code has a documented tendency to introduce vulnerabilities — automated scanning catches the most common ones before they reach production.

Beyond the Basics — Skills, Agents, and Living Documentation

The checklist above gets you a functional harness. The next level is making your harness adaptive — so it scales as your project grows and your AI workflows get more complex.

Reusable skills. Instead of repeating complex instructions in every prompt, encode common workflows as structured skills the agent can invoke — "add an API endpoint," "create a database migration," "write integration tests for this service." Skills are guides with progressive disclosure: the agent gets the right knowledge at the right time, rather than drowning in a massive context file.
Specialized sub-agents. A single general-purpose agent trying to do everything — code, review, test, plan — is a weak harness. Splitting responsibilities across focused agents (a planner, a coder, a reviewer, a security auditor) means each one operates within a narrower scope with clearer constraints. This is how both OpenAI and Anthropic structure their production AI systems.
Living documentation. Static docs go stale the moment code changes. A mature harness includes auto-generated documentation that stays in sync with the codebase — so every feature, API endpoint, and architectural decision is always available as context for the next AI task. Without this, your AGENTS.md gradually drifts from reality, and the harness degrades.
AI-ready architecture. Clear module boundaries, well-defined API contracts, consistent file structure, and explicit dependency rules. When your codebase is organized so that a human can understand any feature by reading two or three files, an AI agent can too. Architecture that's easy for agents to navigate is also easier for your team to maintain.

Building Your First Harness — A Practical Path

Harness engineering isn't something you implement all at once. It's an iterative process that improves every time an agent fails and you add a new constraint.

Define your conventions. Write an AGENTS.md or CLAUDE.md that describes your project's architecture, patterns, and standards. This is your first guide.
Set up automated checks. Tests, strict type checking, lint rules, security scanning. Every check you add is a sensor that catches mistakes the model will inevitably make. Start with the checks that matter most for your project.
Use spec-driven workflows. Don't let the agent start from a vague prompt. Define what the feature should do before you ask the AI to build it.
Close the feedback loop. Every time an agent produces a bad result, ask: "What guide or sensor would have prevented this?" Then add it. The harness improves incrementally — and each improvement prevents an entire class of future failures.

Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again. — Mitchell Hashimoto

This is the core practice of harness engineering: not accepting AI mistakes as the cost of speed, but systematically eliminating them through infrastructure.

Originally published on VibeReady. Republished here for the dev.to community.

How to Vibe Code Your First SaaS (Step-by-Step)

Remy B. — Thu, 23 Apr 2026 13:30:00 +0000

Key Takeaways

Vibe coding lets you describe features in plain language and AI writes the code

Two paths: AI app builders (Lovable/Bolt) for speed, or AI coding tools for full control

A feature spec + architectural context = consistent, production-ready output

You can ship your first SaaS feature in a single session using the workflow in this guide

You can vibe code a SaaS in an afternoon. You can also spend that afternoon iterating on a dashboard Claude keeps redesigning from scratch — because your prompt was six words.

This is the step-by-step workflow I wish I'd had my first week. No specific tool required, no framework assumed. New to the concept? Read What Is Vibe Coding? first for background.

The first time I tried to vibe code a SaaS dashboard, I gave Claude Code a single sentence: "Build me a dashboard." Forty minutes later I was on my third complete rewrite — different layout, different data model, different component names each time. I closed the terminal, opened a notes file, and wrote six sentences: route, data sources, existing components, acceptance criteria, auth, layout wrapper. Twelve minutes after I pasted those six sentences back in, the feature was done and shipping. The spec wasn't overhead. It was the whole trick.

What You Need Before You Start

Before you write a single prompt, get these five things in place. None of them take more than an afternoon, and skipping any of them will cost you time later.

An idea you can describe in one paragraph. You don't need a business plan. You need to be able to say: "I'm building X for Y people, and the first thing it does is Z." If you can't describe it simply, AI can't build it well.
Version control (GitHub). Create a GitHub repository before writing any code. Every change is tracked, you can undo mistakes, and it's required for deployment. It's free — no excuses.
A hosting platform. Vercel (best for Next.js), Netlify, or Railway. All have generous free tiers. You'll deploy from your GitHub repo — push code, and your site updates automatically.
An AI coding tool. Claude Code for terminal-first agentic workflows, Cursor or Windsurf for IDE-integrated development. Pick one to start — you can always add more later.
A project foundation. A starter kit or boilerplate with authentication, payments, and database already configured. Building this from scratch takes weeks and is the wrong use of your time when vibe coding for beginners.

Once these are in place, you're ready to start.

Step 1: Write a Feature Spec (Not Just a Prompt)

This is the single biggest differentiator between people who succeed with vibe coding and people who struggle. Don't jump straight into prompting. Write down what you want first.

A feature spec isn't a full product requirements document. It's 5–10 sentences that describe: what the feature does, who uses it, and what "done" looks like. It forces you to think before you prompt — and gives AI the clarity it needs to generate useful code on the first try.

Here's the difference between a vague prompt and a feature spec:

Vague Prompt

"Build me a dashboard."

AI will generate something — but it won't be what you wanted. You'll spend more time iterating than you saved.

Feature Spec

"Create a user dashboard page at /dashboard. Show the user's name from the session, their current subscription plan from Stripe, and a list of their 5 most recent projects with title, status, and last-modified date. Use the existing DashboardLayout component. Add a 'New Project' button that links to /projects/new. The page should be server-rendered and require authentication."

The difference is specificity. When AI knows the route, the data sources, the existing components, and the acceptance criteria, it generates code that actually fits your application. This is how to vibe code effectively — not with better AI, but with better inputs.

The Quick Path: Start with an AI App Builder

Before diving into the full workflow, it's worth knowing there's a faster option — with trade-offs.

AI app builders like Lovable and Bolt can generate a working application from a text description. You describe your SaaS, and they produce a deployed app with UI, database, authentication, and basic functionality — sometimes in minutes.

This path works well for:

Validating an idea quickly before investing more time
Building prototypes to show investors or early users
Non-technical founders who need a working version fast

The trade-offs are real, though. Customization is limited. Complex features hit walls. You're on their hosting, their infrastructure, their ecosystem. When you outgrow the builder — and most serious SaaS products do — migration is painful and sometimes impossible.

If you want full control over your codebase — production-ready architecture, custom features, your own hosting — keep reading. The rest of this vibe coding tutorial walks you through doing it with AI coding tools.

Step 2: Set Up Your Project Foundation

You can't vibe code into a blank folder effectively. AI needs existing patterns to follow — file structure, naming conventions, component library, API patterns. Without them, every prompt generates code in a different style, and your project becomes an inconsistent mess within a week.

You have two options:

Use a starter kit — A production-ready boilerplate with authentication, payments, database, and infrastructure already configured. This is the fastest path.
Set up manually — Initialize a Next.js (or other framework) project, add your ORM, configure authentication, wire up payments. This takes 1–2 weeks for a solid foundation but gives you full control from line one.

What matters is consistency: a predictable file structure, shared type definitions, reusable components the AI can reference. The difference between vibe coding a prototype and vibe coding production software is the foundation underneath.

Step 3: Give Your AI Tool Context About Your Project

This is the step most beginners skip — and the one that separates good AI output from generic AI output.

Every AI coding tool supports some form of project context file: AGENTS.md for Claude Code, .cursorrules for Cursor, .windsurfrules for Windsurf. These files tell the AI about your project's patterns before it generates code.

At minimum, include:

Your tech stack and framework versions
File and folder naming conventions
Key components and utilities the AI should reuse
Patterns to follow (e.g., "server actions go in src/actions/")

Example Context File (AGENTS.md)

Tech stack: Next.js 15, TypeScript, Prisma, PostgreSQL, Tailwind, shadcn/ui.
Components live in src/components/. Pages in src/app/.
Server actions in src/actions/ — always validate with Zod schemas.
Use the existing Button, Card, and DataTable components from our UI library.
All database queries go through Prisma — never raw SQL.

With context in place, AI generates code that matches your project's conventions instead of inventing its own. This is the foundation of structured vibe coding — and it's what makes vibe coding viable for production.

Step 4: Vibe Code Your First Feature

You have a spec, a foundation, and context. Now it's time to actually vibe code. Here's the workflow, step by step.

1. Share your feature spec with the AI

Open your AI tool and give it the feature spec you wrote in Step 1. If you're using Claude Code, paste it directly. In Cursor or Windsurf, open the composer/chat and share the spec along with any relevant files.

2. Let the AI propose a plan

Don't let AI start writing code immediately. Ask it to propose an implementation plan first: which files it will create or modify, what approach it will take, which existing components it will use. Review the plan before saying "go ahead."

3. Let it generate the code

Once the plan looks right, let AI write the code. For multi-file features, agentic tools like Claude Code will create and modify multiple files in one pass. IDE tools may handle it in stages.

4. Review what it produced

Before accepting anything, check:

Does the file structure match your project's conventions?
Did it reuse existing components or create unnecessary duplicates?
Are types correct? Are imports pointing to real files?
Does the feature actually work when you run it?

5. Iterate through conversation

AI rarely gets it perfect on the first pass — and that's fine. The power of this vibe coding tutorial is showing you that iteration is the workflow, not a failure.

Iteration Prompt

"The dashboard page works, but two things: move the subscription status into a separate card component, and add a loading skeleton while the projects list fetches. Also, the 'New Project' button should use our primary Button variant from the UI library, not a plain anchor tag."

Be specific. Reference file names, component names, and exact behaviors. The more precise your feedback, the more accurate the next iteration.

Note: If you find yourself giving the same feedback repeatedly — "always use our Button component," "add loading states to all data fetches" — encode it into a reusable skill or subagent. AI tools like Claude Code support custom skills that run the same review checklist every time, so you stop repeating yourself and your code stays consistent automatically.

Step 5: Review, Test, and Ship

Don't skip review just because AI wrote it. AI-generated code compiles, passes basic tests, and looks reasonable — but it can also introduce subtle bugs, security issues, and pattern inconsistencies that compound over time.

Before you merge or deploy, run through this checklist:

Logic check. Does the feature actually do what the spec says? Test the happy path and at least one edge case.
Security basics. Are inputs validated? Are database queries parameterized? Are auth checks in place?
Pattern consistency. Does the code follow the same patterns as the rest of your project? Or did AI invent a new approach?
Quality gates. Run your linter, type checker, and any tests you have. Ask AI to write tests for the feature it just built — it's good at this.

Note: Use AI for testing too. Connect a browser automation tool like Chrome DevTools MCP to your AI agent, pair it with a testing skill, and let it click through your feature, check layouts at different screen sizes, and flag visual or functional issues — before you even open the browser yourself.

Once everything passes, commit, push, and deploy. If you set up Vercel or Netlify in Step 1, pushing to GitHub triggers an automatic deploy. Your feature is live.

Worried about AI code quality at scale? Read our data-driven analysis on vibe coding's scaling problem →

3 Mistakes That Slow Down First-Time Vibe Coders

After watching dozens of developers learn how to vibe code, these are the patterns that waste the most time:

Prompting without a spec. You describe something vague, AI generates something vague, you spend 30 minutes iterating to get what you could have specified in 2 minutes of writing. The spec is the shortcut.
No project context. Without context files, AI generates generic code that doesn't match your patterns. You end up with three different button styles, two API patterns, and a file structure that doesn't match anything else in the project.
Accepting everything without review. AI is confident, not correct. It will generate code that looks right, runs without errors, and has a subtle auth bypass or a missing edge case. Always review the diff before accepting.

Every one of these mistakes is recoverable. But avoiding them from the start means you spend your time building features, not fixing AI's assumptions.

What Is Vibe Coding? A Developer's Guide (2026)

Remy B. — Fri, 17 Apr 2026 22:30:35 +0000

I've been vibe coding for the past year — building a full SaaS product almost entirely through AI conversation. Some of it has been shockingly productive. Some of it has been a mess. This post is everything I wish I'd known when I started.

Vibe coding is the practice of building software by describing what you want in natural language and letting AI write the code. Instead of typing syntax line by line, you have a conversation with an AI tool — and it generates working code based on your intent.

If you've heard the term but weren't sure what it actually means, how it works, or whether it's something you should learn — this guide covers everything I've learned about vibe coding in 2026.

What Is Vibe Coding?

Vibe coding is a software development approach where you describe the software you want to build in plain English (or any natural language), and an AI coding tool generates the source code for you. You guide the process by reviewing the output, giving feedback, and iterating — much like directing a collaborator rather than writing every line yourself.

The term captures something specific: instead of thinking in syntax and data structures, you think in outcomes. You focus on what the software should do, and the AI handles how to implement it. The "vibe" is the shift from implementation detail to creative intent.

The Origin of "Vibe Coding"

The term was coined by Andrej Karpathy — Tesla's former head of AI and a founding member of OpenAI — in a now-famous post on X in February 2025:

There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists.

Karpathy was half-joking. He described himself building a project by talking to an AI, accepting suggestions, running the code, and fixing issues through more conversation — without ever carefully reading the generated code. It captured a feeling that thousands of developers already recognized: AI had reached a point where you could build real software just by describing what you wanted.

What started as a tongue-in-cheek observation went viral. Within months, "vibe coding" went from a meme to a genuine methodology. Developer communities adopted it. Tutorials appeared. Tool makers optimized their products around it. By mid-2025, it was the dominant way new developers were learning to build software — and by 2026, even experienced engineers have integrated vibe coding workflows into their daily practice.

Vibe Coding vs Traditional Development

	Traditional Development	Vibe Coding
How code is written	Manually, line by line	Described in natural language, AI generates it
Skill required	Deep programming knowledge	Problem definition + code literacy for review
Speed	Variable, depends on complexity	Faster for standard patterns, similar for novel problems
Who it's for	Professional developers	Developers + technically literate builders
Output quality	Consistent with developer's skill	High for isolated tasks, variable at scale
Scaling behavior	Conventions maintained by team knowledge	Requires architectural context to stay consistent

The key insight: vibe coding doesn't replace programming knowledge — it changes where that knowledge is applied. Instead of writing code, you're reviewing it. Instead of memorizing syntax, you're defining intent.

How Vibe Coding Works in Practice

If you've never tried it, consider this section a compact vibe coding tutorial. Whether you're exploring vibe coding for beginners or you're an experienced developer learning how to vibe code, the core cycle is the same.

The Vibe Coding Workflow

Every vibe coding session follows a four-step loop:

Describe — Tell the AI what you want in plain language. "Build a user settings page with name, email, and notification preferences."
Generate — The AI writes the code: components, API routes, database queries, styling — whatever the task requires.
Review — You look at what it produced. Does it match your intent? Does it work? Are there obvious issues?
Iterate — Refine through conversation. "Move the notification toggles into a separate section. Add email validation." The AI updates the code accordingly.

This loop happens fast. A feature that might take a day in traditional development can often be built in under an hour. The more specific your descriptions, the better the output — but even vague starting points often produce surprisingly usable results.

Vibe Coding Example: A Pricing Page

Imagine you're building a SaaS application and need a pricing page. In traditional development, you'd write JSX, style components, wire up state management, and connect to your payment provider. With vibe coding:

Your prompt: "Build a pricing page with three tiers — Free, Pro, and Enterprise. Each tier should show the price, a list of features, and a CTA button. The Pro tier should be visually highlighted as the recommended option. Use our existing design system and connect the buttons to our Stripe checkout flow."

The AI generates a complete, functional pricing page. You review it, tweak the copy, adjust the feature lists, and iterate until it matches your vision. The entire process might take 20 minutes instead of a full day.

Essential Vibe Coding Tools

The vibe coding ecosystem has matured rapidly. Here are the main categories of tools and when each type shines.

Vibe Coding with Cursor and Windsurf

Cursor and Windsurf are IDE-based tools that integrate AI directly into your code editor. Vibe coding with Cursor gives you a chat panel and inline prompts that modify your actual files in real time. If you prefer a visual, file-tree-oriented workflow where you can see changes as they happen, these are the tools to start with. Cursor has the larger community and deeper feature set; Windsurf (by Codeium) offers a polished alternative with strong multi-file editing.

Vibe Coding with Claude Code and Gemini CLI

Vibe coding with Claude Code is a terminal-first experience. Anthropic's AI coding agent excels at complex, multi-file operations — the kind of work where you need AI to understand your entire project structure, not just the file you're looking at. It reads your codebase, plans changes across multiple files, runs tests, and commits code. For agentic workflows (where AI operates more autonomously), Claude Code is the most capable option available.

Gemini CLI is Google's entry, offering a generous free tier and one of the largest context windows. It's a strong choice for open-source projects and developers who want to experiment with vibe coding without upfront costs.

Pair Programming Tools: GitHub Copilot

GitHub Copilot is the most widely adopted AI coding tool, with millions of active users. It works as an inline suggestion engine — as you type, it predicts what you'll write next. While not a full vibe coding tool in the conversational sense, Copilot is often the entry point that introduces developers to AI-assisted coding. Many vibe coders use Copilot alongside a more capable tool like Claude Code or Cursor.

What Vibe Coding Gets Right

Speed to Prototype

The most obvious advantage: vibe coding is fast. Prototypes that took weeks can be built in hours. MVPs that took months can ship in days. This isn't hype — it's the consistent experience reported by developers across the industry. When you remove the bottleneck of translating ideas into syntax, the speed of development becomes limited by the speed of your thinking, not your typing.

Accessibility for Non-Developers

Vibe coding has opened software creation to people who couldn't build apps before. Designers who understand user flows but can't write React. Product managers who know exactly what the feature should do but never learned TypeScript. Domain experts — doctors, teachers, small business owners — who have ideas for tools that solve their specific problems. For the first time, you can build an app with AI, no coding experience required, and the result is real, deployable code you own.

Focused Creativity

Perhaps the most underrated benefit: vibe coding lets you stay in the creative zone. Instead of context-switching between "what should this feature do?" and "how do I implement this in code?", you stay focused on the product vision. The AI handles the implementation details, and you steer the direction.

Common Misconceptions About Vibe Coding

Vibe coding has grown so fast that myths have grown with it. Here are the most common ones:

"AI does everything — you just sit back." In reality, vibe coding is a collaboration. You define the direction, review the output, catch mistakes, and iterate. The developer's role shifts from writer to director, but the expertise still matters.
"Vibe-coded projects can't go to production." They absolutely can — with the right foundation. The projects that fail in production are usually the ones built ad-hoc from a blank canvas. Start with a proven architecture and quality checks, and vibe-coded code can be as reliable as hand-written code.
"You don't need to understand code at all." Some code literacy helps significantly. You don't need to write code from scratch, but being able to read what the AI produced, spot obvious issues, and understand error messages makes the process much more effective.
"Vibe coding is just a fad." Every generation of developer tools has abstracted away complexity. Compilers abstracted assembly. Frameworks abstracted HTTP. AI tools abstract implementation. Vibe coding is the next step in a decades-long trend, not a temporary phenomenon.
"All vibe-coded projects have the same quality problems." The quality issues come from unstructured AI usage, not from the methodology itself. When AI has architectural context and guardrails, the output quality is dramatically better than ad-hoc prompting.

Vibe Coding Best Practices for Beginners

If you're just getting started with vibe coding, these three practices will save you from the most common pitfalls.

Start with a Foundation, Not a Blank Canvas

The biggest mistake beginners make is asking AI to build everything from scratch. Research shows that AI-generated code has 70% more issues than human-written code when there's no architectural context to guide it (CodeRabbit, 2025). The AI doesn't know your project's conventions, patterns, or standards — so it invents new ones with every prompt.

The fix is straightforward: start with a proven codebase that gives AI the context it needs. Whether that's an existing project, an open-source boilerplate, or a commercial starter kit — the point is to give AI architectural context from the start. Add an AGENTS.md file that describes your conventions, set up linting, and establish patterns before you start prompting.

Use Structured Prompts, Not Ad-Hoc Requests

Instead of vague instructions ("build me a dashboard"), write structured descriptions that include the goal, the expected behavior, the data involved, and any constraints. The more context you provide upfront, the better the AI's first attempt — and the fewer iterations you'll need.

The most effective vibe coders use a PRD-driven workflow: they define features in a lightweight product requirements document before writing a single prompt.

Set Up Quality Gates Early

Don't wait until your project is large to add quality checks. Set up automated tests, type checking, and linting from day one. These quality gates catch AI-generated mistakes automatically — before they compound into larger problems. The difference between a vibe-coded prototype and a vibe-coded product is the automated verification layer that runs on every change.

When Vibe Coding Breaks Down (And How to Fix It)

Vibe coding isn't perfect, and it's important to understand where the limits are. The most well-documented challenge is pattern drift: when AI generates each feature using slightly different patterns, conventions, and approaches because it lacks memory of what it built before. Over time, this leads to code duplication (4x more than human-written code, per GitClear), inconsistent error handling, and security gaps (40%+ of AI-generated code contains vulnerabilities, per arXiv research).

The data on AI-generated code quality paints a clear picture: these are real problems — but they're solvable ones. The root cause isn't vibe coding itself; it's vibe coding without structure. When AI has access to your project's architectural context, enforced coding patterns, and automated quality checks, the output quality improves dramatically. The methodology works; it just needs guardrails. The emerging discipline of harness engineering formalizes how to build these guardrails systematically.

I hit this wall myself about three months into a project. Every feature worked in isolation, but the codebase had become a patchwork of inconsistent patterns. That's what pushed me to build tooling around the problem.

Getting Started with Vibe Coding Today

Ready to start vibe coding? Whether you want to build SaaS with AI or create your first side project, here's a practical path:

Pick an AI coding tool. Claude Code for terminal-first workflows, Cursor or Windsurf for IDE-integrated development, or GitHub Copilot for inline suggestions. Most developers end up using more than one.
Start with a proven foundation. Don't build from a blank canvas. Clone an existing project, use a boilerplate, or set up a well-structured repo with AGENTS.md and linting before you start prompting.
Follow a structured workflow. Plan features before prompting. Use structured descriptions. Run quality checks after every change.
Learn the practices that scale. Once you're comfortable with the basics, learn the patterns that separate prototypes from production applications.

Frequently Asked Questions

Is vibe coding the same as no-code?
No. No-code platforms like Bubble or Webflow use visual builders and limit you to their ecosystem. Vibe coding generates real source code — JavaScript, Python, TypeScript — that you own and can modify. You get an actual codebase, not a locked-in platform.

Can I build a production app with vibe coding?
Yes, but it requires structure. Unstructured vibe coding works great for prototypes but introduces consistency issues at scale. With a production-ready foundation, architectural context files, and quality gates, vibe-coded applications can absolutely run in production.

What's the best AI tool for vibe coding?
It depends on your workflow. Claude Code excels at multi-file agentic tasks from the terminal. Cursor and Windsurf offer the best IDE-integrated experience. GitHub Copilot is great for inline suggestions. Most developers combine two or more tools.

Do I need to know how to code to vibe code?
Some code literacy helps for reviewing AI output, but you don't need to be an expert. Many successful vibe coders are designers, product managers, or domain experts who understand what they want to build but couldn't write it from scratch.

How is vibe coding different from using GitHub Copilot?
Copilot suggests code completions as you type — it's a pair programmer. Vibe coding is broader: you describe entire features, review the output, and iterate through conversation. Copilot can be one tool in a vibe coding workflow, but the methodology encompasses the full build cycle.

What is structured vibe coding?
Structured vibe coding adds architectural context, quality gates, and repeatable workflows on top of the basic describe-and-generate loop. Instead of prompting AI ad-hoc, you give it context about your project's patterns, conventions, and standards — so it generates consistent, production-quality code every time. I wrote a deeper breakdown of structured vibe coding if you want the full framework.

I'm building VibeReady — an AI-native SaaS starter kit that gives AI tools the architectural context they need to generate consistent, production-quality code. If you're vibe coding and hitting the scaling issues described above, check it out.