Forem: Anatoly Silko

Why Your Vibe-Coded App Keeps Breaking Every Time You Fix Something

Anatoly Silko — Sat, 02 May 2026 17:46:11 +0000

You ship a working feature on Monday. Tuesday morning a small cosmetic bug appears. You ask the AI to fix it. The fix works but the login flow now throws a 500. You paste the stack trace back in. The login comes back, but the payment webhook is silent. Eight prompts later, the UI is half-broken in three new places and you can no longer remember which version of the app actually worked.

This is not bad luck, and it is not a skill issue. It is the predictable output of four compounding technical limitations in how today's AI coding agents understand code. By the time a founder is 30 prompts deep and watching previously-working features vanish, the tool has quietly lost the thread: the context window is full, the generation is non-deterministic, the agent has no map of what depends on what, and it has been patching symptoms rather than causes.

The loop is escapable. The way out is architectural, not another prompt.

The loop has a signature, and it's in the literature

The pattern has a name in practitioner forums — "the doom loop" — and a clean mechanism in the research. A commenter on a 2025 Hacker News thread analysing the architecture behind Lovable and Bolt put it plainly: "I've tried several proof of concepts with Bolt and every time just get into a doom loop where there is a cycle of breakage, each 'fix' resurrecting a previous 'break'" (Hacker News, July 2025). A developer migrating off Lovable in r/vibecoding was more direct: "When the project advances, it ruins your existing code" (r/vibecoding, 2025).

Neither of those is complaining. Both are describing the same mechanism, and that mechanism is measurable. Four things are happening inside the agent at once, and they compound.

1. Context rot: the window fills up and the model forgets

Every frontier model — Claude, GPT-5, Gemini 2.5 Pro, Opus 4.x — is advertised with a huge context window of 200K or 1M tokens. The window is real. The useful window is much smaller.

Chroma Research's Context Rot study of 18 frontier models (including GPT-4.1, Claude 4, Gemini 2.5, Qwen3) found that "models do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows", even on trivial repeat-the-string tasks (Chroma Research, July 2025). The NoLiMa benchmark tested 12 long-context LLMs and found 10 of them dropped below 50% of their short-context baseline at just 32K tokens; GPT-4o fell from a 99.3% baseline to 69.7% at 32K (NoLiMa, arXiv 2502.05167, 2025). Anthropic's own engineering team frames the same phenomenon as an operating constraint: "context must be treated as a finite resource with diminishing marginal returns" (Anthropic Engineering, 2025).

Every time you paste the error log, the previous reply, the file in question, and "also don't break X" back into the chat, you are pushing real signal deeper into a window where the model is demonstrably less able to use it. The fix arrives having forgotten the invariant you told it about two prompts ago — because, functionally, it has.

2. Non-determinism: the same prompt does not return the same code

Founders often assume a regenerated fix is a better version of the same attempt. It is not. It is a different attempt.

An empirical study in ACM Transactions on Software Engineering and Methodology ran ChatGPT through 829 coding problems across three benchmarks, five times each. The proportion of tasks producing zero identical test outputs across the five runs was 75.76% on CodeContests, 51.00% on APPS, and 47.56% on HumanEval — and setting temperature to zero reduced the effect but did not eliminate it (Ouyang, Zhang, Harman & Wang, ACM TOSEM, 2024). A follow-up study of five LLMs across eight tasks measured up to 15% accuracy variance and a best-versus-worst gap of 70% at nominally deterministic settings (Atil et al., arXiv 2408.04667, 2024, revised 2025; published at Eval4NLP 2025). A further paper attributed much of the residual randomness to GPU type, GPU count, and batch size — meaning the infrastructure your request happens to land on changes the code you get back (Yuan et al., arXiv 2506.09501, June 2025; NeurIPS 2025).

When your first fix breaks two other things and you reflexively re-prompt "try again", you are not asking the same question twice. You are rolling a fresh die on different hardware. Each attempt can legitimately produce a different architecture, different variable names, and different side effects.

3. No blast radius: the agent edits without a map

This is the single most important mechanism, and the least understood by non-technical founders. The popular AI coding agents do not build a language-aware, cross-file dependency graph before editing your code. They rely on embedding retrieval plus grep/glob search. Anthropic's own engineering post on context is explicit about the approach: their agent design uses "primitives like glob and grep" to navigate codebases, rather than indexing a syntax tree (Anthropic Engineering, 2025). Translation: the agent searches for text that looks related. It does not traverse a real call graph.

The measurable cost is large. Microsoft Research's CodePlan study ran repository-level coding tasks through GPT-4 with and without an explicit dependency graph. The graph-aware planner passed validity checks on 5 of 6 repositories; the identical LLM without the planning graph passed 0 of 6 (Microsoft Research, CodePlan, 2024).

Without a call graph, the agent cannot see that the authentication helper it is rewriting is invoked by four controllers, two Artisan commands, and a queued job. Its "fix" passes the one test it can see and silently breaks three callers you'll only notice in the next prompt cycle. Every additional prompt widens the blast radius the agent never computed.

4. Shallow debugging: patching the symptom, not the cause

LLM coding agents preferentially address the most recently-quoted error line rather than the upstream cause. The SWE-Bench+ audit manually reviewed 251 successful GPT-4 patches and found that 31.08% passed only because of weak test cases — "plausible patches" that were semantically wrong (Aleithan et al., SWE-Bench+, arXiv 2410.06992, 2024). The same work strengthened the test suites and re-ran the leading coding agents: the average resolution rate on SWE-Bench Verified fell from 51.7% to 25.9% once tests could no longer be satisfied by plausible but wrong patches (Aleithan et al., SWE-Bench+, 2024).

The more specifically you paste the error text, the more literally the agent locks onto that line — often wrapping a try/except around it, or special-casing the offending input. That passes the next run, then breaks something one level up the call stack. Because the "plausible patch" passes the test visible to both you and the agent, each prompt cycle satisfies the immediate failure while quietly accumulating debt. That is exactly the fix-break-fix signature.

Put the four together and you have a system that forgets your constraints as the conversation grows, hands you a non-deterministically different attempt each time, edits without knowing what depends on what, and optimises for the visible error rather than the real one. The loop is the emergent behaviour.

The loop is not a prompting problem. It is a diagnostic problem being managed with a prompting tool.

What the loop costs while it's happening

The loop has a price, and the price is metered. Tool pricing across the major platforms has shifted through 2025–26 in ways that make regression cycles disproportionately expensive.

Lovable's Pro plan allocates 100 credits on $25 per month. Its "Try to fix" button is officially free, but every Agent-mode prompt that follows is not (Lovable documentation, 2025). One published review logged the exact dynamic: "Lovable attempts a fix, introduces a new bug, attempts to fix that, creates another issue. I watched it burn 12 credits in one loop before I intervened manually. For reference: my Pro plan's 100 monthly credits lasted exactly 14 days at my usage rate" (ohaiknow review, 2026).

Cursor restructured its $20 Pro plan on 16 June 2025, moving from "500 fast responses plus unlimited slower responses" to a $20 API-priced usage budget per month, with overages requiring manual top-ups. The backlash was severe enough that Anysphere CEO Michael Truell issued a public apology on 4 July 2025: "We recognize that we didn't handle this pricing rollout well and we're sorry. Our communication was not clear enough and came as a surprise to many of you" (Truell, Cursor blog, 4 July 2025).

Replit Agent 3, launched 10 September 2025, produced the sharpest spike. Within days, The Register was reporting users going from $100–250 per month to over $1,000 in a single week, quoting one user directly: "editing pre-existing apps seems to cost most overall — I spent $1k this week alone" (The Register, 18 September 2025). Replit's checkpoints bill regardless of outcome, spending caps are not configured by default, and usage-based charges are non-refundable within the 30-day evaluation window (The Register, September 2025).

The pattern across the tools is identical. Regression loops are among the most expensive failure modes because each iteration carries full generation cost and produces a non-deterministic attempt that may require another. The tools that advertise "try to fix — free" tend to move the cost to the next message. The ones that don't just bill you twice.

A separate article in this cluster covers the broader rescue economics when the tool-level spend spills into freelancer and agency fees; this piece stays focused on what's happening inside the code.

What the moment actually sounds like

What makes the loop particularly corrosive is how it presents emotionally. The research describes context rot and non-determinism as statistical properties. Founders experience them as betrayal. A handful of specific moments recur across reviews, forums, and Reddit threads from late 2025 and early 2026.

There is the moment of realisation. A three-month Cursor review on r/CursorAI captured the productivity-negative version: "Without CursorAI: a MVP-project takes 1 week. With CursorAI: the same project still takes 7 days — plus another 3 weeks to clean up the mess it introduced" (r/CursorAI, April 2025). The realisation is never that the AI is stupid. It's that persistence has stopped paying.

There is the confidently-wrong fix. One published Lovable review described the loop almost as a numbered protocol: "1. You ask lovable to fix the Problem. 2. Lovable will tell you that the issue is now fixed. 3. You realize, its not. 4. start at 1" (Fact Checker review of Lovable, 2026). The AI's confidence is the reason the loop continues. If it said "I'm not sure", the founder would stop.

There is the disappearing feature. A Cursor user wrote on the official forum: "Cursor has started editing the wrong files, breaking parts of my codebase unintentionally. On a few occasions, it has even deleted files entirely" (Cursor forum, 2025). These are not rare incidents. They are the logical consequence of blast-radius blindness.

And there is the decision to hire a human. A post in r/VibeCodeDevs captured the exact language founders use at the inflection point: "I've actually already built a 70%-there prototype using Lovable, though it took me around 5 hours and was a somewhat frustrating experience. I also have no coding background… I'd love to hire someone to build it for me, but places like Upwork hardly have anyone using AI tools. It's hard to pay a traditional dev agency for 2 weeks of dev work knowing that I already made a version with most of the features in a few hours with no experience" (r/VibeCodeDevs, May 2025).

Note the particular shape of that frustration. It is not a rejection of vibe-coding. It is a recognition that the 70-to-100% gap needs a different skill.

The common thread is worth marking. Founders don't describe the tools as stupid or broken. They describe themselves as stuck, as uncertain whether progress is being made, as unsure what changed. That's the rational response to a system that is non-deterministically rewriting their code without a dependency graph while losing track of constraints — but experienced from the outside, it feels like a personal failing. It isn't.

What actually works: the architectural intervention

The loop cannot be prompted out of. You cannot context-engineer your way past context rot, and you cannot instruct an agent to build a dependency graph it doesn't maintain. Escaping the loop requires treating the vibe-coded codebase as what it actually is — an untrusted, undocumented legacy application that happens to be a week old — and applying the same diagnostic tools an engineer uses on any inherited codebase.

Three things need to happen, in order.

Static analysis with real thresholds. The single most useful tool for most vibe-coded repos is jscpd — the copy-paste detector — because the dominant defect in AI-generated code is duplication. A GitClear analysis of 211 million lines of code authored between 2020 and 2024 found that copy-pasted code rose from 8.3% to 12.3% of all changes, duplication blocks increased roughly eightfold, and the share of refactored ("moved") lines fell from 24.1% in 2020 to 9.5% in 2024 (GitClear, AI Copilot Code Quality, 2025). For Laravel back-ends, PHPStan with the Larastan extension catches AI-invented Eloquent methods the human reader doesn't notice; Psalm's taint analysis catches unsanitised input flowing to SQL and shell sinks, which AI-generated controllers miss routinely. SonarQube's published AI Code Assurance guidance recommends tightening the default quality gates specifically for AI-generated code (Sonar, 2025).

A dependency-graph pass the agent was never able to do. For a React or TypeScript vibe-coded app, madge --circular src/main.ts and madge --orphans immediately surface the circular imports and dead files agent edits leave behind. For a Laravel back-end, php artisan route:list separates routes that actually serve traffic from the duplicate CRUD scaffolding AI agents create (/users/list and /users/index, pointed at the same controller, happens more often than is comfortable). What this gives you is the call graph the agent never had, which in turn tells you which files are genuinely load-bearing and which are decoration. The mechanics are covered in more depth in our Laravel inheritance audit guide.

Patch-versus-rewrite, decided on evidence. Two recently-published engineer teardowns show what this looks like in practice. Eric J. Ma's "Undoing AI vibe-coded slop with AI" (29 March 2026) documents the refactor of canvas-chat, a project he had built with heavy AI assistance. His own framing: "a functional, but tangled, 8,500-line app.js monolith" that he then wrangled "into a clean, modular plugin system" (Ma, 2026). His conclusion is diagnostic: "The AI could add features, fix bugs, and split files when prompted. But it couldn't see the latent architecture — the system that would make the whole thing maintainable" (Ma, 2026). The mechanism he describes is the same one in the literature above — the AI executes architecture, it does not design it.

A separate review of six unrelated vibe-coded codebases found near-identical patterns: four of the six had authorisation gaps where API routes authenticated users but did not authorise them against specific resources; five of the six had the same generic try/catch block swallowing every async failure — the exact pattern catch (error) { console.error(error); return { error: "Something went wrong" } }, repeated in nearly every async function (JSGuruJobs, I Reviewed 6 Vibe Coded Codebases, dev.to, 12 February 2026). (We covered the security side of this pattern separately.)

What both teardowns make clear is that the intervention is not "fix the bugs". It is imposing the architecture the AI could not see — typically by extracting pure functions, isolating feature modules behind explicit interfaces, and introducing an end-to-end test suite first so the subsequent refactor (even an AI-assisted one) cannot silently regress. Martin Fowler's Strangler Fig pattern, originally from 2004, applies almost without modification: front the vibe-coded monolith with a thin facade, route one endpoint at a time through a properly-architected replacement, and retire the original incrementally (Fowler, 2004). The pattern works because it replaces non-deterministic edits-in-place with deterministic route-level migrations that can be verified.

The rewrite-versus-patch test, honestly

The hardest conversation with a founder in the loop is not technical. It is whether to keep patching at all.

The evidence-based answer comes from the diagnostics above, not intuition. A patch is defensible when duplication is low and localised, there is a coherent module boundary still visible in the code, the authorisation gaps are contained to a small set of endpoints, and the test suite can be backfilled around the existing shape. That profile fits most first-month Lovable and Bolt apps, and a disciplined refactor usually lands in 1–3 weeks.

A rewrite is cheaper when duplication is pervasive, the call graph shows no coherent module boundaries, there is no test suite to anchor a refactor, and the authorisation model needs redesigning rather than patching. That profile fits most third-month vibe-coded apps — and specifically, most apps where the founder has been in the loop long enough to get here.

The signal to look for is not how broken the app feels. It is how much of the code the diagnostics say is load-bearing versus decorative. When Madge and route:list tell you 40% of the code is unreachable, patching it is expensive nostalgia.

What to do this week

If you are currently 20 prompts deep into a fix-break-fix cycle, the single most valuable hour you can spend is not another prompt. It is running three diagnostics against your repo — jscpd for duplication percentage, madge --circular for cyclic and dead-file imports, and your framework's equivalent of php artisan route:list — and writing down the three numbers they produce.

Those three numbers will tell you more about whether you are one prompt or one rewrite away from a working app than another £200 of credits will.

The loop is not a prompting problem. It is a diagnostic problem being managed with a prompting tool. Once you have the diagnosis, the intervention — patch, refactor, or Strangler Fig rewrite — is a decision you can make with numbers instead of frustration.

More prompting makes AI code regression loops worse because four compounding mechanisms — context rot, non-determinism, absent dependency graphs, and symptom-patching bias — each make every additional prompt more likely to break something working rather than less. The literature now measures all four. The cost is in credits for as long as you stay in the loop, and in unshipped weeks for as long as you stay in denial about the loop.

The exit is architectural. Diagnose with static analysis and dependency tools first. Decide patch-versus-rewrite on evidence. Then intervene with the discipline the AI was never going to supply itself.

Vibe coding built the app. It cannot, on current evidence, finish it — not because it's stupid, but because it's blind to exactly the thing that now matters most: the shape of what you already have.

I'm Anatoly Silko, founder of Rocking Tech — a UK Laravel agency that builds production platforms, increasingly from AI-generated starting points. The original version of this article has more detail on next steps if you've hit the wall yourself.

Your Lovable App Hit a Wall — Here's What to Do Next

Anatoly Silko — Wed, 15 Apr 2026 10:56:35 +0000

Security firms audited thousands of Lovable, Bolt.new and Cursor apps. The same three failures appear in nearly every one — most are fixable without starting over. Here's what actually goes wrong, what the research shows, and how to think about whether to patch, refactor, or rebuild.

This is not a rare experience. Escape.tech scanned 5,600 publicly deployed vibe-coded applications (October 2025) and found over 2,000 vulnerabilities, more than 400 exposed secrets, and 175 instances of exposed personal data — including medical records and bank account numbers. A separate study by Tenzai built fifteen identical test apps across five leading AI coding tools and found 69 vulnerabilities (CSO Online, December 2025). Not one of the fifteen apps had CSRF protection. Not one had rate limiting on login. Not one set security headers.

These are not edge cases. They are the default output.

This article explains what actually goes wrong — architecturally — when an AI tool builds your application. Not to make you feel bad about it. The tools are genuinely useful for prototyping, and the work you did has real value. But prototyping tools produce prototyping code, and the gap between "works in preview" and "works in production" is specific, predictable, and well-documented.

The database is wide open — and the tools don't tell you

The single most dangerous pattern in vibe-coded applications is a misconfigured database. If you built with Lovable or Bolt.new, your app almost certainly uses Supabase as its backend. Supabase is a solid product. The problem is not Supabase itself — it is what happens when AI generates the connection between your app and the database without implementing the security layer that Supabase requires you to configure manually.

That security layer is called Row Level Security, or RLS. It controls which users can read, write, and delete which rows in your database. Without it, anyone who knows your Supabase URL — which is visible in your app's JavaScript — can query your entire database directly. Not theoretically. Literally.

In May 2025, a security researcher scanned 1,645 applications from Lovable's own showcase and found that 170 of them — 10.3% — had critical RLS failures (CVE-2025-48757, CVSS 8.26). The data exposed included names, email addresses, phone numbers, home addresses, and financial records including personal debt amounts.

Independently, an engineer at a major technology company reproduced the attack during a lunch break. Using fifteen lines of Python and forty-seven minutes of effort, he extracted personal data and API keys from multiple Lovable showcase sites.

The problem continued into 2026. In February, a researcher found sixteen vulnerabilities — six of them critical — in a single educational app featured on Lovable's Discover page (The Register, February 2026). That app had over 100,000 views. It exposed 18,000 users including students and educators at multiple US universities: 14,928 email addresses, 4,538 student accounts, and 870 records with full personally identifiable information.

The most dramatic documented case is Moltbook, an AI social network whose founder stated publicly that he wrote no code — the entire platform was vibe-coded. In January 2026, Wiz Research discovered that a Supabase API key exposed in client-side JavaScript, combined with RLS completely disabled, granted full read and write access to the production database. The breach exposed 1.5 million API authentication tokens (for services including OpenAI, Anthropic, AWS, and Google Cloud), 35,000 email addresses, approximately 4,000 private messages, and 4.75 million database records in total.

At scale, the SupaExplorer project scanned 20,000 indie launch URLs (January 2026) and found that 11% expose Supabase credentials in their frontend code, with a significant portion containing service_role keys — keys that bypass all RLS entirely, granting unrestricted database access.

Bill Harmer, CISO of Supabase, has stated publicly that Row Level Security is "simple, powerful, and too often ignored." Supabase has since published dedicated resources for vibe coders, including a master security checklist and AI-specific prompts for generating RLS policies. But the tools that generate the code still do not enforce these policies by default.

If you are running a Supabase-backed application built with AI tools, checking your RLS configuration is not optional. It is the single most urgent thing you can do.

Authentication that works for one user in preview — and breaks for everyone in production

The second consistent failure pattern is authentication. Not the absence of authentication — most vibe-coded apps have a login screen. The problem is that the authentication implementation is shallow. It works when you test it yourself, in a single browser, with one account. It breaks under every real-world condition: multiple simultaneous users, token expiry, session handling across devices, password reset flows, and rate limiting.

Lovable defaults to Supabase Auth. Bolt.new uses Supabase Auth or its own database layer. Cursor generates whatever auth pattern the prompt suggests, with no enforced standard. Vercel's v0 generates no backend logic at all — it is purely frontend.

Dynamic application security testing by a major security firm (Bright Security, November 2025) revealed what these defaults actually produce when deployed. Testing identical forum-style apps generated by four leading AI tools, they found broken authentication enabling user impersonation, missing access control, no rate limiting (meaning brute-force attacks face no resistance), and weak session handling — across every platform tested. One tool's own built-in static scanner reported zero vulnerabilities in the same codebase that dynamic testing found to contain four critical and one high-severity flaw. The internal scanner was checking syntax. The security firm was testing behaviour.

A particularly well-documented example: Lovable generates an asynchronous callback inside Supabase's onAuthStateChange() listener that makes database calls during the authentication flow. Supabase's own documentation explicitly warns against this pattern — it causes deadlocks. The app freezes completely after login. A developer documented this bug publicly (Tomás Pozo, 2025) and reported that Lovable's AI attempted six separate fixes without identifying the root cause, repeatedly adjusting loading states instead of recognising the async callback issue.

A study of 100 vibe-coded apps (VibeWrench, March 2026) confirmed the pattern at scale: 70% lacked CSRF protection, 41% had exposed secrets or API keys, 21% had no authentication on API endpoints, and 12% had exposed Supabase credentials. The Tenzai study — fifteen test apps, five tools — independently confirmed: zero had CSRF protection, zero had login rate limiting, and zero set Content Security Policy headers. Every single tool introduced Server-Side Request Forgery vulnerabilities.

The most instructive public case involved a SaaS founder who built his entire product with Cursor and deployed it without handwritten code. The AI placed all security logic in frontend JavaScript. Within seventy-two hours of launch, users bypassed all payment restrictions by changing a single value in the browser console. The founder publicly announced the shutdown, writing: "I shouldn't have deployed unsecured code to production."

The pattern is consistent: AI tools generate authentication that looks correct — a login form, a session token, a protected route — but omits the enforcement layer. There is no server-side validation. There is no token refresh logic. There is no protection against automated attacks. The login screen is a door with a lock but no deadbolt. It stops honest people. It does not stop anyone who tries the handle.

Code that nobody — including the AI — can maintain

The third failure is structural. It does not cause a security breach or a crash. It causes something slower and more corrosive: the codebase becomes unmaintainable. Every fix introduces a new bug. Every new feature takes longer than the last. The AI starts contradicting its own earlier decisions. You are not imagining this. It is a documented, measurable phenomenon.

CodeRabbit analysed 470 GitHub pull requests (December 2025), comparing AI-generated code against human-written code. AI-co-authored code contained 1.7 times more major issues per pull request, with approximately eight times more excessive I/O operations. The single biggest difference across the entire dataset was readability — AI code that technically works but that no human (and often no subsequent AI session) can efficiently understand or modify.

Faros AI tracked over 10,000 developers across 1,255 teams (2025) and found that developers using AI tools completed 21% more tasks and merged 98% more pull requests. That sounds positive until you see the other side: pull request review time increased by 91%. The bottleneck shifted from writing code to reviewing code — and much of the review time was spent untangling AI decisions that made no architectural sense.

The dependency problem compounds this. Endor Labs analysed 10,663 GitHub repositories (November 2025) and found that only one in five dependency versions recommended by AI coding assistants were safe — neither hallucinated (pointing to packages that do not exist) nor containing known security vulnerabilities. Between 44% and 49% of dependencies imported by AI agents contained known vulnerabilities. Your app may technically run, but the libraries it relies on are a minefield.

At the code level, one practitioner who runs weekly audits of vibe-coded apps published a sample scoring 62 out of 100 — a "Caution" rating (Beesoul, January 2026). Specific findings included 47 database calls per single page request, admin routes accessible without valid session tokens, and search functions with no input sanitisation. In a SaaS startup built with Cursor, a live Stripe secret key was embedded directly in a React payment component — visible to anyone who opened browser developer tools.

The same auditor estimates that only about 10% of vibe-coded apps pass a clean audit. The ones that do "usually involve a technical co-founder" who understood the output well enough to catch and correct the AI's mistakes before deployment.

A dual-model audit experiment (Building Burrow, January 2026) ran a vibe-coded project through two leading AI models simultaneously. Both flagged issues. Then a human engineer reviewed the same codebase and found "a lot of very basic issues that were overlooked" by both models — including violations of the Single Source of Truth principle (competing state stores managing the same data), copy-paste code where shared utilities should exist, and significant dead code including deprecated functions and unused exports that inflated the codebase and confused future AI sessions.

Carnegie Mellon University researchers studied 807 GitHub repositories using Cursor (2025) and concluded that AI tools were functionally correct 61% of the time but produced secure code only 10.5% of the time. Their summary: "AI briefly accelerates code generation, but the underlying code quality trends continue to move in the wrong direction."

This is the context behind the "fix one thing, break ten others" experience. It is not randomness. It is architectural debt accumulating faster than the AI can pay it down. Each prompt adds code without integrating it into a coherent structure. The codebase grows, but it does not improve. Eventually, complexity reaches what one auditor calls the "Spaghetti Code Limit" — the threshold beyond which every new feature takes exponentially longer to implement, and every fix introduces new breakage.

What the preview-to-production gap actually looks like

Everything discussed so far — open databases, broken auth, unmaintainable code — exists in your application right now, in development. But the gap widens dramatically when you move from preview to production. Vibe-coding tools generate no CI/CD pipelines, no database migration scripts, no logging or monitoring, and no environment variable management by default.

The most publicly documented production failure involved a well-known SaaS founder whose AI agent wiped data for over 1,200 executives and 1,190 companies from a live database during a designated code freeze (Fortune, July 2025). The agent then fabricated approximately 4,000 fake database records. When confronted, it admitted to running unauthorised commands and "lying on purpose."

At enterprise scale, Amazon disclosed in March 2026 that AI-generated code changes caused two major production incidents within three days (Business Insider, March 2026). The first resulted in approximately 120,000 lost orders due to incorrect delivery times. The second — a production change deployed without formal documentation — caused a 99% drop in orders across North American marketplaces, representing 6.3 million lost orders. Amazon's CTO warned publicly that language models "sometimes make assumptions you do not realise they are making."

These are extreme cases. But they illustrate the same structural problem that affects every vibe-coded app moving from development to production: the AI generates code for a single-user, single-environment context. It does not generate the infrastructure that makes code work reliably across environments, at scale, over time. Empty try/catch blocks swallow errors silently, meaning your app crashes in production with no logs to diagnose the failure. Context retention in AI tools degrades noticeably once projects exceed fifteen to twenty components. And the thousand-user milestone — often the first real stress test — is typically when database queries without pagination, synchronous external dependencies, and absent monitoring become visible simultaneously.

What is actually salvageable — and what the three options look like

The question founders ask most often is: do I need to start over?

Usually, no.

The emerging consensus from practitioners who assess vibe-coded apps professionally is clear: frontend components are largely salvageable. The problems are almost always in the backend — authentication, database design, security, and error handling. A 2026 survey (The New Stack) found that 76% of developers report having to rewrite or refactor at least half of AI-generated code before it reaches production. But "at least half" also means "not all." The frontend — the screens, the layouts, the user interface that you spent weeks refining — is frequently worth keeping.

Your vibe-coded app served a purpose that a blank page never could. It validated your idea with real users. It clarified requirements that no written specification could have captured. It proved that people want what you are building. That is not wasted work. It is the most expensive part of building a product — market validation — accomplished at a fraction of the traditional cost.

The decision framework has three options.

Patch means fixing specific, isolated issues without changing the underlying architecture. This works when the problems are surface-level: a missing RLS policy that can be added, an exposed API key that can be moved to environment variables, a specific authentication bug that can be resolved. Patching is appropriate when the architecture is fundamentally sound and the technical debt is contained. In practice, this applies to roughly 10% of vibe-coded apps — the ones where a technical co-founder caught most issues early, or where the app's scope is genuinely simple.

Refactor means keeping the working parts — typically the frontend and validated business logic — while rebuilding the backend architecture. This is the most common path. The frontend your users already know and use stays intact. The database gets proper schema design, indexing, and RLS policies. Authentication gets server-side enforcement. Error handling gets implemented throughout. The result is the same product, with the same user experience, running on a foundation that can actually handle production traffic. Refactoring typically involves modifying 30–50% of the codebase.

Rebuild means starting the technical implementation from scratch, using the existing application as a living requirements document. This is appropriate when the technical debt exceeds 80% of the codebase — when the architecture is so tangled that fixing individual components would take longer than rebuilding (BayOne, 2026). Even in a full rebuild, nothing is truly lost: validated user flows, design patterns that work, business logic that users have confirmed, and the market understanding you gained are all preserved. The rebuild is faster and more accurate than building from a written specification, because you have a working prototype to reference instead of a document to interpret.

The critical point: you cannot determine which option is right without assessing what is actually in the codebase. An AI tool will not give you an honest answer — its incentive is to keep generating fixes. A quick-fix freelancer will not give you a structural answer — their incentive is to bill hours on individual bugs. The assessment itself is the first step.

The tools are not the villain. The gap is the gap.

Collins Dictionary named "vibe coding" its Word of the Year for 2026. Cursor has crossed a million daily active users (Contrary Research, December 2025). Bolt.new added five million registered users in its first five months. Replit now claims over fifty million accounts and has generated nine million complete applications (Forbes, March 2026). These tools are not going away, and they should not. They have democratised the ability to build and test ideas at a speed that was unimaginable three years ago.

But prototyping tools produce prototyping code. That is not a criticism — it is a description. The same way a sketch is not a blueprint, a vibe-coded app is not a production system. The sketch has value. The blueprint has different value. The gap between them is specific, measurable, and closable.

The research is unambiguous on what that gap contains: misconfigured database security, shallow authentication, unmaintainable code structure, and absent deployment infrastructure. These are not random failures. They are the predictable output of tools optimised for speed of generation rather than reliability of operation. Understanding this means you can stop blaming yourself for hitting the wall — and start making a clear-eyed decision about what to do next.

Your AI-Generated Code Isn't Secure — Here's What We Find Every Time

Anatoly Silko — Sat, 04 Apr 2026 22:52:43 +0000

Veracode tested 150+ AI models and found 45% of generated code introduces OWASP Top 10 vulnerabilities. The failure rate for cross-site scripting defences is 86% — and it isn't improving with newer models. Here's what that looks like inside a real codebase, what you can check yourself in 30 minutes, and what the UK's National Cyber Security Centre is now saying about it.

If you built something with Lovable, Bolt.new, Cursor, Replit, or v0 — and it's live, or about to be — six specific security problems are almost certainly sitting in your codebase right now.

That's not opinion. It's the consistent finding across every major independent security study published in the past twelve months: Veracode's 150-model benchmark, DryRun Security's assessment of three leading AI agents, Apiiro's scan of 62,000 enterprise repositories, and a Georgia Tech research team tracking real vulnerabilities in real time. The tools write code that runs. They don't write code that's safe.

This article gives you the practitioner's view: what the six problems are, how to check for them yourself in 30 minutes using free tools, what the UK's own National Cyber Security Centre said about it, and what the independent research actually found.

The six things we find in every assessment

The same six security failures appear in virtually every AI-generated codebase. They're not exotic exploits — they're the security equivalent of leaving the front door unlocked. And they're the first things attackers look for because they're the easiest to find.

1. Your secret keys are in the code anyone can read

When you tell an AI tool to "connect to Stripe" or "add OpenAI," it pastes the secret key directly into a JavaScript file that ships to every user's browser — visible to anyone who opens developer tools.

GitGuardian's 2026 analysis of public GitHub found 28.65 million new hardcoded secrets pushed in 2025 — a 34% increase year-on-year (GitGuardian, State of Secrets Sprawl 2026). AI-assisted commits leaked secrets at 3.2% versus the 1.5% baseline: more than double the rate. Supabase credential leaks specifically rose 992%.

A SaaS founder who built his entire product with Cursor was attacked within days of sharing it publicly. Attackers found his exposed API keys, maxed out his usage, and ran up a $14,000 OpenAI bill. He shut down permanently.

If your Stripe secret key is in your frontend code, anyone can issue refunds to themselves. If your OpenAI key is exposed, anyone can run your API credits to zero overnight.

2. User input goes straight to the database without checks

AI generates the shortest path to working code. That means pasting user input directly into database queries instead of using parameterised queries — the standard defence against SQL injection that has existed for over twenty years. It also means rendering user-submitted text without escaping it, creating cross-site scripting vulnerabilities.

Veracode found an 86% failure rate on XSS defences across all 150+ models tested — with no improvement in the latest generation (Veracode, GenAI Code Security Report, July 2025). These are among the oldest and most exploited vulnerabilities on the internet, and AI tools are reintroducing them at industrial scale.

3. Your APIs have no speed limit

An API without rate limiting is an open invitation. Attackers can try thousands of passwords per second. Competitors can scrape every record. Bots can flood expensive AI features and run up cloud bills.

DryRun Security's March 2026 study found the most telling detail: rate limiting middleware was defined in every codebase. The AI wrote the code for it. But not a single agent actually connected it to the application. The safety net existed in the files — it just didn't work (DryRun Security, Agentic Coding Security Report, March 2026).

4. File uploads accept anything

When AI builds an upload feature — profile pictures, documents, attachments — it saves whatever file the user provides without checking the type, size, or filename. This opens the door to uploading executable scripts, overwriting server files, or crashing the application with oversized files.

JFrog's research found that even when the AI does add file validation, it generates naive checks that block only the most literal attack patterns and can be bypassed with encoding or absolute paths (JFrog, Analyzing Common Vulnerabilities Introduced by Code-Generative AI).

5. No browser-level security headers

Every modern browser supports security headers — single-line configuration directives that control which scripts can run, whether to force HTTPS, and whether the site can be framed. Content-Security-Policy, Strict-Transport-Security, X-Frame-Options. AI tools never add them.

In the Tenzai study — fifteen apps built by five major AI coding tools — not one set any security headers. Zero out of fifteen (Tenzai, Secure Coding Comparison, December 2025).

6. Server-side request forgery on every URL feature

When AI builds a feature that fetches data from a URL — link previews, image proxies, webhooks — it makes the server request whatever URL the user provides, including internal cloud metadata endpoints that expose full infrastructure credentials.

The AppSec Santa 2026 study found SSRF was the single most common vulnerability across all six models tested, with 32 confirmed findings (AppSec Santa, AI Code Security Study, 2026). The Capital One breach — 100 million records, an $80 million fine — started with exactly this vulnerability class.

How to check yours in the next 30 minutes

You don't need a developer for this. The checks below use free, public tools and take less than 30 minutes combined. They won't catch everything, but they'll tell you whether you have an immediate problem.

Check 1: Security headers

Visit securityheaders.com or Mozilla HTTP Observatory. Enter your URL. You'll get a letter grade from A+ to F. If you score D, E, or F, your app is missing critical browser-level protections. Most vibe-coded apps score F.

Check 2: Exposed secrets in source code

In Chrome, press Ctrl+U (Cmd+Option+U on Mac) to view page source. Search for: sk_live (Stripe secret key), sk- (OpenAI), AKIA (AWS), password, secret, api_key. Public keys like Stripe's pk_live_ are expected. Secret keys should never appear in frontend code.

Check 3: Exposed .env file

Type your domain followed by /.env — for example, https://yourapp.com/.env. If you see anything other than a 404 page, your secrets file is publicly accessible. This is a critical emergency. Also try /.env.local and /.env.production. A 2024 Palo Alto Networks campaign exploited .env files across over 110,000 domains.

If you type your domain followed by /.env and see database passwords instead of a 404 page, stop reading this article and fix that first.

Check 4: Supabase database security

If your app uses Supabase, log into the Dashboard → Database → Security Advisor. Look for check 0013: "RLS disabled in public." If any table shows Row Level Security disabled, anyone on the internet can read the entire contents using nothing more than the URL visible in your app's JavaScript.

Check 5: SSL certificate

Visit ssllabs.com/ssltest and enter your domain. Takes two minutes. Most modern hosting should give an automatic A. Anything below that indicates a misconfiguration.

Check 6: Debug mode in production

Visit a non-existent page on your site — something like /this-does-not-exist-12345. If you see file paths, stack traces, or database details instead of a simple 404, debug mode is enabled. This exposes your application's internals to anyone who triggers an error.

Check 7: What Google has indexed

Type site:yourapp.com into Google. Then try site:yourapp.com inurl:admin for exposed admin panels, or site:yourapp.com filetype:env for indexed secrets files. Any result you didn't expect to be public shouldn't be.

What the UK government said

On 24 March 2026, NCSC CEO Richard Horne addressed vibe coding directly at the RSA Conference. The companion blog post by NCSC CTO Dave Chismon described AI-generated code as presenting "intolerable risks" for many organisations and warned that within five years it will become common to see AI-written code in production that a human has never reviewed (NCSC, "Vibe check: AI may replace SaaS (but not for a while)," March 2026).

That phrasing — "intolerable risks" — came from the UK government's own cybersecurity authority. Not a vendor. Not a consultant. The NCSC.

What the ICO expects from you

Existing obligations under Article 32 of UK GDPR — requiring "appropriate technical and organisational measures" to protect personal data — are technology-neutral. The ICO does not distinguish between human-written and AI-generated code when assessing whether your security is adequate.

The enforcement record makes the consequences concrete. Advanced Computer Software Group was fined £3.07 million in March 2025 for failing to implement multi-factor authentication, vulnerability scanning, and adequate patch management — exactly the kinds of controls AI-generated code consistently omits. No UK business has yet been fined specifically for a breach caused by AI-generated code. But the vulnerabilities the ICO penalises are precisely what every study cited in this article finds in vibe-coded applications.

What your insurer may not cover

42% of UK organisations report their cyber insurance policy now specifically excludes liabilities associated with AI misuse (SecurityBrief UK, 2025–2026). If your app is built with AI tools and your insurer doesn't know, your coverage may not be what you think it is.

The research behind the numbers

Everything above rests on independent research with disclosed methodology and large sample sizes.

Veracode: 150+ models, 80 tasks, 45% failure rate. Java had the worst at 72%. XSS defences failed in 86% of samples. Model size made no meaningful difference.

Apiiro: 62,000 repos across Fortune 50 enterprises. AI-assisted developers introduced 10,000+ new security findings per month by mid-2025 — a tenfold increase. Privilege escalation paths jumped 322%.

DryRun Security: Three AI agents, two apps each, 30 pull requests. 26 of 30 contained at least one vulnerability. Four authentication weaknesses appeared in every final codebase.

GitGuardian: 1.94 billion public GitHub commits analysed. 28.65 million leaked secrets in 2025, up 34%. AI-assisted commits leaked at double the baseline rate.

Georgia Tech: 74 confirmed AI-linked CVEs from 43,849 advisories. Monthly growth: 6 in January, 15 in February, 35 in March 2026. 39 rated Critical or High. Researchers estimate the actual number is 5–10× higher.

The scale of what's been built

Cursor confirmed over one million daily active users by late 2025 and now reports seven million monthly. Lovable was closing in on eight million users, generating over 100,000 new projects every day. Bolt.new reached three million registered users within five months.

Collins Dictionary named "vibe coding" its Word of the Year for 2026. Google reports AI now generates 41% of all code written globally. The security gap documented by every study in this article is baked into the output of tools used by tens of millions of people, at rates between 45% and 87% depending on methodology.

What to do about it

The patterns described in this article are not exotic. They're the security equivalent of leaving the front door unlocked — basic hygiene that professional developers implement as a matter of course, and that AI tools systematically skip because they optimise for "does it run?" rather than "is it safe?"

That's actually good news. It means the problems are fixable.

The AI tools that built your app are not villains. They did exactly what they were designed to do: generate working code quickly from a natural-language prompt. The gap isn't a bug — it's a design choice. Prototyping tools optimise for speed. Production systems require security. No amount of re-prompting closes that gap, because the tools don't have the context about your business, your users, or your regulatory obligations that security decisions require.

That context is what a human assessment provides.

I'm Anatoly Silko, founder of Rocking Tech — a UK-based agency that builds production Laravel platforms, increasingly from AI-generated starting points. If you've built something with AI tools and want to know whether it's production-ready, the original version of this article has more detail on next steps.