Forem: hargurjeet singh

Vibe Coding in Production: How to Ship AI-Generated Code Responsibly

hargurjeet singh — Tue, 28 Apr 2026 06:01:33 +0000

Notes from a recent developer conference from AWS and Anthropic — practical wisdom for engineers navigating the AI-assisted coding era.

The era of AI-assisted coding is here — but shipping it responsibly requires a new mindset.

The Elephant in the Room

Let's not sugarcoat it — vibe coding is controversial.

A lot of developers hear "vibe coding" and immediately picture someone blindly prompting an AI, copy-pasting whatever comes out, and calling it a day. And honestly? That fear isn't entirely unfounded.

But here's the thing: AI is going to generate a massive amount of code in the near future. We're talking about AI systems that can already handle tasks taking a human an hour — and that capability is doubling roughly every 7 months, according to METR's 2025 benchmark study.

The question isn't whether you'll encounter AI-generated code in production — it's whether you'll know how to work with it responsibly.

📊 By the numbers: 42% of all code committed today is AI-assisted (expected to rise to 65% by 2027). 84% of developers are already using or planning to use AI tools in their workflow. Yet 96% say they don't fully trust the output.
(Sources: Sonar State of Code 2025, Stack Overflow Developer Survey 2025)

72% of developers use or plan to use AI tools — with 30% already using them daily. Source: Stack Overflow Developer Survey 2025

AI is no longer just for experiments — 58% of developers use it in business-critical services. Source: Sonar State of Code 2025

The Exponential You Can't Ignore

Researchers at METR tracked how long a task an AI agent can complete at 50% reliability. The finding: this "time horizon" has been growing exponentially for six straight years — doubling approximately every 7 months.

AI task-completion time horizon, doubling every ~7 months since 2019. Source: METR, March 2025

Currently sitting at around 2 hours, extrapolations suggest:

Early 2027: ~16 hours of work
Early 2028: ~5 days of work
Within a decade: Multi-week software projects, handled autonomously

This isn't science fiction. It's a trend that has remained consistent since 2019, and there's no evidence of it plateauing. In fact, in 2024–2025 the doubling rate accelerated to roughly every 4 months.

As a software engineer, this is the single most important number you should internalize. Your workflows need to evolve ahead of this curve — not behind it.

Where Vibe Coding Actually Works Today

The most successful use cases right now tend to be in low-stakes, high-experimentation environments:

Proof-of-concept projects (POCs)
Game development and creative side projects
Controlled, sandboxed environments
Internal tooling with limited blast radius

These contexts share a common trait: the cost of failure is low and the feedback loop is fast. You can let the AI run, see what it produces, verify the outcome, and iterate. That's where vibe coding shines today.

It's no coincidence that younger developers are the fastest adopters. Stack Overflow's 2025 survey found developers aged 18–24 are twice as likely to use AI daily compared to developers over 45.

But production systems are a different beast. Higher stakes demand a higher level of responsibility.

The Core Insight: Trust the System, Not Every Line

Here's a mental model that clicked at the conference:

Think back to when compilers were first introduced. Early programmers were skeptical. They wanted to read and verify the assembly output by hand. But as complexity scaled, that became impossible. At some point, you had to trust the compiler. You shifted your verification to the output behavior, not the internal mechanism.

We're at a similar inflection point with AI-generated code.

"We have to start learning that the code does not exist — but the product does."

This is the mindset shift. You're not the author of every line anymore. You're the owner of the outcome.

This Problem Is Older Than Software

Managing things you don't fully understand is not a new problem. It's as old as civilization itself.

Models are succeeding at increasingly long tasks — the gap between AI and human task lengths is closing fast. Source: METR

Consider:

Role	What they manage	What they don't fully know
CTO	Engineering teams and systems	Deep domain expertise in every stack
Product Manager	Product features and roadmap	Full implementation details
CEO	Company finances and strategy	The intricacies of accounting

And yet, these people ship products, close quarters, and lead organizations successfully every day. How?

They don't verify everything. They verify the right abstraction.

The CTO writes acceptance tests — they don't read every PR line by line.
The PM uses the product — they don't audit the codebase.
The CEO does fact-checks and sanity checks on financial data — they don't reconcile every ledger entry.

As engineers moving into an AI-assisted world, we need to adopt the same mindset.

The Trust Gap Is Real

The data backs this up. From the Stack Overflow 2025 Developer Survey (49,000+ respondents):

66% of developers say their #1 frustration is AI solutions that are "almost right, but not quite"
45% say debugging AI-generated code takes longer than writing it themselves
46% actively distrust AI output accuracy
Positive sentiment toward AI tools dropped from 70%+ in 2023–2024 to just 60% in 2025

AI success rate drops sharply as task length increases — a pattern every developer working with vibe coding needs to understand. Source: METR

And from CodeRabbit's independent analysis: pull requests containing AI-generated code have roughly 1.7× more issues than human-written code alone.

This is the core challenge of vibe coding in production. The code looks fine. It often runs fine on the happy path. But it hides subtle bugs, edge cases, and architectural landmines that only surface later.

Finding Your Abstraction Layer

The practical challenge is this: what is the right abstraction layer for verifying AI-generated code?

This is still an open question in the industry. There's currently no standardized unit for measuring technical debt introduced by AI. But here's a working framework:

1. Focus on "Leaf Nodes", Not Architecture

AI is generally good at implementing isolated, well-scoped functionality — the leaf nodes of your system. It's less reliable for core architectural decisions. Your job is to:

Guard the architecture yourself. High-level design, data flow, system boundaries — these must still be understood by a human.
Let AI handle the leaves. Functions, utilities, boilerplate, CRUD operations, transformations — these are safer territory for AI generation.

2. Verifiability Over Comprehension

You don't need to understand every line. You need to be able to verify the behavior.

This means:

Writing clear acceptance tests before generating code
Defining inputs and expected outputs upfront
Using integration tests to validate system behavior end-to-end
Designing for human-readable output so verification is fast

3. Stress-Test for Stability

AI-generated code can look clean on the surface but fail under load or edge cases. Build carefully designed stress tests into your workflow, especially for anything hitting production.

4. Keep Some Human Review in the Loop

Even in heavily AI-assisted workflows, having human eyes on leaf nodes before they're merged is valuable — not to read every line, but to catch obvious red flags.

💡 Data point: GitHub Copilot shows a 46% code completion rate, but developers accept only about 30% of its suggestions. Human review remains the final gate — and it should be. (Source: Second Talent 2026)

The "Be Claude's PM" Mental Model

Treat your AI like a capable engineer — your job is to be the PM: define clearly, verify rigorously.

One of the most memorable framings from the conference was this: treat your AI coding assistant like a very capable engineer who needs a good PM.

That means:

Be precise about what you want, not how to build it
Define acceptance criteria clearly
Review the output from a product/behavior perspective
Give feedback and iterate — don't accept the first output blindly

The AI generates the implementation. You own the specification and the verification.

The Real Caveat: Technical Debt Is Invisible

Here's the honest caveat that deserves its own section:

Extensibility cannot be easily verified.

When you vibe code a feature, you might get working code today that's a nightmare to extend in six months. AI tends to optimize for "works now" rather than "works cleanly at scale." The lack of a standardized way to measure technical debt in AI-generated code is a real, unsolved problem.

From independent research: code duplication has increased 4× with AI-assisted coding, and short-term code churn is rising — suggesting more copy-paste patterns, less maintainable design.

Until the tooling catches up, the practical mitigation is:

Keep core architecture off-limits to AI autonomy
Regularly schedule architectural review sessions
Be transparent with your team about which parts of the codebase were AI-generated

Closing Thoughts: Remember the Exponential

AI performance has increased rapidly across benchmarks — translating this into real-world workflow impact is the engineering challenge of our era. Source: METR

The METR chart tells a clear story. In under a decade, AI agents are projected to independently complete a large fraction of software tasks that currently take humans days or weeks.

Here are the four takeaways to keep close:

Be Claude's PM — specify clearly, verify rigorously
Focus on leaf nodes, not architecture — protect the structure, delegate the implementation
Design for verifiability — if you can't verify it, you can't ship it responsibly
Remember the exponential — the tools are getting dramatically better; your workflows need to evolve with them

The engineers who will thrive in this era aren't the ones who resist AI or blindly trust it. They're the ones who learn to manage implementations they don't fully understand — which, as we've established, is a problem as old as civilization.

The only real disadvantage is falling behind on learning this skill altogether.

References & Further Reading

METR: Measuring AI Ability to Complete Long Tasks — the source of the 7-month doubling benchmark
Stack Overflow 2025 Developer Survey — 49,000+ developers on AI adoption and trust
Sonar: State of Code 2025 — the 96% distrust statistic
MIT Technology Review: AI coding is now everywhere — nuanced view of AI coding's real-world impact
Second Talent: AI Coding Assistant Statistics 2026 — adoption and productivity stats

These notes were compiled from a developer conference session on AI-assisted engineering practices. Statistics sourced from Stack Overflow 2025 Developer Survey, METR (March 2025), Sonar State of Code 2025, and Second Talent 2026 compilation.

Tags: #ai #productivity #webdev #programming

Running LLMs Locally: A Rigorous Benchmark of Phi-3, Mistral, and Llama 3.2 on Ollama

hargurjeet singh — Sun, 15 Mar 2026 01:08:17 +0000

Abstract

This report presents a comprehensive evaluation of three small language models (SLMs) – Llama 3.2 (3B), Phi-3 mini, and Mistral 7B – running locally via Ollama. A FastAPI-based benchmarking framework was developed to measure inference speed, resource consumption, and the models' ability to produce valid JSON outputs as defined by Pydantic schemas. A retry mechanism with reprompting was implemented to handle malformed responses. The models were tested on a suite of 30 prompts spanning general knowledge, mathematics, coding, reasoning, and creative writing. Results highlight trade-offs between speed, accuracy, and resource usage, providing actionable insights for deploying local AI assistants in production environments.

1. Introduction

Local deployment of small language models offers privacy, low latency, and cost advantages over cloud-based APIs. However, ensuring consistent, structured outputs is essential for integration into applications. This project benchmarks three popular SLMs on:

Inference speed: tokens per second, time to first token (TTFT), total response latency.
Resource usage: CPU and memory utilization during inference.
Output quality: JSON schema compliance with retry-based correction.

The benchmark application enforces deterministic JSON outputs using Pydantic validation and a retry mechanism that reprompts the model with stricter instructions upon failure. This mimics real-world production requirements where structured data is mandatory.

2. Methodology

2.1 Test Environment

Hardware: Mac mini (Apple Silicon, 16 GB RAM)

OS: macOS

Software:

Ollama (v0.1.32)
Python 3.10
FastAPI + Uvicorn
Pydantic, psutil, requests

2.2 Benchmark Application

A FastAPI server (benchmark_app.py) exposes two endpoints:

GET /models – lists available Ollama models.
POST /benchmark/all-tests – runs all 30 test prompts on a specified model, returning per-test and aggregate metrics.

For each prompt:

The model is invoked with a streaming chat completion.
Time to first token and total time are recorded.
The response is validated against a Pydantic schema (strict JSON, no markdown allowed).
If validation fails, the model is retried (up to 2 times) with a more explicit instruction to output pure JSON.
System resource usage (CPU, memory) is sampled before and after each test.

2.3 Test Suite (`prompts.py`)

Thirty prompts are categorized into six groups, each with a dedicated Pydantic schema:

Category	Schema	Example Prompt
General Knowledge	`GeneralResponse`	"What is the capital of Japan?"
Math	`MathResponse`	"Solve for x: 3x + 7 = 22"
Coding	`CodeResponse`	"Write a Python function to reverse a string"
Reasoning	`ReasoningResponse`	"All blurgs are red. ... Are all blurgs heavy?"
Creative Storytelling	`StoryResponse`	"Write a 3-sentence story about an astronaut..."

Each prompt includes a strict instruction to return only the JSON object, and the expected field names/types are defined in the schema.

2.4 Model Comparison Study

The script model_comparison_study.py automates benchmarking across multiple models. It:

Verifies server availability.
Runs the full test suite on each specified model (Llama 3.2 3B, Phi-3 mini, Mistral 7B).
Aggregates metrics and computes averages.
Saves detailed results as JSON and a summary CSV.
Prints a comparison table with performance awards.

3. Results

3.1 Performance Metrics

The table below summarizes average performance across all 30 tests. Measurements were taken on a Mac mini (Apple Silicon, 16 GB RAM) with all models running on CPU.

Model	Tokens/sec	TTFT (ms)	Total Time (s)	CPU %	Memory %	Success Rate (%)
llama3.2:latest	22.24	427.29	4.68	14.6	88.8	100.0
phi3:mini	22.70	323.99	6.81	13.0	90.4	46.7
mistral:7b	10.98	1115.96	12.47	14.7	94.4	90.0

Tokens/sec: Measured as total tokens generated divided by total inference time.

TTFT (Time to First Token): Latency until the first token is produced.

Total Time: Average response generation time per test.

CPU/Memory %: Average utilization during inference (note that memory usage includes model loading and OS overhead).

Success Rate: Percentage of tests that passed JSON validation after up to two retries.

3.2 JSON Compliance and Retry Effectiveness

The following table details the retry counts and final compliance rates. Retries were attempted only when the initial response failed validation.

Model	JSON Compliance (%)	Total Retries Used	Retries per Prompt (avg)
llama3.2:latest	100.0	48	1.6
phi3:mini	46.7	15	0.5
mistral:7b	90.0	0	0.0

llama3.2 achieved perfect compliance but required an average of 1.6 retries per prompt, indicating that while it often produced malformed JSON initially, the retry mechanism corrected it every time.
phi3:mini had the lowest compliance; retries helped only marginally (15 total retries across 30 prompts) but failed to salvage most of its invalid outputs.
mistral:7b never needed a retry for its successful responses; all 27 successes were first‑try. The three failures were likely cases where even retries would not have helped (hence no retries attempted).

3.3 Resource Utilization

All models consumed significant memory due to being loaded simultaneously in the Ollama server. Memory usage ranged from 88.8% (Llama 3.2) to 94.4% (Mistral 7B) of available RAM, indicating that running larger models on a 16 GB system pushes memory limits.

CPU usage remained moderate (13–15%) as inference is primarily memory-bound on Apple Silicon.

Figure 1: Average CPU and memory usage per model. Llama 3.2 shows the lowest memory footprint, while Mistral 7B consumes the most.

3.4 Ranking Summary

A multi‑criteria ranking was computed, considering speed (tokens/sec), latency (TTFT), success rate, and a combined efficiency score (inverse of resource usage). Lower overall score is better.

Model	Rank Speed	Rank Latency	Rank Success	Rank Efficiency	Overall Score
phi3:mini	1	1	3	1	6
llama3.2:latest	2	2	1	2	7
mistral:7b	3	3	2	3	11

phi3:mini ranks best in speed, latency, and efficiency, but worst in success rate.
llama3.2:latest ranks second in speed and latency, but first in success rate.
mistral:7b consistently ranks third in all categories except success rate, where it places second.

3.5 Radar Chart Overview

A radar chart was generated to visualize the trade‑offs across four normalized metrics: Speed, Latency (inverse), Efficiency (inverse), and JSON Compliance. Each model's polygon reveals its strengths and weaknesses at a glance.

4. Code Implementation

The core of the benchmarking system consists of three main components: the FastAPI server, the validation logic, and the retry mechanism. Below are the key code snippets.

4.1 FastAPI Server Setup

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import ollama
import time
import psutil
import json

app = FastAPI(title="Simple Ollama Benchmark")

class BenchmarkRequest(BaseModel):
    model: str
    prompt: str
    max_tokens: int = 1024
    max_retries: int = 2

@app.post("/benchmark/all-tests")
def run_all_tests(model: str, max_tokens: int = 1024, max_retries: int = 2):
    # Implementation details in full codebase
    pass

4.2 JSON Validation Function

The validation function strictly checks for pure JSON—no markdown, no extra text.

def validate_json_response(response_text: str, schema_model: BaseModel):
    """Strictly validate response - must be pure JSON, no extraction"""
    try:
        # Try to parse as JSON - if this fails, it's not valid JSON
        parsed = json.loads(response_text)

        # Validate against schema
        validated = schema_model(**parsed)

        return True, validated.model_dump(), None
    except json.JSONDecodeError as e:
        return False, None, f"Invalid JSON (must be pure JSON, no markdown or extra text): {str(e)}"
    except ValidationError as e:
        return False, None, f"Schema validation failed: {str(e)}"
    except Exception as e:
        return False, None, f"Unexpected error: {str(e)}"

4.3 Retry Mechanism with Reprompting

The retry logic gives the model a second chance with a stricter prompt when validation fails.

def run_model_with_retry(model: str, prompt: str, schema_model: BaseModel, 
                         max_tokens: int, max_retries: int = 2):
    """Run model with retry mechanism for strict JSON validation"""

    for retry_count in range(max_retries + 1):
        # Prepare messages based on retry count
        if retry_count == 0:
            messages = [{'role': 'user', 'content': prompt}]
        else:
            # More strict reprompt on failure
            retry_prompt = f"""
            Your previous response was not valid JSON.
            Error: {last_error}

            You MUST respond with ONLY a valid JSON object. No markdown, 
            no backticks, no additional text, no explanations.
            Just the raw JSON object.

            Original instruction:
            {prompt}
            """
            messages = [{'role': 'user', 'content': retry_prompt}]

        # Stream to measure first token
        stream = ollama.chat(
            model=model,
            messages=messages,
            stream=True,
            options={
                "num_predict": max_tokens,
                "temperature": 0.1,  # Lower temperature for more consistent JSON
                "stop": ["```

", "

```json"]  # Try to prevent markdown
            }
        )

        # Collect response and measure timing
        response_text = ""
        first_token_time = None

        for chunk in stream:
            if first_token_time is None:
                first_token_time = time.time()

            if 'message' in chunk and 'content' in chunk['message']:
                response_text += chunk['message']['content']

        # Check for markdown indicators (immediate fail)
        if response_text.strip().startswith('```

') or '

```json' in response_text:
            last_error = "Response contains markdown code blocks. Must be pure JSON only."
            continue

        # Validate JSON
        is_valid, parsed_response, error_msg = validate_json_response(
            response_text.strip(), schema_model
        )

        if is_valid:
            return success_result

        last_error = error_msg

    # All retries failed
    return failure_result

4.4 Defining Pydantic Schemas

Example schemas from prompts.py that enforce the expected output structure:

from pydantic import BaseModel, Field
from typing import List, Optional

class MathResponse(BaseModel):
    question: str
    answer: float
    explanation: str
    steps: List[str] = Field(default_factory=list)

class CodeResponse(BaseModel):
    function_name: str
    code: str
    explanation: str
    time_complexity: Optional[str] = None
    space_complexity: Optional[str] = None

class StoryResponse(BaseModel):
    title: str
    characters: List[str]
    plot_summary: str
    story: str
    moral: Optional[str] = None

4.5 Running the Benchmark

The comparison script orchestrates testing across multiple models:

# From model_comparison_study.py
def benchmark_model(self, model_name: str):
    response = requests.post(
        f"{self.base_url}/benchmark/all-tests",
        params={
            "model": model_name,
            "max_tokens": 1024,
            "max_retries": 2
        },
        timeout=900
    )

    if response.status_code == 200:
        result = response.json()
        return {
            "model": model_name,
            "avg_tokens_per_second": result['averages']['avg_tokens_per_second'],
            "avg_time_to_first_token_ms": result['averages']['avg_time_to_first_token_ms'],
            "success_rate": result['summary']['successful_tests'] / 
                           result['summary']['total_tests_run'] * 100,
            # Additional metrics...
        }

The complete implementation is available in the GitHub repository.

5. Discussion

5.1 Speed vs. Accuracy Trade‑off

Llama 3.2 3B strikes an excellent balance: high speed (22.24 tokens/sec) and perfect compliance after retries, though it required many retries to achieve that perfection. With the retry mechanism in place, it is a robust choice for most applications.

Phi-3 mini offers the best speed and lowest latency, but its poor compliance (46.7%) makes it unreliable for structured output tasks without additional fallback logic. Its low CPU usage and quick first token are attractive for interactive applications where occasional failures can be tolerated.

Mistral 7B delivers high first‑try accuracy (90%) with zero retries needed for successes, but at half the speed and with a noticeable delay to first token. It is best suited for offline batch processing or applications where correctness outweighs latency.

5.2 Resource Constraints

Memory usage is a key constraint on edge devices. On a 16 GB Mac mini, all three models consumed over 88% of RAM, leaving little headroom for other processes. For deployment on memory‑limited hardware, Llama 3.2 is the most memory‑efficient of the three, while still maintaining perfect compliance. Phi-3's higher memory footprint (90.4%) combined with low success rate makes it less attractive unless its speed advantage is essential.

5.3 Retry Mechanism Value

The retry mechanism proved essential for Llama 3.2, converting many initially invalid responses into valid ones. For Mistral, it was unnecessary. For Phi-3, it was largely ineffective, suggesting that the model struggles to follow the "pure JSON" instruction even when prompted more strictly. This highlights the importance of model selection for tasks requiring strict format adherence.

5.4 Structured Output Enforcement

Pydantic validation with strict JSON‑only requirements effectively ensures that downstream systems receive predictable data. The retry mechanism adds robustness, but as seen with Phi-3, it cannot compensate for a model's fundamental inability to follow format instructions. In production, combining validation with a fallback parser (e.g., extracting JSON from markdown) could salvage some otherwise failed responses, though this compromises the purity of the structured output guarantee.

6. Conclusion

This benchmark demonstrates that local SLMs can deliver both reasonable performance and structured outputs, but with significant variance across models. Key takeaways:

Llama 3.2 3B is the overall winner when paired with retries: 22.24 tokens/sec, 100% final compliance, and moderate memory usage. It is the recommended choice for applications requiring reliable structured output.
Mistral 7B provides near‑perfect first‑try compliance (90%) but at lower speed and higher memory cost; suitable for accuracy‑critical tasks where latency is not primary.
Phi-3 mini excels in speed and low latency but suffers from poor format adherence, limiting its direct use unless supplemented by robust post‑processing.

The developed benchmarking framework is reusable for testing new models or prompts. Future work could explore:

GPU acceleration to reduce memory pressure and improve speed.
Prompt engineering techniques (e.g., few‑shot examples, system prompts) to boost compliance for models like Phi-3.
Integration with function‑calling APIs to enforce schemas more naturally.

Note: Numerical results were obtained on a Mac mini (16 GB RAM) running Ollama with CPU inference. Actual performance may vary with hardware and Ollama version.

All code is available on GitHub repository. (give it a ⭐ if you find it useful!)