<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: hargurjeet singh</title>
    <description>The latest articles on Forem by hargurjeet singh (@gurjeet333).</description>
    <link>https://forem.com/gurjeet333</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3808127%2F8ecc00cb-74b6-4a9b-89fa-107b8c17984e.png</url>
      <title>Forem: hargurjeet singh</title>
      <link>https://forem.com/gurjeet333</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/gurjeet333"/>
    <language>en</language>
    <item>
      <title>Vibe Coding in Production: How to Ship AI-Generated Code Responsibly</title>
      <dc:creator>hargurjeet singh</dc:creator>
      <pubDate>Tue, 28 Apr 2026 06:01:33 +0000</pubDate>
      <link>https://forem.com/gurjeet333/vibe-coding-in-production-how-to-ship-ai-generated-code-responsibly-174m</link>
      <guid>https://forem.com/gurjeet333/vibe-coding-in-production-how-to-ship-ai-generated-code-responsibly-174m</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Notes from a recent developer conference from AWS and Anthropic — practical wisdom for engineers navigating the AI-assisted coding era.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1555066931-4365d14bab8c%3Fw%3D1000%26auto%3Dformat%26fit%3Dcrop" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1555066931-4365d14bab8c%3Fw%3D1000%26auto%3Dformat%26fit%3Dcrop" alt="Developer working with AI-generated code" width="1000" height="667"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The era of AI-assisted coding is here — but shipping it responsibly requires a new mindset.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Elephant in the Room
&lt;/h2&gt;

&lt;p&gt;Let's not sugarcoat it — vibe coding is controversial.&lt;/p&gt;

&lt;p&gt;A lot of developers hear "vibe coding" and immediately picture someone blindly prompting an AI, copy-pasting whatever comes out, and calling it a day. And honestly? That fear isn't entirely unfounded.&lt;/p&gt;

&lt;p&gt;But here's the thing: &lt;strong&gt;AI is going to generate a massive amount of code in the near future.&lt;/strong&gt; We're talking about AI systems that can already handle tasks taking a human an hour — and that capability is doubling roughly every 7 months, according to METR's 2025 benchmark study.&lt;/p&gt;

&lt;p&gt;The question isn't whether you'll encounter AI-generated code in production — it's whether you'll know how to work with it responsibly.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;📊 &lt;strong&gt;By the numbers:&lt;/strong&gt; 42% of all code committed today is AI-assisted (expected to rise to 65% by 2027). 84% of developers are already using or planning to use AI tools in their workflow. Yet 96% say they don't fully trust the output.&lt;br&gt;
&lt;em&gt;(Sources: Sonar State of Code 2025, Stack Overflow Developer Survey 2025)&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhzpgmgosew366f54i9mx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhzpgmgosew366f54i9mx.png" alt="Stack Overflow 2025: breakdown of how frequently developers use AI tools — 47% daily, 18% weekly, 14% monthly, 5% plan to, 16% don't plan to" width="800" height="370"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;72% of developers use or plan to use AI tools — with 30% already using them daily. Source: &lt;a href="https://survey.stackoverflow.co/2025/ai" rel="noopener noreferrer"&gt;Stack Overflow Developer Survey 2025&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwgu99yxo0wwd2jzeakfq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwgu99yxo0wwd2jzeakfq.png" alt="Sonar State of Code 2025: where developers use AI — 88% for prototypes, 83% for internal production systems, 73% for customer-facing apps, 58% for business-critical services" width="800" height="372"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;AI is no longer just for experiments — 58% of developers use it in business-critical services. Source: &lt;a href="https://shiftmag.dev/state-of-code-2025-7978/" rel="noopener noreferrer"&gt;Sonar State of Code 2025&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fozp3tlen9k42zoz0p2n0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fozp3tlen9k42zoz0p2n0.png" alt="Sonar State of Code 2025: 96% of developers doubt the reliability of AI-generated code, citing subtle errors and hidden flaws" width="800" height="369"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Exponential You Can't Ignore
&lt;/h2&gt;

&lt;p&gt;Researchers at METR tracked how long a task an AI agent can complete at 50% reliability. The finding: this "time horizon" has been growing exponentially for six straight years — doubling approximately every 7 months.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fma1i9evl4bl38q3ff8gb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fma1i9evl4bl38q3ff8gb.png" alt="The length of tasks AIs can complete is doubling every 7 months" width="800" height="478"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;AI task-completion time horizon, doubling every ~7 months since 2019. Source: &lt;a href="https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/" rel="noopener noreferrer"&gt;METR, March 2025&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Currently sitting at around 2 hours, extrapolations suggest:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Early 2027:&lt;/strong&gt; ~16 hours of work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Early 2028:&lt;/strong&gt; ~5 days of work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Within a decade:&lt;/strong&gt; Multi-week software projects, handled autonomously&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't science fiction. It's a trend that has remained consistent since 2019, and there's no evidence of it plateauing. In fact, in 2024–2025 the doubling rate &lt;em&gt;accelerated&lt;/em&gt; to roughly every 4 months.&lt;/p&gt;

&lt;p&gt;As a software engineer, this is the single most important number you should internalize. Your workflows need to evolve ahead of this curve — not behind it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Vibe Coding Actually Works Today
&lt;/h2&gt;

&lt;p&gt;The most successful use cases right now tend to be in low-stakes, high-experimentation environments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Proof-of-concept projects&lt;/strong&gt; (POCs)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Game development and creative side projects&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Controlled, sandboxed environments&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Internal tooling with limited blast radius&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These contexts share a common trait: the cost of failure is low and the feedback loop is fast. You can let the AI run, see what it produces, verify the outcome, and iterate. That's where vibe coding shines today.&lt;/p&gt;

&lt;p&gt;It's no coincidence that younger developers are the fastest adopters. Stack Overflow's 2025 survey found developers aged 18–24 are &lt;strong&gt;twice as likely&lt;/strong&gt; to use AI daily compared to developers over 45.&lt;/p&gt;

&lt;p&gt;But production systems are a different beast. Higher stakes demand a higher level of responsibility.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Insight: Trust the System, Not Every Line
&lt;/h2&gt;

&lt;p&gt;Here's a mental model that clicked at the conference:&lt;/p&gt;

&lt;p&gt;Think back to when compilers were first introduced. Early programmers were skeptical. They wanted to read and verify the assembly output by hand. But as complexity scaled, that became impossible. At some point, you &lt;em&gt;had&lt;/em&gt; to trust the compiler. You shifted your verification to the &lt;strong&gt;output behavior&lt;/strong&gt;, not the internal mechanism.&lt;/p&gt;

&lt;p&gt;We're at a similar inflection point with AI-generated code.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"We have to start learning that the code does not exist — but the product does."&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the mindset shift. You're not the author of every line anymore. You're the &lt;strong&gt;owner of the outcome&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  This Problem Is Older Than Software
&lt;/h2&gt;

&lt;p&gt;Managing things you don't fully understand is not a new problem. It's as old as civilization itself.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhb1g6b5xizeo4scnfswi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhb1g6b5xizeo4scnfswi.png" alt="AI models succeeding at increasingly longer tasks over time" width="800" height="422"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Models are succeeding at increasingly long tasks — the gap between AI and human task lengths is closing fast. Source: METR&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Consider:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;What they manage&lt;/th&gt;
&lt;th&gt;What they &lt;em&gt;don't&lt;/em&gt; fully know&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CTO&lt;/td&gt;
&lt;td&gt;Engineering teams and systems&lt;/td&gt;
&lt;td&gt;Deep domain expertise in every stack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Product Manager&lt;/td&gt;
&lt;td&gt;Product features and roadmap&lt;/td&gt;
&lt;td&gt;Full implementation details&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CEO&lt;/td&gt;
&lt;td&gt;Company finances and strategy&lt;/td&gt;
&lt;td&gt;The intricacies of accounting&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;And yet, these people ship products, close quarters, and lead organizations successfully every day. How?&lt;/p&gt;

&lt;p&gt;They don't verify &lt;em&gt;everything&lt;/em&gt;. They verify &lt;strong&gt;the right abstraction&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The CTO writes &lt;strong&gt;acceptance tests&lt;/strong&gt; — they don't read every PR line by line.&lt;/li&gt;
&lt;li&gt;The PM &lt;strong&gt;uses the product&lt;/strong&gt; — they don't audit the codebase.&lt;/li&gt;
&lt;li&gt;The CEO does &lt;strong&gt;fact-checks and sanity checks&lt;/strong&gt; on financial data — they don't reconcile every ledger entry.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As engineers moving into an AI-assisted world, we need to adopt the same mindset.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Trust Gap Is Real
&lt;/h2&gt;

&lt;p&gt;The data backs this up. From the &lt;strong&gt;Stack Overflow 2025 Developer Survey&lt;/strong&gt; (49,000+ respondents):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;66%&lt;/strong&gt; of developers say their #1 frustration is AI solutions that are "almost right, but not quite"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;45%&lt;/strong&gt; say debugging AI-generated code takes &lt;em&gt;longer&lt;/em&gt; than writing it themselves&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;46%&lt;/strong&gt; actively distrust AI output accuracy&lt;/li&gt;
&lt;li&gt;Positive sentiment toward AI tools dropped from &lt;strong&gt;70%+ in 2023–2024 to just 60% in 2025&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7r77i0p7s9x32pi0edrl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7r77i0p7s9x32pi0edrl.png" alt="AI model success rate vs task length" width="800" height="444"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;AI success rate drops sharply as task length increases — a pattern every developer working with vibe coding needs to understand. Source: METR&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;And from CodeRabbit's independent analysis: pull requests containing AI-generated code have roughly &lt;strong&gt;1.7× more issues&lt;/strong&gt; than human-written code alone.&lt;/p&gt;

&lt;p&gt;This is the core challenge of vibe coding in production. The code &lt;em&gt;looks&lt;/em&gt; fine. It often &lt;em&gt;runs&lt;/em&gt; fine on the happy path. But it hides subtle bugs, edge cases, and architectural landmines that only surface later.&lt;/p&gt;




&lt;h2&gt;
  
  
  Finding Your Abstraction Layer
&lt;/h2&gt;

&lt;p&gt;The practical challenge is this: &lt;strong&gt;what is the right abstraction layer for verifying AI-generated code?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is still an open question in the industry. There's currently no standardized unit for measuring technical debt introduced by AI. But here's a working framework:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Focus on "Leaf Nodes", Not Architecture
&lt;/h3&gt;

&lt;p&gt;AI is generally good at implementing isolated, well-scoped functionality — the leaf nodes of your system. It's less reliable for core architectural decisions. Your job is to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Guard the architecture yourself.&lt;/strong&gt; High-level design, data flow, system boundaries — these must still be understood by a human.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Let AI handle the leaves.&lt;/strong&gt; Functions, utilities, boilerplate, CRUD operations, transformations — these are safer territory for AI generation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Verifiability Over Comprehension
&lt;/h3&gt;

&lt;p&gt;You don't need to understand every line. You need to be able to &lt;strong&gt;verify the behavior&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writing clear acceptance tests &lt;em&gt;before&lt;/em&gt; generating code&lt;/li&gt;
&lt;li&gt;Defining inputs and expected outputs upfront&lt;/li&gt;
&lt;li&gt;Using integration tests to validate system behavior end-to-end&lt;/li&gt;
&lt;li&gt;Designing for human-readable output so verification is fast&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Stress-Test for Stability
&lt;/h3&gt;

&lt;p&gt;AI-generated code can look clean on the surface but fail under load or edge cases. Build carefully designed stress tests into your workflow, especially for anything hitting production.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Keep Some Human Review in the Loop
&lt;/h3&gt;

&lt;p&gt;Even in heavily AI-assisted workflows, having human eyes on leaf nodes before they're merged is valuable — not to read every line, but to catch obvious red flags.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Data point:&lt;/strong&gt; GitHub Copilot shows a 46% code completion rate, but developers accept only about &lt;strong&gt;30%&lt;/strong&gt; of its suggestions. Human review remains the final gate — and it should be. &lt;em&gt;(Source: Second Talent 2026)&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The "Be Claude's PM" Mental Model
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1498050108023-c5249f4df085%3Fw%3D1000%26auto%3Dformat%26fit%3Dcrop" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1498050108023-c5249f4df085%3Fw%3D1000%26auto%3Dformat%26fit%3Dcrop" alt="Software developer reviewing AI output on screen" width="1000" height="666"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Treat your AI like a capable engineer — your job is to be the PM: define clearly, verify rigorously.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;One of the most memorable framings from the conference was this: &lt;strong&gt;treat your AI coding assistant like a very capable engineer who needs a good PM.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Be precise about &lt;em&gt;what you want&lt;/em&gt;, not &lt;em&gt;how to build it&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Define acceptance criteria clearly&lt;/li&gt;
&lt;li&gt;Review the output from a product/behavior perspective&lt;/li&gt;
&lt;li&gt;Give feedback and iterate — don't accept the first output blindly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AI generates the implementation. You own the specification and the verification.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Caveat: Technical Debt Is Invisible
&lt;/h2&gt;

&lt;p&gt;Here's the honest caveat that deserves its own section:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extensibility cannot be easily verified.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When you vibe code a feature, you might get working code today that's a nightmare to extend in six months. AI tends to optimize for "works now" rather than "works cleanly at scale." The lack of a standardized way to measure technical debt in AI-generated code is a real, unsolved problem.&lt;/p&gt;

&lt;p&gt;From independent research: code duplication has increased &lt;strong&gt;4× with AI-assisted coding&lt;/strong&gt;, and short-term code churn is rising — suggesting more copy-paste patterns, less maintainable design.&lt;/p&gt;

&lt;p&gt;Until the tooling catches up, the practical mitigation is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep core architecture off-limits to AI autonomy&lt;/li&gt;
&lt;li&gt;Regularly schedule architectural review sessions&lt;/li&gt;
&lt;li&gt;Be transparent with your team about which parts of the codebase were AI-generated&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Closing Thoughts: Remember the Exponential
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjveitwl02pwi8m6tye96.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjveitwl02pwi8m6tye96.png" alt="AI performance benchmarks across domains" width="800" height="565"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;AI performance has increased rapidly across benchmarks — translating this into real-world workflow impact is the engineering challenge of our era. Source: METR&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The METR chart tells a clear story. In under a decade, AI agents are projected to independently complete a large fraction of software tasks that currently take humans days or weeks.&lt;/p&gt;

&lt;p&gt;Here are the four takeaways to keep close:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Be Claude's PM&lt;/strong&gt; — specify clearly, verify rigorously&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Focus on leaf nodes, not architecture&lt;/strong&gt; — protect the structure, delegate the implementation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Design for verifiability&lt;/strong&gt; — if you can't verify it, you can't ship it responsibly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remember the exponential&lt;/strong&gt; — the tools are getting dramatically better; your workflows need to evolve with them&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The engineers who will thrive in this era aren't the ones who resist AI or blindly trust it. They're the ones who learn to &lt;strong&gt;manage implementations they don't fully understand&lt;/strong&gt; — which, as we've established, is a problem as old as civilization.&lt;/p&gt;

&lt;p&gt;The only real disadvantage is falling behind on learning this skill altogether.&lt;/p&gt;




&lt;h2&gt;
  
  
  References &amp;amp; Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/" rel="noopener noreferrer"&gt;METR: Measuring AI Ability to Complete Long Tasks&lt;/a&gt; — the source of the 7-month doubling benchmark&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://survey.stackoverflow.co/2025/ai" rel="noopener noreferrer"&gt;Stack Overflow 2025 Developer Survey&lt;/a&gt; — 49,000+ developers on AI adoption and trust&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://shiftmag.dev/state-of-code-2025-7978/" rel="noopener noreferrer"&gt;Sonar: State of Code 2025&lt;/a&gt; — the 96% distrust statistic&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.technologyreview.com/2025/12/15/1128352/rise-of-ai-coding-developers-2026/" rel="noopener noreferrer"&gt;MIT Technology Review: AI coding is now everywhere&lt;/a&gt; — nuanced view of AI coding's real-world impact&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.secondtalent.com/resources/ai-coding-assistant-statistics/" rel="noopener noreferrer"&gt;Second Talent: AI Coding Assistant Statistics 2026&lt;/a&gt; — adoption and productivity stats&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;These notes were compiled from a developer conference session on AI-assisted engineering practices. Statistics sourced from Stack Overflow 2025 Developer Survey, METR (March 2025), Sonar State of Code 2025, and Second Talent 2026 compilation.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; &lt;code&gt;#ai&lt;/code&gt; &lt;code&gt;#productivity&lt;/code&gt; &lt;code&gt;#webdev&lt;/code&gt; &lt;code&gt;#programming&lt;/code&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>coding</category>
      <category>productivity</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Running LLMs Locally: A Rigorous Benchmark of Phi-3, Mistral, and Llama 3.2 on Ollama</title>
      <dc:creator>hargurjeet singh</dc:creator>
      <pubDate>Sun, 15 Mar 2026 01:08:17 +0000</pubDate>
      <link>https://forem.com/gurjeet333/running-llms-locally-a-rigorous-benchmark-of-phi-3-mistral-and-llama-32-on-ollama-2289</link>
      <guid>https://forem.com/gurjeet333/running-llms-locally-a-rigorous-benchmark-of-phi-3-mistral-and-llama-32-on-ollama-2289</guid>
      <description>&lt;h2&gt;
  
  
  Abstract
&lt;/h2&gt;

&lt;p&gt;This report presents a comprehensive evaluation of three small language models (SLMs) – Llama 3.2 (3B), Phi-3 mini, and Mistral 7B – running locally via Ollama. A FastAPI-based benchmarking framework was developed to measure inference speed, resource consumption, and the models' ability to produce valid JSON outputs as defined by Pydantic schemas. A retry mechanism with reprompting was implemented to handle malformed responses. The models were tested on a suite of 30 prompts spanning general knowledge, mathematics, coding, reasoning, and creative writing. Results highlight trade-offs between speed, accuracy, and resource usage, providing actionable insights for deploying local AI assistants in production environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;Local deployment of small language models offers privacy, low latency, and cost advantages over cloud-based APIs. However, ensuring consistent, structured outputs is essential for integration into applications. This project benchmarks three popular SLMs on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Inference speed&lt;/strong&gt;: tokens per second, time to first token (TTFT), total response latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource usage&lt;/strong&gt;: CPU and memory utilization during inference.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output quality&lt;/strong&gt;: JSON schema compliance with retry-based correction.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The benchmark application enforces deterministic JSON outputs using Pydantic validation and a retry mechanism that reprompts the model with stricter instructions upon failure. This mimics real-world production requirements where structured data is mandatory.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Methodology
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2.1 Test Environment
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Hardware&lt;/strong&gt;: Mac mini (Apple Silicon, 16 GB RAM)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OS&lt;/strong&gt;: macOS&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Software&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ollama (v0.1.32)&lt;/li&gt;
&lt;li&gt;Python 3.10&lt;/li&gt;
&lt;li&gt;FastAPI + Uvicorn&lt;/li&gt;
&lt;li&gt;Pydantic, psutil, requests&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.2 Benchmark Application
&lt;/h3&gt;

&lt;p&gt;A FastAPI server (&lt;code&gt;benchmark_app.py&lt;/code&gt;) exposes two endpoints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;GET /models&lt;/code&gt; – lists available Ollama models.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;POST /benchmark/all-tests&lt;/code&gt; – runs all 30 test prompts on a specified model, returning per-test and aggregate metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each prompt:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The model is invoked with a streaming chat completion.&lt;/li&gt;
&lt;li&gt;Time to first token and total time are recorded.&lt;/li&gt;
&lt;li&gt;The response is validated against a Pydantic schema (strict JSON, no markdown allowed).&lt;/li&gt;
&lt;li&gt;If validation fails, the model is retried (up to 2 times) with a more explicit instruction to output pure JSON.&lt;/li&gt;
&lt;li&gt;System resource usage (CPU, memory) is sampled before and after each test.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  2.3 Test Suite (&lt;code&gt;prompts.py&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;Thirty prompts are categorized into six groups, each with a dedicated Pydantic schema:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Schema&lt;/th&gt;
&lt;th&gt;Example Prompt&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;General Knowledge&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GeneralResponse&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"What is the capital of Japan?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Math&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MathResponse&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"Solve for x: 3x + 7 = 22"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coding&lt;/td&gt;
&lt;td&gt;&lt;code&gt;CodeResponse&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"Write a Python function to reverse a string"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ReasoningResponse&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"All blurgs are red. ... Are all blurgs heavy?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Creative Storytelling&lt;/td&gt;
&lt;td&gt;&lt;code&gt;StoryResponse&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"Write a 3-sentence story about an astronaut..."&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each prompt includes a strict instruction to return only the JSON object, and the expected field names/types are defined in the schema.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.4 Model Comparison Study
&lt;/h3&gt;

&lt;p&gt;The script &lt;code&gt;model_comparison_study.py&lt;/code&gt; automates benchmarking across multiple models. It:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Verifies server availability.&lt;/li&gt;
&lt;li&gt;Runs the full test suite on each specified model (Llama 3.2 3B, Phi-3 mini, Mistral 7B).&lt;/li&gt;
&lt;li&gt;Aggregates metrics and computes averages.&lt;/li&gt;
&lt;li&gt;Saves detailed results as JSON and a summary CSV.&lt;/li&gt;
&lt;li&gt;Prints a comparison table with performance awards.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.1 Performance Metrics
&lt;/h3&gt;

&lt;p&gt;The table below summarizes average performance across all 30 tests. Measurements were taken on a Mac mini (Apple Silicon, 16 GB RAM) with all models running on CPU.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Tokens/sec&lt;/th&gt;
&lt;th&gt;TTFT (ms)&lt;/th&gt;
&lt;th&gt;Total Time (s)&lt;/th&gt;
&lt;th&gt;CPU %&lt;/th&gt;
&lt;th&gt;Memory %&lt;/th&gt;
&lt;th&gt;Success Rate (%)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;llama3.2:latest&lt;/td&gt;
&lt;td&gt;22.24&lt;/td&gt;
&lt;td&gt;427.29&lt;/td&gt;
&lt;td&gt;4.68&lt;/td&gt;
&lt;td&gt;14.6&lt;/td&gt;
&lt;td&gt;88.8&lt;/td&gt;
&lt;td&gt;100.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;phi3:mini&lt;/td&gt;
&lt;td&gt;22.70&lt;/td&gt;
&lt;td&gt;323.99&lt;/td&gt;
&lt;td&gt;6.81&lt;/td&gt;
&lt;td&gt;13.0&lt;/td&gt;
&lt;td&gt;90.4&lt;/td&gt;
&lt;td&gt;46.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mistral:7b&lt;/td&gt;
&lt;td&gt;10.98&lt;/td&gt;
&lt;td&gt;1115.96&lt;/td&gt;
&lt;td&gt;12.47&lt;/td&gt;
&lt;td&gt;14.7&lt;/td&gt;
&lt;td&gt;94.4&lt;/td&gt;
&lt;td&gt;90.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Tokens/sec&lt;/strong&gt;: Measured as total tokens generated divided by total inference time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TTFT (Time to First Token)&lt;/strong&gt;: Latency until the first token is produced.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total Time&lt;/strong&gt;: Average response generation time per test.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CPU/Memory %&lt;/strong&gt;: Average utilization during inference (note that memory usage includes model loading and OS overhead).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Success Rate&lt;/strong&gt;: Percentage of tests that passed JSON validation after up to two retries.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 JSON Compliance and Retry Effectiveness
&lt;/h3&gt;

&lt;p&gt;The following table details the retry counts and final compliance rates. Retries were attempted only when the initial response failed validation.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;JSON Compliance (%)&lt;/th&gt;
&lt;th&gt;Total Retries Used&lt;/th&gt;
&lt;th&gt;Retries per Prompt (avg)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;llama3.2:latest&lt;/td&gt;
&lt;td&gt;100.0&lt;/td&gt;
&lt;td&gt;48&lt;/td&gt;
&lt;td&gt;1.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;phi3:mini&lt;/td&gt;
&lt;td&gt;46.7&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mistral:7b&lt;/td&gt;
&lt;td&gt;90.0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;llama3.2&lt;/strong&gt; achieved perfect compliance but required an average of 1.6 retries per prompt, indicating that while it often produced malformed JSON initially, the retry mechanism corrected it every time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;phi3:mini&lt;/strong&gt; had the lowest compliance; retries helped only marginally (15 total retries across 30 prompts) but failed to salvage most of its invalid outputs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;mistral:7b&lt;/strong&gt; never needed a retry for its successful responses; all 27 successes were first‑try. The three failures were likely cases where even retries would not have helped (hence no retries attempted).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.3 Resource Utilization
&lt;/h3&gt;

&lt;p&gt;All models consumed significant memory due to being loaded simultaneously in the Ollama server. Memory usage ranged from &lt;strong&gt;88.8% (Llama 3.2)&lt;/strong&gt; to &lt;strong&gt;94.4% (Mistral 7B)&lt;/strong&gt; of available RAM, indicating that running larger models on a &lt;strong&gt;16 GB system pushes memory limits&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;CPU usage remained moderate (&lt;strong&gt;13–15%&lt;/strong&gt;) as inference is primarily &lt;strong&gt;memory-bound on Apple Silicon&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fhargurjeet%2Flocal_slm_experiments%2Fmain%2Fresults%2Fcharts%2Fresource_usage.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fhargurjeet%2Flocal_slm_experiments%2Fmain%2Fresults%2Fcharts%2Fresource_usage.png" alt="Resource Usage Comparison" width="800" height="340"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 1:&lt;/strong&gt; Average CPU and memory usage per model. &lt;em&gt;Llama 3.2 shows the lowest memory footprint, while Mistral 7B consumes the most.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3.4 Ranking Summary
&lt;/h3&gt;

&lt;p&gt;A multi‑criteria ranking was computed, considering speed (tokens/sec), latency (TTFT), success rate, and a combined efficiency score (inverse of resource usage). Lower overall score is better.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Rank Speed&lt;/th&gt;
&lt;th&gt;Rank Latency&lt;/th&gt;
&lt;th&gt;Rank Success&lt;/th&gt;
&lt;th&gt;Rank Efficiency&lt;/th&gt;
&lt;th&gt;Overall Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;phi3:mini&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;llama3.2:latest&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mistral:7b&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;phi3:mini&lt;/strong&gt; ranks best in speed, latency, and efficiency, but worst in success rate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;llama3.2:latest&lt;/strong&gt; ranks second in speed and latency, but first in success rate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;mistral:7b&lt;/strong&gt; consistently ranks third in all categories except success rate, where it places second.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.5 Radar Chart Overview
&lt;/h3&gt;

&lt;p&gt;A radar chart was generated to visualize the trade‑offs across four normalized metrics: Speed, Latency (inverse), Efficiency (inverse), and JSON Compliance. Each model's polygon reveals its strengths and weaknesses at a glance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0sjt8mnu75r6slxi4ttx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0sjt8mnu75r6slxi4ttx.png" alt="Figure 2: Radar chart comparing models on speed, latency, efficiency, and compliance. Larger area indicates better overall balance." width="800" height="659"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Code Implementation
&lt;/h2&gt;

&lt;p&gt;The core of the benchmarking system consists of three main components: the FastAPI server, the validation logic, and the retry mechanism. Below are the key code snippets.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.1 FastAPI Server Setup
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HTTPException&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;psutil&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Simple Ollama Benchmark&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BenchmarkRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;
    &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/benchmark/all-tests&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_all_tests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Implementation details in full codebase
&lt;/span&gt;    &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.2 JSON Validation Function
&lt;/h3&gt;

&lt;p&gt;The validation function strictly checks for pure JSON—no markdown, no extra text.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_json_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema_model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Strictly validate response - must be pure JSON, no extraction&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Try to parse as JSON - if this fails, it's not valid JSON
&lt;/span&gt;        &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Validate against schema
&lt;/span&gt;        &lt;span class="n"&gt;validated&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;schema_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validated&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_dump&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Invalid JSON (must be pure JSON, no markdown or extra text): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ValidationError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Schema validation failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unexpected error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.3 Retry Mechanism with Reprompting
&lt;/h3&gt;

&lt;p&gt;The retry logic gives the model a second chance with a stricter prompt when validation fails.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_model_with_retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema_model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                         &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Run model with retry mechanism for strict JSON validation&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;retry_count&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Prepare messages based on retry count
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;retry_count&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# More strict reprompt on failure
&lt;/span&gt;            &lt;span class="n"&gt;retry_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            Your previous response was not valid JSON.
            Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;last_error&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

            You MUST respond with ONLY a valid JSON object. No markdown, 
            no backticks, no additional text, no explanations.
            Just the raw JSON object.

            Original instruction:
            &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
            &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;retry_prompt&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;

        &lt;span class="c1"&gt;# Stream to measure first token
&lt;/span&gt;        &lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_predict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Lower temperature for more consistent JSON
&lt;/span&gt;                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stop&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;```

&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;

```json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Try to prevent markdown
&lt;/span&gt;            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Collect response and measure timing
&lt;/span&gt;        &lt;span class="n"&gt;response_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
        &lt;span class="n"&gt;first_token_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;first_token_time&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;first_token_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
                &lt;span class="n"&gt;response_text&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="c1"&gt;# Check for markdown indicators (immediate fail)
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;```

&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;

```json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;last_error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Response contains markdown code blocks. Must be pure JSON only.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;

        &lt;span class="c1"&gt;# Validate JSON
&lt;/span&gt;        &lt;span class="n"&gt;is_valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parsed_response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;error_msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;validate_json_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;schema_model&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_valid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;success_result&lt;/span&gt;

        &lt;span class="n"&gt;last_error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;error_msg&lt;/span&gt;

    &lt;span class="c1"&gt;# All retries failed
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;failure_result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.4 Defining Pydantic Schemas
&lt;/h3&gt;

&lt;p&gt;Example schemas from prompts.py that enforce the expected output structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MathResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;explanation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CodeResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;function_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;explanation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;time_complexity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;space_complexity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StoryResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;characters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;plot_summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;story&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;moral&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.5 Running the Benchmark
&lt;/h3&gt;

&lt;p&gt;The comparison script orchestrates testing across multiple models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# From model_comparison_study.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;benchmark_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/benchmark/all-tests&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_retries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;900&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_tokens_per_second&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;averages&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;avg_tokens_per_second&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_time_to_first_token_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;averages&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;avg_time_to_first_token_ms&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;successful_tests&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; 
                           &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_tests_run&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="c1"&gt;# Additional metrics...
&lt;/span&gt;        &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The complete implementation is available in the &lt;a href="https://github.com/hargurjeet/local_slm_experiments/tree/main" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Discussion
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5.1 Speed vs. Accuracy Trade‑off
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Llama 3.2 3B&lt;/strong&gt; strikes an excellent balance: high speed (22.24 tokens/sec) and perfect compliance after retries, though it required many retries to achieve that perfection. With the retry mechanism in place, it is a robust choice for most applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phi-3 mini&lt;/strong&gt; offers the best speed and lowest latency, but its poor compliance (46.7%) makes it unreliable for structured output tasks without additional fallback logic. Its low CPU usage and quick first token are attractive for interactive applications where occasional failures can be tolerated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistral 7B&lt;/strong&gt; delivers high first‑try accuracy (90%) with zero retries needed for successes, but at half the speed and with a noticeable delay to first token. It is best suited for offline batch processing or applications where correctness outweighs latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.2 Resource Constraints
&lt;/h3&gt;

&lt;p&gt;Memory usage is a key constraint on edge devices. On a 16 GB Mac mini, all three models consumed over 88% of RAM, leaving little headroom for other processes. For deployment on memory‑limited hardware, &lt;strong&gt;Llama 3.2&lt;/strong&gt; is the most memory‑efficient of the three, while still maintaining perfect compliance. &lt;strong&gt;Phi-3's&lt;/strong&gt; higher memory footprint (90.4%) combined with low success rate makes it less attractive unless its speed advantage is essential.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.3 Retry Mechanism Value
&lt;/h3&gt;

&lt;p&gt;The retry mechanism proved essential for &lt;strong&gt;Llama 3.2&lt;/strong&gt;, converting many initially invalid responses into valid ones. For &lt;strong&gt;Mistral&lt;/strong&gt;, it was unnecessary. For &lt;strong&gt;Phi-3&lt;/strong&gt;, it was largely ineffective, suggesting that the model struggles to follow the "pure JSON" instruction even when prompted more strictly. This highlights the importance of model selection for tasks requiring strict format adherence.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.4 Structured Output Enforcement
&lt;/h3&gt;

&lt;p&gt;Pydantic validation with strict JSON‑only requirements effectively ensures that downstream systems receive predictable data. The retry mechanism adds robustness, but as seen with &lt;strong&gt;Phi-3&lt;/strong&gt;, it cannot compensate for a model's fundamental inability to follow format instructions. In production, combining validation with a fallback parser (e.g., extracting JSON from markdown) could salvage some otherwise failed responses, though this compromises the purity of the structured output guarantee.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Conclusion
&lt;/h2&gt;

&lt;p&gt;This benchmark demonstrates that local SLMs can deliver both reasonable performance and structured outputs, but with significant variance across models. Key takeaways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Llama 3.2 3B&lt;/strong&gt; is the overall winner when paired with retries: 22.24 tokens/sec, 100% final compliance, and moderate memory usage. It is the recommended choice for applications requiring reliable structured output.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Mistral 7B&lt;/strong&gt; provides near‑perfect first‑try compliance (90%) but at lower speed and higher memory cost; suitable for accuracy‑critical tasks where latency is not primary.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Phi-3 mini&lt;/strong&gt; excels in speed and low latency but suffers from poor format adherence, limiting its direct use unless supplemented by robust post‑processing.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The developed benchmarking framework is reusable for testing new models or prompts. Future work could explore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU acceleration to reduce memory pressure and improve speed.&lt;/li&gt;
&lt;li&gt;Prompt engineering techniques (e.g., few‑shot examples, system prompts) to boost compliance for models like Phi-3.&lt;/li&gt;
&lt;li&gt;Integration with function‑calling APIs to enforce schemas more naturally.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Numerical results were obtained on a Mac mini (16 GB RAM) running Ollama with CPU inference. Actual performance may vary with hardware and Ollama version.&lt;/p&gt;

&lt;p&gt;All code is available on &lt;a href="https://github.com/hargurjeet/local_slm_experiments/tree/main" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;. (give it a ⭐ if you find it useful!)&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>python</category>
    </item>
  </channel>
</rss>
