<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Patel Darshit</title>
    <description>The latest articles on Forem by Patel Darshit (@darshitp091).</description>
    <link>https://forem.com/darshitp091</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3774384%2Fd7cd4c9b-e94a-42eb-b57e-b3a4da58f4c2.jpeg</url>
      <title>Forem: Patel Darshit</title>
      <link>https://forem.com/darshitp091</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/darshitp091"/>
    <language>en</language>
    <item>
      <title>README.3D — The AI-Powered Documentation &amp; Global Repo Analysis Engine You Didn't Know You Needed</title>
      <dc:creator>Patel Darshit</dc:creator>
      <pubDate>Sun, 08 Mar 2026 06:27:33 +0000</pubDate>
      <link>https://forem.com/darshitp091/readme3d-the-ai-powered-documentation-global-repo-analysis-engine-you-didnt-know-you-needed-3dbj</link>
      <guid>https://forem.com/darshitp091/readme3d-the-ai-powered-documentation-global-repo-analysis-engine-you-didnt-know-you-needed-3dbj</guid>
      <description>&lt;h2&gt;
  
  
  The Problem Every Developer Knows Too Well
&lt;/h2&gt;

&lt;p&gt;You've just shipped a solid project. The code is clean, the architecture is sound, and the logic is airtight. But then comes the part most developers dread: &lt;strong&gt;writing the README&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A great README is the front door to your project. It's what determines whether a potential contributor clicks "Star" or clicks away. Yet despite its importance, documentation is consistently treated as an afterthought — rushed, incomplete, or skipped entirely.&lt;/p&gt;

&lt;p&gt;That's exactly the gap that &lt;strong&gt;&lt;a href="https://readme-3d.vercel.app/" rel="noopener noreferrer"&gt;README.3D&lt;/a&gt;&lt;/strong&gt; is designed to close.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is README.3D?
&lt;/h2&gt;

&lt;p&gt;README.3D is an &lt;strong&gt;AI Documentation &amp;amp; Global Repo Analysis Engine&lt;/strong&gt; built to automate, enhance, and elevate the way developers document their projects.&lt;/p&gt;

&lt;p&gt;At its core, README.3D combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI-driven content generation&lt;/strong&gt; — intelligently analyzes your repository and produces structured, professional documentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Global repo analysis&lt;/strong&gt; — goes beyond surface-level scanning to understand your project's architecture, dependencies, and purpose&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3D-enhanced visual presentation&lt;/strong&gt; — brings a modern, immersive aesthetic to the traditionally flat world of markdown documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whether you're building an open-source library, a SaaS product, or a personal portfolio project, README.3D transforms your codebase into clear, compelling documentation in seconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  🤖 AI-Powered Documentation Generation
&lt;/h3&gt;

&lt;p&gt;README.3D uses advanced AI to analyze your repository's structure, detect frameworks and dependencies, and generate contextually accurate documentation — including installation guides, usage examples, feature breakdowns, and API references.&lt;/p&gt;

&lt;h3&gt;
  
  
  🌐 Global Repository Analysis Engine
&lt;/h3&gt;

&lt;p&gt;The platform doesn't just skim your &lt;code&gt;package.json&lt;/code&gt;. It performs a deep-dive analysis across your entire codebase to surface meaningful insights and translate them into documentation that truly reflects what your project does.&lt;/p&gt;

&lt;h3&gt;
  
  
  🧊 3D Visual Experience
&lt;/h3&gt;

&lt;p&gt;True to its name, README.3D brings a visually distinctive interface to the documentation workflow. The 3D design philosophy makes the tool not just functional, but genuinely engaging to use.&lt;/p&gt;

&lt;h3&gt;
  
  
  ⚡ Instant Output
&lt;/h3&gt;

&lt;p&gt;No lengthy setup. No complex configuration. Paste your repository URL, let the engine run, and walk away with a production-ready README that you can customize, copy, or deploy immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  Who Is This For?
&lt;/h2&gt;

&lt;p&gt;README.3D is built for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Solo developers&lt;/strong&gt; who want polished documentation without spending hours writing it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-source maintainers&lt;/strong&gt; who need consistent, high-quality docs across multiple repositories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineering teams&lt;/strong&gt; looking to standardize documentation practices at scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Job seekers &amp;amp; portfolio builders&lt;/strong&gt; who understand that a well-documented project speaks louder than code alone&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why It Stands Out
&lt;/h2&gt;

&lt;p&gt;The README generator space has grown significantly, but most tools offer surface-level templating with minimal intelligence. README.3D differentiates itself through its &lt;strong&gt;global analysis engine&lt;/strong&gt; — meaning it understands your project holistically rather than just filling in a template with variable placeholders.&lt;/p&gt;

&lt;p&gt;The result is documentation that reads like it was written by someone who actually understands the project — because, in a sense, the AI does.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Getting up and running with README.3D takes less than a minute:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Visit &lt;strong&gt;&lt;a href="https://readme-3d.vercel.app/" rel="noopener noreferrer"&gt;readme-3d.vercel.app&lt;/a&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Paste your GitHub repository URL&lt;/li&gt;
&lt;li&gt;Let the AI analysis engine do its work&lt;/li&gt;
&lt;li&gt;Review, customize, and export your generated README&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No account required to get started. No friction between you and better documentation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Documentation debt is real, and it silently kills otherwise great projects. README.3D is a practical, well-designed tool that removes the most common excuse developers have for skipping documentation: &lt;em&gt;"I don't have time."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With AI handling the heavy lifting and a global repo analysis engine ensuring accuracy, there's no longer a reason to ship a project with a bare-bones or missing README.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try it out → &lt;a href="https://readme-3d.vercel.app/" rel="noopener noreferrer"&gt;readme-3d.vercel.app&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>productivity</category>
      <category>github</category>
    </item>
    <item>
      <title>I Spent 3 Months Compressing AI Models So You Don't Have To – Here's What I Learned</title>
      <dc:creator>Patel Darshit</dc:creator>
      <pubDate>Sun, 15 Feb 2026 19:02:20 +0000</pubDate>
      <link>https://forem.com/darshitp091/i-spent-3-months-compressing-ai-models-so-you-dont-have-to-heres-what-i-learned-d12</link>
      <guid>https://forem.com/darshitp091/i-spent-3-months-compressing-ai-models-so-you-dont-have-to-heres-what-i-learned-d12</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Deployed 100+ AI models to edge devices. Discovered the hard way that manual optimization sucks. Built a tool to automate it. Sharing everything I learned.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;You spend weeks training the perfect computer vision model. 98% accuracy. Beautiful loss curves. Your team is celebrating.&lt;/p&gt;

&lt;p&gt;Then someone asks: &lt;em&gt;"Can we run this on a Jetson Nano?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;And suddenly, your 2GB PyTorch masterpiece becomes a 500MB problem.&lt;/p&gt;

&lt;p&gt;This was me six months ago. I had a YOLOv8 model that needed to run on edge hardware for a robotics project. The model worked perfectly in the cloud. On a Jetson Nano? &lt;strong&gt;12 FPS&lt;/strong&gt;. Unusable.&lt;/p&gt;

&lt;p&gt;I needed 30+ FPS for real-time detection.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Manual Optimization Rabbit Hole
&lt;/h2&gt;

&lt;p&gt;Here's what I tried first (spoiler: it was painful):&lt;/p&gt;

&lt;h3&gt;
  
  
  Attempt 1: TensorFlow Lite Conversion
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tensorflow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;

&lt;span class="n"&gt;converter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lite&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TFLiteConverter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_saved_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;saved_model_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optimizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lite&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Optimize&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;tflite_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;model.tflite&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tflite_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Model size went from 980MB → 780MB. Not enough. Inference time barely improved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time wasted:&lt;/strong&gt; 8 hours fighting compatibility issues between TensorFlow versions.&lt;/p&gt;




&lt;h3&gt;
  
  
  Attempt 2: Manual INT8 Quantization
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target_spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;supported_ops&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lite&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OpsSet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TFLITE_BUILTINS_INT8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inference_input_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;int8&lt;/span&gt;
&lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inference_output_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;int8&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;representative_dataset&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;224&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;224&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

&lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;representative_dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;representative_dataset&lt;/span&gt;
&lt;span class="n"&gt;tflite_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Model crashed on inference. Accuracy dropped to 73% (from 98%). Completely broken.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time wasted:&lt;/strong&gt; Another 12 hours debugging why my calibration dataset wasn't working.&lt;/p&gt;




&lt;h3&gt;
  
  
  Attempt 3: ONNX Runtime + TensorRT
&lt;/h3&gt;

&lt;p&gt;This one actually worked. But here's what it took:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Convert PyTorch → ONNX&lt;/strong&gt; (3 hours, fighting version conflicts)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimize ONNX graph&lt;/strong&gt; (2 hours, manual layer fusion)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Convert to TensorRT engine&lt;/strong&gt; (4 hours, hardware-specific tuning)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Profile and fix precision issues&lt;/strong&gt; (6 hours of trial and error)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Total time:&lt;/strong&gt; 3 days for ONE model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final result:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Size: 245MB (75% reduction)&lt;/li&gt;
&lt;li&gt;✅ Latency: 33ms (2.5x faster)&lt;/li&gt;
&lt;li&gt;✅ Accuracy: 97.2% (0.8% loss)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It worked! But I had &lt;strong&gt;15 more models&lt;/strong&gt; to optimize.&lt;/p&gt;




&lt;h2&gt;
  
  
  The "Oh Crap" Moment
&lt;/h2&gt;

&lt;p&gt;At that rate, optimizing all my models would take &lt;strong&gt;45 days of full-time work&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I started Googling for tools. Found OctoAI. Perfect solution.&lt;/p&gt;

&lt;p&gt;Then I read: &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"OctoAI acquired by NVIDIA. Platform shutting down October 2024."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Great. 😑&lt;/p&gt;

&lt;p&gt;Neural Magic? Enterprise-only. $50K minimum.&lt;/p&gt;

&lt;p&gt;Edge Impulse? Microcontroller focus, not for my use case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There was no affordable, automated solution for regular developers.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned From 100+ Model Compressions
&lt;/h2&gt;

&lt;p&gt;Over the next 3 months, I compressed over 100 different models. Here's the non-obvious stuff nobody tells you:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. INT8 Quantization is Magic (When Done Right)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Average results across 100+ models:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📦 Compression: &lt;strong&gt;4x smaller&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;⚡ Speedup: &lt;strong&gt;2-3x faster&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;🎯 Accuracy loss: &lt;strong&gt;0.5-1.5%&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But here's the catch: &lt;strong&gt;calibration dataset matters more than model architecture.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Bad calibration = 10% accuracy loss 😱&lt;br&gt;&lt;br&gt;
Good calibration = 0.5% accuracy loss 🎉&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My calibration strategy:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Use 1000 representative samples from validation set
&lt;/span&gt;&lt;span class="n"&gt;calibration_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;validation_set&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;representative_dataset_gen&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;calibration_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;numpy&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;

&lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;representative_dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;representative_dataset_gen&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Game changer. Accuracy loss went from 3% to 0.5%.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Not All Layers Should Be Quantized
&lt;/h3&gt;

&lt;p&gt;I was quantizing everything to INT8. &lt;strong&gt;Rookie mistake.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Some layers (especially first conv and last FC layers) are super sensitive to quantization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Better approach: Mixed precision quantization&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer Type&lt;/th&gt;
&lt;th&gt;Precision&lt;/th&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;First Conv&lt;/td&gt;
&lt;td&gt;FP16&lt;/td&gt;
&lt;td&gt;Sensitive to input variations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Middle Conv Layers&lt;/td&gt;
&lt;td&gt;INT8&lt;/td&gt;
&lt;td&gt;Biggest size savings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Attention Layers&lt;/td&gt;
&lt;td&gt;FP16&lt;/td&gt;
&lt;td&gt;Critical for accuracy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Final FC Layer&lt;/td&gt;
&lt;td&gt;FP16&lt;/td&gt;
&lt;td&gt;Output quality matters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch Norm&lt;/td&gt;
&lt;td&gt;INT8&lt;/td&gt;
&lt;td&gt;Can be fused anyway&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; 6x compression with 0.3% accuracy loss instead of 3%.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Framework Conversion is a Minefield
&lt;/h3&gt;

&lt;p&gt;Success rates I observed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PyTorch → TFLite directly: &lt;strong&gt;60% success&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;PyTorch → ONNX → TFLite: &lt;strong&gt;85% success&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;TensorFlow → ONNX → TFLite: &lt;strong&gt;90% success&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The trick:&lt;/strong&gt; Always go through ONNX as an intermediate step.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Export PyTorch to ONNX
&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;onnx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;export&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dummy_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model.onnx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;opset_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;input_names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;output_names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;dynamic_axes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;batch_size&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;batch_size&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Optimize ONNX graph
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;onnxruntime&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ort&lt;/span&gt;

&lt;span class="n"&gt;sess_options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ort&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SessionOptions&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;sess_options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;graph_optimization_level&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ort&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GraphOptimizationLevel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ORT_ENABLE_ALL&lt;/span&gt;
&lt;span class="n"&gt;sess_options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optimized_model_filepath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_optimized.onnx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ort&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;InferenceSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model.onnx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sess_options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Free &lt;strong&gt;15-20% speedup&lt;/strong&gt; just from graph fusion.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Hardware-Specific Optimization is Non-Negotiable
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Generic optimization:&lt;/strong&gt; 2x speedup&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Hardware-specific optimization:&lt;/strong&gt; 5x speedup&lt;/p&gt;

&lt;p&gt;Here's what works for each platform:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hardware&lt;/th&gt;
&lt;th&gt;Best Framework&lt;/th&gt;
&lt;th&gt;Key Optimization&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;NVIDIA Jetson&lt;/td&gt;
&lt;td&gt;TensorRT&lt;/td&gt;
&lt;td&gt;FP16 + layer fusion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Raspberry Pi&lt;/td&gt;
&lt;td&gt;TFLite + XNNPACK&lt;/td&gt;
&lt;td&gt;INT8 quantization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;iOS (iPhone)&lt;/td&gt;
&lt;td&gt;CoreML&lt;/td&gt;
&lt;td&gt;Neural Engine offload&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Android&lt;/td&gt;
&lt;td&gt;TFLite + NNAPI&lt;/td&gt;
&lt;td&gt;GPU delegate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Edge TPU&lt;/td&gt;
&lt;td&gt;TFLite + Edge TPU compiler&lt;/td&gt;
&lt;td&gt;INT8 required&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Don't try to use the same optimized model everywhere.&lt;/strong&gt; It won't work.&lt;/p&gt;


&lt;h3&gt;
  
  
  5. Pruning is Overrated (for Most Use Cases)
&lt;/h3&gt;

&lt;p&gt;Everyone talks about pruning. I tried it extensively:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structured pruning:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ 30% size reduction&lt;/li&gt;
&lt;li&gt;❌ 5% accuracy loss&lt;/li&gt;
&lt;li&gt;❌ Marginal speedup on real hardware&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Unstructured pruning:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ 50% size reduction&lt;/li&gt;
&lt;li&gt;❌ No speedup on real hardware (sparse ops aren't optimized)&lt;/li&gt;
&lt;li&gt;❌ Complicated to maintain&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;My take:&lt;/strong&gt; Quantization first. Pruning only if you REALLY need that extra 20% size reduction.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Breaking Point
&lt;/h2&gt;

&lt;p&gt;After manually optimizing 47 models, I hit a wall.&lt;/p&gt;

&lt;p&gt;Each model took &lt;strong&gt;4-8 hours&lt;/strong&gt;. I was burning out. And I still had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ 15 models for the robotics project&lt;/li&gt;
&lt;li&gt;❌ 22 models for a computer vision pipeline&lt;/li&gt;
&lt;li&gt;❌ 18 models for a client's IoT deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I did the math: &lt;strong&gt;220+ hours of manual work remaining.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's when I decided to automate it.&lt;/p&gt;


&lt;h2&gt;
  
  
  Building the Solution
&lt;/h2&gt;

&lt;p&gt;I built a pipeline that handles everything I was doing manually:&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1: Model Analysis
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Auto-detect framework and analyze architecture&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# Detect framework
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.pt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.pth&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;framework&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pytorch&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.h5&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;framework&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tensorflow&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.onnx&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;framework&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;onnx&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;onnx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Count parameters
&lt;/span&gt;    &lt;span class="n"&gt;total_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;numel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="c1"&gt;# Identify quantization-sensitive layers
&lt;/span&gt;    &lt;span class="n"&gt;sensitive_layers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;identify_sensitive_layers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;framework&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;framework&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_params&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total_params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sensitive_layers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sensitive_layers&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 2: Automated Compression
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compress_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_hardware&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compression_level&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Apply mixed precision quantization&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# Generate calibration dataset
&lt;/span&gt;    &lt;span class="n"&gt;calibration_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_calibration_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Define quantization config
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;compression_level&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;aggressive&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;default_precision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;int8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="n"&gt;sensitive_precision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fp16&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;compression_level&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;balanced&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;default_precision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;int8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="n"&gt;sensitive_precision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fp16&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# conservative
&lt;/span&gt;        &lt;span class="n"&gt;default_precision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fp16&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="n"&gt;sensitive_precision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fp16&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="c1"&gt;# Apply mixed precision
&lt;/span&gt;    &lt;span class="n"&gt;quantized_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;apply_mixed_precision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;default_precision&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_precision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;sensitive_layers_precision&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sensitive_precision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;calibration_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;calibration_data&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Hardware-specific compilation
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;target_hardware&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;jetson&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;final_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compile_tensorrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantized_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;target_hardware&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;raspberry_pi&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;final_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compile_tflite_xnnpack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantized_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;target_hardware&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;mobile&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;final_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compile_coreml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantized_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;final_model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 3: Validation
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_compression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compressed_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_dataset&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Benchmark and validate accuracy&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# Size comparison
&lt;/span&gt;    &lt;span class="n"&gt;original_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_model_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;compressed_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_model_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;compressed_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;compression_ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;original_size&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;compressed_size&lt;/span&gt;

    &lt;span class="c1"&gt;# Latency benchmark
&lt;/span&gt;    &lt;span class="n"&gt;original_latency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;benchmark_latency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;compressed_latency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;benchmark_latency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;compressed_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;speedup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;original_latency&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;compressed_latency&lt;/span&gt;

    &lt;span class="c1"&gt;# Accuracy validation
&lt;/span&gt;    &lt;span class="n"&gt;original_acc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate_accuracy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;compressed_acc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate_accuracy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;compressed_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;accuracy_loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;original_acc&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;compressed_acc&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;compression_ratio&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;compression_ratio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;speedup&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;speedup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;accuracy_loss&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;accuracy_loss&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Tech stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backend: FastAPI (async job queue with Redis)&lt;/li&gt;
&lt;li&gt;ML: ONNX Runtime, TFLite, PyTorch, llama.cpp&lt;/li&gt;
&lt;li&gt;Compute: RunPod GPU instances (70% cheaper than AWS)&lt;/li&gt;
&lt;li&gt;Storage: S3-compatible object storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The whole pipeline runs in &lt;strong&gt;3-8 minutes&lt;/strong&gt; depending on model size.&lt;/p&gt;

&lt;p&gt;What took me 4-8 hours manually now takes under 10 minutes.&lt;/p&gt;


&lt;h2&gt;
  
  
  Real-World Results
&lt;/h2&gt;

&lt;p&gt;Here are some standout cases from the 100+ models I've compressed:&lt;/p&gt;
&lt;h3&gt;
  
  
  Case 1: YOLOv8-Large (Object Detection)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Size&lt;/td&gt;
&lt;td&gt;980MB&lt;/td&gt;
&lt;td&gt;245MB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75% smaller&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency&lt;/td&gt;
&lt;td&gt;85ms&lt;/td&gt;
&lt;td&gt;33ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.5x faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mAP&lt;/td&gt;
&lt;td&gt;98.2%&lt;/td&gt;
&lt;td&gt;97.4%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.8% loss&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Jetson Nano, real-time drone detection&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h3&gt;
  
  
  Case 2: BERT-Base (NLP)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Size&lt;/td&gt;
&lt;td&gt;440MB&lt;/td&gt;
&lt;td&gt;110MB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75% smaller&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency&lt;/td&gt;
&lt;td&gt;120ms&lt;/td&gt;
&lt;td&gt;45ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.7x faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Accuracy&lt;/td&gt;
&lt;td&gt;94.3%&lt;/td&gt;
&lt;td&gt;93.8%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.5% loss&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Raspberry Pi 4, on-device sentiment analysis&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h3&gt;
  
  
  Case 3: MobileNetV3 (Image Classification)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Size&lt;/td&gt;
&lt;td&gt;21MB&lt;/td&gt;
&lt;td&gt;5.2MB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75% smaller&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency&lt;/td&gt;
&lt;td&gt;18ms&lt;/td&gt;
&lt;td&gt;7ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.6x faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Top-1 Acc&lt;/td&gt;
&lt;td&gt;75.2%&lt;/td&gt;
&lt;td&gt;74.6%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.6% loss&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Android app, 100M+ users&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Pattern:&lt;/strong&gt; Consistent &lt;strong&gt;4x compression&lt;/strong&gt; with &lt;strong&gt;2-3x speedup&lt;/strong&gt; and &lt;strong&gt;&amp;lt;1% accuracy loss&lt;/strong&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;Edge AI is exploding. The market is going from &lt;strong&gt;$24.9B (2025)&lt;/strong&gt; to &lt;strong&gt;$118.69B by 2033&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But here's the problem: &lt;strong&gt;most AI is still trained in the cloud and deployed in the cloud.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The real world needs AI at the edge:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🚁 Drones that detect objects in real-time&lt;/li&gt;
&lt;li&gt;🤖 Robots that navigate autonomously&lt;/li&gt;
&lt;li&gt;🏥 Medical devices that process data locally (HIPAA compliance)&lt;/li&gt;
&lt;li&gt;📷 Smart cameras that work without internet&lt;/li&gt;
&lt;li&gt;🔋 IoT sensors that run for years on battery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And getting models onto these devices is &lt;strong&gt;still way too hard.&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  What I Wish Someone Had Told Me 6 Months Ago
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Start with quantization, not pruning
&lt;/h3&gt;

&lt;p&gt;It's the 80/20 solution. You'll get 4x compression with minimal effort.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Always use ONNX as an intermediate format
&lt;/h3&gt;

&lt;p&gt;It saves so much pain with framework conversions.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Calibration dataset quality &amp;gt; model architecture
&lt;/h3&gt;

&lt;p&gt;Use 1000+ representative samples from your actual validation set.&lt;/p&gt;
&lt;h3&gt;
  
  
  4. Hardware-specific optimization is non-optional
&lt;/h3&gt;

&lt;p&gt;Generic models won't cut it. Optimize for your target hardware.&lt;/p&gt;
&lt;h3&gt;
  
  
  5. Measure everything on real hardware
&lt;/h3&gt;

&lt;p&gt;Latency, throughput, memory, power consumption. Not just in theory.&lt;/p&gt;
&lt;h3&gt;
  
  
  6. Don't trust the accuracy number from quantization tools
&lt;/h3&gt;

&lt;p&gt;Always validate on your actual test set with your actual metrics.&lt;/p&gt;


&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;If you're deploying models to edge devices, here's your action plan:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The compression playbook
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;optimize_for_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_hardware&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Convert to ONNX first (universal format)
&lt;/span&gt;    &lt;span class="n"&gt;onnx_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;convert_to_onnx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Apply INT8 quantization with good calibration
&lt;/span&gt;    &lt;span class="n"&gt;calibration_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sample_validation_set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;quantized_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;quantize_int8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;onnx_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;calibration_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Hardware-specific compilation
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;target_hardware&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;jetson&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;final_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compile_tensorrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantized_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;target_hardware&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;raspberry_pi&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;final_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compile_tflite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantized_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 4. Validate on real hardware
&lt;/span&gt;    &lt;span class="n"&gt;metrics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;benchmark_on_device&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;final_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_hardware&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;final_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected results:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4x smaller models&lt;/li&gt;
&lt;li&gt;2-3x faster inference&lt;/li&gt;
&lt;li&gt;&amp;lt;1% accuracy loss&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;If you're facing similar challenges, I built this into a platform that automates the entire process.&lt;/p&gt;

&lt;p&gt;🔗 &lt;strong&gt;Try it here:&lt;/strong&gt; &lt;a href="https://edge-ai-alpha.vercel.app/" rel="noopener noreferrer"&gt;https://edge-ai-alpha.vercel.app/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free tier:&lt;/strong&gt; 5 compressions/month, no credit card needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Supports:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ PyTorch, TensorFlow, ONNX models&lt;/li&gt;
&lt;li&gt;✅ Export to TFLite, CoreML, TensorRT, ONNX Runtime&lt;/li&gt;
&lt;li&gt;✅ Automatic calibration dataset generation&lt;/li&gt;
&lt;li&gt;✅ Hardware-specific optimization profiles&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;I'm working on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🔧 Automated hyperparameter tuning for compression&lt;/li&gt;
&lt;li&gt;🌐 Federated learning support (train on edge, aggregate in cloud)&lt;/li&gt;
&lt;li&gt;🎛️ Custom hardware profiles (add your own device specs)&lt;/li&gt;
&lt;li&gt;📦 Multi-model ensemble optimization&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Let's Discuss!
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What's your experience with model deployment?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ever tried quantization? How did it go?&lt;/li&gt;
&lt;li&gt;What's your biggest pain point with edge AI?&lt;/li&gt;
&lt;li&gt;Any horror stories to share? 😅&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Drop a comment below! I'll try to respond to everyone.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Thanks for reading!&lt;/strong&gt; If you found this helpful, consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❤️ Reacting with a ❤️ or 🦄&lt;/li&gt;
&lt;li&gt;💬 Sharing your own edge deployment experiences&lt;/li&gt;
&lt;li&gt;🔖 Bookmarking for later reference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Happy optimizing! 🚀&lt;/p&gt;

&lt;p&gt;— Patel Darshit&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>tensorflow</category>
    </item>
  </channel>
</rss>
