<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: divyaprakash D</title>
    <description>The latest articles on Forem by divyaprakash D (@divyaprakash_d_2d5d085bd4).</description>
    <link>https://forem.com/divyaprakash_d_2d5d085bd4</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1761017%2Ff2cb9107-da61-4982-977d-2099598f1e5d.jpg</url>
      <title>Forem: divyaprakash D</title>
      <link>https://forem.com/divyaprakash_d_2d5d085bd4</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/divyaprakash_d_2d5d085bd4"/>
    <language>en</language>
    <item>
      <title>[Boost]</title>
      <dc:creator>divyaprakash D</dc:creator>
      <pubDate>Sat, 14 Feb 2026 05:18:06 +0000</pubDate>
      <link>https://forem.com/divyaprakash_d_2d5d085bd4/-3pkp</link>
      <guid>https://forem.com/divyaprakash_d_2d5d085bd4/-3pkp</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/divyaprakash_d_2d5d085bd4" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1761017%2Ff2cb9107-da61-4982-977d-2099598f1e5d.jpg" alt="divyaprakash_d_2d5d085bd4"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/divyaprakash_d_2d5d085bd4/stop-editing-start-playing-meet-autoshorts-the-ai-gaming-editor-4mbp" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;Stop Editing. Start Playing. Meet AutoShorts: The AI Gaming Editor 🎮&lt;/h2&gt;
      &lt;h3&gt;divyaprakash D ・ Feb 13&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#devchallenge&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#githubchallenge&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#cli&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#githubcopilot&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
      <category>devchallenge</category>
      <category>githubchallenge</category>
      <category>cli</category>
      <category>githubcopilot</category>
    </item>
    <item>
      <title>Stop Editing. Start Playing. Meet AutoShorts: The AI Gaming Editor 🎮</title>
      <dc:creator>divyaprakash D</dc:creator>
      <pubDate>Fri, 13 Feb 2026 09:25:52 +0000</pubDate>
      <link>https://forem.com/divyaprakash_d_2d5d085bd4/stop-editing-start-playing-meet-autoshorts-the-ai-gaming-editor-4mbp</link>
      <guid>https://forem.com/divyaprakash_d_2d5d085bd4/stop-editing-start-playing-meet-autoshorts-the-ai-gaming-editor-4mbp</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/github-2026-01-21"&gt;GitHub Copilot CLI Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AutoShorts&lt;/strong&gt; is an AI-powered pipeline that automatically transforms long-form gameplay footage into viral-ready vertical clips. It uses &lt;strong&gt;Vision AI&lt;/strong&gt; to semantically understand content—distinguishing between "action," "clutch plays," and "WTF moments"—then adds &lt;strong&gt;AI-generated captions&lt;/strong&gt; and &lt;strong&gt;AI voiceovers&lt;/strong&gt; with matching energy and personality.&lt;/p&gt;

&lt;p&gt;The result? Hours of gameplay → polished TikTok/Shorts/Reels-ready clips, with minimal human intervention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;View Project on GitHub:&lt;/strong&gt; &lt;a href="https://github.com/divyaprakash0426/autoshorts" rel="noopener noreferrer"&gt;Link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Demo Video:&lt;/strong&gt; 

  &lt;iframe src="https://www.youtube.com/embed/JZawIDjbxCg"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;h3&gt;
  
  
  🎥 Showcase: Multi-Language &amp;amp; Style Generation
&lt;/h3&gt;

&lt;p&gt;AutoShorts automatically adapts its editing style, captions, and voiceover personality based on the content and target language. Here are some examples generated entirely by the pipeline:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Content&lt;/th&gt;
&lt;th&gt;Style&lt;/th&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Video&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fortnite&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Story Roast&lt;/td&gt;
&lt;td&gt;🇺🇸 English&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.youtube.com/shorts/tTUipTAdBlk" rel="noopener noreferrer"&gt;Watch Part 1&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Indiana Jones&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GenZ Slang&lt;/td&gt;
&lt;td&gt;🇺🇸 English&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.youtube.com/shorts/VAOlR5RAX14" rel="noopener noreferrer"&gt;Watch Part 1&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Battlefield 6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dramatic Story&lt;/td&gt;
&lt;td&gt;🇯🇵 Japanese&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.youtube.com/shorts/DYNEr1CzTpY" rel="noopener noreferrer"&gt;Watch Part 1&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Indiana Jones&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Story News&lt;/td&gt;
&lt;td&gt;🇨🇳 Chinese&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.youtube.com/shorts/kGRrpu66fpk" rel="noopener noreferrer"&gt;Watch Part 1&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fortnite&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Story Roast&lt;/td&gt;
&lt;td&gt;🇪🇸 Spanish&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.youtube.com/shorts/5QcelWS1oSo" rel="noopener noreferrer"&gt;Watch Part 1&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fortnite&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Story Roast&lt;/td&gt;
&lt;td&gt;🇷🇺 Russian&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.youtube.com/shorts/A06FdnycTYo" rel="noopener noreferrer"&gt;Watch Part 1&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Indiana Jones&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Auto Gameplay&lt;/td&gt;
&lt;td&gt;🇧🇷 Portuguese&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.youtube.com/shorts/qDFsTnH9qxc" rel="noopener noreferrer"&gt;Watch Part 1&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  📸 Dashboard Interface
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Generate Page&lt;/strong&gt;&lt;br&gt;
The command center for creating new content. Simply drop a video or select an existing one, choose your analysis mode (Local vs. Cloud), and hit "Find Clips."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5hlsak6ol3o8kdmuj07k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5hlsak6ol3o8kdmuj07k.png" alt="Generate Page"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Settings &amp;amp; Cost Control&lt;/strong&gt;&lt;br&gt;
Full control over which AI models are used and strictly managed API costs. You can toggle between OpenAI, Gemini, or efficient Local Heuristics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy6iicn8whq7qav401owe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy6iicn8whq7qav401owe.png" alt="Settings Page"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Why I Built This
&lt;/h2&gt;

&lt;p&gt;I had a problem that every content creator knows: &lt;strong&gt;hours of gameplay footage, but no time to edit&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Recording gameplay is the easy part. The hard part is scrubbing through 2-hour VODs looking for that one clutch moment, that hilarious fail, or that "wait, what just happened?" clip. Then you need to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find the moment&lt;/li&gt;
&lt;li&gt;Crop to vertical (9:16)&lt;/li&gt;
&lt;li&gt;Add captions that match the vibe&lt;/li&gt;
&lt;li&gt;Maybe add commentary or voiceover&lt;/li&gt;
&lt;li&gt;Export and repeat... dozens of times&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I was spending 3-4 hours editing for every hour of footage. That's backwards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I wanted a system where I could:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Drop a raw gameplay file&lt;/li&gt;
&lt;li&gt;Walk away&lt;/li&gt;
&lt;li&gt;Come back to ready-to-upload clips with captions and voiceovers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AutoShorts is that system.&lt;/p&gt;


&lt;h2&gt;
  
  
  How I Built It (Technical Deep-Dive)
&lt;/h2&gt;

&lt;p&gt;Building AutoShorts was a rollercoaster of "this is genius" moments immediately followed by "why is everything on fire." Here's the real story — the problems nobody warns you about, and the solutions that made it all work.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Architecture Challenge
&lt;/h3&gt;

&lt;p&gt;When the feature set started growing — Vision AI analysis, TTS voice synthesis, story narration, cross-clip narrative arcs — it became clear that a single orchestration file wasn't going to cut it. Every new feature touched everything else, and debugging felt like untangling christmas lights.&lt;/p&gt;

&lt;p&gt;The fix was &lt;strong&gt;Domain-Driven Design&lt;/strong&gt; — splitting the logic into focused modules, each owning its piece of the pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;src/
├── shorts.py              # Orchestration &amp;amp; rendering
├── ai_providers.py        # Gemini/OpenAI abstraction
├── tts_generator.py       # Qwen3-TTS voice synthesis
├── subtitle_generator.py  # Caption generation &amp;amp; timing
└── story_narrator.py      # Cross-clip narrative generation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This separation seemed like overkill at first. Then I discovered I needed to load and unload AI models from GPU memory between pipeline stages — TTS has to yield VRAM for rendering, which has to yield for AI analysis — and suddenly having clean boundaries between modules was the only thing keeping me sane.&lt;/p&gt;

&lt;h3&gt;
  
  
  The VRAM Juggling Act
&lt;/h3&gt;

&lt;p&gt;Here's the thing about running AI models on consumer GPUs: &lt;strong&gt;they don't share nicely.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Qwen3-TTS (voice synthesis) needs ~4GB VRAM. Video rendering with PyTorch needs ~2GB. These models don't politely step aside for each other — they sit in VRAM until you physically evict them.&lt;/p&gt;

&lt;p&gt;The solution was &lt;strong&gt;aggressive model lifecycle management&lt;/strong&gt; — singleton patterns with explicit cleanup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After TTS generation completes
&lt;/span&gt;&lt;span class="n"&gt;QwenTTS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clear_instance&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;empty_cache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TTS model unloaded — VRAM freed for rendering&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without this, the pipeline would OOM (out-of-memory crash) after processing 2-3 clips. Fun times at 2 AM when you're wondering why clip #3 always segfaults.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Qwen3-VL Dead End: When "Local" Goes Too Far
&lt;/h3&gt;

&lt;p&gt;I desperately wanted the entire video analysis to happen locally. I actually got &lt;strong&gt;Qwen3-VL&lt;/strong&gt; (video-language model) integrated and working, but it was a textbook case of &lt;em&gt;"just because you can, doesn't mean you should."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Qwen3-VL is a monster. It’s not just big; it's VRAM-hungry beyond reason. My 12GB RTX 4080 laptop didn't stand a chance, and even on high-end 24GB cards, it would regularly hit the OOM wall during long video sequences.&lt;/p&gt;

&lt;p&gt;I attempted a last-ditch effort using &lt;strong&gt;Qwen3-VL-4B-Instruct-FP8&lt;/strong&gt;, but even with quantization, the stability wasn't there—it still occasionally nuked the pipeline. Worse, the analysis quality didn't justify the struggle; the results were underwhelming compared to the resource cost. It felt like I was trying to race a semi-truck on a go-kart track.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pivot:&lt;/strong&gt; This failure is actually what led to the &lt;strong&gt;Deep Analysis Proxy&lt;/strong&gt; system. I realized that instead of fighting 30GB models locally, I could spend those dev cycles on intelligent preprocessing (the 15MB proxy) and let a cloud model do the heavy lifting for pennies. The result was a pipeline that's actually accessible to people with consumer GPUs, rather than just data center owners.&lt;/p&gt;

&lt;h3&gt;
  
  
  The TTS Timing Nightmare
&lt;/h3&gt;

&lt;p&gt;This was the most infuriating bug I encountered, and it took three separate debugging sessions to crack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; Subtitles and voiceover were drifting out of sync in story mode. By the end of a 60-second clip, subtitles were 3-4 seconds ahead of the voice. Not great when you're going for "professional esports broadcast" and getting "badly dubbed foreign film."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The investigation:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Story mode generates a continuous narration (like a broadcaster). The TTS engine reads all sentences as one flowing piece. But subtitles were timed by probing each sentence &lt;em&gt;individually&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Subtitle timing (probed separately):
  "The player approaches" → 2.3s
  "An incredible shot"    → 1.8s
  Total: 4.1s

TTS (generated as merged text):
  "The player approaches an incredible shot" → 3.6s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See the problem? When you join sentences, the TTS naturally flows faster — no pause between them. That 0.5s error &lt;em&gt;accumulated&lt;/em&gt; across every sentence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Probe the &lt;em&gt;merged&lt;/em&gt; narration once, then distribute timing proportionally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ❌ Wrong: probe each sentence separately
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;probe_tts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Accumulated error!
&lt;/span&gt;
&lt;span class="c1"&gt;# ✅ Right: probe merged text, distribute proportionally
&lt;/span&gt;&lt;span class="n"&gt;full_narration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;total_duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;probe_tts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;full_narration&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sentence_duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total_duration&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total_chars&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One of those fixes where you stare at the solution and think &lt;em&gt;"why didn't I see this three days ago?"&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The "TTS Longer Than Video" Problem
&lt;/h3&gt;

&lt;p&gt;Sometimes the AI writes an essay when you asked for a tweet. A 45-second gameplay clip ends up with 52 seconds of narration. Now what?&lt;/p&gt;

&lt;p&gt;Three options on the table:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Option A:&lt;/strong&gt; Truncate the voiceover → Loses content, sounds cut off&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Option B:&lt;/strong&gt; Speed up the voice → Sounds like a chipmunk reading the news&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Option C:&lt;/strong&gt; Extend the video to match → 🤔&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Option C won, but with nuance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tts_duration&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;clip_duration&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Big gap: go back to source video, extract more footage
&lt;/span&gt;    &lt;span class="nf"&gt;rerender_clip_for_tts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;render_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tts_duration&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Small gap: freeze last frame using FFmpeg tpad
&lt;/span&gt;    &lt;span class="n"&gt;ffmpeg_filter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tpad=stop_mode=clone:stop_duration=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;gap&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The re-render logic reaches back into the &lt;em&gt;original source video&lt;/em&gt; and extracts more footage — even beyond the original scene boundaries. This required tracking render metadata (start time, source file, scene duration) through the entire pipeline. Worth it though. No more cut-off narration.&lt;/p&gt;

&lt;h3&gt;
  
  
  FlashAttention: When Your RAM Isn't Enough
&lt;/h3&gt;

&lt;p&gt;Qwen3-TTS performs best with FlashAttention 2 — a CUDA kernel that speeds up attention computation by 3-4x. One problem: building it from source requires compiling CUDA code, which needs &lt;strong&gt;125GB+ RAM&lt;/strong&gt; during compilation. On machines with less than 32GB RAM, the build takes &lt;strong&gt;24 hours or more&lt;/strong&gt; — if the OOM killer doesn't murder it first.&lt;/p&gt;

&lt;p&gt;My machine has 16GB. &lt;code&gt;Killed&lt;/code&gt; — my favorite one-word error message.&lt;/p&gt;

&lt;p&gt;The solution? Prebuilt wheels. Someone lovely had already compiled FlashAttention for various PyTorch + CUDA combinations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="nl"&gt;install_flash_attn&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nv"&gt;PYVER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;$$(&lt;/span&gt;python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import sys; print(f'cp{sys.version_info.major&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;{sys.version_info.minor}')"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    pip &lt;span class="nb"&gt;install &lt;/span&gt;https://github.com/.../flash_attn-2.6.3+cu128torch2.10-&lt;span class="nv"&gt;$$&lt;/span&gt;PYVER-linux_x86_64.whl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One line. No compilation. No 125GB RAM requirement. Installation went from "impossible on my hardware" to "done in 30 seconds."&lt;/p&gt;

&lt;h3&gt;
  
  
  Deep Analysis: Letting AI See the Full Picture
&lt;/h3&gt;

&lt;p&gt;Here's an insight that changed everything: &lt;strong&gt;short clips lack context.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the default mode, each candidate clip is analyzed independently — the AI sees 2 minutes of footage and scores it. But it doesn't know what happened &lt;em&gt;before&lt;/em&gt; or &lt;em&gt;after&lt;/em&gt;. A celebration makes no sense without the clutch play that preceded it.&lt;/p&gt;

&lt;p&gt;Deep Analysis mode fixes this by letting Gemini see the &lt;strong&gt;entire video&lt;/strong&gt; — but we're not about to upload a multi-GB 4K recording raw. That would take forever and burn through API quotas.&lt;/p&gt;

&lt;p&gt;Instead, we generate a &lt;strong&gt;lightweight proxy&lt;/strong&gt; first using GPU-accelerated FFmpeg:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# GPU-accelerated proxy: 4K@60fps → 640p@1fps, high compression
&lt;/span&gt;&lt;span class="n"&gt;gpu_cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ffmpeg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-hwaccel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-hwaccel_output_format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-i&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video_path&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-vf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scale_cuda=640:-2,fps=1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# 640px wide, 1 frame per second
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c:v&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hevc_nvenc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-qp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;35&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                         &lt;span class="c1"&gt;# Aggressive compression
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c:a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aac&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-b:a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-ac&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Mono 32kbps audio
&lt;/span&gt;    &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temp_proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A 2-hour 4K gameplay recording (~30GB) becomes a ~15MB proxy. Same content, same timeline, same audio cues — just tiny enough to upload in seconds. The proxy is also cached by file hash, so re-runs skip the generation step entirely.&lt;/p&gt;

&lt;p&gt;The AI can now identify narrative arcs — the setup, the payoff, the aftermath. It finds moments that a clip-by-clip analysis would miss entirely. The quality jump is &lt;em&gt;dramatic&lt;/em&gt;, and all it costs is a ~15MB upload instead of 30GB.&lt;/p&gt;

&lt;h3&gt;
  
  
  Voice Design: From Text to Personality
&lt;/h3&gt;

&lt;p&gt;The most "wow" feature. Instead of picking from generic preset voices, you describe the voice you want in natural language:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;VOICE_PRESET_MAP&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;story_news&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        gender: Male.
        pitch: Dynamic, high-energy with excitement.
        speed: Brisk, fast-paced, maintaining high momentum.
        emotion: Hype, adrenaline, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unbelievable play&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; excitement.
        personality: Charismatic, knowledgeable, maximum energy.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;story_dramatic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        gender: Female.
        pitch: Rich, resonant mid-range with expressive depth.
        speed: Measured, deliberate pacing with dramatic pauses for impact.
        emotion: Intense, evocative, drawing listeners into the story.
        personality: Wise, commanding, magnetic storyteller presence.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Qwen3-TTS reads this description and synthesizes a matching voice. The same caption sounds completely different between "esports broadcaster" and "creepypasta narrator" — and it all happens locally. No cloud TTS API, no per-word billing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Slang Preprocessing: Making TTS Sound Natural
&lt;/h3&gt;

&lt;p&gt;TTS engines and internet slang do not get along. "rn" becomes "urn." "lol" becomes "loll." "fr fr" sounds like a French car brand.&lt;/p&gt;

&lt;p&gt;The fix is a preprocessing layer that expands slang before TTS sees it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;preprocess_tts_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\brn\b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;right now&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\blol\b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;L O L&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\bidk\b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t know&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Qwen3-TTS doesn't pause at dashes, so swap them for ellipses
&lt;/span&gt;    &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; -- &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;... &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; - &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;... &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Small detail, huge impact. GenZ-style captions like "bro that was lowkey insane rn fr fr" actually &lt;em&gt;sound&lt;/em&gt; right when spoken aloud.&lt;/p&gt;

&lt;h3&gt;
  
  
  CJK Subtitle Handling: When Words Don't Have Spaces
&lt;/h3&gt;

&lt;p&gt;English subtitles are easy — split on spaces, chunk into 7-word captions, done. But Japanese, Chinese, and Korean (JCK languages) don't use spaces between words. A sentence is one continuous stream of characters.&lt;/p&gt;

&lt;p&gt;This completely broke the subtitle chunking logic. A 40-character Japanese sentence would appear as one massive wall of text filling the entire screen.&lt;/p&gt;

&lt;p&gt;The fix was &lt;strong&gt;character-based splitting with language detection&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Detect CJK characters in the sentence
&lt;/span&gt;&lt;span class="n"&gt;is_cjk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\u4e00&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;char&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\u9fff&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt;  &lt;span class="c1"&gt;# Chinese
&lt;/span&gt;              &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\u3040&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;char&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\u30ff&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;       &lt;span class="c1"&gt;# Japanese
&lt;/span&gt;              &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;char&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;MAX_CJK_CHARS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;18&lt;/span&gt;  &lt;span class="c1"&gt;# Characters per line for CJK
&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_cjk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Character-based splitting instead of word-based
&lt;/span&gt;    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;MAX_CJK_CHARS&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
              &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;MAX_CJK_CHARS&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="c1"&gt;# Distribute TTS duration proportionally by character count
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;chunk_ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;chunk_duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tts_duration&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;chunk_ratio&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The sentence splitter also handles CJK punctuation (&lt;code&gt;。！？&lt;/code&gt;) which doesn't follow the English pattern of period-then-whitespace. These characters terminate sentences directly, no space required.&lt;/p&gt;

&lt;p&gt;One of those "obvious in hindsight" fixes that makes multi-language support actually work instead of just being a checkbox on a feature list.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Experience with GitHub Copilot CLI
&lt;/h2&gt;

&lt;p&gt;Everything above? That's the engineering. But I'd be lying if I said I did it alone. GitHub Copilot CLI was my pair programmer through most of this — and here's how it actually helped.&lt;/p&gt;

&lt;p&gt;Copilot CLI wasn't just autocomplete — it was a debugging partner, architecture consultant, and documentation writer rolled into one.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Worked Exceptionally Well
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Plan Mode for Complex Changes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using &lt;code&gt;[[PLAN]]&lt;/code&gt; prefix before major refactors gave me a structured approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[[PLAN]] Migrate from ChatterBox TTS to Qwen3-TTS VoiceDesign
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Copilot generated a 6-phase plan covering dependency changes, API migration, FlashAttention setup, testing checkpoints, and rollback strategies. I could review and edit the plan before implementation started.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Debugging Across Sessions&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The checkpoint system was crucial. When investigating the subtitle timing bug, I could reference earlier sessions:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Check checkpoint 012-tts-subtitle-sync for what we tried before"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Copilot would review the history and avoid repeating failed approaches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Parallel Exploration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When I wasn't sure which approach to take, I'd ask Copilot to spin up explore agents to investigate multiple paths simultaneously:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;task agent_type: explore
prompt: "How does generate_for_captions() handle timing in story mode vs normal mode?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This let me understand the codebase faster than reading linearly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Test Generation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After making changes, Copilot helped write comprehensive tests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_preprocess_tts_text_em_dash&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;preprocess_tts_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wait — what&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;50 tests covering subtitle formatting, TTS preprocessing, voice description generation, and scene combination logic — all generated from understanding the code context.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I Learned
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Be specific about constraints.&lt;/strong&gt; "Fix the OOM error" is less useful than "We have 10GB VRAM, model A needs 8GB, model B needs 4GB, how do we sequence them?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use checkpoints liberally.&lt;/strong&gt; Complex debugging spans sessions. Good checkpoints save hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Let Copilot see the errors.&lt;/strong&gt; Pasting full stack traces and logs gives it the context to diagnose accurately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trust but verify.&lt;/strong&gt; Copilot's suggestions are usually good, but always run the tests.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Pipeline Today
&lt;/h2&gt;

&lt;p&gt;Here's what happens when you drop a gameplay video into AutoShorts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Scene Detection&lt;/strong&gt; — GPU-accelerated analysis finds candidate moments using audio spikes + motion detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI Ranking&lt;/strong&gt; — Vision AI (Gemini/OpenAI) watches each clip and scores it across 7 semantic categories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deep Analysis&lt;/strong&gt; &lt;em&gt;(optional)&lt;/em&gt; — GPU-downscaled proxy uploaded to Gemini for context-aware moment detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smart Selection&lt;/strong&gt; — Diverse category selection ensures variety (not just all "action" clips)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU Rendering&lt;/strong&gt; — NVENC hardware encoding creates vertical crops with blurred backgrounds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caption Generation&lt;/strong&gt; — AI writes contextual captions matching the clip's energy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice Synthesis&lt;/strong&gt; — Qwen3-TTS creates matching voiceovers with style-appropriate personalities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timing Sync&lt;/strong&gt; — Subtitle timing synchronized with actual TTS audio duration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smart Mixing&lt;/strong&gt; — Game audio ducked during voiceover, video extended if TTS runs long&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Total processing time: ~5-7 minutes per clip on an RTX 3080.&lt;/p&gt;




&lt;h2&gt;
  
  
  Analysis Modes &amp;amp; Cost
&lt;/h2&gt;

&lt;p&gt;AutoShorts supports four analysis modes, each with different tradeoffs between &lt;strong&gt;cost&lt;/strong&gt;, &lt;strong&gt;accuracy&lt;/strong&gt;, and &lt;strong&gt;speed&lt;/strong&gt;. You choose the mode via environment variables — no code changes needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Each Mode Works
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;🔧 Local Heuristics Only&lt;/strong&gt; (&lt;code&gt;AI_PROVIDER=local&lt;/code&gt;)&lt;/p&gt;

&lt;p&gt;Zero API calls. Scenes are scored purely on GPU-computed signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Audio RMS&lt;/strong&gt; — Loudness spikes (explosions, crowd reactions, voice peaks).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spectral Flux&lt;/strong&gt; — Sudden frequency changes (gunshots, impacts, glass breaking).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visual Motion&lt;/strong&gt; — Pixel-diff action scoring via GPU-accelerated grayscale diffing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All three signals are computed in a single pass using PyTorch on GPU. Scenes are ranked by a combined &lt;strong&gt;&lt;code&gt;0.6 × Audio (RMS + Flux) + 0.4 × Visual Motion&lt;/code&gt;&lt;/strong&gt; score. Fast, free, and surprisingly effective for high-action content — but blind to &lt;em&gt;context&lt;/em&gt; (it can't tell a celebration from a firefight).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🖼️ OpenAI Vision&lt;/strong&gt; (&lt;code&gt;AI_PROVIDER=openai&lt;/code&gt;)&lt;/p&gt;

&lt;p&gt;Heuristics first narrow the field using &lt;strong&gt;Smart Selection&lt;/strong&gt; (70% top scores + 30% random exploration), then candidates are sent to OpenAI. OpenAI's API doesn't accept video, so we extract &lt;strong&gt;8 keyframe JPEGs&lt;/strong&gt; per clip:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Extract 8 static frames as base64 JPEGs
&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ffmpeg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-i&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clip_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-vf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fps=1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-frames:v&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...]&lt;/span&gt;
&lt;span class="c1"&gt;# Send as image_url content to GPT-4o
&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data:image/jpeg;base64,&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The AI scores each clip across 7 semantic categories (action, funny, clutch, wtf, epic_fail, hype, skill). Good accuracy from static frames alone, but it &lt;em&gt;can't hear audio&lt;/em&gt; and misses motion-dependent moments like glitches or physics bugs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🎬 Gemini Per-Clip&lt;/strong&gt; (&lt;code&gt;AI_PROVIDER=gemini&lt;/code&gt;)&lt;/p&gt;

&lt;p&gt;Uses the same &lt;strong&gt;Smart Selection&lt;/strong&gt; (mixing high-heuristic clips with &lt;strong&gt;random segments&lt;/strong&gt; for diversity), but uploads each candidate as &lt;strong&gt;actual video&lt;/strong&gt; (downscaled to 640px wide). Gemini sees motion, timing, and audio:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Each candidate clip: 640p downscaled, ~30-60s, uploaded as MP4
&lt;/span&gt;&lt;span class="n"&gt;video_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;clip_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mime_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;video/mp4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-3-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;video_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Significantly better at detecting &lt;em&gt;funny&lt;/em&gt;, &lt;em&gt;wtf&lt;/em&gt;, and &lt;em&gt;clutch&lt;/em&gt; moments that depend on temporal context. Clips are analyzed in parallel (3 concurrent threads) to keep latency manageable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧠 Gemini Deep Analysis&lt;/strong&gt; (&lt;code&gt;GEMINI_DEEP_ANALYSIS=true&lt;/code&gt;)&lt;/p&gt;

&lt;p&gt;The nuclear option. Instead of pre-filtering with heuristics then analyzing clips, Deep Analysis lets Gemini see the &lt;strong&gt;entire video&lt;/strong&gt; — but not the raw multi-GB 4K file. A GPU-accelerated proxy is generated first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;4K @ 60fps → 640p @ 1fps, QP 35, mono 32kbps audio
~30GB gameplay recording -&amp;gt; ~15MB proxy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gemini watches the whole thing and returns timestamped moments with categories and scores. No heuristic bias, no missed context. The AI finds narrative arcs — the buildup before a clutch play, the reaction after an epic fail — that clip-by-clip analysis simply can't detect.&lt;/p&gt;

&lt;p&gt;Deep Analysis moments are scored with a &lt;code&gt;+200&lt;/code&gt; bias to ensure they rank above any heuristic candidate. A few high-action heuristic backups are still included as safety net clips.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparison Summary (1-hour 4K gameplay)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Analysis Cost&lt;/th&gt;
&lt;th&gt;Creative Cost*&lt;/th&gt;
&lt;th&gt;Total Cost&lt;/th&gt;
&lt;th&gt;Data Uploaded&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Local Heuristics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⭐⭐&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Free&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Free&lt;/strong&gt; (Whisper)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Free&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0 bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenAI Vision&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⭐⭐⭐&lt;/td&gt;
&lt;td&gt;~\$0.15&lt;/td&gt;
&lt;td&gt;~\$0.15&lt;/td&gt;
&lt;td&gt;~\$0.30&lt;/td&gt;
&lt;td&gt;~6MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemini Per-Clip&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;~\$0.08&lt;/td&gt;
&lt;td&gt;~\$0.08&lt;/td&gt;
&lt;td&gt;~\$0.16&lt;/td&gt;
&lt;td&gt;~90MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemini Deep Analysis&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;~\$0.05&lt;/td&gt;
&lt;td&gt;~\$0.08&lt;/td&gt;
&lt;td&gt;~\$0.13&lt;/td&gt;
&lt;td&gt;~60MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;*Creative Cost:&lt;/strong&gt; Includes AI caption generation (LLM API call) + Voiceover synthesized locally (Free).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The counterintuitive result:&lt;/strong&gt; Deep Analysis is the most cost-effective mode because it replaces 15 individual analysis uploads with one optimized proxy upload, while still delivering superior context-aware detection.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Roadmap &amp;amp; Vision
&lt;/h2&gt;

&lt;p&gt;AutoShorts works today as a local pipeline for content creators. But the underlying engine — scene detection, AI ranking, voice synthesis, smart cropping — is a &lt;strong&gt;general-purpose highlight extraction backend&lt;/strong&gt;. Here's where this is heading:&lt;/p&gt;

&lt;h3&gt;
  
  
  🔮 What's Next
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;v2.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Universal Video Type Support (Podcasts, Sports, Entertainment, etc.)&lt;/td&gt;
&lt;td&gt;🔜 Planned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;v2.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SFX generation — AI-generated sound effects matched to on-screen action&lt;/td&gt;
&lt;td&gt;🔜 Planned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;v2.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud API mode (submit video URL → get clips back)&lt;/td&gt;
&lt;td&gt;📐 Designing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;v3.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Live stream monitoring (detect highlights in real-time)&lt;/td&gt;
&lt;td&gt;🔬 Research&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;v3.x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-platform auto-upload (TikTok, YouTube Shorts, Reels)&lt;/td&gt;
&lt;td&gt;📋 Backlog&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  🎮 Platform Integration Potential
&lt;/h3&gt;

&lt;p&gt;The most exciting future isn't AutoShorts as a standalone tool — it's AutoShorts as a &lt;strong&gt;backend engine&lt;/strong&gt; embedded in platforms millions of gamers already use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Microsoft Xbox Game Bar&lt;/strong&gt; — The overlay already captures screenshots and gameplay recordings (&lt;code&gt;Win+G&lt;/code&gt;). Imagine a "Generate Highlights" button that takes your captured footage and produces ready-to-share clips with captions and voiceover — &lt;em&gt;without ever leaving the overlay.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NVIDIA ShadowPlay&lt;/strong&gt; — ShadowPlay's Instant Replay already silently records the last 30 seconds to 20 minutes of gameplay. Pair that buffer with AutoShorts' AI ranking, and ShadowPlay could &lt;em&gt;automatically identify and export your best moments&lt;/em&gt; with professional-grade overlays and narration. No scrubbing through footage. No editing. Just play.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discord Activity Integration&lt;/strong&gt; — Post-session highlight reels generated from screen shares, dropped directly into your server channel.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The core thesis: &lt;strong&gt;highlight detection + voice synthesis + smart cropping&lt;/strong&gt; is infrastructure, not an app. Every platform that captures gameplay footage could use this engine to turn passive recording into active content creation.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;The best highlight reel is the one you never had to make.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Acknowledgements
&lt;/h2&gt;

&lt;p&gt;This project builds upon:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/artryazanov/shorts-maker-gpu" rel="noopener noreferrer"&gt;artryazanov/shorts-maker-gpu&lt;/a&gt;&lt;/strong&gt; — GPU-accelerated clip extraction using heuristic scoring (audio dB + motion detection).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/Binary-Bytes/Auto-YouTube-Shorts-Maker" rel="noopener noreferrer"&gt;Binary-Bytes/Auto-YouTube-Shorts-Maker&lt;/a&gt;&lt;/strong&gt; — Original concept and inspiration for the automated short-form content pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/QwenLM/Qwen3-TTS" rel="noopener noreferrer"&gt;Qwen3-TTS&lt;/a&gt;&lt;/strong&gt; — Voice synthesis with natural language design&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/francozanardi/pycaps" rel="noopener noreferrer"&gt;PyCaps&lt;/a&gt;&lt;/strong&gt; — Animated subtitle rendering&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Improvements Over Base Project
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Base Project&lt;/th&gt;
&lt;th&gt;AutoShorts&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Monolithic script&lt;/td&gt;
&lt;td&gt;Modular package with lifecycle management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scene Scoring&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Audio dB + motion only&lt;/td&gt;
&lt;td&gt;Hybrid: heuristics + Vision AI semantic analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deep Analysis&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Full-video Gemini analysis for context-aware detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Voiceover&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Qwen3-TTS with style-adaptive voice design&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Captions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;AI-generated, 10+ styles including story modes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CJK Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Character-based subtitle chunking for JCK languages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single model&lt;/td&gt;
&lt;td&gt;VRAM-aware model sequencing (unload between phases)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TTS Sync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Per-sentence TTS generation for accurate timing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overflow Handling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Re-render clips when TTS &amp;gt; video length&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
&lt;span class="c"&gt;# Clone the repository&lt;/span&gt;
git clone https://github.com/divyaprakash0426/autoshorts.git
&lt;span class="nb"&gt;cd &lt;/span&gt;autoshorts

&lt;span class="c"&gt;# Setup environment variables&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Edit .env and add your API keys (Gemini/OpenAI) &lt;/span&gt;

&lt;span class="c"&gt;# Option 1: Using Makefile (Recommended)&lt;/span&gt;

make &lt;span class="nb"&gt;install&lt;/span&gt;

&lt;span class="c"&gt;# Option 2: Using Shell Script&lt;/span&gt;
./install.sh

&lt;span class="c"&gt;# Drop videos in gameplay/, then run:&lt;/span&gt;
./.venv/bin/python run.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or launch the dashboard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./.venv/bin/streamlit run src/dashboard/About.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🛡️ Battle Tested On
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Asus Zephyrus G16&lt;/strong&gt; (RTX 4080 Mobile, Intel Ultra 9) running Arch Linux.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with frustration, caffeine, and GitHub Copilot CLI.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>githubchallenge</category>
      <category>cli</category>
      <category>githubcopilot</category>
    </item>
    <item>
      <title>Building AutoShorts: A High-Performance AI Pipeline for Automated Viral Content 🎬🤖</title>
      <dc:creator>divyaprakash D</dc:creator>
      <pubDate>Sat, 24 Jan 2026 14:59:45 +0000</pubDate>
      <link>https://forem.com/divyaprakash_d_2d5d085bd4/building-autoshorts-a-high-performance-ai-pipeline-for-automated-viral-content-g5i</link>
      <guid>https://forem.com/divyaprakash_d_2d5d085bd4/building-autoshorts-a-high-performance-ai-pipeline-for-automated-viral-content-g5i</guid>
      <description>&lt;h2&gt;
  
  
  The Problem: Content Creation is a Bottleneck
&lt;/h2&gt;

&lt;p&gt;Every creator knows the "highlight reel" struggle. You have hours of high-quality gameplay footage, but finding that perfect 30-second clip, cropping it, adding subtitles, and layering a voiceover takes hours of manual labor.&lt;br&gt;
I wanted to see if I could build a &lt;strong&gt;fully automated, high-performance pipeline&lt;/strong&gt; to handle this from start to finish. Today, I'm open-sourcing &lt;strong&gt;AutoShorts&lt;/strong&gt;.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftgsxcr48ugtjwf0ibmso.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftgsxcr48ugtjwf0ibmso.png" alt="AutoShorts Architecture Architecture" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is AutoShorts?
&lt;/h2&gt;

&lt;p&gt;AutoShorts is a GPU-optimized CLI tool that analyzes long-form video, identifies high-engagement scenes using AI, and synthesizes them into ready-to-upload vertical shorts. &lt;br&gt;
It doesn't just "cut" video; it understands it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Technical Deep Dive 🛠️
&lt;/h2&gt;

&lt;p&gt;To keep processing times low and avoid massive cloud API bills, I focused heavily on local processing and hardware acceleration:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. GPU Scene Analysis ⚡
&lt;/h3&gt;

&lt;p&gt;Using &lt;code&gt;decord&lt;/code&gt; and &lt;code&gt;PyTorch&lt;/code&gt;, the pipeline performs frame extraction and visual feature analysis directly on the GPU. We calculate action density and spectral flux to find "loud" or "fast" moments before the text-based AI even sees the clip.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Dual-AI Intelligence 🧠
&lt;/h3&gt;

&lt;p&gt;The pipeline integrates with &lt;strong&gt;OpenAI (GPT-4o)&lt;/strong&gt; and &lt;strong&gt;Google Gemini&lt;/strong&gt;. We pass the metadata and scene descriptions to the LLM to score segments based on:&lt;br&gt;
&lt;strong&gt;Hook Potential&lt;/strong&gt;: Is the start grabby?&lt;br&gt;
&lt;strong&gt;Relevance&lt;/strong&gt;: Does the action make sense?&lt;br&gt;
&lt;strong&gt;Emotional Impact&lt;/strong&gt;: Is it funny, impressive, or a "fail"?&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Smart Subtitles &amp;amp; Neural TTS 🗣️
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Local TTS&lt;/strong&gt;: Instead of paid APIs, we use &lt;strong&gt;ChatterBox&lt;/strong&gt; locally. It supports emotional prosody, so the voiceover doesn't sound like a monotone robot.&lt;br&gt;
&lt;strong&gt;PyCaps Renderer&lt;/strong&gt;: We use a custom Playwright-based renderer to create those "MrBeast style" word-by-word animated captions that are essential for mobile retention.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. NVENC Rendering 🎞️
&lt;/h3&gt;

&lt;p&gt;Final assembly—including audio mixing, blurring backgrounds (for the vertical look), and burning in subtitles—is offloaded to &lt;strong&gt;NVIDIA’s NVENC hardware&lt;/strong&gt;. This keeps the CPU free for other tasks and slashes render times.&lt;/p&gt;

&lt;h2&gt;
  
  
  🚧 What’s Next? (The Roadmap)
&lt;/h2&gt;

&lt;p&gt;This is a v1.0 release, and while the pipeline is robust, the potential for enhancement is huge. I’m looking for contributors to help with:&lt;br&gt;
&lt;strong&gt;Upgrading the Voice Engine&lt;/strong&gt;: Integrating more recent open-source models like &lt;strong&gt;ChatterBoxTurbo&lt;/strong&gt;, &lt;strong&gt;Qwen-TTS&lt;/strong&gt;, or &lt;strong&gt;NVIDIA’s latest TTS&lt;/strong&gt; for even more realistic voice cloning and prosody.&lt;br&gt;
&lt;strong&gt;Intelligent Auto-Zoom&lt;/strong&gt;: Currently, the 9:16 crop is centered. Adding object detection (YOLO/RT-DETR) to &lt;strong&gt;follow the action&lt;/strong&gt;—dynamically moving the crop window to follow a character or a vehicle.&lt;br&gt;
&lt;strong&gt;Advanced Transition Styles&lt;/strong&gt;: Adding AI-generated transitions between merged scenes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build With Me 🚀
&lt;/h2&gt;

&lt;p&gt;The project is fully dockerized and open for contributions. Whether you're interested in machine learning, computer vision, or just want to automate your own YouTube channel, I'd love to see you in the PRs.&lt;br&gt;
&lt;strong&gt;GitHub Repository:&lt;/strong&gt; &lt;a href="https://github.com/divyaprakash0426/autoshorts" rel="noopener noreferrer"&gt;github.com/divyaprakash0426/autoshorts&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A huge thanks to the original concepts from &lt;a href="https://github.com/artryazanov/shorts-maker-gpu" rel="noopener noreferrer"&gt;artryazanov&lt;/a&gt; and &lt;a href="https://github.com/Binary-Bytes/Auto-YouTube-Shorts-Maker" rel="noopener noreferrer"&gt;Binary-Bytes&lt;/a&gt; which provided the foundation for this refactored release.&lt;/em&gt;&lt;br&gt;
&lt;strong&gt;What features would you add to an AI video pipeline like this? Let's discuss in the comments! 👇&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
