<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Pratay Karali</title>
    <description>The latest articles on Forem by Pratay Karali (@pratay_karali_5716376b9f2).</description>
    <link>https://forem.com/pratay_karali_5716376b9f2</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3942793%2F3d7c6310-19e1-4fb0-8342-fe1255fe47bf.jpg</url>
      <title>Forem: Pratay Karali</title>
      <link>https://forem.com/pratay_karali_5716376b9f2</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/pratay_karali_5716376b9f2"/>
    <language>en</language>
    <item>
      <title>Why Your Voice Agent Won't Stop Talking: Building the Zero-Latency Interruption Layer with Gemma 4 E2B</title>
      <dc:creator>Pratay Karali</dc:creator>
      <pubDate>Thu, 21 May 2026 13:48:02 +0000</pubDate>
      <link>https://forem.com/pratay_karali_5716376b9f2/why-your-voice-agent-wont-stop-talking-building-the-zero-latency-interruption-layer-with-gemma-4-4966</link>
      <guid>https://forem.com/pratay_karali_5716376b9f2/why-your-voice-agent-wont-stop-talking-building-the-zero-latency-interruption-layer-with-gemma-4-4966</guid>
      <description>&lt;p&gt;&lt;em&gt;The most human problem in AI — and the architecture that finally solves it.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You know the moment.&lt;/p&gt;

&lt;p&gt;You're talking to a voice assistant. It starts giving you an answer you didn't need. You try to cut it off. You say "wait—" or "no, actually—" and it just &lt;em&gt;keeps going.&lt;/em&gt; Plowing straight through your sentence. Talking over you like an oblivious colleague who hasn't noticed everyone else went quiet.&lt;/p&gt;

&lt;p&gt;You eventually fall silent. You wait for it to finish. You try again.&lt;/p&gt;

&lt;p&gt;This isn't a minor annoyance. It's the conversational uncanny valley — and it's the reason most voice AI products feel broken even when the underlying model is genuinely smart. The model might generate perfect answers. But the architecture doesn't know when to &lt;em&gt;stop.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Gemma 4 E2B changes the foundational conditions of this problem. And in this guide, we're going to build the interruption layer that finally eliminates it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Enemy: Sequential Latency
&lt;/h2&gt;

&lt;p&gt;Before we fix anything, we need to understand exactly what creates the problem.&lt;/p&gt;

&lt;p&gt;Traditional voice pipelines look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Microphone → [STT Model] → text → [LLM] → text → [TTS Model] → Speaker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each arrow is a waiting period. The Speech-to-Text model needs to buffer enough audio to transcribe. The LLM needs the full transcription before it can begin inference. The TTS needs the full LLM response before synthesis begins. You're not having a conversation — you're exchanging documents with an extremely fast filing system.&lt;/p&gt;

&lt;p&gt;Compounding this: the entire pipeline is typically synchronous. While the TTS is speaking, nothing is listening. The microphone input buffer fills up. Your interruption gets queued somewhere behind three seconds of audio the system has already committed to playing.&lt;/p&gt;

&lt;p&gt;By the time the agent "hears" you said "wait" — it's been saying something else for two full seconds. The uncanny valley isn't a UI problem. It's an architecture problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Weapon: Gemma 4 E2B's Native Audio Encoder
&lt;/h2&gt;

&lt;p&gt;Released by Google DeepMind on April 2, 2026, under Apache 2.0, Gemma 4 E2B is a 2.3 billion effective parameter model with something no model in its weight class has ever had: a native 300-million parameter audio encoder baked directly into the architecture.&lt;/p&gt;

&lt;p&gt;This single design decision eliminates the first bottleneck entirely.&lt;/p&gt;

&lt;p&gt;Instead of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raw audio → STT model → text string → LLM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gemma 4 E2B does:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raw 16kHz audio waveform → mel-spectrogram → embedding space → LLM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No intermediate transcription. No text flattening that destroys vocal intonation, prosody, and emotional inflection. The model hears &lt;em&gt;how&lt;/em&gt; you said something, not just &lt;em&gt;what&lt;/em&gt; you said.&lt;/p&gt;

&lt;p&gt;The feature extractor processes audio using a 20ms frame length (320 samples at 16kHz) with a 10ms hop length, converting the waveform directly into mel-frequency spectrograms that project straight into the model's embedding space. The model accepts up to 30 seconds of audio per interaction — easily enough for complex, natural queries.&lt;/p&gt;

&lt;p&gt;Here's what makes the E2B variant special in the Gemma 4 family:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Effective Params&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Target Hardware&lt;/th&gt;
&lt;th&gt;Audio?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;E2B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.3B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;128K&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Phones, RPi 5, Laptops&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;✅ Yes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;E4B&lt;/td&gt;
&lt;td&gt;4.5B&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;High-end phones, edge servers&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;26B A4B&lt;/td&gt;
&lt;td&gt;~4B active (MoE)&lt;/td&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;td&gt;RTX 4090/5090&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;31B&lt;/td&gt;
&lt;td&gt;30.7B&lt;/td&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;td&gt;A100/H100&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The native audio encoder exists &lt;strong&gt;only&lt;/strong&gt; in the E2B and E4B variants. If you're building a real-time voice agent, you're using E2B. Under 4-bit quantization (Q4_K_M), it runs in 2–3 GB of RAM — fitting comfortably on a Raspberry Pi 5 or any modern laptop.&lt;/p&gt;

&lt;p&gt;The architecture achieves this through three innovations worth understanding:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-Layer Embeddings (PLE):&lt;/strong&gt; Instead of one embedding matrix at the input, every decoder block gets its own specialized embedding slice. These function as memory lookups rather than matrix multiplications — so the model accesses 5.1 billion parameters worth of knowledge while only &lt;em&gt;activating&lt;/em&gt; 2.3 billion per token. Fast inference, deep intelligence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid Attention (4:1 Local/Global):&lt;/strong&gt; Rather than full global attention across 128K tokens (which scales quadratically — catastrophic on edge hardware), E2B applies local sliding-window attention (512 tokens) for four consecutive layers, then one full global attention layer. 35 layers total, always ending global. The RULER benchmark puts long-context recall at 66.4% at 128K depth — versus 13.5% in prior generations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;p-RoPE:&lt;/strong&gt; Proportional Rotary Position Embeddings dedicate a subset of dimensions strictly to positional data, leaving 75% as clean content channels. This prevents the catastrophic forgetting that typically afflicts long voice conversations.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Gatekeeper: Silero VAD
&lt;/h2&gt;

&lt;p&gt;Eliminating STT latency is half the battle. The other half is knowing &lt;em&gt;when&lt;/em&gt; to listen.&lt;/p&gt;

&lt;p&gt;You cannot feed a raw, open microphone stream into an LLM. It will constantly trigger on ambient noise, keyboard clicks, HVAC hum, your dog. Every false trigger is a wasted inference cycle. On consumer hardware, that means freezing the application.&lt;/p&gt;

&lt;p&gt;Enter Silero VAD.&lt;/p&gt;

&lt;p&gt;One megabyte. Trained on 100+ languages. Forward pass executes in under one millisecond on a single CPU thread. It returns a speech probability scalar between 0.0 and 1.0 for every 30-32ms audio chunk (512 samples at 16kHz).&lt;/p&gt;

&lt;p&gt;The critical detail is &lt;strong&gt;hysteresis&lt;/strong&gt; — raw probability alone causes rapid toggling. You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Speech start: probability &amp;gt; 0.5 for at least &lt;strong&gt;250ms&lt;/strong&gt; of consecutive audio&lt;/li&gt;
&lt;li&gt;Speech end: silence for at least &lt;strong&gt;500ms&lt;/strong&gt;
These thresholds prevent a door slam from triggering a full inference cycle, and prevent a natural mid-sentence pause from prematurely ending the user's utterance.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Architecture: Four Threads, One Goal
&lt;/h2&gt;

&lt;p&gt;Here's the core insight: the reason voice agents fail at interruption is that they're architecturally &lt;em&gt;single-threaded in spirit&lt;/em&gt; even when multi-threaded in implementation. The microphone, the VAD, the LLM, and the speaker all wait for each other.&lt;/p&gt;

&lt;p&gt;Our system has four completely decoupled components communicating via thread-safe queues and a single shared &lt;code&gt;threading.Event()&lt;/code&gt; flag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────┐
│                    SYSTEM ARCHITECTURE                       │
│                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │  PyAudio     │    │  Silero VAD  │    │  Gemma 4 E2B │  │
│  │  C-Thread    │───▶│  Gatekeeper  │───▶│  Inference   │  │
│  │  (non-block) │    │  (CPU only)  │    │  Engine      │  │
│  └──────────────┘    └──────┬───────┘    └──────┬───────┘  │
│                             │                   │           │
│                    interrupt_event.set()    response text   │
│                             │                   │           │
│                             ▼                   ▼           │
│                    ┌──────────────────────────────────┐     │
│                    │     TTS Output Worker Thread     │     │
│                    │  polls interrupt_event every     │     │
│                    │  50ms chunk → instant flush      │     │
│                    └──────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;interrupt_event&lt;/code&gt; is the entire system's nervous system. The moment Silero detects speech onset, it fires. The TTS worker is polling that flag on every 50ms audio chunk write. The instant it fires — mid-syllable if necessary — the TTS queue is flushed. The agent goes silent. The user has the floor.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building It: Phase by Phase
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Environment Setup
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; transformers torch accelerate onnxruntime
pip &lt;span class="nb"&gt;install &lt;/span&gt;pyaudio soundfile silero-vad bitsandbytes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;onnxruntime&lt;/code&gt; ensures Silero runs on the ONNX Lite backend for maximum CPU efficiency. &lt;code&gt;bitsandbytes&lt;/code&gt; handles 4-bit quantization to compress the model to ~3.5GB VRAM.&lt;/p&gt;




&lt;h3&gt;
  
  
  Phase 1: The VAD Gatekeeper (Non-Blocking Audio Capture)
&lt;/h3&gt;

&lt;p&gt;Python's GIL is the enemy of real-time audio. The moment you write &lt;code&gt;stream.read()&lt;/code&gt; in a &lt;code&gt;while True&lt;/code&gt; loop, you've blocked your entire thread during LLM inference. Microphone buffer overflows. Interruptions get lost.&lt;/p&gt;

&lt;p&gt;The solution: PyAudio's non-blocking &lt;code&gt;stream_callback&lt;/code&gt;. This delegates audio capture to a dedicated C-level PortAudio thread — completely outside the Python GIL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyaudio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;VADGatekeeper&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sample_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sample_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sample_rate&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;

        &lt;span class="c1"&gt;# Silero VAD via ONNX backend — ~1MB, &amp;lt;1ms per forward pass
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;utils&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;repo_or_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;snakers4/silero-vad&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;silero_vad&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;force_reload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;onnx&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_speech_timestamps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;VADIterator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;utils&lt;/span&gt;

        &lt;span class="c1"&gt;# Stateful iterator: 0.5 threshold, 500ms silence to confirm end
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vad_iterator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;VADIterator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;min_silence_duration_ms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audio_queue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Queue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_speaking&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interrupt_event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;current_utterance_frames&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;audio_callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;in_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frame_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time_info&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Executed in a C-level PortAudio thread.
        Deposits audio bytes into queue and returns INSTANTLY.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audio_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;in_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pyaudio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;paContinue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;audio_callback&lt;/code&gt; does exactly one thing: put bytes in the queue and return. The callback executes every 32ms. It never blocks. It never waits for LLM inference to finish. The microphone is always listening.&lt;/p&gt;




&lt;h3&gt;
  
  
  Phase 2: The Gemma 4 E2B Inference Engine
&lt;/h3&gt;

&lt;p&gt;Loading E2B in native 16-bit precision requires 10+ GB VRAM. On consumer hardware, we use &lt;code&gt;BitsAndBytesConfig&lt;/code&gt; for 4-bit NF4 quantization — collapsing that to ~3.5 GB:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoProcessor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoModelForMultimodalLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BitsAndBytesConfig&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;GemmaVoiceEngine&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemma-4-E2B-it&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="n"&gt;bnb_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BitsAndBytesConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;load_in_4bit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;bnb_4bit_use_double_quant&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;bnb_4bit_quant_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nf4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;bnb_4bit_compute_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoProcessor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForMultimodalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;quantization_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bnb_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio_numpy_array&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# CRITICAL: audio content block MUST precede text block
&lt;/span&gt;        &lt;span class="n"&gt;user_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;audio_numpy_array&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Please respond to the audio input concisely.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply_chat_template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tokenize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;return_dict&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;add_generation_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;input_len&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;response_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;input_len&lt;/span&gt;&lt;span class="p"&gt;:],&lt;/span&gt;
            &lt;span class="n"&gt;skip_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One non-obvious detail: the &lt;code&gt;{"type": "audio"}&lt;/code&gt; block &lt;strong&gt;must physically precede&lt;/strong&gt; the &lt;code&gt;{"type": "text"}&lt;/code&gt; block in the content array. This is an architectural requirement of Gemma 4's multimodal formatting — the audio token expansions need to be computed before the text instructions are interpolated. Getting this wrong causes silent inference failures.&lt;/p&gt;

&lt;p&gt;When &lt;code&gt;apply_chat_template&lt;/code&gt; is called, the processor runs the mel-spectrogram computation, determines the exact number of audio tokens based on waveform duration, and stitches the audio representations into the prompt in place of the structural placeholder. The complexity of audio tokenization is entirely abstracted.&lt;/p&gt;




&lt;h3&gt;
  
  
  Phase 3: The Output Controller (The Flush Mechanism)
&lt;/h3&gt;

&lt;p&gt;This is where interruption actually happens. The TTS worker polls &lt;code&gt;interrupt_event&lt;/code&gt; on every single audio chunk before writing to the speaker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OutputController&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audio_out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pyaudio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;PyAudio&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;out_stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audio_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pyaudio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;paInt16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;channels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;24000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;playback_queue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Queue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tts_playback_worker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;interrupt_event&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Check BEFORE dequeuing
&lt;/span&gt;            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;interrupt_event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_set&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Output] Interruption. Flushing queue.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;playback_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;playback_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_nowait&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Empty&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="k"&gt;break&lt;/span&gt;
                &lt;span class="n"&gt;interrupt_event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clear&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;audio_chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;playback_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                &lt;span class="c1"&gt;# Check AGAIN immediately before hardware write
&lt;/span&gt;                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;interrupt_event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_set&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                    &lt;span class="k"&gt;continue&lt;/span&gt;

                &lt;span class="c1"&gt;# Blocking write to physical speaker
&lt;/span&gt;                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;out_stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Empty&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The double-check pattern — once before dequeuing, once before the hardware write — closes the race condition window as tightly as physically possible. The maximum interruption delay equals the duration of one audio chunk: 50 milliseconds. That's imperceptible to humans.&lt;/p&gt;

&lt;p&gt;Feed 50ms TTS chunks into &lt;code&gt;playback_queue&lt;/code&gt;. The agent stops mid-syllable when interrupted. This is the uncanny valley fix.&lt;/p&gt;




&lt;h3&gt;
  
  
  Phase 4: The Orchestration Loop
&lt;/h3&gt;

&lt;p&gt;Everything unified — the moment speech starts, the interrupt fires. When silence confirms the utterance is complete, inference runs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main_orchestration_loop&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;gatekeeper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;VADGatekeeper&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GemmaVoiceEngine&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;output_ctrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OutputController&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;tts_thread&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;output_ctrl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tts_playback_worker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gatekeeper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interrupt_event&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt;
        &lt;span class="n"&gt;daemon&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tts_thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pyaudio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;PyAudio&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;mic_stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pyaudio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;paInt16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;channels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gatekeeper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sample_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;frames_per_buffer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gatekeeper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;stream_callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gatekeeper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audio_callback&lt;/span&gt;  &lt;span class="c1"&gt;# Non-blocking C thread
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mic_stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_stream&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Listening. Speak freely — interrupt anytime.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;raw_chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gatekeeper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audio_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

            &lt;span class="c1"&gt;# Normalize bytes → float32 [-1.0, 1.0] for VAD
&lt;/span&gt;            &lt;span class="n"&gt;audio_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;frombuffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;int16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;audio_float32&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;audio_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;32768.0&lt;/span&gt;
            &lt;span class="n"&gt;tensor_chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_numpy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_float32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="n"&gt;speech_dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gatekeeper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;vad_iterator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tensor_chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;speech_dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;speech_dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Speech detected — firing interrupt...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;gatekeeper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interrupt_event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# TTS stops NOW
&lt;/span&gt;                    &lt;span class="n"&gt;gatekeeper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_speaking&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
                    &lt;span class="n"&gt;gatekeeper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;current_utterance_frames&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

                &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;speech_dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Utterance complete. Processing...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;gatekeeper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_speaking&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
                    &lt;span class="n"&gt;gatekeeper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;current_utterance_frames&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_float32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                    &lt;span class="n"&gt;full_waveform&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;concatenate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="n"&gt;gatekeeper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;current_utterance_frames&lt;/span&gt;
                    &lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;full_waveform&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="c1"&gt;# Route response to your TTS generator here
&lt;/span&gt;                    &lt;span class="n"&gt;gatekeeper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;current_utterance_frames&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;gatekeeper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_speaking&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;gatekeeper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;current_utterance_frames&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_float32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;KeyboardInterrupt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;mic_stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stop_stream&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;mic_stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;terminate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main_orchestration_loop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The timing guarantee: Silero detects speech onset within &lt;strong&gt;32 milliseconds&lt;/strong&gt; (one 512-sample chunk at 16kHz). &lt;code&gt;interrupt_event.set()&lt;/code&gt; propagates across the memory barrier to the TTS worker in microseconds. The TTS flush completes within one chunk cycle — 50ms. Total interruption latency: &lt;strong&gt;under 82ms&lt;/strong&gt; from first syllable to silence.&lt;/p&gt;

&lt;p&gt;Human perception of conversational delay becomes noticeable around 200ms. We're well inside that window.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Memory Constraint Nobody Mentions
&lt;/h2&gt;

&lt;p&gt;There's a hard reality that documentation often glosses over: the 128K context window is theoretically available, but practically unreachable on consumer hardware.&lt;/p&gt;

&lt;p&gt;The Gemma 4 E2B KV cache allocates approximately &lt;strong&gt;490 KB per token&lt;/strong&gt; due to its dense 256 head dimension. Filling 128K tokens would require &lt;strong&gt;60+ GB of VRAM&lt;/strong&gt; for the cache alone. On a machine with 16 GB unified memory, you can safely operate up to roughly &lt;strong&gt;8,000 tokens&lt;/strong&gt; of conversation history.&lt;/p&gt;

&lt;p&gt;This means you need aggressive context pruning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Keep the system prompt + last N turns
&lt;/span&gt;&lt;span class="n"&gt;MAX_HISTORY_TOKENS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6000&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;prune_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Estimate tokens and trim oldest turns
&lt;/span&gt;    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;estimate_tokens&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;MAX_HISTORY_TOKENS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Remove oldest user/assistant turn pair
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For voice agents requiring hours of continuous memory — think customer service bots or long-form interview assistants — migrate the backend from &lt;code&gt;transformers&lt;/code&gt; to &lt;code&gt;llama.cpp&lt;/code&gt; with GGUF format. The TCQ and q4_0 quantization algorithms in llama.cpp apply rotational matrix compression to KV vectors, preserving semantic accuracy while dramatically reducing cache memory. This is the mandatory optimization for production long-session voice deployments.&lt;/p&gt;




&lt;h2&gt;
  
  
  Beyond VAD: Semantic Interruption
&lt;/h2&gt;

&lt;p&gt;The system above is production-ready for most use cases. But there's one failure mode to know about: &lt;strong&gt;backchannels&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When a user murmurs "mm-hmm" or "right" in passive agreement, Silero correctly detects acoustic energy and fires the interrupt. The agent stops speaking. The user wasn't actually interrupting — they were listening.&lt;/p&gt;

&lt;p&gt;The fix: &lt;strong&gt;semantic yielding&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of a destructive flush on speech onset, make it a &lt;em&gt;reversible pause&lt;/em&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pause&lt;/strong&gt; the TTS stream (don't flush) when VAD fires&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capture&lt;/strong&gt; the short interjection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Classify&lt;/strong&gt; via a fast sub-billion parameter model:

&lt;ul&gt;
&lt;li&gt;Backchannel ("mm-hmm", "right", "okay") → &lt;strong&gt;unpause&lt;/strong&gt;, resume seamlessly&lt;/li&gt;
&lt;li&gt;Genuine interruption ("wait", "stop", "what about—") → &lt;strong&gt;flush&lt;/strong&gt;, route to E2B&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Classification adds ~80-100ms overhead — still within the imperceptible window
This elevates the system from prototype to something that genuinely feels like a conversation.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What This Actually Unlocks
&lt;/h2&gt;

&lt;p&gt;The architecture in this guide runs entirely offline. No API calls. No cloud dependency. No subscription. A Raspberry Pi 5 can host a voice agent that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hears you natively — not transcribed text, actual audio with prosody and intent&lt;/li&gt;
&lt;li&gt;Responds intelligently across a multi-hour conversation context&lt;/li&gt;
&lt;li&gt;Stops the moment you start speaking — every single time&lt;/li&gt;
&lt;li&gt;Gets smarter about your patterns over time with persistent conversation history
We've spent years accepting that voice AI is clunky because it has to be. Gemma 4 E2B makes that tradeoff optional. The uncanny valley was always an architecture problem. We just finally have the pieces to solve it on hardware that fits in your pocket.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; transformers torch accelerate onnxruntime
pip &lt;span class="nb"&gt;install &lt;/span&gt;pyaudio soundfile silero-vad bitsandbytes

&lt;span class="c"&gt;# Pull via Ollama if you prefer the managed route&lt;/span&gt;
ollama pull gemma4:e2b

&lt;span class="c"&gt;# Or load directly via Hugging Face&lt;/span&gt;
&lt;span class="c"&gt;# model_id = "google/gemma-4-E2B-it"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start with the orchestration loop, verify your Silero VAD fires correctly on your microphone, then wire in your preferred TTS engine to &lt;code&gt;output_ctrl.playback_queue&lt;/code&gt;. The interruption layer works regardless of which TTS you choose — Kokoro, Edge-TTS, Coqui, anything that produces audio chunks.&lt;/p&gt;

&lt;p&gt;The agent that finally listens is one thread event away.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written for the Gemma 4 Writing Challenge on DEV.to. Deadline: May 24, 2026.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>gemmachallenge</category>
      <category>ai</category>
      <category>gemma</category>
      <category>devchallenge</category>
    </item>
    <item>
      <title>The AI That Learns While You Sleep: Inside Hermes Agent's Self-Evolving Brain</title>
      <dc:creator>Pratay Karali</dc:creator>
      <pubDate>Thu, 21 May 2026 11:43:03 +0000</pubDate>
      <link>https://forem.com/pratay_karali_5716376b9f2/the-ai-that-learns-while-you-sleep-inside-hermes-agents-self-evolving-brain-3ai8</link>
      <guid>https://forem.com/pratay_karali_5716376b9f2/the-ai-that-learns-while-you-sleep-inside-hermes-agents-self-evolving-brain-3ai8</guid>
      <description>&lt;p&gt;&lt;em&gt;Every other agent forgets. This one grows.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;It's 3am. Your laptop is idle. No terminal is open. No prompt is waiting.&lt;/p&gt;

&lt;p&gt;And somewhere inside &lt;code&gt;~/.hermes/skills/&lt;/code&gt;, a daemon just woke up.&lt;/p&gt;

&lt;p&gt;It's reading through every task your agent completed this week — every tool call, every failure, every correction, every workaround. It's grading them. Consolidating the ones that overlap. Pruning the ones that underperformed. Writing new procedural memory files from scratch.&lt;/p&gt;

&lt;p&gt;You didn't ask it to. You didn't schedule it. It just happens — every seven days — like clockwork.&lt;/p&gt;

&lt;p&gt;By morning, your agent is measurably better at your codebase than it was yesterday. And you were asleep the entire time.&lt;/p&gt;

&lt;p&gt;This is the Hermes Agent. And it's the first open-source runtime I've encountered that doesn't just &lt;em&gt;execute&lt;/em&gt; intelligence — it &lt;em&gt;accumulates&lt;/em&gt; it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem Every Other Agent Has
&lt;/h2&gt;

&lt;p&gt;Here's the dirty secret of most agentic frameworks: they're amnesiac by design.&lt;/p&gt;

&lt;p&gt;You give the agent a complex task. It struggles, recovers, finds a workaround, completes it. You feel good. You close the terminal.&lt;/p&gt;

&lt;p&gt;Next time you run the same class of task? It starts from zero. The workaround is gone. The hard-won recovery pattern — vanished. The agent will struggle through the exact same failure mode it navigated last Tuesday, because nothing in the architecture captured what it learned.&lt;/p&gt;

&lt;p&gt;This isn't a bug in most frameworks. It's a philosophical choice: keep the agent stateless, keep it predictable, keep it simple. The problem is that "simple" compounds into "perpetually mediocre." You're not running an agent that gets better. You're running a very expensive &lt;code&gt;for&lt;/code&gt; loop.&lt;/p&gt;

&lt;p&gt;Hermes made a different choice. Its entire architecture is built around a single question: &lt;em&gt;what if the agent remembered — and what if the act of remembering made it smarter?&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Heartbeat: Observe, Execute, Reflect, Crystallize
&lt;/h2&gt;

&lt;p&gt;Most agent frameworks have a loop. Input → Plan → Tool Call → Output. Repeat.&lt;/p&gt;

&lt;p&gt;Hermes has a &lt;em&gt;heartbeat.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The four-phase OERC cycle isn't just an execution pattern — it's a learning metabolism. Each phase has a distinct biological analog, and understanding them is the key to understanding why Hermes behaves the way it does.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────┐
│              THE HERMES LEARNING HEARTBEAT               │
│                                                          │
│   OBSERVE ──► EXECUTE ──► REFLECT ──► CRYSTALLIZE       │
│      │                        │             │            │
│   Scan skills             Run GEPA      Write SKILL.md   │
│   FTS5 lookup            self-analysis   to ~/.hermes/   │
│   ~20 tokens/skill        on trace       skills/         │
│      │                        │             │            │
│   "What do I               "Why did        "Next time,   │
│    already know?"           that work?"     I'll know."  │
└─────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Observe&lt;/strong&gt; — Before executing anything, the agent scans its local SQLite database using FTS5 full-text search to find skills that match the incoming task. Critically, it only loads Level-0 summaries at this stage — roughly 20 tokens per skill — so it can survey thousands of procedures without bloating its context window. It enters the execution phase already knowing what it knows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Execute&lt;/strong&gt; — The agent runs its tool-calling loop, dispatching up to 8 tools in parallel via a localized thread pool. Every terminal command, every API response, every correction is captured in a detailed execution trace. The agent isn't just doing the work — it's recording a complete transcript of &lt;em&gt;how&lt;/em&gt; it did the work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reflect&lt;/strong&gt; — This is where it gets interesting. After a sufficiently complex task (typically 5+ tool calls), GEPA — a Genetic-Pareto Prompt Evolution system running alongside DSPy — analyzes the execution trace. It identifies the failure points. It models why the recovery worked. It generates optimized guidelines, documents common pitfalls, and drafts verification steps. Crucially, this reflection runs in a background thread — it doesn't make you wait.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Crystallize&lt;/strong&gt; — The reflection output is compiled into a structured Markdown skill file and written to &lt;code&gt;~/.hermes/skills/&lt;/code&gt;. It's indexed at Level 1, ready to be surfaced in the &lt;em&gt;Observe&lt;/em&gt; phase of the next session. The loop closes. The knowledge persists.&lt;/p&gt;

&lt;p&gt;This is how muscle memory works in humans. You struggle through a new motor pattern consciously, repeatedly, until the motion becomes automatic and sub-cortical. Hermes does the same thing — except it crystallizes in one pass, not ten thousand.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three-Layer Brain
&lt;/h2&gt;

&lt;p&gt;If the OERC cycle is the heartbeat, the memory architecture is the brain. And it's structured with a specificity that most frameworks don't come close to.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Memory Layer&lt;/th&gt;
&lt;th&gt;Human Analog&lt;/th&gt;
&lt;th&gt;Location&lt;/th&gt;
&lt;th&gt;Size Cap&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L1: Session Context&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Working Memory&lt;/td&gt;
&lt;td&gt;Volatile RAM&lt;/td&gt;
&lt;td&gt;Model context window&lt;/td&gt;
&lt;td&gt;Cleared on &lt;code&gt;/reset&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L2: Persistent Store&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Episodic Memory&lt;/td&gt;
&lt;td&gt;&lt;code&gt;~/.hermes/memories/MEMORY.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~2,200 chars / ~800 tokens&lt;/td&gt;
&lt;td&gt;Locked into system prompt prefix at session start&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L3: User Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Autobiographical Self&lt;/td&gt;
&lt;td&gt;&lt;code&gt;~/.hermes/memories/USER.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~1,375 chars / ~500 tokens&lt;/td&gt;
&lt;td&gt;Embedded in System Slot #1; updated in background&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Procedural&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Muscle Memory&lt;/td&gt;
&lt;td&gt;&lt;code&gt;~/.hermes/skills/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;15 KiB per file&lt;/td&gt;
&lt;td&gt;Loaded dynamically via progressive disclosure&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The size caps aren't arbitrary. They're a deliberate architectural decision to prevent what the Hermes team calls "prompt degradation" — the phenomenon where too much injected context starts &lt;em&gt;hurting&lt;/em&gt; model performance instead of helping it. Every cap is the result of empirical testing on where the signal-to-noise ratio flips.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;USER.md&lt;/code&gt; file is the layer that tends to surprise people the most. It's not just a config file — it's a live model of &lt;em&gt;you&lt;/em&gt;. Your coding style. Your preferred abstractions. Your tolerance for verbose output. The Honcho dialectic system periodically rewrites it based on observed interaction patterns. Over weeks of use, the agent stops feeling like a generic assistant and starts feeling like something that's been working with you specifically.&lt;/p&gt;

&lt;p&gt;And the retrieval is fast — SQLite FTS5 scanning thousands of historical conversations in under 10 milliseconds. There's no vector embedding server to run, no semantic search latency to absorb. It's just a very well-engineered SQLite database doing what SQLite does best.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Immune System: Why Subagent Constraints Are the Feature
&lt;/h2&gt;

&lt;p&gt;When Hermes delegates a task to a subagent, that subagent runs in a fresh, isolated workspace with its own conversational context. Importantly, it operates under four hard constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No spawning subagents&lt;/strong&gt; — unless explicitly assigned the "orchestrator" role&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No user interaction&lt;/strong&gt; — subagents cannot prompt you for input&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No shared memory writes&lt;/strong&gt; — blocked from writing to &lt;code&gt;MEMORY.md&lt;/code&gt; or &lt;code&gt;USER.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sequential code execution&lt;/strong&gt; — no recursive script injection
Most developers see this list and think "limitations." That's the wrong read.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These constraints are a security immune system. When you're running autonomous agents against your actual codebase — at 3am, while you sleep — you &lt;em&gt;want&lt;/em&gt; hard walls between parallel execution threads. A subagent that can spawn children can create exponential resource consumption. A subagent that can write to shared memory can corrupt the episodic store that took weeks to accumulate. A subagent that can talk to you mid-task can create race conditions between its output and your response.&lt;/p&gt;

&lt;p&gt;The constraints are what make the delegation &lt;em&gt;trustworthy&lt;/em&gt; at scale.&lt;/p&gt;

&lt;p&gt;Here's what a production parallel code review delegation looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;hermes_tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;delegate_task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;execute_code&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_automated_refactor&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;target_files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;src/auth/jwt.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;src/auth/login.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Dispatch two isolated security-focused subagents in parallel
&lt;/span&gt;    &lt;span class="nf"&gt;delegate_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;goal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyze &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;target_files&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; for JWT validation vulnerabilities and apply fixes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fixer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Project is at /workspace.
                Verify signature validation and check algorithm headers.
                Run pytest tests/auth/ -v after implementing changes.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;goal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyze &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;target_files&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; for SQL injection vectors and parameterize queries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fixer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Project is at /workspace.
                Convert string-formatted database executions to prepared statements.
                Run pytest tests/auth/ -v to verify integration.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Final validation pass
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;execute_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;script&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
import subprocess
result = subprocess.run([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pytest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tests/auth/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-v&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;], capture_output=True, text=True)
print(result.stdout)
print(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REFACTOR_COMPLETE_AND_VERIFIED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; if result.returncode == 0 else &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REFACTOR_FAILED_VALIDATION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;)
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two subagents. Two security domains. Zero shared state. One validation pass at the end. This is the architecture of a system that's designed to be trusted in production — not just demonstrated in a keynote.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Sandbox: Paranoid by Default
&lt;/h2&gt;

&lt;p&gt;Autonomous agents executing shell commands on your machine is, frankly, terrifying if done wrong. Hermes handles this through six execution backends — Local, Docker, SSH, Daytona, Singularity, and Modal — with the Docker backend being the recommended production configuration.&lt;/p&gt;

&lt;p&gt;The key architectural difference from naive Docker wrappers: Hermes uses a &lt;em&gt;single, persistent container&lt;/em&gt; initialized at startup. Every command, every subagent, every code execution routes through &lt;code&gt;docker exec&lt;/code&gt; into this one container. State persists between steps. Package installations survive across tool calls. Environment variables don't reset mid-task.&lt;/p&gt;

&lt;p&gt;And it's hardened by default:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;terminal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docker"&lt;/span&gt;
  &lt;span class="na"&gt;docker_image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nikolaik/python-nodejs:python3.11-nodejs20"&lt;/span&gt;
  &lt;span class="na"&gt;container_persistent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;container_cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;container_memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4096&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood, Hermes applies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--cap-drop ALL&lt;/code&gt; with only &lt;code&gt;DAC_OVERRIDE&lt;/code&gt;, &lt;code&gt;CHOWN&lt;/code&gt;, and &lt;code&gt;FOWNER&lt;/code&gt; restored&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--security-opt no-new-privileges&lt;/code&gt; to block runtime escalation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--pids-limit 256&lt;/code&gt; to neutralize fork-bomb attacks&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tmpfs&lt;/code&gt; mounts for &lt;code&gt;/tmp&lt;/code&gt; (512MB) and &lt;code&gt;/var/tmp&lt;/code&gt; (256MB) to prevent disk exhaustion
The paranoia is engineered in. You don't have to configure it. It's the default.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Hermes vs. OpenClaw: A Philosophy Comparison
&lt;/h2&gt;

&lt;p&gt;This isn't a "which is better" comparison. It's a "which philosophy are you choosing" comparison.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;OpenClaw&lt;/th&gt;
&lt;th&gt;Hermes Agent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Center of Gravity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gateway-First: sessions route to a stateless agent loop&lt;/td&gt;
&lt;td&gt;Agent-First: the cognitive loop is the core; gateways wrap it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Skill System&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Static, human-authored files edited manually&lt;/td&gt;
&lt;td&gt;Self-generating: OERC loop writes new skills automatically&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Flat files + standard SQLite, manual configuration&lt;/td&gt;
&lt;td&gt;Three-layer persistent stack with FTS5, pluggable backends&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Parallel Execution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sequential within single thread&lt;/td&gt;
&lt;td&gt;Native &lt;code&gt;delegate_task&lt;/code&gt; spawning isolated parallel subagents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Container Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;New container per task (high init overhead, stateless)&lt;/td&gt;
&lt;td&gt;Single persistent container (low overhead, stateful)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Under 30 minutes&lt;/td&gt;
&lt;td&gt;2–4 hours for full configuration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Annual API Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~$600–$1,800&lt;/td&gt;
&lt;td&gt;~$500–$1,600 (optimized via prompt caching, α≈0.90)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The tradeoff is honest: OpenClaw is faster to set up and easier to reason about because it's simpler. Hermes requires a real configuration investment — model routing, memory setup, Docker configuration, gateway pairing. The 2–4 hour setup time is real.&lt;/p&gt;

&lt;p&gt;But here's the question: are you building something you'll run once, or something you'll run every day?&lt;/p&gt;

&lt;p&gt;If it's the latter — if this agent is going to touch your codebase regularly, learn your patterns, automate your recurring tasks — the investment compounds. The agent that takes 4 hours to set up on day one is measurably smarter on day 30 than the agent that took 30 minutes. Because it's been crystallizing knowledge the entire time.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cost Architecture: Why α≈0.90 Changes Everything
&lt;/h2&gt;

&lt;p&gt;The economic model of Hermes is worth understanding before you dismiss it as "expensive self-hosted AI."&lt;/p&gt;

&lt;p&gt;The cost formula for a single conversational turn:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;C_turn = T_dynamic · R_dynamic 
       + T_cached · (1-α) · R_dynamic 
       + T_cached · α · R_cached 
       + T_out · R_out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where α is the prompt cache hit ratio. Hermes achieves α≈0.90 in steady-state operation — meaning 90% of input tokens hit the cache and are billed at the discounted cached rate.&lt;/p&gt;

&lt;p&gt;This is the architectural payoff of the frozen memory layers. Because &lt;code&gt;MEMORY.md&lt;/code&gt; and &lt;code&gt;USER.md&lt;/code&gt; are static between updates, they sit in Anthropic's prompt cache indefinitely. The system prompt that took 1,300 tokens to construct is only billed at full price once. Every subsequent session loads it at a fraction of the cost.&lt;/p&gt;

&lt;p&gt;For long-running, multi-hour agent operations — exactly the kind of work Hermes is designed for — this cache hit ratio is the difference between a $40 session and a $4 session.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Curator: Cognitive Gardening
&lt;/h2&gt;

&lt;p&gt;The detail about Hermes that I find most philosophically interesting is the Curator Daemon.&lt;/p&gt;

&lt;p&gt;Every 7 days — introduced in version 0.12.0 — a background process scans your entire &lt;code&gt;~/.hermes/skills/&lt;/code&gt; directory. It grades each skill against historical execution logs. It identifies skills that overlap and consolidates them. It prunes skills that underperformed or became too narrow to be useful.&lt;/p&gt;

&lt;p&gt;No human touches this process. No one reviews the results unless they want to. The agent manages its own long-term memory hygiene.&lt;/p&gt;

&lt;p&gt;There's a term for this in cognitive neuroscience: synaptic pruning. The human brain does something similar during sleep — eliminating weak neural connections to strengthen the ones that matter. The result is that you wake up with slightly better-consolidated memories than you had when you fell asleep.&lt;/p&gt;

&lt;p&gt;Hermes does this to its skill library. Every week. While your machine is idle.&lt;/p&gt;

&lt;p&gt;The practical implication: a Hermes instance you've been running for 6 months has a fundamentally different skill profile than one you deployed last week. It's been shaped by your specific tasks, your specific codebase, your specific failure patterns. It's not a generic agent anymore. It's &lt;em&gt;your&lt;/em&gt; agent — in a way that no stateless framework can match.&lt;/p&gt;




&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;p&gt;Hermes is not for everyone. Let me be direct about that.&lt;/p&gt;

&lt;p&gt;If you want to run a one-shot coding task and be done — use a simpler tool. The 2–4 hour setup overhead isn't justified for occasional use.&lt;/p&gt;

&lt;p&gt;Hermes is for the developer who:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Has recurring workflows they're tired of re-explaining to an agent every time&lt;/li&gt;
&lt;li&gt;Wants their agent to get better at their specific project over time&lt;/li&gt;
&lt;li&gt;Is comfortable with self-hosted infrastructure and local model routing&lt;/li&gt;
&lt;li&gt;Trusts the Docker sandbox model and wants autonomous background execution
If that's you, the architecture rewards patience. The first week, Hermes feels like any other agent. By week four, it's started to feel like something that's been paying attention.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Quiet Provocation
&lt;/h2&gt;

&lt;p&gt;I want to end on something that's been sitting with me since I went deep on this architecture.&lt;/p&gt;

&lt;p&gt;We've spent a lot of time debating whether AI will replace developers. That's the loud conversation. It makes for good headlines.&lt;/p&gt;

&lt;p&gt;The quieter, more interesting question is: &lt;em&gt;what happens when your tools start remembering faster than you do?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You write code in a codebase for three years. You build intuitions. You know which abstractions leak, which patterns cause bugs three sprints later, which shortcuts always come back to bite you. That institutional knowledge lives in your head — and it's deeply, irreplaceably valuable.&lt;/p&gt;

&lt;p&gt;Hermes is the first framework I've seen that's designed to accumulate that same class of knowledge. Not about code in general. About &lt;em&gt;your&lt;/em&gt; code specifically. About &lt;em&gt;your&lt;/em&gt; patterns, your failures, your recoveries.&lt;/p&gt;

&lt;p&gt;The Curator Daemon pruning skill files at 3am isn't just a background process. It's the system becoming a better collaborator — specifically for you, specifically for your project, without you doing anything.&lt;/p&gt;

&lt;p&gt;That's not a replacement. That's an apprentice that never forgets a lesson.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;If you want to explore Hermes seriously, here's the honest path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Step 1: Install&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://hermes-agent.nousresearch.com/install.sh | bash

&lt;span class="c"&gt;# Step 2: Configure model routing and memory&lt;/span&gt;
hermes setup

&lt;span class="c"&gt;# Step 3: Set up your execution backend&lt;/span&gt;
&lt;span class="c"&gt;# Edit ~/.hermes/config.yaml — configure Docker backend, model routing&lt;/span&gt;

&lt;span class="c"&gt;# Step 4: Run your first task and watch the skill directory afterward&lt;/span&gt;
hermes
&lt;span class="c"&gt;# After a complex task, check: ls ~/.hermes/skills/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Budget 2–4 hours for the initial setup. It's not quick. It's worth it.&lt;/p&gt;

&lt;p&gt;Start with a task you run regularly — code review, dependency scanning, documentation generation. After a week, look at what's been crystallized into &lt;code&gt;~/.hermes/skills/&lt;/code&gt;. That directory will tell you more about how the system works than any documentation.&lt;/p&gt;

&lt;p&gt;And check back in a month. The agent you're running then won't be the same one you started with.&lt;/p&gt;

&lt;p&gt;That's the whole point.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written for the Hermes Agent Challenge on DEV.to.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>hermesagentchallenge</category>
      <category>ai</category>
      <category>hermes</category>
      <category>devchallenge</category>
    </item>
    <item>
      <title>The Day AI Became Its Own CTO: Antigravity 2.0 and the 12-Hour OS</title>
      <dc:creator>Pratay Karali</dc:creator>
      <pubDate>Thu, 21 May 2026 09:07:38 +0000</pubDate>
      <link>https://forem.com/pratay_karali_5716376b9f2/the-day-ai-became-its-own-cto-antigravity-20-and-the-12-hour-os-2gmb</link>
      <guid>https://forem.com/pratay_karali_5716376b9f2/the-day-ai-became-its-own-cto-antigravity-20-and-the-12-hour-os-2gmb</guid>
      <description>&lt;p&gt;&lt;em&gt;What happens when you stop giving AI a task — and give it a company?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There's a moment in every science fiction film where the machine stops waiting for instructions.&lt;/p&gt;

&lt;p&gt;At Google I/O 2026, that moment happened live on stage — and it didn't feel like science fiction. It felt like watching the future quietly clock in for work.&lt;/p&gt;

&lt;p&gt;Antigravity 2.0 was given a single directive: build an operating system. No team. No standups. No Jira tickets. Just one primary agent, 93 subagents it spun up itself, 15,000+ model requests, 2.6 billion tokens generated, and 12 hours on the clock.&lt;/p&gt;

&lt;p&gt;The total bill? Under $1,000.&lt;/p&gt;

&lt;p&gt;The result? A working OS — that, when it failed to run Doom due to missing keyboard and video drivers, &lt;em&gt;diagnosed the problem and wrote the drivers live on stage.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I've been staring at that moment for two days. Let me tell you what I think it actually means.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Jarvis Architecture: Corporate Hierarchy Without the Politics
&lt;/h2&gt;

&lt;p&gt;Here's the frame that won't leave my head: this isn't just "AI coding." This is AI operating like a corporation — and it's a corporation unlike any that has ever existed.&lt;/p&gt;

&lt;p&gt;The primary Antigravity agent functions as a CTO. It doesn't write every line of code. It understands the system, breaks the goal into domains, and &lt;em&gt;spawns&lt;/em&gt; specialized subagents — one for the database layer, one for the frontend, one for testing, one for drivers. Each subagent works in an isolated workspace, reports summarized results back, and dissolves when its job is done.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;         +--------------------------+
         |      Primary Agent       |
         |  (Context Coordinator)   |
         +----+------+----------+---+
              |      |          |
  +-----------+      |          +-----------+
  | Spawns           | Spawns               | Spawns
  v                  v                      v
+----------+    +----------+          +----------+
| Subagent |    | Subagent |          | Subagent |
| Database |    | Frontend |          |  Testing |
+----------+    +----------+          +----------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the part that breaks my brain a little: &lt;strong&gt;nobody fights.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In every human organization I've ever encountered, the frontend team argues with the backend team. The testing team is chronically ignored. The DevOps engineer is always the last person anyone calls and the first person everyone blames. There are ego collisions, misaligned incentives, communication overhead, documentation that's three sprints out of date.&lt;/p&gt;

&lt;p&gt;In Antigravity's architecture, the subagents operate in clean isolation. No competing priorities. No meetings about meetings. The primary agent synthesizes their outputs and steers. The goal is the only agenda item.&lt;/p&gt;

&lt;p&gt;It's the corporate hierarchy that every management book has been trying to describe for 50 years — and it turns out the only way to actually build it is to use agents that have no sense of self-preservation.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Sandbox Is the Secret Weapon
&lt;/h2&gt;

&lt;p&gt;What made the 12-hour OS build &lt;em&gt;possible&lt;/em&gt; — beyond the model itself — is the infrastructure underneath it: the Managed Agents API and its ephemeral sandbox architecture.&lt;/p&gt;

&lt;p&gt;Every Antigravity agent runs inside a Google-hosted Ubuntu Linux container. You don't provision it. You don't configure it. One API call to the Interactions API spins it up — Python 3.12, Node.js 22, a full shell, Google Search, URL context, all ready.&lt;/p&gt;

&lt;p&gt;The architecture separates control from execution cleanly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+------------------------------------------------------------+
|                       CONTROL PLANE                        |
|                        (Agents API)                        |
|  - Register Agent Identity &amp;amp; Constraints                   |
|  - Mount GCS Buckets, Define Network Allowlists            |
+-----------------------------+------------------------------+
                              | Configures
                              v
+-----------------------------+------------------------------+
|                        DATA PLANE                          |
|                    (Interactions API)                      |
|                                                            |
|  +------------------------------------------------------+  |
|  |          Google-Hosted Ubuntu VM Sandbox             |  |
|  |  - Ephemeral Linux Environment                       |  |
|  |  - Python 3.12 &amp;amp; Node.js 22 Runtimes                 |  |
|  |  +----------+  +----------------+  +--------+        |  |
|  |  |   Bash   |  | Google Search  |  |  URL   |        |  |
|  |  | Executor |  |     Tool       |  | Context|        |  |
|  |  +----------+  +----------------+  +--------+        |  |
|  +------------------------------+-----------------------+  |
+---------------------------------|--------------------------+
                                  | Persists across turns
                                  v
+----------------------------------------------------------+
|               PERSISTENT CONTEXT STORAGE                 |
|  - Filesystem &amp;amp; Installed Packages                       |
|  - Conversation History (previous_interaction_id)        |
+----------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key design insight is &lt;strong&gt;state persistence across turns&lt;/strong&gt;. When you pass &lt;code&gt;previous_interaction_id&lt;/code&gt; back into a new Interactions API call, the sandbox doesn't reset. The files the agent created last turn are still there. The packages it installed are still there. The 500k tokens of planning context it built up are still there.&lt;/p&gt;

&lt;p&gt;This is what enables long-horizon tasks. A single agent interaction can consume between 300,000 and 3,000,000 tokens — but the platform caches 50–70% of input tokens, making the economics manageable.&lt;/p&gt;

&lt;p&gt;Here's how that multi-turn persistence looks in practice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Turn 1: Give the agent its first major task
&lt;/span&gt;&lt;span class="n"&gt;first_interaction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;antigravity-preview-05-2026&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retrieve the top 10 trends from Hacker News, write them to trends.csv, and generate a matplotlib visualization.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remote&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url_context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sandbox ID: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;first_interaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environment_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Turn 2: Continue in the SAME container — trends.csv is still there
&lt;/span&gt;&lt;span class="n"&gt;second_interaction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;antigravity-preview-05-2026&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;previous_interaction_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;first_interaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;first_interaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environment_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Same container, state preserved
&lt;/span&gt;    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Convert trends.csv into a responsive HTML dashboard.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No re-explaining context. No re-uploading files. The agent remembers because the environment remembers.&lt;/p&gt;




&lt;h2&gt;
  
  
  The CLI: A Deep-Sea Probe for Your Codebase
&lt;/h2&gt;

&lt;p&gt;If Antigravity 2.0 is the CTO, the Antigravity CLI is the ROV they send into the trench.&lt;/p&gt;

&lt;p&gt;It's built in Go — lightweight, fast, low overhead — and it can run background tasks asynchronously while you sleep. It doesn't need a visual IDE. It doesn't need a human watching every step. Like an unmanned probe sent into deep water where no one has ever looked, it explores, documents, and surfaces what it finds.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install the CLI&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://antigravity.google/cli/install.sh | bash

&lt;span class="c"&gt;# Drop into an interactive agent shell&lt;/span&gt;
antigravity-cli

&lt;span class="c"&gt;# The commands that make it feel like you hired someone&lt;/span&gt;
/goal          &lt;span class="c"&gt;# "Complete this without asking me every 5 minutes"&lt;/span&gt;
/schedule      &lt;span class="c"&gt;# Cron-like automation for recurring tasks&lt;/span&gt;
/browser       &lt;span class="c"&gt;# Spawn a visual subagent to crawl and test web apps&lt;/span&gt;
/rewind        &lt;span class="c"&gt;# Undo the last conversation turn and branch differently&lt;/span&gt;
/permissions   &lt;span class="c"&gt;# Tune autonomy: request-review, always-proceed, strict&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;/goal&lt;/code&gt; command is the one I keep thinking about. You give the CLI an objective, and it executes — without prompting for step-by-step approvals — until it's done or until it genuinely needs you. This is what "autonomous" actually means in practice. Not just suggesting the next action, but &lt;em&gt;doing the work while you're away from your desk.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;/schedule&lt;/code&gt; command extends this further — periodic automated checks, nightly refactor scripts, weekly report generation. This isn't a coding assistant. It's a background process that thinks.&lt;/p&gt;

&lt;p&gt;And the deep-sea metaphor holds technically too: unlike browser-based tools, the CLI agents can reach across system boundaries, navigate unknown package ecosystems, probe APIs with no documentation, and surface their findings in structured logs. There are large parts of most codebases that no human fully understands anymore. The CLI goes there.&lt;/p&gt;




&lt;h2&gt;
  
  
  Behavior as Configuration: AGENTS.md and SKILL.md
&lt;/h2&gt;

&lt;p&gt;One of the underrated announcements from I/O 2026 is how Antigravity handles behavioral customization — not through complex API parameters, but through &lt;strong&gt;versioned markdown files in your repository&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Two files you should know:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;AGENTS.md&lt;/code&gt;&lt;/strong&gt; lives at the project root. It defines the agent's operating constraints, persona, and global rules. Developed collaboratively by OpenAI, Google, and others under the Linux Foundation's Agentic AI Foundation, it's becoming a universal standard across tens of thousands of repositories — a &lt;code&gt;Dockerfile&lt;/code&gt; for agent behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;SKILL.md&lt;/code&gt;&lt;/strong&gt; lives at &lt;code&gt;.agents/skills/&amp;lt;skill-name&amp;gt;/SKILL.md&lt;/code&gt;. It packages specific capabilities: step-by-step procedures, tool dependencies, reference schemas. Originating from Anthropic, it's now supported across major platforms. The design philosophy is &lt;em&gt;progressive disclosure&lt;/em&gt; — the agent reads high-level summaries from &lt;code&gt;AGENTS.md&lt;/code&gt; first, then loads specific SKILL.md files only when a matching task appears.&lt;/p&gt;

&lt;p&gt;A minimal example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# .agents/AGENTS.md&lt;/span&gt;

&lt;span class="gu"&gt;## Role&lt;/span&gt;
You are a senior DevOps automation agent.

&lt;span class="gu"&gt;## Non-Negotiable Constraints&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Never run database migrations without human approval.
&lt;span class="p"&gt;-&lt;/span&gt; All infrastructure changes must test against staging first.
&lt;span class="p"&gt;-&lt;/span&gt; Validate all generated scripts with linters before execution.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .agents/skills/docker-builder/SKILL.md&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker-builder&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Automates multi-stage Docker builds and security scans.&lt;/span&gt;
&lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;code_execution&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Build Procedure&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; Locate package.json or requirements.txt in the repository root.
&lt;span class="p"&gt;2.&lt;/span&gt; Generate an optimized, multi-stage Dockerfile using distroless base images.
&lt;span class="p"&gt;3.&lt;/span&gt; Execute: docker build -t app-image:latest
&lt;span class="p"&gt;4.&lt;/span&gt; Run vulnerability scan with Trivy.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is elegant because it solves a real problem: how do you give an autonomous agent enough context to be useful without flooding its context window with everything all at once? The answer is the same answer good software architecture has always given — load what you need, when you need it.&lt;/p&gt;

&lt;p&gt;Your &lt;code&gt;AGENTS.md&lt;/code&gt; is the onboarding doc for an employee who never forgets what they read.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cost Architecture: What Running 93 Agents Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;Let's talk numbers, because "under $1,000 for an OS" deserves to be unpacked.&lt;/p&gt;

&lt;p&gt;Gemini 3.5 Flash — the model powering Antigravity's agent infrastructure — costs $3 to $9 per million output tokens. At 289 tokens/second output speed, it processes roughly four times faster than its nearest benchmarked competitors. The platform cached 50–70% of input tokens across the multi-turn interactions, which is the key cost lever on operations that process millions of tokens.&lt;/p&gt;

&lt;p&gt;For the OS build specifically: 2.6 billion tokens total, $1,000 spent. That works out to roughly $0.38 per million tokens effective cost after caching — for 93 parallel agents working across 12 hours.&lt;/p&gt;

&lt;p&gt;For context on what different agent task types typically cost:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task Type&lt;/th&gt;
&lt;th&gt;Input Tokens&lt;/th&gt;
&lt;th&gt;Output Tokens&lt;/th&gt;
&lt;th&gt;Session Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Research &amp;amp; Synthesis&lt;/td&gt;
&lt;td&gt;100k–500k&lt;/td&gt;
&lt;td&gt;10k–40k&lt;/td&gt;
&lt;td&gt;$0.30–$1.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code &amp;amp; Doc Generation&lt;/td&gt;
&lt;td&gt;100k–500k&lt;/td&gt;
&lt;td&gt;15k–50k&lt;/td&gt;
&lt;td&gt;$0.30–$1.30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Architecture Design&lt;/td&gt;
&lt;td&gt;100k–400k&lt;/td&gt;
&lt;td&gt;10k–30k&lt;/td&gt;
&lt;td&gt;$0.25–$0.80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large-Scale Data Processing&lt;/td&gt;
&lt;td&gt;300k–3M&lt;/td&gt;
&lt;td&gt;30k–150k&lt;/td&gt;
&lt;td&gt;$0.70–$3.25&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The OS build was off the scale of normal tasks — but it demonstrates that the upper limit of what you can accomplish in a single agentic run has expanded dramatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Actually Changes
&lt;/h2&gt;

&lt;p&gt;I want to be careful here, because a lot of commentary around events like this slides into either hype or dismissal — and neither is honest.&lt;/p&gt;

&lt;p&gt;What I saw at I/O 2026 wasn't "AI replacing developers." What I saw was a fundamental shift in the &lt;em&gt;grain&lt;/em&gt; of software development.&lt;/p&gt;

&lt;p&gt;The analogy that feels right to me: the invention of the compiler didn't eliminate programmers. It changed what programming meant. Before compilers, you managed memory addresses by hand. After, you reasoned about logic and let the tool handle the translation. The skill didn't disappear — it moved up a level of abstraction.&lt;/p&gt;

&lt;p&gt;Antigravity 2.0 is another step up that abstraction ladder. You're not writing every function anymore. You're writing &lt;code&gt;AGENTS.md&lt;/code&gt;. You're designing the constraint system. You're defining what "done" means and what "never do this" means. You're the architect now — not because coding became less important, but because the agents need someone to tell them what the building is &lt;em&gt;for.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The developers who thrive in this environment won't be the ones who can type the fastest. They'll be the ones who can think most clearly about systems, constraints, and goals — and who can write them down in a way an autonomous agent can actually follow.&lt;/p&gt;

&lt;p&gt;That's a different skill. But it's still unmistakably a &lt;em&gt;craft&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Moment I'll Remember
&lt;/h2&gt;

&lt;p&gt;When Antigravity 2.0's OS failed to boot Doom because it was missing keyboard and video drivers — and then the primary agent, live on stage, diagnosed the gap, spawned new subagents, wrote the missing drivers, and injected them — the audience reaction wasn't the typical polite I/O applause.&lt;/p&gt;

&lt;p&gt;There was a moment of genuine silence first.&lt;/p&gt;

&lt;p&gt;I think that silence was people recalibrating. Because the agent didn't just complete the task. It encountered unexpected failure in a domain it hadn't been explicitly prepared for, reasoned about the gap, and solved it.&lt;/p&gt;

&lt;p&gt;That's not a demo trick. That's the capability.&lt;/p&gt;

&lt;p&gt;You can build a demo that looks impressive. You can't fake a system that reasons its way through failure it didn't anticipate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where to Start
&lt;/h2&gt;

&lt;p&gt;If you want to explore the Managed Agents API yourself, the entry points are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with a single-turn sandbox&lt;/strong&gt; — spin up a remote environment via the Interactions API, give it a research + visualization task, and observe the execution loop via SSE streaming.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add state persistence&lt;/strong&gt; — on your second interaction, pass &lt;code&gt;previous_interaction_id&lt;/code&gt; and &lt;code&gt;environment_id&lt;/code&gt;. Watch the agent pick up exactly where it left off.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write your first AGENTS.md&lt;/strong&gt; — define three constraints that matter for your project domain. Watch how it changes agent behavior on subsequent runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Try a parallel task&lt;/strong&gt; — give the primary agent a goal complex enough that it spawns subagents. Monitor them via &lt;code&gt;/agents&lt;/code&gt; in the CLI.
The sandbox persists for 7 days of inactivity before teardown. That's 7 days to run a project you've been putting off.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;93 agents, $1,000, one OS. What would you build?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written for the Google I/O 2026 Writing Challenge on DEV.to. Technical data sourced from the Google I/O 2026 developer keynote and platform documentation.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>googleiochallenge</category>
      <category>antigravity</category>
      <category>ai</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
