Forem: Pratay Karali

Why Your Voice Agent Won't Stop Talking: Building the Zero-Latency Interruption Layer with Gemma 4 E2B

Pratay Karali — Thu, 21 May 2026 13:48:02 +0000

The most human problem in AI — and the architecture that finally solves it.

You know the moment.

You're talking to a voice assistant. It starts giving you an answer you didn't need. You try to cut it off. You say "wait—" or "no, actually—" and it just keeps going. Plowing straight through your sentence. Talking over you like an oblivious colleague who hasn't noticed everyone else went quiet.

You eventually fall silent. You wait for it to finish. You try again.

This isn't a minor annoyance. It's the conversational uncanny valley — and it's the reason most voice AI products feel broken even when the underlying model is genuinely smart. The model might generate perfect answers. But the architecture doesn't know when to stop.

Gemma 4 E2B changes the foundational conditions of this problem. And in this guide, we're going to build the interruption layer that finally eliminates it.

The Enemy: Sequential Latency

Before we fix anything, we need to understand exactly what creates the problem.

Traditional voice pipelines look like this:

Microphone → [STT Model] → text → [LLM] → text → [TTS Model] → Speaker

Each arrow is a waiting period. The Speech-to-Text model needs to buffer enough audio to transcribe. The LLM needs the full transcription before it can begin inference. The TTS needs the full LLM response before synthesis begins. You're not having a conversation — you're exchanging documents with an extremely fast filing system.

Compounding this: the entire pipeline is typically synchronous. While the TTS is speaking, nothing is listening. The microphone input buffer fills up. Your interruption gets queued somewhere behind three seconds of audio the system has already committed to playing.

By the time the agent "hears" you said "wait" — it's been saying something else for two full seconds. The uncanny valley isn't a UI problem. It's an architecture problem.

The Weapon: Gemma 4 E2B's Native Audio Encoder

Released by Google DeepMind on April 2, 2026, under Apache 2.0, Gemma 4 E2B is a 2.3 billion effective parameter model with something no model in its weight class has ever had: a native 300-million parameter audio encoder baked directly into the architecture.

This single design decision eliminates the first bottleneck entirely.

Instead of:

Raw audio → STT model → text string → LLM

Gemma 4 E2B does:

Raw 16kHz audio waveform → mel-spectrogram → embedding space → LLM

No intermediate transcription. No text flattening that destroys vocal intonation, prosody, and emotional inflection. The model hears how you said something, not just what you said.

The feature extractor processes audio using a 20ms frame length (320 samples at 16kHz) with a 10ms hop length, converting the waveform directly into mel-frequency spectrograms that project straight into the model's embedding space. The model accepts up to 30 seconds of audio per interaction — easily enough for complex, natural queries.

Here's what makes the E2B variant special in the Gemma 4 family:

Variant	Effective Params	Context	Target Hardware	Audio?
E2B	2.3B	128K	Phones, RPi 5, Laptops	✅ Yes
E4B	4.5B	128K	High-end phones, edge servers	✅ Yes
26B A4B	~4B active (MoE)	256K	RTX 4090/5090	❌ No
31B	30.7B	256K	A100/H100	❌ No

The native audio encoder exists only in the E2B and E4B variants. If you're building a real-time voice agent, you're using E2B. Under 4-bit quantization (Q4_K_M), it runs in 2–3 GB of RAM — fitting comfortably on a Raspberry Pi 5 or any modern laptop.

The architecture achieves this through three innovations worth understanding:

Per-Layer Embeddings (PLE): Instead of one embedding matrix at the input, every decoder block gets its own specialized embedding slice. These function as memory lookups rather than matrix multiplications — so the model accesses 5.1 billion parameters worth of knowledge while only activating 2.3 billion per token. Fast inference, deep intelligence.

Hybrid Attention (4:1 Local/Global): Rather than full global attention across 128K tokens (which scales quadratically — catastrophic on edge hardware), E2B applies local sliding-window attention (512 tokens) for four consecutive layers, then one full global attention layer. 35 layers total, always ending global. The RULER benchmark puts long-context recall at 66.4% at 128K depth — versus 13.5% in prior generations.

p-RoPE: Proportional Rotary Position Embeddings dedicate a subset of dimensions strictly to positional data, leaving 75% as clean content channels. This prevents the catastrophic forgetting that typically afflicts long voice conversations.

The Gatekeeper: Silero VAD

Eliminating STT latency is half the battle. The other half is knowing when to listen.

You cannot feed a raw, open microphone stream into an LLM. It will constantly trigger on ambient noise, keyboard clicks, HVAC hum, your dog. Every false trigger is a wasted inference cycle. On consumer hardware, that means freezing the application.

Enter Silero VAD.

One megabyte. Trained on 100+ languages. Forward pass executes in under one millisecond on a single CPU thread. It returns a speech probability scalar between 0.0 and 1.0 for every 30-32ms audio chunk (512 samples at 16kHz).

The critical detail is hysteresis — raw probability alone causes rapid toggling. You need:

Speech start: probability > 0.5 for at least 250ms of consecutive audio
Speech end: silence for at least 500ms These thresholds prevent a door slam from triggering a full inference cycle, and prevent a natural mid-sentence pause from prematurely ending the user's utterance.

The Architecture: Four Threads, One Goal

Here's the core insight: the reason voice agents fail at interruption is that they're architecturally single-threaded in spirit even when multi-threaded in implementation. The microphone, the VAD, the LLM, and the speaker all wait for each other.

Our system has four completely decoupled components communicating via thread-safe queues and a single shared threading.Event() flag:

┌─────────────────────────────────────────────────────────────┐
│                    SYSTEM ARCHITECTURE                       │
│                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │  PyAudio     │    │  Silero VAD  │    │  Gemma 4 E2B │  │
│  │  C-Thread    │───▶│  Gatekeeper  │───▶│  Inference   │  │
│  │  (non-block) │    │  (CPU only)  │    │  Engine      │  │
│  └──────────────┘    └──────┬───────┘    └──────┬───────┘  │
│                             │                   │           │
│                    interrupt_event.set()    response text   │
│                             │                   │           │
│                             ▼                   ▼           │
│                    ┌──────────────────────────────────┐     │
│                    │     TTS Output Worker Thread     │     │
│                    │  polls interrupt_event every     │     │
│                    │  50ms chunk → instant flush      │     │
│                    └──────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────┘

The interrupt_event is the entire system's nervous system. The moment Silero detects speech onset, it fires. The TTS worker is polling that flag on every 50ms audio chunk write. The instant it fires — mid-syllable if necessary — the TTS queue is flushed. The agent goes silent. The user has the floor.

Building It: Phase by Phase

Environment Setup

pip install -U transformers torch accelerate onnxruntime
pip install pyaudio soundfile silero-vad bitsandbytes

onnxruntime ensures Silero runs on the ONNX Lite backend for maximum CPU efficiency. bitsandbytes handles 4-bit quantization to compress the model to ~3.5GB VRAM.

Phase 1: The VAD Gatekeeper (Non-Blocking Audio Capture)

Python's GIL is the enemy of real-time audio. The moment you write stream.read() in a while True loop, you've blocked your entire thread during LLM inference. Microphone buffer overflows. Interruptions get lost.

The solution: PyAudio's non-blocking stream_callback. This delegates audio capture to a dedicated C-level PortAudio thread — completely outside the Python GIL.

import torch
import numpy as np
import pyaudio
import queue
import threading

class VADGatekeeper:
    def __init__(self, sample_rate=16000, chunk_size=512):
        self.sample_rate = sample_rate
        self.chunk_size = chunk_size

        # Silero VAD via ONNX backend — ~1MB, <1ms per forward pass
        self.model, utils = torch.hub.load(
            repo_or_dir='snakers4/silero-vad',
            model='silero_vad',
            force_reload=False,
            onnx=True
        )
        (self.get_speech_timestamps, _, _, self.VADIterator, _) = utils

        # Stateful iterator: 0.5 threshold, 500ms silence to confirm end
        self.vad_iterator = self.VADIterator(
            self.model,
            threshold=0.5,
            min_silence_duration_ms=500
        )

        self.audio_queue = queue.Queue()
        self.is_speaking = False
        self.interrupt_event = threading.Event()
        self.current_utterance_frames = []

    def audio_callback(self, in_data, frame_count, time_info, status):
        """
        Executed in a C-level PortAudio thread.
        Deposits audio bytes into queue and returns INSTANTLY.
        """
        self.audio_queue.put(in_data)
        return (None, pyaudio.paContinue)

The audio_callback does exactly one thing: put bytes in the queue and return. The callback executes every 32ms. It never blocks. It never waits for LLM inference to finish. The microphone is always listening.

Phase 2: The Gemma 4 E2B Inference Engine

Loading E2B in native 16-bit precision requires 10+ GB VRAM. On consumer hardware, we use BitsAndBytesConfig for 4-bit NF4 quantization — collapsing that to ~3.5 GB:

from transformers import AutoProcessor, AutoModelForMultimodalLM, BitsAndBytesConfig

class GemmaVoiceEngine:
    def __init__(self, model_id="google/gemma-4-E2B-it"):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"

        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16
        )

        self.processor = AutoProcessor.from_pretrained(model_id)
        self.model = AutoModelForMultimodalLM.from_pretrained(
            model_id,
            quantization_config=bnb_config,
            device_map="auto"
        )
        self.conversation_history = []

    def generate_response(self, audio_numpy_array):
        # CRITICAL: audio content block MUST precede text block
        user_message = {
            "role": "user",
            "content": [
                {"type": "audio", "audio": audio_numpy_array},
                {"type": "text", "text": "Please respond to the audio input concisely."}
            ]
        }
        self.conversation_history.append(user_message)

        inputs = self.processor.apply_chat_template(
            self.conversation_history,
            tokenize=True,
            return_dict=True,
            return_tensors="pt",
            add_generation_prompt=True
        ).to(self.device)

        input_len = inputs["input_ids"].shape[-1]
        outputs = self.model.generate(**inputs, max_new_tokens=256)

        response_text = self.processor.decode(
            outputs[input_len:],
            skip_special_tokens=True
        )
        self.conversation_history.append(
            {"role": "assistant", "content": response_text}
        )
        return response_text

One non-obvious detail: the {"type": "audio"} block must physically precede the {"type": "text"} block in the content array. This is an architectural requirement of Gemma 4's multimodal formatting — the audio token expansions need to be computed before the text instructions are interpolated. Getting this wrong causes silent inference failures.

When apply_chat_template is called, the processor runs the mel-spectrogram computation, determines the exact number of audio tokens based on waveform duration, and stitches the audio representations into the prompt in place of the structural placeholder. The complexity of audio tokenization is entirely abstracted.

Phase 3: The Output Controller (The Flush Mechanism)

This is where interruption actually happens. The TTS worker polls interrupt_event on every single audio chunk before writing to the speaker:

class OutputController:
    def __init__(self):
        self.audio_out = pyaudio.PyAudio()
        self.out_stream = self.audio_out.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=24000,
            output=True
        )
        self.playback_queue = queue.Queue()

    def tts_playback_worker(self, interrupt_event):
        while True:
            # Check BEFORE dequeuing
            if interrupt_event.is_set():
                print("[Output] Interruption. Flushing queue.")
                while not self.playback_queue.empty():
                    try:
                        self.playback_queue.get_nowait()
                    except queue.Empty:
                        break
                interrupt_event.clear()
                continue

            try:
                audio_chunk = self.playback_queue.get(timeout=0.05)

                # Check AGAIN immediately before hardware write
                if interrupt_event.is_set():
                    continue

                # Blocking write to physical speaker
                self.out_stream.write(audio_chunk)

            except queue.Empty:
                continue

The double-check pattern — once before dequeuing, once before the hardware write — closes the race condition window as tightly as physically possible. The maximum interruption delay equals the duration of one audio chunk: 50 milliseconds. That's imperceptible to humans.

Feed 50ms TTS chunks into playback_queue. The agent stops mid-syllable when interrupted. This is the uncanny valley fix.

Phase 4: The Orchestration Loop

Everything unified — the moment speech starts, the interrupt fires. When silence confirms the utterance is complete, inference runs:

def main_orchestration_loop():
    gatekeeper = VADGatekeeper()
    agent = GemmaVoiceEngine()
    output_ctrl = OutputController()

    tts_thread = threading.Thread(
        target=output_ctrl.tts_playback_worker,
        args=(gatekeeper.interrupt_event,),
        daemon=True
    )
    tts_thread.start()

    p = pyaudio.PyAudio()
    mic_stream = p.open(
        format=pyaudio.paInt16,
        channels=1,
        rate=gatekeeper.sample_rate,
        input=True,
        frames_per_buffer=gatekeeper.chunk_size,
        stream_callback=gatekeeper.audio_callback  # Non-blocking C thread
    )
    mic_stream.start_stream()
    print("Listening. Speak freely — interrupt anytime.")

    try:
        while True:
            raw_chunk = gatekeeper.audio_queue.get()

            # Normalize bytes → float32 [-1.0, 1.0] for VAD
            audio_data = np.frombuffer(raw_chunk, dtype=np.int16)
            audio_float32 = audio_data.astype(np.float32) / 32768.0
            tensor_chunk = torch.from_numpy(audio_float32)

            speech_dict = gatekeeper.vad_iterator(tensor_chunk)

            if speech_dict:
                if 'start' in speech_dict:
                    print("Speech detected — firing interrupt...")
                    gatekeeper.interrupt_event.set()  # TTS stops NOW
                    gatekeeper.is_speaking = True
                    gatekeeper.current_utterance_frames = []

                elif 'end' in speech_dict:
                    print("Utterance complete. Processing...")
                    gatekeeper.is_speaking = False
                    gatekeeper.current_utterance_frames.append(audio_float32)

                    full_waveform = np.concatenate(
                        gatekeeper.current_utterance_frames
                    )
                    response = agent.generate_response(full_waveform)
                    print(f"Agent: {response}")
                    # Route response to your TTS generator here
                    gatekeeper.current_utterance_frames = []

            if gatekeeper.is_speaking:
                gatekeeper.current_utterance_frames.append(audio_float32)

    except KeyboardInterrupt:
        mic_stream.stop_stream()
        mic_stream.close()
        p.terminate()

if __name__ == "__main__":
    main_orchestration_loop()

The timing guarantee: Silero detects speech onset within 32 milliseconds (one 512-sample chunk at 16kHz). interrupt_event.set() propagates across the memory barrier to the TTS worker in microseconds. The TTS flush completes within one chunk cycle — 50ms. Total interruption latency: under 82ms from first syllable to silence.

Human perception of conversational delay becomes noticeable around 200ms. We're well inside that window.

The Memory Constraint Nobody Mentions

There's a hard reality that documentation often glosses over: the 128K context window is theoretically available, but practically unreachable on consumer hardware.

The Gemma 4 E2B KV cache allocates approximately 490 KB per token due to its dense 256 head dimension. Filling 128K tokens would require 60+ GB of VRAM for the cache alone. On a machine with 16 GB unified memory, you can safely operate up to roughly 8,000 tokens of conversation history.

This means you need aggressive context pruning:

# Keep the system prompt + last N turns
MAX_HISTORY_TOKENS = 6000

def prune_history(self):
    # Estimate tokens and trim oldest turns
    while self.estimate_tokens() > MAX_HISTORY_TOKENS:
        # Remove oldest user/assistant turn pair
        if len(self.conversation_history) > 2:
            self.conversation_history.pop(1)
            self.conversation_history.pop(1)

For voice agents requiring hours of continuous memory — think customer service bots or long-form interview assistants — migrate the backend from transformers to llama.cpp with GGUF format. The TCQ and q4_0 quantization algorithms in llama.cpp apply rotational matrix compression to KV vectors, preserving semantic accuracy while dramatically reducing cache memory. This is the mandatory optimization for production long-session voice deployments.

Beyond VAD: Semantic Interruption

The system above is production-ready for most use cases. But there's one failure mode to know about: backchannels.

When a user murmurs "mm-hmm" or "right" in passive agreement, Silero correctly detects acoustic energy and fires the interrupt. The agent stops speaking. The user wasn't actually interrupting — they were listening.

The fix: semantic yielding.

Instead of a destructive flush on speech onset, make it a reversible pause:

Pause the TTS stream (don't flush) when VAD fires
Capture the short interjection
Classify via a fast sub-billion parameter model:
- Backchannel ("mm-hmm", "right", "okay") → unpause, resume seamlessly
- Genuine interruption ("wait", "stop", "what about—") → flush, route to E2B
Classification adds ~80-100ms overhead — still within the imperceptible window This elevates the system from prototype to something that genuinely feels like a conversation.

What This Actually Unlocks

The architecture in this guide runs entirely offline. No API calls. No cloud dependency. No subscription. A Raspberry Pi 5 can host a voice agent that:

Hears you natively — not transcribed text, actual audio with prosody and intent
Responds intelligently across a multi-hour conversation context
Stops the moment you start speaking — every single time
Gets smarter about your patterns over time with persistent conversation history We've spent years accepting that voice AI is clunky because it has to be. Gemma 4 E2B makes that tradeoff optional. The uncanny valley was always an architecture problem. We just finally have the pieces to solve it on hardware that fits in your pocket.

Quick Start

# Install dependencies
pip install -U transformers torch accelerate onnxruntime
pip install pyaudio soundfile silero-vad bitsandbytes

# Pull via Ollama if you prefer the managed route
ollama pull gemma4:e2b

# Or load directly via Hugging Face
# model_id = "google/gemma-4-E2B-it"

Start with the orchestration loop, verify your Silero VAD fires correctly on your microphone, then wire in your preferred TTS engine to output_ctrl.playback_queue. The interruption layer works regardless of which TTS you choose — Kokoro, Edge-TTS, Coqui, anything that produces audio chunks.

The agent that finally listens is one thread event away.

Written for the Gemma 4 Writing Challenge on DEV.to. Deadline: May 24, 2026.

The AI That Learns While You Sleep: Inside Hermes Agent's Self-Evolving Brain

Pratay Karali — Thu, 21 May 2026 11:43:03 +0000

Every other agent forgets. This one grows.

It's 3am. Your laptop is idle. No terminal is open. No prompt is waiting.

And somewhere inside ~/.hermes/skills/, a daemon just woke up.

It's reading through every task your agent completed this week — every tool call, every failure, every correction, every workaround. It's grading them. Consolidating the ones that overlap. Pruning the ones that underperformed. Writing new procedural memory files from scratch.

You didn't ask it to. You didn't schedule it. It just happens — every seven days — like clockwork.

By morning, your agent is measurably better at your codebase than it was yesterday. And you were asleep the entire time.

This is the Hermes Agent. And it's the first open-source runtime I've encountered that doesn't just execute intelligence — it accumulates it.

The Problem Every Other Agent Has

Here's the dirty secret of most agentic frameworks: they're amnesiac by design.

You give the agent a complex task. It struggles, recovers, finds a workaround, completes it. You feel good. You close the terminal.

Next time you run the same class of task? It starts from zero. The workaround is gone. The hard-won recovery pattern — vanished. The agent will struggle through the exact same failure mode it navigated last Tuesday, because nothing in the architecture captured what it learned.

This isn't a bug in most frameworks. It's a philosophical choice: keep the agent stateless, keep it predictable, keep it simple. The problem is that "simple" compounds into "perpetually mediocre." You're not running an agent that gets better. You're running a very expensive for loop.

Hermes made a different choice. Its entire architecture is built around a single question: what if the agent remembered — and what if the act of remembering made it smarter?

The Heartbeat: Observe, Execute, Reflect, Crystallize

Most agent frameworks have a loop. Input → Plan → Tool Call → Output. Repeat.

Hermes has a heartbeat.

The four-phase OERC cycle isn't just an execution pattern — it's a learning metabolism. Each phase has a distinct biological analog, and understanding them is the key to understanding why Hermes behaves the way it does.

┌─────────────────────────────────────────────────────────┐
│              THE HERMES LEARNING HEARTBEAT               │
│                                                          │
│   OBSERVE ──► EXECUTE ──► REFLECT ──► CRYSTALLIZE       │
│      │                        │             │            │
│   Scan skills             Run GEPA      Write SKILL.md   │
│   FTS5 lookup            self-analysis   to ~/.hermes/   │
│   ~20 tokens/skill        on trace       skills/         │
│      │                        │             │            │
│   "What do I               "Why did        "Next time,   │
│    already know?"           that work?"     I'll know."  │
└─────────────────────────────────────────────────────────┘

Observe — Before executing anything, the agent scans its local SQLite database using FTS5 full-text search to find skills that match the incoming task. Critically, it only loads Level-0 summaries at this stage — roughly 20 tokens per skill — so it can survey thousands of procedures without bloating its context window. It enters the execution phase already knowing what it knows.

Execute — The agent runs its tool-calling loop, dispatching up to 8 tools in parallel via a localized thread pool. Every terminal command, every API response, every correction is captured in a detailed execution trace. The agent isn't just doing the work — it's recording a complete transcript of how it did the work.

Reflect — This is where it gets interesting. After a sufficiently complex task (typically 5+ tool calls), GEPA — a Genetic-Pareto Prompt Evolution system running alongside DSPy — analyzes the execution trace. It identifies the failure points. It models why the recovery worked. It generates optimized guidelines, documents common pitfalls, and drafts verification steps. Crucially, this reflection runs in a background thread — it doesn't make you wait.

Crystallize — The reflection output is compiled into a structured Markdown skill file and written to ~/.hermes/skills/. It's indexed at Level 1, ready to be surfaced in the Observe phase of the next session. The loop closes. The knowledge persists.

This is how muscle memory works in humans. You struggle through a new motor pattern consciously, repeatedly, until the motion becomes automatic and sub-cortical. Hermes does the same thing — except it crystallizes in one pass, not ten thousand.

The Three-Layer Brain

If the OERC cycle is the heartbeat, the memory architecture is the brain. And it's structured with a specificity that most frameworks don't come close to.

Memory Layer	Human Analog	Location	Size Cap	Behavior
L1: Session Context	Working Memory	Volatile RAM	Model context window	Cleared on `/reset`
L2: Persistent Store	Episodic Memory	`~/.hermes/memories/MEMORY.md`	~2,200 chars / ~800 tokens	Locked into system prompt prefix at session start
L3: User Model	Autobiographical Self	`~/.hermes/memories/USER.md`	~1,375 chars / ~500 tokens	Embedded in System Slot #1; updated in background
Procedural	Muscle Memory	`~/.hermes/skills/`	15 KiB per file	Loaded dynamically via progressive disclosure

The size caps aren't arbitrary. They're a deliberate architectural decision to prevent what the Hermes team calls "prompt degradation" — the phenomenon where too much injected context starts hurting model performance instead of helping it. Every cap is the result of empirical testing on where the signal-to-noise ratio flips.

The USER.md file is the layer that tends to surprise people the most. It's not just a config file — it's a live model of you. Your coding style. Your preferred abstractions. Your tolerance for verbose output. The Honcho dialectic system periodically rewrites it based on observed interaction patterns. Over weeks of use, the agent stops feeling like a generic assistant and starts feeling like something that's been working with you specifically.

And the retrieval is fast — SQLite FTS5 scanning thousands of historical conversations in under 10 milliseconds. There's no vector embedding server to run, no semantic search latency to absorb. It's just a very well-engineered SQLite database doing what SQLite does best.

The Immune System: Why Subagent Constraints Are the Feature

When Hermes delegates a task to a subagent, that subagent runs in a fresh, isolated workspace with its own conversational context. Importantly, it operates under four hard constraints:

No spawning subagents — unless explicitly assigned the "orchestrator" role
No user interaction — subagents cannot prompt you for input
No shared memory writes — blocked from writing to MEMORY.md or USER.md
Sequential code execution — no recursive script injection Most developers see this list and think "limitations." That's the wrong read.

These constraints are a security immune system. When you're running autonomous agents against your actual codebase — at 3am, while you sleep — you want hard walls between parallel execution threads. A subagent that can spawn children can create exponential resource consumption. A subagent that can write to shared memory can corrupt the episodic store that took weeks to accumulate. A subagent that can talk to you mid-task can create race conditions between its output and your response.

The constraints are what make the delegation trustworthy at scale.

Here's what a production parallel code review delegation looks like:

from hermes_tools import delegate_task, execute_code

def run_automated_refactor():
    target_files = ["src/auth/jwt.py", "src/auth/login.py"]

    # Dispatch two isolated security-focused subagents in parallel
    delegate_task(
        tasks=[
            {
                "goal": f"Analyze {target_files} for JWT validation vulnerabilities and apply fixes",
                "profile": "fixer",
                "context": """Project is at /workspace.
                Verify signature validation and check algorithm headers.
                Run pytest tests/auth/ -v after implementing changes."""
            },
            {
                "goal": f"Analyze {target_files} for SQL injection vectors and parameterize queries",
                "profile": "fixer",
                "context": """Project is at /workspace.
                Convert string-formatted database executions to prepared statements.
                Run pytest tests/auth/ -v to verify integration."""
            }
        ]
    )

    # Final validation pass
    return execute_code(script="""
import subprocess
result = subprocess.run(["pytest", "tests/auth/", "-v"], capture_output=True, text=True)
print(result.stdout)
print("REFACTOR_COMPLETE_AND_VERIFIED" if result.returncode == 0 else "REFACTOR_FAILED_VALIDATION")
""")

Two subagents. Two security domains. Zero shared state. One validation pass at the end. This is the architecture of a system that's designed to be trusted in production — not just demonstrated in a keynote.

The Sandbox: Paranoid by Default

Autonomous agents executing shell commands on your machine is, frankly, terrifying if done wrong. Hermes handles this through six execution backends — Local, Docker, SSH, Daytona, Singularity, and Modal — with the Docker backend being the recommended production configuration.

The key architectural difference from naive Docker wrappers: Hermes uses a single, persistent container initialized at startup. Every command, every subagent, every code execution routes through docker exec into this one container. State persists between steps. Package installations survive across tool calls. Environment variables don't reset mid-task.

And it's hardened by default:

terminal:
  backend: "docker"
  docker_image: "nikolaik/python-nodejs:python3.11-nodejs20"
  container_persistent: true
  container_cpu: 2
  container_memory: 4096

Under the hood, Hermes applies:

--cap-drop ALL with only DAC_OVERRIDE, CHOWN, and FOWNER restored
--security-opt no-new-privileges to block runtime escalation
--pids-limit 256 to neutralize fork-bomb attacks
tmpfs mounts for /tmp (512MB) and /var/tmp (256MB) to prevent disk exhaustion The paranoia is engineered in. You don't have to configure it. It's the default.

Hermes vs. OpenClaw: A Philosophy Comparison

This isn't a "which is better" comparison. It's a "which philosophy are you choosing" comparison.

Dimension	OpenClaw	Hermes Agent
Center of Gravity	Gateway-First: sessions route to a stateless agent loop	Agent-First: the cognitive loop is the core; gateways wrap it
Skill System	Static, human-authored files edited manually	Self-generating: OERC loop writes new skills automatically
Memory	Flat files + standard SQLite, manual configuration	Three-layer persistent stack with FTS5, pluggable backends
Parallel Execution	Sequential within single thread	Native `delegate_task` spawning isolated parallel subagents
Container Model	New container per task (high init overhead, stateless)	Single persistent container (low overhead, stateful)
Setup Time	Under 30 minutes	2–4 hours for full configuration
Annual API Cost	~$600–$1,800	~$500–$1,600 (optimized via prompt caching, α≈0.90)

The tradeoff is honest: OpenClaw is faster to set up and easier to reason about because it's simpler. Hermes requires a real configuration investment — model routing, memory setup, Docker configuration, gateway pairing. The 2–4 hour setup time is real.

But here's the question: are you building something you'll run once, or something you'll run every day?

If it's the latter — if this agent is going to touch your codebase regularly, learn your patterns, automate your recurring tasks — the investment compounds. The agent that takes 4 hours to set up on day one is measurably smarter on day 30 than the agent that took 30 minutes. Because it's been crystallizing knowledge the entire time.

The Cost Architecture: Why α≈0.90 Changes Everything

The economic model of Hermes is worth understanding before you dismiss it as "expensive self-hosted AI."

The cost formula for a single conversational turn:

C_turn = T_dynamic · R_dynamic 
       + T_cached · (1-α) · R_dynamic 
       + T_cached · α · R_cached 
       + T_out · R_out

Where α is the prompt cache hit ratio. Hermes achieves α≈0.90 in steady-state operation — meaning 90% of input tokens hit the cache and are billed at the discounted cached rate.

This is the architectural payoff of the frozen memory layers. Because MEMORY.md and USER.md are static between updates, they sit in Anthropic's prompt cache indefinitely. The system prompt that took 1,300 tokens to construct is only billed at full price once. Every subsequent session loads it at a fraction of the cost.

For long-running, multi-hour agent operations — exactly the kind of work Hermes is designed for — this cache hit ratio is the difference between a $40 session and a $4 session.

The Curator: Cognitive Gardening

The detail about Hermes that I find most philosophically interesting is the Curator Daemon.

Every 7 days — introduced in version 0.12.0 — a background process scans your entire ~/.hermes/skills/ directory. It grades each skill against historical execution logs. It identifies skills that overlap and consolidates them. It prunes skills that underperformed or became too narrow to be useful.

No human touches this process. No one reviews the results unless they want to. The agent manages its own long-term memory hygiene.

There's a term for this in cognitive neuroscience: synaptic pruning. The human brain does something similar during sleep — eliminating weak neural connections to strengthen the ones that matter. The result is that you wake up with slightly better-consolidated memories than you had when you fell asleep.

Hermes does this to its skill library. Every week. While your machine is idle.

The practical implication: a Hermes instance you've been running for 6 months has a fundamentally different skill profile than one you deployed last week. It's been shaped by your specific tasks, your specific codebase, your specific failure patterns. It's not a generic agent anymore. It's your agent — in a way that no stateless framework can match.

Who This Is For

Hermes is not for everyone. Let me be direct about that.

If you want to run a one-shot coding task and be done — use a simpler tool. The 2–4 hour setup overhead isn't justified for occasional use.

Hermes is for the developer who:

Has recurring workflows they're tired of re-explaining to an agent every time
Wants their agent to get better at their specific project over time
Is comfortable with self-hosted infrastructure and local model routing
Trusts the Docker sandbox model and wants autonomous background execution If that's you, the architecture rewards patience. The first week, Hermes feels like any other agent. By week four, it's started to feel like something that's been paying attention.

The Quiet Provocation

I want to end on something that's been sitting with me since I went deep on this architecture.

We've spent a lot of time debating whether AI will replace developers. That's the loud conversation. It makes for good headlines.

The quieter, more interesting question is: what happens when your tools start remembering faster than you do?

You write code in a codebase for three years. You build intuitions. You know which abstractions leak, which patterns cause bugs three sprints later, which shortcuts always come back to bite you. That institutional knowledge lives in your head — and it's deeply, irreplaceably valuable.

Hermes is the first framework I've seen that's designed to accumulate that same class of knowledge. Not about code in general. About your code specifically. About your patterns, your failures, your recoveries.

The Curator Daemon pruning skill files at 3am isn't just a background process. It's the system becoming a better collaborator — specifically for you, specifically for your project, without you doing anything.

That's not a replacement. That's an apprentice that never forgets a lesson.

Getting Started

If you want to explore Hermes seriously, here's the honest path:

# Step 1: Install
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash

# Step 2: Configure model routing and memory
hermes setup

# Step 3: Set up your execution backend
# Edit ~/.hermes/config.yaml — configure Docker backend, model routing

# Step 4: Run your first task and watch the skill directory afterward
hermes
# After a complex task, check: ls ~/.hermes/skills/

Budget 2–4 hours for the initial setup. It's not quick. It's worth it.

Start with a task you run regularly — code review, dependency scanning, documentation generation. After a week, look at what's been crystallized into ~/.hermes/skills/. That directory will tell you more about how the system works than any documentation.

And check back in a month. The agent you're running then won't be the same one you started with.

That's the whole point.

Written for the Hermes Agent Challenge on DEV.to.

The Day AI Became Its Own CTO: Antigravity 2.0 and the 12-Hour OS

Pratay Karali — Thu, 21 May 2026 09:07:38 +0000

What happens when you stop giving AI a task — and give it a company?

There's a moment in every science fiction film where the machine stops waiting for instructions.

At Google I/O 2026, that moment happened live on stage — and it didn't feel like science fiction. It felt like watching the future quietly clock in for work.

Antigravity 2.0 was given a single directive: build an operating system. No team. No standups. No Jira tickets. Just one primary agent, 93 subagents it spun up itself, 15,000+ model requests, 2.6 billion tokens generated, and 12 hours on the clock.

The total bill? Under $1,000.

The result? A working OS — that, when it failed to run Doom due to missing keyboard and video drivers, diagnosed the problem and wrote the drivers live on stage.

I've been staring at that moment for two days. Let me tell you what I think it actually means.

The Jarvis Architecture: Corporate Hierarchy Without the Politics

Here's the frame that won't leave my head: this isn't just "AI coding." This is AI operating like a corporation — and it's a corporation unlike any that has ever existed.

The primary Antigravity agent functions as a CTO. It doesn't write every line of code. It understands the system, breaks the goal into domains, and spawns specialized subagents — one for the database layer, one for the frontend, one for testing, one for drivers. Each subagent works in an isolated workspace, reports summarized results back, and dissolves when its job is done.

         +--------------------------+
         |      Primary Agent       |
         |  (Context Coordinator)   |
         +----+------+----------+---+
              |      |          |
  +-----------+      |          +-----------+
  | Spawns           | Spawns               | Spawns
  v                  v                      v
+----------+    +----------+          +----------+
| Subagent |    | Subagent |          | Subagent |
| Database |    | Frontend |          |  Testing |
+----------+    +----------+          +----------+

This is the part that breaks my brain a little: nobody fights.

In every human organization I've ever encountered, the frontend team argues with the backend team. The testing team is chronically ignored. The DevOps engineer is always the last person anyone calls and the first person everyone blames. There are ego collisions, misaligned incentives, communication overhead, documentation that's three sprints out of date.

In Antigravity's architecture, the subagents operate in clean isolation. No competing priorities. No meetings about meetings. The primary agent synthesizes their outputs and steers. The goal is the only agenda item.

It's the corporate hierarchy that every management book has been trying to describe for 50 years — and it turns out the only way to actually build it is to use agents that have no sense of self-preservation.

The Sandbox Is the Secret Weapon

What made the 12-hour OS build possible — beyond the model itself — is the infrastructure underneath it: the Managed Agents API and its ephemeral sandbox architecture.

Every Antigravity agent runs inside a Google-hosted Ubuntu Linux container. You don't provision it. You don't configure it. One API call to the Interactions API spins it up — Python 3.12, Node.js 22, a full shell, Google Search, URL context, all ready.

The architecture separates control from execution cleanly:

+------------------------------------------------------------+
|                       CONTROL PLANE                        |
|                        (Agents API)                        |
|  - Register Agent Identity & Constraints                   |
|  - Mount GCS Buckets, Define Network Allowlists            |
+-----------------------------+------------------------------+
                              | Configures
                              v
+-----------------------------+------------------------------+
|                        DATA PLANE                          |
|                    (Interactions API)                      |
|                                                            |
|  +------------------------------------------------------+  |
|  |          Google-Hosted Ubuntu VM Sandbox             |  |
|  |  - Ephemeral Linux Environment                       |  |
|  |  - Python 3.12 & Node.js 22 Runtimes                 |  |
|  |  +----------+  +----------------+  +--------+        |  |
|  |  |   Bash   |  | Google Search  |  |  URL   |        |  |
|  |  | Executor |  |     Tool       |  | Context|        |  |
|  |  +----------+  +----------------+  +--------+        |  |
|  +------------------------------+-----------------------+  |
+---------------------------------|--------------------------+
                                  | Persists across turns
                                  v
+----------------------------------------------------------+
|               PERSISTENT CONTEXT STORAGE                 |
|  - Filesystem & Installed Packages                       |
|  - Conversation History (previous_interaction_id)        |
+----------------------------------------------------------+

The key design insight is state persistence across turns. When you pass previous_interaction_id back into a new Interactions API call, the sandbox doesn't reset. The files the agent created last turn are still there. The packages it installed are still there. The 500k tokens of planning context it built up are still there.

This is what enables long-horizon tasks. A single agent interaction can consume between 300,000 and 3,000,000 tokens — but the platform caches 50–70% of input tokens, making the economics manageable.

Here's how that multi-turn persistence looks in practice:

import base64
from google import genai

client = genai.Client()

# Turn 1: Give the agent its first major task
first_interaction = client.interactions.create(
    agent="antigravity-preview-05-2026",
    input="Retrieve the top 10 trends from Hacker News, write them to trends.csv, and generate a matplotlib visualization.",
    environment="remote",
    tools=[
        {"type": "code_execution"},
        {"type": "google_search"},
        {"type": "url_context"}
    ]
)

print(f"Sandbox ID: {first_interaction.environment_id}")

# Turn 2: Continue in the SAME container — trends.csv is still there
second_interaction = client.interactions.create(
    agent="antigravity-preview-05-2026",
    previous_interaction_id=first_interaction.id,
    environment=first_interaction.environment_id,  # Same container, state preserved
    input="Convert trends.csv into a responsive HTML dashboard."
)

No re-explaining context. No re-uploading files. The agent remembers because the environment remembers.

The CLI: A Deep-Sea Probe for Your Codebase

If Antigravity 2.0 is the CTO, the Antigravity CLI is the ROV they send into the trench.

It's built in Go — lightweight, fast, low overhead — and it can run background tasks asynchronously while you sleep. It doesn't need a visual IDE. It doesn't need a human watching every step. Like an unmanned probe sent into deep water where no one has ever looked, it explores, documents, and surfaces what it finds.

# Install the CLI
curl -fsSL https://antigravity.google/cli/install.sh | bash

# Drop into an interactive agent shell
antigravity-cli

# The commands that make it feel like you hired someone
/goal          # "Complete this without asking me every 5 minutes"
/schedule      # Cron-like automation for recurring tasks
/browser       # Spawn a visual subagent to crawl and test web apps
/rewind        # Undo the last conversation turn and branch differently
/permissions   # Tune autonomy: request-review, always-proceed, strict

The /goal command is the one I keep thinking about. You give the CLI an objective, and it executes — without prompting for step-by-step approvals — until it's done or until it genuinely needs you. This is what "autonomous" actually means in practice. Not just suggesting the next action, but doing the work while you're away from your desk.

The /schedule command extends this further — periodic automated checks, nightly refactor scripts, weekly report generation. This isn't a coding assistant. It's a background process that thinks.

And the deep-sea metaphor holds technically too: unlike browser-based tools, the CLI agents can reach across system boundaries, navigate unknown package ecosystems, probe APIs with no documentation, and surface their findings in structured logs. There are large parts of most codebases that no human fully understands anymore. The CLI goes there.

Behavior as Configuration: AGENTS.md and SKILL.md

One of the underrated announcements from I/O 2026 is how Antigravity handles behavioral customization — not through complex API parameters, but through versioned markdown files in your repository.

Two files you should know:

AGENTS.md lives at the project root. It defines the agent's operating constraints, persona, and global rules. Developed collaboratively by OpenAI, Google, and others under the Linux Foundation's Agentic AI Foundation, it's becoming a universal standard across tens of thousands of repositories — a Dockerfile for agent behavior.

SKILL.md lives at .agents/skills/<skill-name>/SKILL.md. It packages specific capabilities: step-by-step procedures, tool dependencies, reference schemas. Originating from Anthropic, it's now supported across major platforms. The design philosophy is progressive disclosure — the agent reads high-level summaries from AGENTS.md first, then loads specific SKILL.md files only when a matching task appears.

A minimal example:

# .agents/AGENTS.md

## Role
You are a senior DevOps automation agent.

## Non-Negotiable Constraints
- Never run database migrations without human approval.
- All infrastructure changes must test against staging first.
- Validate all generated scripts with linters before execution.

# .agents/skills/docker-builder/SKILL.md
name: docker-builder
description: Automates multi-stage Docker builds and security scans.
tools: [code_execution]

## Build Procedure
1. Locate package.json or requirements.txt in the repository root.
2. Generate an optimized, multi-stage Dockerfile using distroless base images.
3. Execute: docker build -t app-image:latest
4. Run vulnerability scan with Trivy.

This is elegant because it solves a real problem: how do you give an autonomous agent enough context to be useful without flooding its context window with everything all at once? The answer is the same answer good software architecture has always given — load what you need, when you need it.

Your AGENTS.md is the onboarding doc for an employee who never forgets what they read.

The Cost Architecture: What Running 93 Agents Actually Looks Like

Let's talk numbers, because "under $1,000 for an OS" deserves to be unpacked.

Gemini 3.5 Flash — the model powering Antigravity's agent infrastructure — costs $3 to $9 per million output tokens. At 289 tokens/second output speed, it processes roughly four times faster than its nearest benchmarked competitors. The platform cached 50–70% of input tokens across the multi-turn interactions, which is the key cost lever on operations that process millions of tokens.

For the OS build specifically: 2.6 billion tokens total, $1,000 spent. That works out to roughly $0.38 per million tokens effective cost after caching — for 93 parallel agents working across 12 hours.

For context on what different agent task types typically cost:

Task Type	Input Tokens	Output Tokens	Session Cost
Research & Synthesis	100k–500k	10k–40k	$0.30–$1.00
Code & Doc Generation	100k–500k	15k–50k	$0.30–$1.30
Architecture Design	100k–400k	10k–30k	$0.25–$0.80
Large-Scale Data Processing	300k–3M	30k–150k	$0.70–$3.25

The OS build was off the scale of normal tasks — but it demonstrates that the upper limit of what you can accomplish in a single agentic run has expanded dramatically.

What This Actually Changes

I want to be careful here, because a lot of commentary around events like this slides into either hype or dismissal — and neither is honest.

What I saw at I/O 2026 wasn't "AI replacing developers." What I saw was a fundamental shift in the grain of software development.

The analogy that feels right to me: the invention of the compiler didn't eliminate programmers. It changed what programming meant. Before compilers, you managed memory addresses by hand. After, you reasoned about logic and let the tool handle the translation. The skill didn't disappear — it moved up a level of abstraction.

Antigravity 2.0 is another step up that abstraction ladder. You're not writing every function anymore. You're writing AGENTS.md. You're designing the constraint system. You're defining what "done" means and what "never do this" means. You're the architect now — not because coding became less important, but because the agents need someone to tell them what the building is for.

The developers who thrive in this environment won't be the ones who can type the fastest. They'll be the ones who can think most clearly about systems, constraints, and goals — and who can write them down in a way an autonomous agent can actually follow.

That's a different skill. But it's still unmistakably a craft.

The Moment I'll Remember

When Antigravity 2.0's OS failed to boot Doom because it was missing keyboard and video drivers — and then the primary agent, live on stage, diagnosed the gap, spawned new subagents, wrote the missing drivers, and injected them — the audience reaction wasn't the typical polite I/O applause.

There was a moment of genuine silence first.

I think that silence was people recalibrating. Because the agent didn't just complete the task. It encountered unexpected failure in a domain it hadn't been explicitly prepared for, reasoned about the gap, and solved it.

That's not a demo trick. That's the capability.

You can build a demo that looks impressive. You can't fake a system that reasons its way through failure it didn't anticipate.

Where to Start

If you want to explore the Managed Agents API yourself, the entry points are:

Start with a single-turn sandbox — spin up a remote environment via the Interactions API, give it a research + visualization task, and observe the execution loop via SSE streaming.
Add state persistence — on your second interaction, pass previous_interaction_id and environment_id. Watch the agent pick up exactly where it left off.
Write your first AGENTS.md — define three constraints that matter for your project domain. Watch how it changes agent behavior on subsequent runs.
Try a parallel task — give the primary agent a goal complex enough that it spawns subagents. Monitor them via /agents in the CLI. The sandbox persists for 7 days of inactivity before teardown. That's 7 days to run a project you've been putting off.

93 agents, $1,000, one OS. What would you build?

Written for the Google I/O 2026 Writing Challenge on DEV.to. Technical data sourced from the Google I/O 2026 developer keynote and platform documentation.