<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Kunal Jaiswal</title>
    <description>The latest articles on Forem by Kunal Jaiswal (@ljkunal).</description>
    <link>https://forem.com/ljkunal</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3852052%2F90b66918-7075-4354-a828-873703958bac.jpeg</url>
      <title>Forem: Kunal Jaiswal</title>
      <link>https://forem.com/ljkunal</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ljkunal"/>
    <language>en</language>
    <item>
      <title>Claude Code Was Getting Dumber. Semantic Memory Fixed It.</title>
      <dc:creator>Kunal Jaiswal</dc:creator>
      <pubDate>Wed, 22 Apr 2026 17:21:38 +0000</pubDate>
      <link>https://forem.com/ljkunal/claude-code-was-getting-dumber-semantic-memory-fixed-it-1ch1</link>
      <guid>https://forem.com/ljkunal/claude-code-was-getting-dumber-semantic-memory-fixed-it-1ch1</guid>
      <description>&lt;p&gt;I use Claude Code as my primary development tool. It manages a home automation stack spread across five machines — camera monitors, WhatsApp agents, LLM inference pipelines, job scrapers, diet trackers. Over two months, the codebase grew to 30+ services with their own ports, configs, credentials, and war stories.&lt;/p&gt;

&lt;p&gt;To keep Claude informed, I maintained documentation in &lt;code&gt;.md&lt;/code&gt; files. &lt;code&gt;camera_monitor.md&lt;/code&gt;. &lt;code&gt;dgx_inference.md&lt;/code&gt;. &lt;code&gt;openclaw_agents.md&lt;/code&gt;. One file per system, each containing architecture decisions, port numbers, credentials, known bugs, and fix history.&lt;/p&gt;

&lt;p&gt;It worked great at 5 files. At 30, Claude started losing the plot.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With Files
&lt;/h2&gt;

&lt;p&gt;Claude Code reads &lt;code&gt;CLAUDE.md&lt;/code&gt; at session start. I added rules there: "read &lt;code&gt;camera_monitor.md&lt;/code&gt; before touching the camera system." "Check &lt;code&gt;dgx_inference.md&lt;/code&gt; for port mappings." Reasonable instructions.&lt;/p&gt;

&lt;p&gt;But Claude's context window is finite. Each &lt;code&gt;.md&lt;/code&gt; file averaged 200-400 lines. Loading 10 of them for a cross-system task consumed 3,000-4,000 tokens before Claude wrote a single line of code. And it had to &lt;em&gt;decide&lt;/em&gt; which files to read — sometimes guessing wrong, reading 5 files before finding the answer in the 6th.&lt;/p&gt;

&lt;p&gt;The symptoms were subtle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude would re-explore code I'd already documented&lt;/li&gt;
&lt;li&gt;It would suggest ports that were already in use&lt;/li&gt;
&lt;li&gt;It would miss dependencies between services ("that endpoint moved to the Dell server last week")&lt;/li&gt;
&lt;li&gt;It would launch Explore agents to grep the codebase for answers that existed in a &lt;code&gt;.md&lt;/code&gt; file it hadn't loaded&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each session started with a tax: Claude burning context on orientation instead of doing work. The more documentation I wrote, the worse it got. A classic information retrieval problem disguised as a context window problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  What RAG Doesn't Solve
&lt;/h2&gt;

&lt;p&gt;The obvious answer is "just use RAG." Embed the docs, retrieve the relevant chunks, inject them into context. Every AI wrapper does this.&lt;/p&gt;

&lt;p&gt;But RAG over documentation files has a specific failure mode: &lt;strong&gt;your retrieval unit is the wrong size.&lt;/strong&gt; A &lt;code&gt;.md&lt;/code&gt; file is either too big (wastes context with irrelevant sections) or you chunk it and lose the structural relationships between sections. The camera monitor doc has 15 sections — contour detection constants, RTSP URLs, the KV cache leak fix, the PyAV deadlock bug. A chunk retriever might return the RTSP URLs when you asked about the deadlock, because they share keywords like "camera" and "connection."&lt;/p&gt;

&lt;p&gt;What I actually needed was a &lt;strong&gt;memory system&lt;/strong&gt; — not document retrieval, but a knowledge base where each entry is a self-contained fact with metadata, and search is semantic, not keyword.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Memory Server
&lt;/h2&gt;

&lt;p&gt;I built a 767-line Python server that gives Claude six MCP tools: &lt;code&gt;memory_search&lt;/code&gt;, &lt;code&gt;memory_save&lt;/code&gt;, &lt;code&gt;memory_list&lt;/code&gt;, &lt;code&gt;memory_update&lt;/code&gt;, &lt;code&gt;memory_delete&lt;/code&gt;, &lt;code&gt;memory_stats&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Code ──MCP/SSE──▶ memory_server.py (port 8042)
                              │
                              ├── sentence-transformers (all-MiniLM-L6-v2)
                              │     └── 384-dim embeddings, runs on CPU
                              │
                              ├── TurboQuant (4-bit vector compression)
                              │     └── in-memory index, persisted to disk
                              │
                              └── MySQL
                                    └── content, tags, category, agent_id, timestamps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each memory is a self-contained knowledge unit — a bug fix, a port mapping, an architecture decision, a credential, a "never do this" lesson. Not a document chunk. A &lt;em&gt;fact&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The server exposes both MCP (for Claude Code) and REST (for other agents). Search is cosine similarity on MiniLM-L6-v2 embeddings, compressed to 4-bit with TurboQuant for a smaller memory footprint. Metadata lives in MySQL for filtering by category, tags, and agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Memory-First Rule
&lt;/h2&gt;

&lt;p&gt;The server alone isn't enough. Claude needs to be &lt;em&gt;told&lt;/em&gt; to use it. In &lt;code&gt;CLAUDE.md&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;### MANDATORY: Memory-First Rule&lt;/span&gt;
&lt;span class="ge"&gt;**&lt;/span&gt;BEFORE reading any files, exploring code, or launching agents
— ALWAYS use &lt;span class="sb"&gt;`memory_search`&lt;/span&gt; first.&lt;span class="ge"&gt;**&lt;/span&gt;

&lt;span class="gs"&gt;**Order of operations for ANY question:**&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; &lt;span class="sb"&gt;`memory_search`&lt;/span&gt; with relevant keywords
&lt;span class="p"&gt;2.&lt;/span&gt; Only if memory has no results → then read files/explore code
&lt;span class="p"&gt;3.&lt;/span&gt; If memory server is down → fall back to local .md files

&lt;span class="gs"&gt;**DO NOT:**&lt;/span&gt; Launch Explore agents, read source files, or grep
the codebase as a first step. Memory has it.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single rule changed everything. Instead of reading 5 files to find a port number, Claude runs one semantic search and gets a scored result in &amp;lt;100ms. The context window stays clean for actual work.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Went Into Memory
&lt;/h2&gt;

&lt;p&gt;I wrote an import script that parsed every &lt;code&gt;.md&lt;/code&gt; documentation file, split them by &lt;code&gt;##&lt;/code&gt; headers into self-contained sections, and bulk-loaded them as individual memories. 30 files became 200+ memories, each tagged with source file and category.&lt;/p&gt;

&lt;p&gt;Some examples of what a single memory entry looks like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"DGX Spark Ollama runs on port 11434, WebSocket proxy on 8765, adapter on 8091. Request flow: gate_monitor → adapter HTTP → WebSocket → proxy → Ollama."&lt;/li&gt;
&lt;li&gt;"KV Cache Leak Fix: Ollama allocates full KV cache based on model's context_length field. Fix: create derived Modelfile with &lt;code&gt;PARAMETER num_ctx 4096&lt;/code&gt;."&lt;/li&gt;
&lt;li&gt;"Cross-thread RTSP kill bug: analysis_loop calling container.close() from wrong thread → PyAV deadlock at 300% CPU. Fix: threading.Event, rtsp_loop checks between frames."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each one is a complete thought. No "see section 3.2 of camera_monitor.md." No dependency on having read the parent document. Claude searches "camera monitor RTSP bug" and gets exactly the fix history — nothing more, nothing less.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-Agent Isolation
&lt;/h2&gt;

&lt;p&gt;The server enforces agent isolation. Every API call requires an &lt;code&gt;agent&lt;/code&gt; parameter — &lt;code&gt;claude&lt;/code&gt;, &lt;code&gt;skippy&lt;/code&gt;, &lt;code&gt;jot&lt;/code&gt;, &lt;code&gt;hermes&lt;/code&gt;. Each agent only sees its own memories. &lt;code&gt;agent="global"&lt;/code&gt; bypasses filtering for debugging.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;global&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sql&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; AND agent_id = %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matters because I run multiple AI agents with different roles. Claude Code manages infrastructure. Skippy handled WhatsApp conversations. Each needs different knowledge, and neither should see the other's private data. One memory server, multiple isolated namespaces.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Before and After
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Before (file-based):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Claude reads &lt;code&gt;CLAUDE.md&lt;/code&gt; (200 lines)&lt;/li&gt;
&lt;li&gt;Claude decides which &lt;code&gt;.md&lt;/code&gt; files might be relevant&lt;/li&gt;
&lt;li&gt;Claude reads 3-5 files (600-2000 lines)&lt;/li&gt;
&lt;li&gt;Claude sometimes reads the wrong files, backtracks&lt;/li&gt;
&lt;li&gt;Claude finally has enough context, starts working&lt;/li&gt;
&lt;li&gt;Context window: 2,000-4,000 tokens consumed on orientation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;After (memory-first):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Claude reads &lt;code&gt;CLAUDE.md&lt;/code&gt; (200 lines, includes memory-first rule)&lt;/li&gt;
&lt;li&gt;Claude calls &lt;code&gt;memory_search("camera monitor RTSP port")&lt;/code&gt; → 3 results, 50 lines&lt;/li&gt;
&lt;li&gt;Claude has the answer, starts working&lt;/li&gt;
&lt;li&gt;Context window: ~250 tokens consumed on orientation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The difference isn't just speed. It's &lt;em&gt;accuracy&lt;/em&gt;. Memory search returns scored results ranked by semantic similarity. File reading returns entire documents and hopes Claude finds the relevant paragraph. Memory search at 0.4+ similarity threshold almost always returns the right answer. File reading sometimes returns the right file but the wrong section.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Change
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Score tuning matters.&lt;/strong&gt; I started with &lt;code&gt;min_score: 0.3&lt;/code&gt; which returned too many tangential results. Bumping to &lt;code&gt;0.4&lt;/code&gt; cut noise significantly. Your threshold depends on your embedding model and memory granularity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory hygiene is real work.&lt;/strong&gt; Memories go stale. Ports change, services get decommissioned, bugs get fixed. You need to &lt;code&gt;memory_update&lt;/code&gt; old entries or they'll mislead future sessions. I treat it like documentation — when I change a service, I update both the code and the memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The import granularity is critical.&lt;/strong&gt; Too coarse (full documents) and you're back to RAG's chunking problem. Too fine (individual config values) and you lose relationships. &lt;code&gt;##&lt;/code&gt; header sections turned out to be the right unit for my documentation style — each section is typically one concept with enough context to stand alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;memory_server.py         — 767 lines, Python (Starlette + uvicorn)
sentence-transformers    — all-MiniLM-L6-v2, 384-dim embeddings
turboquant-vectors       — 4-bit vector compression + cosine search
MySQL                    — metadata, tags, categories, agent ownership
MCP SSE transport        — Claude Code native tool integration
REST API                 — /api/search, /api/save, /api/health (for other agents)
Dell R740 (Ubuntu)       — always-on server, port 8042
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;202 memories. 6 tools. One rule in &lt;code&gt;CLAUDE.md&lt;/code&gt;. Claude went from spending its first 30 seconds reading the wrong files to spending 100ms finding the right answer.&lt;/p&gt;

&lt;p&gt;The irony isn't lost on me: I built an AI memory system to make a different AI smarter. But that's the actual state of the art — AI systems that get better not from bigger models, but from better access to the right information at the right time.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>devops</category>
      <category>homelab</category>
    </item>
    <item>
      <title>My Security Cameras Were Dead for 3 Days. Now They Fix Themselves.</title>
      <dc:creator>Kunal Jaiswal</dc:creator>
      <pubDate>Wed, 22 Apr 2026 17:12:23 +0000</pubDate>
      <link>https://forem.com/ljkunal/my-security-cameras-were-dead-for-3-days-now-they-fix-themselves-139c</link>
      <guid>https://forem.com/ljkunal/my-security-cameras-were-dead-for-3-days-now-they-fix-themselves-139c</guid>
      <description>&lt;p&gt;I run three AI-powered security cameras at home. RTSP streams feed into a Python daemon that runs OpenCV contour detection, sends cropped regions to a vision LLM on an NVIDIA DGX Spark, and fires WhatsApp alerts when it spots something.&lt;/p&gt;

&lt;p&gt;It works great — until it doesn't.&lt;/p&gt;

&lt;p&gt;On April 13th, the cameras silently died. No alerts. No crash. No logs. The process was "running." &lt;code&gt;launchctl&lt;/code&gt; showed a healthy PID. The dashboard showed the last captured frame — frozen, three days old.&lt;/p&gt;

&lt;p&gt;Nobody noticed until April 16th.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bug That Looked Like Nothing
&lt;/h2&gt;

&lt;p&gt;The camera monitor runs multiple threads: an &lt;code&gt;rtsp_loop&lt;/code&gt; per camera that decodes RTSP frames, and an &lt;code&gt;analysis_loop&lt;/code&gt; that sends frames to the vision model. When the analysis loop decided a camera needed reconnection — stale frames, RTSP errors — it called &lt;code&gt;container.close()&lt;/code&gt; on the PyAV RTSP container.&lt;/p&gt;

&lt;p&gt;The problem: &lt;code&gt;container.close()&lt;/code&gt; was called from &lt;code&gt;analysis_loop&lt;/code&gt;'s thread. The RTSP container was being read by &lt;code&gt;rtsp_loop&lt;/code&gt;'s thread. PyAV wraps FFmpeg's C-level network read, which can't be safely interrupted cross-thread.&lt;/p&gt;

&lt;p&gt;The result: deadlock. Both threads frozen. Process stuck at 300% CPU doing nothing. macOS &lt;code&gt;launchctl&lt;/code&gt; saw a running PID and was satisfied. KeepAlive didn't trigger. Logs stopped flowing but nobody reads logs at 3 AM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# What launchctl saw:
2081    0    com.gate.monitor     ← "running"

# What was actually happening:
PID 2081   300% CPU   state: R (running — spinning in deadlock)
Last log line: 2026-04-13 05:27:21   ← 3 days ago
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I fixed the cross-thread bug — replaced &lt;code&gt;container.close()&lt;/code&gt; with a &lt;code&gt;threading.Event&lt;/code&gt; that &lt;code&gt;rtsp_loop&lt;/code&gt; checks between frames and breaks cleanly from its own thread. But I knew there'd be a next bug. There's always a next bug.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Traditional Monitoring Misses
&lt;/h2&gt;

&lt;p&gt;Here's what every basic health check would have said during those 3 days:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;th&gt;Reality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Process running?&lt;/td&gt;
&lt;td&gt;✓ Yes&lt;/td&gt;
&lt;td&gt;Zombie&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Exit code 0?&lt;/td&gt;
&lt;td&gt;✓ Yes&lt;/td&gt;
&lt;td&gt;Never exited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Port responding?&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;No HTTP server&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk space OK?&lt;/td&gt;
&lt;td&gt;✓ Yes&lt;/td&gt;
&lt;td&gt;Irrelevant&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The failure was &lt;strong&gt;behavioral&lt;/strong&gt;, not structural. The process was alive but brain-dead. You need a watchdog that understands what "healthy" actually means for your specific service.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Watchdog
&lt;/h2&gt;

&lt;p&gt;I wrote &lt;code&gt;camera_watchdog.py&lt;/code&gt; — a separate process that runs every 12 hours via LaunchAgent. It checks four things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Process alive + CPU zombie detection&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_process_info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ps aux | grep &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; | grep -v grep&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Parse PID and CPU% from ps output
&lt;/span&gt;    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="n"&gt;procs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_process_info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gate_monitor.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;procs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gate_monitor process NOT RUNNING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;procs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# CPU_ZOMBIE_THRESHOLD
&lt;/span&gt;    &lt;span class="n"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gate_monitor is a ZOMBIE (PID &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;procs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pid&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;procs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cpu&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;% CPU)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A process at 300% CPU with no log output for 10 minutes isn't "running." It's a zombie.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Log freshness&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_log_age_minutes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/tmp/gate_monitor.log&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# LOG_STALE_MIN
&lt;/span&gt;    &lt;span class="n"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gate_monitor log is &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; min stale&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The camera monitor writes log lines every few seconds. If the last line is older than 10 minutes, something is wrong — even if the process has a PID.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Error filtering&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not all errors are equal. RTSP drops, connection resets, and timeouts are transient — the monitor handles them internally. The watchdog only flags errors that &lt;em&gt;aren't&lt;/em&gt; in the known-transient list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;TRANSIENT_PATTERNS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Connection reset by peer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RTSP error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retry in&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timed out&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Connection refused&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Broken pipe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="c1"&gt;# Only flag errors NOT matching transient patterns
&lt;/span&gt;&lt;span class="n"&gt;serious_errors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;error_lines&lt;/span&gt;
                  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pat&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;TRANSIENT_PATTERNS&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Dependency checks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The camera monitor depends on an LLM adapter (localhost:8091) and DGX Ollama (remote GPU server). If these are down, the monitor will silently stop analyzing frames. The watchdog checks both and can restart the local adapter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Escalation Hierarchy
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. The watchdog doesn't just check — it fixes. In three tiers:&lt;/p&gt;

&lt;h3&gt;
  
  
  Tier 1: Simple Fix (up to 3 attempts)
&lt;/h3&gt;

&lt;p&gt;Kill the zombie. Restart the LaunchAgent. Restart dependencies. Wait 15 seconds. Check if logs are flowing again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zombie_pid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;details&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;kill_process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;details&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zombie_pid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# LaunchAgent auto-restarts
&lt;/span&gt;
&lt;span class="c1"&gt;# Verify it came back
&lt;/span&gt;&lt;span class="n"&gt;procs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_process_info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gate_monitor.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;procs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;restart_launchagent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;com.gate.monitor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most issues stop here. Dead process? Restart it. Zombie? Kill it and let KeepAlive do its job. Adapter down? Reload its plist.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tier 2: Claude Code
&lt;/h3&gt;

&lt;p&gt;If three restart attempts fail, the problem isn't operational — it's in the code. The watchdog invokes Claude Code to autonomously diagnose and fix the bug:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/opt/homebrew/bin/claude&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--print&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--dangerously-skip-permissions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-p&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 5 min max
&lt;/span&gt;    &lt;span class="n"&gt;cwd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/Users/chimpoo/repos/camera-monitor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The prompt includes the last 100 lines of logs, process status, and instructions to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read the source code&lt;/li&gt;
&lt;li&gt;Diagnose the root cause from logs + code&lt;/li&gt;
&lt;li&gt;Fix the bug if it's a code issue&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;git commit&lt;/code&gt; and &lt;code&gt;git push&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Kill the old process (LaunchAgent restarts with new code)&lt;/li&gt;
&lt;li&gt;Verify logs are flowing&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Claude gets &lt;code&gt;--dangerously-skip-permissions&lt;/code&gt; because it's running unattended at 3 AM. There's no human to click "approve." It has 5 minutes to read, reason, patch, commit, and deploy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tier 3: Human
&lt;/h3&gt;

&lt;p&gt;If Claude can't fix it either, a Telegram message arrives:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🚨 Camera Watchdog — Needs manual intervention

Problems: gate_monitor log is 3842 min stale
Restart attempts: 3
Claude attempted fix but failed:
[Claude's analysis of why it couldn't fix the issue]

Check: ssh chimpoo@192.168.0.26
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Reporting
&lt;/h2&gt;

&lt;p&gt;Every heartbeat ends with a Telegram message. You always know what happened:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ All systems nominal          — nothing to do
🔧 Issue detected and fixed    — restarted something
🤖 Claude auto-fix applied     — code was patched
🚨 Needs manual intervention   — you need to SSH in
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;State persists between runs in &lt;code&gt;/tmp/camera_watchdog_state.json&lt;/code&gt; — restart count, last Claude fix timestamp, history of fixes applied. The restart counter resets on the next healthy heartbeat, so a recovered service doesn't carry baggage.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Run it more often.&lt;/strong&gt; Every 12 hours means worst-case you lose 12 hours of camera coverage before the watchdog notices. I'm considering dropping it to every 30 minutes. The checks themselves take &amp;lt;5 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Log rotation.&lt;/strong&gt; The camera monitor wipes its log on restart. If the watchdog kills a zombie before capturing logs, the evidence disappears. A proper log rotation would preserve the forensics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test the Claude tier.&lt;/strong&gt; I've only seen Tier 1 fire in production. The Claude escalation path is written and theoretically sound, but I haven't had a code-level bug recur since the PyAV fix. Which is either good engineering or an untested code path — depending on your perspective.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Uncomfortable Part
&lt;/h2&gt;

&lt;p&gt;Giving an AI agent &lt;code&gt;--dangerously-skip-permissions&lt;/code&gt; over your codebase, with the ability to commit and deploy, at 3 AM with nobody watching — that should make you uncomfortable. It makes me uncomfortable.&lt;/p&gt;

&lt;p&gt;But here's the trade-off: my cameras were dead for 3 days and nobody noticed. The cost of unattended downtime, for a security system, is higher than the cost of a bad auto-fix. Claude can't brick the system worse than "not running." And if it makes a wrong fix, I have git history and the Telegram receipt.&lt;/p&gt;

&lt;p&gt;The watchdog is 280 lines of stdlib Python. No frameworks, no dependencies, no infrastructure. Just &lt;code&gt;subprocess&lt;/code&gt;, &lt;code&gt;urllib&lt;/code&gt;, and the knowledge that every system eventually breaks — and the interesting question is what happens next.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Stack
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;camera_watchdog.py        — 280 lines, stdlib Python, LaunchAgent
gate_monitor.py           — RTSP + OpenCV + vision LLM pipeline
DGX Spark (Blackwell GPU) — Ollama + gemma4:31b-4k vision model
Claude Code CLI            — autonomous diagnosis + code fix
Telegram Bot API           — reporting (stdlib urllib, no SDK)
macOS LaunchAgent          — scheduling + KeepAlive
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full watchdog source is straightforward enough to adapt for any daemon you run. The three-tier pattern — restart, AI fix, human escalation — is the part worth stealing.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>selfhealing</category>
      <category>homelab</category>
    </item>
    <item>
      <title>The Wrong GUID: How a Single Constant Broke WebSocket in Every Browser But Not Python</title>
      <dc:creator>Kunal Jaiswal</dc:creator>
      <pubDate>Sun, 12 Apr 2026 05:24:49 +0000</pubDate>
      <link>https://forem.com/ljkunal/the-wrong-guid-how-a-single-constant-broke-websocket-in-every-browser-but-not-python-1ojd</link>
      <guid>https://forem.com/ljkunal/the-wrong-guid-how-a-single-constant-broke-websocket-in-every-browser-but-not-python-1ojd</guid>
      <description>&lt;p&gt;I run a home automation setup with multiple RTSP IP cameras. The camera dashboard shows a grid of all cameras with a &lt;strong&gt;"Stream All"&lt;/strong&gt; button. Each stream is an MJPEG feed served through ffmpeg via a Python HTTP server.&lt;/p&gt;

&lt;p&gt;Click "Stream All" and you'd expect every feed to light up. Instead, 5 or 6 cameras would load and the rest would stay black forever. Refresh, and a &lt;em&gt;different&lt;/em&gt; set of cameras would load.&lt;/p&gt;

&lt;p&gt;The culprit was well-known: &lt;strong&gt;browsers limit ~6 concurrent HTTP/1.1 connections per origin&lt;/strong&gt;. Each MJPEG stream is a long-lived HTTP response that never closes. Camera 7 has to wait for camera 1 to finish — which is never.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: WebSocket Multiplexing
&lt;/h2&gt;

&lt;p&gt;WebSocket doesn't have the per-origin connection limit. One WebSocket connection can carry unlimited streams of data. The plan:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;StreamManager&lt;/strong&gt; — a shared ffmpeg process pool. One ffmpeg per camera, regardless of how many clients are watching. Frames broadcast to all subscribers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/ws/cameras&lt;/code&gt; endpoint&lt;/strong&gt; — clients subscribe to cameras via JSON commands, receive binary JPEG frames with a camera ID prefix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client-side JS&lt;/strong&gt; — single WebSocket connection, Blob URLs for rendering frames to &lt;code&gt;&amp;lt;img&amp;gt;&lt;/code&gt; elements.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The constraint: &lt;strong&gt;stdlib only&lt;/strong&gt;. No pip dependencies. This is a single-file Python HTTP server running as a macOS LaunchAgent. I implemented the WebSocket protocol (RFC 6455) from scratch — handshake, frame encoding/decoding, ping/pong, binary and text frames.&lt;/p&gt;

&lt;h3&gt;
  
  
  The frame format
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Server → Client binary frame:&lt;/span&gt;
&lt;span class="c1"&gt;// [1 byte: camera ID length] [N bytes: camera IP] [JPEG bytes]&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;buf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Uint8Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;idLen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;camId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TextDecoder&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;idLen&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;jpeg&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;idLen&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Render to &amp;lt;img&amp;gt; via Blob URL&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;blob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Blob&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nx"&gt;jpeg&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;image/jpeg&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="nx"&gt;img&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;src&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createObjectURL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Python Says: Works Perfectly
&lt;/h2&gt;

&lt;p&gt;I wrote a Python test client that connects over TLS, subscribes to all cameras, and counts frames:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Subscribe to ALL cameras over single WebSocket
&lt;/span&gt;&lt;span class="nf"&gt;ws_send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cmd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subscribe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cameras&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;camera_ips&lt;/span&gt;
&lt;span class="p"&gt;}))&lt;/span&gt;

&lt;span class="c1"&gt;# Result after 12 seconds:
# Camera IP          Frames
# x.x.x.11              19
# x.x.x.13              20
# x.x.x.16              19
# ... (all cameras streaming)
# TOTAL               209 frames from all cameras
# Total data: ~8 MB in 12s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Every camera streaming. ~8 MB over a single WebSocket in 12 seconds.&lt;/strong&gt; The StreamManager, ffmpeg pool, frame extraction, binary WebSocket framing — all working perfectly.&lt;/p&gt;

&lt;p&gt;Time to open it in a browser.&lt;/p&gt;




&lt;h2&gt;
  
  
  Browsers Say: No
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;WebSocket Test
04:46:59.972 Connecting...
04:47:00.148 ERROR: {"isTrusted":true}
04:47:00.148 CLOSED code=1006 reason= clean=false
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Code &lt;strong&gt;1006&lt;/strong&gt;. Abnormal closure. No reason. Not clean. The &lt;code&gt;onopen&lt;/code&gt; callback never fires. Both Safari and Brave. Every single time.&lt;/p&gt;

&lt;p&gt;The server logs told a different story:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ws] Client connected from x.x.x.x
[ws] Sending 101 (129 bytes)
[ws] Reader error: ConnectionError: WebSocket connection closed
[ws] Client disconnected, was watching 0 cameras
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server &lt;em&gt;sent&lt;/em&gt; the 101 Switching Protocols response. The client &lt;em&gt;connected&lt;/em&gt; at the TCP level. But the browser never acknowledged the upgrade. It just... closed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Debugging Spiral
&lt;/h2&gt;

&lt;p&gt;What followed was hours of systematically ruling out every possible cause:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 1: TLS Certificate&lt;/strong&gt; ❌&lt;br&gt;
Self-signed cert? Generated proper certs with mkcert, installed CA in system keychain. Pages loaded without warnings. WebSocket still failed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 2: Buffering&lt;/strong&gt; ❌&lt;br&gt;
Maybe &lt;code&gt;wfile&lt;/code&gt; is buffering the 101? Tried &lt;code&gt;handler.wfile.write()&lt;/code&gt; + &lt;code&gt;flush()&lt;/code&gt;, then &lt;code&gt;handler.connection.sendall()&lt;/code&gt;, then &lt;code&gt;handler.request.sendall()&lt;/code&gt;. All sent the bytes. Browser still rejected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 3: HTTP Protocol Version&lt;/strong&gt; ❌&lt;br&gt;
Python's &lt;code&gt;BaseHTTPRequestHandler.protocol_version&lt;/code&gt; defaults to &lt;code&gt;"HTTP/1.0"&lt;/code&gt;. WebSocket requires HTTP/1.1. Set it to &lt;code&gt;"HTTP/1.1"&lt;/code&gt;. Then built the response manually with raw bytes. Still failed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 4: ALPN Negotiation&lt;/strong&gt; ❌&lt;br&gt;
Maybe the browser is negotiating HTTP/2 via ALPN? Added &lt;code&gt;ctx.set_alpn_protocols(["http/1.1"])&lt;/code&gt;. No change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 5: Bypass the HTTP handler entirely&lt;/strong&gt; ❌&lt;br&gt;
Built a standalone raw WebSocket server on a separate port — pure socket, no &lt;code&gt;BaseHTTPRequestHandler&lt;/code&gt;. Read the HTTP request manually, send 101 manually. &lt;strong&gt;Still failed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 6: Mixed content / port issues&lt;/strong&gt; ❌&lt;br&gt;
Tried &lt;code&gt;ws://&lt;/code&gt; on the HTTP port (Brave auto-upgraded to HTTPS). Tried serving from HTTP. Tried different ports. Nothing.&lt;/p&gt;

&lt;p&gt;At this point I had verified the 101 response byte-by-byte:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Captured via Python, raw bytes from the server:
&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;HTTP/1.1 101 Switching Protocols&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Upgrade: websocket&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Connection: Upgrade&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Sec-WebSocket-Accept: MuIAfeA8S6DsJZLE/8a3flJsJzM=&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;span class="c1"&gt;# 129 bytes. Correct CRLF. Correct headers. Correct format.
# Python clients: works. Browsers: 1006.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;The response was &lt;strong&gt;byte-for-byte correct&lt;/strong&gt;. Correct HTTP version. Correct headers. Correct line endings. Correct empty line. And yet every browser on Earth rejected it.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Dad Steps In
&lt;/h2&gt;

&lt;p&gt;I shared the full debugging context with my dad — who happens to run a swarm of AI agents for exactly this kind of problem. His analysis was surgical:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Sec-WebSocket-Accept is not byte-perfect.&lt;/strong&gt; The computation must use the exact GUID from RFC 6455. Even if the format looks right, if the GUID constant is wrong, the Accept value will be wrong. Python clients don't validate the Accept header. Browsers do. This is the #1 hidden killer.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I looked at my code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- _WS_MAGIC = b"258EAFA5-E914-47DA-95CA-5AB5F43F86A2"
&lt;/span&gt;&lt;span class="gi"&gt;+ _WS_MAGIC = b"258EAFA5-E914-47DA-95CA-C5AB0DC85B11"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The magic GUID. The one constant that every WebSocket implementation on the planet must agree on. &lt;strong&gt;I had it wrong.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not slightly wrong. Not a typo in one character. The entire last segment was different:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;5AB5F43F86A2  ← what I had (WRONG)
C5AB0DC85B11  ← what RFC 6455 specifies (CORRECT)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why Python Didn't Care
&lt;/h2&gt;

&lt;p&gt;The WebSocket handshake works like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Client sends &lt;code&gt;Sec-WebSocket-Key: &amp;lt;random base64&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Server concatenates it with the magic GUID&lt;/li&gt;
&lt;li&gt;Server SHA-1 hashes the result, base64 encodes it&lt;/li&gt;
&lt;li&gt;Server sends back &lt;code&gt;Sec-WebSocket-Accept: &amp;lt;hash&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Client verifies the Accept value matches what it expects&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Step 5 is where the divergence happens. Python's WebSocket test clients — and many WebSocket libraries — &lt;strong&gt;skip the Accept validation&lt;/strong&gt;. They see "101 Switching Protocols" and proceed. The Accept header is there but nobody checks it.&lt;/p&gt;

&lt;p&gt;Browsers check it. Strictly. Silently. If it doesn't match, they close the TCP connection without sending a close frame — which is why you get code 1006 ("abnormal closure") with no reason string. The browser doesn't even tell you &lt;em&gt;what&lt;/em&gt; was wrong.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The Fix:&lt;/strong&gt; One line. One constant. Changed the GUID to the correct RFC 6455 value. All cameras streaming in every browser instantly.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Browsers are strict. Clients are lenient.
&lt;/h3&gt;

&lt;p&gt;Don't assume your test client validates what a browser validates. The &lt;code&gt;Sec-WebSocket-Accept&lt;/code&gt; header exists specifically so the client can verify the server understood the WebSocket protocol. Python clients being lenient masked a fatal bug for hours.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Error code 1006 is useless
&lt;/h3&gt;

&lt;p&gt;1006 means "I closed the connection abnormally." It doesn't say why. It could be a network error, a TLS issue, a protocol violation, or a wrong Accept header. Browser DevTools don't show the specific validation failure. This is a spec decision — 1006 is never sent over the wire, it's generated locally — but it makes debugging nearly impossible.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Magic constants are the worst kind of bug
&lt;/h3&gt;

&lt;p&gt;The GUID &lt;code&gt;258EAFA5-E914-47DA-95CA-C5AB0DC85B11&lt;/code&gt; is an arbitrary string chosen by the RFC authors. It has no structure, no checksum, no way to validate it in isolation. If you copy it wrong, everything looks correct until a strict client rejects it. Use a well-tested library, or copy the constant from the actual RFC text — not from memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Test with the real consumer
&lt;/h3&gt;

&lt;p&gt;My Python test proved the architecture worked: shared ffmpeg pool, subscriber queues, frame broadcast, binary WebSocket framing. All solid. But I should have tested with a browser from minute one. The gap between "works in Python" and "works in Chrome" was exactly one wrong constant.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. The architecture was right
&lt;/h3&gt;

&lt;p&gt;Despite the handshake bug, the design held up perfectly once the GUID was fixed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;All cameras, 1 WebSocket, ~2 fps each&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared ffmpeg pool&lt;/strong&gt; — one process per camera regardless of viewer count&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10-second grace period&lt;/strong&gt; on unsubscribe — handles page refresh without killing ffmpeg&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-restart&lt;/strong&gt; on ffmpeg crash (max 3 in 30s)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MJPEG fallback&lt;/strong&gt; for clients that can't do WebSocket&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero pip dependencies&lt;/strong&gt; — pure stdlib Python&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;p&gt;For anyone building something similar:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Server: Python stdlib HTTP server with manual WebSocket
# Camera: RTSP → ffmpeg → MJPEG frames
# Transport: WebSocket binary frames (1-byte ID prefix + JPEG)
# Client: Blob URLs → &amp;lt;img&amp;gt; elements
# Infra: macOS LaunchAgent, mkcert TLS
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StreamManager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# One ffmpeg per camera, broadcast to N subscribers
&lt;/span&gt;    &lt;span class="c1"&gt;# subscribe(cam_id, queue) → start ffmpeg if first
&lt;/span&gt;    &lt;span class="c1"&gt;# unsubscribe(cam_id, queue) → stop ffmpeg if last (after 10s grace)
&lt;/span&gt;    &lt;span class="c1"&gt;# Reader thread: extract JPEGs from ffmpeg stdout
&lt;/span&gt;    &lt;span class="c1"&gt;# Queue per client, maxsize=100, drop on full
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CameraWS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Client-side JavaScript
&lt;/span&gt;    &lt;span class="c1"&gt;# Single WebSocket to /ws/cameras
&lt;/span&gt;    &lt;span class="c1"&gt;# subscribe([ips]) / unsubscribe([ips])
&lt;/span&gt;    &lt;span class="c1"&gt;# onmessage → Blob URL → img.src
&lt;/span&gt;    &lt;span class="c1"&gt;# Auto-reconnect, 3-fail MJPEG fallback
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The WebSocket approach is the correct way to beat the browser connection limit for multi-camera MJPEG streaming. Just make sure you copy the GUID correctly.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For the record, the WebSocket magic GUID from RFC 6455 Section 4.2.2 is:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;258EAFA5-E914-47DA-95CA-C5AB0DC85B11&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Commit it to memory. Or better yet, don't — use a library.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>websocket</category>
      <category>python</category>
      <category>debugging</category>
      <category>homeautomation</category>
    </item>
    <item>
      <title>Building a Real-Time Security Camera System with Local Vision LLMs</title>
      <dc:creator>Kunal Jaiswal</dc:creator>
      <pubDate>Tue, 31 Mar 2026 12:19:05 +0000</pubDate>
      <link>https://forem.com/ljkunal/building-a-real-time-security-camera-system-with-local-vision-llms-2kgj</link>
      <guid>https://forem.com/ljkunal/building-a-real-time-security-camera-system-with-local-vision-llms-2kgj</guid>
      <description>&lt;p&gt;I replaced my Lorex NVR's motion detection — which alerted me 40 times a day about swaying trees and shadows — with a pipeline that uses a vision language model to understand what it's actually seeing. It runs entirely on local hardware, costs nothing after setup, and sends me a WhatsApp message only when something real happens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;3× Lorex 4K cameras (RTSP)
    ↓
gate_monitor.py (Mac Studio, M2 Ultra)
    ├── OpenCV: frame capture every 5s per camera
    ├── OpenCV: contour-based motion detection (frame N vs N-1)
    ├── Crop: extract largest changed region
    ├── VLM: qwen2.5vl:7b on DGX Spark (Blackwell, 10GbE link)
    │   └── "Classify this crop: ALERT or CLEAR?"
    ├── Alert: annotate frame with contour boxes
    │   ├── WiiM speaker announcement (TTS)
    │   └── WhatsApp message with image
    └── Audio: faster-whisper transcription (gate camera only)
        └── Gated by visual confirmation (120s window)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three cameras — front gate, backyard, driveway — each running in parallel threads. The system processes about &lt;strong&gt;50,000 VLM inference calls per day&lt;/strong&gt; and has been running 24/7 for weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not Just Use YOLO?
&lt;/h2&gt;

&lt;p&gt;Traditional object detection (YOLO, SSD) tells you &lt;em&gt;what&lt;/em&gt; is in a frame. A vision language model tells you &lt;em&gt;what's happening&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;My gate camera watches a residential street. YOLO would detect "person" for the mail carrier, the neighbor walking their dog, someone cutting through to the next street, and an actual trespasser — all equally. A VLM can distinguish:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"A delivery driver placing a package at the door" → alert&lt;/li&gt;
&lt;li&gt;"A person walking on the public sidewalk beyond the gate" → not relevant&lt;/li&gt;
&lt;li&gt;"The shadow of a tree branch moving across the driveway" → clear&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight: I don't need the VLM to be fast (it runs at ~15 tok/s). I need it to be smart. By using OpenCV contour detection as a fast pre-filter, the VLM only sees cropped regions where something actually changed — typically 2–5 calls per camera per minute instead of 12.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Contour Detection Layer
&lt;/h2&gt;

&lt;p&gt;Before any AI touches a frame, OpenCV does the heavy lifting:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Capture frame, convert to grayscale, resize to 640px width&lt;/li&gt;
&lt;li&gt;Compute absolute difference against previous analyzed frame&lt;/li&gt;
&lt;li&gt;Apply binary threshold (25) and dilation (3 iterations) to merge nearby changes&lt;/li&gt;
&lt;li&gt;Find contours, filter by area (min 150px², max 40% of frame)&lt;/li&gt;
&lt;li&gt;Merge nearby bounding boxes (within 50px)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If no contours survive filtering: &lt;strong&gt;CLEAR&lt;/strong&gt; — zero VLM calls. This happens 70%+ of the time (still frames, minor lighting shifts).&lt;/p&gt;

&lt;p&gt;If contours are found: crop the largest region, send to VLM for classification.&lt;/p&gt;

&lt;p&gt;Every 60 seconds, a fallback full-frame check catches anything that appeared between frames but hasn't moved (a parked car that wasn't there before, a person standing still).&lt;/p&gt;

&lt;h2&gt;
  
  
  Exclusion Zones
&lt;/h2&gt;

&lt;p&gt;Not all motion is interesting. I built a polygon zone editor (web UI at &lt;code&gt;/zones&lt;/code&gt;) that lets me draw exclusion and inclusion zones on camera frames — similar to professional NVR software.&lt;/p&gt;

&lt;p&gt;Current exclusion zones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gate camera:&lt;/strong&gt; the road beyond the gate (top portion of frame) — cars passing on the street aren't security events&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Driveway:&lt;/strong&gt; a steam pipe and stone wall fixture that cause constant false triggers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backyard:&lt;/strong&gt; a kamado BBQ grill and tree branches that sway in wind&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The zones are stored as JSON polygons. At runtime, &lt;code&gt;cv2.fillPoly&lt;/code&gt; builds a binary mask, which is applied to the thresholded diff before contour detection. Masked pixels are zeroed — contours in excluded areas never form.&lt;/p&gt;

&lt;h2&gt;
  
  
  False Positive War Stories
&lt;/h2&gt;

&lt;p&gt;Vision LLMs hallucinate. In security camera analysis, this means phantom alerts. Here are the patterns I found and fixed:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The negation problem.&lt;/strong&gt; The VLM would say "No people, vehicles, or animals are visible in the frame" and my classifier would see "people, vehicles, animals" and trigger an alert. Fix: expanded the negation lookback from 25 to 60 characters and added sentence-level negation detection ("if sentence starts with no/not/without AND ends with visible/present/found → CLEAR").&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The hedge problem.&lt;/strong&gt; The VLM would output both "ALERT" and "CLEAR" in the same response when it was uncertain. Fix: if both keywords appear on the same line, CLEAR wins. It's better to miss an event than to false-alert at 3 AM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The location confusion.&lt;/strong&gt; "A vehicle on the road beyond the gate" was triggering alerts for the gate camera. But the road isn't my property. Fix: added location-based negation — "beyond the gate", "past the gate", "on the road", "on the street" → CLEAR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The shadow/reflection problem.&lt;/strong&gt; "A shadow of a person" would alert. Fix: added "shadow of" as a negation pattern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The phantom description.&lt;/strong&gt; This was the most insidious. When the VLM received a nearly-black night frame, it would occasionally hallucinate vivid descriptions of people or vehicles. Fix: contour detection at night produces zero contours (no pixel changes in darkness), so the VLM is never called — the contour pre-filter eliminates this class of error entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Audio Intelligence
&lt;/h2&gt;

&lt;p&gt;The gate camera has a microphone. &lt;code&gt;faster-whisper&lt;/code&gt; (medium.en model) transcribes 15-second audio chunks, but audio alerts are &lt;strong&gt;gated by visual confirmation&lt;/strong&gt; — a speech transcription only fires if there was a visually-confirmed alert within the last 120 seconds. This prevents phantom audio alerts from wind, distant traffic, or radio.&lt;/p&gt;

&lt;p&gt;Urgent keywords (help, emergency, fire) bypass the gate.&lt;/p&gt;

&lt;p&gt;The transcription pipeline: PCM audio from RTSP → WAV → &lt;code&gt;faster-whisper&lt;/code&gt; → filter noise phrases ("thank you for watching", street chatter) → if visually gated → WiiM speaker announcement + WhatsApp message with OGG audio clip.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Alert Review Tool
&lt;/h2&gt;

&lt;p&gt;50,000 VLM calls per day generates a lot of classification data. I built a daily review tool (&lt;code&gt;/alerts/review&lt;/code&gt;) that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Parses the last 24 hours of &lt;code&gt;CONFIRMED ALERT&lt;/code&gt; lines from the log&lt;/li&gt;
&lt;li&gt;Groups by camera + normalized description&lt;/li&gt;
&lt;li&gt;Sends all patterns to &lt;code&gt;qwen3.5:35b&lt;/code&gt; for meta-classification: REAL / FALSE_POSITIVE / NOISE&lt;/li&gt;
&lt;li&gt;Presents a web UI with tabs (Needs Review / AI Flagged / Suppressed / Acknowledged)&lt;/li&gt;
&lt;li&gt;One-click suppress permanently filters a pattern from future alerts&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The LLM classification is given context: Calgary snowy conditions, known permanent features (kamado BBQ, stone wall, gate post lights), typical neighborhood activity. It correctly flags 80%+ of false positives for one-click suppression.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cameras&lt;/td&gt;
&lt;td&gt;3 (4K, RTSP)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frame interval&lt;/td&gt;
&lt;td&gt;5 seconds per camera&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VLM calls/day&lt;/td&gt;
&lt;td&gt;~50,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VLM model&lt;/td&gt;
&lt;td&gt;qwen2.5vl:7b-4k (14.5 GB on DGX Spark)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference latency&lt;/td&gt;
&lt;td&gt;~200ms per crop (10GbE link)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;False positive rate&lt;/td&gt;
&lt;td&gt;&amp;lt;5% after zone exclusions + negation fixes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total system cost&lt;/td&gt;
&lt;td&gt;$0/month (all local hardware)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with contour detection, not VLM.&lt;/strong&gt; I initially sent every frame to the VLM. The 70.8 GB memory leak I found in Ollama (separate blog post) was partly caused by this constant load. Contour pre-filtering reduced VLM calls by 70%+ and made the whole system viable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use a smaller VLM for classification, larger for description.&lt;/strong&gt; A 3B model could handle binary ALERT/CLEAR classification. Reserve the 7B model for generating the detailed description that goes into the WhatsApp alert.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Night mode needs a different approach.&lt;/strong&gt; IR cameras produce grayscale footage that confuses vision LLMs trained on color images. Thermal cameras or dedicated night-vision models would work better.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;p&gt;All of this runs on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mac Studio M2 Ultra (128 GB) — camera capture, OpenCV, audio processing, web UI&lt;/li&gt;
&lt;li&gt;NVIDIA DGX Spark (120 GB) — VLM inference via Ollama&lt;/li&gt;
&lt;li&gt;10GbE direct link between the two machines&lt;/li&gt;
&lt;li&gt;Raspberry Pi — WhatsApp gateway&lt;/li&gt;
&lt;li&gt;WiiM speaker — voice announcements&lt;/li&gt;
&lt;li&gt;Python 3.9, stdlib only — no pip dependencies in any production script&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Zero cloud APIs. Zero subscriptions. Full privacy.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The full pipeline code and zone editor are on &lt;a href="https://github.com/kjaiswal" rel="noopener noreferrer"&gt;my GitHub&lt;/a&gt;. If you're running local vision models for home automation, I'd like to hear what models and pre-filters work for you.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>homelab</category>
      <category>machinelearning</category>
      <category>llm</category>
    </item>
    <item>
      <title>Distributed LLM Inference Across NVIDIA Blackwell and Apple Silicon Over 10GbE</title>
      <dc:creator>Kunal Jaiswal</dc:creator>
      <pubDate>Tue, 31 Mar 2026 12:15:15 +0000</pubDate>
      <link>https://forem.com/ljkunal/distributed-llm-inference-across-nvidia-blackwell-and-apple-silicon-over-10gbe-2feg</link>
      <guid>https://forem.com/ljkunal/distributed-llm-inference-across-nvidia-blackwell-and-apple-silicon-over-10gbe-2feg</guid>
      <description>&lt;p&gt;I connected an NVIDIA DGX Spark to a Mac Studio with a direct 10-gigabit Ethernet cable and split a large language model across both GPUs. Here's what actually happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;I have two machines that are excellent at different things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NVIDIA DGX Spark&lt;/strong&gt; (GB10 Blackwell, 120 GB unified memory) — screaming fast tensor cores, CUDA 13&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mac Studio&lt;/strong&gt; (M2 Ultra, 128 GB unified memory) — great Metal GPU, massive memory bandwidth&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Combined: &lt;strong&gt;248 GB&lt;/strong&gt; of GPU-accessible memory. Enough to run models that don't fit on either machine alone — 100B+ parameter models at reasonable quantization levels.&lt;/p&gt;

&lt;p&gt;The question: can you actually get useful performance by splitting a model across heterogeneous GPUs over a network link?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Physical Setup
&lt;/h2&gt;

&lt;p&gt;I connected both machines with a direct 10GbE cable — no switch, no router. Just a CAT6A cable between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DGX: Realtek 10GbE NIC (&lt;code&gt;enP7s7&lt;/code&gt;) → &lt;code&gt;192.168.100.2/24&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Mac Studio: 10GbE port (&lt;code&gt;en0&lt;/code&gt;) → &lt;code&gt;192.168.100.1/24&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Measured throughput: &lt;strong&gt;9.41 Gbps&lt;/strong&gt;. Both machines keep WiFi for LAN/internet access — the direct cable is a dedicated inference-only link.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why llama.cpp RPC (and Why Not Exo)
&lt;/h2&gt;

&lt;p&gt;I tried two approaches:&lt;/p&gt;

&lt;h3&gt;
  
  
  Exo (MLX Ring) — Failed
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/exo-explore/exo" rel="noopener noreferrer"&gt;Exo&lt;/a&gt; is a distributed inference framework that uses MLX on both Metal and CUDA backends. I got peer discovery working, placed a 128 GB MiniMax M2.5 model across both nodes, but hit a wall: &lt;strong&gt;&lt;code&gt;mx.distributed.init(backend="ring")&lt;/code&gt; hangs indefinitely on the CUDA backend&lt;/strong&gt;. The MLX CUDA ring implementation simply doesn't work yet (as of MLX 0.31.1). Even single-node ring init hangs on DGX.&lt;/p&gt;

&lt;p&gt;I fixed several other bugs along the way (election instability, edge oscillation, model path mismatches, Linux interface detection) and &lt;a href="https://github.com/exo-explore/exo/pull/1809" rel="noopener noreferrer"&gt;submitted a P2P model distribution PR&lt;/a&gt;, but the core distributed inference path is blocked until Apple adds CUDA ring support to MLX.&lt;/p&gt;

&lt;h3&gt;
  
  
  llama.cpp RPC — Works
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/ggerganov/llama.cpp" rel="noopener noreferrer"&gt;llama.cpp's RPC backend&lt;/a&gt; takes a different approach. Instead of requiring the same ML framework on both ends, it exposes a simple RPC server that provides raw compute. The host machine (Mac Studio) runs &lt;code&gt;llama-server&lt;/code&gt;, loads the model, and offloads layers to remote RPC servers (DGX) as needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# DGX — start RPC server&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; /home/kjaiswal/llama.cpp
&lt;span class="nv"&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;build/bin build/bin/rpc-server &lt;span class="nt"&gt;-H&lt;/span&gt; 192.168.100.2 &lt;span class="nt"&gt;-p&lt;/span&gt; 50052

&lt;span class="c"&gt;# Mac Studio — start llama-server with RPC&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; /Users/chimpoo/llama.cpp
build/bin/llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; /path/to/model.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rpc&lt;/span&gt; 192.168.100.2:50052 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 9999 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both were built from the same commit (&lt;code&gt;b0f0dd3e5&lt;/code&gt;) with their respective GPU backends:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mac Studio: &lt;code&gt;GGML_METAL=ON GGML_RPC=ON&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;DGX: &lt;code&gt;GGML_CUDA=ON GGML_RPC=ON&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model file only needs to exist on the Mac Studio. llama.cpp automatically splits layers across available compute based on memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Qwen2.5-7B Q4_K_M (4.4 GB) — Fits one machine easily
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Prompt Processing&lt;/th&gt;
&lt;th&gt;Token Generation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Local Metal only&lt;/td&gt;
&lt;td&gt;76 tok/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;92 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RPC (Metal + CUDA)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;318 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;53 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Qwen2.5-72B Q4_K_M (44.2 GB) — Fits Mac Studio alone
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Prompt Processing&lt;/th&gt;
&lt;th&gt;Token Generation&lt;/th&gt;
&lt;th&gt;Model Split&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Local Metal only&lt;/td&gt;
&lt;td&gt;28 tok/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;44 GB on Metal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RPC (Metal + CUDA)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;30 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6 tok/s&lt;/td&gt;
&lt;td&gt;31 GB Metal + 14 GB CUDA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What the Numbers Mean
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prompt processing (prefill) benefits from RPC.&lt;/strong&gt; The DGX Blackwell tensor cores accelerate the matrix multiplications needed to process input tokens. For the 7B model, prefill was &lt;strong&gt;4.2x faster&lt;/strong&gt; with RPC. Even the 72B model saw a slight improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token generation (decode) is slower with RPC.&lt;/strong&gt; Each generated token requires a round-trip over the network to synchronize KV cache states. At 10 Gbps, this adds ~0.2ms per layer per token. With 80 layers, that's 16ms of network overhead per token — enough to cut generation speed roughly in half.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For models that fit one machine, local is faster.&lt;/strong&gt; The 72B model runs at 11 tok/s locally vs 6 tok/s over RPC. The network overhead isn't worth it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real value is models that DON'T fit one machine.&lt;/strong&gt; With 248 GB combined, I can run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MiniMax M2.5 Q4_K_M (138 GB) — 230B parameters, 10B active MoE&lt;/li&gt;
&lt;li&gt;Qwen3-235B Q4_K_M (132 GB) — 235B parameters, 22B active MoE&lt;/li&gt;
&lt;li&gt;DeepSeek-R1 at higher quantization than either machine could handle alone&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At Q4 quantization, a 200B+ MoE model should generate at ~4–8 tok/s across both machines. Not fast, but usable for batch processing, code review, and complex reasoning tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Direct cables beat switches.&lt;/strong&gt; A direct 10GbE link has lower latency and jitter than going through a network switch. For latency-sensitive distributed inference, every microsecond matters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Prefill and decode have opposite scaling characteristics.&lt;/strong&gt; Prefill is embarrassingly parallel and benefits from more compute. Decode is sequential and bottlenecked by network latency. This suggests a potential disaggregated architecture: use the DGX for prefill, Mac Studio for decode.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GGUF is the universal format.&lt;/strong&gt; Ollama GGUFs have custom metadata that upstream llama.cpp can't read (e.g., &lt;code&gt;rope.dimension_sections&lt;/code&gt; with wrong array length). Always use HuggingFace community GGUFs (bartowski, etc.) for llama.cpp.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Heterogeneous distributed inference works today&lt;/strong&gt; — but only with frameworks that abstract the GPU backend behind a network protocol (like llama.cpp RPC). Frameworks that require the same ML runtime on all nodes (like Exo with MLX) are blocked on backend parity.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Benchmark MiniMax M2.5 (138 GB) split across both machines — the first model that actually needs distributed inference&lt;/li&gt;
&lt;li&gt;Test disaggregated prefill (DGX) + decode (Mac Studio) once both run the same framework&lt;/li&gt;
&lt;li&gt;Explore vLLM's distributed serving for production workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full setup — including the Exo debugging saga and the 5 bugs I fixed — is documented in my infrastructure notes. Happy to share details if you're working on similar multi-GPU setups.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you're working on similar multi-GPU setups, I'd love to hear what's working for you. The full setup notes and Exo bug fixes are on &lt;a href="https://github.com/kjaiswal" rel="noopener noreferrer"&gt;my GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>homelab</category>
      <category>machinelearning</category>
      <category>llm</category>
    </item>
    <item>
      <title>How Ollama Silently Ate 65GB of My VRAM (And How I Fixed It)</title>
      <dc:creator>Kunal Jaiswal</dc:creator>
      <pubDate>Mon, 30 Mar 2026 17:28:08 +0000</pubDate>
      <link>https://forem.com/ljkunal/how-ollama-silently-ate-65gb-of-my-vram-and-how-i-fixed-it-22pf</link>
      <guid>https://forem.com/ljkunal/how-ollama-silently-ate-65gb-of-my-vram-and-how-i-fixed-it-22pf</guid>
      <description>&lt;p&gt;I run a vision-language model (&lt;code&gt;qwen2.5vl:7b&lt;/code&gt;) on an NVIDIA DGX Spark for automated camera analysis — three RTSP cameras, one inference call every 5 seconds, 24/7. The model weights are about 6GB. It should use maybe 8-10GB total.&lt;/p&gt;

&lt;p&gt;After a week of running, I checked memory usage: &lt;strong&gt;70.8GB out of 120GB.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's 65GB of VRAM consumed by a 6GB model. Here's what happened.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Symptom
&lt;/h2&gt;

&lt;p&gt;Everything was working fine. Inference was fast, results were accurate. I only noticed the problem because I wanted to load a second model and got an out-of-memory error.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;ollama ps
&lt;span class="go"&gt;NAME              SIZE     PROCESSOR
qwen2.5vl:7b     70.8GB   100% GPU
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;70.8GB for a 7B model. That's not right.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding the Cause
&lt;/h2&gt;

&lt;p&gt;The VRAM breakdown:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model weights: ~6 GB&lt;/li&gt;
&lt;li&gt;KV cache: &lt;strong&gt;~65 GB&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Overhead: ~0.5 GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The KV cache was the problem. But why was it so large?&lt;/p&gt;

&lt;p&gt;Every transformer model has a &lt;strong&gt;context length&lt;/strong&gt; — the maximum number of tokens it can process at once. Ollama pre-allocates a KV cache for the &lt;strong&gt;full declared context length&lt;/strong&gt; when a model first loads. And &lt;code&gt;qwen2.5vl:7b&lt;/code&gt; declares a context length of &lt;strong&gt;131,072 tokens&lt;/strong&gt; (128K) in its GGUF metadata.&lt;/p&gt;

&lt;p&gt;My requests used about 1,000 tokens each. Ollama allocated memory for 131,072.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Didn't It Shrink?
&lt;/h2&gt;

&lt;p&gt;I tried every obvious fix:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What I tried&lt;/th&gt;
&lt;th&gt;What happened&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;OLLAMA_NUM_CTX=4096&lt;/code&gt; environment variable&lt;/td&gt;
&lt;td&gt;Ignored — doesn't override per-model defaults&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;"num_ctx": 4096&lt;/code&gt; in &lt;code&gt;/api/chat&lt;/code&gt; request body&lt;/td&gt;
&lt;td&gt;Doesn't shrink an already-loaded model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Using &lt;code&gt;/v1/chat/completions&lt;/code&gt; (OpenAI-compatible API)&lt;/td&gt;
&lt;td&gt;No &lt;code&gt;num_ctx&lt;/code&gt; parameter available at all&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Restarting Ollama&lt;/td&gt;
&lt;td&gt;Works temporarily — but model reloads at 128K on first request&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The root cause: Ollama reads the model's context length from the GGUF file and allocates the full KV cache on first load. &lt;strong&gt;There is no way to override this at request time for an already-loaded model.&lt;/strong&gt; And in an automated pipeline where requests come every 5 seconds, the model never unloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;p&gt;The only reliable solution is to create a &lt;strong&gt;derived model&lt;/strong&gt; with the context size baked into the model definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Save as Modelfile.vision&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; qwen2.5vl:7b&lt;/span&gt;
PARAMETER num_ctx 4096
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama create qwen2.5vl:7b-4k &lt;span class="nt"&gt;-f&lt;/span&gt; Modelfile.vision
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Now use &lt;code&gt;qwen2.5vl:7b-4k&lt;/code&gt; instead of &lt;code&gt;qwen2.5vl:7b&lt;/code&gt; in your API calls.&lt;/p&gt;

&lt;p&gt;For extra safety, I also set a global default in Ollama's systemd service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/systemd/system/ollama.service
&lt;/span&gt;&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;Environment&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"OLLAMA_NUM_CTX=4096"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl daemon-reload
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart ollama
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This catches any model that doesn't have an explicit &lt;code&gt;num_ctx&lt;/code&gt; — at least it won't silently balloon to 128K.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total VRAM used&lt;/td&gt;
&lt;td&gt;70.8 GB&lt;/td&gt;
&lt;td&gt;14.5 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KV cache context&lt;/td&gt;
&lt;td&gt;128,000 tokens&lt;/td&gt;
&lt;td&gt;4,096 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Free VRAM&lt;/td&gt;
&lt;td&gt;37 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;85 GB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference speed&lt;/td&gt;
&lt;td&gt;No change&lt;/td&gt;
&lt;td&gt;No change&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output quality&lt;/td&gt;
&lt;td&gt;No change&lt;/td&gt;
&lt;td&gt;No change&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;56GB of VRAM recovered&lt;/strong&gt; with zero impact on inference. My requests never used more than ~1K tokens — the other 127K were allocated for nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Is Affected?
&lt;/h2&gt;

&lt;p&gt;This matters if you're running Ollama for &lt;strong&gt;automated workloads&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API servers handling frequent requests (the model stays loaded)&lt;/li&gt;
&lt;li&gt;Chatbots, agents, or monitoring pipelines&lt;/li&gt;
&lt;li&gt;Multiple models on the same GPU&lt;/li&gt;
&lt;li&gt;Any setup where you need predictable memory usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Interactive chat sessions are less affected because Ollama unloads models after an idle timeout. But if your requests keep the model hot, the full KV cache lives in VRAM permanently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Models to Watch Out For
&lt;/h2&gt;

&lt;p&gt;Many popular models declare 128K context by default:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Default Context&lt;/th&gt;
&lt;th&gt;Approx KV Cache&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5vl:7b&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;~65 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5:32b&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;~130 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;llama3.1:70b&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;~130 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mistral-large&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;~130 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Check your model's declared context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama show &amp;lt;model&amp;gt; &lt;span class="nt"&gt;--modelfile&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; ctx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or via the API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:11434/api/show &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"name":"qwen2.5vl:7b"}'&lt;/span&gt; | python3 &lt;span class="nt"&gt;-m&lt;/span&gt; json.tool | &lt;span class="nb"&gt;grep &lt;/span&gt;context
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Rule
&lt;/h2&gt;

&lt;p&gt;For any Ollama model used in automated pipelines:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Always create a derived Modelfile with an explicit &lt;code&gt;num_ctx&lt;/code&gt; matching your actual needs.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Some guidelines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vision/camera analysis: &lt;strong&gt;2K–4K&lt;/strong&gt; tokens&lt;/li&gt;
&lt;li&gt;Chatbot or agent: &lt;strong&gt;4K–8K&lt;/strong&gt; tokens&lt;/li&gt;
&lt;li&gt;Document analysis: &lt;strong&gt;8K–16K&lt;/strong&gt; tokens&lt;/li&gt;
&lt;li&gt;RAG with large context: &lt;strong&gt;16K–32K&lt;/strong&gt; tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Never leave a model at its default 128K context unless you actually need 128K. The KV cache allocation is proportional to context size — halving the context roughly halves the memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Isn't a Bug (But Maybe Should Be)
&lt;/h2&gt;

&lt;p&gt;Ollama's behavior is technically correct — pre-allocating the KV cache avoids the overhead of dynamic resizing during inference. For interactive use, where you might paste a long document or have a deep conversation, having the full context available makes sense.&lt;/p&gt;

&lt;p&gt;But for API workloads, it's a footgun. The mismatch between "model supports 128K context" and "my requests use 1K context" is common, and the memory cost is hidden. You don't see it in &lt;code&gt;nvidia-smi&lt;/code&gt; as a separate allocation — it's all lumped under the model.&lt;/p&gt;

&lt;p&gt;A dynamic or configurable allocation per-request would fix this for API users. Until then, the Modelfile workaround is the best approach.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I documented the full benchmarks and fix in my &lt;a href="https://github.com/kjaiswal/llama-cpp-distributed-benchmarks" rel="noopener noreferrer"&gt;llama-cpp-distributed-benchmarks&lt;/a&gt; repo, which also covers distributed inference across Apple Silicon + NVIDIA Blackwell over 10GbE.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; &lt;code&gt;#ollama&lt;/code&gt; &lt;code&gt;#llm&lt;/code&gt; &lt;code&gt;#vram&lt;/code&gt; &lt;code&gt;#inference&lt;/code&gt; &lt;code&gt;#nvidia&lt;/code&gt; &lt;code&gt;#machinelearning&lt;/code&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>performance</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
