Forem: Kunal Jaiswal

Claude Code Was Getting Dumber. Semantic Memory Fixed It.

Kunal Jaiswal — Wed, 22 Apr 2026 17:21:38 +0000

I use Claude Code as my primary development tool. It manages a home automation stack spread across five machines — camera monitors, WhatsApp agents, LLM inference pipelines, job scrapers, diet trackers. Over two months, the codebase grew to 30+ services with their own ports, configs, credentials, and war stories.

To keep Claude informed, I maintained documentation in .md files. camera_monitor.md. dgx_inference.md. openclaw_agents.md. One file per system, each containing architecture decisions, port numbers, credentials, known bugs, and fix history.

It worked great at 5 files. At 30, Claude started losing the plot.

The Problem With Files

Claude Code reads CLAUDE.md at session start. I added rules there: "read camera_monitor.md before touching the camera system." "Check dgx_inference.md for port mappings." Reasonable instructions.

But Claude's context window is finite. Each .md file averaged 200-400 lines. Loading 10 of them for a cross-system task consumed 3,000-4,000 tokens before Claude wrote a single line of code. And it had to decide which files to read — sometimes guessing wrong, reading 5 files before finding the answer in the 6th.

The symptoms were subtle:

Claude would re-explore code I'd already documented
It would suggest ports that were already in use
It would miss dependencies between services ("that endpoint moved to the Dell server last week")
It would launch Explore agents to grep the codebase for answers that existed in a .md file it hadn't loaded

Each session started with a tax: Claude burning context on orientation instead of doing work. The more documentation I wrote, the worse it got. A classic information retrieval problem disguised as a context window problem.

What RAG Doesn't Solve

The obvious answer is "just use RAG." Embed the docs, retrieve the relevant chunks, inject them into context. Every AI wrapper does this.

But RAG over documentation files has a specific failure mode: your retrieval unit is the wrong size. A .md file is either too big (wastes context with irrelevant sections) or you chunk it and lose the structural relationships between sections. The camera monitor doc has 15 sections — contour detection constants, RTSP URLs, the KV cache leak fix, the PyAV deadlock bug. A chunk retriever might return the RTSP URLs when you asked about the deadlock, because they share keywords like "camera" and "connection."

What I actually needed was a memory system — not document retrieval, but a knowledge base where each entry is a self-contained fact with metadata, and search is semantic, not keyword.

The Memory Server

I built a 767-line Python server that gives Claude six MCP tools: memory_search, memory_save, memory_list, memory_update, memory_delete, memory_stats.

Claude Code ──MCP/SSE──▶ memory_server.py (port 8042)
                              │
                              ├── sentence-transformers (all-MiniLM-L6-v2)
                              │     └── 384-dim embeddings, runs on CPU
                              │
                              ├── TurboQuant (4-bit vector compression)
                              │     └── in-memory index, persisted to disk
                              │
                              └── MySQL
                                    └── content, tags, category, agent_id, timestamps

Each memory is a self-contained knowledge unit — a bug fix, a port mapping, an architecture decision, a credential, a "never do this" lesson. Not a document chunk. A fact.

The server exposes both MCP (for Claude Code) and REST (for other agents). Search is cosine similarity on MiniLM-L6-v2 embeddings, compressed to 4-bit with TurboQuant for a smaller memory footprint. Metadata lives in MySQL for filtering by category, tags, and agent.

The Memory-First Rule

The server alone isn't enough. Claude needs to be told to use it. In CLAUDE.md:

### MANDATORY: Memory-First Rule
**BEFORE reading any files, exploring code, or launching agents
— ALWAYS use `memory_search` first.**

**Order of operations for ANY question:**
1. `memory_search` with relevant keywords
2. Only if memory has no results → then read files/explore code
3. If memory server is down → fall back to local .md files

**DO NOT:** Launch Explore agents, read source files, or grep
the codebase as a first step. Memory has it.

This single rule changed everything. Instead of reading 5 files to find a port number, Claude runs one semantic search and gets a scored result in <100ms. The context window stays clean for actual work.

What Went Into Memory

I wrote an import script that parsed every .md documentation file, split them by ## headers into self-contained sections, and bulk-loaded them as individual memories. 30 files became 200+ memories, each tagged with source file and category.

Some examples of what a single memory entry looks like:

"DGX Spark Ollama runs on port 11434, WebSocket proxy on 8765, adapter on 8091. Request flow: gate_monitor → adapter HTTP → WebSocket → proxy → Ollama."
"KV Cache Leak Fix: Ollama allocates full KV cache based on model's context_length field. Fix: create derived Modelfile with PARAMETER num_ctx 4096."
"Cross-thread RTSP kill bug: analysis_loop calling container.close() from wrong thread → PyAV deadlock at 300% CPU. Fix: threading.Event, rtsp_loop checks between frames."

Each one is a complete thought. No "see section 3.2 of camera_monitor.md." No dependency on having read the parent document. Claude searches "camera monitor RTSP bug" and gets exactly the fix history — nothing more, nothing less.

Per-Agent Isolation

The server enforces agent isolation. Every API call requires an agent parameter — claude, skippy, jot, hermes. Each agent only sees its own memories. agent="global" bypasses filtering for debugging.

if agent != "global":
    sql += " AND agent_id = %s"
    params.append(agent)

This matters because I run multiple AI agents with different roles. Claude Code manages infrastructure. Skippy handled WhatsApp conversations. Each needs different knowledge, and neither should see the other's private data. One memory server, multiple isolated namespaces.

The Before and After

Before (file-based):

Claude reads CLAUDE.md (200 lines)
Claude decides which .md files might be relevant
Claude reads 3-5 files (600-2000 lines)
Claude sometimes reads the wrong files, backtracks
Claude finally has enough context, starts working
Context window: 2,000-4,000 tokens consumed on orientation

After (memory-first):

Claude reads CLAUDE.md (200 lines, includes memory-first rule)
Claude calls memory_search("camera monitor RTSP port") → 3 results, 50 lines
Claude has the answer, starts working
Context window: ~250 tokens consumed on orientation

The difference isn't just speed. It's accuracy. Memory search returns scored results ranked by semantic similarity. File reading returns entire documents and hopes Claude finds the relevant paragraph. Memory search at 0.4+ similarity threshold almost always returns the right answer. File reading sometimes returns the right file but the wrong section.

What I'd Change

Score tuning matters. I started with min_score: 0.3 which returned too many tangential results. Bumping to 0.4 cut noise significantly. Your threshold depends on your embedding model and memory granularity.

Memory hygiene is real work. Memories go stale. Ports change, services get decommissioned, bugs get fixed. You need to memory_update old entries or they'll mislead future sessions. I treat it like documentation — when I change a service, I update both the code and the memory.

The import granularity is critical. Too coarse (full documents) and you're back to RAG's chunking problem. Too fine (individual config values) and you lose relationships. ## header sections turned out to be the right unit for my documentation style — each section is typically one concept with enough context to stand alone.

The Stack

memory_server.py         — 767 lines, Python (Starlette + uvicorn)
sentence-transformers    — all-MiniLM-L6-v2, 384-dim embeddings
turboquant-vectors       — 4-bit vector compression + cosine search
MySQL                    — metadata, tags, categories, agent ownership
MCP SSE transport        — Claude Code native tool integration
REST API                 — /api/search, /api/save, /api/health (for other agents)
Dell R740 (Ubuntu)       — always-on server, port 8042

202 memories. 6 tools. One rule in CLAUDE.md. Claude went from spending its first 30 seconds reading the wrong files to spending 100ms finding the right answer.

The irony isn't lost on me: I built an AI memory system to make a different AI smarter. But that's the actual state of the art — AI systems that get better not from bigger models, but from better access to the right information at the right time.

My Security Cameras Were Dead for 3 Days. Now They Fix Themselves.

Kunal Jaiswal — Wed, 22 Apr 2026 17:12:23 +0000

I run three AI-powered security cameras at home. RTSP streams feed into a Python daemon that runs OpenCV contour detection, sends cropped regions to a vision LLM on an NVIDIA DGX Spark, and fires WhatsApp alerts when it spots something.

It works great — until it doesn't.

On April 13th, the cameras silently died. No alerts. No crash. No logs. The process was "running." launchctl showed a healthy PID. The dashboard showed the last captured frame — frozen, three days old.

Nobody noticed until April 16th.

The Bug That Looked Like Nothing

The camera monitor runs multiple threads: an rtsp_loop per camera that decodes RTSP frames, and an analysis_loop that sends frames to the vision model. When the analysis loop decided a camera needed reconnection — stale frames, RTSP errors — it called container.close() on the PyAV RTSP container.

The problem: container.close() was called from analysis_loop's thread. The RTSP container was being read by rtsp_loop's thread. PyAV wraps FFmpeg's C-level network read, which can't be safely interrupted cross-thread.

The result: deadlock. Both threads frozen. Process stuck at 300% CPU doing nothing. macOS launchctl saw a running PID and was satisfied. KeepAlive didn't trigger. Logs stopped flowing but nobody reads logs at 3 AM.

# What launchctl saw:
2081    0    com.gate.monitor     ← "running"

# What was actually happening:
PID 2081   300% CPU   state: R (running — spinning in deadlock)
Last log line: 2026-04-13 05:27:21   ← 3 days ago

I fixed the cross-thread bug — replaced container.close() with a threading.Event that rtsp_loop checks between frames and breaks cleanly from its own thread. But I knew there'd be a next bug. There's always a next bug.

What Traditional Monitoring Misses

Here's what every basic health check would have said during those 3 days:

Check	Result	Reality
Process running?	✓ Yes	Zombie
Exit code 0?	✓ Yes	Never exited
Port responding?	N/A	No HTTP server
Disk space OK?	✓ Yes	Irrelevant

The failure was behavioral, not structural. The process was alive but brain-dead. You need a watchdog that understands what "healthy" actually means for your specific service.

The Watchdog

I wrote camera_watchdog.py — a separate process that runs every 12 hours via LaunchAgent. It checks four things:

1. Process alive + CPU zombie detection

def get_process_info(name):
    out, _, _ = run(f"ps aux | grep '{name}' | grep -v grep")
    # Parse PID and CPU% from ps output
    ...

procs = get_process_info("gate_monitor.py")
if not procs:
    issues.append("gate_monitor process NOT RUNNING")
elif procs[0]["cpu"] > 200:  # CPU_ZOMBIE_THRESHOLD
    issues.append(f"gate_monitor is a ZOMBIE (PID {procs[0]['pid']}, {procs[0]['cpu']}% CPU)")

A process at 300% CPU with no log output for 10 minutes isn't "running." It's a zombie.

2. Log freshness

age = get_log_age_minutes("/tmp/gate_monitor.log")
if age > 10:  # LOG_STALE_MIN
    issues.append(f"gate_monitor log is {age:.0f} min stale")

The camera monitor writes log lines every few seconds. If the last line is older than 10 minutes, something is wrong — even if the process has a PID.

3. Error filtering

Not all errors are equal. RTSP drops, connection resets, and timeouts are transient — the monitor handles them internally. The watchdog only flags errors that aren't in the known-transient list:

TRANSIENT_PATTERNS = [
    "Connection reset by peer", "RTSP error", "retry in",
    "timed out", "Connection refused", "Broken pipe",
]
# Only flag errors NOT matching transient patterns
serious_errors = [line for line in error_lines
                  if not any(pat.lower() in line.lower() for pat in TRANSIENT_PATTERNS)]

4. Dependency checks

The camera monitor depends on an LLM adapter (localhost:8091) and DGX Ollama (remote GPU server). If these are down, the monitor will silently stop analyzing frames. The watchdog checks both and can restart the local adapter.

The Escalation Hierarchy

This is where it gets interesting. The watchdog doesn't just check — it fixes. In three tiers:

Tier 1: Simple Fix (up to 3 attempts)

Kill the zombie. Restart the LaunchAgent. Restart dependencies. Wait 15 seconds. Check if logs are flowing again.

if "zombie_pid" in details:
    kill_process(details["zombie_pid"])
    time.sleep(3)  # LaunchAgent auto-restarts

# Verify it came back
procs = get_process_info("gate_monitor.py")
if not procs:
    restart_launchagent("com.gate.monitor")
    time.sleep(15)

Most issues stop here. Dead process? Restart it. Zombie? Kill it and let KeepAlive do its job. Adapter down? Reload its plist.

Tier 2: Claude Code

If three restart attempts fail, the problem isn't operational — it's in the code. The watchdog invokes Claude Code to autonomously diagnose and fix the bug:

result = subprocess.run(
    ["/opt/homebrew/bin/claude", "--print", "--dangerously-skip-permissions",
     "-p", prompt],
    capture_output=True, text=True,
    timeout=300,  # 5 min max
    cwd="/Users/chimpoo/repos/camera-monitor",
)

The prompt includes the last 100 lines of logs, process status, and instructions to:

Read the source code
Diagnose the root cause from logs + code
Fix the bug if it's a code issue
git commit and git push
Kill the old process (LaunchAgent restarts with new code)
Verify logs are flowing

Claude gets --dangerously-skip-permissions because it's running unattended at 3 AM. There's no human to click "approve." It has 5 minutes to read, reason, patch, commit, and deploy.

Tier 3: Human

If Claude can't fix it either, a Telegram message arrives:

🚨 Camera Watchdog — Needs manual intervention

Problems: gate_monitor log is 3842 min stale
Restart attempts: 3
Claude attempted fix but failed:
[Claude's analysis of why it couldn't fix the issue]

Check: ssh chimpoo@192.168.0.26

The Reporting

Every heartbeat ends with a Telegram message. You always know what happened:

✅ All systems nominal          — nothing to do
🔧 Issue detected and fixed    — restarted something
🤖 Claude auto-fix applied     — code was patched
🚨 Needs manual intervention   — you need to SSH in

State persists between runs in /tmp/camera_watchdog_state.json — restart count, last Claude fix timestamp, history of fixes applied. The restart counter resets on the next healthy heartbeat, so a recovered service doesn't carry baggage.

What I'd Do Differently

Run it more often. Every 12 hours means worst-case you lose 12 hours of camera coverage before the watchdog notices. I'm considering dropping it to every 30 minutes. The checks themselves take <5 seconds.

Log rotation. The camera monitor wipes its log on restart. If the watchdog kills a zombie before capturing logs, the evidence disappears. A proper log rotation would preserve the forensics.

Test the Claude tier. I've only seen Tier 1 fire in production. The Claude escalation path is written and theoretically sound, but I haven't had a code-level bug recur since the PyAV fix. Which is either good engineering or an untested code path — depending on your perspective.

The Uncomfortable Part

Giving an AI agent --dangerously-skip-permissions over your codebase, with the ability to commit and deploy, at 3 AM with nobody watching — that should make you uncomfortable. It makes me uncomfortable.

But here's the trade-off: my cameras were dead for 3 days and nobody noticed. The cost of unattended downtime, for a security system, is higher than the cost of a bad auto-fix. Claude can't brick the system worse than "not running." And if it makes a wrong fix, I have git history and the Telegram receipt.

The watchdog is 280 lines of stdlib Python. No frameworks, no dependencies, no infrastructure. Just subprocess, urllib, and the knowledge that every system eventually breaks — and the interesting question is what happens next.

The Stack

camera_watchdog.py        — 280 lines, stdlib Python, LaunchAgent
gate_monitor.py           — RTSP + OpenCV + vision LLM pipeline
DGX Spark (Blackwell GPU) — Ollama + gemma4:31b-4k vision model
Claude Code CLI            — autonomous diagnosis + code fix
Telegram Bot API           — reporting (stdlib urllib, no SDK)
macOS LaunchAgent          — scheduling + KeepAlive

The full watchdog source is straightforward enough to adapt for any daemon you run. The three-tier pattern — restart, AI fix, human escalation — is the part worth stealing.

The Wrong GUID: How a Single Constant Broke WebSocket in Every Browser But Not Python

Kunal Jaiswal — Sun, 12 Apr 2026 05:24:49 +0000

I run a home automation setup with multiple RTSP IP cameras. The camera dashboard shows a grid of all cameras with a "Stream All" button. Each stream is an MJPEG feed served through ffmpeg via a Python HTTP server.

Click "Stream All" and you'd expect every feed to light up. Instead, 5 or 6 cameras would load and the rest would stay black forever. Refresh, and a different set of cameras would load.

The culprit was well-known: browsers limit ~6 concurrent HTTP/1.1 connections per origin. Each MJPEG stream is a long-lived HTTP response that never closes. Camera 7 has to wait for camera 1 to finish — which is never.

The Solution: WebSocket Multiplexing

WebSocket doesn't have the per-origin connection limit. One WebSocket connection can carry unlimited streams of data. The plan:

StreamManager — a shared ffmpeg process pool. One ffmpeg per camera, regardless of how many clients are watching. Frames broadcast to all subscribers.
/ws/cameras endpoint — clients subscribe to cameras via JSON commands, receive binary JPEG frames with a camera ID prefix.
Client-side JS — single WebSocket connection, Blob URLs for rendering frames to <img> elements.

The constraint: stdlib only. No pip dependencies. This is a single-file Python HTTP server running as a macOS LaunchAgent. I implemented the WebSocket protocol (RFC 6455) from scratch — handshake, frame encoding/decoding, ping/pong, binary and text frames.

The frame format

// Server → Client binary frame:
// [1 byte: camera ID length] [N bytes: camera IP] [JPEG bytes]

const buf = new Uint8Array(ev.data);
const idLen = buf[0];
const camId = new TextDecoder().decode(buf.slice(1, 1 + idLen));
const jpeg  = buf.slice(1 + idLen);

// Render to <img> via Blob URL
const blob = new Blob([jpeg], {type: 'image/jpeg'});
img.src = URL.createObjectURL(blob);

Python Says: Works Perfectly

I wrote a Python test client that connects over TLS, subscribes to all cameras, and counts frames:

# Subscribe to ALL cameras over single WebSocket
ws_send(sock, json.dumps({
    "cmd": "subscribe",
    "cameras": camera_ips
}))

# Result after 12 seconds:
# Camera IP          Frames
# x.x.x.11              19
# x.x.x.13              20
# x.x.x.16              19
# ... (all cameras streaming)
# TOTAL               209 frames from all cameras
# Total data: ~8 MB in 12s

Every camera streaming. ~8 MB over a single WebSocket in 12 seconds. The StreamManager, ffmpeg pool, frame extraction, binary WebSocket framing — all working perfectly.

Time to open it in a browser.

Browsers Say: No

WebSocket Test
04:46:59.972 Connecting...
04:47:00.148 ERROR: {"isTrusted":true}
04:47:00.148 CLOSED code=1006 reason= clean=false

Code 1006. Abnormal closure. No reason. Not clean. The onopen callback never fires. Both Safari and Brave. Every single time.

The server logs told a different story:

[ws] Client connected from x.x.x.x
[ws] Sending 101 (129 bytes)
[ws] Reader error: ConnectionError: WebSocket connection closed
[ws] Client disconnected, was watching 0 cameras

The server sent the 101 Switching Protocols response. The client connected at the TCP level. But the browser never acknowledged the upgrade. It just... closed.

The Debugging Spiral

What followed was hours of systematically ruling out every possible cause:

Attempt 1: TLS Certificate ❌
Self-signed cert? Generated proper certs with mkcert, installed CA in system keychain. Pages loaded without warnings. WebSocket still failed.

Attempt 2: Buffering ❌
Maybe wfile is buffering the 101? Tried handler.wfile.write() + flush(), then handler.connection.sendall(), then handler.request.sendall(). All sent the bytes. Browser still rejected.

Attempt 3: HTTP Protocol Version ❌
Python's BaseHTTPRequestHandler.protocol_version defaults to "HTTP/1.0". WebSocket requires HTTP/1.1. Set it to "HTTP/1.1". Then built the response manually with raw bytes. Still failed.

Attempt 4: ALPN Negotiation ❌
Maybe the browser is negotiating HTTP/2 via ALPN? Added ctx.set_alpn_protocols(["http/1.1"]). No change.

Attempt 5: Bypass the HTTP handler entirely ❌
Built a standalone raw WebSocket server on a separate port — pure socket, no BaseHTTPRequestHandler. Read the HTTP request manually, send 101 manually. Still failed.

Attempt 6: Mixed content / port issues ❌
Tried ws:// on the HTTP port (Brave auto-upgraded to HTTPS). Tried serving from HTTP. Tried different ports. Nothing.

At this point I had verified the 101 response byte-by-byte:

# Captured via Python, raw bytes from the server:
b'HTTP/1.1 101 Switching Protocols\r\n'
b'Upgrade: websocket\r\n'
b'Connection: Upgrade\r\n'
b'Sec-WebSocket-Accept: MuIAfeA8S6DsJZLE/8a3flJsJzM=\r\n'
b'\r\n'

# 129 bytes. Correct CRLF. Correct headers. Correct format.
# Python clients: works. Browsers: 1006.

The response was byte-for-byte correct. Correct HTTP version. Correct headers. Correct line endings. Correct empty line. And yet every browser on Earth rejected it.

Dad Steps In

I shared the full debugging context with my dad — who happens to run a swarm of AI agents for exactly this kind of problem. His analysis was surgical:

Sec-WebSocket-Accept is not byte-perfect. The computation must use the exact GUID from RFC 6455. Even if the format looks right, if the GUID constant is wrong, the Accept value will be wrong. Python clients don't validate the Accept header. Browsers do. This is the #1 hidden killer.

I looked at my code:

- _WS_MAGIC = b"258EAFA5-E914-47DA-95CA-5AB5F43F86A2"
+ _WS_MAGIC = b"258EAFA5-E914-47DA-95CA-C5AB0DC85B11"

The magic GUID. The one constant that every WebSocket implementation on the planet must agree on. I had it wrong.

Not slightly wrong. Not a typo in one character. The entire last segment was different:

5AB5F43F86A2  ← what I had (WRONG)
C5AB0DC85B11  ← what RFC 6455 specifies (CORRECT)

Why Python Didn't Care

The WebSocket handshake works like this:

Client sends Sec-WebSocket-Key: <random base64>
Server concatenates it with the magic GUID
Server SHA-1 hashes the result, base64 encodes it
Server sends back Sec-WebSocket-Accept: <hash>
Client verifies the Accept value matches what it expects

Step 5 is where the divergence happens. Python's WebSocket test clients — and many WebSocket libraries — skip the Accept validation. They see "101 Switching Protocols" and proceed. The Accept header is there but nobody checks it.

Browsers check it. Strictly. Silently. If it doesn't match, they close the TCP connection without sending a close frame — which is why you get code 1006 ("abnormal closure") with no reason string. The browser doesn't even tell you what was wrong.

The Fix: One line. One constant. Changed the GUID to the correct RFC 6455 value. All cameras streaming in every browser instantly.

Lessons Learned

1. Browsers are strict. Clients are lenient.

Don't assume your test client validates what a browser validates. The Sec-WebSocket-Accept header exists specifically so the client can verify the server understood the WebSocket protocol. Python clients being lenient masked a fatal bug for hours.

2. Error code 1006 is useless

1006 means "I closed the connection abnormally." It doesn't say why. It could be a network error, a TLS issue, a protocol violation, or a wrong Accept header. Browser DevTools don't show the specific validation failure. This is a spec decision — 1006 is never sent over the wire, it's generated locally — but it makes debugging nearly impossible.

3. Magic constants are the worst kind of bug

The GUID 258EAFA5-E914-47DA-95CA-C5AB0DC85B11 is an arbitrary string chosen by the RFC authors. It has no structure, no checksum, no way to validate it in isolation. If you copy it wrong, everything looks correct until a strict client rejects it. Use a well-tested library, or copy the constant from the actual RFC text — not from memory.

4. Test with the real consumer

My Python test proved the architecture worked: shared ffmpeg pool, subscriber queues, frame broadcast, binary WebSocket framing. All solid. But I should have tested with a browser from minute one. The gap between "works in Python" and "works in Chrome" was exactly one wrong constant.

5. The architecture was right

Despite the handshake bug, the design held up perfectly once the GUID was fixed:

All cameras, 1 WebSocket, ~2 fps each
Shared ffmpeg pool — one process per camera regardless of viewer count
10-second grace period on unsubscribe — handles page refresh without killing ffmpeg
Auto-restart on ffmpeg crash (max 3 in 30s)
MJPEG fallback for clients that can't do WebSocket
Zero pip dependencies — pure stdlib Python

The Stack

For anyone building something similar:

# Server: Python stdlib HTTP server with manual WebSocket
# Camera: RTSP → ffmpeg → MJPEG frames
# Transport: WebSocket binary frames (1-byte ID prefix + JPEG)
# Client: Blob URLs → <img> elements
# Infra: macOS LaunchAgent, mkcert TLS

class StreamManager:
    # One ffmpeg per camera, broadcast to N subscribers
    # subscribe(cam_id, queue) → start ffmpeg if first
    # unsubscribe(cam_id, queue) → stop ffmpeg if last (after 10s grace)
    # Reader thread: extract JPEGs from ffmpeg stdout
    # Queue per client, maxsize=100, drop on full

class CameraWS:  # Client-side JavaScript
    # Single WebSocket to /ws/cameras
    # subscribe([ips]) / unsubscribe([ips])
    # onmessage → Blob URL → img.src
    # Auto-reconnect, 3-fail MJPEG fallback

The WebSocket approach is the correct way to beat the browser connection limit for multi-camera MJPEG streaming. Just make sure you copy the GUID correctly.

For the record, the WebSocket magic GUID from RFC 6455 Section 4.2.2 is:

258EAFA5-E914-47DA-95CA-C5AB0DC85B11

Commit it to memory. Or better yet, don't — use a library.

Building a Real-Time Security Camera System with Local Vision LLMs

Kunal Jaiswal — Tue, 31 Mar 2026 12:19:05 +0000

I replaced my Lorex NVR's motion detection — which alerted me 40 times a day about swaying trees and shadows — with a pipeline that uses a vision language model to understand what it's actually seeing. It runs entirely on local hardware, costs nothing after setup, and sends me a WhatsApp message only when something real happens.

Architecture

3× Lorex 4K cameras (RTSP)
    ↓
gate_monitor.py (Mac Studio, M2 Ultra)
    ├── OpenCV: frame capture every 5s per camera
    ├── OpenCV: contour-based motion detection (frame N vs N-1)
    ├── Crop: extract largest changed region
    ├── VLM: qwen2.5vl:7b on DGX Spark (Blackwell, 10GbE link)
    │   └── "Classify this crop: ALERT or CLEAR?"
    ├── Alert: annotate frame with contour boxes
    │   ├── WiiM speaker announcement (TTS)
    │   └── WhatsApp message with image
    └── Audio: faster-whisper transcription (gate camera only)
        └── Gated by visual confirmation (120s window)

Three cameras — front gate, backyard, driveway — each running in parallel threads. The system processes about 50,000 VLM inference calls per day and has been running 24/7 for weeks.

Why Not Just Use YOLO?

Traditional object detection (YOLO, SSD) tells you what is in a frame. A vision language model tells you what's happening.

My gate camera watches a residential street. YOLO would detect "person" for the mail carrier, the neighbor walking their dog, someone cutting through to the next street, and an actual trespasser — all equally. A VLM can distinguish:

"A delivery driver placing a package at the door" → alert
"A person walking on the public sidewalk beyond the gate" → not relevant
"The shadow of a tree branch moving across the driveway" → clear

The key insight: I don't need the VLM to be fast (it runs at ~15 tok/s). I need it to be smart. By using OpenCV contour detection as a fast pre-filter, the VLM only sees cropped regions where something actually changed — typically 2–5 calls per camera per minute instead of 12.

The Contour Detection Layer

Before any AI touches a frame, OpenCV does the heavy lifting:

Capture frame, convert to grayscale, resize to 640px width
Compute absolute difference against previous analyzed frame
Apply binary threshold (25) and dilation (3 iterations) to merge nearby changes
Find contours, filter by area (min 150px², max 40% of frame)
Merge nearby bounding boxes (within 50px)

If no contours survive filtering: CLEAR — zero VLM calls. This happens 70%+ of the time (still frames, minor lighting shifts).

If contours are found: crop the largest region, send to VLM for classification.

Every 60 seconds, a fallback full-frame check catches anything that appeared between frames but hasn't moved (a parked car that wasn't there before, a person standing still).

Exclusion Zones

Not all motion is interesting. I built a polygon zone editor (web UI at /zones) that lets me draw exclusion and inclusion zones on camera frames — similar to professional NVR software.

Current exclusion zones:

Gate camera: the road beyond the gate (top portion of frame) — cars passing on the street aren't security events
Driveway: a steam pipe and stone wall fixture that cause constant false triggers
Backyard: a kamado BBQ grill and tree branches that sway in wind

The zones are stored as JSON polygons. At runtime, cv2.fillPoly builds a binary mask, which is applied to the thresholded diff before contour detection. Masked pixels are zeroed — contours in excluded areas never form.

False Positive War Stories

Vision LLMs hallucinate. In security camera analysis, this means phantom alerts. Here are the patterns I found and fixed:

The negation problem. The VLM would say "No people, vehicles, or animals are visible in the frame" and my classifier would see "people, vehicles, animals" and trigger an alert. Fix: expanded the negation lookback from 25 to 60 characters and added sentence-level negation detection ("if sentence starts with no/not/without AND ends with visible/present/found → CLEAR").

The hedge problem. The VLM would output both "ALERT" and "CLEAR" in the same response when it was uncertain. Fix: if both keywords appear on the same line, CLEAR wins. It's better to miss an event than to false-alert at 3 AM.

The location confusion. "A vehicle on the road beyond the gate" was triggering alerts for the gate camera. But the road isn't my property. Fix: added location-based negation — "beyond the gate", "past the gate", "on the road", "on the street" → CLEAR.

The shadow/reflection problem. "A shadow of a person" would alert. Fix: added "shadow of" as a negation pattern.

The phantom description. This was the most insidious. When the VLM received a nearly-black night frame, it would occasionally hallucinate vivid descriptions of people or vehicles. Fix: contour detection at night produces zero contours (no pixel changes in darkness), so the VLM is never called — the contour pre-filter eliminates this class of error entirely.

Audio Intelligence

The gate camera has a microphone. faster-whisper (medium.en model) transcribes 15-second audio chunks, but audio alerts are gated by visual confirmation — a speech transcription only fires if there was a visually-confirmed alert within the last 120 seconds. This prevents phantom audio alerts from wind, distant traffic, or radio.

Urgent keywords (help, emergency, fire) bypass the gate.

The transcription pipeline: PCM audio from RTSP → WAV → faster-whisper → filter noise phrases ("thank you for watching", street chatter) → if visually gated → WiiM speaker announcement + WhatsApp message with OGG audio clip.

The Alert Review Tool

50,000 VLM calls per day generates a lot of classification data. I built a daily review tool (/alerts/review) that:

Parses the last 24 hours of CONFIRMED ALERT lines from the log
Groups by camera + normalized description
Sends all patterns to qwen3.5:35b for meta-classification: REAL / FALSE_POSITIVE / NOISE
Presents a web UI with tabs (Needs Review / AI Flagged / Suppressed / Acknowledged)
One-click suppress permanently filters a pattern from future alerts

The LLM classification is given context: Calgary snowy conditions, known permanent features (kamado BBQ, stone wall, gate post lights), typical neighborhood activity. It correctly flags 80%+ of false positives for one-click suppression.

Performance

Metric	Value
Cameras	3 (4K, RTSP)
Frame interval	5 seconds per camera
VLM calls/day	~50,000
VLM model	qwen2.5vl:7b-4k (14.5 GB on DGX Spark)
Inference latency	~200ms per crop (10GbE link)
False positive rate	<5% after zone exclusions + negation fixes
Total system cost	$0/month (all local hardware)

What I'd Do Differently

Start with contour detection, not VLM. I initially sent every frame to the VLM. The 70.8 GB memory leak I found in Ollama (separate blog post) was partly caused by this constant load. Contour pre-filtering reduced VLM calls by 70%+ and made the whole system viable.
Use a smaller VLM for classification, larger for description. A 3B model could handle binary ALERT/CLEAR classification. Reserve the 7B model for generating the detailed description that goes into the WhatsApp alert.
Night mode needs a different approach. IR cameras produce grayscale footage that confuses vision LLMs trained on color images. Thermal cameras or dedicated night-vision models would work better.

The Stack

All of this runs on:

Mac Studio M2 Ultra (128 GB) — camera capture, OpenCV, audio processing, web UI
NVIDIA DGX Spark (120 GB) — VLM inference via Ollama
10GbE direct link between the two machines
Raspberry Pi — WhatsApp gateway
WiiM speaker — voice announcements
Python 3.9, stdlib only — no pip dependencies in any production script

Zero cloud APIs. Zero subscriptions. Full privacy.

The full pipeline code and zone editor are on my GitHub. If you're running local vision models for home automation, I'd like to hear what models and pre-filters work for you.

Distributed LLM Inference Across NVIDIA Blackwell and Apple Silicon Over 10GbE

Kunal Jaiswal — Tue, 31 Mar 2026 12:15:15 +0000

I connected an NVIDIA DGX Spark to a Mac Studio with a direct 10-gigabit Ethernet cable and split a large language model across both GPUs. Here's what actually happened.

The Problem

I have two machines that are excellent at different things:

NVIDIA DGX Spark (GB10 Blackwell, 120 GB unified memory) — screaming fast tensor cores, CUDA 13
Mac Studio (M2 Ultra, 128 GB unified memory) — great Metal GPU, massive memory bandwidth

Combined: 248 GB of GPU-accessible memory. Enough to run models that don't fit on either machine alone — 100B+ parameter models at reasonable quantization levels.

The question: can you actually get useful performance by splitting a model across heterogeneous GPUs over a network link?

The Physical Setup

I connected both machines with a direct 10GbE cable — no switch, no router. Just a CAT6A cable between:

DGX: Realtek 10GbE NIC (enP7s7) → 192.168.100.2/24
Mac Studio: 10GbE port (en0) → 192.168.100.1/24

Measured throughput: 9.41 Gbps. Both machines keep WiFi for LAN/internet access — the direct cable is a dedicated inference-only link.

Why llama.cpp RPC (and Why Not Exo)

I tried two approaches:

Exo (MLX Ring) — Failed

Exo is a distributed inference framework that uses MLX on both Metal and CUDA backends. I got peer discovery working, placed a 128 GB MiniMax M2.5 model across both nodes, but hit a wall: mx.distributed.init(backend="ring") hangs indefinitely on the CUDA backend. The MLX CUDA ring implementation simply doesn't work yet (as of MLX 0.31.1). Even single-node ring init hangs on DGX.

I fixed several other bugs along the way (election instability, edge oscillation, model path mismatches, Linux interface detection) and submitted a P2P model distribution PR, but the core distributed inference path is blocked until Apple adds CUDA ring support to MLX.

llama.cpp RPC — Works

llama.cpp's RPC backend takes a different approach. Instead of requiring the same ML framework on both ends, it exposes a simple RPC server that provides raw compute. The host machine (Mac Studio) runs llama-server, loads the model, and offloads layers to remote RPC servers (DGX) as needed.

# DGX — start RPC server
cd /home/kjaiswal/llama.cpp
LD_LIBRARY_PATH=build/bin build/bin/rpc-server -H 192.168.100.2 -p 50052

# Mac Studio — start llama-server with RPC
cd /Users/chimpoo/llama.cpp
build/bin/llama-server \
  -m /path/to/model.gguf \
  --rpc 192.168.100.2:50052 \
  -ngl 99 \
  --host 0.0.0.0 --port 9999 \
  -c 4096

Both were built from the same commit (b0f0dd3e5) with their respective GPU backends:

Mac Studio: GGML_METAL=ON GGML_RPC=ON
DGX: GGML_CUDA=ON GGML_RPC=ON

The model file only needs to exist on the Mac Studio. llama.cpp automatically splits layers across available compute based on memory.

Benchmarks

Qwen2.5-7B Q4_K_M (4.4 GB) — Fits one machine easily

Mode	Prompt Processing	Token Generation
Local Metal only	76 tok/s	92 tok/s
RPC (Metal + CUDA)	318 tok/s	53 tok/s

Qwen2.5-72B Q4_K_M (44.2 GB) — Fits Mac Studio alone

Mode	Prompt Processing	Token Generation	Model Split
Local Metal only	28 tok/s	11 tok/s	44 GB on Metal
RPC (Metal + CUDA)	30 tok/s	6 tok/s	31 GB Metal + 14 GB CUDA

What the Numbers Mean

Prompt processing (prefill) benefits from RPC. The DGX Blackwell tensor cores accelerate the matrix multiplications needed to process input tokens. For the 7B model, prefill was 4.2x faster with RPC. Even the 72B model saw a slight improvement.

Token generation (decode) is slower with RPC. Each generated token requires a round-trip over the network to synchronize KV cache states. At 10 Gbps, this adds ~0.2ms per layer per token. With 80 layers, that's 16ms of network overhead per token — enough to cut generation speed roughly in half.

For models that fit one machine, local is faster. The 72B model runs at 11 tok/s locally vs 6 tok/s over RPC. The network overhead isn't worth it.

The real value is models that DON'T fit one machine. With 248 GB combined, I can run:

MiniMax M2.5 Q4_K_M (138 GB) — 230B parameters, 10B active MoE
Qwen3-235B Q4_K_M (132 GB) — 235B parameters, 22B active MoE
DeepSeek-R1 at higher quantization than either machine could handle alone

At Q4 quantization, a 200B+ MoE model should generate at ~4–8 tok/s across both machines. Not fast, but usable for batch processing, code review, and complex reasoning tasks.

Lessons Learned

Direct cables beat switches. A direct 10GbE link has lower latency and jitter than going through a network switch. For latency-sensitive distributed inference, every microsecond matters.
Prefill and decode have opposite scaling characteristics. Prefill is embarrassingly parallel and benefits from more compute. Decode is sequential and bottlenecked by network latency. This suggests a potential disaggregated architecture: use the DGX for prefill, Mac Studio for decode.
GGUF is the universal format. Ollama GGUFs have custom metadata that upstream llama.cpp can't read (e.g., rope.dimension_sections with wrong array length). Always use HuggingFace community GGUFs (bartowski, etc.) for llama.cpp.
Heterogeneous distributed inference works today — but only with frameworks that abstract the GPU backend behind a network protocol (like llama.cpp RPC). Frameworks that require the same ML runtime on all nodes (like Exo with MLX) are blocked on backend parity.

What's Next

Benchmark MiniMax M2.5 (138 GB) split across both machines — the first model that actually needs distributed inference
Test disaggregated prefill (DGX) + decode (Mac Studio) once both run the same framework
Explore vLLM's distributed serving for production workloads

The full setup — including the Exo debugging saga and the 5 bugs I fixed — is documented in my infrastructure notes. Happy to share details if you're working on similar multi-GPU setups.

If you're working on similar multi-GPU setups, I'd love to hear what's working for you. The full setup notes and Exo bug fixes are on my GitHub.

How Ollama Silently Ate 65GB of My VRAM (And How I Fixed It)

Kunal Jaiswal — Mon, 30 Mar 2026 17:28:08 +0000

I run a vision-language model (qwen2.5vl:7b) on an NVIDIA DGX Spark for automated camera analysis — three RTSP cameras, one inference call every 5 seconds, 24/7. The model weights are about 6GB. It should use maybe 8-10GB total.

After a week of running, I checked memory usage: 70.8GB out of 120GB.

That's 65GB of VRAM consumed by a 6GB model. Here's what happened.

The Symptom

Everything was working fine. Inference was fast, results were accurate. I only noticed the problem because I wanted to load a second model and got an out-of-memory error.

$ ollama ps
NAME              SIZE     PROCESSOR
qwen2.5vl:7b     70.8GB   100% GPU

70.8GB for a 7B model. That's not right.

Finding the Cause

The VRAM breakdown:

Model weights: ~6 GB
KV cache: ~65 GB
Overhead: ~0.5 GB

The KV cache was the problem. But why was it so large?

Every transformer model has a context length — the maximum number of tokens it can process at once. Ollama pre-allocates a KV cache for the full declared context length when a model first loads. And qwen2.5vl:7b declares a context length of 131,072 tokens (128K) in its GGUF metadata.

My requests used about 1,000 tokens each. Ollama allocated memory for 131,072.

Why Didn't It Shrink?

I tried every obvious fix:

What I tried	What happened
`OLLAMA_NUM_CTX=4096` environment variable	Ignored — doesn't override per-model defaults
`"num_ctx": 4096` in `/api/chat` request body	Doesn't shrink an already-loaded model
Using `/v1/chat/completions` (OpenAI-compatible API)	No `num_ctx` parameter available at all
Restarting Ollama	Works temporarily — but model reloads at 128K on first request

The root cause: Ollama reads the model's context length from the GGUF file and allocates the full KV cache on first load. There is no way to override this at request time for an already-loaded model. And in an automated pipeline where requests come every 5 seconds, the model never unloads.

The Fix

The only reliable solution is to create a derived model with the context size baked into the model definition:

# Save as Modelfile.vision
FROM qwen2.5vl:7b
PARAMETER num_ctx 4096

ollama create qwen2.5vl:7b-4k -f Modelfile.vision

That's it. Now use qwen2.5vl:7b-4k instead of qwen2.5vl:7b in your API calls.

For extra safety, I also set a global default in Ollama's systemd service:

# /etc/systemd/system/ollama.service
[Service]
Environment="OLLAMA_NUM_CTX=4096"

sudo systemctl daemon-reload
sudo systemctl restart ollama

This catches any model that doesn't have an explicit num_ctx — at least it won't silently balloon to 128K.

The Result

	Before	After
Total VRAM used	70.8 GB	14.5 GB
KV cache context	128,000 tokens	4,096 tokens
Free VRAM	37 GB	85 GB
Inference speed	No change	No change
Output quality	No change	No change

56GB of VRAM recovered with zero impact on inference. My requests never used more than ~1K tokens — the other 127K were allocated for nothing.

Who Is Affected?

This matters if you're running Ollama for automated workloads:

API servers handling frequent requests (the model stays loaded)
Chatbots, agents, or monitoring pipelines
Multiple models on the same GPU
Any setup where you need predictable memory usage

Interactive chat sessions are less affected because Ollama unloads models after an idle timeout. But if your requests keep the model hot, the full KV cache lives in VRAM permanently.

Models to Watch Out For

Many popular models declare 128K context by default:

Model	Default Context	Approx KV Cache
qwen2.5vl:7b	128K	~65 GB
qwen2.5:32b	128K	~130 GB
llama3.1:70b	128K	~130 GB
mistral-large	128K	~130 GB

Check your model's declared context:

ollama show <model> --modelfile | grep -i ctx

Or via the API:

curl -s http://localhost:11434/api/show \
  -d '{"name":"qwen2.5vl:7b"}' | python3 -m json.tool | grep context

The Rule

For any Ollama model used in automated pipelines:

Always create a derived Modelfile with an explicit num_ctx matching your actual needs.

Some guidelines:

Vision/camera analysis: 2K–4K tokens
Chatbot or agent: 4K–8K tokens
Document analysis: 8K–16K tokens
RAG with large context: 16K–32K tokens

Never leave a model at its default 128K context unless you actually need 128K. The KV cache allocation is proportional to context size — halving the context roughly halves the memory.

Why This Isn't a Bug (But Maybe Should Be)

Ollama's behavior is technically correct — pre-allocating the KV cache avoids the overhead of dynamic resizing during inference. For interactive use, where you might paste a long document or have a deep conversation, having the full context available makes sense.

But for API workloads, it's a footgun. The mismatch between "model supports 128K context" and "my requests use 1K context" is common, and the memory cost is hidden. You don't see it in nvidia-smi as a separate allocation — it's all lumped under the model.

A dynamic or configurable allocation per-request would fix this for API users. Until then, the Modelfile workaround is the best approach.

I documented the full benchmarks and fix in my llama-cpp-distributed-benchmarks repo, which also covers distributed inference across Apple Silicon + NVIDIA Blackwell over 10GbE.

Tags: #ollama #llm #vram #inference #nvidia #machinelearning