<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Kunal Jaiswal</title>
    <description>The latest articles on Forem by Kunal Jaiswal (@ljkunal).</description>
    <link>https://forem.com/ljkunal</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3852052%2F90b66918-7075-4354-a828-873703958bac.jpeg</url>
      <title>Forem: Kunal Jaiswal</title>
      <link>https://forem.com/ljkunal</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ljkunal"/>
    <language>en</language>
    <item>
      <title>The Wrong GUID: How a Single Constant Broke WebSocket in Every Browser But Not Python</title>
      <dc:creator>Kunal Jaiswal</dc:creator>
      <pubDate>Sun, 12 Apr 2026 05:24:49 +0000</pubDate>
      <link>https://forem.com/ljkunal/the-wrong-guid-how-a-single-constant-broke-websocket-in-every-browser-but-not-python-1ojd</link>
      <guid>https://forem.com/ljkunal/the-wrong-guid-how-a-single-constant-broke-websocket-in-every-browser-but-not-python-1ojd</guid>
      <description>&lt;p&gt;I run a home automation setup with multiple RTSP IP cameras. The camera dashboard shows a grid of all cameras with a &lt;strong&gt;"Stream All"&lt;/strong&gt; button. Each stream is an MJPEG feed served through ffmpeg via a Python HTTP server.&lt;/p&gt;

&lt;p&gt;Click "Stream All" and you'd expect every feed to light up. Instead, 5 or 6 cameras would load and the rest would stay black forever. Refresh, and a &lt;em&gt;different&lt;/em&gt; set of cameras would load.&lt;/p&gt;

&lt;p&gt;The culprit was well-known: &lt;strong&gt;browsers limit ~6 concurrent HTTP/1.1 connections per origin&lt;/strong&gt;. Each MJPEG stream is a long-lived HTTP response that never closes. Camera 7 has to wait for camera 1 to finish — which is never.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: WebSocket Multiplexing
&lt;/h2&gt;

&lt;p&gt;WebSocket doesn't have the per-origin connection limit. One WebSocket connection can carry unlimited streams of data. The plan:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;StreamManager&lt;/strong&gt; — a shared ffmpeg process pool. One ffmpeg per camera, regardless of how many clients are watching. Frames broadcast to all subscribers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;/ws/cameras&lt;/code&gt; endpoint&lt;/strong&gt; — clients subscribe to cameras via JSON commands, receive binary JPEG frames with a camera ID prefix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client-side JS&lt;/strong&gt; — single WebSocket connection, Blob URLs for rendering frames to &lt;code&gt;&amp;lt;img&amp;gt;&lt;/code&gt; elements.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The constraint: &lt;strong&gt;stdlib only&lt;/strong&gt;. No pip dependencies. This is a single-file Python HTTP server running as a macOS LaunchAgent. I implemented the WebSocket protocol (RFC 6455) from scratch — handshake, frame encoding/decoding, ping/pong, binary and text frames.&lt;/p&gt;

&lt;h3&gt;
  
  
  The frame format
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Server → Client binary frame:&lt;/span&gt;
&lt;span class="c1"&gt;// [1 byte: camera ID length] [N bytes: camera IP] [JPEG bytes]&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;buf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Uint8Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;idLen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;camId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TextDecoder&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;idLen&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;jpeg&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;idLen&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Render to &amp;lt;img&amp;gt; via Blob URL&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;blob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Blob&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nx"&gt;jpeg&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;image/jpeg&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="nx"&gt;img&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;src&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createObjectURL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Python Says: Works Perfectly
&lt;/h2&gt;

&lt;p&gt;I wrote a Python test client that connects over TLS, subscribes to all cameras, and counts frames:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Subscribe to ALL cameras over single WebSocket
&lt;/span&gt;&lt;span class="nf"&gt;ws_send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cmd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subscribe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cameras&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;camera_ips&lt;/span&gt;
&lt;span class="p"&gt;}))&lt;/span&gt;

&lt;span class="c1"&gt;# Result after 12 seconds:
# Camera IP          Frames
# x.x.x.11              19
# x.x.x.13              20
# x.x.x.16              19
# ... (all cameras streaming)
# TOTAL               209 frames from all cameras
# Total data: ~8 MB in 12s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Every camera streaming. ~8 MB over a single WebSocket in 12 seconds.&lt;/strong&gt; The StreamManager, ffmpeg pool, frame extraction, binary WebSocket framing — all working perfectly.&lt;/p&gt;

&lt;p&gt;Time to open it in a browser.&lt;/p&gt;




&lt;h2&gt;
  
  
  Browsers Say: No
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;WebSocket Test
04:46:59.972 Connecting...
04:47:00.148 ERROR: {"isTrusted":true}
04:47:00.148 CLOSED code=1006 reason= clean=false
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Code &lt;strong&gt;1006&lt;/strong&gt;. Abnormal closure. No reason. Not clean. The &lt;code&gt;onopen&lt;/code&gt; callback never fires. Both Safari and Brave. Every single time.&lt;/p&gt;

&lt;p&gt;The server logs told a different story:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ws] Client connected from x.x.x.x
[ws] Sending 101 (129 bytes)
[ws] Reader error: ConnectionError: WebSocket connection closed
[ws] Client disconnected, was watching 0 cameras
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server &lt;em&gt;sent&lt;/em&gt; the 101 Switching Protocols response. The client &lt;em&gt;connected&lt;/em&gt; at the TCP level. But the browser never acknowledged the upgrade. It just... closed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Debugging Spiral
&lt;/h2&gt;

&lt;p&gt;What followed was hours of systematically ruling out every possible cause:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 1: TLS Certificate&lt;/strong&gt; ❌&lt;br&gt;
Self-signed cert? Generated proper certs with mkcert, installed CA in system keychain. Pages loaded without warnings. WebSocket still failed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 2: Buffering&lt;/strong&gt; ❌&lt;br&gt;
Maybe &lt;code&gt;wfile&lt;/code&gt; is buffering the 101? Tried &lt;code&gt;handler.wfile.write()&lt;/code&gt; + &lt;code&gt;flush()&lt;/code&gt;, then &lt;code&gt;handler.connection.sendall()&lt;/code&gt;, then &lt;code&gt;handler.request.sendall()&lt;/code&gt;. All sent the bytes. Browser still rejected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 3: HTTP Protocol Version&lt;/strong&gt; ❌&lt;br&gt;
Python's &lt;code&gt;BaseHTTPRequestHandler.protocol_version&lt;/code&gt; defaults to &lt;code&gt;"HTTP/1.0"&lt;/code&gt;. WebSocket requires HTTP/1.1. Set it to &lt;code&gt;"HTTP/1.1"&lt;/code&gt;. Then built the response manually with raw bytes. Still failed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 4: ALPN Negotiation&lt;/strong&gt; ❌&lt;br&gt;
Maybe the browser is negotiating HTTP/2 via ALPN? Added &lt;code&gt;ctx.set_alpn_protocols(["http/1.1"])&lt;/code&gt;. No change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 5: Bypass the HTTP handler entirely&lt;/strong&gt; ❌&lt;br&gt;
Built a standalone raw WebSocket server on a separate port — pure socket, no &lt;code&gt;BaseHTTPRequestHandler&lt;/code&gt;. Read the HTTP request manually, send 101 manually. &lt;strong&gt;Still failed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 6: Mixed content / port issues&lt;/strong&gt; ❌&lt;br&gt;
Tried &lt;code&gt;ws://&lt;/code&gt; on the HTTP port (Brave auto-upgraded to HTTPS). Tried serving from HTTP. Tried different ports. Nothing.&lt;/p&gt;

&lt;p&gt;At this point I had verified the 101 response byte-by-byte:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Captured via Python, raw bytes from the server:
&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;HTTP/1.1 101 Switching Protocols&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Upgrade: websocket&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Connection: Upgrade&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Sec-WebSocket-Accept: MuIAfeA8S6DsJZLE/8a3flJsJzM=&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;span class="c1"&gt;# 129 bytes. Correct CRLF. Correct headers. Correct format.
# Python clients: works. Browsers: 1006.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;The response was &lt;strong&gt;byte-for-byte correct&lt;/strong&gt;. Correct HTTP version. Correct headers. Correct line endings. Correct empty line. And yet every browser on Earth rejected it.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Dad Steps In
&lt;/h2&gt;

&lt;p&gt;I shared the full debugging context with my dad — who happens to run a swarm of AI agents for exactly this kind of problem. His analysis was surgical:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Sec-WebSocket-Accept is not byte-perfect.&lt;/strong&gt; The computation must use the exact GUID from RFC 6455. Even if the format looks right, if the GUID constant is wrong, the Accept value will be wrong. Python clients don't validate the Accept header. Browsers do. This is the #1 hidden killer.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I looked at my code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- _WS_MAGIC = b"258EAFA5-E914-47DA-95CA-5AB5F43F86A2"
&lt;/span&gt;&lt;span class="gi"&gt;+ _WS_MAGIC = b"258EAFA5-E914-47DA-95CA-C5AB0DC85B11"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The magic GUID. The one constant that every WebSocket implementation on the planet must agree on. &lt;strong&gt;I had it wrong.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not slightly wrong. Not a typo in one character. The entire last segment was different:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;5AB5F43F86A2  ← what I had (WRONG)
C5AB0DC85B11  ← what RFC 6455 specifies (CORRECT)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why Python Didn't Care
&lt;/h2&gt;

&lt;p&gt;The WebSocket handshake works like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Client sends &lt;code&gt;Sec-WebSocket-Key: &amp;lt;random base64&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Server concatenates it with the magic GUID&lt;/li&gt;
&lt;li&gt;Server SHA-1 hashes the result, base64 encodes it&lt;/li&gt;
&lt;li&gt;Server sends back &lt;code&gt;Sec-WebSocket-Accept: &amp;lt;hash&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Client verifies the Accept value matches what it expects&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Step 5 is where the divergence happens. Python's WebSocket test clients — and many WebSocket libraries — &lt;strong&gt;skip the Accept validation&lt;/strong&gt;. They see "101 Switching Protocols" and proceed. The Accept header is there but nobody checks it.&lt;/p&gt;

&lt;p&gt;Browsers check it. Strictly. Silently. If it doesn't match, they close the TCP connection without sending a close frame — which is why you get code 1006 ("abnormal closure") with no reason string. The browser doesn't even tell you &lt;em&gt;what&lt;/em&gt; was wrong.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The Fix:&lt;/strong&gt; One line. One constant. Changed the GUID to the correct RFC 6455 value. All cameras streaming in every browser instantly.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Browsers are strict. Clients are lenient.
&lt;/h3&gt;

&lt;p&gt;Don't assume your test client validates what a browser validates. The &lt;code&gt;Sec-WebSocket-Accept&lt;/code&gt; header exists specifically so the client can verify the server understood the WebSocket protocol. Python clients being lenient masked a fatal bug for hours.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Error code 1006 is useless
&lt;/h3&gt;

&lt;p&gt;1006 means "I closed the connection abnormally." It doesn't say why. It could be a network error, a TLS issue, a protocol violation, or a wrong Accept header. Browser DevTools don't show the specific validation failure. This is a spec decision — 1006 is never sent over the wire, it's generated locally — but it makes debugging nearly impossible.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Magic constants are the worst kind of bug
&lt;/h3&gt;

&lt;p&gt;The GUID &lt;code&gt;258EAFA5-E914-47DA-95CA-C5AB0DC85B11&lt;/code&gt; is an arbitrary string chosen by the RFC authors. It has no structure, no checksum, no way to validate it in isolation. If you copy it wrong, everything looks correct until a strict client rejects it. Use a well-tested library, or copy the constant from the actual RFC text — not from memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Test with the real consumer
&lt;/h3&gt;

&lt;p&gt;My Python test proved the architecture worked: shared ffmpeg pool, subscriber queues, frame broadcast, binary WebSocket framing. All solid. But I should have tested with a browser from minute one. The gap between "works in Python" and "works in Chrome" was exactly one wrong constant.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. The architecture was right
&lt;/h3&gt;

&lt;p&gt;Despite the handshake bug, the design held up perfectly once the GUID was fixed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;All cameras, 1 WebSocket, ~2 fps each&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared ffmpeg pool&lt;/strong&gt; — one process per camera regardless of viewer count&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10-second grace period&lt;/strong&gt; on unsubscribe — handles page refresh without killing ffmpeg&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-restart&lt;/strong&gt; on ffmpeg crash (max 3 in 30s)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MJPEG fallback&lt;/strong&gt; for clients that can't do WebSocket&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero pip dependencies&lt;/strong&gt; — pure stdlib Python&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;p&gt;For anyone building something similar:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Server: Python stdlib HTTP server with manual WebSocket
# Camera: RTSP → ffmpeg → MJPEG frames
# Transport: WebSocket binary frames (1-byte ID prefix + JPEG)
# Client: Blob URLs → &amp;lt;img&amp;gt; elements
# Infra: macOS LaunchAgent, mkcert TLS
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StreamManager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# One ffmpeg per camera, broadcast to N subscribers
&lt;/span&gt;    &lt;span class="c1"&gt;# subscribe(cam_id, queue) → start ffmpeg if first
&lt;/span&gt;    &lt;span class="c1"&gt;# unsubscribe(cam_id, queue) → stop ffmpeg if last (after 10s grace)
&lt;/span&gt;    &lt;span class="c1"&gt;# Reader thread: extract JPEGs from ffmpeg stdout
&lt;/span&gt;    &lt;span class="c1"&gt;# Queue per client, maxsize=100, drop on full
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CameraWS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Client-side JavaScript
&lt;/span&gt;    &lt;span class="c1"&gt;# Single WebSocket to /ws/cameras
&lt;/span&gt;    &lt;span class="c1"&gt;# subscribe([ips]) / unsubscribe([ips])
&lt;/span&gt;    &lt;span class="c1"&gt;# onmessage → Blob URL → img.src
&lt;/span&gt;    &lt;span class="c1"&gt;# Auto-reconnect, 3-fail MJPEG fallback
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The WebSocket approach is the correct way to beat the browser connection limit for multi-camera MJPEG streaming. Just make sure you copy the GUID correctly.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For the record, the WebSocket magic GUID from RFC 6455 Section 4.2.2 is:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;258EAFA5-E914-47DA-95CA-C5AB0DC85B11&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Commit it to memory. Or better yet, don't — use a library.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>websocket</category>
      <category>python</category>
      <category>debugging</category>
      <category>homeautomation</category>
    </item>
    <item>
      <title>Building a Real-Time Security Camera System with Local Vision LLMs</title>
      <dc:creator>Kunal Jaiswal</dc:creator>
      <pubDate>Tue, 31 Mar 2026 12:19:05 +0000</pubDate>
      <link>https://forem.com/ljkunal/building-a-real-time-security-camera-system-with-local-vision-llms-2kgj</link>
      <guid>https://forem.com/ljkunal/building-a-real-time-security-camera-system-with-local-vision-llms-2kgj</guid>
      <description>&lt;p&gt;I replaced my Lorex NVR's motion detection — which alerted me 40 times a day about swaying trees and shadows — with a pipeline that uses a vision language model to understand what it's actually seeing. It runs entirely on local hardware, costs nothing after setup, and sends me a WhatsApp message only when something real happens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;3× Lorex 4K cameras (RTSP)
    ↓
gate_monitor.py (Mac Studio, M2 Ultra)
    ├── OpenCV: frame capture every 5s per camera
    ├── OpenCV: contour-based motion detection (frame N vs N-1)
    ├── Crop: extract largest changed region
    ├── VLM: qwen2.5vl:7b on DGX Spark (Blackwell, 10GbE link)
    │   └── "Classify this crop: ALERT or CLEAR?"
    ├── Alert: annotate frame with contour boxes
    │   ├── WiiM speaker announcement (TTS)
    │   └── WhatsApp message with image
    └── Audio: faster-whisper transcription (gate camera only)
        └── Gated by visual confirmation (120s window)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three cameras — front gate, backyard, driveway — each running in parallel threads. The system processes about &lt;strong&gt;50,000 VLM inference calls per day&lt;/strong&gt; and has been running 24/7 for weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not Just Use YOLO?
&lt;/h2&gt;

&lt;p&gt;Traditional object detection (YOLO, SSD) tells you &lt;em&gt;what&lt;/em&gt; is in a frame. A vision language model tells you &lt;em&gt;what's happening&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;My gate camera watches a residential street. YOLO would detect "person" for the mail carrier, the neighbor walking their dog, someone cutting through to the next street, and an actual trespasser — all equally. A VLM can distinguish:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"A delivery driver placing a package at the door" → alert&lt;/li&gt;
&lt;li&gt;"A person walking on the public sidewalk beyond the gate" → not relevant&lt;/li&gt;
&lt;li&gt;"The shadow of a tree branch moving across the driveway" → clear&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight: I don't need the VLM to be fast (it runs at ~15 tok/s). I need it to be smart. By using OpenCV contour detection as a fast pre-filter, the VLM only sees cropped regions where something actually changed — typically 2–5 calls per camera per minute instead of 12.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Contour Detection Layer
&lt;/h2&gt;

&lt;p&gt;Before any AI touches a frame, OpenCV does the heavy lifting:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Capture frame, convert to grayscale, resize to 640px width&lt;/li&gt;
&lt;li&gt;Compute absolute difference against previous analyzed frame&lt;/li&gt;
&lt;li&gt;Apply binary threshold (25) and dilation (3 iterations) to merge nearby changes&lt;/li&gt;
&lt;li&gt;Find contours, filter by area (min 150px², max 40% of frame)&lt;/li&gt;
&lt;li&gt;Merge nearby bounding boxes (within 50px)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If no contours survive filtering: &lt;strong&gt;CLEAR&lt;/strong&gt; — zero VLM calls. This happens 70%+ of the time (still frames, minor lighting shifts).&lt;/p&gt;

&lt;p&gt;If contours are found: crop the largest region, send to VLM for classification.&lt;/p&gt;

&lt;p&gt;Every 60 seconds, a fallback full-frame check catches anything that appeared between frames but hasn't moved (a parked car that wasn't there before, a person standing still).&lt;/p&gt;

&lt;h2&gt;
  
  
  Exclusion Zones
&lt;/h2&gt;

&lt;p&gt;Not all motion is interesting. I built a polygon zone editor (web UI at &lt;code&gt;/zones&lt;/code&gt;) that lets me draw exclusion and inclusion zones on camera frames — similar to professional NVR software.&lt;/p&gt;

&lt;p&gt;Current exclusion zones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gate camera:&lt;/strong&gt; the road beyond the gate (top portion of frame) — cars passing on the street aren't security events&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Driveway:&lt;/strong&gt; a steam pipe and stone wall fixture that cause constant false triggers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backyard:&lt;/strong&gt; a kamado BBQ grill and tree branches that sway in wind&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The zones are stored as JSON polygons. At runtime, &lt;code&gt;cv2.fillPoly&lt;/code&gt; builds a binary mask, which is applied to the thresholded diff before contour detection. Masked pixels are zeroed — contours in excluded areas never form.&lt;/p&gt;

&lt;h2&gt;
  
  
  False Positive War Stories
&lt;/h2&gt;

&lt;p&gt;Vision LLMs hallucinate. In security camera analysis, this means phantom alerts. Here are the patterns I found and fixed:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The negation problem.&lt;/strong&gt; The VLM would say "No people, vehicles, or animals are visible in the frame" and my classifier would see "people, vehicles, animals" and trigger an alert. Fix: expanded the negation lookback from 25 to 60 characters and added sentence-level negation detection ("if sentence starts with no/not/without AND ends with visible/present/found → CLEAR").&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The hedge problem.&lt;/strong&gt; The VLM would output both "ALERT" and "CLEAR" in the same response when it was uncertain. Fix: if both keywords appear on the same line, CLEAR wins. It's better to miss an event than to false-alert at 3 AM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The location confusion.&lt;/strong&gt; "A vehicle on the road beyond the gate" was triggering alerts for the gate camera. But the road isn't my property. Fix: added location-based negation — "beyond the gate", "past the gate", "on the road", "on the street" → CLEAR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The shadow/reflection problem.&lt;/strong&gt; "A shadow of a person" would alert. Fix: added "shadow of" as a negation pattern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The phantom description.&lt;/strong&gt; This was the most insidious. When the VLM received a nearly-black night frame, it would occasionally hallucinate vivid descriptions of people or vehicles. Fix: contour detection at night produces zero contours (no pixel changes in darkness), so the VLM is never called — the contour pre-filter eliminates this class of error entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Audio Intelligence
&lt;/h2&gt;

&lt;p&gt;The gate camera has a microphone. &lt;code&gt;faster-whisper&lt;/code&gt; (medium.en model) transcribes 15-second audio chunks, but audio alerts are &lt;strong&gt;gated by visual confirmation&lt;/strong&gt; — a speech transcription only fires if there was a visually-confirmed alert within the last 120 seconds. This prevents phantom audio alerts from wind, distant traffic, or radio.&lt;/p&gt;

&lt;p&gt;Urgent keywords (help, emergency, fire) bypass the gate.&lt;/p&gt;

&lt;p&gt;The transcription pipeline: PCM audio from RTSP → WAV → &lt;code&gt;faster-whisper&lt;/code&gt; → filter noise phrases ("thank you for watching", street chatter) → if visually gated → WiiM speaker announcement + WhatsApp message with OGG audio clip.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Alert Review Tool
&lt;/h2&gt;

&lt;p&gt;50,000 VLM calls per day generates a lot of classification data. I built a daily review tool (&lt;code&gt;/alerts/review&lt;/code&gt;) that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Parses the last 24 hours of &lt;code&gt;CONFIRMED ALERT&lt;/code&gt; lines from the log&lt;/li&gt;
&lt;li&gt;Groups by camera + normalized description&lt;/li&gt;
&lt;li&gt;Sends all patterns to &lt;code&gt;qwen3.5:35b&lt;/code&gt; for meta-classification: REAL / FALSE_POSITIVE / NOISE&lt;/li&gt;
&lt;li&gt;Presents a web UI with tabs (Needs Review / AI Flagged / Suppressed / Acknowledged)&lt;/li&gt;
&lt;li&gt;One-click suppress permanently filters a pattern from future alerts&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The LLM classification is given context: Calgary snowy conditions, known permanent features (kamado BBQ, stone wall, gate post lights), typical neighborhood activity. It correctly flags 80%+ of false positives for one-click suppression.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cameras&lt;/td&gt;
&lt;td&gt;3 (4K, RTSP)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frame interval&lt;/td&gt;
&lt;td&gt;5 seconds per camera&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VLM calls/day&lt;/td&gt;
&lt;td&gt;~50,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VLM model&lt;/td&gt;
&lt;td&gt;qwen2.5vl:7b-4k (14.5 GB on DGX Spark)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference latency&lt;/td&gt;
&lt;td&gt;~200ms per crop (10GbE link)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;False positive rate&lt;/td&gt;
&lt;td&gt;&amp;lt;5% after zone exclusions + negation fixes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total system cost&lt;/td&gt;
&lt;td&gt;$0/month (all local hardware)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with contour detection, not VLM.&lt;/strong&gt; I initially sent every frame to the VLM. The 70.8 GB memory leak I found in Ollama (separate blog post) was partly caused by this constant load. Contour pre-filtering reduced VLM calls by 70%+ and made the whole system viable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use a smaller VLM for classification, larger for description.&lt;/strong&gt; A 3B model could handle binary ALERT/CLEAR classification. Reserve the 7B model for generating the detailed description that goes into the WhatsApp alert.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Night mode needs a different approach.&lt;/strong&gt; IR cameras produce grayscale footage that confuses vision LLMs trained on color images. Thermal cameras or dedicated night-vision models would work better.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;p&gt;All of this runs on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mac Studio M2 Ultra (128 GB) — camera capture, OpenCV, audio processing, web UI&lt;/li&gt;
&lt;li&gt;NVIDIA DGX Spark (120 GB) — VLM inference via Ollama&lt;/li&gt;
&lt;li&gt;10GbE direct link between the two machines&lt;/li&gt;
&lt;li&gt;Raspberry Pi — WhatsApp gateway&lt;/li&gt;
&lt;li&gt;WiiM speaker — voice announcements&lt;/li&gt;
&lt;li&gt;Python 3.9, stdlib only — no pip dependencies in any production script&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Zero cloud APIs. Zero subscriptions. Full privacy.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The full pipeline code and zone editor are on &lt;a href="https://github.com/kjaiswal" rel="noopener noreferrer"&gt;my GitHub&lt;/a&gt;. If you're running local vision models for home automation, I'd like to hear what models and pre-filters work for you.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>homelab</category>
      <category>machinelearning</category>
      <category>llm</category>
    </item>
    <item>
      <title>Distributed LLM Inference Across NVIDIA Blackwell and Apple Silicon Over 10GbE</title>
      <dc:creator>Kunal Jaiswal</dc:creator>
      <pubDate>Tue, 31 Mar 2026 12:15:15 +0000</pubDate>
      <link>https://forem.com/ljkunal/distributed-llm-inference-across-nvidia-blackwell-and-apple-silicon-over-10gbe-2feg</link>
      <guid>https://forem.com/ljkunal/distributed-llm-inference-across-nvidia-blackwell-and-apple-silicon-over-10gbe-2feg</guid>
      <description>&lt;p&gt;I connected an NVIDIA DGX Spark to a Mac Studio with a direct 10-gigabit Ethernet cable and split a large language model across both GPUs. Here's what actually happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;I have two machines that are excellent at different things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NVIDIA DGX Spark&lt;/strong&gt; (GB10 Blackwell, 120 GB unified memory) — screaming fast tensor cores, CUDA 13&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mac Studio&lt;/strong&gt; (M2 Ultra, 128 GB unified memory) — great Metal GPU, massive memory bandwidth&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Combined: &lt;strong&gt;248 GB&lt;/strong&gt; of GPU-accessible memory. Enough to run models that don't fit on either machine alone — 100B+ parameter models at reasonable quantization levels.&lt;/p&gt;

&lt;p&gt;The question: can you actually get useful performance by splitting a model across heterogeneous GPUs over a network link?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Physical Setup
&lt;/h2&gt;

&lt;p&gt;I connected both machines with a direct 10GbE cable — no switch, no router. Just a CAT6A cable between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DGX: Realtek 10GbE NIC (&lt;code&gt;enP7s7&lt;/code&gt;) → &lt;code&gt;192.168.100.2/24&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Mac Studio: 10GbE port (&lt;code&gt;en0&lt;/code&gt;) → &lt;code&gt;192.168.100.1/24&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Measured throughput: &lt;strong&gt;9.41 Gbps&lt;/strong&gt;. Both machines keep WiFi for LAN/internet access — the direct cable is a dedicated inference-only link.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why llama.cpp RPC (and Why Not Exo)
&lt;/h2&gt;

&lt;p&gt;I tried two approaches:&lt;/p&gt;

&lt;h3&gt;
  
  
  Exo (MLX Ring) — Failed
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/exo-explore/exo" rel="noopener noreferrer"&gt;Exo&lt;/a&gt; is a distributed inference framework that uses MLX on both Metal and CUDA backends. I got peer discovery working, placed a 128 GB MiniMax M2.5 model across both nodes, but hit a wall: &lt;strong&gt;&lt;code&gt;mx.distributed.init(backend="ring")&lt;/code&gt; hangs indefinitely on the CUDA backend&lt;/strong&gt;. The MLX CUDA ring implementation simply doesn't work yet (as of MLX 0.31.1). Even single-node ring init hangs on DGX.&lt;/p&gt;

&lt;p&gt;I fixed several other bugs along the way (election instability, edge oscillation, model path mismatches, Linux interface detection) and &lt;a href="https://github.com/exo-explore/exo/pull/1809" rel="noopener noreferrer"&gt;submitted a P2P model distribution PR&lt;/a&gt;, but the core distributed inference path is blocked until Apple adds CUDA ring support to MLX.&lt;/p&gt;

&lt;h3&gt;
  
  
  llama.cpp RPC — Works
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/ggerganov/llama.cpp" rel="noopener noreferrer"&gt;llama.cpp's RPC backend&lt;/a&gt; takes a different approach. Instead of requiring the same ML framework on both ends, it exposes a simple RPC server that provides raw compute. The host machine (Mac Studio) runs &lt;code&gt;llama-server&lt;/code&gt;, loads the model, and offloads layers to remote RPC servers (DGX) as needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# DGX — start RPC server&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; /home/kjaiswal/llama.cpp
&lt;span class="nv"&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;build/bin build/bin/rpc-server &lt;span class="nt"&gt;-H&lt;/span&gt; 192.168.100.2 &lt;span class="nt"&gt;-p&lt;/span&gt; 50052

&lt;span class="c"&gt;# Mac Studio — start llama-server with RPC&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; /Users/chimpoo/llama.cpp
build/bin/llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; /path/to/model.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rpc&lt;/span&gt; 192.168.100.2:50052 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 9999 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both were built from the same commit (&lt;code&gt;b0f0dd3e5&lt;/code&gt;) with their respective GPU backends:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mac Studio: &lt;code&gt;GGML_METAL=ON GGML_RPC=ON&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;DGX: &lt;code&gt;GGML_CUDA=ON GGML_RPC=ON&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model file only needs to exist on the Mac Studio. llama.cpp automatically splits layers across available compute based on memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Qwen2.5-7B Q4_K_M (4.4 GB) — Fits one machine easily
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Prompt Processing&lt;/th&gt;
&lt;th&gt;Token Generation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Local Metal only&lt;/td&gt;
&lt;td&gt;76 tok/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;92 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RPC (Metal + CUDA)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;318 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;53 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Qwen2.5-72B Q4_K_M (44.2 GB) — Fits Mac Studio alone
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Prompt Processing&lt;/th&gt;
&lt;th&gt;Token Generation&lt;/th&gt;
&lt;th&gt;Model Split&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Local Metal only&lt;/td&gt;
&lt;td&gt;28 tok/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;44 GB on Metal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RPC (Metal + CUDA)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;30 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6 tok/s&lt;/td&gt;
&lt;td&gt;31 GB Metal + 14 GB CUDA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What the Numbers Mean
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prompt processing (prefill) benefits from RPC.&lt;/strong&gt; The DGX Blackwell tensor cores accelerate the matrix multiplications needed to process input tokens. For the 7B model, prefill was &lt;strong&gt;4.2x faster&lt;/strong&gt; with RPC. Even the 72B model saw a slight improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token generation (decode) is slower with RPC.&lt;/strong&gt; Each generated token requires a round-trip over the network to synchronize KV cache states. At 10 Gbps, this adds ~0.2ms per layer per token. With 80 layers, that's 16ms of network overhead per token — enough to cut generation speed roughly in half.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For models that fit one machine, local is faster.&lt;/strong&gt; The 72B model runs at 11 tok/s locally vs 6 tok/s over RPC. The network overhead isn't worth it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real value is models that DON'T fit one machine.&lt;/strong&gt; With 248 GB combined, I can run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MiniMax M2.5 Q4_K_M (138 GB) — 230B parameters, 10B active MoE&lt;/li&gt;
&lt;li&gt;Qwen3-235B Q4_K_M (132 GB) — 235B parameters, 22B active MoE&lt;/li&gt;
&lt;li&gt;DeepSeek-R1 at higher quantization than either machine could handle alone&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At Q4 quantization, a 200B+ MoE model should generate at ~4–8 tok/s across both machines. Not fast, but usable for batch processing, code review, and complex reasoning tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Direct cables beat switches.&lt;/strong&gt; A direct 10GbE link has lower latency and jitter than going through a network switch. For latency-sensitive distributed inference, every microsecond matters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Prefill and decode have opposite scaling characteristics.&lt;/strong&gt; Prefill is embarrassingly parallel and benefits from more compute. Decode is sequential and bottlenecked by network latency. This suggests a potential disaggregated architecture: use the DGX for prefill, Mac Studio for decode.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GGUF is the universal format.&lt;/strong&gt; Ollama GGUFs have custom metadata that upstream llama.cpp can't read (e.g., &lt;code&gt;rope.dimension_sections&lt;/code&gt; with wrong array length). Always use HuggingFace community GGUFs (bartowski, etc.) for llama.cpp.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Heterogeneous distributed inference works today&lt;/strong&gt; — but only with frameworks that abstract the GPU backend behind a network protocol (like llama.cpp RPC). Frameworks that require the same ML runtime on all nodes (like Exo with MLX) are blocked on backend parity.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Benchmark MiniMax M2.5 (138 GB) split across both machines — the first model that actually needs distributed inference&lt;/li&gt;
&lt;li&gt;Test disaggregated prefill (DGX) + decode (Mac Studio) once both run the same framework&lt;/li&gt;
&lt;li&gt;Explore vLLM's distributed serving for production workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full setup — including the Exo debugging saga and the 5 bugs I fixed — is documented in my infrastructure notes. Happy to share details if you're working on similar multi-GPU setups.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you're working on similar multi-GPU setups, I'd love to hear what's working for you. The full setup notes and Exo bug fixes are on &lt;a href="https://github.com/kjaiswal" rel="noopener noreferrer"&gt;my GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>homelab</category>
      <category>machinelearning</category>
      <category>llm</category>
    </item>
    <item>
      <title>How Ollama Silently Ate 65GB of My VRAM (And How I Fixed It)</title>
      <dc:creator>Kunal Jaiswal</dc:creator>
      <pubDate>Mon, 30 Mar 2026 17:28:08 +0000</pubDate>
      <link>https://forem.com/ljkunal/how-ollama-silently-ate-65gb-of-my-vram-and-how-i-fixed-it-22pf</link>
      <guid>https://forem.com/ljkunal/how-ollama-silently-ate-65gb-of-my-vram-and-how-i-fixed-it-22pf</guid>
      <description>&lt;p&gt;I run a vision-language model (&lt;code&gt;qwen2.5vl:7b&lt;/code&gt;) on an NVIDIA DGX Spark for automated camera analysis — three RTSP cameras, one inference call every 5 seconds, 24/7. The model weights are about 6GB. It should use maybe 8-10GB total.&lt;/p&gt;

&lt;p&gt;After a week of running, I checked memory usage: &lt;strong&gt;70.8GB out of 120GB.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's 65GB of VRAM consumed by a 6GB model. Here's what happened.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Symptom
&lt;/h2&gt;

&lt;p&gt;Everything was working fine. Inference was fast, results were accurate. I only noticed the problem because I wanted to load a second model and got an out-of-memory error.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;ollama ps
&lt;span class="go"&gt;NAME              SIZE     PROCESSOR
qwen2.5vl:7b     70.8GB   100% GPU
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;70.8GB for a 7B model. That's not right.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding the Cause
&lt;/h2&gt;

&lt;p&gt;The VRAM breakdown:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model weights: ~6 GB&lt;/li&gt;
&lt;li&gt;KV cache: &lt;strong&gt;~65 GB&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Overhead: ~0.5 GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The KV cache was the problem. But why was it so large?&lt;/p&gt;

&lt;p&gt;Every transformer model has a &lt;strong&gt;context length&lt;/strong&gt; — the maximum number of tokens it can process at once. Ollama pre-allocates a KV cache for the &lt;strong&gt;full declared context length&lt;/strong&gt; when a model first loads. And &lt;code&gt;qwen2.5vl:7b&lt;/code&gt; declares a context length of &lt;strong&gt;131,072 tokens&lt;/strong&gt; (128K) in its GGUF metadata.&lt;/p&gt;

&lt;p&gt;My requests used about 1,000 tokens each. Ollama allocated memory for 131,072.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Didn't It Shrink?
&lt;/h2&gt;

&lt;p&gt;I tried every obvious fix:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What I tried&lt;/th&gt;
&lt;th&gt;What happened&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;OLLAMA_NUM_CTX=4096&lt;/code&gt; environment variable&lt;/td&gt;
&lt;td&gt;Ignored — doesn't override per-model defaults&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;"num_ctx": 4096&lt;/code&gt; in &lt;code&gt;/api/chat&lt;/code&gt; request body&lt;/td&gt;
&lt;td&gt;Doesn't shrink an already-loaded model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Using &lt;code&gt;/v1/chat/completions&lt;/code&gt; (OpenAI-compatible API)&lt;/td&gt;
&lt;td&gt;No &lt;code&gt;num_ctx&lt;/code&gt; parameter available at all&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Restarting Ollama&lt;/td&gt;
&lt;td&gt;Works temporarily — but model reloads at 128K on first request&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The root cause: Ollama reads the model's context length from the GGUF file and allocates the full KV cache on first load. &lt;strong&gt;There is no way to override this at request time for an already-loaded model.&lt;/strong&gt; And in an automated pipeline where requests come every 5 seconds, the model never unloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;p&gt;The only reliable solution is to create a &lt;strong&gt;derived model&lt;/strong&gt; with the context size baked into the model definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Save as Modelfile.vision&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; qwen2.5vl:7b&lt;/span&gt;
PARAMETER num_ctx 4096
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama create qwen2.5vl:7b-4k &lt;span class="nt"&gt;-f&lt;/span&gt; Modelfile.vision
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Now use &lt;code&gt;qwen2.5vl:7b-4k&lt;/code&gt; instead of &lt;code&gt;qwen2.5vl:7b&lt;/code&gt; in your API calls.&lt;/p&gt;

&lt;p&gt;For extra safety, I also set a global default in Ollama's systemd service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/systemd/system/ollama.service
&lt;/span&gt;&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;Environment&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"OLLAMA_NUM_CTX=4096"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl daemon-reload
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart ollama
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This catches any model that doesn't have an explicit &lt;code&gt;num_ctx&lt;/code&gt; — at least it won't silently balloon to 128K.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total VRAM used&lt;/td&gt;
&lt;td&gt;70.8 GB&lt;/td&gt;
&lt;td&gt;14.5 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KV cache context&lt;/td&gt;
&lt;td&gt;128,000 tokens&lt;/td&gt;
&lt;td&gt;4,096 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Free VRAM&lt;/td&gt;
&lt;td&gt;37 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;85 GB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference speed&lt;/td&gt;
&lt;td&gt;No change&lt;/td&gt;
&lt;td&gt;No change&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output quality&lt;/td&gt;
&lt;td&gt;No change&lt;/td&gt;
&lt;td&gt;No change&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;56GB of VRAM recovered&lt;/strong&gt; with zero impact on inference. My requests never used more than ~1K tokens — the other 127K were allocated for nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Is Affected?
&lt;/h2&gt;

&lt;p&gt;This matters if you're running Ollama for &lt;strong&gt;automated workloads&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API servers handling frequent requests (the model stays loaded)&lt;/li&gt;
&lt;li&gt;Chatbots, agents, or monitoring pipelines&lt;/li&gt;
&lt;li&gt;Multiple models on the same GPU&lt;/li&gt;
&lt;li&gt;Any setup where you need predictable memory usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Interactive chat sessions are less affected because Ollama unloads models after an idle timeout. But if your requests keep the model hot, the full KV cache lives in VRAM permanently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Models to Watch Out For
&lt;/h2&gt;

&lt;p&gt;Many popular models declare 128K context by default:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Default Context&lt;/th&gt;
&lt;th&gt;Approx KV Cache&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5vl:7b&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;~65 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5:32b&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;~130 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;llama3.1:70b&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;~130 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mistral-large&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;~130 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Check your model's declared context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama show &amp;lt;model&amp;gt; &lt;span class="nt"&gt;--modelfile&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; ctx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or via the API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:11434/api/show &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"name":"qwen2.5vl:7b"}'&lt;/span&gt; | python3 &lt;span class="nt"&gt;-m&lt;/span&gt; json.tool | &lt;span class="nb"&gt;grep &lt;/span&gt;context
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Rule
&lt;/h2&gt;

&lt;p&gt;For any Ollama model used in automated pipelines:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Always create a derived Modelfile with an explicit &lt;code&gt;num_ctx&lt;/code&gt; matching your actual needs.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Some guidelines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vision/camera analysis: &lt;strong&gt;2K–4K&lt;/strong&gt; tokens&lt;/li&gt;
&lt;li&gt;Chatbot or agent: &lt;strong&gt;4K–8K&lt;/strong&gt; tokens&lt;/li&gt;
&lt;li&gt;Document analysis: &lt;strong&gt;8K–16K&lt;/strong&gt; tokens&lt;/li&gt;
&lt;li&gt;RAG with large context: &lt;strong&gt;16K–32K&lt;/strong&gt; tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Never leave a model at its default 128K context unless you actually need 128K. The KV cache allocation is proportional to context size — halving the context roughly halves the memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Isn't a Bug (But Maybe Should Be)
&lt;/h2&gt;

&lt;p&gt;Ollama's behavior is technically correct — pre-allocating the KV cache avoids the overhead of dynamic resizing during inference. For interactive use, where you might paste a long document or have a deep conversation, having the full context available makes sense.&lt;/p&gt;

&lt;p&gt;But for API workloads, it's a footgun. The mismatch between "model supports 128K context" and "my requests use 1K context" is common, and the memory cost is hidden. You don't see it in &lt;code&gt;nvidia-smi&lt;/code&gt; as a separate allocation — it's all lumped under the model.&lt;/p&gt;

&lt;p&gt;A dynamic or configurable allocation per-request would fix this for API users. Until then, the Modelfile workaround is the best approach.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I documented the full benchmarks and fix in my &lt;a href="https://github.com/kjaiswal/llama-cpp-distributed-benchmarks" rel="noopener noreferrer"&gt;llama-cpp-distributed-benchmarks&lt;/a&gt; repo, which also covers distributed inference across Apple Silicon + NVIDIA Blackwell over 10GbE.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; &lt;code&gt;#ollama&lt;/code&gt; &lt;code&gt;#llm&lt;/code&gt; &lt;code&gt;#vram&lt;/code&gt; &lt;code&gt;#inference&lt;/code&gt; &lt;code&gt;#nvidia&lt;/code&gt; &lt;code&gt;#machinelearning&lt;/code&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>performance</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
