<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Vitalii Buhaiov</title>
    <description>The latest articles on Forem by Vitalii Buhaiov (@vitaliibuhaiovmarkettrace).</description>
    <link>https://forem.com/vitaliibuhaiovmarkettrace</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3931543%2F525cf842-5da8-4942-aa2c-5887778df3cd.jpg</url>
      <title>Forem: Vitalii Buhaiov</title>
      <link>https://forem.com/vitaliibuhaiovmarkettrace</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/vitaliibuhaiovmarkettrace"/>
    <language>en</language>
    <item>
      <title>ChunkLoadError on every deploy: the in-place rebuild trap in Next.js standalone</title>
      <dc:creator>Vitalii Buhaiov</dc:creator>
      <pubDate>Mon, 18 May 2026 08:07:04 +0000</pubDate>
      <link>https://forem.com/markettrace/chunkloaderror-on-every-deploy-the-in-place-rebuild-trap-in-nextjs-standalone-1d8i</link>
      <guid>https://forem.com/markettrace/chunkloaderror-on-every-deploy-the-in-place-rebuild-trap-in-nextjs-standalone-1d8i</guid>
      <description>&lt;p&gt;We run a Next.js 16 site behind nginx on a single VPS. Recently Google Search Console reported a single &lt;code&gt;500&lt;/code&gt; on one of our locale-prefixed pages. The page was working fine by the time I clicked through. I almost ignored it. I'm glad I didn't. The trail led to a bug that fires on every deploy, and the fix is short.&lt;/p&gt;

&lt;p&gt;Here's the story and what the fix cost us.&lt;/p&gt;

&lt;h2&gt;
  
  
  The single 500
&lt;/h2&gt;

&lt;p&gt;Search Console flagged a locale-prefixed product route. The URL returned a clean &lt;code&gt;200&lt;/code&gt; when I curled it. So either the indexer hit a transient blip, or something in our deploy flow occasionally leaks a 500 to whichever request happens to be in flight at the wrong second.&lt;/p&gt;

&lt;p&gt;The nginx access log made it concrete. One &lt;code&gt;500&lt;/code&gt; for that URL, single timestamp, never before or after:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;[06:58:05]  GET /es/products/details  500
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the matching &lt;code&gt;journalctl -u frontend&lt;/code&gt; for the same second:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;06:58:04  Error [ChunkLoadError]: Failed to load chunk
          server/chunks/ssr/messages_es_json_[json]_cjs_xxxxxxxx._.js
          from module 83578
   [cause]: Error: Cannot find module
            '/opt/app/frontend/.next/standalone/.next/server/chunks/ssr/...'
06:58:04  Error [ChunkLoadError]: Failed to load chunk
          server/chunks/ssr/[root-of-the-server]__xxxxxxx._.js ...
06:58:04  ⨯ unhandledRejection: ChunkLoadError ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hundreds of these in a five-second window, then silence. That five-second window matched the deploy run from earlier that morning to the second. A later deploy left a bigger spread of &lt;code&gt;500&lt;/code&gt;s across other locale-prefixed routes. Same root cause, same five seconds, more URLs simply because more requests landed in the window.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the rebuild was doing
&lt;/h2&gt;

&lt;p&gt;Our deploy on master push was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; /opt/app/frontend
npm ci &lt;span class="nt"&gt;--prefer-offline&lt;/span&gt;
npm run build              &lt;span class="c"&gt;# writes .next/standalone/ + .next/static/ + .next/server/&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; .next/static .next/standalone/.next/static
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; public        .next/standalone/public
systemctl restart frontend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;WorkingDirectory&lt;/code&gt; of the systemd unit was &lt;code&gt;.next/standalone/&lt;/code&gt;. &lt;code&gt;next build&lt;/code&gt; overwrites that directory in place. So during a 3-minute rebuild, the running Node process held a CPU full of in-memory references to chunk filenames (say, &lt;code&gt;server/chunks/ssr/messages_es_json_[json]_cjs_xxxxxxxx._.js&lt;/code&gt;) that the new build had just deleted and replaced with a different hash. Then &lt;code&gt;systemctl restart&lt;/code&gt; finally killed the old process and started a new one.&lt;/p&gt;

&lt;p&gt;Any SSR request that hit the old process during that ~5-second window between "files replaced" and "process restarted" tried to lazy-load a chunk by its old filename. Node went to disk, didn't find it, threw &lt;code&gt;ChunkLoadError&lt;/code&gt;. Next.js doesn't handle that in the SSR path. It bubbles up as a 500.&lt;/p&gt;

&lt;p&gt;In-memory code that pre-loaded its chunks at boot kept working. Anything that touched a route that lazy-loaded (a different locale, an MDX-rendered page, a dynamic import) was a coin flip.&lt;/p&gt;

&lt;p&gt;This isn't a Next.js bug. It's the cost of in-place rebuild deploys for any Node.js process that uses dynamic imports. We had lots of them: one per locale message bundle, one per MDX route, one per locale-prefixed page.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we considered
&lt;/h2&gt;

&lt;p&gt;Four options, in increasing order of "actually adequate":&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Stop, then build, then start. &lt;code&gt;systemctl stop&lt;/code&gt; before &lt;code&gt;npm run build&lt;/code&gt;. The running process never sees mismatched chunks because it isn't running. Cost: nginx returns &lt;code&gt;502&lt;/code&gt; for 30–60 seconds while the build runs. &lt;code&gt;502&lt;/code&gt; is "service unavailable, retry later", which Google treats as transient. Much friendlier than &lt;code&gt;500&lt;/code&gt;. Users still see a maintenance-ish page for a minute.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Atomic directory swap. Build into a sibling directory, then &lt;code&gt;mv .next/standalone .next/standalone-old &amp;amp;&amp;amp; mv .next/standalone-new .next/standalone &amp;amp;&amp;amp; systemctl restart&lt;/code&gt;. The running process keeps reading its old (now-renamed) directory until restart. Window shrinks from 30 seconds of &lt;code&gt;502&lt;/code&gt; to 3–5 seconds of &lt;code&gt;502&lt;/code&gt;. Still some downtime, no &lt;code&gt;500&lt;/code&gt;s.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;proxy_next_upstream&lt;/code&gt; with a backup server. Tell nginx to retry on a backup if the primary returns &lt;code&gt;500&lt;/code&gt;. Requires keeping two upstream instances in sync forever, including during deploys. That sync is exactly the problem we were trying to solve, so this just relocates it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Blue-green at the systemd + nginx layer. Two long-running pools on different ports. Build into the idle one. Health-check it. Atomically swap nginx upstream. Drain. Stop the old. Zero failed requests during deploy.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We chose 4. The first three each shave a different chunk off the failure window; 4 closes it entirely. And it costs almost nothing on a 16 GB box (more on this below).&lt;/p&gt;

&lt;h2&gt;
  
  
  The pieces
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Two systemd instances from one template unit
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/systemd/system/frontend@.service
&lt;/span&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Frontend (Next.js standalone, %i pool)&lt;/span&gt;
&lt;span class="py"&gt;After&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network.target&lt;/span&gt;
&lt;span class="py"&gt;ConditionPathExists&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/opt/app/frontend/pools/%i/server.js&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;
&lt;span class="py"&gt;User&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;app&lt;/span&gt;
&lt;span class="py"&gt;WorkingDirectory&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/opt/app/frontend/pools/%i&lt;/span&gt;
&lt;span class="py"&gt;EnvironmentFile&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/etc/frontend-%i.env&lt;/span&gt;
&lt;span class="py"&gt;Environment&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;NODE_ENV=production&lt;/span&gt;
&lt;span class="py"&gt;Environment&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;HOSTNAME=127.0.0.1&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/bin/node server.js&lt;/span&gt;
&lt;span class="py"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;always&lt;/span&gt;
&lt;span class="py"&gt;RestartSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;%i&lt;/code&gt; is the instance name. &lt;code&gt;frontend@blue&lt;/code&gt; runs from &lt;code&gt;pools/blue/&lt;/code&gt;, &lt;code&gt;frontend@green&lt;/code&gt; from &lt;code&gt;pools/green/&lt;/code&gt;. The per-color env files supply &lt;code&gt;PORT=3000&lt;/code&gt; and &lt;code&gt;PORT=3001&lt;/code&gt; respectively, kept VPS-local because they don't belong in git.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ConditionPathExists&lt;/code&gt; is doing real work. Without it, an empty pool slot (fresh install, partial deploy) would loop on &lt;code&gt;Restart=always&lt;/code&gt;. With it, systemd just doesn't start the unit until the path appears.&lt;/p&gt;

&lt;h3&gt;
  
  
  Nginx upstream as an include file
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="c1"&gt;# /etc/nginx/conf.d/frontend-upstream.conf&lt;/span&gt;
&lt;span class="k"&gt;upstream&lt;/span&gt; &lt;span class="s"&gt;frontend&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;include&lt;/span&gt; &lt;span class="n"&gt;/etc/nginx/frontend-upstream-active.inc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="c1"&gt;# /etc/nginx/frontend-upstream-active.inc&lt;/span&gt;
&lt;span class="k"&gt;server&lt;/span&gt; &lt;span class="nf"&gt;127.0.0.1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3001&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The deploy script never edits &lt;code&gt;frontend-upstream.conf&lt;/code&gt;. It writes a new &lt;code&gt;frontend-upstream-active.inc&lt;/code&gt; via temp-file + &lt;code&gt;mv&lt;/code&gt; (which is atomic on a single filesystem), then sends &lt;code&gt;nginx -s reload&lt;/code&gt;. &lt;code&gt;mv(2)&lt;/code&gt; flips the upstream pointer in one instruction; &lt;code&gt;reload&lt;/code&gt; graceful-rotates the workers.&lt;/p&gt;

&lt;p&gt;One trap: name the include file with an extension that &lt;em&gt;isn't&lt;/em&gt; &lt;code&gt;.conf&lt;/code&gt;, or put it outside &lt;code&gt;/etc/nginx/conf.d/&lt;/code&gt;. Otherwise the top-level &lt;code&gt;include /etc/nginx/conf.d/*.conf&lt;/code&gt; will try to load it as a standalone config and choke on the bare &lt;code&gt;server&lt;/code&gt; directive. We used &lt;code&gt;.inc&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deploy script flow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;ACTIVE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;pools/active-color 2&amp;gt;/dev/null &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo &lt;/span&gt;blue&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ACTIVE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"blue"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nv"&gt;IDLE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;green&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nv"&gt;IDLE_PORT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3001&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nv"&gt;ACTIVE_PORT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3000
&lt;span class="k"&gt;else
  &lt;/span&gt;&lt;span class="nv"&gt;IDLE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;blue&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="nv"&gt;IDLE_PORT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3000&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nv"&gt;ACTIVE_PORT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3001
&lt;span class="k"&gt;fi&lt;/span&gt;

&lt;span class="c"&gt;# Sanity check: abort if the world is inconsistent.&lt;/span&gt;
&lt;span class="nv"&gt;NGINX_PORT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-oE&lt;/span&gt; &lt;span class="s1"&gt;'127\.0\.0\.1:[0-9]+'&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$NGINX_UPSTREAM&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;cut&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;: &lt;span class="nt"&gt;-f2&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$NGINX_PORT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ACTIVE_PORT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"FAIL: marker mismatch"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;exit &lt;/span&gt;1&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;# Build (only writes inside .next/, not pools/).&lt;/span&gt;
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; &lt;span class="s2"&gt;"pools/&lt;/span&gt;&lt;span class="nv"&gt;$IDLE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
npm run build

&lt;span class="c"&gt;# Stage build into idle pool.&lt;/span&gt;
&lt;span class="nb"&gt;mv&lt;/span&gt; .next/standalone &lt;span class="s2"&gt;"pools/&lt;/span&gt;&lt;span class="nv"&gt;$IDLE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="c"&gt;# Bring idle online and prove it works.&lt;/span&gt;
systemctl restart &lt;span class="s2"&gt;"frontend@&lt;/span&gt;&lt;span class="nv"&gt;$IDLE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;i &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;seq &lt;/span&gt;1 60&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;curl &lt;span class="nt"&gt;-sf&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; /dev/null &lt;span class="nt"&gt;--max-time&lt;/span&gt; 2 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Host: example.com'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s2"&gt;"http://127.0.0.1:&lt;/span&gt;&lt;span class="nv"&gt;$IDLE_PORT&lt;/span&gt;&lt;span class="s2"&gt;/"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;break
  sleep &lt;/span&gt;1
&lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# Multi-route smoke (locale-prefixed + MDX + dynamic) before cutover.&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;route &lt;span class="k"&gt;in&lt;/span&gt; / /es/ /products/details /docs/guide /blog&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-L&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; /dev/null &lt;span class="nt"&gt;--max-time&lt;/span&gt; 5 &lt;span class="nt"&gt;-w&lt;/span&gt; &lt;span class="s1"&gt;'%{http_code}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Host: example.com'&lt;/span&gt; &lt;span class="s2"&gt;"http://127.0.0.1:&lt;/span&gt;&lt;span class="nv"&gt;$IDLE_PORT$route&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-qE&lt;/span&gt; &lt;span class="s1"&gt;'^[23]'&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# Atomic upstream swap + reload.&lt;/span&gt;
&lt;span class="nb"&gt;printf&lt;/span&gt; &lt;span class="s1"&gt;'server 127.0.0.1:%s;\n'&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$IDLE_PORT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;NGINX_UPSTREAM&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.new"&lt;/span&gt;
&lt;span class="nb"&gt;mv&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;NGINX_UPSTREAM&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.new"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$NGINX_UPSTREAM&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
nginx &lt;span class="nt"&gt;-t&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; nginx &lt;span class="nt"&gt;-s&lt;/span&gt; reload

&lt;span class="c"&gt;# Mark, drain, retire.&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$IDLE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; pools/active-color
systemctl &lt;span class="nb"&gt;enable&lt;/span&gt;  &lt;span class="s2"&gt;"frontend@&lt;/span&gt;&lt;span class="nv"&gt;$IDLE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
systemctl disable &lt;span class="s2"&gt;"frontend@&lt;/span&gt;&lt;span class="nv"&gt;$ACTIVE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;sleep &lt;/span&gt;30                              &lt;span class="c"&gt;# drain in-flight requests on the old pool&lt;/span&gt;
systemctl stop &lt;span class="s2"&gt;"frontend@&lt;/span&gt;&lt;span class="nv"&gt;$ACTIVE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The order matters more than it looks. Two specifics.&lt;/p&gt;

&lt;p&gt;Write the marker file &lt;em&gt;immediately&lt;/em&gt; after the nginx reload, before the drain. If the script crashes during the sleep or the systemctl stop, the marker reflects what nginx is doing right now. The next deploy reads truth, not stale state.&lt;/p&gt;

&lt;p&gt;Sanity-check before destructive ops. &lt;code&gt;rm -rf pools/$IDLE&lt;/code&gt; is fine if &lt;code&gt;$IDLE&lt;/code&gt; really is idle. If the marker file lies (say a previous rollback was incomplete), &lt;code&gt;$IDLE&lt;/code&gt; could be the pool that's serving traffic. The pre-flight check compares the marker against nginx's upstream port and refuses to proceed on a mismatch.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it costs us
&lt;/h2&gt;

&lt;p&gt;Measured on the live VPS. Your absolute numbers will vary with bundle size and traffic; the ratios won't:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After (steady)&lt;/th&gt;
&lt;th&gt;After (during 30-s cutover)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Frontend RAM (RSS)&lt;/td&gt;
&lt;td&gt;241 MB&lt;/td&gt;
&lt;td&gt;241 MB&lt;/td&gt;
&lt;td&gt;482 MB (both pools running)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk used by pool dirs&lt;/td&gt;
&lt;td&gt;173 MB&lt;/td&gt;
&lt;td&gt;346 MB (active + previous, kept for rollback)&lt;/td&gt;
&lt;td&gt;346 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frontend CPU&lt;/td&gt;
&lt;td&gt;~0 % idle&lt;/td&gt;
&lt;td&gt;~0 %&lt;/td&gt;
&lt;td&gt;~0 % (both pools idle during cutover)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Build-phase RAM peak&lt;/td&gt;
&lt;td&gt;~1.0–1.5 GB&lt;/td&gt;
&lt;td&gt;unchanged&lt;/td&gt;
&lt;td&gt;unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Against a 16 GB / 150 GB box where Redis already eats 4–5 GB resident, this is rounding error. The build itself is the expensive part of any deploy and it didn't change.&lt;/p&gt;

&lt;p&gt;What it bought us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero &lt;code&gt;500&lt;/code&gt;s during deploy. The old pool keeps serving its own (unchanged) chunks until it's gracefully stopped. The new pool starts from a complete on-disk build before nginx ever sends it a request.&lt;/li&gt;
&lt;li&gt;Zero &lt;code&gt;502&lt;/code&gt;s during deploy. No restart window. &lt;code&gt;nginx -s reload&lt;/code&gt; is graceful and doesn't drop in-flight connections.&lt;/li&gt;
&lt;li&gt;Cheap rollback. The previous pool's directory is retained until the next deploy. To revert, the rollback script starts the old pool, writes the include file back, reloads nginx. No rebuild needed. About 10 seconds end-to-end.&lt;/li&gt;
&lt;li&gt;Honest failure mode. If the build fails, the script aborts before touching nginx; the old pool is still serving. If the new pool fails health-check, the script stops it and exits non-zero; the old pool is still serving. There's no state in which the deploy can take the site down mid-flight.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Gotchas
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;next build&lt;/code&gt; cleans &lt;code&gt;.next/&lt;/code&gt; at start
&lt;/h3&gt;

&lt;p&gt;The first cut of this had pool directories at &lt;code&gt;.next/standalone-blue/&lt;/code&gt; and &lt;code&gt;.next/standalone-green/&lt;/code&gt;. They got wiped on every rebuild. &lt;code&gt;next build&lt;/code&gt; does a recursive clean of &lt;code&gt;.next/&lt;/code&gt; before running. If you want anything to survive across builds, keep it &lt;em&gt;outside&lt;/em&gt; &lt;code&gt;.next/&lt;/code&gt;. We moved pools to &lt;code&gt;pools/&amp;lt;color&amp;gt;/&lt;/code&gt; (sibling of &lt;code&gt;.next/&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Not Next.js-specific. Most build tools assume their output dir is theirs to own. Don't squat in it.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;mv&lt;/code&gt; is safe under a running Linux process
&lt;/h3&gt;

&lt;p&gt;While migrating prod to the new layout I had to move &lt;code&gt;pools/blue/&lt;/code&gt; while &lt;code&gt;frontend@blue&lt;/code&gt; was actively serving from inside it. Linux inode semantics make this fine: a process holds inode references through its open file descriptors and CWD, not path strings. &lt;code&gt;mv&lt;/code&gt; within a single filesystem is just a &lt;code&gt;rename(2)&lt;/code&gt;; the inodes don't move. The running pool kept serving without noticing.&lt;/p&gt;

&lt;p&gt;Same reason &lt;code&gt;tail -f&lt;/code&gt; keeps working when you rotate a log file by renaming it. Useful primitive once you remember it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Don't put backup files in &lt;code&gt;sites-enabled/&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;I made a backup of &lt;code&gt;/etc/nginx/sites-enabled/default&lt;/code&gt; &lt;em&gt;next to&lt;/em&gt; the original, then &lt;code&gt;nginx -t&lt;/code&gt; started warning about "conflicting server name" entries. The top-level &lt;code&gt;include /etc/nginx/sites-enabled/*&lt;/code&gt; was loading my &lt;code&gt;.bak&lt;/code&gt; as a config. Move backups elsewhere or rename them so the glob misses.&lt;/p&gt;

&lt;h3&gt;
  
  
  systemd templates aren't auto-enabled by &lt;code&gt;enable --now&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Our CI workflow has a generic loop that auto-enables newly-installed singleton units. Templates (&lt;code&gt;foo@.service&lt;/code&gt;) are explicitly skipped because they need an instance name. That's the right behavior for our case: we want exactly one of blue/green enabled at a time, and the deploy script decides which.&lt;/p&gt;

&lt;h3&gt;
  
  
  Health-check should match production conditions
&lt;/h3&gt;

&lt;p&gt;A bare &lt;code&gt;curl http://127.0.0.1:$PORT/&lt;/code&gt; will succeed in a lot of cases where production is broken. Add &lt;code&gt;-H 'Host: example.com'&lt;/code&gt; if you're behind a reverse proxy, follow redirects with &lt;code&gt;-L&lt;/code&gt;, and probe routes that exercise the middleware / SSR / MDX paths that you care about. We had a Next.js + Cloudflare + nginx interaction bug that only surfaced when the request &lt;code&gt;Host&lt;/code&gt; header didn't match &lt;code&gt;127.0.0.1&lt;/code&gt;. A localhost-only health check wouldn't have caught it.&lt;/p&gt;

&lt;h2&gt;
  
  
  When this isn't worth it
&lt;/h2&gt;

&lt;p&gt;This pattern is small enough to recommend for any single-VPS deploy that uses &lt;code&gt;systemd&lt;/code&gt; + a reverse proxy. It scales to multiple boxes the same way. Replace "two systemd instances on one box" with "two server fleets behind a load balancer" and the swap mechanic is identical.&lt;/p&gt;

&lt;p&gt;It is &lt;em&gt;not&lt;/em&gt; worth it if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your app boots in under a second and has no in-flight state that matters across restarts. A plain restart is simpler and the failure window is too short to care.&lt;/li&gt;
&lt;li&gt;You already use a real orchestrator. Kubernetes, Nomad, ECS: all of them do this for you as a rolling deploy. If you have it, use it.&lt;/li&gt;
&lt;li&gt;You're on a serverless platform where the runtime owns the deploy lifecycle. Same reason.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a single-VPS Node.js process behind nginx, though, blue-green is the proportionate fix. Half a day of work, no new dependencies.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed for us, concretely
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The five-second &lt;code&gt;ChunkLoadError&lt;/code&gt; window is gone by construction. The old pool's chunks never get touched; the new pool starts from a complete build before nginx ever sends it a request.&lt;/li&gt;
&lt;li&gt;Rollback is a 10-second nginx-upstream rewrite, not a rebuild.&lt;/li&gt;
&lt;li&gt;The next time &lt;code&gt;next build&lt;/code&gt; evicts a critical chunk filename (and it will), nobody outside our journal will know.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The hour we spent figuring out the original bug was longer than the hour we spent implementing the fix. If you're running anything stateful behind nginx and your deploy is &lt;code&gt;git pull &amp;amp;&amp;amp; build &amp;amp;&amp;amp; restart&lt;/code&gt;, look at what your single-&lt;code&gt;500&lt;/code&gt; window looks like.&lt;/p&gt;

</description>
      <category>nextjs</category>
      <category>development</category>
      <category>devops</category>
      <category>nginx</category>
    </item>
    <item>
      <title>Why a single timestamp breaks real-time aggregation</title>
      <dc:creator>Vitalii Buhaiov</dc:creator>
      <pubDate>Thu, 14 May 2026 21:14:13 +0000</pubDate>
      <link>https://forem.com/markettrace/why-a-single-timestamp-breaks-real-time-aggregation-3ill</link>
      <guid>https://forem.com/markettrace/why-a-single-timestamp-breaks-real-time-aggregation-3ill</guid>
      <description>&lt;p&gt;During volatile moves, the aggregator could show a "consensus" order book that never existed on any exchange at any single instant. The bug: one &lt;code&gt;timestamp&lt;/code&gt; field hiding three different "nows", one per venue.&lt;/p&gt;

&lt;p&gt;I learned this aggregating live order books. The pattern generalizes to any multi-source pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;I run a service that joins live order books from Binance, Bybit, and OKX into one view. Each producer is a small daemon (one per exchange/asset pair) that holds a websocket open, applies snapshot+diff updates, and publishes the top 200 levels to Redis. A 10 Hz aggregator reads the three Redis keys, bins prices into per-asset buckets ($1 for BTC, $0.10 for ETH, etc.), and writes one unified snapshot.&lt;/p&gt;

&lt;p&gt;There's an obvious question every consumer asks:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;what time is this unified snapshot from?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you're paying attention, this question has two wrong answers before it has a right one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrong answer #1: now()
&lt;/h2&gt;

&lt;p&gt;The aggregator runs every 100 ms. So the snapshot is from… now? Sort of. It's from the moment the aggregator built it. The &lt;em&gt;underlying data&lt;/em&gt; is older. Each producer has its own publish cadence, the websocket has its own jitter, and the exchange itself stamped the event some milliseconds before that.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;now()&lt;/code&gt; is fine as an audit-log field ("the aggregator emitted this snapshot at t=…"). It's wrong if a consumer wants to know how old the &lt;em&gt;data&lt;/em&gt; is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrong answer #2: max(producer.ts)
&lt;/h2&gt;

&lt;p&gt;Better: each producer stamps &lt;code&gt;ts = int(time.time() * 1000)&lt;/code&gt; on its publish. The aggregator picks the freshest producer and calls that the snapshot's time.&lt;/p&gt;

&lt;p&gt;This is what I shipped first. It's wrong for a subtler reason: producer &lt;code&gt;ts&lt;/code&gt; is &lt;em&gt;wall time on the producer machine&lt;/em&gt;, not the exchange's event time. Two producers can be sitting on data that the exchange stamped 200 ms apart yet publish to Redis 5 ms apart on their wall clocks. The snapshot looks synchronized because producer clocks are close, even though the underlying exchange events are not. Even perfectly synchronized producer clocks can't reconstruct exchange event ordering after the fact.&lt;/p&gt;

&lt;p&gt;In quiet markets this is invisible. In high-volatility moments (a Fed print, a liquidation cascade) Binance, Bybit, and OKX can stamp their depth events tens to hundreds of milliseconds apart. Producer ts hides this completely. The consensus snapshot reads as a single instant when in fact it's stitched from three.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two-field rule
&lt;/h2&gt;

&lt;p&gt;The fix that finally stuck: &lt;strong&gt;every payload carries two timestamps&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# producer wall time (publish moment)
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;top&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;      &lt;span class="c1"&gt;# exchange-stamped event time
&lt;/span&gt;    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;exchange emits event  →  producer receives  →  producer publishes  →  aggregator joins
       event_ts                                       ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ts&lt;/code&gt; is the producer's wall clock at publish. It controls Redis TTL ("is this producer alive?") and is the right field for staleness gates. If a producer dies, &lt;code&gt;ts&lt;/code&gt; stops moving, which is the signal to exclude it.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;event_ts&lt;/code&gt; is whatever the &lt;em&gt;exchange&lt;/em&gt; called the time of the underlying event. It's the right field for cross-source alignment. Two producers with &lt;code&gt;event_ts&lt;/code&gt; 200 ms apart are showing different instants of the market, even if their &lt;code&gt;ts&lt;/code&gt; is identical.&lt;/p&gt;

&lt;p&gt;These do different jobs. Conflating them, picking one field to do both, leaks one job into the other. Gate staleness on &lt;code&gt;event_ts&lt;/code&gt;, and you get false alarms when a venue throttles its push rate. Align on &lt;code&gt;ts&lt;/code&gt;, and you get fake consensus during a vol spike.&lt;/p&gt;

&lt;p&gt;Two fields. Two jobs. Don't combine them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Treasure hunt: what each venue actually sends
&lt;/h2&gt;

&lt;p&gt;The two-field rule pushes complexity into the producers. Each one has to know what its exchange calls "event time". The answer is different at every venue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Binance&lt;/strong&gt; futures &lt;code&gt;depth&lt;/code&gt; stream: each event carries &lt;code&gt;E&lt;/code&gt; (exchange-emitted event timestamp) and &lt;code&gt;T&lt;/code&gt; (transaction time). I prefer &lt;code&gt;E&lt;/code&gt;, fall back to &lt;code&gt;T&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ev_ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;E&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both are ms epoch ints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bybit&lt;/strong&gt; &lt;code&gt;orderbook.50.&amp;lt;SYMBOL&amp;gt;&lt;/code&gt; on v5 linear: a top-level &lt;code&gt;ts&lt;/code&gt; field on the wrapping frame, int ms epoch. The inner &lt;code&gt;data&lt;/code&gt; carries &lt;code&gt;u&lt;/code&gt; (sequence id) but no separate event timestamp.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OKX&lt;/strong&gt; &lt;code&gt;books&lt;/code&gt; channel on v5: the timestamp lives &lt;em&gt;inside&lt;/em&gt; each &lt;code&gt;data[]&lt;/code&gt; entry, and it arrives as a &lt;em&gt;string&lt;/em&gt; of ms epoch.&lt;/p&gt;

&lt;p&gt;Three venues, three shapes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Venue&lt;/th&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Location&lt;/th&gt;
&lt;th&gt;Wire type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Binance&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;E&lt;/code&gt; (fallback &lt;code&gt;T&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;per-event object&lt;/td&gt;
&lt;td&gt;int ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bybit&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;top-level wrapper&lt;/td&gt;
&lt;td&gt;int ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OKX&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;per-entry inside &lt;code&gt;data[]&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;string ms (parse)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;After extraction, every producer normalizes to int ms epoch stamped on a field literally named &lt;code&gt;event_ts&lt;/code&gt; in Redis. Downstream code never branches on exchange to interpret time.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Normalize exchange-specific timestamp semantics at the producer boundary, not in downstream consumers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The aggregator: two jobs, two fields
&lt;/h2&gt;

&lt;p&gt;With the invariant in place, the aggregator's logic separates cleanly. Stale gate uses &lt;code&gt;ts&lt;/code&gt;. Alignment uses &lt;code&gt;event_ts&lt;/code&gt;. They never cross paths.&lt;/p&gt;

&lt;p&gt;The stale gate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;snap_ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;age_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;now_ms&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;snap_ts&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;snap_ts&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;age_ms&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;age_ms&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;STALE_THRESHOLD_S&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sources_status&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exchange&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stale&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;age_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;continue&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Threshold is 60 s. If a producer's wall clock hasn't moved in a minute, it's dead. Exclude it from the union. This is the right measure of &lt;em&gt;producer health&lt;/em&gt; and nothing else.&lt;/p&gt;

&lt;p&gt;For the live sources, extract &lt;code&gt;event_ts&lt;/code&gt; and the lag it implies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ev_ts_raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;event_ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ev_ts_raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ev_ts_raw&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="n"&gt;event_age_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;now_ms&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;event_ts&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;event_ts&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;event_age_ms&lt;/code&gt; is the honest measure of how far behind this venue's underlying data is. A producer can be perfectly healthy (recent &lt;code&gt;ts&lt;/code&gt;) yet showing data that's 300 ms behind because the exchange itself is slow under load. That's a different failure mode, and the FE needs to surface it differently. Not "Bybit is down" but "Bybit is lagging."&lt;/p&gt;

&lt;h2&gt;
  
  
  cross_exchange_skew_ms as honesty metric
&lt;/h2&gt;

&lt;p&gt;Once each live venue has an &lt;code&gt;event_ts&lt;/code&gt;, the skew across them falls out:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ok_event_ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;event_ts_per_exch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ok_exchanges&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;cross_exchange_skew_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ok_event_ts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;cross_exchange_skew_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ok_event_ts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ok_event_ts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A single integer. Spread between earliest and latest exchange-stamped event in the "consensus" snapshot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Binance event_ts = 12:00:00.100
Bybit   event_ts = 12:00:00.420
OKX     event_ts = 12:00:00.160

cross_exchange_skew_ms = 320
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That "consensus" snapshot spans almost a third of a second. Skew is a different signal from transport latency. It captures how far apart the venues' own clocks place their events, regardless of how fast your producers ran. If the exchanges themselves stamped events 320 ms apart, the snapshot is 320 ms wide.&lt;/p&gt;

&lt;p&gt;Practical thresholds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;100 ms     = normal
100–300 ms  = degraded
&amp;gt;300 ms     = unsafe for microstructure signals
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This metric does one important thing: &lt;strong&gt;it surfaces dishonesty in the consensus view&lt;/strong&gt;. A naive aggregator presents a unified snapshot as if it's a single instant. It isn't. By emitting &lt;code&gt;cross_exchange_skew_ms&lt;/code&gt; on every payload, every consumer picks its own policy. A 1-second chart can ignore 200 ms of skew. A spoof-detection feature has to discard the snapshot and wait. A "live consensus" UI can display the skew as a number. &lt;em&gt;"Consensus over a 47 ms window"&lt;/em&gt; is honest; hiding it isn't.&lt;/p&gt;

&lt;p&gt;The principle generalizes: when a system can't be honest about precision, it should at least be honest about its imprecision.&lt;/p&gt;

&lt;h2&gt;
  
  
  When a venue's clock drifts
&lt;/h2&gt;

&lt;p&gt;The two-field rule also protects against a failure mode that took me embarrassingly long to notice: an exchange's clock can be wrong.&lt;/p&gt;

&lt;p&gt;Most of the time venue clocks are NTP-disciplined and accurate to a few ms. But under load, after a maintenance window, or around an NTP step, a venue's &lt;code&gt;event_ts&lt;/code&gt; can lurch forward (or backward) by hundreds of ms relative to wall time. The producer faithfully forwards the bad timestamp because that's its job.&lt;/p&gt;

&lt;p&gt;With one timestamp, you can't tell whether the producer is slow or the venue is, so you either drop the venue or accept a torn consensus snapshot. With two timestamps the failure is &lt;em&gt;visible&lt;/em&gt;: &lt;code&gt;event_age_ms&lt;/code&gt; goes negative (the venue claims data from the future), or it spikes asymmetrically vs other venues. The skew metric lights up, and you can downgrade &lt;em&gt;that venue specifically&lt;/em&gt;, not the whole pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond crypto: any multi-source pipeline
&lt;/h2&gt;

&lt;p&gt;The pattern isn't crypto-specific. Anywhere you aggregate real-time data from N independent sources, the same problem shows up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Distributed logs: pod wall-clock time vs the event's actual occurrence (or the trace span's start).&lt;/li&gt;
&lt;li&gt;Sensor fusion: each sensor's local clock vs the moment your gateway received it.&lt;/li&gt;
&lt;li&gt;IoT telemetry: device clock (often horribly skewed) vs gateway ingestion time.&lt;/li&gt;
&lt;li&gt;Cross-region replication: source DB commit time vs replica apply time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same three temptations. Same three wrong answers. Same fix: carry both timestamps to the joining layer. Ingestion time for liveness and TTL. Source time for alignment. Spread between sources as a first-class metric so consumers decide what to trust.&lt;/p&gt;

&lt;p&gt;This is also the framing Apache Flink and Beam ship with: &lt;em&gt;event time&lt;/em&gt; vs &lt;em&gt;processing time&lt;/em&gt;, with watermarks to surface drift. Most ad-hoc real-time systems converge to the same dual-timestamp model eventually. You can skip the eventually.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell past me
&lt;/h2&gt;

&lt;p&gt;Two lines.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;In the very first payload schema, write both &lt;code&gt;ts&lt;/code&gt; and &lt;code&gt;event_ts&lt;/code&gt;. Migrating later means rewriting every producer, the aggregator, and every consumer. Adding it on day one is two extra lines per producer.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Emit the skew metric on every aggregated payload, even when it's "always low". The day skew matters, you'll wish you had it on the wire.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The single-&lt;code&gt;timestamp&lt;/code&gt; field is one of those defaults that looks fine until it doesn't.&lt;/p&gt;

&lt;p&gt;One timestamp tells you when you saw the data.&lt;/p&gt;

&lt;p&gt;Two timestamps tell you when the market happened.&lt;/p&gt;

&lt;p&gt;Carry both.&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>dataengineering</category>
      <category>webdev</category>
      <category>cryptocurrency</category>
    </item>
  </channel>
</rss>
