<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Alex Zhdankov</title>
    <description>The latest articles on Forem by Alex Zhdankov (@alex_zhdankov).</description>
    <link>https://forem.com/alex_zhdankov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3929959%2Fb8e74cb6-a829-45e7-820e-5ae2678fb69a.png</url>
      <title>Forem: Alex Zhdankov</title>
      <link>https://forem.com/alex_zhdankov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/alex_zhdankov"/>
    <language>en</language>
    <item>
      <title>Why your SSH scripts will fail in production</title>
      <dc:creator>Alex Zhdankov</dc:creator>
      <pubDate>Mon, 18 May 2026 15:41:40 +0000</pubDate>
      <link>https://forem.com/alex_zhdankov/why-your-ssh-scripts-will-fail-in-production-4cb8</link>
      <guid>https://forem.com/alex_zhdankov/why-your-ssh-scripts-will-fail-in-production-4cb8</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Remote command execution looks trivial — until unstable networks, retries, long-running commands, and half-open connections turn it into a reliability problem.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We use Paramiko with a thin supervision layer on top.&lt;br&gt;
The same operational problems apply to AsyncSSH, Fabric, or plain OpenSSH subprocesses.&lt;/p&gt;

&lt;p&gt;At first, the implementation looked completely straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;paramiko&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SSHClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;stdin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stderr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exec_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;systemctl restart postgres&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In development, this worked perfectly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Then production happened&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hundreds of hosts.&lt;/li&gt;
&lt;li&gt;Unstable networks.&lt;/li&gt;
&lt;li&gt;Long-running commands.&lt;/li&gt;
&lt;li&gt;Frozen sessions.&lt;/li&gt;
&lt;li&gt;Half-open connections.&lt;/li&gt;
&lt;li&gt;Retries.&lt;/li&gt;
&lt;li&gt;Partial execution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point this stopped being &lt;em&gt;“SSH scripting”&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;It became a distributed systems problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  SSH is deceptively simple
&lt;/h2&gt;

&lt;p&gt;Most developers intuitively model SSH like this:&lt;br&gt;
&lt;code&gt;local subprocess, but remote&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;But production SSH execution is actually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;network transport
+ stateful session
+ interactive channel
+ remote process lifecycle
+ unreliable infrastructure
+ partial execution visibility
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And failures can happen independently at every layer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Application
    ↓
SSH Client
    ↓
TCP transport        ← packets can vanish
    ↓
SSH session          ← can hang without closing
    ↓
Remote shell         ← can ignore commands
    ↓
Process execution    ← may continue after disconnect
    ↓
stdout/stderr        ← can block forever
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This distinction changes everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure mode #1 — execution uncertainty
&lt;/h2&gt;

&lt;p&gt;This was the first major production lesson.&lt;/p&gt;

&lt;p&gt;If the SSH transport dies, you do not know whether the command:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;succeeded&lt;/li&gt;
&lt;li&gt;failed&lt;/li&gt;
&lt;li&gt;partially executed&lt;/li&gt;
&lt;li&gt;is still running remotely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That uncertainty completely changes retry semantics.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;systemctl restart postgres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the connection drops immediately after sending the command:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;did restart begin?&lt;/li&gt;
&lt;li&gt;is postgres still restarting?&lt;/li&gt;
&lt;li&gt;did it already succeed?&lt;/li&gt;
&lt;li&gt;is the service now dead?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You no longer have execution certainty.&lt;/p&gt;

&lt;p&gt;This is not a &lt;em&gt;“Paramiko problem”&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This is a distributed systems problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retry is dangerous
&lt;/h2&gt;

&lt;p&gt;Retries sound harmless until commands become stateful.&lt;/p&gt;

&lt;p&gt;Some operations are naturally &lt;em&gt;idempotent&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /proc/meminfo
&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-la&lt;/span&gt; /etc
systemctl status postgres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Others are not:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;useradd deploy
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /some/path
systemctl restart postgres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A failed transport does not imply failed execution.&lt;/p&gt;

&lt;p&gt;That means naive retry logic can create destructive side effects.&lt;/p&gt;

&lt;p&gt;This forced us to separate failures into two categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transport uncertainty&lt;/li&gt;
&lt;li&gt;command failure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are fundamentally different operational states.&lt;/p&gt;

&lt;h2&gt;
  
  
  Timeouts are not one thing
&lt;/h2&gt;

&lt;p&gt;One of the most common mistakes in SSH automation is treating timeout as a single concept.&lt;/p&gt;

&lt;p&gt;Production systems usually need several independent timeout layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TCP connect timeout&lt;/li&gt;
&lt;li&gt;SSH handshake timeout&lt;/li&gt;
&lt;li&gt;authentication timeout&lt;/li&gt;
&lt;li&gt;command execution timeout&lt;/li&gt;
&lt;li&gt;idle/read timeout&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each failure means something different operationally.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;banner_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;auth_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But even that is insufficient.&lt;/p&gt;

&lt;p&gt;A command may still hang forever while the socket technically remains alive.&lt;/p&gt;

&lt;p&gt;That distinction matters a lot in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Half-open connections are nasty
&lt;/h2&gt;

&lt;p&gt;This became one of the hardest reliability problems.&lt;/p&gt;

&lt;p&gt;Sometimes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TCP stays alive&lt;/li&gt;
&lt;li&gt;SSH transport stays alive&lt;/li&gt;
&lt;li&gt;but the remote process is effectively dead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Or:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;packets silently disappear&lt;/li&gt;
&lt;li&gt;the remote kernel freezes&lt;/li&gt;
&lt;li&gt;stdout stops forever&lt;/li&gt;
&lt;li&gt;but the socket never closes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From the application perspective:&lt;br&gt;
&lt;code&gt;everything looks connected&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;while the operation is permanently stalled.&lt;/p&gt;

&lt;p&gt;This is the classic half-open connection problem.&lt;/p&gt;
&lt;h2&gt;
  
  
  Blocking reads break automation
&lt;/h2&gt;

&lt;p&gt;This code looks innocent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But under real workloads it becomes dangerous.&lt;/p&gt;

&lt;p&gt;If:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the command hangs&lt;/li&gt;
&lt;li&gt;stdout stops producing data&lt;/li&gt;
&lt;li&gt;the socket remains alive&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;then:&lt;br&gt;
&lt;code&gt;the thread blocks forever&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;We eventually moved to &lt;em&gt;streaming execution&lt;/em&gt; instead of buffered reads.&lt;/p&gt;
&lt;h2&gt;
  
  
  Streaming changes the execution model
&lt;/h2&gt;

&lt;p&gt;Long-running commands fundamentally change how remote execution must be handled.&lt;/p&gt;

&lt;p&gt;Operations like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pg_dump&lt;/li&gt;
&lt;li&gt;VACUUM&lt;/li&gt;
&lt;li&gt;package upgrades&lt;/li&gt;
&lt;li&gt;log exports&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;can run for minutes or hours.&lt;/p&gt;

&lt;p&gt;Buffering all output in memory is unreliable.&lt;br&gt;
Blocking until completion destroys &lt;em&gt;observability&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Instead we switched to chunked streaming:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit_status_ready&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recv_ready&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This solved several production problems simultaneously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;realtime progress visibility&lt;/li&gt;
&lt;li&gt;lower memory usage&lt;/li&gt;
&lt;li&gt;cancellation support&lt;/li&gt;
&lt;li&gt;dead session detection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Streaming ended up being much more operationally stable than buffered execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security becomes infrastructure, not validation
&lt;/h2&gt;

&lt;p&gt;Another important lesson:&lt;/p&gt;

&lt;p&gt;SSH automation is remote code execution infrastructure.&lt;/p&gt;

&lt;p&gt;That means command construction rules matter enormously.&lt;/p&gt;

&lt;p&gt;This is catastrophic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rm -rf &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because eventually someone passes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/home/user; rm -rf /
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We ended up treating all remote commands as infrastructure-sensitive operations.&lt;/p&gt;

&lt;p&gt;Input validation alone was insufficient.&lt;/p&gt;

&lt;p&gt;Every dynamic argument had to be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;validated&lt;/li&gt;
&lt;li&gt;escaped&lt;/li&gt;
&lt;li&gt;constrained
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;safe_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;shlex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even simple automation eventually becomes security-critical.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resource cleanup matters more than expected
&lt;/h2&gt;

&lt;p&gt;SSH resources leak surprisingly easily.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Channels.&lt;/li&gt;
&lt;li&gt;Sockets.&lt;/li&gt;
&lt;li&gt;Transports.&lt;/li&gt;
&lt;li&gt;PTY buffers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Under load, forgotten cleanup accumulates fast.&lt;/p&gt;

&lt;p&gt;We eventually standardized all operations around explicit lifecycle management:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;ssh_operation&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ssh&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ssh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important part was not aesthetics.&lt;/p&gt;

&lt;p&gt;It was guaranteeing cleanup under:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;exceptions&lt;/li&gt;
&lt;li&gt;timeouts&lt;/li&gt;
&lt;li&gt;partial failures&lt;/li&gt;
&lt;li&gt;interrupted execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Production automation lives or dies on &lt;em&gt;cleanup guarantees&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture we ended up with
&lt;/h2&gt;

&lt;p&gt;Over time the system evolved into several independent layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Connection management
    ↓
Retry classification
    ↓
Execution supervision
    ↓
Streaming transport
    ↓
Resource cleanup
    ↓
Observability
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important realization was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;remote execution is not a helper function&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final insight
&lt;/h2&gt;

&lt;p&gt;The happy path is trivial.&lt;/p&gt;

&lt;p&gt;Production architecture begins where execution certainty ends.&lt;/p&gt;

&lt;p&gt;SSH automation fails when treated like scripting.&lt;/p&gt;

&lt;p&gt;Because it is not scripting.&lt;/p&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;remote process orchestration&lt;/li&gt;
&lt;li&gt;over unreliable transport&lt;/li&gt;
&lt;li&gt;with partial execution visibility&lt;/li&gt;
&lt;li&gt;inside a distributed system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And once you accept that,&lt;br&gt;
the architecture changes completely.&lt;/p&gt;

</description>
      <category>ssh</category>
      <category>python</category>
      <category>distributedsystems</category>
      <category>devops</category>
    </item>
    <item>
      <title>We built a real psql terminal in the browser. Here’s what made it unexpectedly hard.</title>
      <dc:creator>Alex Zhdankov</dc:creator>
      <pubDate>Wed, 13 May 2026 20:53:07 +0000</pubDate>
      <link>https://forem.com/alex_zhdankov/we-built-a-real-psql-terminal-in-the-browser-heres-what-made-it-unexpectedly-hard-57a1</link>
      <guid>https://forem.com/alex_zhdankov/we-built-a-real-psql-terminal-in-the-browser-heres-what-made-it-unexpectedly-hard-57a1</guid>
      <description>&lt;p&gt;&lt;strong&gt;A PTY-backed PostgreSQL console running in the browser using reverse WebSockets, Redis Streams, and xterm.js — designed around centralized control-plane constraints and production failure modes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We needed a real PostgreSQL terminal inside the browser.&lt;/p&gt;

&lt;p&gt;Not a SQL editor.&lt;br&gt;
Not a query API.&lt;br&gt;
A real psql session with full terminal semantics.&lt;/p&gt;

&lt;p&gt;That requirement immediately forced several architectural constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a real PTY&lt;/li&gt;
&lt;li&gt;a long-lived stateful process&lt;/li&gt;
&lt;li&gt;bidirectional streaming&lt;/li&gt;
&lt;li&gt;terminal resize handling&lt;/li&gt;
&lt;li&gt;signal forwarding (&lt;code&gt;Ctrl+C&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;native &lt;code&gt;psql&lt;/code&gt; behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And then the infrastructure constraints made things significantly more interesting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;agents live in internal networks&lt;/li&gt;
&lt;li&gt;all traffic must go through the Control Plane&lt;/li&gt;
&lt;li&gt;xterm.js only supports WebSocket transport&lt;/li&gt;
&lt;li&gt;we could not emulate &lt;code&gt;psql&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, this stopped being a “web feature”.&lt;br&gt;
It became a distributed terminal runtime problem.&lt;/p&gt;
&lt;h2&gt;
  
  
  High-level architecture
&lt;/h2&gt;

&lt;p&gt;This system only makes sense if you read it as a &lt;em&gt;dataflow graph&lt;/em&gt;, not as isolated services.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser (xterm.js)
    │
    │ WebSocket (terminal I/O)
    ▼
Control Plane
    │
    │ session management + auth
    ▼
Redis Streams (output buffer)
    │
    │ coordination + async delivery
    ▼
Agent WebSocket channel
    │
    │ PTY stdin/stdout bridge
    ▼
PTY → real psql process
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical architectural decision:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the browser never connects to the agent directly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The Control Plane is the only public entrypoint in the entire system.&lt;br&gt;
Everything flows through it.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why the architecture looks “backwards”
&lt;/h2&gt;

&lt;p&gt;The surprising part is that the agent initiates the terminal transport.&lt;/p&gt;

&lt;p&gt;Not because NAT traversal was impossible.&lt;/p&gt;

&lt;p&gt;But because the system was intentionally designed around a centralized Control Plane.&lt;/p&gt;

&lt;p&gt;Agents sit in internal networks.&lt;br&gt;
The browser has no direct visibility into them.&lt;/p&gt;

&lt;p&gt;So instead of:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Browser → Agent&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;the architecture becomes:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Browser → Control Plane ← Agent&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The Control Plane acts as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;session coordinator&lt;/li&gt;
&lt;li&gt;auth boundary&lt;/li&gt;
&lt;li&gt;transport router&lt;/li&gt;
&lt;li&gt;lifecycle owner&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once that decision is made, reverse WebSockets become the natural transport model.&lt;/p&gt;
&lt;h2&gt;
  
  
  Session establishment
&lt;/h2&gt;

&lt;p&gt;The session lifecycle happens in multiple stages.&lt;/p&gt;

&lt;p&gt;Importantly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the PTY process does not exist when the browser first connects&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Only a logical session exists.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 1 — Browser creates a logical session
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser
  │
  │ WebSocket connect
  ▼
Control Plane
  ├── creates session_id
  ├── registers browser handler
  └── starts auth timeout
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;At this point:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no PTY exists&lt;/li&gt;
&lt;li&gt;no psql exists&lt;/li&gt;
&lt;li&gt;no database connection exists&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Control Plane only knows:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“a browser wants a terminal session”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Step 2 — Control Plane signals the agent
&lt;/h2&gt;

&lt;p&gt;The Control Plane sends a lightweight HTTP request:&lt;br&gt;
&lt;code&gt;POST /terminal?session_id=&amp;lt;uuid&amp;gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This is intentionally the only HTTP hop in the entire terminal lifecycle.&lt;/p&gt;

&lt;p&gt;The request does not carry terminal traffic.&lt;/p&gt;

&lt;p&gt;It only means:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“establish terminal transport for this session”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Step 3 — Agent opens reverse WebSocket
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent
  │
  │ outbound WebSocket
  ▼
Control Plane
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Now the system has two independent transport channels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Browser WS → Control Plane&lt;/li&gt;
&lt;li&gt;Agent WS   → Control Plane&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But they are still disconnected.&lt;/p&gt;

&lt;p&gt;The system is in a &lt;em&gt;half-connected state&lt;/em&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Session stitching
&lt;/h2&gt;

&lt;p&gt;This is the moment where the architecture becomes interesting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser Handler ───────┐
                       ├── session binding
Agent Handler ─────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the Control Plane stops being a transport endpoint and becomes a message router&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It now forwards:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;browser input → agent&lt;/li&gt;
&lt;li&gt;agent output → browser&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But critically:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;not directly&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;All terminal output passes through an asynchronous buffering layer.&lt;/p&gt;

&lt;p&gt;That layer ended up being one of the most important production decisions in the system.&lt;/p&gt;

&lt;h2&gt;
  
  
  PTY process creation
&lt;/h2&gt;

&lt;p&gt;Once the session is fully initialized, the agent forks a real PTY:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child_pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pty&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fork&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;child_pid&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;psql&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-U&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dbname&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point the architecture fundamentally changes.&lt;/p&gt;

&lt;p&gt;This is no longer “web infrastructure”.&lt;/p&gt;

&lt;p&gt;It becomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PTY supervision&lt;/li&gt;
&lt;li&gt;file descriptor management&lt;/li&gt;
&lt;li&gt;process lifecycle handling&lt;/li&gt;
&lt;li&gt;signal propagation&lt;/li&gt;
&lt;li&gt;backpressure management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most complexity appeared after this step.&lt;/p&gt;

&lt;p&gt;Not before it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real data pipeline
&lt;/h2&gt;

&lt;p&gt;This is the most important flow in the system.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser
  │
  │ keystroke
  ▼
Control Plane
  │
  ▼
Agent WS handler
  │
  │ write(fd)
  ▼
PTY → psql
  │
  │ stdout
  ▼
PTY reader thread
  │
  │ Redis XADD
  ▼
Redis Streams
  │
  │ async consumer
  ▼
Control Plane
  │
  │ WS push
  ▼
Browser
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most important line in the entire architecture is this:&lt;br&gt;
&lt;code&gt;PTY reader → Redis XADD → async consumer → WebSocket&lt;/code&gt;&lt;br&gt;
That line is the system’s stability boundary.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Redis Streams became mandatory
&lt;/h2&gt;

&lt;p&gt;The original implementation directly forwarded PTY output into WebSocket writes:&lt;br&gt;
&lt;code&gt;PTY → WebSocket&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;It worked in development.&lt;/p&gt;

&lt;p&gt;It failed in production.&lt;br&gt;
The issue was subtle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PTY reads are synchronous&lt;/li&gt;
&lt;li&gt;WebSocket writes can block&lt;/li&gt;
&lt;li&gt;backpressure propagates backwards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The resulting failure mode was catastrophic for terminal UX:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;slow network
    ↓
blocked WS writes
    ↓
frozen PTY reader
    ↓
terminal stalls
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The terminal looked dead while psql was still running underneath.&lt;/p&gt;

&lt;p&gt;Redis Streams solved this by introducing a decoupling boundary.&lt;/p&gt;

&lt;p&gt;Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PTY reads stay non-blocking&lt;/li&gt;
&lt;li&gt;network latency becomes isolated&lt;/li&gt;
&lt;li&gt;consumers can temporarily lag&lt;/li&gt;
&lt;li&gt;output survives reconnects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The additional latency was negligible.&lt;/p&gt;

&lt;p&gt;The operational stability improvement was enormous.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture is actually two independent loops
&lt;/h2&gt;

&lt;p&gt;This is the part most terminal architectures hide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input loop&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;Browser → Control Plane → Agent → PTY&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output loop&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;PTY → Redis → Control Plane → Browser&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;These loops are intentionally independent.&lt;/p&gt;

&lt;p&gt;That separation is what allows the system to survive partial failures.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why we split browser and agent handlers
&lt;/h2&gt;

&lt;p&gt;We intentionally kept browser-facing and agent-facing handlers separate.&lt;/p&gt;

&lt;p&gt;Because they solve fundamentally different problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Browser Handler&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;auth&lt;/li&gt;
&lt;li&gt;user session ownership&lt;/li&gt;
&lt;li&gt;browser disconnect semantics&lt;/li&gt;
&lt;li&gt;user errors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Agent Handler&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PTY lifecycle&lt;/li&gt;
&lt;li&gt;process supervision&lt;/li&gt;
&lt;li&gt;reconnect semantics&lt;/li&gt;
&lt;li&gt;infrastructure errors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Trying to merge them created tightly coupled failure modes and significantly more lifecycle complexity.&lt;/p&gt;

&lt;p&gt;Separating them made the system dramatically easier to reason about.&lt;/p&gt;
&lt;h2&gt;
  
  
  Failure modes that mattered in production
&lt;/h2&gt;

&lt;p&gt;The hardest problems were not PostgreSQL problems.&lt;/p&gt;

&lt;p&gt;They were long-lived process problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A. Redis failure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Impact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;output pipeline breaks&lt;/li&gt;
&lt;li&gt;PTY continues running&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mitigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;memory limits&lt;/li&gt;
&lt;li&gt;retention limits&lt;/li&gt;
&lt;li&gt;monitoring&lt;/li&gt;
&lt;li&gt;bounded stream lifetime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;B. Agent disconnect&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Impact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transport disappears&lt;/li&gt;
&lt;li&gt;PTY may still be alive&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mitigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reconnect window&lt;/li&gt;
&lt;li&gt;session reattachment&lt;/li&gt;
&lt;li&gt;delayed teardown&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;C. Process explosion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Impact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;memory exhaustion&lt;/li&gt;
&lt;li&gt;PostgreSQL connection storms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mitigation:&lt;br&gt;
&lt;code&gt;BoundedSemaphore(max_sessions=10)&lt;/code&gt;&lt;br&gt;
This was one of the simplest and most effective safeguards in the system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;D. xterm resize storms&lt;/strong&gt;&lt;br&gt;
xterm.js emits resize events aggressively during browser resizing.&lt;/p&gt;

&lt;p&gt;Impact:&lt;br&gt;
Each resize triggers:&lt;br&gt;
&lt;code&gt;ioctl(TIOCSWINSZ)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Mitigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Without throttling, the PTY spent significant time processing resize events instead of actual terminal traffic.&lt;/li&gt;
&lt;li&gt;Simple debounce logic completely fixed the issue.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Scaling reality
&lt;/h2&gt;

&lt;p&gt;The system does not scale like a normal WebSocket service.&lt;/p&gt;

&lt;p&gt;Each session includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a real &lt;code&gt;psql&lt;/code&gt; process&lt;/li&gt;
&lt;li&gt;a PTY&lt;/li&gt;
&lt;li&gt;multiple threads&lt;/li&gt;
&lt;li&gt;Redis streams&lt;/li&gt;
&lt;li&gt;two WebSocket channels&lt;/li&gt;
&lt;li&gt;a database connection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The scaling bottleneck is not Redis.&lt;/p&gt;

&lt;p&gt;It is not CPU.&lt;/p&gt;

&lt;p&gt;It is not WebSockets.&lt;/p&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;how many real PostgreSQL sessions the infrastructure can sustain&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Why HTTP and SSE were rejected
&lt;/h2&gt;

&lt;p&gt;We evaluated both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HTTP&lt;/strong&gt;&lt;br&gt;
Failed because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stateless&lt;/li&gt;
&lt;li&gt;no streaming terminal semantics&lt;/li&gt;
&lt;li&gt;no signal handling&lt;/li&gt;
&lt;li&gt;no persistent shell state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;SSE&lt;/strong&gt;&lt;br&gt;
Failed because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one-directional transport&lt;/li&gt;
&lt;li&gt;incompatible with terminal interaction patterns&lt;/li&gt;
&lt;li&gt;xterm.js expects bidirectional communication&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At the end, terminals naturally map onto WebSockets.&lt;/p&gt;

&lt;p&gt;Trying to avoid that only complicates the architecture.&lt;/p&gt;
&lt;h2&gt;
  
  
  What this system actually is
&lt;/h2&gt;

&lt;p&gt;If you remove all abstractions:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this is a distributed process supervisor for a PTY running &lt;code&gt;psql&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Everything else is transport, routing, buffering, and failure handling around that core idea.&lt;/p&gt;
&lt;h2&gt;
  
  
  Final architecture insight
&lt;/h2&gt;

&lt;p&gt;The system is ultimately defined by three separations.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Connection separation&lt;/strong&gt;&lt;br&gt;
The Control Plane isolates browsers from agents.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Process separation&lt;/strong&gt;&lt;br&gt;
PTY isolates PostgreSQL from the web layer.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flow separation&lt;/strong&gt;&lt;br&gt;
Redis isolates terminal I/O from network I/O.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Final mental model
&lt;/h2&gt;

&lt;p&gt;If you understand only one thing, understand this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser ↔ Control Plane ↔ Agent ↔ PTY ↔ psql
                     ↑
              Redis is the buffer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything else is lifecycle management around this chain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;We did not build a “web UI for PostgreSQL”.&lt;/p&gt;

&lt;p&gt;We built a distributed, fault-tolerant runtime for a stateful terminal process.&lt;/p&gt;

&lt;p&gt;PostgreSQL just happened to be the process attached to it.&lt;/p&gt;

</description>
      <category>websockets</category>
      <category>redis</category>
      <category>architecture</category>
      <category>python</category>
    </item>
  </channel>
</rss>
