Forem: Alex Zhdankov

Why your SSH scripts will fail in production

Alex Zhdankov — Mon, 18 May 2026 15:41:40 +0000

Remote command execution looks trivial — until unstable networks, retries, long-running commands, and half-open connections turn it into a reliability problem.

We use Paramiko with a thin supervision layer on top.
The same operational problems apply to AsyncSSH, Fabric, or plain OpenSSH subprocesses.

At first, the implementation looked completely straightforward:

client = paramiko.SSHClient()
client.connect(hostname=host, username=user)

stdin, stdout, stderr = client.exec_command(
    "systemctl restart postgres"
)

output = stdout.read().decode()

In development, this worked perfectly.

Then production happened.

Hundreds of hosts.
Unstable networks.
Long-running commands.
Frozen sessions.
Half-open connections.
Retries.
Partial execution.

At that point this stopped being “SSH scripting”.

It became a distributed systems problem.

SSH is deceptively simple

Most developers intuitively model SSH like this:
local subprocess, but remote

But production SSH execution is actually:

network transport
+ stateful session
+ interactive channel
+ remote process lifecycle
+ unreliable infrastructure
+ partial execution visibility

And failures can happen independently at every layer.

Application
    ↓
SSH Client
    ↓
TCP transport        ← packets can vanish
    ↓
SSH session          ← can hang without closing
    ↓
Remote shell         ← can ignore commands
    ↓
Process execution    ← may continue after disconnect
    ↓
stdout/stderr        ← can block forever

This distinction changes everything.

Failure mode #1 — execution uncertainty

This was the first major production lesson.

If the SSH transport dies, you do not know whether the command:

succeeded
failed
partially executed
is still running remotely

That uncertainty completely changes retry semantics.

For example:

systemctl restart postgres

If the connection drops immediately after sending the command:

did restart begin?
is postgres still restarting?
did it already succeed?
is the service now dead?

You no longer have execution certainty.

This is not a “Paramiko problem”.

This is a distributed systems problem.

Retry is dangerous

Retries sound harmless until commands become stateful.

Some operations are naturally idempotent:

cat /proc/meminfo
ls -la /etc
systemctl status postgres

Others are not:

useradd deploy
rm -rf /some/path
systemctl restart postgres

A failed transport does not imply failed execution.

That means naive retry logic can create destructive side effects.

This forced us to separate failures into two categories:

transport uncertainty
command failure

Those are fundamentally different operational states.

Timeouts are not one thing

One of the most common mistakes in SSH automation is treating timeout as a single concept.

Production systems usually need several independent timeout layers:

TCP connect timeout
SSH handshake timeout
authentication timeout
command execution timeout
idle/read timeout

Each failure means something different operationally.

client.connect(
    hostname=host,
    username=username,
    timeout=10,
    banner_timeout=15,
    auth_timeout=15
)

But even that is insufficient.

A command may still hang forever while the socket technically remains alive.

That distinction matters a lot in production.

Half-open connections are nasty

This became one of the hardest reliability problems.

Sometimes:

TCP stays alive
SSH transport stays alive
but the remote process is effectively dead

Or:

packets silently disappear
the remote kernel freezes
stdout stops forever
but the socket never closes

From the application perspective:
everything looks connected

while the operation is permanently stalled.

This is the classic half-open connection problem.

Blocking reads break automation

This code looks innocent:

stdout.read()

But under real workloads it becomes dangerous.

If:

the command hangs
stdout stops producing data
the socket remains alive

then:
the thread blocks forever

We eventually moved to streaming execution instead of buffered reads.

Streaming changes the execution model

Long-running commands fundamentally change how remote execution must be handled.

Operations like:

pg_dump
VACUUM
package upgrades
log exports

can run for minutes or hours.

Buffering all output in memory is unreliable.
Blocking until completion destroys observability.

Instead we switched to chunked streaming:

while not channel.exit_status_ready():
    if channel.recv_ready():
        data = channel.recv(4096)
        callback(data)

This solved several production problems simultaneously:

realtime progress visibility
lower memory usage
cancellation support
dead session detection

Streaming ended up being much more operationally stable than buffered execution.

Security becomes infrastructure, not validation

Another important lesson:

SSH automation is remote code execution infrastructure.

That means command construction rules matter enormously.

This is catastrophic:

cmd = f"rm -rf {user_input}"

Because eventually someone passes:

/home/user; rm -rf /

We ended up treating all remote commands as infrastructure-sensitive operations.

Input validation alone was insufficient.

Every dynamic argument had to be:

validated
escaped
constrained

safe_value = shlex.quote(user_input)

Even simple automation eventually becomes security-critical.

Resource cleanup matters more than expected

SSH resources leak surprisingly easily.

Channels.
Sockets.
Transports.
PTY buffers.

Under load, forgotten cleanup accumulates fast.

We eventually standardized all operations around explicit lifecycle management:

with ssh_operation(...) as ssh:
    ssh.execute(...)

The important part was not aesthetics.

It was guaranteeing cleanup under:

exceptions
timeouts
partial failures
interrupted execution

Production automation lives or dies on cleanup guarantees.

The architecture we ended up with

Over time the system evolved into several independent layers:

Connection management
    ↓
Retry classification
    ↓
Execution supervision
    ↓
Streaming transport
    ↓
Resource cleanup
    ↓
Observability

The important realization was:

remote execution is not a helper function

It is infrastructure.

Final insight

The happy path is trivial.

Production architecture begins where execution certainty ends.

SSH automation fails when treated like scripting.

Because it is not scripting.

It is:

remote process orchestration
over unreliable transport
with partial execution visibility
inside a distributed system

And once you accept that,
the architecture changes completely.

We built a real psql terminal in the browser. Here’s what made it unexpectedly hard.

Alex Zhdankov — Wed, 13 May 2026 20:53:07 +0000

A PTY-backed PostgreSQL console running in the browser using reverse WebSockets, Redis Streams, and xterm.js — designed around centralized control-plane constraints and production failure modes.

We needed a real PostgreSQL terminal inside the browser.

Not a SQL editor.
Not a query API.
A real psql session with full terminal semantics.

That requirement immediately forced several architectural constraints:

a real PTY
a long-lived stateful process
bidirectional streaming
terminal resize handling
signal forwarding (Ctrl+C)
native psql behavior

And then the infrastructure constraints made things significantly more interesting:

agents live in internal networks
all traffic must go through the Control Plane
xterm.js only supports WebSocket transport
we could not emulate psql

At that point, this stopped being a “web feature”.
It became a distributed terminal runtime problem.

High-level architecture

This system only makes sense if you read it as a dataflow graph, not as isolated services.

Browser (xterm.js)
    │
    │ WebSocket (terminal I/O)
    ▼
Control Plane
    │
    │ session management + auth
    ▼
Redis Streams (output buffer)
    │
    │ coordination + async delivery
    ▼
Agent WebSocket channel
    │
    │ PTY stdin/stdout bridge
    ▼
PTY → real psql process

The critical architectural decision:

the browser never connects to the agent directly.

The Control Plane is the only public entrypoint in the entire system.
Everything flows through it.

Why the architecture looks “backwards”

The surprising part is that the agent initiates the terminal transport.

Not because NAT traversal was impossible.

But because the system was intentionally designed around a centralized Control Plane.

Agents sit in internal networks.
The browser has no direct visibility into them.

So instead of:

Browser → Agent

the architecture becomes:

Browser → Control Plane ← Agent

The Control Plane acts as:

session coordinator
auth boundary
transport router
lifecycle owner

Once that decision is made, reverse WebSockets become the natural transport model.

Session establishment

The session lifecycle happens in multiple stages.

Importantly:

the PTY process does not exist when the browser first connects

Only a logical session exists.

Step 1 — Browser creates a logical session

Browser
  │
  │ WebSocket connect
  ▼
Control Plane
  ├── creates session_id
  ├── registers browser handler
  └── starts auth timeout

At this point:

no PTY exists
no psql exists
no database connection exists

The Control Plane only knows:

“a browser wants a terminal session”

Step 2 — Control Plane signals the agent

The Control Plane sends a lightweight HTTP request:
POST /terminal?session_id=<uuid>

This is intentionally the only HTTP hop in the entire terminal lifecycle.

The request does not carry terminal traffic.

It only means:

“establish terminal transport for this session”

Step 3 — Agent opens reverse WebSocket

Agent
  │
  │ outbound WebSocket
  ▼
Control Plane

Now the system has two independent transport channels:

Browser WS → Control Plane
Agent WS → Control Plane

But they are still disconnected.

The system is in a half-connected state.

Session stitching

This is the moment where the architecture becomes interesting.

Browser Handler ───────┐
                       ├── session binding
Agent Handler ─────────┘

At this point:

the Control Plane stops being a transport endpoint and becomes a message router

It now forwards:

browser input → agent
agent output → browser

But critically:

not directly

All terminal output passes through an asynchronous buffering layer.

That layer ended up being one of the most important production decisions in the system.

PTY process creation

Once the session is fully initialized, the agent forks a real PTY:

(child_pid, fd) = pty.fork()

if child_pid == 0:
    subprocess.run([
        "psql",
        "-U", user,
        "-d", dbname
    ])

At this point the architecture fundamentally changes.

This is no longer “web infrastructure”.

It becomes:

PTY supervision
file descriptor management
process lifecycle handling
signal propagation
backpressure management

Most complexity appeared after this step.

Not before it.

The real data pipeline

This is the most important flow in the system.

Browser
  │
  │ keystroke
  ▼
Control Plane
  │
  ▼
Agent WS handler
  │
  │ write(fd)
  ▼
PTY → psql
  │
  │ stdout
  ▼
PTY reader thread
  │
  │ Redis XADD
  ▼
Redis Streams
  │
  │ async consumer
  ▼
Control Plane
  │
  │ WS push
  ▼
Browser

The most important line in the entire architecture is this:
PTY reader → Redis XADD → async consumer → WebSocket
That line is the system’s stability boundary.

Why Redis Streams became mandatory

The original implementation directly forwarded PTY output into WebSocket writes:
PTY → WebSocket

It worked in development.

It failed in production.
The issue was subtle:

PTY reads are synchronous
WebSocket writes can block
backpressure propagates backwards

The resulting failure mode was catastrophic for terminal UX:

slow network
    ↓
blocked WS writes
    ↓
frozen PTY reader
    ↓
terminal stalls

The terminal looked dead while psql was still running underneath.

Redis Streams solved this by introducing a decoupling boundary.

Now:

PTY reads stay non-blocking
network latency becomes isolated
consumers can temporarily lag
output survives reconnects

The additional latency was negligible.

The operational stability improvement was enormous.

The architecture is actually two independent loops

This is the part most terminal architectures hide.

Input loop
Browser → Control Plane → Agent → PTY

Output loop
PTY → Redis → Control Plane → Browser

These loops are intentionally independent.

That separation is what allows the system to survive partial failures.

Why we split browser and agent handlers

We intentionally kept browser-facing and agent-facing handlers separate.

Because they solve fundamentally different problems.

Browser Handler:

auth
user session ownership
browser disconnect semantics
user errors

Agent Handler:

PTY lifecycle
process supervision
reconnect semantics
infrastructure errors

Trying to merge them created tightly coupled failure modes and significantly more lifecycle complexity.

Separating them made the system dramatically easier to reason about.

Failure modes that mattered in production

The hardest problems were not PostgreSQL problems.

They were long-lived process problems.

A. Redis failure

Impact:

output pipeline breaks
PTY continues running

Mitigation:

memory limits
retention limits
monitoring
bounded stream lifetime

B. Agent disconnect

Impact:

transport disappears
PTY may still be alive

Mitigation:

reconnect window
session reattachment
delayed teardown

C. Process explosion

Impact:

memory exhaustion
PostgreSQL connection storms

Mitigation:
BoundedSemaphore(max_sessions=10)
This was one of the simplest and most effective safeguards in the system.

D. xterm resize storms
xterm.js emits resize events aggressively during browser resizing.

Impact:
Each resize triggers:
ioctl(TIOCSWINSZ)

Mitigation:

Without throttling, the PTY spent significant time processing resize events instead of actual terminal traffic.
Simple debounce logic completely fixed the issue.

Scaling reality

The system does not scale like a normal WebSocket service.

Each session includes:

a real psql process
a PTY
multiple threads
Redis streams
two WebSocket channels
a database connection

The scaling bottleneck is not Redis.

It is not CPU.

It is not WebSockets.

It is:

how many real PostgreSQL sessions the infrastructure can sustain

Why HTTP and SSE were rejected

We evaluated both.

HTTP
Failed because:

stateless
no streaming terminal semantics
no signal handling
no persistent shell state

SSE
Failed because:

one-directional transport
incompatible with terminal interaction patterns
xterm.js expects bidirectional communication

At the end, terminals naturally map onto WebSockets.

Trying to avoid that only complicates the architecture.

What this system actually is

If you remove all abstractions:

this is a distributed process supervisor for a PTY running psql

Everything else is transport, routing, buffering, and failure handling around that core idea.

Final architecture insight

The system is ultimately defined by three separations.

Connection separation
The Control Plane isolates browsers from agents.
Process separation
PTY isolates PostgreSQL from the web layer.
Flow separation
Redis isolates terminal I/O from network I/O.

Final mental model

If you understand only one thing, understand this:

Browser ↔ Control Plane ↔ Agent ↔ PTY ↔ psql
                     ↑
              Redis is the buffer

Everything else is lifecycle management around this chain.

Final thought

We did not build a “web UI for PostgreSQL”.

We built a distributed, fault-tolerant runtime for a stateful terminal process.

PostgreSQL just happened to be the process attached to it.