Forem: Ilya R.

Hardware-backed SSH keys, end to end: YubiKey, PIV, software alternatives, and where SSH CAs fit in

Ilya R. — Sat, 09 May 2026 18:31:36 +0000

This is a working guide to using a YubiKey for SSH on a real Linux fleet, plus the surrounding landscape — PIV, software-only alternatives, and SSH certificate authorities. The goal is to retire file-based SSH keys without breaking daily operations.

The article is structured around four questions:

What does a hardware-backed key actually do, and what knobs do you control?
How do you combine those knobs into a policy that works for both root login and Ansible?
What if you can't ship YubiKeys?
When should you stop managing keys yourself and adopt an SSH CA?

The problem with file-based keys

Every classic SSH key is a file in ~/.ssh/. That file holds the private key. To log in to a server, your SSH client reads the file and produces a cryptographic signature.

There are really two issues here, and they compound:

File-based key can leak. It exists in the filesystem, can be read by anything with sufficient access, can be copied, backed up, accidentally committed, or extracted via a misconfigured recovery scenario. This is a fundamental property of where the key lives.
The discipline that would mitigate this rarely survives daily work. The cryptography is fine; the operational reality isn't.
- What works in theory: a passphrase-protected key combined with ssh-agent -t 10m is genuinely close to unbreakable. The key is decrypted briefly, signs what it needs, and the agent forgets it.
- What happens in practice: engineers drop passphrases for convenience, or load the key into ssh-agent on first use and leave the agent running for the entire session.
- Agent forwarding compounds it: with ssh -A, a key that's been unlocked once can sign on the operator's behalf from any forwarded host for the rest of the agent's lifetime.

Hardware-backed keys remove the need for that discipline. The private key never leaves the device, and signing requires the device's physical presence — there's nothing to forget to passphrase, nothing to leave running for too long, nothing for a forwarded host to sign with silently.

YubiKey is the most flexible option because the same device works on Linux, macOS, Windows, iOS, and Android with the same protocol and the same key files. Most of this article is about YubiKey + FIDO2; the alternatives come later.

How a YubiKey actually signs

The SSH client never sees the private key. It hands the YubiKey a small piece of data to sign (a "nonce"), the YubiKey signs internally and returns the signature. If the device is unplugged, signing is impossible regardless of what's on the laptop.

This article uses FIDO2 (the modern protocol; SSH key types sk-ssh-ed25519@openssh.com and sk-ecdsa-sha2-nistp256@openssh.com, generated with ssh-keygen -t ed25519-sk or -t ecdsa-sk). FIDO2 has been first-class in OpenSSH since version 8.2 (February 2020). PIV — the older smartcard protocol — is covered later as an alternative.

The four knobs

When you generate a FIDO2 key on a YubiKey, four properties determine how it behaves:

Resident vs non-resident — where the credential is stored.
Touch — does signing require a tap on the YubiKey?
PIN — does signing require the FIDO2 PIN?
ssh-agent — is the key loaded into ssh-agent, or used directly?

These are independent yes/no choices. Combined, they describe what it takes to sign with that particular key. The next four sections take them one at a time.

Knob 1: resident vs non-resident

This is the one most people get wrong, so it gets the most space.

Resident (created with -O resident): the credential lives on the YubiKey itself. The file in ~/.ssh/ is just a pointer — a label that says "ask the device for credential 0xA3F2…". If you delete the file, you can recreate it on any machine by running ssh-keygen -K, which queries the YubiKey for all its resident credentials and writes them to disk.

Non-resident (the default): the credential is split. The YubiKey has a master secret used to derive credentials on demand. The file on disk holds an encrypted handle. To sign, the YubiKey needs the handle from the file plus its own master secret. Without the file, the YubiKey doesn't know which credential to derive. Without the YubiKey, the file is gibberish.

The practical consequences:

Question	Resident	Non-resident
Can the file be reconstructed from the YubiKey?	Yes (`ssh-keygen -K`)	No
Does losing the file matter?	No	Yes (re-enroll)
Does a passphrase on the file add real security?	No — file holds an identifier, not a secret	Yes — file holds the encrypted credential handle
Is ssh-agent needed?	No, the YubiKey is the agent	Usually yes, to avoid re-typing the passphrase

The headline rule:

Resident keys don't need a file passphrase, because the file holds nothing secret. Non-resident keys do, because the file holds the part of the credential that isn't on the YubiKey.

A non-resident key with a passphrase is conceptually identical to a classic passphrase-protected file SSH key — except the actual signing material never leaves the YubiKey. Same mental model, with the YubiKey as a hard-bound second factor.

Knob 2: touch

When you sign with a key, the YubiKey can require you to physically touch the gold disc. This is "user presence" — proof that a human is at the device.

Touch required (default): every signing produces a touch prompt. The YubiKey's LED blinks, you tap it, the signing completes. Failure to touch within ~15 seconds aborts the signing.
No touch: signings happen automatically as long as the YubiKey is plugged in. Set with -O no-touch-required at generation. The server's authorized_keys must also have no-touch-required for OpenSSH to accept the signature.

You turn touch off when an operation produces many signings — Ansible across hundreds of hosts, an rsync of 100k files, a deploy that opens 50 sessions. None of these can realistically prompt for a touch each time.

Disable touch only if you plan to use short-lived ssh-agent with password protected non-resident keyfile!

Touch is a defense against silent malicious signing on a host you've connected to (with agent forwarding) or on a compromised laptop you happen to be at. It is not a defense against device theft — someone holding the device can touch it.

Knob 3: PIN

The YubiKey has a FIDO2 PIN, set once with ykman fido access change-pin. It's separate from touch.

PIN required (-O verify-required at generation): every signing prompts for the PIN.
PIN not required: signing happens without a PIN (subject to touch policy).

PIN is a defense against device theft. Touch alone doesn't help you here — the thief can touch. PIN does, because the thief doesn't know it.

The same device PIN gates ssh-keygen -K and FIDO2 credential management generally. Even for credentials that don't require PIN to sign, the device PIN is required to extract them. This becomes important in the four-mode model below.

Knob 4: ssh-agent

ssh-agent is a small process that holds keys in memory and signs on behalf of SSH clients that ask. It exists for two reasons:

You don't want to re-enter a file passphrase on every connection. Load the key once, use it many times.
You want agent forwarding (ssh -A). Connecting to host A and then from inside that session to host B, with B able to ask your laptop's agent for signatures back through the forwarded socket.

For YubiKey-backed keys, whether you need an agent depends on the connection pattern, not just on storage:

Connection pattern	Agent needed?
Direct SSH (`ssh host`)	No — ssh client talks to the YubiKey directly
ProxyJump (`ssh -J jump target`)	No — local ssh signs each hop directly
Agent forwarding (`ssh -A`, in-session multi-hop)	Yes — remote host needs to reach your agent
Non-resident key with passphrase	Yes — to avoid retyping on every connection

ProxyJump is the modern multi-hop pattern: the local ssh client opens each connection in sequence, signing each against the YubiKey directly. Nothing is exposed on intermediate hosts. Agent forwarding is the older pattern, used when you're already inside a remote shell and need to reach further (e.g., on host1, running scp host2:file ./).

For loading resident keys into the agent (when forwarding needed), no passphrase is required:

ssh-add ~/.ssh/id_sudo  # No passphrase prompt; the file holds a reference, not encrypted material.

Never do this with no-touch no-pin keys! They must be password protected and added to agent like ssh-add -t 10m ~/.ssh/id_wheel

Touch-required keys make agent forwarding safe again. With file-based keys, an unlocked agent signs anything it's asked to, silently, for the agent's lifetime — agent forwarding became dangerous because a compromised forwarded host could sign as you on every other host you have access to. With FIDO2 touch-required keys, every signing request from a forwarded host produces a touch prompt on your laptop. If you didn't initiate the action, you don't touch, and the signing fails. The classic "never use -A" advice no longer applies once credentials are hardware-backed and touch-gated.

This refines the rule:

Resident is the default. Non-resident is reserved for keys that must live in ssh-agent for the wheel-style mass-automation use case — explained next.

The four-mode model

A single key configuration cannot serve both rare root login and Ansible across a fleet. Different operations have different blast radius and different frequency, and they want different policies. The pragmatic answer is four keys, each a deliberate combination of the four knobs.

The same model as a table:

Key	Touch	PIN	Storage	File pass	ssh-agent	Use
`root`	yes	yes	resident	no	no	Direct root SSH login
`sudo`	yes	no	resident	no	no	Daily admin (TOTP at host)
`wheel`	no	no	non-resident	yes	yes	NOPASSWD mass automation
`robo`	no	no	resident	no	no	Backups, sftp, stage deploys

Why three are resident and one isn't

wheel is the deliberate exception, for three reasons that compound:

1. Mass automation must use ssh-agent. Ansible across 300 hosts produces thousands of signing operations per run. A touch on each is unworkable. So wheel is generated no-touch-required AND no verify-required. Once it's loaded into ssh-agent (so it can be reused across the run), the agent holds the key in memory.

2. The file on disk needs a passphrase. It's to prevent accidental loading, and to force the operator to deliberately type something before the agent gets the key.

3. The passphrase needs a forcing function. ssh-keygen -K on a new machine writes resident credentials into ~/.ssh — id_root, id_sudo, id_robo — none needing passphrases, because they're just references to material on the device. The flow trains you that "resident-export-without-passphrase is safe."

If wheel were resident, the same command would write id_wheel, and you'd have to remember the one exception: passphrase this file, the others are fine. Humans don't reliably catch that exception.
Non-resident wheel is structurally outside that flow: ssh-keygen -K can't produce it, and the file you copy from your existing setup already has a passphrase. A physical equivalent: keep wheel on a separate YubiKey with a "passphrase required" sticker.

Generation commands

# root: resident, touch + PIN
ssh-keygen -t ed25519-sk -O resident -O verify-required \
  -N "" -f ~/.ssh/id_root -C "laptop-root"

# sudo: resident, touch only
ssh-keygen -t ed25519-sk -O resident \
  -N "" -f ~/.ssh/id_sudo -C "laptop-sudo"

# wheel: non-resident, no touch, no PIN, passphrase, used via ssh-agent -t 10m
ssh-keygen -t ed25519-sk -O no-touch-required \
  -f ~/.ssh/id_wheel -C "laptop-wheel"
# Set a real passphrase when prompted.

# robo: resident, no touch, no PIN
ssh-keygen -t ed25519-sk -O resident -O no-touch-required \
  -N "" -f ~/.ssh/id_robo -C "laptop-robo"

-N "" skips the file passphrase prompt. Used for the three resident keys. wheel is the only one without -N "" — you'll be prompted, and you set a real passphrase.

Server-side `authorized_keys`

Keys generated with -O no-touch-required need a matching no-touch-required option in authorized_keys, otherwise OpenSSH rejects the signature.

root

/root/.ssh/authorized_keys:

sk-ssh-ed25519@openssh.com AAAA... laptop-root

wheel

~wheel/.ssh/authorized_keys:

no-touch-required sk-ssh-ed25519@openssh.com AAAA... laptop-wheel

sudo

~admin/.ssh/authorized_keys (the daily-admin user with sudo privileges):

sk-ssh-ed25519@openssh.com AAAA... laptop-sudo

Pair the sudo key with pam_google_authenticator.so at the host's sudo PAM stack:

# /etc/pam.d/sudo
auth required pam_google_authenticator.so

Per-user TOTP secrets in /etc/google_authenticator (readable only by root) can protect from stolen YubiKey (touch is not enough for sudo). Also protects you from accidental sudo rm -rf / .

robo

~robo/.ssh/authorized_keys — the most-restricted, non-prod-fleet entry, constrained at source IP and forced command:

no-touch-required,from="10.0.0.0/8",command="/usr/local/bin/backup-shell" sk-ssh-ed25519@openssh.com AAAA... laptop-robo

PIV — the older alternative protocol

YubiKey supports a second SSH path: PIV (Personal Identity Verification), a US-government smartcard standard that predates FIDO2 by about a decade.

PIV-on-YubiKey gives you:

Multiple "slots" (9a, 9c, 9d, 9e, plus retired 82–95) — each holds a separate certificate and key pair.
Three touch policies per slot: never, cached (15-second window), always.
PIN policies: default, once, always, never.
Standard X.509 certificates, which integrate nicely if your environment already uses smartcards for things like email signing, S/MIME, or government identity.

A typical setup:

# Generate ECCP256 key in slot 9a, with cached touch and PIN-once
ykman piv keys generate \
  --algorithm ECCP256 \
  --touch-policy CACHED \
  --pin-policy ONCE \
  9a /tmp/pubkey.pem

# Self-signed certificate (or sign with a corporate CA)
ykman piv certificates generate \
  --subject "CN=admin" \
  9a /tmp/pubkey.pem

# Use it directly via PKCS#11
ssh -I /usr/lib/x86_64-linux-gnu/libykcs11.so user@host

# Or load into ssh-agent
ssh-add -s /usr/lib/x86_64-linux-gnu/libykcs11.so

On paper, the cached touch policy is exactly what you want. One touch unlocks signing for 15 seconds, then it locks again — ideal for rsync or scp of many files where one logical operation triggers many SSH transactions.

In practice, the cache behavior depends on how your SSH client handles the PKCS#11 session. Different clients open and close PKCS#11 sessions differently:

Some open the session once per ssh invocation and keep it open, so the cache works as advertised.
Some open and close per cryptographic operation, which resets the cache and produces a touch prompt every signing.
Behavior varies between OpenSSH versions, between using ssh-agent vs. direct PKCS#11, between Linux distributions and OS package builds.

For a single user on one machine, PIV with cached can be made to work once you've found the right combination. For a fleet with mixed client versions across Linux, macOS, and Windows, the behavior isn't predictable. You'll get bug reports for years and your runbooks will accumulate if your client is X, do Y branches.

FIDO2 sidesteps this entirely. Per-credential policy is set at generation time, OpenSSH speaks the protocol natively without PKCS#11 in the middle, and behavior is consistent across clients and platforms.

Use PIV if you already have smartcard tooling, X.509 workflows, or a strong organizational reason to use the existing standard.

Use FIDO2 if you're starting fresh and want predictable behavior across a heterogeneous fleet.

Software-only alternatives

Hardware tokens cost money and procurement takes time. For distributed contractors, BYOD policies, or organizations without an IT budget for keys, you're sometimes deploying software-only solutions. The options below all keep your private key better-protected than a plain file in ~/.ssh/, but with different trade-offs.

The dimension that matters: can the private key be extracted from where it lives?

Solution	Key storage	Extractable?	Notes
Secretive (macOS)	Apple Secure Enclave	No	Touch ID per signing. Open source.
Windows Hello SSH	Windows TPM	No	TPM-bound; biometric/PIN per signing. Caveats below.
KeePassXC SSH agent	Encrypted KDBX database	Yes (when DB unlocked)	Keys are read from disk; the DB is just an extra layer.
1Password SSH agent	1Password vault (cloud-synced)	Yes (extractable when vault is unlocked locally)	Convenient. You're trusting their infrastructure.
LastPass SSH agent	LastPass vault (cloud-synced)	Yes (2022 breach; weak master passwords brute-forced offline)	LastPass had a major vault-data breach in 2022.

The categories sort cleanly:

Hardware-backed (Secretive, Windows Hello). The private key is generated inside a secure element and never leaves it. Same security model as a YubiKey, but tied to one device. Strong for "I always work from this laptop"; weaker for "I work from three machines."

Note on Windows Hello SSH. "Windows Hello SSH" gets used to describe three different things, only one of which is genuinely the macOS-Secretive equivalent:

TPM-backed via Virtual Smart Card — the actual TPM-bound SSH path. Requires tpmvscmgr.exe to create a virtual smart card, a self-signed cert via the Microsoft Smart Card Key Storage Provider, and PuTTY/Pageant rather than the default OpenSSH client. tpmvscmgr.exe is Pro/Enterprise/Education only — not available on Windows 11 Home.
Windows Hello for Business — the corporate path, requires Entra ID or AD join. Out of scope for a personal laptop.
ssh-keygen -t ed25519-sk with Windows Hello as the UV layer — the most-documented "Windows Hello SSH" path, but Windows Hello is just the UI layer asking for your PIN. The actual FIDO2 authenticator is still a USB device (typically a YubiKey). On Windows 11 Home, this is effectively the only available option, which means you need external hardware anyway.

The takeaway: on macOS, software-only hardware-backed SSH is one click in Secretive. On Windows it's an enterprise feature with awkward retrofitting, and Home users are pushed toward an external YubiKey regardless. This is one of the practical reasons a YubiKey wins on cross-platform — the same device works the same way on every OS, no per-OS puzzle to solve.

Software-encrypted (KeePassXC). The key is a normal SSH private key, encrypted in a database. Strictly better than a naked file because there's a master password gating access, but the key is still extractable any time the DB is open. Reasonable when you already use KeePassXC for password management.

Cloud-synced (1Password, LastPass). The key is stored in the provider's vault. Whoever can read the vault can read the key. You're trusting the provider's infrastructure and operational security. 1Password's design (Secret Key + master password) makes server-side decryption genuinely difficult; LastPass's 2022 breach demonstrated that vault contents can leak in practice. The convenience is real; the trust assumption is non-trivial.

Pick the strongest option you can ship to your team, and back it with a multi-mode model along the same lines as the YubiKey one — different keys for different operation classes, with the most automated keys getting the strongest restrictions at the server side.

SSH CAs — Teleport, step-ca, HashiCorp Boundary

Everything above is about credential custody: where the private key lives and what's required to use it.

Teleport, step-ca (Smallstep's open-source CA), and HashiCorp Boundary solve a related but distinct problem: credential lifecycle and access control. Instead of long-lived keys, they issue short-lived SSH certificates that expire automatically. They integrate with identity providers (Okta, Google Workspace, Entra ID), log session activity, and can grant just-in-time access that revokes itself.

Whether you need this depends on scale.

Team size	Typical reality	Recommendation
Solo or up to ~15 people	You know who has access. `authorized_keys` is auditable by reading. Offboarding is manual but tractable.	YubiKey + four-mode model is enough. A CA adds operational overhead without proportional security gain.
15–100 people, growing	New hires need access; departures need offboarding; "who can SSH to production?" stops being answerable from `authorized_keys` alone. Onboarding takes a day per person.	Adopt a CA system. Pain is real and pays back the investment.
Hundreds of devs, regulated industry	Manual key management is impossible. You can't audit it, you can't rotate it, you can't prove who logged into what after the fact.	CA system is mandatory. Plan around it from day one.

The operational pain shows up in roughly this order as you grow:

Adding a key to N hosts requires Ansible discipline. Doable.
Removing a key from N hosts requires the same discipline. Often skipped on departures.
Rotating keys regularly across the whole fleet is a project.
Answering "is this person's access still active?" requires querying every host. Expensive.
Proving to an auditor what happened in a session three months ago requires session logging that authorized_keys doesn't provide.

Each of these gets harder in a known order, and each has a CA-shaped solution.

The common confusion: SSH CAs don't replace hardware keys. They complement them.

When you use a CA, the long-term identity authenticates to the CA's enrollment endpoint and gets a short-lived SSH certificate in return. That long-term identity needs to be protected — if it's a file-based key, an attacker who steals it can request fresh certificates indefinitely. The CA system has moved the problem rather than solved it.

The right shape:

Long-term identity: YubiKey + the four-mode model (or just sudo/root keys, depending on what the CA expects).
Short-term access: SSH certificates issued by the CA, valid for hours, scoped to specific hosts.
Audit: CA logs the issuance; session recording captures what happened during use.

The hardware-backed identity is the foundation. The CA is the access plane on top of it.

TL;DR

The four knobs:

Resident vs non-resident — where the credential lives. Resident is the default; the file is a label, no passphrase needed. Non-resident is for keys that must be in ssh-agent; the file holds encrypted material and must have a passphrase.
Touch — physical proof of presence. Defends against silent signing on a forwarded or compromised host. Not a defense against device theft.
PIN — defense against device theft. Also gates ssh-keygen -K extraction of resident credentials.
ssh-agent — not needed for direct SSH or ProxyJump. Needed for agent forwarding (-A, including in-session multi-hop) and for non-resident keys with passphrases. With FIDO2 + touch-required keys, agent forwarding is safe again because every signing requires a touch on your laptop — silent signing isn't possible.

The four-mode model:

root — resident, PIN + touch. Direct root login, rare.
sudo — resident, touch only. Daily admin. Pair with PAM TOTP at the host.
wheel — non-resident, no touch, passphrase + ssh-agent. NOPASSWD mass automation. Non-resident specifically so device + PIN cannot extract it.
robo — resident, no touch, no PIN. Convenience tier, restricted at the server with from= and command=.

Other paths and where they fit:

PIV is theoretically cleaner (slots, certificates, cached touch policy) but its caching depends on PKCS#11 session handling that drifts between SSH client versions. Avoid for heterogeneous fleets.
Software alternatives sort by extractability. Secretive and Windows Hello are hardware-backed (non-extractable). KeePassXC, 1Password, and LastPass are extractable to varying degrees of "the provider can see your key."
SSH CAs (Teleport, step-ca, HashiCorp Boundary) solve access management at scale. They don't replace hardware keys — they sit on top of them. Adopt when manual authorized_keys management starts hurting, typically around 15–100 engineers.

The shortest possible version: hardware key first, multi-mode policy second, CA system if and when scale demands it.

Note about YubiKey Bio

The article above covers YubiKey 5C variants only — Bio Series is out of scope. The Bio Edition is genuinely more convenient for verify-required SSH keys since one fingerprint tap collapses PIN+touch into a single interaction, but it costs noticeably more than a 5C and the widely available FIDO Edition is FIDO2/U2F only — no PGP, OATH, or PIV. Multi-protocol Edition exists but is sold only via enterprise subscription.

When TLS 1.3 Silently Dies Inside Your Android Proxy

Ilya R. — Fri, 20 Mar 2026 16:59:43 +0000

We run iProxy.online, a mobile proxy infrastructure. Our Android app turns phones into proxy servers across 100+ countries. Last year we shipped an advanced network health checker that runs a lot of probes through these proxies to a controlled server. That’s when things got weird.

A small but noticeable percentage of devices started failing HTTPS checks. HTTP worked fine. The failure was always at the TLS handshake stage. And the behavior was completely non-deterministic: broken for two hours, then fine, then broken for five minutes, then fine again.

What we saw

The correlations were weak and noisy. Android v8-v9 devices showed up more often. Cheaper, lower-spec phones were overrepresented. The strongest signal was memory pressure on the device. When we could catch the failure in real time, the device was almost always low on available RAM. But metrics from memory-starved phones are unreliable by definition, so we couldn’t be sure this wasn’t survivorship bias.

Two tracks of investigation: find correlations (inconclusive, as described above) and understand what actually breaks.

The second track was harder than it sounds. These are not our devices. They sit in remote locations. Physical access happens maybe once a month. We can’t deploy debug builds on demand. The bug is intermittent with no predictable trigger. Direct on-device debugging was effectively impossible.

Narrowing it down

Our metrics pointed at TLS handshake failure, so we tried to reproduce manually. curl through the same proxy, same server (Caddy, default Ubuntu 24.04 repos):

curl -x socks5://proxy:port https://our-server.example.com

Works perfectly. TLS 1.3, clean handshake, 200 OK. Every time.

Our network checker is written in Go (1.24 at the time). I built a minimal Go client to isolate the behavior. Here’s where it got interesting:

Client	TLS version	Result
curl	1.3	works
Go	1.3	hangs
Go	1.2 (forced)	works

Forcing TLS 1.2 in Go:

tlsConfig := &tls.Config{
    MaxVersion: tls.VersionTLS12,
}

This consistently fixed the issue on affected devices.

A tcpdump on the client side showed the Go client sending ClientHello and then… nothing. No ServerHello coming back to client. The proxy app just sat there.

It gets weirder

The problem was server-dependent. Go + TLS 1.3 failed against our Caddy server, against Cloudflare, against Google. But it worked against some other sites. So the failure depended on the specific TLS implementation on the remote end, not just the client.

And this wasn’t limited to our checker. When the bug was active on a device, Chrome via same mobile proxy couldn’t open Google over TLS 1.3 either. TLS 1.2 sites loaded fine. This was a device/app-level issue, not something specific to our Go code.

Why Go and curl behave differently

This is the part that took the longest to reason about.

Go’s crypto/tls and curl’s underlying OpenSSL/BoringSSL produce different ClientHello messages. Go’s ClientHello has always been somewhat larger due to different extension sets and key share choices. But there’s a much bigger factor here that we didn’t initially consider.

Starting with Go 1.23, the post-quantum key exchange X25519Kyber768Draft00 is enabled by default when Config.CurvePreferences is nil (which is the standard case). In Go 1.24, this became X25519MLKEM768. The ML-KEM public key alone is 1184 bytes. This makes the ClientHello big enough to exceed a single TCP packet at typical 1500-byte MTU.

curl (as of the versions we tested) does not send post-quantum key shares by default. Its ClientHello is much smaller and fits comfortably in a single packet.

This size difference matters because our Android app is acting as a TCP proxy. It reads data from one socket and writes it to another. The proxy doesn’t terminate TLS; it just forwards bytes. But it still needs to buffer them.

The (probable) mechanism

Here’s our working theory. There are two chokepoints, not one.

Chokepoint 1: the ClientHello. Go 1.24 sends a very big ClientHello due to the ML-KEM post-quantum key share (1184 bytes for the public key alone). curl’s OpenSSL sends ~500-700 bytes with a 32-byte X25519 key share. Chrome 124+ also sends post-quantum key shares by default, producing similarly large ClientHello messages. Our tcpdump on the client side showed the ClientHello leaving the client, then silence. This means the proxy either failed to forward the ClientHello to the server, or forwarded it and failed to relay the server’s response back. We can’t distinguish these two cases from a client-side capture alone, but both point to the same thing: the proxy is the bottleneck.

Chokepoint 2: the server flight. This explains why TLS 1.3 worked with some servers but not others, even when the ClientHello is identical.

In TLS 1.3 the server responds with a single flight: ServerHello + EncryptedExtensions + Certificate + CertificateVerify + Finished, all at once. The size of this response depends on the server’s certificate chain. A small site with a single cert and short chain might send 2-3 KB. Google or Cloudflare with full certificate chains send 4-6+ KB. Unfortunately, I don't remember what kind of sites worked with TLS 1.3 and what kind of certificate chain they had, so this is just a guess.

So even when the ClientHello makes it through the proxy, the server response might not. A memory-starved proxy can relay a 2 KB response but chokes on a 5 KB one. That’s why some sites worked and others didn’t, with the exact same client? It sounds dubious, the numbers are too small, but I didn't have any other hypotheses.

TLS 1.2 avoids both problems. Its ClientHello is smaller (no PQ key shares), and the handshake is split across multiple round trips with smaller messages in each direction. No single message is large enough to stress the proxy’s buffers.

Why restart fixes it: killing the app clears leaked memory, resets all socket state, and gives the proxy fresh buffer capacity. The fact that this consistently works is the strongest evidence that the root cause is resource exhaustion, not a protocol bug.

The fix

We needed a production fix, not a research paper. We did two things:

Immediate mitigation: capped TLS to 1.2 for the checker and for proxy traffic where possible.

tlsConfig := &tls.Config{
    MaxVersion: tls.VersionTLS12,
}

Observability-driven restart: the checker now detects when TLS 1.3 fails but TLS 1.2 succeeds on the same device. When this pattern appears, we send a remote command to fully restart the app (kill the process, clear memory, relaunch). This consistently fixes the problem, which further supports the memory pressure hypothesis.

We also found and fixed memory leaks in our app that were contributing to the pressure. The correlation between leak fixes and reduced TLS 1.3 failures was visible in our dashboards.

Why we didn’t dig deeper

We don’t have a verified root cause. We have a strong hypothesis, consistent correlations, and effective mitigations that confirm the hypothesis indirectly.

The devices are remote, not ours, and physically accessible about once a month. The bug is intermittent with no reliable trigger. Production users were affected. We needed fast fix, not a deep research.

We’re now investing in better telemetry and the ability to capture targeted diagnostics remotely. Next time something like this happens, we’ll have the data to pin it down.

Key takeaways

For proxy developers on constrained devices: TLS 1.3 messages are larger than TLS 1.2, and post-quantum key exchange makes them much larger. If your proxy buffers TCP data, make sure your buffers can handle multi-packet TLS records, especially under memory pressure.

For Go developers proxying TLS: be aware that Go 1.23+ sends post-quantum key shares by default. If you’re running through proxies or middleboxes, this can break things. Set CurvePreferences explicitly or use GODEBUG=tlskyber=0 (Go 1.23) / GODEBUG=tlsmlkem=0 (Go 1.24+) to disable it.

For anyone debugging intermittent TLS failures: if TLS 1.2 works and TLS 1.3 doesn’t, the problem is almost certainly not in TLS itself. It’s in something between client and server that can’t handle the larger messages. Check your middleboxes, proxies, and buffer sizes.

The Systemd Bug That Nobody Wants to Own

Ilya R. — Fri, 27 Feb 2026 19:18:00 +0000

TL;DR: There’s a namespace bug affecting Ubuntu 20.04, 22.04, and 24.04 servers that causes random service failures. It’s been reported since 2021 across systemd, Ubuntu, Fedora, and Red Hat trackers. Most reports are either expired or labeled “not-our-bug.” Only a reboot fixes it.

If you’re running Ubuntu servers and have ever seen this in your logs:

Failed to set up mount namespacing: /run/systemd/unit-root/dev: Invalid argument
Failed at step NAMESPACE spawning: Invalid argument
Main process exited, code=exited, status=226/NAMESPACE

Congratulations. You’ve encountered one of the most frustrating bugs in the Linux ecosystem — one that’s been bouncing between the kernel and systemd teams for years with no resolution.

What Happens

Random systemd services — including critical ones like systemd-resolved, systemd-timesyncd, systemd-journald, and your own custom services — suddenly refuse to start. The error mentions “mount namespacing” and “Invalid argument.”

Restarting the service doesn’t help. systemctl daemon-reload doesn’t help. The only reliable fix is a full system reboot.

If you’re running containerized workloads (LXC, LXD, Proxmox), it gets worse: the bug can affect the entire host node, and container reboots won’t fix it — you need to reboot the hypervisor itself.

The Blame Game

I’ve tracked this bug across multiple issue trackers:

systemd/systemd #24798 — Ubuntu 20.04, September 2022
systemd/systemd #19926 — Labeled not-our-bug, June 2021
Ubuntu Launchpad #1990659 — Expired due to inactivity
Fedora CoreOS #1296 — Affects PXE/diskless boot
Red Hat Bugzilla #2111863 — Migrated to Jira, status unknown
dbus-broker #297 — CentOS Stream 9

The pattern is always the same: user reports the bug, maintainers ask for debug logs, user either provides them or doesn’t respond fast enough, bug expires or gets closed with “not-our-bug.”

The systemd team says it’s a kernel issue. The kernel team… well, I haven’t found anyone from the kernel team actively investigating this.

Root Causes (As Best We Can Tell)

The bug appears to involve:

Race conditions in mount namespace setup — systemd tries to remount /sys and /dev while other unmount operations are happening
Mount propagation issues — systemd changes the default from MS_PRIVATE to MS_SHARED, causing unexpected interactions
Resource exhaustion — sometimes related to inotify limits (fs.inotify.max_user_instances)
Container/virtualization edge cases — more prevalent in LXC/LXD environments

But nobody has done a definitive root cause analysis. The bug is intermittent, hard to reproduce on demand, and affects systems that have been running fine for weeks or months.

The Irony

Remember when /etc/init.d/ scripts “just worked”? When starting a service meant running a shell script that executed a binary?

Systemd brought us dependency management, socket activation, cgroups integration, and dozens of security features like PrivateDevices=, ProtectSystem=, and PrivateTmp=. These are genuinely useful features.

But they also introduced complexity. The namespace isolation that causes this bug exists because systemd creates a private mount namespace for services with security hardening enabled. It’s a feature. Until it breaks.

The old init system didn’t have this bug because it didn’t have namespaces. Services ran in the global namespace. Less secure? Yes. But also fewer moving parts to fail.

Workarounds

If you’re affected, here are your options:

1. Disable namespace isolation for affected services:

sudo systemctl edit your-service.service

[Service]
PrivateDevices=no
ProtectHome=no
ProtectSystem=no

2. Clear corrupted systemd state:

sudo rm -rf /run/systemd/unit-root/
sudo systemctl daemon-reload

3. Increase inotify limits:

echo "fs.inotify.max_user_instances=512" >> /etc/sysctl.conf
sysctl -p

4. Monitor and auto-restart:

* */3 * * * systemctl list-units --failed | grep -q NAMESPACE && reboot

Yes, that last one is a scheduled reboot. That’s where we are.

What Should Happen

Someone — Canonical, Red Hat, or the systemd team — needs to:

Create a reliable reproduction case
Add instrumentation to capture the exact kernel/systemd state when the failure occurs
Do a proper root cause analysis
Fix it in either the kernel, systemd, or both

Until then, we’re all just rebooting servers and hoping.

Have you encountered this bug? What’s your workaround?

I’d love to hear from anyone who has done deeper investigation or found a permanent fix.