Forem: Patryk Zawadzki

Building Reliable Software: The Trap of Convenience

Patryk Zawadzki — Mon, 16 Feb 2026 14:53:09 +0000

When I started to learn programming a PC (as opposed to programming an Amiga), it was still the 20th century. In the days of yore, when electricity was a novel concept and computer screens had to be illuminated by candlelight in the evenings, we'd use languages like C or Pascal. While the standard libraries of those languages provided most of the needed primitives, it was by no means a "batteries included" situation. And even where a standard library solution existed, we'd still drop to inline assembly for performance-critical sections because those computers were definitely not fast enough to spare any CPU cycles. PCs were also still similar enough that they used the same CPU architecture, and thus the same machine code, so the assembly sections were not that hard to maintain.

Today, the x86 architectures come with dozens of optional extensions and you're not even guranteed to encounter an "Intel" machine (technically referred to as "amd64"). RISC CPUs are coming back to reclaim computing thanks to the efforts of Apple (Apple Silicon), Amazon (AWS Graviton), Microsoft (Azure Cobalt), and others licensees of ARM. In 2026, writing assembly code is something you only do if there is absolutely no other solution. The number of versions of that inline section keeps growing with every new CPU family. Meanwhile, modern compilers are getting so good at optimizing the resulting machine code, and the computers got so fast, that manual optimization is usually not worth the effort. Unless it absolutely is.

So the modern programming languages split. There are system programming languages that optimize for the performance with an extra focus on safety, like Rust. And there are application programming languages that optimize for "productivity", that is the speed at which we produce useful software, rather than the speed at which said software runs.

Productivity Through Convenience

Productivity demands higher-order abstractions. Instead of representing how the underlying hardware works, the programming languages and their libraries instead model how people think about the problems.

Thanks to this, instead of writing several pages of C code to allocate a send buffer and a receive buffer, open a socket, set its options, resolve the target hostname, establish a connection, and so on, you can fetch and parse a web resource in a few lines of Python:

import requests

def fetch_json(url):
    data = requests.get(url)
    return data.json()

With just a few keystrokes I can achieve what used to take me hours just to type out. Thanks to both Python (and C#, and TypeScript, and likely also your favorite language) and the requests library (and its equivalents) being open-source and available for everyone to use for free, we can all collectively and individually build more complex systems with less effort.

Except that, as I mentioned in my previous post, it's systems all the way down. And all those systems make a (conscious or not) choice on what it means to be a reliable tool.

As a fun exercise, look at the above example and try to figure out what the biggest problem with that bit of code is. It certainly works for the happy path, which would make it pass a lot of the unit tests!

Let's walk through several (but not all, the complete list would be way too long) of the things that can go wrong in just two lines of code.

A Litany of Failure Modes

data = requests.get(url)
return data.json()

An error will be raised if the URL is not a valid URL (or not even a string, yay modern languages!).
An error will be raised if the target hostname cannot be resolved (either the domain does not exist, or your DNS server cannot be reached).
If the target hostname resolves to multiple IP (IPv6 or IPv4) addresses, then try each one is tried in sequence until one accepts the connection on the destination port. Since no timeout is specified, the default system timeout for TCP is used (6 connection attempts totalling about 127 seconds on modern Linux systems) for each individual IP address. If none of the IP addresses end up accepting, an error is raised.
If the target system does not speak our desired protocol and responds with random gibberish, an error is raised.
An error is raised if the protocol is secure (like HTTPS) and the target server does not offer any of the TLS variants we trust.
If the protocol is secure (like HTTPS) and the target system responds with an invalid TLS certificate (either broken, expired, or not trusted by any of the certificates our system trusts), an error is raised.
An error is raised if the certificate is valid but does not match the target hostname.
If the target system stops responding, an error is raised. Since no timeout is specified, the default system TCP read timeout is used (60 seconds on modern Linux systems). If the target system sends anything during that time window, the system timer will be reset as the read was successful.
If the response is a valid HTTP redirect response, the process is restarted from step 1, using the redirect URL as the new target URL.
If we somehow get to this point (as unlikely as it seems) and the response is not a valid JSON string, an error is raised.

Different parts of the above are problematic, or extremely problematic, depending on what your code is attempting to achieve.

If your goal is to download a movie to watch, preserve artifacts of a system you're about to delete, or create complete copies of websites for a project like the Internet Archive's Wayback Machine, then chances are you want the code to take all the necessary time, perhaps even multiple attempts instead of giving up on the first transient error. The desired outcome is to access the resource at all costs.

But if your goal is to figure out if an order is eligible for free delivery, you probably don't want to keep the user waiting for literally minutes just because some external server crashed. By the time you send your fallback response, the user will be nowhere to be found, having abandoned their order and long since closed the browser tab.

A web API that takes minutes to respond is a useless one, but the same is true for many other use cases. Imagine having to stand in front of an ATM for several minutes before the machine finally spits your card out with an error. All the while people behind you start to make arrangements for your upcoming funeral.

Failing Fast

If you guessed that the problem with the code is that it could crash, you probably guessed wrong. I'm going with "probably", because I don't know what your use case is. But most systems handle failing fast rather gracefully.

A simple try/except block wrapped around the call to our function could take care of specifying the fallback behavior. And even that is absent, the underlying framework is likely built to withstand that and return an error instead of crashing, like in the old times. What it can't do is rewind the time it took the code to fail.

Resisting Abuse

You can't have reliability with at least some resilience (but I guess failing reliably is also a form of consistency). So you need to teach the system how to defend itself against undesirable behaviors. Some of them outright malicious, some not.

In the above example, an extremely malicious behavior would be to take a domain name and configure its zone record to resolve to 511 different IP addresses, all from non-routable network segments such as 196.168.0.0/16.

Or to have a domain resolve to 9 non-routable IPs and one that returns a HTTP redirect to the same domain.

Or to point the URL to a server that streams the response by sending one byte every 50 seconds, thus never triggering a read timeout.

If those numbers sound oddly specific, this is because we did all those things internally at Saleor. I don't remember why 511 was the number of IPs we went for, maybe CloudFlare didn't allow more records to be added, maybe it didn't matter because no one was going to wait for that test to time out anyway.

But a malicious actor could also ask your system to access a URL of an internal system they can't access directly. If the URL comes from an untrusted source, it could be used to probe your internal network for open ports, based on the error codes you send back. And if your system is foolish enough to show the entire "unexpected response" from such an URL, also to steal your credentials.

Did you know that any EC2 instance on AWS can make a request to http://169.254.169.254/latest/meta-data/ to learn about its own roles? And that a subsequent call to http://169.254.169.254/latest/meta-data/iam/security-credentials/<role-id>/ returns both an AWS access key and its corresponding secret? Yikes on leaking that!

Final Thoughts

Convenience is the biggest pitfall of the modern high-abstraction productivity. All the important bits and compromises are buried deep in the convenience layers, making it impossible to reason about systems without popping the hood. Meanwhile, your IDE, your code review tools, your whiteboard interviews, surface the types of problems that—in the grand scheme of things—don't matter all that much: the ones your system can recover from automatically.

If you ever find yourself accessing the great unknown from Python code, take a look at the requests-hardened wrapper we created for requests. It makes it safe to point the library at untrusted URLs from code that doesn't have forever to wait for the outcome. It also works around a DoS potential in Python's standard library that we also reported (responsibly, it's only public because we were asked by the maintainers to make it public).

Make sure your team doesn't mistake simple code for the simplicity of the underlying systems. The reassurance offered by the complexity becoming invisible is a fake one.

Happy failures. Farewell and until next time!

Building Reliable Software: Planning for Things to Break

Patryk Zawadzki — Fri, 13 Feb 2026 15:18:24 +0000

We often joke that software is usually implemented in two steps: the first 80% of the time is spent on making it work, and then the second 80% of the time is spend on making it work well. People mistake demos, proofs-of-concept, and walking skeletons for products because the optimistic path is often realized in full, so under ideal lab conditions, a PoC behaves just like the full product.

At Saleor, where I act as a CTO, we spend a significant part of our engineering effort embracing the different failure states and making sure the unhappy paths are covered as well as the happy ones.

Embracing the Failure

Because it does not matter how good your software is or how expensive your hardware is, something will eventually break. The only systems that never break are ones that are never used. Amazon's AWS spends more money on preventive measures than you will ever be able to, and yet, a major outage took out the entire us-east-1 region just last October. In 2016, the world's largest particle accelerator, CERN's Large Hadron Collider, was taken offline by a single weasel. Google's Chromecast service was down for days because someone forgot to renew an intermediate CA certificate, something that needs to be done once every 10 years.

The question is not if but when. Reliability is both about pushing that point as far it's practically possible and about planning what happens when it inevitably comes. And both suffer from brutally diminishing returns.

Every additional "nine" in your uptime—getting from 90% to 99%, from 99% to 99.9%, and so on—requires ten times as much resources as the previous one. Getting from one nine to two is usually trivial and gives you roughly 33 days of additional uptime per year. The next step is ten times as much work for only 3.2 additional days. Then it's even more expensive and results in just under 8 hours of additional uptime. You then get to 47 minutes, 4.7 minutes, 37 seconds, and so on. At some point the cost of getting to the next step exceeds the losses from being unavailable.

It's similar with your firefighting tools. You can get from multiple down to one business day per simple fix with relatively simple measures. It takes more expensive tools, stricter procedures, and paid on-call duty to guarantee a same-day attempt at fixing. Shortening it further requires investing in even more specialized (and costlier) tools, better training for engineers, and a lot of upfront work on observability. And again, at some point the cost of lowering the downtime even further is guaranteed to exceed the cost of any prevented downtime.

Some of the component failures you'll encounter will be self-inflicted. Because one day you'll discover that a database server needs to be brought offline to upgrade it to the newer version that fixes a critical CVE.

Given all the above, the pragmatic approach dictates that instead of trying to achieve the impossible, we should build systems that anticipate failures, and, ideally, recover from them without human intervention. While every component's availability is capped by the product of all the SLOs of its direct dependencies, the larger system can be built to tolerate at least some failing components.

The CAP Theorem

The CAP theorem dictates that any distributed stateful system can only achieve at most two of the three guarantees: consistency, availability, and partition tolerance.

What is a distributed stateful system? Anything that stores any data and consists of more than one component. A shell script accessing a database is such a system, and so is a Kubernetes service talking to a server-less database.

The consistency guarantee demands that every time the system returns data, it either returns the most up-to-date data, or the read fails. Under no circumstances can the system return a stale copy as doing so could break an even larger system for which your system is a dependency.

The availability guarantee dictates that if the system receives a request, it must not fail to provide a response.

Partition tolerance means the system needs to remain fully operational even if some of its components are unable to communicate with some other components.

I think it's clear that it's impossible for a system to always return the latest data and never return an error while it can't reach its main database. That's why you can only pick two of the virtues and in most cases it's only practical to achieve one.

It's Systems All the Way Down

It's also important to note that any complex solution is usually a multitude of smaller systems in a trench coat. You can have systems within systems and you can pick different corners of the CAP triangle for every individual subsystem.

A practical example may be an online store that uses an external system to figure out if a given order qualifies for free shipping. The free shipping decision is delegated to a third-party system, a black box only accessed through its API. The order lines and the cost of regular shipping are stored in some sort of a database, and the storefront is backed by a web service that needs to return the valid shipping methods.

Now we have the following systems:

The external shipping discount service that we don't control. that can provide any of the CAP guarantees. Whatever it does is beyond our control.
Our internal free shipping eligibility service that depends on the database (as it needs to be able to send the cart contents) and the external service (as it needs to receive the response).
Our public web service that tells the storefront what shipping methods are available that depends on our internal free shipping eligibility service and the database (to figure out the cost of regular shipping).
The entire store that depends on the storefront running in the client's browser being able to communicate, over the internet, with our public web service.

Since we can't do much about the external system (and if it goes down, fixing it is beyond our reach), we can make the pragmatic decision to make any system that depends on it focus on partition tolerance. For example, we could decide that if the external system can't be reached, any order is eligible for free shipping. This way, when the external system inevitably goes down, we can err on the side of generosity and lose some money on shipping but keep our store transactional (which usually more than makes up for the shipping cost). We could also decide the opposite, that if the service is down, no order can be shipped for free, potentially upsetting some customers, but still taking orders from everyone else.

Better Fault Tolerance

I think it's clear that whichever way we choose is preferable to the entire store becoming unavailable and thus accepting no orders at all.

If we broaden up the partition tolerance to general fault tolerance, we can design systems that are internally as fault-tolerant as is pragmatic and externally as available as practically possible. This prevents cracks from propagating from component to component, which gives the larger system a chance of staying transactional even while some of its individual subsystems struggle to stay online.

Fault tolerance can be achieved through documented fallbacks and software design patterns. It's a process that needs to start during the design stages as it's not easy to bolt onto an existing system. All external communication has to be safeguarded and time-boxed, with timeouts short enough not to grind the larger system to a halt. Repeated failures can temporarily eliminate the external dependency through patterns like the circuit breaker.

High availability is usually achieved through redundancy. If a single component has a 1% chance of randomly failing, adding a second duplicate as a fallback reduces that chance to 0.01%. With proper load balancing it also provides additional capacity and is a first step to auto-scaling. Of course, failure is rarely truly random and is often tied to the underlying hardware or other components, so those, too, may need to be made redundant. Multi-zone or multi-region deployments, database clustering, those are all tools that let you lower the chance of things going south at the expense of hard earned cash.

It's up to you to figure out the sweet spot that offers you relative peace of mind while still keeping the operational expenses below the potential losses.

Self-Healing Systems

Given that we can't fully prevent components from failing, what if we at least eliminated the necessity of a human tending to them once they do? A self-healing system is one that is designed to recover from failures without external intervention. I'm not talking about self-adapting code paths that the prophets of AGI promise, I'm talking about automatic retry mechanisms, dead letter queues for unprocessable events, and robust work queues that guarantee at-least-once delivery.

A good system is one that fails in a predictable manner and recovers from the failure in a similarly predictable manner. Eventual consistency is much easier to achieve than immediate consistency. Exactly-once delivery is often impossible to guarantee but at-least-once beats at-most-once under most circumstances.

Design your systems with idempotency in mind so it's safe to retry partial successes. Use fair queues to prevent a single noisy task from adding hours to wait time to all its neighbors. Treat every component as if it was malfunctioning or outright malicious and ask yourself, "How can I have the system not only tolerate this but also fully recover?"

Perhaps the most extreme version of this is the Chaos Monkey from Netflix, a tool designed to break your system's components in controllable yet unpredictable way. The engineers behind Chaos Monkey theorized that in a system designed around reliability, the actions of the Monkey should be completely invisible from the outermost systems perspective. True, with an asterisk that if you get anything wrong, your services are down and you're losing money. Perhaps not everyone can afford that.

And to get it right is often more about being smart than clever. The self-healing part could be as easy as implementing a health check and restarting the component. Or it could mean dropping the cache if you're unable to deserialize its contents, because maybe you forgot that caches can persist across schema changes. Or even restarting your HTTP server every 27 requests while you're figuring out why the 29th request always causes it to crash. Observe your systems and learn from their failures, adding preventive measures for similar classes of future problems.

Remain Vigilant

In 2026, perhaps more than ever, remain vigilant. With the advent of generative AI, some parts of your service will likely end up being written by an LLM. That model, like all models, was trained on a large corpus of code, both purely commercial and Open Source. You have to remember that most of this code, even if it didn't completely neglect its reliability engineering homework, may have vastly different assumptions about where it stands with regard to the CAP theorem.

You cannot blindly transplant code from one project to another, from an AI chatbot, or from a StackOverflow answer, without also consciously asking yourself, "How does this code anticipate and deal with failures? And does it fit my goals for this particular subsystem?"

Happy failures. Farewell and until next time!