Forem: Saeed Habibi

Reversibility Decays: What Bezos’s Framework Misses About Technical Decisions

Saeed Habibi — Tue, 03 Feb 2026 14:36:31 +0000

The Architecture of Decisions, Part 2

In Part 1, I argued that every technical decision is a bet on a future you cannot fully see. But not all bets are equal. Some bets you can walk back when new information arrives. Others lock you in the moment you commit.

The difference is reversibility. And it matters more than whether your decision is correct.

Jeff Bezos talks about one-way doors and two-way doors. A two-way door is a decision you can reverse. Walk through, look around, walk back if you don’t like what you see. A one-way door locks behind you. The cost of reversal is so high that you effectively cannot reverse.

The insight is simple. The application is not because most decisions don’t announce which type of door they are. And the consequences of getting it wrong compound over the years.

The Spectrum of Reversibility

When engineers talk about reversible decisions, they usually mean “technically possible to undo.” And by that definition, almost everything is reversible. You can rewrite the service. You can migrate the database. You can deprecate the API. Given enough time and resources, you can reverse almost any technical decision.

But that framing misses the point entirely.

The question isn’t whether you can reverse a decision. It’s what reversal actually costs. Time. Money. Team morale. User trust. Opportunity cost of everything else you’re not doing while you’re unwinding a previous choice.

I think of reversibility as a spectrum with several dimensions:

Direct cost of reversal. How much engineering effort does unwinding require? A feature flag toggle is nearly free. A database migration across billions of rows is months of work.

Indirect cost of reversal. What else breaks when you reverse? Changing an internal implementation might have no effect. Changing a public API might break hundreds of integrations you don’t control.

Time sensitivity. Does the reversal cost stay constant or grow? Some decisions get harder to undo over time. Data accumulates. Dependencies form. Teams build on top of your choices.

Coordination cost. Can you reverse unilaterally, or do you need alignment across teams, organizations, or external parties? Solo decisions are cheap to reverse. Decisions that create contracts with others are expensive.

Most technical decisions sit somewhere on this spectrum. The skill is recognizing where and adjusting your decision-making process accordingly.

Why Engineers Misjudge Reversibility

There’s a pattern I’ve seen repeatedly. Engineers treat irreversible decisions as reversible, and reversible decisions as irreversible. Both mistakes are costly, but in different ways.

Treating irreversible decisions as reversible leads to moving too fast on choices that will haunt you. “We can always change it later” becomes the mantra that justifies insufficient analysis. Then, later arrives, and changing it turns out to require a six-month migration that nobody wants to prioritize.

Treating reversible decisions as irreversible leads to analysis paralysis. Teams spend weeks debating choices that could be tested and adjusted in days. They build elaborate decision matrices for problems that would resolve themselves with a quick experiment.

I think the misjudgment happens because engineers focus on technical possibilities rather than practical cost. Yes, you can change the database schema. But will you? When the migration requires coordinated downtime across three services and a month of testing, the answer is usually no.

The decisions that don’t get reversed aren’t the ones that can’t be reversed. They’re the ones where the cost of reversal exceeds the pain of living with the original choice.

The Four Questions

When I’m evaluating any significant technical decision, I ask four questions about reversibility:

What does unwinding this look like?

Not abstractly. Concretely. If we decide this is wrong in six months, what are the actual steps to reverse it? Who needs to be involved? What systems need to change? What data needs to be migrated?

If you can’t articulate a concrete reversal path, you’re probably underestimating the difficulty.

How does the reversal cost compare to the original decision?

If the original decision takes two weeks to implement, and reversal would take two days, that’s a genuine two-way door. If reversal would take six months, you’re looking at something closer to a one-way door, regardless of what you call it.

The ratio matters. A 10:1 reversal-to-implementation ratio should trigger much more careful analysis than a 1:1 ratio.

Does the reversal cost increase over time?

This is the ratchet effect, and it’s the most commonly overlooked dimension. Some decisions have stable reversal costs. Changing an internal algorithm costs about the same whether you do it next month or next year.

But many decisions have increasing reversal costs. The database schema gets harder to change as data accumulates. The API contract gets harder to break as more consumers depend on it. The service boundary gets harder to move as more teams build around it.

For decisions with increasing reversal costs, you have a window. The longer you wait to reverse, the more expensive it becomes. Eventually, the cost exceeds any realistic budget, and the “reversible” decision becomes permanent by default.

What information would make you want to reverse?

This question forces you to think about learning. If you knew X, would you change this decision? If yes, how will you learn X? How long until you know?

Decisions with a short learning timeline are good candidates for fast action. Ship it, learn it, adjust it. Decisions where the learning timeline is long or where you might never know the outcome deserve more upfront analysis.

One-Way Doors in Technical Architecture

Some decisions are genuinely hard to reverse. Not impossible, but expensive enough that reversal is unlikely to happen even when it should.

Public API contracts are the classic example. Once external developers build against your API, changing it breaks their software. You can version it, but now you’ll have to maintain multiple versions indefinitely. You can deprecate it, but “deprecation” often means “running forever because someone important still uses it.”

The API surface you ship becomes a commitment. Not because you can’t change it technically, but because the coordination cost of changing it (communicating with consumers, managing migration timelines, handling the ones who never migrate) exceeds the cost of living with the design.

Database schemas for production data sit in a similar territory. The schema itself is easy to change. The data is the problem. Once you have millions of rows in a particular shape, reshaping them requires migration. Migrations require testing. Testing requires representative data. Large migrations require batching, monitoring, and rollback plans.

I’ve seen teams avoid schema changes for years because the migration effort seemed disproportionate to the benefit. The original schema decision, made when the table had zero rows, became effectively permanent once it had a billion.

Service boundaries can surprise you. They feel reversible because services are just software. You can merge, split, or reorganize them. But service boundaries create organizational boundaries. Teams form around services. Ownership models emerge. Other teams build integrations assuming your service exists.

Changing a service boundary isn’t just a technical change. It’s an organizational change. And organizations are much more complex to refactor than code.

Data deletion is the one truly irreversible decision in computing. Everything else can theoretically be reconstructed given enough effort. Deleted data is gone. (Yes, backups exist. But if you deleted it intentionally, your backups will eventually cycle out too.)

Two-Way Doors Worth Recognizing

On the other end of the spectrum, many decisions are more reversible than teams treat them.

Internal implementation details behind stable interfaces can usually change freely. If your API contract stays constant, consumers don’t care whether you’re using PostgreSQL or MongoDB underneath. The database choice might feel momentous, but if it’s hidden behind a well-designed abstraction, it’s surprisingly reversible.

I’ve seen teams spend months evaluating database options for services that would take two weeks to rewrite entirely if the choice turned out wrong.

Feature flags are explicitly designed to make decisions reversible. Ship the feature to 1% of users. Learn. Adjust. Roll back if needed. The whole point is to reduce the cost of reversal to nearly zero.

Teams that don’t use feature flags effectively are giving up one of the most powerful tools for treating decisions as two-way doors.

Configuration and tuning decisions are usually cheap to change. Timeout values, retry counts, cache sizes, and thread pool configurations. These are knobs, not architecture. Turn them, observe results, adjust.

Yet I’ve seen teams debate configuration decisions with the intensity reserved for fundamental architectural choices. It’s a misallocation of decision-making energy.

Library and framework choices, with good abstraction, can be more reversible than they appear. If your business logic is cleanly separated from your framework, swapping frameworks is localized work. If your business logic is entangled with framework specifics, you’ve accidentally converted a two-way door into a one-way door.

The reversibility of library choices depends on how you use them, not on the library itself.

The Hidden Cost of “Reversible”

Here’s something I’ve noticed that complicates the model: decisions that are technically reversible but practically permanent.

You can reverse them. The engineering cost is reasonable. The coordination is manageable. But you don’t. Because reversing a decision admits the original decision was wrong. Because reversing requires someone to champion the reversal. Because the pain of the current state is distributed across many people, while the effort to reverse it would be concentrated on a few.

These zombie decisions, reversible but never reversed, are everywhere. The “temporary” solution has been running for three years. The migration that was planned but never prioritized. The deprecated system that still handles 30% of traffic.

Reversibility on paper means nothing if the organization lacks the will or the mechanisms actually to reverse decisions when they should be reversed.

I’ve started thinking about this in terms of effective reversibility versus theoretical reversibility. Theoretical reversibility asks, “Can we?” Effective reversibility asks, “Will we, given how our organization actually makes decisions?”

Designing for Reversibility

If reversibility is so valuable, how do you build systems that preserve it?

Put abstraction boundaries around decisions. The narrower the blast radius of a decision, the cheaper the reversal. If your database choice is hidden behind a repository interface, changing databases is localized. If SQL queries are scattered throughout your codebase, you’ve welded yourself to that database.

Every architectural boundary is a potential reversal point. Design boundaries with future reversibility in mind, not just current separation of concerns.

Delay irreversible decisions as long as responsibly possible. Not indefinitely. Not past the point where the decision is needed. But recognize that information arrives over time. A decision made with more information is usually a better decision.

This doesn’t mean avoiding decisions. It means distinguishing between “we need to decide this now” and “we’re deciding this now because deciding feels productive.”

Use the strangler fig pattern for migrations. When you need to reverse a decision that’s become entangled, don’t try to flip it all at once. Build the new approach alongside the old. Migrate traffic gradually. Let the old system shrink until it can be removed.

The strangler fig turns a one-way door into a series of small two-way doors. Each step is reversible, even if the overall direction is committed.

Make reversibility explicit in decision records. When documenting an architectural decision, include the reversal path. What would trigger reconsideration? What would unwinding look like? What’s the expected cost?

Writing this down forces you to think about it. It also creates a record that future teams can reference when they’re considering whether to reverse.

The API Contract We Couldn’t Change

I want to tell you about a decision I got wrong. Not wrong in the “made the wrong choice” sense. Wrong in the “misjudged the reversibility” sense.

We were building a public API for a platform. Moving fast, early stage, lots of uncertainty. The mantra was iterate quickly, learn from users, adjust based on feedback.

We shipped the first version of the API with a response structure that made sense at the time. Resources had IDs, attributes, and relationships. Standard stuff. But we made some choices about naming and nesting that reflected our internal domain model, not the mental model of our API consumers.

Within a few weeks, we realized some of the naming was confusing. The nesting made certain common operations awkward. We had better ideas.

But by the time we had the better design ready, a few dozen developers had built against the original API. Not a huge number. Early adopters, mostly. But they’d written code. Their code worked. Changing our API would break their code.

We had a choice: break the early adopters and ship the better design, or preserve compatibility and live with the awkward design.

We chose compatibility. It felt like the responsible choice. We didn’t want to punish the people who’d trusted us early.

But here’s what I didn’t fully appreciate: that decision was itself an irreversible decision. Every day, we kept the original API, and more developers built against it. The cost of changing grew continuously. What was dozens of developers became hundreds. What was hundreds became thousands.

Three years later, we still had that original API structure. We’d added a “v2” for new resources, but the original endpoints stayed frozen. The awkward naming was documented in tutorials all over the internet. The confusing nesting was baked into SDKs that third parties maintained.

The “reversible” decision to ship quickly and iterate became a permanent decision by accumulation. Not because we couldn’t reverse it, but because the cost of reversal grew faster than our willingness to pay it.

What should we have done? I’m still not entirely sure. Probably, we should have broken the early adopters when we had dozens of them, not thousands. The cost of reversal was high even then, but it was much lower than it would ever be again.

Or maybe we should have treated the API design as a one-way door from the start. Spent more time upfront on the naming and structure. Consulted with potential consumers before shipping. Moved slower on the interface even while moving fast on the implementation.

What I learned is that the door type isn’t fixed at the moment of decision. Reversibility decays. Two-way doors can become one-way doors while you’re not paying attention. The window for reversal is often shorter than you think, and it closes quietly.

What Changes When You Think in Reversibility

Early in my career, I evaluated decisions primarily on whether they were “right.”

Good architecture meant making correct choices. Experience meant having sufficient pattern recognition to make the right decisions faster.

Good architecture meant making correct choices. Experience meant having sufficient pattern recognition to make the right decisions faster.

Now I think about decisions differently. Correctness matters, but correctness is always provisional. What you know today might be wrong tomorrow. What’s right for current requirements might be wrong for future requirements. The environment changes. Your understanding deepens.

Given that uncertainty, the most valuable property of any decision is often not whether it’s correct, but whether you can update it when you learn more.

This shifts how I approach architecture:

For genuine one-way doors, I slow down. I gather more information. I consult more broadly. I try to reduce uncertainty before committing.

For two-way doors, I speed up. I make a reasonable choice and move. I set up mechanisms to learn quickly and reverse cheaply if needed.

For decisions with decaying reversibility, I watch the window. I set explicit triggers for reconsideration. I try to reverse early if reversal seems likely, before the cost makes it impractical.

A decision that’s easy to reverse is a decision that’s safe to make quickly. A decision that’s hard to reverse is a decision worth taking slowly. A decision whose reversibility decays is a decision that demands attention before the window closes.

This is not about being cautious. It’s about matching your decision-making process to the nature of the decision. Some doors deserve careful study before you walk through. Some doors you should walk through immediately because you can always walk back. The skill is knowing which is which.

Reversibility doesn’t tell you what to decide. It tells you how to decide.

Reversibility doesn’t tell you what to decide. It tells you how to decide.

This is Part 2 of the “Architecture of Decisions” series. Part 1 explored why every technical decision is a bet on the future. Part 3 will examine blast radius: what happens when you can’t reverse, and how to contain the damage.

Why Your Best Technical Decisions Will Eventually Be Wrong

Saeed Habibi — Wed, 28 Jan 2026 11:27:46 +0000

The Architecture of Decisions, Part 1

There is a mass grave of technical decisions buried in every codebase. Some of them were wrong from the start. Others were right once, then the world changed. A few were brilliant and remain brilliant. Most live somewhere in between, quietly accumulating interest on debt nobody remembers taking on.

The decision that felt obvious in 2019 becomes the migration nobody wants to own in 2024. The architecture that scaled beautifully for three years collapses under requirements that didn’t exist when you designed it. The framework everyone recommended becomes the framework everyone is migrating away from.

This is not a failure of engineering. This is the nature of technical decisions. Every choice you make is a bet on a future you cannot fully see.

Technical decisions are predictions. Not guesses, not preferences, not best practices applied blindly. Predictions. When you choose PostgreSQL over MongoDB, you are predicting that your data will remain relational, that your access patterns will favor complex queries over document lookups, and that the join performance will matter more than horizontal write scaling. When you define service boundaries around “users” and “orders,” you are predicting that these domains will evolve independently, that the team structure will mirror this separation, and that the contract between them will remain stable.

Most engineers don’t think about decisions this way. They think about decisions as choices between options, evaluated by criteria, selected by judgment. And that framing is not wrong. But it misses something fundamental.

The criteria you use to evaluate options are themselves predictions. “We need horizontal scalability” is a prediction about load that may never materialize. “We need strong consistency” is a prediction about what correctness means for users who haven’t complained yet. “We need to move fast” is a prediction about how long the current architecture will matter before the next rewrite.

This is why two equally skilled engineers can look at the same problem, apply sound reasoning, and reach opposite conclusions. One looks at your 10,000 daily active users and sees a system that needs to scale to millions. The other looks at the same data and sees a system that might stay this size forever. They are not disagreeing about the present. They disagree about the future. And the future has not yet voted.

The uncomfortable truth is that you cannot know if a technical decision is good until time has passed. Sometimes years. The decision to use Kubernetes might look brilliant at month six when you’re deploying twelve services, and catastrophic at month eighteen when you realize you have three services and a full-time infrastructure engineer managing cluster complexity. The decision to stay on Heroku might feel limiting in year one and wise in year three, when your competitor is still debugging their Kubernetes networking.

What you can do is understand the properties of your decisions. Not whether they are right, because that requires information you don’t have. But how will they behave as the future unfolds? How much room do they leave for course correction? How far does the damage spread if they turn out to be wrong? How much are you betting on things you cannot currently know?

Some decisions age well. They remain good choices even as requirements shift, teams change, and the technology landscape evolves. Other decisions decay. They were reasonable once, but the context that made them reasonable has disappeared, and now they are obstacles rather than foundations.

The difference is rarely about the decision itself. It is about the relationship between the decision and time.

This is what I want the Architecture of Decisions series to explore. Not a catalog of correct answers, because right answers depend on context that I cannot know. Instead, a framework for thinking about decisions. A way to evaluate not just what to choose, but how to choose. And more importantly, how to understand what you are actually betting on when you commit to a path.

The framework has three components. Each one addresses a different dimension of how decisions interact with uncertainty.

Reversibility asks: if this decision turns out to be wrong, how hard is it to change course? Some decisions are two-way doors. You can walk through, look around, and walk back if you don’t like what you see. Other decisions are one-way doors. Once you’re through, the door locks behind you. The cost of reversal becomes so high that you effectively cannot reverse.

Blast radius asks: if this decision fails, how much of the system fails with it? Some decisions are contained. They affect a single component, a single team, a single workflow. Other choices are foundational. They propagate through everything. When they go wrong, everything downstream goes wrong too.

Information asymmetry asks: how much do you NOT know, and does that matter? Every decision involves incomplete information. But for some decisions, the missing information is not critical. For others, the missing information is precisely what determines whether the decision will succeed or fail.

These three properties do not tell you what to decide. They tell you how to decide. They tell you how much caution a decision deserves, how much validation it needs, and how much reversibility you should preserve.

Reversibility is the most underrated property of good architecture. Jeff Bezos talks about one-way and two-way doors, and the distinction is helpful, but it’s more nuanced than a binary.

Some decisions look reversible but aren’t. You can technically migrate from PostgreSQL to MongoDB. But if you have three years of queries written against relational assumptions, stored procedures that encode business logic, and reporting tools that expect SQL, the reversal cost is so high that the decision is effectively permanent. The door looks like it swings both ways, but the hinges have rusted shut.

Other decisions look permanent but aren’t. Switching from Python to Go feels like a massive undertaking. But if your architecture is twelve microservices with clean API boundaries, you can rewrite them one at a time. Six months later, you’ve migrated without ever stopping the system. The door looked like a one-way gate, but there was a side entrance that nobody mentioned in the architecture review.

I’ve seen teams spend months debating database choices as if the decision were irreversible, then casually adopt a new frontend framework every six months. The actual reversibility was inverted from their perception. The database, abstracted behind a repository layer, could have been swapped with moderate effort. The frontend, with its tendrils reaching into every component, was the real lock-in.

What makes reversibility powerful is not just the ability to undo. It is the ability to learn. Reversible decisions let you gather information that was unavailable when you made the original choice. You ship, you observe, you adjust. The decision becomes a hypothesis you can test rather than a commitment you must defend.

Part 2 of this series will go deep into reversibility, which we will discuss later. How to identify which type of door you are walking through. How to preserve optionality when you cannot avoid irreversible choices. How to prevent mistaking cosmetic reversibility for actual reversibility.

Blast radius is about failure containment. Every decision will eventually interact with failure, either its own failure or the failure of something adjacent. The question is how far that failure propagates.

A contained failure is annoying. A propagating failure is existential.

Part 3 will explore the blast radius in detail.

Information asymmetry is the most philosophical of the three properties, and in some ways the most important. It asks what you are betting on that you cannot currently verify.

Every decision involves unknowns. That is the nature of predicting the future. But the unknowns are not equally distributed. Some decisions depend heavily on information you don’t have. Others are robust to the unknowns because the unknowns don’t affect the core value proposition.

When you chose Angular in 2016, you were betting on Google’s commitment to the framework. You didn’t know that Angular 2 would be a complete rewrite incompatible with Angular 1. You didn’t know the community would fragment. You couldn’t know, because Google hadn’t decided yet. That’s information asymmetry: the critical information exists in someone else’s future decisions.

When you choose Redis for caching, the information asymmetry is lower. Redis has been stable for over a decade. The API is mature. The failure modes are well-documented. You’re still betting on the future, but you’re betting on a trajectory with a long, observable history.

The skill is not to eliminate information asymmetry, because you can’t. The skill is to recognize where it is highest and to calibrate your confidence accordingly. Decisions with high information asymmetry deserve more hedging, more reversibility, and more caution. That shiny new framework with eighteen months of history and one major corporate sponsor? High information asymmetry. PostgreSQL? Low information asymmetry. Your confidence in each decision should reflect that difference.

Part 4 will explore information asymmetry and how to identify what you don’t know, how to distinguish between uncertainty that matters and uncertainty that doesn’t. How to make good decisions when you cannot wait for the information you wish you had.

I should be honest about something. I have gotten this wrong more times than I have gotten it right.

In 2019, I was part of a team that decided to break a monolith into microservices. The decision made sense at the time. We had a Node.js application that had grown to 200,000 lines of code. Deployments took 20 minutes. Two teams were stepping on each other’s changes. We had read the articles and watched the conference talks. Microservices were the answer to our problems.

We identified what we thought were clean boundaries. Users. Orders. Notifications. Payments. Each would become its own service with its own repository, deployment pipeline, and database.

What we actually built was a distributed monolith. The “Users” service needed to validate orders, so it called the Orders service. The Orders service needed to check payment status, so it called Payments. Payments are required to send receipts, so it is called Notifications. Notifications needed user preferences, so it called Users. We had drawn boxes on a whiteboard, but each box had arrows pointing to the others.

What used to be function calls became HTTP requests. A user sign-up that took 50 milliseconds now takes 400 milliseconds because it crossed four network boundaries. Debugging became archaeology: which service logged the error? Which request ID do I trace? Why is this field null when the other service swears it sent a value?

The decision felt reversible. We could always recombine services if needed. In theory, yes. In practice, once you have five teams building on five services with five deployment processes, the political and organizational costs of reversal exceed the technical costs. The door had closed behind us while we were busy decorating the new rooms.

The blast radius was larger than we anticipated. We thought we were isolating components. We created a runtime dependency graph in which every service required every other service to function. A deploy to Notifications that introduced a 500ms latency spike cascaded into timeouts across the entire system. Our “independent” services were independent only in their git repositories.

And the information asymmetry was enormous. We made the decision based on a predicted scale that never materialized. We designed for 10 million users and peaked at 300,000. We made it based on team structures that changed six months later when the company reorganized. We made it based on deployment independence, which we never actually used because features still required coordinated releases across three or four services.

Three years later, parts of that system are still running. We never fully unwound it. The cost of reversal kept growing, and eventually, we just stopped trying. We learned to live with the distributed monolith, adding workarounds and accepting the latency tax.

That experience changed how I think about technical decisions.

Do not be paralyzed by fear of getting it wrong, because you will get things wrong. But to understand what you are betting on. To know which doors are one-way. To design for the blast radius of your own potential mistakes.

Early in your career, technical decisions feel like puzzles with correct answers. You learn the patterns, apply the principles, and arrive at solutions. Experience teaches you that the answers depend on questions that haven’t been asked yet.

Intermediate engineers evaluate decisions based on technical merit. Is this the right tool? Is this the correct pattern? Is this the right architecture? These are important questions, but they are incomplete.

Senior engineers evaluate decisions based on properties. How reversible is this? What is the blast radius if we’re wrong? What are we betting on that we cannot currently verify? These questions acknowledge that correctness is not a property of decisions in isolation. It is a property of decisions in context, and context changes.

The shift is not about being smarter. It is about being honest. Honest about uncertainty. Honest about the limits of prediction. Honest about the fact that some decisions will be wrong regardless of how carefully you make them.

A system that handles requests is functional.
A system that survives wrong decisions is resilient.
A system where decision quality improves over time demonstrates architectural maturity.

A system that handles requests is functional.
A system that survives wrong decisions is resilient.
A system where decision quality improves over time demonstrates architectural maturity.

That maturity comes not from making fewer mistakes, but from making survivable mistakes. From preserving the ability to learn. From understanding that every technical decision is a bet on a future that has not yet arrived.

The future will vote eventually. Your job is to stay in the game long enough to hear the results.

What is your idea? What experiences do you have?

Three Signs Your Microservices Are Actually a Distributed Monolith

Saeed Habibi — Tue, 20 Jan 2026 13:53:45 +0000

You split the monolith. You have separate repositories, separate deployment pipelines, and separate teams. Everything looks like microservices on the architecture diagram.

But something feels wrong.

Deployments still require coordination across multiple teams. A bug in one service cascades into failures across the system. Your “independent” services move in lockstep, and nobody can explain exactly why.

Here’s the uncomfortable truth: you might have made a monolith worse. You’ve distributed the complexity without distributing the independence. You’ve kept all the coupling and added network latency.

I’ve seen this pattern at three different companies now. Every time, the symptoms were identical.

Sign One: Your Deployments Require a Group Chat

The clearest sign of a distributed monolith is deployment coordination.

If you cannot deploy Service A without also deploying Service B, those services are not independent. They’re a monolith that happens to communicate over HTTP rather than via function calls.

I watched a team spend six months “breaking up” their monolith into twelve services. They were proud of the architecture diagram. Clean boxes, clean arrows. But every Thursday, their deploy channel looked like a military operation. “I’m deploying User Service, hold off on Orders.” “Wait, I need to push Notifications first, or your changes will break.” “Can everyone sync up at 3 pm for the coordinated release?”

They had replaced compile-time dependencies with runtime dependencies. The coupling was still there. They’d just made it invisible to the compiler and visible only in production.

The test is simple. Can any team deploy its service at any time without coordinating with anyone else? If the answer involves “well, usually, but…” then you have a distributed monolith.

True service independence means you can deploy whenever you want. The interfaces between services are stable contracts. Changes are backward compatible. Nobody needs a calendar invite.

Sign Two: One Database to Rule Them All

This one is subtle because it comes in degrees.

The obvious case: multiple services reading and writing to the same tables. I don’t see this as often anymore because everyone knows it’s wrong. But the less obvious cases are everywhere.

Services that share a database schema, even if they “own” different tables. Services that join across each other’s data. Services with foreign key constraints that span ownership boundaries. Services that read from replicas of another service’s data.

At a previous company (I think this was 2020), we had what appeared to be clean service boundaries. Each service had its own set of tables. But the analytics service needed to join user data with transaction data with inventory data. So we gave it read access to everything.

That analytics service became the spider at the center of the web. We couldn’t change the user schema without coordinating with analytics. We couldn’t change the transaction schema without coordinating with analytics. The service that was supposed to just “read some data” had become an implicit contract with every other service.

The database was our distributed monolith’s shared memory.

When services truly own their data, they own it completely. Other services get access through APIs, not through database connections. Changes to the internal schema are invisible to the outside world. This adds latency. This adds complexity in some ways. But it’s the only way to achieve real independence.

(Though I should mention: sometimes a shared database is the right answer. If your services are small, your team is small, and you’re not planning to scale them independently, the ceremony of full isolation might not be worth it. The problem isn’t shared databases. The problem is pretending you have independence when you don’t.)

Sign Three: Everything Calls Everything

Draw a graph of which services call which other services. If it looks like a dense mesh rather than a layered hierarchy, you probably have a distributed monolith.

The tell is in the transitive dependencies. Service A calls B and C. Service B calls D and E. Service C also calls D. Service D calls F and G. And somewhere in there, G calls back to A for some reason.

When you deploy a change to Service D, you’re implicitly changing the behavior of A, B, and C. When D has a latency spike, A, B, and C all slow down. When D goes down, the failure cascades through the entire graph.

I’ve been guilty of this myself. It’s easy to add one more HTTP call. “We just need to check the user’s permissions.” “We just need to look up the product details.” Each call makes sense in isolation. But the compound effect is a system where everything depends on everything.

The distributed monolith emerges gradually, one convenient API call at a time.

In a well-designed system, dependencies flow in one direction. Higher-level services depend on lower-level services. The graph has clear layers. And crucially, no service needs to know about the internal structure of services it calls.

When you find yourself adding a call from a “lower” service to a “higher” one, stop. That’s the call that closes the loop and creates the distributed monolith.

What’s Actually Going Wrong

These three signs, deployment coupling, data coupling, and call coupling, are symptoms of the same root cause. The services don’t have clean boundaries.

A true service boundary is defined by:

Independent deployment: Changes inside the boundary don’t require changes outside it.

Independent data: The service owns its data completely and exposes it only through interfaces.

Independent failure: The service can fail without cascading failures to its callers (and can handle failures of services it depends on).

When you break up a monolith, the tempting approach is to draw boxes around code that “seems related.” User code goes here. Order code goes here. But that’s not what defines a good boundary.

Good boundaries are defined by what can change independently. If two pieces of code always change together, they belong in the same service, regardless of what they’re “about.”

I’m still not sure where to draw these lines. Every system I’ve worked on has had at least a few boundaries in the wrong place. The skill is recognizing when the coupling has gotten bad enough to justify paying to fix it.

The Uncomfortable Question

If your microservices are actually a distributed monolith, what do you do?

Sometimes the answer is: merge them back together. I know this feels like failure. You spent months (or years) breaking things apart. Admitting it didn’t work is hard.

But a well-structured monolith is better than a distributed monolith. At least with a monolith, the compiler catches your mistakes. You can refactor with confidence. Latency is in nanoseconds instead of milliseconds. And you don’t need to debug distributed transactions.

The microservices architecture is valuable when you need independent deployment, scaling, and technology choices. If you don’t need those things, the complexity isn’t paying for itself.

Look at your deploy channel. Look at your database connections. Look at your service graph. Be honest about what you see.

Sometimes the best architecture decision is admitting the last architecture decision was wrong.