Forem: Iyanu David

Designing Systems That Expect Their Own Assumptions to Break

Iyanu David — Fri, 06 Mar 2026 05:41:29 +0000

There is a particular kind of confidence that infects system design at the beginning of a project — a confidence that feels earned, because it is, briefly, correct. You have studied the traffic patterns. You have mapped the service interactions. You know which team owns what, which permissions flow where, which storage tier handles which load shape. The architecture diagram is clean. It makes sense. Someone has drawn boxes and arrows, and the boxes and arrows correspond to reality.

That correspondence, in my experience, rarely survives contact with the third year of a system's life.

I have spent a long time building infrastructure that other people eventually inherit, and longer still inheriting infrastructure other people built. The pattern is not subtle. Systems designed to be correct calcify. They become artifacts of the assumptions that shaped them — assumptions which, over time, stop being true without anyone noticing, or without anyone having the standing to say so out loud. A service that was designed for 40 requests per second is now handling 400. A permissions boundary that made sense when there were two teams is now threaded through six. A deployment script written for a specific Kubernetes version is quietly failing in half a dozen edge cases against the version that replaced it eighteen months ago, and no one knows because the alerting wasn't wired to the right signal.

This is not negligence. It is entropy. It is the ordinary consequence of building something that works and then letting time pass.

The question worth asking — not at the architecture review, but at 11pm when something is paging you that shouldn't be — is whether the system was designed to survive its own obsolescence.

The Assumption Inventory
Every non-trivial system embeds assumptions the way sedimentary rock embeds fossils — layered, implicit, and only visible when you cut a cross-section. Some of those assumptions concern load: this service will receive traffic shaped roughly like this, at this volume, with this distribution across endpoints. Others concern topology: service A will talk to service B through this interface, and service C is downstream of both. Others concern ownership: team X understands the operational posture of component Y and will respond when it misbehaves.

The insidious ones are the ownership assumptions, because those are the ones that dissolve fastest and leave the least evidence.

When the team that built a service turns over — and it will turn over, because people leave, reorg, get promoted into roles where they stop writing code — the operational knowledge they carried goes with them. What remains is a binary. Either it was externalized, in runbooks or documentation or architecture decision records or at minimum in code comments that a careful person left behind, or it wasn't. If it wasn't, the new team inherits a system they can operate by rote but cannot reason about. They learn which buttons to press. They do not learn why the buttons are there or what happens when the buttons stop being the right buttons.

That gap — between operational rote and operational understanding — is where a lot of incidents incubate.

Failure Containment, or: Why Blast Radius Is the First Question
The instinct, when building a system, is to optimize for the happy path. Make the common case fast. Make the expected interactions smooth. Handle the known failure modes.

The second instinct — which takes longer to develop, and which you tend to acquire by being the person paged at 2am — is to obsess over blast radius.

Blast radius is the answer to: if this thing fails, how much of everything else fails with it? The question sounds simple. The answer is almost always more complicated than you expect, because systems in production have a way of developing undocumented dependencies that the architecture diagram never captured. Service A calls service B, yes — but service A also has a three-year-old cron job that queries service B's database directly, bypassing the API, because someone needed a fast read path for a reporting feature and no one ever went back to do it properly. Service B's database is now a hidden dependency of service A's availability. The dependency diagram does not know this. The on-call runbook does not mention it. It will be discovered at the worst possible moment.

Circuit breakers help. Not because they prevent failures — they don't — but because they prevent a failure in one service from propagating into a cascade that takes down everything downstream. A properly implemented circuit breaker trips when error rates exceed a threshold, fails fast, and stops sending traffic into a degraded dependency. The service that depended on it degrades gracefully, returns a cached response or a sensible default, and keeps serving its own callers.

Rate limiting is a cousin of this thinking. Rate limits are not primarily about protecting you from your worst users. They are about ensuring that a spike in one part of the system — whether from a rogue client, a misconfigured retry loop, or a sudden legitimate surge — cannot monopolize resources shared by every other part. An unrated service is a service with implicit unlimited blast radius. Someone will find that limit. The question is whether they find it before or after it causes a problem.

The principle underneath both is simple enough to put on a notecard: failures should stop. Not propagate. Not cascade. Stop. Contain them at the boundary, pay the local cost, and keep the rest of the system working.

Intent That Hides Is Intent That Rots
The second major fracture point in aging systems is not a technical failure. It is a documentation failure — or more precisely, a failure of externalizing intent.

Policy-as-code is the phrase people reach for here, and it's correct as far as it goes. When your security policy lives in a document somewhere, it has a quiet decoupling problem: the document and the infrastructure can diverge without anyone noticing. The document says egress is restricted to these IP ranges. The infrastructure, which was modified by three people over two years in response to operational exigencies, no longer reflects that. You don't know until an audit, or an incident.

When policy is code — when it is evaluated against real infrastructure state on every deployment, when it generates findings when it drifts — the divergence is no longer quiet. It is loud. It fails a check. Someone has to look at it. That is the point.

Dependency graphs are a related problem. Most organizations know, roughly, what their service topology looks like. What they know less well is what the actual topology looks like at 3pm on a Thursday, under load. Service meshes and distributed tracing tools are not primarily monitoring tools — they are topology discovery tools. They let you see what is actually talking to what, in what volumes, with what latencies, as an empirical fact rather than an architectural assumption. The gap between the diagram on the whiteboard and the traces in the APM tool is one of the most informative gaps in modern infrastructure. It tells you what has grown up between the walls that no one planned for.

Intent visible. Reality visible. The comparison between them: this is where a careful operator lives.

On Ownership, and What Happens When It Vaporizes
There is a systems reliability principle that sounds almost sociological: the operational burden of a service is inversely proportional to how well its design is understood by the people currently running it.

This is obvious once stated. It took me years to really believe it.

The practical implication is that architectural decisions have a carrying cost that is not paid by the people who made them. It is paid, often much later, by people who inherit the system in a different operational context, with different traffic, different adjacent services, and no particular reason to know why the original decisions were made. If those decisions are inscrutable — if the code is clever in ways that require context the new team doesn't have, if the failure modes require institutional knowledge that left with the previous team — then the operational cost compounds.

Ownership metadata is an underrated intervention here. Not in the abstract sense of "someone is responsible" but in the concrete sense: a machine-readable record of who owns this service, what its criticality tier is, where its runbook lives, who the secondary contact is, what its dependencies are. This isn't bureaucracy. It is the thing that prevents a new team from spending their first three hours of an incident trying to figure out whether they are the right people to be debugging this service.

Standardized runbooks are a related thing that people resist because they feel like overhead and are overhead, until they aren't. Until the person who knows the system is not on call that week. Until the incident is happening at 6am and the person paged has never seen this service before and the runbook is the difference between thirty minutes and three hours.

Observability as Architecture, Not Afterthought
Observability has become a fashionable word, which means it has started to mean less. Let me be specific about what I mean.

A system is observable to the degree that you can understand its internal state from its external outputs without having to modify it or guess. That sounds abstract. In practice it means: when something goes wrong, can you tell what went wrong, where, when, and why, using the artifacts the system already produces? Or do you have to add instrumentation, redeploy, reproduce the problem, and hope it happens again while you're watching?

Most systems, if you're honest, are in the second category more than they are in the first.

The failure mode here is treating instrumentation as a feature to be added later, after the system is "working." But a system that works under normal conditions but is opaque under abnormal ones is not really working — it is working and unverifiable, which is a different thing. You don't know it's working. You're assuming it's working because you haven't seen it fail. When it does fail, you will be flying mostly blind.

Structured logs matter more than most people realize until they don't have them. The difference between a log line that says ERROR: request failed and one that says

{"level":"error","service":"payment-processor","trace_id":"abc123","user_id":"u_789","error_code":"UPSTREAM_TIMEOUT","dependency":"stripe-api","duration_ms":5043}

is the difference between an investigation that takes forty minutes and one that takes four. Every field in that structured log is a dimension you can query. Every field that isn't there is a question you can't answer.

Metrics are the high-level signal. Traces are the mechanism. Logs are the evidence. A system that has all three, wired correctly into the operational toolchain, allows you to move from symptom to cause without spelunking through servers or guessing. A system that has none of them — or has some of them wired poorly, alerting on the wrong signals, aggregating in ways that lose resolution — gives you the sensation of observability without the substance.

Human Error Is Not a Root Cause
The phrase "human error" in a postmortem is almost always a failure of analysis. Not because humans don't make mistakes — they do, constantly, predictably, in patterns that are well-studied — but because "human error" as a terminal finding implies there is nothing to learn, nothing to change, nothing to protect against. It implies the solution is to hire better humans or remind existing ones to be more careful. Neither of those works.

What works is designing the system to absorb the errors that humans reliably make.

Configuration mistakes are the most common class. The protection is not "be more careful with configuration" — it is: validate configuration before deployment, ideally in a way that catches the specific class of errors that keep happening. Deployment rollbacks are not a hedge against unlikely failures; they are the expected path for a predictable percentage of deployments, and the rollback mechanism should be tested regularly, not dusted off in desperation.

Progressive rollouts — canary deployments, blue-green deployments, whatever variant fits the system — are one of the more underappreciated tools in this category. The cost of a bad deployment is proportional to the exposure before the problem is caught. If you can route 1% of traffic to a new version, observe its behavior for twenty minutes, and only then proceed to 10%, 50%, 100%, you have dramatically reduced the blast radius of your own mistakes. This is not principally about trusting the software less. It is about designing a system that catches errors before they are total.

Feature flags sit in the same conceptual family. Not just as product tools — the ability to ship code that isn't yet exposed, to decouple deployment from release — but as operational safety mechanisms. A flag that allows you to disable a code path without redeploying is a circuit breaker under human control. That matters when a new integration is behaving badly and the fastest remediation is to turn it off.

The Continuous Review Problem
Architecture reviews have a bad reputation in some engineering cultures, which usually means they've been done badly — treated as gatekeeping rituals rather than genuine inquiry, run by people with authority to block but not enough context to evaluate, disconnected from the operational reality of running the systems in question.

Done well, they are something else. They are the mechanism by which a team periodically asks: what do we believe about this system, and is that belief current?

The cadence matters. An annual architecture review of a system that's changed significantly in the last twelve months is nearly useless — you're reviewing a snapshot of something that no longer exists. A quarterly review of a stable system is probably more overhead than it's worth. The useful question is: what rate of change does this system experience, and how often do we need to reconcile our model of it with its actual state?

Security boundary audits are a specific case of this that get neglected more than most. Permissions have a tendency to accumulate in one direction. They are added when needed and rarely removed when no longer needed, because removing permissions is work, carries risk of breaking something, and generates no visible reward when it goes well. The practical consequence is that systems in production tend to be over-permissioned relative to what they actually need. That over-permission is a liability — it expands the attack surface, it means a compromised service can reach further than it should — and it doesn't announce itself. It requires deliberate, periodic audit.

The reliable way to do these reviews is to make them boring. Routine. Scheduled. Not triggered by incidents or compliance pressure, but by calendar. The interesting insights tend to come from the reviews that happened when nothing was obviously wrong, when someone noticed a dependency that didn't make sense or a permission that had no current justification.

What You Would Change on Monday Morning
Suppose you have read this and found yourself nodding. Suppose you are now thinking about a system you maintain, and you are aware — with the specific, uncomfortable awareness of someone who knows their own codebase — that some of the fractures described above are present in it. What do you actually do?

Not everything. Not at once. The practical failure mode of this kind of systemic thinking is paralysis — deciding that the system needs too much improvement to improve incrementally, and therefore improving nothing.

The concrete interventions, in rough order of accessibility:

If you have no structured logging, start there. Pick a format, wire it into the logging framework, start emitting machine-readable log lines from the service you're most likely to debug next. Not every service. One service. See how it changes what you can answer during an investigation. Then expand.

If you have no deployment rollback, the next change you make to production is also a change to your incident response procedure, whether you intended it or not. Build rollback before you need it. Test rollback when nothing is failing.

If your service has no owner metadata — no machine-readable record of who is responsible, where the runbook is, what the criticality tier is — add it. It costs an afternoon and it will eventually save someone hours.

If your blast radius analysis has not been revisited since the system was originally designed, map the actual dependencies. Not the intended ones. The actual ones. Use your traces, your logs, your network flow data. Find the undocumented connections. At least know what they are.

If the last time anyone read the assumptions embedded in your security policy was when it was written, schedule an hour to read them with someone who wasn't in the room when they were made.

None of this is transformational. Transformation is for people who have more leverage over their systems than most of us have over the ones we inherit. The thing that is actually available to most practitioners is: make it incrementally harder for the system to fail silently. Make one more thing visible. Contain one more failure mode. Document one assumption that is currently implicit.

The systems that survive — the ones you encounter after ten years and find they are still running, still comprehensible, still operable by people who didn't build them — are not the ones that were designed perfectly. They are the ones where someone, repeatedly, asked whether the assumptions were still current. Where the culture of the teams that maintained them included, as a normal practice, the willingness to revisit.

Not blame past decisions. Revisit them. Those are not the same thing.

The past decision was probably correct when it was made. The question is whether the world in which it was correct still exists.

Usually, if enough time has passed, it doesn't.

The Economics of Reliability: Cost, Risk, and Architectural Tradeoffs

Iyanu David — Thu, 05 Mar 2026 21:16:57 +0000

There's a particular kind of meeting that happens in every engineering organization eventually. Someone puts a slide on the screen showing quarterly infrastructure spend. The numbers are climbing. A VP — almost always someone whose mental model of software was formed before Kubernetes existed — asks why the monitoring bill is larger than the compute bill. The room goes quiet in a specific way. The engineers know the answer. They're trying to figure out whether this is a safe room to say it in.

That silence is where reliability goes to die.

I've been in enough of those rooms to have developed a kind of diagnostic reflex. When I hear the phrase "right-size our observability footprint," I mentally note it the way a cardiologist notes a patient describing occasional chest tightness. Could be nothing. Probably isn't nothing. The framing itself — observability as a footprint to be right-sized — reveals a category error that will eventually cost ten times whatever the proposed savings are. But I've also learned that explaining this at the moment of the meeting rarely works. The incentive structures around that table are arranged against you, and they've been arranged that way deliberately, often by people who would insist they care deeply about uptime.

Reliability is not primarily a technical problem. It never was. The systems I've watched fail catastrophically over the years — the multi-hour outage that eviscerated a Black Friday revenue quarter, the cascading database failover that turned a 15-minute incident into a 6-hour one because nobody had actually tested the failover path since 2019, the gradual degradation nobody noticed because the dashboards had been thinned to cut Datadog costs — almost none of them failed because the engineers didn't know how to build resilient systems. They failed because the organization had been making a long series of individually defensible decisions that collectively eroded the architecture's tolerance for surprise.

This is worth sitting with. Individually defensible. That's the insidious part.

What Resilience Actually Costs, and Why the Bill Is Always Surprising
Let me be concrete about what investing in reliability requires, because the abstract framing of "multi-region infrastructure" and "redundant pipelines" obscures what you're actually buying.

Multi-region deployment means you are running your entire production topology — compute, storage, networking, secrets management, deployment tooling — in at least two geographically separated environments simultaneously. You are paying for idle capacity that exists specifically to absorb a failure you hope never occurs. You are maintaining parity between those environments as your application evolves, which means every deployment, every schema migration, every config change has to be tested against a more complex matrix of states. You are dealing with the genuinely hard distributed systems problem of data consistency across regions, which introduces latency trade-offs and replication lag that your application code must be written to tolerate. And you're doing all of this while your product team is asking why the new checkout flow hasn't shipped yet.

That's before you touch the human infrastructure. Chaos engineering — real chaos engineering, not the theater version where you run a tool against a staging environment and check the box — requires dedicated engineering time to design failure scenarios, execute them against production, analyze the results, and iterate. It requires an organizational culture where engineers aren't penalized for the incidents those experiments occasionally produce. It requires incident rehearsal, which means pulling senior engineers out of feature work to run tabletop exercises for failure modes that may never materialize. The return on that investment is invisible until the moment it isn't, and by then, you're too busy fighting the incident to feel grateful for the preparation.

The bill compounds. That's the thing organizations keep rediscovering. It's not a one-time investment. Resilience requires maintenance the same way a physical plant requires maintenance. The runbooks go stale. The on-call rotation gets thin when engineers leave and aren't replaced at the same rate. The canary deployment that was carefully tuned to catch regressions silently starts evaluating against the wrong metric after a service rename. I've inherited systems where the "comprehensive observability" was technically still in place — the dashboards existed, the alerts were configured — but nobody had reviewed them in 18 months and half the panels were displaying data from a service that had been deprecated.

For early-stage systems, none of this makes sense. The blast radius of a startup's outage is bounded by the number of users who are currently trying to use it, which is probably small, and the cost of the outage is mostly reputational in a space where users have modest expectations. The right call at that stage genuinely is to move fast. The mistake is failing to notice when that calculus changes.

Reliability Debt: How It Accumulates, Why It's Invisible Until It's Catastrophic
Technical debt gets discussed constantly. Reliability debt is the same phenomenon operating in a different register, and it's harder to see because the damage doesn't manifest incrementally — it manifests in sudden, expensive discontinuities.

Here's the mechanism. Over the course of a year, your team makes a dozen small decisions under resource pressure:

You consolidate three microservices onto a single compute cluster to reduce costs. This makes sense — those services were underutilized, the consolidation is clean, and you save a meaningful amount on cloud spend. What you've also done is increase the blast radius of any incident affecting that cluster. Previously, a bad deployment to service A didn't threaten services B and C. Now it might.

You reduce the sampling rate on your distributed traces from 100% to 10% to bring the observability bill down. Still enough data to analyze normal operation. But under failure conditions, when you need the traces most, you're now looking at a 10x coarser picture of what your system is doing, and the specific transactions that are failing may fall into the 90% you're not capturing.

You remove the staging environment because it was perpetually out-of-date with production anyway, and the engineers weren't using it consistently. Fair criticism of the staging environment as it existed. But now the first place your changes encounter production-like traffic is production.

You centralize IAM permissions to simplify the access management overhead. Fewer roles, cleaner policy documents, easier auditing. Also, when a credential is compromised or a permissions bug ships, the scope of what that principal can affect has grown.

None of these decisions is obviously wrong in isolation. Some of them have genuine merit. Collectively, they have reshaped your system's failure profile in ways that are almost impossible to hold in your head simultaneously. The architecture is now operating with higher blast radii, lower diagnostic resolution, reduced isolation between deployment paths, and broader credential scopes than it was a year ago. It looks mostly the same. It is substantially more fragile.

The debt shows up in a specific way: the system's first serious incident will be worse than it should be. The response team will lack the observability data they need to diagnose quickly. The failure will spread further than it would have if the blast radii hadn't expanded. The recovery will take longer because the staged rollback path has degraded. And by the time the post-mortem happens, it will be genuinely difficult to attribute the severity to any individual decision, because each decision was reasonable and the compounding effect was never explicitly modeled.

This is why I'm skeptical of cost optimization work that doesn't explicitly surface reliability implications. Not because the optimization is wrong, but because the reliability implications are real and need to be priced into the decision. The question isn't "does this save money" — it's "does this save money after accounting for the change in expected incident cost."

The Incentive Structure Problem Is Not Fixable Through Awareness Alone
I want to be careful here, because the standard move in reliability discourse is to identify the incentive misalignment and then imply that organizations simply need to do better at recognizing it. As if the problem is primarily cognitive. As if the VP who asked about the observability bill just needs to be educated.

That framing is comforting and largely wrong.

The incentive structures that deprioritize reliability are doing exactly what incentive structures are designed to do. Engineering teams are measured on delivery velocity, feature throughput, deployment frequency. Reliability investments don't move those metrics. In fact, they temporarily depress them — time spent on resilience work is time not spent shipping features, and every hour of chaos engineering that doesn't produce an incident looks, in retrospect, like a particularly expensive way to confirm the system was working fine. The engineers who are most disciplined about reliability investment are the ones whose work is hardest to justify in planning cycles, because the counterfactual — what would have happened if they hadn't done it — is invisible by construction.

The people making prioritization decisions are not, for the most part, irrational. They're responding to the measurement system they operate in. You can run all the awareness campaigns you want about reliability debt; until the measurement system changes, the behavior won't.

What actually works, in my experience, is translation. Not education about reliability engineering concepts, but translation of risk into the language the organization already speaks.

When I've had success getting reliability investment funded, it's been by doing something unglamorous: modeling the expected cost of specific failure scenarios. Not in abstract terms — not "a multi-hour outage could be costly" — but as concretely as the data allows. How many transactions per hour does the checkout flow process? What's the average transaction value? What's the support ticket cost per affected user? What does engineering time cost during an extended incident? What's the realistic time-to-detection under current observability coverage, and how does that change under reduced coverage? What's the mean time to recovery if the on-call engineer can see 10% of traces versus 100%?

These numbers are imprecise. But imprecise numbers are far more persuasive than precise abstractions, because they put the discussion on the same ground as the budget conversation. You're not asking for reliability investment as a matter of engineering principle. You're presenting a comparison of expected values. The proposed cost reduction saves X per month. The change in risk profile costs an estimated Y per month in increased expected incident severity. The sign of X minus Y determines whether the optimization makes sense.

Sometimes the math favors the optimization. Sometimes it doesn't. Either way, you've moved from a values argument to a financial one, and financial arguments are winnable.

The Post-Incident Investment Cycle: A Pattern Worth Naming
There's a cycle I've watched repeat so many times that I've started thinking of it as a law rather than a tendency.

Phase one: the system is operating with insufficient resilience. The team knows this, but the investment to fix it keeps getting deprioritized. The system doesn't fail obviously, which makes it easy to argue that the risk is acceptable.

Phase two: a significant incident occurs. The kind that generates a post-mortem with an executive summary. The kind that results in a retrospective where someone says "how did we not catch this earlier" and everyone in the room understands, but nobody says out loud, that the answer is "because we kept deferring the work that would have caught it."

Phase three: emergency investment. Suddenly there's budget. Headcount gets pulled from feature work. Platform improvements that have been in the backlog for 18 months get prioritized. New observability tooling is evaluated and purchased. An incident commander role gets created. The SLOs that existed on paper get actual alerting attached to them.

Phase four: stability. The system is genuinely more resilient now. The incident rate drops. Post-mortems start looking less severe. The engineering organization feels like it's operating with more slack.

Phase five: the pressure shifts back. The business environment hasn't changed. Investors or board members are still asking about delivery velocity. Product roadmaps are still packed. The phase three investments improved reliability, but they also increased operational cost and consumed engineering capacity. Gradually — through attrition, through deprioritization, through legitimate competing pressures — the reliability investments begin to erode.

And then you're back in phase one.

I don't think this cycle is inevitable, but I think breaking it requires organizational maturity that's genuinely rare. It requires leadership that holds reliability investment constant through the stability phase, not just through the crisis phase. It requires measurement systems that make reliability degradation visible in real-time, not just after a major incident. It requires engineers who are willing to sound alarms about accumulating debt before those alarms are obviously warranted, which means being willing to be wrong sometimes and dealing with the credibility cost of that.

The organizations that escape the cycle are the ones that have internalized a particular idea: stability is a product, not a state. It requires ongoing investment to maintain, just like any other product. The moment you stop investing in it, it begins to decay.

Observability Is Infrastructure, Not Monitoring
I want to dwell on the observability economics question specifically, because it's where I see the most consequential misunderstanding in practice.

The framing of "observability as a monitoring cost" treats telemetry as a reporting layer that sits on top of the real system. Under this framing, you can tune the telemetry fidelity to manage costs without affecting the system's behavior. The metrics and traces and logs are just recordings of what happened; the system itself is unchanged.

This is technically true and operationally misleading.

What observability actually determines is your ability to operate the system. Not to run it — to operate it. To detect when it's behaving unexpectedly. To diagnose the root cause of a failure. To verify that a change produced the intended effect. To distinguish between correlated failures and coincidental concurrent failures. To understand whether a gradual degradation represents a trend or a transient fluctuation.

When you reduce telemetry fidelity, you don't reduce the frequency of incidents. You increase the time it takes to detect and diagnose them. And time is money in ways that are more concrete than the telemetry cost itself.

The numbers I've seen in practice suggest that a 2x increase in mean time to detection, compounded by a 1.5x increase in mean time to diagnosis, produces an incident cost multiplier that substantially exceeds the telemetry savings. But these numbers are organization-specific, and you have to actually compute them. Asserting the principle in a budget meeting doesn't work. Showing the calculation does.

There's a subtler point underneath this. Observability isn't just useful during incidents. It's what allows engineers to move quickly with confidence during normal operation. The ability to deploy a change and immediately see whether it affected error rates, latency distributions, and business metrics is what makes fast iteration safe. Organizations that cut observability in the name of enabling velocity often produce the opposite effect: engineers move more slowly because they're less confident, or they move quickly and break things that wouldn't have broken if they'd had better feedback loops.

Visibility is expensive. Blindness is more expensive. This isn't a slogan. It's a calculation, and I've watched enough incidents to have substantial confidence in the direction of the inequality.

The Leadership Translation Problem
Senior engineers and engineering leaders occupy an uncomfortable position in reliability economics. They understand the technical risk clearly enough to know what should be done. They often lack the organizational authority to simply do it, which means they need to advocate for it effectively.

Most technical advocacy fails not because the substance is wrong but because it operates in the wrong register. When an engineering leader says "we need multi-region redundancy," the implicit model is that leadership will evaluate the technical argument and reach the right conclusion. But leadership is already operating with a full attention budget, a financial planning cycle with constraints, and a set of competing priorities that are also being framed compellingly. The reliability argument is one of many, and "we need redundancy" competes poorly against "we need to ship this feature to capture this market."

The translation requirement is genuine. Technical risk has to become business exposure. "A single-region outage could halt revenue generation" is better than "we need redundancy," but it's still abstract. Better is: "We processed 200,000 transactions last quarter in our primary region, with an average transaction value of $85. Our historical MTTD in single-region failure scenarios is approximately 40 minutes, and MTTR has averaged 2.5 hours. That's a potential $42 million exposure per incident that our current architecture isn't isolated from." Now you're in the same room as the CFO's model.

This translation work is uncomfortable for many engineers, for good reasons. It requires making estimates with uncertain inputs and presenting them with false precision. It requires framing technical concerns in financial terms that feel reductive. It requires operating as an advocate rather than as an analyst. None of these are natural modes for people who were attracted to engineering because of its rigor and objectivity.

But I've watched the alternative play out. Technical correctness without business fluency results in reliability work being perpetually deferred, periodically rediscovered after major incidents, and never institutionalized. The pattern repeats indefinitely.

What Monday Morning Looks Like
If you've read this far and found it resonant rather than abstract, you're probably sitting with a system that has some of what I've described: accumulated reliability debt, under-instrumented failure paths, observability that's been thinned for cost reasons, a post-incident investment backlog that hasn't been touched since the last crisis, incentive structures that don't reward the work you know needs doing.

What do you actually do?

First: make the debt legible. Not to advocate for fixing it — not yet — but to understand it. Take the time to write down, concretely, what your current architecture's blast radii are. Which component failures are isolated? Which ones are shared? What's the observability coverage on your critical paths? When did you last actually test your failover procedures? Not document them. Test them. Meaning run them against a production-like environment and time the recovery.

Second: identify the three failure scenarios that would hurt the most. Not the most likely — the most consequential. Price them. What does an hour of that specific failure actually cost, in revenue impact, engineering time, customer support volume, and reputational exposure? Keep the numbers rough. Rough-but-honest beats precise-but-fake.

Third: find the cheapest reliability investment you haven't made yet that would materially reduce one of those three scenarios. Not the comprehensive solution. The cheapest meaningful step. This is where the argument for action becomes easy to win, because you're not asking for the full investment — you're asking for the part of the investment that has the best return.

The full program of resilience takes years to build and requires sustained organizational commitment that has to be re-earned continuously. But there's almost always something you can do this week that moves the needle on your most acute risk. And doing that thing builds the track record of reliability work delivering value, which is the prerequisite for getting the bigger investments funded.

Every architecture reflects what an organization actually valued, not what it said it valued. The post-mortem document where everyone agrees that observability is critical doesn't mean much if the next budget cycle cuts the telemetry spend. The reliability roadmap that's been in the planning doc for two years without a single item getting built is telling you something accurate about organizational priorities, regardless of what people say when the topic comes up.

Outages are revealing in this way. They expose not just what failed technically, but what the organization had implicitly decided it was willing to risk. The failure was authorized, in a sense — not deliberately, not in any single decision, but through the accumulated weight of choices about where to spend time and money and attention.

The organizations that build durable reliability aren't the ones that respond best to incidents. They're the ones that have made the reliability investments boring because they're consistently funded, consistently practiced, and consistently measured — even when there's no visible crisis demanding them.

That's harder than it sounds. It requires leadership that can hold a long time horizon through short-term pressure. It requires engineers who can translate risk into business language without losing technical precision. It requires incentive structures that reward prevention rather than just heroic recovery.

But it's achievable. I've seen it. The organizations that get there tend to share one characteristic: they treat reliability as an obligation to the people who depend on their systems, not as a cost center to be optimized.

That framing change — subtle, almost philosophical — turns out to matter enormously when the quarterly infrastructure review comes around and someone puts the observability bill on the screen.

Complexity Is a Liability (Until It Isn't)

Iyanu David — Wed, 04 Mar 2026 20:07:25 +0000

Every mature system accumulates complexity the way old buildings accumulate load-bearing walls that nobody drew on the original blueprints — quietly, necessarily, and in ways that become genuinely dangerous to remove.

More services. More integrations. More policies, more environments, more abstraction layers stacked on top of abstraction layers that were themselves stacked on top of something someone built at 2am in 2019 and never properly documented. The conventional wisdom, the thing you hear in every architecture review when someone senior leans back and sighs, is that you should reduce complexity to improve reliability. And that's directionally correct. It's also incomplete in ways that cause real damage when teams take it literally.

Complexity is not inherently harmful. Unmanaged complexity is. And sometimes — this is the part that gets elided in the simplification rhetoric — removing complexity creates new fragility that won't announce itself until the worst possible moment.

The Two Kinds, and Why the Distinction Matters More Than Almost Anything
There's a taxonomy here that most organizations gesture at but rarely operationalize with any rigor. Architectural complexity falls into two categories, and conflating them is how you end up with a post-incident review that says "we over-simplified" with no apparent awareness of the irony.

The first kind is accidental complexity. Inconsistent patterns — one team using REST, another gRPC, a third using some homebrew message protocol because they hired a Kafka evangelist in 2021 and he's since moved on. Duplicate services that do nominally different things but solve the same underlying problem. Unclear ownership, which sounds like a people problem and is actually a systems problem, because unclear ownership means nobody knows which service to page when something breaks at 3am and the blast radius expands while everyone figures it out. Redundant pipelines. Legacy scaffolding that was supposed to be temporary in 2017 and has since become so load-bearing that nobody touches it, which means nobody understands it, which means it functions as a kind of institutional learned helplessness.

None of this adds capability. It's drift. It's what entropy looks like in production. It should be removed, aggressively, and the only reason it usually isn't is that removal carries risk and carries cost and the people with context to do it safely are always busy with something more urgent. This is how accidental complexity compounds. You don't build it intentionally. You inherit it incrementally, one expedient decision at a time, and then one day you're onboarding a new engineer and you realize you cannot explain why any of this exists.

The second kind is essential complexity. Multi-region failover — which means you have to deal with split-brain scenarios, with replication lag, with the gnarly question of what "current" means when your data is in three availability zones and the network between two of them just partitioned. Zero-trust network segmentation, which is genuinely complicated to implement and maintain and test. Fine-grained IAM boundaries. Event-driven architectures where the complexity lives in the guarantees — exactly-once delivery, ordering within partitions, dead-letter queues that someone has to actually drain. Data lineage enforcement for regulated industries where you need to be able to prove, forensically, what touched what and when.

This complexity exists because the problem space demands it. Removing it simplifies your architecture diagram. It also increases your risk profile in ways that may not be immediately visible, which is exactly what makes simplification seductive and occasionally catastrophic.

The Simplification Trap, Which Is Real and Which I Have Watched Happen
Teams pursue simplification as a blanket objective. Usually this follows a rough trigger: a new VP of Engineering who came from a startup and finds the accumulated weight of the existing system alarming. A cost-optimization initiative where someone runs a spreadsheet and finds that infrastructure spend has grown 40% year-over-year. A postmortem where the incident was traced to complexity — a cascading failure that propagated through too many services — and the action item is to have fewer services.

The typical moves: collapse microservices into a monolith, because microservices have overhead and inter-service latency and distributed transactions are a nightmare to debug. Centralize permissions, because fine-grained IAM is hard to reason about and someone just got paged because a service couldn't read from a bucket it should have been able to read from. Remove staging environments, because staging is expensive and it never quite matches production anyway. Flatten network boundaries. Reduce observability to "core metrics" because the telemetry bill is too high.

Short term, clarity genuinely improves. You can hold more of the system in your head. Deployments are faster. The architecture diagram is legible.

Long term, resilience declines. And it declines quietly, in ways that are easy to attribute to bad luck rather than structural degradation. You removed a circuit breaker here, a bulkhead there, a redundant path whose redundancy you never needed until you did. You collapsed the blast radius boundaries and didn't notice because nothing went wrong for six months.

Why does this happen? Because some complexity absorbs failure. Redundancy is complex — running multiple instances of the same service, maintaining consistency across them, handling failover, all of this is overhead that only pays off during incidents. Isolation is complex — network segmentation, separate databases per service, independent deployment pipelines, all of this costs money and time and engineering attention during normal operations. Defense-in-depth is complex — it means you have multiple mechanisms that each partially mitigate a class of failure, which means you have to understand all of those mechanisms, which means cognitive load. But these mechanisms reduce blast radius. They turn a total outage into a degraded service. They turn a compromised service into a contained breach rather than a full network traversal.

Complexity as Shock Absorber
High-reliability systems — the term has a specific origin in organizational theory, Perrow and Weick and the research on nuclear plants and aircraft carriers, but it applies directly here — intentionally introduce structured complexity. Circuit breakers that stop calling a failing dependency rather than queuing up an ever-growing backlog of requests that will never succeed. Bulkheads that partition thread pools so a slow downstream doesn't exhaust the whole application's capacity. Rate limits, retries with exponential backoff and jitter, fallback paths that serve stale data rather than errors.

Every one of these mechanisms complicates control flow. You have to test them. You have to make sure they're configured correctly, because a circuit breaker with the wrong threshold will trip too eagerly during normal transient errors, or not trip at all when you actually need it. You have to reason about what the fallback behavior means for your users and your data consistency guarantees. This is real overhead.

But what you're buying with that overhead is failure containment. The absence of visible complexity in your system is often — not always, but often — the absence of protection. The system that looks clean in the diagram is frequently the system that cascades catastrophically when one component fails because you removed all the circuit breakers in the simplification initiative.

I think about this in terms of what I'd call structural load distribution. In a building, you spread load across multiple paths. If one fails, the structure redistributes. If you've streamlined everything through a single critical path — which is cleaner, architecturally elegant, fewer components — then you've also made that single path a single point of failure. The elegance is real. So is the fragility.

The Cost Dimension, Which Is Where Idealism Goes to Die
Complexity carries economic weight, and anyone who tells you otherwise has never had to justify an infrastructure bill to a CFO.

More services means higher infrastructure cost — more compute, more networking, more storage, more licensing. More observability means higher telemetry spend — traces and logs are not free, especially at scale, and the first thing that gets cut in a cost optimization is usually the instrumentation that would have told you about the next incident. More isolation means resource duplication. More review layers mean slower delivery, which has its own cost in opportunity terms.

So you have this genuine tension between cost optimization, operational resilience, and delivery velocity, and the three don't resolve cleanly. Over-optimizing for cost removes protective complexity — you cut the redundancy, you consolidate the services, you turn off the detailed logging, and everything looks fine right up until it doesn't. Over-optimizing for resilience bloats operational overhead — you run redundant everything, you maintain elaborate failover mechanisms for failure modes that have never actually occurred, you spend more engineering time maintaining protective infrastructure than building product.

Sustainable architecture balances both, which is genuinely hard and not a thing you achieve once and then maintain passively. It requires revisiting trade-offs as the system evolves, as load patterns change, as the business context shifts. The right amount of complexity for a system handling ten thousand requests per day is different from the right amount for one handling ten million. But those revisions require organizational discipline that most teams don't have, because the pressure is always toward the immediate cost saving or the immediate velocity gain, not the long-horizon resilience.

Cognitive Complexity Versus Structural Complexity
Here's a distinction that I think is more important than the accidental/essential taxonomy, and which gets less attention than it deserves.

A system can be structurally complex — many services, many integration points, many mechanisms — but cognitively simple, if the patterns are consistent. If every service exposes its health state the same way, deploys via the same pipeline, emits structured logs in the same format, uses the same circuit breaker library configured the same way, then an engineer who understands one service can reason about all of them. The cognitive overhead per component is low because the patterns transfer.

Conversely, a system can be structurally simple — few services, minimal integrations — but cognitively complex if the behaviors are unpredictable or inconsistent. One service that fails silently. Another that retries aggressively rather than circuit-breaking. A third whose logging is inconsistent depending on the code path. A fourth that has some undocumented special behavior during weekends because it was originally deployed as a batch job and still has that heritage baked into its scheduler logic. Each of these is tractable in isolation, but the accumulated cognitive load of keeping track of all the exceptions, all the "this one is different because," is genuinely degrading.

Cognitive complexity is what actually degrades reliability. You can see it in specific indicators. Engineers who are reluctant to deploy on Fridays — not because Friday deployments are inherently unsafe, but because the system's behavior during failures is unpredictable enough that they want a full week of support buffer. On-call rotations where certain services are dreaded, where people angle to not be the one holding the pager when service X alerts because service X is a nightmare to debug. Tribal knowledge clustering around individuals — "only Sarah really understands the event processing pipeline" — which means the system's reliability is now contingent on Sarah's availability and tenure. These are diagnostic signals. They tell you where cognitive complexity has accumulated to a level that's affecting human decision-making, which means it's affecting reliability.

Consistency reduces cognitive burden even when systems are large. This is why investing in internal developer platforms, standardized deployment tooling, shared libraries for common concerns — observability, circuit breaking, authentication — pays reliability dividends that are hard to measure directly but very real. You're not reducing structural complexity so much as you're reducing the cognitive overhead per unit of structural complexity.

The Entropy Curve
Over time, accidental complexity increases. This is the default trajectory if you're not actively working against it. Essential complexity stays roughly constant or grows slowly as the system takes on new capabilities. And cognitive clarity decreases, because the documentation that existed at the start is now stale, the engineers who held context have moved on, and the patterns that were once consistent have been gradually diverged from as teams made expedient local decisions.

Without intentional pruning — not occasional, not when things get bad enough to force a migration, but regular, systematic — accidental complexity eventually overwhelms essential complexity. You can't see the protective mechanisms through the noise anymore. The circuit breakers are buried under six layers of abstraction and nobody's sure if they're actually configured. The redundant data path exists on paper but hasn't been tested in two years and probably doesn't work.

This is when incident frequency increases. Not dramatically, not in a way that's easy to attribute to the structural decay rather than to bad luck or increased traffic. Change failure rate rises. Bugs that should have been caught by staging slip to production because staging is too different from production, or because people have started skipping it because it's flaky. MTTR expands because debugging requires tribal knowledge that's no longer distributed across the team. Teams slow down — not because the architecture was wrong when it was designed, but because it was never simplified intentionally as it evolved.

When to Reduce, When to Preserve
Simplify when redundancy no longer provides meaningful isolation — when you have two services that both have the same database as a single point of failure, so the service-level redundancy is illusory. When a service exists only to support a legacy integration that no longer exists on the other end. When two systems solve the same problem differently and the difference is historical accident rather than intentional differentiation — the merge will be painful, but the ongoing maintenance cost exceeds the migration cost. When policies overlap without increasing security. When the cost of maintaining a boundary genuinely exceeds its protective value.

The critical qualifier: simplification should be strategic, not aesthetic. The goal is not to produce a clean architecture diagram for the next engineering all-hands. The goal is to reduce accidental complexity while preserving essential complexity. These are different goals and they sometimes point in opposite directions.

Retain complexity when it limits blast radius. When it reduces cross-team coordination during incidents — because the blast radius boundary is also a coordination boundary, and contained incidents don't require the whole organization to mobilize. When it enforces explicit ownership in ways that make on-call handoffs clean. When it protects critical data paths from noisy neighbors. When it creates resilience under load rather than cascading degradation.

Removing these layers will improve short-term velocity. It usually does. The system is simpler to reason about, faster to deploy to, cheaper to run. The cost shows up later, often after the people who made the simplification decision have moved on and the people who inherited the system are dealing with incidents that the old protective complexity would have contained.

The Maturity Shift
Early-stage systems optimize for speed. This is correct. When you don't know what your system needs to be, the cost of elaborate protective mechanisms is too high — you're buying resilience for a system that might not survive to need it. Move fast, accept fragility, keep the system small enough to reason about without the scaffolding.

Mature systems must optimize for survivability. The shift is subtle and most organizations make it too late, or make it rhetorically without making it structurally. From feature velocity to failure containment — accepting that a slower, more careful deployment process is worth it because the cost of failures, in customer trust and incident response and revenue impact, exceeds the cost of the slower cycle. From architectural elegance to operational durability — preferring the solution that is debuggable at 3am over the solution that is beautiful in a whiteboard drawing. From minimal surface area to managed isolation — accepting that more components means more things to maintain, and deciding that the blast radius reduction is worth that overhead.

Complexity, in a mature system, becomes a tool. Not an accident, not something you're apologetic about, but a deliberate choice made with awareness of the trade-off. The circuit breaker isn't there because someone overengineered it. It's there because you analyzed a failure mode, decided it was worth protecting against, and built the protection. The staging environment isn't there because it's traditional. It's there because the cost of production incidents exceeds the cost of maintaining staging, and you know this because you've measured it.

What You'd Change on Monday
The practical question, which is the one that actually matters after all the analysis.

Start with a cognitive complexity audit, not a structural one. Walk through your on-call rotation and ask your engineers where they dread getting paged. Ask them which services they'd rather not touch. Identify the tribal knowledge clusters — the "only one person understands this" systems. Those are your highest-leverage intervention points, because cognitive complexity is what actually degrades reliability, and you can often address it without architectural changes by investing in documentation, runbooks, and observability improvements.

Then distinguish, rigorously, between your accidental complexity and your essential complexity. For every service, every integration, every policy layer, ask: does this provide meaningful isolation, or does it exist because nobody cleaned it up? The answer is often neither clean nor obvious. Some things are essential complexity that's accumulated accidental complexity on top of it — the core mechanism is protective, but it's surrounded by legacy scaffolding that makes it harder to understand and maintain than it should be.

Then, and only then, start simplification — but measure the blast radius before you reduce it. If you're collapsing two services into one, understand what that does to failure isolation. If you're removing a network boundary, understand what lateral movement becomes possible in an incident. If you're turning off staging, understand what your change failure rate might look like without it. These aren't reasons not to simplify. Sometimes the analysis comes back in favor of simplification. But you should be making that call consciously, with a model of the trade-off, rather than following the local pressure toward cleaner-looking systems.

The deepest pattern here isn't really about complexity. It's about the way systems accumulate history, and the way that history contains both drift and intention in proportions that are hard to distinguish from the outside. Your job, as someone who builds and maintains production systems, is to develop the judgment to tell the difference — to see which complexity is protecting you and which complexity is just noise, and to have the organizational patience to remove one while preserving the other.

Systems do not collapse under visible complexity. They collapse under neglected complexity — the complexity nobody is maintaining, nobody is questioning, nobody is sure is still doing what it was originally built to do. The answer is not minimalism. It is attention.

Automation Scales Decisions, Not Understanding

Iyanu David — Wed, 04 Mar 2026 19:21:25 +0000

There's a particular kind of confidence that settles over an engineering team after they've fully automated their deployment pipeline. The dashboards are green. The pipelines pass. On-call is quiet. And somewhere in the back of their minds, they've filed the whole infrastructure question under solved. This is not hubris, exactly — it's the reasonable conclusion of months of careful work. It just happens to be wrong.

Automation is a force multiplier. What most teams don't sit with long enough is what, precisely, it multiplies.

What You Actually Encoded
Every automated system is a fossilized argument. Someone, at some point, looked at a problem and made a series of judgment calls: how often to retry a failing request, at what CPU threshold to trigger a scale-out event, which IAM roles should cascade from which parent policies, what constitutes a "safe" deployment gate versus an unnecessarily paranoid one. Those judgments got encoded. They became executable. And then — because they worked — they became invisible.

This is the central mechanism people keep missing. Automation doesn't replace human judgment. It preserves human judgment and applies it at machine speed, indefinitely, without asking whether the original judgment still holds.

A retry policy written in 2019 for a monolith that handled 50k requests per day doesn't automatically recalibrate when that same retry logic gets applied, years later, to a distributed mesh of thirty microservices processing 5 million events per hour. It just runs. Faithfully. Catastrophically, sometimes.

The assumption that got encoded — that retrying N times with a fixed backoff was sensible — was probably correct at the time of writing. The system it was written for no longer exists. The automation does.

On Blast Radius, and Why It Compounds
Manual errors have a natural blast radius constraint: a human can only do so much wrong per unit time. They get tired. They second-guess themselves. They stop and ask someone. The failure surface of a person is bounded by human cognitive throughput.

Automation has no such constraint.

A misconfigured Terraform module that gets instantiated across forty services doesn't fail forty times slower than a human would. It fails all at once, consistently, with the full authority of a system that everyone has agreed to trust. A flawed IAM permission template applied organization-wide doesn't announce itself with an error message you'd catch during code review — it propagates quietly through every role that inherits from it, waiting for the exact moment someone tries to do something the template has accidentally made impossible, or worse, accidentally made possible for everyone.

I've seen this pattern play out with auto-scaling configurations more times than I care to count. Someone sets a scaling trigger based on memory utilization, which made sense when the application had a roughly linear relationship between memory and load. Then the application changes — a new caching layer gets introduced, memory utilization flattens out at a lower baseline — and now the auto-scaler is perpetually convinced the system is underloaded. It doesn't provision enough capacity during traffic spikes. The service degrades. On-call gets paged. Someone spends three hours tracing what looks like a load balancer problem before they find a scaling threshold that hasn't been reviewed in fourteen months.

Centralized logic centralizes failure. This is not a bug in automation — it's a structural property of it.

The Retry Storm, in Enough Detail to Be Useful
The retry storm pattern deserves more than a diagram. Here's what it actually looks like from the inside.

A service starts rejecting requests — say, because its database connection pool is exhausted. Upstream callers, all of which have been configured with retry logic (as they should be), begin retrying. Good. That's the intended behavior. But the retry logic doesn't know why requests are failing — it just knows they're failing and that it should try again. So it does. Immediately, or with a brief backoff. The database, which was already saturated, receives more requests. The connection pool stays exhausted. More retries fire.

Meanwhile, the auto-scaler, observing high request latency and elevated CPU on the service, decides to provision more instances. Those new instances spin up and immediately begin processing the backlogged request queue — which means they immediately hit the same saturated database. More connections requested. Same exhausted pool. The new instances are now also retrying, which means the total retry volume has increased. The database is receiving orders of magnitude more traffic than the original load would have generated.

Every component behaved exactly as designed. The retry policy did what it was supposed to do. The auto-scaler did what it was supposed to do. The connection pool limit was set to a reasonable value. The emergent behavior — a cascading amplification loop that makes a localized failure systemic — was not designed. It wasn't modeled. Nobody sat down and worked through what would happen when these three independently reasonable systems ran concurrently under a specific failure condition.

The absence of systemic modeling is the actual problem. Not the retry policy. Not the auto-scaler. The failure to ask: what does this automation look like in conversation with everything else?

Exponential backoff with jitter helps. Circuit breakers help more. But both of those are still point solutions applied to individual services, not a rethinking of how the behaviors interact at a system level.

The Silence Problem
There's a failure mode that doesn't page anyone, doesn't trip any alerts, and doesn't appear on any dashboard. It accrues slowly, over months or years, and by the time it manifests it's been so thoroughly normalized that it's nearly impossible to distinguish from intended behavior.

Automation creates a perception of control that outlasts the validity of the assumptions it was built on.

The thresholds are outdated. But the alerts are quiet, so nobody looks. The monitoring was designed for a system architecture that's been refactored twice since the dashboards were built. The metrics that would indicate something is wrong aren't being collected, because nobody thought to add them when the new service was deployed and nobody thought to ask whether the old metrics still meant what they used to mean. The alerts that do fire get routed to a Slack channel that three people have muted because it pages too frequently and is almost always a false positive.

Automated systems fail quietly first. They degrade. They drift from their intended operating envelope, gradually, in ways that are individually imperceptible. The failure that eventually surfaces isn't usually a sudden collapse — it's the last increment in a long accumulation of small divergences. And when that failure hits, the task of diagnosing it is effectively archaeological. You're not debugging a system; you're reconstructing the history of a system from evidence it didn't know to preserve.

This is what I mean when I say automation becomes invisible. It's not that people forget it exists. It's that they stop interrogating it. It works, so nobody asks whether it's still doing what they think it's doing.

When Automation Outpaces the Team's Mental Model
A team of eight engineers builds a deployment pipeline. They all understand it. They know why the staging gate exists, what the integration test suite is actually validating, why the rollback trigger is configured the way it is. The pipeline reflects their collective understanding of the system.

The team grows. People leave, people join. The system evolves. New services get added. The pipeline gets extended, incrementally, by people who understand the new additions but not necessarily the original context. A year later, you have a pipeline that nobody fully understands end-to-end. Not because the engineers are careless — they're not — but because the complexity of the whole has grown beyond what any individual holds in their head.

Now ask: when something breaks in that pipeline, who can diagnose it?

This is not a hypothetical. I've been in incident bridges where eight engineers were staring at a failed deployment and nobody was confident about what the stage that errored was actually doing, why it was sequenced where it was, or what the safe intervention path looked like. The automation had accumulated organizational knowledge that was no longer distributed across any of the people responsible for it.

Automation abstracts complexity. Abstraction is useful right up until the abstraction fails and you need to understand what's underneath it.

The engineers who built the original system knew. Some of them left. The ones who joined after the fact learned to operate the system — to feed it inputs and read its outputs — without necessarily understanding the reasoning that shaped it. Which is fine, until it isn't.

Guardrails, Not Guarantees
Here's the conceptual error that gets teams into trouble: treating automation as a guarantee rather than a guardrail.

A guardrail is a constraint that makes a bad outcome less likely. A guarantee is a promise that a specific outcome will occur. These are not the same thing, and conflating them is where the real risk lives.

An automated deployment gate that checks for test coverage thresholds is a guardrail. It reduces the probability of shipping undertested code. It doesn't guarantee quality. It doesn't account for tests that pass but don't actually validate the behavior you care about. It doesn't know that the test suite you wrote eighteen months ago was designed to test a system that has since been refactored into something structurally different.

The guardrail is only as good as the model it encodes. And models age.

The dangerous thing about treating automation as a guarantee is that it atrophies the human judgment that should be running in parallel. Teams stop asking whether the deployment gate is checking the right things because they've tacitly decided the deployment gate handles it. The gate passes, so the code ships. The implicit assumption is that if the automation didn't catch it, there's nothing to catch.

This is the illusion of safety. Not that everything is fine. That someone else — the system — is responsible for knowing whether everything is fine.

What Careful Builders Actually Do
None of this is an argument against automation. Automation is correct. The alternative — scaling human operations linearly with system complexity — isn't viable. The question isn't whether to automate; it's whether you're maintaining an honest relationship with what you've automated.

Some patterns that hold up:

Make intent explicit and durable. Every automated rule should have a comment that explains not what it does but why it exists — what failure mode it was designed to prevent, what assumptions it makes about the system, when those assumptions should be revisited. This sounds obvious. Almost nobody does it consistently. The rule that's been "always there" is exactly the one that accumulated without documentation, and it's exactly the one you'll be most confused by when it misfires.

Treat automated thresholds as hypotheses, not facts. A scaling threshold, a retry count, a timeout value — these are predictions about system behavior under specific conditions. Conditions change. The prediction needs to be re-evaluated. Building in a scheduled review cycle for critical automation parameters is not bureaucratic overhead; it's the minimum viable hygiene for systems that run at scale.

Design for legibility, not just correctness. An automated action that's correct but untraceable is a liability. When something goes wrong — and it will — you need to be able to reconstruct what the automation did, when it did it, and why it decided to do it. Every automated action should emit a structured log entry that captures the input condition that triggered it, the decision made, and the outcome. Not just that the auto-scaler scaled out, but what metric reading at what threshold at what time caused it to scale from how many instances to how many.

Preserve the manual override path. Automation should degrade gracefully toward human control, not away from it. A self-healing system that's impossible to pause or override during an incident is not a resilient system — it's a runaway process with administrative access. Manual override paths rust quickly if they're never exercised; some teams have found it useful to deliberately invoke them during game days to confirm they still work and that the team still knows how to use them.

Modularize blast radius. Centralized automation templates are efficient right up until they're catastrophically wrong. The Terraform module reused across forty services is a productivity win until the day someone introduces a misconfiguration and forty services are affected simultaneously. The efficiency-to-blast-radius tradeoff doesn't have a universal answer, but it should be a deliberate choice rather than an accidental consequence of how the automation was structured.

Monday Morning
If I were sitting down on Monday to think about what to actually do with this, the honest answer is that I'd start with an inventory exercise nobody wants to do.

Pull up every piece of automation that touches production — every scaling policy, every CI gate, every IAM template, every retry configuration, every auto-remediation runbook. For each one, ask: does anyone on the current team understand why this exists? When was it last reviewed? What assumptions does it encode, and are those assumptions still accurate?

Most teams, if they're honest, will find they can't answer those questions for a significant fraction of their automation. That fraction is risk.

The goal isn't to tear out the automation and rebuild it. It's to close the gap between what the automation assumes about the system and what the system actually is. In some cases that means updating thresholds. In some cases it means adding observability that was never there. In some cases it means documenting intent that lives only in the head of someone who left eighteen months ago.

The point is not to slow down. It's to stop accumulating invisible debt at machine speed.

Automation preserves yesterday's decisions at tomorrow's scale. That's the core of it. The decisions were probably good when they were made. The question — the one most teams aren't asking rigorously enough — is whether they're still good now, applied to a system that has grown and changed in ways that no single decision-maker fully anticipated.

Revisit what you automated. Not because automation is dangerous, but because the alternative is letting old arguments run your infrastructure indefinitely, at scale, unchallenged.

That's not reliability. That's momentum.

Reliability Is a Socio-Technical Problem

Iyanu David — Tue, 03 Mar 2026 19:38:14 +0000

When systems fail, we reach for the obvious instruments. Logs. Metrics. The deployment timeline. A frantic scroll through configuration diffs at two in the morning while someone on the bridge call asks for an ETA you cannot give. The forensic instinct is understandable — code is legible, traceable, blamable in ways that feel satisfying after an outage. Find the line that caused this. Roll it back. Write the postmortem. Close the ticket.

But I've spent enough time doing this to know that the line is rarely the story.
The line is where the story ended. Where it started is usually somewhere murkier — a Slack thread nobody followed up on, an ownership boundary that two teams interpreted differently, a runbook last touched fourteen months ago by an engineer who left the company in the spring. The code ran exactly as written. The system failed anyway.

The Postmortem That Doesn't Ask Enough
Here's a pattern I keep seeing in postmortems, including ones I've written myself: the contributing factors section lists three or four items — the dependency timeout, the cascading retry amplification, the misconfigured IAM permission — and then, almost as an afterthought, something vague about "communication gaps" or "unclear ownership." The technical findings get RCA tickets and follow-up tasks. The organizational findings get a sentence and a shrug.

This is not cynicism. It's prioritization pressure. Engineers know how to write tickets for code. Writing a ticket for "our team boundaries are ambiguous and nobody owns the alert routing for the payment namespace" is harder to scope, slower to resolve, and invisible to sprint velocity metrics.

So we fix the timeout. We leave the coordination problem intact.

And six months later, a different timeout, a different team, roughly the same failure mode. The postmortem looks strangely familiar.

The technical trigger rotates. The organizational substrate stays.

Conway's Law as a Diagnostic Tool
Most engineers know Conway's Law as a software design observation: systems tend to mirror the communication structures of the organizations that build them. It's often cited as a curiosity. It should be treated as a diagnostic instrument.

When I look at a service topology and see tightly coupled dependencies between services owned by teams that rarely talk to each other, I'm not looking at a technical architecture problem in isolation. I'm looking at a map of organizational friction rendered in YAML and gRPC. The service boundary got drawn where it did because that's where the team boundary happened to be when someone was in a hurry. The tight coupling persists because the two teams have different sprint cycles, different on-call rotations, different definitions of what "breaking change" means. They're not being careless. They're operating in structures that make careful coordination expensive.

Siloed teams produce systems with ownership overlaps and fragmented responsibilities not because engineers are bad at design but because the organizational geometry makes clean boundaries hard to maintain. When a team is perpetually overloaded — ten services, three engineers, an alert queue that never quite drains — documentation is the first thing that ages. Monitoring degrades next. Not because anyone decided to neglect it, but because the backlog of urgent things always outweighs the backlog of important things, and keeping runbooks current is perpetually important and rarely urgent until the moment it becomes both simultaneously.

You cannot architect your way out of an organizational problem. The architecture expresses the organization. If you want a different architecture, you often need a different organization first.

The Cognitive Load Ceiling
There's a ceiling — not a theoretical one, a practical one with hard consequences — to how much system complexity a human brain can hold in operational readiness at any given time.

This sounds obvious. It matters in ways that aren't obvious.

Modern production systems are deeply heterogeneous. A senior SRE on an aggressive incident might need to traverse, in a single investigation: a distributed tracing graph across eight services, a Kubernetes topology they didn't design, a Terraform module authored by a platform team whose Slack channel they're not in, an IAM permission boundary written by security six months ago and never documented in the runbook, a CI/CD pipeline with a conditional artifact promotion step that behaves differently in prod than in staging for reasons that live entirely in one engineer's head.

The engineer on call is not failing if they struggle with this. They're bumping against a cognitive load ceiling that the system's designers never accounted for — in part because no individual designer was responsible for the whole thing, which is itself the problem.

When cognitive load exceeds working capacity, reliability degrades. Diagnosis slows. Remediation becomes tentative. Escalations happen later than they should because the engineer on the hook doesn't yet know which team to escalate to, because service ownership is ambiguous, because the service catalog hasn't been updated since the reorganization.

This is not a training problem. You cannot train humans to have larger working memories. You can design systems to be less cognitively punishing — to surface intent, to contain blast radius, to fail in legible ways that constrain the diagnostic search space. Some teams do this deliberately. Most don't. Most teams are too busy keeping the system running to redesign the system for the humans who run it.

Context as Invisible Infrastructure
The things that allow an incident to be resolved quickly are often not technical. They are contextual. Who owns this service? What does it do, in plain language, and what does it do for downstream systems? What's the right escalation path if the owning team is unavailable at 3am? What changed recently? What's the known failure mode when the upstream dependency degrades?

This context is infrastructure. Not metaphorically — functionally. It is load-bearing. When it erodes, response times increase, investigations branch into dead ends, fixes become localized patches that address symptoms without understanding causes, and institutional knowledge migrates from documentation into the heads of specific individuals who eventually leave.

I've seen teams with genuinely excellent engineers spend forty-five minutes in an incident trying to answer "who do we page?" Because the service had been migrated between teams in a reorg, the PagerDuty routing hadn't been updated, the owner listed in the service catalog was a team name that no longer existed, and the person who knew all of this was on vacation. The code was fine. The context was gone. Those forty-five minutes were not recoverable.

Runbooks are not busywork. Architecture diagrams are not bureaucracy. They are a hedge against the brittleness of human memory in high-pressure situations. They're cheap to maintain and catastrophically expensive to lack when you need them.

Coordination Under Pressure Is a Skill, and Skills Require Practice
Incidents are not normal operating conditions. Time compresses. Decision-making must accelerate. Ambiguity that's tolerable in a planning meeting becomes operationally crippling when you're staring at a customer-facing degradation and a growing incident channel.

Under that pressure, the weaknesses in coordination structures surface rapidly and uncharitably.

Misaligned incentives appear: the infrastructure team wants to understand root cause before doing anything irreversible; the product team wants to restore service immediately regardless of how. Neither is wrong. They just haven't established which posture governs which situations, so the argument happens during the incident rather than before it. Authority boundaries become unclear: who can approve an emergency change? Who can pull the rollback trigger on a service owned by a team that isn't paged? Communication gaps widen: the incident commander doesn't know who to ask because the people who know the system best aren't in the bridge call because nobody knew to page them.

These aren't exotic failure modes. I've watched every one of them happen in organizations with talented, motivated engineers. The problem isn't the people. The problem is that coordination under pressure requires practiced, legible protocols — and most organizations don't treat incident coordination as a learnable skill that needs rehearsal. They treat it as something that should happen naturally when the stakes are high enough.

It doesn't. High stakes make everything harder, including coordination. Especially coordination.

Gamedays, fire drills, tabletop exercises — these feel like organizational overhead when everything is fine. They are exactly what determines whether an organization can function when everything isn't. The teams I've seen recover well from serious incidents are rarely the ones with the most technically sophisticated systems. They're the ones that have practiced being wrong together. That have worked through who says what to whom in the first fifteen minutes. That have identified the coordination breakdowns in advance, in low-stakes simulations, so they don't discover them for the first time during an actual outage.

Automation's Hidden Bargain
Automation has a compelling value proposition: eliminate manual error. Reduce toil. Increase deployment frequency. Make repetitive human judgment unnecessary.

All of this is real. And all of it comes with a less-advertised cost.

Every layer of automation increases cognitive distance between engineers and runtime behavior. When a deployment pipeline runs forty steps across three environments and conditionally applies different configurations based on whether it's a canary or a full rollout, the engineer who wrote the pipeline understands it. The engineer who inherits it, or the one who's on call at eleven on a Friday when it behaves unexpectedly, understands it much less. The automation has removed manual error from the happy path and created a diagnostic labyrinth on every unhappy path.

This is not an argument against automation. It's an argument for automation that is legible — that surfaces its own intent, that fails loudly and clearly, that doesn't distribute its logic across six Lambda functions and a state machine in ways that require a specific kind of architectural archaeology to understand. Engineers still design workflows, still define permissions, still interpret alerts, still approve changes, still diagnose anomalies. Automation changes the surface area of mistakes. It doesn't remove the human element. Sometimes it makes that element harder to find when things go wrong.

The dangerous assumption is that a highly automated system is a self-managing one. It isn't. It's a system whose failure modes have shifted from "someone did the wrong thing" to "someone designed something whose failure mode wasn't anticipated." The second category is often harder to see coming and harder to diagnose after.

Ownership Drift: The Slow Accumulation of Nobody's Problem
As systems grow — and they always grow — ownership becomes unstable. New services appear. Teams reorganize. Responsibilities migrate without documentation. Platform capabilities expand in ways that overlap with what used to be service-team responsibilities but nobody draws the boundary explicitly. An engineer who owned a critical subsystem gets promoted and their knowledge doesn't transfer in any structured way.

The result is ownership drift: a slow accumulation of services that are sort of owned by multiple teams, or sort of owned by none. Each team assumes partial responsibility. Nobody assumes full responsibility. There are no explicit handoffs because nobody recognizes a handoff is needed.

These are your reliability blind spots. When something breaks in a drifted ownership zone, the incident starts with a question that shouldn't need asking: whose problem is this? While that question is being resolved — while people are checking the service catalog, hunting through old Slack messages, reasoning about who probably knows this system — the incident is progressing. Every minute spent on "who?" is a minute not spent on "what?" and "how do we fix it?"

Deliberate ownership tracking is not a bureaucratic luxury. It is a load-bearing operational practice. It is also, frankly, one of the cheapest investments a team can make in reliability and one of the most consistently deferred.

What to Measure That Most Dashboards Don't Show
The reliability metrics most teams track are software metrics: uptime, error rates, deployment frequency, mean time to recovery, change failure rate. These are important. They're also lagging indicators of problems that are often socio-technical in origin.

There are other measurements worth taking:

How many services does each team own? Not as a judgment about team size, but as a signal of cognitive load distribution. A team carrying forty services cannot maintain each of them with the operational attention required for reliable operation of any of them.

How many teams are involved in an average incident? A high number suggests unclear ownership boundaries and tight cross-team coupling. This is not a performance metric for those teams. It's a structural diagnosis.

What percentage of services have clear, current, confirmed owners? Not organizational attribution on paper, but actual acknowledged ownership with real on-call responsibilities. The gap between "someone is listed" and "someone knows they're responsible" is where incidents go to become major outages.

When was the last permission review for each critical service? Identity and access configurations are the kind of thing that accumulate technical debt silently, through incremental additions and never-quite-completed cleanups, until an escalated privilege becomes an exploit vector or an overly broad permission becomes a blast radius amplifier.

What is the alert volume per on-call engineer over a given period? Alert fatigue is not a metaphor. It is a documented degradation in signal detection that occurs when humans are exposed to high volumes of low-quality alerts. An on-call engineer who acknowledged a hundred alerts last week is a less reliable responder than an engineer who processed five meaningful ones. The reliability of the monitoring system is inseparable from the reliability of the service it monitors.

Designing for Human Limits
This is the piece that most systems skip, not from negligence but from a kind of architectural optimism: the idea that if the system is well-designed technically, humans will figure out how to operate it. They usually do, eventually, and at some cost.

Designing for human limits means treating cognitive overhead as a first-class engineering constraint. It means asking, when adding a service: does this have clear ownership, clear failure semantics, a blast radius that a team can contain? It means designing failure modes to be legible — to contain enough diagnostic signal at the moment of failure that a reasonably informed engineer can orient quickly, rather than staring at a wall of correlated alerts with no clear starting point.

It means maintaining ownership as a living artifact, not a field in a YAML file that gets set once and forgotten. It means keeping runbooks close enough to the service that they stay current — reviewed when the service changes, not when someone notices they're stale. It means, periodically, simulating the loss of context: what would a new engineer need to diagnose and recover this service? If the answer involves asking someone specific, that's a fragility worth documenting.

The goal is not to eliminate complexity. Production systems are complex because the problems they solve are complex. The goal is to ensure the complexity is manageable — not just by the engineers who built it, but by the engineers who inherit it, who are on call for it at 2am, who have to reason about it under pressure with imperfect information.

That is a different design target than most teams aim for. It's harder to specify, harder to test, harder to demonstrate in a sprint demo. But it's the one that matters most when something goes wrong. Which it will.

Monday Morning
So what does this actually change about how you build on Monday?

Start by making invisible things visible. Which services in your portfolio have ownership that's genuinely confirmed, not just nominally attributed? Which runbooks have been touched in the last six months? What's your alert volume per on-call engineer? These are not rhetorical questions. They have answers. Generating them is a morning's work, and the findings are usually uncomfortable and immediately actionable.

Then pick the highest-consequence ambiguity and resolve it explicitly. Not in a meeting to plan a future meeting. Decide: this service is owned by this team, this is the escalation path when that team is unavailable, this is what changes require cross-team coordination and this is the protocol for making that fast. Write it down. Put it somewhere engineers will actually find it at 2am, which is not Confluence.

And consider — seriously, not performatively — running an incident simulation. Not a polished gameday with a pre-scripted scenario. A genuinely disruptive exercise: pull the on-call engineer into a fake severity-two, give them a service they know less well than they think they do, and watch where the coordination breaks down. The gaps that emerge are the gaps that will surface in real incidents. Finding them first, in a low-stakes context, is considerably better than finding them in the other kind.

None of this is glamorous. It doesn't show up in your deployment frequency metrics or your architecture diagrams. It doesn't make for a compelling tech blog post or a satisfying sprint retrospective.

But reliability is not a software property. It is an organizational outcome. The code is part of it. The humans are part of it. The structures that shape how humans coordinate, communicate, and carry context — those are part of it too, and they are at least as determinative as the code when things go wrong.

The system will eventually remind you of this. The question is whether you choose to act before it does.

The Architecture Drift Nobody Measures

Iyanu David — Tue, 03 Mar 2026 19:02:26 +0000

Systems rarely collapse suddenly. I know that sounds obvious — every engineer has read the postmortem that opens with "a cascade of small failures" — but the knowledge doesn't seem to change how we build or how we watch. We still instrument for the acute. We still treat the chronic as background noise.

Drift is chronic. And chronic things are genuinely hard to see, not because we lack the tools but because they change too slowly to register against the baseline of what we already expect.

The Word Engineers Already Own — And Why It's the Wrong One

Say "drift" in a room full of platform engineers and you'll get a specific, conditioned response: configuration drift. A node that someone SSH'd into at 2 AM and patched by hand. A Helm value overridden in production because the feature flag wasn't ready. A Terraform state that diverged from what's actually running in the account. That kind of drift is real, it causes real incidents, and — critically — it's detectable. You can diff against a desired state. You can run a reconciliation loop. You can build a dashboard and put a red cell in it when something diverges.

Architectural drift doesn't give you that. There is no desired-state document for trust models. There is no reconciliation loop for ownership boundaries. Nobody is running architecture --plan and watching for a delta.

What I mean by architectural drift is subtler and nastier: it's what happens when the structure of a system — its service topology, its permission surfaces, its implicit contracts between teams, its assumptions about what can reach what and who is responsible for what — evolves away from the intent that shaped it, without anyone making a deliberate decision to change that intent.

The permissions weren't widened in a single bad commit. They were widened in fourteen reasonable commits over eight months, each one unblocking something real, each one reviewed and approved, none of them understood as part of a cumulative pattern.

That's drift.

Architecture as Fossilized Reasoning

Every architectural decision is a theory. Not a fact — a theory. A claim about the world that seemed well-supported at the time: the expected traffic profile, the team size and structure, the threat model, what failure modes were considered credible, how much operational maturity you were willing to assume. Those theories get encoded.

They get encoded in IAM policies that grant a CI service account write access to the specific S3 buckets that existed at the time of writing, plus a * wildcard added six months later for expediency. They get encoded in network segmentation rules designed around a monolith that has since been decomposed into eleven services. They get encoded in deployment pipelines whose scope of authority was never explicitly defined because nobody imagined the pipeline would eventually be able to trigger changes in four environments, two cloud accounts, and a Kubernetes cluster that didn't exist when the pipeline was written.

The decisions were reasonable. Locally rational. The problem is that local rationality doesn't compound into global coherence.
Reality keeps moving and the architecture doesn't move with it — not in any deliberate way. The gaps widen quietly. You don't get a notification.

Success Is the Accelerant

Here's the part that took me an embarrassingly long time to internalize: successful systems drift faster than struggling ones. Not slower. Faster.

When a system is visibly struggling — reliability is poor, latency is bad, deployments are flaky — there's organizational permission to stop and fix things. Engineers can make the case for redesign. The pain is a forcing function.

When a system is working, the pressure calculus inverts. Features need shipping. The platform is fine. Why would you spend two weeks re-examining service ownership when nothing is broken? That instinct is not stupid — it's a reasonable allocation of attention under real constraints. But it has a structural consequence: the things that would compound into future fragility don't get addressed, because there's no present signal that they matter.

Temporary permissions become permanent through sheer persistence — nobody revokes them because nothing has gone wrong yet. Workarounds that were introduced as three-week bridges are still running three years later, load-bearing and undocumented. Ownership of a service migrates informally when a team reorganizes, but the old team's name is still in the runbook and the PagerDuty rotation. A shared library accumulates four internal consumers, then six, then eleven, and at some point it becomes a platform primitive with none of the stability guarantees a platform primitive should carry.

Each of these is an individually manageable problem. Collectively, they constitute a system that is very hard to reason about and very easy to miscalculate during an incident.

The Nonlinearity Problem

There's a combinatorics issue hiding underneath all of this that I think doesn't get talked about enough in concrete terms.

When you add a new service, you don't add one thing. You add N+1 things, where N is the number of existing services it might interact with, plus the infrastructure components it touches, plus the IAM policies it requires, plus the CI pipeline permissions it needs, plus the failure modes it introduces into any service that depends on it. The number of possible system states doesn't grow linearly with the number of components. It grows exponentially — or faster, depending on how tightly coupled the components are.

Most observability tooling is designed to measure behavior: latency distributions, error rates, saturation, throughput. The RED method, the USE method — good frameworks, genuinely useful, not the point. What they capture is how a system performs given its current structure. They don't capture anything about the structure itself. High fan-in on a single service? Not a latency metric. Overly broad blast radius from a CI pipeline? Not an error rate. Seventeen implicit callers depending on an undocumented internal API contract? Not visible in a Grafana dashboard.

We've invested heavily in observability for runtime behavior and almost nothing in observability for structural health. That's a meaningful gap.

When the Org Chart Rewires and the Architecture Doesn't

Conway's Law is usually taught as a prescription — design your system the way you want your organization to look. What gets less attention is its corollary: when your organization changes, your system silently breaks that correspondence. And organizations change constantly.

A team of eight splits into two teams of four. The service that team owned is now jointly owned, which in practice means ambiguously owned. An on-call rotation that used to involve five people who all held full context now involves ten people, half of whom have only ever seen the service through the lens of recent incidents, not its original design. A new team inherits a microservice because the person who wrote it moved to a different organization, and that team's understanding of the service's upstream dependencies is incomplete in ways nobody realizes until 3 AM when a dependency behaves unexpectedly.

None of this shows up in the code. The code doesn't reorganize when the org does. The implicit knowledge — why a particular retry strategy was chosen, what constraint drove a specific timeout value, which downstream service has a known quirk that the original author knew to compensate for — that knowledge disperses. Sometimes it survives in documentation. Often it doesn't.

Organizational drift is architectural drift. The system's resilience degrades not because the code changed but because the humans who understood the code and its operational context no longer hold that understanding collectively.

Automation: The Visibility Tax

Automation is genuinely good. I want to be careful here not to slide into a lazy critique of something that has made software reliability substantially better across the industry. Automated deployments are more reliable than manual ones. Infrastructure-as-code produces more consistent environments than hand-provisioned servers. Reconciliation loops catch configuration drift faster than human audits.

But automation has a visibility tax that I don't think we account for carefully enough.

When a deployment workflow becomes complex — many stages, many conditional branches, cross-account promotion, post-deploy validation hooks — and you wrap that complexity in a pipeline that produces a green checkmark, you've created something that works reliably until its underlying assumptions are violated in a way the pipeline can't detect. The complexity didn't go away. It got abstracted behind an interface that only shows you outcomes.

Overly broad IAM permissions still work. Pipelines succeed. The blast radius isn't visible until something uses it. Fragile dependencies still pass the happy path. Retries compensate for flakiness that should be investigated. You're looking at surface reliability — the rate of green checkmarks — while structural fragility accumulates underneath.

This is not a reason to avoid automation. It is a reason to build automation that exposes its own structural surface area — pipelines that report the scope of environments they can affect, permission policies that log the breadth of their authority, dependency graphs that are computed and versioned alongside the code they describe.

Most pipelines don't do this. It's not hard to add. It's just not where the attention has gone.

Drift Doesn't Alert

The core danger is mundane: drift doesn't page anyone.

There is no threshold for "your service boundaries have diverged meaningfully from team ownership structure." Nobody gets a Slack message that says "your trust model was designed for a 40-engineer organization and you now have 200 engineers." No SLO fires because a CI pipeline's authority has expanded beyond what was originally scoped.

The system looks stable. Dashboards are green. Deployments are succeeding. On-call load is manageable. From the outside, and even from the inside, there's no signal that the structural assumptions are aging faster than the system is being maintained.

And then something breaks. Usually something small — a bad deploy, a dependency timeout, a retry storm, a permissions misconfiguration that was always possible but hadn't been triggered. The triggering event is mundane. The consequences are not, because drift has changed the geometry of the system in ways that weren't modeled and weren't understood.

The incident investigation finds it: ownership was unclear so mitigation was slow. A coupling that nobody had mapped amplified the failure across services that had no business being affected. A pipeline that was never scoped to the blast radius it had quietly accumulated propagated an incorrect state into three environments before anyone caught it.
The trigger was small. Drift made it catastrophic. And nobody knew the fragility was there because nothing had measured it.

What to Actually Do on Monday

I'm suspicious of framework proposals that require organizational consensus to implement, because organizational consensus takes months and architectural debt compounds daily. What follows are things a careful engineer can start doing with existing authority, this week, without waiting for a working group.

Map your blast radius. Pick one CI pipeline — ideally the one that feels the most like infrastructure rather than application deployment — and write down, concretely, every environment it can modify, every cloud account it has credentials for, every approval gate it bypasses in practice versus in theory. The act of writing it down is diagnostic. If you can't complete the list in thirty minutes because you're not sure what the pipeline can reach, that's the finding.

Audit permissions by age, not by scope. Most permission review processes ask "is this permission appropriate?" which is hard to evaluate in the abstract. A more tractable question is "when was this permission last reviewed, and what was the context at that time?" Permissions that are six months old and were granted in a different organizational context are candidates for revalidation regardless of whether they've caused a problem. They are structural liabilities with no expiration date.

**Find the undocumented load-bearing components. **Every system has them — services or libraries that have accreted callers without ever being designed for broad consumption. A rough proxy: look for internal components with no SLO, no on-call owner, and more than three consumers. That combination suggests something that is being treated as infrastructure but without infrastructure-level rigor. It will matter during an incident.

Reread your last three postmortems for structural signals. Not for the contributing factors that were already documented, but for the things the investigation surfaced and then didn't pursue: the ownership ambiguity that got noted but not resolved, the coupling that was surprising but got attributed to "we need better monitoring" rather than "we need to reduce this coupling," the permission that allowed unexpected reach and was remediated but never questioned as a class of problem.

Tie architecture review to scale events, not just feature events. Code review is continuous. Architecture review is usually either nonexistent or triggered by major new projects. The more useful trigger is scale: when your user count doubles, when your engineering org grows by 50%, when your deployment frequency crosses a threshold. Those are the moments when the gap between current architecture and current reality is most likely to be meaningful.

None of this requires a new platform. None of it requires executive sponsorship. It requires treating structural health as a real engineering concern with the same seriousness as runtime reliability — which means measuring it, reviewing it on a cadence, and accepting that visibility into it is a form of risk management, not overhead.

The most dangerous systems I've encountered weren't the ones with poor uptime metrics. They were the ones where everyone felt confident, the dashboards were clean, and nobody had looked carefully at the structure in a year and a half. By the time the confidence was revealed to be misplaced, the conditions for failure had been present for a long time.

That's the thing about drift. It's already happened before you notice it.

Assumptions Do

Iyanu David — Wed, 25 Feb 2026 09:45:47 +0000

This article is part of the “When Systems Age” series exploring how assumptions, automation, and architectural drift shape modern system failures.

There's a particular kind of silence before a large system fails. Not the silence of nothing happening — the opposite, actually. Everything is humming. Metrics are green. The on-call engineer is halfway through a cup of coffee. And somewhere in the stack, a belief that has been quietly wrong for eleven months is about to introduce itself.

I've been in that silence enough times to stop blaming the proximate cause.

When the postmortem gets written, it will name a culprit. An expired certificate. A deployment that bumped a queue depth past an undocumented limit. A retry storm that someone designed on purpose, for a topology that no longer exists. The culprit is real. The finding is technically accurate. And it misses the point almost entirely.

What Actually Ages

Software doesn't rot the way people mean when they say it rots. The bits don't degrade. What rots is the correspondence — the relationship between the system as built and the world it was built to operate inside.

Every non-trivial architecture is, at its foundation, a collection of propositions. Not written ones, usually. The propositions live in configuration files and runbook assumptions and the spatial memory of engineers who've since moved to other companies. They are things like: this service will never be called more than a thousand times per second, so we sized the thread pool at thirty. Or: these two systems are always deployed together, so we didn't bother with a version handshake. Or, more dangerously: we trust everything inside this subnet because in 2019 only internal services could reach it.

Propositions like those are not bugs. They were correct. The thread pool was fine for years. The coupling was invisible until deployment pipelines diverged. The trust boundary made perfect sense when the perimeter was coherent.

What changes is reality. The propositions don't update themselves.

This is what I mean by assumption drift, and it differs from technical debt in ways that matter. Technical debt you can see, at least in principle — deprecated dependencies, test coverage gaps, a module nobody's touched since the original author left. Assumption drift is submerged. The code looks fine. The architecture diagram (if one exists at all) looks consistent. The system continues to function. But the gap between what the system believes about its environment and what is actually true has been widening for months, and nobody has been measuring it, because the metrics you'd need to measure it are not the metrics anyone thought to instrument.

The Seduction of Stability

Here's the uncomfortable part: the systems most likely to harbor deep assumption drift are the ones that have been working reliably for a long time.

Stability is, perversely, a form of concealment. When a system processes requests correctly for eighteen months, you stop looking at it carefully. You stop interrogating its dependencies. You stop asking whether the access patterns from five years ago still describe what the system is actually doing. You let the workarounds calcify into infrastructure. You let the temporary IAM permissions become permanent because nobody can remember what breaks if you revoke them, and the cost of finding out feels higher than the cost of carrying them.

I've watched teams operate a service in production for two years without anyone being able to precisely answer the question: what does this thing actually depend on? Not what the documentation says. What actually happens when you trace the calls. The documentation was written during the initial deployment, before two major refactors and a cloud migration that moved three downstream dependencies to a different availability zone. Nobody updated the docs because the service kept working.

This is the operational equivalent of flying on instruments in clear weather. Everything looks fine from the cockpit. You stop cross-checking.

And then the weather changes.

Automation's Hidden Clause

Automation is where assumption drift becomes genuinely dangerous, because automation scales whatever belief is encoded inside it — correct or not.

Think about what a deployment pipeline is, mechanically. It's a set of assertions about what a valid deployment looks like, operationalized as code. It checks certain things, skips others, applies changes to systems in a specific order predicated on a specific understanding of how those systems relate to each other. When that pipeline was written, the understanding was accurate. The checks were sufficient. The order was correct.

Now imagine the architecture has evolved — new services added, dependencies reshuffled, a critical stateful component moved from one team's ownership to another's without the pipeline being updated to reflect the new blast radius. The pipeline doesn't know any of this. It executes. It marks itself successful. It has applied an outdated model of the system to a system that no longer matches that model, at machine speed, with no hesitation.

A human operator making the same series of decisions would at least pause at an unexpected output. They might notice that something seems off. They might ask a question. Automation doesn't ask questions. It doesn't feel the slight wrongness that precedes a bad outcome. It just scales.

This is not an argument against automation — I'd sooner argue against gravity. It's an argument for treating automation as a bet: you're betting that the assumptions baked into the automation remain valid. And like any bet, you need to know what you've wagered and how often to reassess whether the odds have shifted.

Why Postmortems Miss It

The postmortem process, as practiced at most organizations, has a structural bias toward symptoms.

It's not malicious. It's practical. When a system has just failed, the people in the room have real obligations — restore service, communicate with stakeholders, identify the immediate change that unlocked the failure. Those are legitimate priorities. And they produce a legitimate artifact: a clear account of what happened, with action items that address the thing that happened.

The problem is that the thing that happened is usually not the thing that caused the failure. The thing that happened was a triggering event — it found a vulnerability. But the vulnerability was there before the trigger arrived, and it will still be there, in some form, after the trigger is patched.

I've been in postmortems where we correctly identified a misconfigured circuit breaker as the root cause, wrote an action item to fix the configuration, fixed it, closed the ticket, and then had a substantively identical failure eight months later from a different triggering event finding the same underlying belief — that a particular downstream service would recover within a thirty-second window — which was no longer true after that service was migrated to a platform with slower cold-start behavior.

We fixed the circuit breaker twice. We never questioned the thirty-second assumption. Why would we? The assumption wasn't in the postmortem. It wasn't a named thing. It was just how the system thought about the world.

What Breaks, Concretely

Let me be specific about failure modes, because generality is the enemy of action.

Trust boundaries. In 2019, many organizations built service meshes or network policies around the assumption that the internal network was trustworthy, that mutual TLS was overhead that slowed things down and wasn't worth it, that the implicit perimeter was sufficient. By 2022, those same organizations had contractors, acquired companies, multi-cloud deployments, and third-party SaaS tools with internal network access through various integration mechanisms. The perimeter was meaningless. The trust boundaries hadn't moved. The systems designed to trust each other still trusted each other — including the ones that shouldn't.

Ownership maps. A service is owned by a team. The team reorganizes. Some members go to a new team, some go to a different area. The service gets assigned to whoever seems closest. The runbook has the previous team's Slack channel. The alert routing goes to a queue nobody checks on weekends. This is not hypothetical — I have personally been the on-call person for a service I had never touched, because the alert routing hadn't been updated through two reorgs and the monitoring system didn't distinguish between "team owns this alert" and "someone at this company owns this alert."

Capacity assumptions. A cache was sized for a certain volume. That volume was sensible in 2021, when the feature was new and the user base was limited. The feature grew. The cache was never resized because it was working — hit rates looked fine in aggregate. They looked fine in aggregate because the cache was still hitting on the popular items; the long tail was going to the database, which was quietly absorbing the difference, which nobody noticed until a new user cohort with different access patterns inverted the hit rate and the database began responding with five-second latencies instead of fifteen-millisecond ones.

Interface contracts. Service A expects a response field from Service B. Service B deprecated that field eighteen months ago, but kept emitting it for backward compatibility. A new version of Service B, deployed quietly during a maintenance window, didn't emit it anymore — the compatibility shim had been removed as part of a codebase cleanup. Service A started returning malformed data to its own callers. Nobody caught it immediately because the field was optional in A's schema, which meant it parsed fine; it just wasn't being used to populate a field that turned out to matter a great deal.

Each of these is a story of a belief that was valid, stopped being valid, and wasn't retired.

The Measurement Problem

The reason assumption drift accumulates is that we measure the wrong things.

Performance dashboards measure latency, throughput, error rates. These are current-state metrics — they tell you whether the system is doing its job right now. They don't tell you whether the system's model of itself matches the system's actual behavior. They don't tell you whether the trust boundaries are still coherent. They don't tell you whether the ownership map reflects who would actually respond to an incident.

Some organizations do architecture reviews, but they're usually triggered by new projects rather than elapsed time. The review is for the thing being built, not for the thing that's been running for two years while the environment around it shifted.

The gap is fundamentally an information problem. You cannot revalidate assumptions you're not aware of holding. And most teams, if you asked them to enumerate the operational assumptions embedded in their largest production services, would produce a list that's both incomplete and already partially wrong before they finish writing it down.

I've started, in my own work, keeping what I'd call an assumption register — not a formal document, just a running file of decisions made about how a system relates to its environment, with a timestamp and a rough expiration condition: this is valid as long as traffic stays below X, as long as we're on a single cloud provider, as long as the deployment cycle for Service Y is less than five minutes. When the conditions are revisited — quarterly, or when a significant architectural change happens — we go back to the register and ask which of these are still true.

It doesn't catch everything. Nothing does. But it shortens the mean time between assumption expiration and assumption discovery, and that interval is where the fragility lives.

Monday Morning
So what actually changes on Monday? Not the architecture — that takes months. Not the tooling — that takes longer.

What changes is a habit of questioning.

Before you add the next alert, ask what assumption the alert is protecting, and whether that assumption is still valid. Before you extend an existing automation to handle a new case, ask whether the model of the system that the automation encodes is the model the system actually exhibits now. Before you close a postmortem, spend fifteen minutes on a question the postmortem doesn't ask: what had to be true for this triggering event to find anything to trigger?

That last question is harder and slower than writing action items. It resists being put in a ticket. But it's the question that finds the belief underneath the incident, and without it you're fixing the lock while leaving the window open.

A small, concrete discipline: when a service changes hands — a team reorg, an acquisition, a project sunset — do a one-hour review of what that service assumes about its environment before closing the handoff. Not a full architecture review. Just: what does this system trust? Who does it depend on? What operational assumptions are baked into its configuration that the new owners might not know to question?

Most teams skip this because they're busy and the service is working. That's exactly the condition that makes the review worth doing.

The instinct to move fast is reasonable. The instinct to declare a system stable and stop looking at it is where the gap opens. Stability is not a terminal state. It's a temporary alignment between a system and a world that keeps changing.

Your job — the unglamorous part of it — is to keep measuring that alignment, long after anyone stops asking you to.

The Control Plane Is Your Real Production System

Iyanu David — Sat, 14 Feb 2026 18:24:03 +0000

For fifteen years I thought production meant the place where code executes. The machines serving HTTP requests at three in the morning. The databases holding customer records. The message queues processing payments while engineers sleep.

I was tracking the wrong system.

Modern infrastructure has undergone a phase transition—not in what it does, but in what decides what it does. Runtime environments haven't become less important. They've become subordinate. Downstream. Governed.

The actual production system, the one that determines whether your application lives or dies, whether your weekend stays quiet or explodes, is the control plane. The layer that continuously rewrites what runtime becomes.

Most of us still haven't adjusted our mental models to match this reality.

The Substrate Beneath the Substrate

Every meaningful change in a contemporary system flows through machinery that predates execution:

CI/CD pipelines compile source into deployable artifacts. Infrastructure-as-Code templates declare compute, storage, networking topologies. Git repositories encode the desired shape of reality. Deployment orchestrators promote releases through environments—dev to staging to production, gated by approval workflows and automated verification. Reconciliation loops detect divergence and correct it.

None of this is remarkable anymore. It's ambient. Infrastructure that thinks for itself.

But here's what that ubiquity obscures: runtime environments no longer evolve through human decision-making applied directly to machines. They evolve because automation interprets declarations and enforces them. Continuously. Relentlessly.

Which means runtime is no longer the source of truth about what your system is.

The control plane is the source of truth.

Runtime is output.

When Production Became Read-Only

I remember the before-times. You'd SSH into a production box, adjust a configuration file, restart a service. The change was immediate, tangible, yours. The system evolved through accumulated operational interventions—some documented, many not.

That model died quietly.

Today's runtime environments are generated from upstream sources:

Git commits land in main branches
Merge events trigger pipeline executions
Pipelines synthesize infrastructure from templates
Terraform provisions resources, Kubernetes manifests define workloads
Deployment workflows roll out application versions
Reconciliation controllers watch for state drift and eliminate it

If you manually modify a production server now—say you patch a library or adjust memory limits—automation detects the discrepancy within minutes. It compares observed state against declared state. Then it reverts your change.

The server doesn't obey you anymore.

It obeys the control plane.

Git as Operational Interface

In a growing number of organizations, merging a pull request initiates:

Load balancer reconfiguration
Firewall rule modifications
IAM policy updates
Database schema migrations
Secret rotations
Application deployment across availability zones

A PR is no longer merely a code review. It's an operational change request that triggers cascading infrastructure mutations.

Git has become the command surface. Repositories declare intent. Commits are orders. Approval workflows replace sudo access.

This is elegant until it isn't.

Because now a typo in a YAML file—an extra indent, a missing hyphen—can delete your production database. And it will do so cleanly, atomically, exactly as designed. The control plane doesn't distinguish between intentional and accidental. It just executes.

The Illusion of Runtime Security

Security teams invest extraordinary effort hardening runtime:

Network segmentation isolates workloads
Container security platforms scan images
Endpoint detection tools monitor for anomalous behavior
Service meshes enforce mTLS between microservices
Runtime application self-protection instruments code paths

None of this is wasted. Runtime defenses matter.

But they operate downstream of intent.

If your control plane deploys a container with overly permissive IAM policies, runtime security faithfully enforces those permissions. If automation writes a network policy that accidentally exposes an internal API, your intrusion detection system sees legitimate traffic patterns.

Runtime security cannot correct upstream mistakes. It can only amplify them with precision.

I've watched teams spend months building runtime guardrails while their CI/CD pipeline has admin credentials to every production account. The threat model was inverted.

Where Modern Incidents Actually Start

Outages used to begin with hardware failures. Disk crashes. Memory corruption. Network partitions. Servers dying spontaneously because entropy is real.

Now incidents typically originate in automation:

A developer updates a Terraform module version. The new version changes resource behavior in subtle ways the author didn't document. Your next deployment scales down a database cluster during peak traffic.

A security team tightens IAM policies in a shared library. The change propagates through dependency updates. Three weeks later, a batch job fails because it can't access S3 anymore. Nobody connects the dots for hours.

A deployment workflow has a conditional: if it's Tuesday and traffic is below threshold, roll out the canary faster. That logic made sense once. Now it's buried in YAML nobody reviews. One Tuesday it misbehaves spectacularly.

The runtime environment executes the new state exactly as specified. It isn't failing. It's succeeding at the wrong thing.

Automation accelerates both recovery and failure. It just doesn't care which.

Concentrated Authority, Distributed Ownership

Consider the permissions your control plane possesses:

Create or destroy infrastructure across regions
Modify IAM policies that govern access to customer data
Rotate cryptographic material
Deploy application code to millions of users
Promote artifacts from development to production globally
Roll back entire environments atomically

Few human operators have comparable authority. Even senior engineers typically need approval chains to perform high-risk operations manually.

Automation doesn't.

A GitHub Actions workflow running in your CI/CD pipeline might have more power than your VP of Engineering. It just exercises that power quietly, thousands of times per day, in ways that usually work.

Until they don't.

Why Traditional Security Models Struggle

Security thinking evolved around protecting runtime assets. Firewalls keep bad actors off your network. IAM prevents unauthorized access to databases. Monitoring detects anomalous behavior patterns.

Control planes break these models because change is their purpose:

They're supposed to modify infrastructure at scale
They legitimately create and destroy resources
Their behavior resembles administrative activity
Malicious actions and operational accidents look identical to normal automation

Distinguishing intended change from dangerous change becomes a signal detection problem with terrible base rates. When automation makes ten thousand legitimate modifications daily, how do you spot the one that's quietly catastrophic?

You can't rely on anomaly detection. Large-scale change is the baseline.

Some teams try to solve this with approval workflows. Every infrastructure change requires human review. But humans are terrible at reviewing YAML diffs for subtle logical errors. We pattern-match, we skim, we approve things that look structurally similar to previous changes.

The control plane is too fast and too privileged for human review to be the primary safety mechanism.

Configuration Drift Has Moved Upstream

Drift used to mean: someone manually changed a production server and forgot to document it. Over time, systems diverged from their nominal specifications.
Modern drift originates differently:

Your CI/CD pipeline gradually accumulates permissions. Initially it needed S3 access for artifact storage. Then someone added CloudWatch logging. Later, ECR for container images. Now it has broad read-write across AWS services, and nobody remembers the accretion history.

Deployment workflows grow baroque. Feature flags control rollout percentages, but the flag evaluation logic becomes a miniature state machine with twelve branches. One branch has a bug. It only executes under specific conditions nobody tests for.

Infrastructure templates age. They encode assumptions that were true when written—instance types, API versions, availability zones. Those assumptions quietly become false. The templates keep working, mostly, until they encounter an edge case.

The control plane itself drifts.

And when it does, every runtime environment inherits that drift automatically, at scale, faster than you can react.

Observability's Blind Spot

Most monitoring focuses on runtime behavior:

Application latency percentiles
Error rates and status codes
CPU, memory, disk utilization
Database query performance
Cache hit ratios

This tells you what happened to your application.

It doesn't tell you why the control plane decided to make it happen that way.

Modern reliability requires visibility into upstream decisions:

Which pipeline execution deployed this version
What infrastructure changes occurred in the last hour
Which identity provisioned these resources
Why did the deployment workflow choose this rollout strategy
What policy decisions governed these security settings

Without control plane observability, you're debugging symptoms while the cause remains invisible. You see elevated latency and assume your application regressed. Actually, Terraform just scaled down your database because a developer typo'd a variable definition.

I've spent hours debugging application logs only to discover the real failure was three layers upstream in Kubernetes manifest templating logic.

Treating Control Planes as Production Infrastructure

If the control plane defines reality, it requires production-grade operational discipline.

Ownership must be explicit.
Pipelines, workflows, infrastructure definitions—these aren't shared commons. They need owners who understand their behavior, monitor their health, and respond when they misbehave. Shared responsibility means no responsibility.

Change management applies to automation logic.
Your deployment workflow is code. It has bugs. It makes assumptions. It will fail in ways you didn't anticipate. Changes to workflow logic are production changes. They deserve the same review rigor, testing discipline, and rollout caution as application code.
Probably more, actually, because workflow logic has more authority than most application code.

Least privilege for automation identities.
Don't grant your CI/CD pipeline admin access and call it done. Scope permissions narrowly: this workflow can deploy to production, this one can only update staging, this identity provisions infrastructure but cannot modify IAM policies.

Treat automation identities like high-privilege user accounts, because that's what they are.

Continuous auditability for deployment decisions.
Every time automation changes production, you should be able to reconstruct:

Which commit triggered the change
Who approved it (or which automated gate passed)
What actually changed
Which identity executed the change
Which systems were affected

Git history provides some of this. But you need the execution trace too—pipeline logs, deployment events, infrastructure state transitions. The control plane's decision-making process must be auditable.

Resilience mechanisms at the control plane layer.
Safe rollbacks. Progressive rollouts with automated health checks. Policy enforcement that blocks dangerous changes before they execute. Rate limiting on infrastructure modifications. Circuit breakers for automation that's failing repeatedly.

You build these safeguards into applications. They belong in control planes too.

The Reframe

We used to think: production is where systems run.

Better framing: production is where system state is decided.

Runtime executes. The control plane governs.
Once you internalize this, modern failure modes make sense. They weren't runtime failures. They were control plane failures expressed at scale, propagated through automation, faithfully executed by infrastructure doing exactly what it was told.

What Changed on a Fundamental Level

Infrastructure used to be operated. You made decisions, executed them, observed results. The system evolved through human judgment applied continuously.

Now infrastructure is declared. You describe the desired state. Automation interprets those declarations and drives reality toward them. The system evolves through reconciliation loops that compare intent against observation.

This is more reliable when it works. Changes are reproducible, auditable, reversible. Infrastructure becomes code: versionable, testable, composable.

But it inverts the authority structure.

Humans no longer command systems directly. We program the automation that commands systems. We've inserted an interpretive layer between intent and execution.

That layer is now the most critical component in your architecture.

What to Do Monday Morning

If you operate modern infrastructure, the control plane is already your real production system. You just might not be treating it that way.

Start treating it that way:

Audit your pipeline permissions. Most are overprovisioned. Fix that.

Add observability to deployment workflows. When did they run, what changed, why did they make the decisions they made. Make this queryable.

Review automation logic the way you review application code. It's more important, actually, because it has more authority.

Test failure modes in your control plane. What happens if your CI/CD pipeline is compromised? Can you still deploy? Can you roll back? Do you have a break-glass procedure that doesn't depend on the control plane?

Establish ownership boundaries. No more "the platform team maintains everything." Specific people own specific automation.

Build progressive rollout into infrastructure changes, not just application deploys. When Terraform wants to modify 200 database instances, maybe do five first and verify nothing exploded.

These aren't novel ideas. They're boring operational discipline.

But we forgot to apply them upstream.

Closing Thought
If runtime infrastructure is your application's body, the control plane is its nervous system. It decides what moves, how fast, in response to what stimuli.

You can harden the body all you want. But if the nervous system is compromised—if automation is misconfigured, over-privileged, or poorly understood—the body will execute destructive commands perfectly.

Protecting production means protecting the system that decides what production becomes.

Today, that system is the control plane.

Treat it accordingly.

Your Deployment Pipeline Is a Privileged Identity System

Iyanu David — Sat, 14 Feb 2026 12:04:36 +0000

We treat deployment pipelines like automation.

They are not.

They are identity systems.

Every time a pipeline runs, it answers a critical question: Who is allowed to change production? And increasingly, the answer is: the pipeline. Not a human. Not an admin. Not a ticket approval. The pipeline identity.

Not because we chose this architecture deliberately. Because we arrived here through a thousand small decisions that felt like operational improvements.

The Shift We Didn't Fully Acknowledge

Historically, humans logged into production. Engineers ran deployment scripts from jump boxes. Admins approved infrastructure changes through ticketing systems that everyone hated but at least understood. The trust model was explicit: this person, with these credentials, at this terminal, making this change.

Now: a commit merges. A workflow triggers. Automation deploys. Infrastructure updates itself.

Humans design the change. Pipelines execute it.

That seems like a productivity win—and it is. But it fundamentally relocates where authority lives. Pipelines don't just move code from one place to another. They act on behalf of delegated privilege. That is identity, whether we acknowledge it or not.
The thing is, we gutted the old identity model without building a replacement. We removed direct human access to production, celebrated the reduction in operational risk, and then... concentrated all that authority inside automation we barely monitor. The privilege didn't vanish. It migrated.

What Makes Something an Identity System?

An identity system does four things:

Authenticates. Proves who or what is making a request.

Authorizes. Determines what that actor is allowed to do.

Acts with delegated privilege. Exercises permissions beyond its own inherent capabilities.

Establishes trust boundaries. Creates a perimeter of assumed safety around certain operations.

Modern deployment pipelines do all four.

When a pipeline deploys, it authenticates to a cloud provider—usually by assuming a role through workload identity federation or presenting long-lived credentials someone uploaded six months ago and forgot about. It authorizes itself to apply infrastructure changes, publish artifacts to registries, modify runtime configuration, rotate secrets, update DNS records. It acts with delegated privilege that often exceeds what any individual engineer possesses. And it establishes trust boundaries: "If this workflow ran, the change is approved."

It is not "running a script." It is asserting identity and exercising authority across production systems.

The problem is we designed these systems like they were utilities—background processes that make deployments faster. We didn't design them like we were creating synthetic superusers with cross-environment reach and the ability to reconfigure foundational infrastructure.

The Delegated Authority Problem

Here's the architectural blind spot: pipelines often hold broader authority than any individual engineer.

An engineer cannot directly apply Terraform in production. Policy forbids it. But the pipeline can—because it needs to, and we trust that only approved code reaches the pipeline. An engineer cannot directly push images to the production container registry. But the pipeline can. An engineer cannot modify IAM policies or security groups or KMS key permissions. But the pipeline can, because infrastructure-as-code workflows require those capabilities.

We removed human privilege. Then concentrated it inside automation. That is not inherently wrong—in fact, it's probably necessary for operating at scale. But it must be explicitly modeled as an identity architecture decision, not treated as a deployment convenience.

I've seen production environments where the CI service account has broader permissions than the entire engineering team combined. Not because anyone intended that. Because permission grants accreted over time. A developer needed to add a CloudFront distribution, so the pipeline got cloudfront:*. Someone else needed to configure an RDS instance, so it got rds:*. Six months later, the pipeline can provision anything, and no one remembers why or questions whether it should.

Pipelines as Synthetic Superusers

In many systems, the deployment pipeline becomes a synthetic superuser. It has cross-environment access—promoting artifacts from staging to production. It can rotate credentials in secret managers. It can provision infrastructure in multiple AWS accounts or GCP projects. It can roll back production to previous states. It can modify ingress controllers, update certificate authorities, reconfigure observability backends.

That means compromising the pipeline identity is equivalent to compromising a production admin account. Except the pipeline is easier to reach, because it executes code from repositories.

Think about the attack surface: a malicious npm dependency executes during a build step. It extracts the GITHUB_TOKEN or GITLAB_CI_TOKEN or whatever credential the CI runner uses. That token allows the attacker to trigger workflows, modify environment variables, or—depending on configuration—assume roles directly into production accounts. No password cracking required. No human compromise required. Just delegated authority inherited from automation.

The pipeline identity becomes the attack path. And we've architected it to be maximally reachable: it runs arbitrary code on every pull request, every commit, every merge. We designed a superuser that executes untrusted inputs.

Identity Without Visibility

We typically monitor human logins. SSO events go to a SIEM. Privilege escalations trigger alerts. MFA challenges get logged and audited. That's identity hygiene we learned over decades.

But do we monitor pipeline role assumptions? Deployment identity behavior changes? Workflow-level permission shifts? Unexpected cross-account actions from CI service principals?

Often, no. Pipeline identities operate with minimal scrutiny. They are considered trusted automation—part of the infrastructure, not part of the threat model. But trust without monitoring is assumption. And assumptions age poorly.

I've investigated incidents where pipelines were compromised for weeks before anyone noticed. Why? Because no one was watching. The SIEM had rules for unusual human logins, failed SSH attempts, privilege escalation via sudo. It had nothing for "CI runner assumed production role at 3 AM on a Sunday and modified fourteen security groups."

When you ask security teams what identities exist in their environment, they'll list users, service accounts, maybe API keys. They rarely list deployment workflows as first-class identities, even though those workflows hold more authority than most of the humans.

The "Just Automation" Fallacy

Calling pipelines "just automation" hides risk.

Automation does not eliminate identity. It concentrates it. Every deployment workflow has a defined trust boundary, a set of permissions, a scope of authority, a blast radius. That is identity architecture, whether you design it intentionally or inherit it accidentally.

The "just automation" framing also obscures accountability. If a human misconfigures infrastructure, we know who to talk to. If automation misconfigures infrastructure... who owns that? The developer who wrote the Terraform? The platform team that maintains the pipeline? The security team that approved the permissions? The answer is usually unclear, which means the risk goes unowned.

I've seen organizations where deployment pipelines have the highest privilege in the entire system, and no one has explicit responsibility for auditing or governing that privilege. It's just... there. A dependency. An assumption. Something that has to work, so it has broad permissions, and everyone hopes it's configured correctly.

When Pipelines Become Lateral Movement Vectors

Imagine this scenario—because I've seen variations of it multiple times:
A developer adds a new package to a project. It's a popular library, thousands of downloads, looks legitimate. But it was compromised two weeks ago. The malicious code executes during CI—perhaps in a postinstall script, perhaps in a build step. It extracts a deployment token from environment variables. That token allows role assumption into production.

The attacker doesn't immediately deploy malicious infrastructure. That would trigger alerts. Instead, they modify a single IAM policy, granting themselves persistent access through a separate backdoor. Then they wait. When they're ready, they deploy modified infrastructure—maybe a Lambda function that exfiltrates data, maybe a container that mines cryptocurrency, maybe just a persistence mechanism for later use.

No password cracking required. No phishing campaign. No human compromise. Just delegated authority inherited from automation, exploited through supply chain insertion.

The pipeline identity becomes the pivot point. And because pipelines have broad cross-environment access, a compromise in one workflow can cascade into multiple production systems.

Modeling Pipelines as First-Class Identities

If pipelines are identity systems, they require identity design principles.

1. Least Privilege Per Workflow
A test workflow should not deploy infrastructure. It should not publish production images. It should not modify IAM roles or security groups or DNS records.

Segment permissions by purpose. A workflow that runs unit tests needs to read code and write test results. That's it. A workflow that deploys to staging needs permissions scoped to the staging environment—not production, not development, certainly not cross-account access. A workflow that publishes container images needs write access to a specific registry namespace, not ecr:* across all regions.

This is harder than it sounds, because modern pipelines often reuse the same runner identity across multiple workflows. You end up with a single service account that has the union of all permissions any workflow might need. That's convenient. It's also an identity design failure.

The alternative: workload identity federation with dynamic permission grants based on workflow context. GitHub Actions supports OIDC-based role assumption where permissions can be scoped to specific repositories, branches, or even workflow files. GitLab CI can use JWT-based identity with claims that map to granular IAM policies. These aren't perfect—the configuration is finicky, and the documentation assumes you already understand federated identity—but they allow per-workflow privilege scoping.

I've seen teams reduce pipeline blast radius by 80% just by segmenting workflow permissions. The test suite no longer has production deploy rights. The dependency update bot can't modify infrastructure. The documentation build can't access secret managers.

It requires upfront design. But the alternative is a single compromise granting access to everything.

2. Short-Lived, Context-Aware Credentials
Pipeline identity should be issued per run, expire automatically, and be bound to repository context.

Permanent access keys are identity debt. They sit in CI environment variables, maybe encrypted, maybe not. They get rotated... eventually. Maybe every 90 days if you have a good compliance program. Maybe never if you're honest.

Every permanent credential is a persistent attack surface. If an attacker extracts it, they have access until someone manually revokes it. That could be hours. Could be months. I've seen access keys in CI systems that were created three years ago and have never been rotated because no one wants to risk breaking the deployment process.

Short-lived credentials issued through workload identity mean an attacker has to maintain access to the CI system itself, not just steal a static token. That's a higher bar. It also means credentials automatically expire—usually within an hour—which limits the window for abuse.

Context-aware credentials go further: they embed claims about the repository, branch, workflow file, even the specific commit SHA. You can write IAM policies that say "this role can only be assumed by workflows running in the main branch of org/repo" or "this role can only deploy infrastructure if triggered by a tag matching v*."

That limits both accidental misconfiguration and deliberate abuse. A developer can't just fork the repository and run the production deployment workflow from their fork. The identity system checks the context and denies it.

3. Environment-Level Isolation
Development, staging, and production pipelines should not share identity roles.

Cross-environment authority increases blast radius catastrophically. If a single pipeline identity can deploy to both staging and production, then compromising staging—often less protected, sometimes running outdated dependencies, occasionally accessible by contractors—grants production access.

I've investigated breaches where the entry point was a compromised staging environment, but the actual damage occurred in production because the deployment pipeline had cross-environment permissions. The attacker didn't need to pivot. The pipeline did the pivoting for them.

Isolate identity by environment. Production deployment workflows assume roles that can only operate in production. Staging workflows assume separate roles with zero production access. Development workflows—if they even have deployment capabilities—operate in entirely separate accounts or projects.

This creates operational friction. You can't easily promote an artifact from staging to production with a single workflow. You need separate workflows, separate identities, explicit promotion gates. That friction is the point. It forces intentionality.

4. Observability for Automation Identity
Log and alert on unusual role assumptions, off-hours deployments, unexpected permission escalations, drift in workflow privilege definitions.

Automation must be observable like any other identity. That means:

CloudTrail or equivalent audit logs that capture every API call made by pipeline identities, tagged with workflow context.
Alerting on anomalies: if a deployment workflow assumes a role it's never assumed before, that's worth investigating. If deployments happen at 2 AM on a Saturday when no human is working, that's worth investigating.
Drift detection: if the permissions granted to a pipeline identity change—someone adds s3:* when it only had s3:GetObject before—that should trigger review.
Behavioral baselines: establish what "normal" looks like for pipeline activity, then alert on deviations. Most pipelines operate on predictable schedules. A sudden spike in deployment frequency or cross-region API calls deserves scrutiny.

The challenge: most logging systems aren't configured to treat pipeline identities as suspicious by default. They're infrastructure. They're trusted. They generate enormous log volume, so alerts get tuned to ignore them to reduce noise.

You have to intentionally model them as high-privilege actors whose behavior should be scrutinized, not assumed safe.

5. Explicit Blast Radius Mapping
Ask: if this deployment workflow is compromised, what can it change?

If the answer is "everything," your identity model is underdefined.

Map the blast radius explicitly. Document what each pipeline identity can access, what it can modify, what it can delete. Include indirect access—if the pipeline can modify IAM policies, it can grant itself additional permissions, which means its blast radius includes anything those policies could grant.

I've done this exercise with teams, and it's usually revelatory. They discover that a documentation deployment workflow hass3:* permissions because someone needed to upload to a specific bucket and just granted everything. Or that the infrastructure provisioning workflow can assume roles in accounts the team didn't even know existed.

Once you map the blast radius, you can start reducing it. Scope permissions to specific resources. Remove unnecessary cross-account access. Segment workflows so that each one has the minimum authority required for its specific purpose.

This takes time. It requires understanding the actual operations each workflow performs, not just guessing. But it converts implicit risk into explicit design decisions.

GitOps Complicates the Picture

GitOps models push even more authority into pipelines. Merged code becomes declarative truth. Controllers reconcile desired state continuously. Infrastructure updates automatically based on repository contents.

This increases safety in some ways—every change is auditable, versioned, reviewable. But it also means the identity that reconciles state becomes deeply privileged. Whether that identity lives in a CI platform or a cluster controller like ArgoCD or Flux, it is a powerful delegated actor.

GitOps controllers typically need:

Read access to Git repositories to fetch desired state.
Broad permissions in the target environment to apply changes.
The ability to create, modify, and delete resources continuously.

If you compromise the GitOps controller—or the repository it watches—you control the entire declarative infrastructure. The controller will helpfully reconcile whatever malicious configuration you commit, because that's its job.

I've seen organizations treat their GitOps repositories with less security than they treat production credentials, because "it's just configuration files." But those configuration files define production state, and the controller that applies them has the authority to reconfigure everything.

GitOps doesn't eliminate the pipeline identity problem. It relocates it into a continuously running reconciliation loop.

The Architectural Reframe

Security discussions often focus on users, services, APIs. Pipelines sit in the middle, serving multiple roles:

Users of cloud APIs: they authenticate, assume roles, invoke operations.
Issuers of deployments: they publish artifacts, trigger updates, propagate changes.
Publishers of artifacts: they write to registries, storage buckets, CDN origins.
Enforcers of infrastructure state: they apply Terraform, Helm charts, Kubernetes manifests.

They are not background tooling. They are privileged actors with multi-faceted authority.

If you don't model them as such, your system has an undocumented superuser. One that runs arbitrary code. One that often has broader permissions than your actual administrators. One that's reachable through supply chain attacks, insider threats, or misconfigured repository permissions.

The reframe: every deployment workflow is an identity. It needs an identity profile, a privilege scope, monitoring, incident response procedures. When you grant permissions to a pipeline, you're granting them to every developer who can modify the code that pipeline runs, every dependency that code imports, every CI plugin that workflow uses.

That's a much broader trust boundary than most teams acknowledge.

Closing Thought

In modern infrastructure, humans design change. Pipelines enact it.

That makes pipelines identity systems. And identity systems require explicit privilege design, continuous monitoring, scoped authority, architectural ownership.

If your deployment workflow can reconfigure production, it is not "just CI." It is a privileged identity. Treat it like one.

That means: audit its permissions like you'd audit a superuser account. Monitor its behavior like you'd monitor administrative access. Scope its authority like you'd scope a service principal. Respond to anomalies like you'd respond to suspicious login attempts.

And when something breaks—because in complex systems, things always eventually break—you'll understand exactly what authority was exercised, by which identity, under what context. You won't be searching through logs trying to figure out how your infrastructure got reconfigured without any human touching it.

You'll know. Because you modeled the pipeline as what it actually is: a synthetic superuser with delegated authority to change production systems.

Secrets in Pipelines Are an Architectural Smell

Iyanu David — Sat, 14 Feb 2026 09:01:15 +0000

Modern CI/CD pipelines are powerful. They build software, provision infrastructure, deploy production systems, promote artifacts across environments. And almost every pipeline relies on one thing to function: secrets.

API keys. Cloud credentials. Registry tokens. Signing keys. Database passwords.

We treat secret injection as normal pipeline design.
But that normalization obscures a deeper issue—secrets in pipelines aren't just a security concern. They're often an architectural smell, a visible symptom of invisible design debt.

Why Pipelines Devour Secrets in the First Place

Pipelines require authority to act. They must push images to registries, deploy infrastructure, access cloud APIs, publish packages, run migrations, configure environments. Historically, the simplest solution was storing credentials as environment variables or secret store entries. Inject the secret at runtime. Let automation proceed. Problem solved.

Except it isn't.

The pattern feels clean—declarative YAML, a reference to ${{ secrets.AWS_ACCESS_KEY }}, execution proceeds without friction. But beneath that syntax lives a trust model we've stopped interrogating. Every secret injected into a pipeline runtime becomes ambient authority, accessible not just to the workflow logic you authored but to everything that workflow transitionally executes.

Secrets Expand Invisible Trust Boundaries

When a pipeline receives credentials, the following instantly gain access:

The runner environment. All executed scripts. Dependencies fetched during build. Third-party tooling invoked by automation—Terraform providers, Helm, kubectl, language-specific package managers. The secret is no longer owned by a system. It is temporarily owned by everything the pipeline executes.

And pipelines execute a lot.

Consider a typical Node.js deployment workflow. You install dependencies via npm ci. Somewhere in that dependency graph—maybe five transitive layers deep—sits a package you've never audited. It runs a postinstall script. That script executes in the same runtime context as your deployment logic. It can read process.env.AWS_SECRET_ACCESS_KEY just as easily as your intended code can.

No privilege escalation required. No sophisticated exploit. Just ordinary execution in a trusted context.

The Environment Variable Illusion

Many teams believe secrets are safe because they're "masked." GitHub Actions, GitLab CI, CircleCI—they all redact secret values in logs. But masking protects logs, not execution.

Any process running inside the pipeline can still read environment variables, export them elsewhere, send them over network requests, write them into build artifacts. A compromised dependency doesn't need privileged filesystem access or container escape capabilities. It only needs runtime execution, which you've already granted by including it in your workflow.

The illusion of safety comes from visibility controls—we can't see the secret in logs, therefore it must be protected. But visibility and access are orthogonal concerns. The secret remains in-memory, readable by any code that runs. Logging controls stop shoulder-surfing; they don't stop exfiltration.

I've debugged incidents where secrets appeared in Docker layer caches because a build step echoed environment state for debugging. The logs were clean. The layer cache was not. Masking is theater when the trust boundary includes untrusted code.

Automation Multiplies Privilege

Humans typically operate under constrained permissions. Pipelines often do not.

To reduce friction—because blocked deployments at 3 AM generate escalations—pipelines are granted broad authority: cross-environment deployment rights, infrastructure modification permissions, registry publishing access, secrets manager read capability. The result is privilege concentration. One pipeline identity may hold more operational authority than an entire engineering team.

Automation becomes a superuser.

This isn't irrational. When a deployment fails because the pipeline lacked a specific IAM permission, the quickest fix is expanding the policy. Do that fifty times over eighteen months and you've built a godmode service account that can create S3 buckets, modify DNS, deploy Kubernetes workloads, and rotate database credentials. Individually reasonable decisions compound into systemic risk.

The uncomfortable truth: we grant pipelines elevated privilege because we trust them less than humans to handle friction. A human can escalate, negotiate, file a ticket. A pipeline just fails. So we remove obstacles preemptively by removing constraints.

Secrets Create Persistence

Secrets introduce something modern infrastructure tries to eliminate: persistent trust.

Even when runners are ephemeral—GitHub-hosted runners destroyed after each job, self-hosted runners cycling through fresh containers—secrets often are not. They may live indefinitely in secret managers, be reused across workflows, exist across multiple repositories, remain valid long after their purpose ends.

If leaked once, they remain exploitable until rotated. And rotation is rarely automated well.

I've seen AWS access keys that outlived the services they originally provisioned. They were created during a migration three years prior, stored in a secret manager, referenced in a workflow that ran twice and then got archived. Nobody rotated them because the workflow didn't fail. The keys accumulated implicit authority as IAM policies evolved—originally scoped to S3 and EC2, they eventually inherited permissions for Lambda and RDS through policy inheritance nobody audited.

Static credentials age like code comments—they describe an intention that's no longer accurate, but they still execute.

The Real Problem: Identity vs Credential Design

Secrets exist because pipelines authenticate using credentials instead of identity.

Credentials answer: "What password proves I'm allowed?"
Identity answers: "Who am I right now, and what am I allowed to do?"

Modern cloud systems increasingly support identity federation—short-lived tokens, workload identity, OIDC-based authentication, role assumption with expiration. These approaches remove stored secrets entirely. The pipeline proves identity dynamically rather than presenting stored credentials.

GitHub Actions can authenticate to AWS via OIDC without ever possessing an AWS secret. The workflow requests a token from GitHub's OIDC provider, AWS validates that token against a configured trust relationship, and returns temporary credentials scoped to the specific job. Validity window: minutes. Reusability: none. Exfiltration value: minimal.

But this requires thinking about pipelines as principals, not as scripts that happen to need keys.

Why Secret Injection Persists

If better models exist, why do secrets remain everywhere?

Because secrets optimize for convenience. They are easy to configure, tool-agnostic, backward compatible, predictable. You can copy a secret from 1Password into GitHub's secret UI and have a working deployment in three minutes. Identity-based systems require architectural thinking—trust relationships, IAM role mappings, OIDC issuer configuration, policy scoping.

Secrets require copy-paste. Speed wins in early design phases. Risk appears later, often in a postmortem titled "How we got breached via a compromised build dependency."

There's also a tooling gap. Not every system supports workload identity. Legacy on-prem services, certain SaaS APIs, third-party registries—they still expect HTTP Basic auth or static tokens. So we build hybrid systems where half the infrastructure uses federated identity and half uses long-lived secrets, and the complexity of managing both models incentivizes defaulting to the simpler (more dangerous) one.

The Hidden Failure Mode

Most teams ask: Are our secrets stored securely?

The better question is: Why does this pipeline need a long-lived secret at all?

If a pipeline needs permanent credentials, it often indicates:

Missing workload identity design
Overly broad deployment authority
Lack of environment segmentation
Legacy authentication assumptions

The secret is not the root problem. It's a symptom of an architecture that hasn't embraced identity-native design. You're treating pipelines like cron jobs running on a server somewhere—stateful, persistent, requiring ambient credentials. But modern pipelines are ephemeral workloads. They should authenticate like services do: with provable identity and just-in-time permissions.

When you see AWS_SECRET_ACCESS_KEY in a workflow file, you're looking at a design decision that optimized for immediate functionality over long-term security posture. That's not always wrong—startups have different risk profiles than regulated enterprises—but it's worth naming explicitly.

When Secrets Become Attack Paths

Pipeline compromises frequently follow this pattern:

1. Malicious dependency executes during build
2. Environment variables are accessed (process.env, os.environ, ENV)
3. Tokens are exfiltrated to attacker-controlled infrastructure
4. External systems are accessed using valid credentials
5. Attackers move laterally using trusted automation identity

No exploit required. Just inherited trust.

The SolarWinds breach followed roughly this pattern—compromise the build environment, inject malicious code, distribute it via trusted update mechanisms. The build pipeline wasn't breached through sophisticated zero-day exploits. It was breached because it had authority to sign and distribute software, and that authority was accessible to anything executing within the build context.

Supply chain attacks work because we've built systems where build tooling inherits operational authority. Your deployment pipeline can probably deploy to production. Therefore, any code that runs inside that pipeline can also deploy to production, assuming it can access the same credentials.

This isn't hypothetical. The event-stream incident, node-ipc, colors and faker, UA-Parser-JS—all involved malicious code in dependencies exfiltrating data or attempting lateral movement. The attack surface isn't the pipeline platform. It's the trust model.

Designing Pipelines Without Secrets

Eliminating secrets doesn't mean eliminating authentication. It means changing the model from "what do I know?" to "who am I?"

Use Federated Identity (OIDC)

Allow pipelines to request short-lived credentials directly from cloud providers. No stored keys. No reusable tokens.

GitHub Actions to AWS:

yaml
permissions:
  id-token: write
  contents: read

- uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456789012:role/GitHubActionsDeployRole
    aws-region: us-east-1

No AWS_SECRET_ACCESS_KEY stored anywhere. The workflow gets a token from GitHub's OIDC provider, presents it to AWS, assumes a role scoped to exactly what that workflow needs. Validity: 1 hour maximum. After job completion, credentials expire automatically.
This works for GCP (Workload Identity Federation), Azure (Managed Identities, federated credentials), HashiCorp Vault (JWT auth), and increasingly for third-party services.

Prefer Role Assumption Over Static Access Keys

Even if you must use AWS access keys somewhere, don't grant them direct permissions. Grant them permission to assume roles with expiration windows measured in minutes.

The key itself becomes a minimally-privileged bootstrap credential. Actual operational authority is time-bound.

Scope Authority Per Workflow

A testing workflow should never hold deployment permissions. Separate identities by purpose.

In practice: different IAM roles or GCP service accounts per workflow type. Test workflows can read test infrastructure and write test results. Deployment workflows can write production infrastructure. Neither should hold both capabilities.

This is tedious to configure initially—more YAML, more IAM policies, more cognitive overhead. But it means a compromised test dependency can't deploy malicious code to production. The blast radius is pre-constrained by design, not by hoping the test suite doesn't run untrusted code (it does).

Remove Cross-Environment Credentials

Development pipelines should not deploy production infrastructure. Period.

If your staging deployment workflow holds credentials for production, you've created an unnecessary privilege escalation path. Compromising staging shouldn't compromise production, but if the credentials are shared, it does.

Physical segmentation matters. Different AWS accounts, different GCP projects, different Azure subscriptions—with trust boundaries enforced at the cloud provider level, not just within your CI/CD configuration.

Treat Pipelines as Workloads

Apply the same identity principles used for services:

Least privilege
Short-lived credentials
Explicit trust relationships

We've normalized the idea that a Kubernetes pod shouldn't run as root, shouldn't mount the Docker socket, shouldn't have cluster-admin. We apply least-privilege thinking to containerized workloads.

Pipelines are workloads. They execute code in a runtime environment. They access external systems. The same principles apply—we've just been slower to adopt them because pipelines feel like infrastructure tooling rather than attack surface.

The Architectural Reframe

Secrets in pipelines feel normal because pipelines were originally tooling—Jenkins running on a server somewhere, TeamCity with persistent agents, stored credentials managed like any other application config.

But pipelines are now infrastructure. They're ephemeral compute workloads with identity, execution context, and blast radius. Infrastructure should not rely on shared passwords. It should rely on verifiable identity.

When secrets disappear from pipelines, something important happens: trust becomes observable and temporary. You can audit who accessed what by examining CloudTrail logs of role assumptions, not by trying to trace where a static key might have been copied. You can enforce expiration automatically because credentials are issued per-job, not stored indefinitely. You can scope permissions narrowly because you're not trying to accommodate every possible workflow with a single set of credentials.

That's architectural maturity.

What to Change Monday Morning

If you're maintaining pipelines that rely on secrets, here's where I'd start:

1. Audit existing secrets. How many live in your CI/CD platform? How old are they? When were they last rotated? Which workflows use them? You can't improve what you haven't measured.
2. Identify the lowest-risk migration candidate. Don't start with production deployment. Start with something like "deploy to staging" or "publish test coverage reports." Lower stakes, same pattern.
3. Configure OIDC trust relationships. GitHub, GitLab, and Bitbucket all support OIDC now. AWS, GCP, and Azure all accept OIDC tokens. The documentation exists. It's not trivial, but it's not exotic anymore either.
4. Migrate one workflow completely. Remove the secret, configure workload identity, verify it works. You'll learn about edge cases—token refresh, permission boundaries, error messages that don't quite explain what's wrong.
5. Document the pattern. Not for compliance theater—for the next engineer who needs to add a workflow. If the identity-based approach is harder to copy-paste than "add secret, reference secret," the team will default to secrets.
6. Repeat. Incremental migration beats grand architecture plans that never ship.

The goal isn't perfection. It's reducing the number of persistent credentials that live in CI/CD systems, because every one represents latent risk that compounds over time.

Closing Thought

Secrets are comfortable because they're familiar. But familiarity is not safety.

Every injected credential represents inherited trust—trust extended not just to the workflow logic you wrote, but to every transitive dependency, every script invoked, every tool executed within that runtime context. And inherited trust is where modern compromises begin.

Supply chain attacks work because we've built systems where execution implies authority. If your code runs in the pipeline, it can access pipeline credentials. That's not a bug in the tooling. It's a consequence of the trust model.

If your pipeline needs secrets to operate, the question is not how to store them better—encrypted at rest, rotated quarterly, access-logged. The question is why they exist at all. What architectural decision led to a design where persistent credentials felt necessary? Can that decision be revisited?

Sometimes the answer is "no"—legacy systems exist, third-party APIs have constraints, regulatory requirements impose strange boundaries. But often the answer is "we optimized for speed when we built this, and we never revisited the design."

Secrets in pipelines are an architectural smell. Not because they're always wrong, but because they're often a symptom of something deeper—a system that hasn't caught up to modern identity-native design, where trust is temporary and provable rather than persistent and ambient.

That's the conversation worth having.

Build Systems Have More Power Than Production

Iyanu David — Thu, 12 Feb 2026 08:32:48 +0000

If CI/CD is a control plane, then the build system is its forge. And the forge decides what becomes reality.

We obsess over production security—runtime policies, network segmentation, Zero Trust architectures, container isolation, service mesh controls. The infrastructure that serves traffic gets the scrutiny. The monitoring. The war rooms when things break.

But the build system?

It often has more authority than production itself. And attackers know it.

The Power Differential Nobody Talks About

Production systems are constrained by design. Services run with scoped identities. IAM roles follow least privilege—at least in theory. Network boundaries are defined, even if they're occasionally porous. Observability has matured enough that you can usually reconstruct what happened, even if you can't always prevent it. Runtime detection exists, though its efficacy varies wildly based on how much signal you're willing to drown in.

Build systems operate differently.

They execute arbitrary code from repositories—code that hasn't been vetted yet, that's the entire point. They pull dependencies from public registries where typosquatting and namespace confusion are facts of life, not edge cases. They hold artifact signing keys, often in environment variables or mounted volumes, because that's the path of least friction. They access secrets for packaging, for publishing, for promoting artifacts across environments that span development, staging, and production.

Production consumes artifacts. Build systems create them.

Creation power exceeds runtime power. Always.

Production Can Only Run What Build Produces

Here's the asymmetry: production doesn't decide what it runs. The build pipeline does.

If an attacker compromises production, they control one environment. Maybe they pivot to adjacent services if your lateral movement controls are weak. Maybe they exfiltrate data. It's bad. You declare an incident, page the team, start containment.

If they compromise the build system, they control every future deployment. Every downstream environment. Every customer update. Every signed artifact that carries your organization's cryptographic blessing.

That's not an incident. That's a generational compromise.

The SolarWinds attack demonstrated this with surgical clarity. Compromise the build process, inject malicious code into signed releases, distribute at scale through trusted update mechanisms. By the time defenders noticed, the payload had reached 18,000 organizations. Production security didn't matter. The artifacts themselves were poisoned at the source.

Build Systems Execute Untrusted Code by Design

CI systems exist to run code that is not yet trusted. That's their function. Every pull request—from employees, from contractors, sometimes from external contributors—triggers dependency installation, script execution, compilation, testing, packaging.

Which means third-party packages run inside your build environment. Preinstall and postinstall scripts execute with whatever privileges the build runner has. Toolchains fetch and execute remote binaries because modern development requires it. Container base images are pulled dynamically from registries you don't control.

A single malicious dependency can exfiltrate environment variables. Access injected credentials. Modify artifacts in ways that survive code review because the review happens before the build, not after. Alter build outputs subtly enough that static analysis misses it but the runtime payload activates exactly when designed.

The Codecov breach in 2021 showed how this works in practice. Attackers modified a Bash Uploader script used in CI pipelines. The script extracted environment variables—including credentials—from build environments and sent them to an attacker-controlled server. For months. Across hundreds of customer networks. The build system was the vector. Everything else was downstream consequence.

The Signing Authority Problem

Artifact signing is meant to create trust. Cryptographic proof that this binary, this container image, this deployment package came from your organization and hasn't been tampered with.

But who controls the signing keys?

Often: the build system.

Which means if an attacker controls the build, they control trust itself. They sign malicious artifacts with your keys. Your infrastructure validates those signatures and deploys with confidence. Runtime security sees valid signatures and assumes safety. The entire trust chain is predicated on build integrity, and build integrity is frequently assumed rather than enforced.

SLSA—Supply-chain Levels for Software Artifacts—emerged specifically to address this gap. It defines graduated levels of build provenance, from "no guarantees" to "signed provenance from hardened, isolated build platforms." Sigstore provides the cryptographic infrastructure for verifiable signing. These frameworks exist because the industry recognized that signing without build integrity is theater. A performance that creates the appearance of security while leaving the actual attack surface unaddressed.

Most organizations operate at SLSA Level 1 or below. They sign things. They don't verify the build environment that produced those things.

Production Is Observable. Build Is Often Not.

Production environments typically have centralized logging. Alerting pipelines that page people at 3 AM. Runtime monitoring that tracks anomalies, even if half of it is tuned to reduce noise. Incident response playbooks, varying in quality but at least documented. When something breaks in production, you have forensic data. You can reconstruct the timeline.

Build systems often have logs no one reviews. Ephemeral runners that self-destruct after each job, taking their filesystem state with them. Shared infrastructure where multiple teams' builds execute on the same underlying compute, separated by assumptions about container isolation. Minimal anomaly detection because "builds are supposed to do weird things." Limited forensic retention because storage is expensive and who's really going to investigate a build unless it fails?

Yet the build system can alter everything production becomes.

We instrument runtime heavily. We often under-instrument artifact creation. The thing that determines what runs gets less observability than the thing that runs it.

The Ephemeral Fallacy

Many teams assume ephemeral runners are inherently safer. Fresh compute for every job. No persistent state. What could go wrong?

Ephemeral doesn't mean isolated.

It doesn't mean credential scope is limited. It doesn't mean network egress is restricted. It doesn't mean artifact outputs are verified. It doesn't mean dependencies are trustworthy or even checksummed against known-good hashes.

Short-lived infrastructure with broad privilege is still broad privilege. The runner exists for fifteen minutes, but in those fifteen minutes it has access to your container registry, your artifact storage, your signing keys, your cloud provider credentials, your internal APIs.

The CircleCI security incident in January 2023 illustrated this perfectly. Attackers gained access to encryption keys used to protect customer secrets stored in CircleCI's environment variable system. Those secrets—API tokens, cloud credentials, database passwords—were intended for ephemeral runners. But the runners needed access to them, which means the secrets had to be retrievable, which means they became a target. Ephemeral execution didn't protect against persistent credential theft.

If your build environment can pull secrets, it can leak secrets. Duration doesn't change that equation.

Supply Chain Attacks Scale Differently

Traditional breach pattern: compromise a server, move laterally through the network, escalate privileges, establish persistence, exfiltrate data or deploy ransomware.

Supply chain breach pattern: compromise the build, inject into an artifact, let the organization's own deployment automation distribute your payload.

The second scales faster. And it bypasses runtime defenses entirely.

When you compromise production, defenders can isolate the affected systems, rotate credentials, rebuild from known-good images. When you compromise the build, defenders have to question every artifact produced since the compromise began. Which deployments are safe? Which container images are clean? How far back do we roll? Do we even know when the compromise started?

The forensic problem becomes exponentially harder because the attack happened upstream of where you instrument. Your runtime logs are clean. Your network monitoring shows normal traffic. Everything looks fine because the malicious code was baked in before it reached production.

The Question That Changes the Conversation

Instead of asking "Is our production hardened?" ask: "Can our build environment publish something malicious without being detected?"

If the answer is yes—and for most organizations it is—your security posture is incomplete. You've fortified the castle while leaving the weapon forge unguarded.

Designing Build Systems as High-Privilege Infrastructure

If build systems hold creation authority, they require architectural intent. Not best practices applied as afterthoughts. Not security bolted on when compliance demands it. Intent from the beginning.

That means minimizing credential scope. Use short-lived identities instead of stored secrets. OIDC tokens from your CI provider to your cloud platform, bound to specific repositories and branches. Credentials that exist for the duration of a job and self-revoke, not API keys that live in environment variable configuration for years.

Reduce network access during builds. If your build doesn't need to call external APIs, block egress. If it needs specific dependencies, allowlist those registries and reject everything else. Defense in depth assumes compromise; network segmentation limits what an attacker can do post-compromise.

Separate build execution from signing authority. Don't let the same compute that runs arbitrary code from pull requests also hold the keys that sign production artifacts. Use isolated signing infrastructure that receives artifact hashes from builds and returns signatures, never exposing key material to the build environment itself.

Generate verifiable artifact provenance. Use in-toto or SLSA attestations that capture what was built, where, from what source, using what dependencies. Make the provenance unforgeable and independently verifiable. When something breaks, you need to know what you deployed. When something is compromised, you need to know what to untrust.

Document blast radius explicitly. If this build system is compromised, what can an attacker reach? Which secrets? Which networks? Which downstream systems? Threat modeling isn't about paranoia; it's about honest accounting of what's actually exposed.

Build systems should be treated as production-grade infrastructure because they decide what production becomes.

Closing Thought
Production systems enforce trust. Build systems define it.

If you control production, you control an environment. If you control the build, you control the future of every environment.

That's more power. And power requires architecture.

Not eventually. Not when you have time. Monday morning.

References
CISA Alert AA20-352A — SolarWinds Supply Chain Compromise

Codecov Security Incident Report (2021)

CircleCI Security Alert (January 2023)

SLSA Framework (Supply-chain Levels for Software Artifacts)

Sigstore Project

CI/CD Is Not a Toolchain—It's a Control Plane

Iyanu David — Wed, 11 Feb 2026 12:08:16 +0000

For years, we treated CI/CD as delivery automation.

A toolchain. A convenience layer. A faster path from commit to production.

That framing is outdated—not because the tools changed, but because what they do changed, and we kept pretending they hadn't.

Modern CI/CD systems don't just ship code. They provision infrastructure. They rotate secrets. They apply IAM policies, run database migrations, configure networking, trigger rollbacks, and deploy across multiple environments. They make decisions about what exists and what doesn't. They hold keys that unlock more doors than most engineers will ever touch.

That's not a toolchain.

That's a control plane.

And we're still securing it like internal plumbing.

What Makes a Control Plane

A control plane has three characteristics:

1. It can change system state
2. It holds privileged authority
3. It affects multiple environments

Modern pipelines meet all three. When a pipeline runs, it doesn't just "build." It decides. It allocates compute. It writes firewall rules. It stamps certificates. It mutates the topology of systems that serve actual users.

That authority often exceeds the permissions of individual engineers—by design. The pipeline needs to reach across boundaries that humans can't. It needs to touch production. It needs to rewrite DNS. It needs to push artifacts into registries that gate what runs where.

And yet we talk about it as though it's just Jenkins with better syntax.

The Power Asymmetry Problem

In many organizations, engineers have scoped access. They can read logs from their service. They can deploy to staging. They can query metrics. Production? That requires approvals. IAM changes? Those go through a ticketing process. Cross-account modifications? Forget it.

Services have bounded roles. An API server can write to its own database. It can call specific downstream dependencies. It can't touch S3 buckets it doesn't own. It can't assume roles in other accounts. The principle of least privilege is gospel.

Production environments have layered controls. Network segmentation. Private subnets. Security groups. Bastion hosts. VPNs. You don't just SSH into prod anymore.

But pipelines?

They often hold cross-environment credentials, deployment authority, artifact signing keys, infrastructure modification rights. They can create load balancers, delete databases, rotate encryption keys, push container images, apply Terraform plans, update DNS records, modify IAM policies, and invalidate CDN caches.

Why? Because pipelines need to "just work." Because friction in CI slows down shipping. Because nobody wants to manually approve every deployment step.

The result is a power asymmetry: the automation layer has more authority than the humans it serves.

That should make us uncomfortable.

It doesn't—yet.

"Trusted Runner" Is a Dangerous Phrase

We casually refer to CI environments as trusted.

But trusted by whom? Against what threat model?

Modern pipelines run on third-party infrastructure—GitHub Actions, GitLab runners, CircleCI agents, and cloud-hosted build farms. They execute code from pull requests, often before any human review. They fetch remote dependencies from npm, PyPI, Maven Central, and Docker Hub—registries we don't control. They interact with SaaS APIs. They store cached artifacts across builds. They pull secrets from vaults that assume the runner ID is proof of identity.

They are exposed to dependency poisoning, malicious forks, compromised tokens, lateral movement paths, and supply chain injection. A single poisoned npm package in a pre-build script can exfiltrate AWS credentials. A malicious PR can rewrite pipeline configuration to dump secrets. A compromised runner can pivot to internal services.

Calling this environment "trusted" is less a statement of fact and more a leftover assumption from when CI ran in the basement on hardware we owned.

That world is gone.

Pipelines Now Define Blast Radius

When incidents happen today, the blast radius often traces back to CI/CD.

A compromised token pushes a poisoned artifact that propagates through staging, then production, across six services before anyone notices.

A misconfigured pipeline applies IAM changes globally because the Terraform workspace wasn't parameterized correctly and nobody caught it in review.

A broad secret—something like AWS_ACCESS_KEY_IDscoped to *—enables cross-service access because the pipeline needed to deploy "everything," and that was easier than per-service roles.

An automated rollback reintroduces a vulnerability because the rollback logic doesn't check CVE databases; it just redeploys the last known-good SHA.

The failure isn't in runtime code. Runtime code is sandboxed, logged, and monitored. It runs with constrained permissions. It's been through static analysis, dependency scanning, and peer review.

The failure is in deployment authority.

And deployment authority lives in the pipeline.

The Architectural Shift No One Acknowledged

Cloud-native architecture evolved. Microservices replaced monoliths. Infrastructure became code. Environments became ephemeral. We got better at blast radius containment—network policies, service meshes, zero-trust networking, and identity-aware proxies.

But pipeline trust models didn't evolve at the same speed.

We expanded their power without redesigning their boundaries. We gave them more keys without rethinking the locks. We automated more surfaces without segmenting the automation itself.

A pipeline that deploys a single microservice now might also:

Apply Kubernetes manifests via kubectl
Provision cloud resources via Terraform or Pulumi
Update service mesh policies
Rotate database credentials
Push container images to multiple registries
Update feature flags in a remote config service
Invalidate CDN caches
Send deployment notifications to Slack, PagerDuty, Datadog

Each of those actions requires credentials. Each of those credentials is a pivot point. Each pivot point is a potential compromise.

That's the architectural gap. We designed resilient runtime systems. We forgot to design resilient deployment systems.

The Question Teams Rarely Ask

Instead of asking, Is our pipeline fast?, we should be asking:

If this pipeline were compromised, what could it change?

If the honest answer is "almost everything," then the system is not segmented—it's automated.

And automation without scoped authority is concentrated risk. It's a single choke point with god-mode privileges. It's a target.

The calculus is simple: attackers don't need to compromise every service if they can compromise the thing that deploys every service.

What Practitioners Actually Do on Monday Morning

This isn't theoretical. Here's what changes when you treat CI/CD as a control plane:

You scope pipeline permissions. Not later. Now. Each pipeline gets exactly the permissions it needs to deploy its specific service. No wildcards. No AdministratorAccess. No shared service accounts.

You separate build from deploy. Build environments shouldn't hold deployment credentials. They shouldn't need them. Builds produce artifacts. Deployments consume artifacts. Different trust domains.

You gate production access. Pipelines don't get automatic production deploy rights. They request them. Approval workflows, break-glass procedures, and time-bounded tokens. Production is not just another environment variable.

You audit pipeline changes like code changes. Every modification to .github/workflows or .gitlab-ci.yml or Jenkinsfile goes through review. Those files define what can change production. Treat them accordingly.

You inventory secrets. What does each pipeline actually have access to? Where are credentials stored? How are they rotated? Who can read them? If you can't answer these questions in under five minutes, you have a visibility problem.

You assume compromise. Design pipelines so that a compromised runner can't pivot laterally. Network segmentation, egress controls, short-lived credentials, and least-privilege IAM policies. Defense in depth isn't just for runtime.

You monitor pipeline behavior. What repositories are being cloned? What APIs are being called? What resources are being created? Deployment telemetry is production telemetry.

None of this is exotic. It's just... deliberate.

The Uncomfortable Truth

Pipelines are easier to compromise than production systems.

They run untrusted code by design. They have broad permissions by necessity. They're less monitored, less segmented, and less reviewed. They're infrastructure we inherited rather than infrastructure we designed.

And they're the keys to the kingdom.

The industry spent a decade hardening runtime security—sandboxing, SELinux, seccomp, capabilities, namespaces, firewall rules, and intrusion detection. We got pretty good at it.

We spent comparatively little time hardening deployment security.

That asymmetry is showing.

Reframing CI/CD

CI/CD isn't a supporting utility. It's the system that assembles artifacts, defines infrastructure, enforces policy, and deploys change. It's the mechanism by which intent becomes reality.

That makes it production infrastructure.

And production infrastructure deserves explicit trust modeling, scoped permissions, ownership, review, and architectural design.

Not just YAML.

Not just "works on my machine."

Not just "trusted by default."

If your pipeline can rewrite production, it is production. Treat it that way.

Up Next (Day 2)

If CI/CD is a control plane, then the build stage isn’t harmless either.

Day 2 explores why build systems often have more practical power than production—and why supply chain attacks work so well.

📌 If you’re new here, start with the previous series (pinned).