Forem: Steven Stuart

Reporting and Production Make Terrible Roommates

Steven Stuart — Wed, 11 Mar 2026 14:03:38 +0000

A transactional schema optimizes for write consistency, referential integrity, and the access patterns of the application that owns it. A reporting schema optimizes for read throughput, aggregation, and the access patterns of analysts and dashboards. When both concerns share a single schema, every design decision becomes a negotiation between them, and reporting usually wins because it's the most visible to leadership and the most painful to change after the fact. These concerns can and should be separated so each model serves the workload it was designed for.

Consider what happens when the analytics team asks for a denormalized order_summary view on the production database so their dashboards load faster. The DBA obliges, adds a materialized view, and now every schema migration has to account for it. Six months later the team wants to split the orders table into orders and order_line_items, but the view is embedded in 10 dashboard queries and a nightly export job. The refactor stalls, and the production schema fossilizes around a reporting concern.

Not every system needs a separation on day one. A small team with a single database, low reporting complexity, and a schema that's still fluid can query production directly without meaningful friction. But this distortion is predictable, not surprising. It emerges when reporting consumers multiply, when dashboards become load-bearing, and when schema changes require cross-team coordination. Architects who recognize this trajectory can keep the door open for separation without building the full pipeline prematurely, by resisting the urge to denormalize production schemas for reporting convenience and by keeping reporting access patterns from becoming implicit contracts on the production schema. When the separation does happen, it can be reactive, tapping into what the database already captures, or intentional, making the application responsible for producing reporting-quality records in the write path.

Reactive Separation

A Dedicated Reporting Replica

The simplest place to start is to point reporting tools at a read replica of the production DB. Many teams already have replicas for distributing query load, and so dedicating one to reporting keeps analytical queries from competing with production traffic. No new infrastructure, no async pipeline, no application code changes.

  ┌─────────────┐         ┌─────────────────┐
  │ Application │────────>│  Production DB   │
  └─────────────┘  writes │  (Primary)       │
                          └────────┬─────────┘
                                   │ replication
                                   v
                          ┌─────────────────┐
                          │  Read Replica    │
                          │  (Same Schema)   │
                          └────────┬─────────┘
                                   │ direct queries
                          ┌────────┴─────────┐
                          │  BI / Dashboards  │
                          └──────────────────┘

This is a feasible fit when reporting needs are straightforward, the production schema is close enough to what reporting consumers need, and data that's a few seconds stale is acceptable. "A few seconds stale" is the optimistic case, though. Heavy analytical queries on the replica can cause replication lag to spike well beyond that, especially during peak reporting windows. Still, it's the path of least resistance, and for many systems it works well enough that teams never move beyond it.

The replica also serves as the foundation for ETL. Rather than querying the replica live, teams extract data from it on a schedule, transform it into reporting-friendly shapes, and load it into a warehouse or data lake. Same infrastructure, different consumption pattern. Live queries hit the replica directly for near-real-time results while ETL jobs use it as a source for batch aggregation and historical snapshots. Both approaches keep analytical workloads off the primary.

The replica breaks down, for both live queries and ETL, when reporting needs diverge far enough from the production schema's shape. Reporting consumers write increasingly complex queries with multiple joins, or they start requesting schema changes to production to make their queries simpler, which is exactly the distortion this post is about. The replica also can't capture history. It mirrors current state, so if a record changes twice between queries the intermediate state is gone.

Change Data Capture

CDC tools like Debezium tap the database's transaction log and emit changes as events without any application code changes. The application writes normally to whatever schema makes sense, and CDC streams those changes to a separate store. The stream is async by default, and unlike the replica approach, CDC captures every intermediate state change because it reads from the transaction log rather than polling snapshots.

  ┌─────────────┐         ┌─────────────────┐
  │ Application │────────>│  Production DB   │
  └─────────────┘  writes └────────┬─────────┘
                                   │ transaction log
                                   v
                          ┌─────────────────┐
                          │  CDC Connector   │
                          │  (e.g. Debezium) │
                          └────────┬─────────┘
                                   │ change events
                                   v
                          ┌─────────────────┐
                          │  Stream / Queue  │
                          │  (Kafka, Kinesis)│
                          └────────┬─────────┘
                                   │
                     ┌─────────────┴─────────────┐
                     v                           v
            ┌────────────────┐          ┌────────────────┐
            │  Transform (T) │          │  Schema        │
            │  Reshape/Join  │          │  Registry      │
            └───────┬────────┘          └────────────────┘
                    v
            ┌────────────────┐
            │ Reporting Store│
            │ (Warehouse/DL) │
            └────────────────┘

CDC's greatest strength is that it requires no application code changes, no additional transaction overhead, and no new abstractions in the write path. For legacy systems where the risk of changing the write path is too high, or for teams that need separation now and can't afford to modify every service that writes data, CDC is often the only viable option. It also solves payload completeness for free: the transaction log captures the full row state after each write regardless of whether the application only updated a single field, so downstream consumers never have to wonder whether a missing field means "unchanged" or "removed."

CDC does have limitations.

The first limitation is semantic. CDC events originate from the database layer, so they capture what changed but not why it changed. A row update that represents a customer canceling an order looks identical to a row update that represents a system correcting a data entry error. The database can't distinguish between them because it only sees the state change, not the business intent. For domains where that distinction matters, like financial ledgers or audit-critical workflows, event sourcing is the appropriate tool because it captures the intent as the primary record.

The second limitation is the absence of a contract boundary. The table structure is the contract, implicitly. When that schema changes, nothing fails at build time. The CDC pipeline either silently emits differently shaped events or breaks at runtime, and reporting consumers discover the problem in production rather than in development. A schema registry can partially close this gap by enforcing compatibility rules at deserialization, but that's added infrastructure catching incompatibility at runtime rather than at build time.

The third limitation is database dependency. Not every database has a strong CDC story. PostgreSQL and DynamoDB have mature options, but weaker change stream capabilities can push teams toward application-layer alternatives earlier than expected.

Intentional Separation

Reactive approaches separate the workload but not the context. They can tell you what changed, but not who changed it or why. That context exists at the application layer when the write happens, and it's lost the moment the data hits the database unless someone deliberately captures it. An intentional separation of concerns takes full advantage of the production context while serving the needs of both reporting and prod as equally vital priorities.

The Outbox Pattern

The outbox pattern makes the application responsible for producing reporting-quality records. Instead of letting the database schema define the downstream contract implicitly, the application writes a versioned record to an outbox table within the same database transaction as the domain state change. Either both commit or neither does, so consistency is guaranteed. A separate process reads from the outbox and projects into whatever reporting store analytics needs. The application controls the payload shape, the versioning, and the context included in each record.

  ┌─────────────┐
  │ Application │
  └──────┬──────┘
         │ single transaction
         v
  ┌──────────────────────────────────────┐
  │          Production DB               │
  │                                      │
  │  ┌──────────────┐  ┌──────────────┐  │
  │  │ Domain Table  │  │ Outbox Table │  │
  │  │ (orders,      │  │ (versioned   │  │
  │  │  customers)   │  │  records)    │  │
  │  └──────────────┘  └──────┬───────┘  │
  └───────────────────────────┼──────────┘
                              │ poll / stream
                              v
                     ┌─────────────────┐
                     │ Relay Process   │
                     │ (reads outbox)  │
                     └────────┬────────┘
                              │ publish
                              v
                     ┌─────────────────┐
                     │ Reporting Store │
                     └─────────────────┘

Outbox table via a relational database:

CREATE TABLE outbox (
    id              BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    event_id        UUID          NOT NULL DEFAULT gen_random_uuid(),  -- globally unique, used downstream
    aggregate_type  VARCHAR(100)  NOT NULL,  -- e.g. 'Order', 'Customer'
    aggregate_id    VARCHAR(100)  NOT NULL,  -- e.g. order ID
    event_type      VARCHAR(100)  NOT NULL,  -- e.g. 'OrderCancelled'
    schema_version  INT           NOT NULL,  -- contract versioning
    occurred_at     TIMESTAMPTZ   NOT NULL DEFAULT now(),
    initiated_by    VARCHAR(200),            -- who: user ID, system name
    reason          VARCHAR(500),            -- why: 'customer_request', 'admin_override'
    payload         JSONB         NOT NULL,  -- full state snapshot + context
    published       BOOLEAN       NOT NULL DEFAULT FALSE
);

Outbox record via a Kinesis/Kafka stream (JSON envelope):

{
  "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "aggregateType": "Order",
  "aggregateId": "ORD-20260218-4417",
  "eventType": "OrderCancelled",
  "schemaVersion": 2,
  "occurredAt": "2026-02-18T14:32:08.771Z",
  "initiatedBy": "user:jsmith",
  "reason": "customer_request",
  "payload": {
    "orderId": "ORD-20260218-4417",
    "customerId": "CUST-8821",
    "previousStatus": "Confirmed",
    "newStatus": "Cancelled",
    "lineItems": 3,
    "totalAmount": 284.50,
    "currency": "USD"
  }
}

Because the application controls the payload, the outbox captures context that reactive approaches cannot: who initiated a change, whether it was a customer action or an admin override or a system timeout, and why it happened. The application has this context at write time, and it's impossible to recover after the fact.

This also gives the outbox an explicit, versionable contract boundary. The application decides what the downstream record looks like and versions it independently. A breaking change to the outbox record is a code change that has to compile, pass tests, and go through review. If a developer renames a column in the production schema, the outbox record doesn't change unless someone deliberately updates it. And because the outbox doesn't rely on transaction log capabilities or vendor-specific change feed APIs, any database that supports transactions supports the pattern.

The outbox does not require a record for every database write. It only fires when a specific entity type has a meaningful state change, and only for the types that reporting cares about. Background jobs updating internal timestamps produce nothing. Even deletions produce a record, because knowing that an entity was removed and who removed it is itself meaningful. This keeps the coupling concentrated in write paths that produce meaningful state transitions, not spread across every query and update in the codebase.

For most teams that have outgrown a read replica but don't need full event sourcing, the outbox is my recommendation. It provides intentional separation, explicit contracts, and rich context without the architectural commitment of an append-only event store.

CQRS and Event Sourcing

CQRS (Command Query Responsibility Segregation) formally separates the write model from the read model. The write side accepts commands and persists state. The read side maintains whatever views consumers need, shaped however they need them, updated as fast or as lazily as the use case demands. The two sides share no schema and no storage. What CQRS adds is the explicit acknowledgment that "what happened" and "what is the current state" are different questions that deserve different models. CQRS does not require event sourcing. It can sit in front of a traditional stateful database where the write side persists state normally and the read side maintains separate, denormalized views optimized for queries.

Event sourcing takes this further by changing what the write side stores. Instead of persisting current state and producing reporting records alongside it, every state mutation is recorded as an immutable event, and current state is derived by replaying those events. The event log becomes the source of truth, not the current snapshot. Nothing is overwritten. Every transition is preserved in the order it occurred. Production state is a projection of the event stream, and so is reporting state, and so is any other view you need. If the analytics team changes their requirements six months from now, you replay the same events through a new projection and the full history is there.

                          Commands
                              │
                              v
                     ┌─────────────────┐
                     │   Write Side    │
                     │ (Command Handler│
                     │  + Aggregates)  │
                     └────────┬────────┘
                              │ append events
                              v
                     ┌─────────────────┐
                     │  Event Store    │
                     │  (append-only)  │
                     └────────┬────────┘
                              │ project
                              v
                     ┌─────────────────┐
                     │  Projected      │
                     │  State Tables   │
                     │  (per entity)   │
                     └────────┬────────┘
                              │ query (read side)
              ┌───────────────┼───────────────┐
              v               v               v
     ┌────────────┐  ┌─────────────┐  ┌────────────────┐
     │  Prod API  │  │  Reporting  │  │  Audit /       │
     │  Queries   │  │  Store      │  │  Compliance    │
     └────────────┘  └─────────────┘  └────────────────┘

Event store document (append-only, NoSQL):

{
  "streamId": "Order-ORD-4417",
  "position": 4,
  "eventType": "OrderCancelled",
  "occurredAt": "2026-02-18T14:32:07Z",
  "payload": {
    "initiatedBy": "user:jsmith",
    "reason": "customer_request",
    "previousStatus": "Shipped",
    "newStatus": "Cancelled",
    "lineItems": 3,
    "totalAmount": 284.50,
    "currency": "USD"
  },
  "metadata": {
    "correlationId": "req-88a1c",
    "causationId": "cmd-cancel-4417",
    "userId": "user:jsmith"
  }
}

Projected state table (derived from events, used by reporting/ETL):

CREATE TABLE order_projections (
    order_id         VARCHAR(100) PRIMARY KEY,
    customer_id      VARCHAR(100) NOT NULL,
    current_status   VARCHAR(50)  NOT NULL,
    item_count       INT          NOT NULL,
    total_amount     DECIMAL(12,2) NOT NULL,
    currency         VARCHAR(3)   NOT NULL,
    placed_at        TIMESTAMPTZ,
    shipped_at       TIMESTAMPTZ,
    cancelled_at     TIMESTAMPTZ,
    cancelled_by     VARCHAR(200),
    cancel_reason    VARCHAR(500),
    last_event_pos   INT          NOT NULL,  -- tracks replay position
    projected_at     TIMESTAMPTZ  NOT NULL DEFAULT now()
);

In practice, reporting consumers rarely subscribe to the event stream directly. Event sourcing produces projected state tables, one per entity, where each row represents the current state derived from the event history. Reporting and ETL pull from these projections rather than from raw events. This keeps the event stream internal to the domain, which matters because not everything in the stream is a clean domain event. The projections give reporting consumers a familiar, queryable surface while the event stream retains full history for replay and audit.

This is a good fit for domains where the complete history of state transitions is genuinely valuable, like financial ledgers, audit-critical workflows, or systems where "undo" and "replay" are first-class requirements. The combination of event sourcing and CQRS provides the most complete separation: full history, arbitrary projections, and independent evolution of read and write models.

Most teams should not reach for this combination. Martin Fowler has warned consistently that CQRS is misapplied far more often than it's applied well. Many systems fit a CRUD mental model and should stay that way. CQRS should only apply to specific bounded contexts where the read and write access patterns are genuinely different, not across entire applications. Event sourcing compounds the cost: events are immutable and permanent so schema design requires careful thought, aggregate replay gets expensive without snapshotting, and debugging production issues means reasoning about event sequences rather than inspecting current state.

Separate Early or Pay Later

A read replica is enough to start, but every shortcut that ties these workloads together makes the eventual separation harder. Both production and reporting deserve to be first-class concerns, and treating them that way means decoupling from the schema entirely.

Production databases can now optimize for their inserts and their queries. Dev teams can now deploy and evolve a component's database as needs are discovered, without asking permission. Reporting teams can now get richer, more contextual insights that are readily available. And the two groups can now stop being at each other's throats, because they're no longer competing for the same resource.

How One Screen Holds the Entire Industry Hostage

Steven Stuart — Mon, 23 Feb 2026 21:53:28 +0000

Frameworks like React Native, Flutter, and MAUI keep promising to end the "write it twice" problem across mobile platforms. One codebase, every platform, native-quality results. Yet every time, the abstraction leaks, and then it floods so fast that bailing water is all you have time to do. I've been working with MAUI recently, and the experience crystallized a question I should have asked sooner: why am I not just building a website?

Once you pull that thread, it unravels fast. The web platform's capability surface is far larger than the industry acknowledges, and nearly everything preventing universal web adoption is inertia, business incentives, or mental models rather than real technical constraints.

The web can do the job. One company made sure you'd never trust it to.

This isn't an argument that native apps are obsolete or that local executables should disappear. There are good reasons to run code on your own hardware, and the pure thin-client terminal hasn't arrived yet; maybe it shouldn't. But when teams default to native without questioning it, they accept costs and constraints on the client side that the backend abandoned years ago.

What the Web Platform Can Actually Do

The capabilities list for the modern web is longer than most developers and decision-makers expect. For the typical business application, whether it runs on a phone, a tablet, or a desktop, the web platform already covers the core requirements:

Capability	Web Technology
Offline support	Service Workers, Cache API
Push notifications	Push API (iOS 16.4+, March 2023)
Camera, microphone, biometrics	getUserMedia, WebAuthn/Passkeys
Payment processing	Payment Request API (includes Apple Pay)
Home screen installation	Web App Manifest, standalone window
GPU-accelerated graphics and compute	WebGPU (all major browsers, Nov 2025)
Peripheral device access	WebUSB, WebSerial, WebBluetooth, WebHID (Chromium)
Local file access	File System Access API, Origin Private File System
Near-native performance	WebAssembly, Web Workers
Real-time communication	WebRTC

That list covers what the vast majority of apps actually do. Most are thin clients over an API: authenticate a user, fetch data, display it, let the user interact with it. The web handles all of that with a single codebase on every platform with a browser, and the deployment model alone should give teams pause. No App Store review cycles, no waiting days for a critical bug fix to clear approval, no separate release pipelines for each platform.

What Genuinely Requires Native

The web can't do everything. Some capabilities have no web equivalent and genuinely require native development.

Wearable integration and health data like Apple Watch complications, Wear OS tiles, HealthKit, and Google Health Connect require platform SDKs with no web alternative
Advanced augmented reality using LiDAR scanning, scene understanding, and body tracking exceeds what WebXR currently offers
Deep OS integration like Siri Shortcuts, Google Assistant routines, home screen widgets, and inter-app communication remains outside the web's reach
True background processing for geofencing, long-running background jobs, and persistent location tracking requires native APIs
Specific hardware access like NFC writing on iOS, advanced camera controls, and screenshot blocking are native-only capabilities

This list is relevant, but it's also narrow. Look at the apps on your phone and the software on your desktop, and count how many actually need any of these features.

Cross-Platform Frameworks Are the Wrong Answer

Cross-platform frameworks don't eliminate the two-codebase problem; they disguise it. React Native's bridge, Flutter's rendering engine, and MAUI's handler pattern each introduce their own category of bugs that don't exist in either native platform. You haven't removed the platform differences; you've added a third abstraction layer and inherited all three bug surfaces.

The tech debt is unprojectable because you don't control the framework's roadmap. When Apple changes iOS, you wait for the framework to catch up. When the framework ships breaking changes, you're locked into an unplanned upgrade. When a critical bug sits in the issue tracker for months, your only options are workarounds or forks.

The original justification was that specialized native developers are expensive, so share code to reduce cost. AI code generation has collapsed that constraint. A competent developer with AI assistance can ship Swift or Kotlin without years of platform experience, but all the original disadvantages of cross-platform remain.

Two Companies, Two Arcs

To understand why the web hasn't become the default, it helps to look at how the two most influential companies in software development have traded places.

In the early 2000s, Microsoft was the villain. They owned the desktop, the browser, the runtime, and the development tools, and the DOJ antitrust case in 2001 was about exactly this: using a Windows monopoly to crush Netscape. Apple was the scrappy alternative making beautiful things for creative people, and when the iPhone launched in 2007 it felt like liberation from the carrier-controlled mobile landscape.

Then each company lost something important, and their responses tell you everything.

Microsoft lost mobile and Windows 8 alienated more and more desktop users. Their response was to stop trying to own the screen and instead to compete on the stack. .NET went open source, Visual Studio Code became the most popular editor in the world, they acquired GitHub and kept it open, and Azure now runs more Linux workloads than Windows. The company that once tried to kill Linux now employs more Linux kernel contributors than most Linux companies.

I am still baffled why Microsoft did not backpedal their bloated OS and clunky UX for Desktop as soon as their market assumptions proved to be so very wrong. Windows seems to have gotten worse with each version and with no sign of redemption. Two steps forward and one step back.

Apple very quickly went the other direction. When the iPhone became the dominant computing device, Apple discovered what Microsoft had known in the 1990s: if you control the platform people depend on, you don't have to compete on openness. You compete on control.

I write .NET code for a living and I choose to do it on a Mac because the experience is genuinely better. Notice what that reveals about both companies though. Microsoft made it possible by building .NET and VS Code to run everywhere. Try the reverse: building an iOS app without a Mac, submitting to the App Store without Xcode, running Swift on Windows with the same support .NET has on macOS. You can't. Microsoft earns developers by being useful everywhere. Apple captures them by being mandatory.

Apple's products deserve their loyalty. The Mac is excellent, the ecosystem integration is seamless, and users trust the brand for good reasons. That trust is exactly what makes the constraint so effective. When a company makes products this good, people don't scrutinize the walls. They assume the walls exist for good reasons.

But look at what Apple controls versus what they build. Siri has been outperformed by competitors for over a decade, and it doesn't matter because Siri doesn't need to be good; it needs to be on the iPhone. Owning the screen means you don't have to be the best at anything that runs on it; you just need to be good enough at the thing people hold, and everything else flows through you.

Apple doesn't compete on technology. They compete on constraint ownership. The phone is the aperture, and Apple controls the aperture.

The Walls Apple Built

The walls Apple has constructed around iOS are higher than anything Microsoft built around Windows in the 1990s, and they're more sophisticated because they're framed as user protection rather than vendor control.

Every browser on iOS must use Apple's WebKit rendering engine. Chrome on your iPhone isn't really Chrome. It's a WebKit skin with Chrome's UI on top. Firefox, Edge, Brave: all WebKit underneath. This means Apple alone controls what web capabilities exist on every iOS device, regardless of which browser icon a user taps.

On Chrome and Android, web apps can access over 47 Web APIs including Bluetooth, NFC, Background Sync, USB, and serial devices. On iOS, none of those APIs are available on any browser. In June 2020, Apple publicly rejected 16 Web APIs citing "privacy and fingerprinting concerns." Android handles the same APIs with straightforward permission prompts. The privacy argument doesn't hold up when every other platform manages these capabilities without the problems Apple claims are unsolvable.

Chrome on Android supported push notifications in 2015. iOS didn't get web push until March 2023, and even then Apple requires users to install the web app to their home screen first. On Android, any website can request push permission.

The EU's Digital Markets Act forced Apple's hand on browser engine choice in 2024, but the response was revealing. Rather than comply, Apple attempted to remove PWA support entirely in the EU, converting installed web apps into simple bookmarks. Their justification was "complex security and privacy concerns." After an open letter gathered over 4,200 signatures and the European Commission sent formal inquiries, Apple reversed the decision within two weeks. Genuine security concerns don't evaporate under public pressure.

And even after the DMA technically required browser engine choice, as of early 2026 zero browsers have shipped a non-WebKit engine on iOS in the EU. The regulation exists on paper. The monopoly persists in practice.

The financial incentive is straightforward. The App Store generated approximately $27 billion in commissions in 2024 on a 30% cut. Every app that ships as a web app is revenue Apple doesn't collect. The U.S. Department of Justice made this connection explicit in their March 2024 antitrust lawsuit, which specifically cites the WebKit requirement as part of Apple's monopoly maintenance strategy.

Android doesn't have these restrictions. Chrome supports the full suite of web APIs and PWAs work as first-class applications. But it doesn't matter. No product leader will ship something that doesn't work on iPhones, and Apple's users represent the highest-value demographic in every Western market. The most constrained major platform sets the ceiling for what anyone builds.

The Circular Logic of "Users Prefer Native"

The most common justification for building native apps is market data showing that users spend 88-92% of their mobile time in apps and only 8-12% in browsers. Native retains users at 32% after 90 days compared to 20% for web. The data seems decisive.

But this is a post-hoc fallacy dressed up as market research. Of course the native experience retains users better; it received ten times the investment. Of course users spend more time in apps; they were never given an equivalent web alternative. Native gets the discovery mechanisms, the design talent, and the push notification support. Web gets a fraction of the budget and is treated as a fallback. You cannot measure user preference when one option was deliberately hobbled by the platform owner and underfunded by the developer.

The developer survey data has the same circularity. Flutter and React Native adoption is growing, but these frameworks exist because Apple won't let the web do what it already does on every other platform. A developer checks iOS web capabilities, finds background sync missing and Bluetooth unavailable, builds native instead, and that decision gets counted as evidence that the web isn't ready. The constraint creates the behavior that justifies the constraint.

The counterfactual has never been tested at scale because Apple has prevented it. Equivalent web and native experiences have never existed on iOS. The assumption that native is inherently superior has become so embedded that most teams skip straight to "which framework?" without ever stopping at "does this need to be an app?"

The few times the counterfactual has been tested, the results are telling. The Financial Times left the App Store in 2011 and is still web-first over a decade later. Starbucks built a PWA 99.84% smaller than their iOS app and doubled daily active users. But Starbucks kept the native app too, which raises an important question I can't answer: did they keep it because native was genuinely better, or because no one was willing to ask "why do we still have this?"

The Anxiety That Predates Mobile

When the iPhone launched in 2007, Steve Jobs told developers to build web apps. The web genuinely wasn't ready, and the App Store arrived a year later. But the response to that gap matters more than the gap itself. Rather than rallying behind closing it, the industry built an entirely parallel native ecosystem. This follows a pattern that has repeated since the 1960s: every generation of computing produces a viable thin-client model, and every generation finds reasons to reject it. Mainframe terminals gave way to PCs. Sun's network computer was technically sound and commercially dead. Chromebooks were dismissed as laptops that couldn't work offline, even as every application was migrating to the browser. The anxiety is always the same: if computation lives somewhere else, you lose control. Companies that profit from local-first computing have always been happy to amplify that fear.

The backend already completed the thin-client transition. Cloud won decisively; nobody serious argues for on-premises-first anymore. But the frontend is frozen at the same conceptual barrier that existed when the first PC replaced the first terminal. We accepted that our servers are someone else's computers. We haven't accepted that our applications could be someone else's rendering.

Mobile is also the reason the web became capable enough to challenge native at all. Service workers, WebGL, touch APIs, and WebAssembly weren't inevitable. They were a competitive response to native threatening to make the web irrelevant. The ecosystem that pressured the web into becoming a genuine application platform is now the same ecosystem preventing it from being used as one.

Cloud broke through because no single company controlled the server. The web can't break through until it works on Apple's phone, and Apple decides what works on Apple's phone.

Progress Often Comes by Getting Out of Its Way

Before writing that new shiny app, ask yourselves: "Do we have a specific, documented constraint that the web platform cannot satisfy?"

For most mobile software needs, the answer is no. The web runs everywhere, deploys instantly, requires no framework intermediary, and its capability surface grows with every browser release. Cross-platform frameworks tried to solve platform fragmentation by adding another platform on top. The web solved it by being the platform that was already there. In Android-dominant markets like India and Southeast Asia, companies like Flipkart and JioSaavn have already proven this works: one codebase, instant deployment, no App Store tax.

The immediate objection is discoverability. People find apps by searching the App Store, so if you're not in the store, you're invisible. But most app discovery doesn't actually happen through store browsing; it happens through web search, social media, ads, and word of mouth. The store is more of a checkout counter than a shopping mall. Google Play already supports Trusted Web Activities, which let PWAs appear as store listings. The Microsoft Store accepts PWAs directly. For enterprise and B2B products, store discovery was never relevant to begin with. The discoverability argument is narrower than it sounds, and it gets narrower every year as deep links, QR codes, and social sharing put users directly into web experiences without a store in between.

The pragmatic strategy might be web-first. Build for the browser as the default platform, and only build native when a specific capability genuinely can't be delivered through the web. The web app is your product. The native app, if you need one at all, exists only for the features that Apple won't let the browser handle.

Cost, velocity, and agility shouldn't be values we only demand from our backend infrastructure. The same expectations that drove the industry from on-premises servers to cloud should apply to how we build and deliver client software. Native apps aren't going away, and they shouldn't. But we should be progressing toward both efficiency and sustainability rather than accepting a status quo where one company's business model determines how the entire industry ships code.

Observability Is Authored, Not Installed

Steven Stuart — Mon, 16 Feb 2026 19:04:51 +0000

I have been a part of a dev team where poor observability constantly brought us to a standstill. Not because the tooling was missing, but because the data it collected never carried meaningful context. Alerts fired constantly, so operation teams ignored them, and dashboards existed for every service, but none of them answered the questions that mattered during incidents. Investigations that should have taken minutes took hours. It got bad enough that observability failures alone caused significant SLA violations.

We questioned the choice of platforms, dashboards, and alerting rules. Yet none of those could help because the problem was never the tooling. The problem was upstream since our code didn't know the difference between "I handled this correctly" and "something is actually broken."

The Classification Problem

Consider a payment processing system. A customer's card gets declined for insufficient funds. The payment gateway returns a rejection, and the system logs it as an ERROR.

But this is the system working correctly. The card was declined because it should have been declined. Insufficient funds is a handled business case, not an exception. Because it's logged as an error, though, it shows up in error dashboards, triggers error-rate alerts, and adds to the ambient noise that operators learn to tune out.

Over time, "payment errors" become background radiation. The team knows most of them are just declined cards, so they stop investigating. Then the gateway starts timing out, or a partner pushes a breaking change, and the real problem gets buried. Nobody notices because "payment errors are always high."

The usual response is to blame the team for ignoring alerts. It is a discipline problem, yes, but the discipline that's missing is upstream, in the code that treats expected outcomes as errors. Alert fatigue is the predictable consequence.

The fix is upstream of your alerting platform:

Expected success: The happy path. Logged at DEBUG if at all.
Expected failure: Business logic correctly rejecting something, like declined payments, validation failures, or rate limiting. This is INFO, not ERROR.
Degraded but functional: The system recovered, but something is wearing thin. Retries succeeding after multiple attempts, response times approaching SLA thresholds, connection pools running hot. This is WARN: not broken yet, but worth watching before it becomes broken.
Unexpected failure: Something genuinely went wrong that demands investigation. This is the only category that should be ERROR.

When the system correctly declines a card for insufficient funds, it's tempting to log that as WARN because you want the metric reviewed often. But a correctly handled decline is the system working as designed, not degrading. Whether the decline rate is "concerning" is a business question that changes with strategy and context; log levels shouldn't encode that judgment. Leave business interpretation to reports and dashboards where it can evolve, not to code where it gets baked in and forgotten.

This is one of the places where result types earn their keep. When expected failures are returned as typed results rather than thrown as exceptions, the classification is baked into the code's structure. A declined payment returns a result; a gateway timeout throws an exception. The distinction is explicit at the point where it matters most, and logging infrastructure can respect it without guessing.

When classification is right, every downstream tool benefits. Dashboards that track error rates become genuine health indicators because errors represent actual unexpected failures, not business logic working as designed. Log queries become surgical because structured errors with proper context let you filter to a specific tenant or operation in minutes. Alerts become actionable because they fire only for conditions that demand investigation.

When classification is wrong, the opposite happens. Alerts fire for expected outcomes, so operators learn to ignore them. Dashboards become decoration because nobody trusts what the numbers represent. Every investigation becomes archaeology because the data that should answer your questions is buried under noise. No monitoring platform compensates for what the code got wrong at the source.

Context Is Authored, Not Accumulated

Getting the classification right is only half of it. The other half is what you include when something does fail.

The instinct is to compensate with volume: write verbose logs everywhere so you'll have context when you need it. But a trace log is not a dump file. Every bug I've seen diagnosed from trace logs involved information that should have already been in the error or warning itself. The problem was never insufficient logging volume; it was that nobody authored the context where it mattered.

What actually solves bugs is understanding what the user did and what they sent, not tracing the code's internal flow. If your logs carry a correlation key across services (most structured logging libraries support this out of the box) and your errors capture the operation, the input, and what went wrong, you have what you need to reproduce the problem. The approach is the same one that makes event-sourcing systems reliable: capture the context that led to a state so you can replay it. You don't need to trace every intermediate step if you can reconstruct the scenario from the input and the outcome.

Failures should carry their own context. When an operation fails, the error log should include what was being attempted, what went wrong, and enough identifying information to correlate it. What gets logged must be intentional. You know the domain, so you know the potential inputs, what's valid, and what's sensitive. That knowledge lets you author a safe context: enough to reproduce the problem without exposing data that shouldn't be in a log. If you don't understand the domain well enough to make that distinction, that's the source of the problem, not the logging infrastructure. Trace-level logging has its place for diagnosing specific flows when you can toggle it on temporarily, but it shouldn't be your primary mechanism for understanding what your system did.

The difference between a useful error and a useless one is whether someone authored the context intentionally or hoped that raw volume would cover it.

The Black Box Test

Classification and context are design decisions, but most developers never test whether their logging actually answers the questions it needs to. One reason is the debugger habit. When something behaves unexpectedly, the instinct is to attach a debugger, set breakpoints, and step through execution rather than read the outputs.

Some organizations extend this habit into production with remote debugging, but that's a security liability. Direct access to a running container, or any production process, exposes the environment regardless of the layer. You should be observing system outputs, not attaching to live processes.

Production should be a black box. If your default instinct when something breaks is to attach a debugger rather than read the outputs, you'll never feel the pressure to make those outputs useful. The classification stays sloppy, the context stays thin, and the errors stay vague. Not because you don't know better, but because you've never needed better.

Developers who diagnose from observable behavior, whether testing locally against containerized dependencies or against remote systems, build the discipline naturally. They feel the pain of vague errors and missing context firsthand, and they fix it at the source because they have no other option.

The practical test is straightforward: when something breaks, can you diagnose it from the system's outputs alone? Or do you need to add logging, redeploy, and wait for it to happen again? If the answer is the latter, your code doesn't explain itself yet.

That core discipline compounds when builders own what they operate. You don't log payment declines as errors when you're the one who gets paged for "high error rate on payment service." You don't dump verbose logs instead of authoring context when you're the one parsing them at 3 AM. The feedback loop between writing code and living with it in production is what makes classification honest, context intentional, and alerts worth waking up for.

Better tooling alone won't create that loop; only ownership will.

How Shared Libraries Become Shared Shackles

Steven Stuart — Tue, 03 Feb 2026 16:36:37 +0000

This is a highly opinionated take on shared libraries and the damage they do to team autonomy and development tempo.

Teams deliver value faster and more consistently when they can make decisions, ship changes, and evolve their domains without coordinating across organizational boundaries. Shared libraries erode exactly that independence.

The principle applies anywhere domains and teams need independence, but this post focuses on distributed architectures because that's where the consequences are most severe. When independently deployable components, owned and operated by different teams, get bound together by shared packages, those packages undermine the very independence the architecture was designed to provide.

After seeing costs explode for trivial tasks and critical production updates failing to deliver on time in nearly every organization I have witnessed, I am willing to take a rather "extreme" stance on the subject.

Shared Libraries Violate Core Principles

Distributing components isn't just about distributing work. It's about the Single Responsibility Principle applied at the system level: clear ownership, implementation isolation, and infrastructural independence. These benefits are often implicit in the decision to distribute, but they're the whole point. The share-nothing principle makes this explicit. Services should be autonomous, independently deployable, and free from implementation coupling. When services share nothing, teams can deploy, scale, and evolve on their own terms, at their own tempo.

Shared libraries violate these principles. They couple teams through shared implementation despite being distributed in name, creating little monoliths that bind development tempo across teams that were meant to operate independently. What's at stake isn't code organization; it's each team's ability to make decisions, ship changes, and evolve their domain without waiting on teams that have different priorities and different timelines.

Yet the pitch keeps coming: "We have this code in three places. Let's consolidate it into a shared library. We'll save time, ensure consistency, and make everyone's life easier." It sounds reasonable, it really does, yet it ignores decades of architectural pain and lessons learned. The decision only calculates the cost of duplication while potentially ignoring or incorrectly calculating the cost of sharing across teams, domains, and technical boundaries.

There are important distinctions to draw here, like external libraries versus internal ones, SDKs versus shared packages, and whether this applies beyond distributed systems. We'll address all of those. But first, the costs.

The Costs Nobody Calculates

When someone proposes a shared library, they calculate the savings: "This code exists in five services. If we consolidate, we only maintain it once."

What they don't always sufficiently calculate:

Version conflicts and upgrade pain. Five teams (at worst) now depend on your library. They release on different cadences and at some point one or more teams require a breaking change. Now you're either maintaining multiple versions indefinitely or forcing upgrades on teams that have other priorities. The "one place to maintain" becomes "one place that blocks everyone."

Teams blocked waiting for changes. A team needs functionality the library doesn't have. They can't just add it. They need to coordinate with the library owners, get the change approved, wait for a release, and then upgrade. What would have been a two-hour change becomes a two-week dependency chain.

Debugging across boundaries. When something breaks, the investigation now spans your code and the library code. Your team doesn't own the library. Maybe they don't fully understand it. The abstraction that was supposed to simplify their lives has added a layer they have to dig through.

Bloat or fragmentation, pick your poison. The library starts focused. Then another team needs something slightly different. Then another. The library accumulates features to serve multiple masters, becoming a grab-bag of loosely related functionality coupled together because they share a package, not because they belong together. The disciplined alternative is to split it into many small, focused packages, but that creates its own problem: an entourage of dependencies that each consuming team must track, version, and coordinate with. Instead of one bloated library blocking you, ten focused ones collectively recreate the same burden.

Obscured accountability. Shared libraries don't reduce your quality burden; they move it somewhere less visible. If the library has a bug, your service has a bug. Every service still needs its own load testing, chaos testing, penetration testing, and UAT regardless of whether the underlying code is shared or duplicated. The library doesn't absorb responsibility for your service's behavior. It just adds a dependency you don't own and can't fully verify.

The Cohesion and Coupling Diagnosis

If two services genuinely need the same function, you have three possibilities:

It's a cohesion problem. That function belongs in one place and should be called, not duplicated. Extract it into a service with an API. Now there's a clear owner, a clear contract, and no shared implementation coupling consumers together.

It's a coupling problem. You've drawn your boundaries wrong. The services that "need" the same code are actually more related than you thought. Reconsider where the boundary belongs rather than papering over the boundary violation with a shared dependency.

It's genuinely independent. The similarity is coincidental. Both services need to format dates or parse JSON or validate email addresses. Copy the code. Move on. The duplication costs less than the coordination, and the implementations can evolve independently as each service's needs diverge.

A shared library is almost never the right answer because the problem it solves (duplicated code) rarely justifies the problems it creates (coupling, versioning, blocked teams).

The common rebuttal is "but if there's a bug, I fix it once and it propagates everywhere." Consider what code that would actually be in a well-architected distributed system. Cross-cutting concerns like logging, networking, and observability are handled by infrastructure through sidecars and service meshes. Security is already an acknowledged exception. Third-party libraries have their own maintenance cycles. What remains is business logic, and if your business logic is so coupled across services that a single bug requires simultaneous fixes everywhere, you don't have a sharing problem, you have a boundary problem, which brings you back to the diagnosis above.

Don't Reinvent the Wheel vs. Don't Share Internal Types

There's a meaningful distinction between using established external libraries and sharing internal abstractions.

Using mature, well-tested libraries for universal problems makes sense. Logging frameworks, HTTP clients, serialization libraries, and authentication middleware exist because these problems are universal and well-understood. Someone else solved them better than you would, and the cost of depending on their solution is low because the solution is stable.

Sharing your internal CustomerDto across services is different. Sharing your "standard" repository pattern is different. Sharing your domain models between bounded contexts is different. These aren't universal problems with stable solutions. They're your internal abstractions, and forcing them on other teams assumes those teams should think the same way you do.

The distinction matters: external libraries abstract universal problems. Internal shared libraries impose your specific mental model on teams that might have legitimately different needs.

SDKs Are Different

There's also an important distinction between shared libraries and SDKs published for external consumers.

An SDK abstracts what you expose: the public contract of a service or platform. A good SDK earns its existence by encoding integration complexity that would be expensive and error-prone for every consumer to reimplement: orchestrating multi-step workflows, managing state across API calls, handling idempotency, and abstracting version differences. The value isn't hiding HTTP calls (documentation handles that); it's centralizing integration logic complex enough to justify the maintenance cost across supported runtimes.

An SDK also has a different lifecycle. The platform is built first; the SDK comes afterward for a different audience. Its development and release cycles are separate from the internal teams building features, because the dynamics with external customers differ from the dynamics between internal teams.

A shared library abstracts how you think internally: your domain models, your patterns, your "standard way" of doing things. It exists because someone decided other teams should think the same way. The shared library serves a governance impulse, not the consumer. And unlike an SDK, it tries to couple internal teams to the same release cycle and the same implementation decisions.

The SDK says: "Here's how to use our thing."
The shared library says: "Here's how you should build your thing."

One is a service to consumers. The other is an imposition on autonomous teams disguised as help.

Your Runtime Already Solved This

The shared library pitch often targets "utility code" that your runtime already provides. If you're using .NET, the framework gives you HTTP clients, JSON serialization, logging abstractions, dependency injection, and configuration management. Why would you need an internal shared library wrapping HttpClient when HttpClient exists and is battle-tested by millions of applications?

The urge to share usually targets exactly this kind of code: wrappers, helpers, and utilities that add a thin layer over framework primitives. But the framework primitives are already shared. They're already tested. They're already documented. Your wrapper just adds coordination overhead on top of something that didn't need wrapping.

This varies by ecosystem. For example, Python's dependency management is notoriously painful, and shared internal libraries compound the problem. You're coordinating versions across teams in an ecosystem that already struggles with version conflicts. The runtime that makes sharing easiest is often the one where sharing is least necessary.

The Principle Is Broader Than Distribution

An obvious question: if shared libraries are a problem in distributed systems, were they also a problem in the modular monolith that preceded them?

Wherever different teams own different domains, yes. In a modular monolith, shared packages between domains still couple teams to the same change cycles. The difference is severity. In a monolith, the blast radius is contained: teams share a deployable and version conflicts manifest as build errors rather than runtime failures. The pain is real but manageable. In a distributed system, that same coupling spans deployment pipelines, release cadences, and versioning strategies. A change that would have been a merge conflict in a monolith becomes a multi-team coordination effort with blocked releases and stale dependencies.

Layered architectures sidestep this by design because layers already enforce separation; sharing across layers is a violation of the architecture itself, not a shared library problem. But in domain-oriented architectures, the discipline matters regardless of deployment topology. If Domain A and Domain B need to evolve independently, coupling them through shared implementation undermines that independence whether they're projects in the same solution or services in different repositories.

No Architecture Style Wants This

The shared library pitch assumes that code reuse across boundaries is inherently valuable. But examine any coherent architectural paradigm and the opposite becomes clear.

Layered architecture separates concerns into distinct layers. If your presentation layer and your data layer share a library, you've coupled what you explicitly designed to be independent.

Domain-driven architecture creates autonomous domains with clear boundaries. If Domain A and Domain B share implementation code, they're not really autonomous. They're a distributed monolith with extra steps.

Functional/technical architecture defines components accessed through explicit interfaces. The behavior should live in a component that others call, not in a library that everyone imports.

Polyglot architectures make it worse. The shared library pitch assumes a homogeneous technology landscape that rarely exists. If your organization has services in C#, Java, Python, and Go, do you maintain and keep four versions of every shared library in sync? In polyglot environments, the "shared" library becomes a second-class citizen in every language except the one the authoring team actually uses. The promise of consistency becomes a guarantee of inconsistency across language boundaries.

The API Client Library Obsession

The most common incarnation of shared library dysfunction is the API client package: a library containing contracts, DTOs, and client code that consumers are expected to import when calling your service. I have never seen this pattern result in anything short of chaos.

The pitch sounds reasonable: "We'll publish a client library so consumers don't have to write their own HTTP calls or define their own contracts." But this solves a problem that doesn't exist while creating several that do.

Every API should have documentation describing its contracts. If your API is well-documented with clear schemas, consumers can generate or write their own clients trivially. The documentation is the contract. A client library doesn't replace documentation; it's a poor substitute for it.

Every consumer has different needs. Service A might need three fields from one endpoint. Service B might need ten fields from a different endpoint. Service C might need to call the same endpoint but transform the response differently. When you force everyone to use your client library, you're imposing your view of how your API should be consumed. But consumers know their own needs better than you do.

Client libraries confuse application concerns with infrastructure concerns. Teams building client libraries inevitably add caching strategies, retry policies, circuit breakers, and connection pooling configurations. These aren't client concerns. They're infrastructure concerns that belong in service meshes, sidecars, and API gateways where they can be configured, observed, and tuned without redeploying applications.

A client library buries these decisions in application code where they're invisible to operations and impossible to change without a coordinated release across every consumer. The library author predicts traffic patterns and failure modes as if every consumer will behave identically. They won't.

The absurdity becomes obvious with frontend consumers. Nobody would publish an npm package for their React app to import API contracts, or a Swift package for iOS. Frontend teams read documentation, call endpoints, and map responses to whatever structures suit their application. Backend services have the same needs. The consumer's requirements don't change based on what language they're written in.

This reflexive reach for client libraries has been conditioned by years of cargo-culting patterns from contexts where they made sense (public cloud SDKs with complex auth flows) into contexts where they don't (internal services with straightforward REST endpoints). It's a tax on every consumer and a maintenance burden on every producer, justified by an efficiency that never materializes.

The Governance Theater Problem

Shared libraries often emerge from a governance impulse: "Teams are doing things inconsistently. We need to standardize."

The instinct isn't wrong, and consistency matters. But shared libraries are governance theater. They create the appearance of consistency without addressing the underlying problem.

If teams are building things inconsistently, the question is why. Usually it's because they don't share the same understanding of what matters, what the tradeoffs are, and what "good" looks like. That's an alignment problem. It requires conversation, documentation, and shared values.

Forcing everyone to use the same library doesn't create alignment. It creates compliance. Teams will use your library and still build inconsistent systems because the library doesn't encode the thinking and testing.

Governance through values: "Here's why we authenticate this way, here are the tradeoffs, here's what we're optimizing for. Align your implementation to these principles."

Governance through code: "Use this library or you're non-compliant."

The first creates alignment while preserving autonomy. Teams understand the principles and can make good decisions in novel situations. The second creates coupling while providing the illusion of alignment. Teams comply without understanding, and the moment they hit a situation the library doesn't cover, they're lost.

The Exception: Security Protocols

There's one domain where shared libraries make sense. Shared libraries can work for security protocols like ingress handling, service-to-service authentication, and encryption standards.

Why security is different:

The domain is stable and well-understood. Authentication patterns don't change week to week. The library doesn't need constant evolution to serve its consumers.
The cost of getting it wrong is catastrophic. Security isn't a place for teams to make independent decisions and learn from mistakes. The blast radius is too large.
The surface area is thin and focused. A good security library does one thing. It's not a grab-bag of utilities that grows to serve multiple purposes.
Autonomy isn't the goal. You actually want teams to do security the same way. The coupling is a feature, not a bug.

Even here, the library should be as minimal as possible. Provide the security primitive and get out of the way. The moment it starts accumulating "helpful" utilities beyond its core purpose, it's sliding toward the problems that plague other shared libraries.

What to Do Instead

When you feel the urge to create a shared library, pause and diagnose the actual problem:

If it's a capability multiple services need: Build a service, not a library. Expose an API. Now there's clear ownership, independent deployment, and consumers that can't get version-locked.

If it's a pattern you want to standardize: Write documentation. Explain the principles, the tradeoffs, and the reasoning. Let teams implement the pattern in their own codebases. They'll understand it better than if they'd just imported your abstraction.

If it's truly just duplicated code: Let it be duplicated. The coordination cost of sharing exceeds the maintenance cost of duplication. And the duplicates can evolve independently as needs diverge.

If it's a security primitive: Fine. Build the library. Keep it minimal, stable, and focused. Recognize it's a necessary evil, not a model to emulate.

The shared library is a solution to a problem that rarely exists in the form people imagine. Code duplication isn't what slows teams down. Coordination overhead is. Obsessing over shared code compliance and version alignment diverts attention from what actually produces consistency: shared understanding of principles, tradeoffs, and what "good" looks like. Teams that understand the reasoning make good decisions without needing a library to make decisions for them.

Share values, and the shared library more often becomes unnecessary.

Making Invalid States Unrepresentable: The Billion-Dollar Mistake That Wasn't

Steven Stuart — Thu, 29 Jan 2026 23:39:45 +0000

The billion-dollar mistake. That's what Tony Hoare called his invention of the null reference in 1965. The quote gets repeated so often that "null is dangerous" has become conventional wisdom, especially among entry-level and intermediate developers who hear it as dogma without understanding the context or the alternatives that can be far worse.

But I think we're blaming the wrong villain. Null may have saved far more than it ever cost. Every null reference exception that crashed a system may also have prevented that system from proceeding with corrupted data and invalid logical decisions. The billion-dollar mistake framing counts the crashes but ignores the corruption that never happened. You can count the cost of bug fixes, but what about the disasters those "bugs" prevented?

Information security professionals value the CIA triad of Confidentiality, Integrity, and Availability. Software developers tend to obsess over availability, and that's understandable since a crashed service is visible, embarrassing, and can violate business SLAs. But integrity failures are far worse. Data that looks valid but isn't can corrupt your system just as surely as SQL injection or a man-in-the-middle attack. The corruption just compounds slower and is harder to detect. This is the lens through which the null debate should be understood.

The most common way that developers try to avoid the null issue entirely is to implement default values. However, default values may allow invalid data to flow silently through the system until it corrupts something important. Null crashes loudly at the point of misuse, which is an availability problem you can see and fix. A default value that masks missing data? That proceeds quietly until it causes a security vulnerability, a financial miscalculation, or data corruption that might require weeks to detect and more to fix.

What "Making Invalid States Unrepresentable" Actually Means

The phrase comes from type theory and functional programming, but the concept is practical: structure your data so that invalid combinations cannot exist. Invalid states should fail at construction time, not at runtime deep in business logic.

Consider a user registration:

public class UserRegistration
{
    public string Email { get; set; } = "";
    public string Password { get; set; } = "";
}

This class allows every invalid state imaginable. Empty email, empty password, any combination. The defaults make it easy to construct an object that looks valid but isn't. Code that receives this object has no way to know whether the empty string represents "not provided" or "explicitly set to empty" or "bug in upstream code."

Compare:

public class UserRegistration
{
    public string Email { get; }
    public string Password { get; }

    public UserRegistration(string email, string password)
    {
        if (string.IsNullOrWhiteSpace(email))
            throw new ArgumentException("Email is required", nameof(email));

        if (string.IsNullOrWhiteSpace(password))
            throw new ArgumentException("Password is required", nameof(password));

        Email = email;
        Password = password;
    }
}

Now invalid states cannot be constructed. There's no default email that masks a missing value. There's no way to create a registration without providing required data. The validation happens once, at the boundary, and everything downstream can trust the object is valid.

C# 11 introduced the required keyword, which moves this enforcement to compile time for simpler cases:

public class UserRegistration
{
    public required string Email { get; init; }
    public required string Password { get; init; }
}

The compiler refuses to let you construct a UserRegistration without setting both properties. This is "making invalid states unrepresentable" at its purest: the invalid state literally cannot be expressed in code that compiles.

The distinction between required and constructor validation is straightforward: required enforces presence, constructors enforce validity. Use required when presence is all you need. Use constructors when you need validation logic, like checking that the email contains an @ or that the password meets complexity requirements.

However, required works best for internal domain objects and state that you control. At system boundaries where data arrives via deserialization, serializers and ORMs use parameterless constructors and set properties via reflection, bypassing the compile-time enforcement entirely. The same applies to some dependency injection containers and mocking frameworks. For API contracts and external data, you still need runtime validation with [Required] attributes or explicit checks. The compiler enforcement is powerful, but it only reaches code paths that go through normal construction.

Where the Problem Usually Starts: API Contracts

The domain model above is clean, but most developers encounter this tension at the API boundary first. Consider a typical request DTO:

public class CreateUserRequest
{
    [Required]
    public string Email { get; set; } = "";

    [Required]
    public string Password { get; set; } = "";

    [Required]
    public bool RegisterForAlerts { get; set; } = false;
}

The [Required] attribute signals intent, but the developer adds = "" out of habit or a misguided sense of defensive coding. Now there's a contradiction: the attribute says "required" while the code says "default to empty string."

When JSON is deserialized, a missing field becomes null because the serializer doesn't know about your default. The = "" only takes effect when code constructs the object directly, bypassing deserialization entirely. So the default creates a split: API calls get null (correctly rejected by [Required]), but test code or internal construction gets empty string (silently accepted). The [Required] attribute is runtime validation, not compile-time enforcement.

The fix is simple: don't add the default.

public class CreateUserRequest
{
    [Required]
    public string Email { get; set; }

    [Required]
    public string Password { get; set; }

    //The business could decide that this should default to false but test that assumption first!
    [Required]
    public bool RegisterForAlerts { get; set; }
}

The [Required] attribute ensures the framework validates these fields before your code ever touches them. If validation is bypassed and the property is accessed while null, you get a null reference exception rather than an empty string that looks valid. Null forces handling; the empty string would have propagated silently.

Why Defaults Are More Dangerous Than Null

Default values create four categories of problems that null avoids.

Silent propagation of invalid state. When a required field defaults to an empty string or zero, the invalid state propagates through the system. Each layer assumes the previous layer validated the data. Nobody validated it because it never looked invalid. The corruption accumulates until something finally breaks far from the source.

Consider a payment processing system:

public class PaymentRequest
{
    public decimal Amount { get; set; } = 0m;
    public string Currency { get; set; } = "USD";
    public string MerchantId { get; set; } = "";
}

A bug upstream fails to set the amount. The payment proceeds with Amount = 0. No crash, no exception, no alert. The transaction logs show a valid-looking payment. Days later, someone notices revenue is wrong. The investigation takes hours because nothing obviously failed.

Ambiguous semantics. Does Amount = 0 mean "free transaction," "not set," or "bug"? Does Email = "" mean "user declined to provide" or "form field wasn't rendered"? Default values overload meaning. Null is unambiguous: this value is absent.

A default value claims knowledge it doesn't have. Null admits ignorance.

This ambiguity becomes critical in update operations. When a client submits an update request, the API needs to distinguish between "set this field to empty" and "don't touch this field." Without null, there's no way to express that distinction. Entity Framework relies on this exact semantic: when you load an entity without its relationships, those navigation properties are null. EF interprets null as "not loaded, don't modify" rather than "delete all relationships." If null didn't exist, EF would need the client to re-submit every relationship on every update, or every API would need to accept key-value collections instead of typed objects. Null isn't just tolerable here; it's the simplest solution to a problem that has no good alternatives.

Validation bypass. Code that checks if (amount != null) correctly identifies missing data. Code that checks if (amount != 0) conflates "missing" with "zero." Legitimate zero values become impossible to represent. Business logic contorts to handle the ambiguity that defaults introduced.

Security vulnerabilities. Default values can silently create security holes. Consider a RateLimitPerMinute field that defaults to 0. In some systems, zero means "no limit," so a malformed request that should be rejected instead gets unlimited access. Or a Permissions string that defaults to empty, which a downstream parser interprets as "inherit all permissions from parent." The request looked valid, passed through every layer, and granted access it shouldn't have. With null, the missing field would have forced explicit handling: reject the request, require the field, or make a conscious decision about what absence means.

The Actual Billion-Dollar Mistake

Hoare called null his billion-dollar mistake, and the criticism was valid for its time. ALGOL W and its descendants treated every reference as implicitly nullable. There was no type-level distinction between "this can be null" and "this is never null." The compiler couldn't help you, and nothing forced developers to consider absence. In that context, null was dangerous because the type system provided no guardrails.

But modern type systems solved this problem without eliminating null. C# 8.0 introduced nullable reference types that distinguish string (never null) from string? (might be null). Kotlin distinguishes String from String?. TypeScript has strict null checks. These languages preserve null's benefits while adding type safety. The billion-dollar mistake wasn't null itself; it was nullable references in type systems that didn't require handling.

Hoare's mistake wasn't inventing null. It was inventing null without inventing string?.

The mistake we keep making today is different. It's the pattern of masking errors with defaults instead of failing fast. Every system that returned -1 instead of throwing an exception. Every API that substituted empty arrays for error responses. Every constructor that initialized required fields to placeholder values. These patterns hide bugs rather than reveal them.

Counterarguments and When Defaults Make Sense

This isn't a blanket condemnation of all default values. Some counterarguments deserve serious consideration.

"Null reference exceptions are the most common runtime error." They are, and that's actually the point. Null crashes at the point of use when absence wasn't handled upstream, revealing the bug rather than hiding it. The frequency of null reference exceptions reflects how often code fails to handle absent values, not a flaw in null itself. The alternative isn't fewer bugs; it's bugs that manifest as data corruption instead of crashes.

But a high frequency of null reference exceptions also signals something deeper: continuous misalignment between the development team and stakeholders about what the system should accept and produce. Unit tests exist to test assumptions and prove agreement in both application logic and API contracts. If null reference exceptions keep appearing, the team hasn't captured those agreements in tests, or the agreements themselves are unclear. The exceptions are symptoms of a collaboration problem, not just a coding problem.

"Option/Maybe types are strictly better than null." For representing intentional absence, they genuinely are better. Option<User> makes it explicit that a user might not exist, and pattern matching forces you to handle both cases. Functional programmers rightly point out that Option.getOrElse(default) is a code smell because the whole point is to force handling, not to provide an escape hatch.

But this proves my argument rather than refuting it. Option types work precisely because they make you handle absence explicitly. They fail at compile time if you ignore the None case. That's the same principle I'm advocating: force handling, don't mask absence. The problem isn't null versus Option. It's whether your system forces you to confront missing data or lets you paper over it. An Option that returns a default value on None has the same problem as a nullable field with a default. Most runtimes depend on null, and used correctly with modern type systems, it fulfills the same purpose that Option types serve in functional languages.

"Defensive programming means providing safe defaults." This conflates two different concerns. Resilience at system boundaries means handling malformed external input gracefully, but that's different from masking bugs internally. External APIs should validate input and return clear errors. Internal code should fail fast on invalid state. Providing "safe" defaults inside the system just moves the failure somewhere harder to diagnose.

"Users shouldn't see crashes." Correct, which is why you handle errors at system boundaries. But the crash should still happen internally. Catch exceptions at the API layer, log the details, return a user-friendly error. The internal crash gave you the information to fix the bug. A silent default would have hidden it. And consider: even in runtimes that avoid null entirely, you still need this exception handling infrastructure. Network failures, file system errors, out-of-memory conditions, and database constraint violations all throw exceptions regardless of your null strategy. The boundary handling you need for those exceptional failures handles null reference exceptions too.

"Some fields genuinely have sensible defaults." True. A CreatedAt timestamp defaulting to DateTime.UtcNow makes sense. A RetryCount defaulting to 0 represents legitimate initial state. The distinction is between defaults that represent valid initial state versus defaults that mask missing required data. Configuration values, counters, and timestamps often have legitimate defaults. User-provided data, external inputs, and required business fields typically don't.

Exceptions vs. Result Types

If failing loudly is the goal, why not use exceptions everywhere? The distinction is semantic: exceptions for bugs, Result types for expected outcomes.

A null on a required field represents a violated constraint, something the system was promised it wouldn't receive. That's a bug. The correct response is to crash, log, and fix the code. A Result type represents an expected domain outcome: "user not found" or "validation failed" aren't bugs, they're legitimate results that correct code produced from valid input.

Ask whether the failure represents a bug or a legitimate outcome. If correct code with valid input could produce this result, use a Result type. If not, fail fast with an exception. Both approaches force handling; neither lets you ignore failure and proceed with corrupted state. The danger is when either mechanism gets misused to mask absence: catching exceptions and substituting defaults, or calling Result.GetValueOrDefault() without handling the failure case.

Construction vs. Consumption

The confusion around null often stems from conflating two different phases.

At construction time, invalid states should fail immediately. Required fields should not have defaults that mask their absence. Validation should happen once, at the boundary, with clear errors for invalid input. Objects that exist should be valid by construction.

At consumption time, code shouldn't need to check validity. If an object exists, it's valid. The null checks happen at construction and boundaries. Internal code that receives a UserRegistration shouldn't need to re-validate the email because the constructor already guarantees it's present and valid.

This unsettles developers who've been taught to validate defensively at every layer. But spreading validation across layers is itself a source of bugs. When validation logic lives in the controller, the service, the repository, and the domain model, you've scattered what should be encapsulated business rules across your entire codebase. When validation rules change, you update three places and miss the fourth. When different layers implement slightly different rules, you get inconsistent behavior that's nearly impossible to debug. The same principle applies whether you prefer exceptions or Result types: you don't handle every possible failure at every layer. You propagate failures up to the correct boundary where they can be handled appropriately. Validation belongs at trust boundaries, not scattered throughout internal code that should be able to assume valid input.

Null is dangerous when it appears unexpectedly in consumption code because construction failed to validate. Null is valuable when it represents intentional absence or when it forces construction to fail on invalid input.

This doesn't mean a single validation layer. Systems have multiple trust boundaries: the API gateway, service boundaries, aggregate roots, database constraints. Each boundary validates what it needs to trust. The principle is that once data crosses a boundary and is accepted, code on the inside shouldn't re-validate it. Validate at each door, trust everyone inside that room.

Practical Guidelines

Enforce requirements at the correct layer. At ingress boundaries (API DTOs, deserialization), fields may be nullable because input might be missing. After validation, domain objects should have non-nullable required fields because their existence proves validity. A nullable int? signals "this is optional." If a value is required, it should be non-nullable in the domain model because validation already guaranteed its presence.

Reserve defaults for genuinely optional fields with valid initial states. Retry counts, timestamps, configuration values, and accumulator fields often have legitimate defaults that represent real initial state, not masked absence.

Validate at boundaries, trust internally. System boundaries (API endpoints, message handlers, deserialization) should validate everything and reject invalid input. Internal code should trust that objects exist because they're valid.

Prefer crashes to silent corruption. A null reference exception in development or staging catches bugs immediately. A default value that hides the bug lets it reach production and corrupt data.

Know when to use Result types versus exceptions. When an operation might legitimately fail, use Result types. When something unexpected happens, fail fast with an exception.

Failing Loudly Is a Feature

The real billion-dollar mistake isn't null. It's the widespread practice of substituting defaults for validation, prioritizing code that runs over code that runs correctly. Given the choice between an availability problem you can see and fix, and an integrity problem that compounds invisibly until something important breaks, I'll take the availability problem every time.

SEO Still Works, Just Not How We Hoped

Steven Stuart — Mon, 26 Jan 2026 17:50:45 +0000

A few months ago, I read a post about how SEO is dead and that we need to let the past die and find greener pastures. I'm finally getting around to writing my thoughts on it. That contention was correct in many ways, but it also drew an undeserved binary between SEO and social media as two means to the same end. I think that framing misses the point.

SEO does have a real problem. The timeline to build authority hasn't changed (it still takes 2-5 years), but the gap keeps widening. A startup in 2015 competed against sites with 5-10 years of accumulated authority. A startup in 2025 competes against sites with 15-25 years. Authority signals that once filtered spam now create insurmountable barriers for newcomers because incumbents have had decades to compound their advantages.

And yet, SEO still works perfectly well for what it has become good at. Search for CNN, Mayo Clinic, or Amazon and established brands rank exactly as designed. When you need something tested and proven and you already know what you're looking for, SEO delivers authority and trust.

The answer isn't choosing between SEO and social channels. It's recognizing what each does well: social channels for democratic reach and speed, SEO for authority and trust.

If you are just getting your start as a business or source of information, you should play all angles. Grow quickly through the more democratic and expedited nature of social platforms while positioning yourself to benefit from SEO's authority signals as they compound over time. The binary framing gets it wrong because it treats these as competing alternatives rather than complementary strategies serving different purposes.

What Actually Changed

What changed isn't SEO's rules; it's business expectations. In the 2010s, investors tolerated long growth curves. Venture capital funded multi-year SEO strategies. By 2020, interest rates rose, funding contracted, and investors demanded profitability over growth. Show traction in 6 months or you're dead. SEO's timeline didn't adjust. The business environment did.

So businesses fragmented discovery. When you need customers in 6 months and SEO takes 3 years, you adopt whatever works now:

Parasitic SEO: Publish on Medium, LinkedIn, Substack, and dev.to to piggyback on domains that already rank. You sacrifice ownership, but gain immediate visibility.
Social-first distribution: TikTok, YouTube, and Instagram function as discovery mechanisms independent of traditional search. Engagement elevates content immediately rather than waiting years for authority.
Community platforms: Discord, Reddit, and Hacker News provide direct access to target audiences without algorithmic intermediaries.
Direct advertising: When organic timelines don't align with business needs, companies pay for visibility through Google Ads and social media advertising.

The flood toward alternative platforms doesn't prove SEO is broken; it proves businesses can't wait for authority to compound.

The Broken Promise of Democratization

SEO never delivered on the promise of democratization. Tim Berners-Lee designed the web with an explicit rejection of gatekeeping, making it royalty-free and open so anyone could publish and be found. Google's PageRank promised the "democracy of the web", where community links would surface quality over editorial control.

But those democratic "votes" became the gatekeeping mechanism they were supposed to replace. Links became currency to game. Authority signals made sense when the web was young and spam was rampant, but they compounded over time into insurmountable advantages for incumbents.

The promise decayed into algorithmic gatekeeping that serves Google's advertising revenue. Search advertising generated approximately $175 billion for Google in 2023, roughly 58% of Alphabet's total revenue (Alphabet 2023 Annual Report). Google's search engine exists primarily to serve ads. This explains why Google killed Google Reader (which competed with web traffic), why AMP attempted to keep content within Google's ecosystem, and why search results increasingly feature Google-owned properties.

This mirrors pre-internet gatekeeping, when only established publishers could reach mass audiences. The internet promised to democratize that access. SEO re-centralized it through algorithmic authority. The difference is that SEO's barriers are algorithmic and opaque. You can't argue with an algorithm or pitch your case to a human gatekeeper.

What Comes Next

The solution isn't "fix SEO" or "abandon SEO." SEO became constrained, but it still provides real value for authority and trust. The solution is treating discovery as genuinely multi-channel and funding it accordingly.

Many organizations still treat alternative platforms as afterthoughts. They produce content for SEO, then repurpose scraps for social media and community engagement. This inverts the reality: if SEO takes years and businesses need reach now, alternative channels deserve primary investment, not leftover budget.

The web has already started routing around authority-based discovery. TikTok, Discord, and Reddit elevate content through engagement rather than accumulated authority. YouTube prioritizes watch time over channel age. Hacker News can surface a blog post from an unknown developer based on upvotes alone. The alternatives exist. What's missing is the budget allocation to use them properly.

SEO's trajectory follows the natural lifecycle of human systems: innovation solves a real problem, adoption scales it, power consolidates, incentives shift toward self-preservation, and the system decays until disruption restarts the cycle. SEO now optimizes for protecting its own integrity more than for democratizing discovery. But it still delivers authority and trust for established sources, and that value doesn't disappear just because the system became constrained. Whatever eventually disrupts SEO will follow the same path. Engagement-based ranking will get gamed. New gatekeepers will emerge. The cycle continues.

What to Take From This

SEO has a real problem: it became more about protecting its own integrity than providing the discovery service it promised. Authority signals that once filtered spam now exclude newcomers. But SEO still excels at what it has become: a system for surfacing established, authoritative sources when you need something proven and trustworthy.

The "SEO is dead" narrative gets it wrong by framing this as a binary choice. Social channels and SEO serve different purposes: social platforms provide democratic reach and speed while SEO provides authority and trust.

There's another reason to pursue SEO excellence even if you won't rank soon: the standards themselves are a roadmap for good UX. Accessibility, mobile responsiveness, page speed, clear structure, quality content. These practices make your product better regardless of whether search engines reward you for them.

Stop treating these as competing alternatives. Instead:

Use social channels for democratic reach and speed; treat them as primary investment, not leftover scraps
Build platform-native content for TikTok, YouTube, Discord, and Reddit
Let SEO compound in the background as a long-term asset for authority and trust
Recognize that newcomers need both: social to grow quickly, SEO to establish credibility over time

Why 'Tech Debt' Does Not Get Fixed

Steven Stuart — Tue, 20 Jan 2026 23:49:04 +0000

Most engineering teams have a backlog of work they call "tech debt." Developers understand how it can slow down feature development, increase support costs, and threaten system stability. Yet when they bring these concerns to stakeholders, the work often stays deprioritized indefinitely. So why is that? Why would something so obviously important be ignored. In most cases, it is because the term 'tech debt' positions engineering work as backward-looking cleanup rather than forward-looking value creation. It's defensive, it's ambiguous, and it guarantees the work never gets prioritized.

"Tech debt" is a self-fulfilling prophecy that perpetuates the communication gap that created it in the first place.

Why "Tech Debt" Guarantees Deprioritization

The metaphor creates the outcome everyone complains about by shaping how people think about and discuss the work.

The term is too ambiguous to be actionable. When someone says "tech debt," what do they actually mean? Intentional tradeoffs made under time constraints? Unanticipated consequences of reasonable decisions? Outright mistakes? Code that worked fine but is now outdated due to evolving requirements? The term conflates deliberate strategy with failure, which makes it impossible to have productive conversations about what to do next.

The metaphor obscures actual costs and consequences. Real financial debt has clear terms: borrow $100K at 5% interest, pay it back over 5 years. "Tech debt" has no such clarity. What's the interest rate? When is it due? What happens if we don't pay it? The metaphor lets everyone avoid confronting actual costs and timelines. Without clear costs, there's no urgency, and without clear consequences, there's no accountability. Stakeholders hear "the code is messy" and think "so what?" They don't hear "we're losing $50K per month in support costs because this implementation is brittle, and we can't ship the feature roadmap because every change breaks three other things."

The debt metaphor implies inevitability. "We'll always accumulate some debt; that's just how software works." This defeatist framing makes people accept poor decisions as unavoidable rather than asking "why are we making decisions without enough information?" The term normalizes dysfunction instead of demanding clarity.

The term frames it as engineering's problem. When you say "we have debt to pay down," stakeholders hear "you made a mess, now clean it up." This doesn't invite collaborative problem-solving. It creates an adversarial dynamic where engineering owns the problem and stakeholders reluctantly allocate time to "let them fix their mistakes."

Missing architectural context creates an assumption of incompetence. When architectural decision records don't exist, future teams assume incompetence rather than recognizing intentional tradeoffs. The original context disappears: why this approach was chosen, what constraints existed at the time, what the intended evolution path was. Without that clarity, the current team either blindly perpetuates a bad pattern because they don't understand the original intent, or rewrites everything because they assume the previous team didn't know what they were doing. Both outcomes are expensive.

Consider how different this looks with context. If the architect had documented "We chose NoSQL here because we needed to ship in 3 months with the team we had. The long-term design uses relational storage; we've isolated this behind an interface so we can swap it later without touching business logic," the team has a roadmap instead of a mystery. The architect becomes the translator between constraints, decisions, and evolution paths. Without that translation, the cycle repeats: poor communication creates problems, vague language prevents fixes, and the gap widens.

An Alternative: Categories That Communicate Impact

Developers can keep using "tech debt" internally as shorthand within engineering teams, but when talking to product owners and stakeholders, retire the term entirely.

One approach is to categorize work by business impact: Corrections, Optimizations, and Re-Alignments.

Corrections: Problems Causing Harm Now

What it is: Mistakes, tradeoffs, or outdated decisions actively harming the business right now.

Examples:

Security vulnerabilities exposing customer data
Bugs causing support escalations or customer churn
Reliability issues causing downtime or SLA violations

Why this works: Stakeholders already understand bugs and security problems as priorities because they're causing measurable harm today.

Language to use:

"We have a security vulnerability that exposes customer payment data. The fix takes 2 weeks."
"This bug is costing us $30K per month in support escalations. Fixing it unblocks the support team."
"The authentication service has 99.5% uptime. Our SLA guarantees 99.9%. The gap creates $100K annual credit exposure. Fixing the root cause takes 3 weeks and eliminates the SLA risk."

Corrections communicate urgency. The business is being hurt now, and addressing it stops the bleeding immediately.

Optimizations: Improving Efficiency and Cost

What it is: Mistakes, tradeoffs, or outdated decisions affecting cost, performance, or efficiency.

Examples:

Database queries causing slow page loads (affecting conversion rates)
Infrastructure configuration costing more than necessary (budget impact)
Manual deployment process taking hours per release (velocity impact)
Inefficient algorithms causing excessive cloud compute costs

Why this works: Stakeholders understand optimization as improving what exists. It's not "paying debt," it's "increasing margin" or "improving user experience."

Language to use:

"Our cloud costs are $50K per month. A 3-week optimization brings that to $20K per month, saving $360K annually."
"Checkout page loads in 8 seconds. Optimizing to 2 seconds increases conversion by 15% based on industry benchmarks. The work takes 4 weeks and projects to $500K additional annual revenue."
"Automating deployments cuts release time from 4 hours to 15 minutes, letting us ship features faster. The automation work takes 2 weeks and doubles deployment frequency."

Optimizations communicate efficiency gains with measurable ROI. The business improves margins, performance, or velocity.

Re-Alignments: Unlocking Future Capabilities

What it is: Mistakes, tradeoffs, or outdated decisions that, when fixed, unblock new features, integrations, or business capabilities.

Examples:

Monolithic architecture preventing independent team scaling
API design preventing mobile app development
Data model preventing real-time analytics feature
Vendor lock-in preventing multi-cloud strategy
Legacy authentication system preventing enterprise SSO integrations

Why this works: Stakeholders understand opportunity cost. If the current architecture blocks a $2M revenue opportunity, fixing it isn't "paying debt"; it's "unlocking growth."

Language to use:

"We can't build the mobile app until we redesign the API. The redesign takes 6 weeks and unblocks a $2M annual opportunity."
"Our current data model prevents real-time dashboards (top customer request). Re-aligning the schema takes 4 weeks and delivers the feature."
"The monolith prevents us from scaling the checkout team independently. Splitting it out takes 8 weeks and doubles that team's velocity."
"Moving to OAuth 2.0 unblocks enterprise SSO integrations. The $500K deal waiting on this capability closes once we deliver it. The migration takes 5 weeks."

Re-Alignments communicate strategic value. The business unlocks capabilities that enable growth, close deals, or meet customer demands.

Breaking the Cycle

The self-fulfilling prophecy persists because both sides perpetuate it. Developers tend to use use vague language, stakeholders have little choice but to ignore those vague requests, and the cycle continues.

Break it by being the solution:

Replace "tech debt" with Corrections, Optimizations, and Re-Alignments when talking to stakeholders
Communicate business value from the start in architectural proposals and decisions
Mentor developers on translating technical concerns into stakeholder priorities
Enforce quality standards at every increment through code reviews, architecture reviews, and quality gates
Document decisions with ADRs so context doesn't disappear and future teams have roadmaps

These aren't debts to be paid; they're opportunities for value. If the term itself guarantees the problem, replace it.

Package Updates Are Investments, Not Hygiene Tasks

Steven Stuart — Mon, 12 Jan 2026 15:13:44 +0000

It is time to update a third-party package in your repository, or at least to consider it. So how do you know what is safe, what is needed, what is prudent, and what will keep our company from melting down in record time? To address these questions, most teams pick one of two general reflexes: always update immediately to "stay current," or ignore updates entirely until forced.

Both approaches treat package updates like chores, something to batch process or avoid. But package updates are investments. They consume time and introduce risk, which means they deserve the same deliberate evaluation you'd apply to any other technical decision.

The Distributed Systems Uniformity Trap

In distributed systems, a curious assumption often takes hold: all services must run the same package versions to maintain debuggability and behavioral consistency. And teams lacking clear governance see version alignment as a proxy for unity and control.

This assumption fails on multiple fronts. Distributed systems with shared-nothing architectures don't gain meaningful debugging benefits from version uniformity. Service A running on library v2.1 and Service B on v2.3 rarely creates the problems teams fear. And each service operates independently, communicates through well-defined contracts, and fails or succeeds on its own terms.

Version uniformity does matter in specific contexts:

Shared libraries and contracts: When services share a common library that defines data contracts or communication protocols, mismatched versions can cause subtle serialization bugs or contract violations
Security vulnerabilities: When a CVE affects multiple services, coordinated updates prevent attackers from exploiting the weakest link
Framework-level breaking changes: When a platform upgrade (like .NET major versions) requires coordinated migration across services

Outside of these cases, enforcing uniformity wastes time and introduces unnecessary risk. Governance clarity (understanding which dependencies matter for coordination and which don't) beats version number theater. When coordination does matter, focus on the boundaries: version your APIs explicitly, pin shared contract libraries, and establish migration windows rather than demanding instant synchronization across all services.

Making Intentional Update Decisions

Before updating any dependency, evaluate the change type and context. Semantic versioning provides a starting framework, but not all maintainers follow it rigorously, and even those who do sometimes misjudge what constitutes a breaking change. Read the changelog, not just the version number.

Patch updates (x.y.Z) should favor security fixes and critical bug patches, but verify relevance first. If a patch fixes a theoretical vulnerability in code you don't execute, the risk of updating may exceed the risk of staying put. Check whether the vulnerability applies to your usage patterns, whether the bug affects code paths you use, and whether the community has reported regressions.

Minor updates (x.Y.z) require evaluating value against risk. New features and non-breaking changes matter only if they solve problems you have or deliver performance improvements that affect your workload. Check community adoption rates and feedback; minor updates with low adoption and thin feedback deserve skepticism. Let others find the edge cases first.

Major updates (X.y.z) demand a business case. Breaking changes consume significant engineering time for migration, testing, and bug fixes. The value must justify the investment. Ask what capabilities become available, what technical debt gets resolved, and what risk comes from delaying (losing vendor support, missing future security patches). Treat major updates as planned initiatives with dedicated time and clear success criteria, not as squeezed-in tasks during feature development.

For any update, walk through these core questions:

Problem definition: What specific problem does this solve (security, feature, bug, performance, vendor requirement)? If there's no clear problem, question the update.
Research: Review changelogs for breaking changes, deprecations, and known issues. Check security scan results. Monitor community feedback (GitHub issues, forums, Stack Overflow).
Testing: Focus regression tests on affected code paths. Run load tests against production traffic patterns to validate SLA compliance (response times, throughput, error rates). Ensure you have a rollback plan.
Rollout: Test in a canary environment first if possible. For distributed systems, roll out incrementally (one service at a time). Define who monitors the rollout and what metrics matter.

This framework doesn't guarantee perfection, and perhaps not every step is always needed. But it does at least encourage deliberate thinking and decisions instead of reflexive action.

The Cost of Delay

Delaying updates indefinitely creates different risks:

Security exposure: Unpatched vulnerabilities accumulate; attackers target known CVEs in outdated packages
Vendor abandonment: Falling too far behind loses access to vendor support and community knowledge
Compounding migration cost: The longer you wait, the larger the gap between current and target versions, making eventual migration more painful
Ecosystem drift: New libraries and tools may assume newer dependency versions, limiting your options

Recognizing when delay shifts from prudent caution to mounting debt requires regular review cycles. Quarterly or semi-annual assessments help teams determine whether staying put still makes sense or whether the debt is growing.

If you're already multiple versions behind, don't try to catch up all at once. Audit your dependencies, identify the high-risk gaps (unpatched CVEs, unsupported versions, libraries blocking other upgrades), and create a prioritized update roadmap. Treat it like technical debt: chip away systematically rather than attempting a big-bang migration that creates more risk than it resolves.

Test Based on What Changed

Fear drives teams toward exhaustive testing: "We changed a dependency, so we need to test everything." This wastes time and often misses the actual risks.

Target your testing based on what changed:

Regression tests: Focus on code paths that use the updated dependency directly or indirectly
Load tests: Replicate production traffic patterns against the specific features that changed; validate SLA compliance (response times, throughput, error rates)
Integration tests: If the dependency handles I/O (databases, APIs, file systems), test those boundaries thoroughly

Load testing deserves special attention. Bugs that surface only under concurrent load won't appear in functional tests. Functional tests with serial requests can pass cleanly while hiding race conditions, deadlocks, or resource exhaustion that only manifest under production concurrency. Load tests should mirror production traffic volume and patterns, not arbitrary "stress everything" scenarios.

Avoid the temptation to test everything out of fear. Exhaustive testing creates a false sense of security while consuming time better spent on targeted, high-value validation.

The AWS SDK Lesson

Even when you decide updates don't require cross-team coordination, you still need deliberate evaluation. The assumption that trusted vendors always ship safe updates fails regularly.

Recently, AWS released version 4 of many of their .NET SDK packages with a series of breaking changes. Teams that treated AWS as a trusted source and updated without thorough review faced a flood of critical errors which were hard to detect and thus sometimes make their way to production.

The most damaging change wasn't a breaking API; it was a critical bug introduced in the SDK core authentication workflow. The bug created silent deadlocks when calling AWS services under specific load conditions. Services appeared healthy in development and early testing but locked up under production traffic patterns.

Two Truths From the AWS SDK Incident

Upfront due diligence has limits: You can review changelogs, run regression tests, and validate functionality, but some bugs only surface under production conditions
Ongoing vigilance matters: Staying plugged into ticket systems, community forums, and issue trackers helps you catch problems before they spread

Even trusted sources ship bugs. Intentional updates include monitoring what happens after updates ship, not just before.

Common Objections

"We don't have time for this level of due diligence. Just like TDD, it sounds good in theory but slows us down in practice."

The framework above takes 15-30 minutes per update decision, not hours. Compare that to the time spent dealing with broken production deployments, emergency rollbacks, and firefighting that follows hasty updates. Spending 20 minutes reading a changelog and running targeted tests beats spending 4 hours debugging a silent authentication deadlock at 2 AM.

Deliberate updates consume predictable, scheduled time during normal work hours. Autopilot updates consume unpredictable, high-stress time during incidents. The time spent is roughly equivalent, but one approach happens during office hours with full context, while the other happens during outages with incomplete information.

"Our security team requires us to apply all patches within 48 hours of release. We don't have a choice."

Security policies that mandate blanket timelines without risk assessment create more risk than they prevent. A policy that forces teams to apply an untested patch faster than they can validate it treats all vulnerabilities as equally critical, which they aren't.

When possible, present data to security leadership. Show the difference between a critical remote code execution vulnerability in your authentication layer (apply immediately) versus a theoretical XSS vulnerability in a library function your codebase never calls (evaluate deliberately). Most security teams will adjust policies when presented with risk-based reasoning rather than blanket compliance.

If your organization won't budge, at least apply the intentional framework to prioritize which updates get thorough validation versus rubber-stamp approval. Not every patch deserves the same scrutiny.

"Automated tooling already handles this for us."

Automation helps with detection and scanning: finding available updates, flagging known CVEs, checking for outdated versions. What automation cannot do is decide whether an update makes sense for your context.

Security scanners tell you a vulnerability exists. They don't tell you whether it affects code paths you actually execute. Automated PRs surface new versions. They don't evaluate community feedback, breaking changes, or production risk.

Auto-applying updates (even patch versions) without review is worse than no automation. A tool that automatically merges dependency updates trades predictable, bounded risk (staying on a known version) for unpredictable, unbounded risk (silently introducing bugs you didn't test for). Automation should notify, not decide.

Treat automated tools as early-warning systems. When a security scan flags a CVE or a bot opens a PR, use it as a trigger to walk through the intentional update framework. The automation saves you from manually checking for updates; it doesn't save you from thinking about whether the update makes sense.

"We have too many dependencies to evaluate each one individually. We'd spend all day reviewing changelogs."

Not all dependencies deserve equal attention. Apply the Pareto principle: 20% of your dependencies (authentication libraries, database drivers, core frameworks, HTTP clients) account for 80% of your risk. Focus your evaluation effort there.

For lower-risk dependencies (date formatting libraries, color palette utilities, markdown parsers), batch review them during scheduled maintenance windows. Check for breaking changes and security issues in aggregate, test once across the batch, and apply together. Reserve deep evaluation for high-impact dependencies where bugs cause production incidents.

"Our competitors ship faster because they don't overthink updates like this."

You don't know what your competitors do internally. You see their marketing velocity, not their operational reality. Companies that ship fast and stay fast do so because they avoid the context-switching cost of constant firefighting. They make fewer unforced errors, which means they spend less time recovering from self-inflicted wounds.

Leadership Sets the Tone

Team leads determine how their teams approach updates. If leadership treats updates as chores to batch and rush through, teams will cut corners. If leadership asks hard questions, prioritizes based on value, and accepts that "not yet" is sometimes the right answer, then teams will follow their example.

Shipping fast and thinking deliberately aren't opposites. Teams that update thoughtfully ship faster over time because they spend less time debugging mysterious production issues traced back to an unconsidered dependency change two sprints ago.

Package updates are investment decisions, not hygiene tasks. Treat them with the same rigor you apply to feature development.

Adaptability Over Cleverness: What Makes Code Actually Good

Steven Stuart — Mon, 05 Jan 2026 15:56:07 +0000

Adaptability Over Cleverness

Systems that survive aren't the ones written perfectly from the start. They're the ones that bend without breaking when requirements shift, technologies evolve, and teams discover what they didn't know upfront. Building for change beats chasing premature perfection every time.

You will never get it right the first time. That's not a failure; it's how software development works. Requirements clarify through building, edge cases emerge through usage, and performance issues surface under real load. Teams that treat first attempts as gospel spend months polishing solutions to the wrong problem.

Instead, build systems that can evolve. Give yourself room to deliver, learn, and adapt as reality proves what matters and what doesn't.

Principle and Practice

Adaptable code isn't magic. It follows principles that reduce coupling, isolate change, and make breakage obvious.

Single Responsibility Principle keeps each component focused on one job, whether that's a microservice owning a bounded context or a class handling a single concern. When requirements change, you modify the piece responsible for that concern without cascading edits across the system.

Clean interfaces and separation of concerns apply at both macro and micro levels. Services communicate through contracts, not implementation details. Business logic doesn't know about HTTP; database layers don't make authorization decisions. Abstractions at the right granularity let you pivot implementations as understanding evolves without rewriting everything upstream.

Externalize what changes by making variable behavior configurable. Environment-specific settings keep the same code deployable to dev, staging, and production. Tunable timeouts, batch sizes, and retry policies let operations adapt the system's behavior without engineering involvement.

Fail fast and loud surfaces problems immediately. Silent failures cascade into confusing bugs far from their source. Explicit validation at system boundaries, defensive assertions in critical paths, and structured logging create clear signals when something breaks.

Tests that give you confidence let you refactor with impunity. Unit tests verify component behavior. Integration tests catch interface mismatches. End-to-end tests confirm critical workflows still work.

Consistent naming and structure reduce cognitive load. Developers understand the codebase faster when patterns repeat. Services follow the same lifecycle. Repositories expose the same CRUD operations. Consistency makes the unfamiliar feel familiar.

Seek Balance

The goal isn't maximum flexibility; it's appropriate flexibility. Optimize for the changes you can reasonably anticipate based on domain knowledge and past experience. A payments system will need to support new payment providers. An internal admin tool probably won't need a plugin architecture.

Perfect code written for yesterday's requirements fails when reality shifts. Over-engineered code collapses under its own weight. Adaptable code finds the balance: flexible where change is likely, simple where it isn't.

TDD Tests Assumptions, Not Just Code

Steven Stuart — Fri, 02 Jan 2026 15:13:40 +0000

The real problem in software development isn't buggy code; it's building the wrong thing. We should be focused on preventing the delivery of features that don't match what users need, implementing requirements that were misunderstood, or discovering halfway through that the domain model was wrong. Test-Driven Development (TDD) is one solution to this problem, but it's often a very poorly understood concept and even those who do understand it can still do TDD wrong.

The promise of TDD is that tests guide design and catch bugs early. The reality, sometimes, is that teams write tests for features they don't yet understand, design interfaces around incomplete requirements, and spend hours on tests that get thrown away when understanding finally arrives. The resulting debate often gets heated. Advocates measure test coverage and celebrate red-green-refactor. Skeptics count rewritten tests as waste. Both sides miss what actually happened: when done right, those rewritten tests forced understanding before the wrong system got built and delivered. When done wrong, they were just ceremony.

TDD's value isn't in the tests. It's in the understanding that writing them demands.

Testing Assumptions, Not Just Code

Most discussions frame TDD as a code quality tool: write tests first, implement to pass, refactor for quality. Coverage metrics become the measure of success. But every test is also a test of assumptions about user needs and business logic. When you write a test asserting business rules, you're testing assumptions about how the system should behave. Not just your assumptions as a developer, but the assumptions baked into the requirements themselves. Writing the test first means you don't build for days before discovering misalignment. You might find that an entire scope of work needs to go back for reconsideration, and finding that on day one is obviously better than finding it on day ten.

Consider a test asserting that CalculateDiscount(customer) returns 15% for "premium" customers. That test encodes assumptions about what "premium" means, what discount they deserve, and whether discounts even work this way. Discovery can happen immediately: writing the test forces you to define "premium" concretely, and you realize the customer object has no tier field, or that the pricing service doesn't support percentage-based discounts, or that the requirement contradicts existing business rules. The act of writing the test surfaces the gap before any implementation begins. Or discovery happens later when product clarifies that discounts are tiered by purchase history rather than customer tier. Either way, the test change isn't waste. It's learning captured before shipping wrong behavior.

Changed tests aren't waste; they're evidence of learning. The earlier you surface wrong assumptions, the cheaper they are to fix. Tests force specific questions that conversation alone won't surface. Writing assertions exposes complexity that wasn't obvious when discussing requirements abstractly.

This doesn't mean tests must stay purely conceptual to avoid waste. Mocked code and implementation details in tests encode their own assumptions that sometimes only get validated through actual implementation. Some test code will get thrown away. That's fine. A little code waste is a small price compared to the larger waste of building the wrong system because critical misalignments went undiscovered. That said, when large test refactors happen repeatedly, it might signal something else: the developer who wrote the initial tests wasn't focused on alignment and was just going through the motions.

TDD treated as a checklist rather than a discipline for understanding will produce tests that don't surface assumptions early. The waste isn't in TDD itself; it's in treating TDD as compliance rather than inquiry.

Stop Measuring Success by Tests

This reframing should change what we measure, but not by replacing one test metric with another. Using test coverage as a success metric is a distraction. Measuring "assumptions caught" would be too. The only thing worth measuring success by is value delivered.

Coverage metrics can still provide useful insight into quality gaps, but think in terms of use case coverage rather than line coverage. Are the critical business scenarios tested? Are the edge cases stakeholders care about covered? That's a different question than "what percentage of lines have tests?"

Tests are a tool, not an outcome. When teams treat coverage percentages as goals or count rewritten tests as waste, they've confused the means for the end. The question isn't "how many tests do we have?" or even "how many wrong assumptions did we catch?" The question is "did we deliver what users actually needed?"

The test suite does have secondary value as documentation that new developers can read to understand system constraints without digging through old conversations and tickets. But that's a side effect, not a success metric.

What This Means Practically

Focus on testing assumptions that matter most. Not all assumptions carry equal risk. Prioritize tests that validate:

Business rules — How discounts work, what triggers notifications, when transactions are valid
Edge cases stakeholders haven't considered — What happens when the cart is empty? When the user has no purchase history?
Data validity assumptions — What "valid" input looks like, what formats are acceptable, what happens with missing fields

If requirements are clear and stable, write tests to validate implementation. If requirements are uncertain, write tests to validate assumptions and expect them to change as understanding develops. This distinction matters more than any debate about test-first versus test-after.

The better question is what assumptions you're making about user needs and how to validate them fastest. Sometimes that's writing a test first, other times it's building a prototype first, and sometimes it's showing mockups to users first. The goal isn't perfect tests; it's validated understanding.

The greatest value of TDD isn't in the tests that pass. It's in the tests that change because they revealed assumptions worth questioning.

Start with Understanding

Writing the test first forces a question: what should this code actually do? That question demands understanding before implementation. The passing assertion represents a commitment: this is the contract we're building to. Implementation honors that commitment by making the test pass.

When tests change during development, you're realigning based on discovery. When tests fail after changes, they're surfacing broken commitments that need attention. The discipline isn't about tests; it's about starting with understanding, securing genuine commitment to what you're building, and then honoring what was agreed.

The real problem in software development is building the wrong thing. TDD addresses this by forcing clarity before code, but only when practiced as inquiry rather than compliance. Tests that surface wrong assumptions early are valuable even when they get rewritten. Tests written as ceremony produce waste without insight.

Stop treating coverage as a goal. Stop counting rewritten tests as failure. If requirements are uncertain, write tests to validate assumptions and expect them to change. If requirements are clear, write tests to validate implementation. Either way, measure what matters: did you deliver what users actually needed?

TDD isn't just a testing practice. It's a discipline for understanding.

Rebuild Success Often Comes from Realignment, Not New Technology

Steven Stuart — Wed, 31 Dec 2025 13:42:22 +0000

Business and technical minded people both tend to credit new technology for the gains seen after a system or tool rebuild. They will also often blame the tech for when a rebuild goes awry. But when you examine what actually changed, the technology rarely drove the gains or caused the failure. The improvements (or lack of) came from alignment with business value and the application of operational discipline. Often, the gains could have been seen without the rebuild or the failure measured long before additional development waste.

The critical error is assuming the new runtime, framework, or platform created the success. Ignorance was the actual constraint, and rebuilding forced tech and business teams to confront it.

This misattribution creates dangerous organizational patterns. Teams propose rebuilds when the real problem is dysfunction, not technical limitations. The rebuild becomes a moving target that allows leadership to avoid accountability, celebrate "innovation" while making things worse, and mask problems that were never technical to begin with.

Common Examples of Misattributed Success

The pattern repeats across different technical domains:

Infrastructure migrations: An organization blames rising cloud costs on the provider's pricing model and proposes migrating to on-premises infrastructure. Eighteen months later, leadership celebrates reduced hosting bills without mentioning the tripled operations team, degraded availability, and manual processes replacing what cloud automation previously managed.

The root cause was never the cloud provider. It was absence of operational accountability—no one tracked which resources provided value, right-sized instances, or decommissioned abandoned experiments. The migration forced this discipline, but the same discipline applied to existing infrastructure would have achieved the savings without the rebuild.

Runtime rewrites: Teams celebrate performance gains after rewriting in a faster language. But was it the new runtime, or was it the rewrite that forced them to finally address inefficient database access patterns, redundant service calls, and missing caches?

Framework modernizations: Teams credit the new frontend framework for improved responsiveness. But was it the framework's rendering model, or was it the rebuild that forced them to eliminate wasteful re-renders and implement proper state management?

The new runtime gets credit, the new provider gets credit, and the new framework gets credit. But the realignment did the work.

You could have achieved the same benefits by fixing the queries, right-sizing the infrastructure, and optimizing the existing code. The technology wasn't the constraint. Ignorance was.

When Rebuilds Masquerade as Solutions to Organizational Problems

Rebuilds often hide deeper organizational failures. Understanding these patterns reveals why the cycle repeats.

The Documentation Failure

When architectural decision records don't exist, future teams assume incompetence rather than recognizing intentional tradeoffs. Someone looks at the current architecture and declares "This is bad" when they really mean "I don't understand why it's designed this way." Without Architecture Decision Records documenting the constraints and reasoning behind key decisions, every new technical lead assumes the previous team made poor choices rather than reasonable compromises.

The hidden cost is compounding technical debt. The old system still needs maintenance during migration. The new system accumulates debt rapidly because you're learning as you build. You end up with two systems, both worse than if you had invested in understanding and improving one.

The Coherence Failure

User stories are slices of an agreement, not the agreement itself. They capture deliverable increments but provide no holistic understanding of what you're building or what you built. When organizations treat user stories as the entire agreement rather than fragments of it, coherence collapses.

Teams implement individual stories without understanding how they relate. Related capabilities get built across separate stories with no unifying vision. Each works in isolation, but together they create conflicting mental models. Flows spanning many stories across multiple sprints become impossible to explain end-to-end because no document describes them end-to-end. Each story made sense locally, but globally the system is incoherent.

Eventually someone proposes a rebuild to "clean up the mess." But the mess wasn't created by technical limitations. It was created by treating slices as the whole, by never establishing the agreement that user stories were supposed to slice.

The hidden cost is security and availability. New systems have immature operational practices. Rushed migrations skip security reviews. Unfamiliar platforms lead to misconfigurations.

The Alignment Drift Failure

Systems drift from business needs when there's no mechanism to maintain alignment. Priorities change, but the system keeps implementing old priorities while never removing obsolete ones. New capabilities get added but old ones never get decommissioned. Integrations accumulate. Code paths multiply. Each addition made sense at the time, but no one maintains the whole.

Eventually the system does too much, costs too much, and serves unclear purposes. The rebuild proposal emerges naturally. "Let's start fresh with current priorities," someone suggests. But without changing the process that allowed the drift, the new system will accumulate the same cruft.

The hidden cost is enormous opportunity cost. The months or years spent on a misguided rebuild could have been spent delivering actual business value. You're not just wasting the rebuild time; you're wasting all the value you could have created instead.

The Accountability Avoidance Failure

When leadership constantly shifts priorities without acknowledging past commitments, teams can never succeed or fail definitively. Every problem becomes "we were working on the wrong thing" rather than "we failed to deliver what we committed to." Rebuilds fit perfectly into this pattern because they're the ultimate moving target. By the time the rebuild completes, requirements have shifted again, and the cycle continues.

This connects to how leadership rewards visible heroics over invisible prevention. The engineers who prevented the fire through good design, monitoring, and operational discipline get ignored. The engineers who fought the fire through a hasty rebuild get celebrated. This teaches the organization that creating problems and fixing them dramatically is more valuable than preventing problems quietly. Rebuilds become performative rather than necessary.

Good developers and architects can identify these problems and push for accountability, but without leadership commitment their efforts fail. Leadership must ask hard questions when rebuilds are proposed, acknowledge failures when commitments aren't met, and maintain clarity on what matters. Both leadership and technical teams are needed. Technical teams must articulate problems clearly while leadership creates an environment where solving the right problem matters more than creating the appearance of progress.

The hidden cost is eroded organizational trust. When rebuilds fail to deliver on commitments but get celebrated anyway, teams learn that outcomes don't matter. This produces learned helplessness where engineers stop fighting for quality because leadership doesn't care.

Without addressing these root causes, the new system will develop the same problems. In three years, someone will propose another rebuild. The organization learns that rebuilds are how you "fix" things, entrenching a cycle of waste.

When Rebuilds Are Justified

Rebuilds aren't always wrong; some situations demand them. When your runtime or infrastructure reaches end-of-support and security patches stop flowing, you must migrate. Staying on unsupported platforms creates unacceptable risk.

When the system's core architecture cannot support required characteristics, incremental refactoring may cost more than rebuilding. Some architectural shifts are fundamental enough that preserving the old system while transforming it creates more complexity than starting fresh.

New regulations sometimes demand capabilities the current system cannot provide. Compliance requirements may force architectural changes that touch every layer. When merging systems from acquired companies, rebuilding to a common platform may be necessary for operational efficiency and reducing long-term maintenance burden.

The difference between justified and unjustified rebuilds is honest assessment. Justified rebuilds have clear, measurable forcing functions. Unjustified rebuilds have vague dissatisfaction and organizational dysfunction masked as technical problems.

The AAA Discipline: How to Know If You Need a Rebuild

The AAA Cycle (Align, Agree, Apply) prevents rebuild disasters by forcing honest assessment before action.

Align: Understand Before You Prescribe

Before proposing a rebuild, align on reality. What actually provides value? Which features drive business outcomes versus exist because no one removed them? Why was the current architecture chosen? What problems was it designed to solve? What constraints existed? Which tradeoffs were intentional? Read the ADRs if they exist. Interview people who built the system.

Most importantly, determine whether the problem is technical or organizational. Are costs high because of the technology, or because no one is accountable for managing costs? Is the system slow because of architectural limitations, or because of fixable inefficiencies? Most problems that look technical are actually process failures.

Agree: Get Real Commitment, Not Permission

Once you understand reality, agree on what actually matters. State the real problem, not the symptom. "The platform is expensive" is a symptom, while "We have no operational accountability for cost management" is the problem. Define measurable success criteria with explicit tradeoffs that acknowledge what you're willing to sacrifice and what you're not.

Evaluate alternatives. What could you do besides rebuild? What would those approaches cost? Acknowledge actual constraints: time, budget, team capacity, acceptable risk. Rebuilds hide behind "strategic investment" language to avoid honest resource conversations.

Assign specific ownership. Not "the team" but specific people accountable for specific metrics. If costs don't decrease, who failed? Without genuine agreement, rebuilds become exercises in diffused responsibility where no one can be held accountable.

Apply: Execute with Integrity or Stop

The Apply phase tests whether the agreement was real. Implement what you agreed to. If cost reduction was the priority, instrument cost tracking first. Track against the agreement continuously. When metrics diverge from commitments, pause and realign. Don't celebrate "completed migration" when you violated core commitments.

Recognize when agreements were wrong. If the rebuild isn't solving the real problem, stop. "We committed to this" isn't a valid reason to continue when reality invalidates the premise. Stopping a failed rebuild is success, not failure. Update ADRs and share learnings so the organization doesn't repeat the mistake.

The Apply phase makes accountability real. When rebuilds fail to deliver on commitments, AAA makes that failure visible instead of letting it hide behind "strategic transformation" language.

Realign Before You Rebuild

Rebuilds can solve the wrong problem. They succeed not because of new technology, but because they force teams to understand what they're building, align with business value, and apply best practices. That work could have happened without the rebuild.

Use the AAA Cycle to test whether a rebuild addresses real problems:

Align on what matters and what the actual problems are
Agree on measurable success criteria and who owns the outcomes
Apply with integrity and recognize when agreements were wrong

Fix the organization, and the technology often fixes itself. Rebuild without fixing the organization, and you'll be proposing another rebuild in three years.

The question isn't "Should we rebuild?" The question is "Do we understand why we're considering it, and are we willing to be accountable for the outcome?"

Why Configuration Files Don't Belong With Your Code

Steven Stuart — Tue, 30 Dec 2025 13:20:19 +0000

When you first create that new shiny code project, your configuration requirements seem so obvious and straightforward. You think: "I have a single behavior or feature that just needs this one setting or I am just calling this one external resource which just needs this one api url". Often, your needs are indeed simple, but simple does not translate directly to easy. Yet, even before the future flood of settings and complex use cases arrive, you have already created a problem. The local approach feels intuitive because config stays close to the code that uses it, but real business use cases, daily team dynamics, production deployment pipelines, and distributed architectures are just not that simple.

This localized intuition leads to predictable problems: secrets leak into Git history and persist indefinitely, environment-specific values multiply into templating systems that nobody fully understands, and configuration changes require full application redeployments. These aren't edge cases or signs of poor discipline; they're structural consequences of coupling configuration to code.

The Hidden Costs of Config-in-Code

When configuration lives alongside application code, the problems compound in ways that aren't obvious until you're deep into them.

Secrets inevitably leak. Teams start with good intentions like .gitignore files, environment variables, and careful code reviews, but secrets find their way into version control anyway. A developer copies a working config to test something, a merge conflict gets resolved wrong, or someone commits from a branch that predates the .gitignore update. Once a secret hits Git history, it's there forever unless you rewrite history (which breaks everyone's local repos) or rotate the credential (which means tracking down every service that uses it).

Environment parity becomes impossible. Each environment needs different database hosts, API endpoints, feature flags, and connection pool sizes. Teams respond with templating systems that substitute values at build time, or multiple config files with naming conventions like config.dev.json and config.prod.json, or environment variables that multiply until nobody remembers what SERVICE_ENDPOINT_OVERRIDE_V2 was supposed to do. The same Docker image behaves differently in staging and production, and debugging requires understanding not just the code but the entire config transformation pipeline.

Deployments couple to configuration. When config lives with code, changing a timeout value means rebuilding and redeploying the application. When five services share a database connection string and it needs to change, coordinating five deployments in sequence becomes necessary. Incident response now requires understanding deployment ordering, not just the config change itself.

Configuration fragments across components instead of cohering around domains. When each service owns its config files, configuration naturally organizes around components rather than business concepts. A tenant's feature flags, rate limits, and integration credentials get scattered across every service that handles that tenant's requests. With 50 APIs and 100 tenants, ops teams face an impossible task: understanding which services handle which tenants, how configuration flows through the architecture, and where to make changes when requirements shift. When the payment domain gets restructured into separate services, tenant configs need redistribution across the new architecture. Internal implementation details leak into operational concerns, and ops personnel are forced to understand component boundaries they shouldn't need to know about.

Config drift becomes invisible. Without a central view of configuration, environments gradually diverge in ways nobody fully understands. A value gets changed manually in production to fix an urgent issue and never makes it back to staging. A developer updates the dev config but forgets to propagate it to the template. Over time, the gap between what's documented and what's deployed grows until debugging requires archaeology rather than analysis.

I need a Custom Solution then, Right?

The natural engineering instinct is to solve this problem yourself by storing configs in a database, building a config API, or using Redis as a config cache. These approaches fail for reasons that become obvious only after you've tried them.

A database-backed config store creates a circular dependency: your service needs the database connection string to start, but the connection string is in the database. You end up hardcoding bootstrap credentials anyway, which defeats the purpose. Every service that needs config now depends on the database being available, and rotating database credentials becomes a chicken-and-egg problem.

A custom config API sounds cleaner until you realize the config service itself needs configuration, deployment sequencing, and high availability. When the payment service needs a new config value, you deploy the config API first, then the payment service. If the config API is down during a scaling event, new instances can't start. You've traded one problem for a more complex one.

The fundamental issue is bootstrapping: how does the application know where to find its configuration before it has any configuration? Cloud-native config stores solve this by using identity that's already present in the runtime environment, such as IAM roles in AWS, managed identities in Azure, or service accounts in GCP. The application authenticates using credentials the platform provides automatically, which means no bootstrap secrets to manage.

What External Config Stores Provide

Every major cloud platform offers configuration stores. AWS has Parameter Store and Secrets Manager, Azure has Key Vault and App Configuration, GCP has Secret Manager, and HashiCorp Vault works across all of them. The specific product matters less than the pattern they all share.

Sensitive data never touches the codebase because it never needs to. Credentials live in the config store, encrypted at rest, with access controlled through the same IAM policies that govern your other cloud resources. When someone leaves the team, you revoke their access in one place rather than hoping they don't have local copies of config files. Audit trails show exactly who accessed what and when, which proves invaluable when security asks "who has seen the production database password in the last 90 days?"

Applications become genuinely portable because the same artifact runs everywhere. A Docker image built once deploys to dev, staging, and production without modification. The application asks "what environment am I in?" at startup and fetches the appropriate config, eliminating templating systems, environment-specific builds, and "works on my machine" debugging sessions caused by config drift.

Domain-Oriented Configuration

One of the strongest arguments for external config stores is organizational: configuration should reflect business domains, not component architecture.

When configuration lives with code, it naturally organizes around components: the payment API has its config, the notification service has its config, and the reporting batch job has its config. This feels logical until you realize that configuration often cuts across components. A tenant's feature flags, rate limits, and integration credentials apply to multiple services. A domain like "payments" might span three APIs, two background processors, and a gateway configuration.

Component-oriented config creates problems at scale. With 50 APIs and 100 tenants, ops teams need to understand which services handle which tenants, how configuration flows through the architecture, and where to make changes when business requirements shift. When the payment domain gets restructured into separate authorization and settlement services, all the tenant configs need redistribution. The internal architecture leaks into operational concerns.

Domain-oriented config inverts this relationship. Configuration organizes around business concepts like tenants, products, or integration partners rather than implementation details. The same tenant configuration applies to whichever services handle that tenant's requests. When architecture changes, the domain config stays stable while service-level mappings adjust. Ops personnel work with business concepts they understand rather than implementation details they shouldn't need to know.

Component-specific configuration still exists, but only for genuinely component-specific concerns: performance tuning, networking behavior, resource limits, internal timeouts. These are implementation details that don't belong in domain configuration. The distinction matters: domain config answers "what does this tenant need?" while component config answers "how does this service behave?"

Hierarchical Config in Practice

External config stores enable domain-oriented organization through hierarchy. Consider how AWS Parameter Store structures configuration (the pattern applies to other platforms).

Parameters follow a path structure like /production/tenants/acme-corp/feature-flags or /production/domains/payments/integration-credentials. This hierarchy enables powerful access patterns: granting developers read access to /dev/* while restricting /production/* to deployment pipelines, updating all services in an environment by changing /production/shared/*, and giving domain teams ownership of their domain's config subtree while platform teams manage cross-cutting infrastructure config.

The cost concern that often comes up is a non-issue in practice. AWS Parameter Store's standard tier provides 10,000 free parameters with no throughput charges. Azure Key Vault and GCP Secret Manager have similar pricing models. A single security incident from leaked credentials costs more than decades of config storage fees.

Native integration with compute services removes friction. ECS tasks, Lambda functions, and Kubernetes pods can reference parameters directly in their definitions. The platform handles fetching and injecting values at startup. For services that need to refresh config without restarting, SDKs provide straightforward polling or event-driven updates.

The Trade-offs

Moving to a distributed config store trades one form of complexity for another. Configuration updates feel heavier because they are heavier, more like database migrations than editing a text file. You'll write change requests, get approvals, coordinate timing with deployments, and verify rollback procedures work.

This friction is the point, not a bug. Configuration changes in production systems should be intentional and controlled because a misconfigured timeout can cascade into an outage and a wrong feature flag can expose unfinished functionality to customers. The ceremony forces you to think about backwards compatibility (will existing instances handle this new format?), rollback procedures (how do we revert if this breaks something?), and dependency ordering (which services need to restart first?).

What feels like bureaucracy is actually the appropriate amount of care for changes that can bring down production systems. Config changes deserve deliberate thought rather than a quick file edit.

Do We Always Avoid Local Configs?

The goal isn't to externalize everything since that would create unnecessary dependencies and slow down development. Some configuration genuinely belongs with the code.

Application defaults like log formats, default retry counts, or internal timeout values that don't change between environments MIGHT belong in code. These define how the application behaves, not how it connects to external systems. I would still hesitate before coupling these to your code however.

Framework configuration for HTTP server settings, thread pool sizes, or middleware ordering rarely varies by environment. When it does, you can override specific values from the external store while keeping sensible defaults local.

Development shortcuts that let engineers run the application locally without network dependencies make sense for rapid iteration. The key is ensuring the code path to fetch external config exists and works; you're just bypassing it locally for convenience.

The decision rule is straightforward: if configuration is sensitive, environment-specific, or shared across services, externalize it. If it's about internal application behavior that stays constant everywhere, keep it local. For prototypes and throwaway code, do whatever gets you moving fastest, but establish the external config discipline before real users or sensitive data enter the picture.

Preventing Configs from Creeping Back into Code

Moving to external config stores doesn't automatically stop developers from committing secrets. Without active governance, you'll find credentials in code reviews months after "completing" the migration.

Code review culture matters more than tooling. Values over process! When reviewers consistently flag hardcoded configs, the team internalizes the discipline. Make it part of your definition of done: "configuration fetched from external store, no hardcoded environment-specific values."

Pre-commit hooks using tools to scan commits for patterns that look like secrets, including API keys, connection strings, and private keys. A developer who accidentally pastes a credential gets immediate feedback rather than finding out during a security audit. Teams should be able to govern themselves, but for high risk products this might be needed.

Repository templates ensure new projects start with proper .gitignore files excluding .env, config.local.*, and secrets.*. When developers create repos from templates, protection comes built-in rather than requiring someone to remember to add it.

CI/CD validation catches anything that slips past local checks. Pipelines can scan for hardcoded values, verify that config references resolve to external stores, and fail builds that contain suspicious patterns. This is your safety net when pre-commit hooks get bypassed.

Practical Concerns and How to Handle Them

The concerns that come up when teams consider this migration are legitimate, even if the solutions are straightforward.

"What happens when the config store is unavailable?" In practice, cloud config stores have better uptime than the applications that depend on them. If AWS Parameter Store or Azure Key Vault is down, you likely have bigger problems since the entire region is probably degraded. For transient issues, cache config after the first fetch and use circuit breakers to serve stale values during brief outages. The dependency you're adding is on infrastructure that's more reliable than almost anything you'd build yourself.

"Startup is slower now." True. Fetching config over the network adds latency compared to reading a local file. For most applications, we're talking about tens to hundreds of milliseconds during startup, which is imperceptible to users and irrelevant compared to database connection pool initialization, dependency injection setup, and cache warming. Cache aggressively and fetch once; the performance concern disappears.

"Local development gets harder." It gets different, not harder. Modern SDKs and credential chains make fetching config from external stores transparent once you've set up local credentials. Many teams use local overrides for convenience while ensuring the external fetch code path stays exercised. The initial setup takes an afternoon; the debugging time saved from "works locally, fails in production" issues pays that back quickly.

"We can't work offline." How often do you actually develop offline? Installing packages requires connectivity. So does pulling dependencies, accessing Jira, communicating with teammates, running CI/CD, and pushing code. The offline development scenario is increasingly rare. Yet, using local development solutions like .NET Aspire can provide local git controlled registries and configuration meant only for pure local connectivity.

Version Control for Configuration

Config changes need the same versioning discipline as code changes, and external config stores make this possible in ways that local files don't.

When config lives in files, version history shows what changed but not why. When config lives in an external store with change management, each update can include context: who approved it, what ticket it relates to, what the rollback plan is. Some teams adopt GitOps patterns where config references live in Git (pointing to external store paths) while actual values stay external. This approach gives you code review for config changes without putting secrets in repositories.

Rollback becomes straightforward when the config store maintains version history. Bad deploy? Roll back the config to the previous version without touching application code. This separation lets you fix config problems in seconds rather than the minutes-to-hours a full redeploy requires.

Config drift becomes detectable. Environments gradually diverge in ways nobody fully understands, and external config stores let you see exactly what's different between staging and production. No more "that value got changed manually and nobody remembers why."

The Daily Experience Changes

The shift to external config stores changes how teams work in ways that compound over time.

Testing becomes more realistic because senior engineers can run local code against remote staging or remote dev configs (with appropriate permissions) rather than maintaining local approximations that drift from reality. When something works locally and fails in staging, config drift is rarely the culprit.

Security incidents from leaked credentials drop dramatically because secrets stop appearing in Git history and credential sharing over Slack stops since there's no file to share; permissions grant access directly. When you need to rotate a credential, you update one place and know it propagates everywhere, so the question "did we rotate that key in all environments?" finally has a definitive answer.

Deployments simplify because the same artifact works everywhere. Build pipelines no longer template config files or maintain environment-specific variations, so you build once, deploy many times, and config comes from the environment at runtime with faster deployments and fewer things that can go wrong.

Making the Shift

Local config files feel simple and convenient right up until they aren't. The pattern creates security risks that compound over time, deployment complexity that frustrates teams, and environment drift that causes mysterious production-only failures. External config stores trade familiar convenience for operational discipline, and that trade-off is worth making.

The friction of external config management isn't overhead; it's the appropriate level of care for changes that can bring down production systems. Configuration errors cause outages as often as code bugs, and they deserve the same rigor: version control, change management, audit trails, and rollback procedures.

Start with secrets by moving credentials out of the codebase and into a config store. Once that discipline is established, expand to environment-specific values, and eventually the pattern becomes natural: config comes from the environment, not from files.