Forem: Seth Orell

Proudly Found Elsewhere

Seth Orell — Mon, 30 Mar 2026 13:38:23 +0000

I've helped many companies ship better software faster. One negative trend I've observed at many of these is what's termed the "not invented here" (NIH) syndrome. That is, they tend to build or self-manage software rather than let another company provide it as a paid service.

I've observed this with its attitude toward cloud-managed services, like Lambda or S3: ("We should stick to Kubernetes because we can easily port this to other cloud platforms if necessary."), with database clusters ("Seriously, how hard is it to run MongoDB?"), and with source control ("We can run GitLab ourselves and keep things secure"). I take a different approach - one aligned with the concept of "the division of labor" that lets specialized companies trade efficient services with one another.

By Division of Labor, I mean "the separation of the tasks in any economic system or organization so that participants may specialize." (Wikipedia). It's the specialization that provides the value. Those specialists then voluntarily trade with one another to mutual benefit. When organizations choose to specialize and trade, they cure themselves of NIH. There's even a term for this new approach: "proudly found elsewhere" (PFE). I like that.

Not everything

However, these companies did not consistently apply NIH. They all used other companies specializing in parts of the business they would otherwise own. For example, they used authentication/authorization providers like Auth0, webinar providers like Zoom, email providers like MailChimp, and, most conspicuously, data center providers like AWS.

My litmus test: "Is this service/app a distinguishing, separately marketable feature of our business? If so, build it. Otherwise, let's find a provider." For example, if we don't have a world-class authentication platform that we could actually market and sell (we don't), then we should look to lease one (we did - Auth0). This helps keep expensive engineering and operations work focused on key business initiatives.

Win-win

When I specialize in building software for a business and let Amazon specialize in providing on-demand data centers, who wins and who loses? When my company and Amazon trade (my company's dollars for Amazon's infrastructure), we both benefit. We both win.

Let's look at a more canonical example that I learned from Dr. Yaron Brook. You want to buy an iPhone. It costs $1,000. As you walk out of the Apple Store with your new iPhone, did you "win"? Yes! You valued the iPhone more than you valued your $1,000 (or you wouldn't have bought it). Did Apple win? Yes! They valued your $1,000 more than the iPhone (or they wouldn't have sold it). Both sides voluntarily exchanged a lesser value for a greater one.

This mutually beneficial nature is the essence of trade. It happens when you are trading for iPhones, groceries, or software. And specialization (division of labor) allows both Apple and my company to maximize the value we can bring to the trade. Win-win.

Should I trade?

Some considerations when approaching build/buy:

Speed. It is often much faster to integrate with an existing SaaS provider than to build your own. I once worked with a startup whose CEO repeatedly told me, "Time is not our friend."

Cost. The provider of a SaaS product knows that domain, whatever it is. They have economies of scale and offer a slice of that scale at an affordable per-unit price.

Expertise. The SaaS provider is declaring: "I am good at XYZ; I can deliver it better than any of my competitors, and I constantly work to improve how I deliver it." Who do you think can better run GitLab, your already overworked Operations team, or GitLab itself?

Security. Your SaaS provider's incentive aligns with yours: protect customer data or go out of business. They focus on their security while you focus on yours. For example, Amazon's AWS access management is world-class. You could spend years on your own and never come close.

Complexity. The more responsibilities we heap on engineers, the less they can retain about any particular part of the system. E.g., if engineers have to run and maintain their backend applications and then we start hosting our own MongoDB servers, we now have to add MongoDB DBA tasks to the volume of knowledge we need to keep up to date. Often, a company in this situation ends up hiring someone with specialized knowledge (MongoDB DBA), which further drives up the costs of not using a SaaS provider.

A Reddit thread on self-hosted vs. GitLab.com captured this well: "If gitlab.com crashes, which isn’t impossible, you’ll have the same problem. But given that their core business is running GitLab servers and your core business is probably something else, they are probably better at running GitLab servers than you are."

The same thread also surfaced the opposing view: "For us always self hosted but we’re just like that. I prefer to be in control so if something breaks it can only be my fault." This is a revealing response. "Always self-hosted" and "we’re just like that" signal convention masquerading as strategy. And the reasoning behind "if something breaks it can only be my fault" doesn’t hold up — self-hosting increases the likelihood of something breaking in the first place. Accepting blame isn’t a benefit; it’s a cost.

The Law of Comparative Advantage

The idea of focusing your efforts on what you are good at and trading with others (who are doing the same) is not new. Adam Smith hinted at the concept in his 1776 book, The Wealth of Nations. It was most famously characterized in 1817 by the English economist David Ricardo as "The Law of Comparative Advantage."

Ricardo was interested in international trade. He demonstrated that, given two countries and two goods, even if one country could produce both goods more cheaply than the other, both countries were better off producing the good they are most efficient at and trading the other. It is a bit counterintuitive, but extremely powerful.

I recently read an example in an article from Dr. Harry Binswanger about the subject, from which I quote:

The Law of Comparative Advantage applies much more widely than to international trade. It used to be illustrated by considering a CEO and his typist—back when there were typewriters and typists. Even if the CEO can type faster than a given typist, it pays him to hire that typist because off-loading the typing work saves him time. He can then use that saved time in the area of his comparative advantage: running the company.

AI and the new NIH

The same syndrome is alive and well in the AI era. As large language models have reshaped software development, teams are facing a new wave of build-or-buy decisions: should we use the OpenAI or Anthropic API, or train our own model? Should we use a managed vector database, or build one in-house? Should we license an AI coding assistant, or build our own?

The reasoning sounds strategic: data privacy, cost control, and avoiding vendor lock-in. But the underlying economics haven't changed. Consider AI coding assistants. Companies are building internal "code copilots" instead of licensing Claude Code, Cursor, or GitHub Copilot. Their engineering teams spend months fine-tuning models, wrangling GPU clusters, and building IDE integrations—time that isn't going into their actual product. Meanwhile, the companies building those tools are doing nothing else. Their entire existence is solving that one problem. Their comparative advantage in this domain is overwhelming.

Or consider the companies standing up their own GPU clusters to run open-source LLMs, because "we don't want to depend on a third-party API." Fair enough—data privacy and compliance are real concerns in specific cases. But for most teams, this means hiring ML infrastructure engineers, managing CUDA drivers, tracking model updates, and absorbing the operational overhead of a business they are not in. AWS, Azure, and GCP have entire divisions whose comparative advantage is exactly this. Trade with them.

The litmus test still applies: Is this a distinguishing, separately marketable feature of our business? If you're not building the next foundation model, you probably shouldn't be operating one.

Wrapping Up

So, what would I tell business leaders who want to build it themselves? The advice is the same whether we're talking cloud infrastructure or AI: use GitLab.com instead of hosting your own GitLab. Use MongoDB Atlas—or better yet, migrate to DynamoDB. Pick a cloud provider that offers a broad array of fully managed services and start consuming them (Hello, AWS!). And when your team starts debating whether to build your own LLM, vector store, or AI coding assistant, ask the litmus test question first. If the answer is "no, this isn't what we sell," find a provider.

The division of labor is not just an economic abstraction—it's a practical engineering strategy. Specialization creates depth. Trade creates efficiency. When you focus your team on what makes your product unique and let specialists handle everything else, you ship faster, operate more reliably, and scale more sustainably. Both sides win. That's the whole point.

Happy building!

References

Wikipedia: Division of Labour
Ayn Rand Lexicon: Trader Principle
Harry Binswanger: What If Robots Take All the Jobs?
Wikipedia: Comparative advantage
Yaron Brook: The Yaron Brook Show

Mastering AppSync: Unions and Interfaces

Seth Orell — Thu, 12 Feb 2026 13:36:53 +0000

Introducing AppSync to your team has its challenges. But it's doubly so if the team is unfamiliar with GraphQL (GQL). As the AppSync "expert," you will be expected to explain not only AppSync's APPSYNC_JS resolvers, merged APIs, and how to test changes, but also GQL types, schemas, and queries.

In my experience, most teams pick up GQL basics quickly, but struggle with Union and Interface types. When should they choose one over the other? What makes them different? Does the team need them at all?

None of the above

Let’s tackle the last question first. Do we need to use an interface or union at all? In short, no, you don’t. But interfaces and unions add something to your schema that is otherwise difficult to express. Let’s look at an example.

Imagine an error type that informs the user that his input is invalid. In its basic form, it has the following properties:

In GQL, that can easily become a type as follows:

Specific types of this error may extend this basic set of properties to provide relevant details. Let's look at two: a "missing properties" error and an "improper age" error. They both share the properties code and message, so we can add optional properties like so:

The caller can check the error type and look for a new property to get more details about the response. But what can we not express with this schema?

Other than through simple (English) semantics, the schema doesn’t tell you that the properties minimumAge and missingProperties are mutually exclusive; you’ll get one error type or the other, not both. Other schema tools, such as OpenAPI, include constructs like oneOf to address this. GQL has to use other approaches to make this exclusivity explicit: interface and union types.

The GQL Interface

The idea of an interface is well known to most programmers. In Object-Oriented languages, an interface is a data type that acts as an abstraction of a class. It defines methods and properties that an implementing class must support.

A GQL interface is even simpler. It defines a set of fields that any implementing type must expose. It only defines properties (no methods). Let’s look at an example that builds on the error response from earlier.

We recognized that we want some common error properties, but we also want to display additional information relevant to the type of error returned. We can use a GQL interface to represent the common properties and types that implement the interface for the specific, like so:

Now we could define a mutation with the following signature:

And execute the query with the following projection:

Here is where GQL interfaces start to diverge from OO interfaces. With an OO interface, you cannot (or rather, you ought not) tell what the implementing type is. In GQL, the implementing type is always returned as a special, undeclared property called __typename. This lets us add inline fragments to pull in the additional properties. For example, we could expand our saveInput projection like so:

When you get a result back from this query, you will have your base properties alongside either of the named error properties. The fragments indicate the either-or nature of minimumAge and missingProperties.

Can’t we do the same thing with a GQL union type? Yes, with some caveats. Let’s look at that implementation.

The GQL Union

A GQL union looks similar to a GQL interface, but it serves a different purpose. An interface defines what is common. A union defines separate, and (hopefully) disparate, types.

With one notable exception (detailed below), I consider it poor API design for your endpoint to return dissimilar types. Imagine a REST API that could return Shoes, Ships, or Sealing Wax. How useful is that? Not very. Normally, I would want separate GET routes for /shoes, /ships, and /sealing-wax. That’s a better design!

The big exception to this rule is full-text search. It often returns anything. If you are building a GQL API that touches search, you will be hard-pressed to avoid using a union type. Given that, let’s look at what would happen if we implemented our existing error example with unions.

We could express our errors using unions if we make three distinct types and one union from them:

Now we can invoke our mutation with almost the same projection. I say "almost" because we can no longer ask for the "common" properties outside of a fragment. This is because we are no longer defining anything in common; each type is as separate as a shoe or a ship. Here is the new projection:

We have maintained the mutual exclusivity, but we have lost the commonality. This is key to your decision to use a union versus a type: is there something in common? If so, use interfaces; otherwise, use a union.

GQL union/interface code smells

Kent Beck defines a code smell as “a hint that something has gone wrong somewhere in your code.” Here are a few GQL smells to watch out for when dealing with interfaces and unions.

The suppressed interface

If your type has optional properties that are mutually exclusive. This is a good candidate for an interface. An example could be the overloaded Error type we saw earlier:

The anemic interface

Just because you share some properties between types doesn’t mean you have a candidate for an interface. Interfaces require query fragments and add cognitive load. Don’t use them if you don’t need them. An example could be types that all have an id property. Adding an identifiable interface, with one property (id), buys you little while adding complexity.

Interfaces on full-text search results

In most cases, you should treat interfaces on your full-text search results with suspicion. This is where unions typically shine. Remember our example with shoes, ships, and sealing wax? Use a union here.

Unions on non-full-text search results

The corollary to the above is also true: any non-full-text search endpoint that has a union is suspect. You are probably looking for an interface here. However, if your source API has already blown it and has one badly designed endpoint for fetching shoes, ships, or sealing wax, you may be stuck.

Summary

Understanding the differences (and similarities) between GQL unions and interfaces is valuable when introducing AppSync to a team. They will look to you for answers and advice. Having examples and a few rules of thumb to guide decision-making can go a long way toward helping your team succeed.

Happy building!

References

AWS AppSync GraphQL Developer Guide: Interfaces and unions in GraphQL
GraphQL.org: Schemas and Types

You Don't Have to Test All of Your Code...

Seth Orell — Mon, 05 Jan 2026 14:14:54 +0000

Earlier this year, I noticed a sign in my dentist’s office that said, “You don’t have to brush all of your teeth, just the ones you want to keep.” I smiled. I like this saying for many reasons. It serves as a good reminder that all values require choice (if you want to keep your teeth, then you should brush them). It also reminded me of a recent conversation with a tech leader who wanted to get her engineers involved in testing. “Which parts should we test?” she asked. As I stood there in the dentist’s office, I said to myself, “Test the code you want to work.”

On its own, it’s a bit pithy. We can expand on the idea. Let’s look at some ways we can get from zero to fully tested.

The finish line

Teams that self-test release higher-quality software faster. This is what DORA provides evidence for year after year. It’s not a tradeoff between speed and quality; it’s both.

To ensure quality, we want fast validation that our software isn’t broken. To this end, we want our tests to run as part of our build pipeline, giving the team immediate feedback when something is off.

In fact, my typical pipelines are dominated by validation. Even before I’m through with my DEV environment, look at all the validation steps I have:

Each validation step is designed to give fast feedback. The faster the feedback, the earlier the step is in the pipeline. Linting and unit testing are lightning-fast; they run first. Integration and acceptance tests take longer to run and are executed later.

The constraints

Some of the following constraints are metaphysical; you cannot escape them, while some are practical. All of them are good. The best performing teams I’ve encountered use them.

Time/Money

Engineering teams never have enough time to do everything they’d like to do. This is normal. We have to carefully choose what we spend our time on. Your tests should be pushed down to the appropriate layer (unit/integration/acceptance). Sometimes this means you don’t test everything.

Authorship

Your engineers should write the tests. Or, put another way, whoever writes the test should write the code. Your test code should be as ruthlessly clean and organized as your application code.

Additionally, your engineers should own the build pipeline. They don’t need to own the platform the pipeline runs on (e.g., GitHub Actions or Jenkins), but they should control the pipeline’s structure and be able to make changes to keep it fast and functional for the applications they own.

Mockless

Do not mock things you own. This goes for classes, modules, and services. Mocks are harmful. You may consider mocking stuff you don’t own, but only if necessary. This is (currently) an unpopular viewpoint and will be the topic of a forthcoming article, so hold on to your pushback until then.

The starting line

If your team is writing its first test right at the beginning of the application’s development, congratulations. You are on the path to success from day one. This is the best position to be in.

Unfortunately, many teams have started development with either no testing or with an external team handling testing. For all of them, they are now behind the pack and need to catch up. What to do?

The race

Write a test.

The best time to write your first test is today. The next best time to write your first test is tomorrow. That is, start now; it never gets any easier.

What should I test? If you are building a new feature, test that feature. If you aren’t in the midst of a feature, pick the easiest existing feature to validate and test that. Building a test suite is hard at the beginning, so start small and build more tests as you gain confidence.

Won’t this slow me down? Yes. And you will make it all back as you go forward. An F1 driver will take a few seconds away from the race to make a pit stop and change tires so he can finish even faster than he would have without the pit stop. It’s an investment toward winning, not a sacrifice.

The strategy

Each new feature or bug fix brings a corresponding test (or tests) to validate it. Slowly, the percentage of self-tested features will increase.

You can backfill missing test cases as you have time, or you can wait for a problem to emerge before covering it with a test. Both are viable strategies. The risk on the first is that you test something that may not break or has little impact if it did. In other words, you might have wasted your time. The risk on the second is that, while you now know this is an excellent place for a test, the customer who found the problem may be upset.

Write the test first. This helps reduce the scope of work and keeps you focused on the task at hand. Once the test goes green, tidy up and push to master. Which leads us to...

Practice continuous integration. Merge your changes into master at least once a day. Stop using feature branches. Continuous integration is closely aligned with a test-first methodology; they work well together. Again, this is (currently) an unpopular viewpoint that deserves an entire article. I will have more to say on this in future writings.

The payoff

Better software faster. Who doesn’t want that?

Testing plays a role in both the “better” and the “faster.” You are better because your automated test suite prevents you from backsliding (breaking existing functionality) and, therefore, maintains the application’s high quality. You are faster because your tests are automated and part of your build pipeline. And, because your engineers own the code under test, the tests, and the pipeline, they have both the incentive and the capabilities to keep it running.

Once a team achieves self-testing, it will start running circles around teams that don’t. And from personal experience, it’s a whole lot more fun to build this way.

Happy building!

References

DORA
Yan Cui: My testing strategy for serverless applications
Dave Farley: Acceptance Testing & BDD (playlist)
Ownership Matters: Serverless Is Not A Primary
Ownership Matters: Testing EventBridge with Serverless
Ownership Matters: Code Review Musings
Ownership Matters: How Not to Test with DynamoDB

Two Easy Ways to Cut Costs in AppSync Merged APIs

Seth Orell — Mon, 29 Sep 2025 14:16:34 +0000

AWS released Merged APIs for AppSync in mid-2023. The Merged API acts as a unified endpoint over disparate AppSync Source APIs. This works well with a microservice architecture, as each source API can live near its resource service. I’ve written about AppSync Merged APIs in the past and have included links to those articles in the references section, below.

The Merged API works by directly loading the schema and resolvers of its Source APIs. It differs from tools like Apollo Federation in that it eliminates an additional network hop. The AppSync Merged API invokes the source resolver directly; it doesn’t make an extra wire call to your Source API. How AppSync does this is not widely publicized and can lead to some inefficiencies in your AppSync architecture. I’ve experienced these inefficiencies firsthand and want to share my solutions. Let’s look at the first problem.

CloudWatch Logging

AppSync can log to CloudWatch, and these logs can be critical for debugging problems. In AppSync, you have three controls you can use to affect logging:

Enable / Disable

Field resolver log level (All/Debug/Info/Error/None)

Include verbose content (e.g., request/response headers, context)All these settings are available on both the Source and the Merged APIs.

How do the settings interact between a source and the Merged API? That is, if I change a log setting in the source API, is that change reflected in the Merged? Today I am here to tell you, “No, it is not.” The settings are completely independent. Each AppSync API gets its own log group.

For example, if you have full logging in your Merged API (Enabled, All, & Include verbose content) and more restricted logging in your source API (Enabled, Error, Exclude verbose content), you will still end up with two independent CloudWatch log groups. The Merged API log group will have more content than the source API log group.

What happens if you completely turn off logging in the source API? This was the question I found myself asking recently. The happy answer is that the Merged API logs everything you ask it to. This leads us to our first cost-cutting measure for AppSync: turn off source API logging in production. This can reduce your AppSync-generated CloudWatch bill by up to half. You can adjust log levels in the Merged API to your preference and disable logging in the source API.

If you deploy to multiple environments, such as dev/staging/production, you can adjust this setting per environment. For example, I deploy (and test) my source APIs in isolation, before merging. I do this in my dev environment. For these tests, having the source API log in CloudWatch is essential for debugging. I set the log levels differently as the app progresses through each environment on its way to production, where the source API logging is fully disabled.

AppSync Caching

AppSync has a built-in cache. It is easy to use, and AWS handles most of the underlying cache management for you. You are responsible for telling AppSync the caching behavior you want to use (Full Request / Per-resolver / Operation) and the size of the cache to provision.

Because you are selecting cache sizes from a tier (ranging from small to 12xlarge) and the cache incurs an hourly charge (ranging from $0.044 to $6.775), this part of AppSync begins to run counter to my definition of “serverless.” Don’t lose hope, however. AppSync was introduced in 2021 without caching, which was added in 2022. AWS introduced Elasticache Serverless in 2024. Don’t be surprised if we get a much more “serverless” caching option in AppSync in the near future.

Anyway, back to the case at hand. Cache settings, like log settings, are configured at the API level. Both AppSync source APIs and Merged APIs have cache settings. And, like we’ve already discovered with log settings, the configuration of one API’s cache does not affect the other. Your caching can be completely turned off on your source APIs, yet it will still work with the Merged API if you enable it there. Nice!

This means you don’t need to pay hourly charges for source API caches and still retain the ability to cache at the Merged API, taking pressure off of your underlying resource services.

Further Savings

You can use your infrastructure-as-code framework to configure log levels and cache sizes in each environment. Whether you are using Serverless Framework, CDK, SAM, or Terraform, you have easy ways to group configuration settings per deployment environment. You will likely want to set your log levels to DEBUG in your dev and staging environments, but to something higher (INFO or WARN) in production, where you (hopefully) have higher traffic.

Similarly, for caches, you don’t need 4xlarge cache instances in your dev environment. Set those to small or medium and save.

Summary

To get the most out of AppSync Merged API, you need to understand the difference between it and “federated” solutions. Because it pulls in resolvers directly, you get logging and caching at the Merged API level. This allows you to disable logging and caching on all your AppSync Source APIs without losing any functionality.

Happy building!

References

AWS AppSync GQL Developer Guide: Merging APIs in AWS AppSync
AWS AppSync GQL Developer Guide: Using CloudWatch to monitor and log GraphQL API data
Benoit Boure: Debugging AWS AppSync APIs With CloudWatch
Ownership Matters: AppSync Merged API as Code
Ownership Matters: AppSync CloudFormation Scaling Revisited

Queues, Buses, and Streams

Seth Orell — Wed, 10 Sep 2025 13:11:32 +0000

AWS recently released a new feature to its venerable SQS service named "Fair Queues." In conversations with engineers about its behaviors, I found some general confusion regarding the various messaging systems in AWS, their functions, and how they differ from one another. In this article, I aim to provide an overview of the different types of AWS messaging services, including some examples you can use to teach others.

For people new to AWS architectures, the myriad of options (SQS, SNS, EventBridge, Kinesis, MSK) for data movement can be overwhelming. I've found it helpful to categorize these in broad terms and then tie them to specific examples of real-world use cases.

The thing all these services have in common is that they move data. They differ in rate, capacity, and destination. They also differ along another axis: how "managed" the service is. I'll indicate this as we go along.

I use the broad classifications "Queues," "Buses," and "Streams." Let's examine each and see where these services fall.

Queues

The term "queue" in computer science is a collection of entities that "can be modified by the addition of entities at one end of the sequence and removal of entities at the other end of the sequence." The term comes from the real-world phenomenon of people lining up for something. Imagine a grocery store and people lining up (or queuing) at a checkout counter. New carts line up at the back, and the cashier processes the carts at the front.

A queue can also be considered a buffer. It can absorb excess capacity (shoppers) that would otherwise overwhelm the limited resource (cashiers). The more cashiers (or the faster each works), the faster we can process the people in the queue. The queue helps to smooth out irregular or bursty workloads.

One of the first managed services offered by AWS was its Simple Queuing Service (SQS) back in 2006. SQS is fully managed and scales transparently as usage changes. Consumers use a pull-based model to access messages and only consume when they have the capacity to do so. Put another way, consumers have to ask, "Do you have any messages for me?" to receive anything.

In 2025, AWS added "fairness" to SQS. Out of the box, the queue processes messages (roughly) in the order ingested. With SQS fair queues, you specify how SQS groups the messages to more evenly distribute processing resources. This can help mitigate noisy-neighbor impacts in multi-tenant queues.

I group a queue with its consumer; it is part of the receiver's architecture, not the sender's.

Consider a queue if you need to slow things down. If you find yourself saying, "I cannot process this event just yet, but I don't want to lose it," reach for a queue.

Buses

There are many types of message buses, and they can include features typically found in queues or streams. Still, I want to focus on what I consider the fundamental characteristics of a message bus: fan-out and receiver agnosticism.

Fan-out refers to the situation where a single input message generates multiple output messages. For example, when I place an OrderReceived message on the bus, many independent listeners will each receive a copy of that message. This is often referred to as "pub-sub," which is short for "Publish/Subscribe." In pub-sub, the publisher should be unaware of the subscriber. It is receiver agnostic.

Receiver agnosticism is a fancy way of saying, "I do not know who may consume this message." In other words, the producer isn't saying, "Send this message to Frank." The producer says, "Send this message to anyone who happens to be listening." I've heard this compared to a radio, where anyone who is tuned in to the broadcast will listen to the show, but if you miss it, it's gone.

AWS unveiled its Simple Notification Service (SNS) in 2010. At the time, it was the only fully managed pub-sub service in the cloud world. As such, it was a pioneer in the space and set the bar (along with SQS) for what a fully managed service should do.

SNS is a push-based model. Subscribers receive messages in real time as they are published. They don't need to ask, "Do you have any messages for me?" Since messages are not persisted (remember the radio analogy?), you often see architectures that wire up an SQS queue as a subscriber to an SNS topic. Then, the consuming service pulls its messages from the SQS queue for processing.

In 2016, AWS released "CloudWatch Events" to enable users to receive notifications about the systems they run on AWS. It included EC2 shutdown and autoscale events, new service creation or deletion events, and many others. AWS chief evangelist Jeff Barr said, "You can think of CloudWatch Events as the central nervous system for your AWS environment. It is wired to every nook and cranny of the supported services, and becomes aware of operational changes as they happen."

By 2019, AWS realized that the system underpinning CloudWatch Events could become a first-class product of its own, and EventBridge was born. Since then, it has received a steady stream of enhancements, making it my preferred service for event-driven architectures.

I group a bus with its producer; it is part of the sender's architecture, not the receiver's.

Streams

At first glance, you may have a difficult time discerning a stream from a queue, and this is understandable. They share many of the same traits, such as ordering and persistence, but they differ primarily in how they are used.

A stream is concerned with data flows, a zoomed-out view of how data moves through a system. The relationships between the data elements are essential to stream analysis. In contrast, a queue is interested in each discrete data message as its own unit, apart from the others.

Think about a highway system in a city. Its operators are interested in things like: "How many cars come by each hour?", "What are the times with the highest rates of traffic?", and "What is the ratio of cars to big-rig trucks per hour?" In this case, we are less concerned with each vehicle (message) and more concerned with traffic (the stream). This stream, if recorded, could be "replayed" to answer new questions at a later time.

AWS launched its managed real-time streaming service, Kinesis, in 2013. It could capture and process gigabytes of data per second from multiple sources. Since then, AWS has offered different variations of Kinesis to focus on different streaming concerns:

Kinesis Data Streams (the original)
Kinesis Data Firehose (for moving large volumes of data to a destination)
Kinesis Data Analytics (for Flink-like analysis of streams)
Kinesis Video Streams (for capturing/processing video)

Another AWS stream I often encounter is the DynamoDB Stream. Introduced in 2014, it provides a time-ordered sequence of item-level modifications to a DynamoDB table as a stream. I regularly use this as a source to emit events to other services (DynamoDB Stream -> EventBridge). This stream is less configurable (more managed) than Kinesis.

AWS also supports a service named MSK, which stands for Managed Streaming for Apache Kafka. It simplifies the operation of an Apache Kafka cluster, although you are still responsible for instance sizes and EBS storage configuration. They have a more-managed, "serverless" version of MSK that removes some of the configuration, but does not scale to zero (thus, in my opinion, it does not qualify for the title "serverless"). Consider MSK if you are already using Apache Kafka and are seeking a quick operational win.

Architecturally grouping streams is not as clear-cut as in queues and buses. I have seen examples where the stream is part of the consumer. We have examples of streams as part of the producer (Hi, DynamoDB streams!). And I have worked on applications that were built around the stream itself, like event-processors for stock analysis.

Kafka confusion

Apache Kafka can act like a queue, a bus, or a stream depending upon its use. In fact, the original conversations that prompted this article were with engineers who had experience with Kafka but limited exposure to the more targeted AWS components. Under the hood, Kafka is combining buses, queues, and streams.

You can build similar behaviors by combining AWS components to compose more complicated message handlers. For example, back when I was a kid, if I didn't want to miss a radio broadcast of King Biscuit Flour Hour (look it up), I would take a tape recorder, press the record button, and place it next to the radio's speaker. This is analogous to wiring a queue to a bus. We've come a long way, baby.

Summary

Queues, buses, and streams share some common functionality; they all move messages. The differences emerge when we look at how each treats those messages, and how we group them architecturally.

If you are asked to explain the differences between buses, queues, and streams (or any other set of related concepts), have examples on hand to help you illustrate the similarities and differences. By moving away from the hand-wavy talk of software architecture and toward real-world examples, such as grocery stores, radios, and highways, you can help your engineers ground their thinking.

Happy building!

References

AWS Documentation: SQS, SNS, or EventBridge?
AWS: Original SNS Press Release
AWS: Original CloudWatch Events (EventBridge) Press Release
AWS: Original EventBridge Press Release
Yan Cui: Understanding push vs. poll in event-driven architectures
Svix: Event Streaming vs Message Queue
Wikipedia: Queue (abstract data type)

Serverless Is Not A Primary

Seth Orell — Mon, 28 Jul 2025 13:48:14 +0000

I love serverless. With it, I help teams deliver value quickly, without the need to spend time maintaining servers and back-end infrastructure. But serverless is not enough.

You cannot lead an engineering team to success with serverless if they don't test the code they write, deploy the apps they build, or integrate changes quickly. Each of these areas must be addressed first. These are the fundamentals upon which serverless can shine.

As DORA has consistently demonstrated, year after year, continuous integration and continuous delivery substantially increase code maintainability, job satisfaction, and software quality. The goal is team Agility, Autonomy, and Ownership. Serverless can play a role, but it is not sufficient for success.

Building the fundamentals

A hierarchy is at play here. Processes build on previous processes. You cannot deploy continuously until you can deliver continuously. You cannot deliver continuously until you integrate continuously. You cannot integrate continuously until you solve self-testing, fast (now even faster!) code reviews, and robust build pipelines, among other things. Where does serverless fit in?

"You build it, you run it."
-Werner Vogels

Serverless is a type of software architecture. The term says, "Sure, there are servers somewhere, but you don't have to worry about them; we'll handle that part. You go build." Serverless allows you and your team to run what you build easily.

However, running software is the final stage of the journey. What happens during the "you build it" phase is crucially more critical than your "you run it" phase. If you haven't nailed down your fundamentals, serverless solves very little.

What are these fundamentals? Continuous integration, continuous delivery, and continuous deployment. Each of these fundamentals is itself a bundle of primaries. For example, to achieve continuous integration (merging changes into master at least once per day), you must address the following challenges: version control, self-testing, fast code reviews, and build pipelines. Each of these takes thought and hard work to achieve. If you are interested in how I approach the thorny issue of code reviews, read my first article on the subject.

I have worked with organizations that had the fundamentals but were not utilizing serverless, as well as organizations that lacked the fundamentals but were fully committed to serverless. The developer experience is night and day different between the two, where the team with the fundamentals delivers better software faster with fewer bugs. The team without the fundamentals missed release dates, had ongoing quality issues, and required separate teams to test, deploy, and run their app.

The DORA metrics are a good measure of how well your team has mastered these fundamentals. As I explained in my 2023 article, "The Other Side of DORA Metrics - The Virtuous Cycle", good DORA numbers reflect excellence at the fundamentals. You can't achieve high DORA scores unless you've tackled the basics.

Serverless for ownership

Werner Vogels, Amazon's CTO, famously declared that, at Amazon, "You build it, you run it." He contrasted this approach with the typical "hand-it-off-to-operations" approach most other companies at the time were using. He found that getting engineering involved in the entire software lifecycle enhanced quality and connected the developers with what their customers experience. That's ownership.

When I refer to "ownership," I mean the team's ability to take full responsibility for the applications they code, encompassing testing, deployment, and monitoring. No additional teams (Operations, QA, SRE, DevOps) are required to run and improve the app. For this, serverless shines.

Serverless architectures make running systems easier. No load-balancers, VPCs, or Availability Zones need to be configured. Just build your application and go.

This doesn't mean that serverless is easy. It's not. It takes practice and guidance. With serverless, your engineering team can run everything. Without it, you will have a hard time and will likely need a supporting team (likely some form of Operations) to co-manage the application.

Fundamentals for culture

Serverless is about architecture, but the fundamentals — continuous integration, self-testing, continuous code reviews, and continuous delivery — are about culture. To implement the fundamentals, you will affect how your engineering teams design, verify, build, and release applications. This could be a significant change!

Culture is opt-in. That is, each individual must decide for herself whether she supports the culture or not. Leadership can define the culture, but it will be adopted only if each engineer and manager in the organization truly believes in it. This is where you can lean on others for persuasion.

If you have access to someone with experience, use him. For example, I have helped transform many teams and organizations from a traditional, vertical-silo, throw-it-over-the-wall deployment to a self-tested, continuous deployment. Someone like me can answer questions, give advice, and map a path forward.

In addition, understand what DORA has illustrated. Teams that adopt these fundamentals are happier, more productive, have higher quality code, and ship faster. DORA also demonstrates a strong correlation between the impact of these software delivery practices and business goals, including revenue growth, market competitiveness, and customer satisfaction. Take that to your C-suite.

Engineering biomes: forest & desert

However, depending on your company's current culture, your message may not be well received. Beth Andres-Beck quoted some user pushback in her clarifying article Forest & Desert, "This architecture you've described sounds like a lush forest, but we are living in the desert. I don't see how this will work here." How do these environments impact culture?

As Martin Fowler summarizes it, "The Forest and the Desert is a metaphor for thinking about software development processes, developed by Beth Andres-Beck and her father Kent Beck. It posits that two communities of software developers have great difficulty communicating with each other because they live in very different contexts, so advice that applies to one sounds like nonsense to the other."

I have been mainly speaking to the forest dwellers so far. Now, I want to direct a message to anyone stuck in the desert: lead your team to the forest or get out. The desert will suck out your soul. If nobody in your desert-dwelling company is interested in what you have to say on this, quit.

There are too many companies that are either squarely in the forest or want to move there. They would love to have an engineer like you who is interested in that kind of culture. Don't believe the people who say, "This is just the way it is," or "This is just the way enterprise builds software." They are wrong. Strive for something better.

If you are new to CI/CD, check out Dave Farley's YouTube channel, "Modern Software Engineering," for bite-sized lessons. Read through the treasure trove that is Martin Fowler's blog, including his seminal article on continuous integration (updated in 2024). Hear Jez Humble address claims of "Continuous Delivery: Sounds Great But It Won't Work Here" in devastating fashion.

Now, place serverless on top of all this. You will be unstoppable.

Happy building!

References

Martin Fowler: Continuous Integration
Ownership Matters: Code Review Musings
DORA: Get better at getting better
Ownership Matters: The Other Side of DORA Metrics - The Virtuous Cycle
Martin Fowler: Forest And Desert
Kent Beck and Beth Andres-Beck: Forest & Desert
YouTube: Modern Software Engineering

How to Crater Your Database, Part Five - Summary

Seth Orell — Mon, 30 Jun 2025 21:13:00 +0000

Part One
Part Two
Part Three
Part Four
Part Five <-

We are at the end of this whirlwind tutorial on turning your database into a smoking hole in the ground. Along the way, we discussed:

If you need to scale, predictability is paramount (Part one).
COUNT and JOIN don't scale. With poor scalability, your best customers will perform worse. This is a bad business model (Parts two and three).
You don't need airtight consistency. Watch out for excessive SQL transactions, and don't integrate at the data layer (Part four).

DynamoDB is different

Although I've mentioned it throughout, this post emphasizes that DynamoDB is different. It doesn't perform many of the functions that a SQL database can, and this is intentional. DynamoDB can scale predictably.

I've given examples of these differences throughout this series of articles. If you've worked with SQL before, all the examples should be familiar; they are commonly used. When I first point out the problems with things like aggregations, joins, or heavy transaction use, I get looks of surprise. Nobody has ever advised them against doing those things.

I have a colleague who, during discussions about this topic, asked me, "If you're going to avoid all the things that don't scale well in SQL, why don't you just use DynamoDB in the first place?" This is the point of my summary today.

You can use SQL more effectively than you do today and achieve fast, predictable scaling, but you will have to remain eternally vigilant for non-scalable actions creeping back in. A junior dev will reach for a COUNT without a second thought, and you'll have to scrutinize every commit to ensure it doesn't happen.

Or, you can learn the basics of DynamoDB and let its built-in guardrails keep your applications scalable as you build. As Alex DeBrie wrote, "DynamoDB won't let you write a bad query," where "bad query" is defined as a "query that won't scale."¹ Think of this as a "pit of success" where it is easy to do the right things and annoying to do the wrong things.²

I didn't even get into all the other benefits of DynamoDB: it is fully managed (no servers to configure), it is pay-per-use when using its On-Demand mode, it has a built-in change stream for publishing events, and it doesn't rely on networking tricks to keep it secure (zero trust). See my article, "Why DynamoDB Is (Still) My First Pick for Serverless," in References, below, for more details. If you haven't started building with DynamoDB, you are missing out. Please give it a hard look.

Pre-defined access patterns

Before wrapping up, I want to address a common polemic when comparing DynamoDB to SQL: pre-defined access patterns. In DynamoDB's documentation³, AWS encourages you to identify your data access patterns before building anything. This idea often receives pushback from SQLites, who argue that it is inflexible and impractical. Their anger is misguided. This has nothing to do with DynamoDB but with scalability more broadly.

Let's say you are using SQL in a scalable manner. You use no aggregations, foreign keys, or joins. I guarantee you that your indexes will follow your access patterns. This applies to any datastore, whether SQL, MongoDB, or Elasticsearch. It would be like DynamoDB, but without the guardrails.

And if a new access pattern were introduced, you'd have to do the work to incorporate it. You would figure out specific indexes and generate composite data keys for this new access pattern. You are optimizing for how your app works, not for flexible access patterns. This was hard for me to unlearn, but it's an essential distinction between OLTP and OLAP systems. Your app requires an OLTP database to scale, and you need to design it accordingly.

As Alex DeBrie wrote in The DynamoDB Book, "[in DynamoDB,] You design your data to handle the specific access patterns you have, rather than designing for flexibility in the future."⁴ I want to expand this to say, "In any scalable database system, you must design your data to handle the specific access patterns you have." If you have a problem with pre-defining your access patterns, you also have a problem with scalability.

Summary (of the summary)

Build for scalability. Consider using DynamoDB for its built-in guardrails and all the additional benefits it offers. Scale your business to new heights and profit from all your happy customers. And, after you do, I'd like to meet for coffee and hear all about it.

Happy building!

References

Alex DeBrie: The DynamoDB Book

Ownership Matters: Why DynamoDB Is (Still) My First Pick for Serverless

Alex DeBrie: SQL, NoSQL, and Scale: How DynamoDB scales where relational databases don't

Jeff Atwood: Falling Into The Pit of Success

Alex DeBrie: SQL, NoSQL, and Scale: How DynamoDB scales where relational databases don't ↩
Jeff Atwood: Falling Into The Pit of Success ↩
AWS Documentation: DynamoDB data modeling ↩
Alex DeBrie, The DynamoDB Book, p. 137 ↩

How to Crater Your Database, Part Four - Consistency

Seth Orell — Wed, 28 May 2025 13:52:18 +0000

Part One
Part Two
Part Three
Part Four <-
Part Five

Intro

At first, life was simple. You had a web server running on a SQL database, and your prototype was successful. Every action was 100% immediately consistent because you wrapped it in a SQL transaction. You went live. Your early customers loved your product. You grew. You needed to scale. You may have even hired someone like me to help. One of the first things out of my mouth would be, "You do everything in a transaction. You integrate at the data layer. You can't scale like this." "But," you may reply, "We need consistency!" You do, but not like this and not to this extent. Consistency matters, but you don't need complete consistency; context is king.

Where is consistency essential? Where is it not? What are the costs of consistency? I want to address these questions to provide you with an indication of how I approach the issue. But first, let's define our terms.

What is consistency?

Consistency is the 'C' in ACID, which outlines properties of certain database transactions. These transactions aim to ensure data validity across four dimensions: atomicity, consistency, isolation, and durability.¹ Here, I use the term "transaction" more broadly than the SQL begin/commit/rollback transaction commands with which some of you may be familiar.

As part of ACID, consistency means that your change does not violate any database constraints or invariants, and that future interactions will reflect your change. This helps to prevent data corruption.

Consistency is also the 'C' in CAP, representing three trade-offs: consistency, availability, and partition tolerance. As used in CAP, consistency means that "Every read receives the most recent write or an error."² Note that this differs from the ACID definition.

Consistency is not inherently bad. Further, it is easy in the beginning. Most databases offer efficient consistency at a row or record level. You can insert a record and read it immediately. This is sometimes referred to as "read-after-write consistency."

Problems arise as developers strive for business-level consistency. For example, suppose I have an e-commerce site and want my Order service to complete an order only if there is inventory available, with no possibility of an out-of-stock sale. In that case, I will need to tightly bind (couple) my Inventory service to my Order service. Many engineers would implement this with a single database transaction across the order and inventory tables. Now, the team that runs the Orders service and the team that runs the Inventory service must share the same database server. Furthermore, because they are integrating at the data layer (please do not integrate at the data layer), they cannot scale independently of each other.

If you cling tightly to consistency, you will lose the ability to scale. The corollary is also true: those who have successfully scaled have carefully constrained how they use consistency.

You don't need complete consistency

Let's revisit the example I used above about orders and inventory. While we would prefer only to allow a sale when we have an in-stock item in the warehouse, this isn't easy to enforce. I argue that it isn't necessary.

Imagine having a loosely coupled, scalable system where it's possible to occasionally oversell an item. What of it? If you find that you cannot fulfill an order, issue an apology to the customer and inform her that her order will be fulfilled as soon as the item is back in stock. You could also include a discount coupon for a future purchase. This "apology workflow" is significantly less expensive than engineering a fully consistent order and inventory system and attempting to scale it. Use it.

Airlines and hotels often do this. They don't need airtight consistency when selling seats or rooms. They only need a way to remedy a problem when it does arise.

Real world example: Starbucks

As Gregor Hohpe points out in his 2020 book, The Software Architect Elevator, there are ample sources of design guidance in the real world that we should consider. One of these came from his observations about how Starbucks handles orders to maximize throughput.

First, the Starbucks cashier takes your order (and your money). Then, your order is completed by the barista, and you finalize the transaction by picking up your drink. The two actions are decoupled and independently scalable. If you were to wait at the cashier for your drink before you handed over your money, you would have an airtight transaction, but this would ruin throughput and scale poorly. Starbucks figured this out and jettisoned "consistency" to achieve high-volume output. I daresay the approach has been a smashing success.

Gregor first published this example in an essay,³ which is freely accessible. I highly recommend reading and re-reading it until it sinks in. You don't need complete consistency.

Real world example: Car Dealership

If you've read Martin Kleppmann's Designing Data-Intensive Applications, you may recall an example of his in the "Transactions" chapter about car sales (p. 369. Kindle Edition). He asserts that once a car sells, it should no longer be listed. So far, so good. However, he proposes that this is a canonical example of why you should use a DB transaction that covers both the car's title and its advertising. He is wrong; this is a mistake.

Kleppmann's goal is to prevent the car from being sold twice. This is crucial to the business; having two people pay for the same vehicle would be a disaster. However, the importance wanes when we get to the advertisement about the car. Is it so terrible that the ad lives an extra minute or two before it is removed? If someone calls about the vehicle, a salesman can look up the title and tell the caller, "I'm sorry. That vehicle has sold." (apology workflow)

The cost of wrapping the advertisement and the title in the same transaction (and therefore residing in the same database server) is enormous. If you follow Kleppmann's advice, you cannot scale marketing independently of titling. In effect, you are saying, "The advertisement's consistency is as important as the title's consistency," and that is flat-out false.

Eventual consistency

Let's review the three examples I've included: e-commerce inventory, Starbucks fulfillment, and a car dealership. In each, there are essential operations that ought to be consistent. In each case, this centers on payment. We want to ensure that the customer has paid for her item exactly once and that no other customer has paid for the same item.

Here, the A.C.I.D. properties are helpful. We want this payment to be atomic, consistent, isolated, and durable. However, adjusting inventory, handing off your drink, or retiring an auto listing are all handled later. And, once complete, they make the overarching transaction consistent. This part of the interaction is made consistent "eventually."

Eventual consistency is all around us. If I buy tacos for lunch, the charge will eventually show up on my monthly statement. If I hire a roofer to replace my roof, my house will eventually get a new roof. If I borrow money to pay for the new roof, the lender will eventually be repaid the principal plus interest. In all these cases, the entire transaction becomes consistent over time.

Why DynamoDB is different

AWS recognized that if it wanted to provide scalability for DynamoDB, all its operations could not be entirely consistent. They designed DynamoDB to be consistent where it counts, and less so where it doesn't, similar to the examples I presented above. They have constrained where consistency is applied.

For example, all of DynamoDB's Global Secondary Indexes (GSIs) and event streams are eventually consistent. This is done to support predictable write times while keeping each write fully A.C.I.D. compliant.

Since 2018, DynamoDB has supported transactions. However, the intent was not to lock your inventory to your orders or your titles to your advertisements, but to support features such as unique properties, idempotency, and authorization checks. For detailed examples (with code) of these use cases, see Alex DeBrie's 2020 article, "DynamoDB Transactions: Use Cases and Examples" (link in References, below).

Summary

You don't need consistency everywhere and over everything. Identify where consistency matters and ensure it at that point. Let techniques like the "apology workflow" and eventual consistency help smooth out your distributed and now scalable systems. Be cautious when reaching for database transactions; use them judiciously.

Happy building!

References

Alex DeBrie: DynamoDB Transactions: Use Cases and Examples
Ownership Matters: How Not to Test with DynamoDB
Gregor Hohpe: The Software Architect Elevator
Gregor Hohpe: Starbucks Does Not Use Two-Phase Commit

How to Crater your Database, Part Three - Normalization

Seth Orell — Tue, 15 Apr 2025 13:26:32 +0000

Part One
Part Two
Part Three <-
Part Four
Part Five

So far

In the first two articles in this series, I've stressed that the secret to cratering your database is to do things that don't scale and then try to scale. In the last article, I argued that database aggregation operations like COUNT, MIN, and MAX introduce unpredictable scaling into your system. Today, I want to address Normalization and how it introduces a new way for your database to crater under load.

Why normalize?

Why do we normalize SQL database tables? Initially, we did this to maximize the most constrained resource in the system: storage. Disk space used to be the most expensive part of a computer. In 1982, PC Magazine¹ had hard drive advertisements with a 6MB drive for $2995. That's a staggering $500,000 per Gigabyte (not that you could have bought a GB drive even if you had the money). At the time, people were willing to trade extra compute cycles for more efficient storage.

Today, our disk cost per MB is less than $0.000011 (see figure below).² Storage has become one of the most cost-efficient resources in our systems, while compute has become the bottleneck.

Another reason normalization is still somewhat popular is that it is theoretical and mathematical. It does not care about what the data is for; it only cares about data. This allows normalization to be managed far away from the applications that use it. A database administrator (DBA) can fully control the table schema and ensure a provable third normal form (3NF) without knowing how the data will be used.

The 3NF was designed to reduce data duplication and help preserve referential integrity.³ Other versions exist,⁴ but 3NF is widely known, and I'll use it as a stand-in for all such storage-reducing data schemas.

Why is this a problem?

Splitting the data into separate tables is easy on disk space but creates a problem. The end user never wants to see "normalized" data; she wants to see "un-normalized" data. The DBMS must stitch the various tables into a single result for each request. This is the purpose of the SQL clause, JOIN.

The JOIN clause combines columns from one or more tables into a unified, un-normalized result. This is precisely what our end user is looking for. So why is this a bad thing?

Joins are computationally expensive. They trade efficient at-rest storage (disk) for runtime manipulation (compute). Let's look at an example.⁵

Given a typical eCommerce store in 3NF, here is how we would find a particular book by title:

SELECT * FROM PRODUCTS
INNER JOIN BOOKS ON
productId = productId
WHERE name = "The Fountainhead"

Assuming a proper index exists on name and productId (a best-case scenario), our time complexity will be O(log(N) + Nlog(M)).

As we need to pull in more tables, our time complexity gets worse:

SELECT * FROM PRODUCTS
INNER JOIN ALBUMS ON
productId = productId
INNER JOIN TRACKS ON
albumId = albumId
WHERE name = "Achtung Baby"

Again, assuming proper indexes, this has a time complexity of O(log(N) + Nlog(M) + Nlog(M)).

This gets exponentially worse if you include non-indexed properties in your JOIN.

Unpredictability

A time complexity of linear (O(N)) or worse indicates that we've relinquished predictable performance.⁶ Once you hit O(N), the performance of a query depends on the number of rows on the disk. A large eCommerce store will perform more poorly than a small eCommerce store. Your best customers will perform worse. This is a terrible business model.

Let's look at how a JOIN affects performance. Here's a simple query⁷ that finds all Sales Orders for a Customer:

SELECT c.CustomerID, s.SalesOrderID
FROM Sales.Customer c
INNER JOIN Sales.SalesOrderHeader s
    ON s.CustomerID = c.CustomerID
WHERE c.CustomerID = 11111

Assume there is an index on CustomerID in the Sales table and that CustomerID is the primary key in the Customer table, making this a best-case scenario. Let's look at the query plan for this:

Note that we have a nested loop. That means for each record in the first table, we will have to perform a full table lookup in the second table. For an indexed column, your complexity is O(NlogN). Your complexity will approach O(N^2) for a non-indexed column, guaranteeing a problem as your data grows.

DynamoDB is different

DynamoDB does not support JOIN clauses. This is on purpose. Joins don't scale. DynamoDB doesn't allow you to write a query that won't scale.

To represent the SQL example above, you must design your DynamoDB table to pre-join Customer and Sales. This is sometimes called "denormalization."⁸ The methods for doing so are beyond the scope of this article. Please see Alex Debrie's excellent The DynamoDB Book for examples and deeper explanations.

If this sounds like extra work, you are right. But it's work that buys you consistent, predictable scaling that you don't get with 3NF and JOIN clauses. This "shift left" of complexity buys you peace of mind while your system is under load in production.⁹

Summary

Normalization was designed to solve a problem we no longer have. The penalty you pay for normalization (and having to "un-normalize" it for users) is in computational complexity. All JOIN clauses suffer from this penalty.

This penalty becomes more severe as you grow. Remember, business growth means data growth, and you want business growth. Even O(NlogN) will begin to perform poorly as your data grows.

Happy building!

References

Wikipedia: Join (SQL)
YouTube: AWS re:Invent 2019: Amazon DynamoDB deep dive: Advanced design patterns (DAT403-R1)
Alex Debrie: SQL, NoSQL, and Scale: How DynamoDB scales where relational databases don't
Yan Cui: Runtime vs. Author-time Complexity

https://retrocomputing.stackexchange.com/a/17165 ↩
https://ourworldindata.org/grapher/historical-cost-of-computer-memory-and-storage ↩
https://en.wikipedia.org/wiki/Third_normal_form ↩
Boyce-Codd normal form, 4NF, 5NF ↩
This is the example Rick Houlihan used in his 2019 re:Invent talk on DynamoDB. See References for the link. ↩
See Part Two of this series ↩
Thomas LeBlanc, Execution Plans: Nested Loop ↩
"De-normalization" vs. "Un-normalization" - I like the term denormalization for the pattern of storing pre-joined data on disk, while I use "un-normalize" to describe the JOIN process in a DBMS ↩
See Yan Cui’s post on Runtime vs. Author-time Complexity ↩

How to Crater Your Database, Part Two - Aggregations

Seth Orell — Tue, 18 Mar 2025 13:41:39 +0000

Part One
Part Two <-
Part Three
Part Four
Part Five

Plot spoiler

As I mentioned in the introductory article to this series, the secret to cratering your database is to do things that don't scale and then try to scale. In this article, we'll discuss aggregations, one of the most common non-scalable actions you can perform with your database.

The operations COUNT, AVG, MIN, and MAX are all aggregation functions. Most SQL (and many NoSQL) databases support aggregations. Some databases, like AWS DynamoDB, do not.

I routinely hear engineers complain¹ about DynamoDB's lack of built-in aggregations. This article stresses, "It's on purpose." Or, to put it more colloquially, "It's a feature, not a bug."

Further, the two are mutually exclusive. DynamoDB scales because it doesn't support operations like aggregations. If it did, it couldn't scale. The same works in reverse; your SQL (and many NoSQL) databases will scale more predictably if you don't use aggregations. Don't use them.

Aggregations are bear traps for the unwary, ready to take down your database right when you hit scale. - Alex DeBrie

Why does count (and avg/sum/min/max) hurt?

All database aggregation functions share the same input: all items in the set. None of them can operate on a partial result set. For example, if you ask the database, "How many User records do you have where the User lives in Maryland?" it first has to find all the records WHERE user.state = 'MD' and then iterate over that set to determine the count.

If you have an index on user.state, this will be faster than if you didn't. In both cases, the number of matching users impacts the response time. The larger your dataset, the slower it will respond.

COUNT (and its aggregate cousins) almost always deliver excellent performance during development when data sets are small. But as the number of items increases (which is what you want in Production, right? Millions of happy, paying customers), performance will suffer accordingly. Aggregate functions are time bombs that a future engineer (possibly you) must defuse.

You already guard against this in your code

Let's say you are reviewing some code, and you come across the following block in a colleague's change:

for (let i = 0; i < input.length; i++) {
  for (let j = 0; j < input.length - i - 1; j++) {
    if (input[j] > input[j+1]) {
        let temp = input[j];
        input[j] = input[j+1];
        input[j+1] = temp;
    };
  }
}

What would come to mind (other than "Why did you re-write a bubble sort?") when you see this? My attention is immediately drawn toward the nested for loops. This has a quadratic, or O(n^2), complexity. Given a sufficiently large input, this will perform poorly. It does not scale well.

During my computer science education, this ability to spot nested for loops was drilled into me by professor after professor. Today, I hold this pattern as something to generally² avoid. Unfortunately, we are rarely trained to avoid similar complexity traps in SQL.

Can you spot it here?

SELECT   product,  
         SUM(revenue) AS Total_Revenue  
FROM     sales  
GROUP BY product;

Both GROUP BY and SUM are linear, O(n) operations. As the sales table grows, these aggregations will perform slower and slower over time.

Why should anyone care?

I advocate, "You build it, you run it." If you're running a system, you want it to behave consistently at your expected loads—that is, you want it to scale predictably. If you use aggregate functions, you have relinquished a predictability variable to the user: volume. With aggregations, your system will behave differently for two customers depending on how much data each has in the database.

For example, a customer with a small online store could have dozens of orders a month. Any aggregate function over that data set will perform decently well. However, a customer with a booming online store could have thousands (or even millions) of monthly orders. For that customer, a lurking COUNT will become less and less performant. As the e-commerce engineer who inserts an aggregate function into your code, you optimize for your least valuable customer. How do you explain that to your CTO or your board of directors?

Worse, you are now running a system that could crater at any moment (remember, it's now customer-determined when your DB overheats). This turns on-call rotation into Russian roulette. Strive to keep your systems predictable and your on-call engineer unaffected (and well-rested).

Then, I'll use NoSQL!

You might now think, "Well, if all these SQL aggregate functions are so bad, I'll just use NoSQL." Unfortunately, with one notable exception (more on this below), this won’t let you escape the problem.

Let's take our friend COUNT. Popular NoSQL databases like ElasticSearch, Cassandra, MongoDB, Couchbase, and Neo4J all support it. However, each system has the same "bear trap" penalty. You don't escape the problem by just moving to NoSQL unless...

DynamoDB is different

If you pick DynamoDB as your NoSQL database, you are immune from these aggregate function time bombs precisely because it doesn't allow them. I (and many others) have written about this before. Please look at the links below in References.

DynamoDB is designed to perform consistent OLTP³ interactions whether the data set has 10 or 10,000,000,000 items. DynamoDB can scale predictably, with no "gotchas" from data volumes.

Summary

If you want to scale, aggregations are bear traps. Avoid them. Just because you can use COUNT doesn't mean you should. If you use SQL (or MongoDB or ElasticSearch or ...), be vigilant that aggregations don't sneak into your code base. Or you can use DynamoDB and focus on creating your company's next big hit.

Happy building!

References

Ownership Matters: Why DynamoDB Is (Still) My First Pick for Serverless
Alex DeBrie: The DynamoDB Book
Alex DeBrie: SQL, NoSQL, and Scale: How DynamoDB scales where relational databases don't
Clint Fontanella: What are Aggregate SQL Functions?
Plamen Ratchev: Ten Common SQL Programming Mistakes

Don't use DynamoDB as a production database, 5 reasons NOT to use DynamoDB, DynamoDB and the Art of Knowing Your Limits: When Database Bites Back ↩
If you know your inputs are constrained to a small size, a nested loop may well be fine ↩
AWS: The difference between OLAP and OLTP ↩

How to Crater Your Database, Part One - An Introduction

Seth Orell — Tue, 25 Feb 2025 14:42:46 +0000

Part One <-
Part Two
Part Three
Part Four
Part Five

I've worked with lots of engineers who never consider scale. That's OK. Not everyone has to. However, someone on the team must consider scaling, or your carefully constructed application, which is now starting to grow, could collapse in on itself like a sinkhole. This series of articles is for both classes of engineers: those new to scaling and the experts who might need a refresher.

I'll give away the secret before I even start. If you want to crater¹ your database, do things that don't scale, and then try to scale. That's it!

"But," you may ask, "how do I know which things scale and which don't?" That is a great question, and it will be the subject of several articles, starting with this one.

But I don't need scale

Not every company needs to invest in scale right now. If you are prototyping for market fit, you can use any data store you want, even a flat file.²

Kent Beck has an interesting metaphor that helps classify three broad phases of a business: Explore, Expand, and Extract.³ He refers to this as "3X" and uses it to delineate different approaches to building software.

In the first phase, "Explore," you don't care about scale or load; you care about quickly expressing ideas to generate feedback about what to do next. Companies in this phase should optimize their choices to favor fast prototypes.

The second phase, "Expand," is a growth phase. Scale suddenly becomes paramount. Your load is increasing, and the focus should be on eliminating bottlenecks.

He calls the last phase "Extract." As your scaling rate becomes manageable, efficiency will matter more than growth.

I will be gearing these articles toward software engineers in companies moving into (or neck deep in) phase 2, Expand. If you tell me, "I just need the functionality; it will handle our MVP just fine," you are in the Explore phase. That's fine. Keep these articles bookmarked for when your product becomes successful.

If you want to crater your database, do things that don't scale, and then try to scale.

Predictable scaling

Scale means significant growth. Not 10%, but 10x (or more). It's when your biggest customer plans an ambitious expansion and wants to sign a new contract. Here's how it can look.

Sales: "Acme plans to roll out our product to 50 new locations!"
Engineer: "How many locations do they use us on today?"
Sales: "Five."
Engineer: "(to herself), but Acme already eats up 80% of our database capacity..."

Part of the problem here is that the engineer doesn't know how the system will behave after a 10x increase. She expects it to suffer, but in what way? To put it another way, the system scales unpredictably.

Designing a system to scale predictably is an achievement. You must carefully examine many of your previous assumptions--things that worked for a prototype--and ask yourself, "How will this scale?"

The database

The database is a known bottleneck. Every company I've worked with that has experienced scaling events has fought hard-to-scale database constraints. All the amazing architectural advances in scaling non-stateful resources like compute or networking have revealed the areas that have not made similar progress. The only thing left behind is the database.

Note that you can broaden the term "database" to "datastore." The database problems I will describe also occur in search engines and file systems. Using the same bad techniques, you can crater these systems, too.

The sweet spot: Serverless

If done prematurely, the energy spent on building scalable systems is wasted. Since you don't yet know if your product will be a hit, you need fast prototypes and quick pivots. Choose technologies that allow you to build quickly.

To scale up a system, you must make architectural and technological choices that support predictable scaling.

The "sweet spot" between these two approaches is to use technologies that let you build fast and offer predictable scaling. If you are comfortable with fully managed AWS services like Lambda, API Gateway, S3, and DynamoDB, you can build fast with predictable scaling from the beginning. In other words, consider a "serverless first" approach.

Any investment in becoming more proficient with these services will pay off in spades as you find yourself in more "Explore" and "Expand" situations.⁴ It has for me.

Next Up

In the next installment, we will examine aggregations, one of the most common techniques for cratering your database.

Happy building!

References

Kent Beck's Substack: Software Design: Tidy First?
Robert C. Martin: No DB
Ownership Matters: Everything Suffers from Cold Starts

Merriam-Webster defines crater, used as a verb, to mean “to fail or fall suddenly and dramatically: Collapse, Crash.” ↩
A tip of the hat to Uncle Bob here. ↩
Kent Beck: The Product Development Triathlon ↩
I will end with a brief prediction of how serverless interacts with the "Extract" phase at the end of the series. ↩

Testing AppSync Subscriptions

Seth Orell — Thu, 09 Jan 2025 14:05:00 +0000

AppSync has three types of GQL operations: Queries, Mutations, and Subscriptions. These operations can be divided into two broad categories: Synchronous and Asynchronous.

Queries and Mutations are straightforward to implement and test due to their synchronous nature. Subscriptions are more complicated, both in how they are put together and how they are validated. Today, I want to focus on that validation: how to test AppSync subscriptions.

Subscription Complexity

What makes GQL subscriptions so complex? First, they are tied to a mutation. This mutation is often implemented as an internal-only, "dummy"¹ mutation to guarantee the projected properties. See Tamas Sallai's excellent article on why this is so, linked below in References.

Now, we want to test this complex thing we just built. Subscriptions work off of asynchronous events. I have previously written² about whether you should test your application's eventing. Sometimes, it's not worth the effort. However, with GQL subscriptions, I have much more to test than a mere call to EventBridge. There is too much that can go wrong to forego these tests.

The last thing that makes testing GQL subscriptions difficult is the tooling. I experienced issues combining multiple subscriptions per Jest test fixture. Plus, the teardown can take many seconds for each fixture as the connections clean themselves upon termination. I expect this to improve over time as the tooling and techniques improve.

Our Example

Let's assume that our user, Jack, uses a popular e-commerce site and wants to be alerted if one of his favorite stores, Acme, adds a new product. Here is a diagram of the basic flow.

The products service emits a "New Product Added to Acme" event.
This event triggers a Lambda in our AppSync project, listening for new product events.
This Lambda invokes a "dummy" mutation, notifyProductAdded
The dummy mutation triggers the subscription
Jack receives his notification

There is a lot that can go wrong. We need tests.

Choosing Your Testing Width

Let's examine two subscription testing approaches: "wide" and "narrow."

The wide test involves both the resource service and AppSync. In this case, we call the resource service with an operation, "Create a Product," and let the event flow through AppSync.

The narrow approach stays within the AppSync application and begins by directly invoking the event-handling Lambda. In both cases, we spin up a GQL subscription before starting the test and validating the response.

Let's examine some of the techniques for setting up tests like this. Later, we'll consider why you might choose "wide" over "narrow" or vice versa.

Subscription Testing Techniques

I will be testing in NodeJS using the Jest framework. Many GQL libraries in NPM have great support for subscriptions. However, almost all are intended to work in a Web browser, not NodeJS. I had some success using the aws-amplify package. I handle the entire subscription setup using one method: setUpSubscription. Let's look at the key sections of this method.

Before you go any further, you must import and require WS, a Node.js WebSocket library. Amplify is also intended for a browser and expects to use the browser-native WebSocket object. To test in NodeJS, you must include this line.

The first thing I do is to configure Amplify for my application. In this case, I am using an API key to authorize the call to subscribe.

The next section spins up a Hub listener, which emits events about the connection itself. I use this to ensure I am fully connected before triggering the dummy mutation. The setup returns a function I call stopHubListener to cleanly shut down the Hub after the test completes.

Next, I set up the subscription itself. This portion is straightforward and looks like any GQL invocation. Note that setting up the next() handler will add a result to the array the caller passed in, validating that a message was received through the subscription.

Finally, I wait until my Hub listener is ready. When it is safe to exercise the subscription, it will set its connection state to "Connected."

The examples I used above are in a working application you can review at this GitHub repository. Please see the source code for more details than I provided today.

From here, the test is easy. Depending on your starting point, wide or narrow, you invoke the resource service or the Lambda, respectively. Then, you wait for the expected message to appear in your results array.

A Note of Caution

I had difficulties combining multiple subscription tests into the same test fixture. I don't know whether this was a Jest/Amplify limitation or my limited knowledge. Breaking them into many fixtures (files in Jest) works fine for me, as I run fixtures in parallel with my test runners.

When you have problems, especially lingering WebSocket connections, your tests will hang after passing or failing. These issues are difficult to unwind. Go slow. Build things in small chunks, and keep your test suite healthy as you go.

Other Auth Mechanisms

The example above uses an API key as its primary authorization. What if you don't use an API key? The Amplify library has you covered. It supports all the auth schemes that AppSync does, including custom (Lambda) authorization.

To support custom/Lambda authorization for your subscription, you must make two small changes to your subscription setup. The first is to simplify the call to Amplify.configure() and only pass in the endpoint and region.

The other change is to the call to subscribe. Here, you pass in the authorization mode and token.

The remainder of the method should look identical to our API key version.

Wide or Narrow?

Where should your tests begin, at the resource service (wide) or with your event handler (narrow)? There are pros and cons to each approach.

A wide test tells you the most information. However, it couples your test to the resource service. If that service is experiencing problems, a wide test will fail and block your deployment pipeline.

A narrow test can be implemented ahead of time and independently from the resource service. It provides information about your AppSync API but not its interaction with other resource services.

The fundamental guideline is that your testing width should follow your deployment topology. Is the AppSync API part of the resource service’s deployment? Use wide tests. Is it an independently deployed unit? Use narrow tests. Keeping your deployment pipelines independent is crucial to managing a microservices architecture. Watch out for unnecessary coupling between projects.

Summary

AppSync subscriptions can be complicated. In the projects I've built, they were the most complex endpoints to set up. Knowing they work as expected goes a long way to building confidence that your AppSync API is solid. I hope this post helps you build that confidence.

Happy building!

References

Ownership Matters: Testing EventBridge with Serverless
GitHub: Serverless Reference Architectures appsync-to-http
Tamas Sallai: Real-time data with AppSync subscriptions
Ownership Matters: Testing AppSync JavaScript Resolvers

I first heard Yan Cui use this term, but I don't know if it originated with him. ↩
Testing EventBridge with Serverless ↩