Forem: Ryan Batchelder

AWS ECS Managed Instances

Ryan Batchelder — Mon, 06 Oct 2025 01:53:56 +0000

If you've been particularly focused in the Kubernetes world, you may have missed AWS continuing to roll out new features to their (imo) excellent ECS service with ECS Managed Instances. Let's take a look at who this feature is for and why you may (or may not) want to use it.

Who is this for?

At the highest level, ECS Managed Instances offloads the management of EC2 resources in your clusters to AWS. This might sound like the value proposition of Fargate, but Managed Instances takes a different, complementary approach.

Fargate brings the serverless experience to container orchestration - running your workloads on compute that's entirely outside of your purview and charging you only for the compute time that your container uses. As a trade off, you have effectively no control over the compute that's provisioned behind the scenes to run your containers. This is totally fine for many general purpose workloads that don't have particular hardware requirements or aren't sensitive to where they're running.

But what if you did need specific hardware...?

Unfortunately, not all web services are simple apps serving up memes and cat videos. Sometimes you might need to run a workload that needs access to GPUs, high-throughput networking, or really fast local storage.

Historically, the only answer to this would be to manage your own fleet of EC2 instances and bring them into an ECS cluster. While this allows you to have all of the nice benefits of the ECS container orchestration platform, it comes with all the downsides of managing your own infrastructure. OS configuration, patching, security, etc. are all your responsibility using EC2 directly to back ECS clusters. Capacity Providers were eventually released to make scaling clusters easier, but they still relied on you maintaining Launch Templates for the compute that they... provided.

I don't need specific hardware, should I just use Fargate?

Maybe! Fargate is excellent at providing a way to run containers without having to really think about the compute underneath, but that management doesn't come for free. Fargate tends to be more expensive per vCPU and GB of memory than the equivalent capacity on EC2, with the benefit being that you get to save yourself (or your ops/infra teams) time not worrying about management. For a small number of containers this price premium is probably a welcome trade off, but the math doesn't always work out as you scale up. Being able to bin-pack a number of containers onto one instance can make EC2-backed ECS clusters quite cost-efficient, and Fargate generally can't keep up on this front at least for containers that need to run 24/7.

Fargate has a few other drawbacks as well:

No access to the underlying host means you can't run daemon tasks in the background. This means you need to configure things like log or metric collection per-task rather than having it run at the instance-level like you can with EC2-backed clusters.
Privileged containers aren't supported on Fargate.
Fargate compute isn't always on the latest generation hardware. A vCPU on an m5 instance has very different performance characteristics from an m7i or m7a vCPU, and Fargate can schedule your tasks to run on any sort of compute that AWS chooses.

The Best of Both Worlds - ECS Managed Instances

ECS Managed Instances slides further down the serverless spectrum and solves the problem of needing to manage compute whenever your workloads call for more specific hardware requirements, while allowing you to offload management to AWS. On top of that, the compute that's provisioned and managed on your behalf are still regular EC2 instances joined to your cluster. This means that they can run as many containers as reasonably fit, dividing the cost between multiple workloads and likely saving you money on top of Fargate. Layer on the fact that managed instances take advantage of all your existing EC2 reservations and savings plans, and you can potentially be looking at significant savings.

It's important to note that ECS Managed Instances aren't necessarily free. There is an additional cost overhead over the EC2 sticker price to cover the management, but all in all it's still cheaper than Fargate is. How much cheaper? For most scenarios I checked, only about 3% on the face, but given that EC2 savings plans can have stronger discounts than Fargate that difference can grow significantly depending on your particular scenario. As part of this management fee, you get security patching every 14 days and behind the scenes ECS will still try to cost-optimize your compute by consolidating containers into underutilized instances.

Where can I learn more?

I'd highly recommend taking a read of the official AWS launch announcement first. Thankfully this release came with CloudFormation and CDK support right away, so you can start playing with this right away via infrastructure-as-code.

Passing the AWS Solutions Architect Pro Exam

Ryan Batchelder — Sat, 26 Jul 2025 23:34:12 +0000

After many years of telling myself "I should take the SA Pro exam," I finally made the leap to take (and pass!) the exam this morning. While there's no shortage of exam prep info out there, one thing that helped me feel comfortable going in was finding similar stories to mine. So in the interest of giving that back: here's mine.

Some Context - Why did I take the test in the first place?

I hate tests. I've never been a good test taker, and the thought of sitting for a three-hour exam on a Saturday morning isn't exactly my idea of a nice weekend activity. I'm also not someone who particularly chases certifications. For me, it's more of a way to validate that I have the skills at a particular level, especially when I can't always apply that full skillset in my day-to-day work.

That said, I've taken the Solutions Architect Associate exam twice in the past, and with my associate certification coming up for renewal, I felt that I couldn't just take the associate-level exam a third time. I've been working with AWS for the past 10 years and have helped teach an exam prep cohort internally at my company three times over, so it felt like I was overdue for a professional-level certification.

Solutions Architect best reflects the work that I do "in my day job," so that felt like a natural choice (plus I needed to renew that associate-level cert). I'm exposed to many, but not all, of the concepts and services that the solutions architect track covers. At this stage, I'm not sure if I'll also pursue the DevOps Engineer track, but I haven't ruled out trying some of the specialty certifications.

What did my prep look like?

Effective test prep looks different for everyone, and I think a good way to plan your exam prep is by looking around at how others have tackled it. I spent just over one month preparing for my exam, and heavily weighted most of my practice in the 2 weeks leading up to exam day.

No matter what, I think the best place to start any AWS Certification exam prep is by reviewing the AWS-provided Exam Guide. From there, I opted to review using Stephane Maarek's excellent Ultimate AWS Certified Solutions Architect Professional 2025 course on Udemy. I watched through all of the lectures, even for the services that I thought I was very comfortable with (just in case).

The only section I didn't spend a lot of time with was the "Exam Preparation" section, which deep-dives on a handful of sample questions. If you haven't taken an AWS Certification exam in a while, this is a great walkthrough on the mindset that you should take into answering the questions. I instead opted to replace this section with Jon Bonso's practice exams on Tutorials Dojo. In helping people prepare for the associate-level exam, I've found that Jon's practice exams are some of the best in terms of reflecting the difficulty level of the real exam, and sometimes are even more difficult. I was really glad to see that the professional-level practice exams were no different.

Now that's not to say that I was confident walking into that exam... The practice exam questions were challenging, and working through them in "Review Mode" was a great learning tool, but I got a lot of questions wrong in the process. Ultimately, I did one final walkthrough in timed mode the day before, which I passed comfortably, and then decided to stop thinking about AWS for the rest of the day. 😅

Exam day

If you're reading this and preparing for the SA Pro exam, you've certainly taken an AWS Certification exam before. Structurally, this one isn't much different from SA Associate, but of course it's longer and more difficult. The high-level specs are as follows:

75 questions, where 10 unidentified questions are not scored
180 minutes total, with a few extra minutes allotted for reading the terms and completing a survey at the end
Tests can either be taken in-person at a Pearson Vue testing center or proctored virtually on your own device

The vast majority of the questions involve reading through a fairly involved scenario with lots of detail. Sometimes you need to lean on all of the provided detail; other times it's given as "fluff" and you need to see through it. Many of the answers will be mostly or nearly correct, and you need to find the key detail that makes a given answer "most correct." Working through all of this in the time allotment can be a challenge, but I personally found that there were a few questions with shorter scenarios that I could answer quickly to earn back some time to invest in others.

I always choose to take my tests at a testing center. I've heard too many stories of people having accidental situations like background noises or someone walking into their testing room, causing virtually proctored tests to be invalidated. I also feel that I can mentally focus better when I'm in a purpose-built testing setting. The only downside that I ran into is that the machines used at my testing center were quite old and used what seemed to be 720p displays that were a little annoying to read the dense questions from for 3 hours, especially after being spoiled with nice 4K displays at home.

Final Thoughts

I walked away feeling decent about my exam, but I definitely wasn't confident that I had passed. Considering that Jon's practice exams seemed to trend more difficult, it made me feel like I'd taken an easier test on exam day. AWS tells you that most exam results are delivered within 24 hours, with some taking up to 5 days. My results came about 5 hours later, so thankfully I didn't have to wait in suspense for too long.

As I mentioned before, I'm not particularly chasing all of the certification exams, so I'm not sure what I'll tackle next. For now, I'm happy to have this one in the bag for the next three years.

Resilience in the Cloud - Making Things Better

Ryan Batchelder — Mon, 31 Mar 2025 23:24:23 +0000

If you've read my last two posts in this series, you might be starting to reflect on systems that you manage or are currently building and how resilient they really are. If you haven't read the other posts in my AWS Resilience Series, here's where you can find them:

Availability vs. Recoverability discusses the difference between making a system highly available and being able to recover from a failure when one inevitably does happen.
Fault Isolation Boundaries covers what these boundaries are, how AWS uses them to provide us resilient cloud services, and how we can leverage that architecture to improve our own systems.

Now this third and final post in the series will discuss how you can take the learnings from the previous two and turn them into actionable improvements in your workloads.

Cloud resilience is one of those topics where lots of people have their own idea of what "good" looks like, and all of their paths to get there look a bit different. Along these paths there tend to be a number of different mistakes and misconceptions that are made, though not without good reason. Building resilient applications is complex, but an even greater challenge is that measuring success in resilience often requires an even greater level of investment and maturity than many applications require. Because of this, system resilience can be an elusive topic fraught with false senses of security and frustration.

So how do you evaluate your applications as they stand today, and make a plan for where you want your architecture to be tomorrow? How do you ultimately know what "good" should look like for you?

I've spoken on this topic a few times in the past (at least once publicly), and historically I've answered this question with the following four steps:

Start with backups
Identify potential failure points
Implement mitigations and redundancies
Plan for disaster and test

If you do just one thing, it should be addressing your backup strategy. Today's technology landscape has a huge focus and emphasis on data; data is the foundation of today's business and it's crucial to ensure that data is protected. Data loss can happen in a number of different ways, but you can distill the problem down by looking at every data storage resource in your application and asking yourself: "what would I do if this data disappeared tomorrow?". If the answer is anything besides "I could easily recreate it", you almost certainly need to back it up.

AWS has services with some amazingly good data durability SLAs, but those serve the AWS side of the shared responsibility model, not ours as customers. Having a well-designed backup strategy should protect you against all data loss: storage medium failure/corruption, inadvertent modification or deletion, or even ransomware attacks. Ensure that you're thinking of all of these when designing a data protection strategy and don't just rely on vendor durability SLAs.

Moving beyond backups, next up is to identify other potential failure points in your architecture. For a long time, I've generally followed a similar methodical approach to backups: walk your infrastructure, identify how each component could possibly fail, then mitigate that failure. It turns out that some of the brilliant minds at AWS wrote up similar guidance with way more structure and rigor than I ever did in their prescriptive guidance titled: Resilience analysis framework. If you're working to improve system reliability at any level of scale, I would highly recommend reading about this framework and understanding how it can be best applied in your own organization.

At a high-level, the Resilience analysis framework focuses on the following characteristics of highly available systems: redundancy, sufficient capacity, timely and correct output, and fault isolation. The expectation is that a system that exhibits these characteristics would be considered "resilient". For each of these characteristics, the framework prescribes failure categories which help connect the dots from your infrastructure itself and actually embodying these characteristics.

Relationships of the desired resilience properties reprinted from Resilience analysis framework - Overview of the framework

When it comes to applying the framework, the first step that I always recommend to teams is to start with an up-to-date architecture context diagram that shows all of the pertinent components of your system. I believe that the most value of framework application is had when it prompts conversations amongst teams (versus a solo effort), and having a diagram that everyone can use as a frame of reference is key to keeping the dialogue focused and productive.

From this point, the framework discusses focusing on user stories to evaluate individual business processes with the focus on the value to the end user. For many applications I think this is a great approach since ultimately your users care whether or not they can use your system for useful work; they're usually not concerned with the specifics of your API server. One of the understated benefits of the framework is that it easily scales up and down to systems of all different sizes and scopes. For larger (and particularly older) systems where ownership doesn't necessarily lie with folks that built all of that functionality, attempting to focus on user stories might be a challenge. In these cases, you can still get a ton of value out of evaluating at the component and interaction level instead.

This blog post is not intended to get into all of the detail of the framework, you should really go read what the experts have to say here, but this does get us to another really important discussion point: how far do I go with my mitigations? The framework has an entire page titled "Understanding trade-offs and risks" which discusses this, but I have some of my own thoughts to add on top of that. In my discussions on this topic in the past, I've talked about the notion of resilience being a spectrum. As you move to higher availability (and thus, higher levels of resilience), the cost in both cloud spend and operations overhead increases dramatically. In-fact, AWS even includes a spectrum in their Disaster Recovery of Workloads on AWS: Recovery in the Cloud whitepaper.

Disaster recovery strategies figure 6 reprinted from Disaster Recovery of Workloads on AWS: Recovery in the Cloud

Taking the point I made earlier about cost, you should expect that as you move to the right on this spectrum that your costs will increase. This means that you should consider the criticality of your system when thinking about the appropriate level of mitigations to apply. Many companies use a tiering system to classify the criticality of their applications. If yours does, you will probably be best suited to mapping that tiering into a certain type of resilience configuration.

For most applications, multi-region active/active configurations is going to be the most expensive and the most complex to implement. Especially for systems that aren't "cloud native", it's more rare to find a system that supports running in a multi-active configuration really well. Newer AWS capabilities such as DynamoDB Global Tables and Aurora DSQL are making this easier for applications that can build for this from the beginning, however. Even if you aren't using these technologies, you can still make your systems more resilient by applying mitigations at the appropriate level of investment for the value a system is providing to your customers or business.

The final point that I'll make is that you should always test any recovery plans and mitigation strategies that you implement on a regular basis. Just like you would test a piece of software to ensure it behaves the way you intend, you should ensure that your resilience strategy is actually working for you. The last thing you want in the midst of an outage is to discover that your recovery plan relies on some IAM permission or security group configuration that you didn't account for, and now you're fighting two fires instead of one. Test as close to production as you possibly can, and test often. AWS Fault Injection Service can help with this step by making it easy to setup fault experiments and seeing both how your infrastructure and runbooks behave to address a real failure in a safe manner.

Are you really confident that you've done everything you can to ensure your on-call pager never rings, or are you worrying that the next big incident could be right around the corner? Hopefully the information in this series have helped you answer these questions and make things better for both your users and your operations team. How do you and your teams handle this? Let me know in the comments!

My (non-AI) AWS re:Invent 24 picks

Ryan Batchelder — Tue, 10 Dec 2024 13:17:15 +0000

re:Invent 24 has come to a close and as predicted, a huge portion of the event was devoted to generative AI. While there were some pretty interesting announcements surrounding Bedrock and Q, hidden amongst the RAGs were a few of my favorite releases that had nothing to do with AI.

Starting with the data-oriented announcements from Matt Garman's keynote, what better service to talk about than the biggest home for your data: S3. Everyone's favorite blob store got some really interesting data science-oriented features this year that reinforced something that I've come to realize over the past few years: data lakes live in S3. First up is S3 Table Buckets, a new bucket type that brings a number of improvements for buckets storing Apache Iceberg files.

This enables higher performance for queries against Parquet files stored in S3 automatically just by using a different bucket type. In addition, table buckets automatically take care of compaction and cleanup of unreferenced files to help keep your data lake optimized with continued scale.

S3 Metadata solves a problem that many of us have probably implemented custom solutions for in the past, likely through some separate "sidecar" datastore that works in conjunction with your S3 bucket blobs.

This feature builds on the S3 Table Buckets feature to make blob metadata easily queryable and S3 will keep that metadata up to date for you as bucket contents change. While this might seem like one of the less "flashy" announcements of this year's re:Invent, I think this is one of the most broadly applicable for a large swath of S3 users. This is also a classic example of AWS identifying a piece of undifferentiated workload that their customers are building and finding ways to remove that burden. We like this, AWS. Do more of this, please!

Moving away from S3 for the last of the keynote announcements: Aurora DSQL. This is the closest we have as of now to a SQL version of DynamoDB, and that alone is a pretty exciting prospect.

Aurora DSQL leverages Amazon's Time Sync service to eliminate clock drift across regions enabling strong consistency even on multi-region tables. As fancy as this sounds, there are some necessary tradeoffs that come with building a globally available datastore. Note that the slide says "PostgreSQL-compatible" rather than implying that Aurora DSQL perfectly implements PostgreSQL. There are a number of PostgreSQL features that remain unsupported and only a subset of the SQL dialect that's supported. I suspect both of these lists will change slightly as the service moves toward general availability, though some omissions are almost certainly required as a byproduct of enabling this level of consistent multi-region operation. As someone who's been focused quite a bit on resilience lately, I'll be very interested to see how this service evolves. For workloads that can operate within the constraints, Aurora DSQL could be a game changer for systems that are looking for multi-region active/active operations.

If you're looking for some deep-dive content on Aurora DSQL, Marc Brooker has some fantastic content on his blog. He also has two great breakout sessions from re:Invent on the topic: DAT424 and DAT427.

That wraps up the major announcements that came out of the keynotes this year. Werner's keynote on Thursday morning didn't contain a single announcement, instead inspiring us builders to embrace "simplexity". To me this seems like a by-product of cloud becoming "boring", and this isn't a bad thing! As the technology continues to mature, I'd expect to see fewer big announcements and instead a continued refinement of the services that we're all building on today. It's clear that Generative AI is the current industry trend, and I applaud AWS for taking a full-stack approach to supporting that space from the custom-built training hardware all the way up to providing tools such as Q.

As is tradition for re:Invent, there were also a number of announcements that came in the weeks leading up to the event in Vegas (known as "pre:Invent").

One that I've been really excited for is the ability to now setup custom domain names for private API Gateways. This is a feature that has been sorely needed to make API Gateway easier to use without exposing the gateway to the internet. Previously private gateways could only use their auto-assigned URL that had the format https://{rest-api-id}-{vpce-id}.execute-api.{region}.amazonaws.com/{stage}, or one of the other variations that involved passing special headers as part of the request. I know that I personally have implemented a number of different workarounds in the past to get friendlier URLs for my endpoints, and it will be really nice to be able to do this natively now.

Serverless services got a host of nice features and improvements to continue rounding out their offerings. Aurora Serverless v2 can now scale all the way down to zero ACUs which finally addresses one of the chief complaints about that service. Previously the minimum was 0.5 ACUs which meant the baseline ACU cost was nearly $44/month, generally running counter to the serverless tenant of only paying for execution time.

Speaking of price savings in serverless-land, DynamoDB on-demand got significant price cuts with no changes needed for users. AWS doesn't get into details as to why this price cut has come about, but presumably they've found ways to make significant price savings to their operating costs for the service and are passing that savings along to us as the customer. Benefitting from the optimizations that the awesome engineers over at AWS make is an understated perk of building on a public cloud, and this is once again the type of announcement that I think everyone will agree is a great thing.

Finally it wouldn't be a discussion about serverless without including everyone's favorite compute service: Lambda! The improvements to Lambda are smaller but great to see nonetheless. For Python and .NET users, SnapStart is now available. SnapStart was announced last year for Java as a way to reduce cold start times by pre-warming and caching a function just after initialization so that invocations could load the cached state rather than re-running initialization on every cold start. Unlike the Java implementation, Python and .NET come with some additional cost so you'll need to do some evaluation to determine if it's worth it for your functions. That said, I can see this being useful for certain applications particularly in the Python world where certain data science libraries can get quite large. The last main Lambda announcement was the release of the Python 3.13 and Node 22 runtimes. New runtimes come all the time as languages get updated so this isn't groundbreaking, but still nice to see for users of both languages who haven't otherwise moved onto providing their own runtimes.

While this was certainly not an exhaustive list (that post would be far too long!), these are some of the announcements that I'm personally really excited about from this year's re:Invent cycle. If you're looking for more, I recommend checking out https://aws-news.com for all the good stuff that's been announced over the last few weeks.

Resilience in the Cloud - Fault Isolation Boundaries

Ryan Batchelder — Tue, 08 Oct 2024 00:12:03 +0000

Unlike the previous post in this series, this one will get deeper into AWS-specific nuances and all the points here here will not be directly translatable to other infrastructure providers. That said, the concept of fault isolation boundaries is broadly applicable across many types of software and infrastructure designs.

After my last post, hopefully you're at least devoting some thought to your application's recoverability. While I tried to stress the importance of being prepared to recover your application in the event of a failure, availability is still a key trait for many applications.

For many folks there's likely a thought that high availability = run more than one instance of my application. On the surface, that's true! The more instances that are available to handle your workload means more things that have to fail before your workload cannot be handled. Perfect, so this means that optimizing availability is an exercise in adding as many instances as possible without going bankrupt, right?

Well, your infrastructure provider would be certainly thrilled with this approach...

(This is what I'm guessing cloud providers look like when you run lots of redundant instances for resilience without corresponding traffic)

What if we could instead be smarter about adding redundancy while stretching the utility of your infrastructure dollars? Fault isolation boundaries give us a means to ensure that we're building workloads that are "redundant enough" to mitigate entire classes of failures without being cost-inefficient.

But first, what are fault isolation boundaries?

What's in a Name?

At a high level, fault isolation boundaries are logical divisions within your infrastructure that are intended to contain the impact of component failure. If you've ever had a conversation about "limiting the blast radius" of a potential failure in your application, you've likely implemented a fault isolation boundary to do it. AWS provides us with a number of infrastructure-level fault isolation boundaries:

Availability Zones
Regions
Accounts
Partitions
Local Zones
Control Plane vs. Data Plane Separation

These boundaries serve multiple purposes. Regions and local zones can enable operations in certain jurisdictions which have legal requirements around where data is processed and stored. Accounts (and to an extent, partitions) are important security boundaries which ensure that access is scoped appropriately. But how do these boundaries fit into a conversation around resilience?

From a resilience perspective, AWS uses them internally to roll out new capabilities in a safe manner so that any failed changes can not only be rolled back quickly, but also only impact a small number of users. When a new feature is being rolled out, availability zones will get the update one at a time (meeting certain success metrics at each step) until it is rolled to the entire region. More than just feature releases though, availability zones and regions in particular are designed with specific constraints to ensure they provide resilience value to us as the customer.

Infrastructure Designed for Reliability

AWS has regions all over the globe and they're constantly working to add new ones. If you're new to AWS, it might be easy to view regions as simply a way to choose where your workloads run. Whether it's to provide lower latency by putting your workload physically closer to your users or ensuring that data stays where it is legally required to, regions dictate the physical location of your infrastructure. Regions are more than this though; they are a key resilience feature for AWS. They are deliberately designed to be hundreds of miles apart (within the same partition) such that a natural disaster impacting one region should not impact its neighbors. If one does experience a failure, each region is designed with independent control planes such that any outages should remain isolated (with some notable exceptions for global services).

The next smaller boundary within regions are availability zones (often abbreviated AZs). In contrast to regions, availability zones are separated by just tens of miles to ensure that data replication between them remains fast. In spite of availability zones being closer together, they are designed to avoid "shared fate" failure scenarios such as localized natural disasters or power disruptions. AZs are interconnected by redundant fiber connections so data replication between them is super fast. Many applications should be able to operate across multiple availability zones within the same region without inter-zone latency concerns.

Planar Separation and Static Stability

Now that we've covered physical infrastructure, let's talk about what it means to actually run something on that. AWS services are implemented with a division between their control and data planes. The former is what allows us to create, update, or delete resource in our AWS accounts. Control planes provide the APIs that the Console, CLI, SDKs, and infrastructure as code interact with. Conversely, data planes are what provide the ongoing service by a given resource. Think of a control plane as the component that facilitates the creation of an EC2 instance and the data plane is what keeps the instance going and able to serve traffic.

This planar separation is important for a number of reasons, one of which being that it enables a concept called static stability. This is a pattern which aims to provide higher levels of reliability by designing applications that can recover from failure without the need to make any changes. In the case of an application that runs across multiple availability zones, static stability would mean having infrastructure ready to handle traffic provisioned and ready in each availability zone with no manual intervention needed to failover between them. In context this means that, since data planes are relatively simpler than control planes and thus less likely to fail, designing an application with static stability will help guard you against an outage when a control plane becomes unavailable.

What does it all mean?

If you've made it this far, you might be thinking "this all sounds great, but what should I do this all this?" Since we set out trying to work out how to make our systems more resilient while also remaining cost effective, we can look at the fault isolation boundaries as a way to quantify risk.

Statistically speaking, failures of (or within) individual availability zones are more likely than entire regions. Fittingly, many services in AWS make it really easy to operate across multiple availability zones at once. Because of this, implementing statically stable workloads across multiple AZs in a single region is a good first step. Of course the additional infrastructure isn't free, but the added operational overhead is minimal. For many applications this might be enough, but the most critical workloads might demand multi-region operations. What's important here is that you're intentional with where you're running your workloads. Running 3 redundant instances split across 3 AZs is far more reliable than running all 3 in the same AZ.

If using more AZs is good, then more regions is surely better, right? Regional service failures tend to be more rare than AZ-level outages, though they are more impactful. Unlike AZ failovers, most services do not provide an easy way to operate out of multiple regions at the same time so there are usually more moving parts to account for in a regional failover. Because of this, it's best to account for multi-region operations early in the design process so you can choose services and patterns that are conducive to this level of resilience. It's not impossible to retrofit later on, but it can require a lot more refactoring than just adding multi-AZ resilience. The operational complexity combined with the additional infrastructure needed makes multi-region resilience significantly more expensive than multi-AZ.

Think about AZs like having fire doors in a building. In the event of a fire, these doors help to prevent the fire from spreading to other parts of the building. Relative to the cost of the entire building, fire doors are a cheap way to help isolate the potential damage that a fire can cause (and help to keep occupants safer). Running in multiple regions is like constructing an entirely separate building to protect against a fire in the first. If one building burns down, you still have a second usable building, but the cost to do so is very high.

If you want to read more on this topic, I highly recommend the entire AWS Whitepaper on Fault Isolation Boundaries. Even if you've been working in AWS for a long time, it is a great resource to dive deeper into how various AWS services are built for resilience (and some hidden gotchas that you may not be thinking about).

In my previous post I covered the difference between availability and recoverability. Fault isolation boundaries give us a way to quantify how we improve our availability. For the next and final post in this series, I'll cover the resilience spectrum and how we can effectively place our workloads on that spectrum.

Resilience in the Cloud - Availability vs Recoverability

Ryan Batchelder — Tue, 24 Sep 2024 01:12:32 +0000

Note: the following post is going to focus on AWS in any discussion of examples, but the overall concepts here are meant to be vendor-agnostic and can be applied to any cloud environment.

One of the biggest points of confusion I've run into when working with teams on their resilience strategies in the past is the difference between (high) availability and recoverability. I don't blame them, lots of the buzz around cloud apps focuses on building things that are highly available and often eschew the notion of recoverability entirely. I mean sure if your service never goes down, then there's never anything to recover, right?

If there were true, I'd stop writing here and you all could close this tab and return to your regularly scheduled cat videos. You could certainly do that anyway, but when your video is over I'd recommend maybe coming back here and making sure you know how to keep those cat videos rolling even on the rainiest of days.

The Shared Responsibility Model for Resiliency

Before we get into the differences here, I think it's important to start with the Shared Responsibility Model for Resiliency.

If you've worked in AWS for a while, you're probably aware of the Shared Responsibility Model in the context of security. The resiliency model is thematically similar: AWS is responsible for the resilience "of" the cloud, while us as the customer are responsible for the resilience of our workloads "in" the cloud. I like to start with this because it helps to set the stage for debunking some of the myths around cloud resilience that can leave you in a tough position if you're not prepared for a disaster. The biggest culprits here are the durability and availability guarantees provided at the service level.

Let me be clear that I'm not intending to imply that AWS (or any other cloud vendor) is attempting to mislead customers. The opposite is actually true, the fact that S3 has a 99.999999999% durability guarantee is an incredible feat of engineering that should be celebrated! But when using S3 keep in mind that the durability guarantee serves AWS' side of the shared responsibility model - it's extremely unlikely that data stored in S3 will be lost at the fault of AWS. So our data is all safe in S3 and we don't need to worry about it, right...?

Not quite. If you read that sentence over again, you'll notice that I said "...at the fault of AWS". Those impressive durability numbers will do nothing to protect you against a catastrophic bug in your code, ransomware attacks, or simple human error of your users or someone on your team. These types of failures fall on your side of the shared responsibility model, and directly impact the recoverability of data in your application.

There's a similar comparison to be made on the availability side of the equation with the various service level agreements (SLAs) that apply. For example EC2 has an instance-level SLA of at least 99.95% uptime, but that SLA does not apply to scenarios in which your instance fails due to something on your side of the model. An example of this could be your workloads exhausting the available memory on and instance and causing a crash, or a security vendor pushing an improperly tested patch to your machines.

Availability

"High Availability" is one of the big buzzword concepts of modern cloud architectures. This is for good reason - the commoditization of compute resources has made it really easy to simply run more instances of your workloads to ensure that there's always a node available when needed. The cloud made it so that suddenly we could solve all sorts of performance and availability problems by simply throwing more computers at them. Sure you can make an argument that this has made it easier to write software poorly, but there's no denying that the ease of scale provided by cloud computing is a net-positive for service availability.

Beyond sheer scale, cloud computing allows us to put our workloads in geographically disparate places which has its own share of benefits. By changing just a few lines of code, AWS (and other cloud providers) allow us to spread our workloads across different physical locations, insulating catastrophic failures in one location from impacting our workloads in their entirety.

I could (and will) write an entire post on the different types of fault-isolation boundaries provided in AWS. For the purposes of this discussion however, it's important to consider that you generally want to reduce single points of failure in your applications whenever it's possible and feasible. This means running more than one instance of any compute resources, and at least leveraging database configurations that maintain availability across multiple zones. For data, this usually comes with some amount of replication as well to ensure that there are multiple copies of your data being kept synchronized and ready for use should a primary dataset become unavailable.

The extent to which you configure your resources to be highly available has a direct correlation to your bill, so there's a constant balancing act to be performed to ensure you're not overspending but also not underdelivering on availability.

Recoverability

If high availability is one of the hot buzzwords of cloud computing, recoverability is the complete opposite in that it's often ignored (or at least under-appreciated)! Despite this, I'm of the opinion that recoverability is actually more important for most workloads.

At the most fundamental level, recoverability is your ability to restore your workload and its data from any sort of incident which might cause it to be inconsistent with its desired state. For your running application, this can be a crash or performance degradation due to resource exhaustion, operator error, or a bad code change. For data, this can be data loss due to inadvertent deletion, ransomware (or some other security incident), or even accidental modification. In either scenario, it's important to keep in mind even the seemingly infinite realm of the cloud, there is still physical infrastructure behind the scenes that's susceptible to catastrophic failure.

Put simply: when something bad happens, recoverability represents your ability to make things good again.

In a world where businesses are placing huge values on their data (and in many cases, data can be quite literally everything to a business), it doesn't matter how available a system is if the data contained within it can be lost at a moment's notice. From a data perspective, recoverability usually comes in the form of backups. It might be tempting to think that your data backup needs are covered by synchronized replicas (surely you've just finished reading the section on availability, right?). Yes, synchronized replicas do increase availability and durability (which serves the AWS side of the shared responsibility model!), but since they're synchronized it means that any failure condition of your data is immediately replicated to all of your copies. That table you just accidentally dropped in production was dropped from every replica within milliseconds, leaving you with just as much of a mess to clean up. To truly recover, you must have some sort of backup of that data that you can restore.

If there's one thing you learn from this entire post, let it be this: like RAID, replicas are not backups.

Recovering compute is a different conversation entirely. Thanks to the "treat your servers like cattle, not pets" analogy and the prevalance of twelve-factor methodologies, modern cloud applications typically are built such that compute is fungible. That's to say: if your server somehow fails, you should be able to recover it by simply deploying another one. If your application is built in such a way that losing your compute is anything more than a minor annoyance, you may want to consider why that's the case and seek ways to improve your architecture. The best-case scenario (besides no failure) is that automation has you covered and compute is kept healthy without any intervention; this is overlapped right into the high availability camp. If you cannot get to that point for some reason, then you want to at least make sure any manual interventions are as quick, easy, and repeatable as possible. Automate and script as much as possible to minimize any potential for human error.

Putting it all together

By this point I hope I've been able to make it clear that availability and recoverability are two separate but intertwined concepts. Well architected cloud applications need to consider both, especially when your application's data is business critical. Remember that it's easy to get lulled into a false sense of security by high durability numbers and an architecture that scatters replicas of your data across the globe.

Always ask yourself: what would happen if I woke up and this data was gone tomorrow? If the answer is anything besides "nothing", back it up (and then maybe copy that backup somewhere else, too). Of course you should also be doing Well Architected evaluations periodically, which includes the Reliability Pillar that covers this and more.

Overall building highly available systems is a desirable trait - your systems should be available whenever a user needs them. Just keep in mind that even highly available systems can and will fail at some point, at which time the ease and efficacy of your systems' recoverability are far more important than any availability feature it has.