<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: will peixoto</title>
    <description>The latest articles on Forem by will peixoto (@_willpeixoto).</description>
    <link>https://forem.com/_willpeixoto</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F500557%2F90d05836-31da-4959-86f5-ae678b03b74d.jpeg</url>
      <title>Forem: will peixoto</title>
      <link>https://forem.com/_willpeixoto</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/_willpeixoto"/>
    <language>en</language>
    <item>
      <title>The Price of High Availability: Resilience in Architecture: Decisions Are Strategic, Not Just Technical</title>
      <dc:creator>will peixoto</dc:creator>
      <pubDate>Thu, 23 Oct 2025 16:26:18 +0000</pubDate>
      <link>https://forem.com/aws-builders/the-price-of-high-availability-resilience-in-architecture-decisions-are-strategic-not-just-4hhf</link>
      <guid>https://forem.com/aws-builders/the-price-of-high-availability-resilience-in-architecture-decisions-are-strategic-not-just-4hhf</guid>
      <description>&lt;h2&gt;
  
  
  High Availability Has a Price — and Resilience Begins With Decisions, Not Stacks
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://willpeixoto.hashnode.dev/resiliencia-custo-ha-multi-regiao-ou-on-premise" rel="noopener noreferrer"&gt;🇧🇷 Read in Portuguese&lt;/a&gt; →&lt;/p&gt;

&lt;p&gt;After a major cloud outage — like the ones we’ve seen at AWS — the same questions always resurface in developer forums and company meetings:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Should we move to a multi-region setup immediately?”&lt;br&gt;&lt;br&gt;
Or worse —&lt;br&gt;&lt;br&gt;
“Should we go back to on-premise?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The correct answer is rarely technical.&lt;br&gt;&lt;br&gt;
It is &lt;strong&gt;strategic&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As Werner Vogels (AWS CTO) often says:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;“Everything fails, all the time.”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The question is not &lt;em&gt;if&lt;/em&gt; you will fail — it’s &lt;em&gt;when&lt;/em&gt; you will fail, and how prepared you are when it happens.&lt;br&gt;&lt;br&gt;
Because it will happen, whether you are in the cloud, on-premise, or multi-cloud.&lt;/p&gt;

&lt;p&gt;What truly separates resilient teams is not the absence of failure, but their &lt;strong&gt;speed, clarity, and effectiveness&lt;/strong&gt; in response and recovery.&lt;/p&gt;

&lt;p&gt;True architectural maturity isn't about choosing "multi-region" or "on-premise" — it’s about understanding the inherent risk, documenting the decision transparently, and reacting with a plan.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The Paradox of Visible Failure
&lt;/h2&gt;

&lt;p&gt;Every time a major outage occurs, technical and executive teams tend to divide into two extreme reactions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Crucial Reflection:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The cloud doesn’t fail more often than a traditional data center — it simply fails in a more visible, shared, and, ironically, democratic way.  &lt;/p&gt;

&lt;p&gt;In AWS, problems escalate globally and become trending topics in minutes.&lt;br&gt;&lt;br&gt;
Do you honestly believe your company has superior capacity to manage the resilience of infrastructure at a global scale compared to a major cloud provider?&lt;/p&gt;

&lt;p&gt;The debate isn’t “Cloud vs. Data Center.”&lt;br&gt;&lt;br&gt;
It’s a strategic game of &lt;strong&gt;Conscious Resilience vs. The Comfort Zone.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Cost vs. Continuity: The 9s Game
&lt;/h2&gt;

&lt;p&gt;In the world of infrastructure, every additional "9" in your SLA (Service Level Agreement) costs exponentially more.&lt;/p&gt;

&lt;p&gt;To illustrate the real impact of each availability level, see the maximum allowed downtime per year:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;SLA Level&lt;/th&gt;
&lt;th&gt;Downtime per Year&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;99% (Two 9s)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~3.6 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;99.9% (Three 9s)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~8 hours 43 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;99.99% (Four 9s)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~52 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;99.999% (Five 9s)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~5 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every jump in level requires not only doubling or tripling infrastructure but also operational sophistication.&lt;br&gt;&lt;br&gt;
The added cost must be justified by &lt;strong&gt;ROI (Return on Investment)&lt;/strong&gt; — never by technical pride.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;📢 &lt;strong&gt;The Non-Negotiable Factor: Regulation&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
For regulated sectors, the SLA choice is often mandated by law or industry standards.&lt;br&gt;&lt;br&gt;
The debate is how to achieve the legally binding SLA, as the cost of the regulatory fine far exceeds any technical savings.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  3. Architectural Patterns: Naming Resilience
&lt;/h2&gt;

&lt;p&gt;Resilience is codified through established patterns.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resilience Pattern&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Failure Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Protection Against Multi-AZ&lt;/td&gt;
&lt;td&gt;HA baseline (99.9% to 99.99%)&lt;/td&gt;
&lt;td&gt;Hardware, Data Center, or single AZ&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pilot Light&lt;/td&gt;
&lt;td&gt;RTO requirement of several hours&lt;/td&gt;
&lt;td&gt;Complete regional failure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Active-Active&lt;/td&gt;
&lt;td&gt;RTO/RPO near zero&lt;/td&gt;
&lt;td&gt;Regional failure and global balancing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Circuit Breaker&lt;/td&gt;
&lt;td&gt;Any microservice dependency&lt;/td&gt;
&lt;td&gt;Cascading failures&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  4. Conscious Decisions: The Virtue of ADRs
&lt;/h2&gt;

&lt;p&gt;This is where ADRs (Architecture Decision Records) become crucial.&lt;br&gt;&lt;br&gt;
They capture the decision, the reason, and the accepted risk.&lt;/p&gt;

&lt;p&gt;Here's an example of a well-formed ADR:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;ADR-014: No Multi-Region Replication for MVP&lt;br&gt;
Context: Traffic &amp;lt; 10 req/s. Replication cost is estimated at &amp;gt; 3x current cost.&lt;br&gt;
Decision: Maintain single-region (Multi-AZ), with daily cross-region backup.&lt;br&gt;
Review Trigger: Upon hitting 100 average req/s OR when SLA (99.95%) causes business impact.&lt;br&gt;
Accepted Risk: Risk of total service downtime in case of a full regional outage (RTO estimated at 4 hours).&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  5. Selective Resilience: Not Everything Needs HA (and that’s fine)
&lt;/h2&gt;

&lt;p&gt;Selective resilience is a virtue of efficiency.&lt;br&gt;&lt;br&gt;
Allocating finite resources (money and engineering attention) to unnecessary redundancy is one of the biggest wastes in architecture.&lt;/p&gt;

&lt;p&gt;💡 &lt;strong&gt;Prioritize High Availability (HA) only for what truly matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Direct Revenue Functions:&lt;/strong&gt; Components critical for financial transactions (e.g., checkout and payment APIs).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Critical Customer Journey:&lt;/strong&gt; Functions that prevent customers from using the core value of the product (e.g., login or main catalog).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regulatory &amp;amp; Legal Risk:&lt;/strong&gt; Services where failure results in legal fines or breaks a penalizing contractual SLA.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Critical Data Integrity:&lt;/strong&gt; Where data loss violates the acceptable RPO.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;High availability without a purpose is like putting an airbag on a bicycle.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  6. Managed ≠ Immune: The Serverless Mindset
&lt;/h2&gt;

&lt;p&gt;A common mistake is believing that using serverless services (Lambda, DynamoDB, SQS) grants immunity to failure.&lt;br&gt;&lt;br&gt;
It does not.&lt;/p&gt;

&lt;p&gt;Failure will still come — and often from where you least expect it.&lt;/p&gt;

&lt;p&gt;Managed services reduce the operational surface area, but they do not replace sound design and preparation.&lt;br&gt;&lt;br&gt;
Real resilience does not come from the cloud provider; it comes from the architecture you design on top of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Maturity is in the Question: Architecture as Influence and Translation
&lt;/h2&gt;

&lt;p&gt;The difference between &lt;em&gt;having an opinion&lt;/em&gt; and &lt;em&gt;having influence&lt;/em&gt; lies in your capacity for strategic clarity.&lt;/p&gt;

&lt;h3&gt;
  
  
  ❓ Where is Your Team's Maturity?
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Immature Teams Ask:&lt;/th&gt;
&lt;th&gt;Mature Teams Ask:&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"What stack solves this?"&lt;/td&gt;
&lt;td&gt;"What risk are we willing to accept for this cost?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Should we use K8S or Lambda?"&lt;/td&gt;
&lt;td&gt;"What RTO/RPO does the customer expect?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"What does Netflix do?"&lt;/td&gt;
&lt;td&gt;"What does our business need to survive a disaster?"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Common Trap:&lt;br&gt;&lt;br&gt;
Your team defines the &lt;strong&gt;HOW&lt;/strong&gt; (the stack), but the business defines the &lt;strong&gt;WHAT&lt;/strong&gt; (the acceptable RTO and RPO).&lt;br&gt;&lt;br&gt;
It is your job to return the question so the &lt;strong&gt;risk decision&lt;/strong&gt; lies with the business.&lt;/p&gt;




&lt;h3&gt;
  
  
  Translating Resilience Concepts for Leadership
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Translation / Strategic Question&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Failover Multi-Region&lt;/td&gt;
&lt;td&gt;"The catastrophe insurance." Reduces downtime from days to hours. Q: How many hours can we afford to be down?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTO / RPO&lt;/td&gt;
&lt;td&gt;"Defining the loss limit." Q: How much data (RPO) can we afford to lose and how much time (RTO) to recover?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SPOF (Single Point Fail.)&lt;/td&gt;
&lt;td&gt;"The Achilles' Heel of Revenue." Q: What's the loss in 1h if this fails?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  8. Conclusion
&lt;/h2&gt;

&lt;p&gt;There is no such thing as a &lt;strong&gt;fail-proof architecture&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
But there is such a thing as a &lt;strong&gt;surprise-proof organization&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It begins with conscious decisions, documented context (ADRs), and the technical humility to accept that error is part of the equation.&lt;br&gt;&lt;br&gt;
Teams that understand the &lt;em&gt;why&lt;/em&gt; before the &lt;em&gt;how&lt;/em&gt; build systems that not only scale — &lt;strong&gt;they survive&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Essential References
&lt;/h2&gt;

&lt;p&gt;These are the documents that define the best practices for resilience in the cloud:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/reliability.html" rel="noopener noreferrer"&gt;AWS Well-Architected Framework – Reliability Pillar&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/introduction.html" rel="noopener noreferrer"&gt;Disaster Recovery of Workloads on AWS (Whitepaper)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/pt/dynamodb/global-tables/" rel="noopener noreferrer"&gt;DynamoDB Global Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-resilience.html" rel="noopener noreferrer"&gt;EventBridge Resilience Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  10. Glossary for Resilience
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RTO (Recovery Time Objective):&lt;/strong&gt; The maximum acceptable time a system can be down after a failure.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RPO (Recovery Point Objective):&lt;/strong&gt; The maximum acceptable amount of data (measured in time) that can be lost during a disaster.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ADR (Architecture Decision Record):&lt;/strong&gt; A short document recording a technical decision, the reason, and the accepted risk.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SPOF (Single Point of Failure):&lt;/strong&gt; A single component that, if it fails, takes down the entire system or revenue stream.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Circuit Breaker:&lt;/strong&gt; A software pattern that isolates a failing dependency to prevent cascading failures.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  11. Want to Go Deeper?
&lt;/h2&gt;

&lt;p&gt;Want to dive deeper into orchestration, resilience, and the strategic role of the developer in the serverless era?&lt;/p&gt;

&lt;p&gt;🎤 I'll be at &lt;strong&gt;&lt;a href="https://sdsp.io/" rel="noopener noreferrer"&gt;ServerlessDays São Paulo&lt;/a&gt;&lt;/strong&gt; on &lt;strong&gt;November 8th&lt;/strong&gt;, at &lt;strong&gt;Cubo Itaú&lt;/strong&gt;, discussing how to go beyond the stack and build systems that not only function — &lt;strong&gt;but thrive in chaos.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Come join the conversation! 🚀&lt;/p&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>architecture</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
