<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Kris Iyer</title>
    <description>The latest articles on Forem by Kris Iyer (@krisiye).</description>
    <link>https://forem.com/krisiye</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F504554%2Fc00bdf38-7085-4a4c-8906-712c29268bb8.JPG</url>
      <title>Forem: Kris Iyer</title>
      <link>https://forem.com/krisiye</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/krisiye"/>
    <language>en</language>
    <item>
      <title>This PHD from AWS Might Save Your Weekend!</title>
      <dc:creator>Kris Iyer</dc:creator>
      <pubDate>Mon, 10 Nov 2025 23:48:42 +0000</pubDate>
      <link>https://forem.com/aws-builders/this-phd-from-aws-might-save-your-weekend-2a2c</link>
      <guid>https://forem.com/aws-builders/this-phd-from-aws-might-save-your-weekend-2a2c</guid>
      <description>&lt;p&gt;There are some lessons you only learn the hard way in cloud operations — and this was one of them.&lt;/p&gt;

&lt;p&gt;A few weeks ago, one of our &lt;strong&gt;Amazon RDS databases restarted itself&lt;/strong&gt; in the middle of the night.&lt;br&gt;&lt;br&gt;
No deploys. No CloudWatch alarms. Just… downtime.&lt;/p&gt;

&lt;p&gt;When I checked the console later, I saw the engine version had been upgraded — &lt;strong&gt;even though Auto Minor Version Upgrade was disabled.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
That’s not supposed to happen, right?&lt;/p&gt;




&lt;h2&gt;
  
  
  The Mystery
&lt;/h2&gt;

&lt;p&gt;My first reaction: “We must’ve messed up the configuration.”&lt;br&gt;&lt;br&gt;
But after a deep dive, I realized AWS had &lt;strong&gt;forced the upgrade&lt;/strong&gt; because the version we were running had reached &lt;strong&gt;end-of-support&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Apparently, this is &lt;a href="https://repost.aws/articles/ARZFO9SPbbS0Cv3GddKZJcug/rds-database-upgraded-and-restarted-automatically-despite-having-auto-minor-version-upgrade-disabled" rel="noopener noreferrer"&gt;expected behavior&lt;/a&gt;.&lt;br&gt;&lt;br&gt;
When your RDS or Aurora version goes out of support, AWS reserves the right to automatically upgrade it to a supported version — even with auto-upgrade turned off.&lt;/p&gt;

&lt;p&gt;That’s when I learned my biggest oversight:&lt;br&gt;&lt;br&gt;
The &lt;strong&gt;AWS Personal Health Dashboard (PHD)&lt;/strong&gt; had already warned me about the change.&lt;br&gt;&lt;br&gt;
I just hadn’t been looking.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Really Happens During a Forced Upgrade
&lt;/h2&gt;

&lt;p&gt;Here’s where AWS is very clear about what happens — when an Aurora or RDS version reaches &lt;em&gt;End of Standard Support (EOSS)&lt;/em&gt;, &lt;strong&gt;Aurora performs an automatic upgrade to keep the cluster compliant&lt;/strong&gt;.  &lt;/p&gt;

&lt;p&gt;And even if Auto Minor Version Upgrade is disabled, the process still triggers restarts across the cluster.&lt;br&gt;&lt;br&gt;
AWS describes it like this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“When a version reaches EOSS, Aurora performs an automatic upgrade to keep the cluster compliant with supported versions, even if Auto Minor Version Upgrade is disabled.&lt;br&gt;&lt;br&gt;
During this process, cluster nodes are restarted sequentially, and DNS endpoints are briefly remapped to the new hosts, which can cause temporary connection errors such as: &lt;code&gt;connection refused&lt;/code&gt; or &lt;code&gt;host not resolving&lt;/code&gt;.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That short sentence hides a lot of operational pain.  &lt;/p&gt;




&lt;h2&gt;
  
  
  The Connection Pooling Trap
&lt;/h2&gt;

&lt;p&gt;If your application uses &lt;strong&gt;connection pooling&lt;/strong&gt; (as most do), this restart can leave behind &lt;strong&gt;stale or dead connections&lt;/strong&gt; that linger long after the cluster is back up.  &lt;/p&gt;

&lt;p&gt;Here’s what I saw in logs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;connection refused&lt;/code&gt; errors when the DNS endpoint switched hosts.
&lt;/li&gt;
&lt;li&gt;Application threads stuck waiting on sockets that would never recover.
&lt;/li&gt;
&lt;li&gt;Connection pools holding references to the old host until they were recycled.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Interestingly, &lt;strong&gt;HikariCP&lt;/strong&gt; handled it gracefully — it dropped the bad connections and automatically re-established new ones.&lt;br&gt;&lt;br&gt;
But some other clients we had running didn’t recover cleanly; their pools held onto dead connections until we performed an &lt;strong&gt;application rolling restart&lt;/strong&gt; to clear them.  &lt;/p&gt;

&lt;p&gt;It was a subtle but painful reminder that even small differences in connection management can turn a “brief restart” into a customer-facing outage.&lt;/p&gt;

&lt;p&gt;The fix?  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implement &lt;strong&gt;aggressive connection validation&lt;/strong&gt; and &lt;strong&gt;retry logic&lt;/strong&gt; in your database clients.
&lt;/li&gt;
&lt;li&gt;For pooled connections, ensure you’re using health checks like &lt;code&gt;connectionTestQuery&lt;/code&gt; or &lt;code&gt;validationTimeout&lt;/code&gt;.
&lt;/li&gt;
&lt;li&gt;Consider &lt;strong&gt;shorter max lifetime settings&lt;/strong&gt; in your pool so old connections are recycled faster after restarts.
&lt;/li&gt;
&lt;li&gt;And of course — &lt;strong&gt;know when AWS is about to restart your cluster&lt;/strong&gt; (that’s what PHD is for).&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Bigger Picture — It’s Not Just RDS
&lt;/h2&gt;

&lt;p&gt;While my story centers around RDS, &lt;strong&gt;the AWS Personal Health Dashboard covers far more than databases.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Nearly every critical service you run in production can appear here — and if you’re not watching, you can miss events that matter.&lt;/p&gt;

&lt;p&gt;A few common examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Amazon EC2:&lt;/strong&gt; Instance retirements, hardware maintenance, or networking reconfigurations.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Aurora &amp;amp; RDS:&lt;/strong&gt; Engine upgrades, SSL/TLS certificate rotations, or storage maintenance.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon SQS &amp;amp; SNS:&lt;/strong&gt; Service endpoint updates, &lt;em&gt;feature deprecations&lt;/em&gt;, and regional throttling events.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Lambda &amp;amp; EventBridge:&lt;/strong&gt; Deprecation of older runtimes, event delivery changes, or region-specific service modifications.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon EKS &amp;amp; ECS:&lt;/strong&gt; Control plane upgrades, patching windows, or underlying node retirements.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deprecation notices are one of the most valuable (and easily missed) categories in PHD. They often appear weeks or months in advance, giving teams the time to plan migrations or version upgrades — &lt;em&gt;before&lt;/em&gt; a breaking change occurs.&lt;/p&gt;

&lt;p&gt;Each of these can show up in the Personal Health Dashboard &lt;strong&gt;before&lt;/strong&gt; they impact you — but only if you’re subscribed, alerting, and paying attention.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Personal Health Dashboard Actually Does
&lt;/h2&gt;

&lt;p&gt;Most teams glance at the Personal Health Dashboard once and forget it exists.&lt;br&gt;&lt;br&gt;
But if you dig in, you’ll realize it’s a quiet goldmine of visibility into what AWS is doing to &lt;em&gt;your&lt;/em&gt; infrastructure.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;💡 &lt;strong&gt;Proactive warnings&lt;/strong&gt; about maintenance, deprecations, and upcoming service changes.
&lt;/li&gt;
&lt;li&gt;🔔 &lt;strong&gt;Integration options&lt;/strong&gt; via EventBridge or SNS for automated alerts (to Slack, email, etc.).
&lt;/li&gt;
&lt;li&gt;🧾 &lt;strong&gt;Historical logs&lt;/strong&gt; of past events so you can correlate AWS maintenance with your own incidents.
&lt;/li&gt;
&lt;li&gt;🔍 &lt;strong&gt;Account-level insights&lt;/strong&gt;, not just global AWS status updates.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s basically AWS’s way of saying:  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Hey, we’re about to touch something you own. Just so you know.”&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What I Changed After That
&lt;/h2&gt;

&lt;p&gt;After that forced upgrade, I decided I’d never let another PHD event go unnoticed.&lt;br&gt;&lt;br&gt;
Here’s what I plan (and what I’d recommend others do too):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Set up PHD notifications&lt;/strong&gt; using SNS → EventBridge → Slack/Teams.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create a weekly check&lt;/strong&gt; for open/scheduled events in your ops stand-up.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add version lifecycle tracking&lt;/strong&gt; for all RDS, Aurora, EC2, and other managed services.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update your incident runbook&lt;/strong&gt; to check PHD during investigations.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tag PHD alerts&lt;/strong&gt; in PagerDuty with “AWS Provider Change” so you can separate them from internal incidents.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch for deprecation notices&lt;/strong&gt; — these are often the first sign a version, runtime, or API will be retired.
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now, when AWS schedules something, we know &lt;em&gt;before&lt;/em&gt; production does.&lt;/p&gt;




&lt;h2&gt;
  
  
  Looking Ahead — AI, Automation, and MCP
&lt;/h2&gt;

&lt;p&gt;The next step for me has been making all this smarter.&lt;br&gt;&lt;br&gt;
AWS gives you the data through PHD, but there’s still a lot of noise.&lt;br&gt;&lt;br&gt;
AI and automation can help make sense of it.&lt;/p&gt;

&lt;p&gt;With tools like &lt;strong&gt;Amazon Q&lt;/strong&gt;, &lt;strong&gt;QuickSight&lt;/strong&gt;, and &lt;strong&gt;Bedrock&lt;/strong&gt;, you can summarize or query health events in natural language — “What’s changing next week in us-east-1?” or “Which clusters are nearing end-of-support?”&lt;br&gt;&lt;br&gt;
And if you want to take it a step further, AWS Labs has released the &lt;a href="https://github.com/awslabs/mcp" rel="noopener noreferrer"&gt;&lt;strong&gt;MCP Servers project&lt;/strong&gt;&lt;/a&gt;, which defines a standard interface that allows AI assistants or bots to securely access AWS data (like the Health API) and answer those same questions automatically.&lt;/p&gt;

&lt;p&gt;It’s early days, but it’s easy to see how something like this could become an &lt;em&gt;Ops Copilot&lt;/em&gt; — a chat assistant that not only reports on AWS health events but also suggests who owns the impacted resource, what runbook applies, and what to do next.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why It Matters
&lt;/h2&gt;

&lt;p&gt;Cloud automation is great — until it surprises you.&lt;br&gt;&lt;br&gt;
The Personal Health Dashboard is your early warning system for when AWS changes something under the hood.  &lt;/p&gt;

&lt;p&gt;If you’re running production workloads, it’s not optional.&lt;br&gt;&lt;br&gt;
It’s as essential as CloudWatch or your favorite APM.&lt;/p&gt;

&lt;p&gt;Here’s the mindset shift that helped me:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Automation doesn’t remove responsibility.&lt;br&gt;&lt;br&gt;
It just changes where you have to pay attention.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;AWS can still upgrade your RDS/Aurora cluster when versions hit end-of-support.
&lt;/li&gt;
&lt;li&gt;The Personal Health Dashboard will warn you — if you’re watching.
&lt;/li&gt;
&lt;li&gt;Connection pools can fail during DNS endpoint remapping; some (like HikariCP) recover automatically, others may need a restart.
&lt;/li&gt;
&lt;li&gt;PHD isn’t just for RDS — it covers EC2, EventBridge, SQS, SNS, Lambda, EKS, and more.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deprecation notices&lt;/strong&gt; are critical early warnings — act on them before they become incidents.
&lt;/li&gt;
&lt;li&gt;Automate those alerts via SNS/EventBridge to avoid surprises.
&lt;/li&gt;
&lt;li&gt;Tools like &lt;strong&gt;AI and MCP&lt;/strong&gt; can make this even smarter — turning AWS health data into insight.
&lt;/li&gt;
&lt;li&gt;Treat PHD as a core monitoring tool, not an afterthought.
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Since that incident, I’ve made the PHD part of our &lt;strong&gt;daily ops hygiene&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
It’s not flashy, but it’s saved us from more than one nasty surprise.&lt;/p&gt;

&lt;p&gt;If you’ve been ignoring it (like I did), maybe it’s time to give it another look.  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Have you ever been surprised by an AWS upgrade, deprecation, or hidden maintenance event? I’d love to hear how you handled it — drop a comment below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>reliability</category>
      <category>rds</category>
    </item>
    <item>
      <title>Kubernetes Features for operating resilient workloads on Amazon EKS</title>
      <dc:creator>Kris Iyer</dc:creator>
      <pubDate>Fri, 14 Mar 2025 13:02:31 +0000</pubDate>
      <link>https://forem.com/aws-builders/kubernetes-features-for-operating-resilient-workloads-on-amazon-eks-3j76</link>
      <guid>https://forem.com/aws-builders/kubernetes-features-for-operating-resilient-workloads-on-amazon-eks-3j76</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AEcCpjK1HuKaIFoST" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AEcCpjK1HuKaIFoST" width="800" height="546"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by Yomex Owo on Unsplash&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;In Kubernetes, managing highly available applications is critical for maintaining service reliability and resiliency. A scalable and resilient architecture keeps your applications and services running without disruptions, which keeps your customers and users happy! Thankfully, there are several configuration options to meet these important NFRs for your k8s workloads.&lt;/p&gt;
&lt;h3&gt;
  
  
  Control Plane
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/eks/" rel="noopener noreferrer"&gt;Amazon Elastic Kubernetes Service&lt;/a&gt; (EKS) is a managed Kubernetes service that makes it easy to run Kubernetes on AWS without installing, operating, and maintaining your own Kubernetes control plane or worker nodes. The EKS architecture is designed to eliminate any single points of failure that may compromise the availability and durability of the Kubernetes control plane and offers an &lt;a href="https://aws.amazon.com/eks/sla" rel="noopener noreferrer"&gt;SLA of 99.95% for API server endpoint availability&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Last December at re:Invent 2024, AWS announced a new mode for managing Amazon EKS clusters: &lt;a href="https://aws.amazon.com/blogs/containers/getting-started-with-amazon-eks-auto-mode" rel="noopener noreferrer"&gt;Amazon EKS Auto Mode&lt;/a&gt; promises simplified cluster operations, improved application performance, availability, and security, and continuously optimizes compute costs. With EKS Auto Mode, you can focus on running your applications without worrying about the underlying infrastructure and resilience of the control plane.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fanbdsnbn18znr03b0268.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fanbdsnbn18znr03b0268.png" alt="Shows AWS EKS Auto mode architecture with control plane and capabilities." width="800" height="449"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;AWS EKS Auto mode&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It is designed to be highly available, fault-tolerant, and scalable. Following the recommendations of the Well-Architected Framework, Amazon EKS runs the Kubernetes control plane across multiple AWS Availability Zones (AZ) to ensure high availability. The cluster control plane auto-scales based on the load and any unhealthy control plane instances are replaced automatically. The availability of EC2 instances attached to an EKS Cluster is covered under the &lt;a href="https://aws.amazon.com/compute/sla/" rel="noopener noreferrer"&gt;Amazon Compute SLA&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In a nutshell, as far as the resilience of the control plane is concerned, AWS EKS auto mode handles all of that for you allowing engineering teams to focus on building applications and business logic rather than managing the infra!&lt;/p&gt;
&lt;h3&gt;
  
  
  Data Plane
&lt;/h3&gt;
&lt;h4&gt;
  
  
  Multi-Region Kubernetes Clusters
&lt;/h4&gt;

&lt;p&gt;If you are one of those with stringent availability requirements you may choose to operate across multiple AWS Regions. This approach protects against larger-scale disasters or regional outages. However, the cost and complexity of implementing such a setup can be significantly higher. This architecture pattern is typically reserved for disaster recovery and business continuity. Therefore &lt;strong&gt;understanding the Spectrum of Resilience Strategies is very important:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We can categorize resilience strategies based on their:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Recovery Time Objective (RTO):&lt;/strong&gt; How quickly do you need to restore service?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recovery Point Objective (RPO):&lt;/strong&gt; How much data loss you can tolerate?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost:&lt;/strong&gt; The expense of implementing and maintaining the strategy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complexity:&lt;/strong&gt; The difficulty of setup and ongoing management.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here’s a breakdown of the strategies, moving from “cold” to “hot” standby:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Backup and Restore (Very Cold Passive):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Description:&lt;/strong&gt; Regularly backing up data and infrastructure configurations to a separate region or storage location. If a disaster occurs, you restore the backups to a new environment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RTO:&lt;/strong&gt; Very high (hours to days).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RPO:&lt;/strong&gt; High (potential for significant data loss, depending on backup frequency).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost:&lt;/strong&gt; Lowest (storage costs for backups, minimal infrastructure costs).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complexity:&lt;/strong&gt; Low (relatively simple backup and restore procedures).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Cases:&lt;/strong&gt; Suitable for non-critical workloads with relaxed RTO/RPO requirements, where cost is the primary concern.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shades:&lt;br&gt;
-&lt;/strong&gt; Frequency of backup. Daily, Hourly, etc.
&lt;/li&gt;
&lt;li&gt;Location of backup. S3, Glacier, another region.
&lt;/li&gt;
&lt;li&gt;Automation of restore. Manual vs Fully automated.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Pilot Light (Active/Passive — Cold Standby):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Description:&lt;/strong&gt; Maintaining a minimal “pilot light” environment in a secondary region, including core infrastructure components. When a disaster occurs, you scale up the pilot light to full capacity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RTO:&lt;/strong&gt; Moderate (minutes to hours).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RPO:&lt;/strong&gt; Low (data replication is typically used, minimizing data loss).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost:&lt;/strong&gt; Low to moderate (cost of the pilot light environment, data replication).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complexity:&lt;/strong&gt; Moderate (setup and testing of failover procedures).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Cases:&lt;/strong&gt; Suitable for workloads that require faster recovery than backup and restore, but can tolerate some downtime.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shades:&lt;br&gt;
-&lt;/strong&gt; Amount of infrastructure kept running. Minimal core services, or almost a full duplicate.
&lt;/li&gt;
&lt;li&gt;Automation of scaling. Manual, semi, or fully automated.
&lt;/li&gt;
&lt;li&gt;Data replication type. Async, or sync.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Warm Standby (Active/Passive — Warm Standby):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Description:&lt;/strong&gt; Maintaining a fully scaled-down, but functional, environment in a secondary region. Data is continuously replicated. When a disaster occurs, you switch traffic to the warm standby.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RTO:&lt;/strong&gt; Low (minutes).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RPO:&lt;/strong&gt; Very low (near-zero data loss).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost:&lt;/strong&gt; Moderate to high (cost of the warm standby environment, data replication).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complexity:&lt;/strong&gt; High (complex failover and failback procedures).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Cases:&lt;/strong&gt; Suitable for critical workloads that require minimal downtime.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Shades:&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Amount of traffic the warm standby receives. No traffic, or a small amount of test traffic.&lt;/li&gt;
&lt;li&gt;Testing of failover. Frequent, or infrequent.&lt;/li&gt;
&lt;li&gt;Data replication consistency. Strong or eventual consistency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Active/Active (Hot Standby):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Description:&lt;/strong&gt; Running identical environments in multiple regions simultaneously, with traffic distributed across them. If one region fails, traffic is automatically routed to the remaining regions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RTO:&lt;/strong&gt; Very low (seconds).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RPO:&lt;/strong&gt; Very low (near-zero data loss).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost:&lt;/strong&gt; Highest (cost of running full environments in multiple regions).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complexity:&lt;/strong&gt; Highest (complex traffic management, data synchronization, and application design).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Cases:&lt;/strong&gt; Suitable for mission-critical workloads that require continuous availability and cannot tolerate any downtime.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Shades:&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Traffic distribution. Weighted, or even distribution.&lt;/li&gt;
&lt;li&gt;Data replication. Synchronous, or asynchronous.&lt;/li&gt;
&lt;li&gt;Application design. Region-aware applications, or not.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Expanding on the Cost-Benefit Trade-Off:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost of Downtime:&lt;/strong&gt; The “cost of being down” is not just financial. It includes reputational damage, customer churn, and lost productivity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workload Characteristics:&lt;/strong&gt; The nature of your workload influences the appropriate strategy. Real-time applications require lower RTO/RPO than batch processing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance Requirements:&lt;/strong&gt; Regulatory requirements may dictate specific RTO/RPO targets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing and Validation:&lt;/strong&gt; Regularly testing failover procedures is crucial to ensure they work as expected.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Geographic Distribution:&lt;/strong&gt; Active/active can also improve performance by serving users from the closest region.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By carefully evaluating these pillars, you can choose the resilience strategy that best balances cost, complexity, and risk for your specific needs for your Multi-regional deployments.&lt;/p&gt;
&lt;h4&gt;
  
  
  Multi-cluster
&lt;/h4&gt;

&lt;p&gt;This is a popular pattern where we deploy workloads across multiple Amazon EKS clusters to eliminate a Kubernetes cluster from being a single point of failure. Multi-cluster architecture also provides opportunities for testing, maintenance, and upgrades without disrupting production environments. By diverting traffic or workloads to a set of clusters during planned maintenance activities, one can ensure continuous service availability and achieve near-zero downtime.&lt;/p&gt;

&lt;p&gt;It does mean we use Application Load Balancer (ALB) or Network Load Balancer (NLB) to distribute traffic to replicas of services running inside a cluster or even load balance traffic across multiple clusters. When using ALB, we can create dedicated target groups for each cluster. Using weighted target groups, we can then control the percentage of traffic each cluster gets. For workloads that use an NLB, we can use &lt;a href="https://docs.aws.amazon.com/global-accelerator/?icmpid=docs_homepage_networking" rel="noopener noreferrer"&gt;AWS Global Accelerator&lt;/a&gt; to distribute traffic across multiple clusters.&lt;/p&gt;

&lt;p&gt;Note: Avoid potential pitfalls with configuration drifts across clusters. When adopting a multi-cluster architecture for resiliency, it is essential to reduce the operational overhead of managing clusters individually. The idea is to treat clusters as a unit. Should an issue arise in the deployment, it’s easier to fix if all clusters share the same workload version and configuration.&lt;/p&gt;
&lt;h4&gt;
  
  
  Compute Resources (Nodes/Node Groups)
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes Cluster AutoScaler (CA)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler" rel="noopener noreferrer"&gt;The Kubernetes Cluster Autoscaler&lt;/a&gt; is a popular Cluster Autoscaling solution maintained by &lt;a href="https://github.com/kubernetes/community/tree/master/sig-autoscaling" rel="noopener noreferrer"&gt;SIG Autoscaling&lt;/a&gt;. It is responsible for ensuring that your cluster has enough nodes to schedule your pods without wasting resources. It watches for pods that fail to schedule and for underutilized nodes. It then simulates the addition or removal of nodes before applying the change to your cluster. The AWS Cloud Provider implementation within CA controls the .DesiredReplicas field of your EC2 Auto Scaling Groups.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automated Node Scaling:&lt;/strong&gt; The CA automatically adjusts the number of nodes in your cluster based on the resource requests of pending pods. If pods are waiting to be scheduled and there aren’t enough resources, the CA will provision new nodes. Conversely, if nodes are underutilized, the CA can scale down the cluster, removing unnecessary nodes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Improved Resource Utilization:&lt;/strong&gt; By dynamically scaling the number of nodes, the CA helps to optimize resource utilization. You avoid over-provisioning nodes, which can lead to wasted resources and increased costs. The cluster more closely matches the resource demands.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Optimization:&lt;/strong&gt; Scaling down the number of nodes when they are not needed can significantly reduce your cloud computing costs. You only pay for the resources you use.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simplified Cluster Management:&lt;/strong&gt; The CA automates the process of adding and removing nodes, reducing the manual effort required to manage your Kubernetes infrastructure. This frees up your operations team for other tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Increased Application Availability:&lt;/strong&gt; By ensuring that there are enough resources available to run your applications, the CA can help to improve application availability and prevent resource starvation. Pods are more likely to be scheduled quickly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Support for Diverse Environments:&lt;/strong&gt; The CA is designed to work with various cloud providers (AWS, Azure, GCP) and even on-premises Kubernetes clusters. This makes it a versatile solution for managing Kubernetes in different environments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration with Kubernetes:&lt;/strong&gt; The CA is a core Kubernetes component and integrates seamlessly with other Kubernetes features, such as the scheduler and the HPA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuration Flexibility:&lt;/strong&gt; The CA offers a range of configuration options, allowing you to customize its behavior to meet your specific needs. You can control the minimum and maximum number of nodes, the types of instances to use, and other parameters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node Draining:&lt;/strong&gt; When scaling down, the CA gracefully drains nodes by evicting pods before terminating the instance. This prevents application disruptions and ensures a smooth scaling process.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scaling Faster:&lt;/strong&gt; There are a couple of things you can do to make sure your data plane scales faster and recovers quickly from failures in individual Nodes and make CA more effecient:
 — &lt;a href="https://aws.amazon.com/blogs/containers/eliminate-kubernetes-node-scaling-lag-with-pod-priority-and-over-provisioning/" rel="noopener noreferrer"&gt;Over-provision capacity at the Node level&lt;/a&gt;
 — Reduce Node startup time by using &lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami-bottlerocket.html" rel="noopener noreferrer"&gt;Bottlerocket&lt;/a&gt; or vanilla&lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html" rel="noopener noreferrer"&gt;Amazon EKS optimized Linux&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Karpenter&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/eks/latest/best-practices/karpenter.html" rel="noopener noreferrer"&gt;Karpenter&lt;/a&gt; was introduced in this space as an AWS alternative to the CA which allowed for greater flexibility with fine-grained control over cluster and node management like never before for EKS.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Faster and More Efficient&lt;/strong&gt; scaling by directly provisions EC2 instances based on the needs of pending pods and thus eliminates the overhead of managing node groups and significantly speeds up the scaling process, especially for workloads with fluctuating demands.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-grained Control&lt;/strong&gt; control over the types of instances it provisions. You can specify instance types, availability zones, architectures (e.g., ARM64), and other instance properties through “provisioners.” This allows you to optimize resource utilization and cost efficiency for different workloads. You can tailor the compute resources to the workload. Karpenter can efficiently provision worker nodes for a wide variety of workloads, including those with specialized hardware requirements (like GPUs) or architectural needs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simplified Cluster Management&lt;/strong&gt; by dynamically provisioning nodes. You don’t need to pre-configure and manage multiple node groups with varying instance types. This reduces operational overhead and makes it easier to manage your EKS cluster. Less configuration and management are required.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Improved Cost Optimization:&lt;/strong&gt; Karpenter integrates well with AWS cost-saving features. It can automatically provision nodes using Spot Instances, Savings Plans, or other cost-effective options, helping you minimize your EKS spending.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Better Integration with EKS:&lt;/strong&gt; As an AWS-developed tool, Karpenter seamlessly integrates with EKS and other AWS services. This leads to a smoother experience and allows Karpenter to leverage AWS-specific features and best practices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Right-Sized Instances:&lt;/strong&gt; Karpenter provisions instances that precisely match the resource requests of pending pods. This avoids over-provisioning and improves resource utilization. You don’t get instances that are too large or too small.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated Node Draining:&lt;/strong&gt; When scaling down, Karpenter gracefully drains nodes by evicting pods before terminating the instance. This prevents application disruptions and ensures a smooth scaling process.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Declarative Configuration:&lt;/strong&gt; Karpenter uses declarative configuration through provisioners, making it easy to manage and version your worker node configurations. With Karpenter, you can define NodePools with constraints on node provisioning like taints, labels, requirements (instance types, zones, etc.), and limits on total provisioned resources. When deploying workloads, you can specify various scheduling constraints in the pod specifications like resource requests/limits, node selectors, node/pod affinities, tolerations, and topology spread constraints. Karpenter will then provision right-sized nodes based on these specifications.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Here is a Quick high-level comparison:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fly8o89i9hhvt2njn0tyi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fly8o89i9hhvt2njn0tyi.png" width="800" height="648"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EKS Auto Mode&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As discussed under the control plane section above, EKS Auto Mode extends into the data plane with its powerful features that simplify worker node management by leveraging Karpenter under the hood. So in essence what you get is &lt;strong&gt;Managed Karpenter&lt;/strong&gt; where AWS manages the Karpenter installation and configuration for you. EKS Auto Mode provides a superior experience compared to manually configuring the Cluster Autoscaler.&lt;/p&gt;

&lt;p&gt;For the vast majority of EKS users, especially those starting new clusters or looking for a simplified solution, &lt;strong&gt;EKS Auto Mode&lt;/strong&gt; is the recommended approach for autoscaling worker nodes.&lt;/p&gt;

&lt;p&gt;Let’s break down some examples of situations where EKS Auto Mode might not be sufficient and you might need more direct control via Karpenter or (less commonly) the Cluster Autoscaler:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyeto3lnu2i7e6l4ehsm3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyeto3lnu2i7e6l4ehsm3.png" alt="shows a decision tree graph for EKS Auto Mode and when to choose between karpenter and cluster auto scaler." width="800" height="308"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A decision tree for EKS Auto Mode&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Fine-Grained Instance Type Control:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Specific Instance Families:&lt;/strong&gt; Auto Mode lets you choose instance types, but you might need a &lt;em&gt;very&lt;/em&gt; specific generation (e.g., c5.xlarge vs. c7g.xlarge for Graviton) or a particular instance family due to workload requirements (e.g., memory-optimized instances). Auto Mode's selection might not always align perfectly with your needs and complex computing use cases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Heterogeneous Instance Types within a Provisioner:&lt;/strong&gt; While Karpenter can provision nodes with different instance types, Auto Mode simplifies this and may not offer the same level of granular control. You might want a mix of instances within a single provisioner based on cost or performance needs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom AMIs:&lt;/strong&gt; You might require a custom Amazon Machine Image (AMI) with specific software pre-installed or security hardening applied. Auto Mode typically uses standard AMIs, so you’d need more control for custom AMIs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Taints and Tolerations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Node Taints:&lt;/strong&gt; Taints are used to repel pods from specific nodes. You might need to taint nodes for specialized workloads (e.g., GPU nodes) and then use tolerations in your pod specifications to allow those pods to run on the tainted nodes. Auto Mode might not offer the fine-grained control to apply specific taints during node provisioning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complex Tolerations:&lt;/strong&gt; While Auto Mode does consider tolerations in pod specs if you have a complex set of tolerations, sometimes direct Karpenter configuration is better to ensure the right nodes are provisioned.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Node Labels:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Application-Specific Labels:&lt;/strong&gt; Labels are used to organize and select nodes. You might need to apply specific labels to nodes for application-specific purposes (e.g., environment, team, or application name). Auto Mode’s labeling might not be flexible enough for all cases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node Pool Management:&lt;/strong&gt; If you want to create and manage distinct sets of nodes (node pools) with different labels and configurations, you might need more direct control than Auto Mode provides.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Advanced Karpenter Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Provisioner Prioritization:&lt;/strong&gt; If you want to prioritize certain provisioners (sets of instance types and configurations) over others, you would need to configure Karpenter directly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom Scheduling Logic:&lt;/strong&gt; For very specialized scheduling needs beyond what Kubernetes natively provides, you might need to use advanced Karpenter features.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration with other tools:&lt;/strong&gt; If you have existing infrastructure-as-code (IaC) or configuration management tools that directly manage Karpenter, switching to Auto Mode could require significant changes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. Node Lifecycle Management:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Node Draining Customization:&lt;/strong&gt; While Auto Mode handles node draining, you might have specific requirements for how nodes are drained (e.g., specific pod eviction policies or pre-shutdown scripts).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node Replacement Strategies:&lt;/strong&gt; You might need to define custom node replacement strategies based on your application’s requirements.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In summary &lt;strong&gt;,&lt;/strong&gt; Auto Mode is excellent for most common scenarios. However, if your worker node requirements involve very specific instance types, complex taints, and tolerations, application-specific labels, advanced Karpenter features, or customized node lifecycle management, then configuring Karpenter directly gives you the necessary fine-grained control. You’ll know you need it when the Auto Mode configuration options simply don’t offer the knobs you need to turn.&lt;/p&gt;

&lt;p&gt;Note: It’s important to keep in mind that EKS Auto Mode is still under active development, and AWS is continuously adding new features and improvements. So, it’s always a good idea to check the latest AWS documentation to see if Auto Mode meets your specific needs.&lt;/p&gt;
&lt;h4&gt;
  
  
  Deployment Strategies
&lt;/h4&gt;

&lt;p&gt;In Kubernetes, several options for deploying applications are available, each suited for different needs and scenarios. Some of the popular techniques include &lt;em&gt;blue-green deployments, canary deployments, and rolling updates.&lt;/em&gt; Each method is unique in what it has to offer for managing updates and minimizing downtime.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Rolling Updates&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Kubernetes Rolling Updates are a critical feature for deploying and updating applications with zero downtime. They work by gradually replacing old pods with new ones, ensuring a smooth transition and minimizing disruption to users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choosing the Right Values is critical.&lt;/strong&gt; The optimal values for maxSurge and maxUnavailable depend on your application's specific requirements and resource constraints. Here are some factors to consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Application Sensitivity:&lt;/strong&gt; If your application is critical and cannot tolerate any downtime or performance degradation, you should use conservative values for maxSurge and maxUnavailable (e.g., 10-20%).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource Availability:&lt;/strong&gt; If your cluster has limited resources, you should use lower values for maxSurge to avoid resource exhaustion.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update Speed:&lt;/strong&gt; If you need to deploy updates quickly, you can use higher values for maxSurge and maxUnavailable to speed up the process.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you’re first setting up rolling updates, it’s best to start with conservative values for maxSurge and maxUnavailable and gradually increase them as you gain confidence, monitor the application, and fine-tune it as per your needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Canary Deployments&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;Canary deployment strategy&lt;/strong&gt; is a &lt;em&gt;gradual&lt;/em&gt; rollout where you deploy a small percentage of the new version (the “canary”) alongside the existing version of your service. It is typically used when deploying new features or updates to a subset of users or servers to test them in a live environment and is often used for applications that require frequent updates. This strategy allows for the testing of new features with minimal impact on the production environment and can help us identify issues before they affect the entire system. A service mesh provides advanced traffic management features, including fine-grained traffic splitting, header-based routing, and more. This is highly recommended for more sophisticated canary deployments. For instance, &lt;a href="https://istio.io/" rel="noopener noreferrer"&gt;Istio&lt;/a&gt; offers a VirtualService that defines how traffic is routed to your services where you you may choose to route most traffic (e.g., 90%) to v1 of your service and a small percentage (e.g., 10%) to v2 (the canary). If there were no service mesh involved you have to back this with an ingress controller strategy or a load balancer to do the same. For instance, Ambassador a popular ingress controller for Kubernetes offers &lt;a href="https://www.getambassador.io/docs/edge-stack/latest/topics/using/canary" rel="noopener noreferrer"&gt;canary releases&lt;/a&gt; based on weights.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Blue-Green Deployments&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With &lt;strong&gt;the Blue-Green deployment strategy,&lt;/strong&gt; you have two complete environments (blue and green). One environment (e.g., blue) is live, serving all traffic. You deploy the new version to the other environment (green). After testing in green, you switch &lt;em&gt;all&lt;/em&gt; traffic from blue to green. Blue/green is about &lt;em&gt;faster deployments&lt;/em&gt; and &lt;em&gt;simplified rollbacks&lt;/em&gt;. The new release candidate is tested before being switched to the production environment, allowing for a smooth transition without any downtime or errors. This approach also heavily depends on your load balancer or ingress control and the ability to switch traffic from blue to green or vice-versa in the event of a rollback.&lt;/p&gt;

&lt;p&gt;In summary, choosing a deployment strategy depends on the specific requirements and characteristics of the application or service being deployed. Canary deployments are great for services that go with frequent updates and testing, Rolling deployments are a great choice for zero-downtime deployments, and Blue-Green deployments are ideal for minimizing downtime during deployments.&lt;/p&gt;
&lt;h4&gt;
  
  
  Topology Spread Constraints (TSC)
&lt;/h4&gt;

&lt;p&gt;Kubernetes Topology Spread Constraints (TSC) ensure pods are spread across zones during scale-up. However, they don’t guarantee balanced distribution during scale-down. The Kubernetes descheduler can be used to address this imbalance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: apps/v1
kind: Deployment # Or StatefulSet, etc.
spec:
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: &amp;lt;number&amp;gt; # Maximum difference in pods across topologies
          topologyKey: &amp;lt;string&amp;gt; # The topology domain (e.g., zone, node)
          whenUnsatisfied: &amp;lt;string&amp;gt; # How to handle if constraints can't be met
          labelSelector: # Selects the pods to which this applies
            matchLabels:
              &amp;lt;label-key&amp;gt;: &amp;lt;label-value&amp;gt;
          # Optional: minDomains: &amp;lt;number&amp;gt; # Minimum number of domains to spread across
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s break down the key parameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;maxSkew:&lt;/strong&gt; This is the &lt;em&gt;most crucial&lt;/em&gt; parameter. It defines the &lt;em&gt;maximum&lt;/em&gt; difference in the number of pods between any two topologies (e.g., zones). A maxSkew of 1 means that the difference in pod count between any two zones should be no more than 1.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;topologyKey:&lt;/strong&gt; This specifies the topology domain. Common values:&lt;/li&gt;
&lt;li&gt;kubernetes.io/hostname: Spreads pods across nodes.&lt;/li&gt;
&lt;li&gt;topology.kubernetes.io/zone: Spreads pods across availability zones.&lt;/li&gt;
&lt;li&gt;topology.kubernetes.io/region: Spreads pods across regions. (Less common)&lt;/li&gt;
&lt;li&gt;Custom labels: You can use any label as a topology key, giving you very flexible control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;whenUnsatisfied:&lt;/strong&gt; This dictates what Kubernetes should do if the constraint cannot be satisfied when a pod is scheduled:&lt;/li&gt;
&lt;li&gt;DoNotSchedule: (Recommended in most cases) Prevents the pod from being scheduled if the constraint cannot be met. This ensures the spread is maintained.&lt;/li&gt;
&lt;li&gt;ScheduleAnyway: Allows the pod to be scheduled even if the constraint is violated. This is generally &lt;em&gt;not&lt;/em&gt; recommended as it defeats the purpose of the constraint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;labelSelector:&lt;/strong&gt; This uses standard Kubernetes label selectors to specify which pods the constraint applies to. This is essential to target the constraint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;minDomains (Optional):&lt;/strong&gt; Specifies the minimum number of topology domains (like zones) that pods should be spread across. This is useful for high availability, ensuring your application runs in a minimum number of zones.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures that no two zones have more than one pod difference for pods labeled app: my-app&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfied: DoNotSchedule
    labelSelector:
      matchLabels:
        app: my-app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Likewise, you could spread by nodes to limit the difference in the number of app: my-app pods on any two nodes to a maximum of 2.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;topologySpreadConstraints:
  - maxSkew: 2
    topologyKey: kubernetes.io/hostname
    whenUnsatisfied: DoNotSchedule
    labelSelector:
      matchLabels:
        app: my-app

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Important Considerations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DoNotSchedule is Crucial:&lt;/strong&gt; In almost all cases, you should use whenUnsatisfied: DoNotSchedule. Otherwise, the constraint becomes meaningless.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Label Selectors are Essential:&lt;/strong&gt; Without alabelSelector, the constraint will apply to &lt;em&gt;all&lt;/em&gt; pods, which is rarely what you want.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;maxSkew and Replica Count:&lt;/strong&gt; maxSkew interacts with the number of replicas. If you have fewer replicas than topology domains, perfect spreading might not be possible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Planning:&lt;/strong&gt; Think about your failure domains (zones, nodes) and how you want your application to behave in the event of a failure. This will help you determine the appropriate maxSkew and topologyKey.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Descheduler
&lt;/h4&gt;

&lt;p&gt;The &lt;a href="https://github.com/kubernetes-sigs/descheduler" rel="noopener noreferrer"&gt;Descheduler&lt;/a&gt; is a valuable tool for managing and optimizing Kubernetes clusters. By evicting the appropriate pods, the Descheduler can help improve resource utilization, maintain the desired state of the cluster, and enhance the scalability and security of the cluster in the face of node failures and vulnerabilities. Becomes very useful to maintain the balance and spread across your zones specifically during scale-down events.&lt;/p&gt;

&lt;h4&gt;
  
  
  Topology Aware Routing on Amazon EKS
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://kubernetes.io/docs/concepts/services-networking/topology-aware-routing/" rel="noopener noreferrer"&gt;&lt;em&gt;Topology Aware Routing&lt;/em&gt;&lt;/a&gt; &lt;em&gt;(Also referred to as Topology Aware Hints or TAH before v1.27)&lt;/em&gt; provides a mechanism to help keep network traffic within the zone where it originated. Preferring same-zone traffic between Pods in your cluster can help with reliability, performance (network latency and throughput), or cost.&lt;/p&gt;

&lt;p&gt;Kubernetes clusters are increasingly deployed in multi-zone environments. &lt;em&gt;Topology Aware Routing&lt;/em&gt; provides a mechanism to help keep traffic within the zone it originated from. When calculating the endpoints for a &lt;a href="https://kubernetes.io/docs/concepts/services-networking/service/" rel="noopener noreferrer"&gt;Service&lt;/a&gt;, the EndpointSlice controller considers the topology (region and zone) of each endpoint and populates the hints field to allocate it to a zone. Cluster components such as &lt;a href="https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/" rel="noopener noreferrer"&gt;kube-proxy&lt;/a&gt; can then consume those hints, and use them to influence how the traffic is routed (favoring topologically closer endpoints).&lt;/p&gt;

&lt;p&gt;You can enable Topology Aware Routing for a Service by setting the service.kubernetes.io/topology-mode annotation to Auto. When there are enough endpoints available in each zone, Topology Hints will be populated on EndpointSlices to allocate individual endpoints to specific zones, resulting in traffic being routed closer to where it originated from.&lt;/p&gt;

&lt;p&gt;When using Horizontal Pod Autoscaler, topology spread constraints ensure newly created pods are spread among AZs during scaling out. However, when scaling in, the deployment controller won’t consider AZ balance, and instead randomly terminates pods. This may cause the endpoints in each AZ to be disproportionate and disable Topology Aware Routing. The &lt;a href="https://github.com/kubernetes-sigs/descheduler" rel="noopener noreferrer"&gt;descheduler&lt;/a&gt; tool can help you re-balance pods by evicting improperly placed pods so that the Kubernetes scheduler can reschedule them with the appropriate constraints in effect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Topology Aware Routing and Node Affinity&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While node affinity is &lt;em&gt;not required&lt;/em&gt; for &lt;em&gt;basic&lt;/em&gt; topology-aware routing, it becomes very useful in the following scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;More Control Over Pod Placement:&lt;/strong&gt; Node affinity gives you fine-grained control over &lt;em&gt;where&lt;/em&gt; pods are initially scheduled. While basic topology-aware routing ensures traffic is preferentially routed to the same zone, it doesn’t guarantee that pods will be evenly distributed across zones. Node affinity allows you to express preferences or requirements for pod placement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Taints and Tolerations:&lt;/strong&gt; If you use taints to restrict pods to specific nodes (e.g., GPU nodes), you’ll need tolerations in your pods to allow them to run on those nodes. Node affinity can then be used to further refine placement within those tainted nodes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's consider the following scenario &lt;strong&gt;: Regional Redundancy with Specialized Hardware for your ML Training job:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Regional Redundancy:&lt;/strong&gt; Your application must be resilient to zone failures. You want pods to be spread across at least three availability zones in a region.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Specialized Hardware:&lt;/strong&gt; Your training jobs require GPUs. You have a set of nodes in each zone equipped with GPUs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Locality (Performance):&lt;/strong&gt; For performance reasons, you want training jobs to run on GPU nodes &lt;em&gt;in the same zone&lt;/em&gt; where the training data resides (assuming data locality is a factor in your application’s architecture).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this scenario, basic Topology Aware Routing is not sufficient as you also need to combine this with special node types such as matching node labels as shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-training-app
spec:
  replicas: 9 # 3 replicas per zone (adjust as needed)
  selector:
    matchLabels:
      app: ml-training
  template:
    metadata:
      labels:
        app: ml-training
    spec:
      affinity:
        nodeAffinity: # Ensure pods are on GPU nodes
          requiredDuringSchedulingIgnoredDuringExecution: # Must be on a GPU node
            nodeSelectorTerms:
            - matchExpressions:
              - key: gpu # Label on your GPU nodes (e.g., nvidia.com/gpu)
                operator: Exists
          preferredDuringSchedulingIgnoredDuringExecution: # Preference for same zone
          - weight: 100
            preference:
              matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                  - $(ZONE) # Placeholder for zone label
      topologySpreadConstraints: # Ensure spread across zones
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfied: DoNotSchedule
        labelSelector:
          matchLabels:
            app: ml-training
      containers:
      # ... (your container definition)
      env:
      # ... (environment variable injection for ZONE as before)
      initContainers:
      # ... (init container for ZONE injection as before)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Pod Disruption Budget (PDB)
&lt;/h4&gt;

&lt;p&gt;A &lt;a href="https://kubernetes.io/docs/concepts/workloads/pods/disruptions/" rel="noopener noreferrer"&gt;pod disruption budget (PDB)&lt;/a&gt; is a Kubernetes policy that helps ensure the high availability of your applications running on the platform. It defines the minimum number of pods from a specific deployment that must be available at any given time. This ensures that even during maintenance operations or unforeseen disruptions, your application remains functional with minimal downtime.&lt;/p&gt;

&lt;p&gt;Here’s a breakdown of how PDBs work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Minimum Available Pods:&lt;/strong&gt; You define the minimum number of pods that your application needs to function properly. This minimum acceptable number is specified in the PDB configuration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voluntary Disruptions:&lt;/strong&gt; PDBs primarily target voluntary disruptions, which are planned events initiated by the cluster administrator or automated processes. These disruptions could include:&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node Drain&lt;/strong&gt; : Taking a node out of service for maintenance requires draining the pods running on it. A PDB can prevent evictions from exceeding a safe limit to ensure application functionality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rolling Updates&lt;/strong&gt; : Upgrading deployments often involve rolling restarts where new pods are introduced while old ones are terminated. A PDB can pace this rollout to avoid overwhelming the system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not for Involuntary Disruptions:&lt;/strong&gt; PDBs don’t have control over involuntary disruptions caused by hardware failures, network issues, or software crashes. These events can still cause your application to become unavailable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Benefits of using Pod Disruption Budgets:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;High Availability:&lt;/strong&gt; PDBs minimize downtime during planned maintenance or upgrades by preventing accidental pod evictions beyond a safe threshold.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational Efficiency:&lt;/strong&gt; They provide a safety net for cluster administrators, allowing them to perform maintenance tasks with confidence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource Optimization:&lt;/strong&gt; PDBs can help prevent unnecessary pod evictions, leading to more efficient resource utilization within the cluster.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Here are some additional points to remember:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PDB Definition:&lt;/strong&gt; PDBs are defined as YAML or JSON objects and applied using the kubectl apply command.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disruption Budget:&lt;/strong&gt; The budget can be specified as an absolute number of pods or a percentage of the total replicas in the deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Always Allow Unhealthy Pod Eviction:&lt;/strong&gt; It’s recommended to set the alwaysAllowUnhealthyPodEviction policy to true in your PDB. This allows evicting misbehaving pods during a node drain to proceed without waiting for them to become healthy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing &amp;amp; Simulation:&lt;/strong&gt; Test PDB configurations thoroughly to ensure they align with your application’s availability requirements and architecture. Simulate disruptions and verify that the desired number of Pods remains available.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor and Alert:&lt;/strong&gt; Implement monitoring and alerting mechanisms to detect PDB violations. This enables proactive management and ensures timely intervention in case of availability issues.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graceful Shutdowns:&lt;/strong&gt; Configure your applications to handle graceful shutdowns when evicted by a PDB. This allows them to complete ongoing tasks, release resources, and avoid data loss or corruption.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By implementing pod disruption budgets, you can enhance the resilience and availability of your applications running on Kubernetes clusters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example Pod Disruption Budget (PDB) in Kubernetes:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: hello-world
spec:
  selector:
    matchLabels:
      app: hello-world
  minAvailable: 30%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pod Disruption Budgets (PDBs) work alongside deployment strategies like rolling updates with maxSurge and maxUnavailable settings to manage application availability during upgrades.&lt;/p&gt;

&lt;h4&gt;
  
  
  Pod Disruption Budget (PDB) and &lt;strong&gt;Rolling Update Strategy&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;You don’t necessarily need both a pod disruption budget (PDB) and a rolling update strategy, but they can work together effectively to achieve different goals for application availability in Kubernetes. They can be complementary for robust application availability during both maintenance and upgrades:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PDB sets a safety floor:&lt;/strong&gt; It defines the minimum number of pods that must be available even during a rolling update with maxUnavailable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rolling update controls the rollout:&lt;/strong&gt; It manages the pace of pod replacement within the boundaries set by the PDB’s minimum availability.&lt;/li&gt;
&lt;li&gt;While maxSurge and maxUnavailable define the deployment's rollout strategy, a PDB sets a &lt;strong&gt;hard minimum&lt;/strong&gt; on the number of available pods. This minimum should ideally be greater than or equal to maxUnavailable to avoid conflicts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scenario:&lt;/strong&gt; Let’s say your deployment has 5 replicas, maxSurge is set to 1 (allowing 1 extra pod), and maxUnavailable is set to 2 (allowing 2 pods to be unavailable).

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Without PDB:&lt;/strong&gt; The deployment could potentially terminate 2 pods and create 1 new one, leaving only 2 pods available (5 total - 2 unavailable + 1 surge). This might violate application requirements for minimum availability.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;With PDB:&lt;/strong&gt; If a PDB is defined with a minimum of 3 available pods, the deployment can only terminate pods down to 3 running pods (5 total - 3 minimum available). This ensures your application remains functional with at least 3 pods even during the update.
&lt;/li&gt;
&lt;li&gt;In case of a conflict between maxUnavailable and the PDB's minimum available pods, the stricter setting takes precedence. This ensures the PDB's minimum availability requirement is met.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Setting the PDB’s minimum available pods to a value &lt;strong&gt;greater than or equal to&lt;/strong&gt; maxUnavailable in your deployment strategy ensures the PDB doesn’t accidentally restrict the deployment’s ability to perform rolling updates within the defined boundaries.&lt;/li&gt;

&lt;li&gt;Using percentages for both PDB’s minimum available and deployment’s maxUnavailable allows for flexibility as you scale your application.&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  Pod Priority and Preemption
&lt;/h4&gt;

&lt;p&gt;Kubernetes supports prioritizing your pods when it comes to scheduling. A critical workload/job could be marked as a higher priority pod compared to some other non-critical pods or jobs thereby increasing the chances of getting scheduling ahead of some other lower priority pods.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 10000
globalDefault: false # Not the default priority class
description: "This priority class is for high-priority pods."

apiVersion: v1
kind: Pod
metadata:
  name: my-high-priority-app
spec:
  priorityClassName: high-priority # Assign the priority class
  containers:
  - name: my-container
    image: my-image
    # ... other pod specifications
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Important Factors
&lt;/li&gt;
&lt;li&gt;R &lt;strong&gt;esource quotas&lt;/strong&gt; can interact with pod priorities. A higher-priority pod might still be subject to resource quotas.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PodDisruptionBudgets&lt;/strong&gt; can protect even low-priority pods from being preempted if they are part of a critical application.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Quality of Service (QoS)
&lt;/h4&gt;

&lt;p&gt;QoS classes define how Kubernetes handles resource requests and limits for pods. They influence how Kubernetes schedules pods and how it handles resource contention. Kubernetes automatically assigns a QoS class to a pod based on its resource requests and limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;QoS Classes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Guaranteed: Pods that have both resource requests and limits specified for all containers, and the requests are equal to the limits. These pods are given the highest priority for resources. They are less likely to be evicted.&lt;/li&gt;
&lt;li&gt;Burstable: Pods that have resource requests and limits specified, but the requests are less than the limits. They can "burst" up to their limits if resources are available. They are more likely to be evicted than Guaranteed pods.&lt;/li&gt;
&lt;li&gt;BestEffort: Pods that do &lt;em&gt;not&lt;/em&gt; have resource requests or limits specified. They are given the lowest priority for resources and are the most likely to be evicted.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Qos and Pod Priority&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While Scheduling Kubernetes first considers pod priority when scheduling pods. Higher-priority pods are scheduled first. Within a priority level, Kubernetes uses QoS classes to determine how to allocate resources. Guaranteed pods are given preference. When resources are scarce, Kubernetes uses both pod priority and QoS to decide which pods to evict. Lower-priority pods are evicted first. Within a priority level, BestEffort pods are evicted before Burstable pods, and Burstable pods are evicted before Guaranteed pods.&lt;/p&gt;

&lt;p&gt;As a starting point, make sure to set appropriate resource requests and limits for your pods. This is crucial for QoS and for ensuring that your applications have the resources they need. Use the Guaranteed QoS class for critical system pods or applications that require predictable performance. Burstable is often a good balance for most applications, allowing them to burst up to their limits when resources are available. However, only use BestEffort for non-critical pods or background tasks that can tolerate resource scarcity and potential eviction.&lt;/p&gt;

&lt;h4&gt;
  
  
  Pod Affinity and Node Affinity
&lt;/h4&gt;

&lt;p&gt;Pod affinity and node affinity are Kubernetes features that allow you to control how pods are scheduled onto nodes. They are essential tools for building resilient applications on EKS (or any Kubernetes cluster).&lt;/p&gt;

&lt;p&gt;Pod affinity allows you to specify rules about &lt;em&gt;where&lt;/em&gt; a pod should be scheduled based on &lt;em&gt;other pods&lt;/em&gt; that are already running in the cluster. You can use it to attract pods to the same node or zone as other pods (co-location) or to repel pods from the same node or zone (anti-affinity).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Co-location (Attraction):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Placing related pods (e.g., a web server and its database) on the same node to reduce latency.&lt;/li&gt;
&lt;li&gt;Ensuring that pods that communicate frequently are located close to each other.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Anti-affinity (Repulsion):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spreading replicas of a pod across different nodes (or zones) to increase availability and fault tolerance. If one node fails, the other replicas will continue to run.&lt;/li&gt;
&lt;li&gt;Preventing pods that consume a lot of resources from being scheduled on the same node and avoiding the bad-neighbor effect.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Types of Pod Affinity:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;requiredDuringSchedulingIgnoredDuringExecution: The rule &lt;em&gt;must&lt;/em&gt; be satisfied during scheduling. If the rule cannot be met, the pod will not be scheduled. IgnoredDuringExecution means that if the affinity rule becomes violated &lt;em&gt;after&lt;/em&gt; the pod is scheduled (e.g., the other pod is terminated), the pod will continue to run but will not be rescheduled if it is evicted.&lt;/li&gt;
&lt;li&gt;preferredDuringSchedulingIgnoredDuringExecution: The rule is &lt;em&gt;preferred&lt;/em&gt; but not required. Kubernetes will try to satisfy the rule, but if it cannot, the pod will still be scheduled (potentially on a different node). IgnoredDuringExecution has the same meaning as above.&lt;/li&gt;
&lt;li&gt;requiredDuringSchedulingRequiredDuringExecution: The rule &lt;em&gt;must&lt;/em&gt; be satisfied during scheduling and &lt;em&gt;must&lt;/em&gt; continue to be satisfied during execution. If the rule becomes violated after the pod is scheduled, the pod will be evicted. This is the strongest form of affinity.&lt;/li&gt;
&lt;li&gt;preferredDuringSchedulingRequiredDuringExecution: The rule is &lt;em&gt;preferred&lt;/em&gt; during scheduling and &lt;em&gt;must&lt;/em&gt; be satisfied during execution. If the rule is violated after the pod is scheduled, the pod will be evicted.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        topologyKey: kubernetes.io/hostname # Spread across nodes
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - my-app
  # ... other pod specifications
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Node Affinity&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Node affinity allows you to specify rules about &lt;em&gt;which nodes&lt;/em&gt; a pod should be scheduled on based on &lt;em&gt;labels&lt;/em&gt; that are attached to the nodes. For instance scheduling pods on nodes with specific hardware (e.g., GPUs).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Types of Node Affinity:&lt;/strong&gt; Similar to pod affinity, you have requiredDuringSchedulingIgnoredDuringExecution, preferredDuringSchedulingIgnoredDuringExecution, requiredDuringSchedulingRequiredDuringExecution, and preferredDuringSchedulingRequiredDuringExecution.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu
            operator: Exists # Node must have the 'gpu' label
  # ... other pod specifications
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Anti-affinity (both pod and node) is crucial for high availability. By spreading your pods across nodes and availability zones, you ensure that your application can survive node or zone failures. Makes it fault-tolerant in the event a node fails, the pods running on other nodes will continue to serve traffic. Affinity can also help you optimize resource utilization by co-locating related pods and ensuring that pods are scheduled on nodes with the appropriate resources. Co-locating pods that communicate frequently can reduce latency and improve performance. If you use node affinity, you might also need to use tolerations to allow pods to be scheduled on nodes with taints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note: Consider preferred Over required&lt;/strong&gt; Unless it's essential, use preferred affinity rules over required rules. required rules can make it difficult to schedule pods if the constraints cannot be met.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs1ky8s665w6eseywlkso.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs1ky8s665w6eseywlkso.png" width="800" height="914"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Horizontal Pod Auto Scaling (HPA)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/" rel="noopener noreferrer"&gt;HPA&lt;/a&gt; is one of the best knobs to control in the Data Plane that allows scaling your deployments horizontally based on metrics. You have out-of-the-box metrics like CPU and Memory available that you could scale on or even combine metrics for the scaler. In addition custom metrics relevant to your applications such as connection_pool, requests_per_second etc. could be configured that may provide a better trigger for your applications to inform the scaler.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-tune behavior for HPA&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;HPA supports &lt;strong&gt;scaleUp&lt;/strong&gt; and &lt;strong&gt;ScaleDown&lt;/strong&gt; behaviors to be configured per your needs through scaling policies. One or more scaling policies can be specified in the &lt;strong&gt;behavior&lt;/strong&gt; section of the spec. When multiple policies are specified the policy that allows the highest amount of change is the policy that is selected by default. The &lt;strong&gt;stabilizationWindowSeconds&lt;/strong&gt; is used to restrict the &lt;a href="https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#flapping" rel="noopener noreferrer"&gt;flapping&lt;/a&gt; of replica count when the metrics used for scaling keep fluctuating and is important to get it right for your workloads.&lt;/p&gt;

&lt;p&gt;Note: When HPA is enabled, it is recommended that the value of &lt;strong&gt;spec.replicas&lt;/strong&gt; of the Deployment and / or StatefulSet be removed from their &lt;a href="https://kubernetes.io/docs/reference/glossary/?all=true#term-manifest" rel="noopener noreferrer"&gt;manifest(s)&lt;/a&gt;. If this isn't done, any time a change to that object is applied, this will instruct Kubernetes to scale the current number of Pods to the value of the spec.replicas key. This may not be desired and could be troublesome when an HPA is active, resulting in thrashing or flapping behavior.&lt;/p&gt;

&lt;h4&gt;
  
  
  Vertical Pod Auto Scaling (VPA)
&lt;/h4&gt;

&lt;p&gt;The Vertical Pod Autoscaler (VPA) is a Kubernetes component that automatically adjusts the CPU and memory requests and limits of your pods. It analyzes the historical resource consumption of your pods such as peak usage, average usage, and resource usage trends, and then recommends or optionally can apply new resource settings to optimize resource utilization and improve overall cluster efficiency.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pair this with the Pod Disruption Budget as VPA’s Updater component could apply recommendations that require pod restarts to take effect.&lt;/li&gt;
&lt;li&gt;Starting with Recommender mode only, learning more about the recommendations and carefully monitoring your application are highly recommended best practices.&lt;/li&gt;
&lt;li&gt;A good starting point would be with either HPA or VPA, depending on your needs. If you need to scale based on traffic, use HPA. If you need to optimize per-pod resources, use VPA.&lt;/li&gt;
&lt;li&gt;In some specific use cases combining both may be required but note that contradicting recommendations for the scalers could make it complex to manage and troubleshoot. For instance, If both HPA and VPA are set to scale based on CPU or memory usage, they might contradict each other, leading to inefficient resource allocation. When using both HPA and VPA, consider using custom metrics for HPA to avoid conflicts with VPA’s resource adjustments.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Multi-Dimensional Pod Auto Scaler (Feature Request for EKS)
&lt;/h4&gt;

&lt;p&gt;This is a &lt;a href="https://github.com/aws/containers-roadmap/issues/2051" rel="noopener noreferrer"&gt;feature request&lt;/a&gt; into EKS and would be great to support this in a future release to ease EKS users from having to juggle between HPA and VPA with a single AutoScaler that would eliminate contradicting scaling decisions.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Kubernetes Event-driven Autoscaling&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://keda.sh/" rel="noopener noreferrer"&gt;KEDA&lt;/a&gt; (Kubernetes Event-driven Autoscaling) plays a crucial role in enhancing resilience on EKS (and Kubernetes in general) by enabling your applications to automatically scale based on various event triggers, rather than just CPU or memory usage. This event-driven autoscaling is key to building more responsive, robust, and cost-effective systems. You could certainly argue that custom metrics may come in handy and may suffice. However, when considering Event-Driven architectures, KEDA specifically shines in this space as it can scale based on a variety of &lt;a href="https://keda.sh/docs/2.16/concepts/#event-sources-and-scalers" rel="noopener noreferrer"&gt;scalers&lt;/a&gt; (e.g., RabbitMQ, Kafka, AWS SQS, PostgreSQL, Datadog, and many more), ensuring that your application can process messages as they arrive at scale. Combined with HPA, KEDA could prove to be a powerful tool in your kit to build resilient workloads!&lt;/p&gt;

&lt;h4&gt;
  
  
  Managing Computational Resources
&lt;/h4&gt;

&lt;p&gt;Running resilient workloads would also mean thinking about your current computational resource utilization, requests, limits, quotas (QOS), and more. Please do check out &lt;a href="https://hmh.engineering/dive-into-managing-kubernetes-computational-resources-73283c048360" rel="noopener noreferrer"&gt;Dive into managing Kubernetes computational resources&lt;/a&gt; which we published on this topic earlier. It dives into a lot more details and helps you gain insights into computational resources.&lt;/p&gt;

&lt;h4&gt;
  
  
  Self Healing with Kubernetes Probes
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/" rel="noopener noreferrer"&gt;Kubernetes Probes&lt;/a&gt; allows Kubernetes to monitor the health and readiness of your pods and take action when issues arise, ensuring your application remains available and responsive. Needless to say how important probes are for running resilient and self-healing workloads. It is a good practice to configure these to strike a balance between speed and reliability, as you don't want to configure thresholds too small that it takes several restarts to start one, nor do you want to bump thresholds up too much that delays traffic being routed to the pod that has been ready for a while. Highly recommend reading &lt;a href="https://hmh.engineering/dive-into-kubernetes-healthchecks-part-1-73a900fa6dbd" rel="noopener noreferrer"&gt;Dive into Kubernetes Healthchecks&lt;/a&gt; (2-part series) published earlier that will help you gain a solid understanding of the various probes and their impact on your workloads.&lt;/p&gt;

&lt;h4&gt;
  
  
  Run Lean or Distroless images
&lt;/h4&gt;

&lt;p&gt;Running lean images or distroless images plays a significant role in EKS resiliency by improving security, reducing resource consumption, and speeding up deployments.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lean images and distroless images contain only the essential components needed to run your application. They eliminate unnecessary libraries, tools, and system utilities that might be present in traditional base images. This significantly reduces the attack surface, minimizing potential vulnerabilities that attackers could exploit. Lean images make it easier to audit your container images and ensure that they comply with security policies and go well with your&lt;/li&gt;
&lt;li&gt;Lean images and distroless images are significantly smaller in size compared to traditional base images and hence can be pulled more quickly from container registries, reducing the time it takes to deploy your application. This is especially important for scaling and rolling updates. Lean images often have a smaller memory footprint, allowing your applications to run more efficiently and potentially allowing you to run more pods on the same node.&lt;/li&gt;
&lt;li&gt;Smaller images lead to faster image pulls, which speeds up the deployment process and often results in faster startup times because they have fewer components to initialize. This can be crucial for applications that need to scale quickly or even for microservices that are frequently deployed or even in optimizing HPA.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Use a Service Mesh/Service Network
&lt;/h4&gt;

&lt;p&gt;Service meshes like &lt;a href="https://istio.io/" rel="noopener noreferrer"&gt;Istio&lt;/a&gt;, &lt;a href="https://www.consul.io/" rel="noopener noreferrer"&gt;Consul&lt;/a&gt;, &lt;a href="http://linkerd.io/" rel="noopener noreferrer"&gt;LinkerD&lt;/a&gt;, or a &lt;a href="https://docs.aws.amazon.com/vpc-lattice/latest/ug/service-networks.html" rel="noopener noreferrer"&gt;Service Network&lt;/a&gt; like VPC Lattice enable service-to-service communication and increase the observability and resiliency of your microservices network. Most service mesh products work by having a small network proxy run alongside each service that intercepts and inspects the application’s network traffic. You can place your application in a mesh without modifying your application. Using the service proxy’s built-in features, you can have it generate network statistics, create access logs, and add HTTP headers to outbound requests for distributed tracing, enable automatic request retries, timeouts, circuit-breaking, rate-limiting as well as improve security patterns.&lt;/p&gt;

&lt;p&gt;Service Networks such as VPC Lattice have a different architecture that does not require one to configure proxies or sidecars. Instead, VPC Lattice provides a managed control plane and data plane, eliminating the need for additional components within your Pods.&lt;/p&gt;

&lt;h4&gt;
  
  
  Circuit Breaking, Retry, and Backoff
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/circuit-breaker.html" rel="noopener noreferrer"&gt;Circuit breakers&lt;/a&gt; are a powerful technique to improve resilience. By preventing additional connections or requests to an overloaded service, circuit breakers limit the &lt;a href="https://www.ibm.com/garage/method/practices/manage/practice_limited_blast_radius" rel="noopener noreferrer"&gt;“blast radius”&lt;/a&gt; of an overloaded service. The circuit-breaker pattern could be applied within your application that communicates with various upstream services, and or at the ingress controller or the service mesh if one was supported. This should also be paired with an appropriate Retry/Backoff mechanism where Retry/Backoff attempts to recover from temporary issues, and the circuit breaker steps in to prevent further attempts when the problem is more persistent. This combination makes your system more resilient to failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  AWS Well-Architected Framework
&lt;/h3&gt;

&lt;p&gt;Highly recommend that we consider the Well-Architected Framework &lt;a href="https://aws.amazon.com/architecture/well-architected/?wa-lens-whitepapers.sort-by=item.additionalFields.sortDate&amp;amp;wa-lens-whitepapers.sort-order=desc&amp;amp;wa-guidance-whitepapers.sort-by=item.additionalFields.sortDate&amp;amp;wa-guidance-whitepapers.sort-order=desc" rel="noopener noreferrer"&gt;AWS Well-Architected Framework&lt;/a&gt; which provides a structured approach to designing and operating resilient services, building and operating secure, high-performing, resilient, and efficient infrastructure for your applications and workloads in the AWS cloud. By adhering to its principles, you can build systems that are better equipped to withstand failures, recover quickly from disruptions, and provide consistent availability to your users. It’s about designing for failure, not just hoping it won’t happen. Please note that it's not a managed service itself, but rather a conceptual model that provides a consistent approach to evaluating and improving your cloud architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  Resilience Assessment and Design For Failure
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AgzsmQoxCSu9ndX-8" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AgzsmQoxCSu9ndX-8" width="800" height="570"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by Francisco De Legarreta C. on Unsplash&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It is best to assume that anything that can go wrong will go wrong&lt;/p&gt;

&lt;p&gt;The National Academy of Sciences defines resilience as “&lt;em&gt;the ability to prepare and plan for, absorb, recover from, or more successfully adapt to actual or potential events&lt;/em&gt;”.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It's always a good practice to test and see how your design choices hold when failures are induced. &lt;a href="https://principlesofchaos.org/" rel="noopener noreferrer"&gt;Chaos engineering&lt;/a&gt; can be used to validate the effectiveness of design choices made with “ &lt;strong&gt;designing for failure&lt;/strong&gt; ” in mind. Chaos could be induced into your workloads and clusters in various ways, ranging from inducing node restarts, pod restarts, triggering failover, etc. It is also a good idea to use something like &lt;a href="https://aws.amazon.com/fis/" rel="noopener noreferrer"&gt;AWS Fault Injection Service&lt;/a&gt; in combination with Behavior Driven Development (BDD) or similar for managing your experiments and most importantly orchestrating the expected behavior.&lt;/p&gt;

&lt;h3&gt;
  
  
  AWS Resilience Hub
&lt;/h3&gt;

&lt;p&gt;Leverage AWS Resilience Hub to manage and improve the resilience posture of your applications on AWS. AWS Resilience Hub enables you to define your resilience goals, assess your resilience posture against those goals, and implement recommendations for improvement based on the &lt;a href="https://aws.amazon.com/architecture/well-architected/" rel="noopener noreferrer"&gt;AWS Well-Architected Framework&lt;/a&gt;. Within AWS Resilience Hub, you can also create and run &lt;a href="https://aws.amazon.com/fis/" rel="noopener noreferrer"&gt;AWS Fault Injection Service&lt;/a&gt; (AWS FIS) experiments, which mimic real-life disruptions to your application to help you better understand dependencies and uncover potential weaknesses. &lt;strong&gt;AWS Resilience Hub provides you with the services and tooling you need to continuously strengthen your resilience posture, all in a single place.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Secure your traffic with AWS WAF and AWS Shield
&lt;/h3&gt;

&lt;p&gt;WAFs are a critical component of resilient workloads. They protect against application-level attacks, enhance availability, support incident response, enable safe deployments, and help organizations meet compliance requirements. While not a replacement for other security measures, a properly configured and maintained WAF provides a valuable layer of defense for any publicly accessible web application on EKS.&lt;/p&gt;

&lt;p&gt;While &lt;a href="https://aws.amazon.com/shield/" rel="noopener noreferrer"&gt;AWS Shield&lt;/a&gt; provides DDoS protection at the network layer, &lt;a href="https://aws.amazon.com/waf/" rel="noopener noreferrer"&gt;AWS WAF&lt;/a&gt; can protect against application-layer DDoS attacks. Combining these AWS solutions with EKS can help harden your services, and improve your availability and resilience.&lt;/p&gt;

&lt;h3&gt;
  
  
  Golden Signals
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://sre.google/sre-book/monitoring-distributed-systems/" rel="noopener noreferrer"&gt;Golden Signals&lt;/a&gt; are &lt;em&gt;high-level indicators&lt;/em&gt; of service health and performance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AFH7SU17GmDADszX0" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F0%2AFH7SU17GmDADszX0" width="800" height="533"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by Mikail McVerry on Unsplash&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Data plane observability provides the &lt;em&gt;underlying data&lt;/em&gt; that feeds these signals. By monitoring the Golden Signals, you can gain a quick understanding of the state of your services and take action to ensure their reliability and resilience. Some key aspects for operating and managing production workloads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Setting SLOs (Service Level Objectives):&lt;/strong&gt; Define target values for your Golden Signals (e.g., “99.9% of requests should have a latency of less than 200ms”). These SLOs become your key performance indicators (KPIs) for service reliability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alerting:&lt;/strong&gt; Set up alerts based on your SLOs. If a Golden Signal deviates significantly from its target, you’ll be notified so you can investigate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Troubleshooting:&lt;/strong&gt; When an issue occurs, use the Golden Signals to quickly understand the impact. For example, if you see a spike in latency, you can investigate further using traces and logs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capacity Planning:&lt;/strong&gt; Use traffic and saturation metrics to understand your capacity needs and plan for future growth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance Optimization:&lt;/strong&gt; Identify bottlenecks by analyzing latency and saturation metrics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tooling:&lt;/strong&gt; Use appropriate tooling such as APM to provide end-to-end observability and traceability for your infrastructure. &lt;a href="https://aws.amazon.com/xray/" rel="noopener noreferrer"&gt;AWS X-Ray&lt;/a&gt; is a good choice and an invaluable tool for enhancing observability in your EKS environment. It provides detailed distributed tracing, improves your understanding of Golden Signals, and ultimately contributes to building more resilient and performant applications. By enabling faster root cause analysis, proactive issue detection, and performance optimization, X-Ray empowers you to operate your workloads more effectively and ensure a better user experience.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;As you may have figured out by now, operating resilient workloads requires quite a bit of planning and effort. Of course, EKS makes lives easier as described with EKS auto mode and Karpenter, but one has to still focus on optimizing and securing their data planes. I hope this article leaves you with a reasonable understanding of the knobs you could turn to make your critical services highly available and resilient! Lastly leaving a bunch of reference materials that were an inspiration to this article and may come in handy along your journey for operating resilient workloads on Kubernetes! Cheers!&lt;/p&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/containers/operating-resilient-workloads-on-amazon-eks/" rel="noopener noreferrer"&gt;https://aws.amazon.com/blogs/containers/operating-resilient-workloads-on-amazon-eks/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/whitepapers/latest/running-containerized-microservices/design-for-failure.html" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/whitepapers/latest/running-containerized-microservices/design-for-failure.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws-observability.github.io/observability-best-practices/recipes/eks/" rel="noopener noreferrer"&gt;https://aws-observability.github.io/observability-best-practices/recipes/eks/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/architecture/behavior-driven-chaos-with-aws-fault-injection-simulator/" rel="noopener noreferrer"&gt;https://aws.amazon.com/blogs/architecture/behavior-driven-chaos-with-aws-fault-injection-simulator/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/containers/managing-pod-scheduling-constraints-and-groupless-node-upgrades-with-karpenter-in-amazon-eks/" rel="noopener noreferrer"&gt;https://aws.amazon.com/blogs/containers/managing-pod-scheduling-constraints-and-groupless-node-upgrades-with-karpenter-in-amazon-eks/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/eks/auto-mode/" rel="noopener noreferrer"&gt;https://aws.amazon.com/eks/auto-mode/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/concepts/services-networking/topology-aware-routing/" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/services-networking/topology-aware-routing/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/containers/blue-green-or-canary-amazon-eks-clusters-migration-for-stateless-argocd-workloads/" rel="noopener noreferrer"&gt;https://aws.amazon.com/blogs/containers/blue-green-or-canary-amazon-eks-clusters-migration-for-stateless-argocd-workloads/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/eks/latest/best-practices/cas.html" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/eks/latest/best-practices/cas.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/aws/introducing-karpenter-an-open-source-high-performance-kubernetes-cluster-autoscaler/" rel="noopener noreferrer"&gt;https://aws.amazon.com/blogs/aws/introducing-karpenter-an-open-source-high-performance-kubernetes-cluster-autoscaler/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/eks/latest/best-practices/application.html" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/eks/latest/best-practices/application.html&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Thanks to &lt;a href="https://medium.com/u/70b89d28e6a7" rel="noopener noreferrer"&gt;Jonathan Dawson&lt;/a&gt; for the feedback on this article!&lt;/p&gt;




</description>
      <category>aws</category>
      <category>resilience</category>
      <category>kubernetes</category>
      <category>awseks</category>
    </item>
    <item>
      <title>GP2 to GP3 for AWS RDS Postgres</title>
      <dc:creator>Kris Iyer</dc:creator>
      <pubDate>Fri, 21 Feb 2025 05:01:56 +0000</pubDate>
      <link>https://forem.com/aws-builders/gp2-to-gp3-for-aws-rds-postgres-18lm</link>
      <guid>https://forem.com/aws-builders/gp2-to-gp3-for-aws-rds-postgres-18lm</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F64bgicz8fryls5ttzvw0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F64bgicz8fryls5ttzvw0.png" alt="AWS GP3 Storage" width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;AWS GP3&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;Recently I had an opportunity to work on a migration for RDS Postgres storage types from GP2 to GP3 for a large database. The migration was mostly motivated by potential performance improvements, getting past throughput limits on GP2, better IOPS, and of course, getting on a more modern storage architecture offered by AWS at a lower cost. This post mostly highlights some challenges you may encounter during a migration process which is not very obvious until you run into them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Migration Options
&lt;/h3&gt;

&lt;p&gt;Depending on the size of the database and/or migration windows/strategy you may choose one of the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Modify Storage Type&lt;/strong&gt; This is the simplest approach. You can directly modify the storage type of your primary instance to GP3 through the AWS Management Console, CLI, or SDK. It’s generally non-disruptive, with minimal performance impact during the conversion. Use this option when:
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need a simple and non-disruptive upgrade:&lt;/strong&gt; This method is the easiest way to convert your storage and offers minimal downtime. It’s generally suitable for most GP2 to GP3 migration scenarios.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Downtime tolerance is low:&lt;/strong&gt; The modification process is typically non-disruptive and in place, with only a brief period of potential performance impact while the underlying storage configuration is adjusted (Storage Optimization).

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Your database size is manageable:&lt;/strong&gt; Modifying storage type works well for databases of various sizes. However, for very large databases, a snapshot/restore approach might offer more control over the migration process.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Database Migration Service (DMS):&lt;/strong&gt;
Use DMS to migrate your data from the GP2 instance to a newly created GP3 instance. This is flexible for complex migrations and minimizes downtime, but it might require more configuration and potentially incur additional costs and could have an impact on the primary depending on the configuration.
&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Replication task scheduling:&lt;/strong&gt; Schedule the initial data load for off-peak hours when the writer's workload is lighter.
&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Bandwidth throttling:&lt;/strong&gt; Limit the amount of data DMS reads from the source per unit of time to minimize performance impact.
&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;CDC (Change Data Capture):&lt;/strong&gt; For ongoing replication, DMS uses CDC techniques to capture only the changes made to the source data, reducing the load compared to full table scans.
&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;pglogical:&lt;/strong&gt; Can be used as a plugin for &lt;a href="https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html#CHAP_Source.PostgreSQL.Security" rel="noopener noreferrer"&gt;Postgres Logical Replication&lt;/a&gt; for DMS thereby reducing the impact on the writer further.&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Snapshot and Restore:&lt;/strong&gt;
If you have a very strict window for downtime during the migration and modifying storage type directly isn’t feasible due to the brief outage it can cause, Snapshot and Restore can offer more control. You can create a snapshot of the GP2 volume during a maintenance window and then restore it to a newly provisioned GP3 instance during another window. This approach allows you to minimize downtime on the primary instance by performing the restoration on a separate instance.&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Replica with GP3 storage:&lt;/strong&gt; While keeping your primary instance on GP2 adding a replica with GP3 storage can be a potential strategy for transitioning to GP3. Some Considerations:
&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Cost:&lt;/strong&gt; This could be a factor for this solution as this solution does require additional instance(s).
&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Rehydration&lt;/strong&gt; : Status could be a big factor that requires one to either work with some tools to speed up rehydration and/or collaborate with AWS over support tickets to understand the rehydration the status from S3 to EBS.&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  Some scenarios and comparisons
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9gmdjee47fxjmhz0qeta.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9gmdjee47fxjmhz0qeta.png" width="800" height="907"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Hydration after restore
&lt;/h3&gt;

&lt;p&gt;For Amazon RDS instances that are restored from snapshots (automated and manual), the instances are made available as soon as the needed infrastructure is provisioned. However, there is an ongoing process that continues to copy the storage blocks from Amazon S3 to the EBS volume; this is called &lt;strong&gt;&lt;em&gt;lazy loading&lt;/em&gt;&lt;/strong&gt;. While lazy loading is in progress, I/O operations might need to wait for the blocks being accessed to be first read from Amazon S3. This causes increased I/O latency, which doesn’t always have an impact on applications using the Amazon RDS instance. If you want to reduce any slowness due to hydration, read all the data blocks as soon as the restore is complete.&lt;/p&gt;

&lt;h4&gt;
  
  
  Mitigating Effects of Lazy Loading
&lt;/h4&gt;

&lt;p&gt;There are a few strategies available that help mitigate or minimize the impact or in other words also help speed up lazy loading. For &lt;a href="https://aws.amazon.com/rds/postgresql/" rel="noopener noreferrer"&gt;Amazon RDS for PostgreSQL&lt;/a&gt;, the following options are available:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Use the&lt;/strong&gt;  &lt;strong&gt;pg_prewarm&lt;/strong&gt; shared library module to read through all the tables
&lt;/li&gt;
&lt;li&gt;pg_prewarm doesn’t pre-fetch the following:
&lt;/li&gt;
&lt;li&gt;Toast tables [RDS limitation] — No workaround
&lt;/li&gt;
&lt;li&gt;Indexes [pg_prewarm limitation] — No workaround
&lt;/li&gt;
&lt;li&gt;DB Objects owned by other users [RDS limitation]. The workaround here is to re-run the SQL once as each DB User (that owns any table) Useful script can be found below:
&lt;a href="https://github.com/robins/PrewarmRDSPostgres/blob/master/singledb.sql" rel="noopener noreferrer"&gt;https://github.com/robins/PrewarmRDSPostgres/blob/master/singledb.sql&lt;/a&gt;
&lt;a href="https://github.com/robins/PrewarmRDSPostgres/blob/master/toast.sql" rel="noopener noreferrer"&gt;https://github.com/robins/PrewarmRDSPostgres/blob/master/toast.sql&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use the&lt;/strong&gt;  &lt;strong&gt;pg_dump&lt;/strong&gt; utility with jobs and data-only parameters to perform an export of all application schemas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Perform an explicit&lt;/strong&gt;  &lt;strong&gt;select&lt;/strong&gt; on all the large and heavily used tables individually with parallelism.
&lt;/li&gt;
&lt;li&gt;For large tables, you may be able to split the query into ranges based on the primary key. For example, the below query gives 4 ranges of equal number of rows (primary key col1)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;select nt,max(col1),count(*) 
from (SELECT col1, Ntile(4) over(ORDER BY col1) nt FROM testuser.test_table)st 
group by nt 
order by nt;

        NT MAX(COL1) COUNT(*)
---------- ---------- ----------
         1 125000 125000
         2 250000 125000
         3 375000 125000
         4 500000 125000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;NOTE: test the query before using in production, if you decide to use this query.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DMS&lt;/strong&gt; however acts as a conduit for transferring data between databases. During a GP2 to GP3 migration using DMS, the data is directly transferred from the source GP2 instance to the target GP3 instance. DMS doesn’t store the entire migrated dataset in S3 as an intermediate step thus eliminating the need to rehydrate from S3.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  PostgreSQL Error Conflict Recovery
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
CONTEXT: SQL statement "select pg_prewarm(temprow.tablename)"
PL/pgSQL function inline_code_block line 6 at SQL statement
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://repost.aws/knowledge-center/rds-postgresql-error-conflict-recovery" rel="noopener noreferrer"&gt;Conflict Recovery Error&lt;/a&gt; might occur due to the lack of visibility from the primary instance over the activity that’s happening on the read replica. The conflict with recovery occurs when WAL information can’t be applied on the read replica because the changes might obstruct an activity that’s happening on the read replica.&lt;/li&gt;
&lt;li&gt;Query conflict might happen when a transaction on the read replica is reading tuples that are set for deletion on the primary instance. The deletion of tuples followed by vacuuming on the primary instance causes a conflict with the SELECT query that’s still running on the replica. In this case, the SELECT query on the replica is terminated with the following error message:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  max_standby_archive_delay
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;max_standby_archive_delay&lt;/strong&gt; is a configuration parameter on a PostgreSQL replica instance in RDS that controls how long the replica waits for conflicting queries to finish before canceling its attempt to apply a specific Write-Ahead Log (WAL) segment. In other words, allows you to manage the trade-off between data consistency and accommodating long-running queries on your RDS PostgreSQL replica.&lt;/p&gt;

&lt;p&gt;If WAL data is read from the archive location in Amazon Simple Storage Service (Amazon S3), then use the max_standby_archive_delay parameter.&lt;/p&gt;

&lt;h4&gt;
  
  
  max_standby_streaming_delay
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;max_standby_streaming_delay&lt;/strong&gt; on the replica primarily affects its behavior and data consistency with the primary. However, in high-load scenarios or during failovers, it can have indirect consequences for the writer (primary) due to potential replication lag. For instance, If the replica falls significantly behind due to frequent pauses or cancellations caused by, the primary might experience increased write load as it needs to buffer more WAL segments before they can be applied on the replica.&lt;/p&gt;

&lt;p&gt;If you are increasing &lt;a href="https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-MAX-STANDBY-ARCHIVE-DELAY" rel="noopener noreferrer"&gt;&lt;strong&gt;max_standby_archive_delay&lt;/strong&gt;&lt;/a&gt; to avoid canceling queries that conflict with reading WAL archive entries, then consider increasing &lt;a href="https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-MAX-STANDBY-STREAMING-DELAY" rel="noopener noreferrer"&gt;&lt;strong&gt;max_standby_streaming_delay&lt;/strong&gt;&lt;/a&gt; as well to avoid cancelations linked to conflict with streaming WAL entries.&lt;/p&gt;

&lt;p&gt;If WAL data is read from streaming replication, then use the max_standby_streaming_delay parameter.&lt;/p&gt;

&lt;h4&gt;
  
  
  vacuum_defer_cleanup_age
&lt;/h4&gt;

&lt;p&gt;NOTE: Applies to Postgres versions &amp;lt; 17.&lt;/p&gt;

&lt;p&gt;With vacuum_defer_cleanup_age, you could specify a time delay (in seconds) for how long the replica would defer cleaning up certain types of data during auto vacuum. The purpose of this deferral was to potentially avoid conflicts between ongoing queries on the replica and the auto vacuum process cleaning up data that those queries might still be accessing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For versions ≥ 17&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The combination of the statement_timeout parameter and the hot_standby_feedback feature can achieve a similar outcome to vacuum_defer_cleanup_age .&lt;/p&gt;

&lt;h3&gt;
  
  
  Lazy Loading and Multi-AZ
&lt;/h3&gt;

&lt;p&gt;When you change your Single-AZ instance to Multi-AZ, Amazon RDS creates a snapshot of the instance’s volumes. The snapshot is used to create new volumes in another Availability Zone. Although these new volumes are immediately available for use, you might experience a performance impact. This impact occurs because the new volume’s data is still loading from Amazon Simple Storage Service (Amazon S3). Meanwhile, the DB instance continues to load data in the background. This process might lead to elevated write latency and a performance impact during and after the modification process.&lt;/p&gt;

&lt;p&gt;a. Initiate a failover on your Reader instance to be sure that the new AZ is the primary AZ.&lt;/p&gt;

&lt;p&gt;b. Perform read operations in your Reader instance — Perform an explicit select on all the large and heavily used tables individually with parallelism or — Use the pg_prewarm shared library module to read through all the tables&lt;/p&gt;

&lt;p&gt;c. Confirm that the write latency has returned to normal levels by reviewing the WriteLatency metric in Amazon CloudWatch.&lt;/p&gt;

&lt;p&gt;For more information, refer to “&lt;a href="https://repost.aws/knowledge-center/rds-convert-single-az-multi-az#:~:text=Reduce%20latency%20if%20your%20instance%20is%20already%20Multi%2DAZ" rel="noopener noreferrer"&gt;What’s the impact of modifying my Single-AZ Amazon RDS instance to a Multi-AZ instance and vice versa?&lt;/a&gt;”&lt;/p&gt;

&lt;h3&gt;
  
  
  Cascading Replicas and Rollback Scenarios
&lt;/h3&gt;

&lt;p&gt;Note: This only applies to Postgres versions ≥14.1.&lt;/p&gt;

&lt;p&gt;While you make your transition from GP2 to GP3, certainly plan on a fallback or rollback strategy as best practice. As you promote your GP3-based replica to the primary (standalone) make sure to add a replica which is GP2 just in case. You can achieve this with &lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PostgreSQL.Replication.ReadReplicas.html#USER_PostgreSQL.Replication.ReadReplicas.Configuration.cascading" rel="noopener noreferrer"&gt;cascading replicas&lt;/a&gt; which could be set up ahead of time along with the GP3 replica. Rehydration and lazy-loading apply the same as any others including any Multi-AZ Standby instances.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;With cascading read replicas, RDS for PostgreSQL DB instance sends WAL data to the first read replica in the chain. That read replica then sends WAL data to the second replica in the chain, and so on. The end result is that all read replicas in the chain have the changes from the RDS for PostgreSQL DB instance, but without the overhead solely on the source DB instance.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Choosing the right strategy depends on how critical minimizing downtime is, cost, and scalability among other things. Choose wisely, consult AWS support, and engage your AWS TAM and AWS Solution Architects to discuss the impact of migration on your databases and SLAs while considering lazy loading and/or full initialization of RDS GP3 volume as you plan yours!&lt;/p&gt;




</description>
      <category>gp3</category>
      <category>postgres</category>
      <category>aws</category>
      <category>rds</category>
    </item>
    <item>
      <title>Analyzing Amazon Load Balancer Access Logs</title>
      <dc:creator>Kris Iyer</dc:creator>
      <pubDate>Mon, 05 Feb 2024 22:11:19 +0000</pubDate>
      <link>https://forem.com/aws-builders/analyzing-amazon-load-balancer-access-logs-2j50</link>
      <guid>https://forem.com/aws-builders/analyzing-amazon-load-balancer-access-logs-2j50</guid>
      <description>&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;Analyzing access logs may be required for several reasons and is a great practice, in general, to stay on top of your access logs to understand traffic, distribution, user agents, URI classification(client's IP address, latencies, request paths, and server responses). Overall you can use these access logs to analyze traffic patterns and troubleshoot issues.&lt;/p&gt;

&lt;p&gt;Access logs are not activated by default. Once enabled access logs are shipped to Amazon S3. Before you enable them on yours please read the below carefully(including s3 costs): &lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html"&gt;https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's important to also note that the access log files are compressed. If you open the files using the Amazon S3 console, they are uncompressed and the information is displayed. If you download the files, you must uncompress them to view the information. Depending on your use cases your access logs could be running into gigabytes of data and processing and analyzing could be challenging. &lt;/p&gt;

&lt;p&gt;There are several ways you could approach analyzing access logs. Below is a summary of some of our options not listed in any particular order.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS Based Log Analyzers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Log Analytics with Amazon Athena&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Since Load Balancer access Logs are shipped to S3, you may use the power of Athena to query from S3. You can then slice this data based on various dimensions using plain old SQL which works great and is effective.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://repost.aws/knowledge-center/athena-analyze-access-logs"&gt;https://repost.aws/knowledge-center/athena-analyze-access-logs&lt;/a&gt;&lt;br&gt;
&lt;a href="https://repost.aws/knowledge-center/analyze-logs-athena"&gt;https://repost.aws/knowledge-center/analyze-logs-athena&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You may further choose to combine this with &lt;a href="https://docs.aws.amazon.com/quicksight/latest/user/create-a-data-set-athena.html"&gt;Amazon QuickSight&lt;/a&gt; to build powerful dashboards for BI use cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amazon OpenSearch Service&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) operates OpenSearch and open-source Elasticsearch, making it easy to search, visualize, and analyze your data across multiple use cases such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fast, Scalable Full-Text Search&lt;/li&gt;
&lt;li&gt;Application and Infrastructure Monitoring&lt;/li&gt;
&lt;li&gt;Security and Event Information Management&lt;/li&gt;
&lt;li&gt;Operational Health Tracking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;CloudWatch Log Insights&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This does need some extra work on our part before &lt;a href="https://medium.com/@xinweiiiii/forwards-logs-from-aws-s3-to-aws-cloudwatch-real-time-5934287ca1f"&gt;transform access logs from s3 to JSON format to Cloudwatch&lt;/a&gt;. However once in CloudWatch, we could use &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html"&gt;CloudWatch Insights&lt;/a&gt; and its capabilities to analyze this data. Optionally you can also use natural language (with the AI assistant) to create CloudWatch Logs Insights queries that may otherwise be challenging to build.&lt;/p&gt;

&lt;h2&gt;
  
  
  External Log Analyzers
&lt;/h2&gt;

&lt;p&gt;Several Enterprise Solutions exist in the market that allow you to ingest and analyze logs (not limited to access logs). &lt;/p&gt;

&lt;p&gt;Some popular integrations for your review:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.loggly.com/docs/s3-ingestion-auto/"&gt;Loggly&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.splunk.com/Documentation/AddOns/released/AWS/S3"&gt;Splunk&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.sumologic.com/application/elb/"&gt;Sumo logic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.datadoghq.com/observability_pipelines/guide/ingest_aws_s3_logs_with_the_observability_pipelines_worker/"&gt;DataDog&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Opensource Log Analyzers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;elb-log-analyzer (py)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/dmdhrumilmistry/elb-log-analyzer"&gt;elb-log-analyzer&lt;/a&gt; is a Python-based utility that lets you connect to your origin (s3) and analyze logs. In addition, it does have several features including, downloading logs from s3, analyzing logs, streamlit integration for dashboards, slack integration for setting up anomaly alerts, docker integration, and more!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;elb-rebar (RUST)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://lib.rs/crates/elb-rebar"&gt;elb-rebar&lt;/a&gt; is a parallel AWS Elastic Load Balancing log analyzer for quick statistics on web requests. This is a RUST-based utility that is easy to install and run!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;elb-log-analyzer (NPM)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/ozantunca/elb-log-analyzer"&gt;elb-log-analyzer&lt;/a&gt; is an NPM-based utility that lets you quickly install and be up and running by parsing your logs with various dimensions. I find it very flexible regarding usage and its ability to sort, limit, or even filter our search by prefix (this is extremely useful when there is a high volume of unique URIs to track due to request parameters or similar). &lt;/p&gt;

&lt;h2&gt;
  
  
  Anomaly Detection
&lt;/h2&gt;

&lt;p&gt;ML-backed Anomaly detection with access logs could come in handy in some use cases. There are a few that already offer this capability.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/quicksight/latest/user/anomaly-detection.html"&gt;Amazon QuickSight&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/LogsAnomalyDetection.html"&gt;Log Anomaly Detector&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.datadoghq.com/blog/accelerate-incident-investigations-with-log-anomaly-detection/"&gt;Datadog Log Anomaly Detection&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Hopefully, this article leaves you with a bunch of options under AWS, enterprise, and open-source solutions out there that help one deal with a common but challenging problem space with ever-growing data patterns, logs, and analytics requirements! Lastly, make sure to stay on top of your access logs in addition to metrics and application logs for increased Reliability, Security, Stability, and Scalability of your applications and/or services!!! &lt;/p&gt;

</description>
      <category>alb</category>
      <category>elb</category>
      <category>aws</category>
      <category>analytics</category>
    </item>
    <item>
      <title>AWS Pricing Calculator needs hardening</title>
      <dc:creator>Kris Iyer</dc:creator>
      <pubDate>Thu, 03 Aug 2023 17:09:08 +0000</pubDate>
      <link>https://forem.com/aws-builders/aws-pricing-calculator-needs-hardening-4n2k</link>
      <guid>https://forem.com/aws-builders/aws-pricing-calculator-needs-hardening-4n2k</guid>
      <description>&lt;p&gt;&lt;strong&gt;Background&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AWS Pricing calculator allows you to Configure a cost estimate that fits your unique business or personal needs with AWS products and services. We have had the &lt;a href="http://calculator.s3.amazonaws.com/calc5.html"&gt;Simple Monthly Calculator (SMC) &lt;/a&gt; (Retired as of 3/31/23) that provided estimates in the past which was then replaced with the &lt;a href="https://calculator.aws/#/"&gt;AWS Pricing Calculator&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why should you use AWS Pricing Calculator?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The AWS Pricing Calculator has a simplified web interface, which now supports cost estimates of more than 150 AWS services. It also enables cost estimates at scale, like bulk import for EC2 instances. AWS Pricing Calculator is accessible to all users, prospects, or AWS customers, without an AWS account. It provides cost estimates for your workloads, using the public AWS prices. Also the best source to get the most updated and comprehensive pricing estimates in one place.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Areas on AWS Pricing Calculator that needs Hardening&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User Experience Related&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This one has been out there for a while but its easy to configure your estimate and do all the hard work to get to the point where you have an estimate only to click outside of the primary form and find out you have just lost all your great work. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vM9QoHpc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0k4jin1a46vzt4ff5pyv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vM9QoHpc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0k4jin1a46vzt4ff5pyv.png" alt="AWS Pricing Calculator form" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Make sure to not click anywhere outside of the primary overlay or form for the calculator without saving your work.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing Related&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This one in particular has been even more frustrating in many ways. Noticed that the form fields reset automatically when you change around a few options under your service configuration which could lead to a price that is not accurate. &lt;/p&gt;

&lt;p&gt;Here is an example where i would try to compare my price for the two Aurora PostgreSQL offerings (Aurora Standard vs Aurora I/O-Optimized). I start out by selecting Aurora Standard and gather the monthly price ($894.98 USD/m for a db.r7g.2xlarge)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rmWvyJrx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/373pbermgysghl89o5bx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rmWvyJrx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/373pbermgysghl89o5bx.png" alt="Aurora Standard Configuration" width="800" height="401"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All good until this point. Next I update from Aurora Standard to I/O-Optimized to compare and the cost bump really caught me  (and a few others) by surprise.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_Tc9XD_t--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/88ewivtd7yc8ot9sbv8h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_Tc9XD_t--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/88ewivtd7yc8ot9sbv8h.png" alt="Aurora I/O-Optimized" width="800" height="398"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The pricing calculator resets the instance type and picks an alternate type which is not what you expect. Easily goes unnoticed when your focus was a cost comparison and you are looking at the delta. Certainly caught a bunch of us by surprise. Will drop in some support cases with AWS next to see if we can resolve these discrepancies and improve the user experience!&lt;/p&gt;

&lt;p&gt;Hope this saves some of you some time!&lt;/p&gt;

</description>
      <category>aws</category>
    </item>
    <item>
      <title>How to build an AI chatbot with Openfire and OpenAI Chat Completion</title>
      <dc:creator>Kris Iyer</dc:creator>
      <pubDate>Fri, 24 Mar 2023 11:02:53 +0000</pubDate>
      <link>https://forem.com/krisiye/how-to-build-an-ai-chatbot-with-openfire-and-openai-chat-completion-23fa</link>
      <guid>https://forem.com/krisiye/how-to-build-an-ai-chatbot-with-openfire-and-openai-chat-completion-23fa</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QN6CdayB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/0%2A3cg6LMG-ClblqIq2" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QN6CdayB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/0%2A3cg6LMG-ClblqIq2" alt="" width="880" height="515"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by Eric Krull on Unsplash&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Responsible use of artificial intelligence (AI) and ML technologies is key to fostering continued innovation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;AI chatbots are here, there, and everywhere! Ever since the introduction of ChatGPT in December 2022, tech companies of all sizes have been racing to build AI-powered tools and solutions.&lt;/p&gt;

&lt;p&gt;This article describes some high-level building blocks to help you build a chatbot powered by Openfire that could be wired up to OpenAI APIs to provide chat completions!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--H9gzSmWA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A-Gp_vl4IfiSW_iFeBDA6Qg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--H9gzSmWA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A-Gp_vl4IfiSW_iFeBDA6Qg.png" alt="An architecture diagram showing a chatbot and users using conversejs to connect to openfire, integrated with the Botz library (capable of managing logins and presence for the chatbot as well as intercepts messages) and is able to use openAI through the java sdk to gather responses." width="880" height="570"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;High-level Architecture for building a chatbot with Openfire.&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  OpenAI API
&lt;/h3&gt;

&lt;p&gt;The recently released OpenAI APIs are a game changer for both small and large enterprises to provide In-App chatGPT-like applications powered by the same APIs and models that power chatGPT. Check out the &lt;a href="https://platform.openai.com/docs/api-reference/chat"&gt;API reference&lt;/a&gt; and the API &lt;a href="https://platform.openai.com/playground?mode=chat"&gt;playground&lt;/a&gt; for more details on the Chat Completions API and more.&lt;/p&gt;

&lt;p&gt;The early adopters of the OpenAI &lt;a href="https://openai.com/blog/introducing-chatgpt-and-whisper-apis"&gt;APIs&lt;/a&gt; such as &lt;a href="https://newsroom.snap.com/say-hi-to-my-ai"&gt;My AI&lt;/a&gt; on Snapchat, &lt;a href="https://www.instacart.com/"&gt;Instacart&lt;/a&gt;, &lt;a href="https://shop.app/"&gt;shop&lt;/a&gt; by Shopify, and &lt;a href="https://www.speak.com/"&gt;Speak&lt;/a&gt; are great references for building the next generation of AI tools and solutions. For edtech, virtual tutors and learning assistants offered by &lt;a href="https://quizlet.com/labs/qchat"&gt;Quizlet&lt;/a&gt;, &lt;a href="https://openai.com/customer-stories/duolingo"&gt;DuoLingo&lt;/a&gt;, and &lt;a href="https://openai.com/customer-stories/khan-academy"&gt;Khan Academy&lt;/a&gt; are creative examples of GTP 3 and 4 capabilities!&lt;/p&gt;

&lt;p&gt;To make lives easier the APIs come with a variety of &lt;a href="https://platform.openai.com/docs/libraries/community-libraries"&gt;SDKs&lt;/a&gt; which includes community-based projects that are well-documented, and easy to integrate with your applications!&lt;/p&gt;
&lt;h3&gt;
  
  
  Openfire
&lt;/h3&gt;

&lt;p&gt;&lt;a href="http://www.igniterealtime.org/projects/openfire/index.jsp"&gt;Openfire&lt;/a&gt; is a real-time collaboration (RTC) server licensed under the Open Source Apache License. It uses the only widely adopted open protocol for instant messaging, XMPP (also called Jabber). Openfire is incredibly easy to set up and administer but offers rock-solid security and performance.&lt;/p&gt;

&lt;p&gt;The project was originated by Jive Software around 2002, and continues to thrive under a community model, as part of the Ignite Realtime Foundation that does a fantastic job!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Some of the core features (but not limited to):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;XMPP server written in Java and licensed under the Apache License 2.0&lt;/li&gt;
&lt;li&gt;User-friendly web-based installation and administration panel&lt;/li&gt;
&lt;li&gt;Shared groups for easy roster deploying&lt;/li&gt;
&lt;li&gt;Plugin interface&lt;/li&gt;
&lt;li&gt;SSL/TLS support&lt;/li&gt;
&lt;li&gt;Offline Messages support&lt;/li&gt;
&lt;li&gt;Server-to-Server connectivity&lt;/li&gt;
&lt;li&gt;Database connectivity for storing messages and user details (including the embedded HSQL database and support for MySQL, PostgreSQL, etc.)&lt;/li&gt;
&lt;li&gt;LDAP integration&lt;/li&gt;
&lt;li&gt;Platform independent (with the installers for different platforms)&lt;/li&gt;
&lt;li&gt;Connection manager for load balancing&lt;/li&gt;
&lt;li&gt;Clustering support (hazelcast)&lt;/li&gt;
&lt;li&gt;Message archiving-logging&lt;/li&gt;
&lt;li&gt;Content filtering, packet rules&lt;/li&gt;
&lt;li&gt;Pluggable Roster Module&lt;/li&gt;
&lt;li&gt;Custom Authentication Provider&lt;/li&gt;
&lt;li&gt;Support for BOSH as well as WebSockets.&lt;/li&gt;
&lt;li&gt;File Sharing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To get started visit &lt;a href="https://download.igniterealtime.org/openfire/docs/latest/documentation/working-with-openfire.html"&gt;working with Openfire&lt;/a&gt; to get your chat server up in a few mins!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--VUwUyHNn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2Aa9a_queuzFCbl1xCbFPZSw.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--VUwUyHNn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2Aa9a_queuzFCbl1xCbFPZSw.gif" alt="A picture showing the openfire admin dashboard and configuration options." width="880" height="815"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;An admin dashboard for managing openfire!&lt;/em&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  XMPP Clients
&lt;/h4&gt;

&lt;p&gt;Openfire server could be used along with a Javascript XMPP client of your choice. Some popular projects and plugins that can get you started are as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/conversejs/converse.js/"&gt;ConverseJS&lt;/a&gt; is a popular Javascript XMPP client that implements a full range of &lt;a href="https://github.com/conversejs/converse.js/#supported-xmpp-extensions"&gt;XMPP extensions&lt;/a&gt;. Also available as a plugin for openfire — &lt;a href="https://github.com/igniterealtime/openfire-inverse-plugin"&gt;inverse-openfire-plugin&lt;/a&gt; and can be installed on openfire with a few clicks.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/legastero/stanza"&gt;StanzaJS&lt;/a&gt; is a JavaScript/TypeScript library for using modern XMPP, and it does that by exposing everything as JSON. Unless you insist, you have no need to ever see or touch any XML when using StanzaJS.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.jsxc.org/"&gt;JSXC&lt;/a&gt; is a Javascript XMPP client that is also available as a &lt;a href="https://github.com/igniterealtime/openfire-jsxc-plugin"&gt;jsxc openfire plugin&lt;/a&gt; that could be installed with a few clicks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lOd2O074--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AKWCE8XFwhQ75k2fl8HsVvA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lOd2O074--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AKWCE8XFwhQ75k2fl8HsVvA.png" alt="shows javascript plugins such as inverse and jsxc installed on openfire." width="880" height="254"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The plugin page in openfire with inverse and jsxc installed.&lt;/em&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Plugins
&lt;/h4&gt;

&lt;p&gt;Plugins are a great way to extend or customize capabilities on your Openfire server.&lt;/p&gt;

&lt;p&gt;There are a no of &lt;a href="https://www.igniterealtime.org/projects/openfire/plugins.jsp"&gt;community-based plugins&lt;/a&gt; that you could readily install and use on your Openfire server. Adding a new plugin for your custom needs requires setting up a jar/war per the &lt;a href="https://download.igniterealtime.org/openfire/docs/latest/documentation/plugin-dev-guide.html"&gt;Openfire plugin specification&lt;/a&gt; and deploying them to your Openfire server.&lt;/p&gt;

&lt;p&gt;You are going to need one for a chatBot!&lt;/p&gt;
&lt;h3&gt;
  
  
  Botz
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://www.igniterealtime.org/projects/botz/"&gt;Botz&lt;/a&gt; library adds to the already rich and extensible Openfire with the ability to create internal user bots. With the Botz library, programmers may choose to develop a virtual user or a chatbot as a plugin. Although Openfire does not really distinguish this virtual user from the real users, one could intercept messages to the chatBot from your users, and be able to respond per your needs. An example would be to integrate the Botz library with the relatively new &lt;a href="https://github.com/TheoKanning/openai-java"&gt;OpenAI Java SDK&lt;/a&gt; to provide an AI chatBot or a ChatGPT experience for your Openfire users.&lt;/p&gt;
&lt;h4&gt;
  
  
  Setup BOTZ within your Openfire plugin
&lt;/h4&gt;

&lt;a href="https://medium.com/media/1d6aea2fff779da1bf9c68905cf6bb02/href"&gt;https://medium.com/media/1d6aea2fff779da1bf9c68905cf6bb02/href&lt;/a&gt;

&lt;p&gt;&lt;a href="https://discourse.igniterealtime.org/t/botz-version-1-2-0-release/92649"&gt;Botz version 1.2.0&lt;/a&gt; was released recently and can be used alongside &lt;a href="https://github.com/igniterealtime/Openfire/releases/tag/v4.7.4"&gt;Openfire 4.7.4&lt;/a&gt;. Thanks to &lt;a href="https://github.com/guusdk"&gt;@guusdk&lt;/a&gt; !!!&lt;/p&gt;
&lt;h4&gt;
  
  
  OpenAI API integration
&lt;/h4&gt;

&lt;p&gt;Add the OpenAI java dependency to your project. The below example uses gpt-3.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;dependency&amp;gt;
 &amp;lt;groupId&amp;gt;com.theokanning.openai-gpt3-java&amp;lt;/groupId&amp;gt;&amp;lt;!-- use latest as needed --&amp;gt;
 &amp;lt;artifactId&amp;gt;service&amp;lt;/artifactId&amp;gt;
 &amp;lt;version&amp;gt;0.11.0&amp;lt;/version&amp;gt;&amp;lt;!-- use latest as needed --&amp;gt;
&amp;lt;/dependency&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add a helper to wrap any customizations for your service.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/media/730ea9ec8030560d8d1030e6097baf74/href"&gt;&lt;/a&gt;&lt;a href="https://medium.com/media/730ea9ec8030560d8d1030e6097baf74/href"&gt;https://medium.com/media/730ea9ec8030560d8d1030e6097baf74/href&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At this point, packaging your plugin code and deploying it to the Openfire server should enable your BOT user and set up the presence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Auto Display Bot for your users
&lt;/h3&gt;

&lt;p&gt;Now that we have a way to add a virtual BOT user, we will need a mechanism to have the BOT user appear under the contact lists for your real chat users. In order to do that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create a group in Openfire’s admin console&lt;/li&gt;
&lt;li&gt;Make the bot a member of that group&lt;/li&gt;
&lt;li&gt;Enable contact list group sharing, to make the group appear on the contact list of every user in the system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lT11tFZj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AJH60om9ySk1L-eUk89EU7A.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lT11tFZj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AJH60om9ySk1L-eUk89EU7A.png" alt="Shows the Openfire group and contact listing configuration that allows us to inject the bot user to show up for our real users." width="880" height="446"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Openfire group and contact listing for bot user.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Wrap up
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Uvy8cguu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2Af8Ze1_QKOnD9Qc1EDE1CPA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Uvy8cguu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2Af8Ze1_QKOnD9Qc1EDE1CPA.png" alt="inverse plugin showing converse js connected to openfire with the bot user exchanging messages with openAI chat completions." width="880" height="537"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;chat completions with openAI with converse (inverse openfire plugin)&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What you see above is a real user chatting with the virtual user myaibot.&lt;/li&gt;
&lt;li&gt;Since we made the BOT a member of a new group and enabled contact list group sharing for all users we see the BOT listed under the user's contact list along with the online presence that was set up using Botz.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hope some of these building blocks help you build the next cool AI BOT for your application powered by Openfire, BOTZ, and openAI!&lt;/p&gt;

&lt;p&gt;At &lt;a href="https://medium.com/u/ac63ac3175a7"&gt;Houghton Mifflin Harcourt&lt;/a&gt; we’ve only just begun to explore ideas on embedding AI and chatGPT-like tools into our platform and products that could benefit learners and educators as well as help find efficiencies with internal workflows. &lt;a href="https://medium.com/u/549404f77a2c"&gt;Julie Miles&lt;/a&gt; (SVP — Learning Sciences), myself, and the rest of my colleagues really think of this as the tip of the literal iceberg. Some of the ideas we have been working on across some broad buckets of research are listed below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduce teacher effort in lesson planning&lt;/li&gt;
&lt;li&gt;Use assessment data to group students or differentiate assignments to assist teachers&lt;/li&gt;
&lt;li&gt;Provide feedback on areas to improve to both the student and the teacher on the student’s work, recommendations, etc&lt;/li&gt;
&lt;li&gt;Provide intelligent tutoring to students (e.g., encouragement, hints, feedback on essays, etc.) while they’re working through a lesson&lt;/li&gt;
&lt;li&gt;Create efficiencies in functional workflows so that we free up more time for thought leadership which leads to more innovation. Some areas we have already experimented with include:
&lt;/li&gt;
&lt;li&gt;Writing first drafts of content or assessment items
&lt;/li&gt;
&lt;li&gt;Translating content into other languages
&lt;/li&gt;
&lt;li&gt;Inserting explanatory comments into existing code
&lt;/li&gt;
&lt;li&gt;Asking how to fix code that doesn’t work
&lt;/li&gt;
&lt;li&gt;Checking code for bugs and auto-completion
&lt;/li&gt;
&lt;li&gt;Drafting marketing names for new products and many more areas.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With open AI leading the way with GPT 3 and GPT-4, the future looks promising as well as exciting! Looking forward to learning and experimenting with building safe, accountable, and trustworthy AI solutions!&lt;/p&gt;

&lt;h4&gt;
  
  
  Helpful Resources
&lt;/h4&gt;

&lt;p&gt;There is a plethora of open-source/no-code/low-code solutions out there that you could play with to spin up your AI BOT. I want to leave you with some helpful resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/RocketChat"&gt;RocketChat&lt;/a&gt; recently came out with an &lt;a href="https://github.com/RocketChat/Rocket.Chat.OpenAI.Completions.App"&gt;OpenAI chat completion app&lt;/a&gt;, and browser-based no-code app builder&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://flutterflow.io/"&gt;flutter.io&lt;/a&gt; announced easy additions for &lt;a href="https://community.flutterflow.io/c/community-tutorials/add-ai-completion-to-your-flutterflow-project"&gt;OpenAI chat completions&lt;/a&gt; to their projects.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.twilio.com/blog/sms-chatbot-openai-api-node"&gt;sms-chatbot with nodejs and twillio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.twilio.com/blog/python-whatsapp-chef-bot-openai-gpt3"&gt;Your Personal Michelin Star Chef with OpenAI’s GPT-3 Engine, python and twillio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.igniterealtime.org/projects/botz/"&gt;Learn more about Botz and Openfire&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s a wrap for this article. Good Luck and Stay Safe!&lt;/p&gt;

&lt;p&gt;Thanks to &lt;a href="https://medium.com/u/549404f77a2c"&gt;Julie Miles&lt;/a&gt; and &lt;a href="https://medium.com/u/c5193d2b3e93"&gt;Tom Holt&lt;/a&gt; for their contributions to this post!&lt;/p&gt;




</description>
      <category>openfire</category>
      <category>chatbots</category>
      <category>chatgpt</category>
      <category>openai</category>
    </item>
    <item>
      <title>High CPU and zombie threads on Amazon Aurora Mysql 5.6</title>
      <dc:creator>Kris Iyer</dc:creator>
      <pubDate>Thu, 22 Sep 2022 21:12:38 +0000</pubDate>
      <link>https://forem.com/aws-builders/high-cpu-and-zombie-threads-on-amazon-aurora-mysql-56-1mbj</link>
      <guid>https://forem.com/aws-builders/high-cpu-and-zombie-threads-on-amazon-aurora-mysql-56-1mbj</guid>
      <description>&lt;p&gt;Recently noticed some high avg CPU utilization on an Amazon Aurora Mysql Databases running &lt;code&gt;Mysql 5.6 (oscar:5.6.mysql_aurora.1.22.2)&lt;/code&gt;. Something that was noticed that I thought was interesting to share were zombie threads or threads that were running for a long period of time and never finished as well as threads that were not possible to be killed. &lt;/p&gt;

&lt;p&gt;These were simple DDL statements that were triggered by a little reporting engine that created a bunch of temporary tables to gather some aggregations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7veyzyv4inf6h1fm1bno.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7veyzyv4inf6h1fm1bno.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A quick look up on the process list tells us that there are some DDL statements stuck for 4 days as shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mysql&amp;gt; show full processlist;
| Id       | User            | Host                | db         | Command | Time   | State                     | Info                
| 77569519 | app        | x.x.x.x:yyyyy | test | Query   | 404949 | init                      | DROP TEMPORARY TABLE IF EXISTS temp1
:::::
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;TRX status for the same:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mysql&amp;gt; SELECT * FROM INFORMATION_SCHEMA.INNODB_TRX where trx_mysql_thread_id = 77569519 \G
*************************** 1. row ***************************
                    trx_id: 124803462108
                 trx_state: RUNNING
               trx_started: 2022-09-17 21:01:45
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 33614
       trx_mysql_thread_id: 77569519
                 trx_query: DROP TEMPORARY TABLE IF EXISTS temp1
       trx_operation_state: NULL
         trx_tables_in_use: 0
         trx_tables_locked: 0
          trx_lock_structs: 14
     trx_lock_memory_bytes: 376
           trx_rows_locked: 0
         trx_rows_modified: 33600
   trx_concurrency_tickets: 0
       trx_isolation_level: READ COMMITTED
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The initial suspect was disk issues causing these long running queries but was ruled out as metrics seemed ok and the database appeared to have plenty of Local Storage to deal with temporary tables. The next attempt to recover from these were to kill the long query to free up the CPU cycles.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mysql&amp;gt; call mysql.rds_kill(77569519);
Query OK, 0 rows affected (0.00 sec)

mysql&amp;gt; call mysql.rds_kill_query(77569519);
Query OK, 0 rows affected (0.00 sec)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No luck despite attempts to kill the query and even the connection. While &lt;code&gt;rds_kill_query&lt;/code&gt; did not change anything &lt;code&gt;rds_kill&lt;/code&gt; did change the command status from &lt;code&gt;Query&lt;/code&gt; to &lt;code&gt;killed&lt;/code&gt;. Neither of these were helpful in this case and the &lt;code&gt;trx_state&lt;/code&gt; continued to be &lt;code&gt;RUNNING&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mysql&amp;gt; show full processlist;
| Id       | User            | Host                | db         | Command | Time   | State                     | Info                
| 77569519 | app        | x.x.x.x:yyyyy | test | Killed  | 422937 | init     | DROP TEMPORARY TABLE IF EXISTS temp1
:::::
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next up was to seek some help from AWS Support and thus gathered the below recommendations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reboot the Amazon Aurora Cluster (or trigger a failover).&lt;/li&gt;
&lt;li&gt;Upgrade From Amazon Aurora &lt;code&gt;1.x&lt;/code&gt; to Amazon Aurora to &lt;code&gt;2.x&lt;/code&gt;. Particularly &lt;code&gt;2.07.8&lt;/code&gt; which has some fixes from the community edition for stability around temporary tables. &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Note that Aurora &lt;code&gt;2.x&lt;/code&gt; would mean an upgrade to Mysql &lt;code&gt;5.7.x&lt;/code&gt; from a compatibility standpoint.&lt;/p&gt;

&lt;p&gt;Hope this helps!&lt;/p&gt;

</description>
      <category>aws</category>
      <category>rds</category>
      <category>auroramysql</category>
    </item>
    <item>
      <title>Tracking down high CPU Utilization on Amazon Aurora PostgreSQL</title>
      <dc:creator>Kris Iyer</dc:creator>
      <pubDate>Tue, 30 Aug 2022 16:10:12 +0000</pubDate>
      <link>https://forem.com/krisiye/tracking-down-high-cpu-utilization-on-amazon-aurora-postgresql-5ae4</link>
      <guid>https://forem.com/krisiye/tracking-down-high-cpu-utilization-on-amazon-aurora-postgresql-5ae4</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--p0mYLxsv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AxQ_VHEQhpOty5k-r3h06Vg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--p0mYLxsv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AxQ_VHEQhpOty5k-r3h06Vg.png" alt="An image showing the aws.rds.cpuutilization representing high CPU utilization." width="880" height="259"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;High CPU utilization on Aurora RDS&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In one of my previous &lt;a href="https://dev.to/aws-builders/amazon-aurora-and-local-storage-50of"&gt;articles&lt;/a&gt;, I discuss some interesting ways we can troubleshoot high local storage utilization on Amazon Aurora PostgreSQL. In this article, I share some thoughts on troubleshooting high CPU utilization as well as some best practices for Amazon Aurora PostgreSQL.&lt;/p&gt;
&lt;h3&gt;
  
  
  Keep it Simple
&lt;/h3&gt;

&lt;p&gt;PostgreSQL built-in extensions &lt;a href="https://www.postgresql.org/docs/current/pgstatstatements.html"&gt;pg_stat_statements&lt;/a&gt;, &lt;a href="https://www.postgresql.org/docs/14/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY-VIEW"&gt;pg_stat_activity&lt;/a&gt;, and &lt;a href="https://www.postgresql.org/docs/current/monitoring-stats.html"&gt;pg_stat_user_tables&lt;/a&gt; are great starting points and can quickly help you gather insights around your top SQL, missing indexes, gather insights into locking and identify blocked queries along with blocking PID's/queries.&lt;/p&gt;

&lt;a href="https://medium.com/media/2f24ca9c7a72fe833d90317710c48f82/href"&gt;https://medium.com/media/2f24ca9c7a72fe833d90317710c48f82/href&lt;/a&gt;
&lt;h3&gt;
  
  
  Slow Query Logging
&lt;/h3&gt;

&lt;p&gt;For heavy and concurrent workloads, slow query logging could provide you with some great insights. Go ahead and turn on your slow logs but make sure to set up some reasonable thresholds just so that you catch enough. Note that logging all statements could have a huge impact on performance as well as result in high resource utilization. Logging lock waits could also be a useful addition to see if lock waits were contributing to your performance issues.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Example that logs statements executing for &amp;gt; 500ms.
log_min_duration_statement=500

# Useful in determining if lock waits are causing poor performance 
log_lock_waits=1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Follow up with EXPLAIN ANALYZE and look for improvements.&lt;/p&gt;

&lt;p&gt;Some pointers for you to look for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Look for a difference in estimated vs actual rows.&lt;/li&gt;
&lt;li&gt;No index, wrong index (cardinality)&lt;/li&gt;
&lt;li&gt;Large no of buffers read (check on the working-set and if it fits under shared_buffers)&lt;/li&gt;
&lt;li&gt;A large no of rows filtered by a post join predicate.&lt;/li&gt;
&lt;li&gt;Reading more data than necessary (pruning, clustering, index-only)&lt;/li&gt;
&lt;li&gt;Look for slow nodes in your plans (SORT[AGG], NOT IN, OR, large seq_scans, CTEs, COUNT, function usage in filters, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that sequential scans in some cases may be faster than an index scan specifically when the SELECT returns more than approximately 5–10% of all rows in the table. This is because an index scan requires several IO operations for each row whereas a sequential scan only requires a single IO for each row or even less because a block (page) on the disk contains more than one row, so more than one row can be fetched with a single IO operation. More bounded queries are better to inform the optimizer and can help pick the right scan strategy.&lt;/p&gt;

&lt;p&gt;Analyzing the plan could be overwhelming sometimes and could use tools such as &lt;a href="https://explain.dalibo.com/"&gt;dalibo&lt;/a&gt; and &lt;a href="https://explain.depesz.com/"&gt;depesz&lt;/a&gt; that help visualize your explain plans (Make sure to read the data retention policies on these tools and ideally anonymize your queries for security reasons before you upload your plans)!&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance Insights
&lt;/h3&gt;

&lt;p&gt;Turning on Performance Insights on your Aurora PostgreSQL cluster is another great way to get detailed insights into your performance and resource utilization. With Performance Insights, you have a quick way to slice your queries by top SQL, top wait, etc, and can come in handy to continually monitor your production workloads.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ka3rm7Ia--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A--OS4eDalv_Chh3cJV0LDQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ka3rm7Ia--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A--OS4eDalv_Chh3cJV0LDQ.png" alt="A screen shot from AWS Performance Insights showing high CPU utilization sliced by top SQL." width="880" height="374"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Performance Insights showing high CPU utilization sliced by top SQL.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Another great metric dimension to look at would be to look at waits and identify where your database may be spending the most time. Metrics are broken down below by Top SQL and sorted by the top wait.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7Fre7LU7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A9VJTcHWX9UOclP4b-0WIuw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7Fre7LU7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A9VJTcHWX9UOclP4b-0WIuw.png" alt="Screenshot from Performance Insights showing metrics sliced by top SQL and wait." width="880" height="464"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Performance Insights showing metrics sliced by top SQL and wait.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you need a good overview and understanding of Performance Insights I highly recommend watching the &lt;a href="https://www.youtube.com/watch?v=RyX9tPxffmw"&gt;talk on Performance Insights at AWS re:Invent&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Configuration
&lt;/h3&gt;
&lt;h4&gt;
  
  
  shared_buffers
&lt;/h4&gt;

&lt;p&gt;One of the common pitfalls with setting shared_buffers very large is that the memory is nailed down for page caching, and can’t be used for other purposes, such as temporary memory for sorts, hashing, and materialization (work_mem) or vacuuming and index build (maintenance_work_mem).&lt;/p&gt;

&lt;p&gt;If you can’t fit your entire workload within shared_buffers, then there are a number of reasons to keep it relatively small. If the working set is larger than shared_buffers, most buffer accesses will miss the database buffer cache and fault a page in from the OS; clearly, it makes no sense to allocate a large amount of memory to a cache with a low hit rate.&lt;/p&gt;

&lt;a href="https://medium.com/media/1d635554846d95a294535ffb62d847dd/href"&gt;https://medium.com/media/1d635554846d95a294535ffb62d847dd/href&lt;/a&gt;

&lt;p&gt;&lt;strong&gt;wal_buffers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;PostgreSQL backend processes initially write their write-ahead log records into these buffers, and then the buffers are flushed to the disk. Once the contents of any given 8kB buffer are durably on disk, the buffer can be reused. Since insertions and writes are both sequential, the WAL buffers are in effect a ring buffer, with insertions filling the buffer and WAL flushes draining it. Performance suffers when the buffer fills up and no more WAL can be inserted until the current flush is complete. The effects are mitigated by the fact that, when synchronous_commit is not turned off, every transaction commit waits for its WAL record to be flushed to disk; thus, with small transactions at low concurrency levels, a large buffer is not critical. With PostgreSQL 14 you can now get more insights into your wal_buffers with &lt;a href="https://www.postgresql.org/docs/14/monitoring-stats.html#MONITORING-PG-STAT-WAL-VIEW"&gt;pg_stat_wal&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Below you can see a high CPU but also high WAL:write which could provide us some hints to tune the database further such as setting aside some extra memory for the wal_buffers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--S1zHuXqw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AnJ8QRWmCIKuluJmnkkgsVw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--S1zHuXqw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AnJ8QRWmCIKuluJmnkkgsVw.png" alt="Screenshot from Performance Insights showing metrics sliced by top waits as well as showing high CPU and high wal:write" width="880" height="466"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Performance Insights showing metrics sliced by top waits.&lt;/em&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  &lt;em&gt;random_page_cost&lt;/em&gt;
&lt;/h4&gt;

&lt;p&gt;Defaults to 4. Storage that has a low random read cost relative to sequential, such as solid-state drives could be better modeled with a lower value for random_page_cost, e.g. 1.0. Best to configure Aurora databases with 1.0 and measure improvements.&lt;/p&gt;
&lt;h4&gt;
  
  
  &lt;strong&gt;work_mem, max_parallel_workers_per_gather&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;For a complex query, several sort or hash operations may be running in parallel. Also, several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of &lt;a href="https://aws.amazon.com/blogs/database/tune-sorting-operations-in-postgresql-with-work_mem/"&gt;work_mem&lt;/a&gt; and should be tuned appropriately. Setting in too low or too high could have an impact on performance. The default(4MB) for OLTP workloads is a good starting point. It can be increased to a much higher value for non-OLTP workloads.&lt;/p&gt;

&lt;p&gt;Similarly, a parallel query using 4 workers may use up to 5 times as much CPU time, memory, I/O bandwidth, and so forth. Defaults to 2. Recommended configuration for&lt;a href="https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-MAX-PARALLEL-WORKERS-PER-GATHER"&gt;max_parallel_workers_per_gather&lt;/a&gt; for highly concurrent OLTP workloads spanning several connections is to set to 0 or turn off. For low concurrency, the defaults may suffice. May want to increase slowly and evaluate performance for non-OLTP workloads.&lt;/p&gt;
&lt;h3&gt;
  
  
  Prevent The Bloat
&lt;/h3&gt;

&lt;p&gt;The importance of removing dead tuples is twofold. Dead tuples not only decrease space utilization but can also lead to database performance issues. When a table has a large number of dead tuples, its size grows much more than it actually needs — usually called &lt;em&gt;bloat&lt;/em&gt;. A bloat results in a cascading effect on your database such as a sequential scan on a bloated table has more pages to go through, costing additional I/O and taking longer, a bloated index results in more unnecessary I/O fetches, thus slowing down index lookup and scanning, etc.&lt;/p&gt;

&lt;p&gt;For databases that have high volumes of write operations, the growth rate of dead tuples can be high. In addition, the default configuration of autovacuum_max_workers is 3. Recommend monitoring the bloat on the database by inspecting dead tuples across your tables that deal with high concurrency.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-- monitor dead tuples
SELECT relname, n_dead_tup FROM pg_stat_user_tables;

-- monitor auto vacuum
SELECT relname, last_vacuum, last_autovacuum FROM pg_stat_user_tables;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While increasing autovacuum_max_workers maybe needed in some cases it also means increasing resource utilization. Careful tuning might result in overall performance improvement by cleaning up dead tuples faster and being able to keep up.&lt;/p&gt;

&lt;h4&gt;
  
  
  Write Amplification, fillfactor, and HOT Updates
&lt;/h4&gt;

&lt;blockquote&gt;
&lt;p&gt;“ &lt;strong&gt;&lt;em&gt;fillfactor&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;for a table is a percentage between 10 and 100. 100 (complete packing) is the default. When a smaller fillfactor is specified, INSERT operations pack table pages only to the indicated percentage; the remaining space on each page is reserved for updating rows on that page&lt;/em&gt;”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In a scenario when a row is updated and at least one of the indexed columns is part of the updated columns, PostgreSQL needs to update all indexes on the table to point to the latest version of the row. This phenomenon is called &lt;strong&gt;write amplification&lt;/strong&gt; (also one of the biggest architectural challenges with PostgreSQL to not used clustered indexes that affect performance).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Heap-only Tuples&lt;/strong&gt; ( &lt;strong&gt;HOT&lt;/strong&gt; ) updates are an efficient way to prevent write amplification within PostgreSQL. Lowering the fillfactor can have a positive impact by increasing the percentage of HOT updates. A lower fillfactor can stimulate more HOT updates i.e. fewer write operations. Since we write less we also generate fewer WAL writes. Another benefit of HOT updates is that they ease the maintenance tasks on the table. After performing a HOT update, the old and new versions of the row are on the same page. This makes the single page cleanup more efficient and the vacuum operation has less work to do.&lt;/p&gt;

&lt;p&gt;HOT updates help to limit table and index bloat. Since HOT updates do not update the index at all, we don’t add any bloat to the indexes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-- review your fillFactor
SELECT 
 pc.relname AS ObjectName,
    pc.reloptions AS ObjectOptions
FROM pg_class AS pc
INNER JOIN pg_namespace AS pns 
 ON pns.oid = pc.relnamespace
WHERE pns.nspname = 'public';
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;HOT comes with some performance trade-offs that affect read performance for index scans. So carefully reduce the fillfactor (such as 85%) on tables that get a lot of updates and measure performance differences!&lt;/p&gt;

&lt;p&gt;Hoping something like the &lt;a href="https://cybertec-postgresql.github.io/zheap/"&gt;zheap&lt;/a&gt; storage engine initiative should help us get past these bottlenecks in the future. Until then we may not be able to prevent the bloat but could certainly minimize the impact.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Enhanced Monitoring&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pIz0Apbg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AYrn4pCrAAPqARnmicJGyBQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pIz0Apbg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AYrn4pCrAAPqARnmicJGyBQ.png" alt="" width="880" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Turning on Enhanced Monitoring can provide you with useful insights at the database host level as well as the process list. Specifically useful if you have to track down a specific process (PID) consuming a lot of CPU and map that to pg_stat_activity for more details on the query. Also gives you a great metric dimension for comparing Read/Write IOPS, memory, etc. to CPU utilization.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query Plan Management (QPM)
&lt;/h3&gt;

&lt;p&gt;A major cause of response time variability is query plan instability. There are various factors that can unexpectedly change the execution plan of queries. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Change in optimizer statistics (manually or automatically)&lt;/li&gt;
&lt;li&gt;Changes to the query planner configuration parameters&lt;/li&gt;
&lt;li&gt;Changes to the schema, such as adding a new index&lt;/li&gt;
&lt;li&gt;Changes to the bind variables used in the query&lt;/li&gt;
&lt;li&gt;Minor version or major version upgrades to the PostgreSQL database version. (Analyze operation isn’t performed after the upgrade to refresh the pg_statistic table?)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-OTHER"&gt;Planner configuration&lt;/a&gt; options such as default_statistics_target, from_collapse_limit, join_collapse_limit etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;QPM is a great feature that allows us to manage query plans, prevent plan regressions and improve plan stability. QPM collects plan statistics and allows us the controls needed to approve plans that have a lower cost estimate and/or let Aurora adapt to run the plan with minimal cost automatically.&lt;/p&gt;

&lt;p&gt;QPM is available on Amazon Aurora PostgreSQL version 10.5-compatible (Aurora 2.1.0) and later and can be enabled in production (minimal overhead) and or enabled/disabled against your test working-sets with tools such as &lt;a href="https://github.com/akopytov/sysbench"&gt;sysbench&lt;/a&gt;. I highly recommend turning this on under your test environments and also practice plan evolution (reviewing and approving plans) before applying QPM in production. Once applied to production a periodic review will be necessary to see if the optimizer has found better plans with a lower cost estimate that needs to be approved.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anomaly Detection
&lt;/h3&gt;

&lt;p&gt;CPU or resource utilization on your database server that is predictable and can be repeatable for testing purposes are usually easier to deal with and be tuned for performance. The issues that happen once in a while or are not repeatable will need a more careful inspection of metrics and some tools to troubleshoot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/devops-guru/features/devops-guru-for-rds/"&gt;AWS DevOps Guru&lt;/a&gt; is one of the alternatives that is ML-based and used to identify anomalies such as increased latency, error rates, and resource constraints and then send alerts with a description and actionable recommendations for remediation. From a database perspective, you could for example alert on high-load wait events (based on the db_load metric) and CPU capacity exceeded. In addition, DevOps Guru can catch anomalies in logs which is a useful feature to have. For example, you could now alert on any abnormal error rate seen in PostgreSQL logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Amazon Aurora on PostgreSQL 14&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Amazon Aurora announced support for PostgreSQL major version 14 (14.3) recently! &lt;a href="https://www.postgresql.org/docs/14/release-14.html"&gt;PostgreSQL 14&lt;/a&gt; includes performance improvements for parallel queries, heavily-concurrent workloads, partitioned tables, logical replication, and vacuuming. In addition, this release includes enhancements to observability, developer experience, and security.&lt;/p&gt;

&lt;p&gt;For workloads that use many connections, the PostgreSQL 14 upgrade has achieved an improvement of 2x on some benchmarks. Heavy workloads, and workloads with many small write operations, also benefit from the new ability to pipeline queries to a database, which can boost performance over high-latency connections. This client-side feature can be used with any modern PostgreSQL database with the version 14 client or a client driver built with version 14 of libpq.&lt;/p&gt;

&lt;p&gt;Another big plus in PostgreSQL 14 is that dead tuples are automatically detected and removed even between vacuums, allowing for a reduced number of page splits, which in turn reduces index bloat.&lt;/p&gt;

&lt;p&gt;For distributed workloads, the use of logical replication can stream in-progress transactions to subscribers, with performance benefits for applying large transactions.&lt;/p&gt;

&lt;p&gt;Please refer to &lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraPostgreSQLReleaseNotes/AuroraPostgreSQL.Updates.html#AuroraPostgreSQL.Updates.20180305.143X"&gt;Amazon Aurora PostgreSQL updates&lt;/a&gt; for more information.&lt;/p&gt;

&lt;h3&gt;
  
  
  Right Sizing and Alternative Architecture Patterns
&lt;/h3&gt;

&lt;p&gt;In some cases, you may have a reasonably optimized database and queries but are still looking to find that extra bit of performance improvement. You have a number of choices you could make to better manage your workloads, and improve performance and resource utilization. Often times this means Right Sizing your database, database upgrades, or even choosing a better architecture for your applications. Check out my previous &lt;a href="https://dev.to/aws-builders/right-sizing-aws-rds-59hb"&gt;post on Right Sizing&lt;/a&gt; for some recommendations on this and more!&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Try to keep things simple and start with the built-in tools and extensions available with your database engine on the cloud to quickly pinpoint and fix resource utilization issues. Continue to monitor your databases with slow logs, Cloudwatch, Performance Insights, Enhanced Monitoring, DevOps Guru, or an APM of your choice. Last but not least reduce bloat to have a bigger impact on performance overall. While this post is mostly centered around Aurora PostgreSQL you could achieve similar insights on Aurora Mysql as well as RDS. Lastly, I want to leave some useful references for you to read up on this topic. Good Luck!&lt;/p&gt;

&lt;h4&gt;
  
  
  References
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/premiumsupport/knowledge-center/rds-aurora-postgresql-high-cpu/"&gt;Troubleshoot high CPU utilization for Amazon RDS or Aurora for PostgreSQL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/database/a-case-study-of-tuning-autovacuum-in-amazon-rds-for-postgresql/"&gt;A Case Study of Tuning Autovacuum in Amazon RDS for PostgreSQL | Amazon Web Services&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/database/amazon-devops-guru-for-rds-under-the-hood/"&gt;Amazon DevOps Guru for RDS under the hood | Amazon Web Services&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://muratbuffalo.blogspot.com/2022/03/amazon-aurora-design-considerations-and.html"&gt;Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/about-aws/whats-new/2022/06/amazon-aurora-supports-postgresql-14/"&gt;Amazon Aurora now supports PostgreSQL 14&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/database/introduction-to-aurora-postgresql-query-plan-management/"&gt;Introduction to Aurora PostgreSQL Query Plan Management | Amazon Web Services&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.adyen.com/blog/postgresql-hot-updates"&gt;Fighting PostgreSQL write amplification with HOT updates&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.cybertec-postgresql.com/en/zheap-reinvented-postgresql-storage/"&gt;zheap: Reinvented PostgreSQL storage - CYBERTEC&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thank you to my co-workers &lt;a href="https://medium.com/u/bd1907ce5879"&gt;Sasha Mandich&lt;/a&gt; and &lt;a href="https://medium.com/u/34d2b81e5dc4"&gt;Francislainy Campos&lt;/a&gt; for their feedback on this post!&lt;/p&gt;




</description>
      <category>postgres</category>
      <category>awsaurora</category>
      <category>aws</category>
    </item>
    <item>
      <title>Testcontainers for Hashicorp Consul and Vault</title>
      <dc:creator>Kris Iyer</dc:creator>
      <pubDate>Wed, 16 Feb 2022 15:59:39 +0000</pubDate>
      <link>https://forem.com/krisiye/testcontainers-for-hashicorp-consul-and-vault-5hf7</link>
      <guid>https://forem.com/krisiye/testcontainers-for-hashicorp-consul-and-vault-5hf7</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ufREJj2M--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A92KE3Rr_GSdODI_zY5wvHw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ufREJj2M--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A92KE3Rr_GSdODI_zY5wvHw.png" alt="3 cubes representing testcontainers for Vault, Consul and LocalStack in a Docker-In-Docker environment with host.testcontainers.internal pointing from Vault to the other testcontainer cubes." width="880" height="677"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Testcontainers for Vault, Consul and LocalStack with DIND.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In my earlier post, I touched on some interesting architectural patterns for &lt;a href="https://dev.to/krisiye/spring-boot-configuration-and-secret-management-patterns-on-kubernetes-1179"&gt;Configuration and Secret Management&lt;/a&gt; for your microservices on k8s. I highly recommend reading if you haven’t already. This article builds on the same idea and highlights the importance of integration tests of your service and how we could leverage Testcontainers to ease out some of these challenges such as divergent test environments and configuration or even mock components used in your tests that may not be enough to catch issues early on in your continuous integration pipelines and provide adequeate code coverage.&lt;/p&gt;
&lt;h3&gt;
  
  
  Testcontainers
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.testcontainers.org/"&gt;Testcontainers&lt;/a&gt; is a JVM library that allows users to run and manage Docker images with Java code and frameworks such as JUnit. The most common use-cases for Testcontainers include integration tests against microservices with external dependencies such as Vault, Consul, AWS, Databases, Cache Frameworks, and more. With an API-based approach, it is easy to manage the lifecycle of a container along with the configuration that may be required for your services.&lt;/p&gt;
&lt;h3&gt;
  
  
  Integration tests with Testcontainers
&lt;/h3&gt;

&lt;p&gt;Some obvious benefits for integration testing with Testcontainers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TestContainers can help your tests mirror managed environments and components as closely as possible.

&lt;ul&gt;
&lt;li&gt;For sourcing secrets and configuration from Vault (includes the various secret backends)and consul.
&lt;/li&gt;
&lt;li&gt;Integration with cloud providers such as AWS (most commonly used services if not all), GCloud (incubating), Azure (incubating).
&lt;/li&gt;
&lt;li&gt;Databases such as PostgreSQL or MySQL etc.
&lt;/li&gt;
&lt;li&gt;Kafka, RabbitMQ for distributed messaging
&lt;/li&gt;
&lt;li&gt;The list goes on. Check out the full list of Testcontainer &lt;a href="https://github.com/testcontainers/testcontainers-java/tree/master/modules"&gt;modules&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Testing compatibility and tech stack upgrades for client libraries and dependencies such as spring cloud (Vault, Consul, AWS), AWS SDK, etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Why Vault and Consul?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Secret Management&lt;/strong&gt; in microservices needs to be high on your priority list to build secure and scalable microservices. Secrets can be &lt;strong&gt;sensitive&lt;/strong&gt; , &lt;strong&gt;dynamic, and time-bound&lt;/strong&gt;. They require proper &lt;strong&gt;access control&lt;/strong&gt; models with &lt;strong&gt;audit logs&lt;/strong&gt; and &lt;strong&gt;encryption&lt;/strong&gt;. We also need to support unique &lt;strong&gt;life-cycle&lt;/strong&gt; policies and rotations for microservices. While there are a few options that may work for your needs, &lt;a href="https://www.hashicorp.com/products/vault"&gt;&lt;strong&gt;Hashicorp Vault&lt;/strong&gt;&lt;/a&gt; is certainly the most popular and comprehensive solution in this space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration Management&lt;/strong&gt; also presents a set of challenges with microservices. Support for static and dynamic configuration, externalized configurations, watching for changes, and updating application configuration without any service disruption are some of the key features we would expect from microservices. In addition, the microservice ecosystem in mature organizations often leads to a web with many microservices deployed but also inter-connected where &lt;a href="https://www.consul.io/use-cases/discover-services"&gt;&lt;strong&gt;Service Discovery&lt;/strong&gt;&lt;/a&gt; or even a &lt;a href="https://www.consul.io/use-cases/multi-platform-service-mesh"&gt;&lt;strong&gt;Service Mesh&lt;/strong&gt;&lt;/a&gt; becomes important in a cloud infrastructure to manage endpoint configuration, load balance, security, etc. &lt;a href="https://www.consul.io/"&gt;&lt;strong&gt;Hashicorp Consul&lt;/strong&gt;&lt;/a&gt;is a great fit for meeting these requirements.&lt;/p&gt;
&lt;h3&gt;
  
  
  Testcontainers with Vault
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://www.testcontainers.org/modules/vault/"&gt;Hashicorp Vault Testcontainer&lt;/a&gt; module aims to solve your app’s integration testing with Vault. You can use it to source static and dynamic credentials for your application as well as test how your application may behave with Vault by writing different test scenarios in Junit such as corner cases like lease rotations, lease expiry, exception handling, etc.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;dependency&amp;gt;
    &amp;lt;groupId&amp;gt;org.testcontainers&amp;lt;/groupId&amp;gt;
    &amp;lt;artifactId&amp;gt;vault&amp;lt;/artifactId&amp;gt;
    &amp;lt;version&amp;gt;${testcontainers.version}&amp;lt;/version&amp;gt;
    &amp;lt;scope&amp;gt;test&amp;lt;/scope&amp;gt;
&amp;lt;/dependency&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Testcontainers for Consul
&lt;/h3&gt;

&lt;p&gt;One of the challenges we hit was that there wasn’t a Testcontainer module for consul. Based on some discussions with the test container teams (&lt;a href="https://github.com/testcontainers/testcontainers-java/issues/4860"&gt;#4680&lt;/a&gt;) we decided to fork off an existing &lt;a href="https://github.com/denverx/testcontainers-consul"&gt;project&lt;/a&gt;, polished it a bit, and published this artifact for the OSS community as part of &lt;a href="https://www.hmhco.com/"&gt;Houghton Mifflin Harcourt&lt;/a&gt;. This also turns out to be a tiny milestone in some ways as it is our first (out of many more to come) OSS artifact on Sonatype!!!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;dependency&amp;gt;
 &amp;lt;groupId&amp;gt;com.hmhco&amp;lt;/groupId&amp;gt;
 &amp;lt;artifactId&amp;gt;testcontainers-consul&amp;lt;/artifactId&amp;gt;
 &amp;lt;version&amp;gt;0.0.4&amp;lt;/version&amp;gt;
 &amp;lt;scope&amp;gt;test&amp;lt;/scope&amp;gt;
&amp;lt;/dependency&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The module supports legacy and newer versions of Consul, ACL, Clustering, and more. The project can be found on &lt;a href="https://github.com/hmhco/testcontainers-consul"&gt;Github&lt;/a&gt; and we welcome contributions via pull request and/or the discussion forum for any issues or improvements! Eventually, the hope would be to get this module added to &lt;a href="https://github.com/testcontainers/testcontainers-java"&gt;testcontainers-java&lt;/a&gt; and be supported alongside the rest of the modules.&lt;/p&gt;

&lt;h3&gt;
  
  
  Some useful pointers when working with Testcontainers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;There may be test scenarios that require a Testcontainer to be able to talk to another in the scope of a test. TestContainers &lt;a href="https://www.testcontainers.org/features/networking/"&gt;networking&lt;/a&gt; support makes it easy for you to do that with a generic host address “ &lt;strong&gt;host.testcontainers.internal:{port}&lt;/strong&gt; ” when looking up containers and ports that may be exposed.
&lt;/li&gt;
&lt;li&gt;An example would be to write a test when integrating Vault and Consul KV via ACL or even Vault integrating with other secret backends such as AWS (localstack) or Database backends (PostgreSQL or similar.)&lt;/li&gt;
&lt;li&gt;Supports Junit 4 and Jupiter/Junit 5 and Spock. Choose what’s best for your test framework.&lt;/li&gt;
&lt;li&gt;Manage &lt;a href="https://www.testcontainers.org/test_framework_integration/manual_lifecycle_control/"&gt;Testcontainer life-cycles&lt;/a&gt;appropriately.
&lt;/li&gt;
&lt;li&gt;Often you may want to re-use a Testcontainer across your tests. This may also help speed up your test phase.&lt;/li&gt;
&lt;li&gt;Use containers for your tests and pipelines. Various patterns including &lt;a href="https://www.testcontainers.org/supported_docker_environment/continuous_integration/dind_patterns/"&gt;DIND&lt;/a&gt; are available.
&lt;/li&gt;
&lt;li&gt;Running an image for every test method, image per class, or even running one image for all integration test executions. When sharing an image we need to pay close attention to test data and rollback to clean up the state after test execution.&lt;/li&gt;
&lt;li&gt;Configure testcontainers.properties when working with private docker registries.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;hub.image.name.prefix&lt;/strong&gt; ={your_private_registry}
&lt;/li&gt;
&lt;li&gt;Also see image &lt;a href="https://www.testcontainers.org/supported_docker_environment/image_registry_rate_limiting/"&gt;dependencies&lt;/a&gt; for what you may need on your private registry for Testcontainers and getting around any Docker Rate Limiting.&lt;/li&gt;
&lt;li&gt;Configuring logback to see Testcontainer logs is useful at development time and troubleshooting any issues. You could also &lt;a href="https://www.testcontainers.org/features/container_logs/"&gt;stream container logs&lt;/a&gt; if you choose to.&lt;/li&gt;
&lt;li&gt;Use container labels and image pull policy as appropriate to make sure you benefit from any caching for images that are not changing often. Please see &lt;a href="https://www.testcontainers.org/features/advanced_options/"&gt;advanced&lt;/a&gt; options for more details on this.&lt;/li&gt;
&lt;li&gt;Configure &lt;a href="https://www.testcontainers.org/features/startup_and_waits/"&gt;wait_timeouts&lt;/a&gt; for containers. Also not a bad idea to Assert the container is up before you kick off your tests.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Assert.assertTrue(yourContainer.isRunning()));
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Follow the &lt;strong&gt;FIRST&lt;/strong&gt; principle for your tests as defined in the book &lt;a href="https://www.oreilly.com/library/view/clean-code-a/9780136083238/"&gt;&lt;strong&gt;&lt;em&gt;Clean Code: A Handbook of Agile Software Craftsmanship&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt; written by Robert C. Martin. It’s a great read for coding best practices and I highly recommend it!&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Getting Started
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;a href="https://github.com/krisiye/springcloud_vault_consul"&gt;quick start&lt;/a&gt; for a spring-boot service with Spring Cloud Vault and Consul has been provided.
&lt;/li&gt;
&lt;li&gt;This example includes integration tests for Consul KV, Vault Secret Backends (KV, Consul, AWS).
&lt;/li&gt;
&lt;li&gt;It also includes a &lt;a href="https://github.com/krisiye/springcloud_vault_consul/tree/main/spring_cloud_vault_consul_with_acl/docker-compose"&gt;docker-compose&lt;/a&gt; recipe for running all of these integrations on your local development environment.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;The addition of consul as an independent Testcontainer module allows you to integrate Vault and Consul for your test pipelines and add a lot more coverage for your code! I also hope the quickstart examples provided above serve as a good starting point for adding integration tests to your microservice and standardizing your development and test pipelines with best practices for configuration and secret management.&lt;/p&gt;

&lt;p&gt;Also, tune in to &lt;a href="https://events.hashicorp.com/hashitalks2022"&gt;HashiTalks 2022&lt;/a&gt; on Feb 17/18, 2022 if you are interested in learning more about HashiCorp Vault, Consul, and many more services for your cloud infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uYN6Iw_J--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A0mlYcsgPTjzjdmz95mBILw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uYN6Iw_J--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A0mlYcsgPTjzjdmz95mBILw.png" alt="Speaker card for Hashitalks 2022." width="880" height="495"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Speaker card for Hashitalks 2022.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you would like to learn more about Spring Cloud and integration with Vault and Consul for your microservice please join me at the HashiTalks on Feb 17, 12:05–12:35 GMT. We have a great lineup for speakers and topics this year and looking forward to speaking at the event as well as learning a lot more from the Hashicorp user group!&lt;/p&gt;

&lt;p&gt;See you there!&lt;/p&gt;




</description>
      <category>vault</category>
      <category>consul</category>
      <category>microservices</category>
      <category>springcloud</category>
    </item>
    <item>
      <title>Amazon Aurora and Local Storage</title>
      <dc:creator>Kris Iyer</dc:creator>
      <pubDate>Tue, 25 May 2021 08:12:07 +0000</pubDate>
      <link>https://forem.com/aws-builders/amazon-aurora-and-local-storage-50of</link>
      <guid>https://forem.com/aws-builders/amazon-aurora-and-local-storage-50of</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--yjBYN1s1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2Alc0aMAyK27gpPorZlGNyHA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--yjBYN1s1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2Alc0aMAyK27gpPorZlGNyHA.png" alt="Amazon Aurora RDS Metrics."&gt;&lt;/a&gt;Amazon Aurora RDS Metrics.&lt;/p&gt;

&lt;h3&gt;
  
  
  “ERROR: could not write block &lt;code&gt;n&lt;/code&gt; of temporary file: No space left on device.”
&lt;/h3&gt;

&lt;p&gt;Sounds familiar? &lt;em&gt;“No space left on device”&lt;/em&gt; is certainly not common when it comes to &lt;a href="https://aws.amazon.com/rds/aurora/"&gt;Amazon Aurora&lt;/a&gt; as &lt;a href="https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-aurora-increases-maximum-storage-size-128tb/"&gt;storage scales automatically up to 128TB&lt;/a&gt; and you’re less likely to reach the limit when you scale up your application on a single Amazon Aurora database cluster. No need to delete data or to split the database across multiple instances for storage purposes which is great. What’s going on with Local Storage then? We could be still be left with low local storage or no space leading to failover if we are running databases that are generally used for OLTP but has the need to periodically run fewer but large jobs that push the local storage limits.&lt;/p&gt;

&lt;p&gt;This post attempts to shed some more light on Amazon Aurora, the Local Storage architecture as well as some options to improve performance and utilize local storage, lower IO costs as well as not run into local storage limits. For simplicity, I will be focusing on Amazon Aurora for PostgreSQL for engine-specific examples and references on architecture for storage, memory, and optimizations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon Aurora Storage Architecture and IO
&lt;/h3&gt;

&lt;p&gt;Amazon Aurora is backed by a robust, scalable, and distributed storage architecture. One of the big advantages of Amazon Aurora has been elastic storage that scales with your data eliminating the need for provisioning large storage capacity and utilize some percentage of that. For a while when you deleted data from Aurora clusters, such as by dropping a table or partition, the overall allocated storage space remained the same. Since October 2020, the storage space allocated to your &lt;a href="https://aws.amazon.com/rds/aurora/"&gt;Amazon Aurora&lt;/a&gt; database cluster decreases dynamically when you delete data from the cluster. The storage space already automatically increases up to a maximum size of 128 tebibytes (TiB), and will now automatically decrease when data is deleted. Please refer to &lt;a href="https://aws.amazon.com/about-aws/whats-new/2020/10/amazon-aurora-enables-dynamic-resizing-database-storage-space/"&gt;Dynamic Resizing of Database Storage Space&lt;/a&gt; for more details.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--s5e3FRg6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/0%2A7E4bCCRobTxn9Ks5" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--s5e3FRg6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/0%2A7E4bCCRobTxn9Ks5" alt="Amazon Aurora Storage Architecture."&gt;&lt;/a&gt;Amazon Aurora Storage Architecture.&lt;/p&gt;

&lt;h4&gt;
  
  
  Storage Types
&lt;/h4&gt;

&lt;p&gt;Amazon Aurora clusters have two types of storage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Storage used for persistent data (shared cluster storage). For more information, see &lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Overview.StorageReliability.html#aurora-storage-contents"&gt;What the cluster volume contains&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Storage used for temporary data and logs (local storage). All DB files (for example, error logs, temporary files, and temporary tables) are stored in the DB instance local storage. This includes sorting operations, hash tables, grouping operations that SQL queries require, storage that the error logs occupy, and temporary tables that form. Simply put, the Temp space uses the local “ephemeral” volume on the instance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In general, the components which contribute towards the local storage depend on the engine. For example, on Amazon Aurora for PostgreSQL :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Any temp tables created by PostgreSQL transactions (includes implicit and explicit user-defined temp tables)&lt;/li&gt;
&lt;li&gt;Data files&lt;/li&gt;
&lt;li&gt;WAL logs&lt;/li&gt;
&lt;li&gt;DB logs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Storage architecture for PostgreSQL on RDS differs from Amazon Aurora where temporary tables use persistent storage as opposed to local instance storage. This could potentially be seen as a big change dependent on the nature of your application and workloads. However, ephemeral storage is faster and cheaper (Amazon Aurora does not charge for IO against local storage) than permanent storage, which makes the queries run faster at less cost.&lt;/p&gt;

&lt;p&gt;On the flip side, there is also a vertical limit based on the instance size to what can be processed in memory and on local storage on Amazon Aurora. Each Aurora DB instance contains a limited amount of local storage that is determined by the DB instance class. Typically, the amount of local storage is 2X the amount of RAM for your instance class. An example for db.r5 instance classes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;db.r5.large ~ 31 GiB
db.r5.xlarge ~ 62 GiB
db.r5.2xlarge ~ 124 GiB
db.r5.4xlarge ~ 249 GiB
db.r5.8xlarge ~ 499 GiB
db.r5.12xlarge ~ 748 GiB
db.r5.24xlarge ~ 1500 GiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--K1mYlyt---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/0%2AGaBfVtrVSWi7reAa" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--K1mYlyt---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/0%2AGaBfVtrVSWi7reAa" alt="Free Local Storage metric from an example running sort operations against a large table against default database settings on Amazon Aurora on PostgreSQL."&gt;&lt;/a&gt;Free Local Storage metric from an example running sort operations against a large table against default database settings on Amazon Aurora on PostgreSQL.&lt;/p&gt;

&lt;p&gt;In a traditional PostgreSQL installation, one could overcome this limitation by creating tablespaces and configuring this to the appropriate storage volume. Aurora PostgreSQL doesn’t have a filesystem for its tablespace. It has a block store and does not allow using the primary data volume that is elastic. While tablespaces are still allowed to be created on Amazon Aurora (and could configure temp_tablespaces) it is mostly there for compatibility.&lt;/p&gt;

&lt;p&gt;This is a huge shortcoming in terms of the storage architecture, which could have Amazon Aurora users needing to upgrade the instance class just for this purpose leading to additional costs (Note: IO against local storage is not charged) and left with unused capacity. The preferred solution is to optimize our workloads and database configuration to minimize or completely avoid spillover to disk along with having multiple instances in the cluster for failover in the event where you run low on local disk space.&lt;/p&gt;

&lt;p&gt;I hope local instance storage on Amazon Aurora be considered for auto-scaling and/or be made configurable in the future!&lt;/p&gt;
&lt;h4&gt;
  
  
  IO
&lt;/h4&gt;

&lt;p&gt;As folks migrate to Amazon Aurora it is really useful to understand the IO subsystem from a performance as well as a cost perspective.&lt;/p&gt;

&lt;p&gt;I/Os are input/output operations performed by the Aurora database engine against its SSD-based virtualized storage layer. Every database page read operation counts as one I/O. The Aurora database engine issues reads against the storage layer in order to fetch database pages not present in memory in the cache. If your query traffic can be totally served from memory or the cache, you will not be charged for retrieving any data pages from memory. If your query traffic cannot be served entirely from memory, you will be charged for any data pages that need to be retrieved from storage. Each database page is 8KB in Aurora PostgreSQL.&lt;/p&gt;

&lt;p&gt;Amazon Aurora was designed to eliminate unnecessary I/O operations in order to reduce costs and to ensure resources are available for serving read/write traffic. Write I/Os are only consumed when persisting write-ahead log records in Aurora PostgreSQL to the storage layer for the purpose of making writes durable. Write I/Os are counted in 4KB units. For example, a log record that is 1024 bytes will count as one write I/O operation. However, if the log record is larger than 4KB, more than one write I/O operation will be needed to persist it. Concurrent write operations whose log records are less than 4KB may be batched together by the Aurora database engine in order to optimize I/O consumption if they are persisted on the same storage protection groups. It is also important to note that, unlike traditional database engines, Aurora never flushes dirty data pages to storage.&lt;/p&gt;
&lt;h3&gt;
  
  
  PostgreSQL Memory Architecture
&lt;/h3&gt;

&lt;p&gt;It is important to have a good understanding of the PostgreSQL memory components and architecture as we work through SQL optimizations related to performance and IO. Let us take a look at the big pieces next!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--T-l3JDyC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/858/0%2A5eK-wULvDdpSEKXh" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--T-l3JDyC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/858/0%2A5eK-wULvDdpSEKXh" alt="A simplified representation of the PostgreSQL memory architecture."&gt;&lt;/a&gt;A simplified representation of the PostgreSQL memory architecture.&lt;/p&gt;
&lt;h4&gt;
  
  
  Work_Mem
&lt;/h4&gt;

&lt;p&gt;Probably the most important of all of the buffers in PostgreSQL and requires to be tuned in most cases when it comes to reducing IO or improving query performance. The work_mem value defaults to 4MB in PostgreSQL which means that per Postgres activity (each join, some sorts, etc.) can consume 4MB before it starts overflowing to disk. Honestly, that’s a bit low for many modern use-cases. When Postgres starts writing temp files to disk, obviously things will be much slower than in memory. One size does not fit all and it’s always tough to get the right value for work_mem perfect. A lot depends on your workloads such as fewer connections with large queries (good to tune), or a lot of connections with a lot of smaller queries (defaults will work), or a combination of both (probably a combination of tuning and/or defaults with session overrides). Often a sane default (&amp;gt; PostgreSQL defaults) can be figured out with appropriate testing and effective monitoring.&lt;/p&gt;

&lt;p&gt;Note that starting PostgreSQL 10 parallel execution is enabled and could make a significant difference in query processing and resource utilization. While adjusting work_mem we need to factor in the amount of memory that will be used by your process overall. WORK_MEM limits the memory usage of each process! Not just for queries: work_mem * processes * joins could lead to significant memory usage. A few additional parameters to tune along work_mem are listed below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.postgresql.org/docs/11/runtime-config-resource.html#GUC-MAX-PARALLEL-WORKERS-PER-GATHER"&gt;max_parallel_workers_per_gather&lt;/a&gt; — no of workers an executor will use for the parallel execution of a planner node&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.postgresql.org/docs/11/runtime-config-resource.html#GUC-MAX-WORKER-PROCESSES"&gt;max_worker_processes&lt;/a&gt; — the total number of workers to the number of CPU cores installed on a server&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.postgresql.org/docs/11/runtime-config-resource.html#GUC-MAX-WORKER-PROCESSES"&gt;max_parallel_workers&lt;/a&gt; — Sets the maximum number of workers that the system can support for parallel queries.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Maintenance_Work_Mem
&lt;/h4&gt;

&lt;p&gt;Defaults to 64MB and is used by maintenance operations, such as VACUUM, CREATE INDEX, CREATE MATERIALIZED VIEW, and ALTER TABLE ADD FOREIGN KEY. Note that the actual memory used depends on the no of auto vacuum workers (Defaults to 3). For better control of memory available for VACUUM operations, we could configure autovacuum_work_mem which defaults to -1 . (uses maintenance_work_mem when set to defaults)&lt;/p&gt;
&lt;h4&gt;
  
  
  Temp_Buffer
&lt;/h4&gt;

&lt;p&gt;The default is 8MB assuming a BLCKSZ of 8KB. These are session-local buffers used only for access to temporary tables. Note that this value can only be changed before the first use of temporary tables in the session. Any subsequent attempt will have no effect. For data-heavy queries using temporary tables configuring this to a higher value could lead to performance improvements as well as minimizing overflow to disk.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The temp buffers are only used for access to temporary tables in a user session. There is no relation between temp buffers in memory and the temporary files that are created under the pgsql_tmp directory during large sort and hash table operations.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4&gt;
  
  
  Shared Buffer
&lt;/h4&gt;

&lt;p&gt;The shared_buffer defines how much dedicated system memory PostgreSQL will use for the cache. There is also a difference compared to PostgreSQL on RDS. This is because Aurora PostgreSQL eliminates double buffering and doesn’t utilize file system cache. As a result, &lt;a href="https://aws.amazon.com/premiumsupport/knowledge-center/rds-aurora-postgresql-shared-buffers/"&gt;Aurora PostgreSQL can increase shared_buffers to improve performance&lt;/a&gt;. It’s a best practice to use the default value of 75% (SUM({DBInstanceClassMemory/12038},-50003)) for the shared_buffers DB parameter when using Aurora PostgreSQL. A smaller value can degrade performance by reducing the available memory to the data pages while also increasing I/O on the Aurora storage subsystem. Tuning the shared buffer may be required in some cases (Such as lowering to 50% of available memory) where we run fewer connections that require a larger work_mem configuration on the session for optimal performance.&lt;/p&gt;
&lt;h4&gt;
  
  
  WAL Buffer
&lt;/h4&gt;

&lt;p&gt;WAL buffers are used to hold write-ahead log (WAL) records that aren’t yet written to storage. The size of the WAL buffer cache is controlled by the wal_buffers setting. Aurora uses a log-based storage engine and changes are sent to storage nodes for persistence. Given the difference in how writes are handled by the Aurora storage engine, this parameter should be left unchanged (Defaults to 16MB)when using Aurora PostgreSQL.&lt;/p&gt;
&lt;h4&gt;
  
  
  CLOG Buffers
&lt;/h4&gt;

&lt;p&gt;CLOG (commit log) buffers are an area in operating system RAM dedicated to holding commit log pages. The commit log pages contain a log of transaction metadata and differ from the WAL data. The commit logs have the commit status of all transactions and indicate whether or not a transaction has been completed (committed). There is no specific parameter to control this area of memory. This is automatically managed by the database engine in tiny amounts. This is a shared memory component, which is accessible to all the background server and user processes of a PostgreSQL database.&lt;/p&gt;
&lt;h4&gt;
  
  
  Memory for Locks / Lock Space
&lt;/h4&gt;

&lt;p&gt;This memory space is used to store all heavyweight locks used by the PostgreSQL instance. These locks are shared across all the background server and user processes connecting to the database. A non-default larger setting of two database parameters namely max_locks_per_transaction (Defaults to 64) and max_pred_locks_per_transaction (Defaults to 64) in a way influences the size of this memory component.&lt;/p&gt;
&lt;h3&gt;
  
  
  Monitoring Queries and disk usage
&lt;/h3&gt;

&lt;p&gt;This is an interesting space and almost all Aurora engines (Mysql/PostgreSQL etc.) have a rich set of capabilities to monitor and provision adequate memory buffers for optimal processing. I’ll be mostly using PostgreSQL as an example for the rest of my post, which IMHO is a great example of comprehensive in-built monitoring and statistics capabilities.&lt;/p&gt;
&lt;h4&gt;
  
  
  pg_stat_statements
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=&amp;gt; \d pg_stat_statements;
                    View "public.pg_stat_statements"
       Column | Type | Collation | Nullable | Default 
--------------------------+------------------+-----------+----------+---------
 userid | oid | | | 
 dbid | oid | | | 
 queryid | bigint | | | 
 query | text | | | 
 calls | bigint | | | 
 total_time | double precision | | | 
 min_time | double precision | | | 
 max_time | double precision | | | 
 mean_time | double precision | | | 
 stddev_time | double precision | | | 
 rows | bigint | | | 
 shared_blks_hit | bigint | | | 
 shared_blks_read | bigint | | | 
 shared_blks_dirtied | bigint | | | 
 shared_blks_written | bigint | | | 
 local_blks_hit | bigint | | | 
 local_blks_read | bigint | | | 
 local_blks_dirtied | bigint | | | 
 local_blks_written | bigint | | | 
 temp_blks_read | bigint | | | 
 temp_blks_written | bigint | | | 
 blk_read_time | double precision | | | 
 blk_write_time | double precision | | |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;shared blocks&lt;/strong&gt; contain data from regular tables and indexes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;local blocks&lt;/strong&gt; contain data from temporary tables and indexes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;temp blocks&lt;/strong&gt; contain short-term working data used in sorts, hashes, Materialize plan nodes, and similar cases.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Other tried and tested formula is to enable log_temp_files on your database server which will log any queries that create any temporary file and also their sizes. Sample Log:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2021-04-10 09:30:15 UTC:xx.xx.x.xx(xxxxx):postgres@example:[xxxxx]: **LOG** : 00000: **temporary file** : path "base/pgsql_tmp/pgsql_tmp20362.65", size **1073741824**  
2021-04-10 09:30:15 UTC:xx.xx.x.xx(xxxxx):postgres@example:[xxxxx]: **CONTEXT** : SQL statement xxxx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Additionally, we could combine these metrics with native tools for a deep dive on log analysis for your engine such as &lt;a href="https://github.com/darold/pgbadger"&gt;pgbadger&lt;/a&gt; for PostgreSQL.&lt;/p&gt;
&lt;h4&gt;
  
  
  pg_statio
&lt;/h4&gt;

&lt;p&gt;The pg_statio_ views are primarily useful to determine the effectiveness of the buffer cache. When the number of actual disk reads is much smaller than the number of buffer hits, then the cache is satisfying most read requests without invoking a kernel call. However, these statistics do not give the entire story: due to the way in which PostgreSQL handles disk I/O, data that is not in the PostgreSQL buffer cache might still reside in the kernel's I/O cache, and might therefore still be fetched without requiring a physical read.&lt;/p&gt;
&lt;h4&gt;
  
  
  Useful Tips
&lt;/h4&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Enable &lt;a href="https://www.postgresql.org/docs/current/auto-explain.html"&gt;auto_explain&lt;/a&gt; with auto_explain.log_nested_statements = onallows you to see the duration and the execution plans of the SQL statements inside the function in the PostgreSQL log file.&lt;/li&gt;
&lt;li&gt;Enable &lt;a href="https://www.postgresql.org/docs/current/pgstatstatements.html"&gt;pg_stat_statements&lt;/a&gt; and set the parameter pg_stat_statements.track = all. pg_stat_statements will track information for the SQL statements inside a function. That way you can see metrics at the SQL statement level within your stored procedure or function. In addition, consider configuring pg_stat_statements.max to a higher value than defaults (5000) if you would like to keep a larger dataset to compare. Otherwise, information about the least-executed statements is discarded.&lt;/li&gt;
&lt;li&gt;Track queries and cache hit ratios. It is important to get a good hit ratio such as &amp;gt; 90% in most cases.&lt;/li&gt;
&lt;li&gt;Track top queries and slice by no of executions, execution time avg, CPU, etc. Often CPU-heavy queries are an indicator of a performance problem. It may or may not be a root cause but could be a symptom of an IO-related issue. We can easily get to see these using pg_stat_statements.&lt;/li&gt;
&lt;li&gt;Use pg_statio_all_tables and pg_statio_all_indexes to track IO metrics for tables along with indexes.&lt;/li&gt;
&lt;li&gt;It is a common practice to create temporary tables, insert data and then add an index. However, depending on our data set and key length, we could end up overflowing to disk.&lt;/li&gt;
&lt;li&gt;A longer Log Retention also means more local storage consumption. Workloads and log retention should be adjusted appropriately. Note that Amazon Aurora does compress (gzip) older logs to optimize when storage is low, and will also delete if it's necessary or if storage is critically low. See &lt;a href="https://docs.amazonaws.cn/en_us/AmazonRDS/latest/AuroraUserGuide/USER_LogAccess.Concepts.PostgreSQL.html"&gt;PostgreSQL database log files&lt;/a&gt; for more details on this. For longer periods of retention, consider &lt;a href="https://docs.amazonaws.cn/en_us/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.CloudWatch.html"&gt;offloading logs to cloudwatch&lt;/a&gt; and/or your choice of monitoring/log analyzer tool.&lt;/li&gt;
&lt;li&gt;Tune conservatively, to begin with, setting shared_buffers and work_mem to some reasonable values based on database size and workload, and setting maintenance_work_mem, effective_cache_size, effective_io_concurrency , temp_buffers and random_page_cost according to your instance size. Note that Amazon Aurora defaults are often based on instance class (memory, CPU, etc). Review carefully and back these configuration updates with a good performance test representative of your data sets and workloads, benchmark results, and iterate.&lt;/li&gt;
&lt;li&gt;Eliminate order by clauses where possible. Sorts against queries spanning several joins and/or large tables are one of the most common bottlenecks for disk overflow that end up creating temporary tables. These are mostly query operations that cannot fit under configured memory buffers such as work_mem, temp_buffers, etc. Please see PostgreSQL Memory Architecture for more details. In many cases, applications can deal with unsorted query results (or just do not care) and can help reduce memory utilization and save us from overflowing to disk on the database.&lt;/li&gt;
&lt;li&gt;Watch out for the Aggregate strategy as well as the sort method on your explain plan surrounding your group by and order by clause. Shown below is an explain plan for a query against a large table with a group by that uses GroupAggregate with external merge disk for its sort method:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**Finalize GroupAggregate** (cost=92782922.93..126232488.05 rows=116059400 width=119) (actual time=2414955.083..2820682.212 rows=59736353 loops=1)
Group Key: t1.id, t1.name, t1.group
-&amp;gt; **Gather Merge** (cost=92782922.93..122750706.05 rows=232118800 width=119) (actual time=2414955.070..2792088.265 rows=108495098 loops=1)
     Workers Planned: 2
     Workers Launched: 2
    -&amp;gt; **Partial GroupAggregate** (cost=92781922.91..95957437.06 rows=116059400 width=119) (actual time=2417780.397..2660540.817 rows=36173064 loops=3)
      Group Key: t1.id, t1.name, t1.group
        -&amp;gt; Sort (cost=92781922.91..93184906.94 rows=161193612 width=119) (actual time=2417780.369..2607915.169 rows=264507496 loops=3)
          Sort Key: t1.id, t1.name, t1.group
          **Sort Method: external merge Disk: 33520120kB**
          **Worker 0: Sort Method: external merge Disk: 33514064kB  
 Worker 1: Sort Method: external merge Disk: 33903440kB**  
           -&amp;gt; Hash Join (cost=31.67..50973458.94 rows=161193612 width=119) (actual time=123.618..260171.993 rows=264507496 loops=3)
           :::
 Planning Time: 0.413 ms
 **Execution Time: 2977412.203 ms**
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Shown below is an explain plan with optimal configuration for work_mem . We can see a huge improvement as this plan usesHashAggregate instead of the GroupAggregate and processes in-memory, thus reduce IO and also improve performance on Amazon Aurora.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**HashAggregate** (cost=79383833.03..80544427.03 rows=116059400 width=119) (actual time=1057250.962..1086043.902 rows=59736353 loops=1)
    Group Key: t1.id, t1.name, t1.group
    -&amp;gt; **Hash Join** (cost=31.67..75515186.35 rows=386864668 width=119) (actual time=272.742..755349.784 rows=793522488 loops=1)
    :::
Planning Time: 0.402 ms
**Execution Time: 1237705.369 ms**
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Performance Insights
&lt;/h4&gt;

&lt;p&gt;Amazon Aurora makes many of the key engine-specific metrics available as dashboards. Performance Insights is currently available for Amazon Aurora PostgreSQL compatible Edition, MySQL compatible Edition, PostgreSQL, MySQL, Oracle, SQL Server, and MariaDB.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OkHuR4vT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AxCU3Rss-YuqHpVD2Hzvt_w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OkHuR4vT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AxCU3Rss-YuqHpVD2Hzvt_w.png" alt="Performance Insights for Aurora PostgreSQL."&gt;&lt;/a&gt;Performance Insights for Aurora PostgreSQL.&lt;/p&gt;

&lt;p&gt;For PostgreSQL stats frompg_stat_statements as well as the active processes and live queries from pg_stat_activity are available on the performance insights dashboard.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Performance Insights can only collect statistics for queries in pg_stat_activity that aren't truncated. By default, PostgreSQL databases truncate queries longer than 1,024 bytes. To increase the query size, change the track_activity_query_size parameter in the DB parameter group associated with your DB instance. When you change this parameter, a DB instance reboot is required.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Testing Query Optimizations and Caching
&lt;/h3&gt;

&lt;p&gt;We always need a good way to test our optimizations and it's quite common that we get different results depending on a test against a cold start vs a test against a warmed up database using cache effectively. This certainly makes it harder to test your optimizations and improvements. However, it is important to note that in runtime you will benefit from caching and should be tested with cache in effect.&lt;/p&gt;

&lt;p&gt;There are some good reasons though for testing without cache, and the capabilities to allow that depend on your engine. In Amazon Aurora for Mysql, you could run &lt;strong&gt;RESET QUERY CACHE&lt;/strong&gt; ; between your tests for comparable tests.&lt;/p&gt;

&lt;p&gt;In Amazon Aurora for PostgreSQL, we do not deal with any os level cache. I/O is handled by the Aurora storage driver and there is no file system or secondary level of caching for tables or indexes (also another reason to increase your shared_buffer). However, we could use an EXPLAIN (ANALYZE,BUFFERS) to gain insights into your query plan as well as estimate execution times.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=&amp;gt; **EXPLAIN (ANALYZE,BUFFERS) SELECT \* FROM foo;**

QUERY PLAN
------------------------------------------------------------------------ **Seq Scan** on foo ( **cost** =0.00..715.04 **rows** =25004 width=172) (actual time=0.011..3.037 **rows** =25000 **loops** =1)

**Buffers** : shared hit=465
**Planning Time** : 0.038 ms
**Execution Time** : 4.252 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We could see how our buffers work for a query and how that translates to caching. For clearing the session cache and the query plan cache, you can use DISCARD PLAN or DISCARD ALL if you want to clear everything. Please see &lt;a href="https://www.postgresql.org/docs/13/sql-discard.html"&gt;SQL-DISCARD&lt;/a&gt; for more information on options.&lt;/p&gt;

&lt;p&gt;I hope this post provides you with some insights on Amazon Aurora, the PostgreSQL architecture, challenges with local storage, as well as some useful tips to help you along this journey! Next up, I plan on looking at architecture patterns for transparent read/write splitting on Amazon Aurora as well as benchmarking &lt;a href="https://github.com/awslabs/aws-postgresql-jdbc"&gt;AWS-provided JDBC driver for PostgreSQL&lt;/a&gt; that supports fast failover!&lt;/p&gt;

&lt;p&gt;Stay Safe!&lt;/p&gt;

&lt;h3&gt;
  
  
  Useful References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=3PshvYmTv9M"&gt;Deep Dive on Amazon Aurora with PostgreSQL Compatibility&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.amazonaws.cn/en_us/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.BestPractices.html"&gt;Amazon Aurora PostgreSQL Best Practices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/database/amazon-aurora-postgresql-parameters-part-1-memory-and-query-plan-management/"&gt;Aurora PostgreSQL Memory and Query Plan Management&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://d1.awsstatic.com/product-marketing/Aurora/RDS_Aurora_PostgreSQL_Performance_Assessment_Benchmarking_V1-0.pdf"&gt;Aurora PostgreSQL benchmarking guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.percona.com/blog/2019/02/21/parallel-queries-in-postgresql/"&gt;Parallel Queries in PostgreSQL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.postgresql.org/docs/12/monitoring-stats.html"&gt;PostgreSQL Stats Monitoring&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thank you &lt;a href="https://medium.com/u/2053aaf853f5"&gt;Attila Vágó&lt;/a&gt;, &lt;a href="https://medium.com/u/34d2b81e5dc4"&gt;Francislâiny Campos&lt;/a&gt;, and &lt;a href="https://medium.com/u/c989a91ec1f9"&gt;Andrew Brand&lt;/a&gt; for your feedback on this post!&lt;/p&gt;




</description>
      <category>aws</category>
      <category>amazonaurora</category>
      <category>postgres</category>
    </item>
    <item>
      <title>AWS STS with Spring Cloud Vault</title>
      <dc:creator>Kris Iyer</dc:creator>
      <pubDate>Tue, 30 Mar 2021 08:31:25 +0000</pubDate>
      <link>https://forem.com/aws-builders/aws-sts-with-spring-cloud-vault-1e5g</link>
      <guid>https://forem.com/aws-builders/aws-sts-with-spring-cloud-vault-1e5g</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fproxy%2F0%2An-Aqmi5lNsCCAfh_" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fproxy%2F0%2An-Aqmi5lNsCCAfh_" alt="AWS STS with Spring Cloud Vault"&gt;&lt;/a&gt;AWS STS with Spring Cloud Vault&lt;/p&gt;

&lt;p&gt;In my last post “&lt;a href="https://dev.to/krisiye/spring-boot-configuration-and-secret-management-patterns-on-kubernetes-1179"&gt;Spring Boot Configuration and Secret Management Patterns on Kubernetes&lt;/a&gt;” I touched on some integration patterns for secret management with Spring Cloud Vault. Along with that I also highlighted that one of the issues I was working on was about enabling &lt;a href="https://github.com/spring-cloud/spring-cloud-vault/pull/575" rel="noopener noreferrer"&gt;AWS STS for S_pring Cloud Vault_&lt;/a&gt;. This is now available with spring cloud 2020.0.2!!!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;dependency&amp;gt;
 &amp;lt;groupId&amp;gt;org.springframework.cloud&amp;lt;/groupId&amp;gt;
 &amp;lt;artifactId&amp;gt;spring-cloud-dependencies&amp;lt;/artifactId&amp;gt;
 &amp;lt;version&amp;gt;2020.0.2&amp;lt;/version&amp;gt;
 &amp;lt;type&amp;gt;pom&amp;lt;/type&amp;gt;
 &amp;lt;scope&amp;gt;import&amp;lt;/scope&amp;gt;
&amp;lt;/dependency&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Notice the new &lt;a href="https://github.com/spring-cloud/spring-cloud-release/wiki/Release-Train-Naming-Convention" rel="noopener noreferrer"&gt;Release Train versioning&lt;/a&gt; naming convention!&lt;/p&gt;
&lt;h3&gt;
  
  
  AWS Security Token Service (AWS STS)
&lt;/h3&gt;

&lt;p&gt;AWS Security Token Service (AWS STS) is a web service that enables you to request temporary, limited-privilege credentials for AWS Identity and Access Management (IAM) users or for users that you authenticate (federated users). The key purpose of AWS STS is to allow a user or an application to assume a role and obtain access to AWS services or resources. For more information, see &lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp.html" rel="noopener noreferrer"&gt;Temporary Security Credentials&lt;/a&gt; in the &lt;em&gt;IAM User Guide&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F748%2F0%2AUceGoK0Y7NClA3Kj" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F748%2F0%2AUceGoK0Y7NClA3Kj" alt="IAM user assuming role via STS."&gt;&lt;/a&gt;IAM user assuming role via STS.&lt;/p&gt;

&lt;p&gt;For applications it's no different, and we could have the application assume the role and request temporary credentials to AWS resources such as EC2, S3, etc. This is where Spring Cloud Vault combined with AWS Secrets backend on Vault provides the capability for a Spring Boot application to use dynamic credentials.&lt;/p&gt;
&lt;h3&gt;
  
  
  Vault AWS Secret Backend
&lt;/h3&gt;

&lt;p&gt;The AWS secrets engine generates AWS access credentials dynamically based on IAM policies. The AWS IAM credentials are time-based and are automatically revoked when the Vault lease expires.&lt;/p&gt;

&lt;p&gt;Vault supports three different types of credentials to retrieve from AWS:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://www.vaultproject.io/docs/secrets/aws#iam_user" rel="noopener noreferrer"&gt;iam_user&lt;/a&gt;: Vault will create an IAM user for each lease, attach the managed and inline IAM policies as specified in the role to the user, and if a &lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_boundaries.html" rel="noopener noreferrer"&gt;permissions boundary&lt;/a&gt; is specified on the role, the permissions boundary will also be attached. Vault will then generate an access key and secret key for the IAM user and return them to the caller. IAM users have no session tokens and so no session token will be returned. Vault will delete the IAM user upon reaching the TTL expiration.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.vaultproject.io/docs/secrets/aws#assumed_role" rel="noopener noreferrer"&gt;assumed_role&lt;/a&gt;: Vault will call &lt;a href="https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html" rel="noopener noreferrer"&gt;sts:AssumeRole&lt;/a&gt; and return the access key, secret key, and session token to the caller.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.vaultproject.io/docs/secrets/aws#federation_token" rel="noopener noreferrer"&gt;federation_token&lt;/a&gt;: Vault will call &lt;a href="https://docs.aws.amazon.com/STS/latest/APIReference/API_GetFederationToken.html" rel="noopener noreferrer"&gt;sts:GetFederationToken&lt;/a&gt; passing in the supplied AWS policy document and return the access key, secret key, and session token to the caller.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;More details on the setup can be found under &lt;a href="https://www.vaultproject.io/docs/secrets/aws#setup" rel="noopener noreferrer"&gt;AWS Secrets Engine&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Spring Cloud Vault
&lt;/h3&gt;

&lt;p&gt;The AWS secret engine can be enabled by adding the spring-cloud-vault-config-aws&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;dependencies&amp;gt;
    &amp;lt;dependency&amp;gt;
        &amp;lt;groupId&amp;gt;org.springframework.cloud&amp;lt;/groupId&amp;gt;
        &amp;lt;artifactId&amp;gt;spring-cloud-vault-config-aws&amp;lt;/artifactId&amp;gt;
        &amp;lt;version&amp;gt;3.0.2&amp;lt;/version&amp;gt;
    &amp;lt;/dependency&amp;gt;
&amp;lt;/dependencies&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The AWS secret integration now supports the notion of a credential-type which defaults to iam_user for backward compatibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sample iam_user configuration:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spring.cloud.vault:
  aws:
    enabled: true
    role: readonly
    backend: aws
    access-key-property: cloud.aws.credentials.accessKey
    secret-key-property: cloud.aws.credentials.secretKey
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;enabled setting this value to true enables the AWS backend config usage&lt;/li&gt;
&lt;li&gt;role sets the role name of the AWS role definition&lt;/li&gt;
&lt;li&gt;backend sets the path of the AWS mount to use&lt;/li&gt;
&lt;li&gt;access-key-property sets the property name in which the AWS access key is stored&lt;/li&gt;
&lt;li&gt;secret-key-property sets the property name in which the AWS secret key is stored&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For AWS STS supported values forcredential-type are assumed_role or federation_token.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sample assume_role configuration:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spring.cloud.vault:
  aws:
    enabled: true
      role: sts-vault-role
      backend: aws
      credential-type: assumed_role
      access-key-property: cloud.aws.credentials.accessKey
      secret-key-property: cloud.aws.credentials.secretKey
      session-token-key-property: cloud.aws.credentials.sessionToken
      ttl: 3600s
      role-arn: arn:aws:iam::${AWS_ACCOUNT}:role/sts-app-role
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;New additions STS configuration:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;session-token-key-property sets the property name in which the AWS STS security token is stored&lt;/li&gt;
&lt;li&gt;credential-type sets the AWS credential type to use for this backend. Defaults to iam_user&lt;/li&gt;
&lt;li&gt;ttl sets the TTL for the STS token when using assumed_role or federation_token. Defaults to the TTL specified by the vault role. Min/Max values are also limited to what AWS would support for STS.&lt;/li&gt;
&lt;li&gt;role-arn sets the IAM role to assume if more than one is configured for the vault role when using assumed_role&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Please read &lt;a href="https://docs.spring.io/spring-cloud-vault/docs/current/reference/html/#vault.config.backends.aws" rel="noopener noreferrer"&gt;Spring Cloud Vault AWS Backend&lt;/a&gt; for more details on this integration.&lt;/p&gt;
&lt;h3&gt;
  
  
  Lease Rotation and Property Sources
&lt;/h3&gt;

&lt;p&gt;STS credentials default to a TTL of 60 mins. You can adjust the TTL based on your requirement. It's important to note that min/max TTL values allowed are as per what AWS STS would allow for configuration.&lt;/p&gt;

&lt;p&gt;For assumed_role we could set that between a minimum of 900 seconds (15 minutes) up to the maximum session duration setting for the role which could be anywhere between 3,600s (1 hour) and 43,200s (12 hours). The default expiration period for federation_token is substantially longer (12 hours instead of one hour compared to assumed_role) and we could specify the duration between 900 seconds (15 minutes) to 129,600 seconds (36 hours).&lt;/p&gt;

&lt;p&gt;Spring Cloud Vault managed leases can either be RENEWED (if they are renewable) or ROTATED based on the vault lifecycle configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sample Vault lifecycle:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vault:
  enabled: true
  host: 127.0.0.1
  port: 8200
  scheme: http
  uri: [http://127.0.0.1:8200/](http://127.0.0.1:8200/)
  config:
    lifecycle:
      min-renewal: 1m
      expiry-threshold: 5m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;min-renewal makes sure that the leases are not renewed/rotated too frequently and at least stick around for the configured duration. expiry-threshold is the configured duration before the lease expiry that vault will renew/rotate a lease.&lt;/p&gt;

&lt;p&gt;Spring Cloud Vault and the LeaseContainer will make sure the property sources are updated with the new set of credentials upon a Lease Expiry. However, it is the responsibility of the application to make sure any properties updated under the property sources and environment are propagated through any spring beans initialized with credentials.&lt;/p&gt;

&lt;p&gt;Let us assume a spring boot application that managed AWS creds through a ConfigurationProperties class such as below:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;Let's also assume that there was another Refresh Scope bean that has an autowired dependency for AwsConfigurationProperties.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;In this scenario, it becomes important to listen to SecretLeaseCreatedEvent and rebind/refresh the respective configuration properties and any other refresh scoped beans within the application that may need updated properties, such as AWS credentials injected. Let us review how we can achieve this next.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;VaultAwsConfiguration shown above registers a lease listener during postConstruct()&lt;/li&gt;
&lt;li&gt;Rebinds any ConfigurationProperties (AwsConfigurationProperties) using a ConfigurationPropertiesRebinder&lt;/li&gt;
&lt;li&gt;Refreshes any refresh scoped beans (AwsConfiguration, basicAWSCredentials, amazonS3Client) on the ApplicationContext upon receiving a SecretLeaseCreatedEvent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Wondering why not just use spring cloud aws? If you are, you are absolutely right! At the moment, Spring Cloud AWS v2.3.0 only supports AWS access key and secrets. It also does not integrate with vault and lease events at the moment. I do have an issue logged for &lt;a href="https://github.com/awspring/spring-cloud-aws/issues/73" rel="noopener noreferrer"&gt;supporting STS Session Token&lt;/a&gt; and it will be a nice addition for Spring Cloud AWS to integrate with Spring Cloud Vault for its credential manager implementation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Graceful Shutdown
&lt;/h3&gt;

&lt;p&gt;Spring Cloud Vault performs a revoke on any active lease as the application container shuts down. The application will need to configure appropriate permissions on the vault role to perform sys/leases/revoke so that spring cloud vault could revoke leases.&lt;/p&gt;

&lt;p&gt;Something I ran across with Spring Boot 2.4 and legacy bootstrap is that /actuator/refresh ends up closing the context and thus also triggers destroy() on the LeaseContainer resulting in a revoke. There isn’t a fix or a workaround for this yet under legacy bootstrap, but the recommendation is to cut over to using Config Data API. Note that spring config imports are processed in reverse order. For instance, if we were using multiple sources such as vault and consul (with ACL) for imports, and would like vault secrets to be resolved and imported before others, they will have to be set up as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spring:
  config:
    import: consul://,vault://
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unrelated to STS and vault I have a &lt;a href="https://github.com/spring-projects/spring-boot/issues/25705" rel="noopener noreferrer"&gt;spring boot issue&lt;/a&gt; raised for ordered dependency resolution with config data API where we need property sources updated to be honored before we process imports (Such as the consul ACL token from the vault consul backend).&lt;/p&gt;

&lt;h3&gt;
  
  
  Known Issues
&lt;/h3&gt;

&lt;p&gt;With Spring cloud v2020.0.2 there is a known issue (&lt;em&gt;java.lang.NoSuchMethodError&lt;/em&gt;) that stems from spring-cloud-configdue to an incorrect dependency resolution for spring-vault-core. See &lt;a href="https://github.com/spring-cloud/spring-cloud-config/issues/1841" rel="noopener noreferrer"&gt;Vault core dependency resolution causing java.lang.NoSuchMethodError&lt;/a&gt; for more details. This will be corrected in a subsequent release but in the meanwhile, you could implement a workaround to override spring-vault-core version such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;dependency&amp;gt;
 &amp;lt;groupId&amp;gt;org.springframework.vault&amp;lt;/groupId&amp;gt;
 &amp;lt;artifactId&amp;gt;spring-vault-core&amp;lt;/artifactId&amp;gt;
 &amp;lt;version&amp;gt;2.3.2&amp;lt;/version&amp;gt;
&amp;lt;/dependency&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I hope you enjoyed the read and this helps you on your journey to build secured cloud applications using temporary credentials with AWS STS!&lt;/p&gt;

&lt;p&gt;Thank you and stay safe!&lt;/p&gt;

&lt;p&gt;Thanks to &lt;a href="https://medium.com/u/2053aaf853f5" rel="noopener noreferrer"&gt;Attila Vágó&lt;/a&gt;, &lt;a href="https://medium.com/u/cc5d990750e3" rel="noopener noreferrer"&gt;Darragh Grace&lt;/a&gt;, and &lt;a href="https://medium.com/u/34d2b81e5dc4" rel="noopener noreferrer"&gt;Francislâiny Campos&lt;/a&gt; for their feedback on this post!&lt;/p&gt;




</description>
      <category>springboot</category>
      <category>springcloudvault</category>
      <category>vault</category>
      <category>awsiam</category>
    </item>
    <item>
      <title>Spring Boot Configuration and Secret Management Patterns on Kubernetes</title>
      <dc:creator>Kris Iyer</dc:creator>
      <pubDate>Wed, 24 Feb 2021 09:11:05 +0000</pubDate>
      <link>https://forem.com/krisiye/spring-boot-configuration-and-secret-management-patterns-on-kubernetes-1179</link>
      <guid>https://forem.com/krisiye/spring-boot-configuration-and-secret-management-patterns-on-kubernetes-1179</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0tXd8ynq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2ARfWGnaA2olNUeD2YuZqxmQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0tXd8ynq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2ARfWGnaA2olNUeD2YuZqxmQ.png" alt=""&gt;&lt;/a&gt;Popular tools and frameworks on k8s&lt;/p&gt;

&lt;p&gt;Spring Boot has been a very popular framework for building microservices in the cloud. Working with Spring Boot on Kubernetes has always been fun, but also comes with its own set of challenges as well as presents numerous architectural options ranging from application security, package management, containers, security, service configuration, and secrets management. More recently I have had a chance to evaluate and implement many of these patterns. In this post, I share some of my encounters and learnings in the service configuration and secret management space that I hope will help you on your K8s journey!&lt;/p&gt;

&lt;h3&gt;
  
  
  The Norm
&lt;/h3&gt;

&lt;p&gt;A typical &lt;em&gt;Spring Boot&lt;/em&gt; microservice configuration includes a bunch of profiles set up through application or bootstrap YAML files. One could better that with a pattern where we have profiles and application YAML files externalized through &lt;em&gt;Spring Cloud Config&lt;/em&gt;. Now that we have configuration externalized, how about dynamically reloading configuration without application restart? Sure. One could do that with some kind of a job or a task that knows how to perform a &lt;a href="https://www.baeldung.com/spring-reloading-properties"&gt;&lt;em&gt;/actuator/refresh&lt;/em&gt;&lt;/a&gt; once the updated configs have been deployed. Others may choose more of an automated approach with spring cloud bus over Kafka or RabbitMQ to publish state change events and reload applications appropriately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Meet the k8s counterpart
&lt;/h3&gt;

&lt;p&gt;At the core of the Kubernetes are concepts such as &lt;a href="https://kubernetes.io/docs/concepts/configuration/configmap/"&gt;ConfigMaps&lt;/a&gt; and &lt;a href="https://kubernetes.io/docs/concepts/configuration/secret/"&gt;Secrets&lt;/a&gt; that provide a clean separation between sensitive and non-sensitive configurations for your services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ConfigMaps and Secrets as Environment variables or files.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The&lt;/strong&gt; &lt;a href="https://12factor.net/"&gt;&lt;strong&gt;twelve-factor&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;app stores config in &lt;em&gt;environment variables&lt;/em&gt;&lt;/strong&gt; (often shortened to &lt;em&gt;env vars&lt;/em&gt; or &lt;em&gt;env&lt;/em&gt;). Env vars are easy to change between deploys without changing any code; unlike config files, there is little chance of them being checked into the code repo accidentally; and unlike custom config files, or other config mechanisms such as Java System Properties, they are a language- and OS-agnostic standard.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;On Kubernetes we have a choice to make configurations sourced from ConfigMaps and Secrets available to the application as environment variables through deployment descriptors such as configMapkeyRef or secretKeyRef .&lt;/p&gt;

&lt;p&gt;Another alternative to environment variables is to load &lt;em&gt;Configmaps&lt;/em&gt; or &lt;em&gt;Secrets&lt;/em&gt; as a file through k8s volumes and volumeMounts where applications also get the benefit of listening to file system change events as configurations may be updated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spring Cloud Kubernetes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://spring.io/projects/spring-cloud-kubernetes"&gt;&lt;em&gt;Spring Cloud Kubernetes&lt;/em&gt;&lt;/a&gt; makes it a lot easier for Spring Boot applications on Kubernetes to leverage these features for service configuration as well as secret management.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--S3ZAXllu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A9PNFQZF18ii8RUoGl440kA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--S3ZAXllu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A9PNFQZF18ii8RUoGl440kA.png" alt=""&gt;&lt;/a&gt;In-Built integration pattern with spring cloud kubernetes for ConfigMaps and Secrets&lt;/p&gt;

&lt;p&gt;Along with reading ConfigMaps and Secrets, the application can also watch for changes and be configured to reload itself if there were any changes. This is a pretty cool feature. However, this comes with a level of coupling as well as Kubernetes permission sets required (Such as access to the API Server) that may not be desired for the applications. Starting Spring Cloud 2020.0 release, this feature has been deprecated. Read the deprecation notice for more details. &lt;a href="https://github.com/spring-cloud/spring-cloud-kubernetes/#53-propertysource-reload"&gt;More details on the deprecation notice can be found here&lt;/a&gt;. What Next?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Config Watcher&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There is now a config-watcher available that can be deployed as a side-car application that watches for any changes to ConfigMaps or Secrets and is capable of issuing a reload to your application across your &lt;em&gt;replicaset&lt;/em&gt;. There are a couple of methods to achieve this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HTTP — sidecar for config watcher&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Q--cuGa6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A922NFz6nNFV4NQ0aaSL4JQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Q--cuGa6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2A922NFz6nNFV4NQ0aaSL4JQ.png" alt=""&gt;&lt;/a&gt;HTTP based config-watcher sidecar&lt;/p&gt;

&lt;p&gt;This integration is enabled through the DiscoveryClient where applications can enable this feature with an @EnableDiscoveryClient annotation. The config-watcher can discover pods that match the namespace or pod label configured and issue a refresh against the pods upon a configuration change event for a ConfigMap or Secret.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spring Cloud Bus — sidecar for config watcher&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--PBmvWzMK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AvxDpcWsmkuF5kEY52WyT7Q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--PBmvWzMK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AvxDpcWsmkuF5kEY52WyT7Q.png" alt=""&gt;&lt;/a&gt;spring-cloud-bus based config-watcher sidecar&lt;/p&gt;

&lt;p&gt;Just like the HTTP option discussed above, the config-watcher also supports Spring Cloud Bus integration over RabbitMQ and Kafka (which I personally had the privilege to contribute to recently. See &lt;a href="https://github.com/spring-cloud/spring-cloud-kubernetes/issues/654"&gt;Support Kafka for Spring Cloud Kubernetes Configuration Watcher&lt;/a&gt; for more details). Upon Detecting a change the config-watcher publishes a &lt;em&gt;ReloadEvent&lt;/em&gt; that the pods could listen to and perform an application Reload. See the &lt;a href="https://docs.spring.io/spring-cloud-kubernetes/docs/current/reference/html/index.html#messaging-implementation"&gt;documentation on cloud bus intergation&lt;/a&gt; for configuration details.&lt;/p&gt;

&lt;p&gt;Another useful feature that I was also able to contribute to in this space is the ability to choose between enabling ConfigMaps and/or Secrets (See &lt;a href="https://github.com/spring-cloud/spring-cloud-kubernetes/issues/635"&gt;spring-cloud-kubernetes-config options and autoconfigure not working with reload&lt;/a&gt; for more details).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spring:
  cloud:
    kubernetes:
      config:
        enabled: true
      secrets:
        enabled: false
      discovery:
        enabled: false
      reload:
        enabled: true
        monitoring-config-maps: true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Allows us that extra flexibility for applications that may only need to enable one or the other or both if you may wish.&lt;/p&gt;

&lt;p&gt;You can already see the benefits here of using a &lt;em&gt;sidecar&lt;/em&gt; container such as the config-watcher either over HTTP or Spring Cloud Bus. Gets us a level of decoupling that is more secured and saves the application from the finer details of watching for configuration changes. While this is a step forward this does not completely get rid of the dependencies on the K8s API Server. See &lt;a href="https://github.com/spring-cloud/spring-cloud-kubernetes/issues/461"&gt;Strongly discourage applications from talking to the Kubernetes API Server&lt;/a&gt; for more details.&lt;/p&gt;

&lt;p&gt;In general, applications should not need to know that they are running inside Kubernetes. However, &lt;em&gt;Spring Cloud Kubernetes&lt;/em&gt; allows you to detect this by automatically enabling a &lt;em&gt;Kubernetes&lt;/em&gt; profile (adds onto existing active profiles if you have any already) that one could use to determine the deploy environment. This is useful if we had applications that are deployed on multiple platforms.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Consul&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.consul.io/"&gt;Consul&lt;/a&gt; from Hashicorp is a service mesh solution providing a full-featured control plane with service discovery, configuration, and segmentation functionality.&lt;/p&gt;

&lt;p&gt;With the &lt;a href="https://www.consul.io/docs/dynamic-app-config/kv"&gt;Consul KV&lt;/a&gt;, we could store plain key/value or files such as .yaml or .json. Spring Cloud Consul makes it a lot easier for Spring Boot applications to integrate with consul where .yaml files are automatically loaded as part of the spring bootstrap phase and attached to the appropriate profiles making it a great option for externalizing non-sensitive configuration and still benefit from spring profiles. In addition, it also supports an out of the box watcher that keeps track of any changes and automatically reloads any spring components that may have been @RefreshScope .&lt;/p&gt;

&lt;p&gt;Consul service mesh enables service-to-service communication with authorization and encryption. Applications can use sidecar proxies in a service mesh configuration to automatically establish TLS connections for inbound and outbound connections without being aware of the network configuration and topology. A great way to secure your services with managed configurations or &lt;a href="https://www.consul.io/docs/connect/intentions"&gt;Intentions&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Vault&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.vaultproject.io/"&gt;Vault&lt;/a&gt; is another product from HashiCorp that brings in capabilities for secrets management with a variety of secret engine backends such as consul (infra), database (infra), aws (cloud), and its own kv (generic) among others.&lt;/p&gt;

&lt;p&gt;Vault KV can be very useful to store simple key/values and supports spring profiles for property sources.&lt;/p&gt;

&lt;p&gt;The Database backend supports generating dynamic credentials for your applications and saves you from generating one and persisting it. This can be injected into your data source and also be eligible for lease rotation.&lt;/p&gt;

&lt;p&gt;The AWS backend can be used to generate dynamic credentials based on IAM user or STS (assumed_role and federation_token) for AWS service integration.&lt;/p&gt;

&lt;p&gt;The Consul backend can be used to secure application access to consul. For instance, applications would need a consul token to integrate with consul kv where the token itself is sensitive information and applications need a mechanism to securely request consul tokens and thus integrate with consul kv.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Spring Cloud Consul and Spring Cloud Vault&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--41EAm1x6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2Ay3XZNUXAlnGyUXOJbNoNdA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--41EAm1x6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2Ay3XZNUXAlnGyUXOJbNoNdA.png" alt=""&gt;&lt;/a&gt;A sample microservice template on k8s&lt;/p&gt;

&lt;p&gt;Spring Cloud Vault and Spring Cloud Consul can be used together and work pretty well and are great choices for Spring Boot applications integrating with Consul and Vault for service configuration and secrets management. There are some challenges at bootstrap to make sure the consul token is retrieved before asking for non-sensitive configuration from consul kv (Workaround available. See &lt;a href="https://github.com/spring-cloud/spring-cloud-vault/issues/58"&gt;Consul Tokens from Spring Vault do not get picked up by Spring Cloud Config Consul&lt;/a&gt; for more details.).&lt;/p&gt;

&lt;p&gt;Recently I did run into some issues with the spring cloud vault not supporting AWS STS for temporary credentials behind the AWS Secret Engine. This is being discussed under &lt;a href="https://github.com/spring-cloud/spring-cloud-vault/issues/572"&gt;Support AWS STS for the vault secrets backend for aws&lt;/a&gt; and I have a &lt;a href="https://github.com/spring-cloud/spring-cloud-vault/issues/572"&gt;pull request&lt;/a&gt; in the works!&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Vault Agent Injector Sidecar&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Another pattern that could be used with vault and pods on k8s is the sidecar injector option. Provides for an annotation-driven secret injection for your pods that could be a pretty handy way to decouple applications with secret management. This is also a great choice for applications and tech stacks that do not have a comprehensive SDK/library that supports vault integration by abstracting away the details and allowing pods to use mounts and provide secrets as files along with support for lease rotations. See &lt;a href="https://www.vaultproject.io/docs/platform/k8s/injector"&gt;Agent Sidecar Injector&lt;/a&gt;for more details.&lt;/p&gt;

&lt;h3&gt;
  
  
  Config file processing on Spring Boot 2.4
&lt;/h3&gt;

&lt;p&gt;Starting Spring Boot 2.4 there is a significant change in how configs are loaded in spring boot. For example, Vault could now be used in conjunction with spring.config.import and provides a secured replacement to bootstrap.yaml.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spring:
  config:                           
    import: vault://secret/app/pres/dev                              
    activate:                             
      on-profile: "dev"
  datasource:
    username: ${username}
    password: ${password}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read &lt;a href="https://spring.io/blog/2020/08/14/config-file-processing-in-spring-boot-2-4"&gt;config file processing&lt;/a&gt; for more details.&lt;/p&gt;

&lt;p&gt;If you got this far, you would already know by now that there are a plethora of architectural patterns in this space. :-) There are not any rights or wrongs here and its’ mostly up to you to pick what's best for you/team/projects based on the capability, flexibility, and most importantly security you may be looking for with your configurations and secret management. I hope this post helps you a tiny bit along your k8s journey!&lt;/p&gt;

&lt;p&gt;Good Luck and Stay Safe!&lt;/p&gt;

&lt;p&gt;Thanks to &lt;a href="https://medium.com/u/2053aaf853f5"&gt;Attila Vágó&lt;/a&gt;, &lt;a href="https://medium.com/u/cc5d990750e3"&gt;Darragh Grace&lt;/a&gt;, and &lt;a href="https://medium.com/u/34d2b81e5dc4"&gt;Francislâiny Campos&lt;/a&gt; for their feedback on this post!&lt;/p&gt;




</description>
      <category>vault</category>
      <category>springboot</category>
      <category>kubernetes</category>
      <category>microservices</category>
    </item>
  </channel>
</rss>
