<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Zippy Wachira</title>
    <description>The latest articles on Forem by Zippy Wachira (@yaddah).</description>
    <link>https://forem.com/yaddah</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2663492%2F14938c31-24e7-496a-84bc-701bbf18aac7.jpg</url>
      <title>Forem: Zippy Wachira</title>
      <link>https://forem.com/yaddah</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/yaddah"/>
    <language>en</language>
    <item>
      <title>Getting Started with AWS S3 Versioning</title>
      <dc:creator>Zippy Wachira</dc:creator>
      <pubDate>Tue, 10 Feb 2026 19:41:29 +0000</pubDate>
      <link>https://forem.com/yaddah/getting-started-with-aws-s3-versioning-5bbp</link>
      <guid>https://forem.com/yaddah/getting-started-with-aws-s3-versioning-5bbp</guid>
      <description>&lt;p&gt;One of the more interesting features of Amazon S3 buckets is bucket versioning. Once enabled for a bucket, this feature allows a user to store multiple versions of the same object within the same bucket. Since the feature enables the bucket to preserve, retrieve, and restore every version of every object, it is much easier recover from both unintended user actions, such as accidental deletions, and application failures.&lt;/p&gt;

&lt;p&gt;Uploading objects to S3 is quite a straightforward process. However, by default, if you upload an object with the same key name as an existing object, the original object is overwritten. Once you enable versioning, new objects are automatically assigned a version ID to distinguish them from the other objects.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuveujwvhh9u95qyi7fwi.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuveujwvhh9u95qyi7fwi.webp" alt=" " width="272" height="213"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the bucket above, notice that both objects have the same key name but different version IDs. If another object is added, it is assigned its own unique Version ID.&lt;/p&gt;

&lt;p&gt;Simple enough, right?&lt;/p&gt;

&lt;p&gt;Now, the next question that would arise is how you interact with objects stored in a versioned bucket?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Adding an object to a versioned bucket.&lt;br&gt;
Adding an object to a bucket follows the normal process of uploading an object to the bucket. Once you upload the object, it is given a unique version ID.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Retrieving an object from a versioned bucket.&lt;br&gt;
A simple GET request will retrieve the most current version of the object.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwtng339gnqlgdzvb9joc.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwtng339gnqlgdzvb9joc.webp" alt=" " width="495" height="236"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To retrieve other versions, specify the version ID you want.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F129tkjv1ukvufmqsc3qq.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F129tkjv1ukvufmqsc3qq.webp" alt=" " width="504" height="244"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Deleting an object from a versioned bucket.
Unlike objects in buckets that are not enabled for versioning, a simple DELETE request cannot permanently delete an object. When S3 receives the DELETE request, it places a &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/ManagingDelMarkers.html" rel="noopener noreferrer"&gt;delete marker&lt;/a&gt; in the bucket. All versions of the deleted object remain in the bucket, and the delete marker is used as the most recent version of the object. However, the delete marker does not have any data; any GET requests to the marker return a 404 error. If you remove the marker, a GET request will retrieve the most recent version of the object.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjm36hlt6inq170jh2b8a.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjm36hlt6inq170jh2b8a.webp" alt=" " width="378" height="267"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the diagram above, notice that the DELETE request on the bucket has resulted in the creation of a delete marker in the bucket.&lt;/p&gt;

&lt;p&gt;To permanently delete versioned objects, you have to specify the object version with the DELETE request.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg0is9n0oqiwhwacsf77o.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg0is9n0oqiwhwacsf77o.webp" alt=" " width="373" height="273"&gt;&lt;/a&gt;&lt;br&gt;
To delete a Delete Marker, you must specify its Version ID in the DELETE request.&lt;/p&gt;

&lt;p&gt;Interesting right? I certainly hope you think so too.😊&lt;/p&gt;

</description>
      <category>aws</category>
      <category>beginners</category>
      <category>cloud</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Configuring Nginx Files and Directories</title>
      <dc:creator>Zippy Wachira</dc:creator>
      <pubDate>Tue, 10 Feb 2026 19:35:01 +0000</pubDate>
      <link>https://forem.com/yaddah/configuring-nginx-files-and-directories-261p</link>
      <guid>https://forem.com/yaddah/configuring-nginx-files-and-directories-261p</guid>
      <description>&lt;p&gt;Web Servers: the powerful force behind every search result on a browser. Truth is, most of us do not give a second thought to the mechanism that makes web requests successful as long as we get the desired results. But web servers are quite intriguing, and if you are as curious as I am, you might want to know a few tweaks to help you configure and troubleshoot a simple web server. So let's get right into it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Webservers do&lt;/strong&gt;&lt;br&gt;
If you type mango in your browser at this moment, you will get a variety of results, each with a unique URL that will lead you to the main web page for the result. For example, from my end, the first three results for mango are: an online fashion store in Kenya, a Wikipedia link for mango the fruit, and a Twitter link to a page with the handle Mango. Chances are, the results of your search differ from mine. Search engines use information such as your search history, your location, language, and popular searches, among other things, to determine what results to display, hence the difference.&lt;/p&gt;

&lt;p&gt;When you click on one of the results, the web browser first runs a domain name resolution to obtain the IP address of the web server hosting the webpage. The browser then connects with the web server either via port 80 (HTTP) or port 443 (HTTPs) and requests the file specified. The web browser uses the same protocol to respond to the browser, which then displays this result. If the page is non-existent or an error occurs, the web server returns an error message. Seems simple enough, don’t it?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nginx vs Apache&lt;/strong&gt;&lt;br&gt;
While there are many web servers out there, Nginx and Apache are two of the most commonly used. At least 50% of the world’s websites run on one of these servers. But let's start with a short introduction. Apache was invented first and was the backbone of WWW. It is an open-source, high-performing web server that is maintained by the Apache Software Foundation. It is a top choice for sys admins because of its cross-platform support, flexibility, and simplicity. It is also one of the key components of the LAMP stack and is pre-installed in Linux distros. &lt;br&gt;
Nginx (pronounced as ‘Engine X’ 🤦‍♀️), on the other hand, was released in 2004 by Igor Sysoev. Since it was developed to specifically address the limitations of the Apache server, it became very popular, even surpassing Apache.&lt;/p&gt;

&lt;p&gt;While there are several differences between the two web servers, the key difference lies in how the servers handle client requests. Apache uses a process-driven architecture, which means that each request is handled by a different process. A parent process receives the connection requests and creates a child process to handle them. When it receives a new request, it spawns a new child process to handle it. This results in heavy usage of server resources such as memory. Nginx, on the other hand, uses an event-driven architecture where a single process is used to handle multiple requests. Like Apache, a master process receives connections. However, each worker process can handle thousands of requests simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nginx Configuration Files&lt;/strong&gt;&lt;br&gt;
To configure Nginx, there are several key directories and files that you need to be familiar with. It is these files that are customized to serve your specific website. First, depending on how you installed Nginx, the default configuration files can either be located at /etc/nginx/nginx.conf (most distributions)or at &lt;em&gt;/usr/local/etc/nginx/nginx.conf&lt;/em&gt; or at &lt;em&gt;/usr/local/nginx/conf/nginx.conf&lt;/em&gt;. To find the default path on your local machine, you can use either &lt;em&gt;nginx -t&lt;/em&gt; or &lt;em&gt;whereis nginx.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Nginx configurations have two key concepts: directives, which are configuration options, and blocks/contexts, which are groups in which directives are organized. To better understand this, consider the output of the /etc/nginx/nginx.conf file.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjhkywohf4v0auhe8ewic.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjhkywohf4v0auhe8ewic.webp" alt=" " width="774" height="509"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the above snippet, user, worker processes, PID, and include are directives in the main context. The main context is not contained within a block and represents details that affect all applications. Common directives set here are user and group details, worker processes, and the file to save the PIDs of the main processes. The event context is used to set global options for how Nginx handles connections. Recall that Nginx is an event-driven model; thus, directives set determine how worker processes handle connections.&lt;/p&gt;

&lt;p&gt;The HTTP context includes directives for handling web traffic. The directives set in this context are passed on to all websites that are served by the server. Common directives set in this block are access and error logs, error pages, TCP keep-alive settings, among others. Within the http context, you may notice an include directive. This directive tells Nginx where the configuration files for the website are located:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you installed from the official Nginx repository, the directive will point to &lt;em&gt;/etc/nginx/conf.d/.&lt;/em&gt; Each website you host within NGINX has its own configuration file within the above directory and has a name formatted as &lt;em&gt;/etc/nginx/conf.d/blue.com.conf&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you installed from the Debian repository, the directive points to /etc/nginx/sites-enabled/_&lt;em&gt;. With this structure, individual configuration files are stored in the _/etc/nginx/sites-available&lt;/em&gt; directory.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Finally is the mail context, which sets directives for using Nginx as a proxy mail server. It provides a connection to POP3 and IMAP mail servers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Document Root Directories&lt;/strong&gt;&lt;br&gt;
By default, Nginx serves documents out of the /var/www/html directory. To host multiple sites, it is necessary to create different root directories within the &lt;em&gt;/var/www/&lt;/em&gt; directory, e.g., to host two sites, you can create directories as &lt;em&gt;/var/www/site1.com/html&lt;/em&gt; and &lt;em&gt;/var/www/site2.com/html.&lt;/em&gt; The index files for the sites are then configured within these directories e.g.,/var/www/site1/html/index.html.&lt;/p&gt;

&lt;p&gt;Server Blocks&lt;br&gt;
Server blocks are a feature of Nginx that allow you to host multiple websites on a single server. Each server block holds information about the website, such as the location of the document root, security policies, and SSL certificates used. By default, Nginx has one server block called default: &lt;em&gt;/etc/nginx/sites-available/default&lt;/em&gt;. You create a server block for your website by appending the name of your website to the directory e.g.,/etc/nginx/sites-available/site1.com. The structure of a server block is as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2xahc8akaxtdwodrp23w.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2xahc8akaxtdwodrp23w.webp" alt=" " width="432" height="275"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The listen directive tells Nginx the IP and the TCP port where requests are received. The server name identifies the domain being served, e.g., site1.com. When it receives a request, Nginx will first match it to the IP and port listed in the listen directive. If there are several server blocks with the same listen directives, it checks the host header of the request and matches it to a server_name directive. If there are multiple directives with the same IP, port, and server_name, then it chooses the first server block with the name. Finally, if no server_name directives match the host's header, it checks for a default_server parameter.&lt;/p&gt;

&lt;p&gt;Root specifies the path of the document root, and index specifies the name of the index file for the site. Location directives enable you to specify how Nginx responds to requests for resources within the server. The locations are literal string matches and match any part of an HTTP request. Consider the following location configuration:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fom89eo0mo0k3mnpcqjtk.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fom89eo0mo0k3mnpcqjtk.webp" alt=" " width="368" height="133"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this case, a request to &lt;em&gt;&lt;a href="http://site1/com/planet/blog" rel="noopener noreferrer"&gt;http://site1/com/planet/blog&lt;/a&gt;&lt;/em&gt; or to &lt;em&gt;&lt;a href="http://site1/com/planet/blog/events" rel="noopener noreferrer"&gt;http://site1/com/planet/blog/events&lt;/a&gt;&lt;/em&gt; will be served by location &lt;em&gt;/planet/blog/&lt;/em&gt; rather than location /planet. The try file directive specifies the files and directories where Nginx should check for files if a request to the specified location is received. The default try directive above indicates a match for all locations.&lt;/p&gt;

&lt;p&gt;To enable server blocks, the final step is to create a symbolic link to /etc/nine/sites-enabled directory. By default, Nginx checks the sites-enabled directory during startup. Creating sym links to the configuration files in the sites-available directory allows you to manage your vhosts more easily. To disable a block, all you have to do is delete the sym link. You can optionally use the conf.d directory to manage your server blocks, but to delete something, you would have to remove it from the directory first. To manage multiple virtual hosts (websites), it is recommended to use the sites-enabled approach. Conf.d is more suited to configurations that are not tied to a single virtual host.&lt;/p&gt;

&lt;p&gt;There is definitely a lot more to configuring web servers than this! But I certainly hope this has provided you with a place to start 😊.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>devops</category>
      <category>tutorial</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Breaking Down AWS IAM</title>
      <dc:creator>Zippy Wachira</dc:creator>
      <pubDate>Tue, 10 Feb 2026 18:46:38 +0000</pubDate>
      <link>https://forem.com/yaddah/breaking-down-aws-iam-5hfn</link>
      <guid>https://forem.com/yaddah/breaking-down-aws-iam-5hfn</guid>
      <description>&lt;p&gt;AWS has a large variety of security offerings. Among these, however, none is as extensive as IAM. Besides integrating with all AWS services, IAM also enables fine-grained access control, which means that permissions can be managed up to an individual user’s or individual resource’s level. This is also accomplished by one of IAM’s best practices, which requires the assignment of permissions according to the &lt;a href="https://aws.amazon.com/blogs/security/techniques-for-writing-least-privilege-iam-policies/#:~:text=Least%20privilege%20is%20a%20principle,build%20securely%20in%20the%20cloud." rel="noopener noreferrer"&gt;principle of least privilege&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let's start with the basic components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IAM identities&lt;/strong&gt;&lt;br&gt;
There are three types of IAM identities: users, groups, and roles.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;IAM users often represent people interacting with AWS, but can also represent a service. IAM users have long-term credentials, which are in the form of:&lt;/li&gt;
&lt;li&gt;Username+password for use with the management console&lt;/li&gt;
&lt;li&gt;Access keys for use with the AWS CLI and SDKs&lt;/li&gt;
&lt;li&gt;SSH keys for use with AWS CodeCommit and&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Server Certificates are used to authenticate some AWS services, such as websites&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;IAM groups are a collection of users who share the same permissions. They make it easier to assign and manage permissions for a large number of users. All users in a group automatically inherit the permissions attached to the group.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;IAM Roles are similar to IAM users, but with temporary credentials assigned via AWS STS. Roles can be assumed by any user, allowing them to temporarily take on different permissions.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;IAM Policies&lt;/strong&gt;&lt;br&gt;
IAM policies are IAM entities that are attached to IAM identities and define the kind of permissions that the identity has. When an identity makes a request, AWS evaluates the policies attached to the identity to determine if the actions the principal is requesting are allowed. The policies attached to a principal apply across all access methods: Console, CLI, and SDKs.&lt;/p&gt;

&lt;p&gt;Now, let us dig a bit deeper into IAM users.&lt;/p&gt;

&lt;h2&gt;
  
  
  Federated User Access
&lt;/h2&gt;

&lt;p&gt;Federated users are users who are managed in an external directory and require access to AWS resources. Federation eliminates the need to recreate the users in your AWS account by allowing you to continue using your existing user directory and only assigning users temporary permissions to accomplish the task they need on the AWS cloud. There are two approaches to federation: using AWS Single Sign-On (SSO)and using AWS IAM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Using AWS SSO to Manage Federation&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://aws.amazon.com/blogs/security/how-to-create-and-manage-users-within-aws-sso/" rel="noopener noreferrer"&gt;AWS Identity Center&lt;/a&gt; is a service that allows you to assign and manage access and user permissions across all your accounts in AWS Organizations. SSO also supports identity federation using Security Assertion Markup Language (SAML). SAML is an industry standard that enables the secure exchange of credentials between an identity provider (IdP) and a SAML consumer (service provider, SP). SSO works with an IdP of your choice, e.g., Azure Active Directory, and leverages IAM permissions and policies to manage federated access. With SSO, you assign permissions based on the group memberships in the IdP’s directory and control their access by modifying users and groups in the IdP. You can also use AWS SSO as an IdP to authenticate users to SSO-integrated applications, such as Salesforce, and also to authenticate users to the Management Console and CLI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Using AWS IAM to Manage Federation&lt;/strong&gt;&lt;br&gt;
You can use IAM Identity Providers to manage user identities outside your organization. With the IAM IdP, there is no need to create custom sign-in codes or manage user identities since the IdP does that for you. With IAM IdP, there are two types of federation supported:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a) Web Identity Federation&lt;/strong&gt;&lt;br&gt;
If you are writing an application to be used by a large number of users, e.g., a game that runs on mobile devices but stores data on Amazon S3, a web identity federation would be a good option. Web Identity Federation allows you to use IdPs such as Facebook, Google, or any other OpenID Connect (OIDC)-compatible IdP. Users receive an authentication token, which is then exchanged for temporary security credentials that map to an existing IAM role in AWS with the required permissions.&lt;/p&gt;

&lt;p&gt;Note: Rather than directly using Web Identity Federation, it is recommended to use Amazon Cognito for mobile apps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b) SAML 2.0-based Federation&lt;/strong&gt;&lt;br&gt;
IAM supports identity federation using SAML to enable single sign-on for users to log into the management console or call API operations. This type of federation has two main use cases. The first is to allow users within your organization to call AWS API operations e.g. enable users within your corporate IdP to backup data to an S3 bucket. The second use case is to allow users registered in a SAML 2.0-compatible IdP to sign in to the management console.&lt;/p&gt;

&lt;p&gt;And now let’s see what roles can do:&lt;/p&gt;

&lt;h2&gt;
  
  
  IAM Roles
&lt;/h2&gt;

&lt;p&gt;IAM roles can be used for a variety of cases, including the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Grant IAM users in the same account as the role access to resources within the account&lt;/li&gt;
&lt;li&gt;Grant users access to resources in a different account&lt;/li&gt;
&lt;li&gt;Grant access to AWS resources to identities outside AWS&lt;/li&gt;
&lt;li&gt;Grant access to third parties, e.g., auditors&lt;/li&gt;
&lt;li&gt;Provide access to AWS services to access other services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A role that a service assumes to perform actions within your account on your behalf is called a service role. If a role serves a specialized purpose for a service, it is referred to as a service-linked role. Users can assume a role from either the console or from the CLI/API by using the AssumeRole API.&lt;/p&gt;

&lt;h2&gt;
  
  
  Policies and Permissions
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Identity-based policies are attached to IAM users, groups, and roles to define what these identities can do, on which resources, and under which circumstances. Identity-based policies are of two types:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;a) Managed policies are standalone policies. You can opt to use AWS-managed policies or create your own customer-managed policies, which you manage yourself. AWS-managed policies provide permissions for common tasks, such as granting administrative permissions, and for use when starting out before one is able to create their own policies.&lt;/p&gt;

&lt;p&gt;b) Inline policies are embedded in identity and provide a strict one-to-one relationship between the identity and the policy. These policies are applicable for scenarios where you want a policy to only be attached to a specific identity and no other.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Resource-based policies are a type of inline policy that is attached to a resource. A common use case of these policies is in enabling cross-account access to a principal. IAM supports one type of resource-based policy called the trust policy, which is usually attached to an IAM role.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Permission Boundaries are used to set the maximum permissions that an identity-based policy can grant to an entity. The effective permissions of the principal are the intersection of all the policies that affect the principal such as identity-based policies, resource-based policies, session policies, and SCPs.Now, working with permission boundaries can be tricky if you don't understand how they interact with other types of policies. To see how AWS calculates effective permissions, see &lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_boundaries.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Service Control policies are used to control permissions for an organization or an organization's unit. They determine permissions for the accounts in the organization. A unique feature of SCPs is that they do not grant permissions. Rather, they limit the permissions that resource-based and identity-based policies can grant to identities in the account. The effective permission for the identity is the intersection between what is allowed by the SCP and what is allowed by the identity and resource-based policies. To understand how SCPs interact with the other policies, see here.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Access Control Policies control which principles in another account can access resources in your account. They cannot be used for principals in the same account as the policy.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Session Policies are inline permissions policies that users pass to the session when they assume a role or as a federated user when using the CLI or API. Session policies can be passed using the AssumeRole, AssumeRoleWithSAML, and AssumeRoleWithWebIdentity API operations. Like SCPs, session policies also do not assign permissions; they only limit the permissions for a session. The resulting session permissions are the intersection of the session policies and the resource-based and identity-based policies. See how effective permissions are calculated here.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As much as this seems, it is but the tip of the iceberg. But it's a good place to start; don’t you think?☺&lt;/p&gt;

</description>
      <category>aws</category>
      <category>beginners</category>
      <category>security</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>A Defense in Depth Approach to Cloud Security</title>
      <dc:creator>Zippy Wachira</dc:creator>
      <pubDate>Tue, 10 Feb 2026 18:19:31 +0000</pubDate>
      <link>https://forem.com/yaddah/a-defense-in-depth-approach-to-cloud-security-4078</link>
      <guid>https://forem.com/yaddah/a-defense-in-depth-approach-to-cloud-security-4078</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In an era marked by pervasive digital connectivity and evolving cyber threats, ensuring the security of sensitive information and critical infrastructure has become paramount. Traditional security approaches centered around perimeter defenses alone are no longer sufficient to withstand sophisticated attacks and safeguard against data breaches. Instead, organizations must adopt a multi-layered security strategy known as defense in depth.&lt;/p&gt;

&lt;p&gt;Defense in Depth (DiD) is a proactive and comprehensive security framework that employs multiple layers of defense mechanisms to protect against a wide range of threats. By diversifying security controls across networks, systems, applications, and data, defense in depth aims to create overlapping layers of protection that collectively strengthen the security posture of an organization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Principles of DiD
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;“It is not just multiple layers of controls to collectively mitigate one or more risks, but rather multiple layers of interlocking or inter-linked controls.” — Phil Venables&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Controls at different points should be complementary, i.e., every preventative control should have a detective control at the same level and/or one level downstream in the architecture.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Controls need to be continuously assessed to validate their correct deployment and operation.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  DiD in Cloud
&lt;/h2&gt;

&lt;p&gt;DiD leverages security measures across 7 key domains as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv67wit9tpualje5lfqu5.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv67wit9tpualje5lfqu5.webp" alt=" " width="800" height="398"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Physical Layer&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;What: Physical security measures applied at AWS data centers e.g. biometrics, surveillance, etc.&lt;/p&gt;

&lt;p&gt;How: AWS Responsibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Perimeter Layer&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;What: Perimeter security is your first line of defense as a customer. It allows you to define who has access to your environment, how they access the environment, and what level of access to assign them. For example, an administrator will need full access to the environment, a project/team lead may need full access to the resources that pertain to their project as well as the ability to assign and revoke access for their team. On the other hand, a contractor may need temporary read/write access to only specific services, while an auditor would only require temporary read-only access for the period of the assessment.&lt;/p&gt;

&lt;p&gt;How:&lt;br&gt;
&lt;strong&gt;1. Preventative Controls&lt;/strong&gt;&lt;br&gt;
1.1. Who has Access?&lt;/p&gt;

&lt;p&gt;1.1.1. Internal staff, e.g., engineers, system administrators, and security team who build, manage, and govern the environment.&lt;/p&gt;

&lt;p&gt;1.1.2. Business stakeholders: review performance metrics, monitor resource usage, and make data-driven decisions related to business operations and strategy.&lt;/p&gt;

&lt;p&gt;1.1.3. Clients: consume services/applications deployed in the environment.&lt;/p&gt;

&lt;p&gt;1.1.4. Third-party contractors/vendors/partners: temporary access for project-related tasks.&lt;/p&gt;

&lt;p&gt;1.1.5. Legal consultants/advisors/auditors&lt;/p&gt;

&lt;p&gt;1.2. How do they access the environment?&lt;/p&gt;

&lt;p&gt;1.2.1. IAM Users Accounts: long-term access for internal users/staff, e.g., engineers who are responsible for managing, configuring, deploying, and monitoring AWS infrastructure, applications, and services.&lt;/p&gt;

&lt;p&gt;1.2.2. Federation: for users with identities managed in an external IdP, e.g., AD, Facebook, Amazon, Google&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using IAM Identity Center/SSO: consistent, synchronized access to multiple AWS accounts and applications.&lt;/li&gt;
&lt;li&gt;Cognito Identity Pools: identify federation for authenticated and unauthenticated users.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;1.2.3. IAM Roles: short-term, temporary access credentials that can be assumed by any identity.&lt;/p&gt;

&lt;p&gt;1.2.4. Restricted Access Channels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VPNs: secure, encrypted communications channels between on-premises networks and AWS&lt;/li&gt;
&lt;li&gt;PrivateLink: private connectivity between VPCs, supported AWS services, and your on-premises networks without exposing your traffic to the public internet.&lt;/li&gt;
&lt;li&gt;Dedicated Audit Accounts: give security and compliance teams read and write access to all accounts for audits and security remediations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;1.3. What is their level of access?&lt;/p&gt;

&lt;p&gt;1.3.1. RBAC and Least Privilege: restrict access based on the identity’s roles/responsibilities within an organization.&lt;/p&gt;

&lt;p&gt;1.3.2. Policy-Based Access Control: assign access controls to resources based on IAM policies for user, groups, and roles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Detective Controls&lt;/strong&gt;&lt;br&gt;
2.1. AWS CloudTrail: capture API activity and logs pertaining to access activity.&lt;/p&gt;

&lt;p&gt;2.2. MFA: restrict access to services to only users with MFA enabled.&lt;/p&gt;

&lt;p&gt;2.3. AWS Config: enforce compliance with IAM best practices e.g. such as ensuring MFA is enabled or restricting the use of insecure IAM policies.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Network Layer&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;What: The network layer focuses on safeguarding the communication and data exchange between devices, systems, and services within the organization’s network, as well as controlling the flow of traffic entering and leaving the network.&lt;/p&gt;

&lt;p&gt;How:&lt;br&gt;
&lt;strong&gt;1. Preventative Controls&lt;/strong&gt;&lt;br&gt;
1.1. Network Access Control: who has access to network resources and how they access these resources.&lt;/p&gt;

&lt;p&gt;1.1.1. IAM: access management&lt;/p&gt;

&lt;p&gt;1.1.2. VPNs/Direct Connect: private encrypted connectivity between on-premised environments and VPC resources.&lt;/p&gt;

&lt;p&gt;1.1.3. PrivateLink: private connectivity between VPCs and AWS services or endpoints without traversing the public internet.&lt;/p&gt;

&lt;p&gt;1.2. Network Segmentation and Isolation: partition the network into distinct zones based on security requirements, workloads, trust levels, and data sensitivity.&lt;/p&gt;

&lt;p&gt;1.2.1. VPCs, Subnets &amp;amp; AZs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each VPC is a logically isolated container for network resources.&lt;/li&gt;
&lt;li&gt;Subnets provide segmentation at the network level and allow you to isolate resources based on their function, security requirements, or access control policies.&lt;/li&gt;
&lt;li&gt;AZs provide redundant power, networking, and connectivity in an AWS Region and allow high availability, fault tolerance, and scalability for applications.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;1.2.2. NACLs, Security Groups, Route Tables: control ho traffic is routed between various network segments.&lt;/p&gt;

&lt;p&gt;1.2.3. VPC Peering and Transit Gateway:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Peering allows you to route traffic between VPCs privately using private IP addresses.&lt;/li&gt;
&lt;li&gt;TGW enables central management and connectivity scaling across multiple VPCs, accounts, and networks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;1.3. Traffic Filtering&lt;/p&gt;

&lt;p&gt;1.3.1. Network Firewall: stateful, managed, network firewall and intrusion detection and prevention service for your VPC.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pass traffic only from known AWS service domains or IP address endpoints.&lt;/li&gt;
&lt;li&gt;Perform deep packet inspection on traffic entering or leaving your VPC.&lt;/li&gt;
&lt;li&gt;Use stateful protocol detection to filter protocols like HTTPS.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;1.3.2. Web Application Firewall: a firewall that protects web applications hosted on AWS against common web-based attacks.&lt;/p&gt;

&lt;p&gt;1.3.3. Virtual Security Appliances, i.e., firewalls, IDS/IPS/DPI systems, i.e., Cisco, Palo Alto,o etc.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Detective Controls&lt;/strong&gt;&lt;br&gt;
2.1. VPC Flow Logs: monitor the IP traffic going to and from a VPC, subnet, or network interface.&lt;/p&gt;

&lt;p&gt;2.2. Network Access Analyzer:&lt;br&gt;
Improve your network security posture by identifying unintended network access.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Verify that your production environment VPCs and development environment VPCs are isolated from one another.&lt;/li&gt;
&lt;li&gt;Verify that network paths are secured e.g. controls such as network firewalls and NAT gateways have been set up where necessary.&lt;/li&gt;
&lt;li&gt;Verify that your resources have network access only from a trusted IP address range, over specific ports, and protocols.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Host and Application Layers&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;What: The host layer focuses on security measures implemented on individual compute resources e.g. EC2, ECS, EKS, RDS.&lt;/p&gt;

&lt;p&gt;How:&lt;br&gt;
&lt;strong&gt;1. Preventative Controls&lt;/strong&gt;&lt;br&gt;
1.1. Vulnerability Management:&lt;br&gt;
1.1.1. Regularly scan and patch compute resources e.g. EC2, ECS, EKS.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inspector: Automatically discovers workloads, such as Amazon EC2 instances, containers, and Lambda functions, and scans them for software vulnerabilities and unintended network exposure.&lt;/li&gt;
&lt;li&gt;Systems Manager: patch management for your compute resources&lt;/li&gt;
&lt;li&gt;Security Hub: collects security data across AWS accounts, AWS services, and supported third-party products and helps you analyze your security trends and identify the highest priority security issues.&lt;/li&gt;
&lt;li&gt;Code Guru: Scans code libraries and dependencies for issues and defects that are difficult for developers to find and offers suggestions for improving your Java and Python code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;1.1.2. Configure maintenance windows for AWS-managed resources, e.g., RDS.&lt;/p&gt;

&lt;p&gt;1.2. Reduced Attack Surface:&lt;/p&gt;

&lt;p&gt;1.2.1. Hardened Operating Systems, e.g., using CIS images for workload instances.&lt;/p&gt;

&lt;p&gt;1.2.2. EC2 Image Builder: ease creation of custom patched AMIs. When software updates become available, Image Builder automatically produces a new image without requiring users to manually initiate image builds.&lt;/p&gt;

&lt;p&gt;1.2.3. ECR Image scanning for identifying software vulnerabilities in your container images.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Detective Controls&lt;/strong&gt;&lt;br&gt;
2.1. Config: monitor changes to application configurations, code deployments, etc., to detect unauthorized modifications or unusual application behavior.&lt;/p&gt;

&lt;p&gt;2.2. CloudWatch Logs to monitor system-level logs generated by EC2 instances, including authentication logs, application logs, and system logs to detect security incidents, anomalous behavior, and operational issues.&lt;/p&gt;

&lt;p&gt;2.3. CloudTrail: detailed records of actions taken by users, roles, and services, including caller identity, the time of request, and the actions performed. You can track user activity, identify unauthorized access attempts, and investigate security incidents.&lt;/p&gt;

&lt;p&gt;2.4. GuardDuty: threat detection service that continuously monitors for malicious activity and unauthorized behavior across your environment. GuardDuty generates findings and alerts for suspicious activity, enabling you to investigate and remediate security incidents promptly.&lt;/p&gt;

&lt;p&gt;2.5. Inspector: analyzes the network, operating system, and application configurations to identify potential security issues.&lt;/p&gt;

&lt;p&gt;2.6. Third-party Security Solutions: third-party security solutions available on Marketplace that offer advanced threat detection, vulnerability management, and security analytics capabilities for environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Layer
&lt;/h2&gt;

&lt;p&gt;What: The data layer encompasses all aspects of data security, including data storage, transmission, access, and usage. It focuses on safeguarding sensitive information, such as customer data, intellectual property, financial records, and other confidential or regulated data, from unauthorized access, disclosure, alteration, or loss.&lt;/p&gt;

&lt;p&gt;How:&lt;br&gt;
&lt;strong&gt;1. Preventative Controls&lt;/strong&gt;&lt;br&gt;
1.1. Data Confidentiality: protecting data against unintentional, unlawful, or unauthorized access, disclosure, or theft.&lt;/p&gt;

&lt;p&gt;1.1.1. Data access: Define authorized principals in access policies, follow least privilege principles.&lt;/p&gt;

&lt;p&gt;1.1.2. Encryption&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Encryption at rest using KMS/SSE&lt;/li&gt;
&lt;li&gt;Encryption in transit using SSL/TLS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;1.2. Data Integrity: ensuring the accuracy, completeness, consistency, and validity of data.&lt;br&gt;
1.2.1. Regular Backups:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS Backup: a fully managed backup service that makes it easy to centralize and automate the backing up of data across AWS services.&lt;/li&gt;
&lt;li&gt;S3 Versioning: preserve historical versions of objects, enabling you to recover from accidental deletions, modifications, or data corruption.&lt;/li&gt;
&lt;li&gt;Cross-region replication: ensuring data availability in multiple geographic regions and protecting against regional outages.&lt;/li&gt;
&lt;li&gt;Automated/Manual Snapshots: available for EBS, RDS, to allow for PIT recovery.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;1.2.2. Immutable Storage: services such as S3 Object Lock prevents objects from being deleted or modified for a specified retention period, protecting data integrity from accidental or malicious changes.&lt;/p&gt;

&lt;p&gt;1.2.3. Data Validation and verification: checksums, digital signatures, and cryptographic hashes to verify the integrity of data during transmission and storage.&lt;/p&gt;

&lt;p&gt;1.3. Data Availability: a measure of how often your data is available for use.&lt;/p&gt;

&lt;p&gt;1.3.1. Highly available and fault-tolerant architectures.&lt;/p&gt;

&lt;p&gt;1.3.2. Backups and DR&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Detective Controls&lt;/strong&gt;&lt;br&gt;
2.1. CloudWatch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Alarms and health checks to monitor the health and availability of AWS resources hosting critical data.&lt;/li&gt;
&lt;li&gt;Automated alerts for performance degradation, service disruptions, or availability issues affecting data access.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2.2. CloudTrail:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trails to deliver log files to Amazon S3 and set up S3 event notifications or CloudWatch Events to trigger alerts for specific API activity or security events.&lt;/li&gt;
&lt;li&gt;Visibility into user activity and API calls, allowing you to detect and investigate unauthorized access attempts or security incidents affecting data confidentiality.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2.3. GuardDuty:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Continuously monitor malicious activity and unauthorized behavior within your environment.&lt;/li&gt;
&lt;li&gt;Detects anomalies and security threats targeting data confidentiality, e.g., unauthorized access attempts, data exfiltration, or communication with known malicious IP addresses.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2.4. Config:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Config rules to assess compliance with security best practices and detect misconfigurations affecting data integrity.&lt;/li&gt;
&lt;li&gt;Config rules to detect deviations from security best practices.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Policies and Procedures&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Security objectives and standards: define the organization’s security best practices for cloud environments and best practices for cloud environments i.e., data protection, access control, network security, incident response, and compliance requirements.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Risk Management: define roles and responsibilities for identifying, assessing, and mitigating security risks associated with cloud deployments.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Access Control: define user roles, permissions, and authentication mechanisms, such as multi-factor authentication (MFA) and identity federation, to prevent unauthorized access.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Incident Response and DR: define escalation paths, communication protocols, and remediation steps for containing and mitigating security incidents, restoring services, and minimizing the impact on business operations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;User Training: Educate identities about cloud security risks, best practices, and compliance requirements to help raise awareness of security policies, reinforce security behaviors, and empower personnel to recognize and report security incidents effectively.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>architecture</category>
      <category>cloud</category>
      <category>cybersecurity</category>
      <category>security</category>
    </item>
    <item>
      <title>AWS CSI - Investigating Cloud Conundrums (CloudTrail)</title>
      <dc:creator>Zippy Wachira</dc:creator>
      <pubDate>Tue, 10 Feb 2026 17:53:12 +0000</pubDate>
      <link>https://forem.com/yaddah/aws-csi-investigating-cloud-conundrums-cloudtrail-ed0</link>
      <guid>https://forem.com/yaddah/aws-csi-investigating-cloud-conundrums-cloudtrail-ed0</guid>
      <description>&lt;p&gt;Imagine this: you come home after a long day and find that your house is a complete mess. You have absolutely no clue what happened. Who created the mess in your house? How and when did they get access to your house? What did the intruder take? What did they displace/destroy in the house?&lt;/p&gt;

&lt;p&gt;If you’re lucky, you may have a security system in place, e.g., cameras that recorded the incident and that you can use to trace back and identify how and when the destruction happened. But if not, then it becomes a complete nightmare to try to figure out what happened. And this is exactly what it feels like to try to troubleshoot an event in AWS without CloudTrail.&lt;/p&gt;

&lt;p&gt;Think of CloudTrail as that indoor security camera that captures who came into your house, what they touched, changes, added, or even removed from the environment, and the exact day/time that all these activities occurred. So, just like reviewing the security footage will help you understand the break-in event, reviewing CloudTrail logs will provide the vital evidence you need to investigate and resolve any issues within your AWS environment.&lt;/p&gt;

&lt;p&gt;The good news is that a lot of people know exactly &lt;strong&gt;WHAT&lt;/strong&gt; CloudTrail is and what it is meant to be used for. The bad news is that not enough people know &lt;strong&gt;HOW&lt;/strong&gt; to use CloudTrail to derive useful insights from the captured logs. So, let’s dive into how exactly you use AWS CloudTrail to Investigate Cloud Conundrums.&lt;/p&gt;

&lt;h2&gt;
  
  
  Some Basics
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;CloudTrail is enabled by default for your account, which means you automatically have access to CloudTrail Event History.&lt;/li&gt;
&lt;li&gt;CloudTrail Event History provides an immutable record of events from the past 90 days. These are events captured from the Console, CLI, SDKs, and APIs.&lt;/li&gt;
&lt;li&gt;You are not charged for viewing CloudTrail Event History.&lt;/li&gt;
&lt;li&gt;CloudTrail events are regional.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Layout
&lt;/h2&gt;

&lt;p&gt;Let’s start with the standard layout of the CloudTrail Console:&lt;br&gt;
Note: For this piece, we are focusing on the Event History Section of CloudTrail on the Console.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvd8e1ocu3b60yid2i7kp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvd8e1ocu3b60yid2i7kp.png" alt=" " width="800" height="374"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Display Customization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The settings icon at the far right [3] allows you to customize the fields that are displayed. Options are as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2sc7hmch8njrod0dd65m.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2sc7hmch8njrod0dd65m.jpg" alt=" " width="698" height="685"&gt;&lt;/a&gt;&lt;br&gt;
You can read about what each of the fields represents &lt;a href="https://docs.aws.amazon.com/awscloudtrail/latest/userguide/view-cloudtrail-events-console.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Event Filtering&lt;/strong&gt;&lt;br&gt;
CloudTrail tracks every API call within your AWS account, resulting in a historical record that can grow significantly, even for small accounts with limited users. Of course, the more activity in the account, the larger the volume of events recorded. This can make investigating a singular event a nightmare, as you’d need to comb through hundreds of records to get to the record of interest.&lt;/p&gt;

&lt;p&gt;This is where fields [1] and [2] come in. [1] provides a range of parameters you can use to filter through your events, as below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1mdorjbfgi5ar8nzngv3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1mdorjbfgi5ar8nzngv3.png" alt=" " width="365" height="321"&gt;&lt;/a&gt;&lt;br&gt;
For instance, let’s say you want to investigate who created a new user account and when the user was created, you can apply a filter as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F33hwqnq2evo59ak4k43b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F33hwqnq2evo59ak4k43b.png" alt=" " width="604" height="141"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For the case above, we apply a filter based on the Event name and search specifically for &lt;em&gt;&lt;strong&gt;CreateUser&lt;/strong&gt;&lt;/em&gt; events.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;User name&lt;/em&gt; will typically provide the name of the IAM principal that performed the &lt;em&gt;CreateUser&lt;/em&gt; action, i.e., an IAM user or a service. The &lt;em&gt;Resource Name&lt;/em&gt;, on the other hand, is the actual AWS resource that the action was performed on, i.e., for the case above, this would be the name of the user that was created.&lt;/p&gt;

&lt;p&gt;Another good example would be if you want to see what activities a specific user or service performed in the account in a given period, you can apply a filter such as below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0aotb51svbsanx81egzr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0aotb51svbsanx81egzr.png" alt=" " width="604" height="289"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note: The date/time filter supports both a relative and an absolute range.&lt;/p&gt;

&lt;h2&gt;
  
  
  Event Types
&lt;/h2&gt;

&lt;p&gt;In my investigations, I found that I lean more towards filtering by Event name as I often find myself looking for a specific event from a specific service. The challenge I found with this, though, was in identifying what events are recorded for each service and what the event names are.&lt;/p&gt;

&lt;p&gt;For starters, it is important to note that CloudTrail Event History only supports &lt;a href="https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-concepts.html" rel="noopener noreferrer"&gt;management events&lt;/a&gt;. Note that to capture data events, you must create a trail and explicitly add each resource type for which you want to collect data plane activity. Second, AWS provides a comprehensive list of API Actions for each service through the &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/APIReference/OperationList-query-ec2.html" rel="noopener noreferrer"&gt;API Reference documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The API Reference documentation for a service provides descriptions, API request parameters, and the XML response for the service’s API actions. For example, from the EC2 API Reference documentation, we see the following example EC2 API Actions:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuy7dpy9j5v8byb9wcv0u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuy7dpy9j5v8byb9wcv0u.png" alt=" " width="604" height="289"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So now, from the CloudTrail Event History console, I can easily investigate an event such as when an Elastic IP Address was created using the below filter:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp4dtvphakxuu3k0wt9pk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp4dtvphakxuu3k0wt9pk.png" alt=" " width="604" height="110"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note: to view the API reference for an AWS Service, simply search for  API Reference on your favorite browser😊&lt;/p&gt;

&lt;h2&gt;
  
  
  Event Sources
&lt;/h2&gt;

&lt;p&gt;This filter comes in handy if you want to view a general record of all activity performed on a specific AWS service, e.g., S3.&lt;/p&gt;

&lt;p&gt;If you select the Event source filter on the CloudTrail history console, you can view a list of all the service names available in AWS that you can filter by.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9yq80u8yks4ibuhpypoh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9yq80u8yks4ibuhpypoh.png" alt=" " width="415" height="369"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that the event name takes the form of . amazonaws.com so simply type the service in your search, and the full service name is autocompleted for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Examples
&lt;/h2&gt;

&lt;p&gt;Now let’s have a look at a few real-world use cases and how you’d typically use CloudTrail to investigate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Scenario 1: An administrator receives a notification for a sudden spike in failed login attempts for a critical IAM user account.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Troubleshooting:&lt;/strong&gt;&lt;br&gt;
The above scenario is a &lt;a href="https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference-aws-console-sign-in-events.html" rel="noopener noreferrer"&gt;&lt;em&gt;ConsoleLogin&lt;/em&gt; Event&lt;/a&gt;. What we’d want to determine here is if this is a legitimate login attempt from the user or if it could potentially be a brute force attack on the account. So essentially, we are looking for details such as:&lt;/p&gt;

&lt;p&gt;Source IP Address: Multiple login attempts originating from a single, unexpected IP address can be a red flag.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User Agent: this is information relating to the device used for the login attempt e.g., browser and OS. This could reveal inconsistencies compared to the expected login patterns for the user (hint: check previous successful login attempts to identify the usual patterns)&lt;/li&gt;
&lt;li&gt;Timestamp: A rapid succession of failed login attempts within a short timeframe is a strong indicator of a brute-force attack.&lt;/li&gt;
&lt;li&gt;Number of failed login attempts: A high number of failed login attempts within a short period is a strong indicator of brute-force attempts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To get the above information, we can proceed as follows:&lt;/p&gt;

&lt;p&gt;i. Filter&lt;br&gt;
We can apply a filter as follows:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzmrcirsr41pigtqatjkr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzmrcirsr41pigtqatjkr.png" alt=" " width="604" height="99"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ii. View Details&lt;br&gt;
From the list, select the event of interest. For this case, we’d identify this event using the User name field.&lt;/p&gt;

&lt;p&gt;Once you click on the event, you get access to a more detailed event record as follows:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faw77f9wl3wpihsjhy6mp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faw77f9wl3wpihsjhy6mp.png" alt=" " width="357" height="262"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From here, you can pick out the relevant details and determine if this is a security event.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Scenario 2: An application deployed on EC2 instances suddenly experiences slow loading times and high error rates. You suspect a recent configuration change might be to blame.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Troubleshooting:&lt;/strong&gt;&lt;br&gt;
i. Identify the timeframe: Narrow down when the application issues began. We need to apply a time filter for this period.&lt;/p&gt;

&lt;p&gt;ii. Filter Events: Depending on your architecture, you want to check for a couple of things, e.g.,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Configuration changes made to the instance itself i.e., instance resources, could reveal why the slow loading times occur.&lt;/li&gt;
&lt;li&gt;Configuration changes made to the instance’s networking, e.g., changes to security group, route table, NACLs,s and any applicable policies.&lt;/li&gt;
&lt;li&gt;Configuration changes made to the other service that the application is communicating with, e.g., if the application is picking/putting data into an S3 bucket, it could be that the instance profile was changed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Event filters can then be applied as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feyjrr47oya4m5nj0qaj2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feyjrr47oya4m5nj0qaj2.png" alt=" " width="604" height="48"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Check for configuration changes made to the security group.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2wltnldo4ho059bnx1lj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2wltnldo4ho059bnx1lj.png" alt=" " width="604" height="73"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Check for configuration changes made to the EC2 instance.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fntibt8oj9xojz201n900.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fntibt8oj9xojz201n900.png" alt=" " width="604" height="72"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Check for configuration changes made to the EC2 instance role.&lt;br&gt;
Limitations&lt;/p&gt;

&lt;p&gt;While the CloudTrail console offers a convenient way to investigate AWS events, its filtering capabilities are currently limited to a single field at a time. This can be restrictive when you need to refine your search based on multiple criteria.&lt;/p&gt;

&lt;p&gt;For more comprehensive filtering, you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;View history via the AWS CLI&lt;/li&gt;
&lt;li&gt;Download your CloudTrail events as a CSV file and leverage the powerful filtering and analysis features of tools like Excel.&lt;/li&gt;
&lt;li&gt;Save logs to an S3 bucket and use an Athena table for filtering.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stop Guessing, Start Tracking: Enable CloudTrail today and gain visibility into your AWS activity.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>beginners</category>
      <category>monitoring</category>
      <category>security</category>
    </item>
    <item>
      <title>AWS CSI - Investigating Cloud Conundrums (CloudWatch - Part 3)</title>
      <dc:creator>Zippy Wachira</dc:creator>
      <pubDate>Mon, 09 Feb 2026 20:43:44 +0000</pubDate>
      <link>https://forem.com/yaddah/aws-csi-investigating-cloud-conundrums-cloudwatch-part-3-2fp9</link>
      <guid>https://forem.com/yaddah/aws-csi-investigating-cloud-conundrums-cloudwatch-part-3-2fp9</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/yaddah/aws-csi-investigating-cloud-conundrums-cloudwatch-part-1-40po"&gt;Part 1&lt;/a&gt; of this series, we looked at the basics of CloudWatch metrics and one example of how you can leverage CloudWatch metrics to troubleshoot performance issues on AWS. In &lt;a href="https://dev.to/yaddah/aws-csi-investigating-cloud-conundrums-cloudwatch-part-2-5c5d"&gt;Part 2&lt;/a&gt;, we delved a bit deeper into some more examples and scenarios that allowed us to get a better understanding of how to leverage CloudWatch metrics.&lt;/p&gt;

&lt;p&gt;In this third piece, we are going to take a step back. Whether you’re an AWS novice or an expert, identifying WHICH metrics to look at when troubleshooting can pose a real challenge. In this piece, we will look at some strategies and best practises to help you identify the right metrics for troubleshooting performance issues on AWS.&lt;/p&gt;

&lt;p&gt;So, let’s dive in!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The bigger picture: Application Architecture&lt;/strong&gt;&lt;br&gt;
Before diving into metrics and troubleshooting performance issues in AWS, it’s essential to have a comprehensive understanding of your application’s architecture. Identify the key components, e.g., where is your compute layer hosted, where is your database, your storage, etc. Your architecture acts as a blueprint that outlines how different components (services, databases, etc.) interact to deliver your application’s functionality. By comprehending this blueprint, you can map potential performance issues to specific components and identify the relevant CloudWatch metrics for each.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Identify the affected service&lt;/strong&gt;&lt;br&gt;
Is it a slow website, sluggish database queries, or high latency in your Lambda functions? When troubleshooting performance issues in your AWS environment, identifying the affected service is a crucial step. It’s like being a detective at a crime scene — knowing where to look is only half the battle. CloudWatch offers a vast array of metrics across different categories like compute, network, database, and more. Knowing the affected service allows you to filter out irrelevant categories and focus on the metrics most likely to pinpoint the issue. For example, CPU utilization for EC2 instances wouldn’t be relevant if you’re investigating slow database queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Identify Common Performance Issues and Related Metrics&lt;/strong&gt;&lt;br&gt;
Understanding common problems that can arise and the relevant CloudWatch metrics that can help diagnose these issues is crucial when looking into performance problems in your AWS environment. By understanding these bottlenecks and their corresponding CloudWatch metrics, you can swiftly determine possible causes and take corrective action. For example:&lt;/p&gt;

&lt;p&gt;For High Latency or Slow Performance, you need to look at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Elastic Load Balancer (ELB): TargetResponseTime&lt;/li&gt;
&lt;li&gt;API Gateway: Latency&lt;/li&gt;
&lt;li&gt;EC2 Instances: CPUUtilization, DiskReadOps, DiskWriteOps, NetworkIn, NetworkOut&lt;/li&gt;
&lt;li&gt;RDS Instances: ReadLatency, WriteLatency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For High Error Rates, you need to look at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ELB: HTTPCode_ELB_4XX_Count, HTTPCode_ELB_5XX_Count&lt;/li&gt;
&lt;li&gt;API Gateway: 4XXError, 5XXError&lt;/li&gt;
&lt;li&gt;Lambda: Errors, Throttles&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For Traffic Spikes or Sudden Increase in Load:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ELB/API Gateway: RequestCount&lt;/li&gt;
&lt;li&gt;EC2 Instances: NetworkIn, NetworkOut&lt;/li&gt;
&lt;li&gt;RDS Instances: DatabaseConnections, NetworkReceiveThroughput, NetworkTransmitThroughput&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A grasp of your application architecture and common performance pitfalls empowers you to swiftly identify the right CloudWatch metrics for troubleshooting. Over time, this process becomes more intuitive, allowing you to troubleshoot efficiently and maintain optimal performance for your AWS environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Leverage Existing CloudWatch Documentation&lt;/strong&gt;&lt;br&gt;
CloudWatch documentation serves as your trusty roadmap when navigating the vast world of CloudWatch metrics. It helps you make sense of the data, find the right metrics for troubleshooting, and fix performance problems in your AWS environment. Here’s how CloudWatch documentation can assist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metric Descriptions provide a clear explanation of what it represents and how it’s measured.&lt;/li&gt;
&lt;li&gt;Dimensional Breakdown often details the dimensions associated with each metric. Understanding dimensions allows you to filter and analyze metrics with greater granularity.&lt;/li&gt;
&lt;li&gt;Best Practices: CloudWatch documentation outlines best practices for collecting, monitoring, and analyzing metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. Follow the User Experience&lt;/strong&gt;&lt;br&gt;
In the realm of AWS performance troubleshooting, User Experience is regarded as the “Golden Signal”. This underscores the paramount importance of focusing on metrics that directly or indirectly impact how your users interact with your applications. Ultimately, the success of your applications hinges on user satisfaction. A slow website, unresponsive interface, or delayed responses can lead to frustration and user churn.&lt;/p&gt;

&lt;p&gt;CloudWatch offers various metrics that directly or indirectly impact user experience. Some key examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Website Load Time is the time it takes for a web page to fully load on a user’s device. Slow load times can lead to user abandonment and negatively impact conversion rates.&lt;/li&gt;
&lt;li&gt;Database Query Latency is the time it takes for a database to respond to a query. High latency can result in sluggish application performance and delayed responses for users.&lt;/li&gt;
&lt;li&gt;API Response Times is the time it takes for your API to respond to a request. Slow API response times can hinder the overall performance of applications that rely on APIs.&lt;/li&gt;
&lt;li&gt;Application Error Rates are the frequency of errors encountered by users within your applications. Frequent errors can disrupt user workflows and damage trust in your services.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;6. Monitor Dependencies and Downstream Services&lt;/strong&gt;&lt;br&gt;
Application problems can ripple outwards. Dependencies (like databases and message queues) and downstream services (what your application interacts with) can significantly impact overall performance and reliability. For example, if an application is slow, check not just the EC2 metrics but also the RDS metrics if your application relies on a database.&lt;/p&gt;

&lt;p&gt;Key dependencies and downstream services to monitor include databases, message queues, caches, storage, networking component etc. By keeping an eye on these components, you can quickly identify and address performance issues that may not be immediately apparent from primary application metrics alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Compare Against Historical Data&lt;/strong&gt;&lt;br&gt;
For troubleshooting and monitoring your AWS environment, comparing current metrics to historical data is key. This reveals trends and anomalies, helping you distinguish between normal fluctuations and potential issues requiring attention. Comparing metric data against historical data is important for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Establishing a Baseline: Historical data helps establish normal operating conditions. Comparing current metrics against this baseline allows you to determine if the current performance is within expected ranges.&lt;/li&gt;
&lt;li&gt;Identifying Anomalies: Anomalies are data points that deviate significantly from the norm. By comparing current metrics with historical data, you can quickly spot unusual behaviour that might indicate issues.&lt;/li&gt;
&lt;li&gt;Understanding Trends: Trends show the general direction in which a metric is moving over time. Identifying trends helps you anticipate future behaviour, such as increasing resource usage that might eventually lead to performance bottlenecks if not addressed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By following these strategies and best practices, you can transform CloudWatch metrics from a vast dataset into a powerful troubleshooting tool. With a targeted approach to metric selection, you can acquire more insight into the performance of your AWS environment, spot possible bottlenecks early on, and guarantee a seamless and effective user experience. Remember, effective troubleshooting is an ongoing process. With more AWS resources and CloudWatch expertise under your belt, you’ll become more adept at picking the pertinent metrics for every circumstance.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>performance</category>
    </item>
    <item>
      <title>AWS CSI - Investigating Cloud Conundrums (CloudWatch-Part 2)</title>
      <dc:creator>Zippy Wachira</dc:creator>
      <pubDate>Mon, 09 Feb 2026 20:36:25 +0000</pubDate>
      <link>https://forem.com/yaddah/aws-csi-investigating-cloud-conundrums-cloudwatch-part-2-5c5d</link>
      <guid>https://forem.com/yaddah/aws-csi-investigating-cloud-conundrums-cloudwatch-part-2-5c5d</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/yaddah/aws-csi-investigating-cloud-conundrums-cloudwatch-part-1-40po"&gt;Part 1&lt;/a&gt; of this series, we looked at the basics of CloudWatch metrics and one example of how you can leverage CloudWatch metrics to troubleshoot performance issues on AWS. In this second piece, we’ll dive a little deeper and investigate a few more examples.&lt;/p&gt;

&lt;p&gt;So, let’s dive in!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Scenario 2: You have a microservices-based application running on Amazon ECS (Elastic Container Service). Users have reported that the application becomes unresponsive after running for a few hours.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Background:&lt;/strong&gt;&lt;br&gt;
A memory leak is a type of a resource leak that occurs when a program allocates memory but fails to release it back to the system after it is no longer needed. The result is that over time, the program consumes more and more memory, leading to resource exhaustion. As memory becomes scarce, the application may slow down due to increased garbage collection activity or the need to swap memory to disk.&lt;/p&gt;

&lt;p&gt;Memory leaks typically cause a gradual increase in memory usage. The application may start normally but degrade over time as memory is exhausted. If the application becomes unresponsive after a consistent period, it suggests a pattern where memory consumption reaches a critical threshold, causing the failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Investigation:&lt;/strong&gt;&lt;br&gt;
Occasionally, a memory allocation spike can cause a one-time spike in the amount of memory being used by a resource in your AWS environment. For an allocation spike, restarting the service will temporarily resolve the issue. However, if the problem recurs, it could be an indication that the underlying issue is a memory leak rather than a one-time allocation spike.&lt;/p&gt;

&lt;p&gt;For either case, you need to look at the &lt;strong&gt;&lt;em&gt;‘MemoryUtilization’&lt;/em&gt;&lt;/strong&gt; metrics. The &lt;strong&gt;&lt;em&gt;‘MemoryUtilization’&lt;/em&gt;&lt;/strong&gt; metric shows the percentage of memory that is used by tasks in the specific dimension. For statistics, you’d need to look at the average and maximum utilization over the period of interest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Scenario 3: You have an e-commerce website, hosted on Amazon EC2 instances behind an Application Load Balancer (ALB), is experiencing a sudden spike in traffic. Customers report slow loading times and intermittent outages. You suspect a Distributed Denial of Service (DDoS) attack.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Background:&lt;/strong&gt;&lt;br&gt;
A Distributed Denial of Service (DDoS) attack is a malicious attempt to disrupt the normal traffic of a targeted server, service, or network by overwhelming it with a flood of internet traffic. This flood typically originates from a network of compromised computers or devices, making it difficult to pinpoint and block the source. The sheer volume of illegitimate traffic can overload resources, making the website or service inaccessible to legitimate users. End users might encounter slow loading times, error messages, or complete outages.&lt;/p&gt;

&lt;p&gt;In the context of AWS, a DDoS attack can target various services such as EC2 instances, load balancers, or even the application running on AWS infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Investigation:&lt;/strong&gt;&lt;br&gt;
While a sudden spike in traffic can occur during legitimate events e.g., sales, promotion events etc., there are key patterns that can help identify a possible DDoS attack:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Traffic Patterns:&lt;/strong&gt; While legitimate spikes may follow a more gradual increase and decrease in traffic, a DDoS attack will typically involve a sudden and sustained surge in traffic, often exceeding normal peak usage patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Source of Traffic:&lt;/strong&gt; The source of legitimate traffic can usually be traced back to a diverse set of users and locations. DDoS traffic on the other hand, might originate from a limited number of IP addresses or geographical locations, indicating a coordinated attack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Application Impact:&lt;/strong&gt; DDoS attacks usually target specific web applications or services. Legitimate traffic spikes might affect overall website performance but wouldn’t target specific applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Increased Error Rates:&lt;/strong&gt; Along with high traffic, you may observe an increase in 4xx (client error) and 5xx (server error) HTTP status codes, indicating that the backend servers are overwhelmed and unable to process the requests.&lt;/p&gt;

&lt;p&gt;Key metrics to monitor to investigate a possible DDoS attack include:&lt;br&gt;
&lt;strong&gt;1. Number of Requests Received:&lt;/strong&gt;&lt;br&gt;
If your application is fronted by an Application Load Balancer, then you need to look at the RequestCount metric. The RequestCount metric shows the number of requests processed over IPv4 and IPv6. A sudden and unusual spike in request count is a primary indicator of a potential DDoS attack. For API Gateway, this would be the Count metric.&lt;/p&gt;

&lt;p&gt;For the RequestCount metrics, the statistics of interest would be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sum: the total number of requests over a period will help in understanding the overall traffic volume.&lt;/li&gt;
&lt;li&gt;Average: the average number of requests per second helps to identify spikes relative to normal traffic patterns.&lt;/li&gt;
&lt;li&gt;Maximum: the peak number of requests received in the given period is useful for identifying the highest load.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Network Traffic&lt;/strong&gt;&lt;br&gt;
For the instances hosting the application, you need to check the NetworkIn and &lt;strong&gt;&lt;em&gt;NetworkOut&lt;/em&gt;&lt;/strong&gt; metrics. If these also show a sharp increase, it may be indicative of a DDoS attack.&lt;/p&gt;

&lt;p&gt;For network traffic metrics, we need to look at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sum: the total amount of data transferred in and out, respectively, which helps quantify the scale of traffic.&lt;/li&gt;
&lt;li&gt;Average: the average data transfer rate, useful for comparing against baseline traffic levels.&lt;/li&gt;
&lt;li&gt;Maximum: the peak data transfer rate, which can indicate periods of intense activity typical of a DDoS attack.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. HTTP Error Rates&lt;/strong&gt;&lt;br&gt;
An increase in HTTP error rates can indicate that your servers are struggling to handle the incoming requests. To check the error rates, you can check the &lt;strong&gt;&lt;em&gt;HTTPCode_ELB_4XX_Count&lt;/em&gt;&lt;/strong&gt; and*&lt;em&gt;&lt;em&gt;HTTPCode_ELB_5XX_Count&lt;/em&gt;&lt;/em&gt;* for your ALB or &lt;strong&gt;&lt;em&gt;4XXError&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;5XXError&lt;/em&gt;&lt;/strong&gt; if using API Gateway.&lt;/p&gt;

&lt;p&gt;For HTTP Error metrics, we need to look at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sum: the total number of server and client errors over a period. A significant increase in server errors (5xx) can indicate that the backend is overwhelmed and an increase in client error (4xx) can increase due to an increase in the number of malformed requests.&lt;/li&gt;
&lt;li&gt;Average: the average rate of errors, useful for comparing against normal error rates.&lt;/li&gt;
&lt;li&gt;Maximum: the peak error rate, which can indicate the most stressful/problematic periods.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Target Response Time&lt;/strong&gt;&lt;br&gt;
The ALB’s &lt;strong&gt;&lt;em&gt;TargetResponseTime&lt;/em&gt;&lt;/strong&gt; metric shows the time passed, in seconds, after the request leaves the load balancer until the target starts to send the response headers. Increased response times can signal that your application is under strain.&lt;/p&gt;

&lt;p&gt;The key statistics to look at for this metric include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Average: the average response time, helping to identify trends in performance degradation.&lt;/li&gt;
&lt;li&gt;Maximum: the longest response time recorded, which can indicate extreme cases of backend strain.&lt;/li&gt;
&lt;li&gt;P95 or P99: Percentile metrics show response times at the 95th or 99th percentile, useful for identifying the response times experienced by the top 5% or 1% of requests, which can be heavily affected during an attack.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  CloudWatch Statistics
&lt;/h2&gt;

&lt;p&gt;When it comes to trying to make sense of CloudWatch metrics, statistics can be a powerful ally. The Sum, Average, Minimum, and Maximum statistics are the most used. But there are other powerful statistics that you can leverage. For example:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Percentiles&lt;/strong&gt;&lt;br&gt;
Percentiles help to understand the relative standing of a value in a dataset. It tells you how a particular value compares to the rest of the data. For example, imagine you are in a race with 100 participants. If you finish in the 95th position, you are in the top 5 runners. This means you are faster than 95 other runners, and only 4 runners are faster than you. Here, your position represents the 95th percentile (p95).&lt;/p&gt;

&lt;p&gt;Similarly, in CloudWatch, p95 would mean that 95 percent of the data within the specified period is lower than this value, and 5 percent of the data is higher than this value. Let’s say, for example, that you’re monitoring the latency (response time) of your game servers using CloudWatch. You have checked the average latency for the application, and it is 50ms. Is this good? Is this bad? The average latency would not be able to show you the entire picture as there could be a significant variation in individual player experiences.&lt;/p&gt;

&lt;p&gt;Let’s say instead that you filter the metric using the p90 statistic. This statistic will show the experience of most players. So, for example, if the p90 response time is 100 ms, this means that 90% of the requests were completed in 200 ms or less, and only 10% of the requests took longer than 200ms. Similarly, if the p50 response time is 50 ms, it means that 50% of requests were completed in 50 ms or less.&lt;/p&gt;

&lt;p&gt;Percentiles help you understand the typical performance and identify outliers. For example, while the average (mean) response time might be 50 ms, the p90 being 100 ms indicates that some requests take significantly longer.&lt;/p&gt;

&lt;p&gt;To understand more about CloudWatch statistics, view &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Statistics-definitions.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this article, we’ve explored several real-world scenarios where CloudWatch metrics empower you to investigate and troubleshoot performance issues within your AWS environment. But a crucial question remains: how do you identify the right metrics to look at for a specific issue?&lt;/p&gt;

&lt;p&gt;Well, worry not, help is on the way! In our next blog post, we’ll delve into practical strategies and best practices to guide you in selecting the most relevant CloudWatch metrics for troubleshooting various performance concerns in your AWS infrastructure. Stay tuned!&lt;/p&gt;

</description>
      <category>aws</category>
      <category>monitoring</category>
      <category>performance</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>AWS CSI -Investigating Cloud Conundrums (CloudWatch - Part 1)</title>
      <dc:creator>Zippy Wachira</dc:creator>
      <pubDate>Mon, 09 Feb 2026 20:02:16 +0000</pubDate>
      <link>https://forem.com/yaddah/aws-csi-investigating-cloud-conundrums-cloudwatch-part-1-40po</link>
      <guid>https://forem.com/yaddah/aws-csi-investigating-cloud-conundrums-cloudwatch-part-1-40po</guid>
      <description>&lt;p&gt;If you’re anything like me, you absolutely hate going to the doctors. Unfortunately, (&lt;em&gt;and at least until we can make ourselves indestructible🤞&lt;/em&gt;), every so often, you will always find yourself in a doctor’s office. Now, for the doctor to accurately diagnose your illness and prescribe the right treatment, they need to first collect a range of vitals — your temperature, blood pressure, heart rate, and so on. These vital signs provide crucial insights into your health, and tracking them over time helps the doctor identify patterns, detect issues early, and understand the overall state of your body.&lt;/p&gt;

&lt;p&gt;Similarly, CloudWatch acts like your AWS environment’s diagnostic physician. It collects a comprehensive set of data points like system metrics (CPU usage, memory allocation, network latency) and logs (application errors, API calls, resource utilization) that serve as vital signs. By analyzing these metrics and logs, CloudWatch helps you diagnose the health of your application. An unexpected surge in CPU usage might point to inefficient code, while frequent errors in the logs could indicate configuration issues.&lt;/p&gt;

&lt;p&gt;In this blog, we will delve into CloudWatch metrics and explore how you can leverage these metrics to understand the performance of your AWS Services as well as detect potential issues. Whether it’s preventing a minor symptom from becoming a major outage or optimizing your resources for peak performance, CloudWatch is your go-to solution for maintaining the well-being of your cloud infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Some Basics
&lt;/h2&gt;

&lt;p&gt;Let’s start with a few important details about metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A metric is a quantitative measure of a system’s characteristic over time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Majority of AWS services provide a set of free metrics under basic monitoring. However, to monitor a parameter that is not enabled for the free metrics, you can enable detailed monitoring or set up custom metrics.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Metrics are collected as a set of time-ordered data points. The period over which data points are collected varies between under a second and an hour. The retention period of a metric is dependent on how frequently data points are published. See &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Metrics exist only in the Region in which they are created.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Metrics are categorized into dimensions, i.e., you can monitor the CPU Utilization of EC2 instances, RDS Databases, ECS Cluster,s etc. However, when you want to only view the CPU Utilization for one or all your RDS Databases, then you’d view this under the ‘Across All Databases’ dimension or the ‘DBInstanceIdentifier’ dimension.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Understanding CloudWatch Metrics
&lt;/h2&gt;

&lt;p&gt;The good news is that AWS maintains exhaustive documentation for the supported metrics for each service. Additionally, each metric is explained in detail so it’s clear to understand what exactly the metric measures.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;For a list of services that publish their metrics to CloudWatch, see &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/aws-services-cloudwatch-metrics.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;To understand the specific metrics that are supported for a particular service, search for ‘Available Metrics for ’ e.g. ‘Available Metrics for API Gateway’. For most services, this page will also display the available namespaces and dimensions available for the service.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;E.g.,&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6l2uqsjq0ij44wljkk0h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6l2uqsjq0ij44wljkk0h.png" alt=" " width="800" height="323"&gt;&lt;/a&gt;&lt;br&gt;
1: This is the name of the metric, i.e., the characteristic that is being measured.&lt;/p&gt;

&lt;p&gt;2: The description of the metric, i.e., what it is and what it measures. For some metrics, that description will also include other notable details of the metric, i.e, recommendations, when to use the metric, exceptions, etc.&lt;/p&gt;

&lt;p&gt;3: The unit of a metric is the scale of measurement of that metric. e.g., For EC2 instance metrics, the BurstBalance has the unit ‘Percent’. This tells you that the BurstBalance metric is measured as a percentage value. Units provide context and meaning to the raw numerical values you see, e.g., you could compare BurstBalance (percentage) with CPUUtilization (percentage) to see if high CPU usage is depleting your burstable credits.&lt;/p&gt;

&lt;p&gt;4: CloudWatch provides several statistics for a metric’s data points, e.g., sum, average, minimum, maximum, etc. See all available statistics here. Statistics are crucial to understanding a metric’s behaviour, e.g., the average helps to identify a baseline for the metric’s typical behaviour. Meaningful Statistics for a metric are the statistics that are considered the most useful for that metric.&lt;/p&gt;

&lt;p&gt;5: For RDS, some metrics are only available for a specific database engine. The ‘Applies to’ column indicates the database engine for which the metric can be collected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Graphing Metrics
&lt;/h2&gt;

&lt;p&gt;Trying to understand what a set of data is trying to tell you purely by looking at rough numbers can leave you feeling foggy. Visuals, on the other hand, are like a lightbulb moment, illuminating complex ideas in a clear and memorable way. On CloudWatch, you can use graphs to view metrics over a period.&lt;/p&gt;

&lt;p&gt;Say, for example, you want to view the average write I/O operations on your EBS volume for a period. You can access the metric on the console as follows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foa8pn5w4q0pej9h5tz2l.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foa8pn5w4q0pej9h5tz2l.PNG" alt=" " width="800" height="650"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;1: You can use the time filter to granularize your search to a specific period. The custom option allows you to specify a custom period, e.g, view metrics over 3 weeks&lt;/p&gt;

&lt;p&gt;2: The Actions/Options tabs allow you to customize your widget, i.e., specify how you want your data to be displayed. The Options tab provides more customization for your graph, e.g., labels to add to the axis, units, etc.&lt;/p&gt;

&lt;p&gt;3: The Graphed Metrics tab allows you to customize the graph.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe95b04morcw00yiawcjs.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe95b04morcw00yiawcjs.PNG" alt=" " width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can change the statistic being displayed, e.g., change from average to maximum or view a sum. You can also change the period, which alters the data points on the graph e.g,. To view the maximum values at each hour, can filter as below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1nyeaytgnvevir7vusve.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1nyeaytgnvevir7vusve.PNG" alt=" " width="800" height="408"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Examples
&lt;/h2&gt;

&lt;p&gt;Scenario 1: Your users are reporting that your web application is responding slowly. You need to determine the cause of the high latency and resolve it quickly.&lt;/p&gt;

&lt;p&gt;Resolution:&lt;/p&gt;

&lt;p&gt;There are 2 main reasons for slow response times in an application.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Resource limitations:&lt;/strong&gt; When the resources assigned to the compute infrastructure where the application is running are insufficient to sustain the load, i.e., using a small instance for a high-load application may result in CPU overload and memory bottlenecks. This can also occur if the database is overloaded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Application Code Issues:&lt;/strong&gt; Poorly written code with logic flaws e.g., code that does not properly release memory after use, can lead to memory depletion and slow performance.&lt;/p&gt;

&lt;p&gt;To check if the lag is a result of resource constraints, we can examine the compute service’s CPU utilization and disk I/O. Now, so far, we have looked at how to access and view different metrics for a service. The next big question becomes, how do you interpret CloudWatch data and derive meaningful insights from it&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;i. CPU Utilization&lt;/strong&gt;&lt;br&gt;
As previously mentioned, there are various statistics available to you for each metric. For this case, to determine if CPU Utilization is the reason for latency, we need to look at the following 3 statistics over the given period:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Average CPU Utilization:&lt;/strong&gt;This statistic helps in understanding the general load on your instance over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Maximum CPU Utilization:&lt;/strong&gt;This statistic shows the peak CPU usage within a specified period. It is useful to identify if there are any spikes that might correlate with periods of high latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- CPU Credit Balance&lt;/strong&gt;(only for burstable instances): If you’re using burstable instances (e.g., T2, T3 instances), running out of CPU credits can cause the instance to throttle and result in increased latency.&lt;/p&gt;

&lt;p&gt;An important thing to remember here is the unit used to measure the metric, which can be found in the service’s official documentation. CPU Utilization is measured as a percentage; thus, the output would look something like the below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjxagz81t0n1snqr2y30h.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjxagz81t0n1snqr2y30h.PNG" alt=" " width="800" height="498"&gt;&lt;/a&gt;&lt;br&gt;
Average CPU Utilization&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0e4qo9gx95c7ri54tocm.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0e4qo9gx95c7ri54tocm.PNG" alt=" " width="800" height="519"&gt;&lt;/a&gt;&lt;br&gt;
Maximum CPU Utilization&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0i2898nlakt7z3syhf0y.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0i2898nlakt7z3syhf0y.PNG" alt=" " width="800" height="474"&gt;&lt;/a&gt;&lt;br&gt;
CPUCreditBalance&lt;/p&gt;

&lt;p&gt;From the above, we can see that there was a spike in CPU Utilization at three instances, which also corresponds to the time when the burst credits were most spent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ii. Disk I/O&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Disk I/O metrics reflect the performance and usage of your disk. Key Disk I/O metrics include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DiskReadOps:&lt;/strong&gt; The number of read operations performed on the disk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DiskWriteOps&lt;/strong&gt;: The number of write operations performed on the disk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DiskReadBytes&lt;/strong&gt;: The amount of data read from the disk, in bytes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DiskWriteBytes&lt;/strong&gt;: The amount of data written to the disk, in bytes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note: the above metrics are only available for instance store volumes. If using EBS, you’d be looking at the EBSReadOps, EBSWriteOps, EBSReadBytes, and EBSWriteBytes metrics.&lt;/p&gt;

&lt;p&gt;Key statistics to measure include:&lt;br&gt;
&lt;strong&gt;- Sum:&lt;/strong&gt;For DiskReadOps and DiskWriteOps, the sum statistic helps you understand the total number of I/O operations over a period.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Average&lt;/strong&gt;: For DiskReadBytes and DiskWriteBytes, the average statistic provides insight into the average data throughput over a period.&lt;/p&gt;

&lt;p&gt;Note: &lt;em&gt;DiskReadOps and DiskWriteOps will show the number of completed read operations from all volumes in a specified period. To obtain the average IOPS, you need to take the total amount of operations/time in seconds, e.g., say you have a DiskReadOps of 100,000 over a period of 1 hour, then the read operations per second would be 100,000/(3600) = ~28&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In most cases, it is not possible to troubleshoot an issue simply by examining a single metric. For example, in the scenario above, we can’t determine that the application lag is due to a resource constraint simply by looking at the CPU Utilization, this is even if the spikes in utilization align with periods of latency.&lt;/p&gt;

&lt;p&gt;To get the full picture, we need to analyse multiple metrics together. Let’s say users report slowdowns. Examining both CPU utilization and disk I/O during those periods can reveal if spikes or abnormal patterns in both metrics coincide with the latency. If you have the CloudWatch agent installed, you can also compare these against memory utilization metrics. This combined view strengthens the case for resource limitations being the root cause.&lt;/p&gt;

&lt;p&gt;This article has provided a foundational understanding of CloudWatch metrics and logs. However, the vast capabilities of CloudWatch extend far beyond what we’ve covered here. In a future article, we’ll delve deeper into advanced techniques for leveraging CloudWatch logs and metrics to troubleshoot issues and ensure the optimal health of your AWS resources. &lt;strong&gt;Stay tuned!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>beginners</category>
      <category>cloud</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Navigating Disaster Recovery in the Digital Age: Choosing the Right Approach – Part 6</title>
      <dc:creator>Zippy Wachira</dc:creator>
      <pubDate>Fri, 14 Feb 2025 18:19:42 +0000</pubDate>
      <link>https://forem.com/yaddah/navigating-disaster-recovery-in-the-digital-age-choosing-the-right-approach-part-6-4dj6</link>
      <guid>https://forem.com/yaddah/navigating-disaster-recovery-in-the-digital-age-choosing-the-right-approach-part-6-4dj6</guid>
      <description>&lt;p&gt;Welcome to the final chapter of our journey! Over the past blogs, we've broken down every aspect, looked at the subtleties, and investigated how two possible solutions—AWS Elastic Disaster Recovery (DRS) and Veeam—compare with our client's needs. &lt;br&gt;
&lt;em&gt;Can review the case study &lt;a href="https://dev.to/yaddah/navigating-disaster-recovery-in-the-digital-age-choosing-the-right-approach-part-1-4cn4"&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Now, let’s cut through the noise and get to the heart of the matter—because the right choice isn’t just about features or costs. It’s about finding the best fit for the client’s unique needs. Let’s dive in!&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS-Native vs Third-Party Service
&lt;/h2&gt;

&lt;p&gt;Now this is an easy one – ain’t it? &lt;/p&gt;

&lt;p&gt;We’ve explored all the factors and compared two potential solutions: AWS’s DRS and the third-party Veeam. &lt;/p&gt;

&lt;p&gt;From our comparison table, it seems that  Veeam is the obvious winner. But is it really? &lt;/p&gt;

&lt;p&gt;Before we pick, let’s explore one more factor—cost.&lt;/p&gt;

&lt;p&gt;While not always explicitly stated as a requirement, the cost of the solution is a crucial factor to consider when designing a solution. After all, this will be a business expense for the client, so it’s important to provide not only a functional solution but also a cost-effective one. &lt;br&gt;
When comparing DRS and Veeam, the cost structure for each is different. With DRS, there is a flat per-hour fee for each server being replicated to AWS, and you also pay for the replication instance(s), underlying volumes, and any recovery instances created during the recovery process in AWS. For each recovery instance, you incur charges for compute, memory, and storage. &lt;/p&gt;

&lt;p&gt;On the other hand, with Veeam, you pay for the Veeam license for your source servers. In addition to that, you pay for storage and API calls to and from S3. You’ll also incur costs for any recovery instances provisioned from the backups stored in S3. In both cases, recovery costs are only incurred when recovery is initiated into the AWS environment.&lt;/p&gt;

&lt;p&gt;So, back to our options. From the comparison chart, Veeam is taking the lead.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn2lyfe9urbog2xyd5gca.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn2lyfe9urbog2xyd5gca.png" alt="Image description" width="548" height="678"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, let’s go back to our case study and check if Veeam meets all our requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Comprehensive Backups – &lt;strong&gt;_Yes _&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Enhanced Recovery Capabilities – &lt;strong&gt;&lt;em&gt;Yes&lt;/em&gt;&lt;/strong&gt; &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;An RTO of 1-2 hours for the ERP System- &lt;strong&gt;&lt;em&gt;No&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Recovery options that include both on-premises restoration and the possibility of running the ERP in the cloud – &lt;strong&gt;&lt;em&gt;Yes&lt;/em&gt;&lt;/strong&gt; &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Flexible RTO and RPO for Other Systems: &lt;strong&gt;&lt;em&gt;Yes&lt;/em&gt;&lt;/strong&gt; &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The verdict? The ERP system is the outlier—and as the most critical system, we can’t ignore the need for a 2-hour RTO.&lt;/p&gt;

&lt;p&gt;So, where does that leave us? My vote? A hybrid approach. DRS for the mission-critical ERP system and Veeam for the more flexible Information and Library systems.&lt;/p&gt;

&lt;p&gt;Of course, there’s more to this decision than meets the eye. A hybrid solution can bring added complexity and cost, so as an architect, your job is to present all viable options along with their pros and cons. In this case, we’re looking at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Veeam only&lt;/li&gt;
&lt;li&gt;DRS only&lt;/li&gt;
&lt;li&gt;Hybrid approach with Veeam and DRS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, what do you think? Which solution would you have recommended to the client and why? &lt;/p&gt;

</description>
      <category>aws</category>
      <category>backup</category>
      <category>disasterrecovery</category>
    </item>
    <item>
      <title>Navigating Disaster Recovery in the Digital Age: Choosing the Right Approach – Part 5</title>
      <dc:creator>Zippy Wachira</dc:creator>
      <pubDate>Mon, 27 Jan 2025 19:56:05 +0000</pubDate>
      <link>https://forem.com/yaddah/navigating-disaster-recovery-in-the-digital-age-choosing-the-right-approach-part-5-59ok</link>
      <guid>https://forem.com/yaddah/navigating-disaster-recovery-in-the-digital-age-choosing-the-right-approach-part-5-59ok</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/yaddah/navigating-disaster-recovery-in-the-digital-age-choosing-the-right-approach-part-4-3fj4"&gt;Part 4&lt;/a&gt;, we laid the foundation by examining the critical differences between Backup and Disaster Recovery and analyzing the role of Scheduling and Automation in choosing a solution. In this installment, we’re taking the next step by evaluating the remaining factors. &lt;br&gt;
&lt;strong&gt;&lt;em&gt;You can review the case study &lt;a href="https://dev.to/yaddah/navigating-disaster-recovery-in-the-digital-age-choosing-the-right-approach-part-1-4cn4"&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;As usual, we'll apply these considerations to the case study, analysing how each solution measures up. Every element will help us get closer to the final question: Which solution best suits the client's needs? &lt;/p&gt;

&lt;p&gt;So, let’s dive in and uncover what these next factors reveal!&lt;/p&gt;

&lt;h2&gt;
  
  
  RPO vs RTO Requirements
&lt;/h2&gt;

&lt;p&gt;When considering the client’s case study, what RTO and RPO strategy do you think best meets their needs? How might their requirements for different systems influence the solution?&lt;/p&gt;

&lt;p&gt;From the case study, the client has varying requirements for their three systems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ERP Application&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RTO: 1-2 hours.&lt;/li&gt;
&lt;li&gt;RPO: Not explicitly mentioned but can be inferred as low, given the critical nature of the system.&lt;/li&gt;
&lt;li&gt;Additional Recovery Needs: Flexibility to restore either to the cloud or on-premises.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Information System and Library System&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More flexible RTO/RPO requirements, as these systems are less critical than the ERP.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These differences mean the solution needs to be adaptable, balancing stringent recovery metrics for the ERP with cost-effective approaches for the less critical systems.&lt;/p&gt;

&lt;p&gt;Now, let’s compare AWS Elastic Disaster Recovery Service (DRS) and Veeam Backup and Replication Service to assess their suitability for meeting the client’s RPO/RTO requirement.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgkczbtwrpu94azoj9m2x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgkczbtwrpu94azoj9m2x.png" alt="Image description" width="748" height="563"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From an RPO/RTO perspective, DRS excels in providing ultra-low RPO and RTO, especially for critical systems due to continuous replication. &lt;/p&gt;

&lt;h2&gt;
  
  
  Physical vs Virtual Servers
&lt;/h2&gt;

&lt;p&gt;Based on our case study, the client is managing a mix of virtualized and physical environments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The ERP application is hosted in a virtualized environment, which is critical for the business and requires a more robust disaster recovery solution.&lt;/li&gt;
&lt;li&gt;The information system and library system, however, are running on physical servers, which adds complexity when considering recovery strategies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The client's mixed infrastructure means we need a tool that supports both physical and virtual environments. &lt;/p&gt;

&lt;p&gt;So, let’s see how our two solutions compare:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F23h63tebxvmes2ktjorw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F23h63tebxvmes2ktjorw.png" alt="Image description" width="647" height="634"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For virtual environments, both DRS and Veeam offer comprehensive coverage. For physical servers, both DRS and Veeam support physical servers, but Veeam offers broader support for various physical server configurations and operating systems. Both solutions support physical and virtual servers, but Veeam has a slight edge in terms of flexibility and support for more diverse infrastructure. &lt;/p&gt;

&lt;h2&gt;
  
  
  On-Premises vs Cloud Restores
&lt;/h2&gt;

&lt;p&gt;The client is exploring both on-premises and cloud recovery options for their systems, with a specific emphasis on restoring their ERP system, which is critical for their business operations.&lt;/p&gt;

&lt;p&gt;For cloud-based recovery, AWS Elastic Disaster Recovery (DRS) offers a streamlined process with the potential to quickly spin up instances in the cloud. However, while DRS can restore to an on-premises environment, this approach adds significant complexity and cost.&lt;/p&gt;

&lt;p&gt;The client would need to first initiate recovery in the cloud (spin up recovery instances in the cloud which are sized to match the on-premises servers), then initiate failback to the on-premises environment. This process would not only be costly (the client needs to incur an unnecessary cost of recovery instances in the cloud, plus hefty data transfer fees) but it would significantly increase the RTO for restoring to the on-premises environment. Furthermore, considerations around bandwidth, network setup, and the configuration of recovery infrastructure make the on-premises restore with DRS less attractive for this client.&lt;/p&gt;

&lt;p&gt;Veeam, on the other hand, offers a more straightforward and cost-effective approach for on-premises recovery.&lt;/p&gt;

&lt;p&gt;Veeam has a strong focus on both cloud and on-premises backup and recovery, with features designed for quick restoration to both environments. Its restore process is far more simplified, and it provides the flexibility to recover systems back to on-premises environments with minimal complexity. Additionally, Veeam offers tools to handle the nuances of restoring large data volumes, which can ease the recovery process from a cloud backup to on-premises hardware.&lt;/p&gt;

&lt;p&gt;Additionally, Veeam Backup &amp;amp; Replication allows you to restore different workloads (VMs, Google VM instances, physical servers etc,) to Amazon EC2 instances. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdgnwi5b67t06ne1bc9sv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdgnwi5b67t06ne1bc9sv.png" alt="Image description" width="611" height="753"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And that’s a wrap for Part 5! We’ve tackled some heavy hitters—RTO and RPO, the unique dynamics of physical and virtual servers, and the ever-relevant debate of on-premises vs cloud restores. These are crucial factors that bring us closer to deciding which solution might best meet our client’s needs.&lt;/p&gt;

&lt;p&gt;But the story doesn’t end here. In the final part of this series, we’ll zoom out to look at the bigger picture: AWS-native solutions vs third-party alternatives in the context of Veeam and AWS Elastic Disaster Recovery (DRS). It’s a showdown that will weigh the pros and cons of these two approaches to help us determine the ultimate recommendation for the client.&lt;/p&gt;

&lt;p&gt;So, what’s your call so far? Are you team AWS-native or team third-party? Stick around—Part 6 is where everything comes together for the grand finale! Don’t miss it.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>backup</category>
      <category>disasterrecovery</category>
    </item>
    <item>
      <title>Navigating Disaster Recovery in the Digital Age: Choosing the Right Approach – Part 4</title>
      <dc:creator>Zippy Wachira</dc:creator>
      <pubDate>Mon, 27 Jan 2025 19:33:02 +0000</pubDate>
      <link>https://forem.com/yaddah/navigating-disaster-recovery-in-the-digital-age-choosing-the-right-approach-part-4-3fj4</link>
      <guid>https://forem.com/yaddah/navigating-disaster-recovery-in-the-digital-age-choosing-the-right-approach-part-4-3fj4</guid>
      <description>&lt;p&gt;Over the last three blogs, we have established the groundwork for creating a strong backup and disaster recovery (DR) solution. In &lt;a href="https://dev.to/yaddah/navigating-disaster-recovery-in-the-digital-age-choosing-the-right-approach-part-1-4cn4"&gt;Part 1&lt;/a&gt;, we explored the fundamentals of disaster recovery in today’s world and introduced six key factors to consider when developing a Backup/DR solution. In &lt;a href="https://dev.to/yaddah/navigating-disaster-recovery-in-the-digital-age-choosing-the-right-approach-part-2-5423"&gt;Part 2&lt;/a&gt;, we examined the differences between backup and disaster recovery, as well as the advantages and disadvantages of third-party versus AWS-native solutions. Then, in &lt;a href="https://dev.to/yaddah/navigating-disaster-recovery-in-the-digital-age-choosing-the-right-approach-part-3-3gbh"&gt;Part 3&lt;/a&gt;, we took a closer look at other crucial considerations, including scheduling, automation, RTO/RPO, and how the client’s physical or virtual environment influences the choice of tools.&lt;/p&gt;

&lt;p&gt;Now, it’s time to shift gears and get into the heart of this series: the case study that inspired it all.&lt;/p&gt;

&lt;p&gt;This blog will revisit the real-world client scenario that posed this exciting challenge. We'll use the elements covered in the previous sections to examine the client's needs, constraints, and objectives. How do the considerations we’ve explored shape the final solution? What trade-offs were necessary, and how were they balanced?&lt;/p&gt;

&lt;p&gt;By the end of this post, you’ll have a clear picture of how theory meets practice when designing a customized DR/backup solution—and why no two solutions are ever quite the same.&lt;/p&gt;

&lt;p&gt;Let’s dive into the case study and start piecing it all together!&lt;/p&gt;

&lt;h2&gt;
  
  
  Recap of the Case Study
&lt;/h2&gt;

&lt;p&gt;Our client sought to conduct a Proof of Concept (PoC) for a disaster recovery solution on AWS for three critical on-premises systems, each with unique characteristics and requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ERP Application: The crown jewel of their operations, hosted in a virtualized environment. This system was mission-critical, demanding a stringent Recovery Time Objective (RTO) of 1–2 hours.&lt;/li&gt;
&lt;li&gt;Information System: A physical server housing essential data and workflows.&lt;/li&gt;
&lt;li&gt;Library System: Another physical server, supporting key business functions but with more flexible recovery requirements.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Challenges with the Existing Solution&lt;/strong&gt;&lt;br&gt;
The client’s existing backup approach relied on native backup software to perform daily full backups. These backups were retained for only 24 hours before being discarded. Unfortunately, this setup introduced significant limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Limited Retention: The 24-hour backup retention window left the systems vulnerable to data loss if issues went undetected for longer periods.&lt;/li&gt;
&lt;li&gt;Unreliable Recovery: The manual restore process was cumbersome and prone to failures, undermining their ability to recover effectively inwhen needed&lt;/li&gt;
&lt;li&gt;Critical ERP Recovery Needs: The ERP system required an RTO of 1–2 hours, a demand far beyond what the current setup could reliably support.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Requirements for the New Solution&lt;/strong&gt;&lt;br&gt;
The client’s objectives were clear: they needed a comprehensive and reliable disaster recovery solution that could include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A robust backup system to ensure reliable and complete backups.&lt;/li&gt;
&lt;li&gt;Efficient and dependable restoration processes to minimize downtime and avoid failed restores.&lt;/li&gt;
&lt;li&gt;A solution for the ERP system that would support a stringent RTO of 1–2 hours and both on-premises and cloud-based recovery options.&lt;/li&gt;
&lt;li&gt;Flexible RPO/RTO metrics for Other Systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In addition to presenting a technical challenge, this case study offered a chance to create a solution that satisfied a variety of operational requirements while striking a balance between cost, complexity, and dependability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution Analysis
&lt;/h2&gt;

&lt;p&gt;For this case study, we are primarily going to be looking at two possible solutions: AWS Elastic Disaster Recovery Service (DRS) and Veeam Backup and Replication Service.  For each of the factors, we will evaluate how well each solution aligns with the client’s requirements gradually narrowing down the options to determine the most suitable final solution. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Backup vs Disaster Recovery&lt;/strong&gt;&lt;br&gt;
Take a moment and think about it. Based on the details of the client’s requirements and existing setup, would you classify their need as a Backup solution or a Disaster Recovery (DR) solution?&lt;/p&gt;

&lt;p&gt;The client initially requested a DR solution, but as we’ve discussed in previous blogs, clients often use “backup” and “disaster recovery” interchangeably. So, let’s dig deeper.&lt;/p&gt;

&lt;p&gt;For starters, we know that in their on-premises environment, the client seemed to be operating a simple backup and restore system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;They performed manual daily backups using native software.&lt;/li&gt;
&lt;li&gt;These backups were used to restore data to their systems in the event of a failure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, certain aspects of their request strongly indicated a need for disaster recovery rather than a simple backup solution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stringent RTO for the ERP System:&lt;/strong&gt; The client emphasized a Recovery Time Objective (RTO) of 1–2 hours for their critical ERP system. This requirement points to a more comprehensive backup and aligns with DR strategies designed to minimize downtime.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Efficient Recovery Options:&lt;/strong&gt; The customer wanted better recovery methods, especially for the ERP system. In earlier blogs, we noted that DR involves restoring not just data but also the entire system and application infrastructure to operational status—a significant distinction from backups.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Recovery Considerations:&lt;/strong&gt; In the event of a recovery, the client made it clear that they were open to running the ERP system on the cloud. This is indicative of a DR strategy as it entails restoring and operating crucial workloads in a different location.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Focus on Business Continuity:&lt;/strong&gt; The ERP’s critical nature indicates that the client is likely prioritizing seamless business continuity, a hallmark of DR solutions. With its manual procedures and extended restoration times, a backup system by itself would not be suffice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tailored Solutions for Other Systems:&lt;/strong&gt; While the ERP has stringent recovery metrics, the client’s more flexible RTO/RPO requirements for other systems suggest they are looking for a solution that balances DR for critical workloads with backup solutions for less critical ones.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, let’s compare AWS Elastic Disaster Recovery Service (DRS) and Veeam Backup and Replication Service to assess their suitability for meeting the client’s DR requirement.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm3sqsgq4ns64z9pqzteg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm3sqsgq4ns64z9pqzteg.png" alt="Image description" width="595" height="159"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the above, both Veeam and DRS align well with the client’s needs, each excelling in different areas. DRS offers fast and efficient failover capabilities, making it an excellent choice for the ERP system’s stringent recovery requirements. Meanwhile, Veeam delivers a robust backup solution, ensuring reliable and comprehensive backups for all three systems. Furthermore, Veeam also supports recovery, adding versatility to its functionality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Scheduling and Automation&lt;/strong&gt;&lt;br&gt;
Considering the client’s request for a comprehensive backup process and enhanced recovery mechanisms, what level of scheduling and automation do you think would best suit their needs?&lt;br&gt;
In their on-premises setup, the client relies on manual daily backups, with the following characteristics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backups are initiated and managed manually.&lt;/li&gt;
&lt;li&gt;The restore process is also manual and prone to errors, with instances of incomplete backups and unsuccessful recovery attempts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This manual approach introduces inefficiencies and increases the likelihood of human error, both of which are particularly problematic for critical systems like their ERP application.&lt;/p&gt;

&lt;p&gt;Several factors in the client’s requirements strongly indicate the need for automation in their backup and recovery processes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Comprehensive Backup Requirement:&lt;/strong&gt; The client’s stated desire for a more comprehensive solution implies they need a system that goes beyond basic backups, incorporating robust policies for retention, versioning, and automated execution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Focus on RTO/RPO Metrics:&lt;/strong&gt; Automation directly supports achieving the low RTO for their ERP system by streamlining recovery steps and reducing delays caused by manual processes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Desire for Enhanced Recovery:&lt;/strong&gt; The issues faced in their manual restore process (incomplete backups, failed restores) further highlight the necessity of an automated recovery system that removes guesswork and error.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, let’s compare AWS Elastic Disaster Recovery Service (DRS) and Veeam Backup and Replication Service to assess their suitability for meeting the client’s scheduling and automation requirements.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuvk1kr2bzibhvbfu5re2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuvk1kr2bzibhvbfu5re2.png" alt="Image description" width="748" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the above, we see that while DRS does provide automation for data replication and failover, it has limited flexibility due to its ‘&lt;strong&gt;always-on&lt;/strong&gt;’ replication model. Veeam on the other hand, is more customizable and provides high flexibility in tailoring backup processes as well as automating backup and recovery. &lt;/p&gt;

&lt;p&gt;In this part of our blog series, we’ve taken a deep dive into the Backup vs Disaster Recovery factor and explored the importance of Scheduling and Automation when evaluating solutions. By applying these factors to the case study, we’ve not only clarified the client’s requirements but also laid the groundwork for comparing two potential solutions: AWS Elastic Disaster Recovery (DRS) and Veeam Backup and Replication.&lt;/p&gt;

&lt;p&gt;The analysis so far shows us that while DRS excels in automation and seamless disaster recovery, Veeam shines in customizable scheduling and robust backup capabilities. But the decision-making process doesn’t end here.&lt;/p&gt;

&lt;p&gt;In the following blogs, we’ll continue to dissect the remaining factors. &lt;br&gt;
So, stay tuned as we work toward identifying the most effective solution for the client’s needs. &lt;/p&gt;

&lt;p&gt;Which solution do you think is pulling ahead so far – DRS or Veeam? Take your pick and don’t forget to check back for the next installment!&lt;/p&gt;

</description>
      <category>aws</category>
      <category>disasterrecovery</category>
      <category>backup</category>
    </item>
    <item>
      <title>Navigating Disaster Recovery in the Digital Age: Choosing the Right Approach – Part 3</title>
      <dc:creator>Zippy Wachira</dc:creator>
      <pubDate>Tue, 14 Jan 2025 19:14:04 +0000</pubDate>
      <link>https://forem.com/yaddah/navigating-disaster-recovery-in-the-digital-age-choosing-the-right-approach-part-3-3gbh</link>
      <guid>https://forem.com/yaddah/navigating-disaster-recovery-in-the-digital-age-choosing-the-right-approach-part-3-3gbh</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/yaddah/navigating-disaster-recovery-in-the-digital-age-choosing-the-right-approach-part-1-4cn4"&gt;first installment&lt;/a&gt; of this blog series, we introduced the concept of disaster recovery (DR) and highlighted six key factors to consider when designing a robust Backup/DR solution. These factors serve as a guide for architects to evaluate and tailor solutions that align with a client’s unique needs and objectives.&lt;/p&gt;

&lt;p&gt;In the &lt;a href="https://dev.to/yaddah/navigating-disaster-recovery-in-the-digital-age-choosing-the-right-approach-part-2-5423"&gt;second part&lt;/a&gt;, we took a closer look at the first two factors: the critical distinction between Backups and DR, and the choice between AWS-native and Third-party solutions. These foundational considerations set the stage for understanding the broader landscape of options and how they align with different use cases.&lt;/p&gt;

&lt;p&gt;Now, in this third installment, we turn our attention to the remaining factors. Join us as we explore why these factors matter, how they impact the decision-making process, and the role they play in designing a solution that delivers resilience and reliability. Let’s dive in!&lt;/p&gt;

&lt;h2&gt;
  
  
  Scheduling and Automation
&lt;/h2&gt;

&lt;p&gt;One of the critical factors to consider when designing a backup/DR solution is the level of control and automation the client needs over the backup process. Scheduling and automation capabilities vary widely between solutions, and understanding the client’s expectations is key to selecting the right tool.&lt;/p&gt;

&lt;p&gt;Some solutions offer flexible scheduling options, giving clients the ability to define tailored backup policies based on their specific needs. For example, &lt;a href="https://docs.aws.amazon.com/aws-backup/latest/devguide/whatisbackup.html" rel="noopener noreferrer"&gt;AWS Backup&lt;/a&gt; allows users to create custom backup plans where they can specify the frequency (e.g., daily, weekly, monthly), retention periods, and assign different schedules to various resources. Once configured, these backups occur automatically according to the set policies, requiring minimal ongoing management from the client.&lt;/p&gt;

&lt;p&gt;In contrast, some solutions provide fixed or limited scheduling options that may not offer the same level of customization. For instance, &lt;a href="https://docs.aws.amazon.com/drs/latest/userguide/what-is-drs.html" rel="noopener noreferrer"&gt;Elastic Disaster Recovery Service&lt;/a&gt; continuously performs block-level replication of source server volumes. While this ensures real-time data protection, it doesn’t provide the client with the ability to set specific backup schedules or retention policies, as the process is designed to operate continuously without manual intervention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why It Matters&lt;/strong&gt;&lt;br&gt;
Failing to address scheduling and automation needs during the design phase can lead to operational inefficiencies or, worse, missed recovery points. For instance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If a solution lacks the ability to automate backups during off-peak hours, it might interfere with production workloads.&lt;/li&gt;
&lt;li&gt;Limited retention options could result in insufficient data points for recovery during audits or post-disaster analysis.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By understanding how a client wants to manage the backup process—whether they need granular control or prefer a hands-off approach—you can select a solution that not only meets their operational requirements but also positions them for long-term success.&lt;/p&gt;

&lt;h2&gt;
  
  
  Physical vs Virtual Servers
&lt;/h2&gt;

&lt;p&gt;A client’s existing infrastructure plays a significant role in determining the appropriate solution.  Whether a client’s landscape consists of physical or virtualized servers is another key factor in determining the most fitting Backup/DR solution. &lt;/p&gt;

&lt;p&gt;Some tools are versatile enough to cater to both physical and virtual environments, providing flexibility for hybrid infrastructures. For example, &lt;a href="https://docs.aws.amazon.com/drs/latest/userguide/what-is-drs.html" rel="noopener noreferrer"&gt;Elastic Disaster Recovery Service&lt;/a&gt; supports both physical and virtual servers, making it a strong candidate for clients with mixed environments. &lt;/p&gt;

&lt;p&gt;Other tools are designed specifically for virtualized environments, limiting their applicability for clients with physical infrastructure. For instance, &lt;a href="https://aws.amazon.com/storagegateway/#:~:text=AWS%20Storage%20Gateway%20gives%20your,Private%20Cloud%20(Amazon%20VPC)." rel="noopener noreferrer"&gt;AWS Storage Gateway&lt;/a&gt; is optimized for virtualized environments and works with platforms like VMware ESXi, Hyper-V, and KVM. Similarly, &lt;a href="https://docs.aws.amazon.com/aws-backup/latest/devguide/whatisbackup.html" rel="noopener noreferrer"&gt;AWS Backup&lt;/a&gt;, when used to back up an on-premises environment, requires the environment to be a VMware setup (specifically VMware ESXi).&lt;/p&gt;

&lt;p&gt;Third-party solutions often provide extensive compatibility, making them suitable for clients with diverse setups. For example, &lt;a href="https://www.acronis.com/en-us/" rel="noopener noreferrer"&gt;Acronis&lt;/a&gt; supports physical, virtual, and cloud environments, offering a one-size-fits-all approach for hybrid infrastructures. &lt;a href="https://www.arcserve.com/?_gl=1*l0bsy5*_up*MQ..*_ga*NjYyODA2MTQzLjE3MzY4ODE3Mzk.*_ga_VG5R8X5LFC*MTczNjg4MTczOS4xLjEuMTczNjg4MTc5Mi4wLjAuMA.." rel="noopener noreferrer"&gt;ArcServe&lt;/a&gt; has support for VMware ESX/vSphere, Microsoft Hyper-V, Citrix XenServer, and Red Hat EV virtualized environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it Matters&lt;/strong&gt;&lt;br&gt;
Selecting a tool that doesn’t align with the client’s current setup can lead to inefficiencies, increased costs, or even the inability to implement a functional solution. By thoroughly understanding the client’s environment and matching it to the capabilities of the solution, you can ensure compatibility, seamless integration, and a tailored approach that addresses both current and future requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  RPO/RTO Requirements
&lt;/h2&gt;

&lt;p&gt;Recovery Time Objective (RTO) dictates how quickly systems need to be restored after a failure while Recovery Point Objective (RPO) dictates how much data loss is acceptable. AWS offers a range of DR strategies to meet varying RTO/RPO requirements, each with its own implementation complexity and cost considerations.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backup and Restore is best suited for low priority use cases where the client can tolerate longer recovery times. With this strategy, backups are taken periodically, and in the event of failure, systems are restored from these backups. The RTO and RPO for Backup and Restore can be quite high (i.e., several hours), making it suitable for cases where the recovery window is not critical. This strategy typically offers the lowest cost but is less suited for mission-critical applications.&lt;/li&gt;
&lt;li&gt;Pilot Light is ideal for environments that require a moderate RPO/RTO (in the range of 10s of minutes). With Pilot Light, a minimal version of the application runs on AWS at all times. In the event of a failure, the necessary resources are quickly spun up to restore full functionality. This strategy ensures faster recovery than Backup and Restore but still allows for some downtime, which makes it a cost-effective option for many organizations.&lt;/li&gt;
&lt;li&gt;Warm Standby takes the Pilot Light concept further by keeping a scaled-down version of the entire environment always running. This ensures much faster recovery, with RTO/RPO in the range of minutes. The environment is pre-configured, so failover happens quickly, and systems can be rapidly scaled up in the event of a disaster. Warm Standby is a good middle ground for clients who require fast recovery but don’t always need a fully active system.&lt;/li&gt;
&lt;li&gt;Active/Active is the most complex and costly solution, designed for scenarios that demand zero downtime and real-time backups. In an Active/Active setup, systems are fully mirrored across AWS and on-premises (or across multiple AWS regions). This allows for immediate failover with zero disruption to service. The RTO and RPO are close to zero, but this approach incurs the highest costs due to the need for continuous synchronization and infrastructure running at full capacity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why it Matters&lt;/strong&gt;&lt;br&gt;
Different tools align with varying RPO/RTO requirements. For instance, AWS Elastic Disaster Recovery is ideal for scenarios requiring low RPO/RTO, such as Pilot Light or Warm Standby strategies, as it ensures continuous block-level replication of source servers. AWS Backup, on the other hand, is better suited for Backup and Restore use cases, offering flexible backup schedules but longer recovery times. Third-party solutions like &lt;a href="https://helpcenter.veeam.com/docs/vbaws/guide/welcome.html" rel="noopener noreferrer"&gt;Veeam&lt;/a&gt; and &lt;a href="https://www.zerto.com/resources/a-to-zerto/backup-and-recovery/" rel="noopener noreferrer"&gt;Zerto&lt;/a&gt; provide robust options for Pilot Light and Warm Standby configurations, often including advanced features such as automated failover and failback to support tighter RPO/RTO objectives.&lt;/p&gt;

&lt;h2&gt;
  
  
  On Premises vs Cloud Restores
&lt;/h2&gt;

&lt;p&gt;The restore process is a critical factor that differs from tool to tool, and the complexity of restoring data to different environments—whether on-premises or in the cloud—varies as well. This difference in complexity, cost, and ease of restore should play a significant role in selecting the right tool for the job.&lt;/p&gt;

&lt;p&gt;For example, when backing up data to AWS and later needing to restore it back to an on-premises environment, organizations must consider data transfer costs. Moving large amounts of data from AWS back to the local environment can incur significant bandwidth costs, depending on the amount of data being restored. Additionally, the time required for restoring the data also becomes a key factor, particularly if the transfer involves multiple terabytes or requires the use of slower mediums. This restoration process may introduce delays, which is a crucial consideration for businesses with stringent recovery time objectives (RTOs).&lt;/p&gt;

&lt;p&gt;On the other hand, restoring data within AWS presents different challenges. While the cost of transferring data within AWS itself is usually lower than moving data from AWS to an on-premises location, you still need to think about the recovery resources that need to be launched. This includes creating EC2 instances, setting up databases, or even configuring network access to ensure users can interact with the recovered applications and data. Furthermore, if the goal is to continue operations entirely within AWS, you'll need to ensure proper connectivity between the cloud-based recovery resources and any on-premises systems that need to interact with them. &lt;/p&gt;

&lt;p&gt;Different tools provide varying levels of support for cloud vs on-premises restores. Some tools offer seamless, automated restores to cloud environments, while others focus more on on-premises environments and might lack cloud-native features or optimizations. For instance, AWS Backup provides strong cloud recovery capabilities but would require additional steps and consideration when restoring back to an on-premises environment. Veeam Backup &amp;amp; Replication, on the other hand, offers more flexibility, supporting restores both to AWS and on-premises environments with robust options for data migration and failover.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it Matters&lt;/strong&gt;&lt;br&gt;
The complexity of the restore process and the associated costs should be factored into the decision-making process when selecting the right disaster recovery tool. If quick restoration to an on-premises environment is required, the tool must support efficient data recovery methods, while tools designed for cloud restores should account for the setup and management of cloud infrastructure during the recovery. &lt;br&gt;
Understanding the nuances of each option—whether considering the cost and complexity of cloud vs on-premises restores—will ensure that the solution is tailored to the client’s operational needs and recovery objectives.&lt;/p&gt;

&lt;p&gt;In this post, we’ve explored the remaining key factors to consider when designing a disaster recovery and backup solution, from understanding the complexities of RTO/RPO to weighing the differences between on-premises and cloud restores. Each of these factors plays a crucial role in building a solution that aligns with both technical requirements and business needs.&lt;/p&gt;

&lt;p&gt;In our next installment, we’ll return to the case study and apply these factors to analyze the client’s specific requirements, ultimately crafting a tailored solution for their disaster recovery and backup needs. Stay tuned as we turn theory into practice and bring the solution to life.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ready to see how the puzzle pieces fit together? Let’s dive into the final design in the next blog—don’t miss it!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>backup</category>
      <category>disasterrecovery</category>
    </item>
  </channel>
</rss>
