Forem: Zippy Wachira

The Day ‘Containers Everywhere’ Met a Database – Part 3

Zippy Wachira — Thu, 30 Apr 2026 18:13:47 +0000

In Part 1, we introduced a client moving from a monolithic system towards a microservices-based architecture on AWS, and how one decision stood out: the desire to run their database inside a container.

In Part 2, we looked at what it actually takes to run a database on AWS Fargate, from storage options to the realities of backup and recovery. And while we established that it’s entirely possible to run a database this way, we also saw where the model starts to show strain, especially for production-grade, performance-sensitive workloads.

In this final piece, we look at two additional alternatives – EC2 and RDS. Both options take a fundamentally different approach to the problem. More importantly, they shift where complexity lives.

Option 2: Self-Managed Database on EC2

In this configuration, the database is installed directly on an EC2 instance running Linux. The database runs as a native operating system service, just like it would in a traditional data centre, and persistent storage is provided by EBS volumes attached to the instance.

At first glance, this might feel like a step backward - especially in a world moving towards containers and serverless. But when it comes to databases, this model still holds up remarkably well for several reasons.
Unlike containers, EC2 instances are stateful by design:

The instance persists
The storage persists
The relationship between compute and storage is stable This aligns naturally with how databases operate. But let’s see how this option stacks up against Fargate across our three core dimensions.

a) Persistent Storage
With EC2, storage becomes more manageable and predictable. Unlike Fargate's one-volume limit, EC2 allows multiple EBS volumes to be attached to a single instance. This allows for flexible architectures, e.g.,

Separate volumes for data, logs, and backups
Different volume types (gp3, io2) with independent IOPS and throughput settings for each workload.

In addition, resizing, attaching, detaching, and snapshotting are standard operations, all managed at the instance level without the need to orchestrate ephemeral tasks.

b) Backup and Recovery
Because EC2 runs on persistent virtual machines, you can take EBS snapshots to obtain physical database backups. Physical backups capture the actual database files at the block level. During restore, you're copying raw data back onto disk, not replaying SQL statements. The result is significantly faster and more predictable restore times compared to logical backups.

But still, there’s a catch:

As previously noted, while EBS snapshots are crash-consistent by default, they are not application-consistent for live databases. Achieving a clean, reliable backup requires coordinating with the database (e.g., flushing writes or temporarily pausing activity).
Binary logging must still be configured and managed to enable point-in-time recovery (PITR), introducing additional operational responsibility.

So, while EC2 solves many of the issues we saw with containers, it introduces its own trade-offs:

The operational overhead remains significant. The team is responsible for everything below the database layer - patching, upgrades, monitoring, backups, and troubleshooting.
High availability and scaling must be designed and implemented manually. This includes replication, failover strategies, and multi-AZ setups. For teams early in their cloud journey, the possibility of this architecture pattern becoming a burden was still very real.

But again, context matters.
For our client, this option was the closest match to their existing operating model. It aligned with the architecture they were familiar with and the expertise they had already built. So, while it didn’t eliminate operational complexity, it avoided introducing entirely new ones.
Because of that, it stood out as a strong and pragmatic candidate - bridging the gap between where they were and where they wanted to go.

Option 3: Fully Managed Database on RDS
Where EC2 gives you control, RDS takes that control away - on purpose.

Being a managed service, RDS fully controls the underlying database infrastructure. You simply choose a database engine. Select an instance class. Specify storage size. AWS handles the rest - the operating system, the MySQL installation, the storage layer, the backups, and the high availability configuration.

With RDS, the model shifts from managing database infrastructure to consuming database as a service.

a) Storage Management
With RDS, storage is automatically provisioned and managed by AWS.

You don’t manage any volumes. Instead, you simply tell RDS how much storage you want (or set a maximum for auto-scaling), and it handles the rest. The underlying EBS volumes are managed by AWS - attached, formatted, and monitored without any action from you.
Automatic storage scaling. Once enabled, RDS automatically increases storage when free space falls below a set threshold - without downtime.
No volume sprawl. One database. One logical storage allocation. No tracking dozens or hundreds of volumes across tasks or instances.

In essence, with RDS, storage becomes something you can configure rather than something you operate.

b) Backup and Recovery
This is where RDS truly shines-and where it decisively beats both Fargate and EC2.

Automated physical backups. Using AWS Backup, RDS takes physical snapshots of the underlying EBS volumes on a schedule you define (daily, by default).
Point-in-time recovery (PITR) built-in. RDS automatically archives transaction logs (binary logs for MySQL) and allows you to restore to any point in the last 35 days, down to the second. No manual binary log shipping. No scripting. Just select a timestamp and click restore.
Restore simplicity. Restoring a database from a snapshot or a point-in-time is a few clicks in the AWS console or a single API call. RDS launches a new RDS instance with the restored data within minutes. No need to mount volumes, replay binary logs, or run recovery procedures.

In addition to the above, RDS offers the following benefits:

Multi-AZ and High Availability Deployments are natively supported.
Routine database operations are handled for you. RDS handles patching for both the operating system and the MySQL engine during a pre-defined maintenance window.

So, is there a catch with RDS? Yes, the convenience comes with some trade-offs.

Higher cost. Since you’re paying for the managed service premium, RDS is more expensive than running MySQL on EC2 with equivalent resources.
Limited Control over Database parameters and the OS. You can modify parameters via custom parameter groups, but some low-level settings are restricted or require support tickets. Additionally, you do not have full access to the underlying OS.

Conclusion: The Ultimate Choice
So far, we've looked at three very different ways to host a MySQL database on AWS. We’ve also looked at a client scenario – one that speaks to this very choice of determining which platform is the best fit. Based on their requirements and the three hosting options we have discussed, which option do you think they would choose?

Better yet, which option would YOU choose if you were faced with a similar decision?

One of the biggest takeaways from this exercise is that architecture decisions are rarely about right vs wrong. Rather, they’re about trade-offs.

Do you prioritise consistency and portability, even if it introduces operational complexity?
Do you want full control and performance, even if it means additional management overhead?
Or do you value simplicity and reliability, even if it comes at the cost of flexibility?

Truth is, there’s no universal answer - only the one that best fits your context. Some teams run stateful databases in container platforms all the time - Cassandra, CockroachDB, even PostgreSQL. They've built the tooling, automated the backups, and accepted the constraints. For them, containers are just another deployment target.
Other teams draw a hard line. Containers are for stateless application workloads only. The database belongs on a more stable infrastructure, whether that's a VM or a managed service.

But the question remains - just because you can run a database in a container… should you?

I'd love to hear your perspective.

Have you run a production database in a container? How did it go?
Would you draw the line at the database, or do you believe in "containers for everything"?
Which of these three options would you have recommended for this client?

Drop a comment below or find me on LinkedIn. Let's have the conversation.

The Day ‘Containers Everywhere’ Met a Database – Part 2

Zippy Wachira — Thu, 30 Apr 2026 17:48:47 +0000

In the previous post, I introduced a real migration scenario where a client, in the process of modernising their application, made an interesting architectural choice - to run their MySQL database inside a container alongside the rest of their services.

At first glance, the approach made sense. The entire application was being redesigned around containers. Keeping everything consistent felt clean and intuitive. But as we explored in the first post, databases don’t behave like the rest of your application. They are stateful, performance-sensitive, and operationally demanding in ways that don’t always align with container-first design patterns.

The real challenge wasn’t whether the database could run in a container.
It was whether it should, especially for a production database, and when there are other options available that may be better suited to the nature of database workloads.

And that is where the conversation shifted.

Instead of focusing on a single approach, we stepped back and evaluated three distinct options for hosting the database on AWS:

Running the database as a container on AWS Fargate
Hosting the database on a virtual machine using Amazon EC2
Using a managed service like Amazon RDS

Now, there are several factors to consider here – but for this piece, we will focus on three main ones:

Storage management
Backup and recovery
Operational overhead

Option 1: Containerised Database on ECS Fargate

In this model, the MySQL database runs as a container on Amazon ECS, using AWS Fargate for compute. Persistent storage is attached externally – via EFS or EBS.

First, let’s look at why this pattern is appealing:

It provides consistency across environments. The same container can run in development, testing, and production
You get a unified deployment pipeline. Your database, rather than being an independent component, becomes just another service in your CI/CD workflow
Reduced infrastructure management. Fargate is serverless. So there is the added advantage of not having any servers to provision or manage directly.

None of these are trivial benefits - they’re part of what makes containers powerful.

Now, let’s take a look at the other side:
a) Persistent Storage Options for ECS Fargate
Containers, as we’ve already seen, are designed to be ephemeral. They can be stopped, replaced, or rescheduled at any time. Databases, on the other hand, expect stability. To bridge that gap, external storage must be introduced into the architecture.

ECS Fargate supports both EFS (for file storage) and EBS (for block storage). Let’s compare these two options:

EBS: Strong Performance, but Tightly Coupled to Task Lifecycle
When attached to a Fargate task, EBS volumes provide block-level storage with predictable IOPS, which is ideal for database workloads that demand low-latency reads and writes. The volume persists beyond the life of a task, so your data survives container restarts.

The challenge, however, is that Fargate only allows for a single volume per task. You got one task – well and good. But the constraint shows up when the application scales, creating multiple tasks each requiring its own volume. Over time, you’ve got dozens of volumes to track, monitor, and manage, which can be an admin nightmare.
Additionally, resizing an EBS volume that is attached to a Fargate task often requires stopping the task, detaching the volume, resizing it, and then reattaching it to a new task. This type of manual orchestration introduces downtime and complexity into the environment.

EFS: Operational Simplicity with Performance Trade-offs
The key strength of EFS over EBS, on the other hand, is that the file system can be mounted simultaneously across multiple Fargate tasks.
Because it's shared, there is no need to provision a separate volume per task. This eliminates the volume sprawl problem.
More importantly, EFS is natively integrated with Fargate, meaning mounting it is a simple configuration, and it scales automatically as storage grows.

So what is the drawback? EFS is file storage. This means that all I/O operations occur over the network. For a database, this means that it will experience higher latency, throughput, and IOPS will be less predictable – especially under concurrent access, and the performance can degrade under sustained transactional workloads. Overall, EFS introduces file locking and consistency considerations that are not ideal for relational databases.

So, at a high level, EBS optimises for performance predictability, while EFS optimises for operational simplicity and scalability.

b) Backup and Recovery Complexity
This is where Fargate becomes especially difficult to defend for a production database.

Docker containers do not easily persist the user space. You cannot snapshot a container to get a physical database backup the same way you can with a VM or a logical volume. This creates a significant challenge; obtaining a physical DB backup (which is strongly preferred for faster restore times) is not straightforward.

The alternative is to fall back on logical backups. Logical backups can be obtained either by running scheduled tasks or utilizing sidecar containers to execute a mysqldump to an external storage service, e.g., S3. While this works, it also has some drawbacks:

Significantly slower backup and restore times. Logical backups work by exporting data as SQL statements. When taking a backup, data must be read row-by-row. During restore, the database must first recreate the schema, then similarly re-insert data row by row. For a small database, this may be fine. But as your database grows, restores can stretch from minutes to hours – even days.
Logical backups are resource-intensive. There’s generally increased load on the database due to actions such as full table scans and serialization. For an environment such as Fargate, where resources are already constrained, this could result in slower queries and degraded application performance during backup windows.
Logical backups don’t provide true Point-in-time-Recovery (PITR). Since logical backups are taken at a specific point, any data between backup windows is usually lost. To achieve true Point-in-time-Recovery, additional configuration would be required, e.g., setting up binary log management.

So, what about EBS Snapshots? Could we not obtain a physical backup from that? Yes, we could. EBS Snapshots provide faster backups, faster restores, and better alignment with low RTO – all of which solve the issue of slow recovery.
However, the disadvantage of EBS snapshots is that, alone, these snapshots are not application-consistent for live databases without carefully quiescing writes first. This means that you can restore a snapshot and end up with corrupted data or an inconsistent state.
Additionally, snapshots still do not provide PITR, which is critical for a production database.

So, while snapshots give you speed, logs provide precision and are just as critical to build a reliable recovery strategy on ECS Fargate. Simply put, you don’t get a complete backup strategy out of the box; you have to build it yourself.

Conclusion: Just Because You Can…

Where Constraints Start to Matter
So, where does this leave us?

Yes, you can run a database on Fargate.

On storage, you have legitimate options. EBS gives you predictable block performance with the constraint of one volume per task, while EFS gives you shared, scalable storage that eliminates volume sprawl but with different performance characteristics.

Even on the backup side, it’s entirely feasible to design a working strategy. With the right combination of logical backups, snapshots, and operational processes, you can achieve reliable backup and recovery.

So, the real question becomes not ‘Can you run a database on Fargate?’ but rather ‘What type of database can you run on Fargate?’

For development and testing environments? For low-throughput internal tools? For databases small enough that a slow logical restore doesn't ruin anyone's week? Fargate can be a perfectly reasonable choice.

But for production-grade, performance-sensitive systems, these constraints compound quickly. At that point, the trade-offs we’ve discussed, storage complexity, backup orchestration, and performance variability, stop being theoretical concerns and start becoming real operational challenges.

Why Context Matters
Back to our client.

The thing about architecture is that you often don’t land at a single technical solution. In fact, the majority of the time, you’ll have at least 3 workable options where you only need one. So, how do you determine which is which?

Remember the background I shared in Part 1? The context a client shares often comes in handy in determining which of your options is the most viable. For my client, I had 2 additional factors from their background that I needed to consider.

1. The Learning Curve
This client was only just beginning their containerisation journey – having previously solely run on a traditional monolithic setup. Let’s break that down further:

They are new to microservices and containerised environments. The team is on a learning curve, building new skills, figuring out deployment patterns, and learning how to debug distributed systems.
They are also migrating to a cloud platform they are new to. AWS, with its dozens of services and unfamiliar operating models, represents another significant learning curve all on its own. Two steep learning curves, happening simultaneously. Adding a complex, stateful workload in a containerised environment on top of that would mean introducing another layer of operational complexity - right at the point where the team is still building foundational knowledge.

2. Growth and Future Scale
The client had made one thing very clear: they were anticipating significant user growth in their new market. A key objective of the migration was to get ahead of that growth. This meant resiliency and scalability weren't nice-to-haves. They were core requirements.

We had to design for the future, not just for the day after launch, but for six months and a year down the road when their user base had doubled or tripled. Would a manually managed backup pipeline still hold up? Would the one-volume-per-task constraint become a scaling bottleneck? Would the team, still learning AWS and containers, be able to respond quickly when something broke?

Taken together, these factors made one thing clear - while running the database in a container aligned with the client’s initial vision, it introduced risks that didn’t align with their long-term goals. And that is why we had to step back and consider alternatives.

In the next blog piece, we will look at two alternative approaches: EC2 and RDS. Because in the end, choosing where your database lives isn’t just a technical decision, it’s an operational one. And getting it right early can save you from a lot of pain later.

The Day ‘Containers Everywhere’ Met a Database – Part 1

Zippy Wachira — Thu, 30 Apr 2026 17:25:09 +0000

A while back, I was pulled into what sounded like a fairly standard cloud migration conversation. A client wanted to migrate their application to AWS, modernise their stack, and position themselves for growth.

On the surface, it sounded like nothing unusual.

As we began unpacking the details of the migration, they outlined their plan for the database. They wanted to take their MySQL database… and run it inside a container.

And that is when things got interesting.

I've worked on a lot of migrations over the years, and while this approach isn’t impossible, it stood out for a different reason. It was trying to solve a problem that didn't quite exist, while quietly ignoring several that very much did.

But before I explain why that pattern raised red flags, let me give you the full picture of what this client was actually trying to achieve.

Background

At the time of meeting the client, they were running a traditional monolithic application hosted with a local provider. The entire system – the frontend, backend, and database – all lived on a single virtual machine. While this setup initially worked, over time, the limitations became harder to ignore.

Performance issues started creeping in. Scaling was difficult. And even small changes required careful coordination across the entire system.
At the same time, the business was preparing to expand into a new market, one that would bring a significant increase in users and transaction volumes. Based on their projections, it was clear that their existing architecture wouldn’t hold up under the expected load.

So, they made a strategic shift. They would move towards a true microservices-based architecture and break down the monolith into smaller, independent services. This would allow them to scale individual components independently, deploy faster, and improve overall resilience.
The modernised system would consist of separate frontend, backend, caching, and database layers. Containers were chosen as the preferred deployment model. From an application perspective, this made perfect sense.

The Turning Point – the Database

As the conversation progressed, the client outlined their migration plan:

Each service would run in its own container.
The database would follow the same pattern, containerised and deployed alongside the rest of the services.

At first glance, the approach felt clean and consistent. Everything followed the same deployment model.

But it also raised an important question:
Should a database be treated the same way as stateless application services?

That question became the foundation for a much deeper discussion. Not just about how to migrate their MySQL database, but where and why it should be hosted in a particular way.

Why It Mattered

Databases are fundamentally different from most application services. Unlike stateless components, they are:

Stateful by design
Highly sensitive to latency and disk performance
Dependent on consistent backup and recovery strategies

When you try to force a stateful workload into a pattern designed for stateless ephemeral systems, you run into several problems. And those problems don’t always show up immediately. They appear later, under load, during failures, or when you need to recover data quickly.

The decisions you make about how to host your database have long-term implications on its performance, reliability, and operational complexity. And in this case, getting it wrong could directly impact the client’s ability to scale into their new market.

What Comes Next

In the next post, I’ll start to break down 3 database hosting options we explored. Each option has its place - but they come with very different trade-offs.

Getting Started with AWS S3 Versioning

Zippy Wachira — Tue, 10 Feb 2026 19:41:29 +0000

One of the more interesting features of Amazon S3 buckets is bucket versioning. Once enabled for a bucket, this feature allows a user to store multiple versions of the same object within the same bucket. Since the feature enables the bucket to preserve, retrieve, and restore every version of every object, it is much easier recover from both unintended user actions, such as accidental deletions, and application failures.

Uploading objects to S3 is quite a straightforward process. However, by default, if you upload an object with the same key name as an existing object, the original object is overwritten. Once you enable versioning, new objects are automatically assigned a version ID to distinguish them from the other objects.

In the bucket above, notice that both objects have the same key name but different version IDs. If another object is added, it is assigned its own unique Version ID.

Simple enough, right?

Now, the next question that would arise is how you interact with objects stored in a versioned bucket?

Adding an object to a versioned bucket.
Adding an object to a bucket follows the normal process of uploading an object to the bucket. Once you upload the object, it is given a unique version ID.
Retrieving an object from a versioned bucket.
A simple GET request will retrieve the most current version of the object.

To retrieve other versions, specify the version ID you want.

Deleting an object from a versioned bucket. Unlike objects in buckets that are not enabled for versioning, a simple DELETE request cannot permanently delete an object. When S3 receives the DELETE request, it places a delete marker in the bucket. All versions of the deleted object remain in the bucket, and the delete marker is used as the most recent version of the object. However, the delete marker does not have any data; any GET requests to the marker return a 404 error. If you remove the marker, a GET request will retrieve the most recent version of the object.

In the diagram above, notice that the DELETE request on the bucket has resulted in the creation of a delete marker in the bucket.

To permanently delete versioned objects, you have to specify the object version with the DELETE request.

To delete a Delete Marker, you must specify its Version ID in the DELETE request.

Interesting right? I certainly hope you think so too.😊

Configuring Nginx Files and Directories

Zippy Wachira — Tue, 10 Feb 2026 19:35:01 +0000

Web Servers: the powerful force behind every search result on a browser. Truth is, most of us do not give a second thought to the mechanism that makes web requests successful as long as we get the desired results. But web servers are quite intriguing, and if you are as curious as I am, you might want to know a few tweaks to help you configure and troubleshoot a simple web server. So let's get right into it.

What Webservers do
If you type mango in your browser at this moment, you will get a variety of results, each with a unique URL that will lead you to the main web page for the result. For example, from my end, the first three results for mango are: an online fashion store in Kenya, a Wikipedia link for mango the fruit, and a Twitter link to a page with the handle Mango. Chances are, the results of your search differ from mine. Search engines use information such as your search history, your location, language, and popular searches, among other things, to determine what results to display, hence the difference.

When you click on one of the results, the web browser first runs a domain name resolution to obtain the IP address of the web server hosting the webpage. The browser then connects with the web server either via port 80 (HTTP) or port 443 (HTTPs) and requests the file specified. The web browser uses the same protocol to respond to the browser, which then displays this result. If the page is non-existent or an error occurs, the web server returns an error message. Seems simple enough, don’t it?

Nginx vs Apache
While there are many web servers out there, Nginx and Apache are two of the most commonly used. At least 50% of the world’s websites run on one of these servers. But let's start with a short introduction. Apache was invented first and was the backbone of WWW. It is an open-source, high-performing web server that is maintained by the Apache Software Foundation. It is a top choice for sys admins because of its cross-platform support, flexibility, and simplicity. It is also one of the key components of the LAMP stack and is pre-installed in Linux distros.
Nginx (pronounced as ‘Engine X’ 🤦‍♀️), on the other hand, was released in 2004 by Igor Sysoev. Since it was developed to specifically address the limitations of the Apache server, it became very popular, even surpassing Apache.

While there are several differences between the two web servers, the key difference lies in how the servers handle client requests. Apache uses a process-driven architecture, which means that each request is handled by a different process. A parent process receives the connection requests and creates a child process to handle them. When it receives a new request, it spawns a new child process to handle it. This results in heavy usage of server resources such as memory. Nginx, on the other hand, uses an event-driven architecture where a single process is used to handle multiple requests. Like Apache, a master process receives connections. However, each worker process can handle thousands of requests simultaneously.

Nginx Configuration Files
To configure Nginx, there are several key directories and files that you need to be familiar with. It is these files that are customized to serve your specific website. First, depending on how you installed Nginx, the default configuration files can either be located at /etc/nginx/nginx.conf (most distributions)or at /usr/local/etc/nginx/nginx.conf or at /usr/local/nginx/conf/nginx.conf. To find the default path on your local machine, you can use either nginx -t or whereis nginx.

Nginx configurations have two key concepts: directives, which are configuration options, and blocks/contexts, which are groups in which directives are organized. To better understand this, consider the output of the /etc/nginx/nginx.conf file.

In the above snippet, user, worker processes, PID, and include are directives in the main context. The main context is not contained within a block and represents details that affect all applications. Common directives set here are user and group details, worker processes, and the file to save the PIDs of the main processes. The event context is used to set global options for how Nginx handles connections. Recall that Nginx is an event-driven model; thus, directives set determine how worker processes handle connections.

The HTTP context includes directives for handling web traffic. The directives set in this context are passed on to all websites that are served by the server. Common directives set in this block are access and error logs, error pages, TCP keep-alive settings, among others. Within the http context, you may notice an include directive. This directive tells Nginx where the configuration files for the website are located:

If you installed from the official Nginx repository, the directive will point to /etc/nginx/conf.d/. Each website you host within NGINX has its own configuration file within the above directory and has a name formatted as /etc/nginx/conf.d/blue.com.conf

If you installed from the Debian repository, the directive points to /etc/nginx/sites-enabled/_. With this structure, individual configuration files are stored in the _/etc/nginx/sites-available directory.

Finally is the mail context, which sets directives for using Nginx as a proxy mail server. It provides a connection to POP3 and IMAP mail servers.

Document Root Directories
By default, Nginx serves documents out of the /var/www/html directory. To host multiple sites, it is necessary to create different root directories within the /var/www/ directory, e.g., to host two sites, you can create directories as /var/www/site1.com/html and /var/www/site2.com/html. The index files for the sites are then configured within these directories e.g.,/var/www/site1/html/index.html.

Server Blocks
Server blocks are a feature of Nginx that allow you to host multiple websites on a single server. Each server block holds information about the website, such as the location of the document root, security policies, and SSL certificates used. By default, Nginx has one server block called default: /etc/nginx/sites-available/default. You create a server block for your website by appending the name of your website to the directory e.g.,/etc/nginx/sites-available/site1.com. The structure of a server block is as follows:

The listen directive tells Nginx the IP and the TCP port where requests are received. The server name identifies the domain being served, e.g., site1.com. When it receives a request, Nginx will first match it to the IP and port listed in the listen directive. If there are several server blocks with the same listen directives, it checks the host header of the request and matches it to a server_name directive. If there are multiple directives with the same IP, port, and server_name, then it chooses the first server block with the name. Finally, if no server_name directives match the host's header, it checks for a default_server parameter.

Root specifies the path of the document root, and index specifies the name of the index file for the site. Location directives enable you to specify how Nginx responds to requests for resources within the server. The locations are literal string matches and match any part of an HTTP request. Consider the following location configuration:

In this case, a request to http://site1/com/planet/blog or to http://site1/com/planet/blog/events will be served by location /planet/blog/ rather than location /planet. The try file directive specifies the files and directories where Nginx should check for files if a request to the specified location is received. The default try directive above indicates a match for all locations.

To enable server blocks, the final step is to create a symbolic link to /etc/nine/sites-enabled directory. By default, Nginx checks the sites-enabled directory during startup. Creating sym links to the configuration files in the sites-available directory allows you to manage your vhosts more easily. To disable a block, all you have to do is delete the sym link. You can optionally use the conf.d directory to manage your server blocks, but to delete something, you would have to remove it from the directory first. To manage multiple virtual hosts (websites), it is recommended to use the sites-enabled approach. Conf.d is more suited to configurations that are not tied to a single virtual host.

There is definitely a lot more to configuring web servers than this! But I certainly hope this has provided you with a place to start 😊.

Breaking Down AWS IAM

Zippy Wachira — Tue, 10 Feb 2026 18:46:38 +0000

AWS has a large variety of security offerings. Among these, however, none is as extensive as IAM. Besides integrating with all AWS services, IAM also enables fine-grained access control, which means that permissions can be managed up to an individual user’s or individual resource’s level. This is also accomplished by one of IAM’s best practices, which requires the assignment of permissions according to the principle of least privilege.

Let's start with the basic components:

IAM identities
There are three types of IAM identities: users, groups, and roles.

IAM users often represent people interacting with AWS, but can also represent a service. IAM users have long-term credentials, which are in the form of:
Username+password for use with the management console
Access keys for use with the AWS CLI and SDKs
SSH keys for use with AWS CodeCommit and
Server Certificates are used to authenticate some AWS services, such as websites
IAM groups are a collection of users who share the same permissions. They make it easier to assign and manage permissions for a large number of users. All users in a group automatically inherit the permissions attached to the group.
IAM Roles are similar to IAM users, but with temporary credentials assigned via AWS STS. Roles can be assumed by any user, allowing them to temporarily take on different permissions.

IAM Policies
IAM policies are IAM entities that are attached to IAM identities and define the kind of permissions that the identity has. When an identity makes a request, AWS evaluates the policies attached to the identity to determine if the actions the principal is requesting are allowed. The policies attached to a principal apply across all access methods: Console, CLI, and SDKs.

Now, let us dig a bit deeper into IAM users.

Federated User Access

Federated users are users who are managed in an external directory and require access to AWS resources. Federation eliminates the need to recreate the users in your AWS account by allowing you to continue using your existing user directory and only assigning users temporary permissions to accomplish the task they need on the AWS cloud. There are two approaches to federation: using AWS Single Sign-On (SSO)and using AWS IAM.

1. Using AWS SSO to Manage Federation
AWS Identity Center is a service that allows you to assign and manage access and user permissions across all your accounts in AWS Organizations. SSO also supports identity federation using Security Assertion Markup Language (SAML). SAML is an industry standard that enables the secure exchange of credentials between an identity provider (IdP) and a SAML consumer (service provider, SP). SSO works with an IdP of your choice, e.g., Azure Active Directory, and leverages IAM permissions and policies to manage federated access. With SSO, you assign permissions based on the group memberships in the IdP’s directory and control their access by modifying users and groups in the IdP. You can also use AWS SSO as an IdP to authenticate users to SSO-integrated applications, such as Salesforce, and also to authenticate users to the Management Console and CLI.

2. Using AWS IAM to Manage Federation
You can use IAM Identity Providers to manage user identities outside your organization. With the IAM IdP, there is no need to create custom sign-in codes or manage user identities since the IdP does that for you. With IAM IdP, there are two types of federation supported:

a) Web Identity Federation
If you are writing an application to be used by a large number of users, e.g., a game that runs on mobile devices but stores data on Amazon S3, a web identity federation would be a good option. Web Identity Federation allows you to use IdPs such as Facebook, Google, or any other OpenID Connect (OIDC)-compatible IdP. Users receive an authentication token, which is then exchanged for temporary security credentials that map to an existing IAM role in AWS with the required permissions.

Note: Rather than directly using Web Identity Federation, it is recommended to use Amazon Cognito for mobile apps.

b) SAML 2.0-based Federation
IAM supports identity federation using SAML to enable single sign-on for users to log into the management console or call API operations. This type of federation has two main use cases. The first is to allow users within your organization to call AWS API operations e.g. enable users within your corporate IdP to backup data to an S3 bucket. The second use case is to allow users registered in a SAML 2.0-compatible IdP to sign in to the management console.

And now let’s see what roles can do:

IAM Roles

IAM roles can be used for a variety of cases, including the following:

Grant IAM users in the same account as the role access to resources within the account
Grant users access to resources in a different account
Grant access to AWS resources to identities outside AWS
Grant access to third parties, e.g., auditors
Provide access to AWS services to access other services

A role that a service assumes to perform actions within your account on your behalf is called a service role. If a role serves a specialized purpose for a service, it is referred to as a service-linked role. Users can assume a role from either the console or from the CLI/API by using the AssumeRole API.

Policies and Permissions

Identity-based policies are attached to IAM users, groups, and roles to define what these identities can do, on which resources, and under which circumstances. Identity-based policies are of two types:

a) Managed policies are standalone policies. You can opt to use AWS-managed policies or create your own customer-managed policies, which you manage yourself. AWS-managed policies provide permissions for common tasks, such as granting administrative permissions, and for use when starting out before one is able to create their own policies.

b) Inline policies are embedded in identity and provide a strict one-to-one relationship between the identity and the policy. These policies are applicable for scenarios where you want a policy to only be attached to a specific identity and no other.

Resource-based policies are a type of inline policy that is attached to a resource. A common use case of these policies is in enabling cross-account access to a principal. IAM supports one type of resource-based policy called the trust policy, which is usually attached to an IAM role.
Permission Boundaries are used to set the maximum permissions that an identity-based policy can grant to an entity. The effective permissions of the principal are the intersection of all the policies that affect the principal such as identity-based policies, resource-based policies, session policies, and SCPs.Now, working with permission boundaries can be tricky if you don't understand how they interact with other types of policies. To see how AWS calculates effective permissions, see here.
Service Control policies are used to control permissions for an organization or an organization's unit. They determine permissions for the accounts in the organization. A unique feature of SCPs is that they do not grant permissions. Rather, they limit the permissions that resource-based and identity-based policies can grant to identities in the account. The effective permission for the identity is the intersection between what is allowed by the SCP and what is allowed by the identity and resource-based policies. To understand how SCPs interact with the other policies, see here.
Access Control Policies control which principles in another account can access resources in your account. They cannot be used for principals in the same account as the policy.
Session Policies are inline permissions policies that users pass to the session when they assume a role or as a federated user when using the CLI or API. Session policies can be passed using the AssumeRole, AssumeRoleWithSAML, and AssumeRoleWithWebIdentity API operations. Like SCPs, session policies also do not assign permissions; they only limit the permissions for a session. The resulting session permissions are the intersection of the session policies and the resource-based and identity-based policies. See how effective permissions are calculated here.

As much as this seems, it is but the tip of the iceberg. But it's a good place to start; don’t you think?☺

A Defense in Depth Approach to Cloud Security

Zippy Wachira — Tue, 10 Feb 2026 18:19:31 +0000

Introduction

In an era marked by pervasive digital connectivity and evolving cyber threats, ensuring the security of sensitive information and critical infrastructure has become paramount. Traditional security approaches centered around perimeter defenses alone are no longer sufficient to withstand sophisticated attacks and safeguard against data breaches. Instead, organizations must adopt a multi-layered security strategy known as defense in depth.

Defense in Depth (DiD) is a proactive and comprehensive security framework that employs multiple layers of defense mechanisms to protect against a wide range of threats. By diversifying security controls across networks, systems, applications, and data, defense in depth aims to create overlapping layers of protection that collectively strengthen the security posture of an organization.

Principles of DiD

“It is not just multiple layers of controls to collectively mitigate one or more risks, but rather multiple layers of interlocking or inter-linked controls.” — Phil Venables
Controls at different points should be complementary, i.e., every preventative control should have a detective control at the same level and/or one level downstream in the architecture.
Controls need to be continuously assessed to validate their correct deployment and operation.

DiD in Cloud

DiD leverages security measures across 7 key domains as follows:

Physical Layer

What: Physical security measures applied at AWS data centers e.g. biometrics, surveillance, etc.

How: AWS Responsibility.

Perimeter Layer

What: Perimeter security is your first line of defense as a customer. It allows you to define who has access to your environment, how they access the environment, and what level of access to assign them. For example, an administrator will need full access to the environment, a project/team lead may need full access to the resources that pertain to their project as well as the ability to assign and revoke access for their team. On the other hand, a contractor may need temporary read/write access to only specific services, while an auditor would only require temporary read-only access for the period of the assessment.

How:
1. Preventative Controls
1.1. Who has Access?

1.1.1. Internal staff, e.g., engineers, system administrators, and security team who build, manage, and govern the environment.

1.1.2. Business stakeholders: review performance metrics, monitor resource usage, and make data-driven decisions related to business operations and strategy.

1.1.3. Clients: consume services/applications deployed in the environment.

1.1.4. Third-party contractors/vendors/partners: temporary access for project-related tasks.

1.1.5. Legal consultants/advisors/auditors

1.2. How do they access the environment?

1.2.1. IAM Users Accounts: long-term access for internal users/staff, e.g., engineers who are responsible for managing, configuring, deploying, and monitoring AWS infrastructure, applications, and services.

1.2.2. Federation: for users with identities managed in an external IdP, e.g., AD, Facebook, Amazon, Google

Using IAM Identity Center/SSO: consistent, synchronized access to multiple AWS accounts and applications.
Cognito Identity Pools: identify federation for authenticated and unauthenticated users.

1.2.3. IAM Roles: short-term, temporary access credentials that can be assumed by any identity.

1.2.4. Restricted Access Channels:

VPNs: secure, encrypted communications channels between on-premises networks and AWS
PrivateLink: private connectivity between VPCs, supported AWS services, and your on-premises networks without exposing your traffic to the public internet.
Dedicated Audit Accounts: give security and compliance teams read and write access to all accounts for audits and security remediations.

1.3. What is their level of access?

1.3.1. RBAC and Least Privilege: restrict access based on the identity’s roles/responsibilities within an organization.

1.3.2. Policy-Based Access Control: assign access controls to resources based on IAM policies for user, groups, and roles.

2. Detective Controls
2.1. AWS CloudTrail: capture API activity and logs pertaining to access activity.

2.2. MFA: restrict access to services to only users with MFA enabled.

2.3. AWS Config: enforce compliance with IAM best practices e.g. such as ensuring MFA is enabled or restricting the use of insecure IAM policies.

Network Layer

What: The network layer focuses on safeguarding the communication and data exchange between devices, systems, and services within the organization’s network, as well as controlling the flow of traffic entering and leaving the network.

How:
1. Preventative Controls
1.1. Network Access Control: who has access to network resources and how they access these resources.

1.1.1. IAM: access management

1.1.2. VPNs/Direct Connect: private encrypted connectivity between on-premised environments and VPC resources.

1.1.3. PrivateLink: private connectivity between VPCs and AWS services or endpoints without traversing the public internet.

1.2. Network Segmentation and Isolation: partition the network into distinct zones based on security requirements, workloads, trust levels, and data sensitivity.

1.2.1. VPCs, Subnets & AZs:

Each VPC is a logically isolated container for network resources.
Subnets provide segmentation at the network level and allow you to isolate resources based on their function, security requirements, or access control policies.
AZs provide redundant power, networking, and connectivity in an AWS Region and allow high availability, fault tolerance, and scalability for applications.

1.2.2. NACLs, Security Groups, Route Tables: control ho traffic is routed between various network segments.

1.2.3. VPC Peering and Transit Gateway:

Peering allows you to route traffic between VPCs privately using private IP addresses.
TGW enables central management and connectivity scaling across multiple VPCs, accounts, and networks.

1.3. Traffic Filtering

1.3.1. Network Firewall: stateful, managed, network firewall and intrusion detection and prevention service for your VPC.

Pass traffic only from known AWS service domains or IP address endpoints.
Perform deep packet inspection on traffic entering or leaving your VPC.
Use stateful protocol detection to filter protocols like HTTPS.

1.3.2. Web Application Firewall: a firewall that protects web applications hosted on AWS against common web-based attacks.

1.3.3. Virtual Security Appliances, i.e., firewalls, IDS/IPS/DPI systems, i.e., Cisco, Palo Alto,o etc.

2. Detective Controls
2.1. VPC Flow Logs: monitor the IP traffic going to and from a VPC, subnet, or network interface.

2.2. Network Access Analyzer:
Improve your network security posture by identifying unintended network access.

Verify that your production environment VPCs and development environment VPCs are isolated from one another.
Verify that network paths are secured e.g. controls such as network firewalls and NAT gateways have been set up where necessary.
Verify that your resources have network access only from a trusted IP address range, over specific ports, and protocols.

Host and Application Layers

What: The host layer focuses on security measures implemented on individual compute resources e.g. EC2, ECS, EKS, RDS.

How:
1. Preventative Controls
1.1. Vulnerability Management:
1.1.1. Regularly scan and patch compute resources e.g. EC2, ECS, EKS.

Inspector: Automatically discovers workloads, such as Amazon EC2 instances, containers, and Lambda functions, and scans them for software vulnerabilities and unintended network exposure.
Systems Manager: patch management for your compute resources
Security Hub: collects security data across AWS accounts, AWS services, and supported third-party products and helps you analyze your security trends and identify the highest priority security issues.
Code Guru: Scans code libraries and dependencies for issues and defects that are difficult for developers to find and offers suggestions for improving your Java and Python code.

1.1.2. Configure maintenance windows for AWS-managed resources, e.g., RDS.

1.2. Reduced Attack Surface:

1.2.1. Hardened Operating Systems, e.g., using CIS images for workload instances.

1.2.2. EC2 Image Builder: ease creation of custom patched AMIs. When software updates become available, Image Builder automatically produces a new image without requiring users to manually initiate image builds.

1.2.3. ECR Image scanning for identifying software vulnerabilities in your container images.

2. Detective Controls
2.1. Config: monitor changes to application configurations, code deployments, etc., to detect unauthorized modifications or unusual application behavior.

2.2. CloudWatch Logs to monitor system-level logs generated by EC2 instances, including authentication logs, application logs, and system logs to detect security incidents, anomalous behavior, and operational issues.

2.3. CloudTrail: detailed records of actions taken by users, roles, and services, including caller identity, the time of request, and the actions performed. You can track user activity, identify unauthorized access attempts, and investigate security incidents.

2.4. GuardDuty: threat detection service that continuously monitors for malicious activity and unauthorized behavior across your environment. GuardDuty generates findings and alerts for suspicious activity, enabling you to investigate and remediate security incidents promptly.

2.5. Inspector: analyzes the network, operating system, and application configurations to identify potential security issues.

2.6. Third-party Security Solutions: third-party security solutions available on Marketplace that offer advanced threat detection, vulnerability management, and security analytics capabilities for environments.

Data Layer

What: The data layer encompasses all aspects of data security, including data storage, transmission, access, and usage. It focuses on safeguarding sensitive information, such as customer data, intellectual property, financial records, and other confidential or regulated data, from unauthorized access, disclosure, alteration, or loss.

How:
1. Preventative Controls
1.1. Data Confidentiality: protecting data against unintentional, unlawful, or unauthorized access, disclosure, or theft.

1.1.1. Data access: Define authorized principals in access policies, follow least privilege principles.

1.1.2. Encryption

Encryption at rest using KMS/SSE
Encryption in transit using SSL/TLS

1.2. Data Integrity: ensuring the accuracy, completeness, consistency, and validity of data.
1.2.1. Regular Backups:

AWS Backup: a fully managed backup service that makes it easy to centralize and automate the backing up of data across AWS services.
S3 Versioning: preserve historical versions of objects, enabling you to recover from accidental deletions, modifications, or data corruption.
Cross-region replication: ensuring data availability in multiple geographic regions and protecting against regional outages.
Automated/Manual Snapshots: available for EBS, RDS, to allow for PIT recovery.

1.2.2. Immutable Storage: services such as S3 Object Lock prevents objects from being deleted or modified for a specified retention period, protecting data integrity from accidental or malicious changes.

1.2.3. Data Validation and verification: checksums, digital signatures, and cryptographic hashes to verify the integrity of data during transmission and storage.

1.3. Data Availability: a measure of how often your data is available for use.

1.3.1. Highly available and fault-tolerant architectures.

1.3.2. Backups and DR

2. Detective Controls
2.1. CloudWatch:

Alarms and health checks to monitor the health and availability of AWS resources hosting critical data.
Automated alerts for performance degradation, service disruptions, or availability issues affecting data access.

2.2. CloudTrail:

Trails to deliver log files to Amazon S3 and set up S3 event notifications or CloudWatch Events to trigger alerts for specific API activity or security events.
Visibility into user activity and API calls, allowing you to detect and investigate unauthorized access attempts or security incidents affecting data confidentiality.

2.3. GuardDuty:

Continuously monitor malicious activity and unauthorized behavior within your environment.
Detects anomalies and security threats targeting data confidentiality, e.g., unauthorized access attempts, data exfiltration, or communication with known malicious IP addresses.

2.4. Config:

Config rules to assess compliance with security best practices and detect misconfigurations affecting data integrity.
Config rules to detect deviations from security best practices.

Policies and Procedures

Security objectives and standards: define the organization’s security best practices for cloud environments and best practices for cloud environments i.e., data protection, access control, network security, incident response, and compliance requirements.
Risk Management: define roles and responsibilities for identifying, assessing, and mitigating security risks associated with cloud deployments.
Access Control: define user roles, permissions, and authentication mechanisms, such as multi-factor authentication (MFA) and identity federation, to prevent unauthorized access.
Incident Response and DR: define escalation paths, communication protocols, and remediation steps for containing and mitigating security incidents, restoring services, and minimizing the impact on business operations.
User Training: Educate identities about cloud security risks, best practices, and compliance requirements to help raise awareness of security policies, reinforce security behaviors, and empower personnel to recognize and report security incidents effectively.

AWS CSI - Investigating Cloud Conundrums (CloudTrail)

Zippy Wachira — Tue, 10 Feb 2026 17:53:12 +0000

Imagine this: you come home after a long day and find that your house is a complete mess. You have absolutely no clue what happened. Who created the mess in your house? How and when did they get access to your house? What did the intruder take? What did they displace/destroy in the house?

If you’re lucky, you may have a security system in place, e.g., cameras that recorded the incident and that you can use to trace back and identify how and when the destruction happened. But if not, then it becomes a complete nightmare to try to figure out what happened. And this is exactly what it feels like to try to troubleshoot an event in AWS without CloudTrail.

Think of CloudTrail as that indoor security camera that captures who came into your house, what they touched, changes, added, or even removed from the environment, and the exact day/time that all these activities occurred. So, just like reviewing the security footage will help you understand the break-in event, reviewing CloudTrail logs will provide the vital evidence you need to investigate and resolve any issues within your AWS environment.

The good news is that a lot of people know exactly WHAT CloudTrail is and what it is meant to be used for. The bad news is that not enough people know HOW to use CloudTrail to derive useful insights from the captured logs. So, let’s dive into how exactly you use AWS CloudTrail to Investigate Cloud Conundrums.

Some Basics

CloudTrail is enabled by default for your account, which means you automatically have access to CloudTrail Event History.
CloudTrail Event History provides an immutable record of events from the past 90 days. These are events captured from the Console, CLI, SDKs, and APIs.
You are not charged for viewing CloudTrail Event History.
CloudTrail events are regional.

Layout

Let’s start with the standard layout of the CloudTrail Console:
Note: For this piece, we are focusing on the Event History Section of CloudTrail on the Console.

Display Customization

The settings icon at the far right [3] allows you to customize the fields that are displayed. Options are as follows:

You can read about what each of the fields represents here.

Event Filtering
CloudTrail tracks every API call within your AWS account, resulting in a historical record that can grow significantly, even for small accounts with limited users. Of course, the more activity in the account, the larger the volume of events recorded. This can make investigating a singular event a nightmare, as you’d need to comb through hundreds of records to get to the record of interest.

This is where fields [1] and [2] come in. [1] provides a range of parameters you can use to filter through your events, as below:

For instance, let’s say you want to investigate who created a new user account and when the user was created, you can apply a filter as follows:

For the case above, we apply a filter based on the Event name and search specifically for CreateUser events.

The User name will typically provide the name of the IAM principal that performed the CreateUser action, i.e., an IAM user or a service. The Resource Name, on the other hand, is the actual AWS resource that the action was performed on, i.e., for the case above, this would be the name of the user that was created.

Another good example would be if you want to see what activities a specific user or service performed in the account in a given period, you can apply a filter such as below:

Note: The date/time filter supports both a relative and an absolute range.

Event Types

In my investigations, I found that I lean more towards filtering by Event name as I often find myself looking for a specific event from a specific service. The challenge I found with this, though, was in identifying what events are recorded for each service and what the event names are.

For starters, it is important to note that CloudTrail Event History only supports management events. Note that to capture data events, you must create a trail and explicitly add each resource type for which you want to collect data plane activity. Second, AWS provides a comprehensive list of API Actions for each service through the API Reference documentation.

The API Reference documentation for a service provides descriptions, API request parameters, and the XML response for the service’s API actions. For example, from the EC2 API Reference documentation, we see the following example EC2 API Actions:

So now, from the CloudTrail Event History console, I can easily investigate an event such as when an Elastic IP Address was created using the below filter:

Note: to view the API reference for an AWS Service, simply search for API Reference on your favorite browser😊

Event Sources

This filter comes in handy if you want to view a general record of all activity performed on a specific AWS service, e.g., S3.

If you select the Event source filter on the CloudTrail history console, you can view a list of all the service names available in AWS that you can filter by.

Note that the event name takes the form of . amazonaws.com so simply type the service in your search, and the full service name is autocompleted for you.

Examples

Now let’s have a look at a few real-world use cases and how you’d typically use CloudTrail to investigate.

Scenario 1: An administrator receives a notification for a sudden spike in failed login attempts for a critical IAM user account.

Troubleshooting:
The above scenario is a ConsoleLogin Event. What we’d want to determine here is if this is a legitimate login attempt from the user or if it could potentially be a brute force attack on the account. So essentially, we are looking for details such as:

Source IP Address: Multiple login attempts originating from a single, unexpected IP address can be a red flag.

User Agent: this is information relating to the device used for the login attempt e.g., browser and OS. This could reveal inconsistencies compared to the expected login patterns for the user (hint: check previous successful login attempts to identify the usual patterns)
Timestamp: A rapid succession of failed login attempts within a short timeframe is a strong indicator of a brute-force attack.
Number of failed login attempts: A high number of failed login attempts within a short period is a strong indicator of brute-force attempts.

To get the above information, we can proceed as follows:

i. Filter
We can apply a filter as follows:

ii. View Details
From the list, select the event of interest. For this case, we’d identify this event using the User name field.

Once you click on the event, you get access to a more detailed event record as follows:

From here, you can pick out the relevant details and determine if this is a security event.

Scenario 2: An application deployed on EC2 instances suddenly experiences slow loading times and high error rates. You suspect a recent configuration change might be to blame.

Troubleshooting:
i. Identify the timeframe: Narrow down when the application issues began. We need to apply a time filter for this period.

ii. Filter Events: Depending on your architecture, you want to check for a couple of things, e.g.,

Configuration changes made to the instance itself i.e., instance resources, could reveal why the slow loading times occur.
Configuration changes made to the instance’s networking, e.g., changes to security group, route table, NACLs,s and any applicable policies.
Configuration changes made to the other service that the application is communicating with, e.g., if the application is picking/putting data into an S3 bucket, it could be that the instance profile was changed.

Event filters can then be applied as follows:

Check for configuration changes made to the security group.

Check for configuration changes made to the EC2 instance.

Check for configuration changes made to the EC2 instance role.
Limitations

While the CloudTrail console offers a convenient way to investigate AWS events, its filtering capabilities are currently limited to a single field at a time. This can be restrictive when you need to refine your search based on multiple criteria.

For more comprehensive filtering, you can:

View history via the AWS CLI
Download your CloudTrail events as a CSV file and leverage the powerful filtering and analysis features of tools like Excel.
Save logs to an S3 bucket and use an Athena table for filtering.

Stop Guessing, Start Tracking: Enable CloudTrail today and gain visibility into your AWS activity.

AWS CSI - Investigating Cloud Conundrums (CloudWatch - Part 3)

Zippy Wachira — Mon, 09 Feb 2026 20:43:44 +0000

In Part 1 of this series, we looked at the basics of CloudWatch metrics and one example of how you can leverage CloudWatch metrics to troubleshoot performance issues on AWS. In Part 2, we delved a bit deeper into some more examples and scenarios that allowed us to get a better understanding of how to leverage CloudWatch metrics.

In this third piece, we are going to take a step back. Whether you’re an AWS novice or an expert, identifying WHICH metrics to look at when troubleshooting can pose a real challenge. In this piece, we will look at some strategies and best practises to help you identify the right metrics for troubleshooting performance issues on AWS.

So, let’s dive in!

1. The bigger picture: Application Architecture
Before diving into metrics and troubleshooting performance issues in AWS, it’s essential to have a comprehensive understanding of your application’s architecture. Identify the key components, e.g., where is your compute layer hosted, where is your database, your storage, etc. Your architecture acts as a blueprint that outlines how different components (services, databases, etc.) interact to deliver your application’s functionality. By comprehending this blueprint, you can map potential performance issues to specific components and identify the relevant CloudWatch metrics for each.

2. Identify the affected service
Is it a slow website, sluggish database queries, or high latency in your Lambda functions? When troubleshooting performance issues in your AWS environment, identifying the affected service is a crucial step. It’s like being a detective at a crime scene — knowing where to look is only half the battle. CloudWatch offers a vast array of metrics across different categories like compute, network, database, and more. Knowing the affected service allows you to filter out irrelevant categories and focus on the metrics most likely to pinpoint the issue. For example, CPU utilization for EC2 instances wouldn’t be relevant if you’re investigating slow database queries.

3. Identify Common Performance Issues and Related Metrics
Understanding common problems that can arise and the relevant CloudWatch metrics that can help diagnose these issues is crucial when looking into performance problems in your AWS environment. By understanding these bottlenecks and their corresponding CloudWatch metrics, you can swiftly determine possible causes and take corrective action. For example:

For High Latency or Slow Performance, you need to look at:

Elastic Load Balancer (ELB): TargetResponseTime
API Gateway: Latency
EC2 Instances: CPUUtilization, DiskReadOps, DiskWriteOps, NetworkIn, NetworkOut
RDS Instances: ReadLatency, WriteLatency

For High Error Rates, you need to look at:

ELB: HTTPCode_ELB_4XX_Count, HTTPCode_ELB_5XX_Count
API Gateway: 4XXError, 5XXError
Lambda: Errors, Throttles

For Traffic Spikes or Sudden Increase in Load:

ELB/API Gateway: RequestCount
EC2 Instances: NetworkIn, NetworkOut
RDS Instances: DatabaseConnections, NetworkReceiveThroughput, NetworkTransmitThroughput

A grasp of your application architecture and common performance pitfalls empowers you to swiftly identify the right CloudWatch metrics for troubleshooting. Over time, this process becomes more intuitive, allowing you to troubleshoot efficiently and maintain optimal performance for your AWS environment.

4. Leverage Existing CloudWatch Documentation
CloudWatch documentation serves as your trusty roadmap when navigating the vast world of CloudWatch metrics. It helps you make sense of the data, find the right metrics for troubleshooting, and fix performance problems in your AWS environment. Here’s how CloudWatch documentation can assist:

Metric Descriptions provide a clear explanation of what it represents and how it’s measured.
Dimensional Breakdown often details the dimensions associated with each metric. Understanding dimensions allows you to filter and analyze metrics with greater granularity.
Best Practices: CloudWatch documentation outlines best practices for collecting, monitoring, and analyzing metrics.

5. Follow the User Experience
In the realm of AWS performance troubleshooting, User Experience is regarded as the “Golden Signal”. This underscores the paramount importance of focusing on metrics that directly or indirectly impact how your users interact with your applications. Ultimately, the success of your applications hinges on user satisfaction. A slow website, unresponsive interface, or delayed responses can lead to frustration and user churn.

CloudWatch offers various metrics that directly or indirectly impact user experience. Some key examples include:

Website Load Time is the time it takes for a web page to fully load on a user’s device. Slow load times can lead to user abandonment and negatively impact conversion rates.
Database Query Latency is the time it takes for a database to respond to a query. High latency can result in sluggish application performance and delayed responses for users.
API Response Times is the time it takes for your API to respond to a request. Slow API response times can hinder the overall performance of applications that rely on APIs.
Application Error Rates are the frequency of errors encountered by users within your applications. Frequent errors can disrupt user workflows and damage trust in your services.

6. Monitor Dependencies and Downstream Services
Application problems can ripple outwards. Dependencies (like databases and message queues) and downstream services (what your application interacts with) can significantly impact overall performance and reliability. For example, if an application is slow, check not just the EC2 metrics but also the RDS metrics if your application relies on a database.

Key dependencies and downstream services to monitor include databases, message queues, caches, storage, networking component etc. By keeping an eye on these components, you can quickly identify and address performance issues that may not be immediately apparent from primary application metrics alone.

7. Compare Against Historical Data
For troubleshooting and monitoring your AWS environment, comparing current metrics to historical data is key. This reveals trends and anomalies, helping you distinguish between normal fluctuations and potential issues requiring attention. Comparing metric data against historical data is important for:

Establishing a Baseline: Historical data helps establish normal operating conditions. Comparing current metrics against this baseline allows you to determine if the current performance is within expected ranges.
Identifying Anomalies: Anomalies are data points that deviate significantly from the norm. By comparing current metrics with historical data, you can quickly spot unusual behaviour that might indicate issues.
Understanding Trends: Trends show the general direction in which a metric is moving over time. Identifying trends helps you anticipate future behaviour, such as increasing resource usage that might eventually lead to performance bottlenecks if not addressed.

By following these strategies and best practices, you can transform CloudWatch metrics from a vast dataset into a powerful troubleshooting tool. With a targeted approach to metric selection, you can acquire more insight into the performance of your AWS environment, spot possible bottlenecks early on, and guarantee a seamless and effective user experience. Remember, effective troubleshooting is an ongoing process. With more AWS resources and CloudWatch expertise under your belt, you’ll become more adept at picking the pertinent metrics for every circumstance.

AWS CSI - Investigating Cloud Conundrums (CloudWatch-Part 2)

Zippy Wachira — Mon, 09 Feb 2026 20:36:25 +0000

In Part 1 of this series, we looked at the basics of CloudWatch metrics and one example of how you can leverage CloudWatch metrics to troubleshoot performance issues on AWS. In this second piece, we’ll dive a little deeper and investigate a few more examples.

So, let’s dive in!

Scenario 2: You have a microservices-based application running on Amazon ECS (Elastic Container Service). Users have reported that the application becomes unresponsive after running for a few hours.

Background:
A memory leak is a type of a resource leak that occurs when a program allocates memory but fails to release it back to the system after it is no longer needed. The result is that over time, the program consumes more and more memory, leading to resource exhaustion. As memory becomes scarce, the application may slow down due to increased garbage collection activity or the need to swap memory to disk.

Memory leaks typically cause a gradual increase in memory usage. The application may start normally but degrade over time as memory is exhausted. If the application becomes unresponsive after a consistent period, it suggests a pattern where memory consumption reaches a critical threshold, causing the failure.

Investigation:
Occasionally, a memory allocation spike can cause a one-time spike in the amount of memory being used by a resource in your AWS environment. For an allocation spike, restarting the service will temporarily resolve the issue. However, if the problem recurs, it could be an indication that the underlying issue is a memory leak rather than a one-time allocation spike.

For either case, you need to look at the ‘MemoryUtilization’ metrics. The ‘MemoryUtilization’ metric shows the percentage of memory that is used by tasks in the specific dimension. For statistics, you’d need to look at the average and maximum utilization over the period of interest.

Scenario 3: You have an e-commerce website, hosted on Amazon EC2 instances behind an Application Load Balancer (ALB), is experiencing a sudden spike in traffic. Customers report slow loading times and intermittent outages. You suspect a Distributed Denial of Service (DDoS) attack.

Background:
A Distributed Denial of Service (DDoS) attack is a malicious attempt to disrupt the normal traffic of a targeted server, service, or network by overwhelming it with a flood of internet traffic. This flood typically originates from a network of compromised computers or devices, making it difficult to pinpoint and block the source. The sheer volume of illegitimate traffic can overload resources, making the website or service inaccessible to legitimate users. End users might encounter slow loading times, error messages, or complete outages.

In the context of AWS, a DDoS attack can target various services such as EC2 instances, load balancers, or even the application running on AWS infrastructure.

Investigation:
While a sudden spike in traffic can occur during legitimate events e.g., sales, promotion events etc., there are key patterns that can help identify a possible DDoS attack:

- Traffic Patterns: While legitimate spikes may follow a more gradual increase and decrease in traffic, a DDoS attack will typically involve a sudden and sustained surge in traffic, often exceeding normal peak usage patterns.

- Source of Traffic: The source of legitimate traffic can usually be traced back to a diverse set of users and locations. DDoS traffic on the other hand, might originate from a limited number of IP addresses or geographical locations, indicating a coordinated attack.

- Application Impact: DDoS attacks usually target specific web applications or services. Legitimate traffic spikes might affect overall website performance but wouldn’t target specific applications.

- Increased Error Rates: Along with high traffic, you may observe an increase in 4xx (client error) and 5xx (server error) HTTP status codes, indicating that the backend servers are overwhelmed and unable to process the requests.

Key metrics to monitor to investigate a possible DDoS attack include:
1. Number of Requests Received:
If your application is fronted by an Application Load Balancer, then you need to look at the RequestCount metric. The RequestCount metric shows the number of requests processed over IPv4 and IPv6. A sudden and unusual spike in request count is a primary indicator of a potential DDoS attack. For API Gateway, this would be the Count metric.

For the RequestCount metrics, the statistics of interest would be:

Sum: the total number of requests over a period will help in understanding the overall traffic volume.
Average: the average number of requests per second helps to identify spikes relative to normal traffic patterns.
Maximum: the peak number of requests received in the given period is useful for identifying the highest load.

2. Network Traffic
For the instances hosting the application, you need to check the NetworkIn and NetworkOut metrics. If these also show a sharp increase, it may be indicative of a DDoS attack.

For network traffic metrics, we need to look at:

Sum: the total amount of data transferred in and out, respectively, which helps quantify the scale of traffic.
Average: the average data transfer rate, useful for comparing against baseline traffic levels.
Maximum: the peak data transfer rate, which can indicate periods of intense activity typical of a DDoS attack.

3. HTTP Error Rates
An increase in HTTP error rates can indicate that your servers are struggling to handle the incoming requests. To check the error rates, you can check the HTTPCode_ELB_4XX_Count and*HTTPCode_ELB_5XX_Count* for your ALB or 4XXError and 5XXError if using API Gateway.

For HTTP Error metrics, we need to look at:

Sum: the total number of server and client errors over a period. A significant increase in server errors (5xx) can indicate that the backend is overwhelmed and an increase in client error (4xx) can increase due to an increase in the number of malformed requests.
Average: the average rate of errors, useful for comparing against normal error rates.
Maximum: the peak error rate, which can indicate the most stressful/problematic periods.

4. Target Response Time
The ALB’s TargetResponseTime metric shows the time passed, in seconds, after the request leaves the load balancer until the target starts to send the response headers. Increased response times can signal that your application is under strain.

The key statistics to look at for this metric include:

Average: the average response time, helping to identify trends in performance degradation.
Maximum: the longest response time recorded, which can indicate extreme cases of backend strain.
P95 or P99: Percentile metrics show response times at the 95th or 99th percentile, useful for identifying the response times experienced by the top 5% or 1% of requests, which can be heavily affected during an attack.

CloudWatch Statistics

When it comes to trying to make sense of CloudWatch metrics, statistics can be a powerful ally. The Sum, Average, Minimum, and Maximum statistics are the most used. But there are other powerful statistics that you can leverage. For example:

- Percentiles
Percentiles help to understand the relative standing of a value in a dataset. It tells you how a particular value compares to the rest of the data. For example, imagine you are in a race with 100 participants. If you finish in the 95th position, you are in the top 5 runners. This means you are faster than 95 other runners, and only 4 runners are faster than you. Here, your position represents the 95th percentile (p95).

Similarly, in CloudWatch, p95 would mean that 95 percent of the data within the specified period is lower than this value, and 5 percent of the data is higher than this value. Let’s say, for example, that you’re monitoring the latency (response time) of your game servers using CloudWatch. You have checked the average latency for the application, and it is 50ms. Is this good? Is this bad? The average latency would not be able to show you the entire picture as there could be a significant variation in individual player experiences.

Let’s say instead that you filter the metric using the p90 statistic. This statistic will show the experience of most players. So, for example, if the p90 response time is 100 ms, this means that 90% of the requests were completed in 200 ms or less, and only 10% of the requests took longer than 200ms. Similarly, if the p50 response time is 50 ms, it means that 50% of requests were completed in 50 ms or less.

Percentiles help you understand the typical performance and identify outliers. For example, while the average (mean) response time might be 50 ms, the p90 being 100 ms indicates that some requests take significantly longer.

To understand more about CloudWatch statistics, view here.

In this article, we’ve explored several real-world scenarios where CloudWatch metrics empower you to investigate and troubleshoot performance issues within your AWS environment. But a crucial question remains: how do you identify the right metrics to look at for a specific issue?

Well, worry not, help is on the way! In our next blog post, we’ll delve into practical strategies and best practices to guide you in selecting the most relevant CloudWatch metrics for troubleshooting various performance concerns in your AWS infrastructure. Stay tuned!

AWS CSI -Investigating Cloud Conundrums (CloudWatch - Part 1)

Zippy Wachira — Mon, 09 Feb 2026 20:02:16 +0000

If you’re anything like me, you absolutely hate going to the doctors. Unfortunately, (and at least until we can make ourselves indestructible🤞), every so often, you will always find yourself in a doctor’s office. Now, for the doctor to accurately diagnose your illness and prescribe the right treatment, they need to first collect a range of vitals — your temperature, blood pressure, heart rate, and so on. These vital signs provide crucial insights into your health, and tracking them over time helps the doctor identify patterns, detect issues early, and understand the overall state of your body.

Similarly, CloudWatch acts like your AWS environment’s diagnostic physician. It collects a comprehensive set of data points like system metrics (CPU usage, memory allocation, network latency) and logs (application errors, API calls, resource utilization) that serve as vital signs. By analyzing these metrics and logs, CloudWatch helps you diagnose the health of your application. An unexpected surge in CPU usage might point to inefficient code, while frequent errors in the logs could indicate configuration issues.

In this blog, we will delve into CloudWatch metrics and explore how you can leverage these metrics to understand the performance of your AWS Services as well as detect potential issues. Whether it’s preventing a minor symptom from becoming a major outage or optimizing your resources for peak performance, CloudWatch is your go-to solution for maintaining the well-being of your cloud infrastructure.

Some Basics

Let’s start with a few important details about metrics:

A metric is a quantitative measure of a system’s characteristic over time.
Majority of AWS services provide a set of free metrics under basic monitoring. However, to monitor a parameter that is not enabled for the free metrics, you can enable detailed monitoring or set up custom metrics.
Metrics are collected as a set of time-ordered data points. The period over which data points are collected varies between under a second and an hour. The retention period of a metric is dependent on how frequently data points are published. See here.
Metrics exist only in the Region in which they are created.
Metrics are categorized into dimensions, i.e., you can monitor the CPU Utilization of EC2 instances, RDS Databases, ECS Cluster,s etc. However, when you want to only view the CPU Utilization for one or all your RDS Databases, then you’d view this under the ‘Across All Databases’ dimension or the ‘DBInstanceIdentifier’ dimension.

Understanding CloudWatch Metrics

The good news is that AWS maintains exhaustive documentation for the supported metrics for each service. Additionally, each metric is explained in detail so it’s clear to understand what exactly the metric measures.

For a list of services that publish their metrics to CloudWatch, see here.
To understand the specific metrics that are supported for a particular service, search for ‘Available Metrics for ’ e.g. ‘Available Metrics for API Gateway’. For most services, this page will also display the available namespaces and dimensions available for the service.

E.g.,

1: This is the name of the metric, i.e., the characteristic that is being measured.

2: The description of the metric, i.e., what it is and what it measures. For some metrics, that description will also include other notable details of the metric, i.e, recommendations, when to use the metric, exceptions, etc.

3: The unit of a metric is the scale of measurement of that metric. e.g., For EC2 instance metrics, the BurstBalance has the unit ‘Percent’. This tells you that the BurstBalance metric is measured as a percentage value. Units provide context and meaning to the raw numerical values you see, e.g., you could compare BurstBalance (percentage) with CPUUtilization (percentage) to see if high CPU usage is depleting your burstable credits.

4: CloudWatch provides several statistics for a metric’s data points, e.g., sum, average, minimum, maximum, etc. See all available statistics here. Statistics are crucial to understanding a metric’s behaviour, e.g., the average helps to identify a baseline for the metric’s typical behaviour. Meaningful Statistics for a metric are the statistics that are considered the most useful for that metric.

5: For RDS, some metrics are only available for a specific database engine. The ‘Applies to’ column indicates the database engine for which the metric can be collected.

Graphing Metrics

Trying to understand what a set of data is trying to tell you purely by looking at rough numbers can leave you feeling foggy. Visuals, on the other hand, are like a lightbulb moment, illuminating complex ideas in a clear and memorable way. On CloudWatch, you can use graphs to view metrics over a period.

Say, for example, you want to view the average write I/O operations on your EBS volume for a period. You can access the metric on the console as follows.

1: You can use the time filter to granularize your search to a specific period. The custom option allows you to specify a custom period, e.g, view metrics over 3 weeks

2: The Actions/Options tabs allow you to customize your widget, i.e., specify how you want your data to be displayed. The Options tab provides more customization for your graph, e.g., labels to add to the axis, units, etc.

3: The Graphed Metrics tab allows you to customize the graph.

You can change the statistic being displayed, e.g., change from average to maximum or view a sum. You can also change the period, which alters the data points on the graph e.g,. To view the maximum values at each hour, can filter as below:

Examples

Scenario 1: Your users are reporting that your web application is responding slowly. You need to determine the cause of the high latency and resolve it quickly.

Resolution:

There are 2 main reasons for slow response times in an application.

- Resource limitations: When the resources assigned to the compute infrastructure where the application is running are insufficient to sustain the load, i.e., using a small instance for a high-load application may result in CPU overload and memory bottlenecks. This can also occur if the database is overloaded.

- Application Code Issues: Poorly written code with logic flaws e.g., code that does not properly release memory after use, can lead to memory depletion and slow performance.

To check if the lag is a result of resource constraints, we can examine the compute service’s CPU utilization and disk I/O. Now, so far, we have looked at how to access and view different metrics for a service. The next big question becomes, how do you interpret CloudWatch data and derive meaningful insights from it

i. CPU Utilization
As previously mentioned, there are various statistics available to you for each metric. For this case, to determine if CPU Utilization is the reason for latency, we need to look at the following 3 statistics over the given period:

- Average CPU Utilization:This statistic helps in understanding the general load on your instance over time.

- Maximum CPU Utilization:This statistic shows the peak CPU usage within a specified period. It is useful to identify if there are any spikes that might correlate with periods of high latency.

- CPU Credit Balance(only for burstable instances): If you’re using burstable instances (e.g., T2, T3 instances), running out of CPU credits can cause the instance to throttle and result in increased latency.

An important thing to remember here is the unit used to measure the metric, which can be found in the service’s official documentation. CPU Utilization is measured as a percentage; thus, the output would look something like the below:

Average CPU Utilization

Maximum CPU Utilization

CPUCreditBalance

From the above, we can see that there was a spike in CPU Utilization at three instances, which also corresponds to the time when the burst credits were most spent.

ii. Disk I/O

Disk I/O metrics reflect the performance and usage of your disk. Key Disk I/O metrics include:

DiskReadOps: The number of read operations performed on the disk.
DiskWriteOps: The number of write operations performed on the disk.
DiskReadBytes: The amount of data read from the disk, in bytes.
DiskWriteBytes: The amount of data written to the disk, in bytes.

Note: the above metrics are only available for instance store volumes. If using EBS, you’d be looking at the EBSReadOps, EBSWriteOps, EBSReadBytes, and EBSWriteBytes metrics.

Key statistics to measure include:
- Sum:For DiskReadOps and DiskWriteOps, the sum statistic helps you understand the total number of I/O operations over a period.

- Average: For DiskReadBytes and DiskWriteBytes, the average statistic provides insight into the average data throughput over a period.

Note: DiskReadOps and DiskWriteOps will show the number of completed read operations from all volumes in a specified period. To obtain the average IOPS, you need to take the total amount of operations/time in seconds, e.g., say you have a DiskReadOps of 100,000 over a period of 1 hour, then the read operations per second would be 100,000/(3600) = ~28

In most cases, it is not possible to troubleshoot an issue simply by examining a single metric. For example, in the scenario above, we can’t determine that the application lag is due to a resource constraint simply by looking at the CPU Utilization, this is even if the spikes in utilization align with periods of latency.

To get the full picture, we need to analyse multiple metrics together. Let’s say users report slowdowns. Examining both CPU utilization and disk I/O during those periods can reveal if spikes or abnormal patterns in both metrics coincide with the latency. If you have the CloudWatch agent installed, you can also compare these against memory utilization metrics. This combined view strengthens the case for resource limitations being the root cause.

This article has provided a foundational understanding of CloudWatch metrics and logs. However, the vast capabilities of CloudWatch extend far beyond what we’ve covered here. In a future article, we’ll delve deeper into advanced techniques for leveraging CloudWatch logs and metrics to troubleshoot issues and ensure the optimal health of your AWS resources. Stay tuned!

Navigating Disaster Recovery in the Digital Age: Choosing the Right Approach – Part 6

Zippy Wachira — Fri, 14 Feb 2025 18:19:42 +0000

Welcome to the final chapter of our journey! Over the past blogs, we've broken down every aspect, looked at the subtleties, and investigated how two possible solutions—AWS Elastic Disaster Recovery (DRS) and Veeam—compare with our client's needs.
Can review the case study here.

Now, let’s cut through the noise and get to the heart of the matter—because the right choice isn’t just about features or costs. It’s about finding the best fit for the client’s unique needs. Let’s dive in!

AWS-Native vs Third-Party Service

Now this is an easy one – ain’t it?

We’ve explored all the factors and compared two potential solutions: AWS’s DRS and the third-party Veeam.

From our comparison table, it seems that Veeam is the obvious winner. But is it really?

Before we pick, let’s explore one more factor—cost.

While not always explicitly stated as a requirement, the cost of the solution is a crucial factor to consider when designing a solution. After all, this will be a business expense for the client, so it’s important to provide not only a functional solution but also a cost-effective one.
When comparing DRS and Veeam, the cost structure for each is different. With DRS, there is a flat per-hour fee for each server being replicated to AWS, and you also pay for the replication instance(s), underlying volumes, and any recovery instances created during the recovery process in AWS. For each recovery instance, you incur charges for compute, memory, and storage.

On the other hand, with Veeam, you pay for the Veeam license for your source servers. In addition to that, you pay for storage and API calls to and from S3. You’ll also incur costs for any recovery instances provisioned from the backups stored in S3. In both cases, recovery costs are only incurred when recovery is initiated into the AWS environment.

So, back to our options. From the comparison chart, Veeam is taking the lead.

However, let’s go back to our case study and check if Veeam meets all our requirements:

Comprehensive Backups – _Yes _
Enhanced Recovery Capabilities – Yes
An RTO of 1-2 hours for the ERP System- No
Recovery options that include both on-premises restoration and the possibility of running the ERP in the cloud – Yes
Flexible RTO and RPO for Other Systems: Yes

The verdict? The ERP system is the outlier—and as the most critical system, we can’t ignore the need for a 2-hour RTO.

So, where does that leave us? My vote? A hybrid approach. DRS for the mission-critical ERP system and Veeam for the more flexible Information and Library systems.

Of course, there’s more to this decision than meets the eye. A hybrid solution can bring added complexity and cost, so as an architect, your job is to present all viable options along with their pros and cons. In this case, we’re looking at:

Veeam only
DRS only
Hybrid approach with Veeam and DRS

So, what do you think? Which solution would you have recommended to the client and why?