Forem: augusthottie

I Built a Serverless Event-Driven Pipeline on AWS

augusthottie — Mon, 06 Apr 2026 16:31:27 +0000

After five projects deep on containers and Kubernetes, I needed to add range to my portfolio. Every DevOps role I've looked at mentions serverless somewhere, Lambda for glue code, API Gateway for webhooks, DynamoDB for low-latency lookups, SQS for decoupling services. So this week I built a serverless event-driven pipeline from scratch.

The use case is a URL shortener with click analytics, but the architecture pattern is the same one you'd use for payment processing, IoT ingestion, audit logging, or any event-driven system. The interesting part isn't the URL shortener, it's the async decoupling, the atomic operations, the failure handling, and the IAM patterns.

What I Built

A four-Lambda pipeline:

Shortener (POST /shorten): generates a short code, writes to DynamoDB with a conditional write to handle collisions
Redirect (GET /{code}): reads the URL, fire-and-forget sends a click event to SQS, returns a 302 in under 100ms
Analytics (SQS-triggered): processes click events in batches of 10, writes details to a clicks table, atomically increments the counter on the urls table
Stats (GET /stats/{code}): queries both tables, aggregates top user agents and referers, returns JSON

All Python 3.12, all behind API Gateway HTTP API v2, all defined as Terraform with reusable modules. 38 resources total. Deploys in under two minutes.

The Architecture

POST /shorten      →  Lambda (shortener)  →  DynamoDB (urls table)

GET /{code}        →  Lambda (redirect)   →  DynamoDB → SQS → 302
                                                         ↓
                                              Lambda (analytics)
                                                         ↓
                                              DynamoDB (clicks + counter)

GET /stats/{code}  →  Lambda (stats)      →  DynamoDB (both tables)

The key insight is the SQS queue between the redirect and analytics Lambdas. The redirect Lambda doesn't wait for analytics processing, it sends a message to SQS and immediately returns the 302. Whether analytics takes 10ms or 10 seconds, users get redirected instantly. Click event processing happens entirely in the background.

The Patterns That Make This Production-Realistic

Conditional writes for collision handling. The shortener generates a 7-character random code, but two simultaneous requests could generate the same one. DynamoDB's ConditionExpression: "attribute_not_exists(code)" makes the write fail atomically if the code already exists. Combined with retry logic, this is collision-safe at any concurrency level.

Atomic counters with UpdateExpression. When the analytics Lambda increments the click count, it uses UpdateExpression: "ADD clicks :inc". This is atomic at the database level, no read-modify-write race conditions. If 100 clicks come in simultaneously, all 100 get counted correctly.

Partial batch failure for SQS. The default SQS → Lambda trigger fails the entire batch if any message errors. That's wasteful and creates retry storms. By setting function_response_types = ["ReportBatchItemFailures"] and returning {"batchItemFailures": [{"itemIdentifier": messageId}]} from the Lambda, only the failed messages go back to the queue. The successful 9 out of 10 stay processed.

Dead Letter Queue with 3-retry redrive. Failed messages get retried 3 times before being moved to a dead letter queue. The DLQ holds them for 14 days so you can investigate without blocking the main queue. This is the difference between "the pipeline is broken" and "we have visibility into 12 failed messages from yesterday."

Least-privilege IAM per Lambda. Each function has its own role with the minimum permissions it needs. The shortener can only PutItem on the urls table. The redirect can only GetItem on urls and SendMessage to SQS. The analytics Lambda can read from SQS and write to both tables. The stats Lambda can only read from both tables. If any function gets compromised, the blast radius is contained.

The Terraform Modules

I built four reusable modules:

lambda/ takes a source directory, handler name, IAM policy JSON, and environment variables. It packages the code into a zip, creates an IAM role, attaches the basic execution policy plus the custom one, creates the function, and sets up a CloudWatch log group with 14-day retention. Adding a fifth Lambda would be 15 lines of root-level code.

dynamodb/ creates the urls and clicks tables. The clicks table has a Global Secondary Index on (code, timestamp) so the stats Lambda can query recent clicks efficiently without scanning the whole table.

sqs/ creates the main queue and the dead letter queue, with the redrive policy linking them. Visibility timeout is 60 seconds (must be at least the Lambda timeout), long polling is enabled for 20 seconds.

api_gateway/ creates an HTTP API v2 (cheaper than REST API), three integrations, three routes (POST /shorten, GET /{code}, GET /stats/{code}), the auto-deploy stage with access logging, and the Lambda permissions allowing API Gateway to invoke each function.

The root main.tf composes them and passes the outputs between modules. There's also a null_resource with local-exec that copies the shared utilities folder into each Lambda's source directory before packaging — Lambda doesn't have a native way to share source code between functions without using Layers.

The Live Demo

I built an interactive HTML page that calls the live API directly so you don't have to take my word for it. There's a "Shorten URL" button, a "Generate 10 Clicks" button, and a "Fetch Stats" button. There's a real-time event log at the bottom showing every request as it happens.

The whole demo is a single HTML file with no build step. CSS uses a Fraunces serif display font paired with JetBrains Mono, an orange accent on a dark grid background, and a noise overlay for texture. It looks like a developer tool, not a generic landing page.

The Debugging That Taught Me the Most

HTTP API v2 payload format is different. I had a shared log_event() helper that read event.get("httpMethod") from API Gateway. That worked fine in REST API v1, but in HTTP API v2 the method is at event["requestContext"]["http"]["method"]. The result was that every log entry showed "event_type": "sqs" even for HTTP requests, because httpMethod was missing and the default was "sqs". Subtle bug, easy to miss until you're trying to debug something else.

Lambda cold starts cause race conditions in test scripts. My test script does POST /shorten immediately followed by GET /{code}. If the redirect Lambda is cold, the first GET happens before the Lambda has finished initializing, and somehow this causes API Gateway to return a 404 without invoking the Lambda. I confirmed this by checking CloudWatch logs: no log entries for the failed requests. Adding a 1-2 second delay between the shorten and the first redirect fixed it. In production this isn't an issue because Lambdas stay warm under continuous load.

Partial batch failure requires opt-in. I assumed SQS → Lambda would handle failures at the message level by default. It doesn't. You have to explicitly set function_response_types = ["ReportBatchItemFailures"] on the event source mapping AND return the failed message IDs in the right format from your Lambda. Without both, one bad message fails the entire batch and you get retry storms.

Shared code in Lambda is harder than it should be. Python doesn't have a clean way to share modules between Lambda functions without using Layers (which add deployment complexity). I ended up using a Terraform null_resource with local-exec to copy src/shared/ into each function's source directory before packaging. Hacky but effective.

The Cost Comparison

Running this pipeline with 1 million requests per month costs approximately $3.60. That includes API Gateway, all four Lambdas, DynamoDB on-demand, SQS, and CloudWatch Logs.

For context, my EKS cluster from Projects 3, 4, and 5 costs about $213 per month, and that's whether it's serving zero requests or a million. Serverless is genuinely cheaper for event-driven workloads, especially during development when traffic is sporadic.

Why This Matters for My Portfolio

Before this project, my portfolio was container-heavy. Five projects on EKS, ECS, Helm, ArgoCD. Strong on Kubernetes, weak on serverless. This adds a completely different dimension, event-driven architecture, async processing, NoSQL design, reusable IaC modules, and security patterns specific to AWS managed services.

In an interview, if someone asks "tell me about a serverless project," I now have a 90-second answer that hits async decoupling, atomic operations, partial batch failures, dead letter queues, and least-privilege IAM. And I have a live demo URL they can click.

I Added Log Aggregation to My EKS Observability Stack, Metrics + Logs in One Dashboard

augusthottie — Wed, 01 Apr 2026 17:32:43 +0000

Last week I built an observability stack with Prometheus, Grafana, and custom alerting on EKS. The LinkedIn post got more engagement than anything I'd posted before, and two comments suggested the same thing: "Integrate Loki for logs."

They were right. Metrics tell you that something is wrong. Logs tell you why. Without both in the same place, you're switching between kubectl logs and Grafana dashboards trying to correlate timestamps manually. That's not a workflow, that's a scavenger hunt.

So I added Loki.

What I Added

Loki and Promtail, deployed via ArgoCD alongside the existing Prometheus stack:

Promtail runs as a DaemonSet on every node, tailing container logs from /var/log/pods
Loki stores and indexes the logs, queryable via LogQL
Grafana gets a new "Logs & Metrics Correlation" dashboard with metrics and logs side by side
A new Loki datasource in Grafana so both Prometheus and Loki are available in the same dashboard

The entire addition was three files: an ArgoCD application for the Loki stack, a Grafana datasource ConfigMap, and a logs dashboard ConfigMap. Push to main, ArgoCD syncs, done.

The Logs & Metrics Correlation Dashboard

This is the dashboard I wish I'd had from the start. Seven panels in four rows:

Row 1: Metrics: API Request Rate and Error Rate from Prometheus. See the traffic pattern and spot anomalies.

Row 2: API Logs: Live log stream from the gitops-api containers via Loki. When you see a spike in the metrics above, scroll down and the logs from that exact time range are right there.

Row 3: Infrastructure Logs: PostgreSQL logs on the left, Redis logs on the right. Database checkpoint warnings, connection events, cache operations, all visible without running kubectl logs across multiple pods.

Row 4: Error Logs: A filtered view showing only lines matching error, fail, panic, crash, or exception across all containers. This is the "something is broken, show me what" panel.

Row 5: Log Volume: Lines per second per container. A sudden spike in log volume often means something is throwing errors in a loop.

The key insight: time-synced panels. When you drag to select a time range on the metrics graph, the log panels update to show logs from that exact window. That's the metric-to-log correlation workflow, see a spike, select the time range, read the logs. Root cause in under two minutes.

LogQL: The Query Language

If you know PromQL, LogQL feels familiar. Stream selectors use curly braces like Prometheus label matchers:

All logs from the three-tier namespace:

{namespace="three-tier"}

Just the API container:

{namespace="three-tier", container="gitops-api"}

Filter for errors using a pipeline:

{namespace="three-tier"} |~ "(?i)(error|fail|panic|crash|exception)"

Log volume as a metric (for the timeseries panel):

sum(rate({namespace="three-tier"}[5m])) by (container)

That last one is interesting: rate() on a log stream gives you lines per second, which you can graph just like a Prometheus metric. Useful for spotting error storms.

What Went Wrong (And What I Learned)

Node Capacity

Loki wouldn't schedule. The two t3.medium nodes were already running the app, Prometheus, Grafana, Alertmanager, Node Exporter, kube-state-metrics, ArgoCD, cert-manager, and the LB controller. Too many pods. I had to scale the node group to 3 nodes before Loki could start.

This is something you don't think about until it happens, t3.medium supports around 17 pods per node, and a monitoring stack eats through that fast.

EBS CSI Driver (Again)

The Loki PVC was stuck in Pending even after adding a third node. The EBS CSI driver's IAM role still had the old cluster's OIDC provider URL. Third time hitting this issue — by now the fix is muscle memory: delete the IAM service account, recreate it, reinstall the addon with --resolve-conflicts OVERWRITE.

Grafana Datasource Provisioning

The Loki datasource ConfigMap existed in Kubernetes but Grafana's sidecar didn't pick it up. After a restart, the Prometheus datasource also disappeared. I ended up adding both datasources manually through the Grafana UI.

The lesson: Grafana's sidecar provisioning is convenient when it works, but when it doesn't, just add datasources manually and move on. The dashboards are what matter, not how the datasource was configured.

Curly Quotes in Dashboard JSON

When importing dashboard JSON through Grafana's UI, the PromQL queries had corrupted quotes, curly/smart quotes instead of straight quotes. Every panel showed a parse error. The fix was to edit each panel and retype the query manually from the keyboard.

This is a subtle one. If you copy-paste JSON through a text editor or chat that auto-converts quotes, your Grafana panels will break silently.

Why Logs Complete the Observability Story

With just Prometheus, my monitoring answer was: "The error rate spiked at 3:42 PM." With Loki added, it becomes: "The error rate spiked at 3:42 PM because PostgreSQL was restarting after an OOM kill, here are the exact log lines."

That's the difference between detecting a problem and diagnosing it. In an interview, being able to describe a workflow that goes from alert → metric → log → root cause shows you've actually operated production systems, not just set up dashboards.

The full observability stack now covers:

Metrics: Prometheus + custom application instrumentation
Logs: Loki + Promtail collecting from every container
Alerting: PrometheusRules with 9 custom alerts
Dashboards: Two Grafana dashboards, one for metrics, one for metric-to-log correlation
All deployed via GitOps: ArgoCD managing four applications from a single repo

I Added Prometheus, Grafana, and Custom Alerting to My EKS Cluster, Here's How Observability Actually Works

augusthottie — Wed, 25 Mar 2026 11:41:47 +0000

After building three projects: a CI/CD pipeline, a 3-tier architecture, and GitOps on EKS, I had one obvious gap: observability. I could deploy things, but I couldn't answer "is it healthy?" beyond checking if pods were running.

"How do you monitor your services?" is an interview question I wasn't ready for. I'd used Grafana dashboards other people built. I'd looked at CloudWatch metrics someone else configured. But I'd never instrumented an application, written PromQL queries, or set up alerting rules from scratch.

So I did all of it.

What I Built

I took the GitOps EKS project from last week and added a complete observability layer:

Instrumented the Node.js API with prom-client, 7 custom metrics covering HTTP requests, database queries, cache operations, and connection pools
Deployed kube-prometheus-stack (Prometheus, Grafana, Alertmanager, Node Exporter, kube-state-metrics) via ArgoCD
Built a 9-panel Grafana dashboard showing request rate, error rate, latency, cache hit/miss ratio, DB query performance, and pod resources
Wrote 9 custom alert rules across API health, database performance, cache efficiency, and pod stability

Everything deployed via GitOps, push to main, ArgoCD syncs the monitoring stack and custom configs.

Instrumenting the Application

The first step was making the app emit metrics. I added prom-client and created a metrics module with:

HTTP middleware that wraps every request, tracking method, route, status code, and duration. The /metrics endpoint itself is excluded so Prometheus scraping doesn't inflate the numbers.

Database helpers that time every query and track success/failure by operation type (select, insert, delete). This means I can see not just "is the database slow?" but "are inserts slower than selects?"

Cache tracking on every Redis operation, get (hit or miss), set, and invalidate. This shows whether the caching layer is actually working or if every request hits the database.

Connection pool gauge that samples active database connections every 5 seconds. When this approaches the pool limit (10), something is holding connections open.

The /metrics endpoint exposes everything in Prometheus text format. Hit it once and you get counters, histograms, and gauges, about 100 lines of metrics per scrape.

ServiceMonitor: The Right Way to Scrape

My first attempt used additionalScrapeConfigs in the Prometheus values, a raw scrape config injected into the Prometheus config. It didn't work. The operator didn't pick it up, and debugging why was a dead end.

The correct approach is a ServiceMonitor — a Kubernetes CRD that tells the Prometheus operator what to scrape. It uses label selectors to find Services and endpoints automatically. Mine looks for any Service with app: gitops-api in the three-tier namespace, scrapes port http on path /metrics every 15 seconds.

One detail that took me longer than I'd like to admit: the Service needs a named port. Not just port: 80 but name: http, port: 80. The ServiceMonitor references the port by name, and without it, Prometheus silently ignores the target.

The Dashboard

I built the dashboard as a JSON ConfigMap deployed via ArgoCD. Nine panels in three rows:

Row 1: HTTP layer:
Request Rate (per route), Error Rate (percentage of 5xx responses), P95 Latency (per route). These tell you if the API is serving traffic, how much is failing, and how fast.

Row 2: Data layer:
Requests by Status (pie chart showing 200/201/404/503 distribution), Cache Hit/Miss (pie chart — green means Redis is working), DB Query Duration (p95 for inserts vs selects), DB Active Connections (gauge per pod, 0-10 scale with yellow at 6, red at 8).

Row 3: Infrastructure:
DB Queries per Second (insert and select rates), Pod Memory Usage, Pod CPU Usage. These show whether the workload needs more resources.

The moment all 9 panels lit up with real data was genuinely satisfying. The error rate panel showed a real 503 spike from when PostgreSQL was still starting, that's not test data, that's the actual system behavior captured in metrics.

The PromQL Behind Each Panel

For anyone building their own dashboard, here are the exact queries:

Request Rate: requests per second, broken down by route:

sum(rate(http_requests_total[5m])) by (route)

Error Rate: percentage of 5xx responses:

sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ sum(rate(http_requests_total[5m])) * 100

P95 Latency: 95th percentile response time per route:

histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))

Cache Hit/Miss: Redis cache effectiveness over the last hour:

sum(increase(cache_operations_total{operation="get"}[1h])) by (result)

DB Query Duration (p95): insert vs select latency:

histogram_quantile(0.95, 
  sum(rate(db_query_duration_seconds_bucket[5m])) by (le, operation))

DB Queries per Second: operation throughput:

sum(rate(db_queries_total[5m])) by (operation)

Pod Memory: working set per container:

container_memory_working_set_bytes{namespace="three-tier", container!=""}

Pod CPU: usage rate per pod:

sum(rate(container_cpu_usage_seconds_total{namespace="three-tier", container!=""}[5m])) by (pod)

The key thing to understand: rate() calculates per-second averages over a window, while increase() gives you the total count over a window. Use rate() for time-series graphs, increase() for pie charts and totals. And histogram_quantile() is how you get percentiles from histogram buckets, you can't just average latency and get useful numbers.

Alert Rules

I wrote 9 PrometheusRules covering the scenarios I'd actually want to be woken up for:

API Health: Error rate above 5% (critical), P95 latency above 1 second (warning), metrics endpoint unreachable (critical). The error rate alert uses a 2-minute for duration so a single failed request doesn't trigger it.

Database: Query error rate above 1% (critical), P95 query time above 500ms (warning), connection pool above 8/10 active (warning). The connection pool alert is the early warning — if you're at 80% capacity, the next traffic spike will exhaust it.

Cache: Miss rate above 80% for 10 minutes (warning). A high miss rate means either Redis is down, the cache TTL is too short, or the data is never being cached. The 10-minute window avoids alerting during cold starts.

Pods: Crash-looping (more than 3 restarts in 15 minutes), memory above 85% of limit. Crash loops are critical because they mean the service is fundamentally broken. Memory warnings give you time to increase limits before OOMKills start.

The Debugging That Taught Me the Most

The OIDC mismatch. I reused the EKS Terraform from Project 3, but the EBS CSI driver's IAM role still had the old cluster's OIDC provider URL. Every AssumeRoleWithWebIdentity call failed with AccessDenied, but the error doesn't say "wrong OIDC provider", it just says "not authorized." I had to compare the role's trust policy against the current cluster's OIDC issuer to find the mismatch.

The empty service account annotation. After reinstalling the AWS Load Balancer Controller with Helm, the service account had eks.amazonaws.io/role-arn: "", an empty string instead of the actual ARN. The controller fell back to the node role, which didn't have ELB permissions. A kubectl annotate --overwrite fixed it, but I only found it by checking the SA's YAML directly.

ServiceMonitor vs additionalScrapeConfigs. I spent time trying to make additionalScrapeConfigs work before learning that the Prometheus operator intentionally manages config through CRDs. ServiceMonitor is the right abstraction, it's declarative, it uses label selectors, and the operator reconciles it automatically.

Why This Matters for Interviews

Before this project, my monitoring answer was "we used Grafana and Prometheus." Now I can explain:

How to instrument an application with counters (requests), histograms (latency), and gauges (connections)
Why rate() over increase() for dashboards, and why histogram_quantile() for latency percentiles
How ServiceMonitors work with the Prometheus operator for service discovery
What alerts are worth setting up and why each threshold was chosen
How to deploy and manage a monitoring stack via GitOps

When an interviewer asks "how would you know if your service is having issues?", I have a 9-panel dashboard and 9 alert rules to point to.

I Set Up GitOps on EKS with ArgoCD, Here's What Kubernetes Actually Looks Like in Production

augusthottie — Tue, 17 Mar 2026 13:18:56 +0000

I had a Kubernetes problem. I could talk about it, reference it on my resume, and deploy to existing clusters. But I'd never provisioned one from scratch, written a Helm chart, or set up GitOps. That gap showed, and I knew interviewers could smell it.

So I built the whole thing: EKS cluster with Terraform, custom Helm chart, ArgoCD for GitOps, and a real application that talks to PostgreSQL and Redis. Push to main, ArgoCD syncs, pods roll out. No kubectl apply in sight. This is what I learned.

What I Built

A GitOps pipeline where Git is the only way deployments happen:

git push → GitHub → ArgoCD (polls every 3m) → Helm sync → EKS cluster

The application is a Notes API running on Bun/Express with PostgreSQL for persistence and Redis for caching, both running as pods inside the cluster. Two API replicas sit behind an ALB provisioned automatically by the AWS Load Balancer Controller from a Kubernetes Ingress resource.

Every response includes a pod field showing which replica served the request. Hit the endpoint twice and you'll see different pod names, load balancing in action.

Provisioning EKS with Terraform

I wrote two Terraform modules: one for the VPC and one for the EKS cluster.

The VPC module creates public and private subnets across two AZs, with a NAT Gateway for outbound traffic from private subnets. The critical detail for EKS: every subnet needs specific tags so Kubernetes knows which subnets to use for load balancers.

Public subnets get kubernetes.io/role/elb = 1 (for internet-facing ALBs). Private subnets get kubernetes.io/role/internal-elb = 1. Miss these tags and the AWS Load Balancer Controller silently fails to create ALBs.

The EKS module creates the cluster, a managed node group (2x t3.medium), an OIDC provider for IRSA, and IAM roles for both the nodes and the LB controller.

27 resources. terraform apply takes about 15-20 minutes because EKS clusters are slow to provision.

Writing the Helm Chart

This was the part I had the least experience with. The chart has 8 templates:

Namespace: isolates everything in three-tier
ConfigMap: database host, Redis host, app version
Secret: database password
Deployment: API with 2 replicas, health checks, resource limits
Service: ClusterIP exposing port 80 internally
Ingress: annotated for the AWS Load Balancer Controller to create an ALB
PostgreSQL StatefulSet: with a PersistentVolumeClaim for data
Redis Deployment: ephemeral cache with LRU eviction

Everything configurable lives in values.yaml. Want 3 replicas? Change one number. Different image tag? One line. That's the power of Helm!

ArgoCD: The GitOps Engine

ArgoCD watches a Git repo and makes the cluster match what's in Git. The application definition is simple:

spec:
  source:
    repoURL: https://github.com/augusthottie/gitops-eks
    targetRevision: main
    path: helm/three-tier-app

  syncPolicy:
    automated:
      prune: true      # Delete resources removed from Git
      selfHeal: true   # Revert manual kubectl changes

prune: true means if you delete a template from the Helm chart and push, ArgoCD deletes the corresponding resource from the cluster. selfHeal: true means if someone runs kubectl edit to change something manually, ArgoCD reverts it back to what Git says. Git is the source of truth, always!

The ArgoCD UI is beautiful. You get a tree view of every resource, their health status, sync status, and the Git commit that triggered each sync.

The Problems I Hit (All of Them)

EBS CSI Driver

My PostgreSQL StatefulSet stayed in Pending forever. The PersistentVolumeClaim couldn't bind because EKS doesn't include the EBS CSI driver by default, it's an addon you install separately with its own IAM role.

The fix: install the aws-ebs-csi-driver addon via aws eks create-addon. But even that failed initially because eksctl create iamserviceaccount created a service account that conflicted with the addon. I had to delete and recreate with --resolve-conflicts OVERWRITE.

IRSA Not Working for LB Controller

The AWS Load Balancer Controller kept failing with "AccessDenied" errors, but the IAM policy was correct. The problem: it was using the node role instead of the IRSA role. The service account had the right annotation, but the pods were created before the annotation was applied.

Fix: kubectl rollout restart deployment/aws-load-balancer-controller. The new pods picked up the service account annotation and used the correct IRSA role.

ArgoCD CRD Too Large

Installing ArgoCD with kubectl apply failed because the applicationsets CRD exceeded the annotation size limit for client-side apply. The fix: --server-side=true --force-conflicts. This is a known ArgoCD issue that everyone hits.

ConfigMap Changes Don't Restart Pods

I changed the app version in values.yaml, pushed to GitHub, ArgoCD synced the ConfigMap, but the pods kept serving the old version. Kubernetes doesn't restart pods when a ConfigMap changes, you need a checksum annotation on the pod template that changes when the ConfigMap changes, forcing a rollout.

exec format error (Again)

Same issue from Project 2 — built the Docker image on my Mac (ARM), Kubernetes nodes are x86_64. --platform linux/amd64 on every build. I'll never forget this one.

Testing the GitOps Flow

The demo that proves everything works:

Change app.version in values.yaml from 1.0.0 to 1.1.0
git push origin main
Wait for ArgoCD to sync (up to 3 minutes)
curl /info returns "version": "1.1.0"

No CI/CD pipeline. No deployment commands. No SSH. Git push is the deployment.

Why This Matters

Before this project, my Kubernetes answer in interviews was... vague. I'd deployed to existing clusters, used Helm to install other people's charts, and read a lot of docs. But I hadn't actually provisioned a cluster, written my own Helm chart, or set up GitOps from scratch.

Now I can talk about EKS provisioning, OIDC providers, IRSA, and why your node group needs specific IAM policies. I can explain why StatefulSets exist (PostgreSQL needs stable storage) and why Deployments don't care (Redis can lose its data). I know that the EBS CSI driver isn't installed by default and that it will silently break your PVCs if you forget it.

Every problem I hit, the exec format error, the IRSA annotation not being picked up, ArgoCD CRDs being too large for kubectl, these are real production issues. Not textbook scenarios. The kind of stuff that comes up in interviews when someone asks "tell me about a time you debugged something in Kubernetes."

Cost Warning

EKS is expensive for learning: ~$181/month (control plane $73, nodes $60, NAT $32, ALB $16). Always terraform destroy when you're not working. You can bring it back in 20 minutes.

I Built a 3-Tier AWS Architecture with Terraform Modules, ECS Fargate, RDS, and ElastiCache

augusthottie — Tue, 10 Mar 2026 15:08:54 +0000

My last project was a CI/CD pipeline with blue/green deployments. It taught me CodeDeploy, CodePipeline, and a lot about IAM. But it ran on EC2 instances in a default VPC, no custom networking, no containers, no database tier.

This time I wanted to build what companies actually run in production: a 3-tier architecture with proper network isolation, serverless containers, a managed database, and an in-memory cache. All codified in Terraform modules.

What I Built

A Node.js API running on ECS Fargate that talks to PostgreSQL (RDS) and Redis (ElastiCache), deployed inside a custom VPC with public and private subnets:

Internet → ALB (public subnets)
              ↓ :3000
         ECS Fargate (private subnets)
         Bun + Express API
              ↓ :5432          ↓ :6379
         RDS PostgreSQL    ElastiCache Redis
         (private subnets) (private subnets)

The ALB is the only thing exposed to the internet. ECS, RDS, and Redis all sit in private subnets with no public IP addresses. Each tier's security group only allows traffic from the tier above it. The entire infrastructure is defined in 6 Terraform modules — 37 resources created with one command.

Why This Architecture Matters

If you're interviewing for DevOps or cloud engineering roles, "I deployed an app to EC2" doesn't differentiate you. Interviewers want to know:

Can you design a VPC from scratch with proper subnet segmentation?
Do you understand why databases belong in private subnets?
Can you explain the difference between an Internet Gateway and a NAT Gateway?
Have you actually worked with ECS Fargate, not just read about it?

This project answers all of those with working code.

The Network Layer

This was the foundation everything else depended on. I created a VPC with 10.0.0.0/16 split across two availability zones:

Subnet	CIDR	Tier	Internet Access
Public 1a	10.0.0.0/20	ALB, NAT Gateway	Direct via IGW
Public 1b	10.0.16.0/20	ALB (multi-AZ)	Direct via IGW
Private 1a	10.0.32.0/20	ECS, RDS	Outbound only via NAT
Private 1b	10.0.48.0/20	ECS, ElastiCache	Outbound only via NAT

The key design decision: everything except the ALB goes in private subnets. The ECS tasks need outbound internet access (to pull images from ECR), so they route through a NAT Gateway in the public subnet. But nothing on the internet can reach them directly.

Each Terraform module is self-contained. The VPC module outputs subnet IDs and the VPC ID. Other modules consume those outputs without knowing anything about how the network is built.

Security Group Boundaries

This is the part that makes this a real 3-tier architecture, not just "three things in the same VPC." Each tier has its own security group, and the rules enforce strict boundaries:

Security Group	Allows Inbound	From
alb-sg	TCP 80	0.0.0.0/0 (the internet)
ecs-sg	TCP 3000	alb-sg only
rds-sg	TCP 5432	ecs-sg only
redis-sg	TCP 6379	ecs-sg only

No security group references a CIDR block except the ALB. Everything else references another security group. This means if an ECS task gets compromised, it can only reach the database and cache, not the internet, not other subnets, not other services.

This is how production environments are designed, and explaining it in an interview immediately signals you understand network security beyond "I opened port 22."

ECS Fargate(Containers Without Servers)

I used Fargate instead of EC2 for the compute layer. No instances to patch, no AMIs to maintain, no Auto Scaling Groups to configure. You define a task (CPU, memory, container image, environment variables) and Fargate runs it.

The task definition connects the app to both RDS and Redis through environment variables:

DB_HOST     → RDS endpoint (injected by Terraform)
DB_PASSWORD → Secrets Manager ARN (resolved at task launch by ECS)
REDIS_HOST  → ElastiCache endpoint (injected by Terraform)

The database password never touches Terraform state as plaintext and never appears in environment variable logs. ECS resolves it from Secrets Manager at runtime using the task execution role's IAM permissions.

One thing I enabled that's worth mentioning: the deployment circuit breaker with rollback. If a new task definition fails to start (bad image, crash loop, health check failure), ECS automatically stops the deployment and rolls back to the last working version. Same concept as the CodeDeploy auto-rollback from my first project, but built into ECS.

The Application(Proving the Architecture Works)

I built a fresh Express API specifically designed to demonstrate all three tiers working together. The key endpoint is GET /items:

First request: queries PostgreSQL, caches the result in Redis for 30 seconds, returns "source": "database".

Second request (within 30s): returns the cached data from Redis, "source": "cache", with 1ms latency.

Any write operation (POST, PUT, DELETE) invalidates the Redis cache so the next read gets fresh data from PostgreSQL. This is a standard cache-aside pattern used in production systems.

The /health endpoint checks both database and cache connectivity. If either is down, it returns a 503, which the ALB detects and stops routing traffic to that task.

{
  "status": "healthy",
  "services": {
    "database": { "connected": true, "time": "2026-03-09T14:50:37.136Z" },
    "cache": { "connected": true, "latency": 1 }
  }
}

Terraform Modules(Reusable Infrastructure)

Instead of one giant Terraform file, I split everything into 6 modules:

modules/
├── vpc/             # Network foundation
├── security-groups/ # Tier boundaries
├── alb/             # Load balancing
├── ecs/             # Container orchestration
├── rds/             # Database
└── elasticache/     # Caching

Each module has its own variables.tf, main.tf, and outputs.tf. The root main.tf wires them together:

module "ecs" {
  source            = "./modules/ecs"
  private_subnet_ids = module.vpc.private_subnet_ids
  security_group_id  = module.security_groups.ecs_sg_id
  target_group_arn   = module.alb.target_group_arn
  db_host            = module.rds.endpoint
  redis_host         = module.elasticache.endpoint
  db_secret_arn      = module.rds.secret_arn
  container_image    = "${aws_ecr_repository.app.repository_url}:latest"
}

The advantage of modules: you can reuse the VPC module for a completely different project, or create dev/staging/prod environments by calling the same modules with different variables. That's the next iteration.

The Problems I Hit

exec format error

I built the Docker image on my Mac (Apple Silicon = ARM) and pushed it to ECR. Fargate runs x86_64. The container started and immediately crashed with exec format error, no other context.

The fix: docker build --platform linux/amd64. Always specify the platform when building for Fargate.

no pg_hba.conf entry

RDS PostgreSQL requires SSL by default. My app was connecting without it. The error message is a PostgreSQL internals reference that doesn't mention SSL at all.

The fix: add ssl: { rejectUnauthorized: false } to the connection pool config.

CannotPullContainerError

I deployed the ECS service before pushing the Docker image to ECR. Fargate couldn't find the image, retried 7 times, and tripped the circuit breaker. After pushing the correct image, new deployments still failed because the breaker was already tripped.

The fix: aws ecs update-service --force-new-deployment resets the circuit breaker and triggers a fresh deployment.

Target type: ip vs instance

Fargate requires target_type = "ip" on the ALB target group. EC2-based services use "instance". Using the wrong one causes silent registration failures where ECS reports the task as running but the ALB never sees it.

Cost Breakdown

For anyone worried about the AWS bill:

Resource	Monthly Cost
NAT Gateway	~$32
ALB	~$16
ElastiCache	~$12
ECS Fargate	~$9
RDS db.t3.micro	Free tier
ECR + Secrets Manager	Minimal
Total	~$70/month

The NAT Gateway is the biggest surprise, it's more expensive than the ALB. In production you'd need it, but for learning, terraform destroy when you're not working saves real money.

What I'd Do Differently

Add HTTPS from the start. ACM + Route53 would make this production-ready. HTTP-only is fine for a demo but wouldn't pass a security review.

Use Terraform workspaces for multi-environment. Right now it's a single environment. The module structure supports dev/staging/prod, just pass different variables. That's the next iteration.

Auto Scaling for ECS. One task is fine for a demo, but production needs scaling policies based on CPU and request count.

CI/CD integration. This project deploys manually with docker push and ecs update-service. Connecting it to CodePipeline (from Project 1) would complete the picture.

What This Proves on a Resume

This project covers territory that most junior/mid-level candidates don't demonstrate:

Custom VPC design: with proper public/private subnet segmentation
ECS Fargate: serverless containers, not just EC2
Multi-tier security: security groups referencing other security groups, not CIDRs
Managed data services: RDS + ElastiCache with proper secret handling
Terraform modules: reusable, composable infrastructure, not flat files
Real debugging: ARM vs x86, SSL requirements, circuit breakers, NAT Gateway necessity

If an interviewer asks "tell me about a complex AWS architecture you've built," this project gives you 20 minutes of material.

I Broke My AWS Pipeline on Purpose and Codified Everything in Terraform

augusthottie — Thu, 05 Mar 2026 16:36:54 +0000

Testing automatic rollback by deploying broken code, then codifying 30 AWS resources into Terraform so the entire CI/CD pipeline can be created with one command.

This is Part 2. Read Part 1 here: I Built a Full AWS CI/CD Pipeline with Blue/Green Deployments

Last week I built an AWS-native CI/CD pipeline with blue/green deployments. Someone on LinkedIn asked a great question: "Did you test that rollback works without manual intervention?"

I hadn't. So I did. Then I codified the entire infrastructure in Terraform.

Proving Rollback Actually Works

It's one thing to configure auto-rollback. It's another to watch it fire. I needed to deploy a broken version and confirm the system recovers on its own.

Two Layers of Safety

The pipeline has two layers of protection, and I accidentally discovered the first one.

Layer 1: Tests catch bad code in CodeBuild. My first attempt was changing the /health endpoint to return a 500. But bun test tests that endpoint, so the build failed before the code ever reached deployment. The pipeline stopped at the Build stage. Approval and Deploy never ran.

That's actually a great result, the first safety net caught the problem before it could go anywhere. But it wasn't what I wanted to test.

Layer 2: CodeDeploy catches failures during deployment. To test this, I needed code that passes tests but fails during deployment. I left the app healthy and instead broke the validate_service.sh script:

#!/bin/bash
set -e

echo "Simulating deployment validation failure"
exit 1

# ... rest of the script never executes

This time the build passed, tests passed, I approved the deployment, and CodeDeploy deployed to the green environment. Then ValidateService ran, hit the exit 1, and the deployment failed.

The Result

CodeDeploy detected the failure and triggered an automatic rollback:

Trigger: AutomatedRollback
Execution type: ROLLBACK
Status: Succeeded

No SSH. No console clicks. No human intervention. The green instances were terminated, traffic stayed on the original blue environment, and users were never affected.

The rollback cycle took about 15 minutes, mostly ASG provisioning and the 5-minute termination wait. In production you'd tune those timers, but the mechanics are sound.

One More IAM Surprise

The automated rollback initially failed because the CodePipeline service role was missing codedeploy:GetApplicationRevision. AWS doesn't tell you this is needed until rollback actually fires.

This is a theme with this project: you don't discover the permission you're missing until you trigger the specific action that needs it.

Infrastructure as Code with Terraform

After proving everything works, I codified the entire infrastructure. The goal: anyone should be able to recreate this pipeline with terraform apply.

The Structure

terraform/
├── main.tf              # Provider, default VPC data sources
├── variables.tf         # Configurable inputs
├── outputs.tf           # ALB URL, resource names
├── iam.tf               # All IAM roles + policies
├── s3.tf                # Artifact bucket with lifecycle rules
├── sns.tf               # Approval notification topic
├── security-groups.tf   # ALB + EC2 security groups
├── alb.tf               # Load balancer, target group, listener
├── asg.tf               # Launch template + Auto Scaling Group
├── codebuild.tf         # Build project with S3 caching
├── codedeploy.tf        # App + blue/green deployment group
└── codepipeline.tf      # Pipeline + CloudWatch Event trigger

30 resources. One command. The full pipeline, IAM roles, S3 bucket, SNS topic, security groups, ALB, ASG with launch template, CodeBuild project, CodeDeploy blue/green deployment group, CodePipeline with all four stages, and the CloudWatch Event rule to auto-trigger on pushes.

The IAM Battle (Round 2)

I thought I'd learned all the IAM lessons during the manual build. Terraform taught me new ones.

Blue/green needs broad Auto Scaling permissions. I started with a carefully scoped list of 17 specific autoscaling: actions. The deployment group wouldn't even create, it needed autoscaling:RecordLifecycleActionHeartbeat, which isn't in any AWS documentation for CodeDeploy. Even after adding that, deployments failed with the vague error "does not give you permission to perform operations in AmazonAutoScaling." The fix: autoscaling:*. AWS's blue/green implementation calls undocumented internal actions.

Same story with Elastic Load Balancing. Specific ELB permissions let CodeDeploy create the deployment group but failed during actual traffic shifting, the replacement instances couldn't register in the target group. The fix: elasticloadbalancing:*.

iam:PassRole needs wide scope. I initially scoped PassRole to specific role ARNs. Blue/green deployments pass roles to service-linked roles with unpredictable ARN patterns. The practical approach is Resource: "*" with a condition restricting which services can receive the role:

{
  Effect   = "Allow"
  Action   = "iam:PassRole"
  Resource = "*"
  Condition = {
    StringLike = {
      "iam:PassedToService" = [
        "ec2.amazonaws.com",
        "autoscaling.amazonaws.com",
        "codedeploy.amazonaws.com"
      ]
    }
  }
}

IAM propagation causes race conditions. The pipeline triggered immediately after Terraform created it, before IAM policies had propagated. The Source stage failed with "Insufficient permissions" even though the policy was correct. Retrying a minute later worked fine. This is a Terraform-specific issue, manual builds don't hit it because there's a human-speed delay between creating roles and using them.

CodeDeploy Agent Gotchas on Amazon Linux 2023

The launch template user data needs careful handling:

Add a 30-second sleep before installing packages. The instance needs time to initialize before package managers work reliably.
Use /tmp for downloads, not /home/ec2-user, the home directory may not exist during early boot.
Log everything with exec > /var/log/user-data.log 2>&1 so you can debug via SSM if the agent doesn't start.

Without these, instances boot but the CodeDeploy agent silently fails to install, and deployments time out with "CodeDeploy agent was not able to receive the lifecycle event."

Orphaned ASGs, The Undocumented Gotcha

When blue/green deployments fail mid-way, they leave behind orphaned Auto Scaling Groups with names like CodeDeploy_my-dg_d-ABC123. These block subsequent deployments because instances from the orphaned ASG are still registered in the target group as unhealthy.

The fix: check for and delete them before retrying.

# Find orphaned ASGs
aws autoscaling describe-auto-scaling-groups \
  --query "AutoScalingGroups[?contains(AutoScalingGroupName, 'CodeDeploy')].AutoScalingGroupName" \
  --output text

# Delete them
aws autoscaling delete-auto-scaling-group \
  --auto-scaling-group-name "NAME_HERE" --force-delete

This cost me hours of debugging. I'm writing it down so it doesn't cost you the same.

The Cost Advantage of Terraform

With everything in Terraform, the economics change:

# Done for the day? Tear it all down.
terraform destroy

# Ready to work again? Bring it all back.
terraform apply

The ALB alone costs ~$16/month. When you're learning and not actively using the pipeline, that adds up. Terraform lets you pay only for the hours you're actually working.

What I'd Do Differently Next Time

Start with autoscaling:* and elasticloadbalancing:*, then tighten permissions later using CloudTrail logs to see which actions are actually called. Fighting undocumented permissions one at a time wastes hours.

Add CloudWatch alarms for post-deploy monitoring. Right now rollback only triggers if lifecycle hooks fail during deployment. In production, you'd want alarms monitoring error rates and latency after traffic shifts, with automatic rollback if metrics spike.

Implement canary deployments. Instead of shifting 100% of traffic at once, shift 10% first, monitor for a few minutes, then complete the rollout. CodeDeploy supports this with custom deployment configurations.

Key Takeaways

Test your failure modes. Don't assume rollback works, prove it! Deploy a broken version and watch it recover. That confidence is worth more than any passing test.
Terraform turns a project into a product. A manually-built pipeline is a demo. A Terraform-codified pipeline is something a team can use. And terraform destroy changes the economics of learning on AWS.
Start broad with IAM, tighten later. For blue/green deployments specifically, AWS calls undocumented internal actions. Scoped permissions fail silently with vague errors. Get it working first, then use CloudTrail to scope down.
Document the gotchas nobody else does. Orphaned ASGs, agent installation timing, IAM propagation race conditions, these are the problems you'll actually hit in production, and they're not in any tutorial.

I Built a Full AWS CI/CD Pipeline with Blue/Green Deployments, Here's Everything I Learned

augusthottie — Tue, 03 Mar 2026 16:51:08 +0000

A hands-on walkthrough of building an end-to-end AWS-native CI/CD pipeline using CodeCommit, CodeBuild, CodeDeploy, and CodePipeline with zero-downtime blue/green deployments.

Most DevOps tutorials stop at "set up a GitHub Actions workflow." That's fine, but if you're preparing for the AWS DevOps Professional exam or interviewing for AWS-heavy roles, you need to know the AWS-native CI/CD stack inside and out.

So I built a full pipeline from scratch! No GitHub Actions, no Jenkins, no third-party tools. Just AWS services talking to each other, ending with zero-downtime blue/green deployments.

This post walks through the entire build, the problems I ran into, and what I'd do differently.

What I Built

A simple Node.js/Express API deployed through a fully automated pipeline:

CodeCommit → CodeBuild → Manual Approval → CodeDeploy (Blue/Green)

Every git push to main triggers the pipeline. CodeBuild installs dependencies and runs tests using Bun, then waits for manual approval via an SNS email notification. Once approved, CodeDeploy spins up a fresh Auto Scaling Group, deploys the new version, validates health checks through an ALB, shifts traffic, and terminates the old instances.

Zero downtime. Fully automated after approval.

Why AWS-Native Instead of GitHub Actions?

I already had several projects using GitHub Actions. That's great, but it only tells half the story. AWS has its own CI/CD ecosystem, CodeCommit, CodeBuild, CodeDeploy, CodePipeline, and companies running on AWS often use these services because they integrate tightly with IAM, VPCs, and other AWS infrastructure.

Understanding both gives you range. And for the AWS DevOps Professional exam, these services are tested heavily.

The Application

I kept the app intentionally simple, the pipeline is the project, not the app. It's an Express API with four endpoints:

GET /health: returns a 200 with status info (used by the ALB and CodeDeploy for validation)
GET /: welcome message with available endpoints
GET /info: app version, uptime, memory usage
GET /deploy-info: deployment metadata from CodeDeploy environment variables

The /health endpoint is the most important one. It's what the ALB target group uses for health checks, and it's what the validate_service.sh lifecycle script hits after deployment to confirm everything is working. If it fails, CodeDeploy rolls back automatically.

Setting Up the Pipeline

CodeCommit

Nothing fancy here, create a repo, push your code. I used git-remote-codecommit for authentication instead of HTTPS Git credentials because it uses your existing AWS CLI credentials and doesn't require generating separate passwords.

pip install git-remote-codecommit
git remote add origin codecommit::us-east-1://aws-pipeline-demo
git push origin main

CodeBuild with Bun

CodeBuild uses a buildspec.yml file, think of it like a GitHub Actions workflow but for AWS. Mine installs Bun, runs bun install --frozen-lockfile, executes tests with bun test, and packages the artifact to S3.

version: 0.2

phases:
  install:
    runtime-versions:
      nodejs: 18
    commands:
      - curl -fsSL https://bun.sh/install | bash
      - export BUN_INSTALL="$HOME/.bun"
      - export PATH="$BUN_INSTALL/bin:$PATH"

  pre_build:
    commands:
      - export BUN_INSTALL="$HOME/.bun"
      - export PATH="$BUN_INSTALL/bin:$PATH"
      - bun install --frozen-lockfile
      - bun test

  build:
    commands:
      - export BUN_INSTALL="$HOME/.bun"
      - export PATH="$BUN_INSTALL/bin:$PATH"
      - echo "Build started on $(date)"
      - export APP_VERSION=$(bun -e "console.log(require('./package.json').version)")-$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | head -c 8)
      - echo "App version $APP_VERSION"

One thing that caught me early: bun install --frozen-lockfile requires a bun.lockb file in the repo. If you forget to commit it, the build fails with no helpful error message. Run bun install locally first and push the lockfile.

I also enabled S3 caching for node_modules and the Bun binary directory. The first build downloads everything, but subsequent builds skip the install step entirely when dependencies haven't changed. The actual build commands run in about 7 seconds, the rest of the ~4 minute build time is CodeBuild provisioning its container, which is unavoidable with on-demand compute.

The Approval Gate

Between build and deploy, I added a manual approval stage with SNS notifications. When the build passes, CodePipeline sends an email asking you to approve or reject the deployment.

This is common in production pipelines, you don't always want every passing build to go straight to production. The approval stage gives you a checkpoint to review changes, run additional validation, or coordinate deployment timing.

One gotcha: the sns:Publish permission needs to be on the CodePipeline service role, not the CodeBuild role. The approval action is triggered by CodePipeline, not CodeBuild. I initially added it to the wrong role and spent time debugging why emails weren't sending.

Blue/Green Deployments (The Core of the Project)

This is the part I'm most proud of. In-place deployments are simpler, but blue/green is what production environments use when downtime isn't acceptable.

How It Works

CodeDeploy uses an appspec.yml file that defines where files go and which scripts to run at each lifecycle stage:

hooks:
  ApplicationStop:
    - location: scripts/stop_app.sh
  BeforeInstall:
    - location: scripts/before_install.sh
  AfterInstall:
    - location: scripts/after_install.sh
  ApplicationStart:
    - location: scripts/start_app.sh
  ValidateService:
    - location: scripts/validate_service.sh

Each script handles a specific step: stop the old app, clean up, install dependencies, start the new version, and validate it's healthy.

The Deployment Flow

When CodeDeploy runs a blue/green deployment:

Clones the ASG: creates a brand new Auto Scaling Group with the same configuration as the original
Launches fresh instances: new EC2 instances spin up with the CodeDeploy agent
Runs lifecycle hooks: your scripts execute in order (stop → before_install → after_install → start → validate)
Health check: the ALB confirms the new instances return 200 on /health
Shifts traffic: the ALB moves all traffic from the old target group to the new one
Terminates the old environment: after a 5-minute wait, the original instances are killed

Users never see downtime because traffic only shifts after the new instances are confirmed healthy.

The Thing Nobody Tells You About Blue/Green

I was confused when my first blue/green deployment created a second Auto Scaling Group. I thought, "Why is it making a new one? I already have one."

Here's what nobody explains upfront: your original ASG is just a template. CodeDeploy copies it on the first deployment and replaces it. On the second deployment, it copies the replacement and replaces that. Every deployment creates a new ASG and destroys the old one. Your original ASG only survives until the first successful deployment.

Once I understood this, the whole model clicked.

The IAM Nightmare (And How to Survive It)

IAM was the single biggest time sink in this project. Blue/green deployments require an unusually broad set of permissions because CodeDeploy needs to:

Create and delete Auto Scaling Groups
Launch and terminate EC2 instances
Modify ALB target groups and listeners
Pass IAM roles to new instances (iam:PassRole)
Read artifacts from S3

Each of these is a separate IAM action, and missing any one of them produces a vague error message. Here's what my CodeDeploy service role ended up needing:

Full Auto Scaling permissions (create, update, delete ASGs, lifecycle hooks, scaling policies)
EC2 permissions (describe, run, terminate instances, create tags)
Elastic Load Balancing permissions (describe and modify target groups, register/deregister targets)
S3 read access to the artifact bucket
iam:PassRole for EC2 and Auto Scaling services
SNS publish for notifications
CloudWatch for alarms

My advice: start with the AWS managed AWSCodeDeployRole policy and add custom permissions for blue/green. Don't try to build the policy from scratch, you'll miss something and spend hours debugging.

Build Optimization

A few things I did to keep builds fast and costs low:

S3 caching: CodeBuild caches node_modules and the Bun binary between builds. When dependencies haven't changed, bun install completes almost instantly.

Bun over npm: Bun's install and test execution is noticeably faster than npm/Jest. The entire install + test cycle takes about 7 seconds.

I tried using a custom Docker image (oven/bun:1) to skip installing Bun on every build. It worked locally, but Docker Hub's unauthenticated pull rate limit blocked it in CodeBuild. The fix would be pushing the image to Amazon ECR, but for this project the S3 cache approach was simpler.

Cost Breakdown

For anyone worried about AWS bills:

Resource	Cost
CodePipeline	1 free pipeline/month
CodeBuild	100 free build minutes/month
EC2 t3.micro	Free tier eligible
ALB	~$16/month (the main cost)
S3 + SNS	Negligible

The ALB is the biggest expense. Tear it down when you're not actively working on the project, and spin it back up when needed.

What I'd Do Differently

Start with Terraform from day one. I set everything up manually in the console first to learn how each service works, which was valuable for understanding. But recreating the setup would mean clicking through dozens of console screens. If I were doing this again, I'd codify everything in Terraform as I go.

Use ECR for custom images. Instead of Docker Hub, I'd push the Bun image to ECR to avoid rate limits and get faster pulls within AWS.

Add CloudWatch alarms for auto-rollback. Right now, rollback only triggers if the health check fails during deployment. In production, you'd want CloudWatch alarms monitoring error rates and latency post-deployment, with automatic rollback if metrics spike.

Key Takeaways

If you're learning DevOps or preparing for AWS certifications, building this pipeline taught me more than any course or practice exam:

IAM is the real skill. Anyone can configure a pipeline in the console. Understanding which roles need which permissions, and why, is what separates junior from mid-level DevOps engineers.
Blue/green isn't magic. It's just two environments, a load balancer, and a traffic switch. Once you understand the ASG cloning model, it's straightforward.
AWS-native CI/CD has tradeoffs. It integrates beautifully with IAM, VPCs, and other AWS services. But it's more complex to set up than GitHub Actions, and CodeBuild's container provisioning adds latency. Choose based on your environment.
The debugging is the learning. Every failed deployment, every permission error, every misconfigured health check taught me something I wouldn't have learned from documentation alone.

Building a Command-Line Calculator in Zig - Tutorial

augusthottie — Thu, 19 Dec 2024 17:28:44 +0000

This tutorial will guide you through creating a simple command-line calculator using Zig programming language. We'll cover installation, project setup, implementation, and testing.

Installing Zig
Project Setup
Understanding the Code
Building and Running
Testing
Common Issues

Installing Zig

Visit ziglang.org/download
Download the appropriate version for your operating system (0.11.0 or later)
Extract the archive to a location of your choice
Add Zig to your system's PATH:
- Windows: Edit system environment variables and add the path to Zig
- Linux/MacOS: Add to ~/.bashrc or ~/.zshrc:
```
 export PATH=$PATH:/path/to/zig
```
Verify installation:

   zig version

Project Setup

Create a new project directory:

   mkdir calculator
   cd calculator

Initialize the project structure:

   mkdir src
   touch src/calculator.zig
   touch src/calculator_test.zig

Create a build.zig file:

   const std = @import("std");

   pub fn build(b: *std.Build) void {
       const target = b.standardTargetOptions(.{});
       const optimize = b.standardOptimizeOption(.{});

       const exe = b.addExecutable(.{
           .name = "calculator",
           .root_source_file = b.path("src/calculator.zig"),
           .target = target,
           .optimize = optimize,
       });

       b.installArtifact(exe);

       const run_cmd = b.addRunArtifact(exe);
       run_cmd.step.dependOn(b.getInstallStep());

       const run_step = b.step("run", "Run the app");
       run_step.dependOn(&run_cmd.step);

       const exe_unit_tests = b.addTest(.{
           .root_source_file = b.path("src/calculator_test.zig"),
           .target = target,
           .optimize = optimize,
       });

       const run_exe_unit_tests = b.addRunArtifact(exe_unit_tests);
       const test_step = b.step("test", "Run unit tests");
       test_step.dependOn(&run_exe_unit_tests.step);
   }

Understanding the Code

1. Calculator Implementation (src/calculator.zig)

The calculator consists of two main parts:

a) The calculate function:

pub fn calculate(num1: f64, num2: f64, op: u8) !f64 {
    return switch (op) {
        '+' => num1 + num2,
        '-' => num1 - num2,
        '*' => num1 * num2,
        '/' => if (num2 == 0) {
            return error.DivisionByZero;
        } else num1 / num2,
        '%' => @mod(num1, num2),
        else => error.InvalidOperation,
    };
}

This function:

Takes two numbers and an operation as input
Uses a switch statement to perform the calculation
Handles errors like division by zero
Returns the result as a f64 (double-precision float)

pub fn main() !void {
    const stdout = std.io.getStdOut().writer();
    const stdin = std.io.getStdIn().reader();

    while (true) {
        try stdout.writeAll("\nCalculator (enter 'q' to quit)\n");
        try stdout.writeAll("Enter first number: ");

        var first_input_buffer: [16]u8 = undefined;
        const first_input = try stdin.readUntilDelimiter(&first_input_buffer, '\n');
        if (first_input.len == 1 and first_input[0] == 'q') break;

        const num1 = try std.fmt.parseFloat(f64, first_input);

        try stdout.writeAll("Enter operation (+, -, *, /): ");

        var op_buffer: [16]u8 = undefined;
        const op = try stdin.readUntilDelimiter(&op_buffer, '\n');

        if (op.len == 0) {
            try stdout.writeAll("Error: No operation entered!\n");
            continue;
        }

        try stdout.writeAll("Enter second number: ");

        var second_input_buffer: [16]u8 = undefined;
        const num2 = try std.fmt.parseFloat(f64, try stdin.readUntilDelimiter(&second_input_buffer, '\n'));

        const result = calculate(num1, num2, op[0]) catch |err| {
            switch (err) {
                error.DivisionByZero => try stdout.writeAll("Error: Division by zero!\n"),
                error.InvalidOperation => try stdout.print("Error: Invalid operation '{s}'!\n", .{op}),
                else => return err,
            }
            continue;
        };

        try stdout.print("Result: {d}\n", .{result});
    }
}

b) The main function:

Creates an interactive command-line interface
Reads user input for numbers and operations
Handles errors and displays results
Provides a quit option ('q')

2. Testing (src/calculator_test.zig)

Tests are written using Zig's built-in testing framework:

test "basic addition" {
    const result = try calculator.calculate(5, 3, '+');
    try testing.expectEqual(@as(f64, 8), result);
}

Key testing concepts:

Each test is a function marked with the test keyword
Use try for error handling
Use testing.expectEqual() for assertions

Building and Running

Build the project:

   zig build

Run the calculator:

   zig build run

Using the calculator:

   Calculator (enter 'q' to quit)
   Enter first number: 10
   Enter operation (+, -, *, /): +
   Enter second number: 5
   Result: 15

Testing

Run the test suite:

zig build test

The tests will verify:

Basic arithmetic operations
Error handling for division by zero
Invalid operation handling
Edge cases

Common Issues

Compilation Errors

Ensure Zig version 0.11.0 or later
Check file paths in build.zig
Verify syntax, especially in error handling

Runtime Errors

Division by zero is handled with error.DivisionByZero
Invalid operations return error.InvalidOperation
Input parsing errors are caught and handled

Build Issues
- Make sure build.zig is in the root directory
- Verify project structure matches the tutorial
- Check that all source files are present

Next Steps

To enhance the calculator, consider:

Adding more operations (power, square root)
Implementing memory functions
Supporting complex numbers
Adding a graphical user interface
Implementing scientific calculator features

Resources

Conclusion

You've now built a functional command-line calculator in Zig! This project demonstrates:

Basic Zig syntax and features
Error handling
Testing
Command-line I/O
Project organization

Keep exploring Zig's features and build upon this foundation to create more complex applications!

Find me on X and Discord @augusthottie

Messaging System with RabbitMQ, Celery, and Flask

augusthottie — Fri, 02 Aug 2024 17:14:26 +0000

Overview

This article walks you through setting up a messaging system using Flask, RabbitMQ, and Celery. The system handles asynchronous tasks like sending emails and logging timestamps. You can host the application on an AWS EC2 instance and use screen to keep it running persistently, while exposing it to the internet using ngrok.

GitHub Repo

For code details visit my repo ⬇️:

GitHub Repo

Features
Technologies Used
Installation
Configuration
Running the Application
Endpoints
Logging
Using ngrok
Hosting on AWS EC2

Features

Send Emails: Use the ?sendmail parameter to send emails via SMTP.
Log Timestamps: Use the ?talktome parameter to log the current time.
View Logs: Access application logs via the /logs endpoint.
Asynchronous Tasks: Manage tasks asynchronously with RabbitMQ and Celery.

Technologies Used

Flask: A lightweight framework for building web applications.
Celery: An asynchronous task queue for managing background tasks.
RabbitMQ: A messaging broker that handles message queuing.
SMTP: Protocol used for sending emails.
ngrok: A tool to expose your local server to the internet.

Installation

Prerequisites

Python 3.7 or higher
pip (Python package installer)
RabbitMQ server installed and running

Step 1: Clone the Repository

First, clone the repository and navigate to the project directory:

git clone https://github.com/AugustHottie/devops-stage3.git
cd devops-stage3

Step 2: Create a Virtual Environment

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Step 3: Install Dependencies

Install the necessary packages:

pip install -r requirements.txt

Configuration

Set Up Environment Variables

Create a .env file or set the variables directly in your shell:

   MAIL_ADDRESS=your-email@gmail.com
   APP_PASSWORD=your-google-app-password
   LOG_FILE_PATH=/var/log/messaging_system.log

Ensure RabbitMQ is Running

Start the RabbitMQ server:

   sudo systemctl start rabbitmq-server

Running the Application

Step 1: Start the Celery Worker

Open a new terminal, activate your virtual environment, and start the Celery worker:

celery -A app.celery worker --loglevel=info

Step 2: Start the Flask Application

In another terminal window, run the Flask application:

python app.py

Endpoints

Endpoint	Description	Example Usage
`/`	Main route to send emails or log time	`http://localhost:8000/?sendmail=your_email@example.com`
`/`	Main route to log the current time	`http://localhost:8000/?talktome`
`/logs`	View application logs	`http://localhost:8000/logs`

Logging

Logs are saved to the path specified by LOG_FILE_PATH. Make sure the application has permission to write to this location.

Using ngrok

To expose your local application to the internet, follow these steps:

Download and Install ngrok

   wget https://bin.equinox.io/c/4b0e5f0d1d6e/ngrok-stable-linux-amd64.zip
   unzip ngrok-stable-linux-amd64.zip

Add Your ngrok Authtoken

   ./ngrok authtoken your_ngrok_auth_token

Replace your_ngrok_auth_token with the token you received from the ngrok dashboard.

Expose Your Flask Application

   ./ngrok http 8000

Use the Provided ngrok URL

Use the URL provided by ngrok to access your application externally.

Hosting on AWS EC2

Step 1: Launch an EC2 Instance

Log in to AWS Management Console and navigate to EC2.
Launch a new EC2 instance with your preferred configuration.
Configure security groups to allow HTTP (port 80) and custom TCP (port 8000) traffic.

Step 2: Connect to Your EC2 Instance

Use SSH to connect to your EC2 instance:

ssh -i /path/to/your-key.pem ec2-user@your-ec2-public-dns

Step 3: Install Dependencies on EC2

Update the package list and install necessary packages:

   sudo yum update -y
   sudo yum install -y python3-pip

Clone the repository, create a virtual environment, and install dependencies:

   git clone https://github.com/AugustHottie/devops-stage3.git
   cd devops-stage3
   python3 -m venv venv
   source venv/bin/activate
   pip install -r requirements.txt

Ensure RabbitMQ is installed and running:

   sudo yum install -y rabbitmq-server
   sudo systemctl start rabbitmq-server

Step 4: Set Up `screen` to Persist the Application

Install screen if it is not already installed:

   sudo yum install -y screen

Start a new screen session:

   screen -S myapp

Run the Flask application within the screen session:

   python app.py

Detach from the screen session by pressing Ctrl+A, then D.
To reattach to the screen session:

   screen -r myapp

Step 5: Set Up ngrok on EC2

Download and install ngrok:

   wget https://bin.equinox.io/c/4b0e5f0d1d6e/ngrok-stable-linux-amd64.zip
   unzip ngrok-stable-linux-amd64.zip

Add your ngrok authtoken:

   ./ngrok authtoken your_ngrok_auth_token

Replace your_ngrok_auth_token with the token you received from the ngrok dashboard.

Expose your Flask application using ngrok:

   ./ngrok http 8000

Use the provided ngrok URL to access your application externally.

This guide should help you get your messaging system up and running both locally and on an AWS EC2 instance. Feel free to share your feedback or suggestions in the comment section! 🚀

Introducing DevOpsFetch: A Comprehensive Server Monitoring Tool

augusthottie — Fri, 02 Aug 2024 15:59:09 +0000

Welcome to DevOpsFetch! This is a Bash tool designed to streamline the process of retrieving and monitoring server information. With DevOpsFetch, you can easily display active ports, user logins, Nginx configurations, Docker images, and container statuses. Additionally, the tool includes a systemd service for continuous monitoring and logging, ensuring your server remains under constant observation.

Features
Installation
- Dependencies
- Setup
Usage
- Display Active Ports
- Port Information
- Docker Information
- Nginx Information
- User Logins
- Time Range Activities
Logging
Help
Full Script Code

Features

DevOpsFetch offers a suite of features to help you monitor and manage your server effectively:

Display all active ports and services
Provide detailed information about a specific port
List all Docker images and containers
Provide detailed information about a specific Docker container
Display all Nginx domains and their ports
Provide detailed configuration information for a specific Nginx domain
List all users and their last login times
Provide detailed information about a specific user
Display activities within a specified time range
Continuous monitoring and logging with log rotation

Installation

Dependencies

Before you begin, ensure the following packages are installed on your system:

net-tools for netstat
docker.io
nginx
jq
finger

Setup

Clone the repository:

   git clone https://github.com/AugustHottie/devopsfetch-stage5a.git
   cd devopsfetch

Run the installation script to set up dependencies and the systemd service:

   sudo ./install.sh

This script will install the necessary dependencies, set up the devopsfetch command, and enable the devopsfetch systemd service.

Usage

Command-line Options

-p, --port : Display all active ports and services.
-p <port_number> : Display detailed information about a specific port.
-d, --docker : List all Docker images and containers.
-d <container_name> : Display detailed information about a specific Docker container.
-n, --nginx : Display all Nginx domains and their ports.
-n <domain> : Display detailed configuration information for a specific Nginx domain.
-u, --users : List all users and their last login times.
-u <username> : Display detailed information about a specific user.
-t, --time <start> <end> : Display activities within a specified time range.
-h, --help : Display usage instructions.

Display Active Ports

To display all active ports and services, run:

./devopsfetch.sh -p

Port Information

To provide detailed information about a specific port, run:

./devopsfetch.sh -p <port_number>

Example:

./devopsfetch.sh -p 80

Docker Information

To list all Docker images and containers, run:

./devopsfetch.sh -d

To provide detailed information about a specific Docker container, run:

./devopsfetch.sh -d <container_name>

Example:

./devopsfetch.sh -d my_container

Nginx Information

To display all Nginx domains and their ports, run:

./devopsfetch.sh -n

To provide detailed configuration information for a specific Nginx domain, run:

./devopsfetch.sh -n <domain>

Example:

./devopsfetch.sh -n example.com

User Logins

To list all users and their last login times, run:

./devopsfetch.sh -u

To provide detailed information about a specific user, run:

./devopsfetch.sh -u <username>

Example:

./devopsfetch.sh -u myuser

Time Range Activities

To display activities within a specified time range, run:

./devopsfetch.sh -t "<start_time>" "<end_time>"

Example:

./devopsfetch.sh -t "2024-07-21 00:00:00" "2024-07-22 00:00:00"

Logging

Logs are stored in the /var/log/devopsfetch.log directory. Log rotation is implemented to manage log file size. Logs are rotated daily, keeping up to 7 days of logs.

Help

To display the help message with usage instructions, run:

./devopsfetch.sh -h

Full Script Code

To view the full script code, check my GitHub repo: DevOpsFetch GitHub Repository

By following this guide, you should be able to effectively install, configure, and use DevOpsFetch to monitor your server. For any issues or questions, refer to the comment section or consult the script documentation on my GitHub. Happy monitoring!🔍🚀

How to Deploy a Static Website on AWS EC2 Using Apache

augusthottie — Fri, 02 Aug 2024 15:42:47 +0000

Deploying a static website on an AWS EC2 instance using Apache is a straightforward process that you can follow to get your site live. This guide will walk you through each step using an example where the static website includes an HTML file hosted on GitHub. The example website mentions the HNG Internship which I took part in, and contains a link to https://hng.tech, but you can adapt these steps to deploy any static website.

Introduction
Prerequisites
Setup AWS EC2 Instance
Install and Configure Apache
Deploy the Static Website
Access the Website
Conclusion

Introduction

Deploying a static website on an AWS EC2 instance using Apache is a straightforward process. This guide will walk you through each step, ensuring your website is live and accessible. The static website includes an HTML file hosted on GitHub, mentioning the HNG Internship and linking to https://hng.tech.

Prerequisites

Before you begin, make sure you have:

An AWS account
Basic knowledge of AWS EC2 and SSH
An SSH key pair for accessing the EC2 instance
A static website (HTML, CSS, JavaScript files)

Setup AWS EC2 Instance

1. Launch an EC2 Instance

Log in to your AWS Management Console.
Navigate to the EC2 Dashboard.
Click on "Launch Instance".
Choose an Amazon Machine Image (AMI) (e.g., Amazon Linux 2 AMI).
Select an Instance Type (e.g., t2.micro for free tier eligibility).
Configure Instance Details (default settings are usually fine).
Add Storage (default settings are fine).
Add Tags (optional).
Configure Security Group:
- Add a rule to allow HTTP traffic on port 80.
- Add a rule to allow SSH traffic on port 22.
Review and Launch the instance.
Select your SSH key pair for the instance.

2. Connect to the EC2 Instance

Open a terminal (or use PuTTY if on Windows).
Connect to your instance using SSH:

   ssh -i /path/to/your-key-pair.pem ec2-user@your-instance-public-ip

Install and Configure Apache

1. Update Packages

First, update your package list to ensure you have the latest versions:

sudo yum update -y

2. Install Apache

Install the Apache web server:

sudo yum install httpd -y

3. Start Apache

Start the Apache service:

sudo systemctl start httpd

4. Enable Apache to Start on Boot

Enable Apache to start on boot to ensure it runs automatically when the instance is restarted:

sudo systemctl enable httpd

Deploy the Static Website

1. Navigate to the Apache Directory

Change the directory to where Apache serves files:

cd /var/www/html

2. Download the `index.html` File Using `wget`

Download your HTML file from GitHub:

sudo wget https://raw.githubusercontent.com/AugustHottie/devops-task0/master/index.html

3. Set the Correct Permissions

Set the appropriate permissions for the index.html file:

sudo chown apache:apache /var/www/html/index.html
sudo chmod 644 /var/www/html/index.html

Access the Website

Open a Web Browser

Navigate to the public IP address of your EC2 instance. Your static website should be displayed. In our example, it mentions the HNG Internship and links to https://hng.tech.

Conclusion

By following the steps outlined in this guide, you can successfully deploy any static website on an AWS EC2 instance using Apache. Your website should now be accessible via the public IP address of your EC2 instance. This guide uses an example from an HNG Internship task, but the steps are adaptable to any static site!

If you have any questions or need further assistance, feel free to reach out. Happy deploying!🚀

User Management Automation : Bash Script Guide

augusthottie — Mon, 01 Jul 2024 18:46:28 +0000

Trying to manage user accounts in a Linux environment can be a stressful, time-consuming and error-prone process, especially in a large organization with many users. Automating this process not only makes things easy, efficient and time-saving, but also ensures consistency and accuracy. In this article, we'll look at a Bash script designed to automate user creation, group assignment and password management. This script reads from a file containing user information, creates users and groups as specified, sets up home directories with appropriate permissions, generates random passwords and logs all actions.

Script Overview

Here's a breakdown of the bash script used to automate the user management process:

#!/bin/bash

# Constants for the script
SECURE_FOLDER="/var/secure"  # The path to the secure folder where user information will be stored
LOG_FILE="/var/log/user_management.log"  # The path to the log file for recording script execution
PASSWORD_FILE="/var/secure/user_passwords.csv"  # The path to the file where user passwords will be stored

log_message() {
    # Function to log a message with a timestamp to the log file
    # Arguments:
    #   $1: The message to be logged
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a $LOG_FILE
}

# Function to generate a random password
generate_password() {
    tr -dc 'A-Za-z0-9!@#$%^&*()-_' < /dev/urandom | head -c 16
}

if [ ! -d "$SECURE_FOLDER" ]; then
    # Check if the secure folder exists, if not, create it
    mkdir -p $SECURE_FOLDER
    log_message "Secure folder created."
fi

# Check for command-line argument
if [ "$#" -ne 1 ]; then
    echo "Usage: $0 <path_to_user_file>"
    exit 1
fi

USER_FILE="$1"

# Check if file exists
if [ ! -f "$USER_FILE" ]; then
    echo "File not found: $USER_FILE"
    exit 1
fi

# Create the log file and password file if they do not exist
touch $LOG_FILE
touch $PASSWORD_FILE

# Set the permissions on the password file to be read/write only by the user executing the script
chmod 600 $PASSWORD_FILE

# Write the header to the password file if it is empty
if [ ! -s $PASSWORD_FILE ]; then
    echo "Username,Password" > $PASSWORD_FILE
fi

# Add new line to USER_FILE to avoid error in while loop
echo "" >> $USER_FILE

# read one by one, there is no seperator so it will read line by line
while read -r line; do
    # Trim whitespace, and seperate them via ;
    username=$(echo "$line" | xargs | cut -d';' -f1)
    groups=$(echo "$line" | xargs | cut -d';' -f2)

    # Skip empty lines
    if [ -z "$username" ]; then
        continue
    fi

    # Create user group (personal group)
    if ! getent group "$username" > /dev/null; then
        groupadd "$username"
        log_message "Group '$username' created."
    else
        log_message "Group '$username' already exists."
    fi

    # Create user if not exists
    if ! id -u "$username" > /dev/null 2>&1; then
        useradd -m -g "$username" -s /bin/bash "$username"
        log_message "User '$username' created with home directory."

        # Generate and set password
        password=$(generate_password)
        echo "$username:$password" | chpasswd
        echo "$username,$password" >> $PASSWORD_FILE
        log_message "Password set for user '$username'."

        # Set permissions for home directory
        chmod 700 "/home/$username"
        chown "$username:$username" "/home/$username"
        log_message "Home directory permissions set for user '$username'."
    else
        log_message "User '$username' already exists."
    fi

    # Add user to additional groups
    IFS=',' read -ra group_array <<< "$groups"
        for group in "${group_array[@]}"; do
            group=$(echo "$group" | xargs) # Trim whitespace
            if [ -n "$group" ]; then
                if ! getent group "$group" > /dev/null; then
                    groupadd "$group"
                    log_message "Group '$group' created."
                fi
                usermod -aG "$group" "$username"
                log_message "User '$username' added to group '$group'."
            fi
    done
done < "$USER_FILE"

Script Breakdown

Let's break down the key components of this script and understand how they work together to automate user management.

Setting up directories and files The script begins by defining the locations of the secure folder, log file and password file:

SECURE_FOLDER="/var/secure"
LOG_FILE="/var/log/user_management.log"
PASSWORD_FILE="/var/secure/user_passwords.csv"

A function, log_message(), is used to log messages with a timestamp:

log_message() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a $LOG_FILE
}

The script checks if the secure folder exists and creates it if it doesn't:

if [ ! -d "$SECURE_FOLDER" ]; then
    mkdir -p $SECURE_FOLDER
    log_message "Secure folder created."
fi

Validating Input

The script expects a single argument: the path to the user file. It checks if the argument is provided and if the file exists:

if [ "$#" -ne 1 ]; then
    echo "Usage: $0 <path_to_user_file>"
    exit 1
fi

USER_FILE="$1"

if [ ! -f "$USER_FILE" ]; then
    echo "File not found: $USER_FILE"
    exit 1
fi

Preparing Log and Password Files

The script creates and secures the log and password files:

touch $LOG_FILE
touch $PASSWORD_FILE
chmod 600 $PASSWORD_FILE

if [ ! -s $PASSWORD_FILE ]; then
    echo "Username,Password" > $PASSWORD_FILE
fi

Processing the User File

The main part of the script reads the user file line by line, processes each user, and performs the following steps:

Read Username and Groups: It reads the username and groups from each line, trimming any whitespace:

while IFS=';' read -r username groups; do
    username=$(echo $username | xargs)
    groups=$(echo $groups | xargs)

    if [ -z "$username" ]; then
        continue
    fi

Create User Group: If the user's personal group does not exist, it creates the group:

if ! getent group "$username" > /dev/null; then
    groupadd "$username"
    log_message "Group '$username' created."
else
    log_message "Group '$username' already exists."
fi

Create User: If the user does not exist, it creates the user with a home directory and assigns the personal group. It also generates a random password, sets it for the user, and secures the home directory:

if ! id -u "$username" > /dev/null 2>&1; then
    useradd -m -g "$username" -s /bin/bash "$username"
    log_message "User '$username' created with home directory."

    password=$(tr -dc 'A-Za-z0-9!@#$%^&*()-_' < /dev/urandom | head -c 16)
    echo "$username:$password" | chpasswd
    echo "$username,$password" >> $PASSWORD_FILE
    log_message "Password set for user '$username'."

    chmod 700 "/home/$username"
    chown "$username:$username" "/home/$username"
    log_message "Home directory permissions set for user '$username'."
else
    log_message "User '$username' already exists."
fi

Assign Groups: The script reads the additional groups for the user and assigns them, creating any missing groups:

IFS=',' read -ra group_array <<< "$groups"
for group in "${group_array[@]}"; do
    group=$(echo $group | xargs)
    if ! getent group "$group" > /dev/null; then
        groupadd "$group"
        log_message "Group '$group' created."
    fi
    usermod -aG "$group" "$username"
    log_message "User '$username' added to group '$group'."
done

Finalize: The script ensures the password file is secure:

chmod 600 $PASSWORD_FILE

The code above only allows the authorized user (executor) to read the contents of the password csv file.

Conclusion

This script provides a robust solution for managing user accounts in a Linux environment. By automating user creation, group assignment, and password management, it ensures consistency and saves valuable time for system administrators. The use of logging and secure password storage further enhances the reliability and security of the user management process.

For more information about automation and system administration, check out the HNG Internship program at HNG Internship. The program offers a wealth of resources and opportunities for aspiring DevOps engineers. Additionally, you can explore hiring opportunities at HNG Hire and premium services at HNG Premium.

By implementing such automation scripts, you can streamline your operations, reduce errors, and improve overall system security and efficiency.

Forem: augusthottie

I Built a Serverless Event-Driven Pipeline on AWS

What I Built

The Architecture

The Patterns That Make This Production-Realistic

The Terraform Modules

The Live Demo

The Debugging That Taught Me the Most

The Cost Comparison

Why This Matters for My Portfolio

Links

I Added Log Aggregation to My EKS Observability Stack, Metrics + Logs in One Dashboard

What I Added

The Logs & Metrics Correlation Dashboard

LogQL: The Query Language

What Went Wrong (And What I Learned)

Node Capacity

EBS CSI Driver (Again)

Grafana Datasource Provisioning

Curly Quotes in Dashboard JSON

Why Logs Complete the Observability Story

Links

I Added Prometheus, Grafana, and Custom Alerting to My EKS Cluster, Here's How Observability Actually Works

What I Built

Instrumenting the Application

ServiceMonitor: The Right Way to Scrape

The Dashboard

The PromQL Behind Each Panel

Alert Rules

The Debugging That Taught Me the Most

Why This Matters for Interviews

Links

I Set Up GitOps on EKS with ArgoCD, Here's What Kubernetes Actually Looks Like in Production

What I Built

Provisioning EKS with Terraform

Writing the Helm Chart

ArgoCD: The GitOps Engine

The Problems I Hit (All of Them)

EBS CSI Driver

IRSA Not Working for LB Controller

ArgoCD CRD Too Large

ConfigMap Changes Don't Restart Pods

exec format error (Again)

Testing the GitOps Flow

Why This Matters

Cost Warning

Links

I Built a 3-Tier AWS Architecture with Terraform Modules, ECS Fargate, RDS, and ElastiCache

What I Built

Why This Architecture Matters

The Network Layer

Security Group Boundaries

ECS Fargate(Containers Without Servers)

The Application(Proving the Architecture Works)

Terraform Modules(Reusable Infrastructure)

The Problems I Hit

exec format error

no pg_hba.conf entry

CannotPullContainerError

Target type: ip vs instance

Cost Breakdown

What I'd Do Differently

What This Proves on a Resume

Links

I Broke My AWS Pipeline on Purpose and Codified Everything in Terraform

Proving Rollback Actually Works

Two Layers of Safety

The Result

One More IAM Surprise

Infrastructure as Code with Terraform

The Structure

The IAM Battle (Round 2)

CodeDeploy Agent Gotchas on Amazon Linux 2023

Orphaned ASGs, The Undocumented Gotcha

The Cost Advantage of Terraform

What I'd Do Differently Next Time

Key Takeaways

Links

I Built a Full AWS CI/CD Pipeline with Blue/Green Deployments, Here's Everything I Learned

What I Built

Step 4: Set Up `screen` to Persist the Application