<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Habeeb</title>
    <description>The latest articles on Forem by Habeeb (@dap0am_).</description>
    <link>https://forem.com/dap0am_</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3297571%2Ff895b45c-0a0c-4af5-8551-51ffae8fd79d.png</url>
      <title>Forem: Habeeb</title>
      <link>https://forem.com/dap0am_</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/dap0am_"/>
    <language>en</language>
    <item>
      <title>EKS Networking Explained: Why am I running out of IPs? (Part 1)</title>
      <dc:creator>Habeeb</dc:creator>
      <pubDate>Mon, 03 Nov 2025 16:54:51 +0000</pubDate>
      <link>https://forem.com/dap0am_/eks-networking-explained-why-am-i-running-out-of-ips-part-1-8f3</link>
      <guid>https://forem.com/dap0am_/eks-networking-explained-why-am-i-running-out-of-ips-part-1-8f3</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a two part series, Part 1 explains WHY IP exhaustion happens, Part 2 covers solutions and prevention.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It's a casual morning, your staging environment is working perfectly, you have just deployed 5 new microservices to test a feature, then suddenly, pods are stuck in '&lt;strong&gt;Pending&lt;/strong&gt;' state. You describe the pod to and you see:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;0/3 nodes are available: 3 too many pods&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;But you only have 20 pods total and a 3 m5.large nodes. What gives?&lt;/p&gt;

&lt;p&gt;Welcome to EKS IP exhaustions: one of the most confusing problems for beginners, you have plenty of CPU and memory available, but Kubernetes refuses to schedule pods, the culprit? IP addresses.&lt;/p&gt;

&lt;p&gt;In this two part series, we'll demystify EKS networking completely. Part 1 ( this post ) helps you understand WHY this happens and how to diagnose it. Part 2 covers solutions and prevention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Apartment Building Analogy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let's make this sink in instantly with an easy analogy, think of your EKS cluster as city with apartment buildings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Nodes&lt;/strong&gt; = Apartment Buildings&lt;br&gt;
Each building (EC2 instances) can hold multiple tenants and has specific infrastructure capacity.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pods&lt;/strong&gt; = Apartments/Units&lt;br&gt;
Where your containerized applications actually live.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;IP Addresses&lt;/strong&gt; = Mailing Addresses&lt;br&gt;
Each apartment needs its unique address to receive mail (network traffic).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;ENIs&lt;/strong&gt; = Mailbox Clusters&lt;br&gt;
Limited number of mailbox installations per building floor (Elastic Network Interfaces).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;VPC CIDR&lt;/strong&gt; = City's Zip Code Range&lt;br&gt;
Finite pool of available addresses for the entire city.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The problem in simpler terms&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your city (VPC) only has so many mailing addresses, each building (node) can only install limited mailbox clusters (ENIs), and each cluster serves a specific number of apartments (IPs per ENI).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here is the kicker:&lt;/strong&gt; Even if your building has empty apartments (available CPU/Memory), you can't rent them out if you've run out of mailboxes! That's exactly what happens with EKS.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Key Insight&lt;/strong&gt;: In EKS, IP address limits often become the bottleneck before CPU or memory. This is counterintuitive for beginners.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;How EKS actually Assigns IPs&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
Let's peek behind the curtain and see what really happens when you create a pod.&lt;/p&gt;

&lt;p&gt;The pod creation Flow&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You create a pod (via yaml manifest or kubectl apply)&lt;/li&gt;
&lt;li&gt;Kubernetes scheduler finds a node with available capacity.&lt;/li&gt;
&lt;li&gt;VPC CNI plugin needs to assign the pod an IP from your VPC.&lt;/li&gt;
&lt;li&gt;VPC CNI checks: Do I have a free IP on this node?
YES: Assign IP immediately - Pods starts running
NO: Check if we can attach another ENI
   Under ENI limit: Request new ENI from AWS - Assign IP
   At ENI limit: Pod stays PENDING forever.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What is VPC CNI?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The VPC Container Network Interface (CNI) is an AWS's plugin for EKS. Unlike other Kubernetes CNI plugins that use overlay networks, VPC CNI gives pod real IP addresses from your VPC subnets.&lt;br&gt;
Why? so your pods can communicate directly with other AWS resources without network translation. It's operationally brilliant but comes with IP limits.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Technical detail: VPC CNI uses 'Secondary IP mode' by default. Each pod gets a secondary IP from an ENI attached to the node, the node uses the primary IP of the ENI.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The math behind it all&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Time to get mathematical, let's calculate exactly how many pods you can run on a node.&lt;/p&gt;

&lt;p&gt;The formula:&lt;/p&gt;

&lt;p&gt;Max pods = (number of ENIs x (IPS per ENI -1)) + 2&lt;/p&gt;

&lt;p&gt;Let's break it down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Number of ENIs: Max network interfaces for your instance type.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;IPS per ENI: Max IPs each interface supports.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Why -1?: One IP per ENI is the ENI's primary address.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Why +2?: Kube-proxy and VPC CNI run in host network mode (they don't consume pod IPs).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Quick Example: m5.large instance type&lt;/p&gt;

&lt;p&gt;Specification:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Max ENIs: 3&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;IPs per ENI: 10&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Common wrong calculation: 3 x 10 = 30 pods&lt;/p&gt;

&lt;p&gt;Correct calculation using our formula:&lt;br&gt;
(3 x (10-1) + 2)&lt;br&gt;
= (3 x 9) + 2&lt;br&gt;
= 27 + 2 ===&amp;gt; 29 pods (max)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Troubleshooting: Is this the error I am facing?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Check pod status&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl describe pod &amp;lt;pod-name&amp;gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Look for these in events:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"0/X nodes available: X too many pods"&lt;/li&gt;
&lt;li&gt;"Unable to allocate IP address"&lt;/li&gt;
&lt;li&gt;"InsufficientFreeAddressesInSubnet"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Check node capacity&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl get nodes -o custom-columns= NAME:.metadata.name, PODS:.status.capacity.pods&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Count pods per node&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl get pods -A -o wide | awk '{print $8}' | sort | uniq -c&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Check VPC CNI logs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl logs -n kube-system -l k8s-app=aws-node | grep NetworkInterfaceLimitExceeded&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Check subnet availability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;aws ec2 descibe-subnets --subnet-ids &amp;lt;id-of-your-subnet&amp;gt; --query 'Subnets[0].AvailableIpAddressCount&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;If under 100 IPs are available, you're running low.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's next: Part 2&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You now understand WHY this happens and HOW to diagnose it.&lt;/p&gt;

&lt;p&gt;In part 2, we'll cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Quick fixes (10-30 minutes).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Prefix delegation (29 ---&amp;gt; 110 pods per node).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Advanced solutions and prevention.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The cost connection (how this reveals wasteful spending).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Coming next: Part 2: 'Solving EKS IP exhaustion' will show you exactly how to fix this and prevent it forever.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Happy clustering&lt;/strong&gt;...&lt;/p&gt;

</description>
      <category>eks</category>
      <category>kubernetes</category>
      <category>aws</category>
      <category>devops</category>
    </item>
    <item>
      <title>The DevOps Reality: When "Simple" Tasks Become Learning Adventures</title>
      <dc:creator>Habeeb</dc:creator>
      <pubDate>Wed, 02 Jul 2025 11:27:32 +0000</pubDate>
      <link>https://forem.com/dap0am_/the-devops-reality-when-simple-tasks-become-learning-adventures-4h94</link>
      <guid>https://forem.com/dap0am_/the-devops-reality-when-simple-tasks-become-learning-adventures-4h94</guid>
      <description>&lt;p&gt;Let's talk about how nothing is ever truly simple in engineering and honestly, that's exactly why we get paid to solve these puzzles.&lt;br&gt;
Yesterday, I picked up what seemed like a straightforward task: updating our EKS node AMIs from Amazon Linux 2 to Amazon Linux 2023.&lt;/p&gt;

&lt;p&gt;The documentation made it look like a simple one-liner change. Classic mistake – trusting that "easy" means easy in the cloud world.&lt;br&gt;
I made the changes in our dev environment, deployed successfully, and started testing. That's when things took an interesting turn.&lt;/p&gt;

&lt;p&gt;My test was simple: create a new pod to trigger node scaling. The new node spun up as expected, then... got stuck. Cue the troubleshooting session that every DevOps engineer knows too well.&lt;/p&gt;

&lt;p&gt;After digging into the logs, I discovered kubelet was failing to start. Further investigation revealed the culprit: &lt;code&gt;bootstrap.sh&lt;/code&gt; had been removed from AL2023, and our configuration was still trying to use it. Amazon Linux 2023 now uses &lt;code&gt;nodeadm&lt;/code&gt; instead, a completely different approach to node bootstrapping.&lt;/p&gt;

&lt;p&gt;So there I was, diving deep into nodeadm documentation, learning a new tool, and redesigning our node initialization process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Real Point&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the essence of DevOps and platform engineering. What started as a "simple AMI update" quickly became a journey through:&lt;/p&gt;

&lt;p&gt;Terraform configuration updates&lt;br&gt;
EKS cluster management&lt;br&gt;
EC2 instance troubleshooting&lt;br&gt;
Linux system administration&lt;br&gt;
Kubernetes kubelet debugging&lt;br&gt;
Learning an entirely new bootstrapping tool&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Lesson&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This wasn't a failure, it was just a normal day. In our field, every "simple" task is potentially a rabbit hole of learning opportunities. The key is embracing the complexity, staying curious, and remembering that this constant evolution and problem-solving is exactly what makes our work both challenging and rewarding.&lt;/p&gt;

&lt;p&gt;The ability to quickly switch context between different tools, debug across multiple layers of the stack, and adapt to changing technologies isn't just a nice-to-have skill, it's the core of what we do.&lt;/p&gt;

&lt;p&gt;What's your latest "simple" task that turned into an unexpected learning adventure? I'd love to hear about it in the comments.&lt;/p&gt;

</description>
      <category>eks</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>linux</category>
    </item>
    <item>
      <title>Save Costs with Automation: Using Lambda to Manage EC2 Uptime</title>
      <dc:creator>Habeeb</dc:creator>
      <pubDate>Thu, 26 Jun 2025 20:11:30 +0000</pubDate>
      <link>https://forem.com/dap0am_/save-costs-with-automation-using-lambda-to-manage-ec2-uptime-3c9b</link>
      <guid>https://forem.com/dap0am_/save-costs-with-automation-using-lambda-to-manage-ec2-uptime-3c9b</guid>
      <description>&lt;p&gt;Hear me out: It’s Monday morning, and you’re sipping your coffee as you would normally while also checking your AWS billing dashboard. Your heart sinks as you see that last month’s bill is way more than expected. The culprit? A couple of m5.xlarge instances running 24/7 for the past month, quietly burning through your budget while sitting completely idle outside business hours.&lt;/p&gt;

&lt;p&gt;If this sounds familiar, you’re not alone. In my team, we’ve all been there. Our data processing workloads only run during business hours, yet our EC2 instances were living their best life 24/7, accumulating costs like a taxi meter stuck in traffic. Even with the best intentions and sticky notes reminding us to “TURN OFF THE INSTANCES!”, human error inevitably crept in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Simple Solution That Saved Us Up to 70% on EC2 Costs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After one too many awkward meetings explaining why our AWS costs were through the roof, I decided enough was enough. The solution? A simple yet effective automation using AWS Lambda and Terraform that automatically stops our EC2 instances based on a schedule. No more forgotten instances, no more weekend charges for idle resources, and definitely no more awkward budget meetings.&lt;/p&gt;

&lt;p&gt;In this post, I’ll walk you through exactly how to implement this solution, complete with tested Python code and Terraform configurations. By the end, you’ll have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A Lambda function that intelligently manages your EC2 instances.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Terraform code to deploy everything with a single command.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A scheduling system that works around your team’s actual working hours.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Peace of mind knowing your instances will never be forgotten again.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why This Approach Works&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before diving into the code, let’s talk about why Lambda + EventBridge (formerly CloudWatch Events) is the perfect combo for this use case:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost-Effective&lt;/strong&gt;: Lambda charges only for execution time (a few cents per month for this use case).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reliable&lt;/strong&gt;: AWS manages the infrastructure, so your scheduler won’t fail.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flexible&lt;/strong&gt;: Easy to adjust schedules for holidays, maintenance windows, or team changes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Auditable&lt;/strong&gt;: CloudTrail logs every start/stop action for compliance and debugging.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;How It All Works Together&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Define EC2 Instances&lt;/strong&gt;&lt;br&gt;
You tag the EC2 instances that should be managed by the automation (e.g., AutoStop = true).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Write Lambda Function&lt;/strong&gt;&lt;br&gt;
The Python code reads instance tags and starts or stops the instances accordingly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Set Up Scheduled Rules&lt;/strong&gt;&lt;br&gt;
Using EventBridge, you create cron-style schedules to invoke the Lambda function.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Provision Everything with Terraform&lt;/strong&gt;&lt;br&gt;
All the resources, including IAM permissions are managed through Terraform, making the setup portable and easy to manage.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For the purpose of this blog, I will assume you already have Terraform and AWS CLI and Python configured.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Project structure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Create a new directory for this project&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mkdir ec2-scheduler
cd ec2-schdeuler
#create a directory for python scripts
mkdir python
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the project structure we want to work with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ec2-scheduler/
 ├── python/
 │   ├── EC2InstanceStart.py
 │   ├── EC2InstanceStop.py
 │   └── requirements.txt
 ├── lambda_layer/
 │   └── python/
 │       └── (requests library files)
 ├── lambda_layer.zip
 ├── main.tf
 ├── eventbridge.tf
 ├── variables.tf
 ├── outputs.tf
 ├── provider.tf
 └── backend.tf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The requirement.txt file should contain the below as that's the only library we are using, boto3 is already pre-intstalled in Lambda runtime environment and for the lambda layer directory we will get to that later in this post:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;requests==2.31.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Terraform Setup&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s start with the infrastructure. Here’s how we set up the Lambda function, necessary permissions, and the Eventbridge rule that triggers it.&lt;/p&gt;

&lt;p&gt;Lambda function, IAM Role and Policy:&lt;/p&gt;

&lt;p&gt;Create a .tf file in your working directory, name it main.tf as this will consist the main component for this (The role, policy and lambda )&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resource "aws_iam_role" "LambdaEC2Role" {
  name = "LambdaEC2Role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17",
    Statement = [{
      Action = "sts:AssumeRole",
      Effect = "Allow",
      Principal = {
        Service = "lambda.amazonaws.com"
      }
    }]
  })

  tags = {
    Project = "EC2-Scheduler"
    Purpose = "Cost-Optimization"
  }
}

resource "aws_iam_policy" "LambdaEC2StartStopPolicy" {
  name = "LambdaEC2StartStopPolicy"
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action = ["ec2:StartInstances", "ec2:StopInstances"],
        Resource = "arn:aws:ec2:*:*:instance/*",
        Effect = "Allow"
      },
      {
        Action = ["ec2:DescribeInstances", "ec2:DescribeTags", "ec2:DescribeInstanceStatus"],
        Resource = "*",
        Effect = "Allow"
      },
      {
        Action = ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"],
        Resource = "*",
        Effect = "Allow"
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "attach_lambda_policy_to_lambda_role" {
  role       = aws_iam_role.LambdaEC2Role.name
  policy_arn = aws_iam_policy.LambdaEC2StartStopPolicy.arn
}

data "archive_file" "lambda_package" {
  type        = "zip"
  source_dir  = "${path.module}/Python/"
  output_path = "${path.module}/Python/lambda_package.zip"
}

resource "aws_lambda_function" "EC2AutoStopLambda" {
  filename                        = data.archive_file.lambda_package.output_path
  function_name                   = "EC2AutoStopLambda"
  role                            = aws_iam_role.LambdaEC2Role.arn
  runtime                         = "python3.12"
  handler                         = "EC2AutoStop.lambda_handler"
  memory_size                     = 128
  timeout                         = 60
  reserved_concurrent_executions = 10
  source_code_hash = data.archive_file.lambda_package.output_base64sha256

  environment {
    variables = {
      AWS_REGION        = var.aws_region
      TEAMS_WEBHOOK_URL = var.teams_webhook_url
    }
  }

  # Install dependencies if requirements.txt exists

  layers = var.teams_webhook_url != "" ? [aws_lambda_layer_version.requests_layer[0].arn] : []

  tags = {
    Project = "EC2-Scheduler"
    Purpose = "Stop-Instances"
  }

  depends_on                      = [aws_iam_role_policy_attachment.attach_lambda_policy_to_lambda_role]
}

resource "aws_lambda_function" "EC2AutoStartLambda" {
  filename         = data.archive_file.lambda_package.output_path
  function_name    = "EC2AutoStartLambda"
  role            = aws_iam_role.lambda_ec2_role.arn
  runtime         = "python3.9"
  handler         = "EC2InstanceStart.lambda_handler"
  memory_size     = 128
  timeout         = 60
  reserved_concurrent_executions = 10
  source_code_hash = data.archive_file.lambda_package.output_base64sha256

  environment {
    variables = {
      AWS_REGION = var.aws_region
    }
  }

  tags = {
    Project = "EC2-Scheduler"
    Purpose = "Start-Instances"
  }

  depends_on = [aws_iam_role_policy_attachment.lambda_policy_attachment]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Eventbridge rule that triggers the Lambda&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Create another file in the root of your project, name the file eventbridge.tf&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resource "aws_cloudwatch_event_rule" "EC2AutoStopRule" {
  name                = "EC2AutoStopRule"
  description         = "Rule to trigger Lambda to Stop EC2 Instances"
  schedule_expression = var.stop_schedule

  tags = {
    Project = "EC2-Scheduler"
  }
}

resource "aws_cloudwatch_event_rule" "EC2AutoStartRule" {
  name                = "EC2AutoStartRule"
  description         = "Rule to trigger Lambda to start EC2 instances"
  schedule_expression = var.start_schedule

  tags = {
    Project = "EC2-Scheduler"
  }
}

resource "aws_cloudwatch_event_target" "EC2AutoStopRuleTarget" {
  target_id = "EC2AutoStopLambda"
  arn       = aws_lambda_function.EC2AutoStopLambda.arn
  rule      = aws_cloudwatch_event_rule.EC2AutoStopRule.name
}

resource "aws_cloudwatch_event_target" "EC2AutoStartRuleTarget" {
  target_id = "EC2AutoStartLambda"
  arn       = aws_lambda_function.EC2AutoStartLambda.arn
  rule      = aws_cloudwatch_event_rule.EC2AutoStartRule.name
}

resource "aws_lambda_permission" "EC2AutoStopLambdaPermission" {
  statement_id  = "AllowExecutionFromEventbridge"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.EC2AutoStopLambda.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.EC2AutoStopRule.arn
}

resource "aws_lambda_permission" "EC2AutoStartLambdaPermission" {
  statement_id  = "AllowExecutionFromEventbridge"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.EC2AutoStartLambda.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.EC2AutoStartRule.arn
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we define variables to allow flexibility with our code, create a file variable.tf with the below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;variable "aws_region" {
  description = "AWS region for resources"
  type        = string
}

variable "start_schedule" {
  description = "Cron expression for starting instances"
  type        = string
}

variable "stop_schedule" {
  description = "Cron expression for stopping instances"
  type        = string
}

variable "teams_webhook_url" {
  description = "Microsoft Teams webhook URL for notifications"
  type        = string
  default     = ""  # Set this via terraform.tfvars or environment variable
  sensitive   = true
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Provider and backend config&lt;/p&gt;

&lt;p&gt;provider.tf&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~&amp;gt; 5.0"
    }
  }
  required_version = "&amp;gt;= 1.0"
}

provider "aws" {
  region = var.aws_region  # We'll define this in variables.tf
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;backend.tf&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;terraform {
  backend "s3" {
    bucket         = "my-terraform-state-bucket"
    key            = "ec2-auto-scheduler/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make sure the dynamodb and s3 bucket exist before hand, I can do another blog on this in a later date.&lt;/p&gt;

&lt;p&gt;Finally, I like to create an outputs.tf file to display useful information, below is what I think is needed for this deployment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;output "start_function_arn" {
  description = "ARN of the start Lambda function"
  value       = aws_lambda_function.start_instances.arn
}

output "stop_function_arn" {
  description = "ARN of the stop Lambda function"
  value       = aws_lambda_function.stop_instances.arn
}

output "start_schedule" {
  description = "Cron expression for start schedule"
  value       = aws_cloudwatch_event_rule.start_schedule.schedule_expression
}

output "stop_schedule" {
  description = "Cron expression for stop schedule"
  value       = aws_cloudwatch_event_rule.stop_schedule.schedule_expression
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Quick explanation on the Lambda layer&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;If you’re using Teams notifications, you’ll need to create a Lambda layer with the requests library. Run these commands from your project root (ec2-scheduler/):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# From the ec2-scheduler/ directory
mkdir -p lambda_layer/python

# Install requests to the directory
pip install requests -t lambda_layer/python/

# Create the layer zip
cd lambda_layer
zip -r ../lambda_layer.zip .
cd ..
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once all of this is in place, up next is the actual Python Lambda code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Lambda Functions: Where the Magic Happens&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now for the fun part, the actual code that will save you thousands of dollars. We’ll create two Lambda functions: one to stop instances and another to start them. The stop function has a bonus feature: it monitors instances without the AutoStop tag set to true and alerts you if they’ve been running for more than 48 hours.&lt;/p&gt;

&lt;p&gt;The Stop Lambda Function: Your evening watchdog&lt;/p&gt;

&lt;p&gt;With the infrastructure defined in Terraform, the core logic lives inside the AWS Lambda function. This Python script performs two tasks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Automatically stops EC2 instances that are running and tagged for auto-stop.&lt;/li&gt;
&lt;li&gt;Sends a Microsoft Teams alert if any untagged instance has been running for more than 48 hours.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Create a folder in your current working directory, name it Python, cd into this folder then create a file name EC2InstanceStop.py with the following code&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import boto3
import logging
import os
import requests
from datetime import datetime, timedelta

logger = logging.getLogger()
logger.setLevel(logging.INFO)

region = os.environ['AWS_REGION']
ec2 = boto3.resource('ec2', region_name=region)

# Teams webhook URL
teams_url = 'put in the url of your team webhook'

def send_teams_alert(instance_id):
    message = f"Instance {instance_id} has been running for more than 48 hours."
    headers = {"Content-Type": "application/json"}
    payload = {"text": message}
    requests.post(teams_url, headers=headers, json=payload)

def lambda_handler(event, context):

    filters = [
        {
            'Name': 'tag:AutoStop',
            'Values': ['TRUE','True','true']
        },
        {
            'Name': 'instance-state-name',
            'Values': ['running']
        }
    ]

    instances = ec2.instances.filter(Filters=filters)
    RunningInstances = [instance.id for instance in instances]
    print("Running Instances with AutoStop Tag : " + str(RunningInstances))

    if len(RunningInstances) &amp;gt; 0:
        for instance in instances:
            if instance.state['Name'] == 'running':
                print("Stopping Instance : " + instance.id)
        AutoStopping = ec2.instances.filter(InstanceIds=RunningInstances).stop()
        print("Stopped Instances : " + str(RunningInstances))
    else:
        print("Instance not in Running state or AutoStop Tag not set...")

    # Check for instances running for more than 48 hours
    filters = [
        {
            'Name': 'tag:AutoStop',
            'Values': ['FALSE','False','false']
        },
        {
            'Name': 'instance-state-name',
            'Values': ['running']
        }
    ]

    instances = ec2.instances.filter(Filters=filters)
    RunningInstances = [instance.id for instance in instances]

    for instance in instances:
        launch_time = instance.launch_time
        if datetime.now(launch_time.tzinfo) - launch_time &amp;gt; timedelta(hours=48):
            print("Instance running for more than 48 hours : " + instance.id)
            send_teams_alert(instance.id)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What the code does:&lt;/p&gt;

&lt;p&gt;The python script filters for instances with tag AutoStop = true and in a running state, it then stops any instances that satisfy this criteria using the ec2.instances.stop() call. This allows you to only opt in instances you want to manage by this automation, using tags.&lt;/p&gt;

&lt;p&gt;The script also check for instances not tagged for auto-stop (AutoStop = false) and if they’ve been running for more than 48 hours , it then send an alert to a configured Microsoft Teams Channel.&lt;/p&gt;

&lt;p&gt;We pick the instances we want this Lambda to manage by setting tags on them as below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| Tag Key    | Tag Value | Behavior                            |
| ---------- | --------- | ----------------------------------- |
| `AutoStop` | `true`    | Will be stopped automatically       |
| `AutoStop` | `false`   | Will be left running, but monitored |
| (missing)  | —         | Ignored entirely                    |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Start Function: Your Morning Coffee Companion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The start function is much more simpler, it just needs to wake up your instances from their slumber:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import boto3
import logging
import os

logger = logging.getLogger()
logger.setLevel(logging.INFO)

region = os.environ['AWS_REGION']
ec2 = boto3.resource('ec2', region_name=region)

def lambda_handler(event, context):

    filters = [
        {
            'Name': 'tag:AutoStart',
            'Values': ['TRUE','True','true']
        },
        {
            'Name': 'instance-state-name',
            'Values': ['stopped']
        }
    ]

    instances = ec2.instances.filter(Filters=filters)
    StoppedInstances = [instance.id for instance in instances]
    print("Stopped Instances with AutoStart Tag : " + str(StoppedInstances))

    if len(StoppedInstances) &amp;gt; 0:
        for instance in instances:
            if instance.state['Name'] == 'stopped':
                print("Starting Instance : " + instance.id)
        AutoStarting = ec2.instances.filter(InstanceIds=StoppedInstances).start()
        print("Started Instances : " + str(StoppedInstances))
    else:
        print("Instance not in Stopped state or AutoStart Tag not set...")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This lets you apply this automation without losing control and also using tags allows you to be flexible and not have to hardcode instance-id, new instances can also be added by just tagging them accordingly, very good in a scenario where you have different kind of workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deployment: Making It All Come Together&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now the fun part, putting everything we’ve done so far together for our automation to come to live:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Prepare the instances&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We want to make sure our EC2 instance are properly tagged, by using the command below, you can tag instances or do it in the management console.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Tag instances you want to auto-start and auto-stop
aws ec2 create-tags \
    --resources ${instance-id} \
    --tags Key=AutoStart,Value=true Key=AutoStop,Value=true

# For instances that should only be stopped (not auto-started)
aws ec2 create-tags \
    --resources ${instance-id} \
    --tags Key=AutoStop,Value=true

# For instances you want to monitor but not auto-stop
aws ec2 create-tags \
    --resources ${instance-id} \
    --tags Key=AutoStop,Value=false
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Configure Your Variables&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Create a terraform.tfvars file with your environment specific settings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws_region = "us-east-1"

# Adjust these times for your timezone (these are in UTC)
start_schedule = "cron(0 13 ? * MON-FRI *)"  # 8 AM EST
stop_schedule  = "cron(0 23 ? * MON-FRI *)"  # 6 PM EST

# Add your Teams webhook for notifications
teams_webhook_url = "https://outlook.office.com/webhook/YOUR-WEBHOOK-URL"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3: Initialize and Deploy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Time to deploy! Run these commands from your project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Initialize Terraform
terraform init

# Create the Lambda layer (if using Teams notifications)

pip install requests -t lambda_layer/python/
cd lambda_layer &amp;amp;&amp;amp; zip -r ../lambda_layer.zip . &amp;amp;&amp;amp; cd ..

# Review what will be created
terraform plan

# Deploy everything
terraform apply
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When Terraform asks for confirmation, review the resources being created, in our case we should have the below:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2 Lambda functions
2 EventBridge rules
1 IAM role and policy
Various permissions and attachments
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Type yes to proceed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Verify Your Deployment&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After successful deployment, Terraform will output important information that we defined in the outputs.tf file earlier, and you can also go into the management console to confirm the services has been created.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Bottom Line: Your AWS Bill Will Thank You&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s recap what you’ve just built&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ Automated EC2 scheduling that never forgets.
✅ Proactive monitoring for runaway instances.
✅ Infrastructure as code for easy replication.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;But here’s the real win: peace of mind. No more weekend anxiety about whether someone left instances running. No more awkward conversations with finance. Just predictable, optimized AWS costs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Thoughts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As engineers, we’re problem solvers. This solution started from a real pain point and evolved into something that saves our team thousands annually. Your feedback and experiences make these solutions better for everyone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remember&lt;/strong&gt;: The best time to optimize your AWS costs was yesterday. The second best time is now.&lt;/p&gt;

&lt;p&gt;Happy cost saving, and may your AWS bills be forever low! 🎉&lt;/p&gt;

&lt;p&gt;If you found it helpful, please let me know in the comments. Your feedback helps me improve and motivates me to share more solutions.&lt;/p&gt;

&lt;p&gt;Please follow me for more AWS optimization tips. Next up: “How to configure lifecycle rule on S3 buckets” — another fun task for us to explore.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>lambda</category>
      <category>ec2</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
