Forem: Sushanta Paudel

Building a Real-Time Data Streaming Pipeline on AWS

Sushanta Paudel — Wed, 29 Oct 2025 03:41:36 +0000

Let’s imagine we are building a system to handle continuous streams of IoT sensor data, such as temperature, humidity, and more. In this blog, I will create a data pipeline using AWS Kinesis Data Streams and AWS Lambda.

For demonstration purposes, I’ll simulate IoT sensor data using a producer Lambda, which will act as our IoT device and push sensor data into the Kinesis stream. On the other end, a consumer Lambda will process the incoming events in real time, logging them to CloudWatch Logs.

To ensure no data is lost, any failed events will be redirected to an SQS Dead Letter Queue (DLQ) for later analysis.

Step 1: Create the Kinesis Stream:

In the AWS Management Console, navigate to Kinesis and select Data streams. Click Create, then provide a name for your stream (for example, iot-sensor-stream). Choose On-demand as the capacity mode or Provisioned if you can predict the amount of data you will be streaming. Finally, click Create to set up the data stream.

Step 2: Create a Producer Lambda that simulates IoT sensors

The code for the Lambda Function:

import os
import json
import boto3
import random
import time

# Initialize Kinesis client
kinesis = boto3.client('kinesis')

# Stream ARN is stored in Parameter Store
ssm = boto3.client('ssm')
stream_arn = ssm.get_parameter(Name="/kinesis/iot_sensor_stream_arn")['Parameter']['Value']
stream_name = stream_arn.split("/")[-1]  # extract stream name from ARN

def lambda_handler(event, context):
    # Simulate IoT sensor readings
    sensor_data = {
        "sensor_id": f"sensor-{random.randint(1, 5)}",
        "temperature": round(random.uniform(18.0, 30.0), 2),
        "humidity": round(random.uniform(30.0, 60.0), 2),
        "timestamp": int(time.time())
    }

    # Put record into Kinesis
    response = kinesis.put_record(
        StreamName=stream_name,
        Data=json.dumps(sensor_data),
        PartitionKey=sensor_data["sensor_id"]
    )

    print(f"Produced sensor data: {sensor_data}")
    return {"statusCode": 200, "body": json.dumps(sensor_data)}

We need to attach IAM role to this Lambda with the following permission:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "kinesis:PutRecord",
                "kinesis:PutRecords"
            ],
            "Resource": "arn:aws:kinesis:<region>:<account-id>:stream/iot-sensor-stream"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:GetParameter",
                "ssm:GetParameters"
            ],
            "Resource": "arn:aws:ssm:<region>:<account-id>:parameter/<parameter-name>"
        }
    ]
}

We need to replace 'region' with our AWS region and also replace 'account-id' with our AWS account ID. Finally, replace 'parameter-name' with the name of the SSM parameter our Lambda needs to access. This provides Lambda permission to put records into a specific Kinesis stream and read parameters from SSM Parameter Store.

Step 3: Create Consumer Lambda that will process IoT data

Code for the consumer Data:

import base64
import json

def lambda_handler(event, context):
    for record in event['Records']:
        # Kinesis data is base64 encoded
        payload = base64.b64decode(record['kinesis']['data']).decode('utf-8')
        sensor_data = json.loads(payload)

        # For demo: just log it
        print(f"Consumed sensor data: {sensor_data}")

        # Example future processing:
        # - store in DynamoDB
        # - send alerts if thresholds exceeded
        # - forward to analytics pipeline

    return {"statusCode": 200}

We need to attach AWSLambdaKinesisExecutionRole and sqs:SendMessage permissions to this Lambda.

Step 4: Connecting Kinesis Stream to the Consumer Lambda

In the Lambda console, we need to go to consumer-lambda. In the configuration section, we will find Triggers and add Kinesis as a trigger. Our trigger is named iot-sensor-stream.

We will keep the Starting position as latest and under on failure settings , we will select the SQS DLQ and put the ARN of iot-sensor-dlq.

Step 5: Monitoring

Producer Monitoring
We can go to CloudWatch Logs and open the log group /aws/lambda/lambda-producer. Verify that the logs show entries like:

Produced sensor data:
{'sensor_id': 'sensor-3', 'temperature': 24.1, ...}

Next, in the Kinesis console, we can navigate to our data stream and open the Monitoring tab. Watch metrics such as IncomingRecords and PutRecord. Success to ensure that the producer is sending data successfully.

Consumer Monitoring
For the consumer, we can go to CloudWatch Logs → /aws/lambda/lambda-consumer and verify that entries like the following appear:

Consumed sensor data:
{'sensor_id': 'sensor-3', 'temperature': 24.1, ...}

In the Kinesis console’s Monitoring tab, we can check metrics such as GetRecords.Success and IteratorAgeMilliseconds. These helps us ensure that the consumer is processing records in near real time.

DLQ Monitoring
Finally, in the SQS console, we can open our dead letter queue (iot-sensor-dlq) and check the Approximate number of messages. If the consumer fails (for example, if you intentionally break JSON parsing in the code), failed batches will appear here for later inspection.

So with this setup we have a working IoT sensor streaming pipeline with a Producer Lambda (IoT simulator) which will use Kinesis Stream for buffering and a Consumer Lambda will process this data stream. We also have provisioned SQS DLQ for failures and have integrated CloudWatch and Kinesis metrics for monitoring.

Transforming and Querying JSON Data in AWS S3 with Glue and Athena

Sushanta Paudel — Wed, 15 Oct 2025 10:34:36 +0000

AWS Glue is an AWS service that helps discover, prepare, and integrate all your data at any scale. It can aid in the process of transformation, discovering, and integrating data from multiple sources.

In this blog we have a data source that uploads JSON data files periodically to an Amazon S3 bucket named 'uploads-bucket'. These JSON data files contain log entries.

Below is an example of one of the log entries JSON files that are being uploaded to the Amazon S3 bucket named uploads-bucket.

{
  "timestamp": "2025-10-04T10:15:30Z",
  "level": "ERROR",
  "service": "authentication",
  "message": "Failed login attempt",
  "user": {
    "id": "4626",
    "username": "johndoe",
    "ip_address": "192.168.1.101"
  },
  "metadata": {
    "session_id": "abc123xyz",
    "location": "Kathmandu"
  }
}

Step 1: Create the processing Lambda Function

Now, we will implement an AWS Lambda function that processes Amazon S3 object-created events for the JSON files and the Lambda code will transform the data and write it to a second Amazon S3 bucket named
target-bucket. This function will:

Transform the JSON data.
Write it to a target S3 bucket named target-bucket in Parquet format.
Catalog the data in AWS Glue for querying.

import json
import boto3
import awswrangler as wr
import pandas as pd

GLUE_DATABASE = "log_database"
TARGET_BUCKET = "target-bucket"

def parse_event(event):
    # EventBridge S3 object-created structure
    key = event['detail']['object']['key']
    bucket = event['detail']['bucket']['name']
    return key, bucket

def read_object(bucket, key):
    s3 = boto3.resource('s3')
    obj = s3.Object(bucket, key)
    return obj.get()['Body'].read().decode('utf-8')

def create_database():
    databases = wr.catalog.databases()
    if GLUE_DATABASE not in databases['Name'].values:
        wr.catalog.create_database(GLUE_DATABASE)
        print(f"Database {GLUE_DATABASE} created")
    else:
        print(f"Database {GLUE_DATABASE} already exists")

def lambda_handler(event, context):
    key, bucket = parse_event(event)
    object_body = read_object(bucket, key)

    create_database()

    log_entry = json.loads(object_body)

    # Flatten JSON
    log_df = pd.json_normalize(log_entry)

    # Create partition columns for Glue
    log_df['log_level'] = log_df['level']
    log_df['service_name'] = log_df['service']

    # Write logs to Glue catalog
    wr.s3.to_parquet(
        df=log_df.astype(str),
        path=f"s3://{TARGET_BUCKET}/data/logs/",
        dataset=True,
        database=GLUE_DATABASE,
        table="logs",
        mode="append",
        partition_cols=["log_level", "service_name"],
        description="Application log table",
        parameters={
            "source": "Application",
            "class": "log"
        },
        columns_comments={
            "timestamp": "Time of the log event",
            "level": "Log level (INFO, WARN, ERROR, etc.)",
            "service": "Service generating the log",
            "message": "Log message",
            "user.id": "User identifier",
            "user.username": "Username"
        }
    )

We need to attach the necessary permission role to the Lambda function. The function needs access to both the S3 bucket and AWS Glue.

Step 2:Creating an Amazon EventBridge Rule

Now we will create an Amazon EventBridge rule to invoke our AWS Lambda function when an object is uploaded to the Amazon S3 bucket. We will create a new Amazon EventBridge rule as follows:

Name: s3-uploads
Event bus: default
Rule type: Rule with an event pattern

Scroll down to the Event pattern section, and enter and select:

Event source: AWS Services
AWS service: S3
Event type: Object Created
Specific Bucket: uploads-bucket

Target Configuration:

Target types: AWS Lambda
Function: Enter our Lambda function name

The event pattern JSON here will be:

{
  "source": ["aws.s3"],
  "detail-type": ["Object Created"],
  "detail": {
    "bucket": {
      "name": ["uploads-bucket"]
    }
  }
}

Step 3: Add index partition to the logs table using AWS Glue

When the Lambda function we created runs, it will create tables named 'logs table' in the Glue Database.

In the AWS Console, go to the Glue dashboard.

To see databases, in the left-hand menu, under Data Catalog, click Databases. We will find the database our Lambda function created on this section. Click on this database.
Click on the logs table. The table data is stored in the target Amazon S3 bucket named target-bucket.
We will now add a partition index in this table for faster query performance.
To begin creating an index, click Add index
Enter and select the following to configure the index Index name: log_level_index Index keys: log_level Click Update
Now we can also view this data using Amazon Athena. For this, there is an option at the top of the page , where we need to click on Actions and then View Data.

Step 4: Searching within your Indexed AWS S3 data

With index now in place, we can query efficiently using log_level as filter.

We need to first open the Athena console and open the query editor and select our database.

Example query: Retrieve all ERROR logs from the authentication service:

SELECT timestamp, service, message, user_username, metadata_location
FROM logs
WHERE log_level = 'ERROR'
  AND service_name = 'authentication';

In conclusion, we can combine AWS Lambda, Amazon S3, AWS Glue, and Athena, to build a fully serverless, scalable data pipeline that transforms, catalogs, and queries e-commerce data in near real-time. We can also leverage Glue partitions and indexes for allows for efficient storage and faster analytics, with Athena enabling ad-hoc queries without the overhead of managing traditional databases.

ETL Made Easy: Integrating Multi-Source Data with AWS Glue

Sushanta Paudel — Fri, 03 Oct 2025 07:23:37 +0000

AWS Glue is a serverless data integration service that you can use to perform Extract, Transform, and Load (ETL) jobs. It is often used to handle large datasets and you can use it with a large variety of data sources and formats. It can be used with data lake services like Amazon S3 , Amazon Redshift, DynamoDB, data pipelines and other data warehouses.

The ETL process in AWS Glue is usually as follows:

Extracting information from different data stores like relational databases, NoSQL databases or object stores like Amazon S3.
Transforming the information by converting to required data formats or combining different data sets to compute new values
Loading information into a new data store as required for the use case

In this guide we will be unifying multi-Store order and product data with AWS Glue for analytics and marketing Insights.

Step 1: Log in to your AWS account and open the AWS DynamoDB service. Then, find and click on the Book_Orders table. Once inside, you’ll be able to view all the records stored in the table, which look like this:

Step 2: Go to the S3 service in AWS and locate the relevant buckets. Open each bucket to explore and review the data stored inside.

The data in these files is stored in JSON format and looks like the following example:

{ "book_id": 101, "title": "Learn Python", "author": "John Doe", "weight_kg": 0.5, "category": "Programming" }

Step 3: Now let's create the destination DynamoDB Table

This is where the transformed data will go.

a.Create a new DynamoDB table: book_orders_transformed

b.Partition key: order_id (Number)

c.Leave defaults → Create table.

Initially, it will have no items.

Step 4: Now let us create an AWS Glue Job

a. Open AWS Glue Console
In the AWS Console, let's go the AWS Glue dashboard and then click in ETL jobs and create a new ETL job.

b: Configure the Job
Let us configure the job with the following parameters:

Name: book_orders_etl
IAM role: select an existing role with S3 & DynamoDB permissions or create one.
Type: Spark Python

Step 5: Implementing the Glue ETL Script
Here’s the Python Spark script for our data transformation:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

# Accept parameters at runtime
args = getResolvedOptions(sys.argv, ['JOB_NAME', 's3_bucket'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Function to calculate total price and total weight
def calculate_totals(row):
    row["total_price"] = round(row["quantity"] * row["unit_price"], 2)
    row["total_weight_kg"] = round(row["quantity"] * row["weight_kg"], 2)
    return row

# Load product data from S3
input_path = f"s3://{args['s3_bucket']}/data/"
books_frame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": [input_path]},
    format="json",
    recurse=True
)

# Load order data from DynamoDB
orders_frame = glueContext.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.input.tableName": "book_orders",
        "dynamodb.throughput.read.percent": "1.0"
    }
)

# Join orders with product info
joined_frame = Join.apply(
    frame1=orders_frame,
    frame2=books_frame,
    keys1=['book_id'],
    keys2=['book_id']
)

# Convert DynamicFrame to DataFrame to apply map
df = joined_frame.toDF()

# Apply transformation
transformed_df = df.rdd.map(lambda row: calculate_totals(row.asDict())).toDF()

# Convert back to DynamicFrame
transformed_frame = DynamicFrame.fromDF(transformed_df, glueContext, "transformed_frame")

# Write transformed data back to DynamoDB
glueContext.write_dynamic_frame.from_options(
    frame=transformed_frame,
    connection_type="dynamodb",
    connection_options={
        "dynamodb.output.tableName": "book_orders_transformed",
        "dynamodb.throughput.write.percent": "1.0"
    }
)

job.commit()

Step 6: Run the Glue Job

We will open our job and click on Run with parameters option
Add parameter: Key: s3_bucket Value: my-glue-books-data
Click Run job
We can then monitor the job in Run details and wait till the Status changes to Succeeded
Now we can open book_orders_transformed table in DynamoDB and verify that the book_orders_transformed table has our transformed data fields.

Step 7: Validate Your ETL Output

The DynamoDB table book_orders_transformed will now have our transformed data fields.

Example records in the table:

{ "order_id": 1, "customer_name": "Alice", "book_id": 101, "quantity": 2, "unit_price": 12.5, "order_date": "2025-10-01", "title": "Learn Python", "author": "John Doe", "weight_kg": 0.5, "category": "Programming", "total_price": 25.0, "total_weight_kg": 1.0 }

So in this guide we have successfully unified and transformed multi-source order and product data using AWS Glue. The ETL workflow efficiently extracted data from both DynamoDB and S3, applied necessary transformations including calculations for total price and total weight, and loaded the enriched dataset into a new DynamoDB table ready for analytics and marketing insights.

Just published a new blog on Blue-Green Deployment for Next.js apps using AWS EC2, CodeDeploy, and Auto Scaling! Learn how to achieve zero-downtime deployments with a scalable, production-grade CI/CD pipeline on AWS. Perfect for teams looking to level up

Sushanta Paudel — Tue, 29 Jul 2025 06:27:19 +0000

Sushanta Paudel for AWS Community Builders

Jul 29 '25

Blue-Green Deployment of a Next.js App on EC2 Using AWS CodeDeploy

#aws #devops #cicd #nextjs

Comments

4 min read

Blue-Green Deployment of a Next.js App on EC2 Using AWS CodeDeploy

Sushanta Paudel — Tue, 29 Jul 2025 06:24:46 +0000

In this blog, we'll walk through setting up a Blue-Green deployment strategy for a Next.js application hosted on Amazon EC2 instances.

With Blue-Green Deployment we can deploy application updates with zero-downtime and we will be using AWS services like EC2, Auto-Scaling, Load Balancing and Code Deploy.

Step 1: Setup AMI Image for Auto-Scaling

In AWS Auto Scaling, an Amazon Machine Image (AMI) acts as the blueprint for launching new EC2 instances. Setting up a custom AMI ensures that every instance in the Auto Scaling group launches with the exact configuration needed (OS, applications, libraries, security settings, etc.).

Steps:

Launch a Base EC2 Instance

Start an EC2 instance with a base AMI (like Amazon Linux or Ubuntu).

Customize the Instance

Install required software, configure settings, apply security patches, and set up your application.

Here, we need to make sure CodeDeploy agent is installed and running.

# Update packages
sudo apt-get update -y

# Install dependencies
sudo apt-get install ruby-full wget unzip -y

# Download and install the CodeDeploy agent
cd /home/ubuntu
wget https://aws-codedeploy-<region>.s3.amazonaws.com/latest/install
chmod +x ./install
sudo ./install auto

# Start and enable the agent
sudo systemctl start codedeploy-agent
sudo systemctl enable codedeploy-agent

# Verify status
sudo systemctl status codedeploy-agent

Create the AMI

Step 2: Use the AMI to create a Launch Template

When creating a launch template or configuration for your Auto Scaling group, specify the custom AMI ID.

We do this to make sure all the instances that we will launch by Auto-scaling are identical, reliable and also ready to serve traffic immediately upon startup.

Step 3: Create a Load balancer

Create a load balancer after navigating to the EC2 dashboard. Select Internet-facing as the scheme. Add a listener on port 80(HTTP) or 443(HTTPS). Under Availability Zones, select your VPC and the subnets across at least two availability zones. Then , we need to configure a security group that allows inbound traffic on the listener port ( HTTP:80).

Create a new Target Group or select an existing one and select Instance or IP as the target type. Select an appropriate health check path. We need to point the app’s health check endpoint here.

Step 4: Set up Auto Scaling Group with the launch template

Navigate to Auto-Scaling Group in the dashboard and click Create Auto Scaling Group. Choose the appropriate VPC and subnets for our instances.

Select the launch template we created earlier.

Attach the internet-facing Application Load Balancer we created earlier to the auto scaling group.

Define the scaling capacity as required ;minimum: 2, desired: 2, and maximum: 3 instances. Finally, review all settings and click "Create Auto Scaling Group" to complete the setup.

We need to make sure our source code repository includes a valid appspec.yml file in the root directory. This file tells CodeDeploy how to deploy your application and when to run specific scripts during the deployment lifecycle.

version: 0.0
os: linux
files:
  - source: /
    destination: /home/ubuntu/app

hooks:
  BeforeInstall:
    - location: scripts/stop_server.sh
      timeout: 60
  AfterInstall:
    - location: scripts/install_dependencies.sh
      timeout: 120
  ApplicationStart:
    - location: scripts/start_server.sh
      timeout: 60

Step 5: Create application and deployment group with CodeDeploy

Navigate to AWS CodeDeploy, go to Deploy > Applications, and click Create Application". After creating the application, proceed to create a Deployment Group by filling in all the required details such as service role, deployment type, and environment configuration. The deployment type should be set to Blue/Green.

Select the Load Balancer we created earlier and attached to the auto-scaling group , then click Create Deployment Group to complete the setup.

Make sure to select the correct deployment settings as per requirement.

Make sure to select how long you want to keep the older instance running. It is relevant to completion of the code pipeline run. Here I have chosen to terminate the instances immediately.

Step 6: Integrate this CodeDeploy application to the CodePipeline

To integrate CodeDeploy with CodePipeline, navigate to AWS CodePipeline and create or edit an existing pipeline. In the Deploy stage, choose AWS CodeDeploy as the deployment provider. Select the previously created CodeDeploy application and deployment group. This will link our pipeline to the deployment process, allowing automated deployments to our EC2 instances whenever new changes are pushed from your source repository.

Step 7: Blue/Green Deployment

Now we can monitor our blue/green deployment process in the CodeDeploy console. Blue/Green deployments allow quick rollback if something fails.

In the first step, a replacement instance is provisioned.

In the second step, our application will be installed in the replacement instance launched.

Then, the next step is blocking the traffic to the original instance and redirecting traffic to the new instance. We can track the progress of this in the CodeDeploy console.

Finally, when the traffic is shifted to the replacement instance, the original instance is then terminated.

By combining CodeDeploy Blue-Green deployment with EC2, Auto Scaling, Load Balancer and CodePipeline, we can achieve seamless, zero-downtime deployments for our Next.js application—ensuring high availability, easy rollbacks, and production-grade scalability on AWS.