Forem: Jessica Tiwari

Automating the EC2 Instance

Jessica Tiwari — Thu, 12 Feb 2026 04:20:24 +0000

This week, as a part of MS2V Technologies, we embarked on a small yet informative project focusing on automating the E2 web server using user data. The main objective was to launch an EC2 instance that would automatically install NGINX and display a simple message in the browser, eliminating the need for manual intervention.

Architecture Design

The architecture design was straightforward. The user would connect to the public IP through which the EC2 instance was configured. Once connected, NGINX would be installed, and our simple HTML page would load.

Launching the EC2 Instance
To initiate this project, I first launched an EC2 instance by selecting the appropriate AMI (Amazon Machine Image). Given that this was a basic project with minimal content, I opted for a T2.micro instance.

Network and Security Setup
While configuring the network and security group, I allowed SSH access from my IP on port 22 and HTTP traffic from anywhere through port 80. This restriction ensures secure access for SSH while allowing the browser to load my web page seamlessly.

User Data Script
The user data script created for this project was minimal, focusing on displaying a simple text message upon accessing the web server.

Creating a baked AMI
After successfully connecting to the instance and verifying that NGINX was operational, I created a baked AMI to the existing EC2 instance configuration. This process taking a snapshot of the instance's volume, which resulted in a new AMI that can be reused to launch future instances without the need for manual setup.

Conclusion
Overall, this project was an excellent opportunity to understand the features of AMIs and the automation capabilities within AWS. The ability to launch instances with pre-installed software and configurations greatly simplifies the process, making rapid deployment of web servers both efficient and reliable.

Building a Batch Data Pipeline on AWS

Jessica Tiwari — Mon, 05 Jan 2026 15:19:22 +0000

This is how I approached as a beginner.

Define the Data Flow and Storage

Created an S3-based data lake with three zones:
raw for incoming data
processed for cleaned data
Enabled versioning on the raw bucket to preserve original data for reprocessing.

Catalog and Schema

Created a Glue Data Catalog database.
Used Glue Crawlers to scan raw data and infer schemas.
Enabled automatic partition discovery based on date folders.
Scheduled crawlers to run after each data ingestion.

ETL Transformation

Implemented AWS Glue Jobs using Python Spark.
Transformation steps:
Read raw CSV/JSON data from S3.
Standardize column names and data types.
Handle null and malformed records.
Convert data into Parquet format with Snappy compression.
Enabled job bookmarks to ensure incremental processing.

Query and Validation

Configured Amazon Athena to use the Glue Data Catalog.
Ran validation queries on processed and curated datasets.
Used partition filters to minimize scanned data and reduce cost.
Verified record counts and schema consistency.

Automation

Triggered Glue Jobs using EventBridge schedules.
Monitored job execution and failures via CloudWatch.
Configured SNS alerts for ETL failures.
Archived older raw data to lower-cost S3 storage classes.

How I implemented ETL Pipeline Using AWS Glue

Jessica Tiwari — Sun, 04 Jan 2026 15:22:24 +0000

- Step 1: I considered:

Spark on EC2 (high control, high ops)
Databricks
AWS Glue I selected AWS Glue to minimize operational complexity.

- Step 2: Ingestion Strategy

Data lands in raw/
Glue Crawlers detect schema changes
Catalog updated automatically

- Step 3: Transformation Logic
Glue Jobs perform:

Type casting
Null handling
Deduplication
Format conversion (CSV → Parquet)

- Step 4: Performance Optimization

Enabled job bookmarks
Tuned DPUs
Used Parquet + Snappy compression

- Step 5: Output Strategy

Processed data written to S3