Forem: Sumbul Naqvi

Story of AWS Storage - Understanding by asking and answering!

Sumbul Naqvi — Sat, 04 Oct 2025 08:35:52 +0000

There are two main members/types of storage in AWS Storage family, Uncle (block) and Aunty (object). Who storage information depending upon their nature and temperament. And you need to keep their nature in mind while trying to retrieve information from them. 😉

If you give any information (say 1GB) to uncle/block, he is easy to go guy. What happens is that this file is split into fixed size chunks of data and then stored.

Aunty/Object storage, on the other hand, treats each file like a single unit of data. This might seem like a small difference, but it can change how you access and work with your data.

Let's say I want to change one character out of that 1GB file. If my file is stored in block storage, changing that one character is simple, mainly because we can change the block, or the piece of the file that the character is in, and leave the rest of the file alone. Uncle is carefree 😊
Whereas, in object storage, if I want to change that one character, I instead have to update the entire file. Aunty mentality you know everything is treated as an object. Hundreds of questions asked and change is not easy. 😎

Therefore for our static data, like the employee photos, will most likely be accessed often, but modified rarely, storing in object storage is fine.

For more frequently updated data or data that has high transaction rates, like our application or system files, block storage will perform better.
Jokes apart, I will be giving a detailed view of different concepts related to Storage, with respect to AWS as I move forward using my standard way of asking and answering.

Lets try understanding the Storage Solutions given by AWS by asking and answering few questions:

Que 1- How to choose the right storage service for your needs?
Ans- It depends upon What we want to store, How we want to store and How often we want it to access it.

Que 2- Into how many categories, AWS storage services can be grouped?
Ans- AWS storage services are grouped into three categories: file storage, block storage, and object storage.
In file storage (EFS), data is stored as files in a hierarchy. In block storage (EBS), data is stored in fixed-size blocks. And in object storage (S3), data is stored as objects in buckets.

Que 3- What is the main difference between file storage and object storage?
Ans- In object storage, files are stored as objects. Objects, much like files, are treated as a single, distinct unit of data when stored. However, unlike file storage, these objects are stored in a bucket using a flat structure, meaning there are no folders, directories, or complex hierarchies. Each object contains a unique identifier. This identifier, along with any additional metadata, is bundled with the data and stored.

Que 4- Explain Amazon Elastic File System.
Ans- Amazon Elastic File System(Amazon EFS) is file system that automatically grows and shrinks as you add and remove files. There is no need for provisioning or managing storage capacity and performance. Amazon EFS can be used with any number of AWS compute services and on-premises resources. Amazon EFS can provide consistent performance to each compute instance. EFS works with EC2 instances in multi-AZ.
It is managed service, designed to provide serverless, fully elastic file storage that lets you share file data without provisioning or managing storage capacity and performance. It is the only cloud-native shared file system with fully automatic lifecycle management.

Que 5- When should I use Amazon EFS vs. Amazon Elastic Block Store (Amazon EBS) vs. Amazon S3?
Ans-
EFS is a file storage service for use with Amazon compute (EC2, containers, serverless) and on-premises servers. EFS provides a file system interface, file system access semantics (such as strong consistency and file locking), and concurrently accessible storage for up to thousands of EC2 instances.

Amazon EBS is a block-level storage service for use with EC2. EBS can deliver performance for workloads that require the lowest-latency access to data from a single EC2 instance.

Amazon S3 is an object storage service. S3 makes data available through an internet API that can be accessed anywhere.
Note: Unlike Amazon EBS, Amazon Simple Storage Service (Amazon S3) is a standalone storage solution that isn't tied to compute.

Que 6- How does pricing works in Amazon S3?
Ans- As far as S3 pricing is concerned, you pay for all bandwidth into and out of Amazon S3, except for the following:
• Data transferred out to the internet for the first 100GB per month, aggregated across all AWS Services and Regions (except China and GovCloud)
• Data transferred in from the internet.
• Data transferred between S3 buckets in the same AWS Region.
• Data transferred from an Amazon S3 bucket to any AWS service(s) within the same AWS Region as the S3 bucket (including to a different account in the same AWS Region).
• Data transferred out to Amazon CloudFront (CloudFront).

Que 7- Compare the cost between the basic storage option available.
Ans-

Que 8- What the the few performance parameters with respect to storage?
Ans-

Que 9- Explain Availability and Durability with respect to storage?
Ans- Durability refers to how safe data is from being lost, while data availability refers to how often you can access your stored data.

Que 10- Few use cases of above?
Ans- Few use cases of EBS, S3 and EFS are following:

Que 11- What makes AWS S3 most sort after choice for storage?
Ans- Some of the benefits of AWS S3 are:
Durability: S3 provides 99.999999999 percent durability.
Low cost: S3 lets you store data in a range of "storage classes." These classes are based on the frequency and immediacy you require in accessing files.
Scalability: S3 charges you only for what resources you actually use, and there are no hidden fees or overage charges. You can scale your storage resources to easily meet your organization's ever-changing demands.
Availability: S3 offers 99.99 percent availability of objects
Security: S3 offers an impressive range of access management tools and encryption features that provide top-notch security.
Flexibility: S3 is ideal for a wide range of uses like data storage, data backup, software delivery, data archiving, disaster recovery, website hosting, mobile applications, IoT devices, and much more.
Simple data transfer: You don't have to be an IT genius to execute data transfers on S3. The service revolves around simplicity and ease of use.

Que 12- Differentiate between AWS Buckets vs Objects?
Ans- An object consists of data, key (assigned name), and metadata. A bucket is used to store objects. When data is added to a bucket, Amazon S3 creates a unique version ID and allocates it to the object.

Que 13- What are the different storage classes offered by Amazon S3?
Ans- Amazon S3 offers a range of storage classes designed for different use cases:
S3 Standard for general-purpose storage of frequently accessed data
S3 Intelligent-Tiering for data with unknown or changing access patterns
S3 Standard-Infrequent Access (S3 Standard-IA) and S*3 One Zone-Infrequent Access (S3 One Zone-IA)* for long-lived, but less frequently accessed data
S3 Glacier Instant Retrieval for long-lived data that is rarely accessed and requires retrieval in milliseconds.
S3 Glacier Flexible Retrieval (Formerly S3 Glacier) for archive data that is accessed 1–2 times per year and is retrieved asynchronously.
S3 Glacier Deep Archive (S3 Glacier Deep Archive) for long-term archive and digital preservation.
One Zone-IA Storage Class: It can be used where the data is infrequently accessed and stored in a single region.

Que 14- What is an Amazon EFS Access Point?
Ans- An EFS Access Point is a network endpoint that users and applications can use to access an EFS file system and enforce file- and folder- level permissions (POSIX) based on fine-grained access control and policy-based permissions defined in IAM.
When you create an Amazon EFS Access Point, you can configure an operating system user and group, and a root directory for all connections that use it.

Que 15- What is Amazons solution for windows ?
Ans- Amazon FSx is a fully managed service that offers reliability, security, scalability, and a broad set of capabilities that make it convenient and cost effective to launch, run, and scale high-performance file systems in the cloud.
With Amazon FSx, you can choose between four widely used file systems: Lustre, NetApp ONTAP, OpenZFS, and Windows File Server. You can choose based on your familiarity with a file system or based on your workload requirements for feature sets, performance profiles, and data management capabilities.

Que 16- Briefly explain the services in the Amazon FSx data storage family.
Ans- Amazon FSx data storage family come in four options that align to popular file systems:
· Amazon FSx for Windows File Server provides fully managed Microsoft Windows file servers, backed by a fully native Windows file system.
· Amazon FSx for Lustre allows you to launch and run the high-performance Lustre file system.
· Amazon FSx for OpenZFS a fully managed file storage service that enables you to move data to AWS from on-premises ZFS or other Linux-based file servers.
· Amazon FSx for NetApp ONTAP a fully managed service that provides highly reliable, scalable, high-performing, and feature-rich file storage built on NetApp's popular ONTAP file system.

Que 17- What are AWS Storage Gateway ?
Ans- AWS Storage Gateway is like a bridge between on-premises data and cloud data.
Basically Storage Gateway is of following types:
• S3 File Gateway
• FSx File Gateway
• Volume Gateway
• Tape Gateway

Que 18- What is Amazon EC2 instance store ?
Ans- Amazon Elastic Compute Cloud (Amazon EC2) instance store provides temporary block-level storage for an instance. This storage is located on disks that are physically attached to the host computer. This ties the lifecycle of the data to the lifecycle of the EC2 instance. If you delete the instance, the instance store is also deleted. Because of this, instance store is considered ephemeral block storage. This is preconfigured storage that exists on the same physical server that hosts the EC2 instance and cannot be detached from Amazon EC2. You can think of it as a built-in drive or RAM for your EC2 instance.

Que 19- What is difference between EBS and instance store?
Ans- We use the instance store for temporary storage. Data that's stored in instance store volumes isn't persistent through instance stops, terminations, or hardware failures. For data that you want to retain longer, or if you want to encrypt the data, use Amazon EBS volumes instead.
Thus, we can say , EBS is like HD and instance store are like RAM of our system.

Que 20- How much data can I store in Amazon S3?
Ans- The total volume of data and number of objects you can store in Amazon S3 are unlimited. Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 TB. The largest object that can be uploaded in a single PUT is 5 GB. For objects larger than 100 MB, customers should consider using the multipart upload capability.

Que 21- What is S3 Transfer Acceleration?
Ans- Transfer Acceleration is designed for specific use cases, to optimize transfer speed from across the world into S3 buckets. Transfer Acceleration takes advantage of the globally distributed edge locations in Amazon CloudFront. As the data arrives at an edge location, the data is routed to Amazon S3 over an optimized network path. We enable transfer accelartion at bucket level. It this, creates fast, easy, and secure transfers of files over long distances between your client and your Amazon S3 bucket.

Que 22- What is the difference between AWS Global Accelerator and S3 transfer Acceleration?
Ans- S3 Transfer Acceleration helps speed up long-distance object transfers between S3 buckets, while Global Accelerator helps manage traffic across multiple AWS regions. AWS Global Accelerator is a networking service that helps you improve the availability, performance, and security of your public applications.

Que 23- What are S3 lifecycle policies?
Ans- Lifecycle policies help manage and automate the life of your objects within S3, preventing you from leaving data unnecessarily available. They make it possible to select cheaper storage options if your data needs to be retained, while at the same time, adopting additional security control from Glacier.

Que 24- What is multipart upload in S3?
Ans- Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. You can upload these object parts independently and in any order.
This is useful if you have files larger than 5GB, which is currently the maximum size for a single PUT operation in S3. Multipart uploads also offer enhanced throughput and the ability to stop and resume uploads if there are any network issues.

Que 25- What is difference between AWS Global Accelerator and CloudFront?
Ans- CloudFront uses Edge Locations to cache content while Global Accelerator uses Edge Locations to find an optimal pathway to the nearest regional endpoint. CloudFront is designed to handle HTTP protocol meanwhile Global Accelerator is best used for both HTTP and non-HTTP protocols such as TCP and UDP.

Que 26- Explain the concept of Data Sync?
Ans- Data Sync is used to move large amount of data to and from :
• On-premises / other cloud to AWS (NFS, SMB, HDFS, S3 API…) - needs agent.
• AWS to AWS (different storage services) - no agent needed
The task of replication tasks can be scheduled hourly, daily, weekly. And file permissions and metadata are preserved (NFS POSIX, SMB…) One agent task can use 10 Gbps, can setup a bandwidth limit.

Que 27- What is the snow family in AWS?
Ans- The AWS Snow Family is a service that helps customers who need to run operations in austere, non-data center environments, and in locations where there's no consistent network connectivity.
The Snow Family (comprised of AWS Snowcone, Snowball, and AWS Snowmobile) oﬀers a number of physical devices and capacity profiles, most with built-in computing capabilities.

Que 28- How do I choose between Snow Family and other AWS data migration services?
Ans- When there are no network capacity restrictions, the services offered by AWS Storage Gateway and AWS Direct Connect are appropriate choices. Whether linked to a network or not, the Snow Family of Services offers suitable options for the most effective data transfer methods. To assist you in moving data via networks, highways, and technological partners, AWS provides a variety of solutions.

Summary

Hope, this article helps in adding to your knowledge and understanding of AWS storage.

Thanks,
Sumbul

Amazon SageMaker Built-in Models & all about ML Lifecycle in action- Mini Project.

Sumbul Naqvi — Fri, 19 Sep 2025 20:19:31 +0000

Machine learning or ML can be complex. We can breakdown this complexity to an extent by thinking of ML as a lifecycle. Like a baby growing into infant to adolecent and finally a mature adult .
This growth of ML problem from our aim or goal to a functional solution structured around lifecycle is an iterative process, complete with best practices to help our machine learning projects succeed.

💡 ML lifecycle includes:

Business goal identification
ML problem formulating
Data processing (collection, preprocessing, feature engineering)
Model/Solution development (training, tuning, evaluation)
Model deployment (inference, prediction)
Model monitoring

Some of these steps, such as data processing, deployment, and monitoring, are iterative processes that you might cycle through as an ML engineer, often in the same machine learning project.

Lets throw some light into these phases of ML development.

✔️ Business goal Identification
We should have a clear idea of the problem for which we are considering ML solution, plus the business value(measurable) we aim to be gained by solving that problem.

✔️ ML problem framing
In this phase, the business problem is framed as a machine learning problem: Given aset of data or observations, what should be predicted (known as a label or target variable).
Determining what to predict and how performance and error metrics must be optimized is a key step in this phase.

✔️ Data processing is foundational to training an accurate ML model. Here we explore the specifics of collecting, preparing, and engineering features from raw data followed by EDA **(Exploratory Data Analysis). Thus, main aim here is to convert data into a usable format.
**Data Collection: Collect data should be relevant, diverse and sufficient.
Data Cleaning: Address issues such as missing values, outliers and inconsistencies in the data.
Data Preprocessing: Standardize formats, scale values and encode categorical variables for consistency.
EDA: is used to uncover insights and understand the dataset's structure. During EDA patterns, trends and insights are provided using statistical and visual tools.

✔️ Model development includes model building, training, tuning and evaluation. Our aim is developing a model that performs a specific task or solves a particular problem.

✔️ Deploying trained models into production for making predictions and inferences talks about deployment strategies on AWS. These strategies help with integrating models into production environments.

✔️ Finally, Monitoring models after they are deployed into the production environment is an important task for implementing robust monitoring systems, facilitating early detection of deviations, and supporting timely mitigation measures.

Well-Architected machine learning design principles to facilitate good design in the cloud

💡Well-Architected machine learning design principles, often guided by frameworks like the AWS Well-Architected Framework's Machine Learning Lens, aim to facilitate good design in the cloud by focusing on specific considerations for ML workloads. Few ML design principles are:

Assign ownership- Apply the right skills and the right number of resources along with accountability and empowerment to increase productivity.

Provide protection- Apply security controls to systems and services hosting model data, algorithms, computation, and endpoints. This ensures secure and uninterrupted operations.

Enable resiliency- Ensure fault tolerance and the recoverability of ML models through version control, traceability, and explainability.

Enable reusability- Use independent modular components that can be shared and reused. This helps enable reliability, improve productivity, and optimize cost.

Enable reproducibility- Use version control across components, such as infrastructure, data, models, and code. Track changes back to a point-in-time release. This approach enables model governance and audit standards.

Optimize resources- Perform trade-off analysis across available resources and configurations to achieve optimal outcome.

Reduce cost- Identify the potentials for reducing cost through automation or optimization, analyzing processes, resources, and operations.

Enable automation- Use technologies, such as pipelining, scripting, and continuous integration (CI), continuous delivery (CD), and continuous training (CT), to increase agility, improve performance, sustain resiliency, and reduce cost.

Enable continuous improvement- Evolve and improve the workload through continuous monitoring, analysis, and learning.

Minimize environmental impact- Establish sustainability goals and understand the impact of ML models. Use managed services and adopt efficient hardware and software and maximize their utilization.

How does the model work?

💡It focuses on two key components of the model: features and weights.

Features are identified parts of your dataset that are important in determining accurate outcomes.

Weights represent how important an associated feature is for determining the accuracy of that outcome. A higher likelihood of accuracy results in a higher weight.

For instance if we want a recommendation system and want model to predict whether particular model of car is worth buying. We will form a mathemetical equation in form (y= mx +c) say:

y= m1x1 +m2x2 +m3x3 + c;
Where, x can be set of features **(cost, make, type etc.) and **m is the weights we assign to each feature.

This model only recommends products with a total value greater than one (say). It does a quick calculation of its features and weights, and produces a final value of 1.05. Product Y is an acceptable recommendation.
We can see given a data (labelled or unlabelled) model is trained to make predictions initially on test data. Once its trained we can use it on actual data.

Developing ML Solutions with SageMaker Studio

✔️ 𝐀𝐦𝐚𝐳𝐨𝐧 𝐒𝐚𝐠𝐞𝐌𝐚𝐤𝐞𝐫 𝐢s managed machine learning (ML) service used to build and train ML models and deploy them right away into a production-ready hosted environment.
✔️ While A𝐦𝐚𝐳𝐨𝐧 𝐁𝐞𝐝𝐫𝐨𝐜𝐤 is designed for use cases where you want to build Gen AI applications without investing heavily in custom model development.
SageMaker AI is a good choice for unique or s𝐩𝐞𝐜𝐢𝐚𝐥𝐢𝐳𝐞𝐝 𝐀𝐈/ML needs that require custom-built models.

Let’s get started with our task for today.

Task
a) Sign into the AWS Management Console.
b) Create a SageMaker notebook instance, open conda_python3
c) Prepare the Dataset and Upload the data to S3
d) Train a model using SageMaker’s built-in XGBoost
e) Train a model using Scikit-Learn Random Forest
f) Deploy the model and make predictions.
g) Clean-Up the resource
h) Check data in S3

Solution

a) 👉 Sign into the AWS Management Console.

b) 👉 Create a SageMaker notebook instance, open conda_python3.

Navigate to Amazon SageMaker AI by clicking on the Services menu at the top, then click on Amazon SageMaker AI. Then, click on **Notebooks **on the left panel.
Click on the Create notebook Instance button.

For the Notebook instance settings section, Notebook instance name: Enter Instance-BuiltIn-Model Notebook instance type: Select ml.t2.medium IAM role (existing): AmazonSagemakeroRole-(RANDOMNUMBER)

Leave other options as default, click on the Create notebook instance button.

It will take up to 5 minutes for the notebook instance to be up and running. Wait until the Status changes to InService.

Click on the Open JupyterLab button. You will be redirected to the run environment.

Select conda_python3 kernel.

Name: SageMaker_ML_Lab.ipynb

c) 👉 Prepare the Dataset and Upload the data to S3

We will be using California Housing data. Its globally available for experimenting, with the help of fetch_california_housing() function.

Copy the below code in cell for execution and run it:

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import sagemaker
from sagemaker import get_execution_role

# Load California housing dataset
california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = pd.DataFrame(california.target, columns=["MedHouseVal"])

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Save to CSV (SageMaker expects CSV format)
train_data = pd.concat([y_train, X_train], axis=1)
test_data = pd.concat([y_test, X_test], axis=1)

train_data.to_csv("california_train.csv", index=False, header=False)
test_data.to_csv("california_test.csv", index=False, header=False)

Explanation of above code:

1. Import libraries required:
Numpy and Pandas for numerical and tabular data handling.
Sklearn for functions and scientific calculations.
Sagemaker for interacting with AWS SageMaker services

2. Functions Used:
fetch_california_housing()- to load a real-world regression dataset with housing features (e.g., income, location).
train_test_split()- to divide the dataset so that part is used for training and the rest for testing.
get_execution_role()- to allow SageMaker to securely access AWS resources like S3 and training instances.

Note: The dataset contains features such as: MedInc, HouseAge, AveRooms, etc.

The target variable is the median house value (in $100,000s).
Here we,
Assign the features to a DataFrame X.
Assign the target values to a DataFrame y.
Use 80% of the data for training the model.
Use 20% of the data to evaluate model performance.
Set random_state=42 in train_test_split() to ensure reproducibility.
Place the label/target in the first column of the CSV file, as expected by SageMaker.
Remove headers from the CSV files because the XGBoost built-in algorithm does not expect them.
Later save the files locally to be uploaded to Amazon S3.

Upload the data to S3

Go to your S3 console and create your bucket. Copy its name and paste it into the code.

import boto3
import sagemaker

# Initialize S3 client
s3 = boto3.client('s3')

bucket_name = '<Your-S3-Bucket_name>'  # Replace with your bucket name
prefix = 'california-housing'

# Upload to S3 using SageMaker session
sagemaker_session = sagemaker.Session()

train_path = sagemaker_session.upload_data(path='california_train.csv', bucket=bucket_name, key_prefix=prefix)
test_path = sagemaker_session.upload_data(path='california_test.csv', bucket=bucket_name, key_prefix=prefix)

print(f"Training data uploaded to: {train_path}")
print(f"Test data uploaded to: {test_path}")

Note:
SageMaker **jobs read data directly from **Amazon S3, not from your local machine.
upload_data() handles the transfer of local CSVs to S3.
prefix determines the folder path inside the S3 bucket.

d) 👉 Train a model using SageMaker’s built-in XGBoost

from sagemaker.amazon.amazon_estimator import get_image_uri

# Get XGBoost container (adjust region if needed)
container = get_image_uri(sagemaker_session.boto_region_name, 'xgboost', '1.0-1')

# Define SageMaker estimator
xgb = sagemaker.estimator.Estimator(
    container,
    get_execution_role(),
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=f's3://{bucket_name}/{prefix}/output',
    sagemaker_session=sagemaker_session
)

# Set hyperparameters (for regression)
xgb.set_hyperparameters(
    objective="reg:squarederror",
    num_round=100,
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8
)

# Define data channels
train_input = sagemaker.TrainingInput(train_path, content_type='csv')
test_input = sagemaker.TrainingInput(test_path, content_type='csv')

# Train the model
xgb.fit({'train': train_input, 'validation': test_input})

SageMaker **uses pre-built containers (Docker images) to run popular algorithms like **XGBoost.
Fetch the URI of the XGBoost container image for your AWS region.
Use version ‘1.0–1’ of the XGBoost container.
Use an Estimator, a SageMaker abstraction that manages training jobs.
Key parameters to set in the Estimator include:
container: the machine learning algorithm to use (XGBoost here).
instance_type: the compute resource (e.g., ml.m5.large which has 2 vCPUs and 8 GiB RAM).
output_path: the S3 location where the trained model will be saved.
get_execution_role(): to ensure the SageMaker job has permissions to access AWS services.
Tune the model for best results by specifying hyperparameters such as:
objective: set to regression.
eta: learning rate.
max_depth: controls how deep the decision trees can grow.
subsample: percentage of data used per tree.
These hyperparameters help control overfitting and underfitting.

*e) 👉 Train a model using Scikit-Learn Random Forest *

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Train model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train.values.ravel())

# Evaluate
predictions = rf_model.predict(X_test)
mse = mean_squared_error(y_test, predictions)

print(f"Mean Squared Error: {mse}")

Run a local model using Scikit-learn’s Random Forest algorithm.
Use this local model to benchmark performance against the XGBoost model trained on SageMaker.
Use mean_squared_error to compare and evaluate model accuracy.

We can see mean_squared_error() gives accuracy value of 25% approx.

f) 👉 Deploy the model and make predictions.

# Deploy
xgb_predictor = xgb.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

# Sample prediction
sample = X_test.iloc[0].values.tolist()
payload = ",".join(map(str, sample))

response = xgb_predictor.predict(payload, initial_args={'ContentType': 'text/csv'})

print(f"Predicted MedHouseVal: {float(response)}")
print(f"Actual MedHouseVal: {y_test.iloc[0].values[0]}")

Deploy the trained model to a real-time HTTPS endpoint in SageMaker.
This endpoint allows sending requests to get predictions.
Select one test row and format it as a CSV string (the format required by the endpoint).
Send a predict() request to the deployed model endpoint using the formatted test data.
Compare the model’s prediction with the actual value.

g) 👉 Clean-Up the resource

Always delete SageMaker endpoints when you’re done using them to avoid ongoing charges.

Deleting the endpoint de-provisions the hosting instance and stops billing.

xgb_predictor.delete_endpoint()

h) 👉 Check data in S3 Bucket

Go to your S3 bucket, and inside the objects, you will find a folder named california-housing/. Click on it.

Next, click on the output folder, then select sagemaker-xgboost, and again click on the output folder inside it.

There, you will see the model.tar.gz file.

This file contains the code we used in JupyterLab to train the models.

Conclusion
● We created Amazon Sagemaker notebook instance.
● We successfully configured conda_python3
● We then created the dataset and uploaded it to S3 bucket.
● We successfully trained the Built-in XGBoost
● We successfully trained the Scikit-Learn Random Forest
● We successfully Deployed XGBoost Model & used it for prediction.
Finally, we cleaned-up the service in order to avoid incurring extra cost.

Hope you found this post informative,

Thanks,
Sumbul.

Neural Networks & Generative Models

Sumbul Naqvi — Thu, 18 Sep 2025 19:47:05 +0000

In this blog we will be talking about the basics of neural networks and generative models, which are fundamental concepts in the field of artificial intelligence and machine learning.

I have created a YouTube video on similar lines before, would request you to go through it to start with:

Before I move ahead, as a one liner if we define these terms it will read like:

💡Artificial Intelligence (AI): It's the simulation of human intelligence in machines that are programmed to think and act like humans.

💡Machine Learning (ML): It's a subset of artificial intelligence, which is broadly defined as the capability of a machine to imitate intelligent human behavior.

💡Deep Learning (DL): It's a subset of machine learning that uses multilayered neural networks, called deep neural networks, to simulate the complex decision-making power of the human brain.

💡Neural Networks: It's a method in artificial intelligence that teaches computers to process data in a way that is inspired by the human brain.

💡Generative Models: These models focus on understanding how the data is generated. They aim to learn the distribution of the data itself. For instance, if we're looking at pictures of boy and girl, a generative model would try to understand what makes a boy look like a boy and a girl look like a girl.

💡Generative adversarial network (GAN): a deep learning architecture that trains two neural networks to compete against each other to generate more authentic new data from a given training dataset. A GAN is called adversarial because it trains two different networks and pits them against each other.

Artificial Neural Networks (ANN) are the very core of Deep learning. They are versatile, powerful, scalable and ideal to tackle large and highly complex ML tasks.
Applications of ANNs includes google images, powering speech recognition services (like Apple's Siri), recommending the best video to watch to millions of users per day (YouTube), learning to beat the world champion at the game of GO (by examining millions of past games and the playing against itself e.g.: Deep Mind's AlphaGo).
Artificial neural networks were first introduced in 1943 when Warren McCulloch, a neurophysiologist and a young mathematician, Walter Pitts, developed the first models of neural networks.

They presented a model of How biological neurons might work together in animal brains to perform complex computations. This was 1st ANN Architecture.
ANN entered dark era due to lack of resources. Later in 1980 there was a revival of interests in ANN as network architectures were invented and better training techniques were developed.
But by 1990's powerful alternative ML techniques such as SVM (Support Vector Machine) were favoured by most researchers, as they seemed to offer better results and stronger theoretical foundations.

Now the question arises Why are ANNs relevant today?

The answer to it is:
✔️ Firstly, huge quantity of data is available to train neural network.
✔️ Secondly, improved hardware and software resources in terms of more powerful GPUs etc. With tremendous increase in computing power has made it possible to train large neural networks in a reasonable amount of time.
✔️ Thirdly, ANNs frequently outperform other ML techniques on very large and complex problems.

Warren McCulloch and Walter Pitts model of biological neuron

The first computational model of a neuron was proposed by Warren McCulloch (neuroscientist) and Walter Pitts (logician) in 1943. It may be divided into 2 parts. The first part, g takes an input, performs an aggregation and based on the aggregated value the second part, f makes a decision.

The artificial neurons activate its output when more than a certain number of its input are active.
In a simple neural network, every node in one layer is connected to every node in the next layer. There is only a single hidden layer.
Whereas Deep learning systems have several hidden layers that make them deep😊

There are two main types of deep learning systems with differing architectures - convolutional neural networks (CNNs) and recurrent neural networks (RNNs).

The main differences between CNNs and RNNs include the following:
✔️CNNs are commonly used to solve problems involving spatial data, such as images.
✔️RNNs are better suited to analyzing temporal and sequential data, such as text or videos. It good at natural language functions like language modeling, speech recognition, and sentiment analysis.

CNNs and RNNs have different architectures.
Convolutional neural networks (CNN) are one of the most popular models used today. This neural network computational model uses a variation of multilayer perceptrons and contains one or more convolutional layers that can be either entirely connected or pooled.

RNN use other data points in a sequence to make better predictions. They do this by taking in input and reusing the activations of previous nodes or later nodes in the sequence to influence the output.

Perceptron?

It is one of the smallest ANN architectures invented in 1957 by Frank Rosenblatt, based on a slightly different artificial neuron called a Linear Threshold Unit (LTU).

In LTU, the inputs and outputs are now numbers, instead of binary on/off values.
Each input connection is associated with a weight.
It computes a weighted sum of its input. Then applies step function to that sum and gives output.

A simple LTU can be used for simple linear binary classification. It computes a linear combination of the inputs and if the results exceeds a threshold, it outputs a positive class or else outputs negative class. Just live our logistic regression classifier or linear SVM models of ML.

In 2010 Xavier Glorot, Yoshua Bengio published a paper titled "Understanding the difficulty of training deep feedforward neural networks" in International Conference. They suggested that root cause of Vanishing Gradient problem (since gradient becomes really small in early input layers which lead to slow training) in Neural Network, can be solved by using better activation functions.

In a neural network, we update the weights and biases of the neurons on the basis of the error at the output. This process is known as back-propagation.
Activation functions make the back-propagation possible since the gradients are supplied along with the error to update the weights and biases.
Thus we can say that a neural network is just a BIG mathematical function. You could use different activation functions for different neurons in the same/different layer(s). Different activation functions allow for different non-linearities which might work better for solving a specific function.
Before we end this blog and leave further learning for next post.

Let's talk about:

How can AWS help with your deep learning requirements?

Amazon Web Services (AWS) has several Deep Learning offerings. Some examples of AWS services you can use to fully manage specific deep learning applications include:

Amazon Augmented AI (Amazon A2I) enables you to build the workflows required for human review of ML predictions. Amazon A2I brings human review to all developers, removing the undifferentiated heavy lifting associated with building human review systems or managing large numbers of human reviewers.

Amazon CodeGuru Security tracks, detects, and fixes code security vulnerabilities across the entire development cycle.

Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document.

Amazon DevOps Guru makes it easy for developers and operators to improve the performance and availability of their applications, improves application availability using ML-powered cloud operations.

Amazon Forecast is deep learning service for time-series forecasting, uses ML to forecast sales operations and inventory needs for millions of items.

Amazon Fraud Detector detects online fraud with ML, enhancing business security practices.

Amazon Translate powered by deep-learning technologies, Amazon Translate delivers fast, high-quality, and affordable language translation. It provides highly accurate and continually improving translations with a single API call.

Hope you enjoyed going through this article.
Thanks
Sumbul