Forem: Meftahul Jannat Mila

Building a Machine Learning Pipeline with a Decision Tree Classifier

Meftahul Jannat Mila — Sat, 26 Jul 2025 17:35:21 +0000

In this article, we will explain how to prepare and train a machine learning model using a pipeline. We'll focus on using a Decision Tree to predict survival based on the Titanic dataset. This process involves data cleaning, preprocessing, training, and tuning, all structured within a neat and reusable pipeline.

Introduction:

A Machine Learning Pipeline is a systematic workflow designed to automate the process of building, training, and deploying of ML models. It includes several steps, such as data collection, preprocessing, feature engineering, model training, evaluation and deployment.

Pipelines simplify and standardize workflows, accelerating machine learning development. They enhance data management by enabling the extraction, transformation, and loading of data from diverse sources.

Step 1: Importing Libraries

First, we import the essential libraries for data handling, preprocessing, model training, and evaluation.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.tree import DecisionTreeClassifier
import pickle

Step 2: Load and Clean the Data

We load the Titanic dataset and drop columns that aren’t useful for our model.

df = pd.read_csv('tested.csv')
df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'], inplace=True)

These columns are either IDs, text, or not relevant — so we remove them.

Step 3: Split the Data

We separate our dataset into features (x) and target (y), then split them into training and testing sets.

x_train, x_test, y_train, y_test = train_test_split(
    df.drop(columns=['Survived']),
    df['Survived'],
    test_size=0.2,
    random_state=42
)

This helps us train the model on one part of the data and test its performance on the unseen part.

Step 4: Build the Preprocessing Pipeline

4.1 Impute Missing Values

We fill in missing values — using the mean for age and the most frequent value for the "Embarked" column.

columntransformer1 = ColumnTransformer([
    ('impute_age', SimpleImputer(), [2]),  # Age
    ('impute_embarked', SimpleImputer(strategy='most_frequent'), [6])  # Embarked
], remainder='passthrough')

4.2 One-Hot Encode Categorical Features

We convert text columns like "Sex" and "Embarked" into numbers using one-hot encoding.

columntransformer2 = ColumnTransformer([
    ('ohe_sex_embarked', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), [1, 6])
], remainder='passthrough')

4.3 Scale Numerical Features

We scale all numerical values to a range between 0 and 1 to make the model training more stable.

columntransformer3 = ColumnTransformer([
    ('scale', MinMaxScaler(), slice(0, 10))
])

Step 5: Feature Selection and Model

We select the 5 best features and use a Decision Tree for classification.

selectkbest = SelectKBest(score_func=chi2, k=5)
decisiontreeclassifier = DecisionTreeClassifier()

Step 6: Create the Pipeline

We combine all steps — preprocessing, feature selection, and modeling — into one reusable pipeline.

pipe = make_pipeline(
    columntransformer1,
    columntransformer2,
    columntransformer3,
    selectkbest,
    decisiontreeclassifier
)

Now we can treat this entire setup as a single model object.

Step 7: Train the Model

We train the pipeline using our training data.

pipe.fit(x_train, y_train)

Step 8: Evaluate the Model

We make predictions on the test data and calculate the accuracy.

from sklearn.metrics import accuracy_score

y_pred = pipe.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)

We can also evaluate performance using cross-validation:

cross_val_score(pipe, x_train, y_train, cv=5, scoring='accuracy').mean()

Step 9: Hyperparameter Tuning with GridSearchCV

We use GridSearchCV to find the best values for some parameters, like how many features to select and the maximum depth of the tree.

params = {
    'selectkbest__k': [5, 10],
    'decisiontreeclassifier__max_depth': [3, 5, 10]
}

grid = GridSearchCV(pipe, param_grid=params, cv=5, scoring='accuracy')
grid.fit(x_train, y_train)

print("Best Score:", grid.best_score_)
print("Best Parameters:", grid.best_params_)

Step 10: Save the Trained Pipeline

Finally, we save the trained pipeline so we can reuse it later without retraining.

pickle.dump(pipe, open('pipe.pkl', 'wb'))

Production Prediction

Step 1: Import Required Libraries and Load the Model

We start by importing necessary libraries and loading the saved pipeline.

import pickle
import numpy as np

pipe = pickle.load(open('pipe.pkl', 'rb'))

Step 2: Create a New Input as a NumPy Array

Assume this is the data provided by the user. It must follow the same format (columns and order) as used during training.

# Assume user input
test_input2 = np.array([2, 'female', 16.0, 0, 0, 10.5, 'S'], dtype=object).reshape(1, 7)

Step 3: Make the Prediction Using the Pipeline

We call .predict() on the pipeline just like any other Scikit-learn model.

pipe.predict(test_input2)

This will output either 0 (Did Not Survive) or 1 (Survived), which you can format as needed.

Final Summary of the Article

In this article, we built an end-to-end machine learning pipeline using the Titanic dataset. Here’s a brief overview:

Cleaned the dataset by removing irrelevant columns.
Used Scikit-learn’s Pipeline and ColumnTransformer for data preprocessing, including:
- Imputing missing values
- Encoding categorical variables
- Scaling numerical features
- Selecting important features
Trained a Decision Tree Classifier and validated it with the accuracy score and cross-validation.
Optimized hyperparameters using GridSearchCV.
Saved the trained pipeline with pickle.

Web Scraping Project: Extracting Data from Wikipedia Using Python

Meftahul Jannat Mila — Thu, 03 Jul 2025 06:44:19 +0000

In this project, I used Python to scrape a table of Bangladeshi companies from Wikipedia and convert it into a clean CSV file. The idea was to automatically collect and organize data from a web page without manually copying and pasting the information.

I'll walk you through the process step-by-step, including what each part of the code does and some challenges I faced during the project.

🔧 Tools & Libraries Used

Pandas: For handling tabular data.
Requests: To make HTTP requests and fetch web pages.
BeautifulSoup: To parse and extract data from HTML.

Step 1: Import the Required Libraries

import pandas as pd
import requests
from bs4 import BeautifulSoup

We import the necessary Python libraries to perform web scraping (requests and BeautifulSoup) and data handling (pandas).

Step 2: Request the Wikipedia Page

url = 'https://en.wikipedia.org/wiki/List_of_companies_of_Bangladesh'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

We fetch the HTML content of the Wikipedia page using requests.get(), then parse it using BeautifulSoup with the HTML parser.

Step 3: Locate the Target Table

table = soup.find('table', class_='wikitable sortable')

We find the specific HTML table that contains the list of Bangladeshi companies. Wikipedia uses a table with the class 'wikitable sortable'.

Step 4: Extract Table Headers

c_titles = table.find_all('th',  attrs={"rowspan": "2"})
c_table_titles = [title.text.strip() for title in c_titles]

We extract the table headers (column titles) by finding <th> tags with rowspan="2" (which identifies the actual column names).

Step 5: Set Up the DataFrame

df = pd.DataFrame(columns=c_table_titles)

We create an empty DataFrame with the correct column names. This prepares us to insert the actual company data.

Step 6: Extract Data Rows

column_data = table.find_all('tr')
headers = [th.get_text(strip=True) for th in table.find_all('th', attrs={'rowspan': '2'})]
expected_columns = len(headers)

data_rows = []

for row in column_data[2:]:  # skip header rows
    row_data = row.find_all('td')
    individual_row_data = [td.get_text(strip=True) for td in row_data]

    # Remove extra columns if they exist
    if len(individual_row_data) > expected_columns:
        individual_row_data = individual_row_data[:expected_columns]

    # Skip rows with wrong column count
    if len(individual_row_data) != expected_columns:
        continue

    data_rows.append(individual_row_data)

We loop through all the table rows (skipping the first two header rows) and extract the text from each cell <td>. We also ensure each row matches the expected number of columns and remove any extra or irregular data.

Step 7: Create and Save the Final DataFrame

df = pd.DataFrame(data_rows, columns=headers)
df.to_csv('Companies_Of_BD.csv')

We create a final DataFrame using the collected data and headers, then export it to a CSV file named 'Companies_Of_BD.csv'.

Challenges I Faced

Every project has its share of hiccups. Here are some issues I ran into:

In the table I scraped from Wikipedia, there were six column headers, so each row should have six data values. However, some rows had eight values due to extra columns like status indicators and footnote references. This mismatch could cause errors when creating the DataFrame. I needed to remove the extra values to ensure each row had only six pieces of data, allowing the final CSV file to be clean and usable.

Final Output
The final result is a clean CSV file that contains a structured list of companies in Bangladesh from Wikipedia. This dataset can now be used for analysis, visualizations, or just general reference.

Conclusion
This was a great beginner-friendly project to learn about web scraping, HTML structure, and data cleaning in Python. It taught me how to be careful with real-world web data and handle unexpected formatting issues.

A Beginner’s Guide to NumPy for Data Analysis

Meftahul Jannat Mila — Thu, 20 Mar 2025 13:39:53 +0000

In this article, we’ll dive into NumPy, a must-know Python library that makes handling numbers and data simple and exciting. Whether you’re just starting with Python or curious about data analysis, we’ve got you covered with a friendly, step-by-step journey. We’ll explore how to work with arrays, perform calculations effortlessly, and use NumPy’s powerful tools to analyze data. To top it off, we’ll finish with a hands-on mini-project to bring everything together. Let’s embark on this adventure and unlock the magic of NumPy!

Environment Setup

Before we begin exploring NumPy, we’ll need to set up our environment to run the code examples and the mini-project later on. Here’s how we’ll get everything ready:

Install Python: If Python isn’t on your system yet, we can download it from python.org. During installation, we’ll ensure the option to add Python to our PATH is checked—this makes it easier to use from the terminal.
Install NumPy: We’ll open a terminal (or Command Prompt on Windows) and run:

   pip install numpy

This tells Python’s package manager (pip) to fetch and install NumPy for us.

Choose an Editor: We’ll pick a tool to write our code. Options include:
- IDLE: It comes with Python—just search for it after installation.
- VS Code: A free, popular editor available at code.visualstudio.com.
- Or any text editor we prefer!
Test the Setup: To confirm everything works, we’ll create a file (e.g., test.py) in our editor and add:

   import numpy as np
   print(np.__version__)

When we run it, seeing a version number (like 1.26.4) means we’re all set!

With our environment ready, we’re good to dive into NumPy!

What is NumPy?

NumPy is a Python library built for numerical computations. It gives us a special data structure called an ndarray (n-dimensional array), which is faster and more efficient than regular Python lists. It’s a cornerstone of data analysis in Python and pairs wonderfully with libraries like Pandas and Matplotlib.

To start using NumPy in our code, we’ll import it with:

import numpy as np  # 'np' is the common shortcut

Why Use NumPy?

Before we go further, let’s understand why NumPy is so valuable:

Speed: It’s incredibly fast for calculations, making our work efficient.
Ease: We won’t need complex loops—NumPy handles the heavy lifting for us.
Power: It offers a wealth of built-in functions to simplify data analysis.

With these benefits in mind, let’s see what NumPy can do!

1. Creating NumPy Arrays

Arrays are the foundation of NumPy, and we’ll explore several ways to make them.

From a List

We can turn a regular Python list into a NumPy array to start working with it.

# Turning a list into a 1D array
array = np.array([1, 2, 3, 4])
print(array)

Breakdown:

np.array() transforms our list into a NumPy array.
Output: [1 2 3 4] — a 1D array, like a single row of numbers.

2D Array (Matrix)

We can also build a 2D array, which looks like a grid or matrix, using a list of lists.

# Building a 2D array with rows and columns
array_2d = np.array([[1, 2], [3, 4]])
print(array_2d)

Breakdown:

Each inner list becomes a row in our 2D array.
Output:

  [[1 2]
   [3 4]]

This gives us a 2x2 matrix.

Special Arrays

NumPy lets us quickly generate arrays with specific patterns, like all zeros, ones, or a sequence.

# Generating an array of zeros
zeros = np.zeros((2, 3))  # 2 rows, 3 columns
print(zeros)

# Generating an array of ones
ones = np.ones((3, 2))   # 3 rows, 2 columns
print(ones)

# Generating a range of numbers
range_array = np.arange(0, 10, 2)  # Start at 0, stop before 10, step by 2
print(range_array)

Breakdown:

np.zeros((2, 3)): Gives us a 2x3 array filled with 0.0.
- Output: [[0. 0. 0.] [0. 0. 0.]]
np.ones((3, 2)): Creates a 3x2 array of 1.0.
- Output: [[1. 1.] [1. 1.] [1. 1.]]
np.arange(0, 10, 2): Produces [0 2 4 6 8], similar to Python’s range() but as an array.

Random Arrays

For testing or simulations, we can generate arrays with random values.

# Generating random floats between 0 and 1
random_array = np.random.rand(2, 2)  # 2x2 array
print(random_array)

Breakdown:

np.random.rand(2, 2): Creates a 2x2 array of random numbers between 0 and 1.
Output: Something like [[0.45 0.12] [0.78 0.33]] (values will differ each time).

2. Array Properties

Understanding our array’s structure is key for analysis, so let’s look at some useful properties.

# Setting up a 2D array
array = np.array([[1, 2, 3], [4, 5, 6]])

# Checking the shape: rows and columns
print("Shape:", array.shape)  # (2, 3)

# Checking the total number of elements
print("Size:", array.size)   # 6

# Checking the data type
print("Type:", array.dtype)  # int64 (or similar)

Breakdown:

shape: (2, 3) tells us we have 2 rows and 3 columns.
size: 6 is the total number of elements (2 * 3).
dtype: int64 indicates our elements are integers.

3. Basic Operations

NumPy simplifies math with vectorized operations, meaning we can skip loops entirely!

Element-wise Operations

We can apply operations to every element in an array with ease.

# Adding 2 to every element
a = np.array([1, 2, 3])
print(a + 2)  # [3 4 5]

# Multiplying every element by 3
print(a * 3)  # [3 6 9]

Breakdown:

a + 2: Adds 2 to each element: [1+2, 2+2, 3+2].
a * 3: Multiplies each element: [1*3, 2*3, 3*3].

Array-to-Array Operations

We can also combine two arrays element by element.

# Adding two arrays together
b = np.array([4, 5, 6])
print(a + b)  # [5 7 9]

# Multiplying two arrays
print(a * b)  # [4 10 18]

Breakdown:

a + b: Performs element-wise addition: [1+4, 2+5, 3+6].
a * b: Performs element-wise multiplication: [1*4, 2*5, 3*6].

Matrix Operations

For 2D arrays, we can perform matrix-specific operations like transposition or multiplication.

# Setting up a 2x2 matrix
matrix = np.array([[1, 2], [3, 4]])

# Transposing (swapping rows and columns)
print(matrix.T)

# Performing matrix multiplication
print(np.dot(matrix, matrix))

Breakdown:

matrix.T: Flips [[1 2] [3 4]] to [[1 3] [2 4]].
np.dot(): Multiplies the matrix by itself, yielding:
- Output: [[7 10] [15 22]].

4. Key Functions for Data Analysis

Now, let’s explore NumPy’s powerful functions that make data analysis a breeze.

Indexing and Slicing

We can access specific parts of our arrays using indexing and slicing.

# Working with a 1D array
array = np.array([10, 20, 30, 40])
print(array[1])      # 20 (2nd element)
print(array[1:3])    # [20 30] (elements 2 to 3)

# Working with a 2D array
array_2d = np.array([[1, 2, 3], [4, 5, 6]])
print(array_2d[0, 1])  # 2 (row 1, column 2)
print(array_2d[:, 1])  # [2 5] (all rows, column 2)

Breakdown:

array[1]: Retrieves the element at index 1.
array[1:3]: Slices from index 1 to 2.
array_2d[0, 1]: Fetches row 0, column 1.
array_2d[:, 1]: : selects all rows, 1 picks column 1.

Statistical Functions

NumPy offers handy tools to summarize our data statistically.

# Analyzing a simple dataset
data = np.array([1, 2, 3, 4, 5])
print(np.mean(data))    # 3.0 (average)
print(np.median(data))  # 3.0 (middle value)
print(np.std(data))     # 1.414... (spread)
print(np.min(data))     # 1 (smallest)
print(np.max(data))     # 5 (largest)

Breakdown:

mean: Calculates the average by summing all values and dividing by the count.
median: Finds the middle value when sorted.
std: Measures how spread out our data is.
min/max: Identifies the smallest and largest values.

Filtering with `np.where()`

We can filter our data or replace values based on conditions using np.where().

# Filtering values greater than 3
data = np.array([1, 5, 3, 6, 2])
indices = np.where(data > 3)
print(indices)        # (array([1, 3]),)
print(data[indices])  # [5 6]

# Replacing values > 3 with 10
data_new = np.where(data > 3, 10, data)
print(data_new)  # [ 1 10  3 10  2]

Breakdown:

np.where(data > 3): Returns indices [1, 3] where values exceed 3.
data[indices]: Extracts those values: [5, 6].
np.where(condition, x, y): Uses x (10) where true, otherwise keeps y (original value).

Reshaping Arrays

Sometimes, we need to change an array’s shape to fit our analysis, and reshape() helps us do that.

# Reshaping a 1D array into 2D
array = np.arange(6)  # [0 1 2 3 4 5]
reshaped = array.reshape(2, 3)
print(reshaped)

Breakdown:

reshape(2, 3): Transforms 6 elements into a 2x3 array:
- Output: [[0 1 2] [3 4 5]].

Sorting

We can organize our data in order using sort().

# Sorting an unsorted array
unsorted = np.array([3, 1, 4, 2])
print(np.sort(unsorted))  # [1 2 3 4]

Breakdown:

np.sort(): Arranges the array from smallest to largest.

Unique Values

To find distinct values in our data, we use unique().

# Finding unique values
data = np.array([1, 2, 2, 3, 1])
print(np.unique(data))  # [1 2 3]

Breakdown:

np.unique(): Removes duplicates and sorts the result.

Aggregation

We can summarize our data, like summing values, with aggregation functions.

# Summarizing a 2D array
array_2d = np.array([[1, 2], [3, 4]])
print(np.sum(array_2d))         # 10 (total)
print(np.sum(array_2d, axis=0)) # [4 6] (sum of columns)
print(np.sum(array_2d, axis=1)) # [3 7] (sum of rows)

Breakdown:

sum(): Adds all elements together.
axis=0: Sums down each column.
axis=1: Sums across each row.

5. Mini-Project: Analyzing Random Data

Now, let’s bring everything together with a fun mini-project!

Project Goal

We’ll generate a 3x3 array of random integers, find the maximum value in each row, replace values greater than 5 with 0, and calculate the average of the resulting array.

Project Setup

Since we’ve already set up our environment earlier, we just need to prepare a file for this project:

Create a File: In our chosen editor, we’ll make a new file called numpy_project.py.
Add the Code: We’ll copy the code below into this file and run it.

Project Code

import numpy as np  # Import NumPy

# Step 1: Generating a 3x3 array of random integers between 1 and 10
data = np.random.randint(1, 11, size=(3, 3))
print("Original array:\n", data)

# Step 2: Finding the maximum value in each row
max_per_row = np.max(data, axis=1)
print("\nMax value in each row:", max_per_row)

# Step 3: Replacing values greater than 5 with 0
filtered_data = np.where(data > 5, 0, data)
print("\nArray after replacing > 5 with 0:\n", filtered_data)

# Step 4: Calculating the average of the final array
average = np.mean(filtered_data)
print("\nAverage of final array:", average)

Example Run and Breakdown

Suppose our random array looks like this:

Original array:
 [[ 3  7  2]
  [ 9  4  6]
  [ 1  8  5]]

Step 1: np.random.randint(1, 11, size=(3, 3)) generates a 3x3 array with numbers from 1 to 10.
Step 2: np.max(data, axis=1) finds the max in each row: [7 9 8].
- axis=1 means we’re looking across rows.
Step 3: np.where(data > 5, 0, data) replaces 7, 9, 6, 8 with 0:

  [[3 0 2]
   [0 4 0]
   [1 0 5]]

Step 4: np.mean(filtered_data) computes the average: (3+0+2+0+4+0+1+0+5)/9 = 1.67.

Since the numbers are random, our output will differ, but the process remains the same!

Conclusion

Congratulations—we’ve just taken our first big step into the world of NumPy together! We’ve explored how to work with arrays, perform quick calculations, and analyze data with ease. The mini-project gave us a chance to apply these skills in a practical way, and now we’re equipped to dig deeper. NumPy opens the door to data analysis, and with a bit more practice, we can handle larger datasets or combine it with tools like Matplotlib for visuals or Pandas for structured data. Let’s keep experimenting and enjoy the exciting journey with Python and NumPy!

A Beginner’s Guide to Pandas for Data Analysis

Meftahul Jannat Mila — Mon, 17 Mar 2025 14:56:58 +0000

Welcome to the world of data analysis with Pandas! If you’re new to programming or data analytics, don’t worry—this guide is designed to be simple, friendly, and hands-on. Pandas is a powerful Python library that makes working with data as easy as playing with a spreadsheet—but way more fun! By the end of this article, you’ll understand the essentials of Pandas, from loading data to analyzing it, and we’ll even build a small project together, complete with a simple graph. Let’s dive in!

What is Pandas?

Pandas is a Python library that helps you work with data in tables (called DataFrames) and lists (called Series). Think of it as a supercharged version of Excel or Google Sheets, where you can clean, explore, and analyze data with code. Whether you’re handling sales figures, survey results, or anything tabular, Pandas has your back.

To use Pandas, you’ll need Python installed on your computer. If you don’t have it yet, download it from python.org, then install Pandas by running this in your terminal or command prompt:

pip install pandas

1. Getting Started with Pandas

Let’s kick things off by importing Pandas and creating some basic data structures.

# Import Pandas with the nickname 'pd' (a common shortcut)
import pandas as pd

# Create a Series (a single column of data)
series = pd.Series([10, 20, 30], index=['a', 'b', 'c'])

# Create a DataFrame (a table with rows and columns)
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)

# Show the results
print("Series:")
print(series)
print("\nDataFrame:")
print(df)

Breakdown:

Importing Pandas: import pandas as pd lets us use Pandas functions with the shorthand pd.
Series: A Series is like a labeled list. Here, we gave it numbers (10, 20, 30) with labels (‘a’, ‘b’, ‘c’).
DataFrame: A DataFrame is a table. We made one with two columns: ‘Name’ and ‘Age’.
Output: The Series shows values with their indices, and the DataFrame looks like a neat table.

2. Loading and Saving Data

Data analysts often work with files like CSV or Excel. Pandas makes this a breeze.

# Create a small DataFrame
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})

# Save it to a CSV file
df.to_csv('my_data.csv', index=False)  # 'index=False' skips row numbers

# Load it back
loaded_df = pd.read_csv('my_data.csv')
print("Loaded DataFrame:")
print(loaded_df)

Breakdown:

Saving: to_csv() writes the DataFrame to a file named my_data.csv.
Loading: read_csv() reads the file back into a DataFrame.
Other Formats: You can also use pd.read_excel(), pd.read_json(), or df.to_excel() for different file types.

3. Exploring Your Data

Before analyzing, you need to know what’s in your data.

# Sample DataFrame
df = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})

# See the first row
print("First row:")
print(df.head(1))

# Check structure
print("\nInfo:")
print(df.info())

# Get stats for numbers
print("\nStats:")
print(df.describe())

Breakdown:

head(1): Shows the first row (you can change the number).
info(): Lists columns, their data types (e.g., ‘object’ for text, ‘int64’ for numbers), and if anything’s missing.
describe(): Gives stats like average and max, but only for numeric columns like ‘Age’.

4. Selecting Data

Pandas lets you pick exactly what you need.

# Sample DataFrame
df = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})

# Pick one column
print("Names only:")
print(df['Name'])

# Pick rows where Age > 25
print("\nPeople over 25:")
print(df[df['Age'] > 25])

Breakdown:

Column Selection: df['Name'] grabs the ‘Name’ column as a Series.
Row Filtering: df[df['Age'] > 25] keeps only rows where the condition is true (here, just Bob).

5. Cleaning Data

Real data is messy—Pandas helps fix it.

# Import NumPy for missing values (NaN)
import numpy as np

# DataFrame with missing and duplicate data
df = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, 4, 5]})

# Fill missing values with 0
df['A'] = df['A'].fillna(0)

# Remove duplicate rows in 'B'
df = df.drop_duplicates(subset='B')
print("Cleaned DataFrame:")
print(df)

Breakdown:

Missing Values: np.nan is a missing value. fillna(0) replaces it with 0.
Duplicates: drop_duplicates() keeps only the first occurrence of repeated values in ‘B’.

6. Manipulating Data

Change your data however you like.

# Sample DataFrame
df = pd.DataFrame({'Sales': [100, 200], 'Region': ['North', 'South']})

# Add a new column
df['Tax'] = df['Sales'] * 0.1

# Sort by Sales
df = df.sort_values('Sales')
print("Updated DataFrame:")
print(df)

Breakdown:

New Column: df['Tax'] creates a column based on a calculation.
Sorting: sort_values('Sales') arranges rows from low to high Sales.

7. Grouping and Aggregating

Summarize data by categories.

# Sample DataFrame
df = pd.DataFrame({'Region': ['North', 'South', 'North'], 'Sales': [100, 150, 200]})

# Total sales by region
sales_by_region = df.groupby('Region')['Sales'].sum()
print("Sales by Region:")
print(sales_by_region)

Breakdown:

groupby('Region'): Groups rows by the ‘Region’ column.
sum(): Adds up ‘Sales’ for each group. North: 100 + 200 = 300; South: 150.

8. Merging Data

Combine multiple datasets.

# Two small DataFrames
df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [1, 3], 'Score': [85, 90]})

# Merge them on 'ID'
merged = pd.merge(df1, df2, on='ID', how='inner')
print("Merged DataFrame:")
print(merged)

Breakdown:

merge(): Links df1 and df2 where ‘ID’ matches.
how='inner': Keeps only rows where ‘ID’ exists in both (here, just ID 1).

9. Pivot Tables

Rearrange data for insights.

# Sample DataFrame
df = pd.DataFrame({'Date': ['2023-01', '2023-01'], 'Product': ['A', 'B'], 'Sales': [100, 150]})

# Pivot to spread Products across columns
pivot = df.pivot(index='Date', columns='Product', values='Sales')
print("Pivot Table:")
print(pivot)

Breakdown:

pivot(): Turns ‘Product’ values into columns, with ‘Sales’ as the data. One row for ‘2023-01’ with A and B as columns.

10. Time Series

Work with dates and times.

# Sample DataFrame
df = pd.DataFrame({'Date': ['2023-01-01', '2023-02-01'], 'Sales': [100, 200]})

# Convert Date to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Set Date as index
df.set_index('Date', inplace=True)

# Monthly total
monthly = df.resample('M').sum()
print("Monthly Sales:")
print(monthly)

Breakdown:

to_datetime(): Makes ‘Date’ a proper date type.
set_index(): Uses ‘Date’ as row labels.
resample('M'): Groups by month (‘M’) and sums ‘Sales’.

11. SQL in Pandas

Pandas can connect to databases or mimic SQL queries.

# Sample DataFrame
df = pd.DataFrame({'id': [1, 2], 'product': ['Laptop', 'Mouse'], 'amount': [1000, 20]})

# Filter like SQL: WHERE amount > 50
result = df[df['amount'] > 50]
print("Filtered DataFrame:")
print(result)

Breakdown:

SQL-Like: df[df['amount'] > 50] is like SELECT * FROM df WHERE amount > 50.
Database Option: Use pd.read_sql("SELECT * FROM table", connection) to pull from a real database (e.g., MySQL).

Project: Analyzing Store Sales with a Simple Plot

Let’s put it all together with a beginner-friendly project! We’ll analyze a small store’s sales and add a basic bar chart using Matplotlib to visualize our results.

Project Setup

Install Python: Download from python.org if you haven’t.
Install Pandas: Run pip install pandas in your terminal.
Install Matplotlib: Run pip install matplotlib to add plotting (we’ll use it briefly).
Pick an Editor: Use Notepad, VS Code, or Jupyter Notebook.
Create a File: Save this code as store_sales.py or run it in a notebook.

The Code

# Import Pandas for data and Matplotlib for plotting
import pandas as pd
import matplotlib.pyplot as plt

# Step 1: Create our dataset
data = {
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'Item': ['Pen', 'Notebook', 'Pen', 'Pencil'],
    'Quantity': [5, 2, 3, 4],
    'Price': [1.0, 2.5, 1.0, 0.5]
}
df = pd.DataFrame(data)  # Turn the dictionary into a table
print("Our dataset:")
print(df)

# Step 2: Explore the data
print("\nFirst 2 rows:")  # Show first 2 rows
print(df.head(2))
print("\nDataset info:")  # Check structure
print(df.info())

# Step 3: Calculate total sales per row
df['Total'] = df['Quantity'] * df['Price']  # New column: Quantity × Price
print("\nWith Total column:")
print(df)
total_sales = df['Total'].sum()  # Add up all totals
print(f"\nTotal sales: ${total_sales}")

# Step 4: Find total quantity sold per item
item_sales = df.groupby('Item')['Quantity'].sum()  # Total quantity per item
print("\nQuantity sold per item:")
print(item_sales)

# Step 5: Make a simple bar chart of item sales
plt.bar(item_sales.index, item_sales)  # Items on x-axis, quantities on y-axis
plt.title('Total Quantity Sold Per Item')  # Add a title
plt.xlabel('Item')  # Label for x-axis
plt.ylabel('Quantity Sold')  # Label for y-axis
plt.show()  # Display the plot

# Step 6: Average sales per day
df['Date'] = pd.to_datetime(df['Date'])  # Make Date a proper date
daily_sales = df.groupby('Date')['Total'].sum()  # Total sales per day
print("\nTotal sales per day:")
print(daily_sales)
avg_daily_sales = daily_sales.mean()  # Average of daily totals
print(f"\nAverage daily sales: ${avg_daily_sales}")

# Step 7: Save the results
df.to_csv('store_sales.csv', index=False)  # Save to a CSV file
print("\nData saved to 'store_sales.csv'!")

Breakdown:

Step 1: We created a DataFrame with sales data (dates, items, quantities, prices).
Step 2: Explored it with head() and info() to understand our dataset.
Step 3: Added a ‘Total’ column (e.g., 5 pens × $1 = $5) and summed it for overall sales ($15).
Step 4: Grouped by ‘Item’ to see total quantities sold (e.g., Pens: 8 units).
Step 5: Used Matplotlib’s plt.bar() to make a bar chart:
- item_sales.index (item names) goes on the x-axis.
- item_sales (quantities) goes on the y-axis.
- Added labels and a title, then plt.show() displays it. You’ll see a window pop up with bars for each item!
Step 6: Calculated daily sales (Jan 1: $10, Jan 2: $5) and their average ($7.5).
Step 7: Saved our work to store_sales.csv.

Run this code, and you’ll see the outputs in your console plus a bar chart window showing quantities sold per item (Pens tallest at 8, Notebook shortest at 2). You’ll also get a CSV file in your folder!

Conclusion

Congratulations! You’ve just taken your first steps into the exciting world of data analysis with Pandas. From creating and manipulating DataFrames to exploring, cleaning, and visualizing data, you now have the foundational skills to tackle real-world datasets. The hands-on project showed how Pandas transforms raw numbers into meaningful insights—like total sales or item quantities—complete with a simple chart to bring it all to life. As you continue your journey, experiment with larger datasets, explore advanced features like joins or time series analysis, and let Pandas empower you to uncover stories hidden in the data. Happy analyzing!

Forem: Meftahul Jannat Mila

Building a Machine Learning Pipeline with a Decision Tree Classifier

Introduction:

Step 1: Importing Libraries

Step 2: Load and Clean the Data

Step 3: Split the Data

Step 4: Build the Preprocessing Pipeline

4.1 Impute Missing Values

4.2 One-Hot Encode Categorical Features

4.3 Scale Numerical Features

Step 5: Feature Selection and Model

Step 6: Create the Pipeline

Step 7: Train the Model

Step 8: Evaluate the Model

Step 9: Hyperparameter Tuning with GridSearchCV

Step 10: Save the Trained Pipeline

Production Prediction

Step 1: Import Required Libraries and Load the Model

Step 2: Create a New Input as a NumPy Array

Step 3: Make the Prediction Using the Pipeline

Final Summary of the Article

Web Scraping Project: Extracting Data from Wikipedia Using Python

🔧 Tools & Libraries Used

Step 1: Import the Required Libraries

Step 2: Request the Wikipedia Page

Step 3: Locate the Target Table

Step 4: Extract Table Headers

Step 5: Set Up the DataFrame

Step 6: Extract Data Rows

Step 7: Create and Save the Final DataFrame

A Beginner’s Guide to NumPy for Data Analysis

Environment Setup

What is NumPy?

Why Use NumPy?

1. Creating NumPy Arrays

From a List

2D Array (Matrix)

Special Arrays

Random Arrays

2. Array Properties

3. Basic Operations

Element-wise Operations

Array-to-Array Operations

Matrix Operations

4. Key Functions for Data Analysis

Indexing and Slicing

Statistical Functions

Filtering with np.where()

Reshaping Arrays

Sorting

Unique Values

Aggregation

5. Mini-Project: Analyzing Random Data

Project Goal

Project Setup

Project Code

Example Run and Breakdown

Conclusion

A Beginner’s Guide to Pandas for Data Analysis

What is Pandas?

1. Getting Started with Pandas

Breakdown:

2. Loading and Saving Data

Breakdown:

3. Exploring Your Data

Breakdown:

4. Selecting Data

Breakdown:

5. Cleaning Data

Breakdown:

6. Manipulating Data

Breakdown:

7. Grouping and Aggregating

Breakdown:

8. Merging Data

Breakdown:

9. Pivot Tables

Breakdown:

10. Time Series

Breakdown:

Filtering with `np.where()`