Forem: Gatusso

How I Built an End-to-End HR Attrition Dashboard Using MySQL & Power BI

Gatusso — Tue, 26 May 2026 17:37:53 +0000

Losing great employees is incredibly expensive for businesses. To show potential employers how I tackle real-world business problems using data engineering and visualization, I built an end-to-end HR Attrition Analysis project using the classic IBM HR Analytics dataset (1,470 employees, 35 features).

Here is exactly how I took this raw data from local SQL ingestion to an executive-ready Power BI dashboard.

🏗️ Step 1: Database Ingestion & Quality Checks (MySQL)

Enterprise data lives in relational databases, not flat CSV files. I started by spinning up a local schema in MySQL Workbench and importing the raw dataset.

Before running metrics, I performed a "sanity check" to ensure data integrity. I verified that there were zero duplicate records using the unique EmployeeNumber key and checked for missing values:

SQL
-- Checking for duplicates on the primary key
SELECT EmployeeNumber, COUNT(*) 
FROM hr_employee_attrition
GROUP BY EmployeeNumber
HAVING COUNT(*) > 1;

Result: 0 duplicates. The structural data health was clean.

🧹 Step 2: Data Cleaning & Transformation

A common mistake is overloading a BI tool with uncleaned data. To optimize performance, I built a permanent Database View to drop zero-variance columns (like StandardHours, which was identical for every employee) and transform text fields into binary indicators ($1$ and $0$).

SQL
CREATE VIEW vw_hr_attrition_clean AS
SELECT 
    EmployeeNumber, Age, Department, JobRole, MonthlyIncome, YearsAtCompany,
    CASE WHEN Attrition = 'Yes' THEN 1 ELSE 0 END AS Attrition_Flag,
    CASE WHEN OverTime = 'Yes' THEN 1 ELSE 0 END AS OverTime_Flag
FROM hr_employee_attrition;

This thin architectural layer makes calculating exact percentages downstream incredibly fast.

🔍 Step 3: Segmenting the Risk with SQL

Next, I used aggregation queries to pinpoint exactly where turnover was happening. I analyzed attrition rates across different departments and salary brackets:

SQL
-- Calculating Attrition Rate by Department
SELECT Department, COUNT(*) as Total_Employees,
       ROUND(AVG(Attrition_Flag)*100, 2) as Attrition_Rate
FROM vw_hr_attrition_clean 
GROUP BY Department 
ORDER BY Attrition_Rate DESC;

📊 Step 4: Connecting & Modeling in Power BI

Instead of using static exports, I connected Power BI directly to my local MySQL server using Import Mode.

To maintain clean DAX architecture, I created a dedicated measure matrix table and wrote explicit KPIs rather than relying on default column summaries:

Total Employees = COUNT(vw_hr_attrition_clean[EmployeeNumber])

Total Attrition = SUM(vw_hr_attrition_clean[Attrition_Flag])

Attrition Rate = DIVIDE([Total Attrition], [Total Employees], 0)

The BI Dashboard

💡 Step 5: High-Impact Business Takeaways

Data is just noise without strategic context. Based on the dashboard interactions, I identified three massive "flight risks" and drafted immediate HR action items:

The Overtime Smoking Gun: Employees logging chronic overtime exhibit a 30.6% attrition rate (3x higher than non-overtime peers).
Recommendation: Deploy an automated HR flag system when operational teams cross consecutive overtime thresholds.
The 1-Year Tenure Cliff: Attrition is heavily concentrated among employees in their first 12 months (>30%).
Recommendation: Revamp onboarding tracks with structured 30/60/90-day sentiment check-ins.
Sales Representative Volatility: Sales Reps had an outlier attrition rate of 39.8%, linked to low starting base pay (<$4k/month).
Recommendation: Restructure early compensation frameworks to favor a higher base salary over pure commission during year one.

Why My Baseline Random Forest Model Beat XGBoost: A Deep Dive into the Titanic Survival Prediction Dataset

Gatusso — Sun, 24 May 2026 14:02:03 +0000

A practical look at feature engineering, model optimization, and why simpler models sometimes win on smaller datasets.

When you start out in data science, you are often led to believe that there is a strict hierarchy of algorithms. You start with Linear Regression, move up to Random Forests, and eventually reach the holy grail: Gradient Boosting models like XGBoost. The assumption is usually that more complex equals better results.

But data science in the real world rarely follows a perfect script.

I recently built a survival classification model using the classic Titanic dataset for my portfolio. I set up an end-to-end pipeline, built a solid baseline, ran a rigorous hyperparameter grid search, and threw an XGBoost classifier at the problem.

The results threw me a curveball, and they taught me a massive lesson about data scale and model variance. Here is how I built the pipeline and what the results actually mean.

Before feeding any data into a machine learning model, it’s critical to understand that algorithms are essentially giant math equations. They don't understand context, and they don't handle missing data well. My workflow followed six key stages:

Exploratory Data Analysis (EDA): Finding the historical patterns.
Missing Data Imputation: Smart strategies to fill the blanks.
Feature Engineering: Creating high-signal columns from raw text.
Categorical Encoding: Transforming strings to numbers safely.
Model Evaluation: Setting up an 80/20 train-validation split.
Hyperparameter Tuning & Comparison: Pit baseline RF vs. GridSearch RF vs. XGBoost.

The Power of Feature Engineering

Most beginners simply drop text columns or fill missing values with a global average. To build a production-grade portfolio project, I implemented domain-specific feature engineering choices using pandas:

Smart Age Imputation via Titles: Instead of filling the 177 missing age values with the ship's average age (29), I extracted social titles (Mr., Mrs., Miss, Master) from the names. Because a "Master" is historically a young boy, filling his missing age with the median of the Master group is significantly more accurate than giving him an adult's age.
The Family Size Matrix: I combined SibSp (siblings/spouses) and Parch (parents/children) into a single FamilySize feature. Interestingly, data analysis showed that individuals traveling entirely alone or families larger than 5 had poor survival rates, whereas small families (2-4 people) fared much better.
Handling the Cabin Sparsity: Over 70% of the Cabin column was missing. Rather than dropping it, I turned it into a binary feature: Has_Cabin (1 or 0). This captured a massive socioeconomic signal, as 1st-class passengers were far more likely to have assigned, recorded cabins closer to the deck.

The Showdown: Comparing 3 Architectures

After splitting the data and encoding text variables into numerical binaries using pd.get_dummies(drop_first=True), I trained and evaluated three distinct setups on my validation data.

Here is how they performed:

Strategy	Validation Accuracy	Notes / Settings
1. Baseline Random Forest	82.68%	Simple setup, `max_depth=5`
2. XGBoost Classifier	82.12%	`learning_rate=0.05`, `max_depth=4`
3. GridSearchCV Tuned RF	81.56%	Optimized via 5-Fold Cross-Validation

The GridSearchCV block methodically checked variations of estimators, depths, and split criteria, ultimately landing on these optimal parameters:


python
{'criterion': 'entropy', 'max_depth': 6, 'min_samples_split': 10, 'n_estimators': 50}

Customer Shopping Behavior Analysis: Driving Retail Insights with Data

Gatusso — Tue, 19 May 2026 07:18:32 +0000

I recently completed an end-to-end Customer Shopping Behavior Analysis project using a dataset of 3,900 transactions. The objective was to uncover actionable insights into spending patterns, customer segments, product preferences, and subscription behavior to support strategic business decisions.

Project Approach

Data Preparation & Cleaning (Python)

Loaded and explored the dataset using pandas.

Handled 37 missing values in the Review Rating column by imputing with category-specific medians.

Standardized column names to snake_case, performed consistency checks, and dropped redundant features (e.g., promo_code_used was identical to discount_applied).
Created new features: age_group and purchase_frequency_days (converted textual frequency into numeric days).
Loaded the cleaned data into PostgreSQL for efficient querying.

SQL Analysis – Answering Key Business Questions I developed structured SQL queries in PostgreSQL to deliver clear business insights, including:

Revenue split by gender (Males: ~$157K vs. Females: ~$75K).

High-spending customers who used discounts.

Top 5-rated products per ratings.

Shipping type performance (Express vs. Standard).

Subscribers vs. non-subscribers comparison.

Discount dependency by product.

Customer segmentation (New, Returning, Loyal) based on purchase history.

Revenue contribution by age group.

Visualization & Storytelling Built an interactive Power BI dashboard featuring key KPIs, revenue breakdowns by category/age/subscription, sales trends, and filters for dynamic exploration.

Key Business Recommendations

Strengthen subscription programs with exclusive benefits to convert more loyal customers.
Implement targeted loyalty programs to grow the “Loyal” segment (currently 80% of customers).
Review discount strategy on high-dependency items (e.g., Hats, Sneakers) to protect margins.
Focus marketing on high-revenue age groups and customers preferring Express shipping.

How Data Science and Analytics are transforming industries today

Gatusso — Wed, 23 Apr 2025 08:08:10 +0000

Data science and analytics are more than simply IT departments' technical tools in today's hyper connected, digital-first world; they are essential to contemporary decision-making, innovation, and competitive advantage. Organizations in almost every industry now depend on data to inform strategy, streamline operations, customize consumer experiences, and predict future trends as a result of the rise of big data, artificial intelligence (AI), and machine learning (ML). Data science has the unquestionable ability to revolutionize a variety of industries, including marketing, manufacturing, healthcare, and finance.

Understanding Data Science and Analytics

It's crucial to understand the differences between data science and analytics before exploring their transformative potential. Data science is a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from both structured and unstructured data, while data analytics is a subset of data science that specifically refers to the process of analyzing datasets to make inferences and guide decision-making. Together, they make a powerful combination: data analytics shows "what" is happening, while data science frequently delves deeper into "why" it's happening and "what might happen next."

Health Care: Precision and Predictive Medicine

Healthcare has seen a profound transformation thanks to data science. With the introduction of wearable technology, genomics, and electronic health records (EHRs), an unprecedented amount of data is now being collected and analyzed.
Predictive analytics in healthcare helps identify high-risk patients, potential epidemics, and the progression of diseases. For instance, by analyzing historical patient data, machine learning models can predict hospital readmissions or the onset of chronic diseases like diabetes.

Precision medicine, which customizes treatment plans based on individual variability in genes and lifestyle, is also becoming a reality thanks to advanced data analysis techniques. The Human Genome Project, and the subsequent rise of bioinformatics, owes much of its success to the computational power of data science.
AI models can mimic how drugs interact with the human body, predicting side effects and success rates even before clinical trials start. Pharmaceutical companies also use data science for drug discovery and development, which drastically cuts down on the time and expense needed to bring new drugs to market.

Finance: Risk Management and Algorithmic Trading

Although data has always been at the heart of the financial industry, its size and scope have significantly increased. These days, data science plays a key role in algorithmic trading, fraud detection, credit scoring, and client segmentation.

Banks can prevent fraud before it happens by using machine learning algorithms to detect suspect transaction patterns in real time. To evaluate credit risk more precisely than with conventional scoring techniques, credit card firms employ large databases and prediction models.

To create complex models that can evaluate market trends, forecast stock prices, and execute transactions in milliseconds, quantitative analysts, or quants, use data science in the trading industry. High-frequency trading (HFT) platforms, powered by AI, make transactions based on real-time data feeds, ensuring both speed and efficiency.

Additionally, budgeting programs like YNAB and personal finance apps like Mint leverage data analytics to provide users with personalized recommendations, spending insights, and saving techniques.

Retail: Personalization and Inventory Optimization

The clever use of customer data has permanently altered the retail environment. Retailers are more able to comprehend and predict customer behavior than ever before, whether through e-commerce platforms or physical storefronts with integrated digital systems.

One of the first companies to use data science to power recommendation engines is Amazon. Amazon provides highly tailored recommendations by examining previous purchases, search histories, and browsing habits, which raises conversion rates and improves consumer happiness.

Analytics are also used by retailers to improve inventory control. Businesses can lessen overstock and stockouts by using predictive models that forecast demand based on geographical trends, promotions, and seasonality. By guaranteeing product availability, this raises consumer satisfaction in addition to profits.

In order to stay responsive to public opinion, brands can make real-time adjustments to their marketing strategy and product offerings by using sentiment analysis, which involves examining social media posts and customer evaluations.

Manufacturing: Predictive Maintenance and Smart Factories

Smart manufacturing and the Industrial Internet of Things (IIoT), which both mainly rely on data science and analytics, are the driving forces behind Industry 4.0, the fourth industrial revolution.

Predictive analytics can be used to predict equipment failures based on the constant data generated by sensors integrated in manufacturing equipment. Predictive maintenance extends the life of machinery, improves operating efficiency, and decreases unscheduled downtime.

Data science is also essential to quality control, as computer vision and machine learning algorithms check products for flaws faster and more accurately than a human can.

In addition, entire smart factories are being constructed in which all of the components, from supply chains to assembly lines, are digitally connected and optimized in real time, and production is streamlined and scenarios tested using digital twins and simulations before any physical changes are made.

Marketing: Targeting, Optimization and Campaign Optimization

Without data, marketing in the digital age is like driving blind. With the help of data science, marketers can gain a detailed understanding of consumer behavior, enabling hyper-targeted campaigns that send the appropriate message to the right person at the right moment.

Marketers employ clustering algorithms to categorize consumers based on demographics, internet behavior, purchase history, and psychographics. These insights lead to more relevant adverts and content, enhancing engagement rates.

Marketers can optimize anything from landing page layouts to email subject lines by comparing performance metrics across variations using A/B testing, a basic analytical technique.

Additionally, social media analytics monitor brand sentiment and engagement on various platforms, providing businesses with a real-time understanding of public opinion. Data on audience demographics, reach, and engagement frequently serve as the basis for influencer marketing tactics.

Transportation and Logistics: Route Optimization and Autonomous Vehicles

Data has always been important to logistics companies like UPS and FedEx, but recent advances in analytics have boosted their operations. For example, ride-sharing services like Uber and Lyft use real-time analytics for dynamic pricing, demand forecasting, and driver allocation; these platforms predict where demand will surge and deploy drivers accordingly, minimizing wait times and maximizing profits. In the world of autonomous vehicles, data science is essential: self-driving cars process massive amounts of sensor data in real time, from cameras and LiDAR to radar and GPS, to make safe and effective driving decisions.

Challenges and Ethical Considerations

Data science and analytics present significant ethical and technical issues despite their advantages. Concern over data privacy is growing, especially as more businesses gather and keep private data. These issues have been addressed by regulatory frameworks such as the CCPA and GDPR, but compliance and enforcement are still difficult.

Algorithm and data bias is another significant problem. In industries like hiring, lending, and law enforcement, in particular, predictive models have the potential to reinforce existing disparities if a dataset is biased or lacking.

Additionally, there is a bottleneck for companies looking to implement these technologies since the demand for qualified data scientists frequently outpaces the supply. For insights to be not only generated but also applied successfully, companies need to invest in data literacy for all employees.

The Future: Data Driven Everything

In the future, augmented intelligence—where people and robots collaborate to make smarter decisions—is probably going to be a reality as computing power keeps increasing and data becomes progressively more abundant. Data science will be pushed to its limits by technologies like edge computing, quantum computing, and real-time analytics.

In order to address some of the most important issues facing the globe today, from food security to climate change, emerging sectors like agritech, edtech, and climate tech are also starting to use data-driven approaches.