Forem: Giri Dharan

Jenkins Architecture and Working Pattern with an Example.

Giri Dharan — Fri, 13 Mar 2026 21:03:35 +0000

Jenkins is an open-source automation server that's a powerhouse for Continuous Integration (CI) and Continuous Delivery (CD). Think of it as the conductor of an orchestra, making sure all the different instruments (your development tasks) play together in harmony to deliver a symphony (your software)!

At its core, Jenkins uses a controller-agent architecture.

Here's how it generally works:

Jenkins Controller (formerly Master): This is the brain of the operation.
It manages everything: scheduling build jobs, assigning tasks to agents, and keeping track of plugins and settings.
It hosts the web interface where you configure jobs and monitor progress.
It stores job configurations and build history.
Jenkins Agent (formerly Slave): These are the workhorses.
Agents are separate machines (physical, virtual, or even containers) that execute the tasks the controller assigns to them.
They can run on different operating systems (Windows, Linux, macOS) to support cross-platform testing.
Using multiple agents allows Jenkins to distribute the workload and run many tasks in parallel, preventing the controller from getting overloaded.

How it works with a real-time example (using Jenkins Pipeline):

The modern way to use Jenkins is with Jenkins Pipelines, which define your entire delivery process as code in a Jenkinsfile. This file lives alongside your application's code in a source control repository (like Git), meaning your automation workflow is versioned and reviewable just like any other code.

Let's imagine a team developing a web application:

Code Commit & Trigger: A developer pushes new code changes to a Git repository.
Real-time example: git push origin feature/new-login-page
A webhook or polling mechanism in Jenkins detects this change.
Pipeline Execution: Jenkins reads the Jenkinsfile from the repository. This file defines a series of stages, like "Build," "Test," and "Deploy".
Real-time example Jenkinsfile snippet (Declarative Pipeline) :
null
Task Delegation: The Jenkins Controller schedules these stages and delegates the actual execution steps to available agents.
An agent checks out the code, runs the mvn clean install command to build the application, then mvn test for unit tests.
Reporting and Artifacts: The agent sends the results (success/failure, test reports) back to the controller.
If the build is successful, the agent might publish the build artifact (e.g., a .jar or Docker image) to an artifact registry.
Further Actions: The controller updates the build status in the Jenkins UI and can trigger downstream jobs or send notifications (e.g., to Slack or email).
If all tests pass, the "Deploy to Staging" stage runs, and then, if configured, the "Deploy to Production" stage for the main branch.

This entire process ensures that every code change is automatically built, tested, and potentially deployed, catching issues early and accelerating the software delivery process.

What is Subnet mask? And its Usage with examples in general & on cloud implications.

Giri Dharan — Wed, 25 Feb 2026 16:12:08 +0000

A subnet mask is a 32-bit number that divides an IPv4 address into network and host portions. Devices use it via bitwise AND operations to identify local traffic versus packets needing a router.

Definition

It consists of contiguous 1s (network bits) followed by 0s (host bits) in binary, often written in dotted decimal like 255.255.255.0 (/24 in CIDR notation). This enables subnetting to split large networks into smaller, efficient segments.

Common Examples

Class A: 255.0.0.0 (/8) – Supports ~16 million hosts per network.
Class B: 255.255.0.0 (/16) – Supports ~65,000 hosts.
Class C: 255.255.255.0 (/24) – Supports 254 hosts (limits broadcast traffic).

How It Works

For IP 192.168.1.10 with mask 255.255.255.0, the network ID is 192.168.1.0; hosts range from .1 to .254. Routers compare masks to forward traffic correctly, reducing congestion and enhancing security.

Notation	Dotted Decimal	Binary (key part)	Usable Hosts
/24	255.255.255.0	11111111.00000000	254
/25	255.255.255.128	11111111.10000000	126
/26	255.255.255.192	11111111.11000000	62

How to calculate subnet mask from CIDR notation

To convert CIDR notation (like /24) to a subnet mask, count the prefix number as leading 1s in a 32-bit binary string, fill the rest with 0s, then group into four 8-bit octets and convert to decimal.

Steps

Take the CIDR prefix (e.g., /24 means 24 bits).
Write 24 ones followed by 8 zeros: 11111111.11111111.11111111.00000000.
Convert each octet to decimal: 255.255.255.0.

Examples

CIDR	Binary (grouped by octet)	Subnet Mask
/16	11111111.11111111.00000000.00000000	255.255.0.0
/24	11111111.11111111.11111111.00000000	255.255.255.0
/27	11111111.11111111.11111111.11100000	255.255.255.224

For octet values, remember powers of 2: 128+64+32+16+8+4+2+1 (full octet=255); partial 1s yield 240 (/28), 248 (/29), 252 (/30), etc.

subnet mask configuration on cloud especially in aws

In AWS VPCs, the subnet mask is defined by the CIDR block's prefix length (e.g., /24 = 255.255.255.0), specifying the IP range available for instances in that subnet across an Availability Zone. It ensures non-overlapping addresses, reserves 5 IPs per subnet, and supports public/private isolation via route tables.

VPC Workflow

VPCs use primary CIDR (e.g., 10.0.0.0/16); subnets carve out portions like 10.0.1.0/24. Configure via console (VPC > Subnets > Create), CLI (aws ec2 create-subnet), or Terraform (cidr_block param).

Terraform Snippet

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}
resource "aws_subnet" "private" {
  vpc_id     = aws_vpc.main.id
  cidr_block = "10.0.2.0/24"  # 255.255.255.0
}

Ideal for your DevOps IaC setups with Terraform/K8s.

Type	Example CIDR	Mask	Routing Need
Public	10.0.1.0/24	255.255.255.0	Internet Gateway
Private	10.0.2.0/26	255.255.255.192	NAT Gateway

Linux Operating Systems and its Impact, role on IT.

Giri Dharan — Mon, 09 Feb 2026 04:55:43 +0000

Linux powers ~80-90% of cloud infrastructure, servers, and supercomputers, making it indispensable in the IT industry for its stability, security, and open-source nature. Its rise in desktops (4-30% among developers) and enterprise adoption in 2026 drives efficiency, privacy, and cost savings amid AI, cloud, and sustainability trends.

Key Industry Impacts

Linux dominates hyperscalers (AWS, Azure) and enables containerization (Docker, Kubernetes), reducing costs via no licensing fees and scalability. It supports AI/ML workloads, hybrid clouds, and edge computing, with kernel improvements boosting energy efficiency and privacy—key for ESG compliance and dev teams.

In Chennai's tech hubs, it's core to AWS EKS deployments and MLOps pipelines, aligning with IaC tools like Terraform.

Prominent Job Roles

Demand surges for Linux-proficient roles in DevOps/MLOps, with skills in automation, Kubernetes, and cloud yielding high employability.

Role	Core Linux Duties	Typical Tools/Tech	Industry Fit
DevOps Engineer	Automate CI/CD, manage servers, IaC	Jenkins, Terraform, Ansible, Kubernetes	Cloud ops, software delivery
Site Reliability Engineer (SRE)	Ensure uptime, optimize infra	Prometheus, logging on Linux clusters	Large-scale apps, reliability
Cloud DevOps Engineer	Provision VMs/containers	AWS EC2/EKS, OpenShift	Hybrid/multi-cloud
Kubernetes Admin	Orchestrate containers	Kubeconfig, RBAC on Linux nodes	Microservices, MLOps

MLOps Practices : Technologies, Tools, Principles on Applied Real Life Data Science Workflows.

Giri Dharan — Thu, 01 Jan 2026 14:25:56 +0000

MLOps practices turn one-off ML experiments into reliable, repeatable products by borrowing ideas from DevOps and adapting them to data, models, and continuous learning.

What is MLOps?

MLOps is the set of practices and tools used to manage the full ML lifecycle: data ingestion, training, deployment, monitoring, and retraining.
It aims to shorten time-to-production while increasing reliability, similar to how DevOps improved traditional software delivery.

Core MLOps Principles

Version everything: data, code, models, and configurations must be tracked to reproduce any model build.
Automate the lifecycle: CI/CD extends to ML with automated training, testing, and deployment pipelines.
Monitor in production: models are continuously watched for drift, performance degradation, and outages.

Practice 1: Reproducible ML Environments

Reproducibility starts with standard environments using containers (Docker) and IaC (Terraform/Kubernetes) so the same pipeline runs identically in dev, staging, and prod.
Tools like MLflow or similar trackers store parameters, code versions, and artifacts so a specific run can be rebuilt later.

Real-time example: EKS + Kubeflow

AWS provides a sample where Kubeflow pipelines run on Amazon EKS, with each pipeline step packaged as a Helm chart and executed as part of a single Helm release.
This design makes the ML pipeline reproducible and atomic: each step (data prep, training, evaluation) is declared as YAML, versioned in Git, and redeployable across environments.

Practice 2: CI/CD for Models

CI/CD for ML adds automated tests around data quality, training code, and model performance before deployment.
Pipelines typically trigger on Git commits or new data arrivals, run training, evaluate against baselines, and only promote if metrics improve.

Real-time example: SageMaker + Kubernetes

An AWS pattern defines a SageMaker training pipeline as JSON and wraps it in a Kubernetes custom resource (ACK for SageMaker), applied with kubectl from an EKS cluster.
DevOps engineers manage ML pipelines using the same GitOps/Kubernetes workflow they use for microservices, including kubectl apply, describe, and delete for pipeline runs.

Practice 3: Data & Model Versioning

Data versioning (e.g., with snapshotting or dedicated tools) ensures each model is tied to the exact dataset and feature definitions used during training.
Model registries store multiple versions, associated metadata, and stages (staging, production, archived) to control promotion and rollback.

Real-time example: Churn prediction for telco

In a typical telco churn project, teams maintain a dataset snapshot per training run and log model versions along with ROC-AUC and precision metrics.
When customer behavior shifts, they compare new models against older baselines using the same validation data slice, making it easy to justify upgrading to a new version.

Practice 4: Testing and Validation

MLOps extends testing from unit and integration tests to include data validation, training validation, and pre-deployment model checks.
Common tests include schema checks, null/imbalance detection, and performance guardrails that must be met before release.

Real-time example: Credit risk scoring

A financial services team enforces a rule that no new credit scoring model can be deployed if its default rate exceeds a defined threshold on a holdout dataset.
The CI pipeline fails the deployment job if fairness or performance metrics fail, forcing data scientists to adjust features or retrain before trying again.

Practice 5: Monitoring, Drift, and Feedback Loops

Production monitoring covers both system metrics (latency, errors) and ML-specific metrics (prediction distributions, data drift, concept drift).
Alerts notify teams when live data deviates from training data or when key KPIs like accuracy or revenue impact degrade.

Real-time example: Real-time recommendation system

Streaming recommenders (e.g., in media or e-commerce) track click-through rate and engagement per model version to catch degradation quickly.
When performance drops beyond a threshold, an automated retraining job runs on fresh interaction data, and a new candidate model is A/B tested against the current one.

Practice 6: Observability and Scaling on Kubernetes

For teams on Kubernetes, observability stacks (Prometheus, Grafana, OpenTelemetry) are integrated with inference services and pipelines.
Autoscaling based on CPU, GPU, or custom metrics keeps latency acceptable while controlling cost for training and inference workloads.

Real-time example: MLOps platform on Amazon EKS

An AWS reference architecture shows MLOps platforms running on EKS with custom metrics (queue depth, request rate) feeding Horizontal Pod Autoscalers.
This setup allows bursty training jobs and variable traffic inference endpoints to scale up and down without manual intervention.

Practice 7: Governance, Security, and Compliance

MLOps includes strict access control, audit logging, and approvals for datasets, experiments, and deployments, especially in regulated domains.
Policy-as-code ensures only compliant models and data sources can be used in production pipelines.

Real-time example: Healthcare diagnosis models

Healthcare ML workflows enforce PHI handling rules, encrypt data at rest and in transit, and maintain audit logs of each training run and model promotion.
Before deployment, a multi-step approval (data steward, ML lead, compliance officer) is required, codified directly into the release pipeline.

Practice 8: Start Small and Iterate

Successful teams adopt MLOps gradually, starting with one project and incrementally standardizing patterns, libraries, and platforms.
They invest early in training and shared tooling so data scientists and engineers collaborate on a common platform rather than building siloed, one-off pipelines.

Real-time example: First MLOps project

A typical first project is a simple binary classifier (e.g., churn or lead scoring) where teams pilot experiment tracking, CI/CD, and monitoring end to end.
Lessons from this project feed into an internal template or cookie-cutter repository that becomes the default for all future ML services.

How to Apply This as a DevOps/MLOps Engineer

Standardize infrastructure: Use Kubernetes, Terraform, and Helm for ML workloads to reuse your existing DevOps muscle.
Add ML-aware stages to pipelines: Extend current CI/CD (e.g., Jenkins/GitHub Actions) with data checks, training jobs, and automatic evaluation gates.
Build a minimal platform: Start with experiment tracking, a model registry, and basic monitoring, then layer advanced features like drift detection and A/B testing.

MLOps : Feature Engineering Technique and Process, Where most of the time have been spent for extracting solutions.

Giri Dharan — Wed, 17 Dec 2025 07:49:05 +0000

Feature engineering is the process of transforming raw data into meaningful input features that help a machine learning model learn better and make more accurate predictions. It’s often said that “garbage in, garbage out” — no matter how fancy the algorithm, a model can only perform as well as the quality of its features.

In this blog, we’ll explain what feature engineering is, why it matters, and walk through a real‑world example using the well‑known Titanic survival dataset .

What is a “feature”?

In machine learning, a feature is an individual measurable property or characteristic of the data that the model uses as input. For example:

In a house price prediction model, features might be: bedrooms, area_sqft, age_of_house, location.
In a customer churn model, features could be: monthly_spend, days_since_last_login, number_of_support_tickets.

The target (or label) is what we want to predict, like price or churned (yes/no).

What is feature engineering?

Feature engineering is the art and science of:

Selecting which raw variables to use as features.
Transforming them (scaling, encoding, binning, etc.).
Creating new features from existing ones (e.g., ratios, aggregations, time‑based features).

The goal is to make the patterns in the data more obvious to the model, so it can learn faster and generalize better to unseen data.

Why is feature engineering important?

Even with powerful algorithms like XGBoost or deep learning, good feature engineering often has a bigger impact on model performance than hyperparameter tuning. Here’s why:

Better patterns: Raw data (like dates, text, or categories) may not be in a form that models can easily understand; engineering converts them into numerical signals.
Handles noise and missing data: Techniques like imputation, outlier handling, and normalization make the data more robust .
Reduces dimensionality: Removing irrelevant features or combining correlated ones can speed up training and reduce overfitting.
Improves interpretability: Well‑engineered features (like age_group instead of raw age) are easier for humans to understand.

In real‑time ML systems (like fraud detection or recommendations), feature engineering also affects latency and scalability, since features must be computed quickly on streaming data.

Real‑world example: Titanic survival prediction

Let’s walk through feature engineering on the classic Titanic dataset, where the goal is to predict whether a passenger survived the disaster.

Raw dataset (first few rows)

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen	male	22	1	A/5 21171	7.25	NaN	S
2	1	1	Cumings, Mrs. John	female	38	1	PC 17599	71.28	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.925	NaN	S

Target: Survived (0 = No, 1 = Yes)

Features: everything else.

Step 1: Handle missing values

Real data is messy; many passengers have missing Age or Cabin values.

Age: Instead of dropping rows, impute missing ages with the median age of passengers in the same Pclass (ticket class) .

  df['Age'] = df.groupby('Pclass')['Age'].transform(lambda x: x.fillna(x.median()))

Cabin: Since many are missing, create a binary feature HasCabin (1 if cabin is known, 0 otherwise) .

  df['HasCabin'] = df['Cabin'].notna().astype(int)

Embarked: Only 2 missing; fill with the most frequent port (‘S’ for Southampton) .

Step 2: Encode categorical variables

Models like logistic regression or tree‑based models need numbers, not text .

Sex: Convert male/female to 0/1 .

  df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

Embarked: Use one‑hot encoding to create three binary columns: Embarked_S, Embarked_C, Embarked_Q .

  df = pd.get_dummies(df, columns=['Embarked'], prefix='Embarked')

Step 3: Create new features (feature construction)

This is where domain knowledge shines: we create features that capture meaningful patterns .

Family size: Combine SibSp (siblings/spouses) and Parch (parents/children) to get total family members on board .

  df['FamilySize'] = df['SibSp'] + df['Parch'] + 1  # +1 for the passenger

IsAlone: Flag passengers traveling alone (FamilySize = 1) .

  df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

Title from Name: Extract titles like “Mr.”, “Mrs.”, “Miss”, “Master” from the Name field, then group rare titles into a single category .

  df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
  df['Title'] = df['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
  df['Title'] = df['Title'].replace('Mlle', 'Miss')
  df['Title'] = df['Title'].replace('Ms', 'Miss')
  df['Title'] = df['Title'].replace('Mme', 'Mrs')
  df = pd.get_dummies(df, columns=['Title'], prefix='Title')

Age group: Bin Age into categories like child, young adult, adult, senior .

  df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 12, 18, 35, 60, 100], labels=[0,1,2,3,4])

Fare per person: Divide Fare by FamilySize to get fare per person, which may be more meaningful than total fare .

  df['FarePerPerson'] = df['Fare'] / df['FamilySize']

Step 4: Scale numerical features (optional)

Some algorithms (like SVM, logistic regression with regularization) perform better when features are on a similar scale .

Use standardization (mean=0, std=1) or min‑max scaling on continuous features like Age, Fare, FarePerPerson .

  from sklearn.preprocessing import StandardScaler
  scaler = StandardScaler()
  df[['Age', 'Fare', 'FarePerPerson']] = scaler.fit_transform(df[['Age', 'Fare', 'FarePerPerson']])

Tree‑based models (Random Forest, XGBoost) are less sensitive to scaling, so this step is optional depending on the model choice .

Step 5: Drop irrelevant columns

Remove columns that won’t be used as features:

PassengerId, Name, Ticket, Cabin (we already used HasCabin) .

Final feature set might look like:

Pclass, Sex, Age, SibSp, Parch, Fare, HasCabin,
FamilySize, IsAlone, AgeGroup, FarePerPerson,
Embarked_S, Embarked_C, Embarked_Q,
Title_Master, Title_Miss, Title_Mr, Title_Mrs, Title_Rare

How this helps the model

After feature engineering, the model sees:

More signal: FamilySize and IsAlone capture social context; Title captures social status, which historically influenced survival .
Cleaner input: Missing values are handled, categories are encoded, and numerical features are scaled .
Better generalization: The model can now learn that, for example, women, children, and higher‑class passengers had higher survival rates, using these engineered features .

In practice, a simple model (like Logistic Regression) trained on well‑engineered features often outperforms a complex model on raw data.

Feature engineering in real‑time ML

In production systems (e.g., real‑time fraud detection), feature engineering must be:

Fast: Features computed on the fly from streaming events (e.g., “number of transactions in last 5 minutes”) .
Consistent: The same transformations used in training must be applied at inference time.
Maintainable: Features are often stored in a feature store so they can be reused across models and pipelines .

For example, in a real‑time recommendation system, features like:

user_avg_session_duration_last_7d
item_popularity_score_last_hour
time_since_last_purchase

are computed continuously from event streams and served to the model with low latency.

Best practices

Start simple: Begin with basic transformations (handle missing values, encode categories) before adding complex features.
Use domain knowledge: Talk to business experts to understand what features might matter (e.g., “VIP customers” in churn prediction).
Iterate and validate: Use cross‑validation to measure if a new feature actually improves performance .
Avoid data leakage: Never use future information (e.g., “average future spend”) in features; only use data available at prediction time .
Document features: Maintain a feature catalog so everyone knows what each feature means and how it’s computed.

Summary

Feature engineering turns messy, raw data into clean, meaningful inputs that help ML models learn effectively . In the Titanic example, simple steps like:

Imputing missing Age,
Creating FamilySize and IsAlone,
Extracting Title from names,

can significantly boost model performance .

In real‑time ML, feature engineering becomes even more critical: it directly impacts prediction accuracy, latency, and system scalability . Investing time in thoughtful feature engineering is one of the most effective ways to build robust, high‑performing ML systems.

MLOps: Exploratory Data Analysis [EDA] Deriving Solutions with Statistics Leads to Fearure Engineering.

Giri Dharan — Sun, 14 Dec 2025 16:18:25 +0000

Exploratory Data Analysis (EDA) is the phase where a data scientist aggressively “interrogates” the dataset before trusting any model or dashboard. A good EDA feels like debugging reality: you move from raw, messy data to a clear mental model of how the system behaves in the real world.

What EDA Really Is

EDA is a set of practices for understanding structure, quality, and signal in data using summary statistics and visualization. It helps uncover patterns, anomalies, and relationships between variables, and validates whether the data can actually answer the business question.

From a data scientist’s point of view, EDA is where you translate stakeholder questions into hypotheses and test them quickly on the data. Decisions made here directly drive feature engineering, model choice, and even whether the problem is solvable as stated.

A Real-World Dataset Example

Consider an e‑commerce company that wants to reduce cart abandonment and improve revenue. The analytics team has a transactional dataset with columns like order_id, user_id, product_id, price, quantity, timestamp, device_type, traffic_source, and a label order_status (completed/cancelled/abandoned).

Alternatively, you can practice the same process on public datasets such as retail sales, wine quality, or Iris in Kaggle or learning portals, which provide realistic structure and common data issues. For instance, wine quality data has physicochemical features (alcohol, acidity, chlorides) and a quality score, making it ideal for exploring correlations, outliers, and feature importance.

Step 1: Clarify the Problem

Before touching code, a data scientist frames EDA around decisions, not just plots. For the e‑commerce case, stakeholders might ask: “Which traffic sources produce high-value customers?”, “What patterns precede abandonment?”, or “Which device types correlate with higher conversion?”.

Each question becomes a hypothesis, such as “Mobile users from paid social have lower average order value than desktop users from organic search”. EDA then becomes a structured attempt to confirm or reject such hypotheses using the dataset.

Step 2: Load Data and Sanity Check

Using Python, a typical workflow starts with Pandas, NumPy, and visualization libraries such as Matplotlib and Seaborn. The first inspection uses commands like shape, head, info, and describe to understand size, schema, data types, and basic distributions.

On real transactional data, this often reveals mixed types (numbers stored as strings), unexpected nulls in key fields, and skewed distributions (e.g., many small orders, few large ones). At this point, a data scientist often notes potential data quality issues to discuss with data engineering or the product team.

Step 3: Data Cleaning in Practice

Cleaning is not a separate pre‑processing step; it is tightly integrated into EDA loops. With the e‑commerce dataset, common actions include parsing timestamps to proper datetime, ensuring numeric types for price and quantity, standardizing categorical values, and removing or flagging clearly invalid rows (like negative quantities).

Missing values are handled based on business meaning: missing traffic_source might be grouped into “unknown”, while missing price or user_id may invalidate an order for downstream analysis and should be dropped or investigated. For continuous features such as alcohol in a wine dataset, data scientists may impute nulls using domain‑aware strategies like median or model‑based imputation, validating that this does not distort distributions.

Step 4: Univariate Exploration

Univariate EDA focuses on one variable at a time to understand its distribution and potential issues. For numeric features (e.g., order value, alcohol content, petal length), data scientists typically use histograms, KDE plots, and box plots to assess skewness, heavy tails, and outliers.

For categorical features such as device type, traffic source, or quality score, bar plots and frequency tables show dominant categories, rare levels, and potential encoding issues. These views drive early decisions: for example, highly imbalanced quality labels may suggest resampling strategies later in modeling.

Step 5: Bivariate and Multivariate Analysis

Bivariate analysis explores relationships between two variables, often through scatter plots, grouped boxplots, and grouped aggregations. In e‑commerce data, this might mean plotting average order value by device type or conversion rate by traffic source to detect actionable differences.

Multivariate analysis adds structure using correlation matrices, pair plots, and grouped aggregations over multiple dimensions. In wine quality or Iris datasets, a correlation heatmap can highlight which physicochemical properties or flower dimensions move together and which are independent, shaping feature selection and model complexity.

Step 6: Outliers, Anomalies, and Data Quality

Real‑time or real‑world datasets are rarely clean and often include anomalies such as duplicate orders, impossible timestamps, or extreme values from logging bugs. Data scientists detect these using visual methods (box plots, scatter plots), statistical rules (z‑score, IQR), and domain logic (e.g., orders over a certain amount must be manually verified).

The treatment of outliers is a business decision: for fraud analysis, outliers might be the most important records, while for average customer behavior, they might be capped or excluded to prevent skewed metrics. EDA leads to an explicit policy on whether to keep, transform, or remove such records before modeling.

Step 7: Feature Engineering Ideas from EDA

Effective EDA naturally suggests transformations and new features. In the e‑commerce example, a data scientist might derive features such as session length, number of items per order, time of day, days since last purchase, or rolling spend per user over the last 30 days.

For wine quality, EDA might indicate that a ratio (like sulphates to acidity) or binned alcohol levels capture more interpretable patterns than raw continuous values. These engineered features are grounded in observed relationships and domain intuition, improving both model performance and explainability.

Step 8: Communicating EDA Findings

EDA is only valuable if the insights reach stakeholders in a way that influences decisions. Data scientists often distill EDA into a short narrative: the business questions, key data issues, main patterns discovered, and recommendations for modeling or product changes.

This narrative is typically supported by a small set of high‑signal visualizations and summary tables, rather than every chart produced during exploration. Well‑documented EDA also becomes a reference for future team members, improving reproducibility and saving time when the dataset is reused.

Typical EDA Focus for Data Scientists

From a data scientist’s perspective, EDA priorities differ slightly from those of a pure analyst or data engineer. The focus is on:

Checking whether label and features are consistent with the modeling problem (e.g., no label leakage, enough positive cases).
Understanding variance and correlations to anticipate model bias, variance, and feature redundancy.
Identifying data shifts or seasonality that may require time‑aware validation and monitoring strategies.
Surfacing data quality risks early so that they can be mitigated via cleaning, robust metrics, or feature design.

If you share which dataset you want to target first (for example Kaggle sales data, Iris, wine quality, or a custom CSV), a tailored EDA notebook outline can be structured for you with concrete Pandas and Seaborn code blocks.

Data Engineering Processes: From Raw Data to Cleaned, Processed, Analytics-Ready Data.

Giri Dharan — Sun, 14 Dec 2025 15:04:47 +0000

A practical way to explain the data engineering process is to walk through a realistic dataset end to end. This blog-style write‑up treats the journey from raw data to analytics‑ready tables from a data engineer’s point of view.

Problem context

Imagine a product analytics team that wants to understand user behavior on an e‑commerce platform. The team tracks user sign‑ups, product views, cart additions, and purchases across web and mobile. As a data engineer, the goal is to design pipelines that reliably deliver clean, well‑modeled data to analysts and data scientists. The example dataset will be event data from application logs combined with reference data from operational databases.

Understanding sources and requirements

The first step is clarifying business questions and mapping them to data sources. Typical questions include: “What is the conversion rate from product view to purchase by channel?” or “Which campaigns bring the highest lifetime value customers?”. To answer these, the pipeline must bring together events from tracking logs, user profiles from a customer database, and product metadata from a catalog system.

From a data engineering perspective, this phase also includes non‑functional requirements. These cover data latency (near real‑time vs hourly), expected volume, quality SLAs, and regulatory constraints such as retention and PII handling. Clear requirements drive architectural decisions like batch vs streaming, storage layers, and orchestration tools.

Raw dataset example

Consider three core datasets for this project:

events_raw: clickstream‑style records with fields such as event_id, user_id, event_type (sign_up, product_view, add_to_cart, purchase), product_id, device_type, event_timestamp, and metadata (JSON).
users_dim_source: a daily snapshot from the user management system with user_id, signup_date, country, marketing_channel, and is_deleted flags.
products_dim_source: product catalog exports with product_id, category, brand, price, currency, and active flags.

These sources are messy in practice. Event data may arrive late or out of order, mobile apps may send malformed payloads, and operational teams might change schemas without notice. The data engineer’s job is to create a resilient ingestion layer that can tolerate these realities while preserving lineage and reproducibility.

Ingestion and landing

For ingestion, assume events are pushed into a message broker (like Kafka or Kinesis) and then written to cloud storage in partitioned files. A common pattern is to partition events_raw by event_date (derived from event_timestamp) and possibly by event_type. This layout improves downstream query performance and simplifies backfills.

Relational sources like users_dim_source and products_dim_source are usually pulled via scheduled jobs, using CDC (change data capture) or timestamp‑based incremental extracts. In a modern stack, these extracts land in a “raw” or “bronze” layer where data is stored with minimal transformation, preserving the source shape for audit and reprocessing.

Cleaning and standardization

Once data lands, the next step is basic hygiene. The pipeline enforces schema, handles corrupt records, and standardizes core fields like timestamps, IDs, and currencies. For the e‑commerce dataset, this might mean casting event_timestamp to a unified time zone, normalizing device_type values (web, ios, android), and validating that event_type belongs to a controlled list.

PII and compliance considerations also live here. Email addresses, phone numbers, and names may be tokenized or moved to restricted tables, while event payloads are checked to ensure no sensitive data slips into free‑form fields. From a data engineer’s view, baking compliance into the pipeline early avoids painful retrofits later.

Transformation and modeling

With clean data, the focus shifts to turning raw assets into analytics‑ready models. A common approach is to move through “staging” to “core” or “silver/gold” layers. In staging, events_raw is exploded and normalized: JSON metadata fields are parsed into explicit columns, and invalid combinations (such as purchase events without product_id) are flagged.

Core models then aggregate and join this staged data. For example, a user_events table might combine events with user attributes, while a product_performance_daily table summarizes metrics such as views, add_to_carts, and purchases per product per day. Slow changing dimensions can be implemented for users and products to capture history, so analysts see attributes as they were at the time of each event.

Example modeled tables

Two key tables that emerge from this pipeline are:

fact_user_session: each row represents a user session with fields like user_id, session_id, session_start, session_end, session_channel, session_device, total_events, and session_revenue. Sessions are derived by grouping events by user and breaking on inactivity thresholds.
fact_product_funnel_daily: aggregated by date, product_id, and channel, containing counts of users and events at each funnel stage (viewed, added_to_cart, purchased) plus conversion rates between stages.

These tables sit alongside users_dim and products_dim, which are cleaned and conformed dimension tables suitable for BI tools. Together, they form a simple star schema, making it easier for downstream users to create dashboards and ad‑hoc queries without deciphering raw event structures.

Orchestration and reliability

To keep these pipelines reliable, an orchestration layer coordinates the various steps: ingestion, staging, transformations, and quality checks. Dependencies are explicitly modeled so that failures in upstream jobs prevent incomplete data from flowing downstream. Data engineers also add monitoring on job runtimes, row counts, and key metrics like daily active users or total revenue to detect anomalies.

Data quality tests are embedded as first‑class citizens. Examples include checking uniqueness of event_id, ensuring non‑null user_id for logged‑in events, and validating that revenue numbers stay within expected ranges. When tests fail, the system alerts engineers or rolls back deployments, preserving trust in the platform.

Incremental processing and backfills

The pipeline should be incremental to scale with data volume. For events, this means processing only new partitions (for example, yesterday’s data) while keeping the ability to reprocess historical windows when bugs or logic changes occur. Dimension tables can use CDC or surrogate keys to gracefully handle late‑arriving updates.

Backfills are a fact of life in data engineering. A schema change, a tracking bug, or a new business rule can necessitate recomputing months of data. Good practice is to design transformations as idempotent and partition‑aware, so re‑running jobs for a given date range is straightforward and does not corrupt production tables.

Serving and collaboration

Finally, the engineered data is served to end consumers in the right formats and tools. BI analysts connect to curated schemas for dashboarding, data scientists access more granular tables for modeling, and downstream applications might read specific aggregates through APIs. As a data engineer, part of the job is to document datasets, publish examples, and collect feedback on what works and what is missing.

The data engineering process is not a one‑off project but an ongoing collaboration. As product features change and business questions evolve, new events are added, models are refactored, and pipelines are optimized. The dataset example shows how a data engineer thinks in terms of systems, contracts, and lifecycle, always working to make data more trustworthy and more useful for the rest of the organization.

MLOps: Data Science Lifecycle with DataSets examples, Workflows and Pipelines.

Giri Dharan — Sun, 14 Dec 2025 14:13:16 +0000

A data science lifecycle describes how raw data moves from business problem to deployed model, while workflows and pipelines define how the work is organized and automated end to end.The CRISP‑DM framework is a widely used way to structure this lifecycle, and real datasets like the Titanic survival data or vehicle price data illustrate each phase concretely.

Data science lifecycle

The CRISP‑DM lifecycle has six main phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment.These phases are iterative rather than strictly linear, so projects often loop back from modeling or evaluation to earlier steps as new insights appear.

Another common view is the OSEMN lifecycle: Obtain, Scrub, Explore, Model, and iNterpret.Both CRISP‑DM and OSEMN emphasize that most effort goes into understanding and preparing data rather than just training models.

Workflow vs pipeline

A workflow is the logical sequence of tasks a team follows (for example: define KPI → collect data → clean → train → review → deploy).A pipeline is a more automated, usually code‑driven realization of this workflow, chaining steps such as preprocessing, feature engineering, model training, and evaluation so they can run repeatedly and reliably.

In modern practice, workflows are designed with both business and technical constraints in mind, and then implemented as pipelines that manage dependencies and ensure data flows smoothly from one stage to the next.This separation allows experimentation at the workflow level while keeping execution consistent and reproducible at the pipeline level.

Example lifecycle: Titanic dataset

A classic real dataset for end‑to‑end projects is the Titanic passenger survival dataset hosted on Kaggle.It contains information such as passenger class, sex, age, and fare, with a label indicating whether each passenger survived, making it suitable for a supervised classification pipeline.

Using CRISP‑DM with this dataset:

Business understanding: Define the goal as predicting passenger survival given known attributes, analogous to predicting customer churn or loan default in real businesses.Success could be measured using metrics like accuracy or F1 score on unseen passengers.
Data understanding: Load the CSV files from Kaggle, inspect columns, visualize distributions (for instance age distribution by survival), and check missing values in features like age and cabin. This step reveals data quality issues and signals which engineered features might be helpful, such as family size or ticket groupings.

Data preparation: Handle missing ages (for example by imputing based on class and sex), encode categorical variables like sex and embarked port, and create features such as “family size” or “title” extracted from names.The prepared dataset is then split into training and validation subsets while keeping the target label (survived) separate.
Modeling: Train baseline models such as logistic regression, decision trees, and random forests using the engineered features. Hyperparameter tuning (for example grid search for tree depth or number of estimators) refines model performance.

Finally:

Evaluation: Compare models using cross‑validation and validation metrics, checking not only overall accuracy but also how well the model distinguishes survivors from non‑survivors.Feature importance analysis from tree‑based models highlights which attributes (for example sex, passenger class, and family size) drive predictions.
Deployment: In the Kaggle competition context, deployment means generating predictions for a held‑out test set and submitting a CSV for scoring on a public leaderboard.In a real product, the same pipeline structure could be wrapped behind an API so new passenger‑like records receive live predictions.

Example lifecycle: Vehicle price prediction

A more business‑oriented example is predicting used car prices, using the “vehicle dataset from Cardekho” and its associated car price prediction project.This dataset contains details such as car brand, model year, fuel type, mileage, and selling price, making it a typical regression problem for pricing or recommendation systems.

The lifecycle plays out as:

Business understanding: The objective is to estimate a fair selling price for a car, helping dealers or marketplaces optimize pricing and improve user trust. Success might be defined by low mean absolute error on historical sales and improved conversion rates when integrated into a platform.
Data understanding and preparation: Analysts explore the distribution of prices across brands and model years, detect outliers, and handle missing or inconsistent entries.Data preparation includes encoding fuel type and transmission, deriving car age from registration year, and normalizing numerical features such as mileage and engine size.

Then:

Modeling and evaluation: Several regression algorithms (for example linear regression, random forest, or gradient boosting) can be trained to predict price from features.Models are evaluated with regression metrics such as mean squared error and R² on validation sets to choose the best trade‑off between bias and variance.
Deployment and monitoring: A selected model can be deployed as a web service that powers a “suggested price” widget on a listing page. Ongoing monitoring checks whether prediction errors drift over time as market conditions or car portfolios change, prompting retraining when needed.

From workflow to production pipeline

To operationalize these projects, teams define pipelines that automate data ingestion, transformation, training, and deployment. For example, a Python pipeline using libraries such as scikit‑learn might encapsulate preprocessing steps (like imputation and encoding) and model training in a single object, ensuring any new data is processed identically to training data.

Beyond modeling code, a full data science pipeline integrates with storage and orchestration layers, sending ingested data through ETL or ELT processes into a data lake or warehouse before feeding models. Production pipelines typically include scheduled retraining jobs, automated evaluation against benchmarks, and deployment steps that update serving endpoints or batch scoring outputs with minimal manual intervention.

AttributeError: 'int' object has no attribute 'title' in python3

Giri Dharan — Wed, 03 Dec 2025 16:21:08 +0000

AttributeError: 'int' object has no attribute 'title' in python

This error means you’re calling the string method .title on a value that is actually an int, not a str. In other words, somewhere a variable you expect to be text holds a number instead.

Why it happens

Variable shadowing. A name you used for a string earlier was reassigned to an integer, so later obj.title tries to run on an int. For example: name = "alice"; name = 5; name.title() → error. This commonly happens when reusing names like file, time, or data for different types at different points of the code.
Wrong data shape. You indexed or pulled data from a dict/list/JSON where some items are ints and others are strings, then you applied .title blindly to all items.
Type assumptions. User input or parsed values were cast to int (or loaded as numbers) before string operations are applied.

How to fix quickly

Find the line with .title and print or log the type and value just before it:
- print(type(x), x) or use an assert: assert isinstance(x, str)
Convert to string if that’s acceptable for your logic:
- str(x).title()
Guard by type:
- x = x.title() if isinstance(x, str) else x
Clean data at the source:
- When mapping over a collection: [s.title() for s in items if isinstance(s, str)]
- When reading JSON: coerce only desired fields to str before using .title

Common patterns and remedies

Reused names:
- Problem: file = open(...); file = 123; file.title() → AttributeError
- Fix: Use distinct names (f for file handle, score for numbers).
Library/module shadowing:
- Problem: import time; time = 3; time.time() breaks because time is now an int.
- Fix: Avoid reassigning imported names; or import as alias (import time as time_mod).
Mixed-type collections:
- Problem: for v in values: v.title()
- Fix: for v in values: if isinstance(v, str): v = v.title() else: handle int case.

If you share a snippet

Paste the few lines around where .title is called (and any variable assignments to that name earlier). Indicate expected types. A minimal reproducible example makes it straightforward to point to the exact reassignment or coercion causing the int.

import os
terminal_width = get_terminal_size().columns

the_given_string = input("Enter your string: ")
print(the_given_string.center(terminal_width.title()))
print(the_given_string.ljust(terminal_width.title()))
print(the_given_string.rjust(terminal_width.title()))

The error happens because terminal_width is an int, but .title() is a string method, and center/ljust/rjust expect an int width, not a string. You should use the integer directly.

Here is the corrected code:

import os
from shutil import get_terminal_size

terminal_width = get_terminal_size().columns  # this is an int

the_given_string = input("Enter your string: ")
print(the_given_string.center(terminal_width))
print(the_given_string.ljust(terminal_width))
print(the_given_string.rjust(terminal_width))

NOTE::: NEED TO ADD .title() after the given string Variable.

Key points:

Remove .title() on terminal_width; just pass terminal_width.
If you really needed to convert something to title case, that should be used on a string, e.g. some_string.title(), not on an int.

in python3 instead of input function what can we use:

In Python 3, input() is the standard way to read from the keyboard; there is no direct replacement that you “should” use instead.

Input in Python 3

input() always returns a string, so you convert it when you need numbers:
- n = int(input("Enter an integer: "))
- x = float(input("Enter a float: "))
The old Python 2 functions:
- raw_input() → replaced by input() in Python 3.
- input() in Python 2 (which evaluated expressions) was removed for safety; there is no direct equivalent in Python 3, and evaluating user input is strongly discouraged.

Alternatives depending on context

For command‑line arguments: use sys.argv or argparse instead of input().
For reading from files or other streams: use file methods like readline() or iterate over the file object.

If you explain your exact use case (e.g., need numbers, expressions, command‑line options, etc.), more tailored alternatives can be suggested.

Start Python: To Advanced Topics from basic Strong Foundational concepts Roadmap.

Giri Dharan — Mon, 24 Nov 2025 14:50:52 +0000

To Build a solid Python career for any domain, follow a staged roadmap from foundational concepts to advanced topics, focusing on real-world applications and industry best practices. Here's a clear overview and the high-level advanced subjects you should master.

Python Roadmap for Any Domain Expertise

Basics: Syntax, variables, data types, operators, I/O, control flow (if/else, loops), functions, error handling, and basic OOP principles.
Core Python: Data structures (lists, dictionaries, sets), file handling (including CSV and JSON), modules, packages, and virtual environments.
Object-Oriented Programming: Classes, objects, inheritance, encapsulation, polymorphism, and special (dunder) methods.
Libraries and Frameworks: Core and popular modules (requests, NumPy, pandas, Matplotlib, Flask/Django for web development).
Working with APIs: Consuming REST APIs, serializing data, and practical HTTP requests.

Advanced Python Topics

Iterators, generators, and generator expressions
Decorators and closures
Context managers (with custom implementations)
Multithreading, multiprocessing, and asynchronous programming (async/await)
Memory management and garbage collection
Metaclasses and advanced OOP patterns
Design patterns in Python (Factory, Singleton, Observer, etc.)
Profiling and optimizing code
Global Interpreter Lock (GIL) understanding
Advanced data structures (linked lists, trees, graphs) and algorithms
Concurrency and distributed systems
Type hinting and static type checking (with mypy)
Building and distributing Python packages

Skills Expected at Senior Python Level

Cloud integration (e.g., AWS, Azure)
CI/CD pipeline design and automation
Building scalable backends with frameworks like Django or FastAPI
Database management (SQL and NoSQL)
Security best practices in backend systems
Version control workflows (Git)
Code review and mentorship responsibilities
Contribution to open source or creating your own packages

Learning Approach

Set small, achievable goals to stay motivated.
Tackle hands-on projects early—web apps, automation scripts, data analysis, machine learning prototypes.
Regularly review code, read advanced documentation, and solve coding challenges.

These guidelines provide a proven path for continuous advancement, whether you're aiming for backend engineering, automation, data science, or DevOps with Python.

YUM TO DNF: Amazon Linux_2023 Package Manager.

Giri Dharan — Sat, 22 Nov 2025 14:42:28 +0000

Amazon Linux changed its package manager from yum to DNF starting with Amazon Linux 2023 (AL2023). The main motivation for this change was to adopt the more modern, efficient, and secure package manager that DNF provides, which is now the standard across most Red Hat-based distributions.

Reasons for the Change

DNF (Dandified YUM) is the successor to yum and offers major improvements:
- Faster and more reliable dependency resolution, thanks to a new dependency solver and persistent metadata cache.
- Improved performance and lower system resource usage compared to yum.
- Enhanced support for parallel operations, extension/plugin development, and delta RPMs for better update efficiency.
- A stricter and more predictable API, facilitating the development of automation and third-party integrations.
- More robust security and better memory management.
- Aligning with industry standards, as DNF had already replaced yum as the default in Fedora (since version 22), CentOS (version 8+), Rocky Linux, and RHEL 8+.

When Was the Change Made?

The transition occurred with the release of Amazon Linux 2023 (AL2023). Earlier releases, like Amazon Linux 2 (AL2), used yum as the default package manager.
From AL2023 onward, all yum-like commands should be executed using dnf. The commands remain almost identical, ensuring backward compatibility for users transitioning from yum to dnf.

Summary Table: Amazon Linux Package Manager Evolution

Version	Package Manager	Reason for Switch	First Released
Amazon Linux 2	yum	Older, less efficient dependency handling	2017
Amazon Linux 2023	dnf	Modern, faster, secure, aligns with RHEL	2022

Every major Red Hat-based Linux distribution has shifted to DNF for improved reliability, performance, and future compatibility, making it the logical default for Amazon Linux going forward.

SIM CARD AND ITS WORKING PROCESS: Networks and its Components.

Giri Dharan — Fri, 21 Nov 2025 15:21:41 +0000

A SIM (Subscriber Identity Module) card is a small integrated circuit that enables mobile devices to connect securely to cellular networks. It acts as a digital identity card for the user, storing critical information that allows the device to authenticate with the network, access services, and maintain secure communication. The SIM card is essential for making calls, sending texts, and using mobile data.

Core Functions of a SIM Card

The primary function of a SIM card is to authenticate and identify the user on a mobile network. When a device is powered on, the SIM card communicates with the nearest cell tower to establish a secure connection. The SIM contains several key pieces of data:

International Mobile Subscriber Identity (IMSI): A unique number that identifies the subscriber to the network. This is used during authentication and roaming.
Integrated Circuit Card Identifier (ICCID): The SIM card's serial number, used to identify the physical card itself.
Authentication Key (Ki): A secret cryptographic key stored securely on the SIM. It is never transmitted over the air and is used to verify the user's identity during network authentication.
Location Area Identity (LAI): Tracks the current location of the device for network routing and roaming.
Service Provider Information: Stores details about the user's plan, available services, and network preferences.
Contacts and SMS: Older SIM cards could store contacts and text messages, but modern smartphones typically store this data on the device itself.

When a device attempts to connect to a network, the SIM card sends its IMSI and authentication credentials to the carrier's network. The network verifies these credentials using the Ki key and, if valid, grants access to services like calls, texts, and data.

Internal Components of a SIM Card

A SIM card is a smart card built around a silicon microcontroller. Its main internal components include:

Microcontroller (CPU): A small processor that runs the SIM's operating system and manages data storage and communication.
Memory (ROM, RAM, EEPROM): Stores the SIM's operating system, user data (like contacts and SMS), and critical network information. Modern SIMs typically have 32–256 KB of memory.
Operating System: Manages the SIM's functions, including authentication, data storage, and secure communication with the device.
Security Module: Handles cryptographic operations, such as generating authentication responses and protecting the Ki key.
Contact Pads: Physical gold-plated pads on the card's surface that connect to the device's SIM slot, enabling data transfer.

The SIM card communicates with the mobile device through a standardized protocol (ISO/IEC 7816), allowing it to exchange data securely and efficiently.

How a SIM Card Detects Network Signal Strength

The SIM card itself does not directly measure network signal strength. Instead, it relies on the mobile device's hardware (specifically the radio frequency (RF) module and antenna) to detect and report signal levels. Here's how the process works:

Device Hardware: The phone's RF module continuously scans for available cellular networks and measures the strength of the received signal (usually in dBm).
Network Registration: The SIM card provides the IMSI and authentication credentials to the network, allowing the device to register and maintain a connection.
Signal Reporting: The device's operating system (not the SIM) displays the signal strength (bars or dBm value) based on the RF module's measurements. The SIM card is involved in maintaining the connection but does not directly measure signal strength.

When the device moves between cell towers or experiences changes in signal quality, the RF module updates the signal strength, and the device's software reflects this change. The SIM card ensures that the device remains authenticated and connected to the network, but the actual detection of signal strength is handled by the phone's hardware.

Security and Authentication Process

The SIM card plays a crucial role in securing mobile communications. When a device connects to a network, the following authentication process occurs:

The device sends the IMSI to the network.
The network challenges the SIM card with a random number.
The SIM card uses the Ki key to generate a cryptographic response.
The network verifies the response. If it matches, the device is granted access.

This process ensures that only authorized users can access the network, protecting against unauthorized use and SIM cloning.

The SIM card is a vital component of mobile communication, acting as a secure digital identity that enables devices to connect to cellular networks, authenticate users, and maintain secure communication. Its internal components work together to store critical data and manage network interactions, while signal strength detection is handled by the device's hardware, not the SIM itself.