Forem: Neha Gupta

Why Blockchain was developed ?

Neha Gupta — Fri, 16 May 2025 16:43:23 +0000

Hey everyone 👋

Lately I have been diving into Web3. I started studying about Blockchain and that’s when I wondered why Blockchain was developed 😕and that’s when I read some articles and research paper and found out answer.

In this blog I will tell you the History of Blockchain. So let’s get started 😉

Story of Blockchain

Early Blockchain History (1982–2004)

Back in year 1982 cryptographer David Chaum first proposed a blockchain-like protocol in his dissertation “Computer Systems Established, Maintained, and Trusted by Mutually Suspicious Groups”. As the name suggests gave the idea of transparent and secure system.

In 1991 Stuart Haber and W. Scott Stornetta developed cryptographically secured chain of blocks. At that time they wanted to build a system in which document timestamps could not be tampered so that no literation can be performed and we can have secure and transparent document creation. In their design blocks store the timestamps of digital documents.

The Blockchain idea continued in 1992 and Haber, Stornetta, and Dave Bayer incorporated Merkle trees into the design, which improved its efficiency by allowing several document certificates to be collected into one block.
The idea continued thereafter and many scientist published their papers and proposed ideas.

In 2004 Cryptographic activist Hal Finney introduced a system for digital cash known as “Reusable Proof of Work”. This step was the game-changer in the history of Blockchain and Cryptography. This System helps others to solve the Double Spending Problem (it’s like you have 1 Coin and at the same time you gave it two of your friends so here you have used just 1 coin to pay both of your friends) by keeping the ownership of tokens registered on a trusted server.

So this was some remarkable work that is done so far but still Blockchain was not that popular as it is now. The actual breakthrough happened after year 2008.

2008 America’s Financial Crisis

Back in 2008 the USA faced major banking crisis which disturbed the financial health of the country.

In 2008 banks gave risky home loans (subprime mortgages) even to people who couldn’t afford it. These loans were bundled and sold as “safe investments”, but they were actually very risky. When many people couldn’t pay their loans, the housing market crashed. Big banks and financial institutions collapsed, causing a global financial crisis. Millions lost their jobs, homes, and savings.

This 2008 crisis showed how corrupt the department was and in response to this and to enable transparency and trust among people a group or person named Satoshi Nakamoto released Bitcoin White Paper in 2009. With the development of Bitcoin they directly challenged centralized system of banks ensuring trust and transparency. He modified the model of Merkle Tree and created a system that is more secure and contains the secure history of data exchange. His System follows a peer-to-peer network of time stamping. His system became so useful that Cryptography became the backbone of Blockchain.

The year 2014 is marked as the turning point for blockchain technology. Blockchain technology is separated from the currency and Blockchain 2.0 is born. Financial institutions and other industries started shifting their focus from digital currency to the development of blockchain technologies.

In 2015, Ethereum Frontier Network was launched, thus enabling developers to write smart contracts and dApps that could be deployed to a live network. In the same year, the Linux Foundation launched the Hyperledger project.

And from then on Blockchain technology became more popular and different digital currencies started to show up.

In summary the whole idea of Blockchain was proposed to enhance transparency and security and generate trust among people.

I hope you liked this blog. 💚

Convolutional Neural Network || Beginner’s Guide

Neha Gupta — Wed, 23 Oct 2024 07:37:01 +0000

Hey there 👋 Hope you are doing well 😃

In the journey of Deep Learning, we come across a variety of neural networks. One of the most basic and foundational types is the Artificial Neural Network (ANN). ANNs are great for solving simple problems, but when it comes to complex data like images, texts, and videos, ANNs might struggle to perform effectively. To handle such complex data, we’ve introduced more advanced architectures, one of which is the Convolutional Neural Network (CNN). 🎯

CNNs are designed specifically to work with complex, high-dimensional data, especially in the field of image processing. In this blog, we’ll explore the introduction to CNNs, their history, how they work, and their applications . 🌟

So, let’s dive right in! 🚀

What is Convolutional Neural Network?

Convolutional Neural Networks are special kind of neural networks that are used for processing data that has known grid-like topology such as time-series data, image data. These networks use convolutional layers to process and make predictions from data.

CNN basically consists of an input layer, convolutional layer, pooling layer, fully connected layer (ANN) and output layer.
Don’t worry if you don’t get these points right now. We will discuss them later 😃

A basic CNN takes an image as input applies convolution operation on it then forward the resultant to ANN and generate output.

Why we use CNN?

Note -: Image is collection of pixels

ANN works very well on 1D data such as loan prediction data, house price prediction data etc. But when it comes to 2D data such as image we need to flatten it first then feed it to ANN. Suppose our 2D data is of shape (256,256), on flattening it shape will be (65536,) the trainable parameters in the first layer will be 65536 * number of neurons in layer 1 + bias terms of each neuron. Training of such a large number of neurons is computationally expensive. Hence ANN will not work well with 2D data.

Another problem arises with ANN is loss of important features such as spatial arrangement of pixels. When we flatten 2D data the pixels that are arranged according to the location will get disoriented.
ANN also leads to overfitting.

Seeing the above reasons and to properly process image data CNN was introduced.

History of CNN

CNNs have evolved significantly over time, starting in the 1960s with Hubel & Wiesel's discovery of receptive fields, which laid the foundation for feature detection. In 1980, Kunihiko Fukushima introduced the Neocognitron, a neural network that could recognize patterns in images. In the 1990s, Yann LeCun's LeNet-5 model was a breakthrough in handwritten digit recognition, marking the early success of CNNs in image processing.

The deep learning revolution began in 2012 with AlexNet, developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, which significantly advanced image classification. By 2014, VGGNet and GoogLeNet further enhanced CNN architectures, improving efficiency and performance. In 2015, ResNet introduced deeper networks with skip connections, addressing the vanishing gradient problem and becoming a standard in computer vision.

Today, CNNs power various applications, from autonomous driving to medical imaging, with innovations like Capsule Networks, EfficientNet, and Transformers continuing to reshape deep learning.

Working of CNN

Intuition behind CNN

As we have already seen that CNN was initially used on handwritten digits data for recognizing different digits. Now you might be wondering that different persons have different styles of writing ✍digits then how a model can recognize a digit. And here’s why we need to understand the intuition of CNN first before proceeding to its working.
Suppose we test our model for a handwritten 9️⃣ , when we feed this data to our model Ⓜ, it will extract the basic features first then using these basic features it will extract more complex features. The features are extracted in order to find the pattern in data and according to this pattern it will recognize the corresponding digit.
It is like human brain . Suppose we see an animal 🐤 now based on the physical features of that animal we classify them accordingly.

Working of CNN

Now that we’ve understood the basic intuition behind CNNs, a question still arises: how does the model actually work? 🤔

Let’s break down the basic structure of a CNN. We start with the input layer, which takes in a 2D grid representing a single image. This image is then passed through several layers of the CNN.

First, we have the convolutional and pooling layers. The convolutional layer applies filters (or kernels) to the image to extract features. This is done through a mathematical operation called convolution, which, while similar to matrix multiplication, is specifically designed to detect patterns such as edges, textures, or shapes.

The pooling layer follows, typically used to downsample the result from the convolutional layer (often called the feature map). Pooling helps reduce the dimensionality of the data, retaining only the most important information. For instance, if we want to detect horizontal edges in an image, we can apply a filter designed to extract horizontal features effectively.

Finally, we have the fully connected (dense) layers, which act similarly to a traditional Artificial Neural Network (ANN). This part of the network takes the high-level features extracted by the convolutional and pooling layers and makes predictions based on those.

In simple terms, CNNs work by scanning an image with filters to detect essential features and patterns, then passing the information through dense layers to classify or make predictions.

This explanation provides a high-level overview, as this blog is intended to be an introductory guide.

Applications of CNN

Image Classification 🖼️ – CNNs excel at categorizing images, improving accuracy in tasks like object and handwritten digit recognition.
Object Detection 🔍 – Used in self-driving cars to detect pedestrians, vehicles, and more by identifying and localizing objects in images.
Facial Recognition 👤 – Powering facial recognition in smartphones and security systems by learning distinct facial features.
Medical Imaging 🏥 – Helping in disease diagnosis through X-rays, MRIs, and CT scans by accurately identifying anomalies.
Self-Driving Cars 🚗 – Performing real-time vision tasks like lane detection and obstacle recognition for safe navigation.
Image and Video Processing 🎥 – Enhancing images, segmenting, and tracking objects in real-time for video analysis and editing.
NLP 💬 – Applied in text classification tasks like sentiment analysis and spam detection using CNNs on word embeddings.
Art Generation 🎨 – Enabling neural style transfer to create artistic visuals by blending styles and patterns.
Robotics 🤖 – Assisting robots in recognizing objects and navigating environments using visual data.
Gaming and AR 🎮 – Improving gaming realism and blending virtual and real-world elements through real-time visual data processing.

Conclusion

Convolutional Neural Networks (CNNs) have transformed how we process and understand complex data, especially in the fields of computer vision and beyond. From identifying objects in images to powering facial recognition systems, CNNs have become essential in solving a wide range of real-world problems.

I hope you have found this blog interesting. Please leave some 💛 and don’t forget to follow me.

Thankyou 💛

729. My Calendar I|| Leetcode || Medium

Neha Gupta — Fri, 27 Sep 2024 13:59:51 +0000

Hey there 👋

Hope you are doing well 😃

In this blog we are going to see the complete intuition behind Leetcode problem 729. My Calendar I. We are to understand problem statement first then we will look at the solution of the problem followed by code.

Problem link -: https://leetcode.com/problems/my-calendar-i/description/

Let’s get started 🔥

Problem Statement

You are implementing a program to use as your calendar. We can add a new event if adding the event will not cause a double booking.

A double booking happens when two events have some non-empty intersection (i.e., some moment is common to both events.).

The event can be represented as a pair of integers start and end that represents a booking on the half-open interval [start, end), the range of real numbers x such that start <= x < end.

Implement the MyCalendar class:

MyCalendar() Initializes the calendar object.
boolean book(int start, int end) Returns true if the event can be added to the calendar successfully without causing a double booking. Otherwise, return false and do not add the event to the calendar.

Example 1

Problem Explanation

The problem is about implementing a calendar which holds event for a particular time interval.
The calendar is implemented such that a single interval can hold single booking. For example suppose a user has made booking for time interval [1,4] then this complete slot will be assigned to this user now an another user comes to get a slot of [2,3], but he won’t be able to get it because this interval comes in [1,4].

Note that the slot represents a booking on the half-open interval [start, end), the range of real numbers x such that start <= x < end.

In this problem we are given a list of time intervals and we have to check that whether a time slot can be assigned to an event or not.

Approach

In the given problem the events can be booked in any order. For example [[47,50],[33,41]].
So here simply adding the possible slots in a set or map and checking the next slot with previous stored slots won’t give us desired solution.

Now what can we do here?🤔

In this problem we have to find a way through which we can ensure that a single event is assigned to every slot. This we can do using prefix sum and ordered map. Huh?🙄

We will make an ordered map (as it keep entries in sorted order). Whenever a slot is to be booked we will assume that the slot is valid and will book an event between start and end . mp[start]++ will indicate the starting point of event and mp[end]-- will indicate that the event has been completed at time end-1 . In this way we are keeping track of event in a particular time interval.
Now we will see in map that every slot has single booking. We will do it by calculating cumulative sum. Whenever sum>1 it shows that a double booking has been done. We will remove that time slot from map.
Note that the slot for which double booking is encountered is our current slot and this proves our assumption to be wrong.

As you can see here how the approach is keeping track of slot and booking.

And this is why I told you that the problem can be solved using map and prefix sum.

Code

Here we have made an ordered map and in book() method we have assumed that the current slot can be booked. Then we have calculated cumulative sum, if sum>1 it indicates a double booking and hence our assumption is wrong we will remove this entry from map and will return False . If everything works well we will return True .

So this was complete solution to problem 729. My Calendar I. I hope you have understood it well.
If you like the blog please leave some ❤

Thankyou 😃

The Complete Guide to Becoming a Software Development Engineer (SDE)

Neha Gupta — Sat, 21 Sep 2024 08:45:17 +0000

Hey reader 👋

Hope you are doing well 😃

Becoming a Software Development Engineer (SDE) in a top multinational company, startup, or big tech firm is a dream for many freshers, students, and professionals. However, many aspiring engineers don’t fully understand what an SDE actually does or how to climb the path to this coveted position.

In this blog, I’ll walk you through the complete journey of becoming an SDE — from the skills you need to develop, to the interview preparation strategies, and finally landing your dream job.

So, let’s dive in and get started on this exciting journey! 🔥

What does an SDE do?🧑‍💻

Before we dive into how to become an SDE, let’s clarify what the role involves. As a Software Development Engineer, you’ll be responsible for designing, developing, testing, and maintaining software systems that solve real-world problems. You could be working on anything from building websites and apps to designing complex backend systems, depending on the company and the project.

SDEs are the masterminds behind the tech we use every day! They collaborate with cross-functional teams, solve challenging problems, and create solutions that can scale to millions of users. Sounds exciting 🤩, right?

Levels of SDE

SDE1 (Software Development Engineer 1)

Experience Level: Entry-level (0–2 years of experience)

Role and Responsibilities:

As an SDE1, you’re just starting out in your software engineering career. This role is focused on learning, growing, and gaining experience in the field.
You’ll be working on well-defined tasks under the supervision of senior engineers.
Your primary responsibility is writing clean, maintainable code and following the software development lifecycle (coding, testing, debugging).
You’ll also be involved in team discussions, learning about software design, and understanding how large systems work.
Expect to be given smaller, more manageable projects or parts of larger projects.

Skills Required:

Strong understanding of core programming concepts (OOP, DSA).
Ability to write functional and efficient code.
Problem-solving skills and familiarity with the tools and technologies used by the team.

Growth Focus:

You’ll be learning how to work in a team and gaining real-world experience with production codebases.

SDE2 (Software Development Engineer 2)

Experience Level: Mid-level (2–5 years of experience)

Role and Responsibilities:

At this stage, you’re expected to be more independent and capable of handling larger and more complex tasks.
SDE2 engineers are usually responsible for designing and implementing features end-to-end. You’ll work on more critical parts of the system, potentially leading small projects.
You are also expected to make key architectural decisions, review code, and mentor junior engineers (SDE1).
SDE2s have a solid understanding of system design and scalability. You may also collaborate with other teams and understand the broader impact of the code you write.

Skills Required:

Proficiency in programming and in-depth knowledge of the tools used by the team.
Experience in designing systems or components with scalability and performance in mind.
The ability to debug complex problems and suggest solutions.
Some knowledge of system design and architecture.

Growth Focus:

Focus on becoming a more independent problem-solver and team player. You’ll be deepening your system design knowledge and learning to manage more complex tasks.

SDE3 (Software Development Engineer 3)

Experience Level: Senior-level (5+ years of experience)

Role and Responsibilities:

SDE3s are senior engineers who take on the most complex projects and often lead teams of engineers.
You’ll be responsible for designing and building large-scale systems with high reliability, performance, and scalability in mind.
SDE3s are also responsible for making critical architectural decisions, driving technical direction, and influencing the overall strategy of the company’s engineering efforts.
You’re expected to mentor other engineers, drive best practices, and ensure that the team follows industry standards in terms of code quality and architecture.

Skills Required:

Expert-level knowledge of system design, architecture, and scaling solutions.
Ability to handle complex challenges and guide the engineering team through problem-solving.
Experience across multiple tech stacks and a deep understanding of how different components of large systems interact.
Strong leadership, mentorship, and communication skills.

Growth Focus:

SDE3s are leaders and problem solvers at the highest technical level. The next steps in your career could include moving into engineering management or becoming a principal/lead engineer.

But to get here, you need more than just coding skills — let’s explore the roadmap! 😊

SDE Roadmap 🚀

Now that You Know What an SDE Does and the Levels of the Role…

It’s time for you to understand the journey of becoming an SDE! The path you take will depend on the level of SDE you’re aiming for, but if you’re just starting out and looking to land an SDE1 position, then this pathway is definitely for you! 😀

The Pathway to SDE1: Your Step-by-Step Guide

Start with the Basics: Learn Programming

Every journey starts with the fundamentals, and for an SDE1 role, that means mastering a programming language. Whether you’re in college or learning on your own, choose a language that’s widely used in the industry. Common choices include:

Python: Great for beginners and widely used in backend development.
Java: Extremely popular for enterprise-level applications.
C++: Known for its speed and control over system resources, great for those who want to work on performance-intensive projects.

The key here is not to learn every language out there, but to become proficient in one. Once you’re comfortable coding, you can always pick up new languages along the way.

Learn time and space complexity calculation for a given code or the code you are writing. Understand best case, average case, and worst case complexity. This will be helpful when in later stages you try to optimize your code.

Dive Deep into Data Structures and Algorithms (DSA)

If you’ve been following the tech world, you’ve probably heard that Data Structures and Algorithms (DSA) are the bread and butter of SDE interviews. Most companies rely heavily on DSA to test your problem-solving skills and efficiency as a developer.

Here’s your DSA roadmap:

Start small: Begin with basic data structures like arrays, linked lists, stacks, and queues.
Learn algorithms: Work on sorting algorithms (merge sort, quick sort), searching algorithms (binary search), and more advanced topics like recursion and dynamic programming.
Practice: Platforms like LeetCode, HackerRank, and GeeksforGeeks are gold mines for practice. Start with easy problems, and once you get comfortable, move to more challenging ones.

Don’t rush — consistency is key! Aim for solving 1–2 problems a day, and you’ll soon see your skills improve.

You can read this blog to get a complete guide to DSA -: https://medium.com/@akshatsharma0610/a-data-structures-and-algorithms-guide-get-interview-ready-d2426c5e30c7

Master Core Computer Science Subjects

Knowledge of basic computer science subjects will help in interviews as well as professional work. They help in understanding the complete flow and working of application development. By mastering these topics you will be able to solve complex problems, be able to create robust software, indulge in technical discussions, etc. Thus, take some time and have a good understanding of these. Reference books, courses, college projects, etc. can prove to be helpful while learning these.

The list of subjects you should study are -:

DBMS
Computer Networks
OOPs
Operating System
Computer Architecture
System Design

Learn Object-Oriented Programming (OOP) Concepts

Object-Oriented Programming (OOP) is another core area that companies will test in interviews. It’s especially important if you’re applying for positions in companies that use languages like Java, C++, or Python.

OOP concepts help in designing reusable, scalable, and efficient systems. Focus on:

Classes and Objects: The building blocks of OOP. Understand how to design classes that represent real-world entities.
Inheritance: Allows you to reuse code by creating hierarchies and extending base classes.
Polymorphism: Understand both compile-time (method overloading) and run-time (method overriding) polymorphism.
Encapsulation: Hiding the internal state of objects and allowing access only through methods to protect data integrity.
Abstraction: Focus on defining an interface for interactions, while hiding the implementation details.

Practice by building small projects or applications that implement OOP principles, such as a banking system or a library management system.

Learn High-Level Design (HLD) and Low-Level Design (LLD) for Interview

While system design is more heavily tested in SDE2 and SDE3 interviews, it’s important to have a basic understanding for SDE1 as well. You might be asked to design small-scale systems like a URL shortener or a basic messaging system.

Here’s what you should focus on at this stage:

Scalability: Understand how to build systems that can handle an increasing number of users or requests.
Load Balancing: How to distribute network traffic across multiple servers to avoid overload.
Caching: Techniques to store frequently accessed data for faster retrieval.
Database Design: How to create normalized database schemas and when to use denormalization for performance optimization.

It’s a good idea to start reading books like Designing Data-Intensive Applications to get a feel for how large systems are built.

Build Real-World Projects

While DSA helps you crack interviews, real-world projects showcase your ability to apply knowledge. Building projects not only strengthens your coding skills but also makes your resume stand out.

Here are some beginner-friendly project ideas:

Personal Portfolio Website: Build a website showcasing your skills, projects, and achievements.
To-Do List App: Create a full-stack application using a backend framework like Node.js or Django.
Weather App: Use public APIs to create an app that displays real-time weather information.
Freelance Project: You can also take a freelance project from different platforms or from your family or friend.
You can also clone any existing website and can do useful changes in it. This is really going to help you a lot.

Tip: Share your projects on GitHub, contribute to open source, and even write blogs about your learning process. Employers love seeing developers who are passionate about coding and sharing knowledge.

Learn Version Control (Git)

In the real world, coding is rarely done in isolation. You’ll be collaborating with other developers, and that’s where Git and version control come into play. Here’s what to do:

Learn the basics of Git: cloning repositories, committing code, and pushing changes.
Get familiar with GitHub, which is used by companies for collaboration and reviewing code.

Once you’ve learned Git, make sure to use it in your projects. This shows that you know how to work in a team and manage your code professionally.

Prepare for Behavioral and HR Interviews

In addition to technical interviews, companies will often test your communication skills, teamwork, and cultural fit through behavioral interviews. Here’s what to focus on:

Prepare your stories: Think of examples from your past where you’ve demonstrated leadership, collaboration, problem-solving, or dealt with challenges.
STAR method: When answering, use the STAR method (Situation, Task, Action, Result) to structure your responses.

Show that you’re not only a good coder but also a team player with strong communication skills. This goes a long way in landing your dream job.

Get Certifications on Trending Technologies

Professional certifications help in personal growth and learning new concepts from industry experts. These provide the flexibility to learn from anywhere and even professionals can use certification courses to upskill and receive promotions in their current jobs. This investment of time and resources can provide specialization in one domain and enhance our skill set and understanding of that topic.

Keep Learning and Stay Motivated!

The tech world evolves rapidly, and being an SDE means you should always be curious and willing to learn. Whether it’s picking up a new programming language, learning cloud computing, or diving into machine learning, keep pushing yourself to learn and grow.

The journey to SDE1 is challenging but incredibly rewarding. Stay consistent, practice regularly, and never lose your love for coding!

Conclusion: Ready to Start Your SDE1 Journey?

If you follow this pathway, you’ll not only build the technical skills required to become an SDE1, but you’ll also develop the mindset and confidence to succeed. Remember, it’s not about perfection, it’s about progress. Take it one step at a time, and soon, you’ll be well on your way to landing that dream role as a Software Development Engineer.

Good luck, and keep coding! 🚀

Mistakes I made while studying Machine Learning

Neha Gupta — Tue, 20 Aug 2024 07:13:44 +0000

Hey there 👋 Hope you are doing well 😊
We all know that this is the decade of artificial intelligence, data science, machine learning and stuff. These skills are very important and when added in resume, they can make your resume stand out of crowd. But while learning these skills it is very important to follow the right path. Misleading paths can waste a lot of your time. In this post I'll tell what mistakes I have made while I was learning ML. This post is helpful for those who are just starting their journey in AI. This will save a lot of time of yours😌
So let's get started 🔥

Didn't do much research

When I was starting my ML journey I didn't dedicate my time in doing research and collecting resources. I just jumped into it and found myself puzzled. There were times when my basics were not clear and I was so much overwhelmed by the intermediate things. And often I found myself scratching my head over different concepts. After all this I have realized that it is very important to do research before starting anything.

Directly jumped into ML algorithms

Now you know that I have not done enough research and I was so excited to study about ML that I didn't bother about learning basics and I started learning ML algorithms from day1 and seriously it was such an awful mistake that I have made. I should have started from Python followed by Math's then EDA, Feature Engineering and then ML algorithms.

Following more than one playlist/course at a time

I was studying from YouTube and I started from a playlist. At starting everything was going pretty well but later I found myself distracted from my ongoing course and started learning from numerous courses out there. I have seen different videos for a single topic and this took a lot of time. Also I was so attracted by the content, it is like whenever I found a video from MIT or Stanford I start learning from them and I was beginner back then so watching them were like committing a sin. So it is very important to stay on a playlist that you are following or are going to follow in future.

Everybody has their own way of getting things done

So this is like one of the most important things that I have understood lately. When we learn anything we do it in our own way whether it is web development, data structures or anything. When I was learning ML I used to follow different people and their techniques. When someone used Label Encoder I started using that when someone used Ordinal Encoder I started using that and this made me feel like I was drowning in an ocean. But with time I have realized that people have their own way of doing things and I have to find my own way too.

Not practicing and revising concepts

Back then I was lazy enough to implement and revise the concepts that I have learn on a particular day. And when enough time was passed it felt like a heavy baggage. I started forgetting things and found myself stuck. so it is very important to regularly revise and practice concepts.

Sticking for hours

Whenever I didn't get any concept I kept on studying it for hours and even days and this took a lot of my time. Sticking to things and completing them is very important but sticking to it for a long time and still nothing is working out is a grave mistake. If you don't get anything give it sometime or seek help from somebody else this will save a lot of your time.

Leave your damn ego

So yes I was someone who liked to get things done on my own and this is why I don't like seeking help from others. Whenever I stuck somewhere I never asked someone to help me out and this took a lot of my time. So don't repeat this mistake and ask for help whenever needed.

So these was my mistakes that I have made during my journey of ML. I mentioned them in post because I don't want someone to repeat these mistakes at the cost of their time.

I hope you liked my post. For more follow me.
Thankyou 💚

How to get started to Machine Learning?

Neha Gupta — Thu, 15 Aug 2024 10:20:51 +0000

Hey there 👋 Hope you are doing well 😊
As you know AI wave is all over everyone is trying different AI based services and getting amazing results. Every platform out there is embedding AI to make their platform more smart and useful. AI is one of the most important skills of this decade. But getting started to it is really difficult and can be misleading sometimes. So it is very important to follow right path to understand AI better. Getting into AI means getting started with Machine Learning. In this article I am going to tell you about how you can start your journey to Machine Learning and become master in it.

Get Started with Python

To go into Machine Learning you should know Python first. Learn basics first like how variables are created and manipulated, how loops and conditions work, how functions are formed and used, how lists, arrays, maps etc. created and work. Then you should learn about OOPs in Python. And finally you should know all about Pandas and Numpy libraries as these are going to be very useful in the journey of Data Science.

Learn Maths

While learning Python, you should also devote your time in studying Maths as this subject is like a backbone of Machine Learning, Deep Learning, NLP etc. You should be familiar with Statistics, Linear Algebra, Probability theory, Hypothesis Testing, Calculus and Optimization. If you know these topics very well (clarity in basics) then you will find ML algorithms very easy. Also don't forget to implement Maths functions using Python.

Study EDA and Feature Engineering

Now it is time to play with data😈. EDA stands for Exploratory Data Analysis this gives important insights from your dataset. It is very important for knowing relationship between different features in your dataset. Feature Engineering involves manipulating features in your dataset in order to make your data more resourceful. This complete process involves knowing about data and handling it efficiently. The libraries you should know are Seaborn, Matplotlib, Missingno, PyOd.

Machine Algorithms

This is the part which you have been waiting for. Start studying about Machine learning algorithms now. Study all supervised and unsupervised learning algorithms. Get the maths and geometrical intuition implement them from scratch then understand the pre defined libraries. You should know about sklearn and its sub libraries here.

Improvise Model

Now you know about Machine Learning Algorithms, now it is time to know about the techniques used to improvise them. Learn Cross-validation techniques, HyperParameter Tuning techniques and Ensembeling methods. Get into details and practice them on datasets.

Auto ML

As you have came a long way now it is time to automate your tasks. Study Auto ML and practice get to know about Pipelines, Optuna etc.

Know about Version Controls

Learn to use Git so that you can automate the process of deployment and management.

And the learning path goes on....
If you really want to master it then practice is very important. You can practice ML on different platforms.

Important Platforms

Kaggle -: This is one of the most famous platforms for datascience. They hosts competitions. They have variety of datasets and tutorials and large community.
Link -: https://www.kaggle.com/
Jupytor -: This is an IDE for creating Data Science projects. You can use Google Notebooks too.
Link -: https://jupyter.org/
Github -: This is one of most famous platforms. It has got enormous datasets from where you can practice and make contributions on different projects.
Link -: https://github.com/

Important Resources

Maths -: Statquest statistics playlist on YouTube
link -: https://youtu.be/qBigTkBLU6g?si=tn5f8dBQo_-xDfqr
Machine Learning -: CampusX playlist (Hindi), Coursera Andrew Ng ML specialization course.
CampusX
Link -: https://youtu.be/ZftI2fEz0Fw?si=VrfnThcAc8z0OQ2N
Coursera
Link -: https://www.coursera.org/

So this was the complete roadmap for Machine Learning.
Thankyou 💚

Feature Transformation in Machine Learning || Feature Engineering

Neha Gupta — Tue, 06 Aug 2024 11:24:41 +0000

Hey reader 👋 Hope you are doing well 😊

As you know, to get accurate predictions, our model should be trained well. For better training, our data should be processed properly. To gain valuable insights from data, we perform Exploratory Data Analysis (EDA). Using EDA, we engage in Feature Engineering to transform our data as required.

In the process of Feature Engineering, we handle categorical data, missing values, outliers, feature selection, etc. Transforming numerical values is one of the critical tasks in Feature Engineering. This transformation allows us to convert all data into the same unit, making our data more efficient for model training.

In this blog, we will discuss different types of transformations and their importance. So let's get started 🔥

Feature Transformation

Feature Transformation refers to the process of converting data from one form to another. For example, transforming categorical data into numerical data, scaling numerical data, and converting data so that it follows the desired statistics of an algorithm (e.g., linear regression works well when the data is normally distributed).

The different types of Feature Transformation are -:

Function Transformers
Power Transformers
Feature Scaling
Encoding Categorical Data
Missing Value Imputation
Outlier Detection

Why is Feature Transformation Required?

Imagine trying to solve a jigsaw puzzle with pieces that don’t quite fit together. In the same way, raw, unprocessed data might not fit the requirements of your machine-learning algorithms. Feature transformation is the process of reshaping those pieces, making them compatible and coherent, and ultimately, revealing the full picture.

Machine learning algorithms often work better with features transformed to have similar scales or distributions. Feature transformation can lead to better model performance by improving the model’s ability to learn from the data.

Feature transformation can reveal hidden patterns or relationships in the data that might not be apparent in the original feature space. By creating new features or modifying existing ones, you can expose valuable information that your model can use to make more accurate predictions.

In some cases, feature transformation can help reduce the dimensionality of the data. This not only simplifies the modeling process but also helps prevent issues like the curse of dimensionality, which can lead to overfitting.

A brief about different Feature Transformation techniques

Function Transformers -: Function transformers are the type of feature transformation technique that uses a particular function to transform the data to the normal distribution.
Power Transformers -: Power Transformation techniques are the type of feature transformation technique where the power is applied to the data observations for transforming the data. Techniques like Box-Cox or Yeo-Johnson transformations are used to make data more normally distributed, which can be beneficial for certain algorithms.
Feature Scaling -: Feature Scaling is a feature engineering technique that is used to transform the complete data in single scale. It either scales up the data or scales down as per requirement.
Encoding Categorical Data -: All the machine learning algorithms are suitable for numerical data, so it is very important to convert categorical data into numerical.
Missing Value Imputation -: Sometimes our dataset may contain missing values which can affect our model significantly. so missing values should be handled properly.
Outlier Detection -: Outliers are datapoints that exhibit completely different behavior than rest other points in dataset, these can hinder model performance. So these should be handled properly.

So this is it for this blog in the next blog we will see how Feature Scaling is performed. Till then stay connected and don't forget to follow me.
Thankyou 💜

Handling Missing Values || Feature Engineering || Machine Learning (Part2)

Neha Gupta — Fri, 02 Aug 2024 08:50:09 +0000

Hey reader👋Hope you are doing well😊
We know that to improve performance machine learning model feature engineering is crucial step. One of most important tasks in feature engineering is handling outliers. In this blog we are going to do a detailed discussion on handling missing values. So let's get started 🔥.

Complete Case Analysis

Complete Case Analysis (CCA) is also "listwise deletion". This method is used to handle missing data. In this technique all the rows that contain one or more missing values are excluded from dataset.

So here in final dataset only those rows are included that contain complete data.

Key Assumptions for CCA

Data should be completely missing at random.
Suppose you have a dataset that contains 1000 rows and 5 columns now you have 50 such rows that have missing values. Now these 50 rows are random rows. And you can remove these rows.
If you remove the data at random, the distribution of the data will remain unchanged.
When the proportion of missing data is small.
When simplicity and ease of implementation are prioritized.

Key Points of Complete Case Analysis

Simplicity: CCA is straightforward to implement and understand.
Bias: If the missing data are not missing completely at random (MCAR), CCA can introduce bias into the analysis.
Efficiency: Excluding data with missing values reduces the sample size, which can lead to a loss of statistical power.
Application: Commonly used in regression analysis, where only cases with complete data for all predictors are included.

Use CCA when data in a particular column missing is less than equal to 5% . You can remove complete column if missing data in that column is greater than or equal to 95%.

Implementation

import pandas as pd

//Load your dataset
//Replace 'your_dataset.csv' with the actual file path
df = pd.read_csv('your_dataset.csv')

//Display the original data
print("Original Data:")
print(df.head())

//Filter out rows with any missing data (Complete Case Analysis)
df_complete_case = df.dropna()

//Display data after applying CCA
print("\nData after Complete Case Analysis:")
print(df_complete_case)

//Check the number of rows before and after CCA
print("\nNumber of rows before CCA:", len(df))
print("Number of rows after CCA:", len(df_complete_case))

Note that CCA can create bias in data as on removal rows there are chances of losing important information.
Check the following notebook for implementation of Handling Missing Values -:
https://www.kaggle.com/code/nehagupta09/beginner-s-guide-to-handle-missing-values

I hope you have understood that how missing values are handled in our dataset. In the next blog we are going to read take our discussion on feature engineering further. Till then stay connected and don't forget to follow me.

Thankyou 💙

Handling Missing Values || Feature Engineering || Machine Learning (Part1)

Neha Gupta — Sat, 20 Jul 2024 07:56:22 +0000

What are Missing Values?

Missing values are data points that are absent for a specific variable in a dataset. They can be represented in various ways, such as blank cells, null values, or special symbols like “NA” or “unknown.”

These missing data points pose a significant challenge in data analysis and can lead to inaccurate or biased results.

There are many reasons for a dataset to contain missing values-:

Due to technical issues.
If the data comes from a survey then many people can leave blank response which can lead to missing values in the data.
Data processing issues, privacy concerns etc.

Types of Missing Values

Missing Completely at Random (MCAR)
MCAR is a specific type of missing data in which the probability of a data point being missing is entirely random and independent of any other variable in the dataset. In simpler terms, whether a value is missing or not has nothing to do with the values of other variables or the characteristics of the data point itself.
Missing at Random (MAR)
MAR is a type of missing data where the probability of a data point missing depends on the values of other variables in the dataset, but not on the missing variable itself. For example, if someone lost a schedule, then it may be replaced by a schedule taking at random from the set of filled schedules.
Missing not at random (MNAR)
MNAR is the most challenging type of missing data to deal with. It occurs when the probability of a data point being missing is related to the missing value itself. This means that the reason for the missing data is informative and directly associated with the variable that is missing. For example, when smoking status is not recorded in patients admitted as an emergency, who are also more likely to have worse outcomes from surgery.

How missing values impact our dataset?

It can reduce the size of the sample or dataset.
Lack of information. If the dataset has large amount of missing values then there are high chances of lacking useful information.
If the missing data is not handled properly, it can bias (model not properly training on dataset) the results of your analysis.
Some statistical techniques require complete data for all variables, making them inapplicable when missing values are present.

Identify missing values

There are different methods in Python's pandas library to identify missing values.

.isnull() -: Identifies missing values in a Series or DataFrame.
.notnull() -: Check for missing values in a pandas Series or DataFrame. It returns a boolean Series or DataFrame, where True indicates non-missing values and False indicates missing values.
.isna() -: Similar to notnull() but returns True for missing values and False for non-missing values.

Treating Missing Values

There are various techniques used to treat missing values in a dataset.

1. Remove all the missing data
If the dataset doesn't contain significant amount of missing data then it is worthful to remove all the missing data. The method used in Python is-:
dropna() -: Drops rows or columns containing missing values based on custom criteria.

2. Imputation
Imputation means replacing a missing value with another value based on reasonable estimate. This have chances to give high bias.
Some common Imputation methods are -:

Mean Imputation -: Replace missing values with the mean of the relevant variable. The strategy can highly be affected by outliers. Implementation -: Method 1-: df[column_name].fillna(df[column_name].mean())

Method 2 -:
Using SimpleImputer()-:
It is defined in sklearn library. It replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value.

Here we have imported numpy and SimpleImputor and then created an instance of SimpleImputer named as imp_mean which replaces missing value (np.nan) by mean (strategy="mean"). Then we have fitted the data to imputer and transformed it.
We can use different strategies to impute missing values here.

Median Imputation -: Replace missing values with the median of the relevant variable.
Implementation -:
df[column_name].fillna(df[column_name].median())
We can also use SimpleImputer all we need to do is to give strategy="median".
Mode Imputation -: Replace missing values with the mode of the relevant variable.
Implementation -:
df[column_name].fillna(df[column_name].mode())
We can also use SimpleImputer all we need to do is to give strategy="most_frequent".
This strategy can be challenging in case of multimodal data (having more than one mode).

3. Forward and Backward Fill
Replace missing values with the previous or next non-missing value in the same variable.
These fill methods are particularly useful when there is a logical sequence or order in the data, and missing values can be reasonably assumed to follow a pattern. The method parameter in fillna() allows to specify the filling strategy, and here, it’s set to ‘ffill’ for forward fill and ‘bfill’ for backward fill.

Forward Fill
It replaces missing values with the last observed non-missing value in the column.
Implementation-:
forward_fill=df[column_name].fillna(method='ffill')
The result is stored in the variable forward_fill.

Backward Fill
It replaces missing values with the next observed non-missing value in the column.
Implementation-:
backward_fill=df[column_name].fillna(method='bfill')
The result is stored in the variable backward_fill.

There are two more techniques which we will see in the next blog.
I hope you have understood that how missing values are handled in our dataset. In the next blog we are going to read take our discussion further. Till then stay connected and don't forget to follow me.
Thankyou 💙

Handling Outliers|| Feature Engineering || Machine Learning

Neha Gupta — Wed, 17 Jul 2024 05:56:08 +0000

What are Outliers?

Outliers are extreme values that differ from most other data points in a dataset. They can have big impact on statistical analysis and skew the result of any hypothesis test.

To understand it better let's consider an example-:
Dataset A = [1,2,3,4,5,6]
Mean => 3.75
Now let's some more datapoints in the dataset.
A = [1,2,3,4,5,6,100,101]
Mean => 27.75
So here we can see that the mean is very much high just by adding two points and these two points are very different from rest of the other points in dataset, these points are definitely outliers.
The outliers can negatively affect our data and modeling so it is very important to properly handle them.

How Outliers are introduced in Data?

Outliers in a dataset can be introduced through various mechanisms, both intentional and unintentional. Here are some common ways outliers can be introduced:

Human Error: Manual data entry mistakes, such as typing errors, can lead to outliers. For example, entering an extra zero or a decimal point in the wrong place.
Instrument Error: Faulty measurement instruments or sensors can produce erroneous values that stand out as outliers.
Rare Events: Some outliers occur naturally due to rare events or extreme conditions. For example, an unusually high sales figure during a holiday season.
Merging Datasets: Combining datasets with different scales or units without proper alignment or adjustment can introduce outliers.
Intentional Manipulation: In some cases, outliers might be introduced intentionally, such as in fraudulent financial reporting or tampering with experimental data.

Types of Outliers

Based on their characteristics, outliers or anomalies can be divided into three categories -:

1. Global Outliers
Any observations or data points are considered as global outliers if they deviate significantly from the rest of the observations or data points in a dataset. For example, if you are collecting observations of temperatures in a city, then a value of 100 degrees would be considered an outlier, as it is an extreme as well as impossible temperature value for a city.

2. Contextual Outliers
Any data points or observations are considered as contextual outliers if their value significantly deviates from the rest of the data points in a particular context. It means that the same values may not be considered an outlier in a different context. For example, if you have observations of temperatures in a city, then a value of 40 degrees would be considered an outlier in winter, but the same value might be part of the normal observations in summer.

3. Collective Outliers
Any group of observations or data points within a data set is considered collective outliers if these observations as a collection deviate significantly from the entire data set. It means that these values, individually without collection with other data points, are not considered as either contextual or global outliers.

Identifying Outliers

There are four ways of identifying outliers -:

1. Percentile Method
The percentile method identifies outliers in a dataset by comparing each observation to the rest of the data using percentiles. In this method, We first define the upper and lower bounds of a dataset using the desired percentiles.
For example, we may use the 5th and 95th percentile for a dataset's lower and upper bounds, respectively. Any observations or data points that reside beyond and outside of these bounds can be considered outliers.
This method is simple and useful for identifying outliers in symmetrical and normal distributions.

2. Inter Quartile Range (IQR) Method
This method is similar to Percentile method, a slight difference is here we define an Inter Quartile Range for detecting outliers.
Q1 = 25th percentile
Q3 = 75th percentile
IQR = Q3-Q1
Upper bound = Q3+1.5*(IQR)
Lower bound = Q1-1.5*(IQR)
We check every datapoint ,if the point is in range [Lower bound ,Upper bound] then it is a valid point otherwise it is an outlier.

We are considering 25th and 75th percentile here because we are assuming that our data is normally distributed and most of our data resides in this range.

3. Using Visualization
In python we can use box plot or whisker plot to detect outliers in a dataset.

The box plot just gives the visualization of IQR method.

4. Using Z score method

For a given value, the respective z-score represents its distance in terms of the standard deviation. For example, a z-score of 2 represents that the data point is 2 standard deviations away from the mean. To detect the outliers using the z-score, we can define the lower and upper bounds of the dataset. The upper bound is defined as z = 3, and the lower bound is defined as z = -3. This means any value more than 3 standard deviations away from the mean will be considered an outlier.

Python Implementation for detecting outliers

Handling Outliers

Depending on the dataset there are various ways to handle outliers-:

Removing Outliers
If the outliers are because of manual error it is better to remove them entirely from dataset. If dataset contains large number of outliers then removing them may result in loss of data.
Transforming Outliers
The impact of outliers can be reduced or eliminated by transforming the feature. For example, a log transformation of a feature can reduce the skewness in the data, reducing the impact of outliers.
(We will read about transformations in upcoming blogs)
Impute Outliers
In this outliers are considered as missing values and we can replace them with mean, median, mode, nearest neighbor etc.
Use robust statistical methods
Some of the statistical methods are less sensitive to outliers and can provide more reliable results when outliers are present in the data. For example, we can use median and IQR for the statistical analysis as they are not affected by the outlier’s presence. This way we can minimize the impact of outliers in statistical analysis.

Python Implementation of Handling Outliers

I hope you have understood that how outliers are handled in our dataset. In the next blog we are going to read about how to handle missing values. Till then stay connected and don't forget to follow me.
Thankyou 💙

Feature Engineering in ML

Neha Gupta — Tue, 16 Jul 2024 09:27:47 +0000

Hey reader👋
We know that we train machine learning model on a dataset and generate prediction on any unseen data based on training. The data which we are using here must be structured and well defined so that our algorithm can work efficiently. To make our data more meaningful and useful for our algorithm we perform Feature Engineering on our dataset. Feature Engineering is one of the most important steps in Machine Learning.
In this blog we are going to know about Feature Engineering and its importance. So let's get started🔥

Feature Engineering

Feature Engineering is the process of using domain knowledge to extract features from raw data. These features can be used to improve the performance of Machine Learning algorithm.

So here you can see that we are working on a dataset, as a very first step we are processing data then we are extracting the important features using feature engineering then we are scaling the features i.e. transforming features in same unit. Once feature engineering is performed on dataset we are applying algorithm and then evaluating metrics. For better performance of model we are again performing feature engineering on the dataset till we get a good model.

Why Feature Engineering?

Improves Model Performance: Well-crafted features can significantly enhance the predictive power of our models. The better the features, the more likely the model will capture the underlying patterns in the data.
Reduces Complexity: By creating meaningful features, we can simplify the model's task, which often leads to better performance and reduced computational cost.
Enhances Interpretability: Good features can make our model more interpretable, allowing us to understand and explain how the model makes its predictions.

Key Techniques in Feature Engineering

The key techniques of Feature Engineering are -:

Feature Transformation -: We can transform features so that our model can perform effectively on it and give better results. This generally involves -:

Missing Value Imputation -: Techniques include imputation (filling missing values with mean, median, or mode), or using algorithms that can handle missing data directly.
Handling Categorical Data -: Converting categorical variables into numerical ones using methods like one-hot encoding or label encoding.
Outlier Detection -: Identifying and removing outliers can help in creating robust models.
Feature Scaling -: Scaling features to a standard range or distribution can improve model performance, especially for distance-based algorithms.

Feature Construction -: Sometimes to make our data more meaningful we add some extra information in our data based on existing information. This process is called Feature Construction. This can be done in following ways -:

Polynomial Features: Creating interaction terms or polynomial terms of existing features to capture non-linear relationships.
Domain-Specific Features: Using domain knowledge to create features that capture essential characteristics of the data. For example, in a financial dataset, creating features like debt-to-income ratio or credit utilization.
Datetime Features: Extracting information such as day, month, year, or even whether a date falls on a weekend or holiday can provide valuable insights.

Feature Selection -: Feature Selection is the process of selecting a subset of relevant features from the dataset to be used in a machine learning model. The different techniques we use for feature selection are -:

Filter Method: Based on the statistical measure of the relationship between the feature and the target variable. Features with a high correlation are selected.
Wrapper Method: Based on the evaluation of the feature subset using a specific machine learning algorithm. The feature subset that results in the best performance is selected.
Embedded Method: Based on the feature selection as part of the training process of the machine learning algorithm.

Feature Extraction -: Feature Extraction is the process of creating new features from existing ones to provide more relevant information to the machine learning model. This is important in machine learning because the scale of the features can affect the performance of the model. The various techniques used for feature extraction are -:

Dimensionality Reduction: Reducing the number of features by transforming the data into a lower-dimensional space while retaining important information. Examples are PCA and t-SNE.
Feature Combination: Combining two or more existing features to create a new one. For example, the interaction between two features.
Feature Aggregation: Aggregating features to create a new one. For example, calculating the mean, sum, or count of a set of features.
Feature Transformation: Transforming existing features into a new representation. For example, log transformation of a feature with a skewed distribution.

So this was an introduction to feature engineering. In the upcoming blogs we are going to study about each technique separately. Till then stay connected and don't forget to follow me.
Thankyou ❤

Handling Categorical Values|| Machine Learning

Neha Gupta — Wed, 03 Jul 2024 17:25:37 +0000

Hey reader👋 Hope you are doing well😊
We know that machine learning is all about training our models on the given dataset and generating accurate output for any unseen similar data. There are algorithms (Regression algorithms) that works on numerical data only. And we know that dataset may contain numerical as well as categorical data. Then how can we use some algorithms that only work on numerical data on such dataset. To use Regression algorithms on categorical data we need to transform categorical data into numerical. But how can we do that?🤔
Don't worry in this blog I am going to tell you that how we can handle Categorical data.
So let's get started🔥

Handling Categorical Data

Categorical data refers to the categories in the data. Example -> male, female, red, green, yes or no.
(To understand the types of data that we can encounter please read this artice[https://dev.to/ngneha09/day-2-of-machine-learning-582g])
There are different techniques in Python's sklearn library to handle categorical data. Let's read about them-:

*1. Label Encoder *

The Label Encoder identifies unique categories within a categorical variable and then it assigns unique value to each category. There is no strict rule on how these numerical labels are assigned. One common method is to assign labels based on alphabetical order of categories. It is best suited to ordinal categorical variables.

Implementation-:

So here you can see that we have imported the LabelEncoder from sklearn's preprocessing module then we have created it's instance and then transformed categories into numerical labels using fit_transform.

Disadvantage-:
Due to arbitrary assignment this technique may not reflect meaningful relationships in the data.

2. One Hot Encoding

This technique creates binary features for each category in original variable.

So here you can see that in the first row we have red color so we have 1 assigned color_red and others are given 0.

Implementation-:

Here we have imported OneHotEncoder and then fit the data and transformed categories.

Disadvantages-:
With high cardinality categorical variables this can create sparse matrix, a matrix where most of the elements are 0. It can also result in increased dimensionality of data. Also it is not good for ordinal data as it doesn't preserve order.

3. Binary Encoding

This technique is combination of Hashing and Binary. In this technique the unique categories are assigned unique integers which are then converted into binary code (bit representation).

Implementation-:

Now you can see that extra columns are only the number of bits used in maximum integer assigned to categories.
This technique is best for nominal data where we have large number of categories.

Disadvantage-:
This technique is not good for ordinal data as it does not follow any order.

4. Ordinal Encoding

The critical aspect of Ordinal Encoding is to respect the inherent ordering of the categories. The integers should be assigned in such a way that the order of categories is preserved.

So here you can see that Poor is assigned 1 then Good is assigned 2 and so on. So here the ordering of the categories is preserved.

Implementation-:

Here the encoder takes a 2D array ,we can see that the encoded data is in alphabetical order. This is because we have not given any particular order to encoder so it encodes data on the basis of alphabetical order.

Here we have created an OrdinalEncoder instance with the specified order of categories.

Disadvantages-:
This encoding doesn't suit for nominal variables.

5. Frequency Encoding

This is used for nominal categorical variables with high cardinality. In this technique we calculate the frequency of each category and the encoded value is given by frequency of that category divided by total categories.

Implementation-:

Disadvantage-:
The major disadvantage of this technique is that multiple categories can have same frequency and as a result they will have same encoding.

6. Mean Encoding

In this technique each category in the feature variable is replaced with the mean value of the target variable for that category.
Example-: Suppose we are predicting price of car (target variable) and we have a categorical variable 'Color'. If the average price of car is $20,000 then 'Red' would be replace by 20,000 in encoded feature.
It is useful when dealing with high cardinality Categorical features.

Implementation-:

Here we have calculated mean of the target variable for each category. Map the original categories to their corresponding means. Replace each category with the computed mean.
It has high chances of capturing any existing relationship between category and target variable.

Disadvantages-:
Mean encoding can lead to overfitting, especially when categories have few observations. Regularization techniques, such as smoothing, can help mitigate this risk.

So this is how we handle categorical values. I hope you have understood it well. For more don't forget to follow me.
Thankyou❤