Forem: MNLKuzmin

My experience at Flatiron School

MNLKuzmin — Mon, 22 May 2023 02:22:35 +0000

Disclaimer: since Flatiron is so awesome, they keep reviewing their data and they are constantly re-thinking and restructuring the different courses and services offered, adapting to what helps and doesn’t help, what works and what doesn’t work.
Some of the things I mention here might not be structured in the same way anymore, even if the core content of the classes and the structure itself of the course is the same. Visit flatironschool.com to see what the school offers and what has changed since this post and since I enrolled.

What I loved

I loved the things I learnt. I loved the flexibility and support I received. I loved all the different tools offered (even if I didn’t get to use them all). I loved the material that was presented and that in a year's time (and it could have been less if you don’t have a schedule as busy as mine) I am ready for a whole new beautiful and challenging competitive career... starting almost from zero!

I was coming from…

A background in physics including a master's degree focusing on solid state matter. 3 years working experience as office manager in the healthcare industry. 3 years spent at home with my sons.
When my younger one was almost ready to start to go to school I started to wonder: what do I want to do when he’s going to be in school and I’m going to be able to work again?
I knew I wanted to reconnect to my science background but I wasn’t sure how (unfortunately the search “physicist” for a job on Indeed doesn’t really bring impressive results). I had a chance to take one class of coding during my bachelor's degree (C++) and I didn’t like it because it felt abstract. It wasn’t connected to anything concrete that I was interested in.

Fast forward roughly 8 years after completing my studies, spending three years in a healthcare role and three years home as a stay at home mom (most underestimated job out there by the way - I strongly believe it should count as a master's degree at least).
Now I wonder what would I love to do? I receive a great suggestion from a friend to start by looking up positions on job listing websites, read as many as I can and select my 10 favorite ones. Guess what? I do it and 7 out of 10 are data science positions.
And the interesting thing is reading it from the position description, coding sounded much more interesting than the idea that I had of it back in college.
At that point I was actually excited to start coding if what it meant was that I look at real data about real things and people and actually learn something from it (and possibly even help others because of it!). This results in being able to offer better services and learning from what we have seen so far in every direction and every field.

I got excited about data science. Very. Simultaneously, I find out from a close friend how a mutual friend of ours happens to be switching careers to data science!
He also had no prior experience. He was actually a high school theology teacher, with 3 little kids and a wife who works full time. And he was able to start and finish the bootcamp at Flatiron School remotely while still working full time and being a present and loving husband and dad.
That’s it. If he could do it so can I!
I reach out and we meet up. He is super excited to share his experience of the program and answers all my questions. To say the least - he was very very satisfied at Flatiron and strongly recommended it.

What Flatiron School has to offer

Here are only some of the pros I found:
A new cohort is starting every month. They offer in-person classes or self-paced remote programs.
They have a campus in Manhattan and one in Denver.
They offer classes in software engineering, data science, cybersecurity and product design.
The school offers different payment plans, including also starting to pay for the course once you complete it.
They offer a flexible schedule, in which you are always followed and assisted by a coach. However, all the material is accessible to you remotely and you can do everything on your own at your own time and pace. There are lessons, labs, videos of recorded lectures, quizzes and projects.

Flatiron also puts you in touch with a whole community of students, and there are all sorts of channels to join with meetings organized via Zoom for students to meet and help each other with the work, or simply mingle and get to know new people in your field.
If you ever have a problem or a question - besides reaching out to classmates or your instructor - there are also technical coaches available after-hours.
So if you were like me in the first months - always working in the evenings and on weekends, when supposedly no one else does, there is always someone available to help and answer your questions (and here I have to officially thank Ashley for the 3 evenings she spent with me to try to make MongoDB run on my machine).
Other services I didn’t get to take advantage of but that sounded very helpful: student advisors (for any non technical problem or concern you might have), guided practices and office hours.
I was lucky enough that having studied in Italy I have extensive experience in oral examinations and thus am used to presenting/public speaking. But since the 5 projects with which you are evaluated include a non-technical presentation, the school offers practices for students that want to improve their communication skills. You do not even need to pre-book these sorts of practices and office hours are both offered as zoom calls that you just hop in and someone is there to help you.

After understanding all the support the school has to offer, on top of the high level material and wonderful reputation, I decided this is for me.

My journey

I officially applied to the school, looked into their different financial options, took the entry test, had an entry interview and got accepted!
I also actually won a scholarship for women in tech (and they offer other scholarships as well).
There is some pre-work to do mostly in math and logic and intro to coding in Python.
I chose the remote self-paced because I was still home with my young son. That’s right. My son had not started school yet, and if I wanted to go faster I could have just waited for him to start school, but the truth is, once I found Flatiron I was so excited to start that I enrolled in April, instead of waiting until September. I simply couldn’t wait to start this new adventure.
When the classes began, I had an introductory call with my classmates that were also starting on my same date. While some in my cohort went at a much more rapid pace I ended up becoming close to those who were similar to my timeline - especially my girl Eva who has been with me all along even presenting her final capstone project on the same exact day as me!
Through this whole course we covered so many topics so here is just a quick summary even though this is just a glance at the breadth of the curriculum:

The data science curriculum

Phase 1:

Basic Python
Terminal and Git
Pandas
SQL
MongoDB
Folium
NumPy
Matplotlib
Seaborn

Phase 2:

Combinatronics and probability
Distributions
Hypothesis Testing
Bayesian Statistic
Simple Linear Regression
Multiple Linear Regression

Phase 3:

Object Oriented Programming
Linear Algebra
Calculus
Regularization
Logistic Regression
Classification Metrics
Decision Trees
K Nearest Neighbors
Bayesian Classification
Model Tuning and Pipelines
Ensemble Methods:
Bagging Random Forests
Grid SearchCV
Gradient Boosting
XG Boost

My project was a classification model, built using Random Forests, to identify most determining factor in COVID-19 hospitalizations.
Find the full project on my repository:
https://github.com/MNLKuzmin/Covid

Phase 4:

PCA
Recommendation Systems
Clustering
Time Series
Big Data and Pyspark
Natural Language Processing
Neural Networks
Amazon Web Services

My project was a time series analysis in Electrical Energy Production in the US with Natural Gas.
Find the whole project at the following link:
https://github.com/MNLKuzmin/USEnergy_Generation 

Phase 5 capstone project:

This was a Convolutional Neural Network to detect skin cancer.
See this link for my blog post and for the GitHub repo of my project:
https://github.com/MNLKuzmin/SkinCancerDetection

Finally I would like to thank a few people that helped me and accompanied me during this amazing journey:
Eva, for being there with me in this adventure with all the ups and downs that came along the way!
Other classmates met on the way: Heath, Joshua, Emily, Luke and David for our zoom calls, bouncing ideas off of each other and supporting each other navigating this great journey.
Matt and Morgan - my coaches. For their help and support and patience in front of all my questions!
Mark for the great reviews of my projects. For always giving detailed and helpful feedback and always wanting to teach me something new!
Thank you Flatiron School for this amazing opportunity.
I can’t wait to see what lies ahead!

Explainability in black-box models with LIME and activation layers

MNLKuzmin — Fri, 19 May 2023 13:44:27 +0000

From the project Skin cancer detection with Convolutional Neural Networks

Skin cancer is by far the most common type of cancer.
According to one estimate, about 5.4 million cases of skin cancer are diagnosed among 3.3 million people each year (many people are diagnosed with more than one spot of skin cancer at the same time).
1 out of 5 Americans will develop skin cancer by the time they are 70.

Causes: Skin cancer occurs when errors (mutations) occur in the DNA of skin cells. The mutations cause the cells to grow out of control and form a mass of cancer cells.

Risk factors: Fair skin, light hair, freckling, moles, history of sunburns, excessive sun exposure, sunny or high-altitude climates, precancerous skin lesions, weakened immune system, etc.

If you have skin cancer, it is important to know which type you have because it affects your treatment options and your outlook (prognosis).
If you aren’t sure which type of skin cancer you have, it is recommended that you ask your doctor so you are properly aware.

A doctor will usually do an examination looking at all the skin moles, growths and abnormalities to understand which ones are at risk for being cancerous.

But what if the doctor is not sure?
What if we could develop a tool that could help the doctor decide with more confidence and ensure more safety for every patient?
What if, to make a decision about a patient, the doctor could have the support of advanced technology and a model that makes its determination based on a direct comparison with thousands of other cases?
This is what we were trying to achieve in this project.

When the model is ready to be used in the field, an app can be developed from it.
This app would use the cell phone camera and it would return the type of skin anomaly and the percentage probability of it being cancerous. The threshold above which a case is considered cancerous or at risk can be adjusted by the user, to decrease the chance of false negative cases.
The app would also allow the user to upload the images to a database so the model will continue to improve in accuracy over time.

The app will also show which part of the images the model focuses on to make its determination and show filters that the model applied to the image.
In this way, since the granularity that a computer can scan is higher than the one of the human eye, the model might have caught details that the doctor did not, and thus make a more informed assessment.
The app cannot substitute the critical judgment of a human being, but by the power that this technology offers, we feel this could be a very useful tool to support a doctor in his decision.

The black box model and LIME

One of the main issue we have with Convolutional Neural Networks (and Neural Networks in general) is that even though they are very powerful and efficient, they are hard to understand from the outside.
They are what is usually called a "black-box model". Which means that we provide them with some structure for the model and the input and they produce a result, that is often very accurate, but we have no way from the outside to see what happened exactly for the model to get to that result.
The calculations are not explicit and often anyway not very intelligible, so it is hard for us to trust the model, or once it makes a mistake, to understand why it did and what went wrong.
This is why tools like LIME, that help us understand more about the model, are becoming more and more popular.

With LIME we can see which ones were the parts of the image that more heavily influenced our model to believe that the picture belonged to the cancerous or benign class.

This can be extremely useful to doctors using our app because they don't need to believe blindly in our model, but for each one of the images they can extract what was the part of the picture that led the model to its conclusion, and whether the model focused on the wrong part or read the image correctly, the physician can draw his conclusions, and make a more informed decision.

Summary:

Our data consisted of 2357 pictures of skin anomalies, that belong to 9 different classes.
The goal of the project was to build a model that could classify the images, first in their 9 native classes, and secondly another model was built that would classify the images between cancerous and benign.

Our models are all Convolution Neural Networks. We used Tensorflow with Keras backend, for building the models.
We built sequential models with densely connected layers, and took advantage of regularizers and constraints to tune the models.

For validation, a cross validation was ran at every fitting of the model, setting aside a 20% validation set.
During the grid searches, a 3-fold cross validation was used.

Finally LIME and Visualization of Activation Layers was used for model explainability.

The best 9 classes model reached a mean accuracy (calculated over 10 samples) between 70% and 80% on the train.
When evaluated on the holdout test set the results were an average accuracy between 15% and 20% on the test.
The binary classification model had a mean Recall of 80% and f1 of 85% on the train.
On the test we obtained a recall around 65% and f1 around 70%. Both with a recall threshold of 50%.
When we moved the recall threshold to 30% the model reached on the test a recall of about 85%.

Data Understanding:

Let us dig deeper into what each one of these classes are, and we will preview one image for each class to get a visual sense of what our model is going to be studying.

In particular we will divide the classes in two macro classes, benign and malignant, since we will also build a model to determine if the image is ultimately of benign or cancerous nature.

The firs five classes, Dermatofibroma Pigmented benign keratosis and Seborrheic keratosis, Nevus and Vascular lesion are benign, while the other 4 classes actinic keratosis, basal cell carcinoma, melanoma and squamous cell carcinoma are malignant.

The distribution of the 9 classes is as follows:

To divide into the 2 classes, benign and cancerous, we grouped together the benign classes in one class, and all the malignant classes in another. We obtained this distribution:

Models:

First we created a model to identify which of the 9 classes the image belongs to.
All the models used are convolutional Neural Netowrs (see details in the notebooks about their structure).
We started with a Naive model, and with images of sixe 8X8 pixels, only one Convolutional Layer, one Pooling layer and one Dense layer.
We increased the size of the images, at 32x32 and then at 64x64, and we continued for the rest of the model with this size.
After that, we normalized the pixel values and then started to tune the models, running Grid Searchs on the following parameters: number of epochs, batch size, optimization algorithm, learning rate of optimization algorithm, neuron activation function, number of neurons. We also did some tuning to avoid overfitting, using L2 regularization and dropout regularization.
We used accuracy and loss as metrics to keep track of the progress of our model.
The model that we selected as the best one turned out to be the one we created right after tuning the number of neurons.

This model has the following parameters:
activation function = relu for each layer except last one softmax
optimizer = Adam with learning rate 0.001
neurons - 5 for each layer expect the last one in which they are 9

This model reached an mean accuracy between 70% and 80% and a mean loss between 0 and 2 on the train.
For the test set the results were an average accuracy between 15% and 20% on the test and loss on the test on average between 6 and 14.

With this model we obtained the following confusion matrix:

Next we reorganized the images into 2 classes instead of 9: 'benign' and 'malignant'.
We tuned the model using grid searches of the same paramters as the first model.
In this case we chose two different metrics to evaluate the model, while still keeping an eye on accuracy and loss, we defined functions to extract the recall and f1 of our model.
We chose the recall as our metric in this case because we wanted to try to minimize the cases of false negatives, and we kept monitoring also f1 to make sure the performance of our model remained good.

The performance of the binary classification model was a little unstable but we found a way to select the best performing one and we found for the train a recall of around 80% and an f1 of roughly 85%.
On the test we obtained a recall of around 65% and f1 of around 70%.
When we lowered the recall threshold to 30% we obtained a recall for the test set of about 70%.

LIME

LIME stands for Local Interpretable Model-Agnostic Explanations. This means that it is focusing on some of the results locally, not trying to understand the whole model.
Interpretable as we said, because makes the model more interpretable, Model Agnostic is because it works with any machine learning classifier. Explanation because it returns an explanation of why the model made the classification it did, and returned that specific result.
In general what LIME does is it breaks the images into interpretable components called superpixels (clusters of contiguous pixels). It then generates a data set of perturbed instances by turning some of the interpretable components “off” (in this case, making some of the superpixels in our picture gray).

For each perturbed instance, we get the probability that the skin growth is cancerous according to the model. We then learn a simple (linear) model on this data set, which is locally weighted — that is, we care more about making mistakes in perturbed instances that are more similar to the original image. In the end, we present the superpixels with highest positive weights as an explanation, graying out everything else.

Example of image categorized correctly by the model:

In this image, the parts that are blocked off, in black and red, are the parts of the image that the model ignored to make its determination. The parts that are shown in the natural color and green, are the parts that the model used more heavily to make its determination.
We can notice how the model identified correctly which part of the image to focus on, which was the skin anomaly.

Examples of image categorized incorrectly by the model:

We can see here how the model did not identify correctly the parts of the image to focus on, and made its determination based on the skin around the lesion and not the lesion itself.
The doctor by seeing this image will be warned that most likely the model made a mistake in the determination of the class of this image.

Visualizing Activation Layers

One more thing that we can offer to the AAD to make more clear for the doctors what lead the model to its decision, is visualizing activation layers.
This is part of how a Convolutional Neural Network works, in order to make its determination and classify an image. We can visualize the intermediate hidden layers within our CNN to uncover what sorts of features our deep network is uncovering through some of the various filters.
As we mentioned before a CNN to learn about an image applies different filters, and this new representation of the image is called feature map. What we do when we visualize activation layers is that we look at feature maps and see number of channels. You can visualize each of these channels by slicing the tensor along that axis.

We can also visualize all of the channels from one activation function, with a loop.

Results

We built two different models, one that identified the images belonging to 9 different classes of skin anomalies.
The best model we choose had the following characteristics:
image size 64x64
epochs 10, batch size 10
optimization algorithm Adam, with learning rate 0.001
optimization function 'relu', number of neurons 5

This model reached an accuracy of on average between 70% and 80% and a loss between 0 and 2 on the train.
When evaluated on the holdout test set the results were an average accuracy between 15% and 20% on the test and loss on the test between 6 and 14.

The second model we built was a binary classification model, trying to identify if an image belonged to the 'benign' or 'cancerous' class.
This model was tuned just like the previous one in terms of image size, number of epochs, batch size, optimization algorithm, activation function, number of neurons, regularization and dropout.
The performance of this model was a little unstable but we found a way to select the best performing one (which changes given the stochastic nature of NNs) and in general, we were able to obtain for the train a recall of around 80% and an f1 of roughly 85%.
On the test we obtained a recall of around 65% and f1 of around 70%.
When we lowered the recall threshold to 30% we obtained a recall for the test set of about 70%.

We also used tools like LIME and Visualization of Activation Layers to make the model more explainable, so that if a physician is uncertain of the result of the model, or wanted to dig deeper for other reasons, they would have the chance to see more in-depth what was the way in which the model processed the image and made its determination.

Limitations

Given the stochastic nature of the Neural Networks, we were not able to have permanent results. We hope that by building a broader database of images and training the model several times whenever the database gets updated, we will be able to obtain more stable models with higher accuracy.
There might be limitations to uploading the images in a database for patient privacy reasons, so a HIPAA form would have to be provided and signed by the patient to be able to use the images of their skin.
The set of images for the 9 classes was not balanced, which might have brought the first model to recognize better the more populated classes versus the less populated ones. With a more extensive database to train the model, this issue could be solved and the model could improve.
We had some technical limitations in terms of the running time of the code for which we could not run more grid searches or expand the ranges swept even more, or increase the number of layers or neurons of the model. With more computational power or taking advantage of one of the cloud services higher accuracy could be achieved.

Recommendations

We know that black box models are scary and it can be hard to trust a computer with a patient's health. But we are not trying to substitute the physician, with his skill and critical thinking. But we believe our app can be such a useful tool to support the doctor in a situation of uncertainty, using the power of always-evolving technology.
Use the app to its full potential, looking at LIME explanations and activation layers, and setting the threshold for images to be considered cancerous.
Whenever a patient agrees to it, upload the images of their skin anomalies to make our database always growing, and help us to constantly improve our model.
Report whenever there is a doubt or an error so that the model can be trained better.

Next Steps

To improve our model and for a more in-depth study we could also:

Balance out perfectly the classes in the 9 classes model by image augmentation, to obtain better results. Utilize more powerful tools like models available like Transfer Learning.
Create a function that selects only the images classified incorrectly and runs them through the model again or to another more powerful model (Transfer Learning).
Flagging images with uncertain probability. Most likely the images that are closer to error are the ones where the prediction is close to 0.5. We can select a range from 0.4 to 0.6 where the image instead of being classified gets flagged as an uncertain image and sent through the model again or through a more powerful model.
Create the app that the physicians can use with the possibility to add images to the dataset, and periodically retrain and improve the model.
Take a whole other set of skin lesion images and train our same model on them to increase its accuracy and flexibility. For More Information

You can find my whole project on the following link on GitHub.

Sources:
American Cancer Society
Skin Cancer Foundation
Mayo Clinic

Use of polynomials in linear regression analysis - extension to linear models

MNLKuzmin — Tue, 28 Mar 2023 14:24:15 +0000

We are talking about linear regression:

And we are taking two concepts that we are very common in linear regression, and combine them to offset the limitations of both, to obtain better models and better results.

The two concepts are interactions and polynomials.

Interaction:

This is how the concept of interaction was introduced to me (more or less):
An interaction occurs when an independent variable has a different effect on the outcome depending on the values of another independent variable.
Or in other words: when building a linear regression model, if the effects of a feature on the target is influenced by another feature this means that there is an interaction between them.
This means that we can multiply the two variables, and in this way obtain a term that expresses their ‘interaction’. By adding this term to our linear regression equation our model is accounting for the interaction between these two terms.
It makes a lot of sense and it sounds pretty cool!
We can think of a few examples where this would apply: the yield of crops can be depending on a lot of factors that interact with each other like humidity and temperature, presence of certain nutrients in the soil exposure to sunlight etc.
Or in determining the risk of diabetes for an individual, certain factors might be interacting like age and hypertension or BMI.
More on interactions in linear regression here.

What was a little bit odd to me was that it seemed like given this possibility, we are left to kind of guess what type of interactions might exist, and the idea would be to try out a few interactions just by multiplying randomly some terms together, and just by trial and error try to find the interactions that were real, by looking at our R squared result and see which interactions improved it. Or if we could infer correctly on an interaction, we would still be left to guess randomly the exact term that would describe it (is the nature of the interaction multiplicative? And to which power for each variable? And what if it makes sense to include more variables in that interactions as well? And what coefficient should it have once it’s included in the equation?)

Polynomials

On to the second tool.
The other possibility we have is if we think that the relationship between our dependent variable and all the independent ones might not be linear, we can include polynomial terms to our equation.
The idea is that you can transform your input variable by e.g, squaring it. The corresponding model would then be:

The squared x at that point becomes a new variable to add to the equation.

Here below are some graphs that describe higher order of polynomials.

We can do this with higher orders and with the polynomial feature from sklearn.preprocessing import PolynomialFeatures that can calculate for us all the terms with all the independent variables we have, multiplied together up to a degree of polynomials that we get to set.
The problem with this approach is that once we use all the terms we have from our polynomial features our model is going to tend to overfit. Because all those terms will probably describe our curve very well, most likely too well, picking up the real trends of the relationship between the variables, but also picking up some random noise.
Let us see this with a visual example:

This in an example of when the equation in our model follows our sample so precisely, that they end up overfitting: the model has picked up on the noise as well as the signal in the data.
These models will perform extremely well on our train data (the one we used to build the model) but very poorly with any new data.

What we aim for our model to do is to pick up the real trend, without describing in every detail the noise.
These type of models might have slightly lower results on the train data, because they don’f fit it perfectly but they will also perform very well with new unseen data, because they describe well the relationship between the dependent and independent variables.
This is what this type of model would look like:

It is clear that our model 'has understood' the general relationship between the target and the input, and that once it is fed data that is different, but describes the same phenomenon, it should be able to identify that relationship again.

Usually...

What is recommended usually is when it’s clear that the polynomial relationship with a certain degree is overfitting, lower the degree of the polynomials.
But this might (or at least it did in my case) end up making the polynomial terms not very relevant, since reducing even just by one order results in cutting down many terms (especially if we have many variables).
The level of the fit dramatically decreases, making the whole use of the polynomial terms almost pointless.

Putting these two together

The interesting fact is that actually those terms of the polynomials are the same thing as the interaction terms, just with higher degrees. And the Polynomial features does the favor of calculating all of those terms for us, with two simple lines of code.
The idea I had at this point was to not try to just randomly test a few interaction terms, but use the terms from the polynomial features and add those to the equation in my linear model.
What I did in practice is that I selected a high level of polynomials, that would usually lead to overfitting, and add those to my OLS model and run it normally.
At this point we have a chance thanks to statsmodel, to see the summary and extract the coefficients.
It was simple to sort those values and select the top 5 (or as many as we want) and study those.
One thing that we can do is look at these terms to see if there is something interesting there, since these can reveal some sort of interaction or correlation that can explain the model better to us. The top terms are the ones that are most influential in determining the outcome in the dependent variable.
Looking at these terms therefore can tell us a whole lot about the problem that we are trying to solve and what our target variable depends on.
The other thing that we can do that can definitely improve our results is to include these top terms in our model, without including all the other ones produced by the PolynomialFeatures.
Why does this make a lot of sense?
Because these terms that have a heavier weight are probably the ones that describe the main term of the relationship, being the heaviest ones in the equation. Most likely these terms are the ones that capture the general trend and the overall more generic shape of our curve that describes the relationship. (especially if we have done preprocessing and scaling in an appropriate way).
The other terms that we are leaving out, the ones that have lower weights, are most likely the ones instead picking up the noise! Since the noise is usually randomly distributed, this would be what is described by terms that have lower weights, that maybe describe well the shape of a curve, but only for small little parts of it, that sounds a lot like noise doesn’t it?

I tried this myself in my project, and it worked great.
When I used the model with the third degree polynomials the train performed pretty well with a R squared of 0.72, but the R squared for the test was a disastrous -0.93.

When instead I used the polynomials of third degree, but only picking the top 5 terms with highest coefficients, the RMSE for both the train and the test reached 0.8!

Conclusion

This technique can be used extensively, by increasing the degree of the polynomials included and increasing also the number of terms that we choose to include: in my case was 5 but more and more can be included granted to keep an eye on the R squared of the test to avoid overfitting.
We can learn a great deal from these terms about what is the nature of the relationship between the variables and their interaction, while improving our model’s results considerably without falling into the trap of overfitting.

I had to learn later, that there are techniques that do bring us to a similar result, selecting only some of the features or increasing and decreasing their weights to improve the results of the model (Ridge and Lasso regularization, select k best, wrapper methods…) but I feel that these methods tend to lean too much toward the black box model, moving toward predictive modeling from inferential statistic, where we won’t necessarily be able to see what the coefficients are, what weights they have or the way that these methods picked them.
It would be probably more efficient, but I personally found a lot of value and satisfaction to try, even just one time, by hand this selection of features, and I wonder if this is the type of thing in which ultimately machines won’t be able to fully replace us (I hope!) because they won't be able to grasp the information that we can, understanding why a certain relationship between two variable makes sense, or exploring further into what it entails.
Even high volumes of calculations performed at a super high speed ultimately can’t beat the judgement and understanding of a human being, that can see an information and make connections and apply critical thinking in front of what he just learnt.

A defense of those poor outliers...

MNLKuzmin — Mon, 25 Jul 2022 10:41:25 +0000

Why the outliers?

When I started studying data science, one of the things that struck me the most from the beginning, was the concept of the outliers, and in particular the way they are usually dealt with, during an analysis on data.
What I found to be a common procedure for data analysis about outliers clashed immediately with what I have learned in the past.
From that moment on I had a desire to explore this topic more, and bring up a few points to the world of data scientists, even just to put a little doubt in our minds as data scientists, when we just so lightly decide to “get rid of the outliers”...

A little bit of background

For whoever is not familiar with the concept of outliers, when you are observing a certain phenomenon, or doing an analysis on a set of data, the outliers are the results that lie at the extremes of your spectrum of values, or significantly outside of range, compared to the bulk of the results.
If this doesn’t explain it, let me borrow a definition: ”In statistics, an outlier is a data point that differs significantly from other observations.”

And this is how the paragraph continues:
"An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses." (See this link for further reading about outliers)

As the article mentions, outliers can be (and I would say tend to be) excluded from the dataset and therefore excluded from the statistical analysis.
In other words, since these values are extremes, they represent “exceptions” and therefore they tend to be ignored so that the EDA can bring more consistent results.

Why I want to defend the outliers

From the very first time that I read this definition, and I learnt how it is a common procedure to cut out the outliers, something inside of me said “no this is not right!”.
Forgive me, but this is the physicist in me, in particular an experimental physicist that learnt as part of her training that no point, no result, no observation should be excluded from an analysis, unless we are certain that it was due to a mistake.

The physic’s point of view

The physics approach to outliers is based on the fact that sometimes we do not expect to observe a certain behavior, a certain result from an experiment, but that particular result (granted that it’s not due to a mistake in the procedure applied) could be actually revealing to us a new aspect, a new characteristic or even a whole new phenomenon that we didn’t expect.
I want to bring up an example, which is one of my favorite experiments in modern physics which explains this concept very well.
It is the discovery of the nucleus, by Ernest Rutherford:

”At the time of the experiment, in 1909, the electron was the only known atomic particle. Physicists had thought of a number of possible arrangements for electrons inside the atom. These arrangements or atomic models, had to account for the fact that matter, generally speaking, is electrically neutral and therefore, to counteract the negatively charged electron the atom must contain positive electricity in some form.
A positively charged particle comparable to the electron was not known to exist; it was possible that the positive electricity took a different form; perhaps a fluid. [...]
Alpha particles are smaller than atoms, but heavy, therefore they could serve as high - energy bullets for probing atoms at a time [...].
In his laboratory, beams of alpha particles were sent through atoms [of a gold foil specifically] to hit a detecting screen, so that one could observe whether or not the particles had been knocked off course during their intra-atomic journey.

This method of exploration has been likened to shooting bullets through a bale of hay in which a very small piece of platinum has been hidden. Most of the bullets would encounter nothing but hay and these would pass right through the bale and out to the other side. But should a bullet chance to hit the platinum, it would ricochet at some angle. And if an enormous number of bullets were shot at the bale, so that the hidden nugget was hit often, ricochets in various directions could reveal the nugget’s location and shape.
This was the research that Rutherford suggested: to bombard atoms with alpha particles to see whether any were scattered at a wide angle and there was every reason to believe that they would not find one. Still it should be tried. [...]
Contrary to all expectations, they found that out of the thousands of alpha particles he had tracked through a gold foil some, a very few, had been deflected at wide angles.
Of these, one or two had been turned aside by more than 90degrees; they had come out of the target on the same side they had entered it.”

From the book Men who made a new physics, by Barbara Lovett Cline, which I strongly recommend as an easy, fun reading about modern physics science discoveries.

This discovery was the main point that brought Rutherford to come up with his Rutherford model, which described (correctly) the internal structure of the atom, up until that point unknown.

What made me think about the gold foil experiment when learning about outliers is the fact that the result that Rutherford found, of the alpha particles being scattered at wide angles, was what would be considered an outlier.
It was a tremendous smaller percentage of observations compared to all the rest of the observations that were collected.
It was also significantly different from all the bulk of the results, since the scattering happened at very small angles, if not at all.
But if those few measures were not taken seriously, and discarded as “mistakes” or “outliers” we wouldn’t know the atom as we know it today, and physics might have advanced in its research building on a mistake (like assuming that there is a positively charged fluid inside each atom).
This is the reason why I was taught, as a physicist, not to discard observations and to include all the measurements and all the results in all of my analysis.

The data science’s point of view

Now I know really well that in an EDA there are good reasons to ignore the outliers, since especially the ones that lie very much outside of the range of the bulk of the observations, tend to influence very heavily statistical measures like the mean, which is one of the most indicative and most used values in a statistical analysis (click this link for further explanation on this).
I also am aware that when we are conducting this type of analysis, especially if it is made to provide some sort of business suggestion, what we are most interested in is the bulk of the results, the majority of the population, having to give a prediction about the future of what we can reasonably expect. Given that, what we can expect more realistically is the average, not an outlier.
What Amazon wants to know is what most customers would buy, not what the one particular unpredictable person is going to purchase on that particular day. The general market is looking for high volume, predictable and safe trends on how the general population behaves, and this model has no use for outliers.
But hear me out, one last point for our poor outcasts…

We were all outliers in the past two years

I want to bring up one more example to reflect on the importance of outliers.
Which is the way we all lead our lives in the past two years, as a consequence of the Covid-19 pandemic.
If we look at our behavior, many of the things that became usual for us during the pandemic, are things which we almost never did before 2020.
If we studied the trends of people wearing masks, isolating and socially distancing, working remote, regularly using hand sanitizer or gloves or face shields (outside of a medical environment), getting constantly tested for a virus...
All these behaviors, which during the past 2 years became a norm for most of us, were outliers if we studied them pre-pandemic.
There are situations that can arise, that shift our habits even radically, and in a way not cutting out what is “strange”, “outside the range” or “unexpected” can teach us something, can make us more prepared or can open us up to new types of studies.

Conclusion

Sometimes it is not possible to predict things with a statistical model, sometimes reality surprises us because of other people’s freedom (like when my son says yes to cleaning up his toys without complaining… totally outlier behavior) and sometimes it hits us hard in ways we could never have imagined.
But I think sometimes not looking at the bulk of results, not ignoring the observation that we didn’t expect, can teach us something new, and open up our minds to the unexpected.
Probably for statistic’ sake we will still all need to get rid of our outliers, but I hope this different point of view will make you a little more curious, when you are deleting the outliers, about what happened there, what was that decision that that person made that was out of the ordinary and why, what could have we learnt from it, and maybe you will feel like me that slight twist of the stomach when you run df.drop( )...

Resources and Further Reading:
https://www.amazon.com/Men-Who-Made-New-Physics/dp/0226110273

https://www.khanacademy.org/science/chemistry/electronic-structure-of-atoms/history-of-atomic-structure/a/discovery-of-the-electron-and-nucleus#:~:text=Rutherford's%20gold%20foil%20experiment%20showed,nuclear%20model%20of%20the%20atom.

https://www.britannica.com/science/Rutherford-model

https://www.statology.org/how-do-outliers-affect-the-mean/