Forem: Clébio Júnior

Going Beyond Accuracy: Understanding the Balanced Accuracy, Precision, Recall and F1-score.

Clébio Júnior — Wed, 22 Oct 2025 17:01:08 +0000

Tutorial about metrics which are used for machine learning model validations. The metrics covered in this tutorial are balanced accuracy, precision, recall and F1-score. This same tutorial may be read in a portuguese version here.

During a data science project one of the most wished steps is the development of a machine learning model. In this step there's training and validation of the model, and one of the most used metrics to validate the machine learning model is accuracy. However, how far can the accuracy show how effective the model was in classifying two or more classes?

Therefore, in this post other metrics will be described. They will help you to get other perspectives on the performance of your model, especially with unbalanced databases, in other words, databases with more numbers of a class than others. Here will be covered the metrics balanced accuracy, precision, recall and F1-score.

All the metrics described in this post result in a score between 0 and 1. Where 0 is the worst result and 1 is an excellent result. However, Each metric has different interpretation.

Confusion Matrix

Before to understand how the metrics work, you need to know what is a confusion matrix. Because it will be our basis for the calculations of each metric.This matrix show which are the predictions of each class "yes" or "no". Where the rows are the true classes and the columns are the predictions. So it's possible to classify how the classes were predicted. The table 1 shows this kind of matrix.

	No	Yes
No	TN	FP
Yes	FN	TP

Table 1: Confusion matrix where the "no" and "yes" classes are related to the predictions made by a machine learning model. TN, FP, FN, and TP are acronyms that stand for "true negative", "false positive", "false negative" and "true positive" respectively.

That said, the correct classifications of the “no” class are defined as true negatives (TN), while the correct classifications of the “yes” class are called true positives (TP). Misclassifications of “no” as “yes” are referred to as false positives (FP), and incorrect classifications of “yes” as “no” are known as false negatives (FN).

Table 2 shows the same Table 1, now with example values to illustrate a machine learning model from a data science project for predicting fraudulent banking transactions. The values 101,668, 3, 36, and 95 represent, respectively, TN, FN, FP, and TP. For more information about the referenced data science project, visit the provided link.

	No	Yes
No	101668	3
Yes	36	95

Table 2: Confusion matrix with the results of a machine learning model for predicting fraudulent banking transactions. The values 101,668, 3, 36, and 95 represent, respectively, the number of TN, FP, FN, and TP.

Balanced Accuracy

Accuracy basically calculates all correct predictions (TP and TN) divided by the total number of predictions, that is, all correct and incorrect ones (TP + TN + FP + FN), as shown in Equation 1. However, when there is a highly imbalanced class, accuracy is not a good metric to use. As can be seen in the equation, the high number of TN classifications can mask the low number of TP classifications — giving a misleading impression that the model is performing well in classifying the data.

Equation 1: Accuracy.

An alternative to accuracy is balanced accuracy, which is not affected by class imbalance because it is calculated based on the true positive rate and the true negative rate, as shown in Equation 2. This approach provides a more reliable measure of the model’s performance across both classes.

Equation 2: Balanced Accuracy.

To illustrate, the values of accuracy and balanced accuracy will be calculated using the data from Table 2. The resulting accuracy is 0.9996, which might initially suggest that the model correctly classified almost all instances and is performing exceptionally well. However, most of the correct predictions come from the majority class, which skews the result.

When we use balanced accuracy, which gives equal weight to the performance on each class, the value is 0.8626. This provides a more realistic measure of how well the model performs across both classes.

Even with balanced accuracy, we still only get a global view of overall correctness, so we cannot see how well the model performed on a specific class of interest. In our example, how well did the model identify fraudulent transactions? What percentage of the “yes” class was classified correctly?

Precision

We now understand balanced accuracy and how it provides a global view of the model’s performance across all classes. However, it is also important to examine the model’s classification ability in more detail. In our fraud detection example, how well can the model correctly identify a transaction as truly fraudulent?

The metric used to answer this question is precision, which measures the percentage of correct positive predictions made by the model. This metric relates the number of true positives (TP) to the sum of TP and false positives (FP), as shown in Equation 3.

Equation 3: Precision.

To better interpret this metric, imagine you have a distant target to hit. If you take 100 shots and hit the target 70 times, your precision is 70%. The same logic applies to interpreting the precision of a machine learning model. In our example from Table 2, the precision is 0.9694 or 96.94%. This means that for every 100 positive predictions, the model correctly identifies approximately 97.

Recall

In addition to precision, which shows how well the model can differentiate between classes, it is also important to know how many fraudulent transactions were correctly identified in our example from Table 2. For this reason, we consider the recall metric, also known as sensitivity. This metric measures how well a model can recognize instances of a specific class. Recall is calculated by dividing the number of true positives (TP) by the sum of TP and false negatives (FN) — in other words, the “yes” instances that were incorrectly classified.

Equation 4: Recall

In our example from Table 2, the recall value is 0.7252 or 72.52%. This result shows that the model correctly classified approximately 73% of the “yes” instances. Therefore, this metric can be used to report the percentage of fraudulent transactions that the model is able to correctly identify.

F1 Score

After reviewing precision and recall, you might be thinking that these metrics are important for evaluating model performance. After all, the better a model can differentiate between classes and correctly predict the “yes” class of interest, the better its overall performance.

So, how can we take both metrics into account when assessing a model’s performance?

In this context, one metric we can use is the F1 score. The F1 score is basically the harmonic mean of precision and recall, as shown in Equation 5. In the example from Table 2, the F1 score is 0.8297. This metric can be particularly useful when developing new models to determine which one achieves the best performance.

Equation 5: F1 Score

Conclusion

In this post, we explored several metrics commonly used to evaluate the performance of a machine learning model. We learned that accuracy is not always the best validation metric and can sometimes give a misleading impression of the model’s effectiveness.

For imbalanced classes, a more appropriate metric is balanced accuracy, which provides a global view of the model’s performance across all classes.

To get a more detailed view of individual classes, we rely on precision, which shows how many predictions the model got right, and recall, which indicates how many instances of a particular class were correctly identified by the model. To evaluate both metrics simultaneously, we use the F1 score.

Finally, I hope you enjoyed this post and gained a better understanding of alternative metrics to assess your models. See you next time!

Why it's not a good idea to start with complex machine learning models: a personal experience

Clébio Júnior — Sun, 23 Aug 2020 16:46:15 +0000

When I started studying data science, I became fascinated about neural networks and their power for such complicated applications. As examples, there are applications in computer vision and natural language processing (NLP). Because of their power, I just wanted to start using them in every single problem. But I had to calm down! Sometimes a simple model can get on good score.

In this post I guide you in my experience as a beginner in my first data science challenge and how it helped me to grow up as a student and a data scientist. I'll never forget the power of a simple linear regression model!

The Challenge

Condenation is a website which sometimes organizes challenges as a first step towards accelerating in different areas, one of them is about data science. The last data science challenge was about a prediction of ENEM (brazilian exam to get in the public university) student score in math.

I started it so excited! But I was blind just because I didn't try any other model unless Random Forest and Neural Network to predict the math score. I made a preprocessing to replace some NaN values and selected some features with high correlation. After it, I did a hard work with RandomizedSearchCV to select best parameters. Despite all the hard work I had done, it was unable to reach 90% and to join to the Codenation. So I got frustrated and I gave up on me.

A blessing in disguise..

Recently I met the same database at Kaggle. It's been a while since I had accept the challenge, so I tried it again. As you can read below I will show a new approach to the challenge and how to judge a simple model as weak without even using it. It was a big mistake and a great learning experience.

A New Approach

Here I won't describe everything I did, for example in relation to data preprocessing. But if you want to see my notebook, you can access it in kaggle.

First of all, I checked the database and if it has taken some NaN values. These values were replaced to 0, because I had to deal with it as students dropouts. Afterwards, I realized that there were some correlation among these features. My idea was to get the highest features and use them to predict the math score. The heatmap below shows these correlations using Pearson coefficient.

As we can see, they have a high correlations. So I decided to use them as a predictor feature in a simple linear regression model, as shown below.

X = df_train_filled.drop(columns=['NU_NOTA_COMP5', 'NU_NOTA_MT'])
y = df_train_filled['NU_NOTA_MT']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

lr = LinearRegression(normalize=True)
lr.fit(X_train, y_train)

lr.score(X_test, y_test)

Using a simple train and test split we've got a accuracy of 90%. This accuracy was better than random forest and neural networks models. But maybe you are wondering: "Did you just use part of the database? For a complete understanding, it's needed to use cross validation!". Ok, ok.. You're right! I did it too, as you can see below.

# making scores
mae = make_scorer(mean_absolute_error)
r2 = make_scorer(r2_score) 

cvs = cross_validate(estimator=LinearRegression(normalize=True), X=X, y=y, cv=10, verbose=10, scoring={'mae': mae, 'r2':r2})

I used the mean absolute error and $R^2$ score to evaluate the model. The mean scores were 50.027 and 0.902, respectively. Maybe this model can predict the math score with 90% of score for a test database. So I can be happy and try to make my submission! No, no..

Unfortunately, there is not a possibility to make a submission in Kaggle or on the original website. The Codenation closed a long time ago, so I have to wait for a new chance in the future.

However, what can we learn from it?

It's important to notice that even using random forest and neural networks models I could make some better preprocessing or selected other features and get a good score. Yes, it's right! But this experience was important to me, because I could learn and become a better data scientist.

Even you think that model is so simple to do a hard task, you should give it a chance. Maybe it can't get to a high score or result. However it can become a start point to verify if others models is helping you improving the scorer.

I hope this post could help you!!