So, you’ve trained a classification machine learning model. Now what? How do you evaluate it? That’s where machine learning evaluation metrics for classification come in. This article brings you the top 10 metrics you must know, implemented primarily for binary classification problems. Multi-class classification datasets might require you to tweak the formulas slightly. We’ll first train a logistic regression model, and then we’ll go over each metric in detail.

After reading, you’ll have no trouble picking out the right set of machine learning evaluation metrics for classification datasets. You’ll know what each one stands for, what ranges you can expect the metric value to be in, and what it all means for your model’s predictive power. So without any ado, let’s get started!

Data going into a machine learning model has to be preprocessed adequately – Make sure you know how to do this step in R.

Table of contents:

- Let’s Train a Binary Classification Machine Learning Model in R
- Machine Learning Evaluation Metrics for Classification – Theory, Math, and Code
- Summing Up Machine Learning Evaluation Metrics for Classification

## Let’s Train a Binary Classification Machine Learning Model in R

We’ll start this article by training a binary classification model using logistic regression. The dataset of choice will be *Titanic*, as it’s built into R and requires only minor data preprocessing operations before modeling. Let’s begin by loading the dataset and inspecting what it looks like.

### Dataset Loading

Many of the metrics you’ll see today are built into various R packages, hence, we’ll need many imports at the start of the script. Here’s everything you’ll need – feel free to install any you might not have via the `install.packages("<package-name>")`

command:

```
library(titanic)
library(dplyr)
library(tidyr)
library(caret)
library(mlbench)
library(pROC)
library(MLmetrics)
```

The *Titanic* dataset is part of the `titanic`

package, so we’re good to go. We’ll only use the training subset and split it later into two parts.

The following code snippet loads the dataset and prints the first couple of rows:

```
# Load the Titanic dataset
data(titanic_train)
df <- titanic_train
# Show the first few rows
head(df)
```

It’s a good quality dataset but has some missing values and other formatting issues which a machine learning model won’t like. Let’s handle these next.

### Dataset Preprocessing

The data preprocessing part for this dataset could be an extensive article in itself, but we’ll keep things lightweight today since this isn’t the main talking point. In this section, we’ll:

**Drop unnecessary columns**– Columns that carry no meaningful information (e.g.,`PassengerId`

), and columns that would take too much time and code to preprocess adequately (e.g.,`Name`

,`Ticket`

, and`Cabin`

).**Impute missing values**– Median imputation for`Age`

, and constant imputation for`Embarked`

. Learn more about missing value imputation in R with our extensive guide.**Convert categorical variables to factors**– This makes it easy for a machine learning model to understand the intra-variable relationships without creating dummy columns.

If you prefer code over text, here’s the snippet for you:

```
# Drop unnecessary columns
df <- select(df, -c(PassengerId, Name, Ticket, Cabin))
```

```
# Missing value imputation
df$Age[is.na(df$Age)] <- median(df$Age, na.rm = TRUE)
df$Embarked[is.na(df$Embarked)] <- "S"
```

```
# Convert categorical variables to factors
df$Pclass <- factor(df$Pclass)
df$Sex <- factor(df$Sex)
df$Embarked <- factor(df$Embarked)
head(df)
```

The dataset is now much more condensed but carries almost identical predictive performance.

### Train/Test Split

The last step before training a machine learning model is to split the dataset into training and testing subsets. We’ll use the `caret`

package for the task, and stick to the traditional 80:20 split:

```
# Split the data into training and test sets
set.seed(42)
index <- createDataPartition(df$Survived, p = 0.8, list = FALSE)
train <- df[index, ]
test <- df[-index, ]
```

Here’s how many rows are in each subset:

```
dim(train)
dim(test)
```

That’s it! Let’s train the model next.

### Training a Classification Machine Learning Model

There are many classification algorithms you can choose from, but logistic regression is the one we’ll use today. It strikes a good balance between being easy to understand and offering good predictive performance.

As always in R, you can train a model by writing the model formula. In short, every dataset feature in the training set will be used to predict the `Survived`

target variable:

```
set.seed(42)
model <- glm(Survived ~ ., data = train, family = "binomial")
summary(model)
```

It looks like passenger class, age, gender, and number of siblings/spouses on board have the most impact on the predictive power, indicated by extremely low P-values. On the other hand, the point of embarkment has no impact on the target variable, as you could reasonably assume.

Up next, let’s make actual predictions on previously used data.

### Calculating Prediction Probabilities and Classes

Classification metrics require predicted classes (e.g., 0 or 1), while others require prediction probabilities (e.g., 0.7891 chance of belonging to a positive class). For that reason, we’ll calculate both.

The probabilities are first, and you can obtain them by calling the `predict()`

function and passing in our model and the test set, alongside with `type = "response"`

:

```
predict_probs <- predict(model, newdata = test, type = "response")
predict_probs
```

And now, if the predicted probability is 0.5 or higher, we’ll assign it a class of 1 (survived), or 0 otherwise (not survived):

```
predict_classes <- ifelse(predict_probs >= 0.5, 1, 0)
predict_classes
```

That’s everything we need to start evaluating our classification model with machine learning evaluation metrics for classification.

## Machine Learning Evaluation Metrics for Classification – Theory, Math, and Code

We’ve tried our best in keeping the previous section short and sweet, and now it’s time to dive into the good part. You’ll learn the best machine learning evaluation metrics for classification. Let’s start with the first one, which is a must-have for any machine learning project.

### 1. Confusion Matrix

You can think of the confusion matrix as a special type of table used to evaluate the performance of a classification model. In terms of binary classification, a confusion matrix is a 2×2 matrix that shows the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). These four values are used extensively when calculating other metrics, such as accuracy, precision, and recall.

Down below you’ll see the confusion matrix “formula”. Take this term lightly, since there’s no calculation involved. It’s just a summation of actual vs. predicted values:

To implement the confusion matrix in R, refer to the snippet below. It uses predicted classes instead of probabilities:

```
CONFUSION_MATRIX <- confusionMatrix(factor(predict_classes), factor(test$Survived))
CONFUSION_MATRIX$table
```

Long story short, you want the numbers on the top-left to bottom-right diagonal to be as large as possible. On the other hand, the elements on a top-right to bottom-left diagonal should be minimal, or close to zero.

While we’re here, let’s extract the values for TP, FP, TN, and FN:

```
TP <- CONFUSION_MATRIX$table[1, 1]
FP <- CONFUSION_MATRIX$table[1, 2]
TN <- CONFUSION_MATRIX$table[2, 2]
FN <- CONFUSION_MATRIX$table[2, 1]
```

We’ll need these for the upcoming classification metrics.

### 2. Accuracy

Accuracy measures the proportion of correct predictions to the total number of predictions. It’s a widely-used metric, as it reports the overall predictive performance of your model. But keep in mind – this metric is only relevant if classes are balanced. For example, if you have 99% of records in one class, you can easily obtain an accuracy of 99%. Just think about it.

Anyhow, here’s the formula:

Since we already have the values for TP, TN, and FP, accuracy calculation in R is as easy as it can be:

```
ACCURACY <- (TP + TN) / (TP + FP + TN + FN)
ACCURACY
```

76% isn’t too bad for a couple of minutes of work in data preprocessing. But the classes aren’t perfectly balanced, so other classification metrics might be more relevant for our use case.

### 3. Precision

Precision measures the proportion of true positives (TP) to the total number of positive predictions made. It’s a useful metric when false positives are more costly than false negatives, for example in medical diagnosis. If precision is high, it means the model is making few false positive predictions.

Here’s the formula:

Let’s implement Precision in R. Once again, the implementation is trivial since we already have all the values:

```
PRECISION <- TP / (TP + FP)
PRECISION
```

We’re up to 0.8, which isn’t too bad. Let’s see what recall has to say about it.

### 4. Recall

Recall is the ratio of the number of true positives (TP) to the sum of true positives (TP) and false negatives (FN). This metric measures the percentage of all positive instances in the dataset that are correctly classified by the model. If the recall is high, it means that the model is making a few false negative predictions.

Here’s the recall formula:

Let’s implement it in R and check the score:

```
RECALL <- TP / (TP + FN)
RECALL
```

Recall is higher than precision, which means the model makes fewer false negatives than false positives.

### 5. F1-Score

Now you might be wondering, is there a way to strike the balance between precision and recall? That’s where F1 score comes in.

F1-score is a weighted average between precision and recall. It’s a useful metric when precision and recall have an uneven trade-off. The F1 score is a harmonic mean of precision and recall, and it ranges from 0 to 1, with higher values indicating better performance.

Here’s the formula you can use for the calculation:

```
F1 <- 2 * ((PRECISION * RECALL) / (PRECISION + RECALL))
F1
```

Once again, R implementation is fairly straightforward:

Seems right. It’s just between precision and recall values, which means F1 is the perfect metric to optimize the model for in cases where you don’t have to optimize for false positives or false negatives.

### 6. AUC Score

AUC, or the *Area Under the Receiver Operating Characteristic* curve measures how well a binary classifier distinguishes between positive and negative classes. Traditionally, you would plot the ROC curve, and the AUC measures the area under the curve. Higher AUC means better performance, and vice-versa.

The formula includes integrals since we’re calculating the area under the curve:

Unlike other metrics, AUC needs prediction probabilities for calculation:

```
AUC_SCORE <- AUC(predict_probs, test$Survived)
AUC_SCORE
```

AUC ranges from 0 to 1, so a score of 0.834 sounds good. For reference, a score of 1 would mean the model is perfectly capable of distinguishing between classes, which is almost never the case in practice. On the other end, the AUC score of 0.5 means the model is no better than a random guess. Overall, there’s still some room for improvement, but we’re far from an unusable model.

### 7. Specificity

Specificity measures how well a model is able to correctly identify negative samples (TN) out of all negative samples in the dataset. In other words, it measures the proportion of actual negative cases that were correctly classified as negative by the model.

This metric is widely used in areas such as medical diagnosis. In this field, a low specificity indicates that the model is incorrectly identifying negative cases as positive, which can lead to false alarms or missed diagnoses. The opposite is true the other way around.

The formula is once again as simple as it can be:

And so is the R implementation:

```
SPECIFICITY <- TN / (TN + FP)
SPECIFICITY
```

A result of 0.625 isn’t something to brag about, and there’s definitely room for improvement.

### 8. Balanced Accuracy

Let’s take a step back and discuss accuracy once again. As we said previously, the vanilla accuracy metric isn’t the most representative when classes are imbalanced. That’s where balanced accuracy comes into play.

It’s a useful metric when the dataset is imbalanced, and it provides a more accurate evaluation of the model’s performance.

Anyhow, here’s how to calculate it:

R implementation requires us to calculate the ratios of true positives and true negatives first:

```
TPR <- TP / (TP + FN)
TNR <- TN / (TN + FP)
BAL_ACCURACY <- (TPR + TNR) / 2
BAL_ACCURACY
```

So, taking into account class imbalance, our model is only 73.4% accurate. There’s definitely room for improvement.

### 9. Matthews Correlation Coefficient (MCC)

Matthews Correlation Coefficient is a metric that takes into account TP, TN, FP, and FN scores. It measures the correlation between the predicted and actual classes while taking into account the class imbalance and misclassification rates. It’s particularly useful in situations where the classes are imbalanced, which is obviously the case with the Titanic dataset.

MCC ranges from -1 to +1. If you see a value of +1, it indicates a perfect classification, 0 indicates a random classification, and -1 indicates an entirely wrong classification.

Here’s the math formula for MCC:

We don’t have to calculate it manually since MCC is built into the `mltools`

R package:

```
MCC <- mltools::mcc(predict_classes, test$Survived)
MCC
```

A score of 0.478 isn’t something to write home about, but it definitely proves our model is far from a random classification.

### 10. Logarithmic Loss

And finally, let’s discuss logarithmic loss or log loss for short. It measures the performance of a probabilistic classifier by penalizing false classifications. Log loss is commonly used in multiclass classification problems, but there’s no one stopping us from using it on a binary dataset.

Unlike the other metrics, there’s no hard range defined for this metric. A lower log loss score indicates better performance, but how low is low enough? It’s impossible to answer when evaluating a single model, so use this metric to compare multiple models instead.

Here’s the log loss formula:

The function for calculating log loss in R comes with the `MLmetrics`

pacakge, so we don’t have to implement it manually:

```
LOG_LOSS <- LogLoss(predict_classes, test$Survived)
LOG_LOSS
```

Is 8.15 good or bad? It’s impossible to tell without training a couple more machine learning models and comparing the results. Do this as a homework assignment and report back which model yielded the lowest log loss value.

## Summing Up Machine Learning Evaluation Metrics for Classification

To recap, these 10 machine learning evaluation metrics for classification should be all you need 99% of the time. You’re likely to use only a few, such as the confusion matrix, and optimize the model for precision, recall, or overall accuracy.

That being said, it doesn’t hurt to know the other evaluation metrics you have at your disposal. We hope this article have you a clear picture of how easy it is to evaluate machine learning models in R, and that you now understand these metrics on a deeper level.

*Do you have a favorite classification evaluation metric? What do you prefer when classes are imbalanced?* Make sure to let us know in the comment section below. Or even better – reach out on Twitter – @appsilon. We’d love to hear from you.

Deep Learning in R with… Keras? Train an MNIST digit classifier with TensorFlow’s high-level API.