Top 10 Machine Learning Evaluation Metrics for Classification - Implemented in R

Estimated time:
time
min

So, you've trained a classification machine learning model. Now what? How do you evaluate it? That's where machine learning evaluation metrics for classification come in. This article brings you the top 10 metrics you must know, implemented primarily for binary classification problems. Multi-class classification datasets might require you to tweak the formulas slightly. We'll first train a <a href="https://appsilon.com/r-logistic-regression/" target="_blank" rel="noopener">logistic regression</a> model, and then we'll go over each metric in detail. After reading, you'll have no trouble picking out the right set of machine learning evaluation metrics for classification datasets. You'll know what each one stands for, what ranges you can expect the metric value to be in, and what it all means for your model's predictive power. So without any ado, let's get started! <blockquote>Data going into a machine learning model has to be preprocessed adequately - <a href="https://appsilon.com/data-cleaning-in-r/" target="_blank" rel="noopener">Make sure you know how to do this step in R</a>.</blockquote> Table of contents: <ul><li><a href="#model">Let's Train a Binary Classification Machine Learning Model in R</a></li><li><a href="#metrics">Machine Learning Evaluation Metrics for Classification - Theory, Math, and Code</a></li><li><a href="#summary">Summing Up Machine Learning Evaluation Metrics for Classification</a></li></ul> <hr /> <h2 id="model">Let's Train a Binary Classification Machine Learning Model in R</h2> We'll start this article by training a binary classification model using logistic regression. The dataset of choice will be <i>Titanic</i>, as it's built into R and requires only minor data preprocessing operations before modeling. Let's begin by loading the dataset and inspecting what it looks like. <h3>Dataset Loading</h3> Many of the metrics you'll see today are built into various R packages, hence, we'll need many imports at the start of the script. Here's everything you'll need - feel free to install any you might not have via the <code>install.packages("&lt;package-name&gt;")</code> command: <pre><code class="language-r">library(titanic) library(dplyr) library(tidyr) library(caret) library(mlbench) library(pROC) library(MLmetrics)</code></pre> The <i>Titanic</i> dataset is part of the <code>titanic</code> package, so we're good to go. We'll only use the training subset and split it later into two parts. The following code snippet loads the dataset and prints the first couple of rows: <pre><code class="language-r"># Load the Titanic dataset data(titanic_train) df &lt;- titanic_train <br># Show the first few rows head(df)</code></pre> <img class="size-full wp-image-18566" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7ae9f12c4dca380d423d4_90a5a1ee_1-2.webp" alt="Image 1 - Head of the Titanic dataset" width="3130" height="422" /> Image 1 - Head of the Titanic dataset It's a good quality dataset but has some missing values and other formatting issues which a machine learning model won't like. Let's handle these next. <h3>Dataset Preprocessing</h3> The data preprocessing part for this dataset could be an extensive article in itself, but we'll keep things lightweight today since this isn't the main talking point. In this section, we'll: <ul><li><b>Drop unnecessary columns</b> - Columns that carry no meaningful information (e.g., <code>PassengerId</code>), and columns that would take too much time and code to preprocess adequately (e.g., <code>Name</code>, <code>Ticket</code>, and <code>Cabin</code>).</li><li><b>Impute missing values</b> - Median imputation for <code>Age</code>, and constant imputation for <code>Embarked</code>. <a href="https://appsilon.com/imputation-in-r/" target="_blank" rel="noopener">Learn more about missing value imputation in R with our extensive guide</a>.</li><li><b>Convert categorical variables to factors</b> - This makes it easy for a machine learning model to understand the intra-variable relationships without creating dummy columns.</li></ul> If you prefer code over text, here's the snippet for you: <pre><code class="language-r"># Drop unnecessary columns df &lt;- select(df, -c(PassengerId, Name, Ticket, Cabin))</code></pre> <pre><code class="language-r"># Missing value imputation df$Age[is.na(df$Age)] &lt;- median(df$Age, na.rm = TRUE) df$Embarked[is.na(df$Embarked)] &lt;- "S"</code></pre> <pre><code class="language-r"># Convert categorical variables to factors df$Pclass &lt;- factor(df$Pclass) df$Sex &lt;- factor(df$Sex) df$Embarked &lt;- factor(df$Embarked) <br>head(df)</code></pre> <img class="size-full wp-image-18568" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aea0ef0ba20db044c4ca_e0287229_2-2.webp" alt="Image 2 - Head of the Titanic dataset after data preparation" width="1264" height="418" /> Image 2 - Head of the Titanic dataset after data preparation The dataset is now much more condensed but carries almost identical predictive performance. <h3>Train/Test Split</h3> The last step before training a machine learning model is to split the dataset into training and testing subsets. We'll use the <code>caret</code> package for the task, and stick to the traditional 80:20 split: <pre><code class="language-r"># Split the data into training and test sets set.seed(42) index &lt;- createDataPartition(df$Survived, p = 0.8, list = FALSE) train &lt;- df[index, ] test &lt;- df[-index, ]</code></pre> Here's how many rows are in each subset: <pre><code class="language-r">dim(train) dim(test)</code></pre> <img class="size-full wp-image-18570" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aea0b0d3c725e2ed7de3_332886d5_3-2.webp" alt="Image 3 - Train/test set dimensionality" width="290" height="204" /> Image 3 - Train/test set dimensionality That's it! Let's train the model next. <h3>Training a Classification Machine Learning Model</h3> There are many classification algorithms you can choose from, but <a href="https://appsilon.com/r-logistic-regression/" target="_blank" rel="noopener">logistic regression</a> is the one we'll use today. It strikes a good balance between being easy to understand and offering good predictive performance. As always in R, you can train a model by writing the model formula. In short, every dataset feature in the training set will be used to predict the <code>Survived</code> target variable: <pre><code class="language-r">set.seed(42) <br>model &lt;- glm(Survived ~ ., data = train, family = "binomial") summary(model)</code></pre> <img class="size-full wp-image-18572" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aea1c65c7fcddd319009_c8b2c729_4-2.webp" alt="Image 4 - Summary of a logistic regression model" width="1406" height="1664" /> Image 4 - Summary of a logistic regression model It looks like passenger class, age, gender, and number of siblings/spouses on board have the most impact on the predictive power, indicated by extremely low P-values. On the other hand, the point of embarkment has no impact on the target variable, as you could reasonably assume. Up next, let's make actual predictions on previously used data. <h3>Calculating Prediction Probabilities and Classes</h3> Classification metrics require predicted classes (e.g., 0 or 1), while others require prediction probabilities (e.g., 0.7891 chance of belonging to a positive class). For that reason, we'll calculate both. The probabilities are first, and you can obtain them by calling the <code>predict()</code> function and passing in our model and the test set, alongside with <code>type = "response"</code>: <pre><code class="language-r">predict_probs &lt;- predict(model, newdata = test, type = "response") predict_probs</code></pre> <img class="size-full wp-image-18574" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aea22d2575d7bd72090f_305710be_5-2.webp" alt="Image 5 - Prediction probabilities" width="2850" height="700" /> Image 5 - Prediction probabilities And now, if the predicted probability is 0.5 or higher, we'll assign it a class of 1 (survived), or 0 otherwise (not survived): <pre><code class="language-r">predict_classes &lt;- ifelse(predict_probs &gt;= 0.5, 1, 0) predict_classes</code></pre> <img class="size-full wp-image-18576" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aea2227b18adf5ce1c51_5d4049e1_6-2.webp" alt="Image 6 - Predicted classes" width="2952" height="670" /> Image 6 - Predicted classes That's everything we need to start evaluating our classification model with machine learning evaluation metrics for classification. <h2 id="metrics">Machine Learning Evaluation Metrics for Classification - Theory, Math, and Code</h2> We've tried our best in keeping the previous section short and sweet, and now it's time to dive into the good part. You'll learn the best machine learning evaluation metrics for classification. Let's start with the first one, which is a must-have for any machine learning project. <h3>1. Confusion Matrix</h3> You can think of the confusion matrix as a special type of table used to evaluate the performance of a classification model. In terms of binary classification, a confusion matrix is a 2x2 matrix that shows the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). These four values are used extensively when calculating other metrics, such as accuracy, precision, and recall. Down below you'll see the confusion matrix "formula". Take this term lightly, since there's no calculation involved. It's just a summation of actual vs. predicted values: <img class="size-full wp-image-18578" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aea4c5eb5b7ebba0b1ba_0b3a6c6f_7-2.webp" alt="Image 7 - Confusion matrix &quot;formula&quot;" width="399" height="120" /> Image 7 - Confusion matrix "formula" To implement the confusion matrix in R, refer to the snippet below. It uses predicted classes instead of probabilities: <pre><code class="language-r">CONFUSION_MATRIX &lt;- confusionMatrix(factor(predict_classes), factor(test$Survived)) CONFUSION_MATRIX$table</code></pre> <img class="size-full wp-image-18580" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aea42573d59db41bbeda_c3344655_8-2.webp" alt="Image 8 - Confusion matrix results" width="698" height="254" /> Image 8 - Confusion matrix results Long story short, you want the numbers on the top-left to bottom-right diagonal to be as large as possible. On the other hand, the elements on a top-right to bottom-left diagonal should be minimal, or close to zero. While we're here, let's extract the values for TP, FP, TN, and FN: <pre><code class="language-r">TP &lt;- CONFUSION_MATRIX$table[1, 1] FP &lt;- CONFUSION_MATRIX$table[1, 2] TN &lt;- CONFUSION_MATRIX$table[2, 2] FN &lt;- CONFUSION_MATRIX$table[2, 1]</code></pre> We'll need these for the upcoming classification metrics. <h3>2. Accuracy</h3> Accuracy measures the proportion of correct predictions to the total number of predictions. It's a widely-used metric, as it reports the overall predictive performance of your model. But keep in mind - this metric is only relevant if classes are balanced. For example, if you have 99% of records in one class, you can easily obtain an accuracy of 99%. Just think about it. Anyhow, here's the formula: <img class="size-full wp-image-18582" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aea561a49425c805a49b_84be0dac_9-2.webp" alt="Image 9 - Accuracy formula" width="764" height="106" /> Image 9 - Accuracy formula Since we already have the values for TP, TN, and FP, accuracy calculation in R is as easy as it can be: <pre><code class="language-r">ACCURACY &lt;- (TP + TN) / (TP + FP + TN + FN) ACCURACY</code></pre> <img class="size-full wp-image-18584" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aea6227b18adf5ce1fc7_444a3e58_10-2.webp" alt="Image 10 - Accuracy results" width="1004" height="160" /> Image 10 - Accuracy results 76% isn't too bad for a couple of minutes of work in data preprocessing. But the classes aren't perfectly balanced, so other classification metrics might be more relevant for our use case. <h3>3. Precision</h3> Precision measures the proportion of true positives (TP) to the total number of positive predictions made. It's a useful metric when false positives are more costly than false negatives, for example in medical diagnosis. If precision is high, it means the model is making few false positive predictions. Here's the formula: <img class="size-full wp-image-18586" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7aea7537ef68dd1d0872c_99dbc80a_11-2.webp" alt="Image 11 - Precision formula" width="494" height="106" /> Image 11 - Precision formula Let's implement Precision in R. Once again, the implementation is trivial since we already have all the values: <pre><code class="language-r">PRECISION &lt;- TP / (TP + FP) PRECISION</code></pre> <div class="mceTemp"></div> <img class="size-full wp-image-18588" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29facff63ee9722ceba27_12-2.webp" alt="Image 12 - Precision results" width="662" height="160" /> Image 12 - Precision results We're up to 0.8, which isn't too bad. Let's see what recall has to say about it. <h3>4. Recall</h3> Recall is the ratio of the number of true positives (TP) to the sum of true positives (TP) and false negatives (FN). This metric measures the percentage of all positive instances in the dataset that are correctly classified by the model. If the recall is high, it means that the model is making a few false negative predictions. Here's the recall formula: <img class="size-full wp-image-18590" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29fadf72efad0b6034609_13-2.webp" alt="Image 13 - Recall formula" width="424" height="106" /> Image 13 - Recall formula Let's implement it in R and check the score: <pre><code class="language-r">RECALL &lt;- TP / (TP + FN) RECALL</code></pre> <img class="size-full wp-image-18592" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29faec2f41db1e74822d3_14-2.webp" alt="Image 14 - Recall results" width="580" height="156" /> Image 14 - Recall results Recall is higher than precision, which means the model makes fewer false negatives than false positives. <h3>5. F1-Score</h3> Now you might be wondering, is there a way to strike the balance between precision and recall? That's where F1 score comes in. F1-score is a weighted average between precision and recall. It's a useful metric when precision and recall have an uneven trade-off. The F1 score is a harmonic mean of precision and recall, and it ranges from 0 to 1, with higher values indicating better performance. Here's the formula you can use for the calculation: <img class="size-full wp-image-18594" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29fae3230f605361b8e77_15-2.webp" alt="Image 15 - F1-score formula" width="577" height="113" /> Image 15 - F1-score formula <pre><code class="language-r">F1 &lt;- 2 * ((PRECISION * RECALL) / (PRECISION + RECALL)) F1</code></pre> Once again, R implementation is fairly straightforward: <img class="size-full wp-image-18596" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29faf4a2eaf755d927de8_16-2.webp" alt="Image 16 - F1-score results" width="1260" height="158" /> Image 16 - F1-score results Seems right. It's just between precision and recall values, which means F1 is the perfect metric to optimize the model for in cases where you don't have to optimize for false positives or false negatives. <h3>6. AUC Score</h3> AUC, or the <i>Area Under the Receiver Operating Characteristic</i> curve measures how well a binary classifier distinguishes between positive and negative classes. Traditionally, you would plot the ROC curve, and the AUC measures the area under the curve. Higher AUC means better performance, and vice-versa. The formula includes integrals since we're calculating the area under the curve: <img class="size-full wp-image-18598" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29f8b60d941b765934c6e_17-1.webp" alt="Image 17 - ROC AUC formula" width="825" height="115" /> Image 17 - ROC AUC formula Unlike other metrics, AUC needs prediction probabilities for calculation: <pre><code class="language-r">AUC_SCORE &lt;- AUC(predict_probs, test$Survived) AUC_SCORE</code></pre> <img class="size-full wp-image-18600" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29f8b59639c0bf4571203_18.webp" alt="Image 18 - ROC AUC results" width="782" height="100" /> Image 18 - ROC AUC results AUC ranges from 0 to 1, so a score of 0.834 sounds good. For reference, a score of 1 would mean the model is perfectly capable of distinguishing between classes, which is almost never the case in practice. On the other end, the AUC score of 0.5 means the model is no better than a random guess. Overall, there's still some room for improvement, but we're far from an unusable model. <h3>7. Specificity</h3> Specificity measures how well a model is able to correctly identify negative samples (TN) out of all negative samples in the dataset. In other words, it measures the proportion of actual negative cases that were correctly classified as negative by the model. This metric is widely used in areas such as medical diagnosis. In this field, a low specificity indicates that the model is incorrectly identifying negative cases as positive, which can lead to false alarms or missed diagnoses. The opposite is true the other way around. The formula is once again as simple as it can be: <img class="size-full wp-image-18602" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29fb0c94ac8aeb3b3dc50_19-1.webp" alt="Image 19 - Specificity formula" width="530" height="106" /> Image 19 - Specificity formula And so is the R implementation: <pre><code class="language-r">SPECIFICITY &lt;- TN / (TN + FP) SPECIFICITY</code></pre> <img class="size-full wp-image-18604" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29fb13234cb0c91cde405_20-1.webp" alt="Image 20 - Specificity results" width="700" height="158" /> Image 20 - Specificity results A result of 0.625 isn't something to brag about, and there's definitely room for improvement. <h3>8. Balanced Accuracy</h3> Let's take a step back and discuss accuracy once again. As we said previously, the vanilla accuracy metric isn't the most representative when classes are imbalanced. That's where balanced accuracy comes into play. It's a useful metric when the dataset is imbalanced, and it provides a more accurate evaluation of the model's performance. Anyhow, here's how to calculate it: <img class="size-full wp-image-18606" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29fb1300ce9856e50de71_21-1.webp" alt="Image 21 - Balanced accuracy formula" width="549" height="141" /> Image 21 - Balanced accuracy formula R implementation requires us to calculate the ratios of true positives and true negatives first: <pre><code class="language-r">TPR &lt;- TP / (TP + FN) TNR &lt;- TN / (TN + FP) BAL_ACCURACY &lt;- (TPR + TNR) / 2 BAL_ACCURACY</code></pre> <img class="size-full wp-image-18608" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01a869566d0143c010d19_22-1.webp" alt="Image 22 - Balanced accuracy results" width="744" height="156" /> Image 22 - Balanced accuracy results So, taking into account class imbalance, our model is only 73.4% accurate. There's definitely room for improvement. <h3>9. Matthews Correlation Coefficient (MCC)</h3> Matthews Correlation Coefficient is a metric that takes into account TP, TN, FP, and FN scores. It measures the correlation between the predicted and actual classes while taking into account the class imbalance and misclassification rates. It's particularly useful in situations where the classes are imbalanced, which is obviously the case with the Titanic dataset. MCC ranges from -1 to +1. If you see a value of +1, it indicates a perfect classification, 0 indicates a random classification, and -1 indicates an entirely wrong classification. Here's the math formula for MCC: <img class="size-full wp-image-18610" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01a87299b8d6d84264962_23-1.webp" alt="Image 23 - Matthews correlation coefficient formula" width="1265" height="124" /> Image 23 - Matthews correlation coefficient formula We don't have to calculate it manually since MCC is built into the <code>mltools</code> R package: <pre><code class="language-r">MCC &lt;- mltools::mcc(predict_classes, test$Survived) MCC</code></pre> <img class="size-full wp-image-18612" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29fb3629a6d212138c872_24.webp" alt="Image 24 - Matthews correlation coefficient results" width="1028" height="110" /> Image 24 - Matthews correlation coefficient results A score of 0.478 isn't something to write home about, but it definitely proves our model is far from a random classification. <h3>10. Logarithmic Loss</h3> And finally, let's discuss logarithmic loss or log loss for short. It measures the performance of a probabilistic classifier by penalizing false classifications. Log loss is commonly used in multiclass classification problems, but there's no one stopping us from using it on a binary dataset. Unlike the other metrics, there's no hard range defined for this metric. A lower log loss score indicates better performance, but how low is low enough? It's impossible to answer when evaluating a single model, so use this metric to compare multiple models instead. Here's the log loss formula: <img class="size-full wp-image-18614" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01a899d0ea11e187021e2_25-1.webp" alt="Image 25 - Logarithmic loss formula" width="743" height="150" /> Image 25 - Logarithmic loss formula The function for calculating log loss in R comes with the <code>MLmetrics</code> pacakge, so we don't have to implement it manually: <pre><code class="language-r">LOG_LOSS &lt;- LogLoss(predict_classes, test$Survived) LOG_LOSS</code></pre> <img class="size-full wp-image-18616" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01a8bb556fdf5a54de3b6_26-1.webp" alt="Image 26 - Logarithmic loss results" width="914" height="98" /> Image 26 - Logarithmic loss results Is 8.15 good or bad? It's impossible to tell without training a couple more machine learning models and comparing the results. Do this as a homework assignment and report back which model yielded the lowest log loss value. <hr /> <h2 id="summary">Summing Up Machine Learning Evaluation Metrics for Classification</h2> To recap, these 10 machine learning evaluation metrics for classification should be all you need 99% of the time. You're likely to use only a few, such as the confusion matrix, and optimize the model for precision, recall, or overall accuracy. That being said, it doesn't hurt to know the other evaluation metrics you have at your disposal. We hope this article have you a clear picture of how easy it is to evaluate machine learning models in R, and that you now understand these metrics on a deeper level. <i>Do you have a favorite classification evaluation metric? What do you prefer when classes are imbalanced?</i> Make sure to let us know in the comment section below. Or even better - reach out on Twitter - <a href="http://twitter.com/appsilon" target="_blank" rel="noopener">@appsilon</a>. We'd love to hear from you. <blockquote>Deep Learning in R with... Keras? <a href="https://appsilon.com/r-keras-mnist/">Train an MNIST digit classifier with TensorFlow's high-level API</a>.</blockquote>

Contact us!
Damian's Avatar
Damian Rodziewicz
Head of Sales
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
r
ai&research