Machine Learning with R: A Complete Guide to Logistic Regression

Estimated time:

time

min

Updated: July 13, 2022. <h2>Logistic Regression with R</h2> R Logistic regression is one of the most fundamental algorithms from statistics, commonly used in machine learning. It's not used to produce SOTA models but can serve as an excellent baseline for binary classification problems. <blockquote>Interested in machine learning for beginners? <a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-linear-regression/" target="_blank" rel="noopener noreferrer">Check our detailed guide on Linear Regression with R</a>.</blockquote> Today you'll learn how to implement the logistic regression model in R and also improve your data cleaning, preparation, and feature engineering skills. Navigate to a section: <ul><li><a href="#introduction">Introduction to R Logistic Regression</a></li><li><a href="#dataset">Dataset Loading and Exploration</a></li><li><a href="#preparation">Feature Engineering and Handling Missing Data</a></li><li><a href="#modeling">Predictive Modeling</a></li><li><a href="#predictions">Generating Predictions</a></li><li><a href="#feature-importance">R Logistic Regression Feature Importance</a></li><li><a href="#conclusion">Summary of R Logistic Regression</a></li></ul> <hr /> <h2 id="introduction">Introduction to Logistic Regression</h2> Logistic regression is an algorithm used both in statistics and machine learning. Machine learning engineers frequently use it as a baseline model - a model which other algorithms have to outperform. It's also commonly used first because it's easily interpretable. In a way, logistic regression is similar to linear regression - but the latter is not used to predict continuous values (such as age or height). Instead, it's used to predict binary classes - has the client churned or not, has the person survived or not, or is the disease malignant or benign. To simplify, logistic regression is used to predict the Yes/No type of response. That's not entirely true. Logistic regression tells us the probability of response is Yes, and we then use a predefined threshold to assign classes. For example, if the probability is greater than 0.5, the assigned class is Yes, and otherwise No. Evaluating performance with different thresholds can reduce the number of false positives or false negatives, depending on how you want to go. As you would assume, logistic regression can work with both continuous and categorical data. This means your dataset can contain any sort of data, as long it is adequately prepared. You can use logistic regression models to examine feature importances. You'll see how to do it later through hands-on examples. Knowing which features are important enables you to build simpler and less-dimensional models. As a result, the predictions and the model are more interpretable. And that's all you need for a basic intuition behind logistic regression. Let's get our hands dirty next. <h2 id="dataset">Dataset Loading and Exploration</h2> One of the best-known binary classification datasets is the Titanic dataset. The goal is to predict whether the passenger has survived the accident based on many input features, such as age, passenger class, and others. You don't have to download the dataset, as there's a dedicated package for it in R. You'll use only the training dataset throughout the article, so you don't have to do the preparation and feature engineering twice. The following snippet loads in every required package, stores the training dataset to a variable called <code>df</code>, and prints its structure: <pre><code class="language-r">library(titanic) library(Amelia) library(dplyr) library(modeest) library(ggplot2) library(cowplot) library(mice) library(caTools) library(caret) df <- titanic_train str(df)</code></pre> Here's the corresponding structure: <img class="size-full wp-image-6385" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395c485545dccd9c00e5d_1-1.webp" alt="Image 1 - Titanic dataset structure" width="1418" height="532" /> Image 1 - Titanic dataset structure There's a lot of work required. For example, missing values in some columns are marked with empty strings instead of <code>NA</code>. This issue is easy to fix, and once you fix it, you can plot a missingness map. It will show you where the missing values are located: <pre><code class="language-r">df$Cabin[df$Cabin == ""] <- NA df$Embarked[df$Embarked == ""] <- NA missmap(obj = df)</code></pre> The missingness map is shown below: <img class="size-full wp-image-6386" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270aa580c4700e356c1b5_2-1.webp" alt="Image 2 - Missingness map" width="1500" height="998" /> Image 2 - Missingness map The first three columns contain missing data. You'll see how to fix that in the next section. <h2 id="preparation">Feature Engineering and Handling Missing Data</h2> You need feature engineering because the default features either aren't formatted correctly or don't display information in the best way. Just take a look at the <code>Name</code> column in Image 1 - an algorithm can't process it in the default format. But this feature is quite useful. You can extract the passenger title from it (e.g., Miss, Sir, and so on). As a final step, you can check if a passenger has a rare title (e.g., Dona, Lady, Major, and so on). The following snippet does just that: <pre><code class="language-r">df$Title <- gsub("(.*, )|(\\..*)", "", df$Name) rare_titles <- c("Dona", "Lady", "the Countess", "Capt", "Col", "Don", "Dr", "Major", "Rev", "Sir", "Jonkheer") df$Title[df$Title == "Mlle"] <- "Miss" df$Title[df$Title == "Ms"] <- "Miss" df$Title[df$Title == "Mme"] <- "Mrs" df$Title[df$Title %in% rare_titles] <- "Rare" unique(df$Title)</code></pre> You can see all of the unique titles we have now in the following image: <img class="size-full wp-image-6387" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270ab4a6434e3a86348b8_3-1.webp" alt="Image 3 - Unique passenger titles" width="740" height="66" /> Image 3 - Unique passenger titles You can apply similar logic to the <code>Cabin</code> column. It's useless by default but can be used to extract the deck number. Here's how: <pre><code class="language-r">df$Deck <- factor(sapply(df$Cabin, function(x) strsplit(x, NULL)[[1]][1])) unique(df$Deck)</code></pre> The unique deck numbers are shown in the following image: <img class="size-full wp-image-6388" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270acfcac124f318a5812_4-1.webp" alt="Image 4 - Unique deck numbers" width="736" height="108" /> Image 4 - Unique deck numbers You've now done some feature engineering, which means the original columns can be deleted. The snippet below deletes these two, but also <code>PassengerId</code> and <code>Ticket</code>, because these provide no meaningful information: <pre><code class="language-r">df <- df %>% select(-c(PassengerId, Name, Cabin, Ticket))</code></pre> Finally, you can shift the focus to the missing values. Two approaches will be used - mode and MICE imputation. You'll use mode (most frequent value) imputation on the <code>Embarked</code> column because it contains only a couple of missing values. MICE imputation will require a bit more work. Converting categorical variables to factors is a must, and the imputation is done by leaving the target variable out. Here's the entire code snippet for imputing missing values: <pre><code class="language-r">df$Embarked[is.na(df$Embarked)] <- mlv(df$Embarked, method = "mfv") # Imputing with MICE factor_vars <- c("Pclass", "Sex", "SibSp", "Parch", "Embarked", "Title") df[factor_vars] <- lapply(df[factor_vars], function(x) as.factor(x)) impute_mice <- mice(df[, !names(df) %in% c("Survived")], method = "rf") result_mice <- complete(impute_mice)</code></pre> As a sanity check, you can plot the density plots of continuous variables before and after imputation. Doing so will show you if the imputation skewed the distribution or not. <code>Age</code> is the only continuous variable, so let's make a before and after density plot: <pre><code class="language-r">density_before <- ggplot(df, aes(x = Age)) + geom_density(fill = "#e74c3c", alpha = 0.6) + labs(title = "Age: Before Imputation") + theme_classic() density_after <- ggplot(result_mice, aes(x = Age)) + geom_density(fill = "#2ecc71", alpha = 0.6) + labs(title = "Age: After Imputation") + theme_classic() plot_grid(density_before, density_after)</code></pre> The visualization is shown below: <img class="size-full wp-image-6389" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395c7bf649dee317bcea3_5-1.webp" alt="Image 5 - Density plot of Age before and after imputation" width="1500" height="1095" /> Image 5 - Density plot of Age before and after imputation Some changes are visible, sure, but the overall distribution stayed roughly the same. There were a lot of missing values in this variable, so some changes in distribution are inevitable. Finally, you can assign the imputation results to the original dataset and convert <code>Deck</code> to factor: <pre><code class="language-r">df$Age <- result_mice$Age df$Deck <- result_mice$Deck df$Deck <- as.factor(df$Deck)</code></pre> You now have everything needed to start with predictive modeling - so let's do that next. <h2 id="modeling">Modeling</h2> Before proceeding with modeling, you'll need to split your dataset into training and testing subsets. These are available from the start with the Titanic dataset, but you'll have to do the split manually as we've only used the training dataset. The following snippet splits the data randomly in a 70:30 ratio. Don't forget to set the seed value to 42 if you want the same split: <pre><code class="language-r">set.seed(42) sample_split <- sample.split(Y = df$Survived, SplitRatio = 0.7) train_set <- subset(x = df, sample_split == TRUE) test_set <- subset(x = df, sample_split == FALSE)</code></pre> You can now train the model on the training set. R uses the <code>glm()</code> function to apply logistic regression. The syntax is identical as with linear regression. You'll need to put the target variable on the left and features on the right, separated with the <code>~</code> sign. If you want to use all features, put a dot (.) instead of feature names. Also, don't forget to specify <code>family = "binomial"</code>, as this is required for logistic regression: <pre><code class="language-r">logistic <- glm(Survived ~ ., data = train_set, family = "binomial") summary(logistic)</code></pre> Here's the summary of the model: <img class="size-full wp-image-6390" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270ad5f867c2801dc8b83_6-1.webp" alt="Image 6 - Summary of a logistic regression model" width="1060" height="1598" /> Image 6 - Summary of a logistic regression model The most interesting thing here is the P-values, displayed in the <code>Pr(>|t|)</code> column. Those values indicate the probability of a variable not being important for prediction. It's common to use a 5% significance threshold, so if a P-value is 0.05 or below, we can say that there's a low chance it is not significant for the analysis. You've built and explored the model so far, but there's no use in it yet. The next section shows you how to generate predictions on previously unseen data and evaluate the model. <h2 id="predictions">Generating Predictions</h2> As mentioned in the introduction section, logistic regression is based on probabilities. If the probability is greater than some threshold (commonly 0.5), you can treat this instance as positive. The most common way of evaluating machine learning models is by examining the confusion matrix. It's a square matrix showing you how many predictions were correct (true positives and true negatives), how many were negative but classified as positive (false positives), and how many were positive but classified as negative (false negatives). In our case, positive refers to a passenger who survived the accident. The snippet below shows how to obtain probabilities and classes, and how to print the confusion matrix: <pre><code class="language-r">probs <- predict(logistic, newdata = test_set, type = "response") pred <- ifelse(probs > 0.5, 1, 0) confusionMatrix(factor(pred), factor(test_set$Survived), positive = as.character(1))</code></pre> And here are the corresponding results: <img class="wp-image-6392 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b39589455ba7eb54dbe896_8-1.webp" alt="Image 7 - Confusion matrix of a logistic regression model" width="686" height="990" /> Image 7 - Confusion matrix of a logistic regression model 221 of 268 records were classified correctly, resulting in an accuracy of 82.5%. There are 26 false positives and 21 false negatives. You can play around with classification thresholds (0.5 now) and see how these misclassifications are changing. Next, let's explore two ways you can calculate the feature importance of a logistic regression model in R. <h2 id="feature-importance">R Logistic Regression Feature Importance</h2> The Titanic dataset is what you'd call a perfect dataset. It has just enough features that you don't have to care about reduction. Real-world datasets couldn't be more different. You'll often encounter hundreds or even thousands of columns where only 10 of them are relevant. This section is aimed toward these datasets, even though we'll apply the logic to Titanic. <h3>Feature importance with the varImp() function</h3> The easiest way to calculate the feature importance of an R logistic regression model is with the <code>varImp()</code> function. Here's how to obtain the ten most important features, sorted in descending order: <pre><code class="language-r">importances <- varImp(logistic) importances %>% arrange(desc(Overall)) %>% top_n(10)</code></pre> The features are shown below: <img class="wp-image-6391 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395c93b364f1252b8afb5_7-1.webp" alt="Image 8 - Feature importances of a logistic regression model" width="324" height="464" /> Image 8 - Feature importances of a logistic regression model What's wrong with this approach is that you get back the importance of individual factors. Maybe that's what you want, but most of the time it isn't. For a more sophisticated approach, you'll have to bring out the big guns. <h3>Feature importance with the Boruta package</h3> Boruta is a feature ranking and selection algorithm based on the Random Forests algorithm. In plain English, it shows you if a variable is important or not. You can tweak the “strictness” by adjusting the P-value and other parameters, but that’s a topic for another time. The <code>boruta()</code> function takes in the same parameters as <code>glm()</code>. It’s a formula with the target variable on the left side and the predictors on the right side. The additional <code>doTrace</code> parameter is there to limit the amount of output printed to the console – setting it to 0 will remove it altogether: <pre><code class="language-r">library(Boruta) boruta_output <- Boruta(Survived ~ ., data = train_set, doTrace = 0) boruta_output</code></pre> The output is nothing but pure text, which isn't the most useful thing in the world: <img class="size-full wp-image-14611" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d5972c3f301b9386501c_6f5a975a_9-3.webp" alt="Image 9 - Results of a Boruta algorithm" width="1270" height="156" /> Image 9 - Results of a Boruta algorithm Long story short, all attributes are important. That's something you can expect from datasets that are made for learning and have nothing to do with the real world. If you want to extract all the significant attributes from the output, use the following code: <pre><code class="language-r">rough_fix_mod <- TentativeRoughFix(boruta_output) boruta_signif <- getSelectedAttributes(rough_fix_mod) boruta_signif</code></pre> <img class="size-full wp-image-14613" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d5980ca3d363ce968e34_34355cb5_10-3.webp" alt="Image 10 - Important features" width="1568" height="76" /> Image 10 - Important features From here, you can use the following code snippet to get the importance scores and sort them in descending order: <pre><code class="language-r">importances <- attStats(rough_fix_mod) importances <- importances[importances$decision != "Rejected", c("meanImp", "decision")] importances[order(-importances$meanImp), ]</code></pre> <img class="alignnone size-full wp-image-14615" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d59813d4a2a1c0fb20e9_3486e30b_11-3.webp" alt="Image 11 - Importance scores" width="454" height="382" /> Now, this makes much more sense. As it turns out, the saying "Women and children first" clearly has some impact. It's also expected for passengers from higher classes to survive, which is closely tied to passenger title and fare price. You can also show these importances visually: <pre><code class="language-r">plot(boruta_output, ces.axis = 0.7, las = 2, xlab = "", main = "Feature importance")</code></pre> <img class="size-full wp-image-14617" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d599f7aa9374eb248748_0c867a42_12-3.webp" alt="Image 12 - Feature importance plot" width="1694" height="1622" /> Image 12 - Feature importance plot Making sense of this chart is relatively easy. You should care the most about color. Green means the feature is important, red means it isn't, and blue represents variables used by Bortuta to determine importance, so these can be ignored. The higher the box and whiskers on the chart, the more important the feature is. <blockquote>But what is a box and whikers plot? <a href="https://appsilon.com/ggplot2-boxplots/">Here's our complete guide to get you up and running</a>.</blockquote> And that's more than enough to get you started with logistic regression and classification in general. Let's wrap things up in the next section. <hr /> <h2 id="conclusion">Summary of R Logistic Regression</h2> Logistic regression is often used as a baseline binary classification model. More sophisticated algorithms (tree-based or neural networks) have to outperform it to be useful. Today you've learned how to approach data cleaning, preparation, and feature engineering in a hopefully easy-to-follow and understandable way. You've also learned how to apply binary classification modeling with logistic regression, and how to evaluate classification models. If you want to implement machine learning in your organization, you can always reach out to <a class="editor-rtfLink" href="https://wordpress.appsilon.com/" target="_blank" rel="noopener noreferrer">Appsilon</a> for help. <h3>Learn more:</h3><ul><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-linear-regression/" target="_blank" rel="noopener noreferrer">Machine Learning with R: A Complete Guide to Linear Regression</a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-for-programmers/" target="_blank" rel="noopener noreferrer">What Can I Do With R? 6 Essential R Packages for Programmers</a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/image-classification-tutorial/" target="_blank" rel="noopener noreferrer">Getting Started With Image Classification: fastai, ResNet, MobileNet, and More</a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/ai-for-wildlife-image-classification-appsilon-ai4g-project-receives-google-grant/" target="_blank" rel="noopener noreferrer">AI for Good: ML Wildlife Image Classification to Analyze Camera Trap Datasets</a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/object-detection-yolo-algorithm/" target="_blank" rel="noopener noreferrer">YOLO Algorithm and YOLO Object Detection: An Introduction</a></li></ul>