Machine Learning with R: A Complete Guide to Logistic Regression

Estimated time:
time
min

<em><strong>Updated</strong>: July 13, 2022.</em> <h2><span data-preserver-spaces="true">Logistic Regression with R</span></h2> <span data-preserver-spaces="true">R Logistic regression is one of the most fundamental algorithms from statistics, commonly used in machine learning. It's not used to produce SOTA models but can serve as an excellent baseline for binary classification problems.</span> <blockquote><strong>Interested in machine learning for beginners? <a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-linear-regression/" target="_blank" rel="noopener noreferrer">Check our detailed guide on Linear Regression with R</a>.</strong></blockquote> <span data-preserver-spaces="true">Today you'll learn how to implement the logistic regression model in R and also improve your data cleaning, preparation, and feature engineering skills.</span> <span data-preserver-spaces="true">Navigate to a section:</span> <ul><li><a href="#introduction">Introduction to R Logistic Regression</a></li><li><a href="#dataset">Dataset Loading and Exploration</a></li><li><a href="#preparation">Feature Engineering and Handling Missing Data</a></li><li><a href="#modeling">Predictive Modeling</a></li><li><a href="#predictions">Generating Predictions</a></li><li><a href="#feature-importance">R Logistic Regression Feature Importance</a></li><li><a href="#conclusion">Summary of R Logistic Regression</a></li></ul> <hr /> <h2 id="introduction"><span data-preserver-spaces="true">Introduction to Logistic Regression</span></h2> <span data-preserver-spaces="true">Logistic regression is an algorithm used both in statistics and machine learning. Machine learning engineers frequently use it as a baseline model - a model which other algorithms have to outperform. It's also commonly used first because it's easily interpretable.</span> <span data-preserver-spaces="true">In a way, logistic regression is similar to linear regression - but the latter is not used to predict continuous values (such as age or height). Instead, it's used to predict </span><strong><span data-preserver-spaces="true">binary classes</span></strong><span data-preserver-spaces="true"> - has the client churned or not, has the person survived or not, or is the disease malignant or benign. To simplify, logistic regression is used to predict the </span><em><span data-preserver-spaces="true">Yes/No</span></em><span data-preserver-spaces="true"> type of response.</span> <span data-preserver-spaces="true">That's not entirely true. Logistic regression tells us the </span><strong><span data-preserver-spaces="true">probability</span></strong><span data-preserver-spaces="true"> of response is </span><em><span data-preserver-spaces="true">Yes</span></em><span data-preserver-spaces="true">, and we then use a predefined threshold to assign classes. For example, if the probability is greater than 0.5, the assigned class is </span><em><span data-preserver-spaces="true">Yes</span></em><span data-preserver-spaces="true">, and otherwise </span><em><span data-preserver-spaces="true">No</span></em><span data-preserver-spaces="true">. Evaluating performance with different thresholds can reduce the number of false positives or false negatives, depending on how you want to go.</span> <span data-preserver-spaces="true">As you would assume, logistic regression can work with both continuous and categorical data. This means your dataset can contain any sort of data, as long it is adequately prepared. </span> <span data-preserver-spaces="true">You can use logistic regression models to examine</span><strong><span data-preserver-spaces="true"> feature importances. </span></strong><span data-preserver-spaces="true">You'll see how to do it later through hands-on examples. Knowing which features are important enables you to build simpler and less-dimensional models. As a result, the predictions and the model are more interpretable.  </span> <span data-preserver-spaces="true">And that's all you need for a basic intuition behind logistic regression. Let's get our hands dirty next.</span> <h2 id="dataset">Dataset Loading and Exploration</h2> <span data-preserver-spaces="true">One of the best-known binary classification datasets is the Titanic dataset. The goal is to predict whether the passenger has survived the accident based on many input features, such as age, passenger class, and others.</span> <span data-preserver-spaces="true">You don't have to download the dataset, as there's a dedicated package for it in R. You'll use only the training dataset throughout the article, so you don't have to do the preparation and feature engineering twice.</span> <span data-preserver-spaces="true">The following snippet loads in every required package, stores the training dataset to a variable called <code>df</code>, and prints its structure:</span> <pre><code class="language-r">library(titanic) library(Amelia) library(dplyr) library(modeest) library(ggplot2) library(cowplot) library(mice) library(caTools) library(caret) <br>df &lt;- titanic_train str(df)</code></pre> <span data-preserver-spaces="true">Here's the corresponding structure:</span> <img class="size-full wp-image-6385" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395c485545dccd9c00e5d_1-1.webp" alt="Image 1 - Titanic dataset structure" width="1418" height="532" /> Image 1 - Titanic dataset structure <span data-preserver-spaces="true">There's a lot of work required. For example, missing values in some columns are marked with empty strings instead of <code>NA</code>. This issue is easy to fix, and once you fix it, you can plot a </span><strong><span data-preserver-spaces="true">missingness map</span></strong><span data-preserver-spaces="true">. It will show you where the missing values are located:</span> <pre><code class="language-r">df$Cabin[df$Cabin == ""] &lt;- NA df$Embarked[df$Embarked == ""] &lt;- NA <br>missmap(obj = df)</code></pre> <span data-preserver-spaces="true">The missingness map is shown below:</span> <img class="size-full wp-image-6386" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270aa580c4700e356c1b5_2-1.webp" alt="Image 2 - Missingness map" width="1500" height="998" /> Image 2 - Missingness map <span data-preserver-spaces="true">The first three columns contain missing data. You'll see how to fix that in the next section.</span> <h2 id="preparation"><span data-preserver-spaces="true">Feature Engineering and Handling Missing Data</span></h2> <span data-preserver-spaces="true">You need feature engineering because the default features either aren't formatted correctly or don't display information in the best way. Just take a look at the <code>Name</code> column in </span><em><span data-preserver-spaces="true">Image 1</span></em><span data-preserver-spaces="true"> - an algorithm can't process it in the default format.</span> <span data-preserver-spaces="true">But this feature is quite useful. You can extract the passenger title from it (e.g., </span><em><span data-preserver-spaces="true">Miss</span></em><span data-preserver-spaces="true">, </span><em><span data-preserver-spaces="true">Sir</span></em><span data-preserver-spaces="true">, and so on). As a final step, you can check if a passenger has a rare title (e.g., </span><em><span data-preserver-spaces="true">Dona</span></em><span data-preserver-spaces="true">, </span><em><span data-preserver-spaces="true">Lady</span></em><span data-preserver-spaces="true">, </span><em><span data-preserver-spaces="true">Major</span></em><span data-preserver-spaces="true">, and so on). </span> <span data-preserver-spaces="true">The following snippet does just that:</span> <pre><code class="language-r">df$Title &lt;- gsub("(.*, )|(\\..*)", "", df$Name) rare_titles &lt;- c("Dona", "Lady", "the Countess", "Capt", "Col", "Don", "Dr", "Major", "Rev", "Sir", "Jonkheer") df$Title[df$Title == "Mlle"] &lt;- "Miss" df$Title[df$Title == "Ms"] &lt;- "Miss" df$Title[df$Title == "Mme"] &lt;- "Mrs" df$Title[df$Title %in% rare_titles] &lt;- "Rare" <br>unique(df$Title)</code></pre> <span data-preserver-spaces="true">You can see all of the unique titles we have now in the following image:</span> <img class="size-full wp-image-6387" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270ab4a6434e3a86348b8_3-1.webp" alt="Image 3 - Unique passenger titles" width="740" height="66" /> Image 3 - Unique passenger titles <span data-preserver-spaces="true">You can apply similar logic to the <code>Cabin</code> column. It's useless by default but can be used to extract the deck number. Here's how:</span> <pre><code class="language-r">df$Deck &lt;- factor(sapply(df$Cabin, function(x) strsplit(x, NULL)[[1]][1])) <br>unique(df$Deck)</code></pre> <span data-preserver-spaces="true">The unique deck numbers are shown in the following image:</span> <img class="size-full wp-image-6388" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270acfcac124f318a5812_4-1.webp" alt="Image 4 - Unique deck numbers" width="736" height="108" /> Image 4 - Unique deck numbers <span data-preserver-spaces="true">You've now done some feature engineering, which means the original columns can be deleted. The snippet below deletes these two, but also <code>PassengerId</code> and <code>Ticket</code>, because these provide no meaningful information:</span> <pre><code class="language-r">df &lt;- df %&gt;%  select(-c(PassengerId, Name, Cabin, Ticket))</code></pre> <span data-preserver-spaces="true">Finally, you can shift the focus to the missing values. Two approaches will be used - mode and MICE imputation.</span> <span data-preserver-spaces="true">You'll use mode (most frequent value) imputation on the <code>Embarked</code> column because it contains only a couple of missing values. MICE imputation will require a bit more work. Converting categorical variables to factors is a must, and the imputation is done by leaving the target variable out.</span> <span data-preserver-spaces="true">Here's the entire code snippet for imputing missing values:</span> <pre><code class="language-r">df$Embarked[is.na(df$Embarked)] &lt;- mlv(df$Embarked, method = "mfv") <br># Imputing with MICE factor_vars &lt;- c("Pclass", "Sex", "SibSp", "Parch", "Embarked", "Title") df[factor_vars] &lt;- lapply(df[factor_vars], function(x) as.factor(x)) <br>impute_mice &lt;- mice(df[, !names(df) %in% c("Survived")], method = "rf") result_mice &lt;- complete(impute_mice)</code></pre> <span data-preserver-spaces="true">As a sanity check, you can plot the density plots of continuous variables before and after imputation. Doing so will show you if the imputation skewed the distribution or not. <code>Age</code> is the only continuous variable, so let's make a before and after density plot:</span> <pre><code class="language-r">density_before &lt;- ggplot(df, aes(x = Age)) +  geom_density(fill = "#e74c3c", alpha = 0.6) +  labs(title = "Age: Before Imputation") +  theme_classic() density_after &lt;- ggplot(result_mice, aes(x = Age)) +  geom_density(fill = "#2ecc71", alpha = 0.6) +  labs(title = "Age: After Imputation") +  theme_classic() <br>plot_grid(density_before, density_after)</code></pre> <span data-preserver-spaces="true">The visualization is shown below:</span> <img class="size-full wp-image-6389" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395c7bf649dee317bcea3_5-1.webp" alt="Image 5 - Density plot of Age before and after imputation" width="1500" height="1095" /> Image 5 - Density plot of Age before and after imputation <span data-preserver-spaces="true">Some changes are visible, sure, but the overall distribution stayed roughly the same. There were a lot of missing values in this variable, so some changes in distribution are inevitable.</span> <span data-preserver-spaces="true">Finally, you can assign the imputation results to the original dataset and convert <code>Deck</code> to factor:</span> <pre><code class="language-r">df$Age &lt;- result_mice$Age df$Deck &lt;- result_mice$Deck df$Deck &lt;- as.factor(df$Deck)</code></pre> <span data-preserver-spaces="true">You now have everything needed to start with predictive modeling - so let's do that next.</span> <h2 id="modeling"><span data-preserver-spaces="true">Modeling</span></h2> <span data-preserver-spaces="true">Before proceeding with modeling, you'll need to split your dataset into training and testing subsets. These are available from the start with the Titanic dataset, but you'll have to do the split manually as we've only used the training dataset.</span> <span data-preserver-spaces="true">The following snippet splits the data randomly in a 70:30 ratio. Don't forget to set the seed value to 42 if you want the same split:</span> <pre><code class="language-r">set.seed(42) <br>sample_split &lt;- sample.split(Y = df$Survived, SplitRatio = 0.7) train_set &lt;- subset(x = df, sample_split == TRUE) test_set &lt;- subset(x = df, sample_split == FALSE)</code></pre> <span data-preserver-spaces="true">You can now train the model on the training set. R uses the <code>glm()</code> function to apply logistic regression. The syntax is identical as with linear regression. You'll need to put the target variable on the left and features on the right, separated with the <code>~</code> sign. If you want to use all features, put a dot (.) instead of feature names.</span> <span data-preserver-spaces="true">Also, don't forget to specify <code>family = "binomial"</code>, as this is required for logistic regression:</span> <pre><code class="language-r">logistic &lt;- glm(Survived ~ ., data = train_set, family = "binomial") summary(logistic)</code></pre> <span data-preserver-spaces="true">Here's the summary of the model:</span> <img class="size-full wp-image-6390" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270ad5f867c2801dc8b83_6-1.webp" alt="Image 6 - Summary of a logistic regression model" width="1060" height="1598" /> Image 6 - Summary of a logistic regression model <span data-preserver-spaces="true">The most interesting thing here is the P-values, displayed in the <code>Pr(&gt;|t|)</code> column. Those values indicate the probability of a variable not being important for prediction. It's common to use a 5% significance threshold, so if a P-value is 0.05 or below, we can say that there's a low chance it is not significant for the analysis.</span> <span data-preserver-spaces="true">You've built and explored the model so far, but there's no use in it yet. The next section shows you how to generate predictions on previously unseen data and evaluate the model.</span> <h2 id="predictions"><span data-preserver-spaces="true">Generating Predictions</span></h2> <span data-preserver-spaces="true">As mentioned in the introduction section, logistic regression is based on probabilities. If the probability is greater than some threshold (commonly 0.5), you can treat this instance as positive.</span> <span data-preserver-spaces="true">The most common way of evaluating machine learning models is by examining the </span><strong><span data-preserver-spaces="true">confusion matrix</span></strong><span data-preserver-spaces="true">. It's a square matrix showing you how many predictions were correct (true positives and true negatives), how many were negative but classified as positive (false positives), and how many were positive but classified as negative (false negatives). In our case, </span><em><span data-preserver-spaces="true">positive</span></em><span data-preserver-spaces="true"> refers to a passenger who survived the accident.</span> <span data-preserver-spaces="true">The snippet below shows how to obtain probabilities and classes, and how to print the confusion matrix:</span> <pre><code class="language-r">probs &lt;- predict(logistic, newdata = test_set, type = "response") pred &lt;- ifelse(probs &gt; 0.5, 1, 0) <br>confusionMatrix(factor(pred), factor(test_set$Survived), positive = as.character(1))</code></pre> And here are the corresponding results: <img class="wp-image-6392 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b39589455ba7eb54dbe896_8-1.webp" alt="Image 7 - Confusion matrix of a logistic regression model" width="686" height="990" /> Image 7 - Confusion matrix of a logistic regression model <span data-preserver-spaces="true">221 of 268 records were classified correctly, resulting in an accuracy of 82.5%. There are 26 false positives and 21 false negatives. You can play around with classification thresholds (0.5 now) and see how these misclassifications are changing.</span> Next, let's explore two ways you can calculate the feature importance of a logistic regression model in R. <h2 id="feature-importance">R Logistic Regression Feature Importance</h2> The Titanic dataset is what you'd call a perfect dataset. It has just enough features that you don't have to care about reduction. Real-world datasets couldn't be more different. You'll often encounter hundreds or even thousands of columns where only 10 of them are relevant. This section is aimed toward these datasets, even though we'll apply the logic to Titanic. <h3>Feature importance with the varImp() function</h3> <span data-preserver-spaces="true">The easiest way to calculate the feature importance of an R logistic regression model is with the <code>varImp()</code> function. Here's how to obtain the ten most important features, sorted in descending order:</span> <pre><code class="language-r">importances &lt;- varImp(logistic) importances %&gt;%  arrange(desc(Overall)) %&gt;%  top_n(10)</code></pre> <span data-preserver-spaces="true">The features are shown below:</span> <img class="wp-image-6391 size-full" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395c93b364f1252b8afb5_7-1.webp" alt="Image 8 - Feature importances of a logistic regression model" width="324" height="464" /> Image 8 - Feature importances of a logistic regression model What's wrong with this approach is that you get back the importance of individual factors. Maybe that's what you want, but most of the time it isn't. For a more sophisticated approach, you'll have to bring out the big guns. <h3>Feature importance with the Boruta package</h3> Boruta is a feature ranking and selection algorithm based on the Random Forests algorithm. In plain English, it shows you if a variable is important or not. You can tweak the “strictness” by adjusting the P-value and other parameters, but that’s a topic for another time. The <code>boruta()</code> function takes in the same parameters as <code>glm()</code>. It’s a formula with the target variable on the left side and the predictors on the right side. The additional <code>doTrace</code> parameter is there to limit the amount of output printed to the console – setting it to 0 will remove it altogether: <pre><code class="language-r">library(Boruta) <br>boruta_output &lt;- Boruta(Survived ~ ., data = train_set, doTrace = 0) boruta_output</code></pre> The output is nothing but pure text, which isn't the most useful thing in the world: <img class="size-full wp-image-14611" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d5972c3f301b9386501c_6f5a975a_9-3.webp" alt="Image 9 - Results of a Boruta algorithm" width="1270" height="156" /> Image 9 - Results of a Boruta algorithm Long story short, all attributes are important. That's something you can expect from datasets that are made for learning and have nothing to do with the real world. If you want to extract all the significant attributes from the output, use the following code: <pre><code class="language-r">rough_fix_mod &lt;- TentativeRoughFix(boruta_output) boruta_signif &lt;- getSelectedAttributes(rough_fix_mod) boruta_signif</code></pre> <img class="size-full wp-image-14613" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d5980ca3d363ce968e34_34355cb5_10-3.webp" alt="Image 10 - Important features" width="1568" height="76" /> Image 10 - Important features From here, you can use the following code snippet to get the importance scores and sort them in descending order: <pre><code class="language-r">importances &lt;- attStats(rough_fix_mod) importances &lt;- importances[importances$decision != "Rejected", c("meanImp", "decision")] importances[order(-importances$meanImp), ]</code></pre> <img class="alignnone size-full wp-image-14615" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d59813d4a2a1c0fb20e9_3486e30b_11-3.webp" alt="Image 11 - Importance scores" width="454" height="382" /> Now, this makes much more sense. As it turns out, the saying "Women and children first" clearly has some impact. It's also expected for passengers from higher classes to survive, which is closely tied to passenger title and fare price. You can also show these importances visually: <pre><code class="language-r">plot(boruta_output, ces.axis = 0.7, las = 2, xlab = "", main = "Feature importance")</code></pre> <img class="size-full wp-image-14617" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d599f7aa9374eb248748_0c867a42_12-3.webp" alt="Image 12 - Feature importance plot" width="1694" height="1622" /> Image 12 - Feature importance plot Making sense of this chart is relatively easy. You should care the most about color. Green means the feature is important, red means it isn't, and blue represents variables used by Bortuta to determine importance, so these can be ignored. The higher the box and whiskers on the chart, the more important the feature is. <blockquote>But what is a box and whikers plot? <a href="https://appsilon.com/ggplot2-boxplots/">Here's our complete guide to get you up and running</a>.</blockquote> <span data-preserver-spaces="true">And that's more than enough to get you started with logistic regression and classification in general. Let's wrap things up in the next section.</span> <hr /> <h2 id="conclusion"><span data-preserver-spaces="true">Summary of R Logistic Regression</span></h2> <span data-preserver-spaces="true">Logistic regression is often used as a baseline binary classification model. More sophisticated algorithms (tree-based or neural networks) have to outperform it to be useful. </span> <span data-preserver-spaces="true">Today you've learned how to approach data cleaning, preparation, and feature engineering in a hopefully easy-to-follow and understandable way. You've also learned how to apply binary classification modeling with logistic regression, and how to evaluate classification models. </span> <strong><span data-preserver-spaces="true">If you want to implement machine learning in your organization, you can always reach out to </span></strong><a class="editor-rtfLink" href="https://wordpress.appsilon.com/" target="_blank" rel="noopener noreferrer"><strong><span data-preserver-spaces="true">Appsilon</span></strong></a><strong><span data-preserver-spaces="true"> for help.</span></strong> <h3><span data-preserver-spaces="true">Learn more:</span></h3><ul><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-linear-regression/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Machine Learning with R: A Complete Guide to Linear Regression</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-for-programmers/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">What Can I Do With R? 6 Essential R Packages for Programmers</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/image-classification-tutorial/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Getting Started With Image Classification: fastai, ResNet, MobileNet, and More</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/ai-for-wildlife-image-classification-appsilon-ai4g-project-receives-google-grant/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">AI for Good: ML Wildlife Image Classification to Analyze Camera Trap Datasets</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/object-detection-yolo-algorithm/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">YOLO Algorithm and YOLO Object Detection: An Introduction</span></a></li></ul>

Contact us!
Damian's Avatar
Damian Rodziewicz
Head of Sales
r
tutorials
ai&research