Machine Learning with R: A Complete Guide to Logistic Regression
Updated: July 13, 2022.
Logistic Regression with R
R Logistic regression is one of the most fundamental algorithms from statistics, commonly used in machine learning. It’s not used to produce SOTA models but can serve as an excellent baseline for binary classification problems.
Interested in machine learning for beginners? Check our detailed guide on Linear Regression with R.
Today you’ll learn how to implement the logistic regression model in R and also improve your data cleaning, preparation, and feature engineering skills.
Navigate to a section:
- Introduction to R Logistic Regression
- Dataset Loading and Exploration
- Feature Engineering and Handling Missing Data
- Predictive Modeling
- Generating Predictions
- R Logistic Regression Feature Importance
- Summary of R Logistic Regression
Introduction to Logistic Regression
Logistic regression is an algorithm used both in statistics and machine learning. Machine learning engineers frequently use it as a baseline model – a model which other algorithms have to outperform. It’s also commonly used first because it’s easily interpretable.
In a way, logistic regression is similar to linear regression – but the latter is not used to predict continuous values (such as age or height). Instead, it’s used to predict binary classes – has the client churned or not, has the person survived or not, or is the disease malignant or benign. To simplify, logistic regression is used to predict the Yes/No type of response.
That’s not entirely true. Logistic regression tells us the probability of response is Yes, and we then use a predefined threshold to assign classes. For example, if the probability is greater than 0.5, the assigned class is Yes, and otherwise No. Evaluating performance with different thresholds can reduce the number of false positives or false negatives, depending on how you want to go.
As you would assume, logistic regression can work with both continuous and categorical data. This means your dataset can contain any sort of data, as long it is adequately prepared.
You can use logistic regression models to examine feature importances. You’ll see how to do it later through hands-on examples. Knowing which features are important enables you to build simpler and less-dimensional models. As a result, the predictions and the model are more interpretable.
And that’s all you need for a basic intuition behind logistic regression. Let’s get our hands dirty next.
Dataset Loading and Exploration
One of the best-known binary classification datasets is the Titanic dataset. The goal is to predict whether the passenger has survived the accident based on many input features, such as age, passenger class, and others.
You don’t have to download the dataset, as there’s a dedicated package for it in R. You’ll use only the training dataset throughout the article, so you don’t have to do the preparation and feature engineering twice.
The following snippet loads in every required package, stores the training dataset to a variable called
df, and prints its structure:
library(titanic) library(Amelia) library(dplyr) library(modeest) library(ggplot2) library(cowplot) library(mice) library(caTools) library(caret) df <- titanic_train str(df)
Here’s the corresponding structure:
There’s a lot of work required. For example, missing values in some columns are marked with empty strings instead of
NA. This issue is easy to fix, and once you fix it, you can plot a missingness map. It will show you where the missing values are located:
df$Cabin[df$Cabin == ""] <- NA df$Embarked[df$Embarked == ""] <- NA missmap(obj = df)
The missingness map is shown below:
The first three columns contain missing data. You’ll see how to fix that in the next section.
Feature Engineering and Handling Missing Data
You need feature engineering because the default features either aren’t formatted correctly or don’t display information in the best way. Just take a look at the
Name column in Image 1 – an algorithm can’t process it in the default format.
But this feature is quite useful. You can extract the passenger title from it (e.g., Miss, Sir, and so on). As a final step, you can check if a passenger has a rare title (e.g., Dona, Lady, Major, and so on).
The following snippet does just that:
df$Title <- gsub("(.*, )|(\\..*)", "", df$Name) rare_titles <- c("Dona", "Lady", "the Countess", "Capt", "Col", "Don", "Dr", "Major", "Rev", "Sir", "Jonkheer") df$Title[df$Title == "Mlle"] <- "Miss" df$Title[df$Title == "Ms"] <- "Miss" df$Title[df$Title == "Mme"] <- "Mrs" df$Title[df$Title %in% rare_titles] <- "Rare" unique(df$Title)
You can see all of the unique titles we have now in the following image:
You can apply similar logic to the
Cabin column. It’s useless by default but can be used to extract the deck number. Here’s how:
df$Deck <- factor(sapply(df$Cabin, function(x) strsplit(x, NULL)[])) unique(df$Deck)
The unique deck numbers are shown in the following image:
You’ve now done some feature engineering, which means the original columns can be deleted. The snippet below deletes these two, but also
Ticket, because these provide no meaningful information:
df <- df %>% select(-c(PassengerId, Name, Cabin, Ticket))
Finally, you can shift the focus to the missing values. Two approaches will be used – mode and MICE imputation.
You’ll use mode (most frequent value) imputation on the
Embarked column because it contains only a couple of missing values. MICE imputation will require a bit more work. Converting categorical variables to factors is a must, and the imputation is done by leaving the target variable out.
Here’s the entire code snippet for imputing missing values:
df$Embarked[is.na(df$Embarked)] <- mlv(df$Embarked, method = "mfv") # Imputing with MICE factor_vars <- c("Pclass", "Sex", "SibSp", "Parch", "Embarked", "Title") df[factor_vars] <- lapply(df[factor_vars], function(x) as.factor(x)) impute_mice <- mice(df[, !names(df) %in% c("Survived")], method = "rf") result_mice <- complete(impute_mice)
As a sanity check, you can plot the density plots of continuous variables before and after imputation. Doing so will show you if the imputation skewed the distribution or not.
Age is the only continuous variable, so let’s make a before and after density plot:
density_before <- ggplot(df, aes(x = Age)) + geom_density(fill = "#e74c3c", alpha = 0.6) + labs(title = "Age: Before Imputation") + theme_classic() density_after <- ggplot(result_mice, aes(x = Age)) + geom_density(fill = "#2ecc71", alpha = 0.6) + labs(title = "Age: After Imputation") + theme_classic() plot_grid(density_before, density_after)
The visualization is shown below:
Some changes are visible, sure, but the overall distribution stayed roughly the same. There were a lot of missing values in this variable, so some changes in distribution are inevitable.
Finally, you can assign the imputation results to the original dataset and convert
Deck to factor:
df$Age <- result_mice$Age df$Deck <- result_mice$Deck df$Deck <- as.factor(df$Deck)
You now have everything needed to start with predictive modeling – so let’s do that next.
Before proceeding with modeling, you’ll need to split your dataset into training and testing subsets. These are available from the start with the Titanic dataset, but you’ll have to do the split manually as we’ve only used the training dataset.
The following snippet splits the data randomly in a 70:30 ratio. Don’t forget to set the seed value to 42 if you want the same split:
set.seed(42) sample_split <- sample.split(Y = df$Survived, SplitRatio = 0.7) train_set <- subset(x = df, sample_split == TRUE) test_set <- subset(x = df, sample_split == FALSE)
You can now train the model on the training set. R uses the
glm() function to apply logistic regression. The syntax is identical as with linear regression. You’ll need to put the target variable on the left and features on the right, separated with the
~ sign. If you want to use all features, put a dot (.) instead of feature names.
Also, don’t forget to specify
family = "binomial", as this is required for logistic regression:
logistic <- glm(Survived ~ ., data = train_set, family = "binomial") summary(logistic)
Here’s the summary of the model:
The most interesting thing here is the P-values, displayed in the
Pr(>|t|) column. Those values indicate the probability of a variable not being important for prediction. It’s common to use a 5% significance threshold, so if a P-value is 0.05 or below, we can say that there’s a low chance it is not significant for the analysis.
You’ve built and explored the model so far, but there’s no use in it yet. The next section shows you how to generate predictions on previously unseen data and evaluate the model.
As mentioned in the introduction section, logistic regression is based on probabilities. If the probability is greater than some threshold (commonly 0.5), you can treat this instance as positive.
The most common way of evaluating machine learning models is by examining the confusion matrix. It’s a square matrix showing you how many predictions were correct (true positives and true negatives), how many were negative but classified as positive (false positives), and how many were positive but classified as negative (false negatives). In our case, positive refers to a passenger who survived the accident.
The snippet below shows how to obtain probabilities and classes, and how to print the confusion matrix:
probs <- predict(logistic, newdata = test_set, type = "response") pred <- ifelse(probs > 0.5, 1, 0) confusionMatrix(factor(pred), factor(test_set$Survived), positive = as.character(1))
And here are the corresponding results:
221 of 268 records were classified correctly, resulting in an accuracy of 82.5%. There are 26 false positives and 21 false negatives. You can play around with classification thresholds (0.5 now) and see how these misclassifications are changing.
Next, let’s explore two ways you can calculate the feature importance of a logistic regression model in R.
R Logistic Regression Feature Importance
The Titanic dataset is what you’d call a perfect dataset. It has just enough features that you don’t have to care about reduction. Real-world datasets couldn’t be more different. You’ll often encounter hundreds or even thousands of columns where only 10 of them are relevant. This section is aimed toward these datasets, even though we’ll apply the logic to Titanic.
Feature importance with the varImp() function
The easiest way to calculate the feature importance of an R logistic regression model is with the
varImp() function. Here’s how to obtain the ten most important features, sorted in descending order:
importances <- varImp(logistic) importances %>% arrange(desc(Overall)) %>% top_n(10)
The features are shown below:
What’s wrong with this approach is that you get back the importance of individual factors. Maybe that’s what you want, but most of the time it isn’t. For a more sophisticated approach, you’ll have to bring out the big guns.
Feature importance with the Boruta package
Boruta is a feature ranking and selection algorithm based on the Random Forests algorithm. In plain English, it shows you if a variable is important or not. You can tweak the “strictness” by adjusting the P-value and other parameters, but that’s a topic for another time.
boruta() function takes in the same parameters as
glm(). It’s a formula with the target variable on the left side and the predictors on the right side. The additional
doTrace parameter is there to limit the amount of output printed to the console – setting it to 0 will remove it altogether:
library(Boruta) boruta_output <- Boruta(Survived ~ ., data = train_set, doTrace = 0) boruta_output
The output is nothing but pure text, which isn’t the most useful thing in the world:
Long story short, all attributes are important. That’s something you can expect from datasets that are made for learning and have nothing to do with the real world.
If you want to extract all the significant attributes from the output, use the following code:
rough_fix_mod <- TentativeRoughFix(boruta_output) boruta_signif <- getSelectedAttributes(rough_fix_mod) boruta_signif
From here, you can use the following code snippet to get the importance scores and sort them in descending order:
importances <- attStats(rough_fix_mod) importances <- importances[importances$decision != "Rejected", c("meanImp", "decision")] importances[order(-importances$meanImp), ]
Now, this makes much more sense. As it turns out, the saying “Women and children first” clearly has some impact. It’s also expected for passengers from higher classes to survive, which is closely tied to passenger title and fare price.
You can also show these importances visually:
plot(boruta_output, ces.axis = 0.7, las = 2, xlab = "", main = "Feature importance")
Making sense of this chart is relatively easy. You should care the most about color. Green means the feature is important, red means it isn’t, and blue represents variables used by Bortuta to determine importance, so these can be ignored. The higher the box and whiskers on the chart, the more important the feature is.
But what is a box and whikers plot? Here’s our complete guide to get you up and running.
And that’s more than enough to get you started with logistic regression and classification in general. Let’s wrap things up in the next section.
Summary of R Logistic Regression
Logistic regression is often used as a baseline binary classification model. More sophisticated algorithms (tree-based or neural networks) have to outperform it to be useful.
Today you’ve learned how to approach data cleaning, preparation, and feature engineering in a hopefully easy-to-follow and understandable way. You’ve also learned how to apply binary classification modeling with logistic regression, and how to evaluate classification models.
If you want to implement machine learning in your organization, you can always reach out to Appsilon for help.
- Machine Learning with R: A Complete Guide to Linear Regression
- What Can I Do With R? 6 Essential R Packages for Programmers
- Getting Started With Image Classification: fastai, ResNet, MobileNet, and More
- AI for Good: ML Wildlife Image Classification to Analyze Camera Trap Datasets
- YOLO Algorithm and YOLO Object Detection: An Introduction