Join the R Community at ShinyConf 2023

Imputation in R: Top 3 Ways for Imputing Missing Data


Real-world data is often messy and full of missing values. As a result, data scientists spend the majority of their time cleaning and preparing the data, and have less time to focus on predictive modeling and machine learning. If there’s one thing all data preparation steps share, then it’s dealing with missing data. Today we’ll make this process a bit easier for you by introducing 3 ways for data imputation in R.

After reading this article, you’ll know several approaches for imputation in R and tackling missing values in general. Choosing an optimal approach oftentimes boils down to experimentation and domain knowledge, but we can only take you so far.

Interested in Deep Learning? Learn how to visualize PyTorch neural network models.

Table of contents:


Introduction to Imputation in R

In the simplest words, imputation represents a process of replacing missing or NA values of your dataset with values that can be processed, analyzed, or passed into a machine learning model. There are numerous ways to perform imputation in R programming language, and choosing the best one usually boils down to domain knowledge.

Picture this – there’s a column in your dataset that stands for the amount the user spends on a phone service X. Values are missing for some clients, but what’s the reason? Can you impute them with a simple mean? Well, you can’t, at least not without asking a business question first – Why are these values missing?

Most likely, the user isn’t using that phone service, so imputing missing values with mean would be a terrible, terrible idea.

Let’s examine our data for today. We’ll use the training portion of the Titanic dataset and try to impute missing values for the Age column:

Imports:

library(ggplot2)
library(dplyr)
library(titanic)
library(cowplot)

titanic_train$Age

You can see some of the possible values below:

Image 1 – Possible Age values of the Titanic dataset

There’s a fair amount of NA values, and it’s our job to impute them. They’re most likely missing because the creator of the dataset had no information on the person’s age. If you were to build a machine learning model on this dataset, the best way to evaluate the imputation technique would be to measure classification metrics (accuracy, precision, recall, f1) after training the model.

But before diving into the imputation, let’s visualize the distribution of our variable:

ggplot(titanic_train, aes(Age)) +
  geom_histogram(color = "#000000", fill = "#0099F8") +
  ggtitle("Variable distribution") +
  theme_classic() +
  theme(plot.title = element_text(size = 18))

The histogram is displayed in the figure below:

Image 2 – Distribution of the Age variable

So, why is this important? It’s a good idea to compare variable distribution before and after imputation. You don’t want the distribution to change significantly, and a histogram is a good way to check that.

Don’t know a first thing about histograms? Our detailed guide with ggplot2 has you covered.

We’ll now explore a suite of basic techniques for imputation in R.

Simple Value Imputation in R with Built-in Functions

You don’t actually need an R package to impute missing values. You can do the whole thing manually, provided the imputation techniques are simple. We’ll cover constant, mean, and median imputations in this section and compare the results.

The value_imputed variable will store a data.frame of the imputed ages. The imputation itself boils down to replacing a column subset that has a value of NA with the value of our choice. This will be:

  • Zero: constant imputation, feel free to change the value.
  • Mean (average): average age after when all NA‘s are removed.
  • Median: median age after when all NA‘s are removed.

Here’s the code:

value_imputed <- data.frame(
  original = titanic_train$Age,
  imputed_zero = replace(titanic_train$Age, is.na(titanic_train$Age), 0),
  imputed_mean = replace(titanic_train$Age, is.na(titanic_train$Age), mean(titanic_train$Age, na.rm = TRUE)),
  imputed_median = replace(titanic_train$Age, is.na(titanic_train$Age), median(titanic_train$Age, na.rm = TRUE))
)
value_imputed

We now have a dataset with four columns representing the age:

Image 3 – Results of the basic value imputation

Let’s take a look at the variable distribution changes introduced by imputation on a 2×2 grid of histograms:

h1 <- ggplot(value_imputed, aes(x = original)) +
  geom_histogram(fill = "#ad1538", color = "#000000", position = "identity") +
  ggtitle("Original distribution") +
  theme_classic()
h2 <- ggplot(value_imputed, aes(x = imputed_zero)) +
  geom_histogram(fill = "#15ad4f", color = "#000000", position = "identity") +
  ggtitle("Zero-imputed distribution") +
  theme_classic()
h3 <- ggplot(value_imputed, aes(x = imputed_mean)) +
  geom_histogram(fill = "#1543ad", color = "#000000", position = "identity") +
  ggtitle("Mean-imputed distribution") +
  theme_classic()
h4 <- ggplot(value_imputed, aes(x = imputed_median)) +
  geom_histogram(fill = "#ad8415", color = "#000000", position = "identity") +
  ggtitle("Median-imputed distribution") +
  theme_classic()

plot_grid(h1, h2, h3, h4, nrow = 2, ncol = 2)

Here’s the output:

Image 4 – Distributions after the basic value imputation

All imputation methods severely impact the distribution. There are a lot of missing values, so setting a single constant value doesn’t make much sense. Zero imputation is the worst, as it’s highly unlikely for close to 200 passengers to have the age of zero.

Maybe mode imputation would provide better results, but we’ll leave that up to you.

Impute Missing Values in R with MICE

MICE stands for Multivariate Imputation via Chained Equations, and it’s one of the most common packages for R users. It assumes the missing values are missing at random (MAR).

The basic idea behind the algorithm is to treat each variable that has missing values as a dependent variable in regression and treat the others as independent (predictors). You can learn more about MICE in this paper.

The R mice packages provide many univariate imputation methods, but we’ll use only a handful. First, let’s import the package and subset only the numerical columns to keep things simple. Only the Age attribute contains missing values:

library(mice)

titanic_numeric <- titanic_train %>%
  select(Survived, Pclass, SibSp, Parch, Age)

md.pattern(titanic_numeric)

The md.pattern() function gives us a visual representation of missing values:

Image 5 – Missing map

Onto the imputation now. We’ll use the following MICE imputation methods:

  • pmm: Predictive mean matching.
  • cart: Classification and regression trees.
  • laso.norm: Lasso linear regression.

Once again, the results will be stored in a data.frame:

mice_imputed <- data.frame(
  original = titanic_train$Age,
  imputed_pmm = complete(mice(titanic_numeric, method = "pmm"))$Age,
  imputed_cart = complete(mice(titanic_numeric, method = "cart"))$Age,
  imputed_lasso = complete(mice(titanic_numeric, method = "lasso.norm"))$Age
)
mice_imputed

Let’s take a look at the results:

Image 6 – Results of MICE imputation

It’s hard to judge from the table data alone, so we’ll draw a grid of histograms once again (copy and modify the code from the previous section):

Image 7 – Distributions after the MICE imputation

The imputed distributions overall look much closer to the original one. The CART-imputed age distribution probably looks the closest. Also, take a look at the last histogram – the age values go below zero. This doesn’t make sense for a variable such as age, so you will need to correct the negative values manually if you opt for this imputation technique.

That covers MICE, so let’s take a look at another R imputation approach – Miss Forest.

Imputation with R missForest Package

The Miss Forest imputation technique is based on the Random Forest algorithm. It’s a non-parametric imputation method, which means it doesn’t make explicit assumptions about the function form, but instead tries to estimate the function in a way that’s closest to the data points.

In other words, it builds a random forest model for each variable and then uses the model to predict missing values. You can learn more about it by reading the article by Oxford Academic.

Let’s see how it works for imputation in R. We’ll apply it to the entire numerical dataset and only extract the age:

library(missForest)

missForest_imputed <- data.frame(
  original = titanic_numeric$Age,
  imputed_missForest = missForest(titanic_numeric)$ximp$Age
)
missForest_imputed

There’s no option for different imputation techniques with Miss Forest, as it always uses the random forests algorithm:

Image 8 – Results of the missForest imputation

Finally, let’s visualize the distributions:

Image 9 – Distributions after the missForest imputation

It looks like Miss Forest gravitated towards a constant value imputation since a large portion of values is around 35. The distribution is quite different from the original one, which means Miss Forest isn’t the best imputation technique we’ve seen today.


Summary of Imputation in R

And that does it for three ways to impute missing values in R. You now have several new techniques under your toolbelt, and these should simplify any data preparation and cleaning process. The imputation approach is almost always tied to domain knowledge of the problem you’re trying to solve, so make sure to ask the right business questions when needed.

For a homework assignment, we would love to see you build a classification machine learning model on the Titanic dataset, and use one of the discussed imputation techniques in the process. Which one yields the most accurate model? Which one makes the most sense? Feel free to share your insights in the comment section below and to reach us on Twitter – @appsilon. We’d love to hear from you.

Looking for more guidance on Data Cleaning in R? Start with these two packages.