Data Cleaning in R: 2 R Packages to Clean and Validate Datasets
Real-world datasets are messy. Unless the dataset was created for teaching purposes, it’s likely you’ll have to spend hours or even tens of hours cleaning it before you can show it on a dashboard. That’s where two packages for data cleaning in R come into play –
data.validator. And today you’ll learn how to use them together.
If you’re a software engineer, think of data cleaning and validation as writing and testing code. Think of data cleaning as coding an app – it takes a huge amount of time to get it working correctly. On the other hand, you can’t be sure it’ll work as expected until you’ve tested it properly (validation). They’re not two separated concepts, but one is rather an extension of the other.
Regardless if you’re a software engineer or a data scientist, combining these two is the way to go.
Join the biggest R Shiny event of the year – 2022 Appsilon Shiny Conference.
Table of contents:
- Data Cleaning in R with the Janitor Package
- Data Validation in R with the data.validator Package
Data Cleaning in R with the Janitor Package
So, what is
janitor? Put simply, it’s an R package that has simple functions for examining and cleaning dirty data. It can format data frame column names, isolate duplicate and partially duplicate records, isolate empty and constant data, and much more!
janitor extensively through this section to clean custom datasets, and isolate duplicates of the well-known Iris dataset.
Cleaning column names
Imagine you had a dataset with terribly-formatted column names. Would you clean them by hand? Well, that’s an option if you only have a couple of them. Real-world datasets oftentimes have hundreds of columns, so the by-hand approach is a no-go.
janitor package has a nifty
clean_names() function, and it’s used to reformat data frame column names.
The snippet below creates a data frame with inconsistent column names – some are blank, repeated, or have unwanted characters.
janitor cleans them instantly:
Cleaning column names – Approach #2
There’s another way you could approach cleaning data frame column names – and it’s by using the
The snippet below shows a tibble of the Iris dataset:
Separating words with a dot could lead to messy or unreadable R code. It’s preferred to use underscores instead.
janitor can do it automatically for you:
The column names are now much more consistent with what you’d find in other datasets.
Removing empty data
It’s not rare to get a dataset with missing values. Filling them isn’t always straightforward. Approaches like average value imputation are often naive, and you should have good domain knowledge before using them.
Sometimes, it’s best to remove missing values altogether. The
remove_empty() function does just that – either for rows, columns, or both. Take a look at an example:
Easy, right? Feel free to experiment with different options for the
which parameter to get the hang of it.
Removing constant data
A column with only one unique value is just useless. It provides no value for analysis, visualization, and even training machine learning models. It’s best to remove such columns entirely. Use the
remove_constant() function for the task:
Keep in mind: Only remove constant data if you’re 100% certain other values are not possible. For example, maybe you’re looking at a small sample of a larger dataset that originally has multiple values in the given column. Be extra careful.
Isolating duplicate date
Missing data is no fun, but duplicates can even be worse! Two rows with identical values convey the same information. If they’re in the dataset by accident, they might skew your analysis and models if left untouched.
janitor comes with a
get_dupes() function you can use to check. The code snippet below considers a value as duplicate only if values for all columns are identical:
You can also specify the columns that will be used when checking for duplicates:
As you can see, we get a much larger duplicate base the second time, just because fewer columns were used for the check.
Janitor package – summary
janitor package is extremely powerful when it comes to data cleaning in R. We’ve explored basic functionality which is enough to clean most datasets. So, what’s the next step?
As mentioned earlier, the next step is data validation. It will make sure all test cases have passed.
Data Validation in R with the data.validator Package
Appsilon’s data.validator is a go-to package for scalable and reproducible data validation. You can use it to validate a dataset in the IDE, and you can even export an interactive report. You’ll learn how to do both.
For simplicity’s sake, we’ll use the Iris dataset for validation. You’re free to use any dataset and any validation condition.
You’ll have to start by creating a report object and then using validation functions, such as
validate_cols() to validate conditions:
It looks like one validation failed with three violations. You can’t see more details in the console, unfortunately. But what you can do is create an HTML report instead:
Unlike with the console option, now you can click on the Show button to get detailed insights into why the validation check failed:
As you can see, the
Sepal.Width column was outside the given range in these three instances, so the validation check failed.
Want to learn more about data.validator? Read our complete guide on Appsilon blog.
Summary of Data Cleaning in R
Long story short – it’s crucial to clean and validate your dataset before continuing with analysis, visualization, or predictive modeling. Today you’ve seen how to approach this task with two highly-capable R packages. They work best when used together –
janitor for data cleaning and
data.validator for validation.
For a homework assignment, we recommend you download any messy dataset of your choice and use two discussed packages for cleaning and validation. Share your results with us on Twitter – @appsilon. We’d love to see what you come up with.
Are you completely new to R? These are 6 R packages you must learn as a beginner.