Even the most sophisticated machine learning methods, the most beautiful visualizations, and perfect datasets can be useless if you ignore the context of your code execution. In this article, I will describe a few situations that can destroy your reproducibility. Then I will explain how to resolve these problems and avoid them in the future.
There are three main areas where things can go wrong if you don’t pay attention:
In this blog post, I will focus on the first two areas. They are quite similar because both are related to the software context of research execution. While more and more data scientists pay attention to what happens in their R session, not many are thinking about OS context, which can also significantly influence their research.
Data versioning is a huge topic itself and I will cover it in a separate article in the future. For now, let’s examine some examples of where things can go wrong when you have different R session contexts or various Operating System configurations.
The basic elements of an R session context that data scientists should be aware of are:
set.seed()for a reproducible randomization
Not many monitor this R language change log on a regular basis, but sometimes crucial changes are made and can lead to a failure in your code. One such situation is described here. Change in R(3.2.1) for “nested arima model has higher log-likelihood”, led to an inconsistency of results for arima calculations with d >= 1.
Now, let’s see an example with a changing package version. I am going to use
dplyr in versions
whoa! Version 0.5.0 introduced functionality-breaking changes and our result is now completely different.
When your code uses a randomization and you want to have reproducible results, it is important to remember about setting a seed.
…and with setting seed:
Random numbers returned by the generator are in fact a sequence of numbers that can be reproduced. Setting a seed informs the generator where to start.
Pay attention how expressions on floating point numbers are implemented. Let’s assume you want to implement an algorithm described in a recently published scientific paper. You get unexpected results in some cases and now try to debug it. And you eventually get this:
What? It can be solved by testing for “near equality”:
More about a floating point arithmetic can be found here.
When your research is shared with others and you can’t predict which operating system someone is going to use, many things can go wrong. Any difference in one of the following elements:
can result in different execution results.
Many R packages depend on system packages. You can view it in a package DESCRIPTION file under the parameter
SystemRequirements. Let’s look at the example with
png package and running it on two different operating systems.
png DESCRIPTION file:
Let’s assume we wrote custom function
savePlotToPNG that saves a plot to a file. We want to write an unit test for it and the idea is to compare produced result with a previously generated reference image. We generate reference plots on Ubuntu 16.04:
…and then we run tests on Debian 8 Jessie. We use exactly the same version of
…plots don’t match! This is caused by different
libpng system package versions.
…okayyy 🙂 String sorting in R is based on the locale which is different for Windows and Linux systems. Read more about setting locales here.
Data scientists should be aware of the traps described above. These are the foundations of the reproducible research. Below there is a list of tools that support managing packages versions:
Packrat – popular package that allows to create and manage a private library attached to your project.
Switchr – handy abstraction over
lib.loc and other lower level library settings. Helpful in creating and managing libraries.
Checkpoint – helpful when you need to find older versions of a specific package.
The best approach is to do your research in an isolated environment, that can be shared as an image. Inside of this environment, you can also use any tool for R session level reproducibility, like Packrat, Switchr or Checkpoint.
To create an isolated environment you can use virtualization or containarization. There are many tools that can be used for that: Vagrant, Docker, rkt and many more. I recommend using Docker because it’s easy to use and has a good community support.
Example structure of your
Let’s look at a sample
Dockerfile that describes the steps necessary for the recreation of an isolated environment:
UPDATE: I simplified Dockerfile according to Carl Boettiger comment.
install_libraries.R contains code that reproduces an R environment. It can be either some Packrat code or something like this:
After that it’s enough to run the following command to build an image:
Using a previously saved Docker image you can start a container with the following:
Because the image is based on
rocker/rstudio, you can access an RStudio session via
rstudio). Your R/ directory will be available inside the running container via
/R path. Awesome, isn’t it?!
With a Docker image like this, you have control over the whole software context needed for reproducible research. You can easily extend it with more packages and share it without a fear of losing reproducibility. I recommend this approach in any data science project, which results must be reproduced on more than one machine and also when working in a bigger data science team.