Even the most sophisticated machine learning methods, the most beautiful visualizations, and perfect datasets can be useless if you ignore the context of your code execution. In this article, I will describe a few situations that can destroy your reproducibility. Then I will explain how to resolve these problems and avoid them in the future.
3 danger zones
There are three main areas where things can go wrong if you don’t pay attention:
- R session context
- Operating System (OS) context
- Data versioning
In this blog post, I will focus on the first two areas. They are quite similar because both are related to the software context of research execution. While more and more data scientists pay attention to what happens in their R session, not many are thinking about OS context, which can also significantly influence their research.
Data versioning is a huge topic itself and I will cover it in a separate article in the future. For now, let’s examine some examples of where things can go wrong when you have different R session contexts or various Operating System configurations.
R session context
The basic elements of an R session context that data scientists should be aware of are:
- R version
- Packages versions
- Using
set.seed()
for a reproducible randomization - Floating-point accuracy
R version
Not many monitor this R language changelog on a regular basis, but sometimes crucial changes are made and can lead to a failure in your code. One such situation is described here. Change in R(3.2.1) for “nested arima model has higher log-likelihood”, led to an inconsistency of results for arima calculations with d >= 1.
Packages versions
Now, let’s see an example with a changing package version. I am going to use dplyr
in versions 0.4.0
and 0.5.0
:
dplyr 0.4.0
df <- data_frame(x = c(1, 1, 1, 2, 2), y = 1:5)
result <- df %>% dplyr::distinct(x) %>% sum
# result == 8
dplyr 0.5.0
df <- data_frame(x = c(1, 1, 1, 2, 2), y = 1:5)
result <- df %>% dplyr::distinct(x) %>% sum
# result == 3
Wow! Version 0.5.0 introduced functionality-breaking changes and our result is now completely different.
set.seed()
When your code uses randomization and you want to have reproducible results, it is important to remember about setting a seed.
> sample(1:10, 5)
[1] 6 7 2 5 9
> sample(1:10, 5) # Different result
[1] 5 7 8 2 3
…and with setting seed:
> set.seed(42)
> sample(1:10, 5)
[1] 10 9 3 6 4
> set.seed(42)
> sample(1:10, 5) # Result is the same
[1] 10 9 3 6 4
Random numbers returned by the generator are in fact a sequence of numbers that can be reproduced. Setting a seed informs the generator where to start.
Floating-point accuracy
Pay attention to how expressions on floating-point numbers are implemented. Let’s assume you want to implement an algorithm described in a recently published scientific paper. You get unexpected results in some cases and now try to debug it. And you eventually get this:
> x <- 3
> y <- 2.9
> x - y == 0.1
[1] FALSE
What? It can be solved by testing for “near equality”:
> x <- 3
> y <- 2.9
> all.equal(x - y, 0.1)
[1] TRUE
More about floating-point arithmetic can be found here.
Operating System Context
When your research is shared with others and you can’t predict which operating system someone is going to use, many things can go wrong. Any difference in one of the following elements:
- System packages versions
- System locale
- Environment variables
can result in different execution results.
Many R packages depend on system packages. You can view it in a package DESCRIPTION file under the parameter SystemRequirements
. Let’s look at the example with the png
package and running it on two different operating systems.
Package png
DESCRIPTION file:
Package: png
Version: 0.1-7
...
SystemRequirements: libpng
...
Let’s assume we wrote a custom function savePlotToPNG
that saves a plot to a file. We want to write a unit test for it and the idea is to compare produced result with a previously generated reference image. We generate reference plots on Ubuntu 16.04:
reference.png.path <- file.path("referencePlots", "ref0001.png")
savePlotToPNG(reference.png.path)
…and then we run tests on Debian 8 Jessie. We use exactly the same version of png
package:
reference.png.path <- file.path("referencePlots", "ref0001.png")
test.plot.path <- tempfile(fileext = ".png")
savePlotToPNG(test.plot.path)
identical(x = png::readPNG(test.plot.path),
y = png::readPNG(reference.png.path))
# Error! libpng system package is different on Ubuntu 16.04
libpng
system package versions.
Locale and system variables
Ubuntu:
> x <- c("-110", "-1", "-10", "10", "0", "110", "1")
> sort(x)
[1] "0" "1" "-1" "10" "-10" "110" "-110"
Windows:
> x <- c("-110", "-1", "-10", "10", "0", "110", "1")
> sort(x)
[1] "-1" "-10" "-110" "0" "1" "10" "110"
Tools for R session-level reproducibility
Data scientists should be aware of the traps described above. These are the foundations of reproducible research. Below there is a list of tools that support managing packages versions:
Packrat – a popular package that allows you to create and manage a private library attached to your project.
Switchr – handy abstraction over .libPaths(){:target="_blank"}
, lib.loc
and other lower-level library settings. Helpful in creating and managing libraries.
Checkpoint – helpful when you need to find older versions of a specific package.
Tools for OS-level reproducibility
The best approach is to do your research in an isolated environment, that can be shared as an image. Inside of this environment, you can also use any tool for R session-level reproducibility, like Packrat, Switchr, or Checkpoint.
To create an isolated environment you can use virtualization or containerization. There are many tools that can be used for that: Vagrant, Docker, rkt, and many more. I recommend using Docker because it’s easy to use and has good community support.
Example structure of your ~/project
:
Dockerfile
install_libraries.R
R/ <-- contains your research R scripts
Let’s look at a sample Dockerfile
that describes the steps necessary for the recreation of an isolated environment:
UPDATE: I simplified Dockerfile according to Carl Boettiger’s comment.
FROM rocker/rstudio:3.3.2
MAINTAINER Paweł Przytuła "join@wordpress.appsilon.com"
COPY install_libraries.R /code/install_libraries.R
RUN R -f /code/install_libraries.R
install_libraries.R
contains code that reproduces an R environment. It can be either some Packrat code or something like this:
install.packages("devtools")
devtools::install_version("dplyr", version = "0.5.0")
devtools::install_version("stringr", version = "1.1.0")
devtools::install_version("forecast", version = "7.3")
After that it’s enough to run the following command to build an image:
docker build -t myimagename:1.0.0 .
When an image is ready you should either save it as a file using docker save -o myimagename.img myimagename:1.0.0
or upload it to Docker Hub – steps described here.
Using a previously saved Docker image you can start a container with the following:
docker run -d -p 8787:8787 -v ~/project/R:/R myimagename:1.0.0
Because the image is based on rocker/rstudio
, you can access an RStudio session via http://localhost:8787
(login rstudio
, password rstudio
). Your R/ directory will be available inside the running container via /R
path. Awesome, isn’t it?!
With a Docker image like this, you have control over the whole software context needed for reproducible research. You can easily extend it with more packages and share it without fear of losing reproducibility. I recommend this approach in any data science project, in which results must be reproduced on more than one machine and also when working in a bigger data science team.