---

This site uses cookies. Read more.

 2 March, 2017

R is a vector-oriented language and is optimised for most of its standard uses. However, doing something less typical, like finding a specific element in a dataset might prove more challenging. R does contain the necessary functionalities, but these might not be sufficient when working with a large dataset comprising several million rows – the lookups may be extremely slow. This can be particularly problematic in case of Shiny apps that must respond to user input in fractions of a second. Slow responsiveness can leave your users frustrated!

In this post we are going to compare four different methods that can be used to improve lookup times in R:

  • dplyr::filter
  • Built-in operators
  • Hash tables
  • Ordered indexes in data.table

 In our case, we were able to improve lookup speed 25 times by using data.table indexes.

First, let’s generate a sample dataset with 10 million rows:

library(tibble)

random_string_column <- function(n) {
  stringi::stri_rand_strings(n = n, length = 8)
}

random_data_frame <- function(n) tibble(
  col1 = random_string_column(n),
  col2 = random_string_column(n)
)

data <- random_data_frame(10^7)

Let’s now create a simple benchmarking routine to compare the efficiency of different methods:

time <- function(...) {
  time_measurement <- system.time(eval(...))
  time_measurement[["user.self"]]
}

benchmark <- function(..., n = 100) {
  times <- replicate(n, ...)
  c(min = min(times), max = max(times), mean = mean(times))
}

These routines benchmark a piece of code by running it multiple times. In each run, we can include preparation steps that do not count towards the measured execution time – we will only measure execution time of code wrapped with a call to time(...).

dplyr::filter and data.table filtering

At Appsilon, we often use Hadley’s dplyr package for data manipulations. To find a specific element in a data frame (or, more precisely, a tibble){:target=”_blank”} one can use dplyr’s filter function.
Let’s see how long it takes to find an element with a specific key using dplyr.

library(dplyr)

benchmark({
  key_to_lookup <- select_random_key()
  time(data %>% filter(col1 == key_to_lookup))
})
##    min    max   mean 
## 0.0960 0.2440 0.1124 

It turns out that finding a single element on my machine using filtertakes 112 ms on average. This might not seem like a lot, but in fact it is very slow – especially if you want to do several such operations at a time.

Why is filter so slow? dplyr does not know anything about our dataset. The only way for it to find matching elements is to look row by row and check if the criteria is matched. When the dataset gets large, this simply must take time.

Filtering using data.table is not fast either:

library(data.table)

data_table <- data.table(data)

benchmark({
  key_to_lookup <- select_random_key()
  time(data_table[col1 == key_to_lookup])
})
##     min     max    mean 
## 0.00100 5.99400 0.06194 

In the worst case it took almost 6 seconds to find an element!

Builtin operators

Perhaps it’s just dplyr and data.table? Perhaps filtering the old-school way with builtin operators and filtering is faster?

benchmark({
  key_to_lookup <- select_random_key()
  time(data[data$col1 == key_to_lookup,])
})
##     min     max    mean 
## 0.09800 0.22300 0.11424

It does not seem like it. How about using which for selecting rows?

benchmark({
  key_to_lookup <- select_random_key()
  time(data[which(data$col1 == key_to_lookup),])
})
##     min     max    mean 
## 0.09200 0.15800 0.10471 

Not much improvement. Dataframe filtering and which also use linear search, so as one could expect they cannot be noticeably faster than dplyr::filter.

Hash tables to the rescue?

The problem of efficient lookups is not specific to R. One of the altervnative approaches is to use a hash table. Without delving into the details, a hash table is a very efficient data structure with a constant lookup time. It is implemented in most modern programming languages and it is widely utilised in many areas. If you’d like to read more about how hash tables work, there are lots of good articles about that.

R-bloggers has a great series of articles about hash tables in R: part 1, part 2, part 3.

The main conclusion of those articles is that if you need a hash table in R, you can use one of its built in data structures – environments. Environments are used to keep the bindings of variables to values. Internally, they are implemented as a hash table. There is even a CRAN package hash, which wraps environments for general-purpose usage.
However, these articles analyse the performance of a hash table with just 30 thousand elements… which is relatively small. Before we can use a hash.table, we need to create it. Building an environment with 10^7 elements is extremely slow: it takes more than an hour. For 10^6 elements it takes about 30 seconds.

# Try it yourself:

environment_source <- setNames(as.list(data$col2), data$col1)
time(as.environment(environment_source))

This makes using environments as hash tables virtually impossible if your dataset contains upwards of a million elements. For instance, building a hash table at startup of a Shiny app would force the user to wait ages before the app actually starts. Such startup time is not acceptable in commercial solutions. Apparently, environments were never intended for storing that many elements. In fact, their basic role is to store variables that are created by executing the R code. There are not many scenarios that would require creating this many variables in the code.

One workaround could be to build the environment once and load it from file later, but (quite surprisingly) this turns out to be even slower.

We did not manage to find a good hash tables package for R that was not using environments underneath – please let me know if you came across such a tool! We can expect that an efficient hash table R package will be created in the future, but until then we need to make use of available solutions.

Data.table indexes

It’s unfortunate that we can’t use hash tables for large datasets. However, there is yet another option when when the column you are searching by is a number or a string. In that case, you may use ordered indexes from data.table package.

Index keeps a sorted version of our column, which leads to much faster lookups. Finding an element in a sorted set with binary search has a logarithmic execution time – complexity of O(log n). In practice index can be implemented in different ways, but usually the time needed to build up an index is O(n*log n), which in practice is often fast enough.

Let’s measure the time needed to build the index for out dataset:

time(setkey(data_table, col1))
5.645

6 seconds – this is not negligible, but this is something we can accept. Let’s check the lookup times:

benchmark({
  key_to_lookup <- select_random_key()
  time(data_table[.(key_to_lookup), nomatch = 0L])
})
##     min     max    mean 
## 0.00200 0.01200 0.00455 

This is exactly what we wanted – a mean lookup time of 5 milliseconds! Finally, we’ve managed to get efficient lookups in our dataset.

Summary

Dplyr with its filter method will be slow if you are searching for a single element in a dataset.
The same is true for classic data frame filtering with builtin R operators and for regular filtering using data.table.
Using environment as a hash table gives you fast lookups, but building it for a large dataset takes very long.

If you need efficient lookups in a large dataset in R, check out data.table’s ordered indexes – this appears to be the best option out there!

If you run into any issues, as an RStudio Full Certified Partner, our team at Appsilon is ready to answer your questions about working with large datasets and other topics related to R Shiny, Data Analytics, and Machine Learning. Let us know and we would be happy to chat!

Follow Us for More