Fast data lookups in R: dplyr vs data.table


Updated: May 23, 2022

R is a vector-oriented language and is optimized for most of its standard uses. However, doing something less typical, like finding a specific element in a dataset might prove more challenging. R does contain the necessary functionalities, but these might not be sufficient when working with a large dataset comprising several million rows – data lookups in R may be extremely slow. This can be particularly problematic in the case of R Shiny apps that must respond to user input in fractions of a second. Slow responsiveness will leave your users frustrated!

In this post we are going to compare four different methods that can be used to improve lookup times in R:

Spoiler alert – We were able to improve lookup speed 25 times by using data.table indexes.


Fast Data Lookups in R – Getting Started

First, let’s generate a sample dataset with 10 million rows:

Let’s now create a simple benchmarking routine to compare the efficiency of different methods:

These routines benchmark a piece of code by running it multiple times. In each run, we can include preparation steps that do not count towards the measured execution time – we will only measure the execution time of code wrapped with a call to time(...).

Data Lookups with dplyr::filter and data.table filtering

At Appsilon, we often use Hadley’s dplyr package for data manipulations. To find a specific element in a data frame (or, more precisely, a tibble) you can use dplyr’s filter function.
Let’s see how long it takes to find an element with a specific key using dplyr.

It turns out that finding a single element on my machine using a filter takes 112 ms on average. This might not seem like a lot, but in fact, it’s painfully slow – especially if you want to do several such operations at a time.

Why is filter so slow? dplyr does not know anything about our dataset. The only way for it to find matching elements is to look row by row and check if the criteria are matched. When the dataset gets large, this simply must take time.

Filtering using data.table is not fast either:

In the worst case, it took almost 6 seconds to find an element!

Built-in Operators

Perhaps it’s just dplyr and data.table? Maybe filtering the old-school way with built-in operators and filtering is faster?

It does not seem like it. How about using which for selecting rows?

Not much of an improvement. Dataframe filtering and which also use a linear search, so as one could expect they cannot be noticeably faster than dplyr::filter.

Hash tables to the Rescue?

The problem of efficient lookups is not specific to R. One of the alternative approaches is to use a hash table. Without delving into the details, a hash table is a very efficient data structure with a constant lookup time. It is implemented in most modern programming languages and it is widely utilized in many areas. If you’d like to read more about how hash tables work, there are lots of good articles about that.

R-bloggers has a great series of articles about hash tables in R: Part 1, Part 2, Part 3.

The main conclusion of those articles is that if you need a hash table in R, you can use one of its built-in data structures – environments. These are used to keep the bindings of variables to values. Internally, they are implemented as a hash table. There is even a CRAN package hash, which wraps environments for general-purpose usage.

However, these articles analyze the performance of a hash table with just 30 thousand elements – which is tiny when compared to our dataset. Before we can use a hash.table, we need to create it. Building an environment with 10^7 elements is extremely slow: it takes more than an hour. For 10^6 elements it takes about 30 seconds.

This makes using environments as hash tables virtually impossible if your dataset contains upwards of a million elements. For instance, building a hash table at the startup of a Shiny app would force the user to wait for ages before the app actually starts. Such startup time is not acceptable in commercial solutions. Apparently, environments were never intended for storing that many elements. In fact, their basic role is to store variables that are created by executing the R code. There are not many scenarios that would require creating this many variables in the code.

One workaround could be to build the environment once and load it from the file later, but (quite surprisingly) this turns out to be even slower.

We did not manage to find a good hash tables package for R that was not using environments underneath – please let me know if you come across such a tool! We can expect that an efficient hash table R package will be created in the future, but until then we need to make use of available solutions.

Data.table Ordered Indexes

It’s unfortunate that we can’t use hash tables for large datasets. However, there is yet another option when the column you are searching for is a number or a string. In that case, you may use ordered indexes from data.table package.

The index keeps a sorted version of our column, which leads to much faster lookups. Finding an element in a sorted set with binary search has a logarithmic execution time – complexity of O(log n). In practice, an index can be implemented in different ways, but usually, the time needed to build up an index is O(n*log n), which in practice is often fast enough.

Let’s measure the time needed to build the index for our dataset:

6 seconds – this is not negligible, but this is something we can accept. Let’s check the lookup times:

This is exactly what we wanted – a mean lookup time of 5 milliseconds! Finally, we’ve managed to get efficient lookups in our dataset.

Summary of Fast Data Lookups in R

Dplyr with its filter method will be slow if you search for a single element in a dataset. The same is true for classic data frame filtering with builtin R operators and for regular filtering using data.table. Using environment as a hash table gives you fast lookups, but building it for a large dataset takes very long.

If you need efficient lookups in a large dataset in R, check out data.table’s ordered indexes – this appears to be the best option out there!

If you run into any issues, as an RStudio Full Certified Partner, our team at Appsilon is ready to answer your questions about working with large datasets and other topics related to R Shiny, Data Analytics, and Machine Learning. Let us know and we would be happy to chat!