R is a vector-oriented language and is optimised for most of its standard uses. However, doing something less typical, like finding a specific element in a dataset might prove more challenging. R does contain the necessary functionalities, but these might not be sufficient when working with a large dataset comprising several million rows – the lookups may be extremely slow. This can be particularly problematic in case of Shiny apps that must respond to user input in fractions of a second. Slow responsiveness can leave your users frustrated!

In this post we are going to compare four different methods that can be used to improve lookup times in R:

- dplyr::filter
- Built-in operators
- Hash tables
- Ordered indexes in data.table

In our case, we were able to improve lookup speed 25 times by using data.table indexes.

First, let’s generate a sample dataset with 10 million rows:

Let’s now create a simple benchmarking routine to compare the efficiency of different methods:

These routines benchmark a piece of code by running it multiple times. In each run, we can include preparation steps that do not count towards the measured execution time – we will only measure execution time of code wrapped with a call to `time(...)`

.

At Appsilon, we often use Hadley’s dplyr package for data manipulations. To find a specific element in a data frame (or, more precisely, a tibble){:target=”_blank”} one can use dplyr’s `filter`

function.

Let’s see how long it takes to find an element with a specific key using dplyr.

It turns out that finding a single element on my machine using filtertakes 112 ms on average. This might not seem like a lot, but in fact it is very slow – especially if you want to do several such operations at a time.

Why is `filter`

so slow? *dplyr* does not know anything about our dataset. The only way for it to find matching elements is to look row by row and check if the criteria is matched. When the dataset gets large, this simply must take time.

Filtering using `data.table`

is not fast either:

In the worst case it took almost 6 seconds to find an element!

Perhaps it’s just *dplyr* and *data.table*? Perhaps filtering the old-school way with builtin operators and filtering is faster?

It does not seem like it. How about using `which`

for selecting rows?

Not much improvement. Dataframe filtering and `which`

also use linear search, so as one could expect they cannot be noticeably faster than `dplyr::filter`

.

The problem of efficient lookups is not specific to R. One of the altervnative approaches is to use a **hash table**. Without delving into the details, a hash table is a very efficient data structure with a constant lookup time. It is implemented in most modern programming languages and it is widely utilised in many areas. If you’d like to read more about how hash tables work, there are lots of good articles about that.

R-bloggers has a great series of articles about hash tables in R: part 1, part 2, part 3.

The main conclusion of those articles is that if you need a hash table in R, you can use one of its built in data structures – ** environments**.

`Environments`

are used to keep the bindings of variables to values. Internally, they are implemented as a hash table. There is even a CRAN package hash, which wraps environments for general-purpose usage.This makes using environments as hash tables virtually impossible if your dataset contains upwards of a million elements. For instance, building a hash table at startup of a Shiny app would force the user to wait ages before the app actually starts. Such startup time is not acceptable in commercial solutions. Apparently, environments were never intended for storing that many elements. In fact, their basic role is to store variables that are created by executing the R code. There are not many scenarios that would require creating this many variables in the code.

One workaround could be to build the environment once and load it from file later, but (quite surprisingly) this turns out to be even slower.

We did not manage to find a good hash tables package for R that was not using environments underneath – please let me know if you came across such a tool! We can expect that an efficient hash table R package will be created in the future, but until then we need to make use of available solutions.

It’s unfortunate that we can’t use hash tables for large datasets. However, there is yet another option when when the column you are searching by is a number or a string. In that case, you may use **ordered indexes** from `data.table`

package.

Index keeps a sorted version of our column, which leads to much faster lookups. Finding an element in a sorted set with binary search has a logarithmic execution time – complexity of O(log n). In practice index can be implemented in different ways, but usually the time needed to build up an index is O(n*log n), which in practice is often fast enough.

Let’s measure the time needed to build the index for out dataset:

6 seconds – this is not negligible, but this is something we can accept. Let’s check the lookup times:

This is exactly what we wanted – a mean lookup time of 5 milliseconds! Finally, we’ve managed to get efficient lookups in our dataset.

Dplyr with its `filter`

method will be slow if you are searching for a single element in a dataset.

The same is true for classic data frame filtering with builtin R operators and for regular filtering using `data.table`

.

Using `environment`

as a hash table gives you fast lookups, but building it for a large dataset takes very long.

If you need efficient lookups in a large dataset in R, check out data.table’s ordered indexes – this appears to be the best option out there!

If you run into any issues, as an RStudio Full Certified Partner, our team at Appsilon is ready to answer your questions about working with large datasets and other topics related to R Shiny, Data Analytics, and Machine Learning. Let us know and we would be happy to chat!

- Follow @Appsilon on Twitter
- Follow Appsilon on LinkedIn
- Learn more about our R Shiny open source packages