Fast data lookups in R: dplyr vs data.table

Reading time:

time

min

March 2, 2017

<em><strong>Updated</strong>: May 23, 2022</em> R is a vector-oriented language and is optimized for most of its standard uses. However, doing something less typical, like finding a specific element in a dataset might prove more challenging. R does contain the necessary functionalities, but these might not be sufficient when working with a large dataset comprising several million rows – data lookups in R may be extremely slow. This can be particularly problematic in the case of R Shiny apps that must respond to user input in fractions of a second. Slow responsiveness will leave your users frustrated! In this post we are going to compare four different methods that can be used to improve lookup times in R: <ul><li><a href="#filtering">Data Lookups in R with dplyr::filter</a></li><li><a href="#built-in-operators">Built-in Operators</a></li><li><a href="#hash-tables">Hash Tables for Fast Data Lookups in R</a></li><li><a href="#ordered-indexes">Ordered Indexes in data.table</a></li></ul> Spoiler alert - We were able to improve lookup speed 25 times by using data.table indexes. <hr /> <h2>Fast Data Lookups in R - Getting Started</h2> First, let’s generate a sample dataset with 10 million rows: <figure class="highlight"><script src="https://gist.github.com/darioappsilon/a82de75188ce77b54821bf786429d51b.js"></script></figure> Let’s now create a simple benchmarking routine to compare the efficiency of different methods: <figure class="highlight"><script src="https://gist.github.com/darioappsilon/4eda971250646290de379fc01a8081a9.js"></script></figure> These routines benchmark a piece of code by running it multiple times. In each run, we can include preparation steps that do not count towards the measured execution time – we will only measure the execution time of code wrapped with a call to <code class="highlighter-rouge">time(...)</code>. <h2 id="filtering">Data Lookups with dplyr::filter and data.table filtering</h2> At Appsilon, we often use Hadley’s <a href="https://github.com/hadley/dplyr">dplyr</a> package for data manipulations. To find a specific element in a data frame (or, more precisely, a tibble) you can use dplyr’s <code class="highlighter-rouge">filter</code> function. Let’s see how long it takes to find an element with a specific key using dplyr. <figure class="highlight"><script src="https://gist.github.com/darioappsilon/872038f2a7285fbb53aec9397df2b240.js"></script></figure> It turns out that finding a single element on my machine using a filter takes 112 ms on average. This might not seem like a lot, but in fact, it's painfully slow – especially if you want to do several such operations at a time. Why is <code class="highlighter-rouge">filter</code> so slow? <em>dplyr</em> does not know anything about our dataset. The only way for it to find matching elements is to look row by row and check if the criteria are matched. When the dataset gets large, this simply must take time. Filtering using <code class="highlighter-rouge">data.table</code> is not fast either: <figure class="highlight"><script src="https://gist.github.com/darioappsilon/96f1ab9ad61b51adbffcf635fc966891.js"></script></figure> In the worst case, it took almost 6 seconds to find an element! <h2 id="built-in-operators">Built-in Operators</h2> Perhaps it’s just <em>dplyr</em> and <em>data.table</em>? Maybe filtering the old-school way with built-in operators and filtering is faster? <figure class="highlight"><script src="https://gist.github.com/darioappsilon/1cb2d5843e2f7b1934a4cd553909f049.js"></script></figure> It does not seem like it. How about using <code class="highlighter-rouge">which</code> for selecting rows? <figure class="highlight"><script src="https://gist.github.com/darioappsilon/0b8440d8e00b37f734f7b053f54229cb.js"></script></figure> Not much of an improvement. Dataframe filtering and <code class="highlighter-rouge">which</code> also use a linear search, so as one could expect they cannot be noticeably faster than <code class="highlighter-rouge">dplyr::filter</code>. <h2 id="hash-tables">Hash tables to the Rescue?</h2> The problem of efficient lookups is not specific to R. One of the alternative approaches is to use a <b>hash table</b>. Without delving into the details, a hash table is a very efficient data structure with a constant lookup time. It is implemented in most modern programming languages and it is widely utilized in many areas. If you’d like to read more about how hash tables work, there are lots of good articles about that. <blockquote>R-bloggers has a great series of articles about hash tables in R: <a href="https://www.r-bloggers.com/hash-table-performance-in-r-part-i/">Part 1</a>, <a href="https://www.r-bloggers.com/hash-table-performance-in-r-part-iiin-part-i-of-this-series-i-explained-how-r-hashed/">Part 2</a>, <a href="https://www.r-bloggers.com/hash-table-performance-in-r-part-iiiin-part-iof-this-series-i-explained-how-r-hashed/" target="_blank" rel="noopener noreferrer">Part 3</a>.</blockquote> The main conclusion of those articles is that if you need a hash table in R, you can use one of its built-in data structures – <code class="highlighter-rouge">environments</code>. These are used to keep the bindings of variables to values. Internally, they are implemented as a hash table. There is even a CRAN package hash, which wraps environments for general-purpose usage. <b>However</b>, these articles analyze the performance of a hash table with just 30 thousand elements - which is tiny when compared to our dataset. Before we can use a hash.table, we need to create it. Building an environment with 10^7 elements is <i>extremely </i>slow: it takes more than an hour. For 10^6 elements it takes about 30 seconds. <figure class="highlight"><script src="https://gist.github.com/darioappsilon/0c65b9ae628dbec5481168614f35ddb5.js"></script></figure> This makes using environments as hash tables virtually impossible if your dataset contains upwards of a million elements. For instance, building a hash table at the startup of a Shiny app would force the user to wait for ages before the app actually starts. Such startup time is not acceptable in commercial solutions. Apparently, environments were never intended for storing that many elements. In fact, their basic role is to store variables that are created by executing the R code. There are not many scenarios that would require creating this many variables in the code. One workaround could be to build the environment once and load it from the file later, but (quite surprisingly) this turns out to be even slower. We did not manage to find a good hash tables package for R that was not using environments underneath - please let me know if you come across such a tool! We can expect that an efficient hash table R package will be created in the future, but until then we need to make use of available solutions. <h2 id="ordered-indexes">Data.table Ordered Indexes</h2> It’s unfortunate that we can’t use hash tables for large datasets. However, there is yet another option when the column you are searching for is a number or a string. In that case, you may use <strong>ordered indexes</strong> from <code class="highlighter-rouge">data.table</code> package. The index keeps a sorted version of our column, which leads to much faster lookups. Finding an element in a sorted set with binary search has a logarithmic execution time - complexity of <em>O(log n)</em>. In practice, an index can be implemented in different ways, but usually, the time needed to build up an index is <em>O(n*log n)</em>, which in practice is often fast enough. Let’s measure the time needed to build the index for our dataset: <figure class="highlight"><script src="https://gist.github.com/darioappsilon/939926c4e96adc9003d1cc3407d44663.js"></script></figure> 6 seconds - this is not negligible, but this is something we can accept. Let’s check the lookup times: <figure class="highlight"><script src="https://gist.github.com/darioappsilon/b59e538021c59e6c7680b4099fa903c2.js"></script></figure> This is exactly what we wanted - a mean lookup time of 5 milliseconds! Finally, we’ve managed to get efficient lookups in our dataset. <h2 id="summary">Summary of Fast Data Lookups in R</h2> Dplyr with its <code class="highlighter-rouge">filter</code> method will be slow if you search for a single element in a dataset. The same is true for classic data frame filtering with builtin R operators and for regular filtering using <code class="highlighter-rouge">data.table</code>. Using <code class="highlighter-rouge">environment</code> as a hash table gives you fast lookups, but building it for a large dataset takes very long. If you need efficient lookups in a large dataset in R, check out data.table’s ordered indexes – this appears to be the best option out there! If you run into any issues, as an <a class="c-link" href="https://wordpress.appsilon.com/appsilon-data-science-is-now-an-rstudio-full-service-certified-partner/" target="_blank" rel="noopener noreferrer" aria-describedby="slack-kit-tooltip">RStudio Full Certified Partner</a>, our team at Appsilon is ready to answer your questions about working with large datasets and other topics related to R Shiny, Data Analytics, and Machine Learning. Let us know and we would be happy to chat! <ul><li>Follow <a href="https://twitter.com/appsilon">@Appsilon</a> on Twitter</li><li>Follow Appsilon on <a href="https://www.linkedin.com/company/appsilon">LinkedIn</a></li><li>Learn more about our R Shiny <a href="https://appsilon.com/opensource/">open source</a> packages</li></ul>

Fast data lookups in R: dplyr vs data.table

Open source, pharma, and AI insights - once a week.

Share Your Data Goals with Us