R Data Processing Frameworks: How To Speed Up Your Data Processing Pipelines up to 20 Times

Reading time:
time
min
By:
Dario Radečić
December 5, 2023

Picture this - the data science team you manage primarily uses R and heavily relies on <code>dplyr</code> for implementing data processing pipelines. All is good, but then out of the blue you're working with a client that has a massive dataset, and all of a sudden <code>dplyr</code> becomes the bottleneck. You want a faster way to process data with minimum to no changes to the source code. You're wondering, how much effort will it take to speed up your data processing pipelines 2 times, 5 times, 10 times, or even up to 20 times? The answer might surprise you - <b>far less than you think</b>. This article will show you why <code>dplyr</code> fails to perform on larger datasets, and which alternative options you have in the context of R data processing frameworks. Let's dig in! <blockquote>R is full of things you didn't know are possible - <a href="https://appsilon.com/excel-functions-in-r/" target="_blank" rel="noopener">Here's a couple of advanced Excel functions in R for data manipulation</a>.</blockquote> <h3>Table of contents:</h3><ul><li><strong><a href="#dplyr-alternatives">Dplyr Alternatives - Top R Data Processing Frameworks</a></strong></li><li><strong><a href="#experiment">Dplyr vs. Arrow vs. DuckDB - R Data Processing Framework Test</a></strong></li><li><strong><a href="#summary">Summing up R Data Processing Framework Benchmarks</a></strong></li></ul> <hr /> <h2 id="dplyr-alternatives">Dplyr Alternatives - Top R Data Processing Frameworks</h2> This section will introduce you to two <code>dplyr</code> alternatives that use the same interface - <code>arrow</code> and <code>duckdb</code>. We wanted to focus on these two specifically because they come up with minimal code changes, as you'll see from the examples later. They provide the best "bang for the buck" if your data science team doesn't have the time to learn a new data processing framework from scratch. <h3>Which R dplyr Alternatives Can You Use?</h3> Sure, everyone loves <code>dplyr</code>. We've even dedicated a full article for <a href="https://appsilon.com/r-dplyr-tutorial/" target="_blank" rel="noopener">beginner-level data analysis techniques with dplyr</a>. The package has a user-friendly syntax and is super easy to use for data transformation tasks. But guess what - so are the other two alternatives we'll use today. In addition, they are usually up to 20 times faster. <b>Arrow</b> is a cross-language development platform for in-memory and larger-than-memory data. The R package exposes an interface to the Arrow C++ library, allowing you to benefit from an R programming language syntax, and access to a C++ library API through a set of well-known <code>dplyr</code> backend functions. Arrow also provides zero-copy data sharing between R and Python, which might be appealing for language-agnostic data science teams. <blockquote>You can learn more about Arrow for R <a href="https://arrow.apache.org/docs/r/" target="_blank" rel="noopener">here</a>.</blockquote> <b>DuckDB</b> is an open-source, embedded, in-process, relational OLAP DBMS. Its description contains pretty much every buzzword you could imagine, but it being an OLAP database means the data is organized by columns and is optimized for complex data queries (think joins, aggregations, groupings, and so on). The good thing about <code>duckdb</code> is that it comes with an R API, meaning you can use R to point to a local (or remote) instance of your database with <code>DBI</code>. Further, <code>duckdb</code> R package uses <code>dplyr</code>-like syntax, which means code changes coming from a vanilla <code>dplyr</code> will be minimal to non-existent. <blockquote>You can learn more about DuckDB <a href="https://duckdb.org/" target="_blank" rel="noopener">here</a> and about its R API <a href="https://duckdb.org/docs/api/r.html" target="_blank" rel="noopener">here</a>.</blockquote> So, these are the <code>dplyr</code> alternatives we'll use to perform a series of benchmarks with the goal of comparing R data processing frameworks. They should both be faster than <code>dplyr</code> in most cases, but that's where the actual tests come in. More on that in the following section. <h2 id="experiment">Dplyr vs. Arrow vs. DuckDB - R Data Processing Framework Test</h2> To kick things off, let's talk about data. We have several small to somewhat large Parquet files and a single DuckDB database (not publicly available) that has other files stored inside. The <code>dplyr</code> and <code>arrow</code> benchmarks will be based on the Parquet files, and the DuckDB benchmarks will be connected to the local database: <img class="size-full wp-image-22244" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b00c83bc28ea2f066dc9d8_Image-1-Datasets-used.webp" alt="Image 1 - Datasets used" width="1214" height="420" /> Image 1 - Datasets used Next, we'll discuss how the benchmarks were configured, R package versions, and the hardware used to run the tests. <h3>Benchmark Setup and Information</h3> As for R itself, we've used <b>R version 4.3.1</b>. The most important packages were installed with the following version numbers: <ul><li><code>dplyr</code> - 1.1.3</li><li><code>arrow</code> - 13.0.0</li><li><code>duckdb</code> - 0.8.1-3</li></ul> Each of the benchmark tests you'll see below was run <b>3 times</b> for each R data processing framework. The hardware of choice was a <b>14" M2 Pro MacBook Pro</b> with a 12-core CPU and 16 GB of RAM. For working on <code>dplyr</code> and <code>arrow</code> benchmarks, we've imported the following R packages: <pre><code class="language-r">library(tidyverse) library(arrow)</code></pre> The <code>arrow</code> benchmark results have had the following option configured: <pre><code class="language-r">options(arrow.pull_as_vector = TRUE)</code></pre> Working with <code>duckdb</code> required a couple of extra dependencies: <pre><code class="language-r">library(DBI) library(duckdb)</code></pre> In both of these, the following code was used to load the datasets: <pre><code class="language-r">badges &lt;- read_parquet("../../data/badges.parquet") posts &lt;- read_parquet("../../data/posts.parquet") tags &lt;- read_parquet("../../data/tags.parquet") users &lt;- read_parquet("../../data/users.parquet") votes &lt;- read_parquet("../../data/votes.parquet") wiki &lt;- read_parquet("../../data/wiki.parquet")</code></pre> In order to connect to a DuckDB database and extract the datasets, we've used the following code: <pre><code class="language-r">con &lt;- dbConnect(duckdb::duckdb("./data.duckdb")) <br>badges &lt;- tbl(con, "badges") posts &lt;- tbl(con, "posts") tags &lt;- tbl(con, "tags") users &lt;- tbl(con, "users") votes &lt;- tbl(con, "votes") wiki &lt;- tbl(con, "wiki")</code></pre> Finally, to actually run benchmarks, we decided to declare a <code>benchmark()</code> function which takes another function as an argument: <pre><code class="language-r">benchmark &lt;- function(fun) {    start &lt;- Sys.time()    res &lt;- fun()    end &lt;- Sys.time()    print(end - start)    res }</code></pre> Each of the 8 benchmarks you'll see below wraps the logic inside a separate function and then calls <code>benchmark()</code> and passes itself as an argument. So with that out of the way, let's go over our first benchmark! <h3>Benchmark #1: Finding the Article with the Most External Entries</h3> The goal of the first benchmark was to use the <code>wiki</code> dataset to find the article with the most external entries on English Wikipedia in March 2022. You'll find the code for all three R packages below. The only difference from vanilla <code>dplyr</code> is in passing some additional arguments to <code>summarise()</code> and <code>slice_max()</code> functions. Everything else is identical: <pre><code class="language-r">b1_dplyr &lt;- function() {  wiki %&gt;%    filter(type == "external") %&gt;%    group_by(curr) %&gt;%    summarise(total = sum(n)) %&gt;%    slice_max(total, n = 3) %&gt;%    pull(curr) } <br>b1_arrow &lt;- function() {  wiki %&gt;%    filter(type == "external") %&gt;%    group_by(curr) %&gt;%    summarise(total = sum(n)) %&gt;%    slice_max(total, n = 3, with_ties = FALSE) %&gt;%    pull(curr) } <br>b1_duckdb &lt;- function() {  wiki %&gt;%    filter(type == "external") %&gt;%    group_by(curr) %&gt;%    summarise(total = sum(n, na.rm = TRUE)) %&gt;%    slice_max(total, n = 3, with_ties = FALSE) %&gt;%    pull(curr) }</code></pre> Here are the results: <img class="size-full wp-image-22246" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b00c83ec0a4ec2a911e306_Image-2-Benchmark-1-results-scaled.webp" alt="Image 2 - Benchmark #1 results" width="2560" height="1152" /> Image 2 - Benchmark #1 results As you can see, both <code>arrow</code> and <code>duckdb</code> were faster, by 10 and 7 times, respectively. <h3>Benchmark #2: Finding Properties Across Multiple Datasets</h3> The second test was a combination of two calculations - the first one had the task of finding the <code>DisplayName</code> property of a user that has the most badges, while the second one had to find the <code>Location</code> property for the same condition. Most of the logic is implemented in the first portion of the calculation, where the ID of the user was found, and then in the second portion, only the desired properties were extracted. As before, most of the code differences boil down to the extra arguments in a couple of functions and calling <code>collect()</code> at the end of the calculation: <pre><code class="language-r">b2_dplyr &lt;- function() {  tid &lt;- badges %&gt;%    left_join(users, by = join_by(UserId == Id)) %&gt;%    group_by(UserId, DisplayName) %&gt;%    summarise(NBadges = n()) %&gt;%    ungroup() %&gt;%    slice_max(NBadges, n = 1) %&gt;%    pull(UserId)  top_user &lt;- users %&gt;%    filter(Id == tid) %&gt;%    select(DisplayName, Location) <br>  top_user } <br> b2_arrow &lt;- function() {  tid &lt;- badges %&gt;%    left_join(users, by = join_by(UserId == Id)) %&gt;%    group_by(UserId, DisplayName) %&gt;%    summarise(NBadges = n()) %&gt;%    ungroup() %&gt;%    slice_max(NBadges, n = 1, with_ties = FALSE) %&gt;%    pull(UserId) <br>  top_user &lt;- users %&gt;%    filter(Id == tid) %&gt;%    select(DisplayName, Location) <br>  top_user %&gt;% collect() } <br> b2_duckdb &lt;- function() {  tid &lt;- badges %&gt;%    left_join(users, by = join_by(UserId == Id)) %&gt;%    group_by(UserId, DisplayName) %&gt;%    summarise(NBadges = n()) %&gt;%    ungroup() %&gt;%    slice_max(NBadges, n = 1, with_ties = FALSE) %&gt;%    pull(UserId) <br>  top_user &lt;- users %&gt;%    filter(Id == tid) %&gt;%    select(DisplayName, Location) <br>  top_user %&gt;% collect() }</code></pre> These are the results we got: <img class="size-full wp-image-22248" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b00c84cd075c1c912cb834_Image-3-Benchmark-2-results-scaled.webp" alt="Image 3 - Benchmark #2 results" width="2560" height="1152" /> Image 3 - Benchmark #2 results The truth is - <code>dplyr</code> wasn't the slowest one here. <code>arrow</code> was still almost twice as fast, but <code>duckdb</code> was three times slower on average. <h3>Benchmark #3: Finding the Number of Entries</h3> The third test used the <code>wiki</code> dataset to find the number of entries on the article about the city from the previous benchmark on English Wikipedia in March 2022. Both <code>dplyr</code> alternatives introduce the <code>collect()</code> method at the end of the calculation and also some additional arguments to the <code>summarise()</code> function: <pre><code class="language-r">b3_dplyr &lt;- function() {  city &lt;- str_split(top_user %&gt;% pull(Location), ",")[[1]][[1]]  wiki %&gt;%    filter(curr == city) %&gt;%    summarise(sum(n)) } <br>br_arrow &lt;- function() {  city &lt;- str_split(top_user %&gt;% pull(Location), ",")[[1]][[1]]  wiki %&gt;%    filter(curr == city) %&gt;%    summarise(sum(n)) %&gt;%    collect() } <br>b3_duckdb &lt;- function() {  city &lt;- str_split(top_user %&gt;% pull(Location), ",")[[1]][[1]]  wiki %&gt;%    filter(curr == city) %&gt;%    summarise(sum(n, na.rm = TRUE)) %&gt;%    collect() }</code></pre> Here's what we got out of the tests: <img class="size-full wp-image-22250" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b00c85ab162dee282ad946_Image-4-Benchmark-3-results-scaled.webp" alt="Image 4 - Benchmark #3 results" width="2560" height="1152" /> Image 4 - Benchmark #3 results In absolute terms, the difference isn't that large, but speaking relatively, <code>arrow</code> proved to be around 18 times faster, while <code>duckdb</code> was 14 times faster. Impressive! <h3>Benchmark #4: Finding the Most Common Words with a Given Condition</h3> The fourth test was once again a combination of two aggregations, one on the <code>posts</code> dataset, and the other on <code>wiki</code> dataset. The first one had the task of finding the most common work with at least 8 characters. The second one then found the number of occurrences of the most common word with at least 8 characters in all posts. This is where we see a couple of significant syntax differences among different R data processing frameworks. The <code>dplyr</code> package requires you to write the least amount of code, as you can see below: <pre><code class="language-r">b4_dplyr &lt;- function() {  theword &lt;- posts %&gt;%    select(Body) %&gt;%    mutate(Body = gsub("&lt;.*?&gt;", "", Body)) %&gt;%    mutate(Body = gsub("\n", " ", Body)) %&gt;%    separate_rows(Body, sep = " ") %&gt;%    rename(Words = Body) %&gt;%    filter(nchar(Words) &gt; 7) %&gt;%    count(Words) %&gt;%    slice_max(n, n = 1) %&gt;%    pull(Words) <br>  sum_of_n &lt;- wiki %&gt;%    filter(curr == str_to_title(theword)) %&gt;%    summarize(sum_n = sum(n)) %&gt;%    pull()  paste(theword, sum_of_n) } <br>b4_arrow &lt;- function() {  theword &lt;- posts %&gt;%    select(Body) %&gt;%    mutate(Body = gsub("&lt;.*?&gt;", "", Body)) %&gt;%    mutate(Body = gsub("\n", " ", Body)) %&gt;%    collect() %&gt;%    separate_rows(Body, sep = " ") %&gt;%    as_arrow_table() %&gt;%    rename(Words = Body) %&gt;%    filter(nchar(Words) &gt; 7) %&gt;%    count(Words) %&gt;%    slice_max(n, n = 1, with_ties = FALSE) %&gt;%    pull(Words) <br>  sum_of_n &lt;- wiki %&gt;%    filter(curr == str_to_title(theword)) %&gt;%    summarize(sum_n = sum(n)) %&gt;%    pull() <br>  paste(theword, sum_of_n) } <br>b4_duckdb &lt;- function() {  theword &lt;- posts %&gt;%    select(Body) %&gt;%    mutate(Body = regexp_replace(Body, "&lt;.*?&gt;", "", "g")) %&gt;%    mutate(Body = regexp_replace(Body, "\n", " ")) %&gt;%    mutate(Body = string_split(Body, " ")) %&gt;%    mutate(Body = unnest(Body)) %&gt;%    mutate(Body = lower(Body)) %&gt;%    rename(Words = Body) %&gt;%    filter(nchar(Words) &gt; 7) %&gt;%    count(Words) %&gt;%    slice_max(n, n = 1, with_ties = FALSE) %&gt;%    pull(Words)  theword &lt;- str_to_title(theword)  sum_of_n &lt;- wiki %&gt;%    filter(curr == theword) %&gt;%    summarize(sum_n = sum(n, na.rm = TRUE)) %&gt;%    pull() <br>  paste(theword, sum_of_n) }</code></pre> But at what cost? Let's examine the benchmark results next: <img class="size-full wp-image-22252" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b00c860ac309af73f7af40_Image-5-Benchmark-4-results-scaled.webp" alt="Image 5 - Benchmark #4 results" width="2560" height="1152" /> Image 5 - Benchmark #4 results This time, <code>duckdb</code> was a clear winner, outperforming <code>dplyr</code> by a factor of 9. <code>arrow</code> was only slightly faster than <code>dplyr</code>, by around 6 and a half seconds or 25%. <h3>Benchmark #5: Finding the Largest Difference in Multiple Datasets</h3> This test used the <code>votes</code> and <code>posts</code> datasets to first find the post with the highest difference between upvotes and downvotes, and then find the <code>DisplayName</code> property along with the actual difference between upvotes and downvotes. It's a lot to take in at once, but most of it boils down to running multiple computations sequentially, and there aren't many code differences between our three data processing frameworks. Feel free to take a look at the code and decide by yourself: <pre><code class="language-r">b5_dplyr &lt;- function() {  upvotes &lt;- votes %&gt;%    filter(VoteTypeId == 2) %&gt;%    group_by(PostId) %&gt;%    summarize(UpVotes = n()) %&gt;%    ungroup() <br>  downvotes &lt;- votes %&gt;%    filter(VoteTypeId == 3) %&gt;%    group_by(PostId) %&gt;%    summarize(DownVotes = n()) %&gt;%    ungroup() <br>  posts2 &lt;- posts %&gt;%    left_join(upvotes, by = c("Id" = "PostId")) %&gt;%    left_join(downvotes, by = c("Id" = "PostId")) %&gt;%    mutate(      UpVotes = coalesce(UpVotes, 0),      DownVotes = coalesce(DownVotes, 0)    ) %&gt;%    mutate(UpVoteRatio = UpVotes - DownVotes) <br>  posts2 %&gt;%    inner_join(users, by = c("OwnerUserId" = "Id")) %&gt;%    arrange(desc(UpVoteRatio)) %&gt;%    slice(1) %&gt;%    select(Score, DisplayName) } <br>b5_arrow &lt;- function() {  upvotes &lt;- votes %&gt;%    filter(VoteTypeId == 2) %&gt;%    group_by(PostId) %&gt;%    summarize(UpVotes = n()) %&gt;%    ungroup() <br>  downvotes &lt;- votes %&gt;%    filter(VoteTypeId == 3) %&gt;%    group_by(PostId) %&gt;%    summarize(DownVotes = n()) %&gt;%    ungroup() <br>  posts2 &lt;- posts %&gt;%    left_join(upvotes, by = c("Id" = "PostId")) %&gt;%    left_join(downvotes, by = c("Id" = "PostId")) %&gt;%    mutate(      UpVotes = coalesce(UpVotes, 0),      DownVotes = coalesce(DownVotes, 0)    ) %&gt;%    mutate(UpVoteRatio = UpVotes - DownVotes) <br>  posts2 %&gt;%    inner_join(users, by = c("OwnerUserId" = "Id")) %&gt;%    slice_max(UpVoteRatio, n = 1, with_ties = FALSE) %&gt;%    select(Score, DisplayName) %&gt;%    collect() } <br>b5_duckdb &lt;- function() {  upvotes &lt;- votes %&gt;%    filter(VoteTypeId == 2) %&gt;%    group_by(PostId) %&gt;%    summarize(UpVotes = n()) %&gt;%    ungroup() <br>  downvotes &lt;- votes %&gt;%    filter(VoteTypeId == 3) %&gt;%    group_by(PostId) %&gt;%    summarize(DownVotes = n()) %&gt;%    ungroup() <br>  posts2 &lt;- posts %&gt;%    left_join(upvotes, by = c("Id" = "PostId")) %&gt;%    left_join(downvotes, by = c("Id" = "PostId")) %&gt;%    mutate(      UpVotes = coalesce(UpVotes, 0),      DownVotes = coalesce(DownVotes, 0)    ) %&gt;%    mutate(UpVoteRatio = UpVotes - DownVotes) <br>  posts2 %&gt;%    inner_join(users, by = c("OwnerUserId" = "Id")) %&gt;%    slice_max(UpVoteRatio, n = 1, with_ties = FALSE) %&gt;%    select(Score, DisplayName) %&gt;%    collect() }</code></pre> This is what we got in the end: <img class="size-full wp-image-22254" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b00c887ddb1a202449a550_Image-6-Benchmark-5-results-scaled.webp" alt="Image 6 - Benchmark #5 results" width="2560" height="1152" /> Image 6 - Benchmark #5 results Once again, there is not a huge difference in absolute terms. But relatively speaking, <code>arrow</code> was 2.5 faster and <code>duckdb</code> was 7 times faster when compared to <code>dplyr</code>. <h3>Benchmark #6: Finding the Month with the Most Posts Created</h3> Our next test has a simple task of finding the month in which the most posts were created. That's it! The code differences are almost negligible here - both <code>arrow</code> and <code>duckdb</code> call the <code>collect()</code> method at the end and the order of operation is somewhat different between all three. Nothing you couldn't change for yourself in a couple of minutes: <pre><code class="language-r">b6_dplyr &lt;- function() {  votes %&gt;%    mutate(CreationDateDT = as.POSIXct(CreationDate)) %&gt;%    arrange(CreationDateDT) %&gt;%    group_by(Month = floor_date(CreationDateDT, "month")) %&gt;%    summarize(Count = n()) %&gt;%    slice_max(Count, n = 1) } <br>b6_arrow &lt;- function() {  votes %&gt;%    arrange(CreationDate) %&gt;%    collect() %&gt;% # seems to be some bug that requires this    group_by(Month = floor_date(CreationDate, "month")) %&gt;%    summarize(Count = n()) %&gt;%    slice_max(Count, n = 1, with_ties = FALSE) %&gt;%    collect() } <br>b6_duckdb &lt;- function() {  votes %&gt;%    group_by(Month = floor_date(CreationDate, "month")) %&gt;%    summarize(Count = n()) %&gt;%    slice_max(Count, n = 1) %&gt;%    collect() }</code></pre> Here's the outcome: <img class="size-full wp-image-22256" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b00c8ac4fda458e39214b1_Image-7-Benchmark-6-results-scaled.webp" alt="Image 7 - Benchmark #6 results" width="2560" height="1152" /> Image 7 - Benchmark #6 results <code>duckdb</code> seems to be twice as fast on average when compared to <code>dplyr</code>. On the other hand, <code>arrow</code> was slower on average by a hundredth of a second. <h3>Benchmark #7: Finding the Month with the Biggest Decrease in the Amount of Posts</h3> The goal of our next test was to find the month in which there was the biggest decrease in the amount of created posts. Simple and straightforward, just like with the previous one. <code>arrow</code> was the most verbose framework in this test, requiring a couple of extra lines of code and calling the <code>collect()</code> function twice: <pre><code class="language-r">b7_dplyr &lt;- function() {  votes %&gt;%    mutate(CreationDateDT = as.POSIXct(CreationDate)) %&gt;%    group_by(Month = floor_date(CreationDateDT, "month")) %&gt;%    summarize(VoteCount = n(), .groups = "drop") %&gt;%    mutate(Diff = VoteCount - lag(VoteCount)) %&gt;%    select(Month, Diff) %&gt;%    slice_min(Diff, n = 1) } <br>b7_arrow &lt;- function() {  votes %&gt;%    group_by(Month = floor_date(CreationDate, "month")) %&gt;%    summarize(VoteCount = n(), .groups = "drop") %&gt;%    collect() %&gt;%    mutate(Diff = VoteCount - lag(VoteCount)) %&gt;% # lag not supported in arrow    as_arrow_table() %&gt;%    arrange(Diff) %&gt;%    select(Month, Diff) %&gt;%    slice_head(n = 1) %&gt;%    collect() } <br>b7_duckdb &lt;- function() {  votes %&gt;%    group_by(Month = floor_date(CreationDate, "month")) %&gt;%    summarize(VoteCount = n(), .groups = "drop") %&gt;%    mutate(Diff = VoteCount - lag(VoteCount)) %&gt;%    select(Month, Diff) %&gt;%    slice_min(Diff, n = 1) %&gt;%    collect() }</code></pre> This is what we got: <img class="size-full wp-image-22258" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b00c8bb3438b06fc68caf3_Image-8-Benchmark-7-results-scaled.webp" alt="Image 8 - Benchmark #7 results" width="2560" height="1152" /> Image 8 - Benchmark #7 results Despite the added verbosity, <code>arrow</code> was still twice faster than <code>dplyr</code>. The <code>duckdb</code> framework was somewhere in the middle, still having a slight edge over the vanilla <code>dplyr</code>. <h3>Benchmark #8: Finding Common Tags Along Posts by Location</h3> For the final test, the goal was to find the most common tag along posts created by users from Poland. To get this information, the <code>Location</code> column in the <code>posts</code> dataset should contain "Poland" or "Polska". The syntax is slightly different between our three data processing frameworks, with <code>arrow</code> being the most verbose one once again: <pre><code class="language-r">b8_dplyr &lt;- function() {  tags &lt;- posts %&gt;%    left_join(users, by = c("OwnerUserId" = "Id")) %&gt;%    filter(str_detect(Location, "Poland|Polska")) %&gt;%    select(Tags) %&gt;%    mutate(Tags = str_replace_all(Tags, "[&lt;&gt;]", " ")) %&gt;%    separate_rows(Tags, sep = " ") %&gt;%    filter(Tags != "") <br>  tags %&gt;%    count(Tags) %&gt;%    arrange(desc(n)) %&gt;%    slice(1) } <br>b8_arrow &lt;- function() {  tags &lt;- posts %&gt;%    left_join(users, by = c("OwnerUserId" = "Id")) %&gt;%    filter(str_detect(Location, "Poland|Polska")) %&gt;%    select(Tags) %&gt;%    mutate(Tags = str_replace_all(Tags, "[&lt;&gt;]", " ")) %&gt;%    collect() %&gt;%    separate_rows(Tags, sep = " ") %&gt;%    as_arrow_table() %&gt;%    filter(Tags != "") <br>  tags %&gt;%    count(Tags) %&gt;%    slice_max(n, n = 1, with_ties = FALSE) %&gt;%    collect() } <br>b8_duckdb &lt;- function() {  tags &lt;- posts %&gt;%    left_join(users, by = c("OwnerUserId" = "Id")) %&gt;%    filter(str_detect(Location, "Poland|Polska")) %&gt;%    select(Tags) %&gt;%    mutate(Tags = str_replace_all(Tags, "[&lt;&gt;]", " ")) %&gt;%    mutate(Tags = string_split(Tags, " ")) %&gt;%    mutate(Tags = unnest(Tags)) %&gt;%    filter(Tags != "") <br>  tags %&gt;%    count(Tags) %&gt;%    slice_max(n, n = 1, with_ties = FALSE) %&gt;%    collect() }</code></pre> In the end, these are the time differences: <img class="size-full wp-image-22260" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b00c8cda2fb30f04bd397c_Image-9-Benchmark-8-results-scaled.webp" alt="Image 9 - Benchmark #8 results" width="2560" height="1152" /> Image 9 - Benchmark #8 results But this time, it was <code>duckdb</code> that was almost two times faster than <code>dplyr</code>. The <code>arrow</code> framework provided somewhat of a negligible 10% runtime decrease when compared to <code>dplyr</code>. <h3>Conclusion: Which R Data Processing Framework Should You Use?</h3> In the end, let's imagine all of our 8 benchmarks make a single data processing pipeline. The question is - <b>Which R data processing framework wins in total?</b> Total here being the summation of average (of 3) runtimes for each benchmark. Here are the results: <img class="size-full wp-image-22262" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b00c8d384da7c6c41cbc4c_Image-10-Total-runtime-comparison-scaled.webp" alt="Image 10 - Total runtime comparison" width="2560" height="1152" /> Image 10 - Total runtime comparison It's clear to say that <code>duckdb</code> won by a huge margin - it's 7.4 times faster when compared to <code>dplyr</code> and 4.1 times faster when compared to <code>arrow</code>. This is mostly because of benchmark #4 results, in which <code>duckdb</code> won by 20-30 seconds in absolute terms. Still, it makes sense to compare all three in your work environment to find out which data processing framework is the fastest for your specific needs. Now you know how, so you shouldn't have any trouble cutting down the data processing runtime by a significant factor. <hr /> <h2 id="summary">Summing up R Data Processing Framework Benchmarks</h2> Long story short - it takes minimal effort (code changes) to massively speed up your data processing pipelines in R. The <code>dplyr</code> package is just fine when you're just starting out, but you should look into the alternatives mentioned today when speed is of the essence. Spoiler alert, it always is. But switching between R data processing frameworks doesn't have to be a long an painful experience. Packages like <code>arrow</code> and <code>duckdb</code> use the same <code>dplyr</code> interface but provide much faster results. You can change the backend in a matter of minutes, or hours in a worst-case scenario if you have a lot of data processing pipelines. <i>What's your go-to way of speeding up <code>dplyr</code>? Do you use the packages mentioned today or something else entirely?</i> Make sure to let us know in the comment section below. <blockquote>Looking to automate data quality reporting in R and R Shiny? <a href="https://appsilon.com/automated-r-data-quality-reporting/" target="_blank" rel="noopener">Look no further than R's data.validator package</a>.</blockquote>

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.
Explore Possibilities

Share Your Data Goals with Us

From advanced analytics to platform development and pharma consulting, we craft solutions tailored to your needs.

Talk to our Experts
r
dplyr