Apache Arrow in R - Supercharge Your R Shiny Dashboards with 10X Faster Data Loading

Reading time:

time

min

October 12, 2021

Nobody likes a slow dashboard. Heck, I’m closing it and never coming back if it takes more than a couple of seconds to load. It’s a common problem — the datasets are getting bigger and our attention span is getting shorter. Balancing both as a developer, well, sucks. But there might be a silver lining. Most of the available datasets have one thing in common — the file extension. It’s typically a CSV, and it’s what’s giving you headaches. Don’t get me wrong — CSV’s have a lot going for them. You can easily open CSV files, see what’s inside, and even modify data on the fly. It’s a human-readable format, and tools like Excel can open it without questions asked. The problem is — CSV’s aren’t compressed, and take 6 times longer to read, 12 times longer to write, and 4 times the size on disk, provided a relatively small 50 MB dataset. Things only go south for CSV format as the file size increases. That’s why today you’ll learn about an alternative — Apache Arrow — and see how to use it in R and R Shiny dashboards. Table of contents: <ul><li><a href="#crash-course">Apache Arrow Crash Course</a></li><li><a href="#the-basics">How to Work With Apache Arrow in R</a></li><li><a href="#benchmarks">Benchmarks — Vanilla R vs. Apache Arrow</a></li><li><a href="#shiny">Apache Arrow in R Shiny — How to Speed Up Your Dashboards</a></li><li><a href="#conclusion">Conclusion</a></li></ul> <hr /> <h2 id="crash-course">Apache Arrow Crash Course</h2> Apache Arrow is a cross-language development platform for in-memory data. It’s designed for efficiency and moves away from your typical row-orientated storage (think CSV) to columnar storage. This change allows for adequate compression to be applied, so you end up with a smaller file that’s easier to read and write. Columnar storage also allows you to skip irrelevant data quickly, which results in faster queries and aggregations. <blockquote><strong>Better performance doesn't have to be difficult. Explore Appsilon's <a href="https://appsilon.com/speeding-up-r-shiny/" target="_blank" rel="noopener noreferrer">definitive guide to speeding up R Shiny.</a></strong></blockquote> Apache Arrow supports CSV, JSON, Parquet, and Feather file-formats locally, and Amazon S3 buckets in the cloud. Everything is implemented through a familiar <code>dplyr</code> syntax. That’s right — you won’t have to learn a new library from scratch. The library comes with two data structures: <ul><li><code>Table</code> — A tabular, column-orientated data structure capable of storing and processing large amounts of data more efficiently than R’s built-in <code>data.frame</code>.</li><li><code>Dataset</code> — a data structure similar to <code>Table</code>, but capable to work on larger-than-memory datasets.</li></ul> Apache Arrow comes with a bunch of functions for reading data: <code>read_parquet()</code>, <code>read_feather()</code>, <code>read_delim_arrow()</code>, <code>read_csv_arrow()</code>, <code>read_tsv_arrow()</code>, <code>read_json_arrow()</code>, and all of these will return an R <code>data.frame</code> by default. You can specify an additional argument <code>as_data_frame = FALSE</code> to return an Arrow Table. The options for writing data are more limited. You can only write CSV, Parquet, and Feather files, alongside the raw format and IPC stream, which you won’t use that often. The functions are named just as you would expect — <code>write_csv_arrow()</code>, <code>write_feather()</code>, and <code>write_parquet()</code>. Arrow also uses lazy evaluation, which means you’ll need to call either <code>compute()</code> or <code>collect()</code> to perform computations. The first one returns an Arrow Table, and the second one an R <code>data.frame</code>. You can install Apache Arrow just like any other package: <script src="https://gist.github.com/darioappsilon/a04d18f5160906c66558dff27655ae15.js"></script> Please do so if you plan to follow along. We’ll start by downloading and exploring a simple 50 MB dataset in the next section. <hr /> <h2 id="the-basics">How to Work With Apache Arrow in R</h2> Let’s start by downloading the <a href="https://www.kaggle.com/dgawlik/nyse?select=prices.csv" target="_blank" rel="noopener noreferrer">NYSE prices dataset</a> from Kaggle: <img class="size-full wp-image-8366" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01fff9dc2dda17fc19431_dataset.webp" alt="Image 1 - The NYSE dataset" width="3084" height="1784" /> Image 1 - The NYSE dataset You’ll only need the <code>prices.csv</code> file, so don’t waste your time by downloading the others. Once downloaded, create a new R script and you’re good to go. You can start by changing the working directory to where your CSV file is located and importing the libraries. You’ll need <code>arrow</code> and <code>dplyr</code>: <script src="https://gist.github.com/darioappsilon/faefaa3241628bb50dc8419fee9ebbef.js"></script> The <code>warn.conflicts = FALSE</code> will make sure you don’t get unnecessary warning messages printed to the console. You can now use the <code>read_csv_arrow()</code> function to read in the dataset. Keep in mind — specifying <code>as_data_frame = FALSE</code> will return an Arrow Table, which is what you want for faster computations: <script src="https://gist.github.com/darioappsilon/772afe3f7ffd177822ff4f53f50ba8d5.js"></script> And that’s the only change you need to make when reading data. The <code>stock_prices</code> is stored as a Table, but you can still do any data manipulation you want with <code>dplyr</code>. For instance, the following code snippet keeps only the Netflix data and gets rid of a couple of columns: <script src="https://gist.github.com/darioappsilon/c9f285003f0d0463d9e76ede0d57be63.js"></script> The problem is — Arrow uses lazy evaluation, so no computation is done until you call either <code>compute()</code> or <code>collect()</code>. You can verify that by printing out <code>nflx</code> to the console: <img class="size-full wp-image-8358" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b02000ab40efae1ca98b00_arrow_in_r_1.webp" alt="Image 2 - Working with Arrow in R (1)" width="870" height="462" /> Image 2 - Working with Arrow in R (1) If you call the <code>compute()</code> function, yet another Table object is returned: <script src="https://gist.github.com/darioappsilon/74dfe50a6ec1b1188a75ab33b4950cda.js"></script> <img class="size-full wp-image-8359" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b0200246af19f939a8d7ad_arrow_in_r_2.webp" alt="Image 3 - Working with Arrow in R (2)" width="500" height="372" /> Image 3 - Working with Arrow in R (2) And if you call <code>collect()</code>, you’ll get the filtered dataset in the R <code>data.frame</code> format: <script src="https://gist.github.com/darioappsilon/de89305940eda0efead7db3fa36f0496.js"></script> <img class="size-full wp-image-8360" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b0200387a667cfae9022f2_arrow_in_r_3.webp" alt="Image 4 - Working with Arrow in R (3)" width="984" height="776" /> Image 4 - Working with Arrow in R (3) To summarize, the changes you need to make are minimal — an additional line of code here and there, but the speed and disk improvements are worth it. If you’re still skeptical, check the following benchmark section. <hr /> <h2 id="benchmarks">Benchmarks — Vanilla R vs. Apache Arrow</h2> You’ll now get the idea of just how much vanilla R is slower for reading/writing datasets than Apache Arrow. Execution times were measured with <code>Sys.time()</code> function in R. The <code>prices.csv</code> file was used for all tests. <h3>Benchmark 1 — Reading CSV files</h3> Here, a comparison was made between R’s <code>read.csv()</code> function and Arrow’s <code>read_csv_arrow()</code> function. The results are summarized on the following chart: <img class="size-full wp-image-8361" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020057a16ac39e34bf1cb_benchmark_1.webp" alt="Image 5 - R vs. Apache Arrow in CSV read time (R: 1.85s; Arrow: 0.3s)" width="4540" height="1446" /> Image 5 - R vs. Apache Arrow in CSV read time (R: 1.85s; Arrow: 0.3s) That’s more than a 6x speed decrease when reading CSV files — insane. In addition, if <code>as_data_frame = FALSE</code> was specified in the Apache Arrow function call, the execution took only 0.13 seconds, which is a 14x speed decrease. <h3>Benchmark 2 — Writing files to disk</h3> You’ll want to use Arrow for working with Parquet and Feather files, which are extremely efficient, both in runtime and size on disk. The following chart shows you the runtime difference between R’s <code>write.csv()</code> and Arrow’s <code>write_parquet()</code> and <code>write_feather()</code> functions: <img class="size-full wp-image-8362" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b02005ab40efae1ca98cbb_benchmark_2.webp" alt="Image 6 - R vs. Apache Arrow in write time (R CSV: 2.25s; Arrow Parquet: 0.2s; Arrow Feather: 0.18s)" width="4544" height="1440" /> Image 6 - R vs. Apache Arrow in write time (R CSV: 2.25s; Arrow Parquet: 0.2s; Arrow Feather: 0.18s) And just like that, a 12x decrease in speed when writing datasets to disk. But how much smaller is the file? Let’s answer that next. <h3>Benchmark 3 — File size on disk</h3> The results are nothing short of astonishing. Take a look for yourself: <img class="size-full wp-image-8363" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020075cab2da46e977947_benchmark_3.webp" alt="Image 7 - CSV vs. Parquet vs. Feather file size (CSV: 51.7 MB; Parquet: 12.8 MB; Feather: 24 MB)" width="4538" height="1442" /> Image 7 - CSV vs. Parquet vs. Feather file size (CSV: 51.7 MB; Parquet: 12.8 MB; Feather: 24 MB) Parquet is 4x smaller on disk when comparing a relatively small dataset. Scale that to hundreds of gigabytes of data and take it to the cloud — your wallet will be forever thankful. <hr /> <h2 id="shiny">Apache Arrow in R Shiny — How to Speed Up Your Dashboards</h2> The question now remains — how the heck do I implement Apache Arrow in my R Shiny dashboards? That’s what this section will demystify. We’ll need a dashboard to start, so let’s create one quickly based on the NYSE prices dataset. The dashboard will allow you to select a ticker and view the first 15 rows of data in a table. You’ll be able to pick between the first 250 tickers, as leaving all of them would likely crash the app. <blockquote><strong>Speed up your Shiny dashboard with <a href="https://appsilon.com/r-shiny-faster-updateinput-css-javascript/" target="_blank" rel="noopener noreferrer">updateInput, CSS, JavaScript, and more.</a></strong></blockquote> It’s a simple dashboard indeed, and you can create it with the following source code: <script src="https://gist.github.com/darioappsilon/f357adbae30f2cd0883cd66cfb47112f.js"></script> Here’s how it looks like: <img class="size-full wp-image-8364" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020085cab2da46e977b9b_dashboard_1.gif" alt="Image 8 - Default R Shiny dashboard" width="1022" height="664" /> Image 8 - Default R Shiny dashboard And here’s the source code for the dashboard that uses Apache Arrow: <script src="https://gist.github.com/darioappsilon/f4fb80e23e24b39a86b570ba888aa491.js"></script> In a nutshell, you’ll have to import an additional library and use a different function to read the data. You can still read it from a CSV file, but the example below leverages the advantages of Parquet. Finally, you’ll have to call the <code>collect()</code> function before displaying the table. And that’s it! Here’s how this version of the dashboard looks like: <img class="size-full wp-image-8365" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020091acc439bc25a7e10_dashboard_2.gif" alt="Image 9 - R Shiny dashboard that uses Apache Arrow" width="1022" height="664" /> Image 9 - R Shiny dashboard that uses Apache Arrow Only the date column has changed. It’s now displayed as a timestamp, but that’s an easy fix. We’ll leave it up to you. <hr /> <h2 id="conclusion">Conclusion</h2> And there you have it — a bulletproof guide on using Apache Arrow in R and R Shiny. You’ve learned a lot, from the basic difference of row and columnar storage, reasons why Parquet and Feather files are better if you care for speed and disk space, to a fully working dashboard built on the Apache Arrow package. <blockquote><strong>How strong is your Shiny App UX? Follow Appsilon's <a href="https://appsilon.com/ux-design-of-shiny-apps-7-steps-to-design-dashboards-people-love/" target="_blank" rel="noopener noreferrer">7 steps to design dashboards that people love.</a></strong></blockquote> The further you scale, the more benefits of Apache Arrow you’ll see. Parquet or Feather should be your go-to data format if you don’t plan to edit data on the fly. A comparison with CSV looks unreal, to say the least. <em>Have you implemented Apache Arrow in your projects? Did you find the transition easy?</em> Please let us know, and subscribe to our newsletter for more stories like this.

Apache Arrow in R - Supercharge Your R Shiny Dashboards with 10X Faster Data Loading

Open source, pharma, and AI insights - once a week.

Share Your Data Goals with Us