Fast Data Loading from Files to R

Reading time:

time

min

April 11, 2017

<em><strong>Updated</strong>: June 6, 2022.</em> Loading large data frames when building Shiny Apps can have a significant impact on the app initialization time. When we ran into this issue in a recent project, we decided to conduct a review of the available methods for reading data from CSV files (as provided by our client) to <a href="https://appsilon.com/r-for-programmers/" target="_blank" rel="noopener noreferrer">R</a>. In this article, we will identify the most efficient of these methods using benchmarking and explain our workflow. <blockquote>Want to use R and Python together in your Project? <a href="https://appsilon.com/use-r-and-python-together/">Our complete guide has you covered</a>.</blockquote> We will compare the following: <ol><li><code class="highlighter-rouge">read.csv</code> from <code class="highlighter-rouge">utils</code>, which was the standard way of reading <strong>csv</strong> files to R in RStudio,</li><li><code class="highlighter-rouge">read_csv</code> from <code class="highlighter-rouge">readr</code> which replaced the former method as a standard way of doing it in RStudio,</li><li><code class="highlighter-rouge">load</code> and <code class="highlighter-rouge">readRDS</code> from <code class="highlighter-rouge">base</code>, and</li><li><code class="highlighter-rouge">read_feather</code> from <code class="highlighter-rouge">feather</code> and <code class="highlighter-rouge">fread</code> from <code class="highlighter-rouge">data.table</code>.</li></ol> <hr /> <h2 id="data">R Fast Data Loading - The Dataset</h2> To kick things off, we have to generate a random dataset that's fairly large: <pre><code class="language-r">set.seed(123) df <- data.frame( replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)), replicate(10, stringi::stri_rand_strings(1000, 5)) ) <br>head(df)</code></pre> For reference, this is what the dataset looks like: <img class="size-full wp-image-13277" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b27064b2d248d6c9eda86d_1.webp" alt="Image 1 - Artificially created dataset" width="1760" height="306" /> Image 1 - Artificially created dataset Once created, we'll create variables to hold saving locations for all four file formats - CSV, Feather, RData, and RDS: <pre><code class="language-r">path_csv <- "./assets/data/fast_load/df.csv" path_feather <- "./assets/data/fast_load/df.feather" path_rdata <- "./assets/data/fast_load/df.RData" path_rds <- "./assets/data/fast_load/df.rds"</code></pre> From here, we can load in the required R packages and dump the datasets to disk: <pre><code class="language-r">library(feather) library(data.table) <br>write.csv(df, file = path_csv, row.names = F) write_feather(df, path_feather) save(df, file = path_rdata) saveRDS(df, path_rds)</code></pre> Next, we can check the resulting file sizes: <pre><code class="language-r">files <- c("./assets/data/fast_load/df.csv", "./assets/data/fast_load/df.feather", "./assets/data/fast_load/df.RData", "./assets/data/fast_load/df.rds") info <- file.info(files) info$size_mb <- info$size / (1024 * 1024) print(subset(info, select = c("size_mb")))</code></pre> Here's the output from the <code>print</code> statement: <img class="size-full wp-image-13279" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270659551644329c1c7ae_2.webp" alt="Image 2 - File size comparison" width="714" height="228" /> Image 2 - File size comparison Both CSV and Feather format files take up much more storage space. CSV takes up 6 times and Feather 4 times more space as compared to RDS and RData. <blockquote>Can you write R in... Excel? <a href="https://appsilon.com/r-and-excel/">Without any trouble - Here's our detailed guide</a>.</blockquote> <h2 id="benchmark">R Fast Data Loading - Benchmark and Results</h2> We will use the <code class="highlighter-rouge">microbenchmark</code> library to compare the read times in 10 rounds for the following methods: <ul><li>utils::read.csv</li><li>readr::read_csv</li><li>data.table::fread</li><li>base::load</li><li>base::readRDS</li><li>feather::read_feather</li></ul> Here's the entire code snippet you need to run the benchmark: <pre><code class="language-r">library(microbenchmark) <br>benchmark <- microbenchmark( readCSV = utils::read.csv(path_csv), readrCSV = readr::read_csv(path_csv, progress = F), fread = data.table::fread(path_csv, showProgress = F), loadRdata = base::load(path_rdata), readRds = base::readRDS(path_rds), readFeather = feather::read_feather(path_feather), times = 10 ) print(benchmark, signif = 2)</code></pre> Below you'll find the results obtained on an M1 Pro 16" MacBook Pro: <img class="size-full wp-image-13281" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b27066ebfca615e7a81a41_3.webp" alt="Image 3 - Benchmark results" width="824" height="342" /> Image 3 - Benchmark results And the <strong>winner</strong> is… Feather! However, using Feather requires prior conversion of the file to the feather format. Using <code class="highlighter-rouge">load</code> or <code class="highlighter-rouge">readRDS</code> can improve performance (second and third place in terms of speed) and has an added benefit of storing a smaller/compressed file. In both cases, it is necessary to first convert the file to the proper format. When it comes to reading from the CSV format, <code class="highlighter-rouge">fread</code> significantly beats <code class="highlighter-rouge">read_csv</code> and <code class="highlighter-rouge">read.csv</code>, and thus is the best option to read a CSV file. <blockquote>Supercharge your R Shiny dashboards with 1<a href="https://appsilon.com/apache-arrow-in-r-supercharge-r-shiny-dashboards/" target="_blank" rel="noopener noreferrer">0x faster data loading with Apache Arrow in R</a>.</blockquote> Ultimately, we chose to work with Feather files. The CSV to Feather conversion process is quick and we did not have a strict limitation on storage space in which case either the RDS or RData formats could probably have been a more appropriate choice. The final workflow was: <ol><li>Reading a CSV file provided by our customer using <code class="highlighter-rouge">fread</code>,</li><li>Writing it to Feather using <code class="highlighter-rouge">write_feather</code>, and</li><li>Loading a Feather file on app initialization using <code class="highlighter-rouge">read_feather</code>.</li></ol> The first two tasks were done once and outside of the Shiny App context. There is also quite an interesting benchmark done by <a href="https://gist.github.com/hadley/6353939" target="_blank" rel="noopener noreferrer">Hadley on reading complete files to R</a>. Please note that if you use functions defined in that post, you will end up with a character-type object and will have to apply string manipulations to obtain a commonly and widely used <strong>dataframe</strong>. If you run into any issues, as an <a class="c-link" href="https://wordpress.appsilon.com/appsilon-data-science-is-now-an-rstudio-full-service-certified-partner/" target="_blank" rel="noopener noreferrer" aria-describedby="slack-kit-tooltip">RStudio Full Certified Partner</a>, our team at Appsilon is ready to answer your questions about loading data into R and other topics related to R Shiny, Data Analytics, and Machine Learning. We're experts in this area, and we'd love to chat - <a href="https://appsilon.com/#contact">you can reach out to us at any time</a>. <hr /> <h2>Follow Us for More</h2><ul><li>Follow <a href="https://twitter.com/appsilon">@Appsilon</a> on Twitter</li><li>Follow Appsilon on <a href="https://www.linkedin.com/company/appsilon" target="_blank" rel="noopener noreferrer">LinkedIn</a></li><li>Learn more about our R Shiny <a href="https://appsilon.com/opensource/" target="_blank" rel="noopener noreferrer">open source</a> packages</li></ul>

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Stop Struggling with Outdated Clinical Data Systems

Join pharma data leaders from Jazz Pharmaceuticals and Novo Nordisk in our live podcast episode as they share what really works when building modern, compliant Statistical Computing Environments (SCEs).

Save My Spot

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.

Get the Checklist

Ensure Your R and Python Code Meets FDA and EMA Standards

A comprehensive diagnosis of your R and Python software and computing environment compliance with actionable recommendations and areas for improvement.