How to Analyze Data with R: A Complete Beginner Guide to dplyr
<em><strong>Updated</strong>: September 1, 2022.</em> <span data-preserver-spaces="true">Datasets often require many work hours to understand fully. R makes this process as easy as possible through the <code>dplyr</code> package - the easiest solution for code-based data analysis. You'll learn how to use it today.</span> <blockquote><span data-preserver-spaces="true">Are you completely new to R? </span><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-for-programmers/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Here's our beginner R guide for programmers</span></a><span data-preserver-spaces="true">.</span></blockquote> <span data-preserver-spaces="true">You'll use the </span><a class="editor-rtfLink" href="https://cran.r-project.org/web/packages/gapminder/index.html" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Gapminder dataset</span></a><span data-preserver-spaces="true"> throughout the article. It's available through CRAN, so make sure to install it. Here's how to load in all required packages:</span> <pre><code class="language-r">library(dplyr) library(gapminder) <br>head(gapminder)</code></pre> <span data-preserver-spaces="true">Here's what the first couple of rows of the Gapminder dataset looks like:</span> <img class="size-full wp-image-6353" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b3958285545dccd9bfcfb6_1.webp" alt="Image 1 - Gapminder dataset head" width="625" height="211" /> Image 1 - Gapminder dataset head <span data-preserver-spaces="true">And that's all you need to start analyzing.</span> <span data-preserver-spaces="true">Today you'll learn about:</span> <ul><li><a href="#column-selection">Column Selection</a></li><li><a href="#filtering">Data Filtering</a></li><li><a href="#ordering">Data Ordering</a></li><li><a href="#derived-columns">Creating Derived Columns</a></li><li><a href="#summary-statistics">Calculating Summary Statistics</a></li><li><a href="#grouping">Grouping</a></li><li><a href="#data-slicing">Data Slicing</a></li></ul> <hr /> <h2 id="column-selection"><span data-preserver-spaces="true">Column Selection in R dplyr</span></h2> <span data-preserver-spaces="true">More often than not, you don't need all dataset columns for your analysis. R's <code>dplyr</code> provides a couple of ways to select columns of interest. The first one is more obvious - you pass the column names inside the <code>select()</code> function.</span> <span data-preserver-spaces="true">Here's how to use this syntax to select a couple of columns:</span> <pre><code class="language-r">gapminder %>% select(country, year, pop)</code></pre> <span data-preserver-spaces="true">Here are the results:</span> <img class="size-full wp-image-6354" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b39583002b24af79ba3a11_2.webp" alt="Image 2 - Column selection method 1" width="330" height="333" /> Image 2 - Column selection method 1 <span data-preserver-spaces="true">But what if you have dozens of columns and want to select all but a few? There's a better way - specify the columns you don't need with a minus sign (-) as a prefix:</span> <pre><code class="language-r">gapminder %>% select(-continent)</code></pre> <span data-preserver-spaces="true">Here are the results:</span> <img class="size-full wp-image-6355" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b39584002b24af79ba3a61_3.webp" alt="Image 3 - Column selection method 2 " width="526" height="336" /> Image 3 - Column selection method 2 <span data-preserver-spaces="true">As you can see, the </span><em><span data-preserver-spaces="true">continent</span></em><span data-preserver-spaces="true"> column is the only one that isn't shown. And that's all you should know about column selection. Let's proceed with data filtering.</span> <h2 id="filtering"><span data-preserver-spaces="true">Data Filtering</span></h2> <span data-preserver-spaces="true">Filtering datasets is one of the most common operations you'll do on your job. Not all data is relevant at a given time. Sometimes you need values for a particular product or its sales figures in Q1. Or both. That's where the <code>filter()</code> function comes in handy.</span> <span data-preserver-spaces="true">Here's how to display results only for 2007:</span> <pre><code class="language-r">gapminder %>% select(country, year, lifeExp) %>% filter(year == 2007)</code></pre> <span data-preserver-spaces="true">The results are shown below:</span> <img class="size-full wp-image-6356" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b39584002b24af79ba3a94_4.webp" alt="Image 4 - Data filtering example - year = 2007" width="321" height="333" /> Image 4 - Data filtering example - year = 2007 <span data-preserver-spaces="true">You can nest multiple filter conditions inside a single <code>filter()</code> function. Just make sure to separate the conditions by a comma. Here's how to select a record for Poland in 2007:</span> <pre><code class="language-r">gapminder %>% select(country, year, lifeExp) %>% filter(year == 2007, country == "Poland")</code></pre> <span data-preserver-spaces="true">Here are the results:</span> <img class="size-full wp-image-6357" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b39585d772ec361ef22f46_5.webp" alt="Image 5 - Data filtering example - year = 2007, country = Poland" width="262" height="77" /> Image 5 - Data filtering example - year = 2007, country = Poland <span data-preserver-spaces="true">But what if you want results for multiple countries? You can use the <code>%in%</code> keyword for the task. The snippet below shows records for 2007 for Poland and Croatia:</span> <pre><code class="language-r">gapminder %>% select(country, year, lifeExp) %>% filter(year == 2007, country %in% c("Poland", "Croatia"))</code></pre> <span data-preserver-spaces="true">Here are the results:</span> <img class="size-full wp-image-6358" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395aaf32c7f090d66188b_6.webp" alt="Image 6 - Data filtering example - year = 2007, country = (Poland, Croatia)" width="257" height="98" /> Image 6 - Data filtering example - year = 2007, country = (Poland, Croatia) <span data-preserver-spaces="true">If you understand these examples, you understand data filtering. Let's continue with data ordering.</span> <h2 id="ordering"><span data-preserver-spaces="true">Data Ordering</span></h2> <span data-preserver-spaces="true">Sometimes you want your data ordered by a specific column(s) value. For example, you might want to sort users by age or students by score, either in ascending or descending order. You can easily implement this behavior with <code>dplyr</code> - with its built-in <code>arrange()</code> function.</span> <span data-preserver-spaces="true">Here's how to arrange the results by life expectancy:</span> <pre><code class="language-r">gapminder %>% select(country, year, lifeExp) %>% filter(year == 2007) %>% arrange(lifeExp)</code></pre> <span data-preserver-spaces="true">The results are shown below:</span> <img class="size-full wp-image-6359" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395aae537a8d80f3646bd_7.webp" alt="Image 7 - Data ordering example 1 " width="456" height="334" /> Image 7 - Data ordering example 1 <span data-preserver-spaces="true">As you can see, data is ordered by the </span><em><span data-preserver-spaces="true">lifeExp</span></em><span data-preserver-spaces="true"> column ascendingly. Most cases require descending ordering. Here's how you can implement it:</span> <pre><code class="language-r">gapminder %>% select(country, year, lifeExp) %>% filter(year == 2007) %>% arrange(desc(lifeExp))</code></pre> <span data-preserver-spaces="true">Here are the results:</span> <img class="size-full wp-image-6360" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395abd1554b1f3f91958e_8.webp" alt="Image 8 - Data ordering example 2 " width="368" height="336" /> Image 8 - Data ordering example 2 <span data-preserver-spaces="true">Sometimes you want only a couple of rows returned. The <code>top_n()</code> function lets you specify how many rows should be displayed. Here's an example:</span> <pre><code class="language-r">gapminder %>% select(country, year, lifeExp) %>% filter(year == 2007) %>% arrange(desc(lifeExp)) %>% top_n(5)</code></pre> <span data-preserver-spaces="true">The results are shown in the following image:</span> <img class="size-full wp-image-6361" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395abbd4d8987eae3973c_9.webp" alt="Image 9 - Data ordering example 9" width="354" height="180" /> Image 9 - Data ordering example 9 <span data-preserver-spaces="true">And that's it with regards to the ordering. Next up - derived columns.</span> <h2 id="derived-columns"><span data-preserver-spaces="true">Creating Derived Columns in R dplyr</span></h2> <span data-preserver-spaces="true">With R <code>dplyr</code>, you can use the <code>mutate()</code> function to create new attributes. The new attribute name is put on the left side of the equal sign, and the contents on the right - just as if you were to declare a variable.</span> <span data-preserver-spaces="true">The example below calculates GDP as a product of population and GDP per capita and stores it in a dedicated column. Some other transformations are made along the way:</span> <pre><code class="language-r">gapminder %>% select(country, year, pop, gdpPercap) %>% filter(year == 2007) %>% mutate(gdp = pop * gdpPercap) %>% arrange(desc(gdp)) %>% top_n(5)</code></pre> <span data-preserver-spaces="true">Here are the results:</span> <img class="size-full wp-image-6362" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b3958be537a8d80f362dfb_10.webp" alt="Image 10 - Calculating GDP as (population * GDP per capita)" width="548" height="182" /> Image 10 - Calculating GDP as (population * GDP per capita) <span data-preserver-spaces="true">Instead of <code>mutate()</code>, you can also use <code>transmute()</code>. There's one severe difference - <code>transmute()</code> keeps only the derived column. Let's use it in the example from above:</span> <pre><code class="language-r">gapminder %>% select(country, year, pop, gdpPercap) %>% filter(year == 2007) %>% transmute(gdp = pop * gdpPercap) %>% arrange(desc(gdp)) %>% top_n(5)</code></pre> <span data-preserver-spaces="true">The results are shown below:</span> <img class="size-full wp-image-6363" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395d485545dccd9c01d0f_11.webp" alt="Image 11 - Calculating GDP with transmute() - all other columns are dropped" width="108" height="180" /> Image 11 - Calculating GDP with transmute() - all other columns are dropped <span data-preserver-spaces="true">You'll use <code>mutate()</code> more often, but knowing additional functions can't hurt.</span> <h2 id="summary-statistics"><span data-preserver-spaces="true">Calculating Summary Statistics</span></h2> <span data-preserver-spaces="true">Summary statistics don't need any introduction. In many cases, you need to calculate a simple average of a column. Here's how to calculate average life expectancy among the entire dataset:</span> <pre><code class="language-r">gapminder %>% summarize(avgLifeExp = mean(lifeExp))</code></pre> <span data-preserver-spaces="true">Here are the results:</span> <img class="size-full wp-image-6365" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395d4c148bf3d53e86246_12.webp" alt="Image 12 - Calculating average life expectancy of the entire dataset" width="140" height="76" /> Image 12 - Calculating average life expectancy of the entire dataset <span data-preserver-spaces="true">As you would imagine, you can chain other functions to calculate summary statistics only on a subset. Here's how to calculate the average life expectancy in 2007 in Europe:</span> <pre><code class="language-r">gapminder %>% filter(year == 2007, continent == "Europe") %>% summarize(avgLifeExp = mean(lifeExp))</code></pre> <span data-preserver-spaces="true">The results are shown in the following image:</span> <img class="size-full wp-image-6364" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395d5c3804661abf9d610_13.webp" alt="Image 13 - Calculating average life expectancy for Europe in 2007" width="142" height="76" /> Image 13 - Calculating average life expectancy for Europe in 2007 <span data-preserver-spaces="true">You can do much more with summary statistics, but that requires some grouping knowledge. Let's cover that next.</span> <h2 id="grouping"><span data-preserver-spaces="true">Grouping in R dplyr</span></h2> <span data-preserver-spaces="true">Summary statistics become much more powerful when combined with grouping. For example, you can use the <code>group_by()</code> function to calculate the average life expectancy per continent. Here's how:</span> <pre><code class="language-r">gapminder %>% filter(year == 2007) %>% group_by(continent) %>% summarize(avgLifeExp = mean(lifeExp))</code></pre> <span data-preserver-spaces="true">Here are the results:</span> <img class="size-full wp-image-6366" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395d6ebbe9e2d100c85e1_14.webp" alt="Image 14 - Calculating average life expectancy per continent" width="244" height="177" /> Image 14 - Calculating average life expectancy per continent <span data-preserver-spaces="true">You can also use the previously discussed ordering functions to arrange the dataset by average life expectancy. Here's how to do so in a descending way:</span> <pre><code class="language-r">gapminder %>% filter(year == 2007) %>% group_by(continent) %>% summarize(avgLifeExp = mean(lifeExp)) %>% arrange(desc(avgLifeExp))</code></pre> <span data-preserver-spaces="true">The results are shown below:</span> <img class="size-full wp-image-6367" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395d661538160aca2dcab_15.webp" alt="Image 15 - Ordering dataset by average life expectancy per continent" width="249" height="182" /> Image 15 - Ordering dataset by average life expectancy per continent <span data-preserver-spaces="true">One other powerful function is <code>if_else()</code>. You can use it when creating new columns whose value depends on some conditions.</span> <span data-preserver-spaces="true">For example, here's how to create a column named </span><em><span data-preserver-spaces="true">over75</span></em><span data-preserver-spaces="true">, which has a value of </span><em><span data-preserver-spaces="true">Y</span></em><span data-preserver-spaces="true"> if the average life expectancy for a continent is over 75, and </span><em><span data-preserver-spaces="true">N</span></em><span data-preserver-spaces="true"> otherwise:</span> <pre><code class="language-r">gapminder %>% filter(year == 2007) %>% group_by(continent) %>% summarize(avgLifeExp = mean(lifeExp)) %>% mutate(over75 = if_else(avgLifeExp > 75, "Y", "N"))</code></pre> <span data-preserver-spaces="true">The results are shown in the following image:</span> <img class="size-full wp-image-6368" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395d7a3cfbe0cd7a8bf23_16.webp" alt="Image 16 - Using if_else() upon attribute creation" width="323" height="180" /> Image 16 - Using if_else() upon attribute creation <span data-preserver-spaces="true">And that's all you should know about grouping! Let's cover data slicing next.</span> <h2 id="data-slicing">Data Slicing</h2> This is yet another essential concept in R dplyr. Essentially, it allows you to index the dataset rows using a bunch of convenient helper functions. For example, you can use the raw <code>slice()</code> function to select the first N rows, as shown below: <pre><code class="language-r">gapminder %>% slice(1)</code></pre> Here are the results - it's just the first row of the dataset: <img class="size-full wp-image-15435" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395d7c1aa2c675c9d5b4b_17.webp" alt="Image 17 - Slicing one row" width="1014" height="130" /> Image 17 - Slicing one row You can also use the <code>slice_head()</code> function to be a bit more verbose. The function displays the first N rows of the dataset, hence the "head" name: <pre><code class="language-r">gapminder %>% slice_head(n = 3)</code></pre> <img class="size-full wp-image-15437" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395d8e916ea5c5e9a1472_18.webp" alt="Image 18 - Slicing the first 3 rows" width="1038" height="210" /> Image 18 - Slicing the first 3 rows The <code>slice_tail()</code> function does the same, but for the other end of the dataset. It will display the bottom N rows instead: <pre><code class="language-r">gapminder %>% slice_tail(n = 3)</code></pre> <img class="size-full wp-image-15439" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395d8455ba7eb54dc1f70_19.webp" alt="Image 19 - Slicing the last 3 rows" width="1000" height="210" /> Image 19 - Slicing the last 3 rows If your dataset is sorted by default, neither of the previous two functions will give a representative sample. There's another slicing function in R dplyr - <code>slice_sample()</code>, and it'll display a random sample of N rows: <pre><code class="language-r">gapminder %>% slice_sample(n = 5)</code></pre> <img class="size-full wp-image-15447" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270c009eda01ce1846287_20-1.webp" alt="Image 20 - Slicing a random sample of 5 rows" width="1178" height="294" /> Image 20 - Slicing a random sample of 5 rows There's also a handy <code>slice_min()</code> function. Its job is to extract the N rows with the lowest value for a specified column. Here's an example - the dataset is firstly filtered to keep only the most recent records, and then 5 rows with the lowest life expectancy are extracted: <pre><code class="language-r">gapminder %>% filter(year == 2007) %>% slice_min(lifeExp, n = 5)</code></pre> <img class="size-full wp-image-15441" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395d90a558ace7ac06f2d_20.webp" alt="Image 21 - Slicing the bottom 5 rows by life expectancy" width="1052" height="298" /> Image 21 - Slicing the bottom 5 rows by life expectancy As you would expect, there's also a <code>slice_max()</code> function. It does just the opposite - we think you can get the gist: <pre><code class="language-r">gapminder %>% filter(year == 2007) %>% slice_max(lifeExp, n = 5)</code></pre> <img class="size-full wp-image-15443" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395da87211fe5116f69a5_21.webp" alt="Image 22 - Slicing the top 5 rows by life expectancy" width="1140" height="294" /> Image 22 - Slicing the top 5 rows by life expectancy And that's how you can slice your dataset in R dplyr. Let's wrap things up next. <hr /> <h2><span data-preserver-spaces="true">Summary of R dplyr</span></h2> <span data-preserver-spaces="true">Today you've learned how to analyze data with R's <code>dplyr</code>. It's one of the most developer-friendly packages out there, way simpler than it's Python competitor - Pandas. </span> <span data-preserver-spaces="true">You should be able to analyze and prepare any type of dataset after reading this article. You can do more advanced things, of course, but often these are just combinations of the things you've learned today.</span> <h2><span data-preserver-spaces="true">Learn More</span></h2><ul><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-for-programmers/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">What Can I Do With R? 6 Essential R Packages for Programmers</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-linear-regression/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Machine Learning with R: A Complete Guide to Linear Regression</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/ggplot2-bar-charts/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">How to Make Stunning Bar Charts in R: A Complete Guide with ggplot2</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/ggplot2-line-charts/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">How to Make Stunning Line Charts in R: A Complete Guide with ggplot2</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/ggplot-scatter-plots/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">How to Make Stunning Scatter Plots in R: A Complete Guide with ggplot2</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/leaflet-geomaps/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">How to Make Stunning Geomaps in R: A Complete Guide with Leaflet</span></a></li></ul>