Hands-on R and dplyr - Analyzing the Gapminder Dataset

Estimated time:

time

min

<h2><span data-preserver-spaces="true">Exploratory Data Analysis With dplyr</span></h2> <span data-preserver-spaces="true">When it comes to data analysis in R, you should look no further than the <code>dplyr</code> package. It's an excellent all-rounder - providing you with extensive drill-down abilities while keeping the coding clean and minimal.</span> <blockquote><span data-preserver-spaces="true">Are you completely new to R? </span><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-for-programmers/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Check out what you can do with the language</span></a><span data-preserver-spaces="true">.</span></blockquote> <span data-preserver-spaces="true">Today you'll learn how to do exploratory data analysis on the well-known </span><em><span data-preserver-spaces="true">Gapminder</span></em><span data-preserver-spaces="true"> dataset. It contains historical (1952-2007) data on various indicators, such as life expectancy and GDP, for countries worldwide.</span> <span data-preserver-spaces="true">The article is structured as follows:</span> <ul><li><a href="#data-loading">Dataset Loading and Basic Exploration</a></li><li><a href="#summaries">Data Summaries</a></li><li><a href="#derived-columns">Creating Derived Variables and Testing Assumptions</a></li><li><a href="#advanced">Advanced Analysis</a></li><li><a href="#conclusion">Conclusion</a></li></ul> <h2 id="data-loading"><span data-preserver-spaces="true">Dataset Loading and Basic Exploration</span></h2> <span data-preserver-spaces="true">If you're following along, you'll need to have two packages installed - <code>dplyr</code> and <code>gapminder</code>. Once installed, you can import them with the following code:</span> <script src="https://gist.github.com/darioappsilon/0a21b79621951fc12ffd877ae784d767.js"></script> <span data-preserver-spaces="true">A call to the <code>head()</code> function will show the first six rows of the dataset:</span> <img class="size-full wp-image-6548" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d577303d514718a0c32f_98e2be90_1.webp" alt="Image 1 - First six rows of the Gapminder dataset" width="904" height="302" /> Image 1 - First six rows of the Gapminder dataset <span data-preserver-spaces="true">You now have everything loaded, which means you can begin with the analysis. </span> <span data-preserver-spaces="true">Let's start with something simple. For example, let's say you want to records for the United States for 1997, 2002, and 2007. To get these, you'll have to filter the dataset by continent, country, and year. It can all be done in a single <code>filter()</code> function:</span> <script src="https://gist.github.com/darioappsilon/a4abc15f38fd72e58fe34eef9b2ac873.js"></script> <span data-preserver-spaces="true">The results are shown in the following image:</span> <img class="size-full wp-image-6549" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d5780ca3d363ce9671fd_9ff6f5ed_2.webp" alt="Image 2 - United States records for 1997, 2002, and 2007" width="956" height="186" /> Image 2 - United States records for 1997, 2002, and 2007 <span data-preserver-spaces="true">So, what happened here? As you can see, you can use the <code>filter()</code> function to keep only the records of interest. If you need an exact match, use the <code>==</code> sign. If multiple values match your search criterion, use the <code>%in%</code> operator. As simple as that.</span> <h2 id="summaries"><span data-preserver-spaces="true">Data Summaries</span></h2> <span data-preserver-spaces="true">Summary statistics are a great starting point in any exploratory data analysis. They enable you to find a value that best describes a sample of data or a list of values that best represents each subset of the sample.</span> <span data-preserver-spaces="true">A simple average is a good place to start. Here's how you can find the average life expectancy in the United States for 2007:</span> <script src="https://gist.github.com/darioappsilon/4c3998c59247097f953e0f769312d5e3.js"></script> <span data-preserver-spaces="true">The results are shown below:</span> <img class="size-full wp-image-6550" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d579a6541c6fef54de30_258d8170_3.webp" alt="Image 3 - Average life expectancy for the United States in 2007" width="292" height="114" /> Image 3 - Average life expectancy for the United States in 2007 <span data-preserver-spaces="true">Let's take this a step further and calculate the average life expectancy per continent in 2007. You'll need to use the <code>group_by()</code> function to do so:</span> <script src="https://gist.github.com/darioappsilon/ca215d5cfce452d347053055984752be.js"></script> <span data-preserver-spaces="true">The results are shown in the following image:</span> <img class="size-full wp-image-6551" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d579b6953e1bff27272a_f6f10381_4.webp" alt="Image 4 - Average life expectancy per continent in 2007" width="366" height="266" /> Image 4 - Average life expectancy per continent in 2007 <span data-preserver-spaces="true">If you're anything like me, you'll find the above information useful but not presented in the best way. We're dealing with average life expectancy - meaning higher is better. Having that in mind, it's a good practice to sort the results descendingly.</span> <span data-preserver-spaces="true">Let's see how with a slightly different example. The code below sorts continents by their total population:</span> <script src="https://gist.github.com/darioappsilon/4fe0e4290480ed53cc21191a12107b17.js"></script> <span data-preserver-spaces="true">The results are shown below:</span> <img class="size-full wp-image-6552" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d57a376b1fc40b735cbe_75546b13_5.webp" alt="Image 5 - Total population per continent" width="366" height="262" /> Image 5 - Total population per continent <span data-preserver-spaces="true">You now know how to calculate basic summary statistics - an essential part of any data analysis. Next, you'll learn how to create derived columns and test assumptions.</span> <h2 id="derived-columns"><span data-preserver-spaces="true">Creating Derived Variables and Testing Assumptions</span></h2> <span data-preserver-spaces="true">A derived column indicates a column introduced by the developer - usually by combining values from several different columns. For example, you could calculate the total GDP of a country by multiplying GDP per capita by the country's population.</span> <span data-preserver-spaces="true">Let's do just that in code. The <code>mutate()</code> function is used to calculate derived columns. It uses the following syntax: <code>newColumn = your_calculation</code>:</span> <script src="https://gist.github.com/darioappsilon/dd04f085543a0467b0c051463aae27ed.js"></script> <span data-preserver-spaces="true">The results are shown in the image below:</span> <img class="size-full wp-image-6553" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d57bbebaed0e3b8bfad8_ffe8727b_6.webp" alt="Image 6 - Total GDP per country/year combination" width="1156" height="494" /> Image 6 - Total GDP per country/year combination <span data-preserver-spaces="true">Let's apply this knowledge to something useful - testing assumptions. We assume that higher GDP per capita values lead to higher life expectancy. Keep in mind that we're not doing formal hypothesis testing here - but instead examining the results and eyeballing if they make sense for our assumption.</span> <span data-preserver-spaces="true">To test the assumption, you'll calculate the percentiles from the <code>lifeExp</code> column. This will tell you how many percent of the countries have an identical or lower life expectancy than the current country:</span> <script src="https://gist.github.com/darioappsilon/d4cd9692f58868a2749772a1c84a4a4e.js"></script> <span data-preserver-spaces="true">The results as shown below:</span> <img class="size-full wp-image-6554" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d57b88e64fe97f78afeb_35a8098e_7.webp" alt="Image 7 - Life expectancy percentile sorted descendingly by GDP per capita" width="1188" height="494" /> Image 7 - Life expectancy percentile sorted descendingly by GDP per capita <span data-preserver-spaces="true">From the above image, you can see countries sorted by GDP per capita and their respective life expectancy percentile on the right. All of the countries are well above the average (50th percentile), with the lowest one being at the 68th percentile. </span> <span data-preserver-spaces="true">Before you can "verify" the above claim, you'll have to look at the other end - are countries with the lowest GDP per capita located near the lowest percentiles? </span> <span data-preserver-spaces="true">You'll only need to sort the dataset ascendingly:</span> <script src="https://gist.github.com/darioappsilon/95736435136ab74506607098a80c9668.js"></script> <span data-preserver-spaces="true">The results are shown in the image below:</span> <img class="size-full wp-image-6555" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d57c14d9d20d74931148_aa56b61e_8.webp" alt="Image 8 - Life expectancy percentile sorted ascendingly by GDP per capita" width="1296" height="494" /> Image 8 - Life expectancy percentile sorted ascendingly by GDP per capita <span data-preserver-spaces="true">Yes - our claim seems to make perfect sense. Once again, this wasn't a formal hypothesis test, but instead a test of simple assumptions.</span> <h2 id="advanced"><span data-preserver-spaces="true">Advanced Analysis</span></h2> <span data-preserver-spaces="true">The term "advanced" is a bit abstract in data analysis, to say at least. If you're fluent in R and <code>dplyr</code> and have a couple of years of experience, there's virtually nothing you can't do, so nothing seems to be advanced. On the other hand, even the most basic filtering and aggregating may seem like a big deal if you're starting out.</span> <span data-preserver-spaces="true">For that reason, this section treats the term "advanced" as providing the complete answer to a more complicated question - so multiple operations are required.</span> <span data-preserver-spaces="true">For example, let's say you have to find out the top 10 countries in the 90th percentile regarding life expectancy in 2007. You can reuse some of the logic from the previous sections, but answering this question alone requires multiple filterings and subsetting:</span> <script src="https://gist.github.com/darioappsilon/142dca14256d20a52f64ce6f73c949f9.js"></script> <span data-preserver-spaces="true">As you can see, the <code>filter()</code> function was used twice - the first time to select the year, and the second time to remove the records that are below the 90th percentile, since you're only interested in the top 10. The <code>top_n()</code> function is used to select the best n countries arranged by a specific column, specified by the <code>wt</code> argument.</span> <span data-preserver-spaces="true">The results are shown below:</span> <img class="size-full wp-image-6556" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d57c16b3ff510e383c83_4ca0bf64_9.webp" alt="Image 9 - Top 10 countries above the 90th percentile (life expectancy)" width="774" height="454" /> Image 9 - Top 10 countries above the 90th percentile (life expectancy) <span data-preserver-spaces="true">But what if you had to calculate the opposite - worst 10 countries below the 10th percentile? The syntax is quite similar, except for the second filtering, and the <code>top_n()</code> function, where n is prefixed with a minus sign:</span> <script src="https://gist.github.com/darioappsilon/ec0e783b39f402585e1a4c7d2e288a19.js"></script> <span data-preserver-spaces="true">The minus prefix ensures the bottom 10 records are shown instead of the top 10:</span> <img class="size-full wp-image-6557" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d57d215de09d850dbdc5_9a9e930f_10.webp" alt="Image 10 - Worst 10 countries below the 10th percentile (life expectancy)" width="904" height="454" /> Image 10 - Worst 10 countries below the 10th percentile (life expectancy) <span data-preserver-spaces="true">And that's just enough for today. Let's wrap things up in the next section.</span> <h2 id="conclusion"><span data-preserver-spaces="true">Conclusion</span></h2> <span data-preserver-spaces="true">Today you've learned how to use the <code>dplyr</code> package for exploratory data analysis. The quality of the analysis depends much on the quality of your questions, so make sure to ask the right questions first. If you know how to do that, analysis shouldn't be too much of a trouble.</span> <span data-preserver-spaces="true">If you want to learn more about data analysis and everything R-related, stay tuned to the Appsilon blog. Also, make sure to subscribe to our newsletter, so you never miss an update.</span>  <h2><span data-preserver-spaces="true">Learn More</span></h2><ul><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-dplyr-tutorial/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">How to Analyze Data with R: A Complete Beginner Guide to dplyr</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/introduction-to-sql/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Introduction to SQL: 5 Key Concepts Every Data Professional Must Know</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-rest-api/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">How to Make REST APIs with R: A Beginners Guide to Plumber</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-linear-regression/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Machine Learning with R: A Complete Guide to Linear Regression</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-logistic-regression/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Machine Learning with R: A Complete Guide to Logistic Regression</span></a></li></ul> <a href="https://appsilon.com/careers/" target="_blank" rel="noopener noreferrer"><img class="aligncenter size-large wp-image-6541" src="https://wordpress.appsilon.com/wp-content/uploads/2021/01/appsilon.hiring.20-1024x576.jpg" alt="" width="1024" height="576" /></a> <p style="text-align: center;"><strong><span data-preserver-spaces="true">Appsilon is hiring for remote roles! See our </span></strong><a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/" target="_blank" rel="noopener noreferrer"><strong><span data-preserver-spaces="true">Careers</span></strong></a><strong><span data-preserver-spaces="true"> page for all open positions, including </span></strong><a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/#r-shiny-developer" target="_blank" rel="noopener noreferrer"><strong><span data-preserver-spaces="true">R Shiny Developers</span></strong></a><strong><span data-preserver-spaces="true">, </span></strong><a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/#fullstack-software-engineer-tech-lead" target="_blank" rel="noopener noreferrer"><strong><span data-preserver-spaces="true">Fullstack Engineers</span></strong></a><strong><span data-preserver-spaces="true">, </span></strong><a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/#frontend-engineer" target="_blank" rel="noopener noreferrer"><strong><span data-preserver-spaces="true">Frontend Engineers</span></strong></a><strong><span data-preserver-spaces="true">, a </span></strong><a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/#senior-infrastructure-engineer" target="_blank" rel="noopener noreferrer"><strong><span data-preserver-spaces="true">Senior Infrastructure Engineer</span></strong></a><strong><span data-preserver-spaces="true">, and a </span></strong><a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/#community-manager" target="_blank" rel="noopener noreferrer"><strong><span data-preserver-spaces="true">Community Manager</span></strong></a><strong><span data-preserver-spaces="true">. Join Appsilon and work on groundbreaking projects with the world's most influential Fortune 500 companies.</span></strong></p>