Python Pandas vs. R dplyr - Which Data Analysis Library is the best for 2022

Estimated time:

time

min

<h2><span data-preserver-spaces="true">Pandas vs. dplyr</span></h2> <strong>Updated</strong>: March 9, 2022. <span data-preserver-spaces="true">Python vs. R? Pandas vs. dplyr? It's difficult to find the ultimate go-to library for data analysis. Both R and Python provide excellent options, so the question quickly becomes "which data analysis library is the most convenient". Today's article aims to answer this question, assuming you're equally skilled in both languages.</span> <blockquote><span data-preserver-spaces="true">Looking for more Python and R comparisons? </span><a class="editor-rtfLink" href="https://appsilon.com/dash-vs-shiny/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Check out our Python Dash vs. R Shiny comparison</span></a><span data-preserver-spaces="true">.</span></blockquote> <span data-preserver-spaces="true">As mentioned earlier, this article assumes you are equally skilled in both R and Python. If that's not the case, it's likely your decision will be biased, as people tend to approve more of the familiar technologies. We'll try to provide a completely unbiased opinion based on facts and code comparisons.</span> <span data-preserver-spaces="true">Table of contents:</span> <ul><li><a href="#data-loading">Data Loading</a></li><li><a href="#filtering">Filtering</a></li><li><a href="#summary">Summary Statistics</a></li><li><a href="#derived">Creating Derived Columns</a></li><li><a href="#plotting">Plotting</a></li><li><a href="#conclusion">Conclusion</a></li></ul> <hr /> <h2 id="data-loading"><span data-preserver-spaces="true">Data Loading</span></h2> <span data-preserver-spaces="true">There's no data analysis without data. Both Pandas and dplyr can connect to virtually any data source, and read from any file format. That's why we won't spend any time exploring connection options but will use a build-in dataset instead.</span> <span data-preserver-spaces="true">Here's how you can load Pandas and the Gapminder dataset with Python and Pandas:</span> <script src="https://gist.github.com/darioappsilon/46cdef2fd7d1e71a839c2bf6e1574444.js"></script> <span data-preserver-spaces="true">The results are shown below:</span> <img class="size-full wp-image-12288" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020a26f892f05d57fc1e0_library-and-dataset-loading-with-pandas.webp" alt="Image 1 - Library and dataset loading with Pandas" width="834" height="320" /> Image 1 - Library and dataset loading with Pandas <span data-preserver-spaces="true">And here's how you can do the same with R and dplyr:</span> <script src="https://gist.github.com/darioappsilon/40de69a0f008aab789ddd19a5358134c.js"></script> <span data-preserver-spaces="true">Here are the results:</span> <img class="size-full wp-image-12286" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020a3e939cce381961792_library-and-dataset-loading-with-dplyr.webp" alt="Image 2 - Library and dataset loading with dplyr" width="906" height="298" /> Image 2 - Library and dataset loading with dplyr <span data-preserver-spaces="true">There's no winner in this Pandas vs. dplyr comparison, as both libraries are near identical with the syntax.</span> <strong><span data-preserver-spaces="true">Winner - tie</span></strong><span data-preserver-spaces="true">.</span> <h2 id="filtering"><span data-preserver-spaces="true">Filtering</span></h2> <span data-preserver-spaces="true">This is where things get a bit more interesting. The dplyr package is well-known for its pipe operator (<code>%>%</code>), which you can use to chain operations. This operator makes data drill-downs both easy to write and to read. On the other hand, Pandas doesn't have such an operator.</span> <span data-preserver-spaces="true">Let's go through three problem sets and see how both libraries compare.</span> <strong><span data-preserver-spaces="true">Problem 1</span></strong><span data-preserver-spaces="true"> - find records for the most recent year (2007).</span> <span data-preserver-spaces="true">Here's how to do so with Pandas:</span> <script src="https://gist.github.com/darioappsilon/aaf10292ae3a1e1a22ff9969bca8cb19.js"></script> <img class="size-full wp-image-12298" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020a43558cfc06df18ced_records-from-2007-pandas.webp" alt="Image 3 - Records from 2007 (Pandas)" width="1010" height="732" /> Image 3 - Records from 2007 (Pandas) <span data-preserver-spaces="true">And here's how to do the same with dplyr:</span> <script src="https://gist.github.com/darioappsilon/c922fd623fca9e9ebcbbce11028ac29c.js"></script> <img class="size-full wp-image-12296" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020a62d3baed3c5aa0707_records-from-2007-dplyr.webp" alt="Image 4 - Records from 2007 (dplyr)" width="938" height="490" /> Image 4 - Records from 2007 (dplyr) <span data-preserver-spaces="true">As you can see, both libraries are near equal when it comes to simple filtering. It's common to use the <code>filter()</code> function with dplyr and bracket notation with Pandas. There are other options, sure, but you'll see these most commonly.</span> <strong><span data-preserver-spaces="true">Problem 2</span></strong><span data-preserver-spaces="true"> - find records from the most recent year (2007) only for North and South Americas.</span> <span data-preserver-spaces="true">Still a pretty simple task, but let's see the differences in code. Pandas comes first:</span> <script src="https://gist.github.com/darioappsilon/44b7f1082931b63914876bfe203be113.js"></script> <img class="size-full wp-image-12302" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020a7fdb8ad62e9cf2d39_records-from-2007-for-North-and-South-Americas-pandas.webp" alt="Image 5 - Records from 2007 for North and South Americas (Pandas)" width="1012" height="572" /> Image 5 - Records from 2007 for North and South Americas (Pandas) <span data-preserver-spaces="true">And here's how to do the same with dplyr:</span> <script src="https://gist.github.com/darioappsilon/daff9298f209247228b3987b0eb79b56.js"></script> <img class="size-full wp-image-12300" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020a918c3ea014df85ee3_records-from-2007-for-North-and-South-Americas-dplyr.webp" alt="Image 6 - Records from 2007 for North and South Americas (dplyr)" width="1042" height="492" /> Image 6 - Records from 2007 for North and South Americas (dplyr) <span data-preserver-spaces="true">Applying multiple filters is much easier with dplyr than with Pandas. You can separate conditions with a comma inside a single <code>filter()</code> function. Pandas requires more typing and produces code that's harder to read.</span> <strong><span data-preserver-spaces="true">Problem 3</span></strong><span data-preserver-spaces="true"> - find records from the most recent year (2007) only for the United States.</span> <span data-preserver-spaces="true">Let's add yet another filter condition. The Pandas library comes first:</span> <script src="https://gist.github.com/darioappsilon/afcb167d1842b7c27483db4e8a900ec0.js"></script> <img class="size-full wp-image-12306" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020aa72d4ec73045976ca_Records-from-2007-for-United-States-pandas.webp" alt="Image 7 - Records from 2007 for the United States (Pandas)" width="928" height="102" /> Image 7 - Records from 2007 for the United States (Pandas) <span data-preserver-spaces="true">And here's how to do the same with dplyr:</span> <script src="https://gist.github.com/darioappsilon/fa9ba152e51ecf9aa8673aa8fc8ec399.js"></script> <img class="size-full wp-image-12304" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020ac92a26c843b640df9_records-from-2007-for-United-States-dplyr.webp" alt="Image 8 - Records from 2007 for the United States (dplyr)" width="958" height="106" /> Image 8 - Records from 2007 for the United States (dplyr) <span data-preserver-spaces="true">In a nutshell, Pandas is still tough to write, but you can put every filter condition on a separate line so it's easier to read. </span> <strong><span data-preserver-spaces="true">Winner - dplyr</span></strong><span data-preserver-spaces="true">. A no-brainer for this Pandas vs. dplyr test. Filtering in dplyr is more intuitive and easier to read.</span> <h2 id="summary"><span data-preserver-spaces="true">Summary Statistics</span></h2> <span data-preserver-spaces="true">One of the most common data analysis tasks is calculating summary statistics - as a sample mean. This section compares Pandas and dplyr for these tasks through three problem sets.</span> <strong><span data-preserver-spaces="true">Problem 1</span></strong><span data-preserver-spaces="true"> - calculate the average (mean) life expectancy worldwide in 2007.</span> <span data-preserver-spaces="true">It sounds like a trivial problem - and it is. Let's see how Pandas handles' it.</span> <script src="https://gist.github.com/darioappsilon/27f716c25f6d716b7ff4fb775133e672.js"></script> <img class="size-full wp-image-12280" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020acbce935d21a916cb7_avg-life-expectancy-pandas.webp" alt="Image 9 - Average life expectancy worldwide in 2007 (Pandas)" width="294" height="50" /> Image 9 - Average life expectancy worldwide in 2007 (Pandas) <span data-preserver-spaces="true">Let's do the same with dplyr:</span> <script src="https://gist.github.com/darioappsilon/752f4fb74db83fc00501ea609bc9a320.js"></script> <img class="size-full wp-image-12278" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020adc8798807239f3101_avg-life-expectancy-dplyr.webp" alt="Image 10 - Average life expectancy worldwide in 2007 (dplyr)" width="298" height="122" /> Image 10 - Average life expectancy worldwide in 2007 (dplyr) <span data-preserver-spaces="true">As you can see, dplyr uses the <code>summarize()</code> function to calculate summary statistics, and Pandas relies on calling the function on the column(s) of interest.</span> <strong><span data-preserver-spaces="true">Problem 2</span></strong><span data-preserver-spaces="true"> - calculate the average (mean) life expectancy in 2007 for every continent.</span> <span data-preserver-spaces="true">A bit trickier problem, but nothing you can't handle. The solution requires the use of group by operation on the column of interest. Here's how to do the calculation with Pandas:</span> <script src="https://gist.github.com/darioappsilon/4eb38269f9eb35d5c9fda01434f00be8.js"></script> <img class="size-full wp-image-12284" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020ae9fb8f3e5f412b523_avg-life-expectancy-per-continent-pandas.webp" alt="Image 11 - Average life expectancy per continent in 2007 (Pandas)" width="478" height="226" /> Image 11 - Average life expectancy per continent in 2007 (Pandas) <span data-preserver-spaces="true">Let's do the same with dplyr:</span> <script src="https://gist.github.com/darioappsilon/6943b00babed15be6b0e102bfc795236.js"></script> <img class="size-full wp-image-12282" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020b0e939cce381961895_avg-life-expectancy-per-continent-dplyr.webp" alt="Image 12 - Average life expectancy per continent in 2007 (dplyr)" width="454" height="264" /> Image 12 - Average life expectancy per continent in 2007 (dplyr) <span data-preserver-spaces="true">As you can see, both libraries use some sort of grouping functions - <code>groupby()</code> with Pandas, and <code>group_by()</code> with dplyr, which results in a similar-looking syntax.</span> <strong><span data-preserver-spaces="true">Problem 3</span></strong><span data-preserver-spaces="true"> - calculate the total population per continent in 2007 and sort the results in descending order.</span> <span data-preserver-spaces="true">Yet another relatively simple task to do. Let's see how to solve it with Pandas first:</span> <script src="https://gist.github.com/darioappsilon/d44f8df8317aa2aa940637cf3825eb0b.js"></script> <img class="size-full wp-image-12318" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020b184868bab9cc4b455_total-pop-per-continent-in-2007-pandas.webp" alt="Image 13 - Total population per continent in 2007 (Pandas)" width="386" height="218" /> Image 13 - Total population per continent in 2007 (Pandas) <span data-preserver-spaces="true">Let's do the same with dplyr:</span> <script src="https://gist.github.com/darioappsilon/c6682bf764d6d2d4b18dc390d64feac3.js"></script> <img class="size-full wp-image-12316" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020b3f75d40a9af724544_total-pop-per-continent-in-2007-dplyr.webp" alt="Image 14 - Total population per continent in 2007 (dplyr)" width="368" height="258" /> Image 14 - Total population per continent in 2007 (dplyr) <span data-preserver-spaces="true">The sorting was the only new part of this problem. Pandas uses the <code>sort_values()</code> function with an optional <code>ascending</code> argument, while dplyr uses the <code>arrange()</code> function.</span> <strong><span data-preserver-spaces="true">Winner - tie</span></strong><span data-preserver-spaces="true">. Declaring a winner in this Pandas vs. dplyr test boils down to personal preference. Pandas seems to be a bit more cluttered, but that's due to the initial filtering. Calculating summary statistics in both is easy.</span> <h2 id="derived"><span data-preserver-spaces="true">Creating Derived Columns</span></h2> <span data-preserver-spaces="true">This is the last series of tasks in today's comparison. We'll explore how easy it is to do feature engineering in both libraries. There are only two problem sets this time.</span> <strong><span data-preserver-spaces="true">Problem 1</span></strong><span data-preserver-spaces="true"> - calculate the total GDP by multiplying population and GDP per capita.</span> <span data-preserver-spaces="true">This should be easy enough to do. Let's see the Pandas implementation first:</span> <script src="https://gist.github.com/darioappsilon/cdd790264a09e5abbdb59277c5b95b9c.js"></script> <img class="size-full wp-image-12314" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020b3952f213de6d4a9b1_total-country-GDP-pandas.webp" alt="Image 15 - Calculating total country GDP (Pandas)" width="1034" height="326" /> Image 15 - Calculating total country GDP (Pandas) <span data-preserver-spaces="true">And now let's do the same with dplyr:</span> <script src="https://gist.github.com/darioappsilon/f3346c3473d892a61df579c5e9cbd948.js"></script> <img class="size-full wp-image-12312" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020b5a9c531fb3be69fd9_total-country-GDP-dplyr.webp" alt="Image 16 - Calculating total country GDP (dplyr)" width="1122" height="486" /> Image 16 - Calculating total country GDP (dplyr) <span data-preserver-spaces="true">A call to the <code>head()</code> function in Pandas isn't a part of the solution but is here only to print the first couple of rows instead of the entire dataset. Implementation in both was straightforward, to say at least.</span> <strong><span data-preserver-spaces="true">Problem 2</span></strong><span data-preserver-spaces="true"> - print the top ten countries in the 90th percentile with regards to GDP per capita.</span> <span data-preserver-spaces="true">This one is a bit trickier, but nothing you can't handle. Let's see how to solve it with Pandas:</span> <script src="https://gist.github.com/darioappsilon/34eb9b7ce099b60075382b195d754d32.js"></script> <img class="size-full wp-image-12310" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020b683f54e0b0d35296b_top-10-countries-in-the-90th-perc-wrt-gdp-per-capita-pandas.webp" alt="Image 17 - Top 10 countries in the 90th percentile wrt GDP per capita (Pandas)" width="1308" height="586" /> Image 17 - Top 10 countries in the 90th percentile wrt GDP per capita (Pandas) <span data-preserver-spaces="true">And now let's do the same with dplyr:</span> <script src="https://gist.github.com/darioappsilon/8dd1458081fd470becc05e01296076b2.js"></script> <img class="size-full wp-image-12308" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020b87554f052a38d06ca_top-10-countries-in-the-90th-perc-wrt-gdp-per-capita-dplyr.webp" alt="Image 19 - Top 10 countries in the 90th percentile wrt GDP per capita (dplyr)" width="1200" height="450" /> Image 18 - Top 10 countries in the 90th percentile wrt GDP per capita (dplyr) <span data-preserver-spaces="true">We've created an additional data frame in Pandas for convenience's sake. Still, the implementation in dplyr is much simpler and easier to read, making R's dplyr the winner of this section.</span> <strong><span data-preserver-spaces="true"> Winner - dplyr</span></strong><span data-preserver-spaces="true">. Another no-brainer Pandas vs. dplyr comparison. The syntax of dplyr is much cleaner and easier to read.</span> <h2 id="plotting">Plotting</h2> We'll now take a look at how good each contestant is with visualizing data. We won't compare external libraries, e.g., Matplotlib in Python and GGplot in R - but only what Pandas and dplyr have to offer. Let's start with Pandas. We'll create a small subset that contains only the data for Poland - <code>pl_gapminder</code>: <script src="https://gist.github.com/darioappsilon/e1b9b2dc12ec7e7408928af8261f4d67.js"></script> Pandas makes it easy to plot basic charts. You can call the <code>plot()</code> function on a DataFrame and Pandas will take care of the rest. <strong>Keep in mind</strong> - if plotting multiple columns, one of them should be set as an index: <script src="https://gist.github.com/darioappsilon/4ab16186d201bbcda729b025e97a1b4d.js"></script> <img class="size-full wp-image-12292" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020b8790b9e1ad88f8814_life-expectancy-over-time-in-poland.webp" alt="Image 19 - Life expectancy over time in Poland" width="942" height="664" /> Image 19 - Life expectancy over time in Poland It's not the best-looking visualization, nor it's in high resolution. The plotting capabilities of Pandas are quite limited, and you're better off using a dedicated data visualization library. You can <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html" target="_blank" rel="noopener noreferrer">tweak the basics</a>, such as figure size, title, font size, and the type of the chart - but that's pretty much it: <script src="https://gist.github.com/darioappsilon/5bd226a4e2014d463f49f79fe8b4aad0.js"></script> <img class="size-full wp-image-12290" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020b9487b2c34e797cd5d_life-expectancy-in-poland-as-a-bar-chart.webp" alt="Image 20 - Life expectancy in Poland as a bar chart" width="1780" height="1052" /> Image 20 - Life expectancy in Poland as a bar chart It's still much more than what R's dplyr has to offer. It's a pure data analysis and manipulation library and has no data visualization functionality. For that reason, we have to declare Pandas as the winner of this section. <strong>Winner - Pandas</strong>. But does it matter? You'll almost always use a dedicated data visualization library. <h2 id="conclusion"><span data-preserver-spaces="true">Conclusion</span></h2> <span data-preserver-spaces="true">According to the test made in this article, dplyr is a clear winner. Does that mean you should abandon Pandas once and for all? Well, no.</span> <span data-preserver-spaces="true">How well you'll solve data analysis tasks depends much on the level of familiarity with the library. If you're a big-time Pandas user, solving tasks with dplyr might seem unnatural, resulting in more time spent solving the task. Use the library that'll save you the most time.</span> <span data-preserver-spaces="true">If you're equally skilled in both, there's virtually no debate on which is "better".</span>  <h2><span data-preserver-spaces="true">Learn More</span></h2><ul><li><a class="editor-rtfLink" href="https://appsilon.com/7-data-science-skills/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">7 Must-Have Skills to Get a Job as a Data Scientist</span></a></li><li><a class="editor-rtfLink" href="https://appsilon.com/introduction-to-sql/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Introduction to SQL: 5 Key Concepts Every Data Professional Must Know</span></a></li><li><a class="editor-rtfLink" href="https://appsilon.com/r-dplyr-gapminder/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Hands-on R and dplyr - Analyzing the Gapminder Dataset</span></a></li><li><a class="editor-rtfLink" href="https://appsilon.com/r-linear-regression/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Machine Learning with R: A Complete Guide to Linear Regression</span></a></li><li><a class="editor-rtfLink" href="https://appsilon.com/r-logistic-regression/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Machine Learning with R: A Complete Guide to Logistic Regression</span></a></li></ul>