Intermediate SQL for Data Science - Analytical Functions Deep Dive
<h2><span data-preserver-spaces="true">Intermediate SQL for Data Science</span></h2> <span data-preserver-spaces="true">Running data queries in the database can offer significant speed improvements over doing so in R or Python. There's no need to drag the entire dataset to memory and run the calculations once the loading completes. The runtime differences can be drastic, depending on the dataset size.</span> <span data-preserver-spaces="true">That's why SQL is your best friend when it comes to larger datasets. And there's no better way to improve your SQL skills than going over a couple of intermediate concepts, wrapped in a little something called </span><strong><span data-preserver-spaces="true">analytical functions</span></strong><span data-preserver-spaces="true">. </span><span data-preserver-spaces="true">These functions perform computations over multiple rows, but they also return multiple rows. We'll go over a couple of them in this tutorial.</span> <span data-preserver-spaces="true">This article is structured as follows:</span> <ul><li><a href="#database">Database and Data Introduction</a></li><li><a href="#running-averages">Running Averages</a></li><li><a href="#first-values">First Values</a></li><li><a href="#leads-lags">Leads and Lags</a></li><li><a href="#ranking">Ranking</a></li><li><a href="#conclusion">Conclusion</a></li></ul> <h2 id="database"><span data-preserver-spaces="true">Database and Data Introduction</span></h2> <span data-preserver-spaces="true">This article assumes you have the PostgreSQL database installed and configured. It also assumes basic knowledge of SQL, so words like <code>SELECT</code>, <code>WHERE</code>, <code>BETWEEN</code>, <code>FROM</code>, and the others shouldn't feel new.</span> <span data-preserver-spaces="true">With regards to the data, we'll use a small table called ORDERS which you can download from </span><a class="editor-rtfLink" href="https://www.w3resource.com/sql/sql-table.php" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">here</span></a><span data-preserver-spaces="true">. The previous URL contains SQL code for generating and populating three tables, so please execute it before proceeding.</span> <span data-preserver-spaces="true">If you did everything correctly, the following <code>SELECT</code> statement:</span> <script src="https://gist.github.com/darioappsilon/1d36626b9f47607dbbae22e1d13b41bd.js"></script> <span data-preserver-spaces="true">Should yield the following dataset:</span> <img class="size-full wp-image-6808" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d5515beedd53eb025709_5881e069_1.webp" alt="Image 1 - Orders table data" width="1027" height="290" /> Image 1 - Orders table data <span data-preserver-spaces="true">Which means you're ready to proceed.</span> <h2 id="running-averages"><span data-preserver-spaces="true">Running Averages</span></h2> <span data-preserver-spaces="true">If you have any experience with SQL, it's likely you're familiar with aggregation functions such as <code>SUM</code>, <code>AVG</code>, <code>MIN</code>, and <code>MAX</code>. It's also likely you've used them in the <code>GROUP BY</code> clause. As it turns out, you can also use them in <code>ORDER BY</code> to obtain a running total, average, minimum, or maximum.</span> <span data-preserver-spaces="true">Let's go over a concrete example to make this more clear. You want to monitor the states of your sales agent, and want to see their performance in the third quarter of 2008. To do so, you can calculate the running average revenue and the total revenue obtained.</span> <span data-preserver-spaces="true">Here's the code:</span> <script src="https://gist.github.com/darioappsilon/49b8365bd87844ff7fdb1f71ed692a5f.js"></script> <span data-preserver-spaces="true">And here are the results:</span> <img class="size-full wp-image-6809" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d552eb35c6618a0e0882_0c13f143_2.webp" alt="Image 2 - Running average and total revenue per agent in the third quarter of 2008" width="658" height="328" /> Image 2 - Running average and total revenue per agent in the third quarter of 2008 <span data-preserver-spaces="true">And that's how easy it is! Let's proceed to the next one.</span> <h2 id="first-values"><span data-preserver-spaces="true">First Values</span></h2> <span data-preserver-spaces="true">In Postgres, you can use the <code>FIRST_VALUE</code> analytical function to return the value of a specified column from the first row of the window frame. Similarly, you can use the <code>LAST_VALUE</code> and <code>NTH_VALUE</code> functions.</span> <span data-preserver-spaces="true">In our Orders table example. you could use the <code>FIRST_VALUE</code> function to check the date gap between the first and the next purchase per customer. Here's the code for doing so:</span> <script src="https://gist.github.com/darioappsilon/2a6792d1ee837f19e5c052d0af963722.js"></script> <span data-preserver-spaces="true">Here are the results:</span> <img class="size-full wp-image-6810" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d5522a88faf6f30e7d2b_9a6c9a79_3.webp" alt="Image 3 - Gaps between first and the next purchase per customer" width="498" height="330" /> Image 3 - Gaps between first and the next purchase per customer <span data-preserver-spaces="true">Take a moment and think of all use cases when functionality like this could be useful. More than a handful, I'm sure.</span> <span data-preserver-spaces="true">Let's proceed to the next one.</span> <h2 id="leads-lags"><span data-preserver-spaces="true">Leads and Lags</span></h2> <span data-preserver-spaces="true">As the name suggests, the <code>LEAD</code> function fetches the value of a specific attribute from the next row and returns it in the current row. It takes two arguments:</span> <ul><li><span data-preserver-spaces="true"><code>COLUMN_NAME</code> - name of the attribute from which the next value is fetched</span></li><li><span data-preserver-spaces="true"><code>INDEX</code> - number of rows relative to the current one</span></li></ul> <span data-preserver-spaces="true">On the other hand, the <code>LAG</code> function does the opposite. It fetches the value for a column of interest from the previous <code>INDEX</code> rows.</span> <span data-preserver-spaces="true">Here's an example - we want to find out what is the last highest amount for which an order has been sold. Here's the code:</span> <script src="https://gist.github.com/darioappsilon/6ff7e6bbf35528fc569af95b846f7409.js"></script> <span data-preserver-spaces="true">And here are the results:</span> <img class="size-full wp-image-6811" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d553f9a88e2eece18b61_8a3fef6b_4.webp" alt="Image 4 - Last highest amount for which an order has been sold per agent" width="485" height="329" /> Image 4 - Last highest amount for which an order has been sold per agent <span data-preserver-spaces="true">Let's cover one more analytical function before calling it a day.</span> <h2 id="ranking"><span data-preserver-spaces="true">Ranking</span></h2> <span data-preserver-spaces="true">In PostgreSQL, you can use <code>RANK</code> and <code>DENSE_RANK</code> as numbering functions. They are here to assign an integer value to a row and are particularly useful when you have to find the nth highest or lowest record from the table.</span> <span data-preserver-spaces="true">The two functions are a bit different when it comes to assigning integer values. <code>DENSE_RANK</code> will return consecutive ranks, while <code>RANK</code> will return ranking in such a way where a rank is skipped in case of a tie.</span> <span data-preserver-spaces="true">For example, ranking with <code>DENSE_RANK</code> would return (1, 2, 2, 3), while ranking with <code>RANK</code> would return (1, 2, 2, 4) - hence a skipped rank due to a tie.</span> <span data-preserver-spaces="true">Let's see this in action - we want to find the second highest order values for each month. Here's the code:</span> <script src="https://gist.github.com/darioappsilon/a8b6b3da46b55fdddad4411428673bd0.js"></script> <span data-preserver-spaces="true">And here are the results:</span> <img class="size-full wp-image-6812" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d5532a88faf6f30e7df9_2d535893_5.webp" alt="Image 5 - Second highest order values for each month" width="561" height="327" /> Image 5 - Second highest order values for each month <span data-preserver-spaces="true">And that's just enough for today. Let's wrap things up in the next section.</span> <h2 id="conclusion"><span data-preserver-spaces="true">Conclusion</span></h2> <span data-preserver-spaces="true">And there you have it - a handful of analytical functions to take your SQL and database knowledge to the next level. These are particularly useful in data science, as most of the time the worthy insights are hidden, and the only way to obtain them is through some creative data manipulation.</span> <span data-preserver-spaces="true">Analytical functions provide a perfect way for doing so. </span> <h2><span data-preserver-spaces="true">Learn More</span></h2><ul><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-for-programmers/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">What Can I Do With R? 6 Essential R Packages for Programmers</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-dplyr-tutorial/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">How to Analyze Data with R: A Complete Beginner Guide to dplyr</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-dplyr-gapminder/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Hands-on R and dplyr - Analyzing the Gapminder Dataset</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/7-data-science-skills/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">7 Must-Have Skills to Get a Job as a Data Scientist</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/tableau-vs-r-shiny/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Tableau vs. R Shiny: Which Excel Alternative is Right For You?</span></a></li></ul> <a href="https://appsilon.com/careers/" target="_blank" rel="noopener noreferrer"><img class="aligncenter size-large wp-image-6541" src="https://wordpress.appsilon.com/wp-content/uploads/2021/01/appsilon.hiring.20-1024x576.jpg" alt="" width="1024" height="576" /></a> <p style="text-align: center;"><strong><span data-preserver-spaces="true">Appsilon is hiring for remote roles! See our </span></strong><a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/" target="_blank" rel="noopener noreferrer"><strong><span data-preserver-spaces="true">Careers</span></strong></a><strong><span data-preserver-spaces="true"> page for all open positions, including a <a href="https://appsilon.com/careers/job-offer/?job=senior-react-developer-freelancer" target="_blank" rel="noopener noreferrer">Senior React Developer</a> and </span></strong><a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/job-offer/?job=r-shiny-developer" target="_blank" rel="noopener noreferrer"><strong><span data-preserver-spaces="true">Senior R Shiny Developers</span></strong></a><strong><span data-preserver-spaces="true">. Join Appsilon and work on groundbreaking projects with the world's most influential Fortune 500 companies.</span></strong></p>