ANOVA in R - How To Implement One-Way ANOVA From Scratch

Estimated time:

time

min

If you dive deep into inferential statistics, you're likely to see an acronym ANOVA. It comes in many different flavors, such as one-way, two-way, multivariate, factorial, and so on. We’ll cover the simplest, one-way ANOVA today. We’ll do so from scratch, and then you’ll see how to use a built-in function to implement ANOVA in R. Let’s start with the theory and light math behind ANOVA first. Table of contents: <ul><li><a href="#anova-theory">ANOVA Theory</a></li><li><a href="#anova-from-scratch">ANOVA in R From Scratch</a></li><li><a href="#anova-in-r">ANOVA in R With a Built-In Function</a></li><li><a href="#conclusion">Conclusion</a></li></ul> <hr /> <h2 id="anova-theory">ANOVA Theory</h2> ANOVA stands for <em>Analysis of variance</em>, and it allows you to compare more than two groups (factors) at the same time to determine if any relationship between them exists. Think of ANOVA as a T-test on steroids. T-test allows you to test only two groups to see if there’s any difference in the means. ANOVA scales T-tests to new heights. There are two main types of ANOVA: <ul><li><strong>One-way ANOVA</strong> - It evaluates the impact of a single factor (group) on a single response variable. By doing so, it determines if all the samples are the same or not. One-way ANOVA is quite limited, as it will tell you if two groups are different, but won’t specify group names. You can extend it with a <em>Least Significance Difference</em> test for further inspection.</li><li><strong>Two-way ANOVA</strong> - It evaluates the impact of variables on a single response variable. You can use two-way ANOVA when you have two categorical variables (groups or factors) and a single quantitative outcome.</li></ul> Other, more advanced variations exist, such as multivariate ANOVA (MANOVA) and factorial ANOVA, but we’ll cover these some other time. In a nutshell, one-way ANOVA boils down to a simple hypothesis test: <ul><li><strong>H0</strong> - All sample (group or factor) means are equal or they don’t differ significantly.</li><li><strong>H1</strong> - At least one of the sample means is different from the rest.</li></ul> We can test the hypothesis with an F-test, but doing so requires a couple of calculations. ANOVA takes into account three types of variations - the <strong>total sum of squares</strong> (SST), the <strong>sum of squares within groups</strong> (SSW), and the <strong>sum of squares between groups</strong> (SSB). You only need to calculate two, as SST = SSW + SSB. For that reason, we can define the calculation formulas as follows. <h3>SST</h3> First, we have SST which tells you how much variation there is in the dependent variable: <img class="size-full wp-image-11860" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01f5eef34dd9b4d12ddca_sum-of-squares-total-formula.webp" alt="Image 1 - The sum of squares total formula" width="706" height="264" /> Image 1 - The sum of squares total formula It combines the variance of all factors within a single variable. <h3>SSW</h3> Next, we have SSW - it sums the variance of each individual group: <img class="size-full wp-image-11862" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01f5f99f56874983726de_sum-of-squares-within-groups-formula.webp" alt="Image 2 - The sum of squares within groups formula" width="610" height="256" /> Image 2 - The sum of squares within groups formula <h3>SSB</h3> And finally, we have SSB. Since the total sum of squares (SST) equals the SSW + SSB, we can calculate SSB using simple algebra: <img class="size-full wp-image-11858" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01f5f00b05468d71bdb4b_sum-of-squares-between.webp" alt="Image 3 - The sum of squares between" width="554" height="120" /> Image 3 - The sum of squares between In addition, you also need to know the values for <strong>degrees of freedom.</strong> With SSB, degrees of freedom are simply the number of distinct groups or factors minus one. So if you have three groups, the degrees of freedom is two. It’s a different story for SSW. Degrees of freedom are calculated as a total number of observations in all groups minus the number of groups. For example, if you have five observations per group and three groups in total, you have a total of 15 samples. Hence, the number of degrees of freedom is 12. With that in mind, you can proceed to the final calculations. You can now calculate F-value with the following formula: <img class="size-full wp-image-11846" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01f60c99b38c07a088c1f_F-value-formula.webp" alt="Image 4 - F-value formula" width="508" height="320" /> Image 4 - F-value formula Finally, you can compare F-value with the F-critical value from the F-distribution. If the critical value is lower than your F-value, you can reject the null hypothesis in the favor of the alternative hypothesis. It means that at least one of the sample means differs from the rest. Are you still trying to wrap your head around it? Don’t sweat it - we’ll go over the from-scratch implementation of ANOVA in R next. <blockquote><strong>Build better applications with these <a href="https://appsilon.com/best-practices-for-durable-r-code/" target="_blank" rel="noopener noreferrer">best practices for durable R code</a>. </strong></blockquote> <h2 id="anova-from-scratch">ANOVA in R From Scratch</h2> If you prefer code over math, this section is for you. We’ll cover everything discussed previously in code on actual data. We’ll create a synthetic dataset simulating car part duration in months when bought from different vendors. It’s a dummy example, so don’t think too much of it. It translates well to real data, which is what you should care for. First things first, let’s declare a dataset as a data frame: <script src="https://gist.github.com/darioappsilon/550835bbc291abd314e129e7ab23f009.js"></script> Here’s what it looks like: <img class="size-full wp-image-11840" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01f61ce6734eff837600a_dummy-synthetic-dataset.webp" alt="Image 5 - A dummy synthetic dataset" width="248" height="304" /> Image 5 - A dummy synthetic dataset This format works well for ANOVA in R calculation but isn’t something you’ll see in the real world. Individual groups aren’t usually separated into individual columns, so you can use R’s <code>stack()</code> function to change that. Here’s an example: <script src="https://gist.github.com/darioappsilon/41b07d0c85a8ada3488633080b3b365c.js"></script> <img class="size-full wp-image-11856" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01f62c99b38c07a088ccc_stacked-dummy-dataset.webp" alt="Image 6 - Stacked dummy dataset" width="328" height="828" /> Image 6 - Stacked dummy dataset Now that’s the format you’re more familiar with. It’s easier to calculate SSW in the original form, so we’ll stick with it, and transform the dataset as needed. You can visualize three distinct groups with a boxplot. It will give you a visual intuition if any individual group differs from the rest: <script src="https://gist.github.com/darioappsilon/ccec5244e8cdd868111c9f15402b322b.js"></script> <img class="size-full wp-image-11838" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01f633be1a2252421ce7c_boxplot-of-individual-sellers.webp" alt="Image 7 - Boxplot of individual retailers" width="1906" height="1490" /> Image 7 - Boxplot of individual retailers <blockquote><strong>Want to learn more about boxplots? <a href="https://appsilon.com/ggplot2-boxplots/" target="_blank" rel="noopener noreferrer">Check our complete guide to stunning boxplots with R</a>.</strong></blockquote> The second group stands out. One-way ANOVA in R won’t be able to tell us that, but we should have definite proof against the null hypothesis. From here, we can proceed with SSW calculations. <h3>SSW - Sum of Squares Within Groups</h3> To start, we’ll have to square the differences between individual group values with the mean of the group. We have three groups, so it will require three lines of code: <script src="https://gist.github.com/darioappsilon/5160736023b862eea0d150ab720c3b94.js"></script> Here’s what the dataset looks like now: <img class="size-full wp-image-11854" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01f65a63ce91a2407a25f_SSW-calculation.webp" alt="Image 8 - SSW calculation (1)" width="964" height="302" /> Image 8 - SSW calculation (1) To calculate SSW, simply sum the values of the three newly created columns: <script src="https://gist.github.com/darioappsilon/10933ef1167b043503631a5193ea7897.js"></script> <img class="size-full wp-image-11852" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01f656f892f05d57ee251_SSW-calculation-2.webp" alt="Image 9 - SSW calculation (2)" width="322" height="60" /> Image 9 - SSW calculation (2) One down, two to go. Let’s cover SST next. <h3>SST - Total Sum of Squares</h3> To calculate SST, we first need to stack the dataset. Then, you can calculate the variance by subtracting each data point from the mean and squaring the results. SST is the sum of the squared results: <script src="https://gist.github.com/darioappsilon/49f0c71e3bb020b487f465938ad93935.js"></script> <img class="size-full wp-image-11850" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01f67afc0ff7c657a3339_SST-calculation.webp" alt="Image 10 - SST calculation" width="620" height="62" /> Image 10 - SST calculation Only one remaining! <h3>SSB - Sum of Squares Between Groups</h3> Since SST = SSW + SSB, simple algebra tells us that SSB = SST - SSW. You’re free to do the calculations manually, but this approach is much simpler: <script src="https://gist.github.com/darioappsilon/80adaad564df42a3dfd4eb1c4eb8cafe.js"></script> <img class="size-full wp-image-11848" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01f68c8798807239ea8f1_SSB-calculation.webp" alt="Image 11 - SSB calculation" width="626" height="62" /> Image 11 - SSB calculation The hard part is now over. The only thing left to do is to calculate degrees of freedom for SSB and SSW and get the F-score from there. <h3>Calculate Degrees of Freedom</h3> For SSB, degrees of freedom are calculated as the number of distinct groups (3) minus one. For SSW, you’ll need to subtract the number of distinct groups (3) from the total number of observations (15): <script src="https://gist.github.com/darioappsilon/9d0b153e19f494e8b2191cc1e5a30aee.js"></script> Finally, let’s see how to calculate the F-value. <h3>Calculate F-value</h3> The formula from the previous section states we have to divide SSB AND SSW by their corresponding degrees of freedom, and then divide the results: <script src="https://gist.github.com/darioappsilon/b2537a3c4b41440a806730be316947ac.js"></script> <img class="size-full wp-image-11844" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01f697a92f47f699b55c1_F-value-calculation.webp" alt="Image 12 - F-value calculation" width="494" height="62" /> Image 12 - F-value calculation Anytime you have a statistic this large it’s likely you’ll reject the null hypothesis in the favor of the alternative one. The only way to be certain is by comparing it to the F critical value. <h3>Do We Reject The Null Hypothesis?</h3> There are many ways to get the critical values for the combination of 2 and 12 degrees of freedom. You could look at the F distribution table, but that’s too manual. The alternative is to use R’s <code>qf()</code> function: <script src="https://gist.github.com/darioappsilon/723e3ff28711749939cd0bf72fe1d3d7.js"></script> <img class="size-full wp-image-11842" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01f6a369f5360b2ade092_F-critical-value.webp" alt="Image 13 - F critical value" width="898" height="60" /> Image 13 - F critical value It looks like our test statistic is higher than the critical value, which means we can safely reject the null hypothesis in the favor of the alternative one. One-way ANOVA test told us that at least one sample differs significantly from the others, but we have no idea which one. Our boxplot suggests the second subset differs significantly, but we don’t have any concrete proofs. Here comes the best part - you can calculate ANOVA in R with a single function call! Let’s see how next. <h2 id="anova-in-r">ANOVA in R With a Built-In Function</h2> We’ll work with the same dataset - three car part retailers with five-part longevity values each: <script src="https://gist.github.com/darioappsilon/bbc9f497451a565a564b91d8647b0e1b.js"></script> <img class="size-full wp-image-11840" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01f61ce6734eff837600a_dummy-synthetic-dataset.webp" alt="Image 5 and 10 - A dummy synthetic dataset" width="248" height="304" /> Image 5 - A dummy synthetic dataset Unlike with the manual approach, now it’s mandatory to stack the individual groups in a single column: <script src="https://gist.github.com/darioappsilon/aabdf49665bd2b0db74bf3f72b4c2eae.js"></script> <img class="size-full wp-image-11856" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01f62c99b38c07a088ccc_stacked-dummy-dataset.webp" alt="Image 6 and 15 - Stacked dummy dataset" width="328" height="828" /> Image 6 - Stacked dummy dataset To calculate ANOVA in R, you can use the <code>aov()</code> function. You’re modeling <code>values</code> with <code>ind</code>, so keep that in mind. The code is similar to fitting a regression algorithm in R: <script src="https://gist.github.com/darioappsilon/ca500d062457798f4b2b5f5d561064e8.js"></script> Let’s check the summary: <img class="size-full wp-image-11836" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01f6dc8798807239eae73_ANOVA-in-R-summary.webp" alt="Image 16 - ANOVA in R summary" width="1374" height="264" /> Image 16 - ANOVA in R summary Degrees of freedom, the sum of squares, and the F value match our from-scratch calculations, which is an excellent sign! The ANOVA in R function uses a P-value instead of comparing F-value to the critical value directly. It’s just another way to interpret the results - commonly, if a P-value is below 0.05, we can say we’re rejecting the null hypothesis in favor of the alternative one at a 95% confidence interval. <blockquote><strong>Need more support services for your RStudio infrastructure? <a href="https://appsilon.com/appsilon-data-science-is-now-an-rstudio-full-service-certified-partner/" target="_blank" rel="noopener noreferrer">Appsilon is an RStudio Full Service Certified Partner</a>. Find out how we can help your team grow. </strong></blockquote> <hr /> <h2 id="conclusion">Conclusion</h2> And there you have it! Your guide to ANOVA in R. The calculations aren’t difficult to do manually, as everything boils down to plugging values into formulas. At least, you now understand one-way ANOVA and you now know it’s not rocket science. It’s just a combination of basic math and stats with a fancy name. <h3>Short recap</h3> In a nutshell, you need SST, SSW, SSB, and corresponding degrees of freedom in order to calculate the F-value. Once you have it, simply compare it to the F critical value at corresponding degrees of freedom, either through an F distribution table or R’s <code>qf()</code> function. If the F-value is larger than the critical value, you can safely reject the null hypothesis and state that at least one sample differs significantly from the rest. That’s it! If you want to learn more about ANOVA and other statistical tests, stay tuned to <a href="https://appsilon.com/blog/" target="_blank" rel="noopener noreferrer">Appsilon’s blog</a>. To get frequent updates be sure to subscribe to our newsletter via the contact form below. If you're looking for more R/Shiny-specific content, check out the <a href="https://appsilon.com/shiny-weekly-announcement/" target="_blank" rel="noopener noreferrer">Shiny Weekly newsletter</a>.