'ANOVA in R from scratch'

ANOVA in R – How To Implement One-Way ANOVA From Scratch


If you dive deep into inferential statistics, you’re likely to see an acronym ANOVA. It comes in many different flavors, such as one-way, two-way, multivariate, factorial, and so on. We’ll cover the simplest, one-way ANOVA today. We’ll do so from scratch, and then you’ll see how to use a built-in function to implement ANOVA in R.

Let’s start with the theory and light math behind ANOVA first.

Table of contents:


ANOVA Theory

ANOVA stands for Analysis of variance, and it allows you to compare more than two groups (factors) at the same time to determine if any relationship between them exists. Think of ANOVA as a T-test on steroids. T-test allows you to test only two groups to see if there’s any difference in the means. ANOVA scales T-tests to new heights.

There are two main types of ANOVA:

  • One-way ANOVA – It evaluates the impact of a single factor (group) on a single response variable. By doing so, it determines if all the samples are the same or not. One-way ANOVA is quite limited, as it will tell you if two groups are different, but won’t specify group names. You can extend it with a Least Significance Difference test for further inspection.
  • Two-way ANOVA – It evaluates the impact of variables on a single response variable. You can use two-way ANOVA when you have two categorical variables (groups or factors) and a single quantitative outcome.

Other, more advanced variations exist, such as multivariate ANOVA (MANOVA) and factorial ANOVA, but we’ll cover these some other time.

In a nutshell, one-way ANOVA boils down to a simple hypothesis test:

  • H0 – All sample (group or factor) means are equal or they don’t differ significantly.
  • H1 – At least one of the sample means is different from the rest.

We can test the hypothesis with an F-test, but doing so requires a couple of calculations. ANOVA takes into account three types of variations – the total sum of squares (SST), the sum of squares within groups (SSW), and the sum of squares between groups (SSB).

You only need to calculate two, as SST = SSW + SSB. For that reason, we can define the calculation formulas as follows.

SST

First, we have SST which tells you how much variation there is in the dependent variable:

Image 1 - The sum of squares total formula

Image 1 – The sum of squares total formula

It combines the variance of all factors within a single variable.

SSW

Next, we have SSW – it sums the variance of each individual group:

Image 2 - The sum of squares within groups formula

Image 2 – The sum of squares within groups formula

SSB

And finally, we have SSB. Since the total sum of squares (SST) equals the SSW + SSB, we can calculate SSB using simple algebra:

Image 3 - The sum of squares between

Image 3 – The sum of squares between

In addition, you also need to know the values for degrees of freedom. With SSB, degrees of freedom are simply the number of distinct groups or factors minus one. So if you have three groups, the degrees of freedom is two.

It’s a different story for SSW. Degrees of freedom are calculated as a total number of observations in all groups minus the number of groups. For example, if you have five observations per group and three groups in total, you have a total of 15 samples. Hence, the number of degrees of freedom is 12.

With that in mind, you can proceed to the final calculations. You can now calculate F-value with the following formula:

Image 4 - F-value formula

Image 4 – F-value formula

Finally, you can compare F-value with the F-critical value from the F-distribution. If the critical value is lower than your F-value, you can reject the null hypothesis in the favor of the alternative hypothesis. It means that at least one of the sample means differs from the rest.

Are you still trying to wrap your head around it? Don’t sweat it – we’ll go over the from-scratch implementation of ANOVA in R next.

Build better applications with these best practices for durable R code

ANOVA in R From Scratch

If you prefer code over math, this section is for you. We’ll cover everything discussed previously in code on actual data. We’ll create a synthetic dataset simulating car part duration in months when bought from different vendors. It’s a dummy example, so don’t think too much of it. It translates well to real data, which is what you should care for.

First things first, let’s declare a dataset as a data frame:

Here’s what it looks like:

Image 5 - A dummy synthetic dataset

Image 5 – A dummy synthetic dataset

This format works well for ANOVA in R calculation but isn’t something you’ll see in the real world. Individual groups aren’t usually separated into individual columns, so you can use R’s stack() function to change that. Here’s an example:

Image 6 - Stacked dummy dataset

Image 6 – Stacked dummy dataset

Now that’s the format you’re more familiar with. It’s easier to calculate SSW in the original form, so we’ll stick with it, and transform the dataset as needed.

You can visualize three distinct groups with a boxplot. It will give you a visual intuition if any individual group differs from the rest:

Image 7 - Boxplot of individual retailers

Image 7 – Boxplot of individual retailers

Want to learn more about boxplots? Check our complete guide to stunning boxplots with R.

The second group stands out. One-way ANOVA in R won’t be able to tell us that, but we should have definite proof against the null hypothesis. From here, we can proceed with SSW calculations.

SSW – Sum of Squares Within Groups

To start, we’ll have to square the differences between individual group values with the mean of the group. We have three groups, so it will require three lines of code:

Here’s what the dataset looks like now:

Image 8 - SSW calculation (1)

Image 8 – SSW calculation (1)

To calculate SSW, simply sum the values of the three newly created columns:

Image 9 - SSW calculation (2)

Image 9 – SSW calculation (2)

One down, two to go. Let’s cover SST next.

SST – Total Sum of Squares

To calculate SST, we first need to stack the dataset. Then, you can calculate the variance by subtracting each data point from the mean and squaring the results. SST is the sum of the squared results:

Image 10 - SST calculation

Image 10 – SST calculation

Only one remaining!

SSB – Sum of Squares Between Groups

Since SST = SSW + SSB, simple algebra tells us that SSB = SST – SSW. You’re free to do the calculations manually, but this approach is much simpler:

Image 11 - SSB calculation

Image 11 – SSB calculation

The hard part is now over. The only thing left to do is to calculate degrees of freedom for SSB and SSW and get the F-score from there.

Calculate Degrees of Freedom

For SSB, degrees of freedom are calculated as the number of distinct groups (3) minus one. For SSW, you’ll need to subtract the number of distinct groups (3) from the total number of observations (15):

Finally, let’s see how to calculate the F-value.

Calculate F-value

The formula from the previous section states we have to divide SSB AND SSW by their corresponding degrees of freedom, and then divide the results:

Image 12 - F-value calculation

Image 12 – F-value calculation

Anytime you have a statistic this large it’s likely you’ll reject the null hypothesis in the favor of the alternative one. The only way to be certain is by comparing it to the F critical value.

Do We Reject The Null Hypothesis?

There are many ways to get the critical values for the combination of 2 and 12 degrees of freedom. You could look at the F distribution table, but that’s too manual. The alternative is to use R’s qf() function:

Image 13 - F critical value

Image 13 – F critical value

It looks like our test statistic is higher than the critical value, which means we can safely reject the null hypothesis in the favor of the alternative one. One-way ANOVA test told us that at least one sample differs significantly from the others, but we have no idea which one. Our boxplot suggests the second subset differs significantly, but we don’t have any concrete proofs.

Here comes the best part – you can calculate ANOVA in R with a single function call! Let’s see how next.

ANOVA in R With a Built-In Function

We’ll work with the same dataset – three car part retailers with five-part longevity values each:

Image 5 and 10 - A dummy synthetic dataset

Image 5 – A dummy synthetic dataset

Unlike with the manual approach, now it’s mandatory to stack the individual groups in a single column:

Image 6 and 15 - Stacked dummy dataset

Image 6 – Stacked dummy dataset

To calculate ANOVA in R, you can use the aov() function. You’re modeling values with ind, so keep that in mind. The code is similar to fitting a regression algorithm in R:

Let’s check the summary:

Image 16 - ANOVA in R summary

Image 16 – ANOVA in R summary

Degrees of freedom, the sum of squares, and the F value match our from-scratch calculations, which is an excellent sign! The ANOVA in R function uses a P-value instead of comparing F-value to the critical value directly. It’s just another way to interpret the results – commonly, if a P-value is below 0.05, we can say we’re rejecting the null hypothesis in favor of the alternative one at a 95% confidence interval.

Need more support services for your RStudio infrastructure? Appsilon is an RStudio Full Service Certified Partner. Find out how we can help your team grow. 


Conclusion

And there you have it! Your guide to ANOVA in R. The calculations aren’t difficult to do manually, as everything boils down to plugging values into formulas. At least, you now understand one-way ANOVA and you now know it’s not rocket science. It’s just a combination of basic math and stats with a fancy name.

Short recap

In a nutshell, you need SST, SSW, SSB, and corresponding degrees of freedom in order to calculate the F-value. Once you have it, simply compare it to the F critical value at corresponding degrees of freedom, either through an F distribution table or R’s qf() function. If the F-value is larger than the critical value, you can safely reject the null hypothesis and state that at least one sample differs significantly from the rest. That’s it!

If you want to learn more about ANOVA and other statistical tests, stay tuned to Appsilon’s blog. To get frequent updates be sure to subscribe to our newsletter via the contact form below. If you’re looking for more R/Shiny-specific content, check out the Shiny Weekly newsletter.