Are your data visualizations an eyesore? It’s a common problem, so don’t worry too much about it. The solution is easier than you think, as R provides countless ways to make stunning visuals. Today you’ll learn how to create impressive boxplots with R and the
Read the series from the beginning:
This article demonstrates how to make stunning boxplots with ggplot based on any dataset. We’ll start simple with a brief introduction and interpretation of boxplots and then dive deep into visualizing and styling ggplot boxplots.
Table of contents:
A boxplot is one of the simplest ways of representing a distribution of a continuous variable. It consists of two parts:
Take a look at the following visual representation:
In short, boxplots provide a ton of information for a single chart. Boxplots tell you whether the variable is normally distributed, or if the distribution is skewed in either direction. You can also easily spot the outliers, which always helps.
Let’s see how you can use R and ggplot to visualize boxplots.
R has many datasets built-in, one of them being
mtcars. It’s a small and easy-to-explore dataset we’ll use today to draw boxplots. You’ll need only
ggplot2 installed to follow along.
We’ll visualize boxplots for the
mpg (Miles per gallon) variable among different
cyl (Number of cylinders) options in most of the charts. You’ll have to convert the
cyl variable to a factor beforehand. Here’s how:
head() function prints the first six rows of the dataset:
From the image alone, you can see that
mpg is continuous, and
cyl is categorical. It’s a variable-type combination you’re looking for when working with boxplots.
You can make ggplot boxplots look stunning with a bit of work, but starting out they’ll look pretty plain. Think of this as a blank canvas to paint your beautiful boxplot story. The
geom_boxplot() function is used in
ggplot2 to draw boxplots. Here’s how to use it to make a default-looking boxplot of the miles per gallon variable:
And boy is it ugly. We’ll deal with the stylings later after we go over the basics.
Every so often, you’ll want to visualize multiple boxplots on a single chart — each representing a distribution of the variable with some filter condition applied. For example, we can visualize the distribution of miles per gallon for every possible cylinder value. The latter is already converted to a factor, so you’re ready to go.
Here’s the code:
It makes sense — a car makes fewer miles per gallon the more cylinders it has. There are outliers for cars with eight cylinders, represented with dots above and whiskers below.
You can change the orientation of the chart if you find this one hard to look at. Just call the
coord_flip() function when coding the chart:
We’ll stick with the default orientation moving forward. Let’s say you want to display every data point on the boxplot. The
mtcars dataset is relatively small, so it might actually be a good idea. You’ll have to call the
geom_dotplot() function to do so:
Be extra careful if you’re doing this for a larger dataset. Outliers are a bit harder to spot and it’s easy to get overwhelmed.
Let’s explore how you can make boxplots more appealing to the eye.
Let’s start with the outline color. It might just be enough to give your visualization an extra punch. You can specify an attribute that decides which color is applied in the call to
aes(), and then use the
scale_color_manual() function to provide a list of colors:
There are other ways to specify the color palette, but we find this option to be the most customizable.
If you want to change the fill color instead, you have options. You can specify a color to the
fill parameter inside
geom_boxplot() if you want all boxplots to have the same color:
The alternative is to apply the same logic we used in the outline color — a variable controls which color is applied where, and you can use the
scale_color_manual() function to change the colors:
Now we’re getting somewhere. The only thing we haven’t addressed is that horrendous background color. You can get rid of it by changing the theme. For example, adding
theme_classic() will make your chart a bit more modern and minimalistic:
Style boils down to personal preference, but this one is much easier to look at in our opinion.
There’s still one gigantic elephant in the room left to discuss — titles and labels. No one knows what your ggplot boxplot represents without them.
Let’s start with text labels. It’s somewhat unusual to add them to boxplots, as they’re usually used on charts where exact values are displayed (bar, line, etc.). Nevertheless, you can display any text you want with ggplot boxplots, you’ll just have to get a bit more creative.
For example, if you want to display the number of observations, mean, and median above every boxplot, you’ll first have to declare a function that fetches that information. We decided to name ours
Discover more Boxplot arguments in the ggplot2 boxplot documentation.
You can now pass it to
stat_summary() function when drawing boxplots:
Neat, right? Much better than displaying values directly on the chart.
Let’s cover titles and axes labels next. These are mandatory for production-ready charts, as without them, the users don’t know what they’re looking at. You can use the following code snippet to add title, subtitle, caption, x-axis label, and y-axis label:
If you think these look a bit plain, you’re not alone. You can use the
theme() function to style them. Be aware, your custom styles will be ignored if you call
theme_classic() after declaring custom styles:
Much better — assuming you like the blue color. Everything covered so far is just enough to get you on the right track when making ggplot boxplots, so we’ll stop here.
Looking for more examples of Boxplots? Check out the r-bloggers boxplot feed to see what the R community has to say.
Today you’ve learned what boxplots are, how to draw them with R and the
ggplot2 library, and how to make them aesthetically pleasing by changing colors, adding text, titles, and axis labels. It’s enough to style boxplots however you want. You know what to tweak, and now it’s up to you to pick fonts and colors. When creating data visualizations with R, you’re only limited by your creativity (and R knowledge). If you need help finding inspiration or tools be sure to check out what can be achieved with advanced R programming.
At Appsilon, we’ve used
ggplot2 package frequently when developing enterprise R Shiny dashboards for Fortune 500 companies. If you have a keen eye for design and know a thing or two about R and Shiny, reach out. We have several R Shiny developer positions available.