Time Series Analysis in R: How to Read and Understand Time Series Data
If there's one type of data no company has a shortage of, it has to be time series data. Yet, many <a href="https://appsilon.com/how-to-start-a-career-as-an-r-shiny-developer/" target="_blank" rel="noopener">beginner and intermediate R developers</a> struggle to grasp their heads around basic R time series concepts, such as manipulating datetime values, visualizing time data over time, and handling missing date values. Lucky for you, that will all be a thing of the past in a couple of minutes. This article brings you the <b>basic introduction to the world of R time series analysis</b>. We'll cover many concepts, from key characteristics of time series datasets, loading such data in R, <a href="https://appsilon.com/effective-shiny-dashboards-best-practices/" target="_blank" rel="noopener">visualizing</a> it, and even doing some basic operations such as smoothing the curve and visualizing a trendline. We have a lot of work to do, so let's jump straight in! <blockquote>Looking to start a career as an R/Shiny Developer? <a href="https://appsilon.com/how-to-start-a-career-as-an-r-shiny-developer/" target="_blank" rel="noopener">We have an ultimate guide for landing your first job in the industry</a>.</blockquote> <h3>Table of contents:</h3><ul><li><strong><a href="#key-characteristics">Key Characteristics of Time Series Datasets</a></strong></li><li><strong><a href="#load">Loading an R Time Series Dataset</a></strong></li><li><strong><a href="#datetime">Manipulating Datetime Values of a R Time Series Dataset</a></strong></li><li><strong><a href="#visualize">Visualizing R Time Series Datasets</a></strong></li><li><strong><a href="#missing-values">Handling Missing Dates and Values</a></strong></li><li><strong><a href="#basic-operations">Basic Time Series Operations: Data Smoothing and Trendlines</a></strong></li><li><a href="#summary"><strong>Summing up R Time Series Analysis</strong></a></li></ul> <hr /> <h2 id="key-characteristics">Key Characteristics of Time Series Datasets</h2> Time series datasets are always characterized by at least two features - a time period and a floating point value. They both represent an event, such as Microsoft stock value at November 16th, 2023 at 3 PM. That's essentially the basics, but this section will dive into the core characteristics of time series datasets, and provide you with a foundational understanding of their nature and behaviour. By recognizing these, you will be able to more effectively interpret, analyze, and make predictions based on time series data. Here's the list of all key characteristics you need to know: <ul><li><b>Datetime information</b> - In all-time series datasets, one or more columns are dedicated to show datetime information. You could have the date and time stored in separate columns, or you can have them combined as a single feature. This column(s) serves as an index, marking the exact time at which each observation was recorded.</li><li><b>Measurement of variable(s) over time</b> - Time series datasets usually measure one or more variables over time, e.g., the before-mentioned Microsoft stock price. The key aspect to remember is that these <b>measurements are taken at regular intervals</b>, be it hourly, daily, monthly, or yearly. It's this regularity that allows you to spot patterns, fluctuations, and general changes over time.</li><li><b>Seasonality</b> - Seasonality refers to the occurrence of regular and predictable patterns or cycles in a time series dataset over specific intervals. These patterns are often tied to time-related variables like days of the week, months, or quarters. For example, airplane tickets are always in demand, particularly in the summer months when people go on vacations.</li><li><b>Trend</b> - A trend in time series data is observed when there's a long-term increase or decrease in the data. It doesn’t have to be linear; trends can take various forms, such as exponential or logarithmic, so keep that in mind. For example, a company might see a gradual increase in sales over several years, indicating a positive trend. On the other hand, the exact opposite can happen, indicating a downward trend in sales.</li></ul> If you understand these key characteristics, you'll be one step closer to gaining valuable insights from time series datasets. This will allow you and your business to understand your data and make accurate predictions and informed decisions. But, <b>how do you actually load a time series dataset in R?</b> Let's explore that in the following section. <h2 id="load">Loading an R Time Series Dataset</h2> In a nutshell, time series datasets are not different from other types of datasets you're used to. They're also typically stored in CSV/Excel files or in databases, which means you can use your existing R knowledge to load these files into memory. The dataset of choice for today will be <a href="https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv" target="_blank" rel="noopener noreferrer">Airline passengers</a>, showing the number of passengers in thousands from 1949 to 1960, at monthly intervals. Assuming you have the dataset downloaded, here's the R code you can use to load it: <pre><code class="language-r">data <- read.csv("airline-passengers.csv")</code></pre> The dataset is now in memory, which means you can use the convenient <code>head()</code> function to display the first couple of rows. Let's go with 12 since the dataset shows monthly totals: <pre><code class="language-r">head(data, 12)</code></pre> This is what you'll see printed out: <img class="size-full wp-image-21957" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01974d29d4bcb9683014d_tg_image_1471001930.webp" alt="Image 1 - Loading a time series dataset" width="1140" height="732" /> Image 1 - Loading a time series dataset What makes Airline passengers a time series dataset is the fact that it has a <b>time-related column on regular intervals</b>, and also has a <b>numeric value attached to every time interval</b>. These are the basic two premises described in the previous section. As for trend and seasonality, we'll explore these later in the visualization section. <h2 id="datetime">Manipulating Datetime Values of a R Time Series Dataset</h2> You've probably spotted that the date column isn't formatted correctly in the previous section. The current values are in the form of "year-month". Adding insult to injury, it also looks like the column has a character data type: <img class="size-full wp-image-21960" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b019743b97b66bbdcfae6a_tg_image_4093796769.webp" alt="Image 2 - Column data types" width="1408" height="206" /> Image 2 - Column data types This section will show you how to fix the data type, and also how to convert the date in the format of month end, just in case you don't like the default month start format. We'll use the <code>lubridate</code> package through this section, so make sure you have it installed. <h3>Convert String to Datetime</h3> The <code>lubridate</code> package ships with a <code>ym()</code> function which converts a string date representation in the format of "year-month" to a proper date object. You don't have to apply this function to each row manually - you can <b>pass the entire column instead</b>: <pre><code class="language-r">library(lubridate) <br>data$date <- ym(data$Month) head(data)</code></pre> This is what the dataset looks like now: <img class="size-full wp-image-21963" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b0197694ab187560d67b3f_tg_image_2525110729.webp" alt="Image 3 - Dataset after adding the datetime column" width="926" height="418" /> Image 3 - Dataset after adding the datetime column If you check the data types with <code>str()</code> again, you'll see the following: <img class="size-full wp-image-21969" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b019774367c4d011900f52_tg_image_3862606251.webp" alt="Image 4 - Column data types" width="1388" height="266" /> Image 4 - Column data types Which means we now have a proper date column at our disposal. <h3>Change the Date Formatting</h3> The next thing you might want to do is to change how the date column is formatted. Maybe you prefer to see the <b>last day of the month</b> instead of the first - the change is really easy to implement: <pre><code class="language-r">data$date_mth_end <- ceiling_date(data$date, "month") - days(1) head(data)</code></pre> This is what you will see: <img class="size-full wp-image-21971" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b019788cff4e061341ff8c_tg_image_3964699981.webp" alt="Image 5 - Changing the date format" width="982" height="424" /> Image 5 - Changing the date format There are many ways you can format the date column. In the end, it's just <b>personal preference</b> - R won't treat it any differently behind the scenes. We now have some proper data to visualize. Let's explore how in the following section. <h2 id="visualize">Visualizing R Time Series Datasets</h2> We humans aren't the best at spotting patterns from tabular data. But it's a whole different story when the same data is visualized. This section will show you how to make a basic time series data visualization with <code>ggplot2</code>, and also how to make it somewhat aesthetically pleasing. Before creating the chart, you should make sure your <b>datetime column is a Date object</b>, and not just a string representation of it. Also, make sure the <b>count column is numeric</b>, and not just a number wrapped by quotes. Lucky for you - both conditions are met if you've followed through the previous section! Time series data is often visualized as a line chart. It makes sense since data is continuous and sampled on identical intervals. Here's the code you'll need to make the most basic line chart with <code>ggplot2</code>: <pre><code class="language-r">library(ggplot2) <br>ggplot(data, aes(x = date, y = Passengers)) + geom_line()</code></pre> This is what the chart looks like: <img class="size-full wp-image-21973" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b0197a7b599ca0447e4253_tg_image_2033167642.webp" alt="Image 6 - Basic line chart" width="2292" height="1292" /> Image 6 - Basic line chart It's not the prettiest, but it gets the job done. You can see a clear <b>upward trend</b> and a <b>strong seasonality in the summer months</b>. That's something we'll explore later. The issue that requires immediate attention is the style of this chart. It's nowhere near ready to show to your client or boss, since the title is missing, axis labels could do with some retouch, and the overall theme is awful. Here's a code snippet that will fix all of the listed issues: <pre><code class="language-r">ggplot(data, aes(x = date, y = Passengers)) + geom_line(color = "#0099f9", size = 1.4) + theme_classic() + theme( axis.text = element_text(size = 14, face = "bold"), axis.title = element_text(size = 15), plot.title = element_text(size = 18, face = "bold") ) + labs( title = "Airline Passengers Dataset", x = "Time period", y = "Number of passengers in 000" )</code></pre> This is what the chart looks like now: <img class="size-full wp-image-21975" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b0197a8cff4e0613420353_tg_image_610246812.webp" alt="Image 7 - Styled line chart" width="2294" height="1296" /> Image 7 - Styled line chart Now we're talking! We've gone from plain to stunning in just a couple of lines of code. <blockquote>Are you new to data visualization wtih ggplot2? <a href="https://appsilon.com/ggplot2-line-charts/" target="_blank" rel="noopener">This article will teach you how to make stunning line charts</a>.</blockquote> Up next, let's discuss an issue present in many time series datasets (but not in Airline passengers) - missing values. <h2 id="missing-values">Handling Missing Dates and Values</h2> When working with time series datasets, it's crucial that you have a full picture in front of you, which is a term describing a dataset that has no missing dates or values. There are ways of dealing with missing values, but dates are a lot trickier. Let's explore them first. <h3>Time Series with Missing Dates</h3> To demonstrate the point, we'll create a dummy time series dataset containing monthly sampled data for all months in 2023. But here's the thing - <b>there are no records for March, July, and August</b>: <pre><code class="language-r">ts <- data.frame( date = c("2023-01-01", "2023-02-01", "2023-04-01", "2023-05-01", "2023-06-01", "2023-09-01", "2023-10-01", "2023-11-01", "2023-12-01"), value = c(145, 212, 265, 299, 345, 278, 256, 202, 176) ) ts$date <- ymd(ts$date) ts</code></pre> You can clearly see the records are missing in the following image: <img class="size-full wp-image-21977" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b0197d2a228de4b21aa747_tg_image_107726200.webp" alt="Image 8 - Time series dataset with rows missing" width="438" height="524" /> Image 8 - Time series dataset with rows missing <b>So, what can you do?</b> The usual operating procedure is the following: <ol><li>Create a new <code>data.frame</code> that has an entire sequence of dates. Use the <code>lubridate::seq()</code> for the task instead of implementing the logic manually</li><li>Merge the new <code>data.frame</code> with the one that contains missing records - this will essentially add the missing records to the right place and set the value to <code>NA</code></li><li>Replace <code>NA</code> values with something appropriate - zeros will do fine for now.</li></ol> If you prefer code over text, here's a snippet for you: <pre><code class="language-r"># 1. Create a new data.frame that has a full sequence of dates full_date_df <- data.frame( date = seq(min(ts$date), max(ts$date), by = "month") ) <br># 2. Merge with the old one on the `date` column new_ts <- merge(full_date_df, ts, by = "date", all.x = TRUE) <br># 3. Some values are now missing - replace them with 0 new_ts$value[is.na(new_ts$value)] <- 0 <br>new_ts</code></pre> This is what the reformatted R time series dataset looks like: <img class="size-full wp-image-21979" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b0197dd29d4bcb968309e1_tg_image_2797547700.webp" alt="Image 9 - Time series dataset with added rows" width="756" height="730" /> Image 9 - Time series dataset with added rows Great, that takes care of missing dates, <b>but what about values?</b> That's what we'll cover next. <h3>Time Series with Missing Values</h3> Code-wise, missing values are a lot easier to deal with. It's best if you can find out why the values are missing in the first place, but if you can't, <b>there are various statistical methods</b> available for imputing them. With missing values in time series datasets, you usually have the data column fully populated, and the value field is set to <code>NA</code>. Here's an example of one such dataset: <pre><code class="language-r">ts <- data.frame( date = seq(ymd("20230101"), ymd("20231231"), by = "months"), value = c(145, 212, NA, 265, 299, 345, NA, NA, 278, 256, 202, 176) ) ts</code></pre> This is what it looks like: <img class="size-full wp-image-21981" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b0197e93c80263c782e9e7_tg_image_66051650.webp" alt="Image 10 - Time series dataset with missing values" width="636" height="726" /> Image 10 - Time series dataset with missing values This time, only the values for April, July, and August are missing. We'll show you four techniques for imputing them: <ul><li><b>Mean value</b> - All missing values will be replaced with a simple average of the series.</li><li><b>Forward fill</b> - The missing value at the point T is filled with a non-missing value at T-1.</li><li><b>Backward fill</b> - The missing value at the point T is filled with a non-missing value at T+1.</li><li><b>Linear interpolation</b> - The missing value at the point T is filled with an average of non-missing values at T-1 and T+1.</li></ul> This is how you can implement all of them in code: <pre><code class="language-r">library(zoo) <br># 1. Mean value imputation mean_value <- mean(ts$value, na.rm = TRUE) ts$mean <- ifelse(is.na(ts$value), mean_value, ts$value) <br># 2. Forward fill ts$ffill <- na.locf(ts$value, na.rm = FALSE) <br># 3. Backward fill ts$bfill <- na.locf(ts$value, fromLast = TRUE, na.rm = FALSE) <br># 4. Linear interpolation ts$interpolated <- na.approx(ts$value) <br>ts</code></pre> And here's what the dataset looks like afterward: <img class="size-full wp-image-21983" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b0197f93c80263c782ea2a_tg_image_2825624235.webp" alt="Image 11 - Time series dataset after missing value imputation" width="1090" height="728" /> Image 11 - Time series dataset after missing value imputation That covers handling missing dates and values. Up next, you'll learn how to add a couple of useful visualizations to your existing time series charts. <h2 id="basic-operations">Basic Time Series Operations: Data Smoothing and Trendlines</h2> In this final section, you'll learn two basic but vital time series tasks - smoothing the data curve via moving averages and calculating trendlines. As for the dataset, <b>we're back to Airline passengers</b>. We've loaded the whole thing from scratch, just to have a clean start. <pre><code class="language-r">library(lubridate) <br>data <- read.csv("airline-passengers.csv") data$Month <- ym(data$Month) head(data)</code></pre> Here's what the data looks like: <img class="size-full wp-image-21985" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01980299b8d6d842598fb_tg_image_2159841952.webp" alt="Image 12 - Airline passengers dataset" width="810" height="426" /> Image 12 - Airline passengers dataset First, let's see what smoothing the values curve brings us. <h3>Smoothing the Data Curve</h3> Okay, so, <b>moving averages</b> - what are they? Think of them as a technique that allows you to smooth out short-term fluctuations in a time series dataset. Doing so enables you to shift the focus from extremes to the overall shape of the data. Calculating moving averages involves taking an average of a subset of the total data points at different time intervals, which then "moves" along with the data. In a nutshell, each point on a moving average line <b>represents the average value of the dataset over a specific preceding period</b>. One important parameter worth discussing with moving averages is the <code>window size</code>. In plain English, it determines the number of consecutive data points used to calculate each point in the moving average. When you use a moving average with different factors, such as 3, 6, or 12, it impacts the smoothness of the resulting average and the sensitivity to changes in the data. Now onto the code. We'll use the <code>zoo</code> package to calculate moving averages with window sizes of 3, 6, and 12: <pre><code class="language-r">library(zoo) <br>data$Passengers_MA3 <- rollmean(data$Passengers, 3, fill = NA) data$Passengers_MA6 <- rollmean(data$Passengers, 6, fill = NA) data$Passengers_MA12 <- rollmean(data$Passengers, 12, fill = NA) <br>head(data, 12)</code></pre> This is what the dataset looks like after the calculation: <img class="size-full wp-image-21987" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b019822a228de4b21aaa44_tg_image_2068330021.webp" alt="Image 13 - Airline passenger dataset with moving averages" width="1540" height="730" /> Image 13 - Airline passenger dataset with moving averages Some values at each end of the dataset are missing, and that's simply because there's no way to calculate a moving average for data points before a certain point, depending on the window size. Further, missing values are irrelevant for the point we're trying to prove. You'll get the idea why moving averages are useful as soon as you visualize them: <pre><code class="language-r">ggplot(data, aes(x = Month)) + geom_line(aes(y = Passengers), color = "black", size = 1) + geom_line(aes(y = Passengers_MA3), color = "red", size = 1) + geom_line(aes(y = Passengers_MA6), color = "green", size = 1) + geom_line(aes(y = Passengers_MA12), color = "blue", size = 1) + theme_classic() + theme( axis.text = element_text(size = 14, face = "bold"), axis.title = element_text(size = 15), plot.title = element_text(size = 18, face = "bold"), legend.position = "bottom" ) + labs( title = "Airline Passengers Dataset with Moving Averages", x = "Time period", y = "Number of passengers in 000" )</code></pre> This is the chart you'll end up with: <img class="size-full wp-image-21989" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01983ec1e575073db2aea_tg_image_4077137759.webp" alt="Image 14 - Original Airline passengers dataset with moving averages" width="2228" height="1390" /> Image 14 - Original Airline passengers dataset with moving averages Overall, the larger the window size, the smoother the data curve. To conclude, <b>moving averages allow you to see a generalized pattern in your data, rather than focusing on short-term fluctuations</b>. Up next, let's go over trendlines. <h3>Plotting a Trendline</h3> A trendline does just what the name suggests - <b>it shows a general trend of your data</b> - either neutral, positive, or negative. To calculate a trendline, you'll want to fit a <a href="https://appsilon.com/r-linear-regression/" target="_blank" rel="noopener">linear regression model</a> on a derived numeric feature, and then use the same feature to calculate predictions. This will return a <b>line of best fit</b> - or line that best describes the data - or <b>trendline</b>. Here's the code needed to fit a linear regression model: <pre><code class="language-r"># Create a numeric feature data$Month_num <- as.numeric(data$Month) <br># Fit a linear regression model model <- lm(Passengers ~ Month_num, data = data) # Get predictions data$Trend_Line <- predict(model, newdata = data) <br>head(data[c("Month", "Passengers", "Trend_Line")], 12)</code></pre> The code also prints the first 12 rows of the dataset: <img class="size-full wp-image-21991" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01985410a00724850385e_tg_image_830375096.webp" alt="Image 15 - Airline passengers dataset with a trendline" width="1244" height="732" /> Image 15 - Airline passengers dataset with a trendline It doesn't make much sense numerically, so let's visualize it. You know the drill by now: <pre><code class="language-r">ggplot(data, aes(x = Month)) + geom_line(aes(y = Passengers), color = "#0099f9", size = 1) + geom_line(aes(y = Trend_Line), color = "orange", size = 1.4) + theme_classic() + theme( axis.text = element_text(size = 14, face = "bold"), axis.title = element_text(size = 15), plot.title = element_text(size = 18, face = "bold"), legend.position = "bottom" ) + labs( title = "Airline Passengers Dataset with a Trend Line", x = "Time period", y = "Number of passengers in 000" )</code></pre> This is the chart you'll end up with: <img class="size-full wp-image-21993" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01987299b8d6d84259c77_tg_image_508567742.webp" alt="Image 16 - Visualized Airline passengers dataset with a trendline" width="2106" height="1378" /> Image 16 - Visualized Airline passengers dataset with a trendline Long story short, a trendline is just a <b>straight line that best describes the general movement, or trend, of your data</b>. You could also try fitting a polynomial regression model to this dataset if you suspect the trend shouldn't be linear, but that's a topic for some other time. <hr /> <h2 id="summary">Summing up R Time Series Analysis</h2> And there you have it - pretty much everything a newcomer to time series analysis and forecasting needs. We've covered a lot of analysis ground today, and you've learned how to load time series datasets, visualize them, work with missing values, and even something a bit more advanced - moving averages and trendlines. The next natural step to take is to take a closer look into <b>time series forecasting</b>. That's the topic we'll cover in a follow-up article, so make sure to stay tuned to the <a href="https://appsilon.com/blog/" target="_blank" rel="noopener">Appsilon blog</a> so you don't miss it. <blockquote>What else can you do with R? <a href="https://appsilon.com/r-for-programmers/" target="_blank" rel="noopener">Here are 7 essential and beginner-friendly packages you must know</a>.</blockquote>