Machine Learning with R: A Complete Guide to Linear Regression

Estimated time:

time

min

<h2>Linear Regression with R</h2> Updated: July 12, 2022. Chances are you had some prior exposure to machine learning and statistics. Basically, that's all R linear regression is - a simple statistics problem. <blockquote>Need help with Machine Learning solutions? <a class="editor-rtfLink" href="https://wordpress.appsilon.com/" target="_blank" rel="noopener noreferrer">Reach out to Appsilon</a>.</blockquote> Today you'll learn the different types of R linear regression and how to implement all of them in R. You'll also get a glimpse into feature importance - a concept used everywhere in machine learning to determine which features have the most predictive power. Table of contents: <ul><li><a href="#intro">Introduction to R Linear Regression</a></li><li><a href="#simple-lr">Simple Linear Regression from Scratch</a></li><li><a href="#multiple-lr">Multiple Linear Regression with R</a></li><li><a href="#feature-importance">R Linear Regression Feature Importance</a></li><li><a href="#conclusion">Summary of R Linear Regression</a></li></ul> <hr /> <h2 id="intro">Introduction to Linear Regression</h2> Linear regression is a simple algorithm developed in the field of statistics. As the name suggests, linear regression assumes a linear relationship between the input variable(s) and a single output variable. Needless to say, the output variable (what you're predicting) has to be continuous. The output variable can be calculated as a linear combination of the input variables. There are two types of linear regression: <ul><li>Simple linear regression - only one input variable</li><li>Multiple linear regression - multiple input variables</li></ul> You'll implement both today - simple linear regression from scratch and multiple linear regression with built-in R functions. You can use a linear regression model to learn which features are important by examining coefficients. If a coefficient is close to zero, the corresponding feature is considered to be less important than if the coefficient was a large positive or negative value. That's how the linear regression model generates the output. Coefficients are multiplied with corresponding input variables, and in the end, the bias (intercept) term is added. There's still one thing we should cover before diving into the code - assumptions of a linear regression model: <ul><li>Linear assumption — model assumes that the relationship between variables is linear</li><li>No noise — model assumes that the input and output variables are not noisy — so remove outliers if possible</li><li>No collinearity — model will overfit when you have highly correlated input variables</li><li>Normal distribution — the model will make more reliable predictions if your input and output variables are normally distributed. If that's not the case, try using some transforms on your variables to make them more normal-looking</li><li>Rescaled inputs — use scalers or normalizers to make more reliable predictions</li></ul> You should be aware of these assumptions every time you're creating linear models. We'll ignore most of them for the purpose of this article, as the goal is to show you the general syntax you can copy-paste between the projects. <h2 id="simple-lr">Simple Linear Regression from Scratch</h2> If you have a single input variable, you're dealing with simple linear regression. It won't be the case most of the time, but it can't hurt to know. A simple linear regression can be expressed as: <img class="size-full wp-image-6292" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b3958bd772ec361ef23270_1-4.webp" alt="Image 1 - Simple linear regression formula (line equation)" width="358" height="98" /> Image 1 - Simple linear regression formula (line equation) As you can see, there are two terms you need to calculate beforehand - betas. You'll first see how to calculate Beta1, as Beta0 depends on it. Here's the formula: <img class="size-full wp-image-6293" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b3958cf32c7f090d65fd07_2-4.webp" alt="Image 2 - Beta1 equation" width="720" height="180" /> Image 2 - Beta1 equation And here's the formula for Beta0: <img class="size-full wp-image-6294" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b3958e8ac01edf5bf546b8_3-4.webp" alt="Image 3 - Beta0 equation" width="412" height="88" /> Image 3 - Beta0 equation These x's and y's with the bar over them represent the mean (average) of the corresponding variables. Let's see how all of this works in action. The code snippet below generates X with 300 linearly spaced numbers between 1 and 300 and generates Y as a value from the normal distribution centered just above the corresponding X value with a bit of noise added. Both X and Y are then combined into a single data frame and visualized as a scatter plot with the <code>ggplot2</code> package: <pre><code class="language-r">library(ggplot2) # Generate synthetic data with a clear linear relationship x <- seq(from = 1, to = 300) y <- rnorm(n = 300, mean = x + 2, sd = 25) # Convert to dataframe simple_lr_data <- data.frame(x, y) # Visualize as scatter plot ggplot(data = simple_lr_data, aes(x = x, y = y)) + geom_point(size = 3, color = "#0099f9") + theme_classic() + labs( title = "Dataset for simple linear regression", subtitle = "A clear linear relationship is visible" )</code></pre> <img class="size-full wp-image-6295" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b3958e51a4dda06614175b_4-4.webp" alt="Image 4 - Input data as a scatter plot" width="1276" height="1003" /> Image 4 - Input data as a scatter plot Onto the coefficient calculation now. The coefficients for Beta0 and Beta1 are obtained first, and then wrapped into a <code>simple_lr_predict()</code> function that implements the line equation. The predictions can then be obtained by applying the <code>simple_lr_predict()</code> function to the vector X - they should all line on a single straight line. Finally, input data and predictions are visualized with the <code>ggplot2</code> package: <pre><code class="language-r"># Calculate coefficients b1 <- (sum((x - mean(x)) * (y - mean(y)))) / (sum((x - mean(x))^2)) b0 <- mean(y) - b1 * mean(x) # Define function for generating predictions simple_lr_predict <- function(x) { return(b0 + b1 * x) } # Apply simple_lr_predict() to input data simple_lr_predictions <- sapply(x, simple_lr_predict) simple_lr_data$yhat <- simple_lr_predictions # Visualize input data and the best fit line ggplot(data = simple_lr_data, aes(x = x, y = y)) + geom_point(size = 3, color = "#0099f9") + geom_line(aes(x = x, y = yhat), size = 2) + theme_classic() + labs( title = "Applying simple linear regression to data", subtitle = "Black line = best fit line" )</code></pre> <img class="size-full wp-image-6296" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01c78051092b3b541c409_5-4.webp" alt="Image 5 - Input data as a scatter plot with predictions (best-fit line)" width="1276" height="1003" /> Image 5 - Input data as a scatter plot with predictions (best-fit line) And that's how you can implement simple linear regression in R from scratch! Next, you'll learn how to handle situations when there are multiple input variables. <h2 id="multiple-lr">Multiple Linear Regression with R</h2> You'll use the <a class="editor-rtfLink" href="https://www.kaggle.com/aungpyaeap/fish-market" target="_blank" rel="noopener noreferrer">Fish Market</a> dataset to build your model. To start, the goal is to load the dataset and check if some of the assumptions hold. Normal distribution and outlier assumptions can be checked with boxplots. <blockquote>Want to create better data visualizations? Learn how to make <a href="https://appsilon.com/ggplot2-boxplots/" target="_blank" rel="noopener">stunning boxplots with ggplot2</a>.</blockquote> The code snippet below loads in the dataset and visualizes box plots for every feature (not the target): <pre><code class="language-r">library(reshape) # Load in th dataset df <- read.csv("Fish.csv") # Remove target variable temp_df <- subset(df, select = -c(Weight)) melt_df <- melt(temp_df) # Draw boxplot boxplot(data = melt_df, value ~ variable)</code></pre> <img class="size-full wp-image-6297" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01cd59dc2dda17fbf3ced_6-4.webp" alt="Image 6 - Boxplots of the input features" width="1276" height="936" /> Image 6 - Boxplots of the input features <blockquote>Do you find Box Plots confusing? <a href="https://appsilon.com/ggplot2-boxplots/">Here's our complete guide to get you started</a>.</blockquote> A degree of skew seems to be present in all input variables, and the first three contain a couple of outliers. We'll keep this article strictly machine learning-based, so we won't do any data preparation and cleaning. Train/test split is the obvious next step once you're done with preparation. The <code>caTools</code> package is the perfect candidate for the job. You can train the model on the training set after the split. R has the <code>lm</code> function built-in, and it is used to train linear models. Inside the <code>lm</code> function, you'll need to write the target variable on the left and input features on the right, separated by the <code>~</code> sign. If you put a dot instead of feature names, it means you want to train the model on all features. After the model is trained, you can call the <code>summary()</code> function to see how well it performed on the training set. Here's a code snippet for everything discussed so far: <pre><code class="language-r">library(caTools) set.seed(42) # Train/Test split in 70:30 ratio sample_split <- sample.split(Y = df$Weight, SplitRatio = 0.7) train_set <- subset(x = df, sample_split == TRUE) test_set <- subset(x = df, sample_split == FALSE) # Fit the model and obtain summary model <- lm(Weight ~ ., data = train_set) summary(model)</code></pre> <img class="size-full wp-image-6298" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395b140c61fb73053a2be_7-3.webp" alt="Image 7 - Summary statistics of a multiple linear regression model" width="692" height="711" /> Image 7 - Summary statistics of a multiple linear regression model The most interesting thing here is the P-values, displayed in the <code>Pr(>|t|)</code> column. Those values indicate the probability of a variable not being important for prediction. It's common to use a 5% significance threshold, so if a P-value is 0.05 or below, we can say that there's a low chance it is not significant for the analysis. Let's make a residual plot next. As a general rule, if a histogram of residuals looks normally distributed, the linear model is as good as it can be. If not, it means you can improve it somehow. Here's the code for visualizing residuals: <pre><code class="language-r"># Get residuals lm_residuals <- as.data.frame(residuals(model)) # Visualize residuals ggplot(lm_residuals, aes(residuals(model))) + geom_histogram(fill = "#0099f9", color = "black") + theme_classic() + labs(title = "Residuals plot")</code></pre> <img class="size-full wp-image-6299" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395b10ab70dc239c896ee_8-3.webp" alt="Image 8 - Residuals plot of a multiple linear regression model" width="1276" height="1004" /> Image 8 - Residuals plot of a multiple linear regression model As you can see, there's a bit of skew present due to a large error on the far right. And now it's time to make predictions on the test set. You can use the <code>predict()</code> function to apply the model to the test set. As an additional step, you can combine actual values and predictions into a single data frame, just so the evaluation becomes easier. Here's how: <pre><code class="language-r"># Make predictions on the test set predictions <- predict(model, test_set) # Convert to dataframe eval <- cbind(test_set$Weight, predictions) colnames(eval) <- c("Y", "Yhat") eval <- as.data.frame(eval) head(eval)</code></pre> <img class="size-full wp-image-6300" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270af09eda01ce1845527_9-3.webp" alt="" width="206" height="238" /> Image 9 - Dataset comparing actual values and predictions for the test set If you want a more concrete way of evaluating your regression models, look no further than RMSE (Root Mean Squared Error). This metric will inform you how wrong your model is on average. In this case, it reports back the average number of weight units the model is wrong: <pre><code class="language-r"># Evaluate model mse <- mean((eval$Y - eval$Yhat)^2) rmse <- sqrt(mse)</code></pre> The <code>rmse</code> variable holds the value of 83.60, indicating the model is on average wrong by 83.60 units of weight. <h2 id="feature-importance">R Linear Regression Feature Importance</h2> Most of the time, not all features are relevant for predictive modeling and making predictions. If you were to exclude some of them, you're likely to get a better-performing model or at least an identical model that's simple to understand and interpret. That's where feature importance comes in. Now, linear regression models are highly interpretable out of the box, but you can take them to the next level. That's where the <code>Boruta</code> package comes in. It's a feature ranking and selection algorithm based on the Random Forests algorithm. It clearly shows you if a variable is important or not. You can tweak the "strictness" by adjusting the P-value and other parameters, but that's a topic for another time. The most simple function call will do for today. To get started, first install the package: <pre><code class="language-r">install.packages("Boruta")</code></pre> The <code>boruta()</code> function takes in the same parameters as <code>lm()</code>. It's a formula with the target variable on the left side and the predictors on the right side. The additional <code>doTrace</code> parameter is there to limit the amount of output printed to the console - setting it to 0 will remove it altogether: <pre><code class="language-r">library(Boruta) boruta_output <- Boruta(Weight ~ ., data = train_set, doTrace = 0) boruta_output</code></pre> Here are the contents of <code>boruta_output:</code> <img class="size-full wp-image-14574" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270b06402741d1b156f74_10-3.webp" alt="Image 10 - Results of a Boruta algorithm" width="762" height="190" /> Image 10 - Results of a Boruta algorithm Not too useful immediately, but we can extract the importance with a bit more manual labor. First, let's extract all the attributes that are significant: <pre><code class="language-r">rough_fix_mod <- TentativeRoughFix(boruta_output) boruta_signif <- getSelectedAttributes(rough_fix_mod) boruta_signif</code></pre> <img class="size-full wp-image-14576" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395cbdf484ee4e661d73c_11-3.webp" alt="Image 11 - Important features" width="988" height="80" /> Image 11 - Important features Now we can get to the importance scores and sort them in descending order: <pre><code class="language-r">importances <- attStats(rough_fix_mod) importances <- importances[importances$decision != "Rejected", c("meanImp", "decision")] importances[order(-importances$meanImp), ]</code></pre> <img class="size-full wp-image-14578" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270b29c5dda5a10df55e2_12-3.webp" alt="Image 12 - Importance scores" width="716" height="304" /> Image 12 - Importance scores Those are all the highly-significant features of our Fish market dataset. If you want, you can also show these importance scores graphically. The <code>bortua_output</code> can be passed in directly to the <code>plot</code> function, resulting in the following chart: <pre><code class="language-r">plot(boruta_output, ces.axis = 0.7, las = 2, xlab = "", main = "Feature importance")</code></pre> <img class="size-full wp-image-14580" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270da6452a923f6ff9d9b_13-3.webp" alt="Image 13 - Feature importance plot" width="1548" height="1756" /> Image 13 - Feature importance plot To read this chart, remember only one thing - The green columns are confirmed to be important and the others aren't. We don't see any red features, but those wouldn't be important if you were to see them. Blue box plots represent variables used by Boruta to determine importance, and you can ignore them. <hr /> <h2 id="conclusion">Summary of R Linear Regression</h2> Today you've learned how to train linear regression models in R. You've implemented a simple linear regression model entirely from scratch, and a multiple linear regression model with a built-in function on the real dataset. You've also learned how to evaluate the model through summary functions, residual plots, and various metrics such as MSE and RMSE. If you want to implement machine learning in your organization, you can always reach out to <a class="editor-rtfLink" href="https://wordpress.appsilon.com/" target="_blank" rel="noopener noreferrer">Appsilon</a> for help. <h3>Learn More</h3><ul><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-for-programmers/" target="_blank" rel="noopener noreferrer">What Can I Do With R? 6 Essential R Packages for Programmers</a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/image-classification-tutorial/" target="_blank" rel="noopener noreferrer">Getting Started With Image Classification: fastai, ResNet, MobileNet, and More</a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/ai-for-wildlife-image-classification-appsilon-ai4g-project-receives-google-grant/" target="_blank" rel="noopener noreferrer">AI for Good: ML Wildlife Image Classification to Analyze Camera Trap Datasets</a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/object-detection-yolo-algorithm/" target="_blank" rel="noopener noreferrer">YOLO Algorithm and YOLO Object Detection: An Introduction</a></li></ul>