Machine Learning with R: A Complete Guide to Linear Regression

Reading time:
time
min
By:
Dario Radečić
December 24, 2020

<h2><span data-preserver-spaces="true">Linear Regression with R</span></h2> <em><strong>Updated</strong>: July 12, 2022.</em> <span data-preserver-spaces="true">Chances are you had some prior exposure to machine learning and statistics. Basically, that's all R linear regression is - a simple statistics problem. </span> <blockquote><span data-preserver-spaces="true">Need help with Machine Learning solutions? </span><a class="editor-rtfLink" href="https://wordpress.appsilon.com/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Reach out to Appsilon</span></a><span data-preserver-spaces="true">.</span></blockquote> <span data-preserver-spaces="true">Today you'll learn the different types of R linear regression and how to implement all of them in R. You'll also get a glimpse into feature importance - a concept used everywhere in machine learning to determine which features have the most predictive power.</span> <span data-preserver-spaces="true">Table of contents:</span> <ul><li><a href="#intro">Introduction to R Linear Regression</a></li><li><a href="#simple-lr">Simple Linear Regression from Scratch</a></li><li><a href="#multiple-lr">Multiple Linear Regression with R</a></li><li><a href="#feature-importance">R Linear Regression Feature Importance</a></li><li><a href="#conclusion">Summary of R Linear Regression</a></li></ul> <hr /> <h2 id="intro"><span data-preserver-spaces="true">Introduction to Linear Regression</span></h2> <span data-preserver-spaces="true">Linear regression is a simple algorithm developed in the field of statistics. As the name suggests, linear regression assumes a linear relationship between the input variable(s) and a single output variable. Needless to say, the output variable (what you're predicting) has to be continuous. The output variable can be calculated as a linear combination of the input variables.</span> <span data-preserver-spaces="true">There are two types of linear regression:</span> <ul><li><strong><span data-preserver-spaces="true">Simple linear regression</span></strong><span data-preserver-spaces="true"> - only one input variable</span></li><li><strong><span data-preserver-spaces="true">Multiple linear regression</span></strong><span data-preserver-spaces="true"> - multiple input variables</span></li></ul> <span data-preserver-spaces="true">You'll implement both today - simple linear regression from scratch and multiple linear regression with built-in R functions.</span> <span data-preserver-spaces="true">You can use a linear regression model to learn which features are important by examining </span><strong><span data-preserver-spaces="true">coefficients</span></strong><span data-preserver-spaces="true">. If a coefficient is close to zero, the corresponding feature is considered to be less important than if the coefficient was a large positive or negative value. </span> <span data-preserver-spaces="true">That's how the linear regression model generates the output. Coefficients are multiplied with corresponding input variables, and in the end, the bias (intercept) term is added.</span> <span data-preserver-spaces="true">There's still one thing we should cover before diving into the code - assumptions of a linear regression model:</span> <ul><li><strong><span data-preserver-spaces="true">Linear assumption</span></strong><span data-preserver-spaces="true"> — model assumes that the relationship between variables is linear</span></li><li><strong><span data-preserver-spaces="true">No noise</span></strong><span data-preserver-spaces="true"> — model assumes that the input and output variables are not noisy — so remove outliers if possible</span></li><li><strong><span data-preserver-spaces="true">No collinearity</span></strong><span data-preserver-spaces="true"> — model will overfit when you have highly correlated input variables</span></li><li><strong><span data-preserver-spaces="true">Normal distribution</span></strong><span data-preserver-spaces="true"> — the model will make more reliable predictions if your input and output variables are normally distributed. If that's not the case, try using some transforms on your variables to make them more normal-looking</span></li><li><strong><span data-preserver-spaces="true">Rescaled inputs</span></strong><span data-preserver-spaces="true"> — use scalers or normalizers to make more reliable predictions</span></li></ul> <span data-preserver-spaces="true">You should be aware of these assumptions every time you're creating linear models. We'll ignore most of them for the purpose of this article, as the goal is to show you the general syntax you can copy-paste between the projects. </span> <h2 id="simple-lr"><span data-preserver-spaces="true">Simple Linear Regression from Scratch</span></h2> <span data-preserver-spaces="true">If you have a single input variable, you're dealing with simple linear regression. It won't be the case most of the time, but it can't hurt to know. A simple linear regression can be expressed as:</span> <img class="size-full wp-image-6292" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b3958bd772ec361ef23270_1-4.webp" alt="Image 1 - Simple linear regression formula (line equation)" width="358" height="98" /> Image 1 - Simple linear regression formula (line equation) <span data-preserver-spaces="true">As you can see, there are two terms you need to calculate beforehand - betas.</span> <span data-preserver-spaces="true">You'll first see how to calculate Beta1, as Beta0 depends on it. Here's the formula:</span> <img class="size-full wp-image-6293" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b3958cf32c7f090d65fd07_2-4.webp" alt="Image 2 - Beta1 equation" width="720" height="180" /> Image 2 - Beta1 equation <span data-preserver-spaces="true">And here's the formula for Beta0:</span> <img class="size-full wp-image-6294" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b3958e8ac01edf5bf546b8_3-4.webp" alt="Image 3 - Beta0 equation" width="412" height="88" /> Image 3 - Beta0 equation <span data-preserver-spaces="true">These x's and y's with the bar over them represent the mean (average) of the corresponding variables.</span> <span data-preserver-spaces="true">Let's see how all of this works in action. The code snippet below generates </span><strong><span data-preserver-spaces="true">X</span></strong><span data-preserver-spaces="true"> with 300 linearly spaced numbers between 1 and 300 and generates </span><strong><span data-preserver-spaces="true">Y</span></strong><span data-preserver-spaces="true"> as a value from the normal distribution centered just above the corresponding X value with a bit of noise added. Both X and Y are then combined into a single data frame and visualized as a scatter plot with the <code>ggplot2</code> package:</span> <pre><code class="language-r">library(ggplot2) <br># Generate synthetic data with a clear linear relationship x &lt;- seq(from = 1, to = 300) y &lt;- rnorm(n = 300, mean = x + 2, sd = 25) <br># Convert to dataframe simple_lr_data &lt;- data.frame(x, y) <br># Visualize as scatter plot ggplot(data = simple_lr_data, aes(x = x, y = y)) +  geom_point(size = 3, color = "#0099f9") +  theme_classic() +  labs(    title = "Dataset for simple linear regression",    subtitle = "A clear linear relationship is visible"  )</code></pre> <img class="size-full wp-image-6295" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b3958e51a4dda06614175b_4-4.webp" alt="Image 4 - Input data as a scatter plot" width="1276" height="1003" /> Image 4 - Input data as a scatter plot <span data-preserver-spaces="true">Onto the coefficient calculation now. The coefficients for Beta0 and Beta1 are obtained first, and then wrapped into a <code>simple_lr_predict()</code> function that implements the line equation.</span> <span data-preserver-spaces="true">The predictions can then be obtained by applying the <code>simple_lr_predict()</code> function to the vector X - they should all line on a single straight line. Finally, input data and predictions are visualized with the <code>ggplot2</code> package:</span> <pre><code class="language-r"># Calculate coefficients b1 &lt;- (sum((x - mean(x)) * (y - mean(y)))) / (sum((x - mean(x))^2)) b0 &lt;- mean(y) - b1 * mean(x) <br># Define function for generating predictions simple_lr_predict &lt;- function(x) {  return(b0 + b1 * x) } <br># Apply simple_lr_predict() to input data simple_lr_predictions &lt;- sapply(x, simple_lr_predict) simple_lr_data$yhat &lt;- simple_lr_predictions <br># Visualize input data and the best fit line ggplot(data = simple_lr_data, aes(x = x, y = y)) +  geom_point(size = 3, color = "#0099f9") +  geom_line(aes(x = x, y = yhat), size = 2) +  theme_classic() +  labs(    title = "Applying simple linear regression to data",    subtitle = "Black line = best fit line"  )</code></pre> <img class="size-full wp-image-6296" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01c78051092b3b541c409_5-4.webp" alt="Image 5 - Input data as a scatter plot with predictions (best-fit line)" width="1276" height="1003" /> Image 5 - Input data as a scatter plot with predictions (best-fit line) <span data-preserver-spaces="true">And that's how you can implement simple linear regression in R from scratch! Next, you'll learn how to handle situations when there are multiple input variables.</span> <h2 id="multiple-lr"><span data-preserver-spaces="true">Multiple Linear Regression with R</span></h2> <span data-preserver-spaces="true">You'll use the </span><a class="editor-rtfLink" href="https://www.kaggle.com/aungpyaeap/fish-market" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Fish Market</span></a><span data-preserver-spaces="true"> dataset to build your model. To start, the goal is to load the dataset and check if some of the assumptions hold. Normal distribution and outlier assumptions can be checked with boxplots.</span> <blockquote>Want to create better data visualizations? Learn how to make <a href="https://appsilon.com/ggplot2-boxplots/" target="_blank" rel="noopener">stunning boxplots with ggplot2</a>.</blockquote> <span data-preserver-spaces="true">The code snippet below loads in the dataset and visualizes box plots for every feature (not the target):</span> <pre><code class="language-r">library(reshape) <br># Load in th dataset df &lt;- read.csv("Fish.csv") <br># Remove target variable temp_df &lt;- subset(df, select = -c(Weight)) melt_df &lt;- melt(temp_df) <br># Draw boxplot boxplot(data = melt_df, value ~ variable)</code></pre> <img class="size-full wp-image-6297" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01cd59dc2dda17fbf3ced_6-4.webp" alt="Image 6 - Boxplots of the input features" width="1276" height="936" /> Image 6 - Boxplots of the input features <blockquote>Do you find Box Plots confusing? <a href="https://appsilon.com/ggplot2-boxplots/">Here's our complete guide to get you started</a>.</blockquote> <span data-preserver-spaces="true">A degree of skew seems to be present in all input variables, and the first three contain a couple of outliers. We'll keep this article strictly machine learning-based, so we won't do any data preparation and cleaning.</span> <span data-preserver-spaces="true">Train/test split is the obvious next step once you're done with preparation. The <code>caTools</code> package is the perfect candidate for the job. </span> <span data-preserver-spaces="true">You can train the model on the training set after the split. R has the <code>lm</code> function built-in, and it is used to train linear models. Inside the <code>lm</code> function, you'll need to write the target variable on the left and input features on the right, separated by the <code>~</code> sign. If you put a dot instead of feature names, it means you want to train the model on all features.</span> <span data-preserver-spaces="true">After the model is trained, you can call the <code>summary()</code> function to see how well it performed on the training set. Here's a code snippet for everything discussed so far:</span> <pre><code class="language-r">library(caTools) set.seed(42) <br># Train/Test split in 70:30 ratio sample_split &lt;- sample.split(Y = df$Weight, SplitRatio = 0.7) train_set &lt;- subset(x = df, sample_split == TRUE) test_set &lt;- subset(x = df, sample_split == FALSE) <br># Fit the model and obtain summary model &lt;- lm(Weight ~ ., data = train_set) summary(model)</code></pre> <img class="size-full wp-image-6298" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395b140c61fb73053a2be_7-3.webp" alt="Image 7 - Summary statistics of a multiple linear regression model" width="692" height="711" /> Image 7 - Summary statistics of a multiple linear regression model <span data-preserver-spaces="true">The most interesting thing here is the P-values, displayed in the <code>Pr(&gt;|t|)</code> column. Those values indicate the probability of a variable not being important for prediction. It's common to use a 5% significance threshold, so if a P-value is 0.05 or below, we can say that there's a low chance it is not significant for the analysis.</span> <span data-preserver-spaces="true">Let's make a residual plot next. As a general rule, if a histogram of residuals looks normally distributed, the linear model is as good as it can be. If not, it means you can improve it somehow. Here's the code for visualizing residuals:</span> <pre><code class="language-r"># Get residuals lm_residuals &lt;- as.data.frame(residuals(model)) <br># Visualize residuals ggplot(lm_residuals, aes(residuals(model))) +  geom_histogram(fill = "#0099f9", color = "black") +  theme_classic() +  labs(title = "Residuals plot")</code></pre> <img class="size-full wp-image-6299" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395b10ab70dc239c896ee_8-3.webp" alt="Image 8 - Residuals plot of a multiple linear regression model" width="1276" height="1004" /> Image 8 - Residuals plot of a multiple linear regression model <span data-preserver-spaces="true">As you can see, there's a bit of skew present due to a large error on the far right.</span> <span data-preserver-spaces="true">And now it's time to make predictions on the test set. You can use the <code>predict()</code> function to apply the model to the test set. As an additional step, you can combine actual values and predictions into a single data frame, just so the evaluation becomes easier. Here's how:</span> <pre><code class="language-r"># Make predictions on the test set predictions &lt;- predict(model, test_set) <br># Convert to dataframe eval &lt;- cbind(test_set$Weight, predictions) colnames(eval) &lt;- c("Y", "Yhat") eval &lt;- as.data.frame(eval) head(eval)</code></pre> <img class="size-full wp-image-6300" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270af09eda01ce1845527_9-3.webp" alt="" width="206" height="238" /> Image 9 - Dataset comparing actual values and predictions for the test set <span data-preserver-spaces="true">If you want a more concrete way of evaluating your regression models, look no further than RMSE (Root Mean Squared Error). This metric will inform you how wrong your model is on average. In this case, it reports back the average number of weight units the model is wrong:</span> <pre><code class="language-r"># Evaluate model mse &lt;- mean((eval$Y - eval$Yhat)^2) rmse &lt;- sqrt(mse)</code></pre> <span data-preserver-spaces="true">The <code>rmse</code> variable holds the value of 83.60, indicating the model is on average wrong by 83.60 units of weight.</span> <h2 id="feature-importance">R Linear Regression Feature Importance</h2> Most of the time, not all features are relevant for predictive modeling and making predictions. If you were to exclude some of them, you're likely to get a better-performing model or at least an identical model that's simple to understand and interpret. That's where feature importance comes in. Now, linear regression models are highly interpretable out of the box, but you can take them to the next level. That's where the <code>Boruta</code> package comes in. It's a feature ranking and selection algorithm based on the Random Forests algorithm. It clearly shows you if a variable is important or not. You can tweak the "strictness" by adjusting the P-value and other parameters, but that's a topic for another time. The most simple function call will do for today. To get started, first install the package: <pre><code class="language-r">install.packages("Boruta")</code></pre> The <code>boruta()</code> function takes in the same parameters as <code>lm()</code>. It's a formula with the target variable on the left side and the predictors on the right side. The additional <code>doTrace</code> parameter is there to limit the amount of output printed to the console - setting it to 0 will remove it altogether: <pre><code class="language-r">library(Boruta) <br>boruta_output &lt;- Boruta(Weight ~ ., data = train_set, doTrace = 0) boruta_output</code></pre> Here are the contents of <code>boruta_output:</code> <img class="size-full wp-image-14574" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270b06402741d1b156f74_10-3.webp" alt="Image 10 - Results of a Boruta algorithm" width="762" height="190" /> Image 10 - Results of a Boruta algorithm Not too useful immediately, but we can extract the importance with a bit more manual labor. First, let's extract all the attributes that are significant: <pre><code class="language-r">rough_fix_mod &lt;- TentativeRoughFix(boruta_output) boruta_signif &lt;- getSelectedAttributes(rough_fix_mod) boruta_signif</code></pre> <img class="size-full wp-image-14576" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b395cbdf484ee4e661d73c_11-3.webp" alt="Image 11 - Important features" width="988" height="80" /> Image 11 - Important features Now we can get to the importance scores and sort them in descending order: <pre><code class="language-r">importances &lt;- attStats(rough_fix_mod) importances &lt;- importances[importances$decision != "Rejected", c("meanImp", "decision")] importances[order(-importances$meanImp), ]</code></pre> <img class="size-full wp-image-14578" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270b29c5dda5a10df55e2_12-3.webp" alt="Image 12 - Importance scores" width="716" height="304" /> Image 12 - Importance scores Those are all the highly-significant features of our Fish market dataset. If you want, you can also show these importance scores graphically. The <code>bortua_output</code> can be passed in directly to the <code>plot</code> function, resulting in the following chart: <pre><code class="language-r">plot(boruta_output, ces.axis = 0.7, las = 2, xlab = "", main = "Feature importance")</code></pre> <img class="size-full wp-image-14580" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b270da6452a923f6ff9d9b_13-3.webp" alt="Image 13 - Feature importance plot" width="1548" height="1756" /> Image 13 - Feature importance plot To read this chart, remember only one thing - The green columns are confirmed to be important and the others aren't. We don't see any red features, but those wouldn't be important if you were to see them. Blue box plots represent variables used by Boruta to determine importance, and you can ignore them. <hr /> <h2 id="conclusion"><span data-preserver-spaces="true">Summary of R Linear Regression</span></h2> <span data-preserver-spaces="true">Today you've learned how to train linear regression models in R. You've implemented a simple linear regression model entirely from scratch, and a multiple linear regression model with a built-in function on the real dataset.</span> <span data-preserver-spaces="true">You've also learned how to evaluate the model through summary functions, residual plots, and various metrics such as MSE and RMSE. </span> <strong><span data-preserver-spaces="true">If you want to implement machine learning in your organization, you can always reach out to </span></strong><a class="editor-rtfLink" href="https://wordpress.appsilon.com/" target="_blank" rel="noopener noreferrer"><strong><span data-preserver-spaces="true">Appsilon</span></strong></a><strong><span data-preserver-spaces="true"> for help.</span></strong> <h3><span data-preserver-spaces="true">Learn More</span></h3><ul><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-for-programmers/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">What Can I Do With R? 6 Essential R Packages for Programmers</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/image-classification-tutorial/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Getting Started With Image Classification: fastai, ResNet, MobileNet, and More</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/ai-for-wildlife-image-classification-appsilon-ai4g-project-receives-google-grant/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">AI for Good: ML Wildlife Image Classification to Analyze Camera Trap Datasets</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/object-detection-yolo-algorithm/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">YOLO Algorithm and YOLO Object Detection: An Introduction</span></a></li></ul>

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.
Explore Possibilities

Share Your Data Goals with Us

From advanced analytics to platform development and pharma consulting, we craft solutions tailored to your needs.

Talk to our Experts
r
tutorials
ai&research