Complete Guide to Gradient Boosting and XGBoost in R
<em><strong>Updated</strong>: August 22, 2022.</em> <h2><span data-preserver-spaces="true">R XGBoost and Gradient Boosting</span></h2> <span data-preserver-spaces="true">Gradient boosting is one of the most effective techniques for building machine learning models. It is based on the idea of improving the weak learners (learners with insufficient predictive power). Today you'll learn how to work with XGBoost in R and many other things - from data preparation and visualization, to feature importance of predictive models.</span> <blockquote><span data-preserver-spaces="true">Do you want to learn more about machine learning with R? </span><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-decision-treees/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Check our complete guide to decision trees</span></a><span data-preserver-spaces="true">.</span></blockquote> <span data-preserver-spaces="true">Table of contents:</span> <ul><li><a href="#gradient-boosting">Introduction to Gradient Boosting</a></li><li><a href="#xgboost">Introduction to R XGBoost</a></li><li><a href="#dataset">Dataset Loading and Preparation</a></li><li><a href="#modeling">Predictive Modeling with R XGBoost</a></li><li><a href="#predictions">Predictions and Evaluations</a></li><li><a href="#conclusion">Summary of R XGBoost</a></li></ul> <hr /> <h2 id="gradient-boosting"><span data-preserver-spaces="true">Introduction to Gradient Boosting</span></h2> <span data-preserver-spaces="true">The general idea behind gradient boosting is to combine weak learners to produce a more accurate model. These "weak learners" are essentially decision trees, and gradient boosting aims to combine multiple decision trees to lower the model error somehow. </span> <span data-preserver-spaces="true">The term "boosting" was introduced for the first time successfully in </span><strong><span data-preserver-spaces="true">AdaBoost</span></strong><span data-preserver-spaces="true"> (Adaptive Boosting). This algorithm combines multiple single split decision trees. AdaBoost puts more emphasis on observations that are more difficult to classify by adding new weak learners where needed.</span> <span data-preserver-spaces="true">In a nutshell, gradient boosting is comprised of only three elements:</span> <ul><li><strong><span data-preserver-spaces="true">Weak Learners</span></strong><span data-preserver-spaces="true"> - simple decision trees that are constructed based on purity scores (e.g., </span><em><span data-preserver-spaces="true">Gini</span></em><span data-preserver-spaces="true">).</span></li><li><strong><span data-preserver-spaces="true">Loss Function</span></strong><span data-preserver-spaces="true"> - a differentiable function you want to minimize. In regression, this could be a </span><em><span data-preserver-spaces="true">mean squared error</span></em><span data-preserver-spaces="true">, and in classification, it could be </span><em><span data-preserver-spaces="true">log loss</span></em><span data-preserver-spaces="true">. </span></li><li><strong><span data-preserver-spaces="true">Additive Models</span></strong><span data-preserver-spaces="true"> - additional trees are added where needed, and a functional gradient descent procedure is used to minimize the loss when adding trees.</span></li></ul> <span data-preserver-spaces="true">You now know the basics of gradient boosting. The following section will introduce the most popular gradient boosting algorithm - XGBoost.</span> <h2 id="xgboost"><span data-preserver-spaces="true">Introduction to R XGBoost</span></h2> <span data-preserver-spaces="true">XGBoost stands for </span><em><span data-preserver-spaces="true">eXtreme Gradient Boosting</span></em><span data-preserver-spaces="true"> and represents the algorithm that wins most of the Kaggle competitions. It is an algorithm specifically designed to implement state-of-the-art results fast.</span> <span data-preserver-spaces="true">XGBoost is used both in regression and classification as a go-to algorithm. As the name suggests, it utilizes the </span><em><span data-preserver-spaces="true">gradient boosting</span></em><span data-preserver-spaces="true"> technique to accomplish enviable results - by adding more and more weak learners until no further improvement can be made. </span> <span data-preserver-spaces="true">Today you'll learn how to use the XGBoost algorithm with R by modeling one of the most trivial datasets - the Iris dataset - starting from the next section.</span> <h2 id="dataset"><span data-preserver-spaces="true">Dataset Loading and Preparation</span></h2> <span data-preserver-spaces="true">As mentioned earlier, the Iris dataset will be used to demonstrate how the XGBoost algorithm works. Let's start simple with a necessary first step - library and dataset imports. You'll need only a few, and the dataset is built into R:</span> <pre><code class="language-r">library(xgboost) library(caTools) library(dplyr) library(cvms) library(caret) <br>head(iris)</code></pre> <span data-preserver-spaces="true">Here's what the first couple of rows looks like:</span> <img class="size-full wp-image-6623" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d55a842ddef80cb8561e_00b8cab1_1-4.webp" alt="Image 1 - The first six rows of the Iris dataset" width="962" height="258" /> Image 1 - The first six rows of the Iris dataset <span data-preserver-spaces="true">There's no point in further exploration of the dataset, as anyone in the world of data already knows everything about it. </span> <span data-preserver-spaces="true">The next step is dataset splitting into training and testing subsets. The following code snippet splits the dataset in a 70:30 ratio and then further splits the dataset in features (X) and target (y) for both subsets. This step is necessary for the training process:</span> <pre><code class="language-r">set.seed(42) sample_split <- sample.split(Y = iris$Species, SplitRatio = 0.7) train_set <- subset(x = iris, sample_split == TRUE) test_set <- subset(x = iris, sample_split == FALSE) <br>y_train <- as.integer(train_set$Species) - 1 y_test <- as.integer(test_set$Species) - 1 X_train <- train_set %>% select(-Species) X_test <- test_set %>% select(-Species)</code></pre> <span data-preserver-spaces="true">You now have everything needed to start with the training process. Let's do that in the next section.</span> <h2 id="modeling"><span data-preserver-spaces="true">Predictive Modeling with R XGBoost</span></h2> <span data-preserver-spaces="true">XGBoost uses something known as a DMatrix to store data. DMatrix is nothing but a specific data structure used to store data in a way optimized for both memory efficiency and training speed. </span> <span data-preserver-spaces="true">Besides the DMatrix, you'll also have to specify the parameters for the XGBoost model. You can learn more about all the <a href="https://xgboost.readthedocs.io/en/latest/parameter.html" target="_blank" rel="noopener noreferrer">available parameters </a></span><span data-preserver-spaces="true">here</span><span data-preserver-spaces="true">, but we'll stick to a subset of the most basic ones.</span> <span data-preserver-spaces="true">The following snippet shows you how to construct DMatrix data structures for both training and testing subsets and how to build a list of parameters:</span> <pre><code class="language-r">xgb_train <- xgb.DMatrix(data = as.matrix(X_train), label = y_train) xgb_test <- xgb.DMatrix(data = as.matrix(X_test), label = y_test) xgb_params <- list( booster = "gbtree", eta = 0.01, max_depth = 8, gamma = 4, subsample = 0.75, colsample_bytree = 1, objective = "multi:softprob", eval_metric = "mlogloss", num_class = length(levels(iris$Species)) )</code></pre> <span data-preserver-spaces="true">Now you have everything needed to build a model. Here's how:</span> <pre><code class="language-r">xgb_model <- xgb.train( params = xgb_params, data = xgb_train, nrounds = 5000, verbose = 1 ) xgb_model</code></pre> <span data-preserver-spaces="true">The results of calling <code>xgb_model</code> are displayed below:</span> <img class="size-full wp-image-6624" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d55b4b197a9285803c73_d9b325c0_2-4.webp" alt="Image 2 - XGBoost model after training" width="1598" height="610" /> Image 2 - XGBoost model after training <h3>XGBoost Feature Importance</h3> You now have the model, but what's the underlying logic behind it? What will the model think when making predictions? We can find that out by exploring feature importance. Luckily, XGBoost comes with this functionality built-in, so we don't have to use any external libraries. The first step is to construct an importance matrix. This is done with the <code>xgb.importance()</code> function which accepts two parameters - column names and the XGBoost model itself. Here's the code snippet: <pre><code class="language-r">importance_matrix <- xgb.importance( feature_names = colnames(xgb_train), model = xgb_model ) importance_matrix</code></pre> Let's inspect what it contains: <img class="size-full wp-image-15292" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d55cf9a88e2eece18edf_7f866aa0_3-5.webp" alt="Image 3 - XGBoost Feature Importances" width="1024" height="226" /> Image 3 - XGBoost Feature Importances There's a lot of information present in the table, so how can we simplify it? Easily, with the power of data visualization. The <code>xgb.plot.importance()</code> function allows you to use the importance matrix to produce a line chart: <pre><code class="language-r">xgb.plot.importance(importance_matrix)</code></pre> <img class="size-full wp-image-15294" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d55df7aa9374eb245792_5ee5b0fa_4-3.webp" alt="Image 4 - XGBoost Feature Importances Chart" width="1878" height="1430" /> Image 4 - XGBoost Feature Importances Chart As you can see, the Petal Length feature is the most important for making forecasts. With that out of the way, let's actually make some predictions. <h2 id="predictions"><span data-preserver-spaces="true">Predictions and Evaluations</span></h2> <span data-preserver-spaces="true">You can use the <code>predict()</code> function to make predictions with the XGBoost model, just as with any other model. The next step is to convert the predictions to a data frame and assign column names, as the predictions are returned in the form of probabilities:</span> <pre><code class="language-r">xgb_preds <- predict(xgb_model, as.matrix(X_test), reshape = TRUE) xgb_preds <- as.data.frame(xgb_preds) colnames(xgb_preds) <- levels(iris$Species) xgb_preds</code></pre> <span data-preserver-spaces="true">Here's what the above code snippet produces:</span> <img class="size-full wp-image-6625" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d55e0ca3d363ce965aa3_7f2cf096_3-4.webp" alt="Image 3 - Prediction probabilities for every flower species" width="580" height="266" /> Image 3 - Prediction probabilities for every flower species <span data-preserver-spaces="true">As you would imagine, these probabilities add up to 1 for a single row. The column with the highest probability is the flower species predicted by the model.</span> <span data-preserver-spaces="true">Still, it would be nice to have two additional columns. The first one represents the predicted class (max of predicted probabilities). The other represents the actual class, so we can estimate how well the model performs on unseen data. </span> <span data-preserver-spaces="true">The following snippet does just that:</span> <pre><code class="language-r">xgb_preds$PredictedClass <- apply(xgb_preds, 1, function(y) colnames(xgb_preds)[which.max(y)]) xgb_preds$ActualClass <- levels(iris$Species)[y_test + 1] xgb_preds</code></pre> <span data-preserver-spaces="true">The results are displayed in the following figure:</span> <img class="size-full wp-image-6626" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d55e3e4ce0965306b31a_3101cbf4_4-4.webp" alt="Image 4 - Predicted class vs. actual class on the test set" width="998" height="362" /> Image 4 - Predicted class vs. actual class on the test set <span data-preserver-spaces="true">Things look promising, to say at least, but that's no reason to jump to conclusions. Next, we can calculate the overall accuracy score as a sum of instances where predicted and actual classes match divided by the total number of rows:</span> <pre><code class="language-r">accuracy <- sum(xgb_preds$PredictedClass == xgb_preds$ActualClass) / nrow(xgb_preds) accuracy</code></pre> <span data-preserver-spaces="true">Executing the above code prints out 0.9333 to the console, indicating we have a 93% accurate model on previously unseen data.</span> <span data-preserver-spaces="true">While we're here, we can also print the confusion matrix to see what exactly did the model misclassify:</span> <pre><code class="language-r">confusionMatrix(factor(xgb_preds$ActualClass), factor(xgb_preds$PredictedClass))</code></pre> <span data-preserver-spaces="true">The results are shown below:</span> <img class="size-full wp-image-6627" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d55fe385862d35c6e325_48878c62_5-4.webp" alt="Image 5 - Confusion matrix for XGBoost model on the test set" width="1114" height="1138" /> Image 5 - Confusion matrix for XGBoost model on the test set So, what's actually important here? The confusion matrix itself is shown at the top of the output, but it would be much easier to look at it visually. The <code>plot_confusion_matrix()</code> function from <code>cvms</code> package does just that. It requires the matrix formatted as a tibble, so keep that in mind. The rest of the code snippet is self-explanatory: <pre><code class="language-r">cm <- confusionMatrix(factor(xgb_preds$ActualClass), factor(xgb_preds$PredictedClass)) cfm <- as_tibble(cm$table) plot_confusion_matrix(cfm, target_col = "Reference", prediction_col = "Prediction", counts_col = "n")</code></pre> <img class="size-full wp-image-15296" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d560842ddef80cb8574b_a480733a_8-2.webp" alt="Image 8 - Confusion matrix plot" width="1458" height="1468" /> Image 8 - Confusion matrix plot <span data-preserver-spaces="true">As you can see, only three </span><em><span data-preserver-spaces="true">virginica</span></em><span data-preserver-spaces="true"> species were classified as </span><em><span data-preserver-spaces="true">versicolor</span></em><span data-preserver-spaces="true">. There were no misclassifications in the </span><em><span data-preserver-spaces="true">setosa</span></em><span data-preserver-spaces="true"> species. </span> <span data-preserver-spaces="true">And that's how you can train and evaluate XGBoost models with R. Let's wrap things up in the next section.</span> <hr /> <h2 id="conclusion"><span data-preserver-spaces="true">Summary of R XGBoost</span></h2> <span data-preserver-spaces="true">XGBoost is a complex state-of-the-art algorithm for both classification and regression - thankfully, with a simple R API. Entire books are written on this single algorithm alone, so cramming everything in a single article isn't possible. </span> <span data-preserver-spaces="true">You've still learned a lot - from the basic theory and intuition to implementation and evaluation in R. If you want to learn more, please stay tuned to the Appsilon blog. More guides on the topic are expected in the following weeks.</span> <strong><span data-preserver-spaces="true">If you want to implement machine learning in your organization, you can always reach out to </span></strong><a class="editor-rtfLink" href="https://wordpress.appsilon.com/" target="_blank" rel="noopener noreferrer"><strong><span data-preserver-spaces="true">Appsilon</span></strong></a><strong><span data-preserver-spaces="true"> for help.</span></strong> <h2><span data-preserver-spaces="true">Learn More</span></h2><ul><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-linear-regression/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Machine Learning with R: A Complete Guide to Linear Regression</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-logistic-regression/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Machine Learning with R: A Complete Guide to Logistic Regression</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-decision-treees/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Machine Learning with R: A Complete Guide to Decision Trees</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-mnist-random-forests/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">How to Build a Handwritten Digit Classifier with R and Random Forests</span></a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/object-detection-yolo-algorithm/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">YOLO Algorithm and YOLO Object Detection: An Introduction</span></a></li></ul>