How to Build a Handwritten Digit Classifier with R and Random Forests

Reading time:

time

min

February 14, 2021

<h2>Build an MNIST Classifier With Random Forests</h2> Simple image classification tasks don't require deep learning models. Today you'll learn how to build a handwritten digit classifier from scratch with R and Random Forests and what are the "gotchas" in the process. <blockquote>Are you completely new to machine learning? <a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-linear-regression/" target="_blank" rel="noopener noreferrer">Start with this guide on linear regression</a>.</blockquote> Today's article is structured as follows: <ul><li><a href="#introduction">Dataset Introduction</a></li><li><a href="#loading">Dataset Loading</a></li><li><a href="#training">Model Training</a></li><li><a href="#evaluation">Model Evaluation</a></li><li><a href="#conclusion">Conclusion</a></li></ul> <h2 id="introduction">Dataset Introduction</h2> MNIST is the "hello world" of image classification datasets. It contains tens of thousands of handwritten digits ranging from zero to nine. Each image is of size 28x28 pixels. The following image displays a couple of handwritten digits from the dataset: <img class="size-full wp-image-6613" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d560109e5d60f59d3f97_d85d6fc4_1-3.webp" alt="Image 1 - MNIST dataset sample (source) " width="594" height="361" /> Image 1 - MNIST dataset sample (<a href="https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png" target="_blank" rel="noopener noreferrer">source</a>) As you can see, these images should be relatively easy to classify. The most common approach is to use a neural network with a couple of convolutional layers (to detect patterns), followed by a couple of fully connected layers and an output layer (with ten nodes), but there's a simpler approach. MNIST is a toy dataset, so you can replace the neural network architecture with something simpler, like random forests. This will require image flattening - from 28x28 to 1x784. In a nutshell, you'll end up with a tabular dataset of 784 columns (one for each pixel). More on the pros and cons of this approach in a bit. Let's load the dataset next and talk strategy afterward. <h2 id="loading">Dataset Loading</h2> You can download both training and testing sets on <a class="editor-rtfLink" href="https://pjreddie.com/projects/mnist-in-csv/" target="_blank" rel="noopener noreferrer">this link</a>. It's a CSV format instead of PNG, which eliminates the transformation process. Keep in mind - the CSV's don't contain column names, so you'll have to specify <code>col_names = FALSE</code> when loading the files. The following code snippet loads both sets and extracts the labels (actual digit class). Further, the snippet prints the first 20 labels from the training set: <script src="https://gist.github.com/darioappsilon/614208d72e504c870749df00ba9bbd42.js"></script> The results are shown in the following image: <img class="size-full wp-image-6614" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d561b6953e1bff271727_ef21712b_2-3.webp" alt="Image 2 - First 20 digit labels from the training set" width="712" height="82" /> Image 2 - First 20 digit labels from the training set Note the Levels column - it's here because you've converted digits to factors with the <code>as.factor()</code> function. Finally, let's see how many records there are for each digit: <script src="https://gist.github.com/darioappsilon/d3c79cc20b1863826e5725e91df27401.js"></script> Here are the results: <img class="size-full wp-image-6615" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d5623e4ce0965306b3cd_ebf65ee8_3-3.webp" alt="Image 3 - Record count per digit class" width="798" height="74" /> Image 3 - Record count per digit class As you can see, the values aren't identical and range from around 5400 to 6700, but that shouldn't be too big of an issue for the classifier. Next, let's see how you can train the model. Spoiler alert - it will require only a single line of code. <h2 id="training">Model Training</h2> You'll use the Random Forests algorithm to build a handwritten digit classifier. As discussed before, this has some pros and cons when comparing to the neural network classifiers. The biggest pro is the training speed - the training process will finish in a minute or so on CPU, whereas the training process for neural networks can take anywhere from minutes (GPU) to hours (CPU) - depending on the model architecture and your hardware. The downside of using Random Forests (or any other machine learning algorithm) is the loss of 2D information. When you flatten the image (go from 28x28 to 1x784), you're losing information on surrounding pixels. A convolution operation is a go-to approach for any more demanding image classification problem. Still, the Random Forest classifier should suit you fine on the MNIST dataset. The following code snippet shows you how to import the library, train the model, and print the results. The execution will take a minute or so, depending on your hardware: <script src="https://gist.github.com/darioappsilon/629cf33a76c8daf52a8fa0857dbf3dde.js"></script> The results are shown in the image below: <img class="size-full wp-image-6616" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d56313d4a2a1c0faf6af_35b18a9f_4-3.webp" alt="Image 4 - Results of a random forests model" width="1302" height="720" /> Image 4 - Results of a random forests model As you can see, the confusion matrix for the training set is visible from the image above, alongside the classification errors. We'll talk more about model evaluation in the next section. <h2 id="evaluation">Model Evaluation</h2> The first metric you'll check is the overall accuracy. The random forest model gives you access to the error rate among all of the classes, so you can calculate the mean and subtract the result from 1. 1 - the error rate represents the accuracy. You can use the following code snippet to get the overall accuracy: <script src="https://gist.github.com/darioappsilon/2e15272f309519727e5f989eb291be79.js"></script> The results are shown in the following image: <img class="size-full wp-image-6617" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d563d48dce80f00e11cd_48f7d149_5-3.webp" alt="Image 5 - Overall accuracy on the training set" width="220" height="42" /> Image 5 - Overall accuracy on the training set As you can see, the accuracy is around 95% overall. Not bad for a random forest classifier model. Next, let's explore the error rate for every digit. Maybe some numbers are easier to classify than the others, so let's find out. You'll need the <code>dplyr</code> package for this calculation. You'll use it to select appropriate columns and calculate their means with the <code>colMeans()</code> function. Here's the entire code snippet: <script src="https://gist.github.com/darioappsilon/e177cdd87ab4f33dc408ca74ff8fecdd.js"></script> The results are shown below: <img class="size-full wp-image-6618" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d564cba09eb75280d890_d00caaba_6-3.webp" alt="Image 6 - Average error rate for each digit" width="874" height="154" /> Image 6 - Average error rate for each digit As you can see, zeros and ones seem to be the easiest to classify, and fives and eights the hardest. It makes sense if you think about it. <h2 id="conclusion">Conclusion</h2> This article demonstrated how you could use a simple machine learning algorithm for image classification. Keep in mind - this shouldn't be a go-to approach for more complex images. Just imagine you had 512x512 images. Flattening them would result in a dataset with more than 260K columns. You should always use convolution operations when dealing with more complex image classification, as this operation will detect features of certain objects more accurately than a machine-learning-based approach. If you want to implement machine learning in your organization, you can always reach out to <a class="editor-rtfLink" href="https://wordpress.appsilon.com/" target="_blank" rel="noopener noreferrer">Appsilon</a> for help. <h1>Learn More</h1> <ul><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-linear-regression/" target="_blank" rel="noopener noreferrer">Machine Learning with R: A Complete Guide to Linear Regression</a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-logistic-regression/" target="_blank" rel="noopener noreferrer">Machine Learning with R: A Complete Guide to Logistic Regression</a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-decision-treees/" target="_blank" rel="noopener noreferrer">Machine Learning with R: A Complete Guide to Decision Trees</a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/r-for-programmers/" target="_blank" rel="noopener noreferrer">What Can I Do With R? 6 Essential R Packages for Programmers</a></li><li><a class="editor-rtfLink" href="https://wordpress.appsilon.com/object-detection-yolo-algorithm/" target="_blank" rel="noopener noreferrer">YOLO Algorithm and YOLO Object Detection: An Introduction</a></li></ul> <a href="https://appsilon.com/careers/" target="_blank" rel="noopener noreferrer"><img class="aligncenter size-large wp-image-6541" src="https://wordpress.appsilon.com/wp-content/uploads/2021/01/appsilon.hiring.20-1024x576.jpg" alt="" width="1024" height="576" /></a> Appsilon is hiring for remote roles! See our <a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/" target="_blank" rel="noopener noreferrer">Careers</a> page for all open positions, including <a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/#r-shiny-developer" target="_blank" rel="noopener noreferrer">R Shiny Developers</a>, <a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/#fullstack-software-engineer-tech-lead" target="_blank" rel="noopener noreferrer">Fullstack Engineers</a>, <a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/#frontend-engineer" target="_blank" rel="noopener noreferrer">Frontend Engineers</a>, a <a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/#senior-infrastructure-engineer" target="_blank" rel="noopener noreferrer">Senior Infrastructure Engineer</a>, and a <a class="editor-rtfLink" href="https://wordpress.appsilon.com/careers/#community-manager" target="_blank" rel="noopener noreferrer">Community Manager</a>. Join Appsilon and work on groundbreaking projects with the world's most influential Fortune 500 companies.

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Stop Struggling with Outdated Clinical Data Systems

Join pharma data leaders from Jazz Pharmaceuticals and Novo Nordisk in our live podcast episode as they share what really works when building modern, compliant Statistical Computing Environments (SCEs).

Save My Spot

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.

Get the Checklist

Ensure Your R and Python Code Meets FDA and EMA Standards

A comprehensive diagnosis of your R and Python software and computing environment compliance with actionable recommendations and areas for improvement.