Top 5 Data Science Take-Home Challenges in R Programming Language
Data Science take-home challenges will definitely get you out of your comfort zone. But you know what? That's a good thing - it's the only zone in which you'll learn something, after all. Taming the take-home challenges is necessary both for beginner and advanced R programmers, but the challenges you'll work on vary significantly depending on your skill level. <blockquote>Need to buff your Python skills? Get started with our free <a href="https://github.com/Appsilon/datascience-python" target="_blank" rel="noopener">Introduction to Data Science in Python course</a>.</blockquote> Today we bring you 5 data science take-home challenges in R programming language aimed toward complete beginners and those with a couple of years of industry experience. None of the challenges require any domain knowledge, so you can rest assured you'll know how to solve them. Only technical skills are required. <blockquote>Are you new to R programming language? <a href="https://appsilon.com/r-for-programmers/" target="_blank" rel="noopener">Here are 6 essential packages you should know</a>.</blockquote> Here are the challenges: <ul><li><a href="#challenge-1">Titanic - Machine Learning from Disaster</a></li><li><a href="#challenge-2">Store Sales - Time Series Forecasting</a></li><li><a href="#challenge-3">Digit Recognizer</a></li><li><a href="#challenge-4">TensorFlow - Help Protect the Great Barrier Reef</a></li><li><a href="#challenge-5">Bag of Words Meets Bags of Popcorn</a></li></ul> <hr /> <h2 id="challenge-1">Titanic - Machine Learning from Disaster</h2> Yes, it's the most well-known data science challenge, but for a reason. It provides just enough data preprocessing to keep you wondering what can you do better, and it's relatively simple in terms of machine learning. It's a binary classification problem, after all. <img class="size-full wp-image-12481" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2ab8fa9eddab8e508b4ef_1-2.webp" alt="Image 1 - Titanic challenge on Kaggle" width="3148" height="2314" /> Image 1 - Titanic challenge on Kaggle The goal is to determine who had the highest chance of surviving. Was it women and children? Does socio-economic class play a role? Or was it just an element of luck? It's up to you to find out. It really is a basic data science take-home challenge. You'll need to know how to read comma-separated data (CSV), handle missing data, encode categorical data, visualize data, and train machine learning models (binary classification). The best part is that you can do everything in the R programming language. As of April 2022, 14.5k competitors have submitted more than 54k entries to the challenge on Kaggle. There's no doubt the competition is fierce, so don't expect to be the first on the leaderboard if you're just starting out. Focus on learning on your own and from other submissions - that's the only thing that matters in the long run. We recommend the following resources to get your feet wet: <ul><li><a href="https://www.kaggle.com/competitions/titanic/data" target="_blank" rel="noopener">Official Kaggle challenge overview and data</a></li><li><a href="https://www.kaggle.com/code/startupsci/titanic-data-science-solutions/notebook" target="_blank" rel="noopener">Sample solution overview (86.7% accuracy)</a></li></ul> <h2 id="challenge-2">Store Sales - Time Series Forecasting</h2> Let's face the facts - all companies collect data through time. For you, as a data scientist, that means a lot of work involving time series analysis and forecasting, so it's best to get your hands dirty early on. <img class="size-full wp-image-12483" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2abba5fd20a97337601ae_2-2.webp" alt="Image 2 - Store Sales time series forecasting challenge on Kaggle" width="3148" height="2314" /> Image 2 - Store Sales time series forecasting challenge on Kaggle The goal of this data science take-home challenge is to create a model that can best estimate the unit sales for thousands of items sold at different stores of Corporación Favorita - a large grocery retailer in Ecuador. The dataset(s) contain a lot of supplementary data, such as holiday information, store details, promotions, and so on, so you won't run out of things to tweak any time soon. As you get familiar with time series analysis and forecasting, you'll realize one thing - time series data is tricky to work with. You can use any algorithm from simple moving averages to state-of-the-art deep learning LSTM variations. In addition, any time series problem can be framed as a supervised problem, so any regression algorithm will work as well. The thing is - there are dozens of algorithms you can test and tweak, which will take a lot of your time. <blockquote>Prepping for SQL interview questions? Practice these <a href="https://appsilon.com/data-science-sql-interview-questions/" target="_blank" rel="noopener">top 10 Data Science SQL questions (with answers)</a>.</blockquote> As of now, 1.36k competitors have submitted 3.4k entries, so there's a decent amount of competition. As with the first take-home challenge, aim to learn - not to be the first on the leaderboard. Ready to get started? Here are some resources you could find useful: <ul><li><a href="https://www.kaggle.com/competitions/store-sales-time-series-forecasting/overview" target="_blank" rel="noopener">Challenge overview and datasets</a></li><li><a href="https://medium.com/@sebastianmo/store-sales-time-series-forecasting-faa6612cc8f1" target="_blank" rel="noopener">A Medium article to get you started</a></li></ul> <h2 id="challenge-3">Digit Recognizer</h2> Now, this wouldn't be a complete beginner-friendly data science take-home challenge list without the MNIST dataset. It's a de facto "Hello World" dataset of computer vision. The version we'll show you comes with a twist. <img class="size-full wp-image-12485" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2abbbe8ea1473e19b7f72_3-2.webp" alt="Image 3 - MNIST image classification challenge" width="3148" height="2314" /> Image 3 - MNIST image classification challenge Classifying images can feel daunting to beginners. What is an image anyway? It's just a collection of pixels spread over 1 or 3 color channels (grayscale or colored image). But what is a pixel? It's a number ranging from 0 to 255. The higher the value is, the more of that color is present in the pixel. MNIST dataset contains tens of thousands of 28x28 pixel images in a single color channel. That means there are 784 pixels in total. These 784 features determine what makes a number 5 a number 5, and not a number 7. But, <b>are all pixels useful?</b> Not likely, as most of them will be unnecessary padding around the digit. The possibilities here are endless. You could treat these 784 pixels as distinct features for a classification model and solve the task with Logistic regression. You could reduce the dimensionality beforehand. Or, you could turn a 1-dimensional array of 784 pixels into a 2-dimensional 28x28 matrix and apply the convolutional model to it. The last option will likely yield the best results, but we'll leave the experimentation up to you. MNIST is a well-known dataset, so 2.2k of competitors have submitted 8.4k entries on Kaggle. There's a lot of competition, but it's easy to approach 100% accuracy once you get the gist of it (hint: <a href="https://appsilon.com/transfer-learning-introduction/" target="_blank" rel="noopener">transfer learning</a>). We recommend the following resources to get you started: <ul><li><a href="https://www.kaggle.com/competitions/digit-recognizer/overview" target="_blank" rel="noopener">Kaggle challenge overview and dataset</a></li><li><a href="https://appsilon.com/r-keras-mnist/" target="_blank" rel="noopener">Appsilon's guide to training a digit classification model in R and Tensorflow</a></li></ul> <h2 id="challenge-4">TensorFlow - Help Protect the Great Barrier Reef</h2> The Great Barrier Reef in Australia is the largest one in the world. It has been under a threat recently, in part because of the overpopulation of COTS - the coral-eating crown-of-thorns starfish. That's where you come in. <img class="size-full wp-image-12487" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2ac018644ca583e461068_4-2.webp" alt="Image 4 - Great Barrier Reef object detection challenge" width="3148" height="2314" /> Image 4 - Great Barrier Reef object detection challenge The goal of this data science take-home challenge is to identify starfish in real-time by building an object detection model trained on underwater videos of coral reefs. It's not an easy task, as it assumes you're comfortable in object detection, which is an advanced computer vision technique. The way you approach the challenge is ultimately up to you. There is a total of 23.5k training images extracted from three videos, and it's your job to come up with a model that can accurately detect objects of interest. There's a separate <code>train.csv</code> file that contains annotations - bounding box coordinates around the object(s) of interest for every image. Before you embark on this challenge, make sure you have a decent hardware configuration. Object detection algorithms require GPU for faster training. Anything recent from NVIDIA will do (RTX or better). If you don't have that configuration at your disposal, you could try training the model on Google Colab. As of now, 2.6k competitors have submitted almost 61k entries, so there's definitely a lot of interest in this challenge. It has closed 2 months ago, but you can still work on it for fun and experience. Ready to get started? Here are a couple of resources you might find useful: <ul><li><a href="https://www.kaggle.com/competitions/tensorflow-great-barrier-reef/overview" target="_blank" rel="noopener">Challenge description and dataset</a></li><li><a href="https://appsilon.com/object-detection-yolo-algorithm/" target="_blank" rel="noopener">Appsilon's introduction to YOLO and YOLO object detection algorithm</a></li><li><a href="https://ghost.amsterdamintelligence.com/how-we-ranked-in-the-top-5-of-tensorflows-kaggle-competition/" target="_blank" rel="noopener">How this team ranked in the top 5% of the challenge</a></li></ul> <h2 id="challenge-5">Bag of Words Meets Bags of Popcorn</h2> It wouldn't be a complete list of data science take-home challenges without an NLP-based task. That's where Bags of Popcorn comes into play. The goal is to get started with Word2Vec for natural language processing with basic NLP techniques and a deep-learning-based approach. <img class="size-full wp-image-12489" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2abb4df603677fbfcdc4b_5-1.webp" alt="Image 5 - NLP movie review challenge on Kaggle" width="3148" height="2314" /> Image 5 - NLP movie review challenge on Kaggle The dataset consists of 50,000 IMDB movie reviews, specially curated for sentiment analysis. The sentiment of the reviews is binary, which means the IMDB rating of less than 5 has a sentiment score of 0, and a rating of 7 and above has a sentiment score of 1. Keep in mind that you'll be working with raw text, so text preprocessing and transformation skills are mandatory. The Kaggle page for this challenge comes with four tutorials on working with bag-of-words, word vectors, and comparing deep learning and non-deep learning methods for working with text data. It's an excellent place to start if you have limited prior experience. The examples are in Python, but you can easily translate them into R. As of April 2022, 659 competitors have submitted 4.3k entries. There's not as much competition as with the other take-home challenges, even though the original posting is 7 years old. Ready to tackle your first NLP challenge? Here are some resources to get you started: <ul><li><a href="https://www.kaggle.com/competitions/word2vec-nlp-tutorial/overview" target="_blank" rel="noopener">Official challenge description and data</a></li><li><a href="https://github.com/akash1309/Bag-Of-Words-Meets-Bags-of-Popcorn" target="_blank" rel="noopener">A sample solution on GitHub</a></li></ul> <hr /> <h2>Summary of Data Science Take-Home Challenges in R</h2> And there you have it - five data science take-home challenges that you can work on right now. Sure, most of the code examples are written in Python, but most of them can be translated to R without too much trouble. <blockquote>Can't find a matching R and Python library? <a href="https://appsilon.com/use-r-and-python-together/" target="_blank" rel="noopener">Avoid this problem by using R and Python together</a>.</blockquote> Working on any of these challenges is a time-consuming process. Don't expect to come up with a fully-working and highly-accurate solution in a day. Entire teams spend weeks and even months before they have something presentable. Cut yourself some slack and remember to have fun in the process. If you decide to give these take-home challenges a go, make sure to let us know. Share your results with us on Twitter - <a href="https://twitter.com/appsilon" target="_blank" rel="noopener">@appsilon</a>. We'd love to see what you come up with.