ML Data Versioning with DVC: How to manage machine learning data

Reading time:

time

min

ai&research

By:

Piotr Pasza Storożenko

October 8, 2021

Machine learning projects are a beautiful medley of the code used to build models and the data used to train them. ML models are complex beasts that change through a myriad of methods from the datasets used and the way they're transformed to the changing of the code itself. Keeping your head on straight with the huge complexity of models you've created, or could have created, with each version is no simple task. How do you track what you've done? Share it? Reproduce it?!? What you're searching for here is data versioning and specifically ML data versioning with <a href="https://dvc.org/" target="_blank" rel="noopener noreferrer">DVC</a>.

Data versioning with DVC isn't the light at the end of the tunnel, it's the headlight beams keeping you on track. At Appsilon we work with a lot of machine learning data, models, and parameters. We know the importance of keeping everything well organized. By managing data and version control we're able to <a href="https://appsilon.com/computer-vision/" target="_blank" rel="noopener noreferrer">deliver high-quality ML solutions</a> for complex, fast-paced projects.

Continue reading as we share how to start organizing your data in a project using DVC.
<ul><li><a href="#anchor-1" rel="noopener noreferrer">Common problems with ML data versioning</a></li><li><a href="#anchor-2" rel="noopener noreferrer">What is DVC?</a></li><li><a href="#anchor-3" rel="noopener noreferrer">How does ML data versioning with DVC work?</a></li><li><a href="#anchor-4" rel="noopener noreferrer">Use case example</a></li><li><a href="#anchor-5" rel="noopener noreferrer">The solution</a></li></ul>
<h2 id="anchor-1">Common problems with ML data versioning</h2>
There's a common trap I see often when starting a project with a small PoC. The process usually begins with gathering some data and creating a model in Jupyter Notebook. Some resampling here and a dash of data preprocessing there and sure enough you've made some decent models. As time progresses, this PoC turns out to be a stepping stone into a project worth pursuing. And so we put in more time and more effort. And we create more models, better models. But soon enough this so-far-so-smooth process hits a snag.

Stop me if you've heard these before:

"Was it in <code>model_3final.pth</code> or <code>model_3last.pth</code> that I used a bigger learning rate?"

"When did I start using data preprocessing, during <code>model_2a.pth</code> or <code>model_2aa.pth</code>?"

"Is <code>model_7.pth</code> trained on the new dataset or on the old one?"

"Oh, gosh, which set of parameters and data have I used to train <code>model_2.pth</code>? It was pretty good in the end..."

If you feel attacked, just know you're not alone. We all have these problems. And as you begin incorporating others into your team they will only snowball.
<blockquote><strong>Ensure clean, well-formatted data for your ML model by using <a href="https://appsilon.com/data-validation-with-data-validator-an-open-source-package-from-appsilon/" target="_blank" rel="noopener noreferrer">data.validator for your data validation</a>.</strong></blockquote>
But fear not. There are solutions to these problems. In some cases, a few of them can be solved using <code>Git</code>. However, pushing a 2 GB+ dataset into your Git repository has its hangups. You need a tool to adapt Git from a traditional software development manager to handle ML projects. And that's exactly what DVC does.
<h2 id="anchor-2">What is DVC?</h2>
DVC (data versioning control) is an open-source tool that makes data science and machine learning projects easy to reproduce and share. It can handle large datasets, ML models, and lets ML engineers include best practices into their workflow. You can use it with Git to track data, parameters, and other aspects of your ML project.

‍

It's important to recognize that with DVC you can easily store code in a Git repository and data/models in an AWS/GCP/Azure/etc. storage. Git is fairly flexible, but it doesn't manage those large ML data very well. However, DVC creates small metafiles that are committed to Git while supporting external, remote caches. You can push these files to external storage and switch between your datasets. Basically, DVC stores copies of data in the backend, while Git manages the tracking of changes to said data. DVC is not a replacement for Git, but rather a way to enable smoother cooperation between tools.
<h3 id="anchor-3">How does ML data versioning work with DVC?</h3>
Under the hood, DVC hashes every file in the directory <code>data</code>, adds it to <code>.gitignore</code> and creates a small file <code>data.dvc</code> that is added to Git. By comparing hashes, DVC knows when files change and which version to restore.
<blockquote><strong>Interested in image classification? Create your first <a href="https://appsilon.com/image-classification-tutorial/" target="_blank" rel="noopener noreferrer">image classification project with Appsilon's image classification tutorial. </a></strong></blockquote>
<h3 id="anchor-4">Use case example</h3>
Let's take a look at an example. Suppose you have the following files in your directory:
<pre class="language-r"><code class="language-r">── README.md
├── data
│   ├── secret_appsilon_data_p001.csv
│   ├── secret_appsilon_data_p002.csv
│   └── secret_appsilon_data_p100.csv
├── model.py
├── params.yaml
├── test.py
├── train.py
└── utils.py</code></pre>
 
Everything seems to be working fine. But after a few iterations your directory might look more like this:
<pre class="language-r"><code class="language-r">── README.md
├── data
│   ├── secret_appsilon_data_p001.csv
│   ├── secret_appsilon_data_p002.csv
│   ├── secret_appsilon_data_p100.csv
│   ├── secret_appsilon_data_p001_2.csv
│   ├── secret_appsilon_data_p002_2.csv
│   └── secret_appsilon_data_p100_2.csv
├── model.py
├── models
│   ├── good_params_model3.yaml
│   ├── model.pt
│   ├── model2.pt
│   ├── model3.pt
│   ├── model3a.pt
│   └── model5.pt
├── params.yaml
├── test.py
├── train.py
└── utils.py</code></pre>
 
Even if you prefer to work with organized chaos, any additional team members will likely have no clue which model has been trained when, on which set of parameters, which version of data, and so on. Frankly, it's a mess and we could use some help. Although DVC has a few features dedicated to experiments, let's start with the easiest way to utilize DVC.
<blockquote><strong>What is the YOLO algorithm? <a href="https://appsilon.com/object-detection-yolo-algorithm/" target="_blank" rel="noopener noreferrer">YOLO object detection algorithm made simple</a>.</strong></blockquote>
<h3 id="anchor-5">A solution to ML data versioning with DVC</h3>
Coming back to the repository initiation we can do:
<pre class="language-r"><code class="language-r"><code class="language-r">
git init
dvc init</code></code></pre>
 
Let's begin by adding the data directory.
<pre class="language-r"><code class="language-r">dvc add data</code></pre>
 
The output:
<pre class="language-r"><code class="language-r">100% Adding...|████████████████████████████████████████|1/1 [00:00, 2.09file/s]</code></pre>
 
To track the changes with Git, run:
<pre class="language-r"><code class="language-r">git add data.dvc .gitignore</code></pre>
 
DVC will guide you through how to add tracking of files in Git. Instead of adding the whole <code>data</code> directory to the repository, you only add the <code>data.dvc</code> file. You can check that <code>data</code> directory has been added to <code>.gitignore</code>.

Next, we add the rest of the files to Git.
<pre class="language-r"><code class="language-r">
git add data.dvc .gitignore
git add README.md model.py, params.yaml test.py train.py utils.py
git commit -m "Init repo"
git push</code></pre>
 
The code has now been uploaded to the repo, but the data is still on your machine. You can either use a special directory on a disk for DVC cache or use s3/gs/gdrive/etc. To configure the s3 bucket you will need to run:
<pre class="language-r"><code class="language-r">
dvc remote add -d myremote s3://mybucket/path
git add .dvc/config
git commit -m "Configure remote storage"</code></pre>
 
Now you can run:
<pre class="language-r"><code class="language-r">
git push
dvc push</code></pre>
 
Congratulations! Your code and data are now versioned in the repository.

If you change data, just run:
<pre class="language-r"><code class="language-r">
dvc add data
git add data.dvc
git commit -m "New data added"
git push
dvc push</code></pre>
 
You can treat models, the same way you treat data. It is preferable to have one file <code>model.pt</code>, that changes with new data and alternate parameters. If you need to go back in time and check how the data looked in a previous version, use <code>git checkout XXX</code> and <code>dvc checkout</code>.

However, this is not an ideal way to track your experiments. For tracking experiments specifically, DVC provides an entire interface. You can learn more in the <a href="https://dvc.org/doc/command-reference/exp" target="_blank" rel="noopener noreferrer">DVC documentation for experiments</a>.
<blockquote><strong>Automate deployment on RStudio Connect by building a <a href="https://appsilon.com/build-a-ci-cd-pipeline-for-shiny-apps/" target="_blank" rel="noopener noreferrer">CI/CD pipeline for Shiny apps using Gitlab-CI</a>.</strong></blockquote>
<h2>Integrating ML data versioning with DVC in your workflow</h2>
Working with large datasets can be challenging. Whether you're moving it, cleaning it, or tracking it, it's difficult to keep on top of. Especially in the context of an ML project. But by applying DVC to control data versioning with Git, along with other DataOps best practices, you don't have to risk losing productivity and quality.

Keep your team on track, and your deadlines on target by establishing workflows, version controlling models, and creating reproducible experiments. Make life simpler for both you and your team by incorporating DVC into your ML projects. If you need help consider reaching out to Appsilon. Our <a href="https://appsilon.com/computer-vision/" target="_blank" rel="noopener noreferrer">ML and Computer Vision team</a> have experience building custom ML solutions and can adapt existing solutions to fit your needs. Explore some of our <a href="https://appsilon.com/computer-vision/#use-cases" target="_blank" rel="noopener noreferrer">case studies</a> and discover the possibilities of ML and computer vision.

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Stop Struggling with Outdated Clinical Data Systems

Join pharma data leaders from Jazz Pharmaceuticals and Novo Nordisk in our live podcast episode as they share what really works when building modern, compliant Statistical Computing Environments (SCEs).

Save My Spot

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.

Get the Checklist

Ensure Your R and Python Code Meets FDA and EMA Standards

A comprehensive diagnosis of your R and Python software and computing environment compliance with actionable recommendations and areas for improvement.