Data Validation With data.validator: An Open-Source Package from Appsilon

Estimated time:
time
min

<h2 id="anchor-1">Why Data Validation</h2> Data validation is a crucial step in any data science project. It ensures clean and well-formatted data that is ready for input pipelines to ML models and dashboards. Cleaned data also minimizes errors further down the line. Often, functions and model training pipelines throw errors when presented with missing values, incorrect data types, out-of-range data, etc. <span data-preserver-spaces="true">It's possible to avoid the resulting time and monetary wastage through data validation techniques that ensure checks </span>are passed before feeding the data into the program.  <blockquote><a href="https://appsilon.com/data-quality/" target="_blank" rel="noopener noreferrer">Data Quality Case Studies: How We Saved Clients Real Money Thanks to Data Validation</a></blockquote> It's also important to note that data validation is not a one-off occurrence. When updating ML models, new data is required. And the volume of input will likely change. Having scalable, automated validation in the workflow with every update is necessary. The question now becomes, how do you achieve all of this? Enter: data.validator <ol><li><a href="#anchor-1" target="_blank" rel="noopener noreferrer">Why Data Validation</a></li><li><a href="#anchor-2" target="_blank" rel="noopener noreferrer">Data Validation With data.validator</a></li><li><a href="#anchor-3" target="_blank" rel="noopener noreferrer">Getting Started</a></li><li><a href="#anchor-4" target="_blank" rel="noopener noreferrer">Pipeline</a></li><li><a href="#anchor-5" target="_blank" rel="noopener noreferrer">Custom HTML Reporting</a></li><li><a href="#anchor-6" target="_blank" rel="noopener noreferrer">Example of data.validator in Production</a></li><li><a href="#anchor-7" target="_blank" rel="noopener noreferrer">Conclusion</a></li></ol> <h2 id="anchor-2">Data Validation With data.validator</h2> Today we will look at <a href="https://github.com/Appsilon/data.validator" target="_blank" rel="noopener noreferrer">data.validator,</a> an R package that offers scalable and reproducible data validation in a user-friendly way. The R package data.validator handles data validation beyond simple structure and format, with reporting tools for preventative maintenance and in a way that makes it easier to identify and track the story behind the data. Some features of data.validator include: <ul><li style="font-weight: 400;" aria-level="1">Validation in %>% pipelines with functions: validate_if(), validate_cols(), and validate_rows()</li><li style="font-weight: 400;" aria-level="1">Support for predicate functions from the assertr package like: in_set(), within_bounds(), etc.</li><li style="font-weight: 400;" aria-level="1">Functions for creating user-friendly reports that can be sent to email, stored in logs folder, or generated automatically with RStudio Connect </li><li aria-level="1">Customizable HTML reports</li></ul> <img class="aligncenter size-full wp-image-7350" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b0205372a288e80b4e6881_semantic_report_example.gif" alt="data validator" width="1029" height="626" /> <h3 id="anchor-3">Getting Started</h3> There are two options to install the package: <a href="https://cran.r-project.org/web/packages/data.validator/index.html" target="_blank" rel="noopener noreferrer">CRAN</a> <script src="https://gist.github.com/MicahAppsilon/f08ee274b7cf15ef45fb39dbca0ca4fa.js"></script> Latest Development Version <script src="https://gist.github.com/MicahAppsilon/14d137ca6d80e140b16bc583175d6a66.js"></script> <h3 id="anchor-4">Pipeline</h3> Step 1. First, create a blank report object: <script src="https://gist.github.com/MicahAppsilon/09e0171fd9c7e5f8617245af394d4d40.js"></script> Step 2. Next, load your data set and prepare it for data validation. We will use the standard mtcars data set for this demonstration. After creating the empty report object above, we can now start using the validate() function to perform the required validations on the dataset. We add the dataset and the name as arguments to the validate() function. <script src="https://gist.github.com/MicahAppsilon/d9a1c499dc7b9b49810d5f09a23941e2.js"></script> Step 3. After the validate() function, we can use the validate_*() functions and predicates to validate the data with %>% operator. <script src="https://gist.github.com/MicahAppsilon/c69d27f1e2181a527576505ad7665947.js"></script> Step 4. We can also add custom predicates by first defining a function and then using it inside validate_*() functions. <script src="https://gist.github.com/MicahAppsilon/0f4ca68d93179600178bc661a60a3780.js"></script> Step 5. Once all the validations are done, we add the add_results(report_name) to add this validation result to the created report. <script src="https://gist.github.com/MicahAppsilon/74116087288c29b4f49848f7a55e41fe.js"></script> Step 6. Finally, we print the report or generate an HTML document. <script src="https://gist.github.com/MicahAppsilon/7dc7936951111b16132567e2cb129090.js"></script> <script src="https://gist.github.com/MicahAppsilon/a9d7bd0a472ae40ae4ed75ffdb65624e.js"></script> We can turn off certain parts of the report like this: <script src="https://gist.github.com/MicahAppsilon/6cba6ea3dac50d64df90230d30e9f094.js"></script> We can also view the raw report like this: <script src="https://gist.github.com/MicahAppsilon/7ae0ab9de8cc07bf34eff3cad8993948.js"></script> data.validator provides other ways of saving the report: <script src="https://gist.github.com/MicahAppsilon/413ef1c4d5832791a0a8ffc61a5b8295.js"></script> <h3 id="anchor-5">Custom HTML Reporting</h3> data.validator also supports custom report templates. Results can be shown with various interactive elements (e.g., leaflet map). In the example below, you can see the validation results from setting a predicate function to check Polish district populations that are within 3 standard deviations - <span class="pl-e">assertr</span><span class="pl-k">::</span>within_n_sds(<span class="pl-c1">3</span>). <img class="aligncenter size-full wp-image-7356" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b0205331cb3f3921071ca6_custom_report_example_leaflet.gif" alt="leaflet data validator" width="1029" height="626" /> <script src="https://gist.github.com/MicahAppsilon/6b67402b35dbf13ea40e8786b670efdd.js"></script> You may find a predefined report template <a href="https://github.com/Appsilon/data.validator/blob/master/inst/rmarkdown/templates/standard/skeleton/skeleton.Rmd" target="_blank" rel="noopener noreferrer">here</a>. To use the template as a base, load the package in RStudio and go to File &gt; New File &gt; R Markdown &gt; From template &gt; Simple structure for HTML report summary. Here you can modify the template with custom titles and graphics. <h3 id="anchor-6">Example of data.validator in Production</h3> Workflow for data.validator can be implemented as follows: <ol><li>Running RStudio Connect Scheduler (daily)</li><li>Scheduler sources the data from PostgreSQL table and validates it based on predefined rules.</li><li>Based on validations results, a new data.validator report is created</li><li>Data Response and Action<ul><li>Violation occurrence:<ul><li>data provider and person responsible for data quality receive a report via email</li><li>thanks to <code>assertr</code> functionality, the report is easily understandable both for technical and non-technical personnel</li><li>data provider makes required data fixes</li></ul></li><li>Passes inspection:<ul><li>a specific trigger is sent in order to reload Shiny data</li></ul></li></ul></li></ol> <h2 id="anchor-7">Conclusion</h2> Whether your dataset was built internally or pulled from external sources, you need to check that it meets the expectations you have defined. Detecting incomplete, duplicate, corrupt, or irrelevant data can be a huge undertaking but if not addressed can negatively impact your analysis. That's why Appsilon developed the data.validator package, to easily compose and integrate validation rules, scale for fluctuating volumes of data, and deliver clear customizable reports. If you need assistance with your project, consider reaching out to the <a href="https://appsilon.com/computer-vision/" target="_blank" rel="noopener noreferrer">Appsilon Data Science Machine Learning</a> team. Our data science professionals deliver modern ML and computer vision solutions for Fortune 500 companies. If you are a public sector institution, NGO, academic institution, or public benefit corporation working on ML projects to solve climate change and environmental degradation issues, please reach out to us through our <a href="https://appsilon.com/ai-for-good/" target="_blank" rel="noopener noreferrer">Data for Good</a> initiative. <blockquote>Appsilon is a <a href="https://www.rstudio.com/certified-partners/" target="_blank" rel="noopener noreferrer">Full Service Certified RStudio Partner</a>: discover how we can support you in your project <a href="https://appsilon.com/appsilon-data-science-is-now-an-rstudio-full-service-certified-partner/" target="_blank" rel="noopener noreferrer">here</a></blockquote> <h2>We Need Your Help!</h2> At Appsilon our Tech Team Members regularly contribute to open source packages as part of our commitment to positively impacting the world through technology. If you find our packages useful, please consider dropping a star on your favorite <a href="https://shiny.tools/" target="_blank" rel="noopener noreferrer">shiny packages</a> at our <a href="https://github.com/Appsilon" target="_blank" rel="noopener noreferrer">Github</a>. It helps let us know we’re on the right track. And if you have any comments or questions swing by our feedback threads like the ongoing <a href="https://github.com/Appsilon/shiny.fluent/discussions/24" target="_blank" rel="noopener noreferrer">discussion</a> at our new <a href="https://github.com/Appsilon/shiny.fluent" target="_blank" rel="noopener noreferrer">shiny.fluent package</a>, we love to hear from the community.  <h2>We're Hiring!</h2> Interested in working with the leading experts in Shiny? Appsilon is looking for creative thinkers around the globe. We're a remote-first company, with team members in 7+ countries. Our team members are leaders in the R dev community and we take our core purpose seriously. <blockquote>Advance technology to preserve and improve human life #purpose</blockquote> We promote an inclusive work environment and strive to create a friendly team with a diverse set of skills and a commitment to excellence.<a href="https://appsilon.com/company/" target="_blank" rel="noopener noreferrer"> Contact us</a> and see what it's like to work on groundbreaking projects with Fortune 500 companies, NGOs, and non-profit organizations. <a href="https://appsilon.com/careers/" target="_blank" rel="noopener noreferrer"><img class="aligncenter size-full wp-image-7024" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b020493e47f6a35730bb22_Be-a-part-of-our-team.webp" alt="Be a part of our team" width="1200" height="628" /></a> <p style="text-align: center;"><b>Appsilon is hiring for remote roles! See our </b><a href="https://appsilon.com/careers/" target="_blank" rel="noopener noreferrer"><b>Careers</b></a><b> page for all open positions, including a </b><a href="https://appsilon.com/careers/job-offer/?job=senior-react-developer-freelancer" target="_blank" rel="noopener noreferrer"><b>React Developer</b></a><b> and </b><a href="https://appsilon.com/careers/job-offer/?job=r-shiny-developer" target="_blank" rel="noopener noreferrer"><b>R Shiny Developers</b></a>.<b> Join Appsilon and work on groundbreaking projects with the world’s most influential Fortune 500 companies.</b></p>

Contact us!
Damian's Avatar
Damian Rodziewicz
Head of Sales
open source
r
community
ai&research