R data.validator - How to Create Automated Data Quality Reports in R and Shiny
Every data science project needs a <strong>data validation</strong> step. It's a crucial part, especially when feeding data into machine learning models. You don't want errors or unexpected behaviors in a <strong>production environment</strong>. Data validation is a way you can check the data before it touches the model and ensures it's not corrupted. And yes, you can <strong>automate</strong> <strong>data quality reports</strong>! Today you'll learn how to work with datasets in R, in order to create automated R data quality reports. The best part - you'll build a UI around the data validation logic, so you can easily upload CSV files for examination and validation, and R Shiny will handle the rest. Let's dive straight in! <blockquote>Looking to create reports with R Markdown? <a href="https://appsilon.com/r-markdown-tips/" target="_blank" rel="noopener">Read our top tips on code, images, comments, tables, and more</a>.</blockquote> Table of contents: <ul><li><a href="#how-to">How to Approach R Data Quality Checks for Reporting</a></li><li><a href="#data-validator">Introduction to data.validator Package</a></li><li><a href="#r-shiny">Data Validation and Automated Reporting in R Shiny</a></li><li><a href="#summary">Summing up R Data Quality</a></li></ul> <hr /> <h2 id="how-to">How to Approach R Data Quality Checks for Reporting</h2> In machine learning, you typically train and optimize a model once and then deploy it. What happens from there is more or less wild west because you can't know how the user will use your model. For example, if you create a web interface around it and allow for data entry, you can expect some users to enter values that have nothing to do with the data your model was trained on. If a feature value typically ranges between 0 to 10, but the user enters 10000, things will go wrong. Some of these risks can be avoided with basic form validation, but sooner or later an unexpected request will go through. That's where data quality checks play a huge role. <b>So, what are your options?</b> Several R data quality packages are available, but today we'll focus on Appsilon's <a href="https://github.com/Appsilon/data.validator" target="_blank" rel="nofollow noopener">data.validator</a>. It's a package we developed for scalable and reproducible data validations, and it includes a ton of functions for adding checks to the data. Other options for automated data quality reports with R exist, such as <a href="https://github.com/rstudio/pagedown" target="_blank" rel="nofollow noopener">pagedown</a> and <a href="https://cran.r-project.org/web/packages/officer/index.html" target="_blank" rel="nofollow noopener">officeR</a>, and you're free to use them. We found these alternatives capable of report automation, but nowhere near as interactive and scalable as <code>data.validator</code>. Let's dive into the examples next. <h2 id="data-validator">Introduction to the data.validator Package</h2> The package is available on <a href="https://cran.rstudio.com/web/packages/data.validator/index.html" target="_blank" rel="nofollow noopener">CRAN</a>, which means you can install it easily through the R console: <pre><code class="language-bash">install.packages("data.validator")</code></pre> Alternatively, you can install the latest development version: <pre><code class="language-bash">remotes::install_github("Appsilon/data.validator")</code></pre> We'll work with the <a href="https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv" target="_blank" rel="nofollow noopener">Iris dataset</a> for R data quality reports, and we recommend you download the CSV version instead of using the one built into R. You'll see reasons why in the following section. For now, just load these couple of libraries and read the dataset: <pre><code class="language-r">library(assertr) library(dplyr) library(data.validator) <br>df_iris <- read.csv("/path/to/iris.csv") head(df_iris)</code></pre> Here's what it looks like: <img class="size-full wp-image-16982" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29f7f0518ba4862326362_1-2.webp" alt="Image 1 - Head of the Iris dataset" width="758" height="238" /> Image 1 - Head of the Iris dataset Now, how can you implement data quality checks on this dataset? Just imagine you were to build a machine learning model on it, and you want to somehow regulate the constraints of user-entered data going into the forecasting. First, you need to think of the conditions that have to be satisfied. Let's mention a few: <ul><li>Values can't be missing (NA)</li><li>Column value has to be in a given range, let's say between 0 and 10 for <code>sepal.width</code> and <code>sepal.length</code></li></ul> You can add these and many other conditions to a data validation report. Here's how: <pre><code class="language-r"># A custom function that returns a predicate between <- function(a, b) { function(x) { a <= x & x <= b } } <br># Initialize the report report <- data_validation_report() # Add validation to the report validate(data = df_iris, description = "Iris Dataset Validation Test") %>% validate_cols(predicate = in_set(c("Setosa", "Virginica", "Versicolor")), variety, description = "Correct species category") %>% validate_cols(predicate = not_na, sepal.length:variety, description = "No missing values") %>% validate_cols(predicate = between(0, 10), sepal.length, description = "Column sepal.length between 0 and 10") %>% validate_cols(predicate = between(0, 10), sepal.width, description = "Column sepal.width between 0 and 10") %>% add_results(report = report) <br># Print the report print(report)</code></pre> The <code>between()</code> function is a user-defined one, allowing you to check if a value is within range. You can define your custom functions in a similar manner, or use the ones <a href="https://github.com/Appsilon/data.validator" target="_blank" rel="nofollow noopener">built into data.validator</a>. Here's what the printed report looks like: <img class="size-full wp-image-16984" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29f6d629a6d2121388e78_2-1.webp" alt="Image 2 - Printed data.validator report" width="996" height="448" /> Image 2 - Printed data.validator report <h3>Saving Reports Locally - HTML, CSV, and TXT</h3> You can also save the report locally in HTML format and open it by running the following code: <pre><code class="language-r">save_report(report) browseURL("validation_report.html")</code></pre> <img class="size-full wp-image-16986" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29f6e7518486cd0c6c201_3-1.webp" alt="Image 3 - data.validator report in HTML format" width="1648" height="1076" /> Image 3 - data.validator report in HTML format In case you prefer a simpler, flatter design without colors, change the value of the <code>ui_constructor</code> parameter: <pre><code class="language-r">save_report(report, ui_constructor = render_raw_report_ui) browseURL("validation_report.html")</code></pre> <img class="size-full wp-image-16988" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29f6ff69f7a8684510254_4-1.webp" alt="Image 4 - Raw data.validator report" width="1648" height="1076" /> Image 4 - Raw data.validator report You're not limited to saving R data quality reports to HTML. After all, it's not the most straightforward file format to automatically parse and see if any validation checks failed. For that reason, we include an option for saving reports as CSV files: <pre><code class="language-r">save_results(report, "results.csv")</code></pre> <img class="size-full wp-image-16990" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29f6f4bdfb4891c180aca_5-1.webp" alt="Image 5 - R data quality report in CSV format" width="1570" height="715" /> Image 5 - R data quality report in CSV format It's not as visually attractive, sure, but you can easily write scripts that would read these CSV files if you want to automate data quality checks. And finally, there's an option to save the data as a plain text document: <pre><code class="language-r">save_summary(report, "validation_log.txt")</code></pre> <img class="size-full wp-image-16992" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29f623230f605361b53ec_6-2.webp" alt="Image 6 - data.validator report in TXT format" width="888" height="555" /> Image 6 - data.validator report in TXT format The report now looks like the one from the R console, which is the format you may prefer. <blockquote>Want to dive deeper into <code>data.validator</code>? <a href="https://appsilon.com/data-validation-with-data-validator-an-open-source-package-from-appsilon/" target="_blank" rel="noopener">This article further explores validation functions</a>.</blockquote> You now know the basic idea behind R data quality checks and reports, so next, we'll take this a step further by introducing R Shiny. <h2 id="r-shiny">Data Validation and Automated Reporting in R Shiny</h2> By now you've created a data validation report with <code>data.validator</code>, so why bring R Shiny into the mix? The answer is simple - it will allow you to create an application around the data validation logic, and will further simplify data quality checks for non-tech users. The app you'll create in a minute allows the user to upload a CSV file, for which a validation report is displayed. We recommend you save the following snippet in a separate CSV file. It contains a couple of Iris instances that will fail the data validation test: <pre><code class="language-csv">"sepal.length","sepal.width","petal.length","petal.width","variety" 5.1,3.5,1.4,.2,"Setosa" 100,3,1.4,.2,"Setosa" 4.7,3.2,47,.2,"Setosa" 4.6,3.1,1.5,.2,"Sertosa" 5,NA,1.4,.2,"Setosa"</code></pre> As you can see, either the species is wrong, the value is missing, or the value is out of range. Our Shiny app will have a sidebar that allows for CSV file upload and the main section that renders the head of the uploaded CSV file and its validation report. <b>Keep in mind:</b> R Shiny already has the <code>validate()</code> function, so we have to explicitly write <code>data.validator::validate()</code> to avoid confusion and errors: <pre><code class="language-r">library(shiny) library(data.validator) library(assertr) library(dplyr) <br># data.validator helper function between <- function(a, b) { function(x) { ifelse(!is.na(x), a <= x & x <= b, FALSE) } } <br> ui <- fluidPage( titlePanel("Appsilon's data.validator Shiny Example"), sidebarLayout( sidebarPanel( fileInput(inputId = "dataFile", label = "Choose CSV File", multiple = FALSE, accept = c(".csv")), checkboxInput(inputId = "header", label = "Data has a Header row", value = TRUE) ), mainPanel( tableOutput(outputId = "datasetHead"), uiOutput(outputId = "validation") ) ) ) <br>server <- function(input, output, session) { # Store the dataset as a reactive value data <- reactive({ req(input$dataFile) <br> tryCatch( { df <- read.csv(file = input$dataFile$datapath, header = input$header) }, error = function(e) { stop(safeError(e)) } ) }) <br> # Render the table with the first 5 rows output$datasetHead <- renderTable({ return(head(data(), 5)) }) <br> # Render the data validation report output$validation <- renderUI({ report <- data_validation_report() data.validator::validate(data(), description = "Iris Dataset Validation Test") %>% validate_cols(in_set(c("Setosa", "Virginica", "Versicolor")), variety, description = "Correct species category") %>% validate_cols(predicate = not_na, sepal.length:variety, description = "No missing values") %>% validate_cols(predicate = between(0, 10), sepal.length, description = "Column sepal.length between 0 and 10") %>% validate_cols(predicate = between(0, 10), sepal.length, description = "Column sepal.width between 0 and 10") %>% validate_cols(predicate = between(0, 10), petal.length, description = "Column petal.length between 0 and 10") %>% validate_cols(predicate = between(0, 10), petal.width, description = "Column petal.width between 0 and 10") %>% add_results(report) <br> render_semantic_report_ui(get_results(report = report)) }) } <br> shinyApp(ui = ui, server = server)</code></pre> Here's what the app looks like: <img class="size-full wp-image-16994" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29ff9300ce9856e5114e3_7.gif" alt="Image 7 - R Shiny app build around data.validator" width="1538" height="1164" /> Image 7 - R Shiny app build around data.validator And that's how easy it is to build a UI around the data validation pipeline. You can (and should) add more checks, especially for custom datasets with many attributes. The overall procedure will be identical, only the validation part would get longer. Let's make a short recap next. <hr /> <h2 id="summary">Summing Up Automated R Data Quality Reporting</h2> In data science, it's essential to stay on top of your data. You never know what the user may enter into a form or how the data may change over time, so that's where data validation and automated data quality checks come in handy. You should create constraints around data going into a machine learning model if you want to guarantee reasonable predictions. Quality data in - quality prediction out. Appsilon's <code>data.validator</code> package simplifies data quality checks with a ton of built-in functions. You can also declare custom ones, so there's no hard limit on the checks you want to make. The package can also save data quality reports in HTML, CSV, and TXT format, and is fully compatible with R Shiny. What more do you need? <i>What are your thoughts on <code>data.validator</code> and automated data quality checks in general? Which tools/packages do you use daily?</i> Let us know in the comment section below, and don't hesitate to reach out on Twitter - <a href="http://twitter.com/appsilon">@appsilon</a>. We'd love to hear from you. <blockquote>R Shiny seems more interesting by the day? <a href="https://appsilon.com/how-to-start-a-career-as-an-r-shiny-developer/" target="_blank" rel="noopener">Our detailed guide shows you how to make a career out of it</a>.</blockquote>