Data validation is a crucial aspect of developing robust and reliable R Shiny applications. It ensures that the data being processed and displayed in the app is accurate, consistent, and meets the required criteria.
Looking for an R package that offers scalable and reproducible data validation in a user-friendly way? Learn more about our open-source package, data.validator.
In this blog, we'll explore why data validation is essential in app development and how it can streamline the development process for teams.
👉 Why Data Validation Matters in R/Shiny Apps
1. Ensuring Data Integrity ✅
Data validation acts as a gatekeeper for your Shiny app, preventing erroneous or inconsistent data from compromising the application's functionality. By implementing validation checks, you can catch input errors early in the process, maintain data consistency throughout the app and prevent downstream issues caused by invalid data.
For further optimization tips, check out our guide to enhancing Shiny application performance.
2. Improving App Reliability 🤝
By validating data at various stages of your Shiny app, you can reduce the likelihood of crashes or unexpected behavior, ensure that calculations and visualizations are based on valid data and increase the overall reliability and trustworthiness of your application.
Explore more in our blog on validating R Shiny applications in pharma.
3. Accelerating Team Development with Data Validation 🚀
Implementing robust data validation practices can significantly speed up the process of application development, especially in a team setting. Here's how:
4. Faster Debugging and Testing 🧪
With proper data validation in place, identifying and fixing issues becomes much quicker. Error messages are more specific and actionable, the source of data-related problems is easier to locate and testing scenarios can be more focused and efficient.
5. Modular Development ⚙️
Data validation encourages a modular approach to app development, allowing team members to work on different components simultaneously:
- Input validation can be developed independently of data processing logic
- Different team members can focus on specific validation rules or data types
- Integration of various app components becomes smoother
For more on team development and Shiny, refer to our article on creating custom modules in Shiny for R.
👉 Introducing the pointblank package
The pointblank package is designed to streamline the process of data validation by providing a structured approach for defining and enforcing rules about the data. With the pointblank package it’s really easy to methodically validate your data whether in the form of data frames or as database tables. On top of the validation toolset, the package gives you the means to provide and keep up-to-date with the information that defines your tables.
Within its validation workflows, there are a set of six different approaches that can be used based on the requirement. Here's a brief explanation of each workflow:
- VALID-I: Data Quality Reporting: This workflow uses an "agent-based" approach where an agent is created, validation functions are applied, and a detailed validation report is generated.
- VALID-II: Pipeline Data Validation: This workflow is designed for repeated data-quality checks in a data-transformation pipeline. It can warn the user of data integrity problems or stop the pipeline if issues are detected.
- VALID-III: Expectations in Unit Tests: This workflow allows to test data using a suite of
expect_*()
functions that are analogous to the testthat validation functions but with simplified interfaces. - VALID-IV: Data Tests for Conditionals: This workflow uses a suite of
test_*()
functions to evaluate data and produce logical output (TRUE/FALSE), which can be used to alter code paths based on the validation results. - VALID-V: Table Scan: This workflow is used to scan and describe a target table, providing information about its dimensions, column statistics, and missingness.
- VALID-VI: R Markdown Document Validation: This workflow is used to incorporate validation elements from the other workflows (such as pipeline validation or agent-based reporting) directly into an R Markdown document, providing a way to validate data and display the results in the rendered output.
To keep things interesting and your mind fresh, let's focus only on the validation workflows that usually make people scratch their heads.
The pointblank package is designed to streamline the process of data validation by providing a structured approach for defining and enforcing rules about the data. With the pointblank package it’s really easy to methodically validate your data whether in the form of data frames or as database tables. On top of the validation toolset, the package gives you the means to provide and keep up-to-date with the information that defines your tables.
Within its validation workflows, there are a set of six different approaches that can be used based on the requirement. Here's a brief explanation of each workflow:
- VALID-I: Data Quality Reporting: This workflow uses an "agent-based" approach where an agent is created, validation functions are applied, and a detailed validation report is generated.
- VALID-II: Pipeline Data Validation: This workflow is designed for repeated data-quality checks in a data-transformation pipeline. It can warn the user of data integrity problems or stop the pipeline if issues are detected.
- VALID-III: Expectations in Unit Tests: This workflow allows to test data using a suite of
expect_*()
functions that are analogous to the testthat validation functions but with simplified interfaces. - VALID-IV: Data Tests for Conditionals: This workflow uses a suite of
test_*()
functions to evaluate data and produce logical output (TRUE/FALSE), which can be used to alter code paths based on the validation results. - VALID-V: Table Scan: This workflow is used to scan and describe a target table, providing information about its dimensions, column statistics, and missingness.
- VALID-VI: R Markdown Document Validation: This workflow is used to incorporate validation elements from the other workflows (such as pipeline validation or agent-based reporting) directly into an R Markdown document, providing a way to validate data and display the results in the rendered output.
To keep things interesting and your mind fresh, let's focus only on the validation workflows that usually make people scratch their heads.
Example for VALID-I: Data Quality Reporting
library(pointblank)
# Define `action_levels`
al <- action_levels(warn_at = 0.2, stop_at = 0.8)
# Create a pointblank `agent` object, with the iris dataset as the
# target table. Use validation functions, then, `interrogate()`.
# The agent will then have the validation information.
agent <- datasets::iris |>
create_agent(
tbl_name = "Edgar Anderson's Iris Data",
label = "Example data validation",
actions = al
)
interrogation_result <- agent |>
col_vals_between(Sepal.Length, 4, 5) |>
col_is_integer(Petal.Length) |>
col_is_factor(Species) |>
interrogate()
print(interrogation_result)
⁉️ So what just happened? 🤔
First, let’s understand the action_levels() function.
While working with data, it's often useful to establish acceptable failure thresholds for your data validation steps. This allows you to balance the need for data integrity with the practical realities of real-world data sources.
For example, you may decide that it's acceptable for up to 5% of your validation tests to fail at a given point in time. Or, you might find it helpful to have multiple levels of data quality, such as grouping failing test units into 0-5%, 5-10%, and 10%+ bands.
The action_levels() function is used to specify these failure threshold levels. This function generates an action_levels
object that can then be passed to the actions
parameter of the create_agent() function, effectively setting default thresholds for all your validation steps.
When defining the thresholds, you can use relative values (as real numbers between 0 and 1) to represent the acceptable percentage of failing test units for the WARN
and STOP
conditions.
So, action_levels(warn_at = 0.2, stop_at = 0.8)
defines that the WARN
condition should be met when 20% of test units fail and the STOP
condition should be met when 80% of test units fail.
After defining the action levels, we create an agent using the create_agent() function, then define the validation checks to be performed using 3 validation functions: col_vals_between(), col_is_integer() and col_is_factor(). Lastly, we used the interrogate() function to start executing the checks.
From the validation report, we get the following information:
- 79% of the data in the
Sepal.Length
column are not within our specified range of values, i.e. [4, 5]. This validation step has met theWARN
condition as the failed test unit value is greater than 0.2. This is prominently indicated by the yellow color strip at the left of theSTEP
column. - The
Petal.Length
column does not contain integer values and it has met theSTOP
condition as failing test units value is greater than 0.8. TheSTOP
condition is also indicated by the red color strip. - The
Species
column has successfully passed the validation check which is also indicated by the green color strip.
The generated validation report is both intuitive and visually attractive. Check out this documentation to explore the other validation functions available.
⁉️ What if we want to customize the generated validation report? 😯
We can use the get_agent_report() to customize the validation report as per our need.
get_agent_report(agent, title = "Example Validation Report", arrange_by = "severity")
In this report, we have made the following changes:
- Provided a custom title to our validation report
- Ordered the report table rows based on their severity, i.e. the steps that met the
STOP
condition will appear at the top, followed by the steps meeting theWARN
condition and lastly, the steps that have successfully passed the validation checks.
There are many more customization options available. To explore them all, check out the documentation.
Example for VALID-II: Pipeline Data Validation
This workflow method involves directly applying validation functions to the data tables, without the need to create an agent object. By calling the validation functions directly on the data, you can easily integrate data quality checks at various stages of your data transformation pipeline. This can be particularly useful when importing new data or performing critical data transformations, as it allows you to quickly identify and address any issues that may compromise the integrity of your application's data.
There is no validation report generated in this method as there is no agent involved. Instead, the focus is on the side effects of the validation process - triggering warnings, raising errors, or writing logs when specific failure thresholds are exceeded.
datasets::iris |>
col_vals_lt(Sepal.Length, 5, actions = warn_on_fail(warn_at = 0.2)) |>
col_vals_gt(Sepal.Width, 3, actions = warn_on_fail(warn_at = 0.5)) |>
col_vals_not_null(Species)
🔎 Now let us explore what is going on 👀
In the above code, we directly applied the validation functions on the data without using the create_agent() and interrogate() functions. One notable difference in this method is that the validation functions operate immediately on the data, acting as a sort of filter. If there are no failing test units in any of the validation steps, the incoming data comes out as output, else, a stoppage occurs and no data comes as output.
Notice that we have used the warn_on_fail() function, which is a helper function for setting the WARN
threshold value. The stop_on_fail(stop_at=value) is another helper function that, as the name implies, is used to set the STOP
threshold value.
In the result, we see that the data has passed the validation pipeline and has been returned with 2 warnings. The warnings are well self-explanatory, indicating the columns that met the WARN
condition.
datasets::iris |>
col_vals_lt(Sepal.Length, 5, actions = stop_on_fail()) |>
col_vals_gt(Sepal.Width, 3, actions = warn_on_fail(warn_at = 0.5)) |>
col_vals_not_null(Species)
In this example, we use the stop_on_fail() helper function to define the action level. The default stop_at value is 1 and it is important to note that the value 1 does not indicate 100% of test units.
❗Absolute values, starting from 1, imply the number of failing test units to trigger the WARN
or STOP
condition. For example, warn_at = 10 or stop_at = 10, imply that if 10 or more test units fail, then it will trigger the corresponding conditions.
So, in our output, we see that one of the validation steps has met the STOP
condition, so the data does not pass through the validation pipeline and no data is returned in this case!
Example for VALID-V: Table Scan
The Table Scan workflow is the simplest, yet powerful method that provides a bird's-eye view of your data. This initial step can save you time and help you tailor your validation strategy more effectively.
It has a single scan_data() function that produces a rich HTML report divided into several key sections:
- Overview: A snapshot of your table, including its dimensions, duplicate row counts, and column types.
- Variables: Summaries each table variable. The variable type determines the statistics and summaries provided, giving you a nuanced view of your data.
- Interactions: Visualizes the relationships between variables with an intuitive matrix plot.
- Correlations: For numerical variables, explore correlation matrix plots to understand potential relationships.
- Missing Values: Quickly assess data completeness with a summary figure showing the extent of missing data across variables.
- Sample: View the first 5 and last 5 rows of your dataset to get a feel for its structure and content.
scan_data(datasets::iris)
The sections parameter in the scan_data() function can be used to display only the required sections in the generated report. To explore more, check out this documentation.
👉 Data Validation in Healthcare: A Real-World Application
In healthcare, data validation is not just important,it's critical. A recent project I worked on highlighted the power of data validation in maintaining the reliability and effectiveness of healthcare analytics.
We developed a Shiny Dashboard that pulled data from a Microsoft SQL Server database. Ensuring the data quality was crucial, especially considering future database updates. This led us to create a robust data validation system using the pointblank package.
For insights on connecting R with databases like PostgreSQL, check out our comprehensive guide.
Our validation app focused on four key areas:
- Database Structure Verification: We implemented checks to confirm the existence of all required tables in the database. This ensured that no essential data sources were missing or renamed without the administrator’s knowledge.
- Field Completeness: For each table, we verified the presence of all necessary fields. This step was crucial in catching any structural changes that could break the dashboard's functionality.
- Data Type Consistency: We set up validations to check that each field maintained its proper data type. This was particularly important for fields used in calculations or data visualizations.
- Relational Integrity: Given that our dashboard relied on joining multiple tables, we implemented checks to ensure data availability and consistency across these joins.
For a deeper dive into data validation practices in healthcare, see our post on good software engineering practices in FDA submissions.
By implementing these validation steps, we created a safety net that not only caught potential issues early but also provided peace of mind for both our development team and healthcare professionals relying on the dashboard.
This experience underscored the importance of proactive data validation in healthcare analytics. It's not just about having data—it's about having data you can trust. Tools like pointblank make this process more manageable and reliable, ultimately contributing to better decision-making in critical healthcare environments.
👉 Summing Up Data Validation with pointblank
While we've explored some key features of pointblank in this post, it's worth noting that this powerful package has even more to offer. The capabilities we've discussed only scratch the surface of what pointblank can do for your data validation needs in an R Shiny app development.
For instance,
- pointblank provides robust tools for Information Management, that allows users to create a comprehensive snapshot of your data tables. You can collect details about individual columns, the entire table, and any other relevant information through a series of functions.
- It also offers built-in functionality for email notifications, which can be invaluable for keeping team members informed about validation results.
- Another notable feature is the logging system, which helps you maintain a detailed record of validation activities over time. This can be crucial for debugging and auditing purposes.
- For those dealing with complex data ecosystems, the multiagent feature allows you to manage multiple validation agents simultaneously, streamlining the process of validating data across various sources or stages of your pipeline.
- The pointblank package also supports YAML configuration, enabling you to define validation rules in a YAML file. This can greatly enhance the maintainability and portability of your validation setup.
and much much more…😮
These additional features demonstrate the depth and flexibility of pointblank, making it a comprehensive solution for data validation. As you delve deeper into pointblank, you'll likely discover even more ways it can enhance your development workflow and ensure the integrity of your data.
Ready to elevate your data validation processes? Start using the pointblank package in your R Shiny projects today and experience seamless, reliable data checks. For more resources, checklists, and guides, explore our Appsilon resource page.