Testing and Validation

Testing and Validation Made Easy

Testing and validating data processing workflows is notoriously difficult. The typical approach involves making changes in a fully deployed environment and then performing manual tests. This process is not only time-consuming but also disrupts other users who rely on the shared environment.

In other areas of software engineering, unit testing is the standard practice. Engineers write small, focused tests that can be run locally on their machines, without needing to deploy to a shared environment or database. A suite of unit tests is maintained and automatically executed whenever code changes are made, preventing the integration of any code that breaks existing tests.

However, applying this approach to data processing is challenging, and it becomes nearly impossible when dealing with distributed systems like Apache Spark.

The b.well Open Source SDK addresses this issue by introducing a technology that simplifies unit testing for data processing tasks. Engineers can define a set of sample input files along with expected output files. The SDK will automatically load the input data, execute the specified data processing code in Spark, and compare the results against the expected outputs. If the actual output doesn't match the expected results, the unit test will fail.

These data processing unit tests can be run directly on developers' machines without needing to connect to any databases. Additionally, they can be integrated into the continuous integration process, ensuring that every Pull Request is validated against these tests to prevent faulty code from being merged into the main branch.