In a world powered by data, the concept of 'garbage in, garbage out' is no idle warning. Welcome to a domain where the quality of your data shapes the foundation of your business insights. Here, we embark on an odyssey to demystify data quality's pivotal role. Enter dbt and Airflow, the architects of seamless data quality checks. But that's just the start. Journey with us to uncover Elementary, the vigilant overseer of your data quality landscape. Are anomalies causing unrest? Slack alerts have your back. Join us as we navigate the terrain of data quality assurance and arm you with expertise to ensure every byte counts.
Why checking and monitoring your data quality is fundamental
Data isn't always perfect - to say the least. In fact, unless you work on some specific clean, static, high-level Kaggle datasets, you will always encounter more or less dirty data. The neat part is that most businesses rely on it to make decisions. Imagine if a company used wrong customer info for a marketing plan or messed up forecasts due to supply chain mistakes. This happens because data isn't always accurate or consistent.
To have clean data is crucial, not just for technical reasons, but because it's the foundation of trust. If data is dirty, even the smartest tools can't help. Think about it: if a recipe has wrong ingredients, the food won't taste right no matter how skilled the cook. That’s the core principle of Machine Learning and data analysis : “Garbage in, garbage out”.
This principle implies that if you want maximum control on the output, you must evaluate the quality of your input data input and through your whole data transformation process. In the context of ETL (Extract, Transform, Load) pipeline ingesting erroneous source data leads to erroneous business data or may even break the pipeline which can be very costly for your company.
To counter those issues, the best solution so far is to check the quality of your data at each step of your pipeline, to monitor their results regularly and to be alerted if low quality data has been ingested or has been created through one of your transformations - while it’s impossible to make chef meal with rubbish ingredients, it’s very easy to make rubbish meal with high quality products!
Now that, I hope, we are all convinced about the importance of data quality checks, we’ll go through a technical note exploring dbt, Apache Airflow and Elementary as a solution for high quality data pipeline. Note that this article is the second part of this article explaining how to ensure robust data transformation with dbt and Airflow. I recommend you to read it if you are unfamiliar with either dbt or Airflow. Otherwise, let’s jump into it!
Make Data Quality Checks Efficient with dbt and Airflow
Using dbt for Data Validation
dbt plays a crucial role in ensuring data quality. It does not only allows data transformation but also data quality checks. To define a test in dbt we just have to write a .sql file in the tests folder. There are two differents class of test: singular and generic. Singular tests are tests that are specific to a table and sets of columns. On the other hand, generic tests are test that can be applied on different tables and columns, for instance, uniqueness and non-nullity are generic tests. Both of those tests succeed if the SQL request returns no record and fails otherwise. Note that we can set up threshold to raise either warnings or fails based on the number of failures. So if we want to check if a data is not Null the SQL request will select only Null values. The filtering conditions have to be the opposite of what we expect.
Let’s write a singular and a generic test. We’ll start with the singular one. Let’s say that we want that for the table table1 the first column is at least ten times higher than the second one:
Pretty easy right? Now the generic test. We want to check that one column is higher than the second by a factor x. It will look like this:
So what’s the difference between those two tests? The generic test is actually a macro! We have parametrize the schemas, the models, the columns, and the other variables. So how can we apply this test to the tables and columns I want. As for model definition everything is configured in the .yaml files:
Note that we did not need to explicit the model and column_name variables since they are infered by dbt during the build stage.
With this configuration, the columns col1 and col2 from the table my_dbt_model will be checked.
To run our brand new tests:
dbt test
Automating and Scheduling with Airflow
Now that we have data quality checks let’s integrate them in our pipeline. Let’s say that we want to check the quality of the data after transformation, we simply have to add one task in our DAG, after the dbt run task:
Now we have a robust transformation pipeline with data quality checks. Finally we need to monitor and to be warned about the quality of the data. For that I introduce you the dbt package elementary.
Monitor Data Quality Effectively with Elementary
Elementary: Enhancing Data Observability
Elementary is a dbt package designed for data observability. This tool extends dbt's testing capabilities, allowing for more complex checks on data quality and an user frinedly interface to monitor data quality checks history. Elementary can track data changes over time, alerting you to potential quality issues like unexpected spikes or drops in data.
Here a quick glimpse over their interface:
You can also check their Github page. You’ll find there a demo where you can put your hand over their interface.
Elementary does not provide only an interface to monitor your data quality checks, it also allows you to implement more complex data quality checks called anomalies. Anomalies are tests that check if a specific metric is relatively constant over time or droped or increased a lot at once. This checks can check that your data is not shifting too much. Elementary provides built-in anomalies.
To install elementary on our project, I let you refer to their installation documentation and to their CLI installation documentation.
Integration of elementary in Airflow
The last step before having a perfectly robust pipeline is to integrate elementary reports generation in our DAG after the data quality checks. For that we add a task in our DAG:
This task will generate a index.html file which can be exposed as you wish!
Integration with Notification Systems
Elementary can be integrated with messaging systems like Slack to provide real-time alerts on data issues. This feature ensures that you are promptly informed about any anomalies or quality concerns, enabling quick responses to maintain data integrity.
To do so please refer to their documentation. Then let’s add this alerting step in our DAG. We can have it in parallel with the reports generation since they are not dependant on each other:
We finally have a complete robust data transformation pipeline with dbt, elementary and Airflow!
What’s next?
This article stands as an over-simplified data transformation pipeline but allowed us to go through the basics of dbt tests and data monitoring with elementary and explore how to make them work together with Apache Airflow.
There are many data quality checks technologies out there that could fit with your needs and constraints like Sifflet for instance if dbt and elementary are not your perfect match. You can read the article of Lucie Martin explaining how to have a reliable Data Quality with Sifflet.
If you are looking for Data Experts then, don't hesitate to contact us!