Want to take your data transformation pipeline to the next level? Look no further than dbt and Airflow! With version control, monitoring, and orchestration capabilities, this dynamic duo is the key to unlock the full potential of your data.
In this article, we will see why and how to use this duo to master your data transformation through an end-to-end tutorial.
dbt as your data transformation pipeline
What is dbt?
dbt, which stands for data build tool, is an open-source SQL-first data transformation tool that allows data analysts and data engineers to transform, test and document data pipelines. To simplify, dbt is the “T” of “ETL” (or “ELT”).
There are two ways of using dbt: dbt Cloud and dbt Core. dbt Cloud is - as its name suggests - the cloud version of dbt. It offers an online IDE which enables developers to implement data transformations with a git-versioning system and to set up CI/CD to test, deploy and automate their data transformations. This solution is not free (c.f. https://www.getdbt.com/pricing/) and may be limited if you want to add other services to your data pipelines. That’s why, in this article, we will be focusing on dbt Core - which is free - in tandem with Apache Airflow to schedule and run automatically your pipelines.
The power of dbt
dbt is establishing itself as the standard of data transformation for several - good - reasons. The two main ones are: bring software engineering best practices in a SQL ecosystem and bring simplicity in the configuration and the maintenance of your code. In fact, in most of your future dbt projects, you will write mostly .sql and .yml files - and that is pretty much it!
So how does dbt combine SQL development and software best practices? The short answer: Jinja and git-versioning. Jinja is a templating engine that allows writing code similar to Python syntax and generates SQL. It is then possible to define “SQL functions” called macros which take parameters as a table and/or a column or any variables and generate SQL code. It allows you to refactor your transformations so that it becomes easier to develop and maintain! Also, as you write your transformations, your tests, and your macros in .sql files, you can versionize your project’s files to help you track different versions of your code and to collaborate with other developers!
Now that we have seen why using dbt is a good idea, let’s see how to configure it.
How to configure your projects?
Install dbt
Before writing SQL models and tests with dbt, you must install it in your environment. The easy way is to install dbt with pip:
pip install dbt-<adapter>
You must replace the keyword adapter with the one you are using. For instance, if you want to connect dbt with PostgreSQL, the command becomes:
pip install dbt-postgres
If you don’t have pip yet installed or, for any reason, don’t want to install dbt with pip, you have a few other options available here in the documentation of dbt.
You can check that dbt is correctly installed by running the following command:
dbt --version
It should display something like this:
Configuration of the project
Now that dbt is installed, let’s create a new project using dbt init. Note that you can plug dbt in an existing project (c.f. dbt documentation)!
dbt init
It will ask you to enter a name for your project and to specify which adapter you want to use:
dbt creates a folder called <demo_dbt_airflow> with different subfolders where you will write your transformations, macros, or tests and a configuration file named dbt_project.yml. This configuration file allows you to inform dbt where you write your code and to specify general configurations for your models.
Finally, you want to connect dbt to your database. To do so you must write a new configuration file named profiles.yml at the project’s root folder. You can then connect dbt with your database:
Did you recognize what is “”? That’s Jinja! Here, it simply retrieves the value of HOST environment variable. You can even use Jinja in configuration files to avoid hardcoding environment variables by declaring them in a dedicated .env file
To check if dbt is correctly plugged into your database, run the following command:
dbt debug
dbt will check dbt_project.yml and profile.yml files and the database connection. If you see “All checks passed” then everything is all right and you can finally start to develop your data transformations!
Let’s jump into it!
The models in dbt
If you go to models/, you will see two SQL transformations called models and a configuration file. The configuration file is where you define your sources, models and where you write your documentation. The model named my_first_dbt_model.sql is just plain SQL so let’s dive into the my_second-dbt_model.sql.
As you may see, there is also Jinja there. As mentioned in the file, the ref function allows us to select from other models. It allows dbt to connect those two models in a DAG so that when you run my_second_dbt_model.sql, dbt runs my_first_dbt_model.sql at first. This ref function is a key component in dbt alongside the source function. The source function takes two arguments: a schema and a table.
Like the ref function, it allows dbt to connect a table of your database with the SQL model.
But how does dbt run the query with those Jinja functions against your database? Well, to answer this question, let me quickly explain how macros work.
A quick word about dbt macros
A macro is a Jinja function that produces SQL code. Those macros are defined in the macros folder as .sql file. They are written in a mix of SQL and Jinja. Let’s build a macro that concatenates multiple columns as an example. A macro starts with its signature
Between those two tags, you can write your macro:
As you can see, you can write if statements and for loops. It is very useful since we don’t know how many columns to concatenate. Then you can call this macro in your model like this:
As I mentioned earlier, a macro is a function that generates SQL code. So when you run
dbt run --select call_macro
the actual SQL request is the following:
We have just seen the basics of dbt and how to implement your data transformation pipelines. Let’s see how to orchestrate this pipeline with Apache Airflow.
Orchestrate your data transformation with Airflow
What is Airflow?
Apache Airflow - or Airflow - is a Python-based open-source software to create, schedule, and monitor workflows and in our case data pipelines. Airflow established itself as one of the standard technology for orchestration hence several other articles present this technology. I recommend you to read this article written by my colleague @Mahaut Gérard that introduces Apache Airflow and how to install it, and its basics.
Key components for your projects
In our project, we will focus on specific a Python package that provides dbt operators: airflow_dbt. First, you must download this package by running
pip install airflow-dbt
This package provides the most useful dbt operators to run our data pipeline which are: DbtRunOperator and DbtDocsGenerateOperator.
Now, that we have all our tools, let’s build our data pipeline!
How to build a DAG?
To build a DAG, first, we create a .py file. Let’s call it dbt_pipeline.py. Then, we must import the packages we need:
Once that is done, we can create our DAG. To do so we start with the DAG definition
The parameter “dir” is mandatory with dbt. It tells dbt where our project is. Depending on how you use dbt, you may want to write absolute path.
We can now implement the different tasks in our data pipeline. Let’s create a “dbt run” and a “dbt docs generate” tasks. Once that’s done, we must chain the different tasks to create our DAG. There is a useful function provided by Airflow called chain that simplifies the chaining of the different tasks.
We can see that the DbtRunOperator needs 4 parameters: the common for all tasks <task_id> and 3 dbt related ones. <dir> and <profiles_dir> are the path of our dbt project written in the default_args dictionary. The most interesting parameter is <select>. it allows us to specify dbt which model to run. Hence, you can define complex DAGs that run specific models first then others. Finally, we define a task DbtDocsGenerate that simply runs the command
dbt docs generate
to ensure that the dbt documentation that you wrote in your configuration files is always up-to-date.
Run your data transformation pipeline
Now, that we have built our data transformation pipeline. Let’s run it!
We have easy ways to do it:
1. Execute the python file
python dags/dbt_pipeline.py
2. Run the DAG in the Airflow UI
You are now armed to master your data transformation pipelines with dbt and Airflow!
What’s next?
This article stands as an over-simplified data transformation pipeline but allowed us to go through the basics of dbt and Airflow and explore how to make them work together. However, to build an even more robust data transformation pipeline we will see in the next article how to ensure data quality and generate reports and alerts in our pipeline using once more dbt with the promising dbt elementary package and Airflow!
Meanwhile, to go further in your integration of dbt and Apache Airflow with Kubernetes, I invite you to check the article of a colleague of mine Erwan Benkara DBT & Kubernetes.
If you are looking for Data Experts then, don't hesitate to contact us!