Do you want to manage your BigQuery data pipelines automatically? Airflow is a great option to schedule, execute, and monitor your workflows since it is very easy to configure and setup.
If you are newer to Airflow, this Sicara article presents an excellent introduction to Airflow and its components. Otherwise, we can start our tutorial on how to run BigQuery in Airflow DAGs locally using Docker!
Prerequisites
Why Docker with Airflow is awesome?
To launch airflow on your machine, we recommend using Docker. One of the major advantage of Docker is application portability. Docker encapsulates everything your application requires to run in containers. Thus, you can guarantee that your application will run in any UNIX environment without having to worry about differences in software versions or configurations . Moreover, Docker containers provide a level of isolation that makes it easier to manage multiple Airflow instances. This makes switching between multiple Airflow projects with different requirements very easy. Otherwise, you need to reset Airflow every time you switch project which is painful.
How to run Airflow with Docker Compose?
To set up Airflow with Docker-Compose, you can rely on this article. It presents a Docker Compose file configuration that defines a multi-container Airflow deployment, with a PostgreSQL database, a scheduler, a web server, and a utility container for initialization.
Tutorial requirements
Before following the tutorial, please ensure that you have the following requirements fulfilled :
- Docker desktop installed in your machine: you can visit the official Docker website and download it from there.
- A python working folder containing an empty folder named
dags
, adocker-compose.yml
file and a.env
file given in this article
NB:
- The new version of yaml cause the error “
Map keys must be unique
”. To fix that, putcommon
anddepends-on
in the same line:<<: [ *common, *depends-on ]
. - For this tutorial, you should upgrade the airflow image version to 2.5.1.
After these modifications, the new docker-compose.yml
file should look like the following:
Create your GCP project
The first step is to create a Google Cloud platform project where the BigQuery datasets and tables will be created (At the moment of writing this article, Google is offering a 90 day $300 free trial for every new user). For that, follow these steps:
- Go to the project management page in the Google Cloud console.
- Click “Create Project”
- Enter the “project name” and note of the “project ID”. This is important for later. In our case, the project ID associated to our project is “airflow-project-test-385219”.
- Click “Create” to create your GCP project
Authenticate Google client libraries
First, install the Google Cloud SDK according to the official doc that provides detailed instructions for various operating systems.
Thereafter, you can authenticate Google client libraries to interact with Google APIs. To do this, you can run this command:
gcloud auth application-default login
The browser will be opened, and you need to choose the Google account to be authenticated.
A file ~/.config/gcloud/application_default_credentials.json
will then be created in your machine.
Now that the project creation and the authentication is done, we can create our first BigQuery DAG.
Prepare your DAG
For this tutorial, we are using a simple DAG with a few BigQuery operators to demonstrate how to run some queries. You'll need to put the DAG and the query files in the dags
directory.
The aim of this code is to create an airflow DAG with two tasks:
- Create a dataset
test_dataset
inEU
location using theBigQueryCreateEmptyDatasetOperator
- Create a table
test_table
in thetest_dataset
. For that, we useBigQueryInsertJobOperator
which calls an sql filecreate_bq_table.sql
containing the table creation query.
Setup Google Cloud connection in Airflow
First, you need to mount a volume for Google cloud credentials. In the docker-compose.yml
file, create a volume between the local google cloud credentials file ~/.config/gcloud/application_default_credentials.json
in your machine and a file /home/airflow/.config/gcloud/application_default_credentials.json
in Docker.
Then, you need to declare the Google Cloud environment variables.
- Define the connection id
GOOGLE_CLOUD_DEFAULT
and the connection typegoogle-cloud-platform
. - Set the value of
GOOGLE_APPLICATION_CREDENTIALS
to the credentials file path in Docker - Set value of
GOOGLE_CLOUD_PROJECT
to the id of your Google Cloud project.
Test it out
Now you can launch your dag!
- Run
docker compose up -d
to run airflow in detached mode. - Open http://0.0.0.0:8080 in your web browser (Username: airflow, Password: airflow)
- In DAGs/Actions, click ‘Trigger DAG’ and ensure that the dataset and the table creations are successful.
In google cloud console, in BigQuery service, verify that the dataset and the table have been created in your project.
Finally, run docker compose stop
to stop the containers or docker compose down
to shut down the containers and remove them.
You can find the code for this tutorial in this GitHub repository.
What’s Next
There is much more you can do with Airflow to manage BigQuery tables, run queries and validate the data. You can find different BigQuery operators in this Airflow documentation.
To know more about Airflow and its use cases, I recommend you this Sicara article about dbt transformation which explains how to use dbt and Airflow to master your data transformation.