Data Engineering

April 8, 2022 • 4 min read

Quick tutorial : How to Automate AWS Tasks Thanks to Airflow Hooks

Rédigé par Arnaud Augustin

This article is a step-by-step tutorial that will show you how to upload a file to an S3 bucket thanks to an Airflow ETL (Extract Transform Load) pipeline.

Airflow is a platform used to programmatically declare ETL workflows. Learn how to leverage hooks for uploading a file to AWS S3 with it.

This article is a step-by-step tutorial that will show you how to upload a file to an S3 bucket thanks to an Airflow ETL (Extract Transform Load) pipeline. ETL pipelines are defined by a set of interdependent tasks.

2_e96f1d7b2db00c8f233e9babae1e95a7_800 (1) — An example of interdependent tasks graph built with Airflow

A task might be “download data from an API” or “upload data to a database” for example. A dependency would be “wait for the data to be downloaded before uploading it to the database”. After an introduction to ETL tools, you will discover how to upload a file to S3 thanks to boto3.

A bit of context around Airflow

Airflow is a platform composed of a web interface and a Python library. This project has been initiated by AirBnB in January 2015 and incubated by The Apache Software Foundation since March 2018 (version 1.8). The Airflow community is really active and counts more than 690 contributors for a 10k stars repository.

Also, The Apache Software foundation recently announced Airflow as a top-level project. This gives us a measure of the community and project management health so far.

Build your pipeline step by step

Step 1 : Install Airflow

As for every Python project, create a folder for your project and a virtual environment.

# Create your virtual environment
virutalenv venv
source venv/bin/activate

# Create your Airflow project folder
mkdir airflow_project
cd airflow_project

You also need to export an additional environment variable as mentioned in the 21st of November announcement.

export SLUGIFY_USES_TEXT_UNIDECODE=yes

Eventually, run the commands of the Getting Started part of the documentation that are pasted below.

# airflow needs a home, ~/airflow is the default,
# but you can lay foundation somewhere else if you prefer
# (optional)
export AIRFLOW_HOME=~/airflow

# install from pypi using pip
pip install apache-airflow

# initialize the database
airflow initdb

# start the web server, default port is 8080
airflow webserver -p 8080

# start the scheduler
airflow scheduler

# visit localhost:8080 in the browser and enable the example dag in the home page

Congratulations! You now have access to the Airflow UI at http://localhost:8080 and you are all set to begin this tutorial.

Note: Airflow home folder will be used to store important files (configuration, logs, database among others).

Step 2 : Build your first DAG

A DAG is a Directed Acyclic Graph that represents the tasks chaining of your workflow. Here is the first DAG you are going to build in this tutorial.

3_623f5d92c3b7f4df82cd71a2dfefd05e_800 (1) — A dummy task is a task that does nothing but succeed. It is mainly used for DAG architecture purpose.

On this schematic, we see that task upload_file_to_S3 may be executed only once dummy_start has been successful.

Note: Our ETL is only composed of a L (Load) step in this example

As you can see in $AIRFLOW_HOME/airlow.cfg, the value of the dags_folder entry indicates that your DAG must be declared in folder $AIRFLOW_HOME/dags . Also, we will call upload_file_to_S3.py the file in which we are going to implement our DAG:

# Create the folder containing your DAGs definition
mkdir airflow_home/dags

# Create your DAG definition file
touch airflow_home/dags/upload_file_to_S3.py

# then open this file with your favorite IDE

First, import the required operators from airflow.operators. Then, declare two tasks, attach them to your DAG my_dag thanks to the parameter dag. Using the context manager allows you not to duplicate the parameter dag in each operator. Finally, set a dependency between them with >>.

Basic DAG with a dummy task and a python script which prints a string

Now that we have the spine of our DAG, let’s make it useful. To do so, we will write a helper that uploads a file from your machine to an S3 bucket thanks to boto3.

4_e172b5e97f007ab85bc53ca3a17853b8_800 (1) — Upload a .csv file to your S3 bucket thanks to Airflow and boto3

Step 3 : Use boto3 to upload your file to AWS S3

boto3 is a Python library allowing you to communicate with AWS. In our tutorial, we will use it to upload a file from our local computer to your S3 bucket.

Install boto3 and fill ~/.aws/credentials and ~/.aws/config with your AWS credentials as mentioned in Quick Start. More information about authentication mechanism is given in boto3 Credentials documentation.

All you need to do now is implement this little helper which allows you to upload a file to S3 and call it in your Python upload task.

Now, make your DAG task upload_to_S3_task call this helper thanks to the argument python_callable:

Launch your DAG. When it is finished, you should see your file in your S3 bucket.

Note: A good tip for launching your DAG is to clear the first step with option Downstream checked. The scheduler will then relaunch it automatically.

5_b0824da2ad804a58d5f0148477962b65_1600 (1) — You can see you .csv file in your S3 bucket

Step 4: Do more by doing less, use Airflow hooks!

Do you remember the little helper we wrote to upload a file to S3? Well, all of this is already implemented. To use it, you will have to create a Hook linked to a connection you have defined in Airflow. To create a connection, a possibility is to do it through the UI:

6_2a892fe9f4c7c4871ec54f5bfc0ec7ba_1600 (1) — Go to Admin > Connections to create or edit your Airflow connections

Once you have created your new connection, all there is to be done is fill the two following fields: Conn Id and Conn Type and click Save.

7_9b142dab6e17cd2cd41c1a94d12d7354_1600 (1) — Create a simple S3 connection with no Password nor Extra information

Now that your Airflow S3 connection is setup, you are ready to create an S3 hook to upload your file. Your hook will be linked to your connection thanks to its argument aws_conn_id .

Create a hook to upload your S3 file instead of using boto3 directly

Replace the python_callable helper in upload_to_S3_task by upload_file_to_S3_with_hook and you are all set.

If you read AWS hooks source code you will see that they use boto3. They add an abstraction layer over boto3 and provide an improved implementation of what we did in Step 3 of this article.

Note: Although you did not specify your credentials in your Airflow connection, the process worked. This is because, when Airflow creates a boto3 session with aws_access_key_id=None and aws_secret_access_key=None , boto3 will authenticate you with your ~/.aws/credentials information. If you did not configure your AWS profile locally, you can also fill your AWS credentials directly in Airflow UI trough login/password or through Extra field.

Hooks add a great value to Airflow since they allow you to connect your DAG to your environment. There are already numerous hooks ready to be used like HttpHook, MySqlHook, HiveHook, SlackHook and many others so make sure to check Airflow hooks and Airflow contribution hooks out before establishing a connection to an external service.

Conclusion

Thanks to this tutorial, you should know how to :

Install and configure Airflow;
Make your first Airflow DAG with a python task;
Use boto3 to upload a file on AWS S3;
Use hooks to connect your DAG to your environment;
Manage authentication to AWS via Airflow connections.

Thanks to Florian Carra, Pierre Marcenac, and Tanguy Marchand.

If you are looking for Data Engineering experts, don't hesitate to contact us!

Cet article a été écrit par

Arnaud Augustin