As your project grows, the process of installing Python packages with the correct versions can become quite slow. When this happens in your Continuous Integration (CI) pipeline, the feedback loop can seem endless. This guide aims to help you reduce your feedback loop time with GitlabCI and Docker. Poetry is now a widely adopted packaging and dependency management tool for Python. With its robust dependency resolver, it helps developers install and maintain their dependencies effortlessly.
Here are the key steps and lessons I discovered to streamline GitlabCI during a recent project. It all started with my struggle in managing dependencies and then transitioning all my jobs to Docker to expedite the entire pipeline.
Dependencies, GitlabCI Executors, and Docker
First, let's take a moment to consider what dependencies are. Dependencies are essentially relationships between different software components; one piece of software relies on the functionalities provided by another to function correctly. In Python, it's as simple as the airflow
module needing the Flask
module. However, some Python packages may require system-specific packages. For instance, the PostgreSQL client package, psycopg2
, needs the system package libpq-dev
to function. Since it's system-specific, it can vary from your local machine to your CI machine and even to your production machine.
Talking about data transformation and Biquery data pipeline with Airflow, we have some great articles you can read to learn more about it!
In Gitlab, you have several types of executors to run your CI scripts. It's important to note that the default executor in GitlabCI is the Shell Executor, where your scripts run directly on the machine, specifically in a dedicated per-job shell. So, if you come across a dependency package with a system requirement, you can install it manually. However, this may not be the best approach.
Firstly, it can lead to conflicts with other software. Secondly, your primary concern should be isolation and reproducibility, which is precisely what Docker is designed for. GitlabCI offers other executors, including the Docker Executor. Each script runs in a dedicated Docker container, allowing you to control precisely which dependencies (both system-wide and Python-wide) you want to install in your image.
The isolation and reproducibility come at a minimal cost. Here's what you need:
- A Dockerfile that defines your image and installs, for example,
libpq-dev
if needed
- A Docker registry accessible to your CI runner for both pushing and pulling images
Dockerizing Your Python & Poetry Environment: The Simple Approach
With these steps, you can safely replicate your environment from your local machines to the production environment. Now, let's focus on the Python and Poetry specific aspects.
At first glance, Dockerizing Python and Poetry may appear straightforward. Even though there's no official Poetry image on the Docker Hub, it only takes a few lines to build an image.
- You start from a base Python image. The Python version you set in your tag must match the version you specify in your
pyproject.toml
file. - Then you install Poetry with the official install.
- As a good measure, you can set the
POETRY_HOME
environment variable to control where Poetry will be installed.
And that’s it ! You can now build, tag and push your image to your registry and use it from GitlabCI. Just don’t forget to install your dependencies with Poetry.
Yes, it works. But, if like me, you use this image to run your tests in GitlabCI, you may notice it is terribly slow. Much slower than with our good old Shell Executor. In my case, we went from 2 minutes to more than 5 minutes to run our 4 test jobs. If you dig a bit, you’ll find out that the poetry install
is the one at fault. Let’s speed that up.
Optimizing Python Dependency Management with Poetry
We won’t exactly speed up the poetry install
, it’s nonsense. The key point is that you don’t have to install your dependencies on every run.
By nature, you don’t add or modify your dependencies on every commit. Your pyproject.toml
changes here and there, sometimes. Thus, installing your dependencies directly inside your Docker image is a good idea.
The best practice is to use the poetry.lock
file. It lists the exact versions of dependencies, as it is very well explained in the Poetry documentation. It can speed up the installation as the version resolution is already done.
Either way, you can now use a pre-built image that contains all your dependencies in your GitlabCI. You can remove the line poetry install
in your .gitlab-ci.yml
. Yet, you now have to solve the ultimate issue.
What happens if I want to add another dependency?
You can’t use the current image you have in your registry. You have to build it again. It can be painful if you have to do it every time you change your pyproject.toml
.
The best way is to build it only when you need it, thanks to an additional job in your CI.
Let’s call this job Deps. This job should be triggered when there is a change on whether:
- your
pyproject.toml
/poetry.lock
or - your
Dockerfile
You can do this using rules
and changes
in GitlabCI.
In simple cases, you can set the image tag to something fixed like latest
for instance. If you are numerous in your team, you may face some issues like teammates overriding your image. In such case, you can compute an image tag by hashing these 3 files with the md5sum
native Linux function. Here is an example. Note that you can potentially exclude poetry.lock
.
To sum up, you would end up with a workflow looking like this.
TL;DR
- Dependencies can be both system-wide and Python-wide
- Docker ensures isolation and reproducibility - dependencies are the same from one image to another
- Install your dependencies - with Poetry - in an image you can reuse in GitlabCI to save time
- Optimize your CI by installing dependencies only when you detect a change
Looking for Data experts? Feel free to reach out!