Automate DVC Pipelines Reproduction with Makefile

Rédigé par Antoine Toubhans

In this article, I show how to automate useful actions that should occur before a DVC pipeline execution (dvc pull input files, check that the current git workspace is not dirty) or after the execution (git commit the dvc.lock file, dvc push the produced data) with Makefile.

Disclaimer: this article assumes you are familiar with DVC (Data Version Control) and DVC Pipelines.

At Theodo Data & AI, we love DVC (Data Version Control) and use it in most of our machine learning projects. Personally, I think DVC is a great tool because 1) it has a limited scope (tracking the data) and it does it very well 2) it is very easy to integrate with other tools.

If you are interested in DVC integrated with other tools, you may want to read the article I wrote about DVC + Streamlit (another very cool tech for ML!).

Lean methodology is part of Sicara’s DNA. In practice, we have a tech guild dedicated to ML tooling that meets on a weekly basis to do Yokoten. Yokoten literally means “horizontal deployment” in Japanese. It consists of sharing a good practice learned from a project with other projects.

This article is a Yokoten I presented to our Guild on September 21. I share what we learned on my project to improve the execution of DVC pipelines and how they are actually executed.

A Pipeline to Compute Model Metrics

In my project, we have many DVC Pipelines to do different kinds of things. Some of them are executed just a few times (sometimes even once) for instance when we explore new ideas (iteration on the model, a new way to train our model, etc). On the other hand, some of them are re-executed on a regular basis e.g., training pipelines, evaluation pipelines.

Let’s take an example: we have a metrics pipeline that looks like this:

As often as the code (model logics, evaluation scripts) or the data (model weights, the test data) is modified, we need to compute model metrics to ensure the model performs as expected i.e., that its performance is better than before and that edge cases are covered.

To do so, we simply launch:

dvc repro --force metrics/dvc.yaml

Note the --force option, metrics are critical hence we prefer to force the execution of all stages even if it takes more computation time.

What would we like to Automate?

Several actions are required before launching a pipeline with dvc repro:

check the current git workspace is not dirty: we do not want to execute the pipeline if you have changes not staged for commit;
pull the input data: dvc repro will automatically restore data from intermediary stages from the local cache, but it will not pull input data from remote storage. This may cause the pipeline execution to fail / not to be up-to-date;

Other actions should be done after the pipeline execution:

save the results in a commit: this commit is special as it should normally contain only changes for the dvc.lock file. Automate the git commit allows to standardize the commit name so that it is easier to identify later on;
make the results available to everyone in the team: a common mistake we used to make is to forget to dvc push the data. As a consequence, dvc pull run afterward fails and you have to ask the data scientist that launched the pipeline to manually dvc push the data - if he still has it!

Sometimes, the pipeline execution takes some time - about two hours for our metrics pipeline. Thus, for convenience, it’d be great to have the following:

send message to the team in case of success/failure: we want to be informed what is going on without looking at pipeline logs all the time;
make the pipeline execution asynchronous: in our case, we run metrics on a remote GPU instance, so we’d like to launch the execution in “detached” mode so that we can exit the instance (ssh) right after launching the pipeline.

Let’s Do Automation!

To automate the aforementioned actions, we wrote a Makefile like this:

_compute_metrics:
  # Check workspace is not dirty
  git diff --quiet HEAD

  # Pull the input data
  dvc pull -Rf model
  dvc pull -Rf dataset/testset

  # Compute metrics
  dvc repro --force metrics/dvc.yaml

  # Commit the metrics
  git add metrics
  git commit -m "[DVC] Update model metrics"

  # Push the metrics
  dvc push -R metrics
  git push 

  # Notify metrics computation is done !
  send_success("Metrics done !")

compute_metrics:
  (nohup make _compute_metrics || make send_failure("Metrics failed :(")) &

git diff --quiet HEAD makes the execution fails before pipeline execution if changes are not staged;
send_failure() / send_success() functions are just curl commands that send message to the team slack channel (see a tutorial here);
nohup launches the pipeline execution in the background. You can grab pipeline logs with tail -f nohup.out.

Then, it becomes very easy to compute the metrics:

SSH to a remote (GPU) instance
Launch make compute_metrics and exit the remote instance
Wait for the message on the slack channel!

These three simple steps allow automating the required actions before/after dvc repro. It ensures dvc push is not forgotten and that the sequence of actions is exactly the same for each pipeline execution by any member of the team (e.g., the commit name).

Alternatives

DVC proposes 3 git hooks that you can install by running dvc install:

The Makefile I proposed somehow covers similar needs:

post-checkout hook: the dvc pull commands just before dvc repro ensure the data is up-to-date;
pre-commit hook: git add / git commit immediately follow dvc repro hence the dvc status becomes pointless;
pre-push hook: the dvc push command is just before the git push command.

I think both approaches (git hooks or Makefile) may be relevant depending on your use case. In my project, git hooks are a bit too long to execute because we have many data and pipelines tracked by DVC making every commit painful. One advantage of the Makefile approach is that dvc pull/push are run only when it is necessary i.e., before/after dvc repro.

Conclusion

I hope this article was useful and gave you food for thought! Do not hesitate to leave comments, I am convinced there is a lot to improve!

If you want to know more don't hésitate to contact-us !

Cet article a été écrit par

Antoine Toubhans