In this article, I show how to automate useful actions that should occur before a DVC pipeline execution (dvc pull
input files, check that the current git workspace is not dirty) or after the execution (git commit
the dvc.lock
file, dvc push
the produced data) with Makefile.
Disclaimer: this article assumes you are familiar with DVC (Data Version Control) and DVC Pipelines.
At Sicara, we love DVC (Data Version Control) and use it in most of our machine learning projects. Personally, I think DVC is a great tool because 1) it has a limited scope (tracking the data) and it does it very well 2) it is very easy to integrate with other tools.
If you are interested in DVC integrated with other tools, you may want to read the article I wrote about DVC + Streamlit (another very cool tech for ML!).
Lean methodology is part of Sicara’s DNA. In practice, we have a tech guild dedicated to ML tooling that meets on a weekly basis to do Yokoten. Yokoten literally means “horizontal deployment” in Japanese. It consists of sharing a good practice learned from a project with other projects.
This article is a Yokoten I presented to our Guild on September 21. I share what we learned on my project to improve the execution of DVC pipelines and how they are actually executed.
A Pipeline to Compute Model Metrics
In my project, we have many DVC Pipelines to do different kinds of things. Some of them are executed just a few times (sometimes even once) for instance when we explore new ideas (iteration on the model, a new way to train our model, etc). On the other hand, some of them are re-executed on a regular basis e.g., training pipelines, evaluation pipelines.
Let’s take an example: we have a metrics pipeline that looks like this:
As often as the code (model logics, evaluation scripts) or the data (model weights, the test data) is modified, we need to compute model metrics to ensure the model performs as expected i.e., that its performance is better than before and that edge cases are covered.
To do so, we simply launch:
dvc repro --force metrics/dvc.yaml
Note the --force
option, metrics are critical hence we prefer to force the execution of all stages even if it takes more computation time.
What would we like to Automate?
Several actions are required before launching a pipeline with dvc repro
:
- check the current git workspace is not dirty: we do not want to execute the pipeline if you have changes not staged for commit;
- pull the input data:
dvc repro
will automatically restore data from intermediary stages from the local cache, but it will not pull input data from remote storage. This may cause the pipeline execution to fail / not to be up-to-date;
Other actions should be done after the pipeline execution:
- save the results in a commit: this commit is special as it should normally contain only changes for the
dvc.lock
file. Automate thegit commit
allows to standardize the commit name so that it is easier to identify later on; - make the results available to everyone in the team: a common mistake we used to make is to forget to
dvc push
the data. As a consequence,dvc pull
run afterward fails and you have to ask the data scientist that launched the pipeline to manuallydvc push
the data - if he still has it!
Sometimes, the pipeline execution takes some time - about two hours for our metrics pipeline. Thus, for convenience, it’d be great to have the following:
- send message to the team in case of success/failure: we want to be informed what is going on without looking at pipeline logs all the time;
- make the pipeline execution asynchronous: in our case, we run metrics on a remote GPU instance, so we’d like to launch the execution in “detached” mode so that we can exit the instance (ssh) right after launching the pipeline.
Let’s Do Automation!
To automate the aforementioned actions, we wrote a Makefile like this:
_compute_metrics:
# Check workspace is not dirty
git diff --quiet HEAD
# Pull the input data
dvc pull -Rf model
dvc pull -Rf dataset/testset
# Compute metrics
dvc repro --force metrics/dvc.yaml
# Commit the metrics
git add metrics
git commit -m "[DVC] Update model metrics"
# Push the metrics
dvc push -R metrics
git push
# Notify metrics computation is done !
send_success("Metrics done !")
compute_metrics:
(nohup make _compute_metrics || make send_failure("Metrics failed :(")) &
git diff --quiet HEAD
makes the execution fails before pipeline execution if changes are not staged;send_failure()
/send_success()
functions are justcurl
commands that send message to the team slack channel (see a tutorial here);nohup
launches the pipeline execution in the background. You can grab pipeline logs withtail -f nohup.out
.
Then, it becomes very easy to compute the metrics:
- SSH to a remote (GPU) instance
- Launch
make compute_metrics
and exit the remote instance - Wait for the message on the slack channel!
These three simple steps allow automating the required actions before/after dvc repro
. It ensures dvc push
is not forgotten and that the sequence of actions is exactly the same for each pipeline execution by any member of the team (e.g., the commit name).
Alternatives
DVC proposes 3 git hooks that you can install by running dvc install
:
The Makefile I proposed somehow covers similar needs:
post-checkout
hook: thedvc pull
commands just beforedvc repro
ensure the data is up-to-date;pre-commit
hook:git add
/git commit
immediately followdvc repro
hence thedvc status
becomes pointless;pre-push
hook: thedvc push
command is just before thegit push
command.
I think both approaches (git hooks or Makefile) may be relevant depending on your use case. In my project, git hooks are a bit too long to execute because we have many data and pipelines tracked by DVC making every commit painful. One advantage of the Makefile approach is that dvc pull/push
are run only when it is necessary i.e., before/after dvc repro
.
Conclusion
I hope this article was useful and gave you food for thought! Do not hesitate to leave comments, I am convinced there is a lot to improve!
If you want to know more don't hésitate to contact-us !