In the realm of Machine Learning (ML) projects, code versioning has become an indispensable best practice. However, the practice of data versioning has not seen the same widespread adoption, often leading to the frustration of not being able to retrieve a specific dataset or replicate the success of a high-performing model. To bridge this gap, the Data Version Control (DVC) tool was introduced. Launched in 2017 by Iterative, DVC is an open-source Python library designed to prevent such setbacks.
DVC stands out by allowing for the versioning of data files in conjunction with Git, a popular version control system. It achieves this by storing the actual data in a chosen remote storage solution, like Google Cloud Storage or Amazon S3, while the metadata is versioned through Git. This approach ensures that large data files are handled efficiently without clogging the Git repository.
Furthermore, DVC facilitates the creation, execution, and versioning of data pipelines. This feature is crucial for tracking the progression of datasets and models, offering clarity on the steps taken to produce each outcome and the ability to replicate them.
While there are other data versioning and pipeline systems available, such as Pachyderm, DVC distinguishes itself with its ease of setup and user-friendly nature. Additionally, though tools like MLFlow and Weights & Biases exist for ML experiment tracking, DVC integrates more seamlessly into the Git workflow, allowing for a unified tracking of code, data, and experiment iterations. This integration simplifies the exploration of project histories and avoids the need to embed tracking operations directly into the code, a common requirement with other platforms.
For those seeking to visually explore their experiments, several options complement DVC:
- Iterative Studio offers a comprehensive web app solution developed by DVC's creators, though it starts at $50 per user per month beyond two users.
- A DVC extension for Visual Studio Code provides a free, albeit less extensive, alternative for collaboration within the VSCode environment.
- A custom dashboard, created by integrating DVC with a visualization tool such as Streamlit, allows for a tailored exploration of project experiments, presenting precise information as needed.
Our Perspective
Adopting DVC can be likened to the transition to using Git: initially daunting, but soon becoming indispensable. At Sicara, we recommend leveraging DVC with Streamlit for a flexible, cost-effective approach to experiment visualization, as demonstrated by our Sicarator tool.
It's important to note, though, that while DVC excels in managing experimentation flows, it's not designed for operational pipelines in production environments. For such cases, a more specialized tool like Airflow is recommended, highlighting DVC's role as a powerful companion for data scientists in the experimentation and development stages.