At the onset of a Machine Learning project, two primary options exist regarding the technical stack:
1. Use an end-to-end ML platform, where pre-tested and approved components off the shelf save time. However, this approach comes with typical drawbacks of managed solutions, including cost, "black box" functionalities, limited customization, integration challenges with other tools, and potential vendor lock-in.
2. Employ open-source tools and custom code to construct a bespoke stack. While this option sidesteps the issues of managed solutions, it necessitates an initial investment in both technical decisions and their implementation.
To streamline the latter option, we've developed Sicarator, a project generator. It enables the creation of a high-quality code base for a Machine Learning project incorporating recent open-source technologies in just a few minutes. Initially developed for internal use in 2022, Sicarator was open-sourced a year later after demonstrating its effectiveness across approximately twenty projects.
The tool aims to fulfill the promise of generating a project that adheres to identified best practices, such as:
• Continuous integration with multiple quality checks (e.g., unit tests, linting, typing)
• Data visualization using a Streamlit dashboard
• Data and experiment tracking and visualization via a combination of DVC and Streamlit
The generated code includes necessary documentation for ease of use. Adopting a code-centric approach, the tool empowers data scientists and ML engineers with maximum control. It strives to reflect evolving best practices in the ecosystem, with recent updates including the adoption of Ruff as the code linter/formatter, replacing PyLint and Black.
However, it may offer a less comprehensive solution compared to advanced platforms, requiring additional setup work. For instance, automated model training instance launching is not presently integrated.
OUR PERSPECTIVE
This blip represents both the Sicarator tool and our beliefs regarding the technical stack it establishes. We welcome you to test it and engage in discussions regarding the choices made and potential future features. We employ Sicarator in any Python-based ML project, even those utilizing end-to-end ML platforms, to leverage the Python development best practices embedded in the generator. However, its maximum value is realized in projects aiming to combine open-source technologies like DVC, Streamlit, FastAPI, etc. Therefore, we recommend Sicarator for initializing Python-based ML projects to all AI teams proficient in code seeking to implement an open-source-oriented ML tooling.