Algortihm Decision Record

The array of algorithmic options available to data scientists is extensive, encompassing the selection of algorithms themselves, their implementation, and training strategies. Simultaneously, the constraints encountered between an initial Proof of Concept (PoC) and the deployment of an algorithm in production often differ. Factors such as data volume, data quality, and available hardware may vary significantly. Moreover, the landscape of available solutions evolves rapidly, with state-of-the-art methods undergoing radical changes within 6 to 12 months, as demonstrated by the recent advancements in Generative AI models.

In research and development (R&D) contexts, it is essential for data scientists to possess a clear understanding of the constraints associated with the developed product and the implications of each algorithmic choice, both initially and in subsequent iterations. Additionally, it is crucial to be able to communicate effectively with other project stakeholders regarding these decisions and their consequences.

To tackle these challenges, we drew inspiration from a standard tool in the development realm: Architecture Decision Records. The essential components of an Algorithm Decision Record include:

• Context and the problem being addressed

• Key decision criteria, encompassing typical aspects of Architecture Decision Records as well as algorithm-specific considerations such as the availability of reliable implementations and hardware constraints, and the characteristics of the data (e.g., volume, quality)

• Selection of options and recommendations, involving the identification of at least three alternatives, possibly leveraging aggregators like Papers With Code or HuggingFace

• Impacts of the model choice on the product



Algorithm Decision Records have become standard practice for our data science teams. An observed positive outcome is that, beyond their immediate impact on individual projects, they foster collaboration between teams and facilitate knowledge sharing. Similar to Architecture Decision Records, the use of Algorithm Decision Records is advised primarily for significant decisions where reversing course is not straightforward.


Business metrics

In machine learning projects, models are constructed to address specific business challenges. Data scientists typically rely on metrics derived from research to evaluate these models. However, these metrics may not always accurately reflect the tool's value within a particular business context. Consider a tool designed to automatically identify errors in invoices received via email. This tool first categorizes document pages to locate the invoice and purchase order before conducting checks to detect potential defects.

While metrics like overall accuracy for classification and F1-score for defect detection are commonly used to evaluate performance, in this business scenario, a single algorithmic error in an email may necessitate human intervention. It could be argued that intervening for two errors in the same email isn't significantly more impactful than intervening for just one error. Therefore, to gauge the business value of our tool effectively, we require a metric that treats any error in an email as an indication of poor performance. A potential final metric could be the percentage of error-free emails processed.

This "business" metric enables us to assess whether the tool streamlines tasks for accountants. Moreover, we can estimate the time saved by the tool based on the percentage of emails processed without human intervention, thereby quantifying real business benefits.

However, devising such a metric often involves simplifications. A business metric may thus contain biases that need to be recognized and acknowledged whenever it's discussed. In our example, the tool's performance might seem disproportionately low compared to the individual performance of each model. For instance, if my defect detector achieves 80% accuracy in its predictions, a single error on each document would penalize the overall metric each time. Hence, it's crucial to maintain more specific metrics to monitor technical advancements in models, even if they haven't yet translated into tangible business value.


A fundamental aspect of designing a machine learning project is collaborating with stakeholders to define this metric. This ensures that we can consistently track it throughout the project lifecycle and ascertain that our iterations contribute genuine value to the business.



GitHub Copilot

More than 90% of developers integrate AI into their development workflow, automating code completion tasks. One of the primary technologies facilitating this is GitHub Copilot. Launched in 2021 and powered by a language model derived from GPT-3, Copilot aims to serve as a virtual pair programmer. It significantly expedites developers' work by:

• Offering accurate syntax suggestions for function usage

• Generating tests

• Proposing refactoring recommendations

The value added by Copilot extends beyond time saved; it also enhances development comfort. The quality of its suggestions alleviates the tedium of these tasks, enabling developers to focus on essential aspects such as development logic.

However, the tool has its limitations and may occasionally suggest code that doesn't meet expectations. Data scientists must remain vigilant and consistently review the proposed content. Additionally, less experienced developers may spend less time comprehending the generated code, impeding their learning of best practices. In this context, the role of the Tech Leader becomes crucial as they ensure quality assurance and facilitate the technical team's training. It's not surprising that Copilot is less effective for emerging technologies with limited training data, such as Polars or LangChain in 2023.

Moreover, Copilot requires online operation; projects with stringent security requirements may not be able to utilize the tool, even with options to withhold generated content or training data from GitHub. In such cases, alternatives like Tabnine, which generates suggestions locally, or open-source models like StarCoder, can be considered.


All our developers have adopted Copilot and attest to its usefulness. The comfort it provides and the time it saves are akin to transitioning from a basic text editor to a fully-fledged IDE. The gains in productivity far outweigh the license cost. We deem the use of GitHub Copilot essential, especially for Python-based machine learning projects. However, we acknowledge the potential impact on junior developers' progression and remain open to evaluating any unintended consequences


Short AI iterations

In traditional development projects, the focus is often on standard features with estimable complexity. Iterations can be swift, allowing for rapid feedback, and the value of each iteration lies in the features developed. In contrast, AI projects tend to involve exploratory topics. Understanding the data, processing it, selecting the appropriate model, training it, and evaluating its effectiveness are all part of the process to determine the project's direction. How can we maintain a pragmatic approach and provide regular visibility in this uncertain journey with an ambiguous outcome?

At Sicara, we've successfully experimented with segmenting these exploratory topics into two distinct categories:

• Proof of Concept (POC): This primarily consists of investigation tasks, which are not estimated in terms of complexity but rather in time budget. The objective may be to establish a technical strategy through an Algorithm Decision Record, explore data, analyze algorithm results, etc. Unlike standard development, where the created value is a feature delivered to end-users, the value created by an investigation is learning. The outcome of a "time-boxed" investigation is a decision: either to cease exploration, continue investigation, or proceed to implementation.

• Implementation: This comprises standard tasks with estimated complexity, allowing for the construction of a solution based on the insights gained from the investigation. However, it's crucial to avoid extremes: a chaotic or vague POC on one hand, or a bureaucratic approach that documents every decision, even minor ones, on the other.


We recognize that enforcing shorter iterations and breaking down tasks can be challenging, especially in AI projects. Nonetheless, we believe the effort is worthwhile, as it ultimately accelerates the team's learning pace and enhances alignment with business objectives. We encourage giving it a try and are open to further discussions on this topic.


Test-Driven Machine Learning

The conventional approach in machine learning involves evaluating the algorithm's performance based on a performance metric measured on a test dataset. However, this method overlooks the algorithm's specific performance evolution on sub-problems: an enhancement in the metric can conceal a decline in performance on a critical subset of data.

To address this need for granularity, we adopt Test-Driven Machine Learning (TDML), an AI development approach inspired by Test-Driven Development (TDD).

TDD is a software development practice where tests are written before the code. Developers initially create an automated test matching the functional specifications, then write the necessary code to pass this test, repeating the process as needed. This approach offers several advantages:

• Development is guided by a clear objective, enhancing efficiency.

• Short cycles enable gradual problem discovery.

• Tests serve as “living” documentation integrated into the codebase.

• TDML adapts TDD principles to the realm of machine learning:

◦ The concept of functional specification corresponds to the model's performance on a specific subset of inputs, defined by business metrics comprehensible to stakeholders. A threshold determines whether the test passes or fails.

◦ The iterative process of adding new tests corresponds to the notion of slices: the test set is divided into subsets called slices, each constituting a test. Incorporating new slices can be achieved by expanding the overall test set and/or introducing new divisions of the existing test set. In AI, part of the functional logic is provided by a model trained through a training process. TDML enables the control of output quality and model non-regression post-training.

• TDML offers specific benefits to machine learning:

◦ Adding new slices from production can detect data drift, indicating a decline in model performance over time.

◦ Addressing the challenge of generalizing AI models to real-life cases by incorporating new slices covering these cases, guiding the work of data scientists.



While we view TDML as a healthy practice, we're currently piloting it due to limited available theoretical resources (TDML is primarily an internal practice at Sicara) and the absence of frameworks/tools for its implementation and integration into existing ML ecosystems. We reserve our recommendation for a knowledgeable audience and deem it particularly relevant during iteration phases on a validated algorithm, albeit less so during the Proof of Concept phase.



At the onset of a Machine Learning project, two primary options exist regarding the technical stack:

1. Use an end-to-end ML platform, where pre-tested and approved components off the shelf save time. However, this approach comes with typical drawbacks of managed solutions, including cost, "black box" functionalities, limited customization, integration challenges with other tools, and potential vendor lock-in.

2. Employ open-source tools and custom code to construct a bespoke stack. While this option sidesteps the issues of managed solutions, it necessitates an initial investment in both technical decisions and their implementation.

To streamline the latter option, we've developed Sicarator, a project generator. It enables the creation of a high-quality code base for a Machine Learning project incorporating recent open-source technologies in just a few minutes. Initially developed for internal use in 2022, Sicarator was open-sourced a year later after demonstrating its effectiveness across approximately twenty projects.

The tool aims to fulfill the promise of generating a project that adheres to identified best practices, such as:

• Continuous integration with multiple quality checks (e.g., unit tests, linting, typing)

• Data visualization using a Streamlit dashboard

• Data and experiment tracking and visualization via a combination of DVC and Streamlit

The generated code includes necessary documentation for ease of use. Adopting a code-centric approach, the tool empowers data scientists and ML engineers with maximum control. It strives to reflect evolving best practices in the ecosystem, with recent updates including the adoption of Ruff as the code linter/formatter, replacing PyLint and Black.

However, it may offer a less comprehensive solution compared to advanced platforms, requiring additional setup work. For instance, automated model training instance launching is not presently integrated.



This blip represents both the Sicarator tool and our beliefs regarding the technical stack it establishes. We welcome you to test it and engage in discussions regarding the choices made and potential future features. We employ Sicarator in any Python-based ML project, even those utilizing end-to-end ML platforms, to leverage the Python development best practices embedded in the generator. However, its maximum value is realized in projects aiming to combine open-source technologies like DVC, Streamlit, FastAPI, etc. Therefore, we recommend Sicarator for initializing Python-based ML projects to all AI teams proficient in code seeking to implement an open-source-oriented ML tooling. 


MTEB Massive Text Embedding Benchmark

The semantic information of text is typically represented through fixed-size vectors known as embeddings. Various deep-learning models enable the calculation of these embeddings from raw text. However, selecting the most suitable model for a specific use case can be challenging due to the plethora of off-the-shelf options available.

The Massive Text Embedding Benchmark (MTEB) addresses this challenge by facilitating the comparison of different text embedding models. It achieves this by aggregating 129 diverse datasets (as of early 2024) to enhance generalization across eight distinct tasks.

However, MTEB has notable limitations.

1. It focuses solely on three languages: English, Chinese, and Polish.

2. There is a lack of cross-language benchmarking. Models like "text-embedding-ada-2" from OpenAI or those from Cohere offer advantages in efficiently handling multilingual datasets without necessitating prior translation, which often results in information loss.

3. The open-source nature of the datasets used in MTEB presents both advantages and disadvantages. While it promotes accessibility and community usage, it may also allow models with non-public training datasets (e.g., Cohere and OpenAI) to specifically train on these benchmarks to inflate their evaluation scores, potentially without reflecting genuine improvements in performance.


MTEB serves as a valuable reference for selecting a text embedding model when dealing with monolingual data in English, Chinese, or Polish. However, if project constraints permit, a more reliable albeit time-consuming approach involves constructing a labeled dataset representative of the task to evaluate different models.




Evaluating Large Language Models (LLMs) presents unique challenges compared to other models, primarily due to the indefinite number of correct responses. This complexity prohibits simple automated metrics based on checking equality between the model's response and the expected label.

The concept of "LLM-as-a-judge" involves employing an LLM to evaluate the response of another LLM. This approach is justified in three scenarios:

1. When the judging LLM exhibits superior performance compared to the LLM being evaluated.

2. When advanced reasoning (e.g., "chain of thoughts") is employed by the judging LLM, which would be prohibitively costly in terms of time and/or resources for inference.

3. When the expected response closely aligns with a manual label, enabling the judging LLM to ascertain its correctness. The primary advantage of this method is its capability to automatically evaluate a large number of examples.

However, this research area is still nascent and immature. Its main drawbacks include:

• Using AI to evaluate: the judging tool is susceptible to errors, raising questions about the need for an evaluation pipeline for the evaluation pipeline itself.

• Maintaining a new logic: any changes to the evaluation pipeline necessitate corresponding updates to the evaluation prompts used.

• Cost implications, especially concerning complex reasoning processes. The primary alternative to this method is manual evaluation, which offers greater reliability and a deeper understanding of the data and model behavior. However, manual evaluation is time-consuming and may restrict the number of iterations and examples available for evaluation.


LLM-as-a-judge presents an opportunity to expedite model development and broaden the scope of evaluated responses. However, it is not without its flaws. We recommend exercising caution when employing this approach and suggest complementing LLM evaluations with regular manual evaluations. This ensures that any drift is identified and facilitates a more comprehensive analysis of the model's behavior.



Metric-free iterations on LLM-based models

Metrics play a vital role in data science, yet the ease of achieving decent results with Large Language Models (LLMs) can lead to overlooking this crucial step, often hindering progress towards production. Unlike other deep learning models, LLMs typically do not require fine-tuning for specific tasks, eliminating the need for collecting an evaluation dataset to swiftly create a Proof Of Concept (POC).

However, transitioning to production regularly demands improved performance, necessitating specific adaptations. In such cases, collecting an evaluation dataset (preferably accompanied by a test dataset) becomes essential. This ensures that iterations are relevant, generalizable across the entire dataset, aids in prioritizing necessary improvements, and provides insights into the model's performance.

Nevertheless, evaluating LLMs is more complex than for other models due to the indefinite number of correct responses, rendering automated metrics based on response-label equality impractical. Manual evaluation emerges as the most reliable practice, albeit time-consuming and limiting in terms of the number of evaluations and data assessed. Alternatively, experimental methods involve using LLMs to evaluate themselves.


We advocate for rapid innovation without an evaluation dataset in two scenarios:

  • When performance is non-critical (e.g., for personal use tools) or to de-risk significant product directions.

In other cases, self-evaluation is imperative. When automated evaluation through labels is unfeasible, we recommend manual evaluation for its reliability. This can be supplemented by leveraging LLM-as-a-judge to accelerate iterations.