Co-written by Sixte De Maupeou
Launching experiments on the Databricks UI is a painful process that requires manual management of notebooks and clusters.
In this article, we explain how to address this issue by launching experiments from your local environment, i.e. run your code written locally on a cluster, within a job, using the Databricks CLI. To do so, we will create a workflow, automated in a bash script, that will:
- Move your local Python code to Databricks notebooks.
- Launch it within a job (creating a new cluster for each experiment or reusing the same cluster, depending on your needs).
- Install a branch of your git repository as a Python package on the notebook to access your source code within an experiment (NB: you can have a different branch on each notebook within the same job).
- Send you notifications on failure/success.
- Improve tracking and management of your experiments:
Disclaimer:
- We believe launching experiments in notebooks is a bad practice, as it encourages code duplication and harms reproducibility. However, the way Databricks is designed forces us to do so. Hence, we recommend using Databricks notebooks as an interface with minimal code, only to make calls to a source repository.
- One could find the workflow described a bit hacky, but it works well and it is the best way we found to efficiently launch experiments in Databricks.
1/ Install the Databricks CLI
The Databricks CLI allows you to interact with Databricks from your local environment. Its installation and configuration are quick and easy. First, install the CLI with pip:
pip install databricks-cli
And set up the authentification for the CLI with:
databricks configure --token
A prompt will appear, asking for your Databricks host and personal access token which you can find/generate on the Databricks UI. More details about it on the official documentation
2/ Create a Databricks job
First, we need to define our job in a job.json
file:
Let's examine this JSON in a bit more detail:
- new_cluster: Cluster config for the job. Each experiment will be launched on its own cluster. If you want to use the same cluster for all experiments, create it beforehand and use:
"existing_cluster_id": "<cluster_id>"
instead. - email_notifications can be used to receive notifications on job start, success, or failure. TIP: Slack and Teams channels can have an email address, and hence are convenient places to receive such notifications.
- notebook_task: Specify a path within the Databricks workspace (e.g. Shared/experiment_notebook). In each experiment, your local notebook will overwrite the file in this location to allow your job to access it.
Finally, you can create the job with the following command. It will output a job id that you will need in the next step.
databricks jobs create --json-file job.json
💡 You can also specify Python libraries for the cluster in the JSON but it forces you to use the same ones for every run of the job. We recommend installing libraries directly in your notebooks instead, as detailed in the next section.
3/ Run the job (repeatedly)
Running the following bash script is all you need to start an experiment! It will:
- Take a local notebook and upload it to Databricks, at the "notebook_path" of the job created in the previous step.
- Start the job (hence run the notebook). A branch name can be specified, which is retrieved in your notebook to install the correct branch of your code. However, to use your source code within your job, you first need to make your git repository a downloadable Python package, using poetry or setuptools.
- To retrieve the branch name and install your repo, use the following code at the beginning of your notebook. Here is an example, with a public repo using setup.py.
NB: You may need to insert a personal access token within the repo's path if you are using a private repository (e.g. on GitHub)
- Launch the bash script and voilà:
./run-experiment.sh
💡 To avoid re-uploading the notebook at each iteration, you can keep your notebook generic and use widgets to parametrize it.
Conclusion
In this article, we have seen how to leverage the Databricks CLI to create a job, deploy a notebook to the Databricks workspace, install your code on it and finally run the notebook as part of the job.
Along the way, we have provided solutions to a number of problems faced by Databricks users:
- Launching experiments directly from the IDE makes your life easier: no more context switching and fewer errors by writing notebooks in an environment with a linter and proper syntax highlighting.
- Teams or Slack alerting is extremely useful, especially when a long-running ML training job fails unexpectedly.
- Storing all the experiments in a job enhances the history tracking of experiments.
- Overwriting a notebook while versioning it by running it in a job is tidier than having copies of notebooks all over the place.
- Finally, launching several experiments at the same time on different clusters with different branches saves you a lot of time.
You want to know more about, contact-us !