You surely know that Deep Learning models need tremendous amount of data in order to get good results. Object Detection models are no different.
To train a model like YOLOv5 to automatically detect the object of your choice, your favorite toy for example, you will need to take thousands of images of your toy in many different contexts. And for each image, you will need to create a text file containing the toy position in the image.
This is obviously very time consuming.
This article proposes to use Image Segmentation & Stable Diffusion to automatically generate an Object Detection dataset for any kind of classes.
The pipeline to generate an object detection dataset is composed of four steps:
- Find a dataset of the same instance as our
toy cat
(dogs for example) - Use image segmentation to generate a mask of the dog.
- Fine-tune the Stable Diffusion Inpainting Pipeline from the 🧨Diffusers library
- Run the Stable Diffusion Inpainting Pipeline using our dataset and the generated masks
Image Segmentation: Generate mask images
The Stable Diffusion Inpainting Pipeline takes as input a prompt, an image and a mask image. The pipeline will generate the image from the prompt only for the white pixels of the mask image.
PixelLib helps us do image segmentation in just a few lines of code. In this example we will use the PointRend model to detect our dog. This is the code for image segmentation.
The segmentImage
function returns a tuple:
results
: A dict containing information about 'boxes', 'class_ids', 'class_names', 'object_counts', 'scores', 'masks', 'extracted_objects'.output
: The original image blended with the masks and the bounding boxes (ifshow_bboxes
is set toTrue
)
Create a mask image
We create the mask containing only black or white pixels. We will make the mask bigger than the original dog in order to give room for Stable Diffusion to inpaint our toy cat
.
To do so, we will translate the mask 10 pixels to the left, right, top & bottom and add these translated masks to the original mask.
And voilà! We now have our dog's original image and its corresponding mask.
Fine-tune the Stable Diffusion Inpainting Pipeline
Dreambooth is a technique to fine-tune Stable Diffusion. With very few photos we can teach new concepts to the model. We are going to use this technique to fine-tune the Inpainting Pipeline. The train_dreambooth_inpaint.py script shows how to fine-tune the Stable Diffusion model on your own dataset. Just a few images (e.g. 5) are needed to train the model.
Hardware Requirements for Fine-tuning
Using gradient_checkpointing
and mixed_precision
it should be possible to fine tune the model on a single 24GB GPU. For higher batch_size
and faster training, it’s better to use GPUs with more than 30 GB of GPU memory.
Installing the dependencies
Before running the scripts, make sure to install the library’s training dependencies:
pip install git+https://github.com/huggingface/diffusers.git
pip install -U -r requirements.txt
And initialize an 🤗Accelerate environment with:
accelerate config
You have to be a registered user in Hugging Face Hub, and you’ll also need to use an access token for the code to work. For more information on access tokens, please refer to this section of the documentation.
Run the following command to authenticate your token
huggingface-cli login
Fine-tuning Example
The hyperparameter tuning is key when running these computational expensive trainings. Try different parameters depending on the machine you’re running the training on, but I recommend using the ones bellow.
Run the Stable Diffusion Inpainting pipeline
Stable Diffusion Inpainting is a text2image diffusion model capable of generating photo-realistic images given any text input by inpainting the pictures using a mask.
To run the pipeline, the 🧨Diffusers library makes it really easy.
Conclusion
To summarize, we have:
- Generated a mask image using image segmentation with
pixellib
, on a dog image. - Fine-tuned the
runwayml/stable-diffusion-inpainting
model to make it learn a newtoy cat
class. - Run the
StableDiffusionInpaintPipeline
with our fine-tuned model on our dog image with the generated mask.
Final results
- After all these steps, we have generated a new image of a
toy cat
located at the same place as the dog, so the same bounding box can be use for the two images.
We can now generate new images for all the images of our dataset!
Limitations
Stable Diffusion does not always output convincing results. Some cleaning will be necessary at the end of the dataset generation.
Note that this pipeline is very computational expensive. The fine-tuning of Stable Diffusion needs a 24GB GPU machine. And at inference, even if a lot of improvements have been made, we still need a few GB of GPU to run our pipeline.
This way of creating datasets is interesting if the images needed for the dataset are hard (or impossible) to obtain. For example, if like Pyronear - a french open-source project - you want to detect departures of forrest fires, it will be preferable to use this technique rather than burning trees, obviously. But keep in mind that the standard way of labeling datasets is less energy-consuming.