Do you have a lot of unlabelled data on which you would like to train your model? Let me show you how to make your image annotation pipeline faster and easier!
Introduction
There are cases where you have very little data and you might turn to Few-shot learning. However, you might also have a lot of unlabelled data that you might need to begin your project and do extensive model training. Image annotation can be painful and costly if you do it from scratch.
However, you don’t have to label from scratch! Indeed, you can use the fact that images from the same class are often very similar to each other. I will show you how to use existing trained models and clustering to do your first image annotation for your project.
For this, we will have to define: the dataset to label, the Computer Vision model computing image representation, and the clustering method. Finally, I will show you how to use Streamlit to combine clustering and annotations with easy user interaction.
Preparations
To follow this article, I recommend you pull the code and install dependencies:
Loading the dataset
I will use the tf_flowers dataset from Tensorflow Datasets. You can explore this dataset in further detail on its Know Your Data web page.
You can load the tf_flowers dataset with the following command:
We load the complete training dataset. This is a small dataset of 3,670 items separated into 5 classes: dandelion, daisy, tulips, sunflowers, and roses:
Getting image representation as clustering data
Let us now define the data we will use for clustering. We cannot just use the image pixels, as it will be a weak representation of the image. We want something more intelligent, like an embedding of the image in a latent space.
Therefore, we will use the image features computed by a Computer Vision model. I wanted a model already trained to recognize plants, so I will use this Tensorflow Hub model: an InceptionV3 trained on the iNaturalist dataset. This gives us the image embedding part of a pre-trained model, which we can load as the following:
However, the loaded model limits what kind of inputs are possible, as shown on the model Tensorflow Hub webpage:
The inputimages
are expected to have color values in the range [0,1], following the common image input conventions. The expected size of the input images isheight
xwidth
= 299 x 299 pixels by default, but other input sizes are possible (within limits).
This tells us that we have to preprocess our dataset, which you can do with the following code:
Finally, we can predict our image embeddings:
Now we’re ready to cluster our data!
Clustering the dataset
I will use Agglomerative clustering to divide our images to annotate into groups. This is a hierarchical clustering method, which means that it outputs a dendrogram such as the one below.
In Agglomerative clustering, this dendrogram is computed as follows:
- First, each sample is considered as a separate cluster;
- Then, at each step, you merge the two clusters which give you the smallest merge criterion value;
- The algorithm stops when only one cluster is left, containing all samples.
We need to define which merge criterion to use. We could use the minimum or maximum distance between two clusters, but I choose the Ward criterion here. The Ward criterion is very close to how KMeans works:
- Merge temporarily the two clusters observed;
- Compute the mean point of this merged cluster, a.k.a. the centroid for the Euclidean distance;
- The Ward criterion value for the merge of these 2 clusters is the sum of all points from the merged cluster to the computed centroid.
You can import Agglomerative clustering from sklearn:
By putting it all together in cluster_images.py, we can compute from 1 to 20 clusters with the following command:
python cluster_images.py --n-clusters-min 1 --n-clusters-max 20
The results are saved in CSV files, waiting for you to select the optimal number of clusters for your problem.
The automated part of image annotation is finally done! Hang on for a bit of human interaction part, and the annotations are yours.
Selecting the optimal number of clusters to separate your data into coherent groups
Most of the difficulty of clustering is to select the optimal parameter values to create coherent groups, while keeping clusters as big as possible. This selection can be done automatically, for instance by using the Silhouette score, but the results can be a bit off what you would expect. Indeed, depending on problem, you might want to clusterize your data with a finer or lower resolution.
As we are working on images, it is really easy to decide whether a cluster is coherent with only a few glances. Therefore, the easy solution is to let a human select the number of clusters needed for the dataset. To achieve human interaction and image display without braining too much, we will use Streamlit which makes these kinds of problems a piece of cake.
You can launch the cluster selection Streamlit with the following command:
streamlit run st_visualize_clusters.py
You should see the following page on your browser:
Now you just have to select a number of clusters with the slider, and explore the clusters to see if the configuration fits your needs. In our case, we will select the full 20 clusters because the model used was not trained specifically on this kind of data so it has difficulties separating the flowers classes. We can still get very coherent clusters:
Once that your selection is done, you can click on the “Save selected clusters” button to save your selection in the selected_clusters.csv file.
The final part: annotating your images
Finally, we can annotate our dataset. Thanks to our preparation, we only need to annotate the clusters, which will make the image annotation task much faster.
However, your clusters are almost sure to contain a few outliers, so you should also mark them. The simple strategy used here is to filter out these images. If you want to annotate them too, you will have to do it image by image. This shows that you should select your clusters wisely, in order not to lose too many annotations.
The clusters annotation is also done on Streamlit with the following command:
streamlit run st_label_clusters_and_outliers.py
You should see something like this on your browser:
On the left sidebar, you can select the cluster ID you want to look at. Eventually, you will have to go through all of them.
You can see all the images of a given cluster on a single page. Therefore, it is easy to determine which class the cluster corresponds to. You can give the cluster label on top of the page.
In the example image you can see a daisy among the dandelions. This is an outlier, which you can mark with the checkbox below the image. This will give to the image the “KO” status, allowing you to treat this picture later.
Sometimes, there are just too many images of different classes in a cluster to determine which class it should correspond to. In such cases, just click on the “Delete cluster?” checkbox on top of the page to ignore this cluster. This will give it the special label “DELETE”, allowing you to treat this cluster later as you want.
Finally, when you are done with the labellisation for a cluster, you can click on the button at the bottom to update the generated annotations with your labellisation.
The final result is stored in the labelled_clusters.csv file.
Wrapping up
By combining clusterization and Streamlit for human interactivity, you can do image annotation much more easily and quickly than if you just labelled each image separately.
Sure there are missing annotations because of messy clusters and outliers, but the goal here is mostly to save time and quickly get a dataset for your needs.
In a later article I will show you how to improve this whole task with a few tricks, so please stay tuned for articles published on this blog!
If you want to know more about Streamlit, I suggest you look at this article about object detection dataset.
Are you looking for computer vision Experts? Don't hesitate to contact us!