Few-Shot Learning is the subfield of Machine Learning in which we assume that we only have access to few labeled examples. The field is a few years old and now has its own research community, with its own evaluation processes and its own benchmarks. Sadly, we find that these benchmarks are irrealistic, thus giving us a false idea about the performance of our models.
In this article, we are going to assume that you are familiar with Few-Shot Learning. If you are not, don’t worry! Just follow this tutorial and come back here when you’re done.
We are going to consider the two most widely used Few-Shot-Learning benchmarks: tieredImageNet and miniImageNet. Since 2018, these two benchmarks combined have been used more than a thousand times in peer-reviewed papers.
The standard evaluation process in Few-Shot Learning is to sample hundreds of small few-shot tasks from the test set, compute the accuracy of the model on each task, and report the mean and standard deviation of these accuracies. But never do we look at the tasks individually—only the aggregated results.
So what is really in these tasks? Exactly what kind of problem did these hundreds of research papers try to solve? Does it reflect real-world problems?
Uniformly sampled tasks do not reflect real-world use cases for Few-Shot Learning
Few-Shot Learning benchmarks such as miniImageNet or tieredImageNet evaluate methods on hundreds of Few-Shot Classification tasks. These tasks are sampled uniformly at random from the set of all possible tasks.
This induces a huge bias towards tasks composed of classes that have nothing to do with one another. Classes that you would probably never have to distinguish in any real use case.
If you want to generate more examples of absurd tasks, check out the companion dashboard!
Build better benchmarks with Semantic Task Sampling
The classes of tieredImageNet are part of the WordNet semantic graph. We can use this graph to define a semantic distance between classes, which measures how far the concepts defined by the classes are, e.g. a hotdog is closer to a cheeseburger than it is to a house. Then we can define a measure for the coarsity of a task, as the mean square semantic distance between the classes that compose it.
Thanks to this measure, we can confirm with the above figure that using uniform task sampling (as is usually done in the literature), we can never get a task composed of classes that are semantically close to each other.
But these tasks are not unreachable! We can actually force our task sampler to sample together classes with a low coarsity. That's the pink histogram. The pink histogram makes the impossible possible. It can reach coarsities that the blue histogram would never even dream of.
OK, but what does it really mean for a few-shot learning task to have a low coarsity? I used the slider in the companion dashboard to generate such tasks.
It seems that when you choose a low coarsity, you get a task composed of classes that are semantically close to each other. For instance, with the lowest coarsity (8.65), you get the task of discriminating between 5 breeds of dogs.
On the other hand, when you increase the coarsity, the classes seem to get more distant from one another.
An other way to see this distance is directly on the WordNet graph. Below you can see the subgraph of WordNet spanned by the classes of the few-shot learning benchmark tieredImageNet. The pink dots are the classes. I highlighted some of the classes so you can see the distance between some specific concepts. Again, if you want to play with the graph yourself, check out the dashboard!
Realistic tasks are harder for Few-Shot Learning models
As you could have imagined, the performance of Few-Shot Learning models highly depends on the coarsity of the task. This means that if you have a model that has been tested on the standard tieredImageNet or miniImageNet benchmarks (with very coarse tasks), and then apply it to a real-life use case (most likely with more fine-grained tasks), you will suffer a huge drop in performance.
Going deeper into Few-Shot Learning...
This little article is meant to highlight that common Few-Shot Learning benchmarks are strongly biased toward tasks composed of classes that are very distant from each other.
At Sicara, we have seen a wide variety of industrial applications of Few-Shot Learning, but we never encountered a scenario that can be approached by benchmarks presenting this type of bias. In fact, in our experience, most applications involve discriminating between classes that are semantically close to each other: plates from plates, tools from tools, carpets from carpets, parts of cars from parts of cars, etc.
There are other benchmarks for fine-grained classifications. And it's OK that some benchmarks contain tasks that are very coarse-grained. But today, tieredImageNet and miniImageNet are wildly used in the literature, and it's important to know what's in there, and how to restore the balance.
If you want to know more about the biases of classical Few-Shot Learning benchmarks and about semantic task sampling, check out our paper Few-Shot Image Classification Benchmarks are Too Far From Reality: Build Back Better with Semantic Task Sampling (presented at the Vision Datasets Understanding Workshop at CVPR 2022). Finally, if you’re interested in Few-Shot Learning in general and want to dive into the code, you can get started with the EasyFSL library.