The tremendous success of ImageNet-trained deep features on a wide range of transfer tasks raises the question:
What is it about the ImageNet dataset that makes the learnt features as good as they are?
This work provides an empirical investigation into the various facets of this question,
such as looking at the importance of the amount of examples, number of classes, balance between images-per-class and classes, and the role of fine and coarse grained recognition.
We pre-train CNN features on various subsets of the ImageNet dataset and evaluate transfer performance on a variety of standard vision tasks.
Our overall findings suggest that most changes in the choice of pre-training data long thought to be critical, do not significantly affect transfer performance.
How many pre-training ImageNet examples are sufficient?
Pre-training with only half of the ImageNet data (500 images per class instead of 1000)
results in only a small drop in transfer learning performance (1.5 mAP drop on PASCAL-DET).
This drop is much smaller than the drop on the ImageNet classification task itself.
How many pre-training ImageNet classes are sufficient?
Pre-training with an order of magnitude fewer classes (127 classes instead of 1000)
results in only a small drop in transfer learning performance (drop of 2.8 mAP on PASCAL-DET).
Quite interestingly, we also found that for some transfer tasks,
pre-training with fewer number of classes leads to better performance.
Given the same budget of pre-training images, should we have more classes or more images per class?
Training with fewer classes but more images per class performs slightly better than training with more classes but fewer images per class.
Is more data always helpful?
We found that training using 771 ImageNet classes that exclude
all PASCAL VOC classes achieves nearly the same performance on PASCAL-DET as training on complete ImageNet.
Further experiments confirm that blindly adding more training data does not always lead to better performance and can sometimes hurt performance.
Does training with coarse class induce features relevant for fine-grained recognition and vice-versa?
We have observed that while attempting to distinguish between reltaively coarse classes,
the network implictly discovers features capable of distinguishing finer classes.
Simlarily, when training on fine classes, the network learns features relevant for distinguishing between coarse semantic classes,
when the fine classes within the coarse classes are visually similar.