Yusuf Aytar, Carl Vondrick, Antonio Torralba
Massachusetts Institute of Technology
We capitalize on large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language. By leveraging over a year of sound from video and millions of sentences paired with images, we jointly train a deep convolutional network for aligned representation learning. Our experiments suggest that this representation is useful for several tasks, such as cross-modal retrieval or transferring classifiers between modalities. Moreover, although our network is only trained with image+text and image+sound pairs, it can transfer between text and sound as well, a transfer the network never observed during training. Visualizations of our representation reveal many hidden units which automatically emerge to detect concepts, independent of the modality.
We visualize some units in the upper shared layers to analyze to what they respond. We found several units that automatically emerge to activate on some objects independent of the modality.
The videos show images, texts, and sounds that activate one particular hidden unit. Click them to hear their sounds!
We show examples for cross-modal retrieval between sounds, images, and text. On the top of each panel, we show the input, and below it we show the top retrieved outputs across different modalities.
Although our network was trained using only image/sound and image/text pairs, a strong enough alignment emerges that it can transfer between sound/text.
Check out some other recent work in cross-modal learning: