accessibility

See, Hear, and Read: Deep Aligned Representations

Yusuf Aytar, Carl Vondrick, Antonio Torralba
Massachusetts Institute of Technology

We capitalize on large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language. By leveraging over a year of sound from video and millions of sentences paired with images, we jointly train a deep convolutional network for aligned representation learning. Our experiments suggest that this representation is useful for several tasks, such as cross-modal retrieval or transferring classifiers between modalities. Moreover, although our network is only trained with image+text and image+sound pairs, it can transfer between text and sound as well, a transfer the network never observed during training. Visualizations of our representation reveal many hidden units which automatically emerge to detect concepts, independent of the modality.

Hidden Unit Visualizations

We visualize some units in the upper shared layers to analyze to what they respond. We found several units that automatically emerge to activate on some objects independent of the modality.

The videos show images, texts, and sounds that activate one particular hidden unit. Click them to hear their sounds!

Baby-like

Church-like

Dog-like

Engine-like

Music-like

Sports-like

Water-like

Wind-like

Example Retrievals

We show examples for cross-modal retrieval between sounds, images, and text. On the top of each panel, we show the input, and below it we show the top retrieved outputs across different modalities.

Although our network was trained using only image/sound and image/text pairs, a strong enough alignment emerges that it can transfer between sound/text.

Retrieval 1

Sound Input

Sound Output

Image Output

Text Output

a green soccer field
A soccer field.
the soccer field
big soccer field covered in grass

Retrieval 2

Sound Input

Sound Output

Image Output

Text Output

Water is crashing down
water is color green
Waterfall
this is a nature setting

Retrieval 3

Sound Input

Sound Output

Image Output

Text Output

people in a concert
he holds on to the bar
people at a concert event
two men at night

Retrieval 4

Sound Input

Sound Output

Image Output

Text Output

three walk signals are green
the blurred people in the audience
a big screen on stage
group posing at night time

Retrieval 5

Sound Input

Sound Output

Image Output

Text Output

a man in a suit stands near a table with computer equipment on it
a man in a suit standing next to a control board and computer
man in suit giving a speech
african american male speaking at a press conference

Retrieval 6

Text Input

the choppy water the man is riding

Sound Output

Image Output

Text Output

a person stands on water skis in the water
a couple of kayakers paddling through the water
a paddle boarder is navigating his paddle board through the surf
a man in the water with a sail and skis

Retrieval 7

Text Input

an airplane wing in the air with a lot of clouds

Sound Output

Image Output

Text Output

photo of airplane wing as the plane flies in the blue sky
looking out from a jet plane over the mountains
an airplane ascends through a cloud into a blue sky
a cloudy sky and an airplane wing with its tip up is flying by

Retrieval 8

Text Input

some baseball players a pitcher catcher and an umpire

Sound Output

Image Output

Text Output

three baseball players a batter a catcher and an umpire
a batter swinging at a ball with a catcher and umpire behind the plate
a pitcher is throwing the ball in a baseball game
a baseball player swinging at a pitch with catcher and umpire behind him

Retrieval 9

Text Input

brown bear in pool of water

Sound Output

Image Output

Text Output

a brown bear in the water
brown bear in the water
two bears are in the water
A brown bear in the water

Retrieval 10

Image Input

Sound Output

Image Output

Text Output

darkness of night sky
white and orange curtains behind the stage
Fireworks in the night sky
It is rather dark room.

Retrieval 11

Image Input

Sound Output

Image Output

Text Output

The dog belongs to the homeowner
the dog is on the beach
The dog is guarding the house
The dog is guarding the house

Retrieval 12

Image Input

Sound Output

Image Output

Text Output

A baby chewing on a tooth brush
A baby chewing on a tooth brush
A baby chewing on a tooth brush
a young baby chewing on a yellow hair brush

Related Work

Check out some other recent work in cross-modal learning: