NOTE: our audio results are best experienced using good speakers, preferably headphones.


This supplementary material shows results of our technique. In each example, we take a video of a material and either play pre-recorded audio through a speaker or have a male speaker speak near the object. There are five audio source types: Chirp, Live_Speech, Mary, and Recorded_Speech. In Chirp, we play a five second 100Hz to 1000Hz chirp through a speaker. In Live_Speech, a male speaker speaks near the objects. In Mary, we play a MIDI recording of "Mary had a little lamb". In Recorded_Speech, we play an audio clip from the TIMIT database, which contains clips of English speakers of different sexes and dialects [1]. For Live_Speech and Recorded_Speech, we perform speech enhancement audio denoising. For non-speech clips, we denoise using spectral subtraction. We do not denoise the recovered audio in the Chirp section as the denoising can remove interesting frequency bands that may correspond to material properties.

In the case of the Live_Speech audio clips, the input is a sound file recorded by a nearby conventional microphone. For the Recorded_Speech examples, we also provide a comparison with a laser Doppler vibrometer (a laser microphone) shined at the object called laser_vibrometer.wav.

We provide input files at their original sampling rate, but for a fair comparison, we also provide a version of the input downsampled to the sampling rate of the video. The spectrograms shown in the supplementary material are all on a log scale specified in dB. The specific scale for each plot is shown next to it. We also provide examples of our synthetic data: Run 1, Run 2. The motions of the membrane have been exaggerated to allow the viewer to clearly see them.

[1] Fisher, William M.; Doddington, George R. and Goudie-Marshall, Kathleen M. (1986). "The DARPA Speech Recognition Research Database: Specifications and Status". Proceedings of DARPA Workshop on Speech Recognition. pp. 93–99.