Tony Ezzat, Gadi Geiger, and Tomaso Poggio. MIT Center for Biological and Computational Learning In memory of Christian Benoit |
We describe how to create with machine learning techniques a
generative, videorealistic, speech animation module. A human
subject is first recorded using a videocamera as he/she utters a
pre-determined speech corpus. After processing the corpus
automatically, a visual speech module is learned from the data
that is capable of synthesizing the human subject's mouth uttering
entirely novel utterances that were not recorded in the original
video. The synthesized utterance is re-composited onto a
background sequence which contains natural head and eye
movement. The final output is videorealistic in the sense that it
looks like a video camera recording of the subject. At run time,
the input to the system can be either real audio sequences or
synthetic audio produced by a text-to-speech system, as long as
they have been phonetically aligned.
The two key contributions of this work are
Psychophysics experiments indicate that human subjects are unable
to distinguish (at a significant level above chance) between real
and synthetic versions of the same utterance.