At the heart of our visual speech synthesis approach is the
multidimensional morphable model representation, which is a
generative model of video capable of morphing between various lip
images to synthesize new, previously unseen lip configurations.
The basic underlying assumption of the MMM is that the complete set
of mouth images associated with human speech lie in a low-dimensional
space whose axes represent mouth appearance variation and
mouth shape variation. Mouth appearance is represented in
the MMM as a set of prototype images extracted from the recorded
corpus. Mouth shape is represented in the MMM as a set of optical flow
vectors computed automatically from the recorded corpus. In the work
presented here, 46 images are extracted and 46 optical flow
correspondences are computed. The low-dimensional MMM space is
paramaterized by shape parameters alpha and appearance parameters
beta.
The MMM maybe viewed as a ``black box'' capable of performing two
tasks: Firstly, given as input a set of parameters alpha, beta,
the MMM is capable of synthesizing an image of the subject's
face with that shape-appearance configuration. Synthesis is performed
by morphing the various prototype images to produce novel, previously
unseen mouth images which correspond to the input parameters alpha,
beta.
Conversely, the MMM can also perform analysis: given an input
lip image, the MMM computes shape and appearance parameters alpha,
beta that represent the position of that input image in MMM space.
In this manner, it is possible to project} the entire recorded
corpus onto the constructed MMM, and produce a time series of
alpha, beta parameters that represent trajectories of
mouth motion in MMM space. We term this operation analyzing the
recorded corpus.
Send any comments or questions to Tony Ezzat