Multimodal Interaction With Gestures (& Speech)


The main objective of this work is to design a perceptual user interface that provides:

·        the full body pose of a user (location and orientation of arms, head, hands)

·        the detection of gestures (e.g. waving, pointing)

and combine tracking/gesture output with additional modalities (e.g. speech) for Human-Robot or Human-Computer interaction, e.g. virtual world navigation and interaction, video game, interaction with an avatar.  


In order to estimate body pose, we developed vision-based techniques for tracking body movements and detecting gestures in real-time. Gestures are interpreted in context and used in combination with speech recognition results (provided by an off-the-shelf speech recognition system) for Human-Computer interaction.



The following videos show the integration of our multimodal user interface with various applications (our system has also been integrated with other systems developed at NASA, Toyota but videos for these are not yet available).


Multi-modal studio: Demo

Application similar to Bolt’s Put-That-There. The application allows a user to create and manipulate various 3D geometric shapes in a virtual world using speech and gesture commands


Virtual navigation: Demo


Utilization of body pose estimation to control navigation in a Virtual World/Video Game. In this application, the user can move in the virtual world by moving/tilting his body or by pointing at a derided direction in the virtual world




Tracking view

(Virtual) top view  


Control of virtual (eg. desktop windows/applications/mouse cursor) and real objects (eg. lamps/projectors/sound system) using speech and gesture commands. This system was an outcome of the OXYGEN project and resulted from the integration of multiple components developed at MIT CSAIL (SLS Galaxy speech recognition system, Large-scale microphone array, AIRE’s Metaglue multi-agent platform)



Tracking videos (no integration)

User sitting Demo

User standing/turning Demo






Our latest approach for tracking consists in fitting a 3D (CAD) model of the person to track to range data (obtained from stereo cameras for instance). ELMO, the tracking algorithm used to perform pose estimation is a combination of two techniques:


  • ICP (Iterative Closest Point), an local optimization technique developed for 3D surface registration


  • LSH (Locally Sensitive Hashing), a global technique developed for fast Nearest Neighbor search (sub-linear)


The designed technique works in real-time (15Hz / Pentium IV) and does not require any initialization. In addition, the use of stereo data makes it robust to illumination conditions.


Gesture recognition

Gesture recognition is performed using the output of the vision-based tracking system. We recently developed HCRF (Hidden Markov Random Field), a new graphical model. This model has proved to outperform standard graphical models such as Hidden Markov Models (HMMs) in various gesture recognition tasks.

More detailed on HCRF can be found here.



D. Demirdjian, L. Taycher, G. Shakhnarovich, K. Grauman and T. Darrell. Avoiding the `Streetlight Effect': Tracking by Exploring Likelihood Modes. In ICCV’05. [PDF].

S. Wang, A. Quattoni, L. Morency, D. Demirdjian, T. Darrell, Hidden Conditional Random Fields for Gesture Recognition, Proceedings IEEE Conf. on Computer Vision and Pattern Recognition, 2006. [PDF]

L. Taycher, G. Shakhnarovich, D. Demirdjian, T. Darrell, Conditional Random People: Tracking Humans with CRFs and Grid Filters, Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2006. [PDF]

D. Demirdjian, T. Ko and T. Darrell. Untethered Gesture Acquisition and Recognition for Virtual World Manipulation. In Virtual Reality, 2005. [PDF]

D. Demirdjian. Combining Geometric- and View-Based Approaches for Articulated Pose Estimation. In Proceedings of ECCV’04, Prague, Czech Republic, May 2004. [PDF]


D. Demirdjian and T. Darrell. 3-D Articulated Pose Tracking for Untethered Diectic Reference. In Proceedings of ICMI’02, October 2002, Pittsburgh, Pennsylvania. [PDF]