Multimodal recognition

This work is a collaboration between Artur Arsenio and Paul Fitzpatrick.

Tools are often used in a manner that is composed of some repeated motion -- consider hammers, saws, brushes, files, etc. This repetition can potentially aid a robot to perceive these objects robustly. Our approach is for the robot to detect simple repeated events at frequencies relevant for human interaction, using both visual and acoustic perception. The advantage of combining rhythmic information across these two modalities is that they have complementary properties. Since sound waves disperse more readily than light, vision retains more spatial structure -- but for the same reason it is sensitive to occlusion and the relative angle of the robot's sensors, while auditory perception is quite robust to these factors.


The relationship between object motion and the sound generated varies in an object-specific way. The hammer causes sound when changing direction after striking an object. The bell typically causes sound at either extreme of motion. A toy truck causes sound while moving rapidly with wheels spinning; it is quiet when changing direction.


The spatial trajectory of a moving object can be recovered quite straightforwardly from visual analysis, but not from sound. However, the trajectory in itself is not very revealing about the nature of the object. We use the trajectory to extract visual and acoustic features -- patches of pixels, and sound frequency bands -- that are likely to be associated with the object. Both can be used for recognition. Sound features are easier to use since they are relatively insensitive to spatial parameters such as the relative position and pose of the object and the robot.


Related publication:
  • Artur Arsenio and Paul Fitzpatrick. Exploiting cross-modal rhythm for robot perception of objects. Accepted for publication at the 2nd International Conference on Computational Intelligence, Robotics, and Autonomous Systems, Singapore, December 15 - 18, 2003. (pdf, rough html)