There are many methods for reliable recognition of frontal face image, and recently for recognition from profile gait sequences . However, in many practical settings the subjects to be identified do not conform to these pose and motion limitations. In order for a recognition method to be able to work under such conditions, it needs to be view-independent. Furthermore, once a moving subject is observed by the camera(s), it is appealing to integrate the recognition results based on multiple modalities, to obtain a more robust recognition scheme. We are currently developing techniques for view-independent integrated face-gait recognition. The experiments use an implementation of Image-Based Visual Hulls (IBVH) system, installed at the MIT AI Lab.
An example of typical input to our system is shown in this AVI file. At times the subject does not face any of the cameras, nor does he have his profile towards any camera. Having two dynamic cameras, one staying in front of the subject looking at his face and another staying at a desired distance looking perpendicular to his motion direction, one would obtain data which could be fed to view-dependent recognition methods. We do not have such cameras. However, we have a few (four in our case) far-placed images, providing information on the 3D structure of the observed scene and its appearance (texture).
Using IBVH, we can reconstruct the 3D visuall hull of the person. Then, for any desired camera position, we can project this 3D hull on the image plane, and obtain synthetic silhouette of the VH viewd from that viewpoint. Choosing virtual viewpoints so that the viewing direction is perpendicular to the subjects motion vector at the given time, provides synthetic profiles (from both left and right), which can be used for recognition by gait. An example of such profile sequence, corresponding to the input above, is available.
Clearly, face recognition requires more than a silhouette. Visuall hull provides a 3D approximation of the face surface. As long as this surface is visible to at least one camera, a texture can be mapped to the VH, and for any virtual viewpoint it can be rendered using common computer graphics techniques. Thus, we can obtain a synthetic frontal view of the face by placing the virtual camera in front of the face looking towards it. In order to determine face orientation, we use the novel fast face detection method described in . We can dramatically reduce the searching space: given the VH, and assuming general upright position, we need to search only the different views of the head (the top part of the VH). An AVI clip demonstrates such search space and the face images detected as close to frontal. Heuristic assumptions, such as assuming face orientation roughly parallel to the motion vector, can provide further cues on the initial spatial angle in which frontal face is sought.
In order to estimate the motion of the subject, we examine the position of the centroid of the VH, which is in turn an estimate of the bosy center of mass. This is a coarse estimate, susceptible to errors due to both measurment noise (imperfect silhouette extraction leads to distorted VH), and to the dynamic changes in body shape (such as swinging arms and legs). However we found it reasonably precise for subjects who walk across the scene at a normal walking speed.
A simple Kalman filter is applied to the observed centroids of the VH, to produce an estimated direction of the motion vector in each frame. Once this vector is obtained, we can compute the viewpoints relevant to synthetic face and gait data. Synthetic face images can be rendered and recognized on the fly, while gait feature vector can be computed once sufficient number of synthetic profile have been collected
Since face and gait recognition results are independent given the constructed VH, one can potentially improve recognition performance by combining them. We are exploring the probabilistic framework for such integration. The ad-hoc rule currently in use assigns a confidence measure to each possible ID under each modality. These confidence levels are set according to the distance from the computed feature vector for the observed data and the individuals represented in the database. Final classification is then obtained by averaging the confidence levels assigned by face and gait classifiers, and picking the ID with the maximal confidence.
Our experiments involved 12 individuals. The table below shows error rates, obtained with leave-one-out cross validation over 54 sequences:
|With VH view-normalization||Without view-normalization|
|Combined face and gait||.09||.36|