Contextual guidance of eye movements and attention in real-world scenes: The role of global features on object search

Antonio Torralba, Aude Oliva, Monica Castelhano, John Henderson

Psychological Review, Vol. 113, No. 4. (October 2006), pp. 766-786.

Abstract: Many experiments have shown that the human visual system makes extensive use of contextual information for facilitating object search in natural scenes. However, the question of how to formally model contextual influences is still open. Based on a Bayesian framework, we present an original approach of attentional guidance by global scene context. Two parallel pathways comprise the model; one pathway computes local features (saliency) and the other computes global (scene-centered) features. The Contextual Guidance model of attention combines bottom-up saliency, scene context and top-down mechanisms at an early stage of visual processing, and predicts the image regions likely to be fixated by human observers performing natural search tasks in real world scenes.

Downloads
paper.pdf
citation
Images.zip (Browse the images on LabelMe)
Fixation data.zip (Run showData.m)

Training datasets
The images used to train the context model are part of the LabelMe dataset:

1) people search (see all images, it will show thousands of thumbnails)

2) painting search (see all images)

3) mug search (see all images)


These datasets are available via the LabelMe wesite.

Summary of results
The full model presented here incorporated scene priors to modulate the salient regions taking into account the expected location of the target given its scene context. In the people search task, the two factors are combined resulting in a saliency map modulated by the task. For evaluating the performance of the models, we compared the locations fixated by 8 participants with a thresholded map.
The next figure illustrated how the task modulates the salient regions. The same image was used on two tasks: Painting and mug search. In this example, the results show that the scene context is able to predict which regions will be fixated and how the task produces a change of the fixations.

Accessibility