Many computational models of visual attention have been created from a wide variety of different approaches to predict where people look in images. Each model is usually introduced by demonstrating performances on new images, and it is hard to make immediate comparisons between models. To alleviate this problem, we propose a benchmark data set containing 300 natural images with eye tracking data from 39 observers to compare model performances. This is the largest data set with so many viewers per image. We calculate the performance of many models at predicting ground truth fixations using three different metrics: a receiver operating characteristic, a similarity metric, and the Earth Mover's Distance. We post the results here and provide a way for people to submit new models for evaluation.
paper
This benchmark is released in conjunction to the paper "A Benchmark of Computational Models of Saliency to Predict Human Fixations" by Tilke Judd, Fredo Durand and Antonio Torralba, available as a Jan 2012 MIT tech report.
@InProceedings{Judd_2012, author = {Tilke Judd and Fr{\'e}do Durand and Antonio Torralba}, title = {A Benchmark of Computational Models of Saliency to Predict Human Fixations}, booktitle = {MIT Technical Report}, year = {2012} }
Images
300 benckmark images (The fixations from 39 viewers per image are not public such that no model can be trained using this data set.)A comparison of images and saliency maps for several models.
Model Performances
Model Name | Related content | Area under ROC* curve (higher is better) |
Similarity* (higher is better) |
Earth mover's distance* (lower is better) |
Humans** | code | 0.922 | 1 | 0 |
Boolean Map based Saliency (BMS) [Zhang and Sclaroff ICCV 2013] | website, paper | 0.8257 | 0.5364 | 3.2620 |
Ensembles of Deep Networks (eDN) [Vig et al 2014 CVPR] | website with paper and code | 0.8192 | 0.5123 | 3.0129 |
Judd Model [Judd et al. ICCV 2009] | website with paper and code | 0.811 | 0.506 | 3.13 |
CovSal [Erdem and Erdem JOV 2013] | paper, website | 0.8056 | 0.5018 | 3.1092 |
Tavakoli et al. 2011 | paper, website | 0.8033 | 0.4952 | 3.3488 |
Region Contrast [Cheng et al. 2011] |
website with paper and code | 0.7922 | 0.4705 | 3.4180 |
Graph Based Visual Saliency (GBVS) | code | 0.801 | 0.472 | 3.574 |
Multi-Resolution AIM [Advani et al. 2013] | paper | 0.7719 | 0.4711 | 3.3635 | CWS model | Unpublished work | 0.7896 | 0.4638 | 3.7448 | MKL-based model [Kavak et al. 2013] | website | 0.7836 | 0.4465 | 3.3212 |
Center** | code | 0.783 | 0.451 | 3.719 | Saliency for Image Manipulation | website, paper, code | 0.774 | 0.439 | 4.137 | RARE2012 | website | 0.7719 | 0.4363 | 4.1019 | LMF | website (unpublished work) | 0.7703 | 0.4360 | 4.8645 |
Random Center Surround Saliency (Narayan model) | code, paper | 0.753 | 0.42 | 3.465 |
Bruce and Tsotsos AIM | code (look for AIM.zip) | 0.751 | 0.39 | 4.236 |
Itti&Koch2 | code from the GBVS package | 0.75 | 0.405 | 4.560 |
Region saliency based on global context [Lingfeng Yuan] | coming soon | 0.7416 | 0.4016 | 4.5172 |
Context-Aware saliency | code | 0.742 | 0.39 | 4.90 |
NARFI saliency [Chen et al.] | paper, IEEE link | 0.7263 | 0.4202 | 4.6974 |
Quantum Cuts salienct objects | website | 0.690 | 0.398 | 4.992 |
Torralba | code | 0.684 | 0.343 | 4.715 |
Hou & Zhang | code | 0.682 | 0.319 | 5.368 |
SUN saliency | code, website | 0.672 | 0.34 | 5.088 |
Itti&Koch | code from the Saliency Toolbox | 0.562 | 0.284 | 5.067 |
Achanta | code | 0.523 | 0.297 | 6.854 |
Chance** | code | 0.503 | 0.327 | 6.352 |
** Baseline models that we compare against
Submit a new model
Instructions:1) Download our 300 images (IMAGES.zip)
2) Run your model to create saliency maps of each image. The saliency maps should be .jpg images of the same size and name as the original images.
3) Submit your maps to tjudd@csail.mit.edu (as a zip or tar folder).
4) We run the scoring metrics to compare how well your saliency maps predict where 39 observers looked on the images. Because we do not make the fixations public (to avoid any model from being trained on the data), it is not possible to score the model on your own. For reference, you can see the Matlab code we use to score models.
5) We post your score and model details on this page.
6) Let us know if you have a publication, website, or publicly available code for your model that we can link to your score in the chart above.
If you would also like to know your model's score with optimized blur and center weight, submit saliency maps of these 100 images from our ICCV MIT data set (listed below) which we will use to determine the optimal blur parameter and optimal center weight for your model.
Other data sets
[If you have another fixation data set that you would like to list here, email tjudd@csail.mit.edu with a link and description.]Other fixation data sets
MIT data set [Judd et al. 2009] has fixations from 15 viewers who free-viewed 1003 natural indoor and outdoor images. Created under similar conditions to the above saliency benchmark data set and can be used to train new models of saliency.OSIE data set [Xu et al. 2014]. The Object and Semantic Images and Eye-tracking data set provides data for object and semantic saliency. It includes (a) 700 images with semantic objects (a large portion have multiple dominant objects in the same image, allowing a direct comparison of different objects/attributes in the saliency context), (b) eye-tracking data with 15 subjects free-viewing all images, (c) 5,551 segmented objects with fine contours, and (d) annotations of 12 semantic attributes on each of the 5,551 objects.
VIP data set [Keng-Teck et al. 2013] has fixations from 75 viewers who were either tasked to free-view or detect anomalies in 150 images. The images are randomly selected from the NUSEF dataset. Demographic and personality traits of the viewers are available for training of trait-specific saliency models.
NUSEF data set [Subramanian et al. 2010] has fixations from ~25 observers free-viewing 758 images containing semantically affective objects/scenes such as expressive faces, nudes, unpleasant concepts, and interactive actions.
Toronto data set [Bruce and Tsotsos, 2009] contains data from 11 subjects free-viewing 120 color images of outdoor and indoor scenes. A large portion of images here do not contain particular regions of interest.
Ehinger data set [Ehinger et al. 2009] has fixations from 14 observers as they performed a search task (person detection) on 912 outdoor scenes.
FIFA [Cerf et al. 2009] has fixation data collected from 8 subjects doing a free-viewing task on 180 color outdoor and indoor images. Observers were asked to rate how interesting each image was. Images include salient objects and many different types of faces. This data set was originally used to establish that human faces are very attractive to observers and to test models of saliency that included face detectors.
DOVES(a Database Of Visual Eye movementS) [Linde et al. 2008] is a collection of eye movements from 29 human observers as they viewed 101 natural calibrated images. They are black and white and show natural outdoor scenes with no strongly salient objects.
Le Meur [Le Meur et al. 2006] has fixations of up to 40 observers who free-viewed 27 color images with strongly salient objects for 15 seconds.
IVC Data sets The Images and Video Communications team (IVC) of IRCCyN lab provides several image and video databases including eye movement recordings. Some of the databases are based on a free viewing task, other on a quality evaluation task.
Other saliency-related data sets
MIT Low-resolution data set [Judd et al. 2011] has 168 natural images and 25 pink noise images at 8 different resolutions. 64 observers were distributed across the different resolutions of the images such that there is 8 viewers per image. Useful for studying fixations on low-resolution images.Regional Saliency Dataset (RSD) [Li, Tian, Huang, Gao 2009] (paper) A dataset for evaluating visual saliency in video.
MSRA Salient Object Database [Liu et al. 2007] database of 20,000 images with hand labeled rectangles of principle salient object by 3 users.