6.870 Grounding Object Recognition and Scene Understanding, Fall 2011

6.870 Grounding Object Recognition
and Scene Understanding

Fall 2011


This class will cover current approaches to object recognition and scene understanding in computer vision and its relation to other disciplines. The goal of this class is to provide an in depth presentation of computer vision techniques for recognition of objects, scenes, materials, actions, ... but by putting them in the framework of concrete tasks.

The class is addressed to students from any discipline, not just vision, interested in learning about computer vision techniques that can be applied to their research. We will cover state of the art object recognition and scene understanding techniques and how they relate to robotics, language, computer graphics, crowd sourcing, human-computer interaction, etc. For students in computer vision, this class will allow exploring new tasks and scene representations, beyond labeling objects in images for the sake of it.

The course will cover bag of words models, part based models, classifier based models, multiclass object recognition and transfer learning, concurrent recognition and segmentation, context models for object recognition, representations for scene understanding and large datasets for semi supervised and unsupervised discovery of object and scene categories, etc. We will be reading a mixture of papers from computer vision and influential works from cognitive psychology and other disciplines.



Wednesdays 1pm-4pm

Room 13-1143 (where is this?)

Instructor: Antonio Torralba, Email: torralba csail mit edu

No prerequisits

Student presentations

Each registered student will have to do a class presentation that will complement one of the lectures. The presentation can be about two papers related with the week lecture topic or a paper of your own research if you feel it can be conected with the class material. Email me with your suggestions about topics you would like to present and I will assign you to one of the lectures.

Class projects

Projects can be done individualy or in groups. The ideal group should be formed by 2 or 3 students from different areas. Important dates:
- Project presentations: December 7th. Each group will have 30 minutes to present their work. All members of the team should participate on the presentation.
- Papers: due on December 14th. Papers should be 4-6 pages long. Each member of the group should submit a separate copy by email (pdf format). The papers among members of the same group can be nearly identical, but each member should state which part of the project did she/he focus on and expand on it. Each paper should include the name of all the team members.


Date Topic Lecture Invited speaker


Links to Papers/code
Sept. 7 Class goals and a short introduction Antonio  

Lecture1 (ppt)

-P. Cavanagh, Vision is getting easier every day, Perception 1996
Sept. 14 Edges, textures, ... Antonio   Lecture2 (ppt)

Sept. 21 The importance of data Antonio

Boris Katz

Carl Vondrick

Lecture3 (ppt)
Boris (ppt)
Carl (ppt)

-LabelMe (website, paper.pdf)
-Watson (paper.pdf)
-START (system website, paper.pdf)
-Video annotation

Sept. 28 Object recognition Antonio

Seth Teller

David Hayden

Lecture4 (ppt)

-Felzenszwalb, McAllester and Ramanan. A Discriminatively Trained, Multiscale, Deformable Part Model. CVPR 2008. (code)
- Manipulation (paper.pdf)
- Natural language commands (paper.pdf)

Oct. 5 Object recognition in context Antonio

Nicholas Roy

Ryan Schoen

Lecture5 (ppt)

- tellex11.pdf
- hri10-tk.pdf
- icra09-tk.pdf

Oct. 12 Human vision Antonio

Aude Oliva

Deborah Hanus

Lecture6 (ppt)

- A. Oliva, A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV 2001. (gist code)
- S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. CVPR 2006. (code)

Oct. 19 Words and pictures Antonio

Regina Barzilay

Yevgeni Berzak

Lecture7 (ppt)

Gestural Cohesion for Discourse Segmentation
Jacob Eisenstein, Regina Barzilay, Randall Davis
Proceedings of ACL, 2008

Modeling Gesture Salience as a Hidden Variable for Coreference Resolution and Keyframe Extraction
Jacob Eisenstein, Regina Barzilay, Randall Davis
Journal of Artificial Intelligence Research, 2008

Turning Lectures into Comic Books with Linguistically Salient Gestures
Jacob Eisenstein, Regina Barzilay, Randall Davis
Proceedings of AAAI, 2007

Oct. 26 Multiclass models and transfer learning Antonio

Daniela Rus

Sudeep Pillai

Lecture8 (ppt)

Nov. 2 No class      

Nov. 9 No class ICCV    

Nov. 16 Vision and the brain Antonio

Jim Di Carlo

Ha Hong

Lecture9 (ppt)

Jim Di Carlo's papers

Nov. 23 HCI Antonio Students:
Mike Fleder
Jeremy Scott
Yafim Landa
Lecture10 (ppt)

Nov. 30 3D scenes Antonio Students:
Emily Zhao
Xiaodan Jia
Lecture11 (ppt)

Dec. 7 Project presentations Antonio    

Dec. 14 Projects Antonio Last day of classes  


Related courses:



Other resources:



Here there are links to useful code for low-level and mid-level vision tasks:

Other useful code: