Yale Song

Yale Song

Senior Staff Research Scientist and Manager, Google
<first_name><last_name> AT gmail DOT com

Scholar | Github

Latest News

Starting a new chapter at Google (July, 2025)
Invited talk at Large-Scale Holistic Video Understanding at CVPR 2024 (June, 2024)
Invited talk at Synthetic Data for Computer Vision at CVPR 2024 (June, 2024)
Invited talk at Large-Scale Holistic Video Understanding at CVPR 2022 (June, 2022)
Panel at T4V: Transformers for Vision at CVPR 2022 (June, 2022)
Starting a new chapter at Meta AI! (Apr, 2022)
Invited talk at Stanford Computer Vision and Learning Lab (Mar, 2022)
Panel at UofT AI Conference (Feb, 2022)
Featured on deeplearning.ai where I talk about foundation vision models [Article]
Invited talk at New in ML Workshop at NeurIPS 2021
Invited talk at Large-Scale Holistic Video Understanding at ICCV 2021 [Video / Slides]
Co-organizing a tutorial on Efficient Video Understanding at ICCV 2021 [Video / Slides]
ACAV100M: A new dataset of 100 million clips (31 years in total) for video self-supervised learning

Professional Service

Organizing Committee: CVPR 2025/2024/2023 (Socials Chair), ICLR 2023/2021 (Paper Award Committee), NeurIPS 2021/2020 (Expo Chair), ICMI 2021 (Sponsorship Chair), ICMI 2019 (Program Chair)
Area Chair: NeurIPS 2025/2024/2023/2022/2021/2020, ICLR 2026/2024/2023/2021, ICML 2026/2024/2023, CVPR 2026*/2025*/2024/2020, ECCV 2024*/2022, ICCV 2025/2023/2021, WACV 2023/2020/2018, FG 2018, ICMI 2018/2017/2016, ACL 2019 (*: Lead AC)
Journal Editor: Transactions on Machine Learning Research (2022-Current)
Outstanding Area Chair: ICLR 2023
Outstanding reviewer: ICML 2021/2020, NeurIPS 2019, CVPR 2017

Publications (see also at Google Scholar, DBLP)

2026

GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
Saelyne Yang, Jaesang Yu, Yi-Hao Peng, Kevin Qinghong Lin, Jae Won Cho, Yale Song, Juho Kim
CVPR 2026

Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction
Ce Zhang, Yale Song, Ruta Desai, Michael Louis Iuzzolino, Joseph Tighe, Gedas Bertasius, Satwik Kottur
WACV 2026

2025

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Jang Hyun Cho^†, Andrea Madotto^†, Effrosyni Mavroudi^†, Triantafyllos Afouras^†, Tushar Nagarajan^†, Muhammad Maaz^†, Yale Song^†, Tengyu Ma^†, Shuming Hu^†, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Suyog Jain, Miguel Martin, Huiyu Wang, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Piotr Dollar, Lorenzo Torresani, Kristen Grauman, Christoph Feichtenhofer (†: joint first author)
NeurIPS 2025 (Spotlight) [Preprint] [Model] [Code] [Data]

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
Shraman Pramanick, Effrosyni Mavroudi, Yale Song, Rama Chellappa, Lorenzo Torresani, Triantafyllos Afouras
ICCV 2025 (Highlight)

Streaming VideoLLMs for Real-Time Procedural Video Understanding
Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bugra Tekin, Abhay Mittal, Bharat Lal Bhatnagar, Necati Cihan Camgoz, Shreyas Hampali, Eric Sauser, Shugao Ma, Angela Yao, Fadime Sener
ICCV 2025

VITED: Video Temporal Evidence Distillation
Yujie Lu, Yale Song, Lorenzo Torresani, William Yang Wang, Tushar Nagarajan
CVPR 2025 [Preprint]

2024

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei Huang, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray
CVPR 2024 (Oral) [Preprint] [Blog] [Data]

2023

Ego4D Goal-Step: Toward Hierarchical Understanding of Procedural Activities
Yale Song, Gene Byrne, Tushar Nagarajan, Huiyu Wang, Miguel Martin, Lorenzo Torresani
NeurIPS 2023 Datasets & Benchmarks (Spotlight) [Paper] [Data]

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, Pengchuan Zhang
ICCV 2023 (Highlight) [Preprint]

Egocentric Video Task Translation
Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani
CVPR 2023 (Highlight) [Preprint]

Scaling Novel Object Detection with Weakly Supervised Detection Transformers
Tyler LaBonte, Yale Song, Xin Wang, Vibhav Vineet, Neel Joshi
WACV 2023 [Preprint][Poster]

2022

Neural-Sim: Learning to Generate Training Data with NeRF
Yunhao Ge, Harkirat Behl, Jiashu Xu, Suriya Gunasekar, Neel Joshi, Yale Song, Xin Wang, Laurent Itti, Vibhav Vineet
ECCV 2022

COMPASS: Contrastive Multimodal Pretraining for Autonomous Systems
Shuang Ma, Sai Vemprala, Wenshan Wang, Jayesh K. Gupta, Yale Song, Daniel McDuff, Ashish Kapoor
IROS 2022 [Preprint] [Blog] [Code]

Visual Attention Emerges from Recurrent Sparse Reconstruction
Baifeng Shi, Yale Song, Neel Joshi, Trevor Darrell, Xin Wang
ICML 2022 [Preprint] [Slides] [Code]

Robust Contrastive Learning against Noisy Views
Ching-Yao Chuang, R Devon Hjelm, Xin Wang, Vibhav Vineet, Neel Joshi, Antonio Torralba, Stefanie Jegelka, Yale Song
CVPR 2022 [Preprint] [Code]

CausalCity: Complex Simulations with Agency for Causal Discovery and Reasoning
Daniel McDuff, Yale Song, Jiyoung Lee, Vibhav Vineet, Sai Vemprala, Nicholas Alexander Gyde, Hadi Salman, Shuang Ma, Kwanghoon Sohn, Ashish Kapoor
CLeaR 2022 [Preprint] [Project and code] [Blog]

DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents
Tsu-Jui Fu, William Yang Wang, Daniel McDuff, Yale Song
AAAI 2022 [Preprint] [Project and dataset]

Anomaly Detection in Time Series with Robust Variational Quasi-Recurrent Autoencoders
Tung Kieu, Razvan Cirstea, Yan Zhao, Bin Yang, Chenjuan Guo, Yale Song, Christian Jensen
ICDE 2022

2021

Contrastive Learning of Global and Local Video Representations
Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song
NeurIPS 2021 [Paper] [Code]

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning
Sangho Lee, Jiwan Chung, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, Yale Song
ICCV 2021 [Preprint] [ACAV100M Dataset] [MSR Blog]

Parameter Efficient Multimodal Transformers for Video Representation Learning
Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, Yale Song
ICLR 2021 [Preprint] [Code] [MSR Blog]

Self-Supervised Learning of Compressed Video Representations
Youngjae Yu, Sangho Lee, Gunhee Kim, Yale Song
ICLR 2021 [Preprint]

Active Contrastive Learning of Audio-Visual Video Representations
Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song
ICLR 2021 [Preprint] [Code]

2020

Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval
Kuan-Hsun Wang, Chia Chun Cheng, Yi-Ling Chen, Yale Song, Shang-Hong Lai
ICPR 2020 (Oral)

Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency
Matt Whitehill, Shuang Ma, Daniel McDuff, Yale Song
INTERSPEECH 2020 [Preprint]

Phans, Stans and Cishets: Self-Presentation Effects on Content Propagation in Tumblr
Michael M. Yoder, Qinlan Shen, Alex Coda, Yunseok Jang, Yale Song, Kapil Thadani, Carolyn P. Rose
WebSci 2020

Image to Video Domain Adaptation Using Web Supervision
Andrew Kae and Yale Song
WACV 2020 [Preprint]

2019

Characterizing Bias in Classifiers using Generative Models
Daniel McDuff, Shuang Ma, Yale Song, Ashish Kapoor
NeurIPS 2019 [Preprint]

Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck
Shuang Ma, Daniel McDuff, Yale Song
ICCV 2019 [Paper] [Project]

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
Yale Song and Mohammad Soleymani
CVPR 2019 [Paper] [Project] [Code and Dataset]

Neural TTS Stylization with Adversarial and Collaborative Games
Shuang Ma, Daniel McDuff, Yale Song
ICLR 2019 [Paper] [Code] [Press]

Visual Question Answering with Spatio-Temporal Reasoning
Yunseok Jang, Yale Song, Chris Dongjoo Kim, YoungJae Yu, Youngjin Kim, Gunhee Kim
IJCV 2019 [Paper] [Code and dataset]

2018

Video Prediction with Appearance and Motion Conditions
Yunseok Jang, Gunhee Kim, Yale Song
ICML 2018 [PDF] [Project] [Code]

Image2GIF: Generating Cinemagraphs using Recurrent Deep Q-Networks
Yipin Zhou, Yale Song, Tamara L. Berg
WACV 2018 [arxiv] [project]

2017

Learning from Noisy Labels with Distillation
Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, Jia Li
ICCV 2017 [arxiv] [slides] [YFCC100M Entity Dataset]

ElasticPlay: Interactive Video Summarization with Dynamic Time Budget
Haojian JIn, Yale Song, Koji Yatani
ACM Multimedia 2017 (Oral) [arxiv] [demo] [video]

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
Yunseok Jang, Yale Song, YoungJae Yu, Youngjin Kim, Gunhee Kim
CVPR 2017 (Spotlight) [arxiv] [Code and dataset]

Improving Pairwise Ranking for Multi-label Image Classification
Yuncheng Li, Yale Song, Jiebo Luo
CVPR 2017 [arxiv]

2016

Real-Time Video Highlights for Yahoo Esports
Yale Song
NIPS Workshop, LSCVS 2016, [arxiv]
In production at Yahoo eSports (Match Highlights)

To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos
Yale Song, Miriam Redi, Jordi Vallmitjana, Alejandro Jaimes
CIKM 2016, [arxiv] [Slides] [Code] [Dataset]
In production at Tumblr and Flickr (Thumbnails from user-generated videos)

TGIF: A New Dataset and Benchmark on Animated GIF Description
Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, Jiebo Luo
CVPR 2016 (Spotlight), [arxiv] [Dataset] [Project]

Video2GIF: Automatic Generation of Animated GIFs from Video
Michael Gygli, Yale Song, Liangliang Cao
CVPR 2016, [arxiv] [Demo] [Code] [Dataset]
Press coverage: Yahoo, Motherboard, Le Monde (French)

Fast, Cheap, and Good: Why Animated GIFs Engage Us
Saeideh Bakhshi, David A. Shamma, Lyndon Kennedy, Yale Song, Paloma de Juan, Joseph 'Jofish' Kaye
CHI 2016, [PDF] [Dataset] [Video]

Balancing Appearance and Context in Sketch Interpretation
Yale Song, Randall Davis, Kaichen Ma, Dana L. Penney
IJCAI 2016, [arxiv]

Mouse Activity as an Indicator of Interestingness in Video
Gloria Zen, Paloma de Juan, Yale Song, Alejandro Jaimes
ICMR 2016 (Long paper), [PDF] [Dataset]

2015

TVSum: Summarizing Web Videos using Titles
Yale Song, Jordi Vallmitjana, Amanda Stent, Alejandro Jaimes
CVPR 2015, [PDF] [Poster] [TVSum50 Dataset]

Video Co-summarization: Video Summarization by Visual Co-occurrence
Wen-Sheng Chu, Yale Song, Alejandro Jaimes
CVPR 2015, [PDF] [Poster] [Project]

Continuous Body and Hand Gesture Recognition for Natural Human-Computer Interaction
Yale Song, Randall Davis
IJCAI 2015 Journal Track, [PDF]

Exploiting Sparsity and Co-occurrence Structure for Action Unit Recognition
Yale Song*, Daniel McDuff*, Deepak Vasisht, Ashish Kapoor (* equal contribution)
FG 2015, [PDF] [Project] [Code]

2014

#FluxFlow: Visual Analysis of Anomalous Information Spreading on Social Media
Jian Zhao, Nan Cao, Zhen Wen, Yale Song, Yu-Ru Lin, Christopher Collins
IEEE Trans. Visual. Comput. Graphics (VAST 2014), [PDF] [Video]
Honorable Mention Award (3 out of 146 submissions)

2013

Action Recognition by Hierarchical Sequence Summarization
Yale Song, Louis-Philippe Morency, Randall Davis
CVPR 2013, [PDF] [Code]

One-Class Conditional Random Fields for Sequential Anomaly Detection
Yale Song, Zhen Wen, Ching-Yung Lin, Randall Davis
IJCAI 2013, [PDF]

Distribution-Sensitive Learning for Imbalanced Datasets
Yale Song, Louis-Philippe Morency, Randall Davis
FG 2013, [PDF]

Learning a Sparse Codebook of Facial and Body Microexpressions for Emotion Recognition
Yale Song, Louis-Philippe Morency, Randall Davis
ICMI 2013, [PDF] [Slides]

2012

Multi-View Latent Variable Discriminative Models for Action Recognition
Yale Song, Louis-Philippe Morency, Randall Davis
CVPR 2012, [PDF] [Project] [Code]

Multimodal Human Behavior Analysis: Learning Correlation and Interaction Across Modalities
Yale Song, Louis-Philippe Morency, Randall Davis
ICMI 2012, [PDF] [Slides]

Continuous Body and Hand Gesture Recognition for Natural Human-Computer Interaction
Yale Song, David Demirdjian, Randall Davis
ACM Trans. Interact. Intell. Syst. 2(1), 2012, [PDF]
Press coverage: MIT News, Economist, The Verge, CNET, Gizmodo, DailyBRINK

2011

Tracking Body and Hands For Gesture Recognition: NATOPS Aircraft Handling Signals Database
Yale Song, David Demirdjian, Randall Davis
FG 2011, [PDF] [Dataset]

Multi-Signal Gesture Recognition Using Temporal Smoothing Hidden Conditional Random Fields
Yale Song, David Demirdjian, Randall Davis
FG 2011, [PDF]

Theses

Structured Video Content Analysis: Learning Spatio-Temporal and Multimodal Structures
Yale Song
PhD Thesis, Massachusetts Institute of Technology, 2014 [DSpace@MIT]

Multi-Signal Gesture Recognition using Body and Hand Poses
Yale Song
SM Thesis, Massachusetts Institute of Technology, 2010 [DSpace@MIT]

Talks (selected)

Life in Industrial Research Labs
NeurIPS 2021 Workshop on New in ML, Dec 2021 [Video]

Towards Self-Supervised Holistic Video Representations
ICCV Tutorial on Holistic Video Understanding, Oct 2021 [Video] [Slides]

An Introduction to Learning from Unlabeled Video
ICCV Tutorial on Efficient Video Understanding, Oct 2021 [Video] [Slides]

Learning from Unlabeled Video
UBC Topics in Artificial Intelligence (guest lecture). Apr 2021

SNU Data Science Seminar. Dec 2020

Video

Slides

Interns / Students

Ching-Yao Chuang (MIT/Stefanie Jegelka and Antonio Torralba), Microsoft Research, 2021
Sharath Girish (UMD/Abhinav Shrivastava), Microsoft Research, 2021
Chandler Squires (MIT/Caroline Uhler and David Sontag), Microsoft Research, 2021
Xiaolong Li (VT/Lynn Abbott), Microsoft Research, 2021
Tsu-Jui Ray Fu (UCSB/William Yang Wang), Microsoft Research, 2020
Yuan-Ting Hu (UIUC/Alexander Schwing), Microsoft Research, 2020
Julia Gong (Stanford/Serena Yeung), Microsoft Research, 2020
Sangho Lee (SNU/Gunhee Kim), Microsoft Research, 2020
Shuang Ma (SUNY Buffalo/Chang Wen Chen), Microsoft Research, 2018
Youngjae Yu (SNU/Gunhee Kim), Microsoft Research, 2018
Chris Thomas (Univ. Pittsburgh/Adriana Kovashka), Yahoo Research, 2017
Keith Maki (CMU/Carolyn Penstein Rose), Yahoo Research, 2017
Yunseok Jang (SNU/Gunhee Kim), Yahoo Research, 2015-2017
Yipin Zhou (UNC Chapel Hill/Tamara Berg), Yahoo Research, 2016
Yuncheng Li (Univ. Rochester/Jiebo Luo), Yahoo Research, 2015-2016
Michael Gygli (ETH Zurich/Luc Van Gool), Yahoo Research, 2015
Wen-Sheng Chu (CMU/Fernando de la Torre), Yahoo Research, 2014
Gloria Zen (Univ. Trento/Nicu Sebe), Yahoo Research, 2014
Kyle Knoblock, MIT 6.UAP, 2011
Kenneth M. Siebert, MIT UROP, 2011
Chang-Yoon Park, MIT UROP, 2010

Etc.

My Youtube Channel: I am a guitar player since 1996. See me playing in a band!