Yale Song

Senior Researcher
Microsoft Research
One Microsoft Way, Redmond, WA 98052
<first_name><last_name> AT microsoft DOT com

Bio | Scholar | Github | Curriculum Vitae

Who am I?

I am a member of the Computer Vision Group at Microsoft Research in Redmond. I work in computer vision and machine learning. Most recently, my research is focused on learning from unlabeled and noisy data. I am particularly interested in unsupervised/self-supervised learning from video data, leveraging spatio-temporal and multimodal structures. Before joining Microsoft in 2018, I spent 4 years at Yahoo Research in NYC. I obtained Master's and PhD degrees in Computer Science from MIT in 2010 and 2014, respectively, where I was a member of Computer Science and Artificial Intelligence Laboratory (CSAIL). My dissertation investigated learning from structured data and its applications to video understanding. I was lucky to have my committee Randall Davis, Bill Freeman, John Fisher, and Louis-Philippe Morency.

I am looking for motivated PhD students who are passionate about visual and multimodal representation learning from unlabeled videos. Reach out if you are interested in doing research internship with me at MSR Redmond.

Professional Service

    • Program Chair: ICMI 2019
    • Organizing Committee: NeurIPS 2021/2020 (Expo Chair), ICLR 2021 (Paper Award Committee), ICMI 2021 (Sponsorship Chair)
    • Area Chair: ICCV 2021, ICLR 2021, NeurIPS 2021/2020, CVPR 2020, ACL 2019, WACV 2020/2018, FG 2018, ICMI 2018/2017/2016
    • Reviewer: CVPR, ECCV, ICCV, NeurIPS, ICML, WACV, FG, ICMI
    • Journal Reviewer: TPAMI, TIP, TAFF, TKDE, TiiS, CVIU
    • Outstanding reviewer: ICML 2020, NeurIPS 2019, CVPR 2017

Publications (see also at Google Scholar, DBLP)


  1. Parameter Efficient Multimodal Transformers for Video Representation Learning
    Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, Yale Song
    ICLR 2021 [Preprint]

  2. Self-Supervised Learning of Compressed Video Representations
    Youngjae Yu, Sangho Lee, Gunhee Kim, Yale Song
    ICLR 2021 [Preprint]

  3. Active Contrastive Learning of Audio-Visual Video Representations
    Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song
    ICLR 2021 [Preprint]

  4. 2020

  5. Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval
    Kuan-Hsun Wang, Chia Chun Cheng, Yi-Ling Chen, Yale Song, Shang-Hong Lai
    ICPR 2020 (Oral)

  6. Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency
    Matt Whitehill, Shuang Ma, Daniel McDuff, Yale Song
    INTERSPEECH 2020 [Preprint]

  7. Phans, Stans and Cishets: Self-Presentation Effects on Content Propagation in Tumblr
    Michael M. Yoder, Qinlan Shen, Alex Coda, Yunseok Jang, Yale Song, Kapil Thadani, Carolyn P. Rose
    WebSci 2020

  8. Image to Video Domain Adaptation Using Web Supervision
    Andrew Kae and Yale Song
    WACV 2020 [Preprint]

  9. 2019

  10. Characterizing Bias in Classifiers using Generative Models
    Daniel McDuff, Shuang Ma, Yale Song, Ashish Kapoor
    NeurIPS 2019 [Preprint]

  11. Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck
    Shuang Ma, Daniel McDuff, Yale Song
    ICCV 2019 [Paper] [Project]

  12. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
    Yale Song and Mohammad Soleymani
    CVPR 2019 [Paper] [Project] [Code and Dataset]

  13. Neural TTS Stylization with Adversarial and Collaborative Games
    Shuang Ma, Daniel McDuff, Yale Song
    ICLR 2019 [Paper] [Code] [Press]

  14. Visual Question Answering with Spatio-Temporal Reasoning
    Yunseok Jang, Yale Song, Chris Dongjoo Kim, YoungJae Yu, Youngjin Kim, Gunhee Kim
    IJCV 2019 [Paper] [Code and dataset]

  15. 2018

  16. Video Prediction with Appearance and Motion Conditions
    Yunseok Jang, Gunhee Kim, Yale Song
    ICML 2018 [PDF] [Project] [Code]

  17. Image2GIF: Generating Cinemagraphs using Recurrent Deep Q-Networks
    Yipin Zhou, Yale Song, Tamara L. Berg
    WACV 2018 [arxiv] [project]

  18. 2017

  19. Learning from Noisy Labels with Distillation
    Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, Jia Li
    ICCV 2017 [arxiv] [slides] [YFCC100M Entity Dataset]

  20. ElasticPlay: Interactive Video Summarization with Dynamic Time Budget
    Haojian JIn, Yale Song, Koji Yatani
    ACM Multimedia 2017 (Oral) [arxiv] [demo] [video]

  21. TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
    Yunseok Jang, Yale Song, YoungJae Yu, Youngjin Kim, Gunhee Kim
    CVPR 2017 (Spotlight) [arxiv] [Code and dataset]

  22. Improving Pairwise Ranking for Multi-label Image Classification
    Yuncheng Li, Yale Song, Jiebo Luo
    CVPR 2017 [arxiv]

  23. 2016

  24. Real-Time Video Highlights for Yahoo Esports
    Yale Song
    NIPS Workshop, LSCVS 2016, [arxiv]
    In production at Yahoo eSports (Match Highlights)

  25. To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos
    Yale Song, Miriam Redi, Jordi Vallmitjana, Alejandro Jaimes
    CIKM 2016, [arxiv] [Slides] [Code] [Dataset]
    In production at Tumblr and Flickr (Thumbnails from user-generated videos)

  26. TGIF: A New Dataset and Benchmark on Animated GIF Description
    Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, Jiebo Luo
    CVPR 2016 (Spotlight), [arxiv] [Dataset] [Project]

  27. Video2GIF: Automatic Generation of Animated GIFs from Video
    Michael Gygli, Yale Song, Liangliang Cao
    CVPR 2016, [arxiv] [Demo] [Code] [Dataset]
    Press coverage: Yahoo, Motherboard, Le Monde (French)

  28. Fast, Cheap, and Good: Why Animated GIFs Engage Us
    Saeideh Bakhshi, David A. Shamma, Lyndon Kennedy, Yale Song, Paloma de Juan, Joseph 'Jofish' Kaye
    CHI 2016, [PDF] [Dataset] [Video]

  29. Balancing Appearance and Context in Sketch Interpretation
    Yale Song, Randall Davis, Kaichen Ma, Dana L. Penney
    IJCAI 2016, [arxiv]

  30. Mouse Activity as an Indicator of Interestingness in Video
    Gloria Zen, Paloma de Juan, Yale Song, Alejandro Jaimes
    ICMR 2016 (Long paper), [PDF] [Dataset]

  31. 2015

  32. TVSum: Summarizing Web Videos using Titles
    Yale Song, Jordi Vallmitjana, Amanda Stent, Alejandro Jaimes
    CVPR 2015, [PDF] [Poster] [TVSum50 Dataset]

  33. Video Co-summarization: Video Summarization by Visual Co-occurrence
    Wen-Sheng Chu, Yale Song, Alejandro Jaimes
    CVPR 2015, [PDF] [Poster] [Project]

  34. Continuous Body and Hand Gesture Recognition for Natural Human-Computer Interaction
    Yale Song, Randall Davis
    IJCAI 2015 Journal Track, [PDF]

  35. Exploiting Sparsity and Co-occurrence Structure for Action Unit Recognition
    Yale Song*, Daniel McDuff*, Deepak Vasisht, Ashish Kapoor (* equal contribution)
    FG 2015, [PDF] [Project] [Code]

  36. 2014

  37. #FluxFlow: Visual Analysis of Anomalous Information Spreading on Social Media
    Jian Zhao, Nan Cao, Zhen Wen, Yale Song, Yu-Ru Lin, Christopher Collins
    IEEE Trans. Visual. Comput. Graphics (VAST 2014), [PDF] [Video]
    Honorable Mention Award (3 out of 146 submissions)

  38. 2013

  39. Action Recognition by Hierarchical Sequence Summarization
    Yale Song, Louis-Philippe Morency, Randall Davis
    CVPR 2013, [PDF] [Code]

  40. One-Class Conditional Random Fields for Sequential Anomaly Detection
    Yale Song, Zhen Wen, Ching-Yung Lin, Randall Davis
    IJCAI 2013, [PDF]

  41. Distribution-Sensitive Learning for Imbalanced Datasets
    Yale Song, Louis-Philippe Morency, Randall Davis
    FG 2013, [PDF]

  42. Learning a Sparse Codebook of Facial and Body Microexpressions for Emotion Recognition
    Yale Song, Louis-Philippe Morency, Randall Davis
    ICMI 2013, [PDF] [Slides]

  43. 2012

  44. Multi-View Latent Variable Discriminative Models for Action Recognition
    Yale Song, Louis-Philippe Morency, Randall Davis
    CVPR 2012, [PDF] [Project] [Code]

  45. Multimodal Human Behavior Analysis: Learning Correlation and Interaction Across Modalities
    Yale Song, Louis-Philippe Morency, Randall Davis
    ICMI 2012, [PDF] [Slides]

  46. Continuous Body and Hand Gesture Recognition for Natural Human-Computer Interaction
    Yale Song, David Demirdjian, Randall Davis
    ACM Trans. Interact. Intell. Syst. 2(1), 2012, [PDF]
    Press coverage: MIT News, Economist, The Verge, CNET, Gizmodo, DailyBRINK

  47. 2011

  48. Tracking Body and Hands For Gesture Recognition: NATOPS Aircraft Handling Signals Database
    Yale Song, David Demirdjian, Randall Davis
    FG 2011, [PDF] [Dataset]

  49. Multi-Signal Gesture Recognition Using Temporal Smoothing Hidden Conditional Random Fields
    Yale Song, David Demirdjian, Randall Davis
    FG 2011, [PDF]


  1. Structured Video Content Analysis: Learning Spatio-Temporal and Multimodal Structures
    Yale Song
    PhD Thesis, Massachusetts Institute of Technology, 2014 [DSpace@MIT]

  2. Multi-Signal Gesture Recognition using Body and Hand Poses
    Yale Song
    SM Thesis, Massachusetts Institute of Technology, 2010 [DSpace@MIT]

Interns / Students