David Harwath

Assistant Professor, University of Texas at Austin Computer Science Department
Alumni, MIT CSAIL Spoken Language Systems Group
email: harwath@cs.utexas.edu

About Me

As of September, 2020, I have joined the computer science department at UT Austin as an assistant professor. You can find my new homepage here. I am looking for prospective graduate students interested in machine learning applied to speech, audio, and natural language, especially within a multimodal context (e.g. in conjunction with vision). If you would like to pursue your graduate studies with me, please apply to UTCS and mention your interest in working with my group in your statement of purpose.

My research interests are in the area of machine learning for speech and language processing. The ultimate goal of my work is to discover the algorithmic mechanisms that would enable computers to learn and use spoken language the way that humans do. My approach emphasizes the multimodal and grounded nature of human language, and thus has a strong connection to other machine learning disciplines such as computer vision.

While modern machine learning techniques such as deep learning have made impressive progress across a variety of domains, it is doubtful that existing methods can fully capture the phenomenon of language. State-of-the-art deep learning models for tasks such as speech recognition are extremely data hungry, requiring many thousands of hours of speech recordings that have been painstakingly transcribed by humans. Even then, they are highly brittle when used outside of their training domain, breaking down when confronted with new vocabulary, accents, or environmental noise. Because of its reliance on massive training datasets, the technology we do have is completely out of reach for all but several dozen of the 7,000 human languages spoken worldwide.

In contrast, human toddlers are able to grasp the meaning of new word forms from only a few spoken examples, and learn to carry a meaningful conversation long before they are able to read and write. There are critical aspects of language that are currently missing from our machine learning models. Human language is inherently multimodal; it is grounded in embodied experience; it holistically integrates information from all of our sensory organs into our rational capacity; and it is acquired via immersion and interaction, without the kind of heavy-handed supervision relied upon by most machine learning models. My research agenda revolves around finding ways to bring these aspects into the fold.

I hold a B.S. in electrical and computer engineering from the University of Illinois at Urbana-Champaign (2010), a S.M. in computer science from MIT (2013), and a Ph.D. in computer science from MIT (2018).

Datsets and Code

Model training code for “Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input”

The Places Audio Caption Corpus

The Flickr8k Audio Caption Corpus

Publications

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
David Harwath*, Wei-Ning Hsu*, and James Glass
to appear at ICLR 2020

Transfer Learning from Audio-Visual Grounding to Speech Recognition
Wei-Ning Hsu, David Harwath, and James Glass
Proc. Interspeech 2019

Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio
Emmanuel Azuh, David Harwath, and James Glass
Proc. Interspeech 2019

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass
To appear in the International Journal of Computer Vision

Learning Words by Drawing Images
Dídac Surís, Adrià Recasens, David Bau, David Harwath, James Glass, and Antonio Torralba, Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

Towards Visually Grounded Sub-Word Speech Unit Discovery
David Harwath and James Glass
Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019

Grounding Spoken Words in Unlabeled Video
Angie Boggust, Kartik Audhkhasi, Dhiraj Joshi, David Harwath, Samuel Thomas, Rogerio Feris, Danny Gutfreund, Yang Zhang, Antonio Torralba, Michael Picheny, James Glass
Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2019

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass
Proc. European Conference on Computer Vision (ECCV), 2018

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech
David Harwath, Galen Chuang, and James Glass
Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018

Learning Modality-Invariant Representations for Speech and Images
Kenneth Leidal, David Harwath, and James Glass
Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2017

Learning Word-Like Units from Joint Audio-Visual Analysis
David Harwath and James R. Glass
Proc. Association for Computational Lingustics (ACL), 2017

Unsupervised Learning of Spoken Language with Visual Context
David Harwath, Antonio Torralba, and James R. Glass
Proc. Neural Information Processing Systems (NeurIPS), 2016

Look, Listen, and Decode: Multimodal Speech Recognition with Images
Felix Sun, David Harwath, and James R. Glass
Proc. IEEE Workshop on Spoken Language Technology (SLT), 2016

On the Use of Acoustic Unit Discovery for Language Recognition
Stephen H. Shum, David Harwath, Najim Dehak, and James R. Glass
IEEE/ACM Transactions on Audio, Speech, and Language Processing, September 2016, Vol. 24, No. 9, pp. 1665-1676

Deep Multimodal Semantic Embeddings for Speech and Images
David Harwath and James R. Glass
Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015 (nominated for best paper award)

Speech Recognition Without a Lexicon - Bridging the Gap Between Graphemic and Phonetic Systems
David Harwath and James R. Glass
Proc. Interspeech, 2014

Choosing Useful Word Alternates for Automatic Speech Recognition Correction Interfaces
David Harwath, Alexander Gruenstein, and Ian McGraw
Proc. Interspeech, 2014

Zero Resource Spoken Audio Corpus Analysis
David Harwath, Timothy J. Hazen, and James R. Glass
Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013

A Summary of the 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition
Aren Jansen, Emmanuel Dupoux, Sharon Goldwater, Mark Johnson, Sanjeev Khudanpur, Kenneth Church, Naomi Feldman, Hynek Hermansky, Florian Metze, Richard Rose, Mike Seltzer, Pascal Clark, Ian McGraw, Balakrishnan Varadarajan, Erin Bennett, Benjamin Borschinger, Justin Chiu, Ewan Dunbar, Abdellah Fourtassi, David Harwath, Chia-ying Lee, Keith Levin, Atta Norouzian, Vijay Peddinti, Rachael Richardson, Thomas Schatz, Samuel Thomas
Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013

Topic Identification Based Extrinsic Evaluation of Summarization Techniques Applied to Conversational Speech
David Harwath and Timothy J. Hazen
Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012

Phonetic Landmark Detection for Automatic Language Identification
David Harwath and Mark Hasegawa-Johnson
Proc. Speech Prosody, 2010

Teaching

Teaching assistant for 6.345: Automatic Speech Recognition, Spring 2017
Teaching assistant for 6.036: Introduction to Machine Learning, Spring 2016

Education

Ph.D., Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 2018
S.M., Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 2013
B.S., Electrical Engineering, University of Illinois at Urbana-Champaign, 2010

Media Coverage

Other

My academic genealogy