Stephen H. Shum

sshum (at) csail (dot) mit (dot) edu

MIT Stata Center
32 Vassar Street #32-G424
Cambridge, MA 02139

Hello World!

In June 2016, I completed my Ph.D. in Electrical Engineering and Computer Science (EECS) at MIT.
For the last seven years, I have been lucky to be a part of the Spoken Language Systems (SLS) group in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). My advisors are Jim Glass and Najim Dehak.

The general scope of my research is in signal processing and machine learning as applied to speech, as well as other forms of audio. More specifically, I am interested in the use of both unsupervised and semi-supervised methods to perform statistical inference on an audio signal. For the last few years, I have been working primarily in the realm of language identification, speaker recognition, and speaker diarization, but I continue to be very interested in techniques from computational auditory scene analysis (CASA) that can be applied to more general, non-speech audio.

In May 2009, I graduated from UC Berkeley. Go Bears!

Current/Recent Work

On May the 4th, 2016, I (finally) defended my Ph.D. thesis! Here's a link to my presentation slides.
- Photo credit: Carrie Cai.
Things have been pretty quiet for the last year or so; I've been working on something a little different while trying to wrap up my thesis by June 2016. We're in the process of putting it all together now, but as a teaser, we use Jackie's work on acoustic unit discovery to obtain a DNN-based feature representation from untranscribed multi-lingual data, which can then be used for language identification.
- UPDATE (July 2016): A write-up of our work has been accepted to the IEEE/ACM Transactions on Audio, Speech, and Language Processing. You can find it here.
At one of our weekly meetings in 2014, Jim randomly asked me how much labeled data I thought was necessary to build a state-of-the-art speaker recognition system. One thing led to another, and this question resulted in my brief foray into active learning. We put together an Interspeech 2014 paper outlining a greedy heuristic algorithm for obtaining pairwise labels from an otherwise unlabeled dataset, which works surprisingly well. In my thesis, I hope to incorporate additional the mathematical rigor and results from real-world, crowd-sourcing experiments.
During the summer of 2013, I (almost) re-lived my experiences from five years prior, when I spent a summer in Baltimore working at the Summer Workshop 2008 (WS08) put on by the Center of Language and Speech Processing (CLSP) at the Johns Hopkins University. This time, I was at the Human Language Technology Center of Excellence (HLTCOE) working on Robust Speaker Recognition for Real Data at SCALE 2013 (Summer Camp for Applied Language Exploration) with Daniel Garcia-Romero, Alan McCree, and Douglas Reynolds.
- UPDATE (June 2014): Our work resulted in a couple of publications and, to our pleasant surprise, best paper awards in 2014 (see below) at Odyssey: The Speaker and Language Recognition Workshop, which was hosted by the University of Eastern Finland under the midnight sun in the gorgeous lake-filled region of Joensuu, Finland.
The folks from the Human Language Technology Group at the MIT Lincoln Laboratory were kind enough to let me work with them for the summer of 2012. With Bill Campbell and Doug Reynolds, I explored the use of graph-based clustering methods for speaker clustering on large-scale speech corpora. This work turned into a paper and poster presentation at ICASSP 2013 in the gorgeous -- albeit rainy -- Canadian city of Vancouver, B.C.
I spent three years or so working on the problem of speaker diarization ("who spoke when"). In particular, this work utilizes a factor analysis-based framework to extract speaker-specific features on speech segments. We explored different ways to cluster these segments and determine the number of speakers present in the audio stream. At the moment, we have settled on an integrated and iterative method that involves spectral analysis and variational inference. Check out my Master's thesis/poster or Interspeech 2011 and 2012 papers for more details.
- UPDATE (May 2013): We've since put everything together into a publication with IEEE Transactions. One of the figures even made it on the cover of the Sept./Oct. 2013 issue! The paper (and cover) can be found here and is cited below.
- UPDATE (October 2013): We also received some coverage from the MIT News.
During Interspeech 2011, Najim and I presented a tutorial on factor analysis-based methods for extracting speaker-specific features on speech segments.
My summer of 2011 was spent working as an intern in Dick Lyon's Machine Hearing Research Group at Google. With my host, Tom Walters, I was involved with developing and extending some new techniques in audio fingerprinting for robust melody summarization and cover song detection. Our vision was to try extending our approaches to query-by-humming, but alas, we ran out of time. That said, an awesome experience nonetheless.

Publications (Google Scholar)

S. Shum, David F. Harwath, Najim Dehak, and James R. Glass, "On the Use of Acoustic Unit Discovery for Language Recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, September 2016. [paper]
S. Shum, "Overcoming Resource Limitations in the Processing of Unlimited Speech: Applications to Speaker and Language Recognition," Ph.D. Thesis, MIT Department of Electrical Engineering and Computer Science, June 2016. [thesis] [slides]
S. Shum, Najim Dehak, and James R. Glass, "Limited Labels for Unlimited Data: Active Learning for Speaker Recognition," in Proceedings of Interspeech, Singapore, September 2014. [paper] [slides]
S. Shum, Douglas A. Reynolds, Daniel Garcia-Romero, and Alan McCree, "Unsupervised Clustering Approaches for Domain Adaptation in Speaker Recognition Systems," in Proceedings of Odyssey, Joensuu, Finland, June 2014. [paper] [slides]
- Best Student Paper Award
Daniel Garcia-Romero, Alan McCree, S. Shum, Niko Brummer, and Carlos Vaquero, "Unsupervised Domain Adaptation for i-vector Speaker Recognition," in Proceedings of Odyssey, Joensuu, Finland, June 2014. [paper]
- Best Paper Award
S. Shum, Najim Dehak, Reda Dehak, and James R. Glass, "Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach," IEEE Transactions on Audio, Speech, and Language Processing, Vol. 21, No. 10, October 2013, pp. 2015-2028. [paper]
S. Shum, William M. Campbell, and Douglas A. Reynolds, "Large-Scale Community Detection on Speaker Content Graphs," in Proceedings of ICASSP, Vancouver, B.C., Canada, May 2013. [paper] [poster]
S. Shum, Najim Dehak, and Jim Glass, "On the Use of Spectral and Iterative Methods for Speaker Diarization," in Proceedings of Interspeech, Portland, Oregon, September 2012. [paper] [poster]
S. Shum, Najim Dehak, Ekapol Chuangsuwanich, Douglas Reynolds, and Jim Glass, "Exploiting Intra-Conversation Variability for Speaker Diarization," in Proceedings of Interspeech, Florence, Italy, August 2011. [paper] [slides]
S. Shum, "Unsupervised Methods for Speaker Diarization," S.M. Thesis, MIT Department of Electrical Engineering and Computer Science, June 2011. [thesis] [poster]
- William A. Martin Memorial S.M. Thesis Award
D. Sturim, W. Campbell, N. Dehak, Z. Karam, A. McCree, D. Reynolds, F. Richardson, S. Shum, and P. Torres-Carrasquillo, "The MIT LL 2010 Speaker Recognition Evaluation System: Scalable Language-Independent Speaker Recognition," in Proceedings of ICASSP, Prague, Czech Republic, May 2011.
S. Shum, Najim Dehak, Reda Dehak, and Jim Glass, "Unsupervised Speaker Adaptation Based on the Cosine Similarity for Text-Independent Speaker Verification," in Proceedings of IEEE Odyssey, Brno, Czech Republic, June 2010. [paper]
George Chen, John Kua, S. Shum, Nikhil Naikal, Matthew Carlberg, and Avideh Zakhor, "Indoor Localization Algorithms for a Human-Operated Backpack System," in the Fifth International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT), Paris, France, May 2010.
W. Spiegl, G. Stemmer, E. Lasarcyk, V. Kolhatkar, A. Cassidy, B. Potard, S. Shum, Y. Song, P. Xu, P. Beyerlein, J. Harnsberger, and E. Noeth, "Analyzing Features for Automatic Age Estimation on Cross-Sectional Data," in Proceedings of Interspeech, Brighton, United Kingdom, September 2009.
P. Beyerlein, A. Cassidy, V. Kolhatkar, E. Lasarcyk, E. Noeth, B. Potard, S. Shum, Y. Song, W. Spiegl, G. Stemmer, and P. Xu, "Vocal Aging Explained by Vocal Tract Modeling: 2008 JHU Summer Workshop Final Report," Technical Report, August 2008.

Presentations

S. Shum, "Overcoming Resource Limitations in the Processing of Unlimited Speech: Applications to Speaker and Language Recognition," Ph.D. Thesis Defense, Massachusetts Institute of Technology, May 4, 2016. [slides]
S. Shum, "From Vectors Representing Speech to Graphs Representing Corpora: Reconciling how far we've come with how far we still have to go," Chinese University of Hong Kong, Hong Kong, November 2013. (Host: Prof. Helen Meng) [slides]
- While on a trip to Hong Kong, I paid a visit to Helen's lab at CUHK and had a lovely time exploring the campus of my mother's alma mater. Thanks to Helen for being such a gracious host!
- In this seminar, I presented a broad overview of the work that is outlined in more detail at ICASSP 2013 and Odyssey 2014. A fair amount of this material also contributed to my Ph.D. thesis proposal, which I (finally) handed in July 2014.
S. Shum, "Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach," Friedrich-Alexander Universitat Erlangen-Nuremberg, Germany, November 2012. (Host: Prof. Elmar Noeth) [poster] [slides]
- I spent a few days traveling around Austria and Germany and decided to pay a visit to Elmar, who is one of the first people I ever worked with in speech processing. He kindly hosted this seminar -- I really appreciated the generous hospitality -- and we had a great time catching up.
Najim Dehak and S. Shum, "Data-Driven Factor Analysis for Characterizing Spoken Audio," Information Science, Signal Processing, and their Applications (ISSPA) Tutorial, July 2012. [slides]
- Because Najim's graduate university, Ecole de Technologie Superieure, was hosting this year's ISSPA, they invited him to give another version of our Interspeech 2011 tutorial. As a result, I was also roped into enjoying myself in Montreal for a few days. Bummer! :)
S. Shum, "An Introduction to Audio Fingerprinting," SLS Group Meeting, October 2011. [slides]
- This is a talk I gave during one of our group meetings; it is based somewhat on my summer experiences and some general readings.
Najim Dehak and S. Shum, "Low-dimensional Speech Representation Based on Factor Analysis and its Applications," Interspeech Tutorial, August 2011. [slides]
- This was my and Najim's first and (what we thought would be) last tutorial about i-vectors and their application to speaker recognition, language identification, and speaker diarization.

Undergraduate Projects

During my last semester as an EECS undergraduate at UC Berkeley, I worked with George Chen (now at MIT) and Prof. Avideh Zakhor in the Video and Image Processing (VIP) Lab.
I participated as an Undergraduate Researcher on the Vocal Aging Explained Team at the Johns Hopkins University, Center for Language and Speech Processing Summer Workshop 2008. When I came back, I gave a Lunch Talk to the Speech Group at ICSI about how much I learned, how much I did, and, of course, how much I enjoyed the east coast weather (...that I now have to experience on a daily basis!).
My initiation into research occured in January 2008 when I was given an opportunity to join the International Computer Science Institute (ICSI) and work under Prof. Nelson Morgan and Arlo Faria. In my time there, I learned A LOT and dabbled with building an automatic karaoke system (lyrics + music/video alignment) and working on some voice conversion [CS281A, EE225D]. By the end of it all, I had honestly accomplished nothing, but still had a lot of fun.
- For some reason, it seems as though my course project paper, entitled "A GMM-STRAIGHT Approach to Voice Conversion," comes up as a pretty high hit on Google's search results for papers pertaining to this topic. My apologies in advance, however, as I no longer have access to the source code that was used to build this project. Unfortunately, my initial implementation was actually a bit buggy, and I guess I moved on before taking the time to properly clean up and/or publishing what I had. Sorry to disappoint; it'd be great to get back into this topic someday, but I really don't have enough cycles to spare right now!

Fun Stuff

June 2014: The 2014 edition of Odyssey: The Speaker and Language Recognition Workshop was held at the University of Eastern Finland in the town of Joensuu. On top of a fulfilling technical program, a number of us (me and Pavel, photo taken by Pasi) also managed to take advantage of the outdoors and its breathtaking scenery. There were trails to run and cold lakes to step into -- an absolutely amazing time all around!
September 2012: Check out the awesomely-sized glazed donut - from Voodoo Doughnuts - that Ian and I shared in Portland, OR, during Interspeech 2012! There were other good ones too, such as a maple bacon bar that I had the pleasure of taking a bite of... (and until you get the chance to try it out for yourself, you'll just have to take my word for it: the complementary taste of salty+sweet was rather delicious!)

Last updated: 28 July 2016