|
|
|
|
Stephen H. Shum
sshum (at) csail (dot) mit (dot) edu
MIT Stata
Center 32 Vassar Street #32-G424 Cambridge, MA
02139
|
Hello World!
In June 2016, I completed my Ph.D. in Electrical Engineering and
Computer Science (EECS)
at MIT.
For the last seven years, I have been lucky to be a part of the Spoken
Language
Systems (SLS) group
in the MIT Computer Science and Artificial Intelligence Laboratory
(CSAIL).
My advisors
are Jim Glass
and Najim
Dehak.
The general scope of my research is in signal processing
and machine learning as applied to speech, as well as other forms of
audio. More specifically, I am interested in the use of both
unsupervised and semi-supervised methods to perform statistical
inference on an audio signal. For the last few years, I have been
working primarily in the realm of language identification, speaker
recognition, and speaker diarization, but I continue to be very
interested in techniques from computational auditory scene
analysis (CASA)
that can be applied to more general, non-speech audio.
In May 2009, I graduated from UC
Berkeley. Go Bears!
Related Documents (as of June 2016)
Current/Recent Work
- On May the 4th, 2016,
I (finally) defended my Ph.D.
thesis! Here's a link to my
presentation slides.
- Photo credit: Carrie Cai.
- Things have been pretty quiet for the last year or so; I've been
working on something a little different while trying to wrap up my
thesis by June 2016. We're in the process of putting it all together
now, but as a teaser, we use
Jackie's work
on acoustic unit discovery to obtain a DNN-based feature
representation from untranscribed multi-lingual data, which can
then be used
for language
identification.
- UPDATE (July 2016): A write-up of our work has been
accepted to the IEEE/ACM Transactions on Audio, Speech, and
Language Processing. You can find
it here.
- At one of our weekly meetings in 2014, Jim randomly asked me how
much labeled data I thought was necessary to build a
state-of-the-art speaker recognition system. One thing led to
another, and this question resulted in my brief foray into active
learning. We put together an Interspeech
2014 paper
outlining a greedy heuristic algorithm for obtaining pairwise labels
from an otherwise unlabeled dataset, which works surprisingly well.
In my thesis, I hope to incorporate additional the mathematical
rigor and results from real-world, crowd-sourcing experiments.
- During the summer of 2013, I (almost) re-lived my experiences from
five years prior, when I spent a summer in Baltimore working at the
Summer Workshop 2008 (WS08) put on by
the Center of Language and Speech Processing
(CLSP) at the Johns Hopkins
University. This time, I was at the Human Language Technology Center
of Excellence (HLTCOE) working
on Robust Speaker Recognition for Real Data
at SCALE
2013 (Summer Camp for Applied Language Exploration) with Daniel
Garcia-Romero, Alan McCree, and Douglas Reynolds.
- UPDATE (June 2014): Our work resulted in a couple of
publications and, to our pleasant surprise, best paper awards in
2014 (see below) at Odyssey: The Speaker and Language
Recognition Workshop, which was hosted by the University of
Eastern Finland under the midnight sun in the gorgeous
lake-filled region of Joensuu, Finland.
- The folks from
the Human Language
Technology Group at the MIT Lincoln Laboratory were kind enough
to let me work with them for the summer of
2012. With
Bill Campbell
and Doug
Reynolds, I explored the use of graph-based clustering methods
for speaker clustering on large-scale speech corpora. This work
turned into a paper
and poster presentation
at ICASSP 2013 in the gorgeous -- albeit rainy -- Canadian city of
Vancouver, B.C.
- I spent three years or so working on the problem of speaker
diarization ("who spoke when"). In particular, this work utilizes
a factor analysis-based framework to extract speaker-specific
features on speech segments. We explored different ways to cluster
these segments and determine the number of speakers present in the
audio stream. At the moment, we have settled on an integrated and
iterative method that involves spectral analysis and variational
inference. Check out my
Master's thesis/poster
or Interspeech 2011 and
2012 papers
for more details.
- UPDATE (May 2013): We've since put everything
together into a publication with IEEE Transactions. One of the
figures even made it on the cover of the Sept./Oct. 2013
issue! The paper (and cover) can be
found here and is
cited below.
- UPDATE (October 2013): We also received
some coverage
from the MIT News.
- During Interspeech 2011, Najim and I presented
a tutorial
on factor analysis-based methods for extracting speaker-specific
features on speech segments.
- My summer of 2011 was spent working as an intern
in Dick
Lyon's Machine Hearing Research Group at Google. With my
host, Tom
Walters, I was involved with developing and extending some new
techniques in audio fingerprinting for robust melody summarization
and cover song detection. Our vision was to try extending our
approaches to query-by-humming, but alas, we ran out of time. That
said, an awesome experience nonetheless.
- S. Shum, David F. Harwath, Najim Dehak, and James
R. Glass, "On the Use of Acoustic Unit Discovery for Language
Recognition," IEEE/ACM Transactions on Audio, Speech, and
Language Processing, September
2016. [paper]
- S. Shum, "Overcoming Resource Limitations in the
Processing of Unlimited Speech: Applications to Speaker and
Language Recognition," Ph.D. Thesis, MIT Department of Electrical
Engineering and Computer Science, June
2016. [thesis] [slides]
- S. Shum, Najim Dehak, and James R. Glass, "Limited Labels
for Unlimited Data: Active Learning for Speaker Recognition,"
in Proceedings of Interspeech, Singapore, September
2014. [paper]
[slides]
- S. Shum, Douglas A. Reynolds, Daniel Garcia-Romero, and
Alan McCree, "Unsupervised Clustering Approaches for Domain
Adaptation in Speaker Recognition Systems," in Proceedings of
Odyssey, Joensuu, Finland, June
2014. [paper]
[slides]
- Daniel Garcia-Romero, Alan McCree, S. Shum, Niko Brummer,
and Carlos Vaquero, "Unsupervised Domain Adaptation for i-vector
Speaker Recognition," in Proceedings of Odyssey, Joensuu,
Finland, June
2014. [paper]
- S. Shum, Najim Dehak, Reda Dehak, and James R. Glass,
"Unsupervised Methods for Speaker Diarization: An Integrated and
Iterative Approach," IEEE Transactions on Audio, Speech, and
Language Processing, Vol. 21, No. 10, October 2013,
pp. 2015-2028. [paper]
- S. Shum, William M. Campbell, and Douglas A. Reynolds,
"Large-Scale Community Detection on Speaker Content Graphs,"
in Proceedings of ICASSP, Vancouver, B.C., Canada, May
2013. [paper] [poster]
- S. Shum, Najim Dehak, and Jim Glass, "On the Use of
Spectral and Iterative Methods for Speaker Diarization,"
in Proceedings of Interspeech, Portland, Oregon, September
2012. [paper] [poster]
- S. Shum, Najim Dehak, Ekapol Chuangsuwanich, Douglas
Reynolds, and Jim Glass, "Exploiting Intra-Conversation Variability
for Speaker Diarization," in Proceedings of Interspeech,
Florence, Italy, August
2011. [paper] [slides]
- S. Shum, "Unsupervised Methods for Speaker
Diarization," S.M. Thesis, MIT Department of Electrical Engineering
and Computer Science, June
2011. [thesis] [poster]
- William A. Martin Memorial S.M. Thesis Award
- D. Sturim, W. Campbell, N. Dehak, Z. Karam, A. McCree,
D. Reynolds, F. Richardson, S. Shum, and P. Torres-Carrasquillo,
"The MIT LL 2010 Speaker Recognition Evaluation System: Scalable
Language-Independent Speaker Recognition," in Proceedings of ICASSP,
Prague, Czech Republic, May 2011.
- S. Shum, Najim Dehak, Reda Dehak, and Jim Glass,
"Unsupervised Speaker Adaptation Based on the Cosine Similarity for
Text-Independent Speaker Verification," in Proceedings of IEEE
Odyssey, Brno, Czech Republic, June
2010. [paper]
- George Chen, John Kua, S. Shum, Nikhil Naikal, Matthew
Carlberg, and Avideh Zakhor, "Indoor Localization Algorithms for a
Human-Operated Backpack System," in the Fifth International
Symposium on 3D Data Processing, Visualization, and Transmission
(3DPVT), Paris, France, May 2010.
- W. Spiegl, G. Stemmer, E. Lasarcyk, V. Kolhatkar, A. Cassidy,
B. Potard, S. Shum, Y. Song, P. Xu, P. Beyerlein, J. Harnsberger,
and E. Noeth, "Analyzing Features for Automatic Age Estimation on
Cross-Sectional Data," in Proceedings of Interspeech, Brighton,
United Kingdom, September 2009.
- P. Beyerlein, A. Cassidy, V. Kolhatkar, E. Lasarcyk, E. Noeth,
B. Potard, S. Shum, Y. Song, W. Spiegl, G. Stemmer, and P. Xu,
"Vocal Aging Explained by Vocal Tract Modeling: 2008 JHU Summer
Workshop Final Report," Technical Report, August
2008.
Presentations
- S. Shum, "Overcoming Resource Limitations in the
Processing of Unlimited Speech: Applications to Speaker and Language
Recognition," Ph.D. Thesis Defense, Massachusetts Institute of
Technology, May 4,
2016. [slides]
- S. Shum, "From Vectors Representing Speech to Graphs
Representing Corpora: Reconciling how far we've come with how far we
still have to go," Chinese University of Hong Kong, Hong
Kong, November 2013. (Host: Prof. Helen
Meng) [slides]
- While on a trip to Hong Kong, I paid a visit to Helen's lab
at CUHK and had a lovely time exploring the campus of my
mother's alma mater. Thanks to Helen for being such a gracious
host!
- In this seminar, I presented a broad overview of the work
that is outlined in more detail at ICASSP 2013 and Odyssey
2014. A fair amount of this material also contributed to my
Ph.D. thesis proposal, which I (finally) handed in July 2014.
- S. Shum, "Unsupervised Methods for Speaker Diarization:
An Integrated and Iterative Approach,"
Friedrich-Alexander Universitat Erlangen-Nuremberg, Germany,
November 2012. (Host: Prof. Elmar
Noeth) [poster] [slides]
- I spent a few days traveling around Austria and Germany and
decided to pay a visit to Elmar, who is one of the first people
I ever worked with in speech processing. He kindly hosted this
seminar -- I really appreciated the generous hospitality -- and
we had a great time catching up.
- Najim Dehak and S. Shum, "Data-Driven Factor
Analysis for Characterizing Spoken Audio," Information Science,
Signal Processing, and their Applications (ISSPA) Tutorial, July
2012. [slides]
- Because Najim's graduate university, Ecole de Technologie
Superieure, was hosting this year's ISSPA, they invited him to
give another version of our Interspeech 2011 tutorial. As a
result, I was also roped into enjoying myself in Montreal for a
few days. Bummer! :)
- S. Shum, "An Introduction to Audio Fingerprinting," SLS
Group Meeting, October
2011. [slides]
- This is a talk I gave during one of our group meetings; it is
based somewhat on my summer experiences and some general
readings.
- Najim Dehak and S. Shum, "Low-dimensional Speech
Representation Based on Factor Analysis and its Applications,"
Interspeech Tutorial, August
2011. [slides]
- This was my and Najim's first and (what we thought would be)
last tutorial about i-vectors and their application to speaker
recognition, language identification, and speaker diarization.
Undergraduate Projects
- During my last semester as an EECS undergraduate at UC Berkeley, I
worked with George
Chen (now at MIT) and Prof. Avideh Zakhor in the Video and Image
Processing (VIP)
Lab.
- I participated as an Undergraduate Researcher on the
Vocal Aging Explained Team at the Johns Hopkins University,
Center for Language and Speech Processing
Summer Workshop 2008.
When I came back, I gave a
Lunch Talk to the Speech Group at ICSI about how much I learned,
how much I did, and, of course, how much I enjoyed the east coast
weather (...that I now have to experience on a daily basis!).
- My initiation into research occured in January 2008 when I was
given an opportunity to join the International Computer Science
Institute (ICSI) and work
under Prof. Nelson Morgan
and Arlo
Faria. In my time there, I learned A LOT and dabbled with building
an automatic karaoke system (lyrics + music/video alignment) and
working on some voice conversion
[CS281A, EE225D]. By
the end of it all, I had honestly accomplished nothing, but still
had a lot of fun.
- For some reason, it seems as though my course project paper,
entitled "A GMM-STRAIGHT Approach to Voice Conversion," comes up
as a pretty high hit on Google's search results for papers
pertaining to this topic. My apologies in advance, however, as
I no longer have access to the source code that was used to
build this project. Unfortunately, my initial implementation
was actually a bit buggy, and I guess I moved on before taking
the time to properly clean up and/or publishing what I had.
Sorry to disappoint; it'd be great to get back into this topic
someday, but I really don't have enough cycles to spare right
now!
Fun Stuff
- June 2014: The 2014 edition of Odyssey: The Speaker and
Language Recognition Workshop was held
at the University of Eastern Finland in the town of
Joensuu. On top of a fulfilling technical program, a number
of us
(me and Pavel, photo taken by Pasi)
also managed to take advantage of the
outdoors
and its
breathtaking scenery. There
were trails to run and cold
lakes to step into -- an
absolutely amazing time all around!
- September 2012: Check out the
awesomely-sized glazed
donut - from Voodoo Doughnuts -
that Ian and I shared in Portland, OR, during Interspeech 2012!
There were other good ones too, such as a maple bacon bar
that I had the pleasure of taking a bite of... (and until you get
the chance to try it out for yourself, you'll just have to take my
word for it: the complementary taste of salty+sweet was rather
delicious!)
Last updated: 28 July 2016