Yung-Sung Chuang

MIT EECS PhD Student @ CSAIL
Office: 32-G436

Yung-Sung Chuang (莊永松) How to pronounce?

Hi! I'm a fourth-year PhD student in Electrical Engineering and Computer Science at Massachusetts Institute of Technology, where I work with Jim Glass at CSAIL.

My research primarily focuses on natural language processing and large language models (LLMs), with a particular interest in improving their factuality and reliability. In DoLa, we proposed a decoding strategy that enhances LLM factuality by contrasting the knowledge across different transformer layers. In Lookback Lens, we introduced a method that detects and mitigates contextual hallucinations in LLMs using attention maps. Most recently, in SelfCite, we developed a self-supervised framework that enables LLMs to generate fine-grained, sentence-level citations by leveraging context ablation as a reward signal.

I also explore retrieval-based approaches to strengthen LLM by grounding answers in real documents. For instance, in Expand, Rerank, and Retrieve, we proposed query reranking to achieve more accurate retrieval results for open-domain QA. In DiffCSE, we built a contrastive learning method based on the differences between similar sentences to further boost the quality of sentence embeddings.

I was fortunate to intern at FAIR Meta, Microsoft, and MIT-IBM Watson AI Lab. Before joining MIT, I was an undergraduate student in Electrical Engineering at National Taiwan University, where I worked with Hung-Yi Lee, Yun-Nung (Vivian) Chen, and Lin-shan Lee. Here is my Curriculum Vitae.

Email: yungsung [AT] mit.edu

Links: [CV] [Twitter] [Github] [Google Scholar] [DBLP] [Blog] [Linkedin] [Instagram]

Talks

Reducing Hallucinations in LLMs via Decoding, Detection, and Citation
MIT CSAIL: EI Seminar, April 24, 2025

For my other talks at the conferences, please see https://www.youtube.com/@yung-sung

Publications

For a full list of papers, see my Google Scholar.

2025

SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih
To appear in ICML, 2025.
[bib] [abstract] [pdf] [code]

We introduce SelfCite, a novel self-supervised approach that aligns LLMs to generate high-quality, fine-grained, sentence-level citations for the statements in their generated responses. Instead of only relying on costly and labor-intensive annotations, SelfCite leverages a reward signal provided by the LLM itself through context ablation: If a citation is necessary, removing the cited text from the context should prevent the same response; if sufficient, retaining the cited text alone should preserve the same response. This reward can guide the inference-time best-of-N sampling strategy to improve citation quality significantly, as well as be used in preference optimization to directly fine-tune the models for generating better citations. The effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark across five long-form question answering tasks.

Learning to Attribute with Attention
Benjamin Cohen-Wang, Yung-Sung Chuang, Aleksander Madry
arXiv preprint arXiv:2504.13752, 2025.
[bib] [abstract] [pdf] [code]

Given a sequence of tokens generated by a language model, we may want to identify the preceding tokens that influence the model to generate this sequence. Performing such token attribution is expensive; a common approach is to ablate preceding tokens and directly measure their effects. To reduce the cost of token attribution, we revisit attention weights as a heuristic for how a language model uses previous tokens. Naive approaches to attribute model behavior with attention (e.g., averaging attention weights across attention heads to estimate a token's influence) have been found to be unreliable. To attain faithful attributions, we propose treating the attention weights of different attention heads as features. This way, we can learn how to effectively leverage attention weights for attribution (using signal from ablations). Our resulting method, Attribution with Attention (AT2), reliably performs on par with approaches that involve many ablations, while being significantly more efficient. To showcase the utility of AT2, we use it to prune less important parts of a provided context in a question answering setting, improving answer quality.

2024

Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps
Yung-Sung Chuang Linlu Qiu, Cheng-Yu Hsieh, Ranjay Krishna, Yoon Kim, James Glass
In The Conference on Empirical Methods in Natural Language Processing, 2024.
[bib] [abstract] [pdf] [code]

When asked to summarize articles or answer questions given a passage, large language models (LLMs) can hallucinate details and respond with unsubstantiated answers that are inaccurate with respect to the input context. This paper describes a simple approach for detecting such contextual hallucinations. We hypothesize that contextual hallucinations are related to the extent to which an LLM attends to information in the provided context versus its own generations. Based on this intuition, we propose a simple hallucination detection model whose input features are given by the ratio of attention weights on the context versus newly generated tokens (for each attention head). We find that a linear classifier based on these lookback ratio features is as effective as a richer detector that utilizes the entire hidden states of an LLM or a text-based entailment model. The lookback ratio-based detector -- Lookback Lens -- is found to transfer across tasks and even models, allowing a detector that is trained on a 7B model to be applied (without retraining) to a larger 13B model. We further apply this detector to mitigate contextual hallucinations, and find that a simple classifier-guided decoding approach is able to reduce the amount of hallucination, for example by 9.6% in the XSum summarization task.

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, Pengcheng He
In The Twelfth International Conference on Learning Representations (ICLR), 2024.
[bib] [abstract] [pdf] [code]

Despite their impressive capabilities, large language models (LLMs) are prone to hallucinations, i.e., generating content that deviates from facts seen during pretraining. We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs that does not require conditioning on retrieved external knowledge nor additional fine-tuning. Our approach obtains the next-token distribution by contrasting the differences in logits obtained from projecting the later layers versus earlier layers to the vocabulary space, exploiting the fact that factual knowledge in an LLMs has generally been shown to be localized to particular transformer layers. We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts. DoLa consistently improves the truthfulness across multiple choices tasks and open-ended generation tasks, for example improving the performance of LLaMA family models on TruthfulQA by 12-17% absolute points, demonstrating its potential in making LLMs reliably generate truthful facts.

Curiosity-driven Red-teaming for Large Language Models
Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, Pulkit Agrawal
In The Twelfth International Conference on Learning Representations (ICLR), 2024.
[bib] [abstract] [pdf] [code]

Large language models (LLMs) hold great potential for many natural language applications but risk generating incorrect or toxic content. To probe when an LLM generates unwanted content, the current paradigm is to recruit a red team of human testers to design input prompts that elicit undesirable responses from LLMs. However, relying solely on human testers is expensive and time-consuming. Recent works automate red teaming by training a separate red team LLM with reinforcement learning (RL) to generate test cases that maximize the chance of eliciting undesirable responses from the target LLM. Current RL methods only generate a small number of effective test cases, leading to low coverage of the span of prompts that elicit undesirable responses from the target LLM. We propose a method of curiosity-driven red teaming (CRT) that optimizes for novelty to increase the coverage of test cases while maintaining or increasing their effectiveness compared to existing methods. CRT successfully provokes toxic responses from the LLaMA2 model, even after it has been heavily fine-tuned using human preferences to avoid toxic outputs. Code is available at https://github.com/Improbable-AI/curiosity_redteam.

Semstamp: A semantic watermark with paraphrastic robustness for text generation
Abe Bohan Hou, Jingyu Zhang, Tianxing He, Yichen Wang, Yung-Sung Chuang, Hongwei Wang, Lingfeng Shen, Benjamin Van Durme, Daniel Khashabi, Yulia Tsvetkov
In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024.
[bib] [abstract] [pdf] [code]

Existing watermarking algorithms are vulnerable to paraphrase attacks because of their token-level design. To address this issue, we propose SemStamp, a robust sentence-level semantic watermarking algorithm based on locality-sensitive hashing (LSH), which partitions the semantic space of sentences. The algorithm encodes and LSH-hashes a candidate sentence generated by an LLM, and conducts sentence-level rejection sampling until the sampled sentence falls in watermarked partitions in the semantic embedding space. A margin-based constraint is used to enhance its robustness. To show the advantages of our algorithm, we propose a "bigram" paraphrase attack using the paraphrase that has the fewest bigram overlaps with the original sentence. This attack is shown to be effective against the existing token-level watermarking method. Experimental results show that our novel semantic watermark algorithm is not only more robust than the previous state-of-the-art method on both common and bigram paraphrase attacks, but also is better at preserving the quality of generation.

Natural Language Embedded Programs for Hybrid Language Symbolic Reasoning
Tianhua Zhang, Jiaxin Ge, Hongyin Luo, Yung-Sung Chuang, Mingye Gao, Yuan Gong, Xixin Wu, Yoon Kim, Helen Meng, James Glass
In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Findings, 2024.
[bib] [abstract] [pdf] [code]

How can we perform computations over natural language representations to solve tasks that require symbolic and numeric reasoning? We propose natural language embedded programs (NLEP) as a unifying framework for addressing math/symbolic reasoning, natural language understanding, and instruction following tasks. Our approach prompts a language model to generate full Python programs that define functions over data structures which contain natural language representations of structured knowledge. A Python interpreter then executes the generated code and prints the output. Despite using a task-general prompt, we find that this approach can improve upon strong baselines across a range of different tasks including math and symbolic reasoning, text classification, question answering, and instruction following. We further find the generated programs are often interpretable and enable post-hoc verification of the intermediate reasoning steps.

Joint Inference of Retrieval and Generation for Passage Re-ranking
Wei Fang, Yung-Sung Chuang, James Glass
In The 18th Conference of the European Chapter of the Association for Computational Linguistics: Findings, 2024.
[bib] [abstract] [pdf]

Passage retrieval is a crucial component of modern open-domain question answering (QA) systems, providing information for downstream QA components to generate accurate and transparent answers. This study focuses on passage re-ranking, proposing a simple yet effective method, Joint Passage Re-ranking (JPR), that optimizes the mutual information between query and passage distributions, integrating both cross-encoders and generative models in the re-ranking process. Experimental results demonstrate that JPR outperforms conventional re-rankers and language model scorers in both open-domain QA retrieval settings and diverse retrieval benchmarks under zero-shot settings.

Joint Dual Learning With Mutual Information Maximization for Natural Language Understanding and Generation in Dialogues
Shang-Yu Su, Yung-Sung Chung, Yun-Nung Chen
In IEEE/ACM Transactions on Audio, Speech, and Language Processing (Volume: 32), pages 2445 - 2452, March 2024.
[bib] [abstract] [pdf]

Modular conversational systems heavily rely on the performance of their natural language understanding (NLU) and natural language generation (NLG) components. NLU focuses on extracting core semantic concepts from input texts, while NLG constructs coherent sentences based on these extracted semantics. Inspired by information theory in digital communication, we introduce a one-way communication model that mirrors human conversations, comprising two distinct phases: (1) the conversion of thoughts into messages, similar to NLG, and (2) the comprehension of received messages, similar to NLU. This paper presents a novel algorithm that trains NLU and NLG collaboratively by concatenating their models and maximizing mutual information between inputs and outputs. This approach efficiently facilitates the transmission of semantics, leading to enhanced learning performance for both components. Our experimental results, based on three benchmark datasets, consistently demonstrate significant improvements for both NLU and NLG tasks, highlighting the practical promise of our proposed method.

SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken Question Answering
Chyi-Jiunn Lin, Guan-Ting Lin, Yung-Sung Chuang, Wei-Lun Wu, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, Lin-shan Lee
In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.
[bib] [abstract] [pdf]

Spoken Question Answering (SQA) is essential for machines to reply to user's question by finding the answer span within a given spoken passage. SQA has been previously achieved without ASR to avoid recognition errors and Out-of-Vocabulary (OOV) problems. However, the real-world problem of Open-domain SQA (openSQA), in which the machine needs to first retrieve passages that possibly contain the answer from a spoken archive, was never considered. This paper proposes the first known end-to-end framework, Speech Dense Passage Retriever (SpeechDPR), for the retrieval component of the openSQA problem. SpeechDPR learns a sentence-level semantic representation by distilling knowledge from the cascading model of unsupervised ASR (UASR) and text dense retriever (TDR). No manually transcribed speech data is needed. Initial experiments showed performance comparable to the cascading model of UASR and TDR, and significantly better when UASR was poor, verifying that this approach is more robust to speech recognition errors.

A Large-Scale Evaluation of Speech Foundation Models
Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li, Abdelrahman Mohamed, Shinji Watanabe, Hung-yi Lee
In IEEE/ACM Transactions on Audio, Speech, and Language Processing, page(s) 1 - 16, April 2024.
[bib] [abstract] [pdf]

The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific data collection and modeling. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. To bridge this gap, we establish the Speech processing Universal PERformance Benchmark (SUPERB). SUPERB represents an ecosystem designed to evaluate foundation models across a wide range of speech processing tasks, facilitating the sharing of results on an online leaderboard and fostering collaboration through a community-driven benchmark database that aids in new development cycles. We present a unified learning framework for solving the speech processing tasks in SUPERB with the frozen foundation model followed by task-specialized lightweight prediction heads. Combining our results with community submissions, we verify that the framework is simple yet effective, as the best-performing foundation model shows competitive generalizability across most SUPERB tasks. Finally, we conduct a series of analyses to offer an in-depth understanding of SUPERB and speech foundation models, including information flows across tasks inside the models and the statistical significance and robustness of the benchmark.

2023

Audio-Visual Neural Syntax Acquisition
Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass
In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) .
[bib] [abstract] [pdf]

We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without ever being exposed to text. By training on paired images and spoken captions, AV-NSL exhibits the capability to infer meaningful phrase structures that are comparable to those derived by naturally-supervised text parsers, for both English and German. Our findings extend prior work in unsupervised language acquisition from speech and grounded grammar induction, and present one approach to bridge the gap between the two topics.

On Robustness-Accuracy Characterization of Large Language Models using Synthetic Datasets
Ching-Yun Ko, Pin-Yu Chen, Payel Das, Yung-Sung Chuang, Luca Daniel
In Workshop on Efficient Systems for Foundation Models@ ICML2023 , July 2023.
[bib] [abstract] [pdf]

Despite the impressive capability of large language models (LLMs) in solving different downstream tasks, new concerns about proper performance evaluation have been raised, especially for test-data leakage caused by accidentally including them during pretraining, or by indirectly exposing them through API calls for evaluation. Motivated by these, in this paper, we propose a new evaluation workflow that generates steerable synthetic language datasets and proxy tasks for benchmarking the performance of pertained LLMs on sentence classification tasks. This approach allows for better characterization of the joint analysis on the robustness and accuracy of LLMs without risking sensitive information leakage. Verified on various pretrained LLMs, the proposed approach demonstrates promising high correlation with real downstream performance.

Revealing the Blind Spot of Sentence Encoder Evaluation by HEROS
Cheng-Han Chiang, Yung-Sung Chuang, James Glass, Hung-yi Lee
In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023) , 2023.
[bib] [abstract] [pdf] [data]

Existing sentence textual similarity benchmark datasets only use a single number to summarize how similar the sentence encoder's decision is to humans'. However, it is unclear what kind of sentence pairs a sentence encoder (SE) would consider similar. Moreover, existing SE benchmarks mainly consider sentence pairs with low lexical overlap, so it is unclear how the SEs behave when two sentences have high lexical overlap. We introduce a high-quality SE diagnostic dataset, HEROS. HEROS is constructed by transforming an original sentence into a new sentence based on certain rules to form a minimal pair, and the minimal pair has high lexical overlaps. The rules include replacing a word with a synonym, an antonym, a typo, a random word, and converting the original sentence into its negation. Different rules yield different subsets of HEROS. By systematically comparing the performance of over 60 supervised and unsupervised SEs on HEROS, we reveal that most unsupervised sentence encoders are insensitive to negation. We find the datasets used to train the SE are the main determinants of what kind of sentence pairs an SE considers similar. We also show that even if two SEs have similar performance on STS benchmarks, they can have very different behavior on HEROS. Our result reveals the blind spot of traditional STS benchmarks when evaluating SEs.

Expand, Rerank, and Retrieve: Query Reranking for Open-Domain Question Answering
Yung-Sung Chuang, Wei Fang, Shang-Wen Li, Wen-tau Yih, James Glass
In The 61st Annual Meeting of the Association for Computational Linguistics: Findings, 2023.
[bib] [abstract] [pdf] [code]

We propose EAR, a query Expansion And Reranking approach for improving passage retrieval, with the application to open-domain question answering. EAR first applies a query expansion model to generate a diverse set of queries, and then uses a query reranker to select the ones that could lead to better retrieval results. Motivated by the observation that the best query expansion often is not picked by greedy decoding, EAR trains its reranker to predict the rank orders of the gold passages when issuing the expanded queries to a given retriever. By connecting better the query expansion model and retriever, EAR significantly enhances a traditional sparse retrieval method, BM25. Empirically, EAR improves top-5/20 accuracy by 3-8 and 5-10 points in in-domain and out-of-domain settings, respectively, when compared to a vanilla query expansion model, GAR, and a dense retrieval model, DPR.

SAIL: Search-Augmented Instruction Learning
Hongyin Luo, Yung-Sung Chuang, Yuan Gong, Tianhua Zhang, Yoon Kim, Xixin Wu, Danny Fox, Helen Meng, James Glass
In Findings of the Association for Computational Linguistics: EMNLP, 2023.
[bib] [abstract] [pdf] [code] [demo] [web]

Large language models (LLMs) have been significantly improved by instruction fine-tuning, but still lack transparency and the ability to utilize up-to-date knowledge and information. In this work, we propose search-augmented instruction learning (SAIL), which grounds the language generation and instruction following abilities on complex search results generated by in-house and external search engines. With an instruction tuning corpus, we collect search results for each training case from different search APIs and domains, and construct a new search-grounded training set containing \textit{(instruction, grounding information, response)} triplets. We then fine-tune the LLaMA-7B model on the constructed training set. Since the collected results contain unrelated and disputing languages, the model needs to learn to ground on trustworthy search results, filter out distracting passages, and generate the target response. The search result-denoising process entails explicit trustworthy information selection and multi-hop reasoning, since the retrieved passages might be informative but not contain the instruction-following answer. Experiments show that the fine-tuned SAIL-7B model has a strong instruction-following ability, and it performs significantly better on transparency-sensitive tasks, including open-ended question answering and fact checking.

Interpretable Unified Language Checking
Hongyin Luo, Tianhua Zhang, Yung-Sung Chuang, Wei Fang, Luc Gaitskell, Thomas Hartvigsen, Xixin Wu, Danny Fox, Helen Meng, James Glass
In arXiv preprint, 2023.
[bib] [abstract] [pdf] [code]

Despite recent concerns about undesirable behaviors generated by large language models (LLMs), including non-factual, biased, and hateful language, we find LLMs are inherent multi-task language checkers based on their latent representations of natural and social knowledge. We present an interpretable, unified, language checking (UniLC) method for both human and machine-generated language that aims to check if language input is factual and fair. While fairness and fact-checking tasks have been handled separately with dedicated models, we find that LLMs can achieve high performance on a combination of fact-checking, stereotype detection, and hate speech detection tasks with a simple, few-shot, unified set of prompts. With the ``1/2-shot'' multi-task language checking method proposed in this work, the GPT3.5-turbo model outperforms fully supervised baselines on several language tasks. The simple approach and results suggest that based on strong latent knowledge representations, an LLM can be an adaptive and explainable tool for detecting misinformation, stereotypes, and hate speech.

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass
In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
[bib] [abstract] [pdf] [code & dataset]

Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for languages other than English still lags. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English. We propose a cross entropy based objective which forces the distribution over the student’s text-video similarity scores to be similar to those of the teacher models. We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages. Our method improves multilingual text-video retrieval performance on Multi-YouCook2 and several other datasets such as Multi-MSRVTT and VATEX. We also conducted an analysis on the effectiveness of different multilingual text models as teachers.

Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images
Ming Y Lu, Bowen Chen, Andrew Zhang, Drew FK Williamson, Richard J Chen, Tong Ding, Long Phi Le, Yung-Sung Chuang, Faisal Mahmood
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
[bib] [abstract] [pdf] [supplementary material]

Contrastive visual language pretraining has emerged as a powerful method for either training new language-aware image encoders or augmenting existing pretrained models with zero-shot visual recognition capabilities. However, existing works typically train on large datasets of image-text pairs and have been designed to perform downstream tasks involving only small to medium sized-images, neither of which are applicable to the emerging field of computational pathology where there are limited publicly available paired image-text datasets and each image can span up to 100,000 x 100,000 pixels in dimensions. In this paper we present MI-Zero, a simple and intuitive framework for unleashing the zero-shot transfer capabilities of contrastively aligned image and text models to gigapixel histopathology whole slide images, enabling multiple downstream diagnostic tasks to be carried out by pretrained encoders without requiring any additional labels. MI-Zero reformulates zero-shot transfer under the framework of multiple instance learning to overcome the computational challenge of inference on extremely large images. We used over 550k pathology reports and other available in-domain text corpora to pretrain our text encoder. By effectively leveraging strong pretrained encoders, our best model pretrained on over 33k histopathology image-caption pairs achieves an average median zero-shot accuracy of 70.2% across three different real-world cancer subtyping tasks. Our code is available at: https://github. com/mahmoodlab/MI-Zero.

2022

Recent Advances in Pre-trained Language Models: Why Do They Work and How Do They Work
Cheng-Han Chiang, Yung-Sung Chuang, Hung-Yi Lee
In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Tutorial Abstracts, 2022.
[bib] [abstract] [pdf] [tutorial page] [video]

Pre-trained language models (PLMs) are language models that are pre-trained on large-scaled corpora in a self-supervised fashion. These PLMs have fundamentally changed the natural language processing community in the past few years. In this tutorial, we aim to provide a broad and comprehensive introduction from two perspectives: why those PLMs work, and how to use them in NLP tasks. The first part of the tutorial shows some insightful analysis on PLMs that partially explain their exceptional downstream performance. The second part first focuses on emerging pre-training methods that enable PLMs to perform diverse downstream tasks and then illustrates how one can apply those PLMs to downstream tasks under different circumstances. These circumstances include fine-tuning PLMs when under data scarcity, and using PLMs with parameter efficiency. We believe that attendees of different backgrounds would find this tutorial informative and useful.

On the interplay between sparsity, naturalness, intelligibility, and prosody in speech synthesis
Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David Cox, James Glass.
In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
[bib] [abstract] [pdf] [code] [video]

Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. Additionally, we explore several aspects of TTS pruning: amount of finetuning data versus sparsity, TTS-Augmentation to utilize unspoken text, and combining knowledge distillation and pruning. Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility, with similar prosody. All of our experiments are conducted on publicly available models, and findings in this work are backed by large-scale subjective tests and objective measures. Code and pruned models are made available to facilitate future research on efficiency in TTS.

DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings
Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljačić, Shang-Wen Li, Scott Wen-tau Yih, Yoon Kim, James Glass.
In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2022. (Oral Paper)
[bib] [abstract] [pdf] [code] [video]

We propose DiffCSE, an unsupervised contrastive learning framework for learning sentence embeddings. DiffCSE learns sentence embeddings that are sensitive to the difference between the original sentence and an edited sentence, where the edited sentence is obtained by stochastically masking out the original sentence and then sampling from a masked language model. We show that DiffSCE is an instance of equivariant contrastive learning (Dangovski et al., 2021), which generalizes contrastive learning and learns representations that are insensitive to certain types of augmentations and sensitive to other "harmful" types of augmentations. Our experiments show that DiffCSE achieves state-of-the-art results among unsupervised sentence representation learning methods, outperforming unsupervised SimCSE by 2.3 absolute points on semantic textual similarity tasks.

DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering
Guan-Ting Lin, Yung-Sung Chuang, Ho-Lam Chung, Shu-wen Yang, Hsuan-Jui Chen, Shuyan Dong, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, Lin-shan Lee.
In Interspeech, 2022.
[bib] [abstract] [pdf] [code]

Spoken Question Answering (SQA) is to find the answer from a spoken document given a question, which is crucial for personal assistants when replying to the queries from the users. Existing SQA methods all rely on Automatic Speech Recognition (ASR) transcripts. Not only does ASR need to be trained with massive annotated data that are time and cost-prohibitive to collect for low-resourced languages, but more importantly, very often the answers to the questions include name entities or out-of-vocabulary words that cannot be recognized correctly. Also, ASR aims to minimize recognition errors equally over all words, including many function words irrelevant to the SQA task. Therefore, SQA without ASR transcripts (textless) is always highly desired, although known to be very difficult. This work proposes Discrete Spoken Unit Adaptive Learning (DUAL), leveraging unlabeled data for pre-training and fine-tuned by the SQA downstream task. The time intervals of spoken answers can be directly predicted from spoken documents. We also release a new SQA benchmark corpus, NMSQA, for data with more realistic scenarios. We empirically showed that DUAL yields results comparable to those obtained by cascading ASR and text QA model and robust to real-world data. Our code and model will be open-sourced.

2021

PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition
Cheng-I Jeff Lai, Yang Zhang, Alexander H. Liu, Shiyu Chang, Yi-Lun Liao, Yung-Sung Chuang, Kaizhi Qian, Sameer Khurana, David Cox, James Glass.
In Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS), 2021. (Spotlight Paper)
[bib] [abstract] [pdf]

Recent work on speech self-supervised learning (speech SSL) demonstrated the benefits of scale in learning rich and transferable representations for Automatic Speech Recognition (ASR) with limited parallel data. It is then natural to investigate the existence of sparse and transferrable subnetworks in pre-trained speech SSL models that can achieve even better low-resource ASR performance. However, directly applying widely adopted pruning methods such as the Lottery Ticket Hypothesis (LTH) is suboptimal in the computational cost needed. Moreover, contrary to what LTH predicts, the discovered subnetworks yield minimal performance gain compared to the original dense network. In this work, we propose Prune-Adjust- Re-Prune (PARP), which discovers and finetunes subnetworks for much better ASR performance, while only requiring a single downstream finetuning run. PARP is inspired by our surprising observation that subnetworks pruned for pre-training tasks only needed to be slightly adjusted to achieve a sizeable performance boost in downstream ASR tasks. Extensive experiments on low-resource English and multi-lingual ASR show (1) sparse subnetworks exist in pre-trained speech SSL, and (2) the computational advantage and performance gain of PARP over baseline pruning methods. On the 10min Librispeech split without LM decoding, PARP discovers subnetworks from wav2vec 2.0 with an absolute 10.9%/12.6% WER decrease compared to the full model. We demonstrate PARP mitigates performance degradation in cross-lingual mask transfer, and investigate the possibility of discovering a single subnetwork for 10 spoken languages in one run.

SUPERB: Speech processing Universal PERformance Benchmark
Shu-wen Yang, Po-Han Chi*, Yung-Sung Chuang*, Cheng-I Jeff Lai*, Kushal Lakhotia*, Yist Y. Lin*, Andy T. Liu*, Jiatong Shi*, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-Yi Lee.
In Interspeech, 2021.
[bib] [abstract] [pdf] [code] [video] [leaderboard]

Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge this gap, we introduce Speech processing Universal PERformance Benchmark (SUPERB). SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data. Among multiple usages of the shared model, we especially focus on extracting the representation learned from SSL due to its preferable re-usability. We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model. Our results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SUPERB tasks. We release SUPERB as a challenge with a leaderboard and a benchmark toolkit to fuel the research in representation learning and general speech processing.

Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation
Shun-Po Chuang*, Yung-Sung Chuang*, Chih-Chiang Chang*, Hung-Yi Lee.
In The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Findings, 2021.
[bib] [abstract] [pdf] [code] [video]

We study the possibilities of building a non-autoregressive speech-to-text translation model using connectionist temporal classification (CTC), and use CTC-based automatic speech recognition as an auxiliary task to improve the performance. CTC's success on translation is counter-intuitive due to its monotonicity assumption, so we analyze its reordering capability. Kendall's tau distance is introduced as the quantitative metric, and gradient-based visualization provides an intuitive way to take a closer look into the model. Our analysis shows that transformer encoders have the ability to change the word order and points out the future research direction that worth being explored more on non-autoregressive speech translation.

Mitigating Biases in Toxic Language Detection through Invariant Rationalization
Yung-Sung Chuang, Mingye Gao, Hongyin Luo, James Glass, Hung-Yi Lee, Yun-Nung Chen, Shang-Wen Li.
In The 5th Workshop on Online Abuse and Harms at The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (WOAH@ACL-IJCNLP), 2021.
[bib] [abstract] [pdf] [code] [video]

Automatic detection of toxic language plays an essential role in protecting social media users, especially minority groups, from verbal abuse. However, biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection. The biases make the learned models unfair and can even exacerbate the marginalization of people. Considering that current debiasing methods for general natural language understanding tasks cannot effectively mitigate the biases in the toxicity detectors, we propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns (e.g., identity mentions, dialect) to toxicity labels. We empirically show that our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.

Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining
Cheng-I Lai, Yung-Sung Chuang, Hung-Yi Lee, Shang-Wen Li, James Glass.
In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
[bib] [abstract] [pdf] [code] [video]

Much recent work on Spoken Language Understanding (SLU) is limited in at least one of three ways: models were trained on oracle text input and neglected ASR errors, models were trained to predict only intents without the slot values, or models were trained on a large amount of in-house data. In this paper, we propose a clean and general framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech to address these issues. Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU data. We study two semi-supervised settings for the ASR component: supervised pretraining on transcribed speech, and unsupervised pretraining by replacing the ASR encoder with self-supervised speech representations, such as wav2vec. In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation. Experiments on ATIS show that our SLU framework with speech as input can perform on par with those using oracle text as input in semantics understanding, even though environmental noise is present and a limited amount of labeled semantics data is available for training.

2020

Lifelong Language Knowledge Distillation
Yung-Sung Chuang, Shang-Yu Su, Yun-Nung Chen.
In The Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
[bib] [abstract] [pdf] [code] [video]

It is challenging to perform lifelong language learning (LLL) on a stream of different tasks without any performance degradation comparing to the multi-task counterparts. To address this issue, we present Lifelong Language Knowledge Distillation (L2KD), a simple but efficient method that can be easily applied to existing LLL architectures in order to mitigate the degradation. Specifically, when the LLL model is trained on a new task, we assign a teacher model to first learn the new task, and pass the knowledge to the LLL model via knowledge distillation. Therefore, the LLL model can better adapt to the new task while keeping the previously learned knowledge. Experiments show that the proposed L2KD consistently improves previous state-of-the-art models, and the degradation comparing to multi-task models in LLL tasks is well mitigated for both sequence generation and text classification tasks.

Dual Inference for Improving Language Understanding and Generation
Yung-Sung Chuang*, Shang-Yu Su*, Yun-Nung Chen.
In The Conference on Empirical Methods in Natural Language Processing: Findings (EMNLP), 2020.
[bib] [abstract] [pdf] [code]

Natural language understanding (NLU) and Natural language generation (NLG) tasks hold a strong dual relationship, where NLU aims at predicting semantic labels based on natural language utterances and NLG does the opposite. The prior work mainly focused on exploiting the duality in model training in order to obtain the models with better performance. However, regarding the fast-growing scale of models in the current NLP area, sometimes we may have difficulty retraining whole NLU and NLG models. To better address the issue, this paper proposes to leverage the duality in the inference stage without the need of retraining. The experiments on three benchmark datasets demonstrate the effectiveness of the proposed method in both NLU and NLG, providing the great potential of practical usage.

SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering
Yung-Sung Chuang, Chi-Liang Liu, Hung-Yi Lee, Lin-shan Lee.
In Interspeech, 2020. (Interspeech 2020 Travel Grant)
[bib] [abstract] [pdf] [video]

While various end-to-end models for spoken language understanding tasks have been explored recently, this paper is probably the first known attempt to challenge the very difficult task of end-to-end spoken question answering (SQA). Learning from the very successful BERT model for various text processing tasks, here we proposed an audio-and-text jointly learned SpeechBERT model. This model outperformed the conventional approach of cascading ASR with the following text question answering (TQA) model on datasets including ASR errors in answer spans, because the end-to-end model was shown to be able to extract information out of audio data before ASR produced errors. When ensembling the proposed end-to-end model with the cascade architecture, even better performance was achieved. In addition to the potential of end-to-end SQA, the SpeechBERT can also be considered for many other spoken language understanding tasks just as BERT for many text processing tasks.

2019

Towards Understanding of Medical Randomized Controlled Trials by Conclusion Generation
Alexander Te-Wei Shieh*, Yung-Sung Chuang*, Shang-Yu Su, Yun-Nung Chen.
In Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019) at The 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, 2019.
[bib] [abstract] [pdf] [code]

Randomized controlled trials (RCTs) represent the paramount evidence of clinical medicine. Using machines to interpret the massive amount of RCTs has the potential of aiding clinical decision-making. We propose a RCT conclusion generation task from the PubMed 200k RCT sentence classification dataset to examine the effectiveness of sequence-to-sequence models on understanding RCTs. We first build a pointer-generator baseline model for conclusion generation. Then we fine-tune the state-of-the-art GPT-2 language model, which is pre-trained with general domain data, for this new medical domain task. Both automatic and human evaluation show that our GPT-2 fine-tuned models achieve improved quality and correctness in the generated conclusions compared to the baseline pointer-generator model. Further inspection points out the limitations of this current approach and future directions to explore.

Honors

Dean’s list (4 times), Electrical Engineering Dept. at NTU, Spring ’18, Spring ’19, Fall ’19, Spring ’20
Irving T. Ho Memorial Scholarship (2 times), EECS at NTU, Fall ’18, Fall ’19
Travel Grant, INTERSPEECH 2020 conference, Sep. 2020
Appier Best Application Award, 2020 NTU CSIE Undergrad Special Research Exhibition, Jun. 2020
2nd Place & Appier 1st Award, 2019 NTU CSIE Undergrad Special Research Exhibition, Jun. 2019
2nd Place, 2019 NTUEE Undergraduate Innovation Award, Jun. 2019
1st Place, 2018 H. Spectrum Demo Day (out of 21 teams), Jul. 2018
1st Place, NCTS Health Hackathon 2018 (out of 18 teams), Jun. 2018
Top 8 Finalist, Microsoft Imagine Cup Taiwan National Final 2018, Apr. 2018
Best Tech Award & Microsoft Enterprise Award, MakeNTU 2018 (out of 50 teams), Mar. 2018
1st place of Dept. of Transportation, HackNTU 2017 (out of 100+ teams), Jul. 2017

Services

Reviewer:
NeurIPS 2021, 2022, 2023, 2024
ICLR 2022, 2023, 2024, 2025
ICML 2022, 2023, 2024, 2025
ACL ARR 2023, 2024, 2025
EMNLP 2022, 2023
ACL 2023
AAAI 2023
ICASSP 2022, 2023
TASL 2023, 2024, 2025

This website is built from the source code of Nelson F. Liu's awesome website (https://nelsonliu.me).