See my Google Scholar page for an up-to-date list of publications.
2023
Efficiently Controlling Multiple Risks with Pareto Testing
Bracha Laufer-Goldshtein,
Adam Fisch,
Regina Barzilay,
and Tommi Jaakkola
In the Eleventh International Conference on Learning Representations (ICLR)
Machine learning applications frequently come with multiple diverse objectives and constraints that can change over time. Accordingly, trained models can be tuned with sets of hyper-parameters that affect their predictive behavior (e.g., their run-time efficiency versus error rate). As the number of constraints and hyper-parameter dimensions grow, naively selected settings may lead to sub-optimal and/or unreliable results. We develop an efficient method for calibrating models such that their predictions provably satisfy multiple explicit and simultaneous statistical guarantees (e.g., upper-bounded error rates), while also optimizing any number of additional, unconstrained objectives (e.g., total run-time cost). Building on recent results in distribution-free, finite-sample risk control for general losses, we propose Pareto Testing: a two-stage process which combines multi-objective optimization with multiple hypothesis testing. The optimization stage constructs a set of promising combinations on the Pareto frontier. We then apply statistical testing to this frontier only to identify configurations that have (i) high utility with respect to our objectives, and (ii) guaranteed risk levels with respect to our constraints, with specifiable high probability. We demonstrate the effectiveness of our approach to reliably accelerate the execution of large-scale Transformer models in natural language processing (NLP) applications. In particular, we show how Pareto Testing can be used to dynamically configure multiple inter-dependent model attributes – including the number of layers computed before exiting, number of attention heads pruned, or number of text tokens considered – to simultaneously control and optimize various accuracy and cost metrics.
2022
Calibrated Selective Classification
Adam Fisch,
Tommi Jaakkola,
and Regina Barzilay
Transactions on Machine Learning Research (TMLR)
Selective classification allows models to abstain from making predictions (e.g., say "I don’t know") when in doubt in order to obtain better effective accuracy. While typical selective models can succeed at producing more accurate predictions on average, they may still allow for wrong predictions that have high confidence, or skip correct predictions that have low confidence. Providing calibrated uncertainty estimates alongside predictions—probabilities that correspond to true frequencies—can be as important as having predictions that are simply accurate on average. Uncertainty estimates, however, can sometimes be unreliable. In this paper, we develop a new approach to selective classification in which we propose a method for rejecting examples with “uncertain” uncertainties. By doing so, we aim to make predictions with well-calibrated uncertainty estimates over the distribution of accepted examples, a property we call selective calibration. We present a framework for learning selectively calibrated models, where a separate selector network is trained to improve the selective calibration error of a given base model. In particular, our work focuses on achieving robust calibration, where the model is intentionally designed to be tested on out-of-domain data. We achieve this through a training strategy inspired by distributionally robust optimization, in which we apply simulated input perturbations to the known, in-domain training data. We demonstrate the empirical effectiveness of our approach on multiple image classification and lung cancer risk assessment tasks.
-
Confident Adaptive Language Modeling
Tal Schuster*,
Adam Fisch*,
Jai Gupta,
Mostafa Dehghani,
Dara Bahri,
Vinh Q. Tran,
Yi Tay,
and Donald Metzler
In the 36th Conference on Neural Information Processing Systems (NeurIPS)
* = Equal contribution.
Recent advances in Transformer-based large language models (LLMs) have led to significant performance improvements across many tasks. These gains come with a drastic increase in the models’ size, potentially leading to slow and costly use at inference time. In practice, however, the series of generations made by LLMs is composed of varying levels of difficulty. While certain predictions truly benefit from the models’ full capacity, other continuations are more trivial and can be solved with reduced compute. In this work, we introduce Confident Adaptive Language Modeling (CALM), a framework for dynamically allocating different amounts of compute per input and generation timestep. Early exit decoding involves several challenges that we address here, such as: (1) what confidence measure to use; (2) connecting sequence-level constraints to local per-token exit decisions; and (3) attending back to missing hidden representations due to early exits in previous tokens. Through theoretical analysis and empirical experiments on three diverse text generation tasks, we demonstrate the efficacy of our framework in reducing compute—speedup of up to x3—while provably maintaining high performance.
-
2021
Making Pre-trained Language Models Better Few-shot Learners
Tianyu Gao*,
Adam Fisch*,
and Danqi Chen
In the 59th Annual Meeting of the Association for Computational Linguistics (ACL)
* = Equal contribution.
The recent GPT-3 model (Brown et al., 2020) achieves remarkable few-shot performance solely by leveraging a natural-language prompt and a few task demonstrations as input context. Inspired by their findings, we study few-shot learning in a more practical scenario, where we use smaller language models for which fine-tuning is computationally efficient. We present LM-BFF–better few-shot fine-tuning of language models–a suite of simple and complementary techniques for fine-tuning language models on a small number of annotated examples. Our approach includes (1) prompt-based fine-tuning together with a novel pipeline for automating prompt generation; and (2) a refined strategy for dynamically and selectively incorporating demonstrations into each context. Finally, we present a systematic evaluation for analyzing few-shot performance on a range of NLP tasks, including classification and regression. Our experiments demonstrate that our methods combine to dramatically outperform standard fine-tuning procedures in this low resource setting, achieving up to 30% absolute improvement, and 11% on average across all tasks. Our approach makes minimal assumptions on task resources and domain expertise, and hence constitutes a strong task-agnostic method for few-shot learning.
Consistent Accelerated Inference via Confident Adaptive Transformers
Tal Schuster*,
Adam Fisch*,
Tommi Jaakkola,
and Regina Barzilay
In the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)
* = Equal contribution.
We develop a novel approach for confidently accelerating inference in the large and expensive multilayer Transformers that are now ubiquitous in natural language processing (NLP). Amortized or approximate computational methods increase efficiency, but can come with unpredictable performance costs. In this work, we present CATs – Confident Adaptive Transformers – in which we simultaneously increase computational efficiency, while guaranteeing a specifiable degree of consistency with the original model with high confidence. Our method trains additional prediction heads on top of intermediate layers, and dynamically decides when to stop allocating computational effort to each input using a meta consistency classifier. To calibrate our early prediction stopping rule, we formulate a unique extension of conformal prediction. We demonstrate the effectiveness of this approach on four classification and regression tasks.
Few-shot Conformal Prediction with Auxiliary Tasks
Adam Fisch,
Tal Schuster,
Tommi Jaakkola,
and Regina Barzilay
In the 38th International Conference on Machine Learning (ICML)
We develop a novel approach to conformal prediction when the target task has limited data available for training. Conformal prediction identifies a small set of promising output candidates in place of a single prediction, with guarantees that the set contains the correct answer with high probability. When training data is limited, however, the predicted set can easily become unusably large. In this work, we obtain substantially tighter prediction sets while maintaining desirable marginal guarantees by casting conformal prediction as a meta-learning paradigm over exchangeable collections of auxiliary tasks. Our conformalization algorithm is simple, fast, and agnostic to the choice of underlying model, learning algorithm, or dataset. We demonstrate the effectiveness of this approach across a number of few-shot classification and regression tasks in natural language processing, computer vision, and computational chemistry for drug discovery.
Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence
Tal Schuster,
Adam Fisch,
and Regina Barzilay
In the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)
Typical fact verification models use retrieved written evidence to verify claims. Evidence sources, however, often change over time as more information is gathered and revised. In order to adapt, models must be sensitive to subtle differences in supporting evidence. We present VitaminC, a benchmark infused with challenging cases that require fact verification models to discern and adjust to slight factual changes. We collect over 100,000 Wikipedia revisions that modify an underlying fact, and leverage these revisions, together with additional synthetically constructed ones, to create a total of over 400,000 claim-evidence pairs. Unlike previous resources, the examples in VitaminC are contrastive, i.e., they contain evidence pairs that are nearly identical in language and content, with the exception that one supports a given claim while the other does not. We show that training using this design increases robustness – improving accuracy by 10% on adversarial fact verification and 6% on adversarial natural language inference (NLI). Moreover, the structure of VitaminC leads us to define additional tasks for fact-checking resources: tagging relevant words in the evidence for verifying the claim, identifying factual revisions, and providing automatic edits via factually consistent text generation.
Efficient Conformal Prediction via Cascaded Inference with Expanded Admission
Adam Fisch*,
Tal Schuster*,
Tommi Jaakkola,
and Regina Barzilay
In the 9th International Conference on Learning Representations (ICLR)
* = Equal contribution.
In this paper, we present a novel approach for conformal prediction (CP), in which we aim to identify a set of promising prediction candidates – in place of a single prediction. This set is guaranteed to contain a correct answer with high probability, and is well-suited for many open-ended classification tasks. In the standard CP paradigm, the predicted set can often be unusably large and also costly to obtain. This is particularly pervasive in settings where the correct answer is not unique, and the number of total possible answers is high. We first expand the CP correctness criterion to allow for additional, inferred "admissible" answers, which can substantially reduce the size of the predicted set while still providing valid performance guarantees. Second, we amortize costs by conformalizing prediction cascades, in which we aggressively prune implausible labels early on by using progressively stronger classifiers – again, while still providing valid performance guarantees. We demonstrate the empirical effectiveness of our approach for multiple applications in natural language processing and computational chemistry for drug discovery.
2020
CapWAP: Captioning with a Purpose
Adam Fisch,
Kenton Lee,
Ming-Wei Chang,
Jonathan Clark,
and Regina Barzilay
In the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
The traditional image captioning task uses generic reference captions to provide textual information about images. Different user populations, however, will care about different visual aspects of images. In this paper, we propose a new task, Captioning with a Purpose (CapWAP). Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population, rather than merely provide generic information about an image. In this task, we use question-answer (QA) pairs—a natural expression of information need—from users, instead of reference captions, for both training and post-inference evaluation. We show that it is possible to use reinforcement learning to directly optimize for the intended information need, by rewarding outputs that allow a question answering model to provide correct answers to sampled user questions. We convert several visual question answering datasets into CapWAP datasets, and demonstrate that under a variety of scenarios our purposeful captioning system learns to anticipate and fulfill specific information needs better than its generic counterparts, as measured by QA performance on user questions from unseen images, when using the caption alone as context.
2019
MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension
Adam Fisch,
Alon Talmor,
Robin Jia,
Minjoon Seo,
Eunsol Choi,
and Danqi Chen
In the 2nd Machine Reading for Reading Comprehension (MRQA) Workshop at EMNLP
We present the results of the Machine Reading for Question Answering (MRQA) 2019 shared task on evaluating the generalization capabilities of reading comprehension systems. In this task, we adapted and unified 18 distinct question answering datasets into the same format. Among them, six datasets were made available for training, six datasets were made available for development, and the final six were hidden for final evaluation. Ten teams submitted systems, which explored various ideas including data sampling, multi-task learning, adversarial training and ensembling. The best system achieved an average F1 score of 72.5 on the 12 held-out datasets, 10.7 absolute points higher than our initial baseline based on BERT.
Working Hard or Hardly Working: Challenges of Integrating Typology into Neural Dependency Parsers
Adam Fisch,
Jiang Guo,
and Regina Barzilay
In the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP)
* = Equal contribution.
This paper explores the task of leveraging typology in the context of cross-lingual dependency parsing. While this linguistic information has shown great promise in pre-neural parsing, results for neural architectures have been mixed. The aim of our investigation is to better understand this state-of-the-art. Our main findings are as follows: 1) The benefit of typological information is derived from coarsely grouping languages into syntactically-homogeneous clusters rather than from learning to leverage variations along individual typological dimensions in a compositional manner; 2) Typology consistent with the actual corpus statistics yields better transfer performance; 3) Typological similarity is only a rough proxy of cross-lingual transferability with respect to parsing.
2018
StarSpace: Embed All The Things!
Ledell Wu,
Adam Fisch,
Sumit Chopra,
Keith Adams,
Antoine Bordes,
and Jason Weston
In the 31st AAAI Conference on Artificial Intelligence (AAAI)
We present StarSpace, a general-purpose neural embedding model that can solve a wide variety of problems: labeling tasks such as text classification, ranking tasks such as information retrieval/web search, collaborative filtering-based or content-based recommendation, embedding of multi-relational graphs, and learning word, sentence or document level embeddings. In each case the model works by embedding those entities comprised of discrete features and comparing them against each other – learning similarities dependent on the task. Empirical results on a number of tasks show that StarSpace is highly competitive with existing methods, whilst also being generally applicable to new cases where those methods are not.
2017
Reading Wikipedia to Answer Open-Domain Questions
Danqi Chen,
Adam Fisch,
Jason Weston,
and Antoine Bordes
In the 55th Annual Meeting of the Association for Computational Linguistics (ACL)
This paper proposes to tackle open- domain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.
ParlAI: A Dialog Research Software Platform
Alexander H. Miller,
Will Feng,
Adam Fisch,
Jiasen Lu,
Druv Batra,
Antoine Bordes,
Devi Parikh,
and Jason Weston
In the 2017 Conference on Empirical Methods on Natural Language Processing (EMNLP)
We introduce ParlAI (pronounced "par-lay"), an open-source software platform for dialog research implemented in Python, available at http://parl.ai. Its goal is to provide a unified framework for sharing, training and testing of dialog models, integration of Amazon Mechanical Turk for data collection, human evaluation, and online/reinforcement learning; and a repository of machine learning models for comparing with others’ models, and improving upon existing architectures. Over 20 tasks are supported in the first release, including popular datasets such as SQuAD, bAbI tasks, MCTest, WikiQA, QACNN, QADailyMail, CBT, bAbI Dialog, Ubuntu, OpenSubtitles and VQA. Several models are integrated, including neural models such as memory networks, seq2seq and attentive LSTMs.
2016
Key-Value Memory Networks for Directly Reading Documents
Alexander H. Miller,
Adam Fisch,
Jesse Dodge,
Amir-Hossein Karimi,
Antoine Bordes,
and Jason Weston
In the 2016 Conference on Empirical Methods on Natural Language Processing (EMNLP)
Directly reading documents and being able to answer questions from them is an unsolved challenge. To avoid its inherent difficulty, question answering (QA) has been directed towards using Knowledge Bases (KBs) instead, which has proven effective. Unfortunately KBs often suffer from being too restrictive, as the schema cannot support certain types of answers, and too sparse, e.g. Wikipedia contains much more information than Freebase. In this work we introduce a new method, Key-Value Memory Networks, that makes reading documents more viable by utilizing different encodings in the addressing and output stages of the memory read operation. To compare using KBs, information extraction or Wikipedia documents directly in a single framework we construct an analysis tool, WIKIMOVIES, a QA dataset that contains raw text alongside a preprocessed KB, in the domain of movies. Our method reduces the gap between all three settings. It also achieves state-of-the-art results on the existing WIKIQA benchmark.
2015
MARVIN: Multimodally Advantaged Robotic Vehicle for Improved Navigation
Adam Fisch,
and Max Shatkhin
Supervisor: Clarence Rowley and Andrew Houck
(Calvin Dodd MacCracken Senior Thesis Award)
Hybrid robots leverage the advantages of multiple types of locomotion. More specifically, wheel-legged hybrid robots aim to capture the speed, stability, and power efficiency of wheeled robots as well as the ability to traverse robust natural terrain that legged robots provide. Effective hybrid designs are able to capitalize on both sets of advantages without compromising the overall effectiveness of the machine. Here, we present a design and implementation of MARVIN, a wheel-legged hybrid robot that emphasizes three key features: a quick transition mechanism, a well-defined wheel and leg mode, and the capacity for flexible control through continuously variable leg length. We demonstrate how the two clearly defined modes of legs/wheels in MARVIN capitalize on their respective advantages. Furthermore, in realizing the tradeoff between modes specific to this robot, we derive a hybrid path-planning algorithm using an empirically driven cost function, which we found by collecting data in real-terrain experiments. We discuss our mechanical, electronic, and software design approaches in building a prototype of the proposed design. We also review our experimental methods. Lastly, we point out lessons learned from the operation of our prototype robot, identifying directions for future upgrades.