Rebecca Boiarsky

PhD Student in EECS at MIT

Biography

Hello! I’m a 5th year PhD student in EECS at MIT, co-advised by David Sontag in MIT CSAIL and Gaddy Getz at the Broad Institute. I develop machine learning tools to better understand health and disease. My main focus is on understanding and leveraging RNA-sequencing data, especially to improve precision medicine in cancer. Currently, I’m interested in deep representation learning, generative models, and multimodal integration/translation for biological data.

Education

PhD in Electrical Engineering & Computer Science, expected 2025

Massachusetts Institute of Technology
MSc in Biomedical Engineering, 2016

Columbia University
BA in Physics, 2014

Yeshiva University

Featured Publications

Rebecca Boiarsky, Nalini M Singh, Alejandro Buendia, Gad Getz, David Sontag

January, 2023 bioRxiv

A Deep Dive into Single-Cell RNA Sequencing Foundation Models

Large-scale foundation models, which are pre-trained on massive, unlabeled datasets and subsequently fine-tuned on specific tasks, have recently achieved unparalleled success on a wide array of applications, including in healthcare and biology. In this paper, we explore two foundation models recently developed for single-cell RNA sequencing data, scBERT and scGPT. Focusing on the fine-tuning task of cell type annotation, we explore the relative performance of pre-trained models compared to a simple baseline, L1-regularized logistic regression, including in the few-shot setting. We perform ablation studies to understand whether pretraining improves model performance and to better understand the difficulty of the pre-training task in scBERT. Finally, using scBERT as an example, we demonstrate the potential sensitivity of fine-tuning to hyperparameter settings and parameter initializations. Taken together, our results highlight the importance of rigorously testing foundation models against well established baselines, establishing challenging fine-tuning tasks on which to benchmark foundation models, and performing deep introspection into the embeddings learned by the model in order to more effectively harness these models to transform single-cell data analysis. Code is available at https://github.com/clinicalml/sc-foundation-eval.

Rebecca Boiarsky, Nicholas J Haradhvala, Jean-Baptiste Alberge, Romanos Sklavenitis-Pistofidis, Tarek H Mouhieddine, Oksana Zavidij, Ming-Chieh Shih, Danielle Firer, Mendy Miller, Habib El-Khoury, others

January, 2022 Nature Communications

Single cell characterization of myeloma and its precursor conditions reveals transcriptional signatures of early tumorigenesis

Multiple myeloma is a plasma cell malignancy almost always preceded by precursor conditions, but low tumor burden of these early stages has hindered the study of their molecular programs through bulk sequencing technologies. Here, we generate and analyze single cell RNA-sequencing of plasma cells from 26 patients at varying disease stages and 9 healthy donors. In silico dissection and comparison of normal and transformed plasma cells from the same bone marrow biopsy enables discovery of patient-specific transcriptional changes. Using Non-Negative Matrix Factorization, we discover 15 gene expression signatures which represent transcriptional modules relevant to myeloma biology, and identify a signature that is uniformly lost in abnormal cells across disease stages. Finally, we demonstrate that tumors contain heterogeneous subpopulations expressing distinct transcriptional patterns. Our findings characterize transcriptomic alterations present at the earliest stages of myeloma, providing insight into the molecular underpinnings of disease initiation.

Rohan S Kodialam, Rebecca Boiarsky, Justin Lim, Neil Dixit, Aditya Sai, David Sontag

January, 2021 Proceedings of the AAAI Conference on Artificial Intelligence

Deep Contextual Clinical Prediction with Reverse Distillation

Healthcare providers are increasingly using machine learning to predict patient outcomes to make meaningful interventions. However, despite innovations in this area, deep learning models often struggle to match performance of shallow linear models in predicting these outcomes, making it difficult to leverage such techniques in practice. In this work, motivated by the task of clinical prediction from insurance claims, we present a new technique called Reverse Distillation which pretrains deep models by using high-performing linear models for initialization. We make use of the longitudinal structure of insurance claims datasets to develop Self Attention with Reverse Distillation, or SARD, an architecture that utilizes a combination of contextual embedding, temporal embedding and self-attention mechanisms and most critically is trained via reverse distillation. SARD outperforms state-of-the-art methods on multiple clinical prediction outcomes, with ablation studies revealing that reverse distillation is a primary driver of these improvements. Code is available at https://github.com/clinicalml/omop-learn.

Rebecca Peyser, Scott MacDonnell, Yinglin Gao, Luis Cheng, Yong Kim, Theodore Kaplan, Qin Ruan, Yi Wei, Min Ni, Christina Adler, others

January, 2019 American journal of respiratory cell and molecular biology

Defining the activated fibroblast population in lung fibrosis using single-cell sequencing

Recent & Upcoming Talks

A Deep Dive into Single-Cell RNA Sequencing Foundation Models

Oral presentation at MLCB 2023 (top 15% of submissions), “A Deep Dive into Single-Cell RNA Sequencing Foundation Models.”

Dec 1, 2023 9:30 AM University of Washington, Seattle, WA, USA

Contact

rboiar [at] mit [dot] edu