Peter Szolovits' Research |
Health care in the United States, as well as in most other
developed countries, is undergoing a dramatic transformation from
a craftsman to an engineering culture. When I began my own studies
in this field over forty years ago, as a specialist in artificial
intelligence research interested in a challenging and useful
application domain, medical expertise consisted of textbook
knowledge of anatomy, physiology, pharmacology, pathology, etc.,
augmented by the practical experience of the individual clinician
through exposure to the cases he or she saw during training and
practice. As a result, variations in practice were common, schools
of thought differed and were often centered on an influential
mentor, and diagnostic and therapeutic success varied depending on
the expertise of one’s doctor. For over three decades now, the
“evidence-based medicine” movement has proposed and partially
succeeded in substituting clinical trial results for individual
experience as guides to clinical decision making. However, the
vast majority of decisions to be made have had no clinical trials
to compare alternatives, and most trials focus on relatively
simple cases of disease that do not necessarily tell us what to do
in complex cases where a patient suffers from multiple
simultaneous disorders. Since the beginning of large-scale
electronic health record implementations about twenty years ago,
we have realized that the non-trial data generated in the course
of routine health care can also serve as very useful evidence to
improve the practice of medicine. Such data do, however, raise
many difficult technical problems. Among these is the need to
allow a computer to interpret clinical narratives, because most
observational data from doctors and nurses is recorded in that
form. Furthermore, statistical sophistication in interpreting such
data are also needed to overcome problems of selection bias,
missing data, and erroneous data—after all, these data are not
collected under pristine experimental protocols. Nevertheless,
there are many orders of magnitude more such data than trials
data, so the impetus to exploit these is huge, and in many efforts
we and our colleagues have shown that clinically important
questions can be addressed using such data. If this vision comes
to fruition in what the National Academy of Medicine calls the
“learning healthcare system”, then every medical encounter will be
viewed as part of an ongoing experiment that will inform how
future cases are to be best treated.
My research interests broadly include much of biomedical informatics. Throughout my career, I have avoided a tightly focused concentration on a single topic. Instead, I have tried to define my research interests by the demands of health care and how they could be satisfied by computing approaches. My graduate work in the early 1970’s focused on development of specialized application languages to support the computing needs of various disciplines.
When I joined the MIT faculty in 1974, I met a group of doctors from Tufts/New England Medical Center who were trying to elucidate the thought processes of expert doctors as they performed diagnostic and therapeutic reasoning, and to build advisory programs that could help all doctors work as well as the best. I found that the challenges of this field provided an excellent focus for my interests in artificial intelligence, knowledge representation and reasoning. I therefore made the commitment to learn some medicine (I now know enough jargon to be able to play a doctor on TV!) and to understand how AI techniques can help computers “think through” complex medical decision problems. My doctoral students made excellent contributions to such AI challenges: automated generation of explanations, qualitative modeling of physiology, meta-level control of computation, expert systems based on pathophysiologic models at different levels of detail, reasoning about individual patient preferences, and temporal reasoning. We contributed to the work that became identified as the “expert systems” boom of the 1980’s, enriching the applicable techniques.
In medical AI, my first effort in 1975 was to re-engineer a previously-developed very inefficient diagnosis program for renal (kidney) diseases in order to make experimentation with it feasible. With my colleague Steve Pauker, I also wrote what turned out to be a very influential survey paper of various AI methods of reasoning in medical diagnosis. It explains different reasoning methods adopted by some of the early medical AI programs and points out challenges in more clearly defining the semantics of the knowledge they purport to contain.
At this time, I became convinced that medical practice innately relies on feedback from what has been done before, so programs that advise human practitioners must work similarly. They must repeatedly and incrementally re-assess diagnostic and therapeutic plans as time passes, underlying pathologies evolve, and therapies take effect, and thus new observations are acquired. My other insight from these early analyses was that very rarely does an interesting patient (one for whom help from a computer might be useful) suffer from a single, untreated disease. Techniques that match symptoms to generic disease descriptions are rarely adequate, no matter whether they are described by rules, templates, or frames, because they fail to account for interactions.
A full pathophysiological theory could, in principle, represent all the interactions, say in the form of hundreds of differential equations and thousands of parameters. Alas, it cannot yield a useful clinical model because collecting the data needed to estimate all these parameters is infeasible. We need instead a set of models that lie between simple symptom-disease associations and such differential equation models. Patil’s thesis on diagnosis of acid-base and electrolyte disorders pioneered such an approach in the early 1980’s, and Long’s clinical-level heart disease models extended it. We also explored case-based reasoning approaches as a way to generalize from good solutions to specific problems to a broader repertoire of reasoning techniques that could be combined to solve difficult new clinical problems. Artificial Intelligence in Medicine, a book I edited in 1982, gives a good picture of the approaches to these problems, both by my own group and colleagues in the field.
The growing availability of real data, which was foreseen in the 1980's, began to be real in the 1990's, and threatens to inundate us in the 21st century, has made a huge change in my own work and that of my students. We (and our computers) can now learn even complex associations between observables and patient states from large corpora of labeled data. Therefore, much of our research today focuses on finding novel ways to characterize huge collections of data and to develop predictive algorithms that do a good job of anticipating changes in a patient’s state from his or her previous condition and treatments being applied.
We began to pick lots of low-hanging fruit by applying existing statistical and machine learning methods to the analysis of such data. However, new methods are also needed to deal effectively with the great complexity of what happens to seriously ill patients. For example, in this early work, we tended to build predictive models using a set of features of the case. These normally include aspects of the patient’s medical history, their current problems, current and recent laboratory measurements, drugs and procedures. Additional features were designed to summarize the timeline over which the actual data have evolved, by computing trend lines, averages and spreads over various time periods, etc. These provided a useful, but hardly comprehensive account for how the patient’s illness has changed over time or how it has responded to previous attempts at treatment.
In the 1990’s we tried to develop Markov process and dynamic systems models of such phenomena, but the complexity of the models coupled with the computational difficulty of solving them had limited their effectiveness to very small example problems. Intuition suggested that we try more sophisticated models again, so that they can exploit innately time-dependent phenomena such as pharmacokinetics. We began to focus on methods that cluster similar patient states and identify common "trajectories" that patients traverse in the development and resolution of their illnesses, and how these relate to the basic data that can be measured routinely. Also, because virtually every patient’s condition shows the effects of combining therapeutic interventions with often multiple disease processes, partially observable Markov decision processes can provide a more nuanced interpretation of data than simpler feature-based predictive models.
Machine learning techniques rely on having good, informative features from which to learn. Temporal organization of data, as suggested above, can provide one rich source of such features. Another is to use medical knowledge, much as we used earlier in construction of expert systems, to identify useful, more abstract patterns among the data and to use these as features. Modern machine learning research has also created numerous unsupervised learning methods that simply exploit similarities among the data to form clusters, vector-space embeddings, deep neural networks, etc., that capture such similarities in a parsimonious representation, and such approaches have characterized our more recent research.
In 1974, I had estimated that by the early 1980’s most large hospitals would have switched their practice to electronic collection, storage and retrieval of their medical records. Because this change took about thirty years longer to come about, not until medical institutions were "bribed" into installing such systems by the HITECH act as a response to the financial melt-down of 2008, I was clearly wrong. I had based my guess on a simple extrapolation of the costs of keeping paper records vs. the costs of electronic storage. I think those two cost curves did indeed cross in the early 1980’s, but I had not thought through the enormous costs (financial, institutional and human) of making such a switch. For example, Partners Healthcare (now called Mass General Brigham) is said to have spent nearly $2 billion in the mid 2010s to install a comprehensive new commercial EHR system. Most of that investment was spent on customizing what they purchased to their specific needs and on training their staff how to use the new system safely and effectively, and this seems to be the common experience of most hospitals. By the early 1990’s, I decided that I needed to develop a research focus on medical record systems, to better understand why it was so difficult to put them in place. This led to three different lines of work in my group.
First, we realized as soon as the World Wide Web was created that this formed the most appropriate technical basis for presenting and using medical information. With our colleagues at Boston Children's Hospital and others in the Boston area, we formed a collaborative effort that demonstrated by 1994 the ability to view medical records from Children's anywhere on the internet (with suitable security) and by 1995 the ability to aggregate data from multiple institutions to present a longitudinal view of all data about a patient, even if collected at different hospitals and clinics. We also implemented and published consensus methods for assuring the protection of the confidentiality of patients’ clinical records.
Second, also in 1994, we outlined a vision of life-long personal health care supported by a computer system (called Guardian Angel) that kept comprehensive records, educated patients about their health, provided them decision support, and served to connect them to providers and peer groups. Although the vision is still far from being implemented, it did lead us to develop the earliest approaches to personally controlled health records, which have been influential on subsequent commercial developments such as Dossia, Google Health and Microsoft Health Vault. The Health Record Banking movement derives from this vision, recognizing that putting individuals' comprehensive medical records under their own control may be the best way to assure legitimate access to those records by the health care system. With the advent of new health care delivery and financing models adopted under "Obamacare," this vision is still relevant and cries out for continued research.
Third, through debates around 1993 about the propriety of
adopting the social security number as a national health
identifier, I felt challenged to study broader issues of patient
privacy. We proposed cryptographic identification schemes that
permitted aggregation of clinical data about a patient only via
the participation of the patient. We demonstrated the risks of
naive de-identification methods, which leave in place enough
unique data about individual patients to make them
re-identifiable. We also demonstrated that properly pseudonymized
data could still be used effectively to support secondary uses of
those data without casually revealing the identities of the
patients. I also served on a National Research Council committee
that reported on the poor state of protection of electronic health
records in 1997 and inspired some provisions of the HIPAA privacy
protections. Later, I also served on an Institute of Medicine
committee that helped to define the role of institutional review
boards in protecting patient confidentiality in data studies. And
later yet, on another NRC committee that laid out strategic
directions for how information technology could better serve the
needs of clinicians by aligning record keeping with their mental
models of patients, diseases and clinical processes. Much
remains to be done, both technically and in policy, to protect
patients, encourage data sharing, and create effective decision
support.
Since the early 2000s, I have focused on developing techniques to
extract codified clinical data from narrative text and speech
conversations. This has proven important because of the Willie
Sutton principle—that is where a lot of the data are.
Practitioners are able and willing to describe clinical phenomena
with great sophistication in natural language, but not in formal
representations. It remains a challenge, however, to translate
that narrative text to rich formal representations. Even just
identifying the various ways of expressing the same facts and
translating these into a terminology such as SNOMED or ICD remains
difficult. Dealing with more subtle statements, such as accounts
of how findings support or dispute diagnostic hypotheses, or
contingent plans, seems beyond the state of the art. My colleagues
and I have worked on a highly flexible language processing
framework to support research in this area. We have also extended
the dictionary of a popular parser with medical terminology, built
pattern matching methods to find descriptions of medications and
dosages in text, identified signs, symptoms, diseases, treatments,
tests and their results using both pattern-based and statistical
methods, identified temporal and likelihood indicators about such
facts, determined some relations among facts mentioned together,
etc. We have also developed several different approaches to
de-identification of clinical data for research purposes, first
one based on dictionaries and patterns, then later ones on
statistical machine learning and deep learning techniques. These
have been used to enable the re-use of large clinical datasets for
research purposes in many projects.
With the rise of deep neural network approaches to language
processing, NLP approaches have turned toward more end-to-end
prediction models, where the actual language used in reports, or
vector space embeddings of the words and concepts become inputs to
artificial neural net models that can directly combine NLP data
with information from tabular data such as labs, diagnostic and
treatment codes, etc., to build models that predict outcomes and
future interventions. When combined with image embeddings, such
approaches can also help to improve image interpretation models to
identify key features that might appear in a radiological image
that are also mentioned in corresponding radiology reports.
Although we began this type of work using models such as word2vec,
more recent contextual embedding models based on transformer
architectures and language models fine-tuned on biomedical and
clinical data yield better performance. Overall, it seems like
contemporary language modeling can capture grammatical and fluency
characteristics of text, the major open challenge is how to better
capture meaning, which will require some deeper notion of semantic
representations that underlie what language intends to say.
The advent of the genomic revolution in the 1990’s promised to
usher in an era of personalized medicine, where measurements of
single nucleotide polymorphisms (or, eventually sequencing of
individual genomes), measurements of transcription, proteomic
measurements of gene products, methylation patterns that control
gene expression, the role of intestinal and other flora,
environmental exposures, etc., could combine with clinical facts
to provide highly customized diagnostic tools and methods to
choose and optimize therapy for individual patients. We have
contributed to this effort in the Partners Healthcare-based i2b2 project
both through our natural language work and through efforts to
bring together modeling techniques from statistics with those from
artificial intelligence. With notable exceptions in identifying
populations for whom certain drugs are or are not likely to be
effective, the optimistic expectations that such approaches would
easily "solve" many diseases have been disappointed. The current
consensus is that even well-recognized diseases are heterogeneous
at the genetic level, and if private variants drive disease, then
it is hard to know how statistical methods that rely on large
cohorts of very similar patients will provide adequate insights.
Instead of focusing just on analysis of measured data,
experimental methods based on creating model organisms and
modifying genetics with techniques such as CRISPR-CAS9 seem more
likely to pay near-term dividends. Nevertheless, even if what have
been thought of as diseases turn out actually to be
syndromes—common manifestations of different underlying
pathologies—we may find valuable clinical results from studying
such surface-level manifestations.
Probably the best way to gain an understanding of the accomplishments of our research group is to look at our publications and students' theses. I list some of the specific formal projects we are currently working on.
To improve assessment of pulmonary edema in chest x-rays, we create a joint embedding of the x-ray images processed by a CNN and the corresponding text processed by a transformer architecture, training with a penalty that draws the embeddings closer together for matched image/text pairs and pushes them farther apart when the image and report are from distinct cases. Our trained models improve on the ability of the image classifier to estimate degree of pulmonary edema when a report is not yet available. This work is in collaboration with Wistron Corp.
The aims of this project are to identify patients who are unlikely to respond well to conventional therapy for Ulcerative Colitis or Crohn's Disease, to predict disease progression among IBD patients, and to help with earlier identification of patients developing IBD. This work is in collaboration with Takeda.
Using liver imaging data from patients with non-alcoholic fatty liver disease (NAFLD) and non-alcoholic steatohepatitis (NASH) plus associated clinical data, we try to predict which patients will progress to more advanced stages of fibrosis and try to identify factors that contribute to such danger. This work is also in collaboration with Takeda.
Using joint analysis of rich available data from drug trials, identify phenotypes that vary according to the patients' clinical conditions and responses to interventions. Determine whether it is possible to predict from such data which patients will show improved clinical features and outcomes. We also use these data to identify useful abstractions for future studies. This project is in collaboration with Bayer.
We use longitudinal multi-variable health data to develop novel models that can accurately impute missing data by exploiting both the autoregressive nature of measurements and their covariance with other measured data. We plan to extend these methods to understand the biases due to healthcare process variables, and to take advantage of information not only from structured clinical data but also clinical narratives. This is join work with Northwestern University and Mass General Hospital, funded by NIH.
This is a selection of some of our interesting past projects.
As part of a five-year collaborative project between MIT and Philips' North American research laboratory (in Cambridge, MA), our group worked with a partner group at Philips to improve natural language processing capabilities in order to build more accurate predictive models. Most predictive modeling, including much from our group, has favored data coming from well-coded data sources such as laboratory tests, medication orders, bedside measurements and monitoring instruments over data recorded only in narrative notes. Two foci of this work were: (1) improve automated de-identification so that it becomes easier for groups to use narrative notes without violating the privacy of the patients mentioned in those notes, and (2) build new models that more comprehensively take into account all the available clinical data in order to predict useful outcome measures.
This was a very large Harvard Medical School-led effort to create methods and a toolkit to support research access to a federated database of a broad variety of information about individual patients across clinical domains, data sources, and data types. It respected constraints of patient privacy while allowing integration of these data from sources that should not divulge the identity of individual patients. In addition, the project developed high-performance computational techniques to support analysis and modeling of these data to do predictive modeling on a "big data" scale. The project was part of NIH's BD2K ("big data to knowledge") effort.
An ambitious neuro-psychiatric project whose goal was to provide a genomic approach to a new understanding of psychiatric disorders. This was another large Harvard Medical School project. Its most clever idea was to grow nerve cells from induced pluripotent stem cells derived from skin fibroblasts, put them down a neurodevelopmental pathway and thus produce proxies for nerve cells that might have been harvested directly from patients' brains (a much less practical approach). Hard data from these induced neurons can then help characterize patients according to the Research Domain Criteria (RDoC) advocated by NIMH as a new basis for understanding psychiatric diseases. We investigated the genetics and gene expression levels of these cells in order to try to modify their activity by the application of a large variety of drugs. If these modify expression levels to more closely approximate the behavior of nerve cells from controls, this suggests that those drugs may be helpful to the specific patient.
Much of the subtlety of detail in psychiatric cases resides in careful notes made by doctors and nurses as they care for patients. To make the study of psychopathology more systematic, we extracted meaning from such notes and related those meaningful concepts to the RDoC criteria. We developed new forms of topic modeling to provide features useful for predicting hospital readmission and length of stay among psychiatric and non-psychiatric hospitalized patients.
In collaboration with pathologist colleagues as MGH, we investigated the possibility of exploiting covariances among different laboratory measurements to identify costly tests that may not need to be performed because their results may be predicted to sufficient accuracy from the results of other tests. In addition, the same relationships among test results, may be used to shrink the confidence range of test results. Technical challenges are to model missing data appropriately, because not ordering a test is itself informative. We demonstrated that whether a serum ferritin test will yield normal or abnormal results is accurately predictable, and extended the analysis to a larger number of relationships.
In collaboration with Dr. Polina Golland's group and Wistron Corp., we studied the "wellness" of patients recovering from serious illness requiring hospitalization. The goal is to provide expectations for what should happen after discharge, and to predict the rate at which patients may recover from serious illness. Because of the lack of good follow-up data thus far, we have focused on data available during hospitalization.
We were part of MIT/CSAIL's Big Data center and the associated Intel Science and Technology Center. Our role was principally to provide interesting challenges to big data science and technology arising from biomedical problems. Among the characteristics of big data, volume, velocity and variety, it is the latter that is especially rich in the medical world.
The most accessible source of large sets of data in clinical medicine comes from national or private health insurance, because such organizations collect huge volumes of data on the care of many millions of individuals. If properly pseudonomized, this can serve as a valuable research resource. However, these data are frustrating from a clinical viewpoint because they document the process of medical care, not its details or outcome. For example, the fact that lab tests were done is recorded, but what the results of the test were is not. We worked with a group of epidemiologists in Taiwan who have access to a vast trove of national health insurance data to understand how to use these data not just for financial management and policy purposes but to learn something about the practice of medicine.
Create a system that captures primary medical data mentioned during an encounter between a health care provider and a patient. We use speech-to-text technology to create an approximate transcript of both sides of such a conversation, use natural language processing and machine learning methods to extract relevant clinical content from the transcripts, organize these according to medical conventions, and display the data to both provider and patient to allow them to correct mistakes made by this process. We are applying this in the Pediatric Environmental Health Clinic at Children’s Hospital Boston.
Although preliminary data showed that this effort should be feasible, when we began to collect recordings of conversations between doctors and patients, we found that the speech-to-text component worked so poorly that subsequent stages of processing succumbed to GIGO. Two different speech systems we used were unable to interpret two-sided conversations, were misled by background noises in the clinic, and could not cope well without speaker-specific training that was not available for patients. Additionally, most such systems are trained on dictation data, and we found that the language models that describe conversational speech differ very significantly from those for dictation, so the systems were mis-trained for our application. It is possible that newer speech systems that form the basis for Siri, Alexa, Cortana, etc., would do better, but they were not available to us and may still pose privacy issues because they ship digitized speech to a server that does not comply with mandated patient privacy rules.
Develop techniques to collect, interpret, analyze and disseminate multi-channel data from intensive care collated with clinical notes and other patient data. The foci of our group's efforts are to extract meaningful data from textual records and to build algorithms that make sense of the clinical condition of the patient.
We were able to build predictive models that are able to predict 30-day mortality with AUC of up to 0.90, a surprisingly accurate result. We also built models that predict the success of possible clinical interventions such as weaning from a ventilator or intra-aortic balloon pump or from use of vasopressor drugs, and additional models that predicted degradations in the patient's conditon such as sepsis, hypotension or renal failure. These predictions achieve AUC between 0.75 and 0.85, which are quite respectable. This work continues under the auspices of several of our current projects.
Develop a scalable informatics framework that will bridge clinical research data and the vast data banks arising from basic science research in order to better understand the genetic bases of complex diseases. This knowledge will facilitate the design of targeted therapies for individual patients with diseases having genetic origins.
This ten-year project was highly successful, leading to a very broadly disseminated set of software tools that are now commonly used by many U.S. and international hospitals to manage and explore their data warehouses. We also made progress on understanding the genetic correlates (if not causes) of rheumatoid arthritis, irritable bowel disease and other conditions. I2b2 also sponsored a series of important shared task challenge workshops that have helped to advance the state of the art of clinical natural language processing. The Personal Information Commons project is an outgrowth of this effort.
As part of a national collaborative group headed by Mayo Clinic, we built tools to make it possible to re-use clinical data for purposes other than the patient care for which they were collected. Our efforts include natural language processing to identify salient facts and relationships in narrative textual data, defining classification models that can identify specific phenotypes from patient records, and defining ontologies to organize the relevant medical knowledge needed for these tasks.
Although this project ended abruptly because of funding changes, it made significant progress on its goals and has led to numerous follow-on efforts.