The Value of Predictive Models

5/5/2009

Caleb Hug defended his thesis on April 30, 2009. His work brought me to thoughts about the clinical utility of predictive models.

Caleb Hug’s thesis, titled “Detecting Hazardous Intensive Care Patient Episodes Using Real-time Mortality Models” develops a set of models from the tens of thousands of intensive care unit patients whose records are in the MIMIC II database. The purpose of these models is to compute and evaluate a real-time varying acuity score for each patient, which is in a sense a dynamic indicator of how sick a patient is, and whether he is improving or getting worse. Because we have no “gold standard” estimate of acuity, instead he uses the likelihood of the patients’ dying as a proxy for how sick they are.

Some of these models predict the risk of mortality within a month, using a large variety of objective data recorded about the patient, including lab values, instrument settings, intravenous infusions, etc. Among these mortality models, some are based on aggregate data from the patient’s multiple days of stay in the ICU, some depend on data on each specific ICU day, and one relies on the dynamically changing data and is re-calculated every time new relevant data are measured. He also computed an approximation to the commonly-used SAPS II score as a point of comparison. Each of his acuity models, roughly independent of the duration of time they were trained on, succeeded about equally well in its prediction tasks, as measured by area under the ROC curve on an independent test set of cases. And they all performed better than his approximation to the SAPS II score, based on our data. This is good news for the overall goal, because it shows that the real-time model does accurately track the likelihood of eventual outcome for the patient, and therefore appears to be a good proxy for how sick the patient is.

In addition to the mortality models, Caleb also developed a number of more focused models that predict specific events to happen within the next several hours. These include weaning from administration of vasopressors, weaning from an intra-aortic balloon pump, development from infection to septic shock, and acute kidney injury. Somewhat as we expected, these specialized models make more accurate predictions for their target conditions than what one can obtain from any of the mortality models. This suggests that with sufficient experiential data it is possible to make reasonably accurate predictions about when either good or bad events can be anticipated during a patient’s ICU stay.

After Caleb presented the same results to our Biomedical Research Partnership (BRP) group today, Roger Mark, the PI of this project, and I got into an interesting discussion about the value of predictive models, how one can go about evaluating them, and how (or whether) to put them into clinical use. This led me to the following thoughts:

Even if we had a perfect predictive model (i.e., 100% sensitive and specific), how would we use it. For example, suppose that we could develop a model that examines the health state of an ICU patient and tells us, with absolute accuracy, whether that patient would survive his stay in the ICU and for at least the next month thereafter. What would be the use of such a model? If patient Smith were predicted to die, would we therefore immediately “pull the plug” on him? If patient Jones were predicted to live, would that mean we could pay less attention to him? Ultimately, these scenarios are unappealing and not realistic. It’s impossible to make 100% accurate predictions, in part because the true outcome depends on the behavior of clinicians who will be influenced by those predictions.
So what about an imperfect prediction? For much shorter term uses, such as alarms sounded based on physiological measures, we generally think that some combination of sufficiently high sensitivity and specificity, even if considerably short of 100%, are acceptable. For example, an asystole alarm may be excused even if its positive predictive value is only 20% so long as it is highly sensitive. This is because the costs of a false positive and a false negative are dramatically different. In the first case, nurses get annoyed by a spurious alarm, whereas in the second, the patient may die unnoticed. Nevertheless, in practice we hear of ICU nurses turning off important alarms just because the cacophony of frequent alarm sounds overwhelms their ability to respond sensibly.
In the case of much longer-term predictions such as a high likelihood that a patient will not survive for the next 30 days, it’s hard to know how clinicians will or should respond even if the prediction is reasonably accurate. It seems reasonable to assume that they would devote extra attention to the case, be more aggressive in treatment, or try some alternative approach if warned that the current path is downhill. However, it’s hard to know whether that extra attention or change of direction can be sustained over such a long duration as that of the prediction. These predictions don’t have the immediacy of the asystole alarm.
Roger suggests that, despite the predictive model’s roots in mortality prediction, we should think of it instead as what we mean it to be, namely a dynamically changing needle showing the current aggregate health state of the patient. Caleb also suggests that it is really a measure of how this patient is doing, based on a comparison to previous patients in similar situations. This seems like a more sustainable view, but itself raises two questions:

I don’t know how to create a gold standard for health state. So, we build and calibrate the model on the task of predicting mortality, but then use it as a measure of health state. This is certainly our approach, but is it legitimate?
We can certainly provide such a measure as a clinical indicator to ICU clinicians and then see whether it alters their behavior. If yes, then we can see if it leads to better outcomes for their patients. Again, if yes, then it is clearly a useful intervention. Is there a simpler way to assess this hypothesis? Roger suggests asking ICU staff to rate patients’ health state on a subjective scale, and then studying the degree to which Caleb’s acuity score correlates with this. I suspect that this is the most practical current approach.

Back to Blog

Accessibility