MIMIC-III is the largest publically available clinical dataset.
Here we seek predict in-hospital mortality using data about the patient's admission to the ICU.
import datetime
import numpy as np
import pandas as pd
import psycopg2
Let's set up our connection to our MIMIC postgres server. For more information on installing MIMIC, see the official documentation.
sqluser = 'dsontag' # change this to whatever your username is
dbname = 'mimic'
schema_name = 'mimiciii'
# Connect to local postgres version of mimic
con = psycopg2.connect(dbname=dbname, user=sqluser)
cur = con.cursor()
cur.execute('SET search_path to ' + schema_name)
query = \
"""
select
*
from admissions
"""
admissions_df = pd.read_sql_query(query, con)
admissions_df.head()
1) Much of the information we are interested in is in the patients table, such as the date of birth. We need to load it. Call the loaded patients table 'patients_df'
query = \
"""
select * from patients
"""
patients_df = pd.read_sql_query(query, con)
patients_df.head()
2) Next we need to merge the two tables using the field subject_id
, which is shared across both tables.
len(patients_df)
len(admissions_df)
combined_df = admissions_df.merge(patients_df, on='subject_id')
combined_df.head()
If we want to see the patient's age at admission, then we need to subtract the admission time from the date of birth. Working with dates and times is tricky in Python, so we write a function to compute the age. Note that subtracting two datetimes will give us the distance in seconds, and we divide appropriately.
def get_age(dob, admittime):
diff = (admittime - dob).total_seconds() / (3600 * 24 * 365.25)
return diff
combined_df['age'] = combined_df.apply(lambda x: get_age(x['dob'], x['admittime']), axis=1)
combined_df['age'].head()
3) We now define the features. Let's start with 'admission_type', 'admission_location', 'insurance', and 'marital_status'.
Note that because all of our features are categorical, we need to binarize our data using get_dummies
which transforms a categorical feature of x = ['a', 'b', 'a']
to x = [[1, 0], [0,1], [1,0]]
where the columns are a, b
.
Combine the features into a single data frame called X
.
X1 = pd.get_dummies(combined_df['admission_type'], prefix='adm')
X2 = pd.get_dummies(combined_df['admission_location'],prefix='loc')
X3 = pd.get_dummies(combined_df['insurance'], prefix='insur')
X4 = pd.get_dummies(combined_df['marital_status'], prefix='marital')
X4 = pd.get_dummies(combined_df['marital_status'], prefix='marital')
X5 = pd.get_dummies(combined_df['ethnicity'], prefix='eth')
X6 = pd.get_dummies(combined_df['gender'], prefix='gender')
X7 = combined_df['age']
X = pd.concat([X1, X2, X3, X4, X5, X6, X7], axis=1)
X.head()
4) Define the outcome y
to be hospital_expire_flag
, which is 1 if a patient dies or 0 if not. Note that we don't differentiate out types of death (in hopsital, in-ICU) and we are only using admissions numbers.
y = combined_df['hospital_expire_flag']
y.head()
y.describe()
5) Lastly we train a logistic regression using standard ML techniques like splitting into train and test sets. We will be interested in computing the Area under the ROC curve (AUC), so we need to compute the predicted probability and compare to the true label.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, train_size=.8)
clf = LogisticRegression(random_state=0, penalty='l1', C=.1)
clf.fit(Xtrain,ytrain)
from sklearn.metrics import roc_auc_score
ypred = clf.predict_proba(Xtest)[:,1]
roc_auc_score(ytest, ypred)
6) How does each feature contribute to a person's likelihood of dying in the hospital? We can examine the LR coefficients for that.
for i,j in sorted(zip(X.columns,clf.coef_[0]), key=lambda x: x[1]):
if not (j == 0):
print i,j