Assigned: September 13, 2017
Due: September 20, 2017
Responsible: Peter Szolovits
In this homework assignment, you are asked to learn some tools that are essential to the analysis of clinical data sets and to begin to explore a very rich collection of data from the MIMIC-III (aka MIMIC3) project, which we will be using in a number of subsequent assignments (and which provides excellent data sources for class projects). This assignment, therefore, covers an introduction to three topics:
FORTRAN
!),
but today there are many sophisticated data analysis languages
that are interactive, contain high-quality implementations of
common (and not so common) statistical packages and
visualization tools, and thus give great power in analyzing
data. I typically use the R
language, which is free and quite popular among biomedical
informatics (and especially bioinformatics) researchers. Other
options would include MATLAB, SciPy,
Mathematica, Maple, S, SAS, ... . I will describe R methods for
analysis, so if you use other tools, you will need to figure out
how to make them work.A data base of clinical data. The MIMIC project, led by
Prof. Roger Mark of MIT, has collected rich, detailed data on
hearly 60,000 patient stays in the various intensive care
units (ICUs) at the Beth Israel/Deaconess Medical Center in
Boston. These data include tabular data reporting the results
of laboratory tests, prescriptions, clinical orders, billing
codes, minute-by-minute summary data from bedside monitors,
etc., narrative text that includes nursing and doctors' notes,
discharge summaries, radiology and pathology reports, etc.,
and high-resolution waveform data on a fraction of these
patients for whom such data were offloaded from the monitors
and successfully matched to their clinical records. The entire
database has been de-identified, so that it is (supposed to
be) impossible to link the records back to specific real
individuals. One may access the entire data set by passing the
COUHES training course in responsible human subjects research
and signing a limited data use agreement in which you promise
not to further redistribute the data or try to contact
patients if you discover flaws in our de-identification
algorithms. You should have completed these steps for the
first homework assignment. This database is loaded in MySQL on
the server safar.csail.mit.edu, and its current
version is called mimic3v14
.
There is also a smaller subset of 4,000 patients, all of whom
have died, which may be accessed as the database mimic2
without having to do the COUHES training. (Under U.S. laws,
dead people have reduced privacy rights.) The structure of the
data sets is described in an on-line document, but
please note that these data are based on an earlier version of
MIMIC-II and are structurally somewhat different than the
current version. I do not recommend using these older data,
which are primarily to allow developers to demonstrate MIMIC.
You can access the current data through a command line
interface on your own system if you have installed at least
the client side of MySQL via an invocation such as
mysql -h safar.csail.mit.edu -u mimic3 -p
mimic3v14
You should then be asked for the password that we will provide
you in class, and then be connected to the database.
Most of your effort in this assignment is learning the tools we describe, familiarizing yourself with the content of the database, and just a bit of analysis of the data.
Here are abbreviated instructions on how to install the components you will need. This investment of effort will help not only with the current homework but with subsequent work as well.
If you wish to run your own MySQL database server, you will need
to install it on some machine for which you have administrative
privileges. The main benefit of doing this over using the data
available from my server is that my server provides read-only
access. Therefore, you will not be able to store results back into
the database or add other data to it if you use mine. The Github
repository has a load
script of the data into MySQL, which is not actively
maintained, but worked a year ago. (I created it, so I can help if
needed.)
MySQL's "Get
Started with MySQL" instructions provides useful pointers on
how to install the database on Windows, Mac OS X or Linux. For
example, on my Mac, I simply downloaded an installer package and
ran it, giving my password to allow the installer to put its
components into system-protected places. As of now, the latest
stable version of the Community (i.e., free) Edition of MySQL is
5.7.19, though for our purposes any version > 5.0 should work.
Some of the Linux package managers lag behind in supporting newer
releases. For example, on my server, which runs CSAIL's standard
Ubuntu Linux 14.04.5 LTS, if you run
apt-get install mysql-server
you should get a reasonably current version of MySQL. Safar runs version 5.7.14. Massively extensive documentation on MySQL is available from the developer.
As an alternative to running your own server, you can connect to
the MIMIC via the MySQL client as described above, after you have
installed the client, of course. This is probably the easiest way to access the data, though it is subject to congestion if you are all using the same server at the same time.
R can be run via (a) a command line interface, (b) a slightly fancier command line interface embedded in a GUI that supports menu commands for things like installing/updating packages and allows viewing of graphs and documentation, or (c) a full GUI interface called RStudio that supports "literate programming" in which you can intermix markdown text with executable R and then knit them into an output document where results of running the R code are embedded in the document. This is like IPython (now Jupyter) and other modern interfaces, and is the one I have adopted.
You can download installers for R from the Comprehensive R Archive Network (CRAN), or use your favorite package manager to install it. This supports running methods (a) and (b). For (c), visit RStudio.com and download and install the appropriate ("community" = free) version for your OS. I believe RStudio now uses the version of R that you have installed from CRAN.
To allow access from R to MySQL databases, I used the Package
Installer to install the packages DBI and RMySQL, which support
import of data from any MySQL database. (There are similar packages such as RPostgreSQL.) To do this from the system
shell or from the R command line, see the relevant
R documentation. For many this will simply require running install.packages("RMySQL")
in R.
(NOTE: this may additionally require installation of the
MySQL Connector—see MySQL documentation; it once did, but I think probably not any more, though I have not been able to test this issue).
After installing the needed packages, you can do the following, using the password we supply instead of "xxx":
library(RMySQL)
con <- dbConnect(MySQL(), user="mimic3", password="xxx",
dbname="mimic3v14", host="safar.csail.mit.edu")
dbListTables(con)
pt <- dbReadTable(con, "patients")
dbGetQuery(con, "select gender, count(*) as n from patients group by gender")
Full documentation on the RMySQL package is available, but I have been able to use it successfully without digging into the details. A few useful notes:
as.Date
function parses them and returns a Date
object,
which drops the time component but has useful methods for
finding, e.g., the day of the week. Applying as.numeric
to such a Date
yields the number of days since
1970-01-01, and is thus useful for date comparisons and duration
calculations. Some fields (such as any involving events) also include times, and for those, the conversion should be done by as.POSIXct
, which produces a representation of both the date and time. There are limits to the date ranges for various representations, which will occasionally bite you. See below for some examples.Because both SQL and R have computational facilities, there are often alternative ways to achieve the same effect. For example, to select data on only the male patients, you could do either
male.pt <- dbGetQuery(con, "select * from patients where gender='M'")
summary(male.pt)
or
male.pt2 <- pt[pt$GENDER=="M",]
summary(male.pt2)
after having retrieved all patients into pt
using dbReadTable
, as above. It is important to
figure out where your calculations are to take place, because
MySQL's language will be much more efficient than R's and if
you can select appropriate data in your SQL statements, you
will minimize the time to load a lot of extraneous data into
R. Also, unless you are running an R instance with 50GB of RAM
or more, you will probably exhaust memory if you simply try to
load all of the MIMIC tables into R. Not to speak of the
waveform data, which occupies terabytes.
Note that MySQL tends to be case-insensitive unless you work at making it
case-sensitive. R, however, tends to be case-sensitive. Consequently,
when you retrieved pt
in the example above, the column labels were in upper-case, so the natural idea of typing
male.pt2 <- pt[pt$gender=="M",]
would give an error. I often convert all column names to lower-case, as innames(pt) <- tolower(names(pt))
If you prefer to use a different environment than R, you are
welcome to do so, but our ability to support it is much less. Many people at CSAIL are familiar with Python-based tools to do this kind of data analysis, and you may consult them if you want to use Python.
The MIMIC-III database is available through the server above. Note that mostly the data is highly normalized, and therefore
requires joins among multiple tables in order to bring together
data that you may wish to see simultaneously.
If you are not familiar with relational databases, please consult these two chapters from an older edition of C. J. Date's Database textbook: Intro to Relational Databases, and The SQL Language. His description is not identical in detail to the SQL used in MySQL or Postgres, but conceptually the same. Consult the systems' documentation for the precise details. In a later homework assignment, we will study issues of database design and normalization in more depth.
As suggested in the MIMIC2 documentation, investigate the amount of data in the various database tables. For example, you should find that there are 32,535 patients in the database:
mysql> select count(*) from patients;
+----------+
| count(*) |
+----------+
| 46520 |
+----------+
1 row in set (0.01 sec)
Answer each of the following questions:
icustays
?
(Include your query.)icustays
are there in total?
(Include your query.)The chartevents
table records measurements and
observations that are part of a patient's (electronic) chart. Its
itemid
field gives a numeric code that identifies the
particular type of observation recorded in each entry, and those
codes are translated to readable text descriptions in the d_items
table. Write a SQL statement that shows the 20 most frequently
occurring types of chart events and the number of occurrences. HINT: I used a group by
. To get not only the itemid
but the definition of that item, I also needed both a join
and a nested select
.
You should have observed that two of these labels are "Heart Rate", one from Careview and one from Metavision. The hospital switched ICU information systems from the first to the second during the time the MIMIC data were being collected, and so far it has proven difficult to fully integrate the two data sets.
Plot the distributions of the two items corresponding to heart rate measurements that you found in the last problem. You will note that the distributions are quite different! Can you suggest an explanation of why this might be the case? HINT: take a look at the age distributions of the patients who are having their heart rates measured. HINT to the HINT: Age is not stored for each heart rate measurement, but can be computed from the charttime
of the measurement and the patient's dob
, both of which should have been time-shifted equally.
Using a combination of SQL and R facilities, show a histogram of
the number of chart events associated with each patients. You
may also find it instructive to consider the log10
transform of these numbers. Other helpful R visualization tools
include sort
, plot
, and boxplot
.
Try them and see how to best characterize the distribution of the
number of chart events per patient.
Submit the figure you think best characterizes the distribution
and explain why you chose that figure. Simple counts such as this have sometimes proven to be a useful approximate measure of a patient's disease severity—the more measurements, the sicker the patient.
Create a visualization that compares the number of chart events per patient stratified by the first care unit the patient is admitted to during a stay in the ICU.
Submit the visualization that you think best illustrates this comparison and explain any problems of data interpretation you faced.
Prediction is an important clinical task. In this part, you will build a simple model to predict in-hospital mortality. In doing so, you will: extract features, extract outcomes, split the data into a training and testing set, build a model, and evaluate its performance. Include your code as part of the submission.
Using only the first icustay
for each patient,
extract the following features:
intime
and dob
),Create and submit a histogram for each of these variables.
Explain anything that looks surprising in these figures and any
corrections you make as a result.
HINT: to get the earliest of something like a date, you can create a set using group by
and use min of the relevant date field in the set. E.g., for the first icustay intake time for each patient, you can use something like this:select subject_id, min(intime) mintime from icustays group by subject_id having mintime=min(intime)
This is only part of what needs to be done, as you also need to get other information for this ICU stay.
Extract the following outcome for each of the patients for which features have been extracted:
hospital_expire_flg
in admissions)What percentage of patients die in the hospital?
Split your data into a training (70%) and testing (30%) set.
What percentage of patients die in the hospital for the training set?
What percentage of patients die in the hospital for the testing set? Should you make sure that these are closely balanced? What would you do if your data set were very small?
Build a logistic regression model to predict the outcome (in-hospital mortality) using this limited set of features (age, gender, admitting care unit) on the training data. Then use this model to predict outcome in the testing data.
For R users, glm
will make building this model
simple. There are a number of resources that illustrate working
with glm
, such as How
to Perform a Logistic Regression in R and the R
Cookbook. Notably, since glm
can fit a large
variety of models, be sure to specify the correct family
(i.e., binomial(link="logit")
) for logistic
regression.
What is the AUC of this model?
To get an estimate of how certain your AUC value is, you could try a variety of train/test splits (typically n-fold cross-validation) and examine the distribution of AUC values.