Homework 2: Clinical Databases

Assigned: September 13, 2017
Due: September 20, 2017
Responsible: Peter Szolovits

Introduction

In this homework assignment, you are asked to learn some tools that are essential to the analysis of clinical data sets and to begin to explore a very rich collection of data from the MIMIC-III (aka MIMIC3) project, which we will be using in a number of subsequent assignments (and which provides excellent data sources for class projects). This assignment, therefore, covers an introduction to three topics:

Relational databases. This is the format in which most data are stored today. Although many older successful clinical data bases use hierarchic data models in their transaction processing systems such as those in the MUMPS language, most data warehouses for collecting and analyzing such data use relational databases. I have found the free (for non-commerical use) MySQL Server "Community Edition" to be relatively simple to install and maintain, so that is what I principally use in my own work. It runs in Mac OS X, Windows, and various Linux/Unix versions. However, if you feel more comfortable with Postgres, Oracle, DB2, SqlServer, Access, ..., please feel free to use those. The MIMIC project in fact mainly uses Postgres, and maintains a Github repository of code much of which is Postgres-oriented. I do have a MySQL server set up with those data, so you need not run your own database server so long as you have internet connectivity and the Bit Gods smile on us.
Data analysis software. In my youth, we wrote our own computer programs for any data analysis we were interested in (in FORTRAN!), but today there are many sophisticated data analysis languages that are interactive, contain high-quality implementations of common (and not so common) statistical packages and visualization tools, and thus give great power in analyzing data. I typically use the R language, which is free and quite popular among biomedical informatics (and especially bioinformatics) researchers. Other options would include MATLAB, SciPy, Mathematica, Maple, S, SAS, ... . I will describe R methods for analysis, so if you use other tools, you will need to figure out how to make them work.
A data base of clinical data. The MIMIC project, led by Prof. Roger Mark of MIT, has collected rich, detailed data on hearly 60,000 patient stays in the various intensive care units (ICUs) at the Beth Israel/Deaconess Medical Center in Boston. These data include tabular data reporting the results of laboratory tests, prescriptions, clinical orders, billing codes, minute-by-minute summary data from bedside monitors, etc., narrative text that includes nursing and doctors' notes, discharge summaries, radiology and pathology reports, etc., and high-resolution waveform data on a fraction of these patients for whom such data were offloaded from the monitors and successfully matched to their clinical records. The entire database has been de-identified, so that it is (supposed to be) impossible to link the records back to specific real individuals. One may access the entire data set by passing the COUHES training course in responsible human subjects research and signing a limited data use agreement in which you promise not to further redistribute the data or try to contact patients if you discover flaws in our de-identification algorithms. You should have completed these steps for the first homework assignment. This database is loaded in MySQL on the server safar.csail.mit.edu, and its current version is called mimic3v14.
There is also a smaller subset of 4,000 patients, all of whom have died, which may be accessed as the database mimic2 without having to do the COUHES training. (Under U.S. laws, dead people have reduced privacy rights.) The structure of the data sets is described in an on-line document, but please note that these data are based on an earlier version of MIMIC-II and are structurally somewhat different than the current version. I do not recommend using these older data, which are primarily to allow developers to demonstrate MIMIC.
You can access the current data through a command line interface on your own system if you have installed at least the client side of MySQL via an invocation such as mysql -h safar.csail.mit.edu -u mimic3 -p mimic3v14
You should then be asked for the password that we will provide you in class, and then be connected to the database.

Most of your effort in this assignment is learning the tools we describe, familiarizing yourself with the content of the database, and just a bit of analysis of the data.

Preliminaries

Here are abbreviated instructions on how to install the components you will need. This investment of effort will help not only with the current homework but with subsequent work as well.

MySQL

If you wish to run your own MySQL database server, you will need to install it on some machine for which you have administrative privileges. The main benefit of doing this over using the data available from my server is that my server provides read-only access. Therefore, you will not be able to store results back into the database or add other data to it if you use mine. The Github repository has a load script of the data into MySQL, which is not actively maintained, but worked a year ago. (I created it, so I can help if needed.)

MySQL's "Get Started with MySQL" instructions provides useful pointers on how to install the database on Windows, Mac OS X or Linux. For example, on my Mac, I simply downloaded an installer package and ran it, giving my password to allow the installer to put its components into system-protected places. As of now, the latest stable version of the Community (i.e., free) Edition of MySQL is 5.7.19, though for our purposes any version > 5.0 should work. Some of the Linux package managers lag behind in supporting newer releases. For example, on my server, which runs CSAIL's standard Ubuntu Linux 14.04.5 LTS, if you run

apt-get install mysql-server

you should get a reasonably current version of MySQL. Safar runs version 5.7.14. Massively extensive documentation on MySQL is available from the developer.

As an alternative to running your own server, you can connect to the MIMIC via the MySQL client as described above, after you have installed the client, of course. This is probably the easiest way to access the data, though it is subject to congestion if you are all using the same server at the same time.

R

R can be run via (a) a command line interface, (b) a slightly fancier command line interface embedded in a GUI that supports menu commands for things like installing/updating packages and allows viewing of graphs and documentation, or (c) a full GUI interface called RStudio that supports "literate programming" in which you can intermix markdown text with executable R and then knit them into an output document where results of running the R code are embedded in the document. This is like IPython (now Jupyter) and other modern interfaces, and is the one I have adopted.

You can download installers for R from the Comprehensive R Archive Network (CRAN), or use your favorite package manager to install it. This supports running methods (a) and (b). For (c), visit RStudio.com and download and install the appropriate ("community" = free) version for your OS. I believe RStudio now uses the version of R that you have installed from CRAN.

To allow access from R to MySQL databases, I used the Package Installer to install the packages DBI and RMySQL, which support import of data from any MySQL database. (There are similar packages such as RPostgreSQL.) To do this from the system shell or from the R command line, see the relevant R documentation. For many this will simply require running install.packages("RMySQL") in R.
(NOTE: this may additionally require installation of the MySQL Connector—see MySQL documentation; it once did, but I think probably not any more, though I have not been able to test this issue).

After installing the needed packages, you can do the following, using the password we supply instead of "xxx":

library(RMySQL)
con <- dbConnect(MySQL(), user="mimic3", password="xxx", 
                 dbname="mimic3v14", host="safar.csail.mit.edu")
dbListTables(con)
pt <- dbReadTable(con, "patients")
dbGetQuery(con, "select gender, count(*) as n from patients group by gender")

Full documentation on the RMySQL package is available, but I have been able to use it successfully without digging into the details. A few useful notes:

Dates are returned in a SQL-standard form, such as "2013-09-05 16:58:04". To R, these look like character strings. The as.Date function parses them and returns a Date object, which drops the time component but has useful methods for finding, e.g., the day of the week. Applying as.numeric to such a Date yields the number of days since 1970-01-01, and is thus useful for date comparisons and duration calculations. Some fields (such as any involving events) also include times, and for those, the conversion should be done by as.POSIXct, which produces a representation of both the date and time. There are limits to the date ranges for various representations, which will occasionally bite you. See below for some examples.
Because both SQL and R have computational facilities, there are often alternative ways to achieve the same effect. For example, to select data on only the male patients, you could do either
```
male.pt <- dbGetQuery(con, "select * from patients where gender='M'")
summary(male.pt)
```
or
```
male.pt2 <- pt[pt$GENDER=="M",]
summary(male.pt2)
```
after having retrieved all patients into pt using dbReadTable, as above. It is important to figure out where your calculations are to take place, because MySQL's language will be much more efficient than R's and if you can select appropriate data in your SQL statements, you will minimize the time to load a lot of extraneous data into R. Also, unless you are running an R instance with 50GB of RAM or more, you will probably exhaust memory if you simply try to load all of the MIMIC tables into R. Not to speak of the waveform data, which occupies terabytes.
Note that MySQL tends to be case-insensitive unless you work at making it case-sensitive. R, however, tends to be case-sensitive. Consequently, when you retrieved pt in the example above, the column labels were in upper-case, so the natural idea of typing
```
male.pt2 <- pt[pt$gender=="M",]
```
would give an error. I often convert all column names to lower-case, as in
```
names(pt) <- tolower(names(pt))
```

If you prefer to use a different environment than R, you are welcome to do so, but our ability to support it is much less. Many people at CSAIL are familiar with Python-based tools to do this kind of data analysis, and you may consult them if you want to use Python.

MIMIC Data

The MIMIC-III database is available through the server above. Note that mostly the data is highly normalized, and therefore requires joins among multiple tables in order to bring together data that you may wish to see simultaneously.

If you are not familiar with relational databases, please consult these two chapters from an older edition of C. J. Date's Database textbook: Intro to Relational Databases, and The SQL Language. His description is not identical in detail to the SQL used in MySQL or Postgres, but conceptually the same. Consult the systems' documentation for the precise details. In a later homework assignment, we will study issues of database design and normalization in more depth.

Part 1: Getting Oriented

As suggested in the MIMIC2 documentation, investigate the amount of data in the various database tables. For example, you should find that there are 32,535 patients in the database:

mysql> select count(*) from patients;
+----------+
| count(*) |
+----------+
|    46520 |
+----------+
1 row in set (0.01 sec)

Problem 1

Answer each of the following questions:

How many different patients have entries in icustays? (Include your query.)
Why do you think it might be different from the total number of patients?
How many icustays are there in total? (Include your query.)
What does this suggest about the average number of times a single patient winds up in an ICU in our data?

Problem 2

The chartevents table records measurements and observations that are part of a patient's (electronic) chart. Its itemid field gives a numeric code that identifies the particular type of observation recorded in each entry, and those codes are translated to readable text descriptions in the d_items table. Write a SQL statement that shows the 20 most frequently occurring types of chart events and the number of occurrences. HINT: I used a group by. To get not only the itemid but the definition of that item, I also needed both a join and a nested select.

You should have observed that two of these labels are "Heart Rate", one from Careview and one from Metavision. The hospital switched ICU information systems from the first to the second during the time the MIMIC data were being collected, and so far it has proven difficult to fully integrate the two data sets.

Problem 3

Plot the distributions of the two items corresponding to heart rate measurements that you found in the last problem. You will note that the distributions are quite different! Can you suggest an explanation of why this might be the case? HINT: take a look at the age distributions of the patients who are having their heart rates measured. HINT to the HINT: Age is not stored for each heart rate measurement, but can be computed from the charttime of the measurement and the patient's dob, both of which should have been time-shifted equally.

Problem 4

Using a combination of SQL and R facilities, show a histogram of the number of chart events associated with each patients. You may also find it instructive to consider the log₁₀ transform of these numbers. Other helpful R visualization tools include sort, plot, and boxplot. Try them and see how to best characterize the distribution of the number of chart events per patient.

Submit the figure you think best characterizes the distribution and explain why you chose that figure. Simple counts such as this have sometimes proven to be a useful approximate measure of a patient's disease severity—the more measurements, the sicker the patient.

Problem 5

Create a visualization that compares the number of chart events per patient stratified by the first care unit the patient is admitted to during a stay in the ICU.

Submit the visualization that you think best illustrates this comparison and explain any problems of data interpretation you faced.

Part 2: A Very Simple Predictive Model

Prediction is an important clinical task. In this part, you will build a simple model to predict in-hospital mortality. In doing so, you will: extract features, extract outcomes, split the data into a training and testing set, build a model, and evaluate its performance. Include your code as part of the submission.

Problem 1

Using only the first icustay for each patient, extract the following features:

age (using intime and dob),
gender, and
admitting care unit (as above).

Create and submit a histogram for each of these variables. Explain anything that looks surprising in these figures and any corrections you make as a result.
HINT: to get the earliest of something like a date, you can create a set using group by and use min of the relevant date field in the set. E.g., for the first icustay intake time for each patient, you can use something like this:
select subject_id, min(intime) mintime from icustays group by subject_id having mintime=min(intime)This is only part of what needs to be done, as you also need to get other information for this ICU stay.

Problem 2

Extract the following outcome for each of the patients for which features have been extracted:

in-hospital mortality (HINT: hospital_expire_flg in admissions)

What percentage of patients die in the hospital?

Problem 3

Split your data into a training (70%) and testing (30%) set.

What percentage of patients die in the hospital for the training set? What percentage of patients die in the hospital for the testing set? Should you make sure that these are closely balanced? What would you do if your data set were very small?

Problem 4

Build a logistic regression model to predict the outcome (in-hospital mortality) using this limited set of features (age, gender, admitting care unit) on the training data. Then use this model to predict outcome in the testing data.

For R users, glm will make building this model simple. There are a number of resources that illustrate working with glm, such as How to Perform a Logistic Regression in R and the R Cookbook. Notably, since glm can fit a large variety of models, be sure to specify the correct family (i.e., binomial(link="logit")) for logistic regression.

What is the AUC of this model?

To get an estimate of how certain your AUC value is, you could try a variety of train/test splits (typically n-fold cross-validation) and examine the distribution of AUC values.

What would be your first approach to improving this simple model's performance?