We systematically explore the data in the MIMIC III database, focusing on labevents
and chartevents
where the values are numeric. To simplify and speed up this process, we read all the relevant data into R’s memory. This only works on machines with very large RAM provisioning. I believe it requires a system with 64GB of RAM, and have verified that it does not work on a system with “only” 32GB.
library(tidyverse)
library(RMySQL)
library(lubridate)
con <- dbConnect(MySQL(), dbname = "mimic3v14", username="mimic3", password=pwd, host=host)
in_clause <- function(items, col="itemid") {
if (class(items)=="numeric")
paste(col, " in (", paste(items, collapse=","), ")", sep="")
else if (class(items)=="character")
paste(col, "in (", paste0(sprintf("'%s'", items), collapse = ", "), ")")
}
itemid
We have previously checked to see whether there is any overlap between the same itemid
s appearing in different tables that use them. There are no overlaps in itemid
s among the following tables, so at least for this version of MIMIC, we do not need to check again.
labevents
chartevents
datetimeevents
inputevents_cv
inputevents_mv
outputevents
procedureevents_mv
microbiologyevents
Note that for microbiologyevents
, itemid
s are used in three fields, spec_itemid
, org_itemid
, and ab_itemid
. They also do not overlap each other.
Because all the data require a large memory, we can debug the code by reading only a subset. In order to sample uniformly from the available data, we take only every 5th sample from labevents
and every 30th from chartevents
.
First, all the labs
# labs <- dbReadTable(con, "labevents")
sql_labs <- "select subject_id, hadm_id, itemid, charttime, valuenum, valueuom, flag from labevents where valuenum is not null"
# sql_labs <- "select subject_id, hadm_id, itemid, charttime, valuenum, valueuom, flag from labevents where valuenum is not null and mod(row_id, 5)=0"
labs <- dbGetQuery(con, sql_labs)
labs$charttime <- ymd_hms(labs$charttime)
labs$valueuom <- factor(labs$valueuom)
labs$flag <- factor(labs$flag)
summary(labs)
subject_id hadm_id itemid charttime valuenum valueuom flag
Min. : 2 Min. :100001 Min. :50801 Min. :2096-02-26 01:00:00 Min. : -414 mEq/L :5326273 abnormal: 9740677
1st Qu.:11362 1st Qu.:125117 1st Qu.:50882 1st Qu.:2126-07-26 03:54:00 1st Qu.: 4 mg/dL :5038921 delta : 64571
Median :22951 Median :149888 Median :50971 Median :2151-02-12 03:01:00 Median : 18 % :3942299 NA's :15127587
Mean :32129 Mean :149976 Mean :51046 Mean :2151-06-04 04:16:10 Mean : 79 K/uL :1533311
3rd Qu.:49998 3rd Qu.:175035 3rd Qu.:51249 3rd Qu.:2176-07-19 04:44:00 3rd Qu.: 60 IU/L :1025101
Max. :99999 Max. :199999 Max. :51555 Max. :2210-08-24 05:53:00 Max. :14272000 (Other):7059667
NA's :4817521 NA's :1007263
Then the chart events
# charts <- dbReadTable(con, "chartevents")
sql_charts <- "select subject_id, hadm_id, icustay_id, itemid, charttime, storetime, cgid, valuenum, valueuom, warning, error, resultstatus, stopped from chartevents where valuenum is not null"
# sql_charts <- "select subject_id, hadm_id, icustay_id, itemid, charttime, storetime, cgid, valuenum, valueuom, warning, error, resultstatus, stopped from chartevents where valuenum is not null and mod(row_id, 30)=0"
charts <- dbGetQuery(con, sql_charts)
charts$charttime <- ymd_hms(charts$charttime)
charts$storetime <- ymd_hms(charts$storetime)
charts$valueuom <- factor(charts$valueuom)
charts$warning <- factor(charts$warning)
charts$error <- factor(charts$error)
charts$resultstatus <- factor(charts$resultstatus)
charts$stopped <- factor(charts$stopped)
summary(charts)
subject_id hadm_id icustay_id itemid charttime storetime cgid valuenum
Min. : 2 Min. :100001 Min. :200001 Min. : 2 Min. :2100-06-07 20:00:00 Min. :2100-06-07 20:00:00 Min. :14010 Min. : -10069
1st Qu.:11043 1st Qu.:125400 1st Qu.:225957 1st Qu.: 646 1st Qu.:2127-10-14 20:00:00 1st Qu.:2126-09-27 19:44:00 1st Qu.:15449 1st Qu.: 7
Median :22378 Median :149029 Median :250823 Median : 5814 Median :2152-12-04 15:00:00 Median :2151-06-24 23:07:00 Median :17474 Median : 40
Mean :30873 Mean :149782 Mean :250393 Mean : 75025 Mean :2152-03-26 20:51:45 Mean :2151-07-07 05:16:42 Mean :17624 Mean : 72
3rd Qu.:46092 3rd Qu.:174660 3rd Qu.:275019 3rd Qu.:220181 3rd Qu.:2177-07-29 19:00:00 3rd Qu.:2176-11-14 15:51:00 3rd Qu.:19648 3rd Qu.: 96
Max. :99999 Max. :199999 Max. :299999 Max. :228444 Max. :2209-08-07 16:52:00 Max. :2209-08-07 16:52:00 Max. :21570 Max. :10000000
NA's :147506 NA's :6312672 NA's :6312672
valueuom warning error resultstatus stopped
mmHg :30642461 0 :45893601 0 :47922529 Final : 8739290 NotStopd:98323835
BPM :15386992 1 : 2091540 1 : 62612 Manual: 665998 NA's :48180333
% :13896202 NA's:98519027 NA's:98519027 NA's :137098880
bpm : 8317335
kg : 6633378
(Other):41584934
NA's :30042866
Then we see how often each of the lab and chart items occur in our data
d_labitems <- dbReadTable(con, "d_labitems")
names(d_labitems) <- tolower(names(d_labitems))
d_labitems$row_id <- NULL
d_labitems$category <- tools::toTitleCase(tolower(d_labitems$category)) # fixes a few capitalization errors in data
# d_labitems$label <- factor(d_labitems$label)
d_labitems$fluid <- factor(d_labitems$fluid)
d_labitems$category <- factor(d_labitems$category)
d_labitems <- labs %>%
group_by(itemid) %>%
summarize(n=n(), n_subjects=n_distinct(subject_id), n_adm=n_distinct(hadm_id)) %>%
arrange(desc(n)) %>%
ungroup() %>%
inner_join(d_labitems, by="itemid")
d_labitems$desc <-
sprintf("%d (%d) %s in %s%s",
d_labitems$itemid, d_labitems$n, d_labitems$label, d_labitems$fluid,
ifelse(d_labitems$category=="Blood Gas", " {BG}",
ifelse(d_labitems$category=="Chemistry", " {Chem}",
" {Hem}")))
d_labitems$linksto <- factor("labevents")
summary(d_labitems)
itemid n n_subjects n_adm label fluid category loinc_code
Min. :50801 Min. : 1.0 Min. : 1.00 Min. : 1.0 Length:494 Blood :250 Blood Gas : 29 Length:494
1st Qu.:50961 1st Qu.: 31.5 1st Qu.: 24.25 1st Qu.: 17.0 Class :character Urine : 64 Chemistry :226 Class :character
Median :51106 Median : 426.0 Median : 285.50 Median : 221.5 Mode :character Other Body Fluid : 49 Hematology:239 Mode :character
Mean :51134 Mean : 50471.3 Mean : 5258.53 Mean : 5925.8 Ascites : 36
3rd Qu.:51299 3rd Qu.: 4697.2 3rd Qu.: 2228.25 3rd Qu.: 1738.0 Pleural : 34
Max. :51555 Max. :881653.0 Max. :45383.00 Max. :57103.0 Cerebrospinal Fluid (CSF): 27
(Other) : 34
desc linksto
Length:494 labevents:494
Class :character
Mode :character
Correlate the above counts with information on each of the items from the d_items
and d_labitems
tables
d_items <- dbReadTable(con, "d_items")
names(d_items) <- tolower(names(d_items))
d_items$row_id <- NULL
d_items$conceptid <- NULL # this column is all nulls
# d_items$label <- factor(d_items$label)
d_items$abbreviation <- factor(d_items$abbreviation)
d_items$dbsource <- factor(d_items$dbsource)
d_items$linksto <- factor(d_items$linksto)
d_items$category <- factor(d_items$category)
d_items$unitname <- factor(d_items$unitname)
d_items$param_type <- factor(d_items$param_type)
d_items <- charts %>%
group_by(itemid) %>%
summarize(n=n(), n_subjects=n_distinct(subject_id), n_adm=n_distinct(hadm_id)) %>%
arrange(desc(n)) %>%
ungroup() %>%
inner_join(d_items, by="itemid")
d_items$desc <-
sprintf("%d-%s (%d) %s %s",
d_items$itemid,
ifelse(d_items$dbsource=="hospital", "hosp",
ifelse(d_items$dbsource=="carevue", "cv",
ifelse(d_items$dbsource=="metavision", "mv", "???"))),
d_items$n, d_items$label,
ifelse(is.na(d_items$category), "", paste0("{", d_items$category, "}", sep="")))
summary(d_items)
itemid n n_subjects n_adm label abbreviation dbsource
Min. : 2 Min. : 1 Min. : 1.0 Min. : 1.0 Length:2884 14 Gauge Dressing Occlusive : 1 carevue :2111
1st Qu.: 2024 1st Qu.: 8 1st Qu.: 1.0 1st Qu.: 1.0 Class :character 14 Gauge placed in outside facility: 1 hospital : 0
Median : 5578 Median : 64 Median : 5.0 Median : 5.0 Mode :character 14 Gauge placed in the field : 1 metavision: 773
Mean : 63171 Mean : 50799 Mean : 1770.6 Mean : 2039.7 16 Gauge Dressing Occlusive : 1
3rd Qu.:220671 3rd Qu.: 4369 3rd Qu.: 510.8 3rd Qu.: 537.5 16 Gauge placed in outside facility: 1
Max. :228444 Max. :5179363 Max. :29898.0 Max. :34901.0 (Other) : 768
NA's :2111
linksto category unitname param_type desc
chartevents :2884 Labs : 98 None : 196 Numeric : 450 Length:2884
datetimeevents : 0 Access Lines - Invasive: 74 mmHg : 72 Checkbox : 190 Class :character
inputevents_cv : 0 Respiratory : 67 cm : 35 Text : 100 Mode :character
inputevents_mv : 0 Chemistry : 56 % : 30 Numeric with tag: 33
microbiologyevents: 0 Scores - APACHE IV (2) : 53 L/min : 21 Date time : 0
outputevents : 0 (Other) : 621 (Other): 129 (Other) : 0
procedureevents_mv: 0 NA's :1915 NA's :2401 NA's :2111
Combine items and labitems
d_all <- rbind(
d_labitems[,c("itemid", "n", "n_subjects", "n_adm", "label", "category", "linksto", "desc")],
d_items[,c("itemid", "n", "n_subjects", "n_adm", "label", "category", "linksto", "desc")]
) %>% arrange(desc(n))
summary(d_all)
itemid n n_subjects n_adm label category linksto desc
Min. : 2 Min. : 1 Min. : 1.0 Min. : 1.0 Length:3378 Chemistry : 282 chartevents :2884 Length:3378
1st Qu.: 2358 1st Qu.: 9 1st Qu.: 1.0 1st Qu.: 1.0 Class :character Hematology : 254 labevents : 494 Class :character
Median : 6539 Median : 81 Median : 10.0 Median : 9.5 Mode :character Labs : 98 datetimeevents : 0 Mode :character
Mean : 61410 Mean : 50751 Mean : 2280.7 Mean : 2608.0 Access Lines - Invasive: 74 inputevents_cv : 0
3rd Qu.: 51428 3rd Qu.: 4398 3rd Qu.: 796.8 3rd Qu.: 771.8 Respiratory : 67 inputevents_mv : 0
Max. :228444 Max. :5179363 Max. :45383.0 Max. :57103.0 (Other) : 688 microbiologyevents: 0
NA's :1915 (Other) : 0
all <- rbind(labs %>% select(subject_id, hadm_id, itemid, charttime, valuenum, valueuom),
charts %>% select(subject_id, hadm_id, itemid, charttime, valuenum, valueuom))
We assume that data elements with identical labels may be related to each other, and will investigate this by comparing distributions of values. In some instances, the same label may identify measurements from different types of samples (e.g., blood vs. urine) or by different techniques, so we do not expact all the distributions to align.
d_bylabels <- d_all %>%
group_by(label) %>%
summarize(ntot=sum(n),
ambig=n(),
ns=paste(n, collapse=", "),
itemids=paste0(itemid, collapse=", "),
categories=paste0(category, collapse=", "),
# dbsources=paste0(dbsource, collapse=", "),
linksto=paste0(linksto, collapse=", ")
# units=paste0(unitname, collapse=", "),
# param_types=paste0(param_type, collapse=", ")
) %>%
ungroup() %>%
arrange(label)
d_bylabels_ambig <- d_bylabels %>% filter(ambig>1) %>% arrange(label)
d_bylabels_ambig
There are 3190 distinct labels for events, of which 119 are associated with multiple itemid
s. We consider the distribution of numerical values of each of these sets, though we limit examination to labels with a total count of at least a threshold number. We also eliminate outliers by choosing some fraction of the extremes of each distribution to omit. (We begin by eliminating only 1%, i.e., 0.5% at each extreme. This seems to get rid of most artifact values such a negative lab measurements or ridiculously high ones.)
For each distinct label
associated with multiple itemid
s, we plot distributions of their values separately and overplotted on each other, plus violin plots and cumulative distributions to allow different ways to visualize the relationships between the distributions of each itemid
’s data.
total_count_threshold = 1000
outliers_fraction = 0.01
check_for_duplicates <- TRUE
d_chart_ambig_common <- d_bylabels_ambig %>% filter(ntot>=total_count_threshold) %>% arrange(desc(ntot))
print(d_chart_ambig_common)
bin_width <- function(dat) {
log_range <- log10(max(dat) - min(dat))
binw <- ifelse(log_range < 0.0, .01,
ifelse(log_range < 1.0, 0.1,
ifelse(log_range < 2.2, 1,
ifelse(log_range < 3.2, 10, 100))))
}
rid_outliers <- function(d, outliers_fraction = 0.01) {
if (outliers_fraction != 0.0) {
limits <- quantile(d$valuenum, probs=c(outliers_fraction/2, 1-outliers_fraction/2), na.rm=TRUE)
d %>% filter(valuenum <= limits[2] & valuenum >= limits[1])
} else {
d
}
}
dups <- NULL
for (l in 1:nrow(d_chart_ambig_common)) {
# print(l)
lbldata <- d_chart_ambig_common[l,]
items <- (d_all %>% filter(label==lbldata$label))$itemid
# print(items)
dat <- all %>% filter(itemid %in% items) %>%
left_join(d_all, by="itemid")
if (outliers_fraction != 0.0) {
limits <- quantile(dat$valuenum, probs=c(outliers_fraction/2, 1-outliers_fraction/2), na.rm=TRUE)
dat <- dat %>% filter(valuenum <= limits[2] & valuenum >= limits[1])
}
binw <- bin_width(dat$valuenum)
print(dat %>%
ggplot(aes(valuenum, fill=desc)) +
geom_histogram(binwidth = binw) +
facet_wrap(~ desc) +
theme(legend.position="none") +
ggtitle(lbldata$label))
print(dat %>%
ggplot(aes(valuenum, fill=desc)) +
geom_histogram(binwidth = binw, alpha=0.5) +
ggtitle(lbldata$label) +
theme(legend.position="bottom", legend.direction="vertical"))
print(dat %>%
ggplot(aes(x=desc, y=valuenum, color=desc)) +
geom_violin() +
coord_flip() +
theme(legend.position="none") +
ggtitle(lbldata$label))
print(dat %>%
ggplot(aes(valuenum, color=desc)) +
stat_ecdf(aes(linetype = desc)) +
theme(legend.position="bottom", legend.direction="vertical") +
ggtitle(sprintf("CDF of %s", lbldata$label)))
# Collect duplication data while producing plots
if (check_for_duplicates) {
mults <- dat %>%
group_by(subject_id, charttime) %>%
summarize(
itemids = paste(unique(itemid), collapse=", "),
nitemids = n_distinct(itemid),
nvals = n(),
ndistinct = n_distinct(valuenum),
vals = paste(valuenum, collapse=", "),
val_mean = mean(valuenum),
val_low = min(valuenum),
val_high = max(valuenum)
) %>%
ungroup() %>%
filter(nvals > 1)
if (nrow(mults) > 0) {
dups <- rbind(dups, mults)
}
}
}
We have observed that sometimes among the itemid
s associated with a particular label
, there are several measurements recorded at the same time for the same patient. In most cases, the values of these measurements are also identical. This suggests that some of the data are duplicated. This may occur if, for example, lab data are copied to the chart as chart data, if some chart itemid
is copied to entries under a different itemid
, or various other possible reasons. Here we investigate how frequently this occurs, and identify groups of itemid
s where such duplication is common.
# dups_equal <- dups %>% filter(ndistinct==1) %>% arrange(itemids)
# dups_unequal <- dups %>% filter(ndistinct > 1) %>% arrange(itemids)
dups_summary <- dups %>%
group_by(itemids, nitemids) %>%
summarize(ntot = n(),
nequal = sum(ifelse(ndistinct==1, 1, 0)),
nunequal = sum(ifelse(ndistinct > 1, 1, 0))
) %>%
arrange(desc(nequal))
items <- c()
for (i in 1:nrow(dups_summary)) {
items <- append(items, as.numeric(strsplit(as.character(dups_summary[i, "itemids"]), ", ")[[1]]))
}
item_names <- left_join(data.frame(itemid = unique(items)),
d_all,
by="itemid") %>%
select(-desc) %>%
arrange(itemid)
if (nrow(dups_summary) > 0) {
kable(dups_summary, caption="Summary of identical data under different itemids")
}
itemids | nitemids | ntot | nequal | nunequal |
---|---|---|---|---|
51221, 813 | 2 | 229198 | 228692 | 506 |
51222, 814 | 2 | 171632 | 171567 | 65 |
50971, 1535 | 2 | 161085 | 160946 | 139 |
50983, 1536 | 2 | 152601 | 152524 | 77 |
50902, 1523 | 2 | 149106 | 149062 | 44 |
50960, 1532 | 2 | 146959 | 146923 | 36 |
50912, 1525 | 2 | 145734 | 145701 | 33 |
50931, 1529 | 2 | 145093 | 145036 | 57 |
51275, 1533 | 2 | 109798 | 109736 | 62 |
51274, 1286 | 2 | 105154 | 105124 | 30 |
50809, 1529 | 2 | 82307 | 82301 | 6 |
50912, 220615 | 2 | 46325 | 46308 | 17 |
51265, 227457 | 2 | 45663 | 45595 | 68 |
50960, 220635 | 2 | 44796 | 44779 | 17 |
51222, 220228 | 2 | 40358 | 40302 | 56 |
51275, 227466 | 2 | 30673 | 30611 | 62 |
50862, 1521 | 2 | 23194 | 23188 | 6 |
50867, 775 | 2 | 20270 | 20256 | 14 |
50802, 74 | 2 | 9382 | 9382 | 0 |
50821, 3837 | 2 | 9350 | 9350 | 0 |
50818, 3835 | 2 | 9329 | 9329 | 0 |
51200, 3754 | 2 | 8323 | 8317 | 6 |
51144, 3738 | 2 | 8022 | 8022 | 0 |
50862, 227456 | 2 | 7132 | 7130 | 2 |
224828, 776 | 2 | 5102 | 5067 | 35 |
50912, 220615, 1525 | 3 | 4694 | 4692 | 2 |
225624, 1162 | 2 | 4660 | 4659 | 1 |
50960, 220635, 1532 | 3 | 4607 | 4606 | 1 |
220546, 1542 | 2 | 4267 | 4250 | 17 |
51222, 220228, 814 | 3 | 4255 | 4236 | 19 |
225677, 1534 | 2 | 4143 | 4141 | 2 |
51275, 227466, 1533 | 3 | 3605 | 3586 | 19 |
227467, 1530 | 2 | 3224 | 3214 | 10 |
50956, 225672 | 2 | 3123 | 3123 | 0 |
225667, 816 | 2 | 2625 | 2615 | 10 |
50917, 801 | 2 | 2417 | 2417 | 0 |
50867, 220581 | 2 | 2292 | 2291 | 1 |
225668, 1531 | 2 | 1806 | 1795 | 11 |
51007, 1541 | 2 | 1690 | 1689 | 1 |
50863, 3728 | 2 | 1633 | 1633 | 0 |
50914, 792 | 2 | 1471 | 1471 | 0 |
51196, 1526 | 2 | 1225 | 1225 | 0 |
220587, 770 | 2 | 1106 | 1105 | 1 |
220644, 769 | 2 | 1105 | 1105 | 0 |
50909, 227463 | 2 | 1093 | 1066 | 27 |
50862, 227456, 1521 | 3 | 868 | 867 | 1 |
220632, 817 | 2 | 716 | 716 | 0 |
50811, 814 | 2 | 785 | 603 | 182 |
50966, 826 | 2 | 577 | 577 | 0 |
225642, 799 | 2 | 575 | 575 | 0 |
225639, 796 | 2 | 573 | 573 | 0 |
225640, 797 | 2 | 572 | 572 | 0 |
225641, 798 | 2 | 567 | 567 | 0 |
50867, 220581, 775 | 3 | 561 | 561 | 0 |
50971 | 1 | 562 | 477 | 85 |
50983 | 1 | 517 | 471 | 46 |
50902 | 1 | 486 | 457 | 29 |
50931 | 1 | 515 | 444 | 71 |
51221 | 1 | 493 | 419 | 74 |
227468, 1528 | 2 | 382 | 377 | 5 |
51007, 225695 | 2 | 367 | 367 | 0 |
50917, 227440 | 2 | 346 | 346 | 0 |
225638, 795 | 2 | 340 | 340 | 0 |
50912 | 1 | 369 | 335 | 34 |
51265 | 1 | 385 | 330 | 55 |
50820 | 1 | 298 | 295 | 3 |
50818 | 1 | 291 | 288 | 3 |
51222 | 1 | 350 | 288 | 62 |
50802 | 1 | 288 | 287 | 1 |
50821 | 1 | 289 | 286 | 3 |
220546 | 1 | 285 | 285 | 0 |
225624 | 1 | 281 | 281 | 0 |
225677 | 1 | 259 | 259 | 0 |
50960 | 1 | 270 | 256 | 14 |
50866, 220580 | 2 | 247 | 247 | 0 |
225674, 823 | 2 | 229 | 226 | 3 |
220228 | 1 | 217 | 217 | 0 |
227457 | 1 | 216 | 216 | 0 |
220615 | 1 | 211 | 211 | 0 |
220635 | 1 | 205 | 205 | 0 |
51274 | 1 | 215 | 192 | 23 |
225684, 836 | 2 | 181 | 180 | 1 |
50915, 225636 | 2 | 175 | 175 | 0 |
227467 | 1 | 160 | 160 | 0 |
224828 | 1 | 156 | 156 | 0 |
227466 | 1 | 121 | 121 | 0 |
225668 | 1 | 119 | 119 | 0 |
51275 | 1 | 147 | 118 | 29 |
50863 | 1 | 114 | 101 | 13 |
220587 | 1 | 98 | 98 | 0 |
220644 | 1 | 98 | 98 | 0 |
51146 | 1 | 104 | 97 | 7 |
51244 | 1 | 105 | 97 | 8 |
51254 | 1 | 105 | 97 | 8 |
51200 | 1 | 102 | 96 | 6 |
225693, 1540 | 2 | 90 | 90 | 0 |
50819, 505 | 2 | 99 | 90 | 9 |
50966, 227460 | 2 | 79 | 79 | 0 |
50922, 804 | 2 | 78 | 78 | 0 |
50809 | 1 | 76 | 76 | 0 |
225667 | 1 | 74 | 74 | 0 |
225639 | 1 | 73 | 73 | 0 |
225640 | 1 | 73 | 73 | 0 |
225641 | 1 | 73 | 73 | 0 |
225642 | 1 | 73 | 73 | 0 |
220632 | 1 | 71 | 71 | 0 |
51007, 225695, 1541 | 3 | 64 | 64 | 0 |
51491 | 1 | 79 | 64 | 15 |
220603, 1524 | 2 | 59 | 59 | 0 |
50811, 51222 | 2 | 698 | 57 | 641 |
50811, 220228 | 2 | 123 | 55 | 68 |
50862 | 1 | 57 | 52 | 5 |
50819 | 1 | 46 | 46 | 0 |
50811, 51222, 814 | 3 | 411 | 44 | 367 |
50826 | 1 | 43 | 42 | 1 |
225638 | 1 | 41 | 41 | 0 |
50922 | 1 | 43 | 41 | 2 |
227462, 805 | 2 | 40 | 40 | 0 |
227456 | 1 | 38 | 38 | 0 |
50917, 227440, 801 | 3 | 38 | 38 | 0 |
50811, 51222, 220228 | 3 | 225 | 35 | 190 |
50956 | 1 | 35 | 35 | 0 |
225672 | 1 | 30 | 30 | 0 |
220581 | 1 | 24 | 24 | 0 |
50867 | 1 | 24 | 21 | 3 |
220210, 618 | 2 | 135 | 20 | 115 |
50956, 820 | 2 | 21 | 20 | 1 |
51144 | 1 | 21 | 20 | 1 |
51196, 225636, 1526 | 3 | 19 | 19 | 0 |
224639 | 1 | 19 | 17 | 2 |
220045, 211 | 2 | 150 | 16 | 134 |
50809, 50931 | 2 | 347 | 16 | 331 |
50811 | 1 | 17 | 16 | 1 |
51143 | 1 | 17 | 16 | 1 |
51251 | 1 | 17 | 16 | 1 |
51255 | 1 | 16 | 16 | 0 |
220650, 1539 | 2 | 15 | 15 | 0 |
51196, 225636 | 2 | 14 | 14 | 0 |
227468 | 1 | 13 | 13 | 0 |
51493 | 1 | 18 | 13 | 5 |
51516 | 1 | 17 | 11 | 6 |
227470, 835 | 2 | 9 | 9 | 0 |
51094, 51491 | 2 | 24 | 9 | 15 |
225674 | 1 | 7 | 7 | 0 |
225684 | 1 | 7 | 7 | 0 |
225693 | 1 | 7 | 7 | 0 |
50809, 50931, 1529 | 3 | 136 | 7 | 129 |
50815, 223834 | 2 | 10 | 7 | 3 |
50866 | 1 | 8 | 7 | 1 |
211 | 1 | 17 | 6 | 11 |
5815 | 1 | 6 | 6 | 0 |
618 | 1 | 10 | 6 | 4 |
227463 | 1 | 5 | 5 | 0 |
50935 | 1 | 5 | 5 | 0 |
220228, 814 | 2 | 4 | 4 | 0 |
50811, 51222, 220228, 814 | 4 | 22 | 4 | 18 |
50931, 51478 | 2 | 1036 | 4 | 1032 |
8549 | 1 | 6 | 4 | 2 |
227462 | 1 | 3 | 3 | 0 |
227470 | 1 | 3 | 3 | 0 |
50815 | 1 | 3 | 3 | 0 |
50917 | 1 | 3 | 3 | 0 |
220603 | 1 | 2 | 2 | 0 |
220650 | 1 | 2 | 2 | 0 |
225695 | 1 | 2 | 2 | 0 |
50966, 227460, 826 | 3 | 2 | 2 | 0 |
51269 | 1 | 2 | 2 | 0 |
51478 | 1 | 2 | 2 | 0 |
51516, 1542 | 2 | 356 | 2 | 354 |
586 | 1 | 2 | 2 | 0 |
220615, 1525 | 2 | 1 | 1 | 0 |
220635, 1532 | 2 | 1 | 1 | 0 |
227440 | 1 | 1 | 1 | 0 |
50866, 774 | 2 | 1 | 1 | 0 |
51007 | 1 | 2 | 1 | 1 |
51094 | 1 | 1 | 1 | 0 |
51116, 51427 | 2 | 5 | 1 | 4 |
51125, 51436 | 2 | 5 | 1 | 4 |
51125, 51455 | 2 | 7 | 1 | 6 |
51196 | 1 | 1 | 1 | 0 |
51200, 51347, 3754 | 3 | 2 | 1 | 1 |
51352 | 1 | 27 | 1 | 26 |
51375, 51427 | 2 | 1 | 1 | 0 |
763 | 1 | 2 | 1 | 1 |
50809, 51478 | 2 | 13 | 0 | 13 |
50820, 51491 | 2 | 182 | 0 | 182 |
50909 | 1 | 8 | 0 | 8 |
50914 | 1 | 2 | 0 | 2 |
50931, 51478, 1529 | 3 | 77 | 0 | 77 |
50978 | 1 | 2 | 0 | 2 |
51110, 51440 | 2 | 1 | 0 | 1 |
51114, 51200 | 2 | 3 | 0 | 3 |
51114, 51444 | 2 | 2 | 0 | 2 |
51116 | 1 | 3 | 0 | 3 |
51116, 51244 | 2 | 9 | 0 | 9 |
51116, 51446 | 2 | 7 | 0 | 7 |
51117 | 1 | 2 | 0 | 2 |
51117, 51428 | 2 | 1 | 0 | 1 |
51120 | 1 | 3 | 0 | 3 |
51120, 51254 | 2 | 9 | 0 | 9 |
51125 | 1 | 2 | 0 | 2 |
51143, 51440 | 2 | 1 | 0 | 1 |
51144, 51344, 3738 | 3 | 1 | 0 | 1 |
51144, 51441 | 2 | 1 | 0 | 1 |
51200, 51347 | 2 | 1 | 0 | 1 |
51200, 51368 | 2 | 1 | 0 | 1 |
51244, 51375 | 2 | 1 | 0 | 1 |
51244, 51427 | 2 | 2 | 0 | 2 |
51244, 51446 | 2 | 8 | 0 | 8 |
51254, 51355 | 2 | 10 | 0 | 10 |
51254, 51379 | 2 | 1 | 0 | 1 |
51343 | 1 | 3 | 0 | 3 |
51344 | 1 | 5 | 0 | 5 |
51346 | 1 | 1 | 0 | 1 |
51347 | 1 | 6 | 0 | 6 |
51351 | 1 | 222 | 0 | 222 |
51351, 3770 | 2 | 6 | 0 | 6 |
51355 | 1 | 183 | 0 | 183 |
51358 | 1 | 15 | 0 | 15 |
51360 | 1 | 177 | 0 | 177 |
51360, 3791 | 2 | 6 | 0 | 6 |
51360, 51455 | 2 | 1 | 0 | 1 |
51361 | 1 | 1 | 0 | 1 |
51375 | 1 | 2 | 0 | 2 |
51376 | 1 | 1 | 0 | 1 |
51379 | 1 | 1 | 0 | 1 |
51382 | 1 | 2 | 0 | 2 |
51427 | 1 | 3 | 0 | 3 |
51428 | 1 | 2 | 0 | 2 |
51431 | 1 | 1 | 0 | 1 |
51436 | 1 | 3 | 0 | 3 |
51444 | 1 | 2 | 0 | 2 |
51446 | 1 | 3 | 0 | 3 |
51450 | 1 | 3 | 0 | 3 |
51450, 3779 | 2 | 1 | 0 | 1 |
51455 | 1 | 2 | 0 | 2 |
51455, 3791 | 2 | 1 | 0 | 1 |
51478, 1529 | 2 | 1 | 0 | 1 |
51493, 833 | 2 | 377 | 0 | 377 |
51516, 220546 | 2 | 156 | 0 | 156 |
51516, 220546, 1542 | 3 | 8 | 0 | 8 |
if (nrow(dups_summary) > 0) {
kable(item_names, caption="Legend for duplicated data")
}
itemid | n | n_subjects | n_adm | label | category | linksto |
---|---|---|---|---|---|---|
74 | 9446 | 1343 | 1362 | Base Excess | NA | chartevents |
211 | 5179363 | 29898 | 34901 | Heart Rate | NA | chartevents |
505 | 350192 | 1833 | 1855 | PEEP | NA | chartevents |
586 | 2762 | 844 | 863 | QTc | NA | chartevents |
618 | 3384989 | 22312 | 27195 | Respiratory Rate | NA | chartevents |
763 | 47241 | 16409 | 19352 | Daily Weight | NA | chartevents |
769 | 41351 | 10660 | 12293 | ALT | Enzymes | chartevents |
770 | 41218 | 10622 | 12251 | AST | Enzymes | chartevents |
774 | 12 | 8 | 8 | Ammonia | Chemistry | chartevents |
775 | 21145 | 6868 | 7652 | Amylase | Enzymes | chartevents |
776 | 262053 | 15084 | 17229 | Arterial Base Excess | ABG | chartevents |
792 | 1489 | 165 | 204 | Cyclosporin | Drug Level | chartevents |
795 | 13582 | 6215 | 6916 | Differential-Bands | Hematology | chartevents |
796 | 21574 | 8916 | 10259 | Differential-Basos | Hematology | chartevents |
797 | 21575 | 8916 | 10260 | Differential-Eos | Hematology | chartevents |
798 | 21576 | 8916 | 10260 | Differential-Lymphs | Hematology | chartevents |
799 | 21575 | 8916 | 10260 | Differential-Monos | Hematology | chartevents |
801 | 2476 | 950 | 1034 | Digoxin | Drug Level | chartevents |
804 | 89 | 81 | 81 | Ethanol | Drug Level | chartevents |
805 | 1847 | 273 | 359 | FK506 | Drug Level | chartevents |
813 | 274142 | 22405 | 27410 | Hematocrit | Hematology | chartevents |
814 | 187843 | 22188 | 27141 | Hemoglobin | Hematology | chartevents |
816 | 147034 | 13268 | 14630 | Ionized Calcium | Chemistry | chartevents |
817 | 23238 | 7990 | 9090 | LDH | Enzymes | chartevents |
820 | 77 | 63 | 63 | Lipase | Enzymes | chartevents |
823 | 42282 | 4725 | 4979 | Mixed Venous O2% Sat | Blood Gases | chartevents |
826 | 585 | 81 | 95 | Phenobarbital | Drug Level | chartevents |
833 | 175859 | 22167 | 27110 | RBC | Hematology | chartevents |
835 | 747 | 638 | 653 | Sed Rate | Hematology | chartevents |
836 | 8095 | 1768 | 1816 | Serum Osmolality | Chemistry | chartevents |
1162 | 151974 | 17295 | 21108 | BUN | NA | chartevents |
1286 | 106153 | 15700 | 18796 | PT | NA | chartevents |
1521 | 24310 | 8003 | 9159 | Albumin | Chemistry | chartevents |
1523 | 151252 | 17273 | 21078 | Chloride | Chemistry | chartevents |
1524 | 1794 | 1697 | 1711 | Cholesterol | Chemistry | chartevents |
1525 | 152641 | 17299 | 21113 | Creatinine | Chemistry | chartevents |
1526 | 1245 | 900 | 921 | D-Dimer | Coags | chartevents |
1528 | 15575 | 5560 | 5852 | Fibrinogen | Coags | chartevents |
1529 | 283798 | 17332 | 21147 | Glucose | Chemistry | chartevents |
1530 | 106359 | 15715 | 18817 | INR | Coags | chartevents |
1531 | 63233 | 9790 | 11149 | Lactic Acid | Chemistry | chartevents |
1532 | 153447 | 16941 | 20668 | Magnesium | Chemistry | chartevents |
1533 | 114460 | 15844 | 18935 | PTT | Coags | chartevents |
1534 | 134235 | 16954 | 20585 | Phosphorous | Chemistry | chartevents |
1535 | 242648 | 17346 | 21167 | Potassium | Chemistry | chartevents |
1536 | 172037 | 17306 | 21114 | Sodium | Chemistry | chartevents |
1539 | 802 | 697 | 712 | Total Protein | Chemistry | chartevents |
1540 | 2964 | 2371 | 2430 | Triglyceride | Chemistry | chartevents |
1541 | 1788 | 862 | 886 | Uric Acid | Chemistry | chartevents |
1542 | 138090 | 17247 | 21034 | WBC | Hematology | chartevents |
3728 | 1650 | 796 | 813 | Alkaline Phosphatase | Chemistry | chartevents |
3738 | 8049 | 5135 | 5180 | Bands | Heme/Coag | chartevents |
3754 | 8356 | 5315 | 5361 | Eosinophils | Heme/Coag | chartevents |
3770 | 8153 | 5188 | 5234 | Lymphs | Heme/Coag | chartevents |
3779 | 8153 | 5188 | 5234 | Monos | CSF | chartevents |
3791 | 8153 | 5188 | 5234 | Polys | CSF | chartevents |
3835 | 9450 | 1343 | 1362 | pCO2 | ABG'S | chartevents |
3837 | 9450 | 1343 | 1362 | pO2 | ABG'S | chartevents |
5815 | 1804455 | 14416 | 17386 | HR Alarm [Low] | NA | chartevents |
8549 | 1804988 | 14414 | 17384 | HR Alarm [High] | NA | chartevents |
50802 | 490527 | 31427 | 37357 | Base Excess | Blood Gas | labevents |
50809 | 196596 | 24029 | 26676 | Glucose | Blood Gas | labevents |
50811 | 89712 | 21698 | 23637 | Hemoglobin | Blood Gas | labevents |
50815 | 12307 | 7054 | 7449 | O2 Flow | Blood Gas | labevents |
50818 | 490504 | 31424 | 37348 | pCO2 | Blood Gas | labevents |
50819 | 86911 | 12630 | 13831 | PEEP | Blood Gas | labevents |
50820 | 530657 | 32477 | 38918 | pH | Blood Gas | labevents |
50821 | 490522 | 31424 | 37351 | pO2 | Blood Gas | labevents |
50826 | 83805 | 13738 | 15001 | Tidal Volume | Blood Gas | labevents |
50862 | 146652 | 28055 | 30997 | Albumin | Chemistry | labevents |
50863 | 207837 | 30060 | 33574 | Alkaline Phosphatase | Chemistry | labevents |
50866 | 3868 | 2143 | 2069 | Ammonia | Chemistry | labevents |
50867 | 63657 | 18362 | 18605 | Amylase | Chemistry | labevents |
50902 | 795412 | 41203 | 52853 | Chloride | Chemistry | labevents |
50909 | 13522 | 5429 | 5634 | Cortisol | Chemistry | labevents |
50912 | 797231 | 39349 | 50997 | Creatinine | Chemistry | labevents |
50914 | 8565 | 297 | 354 | Cyclosporin | Chemistry | labevents |
50915 | 1093 | 797 | 672 | D-Dimer | Chemistry | labevents |
50917 | 8152 | 2386 | 2620 | Digoxin | Chemistry | labevents |
50922 | 2626 | 1633 | 1671 | Ethanol | Chemistry | labevents |
50931 | 748836 | 38798 | 50396 | Glucose | Chemistry | labevents |
50935 | 10197 | 5916 | 5631 | Haptoglobin | Chemistry | labevents |
50956 | 65361 | 18636 | 20477 | Lipase | Chemistry | labevents |
50960 | 664079 | 38385 | 49763 | Magnesium | Chemistry | labevents |
50966 | 1767 | 190 | 225 | Phenobarbital | Chemistry | labevents |
50971 | 845365 | 41208 | 52877 | Potassium | Chemistry | labevents |
50978 | 3181 | 191 | 155 | Rapamycin | Chemistry | labevents |
50983 | 808328 | 41209 | 52860 | Sodium | Chemistry | labevents |
51007 | 18355 | 4261 | 3305 | Uric Acid | Chemistry | labevents |
51094 | 1157 | 838 | 566 | pH | Chemistry | labevents |
51110 | 160 | 132 | 114 | Atypical Lymphocytes | Hematology | labevents |
51114 | 566 | 376 | 325 | Eosinophils | Hematology | labevents |
51116 | 3346 | 1247 | 1306 | Lymphocytes | Hematology | labevents |
51117 | 2280 | 928 | 939 | Macrophage | Hematology | labevents |
51120 | 3345 | 1247 | 1306 | Monocytes | Hematology | labevents |
51125 | 3344 | 1247 | 1306 | Polys | Hematology | labevents |
51143 | 59334 | 19376 | 19935 | Atypical Lymphocytes | Hematology | labevents |
51144 | 75587 | 23119 | 24612 | Bands | Hematology | labevents |
51146 | 172093 | 37758 | 44247 | Basophils | Hematology | labevents |
51196 | 2877 | 1960 | 1565 | D-Dimer | Hematology | labevents |
51200 | 172095 | 37758 | 44247 | Eosinophils | Hematology | labevents |
51221 | 881653 | 45383 | 57103 | Hematocrit | Hematology | labevents |
51222 | 752277 | 45297 | 56950 | Hemoglobin | Hematology | labevents |
51244 | 172100 | 37759 | 44248 | Lymphocytes | Hematology | labevents |
51251 | 59206 | 19338 | 19902 | Metamyelocytes | Hematology | labevents |
51254 | 172099 | 37759 | 44248 | Monocytes | Hematology | labevents |
51255 | 59158 | 19331 | 19896 | Myelocytes | Hematology | labevents |
51265 | 778163 | 45303 | 56975 | Platelet Count | Hematology | labevents |
51269 | 2489 | 1228 | 1064 | Promyelocytes | Hematology | labevents |
51274 | 468914 | 37792 | 48226 | PT | Hematology | labevents |
51275 | 473449 | 37698 | 48064 | PTT | Hematology | labevents |
51343 | 310 | 268 | 235 | Atypical Lymphocytes | Hematology | labevents |
51344 | 260 | 241 | 218 | Bands | Hematology | labevents |
51346 | 16 | 10 | 8 | Blasts | Hematology | labevents |
51347 | 655 | 500 | 449 | Eosinophils | Hematology | labevents |
51351 | 4732 | 2840 | 2566 | Lymphs | Hematology | labevents |
51352 | 1138 | 869 | 785 | Macrophage | Hematology | labevents |
51355 | 4701 | 2840 | 2566 | Monocytes | Hematology | labevents |
51358 | 332 | 233 | 193 | Other | Hematology | labevents |
51360 | 4689 | 2840 | 2566 | Polys | Hematology | labevents |
51361 | 5 | 4 | 4 | Promyelocytes | Hematology | labevents |
51368 | 83 | 73 | 32 | Eosinophils | Hematology | labevents |
51375 | 683 | 468 | 320 | Lymphocytes | Hematology | labevents |
51376 | 236 | 192 | 124 | Macrophage | Hematology | labevents |
51379 | 683 | 468 | 320 | Monocytes | Hematology | labevents |
51382 | 683 | 468 | 320 | Polys | Hematology | labevents |
51427 | 1967 | 1529 | 1489 | Lymphocytes | Hematology | labevents |
51428 | 1078 | 912 | 877 | Macrophage | Hematology | labevents |
51431 | 1964 | 1529 | 1489 | Monos | Hematology | labevents |
51436 | 1967 | 1529 | 1489 | Polys | Hematology | labevents |
51440 | 159 | 150 | 124 | Atypical Lymphocytes | Hematology | labevents |
51441 | 121 | 115 | 107 | Bands | Hematology | labevents |
51444 | 778 | 627 | 507 | Eosinophils | Hematology | labevents |
51446 | 2613 | 1826 | 1638 | Lymphocytes | Hematology | labevents |
51450 | 2613 | 1826 | 1638 | Monos | Hematology | labevents |
51455 | 2612 | 1826 | 1638 | Polys | Hematology | labevents |
51478 | 11138 | 5714 | 5738 | Glucose | Hematology | labevents |
51491 | 116034 | 32430 | 37653 | pH | Hematology | labevents |
51493 | 43664 | 18708 | 18383 | RBC | Hematology | labevents |
51516 | 43097 | 18680 | 18310 | WBC | Hematology | labevents |
220045 | 2762225 | 17714 | 21924 | Heart Rate | Routine Vital Signs | chartevents |
220210 | 2737105 | 17707 | 21913 | Respiratory Rate | Respiratory | chartevents |
220228 | 137117 | 19253 | 23983 | Hemoglobin | Labs | chartevents |
220546 | 133356 | 19254 | 23983 | WBC | Labs | chartevents |
220580 | 694 | 503 | 521 | Ammonia | Labs | chartevents |
220581 | 8685 | 3497 | 3754 | Amylase | Labs | chartevents |
220587 | 37598 | 9775 | 11482 | AST | Labs | chartevents |
220603 | 1567 | 1484 | 1497 | Cholesterol | Labs | chartevents |
220615 | 153010 | 19313 | 24063 | Creatinine | Labs | chartevents |
220632 | 23125 | 7644 | 8789 | LDH | Labs | chartevents |
220635 | 147786 | 18788 | 23446 | Magnesium | Labs | chartevents |
220644 | 37625 | 9787 | 11494 | ALT | Labs | chartevents |
220650 | 785 | 670 | 694 | Total Protein | Labs | chartevents |
223834 | 202279 | 14550 | 17644 | O2 Flow | Respiratory | chartevents |
224639 | 46452 | 10152 | 11496 | Daily Weight | General | chartevents |
224828 | 140198 | 11601 | 13134 | Arterial Base Excess | Labs | chartevents |
225624 | 152475 | 19312 | 24062 | BUN | Labs | chartevents |
225636 | 764 | 535 | 542 | D-Dimer | Labs | chartevents |
225638 | 8610 | 3760 | 4135 | Differential-Bands | Labs | chartevents |
225639 | 20425 | 8774 | 10193 | Differential-Basos | Labs | chartevents |
225640 | 20424 | 8774 | 10193 | Differential-Eos | Labs | chartevents |
225641 | 20425 | 8774 | 10193 | Differential-Lymphs | Labs | chartevents |
225642 | 20425 | 8774 | 10193 | Differential-Monos | Labs | chartevents |
225667 | 80126 | 9894 | 10821 | Ionized Calcium | Labs | chartevents |
225668 | 69432 | 11936 | 13766 | Lactic Acid | Labs | chartevents |
225672 | 9766 | 4073 | 4401 | Lipase | Labs | chartevents |
225674 | 9867 | 2909 | 3024 | Mixed Venous O2% Sat | Labs | chartevents |
225677 | 135440 | 18061 | 22617 | Phosphorous | Labs | chartevents |
225684 | 6746 | 1578 | 1621 | Serum Osmolality | Labs | chartevents |
225693 | 3062 | 2367 | 2438 | Triglyceride | Labs | chartevents |
225695 | 1656 | 566 | 590 | Uric Acid | Labs | chartevents |
227440 | 1403 | 613 | 695 | Digoxin | Labs | chartevents |
227456 | 22945 | 8464 | 9673 | Albumin | Labs | chartevents |
227457 | 137454 | 19258 | 23984 | Platelet Count | Labs | chartevents |
227460 | 251 | 38 | 49 | Phenobarbital | Labs | chartevents |
227462 | 2159 | 283 | 407 | FK506 | Labs | chartevents |
227463 | 2911 | 1528 | 1628 | Cortisol | Labs | chartevents |
227466 | 99769 | 16967 | 20645 | PTT | Labs | chartevents |
227467 | 93040 | 17060 | 20787 | INR | Labs | chartevents |
227468 | 14339 | 5496 | 5784 | Fibrinogen | Labs | chartevents |
227470 | 700 | 589 | 619 | Sed Rate | Labs | chartevents |
if (!is.null(dups)) {
print(dups)
}
overlap_summary <- data.frame(table(dups_summary$nitemids))
names(overlap_summary) <- c("nitemids", "count")
There are 241 sets of items that share label
s which contain some duplicated data (by our criterion: patient x charttime), amounting to 1985258 instances. Of these instances, the large majority appear to be entries for the same measured value at the same time under different itemid
s, i.e., duplicates. In fact, there are 1977973 instances where the value are the same, and 7285 where they differ. For sets such as 51221, 813, any analysis using these data should probably eliminate the duplicated values and merge those itemid
s if in fact their their overall value distributions look similar.
Among these sets, this is the distribution of the number of itemid
s in each set. If there are singletons, that means that some data are clearly duplicated within the same itemid
.
if (!is.null(dups)) {
dups_within_item <- dups %>% filter(nitemids==1) %>%
group_by(itemids) %>%
summarize(instances=n()) %>%
arrange(desc(instances))
if (nrow(dups_within_item) > 0) {
print(dups_within_item)
}
}
The analysis above is based on the assumption that it is most helpful to compare distributions of variables that have the same label. However, we know that the same label does not always mean that the values should be expected to be the same (e.g., the same substance being measured in different body fluids), and, conversely, that sometimes different labels represent the same measurement but expressed differently (e.g., various way to say arterial blood pressure). The latter is described on GitHub at phys_acuity_modelling/resources/itemid_to_variable_map.csv in what is so far a private repository. I have also copied that file to the current R project directory, but of course that version will not be updated automatically from changes made on GitHub.
itemid_to_variable <- read.csv("itemid_to_variable_map.csv") %>% filter(LEVEL2 != "")
compare_distributions <- function(itemids, label="") {
dat <- all %>% filter(itemid %in% itemids) %>%
rid_outliers(outliers_fraction) %>%
left_join(d_all, by="itemid")
if (nrow(dat) == 0) {
print(sprintf("No data for %s", paste(itemids, collapse=", ")))
} else {
if (label=="") {
label <- dat$label[1]
}
print(sprintf("%d rows for '%s' (%s)", nrow(dat), label, paste(itemids, collapse=",")))
binw <- bin_width(dat$valuenum)
print(dat %>%
ggplot(aes(valuenum, fill=desc)) +
geom_histogram(binwidth = binw) +
facet_wrap(~ desc) +
theme(legend.position="none") +
ggtitle(label))
print(dat %>%
ggplot(aes(valuenum, fill=desc)) +
geom_histogram(binwidth = binw, alpha=0.5) +
ggtitle(label) +
theme(legend.position="bottom", legend.direction="vertical"))
print(dat %>%
ggplot(aes(x=desc, y=valuenum, color=desc)) +
geom_violin() +
coord_flip() +
theme(legend.position="none") +
ggtitle(label))
print(dat %>%
ggplot(aes(valuenum, color=desc)) +
stat_ecdf(aes(linetype = desc)) +
theme(legend.position="bottom", legend.direction="vertical") +
ggtitle(sprintf("CDF of %s", label)))
# Collect duplication data while producing plots
if (check_for_duplicates) {
mults <- dat %>%
group_by(subject_id, charttime) %>%
summarize(
itemids = paste(unique(itemid), collapse=", "),
nitemids = n_distinct(itemid),
nvals = n(),
ndistinct = n_distinct(valuenum),
vals = paste(valuenum, collapse=", "),
val_mean = mean(valuenum),
val_low = min(valuenum),
val_high = max(valuenum)
) %>%
ungroup() %>%
filter(nvals > 1)
if (nrow(mults) > 0) {
# print(sprintf("%s has %d duplicate rows:", label, nrow(mults)))
# print(table(mults$ndistinct))
smry <- mults %>% group_by(itemids) %>%
summarize(n = n(),
nsame = sum(ifelse(ndistinct==1, 1, 0)),
ndiff = sum(ifelse(ndistinct==1, 0, 1)))
kable(smry, caption="Duplicate rows")
}
}
}
}
We examine each “Level 2” concept from this list in turn:
compare_distributions(c(50861,769,220644))
[1] "295756 rows for 'Alanine Aminotransferase (ALT)' (50861,769,220644)"
itemids | n | nsame | ndiff |
---|---|---|---|
220644 | 82 | 82 | 0 |
50861 | 98 | 83 | 15 |
50861, 220644 | 11728 | 11724 | 4 |
50861, 220644, 769 | 1104 | 1104 | 0 |
50861, 769 | 39545 | 39531 | 14 |
compare_distributions(c(50862,772,1521,227456,3727))
[1] "223304 rows for 'Albumin' (50862,772,1521,227456,3727)"
itemids | n | nsame | ndiff |
---|---|---|---|
227456 | 38 | 38 | 0 |
50862 | 57 | 52 | 5 |
50862, 227456 | 7124 | 7122 | 2 |
50862, 227456, 772, 1521 | 868 | 867 | 1 |
50862, 3727 | 542 | 542 | 0 |
50862, 772 | 6410 | 6410 | 0 |
50862, 772, 1521 | 23185 | 23179 | 6 |
772, 1521 | 65 | 65 | 0 |
compare_distributions(c(50863,773,225612,3728))
[1] "283868 rows for 'Alkaline Phosphatase' (50863,773,225612,3728)"
itemids | n | nsame | ndiff |
---|---|---|---|
225612 | 79 | 79 | 0 |
50863 | 92 | 80 | 12 |
50863, 225612 | 11453 | 11450 | 3 |
50863, 225612, 773 | 1053 | 1052 | 1 |
50863, 3728 | 1631 | 1631 | 0 |
50863, 773 | 38804 | 38789 | 15 |
itemid_to_variable_map.csv
.# library(pander)
# library(purrr)
vargroups <- itemid_to_variable %>% split(.$LEVEL2)
for (i in seq_along(vargroups)) {
g <- vargroups[[i]]
if (nrow(g) > 0) {
# pandoc.header(list(as.character(g$LEVEL2[1])))
print(g$LEVEL2[1])
compare_distributions(g$ITEMID)
}
}
[1] Alanine aminotransferase
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "295756 rows for 'Alanine Aminotransferase (ALT)' (50861,769,220644)"
[1] Albumin
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "227690 rows for 'Albumin' (50862,772,1521,227456,3727,50835,51025,51046,51069)"
[1] Alkaline phosphate
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "283868 rows for 'Alkaline Phosphatase' (50863,773,225612,3728)"
[1] Anion gap
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "930089 rows for 'Anion Gap' (50868,227073,3732)"
[1] Asparate aminotransferase
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "295943 rows for 'Asparate Aminotransferase (AST)' (50878,770,220587)"
[1] Basophils
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "171877 rows for 'Basophils' (51146,51442,51345,51112,51387,51367)"
[1] Bicarbonate
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "935150 rows for 'Bicarbonate' (50882,227443,51076,50803)"
[1] Bilirubin
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "499067 rows for 'Bilirubin, Direct' (51028,50883,803,225651,51012,50838,51049,50885,1538,848,225690,50884,3765,51464,51465)"
[1] Blood culture
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1 rows for 'Blood Cultures' (3333,50886)"
[1] Blood urea nitrogen
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1281200 rows for 'Urea Nitrogen' (51006,781,1162,225624,3737)"
[1] Calcium
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1254630 rows for 'Calcium, Total' (50808,786,1522,3746,51066,51029,50893,51077,225625)"
[1] Calcium ionized
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "224904 rows for 'Ionized Calcium' (816,225667)"
[1] Calcium Ionized
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "93 rows for 'Ion Calcium' (3766)"
[1] Capillary refill rate
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "No data for 3348, 115, 8377"
[1] Chloride
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1373552 rows for 'Chloride' (788,1523,3747,4193,50839,51030,51013,51050,220602,51062,51078,226536,50902,50806)"
[1] Cholesterol
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "61506 rows for 'Cholesterol, HDL' (789,1524,220603,3748,50840,51031,50904,50905,50906,51051,50907)"
[1] CO2 (ETCO2, PCO2, etc.)
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "954269 rows for 'Calculated Total CO2' (50804,777,225698,3810,3808,857,3809,4199,223679,226062,3773,3830,3832)"
[1] Creatinine
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1328144 rows for 'Creatinine' (791,1525,220615,3750,51067,50841,51032,51052,51081,51082,51106,51080,50912)"
[1] Diastolic blood pressure
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "6301729 rows for 'Arterial Blood Pressure diastolic' (8368,220051,225310,8555,8441,220180,8502,8440,8503,8504,8507,8506,224643)"
[1] Eosinophils
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "10985 rows for 'Eosinophils' (3754,51474,51444,51347,51419,51114,51368)"
[1] Fraction inspired oxygen
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1469730 rows for 'Inspired O2 Fraction' (3420,223835,3422,189,727)"
[1] Glascow coma scale eye opening
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1521073 rows for 'Eye Opening' (184,220739)"
[1] Glascow coma scale motor response
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1514681 rows for 'Motor Response' (454,223901)"
[1] Glascow coma scale total
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "945427 rows for 'GCS Total' (198)"
[1] Glascow coma scale verbal response
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1516822 rows for 'Verbal Response' (723,223900)"
[1] Glucose
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "2497714 rows for 'Glucose' (50931,807,811,1529,50809,51478,3745,225664,220621,226537)"
[1] Heart Rate
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "7868558 rows for 'Heart Rate' (211,220045)"
[1] Height
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "23864 rows for 'Height' (226707,226730,1394)"
[1] Hematocrit
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1410213 rows for 'Hematocrit' (813,3761,220545,51221,50810)"
[1] Hemoglobin
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1172056 rows for 'Hemoglobin' (814,220228,50852,50855,51222,50811)"
[1] INR
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "467604 rows for 'INR(PT)' (51237)"
[1] Lactate
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "185828 rows for 'Lactate' (50813)"
[1] Lactate dehydrogenase
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "108530 rows for 'Lactate Dehydrogenase (LD)' (50954,51054)"
[1] Lactic acid
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "204302 rows for 'Lactic Acid' (818,225668,1531)"
[1] Lymphocytes
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "244286 rows for 'Atypical Lymphocytes' (51244,51116,51446,51427,51375,51133,51143,51343,51110,51440,51385,51365,51245)"
[1] Magnesium
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1151841 rows for 'Magnesium' (821,1532,220635,50960)"
[1] Mean blood pressure
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "6298103 rows for 'Arterial Blood Pressure mean' (52,220052,225312,224,6702,224322,456,220181,3312,3314,3316,3322,3320)"
[1] Mean corpuscular hemoglobin
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "740209 rows for 'MCH' (51248)"
[1] Mean corpuscular hemoglobin concentration
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "741254 rows for 'MCHC' (51249)"
[1] Mean corpuscular volume
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "741200 rows for 'MCV' (51250)"
[1] Monocytes
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "175943 rows for 'Monocytes' (51254,51355)"
[1] Neutrophils
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "171274 rows for 'Neutrophils' (51256)"
[1] Oxygen saturation
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "7976868 rows for 'Oxygen Saturation' (834,50817,8498,220227,646,220277)"
[1] Partial pressure of carbon dioxide
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "923584 rows for 'pCO2' (50818,3835,3836,3784,778,220235,4201)"
[1] Partial pressure of oxygen
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "270215 rows for 'Arterial PaO2' (779,490)"
[1] Partial thromboplastin time
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "829240 rows for 'PTT' (825,1533,227466,51275)"
[1] Peak inspiratory pressure
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "149193 rows for 'PIP' (507)"
[1] pH
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1387343 rows for 'pH' (50820,51491,3839,1673,50831,51094,780,1126,223830,4753,4202,860,220274)"
[1] Phosphate
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "585016 rows for 'Phosphate' (50970)"
[1] Platelets
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1104267 rows for 'Platelet Count' (51265,828,227457,3789)"
[1] Positive end-expiratory pressure
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "478153 rows for 'PEEP' (505,50819,224700)"
[1] Potassium
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1815358 rows for 'Potassium' (829,1535,3792,227442,227464,50971,50822)"
[1] Prothrombin time
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1127001 rows for 'PT' (815,824,1530,1286,227467,227465,51274)"
[1] Pupillary response left
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "No data for 8466, 3593"
[1] Pupillary response right
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "No data for 584, 8530"
[1] Pupillary size left
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "No data for 8467"
[1] Pupillary size right
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "No data for 585"
[1] Red blood cell count
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "993516 rows for 'Red Blood Cells' (51279,833,51493,3799,4197,51127,51362,51278,51383,51438,51457)"
[1] Respiratory rate
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "9294868 rows for 'Respiratory Rate' (618,220210,3603,224689,614,651,224422,615,224690)"
[1] Sodium
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1455574 rows for 'Sodium' (837,1536,3803,220645,226534,50983,50824)"
[1] Systolic blood pressure
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "6306342 rows for 'Arterial Blood Pressure systolic' (51,220050,225309,6701,455,220179,3313,3315,442,3317,3323,3321,224167,227243)"
[1] Temperature
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "3386638 rows for 'Temperature Fahrenheit' (3655,677,676,223762,3654,678,223761,679)"
[1] Troponin-I
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "8277 rows for 'Troponin I' (51002,851)"
[1] Troponin-T
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "85471 rows for 'Troponin T' (51003,227429)"
[1] Urine Appearance
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "No data for 8480"
[1] Urine Color
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "No data for 706"
[1] Urine output
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "No data for 40055, 40056, 40057, 40061, 40065, 40069, 40085, 40086, 40094, 40096, 40405, 40428, 40473, 40651, 40715, 43175, 226557, 226558, 226559, 226560, 226561, 226563, 226564, 226565, 226567, 226584, 226627, 227510"
[1] Weight
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "4531802 rows for 'Admission Weight (Kg)' (763,224639,226512,3580,3693,3581,226531,3582,581,3723,3583,3692,580,733,225124,4183,226846)"
[1] White blood cell count
70 Levels: Alanine aminotransferase Albumin Alkaline phosphate Anion gap Asparate aminotransferase Basophils Bicarbonate Bilirubin Blood culture ... White blood cell count
[1] "1418537 rows for 'White Blood Cells' (861,1127,1542,220546,51516,4200,51301,51300)"