Benjamin Iriarte, Yongjin Park, and Manolis Kellis
ypp@csail.mit.edu
and manoli@mit.edu
VCFtools
--missing --geno 1 --maf 0.05 --remove-indels --remove-filtered-all
Stored in DATA/genotype/Chr1.mat
...
Stored in DATA/expr/Tis1.mat
...
Retain genes if 10% of individuals have RPKM > 0.1
in all retained tissues
retained number of genes = 12945
retained number of tissues = 48
retained number of samples = 8495
RESULT/STEP02/Tis1.mat
and so on; RESULT/STEP02/GENE_NAME.mat
for gene namesHere are the list of retained tissues:
1 Adipose - Subcutaneous 350
2 Adipose - Visceral (Omentum) 227
3 Adrenal Gland 145
4 Artery - Aorta 224
5 Artery - Coronary 133
6 Artery - Tibial 332
7 Bladder 11
8 Brain - Amygdala 72
9 Brain - Anterior cingulate cortex (BA24) 84
10 Brain - Caudate (basal ganglia) 117
11 Brain - Cerebellar Hemisphere 105
12 Brain - Cerebellum 125
13 Brain - Cortex 114
14 Brain - Frontal Cortex (BA9) 108
15 Brain - Hippocampus 94
16 Brain - Hypothalamus 96
17 Brain - Nucleus accumbens (basal ganglia) 113
18 Brain - Putamen (basal ganglia) 97
19 Brain - Spinal cord (cervical c-1) 71
20 Brain - Substantia nigra 63
21 Breast - Mammary Tissue 214
22 Cells - EBV-transformed lymphocytes 118
23 Cells - Transformed fibroblasts 284
24 Cervix - Ectocervix 6
25 Cervix - Endocervix 5
26 Colon - Sigmoid 149
27 Colon - Transverse 196
28 Esophagus - Gastroesophageal Junction 153
29 Esophagus - Mucosa 286
30 Esophagus - Muscularis 247
31 Fallopian Tube 6
32 Heart - Atrial Appendage 194
33 Heart - Left Ventricle 218
34 Kidney - Cortex 32
35 Liver 119
36 Lung 320
37 Minor Salivary Gland 57
38 Muscle - Skeletal 430
39 Nerve - Tibial 304
40 Ovary 97
41 Pancreas 171
42 Pituitary 103
43 Prostate 106
44 Skin - Not Sun Exposed (Suprapubic) 250
45 Skin - Sun Exposed (Lower leg) 357
46 Small Intestine - Terminal Ileum 88
47 Spleen 104
48 Stomach 193
49 Testis 172
50 Thyroid 323
51 Uterus 83
52 Vagina 96
53 Whole Blood 393
PEER factors 5, 10, 15, 20, 30
PEER_setNmax_iterations(peer.model,1000)
PEER_setTolerance(peer.model,0.01)
PEER_setVarTolerance(peer.model,0.0001)
stored in RESULT/STEP03/PEER/${NF}/Tis${TIS}.mat
MFGL factors with max 30 factors; note: we use expected E[C] not C directly
[B,C,lambda,sigma,tau,lambda] = mfgl_gibbs(NormalizedTis${TIS},30,1000,1000);
stored in RESULT/STEP03/MFGL/Tis${TIS}.mat
Difference between PEER and MFGL
Prior on B and C:
PEER: Bgj ∼ exp( − 0.5βj2Bgj2) and Cji ∼ exp( − 0.5Cji2), where βj values are also inferred with Gamma prior (with fixed hyper-parameters).
MFGL: Bgj ∼ exp( − 0.5β2Bgj2) but C ∼ exp( − λ∥cj∥), where β = 0.01m fixed; regularization parameter λ is sampled from Gamma prior.
Inference : PEER uses variationa Bayes; MFGL Gibbs sampling on Bayesian group lasso model (Kyung et al. 2010). For matrix factorization problem (of moderate size), we do not need to use variational approximation. Using Gibbs sampling we are allowed to take model-average over many possible choices of λ, which then implicitly allows to select right model complexity.
Results on the other tissues can be found in here
We tested from K = 1 to K = 5 and found MFGL-impute accurately estimate missing values and dimensionality.
An example of imputed gene: