# mod-eQTL discovery in the GTEx dataset

Benjamin Iriarte, _Yongjin Park_, and Manolis Kellis

- Contact: `ypp@csail.mit.edu` and `manoli@mit.edu`
- Equal contribution of BI and YP.


# Experimental procedure

1. __STEP1__ Data collection
   * Get genotype matrix per each chromosome using `VCFtools`
   * MAF $\ge$ 0.05; no missing data; no in/dels `--missing --geno 1 --maf 0.05 --remove-indels --remove-filtered-all`
   * Stored in `DATA/genotype/Chr1.mat` ...

   * Get the list of protein coding genes (GenCode V18)
   * Collect RPKM matrix for each tissue (on the same set of genes)
   * Collapse RPKM _per_ gene by simply summing them up ($\approx$ same length)
   * Stored in `DATA/expr/Tis1.mat` ...

2. __STEP2__ Expression quality control and normalization

   * Consider tissues with $\ge$ 50 samples
   * Retain genes if 10% of individuals have `RPKM > 0.1` in all retained tissues
```
retained number of genes = 12945
retained number of tissues = 48
retained number of samples = 8495
```
   * Take log(1+RPKM) and perform quantile normalization
   * Rescale each gene's expression (row vector) to standard Gaussian ~ $N(0,1)$ to prevent outliers
     from biasing downstream analyses.
   * saved in `RESULT/STEP02/Tis1.mat` and so on; `RESULT/STEP02/GENE_NAME.mat` for gene names
   * Here are the list of retained tissues:

```
1	Adipose - Subcutaneous	350
2	Adipose - Visceral (Omentum)	227
3	Adrenal Gland	145
4	Artery - Aorta	224
5	Artery - Coronary	133
6	Artery - Tibial	332
7	Bladder	11
8	Brain - Amygdala	72
9	Brain - Anterior cingulate cortex (BA24)	84
10	Brain - Caudate (basal ganglia)	117
11	Brain - Cerebellar Hemisphere	105
12	Brain - Cerebellum	125
13	Brain - Cortex	114
14	Brain - Frontal Cortex (BA9)	108
15	Brain - Hippocampus	94
16	Brain - Hypothalamus	96
17	Brain - Nucleus accumbens (basal ganglia)	113
18	Brain - Putamen (basal ganglia)	97
19	Brain - Spinal cord (cervical c-1)	71
20	Brain - Substantia nigra	63
21	Breast - Mammary Tissue	214
22	Cells - EBV-transformed lymphocytes	118
23	Cells - Transformed fibroblasts	284
24	Cervix - Ectocervix	6
25	Cervix - Endocervix	5
26	Colon - Sigmoid	149
27	Colon - Transverse	196
28	Esophagus - Gastroesophageal Junction	153
29	Esophagus - Mucosa	286
30	Esophagus - Muscularis	247
31	Fallopian Tube	6
32	Heart - Atrial Appendage	194
33	Heart - Left Ventricle	218
34	Kidney - Cortex	32
35	Liver	119
36	Lung	320
37	Minor Salivary Gland	57
38	Muscle - Skeletal	430
39	Nerve - Tibial	304
40	Ovary	97
41	Pancreas	171
42	Pituitary	103
43	Prostate	106
44	Skin - Not Sun Exposed (Suprapubic)	250
45	Skin - Sun Exposed (Lower leg)	357
46	Small Intestine - Terminal Ileum	88
47	Spleen	104
48	Stomach	193
49	Testis	172
50	Thyroid	323
51	Uterus	83
52	Vagina	96
53	Whole Blood	393
```

3. __STEP3__ Tissue-by-tissue confounder correction/identification of hidden covariates
$$
X = B C
$$
where $X$ is gene by samples.
    * PEER factors 5, 10, 15, 20, 30
```
PEER_setNmax_iterations(peer.model,1000)
PEER_setTolerance(peer.model,0.01)
PEER_setVarTolerance(peer.model,0.0001)
```
      stored in `RESULT/STEP03/PEER/${NF}/Tis${TIS}.mat`
   
    * MFGL factors with max 30 factors; _note: we use expected $E[C]$ not $C$ directly_
```
[B,C,lambda,sigma,tau,lambda] = mfgl_gibbs(NormalizedTis${TIS},30,1000,1000);
```
      stored in `RESULT/STEP03/MFGL/Tis${TIS}.mat`

    * Difference between PEER and MFGL

        1. Prior on $B$ and $C$:

            - PEER: $B_{gj} \sim \exp(-0.5 \beta_{j}^2 B_{gj}^2)$ and $C_{ji} \sim \exp(-0.5 C_{ji}^2)$,
            where $\beta_{j}$ values are also inferred with Gamma prior (with fixed hyper-parameters).

            - MFGL: $B_{gj} \sim \exp(-0.5 \beta^2 B_{gj}^2)$ but $C \sim \exp(- \lambda \|c_{j}\|)$,
            where $\beta = 0.01 m$ fixed; regularization parameter $\lambda$ is sampled from Gamma prior.

        2. Inference : PEER uses variationa Bayes; MFGL Gibbs sampling on Bayesian group lasso model
        (Kyung _et al._ 2010).  For matrix factorization problem (of moderate size), we do not need
        to use variational approximation.  Using Gibbs sampling we are allowed to take model-average
        over many possible choices of $\lambda$, which then implicitly allows to select right model
        complexity.

    * ![](FIG/STEP03/Tis1.png) [Fig1] shows confounding factors found in Tissue #1 (Adipose - Subcutaneous).
    * Top 3-4 factors explain much of variation.
    * ![](FIG/STEP03/Tis19.png) In [Fig2] we see different patterns (Tissue #19; Brain - Spinal
      cord).  The very first factor dominates and segregate population, and inference algorithm
      picks up the right amount of factors.
    * Results on the other tissues can be found in [here](FIG/STEP03/)


4. __STEP4__ Imputation of each gene's tissue x individual matrix
* Demonstration of imputation method:
    - Generate $X = U V$ where the ranks of $U$ and $V$ are both $3$ (in this example).
    - The matrix $X$ is 50 x 450.
    - Factored matrices $U$ and $V$ are from from $N(0,1)$.
    - $X$ was row-wise normalized.
    - Remove 80% of elements from $X$.
    - We can perform imputation while estimating MFGL by sampling missing values given current $U$
    and $V$.
    - We term this MFGL-impute.
    - Our matrix MFGL-impute accurately impute missing values:
    ![](FIG/STEP04/demo/FigScatterK3.png)
    - We also recover correct dimensionality:
    ![](FIG/STEP04/demo/FigTauK3.png)
    - Penalty parameter also reached equilibrium:
    ![](FIG/STEP04/demo/FigLambdaK3.png)
* We tested from $K=1$ to $K=5$ and found MFGL-impute accurately estimate missing values and dimensionality.

* An example of imputed gene:

<!-- 4. __STEP4__ Confounder correction -->
<!--    * Let $X(t)$ be a gene by individual matrix of a tissue $t$; we identify -->
<!--    $$ -->
<!--    X(t) = B C(t)^\top -->
<!--    $$ -->
<!--    where $B$ is gene by factor loading matrix and $C$ is sample by factor confounder matrix. -->
<!--    * Note that loading factors are shared across tissues to prevent from overfitting; -->
<!--    * The loading factors are meant to capture tissue-invariant batch effects. -->
<!--    * Note that training $B$ on the concatenated $[X(1),\ldots,X(n_T)]$ wouldn't work as well. -->
   
5. __STEP5__ Vanilla eQTL calling per tissue
   * MatrixEQTL
   * Number of eQTL discovery


[Fig1]: FIG/STEP03/Tis1.pdf
[Fig2]: FIG/STEP03/Tis19.pdf