The general theme of my research is to develop computational frameworks or machine learning algorithms that effectively integrate genome-wide heterogeneous measurements to recognize meaningful biological patterns that are associated with a specific phenotypic traits such as cancer or other complex diseases.

Fine-mapping and causal variants inference from Genome-wide association studies
Genome wide association studies (GWAS) can help gain numerous insights on the genetic basis of complex diseases, and ultimately contribute to personalized risk prediction and precision medicine. When the regions of association contain protein-altering variants, the path from GWAS hits to disease mechanism and eventually therapeutic development can start with the disrupted gene as a candidate target. However, over 90% of the disease-associated loci consist exclusively of non-coding variants, hindering the ability to interpret their function. Moreover, genome-wide significant loci explain only a small fraction of the phenotypic variance attributable to genetics, a difference often referred to as the "missing heritability". Finally, because of linkage disequilibrium many significant varaints are non-causal but merely linked to causal variants within the LD block, which may range from hundreds to a hundred thousands of kilobases.  My current main research interest is to develop statistical inference models to computationally fine-map causal variants by rigorously integrating various sources of information and infer the causal mechanisms underlying various related genetic traits or diseases.


  1. Li, Y. & Kellis, M. (2016). Joint Bayesian inference of risk variants and tissue-specific epigenomic enrichments across multiple complex human diseases Nucleic Acids Research, 16(2), 1-13. doi:10.1093/nar/gkw627


Electronic health record modeling

Electronic health records (EHR) contain extremely rich information about a patient and are currently being explored by various methods\cite{Jensen:2012ek}. Many large hospitals are routinely generating EHR data of millions of patients annually often in the format of International Classification of Disease (ICD) code, which defines the universe of diseases, disorders, injuries and other related health conditions in a comprehensive and hierarchical fashion. In principle, this implies that diseases are conditionally independent from genetics given the mediating phenotypes, which can provide crucial information to the underlying disease mechanisms.To leverage the EHR data in a systematic way, I'm developping Bayesian models to impute missing health information by modeling the data generative process (manuscript in preparation).

Learning regulatory potential from functional genomic data

Disruption and aberrant coordination of gene expression is often at the higher hierarchy among the causes of complex human diseases. To improve our ability to interpret non-coding sequence various functional genomics data were recently generated from ChIP-seq, massively parallel reporter assays (MPRA), Hi-C. To harness the information provided by these data in order to improve inferring eQTL/GWAS causal SNPs, I'm interested in developping supervised and semi-supervised learning strategies to jointly learn the underlying regulatory properties implicated by both functional genomic data and GWAS/eQTL signals (manuscript in preparation).

Inferring microRNA regulatory networks

MicroRNAs (miRNAs) are ~22 nucleotide long noncoding RNA species. The regulatory roles of microRNAs (miRNA) have important implication in developments and diseases. Functional characterization of miRNAs require accurate identifications of their RNA targets, which has been a challenging computational task due to various confounding factors centering around the combinatorial co-regulatory relationships between miRNA and mRNA. Earlier developed sequence-based methods are mostly based on seed match, phylogenetic conservation, and binding energy. Recently, there is a paradigm shift from the sequence-based binary classification to more quantitative expression-based and network -focused approach. The momentum of this shift is largely facilitated by the increasing amount of expression profiling data of mRNAs and miRNAs across various experimental conditions. One of my research interests is to infer cancer-specific miRNA regulatory networks that can characterize cancer phenotypes and/or facilitate prognostic biomarkers development.


  1. Li, Y. and Zhang, Z. (2015). Computational Biology in microRNA WIREs RNA. 4(7097)

  2. Li, Y., Zhang, Z. (2014). Potential microRNA-mediated oncogenic intercellular communication revealed by pan-cancer analysis. Scientific Reports. 4(7097)

  3. Li, Y.*, Liang, C.*, Wong, KC., Luo, J., Zhang, Z. (2014). Mirsynergy: detecting synergistic miRNA regulatory modules by overlapping neighbourhood expansion. Bioinformatics. 30(18), 2627-2635. doi: 10.1093/bioinformatics/btu373.

  4. Li, Y., Liang, C., Wong, KC, Jin, K., and Zhang, Z. (2014) Inferring probabilistic miRNA-mRNA interaction signatures in cancers: a role-switch approach. Nucleic Acids Research, 42(9), e76. doi: 10.1093/nar/gku182

  5. Li, Y., Goldenberg, A., Wong, KC., Zhang Z. (2013). A probabilistic approach to explore human miRNA targetome by integrating miRNA-overexpression data and sequence information. Bioinformatics (Oxford, England), 30(5), 621–628. doi:10.1093/bioinformatics/btt599

  6. Li, Y. Computational methods of inferring microRNA regulatory networks. Ph.D. thesis (2014)


RNA epigenetics

N6-methyladenosine (m6A) is the most prevalent endogenous methylation in RNA. Recently, Dominissini et al. (2010) and Mayer et al. (2010) have demonstrated a novel NGS protocol to interrogate transcriptome-wide m6A methylation using m6A-seq, based on antibody-mediated capture and massively parallel sequencing. Despite implicated in regulation of gene expression, the functional roles of m6A are still largely unknown. In collaboration with Prof. Crystal Zhao, we are exploring deeper the fundamental biology of m6A in mammalian development with combined experimental and computational approach.


  1. Wang, Y., Li, Y., Toth, J. I., Petroski, M. D., Zhang, Z., & Zhao, J. C. (2014). N6-methyladenosine modification destabilizes developmental regulators in embryonic stem cells. Nature Cell Biology, 16(2), 1-10. doi:10.1038/ncb2902


Detection of protein-associated noncoding RNA from RIP-seq, CLIP-seq, and PAR-CLIP experiments

Comprehensive transcriptome analyses suggest that only 1%-2% of the human or mouse genome is protein coding whereas 70%-90% is transcriptionally active, but do not code for proteins, and thus denoted as non-coding RNA (ncRNA) (ENCODE Project Consortium, 2007). Mounting evidence suggests that many of these ncRNAs are evolutionarily conserved, functionally interact with transcription factors and/or chromatin regulators, and participate in gene regulation. NGS platforms such as PAR-CLIP and RIP-Seq enables unbiased genome-wide identification of these ncRNAs and thus promise to reveal unique aspects of molecular biology. We are closely working with biologists from CCBR to construct protein-protein and protein-ncRNA interaction networks utilizing these technologies.


  1. Zhao, D., Li, Y., Greenblatt, J., & Zhang, Z. (2014). ncRNA–Protein Interactions in Development and Disease from the Perspective of High-Throughput Studies. In A. Emili, J. Greenblatt, & S. Wodak (Eds.), Systems Analysis of Chromatin-Related Protein Complexes in Cancer (pp. 87-115). Springer New York. doi:10.1007/978-1-4614-7931-4_5

  2. Li, Y., Zhao, D. Y., Greenblatt, J. F., & Zhang, Z. (2013). RIPSeeker: a statistical package for identifying protein-associated transcripts from RIP-seq experiments. Nucleic Acids Research, 41(8), e94. doi:10.1093/nar/gkt142


Identification of differential DNA methylation and copy number variation in cancer

In DNA methylation, a hydrogen atom of the cytosine base of the DNA is replaced by a methyl group. This change typically induces a locally more compact chromatin structure, repressing gene activities in the vicinity. On the other hand, copy number variation (CNV) correspond to deletion or duplication of large regions of the genome relative to normal subjects. Abnormal DNA methylation pattern and CNV in cancer have been reported in many studies. In a collaborative project with Prof. Art Petronis from The Krembil Family Epigenetics Laboratory, we use tiling arrays to interrogate both aforementioned abnormal phenomena in sera from large cohorts of colorectal cancer patients to establish some prominent (epi-)genetic signatures. In the future, we will be using (bisulfite) NGS to confirm and extend our current findings.


Kinome analysis

We proposed and implemented a computational pipeline to analyze peptide array kinome data (Li et al., 2012). The work as my B.Sc. Honours thesis was under supervision of Dr. Anthony Kusalik and in collaboration with immunologists (co-authors) from the Vaccine and Infectious Disease Organization (VIDO) at the U of S. To our knowledge, the proposed pipeline is the first integrative approach that addresses kinome-specific computational challenges in microarray analyses. In particular, our statistical testing for differentially phosphorylated kinase peptides takes into account the technical and biological variation inherent to the technology and dynamic kinase activities between biological replicates, respectively. Comparing to existing methods, our approach is more sensitive in detecting kinases involved in well-defined signaling pathways activated by the select stimuli. The central roles of kinases in immune defence make them promising therapeutic targets. Rigorous detection of subtle changes in treatment-specific kinase activities via a powerful platform such as kinome microarray may facilitate pharmaceutical design against diseases.

  1. Arsenault, R. J., Li, Y., Maattanen, P., Scruten, E., Doig, K., Potter, A., Griebel, P., Kusalik, A., and Napper, S. (2013) Altered Toll-like receptor 9 signaling in Mycobacterium avium subsp. paratuberculosis infected bovine monocytes reveals potential therapeutic targets. Infection and immunity, 81(1), 226-237.

  2. Arsenault, R. J., Li, Y., Potter, A., Griebel, P. J., Kusalik, A., and Napper, S. (2012). Induction of ligand-specific PrPC signaling in human neuronal cells. Prion, 6(5), 477-488.

  3. Arsenault, R. J., Li, Y., Bell, K., Doig, K., Potter, A., Griebel, P. J., Kusalik, A., and Napper, S. (2012). Mycobacterium avium subsp. paratuberculosis Inhibits Interferon Gamma-Induced Signaling in Bovine Monocytes. Insights into the Cellular Mechanisms of Johne’s Disease. Infection and immunity, 80, 3039–3048.

  4. Li, Y., Arsenault, R. J., Trost, B., Slind, J., Griebel, P. J., Napper, S., and Kusalik, A. (2012). A Systematic Approach for Analysis of Peptide Array Kinome Data. Science Signaling, 5(220), pl2–pl2.