Research

The advances in biotechnology and engineering have revolutionized biology and other related scientific disciplines by generating massive datasets. Next-generation sequencing (NGS) machines, which are able to generate billions of reads per day, have been widely used for disease studies, personal medicine, functional genomics and systems biology. Other high-throughput experimental methods have also been developed for proteomics and protein-protein interactions. Two key challenges in this "big data" era are first, how to handle these datasets efficiently, and second, how to make sense of the data. To address these challenges, we need smart algorithms that can speed up analysis of the large-scale data, and intelligent models that are capable of mining useful information from the data.

My current research focuses on the design and application of both efficient algorithms and effective modeling techniques, especially for processing, integrating and analyzing the vast datasets in genomics, systems biology and molecular biology. My doctoral research was on statistical models for structural bioinformatics, with a focus on statistical inference for protein structure modeling.

1. Algorithms for large-scale genomics in the era of NGS and big data

As high-throughput sequencing technologies generate increasingly more genomic data at lower cost, processing power and storage resources are becoming the real bottleneck. In the past decade, genomic sequencing capabilities have increased exponentially, outstripping advances in computing power. For example, large-scale international genome projects (e.g., ENCODE, 1000 Genomes Study, Genome 10K Project) currently require analysis of sequence data from hundreds or thousands of individuals, and sequencing centers around the world currently produce more than 10 terabytes of genomic data per day. Similarly, the sizes of protein sequence databases and chemical compound genomic libraries have increased exponentially in recent years. Extracting new insights from these datasets currently being generated will require not only faster computers and larger storage devices, but also smarter algorithms. Fortunately, the amount of non-redundant data in these datasets is not growing nearly as fast; compressive approaches can exploit this fact to greatly reduce the computational cost for information retrieval.

Compressive approaches for NGS read mapping. In no other problem domain are we currently running up against the limits of data storage and analysis as much as we are with the challenges of NGS read data. Large-scale genome projects, such as 1000 Genomes and Genome 10K, currently require analysis of sequencing data from thousands of individuals. Therefore, to utilize the full power of large NGS datasets, algorithms that scale sublinearly in the size of the read data (i.e., those that reduce the effective size or ignore most of the data) are required. In joint work with Deniz Yorukoglu and Prof. Bonnie Berger, we have developed a novel compressive framework, ISLAND, to achieve sublinear-time/space read mapping. To do so, we employ innovative succinct data structures for compressed read representation and novel algorithms for rapid on-demand retrieval of mapping information. As proof of principle for the underlying idea of compressive mapping, we have implemented compressive ISLAND versions of BWA, Bowtie and mrsfast that find all possible mapping locations in time proportional to the size of the compressed read data, and sublinearly in the size of the raw input read data. Even with as low depth coverage as 15X, ISLAND mappers achieve performances 10 times faster than the original read mappers with almost identical accuracy. This work will be submitted shortly for publication in Nature Biotechnology, as a follow up to the Berger lab's Compressive Genomics paper. An NSF-NIH big data grant (PI: Bonnie Berger) based on these ideas has been recently awarded.

Compressive-accelerated BLAST for protein databases. Identification of evolutionarily-related sequences for a given protein is a fundamental problem in computational biology, with applications to orthology prediction, functional annotation, and structure prediction. Almost all protein sequence analysis programs rely on BLASTp and PSI-BLAST, which identify homologous sequences from a given protein sequence database. The runtime of these tools scales linearly in the size of the protein sequence database. While the volume of protein sequence data doubles roughly every two years (reaching roughly 40 billion amino acids thus far), much of the newly-discovered protein sequence data are highly similar to existing data. In joint work with Noah Daniels and Prof. Bonnie Berger, we have developed a compressive-accelerated search algorithm, CaBLASTP, to compress the database and perform sequence search directly. First, we preprocess the protein sequence database by identifying short segments with high similarity. We compress the database by only storing the "unique" segments, maintain the locations of all similar regions in a linked table and record the difference between them and the "unique" segments. The "coarse" database is therefore non-redundant and typically much smaller than the original sequence database. After this "compressed" database is constructed, we apply a coarse-to-fine search algorithm for homology search. The query sequence is initially searched against the "coarse" database by BLASTp or PSI-BLAST. We then use the linked table to find all other similar regions, reconstruct the corresponding sequences, and refine the result. This compressive-accelerated approach scales sublinearly in the size of the database being searched and almost linearly in the size of the "unique" data. CaBLASTP compressively "boosts" the performance of current sequence analysis programs such as BLASTp and PSI-BLAST and speeds up their performance by a factor of at least three.

2. Systems biology for protein chaperones and homeostasis

Protein chaperones are special proteins which aid in folding, unfolding, assembly and disassembly of other protein structures. The cytosol of E. coli contains about 350mg/mL of proteins. Protein molecules are effectively bumping into each other all the time and thus can become partially unfolded or misfolded. Newly synthesized proteins can also easily become misfolded, unfolded or aggregated without help from the cell. Protein misfolding and aggregation has been shown to be associated with many diseases, such as Alzheimer's and Parkinson's. Protein homeostasis (or proteostasis), a special mechanism in the cell, maintains the balance between synthesis, folding, aggregation and turnover of proteins. Protein chaperones, especially heat shock proteins (HSPs), lie at the hub of protein homeostasis. However, the details of cellular pathway signaling for protein homeostasis remain unclear.

Reconstruction of cellular pathways for protein homeostasis. Together with their specific co-factors, or co-chaperones, chaperones are central coordinators of the protein homeostasis network that profoundly influence many human diseases, from cancer to Mendelian disorders to neurodegeneration. Given the immense potential of proteostasis-modulating compounds in medicine, detailed characterization of the normal proteostasis network is a requisite for developing improved therapeutics. I have developed probabilistic modeling approaches to analyze mass spectrometry and quantitative high-throughput LUMIER assays to systematically characterize the chaperone/co-chaperone/client interaction network of human cells. I have designed a Gaussian Process Mixture model to normalize novel high-throughput LUMIER assays and integrate the mass spectrometry data (generated by the Lindquist lab at the Whitehead and Howard Hughes Institutes at MIT and the Gingras Lab from Mount Sinai Toronto) and assign confidence scores to interactions. I have also annotated protein structure and function in this networkby algorithms that I have developed in the context of structural bioinformatics to understand the role of different protein co-chaperones. These computational predictions have been validated by experiments performed by my collaborator Dr. Mikko Taipale from the Lindquist lab. In particular, the algorithms that I have developed enable us to broadly and comprehensively delineate the relationship between the Hsp70 and Hsp90 chaperone systems, uncover hundreds of novel chaperone clients, characterize their integration into specific co-chaperone complexes, and establish a surprisingly distinct network of protein::protein interactions for co-chaperones. We provide a comprehensive framework for understanding how the proteostasis network is wired and a rich resource for exploring how it changes in development and disease.

Deciphering the molecular determinants of HSP90::Kinase interactions, and implications for human cancers. Protein kinases, the basic building blocks of cellular signaling pathways, are proved to be the most effective drug target for cancer therapies. Oncogenic mutations in kinases were observed to be associated with the expression of Hsp90 in cancer cells. The interactions between protein kinases and Hsp90 are modulated by a co-chaperone Cdc37, which recruits partially folded or unfolded kinases to the dimeric Hsp90. New drugs have been developed with Hsp90 as the target, aimed at the inhibition of oncogenic protein kinase access to the Hsp90-Cdc37 chaperone complex. However, the mechanism of recognition of kinase clients and the determinants of client specificity for Hsp90-Cdc37 are not clear; nor does there exist a solved complex structure for their interaction thus far. In collaboration with Dr. Mikko Taipale in the Lindquist lab, we have ground-breakingly resolved the specificity question. We have performed a quantitative and systematic assay of Hsp90-Cdc37/kinase interactions and analyzed the determinants of the client specificity. To make sense of this data, I developed a computational pipeline to construct high-quality kinase-specific sequence alignments. Next, I designed a bootstrapping-based sparse learning algorithm to identify the putative structural motif and the relevant physicochemical determinants for such specificity. I found that a local motif with 10 residues, near the hinge between the N- and C-terminal lobes of the kinase structure, has strong predictive power for the specificity of a set of newly assayed kinases; in other words, we can accurately predict which kinases will bind Hsp90/Cdc37 complexes. Interestingly, such specificity is determined not only by the residues on the protein kinase interface, but also by a number of deeply buried hydrophobic core residues. Based on this motif, I computationally designed mutagenesis and chimeric sequences, which enabled us to successfully rewire Hsp90-Cdc37 specificity for several kinases in living cells, some by mutations on only the buried core residues; as a result we have been able to change the intra-molecular packings of kinase core regions in vivo. Further experimental and computational analyses have indicated the specificity is indeed highly associated with the stability of the alpha-helical bundle in the kinase structure. Furthermore, our results establish a novel mechanism, suggesting the Hsp90-Cdc37 chaperone system recognizes the intermediate kinase conformation by sensing the thermostability of its ?E helix, as well as the exposed residues on a proximal loop. We anticipate these results will aid in our understanding of the role of Hsp90 in cancer drug development. For example, Hsp90 was used recently as a thermodynamic sensor to detect the drug-kinase interaction for certain kinases in vivo. In addition, we analyzed missense kinase mutations from cancer cells and the ones from 1000 Genomes project; we found a profound difference in thermo stability between mutations in cancer cells and those in normal individuals. The investigation of newly-designed Hsp90 inhibitors for the regulation of oncogenic kinases will be critically related to such developments.

3. Statistical inference for computational structural biology

Template-based modeling for protein structure prediction. Template-based modeling (TBM) is arguably the most successful approach for protein structure prediction to date. TBM methods identify similar protein structures for a query protein sequence, build alignments of query protein sequences to putative structures and use these structures as templates to build three-dimensional models. Although this approach can make predictions with reasonable accuracy, there is a tremendously large gap between the predictive ability of current TBM methods and the theoretical limit, especially when the evolutionary signal is insufficient to build accurate sequence alignments for query proteins. To build better alignments, I developed a tree-based graphical model for pairwise protein alignment. I used a conditional random field (CRF) model to represent the probabilistic distribution of pairwise alignments between two proteins. I introduced a set of regression trees as the potential function for this CRF model to capture the complex dependencies between the evolutionary and structural similarities of proteins. Further, I also introduced an information-theoretical measure for each protein to quantify the strength of evolutionary signal implied by its homologous sequences, and used it to guide the training of the alignment model so as to adaptively exploit structural and evolutionary features (i.e. when the evolutionary information is insufficient, the model will rely more on the structural features.) This work is of particular importance for the structure prediction of proteins with insufficient evolutionary signal, which is recognized as the major challenge for the community. We implemented this algorithm in the RaptorX webserver, and evaluated it in the recent community-wide Critical Assessment of Protein Structure Prediction (CASP9) competition in 2010. RaptorX ranked No. 2 in CASP9 out of ~80 servers. Remarkably, it achieved the best performance in the "hard" template-based modeling category. RaptorX was also voted by the CASP9 community as "most innovative method." Since January 2012, the RaptorX server has predicted structures for ~60,000 proteins submitted by 5500 users across more than 100 countries.

Structure-based prediction for interactomes. Besides protein structure prediction, I have been working on predicting protein-protein/RNA interactions from a structural perspective. Together with Raghavendra Hosur and Prof. Bonnie Berger, we have developed a method to predict protein-protein interactions from a structural perspective. Building a graphical model on the predicted protein interface of two proteins, we are able to evaluate the likelihood of the putative interaction by an efficient Markov Chain Monte Carlo sampling technique. Our predictions for the MAPK interactome have been validated by Norbert Perrimon's lab at Harvard Medical School. I plan to apply structural modeling and statistical inference methods for protein-RNA interaction predictions. An NIH R01 grant (PI: Bonnie Berger) based on these ideas has recently been awarded.