Funding supported research:
My main research at PLU is mainly focus on Capstone projects with undergraduate students.
My research is mainly focus on applying machine learning and data mining techniques to address the biomedical problems, such as gene function prediction from Hi-C contact data, protein function prediction, quality assessment for protein tertiary structure prediction, and etc.
Hi-C technique (Z. Wang, R. Cao, et. al, 2013) that can determine the genome-wide chromosomal interaction/contact data was used for spatial proximity profiles of healthy and malignant human B cell/cell-lines. We use it for generating the spatial gene-gene interaction networks, and compare the function similarity of gene pairs that do not spatially interact and that have interactions. We find out that genes having strong spatial interactions tend to have highly similar function in terms of biological process, molecular function and cellular component of the Gene Ontology. And even though the level of gene-gene interactions generally have no or weak correlation with either sequential genomic distance or sequence identity between genes, the interacted genes with high function similarity tend to have stronger interactions, somewhat shorter genomic distance and significantly higher sequence identity. We develop and evaluate a new gene function prediction method based on gene-gene interacting networks, which can predict gene function well for a large number of human genes. Details can be found at this reference (R. Cao et. al. 2015.).
Functionally relevant biological information such as protein sequences, gene expression, and protein-protein interactions has been used mostly separately for protein function prediction. One of the major challenges is how to effectively integrate multiple sources of both traditional and new information such as spatial gene-gene interaction networks generated from chromosomal conformation data together to improve protein function prediction. We developed three different probabilistic scores (MIS, SEQ, and NET score) from protein sequence, function associations, and protein-protein interaction and spatial gene-gene interaction networks for protein function prediction. These three scores were combined in a new Statistical Multiple Integrative Scoring System (SMISS) to predict protein function. More details at the reference (R. Cao, et. al. 2015.).
Quality assessment is to evaluate the quality of a protein model without knowing the native structure, which is crucial for protein tertiary structure prediction. We first evaluate our MULTICOM QA methods on CASP10, and the results show that the pairwise model assessment methods worked better when a large portion of models in the pool were of good quality, whereas single-model quality assessment methods performed better on some hard targets when only a small portion of models in the pool were of reasonable quality. (See R. Cao, et. al. 2014.) And then we developed a machine learning QA tool SMOQ (See R. Cao, et. al. 2014.), which can predict the distance deviation of each residue in a single protein model. In addition, we developed a novel large-scale model QA method in conjunction with model clustering to rank and select protein structural models. It unprecedentedly applied 14 model QA methods to generate consensus model rankings, followed by model refinement based on model combination (i.e. averaging). Our experiment demonstrates that the large-scale model QA approach is more consistent and robust in selecting models of better quality than any individual QA method. It was offically ranked 3rd out of all 143 human and server predictors in CASP11 (See R. Cao, et. al. 2015., R. Cao, et. al. 2015.).