Entry for:The Bioinformatics Peer Prize
The main objective of RNA-seq experiments is to find genes that are differentially expressed (DE) between two or more biological conditions. However, most commonly used methods for RNA-seq DE analysis do not utilize prior knowledge of biological networks to detect DE genes. Incorporating prior information to the DE analysis can extend our understanding of gene functions, biological pathways and system-level properties of the organism. Using both simulated and real data sets, we propose a method which utilizes the prior knowledge of biological interaction networks offering biologists with a viable, more powerful alternative when detecting DE genes from RNA-seq data.
In order to drive complex biological processes, genes work in a coordinated fashion and often differentially express depending on the environmental factors. The gene DE states are not independent among genes as they have physical and functional relationships. We use Poisson-Gamma-Beta joint density to model the RNA-seq gene expression levels where Beta distribution captures the fold-change between two biological conditions. Moreover, we introduce a three-state Markov Random Field (MRF) model to model the dependency of the DE patterns on the gene network. We integrated known biological pathways and interactions data to our statistical model by forming the neighborhood structure, where we define genes to be first-order neighbors if they are recorded as having direct physical interaction within the interaction database. In general, any database that contains network/interaction information can be used to form the neighborhood structure.
More precisely, our proposed Poisson-Gamma-Beta MRF model assumes the probability of a gene being DE will increase if the gene is surrounded by a higher proportion of neighboring genes that are also DE. On the other hand, if the gene is surrounded by a higher proportion of equally expressed (EE) neighboring genes, then the probability of that gene being DE will also decrease.
Simulation studies demonstrate that our method outperforms the previously published two-state DE model for various sample sizes. The two-state DE formulation, does not distinguish between up- and down-regulated genes and all differentially expressed genes are placed into one category, regardless of their direction of change. Furthermore, for a comparable false discovery rate (FDR), our model has better sensitivity than the DESeq, EBSeq, edgeR and NOISeq R packages. In analyzing RNA-Seq colorectal cancer (GSE50760) and hepatocellular carcinoma (GSE77314) datasets, we identified several important Gene Ontology terms related to colorectal cancer and KEGG pathways related to hepatocellular carcinoma, which are not replicated by other commonly used methods such as the DESeq, EBSeq, edgeR and NOISeq. This highlights the need to treat up- and down-regulated as separate categories when performing DE analysis and the potential benefit of utilizing prior information in the form of known biological pathways and gene network for small RNA-seq studies. The proposed method is implemented in an R package pathDESeq which is available for download from Github.
In this paper, we have proposed a three-state model that uses known pathways and interactions to improve sensitivity, specificity and to reduce FDRs when detecting DE genes from RNA-seq data. Simulation studies and real data analysis demonstrate that our proposed method outperforms the previously published two-state DE model for various sample sizes and it has better sensitivity than the DESeq , EBSeq , edgeR and NOISeq. These findings clearly highlight the power of our proposed method relative to methods that do not utilize prior biological network information.
5. Future ideas/collaborators needed to further research?
In this paper, we have applied the proposed method only to two publicly available datasets with known biology, and we are looking forward to collaborating with researchers interested in integrating RNA-seq data with the interaction data to discover novel gene sets with biologically meaningful functional categories.
Furthermore, we are planning to improve the model performance by extending our gene network data from current first order to higher order of neighboring genes.
In future, we are planning to release our method as an R package in Bioconductor so that it reaches a wider scientific community.
6. Please share a link to your paper
Link to the paper describing this research:
Dona et al (2017): https://doi.org/10.1093/bioinformatics/btw833