Entry for:Bioinformatics Peer Prize III
RNA sequencing (RNA-seq) is used to measure gene expression levels across the transcriptome for a huge variety of samples. For example, RNA-seq has been applied to
study gene expression in individuals with rare diseases, in hard-to-obtain tissues or for rare
forms of cancer. Many analyses require quantifying expressed features, which is a computationally intensive processing step. Deposited data are provided as raw sequencing reads, which are costly for standard academic labs researchers to analyze. We developed the recount2 project as our effort to standardize and share ready-to-analyze summarizations of RNA-seq data.
Using the splice-aware Rail-RNA aligner, we processed over 70,000 human RNA-seq samples publicly available from the Sequence Read Archive, the Genotype-Tissue Expression (GTEx v6) and The Cancer Genome Atlas (TCGA) studies. Using the coverage bigWig files that Rail-RNA produces, we quantified gene and exons from the Gencode v25 database with the bwtool software. Additionally, we processed the exon-exon junction count data from Rail-RNA to provide ready-to-analyze count tables for these three expression features in the form of RangedSummarizedExperiment objects and text tables. Coverage bigWigs can also be used to determine expressed regions using the derfinder method. The expressed regions and the exon-exon junction count data are two annotation-agnostic methods that exemplify how the data in recount2 can be used for new analytical methods. The data are available via https://jhubiostatistics.shinyapps.io/recount/ and the recount Bioconductor package http://bioconductor.org/packages/recount.
We compared the re-processed data in recount2 with the publicly available data from the GTEx project to demonstrate that our processing pipeline produced gene counts similar to other published methods. For protein-coding genes, the gene expression levels that we estimated using the recount2 pipeline had a median correlation of 0.987 with the GTEx v6 data. A differential expression analysis comparing colon and whole blood samples using the gene expression measurements from recount2 matched the results obtained using the v6 release from the GTEx portal (0.92 r2 for protein coding genes fold changes). We illustrated how recount2 can be used to investigate or validate cross-tissue differences using publicly available data by comparing samples from colon and whole blood from healthy individuals using a total of 5 studies. We compared the gene-level differential expression analysis results against results from applying the same models using data from GTEx and found that approximately 20% of the top 100 genes from the two analyses were concordant, which is higher than what is observed comparing lung and colon GTEx data (5% concordance). We provided an example on how data in recount2 can be used to evaluate expression differences across different feature-levels: genes, exons, junctions and expressed regions.
The recount2 project can be used for querying, downloading and analyzing large-scale human RNA-seq datasets across more than 70,000 samples, including all of GTEx, TCGA and the SRA. By removing a large number of data processing and quantification choices potentially made by researchers, recount2 reduces the number of “researcher degrees of freedom,” which can improve replication and reduce the potential for false positives created by processing
pipeline differences. By providing an updateable resource of uniformly processed RNA-seq samples, together with R-based software for analysis, recount2 will enable studies that individual laboratories would otherwise not have the resources to undertake.
5. Future ideas/collaborators needed to advance research?
We are always looking for new collaborators. You can find recent related projects to recount2 at https://jhubiostatistics.shinyapps.io/recount/ and if you are interested in doing a postdoc please get in touch with any of the principal investigators of the project. There is more information at the end of the video available at https://lcolladotor.github.io/biopeerprize2018/.