Entry for:The Bioinformatics Peer Prize
Rapid advancement in single-cell capture technology has resulted in increasing interest in single-cell level of study, particularly in the field of transcriptomics. This is because single-cell RNA-Seq (scRNA-seq) provides us with a better understanding of the underlying transcriptional heterogeniety of individual cells, therefore allowing us to better characterise cell subpopulations.
A scRNA-seq experiment typically generates profiles for hundred to thousands of cell. However, current tools are unable to efficiently handle the large amount of data. In order to fully realise the potential of single-cell RNA-Seq, we need a scalable and efficient computational solution.
Falco is a cloud-based framework for parallelised processing of large-scale transcriptomic data geared for multi-sample analysis. Falco utilises Big Data analysis framework of Hadoop MapReduce and Apache Spark, which are industry standard framework which has been used in other fields to process terabytes and petabytes of data. Falco is currently designed for feature quantification of RNA-seq data using standard transcriptomics analysis tool such as STAR and HISAT2 for alignment, and HTSeq and featureCount for quantification.
The basis of Falco framework is the divide and conquer approach. We first divide the reads into multiple smaller files or 'chunks' in order to increase the level of paralellism in analysis and to normalise the performance of the tools. These chunks can then be optionally pre-processed using any pre-processing tools chosen by user to perform task such as removal of low quality reads and adapter trimming. Finally, each of the read chunks is aligned and quantified using the tools selected by the user, and the resulting gene counts per chunks are then merged to obtain the total read counts per sample.
Falco was evaluated by comparing the performance of a popular RNA-seq pipeline (STAR and featureCount) on two single-cell RNA-seq datasets (a mouse embryonic stem cell data with 869 samples totalling to 1.02 Tb of fastq.gz files and a human brain data with 466 samples totalling to 213.6 Gb of fastq.gz files) with and without the Falco framework. Falco was able to speed up the analysis of the mouse dataset from 18.5 hours without Falco (1 node, 16 processes) down to just 2.8 hours (Falco, 40 nodes cluster). Similarly, we see a speed up of the human brain dataset analysis from 13.6 hours without Falco (1 node, 16 processes) down to 1.1 hours (Falco, 40 nodes cluster).
Falco also allows for cost-effective analysis through the use of Amazon Web Services (AWS) spot instances - unused computing capacity which are offered at a reduced cost compared to the normal 'on-demand' price. The use of spot instances for analysis provides a savings of ~65% compared to using 'on-demand' instances.
Falco is a quick and flexible cloud-based feature quantification framework which are able to analyse large volume of transcriptomic data. Falco simplifies the process of feature quantification analysis and is customisable, allowing user to use the analysis tools of their choice. Falco also allows for low-cost analysis through the use of AWS spot instances.
Falco is available to download from https://github.com/VCCRI/Falco.
5. Future ideas/collaborators needed to further research?
We plan to add support for running on traditional infrastructure and on existing Hadoop cluster. Furthermore, we are in the process of implementing other type of transcriptomic analysis, such as alignment only analysis, clustering of RNA-seq samples and transcript reconstruction.
6. Please share a link to your paper
Andrian Yang is a PhD candidate in the Bioinformatics and Systems Biology Laboratory at the Victor Chang Cardiac Research Institute. His PhD research focus is on utlising cloud computing to create ...