Entry for:The Bioinformatics Peer Prize
Single-cell RNA sequencing (scRNA-seq) enables us to study heterogeneity among a population of individual cells, and to define cell types from a transcriptomic perspective. One prominent problem in scRNA-seq data analysis is the prevalence of dropouts, often caused by failures in amplification during the reverse-transcription step in the RNA-seq experiment. The prevalence of dropouts manifests as an excess of zeros and near zero counts in the data set, which has been shown to create difficulties in scRNA-seq data analysis. In particular, dropouts often lead to an increase in distance between two cells, thus distort the clustering structure in the data.
Utilising the observation that a gene with low expression is more likely to be a dropout than a gene with high expression, we hypothesise that we can shrink this dropout-induced inflated distance by imputing the expression value of a dropout candidate with its expected value given any reasonable dropout probability distribution. This is the motivation behind our new method CIDR (Clustering through Imputation and Dimensionality Reduction). Using this intuition, CIDR computes a cell-to-cell dissimilarity matrix based on the 'implicitly imputed' expression values for all dropout candidates. CIDR then uses this dissimilarity matrix to perform dimensionality reduction (principal coordinates analysis) and hierarchical clustering.
CIDR is implemented as an open source R package, and can be easily downloaded and installed from https://github.com/VCCRI/CIDR/. Using a range of simulated and real data, we show that CIDR improves the standard principal component analysis, and outperforms the state-of-the-art methods, namely t-SNE, ZIFA, and RaceID, in terms of clustering accuracy. CIDR typically completes within seconds when processing a data set of hundreds of cells and minutes for a data set of thousands of cells.
Dimensionality reduction and clustering are critical steps in scRNA-seq data analysis because accurately detecting clusters can greatly benefit subsequent analyses, including visualisation, normalisation, differential expression, and co-expression analysis. Therefore, the vast improvement CIDR has over existing tools will be of interest to both users and developers of scRNA-seq technology. CIDR's ultrafast runtimes are particularly vital given the rapid growth in the size of scRNA-seq data sets.
5. Future ideas/collaborators needed to further research?
CIDR can be readily used by anyone who analyses scRNA-seq data, therefore it will have an impact in almost all areas of biomedical and biotechnology research. In terms of our methodology, we believe 'implicit imputation' may have general application in other areas of data science, particularly in areas in which non-random missing data are prevalent. Therefore our work may inspire the development of other fast algorithms in the area of big data analysis.
6. Please share a link to your paper
Dr Joshua Ho is the Head of Bioinformatics and Systems Medicine Laboratory at the Victor Chang Cardiac Research Institute. He is also an NHMRC Career Development Fellow and a Heart Foundation Futur...