CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data

Play Video

$2,000

Prizes

5,576

Views

1. Background

Single-cell RNA sequencing (scRNA-seq) enables us to study heterogeneity among a population of individual cells, and to define cell types from a transcriptomic perspective. One prominent problem in scRNA-seq data analysis is the prevalence of dropouts, often caused by failures in amplification during the reverse-transcription step in the RNA-seq experiment. The prevalence of dropouts manifests as an excess of zeros and near zero counts in the data set, which has been shown to create difficulties in scRNA-seq data analysis. In particular, dropouts often lead to an increase in distance between two cells, thus distort the clustering structure in the data.

2. Method

Utilising the observation that a gene with low expression is more likely to be a dropout than a gene with high expression, we hypothesise that we can shrink this dropout-induced inflated distance by imputing the expression value of a dropout candidate with its expected value given any reasonable dropout probability distribution. This is the motivation behind our new method CIDR (Clustering through Imputation and Dimensionality Reduction). Using this intuition, CIDR computes a cell-to-cell dissimilarity matrix based on the 'implicitly imputed' expression values for all dropout candidates. CIDR then uses this dissimilarity matrix to perform dimensionality reduction (principal coordinates analysis) and hierarchical clustering.

3. Results

CIDR is implemented as an open source R package, and can be easily downloaded and installed from https://github.com/VCCRI/CIDR/. Using a range of simulated and real data, we show that CIDR improves the standard principal component analysis, and outperforms the state-of-the-art methods, namely t-SNE, ZIFA, and RaceID, in terms of clustering accuracy. CIDR typically completes within seconds when processing a data set of hundreds of cells and minutes for a data set of thousands of cells.

4. Conclusions

Dimensionality reduction and clustering are critical steps in scRNA-seq data analysis because accurately detecting clusters can greatly benefit subsequent analyses, including visualisation, normalisation, differential expression, and co-expression analysis. Therefore, the vast improvement CIDR has over existing tools will be of interest to both users and developers of scRNA-seq technology. CIDR's ultrafast runtimes are particularly vital given the rapid growth in the size of scRNA-seq data sets.



 

5. Future ideas/collaborators needed to further research?

CIDR can be readily used by anyone who analyses scRNA-seq data, therefore it will have an impact in almost all areas of biomedical and biotechnology research. In terms of our methodology, we believe 'implicit imputation' may have general application in other areas of data science, particularly in areas in which non-random missing data are prevalent. Therefore our work may inspire the development of other fast algorithms in the area of big data analysis.

Comments

1
金刚 梁
over 1 year ago

Au lieu de se soucier de la façon d'aider ceux qui ont besoin d'aide, ils ne sont peut-être pas aussi réalistes qu'ils veulent penser à ce dont ils ont besoin pour acheter demain. replique montre Tout le monde n'est pas une personne sans amour Dans notre vie réelle, ce qui nous importe le plus n'est pas les autres mais nous-mêmes, nous prêtons rarement attention aux choses qui n'ont rien à voir avec nous.

Image1491974169?1491974169

Dr Joshua Ho is the Head of Bioinformatics and Systems Medicine Laboratory at the Victor Chang Cardiac Research Institute. He is also an NHMRC Career Development Fellow and a Heart Foundation Futur...

Recent Voters