GEOracle: Mining perturbation experiments using free text metadata in Gene Expression Omnibus





1. Background

NCBI's Gene Expression Omnibus (GEO) contains >79,000 gene expression data sets (GSE). Those from perturbation experiments (e.g., gene knock-out, signalling or physical stimulation) are especially valuable because they allow identification of genes that are causally downstream of a perturbation agent. This has important applications in determining signalling pathway targets and gene regulatory networks.

There are likely tens of thousands of perturbation studies in GEO, containing millions of experimentally determined perturbation data. Nonetheless, there is no simple way to determine whether a GSE contains perturbation data, and it is not trivial to automatically match treatment samples with their respective control samples.

2. Method

A key insight is that a wealth of useful information regarding experimental design and sample description is stored as free text in GEO metadata. Although this free text is often readily interpretable by humans, there is no simple means to extract this information from GEO in an automated fashion. We reason that we can use text mining and machine learning techniques to classify GSE that contain perturbation data, and to identify and match the treatment and control samples in a perturbation data set. We posit that such an approach will allow us to extract a large amount of gene regulatory information that are already present in GEO.

Using our R Shiny tool called GEOracle, we can quickly annotate many perturbation experiments from GEO in a semi-automated fashion with full user control. GEOracle then performs differential expression analysis to identify gene targets of the perturbation agent.

3. Results

GEOracle is freely available at To demonstrate GEOracle's application in biomedical research, we present two case studies that involve the discovery of conserved signalling pathway target genes and reconstruction of an organ specific gene regulatory network.

4. Conclusions

This work shows that free text metadata in GEO can be computationally mined to extract a large amount of perturbation data. This wealth of perturbation data can be used for discovering signaling pathway target genes and causal gene regulatory networks. While we believe it is important to push for better use of standard annotations in GEO metadata, GEOracle provides a powerful and practical tool to reuse the large amount of data that already exist in GEO.

5. Future ideas/collaborators needed to further research?

We expect it is possible to further extract additional information about each gene expression experiment, such as cell types and experimental context from the metadata by applying more advanced natural language processing techniques. Such extracted information can then be mapped to standard ontologies (e.g., cell ontology).

Our long-term goal is make the classification accuracy high enough so that all GEO data sets can be processed fully automatically. This will allow us to build a rich resource of genetic/molecular perturbation.

We always value collaboration with experts in natural language processing, machine learning, biological ontology, genetics and biomedical science.



No discussion yet, be the first one to comment


Dr Joshua Ho is the Head of Bioinformatics and Systems Medicine Laboratory at the Victor Chang Cardiac Research Institute. He is also an NHMRC Career Development Fellow and a Heart Foundation Futur...

Round: Open Peer Voting
Category: Main Prize






Recent Voters