Entry for:The Bioinformatics Peer Prize II
The increasing expansion in the number of metagenomic and genomic sequences has dramatically improving our understanding of life’s microbial diversity to an unprecedented level of detail. Yet, our ability to infer metabolic capabilities in a large omic datasets remain biologically and computationally challenging. Here we propose a new Multigenomic Entropy Based Score (MEBS), which enclose the information derived from complex metabolic pathways into a single Score. To test MEBS we focused on the biogeochemical Sulfur cycle due to the lack of studies aiming to integrate all the microbiological and geochemical transformations and their corresponding metabolic pathways in global scale.
MEBS algorithm is a software platform written in Bash, Perl and Python and have been tested under Linux environments. The first step of MEBS consists of the systematic manual acquisition and curation of the molecular and ecological information required to describe the metabolic machinery of interest, for example, the sulfur metabolism. This information is represented by two input files: a list of microorganisms and a multi FASTA file of proteins. MEBS then evaluate the presence/absence patterns of the input proteins in a Genomic dataset (Gen), containing 2,107 non-redundant complete sequenced genomes. Then, the expected vs observed pattern in the input organisms is obtained for each of the input proteins using the mathematical framework of relative entropy (H’). The last step consists in the summation of all the input protein entropies present in the omics data to be evaluated (either genomes or metagenomes) in order to obtain the final Entropy Score. MEBS was thoroughly tested to capture the importance of biogeochemical Sulfur (S) cycles in 935 metagenomes 2107 genomes. The performance, reproducibility and robustness of MEBS was evaluated using several approaches including a random sampling test, linear regression models and ROC curves.
We present MEBS, a new open source software platform aimed to quantitatively evaluate, compare and infer the metabolic machinery of interest, in large ‘omic’ datasets, including complex metabolic pathways such as entire biogeochemical cycles. MEBS algorithm is free, open source and available through: through https://github.com/eead-csic-compbio/metagenome_Pfam_score. The curation effort reported here represents the first comprehensive inventory of the genes, enzymes, pathways, compounds and organisms involved in the sulfur cycle. The input protein domains enriched among sulfur-based microorganisms were obtained with the relative entropy (H’) mathematical framework. The clustering of the 112 H’ values of the input sulfur proteins obtained in a large collection of non-redundant genomes, highlight the possibility of use 12 sulfur informative domains as sulfur cycle marker genes in metagenomic data. Finally the summation of 112’ H’ values in a given genome or metagenome dataset build up the MEBS final Score (Sulfur Score: SS). The SS values in the genomic and metagenomic data collections strongly highlight the broad applicability of our proposed algorithm to accurately detect the sulfur cycle metabolic machinery in large OMIC scale in a fast and a simple fashion manner
Our Sulfur cycle benchmark using MEBS software platform, indicate that the use of a single informative Score the metabolic machinery of interest holds the potential to dramatically change the current view of inferring metabolic capabilities in the present omic-era. We have demonstrated that MEBS is very accurate to detect and classify genomes and metagenomes known to be closely involved in the Sulfur Cycle, suggesting several applications like, the prediction of metabolic capabilities in uncultivated/unexplored taxa and the generation of a measurable score devoted to evaluating any given metabolic pathway or cycle in large meta- genomic scale.
5. Future ideas/collaborators needed to further research?
In this study, we focused on evaluate the Sulfur cycle, but we are currently preparing the manuscript for the carbon, nitrogen, oxygen, phosphorous and iron cycles. Furthermore, we are also working in improve MEBS algorithm by using only a list of microorganisms of interest to avoid the manual exhaustive curation of the proteins involved in the metabolic pathway of interest. We are looking forward to collaborating and help other researchers interested in integrate this software platform in large scale analysis (i.e climate change, bioremediation studies, etc)
6. A link to your paper
PhD student at Ecology Institute UNAM Using 'omic' approaches to understand the individual reactions that make life possible on Earth, focusing on how the genes involved in biogeochemcial cycle...
Round: Open Peer Voting
Category: Student Prize