Entry for:Bioinformatics Peer Prize III
High-throughput biology has led to a wealth of ‘omics data (e.g., transcriptomics, proteomics, metabolomics) generated from various technological platforms and sources. Extracting relevant information in a holistic manner from these large-scale datasets can lead to important biological discoveries, however, computational and statistical integrative methods are currently limited by the sheer complexity of these datasets. Current methods that seek for molecular signatures are limited to the analysis of one type of ‘omics data. Univariate methods often consider each molecule individually in the statistical model, and may miss crucial information by disregarding critical interactions between features.
We developed a suite of multivariate methods that model molecular features holistically and statistically integrate diverse types of data to offer an insightful picture of a biological system.
Our methods are computationally efficient to handle large biological datasets, where the number of biological features (usually thousands) is much larger than the number of samples (usually less than 50), and achieve dimension reduction to provide intuitive graphical visualisation and to identify key diverse molecular features that are tightly correlated. The methods are implemented in our R package mixOmics, that currently includes nineteen methodologies, amongst which thirteen are novel and developed by our team of computational statisticians. This article aims to disseminate the package to a wide audience, and also to introduce our two latest frameworks for data integration; N-integration with DIABLO combines different ‘omics datasets measured on the same N samples or individuals; P-integration with MINT combines studies measured on the same P features (e.g., genes) but from independent cohorts of individuals. Both frameworks extend Projection to Latent Structures models for discriminant and integrative analysis, and for the identification of relevant and robust molecular signatures. The application of our methods is straightforward and detailed on our website www.mixOmics.org.
Feature selection embedded in each of our methods enables us to suggest novel biological hypotheses and the powerful visualisations allow for increased understanding of a biological system. In addition, their relaxed assumptions about data distribution make them highly flexible to answer topical questions across numerous biology-related fields, ranging from biomedical, molecular biology, or ecological research.
In this article we provide examples of analyses integrating transcriptomics, proteomics, methylation and miRNA data, and extensions to microbiome data. DIABLO is a powerful integrative framework to extract highly connected multi-‘omics signatures that explain an outcome of interest, whilst MINT offers the promise of increasing sample size by combining related but independent studies, enabling benchmarking across studies generated from different laboratories and highlighting robust signatures. Both frameworks provide unsupervised (exploratory) and supervised (discriminant) analyses. As a software and methods development team, we aspire to early access and open source of our methods, which are tested, quickly adopted and validated by many of our users around the world. The package has been downloaded by 20K users in 2016 and 29K in 2017 (R CRAN unique IP downloads), and benefits our community of computational biologists and bioinformaticians who wish to make sense of ‘omics data.
mixOmics’ methods are effective for various types of high-throughput data and have been applied to a wide range of biological problems where integration from a data-driven perspective is required to make sense of ‘omics data generated every day. The uniquely embedded feature selection process in the methods is particularly appealing to identify molecular signatures across multiple data sets and to refine biological hypotheses for further experimental validations. mixOmics is a well-designed, user-friendly package with attractive visualisations. It represents a significant contribution to the field of computational biology which has a strong need for such toolkits to mine and integrate datasets.
5. Future ideas/collaborators needed to advance research?
Our mixOmics team is a demonstrably fervent advocate of open source and open science. Methods are available in pilot release to our users for testing and we encourage feedback from our community. Since 2014 we have taught 15 multiple-day workshops across Europe and Australia to more than 350 participants, with the aim of providing computational biologists with useful tools to make sense of their data. mixOmics is a highly collaborative project between computational biologists, statisticians, software developers and bioinformaticians, with key members from Australia, France and Canada. Our developments are guided by cutting-edge biological problems and questions from our collaborators. We welcome, and actively seek out, any scientific and industry collaborations as those will help us make considerable progress towards our ultimate goal: gaining a comprehensive understanding of molecular interactions in biology.
6. A link to your paper
We are an enthusiastic team of computational statisticians developing novel methods in open access software and tools for data mining, data exploration and data integration. Our innovative methods ...