Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2023 Sep 18:2023.07.03.547607. Originally published 2023 Jul 4. [Version 2] doi: 10.1101/2023.07.03.547607

Multi-view integration of microbiome data for identifying disease-associated modules

Efrat Muller, Itamar Shiryan, Elhanan Borenstein
PMCID: PMC10349976  PMID: 37461534

Abstract

Machine learning (ML) has become a widespread strategy for studying complex microbiome signatures associated with disease. To this end, metagenomics data are often processed into a single “view” of the microbiome, such as its taxonomic (species) or functional (gene) composition, which in turn serves as input to such ML models. When further omics are available, such as metabolomics, these can be analyzed as additional complementary views. Following training and evaluation, the resulting model can be explored to identify informative features, generating hypotheses regarding underlying mechanisms. Importantly, however, using a single view generally offers relatively limited hypotheses, failing to capture simultaneous shifts or dependencies across multiple microbiome layers that likely play a role in microbiome-host interactions.

In this work, inspired by the broad domain of multi-view learning , we aimed to investigate the impact of various integration approaches on the ability to predict disease state based on multiple microbiome views, and primarily to generate multifaceted biological hypotheses linking features across views. We first implemented an “early integration” ML pipeline to serve as a baseline for disease predictability and informative features’ detection, and compared the resulting multi-view models to naïve single view models. Using a dataset collection of 25 case-control metagenomic studies, we found that multi-view models typically offer prediction accuracy that is comparable to that of the best-performing single view model, yet, further provide a mixed set of informative features from different views while accounting for between-views dependencies. Such early-integration multi-view models, however, are still limited in their ability to provide well-defined disease-associated multi-view modules that can be used to generate mechanistic hypotheses concerning the disease in question. To address this challenge, we developed a novel “intermediate integration” framework termed MintTea, based on sparse generalized canonical correlation analysis (CCA), to identify multi-view modules of features, highlighting shared disease-associated trends in the data expressed by the different views. We showed that this framework identified multiple modules that both are highly predictive of the disease, and exhibit strong within-module associations across features from different views. We further demonstrated that MintTea has significantly lower false discovery rates compared to other CCA-based approaches. We accordingly advocate for using multi-view models to capture multifaceted microbiome signatures that likely better reflect the complex mechanisms underlying microbiome-disease associations.

Full Text Availability

The license terms selected by the author(s) for this preprint version do not permit archiving in PMC. The full text is available from the preprint server.


Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES