Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2026 Jan 26:2026.01.24.701521. [Version 1] doi: 10.64898/2026.01.24.701521

Community Curation of Microbial Metabolites Enables Biological Insights of Metabolomics Data

Helena Mannochio-Russo 1,†,*, Wilhan D Gonçalves Nunes 1,, Shipei Xing 1, Fernanda de Oliveira 1,2, Andrés Mauricio Caraballo-Rodríguez 1, Paulo Wender Portal Gomes 1,3, Vincent Charron-Lamoureux 1, Julius Agongo 1, Nicole E Avalon 4,5,6, Tammy Bui 1, Lucia Cancelada 7,8, Marc G Chevrette 9,10, Andrés Cumsille 9,10, Moysés B de Araújo Júnior 11,12, Marilyn De Graeve 1,13, Victoria Deleray 1, Mohamed S Donia 14, Mutsawashe B Dzveta 15, Yasin El Abiead 1, Ronald J Ellis 16,17, Donald Franklin Jr 18,17, Neha Garg 19,20, Harsha Gouda 1, Claude Y Hamany Djande 15, Anastasia Hiskia 21, Benjamin N Ho 1, Chambers C Hughes 22,23,24, Sunghoon Hwang 14, Sofia Iliakopoulou 25, Jennifer E Iudicello 18,17, Alan K Jarmusch 1,26, Triantafyllos Kaloudis 25,21, Irina Koester 7,27, Robert Konkel 28, Hector H F Koolen 11, Kine Eide Kvitne 1, Sabina Leanti La Rosa 29, Anny Lam 1, Santosh Lamichhane 30, Motseoa Lephatsi 15, Scott Letendre 17,31, Sarolt Magyari 1,32, Hanna Mazur-Marzec 28, Daniel McDonald 33, Ipsita Mohanty 1, Mónica Monge-Loría 19, David J Moore 18,17, Thiago André Moura Veiga 34, Musiwalo S Mulaudzi 15, Lerato Nephali 35, Griffith Nguyen 1, Martin Orságh 36,37, Abubaker Patan 1, Tomáš Pluskal 36, Phillip B Pope 29,38, Lívia Soman de Medeiros 34, Paolo Stincone 39, Andrej Tekel 36,37, Sydney Thomas 1, Ralph R Torres 7, Shirley M Tsunoda 1, Fidele Tugizimana 15, Martijn van Faassen 40, Felipe Vasquez-Castro 1, Giovanni A Vitale 39, Berenike C Wagner 39, Crystal X Wang 18,17, Sevasti-Kiriaki Zervou 21, Haoqi Nina Zhao 1, Simone Zuffa 1, Daniel Petras 39,41, Laura-Isobel McCall 42, Rob Knight 33,43,44,45,46,47, Mingxun Wang 48, Pieter C Dorrestein 1,49,50,51,*
PMCID: PMC12874021  PMID: 41659575

Abstract

Microbial metabolites play a critical role in regulating ecosystems, including the human body and its microbiota. However, understanding the physiologically relevant role of these molecules, especially through liquid chromatography tandem mass spectrometry (LC-MS/MS)-based untargeted metabolomics, poses significant challenges and often requires manual parsing of a large amount of literature, databases, and webpages. To address this gap, we established the Collaborative Microbial Metabolite Center knowledgebase (CMMC-KB), a platform that fosters collaborative efforts within the scientific community to curate knowledge about microbial metabolites. The CMMC-KB aims to collect comprehensive information about microbial molecules originating from microbial biosynthesis, drug metabolism, exposure-related molecules, food, host-derived molecules, and, whenever available, their known activities. Molecules from other sources, including host-produced, dietary, and pharmaceutical compounds, are also included. By enabling direct integration of this knowledgebase with downstream analytical tools, including molecular networking, we can deepen insights into microbiota and their metabolites, ultimately advancing our understanding of microbial ecosystems.

Introduction:

Of the thousands of molecules detectable by liquid chromatography-mass spectrometry (LC-MS/MS) in typical biospecimens, the host-associated microbiome modifies 15–70% of them depending on the specific organ or biofluid analyzed1. In a typical untargeted metabolomics profile from humans, only about 10% of the acquired spectra can be annotated, and among these, an even smaller portion can be directly traced to microbial origins. Humans have three major sources of microbial metabolites: 1) microbial metabolism of host-derived metabolites2; 2) microbial metabolism of molecules from food and beverages3; and 3) microbial metabolites assembled de novo using proteins encoded by genetic elements often arranged as gene clusters (in bacteria, archaea, fungi, and, recently, discovered to be widespread in phages)4. Additionally, microbial metabolites found in humans originate from microbial processing of xenobiotics other than food, such as plasticizers, pollutants, medications, and environmental molecules absorbed through the skin or inhaled 57.

Despite the critical importance of microbiome-derived metabolites to human health – including those involved in the microbe-gut-brain8 and microbe-diet-host axes9 – and other ecosystems, there is no centralized knowledgebase where the scientific community can deposit, curate, access, and reuse that knowledge. Existing resources have assessed how the microbiome influences the consumption and production of about 900 largely primary microbial metabolites10, and have compiled literature-curated information about 3,269 microbiome-derived metabolites11, but most of these metabolites are not unique to microorganisms or have been curated from metabolic models, which tend to capture mainly primary metabolism rather than specialized metabolites which can be more biologically relevant for host-microbiome interactions12,13. In addition, some targeted commercial metabolite platforms claim to capture up to ~140 microbial molecules, but many of those could also come from diet or the host, highlighting the challenge in the field with accurately understanding microbiome-derived metabolites14. microbeMASST, our recent tool that enables searching a fragmentation spectra against a reference microbial metabolomics database, allows direct connection between bacteria and fungi and microbially-derived molecules they produce15. However, microbial metabolites that have been recently discovered (or even yet to be discovered), the organisms and the genes responsible for their production, and/or their related activities, are not yet systematically cataloged. Therefore, reusing this information is a bottleneck for the community that aims to mechanistically understand the microbiome.

To complement existing microbial metabolite resources, as well as to enable annotation of structurally uncharacterized metabolites (captured as MS/MS spectra), we have created the Collaborative Microbial Metabolite Center knowledgebase (CMMC-KB). Leveraging the Global Natural Product Social Molecular Networking (GNPS)16 mass spectrometry analysis ecosystem, the CMMC-KB enables collaboration within the scientific community to curate knowledge on microbial metabolites or metabolites that might influence microbial metabolites (drugs, food, etc). The goal of this initiative is to facilitate biological interpretations of microbiome-derived molecules. With downstream molecular networking integration, the CMMC-KB allows users to visualize MS/MS spectra of compounds classified as microbial metabolites within molecular networks (grouped by MS/MS spectral similarity), even if their structures remain unknown. Furthermore, it provides information on microbial producers, the sources of the molecules, associated genes or sequences, and their biological activities, if known. For a broader investigation of the metabolome, molecules from other sources, such as endogenous molecules, compounds ingested through diet, and drugs, among others, are also included as part of this resource. The CMMC-KB is a user-accessible, collaboratively curated, and continuously evolving microbiome resource. Further, to encourage data deposition, we offer web-based analysis tools, including accessible web applications, that benefit both the data contributors and the broader community. In alignment with the FAIR data principles, we are committed to building this central knowledge hub in collaboration with the scientific community.

Results and discussion:

The CMMC-KB (https://cmmc-kb.gnps2.org/) is a knowledgebase derived from contributions by the scientific community and comprises spectral (MS/MS data) and structural (chemical structures) information about microbially-derived compounds, as well as dietary, host-derived, and other exposure-related compounds. Contributions to the CMMC-KB are facilitated through a dedicated workflow in GNPS2 (a second major implementation of the GNPS ecosystem), enabling users to upload information organized into eight main sections: 1) MS/MS data selection, 2) metabolite identification, 3) taxonomy/phylogeny selection, 4) biosynthesis, 5) activity, 6) references, 7) funding information, and 8) additional comments (Figure 1a). The community can contribute to this resource by uploading knowledge for a single molecule at a time or in batches of molecules. A comprehensive documentation page is available to guide users on the recommended inputs (https://cmmc.gnps2.org/deposition_documentation/). While MS/MS spectra are recommended, they are not required, and users may deposit the molecular structure. Additionally, the molecules deposited can be classified as confirmed (e.g., observed experimentally in microbial cultures17, observed in colonized but not in germ-free mice, etc.) or predicted to be microbial (e.g., MS/MS of synthetic compounds with matches against other microbial resources like microbeMASST15). As of December 2025, the knowledgebase comprises 80,201 MS/MS spectra from 4,998 compounds that were linked to 2,722 microorganisms. These numbers reflect the collective efforts of more than 30 researchers who have contributed to the development of this resource to date1721. Among the compounds deposited, their molecular sources were mainly classified as microbial, drug, or diet-related (Figure 1b). The majority of compounds had a known molecular origin, such as drugs, de novo biosynthesized by microbes, or diet, with 25.9% classified as unknown/undefined (Figure 1c). Since diet, drugs, and host-derived molecules can act as confounders, and because they often influence microbial metabolite production, the CMMC-KB includes and annotates these non-microbial compounds within a single, comprehensive resource.

Figure 1. Overview of the CMMC-KB capabilities and depositions.

Figure 1.

(a) Inputs accepted for community depositions and current numbers as of December 2025. (b) CMMC enrichment workflow in GNPS2, which annotates molecular networks (generated from Classical or Feature-Based Molecular Networking25,26) by matching experimental spectra to the CMMC-KB and retrieving associated metadata. (c) Download options as MGF and TSV files, enabling reuse in third-party software and in-house workflows. (d) The CMMC-Dashboard is a web application that enables users to utilize outputs from the FBMN and CMMC enrichment workflows, along with uploaded metadata, to generate visualizations for exploring matches to the CMMC-KB (e.g., boxplots for statistical evaluation, structure cards, UpSet-style overviews, and microbeMASST integration). Distribution of the deposited compounds (December 2025) by (e) molecule source and (f) molecule origin. Icons were obtained from Bioicons.com.

To facilitate the use of information deposited in the CMMC-KB, there are three ways to access and leverage the knowledgebase. First, data are available for direct download in TSV/CSV and MGF formats from the website, allowing integration into customized in-house or third-party workflows. Second, we developed a workflow within the GNPS2 ecosystem that enables downstream enrichment of molecular networks (CMMC enrichment) with information from the CMMC-KB. Finally, we created an interactive web application22, CMMC-Dashboard (https://cmmc-dashboard.gnps2.org/), which allows users to visually explore and interpret the data in an accessible and user-friendly manner.

Many compounds deposited as microbial metabolites may also come from other sources. For example, some amino acids and fatty acids can be synthesized by microorganisms, ingested through diet, and also produced by host cells. To address this issue, we refined source annotations in the CMMC-KB by reanalyzing four datasets available in the public domain which contained tissues or biofluids of germ-free (GF) and colonized mice (MSV0000799491, MSV00008804023, MSV000097485, MSV00009097424), also considering mouse diet (chow) for metabolomics data, when available. We ran feature-based molecular networking (FBMN)25 followed by CMMC enrichment in GNPS2. Entries initially labelled as “microbial” were selected in the CMMC-Dashboard, and boxplots were plotted for GF vs. colonized mice (and also vs. diet, if available). We defined a metabolite as “microbial-only” when it was absent in the GF group but present in colonized mice (Supplementary Figure S2a-c), and added labels as “diet” and/or “host” when the metabolite was detected in GF and/or chow. This classification may include both microbially-produced metabolites and microbe-induced host metabolites, which cannot be distinguished without additional experimental validation. This targeted curation expanded the information available in the CMMC-KB by providing additional classifications for 88 metabolites (1.76% of the compounds deposited).

To illustrate how the CMMC-KB can benefit researchers, we used this resource to investigate microbial metabolites in a subset of the American Gut Project (n = 1,993 files), a citizen-science cohort with participation open to the general population (primarily from the United States, the United Kingdom, and Australia)27. In this example, FBMN was performed, followed by CMMC enrichment to annotate features based on spectral matches to the knowledgebase. The source distribution of matched metabolites revealed a diverse chemical landscape, with compounds classified across multiple categories, including microbial, host-derived, and xenobiotic sources (Figure 2a). By overlaying this information onto the molecular network, one can rapidly visualize regions enriched in specific source categories (Figure 2b). This network-based visualization facilitates hypothesis generation by revealing which networks of structurally related compounds share common sources. Zooming into specific network regions (Figure 2c-e) demonstrates the utility of the tool for detailed exploration of individual molecular families, where users can have an integrated view of the source annotations, structural relationships, and associated metadata for compounds of interest. With such an overview, users can target the investigation of specific classes of compounds with important biological functions. For instance, microbially-derived bile acids play crucial roles in immune regulation,28 and have been implicated in conditions ranging from inflammatory bowel disease to metabolic disorders and neurocognitive function29,30. Similarly, N-acyl lipids serve as signaling molecules involved in immune homeostasis, energy metabolism, and gut-brain axis communication31,32. The ability to identify and annotate these metabolite families (along with their potential microbial, dietary, or host origins) enables researchers to formulate targeted hypotheses about microbiome-host interactions and prioritize investigations into specific microbial producers, dietary influences, or disease associations. This analysis exemplifies how the CMMC-KB, combined with molecular networking, provides an efficient workflow to survey complex metabolomic datasets and identify features warranting further mechanistic investigation. Importantly, while the biological roles of bile acids and N-acyl lipids in gut-microbiome interactions were previously established, the CMMC-KB workflow enabled their rapid annotation and source classification in the American Gut Project cohort – a process that would have required extensive manual literature curation. This cross-cohort validation demonstrates that known metabolite-microbiome relationships can be efficiently detected across diverse population studies using this framework.

Figure 2. Application of CMMC-KB enrichment to fecal metabolomics data from the American Gut Project.

Figure 2.

(a) Source distribution of metabolites matched to the CMMC-KB from a subset of the American Gut Project (n = 1,993 samples)27. The UpSet plot was generated using the CMMC-Dashboard. (b) Molecular network visualization with nodes colored by metabolite source annotation from the CMMC-KB. Each node represents a unique mass spectral feature, and edges connect features with similar MS/MS spectra (cosine similarity threshold set to 0.5). (c-e) Zoomed-in views of selected molecular networks with distinct source annotations. These subnetworks illustrate the tool’s capability to rapidly identify and visualize structurally related compounds sharing common sources within complex metabolomic datasets. The colors of the nodes in b-e match the upset plot colors in a.

Beyond this specific use case, the CMMC-KB has been applied to diverse biological contexts that demonstrate its versatility in addressing complex research questions, ranging from the human microbiome, natural products, and environmental fields (Supplementary Material). In clinical settings, this resource enabled mapping drug metabolism across multiple biofluids in people with HIV, revealing that while antiretrovirals like ritonavir undergo extensive microbial transformation in the gut, these derivatives remain largely absent from plasma and cerebrospinal fluid (Supplementary Figure S1). Comparisons between germ-free and colonized mice facilitated the annotation and the refinement of microbial metabolites, including bile acid conjugates and N-acyl lipids, illustrating the dynamic, community-driven nature of the knowledgebase as new data emerge (Supplementary Figure S2). Environmental applications include the detection of bioactive cyanobacterial metabolites in Lake Marathon water samples, providing actionable information for water safety management (Supplementary Figure S3). In disease contexts, the tool identified a microbiome-derived bile acid conjugate altered by Leishmania infection in hamster tissues, linking microbial metabolism to parasite-induced disturbances (Supplementary Figure S4). Finally, in coral holobiont research, the CMMC-KB successfully disentangled bacterial versus zooxanthellae metabolic contributions in synthetic communities, revealing siderophore-mediated interactions that would have been difficult to assign using traditional approaches alone (Supplementary Figure S5).

When using the CMMC-KB, users should be aware of two key limitations. First, spectral matches are based on cosine similarity33 or modified cosine similarity26, which cannot easily distinguish isomers that share very similar MS/MS patterns. As a result, isomeric compounds, including those originating from different sources, may have spectra with a high cosine similarity (e.g., deoxycholic acid is a microbial metabolite, chenodeoxycholic acid is host-derived, and their MS/MS cosine similarity is >0.9; Supplementary Figure S2d). Consequently, users may obtain spectral matches to metabolites of incorrect biological origin, which highlights the need for follow up experiments and analyses for validation. Whenever possible, users should acquire orthogonal data (e.g. UV-vis, retention time, ion mobility collision cross section (CCS)) obtained from authentic chemical standards for confirmation. Second, the microbial origin of metabolites also requires additional experimental validation beyond spectral matching. Users can employ complementary approaches such as pure culture studies, co-culture experiments with isotope tracing (e.g., 13C-labeled substrates), comparisons between germ-free and colonized animal models, or spatial metabolomics to confirm not only the accuracy of the annotation but also its microbial biosynthesis or transformation of the detected compounds. As entries and curated knowledge continue to grow with future studies and depositions, the CMMC-KB will increasingly empower researchers to gain biological insights on the role of the microbiome in human health and diverse ecosystems.

Methods:

CMMC-KB development

The CMMC knowledge portal was developed using the FAIR (Findable, Accessible, Interoperable, and Reusable) principles as a guideline34. It incorporates a series of Python workflows designed to process deposition files and generate visualization tables for all deposited information (Findable). In addition, the CMMC-KB server compiles the files required for molecular networking enrichment workflows, including the MGF for the spectral database and structural information for deposited metabolites (Accessible). The KB server provides programmatic access through API endpoints (Interoperable) to download the database files, enabling seamless integration and reuse of information within custom workflows (Reusable). The database is automatically compiled daily to ensure the workflows use the most up-to-date information available in the KB.

Each compound with an associated structure in the CMMC-KB is assigned a unique URL, enabling seamless cross-linking to external resources such as NPAtlas.35,36 The structure page provides users with tools to explore the molecular structure of metabolites and access all available information for a given molecule and mass spectra available in the knowledgebase. From this interface, users can also contribute additional data by being redirected through a URL to a prepopulated deposition page containing the USI, molecule name, and SMILES/InChI, where further information can be added.

The CMMC-KB Statistics page is a public, daily refreshed summary of the knowledgebase that reports coverage (total unique mass spectra), composition (distributions by metabolite source/origin), and temporal dynamics (new deposits over time), alongside contributor activity.

CMMC-KB deposition workflow

The CMMC-KB deposition workflow is implemented as a Nextflow-based pipeline37 on GNPS2, which runs a series of Python scripts to validate and process user submissions. The deposition workflow supports both single (one molecule) and batch (multiple entries) deposition modes. In single-deposition mode, parameters are provided through a workflow form or YAML file, while in batch depositions, the input is provided through a TSV file. Each entry is checked against controlled vocabularies (e.g., source and origin) and must include valid spectral and structural identifiers: spectra are verified via the Metabolomics Spectrum Resolver API38 using USIs, and chemical structures (SMILES or InChI) are validated with the GNPS2 ChemicalStructureWebService API. Following validation, all data is submitted to the CMMC-KB server via POST requests. The necessary templates, including the TSV file and deposition instructions, are fully documented and available at https://cmmc.gnps2.org/deposition_documentation/.

Network Enrichment workflow

The enrichment workflow is implemented as a Nextflow pipeline available within the GNPS2 ecosystem, and can be launched as a downstream analysis from the Classical or Feature-Based Molecular Networking results. This design enables users to annotate molecular networks with microbial information through a single-click integration. The workflow retrieves molecular networking outputs from both GNPS1 and GNPS2 jobs, including the network (.graphml) and associated spectral (.MGF) files. The retrieved spectra will be matched against the ones available in the CMMC-KB by cosine similarity, and the matches are further enriched with additional metadata if available, including microbial producers, taxonomy, chemical structure, biosynthetic gene clusters, molecular origin, activities, and compound classifications predicted using NPClassifier.39 The outputs include a library match TSV table and a new .graphml file with overlaid compound metadata information from the CMMC-KB matches. Additional visualizations, such as the producer lineage and taxonomic distribution, are generated from the NCBI Taxonomy IDs linked to each deposited spectrum. A documentation of the network enrichment workflow is available at https://cmmc.gnps2.org/network_enrichment/.

CMMC-Dashboard web application

The CMMC Analysis Dashboard (https://cmmc-dashboard.gnps2.org/) was implemented as a web application using the Streamlit Python package to provide interactive access to results from the CMMC-KB enrichment workflow in combination with the FBMN data. The dashboard integrates directly with GNPS2 through Task IDs provided by the user. Task IDs from enrichment and FBMN workflows allow the application to fetch processed files, including enrichment results, FBMN quantification tables, molecular networks, and associated metadata. The dashboard can also be launched directly as a downstream analysis from the enrichment workflow results page, from which the required inputs will be prepopulated in the dashboard interface. A complete documentation for this tool can be found at https://wang-bioinformatics-lab.github.io/GNPS2_Documentation/metaboapp_CMMC_dashboard/.

After the inputs are specified, the dashboard merges enrichment outputs with quantification tables and metadata for downstream analyses. Statistical functionality includes the generation of box plots to compare metabolite abundances across groups, with options for stratification and multiple statistical tests. Overlaps of metabolite sources or origins can be visualized using UpSet plots40 derived from the enrichment results. Molecular network exploration is supported through interactive Plotly visualizations that highlight selected features within networks, incorporate delta-mass annotations for network edges, and enable export of figures. The dashboard further integrates microbeMASST15, allowing users to perform spectral searches based on a Universal Spectrum Identifier (USI) or feature ID, returning exact or analog matches with compounds from microbial cultures. This allows for taxonomically informed results with corresponding downloadable taxonomic trees.

Supplementary Material

Supplement 1

Acknowledgements:

We thank the support from NIH (NIDDK) for the Collaborative Microbial Metabolite Center U24DK133658, support from NSF CAREER award #2047235 to N.G, support from the Research Foundation Flanders (FWO) [V406123N] to M.D.G. The Gordon and Betty Moore Foundation, GBMF12120 and https://doi.org/10.37807/GBMF12120, provided support to P.C.D and A.M.C.-R. This research was supported in part by the National Center for Complementary and Integrative Health of the NIH under award number F32AT011475 to N.E.A.. S.L. was supported by the Research Council of Finland funding (grant no. 363417). T.P. was supported by the Czech Science Foundation (GA CR) grant 21-11563M and by the European Union’s Horizon Europe program (ERC, TerpenCode, 101170268). F.O. was supported by FAPESP (2021/09175-4 and 2022/14603-8). H.H.F.K. was supported by FAPEAM, CNPq (443823/2024-3), and FINEP. L.-I.M. acknowledges the Burroughs Wellcome Fund Investigators in the Pathogenesis of Infectious Disease. R.J.E., D.F.Jr., J.E.I., S.L., and D.J.M were supported by NIH P30 MH062512. R.J.E., D.F.Jr., and S.L. were supported by NIH N01 MH22005 and R01 MH125720. This research was supported in part by the Intramural Research Program of the National Institutes of Health (NIH), National Institute of Environmental Health Sciences (ZIC ES103363). The contributions of the NIH author(s) were made as part of their official duties as NIH federal employees, are in compliance with agency policy requirements, and are considered Works of the United States Government. However, the findings and conclusions presented in this paper are those of the author(s) and do not necessarily reflect the views of the NIH or the U.S. Department of Health and Human Services.

Disclosures:

P.C.D. is an advisor and holds equity in Cybele, BileOmix, and Sirenas, and a Scientific co-founder, advisor, holds equity and/or received income to Ometa, Enveda, and Arome with prior approval by UC-San Diego. P.C.D. also consulted for DSM animal health in 2023. M.W. is a co-founder of Ometa Labs LLC. D.M. is a consultant for and has equity in BiomeSense, Inc. The terms of these arrangements have been reviewed and approved by the University of California, San Diego, in accordance with its conflict-of-interest policies. R.K. is a scientific advisory board member, and consultant for BiomeSense, Inc., has equity and receives income. He is a scientific advisory board member and has equity in GenCirq. He has equity in and acts as a consultant for Cybele. The terms of these arrangements have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. S.T. is currently employed by Ometa Labs; this work was completed prior to that employment, and Ometa Labs had no role in the study design, data collection, analysis, or decision to publish. T.P. is a co-founder of mzio GmbH and a consultant for Novogaia, Inc. M.S.D is a scientific Co-founder at Pragma Bio. The work described here is unrelated to the work conducted at Pragma Bio.

Data availability:

All the datasets used in this work as use cases of the CMMC-KB are available in MassIVE (massive.ucsd.edu). Raw data files from the American Gut Project, used in Figure 2, are deposited at MSV000080673. The feature finding step was performed in MZmine3, following the previous parameters used for this dataset19. Feature-Based Molecular Networking analysis and CMMC enrichment analysis for the American Gut Project use case can be found at https://gnps2.org/status?task=553c08a0e2274572a4edd2ba2d669668 and https://gnps2.org/status?task=2ac40effdb0f404fa6a045a580ff5430, respectively. Additional relevant dataset accessions are provided together with their description in the Supplementary Material. Owing to human volunteer protection constraints, the sample metadata for the HIV cohorts will be provided upon request to HNRC: https://hnrp.hivresearch.ucsd.edu/index.php/hnrc-home.

Code availability:

The code used for creating and implementing the CMMC enrichment workflow within the GNPS2 ecosystem is available at https://github.com/Wang-Bioinformatics-Lab/CMMC_GNPSNetwork_Enrichment_Workflow. The code used to create the CMMC-Dashboard MetaboApp is available at: https://github.com/wilhan-nunes/streamlit_CMMC_analysis-dashboard. The code used for dataset analyses can be found at: https://github.com/helenamrusso/CMMC-KB_manuscript.

References:

  • 1.Quinn R. A. et al. Global chemical effects of the microbiome include new bile-acid conjugations. Nature 579, 123–129 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Tremaroli V. & Bäckhed F. Functional interactions between the gut microbiota and host metabolism. Nature 489, 242–249 (2012). [DOI] [PubMed] [Google Scholar]
  • 3.Oliphant K. & Allen-Vercoe E. Macronutrient metabolism by the human gut microbiome: major fermentation by-products and their impact on host health. Microbiome 7, 91 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Hadadi N., Berweiler V., Wang H. & Trajkovski M. Intestinal microbiota as a route for micronutrient bioavailability. Curr. Opin. Endocr. Metab. Res. 20, 100285 (2021). [Google Scholar]
  • 5.Wilson I. D. & Nicholson J. K. Gut microbiome interactions with drug metabolism, efficacy, and toxicity. Transl. Res. 179, 204–222 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Chiu K., Warner G., Nowak R. A., Flaws J. A. & Mei W. The impact of environmental chemicals on the gut microbiome. Toxicol. Sci. 176, 253–284 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lindell A. E., Zimmermann-Kogadeeva M. & Patil K. R. Multimodal interactions of drugs, natural compounds and pollutants with the gut microbiota. Nat. Rev. Microbiol. 20, 431–443 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Cryan J. F. et al. The Microbiota-gut-brain axis. Physiol. Rev. 99, 1877–2013 (2019). [DOI] [PubMed] [Google Scholar]
  • 9.Corbin K. D. et al. Host-diet-gut microbiome interactions influence human energy balance: a randomized clinical trial. Nat. Commun. 14, 3161 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Han S. et al. A metabolomics pipeline for the mechanistic interrogation of the gut microbiome. Nature 595, 415–420 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Shiroma H. et al. Enteropathway: the metabolic pathway database for the human gut microbiota. Brief. Bioinform. 25, (2024). [Google Scholar]
  • 12.Wishart D. S. et al. MiMeDB: The Human Microbial Metabolome Database. Nucleic Acids Res. 51, D611–D620 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Kruger R. et al. MiMeDB 2.0: The human Microbial Metabolome Database for 2026. Nucleic Acids Res. (2025) doi: 10.1093/nar/gkaf1272. [DOI] [Google Scholar]
  • 14.Wortmann E., Adam G. & Limonciel A. Biocrates. MxP® Quant 1000 in microbiome research. Preprint at https://biocrates.com/wp-content/uploads/2025/05/Application-note-Quant-1000-in-microbiome-research.pdf (2025). [Google Scholar]
  • 15.Zuffa S. et al. microbeMASST: a taxonomically informed mass spectrometry search tool for microbial metabolomics data. Nat Microbiol 9, 336–345 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Wang M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Caraballo-Rodríguez A. M. et al. The undiscovered natural product potential of Actinomycetes. J. Antibiot. (Tokyo) (2025) doi: 10.1038/s41429-025-00876-x. [DOI] [Google Scholar]
  • 18.Poynton E. F. et al. The Natural Products Atlas 3.0: extending the database of microbially derived natural products. Nucleic Acids Res. 53, D691–D699 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Zhao H. N. et al. A resource to empirically establish drug exposure records directly from untargeted metabolomics data. Nat. Commun. 16, 10600 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Patan A. et al. Charting the undiscovered metabolome with synthetic multiplexing. bioRxiv (2025) doi: 10.1101/2025.11.18.689170. [DOI] [Google Scholar]
  • 21.Mannochio-Russo H. et al. The microbiome diversifies long- to short-chain fatty acid-derived N-acyl lipids. Cell (2025) doi: 10.1016/j.cell.2025.05.015. [DOI] [Google Scholar]
  • 22.Mannochio-Russo H. et al. Bridging complexity and accessibility in metabolomics with MetaboApps. ChemRxiv (2025) doi: 10.26434/chemrxiv-2025-3nq29. [DOI] [Google Scholar]
  • 23.Wu M. et al. Gut complement induced by the microbiota combats pathogens and spares commensals. Cell 187, 897–913.e18 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Won T. H. et al. Host metabolism balances microbial regulation of bile acid signalling. Nature 638, 216–224 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Nothias L.-F. et al. Feature-based molecular networking in the GNPS analysis environment. Nat. Methods 17, 905–908 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Watrous J. et al. Mass spectral molecular networking of living microbial colonies. Proc. Natl. Acad. Sci. U. S. A. 109, E1743–52 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.McDonald D. et al. American gut: An open platform for citizen science microbiome research. mSystems 3, (2018). [Google Scholar]
  • 28.Mohanty I. et al. The changing metabolic landscape of bile acids - keys to metabolism and immune regulation. Nat. Rev. Gastroenterol. Hepatol. 21, 493–516 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Jia M. et al. Gut microbiota dysbiosis promotes cognitive impairment via bile acid metabolism in major depressive disorder. Transl. Psychiatry 14, 503 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Fogelson K. A., Dorrestein P. C., Zarrinpar A. & Knight R. The gut microbial bile acid modulation and its relevance to digestive health and diseases. Gastroenterology 164, 1069–1085 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Mann A. et al. Palmitoyl Serine: An Endogenous Neuroprotective Endocannabinoid-Like Entity After Traumatic Brain Injury. J. Neuroimmune Pharmacol. 10, 356–363 (2015). [DOI] [PubMed] [Google Scholar]
  • 32.Long J. Z. et al. The secreted enzyme PM20D1 regulates lipidated amino acid uncouplers of mitochondria. Cell 166, 424–435 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Wan K. X., Vidavsky I. & Gross M. L. Comparing similar spectra: from similarity index to spectral contrast angle. J. Am. Soc. Mass Spectrom. 13, 85–88 (2002). [DOI] [PubMed] [Google Scholar]
  • 34.Wilkinson M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). [Google Scholar]
  • 35.van Santen J. A. et al. The Natural Products Atlas 2.0: a database of microbially-derived natural products. Nucleic Acids Res. 50, D1317–D1323 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.van Santen J. A. et al. The Natural Products Atlas: An open access knowledge base for microbial natural products discovery. ACS Cent. Sci. 5, 1824–1833 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Di Tommaso P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017). [DOI] [PubMed] [Google Scholar]
  • 38.Bittremieux W. et al. Universal MS/MS Visualization and Retrieval with the Metabolomics Spectrum Resolver Web Service. bioRxiv 2020.05.09.086066 (2020) doi: 10.1101/2020.05.09.086066. [DOI] [Google Scholar]
  • 39.Kim H. W. et al. NPClassifier: A deep neural network-based structural classification tool for natural products. J. Nat. Prod. 84, 2795–2807 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Lex A., Gehlenborg N., Strobelt H., Vuillemot R. & Pfister H. UpSet: Visualization of intersecting sets. IEEE Trans. Vis. Comput. Graph. 20, 1983–1992 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

Data Availability Statement

All the datasets used in this work as use cases of the CMMC-KB are available in MassIVE (massive.ucsd.edu). Raw data files from the American Gut Project, used in Figure 2, are deposited at MSV000080673. The feature finding step was performed in MZmine3, following the previous parameters used for this dataset19. Feature-Based Molecular Networking analysis and CMMC enrichment analysis for the American Gut Project use case can be found at https://gnps2.org/status?task=553c08a0e2274572a4edd2ba2d669668 and https://gnps2.org/status?task=2ac40effdb0f404fa6a045a580ff5430, respectively. Additional relevant dataset accessions are provided together with their description in the Supplementary Material. Owing to human volunteer protection constraints, the sample metadata for the HIV cohorts will be provided upon request to HNRC: https://hnrp.hivresearch.ucsd.edu/index.php/hnrc-home.


Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES