Abstract
Metabolite annotation in untargeted metabolomics remains challenging due to the vast structural diversity of metabolites. Network-based approaches have emerged as powerful strategies, particularly for annotating metabolites lacking chemical standards. Here, we develop a two-layer interactive networking topology that integrates data-driven and knowledge-driven networks to enhance metabolite annotation. A comprehensive metabolic reaction network is curated using graph neural network-based prediction of reaction relationships, enhancing both coverage and network connectivity. Experimental data are pre-mapped onto this network via sequential MS1 matching, reaction relationship mapping, and MS2 similarity constraints. The generated networking topology enables interactive annotation propagation with over 10-fold improved computational efficiency. In common biological samples, it annotates over 1600 seed metabolites with chemical standards and >12,000 putatively annotated metabolites through network-based propagation. Notably, two previously uncharacterized endogenous metabolites absent from human metabolome databases have been discovered. Overall, this strategy significantly improves the coverage, accuracy, and efficiency of metabolite annotation and is freely available as MetDNA3.
Subject terms: Mass spectrometry, Metabolomics
Accurate metabolite annotation remains a major challenge in untargeted metabolomics. Here, the authors present MetDNA3, a framework that integrates knowledge and data-driven two-layer networking to improve both the accuracy and coverage of known metabolite annotation, while also enabling the discovery of uncharacterized metabolites.
Introduction
Untargeted metabolomics aims to comprehensively profile endogenous metabolites within biological systems, offering critical insights into cellular metabolism, disease mechanisms, and biomarker discovery1–4. Over the past two decades, significant advancements in liquid chromatography–mass spectrometry (LC–MS)-based untargeted metabolomics have improved data acquisition and processing5–8. However, metabolite annotation remains a major challenge due to the vast chemical diversity and structural complexity of metabolites9–12. Standard library-based spectral matching remains the gold standard for metabolite annotation, but is limited to known metabolites with available reference spectra13–16. Therefore, the efficient annotation of unknown metabolites remains a significant hurdle in the field. To address this, network-based approaches have emerged as powerful and complementary strategies, particularly for annotating metabolites without available chemical standards, including both knowns and unknowns17–22.
Network-based strategies for metabolite annotation can be categorized into data-driven and knowledge-driven networking. In data-driven networks, nodes represent experimental MS features, while edges denote relationships between features, such as MS2 spectral similarity17,18, intensity correlation22,23, and mass differences22,24. Data-driven networking employs unsupervised modeling to uncover latent associations among features, enabling structural elucidation and metabolite annotation. A prominent example is molecular networking (MN) within the GNPS ecosystem17,18, which connects experimental features based on MS2 spectral similarity and utilizes fragment-based substructure information for the discovery and annotation of unknown metabolites. However, the complexity of LC–MS data often results in highly intricate network structures that require advanced interpretation tools22–28. For instance, feature-based molecular networking (FBMN)23 integrates MS1 features to improve isomer differentiation, while ion identity molecular networking (IIMN)24 optimizes and consolidates different ion species of the same molecule. Additionally, NetID22 employs integer linear programming to enhance feature annotation accuracy. More recently, Martin et al.29 integrate biotransformation rules into data-driven networks, highlighting the advantages of knowledge augmentation in improving annotation efficiency.
Alternatively, knowledge-driven networks use nodes to represent metabolites and edges to define relationships such as metabolic reactions19–21 or structural similarities30. This approach leverages supervised modeling to integrate established biochemical knowledge with experimental data, enabling targeted metabolite annotation and enhancing annotation efficiency. A notable example is MetDNA19,20, developed by our group, which uses a metabolic reaction network (MRN) to guide MS2 spectral similarity-based annotation, enabling automated and recursive metabolite annotation from complex LC–MS data. While knowledge-driven networking offers high-confidence annotations, its effectiveness is constrained by the limited coverage of metabolite databases. The lack of comprehensive metabolite databases and reaction relationships results in sparse network structures with low topological connectivity. Most importantly, metabolites not covered by these knowledge networks remain unannotated, limiting both annotation coverage and the discovery of novel metabolites.
Combining data-driven and knowledge-driven networks for metabolite annotation leverages the strengths of both approaches, improving annotation accuracy and coverage. Data-driven networks can uncover previously unrecognized relationships, while knowledge-driven networks provide efficient metabolite annotations by integrating established biochemical knowledge. Despite progress in integrating these two network types, challenges persist due to inherent topological differences between them. Furthermore, the increased complexity of integrated networks necessitates optimization strategies to enhance computational efficiency. The lack of efficient cross-network interaction strategies between the data and knowledge layers limits the propagation of metabolite annotation, thereby constraining both annotation coverage and efficiency. Therefore, addressing these challenges is crucial for advancing network-based metabolite annotation and improving its applicability in untargeted metabolomics.
In this work, we develop a two-layer interactive networking topology that integrates data-driven and knowledge-driven networks to enhance metabolite annotation in untargeted metabolomics. First, a comprehensive metabolic reaction network is constructed using graph neural network (GNN)-based prediction of reaction relationships, enhancing both coverage and network connectivity. Next, we establish a two-layer network topology by pre-mapping experimental data onto the knowledge-based metabolic reaction network using sequential MS1 m/z matching, reaction relationship mapping, and MS2 similarity constraints. By leveraging cross-network interactions, recursive-based metabolite annotation propagation is efficiently achieved with 10-fold improved computational efficiency. In a set of common biological samples, it successfully annotates over 1600 seed metabolites with chemical standards and more than 12,000 putatively annotated metabolites using network-based propagation. Notably, two previously uncharacterized endogenous metabolites absent from human metabolome databases are discovered. Overall, a two-layer interactive networking topology is developed to significantly improve the coverage, accuracy, and efficiency of metabolite annotation in untargeted metabolomics. This advanced data analysis strategy has been implemented in the latest version of MetDNA3 and is freely available at http://metdna.zhulab.cn/.
Results
Curation of the metabolic reaction network for knowledge-driven networking
Existing metabolite knowledge databases, such as KEGG31, MetaCyc32, and HMDB33, often lack comprehensive reaction relationships, leading to sparse network structures with low topological connectivity (Supplementary Fig. 1a and b). To address this limitation, we curated a comprehensive metabolic reaction network by integrating multiple metabolite knowledge databases (KEGG, MetaCyc, and HMDB) with network reconstruction and expansion (Fig. 1a). Specifically, we retrieved metabolite reaction pairs (or reactant pairs, RPs) with and without known reaction relationships from knowledge databases to train a graph neural network-based model. This model predicts potential reaction relationships between any two metabolites in the databases (Fig. 1b, c, Supplementary Fig. 1c–h, and Supplementary Fig. 2) by learning reaction rules from known RPs and extending them to structurally similar pairs. To control potential false positives, a two-step pre-screening strategy was applied prior to the prediction (Supplementary Fig. 1i). Furthermore, unknown metabolites were generated using the previously reported BioTransformer tool34 to enhance metabolite coverage (Fig. 1a; see Methods). A set of metabolite examples for reported, predicted, and BioTransformer-generated reaction pairs was given in Fig. 1d. Structural similarity analysis further validated the reliability of predicted reaction pairs, as their Tanimoto coefficient distributions aligned closely with reported relationships (Fig. 1e).
Fig. 1. Curation of the comprehensive metabolic reaction network (MRN).
a Schematic illustration of curation of the comprehensive metabolic reaction network from knowledge databases. b Dataset curation for training and evaluation of graph neural network (GNN)-based prediction of reaction relationships. c Receiver operating characteristic (ROC) curve for the model prediction of reaction pairs (RPs) in the testing dataset (n = 2995). d Metabolite examples for reported, predicted, and BioTransformer-generated reaction pairs. e Distributions of Tanimoto structural similarity for different types of reaction pairs (Reported, n = 14,974; Predicted, n = 1,432,469; BioTransformer, n = 990,441). Number of metabolites (f), and number of reaction pairs (g) in knowledge databases and curated MRN. Global clustering coefficient of network (h), and degree distribution of nodes (i) in knowledge databases and curated MRN. The global clustering coefficient is calculated as the ratio of the number of closed triplets to the total number of connected triplets within the network. The degree of a node refers to the number of directly connected neighbor metabolites. Box plots show the median (center line), 25th and 75th percentiles (box bounds). The whiskers extend to the most extreme data points within 1.5 × the interquartile range (IQR) from the lower and upper quartiles. Data points beyond the whiskers are plotted individually as outliers. The large dot inside each box indicates the mean value. Source data are provided as a Source Data file for Fig. 1c and Fig. 1e–i. Illustrations in (a) and (b) created in BioRender. Zhu, Z. (2025) https://BioRender.com/aeyidih.
The curated metabolic reaction network substantially enhanced both the coverage and topological connectivity of the knowledge databases. In comparison to knowledge databases, the curated MRN comprised a total of 765,755 metabolites and 2,437,884 potential reaction pairs (Fig. 1f, g). Moreover, both known and unknown metabolites in the curated MRN exhibited high concordance in spatial distribution and chemical classification (Supplementary Fig. 3a, b), indicating that the MRN represents a metabolite-like structural framework with biologically relevant properties. In contrast, compounds in PubChem showed significant spatial deviation compared to our MRN (Supplementary Fig. 3c, d). Compared to the MRN curated from KEGG implemented in MetDNA220, the newly curated MRN significantly improved network coverage while maintaining a similar distribution of chemical space (Supplementary Fig. 3e–g). Additionally, evaluation of the topological properties of the curated MRN revealed a higher global clustering coefficient and an improved degree distribution relative to knowledge databases (Fig. 1h, i, and Supplementary Table 1). Degree distribution describes the number of nodes in the network corresponding to a given degree value. For example, in MRN, 5892 metabolites (nodes) have a degree of 10, whereas the knowledge databases-based network has only 39 such nodes (Fig. 1i). Notably, the curated MRN was constructed based on predictive models and has not yet been experimentally validated. Its primary purpose is to enhance network-based annotation propagation rather than to directly discover novel reactions. Collectively, these results demonstrated that our curated MRN possesses superior metabolite coverage and topological connectivity, effectively addressing the limitations of existing knowledge databases and boosting the potential for knowledge-driven metabolite annotation.
Two-layer interactive networking topology for recursive metabolite annotation
Combining data-driven and knowledge-driven networks for metabolite annotation leverages the strengths of both approaches, but remains challenging. Here, we developed a two-layer interactive networking topology that integrates data-driven and knowledge-driven networks to facilitate recursive metabolite annotation in untargeted metabolomics (Fig. 2a, b). The workflow comprises two major steps: curation of two-layer network topology through data and knowledge pre-mapping, and recursive-based metabolite annotation propagation. This analysis workflow has been implemented in the latest version of MetDNA3.
Fig. 2. Two-layer interactive networking topology for recursive metabolite annotation.
a The curation of two-layer network topology through data and knowledge pre-mapping. b The recursive-based metabolite annotation propagation within two networks. Comparison of numbers of metabolites (c) and reaction pairs (d) between the original MRN, the MS1-constrained MRN, and the data-constrained MRN. Comparison between MetDNA2 and MetDNA3 in terms of the number of neighbor metabolites searched during the recursive process (e) and the number of feature pairs evaluated for MS2 similarity (f). Data shown in panels a–f were retrieved from the processing of NIST human urine dataset (positive MS mode with HILIC separation; HILIC–MS(+)). g Comparison of computational times between MetDNA2 and MetDNA3 across three biological samples (12 datasets in total). Box plots show the median (center line), 25th and 75th percentiles (box bounds). The whiskers extend to the most extreme data points within 1.5 × the interquartile range (IQR) from the lower and upper quartiles. Data points beyond the whiskers are plotted individually as outliers. The large dot inside each box indicates the mean value. Source data are provided as a Source Data file for Fig. 2c–g.
In the first step (Fig. 2a and Supplementary Fig. 4a), experimental data (metabolic features) were pre-mapped onto the knowledge-based metabolic reaction network through sequential MS1 m/z matching, reaction relationship mapping, and MS2 similarity constraints. This procedure established a two-layer network topology, with the MRN representing the knowledge layer and the experimental features forming the data layer. Specifically, to establish connectivity within and between two layers, experimental features were first matched to metabolites in the MRN based on MS1 m/z matching, effectively reducing redundant nodes and edges to form an MS1-constrained MRN. Subsequently, reaction relationships within the MS1-constrained MRN were mapped onto the data layer, guiding the construction of the feature network. Within this feature network, MS2 similarity between features was calculated and applied as a filtering constraint to eliminate unwanted nodes, further refining the network structure. Finally, the topological connectivity of the knowledge-constrained feature network was mapped back to the knowledge layer, resulting in a data-constrained MRN. Importantly, this interactive pre-mapping between data and knowledge layers established refined metabolite-metabolite links and feature-feature links within each of two networks, and direct metabolite-feature relationships between two networks, ensuring consistent network topologies across both layers. By integrating experimental constraints, the two-layer network retained structural coherence while eliminating redundant nodes and edges. For example, in the NIST human urine dataset (positive MS mode with HILIC separation; HILIC–MS(+)), the application of experimental data constraints reduced the MRN from 765,755 metabolites to 2993 (~0.4%) and reaction pairs from 2437,884 to 55,674 (~2.3%), demonstrating the effectiveness of this approach in refining large-scale metabolic networks (Fig. 2c, d).
In the second step (Fig. 2b and Supplementary Fig. 4b), experimental features were annotated as seed metabolites through matching their experimental values, such as MS1, retention time (RT), and MS2 spectra, to the standard libraries. The annotated seed metabolites formed metabolite-feature pairs, and were input into the knowledge layer and data layer of the two-layer network topology. In the first round of recursive metabolite annotation, reaction-paired neighbor metabolites for each seed metabolite were searched and retrieved from the knowledge layer. Simultaneously, corresponding neighbor features with MS2 similarity to the seed feature were also searched and retrieved in the data layer. If the retrieved neighbor metabolites and experimental features in both networks had pre-mapped cross-network links (i.e., metabolite-feature links in step 1), they were retained as new annotations. Then, the newly annotated metabolite-feature pairs served as new seeds for subsequent recursive annotations, followed by the retrieval of reaction-paired neighbor metabolites, the retrieval of MS2-similarity-paired neighbor features, and verification of pre-mapped cross-network links. This recursive annotation process continued iteratively until no new metabolite-feature links were found between two networks, ultimately maximizing metabolite annotation coverage.
Leveraging the two-layer network topology, recursive metabolite annotation propagation was achieved solely through the search of neighboring metabolites and features and the verification of pre-mapped cross-network links. This eliminated redundant m/z matching and MS2 similarity calculations, significantly improving the computational efficiency of metabolite annotation, particularly in large and complex two-layer networks. In contrast, conventional approaches like MetDNA2 required multiple steps for recursive metabolite annotation, including neighbor metabolite search, feature m/z matching, and MS2 similarity calculations, resulting in substantial computational costs, especially in large and complex networks. Importantly, many of these calculations were redundantly repeated. We quantitatively compared numbers of searched neighbor metabolites and MS2 similarity calculations between MetDNA2 and MetDNA3, revealing significant reductions in MetDNA3 (Fig. 2e, f). Furthermore, we evaluated computational efficiency across three biological samples (12 datasets in total) and found that MetDNA3 reduced the mean computation time per dataset from 1082 minutes to 77 minutes, achieving over a 14-fold improvement in efficiency (Fig. 2g).
Improved coverage and correct rate of metabolite annotation
The two-layer interactive networking strategy has been integrated into MetDNA3 for metabolite annotation in untargeted metabolomics, achieving significant improvements in coverage and correct rate. Here, we comprehensively evaluated its performance using a diverse set of biological samples, including cultured BV2 cells, mouse brain tissue, mouse liver tissue, NIST human plasma, and NIST human urine. Each sample type had four datasets acquired using two LC columns and two MS ion polarities: HILIC–MS(+), HILIC–MS(-), RP–MS(+), and RP–MS(-) (Fig. 3a). Additionally, two MS instrument types, Orbitrap and Astral, were tested. Across all datasets, MetDNA3 successfully annotated 1,652 unique level 1 metabolites by matching MS1, RT, and MS2 spectra to the chemical standards, with approximately 600–1000 level 1 metabolites identified per biological sample on average (Fig. 3b). Furthermore, network-based annotation propagation expanded the annotated metabolome to 12,508 metabolites (level 3 annotations), including 9410 known and 3098 unknown metabolites (Fig. 3c), highlighting the substantial increase in metabolite coverage. Detailed metabolite annotations for all samples were provided in Supplementary Figs. 5 and 6, and Supplementary Data 1. Notably, data acquired using the Astral instrument demonstrated superior annotation coverage compared to the Orbitrap (Fig. 3b, c). Additionally, the Astral data exhibited a significantly higher annotation rate (26.9%) compared to the Orbitrap (13.2%) (Supplementary Fig. 6).
Fig. 3. Improved coverage and correct rate of metabolite annotation.
a A diverse set of biological samples, including cultured BV2 cells (n = 6 biological replicates), mouse brain tissues (n = 6 biological replicates), mouse liver tissues (n = 6 biological replicates), NIST human plasma (n = 6 technical replicates), and NIST human urine (n = 6 technical replicates), were used for benchmarking MetDNA3. Number of level 1 metabolite annotations (b) and network-based metabolite annotations (c) for different biological samples in Orbitrap and Astral datasets. “*” indicates the removal of duplicate metabolite annotations. d Schematic overview of the experimental design for evaluating the coverage and correct rate of network-based annotation in MetDNA3. Comparative evaluations of annotation coverage (e) and correct rate (Top 3 annotations) (f) of network-based annotation between MetDNA2 and MetDNA3 (20 datasets per group). g Correct rate of Top N annotations in network-based annotation achieved by MetDNA3 (20 datasets in total). Box plots show the median (center line), 25th and 75th percentiles (box bounds). The whiskers extend to the most extreme data points within 1.5 × the interquartile range (IQR) from the lower and upper quartiles. Data points beyond the whiskers are plotted individually as outliers. The large dot inside each box indicates the mean value. Source data are provided as a Source Data file for Fig. 3b, c, and e–g. Illustrations in a created in BioRender. Zhu, Z. (2025) https://BioRender.com/aeyidih.
To evaluate the coverage and correct rate of MetDNA3 in network-based annotation propagation, we analyzed 20 datasets acquired using the Orbitrap instrument (Fig. 3a–c) and followed the validation framework outlined in Fig. 3d. Specifically, 30% of the level 1 metabolite annotations were designated as seed metabolites for network-based annotation propagation, while the remaining 70% were reserved for validation. The detailed metabolite numbers are summarized in Supplementary Table 2. This process was repeated 10 times for each dataset using a 10-fold cross-validation approach (Fig. 3d). Detailed results are provided in the Supplementary Data 2. Compared to MetDNA2, MetDNA3 exhibited significant enhancements in both coverage and correct rate. Among the validation metabolites, the average annotation coverage through network-based propagation increased from 39.9% to 68.1% (Fig. 3e), while the average annotation correct rate improved from 65.6% to 84.4% (Top 3 annotations; Fig. 3f). In terms of Top N correct rate, MetDNA3 achieved annotation correct rate of 68.0%, 84.4%, and 91.0% for N = 1, 3, and 10, respectively (Fig. 3g). Across different recursive propagation rounds, the annotation correction rates remained stable, with a mean value of 82.4% for Top 3 annotations (Supplementary Fig. 7). False discovery rate (FDR) was also evaluated, yielding 15.6% for Top 3 annotations (Supplementary Fig. 8).
The improved performance in network-based metabolite annotation was attributed to the enhanced topological connectivity (Fig. 1i), enabled by the graph neural network-based prediction of reaction relationships in the curated MRN. We examined the significance of predicted reaction pairs in driving the network-based recursive annotation and propagation. Taking the NIST human urine sample (HILIC–MS(+)) as an example, the predicted reaction pairs accounted for most of annotation propagation (68.0%) (Supplementary Fig. 9a). During propagation, the structural similarity and MS2 spectral similarity between predicted metabolite reaction pairs remained high, comparable to those observed in reported metabolite reaction pairs (Supplementary Fig. 9b and c). Further analyses revealed the structural classification of annotated metabolites during propagation (Supplementary Fig. 9 d). Notably, 83.9% of the propagation occurred between metabolite pairs within the same superclass of structural classification (Supplementary Fig. 9e). Consistent results were also shown in the negative mode dataset of human urine sample (Supplementary Fig. 9f–j). Together, the concordance in metabolite structural classification supports the rationale that network-based metabolite annotation was driven by structural and chemical similarities between metabolite pairs.
Corroboration of network-based metabolite annotations
Multiple bioinformatic strategies have been developed for metabolite annotation in untargeted metabolomics, and integrating multiple approaches provides deeper insights. In particular, most level 3 annotations generated by MetDNA3 either lack available chemical standards, making validation using chemical standards infeasible. To this end, we employed four widely used metabolite annotation tools—CFM-ID35, MetFrag36, MS-FINDER37, and SIRIUS14—to corroborate the consistency of network-based metabolite annotations in MetDNA3 (Fig. 4a). None of these tools rely on network-based annotation strategies. In the 20 Orbitrap datasets, a total of 4302 features with level 3 annotations were further analyzed using four independent bioinformatic tools for structural annotation and compared against MetDNA3 results (Fig. 4b and Supplementary Data 3). Integrating results from all tools showed that the consistency of individual tools with MetDNA3 ranged from 60% to 80% across the 4302 features (Fig. 4b). In total, 3894 features (90.1%) had at least one consistent annotation between MetDNA3 and one of the four tools. Notably, 1680 features (39.1%) were consistently supported by all four tools, whereas only 428 features (9.9%) lacked corroboration from any tool (Fig. 4c). These findings indicate that the majority of network-based metabolite annotations in MetDNA3 could be corroborated by independent tools. While discrepancies among different annotation tools are inevitable, employing a multi-tool corroboration strategy enhances confidence in metabolite annotations.
Fig. 4. Corroboration of network-based metabolite annotations.
a Schematic illustration of corroboration between MetDNA3 and other bioinformatic tools for metabolite annotations. b Numbers of annotated metabolites corroborated between different bioinformatic tools and MetDNA3. c Statistics on the number of bioinformatic tools that generate consistent annotations with MetDNA3. d Annotation of an unknown metabolite γ-glutamyl-threoninyl-glycine (γ-Glu-Thr-Gly) in mouse liver tissue HILIC–MS(+) dataset through the network-based annotation propagation. e Validation of γ-Glu-Thr-Gly using the synthesized standard by matching RT and MS2 spectra. f Annotation of an unknown metabolite N-glycolyltaurine in mouse brain tissue HILIC–MS(-) dataset through the network-based annotation propagation. (g) Validation of N-glycolyltaurine using the synthesized standard by matching RT and MS2 spectra. Source data are provided as a Source Data file for Fig. 4b, c, e, g. Illustrations in a created in BioRender. Zhu, Z. (2025) https://BioRender.com/aeyidih.
We further validated some metabolite annotations in MetDNA3 using chemical standards (Fig. 4d-g and Supplementary Fig. 10). To prioritize candidates for validation, we applied a stepwise filtering strategy that integrated corroborated annotations from multiple in silico tools and checked their recurrences across multiple biological datasets (see Methods). The most notable examples are the discovery of two metabolites that are absent from human metabolome databases. One example is a γ-glutamyl tripeptide, γ-glutamyl-threoninyl-glycine (γ-Glu-Thr-Gly), which is commonly found in mouse liver and brain tissues. In the knowledge layer, γ-Glu-Thr-Gly was paired with a known metabolite, γ-glutamyl-threonine (γ-Glu-Thr), through an in-silico reaction relationship. In the data layer, the experimental feature of γ-Glu-Thr (M249T416) was paired with M306T438 based on MS2 spectral similarity. Consequently, in network-based annotation propagation, the known metabolite-feature pair (γ-Glu-Thr/M249T416) was propagated to form the new pair of γ-Glu-Thr-Gly/M306T438 (Fig. 4d). This demonstrated the effectiveness of the two-layer network strategy in extending metabolite annotations. The annotation was further validated using the synthesized standard by matching MS1, RT, and MS2 spectra (Fig. 4e).
Another previously unreported metabolite, N-glycolyltaurine, was discovered in mouse brain and liver tissues. In the knowledge layer, N-glycolyltaurine was paired with the known metabolite N-acetyltaurine (Fig. 4f). In the data layer, the experimental feature of M166T175 was paired with M182T222 based on MS2 spectral similarity. Through network-based annotation propagation, the known metabolite-feature pair (N-acetyltaurine/M166T175) was propagated to form the new pair N-glycolyltaurine/M182T222 (Fig. 4f). This annotation of N-glycolyltaurine was also validated using the synthesized standard (Fig. 4g). As of March 2025, neither γ-Glu-Thr-Gly nor N-glycolyltaurine is recorded in major metabolite databases, including KEGG, HMDB, and GNPS. Overall, our results demonstrated that network-based metabolite annotation facilitates the discovery of previously unreported metabolites, offering a more comprehensive characterization of the “dark matter” in the metabolome. Other examples of known metabolites annotated through network-based annotation, but lacking standard spectra in public databases, were also provided and validated using chemical standards (Supplementary Fig. 10).
Benchmarking different knowledge-driven network topologies
The impact of network topological properties (such as connectivity and edge density) on network-based metabolite annotation remains underexplored. To demonstrate the superiority of our curated MRN for metabolite annotation, we compared it with two other knowledge-driven networks with distinct topologies: a structure-guided molecular network (SMN) and a fully connected network (FCN) (Fig. 5a). All three networks shared the same metabolite database (n = 53,583 known metabolites curated in Fig. 1). SMN establishes metabolite-metabolite connections based on structural similarity calculations, whereas FCN represents an extreme case where all metabolites are universally connected within the network. Clearly, FCN exhibited the highest network connectivity and density, followed by SMN, while MRN had the lowest connectivity and density (Supplementary Fig. 11 and Supplementary Table 1). We implemented the three networks in MetDNA3 for metabolite annotation and applied them to 20 biological datasets from the Orbitrap instrument (Fig. 5a).
Fig. 5. Benchmarking different knowledge-driven network topologies.
a schematic illustration of metabolite annotation propagation guided by different networks in MetDNA3: MRN, metabolic reaction network; SMN, structure-guided molecular network; FCN, fully connected network. b Proportions of metabolites propagated within the same metabolite superclass for different network-based metabolite annotation. Superclass taxonomy was calculated using ClassyFire44. c Schematic workflow for evaluating annotation performances using experimental and decoy MS2 spectra. d–f Comparative evaluations of accuracy (d), estimated false discovery rate (FDR) (e), and true positive rate (TPR)-false positive rate (FPR) distribution (f) for different networks in network-based annotation propagation (20 datasets in total). Box plots show the median (center line), 25th and 75th percentiles (box bounds). The whiskers extend to the most extreme data points within 1.5 × the interquartile range (IQR) from the lower and upper quartiles. Data points beyond the whiskers are plotted individually as outliers. The large dot inside each box indicates the mean value. Source data are provided as a Source Data file for Fig. 5b and Fig. 5d–f. Illustrations in (a) created in BioRender. Zhu, Z. (2025) https://BioRender.com/aeyidih.
First, we assessed the specificity of annotation propagation by quantifying the proportion of metabolites propagated within the same metabolite class. MRN demonstrated the highest specificity at 80.8%, compared to 62.6% for SMN and 24.9% for FCN (Fig. 5b). These results indicate that excessive network density and indiscriminate connectivity compromise the specificity of network-based annotation propagation. Next, we adopted the spectrum-based method proposed by Scheubert et al.38 to generate decoy MS2 spectra by randomizing the experimental MS2 spectra (Supplementary Fig. 12). The resulting annotations from decoy spectra served as negative controls for evaluating annotation accuracy and estimated false discovery rate (FDR) of network-based annotation propagation (Fig. 5c and Supplementary Data 4). Based on generated decoy MS2 spectra, network-based annotation in MetDNA3 achieved 73.3% accuracy for Top 3 annotations, corresponding to an FDR of 15.8% (Fig. 5d and e). In addition, we used the same decoy MS2 spectra to evaluate the performances of SMN and FCN. The results showed that as network connectivity increased (from MRN to SMN to FCN), annotation accuracy decreased from 73.3% to 65.1% and then to 58.3% (Fig. 5d). Correspondingly, the estimated FDR increased from 15.8% to 36.4% and 44.5% (Fig. 5e). The performance of FCN closely resembled that of m/z + RT matching, suggesting that FCN-based annotation was primarily driven by m/z + RT matches rather than network-based propagation. From the distribution of TPR and FPR, it is evident that our curated MRN exhibits both lower TPR and FPR compared to other methods (Fig. 5f). These results highlight that indiscriminately increasing network connectivity does not improve annotation quality. Instead, carefully curating knowledge networks with high specificity is essential for accurate network-based metabolite annotation and propagation.
Discussion
The integration of data-driven and knowledge-driven networking for metabolite annotation in untargeted metabolomics remains challenging. Key issues include topological differences between two networks, the structural complexity of integrated networks, and the absence of effective algorithms for cross-network interaction and metabolite annotation propagation. To address these challenges, we developed a two-layer interactive networking topology that integrates data-driven and knowledge-driven networks to enhance metabolite annotation. By leveraging cross-network interactions, recursive-based metabolite annotation was achieved with significant improvements in efficiency, coverage and accuracy. Notably, the method facilitated the discovery of two previously uncharacterized endogenous metabolites absent from human metabolome databases. The key aspects of our strategy rely on the proper curation of a knowledge network and the introduction of a new two-layer network topology to support network-based metabolite annotation.
The proper curation of a knowledge network is crucial for supporting network-based metabolite annotation. We curated a comprehensive metabolic reaction network using graph neural network-based predictions of reaction relationships, which enhanced both network coverage, connectivity, and specificity. Our results demonstrated that the predicted reaction pairs in MRN significantly improved annotation propagation, exhibiting performance comparable to reported reaction pairs (Supplementary Fig. 9). However, it is foreseeable that the direct prediction of reaction pairs may include a considerable number of false positives. To mitigate this, we employed a two-step pre-screening strategy aimed at reducing potential false positives of MRN (Supplementary Fig. 1i). Nevertheless, due to the lack of clearly defined non-RPs under these screening conditions, and the incompatibility of applying GNN-based performance metrics to the filtered candidate set, we refrain from reporting an FDR for the MRN. This limitation underscores the need for cautious interpretation of the predicted reaction pairs and highlights several directions for future methodological improvement—including refining the GNN model, expanding the training set, and improving the selection and definition of negative examples. While we recognize these challenges, we consider the MRN a valuable exploratory resource for identifying potential biochemical relationships, and emphasize the need for ongoing efforts to rigorously assess and enhance its predictive accuracy. It should be noted that, in a broader conceptual sense, a reaction pair (or reactant pair) can also be interpreted as a metabolite pair, which represents the relationship between two metabolites—particularly in the context of knowledge networks or metabolite annotation methods that rely on relationships. Furthermore, we benchmarked three knowledge networks with distinct topologies (MRN, SMN, and FCN), and revealed that curating knowledge networks with high specificity is essential for guiding accurate network-based metabolite annotation and propagation (Fig. 5). Networks with high connectivity but low specificity tend to introduce redundant and less accurate annotations during propagation. Recently, Wang et al.30 leveraged structural similarity for network reconstruction, achieving remarkable connectivity and annotation coverage. However, their approach did not assess network specificity or redundancy, which are crucial for ensuring annotation accuracy. Additionally, it lacked the capability to annotate unknown metabolites that are absent from existing knowledge databases, limiting its applicability for novel metabolite discovery. In addition, many existing knowledge-driven networks tend to retain reaction or metabolite pairs with high structural similarity as the basis for network construction. This approach may overlook metabolite pairs that are structurally dissimilar but exhibit high MS2 spectral similarity. In the future, it will be important to explore more specific and tailored methods for identifying reaction or metabolite pairs that are better suited for network-based annotation propagation.
A network topology was developed to integrate both data-driven and knowledge-driven networks for network-based metabolite annotation and propagation. Previous network-based annotation propagation approaches, whether data-driven or knowledge-driven, like NetID22, NAP39 and MetDNA19, primarily operated within a single network layer, establishing localized network topologies. While these methods maintained computational efficiency for small datasets and simple networks, they led to excessive redundancy in searches and computations, becoming increasingly inefficient when handling large-scale datasets and complex network topologies. Through data and knowledge pre-mapping, our strategy refines metabolite-metabolite links and feature-feature links within each layer, as well as direct metabolite-feature relationships between the two layers, ensuring consistent network topologies across both layers. By incorporating experimental constraints, this two-layer network topology preserves structural coherence while eliminating redundant nodes and edges. When applied to network-based metabolite annotation, it improves annotation efficiency by more than 10-fold. In common biological datasets, metabolite annotation can be completed within approximately one hour. In comparison, NetID uses integer linear programming to optimize peak annotation at the formula level within a data-driven network on a global scale. However, its computational efficiency can vary significantly, ranging from minutes to several days, depending on the number of variables and constraints in the model. In some cases, optimization using NetID may fail due to overly complex network topologies that hinder convergence. Overall, our approach offers a more efficient, scalable, and reliable solution for network-based metabolite annotation, overcoming the limitations of previous methods and making it highly suitable for high-throughput metabolomics studies.
In addition to network-based metabolite annotation, other bioinformatic tools, such as CFM-ID35 and SIRIUS14, perform in-silico predictions of MS/MS spectra and/or molecular fingerprints. These methods rely on machine learning-based predictive models to establish relationships between MS2 spectra and metabolite structures, substructures, molecular fingerprints, or other relevant features. These tools have been increasingly adopted to enhance the structural elucidation of metabolites. To evaluate the accuracy of these methods, we also used the validation metabolites (n = 3301, from 20 Orbitrap datasets, Supplementary Table 2) as a benchmark set to assess the performance of CFM-ID, MetFrag, MS-FINDER, and SIRIUS (Supplementary Fig. 13). Most tools showed high correction rates, with Top 3 annotation rates ranging from 76.6% to 89.4%. We note that these tools may have included part of validation metabolites in their training, potentially contributing to the high accuracy. Together, while cross-tool concordance may overestimate performance due to the lack of chemical standards, it remains a reasonable and practical approach for assessing annotation reliability. However, unlike network-based approaches, these methods do not incorporate network topology, meaning they do not account for the MS2 spectral and metabolite structure similarities across the entire untargeted metabolomics dataset. We hypothesize that integrating structural elucidation tools, such as SIRIUS or de novo structure generation tools like MSNovelist16, with network-based annotation methods could significantly improve the accuracy of metabolite annotation and propagation. This improvement can come from the filtering and refinement of candidate structures during the annotation propagation, as well as enhancing the coverage and connectivity of the knowledge network.
However, it should be noted that while MS2 spectrum-based match is widely considered the gold standard for metabolite identification, it has inherent limitations. For small molecules, MS2 spectra often lack sufficient fragment information, and isomeric metabolites may be indistinguishable based on MS2 spectra alone. Consequently, both network-based annotation propagation and validation steps that rely on MS2 spectral matching may lead to optimistic FDR estimates. This limitation warrants careful consideration, and strategies to address it should be explored in future work. Therefore, incorporating orthogonal approaches such as retention time (RT) and collision cross section (CCS) is increasingly important for improving small molecule discrimination. Integrating these complementary data into network-based annotation propagation can also help control false positives, ultimately enhancing the accuracy of metabolite annotation in untargeted metabolomics.
In this work, two previously uncharacterized endogenous metabolites, absent from human metabolome databases, are successfully annotated and validated. This capability for metabolite annotation is enabled by generating new metabolite structures using the previously reported tool BioTransformer34. BioTransformer employs known metabolic reaction rules to make in-silico predictions of new metabolites. While it has proven effective in certain cases, such as γ-Glu-Thr-Gly and N-glycolyltaurine in our study, the success rate remains limited. Meanwhile, BioTransformer-based generation of unknown metabolites may be limited by its predefined reaction types, potentially introducing species-specific biases. For example, in MetDNA, unknown metabolite generation relied heavily on mammalian-derived reactions, which may bias results toward human or mammalian metabolism. We recommend users interpret annotation results with caution and consider constructing customized unknown metabolite databases and metabolic reaction networks tailored to their specific biological context. We hypothesize that the generation rules could be further improved to better reflect the vast chemical space and complex reactions occurring in living biological systems. The emergence of generative AI models presents new opportunities for curating knowledge databases of unknown metabolites. Recently, a large-language-based generative model, DeepMet40, has been developed to learn the latent biosynthetic logic embedded within the structures of known metabolites and apply it to generate novel metabolite structures. Moving forward, combining advanced generative models with network-based metabolite annotation will be crucial for expanding our understanding of the metabolome and accelerating the discovery of unknown metabolites.
Methods
Chemicals and standards
LC–MS grade methanol (MeOH), 2-propanol (IPA), and water (H2O) were purchased from Honeywell (Muskegon, MI, USA). LC–MS grade acetonitrile (ACN) was purchased from Merck (Darmstadt, Germany). Ammonium hydroxide (NH4OH) and ammonium acetate (NH4OAc) were purchased from Sigma-Aldrich (St. Louis, MO, USA). Metabolite chemical standards were purchased from Sigma-Aldrich (St. Louis, MO, USA), J&K (Shanghai, China), TopScience (Shanghai, China), TCI (Tokyo, Japan), TRC (Toronto, Canada), Aladdin (Shanghai, China), and InnoChem (Beijing, China).
Sample preparation
The NIST human urine (SRM 3667) and NIST human plasma (SRM 1950) reference materials were purchased from Ango Biotechnology Co. (Shanghai, China). Human plasma or urine samples (100 µL) were extracted using MeOH (300 µL). The samples were vortexed for 30 seconds and subjected to sonication for 15 minutes, and then were incubated for 1 hour at −20 °C, followed by centrifugation at 17,000 × g and 4 °C for 15 minutes to precipitate proteins. The resulting supernatants were transferred into LC sample vials and stored at −80 °C prior to LC–MS analysis.
6-week-old male mice (C57BL/6J; n = 6) were group-housed in a barrier facility at room temperature of 22 °C with 50% humidity and 12 h light/12 h dark cycles. The C57BL/6J strain was obtained from Vital River Laboratories (Beijing, China). For mouse brain and liver tissues, 20 mg of tissue was homogenized in 200 μL of H2O under low-temperature conditions. Subsequently, 800 μL of MeOH:ACN (1:1, v/v) was added to 200 μL of the homogenate for metabolite extraction. The mixture was vortexed for 30 s and sonicated for 10 min in a 4 °C water bath. To facilitate protein precipitation, the samples were incubated at −20 °C for 1 h, followed by centrifugation at 17,000 × g for 15 min at 4 °C. The supernatant was collected and evaporated to dryness at 4 °C using a vacuum concentrator. Dried samples were stored at −80 °C.
BV2 cells were plated in 6-cm dishes at ~2 × 106 cells/dish, and cultured in DMEM medium containing FBS (10%) and penicillin/streptomycin (1%). Extraction solution ACN/MeOH/H2O (2:2:1, v/v/v) was pre-cooled at −80 °C for 1 h prior to use, ensuring no ice formation. Once cells reached the target number, the culture medium was aspirated completely, and cells were washed rapidly with 1 mL of PBS. The dish was placed on dry ice, and 1000 μL of pre-cooled extraction solution was added, followed by incubation at −80 °C for at least 40 min. During transfer, the dish was maintained on dry ice. Cells were then scraped and transferred into a 2-mL EP tube. An additional 500 μL of extraction solution was used to rinse the dish, and the combined extract (1.5 mL total) was collected. The samples were vortexed for 1 min at 4 °C, followed by centrifugation at 17,000 × g for 15 min at 4 °C to pellet insoluble materials. The supernatant was transferred to a new EP tube and evaporated to dryness at 4 °C using a vacuum concentrator. Dried samples were stored at −80 °C.
Prior to LC–MS analysis, dried extracts of mouse brain tissue, mouse liver tissues and BV2 cells were reconstituted in 100 μL of ACN:H₂O (1:1, v/v), vortexed for 30 s, and sonicated for 10 min at 4 °C in a water bath. The reconstituted samples were centrifuged at 17,000 × g for 15 min at 4 °C, and the supernatant was transferred into LC sample vials for LC–MS analysis.
Data acquisition
The LC–MS-based untargeted metabolomics data acquisition was performed using a Vanquish UHPLC system coupled to an Orbitrap mass spectrometer (Exploris 480, Thermo Fisher Scientific) or an Astral mass spectrometer (Thermo Fisher Scientific). LC separation was performed using a Waters ACQUITY UPLC BEH Amide column (1.7 μm particle size, 100 mm length × 2.1 mm inner diameter) for HILIC separation, and a Kinetex C18 column (2.6 μm, 2.1 × 100 mm) for reverse phase (RP) separation. The column temperature was maintained at 25 °C. A 2 μL injection volume was commonly used. For HILIC separation, the mobile phase A consisted of 25 mM ammonium hydroxide (NH4OH) and 25 mM ammonium acetate (NH4OAc) in water, and B consisted of acetonitrile (ACN), in both the positive (ESI+) and negative (ESI-) modes. The flow rate was set at 0.5 mL/min, and the gradient was programed as follows: 0−0.5 min, 95% B; 0.5−7 min, 95% B to 65% B; 7−8 min, 65% B to 40% B; 8− 9 min, 40% B; 9 − 9.1 min, 40% B to 95% B; and 9.1− 12min, 95% B. For RP separation, mobile phase A was 0.01% acetic acid in water, and B was a mixture of isopropyl alcohol and acetonitrile (1:1), in both the positive (ESI+) and negative (ESI-) modes. The flow rate was 0.3 mL/min, and the gradient was programmed as follows: 0−1 min: 1% B; 1−8 min: 99% B; 8− 9 min: 99% B; 9.0−9.1 min, 99% B to 1% B; and 9.1−12 min: 1% B.
For Orbitrap instrument, the LC–MS data acquisition was operated in full scan mode with a positive/negative ion polarity switch, and ddMS2 scans were applied to acquire MS/MS spectra for all samples. ESI source parameters were set as follows: spray voltage, 3500 V or −2800 V, in positive or negative mode, respectively; sheath gas, 50 arbs; aux gas, 15 arbs; sweep gas, 2 arbs; ion transfer tube temperature, 350 °C; vaporizer temperature, 400 °C or 350 °C, in HILIC or RPLC analysis, respectively. The full MS-scan was set as: Orbitrap resolution, 60,000; AGC target, 1e6; maximum injection time, 100 ms; scan range, 70–1200 Da. The ddMS2 scan was set as: Orbitrap resolution, 30,000; AGC target, 1e5; maximum injection time, 60 ms; scan range, 50–1200 Da; top N setting, 6; isolation width, 1.0 m/z; collision energy mode, stepped; collision energy type, normalized; HCD collision energies (%), 20-30-40; exclusion duration 4 s.
For the Astral instrument, LC–MS data acquisition was performed in full scan mode to collect MS1 ion information and in ddMS2 mode to acquire MS/MS spectra for all samples in both positive and negative ionization modes. ESI source parameters were set as follows: spray voltage, 3500 V or −3000 V, in positive or negative mode, respectively; sheath gas, 50 arbs; aux gas, 10 arbs; sweep gas, 1 arb; ion transfer tube temperature, 350 °C; vaporizer temperature, 350 °C. The full MS-scan was set as: orbitrap resolution, 60,000; AGC target, standard; maximum injection time, 100 ms; scan range, 70–1050 Da; data dependent mode, cycle time; cycle time, 0.6 s. The ddMS2 scan was set as: detector type, Astral; data type, centroid, AGC target, standard; maximum injection time, 5 ms; scan range, 50–1050 Da; isolation width, 1.5 m/z; collision energy type, normalized; HCD collision energies (%), 30; exclusion duration 5 s.
Data pre-processing and metabolite annotation
For Orbitrap datasets, the raw LC−MS data (.raw) of all samples were converted to.mzXML (for full scan mode) and.mgf (for ddMS2 mode) format using ProteoWizard MSConvert (version 3.0.20360). The mzXML data files of samples were grouped for data preprocessing, such as peak detection, retention time correction, and peak grouping using the R package “xcms” (version 3.12.0). The key parameters were set as follows: method: “centWave”; ppm: 10; snthr: 3; peakwidth: c(5,30); minfrac: 0.5. For Astral datasets, the raw LC−MS data (.raw) files were converted to.mzXML (for full scan mode) and.mgf (for ddMS2 mode) format using ProteoWizard MSConvert (version 3.0.24214). The mzXML data files of samples were grouped and imported into Met4DX41 (version 2.2.0, http://met4dx.zhulab.cn/) for data pre-processing including peak detection, retention time correction, and peak grouping. The key parameters were set as follows: Method: centWave; Peak width: min = 5, max = 30; S/N threshold: 6; Minimum fraction: 0.5. After data preprocessing, the feature tables and MS2 spectral files were imported into MetDNA (version 3.1, http://metdna.zhulab.cn/) for metabolite annotation. MetDNA parameters were adjusted according to the liquid chromatography mode: “HILIC” or “RP”. Collision energy settings of “SNCE_20_30_40%” were applied. Metabolite annotation was conducted separately for both positive (ESI+) and negative (ESI−) ionization modes.
Curation of knowledge-based metabolic reaction network
The metabolic reaction network was constructed based on the knowledge databases with network reconstruction and expansion (Fig. 1a). The knowledge databases integrated metabolites from KEGG (downloaded on September 11, 2023), MetaCyc (downloaded on September 16, 2023), and HMDB (downloaded on September 9, 2023). In-silico generated lipid structures lacking CAS numbers were removed from HMDB. Inorganic compounds and elemental substances were also excluded. Metabolites were merged using the first layer of InChIKey (14 characters) to unify molecular identities. The final knowledge databases comprised 53,583 metabolites, including 14,355 from KEGG, 16,112 from MetaCyc, and 38,834 from HMDB (Supplementary Fig. 1a and b). Metabolite reaction pairs (RPs) were retrieved from the KEGG reaction pair database (KEGG RCLASS) and MetaCyc, with inorganic compounds and elemental substances removed. Among the 53,583 metabolites, 11,532 were associated with a total of 14,974 reported reaction pairs in these knowledge databases, representing the original MRN directly curated from knowledge databases (Supplementary Table 1).
For network reconstruction, a graph neural network-based model for reaction relationship prediction was trained then applied to identify potential metabolic reaction relationships between metabolites. The methodological details of the model training were provided in the “Graph neural network-based model for reaction relationship prediction” section. Specifically, we applied this model to predict potential reaction pairs within a metabolite database containing 53,583 metabolites. Exhaustively pairing all metabolite combinations would yield 1,435,542,153 (1.44 billion) possible pairs. However, most of these are likely non-reaction pairs (non-RPs), and direct prediction on this full set would result in a high number of false positives. To address this, we implemented a two-step pre-screening strategy to significantly reduce the number of candidate metabolite pairs (Supplementary Fig. 1i). First, we matched the mass difference (Δmass) between two metabolites to the frequently observed mass differences derived from 14,974 reported reaction pairs in the metabolite knowledge databases (KEGG and MetaCyc), encompassing 2,045 unique Δmass values. Only metabolite pairs with matched Δmass values were retained. This filtering step reduced the number of candidate pairs to 52,963,442 (52.9 million). Second, we further filtered these pairs by the structural overlap coefficient for each pair (calculated by RDKit). Pairs with a coefficient ≥ 0.7 were retained, resulting in a final set of 1,665,846 (1.66 million) pairs. These pairs were then used as input for the graph neural network (GNN)-based model to predict reaction relationships. Among them, 1,432,469 metabolite pairs were predicted as possible reaction pairs. Finally, we combined the 1,432,469 GNN-predicted reaction pairs with 14,974 reported reaction pairs from databases, resulting in a curated MRN comprised a total of 47,748 metabolites and 1,447,443 reaction pairs, including both those from knowledge databases and those predicted by the GNN model (Supplementary Table 1). For network expansion, known metabolites were used as input for the BioTransformer (version 3.0) to generate potential unknown metabolites. The parameter settings are detailed in Supplementary Table 3. All curated unknown metabolites were merged using the first layer of InChIKey (14 characters) to remove stereoisomeric redundancies. The elemental composition of unknown metabolites was restricted to CHONPS, and reaction pairs involving unknown metabolites were retained only if they exhibited a Dice structural similarity greater than 0.7. Unpaired unknown metabolites were discarded. This process resulted in 718,007 unknown metabolites and 990,441 BioTransformer-generated reaction pairs, derived from 258 chemical reactions and 101 enzymes. Finally, the curated MRN, which includes both known and unknown metabolites along with their reaction relationships, consists of 765,755 metabolites (47,748 known and 718,007 unknown metabolites) and 2,437,884 reaction pairs (14,974 reported RPs, 1,432,469 predicted RPs, and 990,441 BioTransformer-generated RPs; Supplementary Table 1).
Graph neural network-based model for reaction relationship prediction
A graph neural network-based model was trained to identify potential metabolic relationships between metabolites from knowledge databases. The training dataset comprised reaction pairs (RPs) and non-reaction pairs (non-RPs). Specifically, RPs (n = 14,974) were retrieved from the knowledge databases, while non-RPs (n = 14,974) were randomly sampled from metabolites without reaction relationships in the knowledge databases. The combined dataset of RPs and non-RPs was randomly split into training, validation, and testing sets at a ratio of 8:1:1 for model training and evaluation (Fig. 1b). The model architecture is illustrated in Supplementary Fig. 2. The model integrated multiple molecular features of a pair of two metabolites to predict their reaction relationship. The input consisted of two SMILES representations of two metabolites. Graph-based features were generated from the molecular graph using a graph convolutional network. The encoding of the molecular graph and the design of graph convolutional layers followed that described in our previous publication42. Additionally, molecular fingerprints and Tanimoto structural similarity were computed to represent structural similarities. Reaction class information and frequency were incorporated from the knowledge databases. All features were concatenated and processed through fully connected and dense layers to estimate the probability of classification, determining whether a potential reaction relationship exists between the two metabolites.
During model optimization, ReLU was employed as the activation function for both graph convolutional and dense layers, with L2 regularization applied to the layer weights. The optimizer was Adam, and Huber loss was selected as the loss function. Model training was conducted on a server equipped with four GTX 2080Ti GPUs (CUDA version 11.6) and 192 GB of memory. The implementation utilized Python (version 3.9.7) along with TensorFlow (version 2.8.0), Bayesian-optimization (version 1.3.1), pandas (version 1.4.1), numpy (version 1.22.2), and RDKit (version 2022.9.1). For hyperparameter tuning, Bayesian optimization (10 initializations and 100 iterations) was employed using the training set, while the validation set was used to monitor Huber loss (Supplementary Table 4). The optimized parameters were then used for final model training, and probability calibration was performed using isotonic regression (Supplementary Fig. 1f). To determine the model threshold for distinguishing RPs from non-RPs, the weighted Youden index was computed, with 0.7 selected as the threshold (Supplementary Fig. 1g and h). The weighted Youden index43 was calculated as Eq. 1:
| 1 |
where w represents the weight, set to 0.2 in this study.
Calculation of network topological properties
To characterize the structural and functional features of the networks, a series of topological properties were computed using the R package igraph (version 2.0.3) and Python package networkit (version 11.1). These properties were categorized into three groups: Information, Connectivity, and Community. Information-level metrics include the number of nodes and edges, the number of connected components, and the number of nodes in the largest connected component. Network components were obtained using the components() function in R. Connectivity-level metrics include average degree, network density, global clustering coefficient, average shortest path length, and network diameter. Node degree was computed using the degree() function, and the average degree was calculated as the mean degree across all nodes. Network density was computed using the edge_density() function in R. The global clustering coefficient was calculated as the ratio of the number of closed triplets to the total number of triplets in the network, where a triplet is defined as three nodes connected by at least two edges. It was calculated using the transitivity() function with the setting type = “global” in R. The average shortest path length was computed using the mean_distance() function, and the network diameter was obtained via the diameter() function in R. Community-level metrics were derived using the Louvain method, a modularity-based community detection algorithm, implemented via networkit.community.detectCommunities() function in Python.
Two-layer interactive networking topology strategy implemented in MetDNA3
The two-layer interactive networking topology, which integrates knowledge-driven and data-driven approaches to achieve recursive-based metabolite annotation propagation, has been implemented in the latest version of MetDNA3. The detailed workflow of MetDNA3 is as follows:
Annotation of seed metabolites. Seed metabolites were first annotated through matching their experimental values such as MS1, retention time, and MS2 spectra to the standard libraries. These seed metabolites were classified as level 1 annotation in MetDNA3. The match tolerances were set as MS1 match, 15 ppm; RT match, 25 s; MS2 spectral match, 0.8 (dot product score). MS2 spectral matching has no restriction on the number of matched fragment ions. The adducts of protonation and deprotonation were used in seed annotation in positive and negative modes, respectively. The curation of in-house standard libraries was carried out following the methodology outlined in our previous publication on MetDNA2.
Two-layer interactive networking topology for recursive metabolite annotation.
Step 1: curation of two-layer network topology through data and knowledge pre-mapping
The curated MRN was pre-mapped to the experimental data (metabolic features) through MS1 m/z matching and predicted RT matching with tolerances set at 15 ppm for MS1 m/z matching and 30% for RT matching. Given that RT prediction accuracy is still limited, we adopted a relatively wide tolerance to account for variability across platforms and conditions. This process resulted in pre-mapped cross-network links (metabolite-feature links) between the knowledge layer and the data layer. The RT predictions were consistent with those generated by MetDNA2. A random forest model was used for RT prediction, utilizing the retention times of seed metabolites along with their molecular descriptors for model training. Subsequently, the reaction relationships derived from the MRN (edges of knowledge layers) were mapped to data layer to generate feature pairs, constructing the feature network. Specifically, for a given feature, its linked metabolites in the knowledge layer were first retrieved (guided by metabolite-feature links). Then, all reaction-paired neighboring metabolites for these metabolites were obtained (guided by the knowledge layer). Following this, all reaction-paired neighboring features linked to these metabolites were retrieved again, guided by the metabolite-feature links, thereby forming the feature network. Then, the MS2 spectral similarity between linked features in the feature network was calculated and applied as a constraint (dot product score ≥ 0.5), resulting in a knowledge-constrained feature network. A modified dot-product function was used to calculate MS2 spectral similarity and followed the same method as described in our previous MetDNA publication19. Specifically, when the precursor m/z of the seed feature is greater than that of the neighboring feature, fragment ions in the seed MS2 spectrum with m/z values exceeding that of the neighbor are excluded. Conversely, when the neighbor has a higher precursor m/z, fragment ions in the MS2 spectrum with m/z values greater than that of the seed are removed. The resulting MS2 spectral similarity-based edge relationships were then mapped back to the knowledge layer, generating a data-constrained MRN.
Step 2: recursive-based metabolite annotation propagation
The annotations of seed metabolites formed metabolite-feature pairs, and were inputted into the knowledge layer and data layer of the two-layer network topology. The recursive-based metabolite annotation propagation was entirely driven by topological connection. Specifically, for a given seed metabolite, its neighbor metabolites were searched within the knowledge layer. Simultaneously, the corresponding feature of the seed metabolite was also used to retrieve its neighbor features within the data layer. All searched neighbor metabolites and features were then systematically paired, generating all possible metabolite-feature pair combinations. Only those metabolite-feature pairs presented in the metabolite-feature links were retained as propagated annotations. These propagated annotations (metabolite-feature pairs) subsequently served as new seeds for the next round of recursive annotation. This iterative process continues until no further annotations can be generated. The recursive annotations adhered to the four constraints established in previous MetDNA publications: MS1 m/z, predicted RT, MS2 similarity, and metabolic reaction transformation. Specifically, MS1 m/z and RT constraints were applied during the initial matching of MS1 m/z and predicted RT, MS2 similarity constraints were enforced within the feature network, and metabolic reaction transformations were integrated into the MRN.
Step 3: redundancy removal of recursive-based annotations
To refine the results obtained from recursive-based metabolite annotation and propagation, we implemented a stepwise redundancy removal process. First, for features annotated with multiple metabolite candidates, priority was given to seed metabolites, followed by known metabolites (level 3.1), while all other candidates were discarded. Next, the experimental RTs and molecular descriptors of annotated metabolites were used to retrain the RT prediction model. For cases where multiple features were assigned to the same metabolite, the annotation with the minimal RT match error was retained. When a single feature was annotated with multiple metabolites, only the top 5 ranked candidates were retained based on the total score.
Finally, global peak correlation network developed in MetDNA2 is also used to annotate all possible ion forms for each metabolite annotation in the feature table. The metabolite annotation confidence and reporting followed MetDNA2.
Evaluation of coverage and correct rate
Twenty Orbitrap-acquired datasets were employed to evaluate the coverage and correct rate in network-based annotation propagation, followed the validation framework outlined in Fig. 3d. Specifically, 30% of the level 1 metabolite annotations were designated as seed metabolites for network-based annotation propagation, while the remaining 70% were reserved for validation. This process was repeated 10 times for each dataset using a 10-fold cross-validation approach. “Coverage” refers to whether a feature receives an annotation, regardless of the correctness of that annotation. The calculation is as shown in Eq. 2:
| 2 |
“Correct rate” refers to the proportion of annotated features for which the correct structure appears within the Top N ranked annotations. The calculation is as shown in Eq. 3:
| 3 |
The “Top N” annotations were selected based on the rank of annotation scores. The top N ranked structures for each feature were retained as Top N annotations. The score was calculated by a combination of m/z match score, RT match score and MS2 dot product similarity score using Eq. 4:
| 4 |
| 5 |
| 6 |
| 7 |
where Scorem/z represents the m/z match score and is calculated as indicated in Eq. 5. ScoreRT represents the RT match score and is calculated using Eq. 6. Scorespec represents the MS2 spectral similarity score and is calculated as indicated in Eq. 7. A modified dot-product function was used to calculate MS2 spectral similarity and followed the same method as described in our previous MetDNA publication19. w represents weight, the default values of wm/z, wRT, and wspec are 0.25, 0.25, and 0.5, respectively. The scoring function follows the previously published MetDNA publication19.
Specifically, we define a true positive (TP) as a correct annotation ranked within the Top N candidates, and a false positive (FP) as an annotation that does not match the ground truth within the Top N. The FDR is then calculated using Eq. 8:
| 8 |
Corroborations of annotated metabolites using other bioinformatic tools
Four bioinformatic tools were employed to corroborate the annotated metabolites by MetDNA3, including CFM-ID (version 4.0), MetFrag (version 2.4.6-CL), MS-FINDER (version 3.61), and SIRIUS (version 6.0.6).
For known metabolites (level 3.1; structures derived from KEGG, MetaCyc, and HMDB), a standardized compound database of 53,583 unique metabolites, integrated from KEGG, MetaCyc, and HMDB, used for annotation is the same as MetDNA3. Data formatting and parameter settings were adjusted according to the specifications of each tool. Detailed parameter configurations are provided in the Supplementary Table 5. A total of 4302 features with level 3.1 annotations were collected from 20 Orbitrap datasets and subjected to corroboration using the above tools (Supplementary Data 3).
For unknown metabolites annotated by MetDNA3 (level 3.2; structures generated from BioTransformer), we corroborated these annotations using multiple in silico tools. A stepwise filtering strategy was applied to prioritize unknown candidates for validation. First, cross-validation was performed using multiple in silico tools, retaining candidates with MetFrag scores ≥ 0.5, MS-FINDER scores ≥ 5, and SIRIUS scores ≥ −200. Next, candidates recurrently detected across multiple biological datasets were then prioritized, as such recurrence suggests higher biological relevance and reduces the likelihood of false positives. We collected all annotated unknown metabolites from the 20 Orbitrap datasets, resulting in a total of 1456 annotation candidates. The filtering process narrowed the list from 1456 to 5 candidate metabolites (Supplementary Table 6). We checked their MS2 spectra, and synthesized two standards for validation, including γ-Glu-Thr-Gly and N-glycolyltaurine.
Decoy MS2 spectra-based assessment of false positives in network-based metabolite annotation propagation
Decoy MS2 spectra were generated from experimental MS2 spectra by spectrum-based method proposed by Scheubert et al.38, and incorporated them into the MetDNA3 data layer for annotation propagation. To ensure that the generated decoy MS2 spectra sufficiently simulate incorrect spectrum, the dot product similarity between the decoy MS2 and the original MS2 was required less than 0.5. For both experimental and decoy MS2 spectra, 30% of them were used as seed metabolites to initiate propagation, and the remaining 70% were used to validate the performance of network-based annotation propagation. The annotation results were classified as true positives (TP) if annotated within rank ≤ 3, and false negatives (FN) otherwise. For decoy MS2 spectra, the same propagation process was applied, where the results are considered false positives (FP) if annotated within rank ≤ 3, and true negatives (TN) otherwise (Supplementary Fig. 12). Notably, since MS2 similarity is computed between pairwise features in the network, a traditional null distribution of scores cannot be directly constructed. Therefore, a ranking-based approach was adopted to find false positive and estimate FDR (Supplementary Fig. 14).
Benchmark of different knowledge-driven networks
The performance of recursive-based metabolite annotation propagation was benchmarked using three types of knowledge networks in MetDNA3: MRN, metabolic reaction network; SMN, structure-guided molecular network; FCN, fully connected network (Supplementary Table 1). The knowledge databases consisted of 53,583 known metabolites. MRN was constructed as previously described, excluding unknown metabolites and removing metabolites without reaction pairs. SMN was generated following the approach outlined in publication30, with a Dice structural similarity threshold of 0.4. FCN was a fully connected graph where each metabolite was linked to all other metabolites. MRN, SMN, and FCN were individually integrated into MetDNA3 for recursive-based metabolite annotation propagation (Fig. 5c). An annotation method, “m/z + RT match” was applied for additional evaluation. The match tolerances for MS1 m/z and RT were set to 15 ppm and 30%, respectively. The correctness of these annotations was assessed by computing true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). The accuracy, true positive rate, false positive rate, and estimated false discovery rate were calculated as Eqs. 9–12:
| 9 |
| 10 |
| 11 |
| 12 |
Ethical Statement
The animal experiments were compliant with the ethical guidelines of the Institutional Animal Care and Use Committees of Interdisciplinary Research Center on Biology and Chemistry, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences (approval research project number: ECSIOC_2023-23).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Description of Additional Supplementary Files
Source data
Acknowledgements
The work was supported by National Key R&D Program of China (2022YFC3400700), National Natural Science Foundation of China (22425404 and 92357308), Shanghai Key Laboratory of Aging Studies (19DZ2260400), Shanghai Municipal Science and Technology Major Project, and Shanghai Basic Research Pioneer Project.
Author contributions
Z.-J.Z. and H.Z. conceived the idea and designed the algorithm and software. H.Z. developed the workflow of two-layer networking topology and MetDNA3 package. H.Z. performed the sample preparation, data acquisition, processing and analysis. H.Z. and X.Z. contributed to the validation of annotation results. H.Z. and Y.Y. contributed to the deployment of MetDNA3 webserver. H.Z. and Z.-J.Z. wrote the manuscript. Z.-J.Z. supervised the project.
Peer review
Peer review information
Nature Communications thanks Ricardo da Silva, who co-reviewed with Gabriel Arini; and Roland Nilsson for their contribution to the peer review of this work. A peer review file is available.
Data availability
The MS2 spectra of the characterized compounds can be accessed at MoNA database [https://mona.fiehnlab.ucdavis.edu/] via IDs from MoNA_0003569 to MoNA_0003577. The raw data files of BV2 cells, mouse brain tissue, mouse liver tissue, NIST human plasma, and NIST human urine from the Orbitrap instrument can be accessed at the MassIVE repository [https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?accession=MSV000097913] or the National Omics Data Encyclopedia under Accession Code OEP00006095. The raw data files of mouse liver tissue, NIST human plasma, and NIST human urine from the Astral instrument can be accessed at the MassIVE repository [https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?accession=MSV000097914] or National Omics Data Encyclopedia under Accession Code OEP00006083. Source data are provided with this paper.
Code availability
The algorithm of two-layer networking topology was mainly developed using R and is executed in MetDNA3. The source code is available on GitHub [https://github.com/ZhuMetLab/MrnAnnoAlgo3] under the CC BY-NC-ND 4.0 License. The completed functions are provided in the MetDNA3 webserver [http://metdna.zhulab.cn/] via a free registration. The GNN-based model for predicting of reaction relationships was developed in Python. The source code is available on GitHub [https://github.com/ZhuMetLab/ReactionPredictor] under the CC BY-NC-ND 4.0 License.
Competing interests
Z.-J.Z. and H.Z. are inventors on a patent application (CN202510593995.5, applicant: Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences) covering the GNN-based model for predicting reaction relationships and the algorithm of the two-layer networking topology described in this study. The remaining authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-025-63536-6.
References
- 1.Patti, G. J., Yanes, O. & Siuzdak, G. Metabolomics: the apogee of the omics trilogy. Nat. Rev. Mol. Cell Biol.13, 263–269 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wishart, D. S. Emerging applications of metabolomics in drug discovery and precision medicine. Nat. Rev. Drug Discov.15, 473–484 (2016). [DOI] [PubMed] [Google Scholar]
- 3.Zamboni, N., Saghatelian, A. & Patti, G. J. Defining the metabolome: size, flux, and regulation. Mol. Cell58, 699–706 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Vinaixa, M. et al. Mass spectral databases for LC/MS- and GC/MS-based metabolomics: state of the field and future prospects. TrAC Trends Anal. Chem.78, 23–35 (2016). [Google Scholar]
- 5.Alseekh, S. et al. Mass spectrometry-based metabolomics: a guide for annotation, quantification and best reporting practices. Nat. Methods18, 747–756 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Perez De Souza, L., Alseekh, S., Scossa, F. & Fernie, A. R. Ultra-high-performance liquid chromatography high-resolution mass spectrometry variants for metabolomics research. Nat. Methods18, 733–746 (2021). [DOI] [PubMed] [Google Scholar]
- 7.Tsugawa, H. et al. MS-DIAL: data-independent MS/MS deconvolution for comprehensive metabolome analysis. Nat. Methods12, 523–526 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Schmid, R. et al. Integrative analysis of multimodal mass spectrometry data in MZmine 3. Nat. Biotechnol.41, 447–449 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wolfender, J.-L., Nuzillard, J.-M., Van Der Hooft, J. J. J., Renault, J.-H. & Bertrand, S. Accelerating metabolite identification in natural product research: toward an ideal combination of liquid chromatography–high-resolution tandem mass spectrometry and NMR profiling, in silico databases, and chemometrics. Anal. Chem.91, 704–742 (2019). [DOI] [PubMed] [Google Scholar]
- 10.Cai, Y., Zhou, Z. & Zhu, Z.-J. Advanced analytical and informatic strategies for metabolite annotation in untargeted metabolomics. TrAC Trends Anal. Chem.158, 116903 (2023). [Google Scholar]
- 11.Xing, S., Shen, S., Xu, B., Li, X. & Huan, T. BUDDY: molecular formula discovery via bottom-up MS/MS interrogation. Nat. Methods20, 881–890 (2023). [DOI] [PubMed] [Google Scholar]
- 12.Novoa-del-Toro, E. M. & Witting, M. Navigating common pitfalls in metabolite identification and metabolomics bioinformatics. Metabolomics20, 103 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kind, T. et al. Identification of small molecules using accurate mass MS/MS search. Mass Spectrom. Rev.37, 513–532 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods16, 299–302 (2019). [DOI] [PubMed] [Google Scholar]
- 15.Stancliffe, E., Schwaiger-Haber, M., Sindelar, M. & Patti, G. J. DecoID improves identification rates in metabolomics through database-assisted MS/MS deconvolution. Nat. Methods18, 779–787 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Stravs, M. A., Dührkop, K., Böcker, S. & Zamboni, N. MSNovelist: de novo structure generation from mass spectra. Nat. Methods19, 865–870 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Watrous, J. et al. Mass spectral molecular networking of living microbial colonies. Proc. Natl. Acad. Sci.109, E1743–E1752 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Wang, M. et al. Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat. Biotechnol.34, 828–837 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Shen, X. et al. Metabolic reaction network-based recursive metabolite annotation for untargeted metabolomics. Nat. Commun.10, 1516 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Zhou, Z. et al. Metabolite annotation from knowns to unknowns through knowledge-guided multi-layer metabolic networking. Nat. Commun.13, 6656 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Aguilar-Mogas, A., Sales-Pardo, M., Navarro, M., Guimerà, R. & Yanes, O. iMet: a network-based computational tool to assist in the annotation of metabolites from tandem mass spectra. Anal. Chem.89, 3474–3482 (2017). [DOI] [PubMed] [Google Scholar]
- 22.Chen, L. et al. Metabolite discovery through global annotation of untargeted metabolomics data. Nat. Methods18, 1377–1385 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Nothias, L.-F. et al. Feature-based molecular networking in the GNPS analysis environment. Nat. Methods17, 905–908 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Schmid, R. et al. Ion identity molecular networking for mass spectrometry-based metabolomics in the GNPS environment. Nat. Commun.12, 3832 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Van Der Hooft, J. J. J., Wandy, J., Barrett, M. P., Burgess, K. E. V. & Rogers, S. Topic modeling for untargeted substructure exploration in metabolomics. Proc. Natl. Acad. Sci.113, 13738–13743 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Ernst, M. et al. MolNetEnhancer: enhanced molecular networks by integrating metabolome mining and annotation tools. Metabolites9, 144 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Naake, T. & Fernie, A. R. MetNet: metabolite network prediction from high-resolution mass spectrometry data in R aiding metabolite annotation. Anal. Chem.91, 1768–1772 (2019). [DOI] [PubMed] [Google Scholar]
- 28.Wang, X. et al. Network topology evaluation and transitive alignments for molecular networking. J. Am. Soc. Mass Spectrom.35, 2165–2175 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Martin, M. R., Bittremieux, W. & Hassoun, S. Molecular structure discovery for untargeted metabolomics using biotransformation rules and global molecular networking. Anal. Chem.97, 3213–3219 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Wang, X. et al. A structure-guided molecular network strategy for global untargeted metabolomics data annotation. Anal. Chem.95, 11603–11612 (2023). [DOI] [PubMed] [Google Scholar]
- 31.Kanehisa, M. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Caspi, R. et al. The MetaCyc database of metabolic pathways and enzymes - a 2019 update. Nucleic Acids Res.48, D445–D453 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Wishart, D. S. et al. HMDB 5.0: the human metabolome database for 2022. Nucleic Acids Res.50, D622–D631 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Wishart, D. S. et al. BioTransformer 3.0—a web server for accurately predicting metabolic transformation products. Nucleic Acids Res. 50, W115–W123 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Wang, F. et al. CFM-ID 4.0: more accurate ESI-MS/MS spectral prediction and compound identification. Anal. Chem.93, 11692–11700 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Ruttkies, C., Schymanski, E. L., Wolf, S., Hollender, J. & Neumann, S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J. Cheminform.8, 3 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Lai, Z. et al. Identifying metabolites by integrating metabolome databases with mass spectrometry cheminformatics. Nat. Methods15, 53–56 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Scheubert, K. et al. Significance estimation for large scale metabolomics annotations by spectral matching. Nat. Commun.8, 1494 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Da Silva, R. R. et al. Propagating annotations of molecular networks using in silico fragmentation. PLOS Comput. Biol.14, e1006089 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Qiang, H. et al. Language model-guided anticipation and discovery of unknown metabolites. Preprint at 10.1101/2024.11.13.623458 (2024).
- 41.Yin, Y., Luo, M. & Zhu, Z.-J. Met4DX: a unified and versatile data processing tool for multidimensional untargeted metabolomics data. J. Am. Soc. Mass Spectrom.35, 2960–2968 (2024). [DOI] [PubMed] [Google Scholar]
- 42.Zhang, H. et al. AllCCS2: curation of ion mobility collision cross-section atlas for small molecules using comprehensive molecular representations. Anal. Chem.95, 13913–13921 (2023). [DOI] [PubMed] [Google Scholar]
- 43.Li, D.-L., Shen, F., Yin, Y., Peng, J.-X. & Chen, P.-Y. Weighted youden index and its two-independent-sample comparison based on weighted sensitivity and specificity. Chin. Med. 126, 1150–1154 (2013). [PubMed] [Google Scholar]
- 44.Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminform.8, 61 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Description of Additional Supplementary Files
Data Availability Statement
The MS2 spectra of the characterized compounds can be accessed at MoNA database [https://mona.fiehnlab.ucdavis.edu/] via IDs from MoNA_0003569 to MoNA_0003577. The raw data files of BV2 cells, mouse brain tissue, mouse liver tissue, NIST human plasma, and NIST human urine from the Orbitrap instrument can be accessed at the MassIVE repository [https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?accession=MSV000097913] or the National Omics Data Encyclopedia under Accession Code OEP00006095. The raw data files of mouse liver tissue, NIST human plasma, and NIST human urine from the Astral instrument can be accessed at the MassIVE repository [https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?accession=MSV000097914] or National Omics Data Encyclopedia under Accession Code OEP00006083. Source data are provided with this paper.
The algorithm of two-layer networking topology was mainly developed using R and is executed in MetDNA3. The source code is available on GitHub [https://github.com/ZhuMetLab/MrnAnnoAlgo3] under the CC BY-NC-ND 4.0 License. The completed functions are provided in the MetDNA3 webserver [http://metdna.zhulab.cn/] via a free registration. The GNN-based model for predicting of reaction relationships was developed in Python. The source code is available on GitHub [https://github.com/ZhuMetLab/ReactionPredictor] under the CC BY-NC-ND 4.0 License.





