Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2024 Feb 28;25(2):bbae056. doi: 10.1093/bib/bbae056

Ion entropy and accurate entropy-based FDR estimation in metabolomics

Shaowei An 1,2,3, Miaoshan Lu 4,5,6, Ruimin Wang 7,8,9, Jinyin Wang 10,11,12, Hengxuan Jiang 13, Cong Xie 14, Junjie Tong 15, Changbin Yu 16,
PMCID: PMC10939419  PMID: 38426325

Abstract

Accurate metabolite annotation and false discovery rate (FDR) control remain challenging in large-scale metabolomics. Recent progress leveraging proteomics experiences and interdisciplinary inspirations has provided valuable insights. While target–decoy strategies have been introduced, generating reliable decoy libraries is difficult due to metabolite complexity. Moreover, continuous bioinformatics innovation is imperative to improve the utilization of expanding spectral resources while reducing false annotations. Here, we introduce the concept of ion entropy for metabolomics and propose two entropy-based decoy generation approaches. Assessment of public databases validates ion entropy as an effective metric to quantify ion information in massive metabolomics datasets. Our entropy-based decoy strategies outperform current representative methods in metabolomics and achieve superior FDR estimation accuracy. Analysis of 46 public datasets provides instructive recommendations for practical application.

Keywords: entropy, metabolomics, mass spectrometry, FDR estimation, metabolite annotation

INTRODUCTION

Metabolomics is the profiling of metabolites in biofluids, cells and tissues [1]. It provides a functional readout of cellular biochemistry and is routinely applied as a tool for biomarker discovery [2]. Among all analytical techniques for untargeted metabolomics, liquid chromatography coupled to accurate tandem mass spectrometry (LC-MS/MS) has become predominant, which allows for the acquisition of thousands of metabolite signals from a single sample [3, 4]. The increased availability and innovation of instrumental techniques dramatically improve the detection and identification of metabolites [5]. To annotate and identify detected signals, tandem mass spectrometry (MS) or MS/MS spectra are collected from experimental samples and compared with reference spectral libraries according to authentic standards [6–8]. This procedure is commonly implemented by the spectrum–spectrum matching methods such as dot product similarity score [9]. However, researchers face challenges in setting appropriate scoring criteria and ascertaining the false discovery rate (FDR) [10], which can finally result in uncontrolled false-positive annotations and unreliable results. The FDR is a cornerstone of the quantification of annotation quality in genomics [11] and proteomics [12], but its immaturity in metabolomics impedes the standardization analysis in the metabolomics data processing [13].

To address analysis challenges in metabolite annotation, the metabolomics community has leveraged experience from the proteomics field where statistical assessment and false discovery calculations for annotations are common practices [14]. The representative method, FDR estimation by the target–decoy approach, has been prevalent in proteomics research for a long time but is relatively new to metabolomics due to the difficulty in generating decoy metabolomics libraries [10, 12, 14]. Despite the intrinsic complexity of metabolite fragmentation, several decoy library construction methods were proposed and validated. The fragmentation tree–based method [10] uses a re-rooted fragmentation tree [15] to generate decoy libraries. It was compared with the empirical Bayes approach [16] and two other decoy methods: the naïve and spectrum-based methods. The fragmentation tree–based method outperformed the other methods and was proved to provide confidence measures in large-scale metabolomics projects. This is considered important progress in the target–decoy approach application in metabolomics. However, this method is limited to certain metabolomics scenarios because it can only be used on fragmentation tree–filtered spectral libraries. Another decoy generation method called the XY-Meta method was presented later and demonstrated better performance than the fragmentation tree–based method through a comprehensive comparison [17]. This method is an optimized form of the random selection method and generates forged spectra by preserving the original reference signals to simulate the presence of isomers of metabolites. Despite all these efforts from the metabolomics community, continuous methodological progress is necessary to increase the utilization of rich spectra resources while removing or reducing match redundancy [4]. This is because numerous databases are generally queried to maximize metabolome coverage and public spectral databases are growing in scale [3, 18].

In addition to obtaining inspiration from proteomics, another attempt to solve the difficulty in spectrum–spectrum matching has been made by applying information entropy theory [19] to metabolomics. Spectral entropy was introduced as a suitable measure for the total information content of an MS/MS spectrum, and entropy similarity scores were proved to improve the accuracy of MS-based annotations with high robustness [20]. This successful cross-disciplines investigation demonstrates the suitability of information theory with the massive MS/MS spectral databases. However, in large-scale metabolomics analysis or downstream pathway analysis, much attention is paid to the spectrum dimension while the basic spectrum component, ion, is rarely noticed in macro statistics.

Here, we introduce ion entropy as a measure for the information content of fragment ions in MS/MS spectral libraries. We used ion entropy to assess widely used public spectral databases including MassBank of North America (MassBank-MoNA), MassBank-Europe [21] and GNPS [6] to reflect its applications. Based on the observed characteristics of ion entropy and previously proposed spectral entropy, we devised two entropy-based decoy spectra generation methods: the spectral entropy–based method and the ion entropy–based method. A comprehensive evaluation validated that both methods are feasible strategies to provide FDR estimation in metabolomics with high accuracy. The ion entropy-based method achieved the best FDR estimation performance when compared with other representative decoy strategies. We further analyzed 46 public metabolomics datasets using our constructed workflow to validate the FDR estimation methods in real-world metabolome annotation and provided instructive suggestions. Overall, this work demonstrates ion entropy as an effective novel metric to advance false discovery rate estimation and improve metabolomics annotation quality.

EXPERIMENTAL SECTION

Ion entropy and its calculation

Entropy, with origins in thermodynamics as a measure of disorder within systems [22], aptly captures the variability of fragment ion distributions in mass spectral libraries. Drawing from information theory, we propose ion entropy–a new metric leveraging Shannon entropy to quantify the information content of fragment ions in MS/MS spectra. Ion entropy enables quantitative evaluation of the richness and diversity of fragmentation patterns, providing novel informatics insights into the complexity of fragment ion populations in metabolomics.

The calculation of ion entropy for a given fragment ion was performed as follows. First, all MS/MS spectra within 10 ppm precursor m/z tolerance of the target fragment ion were retrieved from the reference database. These collected spectra were normalized by dividing fragment ion intensities by their respective precursor ion intensities. Fragment ions within 10 ppm m/z tolerance of the target fragment ion were then selected from this normalized collection, with the number of collected fragment ions denoted as p. The intensities of these extracted fragment ions were adjusted to sum to 1 by dividing by their total combined intensity. Using these final normalized relative ion intensities (Ip), the ion entropy (Sion) was computed according to the following equation:

graphic file with name DmEquation1.gif

Normalized ion entropy can be calculated by the equation:

graphic file with name DmEquation2.gif

Spectral libraries and preprocessing

We utilized three publicly available spectral libraries to evaluate our methods: MassBank-MoNA (2023.07 release, only experimental LC-MS/MS spectra), MassBank-Europe (2023.09 release) and GNPS (2023.12 release). When assessing ion entropy as a new metric, the following preprocessing steps were applied to remove low-quality data. Spectra were eliminated if they (i) lacked key information such as m/z values, intensity values or precursor m/z; (ii) contained no ions within 10 ppm range of the precursor m/z; and (iii) were spectrum types other than MS2. In addition, ions with zero intensity were removed.

In the evaluation of the proposed FDR estimation methods, more stringent preprocessing steps were applied to all query and target spectral libraries to ensure maximal homogeneity. Spectra were removed if they have: (a) key information (m/z values, intensity values, precursor m/z, ion mode, InChIKey) missing; (b) less than five peaks with relative intensity above 2%; (c) no ion within 10 ppm range of the precursor m/z; (d) precursor m/z (or exact mass) over 1000; (e) other spectrum types rather than MS2 type; and (f) negative or unknown ion mode. Besides, ions with relative intensity <1% were treated as noise signals and abandoned. The number of spectra utilized was summarized (Supplementary Table 1, available online at http://bib.oxfordjournals.org/).

Fragmentation tree filtering

The fragmentation tree–based decoy generation method necessitates a noise-filtered target spectral library. To fulfill this prerequisite, we implemented the following noise filtering procedure: For each target MS/MS spectrum, a fragmentation tree was constructed using SIRIUS 5.6 [23] with default parameters, annotating a subset of hypothetical fragment ions with molecular formulas. Only these annotated fragment ions were retained from each spectrum, preserving their original intensities. This filtering step provided the required clean target library as input for the fragmentation tree decoy approach.

Spectral matching function

Two spectral matching functions were implemented in this study for comprehensive evaluation.

  • (i) Spectral entropy similarity. To calculate the similarity between spectrum A and spectrum B, their individual spectral entropies were first computed as SA and SB. The two spectra were then combined, and the spectral entropy of the mixed spectrum (SAB) was determined. The spectral entropy similarity score was calculated as
    graphic file with name DmEquation3.gif

Details about this score function can be found in its original publication [20].

  • (ii) Dot product similarity. If we define the test spectrum q is a collection of peaks: (Mq,1,Iq,1),…,(Mq,i,Iq,i) and the library spectrum r is a collection of peaks: (Mr,1,Ir,1),…,(Mr,i,Ir,i). The dot product similarity is given by the equation
    graphic file with name DmEquation4.gif

Both the spectral matching functions used 0.05 Da as m/z tolerance.

Other decoy strategies for comparison

Three prevailing decoy methods were compared with entropy-based methods in this study. The naïve method [10] uses all possible fragment ions from the target library and then randomly adds these ions to the decoy spectral library, until each decoy spectrum reaches the desired number of fragment ions that mimics the corresponding library spectrum. The fragmentation tree–based method [10] generates decoy libraries using a re-rooted fragmentation tree [15]. The XY-Meta method [17], based on the naïve approach, exhibits better performance on specific datasets. Since the naïve method and XY-Meta method are well described and straightforward, we implemented these methods in our analysis pipeline, closely following the published protocols. Additionally, the fragmentation tree-based decoy method was conducted using SIRIUS 5.6 [23] with default parameters.

FDR calculation

We used the separated target–decoy search (STDS) strategy [24] for the FDR estimation under which circumstances all the library hits above a specific threshold would be reported. The corresponding estimated FDR is the total number of above-threshold decoy library hits (ND) divided by the number of above-threshold target library hits (NT).

graphic file with name DmEquation5.gif

RESULTS AND DISCUSSION

Ion entropy measurement on public spectral databases

Public MS/MS spectral databases accumulate spectra recorded at multiple collision energies to cover a broad range of characteristic fragments [25–27]. Low collision energies mostly preserve the precursor ion, while high collision energies increase fragment ion abundances toward low m/z ranges and lower the precursor ion abundance (Figure 1A). Some ions arise in all recorded energies, while others only arise within a certain collision energy range. Fragment ions appearing at a wide range of collision energies might indicate that they have relatively stable substructures that are hard to dissociate.

Figure 1.

Figure 1

Assessment of public spectral databases using ion entropy. (A) Spectra of deoxyuridine recorded at different collision energies in the Human Metabolome Database (HMDB) with ID of HMDB0000012. (B) Fractions of ions with normalized ion entropy equal to 0 or 1 in the MassBank-Europe, MassBank-MoNA and GNPS databases. (C) Distribution of ion entropies in the MassBank-Europe, MassBank-MoNA and GNPS databases.

We assessed three preprocessed public spectral databases, including GNPS [6], MassBank-Europe [21] and MassBank-MoNA using ion entropy. After normalizing the ion entropy for all ions in the databases, two special ion types emerged—ions with normalized ion entropy of either 0 or 1. The former type of ions are unique regarding their precursor m/z values in the database. These ions can have more diagnostic value than other ions owing to their reflections to unique signals. However, the scarcity of these ions indicates they may have a lower confidence level when applied in accurate metabolite annotation. The latter ions exist in every spectrum regarding their precursor m/z values and exhibit identical relative intensity. This consistent fragmentation pattern suggests these ions may demonstrate stability during fragmentation, as the spectra in public databases are recorded across a wide range of collision energies.

We calculated how much these two special ion types account for in public spectral databases (Figure 1B). Ions with normalized ion entropy equal to 0 accounted for at least 30%, while ions with normalized ion entropy equal to 1 accounted for no more than 10% in all databases. These basic indicators have the potential to be used to measure other spectral libraries in the future. In addition, ion entropies were calculated for other ions as well, and the distributions of these databases were illustrated (Figure 1C and Supplementary Figure 1, available online at http://bib.oxfordjournals.org/). Sharp increases in the density were observed before every Inline graphic value (N represents the natural integer), and the apex density reduced as the ion entropy increased. This can be explained by the composition of the spectral databases (Figure 1A). Due to the ion overlap among spectra under different collision energy conditions, ion entropy will reach its relative maximal value (Inline graphic, for N spectra) if an ion has the same relative intensity in all spectra. Moreover, ions present in fewer spectra are definitely more numerous than those present in more spectra. This theoretical conjecture is consistent with the observed phenomenon. Overall, ion entropy provides a valuable metric to assess the information content, stability or enrichment of fragment ions within spectral databases. Ion entropy can also be utilized to measure the continuous evolution of databases as new reference spectra are added over time (Supplementary Figure 2, available online at http://bib.oxfordjournals.org/).

Decoy library construction by entropy-based methods

Information entropy can serve as a key concept in generating decoy MS/MS spectral libraries. It is believed that the construction of decoy spectra should mimic real spectra as closely as possible but not correspond to MS/MS spectra of any true metabolites present in the sample [10]. We interpret this proposal through the lens of information entropy theory and propose that decoy and target spectra should have the same or similar spectral entropy, reflecting comparable states of chaos or disorder. This further implies decoys are produced under collision conditions matching the original spectrum [20].

Two entropy-based methods were developed to create dependable decoy spectral libraries (Figure 2), termed the spectral entropy–based method and the ion entropy-based method. The spectral entropy–based method randomly shuffles the ion intensities within a target spectrum to generate a decoy spectrum that preserves the same spectral entropy. This approach is akin to the spectral permutation method used in SMfinder [28], but represents the first time that the mechanism is grounded in an entropy-based principle. The ion entropy–based method follows this principle through a more complicated process. First, each ion in a target spectrum is assigned and sorted by an ion entropy value. Ion intensities are then sequentially exchanged by rank, where high-entropy ions receive intensities from lower-entropy ions. Ions with zero ion entropy exclusively exchange intensities within their group, aiming to maintain their integrity and avoid external perturbations. Ultimately, the target ions with new intensities comprise the decoy spectrum. This method utilizes existing tandem mass spectral libraries to learn intrinsic fragmentation patterns, then rearranges fragment ion intensities to construct decoy spectra. The overall concept of intensity shuffling is analogous to previous approaches such as the method in DIAMetAlyzer [29]. However, the newly introduced ion entropy metric provides a unique way to reorder ions that better captures the inherent complexity of fragmentation.

Figure 2.

Figure 2

Decoy library construction schema. (A) The spectral entropy-based decoy method. (B) The ion entropy–based method.

Decoy spectral database validation

Accurate control of spectral entropy during the generation of decoy spectra libraries improves the ability to mimic reference target libraries. We evaluated the spectral entropy distribution of decoy libraries generated from the GNPS database using various decoy strategies (Figure 3). The entire GNPS library was either preprocessed or filtered before decoy generation. It was observed that all other methods altered the original spectral entropy distribution of the target library except for the entropy-based methods. The naive and XY-Meta methods caused significant shifts in the distribution shape toward lower spectral entropies. The fragmentation tree–based method was basically similar to the original distribution for the filtered library while the entropy-based methods maintained perfectly consistent distributions with the target library both for the filtered or unfiltered conditions. Although generating decoy libraries using a mechanism completely irrelevant to entropy, the fragmentation tree–based method obtained similar results to the entropy-based methods.

Figure 3.

Figure 3

Spectral entropy distributions of decoy libraries generated by different decoy strategies. (A) Using the fragmentation tree filtered GNPS library and (B) unfiltered library to generate decoy libraries.

P-value estimation [30] further validated the entropy-based methods by assessing decoy database quality. Using the MassBank-MoNA library as query and GNPS library as target, we performed a separated target–decoy search with dot product similarity. The two entropy-based strategies were used to generate decoy libraries from the GNPS library. True and false annotations can be accurately recognized by the first part of InChIKeys. We then investigated if the P-values of the false annotations estimated by our methods are uniformly distributed. Both entropy-based methods produced highly uniform P-value distributions for false annotations (Figure 4), which agrees with the distribution of P-values under the null hypothesis. The P-value of a spectrum match is defined as the probability of randomly drawing a result of this or better quality under the null hypothesis for which a spectrum has been randomly generated [10]. Our decoy strategies were validated to generate high-quality libraries by both spectral entropy distributions and p-value estimation.

Figure 4.

Figure 4

Distributions of P-values for annotations. The MassBank-MoNA query spectra were used to search the GNPS library using dot product similarity. The decoy libraries were generated by the spectral entropy-based method (AC) and the ion entropy-based method (DF).

Accurate FDR estimation and performance comparison

Our methods were evaluated by comparing how well the estimated FDR aligns with the actual FDR. We selected the MassBank-MoNA and GNPS libraries as the query and target libraries to conduct the STDS analysis. Decoy libraries were generated both with and without fragmentation tree filtering of the target data. Since the similarity score function critically influences FDR estimation, we first investigated how different score functions impact actual FDR (Supplementary Figure 3, available online at http://bib.oxfordjournals.org/). The results revealed that the spectral entropy similarity outperformed the dot product similarity, especially at high score thresholds (corresponding to low actual FDRs). Additionally, removing precursor ions prior to spectral matching enhanced annotation accuracy. Based on these observations, we employed the spectral entropy similarity in the subsequent analyses.

We benchmarked our entropy-based decoy methods against other approaches under various criteria to evaluate FDR estimation accuracy (Figure 5). The ion entropy–based method most closely traced the expected curve under unfiltered conditions, demonstrating superior accuracy. Optimal FDR estimation was achieved by combining ion entropy–based decoys and spectral entropy similarity without precursor removal on the unfiltered library. Both entropy-based strategies excelled at estimating the critical low FDR range (0–0.1) under all tested situations, outperforming other methods. Under filtered conditions, the XY-Meta method estimated FDR more accurately from 0.2 to 0.5, particularly with precursor removal. Overall, the strengths of the entropy-based approaches, especially ion entropy–based method, were demonstrated across diverse search conditions. Comparable results were obtained in negative mode (Supplementary Figure 4, available online at http://bib.oxfordjournals.org/), where the entropy-based methods consistently showed accurate FDR estimation at the key low FDR range (0–0.15). Additionally, since the ion entropy–based method is affected by the number of spectra in the database, we investigated the impact of spectrum accumulation (Supplementary Figure 5, available online at http://bib.oxfordjournals.org/). The results demonstrated that a curated library has a positive effect on the ion entropy–based method.

Figure 5.

Figure 5

The MassBank-MoNA query spectra were searched against the GNPS target database and associated decoys to calculate the estimated FDR. Searches were performed using both fragmentation tree–filtered and –unfiltered GNPS libraries. Spectral entropy similarity was utilized, with and without precursor ion removal prior to matching.

We then selected examples of compounds with multiple spectra and isomers to evaluate the performance of the ion entropy–based method for identifying correct hits. An apigenin spectrum from MassBank-MoNA was searched against GNPS under various conditions. The precursor m/z 271.06062 (±10 ppm) matched 366 library spectra in GNPS, comprising different isomers such as genistein and galangin with the same formula C15H10O5. For decoy methods, FDR <0.05 was used as the annotation threshold. Matching score >0.7 was the criterion for traditional similarity-based approaches. Among all search conditions, the ion entropy–based method showed impressive performance, with only one false library match, while other methods had substantially more false hits from isomeric compounds (Figure 6). This demonstrates the ion entropy–based method’s strong ability to accurately differentiate correct annotations from incorrect isomeric hits. The results highlight the potential of the ion entropy approach for sensitive and specific metabolite annotation when multiple candidate spectra and isomers are present. In summary, these comprehensive benchmarking evaluations highlight the advantages of our entropy-based decoy methods, specifically the ion entropy–based approach, for precise FDR estimation across varied search parameters.

Figure 6.

Figure 6

Frequency of false library hits of an apigenin spectrum across various conditions. The example spectrum was obtained from MassBank-MoNA and searched against GNPS.

Public metabolomics datasets validation

To demonstrate real-world application, we applied our workflow for FDR estimation on 46 public datasets against the preprocessed GNPS library. The workflow incorporated the ion entropy–based decoy library and spectral entropy similarity scoring to enable accurate FDR control. In total, 5 736 862 experimental spectra were collected for annotation. Spectrum match fractions were analyzed at 0.01, 0.05 and 0.1 FDR thresholds (Figure 7). At 0.01 FDR, all datasets had <10% annotated spectra. Fractions increased markedly at 0.05 FDR, improving utilization, yet many spectra remained unannotated even at 0.1 FDR. Since the ion entropy–based method enables accurate FDR control (Figure 5), these results highlight the critical need for precise FDR estimation in metabolite annotation, given the high false-positive risk. By providing robust FDR-controlled annotation, our workflow ensures reliable annotation of experimental data.

Figure 7.

Figure 7

Metabolite annotation on public datasets. The ion entropy–based decoy method and the spectral entropy similarity scoring were used for accurate FDR estimation.

DISCUSSION

Ion entropy was introduced as a metric for measuring ion information in ever-increasing tandem MS/MS spectra libraries. Unlike the relatively broad definition of spectral entropy, only fragment ions with the same precursor m/z were considered by the calculation of ion entropy, which reduces the impact from other unrelated ions and better characterizes specific fragments. In LC-MS/MS metabolomics, ion information aids annotation and compound identification for observed m/z values. However, ion statistics remain underutilized for metabolomics biological discoveries. As spectral libraries grow through data sharing, mining accumulated spectra may reveal new relationships. Elucidating connections between ion entropy and molecular fragmentation could enable enhanced methods for reliable spectra prediction or library quality assessment. Overall, ion entropy provides a valuable statistical perspective to extract new knowledge from metabolomics data as libraries expand.

One of the main challenges in generating metabolomics decoy spectra is that small molecules are structurally diverse and therefore difficult to shuffle or reverse. The ion entropy–based method addresses this challenge by providing a quantifiable characteristic of fragment ions based on the evaluation of spectral databases, distinguishing diverse ions beyond just m/z values. Various ions can now be shuffled according to certain patterns. To be noticed, low-quality spectral libraries may impede the accurate calculation of ion entropy due to spectrum repetition or narrow collision energy ranges. As ion entropy depends on library quality, we recommend applying the entropy-based decoy methods to large, authentic spectral libraries for precise FDR control. While dependency on library quality is a consideration, ion entropy provides a valuable statistical approach to overcome the inherent metabolomics challenge of structural diversity, enabling effective decoy generation from fragmentation patterns in empirical data.

There are some potential variations of the proposed methods worth noting. For instance, in the decoy generation design, retaining the precursor ion intensity while shuffling other peak intensities may produce decoys even closer to the original spectra, suitable for certain datasets. Additionally, during ion entropy calculation, fragment intensities were normalized to the precursor intensity. This normalization could instead utilize the base peak intensity, which would broaden the range of spectra eligible for ion entropy calculation. While the current approaches were optimized for overall performance, exploring modifications like specialized intensity handling may improve adaptability to different data characteristics. As with any methodology, iterating on the techniques and customizing aspects for particular applications represent important future directions.

In summary, our work extended the application of information entropy in metabolomics by proposing a quantity to measure the information content of fragment ions in spectral libraries. The proposed spectral and ion entropy approaches enable accurate control of entropy distributions during decoy construction to closely mimic the properties of experimental spectra. Comprehensive evaluations showed the ion entropy–based method in particular outperformed established techniques for FDR estimation across various search conditions. By preserving entropy characteristics of real MS/MS data, the entropy-based strategies produce high-quality decoy libraries for robust target–decoy analysis. Application to public data highlighted the critical need for precise FDR control in metabolite annotation. Overall, this work establishes entropy as an effective framework to address the challenges of designing optimal decoys and controlling FDR in large-scale metabolomics annotations.

Key Points

  • The concept of ion entropy is introduced to metabolomics and two entropy-based decoy methods are proposed.

  • Assessment of public databases validates that ion entropy is an effective metric to quantify ion information in large metabolomics datasets.

  • The proposed entropy-based decoy strategies outperform current representative decoy methods in metabolomics and achieve better FDR estimation accuracy.

  • Analysis of 46 real public datasets provides recommendations for the practical application of the proposed approaches.

Supplementary Material

Supplementary_bbae056
supplementary_bbae056.docx (559.6KB, docx)

Author Biographies

Shaowei An is a PhD candidate at the College of Life Sciences, Westlake University, where he specializes in artificial intelligence (AI)–empowered metabolomics and multi-omics data analysis.

Miaoshan Lu is a PhD candidate at the College of Engineering, Westlake University.

Ruimin Wang is a PhD candidate at the College of Engineering, Westlake University.

Jinyin Wang is a PhD candidate at the College of Life Sciences, Westlake University.

Hengxuan Jiang is a PhD candidate at the College of Artificial Intelligence and Big Data for Medical Sciences, Shandong First Medical University.

Cong Xie is a PhD candidate at the College of Artificial Intelligence and Big Data for Medical Sciences, Shandong First Medical University.

Junjie Tong is a PhD candidate at the College of Artificial Intelligence and Big Data for Medical Sciences, Shandong First Medical University.

Changbin Yu, PhD (Australian National University), SMIEEE, IEAust, serves as Chair Professor and the Founding Dean of the College of Artificial Intelligence and Big Data for Medical Sciences at Shandong First Medical University. His research focuses on AI algorithms for clinical and biological data, as well as the control of cyber-physical systems.

Contributor Information

Shaowei An, Shandong First Medical University & Central Hospital Affiliated to Shandong First Medical University, 6699 Qingdao Road, Jinan 271016, Shandong, China; Westlake University, 18 Shilongshan Road, Hangzhou 310024, Zhejiang, China; Fudan University, 220 Handan Road, Shanghai 200433, China.

Miaoshan Lu, Shandong First Medical University & Central Hospital Affiliated to Shandong First Medical University, 6699 Qingdao Road, Jinan 271016, Shandong, China; Westlake University, 18 Shilongshan Road, Hangzhou 310024, Zhejiang, China; Zhejiang University, 866 Yuhangtang Road, Hangzhou 310009, Zhejiang, China.

Ruimin Wang, Shandong First Medical University & Central Hospital Affiliated to Shandong First Medical University, 6699 Qingdao Road, Jinan 271016, Shandong, China; Westlake University, 18 Shilongshan Road, Hangzhou 310024, Zhejiang, China; Fudan University, 220 Handan Road, Shanghai 200433, China.

Jinyin Wang, Shandong First Medical University & Central Hospital Affiliated to Shandong First Medical University, 6699 Qingdao Road, Jinan 271016, Shandong, China; Westlake University, 18 Shilongshan Road, Hangzhou 310024, Zhejiang, China; Zhejiang University, 866 Yuhangtang Road, Hangzhou 310009, Zhejiang, China.

Hengxuan Jiang, Shandong First Medical University & Central Hospital Affiliated to Shandong First Medical University, 6699 Qingdao Road, Jinan 271016, Shandong, China.

Cong Xie, Shandong First Medical University & Central Hospital Affiliated to Shandong First Medical University, 6699 Qingdao Road, Jinan 271016, Shandong, China.

Junjie Tong, Shandong First Medical University & Central Hospital Affiliated to Shandong First Medical University, 6699 Qingdao Road, Jinan 271016, Shandong, China.

Changbin Yu, Shandong First Medical University & Central Hospital Affiliated to Shandong First Medical University, 6699 Qingdao Road, Jinan 271016, Shandong, China.

AUTHOR CONTRIBUTIONS

Shaowei conceived the study, developed the software and drafted the manuscript. Miaoshan and Ruimin assisted in software development. Jinyin critically revised the manuscript. Hengxuan devised certain algorithms. Cong and Junjie evaluated the methods using various datasets. Changbin, as a corresponding author, supervised the overall study conception, design and execution and critically revised the manuscript.

FUNDING

Shandong Provincial Natural Science Fund (2022HWYQ-081).

DATA AVAILABILITY

https://github.com/anshaowei/MetaPhoenix

References

  • 1. Johnson  CH, Ivanisevic  J, Siuzdak  G. Metabolomics: beyond biomarkers and towards mechanisms. Nat Rev Mol Cell Biol  2016;17(7):451–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Patti  GJ, Yanes  O, Siuzdak  G. Metabolomics: the apogee of the omics trilogy. Nat Rev Mol Cell Biol  2012;13(4):263–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Blaženović  I, Kind  T, Ji  J, Fiehn  O. Software tools and approaches for compound identification of LC-MS/MS data in metabolomics. Metabolites  2018;8(2):31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Schrimpe-Rutledge  AC, Codreanu  SG, Sherrod  SD, McLean  JA. Untargeted metabolomics strategies—challenges and emerging directions. J Am Soc Mass Spectrom  2016;27(12):1897–905. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Schymanski  EL, Jeon  J, Guide  R, et al.  Identifying small molecules via high resolution mass spectrometry: communicating confidence. Environ Sci Technol  2014;48(4):2097–8. [DOI] [PubMed] [Google Scholar]
  • 6. Wang  M, Carver  JJ, Phelan  VV, et al.  Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat Biotechnol  2016;34(8):828–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Sumner  LW, Amberg  A, Barrett  D, et al.  Proposed minimum reporting standards for chemical analysis: chemical analysis working group (CAWG) metabolomics standards initiative (MSI). Metabolomics  2007;3:211–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. An  SW, Wang  R, Lu  M, et al.  MetaPro: a web-based metabolomics application for LC-MS data batch inspection and library curation. Metabolomics  2023;19(6):57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Stein  SE, Scott  DR. Optimization and testing of mass spectral library search algorithms for compound identification. J Am Soc Mass Spectrom  1994;5(9):859–66. [DOI] [PubMed] [Google Scholar]
  • 10. Scheubert  K, Hufsky  F, Petras  D, et al.  Significance estimation for large scale metabolomics annotations by spectral matching. Nat Commun  2017;8(1):1494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Goeman  JJ, Solari  A. Multiple hypothesis testing in genomics. Stat Med  2014;33(11):1946–78. [DOI] [PubMed] [Google Scholar]
  • 12. Elias  JE, Gygi  SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods  2007;4(3):207–14. [DOI] [PubMed] [Google Scholar]
  • 13. Palmer  A, Phapale  P, Chernyavsky  I, et al.  FDR-controlled metabolite annotation for high-resolution imaging mass spectrometry. Nat Methods  2017;14(1):57–60. [DOI] [PubMed] [Google Scholar]
  • 14. Keich  U, Kertesz-Farkas  A, Noble  WS. Improved false discovery rate estimation procedure for shotgun proteomics. J Proteome Res  2015;14(8):3148–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Bocker  S, Duhrkop  K. Fragmentation trees reloaded. J Chem  2016;8:5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Efron  B, Tibshirani  R, Storey  JD, Tusher  V. Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc  2001;96(456):1151–60. [Google Scholar]
  • 17. Li  D, Liu  B, Zheng  H, et al.  XY-meta: a high-efficiency search engine for large-scale metabolome annotation with accurate FDR estimation. Anal Chem  2020;92(8):5701–7. [DOI] [PubMed] [Google Scholar]
  • 18. Chen  W, Gong  L, Guo  Z, et al.  A novel integrated method for large-scale detection, identification, and quantification of widely targeted metabolites: application in the study of rice metabolomics. Mol Plant  2013;6(6):1769–80. [DOI] [PubMed] [Google Scholar]
  • 19. Shannon  CE. A mathematical theory of communication. Bell Syst Tech J  1948;27(3):379–423. [Google Scholar]
  • 20. Li  Y, Kind  T, Folz  J, et al.  Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat Methods  2021;18(12):1524–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Horai  H, Arita  M, Kanaya  S, et al.  MassBank: a public repository for sharing mass spectral data for life sciences. J Mass Spectrom  2010;45(7):703–14. [DOI] [PubMed] [Google Scholar]
  • 22. Wehrl  A. General properties of entropy. Rev Mod Phys  1978;50:221–60. [Google Scholar]
  • 23. Dührkop  K, Fleischauer  M, Ludwig  M, et al.  SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat Methods  2019;16(4):299–302. [DOI] [PubMed] [Google Scholar]
  • 24. Käll  L, Storey  JD, MacCoss  MJ, Noble  WS. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J Proteome Res  2008;7(01):29–34. [DOI] [PubMed] [Google Scholar]
  • 25. Kind  T, Tsugawa  H, Cajka  T, et al.  Identification of small molecules using accurate mass MS/MS search. Mass Spectrom Rev  2018;37(4):513–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Wishart  DS, Guo  AC, Oler  E, et al.  HMDB 5.0: the human metabolome database for 2022. Nucleic Acids Res  2022;50(D1):D622–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Guijas  C, Montenegro-Burke  JR, Domingo-Almenara  X, et al.  METLIN: a technology platform for identifying knowns and unknowns. Anal Chem  2018;90(5):3156–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Martano  G, Leone  M, D’Oro  P, et al.  SMfinder: small molecules finder for metabolomics and lipidomics analysis. Anal Chem  2020;92(13):8874–82. [DOI] [PubMed] [Google Scholar]
  • 29. Alka  O, Shanthamoorthy  P, Witting  M, et al.  DIAMetAlyzer allows automated false-discovery rate-controlled analysis for data-independent acquisition in metabolomics. Nat Commun  2022;13(1):1347. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Granholm  V, Noble  WS, Käll  L. On using samples of known protein content to assess the statistical calibration of scores assigned to peptide-spectrum matches in shotgun proteomics. J Proteome Res  2011;10(5):2671–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_bbae056
supplementary_bbae056.docx (559.6KB, docx)

Data Availability Statement

https://github.com/anshaowei/MetaPhoenix


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES