Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Jun 6.
Published in final edited form as: Annu Int Conf IEEE Eng Med Biol Soc. 2023 Jul;2023:1–4. doi: 10.1109/EMBC40787.2023.10341007

Deep Learning Based Metabolite Annotation

Hoi Yan Katharine Chau 1, Hongyu Ao 1, Xinran Zhang 1, Shijinqiu Gao 1, Rency S Varghese 1, Habtom W Ressom 1
PMCID: PMC12143282  NIHMSID: NIHMS2080187  PMID: 38082953

Abstract

Metabolite annotation is a major bottleneck in untargeted metabolomics studies by liquid chromatography coupled with mass spectrometry (LC-MS). This is in part due to the limited publicly available spectral libraries, which consist of tandem mass spectrometry (MS/MS) data acquired from just a fraction of known compounds. Machine learning and deep learning methods provide the opportunity to predict molecular fingerprints based on MS/MS data. The predicted molecular fingerprints can then be used to help rank candidate metabolite IDs obtained based on predicted formula or measured precursor m/z of the unknown metabolite. This approach is particularly useful to help annotate metabolites whose corresponding MS/MS spectra cannot be matched with those in spectral libraries. We previously reported application of a convolutional neural network (CNN) for molecular fingerprint prediction using MS/MS spectra obtained from the MoNA repository and NIST 20. In this paper, we investigate high-dimensional representation of the spectral data and molecular fingerprints to improve accuracy in molecular fingerprint prediction.

I. Introduction

Liquid-chromatography coupled with mass spectrometry (LC-MS) is one of the most common technologies used to evaluate the levels of small molecule metabolites in biological samples. However, metabolite annotation continues to be a major bottleneck in LC-MS for untargeted metabolomics studies. Spectral matching of experimental tandem mass spectrometry (MS/MS) data against those in spectral libraries is one of the approaches for metabolite annotation. However, the use of this method is very limited because the MS/MS spectra available in publicly accessible spectral libraries represent only a small fraction of known compounds [1] [2]. In addition to the limited coverage of compounds, the difference in instrument methods between those in spectral libraries and those acquired by users seeking to annotate unknown metabolites poses a challenge.

Machine and deep learning methods offer the opportunity to improve metabolite annotation [1] [3] [4]. This is due to the ability of these methods to learn complex relationships between patterns of MS/MS spectra and properties of compounds from which the spectra are derived. For example, tools such as SIMPLE and CSI:FingerID use mathematical models that are trained to learn the relationship between MS/MS spectra and molecular fingerprints [5] [6] [7]. The trained model is used to predict molecular fingerprint based on an MS/MS spectrum of an unknown compound. Potential candidates can then be ranked by comparing their molecular fingerprints against the fingerprint predicted by the model.

We previously introduced the application of a deep learning model, convolutional neural network (CNN), for compound fingerprint prediction and a workflow for ranking metabolite candidates [8]. The CNN model was trained using 650,553 MS/MS spectra from MoNA and NIST 20 representing 35,683 compounds. We reported improved performance by CNN compared to other models including logistic regression, multi-layer perceptron, and support vector machines [8] [9].

In this paper, we represent each MS/MS spectrum with a larger number of bins that are only 0.01 Da apart and add more fingerprints from ECFP [10], PubChem [11], and Klekota-roth through PyFingerprint (available from http://github.com/hcji/Pyfingerprint) [12] to improve CNN’s accuracy in compound fingerprint prediction. Specifically, we increased the number of bins from 1,174 to 40,088 and the number of fingerprints from 528 to 5,618 compared to our previous implementation [8].

The updated CNN model is implemented as a python package, MetFID, and evaluated using the CASMI 2016 and CASMI 2022 benchmark datasets. Spectra representing the compounds in CASMI 2016 and CASMI 2022 were excluded from the training data to make sure these benchmark datasets serve as testing spectra that are independent of the training spectra. Following the prediction of the molecular fingerprints and generation of candidates considering different adducts and ionization modes, the top-k candidates are identified and evaluated. The top-k ranking performance of MetFID against random ranking and CSI:FingerID using candidates derived from multiple sources.

II. Method

A. Workflow Overview

Figure 1 depicts the steps involved for developing a deep learning model for metabolite annotation including: MS/MS data processing, molecular fingerprint prediction, model training, candidate retrieval, and performance evaluation.

Figure 1.

Figure 1.

Workflow for developing a deep learning-based method for molecular fingerprint prediction based on MS/MS spectra and its application for ranking metabolite candidates.

B. MS/MS Data Processing

We performed the following MS/MS data processing steps to prepare the spectra prior to training a CNN model.

  • Downloaded MS/MS spectra acquired by LC-MS/MS from libraries available in the MoNA repository including Vaniya/Fiehn Natural Products Library, GNPS, RIKEN PlaSMA, MassBank, HMDB, MetaboBASE, Pathogen Box, Fiehn HILIC, etc. [13].

  • Obtained the NIST 20 library from one of NIST’s MS/MS library distributors.

  • Scaled peak list from each spectrum such that the peak intensity values range between 0 and 100.

  • Removed spectra that consisted of fewer than five peaks with relative intensity above 2%.

  • Removed peaks whose m/z values were larger than the precursor mass.

  • Excluded peaks that fall outside the mass range between 100 and 1010 Da.

  • Selected spectra acquired via instrument types such as Orbitrap, QqQ, Q-TOF, or ion trap (IT) to make the training data as homogeneous as possible.

  • Selected 79,404 LC-MS/MS spectra in positive mode and 32,269 LC-MS/MS spectra in negative mode from MoNA.

  • Selected 401,985 LC-MS/MS spectra in positive mode and 136,895 LC-MS/MS spectra in negative mode from NIST 20.

  • Obtained a total of 650,553 MS/MS spectra representing 35,683 compounds as training set.

  • Merged peak lists from multiple spectra if the spectra belong to the same compound using the first part of InChIKey as compound identifier.

  • Transformed the resulting peak lists into vectors of equal length by binning them into pre-specified bins. A bin size of 0.01 Da yielded a total of 117,330 bins.

  • Calculated the accumulated peak intensities within each bin.

  • Removed bins that consist of all 0’s across all spectra. Thus, we found 40,088 bins to be used as input to a deep learning model.

C. Molecular Fingerprint Prediction

Molecular fingerprints for each compound in the training set were calculated using PyFingerprint and OpenBabel [14] [15]. Specifically, the MACCS, FP3, FP4, PubChem, ECFP, and Klekota-roth fingerprints were mined and assembled into a vector consisting of 7,293 binary entries [16] [10] [11] [12]. Fingerprints that consist of all 0’s or all 1’s across the entire spectra were removed, leaving only 5,618 binary entries to be used for training a deep learning model.

D. Model Training

After MS/MS data processing and molecular fingerprint prediction, we trained a deep-learning model to learn the relationships between spectral patterns represented by peak intensities in 40,088 bins and compound fingerprints represented by 5,618 fingerprints represented as binary entries. Fig. 2 depicts the architecture of a one-dimensional CNN (1D CNN) consisting of twelve layers (a Sequential layer, an Embedding layer with 32 nodes, two Convolution1D layers, two MaxPooling1D layers, a Dropout layer, a Flatten layer, and four Hidden layers) following the input layer. The 1D CNN model was trained using the Keras Python package on the back end of TensorFlow [17]. The trained CNN model was implemented as a Python package, MetFID.

Figure 2.

Figure 2.

Architecture of the convolutional neural network (CNN) model.

E. Performance Evaluation

To evaluate the performance of MetFID, we used the CASMI 2016 and CASMI 2022 benchmark datasets which consist of 208 and 250 challenges, respectively. Spectra acquired from the compounds included in these challenges were excluded from the training data. As a result, the benchmark datasets are structurally disjoint to our training set (i.e., the compounds in the training set do not have the same first part of InChIKey as the compounds in the testing set), and thus can serve as an independent testing set.

To measure the fingerprint prediction performance of MetFID, we calculated Tanimoto similarity score (1) and F1 score (2) where TN, TP, FN and FP represent the number of true negative, true positive, false negative, and false positive, respectively. Precision and recall are calculated as TP/P and TP/(TP+FN), respectively. We removed the part where the predicted fingerprint and the true fingerprint were both zeros in Tanimoto similarity score, due to the significance of imbalance between the number of zeros and ones in the fingerprint.

Tanimotosimilarityscore=TPTP+FP+FN (1)
F1score=2*precision*recallprecision+recall (2)

To evaluate the performance of MetFID in metabolite annotation, we retrieved metabolite candidates together with compound names, InChIKeys, formulas, and SMILES from compound databases such as HMDB, MMCD, MELIN, LIPID MAPS, and KEGG based on the precursor m/z with 10 ppm tolerance and considering six adducts ([M+H]+, [M+NH4]+, [M+Na]+, [M+Cl], [M+FA-H] and [M-H]). The candidate list is shortened if the formula of a compound is known. This is accomplished by excluding other compounds that have different formula from the known one.

We calculated the molecular fingerprints for all candidates and ranked the candidates based on the Tanimoto similarity score between the calculated molecular fingerprints and those predicted by MetFID. The accuracy of the model is evaluated based on the top-k ranking of true candidates in the annotation results.

III. Results and Discussion

A. Performance Evaluation of MetFID vs. CSI:FingerID

Before using MetFID for ranking metabolite candidates, we evaluated the trained CNN model in predicting molecular fingerprints of the compounds in the benchmark datasets based on the binned MS/MS spectra. The model yielded a Tanimoto similarity score of 46% and an F1 score of 61% between the true and predicted fingerprints of the compounds in CASMI 2016. Tanimoto similarity and F1 scores for the CASMI 2022 datasets were 20% and 32%, respectively.

Table I shows the performance of MetFID in ranking the metabolite candidates it generated based on the MS/MS spectra obtained from the CASMI 2016 and CASMI 2022 benchmark datasets. The table compares the top-k ranking results obtained by MetFID and CSI:FingerID for the metabolite candidates generated by the respective tools. The ranking accuracy in Table I refers to the percentage of testing cases in which the correct metabolite appears in the top-k of the ranked candidate list. The percentage in parenthesis is calculated by excluding the benchmark spectra for which the true compound is missing in the candidate list. The purpose of this calculation is to account for the situation when the target compound cannot be found by searching against compound databases associated with the tools. The results show that MetFID successfully ranked the correct identification in 50% and 32% of the cases for the CASMI 2016 and CASMI 2022 datasets, respectively.

TABLE I.

Performance comparison between MetFID and CSI:FingerID based on the CASMI 2016 and CASMI 2022 benchmark datasets.

CASMI 2016 CASMI 2022
Mass-Based Formula-Based Mass-Based Formula-Based
Ranking MetFID CSI: FingerID MetFID CSI: FingerID MetFID CSI: FingerID MetFID CSI:FingerID
Top 1 50% (50%) 31% (39%) 69% (69%) 46% (51%) 32% (32%) 46% (51%) 47% (48%) 10% (17%)
Top 3 72% (72%) 38% (48%) 88% (88%) 58% (64%) 54% (54%) 58% (64%) 69% (69%) 20% (36%)
Top 5 83% (83%) 39% (50%) 93% (93%) 60% (66%) 66% (67%) 60% (66%) 78% (79%) 24% (44%)
Top 10 91% (91%) 41% (52%) 97% (97%) 62% (69%) 80% (81%) 62% (69%) 86% (86%) 28% (50%)

Note: the percentage in parenthesis is calculated by excluding the benchmark spectra for which the true compound is missing in the candidate list.

Assuming that the formulas of the compounds in the benchmark datasets are known, MetFID ranked the correct identification in 69% and 47% of the cases for the CASMI 2016 and CASMI 2022 datasets, respectively. All models show better performance on formula-based ranking than mass-based ranking. For these datasets, we observed that the use of formula information helps shrink the average length of candidate lists from a large number to about 6 to 28 candidates. While MetFID performed better than CSI:FingerID in ranking the candidates for the CASMI 2016 dataset, CSI:FingerID performed far better in ranking the candidates for the CASMI 2022 dataset.

B. Random Ranking of Candidate from MetFID

For further evaluation, we randomly shuffled the candidate list 100 times without using the deep learning model. Table II shows the top-k ranking result. The purpose of this exercise is to assess the improvement achieved by deep learning compared to ranking the same candidates randomly. This result demonstrates the benefit of using the deep learning-based ranking of metabolite candidates.

TABLE II.

Random Ranking of Candidates.

CASMI 2016 CASMI 2022
Ranking Mass-Based Formula-Based Mass-Based Formula-Based
Top 1 18% 59% 18% 41%
Top 3 41% 80% 35% 59%
Top 5 56% 87% 47% 69%
Top 10 76% 94% 66% 81%

C. Evaluation of the Impact of the Candidate List

The performances of MetFID and CSI:FingerID were further evaluated by interchanging the candidate list obtained by each prediction tool to assess the impact of the candidate list on top-k ranking. Specifically, by switching the candidates list used by each tool, we were able to perform a more direct comparison on the ability of both tools to annotate the true candidate. We obtained 1,211,645 candidates for CASMI 2016 and 1,166,820 candidates for CASMI 2022 from CSI:FingerID. After removing the duplicate compounds, we retrieved fingerprints and compound information for 1,530,674 CSI:FingerID candidates in total. In contrast, only 3,444 and 5,781 candidates were retrieved by MetFID for the CASMI 2016 and CASMI 2022, respectively. Table III shows that the top-k ranking performance of MetFID dropped significantly when using the long candidate list obtained from CSI:FingerID. This implies that further optimization of the CNN model in MetFID is needed to improve its ability in ranking candidates derived from larger compound databases. In contrast, we observed great performance achieved by MetFID when a relatively short list of candidates is presented to it, suggesting the use of a more focused compound database based on prior knowledge.

TABLE III.

Top-K ranking performance of MetFID using candidate list obtained from CSI:FingerID and top-k ranking performance of CSI:FingerID using candidate list obtained from metfid with CASMI 2016 and CASMI 2022 datasets.

Prediction Tool MetFID CSLFingerID
Candidate List Source CSI:FingerID MetFID
CASMI 2016 CASMI 2022 CASMI 2016 CASMI 2022
Ranking Mass-based Formula-based Mass-based Formula-based Mass-based Formula-based Mass-based Formula-based
Top 1 3% (4%) 4% (4%) 1% (3%) 2% (4%) 63% (70%) 70% (85%) 29% (38%) 41% (47%)
Top 3 10% (15%) 15% (15%) 3% (6%) 7% (10%) 73% (82%) 80% (89%) 50% (65%) 61% (70%)
Top 5 14% (16%) 20% (20%) 5% (10%) 9% (14%) 79% (89%) 84% (94%) 56% (73%) 67% (77%)
Top 10 17% (20%) 27% (27%) 11% (22%) 17% (26%) 84% (94%) 88% (98%) 64% (83%) 75% (86%)

Note: the percentage in parenthesis is calculated by excluding the benchmark spectra for which the true compound is missing in the candidate list.

IV. Conclusion

In this paper, we introduced an updated workflow of our deep learning-based metabolite annotation tool, MetFID. Using the CASMI 2016 and CASMI 2022 benchmark datasets, we evaluated the performance of MetFID in compound fingerprint prediction and ranking metabolite candidates. While increasing the number of bins and fingerprints has improved the performance of MetFID, we observed that the performance significantly deteriorated when a large number of candidates are presented to MetFID for ranking. Thus, future work will focus on optimization of the deep learning model to be able to rank more accurately when a large number of candidates is involved. Also, we plan to expand the compound database that MetFID uses to be able to retrieve more candidates. Finally, we seek to implement methods that predict formulas based on mass spectral data, since in most cases, the formulas of analytes in untargeted metabolomics are unknown.

Acknowledgment

This study is supported by the National Institute of General Medicine under Award Number R35GM141944 awarded to H.W.R.

References

RESOURCES