Editor’s note:
This article has been peer reviewed.
To the Editor - Tandem mass spectrometry (MS2) data provides high confidence molecular identification of known molecules and preliminary characterization of novel, unknown molecules (unknowns). However, in order for databases to be an effective resource, broad chemical space coverage is necessary. Consequently, we have created METLIN (http://metlin.scripps.edu) a highly annotated and structurally diverse database of over 850,000 molecular standards. METLIN’s tandem mass spectral library covers almost 1% of PubChem’s 93 million compounds, essentially a number that can be characterized as the currently known chemical space.
The utility of MS2 data acquisition is especially advantageous when coupled with liquid chromatography mass spectrometry (LC-MS) analysis of complex samples1, 2. Such datasets typically have tens of thousands of features3 and while accurate mass measurements (MS1) of precursor molecular ions can provide putative identifications, these MS1 measurements alone are not sufficient to structurally characterize the number of compounds having identical or similar molecular weights. Therefore, characterizing every feature in an LC-MS dataset is challenging and currently not possible. However, implementation of MS2 provides structural information that greatly increases the confidence of molecular identifications2. The recent expansion of the METLIN MS2 database of molecular standards offers an opportunity to quantify this improved confidence (Figure 1). METLIN now hosts over 850,000 molecular standards with MS2 data generated in both positive and negative ionization modes at multiple collision energies, collectively containing over 3,000,000 curated high-resolution tandem mass spectra. Thus, the size of METLIN makes molecular annotation and identification more feasible. In comparison, the NIST MS2 database, the next largest molecular standards database, contains 15,000 standards (Figure 1a).
The combination of METLIN’s molecular standards and systematically acquired experimental data allows for the examination of the impact that MS2 data has on the identification of known molecules. For example, when METLIN is searched against precursor m/z values at varying part per million (ppm) errors, the number of hits typically ranges from tens to hundreds of compounds. However, with the addition of MS2 data the false positive rate can be minimized to only a few compounds. Beyond providing molecular identification through its multiple search capabilities (e.g. MS2, batch, name, and elemental composition), METLIN’s expansion will facilitate similarity searching. The similarity searching algorithm was originally developed to aid in the identification of unknowns and the discovery of novel molecules (unknowns)6 and operates by using fragment ion data to help align an unknown molecule to compounds with similar fragmentation data within a database to help further identify and characterize them6. METLIN’s fragment ion similarity searching (FISS) and neutral loss similarity search (NLSS) is applied in identifying endogenous metabolites, drugs, drug metabolites, as well as biotransformation of xenobiotics7. METLIN facilitates both endogenous and exogenous compound identification. For example, it contains MS2 data for over 60,000 pyrimidine analogues and over 6000 purine analogues (Figure 1b) among others.
The expanded METLIN database will enable new types of analyses. First, we expect that MS2 data of this size can significantly reduce the magnitude of false positives that molecular identification based solely on molecular ion values can generate. Second, while very high accuracy MS2 data is useful it does not significantly enhance identification confidence. Therefore, low-resolution instrumentation can be more broadly utilized for relatively sophisticated experiments by chemists and biologists that do not have access to high-end equipment8. And finally, METLIN can be applied for identification of unknown compounds via fragment ion and neutral loss similarity searching (FISS and NLSS). For example, synthetic chemists can apply METLIN toward the structure elucidation of unexpected products, while biochemists can use it in identifying the plethora of bacterial and human metabolites in microbiome and exposome-related studies, and it has unexplored potential in the chemical, toxicological, and pharmaceutical sciences9, 10.
Acknowledgements
This research was partially funded by National Institutes of Health grants R35 GM130385 (G.S.), P30 MH062261 (G.S.), P01 DA026146 (G.S.), and U01 CA235493 (G.S.) and by Ecosystems and Networks Integrated with Genes and Molecular Assemblies (ENIGMA), a Scientific Focus Area Program at Lawrence Berkeley National Laboratory for the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research, under contract number DE-AC02-05CH11231 (G.S.).
Footnotes
COMPETING FINACIAL INTERESTS
The authors declare no competing financial interests.
Data availability
The data in METLIN database is available at http://metlin.scripps.edu. The data in other databases mentioned in this study were obtained from their websites [accessed on Februrary 2020] or published papers: MONA (https://mona.fiehnlab.ucdavis.edu/); mzCloud (https://www.mzcloud.org/), GNPS (https://gnps.ucsd.edu/; ref4), HMDB (https://hmdb.ca/; ref5), and NIST 17 (https://chemdata.nist.gov/).
References
- 1.Guijas C et al. Anal. Chem 90, 3156–3164 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Tautenhahn R et al. Nat. Biotechnol 30, 826–828 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Kafader JO et al. Nat. Methods 17, 391–94 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Wang M et al. Nat. Biotechnol 38, 23–26 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wishart DS et al. Nucleic Acids Res 46, D608–D617 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Benton HP, Wong DM, Trauger SA & Siuzdak G Anal. Chem 80, 6382–6389 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Flasch M et al. ACS Chem. Bio 15, 970–981 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Xue J et al. Anal. Chem 92, 6051–6059 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Quinn RA et al. Nature 579, 123–129 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Cllayton TA et al. Proc. Natl. Acad. Sci. USA 106, 14728–14733 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data in METLIN database is available at http://metlin.scripps.edu. The data in other databases mentioned in this study were obtained from their websites [accessed on Februrary 2020] or published papers: MONA (https://mona.fiehnlab.ucdavis.edu/); mzCloud (https://www.mzcloud.org/), GNPS (https://gnps.ucsd.edu/; ref4), HMDB (https://hmdb.ca/; ref5), and NIST 17 (https://chemdata.nist.gov/).