Despite the over 100-y history of mass spectrometry, it remains challenging to link the large volume of known chemical structures to the data obtained with mass spectrometers. Presently, only 1.8% of spectra in an untargeted metabolomics experiment can be annotated. This means that the vast majority of information collected by metabolomics is “dark matter,” chemical signatures that remain uncharacterized (Fig. 1). For a genomic comparison, 80% of predicted genes in the Escherichia coli genome are known. In a bacteriophage metagenome, a well-known frontier of biological dark matter, the amount of known genes is 1–30%, depending on the sample (1). Thus, one could argue that we know more about the genetics of uncultured phage than we do about the chemistry within our own bodies. Much of the chemical dark matter may include known structures, but they remain undiscovered because the reference spectra are not available in mass spectrometry databases. The only way to overcome this challenge is through the development of computational solutions. In PNAS, Dührkop et al. describe the development of such a computational tool, called CSI (compound structure identification):FingerID (2). The tool is designed to aid in the annotation of chemistries that can be observed by mass spectrometry. CSI:FingerID uses fragmentation trees to connect tandem MS (MS/MS) data to chemical structures found in public chemistry databases. Tools such as this can allow metabolomics with mass spectrometry to become as commonly used and scientifically productive as sequencing technologies have in the field of genomics.
There are >60 million molecules in PubChem, yet only 220,000 MS/MS spectra representing about 20,000 molecules that are accessible for untargeted metabolomics experiments (3). Chemists and biologists attempting to identify a mass spectrum without a match in a reference database, such as GNPS, Metlin, NIST, MassBank, and others, must often resort to Googling the parent mass or manually entering it into PubChem or similar chemical databases, hoping to find a match (3–5). The alternative is complete structure elucidation de novo, an even more laborious task, requiring years of work with high-level expertise to isolate and determine the structure of a single molecule. To put this in perspective, a modern day metabolomics experiment with hundreds to thousands of independent samples can easily contain 1 million unique spectra. Assuming that spectral matching takes approximately 10 min to a trained eye, a gross underestimate, it would take 19 y of nonstop data analysis for a single project. This is obviously an unrealistic endeavor, especially considering that mass spectrometers will become even faster and more sensitive in the future.
The method presented by Dührkop et al. (2) is divided into three phases. In the first phase, called the learning phase, a tandem mass spectra database of reference compounds is used to train a set of predictors for known molecular properties (the fingerprint). Using the data from these reference spectra, the method computes a fragmentation tree that best explains the fragmentation spectrum of an unknown molecule. The tree assigns molecular formulas to the corresponding fragment peaks in the MS/MS spectrum, and fragments are connected by the assumed losses. The algorithm then tries to recover the identity and connectivity of the atoms in a molecular structure. With the predicted structure from the fragmentation tree, the method searches for multiple similarity measures for molecular structural comparisons (called kernels) to improve the performance of molecular fingerprint prediction. A molecular fingerprint is based on its molecular properties retrieved from the publicly available known structures (e.g., in PubChem or the literature).
In the second phase, a Support Vector Machine classifier is trained using the kernel similarities to separate molecular structures in a class that contains the molecular property, and one that doesn’t. Such classification is repeated for all molecular properties present in the fingerprint. With the classifier carefully built on the previous step, the method follows to the Prediction phase. Here, given the MS/MS spectra of an unknown compound, the task is to calculate its kernel similarities against all compounds in the reference dataset. A learning tree is again built and the result is a predicted fingerprint of the unknown compound. Dührkop et al. (2) point out that the machine-learning basis of the method allows for improvement in performance with additional reference MS/MS data. In the metabolomics and natural product community, the benefit from publicly available annotated reference spectra is becoming increasingly evident. One such resource is a part of the Global Natural Product Social Molecular Networking effort at gnps.ucsd.edu, which the authors extensively used for the development of CSI:FingerID. Such reference collections are crucial for the development of search tools, because machine-learning methods perform better with more comprehensive training sets. Studies such as this one will hopefully stimulate groups that isolate and characterize specific molecules to share their data. Data-sharing will facilitate the prediction and detection of new structures within the same molecular class, which will be enormously beneficial to both the mass spectrometry and life sciences community. Dührkop et al. (2) refer to the use of spectral orthogonal information (retention time, infrared and UV spectroscopy, and so forth), as a way to “manually” refine the best spectral match. There are several automated methods for using such orthogonal information, but most of them are limited to a specific experimental setup (6, 7). The availability of datasets covering different organisms and experimental procedures will allow the use of the full informational content of a mass spectrum, resulting in improved identification scores.
The final stage of CSI:FingerID is the Scoring phase. With the predicted fingerprint of an unknown molecule, one can retrieve all structures, matching the same molecular formula in a structure database. For each candidate molecular structure, its fingerprint is scored against the predicted fingerprint. Dührkop et al. (2) benchmarked their tool and found an enormous improvement on the scoring function compared with similar algorithms (8, 9). In the last few years, computational methods for structural assessment of metabolomics data have seen significant development. For the two large-scale MS/MS datasets that were tested, the method achieved more correct identifications than the next-best available search algorithm. Dührkop et al.’s (2) method also provides fivefold more unique and correct identifications. The CSI:FingerID tool is available as a web server providing an easy-to-use tool for wet laboratory scientists. The next step of the tool’s evolution will be the ability to process multiple spectra at the same time in a batch process and providing a standalone version to run on the user’s own computer. These options will speed the analysis workflow of complex metabolomics datasets. The method has the potential to improve identification in metabolomics experiments, by expanding the search space outside of that available in spectral libraries. Dührkop et al. (2) also point to the potential to search databases containing hypothetical simulated compounds, expanding the search space by an order of millions (10). Matching spectra in a metabolomics experiment to molecules whose structure has not yet been elucidated may well be in reach within the next few years.
As tools such as CSI:FingerID begin to illuminate more of the chemical dark matter, some form of a chemical ontology must be agreed upon to better classify and bin structures into groups of related compounds. A classification hierarchy will allow the research community to link metabolites to their associated biological processes, whether or not the specific metabolite in question is biologically characterized. Such ontology would greatly benefit from biological information about where a particular molecule or molecular family comes from and what it does. Many compounds in structure databases are chemically synthesized and not produced naturally. Although these compounds broaden the molecular space of these databases, they are most often not clearly differentiated from natural products. For CSI:FingerID, Dührkop et al. (2) enrich the molecular property information with molecules that have known biological activity (11) and weight these signatures with higher scores in their identifications. This is crucial to avoid convoluting the search with synthetic compounds (12), as strategies to differentiate signatures of metabolites and synthetic compounds improve the quality of results from search tools (13, 14). Databases with biologically relevant chemical information are becoming available, including the ChEBI database, Kyoto Encyclopedia of Genes and Genomes, and others (15, 16). These will be extremely useful to chemists and biologists as they apply computational tools to more complex systems with increasingly complex chemistry.
The field of genomics was made possible by the development of algorithms for comparing nucleic acid sequences to identify relatedness in genetic information. In the late 1980s and early 1990s hundreds of these algorithms were developed, including the basic local alignment search tool (BLAST) (17). Since then the field has exploded and technologies for sequencing millions of nucleic acids have been developed to capture the genetic information in our biological world. In metabolomics, the technological advances are already in place. Mass spectrometers are incredible machines capable of identifying the mass of molecules to unprecedented accuracy, on a massive scale, in timeframes of less than a second. However, computational resources analogous to BLAST and the NCBI’s GenBank database are only in their infancy. This year is the 25th anniversary of the release of BLAST. CSI:Finger ID is an example of the type of tools required to expand the power of metabolomics and catch up to the successes of genomics. These tools are fundamental to harnessing mass spectral information and similar in their synthesis and setting to the early tools developed for genomics that revolutionized the field of biology. CSI:FingerID and other algorithms will help catch up to the field of genomic bioinformatics, despite its 25-y head start, and begin to illuminate the diverse chemistry in our biological world.
Acknowledgments
R.R.d.S. is supported by the São Paulo Research Foundation (FAPESP-2015/03348-3).
Footnotes
The authors declare no conflict of interest.
See companion article on page 12580.
References
- 1.Mokili JL, Rohwer F, Dutilh BE. Metagenomics and future perspectives in virus discovery. Curr Opin Virol. 2012;2(1):63–77. doi: 10.1016/j.coviro.2011.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Dührkop K, Shen H, Meusel M, Rousu J, Böcker S. Searching molecular structure databases with tandem mass spectra using CSI:Finger ID. Proc Natl Acad Sci USA. 2015;112:12580–12585. doi: 10.1073/pnas.1509788112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Johnson SR, Lange BM. Open-access metabolomics databases for natural product research: present capabilities and future potential. Front Bioeng Biotechnol. 2015;3:22. doi: 10.3389/fbioe.2015.00022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Vaniya A, Fiehn O. Using fragmentation trees and mass spectral trees for identifying unknown compounds in metabolomics. Trends Analyt Chem. 2015;69:52–61. doi: 10.1016/j.trac.2015.04.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Boulsimani A, Sanchez LM, Garg N, Dorrestein PC. Mass spectrometry of natural products: Current, emerging and future technologies. Nat Prod Rep. 2014;31(6):718–729. doi: 10.1039/c4np00044g. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Pluskal T, Uehara T, Yanagida M. Highly accurate chemical formula prediction tool utilizing high-resolution mass spectra, MS/MS fragmentation, heuristic rules, and isotope pattern matching. Anal Chem. 2012;84(10):4396–4403. doi: 10.1021/ac3000418. [DOI] [PubMed] [Google Scholar]
- 7.Stanstrup J, Gerlich M, Dragsted LO, Neumann S. Metabolite profiling and beyond: Approaches for the rapid processing and annotation of human blood serum mass spectrometry data. Anal Bioanal Chem. 2013;405(15):5037–5048. doi: 10.1007/s00216-013-6954-6. [DOI] [PubMed] [Google Scholar]
- 8.Heinonen M, Shen H, Zamboni N, Rousu J. Metabolite identification and molecular fingerprint prediction through machine learning. Bioinformatics. 2012;28(18):2333–2341. doi: 10.1093/bioinformatics/bts437. [DOI] [PubMed] [Google Scholar]
- 9.Shen Y, Yin C, Su M, Tu J. Rapid, sensitive and selective liquid chromatography-tandem mass spectrometry (LC-MS/MS) method for the quantification of topically applied azithromycin in rabbit conjunctiva tissues. J Pharm Biomed Anal. 2010;52(1):99–104. doi: 10.1016/j.jpba.2009.12.001. [DOI] [PubMed] [Google Scholar]
- 10.Kind T, Fiehn O. Advances in structure elucidation of small molecules using mass spectrometry. Bioanal Rev. 2010;2(1-4):23–60. doi: 10.1007/s12566-010-0015-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Klekota J, Roth FP. Chemical substructures that enrich for biological activity. Bioinformatics. 2008;24(21):2518–2525. doi: 10.1093/bioinformatics/btn479. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Allen F, Greiner R, Wishart D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics. 2014;11(1):98–110. [Google Scholar]
- 13.Peironcely JE, Reijmers T, Coulier L, Bender A, Hankemeier T. Understanding and classifying metabolite space and metabolite-likeness. PLoS One. 2011;6(12):e28966. doi: 10.1371/journal.pone.0028966. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Ruttkies C, Gerlich M, Neumann S. Tackling CASMI 2012: Solutions from MetFrag and MetFusion. Metabolites. 2013;3(3):623–636. doi: 10.3390/metabo3030623. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000;28(1):27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hastings J, et al. The ChEBI reference database and ontology for biologically relevant chemistry: Enhancements for 2013. Nucleic Acids Res. 2013;41(Database issue):D456–D463. doi: 10.1093/nar/gks1146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]