Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2022 May 20;38(13):3484–3487. doi: 10.1093/bioinformatics/btac331

MINE 2.0: enhanced biochemical coverage for peak identification in untargeted metabolomics

Jonathan Strutz 1,2,3, Kevin M Shebek 4,5,6, Linda J Broadbelt 7,8, Keith E J Tyo 9,10,11,
Editor: Zhiyong Lu
PMCID: PMC9237697  PMID: 35595247

Abstract

Summary

Although advances in untargeted metabolomics have made it possible to gather data on thousands of cellular metabolites in parallel, identification of novel metabolites from these datasets remains challenging. To address this need, Metabolic in silico Network Expansions (MINEs) were developed. A MINE is an expansion of known biochemistry which can be used as a list of potential structures for unannotated metabolomics peaks. Here, we present MINE 2.0, which utilizes a new set of biochemical transformation rules that covers 93% of MetaCyc reactions (compared to 25% in MINE 1.0). This results in a 17-fold increase in database size and a 40% increase in MINE database compounds matching unannotated peaks from an untargeted metabolomics dataset. MINE 2.0 is thus a significant improvement to this community resource.

Availability and implementation

The MINE 2.0 website can be accessed at https://minedatabase.ci.northwestern.edu. The MINE 2.0 web API documentation can be accessed at https://mine-api.readthedocs.io/en/latest/. The data and code underlying this article are available in the MINE-2.0-Paper repository at https://github.com/tyo-nu/MINE-2.0-Paper. MINE 2.0 source code can be accessed at https://github.com/tyo-nu/MINE-Database (MINE construction), https://github.com/tyo-nu/MINE-Server (backend web API) and https://github.com/tyo-nu/MINE-app (web app).

Supplementary information

Supplementary data are available at Bioinformatics online.

Graphical Abstract

graphic file with name btac331f2.jpg

1 Introduction

Although advances in untargeted metabolomics have made it possible to gather data on thousands of cellular metabolites in parallel, identification of novel metabolites (and their associated reactions) from these datasets remains challenging. This is primarily due to the wide space of possible chemical structures that could exist for a given metabolomics peak, where an accurate m/z reveals the likely molecular formula, but not the structure. Often, PubChem is used to narrow down this chemical space, as it contains only known structures (Kim et al., 2021). However, many of these possible structures’ natural product (NP) likeness scores deviate from the range of NP likeness scores of known biomolecules, indicating non-biological origin (Jeffryes et al., 2015; Sorokina and Steinbeck, 2019). In addition, unobserved chemical structures are reasonably likely in biological systems (M. Lopez et al., 2021; Sindelar and Patti, 2020).

To address this need, reaction prediction-based tools such as BioTransformer, PROXIMAL, MyCompoundID and Pickaxe have been developed (Amin et al., 2019; Djoumbou-Feunang et al., 2019a; Huan et al., 2015; Jeffryes et al., 2015). These tools are most often used to aid compound discovery for a specific organism and/or dataset and can be combined with machine learning approaches (Djoumbou-Feunang et al., 2019a; Hassanpour et al., 2020). However, they can also be used to create large, easily accessible databases of predicted biological compounds for use by the community at large (Huan et al., 2015; Jeffryes et al., 2015). The construction of these types of databases is important because they increase the accessibility of these tools’ predictions, especially to scientists with a non-computational background or those that lack computing resources.

The Metabolic in silico Network Expansion (MINE) database remains the largest public database of predicted biological compounds one reaction step away from known metabolism (Jeffryes et al., 2015). A MINE is meant to serve as a list of candidate structures (MINE compounds) used for the annotation of unannotated metabolomics peaks (Fig. 1). That is, unlike PubChem, MINEs contain predicted, as well as known structures, and MINEs reduce the chemical space to only those compounds that are predicted to be produced based on known biochemical reaction rules and known metabolites, allowing for higher confidence structures that match unknown peaks. MINE databases are searchable by mass, MS2 (predicted in silico), formula, substructure and chemical similarity score (Wang, 2021). MINE 1.0 has been used in a variety of applications including annotating epimetabolites (Lai et al., 2017), analyzing predicted pathways to MINE compounds (Asplund-Samuelsson et al., 2018; Vila-Santa et al., 2021) and bolstering other metabolomics databases with MINE compounds (Gil de la Fuente et al., 2018; Lai et al., 2018; Laponogov et al., 2018). The original MINE 1.0 compound databases were limited, however, as they utilized a small, curated set of reaction rules that covers only 25% of known biochemical reactions in MetaCyc (Caspi et al., 2018; Hatzimanikatis et al., 2005).

Fig. 1.

Fig. 1.

MINE databases are searchable databases of computationally predicted enzyme products. They can be used to suggest potential candidate structures for unannotated peaks in untargeted metabolomics datasets. Candidates can be ranked by MS2 spectral similarity (compared to MS2 spectra predicted in silico for each candidate compound) if desired (not shown in figure)

2 Results

A set of reaction rules covering all known biochemical reactions in MetaCyc was recently reported (Ni et al., 2021). The 500 most generalizable rules of this ruleset, which cover 93% of known biochemical reactions in MetaCyc, is used to recreate the KEGG, YMDB and EcoCyc MINE databases by taking compounds from those databases and applying the 500 reaction rules. The increase in database size as well as the increase in coverage on two metabolomics datasets from MINE 1.0 to MINE 2.0 is demonstrated.

2.1 MINE 2.0 expansions contain an order of magnitude more compounds

Because the number of reaction operators used is now 500 (compared to 198 for MINE 1.0) and because the operators were specifically designed to maximize potential promiscuity while maintaining the fundamental chemical reaction, MINE 2.0 contains many more compounds than MINE 1.0. Only compounds with molecular weight less than 600 Da were included as reactants in these MINE 2.0 expansions for computational efficiency. The number of predicted compounds in MINE 2.0 increased for all three MINE databases by roughly one order of magnitude compared to MINE 1.0 (Table 1). We also found that the percentage of MINE compounds that also exist in PubChem decreased for all three MINEs, indicating that MINEs now contain an even higher percentage of novel structures.

Table 1.

Size comparison of MINE 1.0 and MINE 2.0

No. of Cpds in Source Databasea No. of Final Cpds in MINE Fold Increase (Source → MINE) Fold Increase (MINE 1.0 —> 2.0) No. of MINE Cpds Found in PubChemb % MINE Cpds Found in PubChemb
KEGG MINE 1.0 13 307 571 368 43 16.7 57 550 10.07%
KEGG MINE 2.0 12 688 9 562 940 754 171 445 1.79%
EcoCyc MINE 1.0 1832 54 719 30 16.6 8799 16.08%
EcoCyc MINE 2.0 1880 906 086 482 41 432 4.57%
YMDB MINE 1.0 1978 100 755 51 5.7 8963 8.90%
YMDB MINE 2.0 992 575 262 580 36 049 6.27%

Note: Three source databases (KEGG, EcoCyc, YMDB) are used as starting compound sets. From these source databases, MINEs are built containing predicted products of reactions consuming the source compounds. The number of compounds for MINE 1.0 and MINE 2.0 for each source database is compared.

a

Compounds in 2.0 source databases use the current (2021) versions of KEGG, EcoCyc and YMDB, whereas 1.0 source databases used the 2015 versions, thus increasing the number of compounds from 1.0 to 2.0. However, compounds in 2.0 source databases are also limited to be less than 600 Da, decreasing the number of starting compounds such that there can be fewer starting compounds in the 2.0 source database.

b

Stereochemistry is not considered when searching PubChem because MINE compounds are not defined with explicit stereochemistry.

We also investigated which types of chemistries are more highly represented in the MINE 2.0 ruleset compared to the ruleset of MINE 1.0 and likely most responsible for the difference in MINE size (Supplementary Section S1 and Supplementary Fig. S1). Overall, we find that MINE 2.0 reaction rules cover significantly more known BRENDA database reactions across nearly all EC classes, with the most significant improvements in the transferase (EC 2.x.x.x), hydrolase (EC 3.x.x.x) and isomerase (EC 5.x.x.x) reaction classes (Chang et al., 2021). Note that bimolecular reactions with two different substrates are currently not predicted due to the computationally expensive calculations that would be required to enumerate all possible substrate combinations, although they are included in these coverage calculations (Supplementary Section S1). We plan on adding this feature to the next version of MINE.

2.2 MINE 2.0 expansions contain structures that match significantly more metabolomics peaks

Two datasets were used to compare the performance of MINE 1.0 versus MINE 2.0 on metabolomics annotation. The first dataset consists only of knowns from MassBank and was originally used to validate MINE 1.0 (Horai et al., 2010; Jeffryes et al., 2015). The second dataset is from an untargeted metabolomics experiment using E.coli and is more representative of a typical untargeted metabolomics dataset as more than 75% of peaks in this dataset remain unannotated (Sévin et al., 2017). Each dataset is filtered to only those peaks with measured m/z < 600 Da. For each dataset, the percentage of peaks with at least one matching MINE structure (based on m/z and specified adducts) is calculated for each MINE database (see Supplementary Section S2, for more details). For the MassBank dataset of known compounds, the percentage of peaks with an exact match in the MINE database is also calculated (stereochemistry not taken into account). Finally, the median number of candidate structures per peak is reported.

On the MassBank dataset of knowns, MINE 2.0 is found to perform slightly better overall. The percentage of exact matches increases from 66.6% to 72.7% from KEGG MINE 1.0 to KEGG MINE 2.0 (Table 2). While KEGG MINE 2.0 returns a median of 537 candidates per peak, too many to examine manually, the use of MS2 spectral matching (using in silico-predicted spectra) to rank candidates results in a median rank of 7 (best-performing MassBank spectra) to 126 (worst-performing MassBank spectra) for the exact match (see Supplementary Section S3, for more details). While MS2 in silico-predicted spectra are not always high enough quality to make meaningful comparisons, their use on this dataset highly ranks most exact matches, suggesting that MS2 spectral matching against in silico spectra predictions is a useful filter, consistent with previous work (Dührkop et al., 2015; Laponogov et al., 2018; Ruttkies et al., 2016). However, it should be noted that if a compound’s spectral predictions (or experimental spectra) are of poor quality, it may be filtered out by MS2 spectral matching.

Table 2.

Comparison of MassBank test dataset annotation

KEGG KEGG MINE 1.0 KEGG MINE 2.0 PubChem
% Peaks annotateda 89.4% 99.5% 100.0% 100.0%
% Peaks with correct annotation presentb,c 56.5% 66.6% 72.7% 100.0%
Median number of candidate molecules 3 48 537 15 274
Total number of unique chemical formulas 9195 60 490 220 934 4 253 738

Note: Peak m/z values in a MassBank dataset are searched against the KEGG, KEGG MINE 1.0 and 2.0 and PubChem databases.

a

A peak is considered annotated if it has at least one candidate with a matching m/z in the respective database (Supplementary Section S2).

b

Each peak in this dataset is associated with a known compound, so the percentage of peaks for which the known existed and was found in the database was calculated.

c

Known found based on connectivity block of InChI Key (stereochemistry is not considered).

To test MINE 2.0 on a more realistic dataset, a large untargeted metabolomics dataset from the literature is used (Sévin et al., 2017). This dataset contains 3099 peaks after filtering to those <600 Da, 2402 of which remained unannotated in the original work. The KEGG MINE 2.0 is able to suggest candidate compounds for 66.4% of these peaks, compared to 47.5% for KEGG MINE 1.0 (Table 3). In addition, MINE 2.0 suggests a larger number of candidates per peak to examine manually (median of 22 for 2.0 versus 1 for 1.0). These same trends hold for EcoCyc MINE 2.0 with a 45.1% annotation rate compared to 15.3% for EcoCyc MINE 1.0.

Table 3.

Comparison of Sévin et al. test dataset annotation

KEGG KEGG MINE 1.0 KEGG MINE 2.0 EcoCyc MINE 1.0 EcoCyc MINE 2.0 PubChem
% All peaks annotateda 25.0% 55.3% 71.9% 24.2% 53.5% 89.4%
% Annotated peaksb annotateda 54.9% 81.9% 91.1% 54.9% 82.4% 99.1%
% Unannotated peaksc annotateda 16.3% 47.5% 66.4% 15.3% 45.1% 86.6%
Median number of candidates 0 1 22 0 1 893

Note: Peak m/z values in a literature dataset (Sévin et al., 2017) are searched against the KEGG, KEGG MINE 1.0 and 2.0, EcoCyc 1.0 and 2.0 and PubChem databases.

a

A peak is considered annotated if it has at least one candidate with a matching m/z in the respective database (Supplementary Section S2).

b

‘Annotated peaks’ are those that were putatively annotated (but not experimentally verified) in Sévin et al.

c

‘Unannotated peaks’ are those that were not putatively annotated in Sévin et al.

While, overall, this is an improvement for peaks that previously had few or no candidates based on MS1 data alone, the use of a larger reaction rule set also results in some peaks with many (e.g. 100 s) of candidates. Thus, MS/MS is even more important in MINE 2.0 to narrow down candidates for these peaks. However, we feel that this tradeoff in MINE 2.0 is justified, as candidates are now able to be accessed for significantly more peaks, and we anticipate that in silico MS/MS prediction tools which can filter large candidate sets will only continue to improve (Ni et al., 2021; Wang, 2021). As in MINE 1.0, candidates can also be filtered by organism (specifically, a user inputs a KEGG GENOME code such as ‘hsa’ for homo sapiens), where only compounds produced from known compounds in a specific organism are highlighted within the list of search results.

Other notable improvements to MINE 2.0 include recalculation of predicted MS2 spectra for MINE compounds using CFM-ID 4.0 (Djoumbou-Feunang et al., 2019b; Wang, 2021) as well as thermodynamic predictions for MINE compounds using Equilibrator (Beber et al., 2021; Noor et al., 2014) (Supplementary Section S4). While a user cannot build their own MINE through the MINE web interface, one could do so locally using the provided MINE-Database codebase (see Availability and Implementation).

3 Conclusions

MINE 2.0 builds on MINE 1.0 by utilizing more known biochemical transformations to increase database size and the coverage of unannotated metabolomics peaks. These larger databases can be efficiently searched by ranking candidates using MS2 spectral similarity if desired. Overall, this work provides a valuable improvement to a tool often utilized by the metabolomics community (Asplund-Samuelsson et al., 2018; Gil de la Fuente et al., 2018; Lai et al., 2017; Vila-Santa et al., 2021).

Supplementary Material

btac331_Supplementary_Data

Acknowledgements

The authors acknowledge James Jeffryes and Joseph Ni for helpful discussions. In addition, this research was supported in part through the computational resources and staff contributions provided for the Quest high performance computing facility at Northwestern University which is jointly supported by the Office of the Provost, the Office for Research, and Northwestern University Information Technology.

Funding

This work was supported by the National Science Foundation [MCB-1614953]; the U.S. Department of Energy [DE-SC0018249]; and the National Institutes of Health [T32-GM008449-23].

Conflict of Interest: none declared.

Data Availability

The data underlying this article are available in the MINE 2.0 databases, at https://minedatabase.ci.northwestern.edu, and in the GitHub repository, tyo-nu/MINE-2.0-Paper, at https://github.com/tyo-nu/MINE-2.0-Paper. The datasets were derived from sources in the public domain: KEGG (https://www.genome.jp/kegg/), MetaCyc (https://metacyc.org/), EcoCyc (https://ecocyc.org/), and YMDB (http://www.ymdb.ca/).

Contributor Information

Jonathan Strutz, Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL 60208, USA; Center for Synthetic Biology, Northwestern University, Evanston, IL 60208, USA; Chemistry of Life Processes Institute, Northwestern University, Evanston, IL 60208, USA.

Kevin M Shebek, Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL 60208, USA; Center for Synthetic Biology, Northwestern University, Evanston, IL 60208, USA; Chemistry of Life Processes Institute, Northwestern University, Evanston, IL 60208, USA.

Linda J Broadbelt, Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL 60208, USA; Center for Synthetic Biology, Northwestern University, Evanston, IL 60208, USA.

Keith E J Tyo, Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL 60208, USA; Center for Synthetic Biology, Northwestern University, Evanston, IL 60208, USA; Chemistry of Life Processes Institute, Northwestern University, Evanston, IL 60208, USA.

References

  1. Amin S.A.  et al. (2019) Towards creating an extended metabolic model (EMM) for E. coli using enzyme promiscuity prediction and metabolomics data. Microb. Cell Fact., 18, 109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Asplund-Samuelsson J.  et al. (2018) Thermodynamic analysis of computed pathways integrated into the metabolic networks of E. coli and Synechocystis reveals contrasting expansion potential. Metabolic Eng., 45, 223–236. [DOI] [PubMed] [Google Scholar]
  3. Beber M.E.  et al. (2021) eQuilibrator 3.0 – a platform for the estimation of thermodynamic constants estimation. Nucleic Acids Res., 50, D603–D609. [DOI] [PMC free article] [PubMed]
  4. Caspi R.  et al. (2018) The MetaCyc database of metabolic pathways and enzymes. Nucleic Acids Res., 46, D633–D639. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chang A.  et al. (2021) BRENDA, the ELIXIR core data resource in 2021: new developments and updates. Nucleic Acids Res., 49, D498–D508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Djoumbou-Feunang Y.  et al. (2019a) BioTransformer: a comprehensive computational tool for small molecule metabolism prediction and metabolite identification. J. Cheminform., 11, 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Djoumbou-Feunang Y.  et al. (2019b) CFM-ID 3.0: significantly improved esi-ms/ms prediction and compound identification. Metabolites, 9, 72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Dührkop K.  et al. (2015) Searching molecular structure databases with tandem mass spectra using CSI: FingerID. Proc. Natl. Acad. Sci. USA, 112, 12580–12585. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Gil de la Fuente A.  et al. (2018) Knowledge-based metabolite annotation tool: CEU mass mediator. J. Pharm. Biomed. Anal., 154, 138–149. [DOI] [PubMed] [Google Scholar]
  10. Hassanpour N.  et al. (2020) Biological filtering and substrate promiscuity prediction for annotating untargeted metabolomics. Metabolites, 10, 160. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Hatzimanikatis V.  et al. (2005) Exploring the diversity of complex metabolic networks. Bioinformatics, 21, 1603–1609. [DOI] [PubMed] [Google Scholar]
  12. Horai H.  et al. (2010) MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom., 45, 703–714. [DOI] [PubMed] [Google Scholar]
  13. Huan T.  et al. (2015) MyCompoundID MS/MS search: metabolite identification using a library of predicted fragment-ion-spectra of 383,830 possible human metabolites. Anal. Chem., 87, 10619–10626. [DOI] [PubMed] [Google Scholar]
  14. Jeffryes J.G.  et al. (2015) MINEs: open access databases of computationally predicted enzyme promiscuity products for untargeted metabolomics. J. Cheminform., 7, 44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Kim S.  et al. (2021) PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res., 49, D1388–D1395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Lai Z.  et al. (2018) Identifying metabolites by integrating metabolome databases with mass spectrometry cheminformatics. Nat. Methods, 15, 53–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Lai Z.  et al. (2017) Using accurate mass gas chromatography-mass spectrometry with the MINE database for epimetabolite annotation. Anal. Chem., 89, 10171–10180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Laponogov I.  et al. (2018) ChemDistiller: an engine for metabolite annotation in mass spectrometry. Bioinformatics, 34, 2096–2102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Lopez L.  et al. (2021) Identification of bioprivileged molecules: expansion of a computational approach to broader molecular space. Mol. Syst. Des. Eng., 6, 445–460. [Google Scholar]
  20. Ni Z.  et al. (2021) Curating a comprehensive set of enzymatic reaction rules for efficient novel biosynthetic pathway design. Metab. Eng., 65, 79–87. [DOI] [PubMed] [Google Scholar]
  21. Noor E.  et al. (2014) Pathway thermodynamics highlights kinetic obstacles in central metabolism. PLoS Comput. Biol., 10, e1003483. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Ruttkies C.  et al. (2016) MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J. Cheminform., 8, 3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Sévin D.C.  et al. (2017) Nontargeted in vitro metabolomics for high-throughput identification of novel enzymes in Escherichia coli. Nat. Methods, 14, 187–194. [DOI] [PubMed] [Google Scholar]
  24. Sindelar M., Patti G.J. (2020) Chemical discovery in the era of metabolomics. J. Am. Chem. Soc., 142, 9097–9105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Sorokina M., Steinbeck C. (2019) NaPLeS: a natural products likeness scorer—web application and database. J. Cheminform., 11, 1, 11, 1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Vila-Santa A.  et al. (2021) Prospecting biochemical pathways to implement microbe-based production of the new-to-nature platform chemical levulinic acid. ACS Synth. Biol., 10, 724–736. [DOI] [PubMed] [Google Scholar]
  27. Wang,F.  et al. (2021) CFM-ID 4.0: More accurate ESI-MS/MS spectral prediction and compound identification. Anal. Chem. , 93, 11692–11700. 10.1021/acs.analchem.1c01465. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btac331_Supplementary_Data

Data Availability Statement

The data underlying this article are available in the MINE 2.0 databases, at https://minedatabase.ci.northwestern.edu, and in the GitHub repository, tyo-nu/MINE-2.0-Paper, at https://github.com/tyo-nu/MINE-2.0-Paper. The datasets were derived from sources in the public domain: KEGG (https://www.genome.jp/kegg/), MetaCyc (https://metacyc.org/), EcoCyc (https://ecocyc.org/), and YMDB (http://www.ymdb.ca/).


Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES