Skip to main content
Journal of Cheminformatics logoLink to Journal of Cheminformatics
. 2022 Sep 22;14:64. doi: 10.1186/s13321-022-00587-7

An initial investigation of accuracy required for the identification of small molecules in complex samples using quantum chemical calculated NMR chemical shifts

Yasemin Yesiltepe 1,2, Niranjan Govind 2, Thomas O Metz 1, Ryan S Renslow 1,2,
PMCID: PMC9499888  PMID: 36138446

Abstract

The majority of primary and secondary metabolites in nature have yet to be identified, representing a major challenge for metabolomics studies that currently require reference libraries from analyses of authentic compounds. Using currently available analytical methods, complete chemical characterization of metabolomes is infeasible for both technical and economic reasons. For example, unambiguous identification of metabolites is limited by the availability of authentic chemical standards, which, for the majority of molecules, do not exist. Computationally predicted or calculated data are a viable solution to expand the currently limited metabolite reference libraries, if such methods are shown to be sufficiently accurate. For example, determining nuclear magnetic resonance (NMR) spectroscopy spectra in silico has shown promise in the identification and delineation of metabolite structures. Many researchers have been taking advantage of density functional theory (DFT), a computationally inexpensive yet reputable method for the prediction of carbon and proton NMR spectra of metabolites. However, such methods are expected to have some error in predicted 13C and 1H NMR spectra with respect to experimentally measured values. This leads us to the question–what accuracy is required in predicted 13C and 1H NMR chemical shifts for confident metabolite identification? Using the set of 11,716 small molecules found in the Human Metabolome Database (HMDB), we simulated both experimental and theoretical NMR chemical shift databases. We investigated the level of accuracy required for identification of metabolites in simulated pure and impure samples by matching predicted chemical shifts to experimental data. We found 90% or more of molecules in simulated pure samples can be successfully identified when errors of 1H and 13C chemical shifts in water are below 0.6 and 7.1 ppm, respectively, and below 0.5 and 4.6 ppm in chloroform solvation, respectively. In simulated complex mixtures, as the complexity of the mixture increased, greater accuracy of the calculated chemical shifts was required, as expected. However, if the number of molecules in the mixture is known, e.g., when NMR is combined with MS and sample complexity is low, the likelihood of confident molecular identification increased by 90%.

Supplementary Information

The online version contains supplementary material available at 10.1186/s13321-022-00587-7.

Keywords: Metabolomics, Small molecules, NMR, DFT, Quantum chemistry

Introduction

Metabolomics and exposomics involve the large-scale study of small molecules found in biological and environmental samples, including endogenous and exogenous chemicals, and their molecular breakdown products [13]. For human studies, understanding the active metabolic pathways and fate of exogenous chemicals is a major focus area for improving health through precision medicine, as well as an important tool for researching and understanding the state of environmental and agricultural conditions [47]. Biological and environmental samples typically comprise numerous molecules and are often in a complex matrix. It is not practical to develop or apply sample preparation methods for isolation of individual constituents (whether because of concentration limits, separation difficulty, or project cost limitations). The ability to comprehensively characterize such complex samples would result in significant advances in multiple scientific fields and enable currently irresolvable solutions for the understanding of metabolic pathways and biological systems such as active phenotypes/functions, industrial reactions such as they relate to (bio)fuels and high-value (bio)products, environmental processes, human actions in society, and even earth systems and climate.

Using nuclear magnetic resonance (NMR) [810], mass spectrometry (MS) [1113], and other tools [1416], a wide range of molecules have been identified and extensively documented in the literature [1721]. Hundreds of thousands of metabolites are now known and their MS/MS or NMR data are electronically available on public and commercial chemical databases [22, 23] such as PubChem [24], Royal Society of Chemistry ChemSpider [25], ChEMBL by European Molecular Biology Laboratory [26, 27], Chemical Entities of Biological Interest (ChEBI) [28, 29], DrugBank [30, 31], Biological Magnetic Resonance Bank (BMRB) [32] and Human Metabolome Database (HMDB) [33], GDB13 [34], The Small Molecule Pathway Database (SMPDB) [35, 36], Distributed Structure-Searchable Toxicity (DSSTox) Database [37], E. coli Metabolome Database (ECMDB) [38, 39], EcoCyc E. coli Database [40], Food Component Database (FooDB) [41], LIPID MAPS In-Silico Structure Database (LMISSD) [42], MetaCyc Metabolic Pathway Database [43], MolMall [44], Super Natural II [45], The Toxin and Toxin Target Database (T3DB) [46, 47], ToxCast [48], The Universal Natural Products Database (UNPD) [49], ZINC [50]. However, the vast majority of molecules that are found in complex biological and environmental samples are not represented in current identification libraries (across multiple analytical platforms) [51, 52]. For example, the largest mass spectral library, the Wiley Registry and NIST Libraries contain more than 1 million mass spectra [53, 54]. HMDB (ver. 4.0) describes 114,260 metabolites, and of the molecules described in HMDB, only a small portion are available for purchase as authentic reference material [5557]. ZINC 15, a database of ~ 1.8 B compounds, currently has 81,519 endogenous human metabolite structures, and of these, 9490 (12%) are immediately available for purchase [58, 59]. Furthermore, it is hypothesized that 1060 or more molecules are structurally feasible (for molecules < 1000 Da) [6062], and much fewer than 1% are available in molecular identification reference libraries [6365]. Thus, one cause of our current restricted size of small molecule identification libraries is due to the limited number of molecules available for purchase as authentic reference material. Even if all molecules were known and available for purchase, the time and cost to analyze these for building reference libraries would be prohibitive [66, 67]. The fields of metabolomics and exposomics, and small molecule identification generally, must overcome the significant, longstanding obstacle in the field: the absence of analytical methods for comprehensive and unambiguous identification of small molecules without reliance on reference data obtained from analysis of chemical standards [6870].

For molecular properties that are consistently calculable with a known (preferably low) error, it is possible to create in silico reference libraries in order to reduce reliance on authentic chemical standards [70]. Several analytical methodologies, such as those based on chromatography coupled with MS[7173] and NMR [74] have demonstrated feasibility for compound identification based on predicted properties. NMR’s ability to be non-destructive and easily quantifiable makes it a unique tool for identifying novel compounds and handling complex metabolite mixtures without the need of chemical separation [75]. For example, MS/MS spectra yield reasonable accuracy for predictions of molecular properties and can be coupled with machine learning methods [76] but limited to short lists of small molecules [77, 78]. Quantum chemical applications such as infrared spectra [79], molecular collisional cross sections (CCS) [80, 81] and NMR chemical shifts [8285], are promising for the calculations of molecular attributes. For example, coupling calculated mass and CCS has contributed to successful chemical identification of cis/trans isomers [86, 87], as well as isomers in complex synthetic samples [88]. For studies specifically using NMR, quantum chemical simulations for the prediction of spectra have been a valuable tool for the community. In the last two decades, density functional theory (DFT), an exceptionally well-established approach for high-throughput chemical calculations with the advantage of high performance for less computational cost, has been widely applied to predict NMR chemical shifts [8991] of molecules and conformers [9294] in different custom solvent conditions [9597]. Furthermore, structural elucidation is one of the most practical uses of NMR, and it is common to utilize NMR chemical shift calculations along with experimental shifts to identify compound mixtures [98100] and to aid reassignment of structures or stereostructure assignment [101103].

Currently, the use and acceptance of predicted NMR chemical shifts is limited due to an incomplete understanding of the required accuracy of such predictions for confident molecular identification. It has already been demonstrated that heuristic/empirical approaches for chemical shift predictions are generally of low accuracy compared to quantum chemical calculation-based methods (e.g. DFT) [83, 104, 105]. For DFT approaches, the factors that significantly affect the accuracy of predicted 13C and/or 1H NMR chemical shifts are the optimization level of molecular geometry [106108], the use of different DFT theories [109112], implicit and explicit solvation models [113115], unique molecular properties of metabolites [116121], etc. Agreement between predicted and experimental chemical shifts can be improved when (i) the basis set is enlarged [104, 122], (ii) the quality of the method is improved for geometry optimization [123, 124], (iii) a scaling procedure is employed [125, 126], (iv) conformational sampling is applied [127], and (v) solvation is taken into account appropriately [128, 129]. However, the question of what level of accuracy is required for calculated NMR chemical shifts when using these as reference spectra for molecular identification remains largely unexplored.

In this study, we investigate the accuracy and/or level of confidence in predicted NMR chemical shifts required to identify small molecules using reference libraries of varying size. Specifically, we present a detailed study on the role of accuracy in the prediction of 13C and 1H NMR spectra for confident metabolite identification in solution phase using a chloroform and water continuum model. We estimate the minimum and maximum error limits which hinder or enable 13C and 1H NMR chemical shift predictions to unambiguously identify molecular structures. In this study, we discuss two cases—simple and complex samples—using 11,716 small molecules taken from the HMDB [18]. We cover different chemical functional groups and explore the results to provide statistics for libraries of different sizes.

Materials and methods

Molecule sets

Two sets of molecules taken from HMDB 4.0 [56] and distinguished by their reported partition coefficients were simulated, one in water (Set I) and a second in chloroform (Set II) as the solvent. The included compounds were not in salt forms, consist only of C, H, O, N, P and S atoms, and are in the molecular weight range of 27 to 500 Da. Set I, the water solvated set, contains 2,723 molecules (29,489 carbon and 45,426 hydrogen nuclei in total across all molecules) and spans a wide range of structure-based chemical classes and chemical functionalities including organic acids, organonitrogen compounds, nucleosides, nucleotides, organoheterocyclic compounds, carboxylic acids, organooxygen compounds, and benzenoids as determined by the hierarchical chemical classification scheme, ClassyFire [130]. Set II, the chloroform solvated set, contains 8,990 molecules (138,535 carbon and 191,327 hydrogen nuclei in total across all molecules) and also spans a broad range of chemical functionalities including organic compounds, organic acids, lipids, benzenoids, and organoheterocyclic compounds. Figure 1 compares the number of molecules containing a given amount of carbons and hydrogens for Sets I and II. The molecules and their geometries in both Sets are provided in the Additional file 1.

Fig. 1.

Fig. 1

Histograms depicting the number of molecules in each set for a given number of carbon atoms in a Set I and b Set II, and hydrogen atoms in c Set I and d Set II

Computational details

The NMR chemical shifts for all molecules in this study were calculated using the In Silico Chemical Library Engine (ISiCLE) [131] (see github.com/pnnl/isicle for the latest version of ISiCLE). ISiCLE is an automated pipeline for high-accuracy chemical property calculation, implemented using the Snakemake workflow management system [132]. This pipeline takes SMILES [133] (a line notation representation of molecule structure) as input, generates initial 3D molecular conformations, and subsequently optimizes this initial structure and calculates chemical properties through quantum chemistry via NWChem [134] (an open-source, high-performance computational chemistry software developed at PNNL). For this study, all molecules were initially optimized in solvent using the computationally inexpensive B3LYP [135, 136] with 3-21G basis set [137139]. We chose this level of theory due to our available computational resources, particularly considering the treatments for the geometry optimization of over 11 k molecules. It is known that the 3-21G basis set for geometry optimization is not adequate to obtain high accuracy in NMR chemical shift calculations [127, 140, 141], but in this study it is only used to simulate NMR spectral data in order to obtain a reasonable representative distribution of (likely moderate accuracy) chemical shifts. Assessment of the best computational approaches to maximize accuracy of NMR chemical shift calculations is beyond the scope of this study. To test whether the NMR spectral data is statistically affected or not by using any other DFT method, the isotropic shielding values of 5 randomly chosen molecules in different shapes and sizes from Sets I and II were calculated using 3 different DFT methods. The shielding values were observed to be shifted in the same direction following the same pattern. Further details are given in Additional file 2. The inclusion of solvent is via the COnductor-like Screening MOdel (COSMO) [142] solvation modeling. NMR isotropic shieldings were calculated for all optimized molecules at the B3LYP/cc-pVDZ [139, 143] level of theory. Based on our previous assessment [131], this method provides reliable chemical shifts [112] and yields isotropic shieldings with a reasonably low computational cost [144]. The gauge-invariant atomic orbital (GIAO) approach [145] was used to compute 13C and 1H NMR chemical shifts. The computed chemical shifts are provided in Additional file (available upon author request).

Algorithm

Various scoring approaches have been proposed for the analysis of chemical shifts and comparisons of DFT methods. The most common criteria in the literature quantifying the agreement between calculated and experimental data are mean absolute error (MAE) (Eq. 1), root mean square error (RMSE) (Eq. 2), corrected mean absolute error (CMAE) (Eq. 3), and correlation coefficients (e.g., the Pearson correlation coefficient).

MAE=i=1Nδexp-δcalcN 1
RMSE=i=1Nδexp-δcalc2N 2
CMAE=i=1Nδexp-(δcalc-b)/mN 3

where δexp is the experimental chemical shift, δcalc is the calculated chemical shift, N is the number of nuclei, and m and b denote slope and intercept of the calculated shifts with respect to experimental shifts.

To identify the compounds in a mixture, our approach follows the steps in the flowchart presented in (Fig. 2). In Step I, NMR chemical shifts of all molecules are calculated as described above. Since we do not have experimental NMR data for the 11 thousand molecules in our two sets, in step II we create representative NMR data for comparisons: the calculated NMR spectra (generated in Step I) are considered as surrogate experimental shifts data and new lists of chemical shifts are created synthetically by adding Gaussian distributed noise. Although the error distributions of NMR chemical shifts were reported to also obey a student t-distribution in other studies [131, 146149], we assume errors for both carbons and protons follow a Gaussian distribution [144, 150] with mean µ and standard deviation σ. Unless otherwise stated, the mean is assigned as 0, since the errors of scaled 13C and 1H NMR chemical shifts are equally likely to be positive or negative [144, 147]. In this study, σ is taken in the range of 0.5–50 ppm and 0.1–10 ppm for 13C and 1H chemical shifts with increment of 0.05 ppm and 0.01 ppm, respectively. Simply, we assume that our initial (non-noise-added) calculated chemical shifts (“surrogate experimental data”) represent the distribution, but not necessarily the accuracy of authentic experimental chemical shifts, and that the addition of zero-mean Gaussian noise to create synthetic data with a defined error allows us to explore how the accuracy of real calculated chemical shifts can affect identification rates. This approach is similar to that taken in other successful studies [151153]

Fig. 2.

Fig. 2

Flowchart for the identification of the compounds in a mixture

In Step III, each molecule taken from the computed data is searched back against the surrogate experimental data. First, the experimental chemical shifts of an unknown molecule are matched to the computed chemical shifts of every single molecule to find the best match, based on minimizing the distance between two sets of chemical shifts. To do this, we used the Munkres assignment algorithm [154156], which gives the minimum distance score (i.e. error) of two sets, within a feasible computational time bounded by a polynomial expression [157]. The Munkres algorithm minimizes the total error or summation of squared differences between each assignment. It is based on the following principle:

Let S1 and S2 be two separate lists of chemical shifts consisting of N and M elements, respectively. Let us construct an M-by-N matrix

s1-b12sM-b12s1-bN2sM-bN2

where si is the mth element of S1, bj is the nth element of S2 and M ≤ N. We have M elements to be assigned to N elements on a one-to-one basis where the assignments constitute an independent set of the M-by-N matrix. Then, the Munkres algorithm models an assignment problem, which returns the least-sum of elements of the matrix, choosing only one element from each row and column. In our case, this indicates the best possible matching, which will be used in the next step.

In Step IV, for each molecule, to determine which set of experimental data best matches to the computed one, the similarity of two sets of assigned chemical shifts is quantified by a distance score. There is no perfect score (i.e. zero error) between two sets (e.g., in practice, there is always some amount of error expected between experimental and computed shifts). A critical issue is finding a method to quantify the error such that it always yields the best match at the top when the list of scores are sorted from most to least likely. In addition to the most popular ways to express chemical shift errors (i.e. MAE and RMSE), we believe that an indication of how confident a matching set is can be expressed better in terms of RMSE and probability. Smith et al. performed a sophisticated systematic study for addressing the issue of the best parameter, and proposed DP4 [147], which is used when experimental NMR data is to be used to identify one molecule out of an arbitrarily large library of many possible structures. DP4 is based on conditional probability and/or Bayes’ theorem—the key factor increasing the certainty of results. While we found DP4 to give slightly better rankings for pure samples than RMSE, we also found it to be computationally much more intense than RMSE. We also believe DP4 is not convenient for ranking matches in impure/complex samples. Therefore, we use RMSE in this study. Further details are given in Additional file 1.

Note that the RMSE ranges differ for carbon and proton. For the cases when carbon and proton are used together for identifying molecules, each RMSE is calculated separately and their geometric means are taken to get a single score for the molecule. The geometric mean is used to normalize the RMSEs, so the error associated with carbon does not dominate that of the proton for cases where both nuclei are used together.

Finally, in Step V, all resulting scores are sorted in ascending order, yielding a list of molecules starting from the most likely to the least likely to be found in the mixture. The ranks and scores of each molecule are reported. In this study, a rank of 1 (top of list) is synonymous with positive molecule identification.

For this study, we considered the case when (1) proton chemical shifts are used alone for identification, (2) carbon chemical shifts are used alone, and finally, (3) when both nuclei are used together.

The automated workflow and all scripts, written in Python, are provided in Additional file 4.

Results and discussion

Robust and comprehensive metabolite identification using calculated NMR chemical shifts requires assessments of the accuracies of the in silico approaches used and that must have validated error ranges. We investigated the level of accuracy required to identify small molecules in NMR libraries. We performed a comprehensive analysis on the extent of accuracy in the predicted 13C and 1H NMR chemical shifts using 11,716 small molecules taken from the HMDB. We analyzed the limits (upper and lower) of error for confident metabolite identification. in two solution phases: chloroform and water. We discussed the possible error ranges in predicted NMR chemical shifts allowing to achieve reasonably confident identification in 2 types of samples: (i) pure uniform sample, and (ii) complex sample. We performed our runs for 190 different error ranges (i.e. σ, Gaussian standard deviation) and repeated the experiments 16 times for each case. Unless otherwise stated, all analyses were performed for each molecule in the two sets. We report the average results for i) 13C chemical shifts alone, ii) 1H chemical shifts alone, and iii) 13C and 1H chemical shifts used together for identification. We report the average percentage of molecules successfully identified (i.e. rank is 1) for Set I (water soluble molecules) and Set II (chloroform soluble molecules).

Case I: Pure sample

In this case, let us assume we have a spectrum from a single compound and an array of carbon (13C) and/or proton (1H) NMR chemical shifts. This case involves selecting only the molecules having the exact number of carbon and/or proton chemical shifts from the database to match the experimental spectrum. This narrows the list of candidate molecules.

Figures 3 and 4 show the identification results of 90% to 100% of the molecules of both sets in the top 10 hits (Top 10) for carbons and protons used independently. As an example, for identifying 90% of the molecules in the first hit (Top 1), 13C chemical shift errors should be below 3.2 and 3.6 ppm for Set I and Set II, respectively. Likewise, for 1H chemical shifts, when the MAE is at most 0.38 ppm for the both sets, there is a 90% chance that the correct identification will be made as the first hit. It is possible to correctly identify 99% of the molecules when the noise is at most 1.1–1.2 and 0.16–0.17 ppm for 13C and 1H chemical shifts, respectively. The molecule of interest has a chance to be among the first two candidate matches (Top 2) when 13C and 1H chemical shift errors are 0.53 ppm and 0.21 ppm, and 0.14 ppm and 0.02 ppm for Set I and Set II, respectively. However, for these sets of molecules, 100% of identification is not possible when 13C and 1H chemical shifts are used alone. The higher quality versions of Figs. 3 and 4, and the full list including 50–100% of identification is given in the Additional file (available upon author request).

Fig. 3.

Fig. 3

Averaged percentages of molecules being identified within the first 1, 2, 5, and 10 candidate molecules at different Gaussian standard deviation (σ) values (ppm) for Set I and Set II when 13C NMR chemical shifts are used alone

Fig. 4.

Fig. 4

Averaged percent of molecules being identified within the first 1,2,5, and 10 candidate molecules at different Gaussian standard deviation (σ) values (ppm) for Set I and Set II when 1H NMR chemical shifts are used alone

Figure 5 shows where the molecules rank in identification lists for a comprehensive identification analysis for Set I and Set II, plotted against carbon and proton errors when 13C and 1H data are used together for identification. The plots show how the probability of a molecule being correctly identified changes with chemical shift errors. The contour lines represent different levels of identification with respect to carbon (y-axis) and proton (x-axis) errors. The color bars show the ranking distributions along the ranges of carbon (0–50 ppm) and proton errors (0–10 ppm). The contour lines are represented in a reciprocal relationship (Eq. 1) (see Additional file 3 for further information). Therefore, on each contour line, it is possible to have a list of combinations for a range of carbon and proton errors. For example, for 90% of identification, the carbon and proton errors (ppm) could be (3 and 10), (5 and 0.92), or (6 and 0.7), respectively, out of many combinations. This reciprocal relationship also gives a trade-off between the carbon and proton errors such that it is possible to skip expensive 13C chemical shifts over highly accurate 1H chemical shifts, and vice versa.

MAE13C=a/MAE1H-b+c 4

Fig. 5.

Fig. 5

Mean of ranks with respect to the carbon and proton errors and contour lines for the different level of identification ratios when carbons and protons are used together for a, b Set I (water soluble molecules) and c, d Set II (chloroform soluble molecules). b and d are the zoomed versions of a and c, respectively. The color bars represent the rankings

Each contour line has an optimum point which represents a trade-off point (reported in Table 1). At these points on the curves, the cumulative errors of carbon and proton are minimum (note that ranges for carbon and proton errors are normalized). A fascinating but not unexpected observation here is that the chances of molecules being successfully identified are doubled when 13C and 1H chemical shifts are used together. Thus, compared to the previous case when 13C and 1H chemical shifts are used independently, using more information increases the chance of successful identification. The full list of trade-off points including 50–99% is reported in the Additional file (available upon author request).

Table 1.

Optimum trade-off MAEs at different Gaussian standard deviation (σ) values (ppm) for Set I and Set II when 13C and 1H NMR chemical shifts are used together for identification

Percentile (%) Set I (Water soluble molecules)
σ (ppm)
Set II (Chloroform soluble molecules)
σ (ppm)
13C 1H 13C 1H
99 2.02 0.30 1.64 0.43
95 4.21 0.57 4.44 0.53
90 6.16 0.72 5.82 0.70

It is observed the ranks range from 1 to 7 for Set I and 1 to 16 for Set II. The difference in ranges source from different sized molecule sets and differences in standard deviation and variance of 13C and 1H chemical shifts. The standard deviations of ranks are shown in the Additional file 2.

Case II: Impure sample

A continuing grand challenge for NMR-based metabolomics is dealing with the spectral complexity in analysis of mixtures. An NMR spectra can have a combination of thousands of distinct resonances belonging either to the main compound or to impurities. Here, we used an approach very similar to a quantitative metabolomics approach in which identification and quantification are based on the underlying assumption that any given sample spectrum is the sum of individual spectra of pure metabolites found in the mixture. The spectrum of interest is compared to a library of pure compound spectra by properly matching and fitting the reference peaks. The reference libraries need to be prepared from NMR spectra of pure metabolites at a precisely known and controlled pH and temperature. Especially, peaks of water or some endogenous metabolite are pH, temperature and salt-sensitive, which frequently leads to errors. In this study, we disregarded the effects of pH and temperatures, and distortions, artifacts and noise in signals. We performed our analysis based on the assumption that the spectrum of every single compound in the mixture is a sub-spectrum stored in the reference database.

Let us assume we have an impure sample consisting of unknown number of compounds and carbon and/or proton NMR chemical shift data for the sample. In contrast to Case I, here we consider an n-tuple of molecules to be the list of candidates in the sample consisting of n number of molecules. Unlike Case I, the sequence of chemical shifts to be matched in the reference library do not necessarily have the same size of candidates; instead any molecule having equal or less 13C and 1H NMR chemical shifts in the reference library has a chance to be a candidate. For instance, if we have a sample of 2 molecules with c1 and c2 number of carbons and h1 and h2 number of protons, respectively, only the pairs having a sum of c1 + c2 carbons and h1 + h2 protons are the candidates and the chemical shifts of an atom can only belong to one of two candidates.

Compared to Case I, not only does the list of candidate molecules expand but matching two sets of data of different size is also not straightforward, making it even more challenging. Because of this, we did not examine this case for different Gaussian noise levels in detail as we did in Case I. We performed our runs for mixtures of 2 and 3 compounds. We report the results of this case only for a specific set of Gaussian noises (the optimum trade-off MAEs of 13C and 1H NMR chemical shifts reported for Set I in Table 1). Unless otherwise specified, we refer the mixtures of 2 and 3 compounds as pairs and triplets, respectively. In Fig. 6, the averaged ranks are shown for molecule pairs and triplets for all the optimum MAEs. Compared to the case of pure samples (Case I), the probability of identification decreases from 95 to 0% (pairs) and 6% (triplets) when the 13C and 1H NMR chemical shift errors are 4.41 ppm and 0.6 ppm, respectively. So, the identification chance is quite low (green and purple lines in Fig. 6) even when the 13C and 1H NMR chemical shift errors are low. We then investigated what happens if the number of compounds in the sample is known. At first this seems counter intuitive, but the probability of identification is increased to 84% (from 0%—pairs) and 68% (from 6%—triplets) when the 13C and 1H NMR chemical shift errors are 4.41 ppm and 0.6 ppm, respectively. The average identification chances increase by 83% and 91% (blue and black lines in Fig. 6). Determining the number of compounds in a sample may be possible using additional orthogonal data. For example, multidimensional NMR experiments or MS may aid in determining the number of high concentration molecular candidates in a sample. Integrating NMR and MS can provide improved identification and quantification of a larger number of metabolites, as in Case II. [158, 159]. This is still, however, less than Case I by 84% and 93% for pairs and triplets, respectively (red line in Fig. 6). Standard deviations of ranks and computational times of runs are given in the Additional file 1.

Fig. 6.

Fig. 6

Average ranks of Case I and Case II with known/unknown number of molecules in samples at optimum points for Set I (water soluble molecules)

Case II was performed only for the smaller molecule set, Set I (water soluble molecules), and not for the larger set, Set II (chloroform soluble molecules), due to the high computational time demands.

NMR spectroscopy is one of the main methods used for identifying the structure of metabolites. Besides the usual parameters (i.e. 13C and 1H NMR chemical shifts), other major NMR parameters (i.e. spin–spin coupling constants and 15 N, 17O, and other nuclei chemical shifts) can alternatively be used for structure identification. We believe the use of any other property will significantly improve molecular identification. In this initial study, we did not test the effect of using additional information that can be collected using NMR (e.g., J-couplings and peak shape). However, most currently available databases provide only 13C and 1H NMR chemical shifts, and J-couplings, multi-dimensional spectra, etc. are missing for many molecules. There is rapid progress in the use of 2D NMR models (i.e. COSY, HSQC, and HMBC) which aids interpretation of spectrum and leads to less ambiguity in the spectral assignments and allows more reliable identification. 2D NMR techniques are proven to overcome the problem of insufficient spectral resolution and spectral redundancy. 2D NMR experiments provide additional information (i.e. couplings between magnetic nuclei) and solve the problem of overlapping peaks. Thus, it allows identification of metabolites that otherwise remain undetected. Multi-dimensional spectra prediction can be obtained using spin dynamics simulation libraries (i.e. SPINACH [160]) coupled with DFT calculations. We are currently assessing the present limits of such automated workflows for accelerating confident, accurate, and fast metabolite identification.

Conclusion

Global comprehensive compound identification in complex samples will revolutionize understanding of the role of important compounds in chemical, environmental and biological studies. A major limitation is that the vast majority of metabolites are not available in current identification libraries, nor available for purchase as authentic reference material. It is not economically and practically feasible to identify hundreds of thousands of metabolites in laboratories to establish small molecule reference libraries. To address this, in silico small molecule libraries are currently the only reasonable solution to move toward comprehensive identification of all molecules in complex samples.

We performed an extensive statistical analysis on the effect of 13C and 1H NMR chemical shift calculation errors, in water and chloroform solvents, on the ability to make correct identification from in silico libraries. For pure samples, the required accuracy levels are feasible, promising the establishment of large scale metabolomic NMR in silico libraries. 90% or more of these molecules in a pure sample can be successfully identified when errors of 13C and 1H NMR chemical shifts are below 6 ppm and 0.5 ppm, respectively. This shows great potential of future use and reliability of predicted NMR chemical shifts in molecule identification for pure samples.

Compared to pure sample identification, it may require complementary information for complex samples in order to correctly identify constituent compounds. The water-soluble molecules in a complex sample have a chance of 68% and 84% (it is 95% for pure samples) to be identified for pairs and triplets, respectively when errors of 13C and 1H NMR chemical shifts are below 4.41 ppm and 0.6 ppm. The possibility of identification increases by 90% when the number of molecules are known beforehand, corroborating other findings that significant potential for parallel MS analysis [161]. This increased confidence in our results indicates the value of adding multiple molecular or chemical properties and using additional measured or accurately predicted information for comprehensive identification of metabolites.

This study provides valuable insight into the practicality and applicability of potential in silico small molecule NMR databases. The rapid innovations in metabolite identifications through the recent advances in computation and data integration in both NMR and MS/NMR analytical and computational methods will aid the full metabolome composition assignment in complex sample identification.

Supplementary Information

13321_2022_587_MOESM1_ESM.xlsx (905.7KB, xlsx)

Additional file 1. Set I - Water Soluble Molecules.

13321_2022_587_MOESM2_ESM.xlsx (2.9MB, xlsx)

Additional file 2. Set II - Chloroform Soluble Molecules.

13321_2022_587_MOESM4_ESM.docx (2.7MB, docx)

Additional file 4. Supplementary Information Document.

Acknowledgements

Not applicable.

Authors’ contributions

YY performed all calculations, analyzed the results, created the figures, and was the primary author. NG assisted with the DFT calculations. TOM and RSR oversaw the research, helped with analysis, and provided the funding for the study. All authors contributed to writing.

Funding

This work was supported by the Microbiomes in Transition (MinT) Initiative as part of the Laboratory Directed Research and Development Program at PNNL. Additional support was provided by the National Institutes of Health, National Institute of Environmental Health Sciences Grant no. U2CES030170. PNNL is a multi‐program national laboratory operated by Battelle for the DOE under contract DE‐AC05‐76RLO 1830.

Availability of data and materials

Molecule MOL files and DFT output files are included in the Additional files, along with Python processing code. Any other data is freely available upon request. The Additional files are available from the authors, upon request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.German JB, Hammock BD, Watkins SM. Metabolomics: building on a century of biochemistry to guide human health. Metabolomics. 2005;1(1):3–9. doi: 10.1007/s11306-005-1102-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wishart DS. Current progress in computational metabolomics. Brief Bioinform. 2007;8(5):279–293. doi: 10.1093/bib/bbm030. [DOI] [PubMed] [Google Scholar]
  • 3.Shulaev V. Metabolomics technology and bioinformatics. Brief Bioinform. 2006;7(2):128–139. doi: 10.1093/bib/bbl012. [DOI] [PubMed] [Google Scholar]
  • 4.Kosmides AK, et al. Metabolomic fingerprinting: challenges and opportunities. Crit Rev Biomed Eng. 2013;41(3):205–221. doi: 10.1615/CritRevBiomedEng.2013007736. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Nicholson JK, Wilson ID. Opinion: understanding 'global' systems biology: metabonomics and the continuum of metabolism. Nat Rev Drug Discov. 2003;2(8):668–676. doi: 10.1038/nrd1157. [DOI] [PubMed] [Google Scholar]
  • 6.Winnike JH, et al. Use of pharmaco-metabonomics for early prediction of acetaminophen-induced hepatotoxicity in humans. Clin Pharmacol Ther. 2010;88(1):45–51. doi: 10.1038/clpt.2009.240. [DOI] [PubMed] [Google Scholar]
  • 7.Holmes E, et al. Human metabolic phenotype diversity and its association with diet and blood pressure. Nature. 2008;453(7193):396–400. doi: 10.1038/nature06882. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Beckonert O, et al. Metabolic profiling, metabolomic and metabonomic procedures for NMR spectroscopy of urine, plasma, serum and tissue extracts. Nat Protoc. 2007;2(11):2692–2703. doi: 10.1038/nprot.2007.376. [DOI] [PubMed] [Google Scholar]
  • 9.Nicholson JK, Lindon JC, Holmes E. 'Metabonomics': understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data. Xenobiotica. 1999;29(11):1181–1189. doi: 10.1080/004982599238047. [DOI] [PubMed] [Google Scholar]
  • 10.Nicholson JK, et al. 750 MHz 1H and 1H–13C NMR spectroscopy of human blood plasma. Anal Chem. 1995;67(5):793–811. doi: 10.1021/ac00101a004. [DOI] [PubMed] [Google Scholar]
  • 11.Smith CA, et al. XCMS: Processing mass spectrometry data for metabolite profiling using Nonlinear peak alignment, matching, and identification. Anal Chem. 2006;78(3):779–787. doi: 10.1021/ac051437y. [DOI] [PubMed] [Google Scholar]
  • 12.Dettmer K, Aronov PA, Hammock BD. Mass spectrometry-based metabolomics. Mass Spectrom Rev. 2007;26(1):51–78. doi: 10.1002/mas.20108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Want EJ, Cravatt BF, Siuzdak G. The expanding role of mass spectrometry in metabolite profiling and characterization. ChemBioChem. 2005;6(11):1941–1951. doi: 10.1002/cbic.200500151. [DOI] [PubMed] [Google Scholar]
  • 14.Dunn WB, Bailey NJ, Johnson HE. Measuring the metabolome: current analytical technologies. Analyst. 2005;130(5):606–625. doi: 10.1039/b418288j. [DOI] [PubMed] [Google Scholar]
  • 15.Hollywood K, Brison DR, Goodacre R. Metabolomics: Current technologies and future trends. Proteomics. 2006;6(17):4716–4723. doi: 10.1002/pmic.200600106. [DOI] [PubMed] [Google Scholar]
  • 16.Moco S, et al. Metabolomics technologies and metabolite identification. Trac-Trends Anal Chem. 2007;26(9):855–866. doi: 10.1016/j.trac.2007.08.003. [DOI] [Google Scholar]
  • 17.Smith CA, et al. METLIN: a metabolite mass spectral database. Ther Drug Monit. 2005;27(6):747–751. doi: 10.1097/01.ftd.0000179845.53213.39. [DOI] [PubMed] [Google Scholar]
  • 18.Wishart DS, et al. HMDB 3.0–the human metabolome database in 2013. Nucleic Acids Res. 2013;41(Database issue):D801–D807. doi: 10.1093/nar/gks1065. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Ulrich EL, et al. BioMagResBank. Nucleic Acids Res. 2008;36(Database):D402–D408. doi: 10.1093/nar/gkm957. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Pence HE, Williams A. ChemSpider: an online chemical information resource. J Chem Educ. 2010;87(11):1123–1124. doi: 10.1021/ed100697w. [DOI] [Google Scholar]
  • 21.Tautenhahn R, et al. XCMS online: a web-based platform to process untargeted metabolomic data. Anal Chem. 2012;84(11):5035–5039. doi: 10.1021/ac300698c. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Williams AJ. A perspective of publicly accessible/open-access chemistry databases. Drug Discov Today. 2008;13(11–12):495–501. doi: 10.1016/j.drudis.2008.03.017. [DOI] [PubMed] [Google Scholar]
  • 23.Sitzmann M, Filippov IV, Nicklaus MC. Internet resources integrating many small-molecule databases. SAR QSAR Environ Res. 2008;19(1–2):1–9. doi: 10.1080/10629360701843540. [DOI] [PubMed] [Google Scholar]
  • 24.Kutzler FW, et al. Charge-Density and bonding in (5,10,15,20-tetramethylporphyrinato)nickel(Ii)—a combined experimental and theoretical-study. J Am Chem Soc. 1983;105(10):2996–3004. doi: 10.1021/ja00348a012. [DOI] [Google Scholar]
  • 25.Stimpson DI, Cann JR. A combined theoretical and experimental-study of the interaction of metrizamide with proteins. Arch Biochem Biophys. 1981;211(1):403–412. doi: 10.1016/0003-9861(81)90471-9. [DOI] [PubMed] [Google Scholar]
  • 26.Cripps SC, Orton RS, Carroll JE. Combined theoretical and experimental studies of a push-pull trapatt circuit. Int J Electron. 1974;37(1):1–21. doi: 10.1080/00207217408900490. [DOI] [Google Scholar]
  • 27.Gaulton A, et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012;40(Database issue):D1100–D1107. doi: 10.1093/nar/gkr777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Izgi T, et al. FT-IR and NMR investigation of 2-(1-cyclohexenyl)ethylamine: a combined experimental and theoretical study. Spectrochimica Acta Part a Mol Biomol Spectrosc. 2007;68(1):55–62. doi: 10.1016/j.saa.2006.10.050. [DOI] [PubMed] [Google Scholar]
  • 29.de Matos P, et al. Chemical entities of biological interest: an update. Nucleic Acids Res. 2010;38:D249–D254. doi: 10.1093/nar/gkp886. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Kwan EE, Liu RY. Enhancing NMR prediction for organic compounds using molecular dynamics. J Chem Theory Comput. 2015;11(11):5083–5089. doi: 10.1021/acs.jctc.5b00856. [DOI] [PubMed] [Google Scholar]
  • 31.Knox C, et al. DrugBank 3.0: a comprehensive resource for 'Omics' research on drugs. Nucleic Acids Res. 2011;39:D1035–D1041. doi: 10.1093/nar/gkq1126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Ulrich EL, et al. BioMagResBank. Nucleic Acids Res. 2008;36:D402–D408. doi: 10.1093/nar/gkm957. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Wishart DS, et al. HMDB: a knowledgebase for the human metabolome. Nucleic Acids Res. 2009;37(Database issue):D603–D610. doi: 10.1093/nar/gkn810. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Blum LC, Reymond JL. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J Am Chem Soc. 2009;131(25):8732–8733. doi: 10.1021/ja902302h. [DOI] [PubMed] [Google Scholar]
  • 35.Jewison T, et al. SMPDB 2.0: big improvements to the small molecule pathway database. Nucleic Acids Res. 2014;42(Database issue):D478–D484. doi: 10.1093/nar/gkt1067. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Frolkis A, et al. SMPDB: the small molecule pathway database. Nucleic Acids Res. 2010;38(Database issue):D480–D487. doi: 10.1093/nar/gkp1002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Richard AM, Williams CR. Distributed structure-searchable toxicity (DSSTox) public database network: a proposal. Mutat Res. 2002;499(1):27–52. doi: 10.1016/S0027-5107(01)00289-5. [DOI] [PubMed] [Google Scholar]
  • 38.Guo AC, et al. ECMDB: the E. coli metabolome database. Nucleic Acids Res. 2013;41(Database issue):D625–D630. doi: 10.1093/nar/gks992. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Sajed T, et al. ECMDB 2.0: a richer resource for understanding the biochemistry of E. coli. Nucleic Acids Res. 2016;44(D1):D495–501. doi: 10.1093/nar/gkv1060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Keseler IM, et al. The EcoCyc database: reflecting new knowledge about Escherichia coli K-12. Nucleic Acids Res. 2017;45(D1):D543–D550. doi: 10.1093/nar/gkw1003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Scalbert A, et al. Databases on food phytochemicals and their health-promoting effects. J Agric Food Chem. 2011;59(9):4331–4348. doi: 10.1021/jf200591d. [DOI] [PubMed] [Google Scholar]
  • 42.Fahy E, et al. Update of the LIPID MAPS comprehensive classification system for lipids. J Lipid Res. 2009;50:S9–S14. doi: 10.1194/jlr.R800095-JLR200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Caspi R, et al. The MetaCyc database of metabolic pathways and enzymes. Nucleic Acids Res. 2018;46(D1):D633–D639. doi: 10.1093/nar/gkx935. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.MolMall. [cited 2019 8/1]; http://www.molmall.net/.
  • 45.Banerjee P, et al. Super Natural II-a database of natural products. Nucleic Acids Res. 2015;43(D1):D935–D939. doi: 10.1093/nar/gku886. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Wishart D, et al. T3DB: the toxic exposome database. Nucleic Acids Res. 2015;43(Database issue):D928–D934. doi: 10.1093/nar/gku1004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Lim E, et al. T3DB: a comprehensively annotated database of common toxins and their targets. Nucleic Acids Res. 2010;38:D781–D786. doi: 10.1093/nar/gkp934. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Richard AM, et al. ToxCast chemical landscape: paving the road to 21st century toxicology. Chem Res Toxicol. 2016;29(8):1225–1251. doi: 10.1021/acs.chemrestox.6b00135. [DOI] [PubMed] [Google Scholar]
  • 49.Gu JY, et al. Use of natural products as chemical library for drug discovery and network pharmacology. PLoS ONE. 2013;8(4):e62839. doi: 10.1371/journal.pone.0062839. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Sterling T, Irwin JJ. ZINC 15-ligand discovery for everyone. J Chem Inf Model. 2015;55(11):2324–2337. doi: 10.1021/acs.jcim.5b00559. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Wishart DS. Advances in metabolite identification. Bioanalysis. 2011;3(15):1769–1782. doi: 10.4155/bio.11.155. [DOI] [PubMed] [Google Scholar]
  • 52.Xiao JF, Zhou B, Ressom HW. Metabolite identification and quantitation in LC-MS/MS-based metabolomics. Trac-Trends Anal Chem. 2012;32:1–14. doi: 10.1016/j.trac.2011.08.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.NIST 17 MS/MS Library. [cited 2019 05.01]. https://www.sisweb.com/software/nist-msms.htm.
  • 54.The NIST 17 Mass Spectral Library. June 2017 [cited 2019 05.01]. https://www.sisweb.com/software/ms/nist.htm#stats.
  • 55.The Human Metabolome Library (HML). [cited 2019 05.01]. http://www.hmdb.ca/hml.
  • 56.Wishart DS, et al. HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res. 2018;46(D1):D608–D617. doi: 10.1093/nar/gkx1089. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Wishart DS, et al. HMDB: the human metabolome database. Nucleic Acids Res. 2007;35(Database issue):D521–D526. doi: 10.1093/nar/gkl923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.ZINC 15, a free database of commercially-available compounds. [cited 2019 05.01]. http://zinc15.docking.org/. [DOI] [PMC free article] [PubMed]
  • 59.Sterling T, Irwin JJ. ZINC 15–ligand discovery for everyone. J Chem Inf Model. 2015;55(11):2324–2337. doi: 10.1021/acs.jcim.5b00559. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Styczynski MP, et al. Systematic identification of conserved metabolites in GC/MS data for metabolomics and biomarker discovery. Anal Chem. 2007;79(3):966–973. doi: 10.1021/ac0614846. [DOI] [PubMed] [Google Scholar]
  • 61.Staniek A, Woerdenbag HJ, Kayser O. Endophytes: exploiting biodiversity for the improvement of natural product-based drug discovery. J Plant Interact. 2008;3(2):75–93. doi: 10.1080/17429140801886293. [DOI] [Google Scholar]
  • 62.Tulp M, Bohlin L. Functional versus chemical diversity: is biodiversity important for drug discovery? Trends Pharmacol Sci. 2002;23(5):225–231. doi: 10.1016/S0165-6147(02)02007-2. [DOI] [PubMed] [Google Scholar]
  • 63.Sumner LW, et al. Proposed minimum reporting standards for chemical analysis. Metabolomics. 2007;3(3):211–221. doi: 10.1007/s11306-007-0082-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.DeHaven CD, et al. Organization of GC/MS and LC/MS metabolomics data into chemical libraries. J Cheminformatics. 2010;2:1–12. doi: 10.1186/1758-2946-2-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Dobson CM. Chemical space and biology. Nature. 2004;432(7019):824–828. doi: 10.1038/nature03192. [DOI] [PubMed] [Google Scholar]
  • 66.Patti GJ, et al. A view from above: cloud plots to visualize global metabolomic data. Anal Chem. 2013;85(2):798–804. doi: 10.1021/ac3029745. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Weckwerth W. Metabolomics in systems biology. Annu Rev Plant Biol. 2003;54:669–689. doi: 10.1146/annurev.arplant.54.031902.135014. [DOI] [PubMed] [Google Scholar]
  • 68.Salek RM, et al. The role of reporting standards for metabolite annotation and identification in metabolomic studies. Gigascience. 2013;2:2047–2217. doi: 10.1186/2047-217X-2-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Fiehn O, et al. The metabolomics standards initiative (MSI) Metabolomics. 2007;3(3):175–178. doi: 10.1007/s11306-007-0070-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Beisken S, Eiden M, Salek RM. Getting the right answers: understanding metabolomics challenges. Expert Rev Mol Diagn. 2015;15(1):97–109. doi: 10.1586/14737159.2015.974562. [DOI] [PubMed] [Google Scholar]
  • 71.Di Stefano V, et al. Applications of liquid chromatography-mass spectrometry for food analysis. J Chromatogr A. 2012;1259:74–85. doi: 10.1016/j.chroma.2012.04.023. [DOI] [PubMed] [Google Scholar]
  • 72.Garcia A, Barbas C. Gas chromatography-mass spectrometry (GC-MS)-based metabolomics. Methods Mol Biol. 2011;708:191–204. doi: 10.1007/978-1-61737-985-7_11. [DOI] [PubMed] [Google Scholar]
  • 73.Schymanski EL, et al. Identifying small molecules via high resolution mass spectrometry: communicating confidence. Environ Sci Technol. 2014;48(4):2097–2098. doi: 10.1021/es5002105. [DOI] [PubMed] [Google Scholar]
  • 74.Tang HR, et al. Use of relaxation-edited one-dimensional and two dimensional nuclear magnetic resonance spectroscopy to improve detection of small metabolites in blood plasma. Anal Biochem. 2004;325(2):260–272. doi: 10.1016/j.ab.2003.10.033. [DOI] [PubMed] [Google Scholar]
  • 75.Nicholson JK, Wilson ID. Understanding 'global' systems biology: metabonomics and the continuum of metabolism. Nat Rev Drug Discovery. 2003;2(8):668–676. doi: 10.1038/nrd1157. [DOI] [PubMed] [Google Scholar]
  • 76.Kangas LJ, et al. In silico identification software (ISIS): a machine learning approach to tandem mass spectral identification of lipids. Bioinformatics. 2012;28(13):1705–1713. doi: 10.1093/bioinformatics/bts194. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Allen F, Greiner R, Wishart D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics. 2015;11(1):98–110. doi: 10.1007/s11306-014-0676-4. [DOI] [Google Scholar]
  • 78.Wolf S, et al. In silico fragmentation for computer assisted identification of metabolite mass spectra. BMC Bioinformatics. 2010;11:1–12. doi: 10.1186/1471-2105-11-148. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Bouteiller Y, et al. Transferable specific scaling factors for interpretation of infrared spectra of biomolecules from density functional theory. J Phys Chem A. 2008;112(46):11656–11660. doi: 10.1021/jp805854q. [DOI] [PubMed] [Google Scholar]
  • 80.Colby SM, et al. ISiCLE: a quantum chemistry pipeline for establishing in silico collision cross section libraries. Anal Chem. 2019;91(7):4346–4356. doi: 10.1021/acs.analchem.8b04567. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Nuñez JR, et al (2018) Advancing Standards-Free Methods for the Identification of Small Molecules in Complex Samples. arXiv preprint arXiv:1810.07367.
  • 82.Casabianca LB, De Dios AC. Ab initio calculations of NMR chemical shifts. J Chem Phys. 2008;128(5):052201. doi: 10.1063/1.2816784. [DOI] [PubMed] [Google Scholar]
  • 83.Lodewyk MW, Siebert MR, Tantillo DJ. Computational prediction of 1H and 13C chemical shifts: a useful tool for natural product, mechanistic, and synthetic organic chemistry. Chem Rev. 2012;112(3):1839–1862. doi: 10.1021/cr200106v. [DOI] [PubMed] [Google Scholar]
  • 84.Hill DE, Vasdev N, Holland JP. Evaluating the accuracy of density functional theory for calculating H-1 and C-13 NMR chemical shifts in drug molecules. Comput Theor Chem. 2015;1051:161–172. doi: 10.1016/j.comptc.2014.11.007. [DOI] [Google Scholar]
  • 85.Lomas JS. H-1 NMR spectra of alcohols in hydrogen bonding solvents: DFT/GIAO calculations of chemical shifts. Magn Reson Chem. 2016;54(1):28–38. doi: 10.1002/mrc.4312. [DOI] [PubMed] [Google Scholar]
  • 86.Zheng XY, et al. Structural elucidation of cis/trans dicaffeoylquinic acid photoisomerization using ion mobility spectrometry-mass spectrometry. J Phys Chem Lett. 2017;8(7):1381–1388. doi: 10.1021/acs.jpclett.6b03015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Zheng XY, et al. Enhancing glycan isomer separations with metal ions and positive and negative polarity ion mobility spectrometry-mass spectrometry analyses. Anal Bioanal Chem. 2017;409(2):467–476. doi: 10.1007/s00216-016-9866-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Nunez JR, et al. Evaluation of in silico multi-feature libraries for providing evidence for the presence of small molecules in synthetic blinded samples. J Chem Inf Model. 2019;59(9):4052–4060. doi: 10.1021/acs.jcim.9b00444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Forsyth DA, Sebag AB. Computed C-13 NMR chemical shifts via empirically scaled GIAO shieldings and molecular mechanics geometries. Conformation and configuration from C-13 shifts. J Am Chem Soc. 1997;119(40):9483–9494. doi: 10.1021/ja970112z. [DOI] [Google Scholar]
  • 90.Auer AA, Gauss J, Stanton JF. Quantitative prediction of gas-phase C-13 nuclear magnetic shielding constants. J Chem Phys. 2003;118(23):10407–10417. doi: 10.1063/1.1574314. [DOI] [Google Scholar]
  • 91.Mothana B, Ban FQ, Boyd RJ. Validation of a computational scheme to study N-15 and C-13 nuclear shielding constants. Chem Phys Lett. 2005;401(1–3):7–12. doi: 10.1016/j.cplett.2004.10.145. [DOI] [Google Scholar]
  • 92.Saito H. Conformation-dependent C-13 chemical-shifts—a new means of conformational characterization as obtained by high-resolution solid-state C-13 Nmr. Magn Reson Chem. 1986;24(10):835–852. doi: 10.1002/mrc.1260241002. [DOI] [Google Scholar]
  • 93.Jaime C, et al. C-13 Nmr chemical-shifts—a single rule to determine the conformation of Calix[4]Arenes. J Org Chem. 1991;56(10):3372–3376. doi: 10.1021/jo00010a036. [DOI] [Google Scholar]
  • 94.Yannoni CS, et al. C-13 Nmr-study of the C60 cluster in the solid-state—molecular-motion and carbon chemical-shift anisotropy. J Phys Chem. 1991;95(1):9–10. doi: 10.1021/j100154a005. [DOI] [Google Scholar]
  • 95.Malkin VG, et al. Solvent effect on the NMR chemical shieldings in water calculated by a combination of molecular dynamics and density functional theory. Chem Eur J. 1996;2(4):452–457. doi: 10.1002/chem.19960020415. [DOI] [Google Scholar]
  • 96.Casanovas J, et al. Calculated and experimental NMR chemical shifts of p-menthane-3,9-diols. A combination of molecular dynamics and quantum mechanics to determine the structure and the solvent effects. J Org Chem. 2001;66(11):3775–3782. doi: 10.1021/jo0016982. [DOI] [PubMed] [Google Scholar]
  • 97.Benzi C, et al. Reliable NMR chemical shifts for molecules in solution by methods rooted in density functional theory. Magn Reson Chem. 2004;42:S57–S67. doi: 10.1002/mrc.1447. [DOI] [PubMed] [Google Scholar]
  • 98.Kiamco MM, et al. Structural and metabolic responses of Staphylococcus aureus biofilms to hyperosmotic and antibiotic stress. Biotechnol Bioeng. 2018;115(6):1594–1603. doi: 10.1002/bit.26572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Dreyer DR, et al. Elucidating the structure of poly(dopamine) Langmuir. 2012;28(15):6428–6435. doi: 10.1021/la204831b. [DOI] [PubMed] [Google Scholar]
  • 100.Xin DY, et al. Development of a C-13 NMR chemical shift prediction procedure using B3LYP/cc-pVDZ and empirically derived systematic error correction terms: a computational small molecule structure elucidation method. J Org Chem. 2017;82(10):5135–5145. doi: 10.1021/acs.joc.7b00321. [DOI] [PubMed] [Google Scholar]
  • 101.Garcellano RC, et al. Isolation of tryptanthrin and reassessment of evidence for its isobaric isostere wrightiadione in plants of the wrightia genus. J Nat Prod. 2018;82(3):440–448. doi: 10.1021/acs.jnatprod.8b00567. [DOI] [PubMed] [Google Scholar]
  • 102.Kutateladze AG, Reddy DS. High-throughput in silico structure validation and revision of halogenated natural products is enabled by parametric corrections to DFT-computed 13C NMR chemical shifts and spin-spin coupling constants. J Org Chem. 2017;82(7):3368–3381. doi: 10.1021/acs.joc.7b00188. [DOI] [PubMed] [Google Scholar]
  • 103.Kutateladze AG, Krenske EH, Williams CM. Reassignments and corroborations of oxo-bridged natural products directed by OSE and DU8+ NMR computation. Angew Chem Int Ed Engl. 2019;58(21):7107–7112. doi: 10.1002/anie.201902777. [DOI] [PubMed] [Google Scholar]
  • 104.Jain R, Bally T, Rablen PR. Calculating accurate proton chemical shifts of organic molecules with density functional methods and modest basis sets. J Org Chem. 2009;74(11):4017–4023. doi: 10.1021/jo900482q. [DOI] [PubMed] [Google Scholar]
  • 105.Perez M, et al. Accuracy vs time dilemma on the prediction of NMR chemical shifts: a case study (chloropyrimidines) J Org Chem. 2006;71(8):3103–3110. doi: 10.1021/jo0600149. [DOI] [PubMed] [Google Scholar]
  • 106.Barone G, et al. Determination of the relative stereochemistry of flexible organic compounds by ab initio methods: conformational analysis and Boltzmann-averaged GIAO C-13 NMR chemical shifts. Chem Eur J. 2002;8(14):3240–3245. doi: 10.1002/1521-3765(20020715)8:14&#x0003c;3240::AID-CHEM3240&#x0003e;3.0.CO;2-G. [DOI] [PubMed] [Google Scholar]
  • 107.Barone G, et al. Structure validation of natural products by quantum-mechanical GIAO calculations of C-13 NMR chemical shifts. Chem Eur J. 2002;8(14):3233–3239. doi: 10.1002/1521-3765(20020715)8:14&#x0003c;3233::AID-CHEM3233&#x0003e;3.0.CO;2-0. [DOI] [PubMed] [Google Scholar]
  • 108.Remya K, Suresh CH. Which density functional is close to CCSD accuracy to describe geometry and interaction energy of small non-covalent dimers? A benchmark study using gaussian09. J Comput Chem. 2013;34(15):1341–1353. doi: 10.1002/jcc.23263. [DOI] [PubMed] [Google Scholar]
  • 109.Zhao Y, Truhlar DG. Improved description of nuclear magnetic resonance chemical shielding constants using the M06-L meta-generalized-gradient-approximation density functional. J Phys Chem A. 2008;112(30):6794–6799. doi: 10.1021/jp804583d. [DOI] [PubMed] [Google Scholar]
  • 110.Magyarfalvi G, Pulay P. Assessment of density functional methods for nuclear magnetic resonance shielding calculations. J Chem Phys. 2003;119(3):1350–1357. doi: 10.1063/1.1581252. [DOI] [Google Scholar]
  • 111.Cimino P, et al. Comparison of different theory models and basis sets in the calculation of C-13 NMR chemical shifts of natural products. Magn Reson Chem. 2004;42:S26–S33. doi: 10.1002/mrc.1410. [DOI] [PubMed] [Google Scholar]
  • 112.Tormena CF, da Silva GVJ. Chemical shifts calculations on aromatic systems: a comparison of models and basis sets. Chem Phys Lett. 2004;398(4–6):466–470. doi: 10.1016/j.cplett.2004.09.103. [DOI] [Google Scholar]
  • 113.Cramer CJ, Truhlar DG. Implicit solvation models: equilibria, structure, spectra, and dynamics. Chem Rev. 1999;99(8):2161–2200. doi: 10.1021/cr960149m. [DOI] [PubMed] [Google Scholar]
  • 114.Wiitala KW, Hoye TR, Cramer CJ. Hybrid density functional methods empirically optimized for the computation of C-13 and H-1 chemical shifts in chloroform solution. J Chem Theory Comput. 2006;2(4):1085–1092. doi: 10.1021/ct6001016. [DOI] [PubMed] [Google Scholar]
  • 115.Reddy G, Yethiraj A. Implicit and explicit solvent models for the simulation of dilute polymer solutions. Macromolecules. 2006;39(24):8536–8542. doi: 10.1021/ma061176+. [DOI] [Google Scholar]
  • 116.Smirnov SN, et al. Hydrogen deuterium isotope effects on the NMR chemical shifts and geometries of intermolecular low-barrier hydrogen-bonded complexes. J Am Chem Soc. 1996;118(17):4094–4101. doi: 10.1021/ja953445+. [DOI] [Google Scholar]
  • 117.Benedict H, et al. Hydrogen/deuterium isotope effects on the N-15 NMR chemical shifts and geometries of low-barrier hydrogen bonds in the solid state. J Mol Struct. 1996;378(1):11–16. doi: 10.1016/0022-2860(95)09143-2. [DOI] [Google Scholar]
  • 118.Gidley MJ, Bociek SM. C-13 Cp/Mas Nmr-studies of amylose inclusion complexes, cyclodextrins, and the amorphous phase of starch granules—relationships between glycosidic linkage conformation and solid-state C-13 chemical-shifts. J Am Chem Soc. 1988;110(12):3820–3829. doi: 10.1021/ja00220a016. [DOI] [Google Scholar]
  • 119.Buckingham AD. Chemical shifts in the nuclear magnetic resonance spectra of molecules containing polar groups. Can J Chem Revue Canadienne De Chimie. 1960;38(2):300–307. doi: 10.1139/v60-040. [DOI] [Google Scholar]
  • 120.Osmialowski B, Kolehmainen E, Gawinecki R. GIAO/DFT calculated chemical shifts of tautomeric species 2-Phenacylpyridines and (Z)-2-(2-hydroxy-2-phenylvinyl)pyridines. Magnet Reson Chem. 2001;39(6):334–340. doi: 10.1002/mrc.856. [DOI] [Google Scholar]
  • 121.Gauss J. Effects of electron correlation in the calculation of nuclear-magnetic-resonance chemical-shifts. J Chem Phys. 1993;99(5):3629–3643. doi: 10.1063/1.466161. [DOI] [Google Scholar]
  • 122.Gao HW, et al. Comparison of different theory models and basis sets in the calculations of structures and C-13 NMR spectra of [Pt(en)(CBDCA-O, O')], an analogue of the antitumor drug carboplatin. J Phys Chem B. 2010;114(11):4056–4062. doi: 10.1021/jp912005a. [DOI] [PubMed] [Google Scholar]
  • 123.Wu A, et al. Systematic studies on the computation of nuclear magnetic resonance shielding constants and chemical shifts: the density functional models. J Comput Chem. 2007;28(15):2431–2442. doi: 10.1002/jcc.20641. [DOI] [PubMed] [Google Scholar]
  • 124.Giesen DJ, Zumbulyadis N. A hybrid quantum mechanical and empirical model for the prediction of isotropic C-13 shielding constants of organic molecules. Phys Chem Chem Phys. 2002;4(22):5498–5507. doi: 10.1039/B206245C. [DOI] [Google Scholar]
  • 125.Hoffmann F, et al. Improved quantum chemical NMR chemical shift prediction of metabolites in aqueous solution toward the validation of unknowns. J Phys Chem A. 2017;121(16):3071–3078. doi: 10.1021/acs.jpca.7b01954. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 126.Aliev AE, Courtier-Murias D, Zhou S. Scaling factors for carbon NMR chemical shifts obtained from DFF B3LYP calculations. J Mol Struct Theochem. 2009;893(1–3):1–5. doi: 10.1016/j.theochem.2008.09.021. [DOI] [Google Scholar]
  • 127.Willoughby PH, Jansma MJ, Hoye TR. A guide to small-molecule structure assignment through computation of (H-1 and C-13) NMR chemical shifts. Nat Protoc. 2014;9(3):643–660. doi: 10.1038/nprot.2014.042. [DOI] [PubMed] [Google Scholar]
  • 128.Pierens GK. H-1 and C-13 NMR scaling factors for the calculation of chemical shifts in commonly used solvents using density functional theory. J Comput Chem. 2014;35(18):1388–1394. doi: 10.1002/jcc.23638. [DOI] [PubMed] [Google Scholar]
  • 129.Caputo MC, Provasi PF, Sauer SPA. The role of explicit solvent molecules in the calculation of NMR chemical shifts of glycine in water. Theor Chem Accounts. 2018;137(7):1–8. doi: 10.1007/s00214-018-2261-9. [DOI] [Google Scholar]
  • 130.Feunang YD, et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J Cheminformatics. 2016;8:1–20. doi: 10.1186/s13321-016-0174-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 131.Yesiltepe Y, et al. An automated framework for NMR chemical shift calculations of small organic molecules. J Cheminformatics. 2018;10:1–16. doi: 10.1186/s13321-018-0305-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 132.Koster J, Rahmann S. Snakemake-a scalable bioinformatics workflow engine. Bioinformatics. 2012;28(19):2520–2522. doi: 10.1093/bioinformatics/bts480. [DOI] [PubMed] [Google Scholar]
  • 133.Weininger D. Smiles, a chemical language and information-system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci. 1988;28(1):31–36. doi: 10.1021/ci00057a005. [DOI] [Google Scholar]
  • 134.Valiev M, et al. NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations. Comput Phys Commun. 2010;181(9):1477–1489. doi: 10.1016/j.cpc.2010.04.018. [DOI] [Google Scholar]
  • 135.Lee CT, Yang WT, Parr RG. Development of the colle-salvetti correlation-energy formula into a functional of the electron-density. Phys Rev B. 1988;37(2):785–789. doi: 10.1103/PhysRevB.37.785. [DOI] [PubMed] [Google Scholar]
  • 136.Becke AD. A new mixing of hartree-fock and local density-functional theories. J Chem Phys. 1993;98(2):1372–1377. doi: 10.1063/1.464304. [DOI] [Google Scholar]
  • 137.Binkley JS, Pople JA, Hehre WJ. Self-consistent molecular-orbital methods. 21. Small split-valence basis-sets for 1st-row elements. J Am Chem Soc. 1980;102(3):939–947. doi: 10.1021/ja00523a008. [DOI] [Google Scholar]
  • 138.Gordon MS, et al. Self-consistent molecular-orbital methods. 22. Small split-valence basis-sets for 2nd-row elements. J Am Chem Soc. 1982;104(10):2797–2803. doi: 10.1021/ja00374a017. [DOI] [Google Scholar]
  • 139.Schuchardt KL, et al. Basis set exchange: a community database for computational sciences. J Chem Inf Model. 2007;47(3):1045–1052. doi: 10.1021/ci600510j. [DOI] [PubMed] [Google Scholar]
  • 140.Saielli G, et al. Addressing the stereochemistry of complex organic molecules by density functional theory-NMR: vannusal B in retrospective. J Am Chem Soc. 2011;133(15):6072–6077. doi: 10.1021/ja201108a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 141.Tantillo DJ. Walking in the woods with quantum chemistry—applications of quantum chemical calculations in natural products research. Nat Prod Rep. 2013;30(8):1079–1086. doi: 10.1039/c3np70028c. [DOI] [PubMed] [Google Scholar]
  • 142.Klamt A, Schüürmann G. COSMO: a new approach to dielectric screening in solvents with explicit expressions for the screening energy and its gradient. J Chem Soc Perkin Trans. 1993;2(5):799–805. doi: 10.1039/P29930000799. [DOI] [Google Scholar]
  • 143.Feller D. The role of databases in support of computational chemistry calculations. J Comput Chem. 1996;17(13):1571–1586. doi: 10.1002/(SICI)1096-987X(199610)17:13&#x0003c;1571::AID-JCC9&#x0003e;3.0.CO;2-P. [DOI] [Google Scholar]
  • 144.Xin D, et al. Development of a (13)C NMR chemical shift prediction procedure using B3LYP/cc-pVDZ and empirically derived systematic error correction terms: a computational small molecule structure elucidation method. J Org Chem. 2017;82(10):5135–5145. doi: 10.1021/acs.joc.7b00321. [DOI] [PubMed] [Google Scholar]
  • 145.Ditchfield R. Self-consistent perturbation-theory of diamagnetism. 1. Gauge-invariant Lcao method for Nmr chemical-shifts. Mol Phys. 1974;27(4):789–807. doi: 10.1080/00268977400100711. [DOI] [Google Scholar]
  • 146.Oliveira FM, et al. Evaluation of some density functional methods for the estimation of hydrogen and carbon chemical shifts of phosphoramidates. Comput Theor Chem. 2016;1090:218–224. doi: 10.1016/j.comptc.2016.06.025. [DOI] [Google Scholar]
  • 147.Smith SG, Goodman JM. Assigning stereochemistry to single diastereoisomers by GIAO NMR calculation: the DP4 probability. J Am Chem Soc. 2010;132(37):12946–12959. doi: 10.1021/ja105035r. [DOI] [PubMed] [Google Scholar]
  • 148.Grimblat N, Zanardi MM, Sarotti AM. Beyond DP4: an Improved probability for the stereochemical assignment of isomeric compounds using quantum chemical calculations of NMR shifts. J Org Chem. 2015;80(24):12526–12534. doi: 10.1021/acs.joc.5b02396. [DOI] [PubMed] [Google Scholar]
  • 149.Navarro-Vazquez A. State of the art and perspectives in the application of quantum chemical prediction of H-1 and C-13 chemical shifts and scalar couplings for structural elucidation of organic compounds. Magn Reson Chem. 2017;55(1):29–32. doi: 10.1002/mrc.4502. [DOI] [PubMed] [Google Scholar]
  • 150.Ermanis K, et al. Doubling the power of DP4 for computational structure elucidation. Org Biomol Chem. 2017;15(42):8998–9007. doi: 10.1039/C7OB01379E. [DOI] [PubMed] [Google Scholar]
  • 151.Renslow RS, et al. A biofilm microreactor system for simultaneous electrochemical and nuclear magnetic resonance techniques. Water Sci Technol. 2014;69(5):966–973. doi: 10.2166/wst.2013.802. [DOI] [PubMed] [Google Scholar]
  • 152.Sutovich KJ, et al. Simultaneous quantification of Bronsted- and Lewis-acid sites in a USY zeolite. J Catal. 1999;183(1):155–158. doi: 10.1006/jcat.1998.2379. [DOI] [Google Scholar]
  • 153.Mueller LJ. Chemical exchange in nuclear magnetic resonance. California Institute of Technology; 1997. [Google Scholar]
  • 154.Munkres J. Algorithms for the assignment and transportation problems. J Soc Ind Appl Math. 1957;5(1):32–38. doi: 10.1137/0105003. [DOI] [Google Scholar]
  • 155.Kuhn HW. The Hungarian method for the assignment problem. Naval Res Logist Q. 1955;2(1):83–97. doi: 10.1002/nav.3800020109. [DOI] [Google Scholar]
  • 156.Kuhn HW. Variants of the Hungarian method for assignment problems. Naval Res Logist Q. 1956;3(4):253–258. doi: 10.1002/nav.3800030404. [DOI] [Google Scholar]
  • 157.Cui H, et al (2016) Solving large-scale assignment problems by Kuhn-Munkres algorithm. In: Proceedings of the 2nd international conference on advances in mechanical engineering and industrial informatics (Ameii 2016), vol 73, pp 822–827.
  • 158.NaganaGowda GA, Raftery D. Recent advances in NMR-based metabolomics. Anal Chem. 2017;89(1):490–510. doi: 10.1021/acs.analchem.6b04420. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 159.Bingol K. Recent advances in targeted and untargeted metabolomics by NMR and MS/NMR methods. High Throughput. 2018;7(2):9. doi: 10.3390/ht7020009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 160.Hogben HJ, et al. Spinach–a software library for simulation of spin dynamics in large spin systems. J Magn Reson. 2011;208(2):179–194. doi: 10.1016/j.jmr.2010.11.008. [DOI] [PubMed] [Google Scholar]
  • 161.Bingol K, et al. Metabolomics beyond spectroscopic databases: a combined MS/NMR strategy for the rapid identification of new metabolites in complex mixtures. Anal Chem. 2015;87(7):3864–3870. doi: 10.1021/ac504633z. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

13321_2022_587_MOESM1_ESM.xlsx (905.7KB, xlsx)

Additional file 1. Set I - Water Soluble Molecules.

13321_2022_587_MOESM2_ESM.xlsx (2.9MB, xlsx)

Additional file 2. Set II - Chloroform Soluble Molecules.

13321_2022_587_MOESM4_ESM.docx (2.7MB, docx)

Additional file 4. Supplementary Information Document.

Data Availability Statement

Molecule MOL files and DFT output files are included in the Additional files, along with Python processing code. Any other data is freely available upon request. The Additional files are available from the authors, upon request.


Articles from Journal of Cheminformatics are provided here courtesy of BMC

RESOURCES