Abstract
Identification of compounds with minimal ambiguity remains a central challenge in mass spectrometry-based metabolomics. Conventional compound identification relies on comparing analytical signatures (e.g., mass-to-charge ratio, collision cross section, tandem mass spectra) against reference data obtained from measurements of authentic chemical standards. The breadth of annotatable compounds using this approach is necessarily limited by availability of authentic standards, analytical throughput, and resolving power of the separations that underly the measurements. The maturation of computational methods, both theory-driven and artificial intelligence/machine learning-based, for prediction of various molecular properties relevant to multidimensional mass spectrometry measurements has opened the door to a new “reference-free” paradigm of compound annotation. Through augmenting existing reference data for molecular properties with computational predictions, the universe of identifiable chemical species can be expanded significantly beyond its current limits. An unexplored aspect of this novel approach is understanding how to gauge confidence in resulting annotations, especially as the compound search space is expanded. Intuitively, the confidence of a compound annotation is related to the inherent discriminatory power of the molecular properties used for identification, as well as the precision with which the properties are measured or predicted. In this work, we characterize this relationship between measurement precision and identification probability in a systematic and quantitative fashion for a defined region of chemical space that includes organic small molecule metabolites. Importantly, this work establishes a framework for conducting metabolite identification probability analysis that enables others to quantify this relationship for their own compounds and properties of interest.


Introduction
Unambiguous identification of compounds from measured chemical properties remains an elusive goal in mass spectrometry (MS)-based metabolomics analyses. − Conventionally, approaches to compound identification compare measured properties for unknowns against reference databases populated with data from measurements of authentic reference standards, ideally with corroboration from more than one orthogonal property, , and these annotations have represented the gold-standard in the field. The number of compounds that can be annotated under this approach is limited primarily by the availability of authentic reference standards, analytical throughput, and to a lesser extent the resolution of the methods used to produce the measurements. − As a result of these limitations, the majority of metabolites observed in metabolomics studies are left unannotated or otherwise excluded from downstream analyses and interpretation, which can significantly limit the conclusions drawn from such studies. Even when reference data exists for compounds of interest, annotations cannot always be unambiguously assigned due to the complexity of small molecule chemical space and the limited extent to which any given molecular property captures this diversity. The significant material costs and effort required to expand the coverage of reference measurements compared to the extent of metabolite chemical space means that these limitations are inherent to the conventional paradigm of reference-based compound identification and unlikely to be solved at a broad scale given the current state of the field.
Recent advancements in the ability to accurately predict molecular properties using computational methods − have set the stage for a new paradigm in the annotation of unknown compounds from metabolomics data, under which reference databases with experimentally determined molecular properties are augmented with computationally predicted values. This so-called “reference-free” paradigm promises to expand the identifiable molecular universe far beyond the current bounds of existing reference data to cover all of the chemical space for which molecular properties can be accurately predicted. , Computational prediction of molecular properties is significantly faster than traditional measurement of reference compounds, making it orders-of-magnitude more efficient in terms of expanding molecular property coverage. Computational property prediction can also be used to augment existing reference values through the addition of orthogonal properties, enabling higher-confidence compound annotation through use of multidimensional analytical signatures.
Computational methods for molecular property prediction may be based on theoretical principles, empirical trends, or a mixture of both, and they may be applicable to either broad or narrow swaths of chemical space depending on the details of their construction. − , Molecular properties including accurate mass to charge ratio (m/z), chromatographic retention time (RT), collision cross section (CCS), and tandem mass spectra (MS/MS or MS2) are particularly important for compound annotation from multidimensional MS-based metabolomics. While accurately computing m/z is trivial given known molecular structure and ionization state, the more complex properties RT, CCS, and MS/MS require significant theory and/or experimental data to support accurate prediction.
The confidence level of compound annotations, whether made by conventional or reference-free approaches, is intrinsically tied to the resolution of the measurements or accuracy of predicted molecular properties used for identification. Intuitively, we expect the number of potential annotations for an unknown feature (i.e., the inverse of identification probability) to be proportional to the level of precision with which the molecular properties forming the basis for annotation can be measured or predicted. For measured values, precision is determined by instrumental capabilities (most importantly resolving power, Rp), which translate to specific search tolerances within the context of compound annotation. Figure depicts ranges of typical search tolerances for m/z, CCS, and RT, with general ranges for relevant instrumental platforms added where appropriate. We also expect annotation confidence level to increase when matching is performed based on multidimensional signatures, as this should reduce the breadth and complexity of the search space. While these relationships are intuitive, there is at present no quantitative understanding of these concepts, which is necessary for the practical application of gauging annotation confidence in reference-free compound identification.
1.

Assessment of molecular property search tolerances and instrumental capabilities. For properties m/z and CCS, search tolerances are specified in relative units (ppm or %). For CCS, hatched regions represent resolving powers only achievable using multipass SLIM or cIM technologies, neither of which currently have protocols for robust determination of CCS values. For RT, the effective Rp that a given search tolerance corresponds to is dependent on the separation length (gray dashed lines). Abbreviations: FT-ICR–Fourier transform ion cyclotron resonance, Q-ToF–quadrupole time-of-flight, SLIM–structures for lossless ion manipulations, IMS–ion mobility spectrometry, TIMS–trapped IMS, DTIMS–drift tube IMS, cIMS–cyclic IMS, TWIMS–traveling wave IMS, Rp–resolving power.
In this work, we set out to quantitatively characterize the relationship between measurement precision and identification probability (i.e., the probability of correctly identifying an annotated metabolite) in the context of multidimensional MS-based metabolomics. To this end, we curated a large database of reference molecular property data from a variety of literature sources and expanded the coverage of this database using a computational pipeline assembled using multiple individual molecular property prediction tools. Using this combined database, we performed systematic identification probability analyses to gain insight into the level of annotation confidence that is possible using both conventional reference-based and reference-free approaches.
Experimental Section
Molecular Property Database
The molecular property database is implemented as a SQLite3 database, which has extensive support across multiple platforms and programming languages. The database tables and their contents are described in detail in the Supporting Information. We implemented a Python interface for interacting with the database (Figure S2) that also includes utilities for initializing and building the database from existing literature data. RDKit was used for all molecular structure-related calculations. All molecular property database-related code (including database schema) is collected into a single Python package, idpp, which is open source and freely available at https://github.com/pnnl/idpp_main.
Initializing Database from Literature Measurement Data
The molecular property database was initialized from several collections of measured properties sourced primarily from the literature. A current version of the Human Metabolome Database (HMDB) was downloaded and used to populate the database with an initial set of compound annotations and corresponding metadata (such as SMILES and InChI structures). A collection of experimental MS/MS spectra were downloaded from MassBank of North America (MoNA, https://mona.fiehnlab.ucdavis.edu) and added to the database. We also included MS/MS spectra from the NIST20 library (https://www.sisweb.com/software/ms/nist.htm), limiting the included entries to those with collision gas listed as N2 and precursor type from a specified list of adducts. We used the RepoRT repository as a source for LC retention times. The individual data sets that compose the RepoRT collection were grouped according to chromatographic method details. The three sources of CCS values are the Unified CCS compendium from the McLean research group, the CCSbase and dmCCS databases from the Xu group, and the recent METLIN-CCS data set.
Computational Pipeline for Molecular Property Prediction
To expand the coverage of the molecular property database and therefore better characterize the relationship between identification probability and measurement precision, we developed a computational molecular property prediction pipeline. The pipeline consists of a collection of tools capable of predicting various molecular properties relevant to multidimensional mass spectrometry measurements, including chromatographic retention time (RT), collision cross section (CCS), ,− and tandem mass spectra (MS2). Each tool has distinct dependencies, input specification, and output format, making it difficult to set up a singular environment to accommodate all tools. To address these limitations, we used the Python-based Snakemake workflow management system to implement the pipeline, enabling prediction of multiple molecular properties using many individual tools with disparate requirements (Figure S3). The pipeline and constituent prediction tools are described in greater detail in the Supporting Information and the code is freely available on GitHub (https://github.com/pnnl/idpp_workflow).
Identification Probability Analysis
The identification probability concept is discussed in-depth elsewhere, but at a high-level, the identification probability analysis used in this work consists of quantifying the number of identifications that result from querying on combinations of molecular properties using different search tolerances. This process is repeated for many combinations of molecular properties and search tolerances to build up a data set from which the relationship between identification probability (i.e., the number of identifications) and precision (i.e., search tolerance) can be determined, both for individual properties and combinations of properties. To improve efficiency of queries, we implemented a workflow (Figure S4) for conducting the identification probability analysis that generates intermediate data structures for each of the targeted molecular properties (m/z, RT, CCS, MS2) and then combines the results of queries from individual properties to obtain results corresponding to different combinations of properties.
Results and Discussion
Curation of Measured Molecular Properties from Literature
Exploring the relationship between identification probability and measurement precision in reference-free compound identification requires a molecular data set that is both broad in terms of the chemical space it spans and deep in terms of its coverage of relevant molecular properties. Another critical requirement for this data set is that any ambiguity or redundancy with respect to compound annotations must be minimized, as these will skew the results of the identification probability analysis. We achieved this using a hierarchical representation of multidimensional molecular property data (Figure S5), where compounds are mapped to ionized adducts that are in turn mapped to individual molecular properties. We sought to assemble a comprehensive molecular data set by first curating as many relevant molecular property measurements (RT, CCS, MS/MS) as possible from literature sources, then augmenting those measurements with computationally predicted values (using a variety of published tools) to expand the property coverage of all included compounds.
We first curated a so-called “reference-only” database using various publicly available collections of measured molecular properties sourced from the literature. After the initial collections of measured properties were added into the database, we performed a series of data cleaning tasks aimed at reducing factors that could negatively impact the results of the identification probability analysis, such as redundancy or ambiguity among compound annotations. The data cleaning process involved removing lipids, harmonizing compound names by removing extraneous characters (e.g., identifiers or punctuation), and remapping compound identifiers to remove duplicates erroneously introduced during cleaning. Lipids were removed from the data set due to the disproportionately high number of isomeric species and long-term issues around consistency in lipid annotation and reporting, which complicate assessment of identification probability. The final “reference-only” database after initialization and cleaning contained 168,078 compound entries, with 173,893 associated adduct entries, mapping to a total of over 1.1 M experimentally measured property entries (170,653 RT entries, 79,617 CCS entries, and 882,820 MS/MS spectra before spectrum combination).
The coverage of different molecular properties in the database is summarized in Figure A, with contributions from different data sources for each property depicted in Figure B–D. While the scale of the measured molecular property database is impressive, there is a relatively small proportion of compounds with coverage across multiple properties. The vast majority of compounds (157,199 or 93.5%) in the database have only single associated measured properties (RT, CCS or MS/MS). This lack of compound coverage across multiple properties significantly limits the utility of this data collection for annotating unknowns from multidimensional metabolomics experiments and also limits the extent to which identification probability can be characterized within a multidimensional context. For this reason, we chose to only conduct in-depth identification probability analysis with m/z alone and combinations of m/z + single properties (i.e., RT, CCS or MS2) for this data set.
2.
Summary of characteristics for the “reference-only” molecular property database. (A) Molecular property coverage of all compounds in the database (N = 168,078). All compounds have associated m/z, and the bars reflect counts of compounds with different combinations of additional molecular properties. The dots below each bar represent which properties are included in the combination. (B–D) Pie charts representing the different sources for molecular properties (RT, CCS, and MS2 spectra, respectively). Source label abbreviations are defined in Table S1. (E,F) Histograms of m/z and CCS values in the database, respectively. (G) Histograms for sets of RTs from the database grouped by chromatographic method and normalized to index values between 0 and 1. Groups are annotated with their corresponding reference number († = RepoRT data sets 374/375, ‡ = RepoRT data sets 366/367). (H) 2D histogram representing MS2 spectra in the database. The x axis represents precursor m/z and the y axis represents fragment m/z, and the counts across all bins are normalized to a scale of 0 to 1 (i.e., density). The plot also includes histograms of the data collapsed along either the x or y dimension. (I) Binned PCA projections based on 1024 bit chemical fingerprints for all compounds in the database.
The distributions of m/z values and CCS values in the database are depicted in Figure E,F, respectively, with most adducts falling within an m/z range of 100–500 and CCS range of 150–225 Å2. As evident from these ranges, the general chemical composition of the database is dominated by organic small molecule metabolites. Indeed, the heavy atom (i.e., non-hydrogen) counts for compounds in the database mostly fall between 20 and 35 with the majority being C (15–25), followed by N and O (both 1–6), and much smaller contributions from other heavy elements (Figure S6).
Examining the distribution of RTs in the database is complicated by the fact that RT is not an intrinsic molecular property; rather, RT depends on experimental factors such as the type/phase of chromatography (principally in this data set: reversed-phase, RP; or hydrophilic interaction chromatography, HILIC) as well as the solvent composition and gradient details. In order to get a rough idea of the distribution of RT values in the database, we grouped entries by chromatographic method and indexed the retention times within each group to a fixed relative scale (0 to 1). Figure G depicts the resulting distributions for groups containing at least 1000 entries, of which four were from RP methods and one from a HILIC method. From these distributions we can see that the largest RP data set only has values above a relative RT of ∼0.3, while other large data sets have more uniform distributions of relative RT. This heterogeneity, combined with the distinctness of RTs obtained using different phases and methods highlights the need to consider RTs derived from different experiments as entirely separate properties from one another. Harmonization, indexing, and other related strategies for making chromatographic RTs more comparable have been explored elsewhere, ,− but the application of such methods in the context of identification probability analysis deserves more dedicated and focused efforts that we deemed outside of the scope of the present work.
In contrast to RT, MS/MS spectra acquired using different methods are in principle comparable to one another to some degree and thus better resemble intrinsic molecular properties. However, the primary challenge with characterizing the MS/MS spectra contained in the database is that they are not represented as scalar values but rather as vectors of m/z and abundance values for variable numbers of fragments. In order to better make sense of the composition of the database in terms of MS/MS spectra, we constructed a 2D histogram where the first dimension is the precursor m/z and the second is the fragment m/z which is accumulated over all MS/MS fragments in the database (Figure H). The distribution of precursor m/z for the MS/MS spectra matches the overall trend among all adduct m/z in the database, with most precursors falling in the range of 200–400 m/z. Fragment m/z follows a similar but proportionally lower distribution than the precursors, mostly spanning 50–200 m/z. What is also apparent from the distribution of MS/MS fragments is the presence of regular diagonal and horizontal elements, which respectively indicate presence of common neutral losses or fragments among the spectra.
Finally, we aimed to coarsely characterize the chemical space spanned by the compounds included in this data set. We used RDKit to compute 1024 bit topological fingerprints for all compounds with SMILES structures in the database, and performed principal components analysis (PCA) on the fingerprint data to visualize how the compounds distribute in chemical space via their topological fingerprints. Figure I depicts the density of compounds from the database in the projected space from the first two components of the PCA. Most compounds fall within a fairly dense region in this space, indicating that the chemical space spanned by compounds in the database is likely somewhat limited but the sampling within this region is reasonably dense. This observation tracks with earlier observations that the database is composed primarily of small organic metabolites. When heavy atom count is mapped onto the same projections (Figure S7), we can see an association with the overall pattern of compound density. However, for larger heavy atom counts we see less concentration in the spatial distribution, with sampling across both the dense region as well as the sparser region near the top of the plot, indicating that the larger compounds in the data set have greater structural diversity.
Identification Probability Analysis Using Reference-Only Database
We first performed a set of identification probability analyses using the “reference-only” database in order to understand the levels of confidence in metabolite annotations that are achievable when relying solely on experimental measurements for molecular properties. The analyses started with querying based on MS adduct m/z alone, which was performed on the complete database (N = 165,923 compounds) for a set of trials using logarithmically spaced search tolerances spanning a wide range (0.6, 1.0, 1.8, 3.2, 5.6, 10.0, 17.8, and 31.6 ppm). Each trial produced a distribution of matches from all compounds in the database, and example distributions from the 0.6, 3.2, and 17.8 ppm trials are presented in Figure A–C, respectively. With increasing search tolerance, the distribution of match counts shifts to larger values, with the observed median match count steadily increasing in these example plots from 6 matches in the 0.6 ppm trial to 44 for the 17.8 ppm trial. These distributions had consistent trends well-approximated by an exponential function of the form
3.
Identification probability analysis using “reference-only” molecular property database. (A–C) Distributions of match counts from m/z-only trials performed using search tolerances of 0.6, 3.2, and 17.8 ppm, respectively. Black traces represent the observed distributions and blue traces represent fits with an exponential function. The observed median and median estimated from the fit parameters are included as vertical dashed lines (black and blue, respectively). (D) Summary of median match counts (observed and estimated from exponential fits in black and blue, respectively) across all m/z-only trials as a function of search tolerance. The blue dashed line represents a fit of the median values with respect to search tolerance using a power function. (E) 2D contour plot depicting median match counts from trials with matching based on m/z + CCS. The light open circles represent the locations of individual combinations of m/z and CCS search tolerances, from which the contours are interpolated. (F) A 2D contour plot as in E, but from trials with matching based on m/z + MS2 spectra.
The quality of the fit generally decreases slightly as the search tolerance is increased, which is likely attributable to the fact that m/z values are not distributed uniformly across the total m/z range but instead form groups that cluster together around specific values. By fitting the observed distributions with this exponential function and deriving the median from the fit parameters
we obtain a continuous estimate of the median that neatly summarizes the match count distributions. This estimated value is more useful than the observed median in that (1) it is less sensitive to noise or outliers in the raw distribution, such as with larger search tolerances as discussed above, and (2) it captures shifts in the distribution with a finer increment than the observed median, which can only take on whole numbers. As evident in the examples from Figure A–C, the estimated median match count generally has good agreement with the observed median. Figure D depicts the observed and fit medians from all m/z-only trials, which demonstrate that the estimated medians have a slight tendency toward underprediction (1–2 fewer matches) relative to the observed value for very small search tolerances. The estimated medians from exponential fits also follow a consistent trend with respect to search tolerance, and indeed they are well captured by a power function (Figure D). The impact of this observation is that the expected median match count, a stand-in for bulk identification probability at the database level, can be confidently predicted for arbitrary m/z search tolerances both within and outside of the range of tolerances characterized in this set of trials. Projecting the fitted curve to 0 ppm search tolerance, we can estimate that the theoretical minimum median match count would be about 2.12, or an effective identification probability of 0.47, reinforcing the intuitive assumption that m/z alone is not capable of supporting unambiguous regardless of search tolerance.
We next sought to examine the trends in identification probability for identifications made by matching on combinations of molecular properties. With the focus of the present work being on MS-based metabolomics, we chose specifically to characterize identification probability associated with matching based on m/z combined with other properties (RT, CCS, and MS2), as these represent the property combinations produced from common experimental configurations.
We started by performing a set of trials with matching based on m/z and CCS using the same set of m/z tolerances as before and a similar logarithmically spaced wide range of CCS search tolerances (0.06, 0.10, 0.18, 0.32, 0.56, 1.00, 1.78, 3.16, and 5.62%). Separate trials were performed for all combinations of m/z and CCS tolerance using the subset of compounds in the database with associated CCS values (N = 68,419). The resulting estimated median match counts from exponential fits from all trials are depicted in Figure E. The median match count increases systematically with increases in search tolerance for either property, indicating that both provide useful information for distinguishing between potential matches. Interestingly, the change in median match count with respect to search tolerance is effectively the same for both properties, except that CCS tolerance is approximately 1 log unit higher in absolute scale, as evidenced by the almost completely diagonal contour lines in the plot. It is also worth noting that the minimum median match count observed using m/z alone is around 3 matches (at 0.6 ppm search tolerance), but this minimum reduces to 1 match when CCS is considered (with search tolerance below 1%). Importantly, these results include a range of CCS tolerances extending to a precision level significantly greater than the limits of typical experimental and computational CCS determination methods (on the order of ∼1%). While tolerances smaller than this range arguably do not provide meaningful insights due to the precision of existing CCS measurements, we note that (1) there is increasing development and application of both instrumentation and computational methods capable of determining CCS values with far greater precision than what was used to produce existing reference CCS databases and (2) assuming that similar accuracy is maintained in producing these higher precision CCS values, the results from trials in this regime of search tolerances represent what is possible once these experimental and computational approaches mature and become standardized.
We next performed a set of trials with matching based on m/z and MS/MS spectra. We repeated trials over a range of similarity thresholds (0.99, 0.95, 0.9, 0.8, 0.5, 0.65, 0.25, 0.1), which function much like normal search tolerances in the context of identification probability analysis, testing all combinations with same set of m/z tolerances as before. Due to time and computing constraints, we did not perform exhaustive queries using all precursors and spectra in the database, but instead relied on a random sampling approach. In these trials, an average of 4629 precursors (∼9% of the 50,276 total compounds with associated MS/MS spectra) were sampled and their query results included in the match count distributions. The resulting estimated median match counts from exponential fits from all trials are depicted as a contour plot in Figure F. While sampling rate may influence match count distributions, in initial testing we found that median match counts remained relatively stable and only the smoothness of the match count distribution increased with sampling rate. Similar to the results from the m/z + CCS trials, we found that the inclusion of MS/MS for compound identification systematically and significantly reduces the median match count, but to a stronger extent than CCS. For instance, with inclusion of MS/MS in searches and using the lowest similarity threshold evaluated (0.1) and an m/z tolerance of 10 ppm, the median match count is reduced by half relative to a similar search using m/z alone (9.6 vs 20.7). Interestingly, the median match counts do not change much in the region where m/z tolerances are low and MS2 thresholds are high, which may suggest a point of diminishing returns in terms of identification probability with high stringency. These results support the expectation that consideration of MS/MS spectra enables compound identification with high confidence, especially in comparison to other properties commonly measured in metabolomics experiments.
We performed similar trials with matching based on m/z and RT, with RTs measured using different chromatographic methods being treated as separate molecular properties and considered in separate trials. We selected a few sets of RTs from the database that were grouped according to shared reversed-phase chromatographic conditions consisting of at least 1000 measured values (Figure G). ,,, We used the same set of m/z tolerances as in previous analyses, with different RT tolerance ranges for the different methods based on their durations. As in the MS/MS trials described above, we employed a random sampling approach for the identification probability analysis using m/z + RT in order to reduce computational time. Details about the different RT data sets, the corresponding search tolerances, and sampling information are included in Table S1. After running the first iteration of the sampling, we checked the preliminary match counts for all 4 RT data sets and found that very few distributions had estimated median match counts above 1, even in trials with very large m/z and RT search tolerances (Figure S8). Upon closer inspection of the database, we found the source of this observation to be presence of very few isobaric compounds within these RT data sets. Thus, even when there were many m/z-based matches from trials using high m/z tolerance, individual queries tended to return single matches despite using very large RT tolerances. With this being the case, we were not able to extract further useful insights from searches that included RT.
Expanding Molecular Property Database with Computationally Predicted Properties
The molecular property database assembled from literature sources had very few compounds with coverage across multiple properties (RT, CCS, MS/MS), limiting the extent to which identification probability analysis can be applied to searches involving combination of these properties. The reference-free paradigm of compound annotation addresses this limitation by not only broadening the annotatable chemical space but deepening the coverage of molecular properties, in turn enabling high-confidence annotations based on multidimensional signatures. To better understand the impact of the latter concept in the context of identification probability analysis, we expanded the “reference-only” database characterized above using a computational pipeline consisting of multiple tools for molecular property prediction. We specifically sought to increase the number of compounds with multiple associated molecular properties, enabling the characterization of matches from multidimensional searches.
Database expansion resulted in the addition of 69,100 RT values, 178,596 CCS values, and 137,334 MS/MS spectra and an increase from 168,078 to 200,161 compound entries. Notably, since identification probability analysis matches are aggregated at the compound level, these additions only broaden property coverage of the database to the extent that the newly added values contain properties for compounds that did not previously have them. However, in cases where the newly added properties do not expand database coverage, they still provide important value by adding multiple observations of the same property for a given compound entry. Multiple observations of the same property from different sources can elucidate the inherent uncertainty in the determination of a property introduced by different measurement or prediction methods. Figure A depicts the coverage across different molecular properties in the expanded database, with contributions from different data sources for each property depicted in Figure B–D. Most importantly, we see a significant increase in compound coverage for combinations of molecular properties, with 27,473 compounds having either RT + CCS, RT + MS/MS, or CCS + MS/MS and 27,314 having coverage across all three properties. The expanded database enables characterizing identification probability using combinations of multiple molecular properties. After expansion, a large proportion of compounds in the database still only had single molecular properties associated, which can mostly be attributed to limitations of the prediction tools used for expansion.
4.
Summary of characteristics for the expanded molecular property database. (A) Molecular property coverage of all compounds in the database (N = 200,161). All compounds have associated m/z, and the bars reflect counts of compounds with different combinations of additional molecular properties. The dots below each bar represent which properties are included in the combination. (B–D) Pie charts representing the different sources for molecular properties (RT, CCS, and MS2 spectra, respectively). Source label abbreviations are defined in Table S1. (E,F) Histograms of m/z and CCS values in the database, respectively. Open traces overlaid on each histogram represent the histogram from the original database. (G) Histograms for sets of RTs from the database (including the added set of predicted RTs) grouped by chromatographic method and normalized to index values between 0 and 1. Groups are annotated with their corresponding reference number († = RepoRT data sets 374/375, ‡ = RepoRT data sets 366/367). (H) 2D histogram representing MS2 spectra in the expanded database. The x axis represents precursor m/z and the y axis represents fragment m/z, and the counts across all bins are normalized to a scale of 0 to 1 (i.e., density). The plot also includes histograms of the data collapsed along either the x or y dimension.
The distributions of m/z values and CCS values in the expanded database are depicted in Figure E,F, respectively, with the distributions from the original database overlaid for comparison. As in the original database, most adducts fall within an m/z range of 100–500 but the total count of entries is slightly increased after expansion. The CCS distribution from the expanded database is slightly broader than the original, with a particular increase in values between 120 and 150 Å2. There is also an increase in entries for the 150–225 Å2 range, which had the densest coverage in the original database.
In terms of RT, the database expansion effectively only added a single predicted data set rather than expanding the coverage of any existing data sets. As discussed previously, these different RT data sets essentially function as separate properties, meaning that this expansion likely had less impact on database coverage relative to other properties. Nevertheless, the distribution of indexed RTs from this predicted data set is shown along with the previously characterized RT data sets in Figure G.
Lastly, we examined the expanded database’s MS/MS coverage. Figure H depicts the 2D histogram of precursors and fragments for all MS/MS spectra in the expanded database. When compared to the same plot from the original database (Figure H), the MS/MS spectra in the expanded database have higher coverage in the lower m/z range for both precursor and fragments. The distributions of precursor and fragment m/z is slightly more uniform with fewer obvious features (i.e., diagonal or horizontal bands) in the 2D space, suggesting more diverse fragment coverage in the expanded database.
Identification Probability Analysis Using Expanded Database
After database expansion, we repeated the same set of identification probability analyses using the previously specified property combinations and search tolerances. The m/z-only trials were consistent with the results from the original database, with the distribution of match counts increases systematically with increasing search tolerance (Figure A–C). Likewise, the observed and estimated median match counts for all trials show a systematic trend with search tolerance (Figure D). Interestingly, the trend between estimated median match count and search tolerance for the expanded database (blue dashed line) is essentially identical to the one estimated from the original database (gray dashed line), suggesting that database expansion increased coverage of molecular properties but did not significantly affect the breadth of encompassed chemical space.
5.
Identification probability analysis using expanded molecular property database. (A–C) Distributions of match counts from m/z-only trials performed using search tolerances of 0.6, 3.2, and 17.8 ppm, respectively. Black traces represent the observed distributions and blue traces represent fits with an exponential function. The observed median and median estimated from the fit parameters are included as vertical dashed lines (black and blue, respectively). (D) Summary of median match counts (observed and estimated from exponential fits in black and blue, respectively) across all m/z-only trials as a function of search tolerance. The blue dashed line represents a fit of the median values with respect to search tolerance using a power function, and the gray dashed line represents the same fit from the “reference-only” database. (E) 2D contour plot depicting median match counts from trials with matching based on m/z + CCS. The light open circles represent the locations of individual combinations of m/z and CCS search tolerances, from which the contours are interpolated. (F) A 2D contour plot as in E, but from trials with matching based on m/z + MS2 spectra. Next, we performed trials with matching based on m/z and MS/MS spectra as described for the original data set. As with m/z + CCS, the expanded database had significantly more entries with m/z + MS/MS spectra than the original database (70,693 vs 50,276). An average of 2961 compounds (∼4.2%) were sampled for each trial. In contrast to the m/z + CCS trials, with searching based on m/z + MS/MS we see little difference between the trends observed from the original and expanded databases, despite the greater compound coverage (Figure F). We attribute this observation to MS/MS spectra offering much more discriminating power than CCS values in compound annotation. This leads to newly added MS/MS spectra in the expanded database being distinct enough from the existing entries, such that they do not significantly increase ambiguity in the resulting annotations.
We next ran a set of trials with matching based on m/z and CCS with the expanded database. The subset of compounds in the database with associated CCS values was more than doubled (161,321 vs 68,419), which we expect to produce a more accurate estimation of confidence for annotations based on m/z and CCS. The resulting median match count estimates from these trials display a similar systematic trend as was observed for the original data set. From the contour plots (Figure E), we observe that while the overall trend is similar to that observed for the original data set, the magnitude of estimated median match counts is consistently higher for the same combination of search tolerances. This observation is directly in line with what was observed from the m/z-only trials, and thus the same interpretation applies.
Interpretation of trials performed with matching based on m/z + RT for the expanded database is limited by the same factors discussed for the original database. Accordingly, when we performed trials using m/z + RT from the expanded database, grouped by chromatographic method as previously described, we observed no meaningful change in the results for the previously characterized subsets. Furthermore, we did not gain useful insights from newly predicted values (Figure S9). These results, and the results from the original database, highlight a need for more focused efforts to characterize the discriminative power of chromatographic retention time in the context of compound annotation, likely using more a sophisticated strategy for harmonizing RTs from different methods.
Given the coverage of compounds having multiple associated molecular properties in the expanded database and an intuition around the benefits of multidimensional signatures for increasing the confidence in compound annotations, we were interested in performing identification probability analysis trials using combinations of three or more properties. However, observations from trials using combinations of fewer properties suggest a limited potential for garnering useful insights from such trials. The estimated median match counts were already somewhat low for most trials, thus it would be difficult to meaningfully capture changes in annotation confidence with inclusion of more dimensions using the current approach. This limitation can be attributed to different factors, but it is likely that the relatively low chemical diversity and coverage of the molecular property database despite expansion are significant contributors.
Conclusion
Overall, the present work demonstrates that the relationship between identification probability and measurement precision within the context of compound annotation in multidimensional MS-based metabolomics is systematic but also dependent on a number of factors. The systematic nature of this relationship means that it can be characterized and even predicted with relatively high confidence, at least within the context of a particular experiment and molecular database, and these insights can provide a quantitative basis for judging annotation confidence. While these trends are systematic in nature, their absolute scale is determined by factors including the size and breadth of chemical space covered by the molecular database used for annotations as well as the specific molecular properties used for annotation. More concretely, given a reference database spanning a defined chemical space and fixed search tolerances for molecular properties of interest, the present work suggests that the number of matches (i.e., identification probability, inversely) from queries for a set of measured features will center around a median value that changes systematically with the search tolerances. This knowledge becomes useful in a practical setting once this relationship between search tolerances and identification probabilities has been characterized for a specific reference database of interest over a range of relevant search tolerances. Subsequently, individual annotations can be assigned an estimated identification probability score based on the reported errors in each of the dimensions used for compound annotation. In other words, the present work provides a framework for assigning quantitative scores to annotations based on empirical characterization of the specific reference database and molecular properties used for compound annotation. We anticipate that application of this framework to consensus databases with relevant molecular properties for compounds within particular biological contexts (e.g., human metabolome, soil microbiome) will provide quantitative metrics for annotation confidence that can be used by others in the field in a more universal fashion and without the need to perform full identification probability analysis separately for each individual study. Development of these resources will be critical to the broader application of a reference-free approach to compound annotation, as this approach is difficult to validate at a large scale using traditional (i.e., reference material-based) means. Finally, with the proliferation of AI-based tools that support compound annotation, our approach will enable users to quantify the accuracy needed to make identifications with confidence levels acceptable to their use cases and further motivate the development of tools capable of further supporting confident annotations. It is also likely that AI/ML could take the place of explicit function fitting within the identification probability analysis framework itself if trained on sufficiently large and diverse data sets.
Supplementary Material
Acknowledgments
This work was supported by the PNNL Laboratory Directed Research and Development (LDRD) program and is a contribution of the m/q Initiative. Additional support was provided by the LDRD program via the Predictive Phenomics Initiative. PNNL is operated by Battelle for the DOE under contract DE-AC05-76RL01830.
Code related to the molecular property prediction pipeline is housed in the repository: https://github.com/pnnl/idpp_workflow. Code related to the chromatographic retention time prediction model is housed in the repository: https://github.com/pnnl/idpp_rtp. All other code relating to the molecular property database and identification probability analysis is housed in the repository: https://github.com/pnnl/idpp_main.
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.analchem.5c01067.
Additional method details, results and discussion from benchmarking and other tests, and additional figures are included in the Supporting Information (PDF)
Conceptualization: DHR, TOM. Funding acquisition: DHR, CHC, TOM, BJMWR, RGE. Data curation: CHC, SCS, DHR. Software: DHR, CHC, AKI, SCS. Methodology: DHR, CHC. Formal analysis: DHR. Writingoriginal draft: DHR, CHC. Writingreview and editing: All authors.
The authors declare no competing financial interest.
References
- Creek D. J., Dunn W. B., Fiehn O., Griffin J. L., Hall R. D., Lei Z. T., Mistrik R., Neumann S., Schymanski E. L., Sumner L. W.. et al. Metabolite identification: are you sure? And how do your peers gauge your confidence? Metabolomics. 2014;10(3):350–353. doi: 10.1007/s11306-014-0656-8. [DOI] [Google Scholar]
- Sumner L. W., Lei Z. T., Nikolau B. J., Saito K., Roessner U., Trengove R.. Proposed quantitative and alphanumeric metabolite identification metrics. Metabolomics. 2014;10(6):1047–1049. doi: 10.1007/s11306-014-0739-6. [DOI] [Google Scholar]
- Schrimpe-Rutledge A. C., Codreanu S. G., Sherrod S. D., McLean J. A.. Untargeted Metabolomics Strategies-Challenges and Emerging Directions. J. Am. Soc. Mass Spectrom. 2016;27(12):1897–1905. doi: 10.1007/s13361-016-1469-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sumner L. W., Amberg A., Barrett D., Beale M. H., Beger R., Daykin C. A., Fan T. W., Fiehn O., Goodacre R., Griffin J. L.. et al. Proposed minimum reporting standards for chemical analysis Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI) Metabolomics. 2007;3(3):211–221. doi: 10.1007/s11306-007-0082-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schymanski E. L., Jeon J., Gulde R., Fenner K., Ruff M., Singer H. P., Hollender J.. Identifying small molecules via high resolution mass spectrometry: communicating confidence. Environ. Sci. Technol. 2014;48(4):2097–2098. doi: 10.1021/es5002105. [DOI] [PubMed] [Google Scholar]
- Kind T., Fiehn O.. Metabolomic database annotations via query of elemental compositions: Mass accuracy is insufficient even at less than 1 ppm. Bmc Bioinformatics. 2006;7:234. doi: 10.1186/1471-2105-7-234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frainay C., Schymanski E. L., Neumann S., Merlet B., Salek R. M., Jourdan F., Yanes O.. Mind the Gap: Mapping Mass Spectral Databases in Genome-Scale Metabolic Networks Reveals Poorly Covered Areas. Metabolites. 2018;8(3):51. doi: 10.3390/metabo8030051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Critch-Doran O., Jenkins K., Hashemihedeshi M., Mommers A. A., Green M. K., Dorman F. L., Jobst K. J.. Toward Part-per-Million Precision in the Determination of an Ion’s Collision Cross Section Using Multipass Cyclic Ion Mobility. J. Am. Soc. Mass Spectrom. 2024;35(4):775–783. doi: 10.1021/jasms.4c00003. [DOI] [PubMed] [Google Scholar]
- Haddad P. R., Taraji M., Szücs R.. Prediction of Analyte Retention Time in Liquid Chromatography. Anal. Chem. 2021;93(1):228–256. doi: 10.1021/acs.analchem.0c04190. [DOI] [PubMed] [Google Scholar]
- de Cripan S. M., Arora T., Olomi A., Canela N., Siuzdak G., Domingo-Almenara X.. Predicting the Predicted: A Comparison of Machine Learning-Based Collision Cross-Section Prediction Models for Small Molecules. Anal. Chem. 2024;96(22):9088–9096. doi: 10.1021/acs.analchem.4c00630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Russo F. F., Nowatzky Y., Jaeger C., Parr M. K., Benner P., Muth T., Lisec J.. Machine learning methods for compound annotation in non-targeted mass spectrometryA brief overview of fingerprinting, in silico fragmentation and de novo methods. Rapid Commun. Mass Spectrom. 2024;38(20):e9876. doi: 10.1002/rcm.9876. [DOI] [PubMed] [Google Scholar]
- Ross D. H., Bhotika H., Zheng X. Y., Smith R. D., Burnum-Johnson K. E., Bilbao A.. Computational tools and algorithms for ion mobility spectrometry-mass spectrometry. Proteomics. 2024;24(12–13):e2200436. doi: 10.1002/pmic.202200436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Metz T. O., Chang C. H., Gautam V., Anjum A., Tian S., Wang F., Colby S. M., Nunez J. R., Blumer M. R., Edison A. S.. et al. Introducing ″Identification Probability″ for Automated and Transferable Assessment of Metabolite Identification Confidence in Metabolomics and Related Studies. Anal. Chem. 2025;97:1. doi: 10.1021/acs.analchem.4c04060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ross D. H., Lee J. Y., Gao Y., Hollerbach A. L., Bilbao A., Shi T., Ibrahim Y. M., Smith R. D., Zheng X.. Evaluation of a Reference-Free Collision Cross Section Calibration Strategy for Proteomics Using SLIM-Based High-Resolution Ion Mobility Spectrometry-Mass Spectrometry. J. Am. Soc. Mass Spectrom. 2024;35(7):1539–1549. doi: 10.1021/jasms.4c00141. [DOI] [PubMed] [Google Scholar]
- Hupatz H., Rahu I., Wang W. C., Peets P., Palm E. H., Kruve A.. Critical review on in silico methods for structural annotation of chemicals detected with LC/HRMS non-targeted screening. Anal. Bioanal. Chem. 2025;417:473. doi: 10.1007/s00216-024-05471-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wishart D. S., Guo A. C., Oler E., Wang F., Anjum A., Peters H., Dizon R., Sayeeda Z., Tian S. Y., Lee B. L.. et al. HMDB 5.0: the Human Metabolome Database for 2022. Nucleic Acids Res. 2022;50(D1):D622–D631. doi: 10.1093/nar/gkab1062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kretschmer F., Harrieder E. M., Hoffmann M. A., Böcker S., Witting M.. RepoRT: a comprehensive repository for small molecule retention times. Nat. Methods. 2024;21:153. doi: 10.1038/s41592-023-02143-z. [DOI] [PubMed] [Google Scholar]
- Picache J. A., Rose B. S., Balinski A., Leaptrot K. L., Sherrod S. D., May J. C., McLean J. A.. Collision cross section compendium to annotate and predict multi-omic compound identities. Chem. Sci. 2019;10(4):983–993. doi: 10.1039/C8SC04396E. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ross D. H., Cho J. H., Xu L. B.. Breaking Down Structural Diversity for Comprehensive Prediction of Ion-Neutral Collision Cross Sections. Anal. Chem. 2020;92(6):4548–4557. doi: 10.1021/acs.analchem.9b05772. [DOI] [PubMed] [Google Scholar]
- Ross D. H., Seguin R. P., Krinsky A. M., Xu L. B.. High-Throughput Measurement and Machine Learning-Based Prediction of Collision Cross Sections for Drugs and Drug Metabolites. J. Am. Soc. Mass Spectrom. 2022;33(6):1061–1072. doi: 10.1021/jasms.2c00111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baker E. S., Hoang C., Uritboonthai W., Heyman H. M., Pratt B., MacCoss M., MacLean B., Plumb R., Aisporna A., Siuzdak G.. METLIN-CCS: an ion mobility spectrometry collision cross section database. Nat. Methods. 2023;20(12):1836–1837. doi: 10.1038/s41592-023-02078-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Colby S. M., Nuñez J. R., Hodas N. O., Corley C. D., Renslow R. R.. Deep Learning to Generate Chemical Property Libraries and Candidate Molecules for Small Molecule Identification in Complex Samples. Anal. Chem. 2020;92(2):1720–1729. doi: 10.1021/acs.analchem.9b02348. [DOI] [PubMed] [Google Scholar]
- Plante P. L., Francovic-Fontaine E. ´., May J. C., McLean J. A., Baker E. S., Laviolette F., Marchand M., Corbeil J.. Predicting Ion Mobility Collision Cross-Sections Using a Deep Neural Network: DeepCCS. Anal. Chem. 2019;91(8):5191–5199. doi: 10.1021/acs.analchem.8b05821. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guo R. F., Zhang Y. J., Liao Y. X., Yang Q., Xie T., Fan X. Q., Lin Z. L., Chen Y., Lu H. M., Zhang Z. M.. Highly accurate and large-scale collision cross sections prediction with graph neural networks. Commun. Chem. 2023;6(1):139. doi: 10.1038/s42004-023-00939-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Murphy, M. ; Jegelka, S. ; Fraenkel, E. ; Kind, T. ; Healey, D. ; Butler, T. . Efficiently predicting high resolution mass spectra with graph neural networks. International Conference on Machine Learning, 2023. [Google Scholar]
- Köster J., Rahmann S.. Snakemake-a scalable bioinformatics workflow engine. Bioinformatics. 2012;28(19):2520–2522. doi: 10.1093/bioinformatics/bts480. [DOI] [PubMed] [Google Scholar]
- Liebisch G., Vizcaino J. A., Kofeler H., Trotzmuller M., Griffiths W. J., Schmitz G., Spener F., Wakelam M. J. O.. Shorthand notation for lipid structures derived from mass spectrometry. J. Lipid Res. 2013;54(6):1523–1530. doi: 10.1194/jlr.M033506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stanstrup J., Neumann S., Vrhovsek U.. PredRet: Prediction of Retention Time by Direct Mapping between Multiple Chromatographic Systems. Anal. Chem. 2015;87(18):9421–9428. doi: 10.1021/acs.analchem.5b02287. [DOI] [PubMed] [Google Scholar]
- Aalizadeh R., Alygizakis N. A., Schymanski E. L., Krauss M., Schulze T., Ibáñez M., McEachran A. D., Chao A., Williams A. J., Gago-Ferrero P.. et al. Development and Application of Liquid Chromatographic Retention Time Indices in HRMS-Based Suspect and Nontarget Screening. Anal. Chem. 2021;93(33):11601–11611. doi: 10.1021/acs.analchem.1c02348. [DOI] [PubMed] [Google Scholar]
- Liu Y., Yang Y., Chen W. D., Shen F., Xie L. H., Zhang Y. Y., Zhai Y. J., He F. C., Zhu Y. P., Chang C.. DeepRTAlign: toward accurate retention time alignment for large cohort mass spectrometry data analysis. Nat. Commun. 2023;14(1):8188. doi: 10.1038/s41467-023-43909-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Naylor C. N., Nagy G.. Recent advances in high-resolution traveling wave-based ion mobility separations coupled to mass spectrometry. Mass Spectrom. Rev. 2025;44:581. doi: 10.1002/mas.21902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Domingo-Almenara X., Guijas C., Billings E., Montenegro-Burke J. R., Uritboonthai W., Aisporna A. E., Chen E., Benton H. P., Siuzdak G.. The METLIN small molecule dataset for machine learning-based retention time prediction. Nat. Commun. 2019;10(1):5811. doi: 10.1038/s41467-019-13680-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huber C., Muller E., Schulze T., Brack W., Krauss M.. Improving the Screening Analysis of Pesticide Metabolites in Human Biomonitoring by Combining High-Throughput In Vitro Incubation and Automated LC-HRMS Data Processing. Anal. Chem. 2021;93(26):9149–9157. doi: 10.1021/acs.analchem.1c00972. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Code related to the molecular property prediction pipeline is housed in the repository: https://github.com/pnnl/idpp_workflow. Code related to the chromatographic retention time prediction model is housed in the repository: https://github.com/pnnl/idpp_rtp. All other code relating to the molecular property database and identification probability analysis is housed in the repository: https://github.com/pnnl/idpp_main.




