Skip to main content
Springer logoLink to Springer
. 2024 Aug 14;417(3):473–493. doi: 10.1007/s00216-024-05471-x

Critical review on in silico methods for structural annotation of chemicals detected with LC/HRMS non-targeted screening

Henrik Hupatz 1,2,#, Ida Rahu 1,✉,#, Wei-Chieh Wang 1, Pilleriin Peets 3, Emma H Palm 4, Anneli Kruve 1,2,5,
PMCID: PMC11700063  PMID: 39138659

Abstract

Non-targeted screening with liquid chromatography coupled to high-resolution mass spectrometry (LC/HRMS) is increasingly leveraging in silico methods, including machine learning, to obtain candidate structures for structural annotation of LC/HRMS features and their further prioritization. Candidate structures are commonly retrieved based on the tandem mass spectral information either from spectral or structural databases; however, the vast majority of the detected LC/HRMS features remain unannotated, constituting what we refer to as a part of the unknown chemical space. Recently, the exploration of this chemical space has become accessible through generative models. Furthermore, the evaluation of the candidate structures benefits from the complementary empirical analytical information such as retention time, collision cross section values, and ionization type. In this critical review, we provide an overview of the current approaches for retrieving and prioritizing candidate structures. These approaches come with their own set of advantages and limitations, as we showcase in the example of structural annotation of ten known and ten unknown LC/HRMS features. We emphasize that these limitations stem from both experimental and computational considerations. Finally, we highlight three key considerations for the future development of in silico methods.

Graphical Abstract

graphic file with name 216_2024_5471_Figa_HTML.jpg

Supplementary Information

The online version contains supplementary material available at 10.1007/s00216-024-05471-x.

Keywords: Untargeted screening, Suspect screening, Non-targeted analysis, Non-targeted screening, Machine learning, Generative modeling

Introduction

Non-targeted screening (NTS) is a theoretical concept of detecting and identifying a wide range of chemicals in complex samples with minimal prior information, increasingly applied in environmental monitoring. In practice, the identifiable chemicals are restricted by the ones extracted with the sample preparation method, retained and separated in liquid chromatography (LC) and potentially in ion mobility (IM), and detected by high-resolution mass spectrometry (HRMS) (Fig. 1) [1]. The success depends on the data processing workflows: peak detection, alignment, blank correction, etc. [2]. Finally, empirical analytical information can be used to obtain candidate structures for structural annotation and prioritize them, while identification can be achieved by directly comparing the acquired analytical information with that of analytical standards. Underperformance in any of these steps is detrimental to the outcome of NTS. This review particularly focuses on in silico methods for candidate structure retrieval and prioritization.

Fig. 1.

Fig. 1

Experimental workflow for analyzing an environmental sample using LC/IM/HRMS experiment with electrospray ionization (ESI). Dark brown indicates experimental analytical features (RT tI, CCS aI, and m/z mI) of an unknown structure, while light brown marks its MS2 features (pI1, pI2, and pI3). CCS values are derived from arrival time distributions (ATD). The schematical table with analytical information will appear in subsequent figures to highlight the in silico structural annotation workflow for LC/HRMS features

NTS commonly detects thousands of LC/HRMS features, i.e., a combination of the accurate mass and retention time (RT), with or without a tandem mass spectrum (MS2). Matching the experimental spectral information with a structure poses a significant challenge and becomes overwhelming if manual data curation is required. Workflows for the structural annotation of the chemicals start with obtaining the candidate structures based on the full scan (MS1) and MS2 either by matching the spectra of the unknown LC/HRMS features with the experimental or in silico spectra in spectral libraries (“Library MS2 spectra matching” and “In silico MS2 spectra matching” sections), automated interpretation of the MS2 spectra (“Structural library matching based on extracted information from MS2 spectra” and “Similar structure search for aiding annotation” sections), or, most recently, generating structures based on the MS2 spectra or matching generated structures to the MS2 spectra (“In silico candidate structure generation” section). However, some candidate structures are more likely than others and can be prioritized based on experimental analytical characteristics (“Complementary features for prioritizing the candidate structures” section). For this purpose, machine learning (ML) methods have recently emerged to streamline the prediction of RT, collision cross section (CCS) values, adduct formation, ionizability, etc., and will be scrutinized in this review. Further, we showcase some of these approaches in the example of structural annotation of LC/HRMS features detected in a wastewater sample (“Experimental structural annotation of LC/HRMS features from a wastewater sample” section) and conclude by providing insights into future developments (“Future perspective” section). We encourage the readers to see previously published reviews on implementing analytical methods for NTS [3], chemical space detectable by NTS [4, 5], data processing and quality control workflows for NTS [6, 7], structural annotation from the metabolomics perspective [8, 9], and tools for quantification and toxicity assessment of detected chemicals [10] for a comprehensive overview of the field.

Structural annotation based on MS1 and MS2 data

Library MS2 spectra matching

Community and commercial efforts to assemble libraries of experimental tandem mass spectra enable tentative annotation of the chemicals with the highest confidence (Fig. 2), corresponding to level 2b according to the scale suggested by Schymanski et al. [11]. Some widely used libraries include MassBank Europe (MassBank) [12], MassBank of North America (MoNA) [13], National Institute of Standards and Technology (NIST) [14], METLIN [15], and Global Natural Product Social Molecular Networking (GNPS) [16].

Table 1.

Overview of in silico components (models, algorithms, databases, metrics, and software) discussed in this review and schematically highlighted in Figs. 2, 4, and 5

In silico tool Examples Explanation
Spectral MS2 database MassBank [12], MoNA [13], NIST [14], METLIN [15], GNPS [16] Database containing experimental MS2 spectra for chemicals measured at different conditions
Spectral matching metric Cosine similarity [17], Spectral entropy [18], MS2DeepScore [19] Mathematical metric used to compare the similarity between two MS2 spectra
Software to predict MS2 spectra MetFrag [20], CFM-ID [21], GrAFF-MS [22] Algorithm or ML model predicting MS2 spectra for a given chemical structure
Structural database ZINC, PubChemLite [23], NORMAN SusDat [24] Database containing chemical structures
Software to predict structural information BUDDY [25], SIRIUS+CSI:FingerID [26], MIST [27] Algorithm or ML model predicting structural information such as sum formula or molecular fingerprints from experimental MS2 spectra
Structural matching metric Tanimoto similarity Mathematical metric used to compare the similarity of two structures based on their structural fingerprints
Generative model Mass2SMILES [28], JTVAE [29], Spec2Mol [30], MassGenie [31], MS2Mol [32], MSNovelist [33], ML model for generating chemical structures corresponding to experimental MS2 spectra
Empirical analytical information prediction RTI [34], CCSbase [35], AllCCS [36] ML model to predict a chemical property for an input structure
Empirical analytical information database METLIN SMRT [37], RepoRT [38], UJI CCS Library [39], Unified CCS Compendium [40], METLIN-CCS [41] Database containing experimental empirical analytical information of chemicals

Fig. 2.

Fig. 2

In silico approaches for retrieving candidate structures, depicted as SMILES (simplified molecular input line entry system) notations, from MS2 spectra. The shown MS2 data are arbitrarily generated and do not correspond to any specific LC/HRMS feature or structure. Brown arrows indicate that candidate structures for the same LC/HRMS feature can be obtained with all four approaches. Circled icons represent in silico components for structural annotation and prioritization; examples of these are shown in Table 1. Dark green highlights the major step of each approach, and all icons are used consistently in the following figures

Nevertheless, the efficiency and accuracy of the spectral library matching depend on several factors. Typically, the spectra of the unknown and reference are measured in different laboratories on different instruments with different experimental conditions. Most notably, the collision energy applied for the fragmentation of parent ions affects the extent of fragmentation [42]. Furthermore, recent studies reveal that the mobile phase composition may affect the structure of the parent ion and, consequently, the formed fragments [43, 44]. Moreover, spectral quality [45, 46] and the number of LC/HRMS features for which the MS2 spectra are acquired [45, 47] depend on the data acquisition method. In addition to experimental considerations, the choice of similarity metric significantly influences spectral matching (Fig. 2) [18, 48].

While cosine similarity [17] is commonly used, it may yield high similarity scores even if only one fragment matches; hence, alternative metrics, such as spectral entropy [18] or MS2DeepScore [19], have been suggested. For the spectral entropy approach, the entropy difference between the two MS2 spectra is calculated, whereas MS2DeepScore uses two identical neural networks to predict the structural similarity of unknown and reference chemicals as Tanimoto similarity from the respective MS2 spectra.

The number of experimental MS2 spectra ranges in thousands, e.g., MassBank 2023.11 and NIST contain reference spectra corresponding to 8215 and 47,494 chemicals, respectively, determined based on unique hashed International Chemical Identifier (InChIKeys). However, the chemicals that have not yet been measured or added to the libraries will remain unannotated. Our evaluation of the public libraries based on the 14 first characters of the InChIKey revealed that between 1.60% (MassBank) and 6.33% (NIST) of the exposure-relevant chemicals from PubChemLite [23] can be theoretically annotated with library search (Fig. 3). Thus, additional in silico approaches are needed to extend the annotations toward the remaining chemical space.

Fig. 3.

Fig. 3

Uniform Manifold Approximation and Projection (UMAP) plots illustrating the chemical space coverage of datasets widely used for LC/HRMS feature annotation (MassBank [12] and SIRIUS [26]) and for training ML models (RTI [34] and CCSBase v1.2 [35]) applied for predicting empirical analytical information used to prioritize candidate structures. The latent space of all relevant chemicals in environmental analysis was learned based on the SIRIUS+CSI:FingerID positive mode fingerprint (3878 bits) calculated from the SMILES representation of 370,167 chemicals in the PubChemLite 0.3.0 dataset. The resulting UMAP embedding was applied to all the datasets (4310 chemicals from MassBank, 21,188 chemicals from SIRIUS+CSI:FingerID positive mode training data, 1426 chemicals from RTI training data, and 4771 chemicals from CCSBase training data). For additional details, refer to Supplementary Information 1 (SI1) Section S6

In silico MS2 spectra matching

Predicting MS2 spectra from a known structure bridges the gap in the availability of reference MS2 spectra (Fig. 3). In silico library matching is based on predicting the MS2 spectra from the structure either by employing known fragmentation rules [49], combinatorial fragmentation [50], or competitive fragmentation modeling [51], where the latter two are widely employed by the community through MetFrag [20] and CFM-ID [21], respectively.

Despite the wide usage, in silico approaches have several known shortcomings, e.g., a large number of predicted unlikely fragments reduces the spectral similarity. Recently, Bremer et al. [52] evaluated the performance of CFM-ID 4.0 on spectra from NIST2020 and MoNA. The average dot product was below 700 out of a maximum of 1000 for most chemical classes containing heteroatoms, though the performance improved at higher collision energies where more fragments are observed. However, fine-tuning generic models for a specific class of chemicals with transfer learning has been shown to improve prediction accuracy vastly [53].

Deep learning models have accelerated acquiring fragmentation rules from the fragmentation spectra of known chemicals. For example, Young et al. [54] proposed graph transformers to predict the MS2 spectra, yet this approach predicts binned spectra, leading to a loss of resolution achieved by employing HRMS. To address this limitation, Murphy et al. [22] developed GrAFF-MS, where they introduced a set of molecular formulas of commonly observed fragments and neutral losses. Subsequently, a chemical structure represented as a molecular graph is mapped to the predefined molecular formulas of fragments and neutral losses, thereby preserving the spectral resolution. The quality of the predicted spectra directly impacts the annotation rates observed in environmental LC/HRMS analysis. For instance, Albergamo et al. [55] tentatively identified 884 and 550 of the 3764 and 3845 prioritized LC/HRMS features in a riverbank filtration system with MetFrag in positive and negative electrospray ionization (ESI) modes, respectively. Nevertheless, only 106 and 139 of the annotations yielded high annotation scores and were considered for further verification. Analytical standards of 42 candidate structures were tested, leading to the confirmation of 25 tentative annotations.

Structural library matching based on extracted information from MS2 spectra

In an alternative approach to in silico fragmentation, the molecular formula and structural information of the candidates can be extracted from structural libraries based on the MS2 spectra (Fig. 2). For instance, BUDDY [25] facilitates molecular formula annotation, while SIRIUS+CSI:FingerID [26] and Metabolite Inference with Spectrum Transformer (MIST) [27] additionally provide structural annotation.

BUDDY uses the fragment and neutral loss pair information from MS2 for formula annotation, focusing on biochemically feasible formulas while allowing the prediction of molecular formulas beyond currently known chemicals. BUDDY annotated 193 and 53 molecular formulas absent from PubChem in plasma and urine samples, respectively, outperforming SIRIUS top 1 formula annotation in all tested datasets. SIRIUS [26] suggests formulas by considering mathematically plausible formulas based on the precursor’s m/z, isotope pattern information from MS1, and fragmentation from MS2. In the process, SIRIUS creates fragmentation trees [56] associating the formulas of fragment ions and neutral losses from MS2 spectra. To enable structural annotation, CSI:FingerID further predicts probabilistic molecular fingerprints based on fragmentation trees (see chemical space coverage for the training set in Fig. 3). Such fingerprints can be matched to a database of chemical structures, expanding the searchable space, for instance, to 100 million chemicals from PubChem.

Recently, MIST, a CSI:FingerID-inspired molecular fingerprint prediction tool, has been developed. Unlike other methods that rely on expert knowledge, MIST utilizes a deep learning approach and incorporates domain knowledge into its architecture. While the initial publication [27] acknowledges a limitation regarding the dependency of correct structure annotation on accurate chemical formula assignment, the subsequently introduced MIST-CF [57] improved formula annotation. MIST-CF uses an energy-based modeling framework and offers shorter calculation times than CSI:FingerID [58], which becomes important while annotating chemicals with m/z above 800 Da, where SIRIUS calculation can become prohibitively long. SIRIUS [26] has a graphical user interface, while MIST [27] is an open-source Python package. In metabolite annotation, MIST yielded results comparable to SIRIUS; however, its application and evaluation in environmental monitoring remain to be investigated, whereas SIRIUS has already gained widespread usage [59, 60].

Similar structure search for aiding annotation

The end goal of environmental monitoring is to identify risk-posing chemicals; however, if the structural annotation is unfeasible due to a lack of candidate structures in databases, prioritization strategies based on chemical class or structural similarity to known chemicals can be employed. This approach is particularly valuable for annotating metabolites and transformation products, which are often absent from libraries but can be identified by retrieving structures similar to the detected chemicals [61]. Firstly, molecular networks enable linking chemicals based on their MS2 spectra [62]. The GNPS [16] platform facilitates this by connecting MS2 spectra of unknown LC/HRMS features to annotated MS2 spectra from all the samples available on the platform. Zhou et al. [63] increased the efficiency of structural annotation for metabolites by appending molecular networks with known reactions. This approach holds promise for annotating chemicals formed through known environmental transformations, but its application in environmental monitoring requires further exploration.

Secondly, MS2Query [64] enables the search for analog structures by matching experimental MS2 data with library MS2 spectra of similar structures using MS2DeepScore [19]. Similarly, Qemistree [65] utilizes molecular fingerprints predicted with SIRIUS+CSI:FingerID to obtain chemical trees, connecting chemicals with similar structures.

Lastly, ClassyFire [66], integrated into SIRIUS software, allows automated assignment of chemical taxonomy on up to 11 levels (Kingdom, SuperClass, Class, SubClass, etc.). Furthermore, SIRIUS enables describing detected chemicals with a CANOPUS vector, where each bit corresponds to a chemical class. These chemical classes can be used for a general description of the sample. For example, in the Earth Microbiome Project [67], ClassyFire pathway classes were used to differentiate the composition of biomes, while Sha et al. [68] tested ClassyFire’s suitability on per- and poly-fluorinated substances (PFASs) categorization, and Aurich et al. [69] grouped chemicals associated with the exposome. Spec2Class [70], on the other hand, predicts chemical superclasses directly from MS2 spectra, though it is currently limited to classes of plant secondary metabolites.

The approaches covered in the “Library MS2 spectra matching” to “Similar structure search for aiding annotation” sections provide numerous options to annotate LC/HRMS features with structures present in spectral or structural libraries or with structures similar to these. Nonetheless, a considerable number of LC/HRMS features in environmental samples remain unannotated [4, 32]. One of the reasons may be that novel structures, absent from libraries, remain inaccessible with these tools. We will refer to these experimental LC/HRMS features without reliable structural annotation as the “unknown chemical space.”

In silico candidate structure generation

Exploring the unknown chemical space became recently possible with the advent of generative models (GMs) in chemistry [71]. Since 2018, GMs have leveraged deep neural networks to learn the structure-to-property distribution of large chemical datasets [72]. By treating molecular representations, such as SMILES, as a language, ML concepts from natural language processing can be adapted for property-guided in silico structure generation [71, 72]. Deep GMs, including recurrent neural networks (RNN), variational autoencoders (VAE), or transformers, have the potential to incorporate empirical analytical information from observed LC/HRMS features to guide candidate structure generation. The main obstacle to successfully employing deep GMs is the need for a sufficiently large training dataset. Usually, over 500,000 unique chemicals are considered necessary for training [33, 73], far exceeding the data available in MS2 repositories. Next, the recent proof-of-concept studies that have explored approaches to circumvent this problem will be discussed (Fig. 4).

Fig. 4.

Fig. 4

Training strategies employed by various GMs developed for candidate structure generation based on HRMS data, addressing the sparsity of training data. MassGenie [31] and MS2Mol [32] (blue) employ in silico and experimental databases for training. MSNovelist [33] (brown) is trained on the molecular fingerprints of chemicals from structural databases. The decoders of Spec2Mol [30] and JTVAE [29] (violet) are pre-trained on SMILES-to-SMILES translation. Mass2SMILES [28] (red) utilizes only experimental databases. Circled icons represent in silico components for structural annotation and prioritization; examples of these are shown in Table 1. Dark green highlights the major step of each approach, and all icons are used consistently throughout the figures of the manuscript

Darkchem [74] represents the pioneering effort in utilizing deep GM to generate structures from LC/HRMS features. Developed by the Renslow group, Darkchem employs a VAE that takes m/z and CCS values as input. Transfer learning techniques were applied to address the sparse training data issue, enabling the GM to be utilized for analog search. While this method does not produce entirely novel structures and does not consider MS2 data, it allows the generation of structurally similar chemicals with specified properties.

In general, deep GMs trained directly on experimental MS2 spectra have shown limited success on validation datasets (Fig. 4) [2830]. For instance, Mass2SMILES [28], a transformer GM, correctly annotated 1% of validation structures and yielded structures with a Tanimoto similarity score above 0.9 for 2% of cases. Spec2Mol [30] consists of a decoder trained in SMILES-to-SMILES translation on 135 million chemicals from PubChem and ZINC-12, coupled with an encoder trained on MS2 spectra for over 30,000 chemicals from NIST2020 (Fig. 4). On the CASMI2017 challenge dataset, Spec2Mol yielded correct structures for 7% of MS2 spectra, compared to 67% for SIRIUS [26]. However, in cases where SIRIUS provided incorrect structures, on average, Spec2Mol suggested candidate structures with higher fingerprint similarity to the correct structure.

Data sparsity can be overcome by training GMs on experimental and in silico generated MS2 spectra (Fig. 4), as demonstrated in MassGenie [31]. For training, it exploits in silico binned MS2 of chemicals from the ZINC library generated by a method based on MetFrag [23] and correctly predicted 53% of the structures with m/z below 500 Da in the CASMI2017 challenge dataset. MS2Mol [32] employs a similar approach, utilizing a transformer trained on experimental spectra for 50,083 chemicals, augmented with in silico spectra predicted with CFM-ID [21] for chemicals of the LOTUS dataset. Notably, MS2Mol uses tokenized MS2 spectra as input, retaining high-resolution information from MS2 compared to binned spectra. Although CSI:FingerID outperformed MS2Mol on established datasets such as CASMI2022, evaluating GMs solely on known chemical space does not adequately test their ability to generate unknown structures and leads to the overestimation of the performance of the models trained on familiar data. To mitigate this, the authors of MS2Mol evaluated its performance on an EnvedaDark dataset of unknown structures. MS2Mol (21.4%) outperformed both CSI:FingerID (11.0%) and library search (7.2%) on close matches (Tanimoto similarity > 0.675). These findings highlight the challenges of providing a suitable validation set for GMs.

By leveraging molecular fingerprints calculated based on MS2 spectra with CSI:FingerID [58], MSNovelist assembled these fingerprints to candidate structures using an RNN [33]. This workaround enabled training the MSNovelist on 1,000,000 fingerprints of known structures without the need for MS2 spectra matching (Fig. 4). However, this method’s effectiveness depends on the quality of the fingerprints obtained with CSI:FingerID. While MSNovelist generated 45% correct structures for a GNPS-based test set compared to 75% with fingerprint database matching, it proposed new metabolites for a bryophyte dataset, outperforming the best database candidates.

Employing an RNN, DarkNPS adopts a more targeted approach to construct a structural library of novel psychoactive chemicals [53, 75]. The GM is trained to generate structures similar to known psychoactive chemicals, and the outputted structures are prioritized based on experimentally observed MS1 and MS2, leveraging in silico MS2 spectra generated with CFM-ID [21]. However, no mass spectrometric data is involved in the model training. This method exhibits promising results within a suspected chemical space, yet limitations remain in its applicability for unknown chemical space.

Due to the novelty of the field and limited code availability, applying GMs to environmental samples remains unexplored. Additionally, all deep GMs described require substantial computational resources for training, though employing pre-trained models for structure generation requires less computational effort. Furthermore, evaluating the performance of GMs is challenging, necessitating benchmarking datasets of unknown structures and posing difficulties in comparing methods due to differing applicability domains.

Complementary features for prioritizing the candidate structures

The methods discussed earlier utilize MS1 and MS2 data to obtain a candidate list. Consequently, complementary experimental information is essential to deprioritize unlikely and prioritize more plausible candidate structures. Empirical analytical information, such as RT from LC and CCS from IM measurements, are commonly used for this purpose. Moreover, insights into ion formation via ESI and sample-related details prove invaluable. While this information can serve as expert knowledge for prioritization, such as considering seasonal disease risks to identify potential chemicals in municipal wastewater [76] and thereby selecting more probable candidates, relying solely on this approach is time-consuming and fraught with challenges. Therefore, in silico methods for removing false positives from the candidate list have emerged.

These methods operate under the assumption that the complementary information available for prioritization is determined by the structure of the chemical and, consequently, its physicochemical properties. For example, the carbon chain length of PFASs affects the RT [77]; thus, the relationship between RTs and structures of PFASs can provide support for identification within a group of homologs. In silico techniques, including ML models, are developed to uncover the relationships between structural and empirical information (Fig. 5). Afterward, these models are employed to predict empirical analytical properties for each candidate structure, and poor agreement with experimental data suggests lower plausibility of the candidate structures.

Fig. 5.

Fig. 5

Computational workflow illustrating the training process of an empirical analytical information (EAI) prediction model using RT as an example. The model is trained by utilizing molecular fingerprints and/or descriptors, followed by empirical analytical information prediction for candidate structures. The brown arrow indicates that retention times can be predicted for each candidate structure. Circled icons represent in silico components for structural annotation and prioritization; examples of these are shown in Table 1. Dark green highlights the major step of each approach, and all icons are used consistently throughout the figures of the manuscript

The potential of in silico approaches was demonstrated by Song et al. [78], who developed an RT prediction model for contaminants of emerging concern and applied it in NTS of wastewater. The group annotated 719 LC/HRMS features with varying confidence levels through comparison with reference standards and matched the obtained MS2 spectra with experimental and in silico MS2 spectra from libraries. Subsequently, the developed RT model was employed to improve the confidence of 234 LC/HRMS features. The annotation confidence of 153 out of these features could be enhanced by leveraging the predicted RT values. In another study by Bijlsma et al. [79], the number of candidate structures was reduced by between 43 and 66% by applying predicted RT and CCS values. In the following sections, we will delve into the in silico methods and provide details of their applicability, potential, and practical implementation in NTS workflows.

Leveraging retention time in NTS workflows

RT stands as the most prevalent complementary information employed within NTS workflows. This is because even slight structural differences can result in notable variations in RT, aiding in the differentiation between isobaric and isomeric chemicals. A straightforward approach would involve comparing the RT obtained experimentally via LC/HRMS with the RT of candidate structures stored in databases. One of the most widely known and largest datasets of RT values is the METLIN small molecule retention time (SMRT) dataset [37], comprising information about 80,038 small molecules analyzed on reversed-phase LC. However, this approach encounters challenges due to the wide variety of chromatographic conditions (column, mobile phase, gradient, etc.) used, making it practically impossible to obtain exhaustive RT libraries. Consequently, in silico models have been developed to predict RTs.

These models utilize chromatographic conditions [80], physicochemical properties [81], molecular descriptors [8284], molecular fingerprints [37], or their combination [85] as input and leverage various ML approaches, such as generalized additive models [80, 86], multiple linear regression, support vector machines, random forest, gradient boosting, and artificial neural networks [8792]. However, their applicability is often hindered by the variability in RTs across different chromatographic systems. To mitigate this challenge, models have been developed to project known RTs from one chromatographic condition onto another [93]. Another hurdle in developing robust and generalizable models is the scarcity of experimental datasets required for model training and the insufficient overlap of the chemical space between the training data and the applicability domain. For instance, the RT models trained on metabolites will likely be less accurate for PFASs. To address this issue, the RepoRT [38] data repository has been recently established. This extensive metadata-rich collection comprises 373 datasets containing 88,325 RT values for 8809 unique chemicals measured using 49 different chromatographic columns, in addition to the METLIN SMRT dataset. Moreover, innovative methods integrating deep learning and transfer learning have emerged to tackle the constraints of RT models to specific chromatographic conditions and the availability of pertinent data [9498].

Besides the models that directly predict the RT values, the retention time index (RTI) system [34], with RTI values ranging from 1 to 1000 and based on the elution patterns of calibrants, has been proposed to harmonize RTs across LC studies. Unlike gas chromatography, where RTI usage is well-established, its adoption in LC has been hindered by the diverse range of chemicals and properties affecting separation [34]. However, the external validation of the RTI system and its accessibility through the University of Athens platform (http://rti.chem.uoa.gr/) for RTI calculations, coupled with insights into uncertainty and applicability domain (see Fig. 3 for coverage of chemical space), have gained considerable recognition within the scientific community.

Despite significant advancements and the development of hundreds of models for predicting RTs of small molecules, their accuracy remains imperfect. Moreover, there are cases where accurate RT prediction is particularly challenging, especially in the presence of what we term “RT cliffs”, where molecules with very similar structures exhibit notably different RTs (e.g., cis- and trans-isomers). Since models rely on structural information, capturing subtle structural variations that influence RTs can be exceedingly difficult, if not impossible, using conventional input data such as fingerprints or molecular descriptors. This highlights the ongoing need for research and innovation to enhance the accuracy and reliability of RT prediction models, particularly in complex chromatographic systems.

Leveraging ion mobility in NTS workflows

IM separates ions based on size, shape, and charge as they travel through a buffer gas in an electric field [99]. This provides an additional dimension of separation and allows deconvoluting MS2 spectra in data-independent acquisition (DIA) where coeluting chemicals within a specific m/z range would otherwise be fragmented together, leading to complex MS2 spectra [100]. Furthermore, it can facilitate the separation of isomeric and isobaric chemicals [101], and CCS values, derived from the drift time of the chemical, can increase structural annotation confidence. The latter can be achieved by matching experimental CCS values with either known CCS values of candidate structures from databases or with predicted values [39]. For library matching, databases such as the UJI CCS Library [39], the Unified CCS Compendium [40], and the most recent METLIN-CCS [41] contain CCS values for over 27,000 structures.

Predicted CCS values can be obtained through computational modeling or ML techniques. Modeling relies on computing the lowest energy conformer of the detected ions followed by CCS determination with different algorithms [102], such as IMoS or MobCal [103]. Conversely, ML approaches, such as CCSbase [35] and AllCCS [36], rely on molecular descriptors [36, 83, 92]. However, including m/z as one of the most significant descriptors in these models reduces their orthogonality to MS1 [36], leading to varying levels of success in NTS applications. Although the efficiency largely depends on the LC/HRMS feature, Asef et al. [104] found that applying predicted CCS values reduced the candidate list by an average of 28%. The low efficiency in deprioritizing candidate structures is mainly attributed to the accuracy of predicted and measured CCS values. Typically, predicted and experimental CCS values are considered to match within a difference of 3% [104, 105], while the measured CCS values of isomeric chemicals can differ by less than 1% [106]. This is also highlighted by Akhlaqi et al. [107], who found that the confidence interval of CCS values needs to reach 0.04–0.15% depending on the dataset to reduce the number of candidate structures by 95%. Additionally, predicted CCS values for structures out of the ML model’s applicability domain (see Fig. 3 for coverage of chemical space of commonly applied CCSbase as an example) may increase the uncertainty.

Additional experimental information for prioritization in NTS

Although RT and CCS values are already used for evaluating candidate structures, confidence can be increased by considering further analytical information. For instance, the ion type observed in ESI/HRMS spectra can be highly indicative of the functional groups and their positions in the detected chemicals (for an experimental case study, see Fig. 6). For example, deprotonated molecules are observed in ESI negative mode for chemicals with acidic functionality, whereas protonated molecules are observed in positive mode for weak and strong bases. Thus, Lowe et al. [108] proposed a concept of ESI/HRMS amenability leveraging the probability of observing [M+H]+ or [M−H] for specific candidate structures. Our group has shown that the ionization efficiency of both ion types can be predicted from the chemical structure [109], suggesting the possibility of using the relative response in these two modes.

Fig. 6.

Fig. 6

Three chemicals sharing the same molecular formula (C10H10O4) can exhibit distinct retention times and be detectable with different ESI modes, influenced by their polarity and acid–base properties. A The peak corresponding to dimethyl phthalate (violet) is magnified by a factor of 10 × for enhanced visibility of other chromatographic peaks. B Additionally, adduct formation and in-source fragmentation may offer supplementary insights into the localization of functional groups

Moreover, sodium, potassium, and ammonium adducts are frequently observed in ESI positive mode, and the formation probability of specific adducts may become beneficial in the structural annotation. Our group [110], as well as Broeckling et al. [111], have investigated the possibility of predicting adduct formation through classification and regression, respectively. Additionally, physiochemical properties, e.g., equilibrium partition ratios between organic solvents and water, have been utilized to differentiate isomers. Abrahamsson et al. [112] used peak intensities in water and eight different organic solvent systems to predict molecular fingerprints, enabling database searches for chemical structure.

In addition to analytical techniques commonly coupled to HRMS, spectroscopic techniques, such as gas phase infrared spectroscopy, can provide complementary information and improve the annotation accuracy [113]. In combination with MS2 and domain knowledge, this technique has recently enabled the identification of organic micropollutants in wastewater [114]. Nevertheless, access to such complementary tools is limited to non-standard instrumentation.

Experimental structural annotation of LC/HRMS features from a wastewater sample

To illustrate the NTS workflows, we investigated structural annotation and prioritization of 10 spiked chemicals treated as unknowns, alongside 10 truly unknown experimental LC/HRMS features from the wastewater sample. We conducted experimental and in silico spectra matching with MassBank [12] and MetFrag [20], as well as employed SIRIUS+CSI:FingerID [26] and Spec2Mol [30] to retrieve candidate structures from MS2 data (for experimental and data processing details, see SI1 Sections S2–S4). We considered the top 10 candidate structures per LC/HRMS feature from each of the four methods, excluding the chemically invalid structures, resulting in 515 unique candidate structures, with 253 for spiked chemicals and 262 for truly unknown features. The candidates obtained for three randomly selected spiked and three randomly selected unknown LC/HRMS features and their following prioritization are visualized schematically in Fig. 7A (for other features, see SI1 Section S8). Additionally, the structures of all the candidates can be found in the SI1 Section S9.

Fig. 7.

Fig. 7

Visualization of the structural annotation and candidate structure prioritization results for the six LC/HRMS features out of the 20 LC/HRMS features studied (remaining features are provided in the S11 Section S8). A The number of candidate structures obtained from experimental and in silico spectra matching with MassBank and MetFrag, and by employing SIRIUS+CSI:FingerID and Spec2Mol. Each candidate structure is represented by a colored circle, with the order indicating its rank within the annotation approach. Dual-colored circles represent candidate structures suggested by two methods. The middle panel illustrates the number of candidate structures prioritized based on predicted RT obtained by utilizing the RTI model and CCS obtained by employing the CCSbase model. For features corresponding to the spiked chemicals, the correct structure is highlighted with a dark blue exclamation mark. B Visualization of the candidate structures in the chemical space using the UMAP embedding of PubChemLite (Fig. 3). All points are transparent, resulting in a darker color when data points are overlaid

Employing the standard practice in evaluating the performance of NTS workflows, we first assessed the applicability and effectiveness of the methods on the annotation of spiked chemicals (Table 2). MassBank accurately annotated all spiked chemicals within the top 3 candidates, while SIRIUS+CSI:FingerID correctly annotated half of the features (always as top 1). In the other five cases, SIRIUS+CSI:FingerID also annotated the sum formulas incorrectly. For example, as seen in Fig. 7A, for the spiked feature with m/z 163.04, dimethyl phthalate, detected as an in-source fragment in MS1, was correctly annotated by matching with MassBank, but not by SIRIUS+CSI:FingerID. This underscores the advantage of parent mass independent library matching, which is becoming beneficial in all-ion fragmentation (AIF) or DIA analyses. Moreover, neither MetFrag nor Spec2Mol suggested correct structures within the top 10 candidates for any of the spiked chemicals. The underperformance of Spec2Mol may be attributed to the fact that the authors observed the best results when utilizing MS2 data of both positive and negative ionization modes [30], but only positive mode data was considered here. However, evaluating and comparing structural annotation methods using only spiked chemicals can lead to biased conclusions, as these chemicals originate from the known chemical space within databases and training sets used by the models/tools. Thus, we suggest evaluating NTS workflows additionally on unknown features, although, due to the absence of ground truth, this evaluation requires different approaches or additional analytical experiments, such as isolation and NMR analysis.

Table 2.

Evaluation of the structural annotation and candidate prioritization of spiked chemicals

graphic file with name 216_2024_5471_Tab2_HTML.jpg

1Ratio of wrong candidate structures prioritized from all false candidates for one LC/HRMS feature

Furthermore, we analyzed the candidates obtained from the four annotation tools by comparing the average Tanimoto similarity scores (Fig. 8A), unveiling that all methods propose structurally similar candidates within one LC/HRMS feature. Moreover, when scrutinizing the similarities of candidates suggested by different methods for the same feature, it becomes apparent that employing SIRIUS+CSI:FingerID and library matching with MassBank yield similar candidates. This aligns well with the observed good performance of both methods on the spiked chemicals. Candidate structures between different LC/HRMS features vary most for SIRIUS+CSI:FingerID. In contrast, the chemically valid structures generated with Spec2Mol exhibit general similarity across all the LC/HRMS features. Comparable observations can be deduced from the visualizations, where each candidate structure per individual LC/HRMS feature is mapped to the chemical space (Fig. 7B and SI1 Section S8). These findings suggest that the candidate structures generated by Spec2Mol overlap with the known chemical space, which could imply that the experimental LC/HRMS features originate from the known chemical space. Alternatively, this overlap may highlight the limitations in the applicability of current GMs, as discussed in the “In silico candidate structure generation” section.

Fig. 8.

Fig. 8

A Heatmap illustrating the structural similarities among candidates suggested by four methods employed for the structural annotation of 20 LC/HRMS features (SI1 Section S7). Each small colored square represents the similarity of candidate structures, calculated as the average of all pairwise Tanimoto similarities between all the suggested candidates within one LC/HRMS feature. The LC/HRMS features are sorted based on their m/z values. Brown indicates higher similarity, while green indicates lower similarity among candidate structures (the white midpoint of the colorbar (0.22) denotes the average similarity across all the suggested candidate structures, calculated as the mean using all the pairwise Tanimoto similarities of candidates). Light blue indicates that a specific LC/HRMS feature did not yield any candidate structures from a particular method. Numbers inside the larger squares represent the overall average similarity scores within or between the methods. B Experimentally obtained retention time (RT) values for 20 LC/HRMS features plotted against the predicted RT values from the RTI model for all candidate structures corresponding to each LC/HRMS feature. Candidates prioritized using the cutoff criterion of ± 2 standard residuals are highlighted in light brown. C Experimentally obtained CCS values for 20 LC/HRMS features plotted against the predicted CCS from the CCSbase model for all candidate structures corresponding to each LC/HRMS feature. Candidates prioritized using the criterion of difference between predicted and experimental values less than 3% are highlighted in light turquoise

For prioritizing candidate structures, we utilized predicted RT and CCS values by employing RTI [34] with the cutoff criterion of ± 2 standard residuals and CCSbase [35] with the difference between predicted and experimental CCS below 3% (see SI1 Section S5 for details). Out of 515 candidates, 255 (depicted in light brown in Fig. 8B) were prioritized based on RT, and 208 (depicted in light turquoise in Fig. 8C) based on CCS (Table 2). However, the significant number of remaining candidates underscores the inefficiency in prioritization due to the high uncertainty in RT and CCS prediction, highlighting the need for further advancements in in silico methods. Additionally, the RTI model was unable to predict RT for two candidate structures and CCSbase for four structures, indicating the limitations in applicability, possibly stemming from the insufficient coverage of chemical space by the data used for model training (Fig. 3). For instance, the RTI model’s inability to predict the RT for a boron-containing candidate can be attributed to the absence of boron chemicals in the training data. Applying both methods for prioritization resulted in 66 candidates for 9 spiked chemicals and 43 for 8 truly unknown LC/HRMS features. Consequently, for one spiked chemical and two truly unknown features, all candidate structures were deprioritized (Fig. 7A, SI1 Section S8). Furthermore, the previously mentioned example of a spiked feature with m/z 163.04, an in-source fragment of dimethyl phthalate, highlights the challenges faced by in silico methods in accurately prioritizing candidates for unknown structures observed as unusual fragments or adducts. Here, the correct structure was deprioritized due to a mismatch in CCS values, as the experimental CCS corresponded to the in-source fragment rather than the parent ion of the correct candidate structure. Similarly, the correct structures for the spiked features m/z 284.14, 297.06, and 748.49 were deprioritized due to discrepancies in CCS for the first and mismatching RT for the latter two (Table 2 and SI1 Section S8). This underscores that the high measurement and prediction uncertainty of current methods impedes reliable candidate prioritization.

Future perspective

Despite significant advancements in in silico annotation tools in recent years, further improvements are required to increase the annotation rates, accuracy, and reproducibility. Below, we outline three main considerations that arose from compiling this review.

Chemical space coverage

The annotation efficiency of the spectral library and in silico spectra matching is determined by the extent to which the chemical space of the employed database overlaps with the chemical space of the sample. Moreover, the accuracy of ML models hinges on the alignment of the sample’s chemical space with the models’ training set [115]. However, achieving comprehensive coverage across a vast spectrum of environmentally relevant chemicals is challenging due to variations in sample sources (e.g., water vs soil, residential area vs industrial dumping site). Additionally, empirical analytical information, such as retention time [38, 116], strongly depends on the analytical method used for NTS. Consequently, predictions often require calibration or transfer to the system used for analysis. Yet, applicability is ambiguous if the conditions differ significantly for the training data. Unfortunately, most tools do not explicitly indicate the similarity between candidate structures and training data, making it challenging to assess whether unknown or tentatively identified chemicals fall within the ML models’ applicability domain. Thus, evaluating the overlap of candidate structures’ chemical space with the training set or database can offer insights into structural annotation accuracy.

Model evaluation

Most ML models are evaluated based on conventional training metrics, such as root mean square error for regression tasks. Nonetheless, these metrics may not precisely measure the efficiency in prioritizing candidate structures [104, 107]. Namely, candidate structures obtained from one MS2 spectrum often share the same chemical formula and highly similar structure. Hence, we encourage the developers to invest additional effort in customizing evaluation metrics, such as integrating uncertainty, to suit specific use cases and to incorporate them when benchmarking the models. Furthermore, curated datasets separate from those used for model training should be employed for evaluation. The latter is particularly challenging for MS2 annotation strategies, as most public libraries have been utilized, directly or indirectly, in developing and optimizing currently available annotation approaches. In this matter, EnvedaDark [32] is the first such library; however, careful measures are necessary to prevent the leakage of such libraries into model training.

Combining in silico methods

Combining multiple structural annotation strategies can be advantageous, as consensus among different methods may reinforce the plausibility of a candidate structure. This was also evidenced in our experiments, where alignment between candidate structures from library matching with MassBank and SIRIUS+CSI:FingerID occurred for the correct candidate structure for spiked chemicals. Furthermore, various empirical analytical characteristics described in the “Complementary features for prioritizing the candidate structures” section can aid in prioritizing structure identification. However, there is currently a lack of community-wide strategies for managing predictions, particularly in combining predictions from different empirical analytical characteristics with varying uncertainty and accuracy. Nonetheless, the integration of such predictions is highly desirable due to the complementary nature of retention time, CCS values, adduct formation, and mass spectrometric information.

Conclusion

In this critical review, we have outlined and compared methods for retrieving candidate structures to annotate unknown LC/HRMS features. As expected, spectral library matching offers the most reliable results when the sought structure is available in the library. However, the currently publicly available spectral databases cover only a fraction of the chemical space. GMs offer access to yet-unknown structures, but their effectiveness depends on the comprehensiveness of the training data. Additionally, the significance of methods leveraging complementary empirical information for candidate structure prioritization cannot be overstated. Inaccurate predictions of properties for candidate structures beyond the model’s applicability domain may result in the loss of correct candidates. Importantly, some candidate structures may be disregarded even before applying these methods if the adduct type cannot be definitively determined or if the deconvolution of data-independent tandem mass spectra produces noisy results.

Supplementary Information

Below is the link to the electronic supplementary material.

216_2024_5471_MOESM1_ESM.docx (22.3MB, docx)

Supplementary Information 1 (DOCX 22825 KB)

216_2024_5471_MOESM2_ESM.csv (122.1KB, csv)

Supplementary Information 2 (CSV 122 KB)

Acknowledgements

Sam Mauritz Badii and Emma Apelgren are acknowledged for their contributions to the sample preparation, and Claudia Möckel is acknowledged for her help in the raw data processing. The influent water samples were kindly provided by the IVL Swedish Environmental Research Institution (Stockholm, Sweden).

Author contribution

H.H.: conceptualization, formal analysis, data curation, writing — original draft, visualization, writing — review and editing; I.R.: conceptualization, software, formal analysis, data curation, writing — original draft, visualization, writing — review and editing; W.-C.W.: investigation, writing — original draft; P.P.: writing — original draft; E.H.P.: writing — original draft; A.K.: conceptualization, formal analysis, writing — original draft, writing — review and editing, funding acquisition.

Funding

Open access funding provided by Stockholm University. H.H. was supported by Stockholm University Center for Circular and Sustainable Systems (SUCCeSS) postdoc funding, P.P. by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany´s Excellence Strategy – EXC 2051 – Project-ID 390713860, E.H.P. by European Union’s Horizon 2020 research and innovation program under grant agreement No. 101036756, project ZeroPM: Zero pollution of persistent, mobile substances, W.-C.W. by Swedish Research Council grant 2021–03917, I.R. by Carl Trygger Foundation project 22:2336. The project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program grant agreement No. 101124488, LearningStructurE: Machine Learning and Mass Spectrometry for Structural Elucidation of Novel Toxic Chemicals.

Code and data availability

The Supplementary Information 1 includes details on experimental methods, data processing, and figures. Additionally, a table containing all extracted candidates and their prioritization can be found as Supplementary Information 2. The code for data processing and visualization is available at the GitHub repository: https://github.com/kruvelab/NTS_LC_HRMS.

Declarations

Conflict of interest

The authors declare no competing interests.

Footnotes

Published in the topical collection New Trends in the Analysis of Environmental Pollutants with guest editors Miren López de Alda, Susan D. Richardson, Kevin V. Thomas, and Nikolaos S. Thomaidis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Henrik Hupatz and Ida Rahu contributed equally to this work.

Contributor Information

Ida Rahu, Email: ida.rahu@mmk.su.se.

Anneli Kruve, Email: anneli.kruve@su.se.

References

  • 1.Black G, Lowe C, Anumol T, Bade J, Favela K, Feng Y-L, Knolhoff A, Mceachran A, Nuñez J, Fisher C, Peter K, Quinete NS, Sobus J, Sussman E, Watson W, Wickramasekara S, Williams A, Young T. Exploring chemical space in non-targeted analysis: a proposed ChemSpace tool. Anal Bioanal Chem. 2023;415:35–44. 10.1007/s00216-022-04434-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Renner G, Reuschenbach M. Critical review on data processing algorithms in non-target screening: challenges and opportunities to improve result comparability. Anal Bioanal Chem. 2023;415:4111–23. 10.1007/s00216-023-04776-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Hollender J, Schymanski EL, Ahrens L, Alygizakis N, Béen F, Bijlsma L, Brunner AM, Celma A, Fildier A, Fu Q, Gago-Ferrero P, Gil-Solsona R, Haglund P, Hansen M, Kaserzon S, Kruve A, Lamoree M, Margoum C, Meijer J, Merel S, Rauert C, Rostkowski P, Samanipour S, Schulze B, Schulze T, Singh RR, Slobodnik J, Steininger-Mairinger T, Thomaidis NS, Togola A, Vorkamp K, Vulliet E, Zhu L, Krauss M. NORMAN guidance on suspect and non-target screening in environmental monitoring. Environ Sci Eur. 2023;35:75. 10.1186/s12302-023-00779-4. [Google Scholar]
  • 4.Hulleman T, Turkina V, O’Brien JW, Chojnacka A, Thomas KV, Samanipour S. Critical Assessment of the Chemical Space Covered by LC–HRMS Non-Targeted Analysis. Environ Sci Technol. 2023;57:14101–12. 10.1021/acs.est.3c03606. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Manz KE, Feerick A, Braun JM, Feng Y-L, Hall A, Koelmel J, Manzano C, Newton SR, Pennell KD, Place BJ, Godri Pollitt KJ, Prasse C, Young JA. Non-targeted analysis (NTA) and suspect screening analysis (SSA): a review of examining the chemical exposome. J Expo Sci Environ Epidemiol. 2023;33:524–36. 10.1038/s41370-023-00574-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Vosough M, Schmidt TC, Renner G. Non-target screening in water analysis: recent trends of data evaluation, quality assurance, and their future perspectives. Anal Bioanal Chem. 2024;416:2125–36. 10.1007/s00216-024-05153-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Minkus S, Bieber S, Letzel T. Spotlight on mass spectrometric non-target screening analysis: Advanced data processing methods recently communicated for extracting, prioritizing and quantifying features. Anal Sci Adv. 2022;3:103–12. 10.1002/ansa.202200001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Cai Y, Zhou Z, Zhu Z-J. Advanced analytical and informatic strategies for metabolite annotation in untargeted metabolomics. TrAC, Trends Anal Chem. 2023;158:116903. 10.1016/j.trac.2022.116903. [Google Scholar]
  • 9.Liebal UW, Phan ANT, Sudhakar M, Raman K, Blank LM. Machine Learning Applications for Mass Spectrometry-Based Metabolomics. Metabolites. 2020;10:243. 10.3390/metabo10060243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Sepman H, Malm L, Peets P, Kruve A. Scientometric review: Concentration and toxicity assessment in environmental non-targeted LC/HRMS analysis. Trends Environ Anal Chem. 2023;40:e00217. 10.1016/j.teac.2023.e00217. [Google Scholar]
  • 11.Schymanski EL, Jeon J, Gulde R, Fenner K, Ruff M, Singer HP, Hollender J. Identifying Small Molecules via High Resolution Mass Spectrometry: Communicating Confidence. Environ Sci Technol. 2014;48:2097–8. 10.1021/es5002105. [DOI] [PubMed] [Google Scholar]
  • 12.Horai H, Arita M, Kanaya S, Nihei Y, Ikeda T, Suwa K, Ojima Y, Tanaka K, Tanaka S, Aoshima K, Oda Y, Kakazu Y, Kusano M, Tohge T, Matsuda F, Sawada Y, Hirai MY, Nakanishi H, Ikeda K, Akimoto N, Maoka T, Takahashi H, Ara T, Sakurai N, Suzuki H, Shibata D, Neumann S, Iida T, Tanaka K, Funatsu K, Matsuura F, Soga T, Taguchi R, Saito K, Nishioka T. MassBank: a public repository for sharing mass spectral data for life sciences. J Mass Spectrom. 2010;45:703–14. 10.1002/jms.1777. [DOI] [PubMed] [Google Scholar]
  • 13.MassBank of North America. https://mona.fiehnlab.ucdavis.edu/. Accessed 30 Apr 2024.
  • 14.Mass Spectrometry Data Center, NIST. https://chemdata.nist.gov/. Accessed 30 Apr 2024.
  • 15.Xue J, Guijas C, Benton HP, Warth B, Siuzdak G. METLIN MS2 molecular standards database: a broad chemical and biological resource. Nat Methods. 2020;17:953–4. 10.1038/s41592-020-0942-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Wang M, Carver JJ, Phelan VV, Sanchez LM, Garg N, Peng Y, Nguyen DD, Watrous J, Kapono CA, Luzzatto-Knaan T, Porto C, Bouslimani A, Melnik AV, Meehan MJ, Liu W-T, Crüsemann M, Boudreau PD, Esquenazi E, Sandoval-Calderón M, Kersten RD, Pace LA, Quinn RA, Duncan KR, Hsu C-C, Floros DJ, Gavilan RG, Kleigrewe K, Northen T, Dutton RJ, Parrot D, Carlson EE, Aigle B, Michelsen CF, Jelsbak L, Sohlenkamp C, Pevzner P, Edlund A, McLean J, Piel J, Murphy BT, Gerwick L, Liaw C-C, Yang Y-L, Humpf H-U, Maansson M, Keyzers RA, Sims AC, Johnson AR, Sidebottom AM, Sedio BE, Klitgaard A, Larson CB, Boya PCA, Torres-Mendoza D, Gonzalez DJ, Silva DB, Marques LM, Demarque DP, Pociute E, O’Neill EC, Briand E, Helfrich EJN, Granatosky EA, Glukhov E, Ryffel F, Houson H, Mohimani H, Kharbush JJ, Zeng Y, Vorholt JA, Kurita KL, Charusanti P, McPhail KL, Nielsen KF, Vuong L, Elfeki M, Traxler MF, Engene N, Koyama N, Vining OB, Baric R, Silva RR, Mascuch SJ, Tomasi S, Jenkins S, Macherla V, Hoffman T, Agarwal V, Williams PG, Dai J, Neupane R, Gurr J, Rodríguez AMC, Lamsa A, Zhang C, Dorrestein K, Duggan BM, Almaliti J, Allard P-M, Phapale P, Nothias L-F, Alexandrov T, Litaudon M, Wolfender J-L, Kyle JE, Metz TO, Peryea T, Nguyen D-T, VanLeer D, Shinn P, Jadhav A, Müller R, Waters KM, Shi W, Liu X, Zhang L, Knight R, Jensen PR, Palsson BØ, Pogliano K, Linington RG, Gutiérrez M, Lopes NP, Gerwick WH, Moore BS, Dorrestein PC, Bandeira N. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat Biotechnol. 2016;34:828–37. 10.1038/nbt.3597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Stein SE, Scott DR. Optimization and testing of mass spectral library search algorithms for compound identification. J Am Soc Mass Spectrom. 1994;5:859–66. 10.1016/1044-0305(94)87009-8. [DOI] [PubMed] [Google Scholar]
  • 18.Li Y, Kind T, Folz J, Vaniya A, Mehta SS, Fiehn O. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat Methods. 2021;18:1524–31. 10.1038/s41592-021-01331-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Huber F, Van Der Burg S, Van Der Hooft JJJ, Ridder L. MS2DeepScore: a novel deep learning similarity measure to compare tandem mass spectra. J Cheminform. 2021;13:84. 10.1186/s13321-021-00558-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Ruttkies C, Schymanski EL, Wolf S, Hollender J, Neumann S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J Cheminform. 2016;8:3. 10.1186/s13321-016-0115-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wang F, Liigand J, Tian S, Arndt D, Greiner R, Wishart DS. CFM-ID 4.0: More Accurate ESI-MS/MS Spectral Prediction and Compound Identification. Anal Chem. 2021;93:11692–700. 10.1021/acs.analchem.1c01465. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Murphy M, Jegelka S, Fraenkel E, Kind T, Healey D, Butler T (2023) Efficiently predicting high resolution mass spectra with graph neural networks. 10.48550/ARXIV.2301.11419.
  • 23.Schymanski EL, Kondić T, Neumann S, Thiessen PA, Zhang J, Bolton EE. Empowering large chemical knowledge bases for exposomics: PubChemLite meets MetFrag. J Cheminform. 2021;13:19. 10.1186/s13321-021-00489-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.NORMAN Network, Aalizadeh R, Alygizakis N, Schymanski E, Slobodnik J, Fischer S, Cirka L, Mohammed Taha H (2024) S0 | SUSDAT | Merged NORMAN Suspect List: SusDat (NORMAN-SLE-S0.0.5.1) [Data set]. Zenodo. 10.5281/zenodo.10510477
  • 25.Xing S, Shen S, Xu B, Li X, Huan T. BUDDY: molecular formula discovery via bottom-up MS/MS interrogation. Nat Methods. 2023;20:881–90. 10.1038/s41592-023-01850-x. [DOI] [PubMed] [Google Scholar]
  • 26.Dührkop K, Fleischauer M, Ludwig M, Aksenov AA, Melnik AV, Meusel M, Dorrestein PC, Rousu J, Böcker S. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat Methods. 2019;16:299–302. 10.1038/s41592-019-0344-8. [DOI] [PubMed] [Google Scholar]
  • 27.Goldman S, Wohlwend J, Stražar M, Haroush G, Xavier RJ, Coley CW. Annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nat Mach Intell. 2023;5:965–79. 10.1038/s42256-023-00708-3. [Google Scholar]
  • 28.Elser D, Huber F, Gaquerel E (2023) Mass2SMILES: deep learning based fast prediction of structures and functional groups directly from high-resolution MS/MS spectra. 10.1101/2023.07.06.547963.
  • 29.Kutuzova S, Igel C, Nielsen M, McCloskey D (2021) Bi-modal variational autoencoders for metabolite identification using tandem mass spectrometry. 10.1101/2021.08.03.454944.
  • 30.Litsa EE, Chenthamarakshan V, Das P, Kavraki LE. An end-to-end deep learning framework for translating mass spectra to de-novo molecules. Commun Chem. 2023;6:132. 10.1038/s42004-023-00932-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Shrivastava AD, Swainston N, Samanta S, Roberts I, Wright Muelas M, Kell DB. MassGenie: A transformer-based deep learning method for identifying small molecules from their mass spectra. Biomolecules. 2021;11:1793. 10.3390/biom11121793. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Butler T, Frandsen A, Lightheart R, Bargh B, Kerby T, West K, Davison J, Taylor J, Krettler C, Bollerman T, Voronov G, Moon K, Kind T, Dorrestein P, Allen A, Colluru V, Healey D (2023) MS2Mol: A transformer model for illuminating dark chemical space from mass spectra. 10.26434/chemrxiv-2023-vsmpx-v4.
  • 33.Stravs MA, Dührkop K, Böcker S, Zamboni N. MSNovelist: de novo structure generation from mass spectra. Nat Methods. 2022;19:865–70. 10.1038/s41592-022-01486-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Aalizadeh R, Alygizakis NA, Schymanski EL, Krauss M, Schulze T, Ibáñez M, McEachran AD, Chao A, Williams AJ, Gago-Ferrero P, Covaci A, Moschet C, Young TM, Hollender J, Slobodnik J, Thomaidis NS. Development and application of liquid chromatographic retention time indices in HRMS-based suspect and nontarget screening. Anal Chem. 2021;93:11601–11. 10.1021/acs.analchem.1c02348. [DOI] [PubMed] [Google Scholar]
  • 35.Ross DH, Cho JH, Xu L. Breaking down structural diversity for comprehensive prediction of ion-neutral collision cross sections. Anal Chem. 2020;92:4548–57. 10.1021/acs.analchem.9b05772. [DOI] [PubMed] [Google Scholar]
  • 36.Zhou Z, Luo M, Chen X, Yin Y, Xiong X, Wang R, Zhu Z-J. Ion mobility collision cross-section atlas for known and unknown metabolite annotation in untargeted metabolomics. Nat Commun. 2020;11:4334. 10.1038/s41467-020-18171-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Domingo-Almenara X, Guijas C, Billings E, Montenegro-Burke JR, Uritboonthai W, Aisporna AE, Chen E, Benton HP, Siuzdak G. The METLIN small molecule dataset for machine learning-based retention time prediction. Nat Commun. 2019;10:5811. 10.1038/s41467-019-13680-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Kretschmer F, Harrieder E-M, Hoffmann MA, Böcker S, Witting M. RepoRT: a comprehensive repository for small molecule retention times. Nat Methods. 2024;21:153–5. 10.1038/s41592-023-02143-z. [DOI] [PubMed] [Google Scholar]
  • 39.Celma A, Sancho JV, Schymanski EL, Fabregat-Safont D, Ibáñez M, Goshawk J, Barknowitz G, Hernández F, Bijlsma L. Improving target and suspect screening high-resolution mass spectrometry workflows in environmental analysis by ion mobility separation. Environ Sci Technol. 2020;54:15120–31. 10.1021/acs.est.0c05713. [DOI] [PubMed] [Google Scholar]
  • 40.Picache JA, Rose BS, Balinski A, Leaptrot KL, Sherrod SD, May JC, McLean JA. Collision cross section compendium to annotate and predict multi-omic compound identities. Chem Sci. 2019;10:983–93. 10.1039/C8SC04396E. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Baker ES, Hoang C, Uritboonthai W, Heyman HM, Pratt B, MacCoss M, MacLean B, Plumb R, Aisporna A, Siuzdak G. METLIN-CCS: an ion mobility spectrometry collision cross section database. Nat Methods. 2023;20:1836–7. 10.1038/s41592-023-02078-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Xu R, Lee J, Chen L, Zhu J. Enhanced detection and annotation of small molecules in metabolomics using molecular-network-oriented parameter optimization. Mol Omics. 2021;17:665–76. 10.1039/D1MO00005E. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Sepman H, Tshepelevitsh S, Hupatz H, Kruve A. Protomer Formation Can Aid the Structural Identification of Caffeine Metabolites. Anal Chem. 2022;94:10601–9. 10.1021/acs.analchem.2c00257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Wang J, Aubry A, Bolgar MS, Gu H, Olah TV, Arnold M, Jemal M. Effect of mobile phase pH, aqueous-organic ratio, and buffer concentration on electrospray ionization tandem mass spectrometric fragmentation patterns: implications in liquid chromatography/tandem mass spectrometric bioanalysis. Rapid Comm Mass Spectrometry. 2010;24:3221–9. 10.1002/rcm.4748. [DOI] [PubMed] [Google Scholar]
  • 45.Tokiyoshi K, Matsuzawa Y, Takahashi M, Takeda H, Hasegawa M, Miyamoto J, Tsugawa H. Using Data-Dependent and -Independent Hybrid Acquisitions for Fast Liquid Chromatography-Based Untargeted Lipidomics. Anal Chem. 2024;96:991–6. 10.1021/acs.analchem.3c04400. [DOI] [PubMed] [Google Scholar]
  • 46.Oberacher H, Sasse M, Antignac J-P, Guitton Y, Debrauwer L, Jamin EL, Schulze T, Krauss M, Covaci A, Caballero-Casero N, Rousseau K, Damont A, Fenaille F, Lamoree M, Schymanski EL. A European proposal for quality control and quality assurance of tandem mass spectral libraries. Environ Sci Eur. 2020;32:43. 10.1186/s12302-020-00314-9. [Google Scholar]
  • 47.Guo J, Huan T. Comparison of Full-Scan, Data-Dependent, and Data-Independent Acquisition Modes in Liquid Chromatography-Mass Spectrometry Based Untargeted Metabolomics. Anal Chem. 2020;92:8072–80. 10.1021/acs.analchem.9b05135. [DOI] [PubMed] [Google Scholar]
  • 48.De Jonge NF, Mildau K, Meijer D, Louwen JJR, Bueschl C, Huber F, Van Der Hooft JJJ. Good practices and recommendations for using and benchmarking computational metabolomics metabolite annotation tools. Metabolomics. 2022;18:103. 10.1007/s11306-022-01963-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Kerber A, Laue R, Meringer M, Rucker C. Molecules in Silico: The Generation of Structural Formulae and Its Applications. J Comput Chem Jpn. 2004;3:85–96. 10.2477/jccj.3.85. [Google Scholar]
  • 50.Wolf S, Schmidt S, Müller-Hannemann M, Neumann S. In silico fragmentation for computer assisted identification of metabolite mass spectra. BMC Bioinforma. 2010;11:148. 10.1186/1471-2105-11-148. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Allen F, Greiner R, Wishart D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics. 2015;11:98–110. 10.1007/s11306-014-0676-4. [Google Scholar]
  • 52.Bremer PL, Vaniya A, Kind T, Wang S, Fiehn O. How Well Can We Predict Mass Spectra from Structures? Benchmarking Competitive Fragmentation Modeling for Metabolite Identification on Untrained Tandem Mass Spectra. J Chem Inf Model. 2022;62:4049–56. 10.1021/acs.jcim.2c00936. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Wang F, Pasin D, Skinnider MA, Liigand J, Kleis J-N, Brown D, Oler E, Sajed T, Gautam V, Harrison S, Greiner R, Foster LJ, Dalsgaard PW, Wishart DS. Deep Learning-Enabled MS/MS Spectrum Prediction Facilitates Automated Identification Of Novel Psychoactive Substances. Anal Chem. 2023;95:18326–34. 10.1021/acs.analchem.3c02413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Young A, Röst H, Wang B. Tandem mass spectrum prediction for small molecules using graph transformers. Nat Mach Intell. 2024. 10.1038/s42256-024-00816-8. [Google Scholar]
  • 55.Albergamo V, Schollée JE, Schymanski EL, Helmus R, Timmer H, Hollender J, De Voogt P. Nontarget Screening Reveals Time Trends of Polar Micropollutants in a Riverbank Filtration System. Environ Sci Technol. 2019;53:7584–94. 10.1021/acs.est.9b01750. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Böcker S, Dührkop K. Fragmentation trees reloaded. J Cheminform. 2016;8:5. 10.1186/s13321-016-0116-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Goldman S, Xin J, Provenzano J, Coley CW. MIST-CF: Chemical Formula Inference from Tandem Mass Spectra. J Chem Inf Model. 2024;64:2421–31. 10.1021/acs.jcim.3c01082. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Dührkop K, Shen H, Meusel M, Rousu J, Böcker S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc Natl Acad Sci USA. 2015;112:12580–5. 10.1073/pnas.1509788112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Bojko B, Onat B, Boyaci E, Psillakis E, Dailianis T, Pawliszyn J. Application of in situ solid-phase microextraction on mediterranean sponges for untargeted exometabolome screening and environmental monitoring. Front Mar Sci. 2019;6:632. 10.3389/fmars.2019.00632. [Google Scholar]
  • 60.Li X, Ma W, Yang B, Tu M, Zhang Q, Li H. Impurity profiling of dinotefuran by high resolution mass spectrometry and SIRIUS tool. Molecules. 2022;27:5251. 10.3390/molecules27165251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Wang Z, Walker GW, Muir DCG, Nagatani-Yoshida K. Toward a global understanding of chemical pollution: A first comprehensive analysis of national and regional chemical inventories. Environ Sci Technol. 2020;54:2575–84. 10.1021/acs.est.9b06379. [DOI] [PubMed] [Google Scholar]
  • 62.Xia J, Si H, Huang X, Chen X, Fu X, Li G, Lai Q, Li F, Wang W, Shao Z. Metabolomics and molecular networking-guided screening of bacillus -derived bioactive compounds against a highly lethal vibrio species. Anal Chem. 2024;96:4359–68. 10.1021/acs.analchem.3c02958. [DOI] [PubMed] [Google Scholar]
  • 63.Zhou Z, Luo M, Zhang H, Yin Y, Cai Y, Zhu Z-J. Metabolite annotation from knowns to unknowns through knowledge-guided multi-layer metabolic networking. Nat Commun. 2022;13:6656. 10.1038/s41467-022-34537-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.De Jonge NF, Louwen JJR, Chekmeneva E, Camuzeaux S, Vermeir FJ, Jansen RS, Huber F, Van Der Hooft JJJ. MS2Query: reliable and scalable MS2 mass spectra-based analogue search. Nat Commun. 2023;14:1752. 10.1038/s41467-023-37446-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Tripathi A, Vázquez-Baeza Y, Gauglitz JM, Wang M, Dührkop K, Nothias-Esposito M, Acharya DD, Ernst M, Van Der Hooft JJJ, Zhu Q, McDonald D, Brejnrod AD, Gonzalez A, Handelsman J, Fleischauer M, Ludwig M, Böcker S, Nothias L-F, Knight R, Dorrestein PC. Chemically informed analyses of metabolomics mass spectrometry data with Qemistree. Nat Chem Biol. 2021;17:146–51. 10.1038/s41589-020-00677-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Djoumbou Feunang Y, Eisner R, Knox C, Chepelev L, Hastings J, Owen G, Fahy E, Steinbeck C, Subramanian S, Bolton E, Greiner R, Wishart DS. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J Cheminform. 2016;8:61. 10.1186/s13321-016-0174-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Shaffer JP, Nothias L-F, Thompson LR, Sanders JG, Salido RA, Couvillion SP, Brejnrod AD, Lejzerowicz F, Haiminen N, Huang S, Lutz HL, Zhu Q, Martino C, Morton JT, Karthikeyan S, Nothias-Esposito M, Dührkop K, Böcker S, Kim HW, Aksenov AA, Bittremieux W, Minich JJ, Marotz C, Bryant MM, Sanders K, Schwartz T, Humphrey G, Vásquez-Baeza Y, Tripathi A, Parida L, Carrieri AP, Beck KL, Das P, González A, McDonald D, Ladau J, Karst SM, Albertsen M, Ackermann G, DeReus J, Thomas T, Petras D, Shade A, Stegen J, Song SJ, Metz TO, Swafford AD, Dorrestein PC, Jansson JK, Gilbert JA, Knight R, the Earth Microbiome Project 500 (EMP500) Consortium, Angenant LT, Berry AM, Bittleston LS, Bowen JL, Chavarría M, Cowan DA, Distel D, Girguis PR, Huerta-Cepas J, Jensen PR, Jiang L, King GM, Lavrinienko A, MacRae-Crerar A, Makhalanyane TP, Mappes T, Marzinelli EM, Mayer G, McMahon KD, Metcalf JL, Miyake S, Mousseau TA, Murillo-Cruz C, Myrold D, Palenik B, Pinto-Tomás AA, Porazinska DL, Ramond J-B, Rowher F, RoyChowdhury T, Sandin SA, Schmidt SK, Seedorf H, Shade A, Shipway JR, Smith JE, Stegen J, Stewart FJ, Tait K, Thomas T, Tucker Y, U’Ren JM, Watts PC, Webster NS, Zaneveld JR, Zhang S. Standardized multi-omics of Earth’s microbiomes reveals microbial and metabolite diversity. Nat Microbiol. 2022;7:2128–50. 10.1038/s41564-022-01266-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Sha B, Schymanski EL, Ruttkies C, Cousins IT, Wang Z. Exploring open cheminformatics approaches for categorizing per- and polyfluoroalkyl substances (PFASs). Environ Sci: Processes Impacts. 2019;21:1835–51. 10.1039/C9EM00321E. [DOI] [PubMed] [Google Scholar]
  • 69.Aurich D, Diderich P, Helmus R, Schymanski EL. Non-target screening of surface water samples to identify exposome-related pollutants: a case study from Luxembourg. Environ Sci Eur. 2023;35:94. 10.1186/s12302-023-00805-5. [Google Scholar]
  • 70.Poltorak V, Shachaf N, Aharoni A, Zeevi D (2024) Spec2Class: Accurate prediction of plant secondary metabolite class using deep learning. 10.1101/2024.03.17.585408.
  • 71.Anstine DM, Isayev O. Generative Models as an Emerging Paradigm in the Chemical Sciences. J Am Chem Soc. 2023;145:8736–50. 10.1021/jacs.2c13467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Sanchez-Lengeling B, Aspuru-Guzik A. Inverse molecular design using machine learning: Generative models for matter engineering. Science. 2018;361:360–5. 10.1126/science.aat2663. [DOI] [PubMed] [Google Scholar]
  • 73.Segler MHS, Kogej T, Tyrchan C, Waller MP. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent Sci. 2018;4:120–31. 10.1021/acscentsci.7b00512. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Colby SM, Nuñez JR, Hodas NO, Corley CD, Renslow RR. Deep learning to generate in Silico chemical property libraries and candidate molecules for small molecule identification in complex samples. Anal Chem. 2020;92:1720–9. 10.1021/acs.analchem.9b02348. [DOI] [PubMed] [Google Scholar]
  • 75.Skinnider MA, Wang F, Pasin D, Greiner R, Foster LJ, Dalsgaard PW, Wishart DS. A deep generative model enables automated structure elucidation of novel psychoactive substances. Nat Mach Intell. 2021;3:973–84. 10.1038/s42256-021-00407-x. [Google Scholar]
  • 76.Papageorgiou M, Kosma C, Lambropoulou D. Seasonal occurrence, removal, mass loading and environmental risk assessment of 55 pharmaceuticals and personal care products in a municipal wastewater treatment plant in Central Greece. Sci Total Environ. 2016;543:547–69. 10.1016/j.scitotenv.2015.11.047. [DOI] [PubMed] [Google Scholar]
  • 77.Guardian MGE, Antle JP, Vexelman PA, Aga DS, Simpson SM. Resolving unknown isomers of emerging per- and polyfluoroalkyl substances (PFASs) in environmental samples using COSMO-RS-derived retention factor and mass fragmentation patterns. J Hazard Mater. 2021;402:123478. 10.1016/j.jhazmat.2020.123478. [DOI] [PubMed] [Google Scholar]
  • 78.Song D, Tang T, Wang R, Liu H, Xie D, Zhao B, Dang Z, Lu G. Enhancing compound confidence in suspect and non-target screening through machine learning-based retention time prediction. Environ Pollut. 2024;347:123763. 10.1016/j.envpol.2024.123763. [DOI] [PubMed] [Google Scholar]
  • 79.Bijlsma L, Berntssen MHG, Merel S. a refined nontarget workflow for the investigation of metabolites through the prioritization by in silico prediction tools. Anal Chem. 2019;91:6321–8. 10.1021/acs.analchem.9b01218. [DOI] [PubMed] [Google Scholar]
  • 80.Stanstrup J, Neumann S, Vrhovšek U. PredRet: Prediction of retention time by direct mapping between multiple chromatographic systems. Anal Chem. 2015;87:9421–8. 10.1021/acs.analchem.5b02287. [DOI] [PubMed] [Google Scholar]
  • 81.Kern S, Fenner K, Singer HP, Schwarzenbach RP, Hollender J. Identification of transformation products of organic contaminants in natural waters by computer-aided prediction and high-resolution mass spectrometry. Environ Sci Technol. 2009;43:7039–46. 10.1021/es901979h. [DOI] [PubMed] [Google Scholar]
  • 82.Bonini P, Kind T, Tsugawa H, Barupal DK, Fiehn O. Retip: Retention time prediction for compound annotation in untargeted metabolomics. Anal Chem. 2020;92:7515–22. 10.1021/acs.analchem.9b05765. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Celma A, Bade R, Sancho JV, Hernandez F, Humphries M, Bijlsma L. Prediction of retention time and collision cross section (CCS H+, CCS H–, and CCS Na+ ) of emerging contaminants using multiple adaptive regression splines. J Chem Inf Model. 2022;62:5425–34. 10.1021/acs.jcim.2c00847. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Bouwmeester R, Martens L, Degroeve S. Generalized calibration across liquid chromatography setups for generic prediction of small-molecule retention times. Anal Chem. 2020;92:6571–8. 10.1021/acs.analchem.0c00233. [DOI] [PubMed] [Google Scholar]
  • 85.Falchi F, Bertozzi SM, Ottonello G, Ruda GF, Colombano G, Fiorelli C, Martucci C, Bertorelli R, Scarpelli R, Cavalli A, Bandiera T, Armirotti A. Kernel-based, partial least squares quantitative structure-retention relationship model for UPLC retention time prediction: A useful tool for metabolite identification. Anal Chem. 2016;88:9510–7. 10.1021/acs.analchem.6b02075. [DOI] [PubMed] [Google Scholar]
  • 86.Low DY, Micheau P, Koistinen VM, Hanhineva K, Abrankó L, Rodriguez-Mateos A, Da Silva AB, Van Poucke C, Almeida C, Andres-Lacueva C, Rai DK, Capanoglu E, Tomás Barberán FA, Mattivi F, Schmidt G, Gürdeniz G, Valentová K, Bresciani L, Petrásková L, Dragsted LO, Philo M, Ulaszewska M, Mena P, González-Domínguez R, Garcia-Villalba R, Kamiloglu S, De Pascual-Teresa S, Durand S, Wiczkowski W, Bronze MR, Stanstrup J, Manach C. Data sharing in PredRet for accurate prediction of retention time: Application to plant food bioactive compounds. Food Chem. 2021;357:129757. 10.1016/j.foodchem.2021.129757. [DOI] [PubMed] [Google Scholar]
  • 87.Feng C, Xu Q, Qiu X, Jin Y, Ji J, Lin Y, Le S, She J, Lu D, Wang G. Evaluation and application of machine learning-based retention time prediction for suspect screening of pesticides and pesticide transformation products in LC-HRMS. Chemosphere. 2021;271:129447. 10.1016/j.chemosphere.2020.129447. [DOI] [PubMed] [Google Scholar]
  • 88.Liapikos T, Zisi C, Kodra D, Kademoglou K, Diamantidou D, Begou O, Pappa-Louisi A, Theodoridis G. Quantitative structure retention relationship (QSRR) modelling for Analytes’ retention prediction in LC-HRMS by applying different Machine Learning algorithms and evaluating their performance. J Chromatogr B. 2022;1191:123132. 10.1016/j.jchromb.2022.123132. [DOI] [PubMed] [Google Scholar]
  • 89.Aicheler F, Li J, Hoene M, Lehmann R, Xu G, Kohlbacher O. Retention time prediction improves identification in nontargeted lipidomics approaches. Anal Chem. 2015;87:7698–704. 10.1021/acs.analchem.5b01139. [DOI] [PubMed] [Google Scholar]
  • 90.Bade R, Bijlsma L, Miller TH, Barron LP, Sancho JV, Hernández F. Suspect screening of large numbers of emerging contaminants in environmental waters using artificial neural networks for chromatographic retention time prediction and high resolution mass spectrometry data analysis. Sci Total Environ. 2015;538:934–41. 10.1016/j.scitotenv.2015.08.078. [DOI] [PubMed] [Google Scholar]
  • 91.Aalizadeh R, Nika M-C, Thomaidis NS. Development and application of retention time prediction models in the suspect and non-target screening of emerging contaminants. J Hazard Mater. 2019;363:277–85. 10.1016/j.jhazmat.2018.09.047. [DOI] [PubMed] [Google Scholar]
  • 92.Mollerup CB, Mardal M, Dalsgaard PW, Linnet K, Barron LP. Prediction of collision cross section and retention time for broad scope screening in gradient reversed-phase liquid chromatography-ion mobility-high resolution accurate mass spectrometry. J Chromatogr A. 2018;1542:82–8. 10.1016/j.chroma.2018.02.025. [DOI] [PubMed] [Google Scholar]
  • 93.Zhang Y, Liu F, Li XQ, Gao Y, Li KC, Zhang QH. Generic and accurate prediction of retention times in liquid chromatography by post–projection calibration. Commun Chem. 2024;7:54. 10.1038/s42004-024-01135-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Osipenko S, Nikolaev E, Kostyukevich Y. retention time prediction with message-passing neural networks. Separations. 2022;9:291. 10.3390/separations9100291. [Google Scholar]
  • 95.Ju R, Liu X, Zheng F, Lu X, Xu G, Lin X. Deep neural network pretrained by weighted autoencoders and transfer learning for retention time prediction of small molecules. Anal Chem. 2021;93:15651–8. 10.1021/acs.analchem.1c03250. [DOI] [PubMed] [Google Scholar]
  • 96.Fedorova ES, Matyushin DD, Plyushchenko IV, Stavrianidi AN, Buryak AK. Deep learning for retention time prediction in reversed-phase liquid chromatography. J Chromatogr A. 2022;1664:462792. 10.1016/j.chroma.2021.462792. [DOI] [PubMed] [Google Scholar]
  • 97.Yang Q, Ji H, Lu H, Zhang Z. Prediction of liquid chromatographic retention time with graph neural networks to assist in small molecule identification. Anal Chem. 2021;93:2200–6. 10.1021/acs.analchem.0c04071. [DOI] [PubMed] [Google Scholar]
  • 98.Xue J, Wang B, Ji H, Li W. RT-Transformer: retention time prediction for metabolite annotation to assist in metabolite identification. Bioinformatics. 2024;40:btae084. 10.1093/bioinformatics/btae084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Gabelica V, Marklund E. Fundamentals of ion mobility spectrometry. Curr Opin Chem Biol. 2018;42:51–9. 10.1016/j.cbpa.2017.10.022. [DOI] [PubMed] [Google Scholar]
  • 100.Celma A, Ahrens L, Gago-Ferrero P, Hernández F, López F, Lundqvist J, Pitarch E, Sancho JV, Wiberg K, Bijlsma L. The relevant role of ion mobility separation in LC-HRMS based screening strategies for contaminants of emerging concern in the aquatic environment. Chemosphere. 2021;280:130799. 10.1016/j.chemosphere.2021.130799. [DOI] [PubMed] [Google Scholar]
  • 101.Harvey DJ, Crispin M, Bonomelli C, Scrivens JH. Ion mobility mass spectrometry for ion recovery and clean-up of MS and MS/MS spectra obtained from low abundance viral samples. J Am Soc Mass Spectrom. 2015;26:1754–67. 10.1007/s13361-015-1163-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Haack A, Ieritano C, Hopkins WS. MobCal-MPI 2.0: an accurate and parallelized package for calculating field-dependent collision cross sections and ion mobilities. Analyst. 2023;148:3257–73. 10.1039/D3AN00545C. [DOI] [PubMed] [Google Scholar]
  • 103.Shrivastav V, Nahin M, Hogan CJ, Larriba-Andaluz C. Benchmark comparison for a multi-processing ion mobility calculator in the free molecular regime. J Am Soc Mass Spectrom. 2017;28:1540–51. 10.1007/s13361-017-1661-8. [DOI] [PubMed] [Google Scholar]
  • 104.Asef CK, Rainey MA, Garcia BM, Gouveia GJ, Shaver AO, Leach FE, Morse AM, Edison AS, McIntyre LM, Fernández FM (2023) Unknown Metabolite Identification Using Machine Learning Collision Cross-Section Prediction and Tandem Mass Spectrometry. Anal Chem acs.analchem. 2c03749. 10.1021/acs.analchem.2c03749. [DOI] [PMC free article] [PubMed]
  • 105.Hinnenkamp V, Balsaa P, Schmidt TC. Target, suspect and non-target screening analysis from wastewater treatment plant effluents to drinking water using collision cross section values as additional identification criterion. Anal Bioanal Chem. 2022;414:425–38. 10.1007/s00216-021-03263-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 106.Wu Q, Wang J-Y, Han D-Q, Yao Z-P. Recent advances in differentiation of isomers by ion mobility mass spectrometry. TrAC, Trends Anal Chem. 2020;124:115801. 10.1016/j.trac.2019.115801. [Google Scholar]
  • 107.Akhlaqi M, Wang W-C, Möckel C, Kruve A. Complementary methods for structural assignment of isomeric candidate structures in non-target liquid chromatography ion mobility high-resolution mass spectrometric analysis. Anal Bioanal Chem. 2023;415:5247–59. 10.1007/s00216-023-04852-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108.Lowe CN, Isaacs KK, McEachran A, Grulke CM, Sobus JR, Ulrich EM, Richard A, Chao A, Wambaugh J, Williams AJ. Predicting compound amenability with liquid chromatography-mass spectrometry to improve non-targeted analysis. Anal Bioanal Chem. 2021;413:7495–508. 10.1007/s00216-021-03713-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109.Liigand P, Kaupmees K, Haav K, Liigand J, Leito I, Girod M, Antoine R, Kruve A. Think negative: Finding the best electrospray ionization/MS mode for your analyte. Anal Chem. 2017;89:5665–8. 10.1021/acs.analchem.7b00096. [DOI] [PubMed] [Google Scholar]
  • 110.Costalunga R, Tshepelevitsh S, Sepman H, Kull M, Kruve A. Sodium adduct formation with graph-based machine learning can aid structural elucidation in non-targeted LC/ESI/HRMS. Anal Chim Acta. 2022;1204:339402. 10.1016/j.aca.2021.339402. [DOI] [PubMed] [Google Scholar]
  • 111.Broeckling CD, Ganna A, Layer M, Brown K, Sutton B, Ingelsson E, Peers G, Prenni JE. Enabling efficient and confident annotation of LC−MS metabolomics data through MS1 spectrum and time prediction. Anal Chem. 2016;88:9226–34. 10.1021/acs.analchem.6b02479. [DOI] [PubMed] [Google Scholar]
  • 112.Abrahamsson D, Siddharth A, Young TM, Sirota M, Park J-S, Martin JW, Woodruff TJ. In Silico structure predictions for non-targeted analysis: From physicochemical properties to molecular structures. J Am Soc Mass Spectrom. 2022;33:1134–47. 10.1021/jasms.1c00386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113.Karunaratne E, Hill DW, Dührkop K, Böcker S, Grant DF. Combining experimental with computational infrared and mass spectra for high-throughput nontargeted chemical structure identification. Anal Chem. 2023;95:11901–7. 10.1021/acs.analchem.3c00937. [DOI] [PubMed] [Google Scholar]
  • 114.Houthuijs KJ, Horn M, Vughs D, Martens J, Brunner AM, Oomens J, Berden G. Identification of organic micro-pollutants in surface water using MS-based infrared ion spectroscopy. Chemosphere. 2023;341:140046. 10.1016/j.chemosphere.2023.140046. [DOI] [PubMed] [Google Scholar]
  • 115.Hu J, Liu D, Fu N, Dong R. Realistic material property prediction using domain adaptation based machine learning. Digit Discov. 2024;3:300–12. 10.1039/D3DD00162H. [Google Scholar]
  • 116.Souihi A, Mohai MP, Palm E, Malm L, Kruve A. MultiConditionRT: Predicting liquid chromatography retention time for emerging contaminants for a wide range of eluent compositions and stationary phases. J Chromatogr A. 2022;1666:462867. 10.1016/j.chroma.2022.462867. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

216_2024_5471_MOESM1_ESM.docx (22.3MB, docx)

Supplementary Information 1 (DOCX 22825 KB)

216_2024_5471_MOESM2_ESM.csv (122.1KB, csv)

Supplementary Information 2 (CSV 122 KB)

Data Availability Statement

The Supplementary Information 1 includes details on experimental methods, data processing, and figures. Additionally, a table containing all extracted candidates and their prioritization can be found as Supplementary Information 2. The code for data processing and visualization is available at the GitHub repository: https://github.com/kruvelab/NTS_LC_HRMS.


Articles from Analytical and Bioanalytical Chemistry are provided here courtesy of Springer

RESOURCES