In silico structure predictions for non-targeted analysis: From physicochemical properties to molecular structures

Dimitri Abrahamsson; Adi Siddharth; Thomas M Young; Marina Sirota; June-Soo Park; Jonathan Martin; Tracey Woodruff

doi:10.1021/jasms.1c00386

. Author manuscript; available in PMC: 2023 Jul 6.

Published in final edited form as: J Am Soc Mass Spectrom. 2022 Jun 1;33(7):1134–1147. doi: 10.1021/jasms.1c00386

In silico structure predictions for non-targeted analysis: From physicochemical properties to molecular structures

Dimitri Abrahamsson ^1,^*, Adi Siddharth ¹, Thomas M Young ², Marina Sirota ^3,⁴, June-Soo Park ^1,⁵, Jonathan Martin ⁶, Tracey Woodruff ¹

PMCID: PMC9365522 NIHMSID: NIHMS1826615 PMID: 35649165

Abstract

While important advances have been made in high-resolution mass spectrometry (HRMS) and its applications in non-targeted analysis (NTA), the number of identified compounds in biological and environmental samples often does not exceed 5% of the detected chemical features. Our aim was to develop a computational pipeline that leverages data from HRMS, but also incorporates physicochemical properties (equilibrium partition ratios between organic solvents and water; K_{solvent-water}) and can propose molecular structures for detected chemical features. As these physicochemical properties are often sufficiently different across isomers, when put together they can form a unique profile for each isomer, which we describe as the “physicochemical fingerprint”. In our study, we used a comprehensive database (n~20,000) of compounds that have been previously reported in human blood and collected their K_{solvent-solvent} values for 129 partitioning systems. We used RDKit to calculate the number of RDKit fragments and the number of RDKit bits per molecule. We then developed and trained an artificial neural network, which used as input the physicochemical fingerprint of a chemical feature and predicted the number and types of RDKit fragments and RDKit bits present in that structure. These were then used to search the database and propose chemical structures. The average success rate of predicting the right chemical structure ranged from 60 to 86% for the training set and from 48 to 81% for the testing set. These observations suggest that physicochemical fingerprints can assist in the identification of compounds with NTA and substantially improve the number of identified compounds.

Graphical Abstract

graphic file with name nihms-1826615-f0001.jpg

1. INTRODUCTION

Recent technological advances in high-resolution mass spectrometry (HRMS) have enabled the non-targeted analysis (NTA) of environmental and biological samples for a very broad spectrum of chemicals that would previously remain undetected with conventional targeted techniques. These compounds may be endogenous metabolites that are associated with a particular disease¹ (metabolome), environmental contaminants that are risk factors for disease or indicators of pollution² (exposome), dietary components (foodome), or pharmaceutical drugs and their transformation by-products³ (pharmacome).

Targeted analysis involves the pre-selection of analytes and the development of analytical methods for the analysis of these chemicals in a given matrix, and usually focuses on a limited number of compounds within a chemical class for which analytical standards are available.^4,5 Identification is conducted by comparing retention times and MS or MS/MS spectra of the sample and authentic standards.

Although less quantitative than targeted analysis, HRMS-based NTA allows for the screening of biological and environmental samples for thousands of compounds without deciding a priori which chemicals to look for. Depending on the study design and application, NTA can facilitate the discovery of molecules associated with a particular disease,^6–8 or understanding sources of environmental pollution, and linking environmental exposures to adverse health outcomes². While NTA is a promising technology, there are important challenges that prevent us from fully leveraging its potential. One critical challenge in NTA is the difficulty in obtaining definitive molecular structures for the majority of the detected chemical features.

HRMS benchtop instruments, such as quadrupole time-of-flight (Q-TOF) and Orbitrap, are the most commonly used instruments in NTA,^2,9,10 and there are important differences in terms of their mass spectral resolution and mass accuracy. A 6546 Q-TOF (Agilent) can achieve resolving power of 40,000 for mass to charge ratio (m/z) of 400,¹¹ whereas for the same m/z a Q Exactive Orbitrap (Thermo Scientific) can achieve a resolving power of 175,000.¹² Higher resolution instruments, such as Fourier Transform Ion Cyclotron Resonance (FT-ICR) instruments can achieve even higher resolving power in the range of 1,000,000, but these are very large, often requiring an entire room, and substantially more expensive. In addition, FT-ICR instruments are not fast scanning, as higher resolution requires more time, and thus, they do not pair as well with chromatography.¹³

Compact benchtop instruments, such as Q-TOF and Orbitrap can provide sufficiently high resolving power for routine analysis, and due to their relatively high mass accuracy (< 5 ppm error for most instruments) can be used to assign unambiguous molecular formulas to thousands of chemical features detected in a sample. A molecular formula provides the atomic composition (i.e., how many C, H, O, F etc.), but not how these atoms are arranged into a molecular structure. Some of the molecular features in a typical non-targeted acquisition will also have an associated data-rich MS/MS spectrum which can be used to annotate probable structures by matching to spectral databases, such as METLIN, MassBank, NIST, MoNA¹⁴ (experimental and in silico generated spectra) and CFM-ID¹⁵ and MetFrag¹⁶ (in silico generated spectra). While important advances have been made in producing large spectral databases, existing databases with experimental HRMS spectra are limited compared to the number of features often detected in biological samples, and in silico generated spectra can often be inaccurate, thus restricting the number of features that can be confidently annotated with a probable or confirmed structure.

In addition to spectral databases, computational tools have been developed to assist in the interpretation of MS/MS spectra from sample analysis. One example of these tools is SIRIUS¹⁷, which is a collection of MS/MS spectra interpretation tools that includes CSI:FingerID (with COSMIC; annotations and scores, ZODIAC (de novo molecular formula annotation) and CANOPUS (compound classes from MS/MS data).¹⁷ Other noteworthy computational tools are MS-DIAL¹⁸ and MS-FINDER¹⁹, which have been developed for processing NTA data and interpreting MS/MS spectra. Smaller scale tools, as scripts without user interfaces, have also been published as part of NTA processing workflows in previous studies, e.g., in one of our previous NTA studies.²

One of the limitations with MS/MS spectra, and thus with MS/MS interpretation tools, is that MS/MS fragmentation can only produce high quality spectra if the parent MS peak is at sufficiently high abundance and if the compound is ionizable. Lower abundance peaks produce noisy MS/MS spectra where the fragments are not fully distinguishable from the baseline. Similarly, compounds that have low ionization efficiency can also produce spectra where only the parent ion is showing, and the generated fragments (if any) are indistinguishable from the baseline.

While there have been great strides in improving the interpretation of MS/MS spectra by computational tools such as SIRIUS¹⁷ and MS-DIAL/MS-FINDER^18,19, current methods and workflows can confirm with analytical standards only about 5% of the detected features in a given sample.^2,9,10,20 There is, thus, a need to further explore and develop computational methods that can help us interpret NTA data to elucidate molecular structures for unknown features. The current proposed methodology leverages data from MS and MS/MS, but also incorporates physicochemical property measurements and is meant to complement existing approaches by narrowing the list of possible structures that may correspond to a detected chemical feature.

One of the main challenges in identifying chemicals through NTA is distinguishing between structural isomers (chemicals with the same formula but varying structures). For example, searching the chemical substance benzoic acid and its corresponding formula (C₇H₆O₂) on EPA’s Chemicals Dashboard²¹, we find 32 structural isomers with very different physicochemical properties. It should be noted though that EPA’s Chemistry Dashboard lists primarily chemicals of environmental relevance, there could be additional relevant isomers beyond the 32 listed there. Searching PubChem for the same formula gives us 392 structural isomers. In Fig.1, we show three examples of structural isomers of C₇H₆O₂ with eight selected physicochemical properties²² (log K equilibrium partition ratios between organic solvents and water). These physicochemical properties are often sufficiently different across isomers, and when put together they can form a unique profile for each isomer (Fig.1), which we describe as the “physicochemical fingerprint”. Theoretically, by conducting partitioning experiments in the lab (Fig. 2), one could confidently identify which isomer of a given formula, e.g., C₇H₆O₂, is the one detected in the sample (Fig.1D). Furthermore, as the physicochemical fingerprints are related to molecular structure, these could together be used to train machine learning algorithms to predict structural characteristics or functional groups (e.g., alcohol, ether or ester groups) that can eventually be used in combination with MS1 data (i.e., molecular masses and formulas) to search databases for matching molecular structures. It is important to note that the structures in Fig.1 are meant to serve as an example that allows us to illustrate the principle of our approach. Measuring partitioning ratios of complex structures and distinguishing between similar molecular structures may be more challenging than illustrated in this example.

Figure 1. — (A), (B) and (C): Selected structural isomers for the formula C₇H₆O₂ with their equilibrium partition ratios (log scale) in selected organic solvents: log K_{solvent-water}. Also showing (D) a hypothetical feature detected in a blood sample with the same unambiguous formula whose physicochemical fingerprint matches isomer 1. Solvents on the x-axis include CD: carbon disulfide, CB: chlorobenzene, CH: cyclohexane, DM: dichloromethane, FB: fluorobenzene, DE: diethyl ether, EA: ethyl acetate, O: octanol.

Figure 2: — Workflow for obtaining K_{solvent-water} measurements for detected chemical features in a sample, converting the physicochemical fingerprints to RDKit fragments/bits and finally searching the Blood Exposome database (n ~20,000 compounds; described under “Data collection”) and matching to molecular structure. Given that the concentrations of detected features in NTA are unknown, our workflow follows the assumption that K_{solvent-water} values can be accurately calculated in partitioning experiments in the lab using the peak areas of the detected features instead of concentrations. TR: triolein, HX: n-hexane, OC: n-octane, UN: n-undecane, CH: cyclohexane, DC: dichloromethane, OA: oleyl alcohol, TO: toluene, TM: trichloromethane, BA: butyl acetate.

The aim of our study was to develop a computational pipeline for in silico structure prediction which can be used to propose probable structures for chemical features (chemical formulas and retention times) detected during NTA of blood samples (with MS1 and/or MS2) that have multiple structural isomers. As input data for the algorithm, we used physicochemical properties, which can be measured in the lab and can be used to identify and disaggregate structural isomers. We should note that the experimental part of the workflow is not evaluated in this paper but will be evaluated in our follow-up study.

2. METHODS

2.1. From physicochemical properties to chemical structures

Our proposed method (Fig. 2) employs the following steps: i) from a concentrated sample extract, a small volume is transferred to 8–10 partitioning systems containing different organic solvents and water (using a concentrated sample extract helps to ensure that there are detectable levels for most analytes after partitioning), ii) the tubes are then shaken and the chemicals are left to equilibrate, iii) the two phases are then separated using a centrifuge and analyzed with HRMS (or to simplify the analytical workflow, one could also analyze only the aquatic phase and the original sample and back-calculate the peak area of the analytes in the solvent from the difference of the two peak areas), iv) using the peak areas for each chemical in the two phases the K_{solvent-water} is calculated as the ratio of the two areas, v) after collecting all K_{solvent-water} for all detectable features, the K_{solvent-water} values are used to create a physicochemical fingerprint for each chemical, vi) the physicochemical fingerprints are then converted into RDKit fragments^23,24 or RDKit^23,25 bits using a trained machine learning algorithm, and finally vii) the RDKit fragments or bits are used to search a database for chemicals that contain these fragments or bits.

Conventionally, the K_{solvent-water} are calculated as:

K_{solvent - water} = \frac{C_{solvent}}{C_{water}}

(1)

where C is the concentration of the analyte in water and solvent. However, since the concentrations of detected features are not known, the equation can be modified as:

K_{solvent - water} = \frac{C_{solvent}}{C_{water}} = \frac{\frac{A_{solvent}}{R R F}}{\frac{A_{water}}{R R F}} = \frac{A_{solvent}}{A_{water}}

(2)

where A is the peak area of the analyte as measured by the HRMS instrument and RRF is the relative response factor of the analyte, which we assume to be equal or nearly equal in water and solvent or in water and original sample extract.

It is important to note that the revised equation assumes that the levels of the analytes in the sample are in the linear range of the calibration curve. As this may not always be the case for detected features, it would be good practice when analyzing a sample to obtain the peak areas of the analytes to analyze sample in a series of dilutions and calculate the average K_{solvent-water}. In addition, it should be noted that there may be matrix effects associated with the different solvents that could influence the measured peak areas of the analytes. This issue could be addressed by analyzing only the aquatic phase and the original sample and back-calculating the peak area of the analytes in the solvent from the difference of the two peak areas).

Another issue that needs to be acknowledged is that there may be cases where chemicals partition strongly to one phase of the partitioning system (solvent or to water) leaving only immeasurable amounts in the other phase. It is therefore good practice in partitioning experiments to adjust the volumes of the two phases to ensure measurable quantities in both phases. Ideally, the amount of the chemical in water or in the solvent should be in the range of 20–80% of the total amount in order to confidently calculate its partitioning ratio. This would need to be evaluated on a case-by-case basis depending on the properties on the detected chemicals. Since we are aiming to use this workflow in NTA, we would first start with the chemicals that appear to be in the 20–80% range for all or most solvents and then we would continue with the chemicals that appear to partition at ratios that exceed that range. For these chemicals, we would use at least 3 different volumes for every solvent and examine the differences in the partitioning. We anticipate that by doing so we will be able to capture some of these chemicals. However, there might be cases where some chemicals partition very strongly to one solvent and require trace amounts of solvent in order to be able to measure the concentrations in the aquatic phase. These cases would have to be excluded from our annotations due to lack of sufficient measurements.

Finally, another issue that needs to be addressed is the effect of pH on the K_{solvent-water} for ionic compounds. Let’s take for example the compound 1-(2-Pyridylazo)-2-naphthol, which is expected to have 3 species when dissolved in water at pH 7. These 3 species are i) the neutral form (M), ii) the negatively charged ion (M⁻) after losing an H from the OH group, and iii) a positively charged ion (M⁺) after gaining an H in the pyridine group. In a partitioning system that contains octanol and water, we would expect that only M will partition to the organic phase and M⁻ and M⁺ will remain in aquatic phase. Although the partitioning of charged molecules to the organic phase is not impossible, it is negligible compared to the partitioning of the neutral species.^26–28 So if one where to measure the K_OW of 1-(2-Pyridylazo)-2-naphthol, that K_OW would be different at different pHs as the distributions of M, M⁻ and M⁺ will vary at different pHs based on dissociation constants (pKa) of 1-(2-Pyridylazo)-2-naphthol. This could be a problem if were measuring only one K_{solvent-water} and if our model was generating potential structures from one K_{solvent-water} for one partitioning system. However, as described below in the modeling section, our algorithm is designed to take as input a set of 10 different K_{solvent-water} measurements for different organic solvents and generate predictions not from the absolute measurements of K_{solvent-water} but from the differences of the various K_{solvent-water} measurements across the different organic solvents. Since the distributions of M, M⁻ and M⁺ are determined by the pH of the water, the same effect that we observe in octanol, we would also observe in other organic solvents, such as triolein and ethyl acetate. Thus, while the absolute values of a set of K_{solvent-water} values would be different at pH = 2 and pH = 7, the differences of the K_{solvent-water} values between the different solvents would remain the same. In the algorithm, this is controlled by using a standard scaler²⁹ that standardizes the values of K_{solvent-water} measurements from −1 to 1 before they are used to train and test the model. Standardization of a dataset is also a common requirement for many machine learning models as they often require normally distributed data.

To evaluate the feasibility of our computational approach we followed a multi-step process (Fig. 3) consisting of four main steps: 1) data collection, 2) converting molecular structures into arrays of information, 3) training an artificial neural network, 4) testing and validating the trained model with simulations and experimental data. Each section is described in detail below.

Figure 3: — Flowchart describing the individual steps from data collection to model development and evaluation.

2.2. Data collection

Here we used the curated version of the Blood Exposome database found on EPA’s Chemicals Dashboard (n = 19867 compounds). The Blood Exposome database was compiled by screening the scientific literature for all organic and inorganic compounds that have been previously reported in human blood or serum. We then used the simplified molecular-input line-entry system (SMILES) of the compounds in the database and downloaded their partition ratios between organic solvents and water (K_{solvent-water}), as well as between two non-aqueous solvents (K_{solvent-solvent}) for 129 partitioning systems from the UFZ-LSER database. We only considered partition ratios that can be determined experimentally and excluded theoretical partition ratios that may be informative but not feasible to measure in the lab due to miscibility (e.g., K_{methanol-water}). It is important to note that UFZ-LSER cannot calculate partition ratios for charged molecules, for surfactants and for molecules over 1000 Da. This is presented as an error message for these particular molecules in the downloaded data and these molecules were removed from the dataset (final n after removing = 18973 compounds). The partition ratios in UFZ-LSER are calculated from experimentally determined or calculated Abraham descriptors using poly-parameter linear free-energy relationships (PP-LFERs).^30,31 However, since experimentally determined descriptors are often limited and not available for large datasets, we often have to rely on calculated descriptors. For the purposes of our modeling exercise, we used partition ratios estimated with calculated Abraham descriptors.

Finally, we used RDKit³² to calculate the numbers of atoms per molecule (e.g., n of C, H, N etc.), number of RDKit fragments (e.g., ether groups, ester groups etc.), and the number of RDKit bits. RDKit bits differ from fragments in the sense that RDKit bits are pieces of information or substructures that do not always correspond to one functional group. For example, an RDKit bit can be C-C=C-OH, whereas a fragment represents an OH group.

2.3. Converting molecules to arrays of information

One important challenge in computational chemistry is presenting molecular structures as bits of information so that they can be processed by the computer and be used to train a model. We considered two different approaches. In the first, we represented each molecule as a collection of RDKit fragments with continuous values for each fragment (e.g., 2 fused carbon rings, 2 double-bonded oxygen atoms, 2 primary amine groups etc.) This is then described as an array of integers (e.g., [2, 2, 2… 0, 1, 3]), where each number represents the number of fragments for each fragment type. In the second, we represented each molecule as collection of binary values but instead of using RDKit fragments, we used RDKit bits. This was also described as an array of integers (0s and 1s), where 0s denote the absence of a particular fragment and 1s the presence of a particular fragment (e.g., [0, 1, 1… 1, 0, 0]). The reason why we chose a binary system for this scenario is that computers work with binary; when there are only two options, in this case, 1s and 0s, we minimize the possibility of a signal being misinterpreted and thus minimizing potential errors downstream.

2.4. Building and training the model

For the model, we chose an artificial neural net (ANN) that we built with TensorFlow³³ and using Python³⁴ as the programming language. Other machine learning models, such as random forest and support vector machine, could also be successful but were not applied in the scope of current work. The model takes as inputs the partition ratios of each chemical and outputs the number of RDKit fragments or RDKit bits. The network was composed of 1 input layer, 10 hidden layers with 500 nodes each with rectified linear unit (ReLu) as the activation function for scenario 1 (RDKit fragments) and sigmoid as the activation function for scenario 2 (RDKit bits), 1 dropout layer to control for overfitting, 1 final hidden layer with 500 nodes and exponential as the activation function for scenario 1 (RDKit fragments) and sigmoid as the activation function for scenario 2 (RDKit bits), and 1 output layer. The optimizer was Adamax and the optimizing step was set to 0.001. The model was compiled and run for 200 epochs. We evaluated the model by splitting the dataset in training and testing with and 80/20 split (n training set = 13,342 and n testing set = 3,336) and following a shuffle-split 5-fold cross-validation. The model was optimized by minimizing the mean absolute error (MAE) for the predictions in the training and testing sets. The model was evaluated for overfitting by examining the MAE in the training and testing sets across the training epochs (Figure S1). The model was tested on the two scenarios described in the section above. The code and the underlying data are all available on GitHub under https://github.com/dimitriabrahamsson/turbo-chem.

2.5. Evaluating different combinations of partitioning systems

We applied a permutation analysis to evaluate whether different combinations of solvent partitioning systems are likely to yield different accuracies in the predictions of RDKit fragments or RDKit bits. We randomly selected 10 partition systems and trained and tested the model based on the description in the section above. We repeated the process 5 times, and we evaluated the different permutations by comparing the overall accuracy in predictions for both the training and the testing sets. We evaluated the models and the two scenarios by examining the predictions using MAE and the cross-validation coefficient of determination (Q²) for the RDKit fragments (scenario 1), and predictive accuracy (ACC) for RDKit bits (scenario 2) defined as:

A C C = \frac{T P + T N}{T P + T N + F P + F N}

(3)

where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives and FN in the number of false negatives.

To further clarify these metrics, for scenario 1 (RDKit fragments) the MAE and Q² are metrics of the deviation from the right number of fragments. For example, if for a given RDKit fragment in a molecule the true number is 5 and the predicted number is 1 or 10 then that would result in high MAE and low Q². Whereas the closer the predicted number of RDKit fragments to the true number of RDKit fragments, the lower the MAE and the higher the Q². For scenario 2 (RDKit bits), a TP equals the all the times the model predicts 1 when the true value is 1 (accurately predicting the presence of an RDKit bit); TN equals the times the model predicts 0 when it should be 0 (accurately predicting the absence of an RDKit bit); FP is when the model predicts 1 but it should be 0 (failing to predict the absence of an RDKit bit); and FN is when the model predicts 0 but it should be 1 (failing to predict the presence of an RDKit bit).

2.6. Model testing by simulating database searching

The ultimate goal of our study was to produce a computational workflow that can be used to propose molecular structures based on the predictions made for RDKit fragments and RDKit bits for detected chemical features derived from their physicochemical fingerprints. It is, thus, important to evaluate not only the accuracy of the model to predict RDKit fragments or RDKit bits, but also the likelihood of proposing the right molecular structure after searching in a database for compounds that match the predicted RDKit fragments or RDKit bits. As mentioned earlier, the model takes as inputs the partition ratios of each chemical and outputs the number of RDKit fragments or RDKit bits. Then these RDKit fragments and bits, together with molecular masses, formulas and isotopic patterns, are used to match to candidate structures in the Blood Exposome database.

As a first step in the model validation we conducted a simulation of database searching by applying the following steps: i) created a subset of 100 randomly selected compounds from the dataset, ii) calculated their partition ratios from the UFZ-LSER database, iii) predicted their RDKit fragments and RDKit bits using our machine learning model, iv) searched the Blood Exposome database and matched RDKit fragments and RDKit bits to best matching structures, and v) evaluated the model by comparing the matched structures to the true structures. The process was repeated once for the training set and once for the testing set by selecting 50 random chemical compounds from the training set and 50 chemical compounds from the testing set (total n = 100). Searching the Blood Exposome database and matching of detected chemical features to chemical structures was done using a linear regression model and ranking the candidates based on their similarity using the calculated r-values. The regression model compared the array containing the predicted RDKit fragments and RDKit bits for a detected chemical feature (e.g., [0, 3, 4, 0 … 0] for RDKit fragments or [0, 1, 0, 0, … 1] for RDKit bits) to arrays of chemicals in the database that had the same formula as the detected chemical feature. The candidates were then ranked based on their r-values and the top 1–5 candidates were used for proposing molecular structures.

2.7. Testing the model with experimental data

As the final step of model validation, we evaluated the performance of the computational workflow in accurately proposing molecular structures from partition ratios calculated with experimentally determined Abraham descriptors. This was done by applying the following steps: i) created a subset of 100 chemical compounds by randomly selecting 100 compounds from our dataset, ii) downloaded partition ratios calculated with experimentally determined Abraham descriptors for as many chemicals as we could find experimental data for from the UFZ-LSER database²², iii) predicted their RDKit fragments and RDKit bits with our machine learning model, iv) searched the Blood Exposome database and matched RDKit fragments and RDKit bits to the best matching structures, and (v) we evaluated the model by comparing the matched structures to the true structures.

2.8. Evaluating uncertainty in structure predictions

In order to provide an estimate of uncertainty for the predicted structures from our algorithm, we developed a scoring function that aims to inform the user about the expected confidence on a particular structure. The developed function takes into account (i) the goodness of fit between the predicted RDKit fragments and the RDKit fragments in the Blood exposome database; and (ii) the expected errors of the ANN at predicting a particular fragment. Our scoring function is described as:

S = R_{i}^{2} - \frac{\sum (M A E_{f} \times n_{f})}{10}

(4)

where,

S is the calculated score (higher values denote higher confidence);

$R_{i}^{2}$ is the coefficient of determination between the predicted RDKit fragments/bits and the matched RDKit fragments/bits from the Blood Exposome database for a given compound (i);

MAE_f is the mean absolute error of the ANN model at predicting a particular fragment/bit (f) based on the calculations for the testing set;

and n_f is the number of occurrences of a given fragment/bit (f) in a given compound (i)

The sum of MAE_f × n_f is divided by a factor of 10 to moderate the effect of $\frac{\sum (M A E_{f} \times n_{f})}{10}$ on S and to avoid generating negative values for S.

3. Results

3.1. Permutation analysis

As mentioned earlier in the methods section, we used a permutation analysis to examine whether different combinations of partitioning systems influence the accuracy of the model. In that analysis we examined both scenario 1 (RDKit fragments) and scenario 2 (RDKit bits). Both scenarios, showed small differences in prediction errors for the 5 permutations (Fig. S2–S5) for both the training and the testing sets. Overall, the prediction errors for scenario 1 were lower than the errors in scenario 2. For scenario 1, permutation D (Fig. S2D and S3D) showed the lowest MAE values in both the training and testing sets. For scenario 2, permutation E (Fig. S4E and S5E) showed the lowest MAE in both the training and the testing sets. The partitioning systems that were generated during the permutation analysis (organic solvents and water) are shown in Table S1.

3.2. Accuracy of predicting RDKit fragments

When examining the accuracy of the predictions for the RDKit fragments (Fig. 4 and 5), among the fragments that were predicted with high accuracy (Fig 4) were halogens, alcohols, tertiary amines, benzene rings and phenols. For these fragments, the Q² ranged from 0.99 to 0.94 for the training set, and from 0.99 to 0.79 for the testing set. For the same fragments, the MAE ranged from 0.03 to 0.09 for the training set, and from 0.04 to 0.25 for the testing set (Fig. 4). Among the fragments that were predicted with lowest accuracy were para-hydroxyl groups, primary amines, methoxy groups, fused carbon rings (bi-cyclic or higher), and anilines. For these fragments, the Q² ranged from 0.86 to 0.52 for the training set and from 0.56 to 0.14 for the testing set (Fig. 5). For the same fragments, the MAE ranged from 0.06 to 0.15 for the training set and from 0.15 to 0.50 for the testing set.

Figure 4: — Examples of the RDKit fragments that were predicted with the highest accuracy in terms of Q² and MAE for the training and testing sets (scenario 1). Higher Q² and lower MAE indicate better performance, whereas lower Q² and higher MAE indicate poor performance. The figure shows all datapoints for the 5-fold cross-validation for permutation 4. Q²: cross-validation coefficient of determination, MAE: mean absolute error, N: number of compounds in training and testing sets of the 5-fold cross-validation containing the specific fragment shown in the plot.

Figure 5: — Examples of the RDKit fragments that were predicted with the lowest accuracy in terms of Q² and MAE for the training and testing sets (scenario 1). Higher Q² and lower MAE indicate better performance, whereas lower Q² and higher MAE indicate poor performance. The figure shows all datapoints for the 5-fold cross-validation for permutation 4. Q²: cross-validation coefficient of determination, MAE: mean absolute error, N: number of compounds in training and testing sets of the 5-fold cross-validation containing the specific fragment shown in the plot.

3.3. Accuracy of predicting RDKit bits

When examining the accuracy of the model in predicting RDKit bits, among the RDKit bits that were predicted with the highest accuracy were substructures of aromatic rings. Such RDKit bits were well represented in the dataset and the number of chemicals containing these RDKit bits was about twice as high as the chemicals without (Fig. 6). For these RDKit bits (e.g., aromatic rings; Fig. 6), the accuracies ranged from 0.97 to 0.94 for the training set and from 0.95 to 0.92 for the testing set. Among the RDKit bits that were predicted with the lowest accuracy, were aromatic substructures containing heteroatoms, such as O and N. Such RDKit bits were not well represented in the dataset as only approximately 10% of the chemicals in the dataset contained these RDKit bits (Fig. 7). For these RDKit bits (e.g., aromatic substructures containing heteroatoms; Fig.7), the accuracy ranged from 0.90 to 0.87 for the training set and from 0.90 to 0.86 for the testing set. While these RDKit bits were predicted with the lowest accuracy, it is important to note that their accuracy (Fig. 7) was only marginally lower compared to the highest accuracy bits (Fig. 6). Finally, the issue with the RDKit bits that were predicted with the lowest accuracy (Fig. 7) was that the model failed to predict their presence in a given compound resulting in an elevated number of false negatives.

Figure 6: — Examples of the RDKit bits that were predicted with the highest accuracy in terms of true positive and true negative rates for the training and testing sets (scenario 2). ACC: Predictive accuracy calculated as described in the method section.

Figure 7: — Examples of the RDKit bits that were predicted with the lowest accuracy in terms of true positive and true negative rates for the training and testing sets (scenario 2). ACC: Predictive accuracy calculated as described in the method section.

3.4. Simulating database searching with fragments and bits predicted from in silico generated fingerprints

In addition to the accuracy of the two scenarios at predicting the correct number of RDKit fragments or presence of RDKit bits, we also evaluated their predictive power in finding the right structure when searching the Blood Exposome database. Comparing the two scenarios for their predictive power in the simulations with in silico generated fingerprints (Fig. 8A–D), the model built with RDKit fragments (scenario 1) showed higher success rate for predicting the correct isomer compared to the model built with RDKit bits (scenario 2). The average success rate for the model built with RDKit fragments ranged from 76 to 99% for the training set (Fig. 8A) and from 67 to 93% for the testing set (Fig 8B). The average success rate for the model built with the RDKti bits ranged from 67 to 95% (Fig. 8C) for the training set and from 58 to 93% (Fig 8D).

Figure 8: — Database searching (Blood Exposome) with *in silico* generated fingerprints (A-D) and experimentally generated fingerprints (E-H). The figure shows the % of correct matches per sample and per number of matched isomers from the database (n~ 20,000, curated version of the Blood Exposome database described in the methods section), starting from matching to the top 1 isomer only up to the top 5 isomers. Searching the Blood Exposome database and matching of detected chemical features to chemical structures was done using a linear regression model and ranking the candidates based on their similarity using the calculated r-values. The regression model compared the array containing the predicted RDKit fragments and RDKit bits for a detected chemical feature (e.g., [0, 3, 4, 0 … 0] for RDKit fragments or [0, 1, 0, 0, … 1] for RDKit bits) to arrays of chemicals in the Blood Exposome database that had the same formula as the detected chemical feature. The candidates were then ranked based on their r-values and the top 1–5 candidates were used for proposing molecular structures. Increasing the number of matched isomers from the Blood Exposome database, increases the likelihood that one of these isomers will be the correct isomer, which explains the upward trend of the curves.

3.5. Searching the database with fragments and bits predicted from fingerprints generated with experimental data

Furthermore, we evaluated the predictive power of the two scenarios at finding the right molecular structure when searching the Blood Exposome database with RDKit fragments and RDkit bits (model outputs) from experimentally generated fingerprints (model inputs). As mentioned earlier, the partition ratios in the UFZ-LSER database are calculated using either experimentally determined or calculated Abraham descriptors. Comparing the two scenarios (RDKit fragments and RDKit bits) when searching the Blood Exposome database, we observed an expected increase in variability and a decrease in predictive power for both models. However, the model built with RDKit fragments was more resilient to introduced noise from experimental data compared to the model built with RDKit bits. The average success rate for the model built with RDKit fragments ranged from 63 to 88% for the training set (Fig. 8E) and from 48 to 84% for the testing set (Fig. 8F). The average success rate for the model built with RDKit bits ranged from 41 to 91% for the training set (Fig. 8G) and from 33 to 88% for the testing set (Fig 8H).

4. Discussion

4.1. Comparing scenarios 1 and 2

When comparing scenarios 1 and 2 for their ability to make accurate predictions of chemical structures, scenario 1 showed higher predictive power. The difference between the two scenarios was clearer in the database search with RDKit fragments and RDKit bits that were generated from experimental physicochemical fingerprints.

4.2. Evaluating the accuracy in predicting chemical structures

Our findings showed that the expected success rate of the developed algorithm using RDKit fragments ranges from 63 to 88% for chemicals in the training set and from 48 to 84% for chemicals in the testing set. These numbers vary depending on how many matches the user wishes to generate from the Blood Exposome database. If the user chooses to match only to the top matched compound, then the expected overall accuracy is 60% for chemicals in the training set and 48% for chemicals in the testing set. If the user chooses to match to the top 5 compounds in the Blood Exposome database, then the overall accuracy is 86% for chemicals in the training set and 81% for chemicals in the testing set. This practically means that, assuming an overall average accuracy of 60–70%, if one were to follow the proposed workflow and purchase analytical standards for 100 chemical compounds, 60–70% of these compounds would be expected to be identified correctly. This is a substantial improvement considering that in NTA studies the number of compounds confirmed with analytical standards often does not exceed 5–10% of the detected chemical features.^2,9,10

4.3. Examples of successful and failed matches

In an effort to understand where the model built with RDKit fragments succeeds and fails in finding the correct chemical structures, we examined some examples from the searches with the experimental data for both the training and the testing sets. In the first case, we examined two chemical compounds from the training set (Fig. 9), fenpropimorph (Fig. 9A) and praziquantel (Fig. 9B). Fenpropimorph was correctly predicted as fenpropimorph, while praziquantel was predicted as 2-diethylamino-3’-benzyloxyacetanilide. When examining the predicted and true RDKit fragments for fenpropimorph, the model predicted correctly the NH0, benzene and ether group in the molecule, but incorrectly suggested that the molecule also had a bicyclic group with two or more fused carbon rings. When examining the case of praziquantel, while the model correctly predicted the presence of most RDKit fragments, it failed to predict the that the molecule had 3 bicyclic groups. It is important to note that, as shown earlier in Figure 5, bicyclic groups are among the RDKit fragments that are predicted with the lowest accuracy by the model so it would be expected that the model would fail in molecules with multiple bicyclic groups.

Figure 9: — Examples of successful and failed matches using the model built of the RDKit fragments for two chemicals from the database searching with fragments predicted using experimental fingerprints. In this figure, the two chemicals are from the training set of the model and are fenpropimorph (A) and praziquantel (B). The chemicals shown here were from the training set. Fenpropimorph (A) was correctly predicted as fenpropimorph (C), while praziquantel (B) was predicted as 2-diethylamino-3’-benzyloxyacetanilide (D). R²: coefficient of determination between predicted RDKit fragments and matched RDKit fragments from the Blood Exposome database. S: Score (the calculation is described in the methods section).

When comparing the R² and S values for the two examples, we observed that the successful match (Fig. 9A and 9C) showed higher R² and S values (R² = 0.71 and S = 0.60) compared to the failed match (Fig. 9B and 9D; R² = 0.53 and S = 0.31). In these two examples, both R² and S values point in the right direction in terms of confidence in the predicted chemical structure.

In addition to these two chemicals, we also examined two examples from the testing set 1-(2-pyridylazo)-2-naphthol and 2-aminobiphenyl (Fig. 10). 1-(2-pyridylazo)-2-naphthol was correctly predicted as 1-(2-pyridylazo)-2-naphthol, while 2-aminobiphenyl was predicted as 3-aminobiphenyl. When examining the RDKit fragments for 1-(2-Pyridylazo)-2-naphthol and 2-aminobiphenyl, the model predicted correctly, with small discrepancies in the numbers, the aromatic N, the aromatic OH, the NH0 group, the benzene rings, the bicyclic group, and the phenol group with no hydrogen bonding in the ortho position. The discrepancies were: i) the model predicted 2 aromatic N atoms as opposed to 1, ii) 2 NH0 groups as opposed to 3, and iii) an OH group in the para position of one of the benzene rings as opposed to the ortho position. When examining the RDKit fragments for 2-aminobiphenyl, the model predicted correctly the NH2, aniline and benzene groups. It is important to note that one of the true RDKit fragments for 2-aminobiphenyl is an OH group in the para position of the benzene ring. This appears to be an error generated by RDKit as the 2-aminobiphenyl does not have an OH group and the NH2 group is in the ortho position of the benzene ring. This is an interesting observation, indicating that there may be other similar small errors in the training set. While these errors can generate more noise for the model and worsen predictions, it is notable that the two structures of 2-aminobiphenyl and 3-aminobiphenyl are identical with the exception of the position of the NH2 group.

Figure 10: — Examples of successful and failed matches using the model built of the RDKit fragments for two chemicals from the database searching with fragments predicted using experimental fingerprints. In this figure, the two chemicals are from the testing set of the model and are 1-(2-Pyridylazo)-2-naphthol (A) and 2-aminobiphenyl (B). The chemicals shown here were from the testing set. 1-(2-Pyridylazo)-2-naphthol (A) was correctly predicted as 1-(2-Pyridylazo)-2-naphthol (C), while 2-aminobiphenyl (B) was predicted as 3-aminobiphenyl (D). R²: coefficient of determination between predicted RDKit fragments and matched RDKit fragments from the Blood Exposome database. S: Score (the calculation is described in the methods section).

When comparing the R² and S values for the two examples, we observed both the successful (Fig. 10A and 10C) and the failed match (Fig. 10B and 10D) showed high R² values (0.78 and 1) indicating high confidence in terms of matching. However, in this case, the successful match scored lower compared to the failed match. This is due to the large errors that are expected for certain RDKit fragments, such as the number of bicyclic groups and the number of OH groups in the para position of the benzene group (Fig. 5).

4.4. Limitations and future considerations

One limitation that needs to be acknowledged is that while the curated version of the Blood Exposome database contains approximately 20,000 compounds, the actual number of distinct molecules in human blood is likely larger. There may be many exogenous compounds, endogenous compounds and endogenously produced transformation products of exogenous compounds whose chemical structures have not yet been determined. These are commonly referred to as the “dark matter” of the metabolome/exposome.^35–37 Another limitation is that, while our models showed good performance at predicting chemical structures that were not included in the training set, the results were limited to the space of the 20,000 compounds that were in the Blood Exposome database. Future studies will focus on expanding the training set of the model by using larger databases, such as EPA’s Chemicals Dashboard²¹ and PubChemLite for Exposomics³⁸, and evaluating the extent to which a trained model can be used to search larger databases. Finally, our study focuses on chemical compounds in blood samples as these samples are important from for exposome focused or environmental health studies. Expanding the training set from blood samples to a larger database, such as EPA’s Chemicals Dashboard will allow for the application of this method to environmental samples, including surface water, air, soil and dust.

Another limitation is that our study uses calculated partition ratios that are derived using poly-parameter linear free-energy relationships (PP-LFERs).^30,31 It is important to note that there are some uncertainties associated with these calculations and, in some cases, the difference between experimentally measured partition ratios and calculated partition ratios using PP-LFERs can be over 1 log unit, however the overall average errors appear to be smaller. For example, in the study of Tülp et al.³¹, comparing experimentally measured partition ratios (n=75) between octanol and water (K_OW) to their PP-LFER calculated K_OW showed a root-mean-squared-error (RMSE) of 0.72 log units. The Abraham descriptors used to calculate partition ratios using PP-LEFRs can be both experimentally determined and predicted. Experimentally determined Abraham descriptors have been shown to produce estimates of partition ratios with uncertainties of less than 1 log unit³¹. Calculated Abraham descriptors have been shown to produce estimates of partition ratios with uncertainties that can be over 2 log units depending on the molecular structure, however, the overall average errors appear to be smaller.³⁹ Stenzel et al.³⁹ evaluated the prediction errors of calculated Abraham descriptors for a set of chemicals (n=159) for the partition ratio between polydimethylsiloxane and water (K_PDMS/w) and measured an RMSD of 0.95 log units. We should also note that while these errors are critical in the predictions of partition ratios, for the purposes of our study these errors are relatively benign. The purpose of our workflow is not to provide definitive identifications or to replace identification with analytical standards, but rather to narrow down the list of potential candidates for detected features through non-targeted analysis, so that they can be later confirmed with analytical standards. Misassigned structures is also a common occurrence in MS/MS fragmentation and matching with spectral data, but since confirmation can be done only with analytical standards, these misassignments are not of critical importance.

Our study presents a novel approach for proposing molecular structures in NTA by leveraging MS1 data and physicochemical properties. Our approach can be used in combination with other tools used in structure elucidation such as the retention time index (RTI) tools⁴⁰ that have also shown promise in narrowing down the list of candidate structures for detected chemical features. Future efforts will examine the potential of integrating MS/MS spectra and RTI tools into our workflow and examining how they can improve our ability to determine molecular structures. Future efforts will also focus on incorporating additional physicochemical properties, such as the acid dissociation constant (K_a), often expressed as pK_a (−log₁₀ K_a). Measurements of pK_a will allow us to determine which of the detected chemical features are charged (ions). As charged molecules are not expected to interact with non-polar solvents, these features will partition only to the aquatic phase, thus, creating unique fingerprints that would be specific to charged molecules.

Our approach can be used in combination with other computational tools, such as SIRIUS¹⁷, MS-DIAL/MS-FINDER^18,19, and MetFrag to assist in narrowing down the list of potential structures for a detected chemical feature. Our approach is complementary to existing approaches as it offers a new angle in structure annotation for non-targeted analysis.

Supplementary Material

supporting information

NIHMS1826615-supplement-supporting_information.docx^{(581.9KB, docx)}

Acknowledgements

This study was funded by NIH/NIEHS grant numbers K99ES032892, P30ES030284, UG3OD023272, UH3OD023272, P01ES022841, and R01ES027051 and by the US EPA grant numbers RD83543301 and RD83564301.

Footnotes

Supporting information

Figures S1–S7, Tables S1–S2, Text: S1

Conflict of interest

The authors have no known conflict of interest.

6. Data and code availability

All datasets and code developed in this study are available on GitHub under https://github.com/dimitriabrahamsson/turbo-chem

References

(1).Petrick LM; Schiffman C; Edmands WMB; Yano Y; Perttula K; Whitehead T; Metayer C; Wheelock CE; Arora M; Grigoryan H; Carlsson H; Dudoit S; Rappaport SM Metabolomics of Neonatal Blood Spots Reveal Distinct Phenotypes of Pediatric Acute Lymphoblastic Leukemia and Potential Effects of Early-Life Nutrition. Cancer Letters 2019, 452, 71–78. 10.1016/j.canlet.2019.03.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
(2).Abrahamsson D; Wang A; Jiang T; Wang M; Siddharth A; Morello-Frosch R; Park J-S; Sirota M; Woodruff TJ A Comprehensive Non-Targeted Analysis Study of the Prenatal Exposome. 2021. 10.26434/chemrxiv.13093457.v2. [DOI] [PMC free article] [PubMed] [Google Scholar]
(3).Li Z; Maier MP; Radke M Screening for Pharmaceutical Transformation Products Formed in River Sediment by Combining Ultrahigh Performance Liquid Chromatography/High Resolution Mass Spectrometry with a Rapid Data-Processing Method. Analytica Chimica Acta 2014, 810, 61–70. 10.1016/j.aca.2013.12.012. [DOI] [PubMed] [Google Scholar]
(4).Fromme H; Albrecht M; Appel M; Hilger B; Völkel W; Liebl B; Roscher E PCBs, PCDD/Fs, and PBDEs in Blood Samples of a Rural Population in South Germany. International Journal of Hygiene and Environmental Health 2015, 218 (1), 41–46. 10.1016/j.ijheh.2014.07.004. [DOI] [PubMed] [Google Scholar]
(5).Mørck TA; Nielsen F; Nielsen JKS; Siersma VD; Grandjean P; Knudsen LE PFAS Concentrations in Plasma Samples from Danish School Children and Their Mothers. Chemosphere 2015, 129, 203–209. 10.1016/j.chemosphere.2014.07.018. [DOI] [PubMed] [Google Scholar]
(6).Trushina E; Mielke MM Recent Advances in the Application of Metabolomics to Alzheimer’s Disease. Biochimica et Biophysica Acta (BBA) - Molecular Basis of Disease 2014, 1842 (8), 1232–1239. 10.1016/j.bbadis.2013.06.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
(7).Metabolomics Reveals Metabolic Biomarkers of Crohn’s Disease https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0006386 (accessed 2021 −12 −09). [DOI] [PMC free article] [PubMed]
(8).Petrick LM; Schiffman C; Edmands WMB; Yano Y; Perttula K; Whitehead T; Metayer C; Wheelock CE; Arora M; Grigoryan H; Carlsson H; Dudoit S; Rappaport SM Metabolomics of Neonatal Blood Spots Reveal Distinct Phenotypes of Pediatric Acute Lymphoblastic Leukemia and Potential Effects of Early-Life Nutrition. Cancer Letters 2019, 452, 71–78. 10.1016/j.canlet.2019.03.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
(9).Newton SR; McMahen RL; Sobus JR; Mansouri K; Williams AJ; McEachran AD; Strynar MJ Suspect Screening and Non-Targeted Analysis of Drinking Water Using Point-of-Use Filters. Environmental Pollution 2018, 234, 297–306. 10.1016/j.envpol.2017.11.033. [DOI] [PMC free article] [PubMed] [Google Scholar]
(10).Moschet C; Anumol T; Lew BM; Bennett DH; Young TM Household Dust as a Repository of Chemical Accumulation: New Insights from a Comprehensive High-Resolution Mass Spectrometric Study. Environ. Sci. Technol 2018, 52 (5), 2878–2887. 10.1021/acs.est.7b05767. [DOI] [PMC free article] [PubMed] [Google Scholar]
(11).6546 LC/ Q-TOF, high resolution Q-TOF LC/MS, suspect screening | Agilent https://www.agilent.com/en/product/liquid-chromatography-mass-spectrometry-lc-ms/lc-ms-instruments/quadrupole-time-of-flight-lc-ms/6546-lc-q-tof (accessed 2021 −12 −08).
(12).Orbitrap LC-MS - US //www.thermofisher.com/us/en/home/industrial/mass-spectrometry/liquid-chromatography-mass-spectrometry-lc-ms/lc-ms-systems/orbitrap-lc-ms.html (accessed 2021 −12 −08).
(13).What’s in an Oil Drop? - MagLab https://nationalmaglab.org/education/magnet-academy/learn-the-basics/stories/what-s-in-an-oil-drop (accessed 2021 −12 −07).
(14).Horai H; Arita M; Kanaya S; Nihei Y; Ikeda T; Suwa K; Ojima Y; Tanaka K; Tanaka S; Aoshima K; Oda Y; Kakazu Y; Kusano M; Tohge T; Matsuda F; Sawada Y; Hirai MY; Nakanishi H; Ikeda K; Akimoto N; Maoka T; Takahashi H; Ara T; Sakurai N; Suzuki H; Shibata D; Neumann S; Iida T; Tanaka K; Funatsu K; Matsuura F; Soga T; Taguchi R; Saito K; Nishioka T MassBank: A Public Repository for Sharing Mass Spectral Data for Life Sciences. Journal of Mass Spectrometry 2010, 45 (7), 703–714. 10.1002/jms.1777. [DOI] [PubMed] [Google Scholar]
(15).Allen F; Pon A; Wilson M; Greiner R; Wishart D CFM-ID: A Web Server for Annotation, Spectrum Prediction and Metabolite Identification from Tandem Mass Spectra. Nucleic Acids Research 2014, 42 (W1), W94–W99. 10.1093/nar/gku436. [DOI] [PMC free article] [PubMed] [Google Scholar]
(16).Ruttkies C; Schymanski EL; Wolf S; Hollender J; Neumann S MetFrag Relaunched: Incorporating Strategies beyond in Silico Fragmentation. Journal of Cheminformatics 2016, 8 (1), 3. 10.1186/s13321-016-0115-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
(17).SIRIUS | Lehrstuhl Bioinformatik Jena.
(18).CompMS | MS-DIAL http://prime.psc.riken.jp/compms/msdial/main.html (accessed 2021 −12 −08).
(19).CompMS | MS-FINDER http://prime.psc.riken.jp/compms/msfinder/main.html (accessed 2021 −12 −08).
(20).Wang A; Gerona RR; Schwartz JM; Lin T; Sirota M; Morello-Frosch R; Woodruff TJ A Suspect Screening Method for Characterizing Multiple Chemical Exposures among a Demographically Diverse Population of Pregnant Women in San Francisco. Environmental health perspectives 2018, 126 (7), 077009. 10.1289/EHP2920. [DOI] [PMC free article] [PubMed] [Google Scholar]
(21).U.S. Environmental Protection Agency. Chemistry Dashboard https://comptox.epa.gov/dashboard/ (accessed 2021 −03 −09).
(22).UFZ - LSER Database https://www.ufz.de/index.php?en=31698&contentonly=1&m=0&lserd_data[mvc]=Public/start (accessed 2020 −02 −17).
(23).Getting Started with the RDKit in Python — The RDKit 2021.09.1 documentation https://www.rdkit.org/docs/GettingStartedInPython.html (accessed 2021 −12 −16).
(24).rdkit.Chem.Fragments module — The RDKit 2021.09.1 documentation http://rdkit.org/docs/source/rdkit.Chem.Fragments.html (accessed 2021 −12 −16).
(25).Landrum G RDKit: Using the New Fingerprint Bit Rendering Code. RDKit, 2018.
(26).Jafvert CT; Westall JC; Grieder E; Schwarzenbach RP Distribution of Hydrophobic Ionogenic Organic Compounds between Octanol and Water: Organic Acids. Environ. Sci. Technol 1990, 24 (12), 1795–1803. 10.1021/es00082a002. [DOI] [Google Scholar]
(27).Westall JC; Leuenberger Christian.; Schwarzenbach RP Influence of PH and Ionic Strength on the Aqueous-Nonaqueous Distribution of Chlorinated Phenols. Environ. Sci. Technol 1985, 19 (2), 193–198. 10.1021/es00132a014. [DOI] [Google Scholar]
(28).Sigmund G; Arp HPH; Aumeier BM; Bucheli TD; Chefetz B; Chen W; Droge STJ; Endo S; Escher BI; Hale SE; Hofmann T; Pignatello J; Reemtsma T; Schmidt TC; Schönsee CD; Scheringer M Sorption and Mobility of Charged Organic Compounds: How to Confront and Overcome Limitations in Their Assessment. Environ. Sci. Technol 2022. 10.1021/acs.est.2c00570. [DOI] [PMC free article] [PubMed] [Google Scholar]
(29).sklearn.preprocessing.StandardScaler https://scikit-learn/stable/modules/generated/sklearn.preprocessing.StandardScaler.html (accessed 2022 −03 −30).
(30).Zissimos AM; Abraham MH; Barker MC; Box KJ; Tam KY Calculation of Abraham Descriptors from Solvent–Water Partition Coefficients in Four Different Systems; Evaluation of Different Methods of Calculation. J. Chem. Soc., Perkin Trans 2 2002, No. 3, 470–477. 10.1039/B110143A. [DOI] [Google Scholar]
(31).Tülp HC; Goss K-U; Schwarzenbach RP; Fenner K Experimental Determination of LSER Parameters for a Set of 76 Diverse Pesticides and Pharmaceuticals. Environ. Sci. Technol 2008, 42 (6), 2034–2040. 10.1021/es702473f. [DOI] [PubMed] [Google Scholar]
(32).Landrum G RDKit: Using the New Fingerprint Bit Rendering Code. RDKit, 2018.
(33).TensorFlow https://www.tensorflow.org/ (accessed 2020 −02 −17).
(34).Welcome to Python.org https://www.python.org/ (accessed 2020 −02 −20).
(35).Varki A Account for the “dark Matter” of Biology. Nature 2013, 497 (7451), 565–565. 10.1038/497565a. [DOI] [PubMed] [Google Scholar]
(36).Peisl BYL; Schymanski EL; Wilmes P Dark Matter in Host-Microbiome Metabolomics: Tackling the Unknowns–A Review. Analytica Chimica Acta 2018, 1037, 13–27. 10.1016/j.aca.2017.12.034. [DOI] [PubMed] [Google Scholar]
(37).da Silva RR; Dorrestein PC; Quinn RA Illuminating the Dark Matter in Metabolomics. PNAS 2015, 112 (41), 12549–12550. 10.1073/pnas.1516878112. [DOI] [PMC free article] [PubMed] [Google Scholar]
(38).Bolton E; Schymanski E; Kondic T; Thiessen P; Zhang J PubChemLite for Exposomics, 2020. 10.5281/zenodo.4183801. [DOI] [PMC free article] [PubMed] [Google Scholar]
(39).Stenzel A; Goss K-U; Endo S Prediction of Partition Coefficients for Complex Environmental Contaminants: Validation of COSMOtherm, ABSOLV, and SPARC. Environmental Toxicology and Chemistry 2014, 33 (7), 1537–1543. 10.1002/etc.2587. [DOI] [PubMed] [Google Scholar]
(40).Aalizadeh R; Alygizakis NA; Schymanski EL; Krauss M; Schulze T; Ibáñez M; McEachran AD; Chao A; Williams AJ; Gago-Ferrero P; Covaci A; Moschet C; Young TM; Hollender J; Slobodnik J; Thomaidis NS Development and Application of Liquid Chromatographic Retention Time Indices in HRMS-Based Suspect and Nontarget Screening. Anal. Chem 2021, 93 (33), 11601–11611. 10.1021/acs.analchem.1c02348. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supporting information

NIHMS1826615-supplement-supporting_information.docx^{(581.9KB, docx)}

Data Availability Statement

All datasets and code developed in this study are available on GitHub under https://github.com/dimitriabrahamsson/turbo-chem

[R1] (1).Petrick LM; Schiffman C; Edmands WMB; Yano Y; Perttula K; Whitehead T; Metayer C; Wheelock CE; Arora M; Grigoryan H; Carlsson H; Dudoit S; Rappaport SM Metabolomics of Neonatal Blood Spots Reveal Distinct Phenotypes of Pediatric Acute Lymphoblastic Leukemia and Potential Effects of Early-Life Nutrition. Cancer Letters 2019, 452, 71–78. 10.1016/j.canlet.2019.03.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] (2).Abrahamsson D; Wang A; Jiang T; Wang M; Siddharth A; Morello-Frosch R; Park J-S; Sirota M; Woodruff TJ A Comprehensive Non-Targeted Analysis Study of the Prenatal Exposome. 2021. 10.26434/chemrxiv.13093457.v2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] (3).Li Z; Maier MP; Radke M Screening for Pharmaceutical Transformation Products Formed in River Sediment by Combining Ultrahigh Performance Liquid Chromatography/High Resolution Mass Spectrometry with a Rapid Data-Processing Method. Analytica Chimica Acta 2014, 810, 61–70. 10.1016/j.aca.2013.12.012. [DOI] [PubMed] [Google Scholar]

[R4] (4).Fromme H; Albrecht M; Appel M; Hilger B; Völkel W; Liebl B; Roscher E PCBs, PCDD/Fs, and PBDEs in Blood Samples of a Rural Population in South Germany. International Journal of Hygiene and Environmental Health 2015, 218 (1), 41–46. 10.1016/j.ijheh.2014.07.004. [DOI] [PubMed] [Google Scholar]

[R5] (5).Mørck TA; Nielsen F; Nielsen JKS; Siersma VD; Grandjean P; Knudsen LE PFAS Concentrations in Plasma Samples from Danish School Children and Their Mothers. Chemosphere 2015, 129, 203–209. 10.1016/j.chemosphere.2014.07.018. [DOI] [PubMed] [Google Scholar]

[R6] (6).Trushina E; Mielke MM Recent Advances in the Application of Metabolomics to Alzheimer’s Disease. Biochimica et Biophysica Acta (BBA) - Molecular Basis of Disease 2014, 1842 (8), 1232–1239. 10.1016/j.bbadis.2013.06.014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] (7).Metabolomics Reveals Metabolic Biomarkers of Crohn’s Disease https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0006386 (accessed 2021 −12 −09). [DOI] [PMC free article] [PubMed]

[R8] (8).Petrick LM; Schiffman C; Edmands WMB; Yano Y; Perttula K; Whitehead T; Metayer C; Wheelock CE; Arora M; Grigoryan H; Carlsson H; Dudoit S; Rappaport SM Metabolomics of Neonatal Blood Spots Reveal Distinct Phenotypes of Pediatric Acute Lymphoblastic Leukemia and Potential Effects of Early-Life Nutrition. Cancer Letters 2019, 452, 71–78. 10.1016/j.canlet.2019.03.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] (9).Newton SR; McMahen RL; Sobus JR; Mansouri K; Williams AJ; McEachran AD; Strynar MJ Suspect Screening and Non-Targeted Analysis of Drinking Water Using Point-of-Use Filters. Environmental Pollution 2018, 234, 297–306. 10.1016/j.envpol.2017.11.033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] (10).Moschet C; Anumol T; Lew BM; Bennett DH; Young TM Household Dust as a Repository of Chemical Accumulation: New Insights from a Comprehensive High-Resolution Mass Spectrometric Study. Environ. Sci. Technol 2018, 52 (5), 2878–2887. 10.1021/acs.est.7b05767. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] (11).6546 LC/ Q-TOF, high resolution Q-TOF LC/MS, suspect screening | Agilent https://www.agilent.com/en/product/liquid-chromatography-mass-spectrometry-lc-ms/lc-ms-instruments/quadrupole-time-of-flight-lc-ms/6546-lc-q-tof (accessed 2021 −12 −08).

[R12] (12).Orbitrap LC-MS - US //www.thermofisher.com/us/en/home/industrial/mass-spectrometry/liquid-chromatography-mass-spectrometry-lc-ms/lc-ms-systems/orbitrap-lc-ms.html (accessed 2021 −12 −08).

[R13] (13).What’s in an Oil Drop? - MagLab https://nationalmaglab.org/education/magnet-academy/learn-the-basics/stories/what-s-in-an-oil-drop (accessed 2021 −12 −07).

[R14] (14).Horai H; Arita M; Kanaya S; Nihei Y; Ikeda T; Suwa K; Ojima Y; Tanaka K; Tanaka S; Aoshima K; Oda Y; Kakazu Y; Kusano M; Tohge T; Matsuda F; Sawada Y; Hirai MY; Nakanishi H; Ikeda K; Akimoto N; Maoka T; Takahashi H; Ara T; Sakurai N; Suzuki H; Shibata D; Neumann S; Iida T; Tanaka K; Funatsu K; Matsuura F; Soga T; Taguchi R; Saito K; Nishioka T MassBank: A Public Repository for Sharing Mass Spectral Data for Life Sciences. Journal of Mass Spectrometry 2010, 45 (7), 703–714. 10.1002/jms.1777. [DOI] [PubMed] [Google Scholar]

[R15] (15).Allen F; Pon A; Wilson M; Greiner R; Wishart D CFM-ID: A Web Server for Annotation, Spectrum Prediction and Metabolite Identification from Tandem Mass Spectra. Nucleic Acids Research 2014, 42 (W1), W94–W99. 10.1093/nar/gku436. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] (16).Ruttkies C; Schymanski EL; Wolf S; Hollender J; Neumann S MetFrag Relaunched: Incorporating Strategies beyond in Silico Fragmentation. Journal of Cheminformatics 2016, 8 (1), 3. 10.1186/s13321-016-0115-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] (17).SIRIUS | Lehrstuhl Bioinformatik Jena.

[R18] (18).CompMS | MS-DIAL http://prime.psc.riken.jp/compms/msdial/main.html (accessed 2021 −12 −08).

[R19] (19).CompMS | MS-FINDER http://prime.psc.riken.jp/compms/msfinder/main.html (accessed 2021 −12 −08).

[R20] (20).Wang A; Gerona RR; Schwartz JM; Lin T; Sirota M; Morello-Frosch R; Woodruff TJ A Suspect Screening Method for Characterizing Multiple Chemical Exposures among a Demographically Diverse Population of Pregnant Women in San Francisco. Environmental health perspectives 2018, 126 (7), 077009. 10.1289/EHP2920. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] (21).U.S. Environmental Protection Agency. Chemistry Dashboard https://comptox.epa.gov/dashboard/ (accessed 2021 −03 −09).

[R22] (22).UFZ - LSER Database https://www.ufz.de/index.php?en=31698&contentonly=1&m=0&lserd_data[mvc]=Public/start (accessed 2020 −02 −17).

[R23] (23).Getting Started with the RDKit in Python — The RDKit 2021.09.1 documentation https://www.rdkit.org/docs/GettingStartedInPython.html (accessed 2021 −12 −16).

[R24] (24).rdkit.Chem.Fragments module — The RDKit 2021.09.1 documentation http://rdkit.org/docs/source/rdkit.Chem.Fragments.html (accessed 2021 −12 −16).

[R25] (25).Landrum G RDKit: Using the New Fingerprint Bit Rendering Code. RDKit, 2018.

[R26] (26).Jafvert CT; Westall JC; Grieder E; Schwarzenbach RP Distribution of Hydrophobic Ionogenic Organic Compounds between Octanol and Water: Organic Acids. Environ. Sci. Technol 1990, 24 (12), 1795–1803. 10.1021/es00082a002. [DOI] [Google Scholar]

[R27] (27).Westall JC; Leuenberger Christian.; Schwarzenbach RP Influence of PH and Ionic Strength on the Aqueous-Nonaqueous Distribution of Chlorinated Phenols. Environ. Sci. Technol 1985, 19 (2), 193–198. 10.1021/es00132a014. [DOI] [Google Scholar]

[R28] (28).Sigmund G; Arp HPH; Aumeier BM; Bucheli TD; Chefetz B; Chen W; Droge STJ; Endo S; Escher BI; Hale SE; Hofmann T; Pignatello J; Reemtsma T; Schmidt TC; Schönsee CD; Scheringer M Sorption and Mobility of Charged Organic Compounds: How to Confront and Overcome Limitations in Their Assessment. Environ. Sci. Technol 2022. 10.1021/acs.est.2c00570. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] (29).sklearn.preprocessing.StandardScaler https://scikit-learn/stable/modules/generated/sklearn.preprocessing.StandardScaler.html (accessed 2022 −03 −30).

[R30] (30).Zissimos AM; Abraham MH; Barker MC; Box KJ; Tam KY Calculation of Abraham Descriptors from Solvent–Water Partition Coefficients in Four Different Systems; Evaluation of Different Methods of Calculation. J. Chem. Soc., Perkin Trans 2 2002, No. 3, 470–477. 10.1039/B110143A. [DOI] [Google Scholar]

[R31] (31).Tülp HC; Goss K-U; Schwarzenbach RP; Fenner K Experimental Determination of LSER Parameters for a Set of 76 Diverse Pesticides and Pharmaceuticals. Environ. Sci. Technol 2008, 42 (6), 2034–2040. 10.1021/es702473f. [DOI] [PubMed] [Google Scholar]

[R32] (32).Landrum G RDKit: Using the New Fingerprint Bit Rendering Code. RDKit, 2018.

[R33] (33).TensorFlow https://www.tensorflow.org/ (accessed 2020 −02 −17).

[R34] (34).Welcome to Python.org https://www.python.org/ (accessed 2020 −02 −20).

[R35] (35).Varki A Account for the “dark Matter” of Biology. Nature 2013, 497 (7451), 565–565. 10.1038/497565a. [DOI] [PubMed] [Google Scholar]

[R36] (36).Peisl BYL; Schymanski EL; Wilmes P Dark Matter in Host-Microbiome Metabolomics: Tackling the Unknowns–A Review. Analytica Chimica Acta 2018, 1037, 13–27. 10.1016/j.aca.2017.12.034. [DOI] [PubMed] [Google Scholar]

[R37] (37).da Silva RR; Dorrestein PC; Quinn RA Illuminating the Dark Matter in Metabolomics. PNAS 2015, 112 (41), 12549–12550. 10.1073/pnas.1516878112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] (38).Bolton E; Schymanski E; Kondic T; Thiessen P; Zhang J PubChemLite for Exposomics, 2020. 10.5281/zenodo.4183801. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] (39).Stenzel A; Goss K-U; Endo S Prediction of Partition Coefficients for Complex Environmental Contaminants: Validation of COSMOtherm, ABSOLV, and SPARC. Environmental Toxicology and Chemistry 2014, 33 (7), 1537–1543. 10.1002/etc.2587. [DOI] [PubMed] [Google Scholar]

[R40] (40).Aalizadeh R; Alygizakis NA; Schymanski EL; Krauss M; Schulze T; Ibáñez M; McEachran AD; Chao A; Williams AJ; Gago-Ferrero P; Covaci A; Moschet C; Young TM; Hollender J; Slobodnik J; Thomaidis NS Development and Application of Liquid Chromatographic Retention Time Indices in HRMS-Based Suspect and Nontarget Screening. Anal. Chem 2021, 93 (33), 11601–11611. 10.1021/acs.analchem.1c02348. [DOI] [PubMed] [Google Scholar]

PERMALINK

In silico structure predictions for non-targeted analysis: From physicochemical properties to molecular structures

Dimitri Abrahamsson

Adi Siddharth

Thomas M Young

Marina Sirota

June-Soo Park

Jonathan Martin

Tracey Woodruff

Abstract

Graphical Abstract

1. INTRODUCTION

Figure 1.

Figure 2:

2. METHODS

2.1. From physicochemical properties to chemical structures

Figure 3:

2.2. Data collection

2.3. Converting molecules to arrays of information

2.4. Building and training the model

2.5. Evaluating different combinations of partitioning systems

2.6. Model testing by simulating database searching

2.7. Testing the model with experimental data

2.8. Evaluating uncertainty in structure predictions

3. Results

3.1. Permutation analysis

3.2. Accuracy of predicting RDKit fragments

Figure 4:

Figure 5:

3.3. Accuracy of predicting RDKit bits

Figure 6:

Figure 7:

3.4. Simulating database searching with fragments and bits predicted from in silico generated fingerprints

Figure 8:

3.5. Searching the database with fragments and bits predicted from fingerprints generated with experimental data

4. Discussion

4.1. Comparing scenarios 1 and 2

4.2. Evaluating the accuracy in predicting chemical structures

4.3. Examples of successful and failed matches

Figure 9:

Figure 10:

4.4. Limitations and future considerations

Supplementary Material

Acknowledgements

Footnotes

6. Data and code availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases