Abstract
In the field of environment and health studies, recent trends have focused on the identification of contaminants of emerging concern (CEC). This is a complex, challenging task, as resources, such as compound databases (DBs) and mass spectral libraries (MSLs) concerning these compounds are very poor. This is particularly true for semi polar organic contaminants that have to be derivatized prior to gas chromatography-mass spectrometry (GC-MS) analysis with electron impact ionization (EI), for which it is barely possible to find any records. In particular, there is a severe lack of datasets of GC-EI-MS spectra generated and made publicly available for the purpose of development, validation and performance evaluation of cheminformatics-assisted compound structure identification (CSI) approaches, including novel cutting-edge machine learning (ML)-based approaches [1].
We set out to fill this gap and support the machine learning-assisted compound identification, thus aiding cheminformatics-assisted identification of silylated derivatives in GC-MS laboratories working in the field of environment and health. To this end, we have generated 12 datasets of GC-EI-MS spectra, six of which contain GC-EI-MS spectra of trimethylsilyl (TMS) and six GC-EI-MS spectra of tert-butyldimethylsilyl (TBDMS) derivatives. Four of these datasets, named testing datasets, contain mass spectra acquired by the authors. They are available in full, together with corresponding metadata. Eight datasets, named training datasets, were derived from mass spectra in the NIST 17 Mass Spectral Library. For these, we have only made the metadata publicly available, due to licensing reasons.
For each type of derivative, two testing datasets are generated by acquiring and processing GC-EI-MS spectra, such that they include raw and processed GC-EI-MS spectra of TMS and TBDMS derivatives of CECs, along with their corresponding metadata. The metadata contains IUPAC name, exact mass, molecular formula, InChI, InChIKey, SMILES and PubChemID, of each CEC and CEC-TMS or CEC-TBDMS derivative, where available. Eight GC-EI-MS training datasets are generated by using the National Institute of Standards and Technology (NIST)/U.S. Environmental Protection Agency (EPA)/National Institute of Health (NIH) 17 Mass Spectral Library. For each derivative type (TMS and TBDMS), four datasets are given, each corresponding to an original dataset obtained from NIST/EPA/NIH 17 and three variants thereof, obtained after each of the filtering steps of the procedure described below. Only the metadata about the training datasets are available, describing the corresponding NIST/EPA/NIH 17 entires: These include the compound name, CAS Registry number, InChIKey, exact mass, Mw, NIST number and ID number.
The datasets we present here were used to train and test predictive models for identification of silylated derivatives built with ML approaches [4]. The models were built by using data curated from the NIST Mass Spectral Library 17 [2] and the machine learning approach of CSI:Output Kernel Regression (CSI:OKR) [2]. Data from the NIST Mass Spectral Library 17 are commercially available from the National Institute of Standards and Technology (NIST)/U.S. Environmental Protection Agency (EPA)/National Institute of Health (NIH) and thus cannot be made publicly available. This highlights the need for publicly available GC-EI-MS spectra, which we address by releasing in full the four testing datasets.
Keywords: Silylation, Derivative, Identification, Machine learning, GC-MS, Mass spectrometry
Specifications Table
| Subject | Analytical chemistry, Omics: General |
| Specific subject area | Generation of mass spectral datasets for testing and training of ML-based CSI approaches using mass spectra |
| Type of data | Raw and Table |
| How the data were acquired | The mass spectra in the testing datasets were acquired using Agilent 7890B/5977A series GC-MSD (Agilent Technologies, USA), in electron impact ionization (EI) mode. Chromatographic separation was achieved on Agilent DB-5MS UI fused-silica capillary column (30m x 0.25mm x 0.25 μm; Agilent |
| Technologies, USA). Data was processed using Mass Hunter Quantitative Analysis v.B.07 (Agilent Technologies, USA). The training datasets were generated from the NIST/EPA/NIH Mass Spectral Library 17 [3] using the accompanying NIST Mass Spectral Search Program (version 2.3) and LIB2NIST converter (NIST, 2011). |
|
| Data format | The testing GC-EI-MS spectral datasets are given in .txt, .msp and .mgf formats and the accompanying metadata is given in .xlsx format. The metadata regarding the training GC-EI-MS spectral datasets is given in .xlsx format. |
| Description of data collection | The GC-EI-MS spectral datasets, that we used for testing ML-based CSI models, were generated in the full scan range of m/z 50-800 amu for the TMS derivatives and m/z 50-1000 amu for the TBDMS derivatives. Raw instrument data was reduced to two-dimensional peak lists (m/z, abundance) using Mass Hunter Qualitative Analysis v.B.07 (Agilent Technologies, USA), in which background subtraction was also performed. The GC-EI-MS spectral datasets intended for training of ML-based CSI models were generated from the NIST/EPA/NIH Mass Spectral Library 17: Constrained search was performed using the NIST Mass Spectral Search Program (version 2.3) and further processing according to the step-wise procedure described below. |
| Data source location | • Institution: Jožef Stefan Institute, Department of Environmental Sciences • City/Town/Region: Ljubljana • Country: Slovenia |
| Data accessibility | Repository name: Mendeley Data Data identification number: DOI:10.17632/j3z5bmvmnd.6 Direct URL to data: https://data.mendeley.com/datasets/j3z5bmvmnd/6 |
| Related research article | Ljoncheva M., Stepišnik T., Kosjek T., Džeroski S., Machine learning for identification of silylated derivatives from mass spectra (2022) Journal of Cheminformatics 14(1):62. doi: 10.1186/s13321-022-00636-1. |
Value of the Data
-
•
The generated GC-EI-MS datasets provide a comprehensive collection of GC-EI-MS spectra of TMS and TBDMS derivatives of structurally and chemically diverse environmental contaminants, given along with their metadata in universal ready-to-use formats (.txt, .msp, .mgf) for further cheminformatics-based processing.
-
•
The generated GC-EI-MS datasets are of value for the environmental and exposomics researchers, as well as for the CSI and ML communities, interested in the development of new CSI tools.
-
•
Few datasets of mass spectra are publicly available: This is especially true for GC-EI-MS spectra, which makes the generated data even more valuable.
-
•
Both the testing and the training data can be further used on their own or as part of larger datasets, for training, testing and validation in the development of novel CSI approaches, for challenging existing approaches, and for performance comparison of novel and existing CSI, especially ML-based approaches.
-
•
The data can be used as a stand-alone database (or joined with other in-house databases of GC-EI-MS spectra), serving as valuable reference during suspect screening and non-targeted environmental analysis.
1. Introduction
NIST/EPA/NIH 17 Mass Spectral Library [3] was used to generate training GCI-EI-MS spectral datasets of TMS and TBDMS derivatives, which are then used for building ML-based CSI approaches. As NIST/EPA/NIH 17 Mass Spectral Library [3] is commercially available and licensed under the United States Department of Commerce Copyright, the training GC-EI-MS spectral datasets themselves cannot be made publicly available. Instead, we provided metadata files of each dataset, containing the name, InChIKey, CAS Registry number, exact mass, molecular weight (Mw), NIST number and ID number for each GC-EI-MS spectrum. For each derivative type (TMS and TBDMS), four GC-EI-MS datasets were generated – the first ones (TMS_0.1 and TBDMS_0.1) containing the GC-EI-MS spectra initially extracted from NIST/EPA/NIH 17 Mass Spectral Library (“Metadata_training_TMS_0.1”, “Metadata_ training_TBDMS_0.1”), followed by TMS_1.3 and TBDMS_1.3 resulting from the first filtering step of the approach described below (“Metadata_ training_TMS_1.3”, “Metadata_ training_TBDMS_1.3”), TMS_2.3 and TBDMS_2.3, resulting from the second filtering step (“Metadata_ training_TMS_2.3”, “Metadata_ training_TBDMS_2.3”), and TMS_3.3 and TBDMS_3.3, resulting from the final, third filtering step (“Metadata_training_TMS_3.3”, “Metadata_ training_TBDMS_3.3”). Using the given metadata and the described procedure (described in more details by Ljoncheva et al. [2]), the training GC-EI-MS spectral datasets can be reconstructed.
Standard solutions of selected environmental contaminants (104 compounds), listed in Table 1, were used for generating the TMS and TBDMS test dataset. The presented data consists of .txt, .msp and .mgf data files for each of the four MS datasets (Test dataset_TMS_RAW, Test dataset_TMS_BS, Test dataset_TBDMS_RAW and Test dataset_TBDMS_BS), in which each of the GCRC-EI-MS spectra is recorded with the compound name, InChIKey, Mw, molecular formula (MF), CAS Registry number and list of peaks represented as m/z and intensities. Metadata of the TMS/TBDMS derivatives and the corresponding parent CEC are given in .xlsx files containing the IUPAC name, exact mass, MF, InChI, InChIKey, SMILES and PubChem ID, when available (“Metadata_test_TMS derivatives.xlsx” and “Metadata_test_TBDMS derivatives.xlsx”). The four datasets were used to test predictive models for identification of silylated derivatives, built with ML approaches.
Table 1.
CEC for generation of the TMS and TBDMS datasets.
| CEC included in TMS datasets only | |
| nitroxoline 17α-hydroxyprogesterone 6β-hydroxypregnenolone 5-androstene-3β, 17β-diol 5α-dehydrotestosterone (boldenone) 11α-hydroxytestosterone 11α-hydroxyandrostenedione methamphetamine nylidrin (-)-quinic acid |
shikimic acid meso-erythritol amphetamine 6-nitroguaiacol estriol codeine L-tyrosine L-ascorbic acid cannabidiolic acid bisphenol FL butylated hydroxytoluene |
| CEC included in TBDMS datasets only | |
| 4,6-dinitroguaiacol | mycophenolic acid |
| CEC included in both TMS and TBDMS datasets | |
| bisphenol A benzoic acid mecoprop 4,4’-biphenol 4,4′-dihydroxydiphenyl ether 4,4′-isopropylidenebis(2,6-dimethylphenol) 2,4′-dihydroxydiphenylmethane (24BPF) bisphenol AF bisphenol AP bisphenol C bisphenol E bisphenol F bisphenol M bisphenol BP bisphenol P bisphenol S bisphenol Z 2,2′-methylenediphenol (22BPF) dihydrotestosterone (stanolone) (±)-11-hydroxy-Δ9- tetrahydrocannabinol (±)-11-nor-9-carboxy-Δ9-tetrahydrocannabinol sulfanilamide adipic acid 4-tert-octylphenol 9-hydroxyfluorene L-leucine L-serine (-)-Δ9 tetrahydrocannabinol (-)-Δ9 tetrahydrocannabinolic acid trans-3’-hydroxycotinine benzoylecgonine bisphenol CL bisphenol PH 8-hydroxyquinoline 2-anilinophenylacetic acid 4-nitroguaiacol 5-nitroguaiacol catechol 3-methylcatechol 3-methyl-5-nitrocatechol |
2-benzyl-4-chlorophenol citric acid monohydrate 4-cumylphenol 2,4-dihydroxybenzophenone estrone 17β-estradiol m-coumaric acid p-coumaric acid o-coumaric acid triclosan (+)-cannabidiol cannabinol cannabichromene morphine 6-monoacetylmorpine carbamazepine isopropylparaben bisphenol B 17α-ethynyl estradiol 4-hydroxybenzophenone 2,2′-dihydroxy-4-methoxybenzophenone (BP-8) clofibric acid ibuprofen naproxen ketoprofen diclofenac methylparaben ethylparaben propylparaben butylparaben isobutylparaben benzylparaben 4-nonylphenol phenylacetic acid resorcinol salicylic acid urea 4-nitrocatechol syringol 4-nitrosyringol etofylline |
Note that for some of the spectra in the test datasets, spectra of the corresponding compounds also appear in the training datasets. Evaluation of the models generated by ML has been conducted separately on spectra of compounds whose spectra appear/do not appear in the training datasets, respectively, and results are reported separately by Ljoncheva et al. [2]. In the metadata files for the testing datasets (“Metadata_test_TMS derivatives.xlsx” and “Metadata_test_TBDMS derivatives.xlsx”), an extra column (the last one) indicates whether a spectrum of the compound at hand also appears in the corresponding training dataset and the respective metadata file (“Metadata_training_TMS_3.3”, “Metadata_ training_TBDMS_3.3”).
The predictive models for CSI of silylated derivatives were built by ML approaches from training datasets of GC-EI-MS spectra of TMS and TBDMS derivatives, which are not publicly available, as they were curated from the commercially available NIST/EPA/NIH Mass Spectral Library 17 [3], licensed under the United States Department of Commerce Copyright. NIST's end-user's license for the NIST 17 MSL restricts its use to a single computer that is not accessible by more than one person. While the training datasets themselves cannot be made publicly available, we make available the corresponding metadata: With licensed address to NIST MSL 17, they can be used to reconstruct the training datasets, by following the workflow summarized below and described by Ljoncheva et al. [2].
The ML approach used to build models for the identification of silylated derivatives from these data [2] was the approach titled CSI:OKR [3].
2. Experimental Design, Materials and Methods
2.1. Experimental design and generation of training datasets
Initial versions of the TMS and TBDMS datasets (TMS_0.1 and TBDMS_0.1) were generated by extracting all GC-EI-MS spectra of TMS, resp. TBDMS, derivatives of small molecules from the NIST/EPA/NIH 17 Mass Spectral Library [3]. The first constrained search for GC-EI-MS TMS spectra, using the constraints name fragment: trimethylsilyl and elements allowed: Si, resulted in a collection of 9958 entries, while for GC-EI-MS TBDMS spectra, the constraints name fragment: tertbutyldimethylsilyl and elements allowed: Si, resulted in an initial dataset of 2238 entries. Entries were extracted in .msp file format and subsequently converted to .txt format, using the LIB2NIST conversion tool (NIST 2011). Each GC-EI-MS entry included the compound name, InChIKey, MF, Mw, exact mass, CAS number, NIST ID and MS peak list. The GC-EI-MS spectra of TMS/TBDMS derivatives with erroneous metadata (name, molecular formula, InChIKey) that do not correspond to the analyzed compound were excluded from the dataset.
The TMS/TBDMS GI-EI-MS spectral datasets were further filtered using a three-step spectral filtering process, including:
-
1)
Exclusion of chemical irregularities. The GC-EI-MS spectra of TMS, resp. TBDMS derivatives of compounds not susceptible to derivatization, defined by the absence of functional group(s) amenable to silylation, were filtered out. The functional groups amenable to silylation are those containing an active hydrogen, i.e., carboxyl, hydroxyl, amine and thiol.
-
2)
Exclusion of high-molecular mass TMS, resp. TBDMS derivatives. The GC-EI-MS spectra of TMS, resp. TBDMS derivatives of CEC with molecular mass ≥ m/z 1000 were eliminated, since, as such, they are above the working linear range of the GC-MS instruments.
-
3)Exclusion of insufficient-quality GC-EI-MS spectra. The following GC-EI-MS spectra were excluded:
-
-GC-EI-MS spectra not acquired at the upper m/z of at least Mw of the derivative + 10 amu;
-
-GC-EI-MS spectra that do not contain both the molecular ion [M]+ peak and at least one of the isotope peaks, such as the 13C isotope peak;
-
-GC-EI-MS spectra that contain neither peaks of fragment ions specific for TMS groups (m/z 73, 147, 221 and 295, corresponding to one, two, three and four TMS groups, respectively) nor for TBDMS groups (m/z 115, 230 and 345, corresponding to one, two and three TBDMS groups, respectively) and
-
-GC-EI-MS spectra not containing at least five fragment ion peaks.
-
-
As a result, the final version of the TMS dataset consists of 4648 TMS GC-EI-MS spectra, while the final version of the TBDMS dataset consists of 1883 GC-EI-MS spectra. For each of the GC-EI-MS spectra in the final TMS and TBMDS datasets, the m/z range was between 50 m/z to Mw of the derivative ± 10 amu. For data parsing, all ion fragments with intensity 0 were removed from the refined TMS and TBDMS datasets.
2.2. Chemical analysis for generation of test datasets
2.2.1. Chemicals and materials
From the in-house pool of reference standards, 104 CEC were selected as environmentally relevant, according to the criteria of the Regulation (EC) No.1907/2006 of the European Parliament and the Council of 18 December 2006 concerning the Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH), Annex III [4]. These are listed in Table 1.
The selected compounds had to satisfy at least three of the following five criteria: 1) Positioning: the compound is present in the US EPA Comptox Chemistry Dashboard (CCD) [5], the most comprehensive repository of EE constituents; 2) Persistence: compound's half-life in fresh or estuarine water > 40 days; 3) Bioaccumulation: BAF and/or BCF > 2000, or in absence of such data, logKow ≥ 5.0; 4) Mobility: compound's water solubility ≥ 0.15 mg/L and log Koc ≤ 4.0, i.e. between -10.0 and 4.0; and 5) EcoToxicity: long-term no-observed-effect concentration (NOEC) for marine or freshwater organisms < 0.01 mg/L. Further details of the selection procedure are given by Ljoncheva et al. [2].
Individual stock solutions of each CEC at a concentration of approximately 150 μg/mL were prepared in acetonitrile (ACN), ethyl acetate (EtAc) or methanol (MeOH), depending on CEC solubility. Of them, individual working solutions (IWS) at concentration 1 µg/mL were prepared and used within 7 days. The CEC included in each GC-EI-MS dataset are listed in Table 1.
2.2.2. Derivatization and analysis
For each CEC, to 500 µL of IWS, 470 µL EtAc and 30 µL derivatization agent was added; for generation of TMS derivatives, 30 µL N, O-bis trifluoroacetamide with 1% trimethylchlorosilane (BSTFA + 1% TMCS), for TBDMS derivatives N-tert-butyldimethylsilyl-N-methyltrifluoroacetamide (MTBSTFA) with 1% TMCS (MTBSTFA + 1% TMCS). For CECs whose IWS are in ACN or MeOH, the 500 µL IWS was dried under N2 flow, reconstituted in 970 µL EtAc, to which 30 µL of the appropriate derivatization agent were added at predefined reaction temperature and duration. 106 TMS derivatives of 102 CEC (4 CEC resulted in two TMS derivatives, namely salicylic acid, dihydrotestosterone (stanolone), sulfanilamide and 5α-androsten-3β, 17β-diol) and 85 TBDMS derivatives of 83 CEC, two with two TBDMS derivatives each (sulfanilamide and L-serine) were generated, with molecular weights of derivatives ranging up to 650 amu.
GC-EI-MS spectra were acquired on Agilent 7890B/5977A series GC-MSD (Agilent Technologies, USA). Separation was achieved on Agilent DB-5MS UI fused-silica capillary column (30 m x 0.25 mm x 0.25 μm; Agilent Technologies, USA). He of 99.99999% purity at the flow rate of 1.2 mL/min was used as a carrier gas. The manifold, ion source and transfer line temperatures were set at 230°C, 150°C and 250°C, respectively. Injections (1 µL) were performed in the splitless mode. Depending upon compound properties, one of the following column oven temperature programs was used for generation of GC-EI-MS spectra of TMS derivatives: (1) initial temperature 70 °C (held 1 min), ramped at 15 °C/min to 280 °C (held 1 min); total runtime: 16 min; (2) initial temperature 70 °C (held 1 min), ramped at 20 °C/min to 240 °C (held 1 min), at 12 °C/min to 310 °C (held 2 min); total runtime: 18.3 min; (3) initial temperature 70 °C (held 1 min), ramped at 20 °C/min to 240 °C (held 1 min), at 12 °C/min to 310 °C (held 4 min); total runtime: 20.3 min. For generation of GC-EI-MS spectra of TBDMS derivatives, the following column oven temperature program was used: initial temperature 70 °C (held 1 min), ramped at 10 °C/min to 240 °C (held 1 min), at 10°C/min to 310 °C (held 5 min); total runtime: 31 min.
The MSD was operated in EI ionization mode (70 eV) by scanning over the mass range of m/z 50-800 amu for TMS derivatives and m/z 50-1000 amu for TBDMS derivatives. In-between the acquisitions of the derivatized standards, EtAc was run as the solvent check to assess potential background interferences and was used for background subtraction as a part of the post-acquisition processing of the GC-EI-MS spectra.
The retention times (Rt) of the TMS and TBDMS derivatives are given in Table 2 and Table 3, respectively.
Table 2.
Rts of CEC-TMS derivatives.
| TMS derivative | Rt (min) | TMS derivative | Rt (min) |
|---|---|---|---|
| GC-MS acquisition method (1) | |||
| benzoic acid TMS | 6.669 | propylparaben TMS | 10.300 |
| methylparaben TMS | 8.930 | isobutylparaben TMS | 10.730 |
| salicylic acid-bis TMS | 9.020 | butylparaben TMS | 11.080 |
| ethylparaben TMS | 9.520 | shikimic acid TMS | 11.260 |
| isopropylparaben TMS | 9.760 | quinic acid TMS | 11.620 |
| ibuprofen TMS | 9.960 | triclosan TMS | 13.560 |
| mecoprop TMS | 10.040 | benzylparaben TMS | 13.840 |
| cannabidiol TMS | 14.200 | ||
| GC-MS acquisition method (2) | |||
| resorcinol TMS | 6.750 | carbamazepine TMS | 12.360 |
| clofibric acid TMS | 8.070 | diclofenac TMS | 12.530 |
| 9-hydroxyfluorene TMS | 9.440 | cannabichromene TMS | 12.805 |
| 4-cumylphenol TMS | 9.883 | Δ9-tetrahydrocannabinol TMS | 13.041 |
| naproxen TMS | 11.070 | cannabinol TMS | 13.630 |
| 4-nonylphenol TMS | 10.010 | Δ9-tetrahydrocannabinolic acid TMS | 14.882 |
| 2,4-dihydroxybenzophenone TMS | 11.130 | estrone 2TMS | 14.882 |
| 4,4′-biphenol 2TMS | 11.490 | estradiol TMS | 15.083 |
| ketoprofen TMS | 11.510 | ethinylestradiol TMS | 15.142 |
| sulfanilamide 2TMS | 11.910 | estriol 3TMS | 16.145 |
| 2,2′-dihydroxy-4-methoxybenzophenone TMS | 12.108 | ||
| GC-MS acquisition method (3) | |||
| phenylacetic acid TMS | 6.203 | bisphenol E 2TMS | 11.692 |
| bisphenol A 2TMS | 11.874 | ||
| catechol 2TMS | 6.329 | L-serine TMS | 11.902 |
| L-serine 3TMS | 6.571 | bisphenol C 2TMS | 11.932 |
| syringol TMS | 6.908 | bisphenol B 2TMS | 12.462 |
| 3-methtylcatechol TMS | 6.928 | benzoylecgonine TMS | 12.582 |
| urea 2TMS | 7.124 | methamphetamine TMS | 12.590 |
| erythritol 4TMS | 7.542 | 4,4′-isopropylidenebis(2,6-dimethylphenol) TMS | 13.432 |
| adipic acid TMS | 7.632 | bisphenol CL 2TMS | 13.571 |
| 8-hydroxyquinoline TMS | 7.803 | nylidrin TMS | 13.579 |
| 6-nitroguaiacol TMS | 8.202 | codeine TMS | 13.811 |
| 4-octylphenol TMS | 8.450 | morphine 2TMS | 14.112 |
| 4-nitroguaiacol TMS | 8.623 | bisphenol Z 2TMS | 14.462 |
| 5-nitroguaiacol TMS | 8.728 | 6-monoacetylmorphine TMS | 14.562 |
| 4-nitrocatechol TMS | 9.044 | 5-androstene-3β,17β-diol TMS | 14.599 |
| citric acid TMS | 9.330 | quinic acid TMS | 14.623 |
| p-coumaric acid TMS | 9.359 | 5-androstene-3β,17β-diol 2TMS | 14.641 |
| 4-nitrosyringol TMS | 9.475 | 11-hydroxytetrahydrocannabinol 2TMS | 14.652 |
| 3-methyl-5-nitrocatechol TMS | 9.538 | stanolone TMS | 14.757 |
| L-leucine TMS | 9.629 | stanolone 2TMS | 14.873 |
| m-coumaric acid TMS | 9.728 | bisphenol S 2TMS | 14.971 |
| butylated hydroxytoluene TMS | 9.722 | bisphenol AP 2TMS | 15.183 |
| trans-3’-hydroxycotinine TMS | 9.842 | boldenone TMS | 15.420 |
| clorophene TMS | 10.002 | 11-nor9-tetrahydrocannabinol 2TMS | 15.553 |
| o-coumaric acid TMS | 10.085 | L-tyrosine TMS | 15.704 |
| nitroxoline TMS | 10.149 | L-ascorbic acid TMS | 15.846 |
| 2,2-bisphenol F 2TMS | 10.231 | 11α-hydroxyandrostenedione TMS | 15.988 |
| bisphenol AF TMS | 10.492 | 11α-hydroxytestosterone TMS | 16.188 |
| amphetamine TMS | 10.551 | bisphenol M 2TMS | 16.193 |
| 2,4-bisphenol F 2TMS | 10.863 | 6β-hydroxypregnenolone TMS | 16.693 |
| 2-anilinophenylacetic acid TMS | 10.885 | 17α-hydroxyprogesterone TMS | 17.135 |
| 2,4-dihydroxybenzophenone TMS | 11.132 | bisphenol P 2TMS | 17.242 |
| shikimic acid TMS | 11.271 | bisphenol BP 2TMS | 17.695 |
| etofylline TMS | 11.253 | bisphenol PH 2TMS | 17.832 |
| 4,4′-dihydroxydiphenyl ether 2TMS | 11.423 | cannabidiolic acid TMS | 18.269 |
| bisphenol F 2TMS | 11.512 | bisphenol FL 2TMS | 18.902 |
Table 3.
Rts of CEC-TBDMS derivatives.
| TBDMS derivative | Rt (min) | TBDMS derivative | Rt (min) |
|---|---|---|---|
| benzoic acid TBDMS | 8.139 | benzoylecgonine TBDMS | 17.724 |
| benzeneacetic acid TBDMS | 8.433 | cannabichromene TBDMS | 17.851 |
| methylparaben TBDMS | 10.327 | citric acid TBDMS | 17.956 |
| 8-hydroquinone TBDMS | 10.738 | tetrahydrocannabinol TBDMS | 18.040 |
| clofibric acid TBDMS | 11.011 | 3,4-dehydroTBDMS | 18.093 |
| ethylparaben TBDMS | 11.032 | cannabidiol TBDMS | 18.566 |
| resorcinol TBMDS | 11.074 | DHDPE TBDMS | 18.587 |
| isopropylparaben TBDMS | 11.316 | bisphenol F TBDMS | 18.619 |
| ibuprofen TBDMS | 11.401 | bisphenol E TBDMS | 18.756 |
| mecoprop TBDMS | 11.506 | 4,4’-bisphenol TBDMS | 18.787 |
| 4-tertoctylphenol TBDMS | 11.727 | bisphenol 8 TBDMS | 18.798 |
| propylparaben TBDMS | 12.032 | cannabinol TBDMS | 18.819 |
| salicylic acid TBDMS | 12.379 | bisphenol A TBDMS | 18.977 |
| adipic acid TBDMS | 12.442 | morphine TBDMS | 19.134 |
| isobutylparaben TBDMS | 12.590 | bisphenol B TBDMS | 19.566 |
| butytlparaben TBDMS | 13.063 | bipshenol C TBDMS | 19.639 |
| 4-nonylphenol TBDMS | 14.462 | 6-monoacetylmorphine TBDMS | 20.008 |
| 9-hydroxyfluorene TBDMS | 13.463 | estrone TBMDS | 20.534 |
| erythritol TBDMS | 13.610 | bisphenol CL TBDMS | 20.839 |
| trans-3’-hydroxycotinine TBDMS | 14.042 | BP26DM TBDMS | 20.860 |
| 4-cumylphenol TBDMS | 14.273 | 11-hydroxytetrahydrocannabinol TBDMS | 21.144 |
| chlorophene TBDMS | 14.389 | ethinyl estradiol TBDMS | 21.144 |
| 4-hydroxybenzophenone TBDMS | 15.725 | tetrahydrocannabinolic acid TBDMS | 21.249 |
| naproxen TBDMS | 15.757 | bisphenol Z TBDMS | 21.733 |
| tricolsan TBDMS | 16.199 | 11-nor-9-tetrahydrocannabinol TBDMS | 22.154 |
| sulfanilamide TBDMS | 16.399 | bisphenol S TBDMS | 22.607 |
| 2,2-bisphenol F TBDMS | 16.578 | bisphenol AP TBDMS | 22.649 |
| ketoprofen TBDMS | 16.883 | estradiol TBDMS | 23.059 |
| benzylparaben TBDMS | 16.925 | bisphenol M TBDMS | 24.185 |
| bisphenol AF TBDMS | 17.114 | bisphenol P TBDMS | 26.479 |
| carbamazepine TBDMS | 17.356 | bisphenol PH TBDMS | 27.079 |
| 2.4-bisphenol F TBDMS | 17.630 | bisphenol BP TBDMS | 27.457 |
| diclofenac TBDMS | 17.693 | ||
2.3. Data processing
GC-EI-MS data acquisition resulted in the generation of multiple (≥15) GC-EI-MS spectra for most of the TMS and TBDMS derivatives. Exceptions are the L-ascorbic acid TMS, L-leucine TMS and L-serine TMS, with three GC-EI-MS spectra each, and the TBDMS derivatives of L-serine, 4-nitroguaiacol, 5-nitroguaiacol, catechol, 3-methylcatechol, 3-methyl-5-nitrocatechol, syringol, 4-nitrosyringol, 4-nitrocatechol, p-coumaric acid, m-coumaric acid, o-coumaric acid, mycophenolic acid, 4,6-dinitroguaiacol, etofylline and urea, with one GC-EI-MS spectrum in the test TBDMS datasets for each. All GC-EI-MS spectra were processed using Mass Hunter Qualitative Analysis v B.07 (Agilent Technologies, USA) that reduced raw instrument data to two-dimensional peak lists (m/z, abundance), exported in .txt format. This software was also used to perform background subtraction, in order to remove constantly present background signals, such as m/z 149 as a typical phtalate interference, m/z 282, m/z 256 and m/z 284 for oleic, palmitic and stearic acid, and m/z 207, m/z 281 and m/z 327 of common polysiloxanes resulting from GC column stationary phase degradation. Their presence was confirmed a priori in the multiple EtAc solvent runs acquired between the acquisitions of CEC silyl derivatives, and was used for background subtraction.
The .txt data were transformed into .mgf format by a Python script which formats the beginning of a new spectrum in .mgf required syntax (i.e., ``BEGIN IONS''), then lists the exact mass of the compound (e.g., ``MASS=194.076''), taken from the appropriate file with the metadata, followed by a row with the charge (``CHARGE=1+''). The introductory part of the data record finishes with a line starting with ``TITLE= InChIKey:'', followed by the InChIKey of the compound, then “Name:” and the name of the compound, where the InChIKey and the name of the compound are taken from the .txt file, followed by an empty line. The peaks are then copied verbatim from the .txt file, line after line, and, afterwards, the data record finishes with the row "END IONS", preceded by an empty line. Each spectrum entry from the .txt files is thus converted into .mgf format and the spectra are listed in the same order as in the .txt file.
The .mgf files can be read by the ProteoWizard MSConvert software (version 3.0.22153-da6d3d1) [6] and consequently converted into a number of other formats that are in use by the MS community (such as mzML or mzXML). Unfortunately, these formats do not include the .msp format. The .msp format of the data was generated from the above described .mgf format by using the Python library available at https://github.com/matchms/matchms.
Ethics Statements
The authors declare that the manuscript meets all the rules and conditions described in the “Ethics in publishing” section standards (https://www.elsevier.com/journals/data-in-brief/2352-3409/guide-for-authors). The work did not include any investigations involving animal experiments, human participants and data collected from social media platforms.
The training GC-EI-MS spectral datasets were curated from the commercially available NIST/EPA NIH 17 Mass Spectral Library. Explicit permission to release the metadata about the training datasets was obtained by the authors from NIST. Due to NIST's individual license, restricting the use to a single computer that is not accessible by more than one person, the training datasets cannot be made available to the public by the authors. However, with licensed access to the NIST 17 MSL, the training data can be reconstructed from the available metadata files by following the detailed description of data preparation, given above.
CRediT authorship contribution statement
Milka Ljoncheva: Investigation, Formal analysis, Data curation, Software, Writing – original draft, Writing – review & editing. Sintija Stevanoska: Data curation, Software. Tina Kosjek: Conceptualization, Resources, Validation, Supervision, Writing – review & editing. Sašo Džeroski: Conceptualization, Resources, Validation, Supervision, Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors acknowledge the instrumental and computational resources provided by the Jožef Stefan Institute. This work was supported by the Slovenian Research Agency (ARRS through the programs P1-0143 and P2-0103). M.L. was funded by the Public Scholarship, Development, Disability and Maintenance Fund of the Republic of Slovenia (contract no. 11011-85/2016).
Data Availability
References
- 1.Ljoncheva M., Stepišnik T., Džeroski S., Kosjek T. Cheminformatics in MS-based environmental exposomics: current achievements and future directions. Trends Environ. Anal. Chem. 2020;28 doi: 10.1016/j.teac.2020.e00099. [DOI] [Google Scholar]
- 2.Ljoncheva M., Stepišnik T., Kosjek T., Džeroski S. Machine learning for identification of silylated derivatives from mass spectra. J. Cheminform. 2022;14 doi: 10.1186/s13321-022-00636-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.National Institute of Standards and Technology, NIST/EPA/NIH Mass Spectral Library 2017, Wiley.Com. (2017). https://www.wiley.com/en-ai/NIST+EPA+NIH+Mass+Spectral+Library+2017-p-9781119750291 (accessed August 15, 2022).
- 4.European Commsion, Regulation (EC) No.1907/2006 of the European Parliament and of the Council on the Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH), OJEC. 396 (2021) 1–552.
- 5.Williams A.J., Grulke C.M., Edwards J., McEachran A.D., Mansouri K., Baker N.C., Patlewicz G., Shah I., Wambaugh J.F., Judson R.S., Richard A.M. The CompTox Chemistry Dashboard: a community data resource for environmental chemistry. J. Cheminform. 2017;9 doi: 10.1186/s13321-017-0247-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Adusumilli R., Mallick P. In: Proteomics. Comai L., Katz J.E., Mallick P., editors. Springer New York; New York, NY: 2017. Data Conversion with ProteoWizard msConvert; pp. 339–368. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
