Skip to main content
ACS AuthorChoice logoLink to ACS AuthorChoice
. 2024 Aug 7;96(33):13652–13662. doi: 10.1021/acs.analchem.4c02589

Quantum Chemistry-Based Prediction of Electron Ionization Mass Spectra for Environmental Chemicals

Helge Hecht †,*, Wudmir Y Rojas †,*, Zargham Ahmad †,*, Aleš Křenek ‡,*, Jana Klánová †,*, Elliott J Price †,*
PMCID: PMC11339729  PMID: 39110763

Abstract

graphic file with name ac4c02589_0010.jpg

There is a lack of experimental electron ionization high-resolution mass spectra available to assist compound identification. The in silico generation of mass spectra by quantum chemistry can aid annotation workflows, in particular to support the identification of compounds that lack experimental reference spectra, such as environmental chemicals. We present an open-source, semiautomated workflow for the in silico prediction of electron ionization high-resolution mass spectra at 70 eV based on the QCxMS software. The workflow was applied to predict the spectra of 367 environmental chemicals, and the accuracy was evaluated by comparison to experimental reference spectra acquired. The molecular flexibility, number of rotatable bonds, and number of electronegative atoms of a compound were negatively correlated with prediction accuracy. Few analytes are predicted to sufficient accuracy for the direct application of predicted spectra in spectral matching workflows (overall average score 428). The m/z values of the top 5 most abundant ions of predicted spectra rarely match ions in experimental spectra, evidencing the disconnect between simulated fragmentation pathways and empirical reaction mechanisms.

Introduction

Electron ionization (EI) mass spectra are widely used for the structural annotation of small molecules through comparison of experimentally acquired spectra with mass spectral libraries. EI is the most common ionization method coupled with gas chromatography– mass spectrometry (GC–MS). However, there is a lack of EI high-resolution mass spectral libraries available, especially for environmental chemicals.1 To date, the Thermo Contaminants Library v1.5 containing 951 analytes is the largest commercially available Orbitrap library of this kind. Community curated databases such as MassBank25 and GNPS6 currently do not contain any spectra acquired on GC-[Orbitrap] MS instruments. Additionally, spectra acquired using high-resolution time-of-flight (TOF) and Orbitrap instruments differ from those obtained through low-resolution single quadrupole instruments. The disparity leads to relatively lower match scores when comparing high-resolution spectra to nominal mass spectra, limiting the applicability of those libraries.7,8 Compounding this challenge, many chemicals are not readily available commercially, and existing experimental libraries exhibit limited chemical diversity.

As an alternative approach, spectra can be generated in silico and used for identification. Methods like machine learning (ML) or quantum chemistry (QC) calculations are employed for this purpose.9 For example, the neural electron–ionization mass spectrometry (NEIMS) software uses a multilayer perceptron (MLP) architecture to generate low-resolution EI mass spectra from molecular fingerprints.10 The specific model presented in the paper was trained in a supervised manner on the NIST/EPA/NIH Mass Spectral Library 2017 database. State of the art ML methods, such as graph neural networks (GNNs) and transformers, have demonstrated superior performance compared to MLP-based models11 and have been applied in various ways for mass spectra prediction tasks.1214 In particular, Zhu and Jonas15 utilize a representation learning approach to predict the likelihood of subformulae within a predetermined depth of bond breakages. They then leverage these probabilities to scale the ion intensities in substructure spectra derived from isotopic patterns. Their study successfully predicted EI high-resolution mass spectra, including exact peaks, for molecules from PubChem, using an artificial high-resolution data set. However, despite these advancements, ML approaches typically require substantial amounts of high-quality training data. Consequently, the broader applicability of ML methods is hindered by the scarcity of available EI high-resolution mass spectral libraries, primarily limiting their usage to the prediction of low-resolution spectra.

In contrast, generating mass spectra through QC calculations does not rely on empirical rules or experimental data. Such approaches offer insights into fragmentation processes and reaction mechanisms. The quantum chemical electron ionization mass spectrometry (QCEIMS) software,16 later renamed QCxMS after the addition of collision-induced dissociation (CID) ionization and published as open-source,17 simulates the ionization and fragmentation process by employing QC principles to generate mass spectra in silico. Besides ab initio calculations,16,18 the package supports semiempirical quantum mechanical (SQM) methods19,20 for increased throughput. SQM modeling provides ionization potentials for high-temperature molecular dynamics (MD) and mass spectra modeling. The simplicity in parametrization, using hybrid density functional theory (DFT) reference data, enhances SQM’s accessibility and efficiency. Despite the potential overestimation of electron delocalization, SQM methods, such as GFN1-xTB, excel in handling metallic systems and covalent bond dissociation. However, GNF1-xTB requires adjustments to optimize directional electrostatic effects, such as halogen bonds.21 Conversely, GFN2-xTB does not necessitate these modifications for handling halogen bonds.22 A standout feature is SQM’s easily adaptable, element-specific parameters, enriching method versatility. Whereas DFT remains the most accurate, SQM-based MD is more than 100 times faster than pure DFT-based MD.

Several studies evaluate the performance of the QCxMS package and related SQM methods across multiple compound categories (Table 1). These include small data sets of (i) organic and inorganic small molecules used for method development validations,23,24 (ii) a restricted set of pollutants consisting of 27 halogenated compounds and 8 organophosphorus flame retardants (ODTs),25 and (iii) purines and pyrimidines.26 Notably, QCxMS has been recently applied to two larger data sets consisting of (iv) 451 small organic molecules27 and (v) 816 trimethylsilylated analytes.28

Table 1. Overview of Related Studies Using QCxMS with Semiempirical Calculations.

study classes molecules masses method reference atom types
this study 56 367 108–715 GFN1-xTB RECETOX Exposome HR-[EI+]-MS library33 C H O N F P S Cl Br Si
Wang et al.27 43 451 26–358 OM2 NIST17 C H N O
Wang et al.24 NA 41 55–333 OM2/CISD + GFN1-xTB NIST17 C H N O F
Wang et al.28 10 816 115–299 GFN1-xTB NIST17 C H N O Si
Schreckenbach et al.25 NA 35 74–959 GFN1-xTB NIST14 C H N O P Cl Br
Ásgeirsson et al.23 NA 23 86–505 GFN1-xTB NIST/SDBSa C H N O P B Sb S Cl Bs Ge Te Ni Cu Cr Fe F Al Si Sn
Lee et al.26 12 80 120–207 OM2/OM3 NIST17 C H N O

These studies demonstrate QCxMS’s ability to predict EI mass spectra, but the chemical space covered in the larger studies (containing ≥100 molecules) is limited to C, H, O, N, and Si atoms and molecular weight ≤400 Da. Notably, previous studies do not compare the predicted spectra against experimental high-resolution spectra acquired from analytical standards but against low-resolution commercial libraries, limiting the assessment of predictive accuracy.

Concerning the expansion of QCxMS applicability, it is crucial to highlight its limited use in investigating large diverse data sets including, for example, environmental chemicals21 (Table 2). In this Article, we introduce an open-source workflow for large-scale prediction of EI mass spectra, leveraging the QCxMS software and using the extended tight-binding GFN1-xTB21 method. The choice of GFN1-xTB aligns with methodologies used in prior studies on environmental chemicals.25 Although the OMx SQM methods perform better,16 they only support C, H, N, O, and F elements,32 limiting the applicable chemical space. We demonstrate the workflow through application to predict EI mass spectra for a previously published set of environmental compounds.1 Furthermore, we then investigate the applicability of this methodology on our chemically diverse data set by comparing the spectra to the accompanying high-resolution EI mass spectral library33 and outline aspects influencing prediction accuracy.

Table 2. Summary of Molecular Properties Investigated in Comparison to Previous Worka.

    this study Wang et al.27 Wang et al.24 Wang et al.28 Schreckenbach et al.25 Ásgeirsson et al.23 Lee et al.26
atoms mean 33.01 21.95 22.82 33.67 31.03 19.76 18.28
  min 12 7 8 17 11 6 12
  max 80 59 56 58 74 49 24
aromatic nitrogens mean 0.45 0.06 0 0.13 0.03 0 1.39
  min 0 0 0 0 0 0 0
  max 3 3 0 4 1 0 4
molecular complexity mean 0.73 0.47 0.47 0.53 0.59 0.27 0.76
  min 0.38 0.12 0.27 0.26 0.35 0 0.67
  max 1.18 0.8 0.77 0.84 0.85 0.69 0.82
molecular flexibility mean 0.36 0.33 0.3 0.63 0.51 0.4 0.04
  min 0 0 0 0.22 0 0 0
  max 0.85 0.86 0.69 0.91 0.9 0.89 0.23
rotatable bonds mean 3.26 2.65 1.7 4.41 4.22 1.43 0
  min 0 0 0 1 0 0 0
  max 21 10 8 14 16 5 0
stereo centers mean 0.75 0.66 0.38 0.33 0.59 0.29 0
  min 0 0 0 0 0 0 0
  max 9 6 3 4 8 6 0
electronegative atoms mean 4.74 1.78 1.32 2.4 5.94 1.76 5.49
  min 0 0 0 1 2 0 4
  max 14 8 5 6 12 8 8
a

Row-wise maxima are printed in bold. Molecular properties were computed using DataWarrior.29 Any molecules failing computation were excluded from the comparison. Structure databases in SDF format have been generated from the respective publications and are deposited on Zenodo.30 Additional chemical identifiers were collected using MSMetaEnhancer.31

Methods

Workflow

The developed semiautomated workflow for predicting mass spectra integrates various functions to streamline the process, significantly enhancing researchers’ capacity to conduct high-throughput mass spectra simulations. To the best of our knowledge, this addresses a deficiency in available open-source end-to-end processing solutions. It includes approaches such as managing input file preparation through templates and a global parameter file for every stage of the workflow on a high-performance computing (HPC) cluster. These are implemented as accompanying scripts in bash and python for batch job submission, progress monitoring (i.e., to obtain statistics on job success or failure), error handling (e.g., scripts to check GAMESS conformer generation or to resubmit jobs), result collection, and data analysis. This ensures comprehensive monitoring and reliability of the computational process.

Starting with the structure data file (SDF) input, the workflow progresses through these steps: (i) Creating and Structuring Files: arranging files for molecular geometry optimization with GAMESS,34 handling molecule reading, directory creation, 3D conformation generation, input file writing, and optimization job script setup; (ii) Molecular Optimization: dispatches optimization tasks to the HPC cluster and gathers optimized molecular inputs for spectral simulations; (iii) QCxMS Spectral Prediction: submits neutral MD, prepares production runs, and executes them on the HPC cluster using batch job processing; and (iv) MSPs Generation and Analysis: retrieves spectral results and formats them into an .msp file. Additionally, the workflow includes postprocessing tools for summarizing simulations and removing unnecessary files. It offers visualization and analysis tools for enhanced usability. A high-level overview of this workflow is illustrated in Figure 1.

Figure 1.

Figure 1

Semiautomatic EI spectra prediction workflow. From an SDF input, the workflow creates all input files required for batch spectral prediction. The workflow is open-source and publicly available online.30

Data Set

The RECETOX Exposome HR-[EI+]-MS library33 was used as the data source. The library contains 56 distinct chemical classes, covering molecules ranging from 12 to 80 atoms, representing a diverse range of compounds in terms of both size and elemental composition1 (Table 2). Isotopically labeled analytes were removed from the library as those are not handled by QCxMS, leaving 367 structures.

3D Conformer Generation

Molecular structures and other metadata, for example, ChemOnt chemical class,35 were extracted from the RECETOX Exposome HR-[EI+]-MS library33 in SDF using the RDKIT package.36 Subsequently, three-dimensional conformers were created from simplified molecular input line entry system (SMILES) descriptors using a customized variant of the rdconf code (see https://github.com/dkoes/rdkit-scripts/blob/master/rdconf.py), which leverages RDKIT functions while employing the universal force field (UFF) method. Molecular structures were optimized using GAMESS via a self-consistent field (SCF) wave function calculation employing the Restricted Hartree–Fock (RHF) method at the 6-31G level of theory, with the optimization employing a gradient convergence criterion of 5 × 10–4 prior to QCxMS analysis. While this workflow for conformer generation and geometry optimization can be applied to molecules with a broad range of elements in the periodic table (up to Z = 86), this may not hold true for molecules with complex electronic structures.

Spectra Prediction Using QCxMS

QCxMS calculations were performed using default parameters (electron energy, 70 eV; excess energy, 0.6 eV/atom; initial temperature, 500 K; simulation time, 10 ps). The ground-state trajectories and production runs were performed using the SQM GFN1-xTB method. After fragmentation, the ionization energies (IEs) were calculated using GFN1-xTB, and the fragment with the lowest IE obtained the positive charge. A default number of trajectories were computed (25× the number of atoms in the molecule) for every molecule.

Postprocessing

Results were collected and converted into a library in the .msp format, and the predicted spectra underwent further filtering using the matchms37 package. Filtering steps were tailored to ensure reliable comparison between predicted spectra and reference spectra: (i) the m/z values were restricted to the range of 70–700, (ii) peak intensities were normalized to the peak of maximum intensity, and (iii) peaks with intensities below 1% of the maximum peak intensity were removed. A second data set containing the top 5 intensity ions from both predicted and experimental spectra was created because, in practice, only minimal spectral information is often available for annotation of compounds in experimental data sets. Chemical properties related to atomic composition and molecular structure were computed using DataWarrior29 (see Table 2). Statistical analysis was performed using scipy38 and pandas.39

Spectral Matching

Spectral matching was performed using matchms. The CosineHungarian score was used with a tolerance of 0.0035 Da (i.e., 5 ppm at 700 m/z), intensity power of 1, and m/z power of 0. In spectral similarity computation, only peaks that fall within the specified tolerance are taken into account. After this filtering process, spectrum vectorization can be understood as the intersection between the compared spectra. Therefore, peaks not present in both spectra do not negatively impact the scoring. This is different from other commonly used scoring methods (e.g., forward and reverse cosine similarity,40 hybrid search41 as employed by NIST MSSearch or spectral entropy similarity42), which might be affected by the presence or absence of peaks. Spectral entropy scores calculated were found to correlate with the matchms cosine score-based matching (Pearson correlation of 0.87).

As matchms does not retain entries with 0 matching ions in the outputs, these were systematically integrated into the score tables, assigning scores and match values of zero for pairs missing in the output files. The workflow for spectral matching and related data sets are available online (i.e., top 5 intensity43 and all peaks44 spectral similarities). For more details regarding the different scoring methods, we refer the reader to the online resources.30

Results

All computations were conducted on the metacentrum HPC cluster (MetaCloud; https://www.metacentrum.cz/en/cloud/).

Geometry Optimization

A subset of 48 molecules did not converge using the previously described geometry optimization parameters (see failed_optimized_mols.csv30). This is most likely due to the generated conformer resulting in an unfavorable initial configuration for geometry optimization. The computational framework allows users to adjust key parameters such as basis sets, gradient convergence tolerance, and the maximum number of steps to address convergence challenges. However, our primary focus at this stage is to enable the generation of molecular structures for spectral predictions, rather than achieving high-level molecular optimization. Therefore, users also have the flexibility to employ alternative tools for refining these molecular structures. In this context, we leveraged the xTB code45 to address the molecular optimization of the aforementioned molecules.

Accuracy of Predicted Spectra

A comparison between the predicted and experimentally acquired spectra, based on intensity ratio similarity (cosine score) and the number of matching ions, does not show any direct trend indicating the failure of the method to achieve meaningful predictions, with the scores and ion matches heterogeneously distributed (Figure 2).

Figure 2.

Figure 2

Cosine similarity scores (x-axis) versus the absolute number of matched ions (y-axis) for predicted and experimental spectra. The color scale and size represent ion matches in % normalized by the number of peaks contained in the experimental (query, size) and predicted spectrum (reference, color), respectively.

Physicochemical Properties

Investigating the correlations of spectral matching scores and computed chemical properties reveals a weak positive correlation between cosine similarity score and number of ion matches. Furthermore, there is a negative correlation between scores and molecular flexibility, the number of rotatable bonds, the number of atoms, and the number of electronegative atoms. These properties are cocorrelated. The number of matching ions follows a similar trend. However, it shows weaker correlation with the number of atoms. Notably, molecular complexity does not correlate with the cosine similarity score, nor the number of matched ions (Figure 3).

Figure 3.

Figure 3

Pearson’s correlations of chemical properties and spectral matching results. Chemical properties were computed using DataWarrior. Molecular complexity is used as an aggregate measure.46 The number of ion matches is given as absolute number as well as normalized by the number of peaks in the predicted and reference spectra.

Chemical Class

Comparisons were performed based on ChemOnt35 chemical class hierarchies to investigate prediction accuracy among structurally related groups (Figure 4).

Figure 4.

Figure 4

Spectral matching of predicted to experimental spectra at the superclass level. The color scale for counts has been aligned across histograms.

At the superclass level, only the benzenoids and organoheterocyclic compounds have a median spectral cosine score greater than 500 and an interquartile range (IQR) exceeding 800. However, the prediction accuracy for lipid and lipid-like molecules, organohalogen compounds, phenylpropanols, and polyketides, as well as for organic acids and derivatives, is comparatively lower, with all having median scores below 400. The lipid and lipid-like molecules are characterized by a high number of rotatable bonds (on average ≥7) and contain additional atoms beyond C, O, and H such as P and S. The organic acids superclass shows the lowest median cosine similarity score of 110.5, encompassing most S- and P-containing molecules in the data set. Low spectral similarity scores have previously been demonstrated for ODTs.25 Particularly for those structures containing S and P, the SQM methods show poor accuracy in simulating fragmentation pathways.

Overall, when considering only the top 5 intensity ions, the spectral match between predicted and experimental spectra is even lower (Figure 4b). For 71 spectra, not a single of the top 5 experimentally measured ions is predicted at the correct m/z. In practice, three or more ions are often used to corroborate annotation, yet three or more ions are only predicted correctly for 95 of the 367 molecules.

On the class level (see Figure 5), two groups stand out for their high prediction accuracy: (i) phenanthrenes and pyrenes (benzenoids), and (ii) benzofurans, benzodioxins, and benzimidazoles (organoheterocyclic compounds), totaling 34 molecules. These molecules, characterized by their relatively planar geometry (molecular flexibility ≤0.4), were reliably predicted with median scores ≥800. However, certain individual classes showed lower prediction performance, indicated by median scores ≤350. These include (i) phenol ethers (benzenoids), (ii) alkyl halides (organohalogen compounds), and (iii) azoles, benzothiazoles, and dizianaphthalenes (organoheterocyclic compounds).

Figure 5.

Figure 5

Boxplot of spectral comparisons between predicted and reference EI mass spectra per chemical class within each superclass. The bar denotes the median, while outliers are depicted as circles with spectral similarity score (blue, right y-axis) and percentage of ions matching the reference spectra (yellow, left y-axis) displayed. Superclasses and classes (x-axis) represented only by a single molecule are excluded. Tables summarizing the mean and median scores, matches, and physicochemical properties are available on Zenodo.30

In comparison to other benzenoids, the phenol ethers in the data set consistently exhibit at least three rotatable bonds, with some also containing aromatic nitrogens or being halogenated. Similarly, the organoheterocyclic classes possess a higher count of aromatic nitrogens and rotatable bonds (both ≥2), along with electronegative atoms (≥5). Additionally, benzothiazoles, which contain S, are not effectively characterized by the SQM methods.

As benzenes and substituted derivatives consist of 158 molecules, we further inspected those at the subclass level (Figure 6). Biphenyls and derivatives constitute the largest subclass, with 41 molecules, and are most accurately predicted, with a median cosine similarity above 800. The outliers such as bitertanol (3,3-dimethyl-1-(4-phenylphenoxy)-1-(1,2,4-triazol-1-yl)butan-2-ol) and bifenazate (propan-2-yl N-(2-methoxy-5-phenylanilino)carbamate), each having six rotatable bonds, are observed within this subclass. Accordingly, subclasses containing molecules with greater structural flexibility, such as phenyl methylcarbamates, phenyl methylcarbamic acids, and diphenylethers, are predicted less accurately.

Figure 6.

Figure 6

Boxplots for subclasses of benzene and substituted derivatives. The bar denotes the median, while outliers are depicted as circles. Subclasses with less than three molecules have been removed from the figure.

Elemental Composition

The class-based analysis reveals a trend that spectra prediction of molecules containing certain atom types, especially aromatic nitrogens present in compounds like azoles, as well as the presence of P, is less accurate. Therefore, in addition to the class-based analysis, we also analyzed the spectra matching results regarding the elemental composition of each molecule.

To isolate the influence resulting from the presence of N in the molecule, we directly compared molecules containing the same atom types but with and without N (Figure 7). Median scores and ion match rates are lower for every group of molecules, except for those containing the chemical compositions (i) Br,C,H,(N),O; (ii) C,H,(N),O,S; and (iii) C,Cl,H,(N),O,S, when N is part of the chemical composition.

Figure 7.

Figure 7

Comparison of spectra prediction accuracy for molecules of selected chemical compositions based on the presence of N. The selected groups of molecules contain all of the form X → X + N present in the data set.

Similarly, the presence of P negatively influences prediction accuracy using our chosen methodology (Figure 8). P often serves as a central atom, resulting in a more flexible structure. Within our data set, P containing molecules have an average number of rotatable bonds of 8.4 over 2.9 for all other molecules, respectively. This is further supported by the negative correlation of the rotatable bond count and cosine similarity and number of matching ions (Figure 3).

Figure 8.

Figure 8

Results for molecule groups containing P and/or S atoms. Groups consisting only of a single molecule have been removed from the boxplot.

Discussion

Our results show that performance of QCxMS using the GFN1-xTB methods for the prediction of EI mass spectra varies significantly across the diverse data set. Only few chemical groups (e.g., pyrenes, phenanthrenes, benzofuranes, benzodioxins, benzimidazoles, and biphenyls and derivatives, totaling 75 molecules) are predicted with sufficient accuracy (i.e., average spectral similarity scores above 800) for direct use in spectral matching-based annotation workflows. In particular, for the vast majority of analytes, even the most abundant ions were not accurately predicted (i.e., wrong m/z), limiting the current potential to incorporate predicted spectra into suspect annotation workflows.

The presence of atoms other than C, O, and H leads to less accurate spectral prediction, as can be seen on the results of P- and/or S-containing molecules depicted in Figure 8 and the comparison of N-containing groups and their respective N-free counterparts in Figure 7. Also, Wang et al.27 report comparably lower scores even for O-containing molecules. Additionally, the extensive presence of electronegative atoms such as halogens (e.g., in alkyl and vinyl halides, see Figure 5c) negatively affects the simulations, as halogen bonds typically require atom pairwise corrections.21 We repeated predictions of the alkyl halides using the GFN2-xTB method that models anisotropic electrostatic interactions and does not employ specific element pair corrections. Although it reportedly improves predictions of molecules containing bonds between molecules with large differences in electronegativity,22 we did not observe any improvement in spectra quality for our subset.48 Beyond chemical composition, the related three-dimensional molecular structure is the main determinant. While planar geometries are predominantly predicted with high accuracy, central atoms with many rotatable bonds (e.g., P in organic acids, see Figure 5d) and other stereo centers lead to lower prediction accuracy for the QCxMS method.

These findings are in line with previous studies reporting very low matching scores for ODTs25 and lower scores obtained for aliphatic molecules over aromatic ones.28 In contrast to Wang et al.,27 we found molecular flexibility to negatively correlate with prediction accuracy. Although a similar trend (Pearson correlation of −0.3 between the dot product score and the number of rotational bonds) appears in their data as well (see comparisons.ipynb30), this might be related to the differences between the GFN1-xTB and OM2 SQM methods. Previous research relied on differing metrics to evaluate spectra prediction quality, typically comparing data solely against low-resolution mass spectra.2328 To put the results from the previous larger studies into context, we conducted the same classification-based analysis (Figure 9).

Figure 9.

Figure 9

Distributions of scores across studies (a) and from other large studies27,28 based on the same ChemOnt classification (b,c). SMILES identifiers were collected using MSMetaEnhancer, and molecules were classified using ClassyFire. Tables used for the plots are available under analysis/data/reference online.30

The observed limitations with regard to the chemical space for which accurate predictions are possible hinder application for many parent environmental chemicals, which are enriched with halogens and/or S compared to endogenous analytes. In addition, the spectral prediction for many metabolites is likely to also be poor due to their higher proportion of P and rotatable bonds.49 However, previous investigations to predict spectra of endogenous analytes rarely included P (Table 1). Simply tuning simulation parameters may not significantly enhance predictions; instead, it is proposed that accurate calculation of potential energy surfaces or the incorporation of excited-state MD may enhance accuracy within the semiempirical framework.50 Recently, Wang et al.24 incorporated excited states into the MD steps of EI spectra using the binary-encounter-Bethe (BEB) model, achieving higher accuracy, although limitations still exist, particularly for nonorganic molecules.

The semiempirical xTB method enables systematic processing of larger sets of molecules, although true high throughput processing would require a speedup in several orders of magnitude. Including initial testing, we scheduled 524 334 compute jobs with a total usage of 43 201 central processing unit (CPU) days on our HPC cluster. The files containing logs, structures, and computed trajectories require ∼2 TB storage space. Considering that such infrastructure and capacity might not be generally available to researchers further highlights the need for more flexible, accessible, and efficient computational methods.51,52 For similar reasons, current works are limited to the use of SQM over DFT-based methods. Future work should consider the application of such ab initio MD for molecules not well characterized by the SQM-based methods.

Conclusions

We present an open-source workflow for the prediction of high-resolution EI mass spectra and performance assessment based on the QCxMS, matchms, RDKIT, xTB, and GAMESS software packages. Additionally, we provide the largest set of predicted mass spectra of environmental chemicals so far and perform an unbiased analysis based on structural taxonomy classifications and chemical element composition. Our results show that further developments of SQM-based MD to improve prediction accuracy for molecules containing electronegative atoms (e.g., halogens, nitrogen) with high molecular flexibility (e.g., multiple rotatable bonds and stereo centers) are crucial. Current methods are insufficient for practical suspect screening applications, which require at least three characteristic (i.e., highly abundant) ions for correct identification, as these conditions were not met for the vast majority of predicted spectra when compared with experimental high-resolution spectra. Furthermore, improvements with regard to computational efficiency and accessibility are essential to advance the field of QC-based in silico mass spectra prediction. This especially holds true considering the need for additional comprehensive large-scale studies, which are required to characterize the applicable chemical space and potential pitfalls of these methodologies.

Acknowledgments

H.H., W.Y.R., Z.A., J.K., and E.J.P. thank the RECETOX Research Infrastructure (LM2023069) financed by the Ministry of Education, Youth and Sports, and the Operational Programme Research, Development and Education (the CETOCOEN EXCELLENCE project no. CZ.02.1.01/0.0/0.0/17 _043/0009632) for supportive background. This work was supported from the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 857560. Computational resources were provided by the project e-INFRA CZ (LM2018140). This presentation reflects only the author’s view, and the European Commission is not responsible for any use that may be made of the information it contains. We acknowledge the support of the Freiburg Galaxy Team and Björn Grüning, Bioinformatics, University of Freiburg (Germany), funded by the Collaborative Research Centre 992 Medical Epigenetics (DFG grant SFB 992/1 2012) and the German Federal Ministry of Education and Research BMBF grant 031 A538A de.NBI-RBC.

Glossary

Glossary

BEB

binary-encounter-Bethe

CID

collision-induced dissociation

CPU

central processing unit

Da

dalton

DFT

density functional theory

EI

electron ionization

GC–MS

gas chromatography–mass spectrometry

GNN

graph neural network

HPC

high-performance computing

IE

ionization energy

IQR

interquartile range

MD

molecular dynamics

ML

machine learning

MLP

multilayer perceptron

NEIMS

neural electron–ionization mass spectrometry

ODT

organophosphorus flame retardant

QC

quantum chemistry

QCEIMS

quantum chemical electron ionization mass spectrometry

RHF

restricted Hartree–Fock

SCF

self-consistent field

SDF

structure data file

SMILES

simplified molecular input line entry system

SQM

semiempirical quantum mechanical

TOF

time-of-flight

UFF

universal force field

Data Availability Statement

All data and scripts used in this work are hosted and archived on Zenodo.30,33 Mirrorplots for all corresponding pairs of predicted and experimental spectra are also available on Zenodo.53 Spectral similarity computations and related results that were created using Galaxy are archived as Galaxy histories, which can be imported into any Galaxy instance.43,44 The code repository containing the workflow as well as the scripts used to generate tables and figures is publicly available at https://github.com/RECETOX/ei_spectra_predictions. Note that this repository is subject to further development, so please refer to the Zenodo archive for information specifically related to this publication.

Author Contributions

H.H., data curation, formal analysis, software, visualization, writing–original draft, writing–review, and editing; W.Y.R., data curation, formal analysis, investigation, software, writing–original draft, writing–review, and editing; Z.A., data curation, software, visualization, writing–review, and editing; A.K., resources, writing–review, and editing; J.K., funding acquisition, writing–review, and editing; E.J.P., conceptualization, formal analysis, resources, supervision, writing–original draft, writing–review, and editing.

The authors declare no competing financial interest.

References

  1. Price E. J.; Palát J.; Coufaliková K.; Kukučka P.; Codling G.; Vitale C. M.; Koudelka Š.; Klánová J. Open High-Resolution EI+ Spectral Library of Anthropogenic Compounds. Front. Public Health 2021, 9, 622558. 10.3389/fpubh.2021.622558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. MassBank Consortium MassBank EU; http://www.massbank.eu/Contents.
  3. MassBank Consortium MassBank Japan; http://www.massbank.jp/Contents.
  4. Fiehn laboratory at UC Davis MassBank of North America; https://massbank.us/spectra/statistics.
  5. Horai H.; et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 2010, 45, 703–714. 10.1002/jms.1777. [DOI] [PubMed] [Google Scholar]
  6. Wang M.; et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 2016, 34, 828–837. 10.1038/nbt.3597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Stettin D.; Poulin R. X.; Pohnert G. Metabolomics Benefits from Orbitrap GC–MS—Comparison of Low- and High-Resolution GC–MS. Metabolites 2020, 10, 143. 10.3390/metabo10040143. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Vinaixa M.; Schymanski E. L.; Neumann S.; Navarro M.; Salek R. M.; Yanes O. Mass spectral databases for LC/MS- and GC/MS-based metabolomics: State of the field and future prospects. TrAC Trends in Anal. Chem. 2016, 78, 23–35. 10.1016/j.trac.2015.09.005. [DOI] [Google Scholar]
  9. Krettler C. A.; Thallinger G. G. A map of mass spectrometry-based in silico fragmentation prediction and compound identification in metabolomics. Briefings Bioinf 2021, 22, 1–25. 10.1093/bib/bbab073. [DOI] [PubMed] [Google Scholar]
  10. Wei J. N.; Belanger D.; Adams R. P.; Sculley D. Rapid Prediction of Electron–Ionization Mass Spectrometry Using Neural Networks. ACS Cent. Sci. 2019, 5, 700–708. 10.1021/acscentsci.9b00085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Zhu H.; Liu L.; Hassoun S.. Using Graph Neural Networks for Mass Spectrometry Prediction. Machine Learning in Computational Biology, 2020.
  12. Young A.; Röst H.; Wang B. Tandem mass spectrum prediction for small molecules using graph transformers. Nat. Mach. Intell. 2024, 6, 404–416. 10.1038/s42256-024-00816-8. [DOI] [Google Scholar]
  13. Murphy M.; Jegelka S.; Fraenkel E.; Kind T.; Healey D.; Butler T.. Efficiently predicting high resolution mass spectra with graph neural networks. Proceedings of the 40th International Conference on Machine Learning; 2023; pp 25549–25562. [Google Scholar]
  14. Goldman S.; Bradshaw J.; Xin J.; Coley C. Prefix-Tree Decoding for Predicting Mass Spectra from Molecules. Advances in Neural Information Processing Systems 2023, 48548–48572. [Google Scholar]
  15. Zhu R. L.; Jonas E. Rapid Approximate Subset-Based Spectra Prediction for Electron Ionization-Mass Spectrometry. Anal. Chem. 2023, 95, 2653–2663. 10.1021/acs.analchem.2c02093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Grimme S. Towards First Principles Calculation of Electron Impact Mass Spectra of Molecules. Angew. Chem., Int. Ed. 2013, 52, 6306–6312. 10.1002/anie.201300158. [DOI] [PubMed] [Google Scholar]
  17. Koopman J.; Grimme S. From QCEIMS to QCxMS: A Tool to Routinely Calculate CID Mass Spectra Using Molecular Dynamics. J. Am. Soc. Mass Spectrom. 2021, 32, 1735–1751. 10.1021/jasms.1c00098. [DOI] [PubMed] [Google Scholar]
  18. Bauer C. A.; Grimme S. How to Compute Electron Ionization Mass Spectra from First Principles. J. Phys. Chem. A 2016, 120, 3755–3766. 10.1021/acs.jpca.6b02907. [DOI] [PubMed] [Google Scholar]
  19. Bauer C. A.; Grimme S. First principles calculation of electron ionization mass spectra for selected organic drug molecules. Org. Biomol. Chem. 2014, 12, 8737–8744. 10.1039/C4OB01668H. [DOI] [PubMed] [Google Scholar]
  20. Koopman J.; Grimme S. Calculation of Electron Ionization Mass Spectra with Semiempirical GFNn-xTB Methods. ACS Omega 2019, 4, 15120–15133. 10.1021/acsomega.9b02011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Grimme S.; Bannwarth C.; Shushkov P. A Robust and Accurate Tight-Binding Quantum Chemical Method for Structures, Vibrational Frequencies, and Noncovalent Interactions of Large Molecular Systems Parametrized for All spd-Block Elements (Z = 1–86). J. Chem. Theory Comput. 2017, 13, 1989–2009. 10.1021/acs.jctc.7b00118. [DOI] [PubMed] [Google Scholar]
  22. Bannwarth C.; Ehlert S.; Grimme S. GFN2-xTB—An Accurate and Broadly Parametrized Self-Consistent Tight-Binding Quantum Chemical Method with Multipole Electrostatics and Density-Dependent Dispersion Contributions. J. Chem. Theory Comput. 2019, 15, 1652–1671. 10.1021/acs.jctc.8b01176. [DOI] [PubMed] [Google Scholar]
  23. Ásgeirsson V.; Bauer C. A.; Grimme S. Quantum chemical calculation of electron ionization mass spectra for general organic and inorganic molecules. Chem. Sci. 2017, 8, 4879–4895. 10.1039/C7SC00601B. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Wang S.; Kind T.; Bremer P. L.; Tantillo D. J.; Fiehn O. Beyond the Ground State: Predicting Electron Ionization Mass Spectra Using Excited-State Molecular Dynamics. J. Chem. Inf. Model. 2022, 62, 4403–4410. 10.1021/acs.jcim.2c00597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Schreckenbach S. A.; Anderson J. S.; Koopman J.; Grimme S.; Simpson M. J.; Jobst K. J. Predicting the Mass Spectra of Environmental Pollutants Using Computational Chemistry: A Case Study and Critical Evaluation. J. Am. Soc. Mass Spectrom. 2021, 32, 1508–1518. 10.1021/jasms.1c00078. [DOI] [PubMed] [Google Scholar]
  26. Lee J.; Kind T.; Tantillo D. J.; Wang L.-P.; Fiehn O. Evaluating the Accuracy of the QCEIMS Approach for Computational Prediction of Electron Ionization Mass Spectra of Purines and Pyrimidines. Metabolites 2022, 12, 68. 10.3390/metabo12010068. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Wang S.; Kind T.; Tantillo D. J.; Fiehn O. Predicting in silico electron ionization mass spectra using quantum chemistry. J. Cheminf. 2020, 12, 63. 10.1186/s13321-020-00470-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Wang S.; Kind T.; Bremer P. L.; Tantillo D. J.; Fiehn O. Quantum Chemical Prediction of Electron Ionization Mass Spectra of Trimethylsilylated Metabolites. Anal. Chem. 2022, 94, 1559–1566. 10.1021/acs.analchem.1c02838. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Sander T.; Freyss J.; von Korff M.; Rufener C. DataWarrior: An Open-Source Program For Chemistry Aware Data Visualization And Analysis. J. Chem. Inf. Model. 2015, 55, 460–473. 10.1021/ci500588j. [DOI] [PubMed] [Google Scholar]
  30. Rojas W. Y.; Hecht H.; Ahmad Z.. RECETOX/ei_spectra_predictions: v0.5; 2024, https://github.com/RECETOX/ei_spectra_predictions.
  31. Troják M.; Hecht H.; Čech M.; Price E. J. MSMetaEnhancer: A Python package for mass spectra metadata annotation. Journal of Open Source Software 2022, 7, 4494. 10.21105/joss.04494. [DOI] [Google Scholar]
  32. Dral P. O.; Wu X.; Spörkel L.; Koslowski A.; Weber W.; Steiger R.; Scholten M.; Thiel W. Semiempirical Quantum-Chemical Orthogonalization-Corrected Methods: Theory, Implementation, and Parameters. J. Chem. Theory Comput. 2016, 12, 1082–1096. 10.1021/acs.jctc.5b01046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Price E. J.; Palát J.; Coufaliková K.; Kukučka P.; Codling G.; Vitale C. M.; Koudelka Š.; Klánová J.. RECETOX Exposome HR-[EI+]-MS library; 2021, 10.5281/zenodo.4471217. [DOI] [PMC free article] [PubMed]
  34. Barca G. M. J.; et al. Recent developments in the general atomic and molecular electronic structure system. J. Chem. Phys. 2020, 152, 154102. 10.1063/5.0005188. [DOI] [PubMed] [Google Scholar]
  35. Djoumbou Feunang Y.; Eisner R.; Knox C.; Chepelev L.; Hastings J.; Owen G.; Fahy E.; Steinbeck C.; Subramanian S.; Bolton E.; Greiner R.; Wishart D. S. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminf. 2016, 8, 61. 10.1186/s13321-016-0174-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Landrum G.; et al. rdkit/rdkit: 2023_09_3 (Q3 2023) Release; 2023.
  37. Huber F.; Verhoeven S.; Meijer C.; Spreeuw H.; Castilla E.; Geng C.; van der Hooft J.; Rogers S.; Belloum A.; Diblen F.; Spaaks J. matchms - processing and similarity evaluation of mass spectrometry data. Journal of Open Source Software 2020, 5, 2411. 10.21105/joss.02411. [DOI] [Google Scholar]
  38. Virtanen P.; et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 2020, 17, 261–272. 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. The pandas development team pandas-dev/pandas: Pandas; https://github.com/pandas-dev/pandas.
  40. Stein S. E. Chemical substructure identification by mass spectral library searching. J. Am. Soc. Mass Spectrom. 1995, 6, 644–655. 10.1016/1044-0305(95)00291-K. [DOI] [PubMed] [Google Scholar]
  41. Cooper B. T.; Yan X.; Simón-Manso Y.; Tchekhovskoi D. V.; Mirokhin Y. A.; Stein S. E. Hybrid Search: A Method for Identifying Metabolites Absent from Tandem Mass Spectrometry Libraries. Anal. Chem. 2019, 91, 13924–13932. 10.1021/acs.analchem.9b03415. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Li Y.; Kind T.; Folz J.; Vaniya A.; Mehta S. S.; Fiehn O. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat. Methods 2021, 18, 1524–1531. 10.1038/s41592-021-01331-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Rojas W. Y.RECETOX Spectral Similarity Top 5 Peaks Galaxy Workflow and History; 2024, https://zenodo.org/records/10842560.
  44. Rojas W. Y.RECETOX Spectral Similarity All Peaks Galaxy Workflow and History; 2024, https://zenodo.org/records/10842462.
  45. Bannwarth C.; Caldeweyher E.; Ehlert S.; Hansen A.; Pracht P.; Seibert J.; Spicher S.; Grimme S. Extended tight-binding quantum chemistry methods. WIREs Comput. Mol. Sci. 2021, 11, 1–49. 10.1002/wcms.1493. [DOI] [Google Scholar]
  46. von Korff M.; Sander T.. About Complexity and Self-Similarity of Chemical Structures in Drug Discovery. Chaos and Complex Systems; Springer-Verlag: Berlin, Heidelberg, 2013; pp 301–306. [Google Scholar]
  47. Hecht H.QCxMS prediction of alkyl halides comparison of GFN1-xTB and GFN2-xTB; 2024, https://zenodo.org/records/10839047.
  48. Khanna V.; Ranganathan S. Physiochemical property space distribution among human metabolites, drugs and toxins. BMC Bioinf 2009, 10, 1–18. 10.1186/1471-2105-10-S15-S10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Nelson T. R.; White A. J.; Bjorgaard J. A.; Sifain A. E.; Zhang Y.; Nebgen B.; Fernandez-Alberti S.; Mozyrsky D.; Roitberg A. E.; Tretiak S. Non-adiabatic Excited-State Molecular Dynamics: Theory and Applications for Modeling Photophysics in Extended Molecular Materials. Chem. Rev. 2020, 120, 2215–2287. 10.1021/acs.chemrev.9b00447. [DOI] [PubMed] [Google Scholar]
  50. Sarojini D.; Burrows-Schilling C.; Thomas K.; Mizumoto C.. Towards Developing a Guide to Choosing National High-Performance Computing Resources. Practice and Experience in Advanced Research Computing; New York, 2023; pp 382–385. [Google Scholar]
  51. IDC Corporate High Performance Computing in the EU: Progress on the Implementation of the European HPC Strategy; 2015; pp 1–137.
  52. RECETOX Mirrorplots for “Quantum chemistry based prediction of electron ionization mass spectra for environmental chemicals”; 2024, https://zenodo.org/records/12784293. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. Rojas W. Y.RECETOX Spectral Similarity Top 5 Peaks Galaxy Workflow and History; 2024, https://zenodo.org/records/10842560.
  2. Rojas W. Y.RECETOX Spectral Similarity All Peaks Galaxy Workflow and History; 2024, https://zenodo.org/records/10842462.
  3. Hecht H.QCxMS prediction of alkyl halides comparison of GFN1-xTB and GFN2-xTB; 2024, https://zenodo.org/records/10839047.
  4. RECETOX Mirrorplots for “Quantum chemistry based prediction of electron ionization mass spectra for environmental chemicals”; 2024, https://zenodo.org/records/12784293. [DOI] [PMC free article] [PubMed]

Data Availability Statement

All data and scripts used in this work are hosted and archived on Zenodo.30,33 Mirrorplots for all corresponding pairs of predicted and experimental spectra are also available on Zenodo.53 Spectral similarity computations and related results that were created using Galaxy are archived as Galaxy histories, which can be imported into any Galaxy instance.43,44 The code repository containing the workflow as well as the scripts used to generate tables and figures is publicly available at https://github.com/RECETOX/ei_spectra_predictions. Note that this repository is subject to further development, so please refer to the Zenodo archive for information specifically related to this publication.


Articles from Analytical Chemistry are provided here courtesy of American Chemical Society

RESOURCES