Atomic structures, conformers and thermodynamic properties of 32k atmospheric molecules

Vitus Besel; Milica Todorović; Theo Kurtén; Patrick Rinke; Hanna Vehkamäki

doi:10.1038/s41597-023-02366-x

. 2023 Jul 12;10:450. doi: 10.1038/s41597-023-02366-x

Atomic structures, conformers and thermodynamic properties of 32k atmospheric molecules

Vitus Besel ^1,^✉, Milica Todorović ², Theo Kurtén ¹, Patrick Rinke ³, Hanna Vehkamäki ¹

PMCID: PMC10338534 PMID: 37438370

Abstract

Low-volatile organic compounds (LVOCs) drive key atmospheric processes, such as new particle formation (NPF) and growth. Machine learning tools can accelerate studies of these phenomena, but extensive and versatile LVOC datasets relevant for the atmospheric research community are lacking. We present the GeckoQ dataset with atomic structures of 31,637 atmospherically relevant molecules resulting from the oxidation of α-pinene, toluene and decane. For each molecule, we performed comprehensive conformer sampling with the COSMOconf program and calculated thermodynamic properties with density functional theory (DFT) using the Conductor-like Screening Model (COSMO). Our dataset contains the geometries of the 7 Mio. conformers we found and their corresponding structural and thermodynamic properties, including saturation vapor pressures (p_Sat), chemical potentials and free energies. The p_Sat were compared to values calculated with the group contribution method SIMPOL. To validate the dataset, we explored the relationship between structural and thermodynamic properties, and then demonstrated a first machine-learning application with Gaussian process regression.

Subject terms: Atmospheric science, Environmental sciences, Theoretical chemistry

Background & Summary

With climate change accelerating, humanity faces unprecedented social, ecological and economic changes¹. While data-driven research is emerging in atmospheric science^2–5, open research data is not yet as readily available as in many other fields^6–10. We present our contribution to data-driven atmospheric science in form of the GeckoQ dataset that provides molecular data relevant for aerosol particle growth and formation.

Aerosol particles and clouds affect the climate by absorbing and reflecting sunlight in the atmosphere, but their impact on global warming is still poorly understood¹¹. Aerosol particles can also act as cloud condensation nuclei. They are either emitted or grow from gaseous molecules in the atmosphere, a process known as new particle formation (NPF). Estimates make NPF responsible for 40–70% of all cloud condensation nuclei¹². Recently, organic molecules have been identified as major contributors to initial aerosol particle growth and formation up to sizes where the particles can act as condensation nuclei^13–16. A key molecular property related to aerosol particle growth is the saturation vapor pressure (p_Sat), a measure for a molecule’s ability to condense to the liquid phase. Thus, molecules with a low p_Sat, low-volatile organic compounds (LVOC), are of special interest for NPF research. However, LVOC are difficult to study experimentally as there are millions of potential LVOC structures in the atmosphere. Due to the large number of LVOC species and their low volatilities, the gas phase concentration of any single compound is often far below the instrumental detection limit¹⁷.

Computational tools, such as density-functional theory (DFT) offer a complementary approach to study LVOCs^18–20. However, due to its computational expense, DFT has not yet been widely used to generate datasets in atmospheric science. Wang et al.²¹ compiled a dataset of 3414 molecules extracted from the the Master Chemical Mechanism^22–24. They computed the saturation vapour pressure (p_Sat) on the same level of theory used in this work, but the dataset size is relatively small for meaningful machine learning^25,26. Krüger et al.⁵ trained deep learning models on 103,040 quinones, but did not extend their study beyond this single molecular class. Finally, Isaacman-VanWertz and Aumont²⁷ studied the p_Sat of 182,000 atmospheric species with computationally-efficient group contribution methods, but did not apply more accurate DFT methods. The dataset presented in this article is derived from the latter study: we extend it with rigorous conformer search and thermodynamic calculations.

In this article, we introduce the GeckoQ dataset encompassing carefully-curated 31,637 LVOCs. To ensure atmospheric relevance, we employed the chemical mechanism GECKO-A²⁸, that simulates the oxidation of hydrocarbon emissions (in the following referred to as parent species) to generate molecules following previous work²⁷. To provide an accurate p_Sat, we used a well established approach of conducting a conformer search with the COSMOconf program, and then we calculated p_Sat with the COSMOtherm program^21,29,30. For each molecule, GeckoQ features important thermodynamic properties: saturation vapor pressures [Pa] (p_Sat), the chemical potential [kJ/mol], the free energy of molecule in mixture [kJ/mol], and the heat of vaporisation [kJ/mol] calculated with DFT. GeckoQ also contains the optimized geometries of all conformers that were included in the calculations, summing up to 7,259,598 structures and associated total energies for the whole dataset, thereby exceeding even the NablaDFT dataset³¹ in size.

Figure 1 presents a general overview of the GeckoQ dataset. Panel (a) depicts typical GeckoQ molecules. They consist of carbon backbones derived from the parent species decane, toluene and α-pinene to which various functional groups are attached. In some cases ring structures persist from the original α-pinene and toluene, but also new rings involving oxygen have formed. The number of conformers depends on the number of functional groups and the length of the carbon backbone, and thus on the size of the molecule. The median number of conformers per molecule is 173, but we found up to 1750 conformers for a single molecule (see Fig. 1b). The molecular size distribution in GeckoQ (cf. Figure 1c) peaks around 25 atoms and is slightly skewed towards larger molecules. The smallest molecule is formaldehyde with 4 atoms and the largest molecules have 41 atoms. The molecules contain only carbon, oxygen, nitrogen and hydrogen, and frequently have more oxygen than carbon atoms and a maximum of two nitrogen atoms (See Fig. 1d). Finally, p_Sat is approximately normally distributed on a log10-scale ranging from 10⁻¹⁴ to 10⁶Pa.

Fig. 1 — A general overview of the data: (a) Sample molecules for small (S), medium (M) and large (L) sizes in terms of number of atoms. For one M sized molecule four conformers of overall 140 conformers are depicted, with carbon in green, nitrogen in blue, oxygen in red and hydrogen in blue. (b) A boxplot of the number of conformers found per molecule (median 173). (c) The distribution of the molecule size in terms of the number atoms. (d) Boxplots for the different atomic species present in the data, excluding hydrogen. (e) The histogram of the p_Sat values in the data.

Next, we review the frequency and type of functional groups in the dataset. Figure 2 provides an overview of all functional groups in the dataset, as detected by the APRL-substructure finder³². The most common groups, hydroperoxides, ketones and hydroxyls, have a large impact on p_Sat as they increase the molecule’s ability to engage in intermolecular interactions in the liquid phase. Generally, a large number of functional groups correlates with a low p_Sat. These relationships will be explored in more detail in the Technical validation section. The molecules in the GeckoQ dataset have a median of five functional groups, usually more than three and fewer than eight groups.

Fig. 2 — Number and type of functional groups in the data: (a) The frequency of occurrence of functional groups per molecule. Four molecules with five ketone groups and six molecules with five hydroxyl (alkyl) groups are not depicted (for clarity). (b) A histogram of the number of functional groups per molecule.

Our objective was to compile the structural and thermodynamic properties of LVOCs together into a single dataset of high accuracy. Studying the relationship between these properties is necessary to understand the behaviour of molecules in the atmosphere based merely on their structure. The GeckoQ dataset can be used to train machine learning models in atmospheric science, for studying particle formation processes and the role of different conformers in these. Here, we demonstrate such models with Gaussian process regression and the topological fingerprint descriptor. Beyond that, GeckoQ can be used for data-driven studies in (organic) chemistry. We anticipate that GeckoQ will faciliate new atmospheric research like the QM9³³ or OE62³⁴ datasets did in chemistry and materials science.

Methods

Dataset curation

The dataset curation involves the selection of relevant molecules and performing quality checks for duplicates, and outliers. We used an initial set of molecules²⁷ generated with the GECKO-A²⁸ program from the parent species α-pinene, decane and toluene. These three were chosen to ensure that GeckoQ covers a diverse range of atmospheric compounds. α-pinene is the main monoterpene emitted by vegetation. Monoterpenes are a major source of biogenic emissions and prevalent in large enough concentrations to drive NPF. Conversely, toluene and decane are examples for anthropogenic aromatic and aliphatic emissions.

In our dataset composition, we removed molecules for which the in GECKO-A implemented group contribution method SIMPOL³⁵ expected lower p_Sat than 10⁻⁸ Pa, since they likely react only in the condensed phase, whereas GECKO-A only includes gas phase reactions²⁷. We further note that autoxidation reactions, which can form very low-volatility products also in the gas phase, are currently missing from GECKO-A. While the specific structures of autoxidation products are likely to differ somewhat from the molecules included in the GeckoQ dataset, they still contain the same types of functional groups, albeit with a larger percentage of peroxides and hydroperoxides. This limits the possible p_Sat range and the chemical phase space of GeckoQ, but ensures chemical consistency because all data stems from modelling well-known reaction types in the gas phase.

We next refined the list of of 180k molecules and corresponding SMILES strings GECKO-A produced³⁶. We removed duplicates that we identified at two stages of data processing. First, we purged 33,827 duplicates based on their SMILES strings in the initial GECKO-A output. Then we removed molecules with identical topological fingerprint (TopFP) descriptors. Of these, 3870 molecules were structural duplicates and redundant, but 223 molecules did exhibit structural differences. We nonetheless chose to remove this small group of molecules to avoid descriptor ambiguities in the future.

We furthermore removed molecules with three or more nitrogen atoms. In GeckoQ nitrogen atoms only occur in the context of nitrate and nitro groups. These groups have only a small effect on p_Sat despite their large mass and are thus less interesting for particle formation. From the remaining 157k molecules, we randomly selected 31,640 for COSMOconf and COSMOtherm calculations.

Further, we inspected all molecules with a p_Sat lower than 10⁻¹³ and higher than 10⁴Pa for outliers. In three cases we found that the molecular structure and calculated p_Sat are inconsistent with each other, because we expected a different p_Sat based on the molecular structure, i.e we found molecules that only differed by a single functional group but had a vastly different p_Sat (the contribution of a single functional group to the p_Sat cannot be arbitrarily large).

We removed the three molecules from the data, resulting in an overall dataset of 31,637 molecules.

Computation of thermodynamic properties

We focused on atmospherically relevant thermodynamic properties. We computed p_Sat [Pa] and the heat of vaporisation [kJ/mol] which are related to the equilibrium between the liquid and the gas phase and therefore describe the likelihood of a molecule to contribute to particle formation and growth. We also calculated the chemical potential [kJ/mol] in the liquid of each molecule and the “free energy of a molecule in mixture” [kJ/mol]. The calculation required a set of conformer structures optimized for the liquid phase and a corresponding set optimized in the gas phase for each molecule. We included the liquid environment implicitly using the Conductor-like Screening Model for Real Solvents^37,38 (COSMO-RS), which is a continuum solvation model. The final thermodynamic properties were calculated with COSMOtherm^37,38 taking all aforementioned molecular conformers into account.

We managed the calculations for GeckoQ with the workflow manager Merlin (https://merlin.readthedocs.io). For each molecule, we carried out the four steps illustrated in Fig. 3. Each step is described in detail in the following.

Fig. 3 — Workflow for the data label calculation. “S&C” stand for a step of clustering and sorting the molecules, “DFT:SP” is a single point DFT calculation, and “DFT:OPT” stands for a DFT structure optimisation.

Conformer sampling and refinement

We used COSMOconf (www.3ds.com/products-services/biovia/) to find the low energy conformers for each molecule. The full COSMOconf “job template” is provided in the GeckoQ data repository and contains technical details on all steps. “Input” (cf. Fig. 3) to COSMOconf is one arbitrary 3-dimensional structure generated from the molecular SMILES string with the BALLOON³⁹ conformer generator. Initially, the COSMOconf program executed a conformer search: It generated and optimized 10,000 conformers employing the “distance geometry” method⁴⁰ implemented in rdkit⁴¹ (energy threshold 200 kcal/mol, RMSD threshold 0.0⁴¹) and it generated an additional 600 conformers with the genetic algorithm implemented in BALLOON³⁹ (with default parameter found in the COSMOconf user guide 2021 p. 39) optimizing them with the MMFF94⁴² force field.

Most of the generated conformers were structurally similar and were removed with COSMOconf’s “CLUSTER_ GEOCHECK” and “CLUSTER_MU” routines, clustering them according to their geometry and energy, respectively, and removing non-unique structures (“S&C” in Fig. 3). The “CLUSTER_GEOCHECK” routine maps conformer structures onto each other. If mapped atom types are different, or if the weighted local similarity measure between all atoms exceeds a threshold of 0.5 Å or 20°, then the conformers are different. Further details can be found in the COSMOconf manual. Secondly, “CLUSTER_MU” clusters conformers with respect to their chemical potential in mixture, where conformers with a potential difference larger than 0.2 kcal/mol are considered different.

All DFT calculations were carried out with Turbomole⁴³ using the multipole accelerated RI-approximation⁴⁴ and employed the Becke-Perdew (BP86)^45,46 exchange-correlation functional. To save computational time, the conformer search was hierarchically structured as displayed in Fig. 3. With a cheap SV(P) basis set the DFT energy of all conformers generated with COSMOconf was calculated (DFT:SP1). This set was reduced in a clustering and sorting step (“S&C” in Fig. 3). The geometry of the remaining conformers was optimized with the same basis set (DFT:OPT1). A subsequent S&C step reduced the conformer set further. To increase the accuracy, we repeated the geometry optimization with a tighter def-TZVP basis set (DFT:OPT2). The final energy was calculated with the def2-TZVPD basis set (DFT:SP2). All of these calculations involve the COSMO-RS^37,38 model, providing a discrete charged surface surrounding the molecule for each conformer, and we will refer to them as “liquid phase conformer”. This surface is utilized later by COSMOtherm. The gas phase conformers were obtained from the liquid phase conformers by repeating the geometry optimization of each liquid phase conformer, but without the implicit solvation model. We again performed the geometry optimization with a slightly cheaper basis set (TZVP; DFT:OPT3) than the final energy calculation (TZVPD; DFT:SP3).

Conformer selection and property calculation

Because COSMOtherm overestimates the impact of intramolecular H-bonds²⁹, we selected only the conformers with a minimal number of these H-bonds for calculating the thermodynamic properties. First, we performed an initial COSMO-RS calculation with the pr_steric keyword, which identifies the number of intramolecular H-bonds for each conformer. We proceeded to choose the energetically lowest conformers with zero intramolecular H-bonds up to a maximum of 40 conformers, following the example of previous work³⁰. If there were no conformers with zero intramolecular H-bonds, we chose conformers with one intramolecular H-bond, and if there were none of those, two, or three.

We utilized the selected conformers to compute thermodynamic properties with COSMOtherm^37,38. In COSMOtherm the p_Sat is calculated for each single conformer of a molecule with the assumption of a pure “solvent” consisting of that same conformer. The solvent is constructed with the discrete charged surfaces provided by the COSMO-RS DFT calculations. All the conformer p_Sat are then weighted according to their overall population, which is determined by the Boltzmann distribution of states with different free energies, resulting in a single p_Sat. The calculations were conducted at a standard temperature of 298.15 K. The files we provide allow for a re-calculation of the properties at a different temperature (See section Usage notes).

SIMPOL

The p_Sat of a molecule can also be computed, e.g. with the group contribution method SIMPOL³⁵, which is frequently employed by the atmospheric research community^21,27,30. As a form of validation, we compare the DFT p_Sat to those of SIMPOL p_Sat (cf. Technical Validation section). SIMPOL is based on

\log_{10} p_{S a t} = \sum_{k} ν_{k} b_{k},

where v_k is the number of functional groups of type k found in a molecule and b_k is a group-specific parameter that has been fitted to reference data. The APRL Substructure Search Program (APRL-SSP)³² is the only publicly available program that can extract SIMPOL v_k’s from SMILES strings (cf. Fig. 2). We found that APRL-SSP does not count carbonyl groups attached to a carbon that is also attached to a peroxy group, and corrected the number of ketones and aldehydes accordingly. After correction, we calculated p_Sat with our own Matlab SIMPOL implementation.

Structural descriptor: topological fingerprint

For machine learning, molecules need to be represented in a machine readable format, a so-called descriptor⁴⁷. In previous work²⁵, some of us had investigated a variety of molecular descriptors for learning atmospherically relevant thermodynamic properties: the Coulomb Matrix⁴⁸, the Many-Body-Tensor-Representation (MBTR)⁴⁹, the MACCS structural key⁵⁰, the Topological Fingerprint (TopFP)^41,51, and the Morgan Fingerprint^41,52. MBTR and TopFP provided the highest accuracy for a kernel-ridge regression based model²⁵. For the machine-learning model of our GeckoQ data, we therefore chose the TopFP, because it produces the same descriptor for all conformers of a molecule and is thus not sensitive to the precise atomic structure of the GeckoQ molecules. Additionally, it is computationally inexpensive compared to MBTR. The TopFP hyperparameters had to be optimised and adjusted to the current dataset. The optimized hyperparameters we found were a size of 8192 for the descriptor array, a minimum path of 1, a maximum path of 9, and 6 bits per hash.

Gaussian process regression

The GeckoQ dataset is intended to facilitate the application of machine learning methods in the field of atmospheric research. To demonstrate a first use case, we employ Gaussian process regression (GPR), a kernel-based probabilistic tool for supervised machine learning⁵³, to predict the p_Sat of molecules from their geometry.

In GPR, a prior belief of the outcome is combined with the data in Bayes’ rule to perform the regression and compute the GP posterior. The mean of the GP posterior constitutes the prediction and its variance is a measure for the reliability of the result. The model covariance is encoded into a kernel function. We deployed an uninformative GP prior and a product kernel, where a constant θ_s multiplies the Radial Basis Function (RBF) kernel:

K_{RBF,s} (x_{1}, x_{2}) = θ_{s} * \exp (- \frac{1}{2} {(x_{1} - x_{2})}^{T} θ_{l}^{- 2} {(x_{1} - x_{2})}^{T}) .

The kernel function contains the signal lengthscale θ_l and the function amplitude θ_s as hyperparameters. We optimized θ_l and θ_s by maximizing the negative log marginal likelihood during data fitting, which is equivalent to conducting a global search in the hyperparameter space. To avoid local minima, we restarted each log marginal likelihood maximization six times. The resulting θ_l and θ_s values lie both in the range of 2000 to 6000, depending on the training data size. The data noise was also treated as a hyperparameter and optimized consistently to a value of 0.23. We used the Pytorch python package⁵⁴ for all GPR calculations. Since p_Sat varies across many orders of magnitude, we learned the log10 p_Sat. All data were normalized prior to machine learning.

To assess GPR performance, we divided the data into two subsets, a test set and a training set. We calculated the mean average error (MAE) between the predictions and the actual p_Sat as a measure of accuracy. The MAE was chosen in continuity with previous work²⁵. To assess learning success, we computed a learning curve by training a series of GP models for training set sizes from 2000–28,000 molecules in steps of 2000 and evaluated the models with a testset of 2000 molecules. We applied 5-fold cross validation, to obtain five separate models, and averaged the resulting MAEs, to account for statistical fluctuations.

Data Records

The main GeckoQ dataframe (Dataframe.csv) consists of 31,637 rows with entries for the identifiers, attributes, labels and functional groups for each molecule. Table 1 provides a detailed break-down of the Dataframe.csv. For completeness, we also included the topological fingerprints and the RDkit objects for each molecule in GeckoQ. The corresponding TopFP and RDkit objects are stored in separate files, TopFP.jl and RDkitObjects.jl, respectively, and are labelled according to the index of the molecules.

Table 1.

Detailed description of all the columns in the Dataframe.csv file.

No.	Column name	Unit	Description
1	index	—	A unique molecule index used in naming files, see in Table 2.
2	SMILES	—	The canonical SMILES string as provided by GECKO-A.
3	InChIKey	—	The standard InChIKey of the molecule.
4	pSat_Pa	Pa	The p_Sat of the molecule calculated by COSMOtherm.
5	ChemPot_kJmol	kJ/mol	The chemical potential of the molecule calculated by COSMOtherm.
6	FreeEnergy_kJmol	kJ/mol	The free energy of the molecule calculated by COSMOtherm.
7	HeatOfVap_kJmol	kJ/mol	The heat of vaporisation of the molecule calculated by COSMOtherm.
8	MW	g/mol	The molecular weight of the molecule.
9	NumOfAtoms	—	The number of atoms the molecule.
10	NumOfC	—	The number of carbon atoms the molecule.
11	NumOfO	—	The number of oxygen atoms the molecule.
12	NumOfN	—	The number of nitrogen atoms the molecule.
13	NumHBondDonors	—	The number of hydrogen bond donors in the molecule i.e. hydrogens bound to a oxygen.
14	NumOfConf	—	The number of stable conformers found and successfully calculated by COSMOconf.
15	NumOfConfUsed	—	The number of conformers that has been used to calculate the thermodynamic properties. The selection of these conformers is discussed more detailed in Sec. Conformer selection and property calculation.
16	parentspecies	—	Either “decane”, “toluene”,“apin” for α-pinene, or a combination of these connected by an underscore to indicate ambiguous descent. In 243 cases the parent species is “None”, because it was not possible to retrieve it.
17	C = C (non-aromatic)	—	The number of non-aromatic C = C bounds found in the molecule.
18	C = C-C = O in non-aromatic ring	—	The number of C = C-C = O structures found in non-aromatic rings in the molecule.
19	hydroxyl (alkyl)	—	The number of the alkylic hydroxyl groups found in the molecule.
20	aldehyde	—	The number of aldehyde groups found in the molecule.
21	ketone	—	The number of ketone groups found in the molecule.
22	carboxylic acid	—	The number of carboxylic acid groups found in the molecule.
23	ester	—	The number of ester groups found in the molecule.
24	ether (alicyclic)	—	The number of alicyclic ester groups found in the molecule.
25	nitrate	—	The number of nitrate groups found in the molecule.
26	nitro	—	The number of nitro groups found in the molecule.
27	aromatic hydroxyl	—	The number of aromatic hydroxyl groups found in the molecule.
28	carbonylperoxynitrate	—	The number of carbonylperoxynitrate groups found in the molecule.
29	peroxide	—	The number of peroxide groups found in the molecule.
30	hydroperoxide	—	The number of hydroperoxide groups found in the molecule.
31	carbonylperoxyacid	—	The number of carbonylperoxyacid groups found in the molecule.
32	nitroester	—	The number of nitroester groups found in the molecule.

Open in a new tab

GeckoQ includes 7,259,598 conformer structures. Dataframe_conformerE.csv contains liquid phase and gas phase energies for all conformers. These energies were extracted from each conformer’s “.cosmo” file (liquid phase conformer) and the “.energy” file (gas phase conformer), that are also available. In addition, we collected various input, output and intermediate files generated in the label calculation process and compressed them in a separate zip archive for each molecule. The different file types are explained in Table 2. We have further grouped every 800 hundred molecules in a tar archive. A list of the resulting 40 tar archives and their molecular indices is provided in the README.md of the data repository.

Table 2.

Details to all the files that can be found in the data repository for each molecule.

file name	type	description
$id.sdf	structure	A structure created from SMILES strings and a input file of COSMOconf for molecule $id.
$id_c$i.cosmo	structures	The COSMOconf output file. Contains the structure and energy for the liquid phase of conformer $i of molecule $id. Conformers are ranked according to rising energy (“Total energy [a.u.]”) and $id_c0.cosmo is the most stable conformer. In some cases, some conformers were removed due to computational errors or non-convergence.
$id_c$i.energy	structures	The structure and energy file for the gas phase conformers of molecule $id.
$id-h-bonds.inp	input	The input file for the pr_steric calculation to determine the number of intramolecular hydrogenbonds for each conformers. It accepts an input file with the list of all conformers, which can be reconstructed from the entry “$id-h-bonds-confs.txt”.
$id-h-bonds.out	output	The output file of the pr_steric calculation. It contains electrostatic and steric information for each conformer. It is possible to retrieve the number of intramolecular H-bonds by checking the overlap of donor groups with neighbouring acceptor groups.
$id-h-bonds-confs.txt	list	A list of all conformers and corresponding numbers of “partial” H-bonds and “full” H-bonds.
COSMOFILES-lt$MinHBondsbonds.txt	list	A list of all conformers with a minimum number of H-bonds, $MinHBonds. Details in Section Property calculation. This file is required by the COSMOtherm calculation input.
lt$MinHBondsbonds.inp	input	The input file for the COSMOtherm calculation using only conformers with $MinHBonds H-bonds.
lt$MinHBondsbonds.out	output	The output file for the COSMOtherm calculation using only conformers with $MinHBonds H-bonds. It contains all the thermodynamic labels we calculate.

Open in a new tab

The file names contain variables where $id refers to the entry of the molecule in the “index” column, $i is the number of a conformer and $MinHBonds is the minimum number of H-bonds found for any conformers of a molecule. The “structure” type are files that contain a molecular 3d xyz structure.

GeckoQ is freely available for download from its Fairdata.fi Etsin repository: 10.23729/022475cc-e527-41a9-bbc0-0113923cf04c⁵⁵. The data is organised as follows:

README.md: General information regarding the data, also provided in this section, “Usage Notes”, and “Figures and Tables”.
Dataframe.csv: Properties and attributes of the GeckoQ molecules (see Table 1).
Dataframe_conformerE.csv: Liquid phase (“totE_liq” column) and gas phase energies (“totE_gas” column) of the GeckoQ conformers.
TopFP.jl: The index and the topological fingerprint for each molecule in GeckoQ for machine learning.
RDkitObjects.jl: A data frame with the indices and an rdkit object for each molecule to facilitate quick and simple visualization, or calculation of different attributes and descriptors.
Data entries/: 40 tar-archives each with 800 zip files, one for each molecule, containing the files specified in Table 2.
Code/: A directory with a jupyter notebook containing instructions on how to load, transform and visualize the data, and with a bash script containing instructions on handling the data files (see Sec. “Usage Notes”). The directory also contains the applied COSMOconf job template.

Technical Validation

To validate GeckoQ, we review the applied p_Sat calculation procedure to check for convergence and for physical and chemical consistency. In addition, we show that the computed p_Sat are consistent with simpler models, and demonstrate a first machine learning application.

First we survey the computational uncertainty of the COSMO-RS model applied in the combined COSMOconf and COSMOtherm approach. The COSMO-RS model was originally parametrized with experimental values of 217 molecules³⁸ and later refined with another 310 molecules⁵⁶. These reference molecules include a diverse range of organic molecular classes and contain the elements H, C, N, O, and Cl with F, S, Br and I added in the refinement. The following accuracies were reported: maximum of 0.566 log units (vapor pressures), 0.451 log(max/γ^∞) (activity coefficients), and 1.2 kJ/mol (Gibbs free-energies). Relative to the p_Sat range of twenty orders of magnitude in GeckoQ, the expected COSMO vapor pressure error and the inherent data noise σ of 0.23 log(σ/Pa) estimated by GPR are in good agreement. Further, we review the quality of our COSMOconf settings. In the initial conformer search for the generation of GeckoQ data we applied the most accurate settings that COSMOconf offers. We utilized the BALLOON program as well as rdkit for conformer generation to ensure that we captured all relevant structures. We furthermore conducted the final DFT calculations at the highest fidelity level included in COSMOconf and COSMOtherm (BP86/def2-TZVPD). We also monitored resulting conformer structures. Failure of conformer calculations to complete correctly (resulting in unphysical energies), or dissociation of a conformer during the structure optimization was indicated by warnings in the COSMOtherm output. In such cases, we removed the few conformers in question and repeated the thermodynamics calculation with all the remaining conformers.

As noted in Section Conformer selection and property calculation, p_Sat is obtained from a weighted average over multiple conformers. For this reason, we checked how sensitive the value of p_Sat is to changes in the conformer selection. We considered all molecules with at least 40 conformers, then randomly chose a subset of 110 for this test. First, the p_Sat was calculated using only the most stable conformer, then we added the next most stable conformer to the selection and computed the result again. This was repeated until we reached 40 conformers. For each of the interim p_Sat values, we computed the ratio to the p_Sat obtained with the maximum number of 40 conformers. Figure 4a displays the mean and standard deviation for this ratio averaged over all 110 molecules. The figure illustrates that the p_Sat for a single conformer deviates by a factor of 3.9 from the converged result. As more conformers are added, this discrepancy decreases and converges to a 1:1 ratio at 32 conformers. The drop of the standard deviation at 32 conformers is caused by a single outlier, where the addition of the 32nd conformer changed the vapor pressure ratio by 1.3 order of magnitude. Based on the average of these 110 molecules, we conclude that the p_Sat values in the GeckoQ data are not sensitive to conformer numbers higher than 32. Thus we can confirm that our choice of a maximum number of 40 conformers was adequate for good precision of p_Sat. For molecules with fewer conformers than 32, all conformers were included.

Next, we examine the relationship between the COSMOtherm p_Sat and molecular structural properties. We focus on the molecular weight (MW) as a universal measure for molecule size and functional groups, since functional groups have the largest influence on p_Sat. For example, functional groups can establish intermolecular as well as intramolecular interactions. Intermolecular interactions lead to a stabilization of the molecule in the liquid phase, i.e a low p_Sat, whereas intramolecular interactions stabilize the molecule in the gas phase and lead to a high p_Sat.

In Fig. 4b we plotted p_Sat as function of the MW. For small molecules p_Sat is high. It decreases with increasing MW before leveling out at approximately 220 g/mol and a p_Sat of roughly 10⁻⁴ Pa. The decrease of p_Sat is consistent with Fig. 4c that shows that a higher number of functional groups decreases p_Sat. Beyond 220 g/mol the abundance of nitrate (62 g/mol) and nitro (46 g/mol) groups (see Fig. 2) dominates (The largest molecules without any nitrate- or nitro- groups have a MW of 282 g/mol). These groups have a comparatively large mass, but their contribution to a lower p_Sat is small, which explains the saturation of p_Sat to a low value. Note also, that during dataset curation, molecules with very low saturation vapor pressure were removed, which provides another reason for p_Sat leveling out with increasing MW.

Next we present the SIMPOL consistency check. Extraction of the functional groups for SIMPOL is not a trivial problem. The molecules in GeckoQ contain a large number and high density of functional groups. To verify APRL-SSP’s extraction accuracy including our correction for “carbonyl-peroxides”, we visually inspected 100 randomly chosen molecules, ensuring that all possible functional groups were present. All functional groups that we found were also found by APRL-SSP. However, we cannot exclude the possibility that combinations of functional groups (such as the “carbonyl-peroxide” group) exist that make the results unreliable, albeit very infrequently. Overall, we report an accuracy of more than 99% for the GeckoQ functional groups.

Figure 5a shows the ratio of the COSMOtherm p_Sat to SIMPOL’s as a function of the number of functional groups N_FG in the corresponding molecules. Ideally, the median would lie around one, but we find it to increase with N_FG. For molecules with 2–5 functional groups, the SIMPOL p_Sat is higher than COSMOtherm’s. The behaviour reverses for 6–9 functional groups. This trend is consistent with the difference between SIMPOL and COSMOtherm. Both methods account for intermolecular interactions, but only COSMOtherm accounts also for intramolecular interactions, which lead to a higher p_Sat. Intramolecular interactions become more important for large N_FG, for which we observed the ratio reversal and larger deviation between SIMPOL and COSMOtherm in line with previous such comparisons^30,57.

Finally, we applied a GPR to the GeckoQ data to map the molecular structures to p_Sat. The resulting learning curve is depicted in Fig. 5b. The MAE for the minimal training set size of 2k is 1.02 log(MAE/Pa). With more training data, the error reduces to 0.82 log(MAE/Pa) at a training set size of 28k. The learning rate is not constant. For small training sets, the GPR learns slightly faster than for larger, where we suspect that our machine-learning model finds less diversity and more redundancy in GeckoQ.

The final MAE 0.82 log(MAE/Pa) at 28k is similar to the data uncertainties reported for COSMOtherm of 0.5 log units⁵⁶, and the inferred data noise of 0.23 log(σ/Pa). This error is smaller than the average deviation between SIMPOL and COSMOtherm. The MAE of the GPR could be further reduced with more data. By extrapolating the learning curve, we estimate that a training set with 2 mio. molecules would be needed to bring the MAE down to 0.5 log units.

Previous work²⁵ also employed the TopFP descriptor in a kernel ridge regression machine learning model to learn p_Sat as a function of atomic structure for a different atmospheric dataset. They obtained a MAE of 0.31 log(MAE/Pa). Their dataset contained a comparatively narrow range of molecules, with a median of N_FG = 3 and a range of 1–6 functional groups. The GeckoQ molecules are larger, more complex and more diverse and thus harder to learn, which manifests in a higher MAE²⁶.

Usage Notes

The Etsin repository with the data contains the directory Code/ with a jupyter notebook BasicDataprocessing.ipynb. The notebook includes basic instructions on how to load, analyze, and search the data, and how to transform the data to a descriptor and target array for machine learning. Further, it also can be a guide for using the rdkit toolbox to calculate molecule properties or to create new descriptors, and additional contains our correction for the counts of ketone and aldehyde groups. The jupyter notebook is also provided as html file. The Dataframe.csv and Dataframe_conformerE.csv can be loaded by any programming language for statistical computing, whereas TopFP.jl and RDkitObjects.jl need to be loaded with the joblib python package.

When machine learning methods are applied to labels such as the p_Sat, we recommend transforming the values to their log10, because machine learning algorithms are usually sensitive to the scale of the data. Code/ also contains the bash script FileProcessing.sh with code for processing the conformer files or COSMOtherm output. It also demonstrates how to unzip only the single energetically most stable conformers or the fast extraction of values from the COSMOtherm output. Moreover, FileProcessing.sh contains all steps for recalculating a molecules properties for a different temperature.

Acknowledgements

We thank Bernard Aumont for sharing the original raw GECKO-A data. Further, we are grateful for Noora Hyttinen providing the COSMOconf files and general support with the COSMO programs, and we acknowledge Pontus Roldin for sharing his code for the SIMPOL calculation. This work was supported by the CSC - IT Center for Science by providing access to the Mahti computer cluster via the “ACTIVE ATMOS” Mahti Grand Challenge project, as well as EuroHPC for facilitating our work on the LUMI platform. This study received financial support from the Academy of Finland through its flagship program, the Atmosphere and Climate Competence Center (Grant No. 337549), and the Centers of Excellence Program (CoE VILMA, Grant No. 346368).

Author contributions

V.B. conducted all calculations, postprocessing, and drafted the manuscript. M.T. and V.B. planned the calculations and the technical validation. V.B., M.T., P.R., T.K. and H.V. designed and guided the study, and participated in the manuscript preparation.

Code availability

Custom code written for data generation mainly consists of scripts for pre- and postprocessing steps linking together the software mentioned below. These scripts are executed through a Merlin workflow. All these scripts are publicly available in a GitHub repository: https://github.com/Supervitux/COSMO_on_Merlin⁵⁸. GECKO-A is available at their website http://geckoa.lisa.u-pec.fr/. COSMOconf 4.3 and COSMOtherm 2021 and their licenses were purchased from Dassault Systemes (https://www.3ds.com/). We provide our custom COSMOconf jobtemplate ((COSMOConfProtocol.xml in the repository. Merlin version 1.7.5 is freely available from https://merlin.readthedocs.io/en/latest/index.html#.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.IPCC. Summary For Policymakers, 3–32 (Cambridge University Press, 2021).
2.Döscher R, et al. The ec-earth3 earth system model for the coupled model intercomparison project 6. Geoscientific Model Development. 2022;15:2973–3020. doi: 10.5194/gmd-15-2973-2022. [DOI] [Google Scholar]
3.Boucher O, et al. Presentation and evaluation of the ipsl-cm6a-lr climate model. Journal of Advances in Modeling Earth Systems. 2020;12:e2019MS002010. doi: 10.1029/2019MS002010. [DOI] [Google Scholar]
4.Giorgi F. Thirty years of regional climate modeling: Where are we and where are we going next? J. Geophys. Res. Atmos. 2019;124:5696–5723. [Google Scholar]
5.Krüger M, et al. Convolutional neural network prediction of molecular properties for aerosol chemistry and health effects. Natural Sciences. 2022;2:e20220016. doi: 10.1002/ntls.20220016. [DOI] [Google Scholar]
6.Borne K. Astroinformatics: Data-oriented astronomy research and education. Earth Sci. Inform. 2010;3:5–17. doi: 10.1007/s12145-010-0055-2. [DOI] [Google Scholar]
7.Wierling C, Lehrach H, Herwig R, Kamburov A. Consensuspathdb–a database for integrating human functional interaction networks. Nucleic Acids Res. 2008;37:D623–D628. doi: 10.1093/nar/gkn698. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Berman H, Henrick K, Nakamura H. Announcing the worldwide protein data bank. Nat. Struct. Biol. 2003;10:980. doi: 10.1038/nsb1203-980. [DOI] [PubMed] [Google Scholar]
9.Himanen L, Geurts A, Foster AS, Rinke P. Data-driven materials science: Status, challenges, and perspectives. Adv. Sci. 2019;6:1900808. doi: 10.1002/advs.201900808. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Liebal UW, Phan ANT, Sudhakar M, Raman K, Blank LM. Machine learning applications for mass spectrometry-based metabolomics. Metabolites. 2020;10:243. doi: 10.3390/metabo10060243. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Arias, P. et al. Climate Change 2021: The Physical Science Basis. Contribution Of Working Group I To The Sixth Assessment Report Of The Intergovernmental Panel On Climate Change: Technical Summary, 33–144 (Cambridge University Press, 2021).
12.Merikanto, J., Spracklen, D., Mann, G., Pickering, S. & Carslaw, K. Impact of nucleation on global CCN. Atmos. Chem. Phys. 9 (2009).
13.Metzger A, et al. Evidence for the role of organics in aerosol particle formation under atmospheric conditions. Proceedings of the National Academy of Sciences. 2010;107:6646–6651. doi: 10.1073/pnas.0911330107. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Kerminen V-M, et al. Atmospheric new particle formation and growth: review of field observations. Environmental Research Letters. 2018;13:103003. doi: 10.1088/1748-9326/aadf3c. [DOI] [Google Scholar]
15.Kupc A, et al. The potential role of organics in new particle formation and initial growth in the remote tropical upper troposphere. Atmos. Chem. Phys. 2020;20:15037–15060. doi: 10.5194/acp-20-15037-2020. [DOI] [Google Scholar]
16.Zhang R, et al. Atmospheric new particle formation enhanced by organic acids. Science. 2004;304:1487–1490. doi: 10.1126/science.1095139. [DOI] [PubMed] [Google Scholar]
17.Seinfeld JH, Pankow JF. Organic atmospheric particulate material. Annual Review of Physical Chemistry. 2003;54:121–140. doi: 10.1146/annurev.physchem.54.011002.103756. [DOI] [PubMed] [Google Scholar]
18.Lee, B. H. et al. Ring-opening yields and auto-oxidation rates of the resulting peroxy radicals from OH-oxidation of α-pinene and β-pinene. Environ. Sci.: Atmos. – (2023). [DOI] [PMC free article] [PubMed]
19.Crounse JD, Nielsen LB, Jørgensen S, Kjaergaard HG, Wennberg PO. Autoxidation of organic compounds in the atmosphere. Journal of Physical Chemistry Letters. 2013;4:3513–3520. doi: 10.1021/jz4019207. [DOI] [Google Scholar]
20.Wang Z, et al. Unraveling the structure and chemical mechanisms of highly oxygenated intermediates in oxidation of organic compounds. Proceedings of the National Academy of Sciences. 2017;114:13102–13107. doi: 10.1073/pnas.1707564114. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Wang C, et al. Uncertain henry’s law constants compromise equilibrium partitioning calculations of atmospheric oxidation products. Atmos. Chem. Phys. 2017;17:7529–7540. doi: 10.5194/acp-17-7529-2017. [DOI] [Google Scholar]
22.Saunders SM, Jenkin ME, Derwent RG, Pilling MJ. Protocol for the development of the master chemical mechanism, MCM v3 (part a): tropospheric degradation of non-aromatic volatile organic compounds. Atmos. Chem. Phys. 2003;3:161–180. doi: 10.5194/acp-3-161-2003. [DOI] [Google Scholar]
23.Bloss C, et al. Development of a detailed chemical mechanism (MCMv3.1) for the atmospheric oxidation of aromatic hydrocarbons. Atmos. Chem. Phys. 2005;5:641–664. doi: 10.5194/acp-5-641-2005. [DOI] [Google Scholar]
24.Jenkin ME, Young JC, Rickard AR. The MCM v3.3.1 degradation scheme for isoprene. Atmos. Chem. Phys. 2015;15:11433–11459. doi: 10.5194/acp-15-11433-2015. [DOI] [Google Scholar]
25.Lumiaro E, Todorović M, Kurten T, Vehkamäki H, Rinke P. Predicting gas–particle partitioning coefficients of atmospheric molecules with machine learning. Atmos. Chem. Phys. 2021;21:13227–13246. doi: 10.5194/acp-21-13227-2021. [DOI] [Google Scholar]
26.Stuke A, et al. Chemical diversity in molecular orbital energy predictions with kernel ridge regression. Journal of Chemical Physics. 2019;150:204121. doi: 10.1063/1.5086105. [DOI] [PubMed] [Google Scholar]
27.Isaacman-VanWertz G, Aumont B. Impact of organic molecular structure on the estimation of atmospherically relevant physicochemical parameters. Atmos. Chem. Phys. 2021;21:6541–6563. doi: 10.5194/acp-21-6541-2021. [DOI] [Google Scholar]
28.Aumont B, Szopa S, Madronich S. Modelling the evolution of organic carbon during its gas-phase tropospheric oxidation: development of an explicit model based on a self generating approach. Atmos. Chem. Phys. 2005;5:2497–2517. doi: 10.5194/acp-5-2497-2005. [DOI] [Google Scholar]
29.Kurtén T, Hyttinen N, D’Ambro EL, Thornton J, Prisle NL. Estimating the saturation vapor pressures of isoprene oxidation products C5H12O6 and C5H10O6 using COSMO-RS. Atmos. Chem. Phys. 2018;18:17589–17600. doi: 10.5194/acp-18-17589-2018. [DOI] [Google Scholar]
30.Hyttinen N, et al. Comparison of saturation vapor pressures of α-pinene + o3 oxidation products derived from COSMO-RS computations and thermal desorption experiments. Atmos. Chem. Phys. 2022;22:1195–1208. doi: 10.5194/acp-22-1195-2022. [DOI] [Google Scholar]
31.Khrabrov K, et al. nabladft: Large-scale conformational energy and hamiltonian prediction benchmark and dataset. Phys. Chem. Chem. Phys. 2022;24:25853–25863. doi: 10.1039/D2CP03966D. [DOI] [PubMed] [Google Scholar]
32.Ruggeri G, Takahama S. Technical note: Development of chemoinformatic tools to enumerate functional groups in molecules for organic aerosol characterization. Atmos. Chem. Phys. 2016;16:4401–4422. doi: 10.5194/acp-16-4401-2016. [DOI] [Google Scholar]
33.Ramakrishnan R, Dral PO, Rupp M, von Lilienfeld OA. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data. 2014;1:140022. doi: 10.1038/sdata.2014.22. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Stuke A, et al. Atomic structures and orbital energies of 61,489 crystal-forming organic molecules. Scientific Data. 2020;7:58. doi: 10.1038/s41597-020-0385-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Pankow, J. & Asher, W. SIMPOL.1: A simple group contribution method for predicting vapor pressures and enthalpies of vaporization of multifunctional organic compounds. Atmos. Chem. Phys. 8 (2008).
36.Aumont, B. personal communication (2020).
37.Klamt A, Schüürmann G. Cosmo: a new approach to dielectric screening in solvents with explicit expressions for the screening energy and its gradient. J. Chem. Soc., Perkin Trans. 1993;2:799–805. doi: 10.1039/P29930000799. [DOI] [Google Scholar]
38.Klamt A, Jonas V, Bürger T, Lohrenz JCW. Refinement and parametrization of cosmo-rs. Journal of Physical Chemistry A. 1998;102:5074–5085. doi: 10.1021/jp980017s. [DOI] [Google Scholar]
39.Vainio MJ, Johnson MS. Generating conformer ensembles using a multiobjective genetic algorithm. Journal of Chemical Information and Modeling. 2007;47:2462–2474. doi: 10.1021/ci6005646. [DOI] [PubMed] [Google Scholar]
40.Blaney, J. M. & Dixon, J. S. Distance Geometry In Molecular Modeling, 299–335 (John Wiley & Sons, Ltd, 1994).
41.Landrum G, 2023. rdkit/rdkit: 2023_03_2 (q1 2023) release. Zenodo. [DOI]
42.Halgren TA. Merck molecular force field. I. basis, form, scope, parameterization, and performance of MMFF94. Journal of Computational Chemistry. 1996;17:490–519. doi: 10.1002/(SICI)1096-987X(199604)17:5/6<490::AID-JCC1>3.0.CO;2-P. [DOI] [Google Scholar]
43.Balasubramani, S. G. et al. TURBOMOLE: Modular program suite for ab initio quantum-chemical and condensed-matter simulations. Journal of Chemical Physics152 (2020). [DOI] [PMC free article] [PubMed]
44.Sierka M, Hogekamp A, Ahlrichs R. Fast evaluation of the coulomb potential for electron densities using multipole accelerated resolution of identity approximation. Journal of Chemical Physics. 2003;118:9136–9148. doi: 10.1063/1.1567253. [DOI] [Google Scholar]
45.Perdew JP. Density-functional approximation for the correlation energy of the inhomogeneous electron gas. Phys. Rev. B. 1986;33:8822–8824. doi: 10.1103/PhysRevB.33.8822. [DOI] [PubMed] [Google Scholar]
46.Becke AD. Density-functional exchange-energy approximation with correct asymptotic behavior. Phys. Rev. A. 1988;38:3098–3100. doi: 10.1103/PhysRevA.38.3098. [DOI] [PubMed] [Google Scholar]
47.Langer MF, Goeßmann A, Rupp M. Representations of molecules and materials for interpolation of quantum-mechanical simulations via machine learning. npj Computational Materials. 2022;8:41. doi: 10.1038/s41524-022-00721-x. [DOI] [Google Scholar]
48.Rupp M, Tkatchenko A, Müller K-R, von Lilienfeld OA. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 2012;108:058301. doi: 10.1103/PhysRevLett.108.058301. [DOI] [PubMed] [Google Scholar]
49.Huo H, Rupp M. Unified representation of molecules and crystals for machine learning. Machine Learning: Science and Technology. 2022;3:045017. [Google Scholar]
50.Durant J, Leland B, Henry D, Nourse J. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 2002;42:1273–80. doi: 10.1021/ci010132r. [DOI] [PubMed] [Google Scholar]
51.Nilakantan R, Bauman N, Dixon JS, Venkataraghavan R. Topological torsion: a new molecular descriptor for sar applications. comparison with other descriptors. Journal of Chemical Information and Computer Sciences. 1987;27:82–85. doi: 10.1021/ci00054a008. [DOI] [Google Scholar]
52.James, C. & Weininger, D. Daylight Theory Manual: Daylight Version 4.9, (Daylight Chemical Information Systems, Inc., 2011).
53.Schulz E, Speekenbrink M, Krause A. A tutorial on gaussian process regression: Modelling, exploring, and exploiting functions. Journal of Mathematical Psychology. 2018;85:1–16. doi: 10.1016/j.jmp.2018.03.001. [DOI] [Google Scholar]
54.Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, 8024–8035 (Curran Associates, Inc., 2019).
55.Besel V, Todorović M, Kurtén T, Rinke P, Vehkamäki H. 2023. GeckoQ: Atomic structures, conformers and thermodynamic properties of 32k atmospheric molecules. Etsin. [DOI] [PMC free article] [PubMed]
56.Eckert F, Klamt A. Fast solvent screening via quantum chemistry: Cosmo-rs approach. AIChE Journal. 2002;48:369–385. doi: 10.1002/aic.690480220. [DOI] [Google Scholar]
57.Hyttinen N, et al. Gas-to-particle partitioning of cyclohexene- and α-pinene-derived highly oxygenated dimers evaluated using cosmotherm. Journal of Physical Chemistry A. 2021;125:3726–3738. doi: 10.1021/acs.jpca.0c11328. [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Besel V, 2023. Supervitux/cosmo_on_merlin: 1.0. Zenodo. [DOI]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Landrum G, 2023. rdkit/rdkit: 2023_03_2 (q1 2023) release. Zenodo. [DOI]
Besel V, Todorović M, Kurtén T, Rinke P, Vehkamäki H. 2023. GeckoQ: Atomic structures, conformers and thermodynamic properties of 32k atmospheric molecules. Etsin. [DOI] [PMC free article] [PubMed]
Besel V, 2023. Supervitux/cosmo_on_merlin: 1.0. Zenodo. [DOI]

Data Availability Statement

[CR1] 1.IPCC. Summary For Policymakers, 3–32 (Cambridge University Press, 2021).

[CR2] 2.Döscher R, et al. The ec-earth3 earth system model for the coupled model intercomparison project 6. Geoscientific Model Development. 2022;15:2973–3020. doi: 10.5194/gmd-15-2973-2022. [DOI] [Google Scholar]

[CR3] 3.Boucher O, et al. Presentation and evaluation of the ipsl-cm6a-lr climate model. Journal of Advances in Modeling Earth Systems. 2020;12:e2019MS002010. doi: 10.1029/2019MS002010. [DOI] [Google Scholar]

[CR4] 4.Giorgi F. Thirty years of regional climate modeling: Where are we and where are we going next? J. Geophys. Res. Atmos. 2019;124:5696–5723. [Google Scholar]

[CR5] 5.Krüger M, et al. Convolutional neural network prediction of molecular properties for aerosol chemistry and health effects. Natural Sciences. 2022;2:e20220016. doi: 10.1002/ntls.20220016. [DOI] [Google Scholar]

[CR6] 6.Borne K. Astroinformatics: Data-oriented astronomy research and education. Earth Sci. Inform. 2010;3:5–17. doi: 10.1007/s12145-010-0055-2. [DOI] [Google Scholar]

[CR7] 7.Wierling C, Lehrach H, Herwig R, Kamburov A. Consensuspathdb–a database for integrating human functional interaction networks. Nucleic Acids Res. 2008;37:D623–D628. doi: 10.1093/nar/gkn698. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Berman H, Henrick K, Nakamura H. Announcing the worldwide protein data bank. Nat. Struct. Biol. 2003;10:980. doi: 10.1038/nsb1203-980. [DOI] [PubMed] [Google Scholar]

[CR9] 9.Himanen L, Geurts A, Foster AS, Rinke P. Data-driven materials science: Status, challenges, and perspectives. Adv. Sci. 2019;6:1900808. doi: 10.1002/advs.201900808. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Liebal UW, Phan ANT, Sudhakar M, Raman K, Blank LM. Machine learning applications for mass spectrometry-based metabolomics. Metabolites. 2020;10:243. doi: 10.3390/metabo10060243. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Arias, P. et al. Climate Change 2021: The Physical Science Basis. Contribution Of Working Group I To The Sixth Assessment Report Of The Intergovernmental Panel On Climate Change: Technical Summary, 33–144 (Cambridge University Press, 2021).

[CR12] 12.Merikanto, J., Spracklen, D., Mann, G., Pickering, S. & Carslaw, K. Impact of nucleation on global CCN. Atmos. Chem. Phys. 9 (2009).

[CR13] 13.Metzger A, et al. Evidence for the role of organics in aerosol particle formation under atmospheric conditions. Proceedings of the National Academy of Sciences. 2010;107:6646–6651. doi: 10.1073/pnas.0911330107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Kerminen V-M, et al. Atmospheric new particle formation and growth: review of field observations. Environmental Research Letters. 2018;13:103003. doi: 10.1088/1748-9326/aadf3c. [DOI] [Google Scholar]

[CR15] 15.Kupc A, et al. The potential role of organics in new particle formation and initial growth in the remote tropical upper troposphere. Atmos. Chem. Phys. 2020;20:15037–15060. doi: 10.5194/acp-20-15037-2020. [DOI] [Google Scholar]

[CR16] 16.Zhang R, et al. Atmospheric new particle formation enhanced by organic acids. Science. 2004;304:1487–1490. doi: 10.1126/science.1095139. [DOI] [PubMed] [Google Scholar]

[CR17] 17.Seinfeld JH, Pankow JF. Organic atmospheric particulate material. Annual Review of Physical Chemistry. 2003;54:121–140. doi: 10.1146/annurev.physchem.54.011002.103756. [DOI] [PubMed] [Google Scholar]

[CR18] 18.Lee, B. H. et al. Ring-opening yields and auto-oxidation rates of the resulting peroxy radicals from OH-oxidation of α-pinene and β-pinene. Environ. Sci.: Atmos. – (2023). [DOI] [PMC free article] [PubMed]

[CR19] 19.Crounse JD, Nielsen LB, Jørgensen S, Kjaergaard HG, Wennberg PO. Autoxidation of organic compounds in the atmosphere. Journal of Physical Chemistry Letters. 2013;4:3513–3520. doi: 10.1021/jz4019207. [DOI] [Google Scholar]

[CR20] 20.Wang Z, et al. Unraveling the structure and chemical mechanisms of highly oxygenated intermediates in oxidation of organic compounds. Proceedings of the National Academy of Sciences. 2017;114:13102–13107. doi: 10.1073/pnas.1707564114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Wang C, et al. Uncertain henry’s law constants compromise equilibrium partitioning calculations of atmospheric oxidation products. Atmos. Chem. Phys. 2017;17:7529–7540. doi: 10.5194/acp-17-7529-2017. [DOI] [Google Scholar]

[CR22] 22.Saunders SM, Jenkin ME, Derwent RG, Pilling MJ. Protocol for the development of the master chemical mechanism, MCM v3 (part a): tropospheric degradation of non-aromatic volatile organic compounds. Atmos. Chem. Phys. 2003;3:161–180. doi: 10.5194/acp-3-161-2003. [DOI] [Google Scholar]

[CR23] 23.Bloss C, et al. Development of a detailed chemical mechanism (MCMv3.1) for the atmospheric oxidation of aromatic hydrocarbons. Atmos. Chem. Phys. 2005;5:641–664. doi: 10.5194/acp-5-641-2005. [DOI] [Google Scholar]

[CR24] 24.Jenkin ME, Young JC, Rickard AR. The MCM v3.3.1 degradation scheme for isoprene. Atmos. Chem. Phys. 2015;15:11433–11459. doi: 10.5194/acp-15-11433-2015. [DOI] [Google Scholar]

[CR25] 25.Lumiaro E, Todorović M, Kurten T, Vehkamäki H, Rinke P. Predicting gas–particle partitioning coefficients of atmospheric molecules with machine learning. Atmos. Chem. Phys. 2021;21:13227–13246. doi: 10.5194/acp-21-13227-2021. [DOI] [Google Scholar]

[CR26] 26.Stuke A, et al. Chemical diversity in molecular orbital energy predictions with kernel ridge regression. Journal of Chemical Physics. 2019;150:204121. doi: 10.1063/1.5086105. [DOI] [PubMed] [Google Scholar]

[CR27] 27.Isaacman-VanWertz G, Aumont B. Impact of organic molecular structure on the estimation of atmospherically relevant physicochemical parameters. Atmos. Chem. Phys. 2021;21:6541–6563. doi: 10.5194/acp-21-6541-2021. [DOI] [Google Scholar]

[CR28] 28.Aumont B, Szopa S, Madronich S. Modelling the evolution of organic carbon during its gas-phase tropospheric oxidation: development of an explicit model based on a self generating approach. Atmos. Chem. Phys. 2005;5:2497–2517. doi: 10.5194/acp-5-2497-2005. [DOI] [Google Scholar]

[CR29] 29.Kurtén T, Hyttinen N, D’Ambro EL, Thornton J, Prisle NL. Estimating the saturation vapor pressures of isoprene oxidation products C5H12O6 and C5H10O6 using COSMO-RS. Atmos. Chem. Phys. 2018;18:17589–17600. doi: 10.5194/acp-18-17589-2018. [DOI] [Google Scholar]

[CR30] 30.Hyttinen N, et al. Comparison of saturation vapor pressures of α-pinene + o3 oxidation products derived from COSMO-RS computations and thermal desorption experiments. Atmos. Chem. Phys. 2022;22:1195–1208. doi: 10.5194/acp-22-1195-2022. [DOI] [Google Scholar]

[CR31] 31.Khrabrov K, et al. nabladft: Large-scale conformational energy and hamiltonian prediction benchmark and dataset. Phys. Chem. Chem. Phys. 2022;24:25853–25863. doi: 10.1039/D2CP03966D. [DOI] [PubMed] [Google Scholar]

[CR32] 32.Ruggeri G, Takahama S. Technical note: Development of chemoinformatic tools to enumerate functional groups in molecules for organic aerosol characterization. Atmos. Chem. Phys. 2016;16:4401–4422. doi: 10.5194/acp-16-4401-2016. [DOI] [Google Scholar]

[CR33] 33.Ramakrishnan R, Dral PO, Rupp M, von Lilienfeld OA. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data. 2014;1:140022. doi: 10.1038/sdata.2014.22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Stuke A, et al. Atomic structures and orbital energies of 61,489 crystal-forming organic molecules. Scientific Data. 2020;7:58. doi: 10.1038/s41597-020-0385-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Pankow, J. & Asher, W. SIMPOL.1: A simple group contribution method for predicting vapor pressures and enthalpies of vaporization of multifunctional organic compounds. Atmos. Chem. Phys. 8 (2008).

[CR36] 36.Aumont, B. personal communication (2020).

[CR37] 37.Klamt A, Schüürmann G. Cosmo: a new approach to dielectric screening in solvents with explicit expressions for the screening energy and its gradient. J. Chem. Soc., Perkin Trans. 1993;2:799–805. doi: 10.1039/P29930000799. [DOI] [Google Scholar]

[CR38] 38.Klamt A, Jonas V, Bürger T, Lohrenz JCW. Refinement and parametrization of cosmo-rs. Journal of Physical Chemistry A. 1998;102:5074–5085. doi: 10.1021/jp980017s. [DOI] [Google Scholar]

[CR39] 39.Vainio MJ, Johnson MS. Generating conformer ensembles using a multiobjective genetic algorithm. Journal of Chemical Information and Modeling. 2007;47:2462–2474. doi: 10.1021/ci6005646. [DOI] [PubMed] [Google Scholar]

[CR40] 40.Blaney, J. M. & Dixon, J. S. Distance Geometry In Molecular Modeling, 299–335 (John Wiley & Sons, Ltd, 1994).

[CR41] 41.Landrum G, 2023. rdkit/rdkit: 2023_03_2 (q1 2023) release. Zenodo. [DOI]

[CR42] 42.Halgren TA. Merck molecular force field. I. basis, form, scope, parameterization, and performance of MMFF94. Journal of Computational Chemistry. 1996;17:490–519. doi: 10.1002/(SICI)1096-987X(199604)17:5/6<490::AID-JCC1>3.0.CO;2-P. [DOI] [Google Scholar]

[CR43] 43.Balasubramani, S. G. et al. TURBOMOLE: Modular program suite for ab initio quantum-chemical and condensed-matter simulations. Journal of Chemical Physics152 (2020). [DOI] [PMC free article] [PubMed]

[CR44] 44.Sierka M, Hogekamp A, Ahlrichs R. Fast evaluation of the coulomb potential for electron densities using multipole accelerated resolution of identity approximation. Journal of Chemical Physics. 2003;118:9136–9148. doi: 10.1063/1.1567253. [DOI] [Google Scholar]

[CR45] 45.Perdew JP. Density-functional approximation for the correlation energy of the inhomogeneous electron gas. Phys. Rev. B. 1986;33:8822–8824. doi: 10.1103/PhysRevB.33.8822. [DOI] [PubMed] [Google Scholar]

[CR46] 46.Becke AD. Density-functional exchange-energy approximation with correct asymptotic behavior. Phys. Rev. A. 1988;38:3098–3100. doi: 10.1103/PhysRevA.38.3098. [DOI] [PubMed] [Google Scholar]

[CR47] 47.Langer MF, Goeßmann A, Rupp M. Representations of molecules and materials for interpolation of quantum-mechanical simulations via machine learning. npj Computational Materials. 2022;8:41. doi: 10.1038/s41524-022-00721-x. [DOI] [Google Scholar]

[CR48] 48.Rupp M, Tkatchenko A, Müller K-R, von Lilienfeld OA. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 2012;108:058301. doi: 10.1103/PhysRevLett.108.058301. [DOI] [PubMed] [Google Scholar]

[CR49] 49.Huo H, Rupp M. Unified representation of molecules and crystals for machine learning. Machine Learning: Science and Technology. 2022;3:045017. [Google Scholar]

[CR50] 50.Durant J, Leland B, Henry D, Nourse J. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 2002;42:1273–80. doi: 10.1021/ci010132r. [DOI] [PubMed] [Google Scholar]

[CR51] 51.Nilakantan R, Bauman N, Dixon JS, Venkataraghavan R. Topological torsion: a new molecular descriptor for sar applications. comparison with other descriptors. Journal of Chemical Information and Computer Sciences. 1987;27:82–85. doi: 10.1021/ci00054a008. [DOI] [Google Scholar]

[CR52] 52.James, C. & Weininger, D. Daylight Theory Manual: Daylight Version 4.9, (Daylight Chemical Information Systems, Inc., 2011).

[CR53] 53.Schulz E, Speekenbrink M, Krause A. A tutorial on gaussian process regression: Modelling, exploring, and exploiting functions. Journal of Mathematical Psychology. 2018;85:1–16. doi: 10.1016/j.jmp.2018.03.001. [DOI] [Google Scholar]

[CR54] 54.Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, 8024–8035 (Curran Associates, Inc., 2019).

[CR55] 55.Besel V, Todorović M, Kurtén T, Rinke P, Vehkamäki H. 2023. GeckoQ: Atomic structures, conformers and thermodynamic properties of 32k atmospheric molecules. Etsin. [DOI] [PMC free article] [PubMed]

[CR56] 56.Eckert F, Klamt A. Fast solvent screening via quantum chemistry: Cosmo-rs approach. AIChE Journal. 2002;48:369–385. doi: 10.1002/aic.690480220. [DOI] [Google Scholar]

[CR57] 57.Hyttinen N, et al. Gas-to-particle partitioning of cyclohexene- and α-pinene-derived highly oxygenated dimers evaluated using cosmotherm. Journal of Physical Chemistry A. 2021;125:3726–3738. doi: 10.1021/acs.jpca.0c11328. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR58] 58.Besel V, 2023. Supervitux/cosmo_on_merlin: 1.0. Zenodo. [DOI]

PERMALINK

Atomic structures, conformers and thermodynamic properties of 32k atmospheric molecules

Vitus Besel

Milica Todorović

Theo Kurtén

Patrick Rinke

Hanna Vehkamäki

Abstract

Background & Summary

Fig. 1.

Fig. 2.

Methods

Dataset curation

Computation of thermodynamic properties

Fig. 3.

Conformer sampling and refinement

Conformer selection and property calculation

SIMPOL

Structural descriptor: topological fingerprint

Gaussian process regression

Data Records

Table 1.

Table 2.

Technical Validation

Fig. 4.

Fig. 5.

Usage Notes

Acknowledgements

Author contributions

Code availability

Competing interests

Footnotes

References

Associated Data

Data Citations

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases