Abstract
Data science and machine learning in materials science require large datasets of technologically relevant molecules or materials. Currently, publicly available molecular datasets with realistic molecular geometries and spectral properties are rare. We here supply a diverse benchmark spectroscopy dataset of 61,489 molecules extracted from organic crystals in the Cambridge Structural Database (CSD), denoted OE62. Molecular equilibrium geometries are reported at the Perdew-Burke-Ernzerhof (PBE) level of density functional theory (DFT) including van der Waals corrections for all 62 k molecules. For these geometries, OE62 supplies total energies and orbital eigenvalues at the PBE and the PBE hybrid (PBE0) functional level of DFT for all 62 k molecules in vacuum as well as at the PBE0 level for a subset of 30,876 molecules in (implicit) water. For 5,239 molecules in vacuum, the dataset provides quasiparticle energies computed with many-body perturbation theory in the G0W0 approximation with a PBE0 starting point (denoted GW5000 in analogy to the GW100 benchmark set (M. van Setten et al. J. Chem. Theory Comput. 12, 5076 (2016))).
Subject terms: Electronic structure, Chemical physics, Scientific data, Electronic structure of atoms and molecules
Measurement(s) | organic molecule |
Technology Type(s) | digital curation • spectroscopy |
Factor Type(s) | computational method |
Machine-accessible metadata file describing the reported data: 10.6084/m9.figshare.11689347
Background & Summary
Consistent and curated datasets have facilitated progress in the natural sciences. High-quality reference data sets were, for example, essential in the development of accurate computational methodology, in particular in quantum chemistry. With the rise of machine learning, datasets have increased in size and have transformed from reference status to a primary source of data for predictions1–7 and discovery8–12.
In this article we present a new dataset for molecular spectroscopy applications. Spectroscopy is ubiquitous in science as one of the primary ways of determining a material’s or molecule’s properties. However, publicly available spectroscopic datasets for technologically relevant molecules are rare. Examples include a dataset of chemical shifts for structures taken from the CSD13,14, the Harvard Clean Energy Project15 as well as the QM816,17 and QM918 datasets. The QM8 database offers optical spectra computed with time-dependent density functional theory (TDDFT) for 22 k organic molecules, while QM9, widely known as one of the standard benchmark sets for machine learning in chemistry, provides a variety of properties for 134 k organic molecules computed with density functional theory (DFT)19,20, including energy levels for the highest occupied and the lowest unoccupied molecular orbitals (HOMO and LUMO, respectively). Although QM8 and QM9 are of unprecedented size compared to previous, common benchmark sets in quantum chemistry of several hundred to thousands of molecules, they still contain only small molecules with restricted elemental diversity (H, C, N, O and F) and with simple bonding patterns6. They lack larger, more complex molecules with, e.g., extended heteroaromatic backbones and attached functional groups, as commonly targeted in organic synthesis21,22 and applied in (opto-)electronic23–26 or pharmaceutical research22,27,28.
We have based the spectroscopic dataset presented in this article on a diverse collection of 64,725 organic crystals that were extracted from the Cambridge Structural Database (CSD)29 by Schober et al.30,31. This 64 k dataset of experimental crystal structures gathered from a variety of application areas was originally compiled to optimize the charge carrier mobility for applications in organic electronics. For our OE62 dataset, we used 61,489 unique organic molecular structures, extracted from the respective organic crystals. All extracted geometries were then relaxed in the gas phase with density-functional theory (DFT).
The molecules in OE62 cover a considerable part of chemical space, as illustrated in Fig. 1. The dataset contains molecules with up to 174 (or 92 non-hydrogen) atoms and a diverse composition of 16 different elements. A large number of different scaffolds and functional groups are included, representing a multifaceted sample of the design space available in organic chemistry6,30,32,33.
To go into more detail, all molecules in OE62 are fully relaxed at the Perdew-Burker-Ernzerhof (PBE)34 level of DFT including Tkatchenko-Scheffler van der Waals (TS-vdW) corrections35. For these equilibrium structures, we then report molecular orbital energies at the PBE and PBE hybrid (PBE0)36,37 level, in the following referring to this part as 62 k set. Partial charges and total energies for DFT-calculations are also included. In two subsets, randomly drawn to span more than half (31 k) and more than 5000 (5 k) of the molecular structures, we provide additional computational results: the influence of solvation – in this case implicit water – on the energy levels is addressed on the PBE0 level for a subset of 30,876 molecules. For the second subset of 5,239 molecules, we computed the quasi-particle energies with many-body perturbation theory in the G0W0 approximation19,38,39 and extrapolated to the complete basis set (CBS) limit. Figure 2 gives a schematic overview of the dataset nesting in OE62 while Table 1 lists computational settings and computed properties. Figure 3(a,b) illustrate the HOMO level and solvation free energy distributions of the 5 k subset.
Table 1.
Set | Method | Computed properties | Access to data records on NOMAD |
---|---|---|---|
62 k | DFT PBE + vdW (vacuum) Tier2 basis set, tight settings | • relaxed geometry | 61–67 |
• occupied & unoccupied MO energies | |||
• total energy | |||
• Hirshfeld charges | |||
62 k | DFT PBE0 (vacuum) Tier2 basis set, tight settings | • geometry fixed at the PBE + vdW level | 68 |
• occupied & unoccupied MO energies | |||
• total energy | |||
• Hirshfeld charges | |||
31 k | DFT PBE0 (water) Tier2 basis set, tight settings, MPE implicit solvation | • geometry fixed at the PBE + vdW level | 69 |
• occupied & unoccupied MO energies | |||
• total energy | |||
• Hirshfeld charges | |||
5 k | DFT PBE0 (vacuum) def2-TZVP & def2-QZVP basis sets (see text), tight settings | • geometry fixed at the PBE + vdW level | 70 |
• occupied & unoccupied MO energies | |||
• total energy | |||
5 k | G0W0@PBE0 (vacuum) def2-TZVP & def2-QZVP basis sets (see text), tight settings | • geomety fixed at the PBE + vdW level | 71 |
• occupied & unoccupied MO energies | |||
• CBS energies of occupied & unoccu pied MOs |
We refer to the 5 k subset of G0W0 quasiparticle energies as GW5000 in analogy to the GW100 benchmark set40. GW100 was a landmark dataset of 100 atoms and molecules that for the first time demonstrated the high numerical accuracy of the computationally costly G0W0 approach. GW100 quickly became the standard reference for GW code development and validation. The GW5000 subset in OE62 is of the same high numeric quality as GW100, but extends the set of reference molecules by a factor of 50. To illustrate the value of multi-level computational results we present a first, preliminary finding in Fig. 3. Panel c) shows the correlation between the G0W0@PBE0 quasiparticle HOMO energies and the DFT HOMO eigenvalues for the GW5000 subset. The correlation is to first approximation linear with PBE0 having a lower variance than PBE. This linear relation (slope of 1.195 and intercept of −0.492 for PBE0) could now be used to predict G0W0 quasiparticle energies from the computationally cheaper PBE0 method without having to perform G0W0 calculations. Applying this linear correction to the PBE0 results yields quasiparticle energy predictions with a root mean square error (RMSE) of only 0.17 eV to the respective GW5000 values.
Given the high-quality computational results from different levels of theory, the (subs)sets included in OE62 can be used to develop, train and evaluate machine learning algorithms, facilitating the search and discovery of diverse molecular structures with improved properties. In the following, we first describe the procedure used to compute molecular structures and properties, followed by a full description of the dataset format and content as well as by a validation of our DFT and G0W0 results. OE62 is freely available as a download from the Technical University of Munich. The input and output files of all calculations performed for OE62 can be downloaded from the Novel Materials Discovery (NOMAD) laboratory (https://nomad-repository.eu).
Methods
All crystal structures collected from the CSD for the 64 k dataset are mono-molecular, i.e. they contain only a single type of molecule per unit-cell. A single molecular structure (conformer) from each crystal was extracted by a custom Python code30,31. This 64 k dataset of molecular structures provides the starting point for the dataset published here. A fraction of the crystals contained in the CSD have polymorphic forms or were added multiple times, coming e.g. from different experimental sources. Although they occur in different crystalline entries in the 64 k dataset, the same molecular structure could enter our molecular database multiple times. First, the SMILES identifiers were computed for the 64 k dataset30,31 from a combination of Open Babel41 (www.openbabel.org) and RDKit (www.rdkit.org)42. We subsequently excluded all extracted molecules whose non-isomeric, canonical SMILES identifier occurred multiple times, keeping only one case each. Further, molecules with an odd number of electrons were removed. After these filtering steps 61,539 molecules remained.
We relaxed the geometries of all molecules at the PBE + vdW level of theory, as implemented in the FHI-aims all-electron code43–45. We chose the PBE + vdW functional for three reasons: 1) It is an all-purpose functional with a favorable accuracy/computational cost ratio that is implemented in all the major electronic structure codes. 2) We would like to stay consistent with previous work6,46, in which PBE + vdW was also used for molecular structures optimization of large molecular data sets. 3) While there might be more accuracte semi-local functionals than PBE47, the addition of vdW corrections makes PBE + vdW appropriate for organic compounds. For organic crystals, for which highly accurate, low-temperature experimental geometries are available, PBE + vdW yields excellent agreement with typical root-mean-squared deviations of only 0.005–0.01 Å per atom48–50.
Given that slightly differing bond assignments in the newly obtained low-energy geometries might change some of the molecular identifiers, we generated new InChI51 (‘IUPAC International Identifier’) and canonical SMILES identifiers using Open Babel (Version 2.4.1 2016), and report these in our dataset. We then checked these representations for duplicates and concurrently removed them. In addition, 6 molecules were removed for which geometry optimization or single point calculations had failed. In total, 61,489 unique molecules remained, which form the basis of the OE62 set.
From the OE62 set we generated two subsets: For the 31 k subset we randomly picked 30,876 molecules. The same was done for the 5 k set by randomly picking 5,239 molecules from the 31 k subset with the additional constraint that the largest molecule should not exceed 100 atoms. The size and element distributions of all three sets are shown in Fig. 1.
In the following we explain the data and additional subsets we created and provide the computational settings. All settings are also listed in Table 1.
62 k set: DFT PBE + vdW (vacuum)
We pre-relaxed all molecular geometries at the PBE level of theory. For structure relaxation, we used the trust radius enhanced variant of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm as implemented in FHI-aims with a maximum atomic residual force criterion of fmax < 0.01 eV Å−1. The electronic wave functions were expanded in a Tier1 basis set at light integration settings43. Since our database only contains closed-shell molecules, we performed spin-restricted DFT calculations. Dispersive forces were included in the geometry relaxations using the Tkatchenko-Scheffler (TS)35 method, while relativistic effects were treated on the level of the atomic zero-order regular approximation (atomic ZORA)43. The DFT self-consistency cycle was treated as converged when changes of total energy, sum of eigenvalues and charge density were found below 10−6 eV, 10−3 eV and 10−5 e Å−3, respectively. Starting from these pre-relaxed structures, we obtained the final geometries by performing a new relaxation with Tier2 basis sets, tight integration settings and a convergence criterion of fmax < 0.001 eV Å−1. The eigenvalues of the molecular states are then stored in our dataset alongside the molecular geometries. We refer to this part of the dataset as PBE + vdW (vacuum).
62 k set: DFT PBE0 (vacuum)
Using the relaxed geometries obtained at the PBE + vdW (vacuum) level of theory, we further carried out single point calculations for all structures using the PBE0 hybrid functional. Computational settings as described before were used, employing again the Tier2 basis set with a tight integration grid. Note, that tabulated total energies obtained at this level also include the vdW contribution computed through the TS method, while “vdW” was dropped from the name to emphasize the single point character of these computations. We correspondingly refer to this set as PBE0 (vacuum).
31 k subset: DFT PBE0 (water)
To study the influence of solvation—here by water—on the PBE0 results, we performed calculations using the Multipole Expansion (MPE) implicit solvation method as implemented in FHI-aims52 for the 31 k susbset. The MPE method facilitates an efficient treatment of the solvation effects on a solute, by using a continuum model of the solvent around it. In detail, the solute molecule is placed within a cavity with the dielectric permitivity of vacuum. The position of the cavity surface is determined by an iso-value of the solute’s electronic density. Outside of this cavity the dielectric constant of water was applied52. The density isovalue as well as the α and β parameters for non-electrostatic contributions to the solvation free energy were taken from the published SPANC parameter-set52.
In the MPE method, the solvation cavity is discretized using a large number of points homogeneously distributed at the density iso-surface. Sampling of these points was achieved using an inexpensive pseudo-dynamical optimisation, allowing up to 1000 optimisation steps and removing the worst 0.1% of walkers at each neighbor-list update step52, to account for the more complex molecules included in the 62 k dataset. To obtain highly converged eigenvalues, we increased the reaction field- and polarization potential expansion orders lmax,R and lmax,O to 14 and 8, respectively, and the degree of overdetermination dod to 16, keeping all other parameters at their default values52. Note that the molecular geometries were not further relaxed in the presence of the water solvent. We kept the structures fixed at the PBE + vdW level. Tabulated total energies again include the vdW contribution obtained by the TS method. The resulting data is referred to as PBE0 (water).
5 k subset: G0W0@PBE0 (vacuum)
For the 5 k subset, the relaxed PBE + vdW structures in vacuum were used as input for the G0W019,39,53 calculations, using the FHI-aims G0W0 implementation based on the analytic continuation44. The PBE0 hybrid functional was used for the underlying DFT calculation (G0W0@PBE0) in combination with the atomic ZORA approximation.
In these G0W0 and PBE0 calculations, we employed the def2 triple-zeta valence plus polarization (def2-TZVP) and the def2 quadruple-zeta valence plus polarization (def2-QZVP) basis sets54. The def2-TZVP and def2-QZVP basis sets are contracted Gaussian orbitals, treated numerically to be compliant with the numeric atom-centered orbital (NAO) technology in FHI-aims43. They are fully all-electron for all elements and do not contain effective core potentials. The def2 basis sets are available from the EMSL database55,56, except for iodine (see Supplementary Information). Note that a basis set of def2-TZVP quality is not available for I and all def2-TZVP calculations for iodine-containing molecules were correspondingly performed with def2-QZVP for I and with def2-TZVP for all other elements.
Since G0W0 calculations converge slowly with respect to basis set size39, we extrapolated the quasiparticle energies to the complete basis set (CBS) limit. Following the procedure for the GW100 benchmark set40, the extrapolated values are calculated from the def2-TZVP and def2-QZVP results by a linear regression against the inverse of the total number of basis functions (see Technical Validation).
The G0W0 self-energy elements were calculated for a set of imaginary frequencies and then analytically continued to the real frequency axis using a Padé approximant57 with 16 parameters. The numerical integration along the imaginary frequency axis was performed using a modified Gauss-Legendre grid44 with 200 grid points. The same grid was employed for the set of frequencies , for which the self-energy is computed. The analytic continuation in combination with the Padé model yields accurate results for valence states40, but is not reliable for core and semi-core states58. Therefore, we included only occupied states with quasiparticle energies larger than −30 eV in the data set, see also Technical Validation for more details.
Data Records
The curated data for all 61,489 molecules is publicly available from two sources:
The dataset and related files can be freely downloaded from the media repository of the Technical University of Munich (mediaTUM) under 10.14459/2019mp150765659. The dataset is provided as JSON output data of Pandas60 DataFrames. Within Python, these dataframes allow structured access to data in a tabular format, where each molecule is stored in a row of the dataframe, while the data is organized in columns. The content of the dataframe is summarized and explained in Table 2. We also provide a tutorial file, which explains loading, filtering and data extraction from dataframes within Python. On mediaTUM, the dataset is distributed under a Creative Commons licence (https://creativecommons.org/licenses/by-sa/4.0/).
The input and output files of all performed calculations can be downloaded from NOMAD. Due to the size of OE62 we provide an individual DOI for each applied computational method61–71.
Table 2.
No. | Column name | Unit | Method | Dataframes | Description |
---|---|---|---|---|---|
1 | refcode_csd | — | — | 62 k, 31 k, 5 k | CSD reference code, unique identifier for the crystal from which the molecule was extracted |
2 | canonical_smiles | — | Open Babel | 62 k, 31 k, 5 k | Molecular string representations derived from DFT PBE + vdW relaxed geometries. |
3 | inchi | — | Open Babel | 62 k, 31 k, 5 k | |
4 | number_of_atoms | — | — | 62 k, 31 k, 5 k | Number of atoms in the molecule |
5 | xyz_pbe_relaxed | Å | PBE + vdW (vacuum) | 62 k, 31 k, 5 k | String in XYZ-file format of DFT PBE + vdW relaxed geometry. Line 1 contains the number of atoms. Line 2 is empty. The remaining lines contain atomic type and coordinate (x, y, z). |
6 | energies_occ_pbe | eV | PBE + vdW (vacuum) | 62 k, 31 k, 5 k | List of eigenvalues of occupied molecular Kohn-Sham orbitals. Given in ascending order, the last value is the HOMO energy. |
7 | energies_occ_pbe0_vac_tier2 | eV | PBE0 (vacuum) | 62 k, 31 k, 5 k | |
8 | energies_occ_pbe0_water | eV | PBE0 (water) | 31 k, 5 k | |
9 | energies_occ_pbe0_vac_tzvp | eV | PBE0 (vacuum) | 5 k | |
10 | energies_occ_pbe0_vac_qzvp | eV | PBE0 (vacuum) | 5 k | |
11 | energies_occ_gw_tzvp | eV | G0W0@PBE0 (vacuum) | 5 k | |
12 | energies_occ_gw_qzvp | eV | G0W0@PBE0 (vacuum) | 5 k | |
13 | cbs_occ_gw | eV | G0W0@PBE0 (vacuum) | 5 k | List of CBS energies of occupied states computed from G0W0@PBE0 TZVP and QZVP energies from 10 and 11. Same order as lists described above. |
14 | energies_unocc_pbe | eV | PBE + vdW (vacuum) | 62 k, 31 k, 5 k | List of eigenvalues of virtual (unoccupied) molecular Kohn-Sham orbitals. Given in ascending order, the first value is the LUMO energy. Only virtual states below the vacuum level (i.e. with negative eigenvalue) are listed. If the LUMO energy is positive, only the LUMO energy is listed. If 20 has more negative eigenvalues than 19, we also include positive eigenvalues in 19 so that both lists in 19 and 20 have equal length. |
15 | energies_unocc_pbe0_vac_tier2 | eV | PBE0 (vacuum) | 62 k, 31 k, 5 k | |
16 | energies_unocc_pbe0_water | eV | PBE0 (water) | 31 k, 5 k | |
17 | energies_unocc_pbe0_vac_tzvp | eV | PBE0 (vacuum) | 5 k | |
18 | energies_unocc_pbe0_vac_qzvp | eV | PBE0 (vacuum) | 5 k | |
19 | energies_unocc_gw_tzvp | eV | G0W0@PBE0 (vacuum) | 5 k | |
20 | energies_unocc_gw_qzvp | eV | G0W0@PBE0 (vacuum) | 5 k | |
21 | cbs_unocc_gw | eV | G0W0@PBE0 (vacuum) | 5 k | List of CBS energies of unoccupied states computed from G0W0@PBE0 TZVP and QZVP energies from 19 and 20. Same order as lists described above. |
22 | total_energy_pbe | eV | PBE + vdW (vacuum) | 62 k, 31 k, 5 k | Total energy of the DFT calculations. Note, for consistency with 22, 23 and 24 also include the vdW contribution to the total energy. 25 and 26 do not include it. |
23 | total_energy_pbe0_vac_tier2 | eV | PBE0 (vacuum) | 62 k, 31 k, 5 k | |
24 | total_energy_pbe0_water | eV | PBE0 (water) | 31 k, 5 k | |
25 | total_energy_pbe0_vac_tzvp | eV | PBE0 (vacuum) | 5 k | |
26 | total_energy_pbe0_vac_qzvp | eV | PBE0 (vacuum) | 5 k | |
27 | hirshfeld_pbe | qe | PBE + vdW (vacuum) | 62 k, 31 k, 5 k | List of Hirshfeld partial charges on atoms. Same order as atoms in xyz_pbe_relaxed. |
28 | hirshfeld_pbe0_vac_tier2 | qe | PBE0 (vacuum) | 62 k, 31 k, 5 k | |
29 | hirshfeld_pbe0_water | qe | PBE0 (water) | 31 k, 5 k |
Columns 1 to 3 contain molecular identifiers. Columns 5 to 29 contain molecular properties computed at respective level of theory. All mentioned energies are given in eV.
Dataframe format
We provide three dataframes: df_62 k, df_31 k and df_5 k. For each molecule in these dataframes, we provide three identifiers (refcode_csd, canonical_smiles and inchi in columns 1 to 3). In column 5, atomic coordinates of PBE + vdW (vacuum) relaxed structures are stored as a string in a standard XYZ format (xyz_pbe_relaxed): The structure information contains a header line specifying the number of atoms na, an empty comment line and na lines containing element type and relaxed atomic coordinates, one atom per line. The structure of all three dataframes is summarized in Table 2.
The following list provides a brief overview over the three dataframes:
Dataframe df_5 k includes 5,239 structures with results for all molecular properties in columns 5 to 29.
Dataframe df_31 k accommodates 30,876 structures, including all structures from df_5 k. G0W0@PBE0 results are only available for molecules from its 5 k subset, while respective columns are left blank for the remaining molecules in df_31 k.
Dataframe df_62 k contains all 61,489 structures, including all structures from df_31 k and df_5 k. PBE0 (water) results are only available for molecules from its 31 k subset, while respective columns are left blank for the remaining molecules in df_62 k. The same applies for G0W0@PBE0 results for the structures from the 5 k subset. The dataframe is ordered, such that the molecules included in the 5 k subset are included first, while the remaining molecules of 31 k and 62 k subsets follow subsequently. This data structure facilitates the filtering of the dataframe by single lines of code, as shown in the tutorial.
In addition, a spreadsheet file is provided in the distributed archive which contains the total energies of all atomic species of the dataset. They are computed for the respective levels of theory using similar computational settings, so that atomization energies for all molecules can be computed from the available molecular total energies.
Finally, future updated versions of the dataset on mediaTUM will be distributed through the versioned DOI given above. In such cases, updated descriptions will be provided in the distributed archive alongside the dataset.
Technical Validation
Validation of relaxed geometries
To quantify the degree to which relaxation in vacuum changes the geometry of the structures compared to their crystalline form, we computed the distance between the two Coulomb matrices72,73 of the original crystal geometry and the PBE + vdW relaxed geometry for each of the 62 k molecules. The distribution of these Coulomb matrix distances is shown in Fig. 4(a). Small distances signify small changes and large distances signify large differences. Most molecules exhibit only little changes in geometry during relaxation, where bond lengths are shifted by a small amount, as illustrated for the example of molecule 1. In some rare cases we find significant shifts in geometry caused by the environmental change from intermolecular interactions in the crystal to intramolecular interactions in vacuum, as shown for molecule 2. The crystal-extracted structure is shaped according to intermolecular van der Waals interactions that were present in the crystal. After relaxation, the intramolecular interactions cause a contraction of the molecular structure.
To validate that the chemical integrity of the majority of the 62 k molecules is preserved during the PBE + vdW relaxation, we perform a consistency check similarly to ref. 18. We generated InChI strings from the relaxed PBE + vdW geometries and compare them to those obtained from the initial crystal-extracted cartesian coordinates. For 284 pairs, the two InChI strings did not match. Such mismatches can, for example, be caused by specifics in the implementation, in which Openbabel assigns different InChI strings to molecules with the same topology, possibly caused by changes in bond lengths, bond angles or dihedral angles. Examples are shown in Fig. 4(b) with molecule 3 exhibiting a small Coulomb matrix distance or molecule 5, which exhibits a large Coulomb matrix distance due to stronger relaxation. Here, stereoassignments change in the molecular structure, causing the different InChI-identifiers. Conversely, the mismatch can be also caused by changes in molecular topology during relaxation. This is the case for molecule 4, for which an intramolecular ring-closure takes place. Compared to 3,054 such inconsistencies found during the collection of the 134 k molecules for the QM9 database18, the number of 284 found here is considerably small. The reason is most likely that our molecular starting geometries were derived from experimentally observed, well-resolved solid-form conformers.
Validation of DFT atomization and orbital energies
For PBE and PBE0 calculations, the Tier2 basis set of FHI-aims typically provides converged results for both the atomization energy as well as for molecular orbital energies74,75. The Tier2 basis set has also been used in other large molecular datasets46,72. We here illustrate the convergence for four selected cases featured in Fig. 5(a) for PBE0 vacuum calculations at tight settings. As expected, HOMO energies at the Tier2 level are well-converged, here estimated within 0.01 eV around reference values obtained with the largest standard basis set included in FHI-aims (Tier4), see Fig. 5(b). The lower lying orbital energies exhibit a similar convergence behavior (not shown).
A further quality assessment of predicted HOMO-energies comes from the comparison of Tier2 and QZVP basis set results, as contained in the 5 k subset, see Fig. 5(c). We find only a small RMSE of 0.009 eV between the Tier2 and the much larger QZVP basis sets. Figure 5 also shows the convergence of the atomization energy of the four molecules in panel d). Again, at the Tier2 level, convergence to better than 0.1 eV with respect to Tier4 is observed. This is consistent with results found in a previous benchmark study74.
Validation of G0W0 quasiparticle energies
Figure 6(a) shows the convergence of the G0W0@PBE0 quasi-particle energies with respect to basis set size and their extrapolation to the CBS limit for the four molecules displayed in Fig. 5(a). In all four cases, the G0W0 energies are not converged even with the largest basis set and CBS extrapolation is required. The slow convergence is typical for the whole 5 k set, as demonstrated in Fig. 6(b), which reports the deviation of the HOMO G0W0 energies computed at the TZVP and QZVP level from the CBS limit for all molecules of the 5 k subset. The distributions displayed in Fig. 6(b) are centered around −0.38 eV (TZVP) and −0.17 eV (QZVP) with a standard deviation of 0.02 eV (TZVP) and 0.01 eV (QZVP) from the median values. Similar results are obtained by including all occupied states above −30 eV in the analysis. In this case, the median value amounts to −0.35 eV for TZVP and −0.15 eV for QZVP. Respective distributions for the deviations of all occupied states from the CBS limit can be found in the Supporting Information.
The quasiparticle energies at the QZVP level are typically lower in energy than the TZVP values, i.e., the straight line determined from the linear extrapolation to the CBS limit has a positive slope, see Fig. 6(a). This empirical observation was already made in the GW100 benchmark study40 for the HOMO level and we also observed it here in our GW5000 study for the valence states. There is no proof that for a given basis set the slope has to be positive. In fact, for ~4% of the energies level above −30 eV we find negative slopes, as shown in Fig. 6(c). This percentage increases considerably in the semi-core energy region between −50 and −30 eV. Such an increase is indicative of either 1) a failure of the analytic continuation used to continue the G0W0 self-energy from imaginary- to the real-frequency axis or 2) the insufficiency of the def2-TZVP basis set to converge the deeper occupied states at the DFT level. Based on our analysis in Fig. 6(c), we therefore include only states with energies larger than −30 eV in the 5 k set. Figure 6(d) confirms that the spectral weight averaged over the whole 5 k subset is located mostly between −30 to −5 eV and thus, not much spectral information is lost by setting the cutoff threshold to −30 eV.
G0W0 calculations were initially run for 5,500 structures randomly drawn from the 31 k set. From these 5,500 molecules, we filtered out molecules for which the analytic continuation of the G0W0 self-energy is inaccurate or breaks down completely. In FHI-aims the quasiparticle equation is solved iteratively to determine the quasiparticle energies. For some molecules, the pole structure of the self-energy gives rise to multiple solutions and the iterative solution does not converge. We excluded all molecules from the dataset for which at least one TZVP or QZVP level did not converge. Moreover, large differences between the TZVP and QZVP quasiparticle energies are an indication of further problems in the G0W0 calculation, since the median difference between TZVP and QZVP is only 0.21 eV (see Fig. 6(b)). We thus excluded molecules for which at least one level exceeded QZVP/TZVP difference of 0.8 eV. This leaves 5,239 molecules in the 5 k set.
Supplementary information
Acknowledgements
C.K., K.R. and H.O. acknowledge support from the Solar Technologies Go Hybrid initiative of the State of Bavaria and the Leibniz Supercomputing Centre for high-performance computing time at the SuperMUC facility. C.K. further acknowledges support by Deutsche Forschungsgemeinschaft (DFG) through the TUM International Graduate School of Science and Engineering (IGSSE), GSC 81. A.S., D.G., M.T. and P.R. gratefully acknowledge computing resources from the Aalto Science-IT project and the CSC Grand Challenge project. D.G. acknowledges support by the Academy of Finland through grant no. 316168. A.S. acknowledges support by the Magnus Ehrnrooth Foundation and the Finnish Cultural Foundation. This project has received funding from the European Union’s Horizon 2020 research and Innovation Programme under grant agreement No. 676580 with The Novel Materials Discovery (NOMAD) Laboratory, a European Center of Excellence. This work was furthermore supported by the Academy of Finland through its Centres of Excellence Programme 2015–2017 under project number 284621 as well as its Key Project Funding scheme under project number 305632. Further support was received by the Artificial Intelligence in Physical Sciences and Engineering scheme (project number 316601).
Author contributions
A.S. and C.K. curated the data, carried out the calculations and postprocessed the results. C.K. performed the calculations at the DFT-levels of theory. A.S. and D.G. conducted the calculations at the G0W0 level of theory. A.S., C.K., D.G. and J.M. validated the calculations. J.M. analyzed correlations between DFT- and G0W0-results. M.T., K.R., P.R. and H.O. conceived the original idea and designed the study. All authors cowrote the manuscript.
Code availability
All electronic structure data contained in this work was generated with the FHI-aims code43–45. The code is available for a license fee from https://aimsclub.fhi-berlin.mpg.de/aims_obtaining_simple.php. Parsing of outputs and data collection were performed with custom-made Python scripts, which will be available upon request. Finally, the published archive contains a tutorial detailing how to access the dataset.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Annika Stuke and Christian Kunkel.
Supplementary information
is available for this paper at 10.1038/s41597-020-0385-y.
References
- 1.Bartók, A. P. et al. Machine learning unifies the modeling of materials and molecules. Sci. Adv. 3 (2017). [DOI] [PMC free article] [PubMed]
- 2.Schütt KT, Sauceda HE, Kindermans P-J, Tkatchenko A, Müller K-R. Schnet – a deep learning architecture for molecules and materials. J. Chem. Phys. 2018;148:241722. doi: 10.1063/1.5019779. [DOI] [PubMed] [Google Scholar]
- 3.Faber FA, et al. Prediction errors of molecular machine learning models lower than hybrid DFT error. J. Chem. Theory Comput. 2017;13:5255–5264. doi: 10.1021/acs.jctc.7b00577. [DOI] [PubMed] [Google Scholar]
- 4.Schütt, K. T., Arbabzadah, F., Chmiela, S., Müller, K. R. & Tkatchenko, A. Quantum-chemical insights from deep tensor neural networks. Nat. Comm. 8 (2017). [DOI] [PMC free article] [PubMed]
- 5.Tang Y-H, de Jong WA. Prediction of atomization energy using graph kernel and active learning. J. Chem. Phys. 2019;150:044107. doi: 10.1063/1.5078640. [DOI] [PubMed] [Google Scholar]
- 6.Stuke A, et al. Chemical diversity in molecular orbital energy predictions with kernel ridge regression. J. Chem. Phys. 2019;150:204121. doi: 10.1063/1.5086105. [DOI] [PubMed] [Google Scholar]
- 7.Ghosh K, et al. Deep learning spectroscopy: Neural networks for molecular excitation spectra. Adv. Sci. 2019;6:1801367. doi: 10.1002/advs.201801367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Mansouri Tehrani A, et al. Machine learning directed search for ultraincompressible, superhard materials. J. Am. Chem. Soc. 2018;140:9844–9853. doi: 10.1021/jacs.8b02717. [DOI] [PubMed] [Google Scholar]
- 9.Meredig B, et al. Combinatorial screening for new materials in unconstrained composition space with machine learning. Phys. Rev. B. 2014;89:094104. doi: 10.1103/PhysRevB.89.094104. [DOI] [Google Scholar]
- 10.Meyer B, Sawatlon B, Heinen S, von Lili enfeld OA, Corminboeuf C. Machine learning meets volcano plots: computational discovery of cross-coupling catalysts. Chem. Sci. 2018;9:7069–7077. doi: 10.1039/C8SC01949E. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Goldsmith BR, Esterhuizen J, Liu J-X, Bartel CJ, Sutton C. Machine learning for heterogeneous catalyst design and discovery. AIChE Journal. 2018;64:2311–2323. doi: 10.1002/aic.16198. [DOI] [Google Scholar]
- 12.Shandiz MA, Gauvin R. Application of machine learning methods for the prediction of crystal system of cathode materials in lithium-ion batteries. Comput. Mater. Sci. 2016;117:270–278. doi: 10.1016/j.commatsci.2016.02.021. [DOI] [Google Scholar]
- 13.Paruzzo FM, et al. Chemical shifts in molecular solids by machine learning. Nat. Comm. 2018;9:2041–1723. doi: 10.1038/s41467-018-06972-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Paruzzo, F. M. et al. Chemical shifts in molecular solids by machine learning datasets. Materials Cloud Archive (2019). [DOI] [PMC free article] [PubMed]
- 15.Hachmann J, et al. The Harvard Clean Energy Project: Large-scale computational screening and design of organic photovoltaics on the world community grid. J. Phys. Chem. Lett. 2011;2:2241–2251. doi: 10.1021/jz200866s. [DOI] [Google Scholar]
- 16.Ramakrishnan R, Hartmann M, Tapavicza E, von Lilienfeld OA. Electronic spectra from TDDFT and machine learning in chemical space. J. Chem. Phys. 2015;143:084111. doi: 10.1063/1.4928757. [DOI] [PubMed] [Google Scholar]
- 17.Ruddigkeit L, van Deursen R, Blum LC, Reymond J-L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J. Chem. Inf. Model. 2012;52:2864–2875. doi: 10.1021/ci300415d. [DOI] [PubMed] [Google Scholar]
- 18.Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data1 (2014). [DOI] [PMC free article] [PubMed]
- 19.Hedin L. New method for calculating the one-particle Green’s function with application to the electron-gas problem. Phys. Rev. 1965;139:A796–A823. doi: 10.1103/PhysRev.139.A796. [DOI] [Google Scholar]
- 20.Kohn W. Nobel Lecture: Electronic structure of matter—wave functions and density functionals. Rev. Mod. Phys. 1999;71:1253–1266. doi: 10.1103/RevModPhys.71.1253. [DOI] [Google Scholar]
- 21.Cabrele C, Reiser O. The modern face of synthetic heterocyclic chemistry. J. Org. Chem. 2016;81:10109–10125. doi: 10.1021/acs.joc.6b02034. [DOI] [PubMed] [Google Scholar]
- 22.Ponra S, Majumdar KC. Brønsted acid-promoted synthesis of common heterocycles and related bio-active and functional molecules. RSC Adv. 2016;6:37784–37922. doi: 10.1039/C5RA27069C. [DOI] [Google Scholar]
- 23.Wang C, Dong H, Hu W, Liu Y, Zhu D. Semiconducting π-conjugated systems in field-effect transistors: A material odyssey of organic electronics. Chem. Rev. 2012;112:2208–2267. doi: 10.1021/cr100380z. [DOI] [PubMed] [Google Scholar]
- 24.Li, Y. Organic Optoelectronic Materials. Lecture Notes in Chemistry (Springer International Publishing, 2015).
- 25.Ostroverkhova O. Organic optoelectronic materials: Mechanisms and applications. Chem. Rev. 2016;116:13279–13412. doi: 10.1021/acs.chemrev.6b00127. [DOI] [PubMed] [Google Scholar]
- 26.Ostroverkhova, O. Handbook of Organic Materials for Optical and (Opto)Electronic Devices: Properties and Applications. Woodhead Publishing Series in Electronic and Optical Materials (Elsevier Science, 2013).
- 27.Silverman, R. & Holladay, M. The Organic Chemistry of Drug Design and Drug Action (Elsevier Science, 2014).
- 28.Taylor AP, et al. Modern advances in heterocyclic chemistry in drug discovery. Org. Biomol. Chem. 2016;14:6611–6637. doi: 10.1039/C6OB00936K. [DOI] [PubMed] [Google Scholar]
- 29.Allen FH. The Cambridge Structural Database: a quarter of a million crystal structures and rising. Acta Crystallogr. B. 2002;58:380–388. doi: 10.1107/S0108768102003890. [DOI] [PubMed] [Google Scholar]
- 30.Schober C, Reuter K, Oberhofer H. Virtual screening for high carrier mobility in organic semiconductors. J. Phys. Chem. Lett. 2016;7:3973–3977. doi: 10.1021/acs.jpclett.6b01657. [DOI] [PubMed] [Google Scholar]
- 31.Schober, C. O. Ab Initio Charge Carrier Mobility and Computational Screening of Molecular Crystals for Organic Semiconductors. Dissertation, Technische Universität München, München (2017).
- 32.Kunkel C, Schober C, Margraf JT, Reuter K, Oberhofer H. Finding the right bricks for molecular legos: A data mining approach to organic semiconductor design. Chem. Mater. 2019;31:969–978. doi: 10.1021/acs.chemmater.8b04436. [DOI] [Google Scholar]
- 33.Kunkel C, Schober C, Oberhofer H, Reuter K. Knowledge discovery through chemical space networks: the case of organic electronics. J. Mol. Model. 2019;25:87. doi: 10.1007/s00894-019-3950-6. [DOI] [PubMed] [Google Scholar]
- 34.Perdew JP, Burke K, Ernzerhof M. Generalized gradient approximation made simple. Phys. Rev. Lett. 1996;77:3865–3868. doi: 10.1103/PhysRevLett.77.3865. [DOI] [PubMed] [Google Scholar]
- 35.Tkatchenko A, Scheffler M. Accurate molecular van der Waals interactions from ground-state electron density and free-atom reference data. Phys. Rev. Lett. 2009;102:073005. doi: 10.1103/PhysRevLett.102.073005. [DOI] [PubMed] [Google Scholar]
- 36.Adamo C, Barone V. Toward reliable density functional methods without adjustable parameters: The PBE0 model. J. Chem. Phys. 1999;110:6158–6170. doi: 10.1063/1.478522. [DOI] [Google Scholar]
- 37.Perdew JP, Ernzerhof M, Burke K. Rationale for mixing exact exchange with density functional approximations. J. Chem. Phys. 1996;105:9982–9985. doi: 10.1063/1.472933. [DOI] [Google Scholar]
- 38.Reining L. The GW approximation: content, successes and limitations. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2018;8:e1344. doi: 10.1002/wcms.1344. [DOI] [Google Scholar]
- 39.Golze D, Dvorak M, Rinke P. The GW compendium: A practical guide to theoretical photoemission spectroscopy. Front. Chem. 2019;7:377. doi: 10.3389/fchem.2019.00377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.van Setten MJ, et al. GW100: Benchmarking G0W0 for molecular systems. J. Chem. Theory Comput. 2015;11:5665–5687. doi: 10.1021/acs.jctc.5b00453. [DOI] [PubMed] [Google Scholar]
- 41.O’Boyle NM, et al. Open babel: An open chemical toolbox. J. Cheminformatics. 2011;3:33. doi: 10.1186/1758-2946-3-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Landrum, G. RDKit: Open-source cheminformatics (2018).
- 43.Blum V, et al. Ab initio molecular simulations with numeric atom-centered orbitals. Comput. Phys. Commun. 2009;180:2175–2196. doi: 10.1016/j.cpc.2009.06.022. [DOI] [Google Scholar]
- 44.Ren, X. et al. Resolution-of-identity approach to Hartree–Fock, hybrid density functionals, RPA, MP2 and GW with numeric atom-centered orbital basis functions. New J. Phys. 14 (2012).
- 45.Zhang IY, Ren X, Rinke P, Blum V, Scheffler M. Numeric atom-centered-orbital basis sets with valence-correlation consistency from H to Ar. New J. Phys. 2013;15:123033. doi: 10.1088/1367-2630/15/12/123033. [DOI] [Google Scholar]
- 46.Ropo, M., Schneider, M., Baldauf, C. & Blum, V. First-principles data set of 45,892 isolated and cation-coordinated conformers of 20 proteinogenic amino acids. Sci. Data3 (2016). [DOI] [PMC free article] [PubMed]
- 47.Mardirossian N, Head-Gordon M. Thirty years of density functional theory in computational chemistry: an overview and extensive assessment of 200 density functionals. Mol. Phys. 2017;115:2315–2372. doi: 10.1080/00268976.2017.1333644. [DOI] [Google Scholar]
- 48.Marom N, Tkatchenko A, Kapishnikov S, Kronik L, Leiserowitz L. Structure and formation of synthetic hemozoin: Insights from first-principles calculations. Cryst. Growth Des. 2011;11:3332–3341. doi: 10.1021/cg200409d. [DOI] [Google Scholar]
- 49.Reilly AM, Tkatchenko A. Understanding the role of vibrations, exact exchange, and many-body van der Waals interactions in the cohesive properties of molecular crystals. J. Chem. Phys. 2013;139:024705. doi: 10.1063/1.4812819. [DOI] [PubMed] [Google Scholar]
- 50.Hoja J, Tkatchenko A. First-principles stability ranking of molecular crystal polymorphs with the DFT+MBD approach. Faraday Discuss. 2018;211:253–274. doi: 10.1039/C8FD00066B. [DOI] [PubMed] [Google Scholar]
- 51.Heller SR, McNaught A, Pletnev I, Stein S, Tchekhovskoi D. InChI, the IUPAC international chemical identifier. J. Cheminformatics. 2015;7:23. doi: 10.1186/s13321-015-0068-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Sinstein M, et al. Efficient implicit solvation method for full potential DFT. J. Chem. Theory Comput. 2017;13:5582–5603. doi: 10.1021/acs.jctc.7b00297. [DOI] [PubMed] [Google Scholar]
- 53.Aryasetiawan F, Gunnarsson O. The GW method. Rep. Prog. Phys. 1998;61:237–312. doi: 10.1088/0034-4885/61/3/002. [DOI] [Google Scholar]
- 54.Weigend F, Ahlrichs R. Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for H to Rn: Design and assessment of accuracy. Phys. Chem. Chem. Phys. 2005;7:3297–3305. doi: 10.1039/b508541a. [DOI] [PubMed] [Google Scholar]
- 55.Feller D. The role of databases in support of computational chemistry calculations. J. Comp. Chem. 1996;17:1571–1586. doi: 10.1002/(SICI)1096-987X(199610)17:13<1571::AID-JCC9>3.0.CO;2-P. [DOI] [Google Scholar]
- 56.Schuchardt KL, et al. Basis Set Exchange: A community database for computational sciences. J. Chem. Inf. Model. 2007;47:1045–1052. doi: 10.1021/ci600510j. [DOI] [PubMed] [Google Scholar]
- 57.Vidberg HJ, Serene JW. Solving the Eliashberg equations by means of N-point Padé approximants. J. Low Temp. Phys. 1977;29:179–192. doi: 10.1007/BF00655090. [DOI] [Google Scholar]
- 58.Golze D, Wilhelm J, van Setten MJ, Rinke P. Core-level binding energies from GW: An efficient full-frequency approach within a localized basis. J. Chem. Theory Comput. 2018;14:4856–4869. doi: 10.1021/acs.jctc.8b00458. [DOI] [PubMed] [Google Scholar]
- 59.Stuke A, 2019. “OE62-dataset” of molecular orbital energies. mediaTUM. [DOI]
- 60.McKinney, W. Data structures for statistical computing in Python. Proc. of the 9th Python in Science Conf. 51–56 (2010).
- 61.Stuke A, 2019. OE62 dataset: results of DFT PBE + vdW (vacuum) calculations - part 1. NOMAD repository. [DOI]
- 62.Stuke A, 2019. OE62 dataset: results of DFT PBE + vdW (vacuum) calculations - part 2. NOMAD repository. [DOI]
- 63.Stuke A, 2019. OE62 dataset: results of DFT PBE + vdW (vacuum) calculations - part 3. NOMAD repository. [DOI]
- 64.Stuke A, 2019. OE62 dataset: results of DFT PBE + vdW (vacuum) calculations - part 4. NOMAD repository. [DOI]
- 65.Stuke A, 2019. OE62 dataset: results of DFT PBE + vdW (vacuum) calculations - part 5. NOMAD repository. [DOI]
- 66.Stuke A, 2019. OE62 dataset: results of DFT PBE + vdW (vacuum) calculations - part 6. NOMAD repository. [DOI]
- 67.Stuke A, 2019. OE62 dataset: results of DFT PBE + vdW (vacuum) calculations - part 7. NOMAD repository. [DOI]
- 68.Stuke A, 2019. OE62 dataset: results of DFT PBE0 (vacuum) calculations. NOMAD repository. [DOI]
- 69.Stuke A, 2019. OE62 dataset: results of DFT PBE0 (water) calculations. NOMAD repository. [DOI]
- 70.Stuke A, 2019. OE62 dataset: results of G0W0@PBE0 (vacuum) calculations with def2-TZVP basis set. NOMAD repository. [DOI]
- 71.Stuke A, 2019. OE62 dataset: results of G0W0@PBE0 (vacuum) calculations with def2-QZVP basis set. NOMAD repository. [DOI]
- 72.Rupp M, Tkatchenko A, Müller K-R, von Lilienfeld OA. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 2012;108:058301. doi: 10.1103/PhysRevLett.108.058301. [DOI] [PubMed] [Google Scholar]
- 73.Himanen, L. et al. Dscribe: Library of descriptors for machine learning in materials science. Comput. Phys. Commun. 106949 (2019).
- 74.Jensen SR, et al. The elephant in the room of density functional theory calculations. J. Phys. Chem. Lett. 2017;8:1449–1457. doi: 10.1021/acs.jpclett.7b00255. [DOI] [PubMed] [Google Scholar]
- 75.Lejaeghere, K. et al. Reproducibility in density functional theory calculations of solids. Science351 (2016). [DOI] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
- Stuke A, 2019. “OE62-dataset” of molecular orbital energies. mediaTUM. [DOI]
- Stuke A, 2019. OE62 dataset: results of DFT PBE + vdW (vacuum) calculations - part 1. NOMAD repository. [DOI]
- Stuke A, 2019. OE62 dataset: results of DFT PBE + vdW (vacuum) calculations - part 2. NOMAD repository. [DOI]
- Stuke A, 2019. OE62 dataset: results of DFT PBE + vdW (vacuum) calculations - part 3. NOMAD repository. [DOI]
- Stuke A, 2019. OE62 dataset: results of DFT PBE + vdW (vacuum) calculations - part 4. NOMAD repository. [DOI]
- Stuke A, 2019. OE62 dataset: results of DFT PBE + vdW (vacuum) calculations - part 5. NOMAD repository. [DOI]
- Stuke A, 2019. OE62 dataset: results of DFT PBE + vdW (vacuum) calculations - part 6. NOMAD repository. [DOI]
- Stuke A, 2019. OE62 dataset: results of DFT PBE + vdW (vacuum) calculations - part 7. NOMAD repository. [DOI]
- Stuke A, 2019. OE62 dataset: results of DFT PBE0 (vacuum) calculations. NOMAD repository. [DOI]
- Stuke A, 2019. OE62 dataset: results of DFT PBE0 (water) calculations. NOMAD repository. [DOI]
- Stuke A, 2019. OE62 dataset: results of G0W0@PBE0 (vacuum) calculations with def2-TZVP basis set. NOMAD repository. [DOI]
- Stuke A, 2019. OE62 dataset: results of G0W0@PBE0 (vacuum) calculations with def2-QZVP basis set. NOMAD repository. [DOI]
Supplementary Materials
Data Availability Statement
All electronic structure data contained in this work was generated with the FHI-aims code43–45. The code is available for a license fee from https://aimsclub.fhi-berlin.mpg.de/aims_obtaining_simple.php. Parsing of outputs and data collection were performed with custom-made Python scripts, which will be available upon request. Finally, the published archive contains a tutorial detailing how to access the dataset.