Skip to main content
ACS AuthorChoice logoLink to ACS AuthorChoice
. 2025 Jun 25;65(13):6644–6654. doi: 10.1021/acs.jcim.5c00665

Structural Bias in Three-Dimensional Autoregressive Generative Machine Learning of Organic Molecules

Zsuzsanna Koczor-Benda , Joe Gilkes †,, Francesco Bartucca , Abdulla Al-Fekaiki , Reinhard J Maurer †,§,*
PMCID: PMC12264931  PMID: 40556385

Abstract

A range of generative machine learning models for the design of novel molecules and materials have been proposed in recent years. Models that can generate three-dimensional structures are particularly suitable for quantum chemistry workflows, enabling direct property prediction. The performance of generative models is typically assessed based on their ability to produce novel, valid, and unique molecules. However, equally important is their ability to learn the prevalence of functional groups and certain chemical moieties in the underlying training data, that is, to faithfully reproduce the chemical space spanned by the training data. Here, we investigate the ability of the autoregressive generative machine learning model G-SchNet to reproduce the chemical space and property distributions of training data sets composed of large, functional organic molecules. We assess the elemental composition, size- and bond-length distributions, as well as the functional group and chemical space distribution of training and generated molecules. By principal component analysis of the chemical space, we find that the model leads to a biased generation that is largely unaffected by the choice of hyperparameters or the training data set distribution, producing molecules that are, on average, less saturated and contain more heteroatoms. Purely aliphatic molecules are mostly absent in the generation. We further investigate generation with functional group constraints and based on composite data sets, which can help to partially remedy the model generation bias. Decision tree models can recognize the generation bias in the models and discriminate between training and generated data, revealing key chemical differences between the two sets. The chemical differences we find affect the distributions of electronic properties such as the HOMO–LUMO gap, which is a common target for functional molecule design.


graphic file with name ci5c00665_0010.jpg


graphic file with name ci5c00665_0008.jpg

Introduction

Generative machine learning models (GMLMs) play an increasingly important role in chemistry. They support the exploration of the compositional and configurational space of molecules (chemical space) and the design of novel molecules and materials. Many variants of generative models for molecules exist based on a diverse range of architectures such as variational autoencoders (VAEs), generative adversarial networks, diffusion models, and autoregressive models. Typically, the performance of generative models is assessed based on their ability to generate novel, unique and valid molecules at a high rate, such that they can create diverse and realistic molecules. However, an equally important requirement is that the GMLM reproduces the chemical space and the structural distribution spanned by the training data. Good knowledge of the structural and functional group distribution that the GMLM produces, for example, is a prerequisite for inverse design and property-driven generative design applications, which typically aim to controllably find molecules beyond the original training data. In such applications, the original training data provides a starting point. This molecular distribution is then either biased , or constrained , toward a specific region of chemical space to increase the likelihood of generating molecules that satisfy certain conditions or properties.

Most GMLMs use text-based or image representations of molecules such as SMILES, , fingerprints, , or images as these are most suitable when combined with existing generative algorithms. However, many applications require molecular representations in three-dimensional (3D) Euclidean space that capture bond distances, angles, and molecular conformation. This is particularly so when studying optoelectronic and vibrational properties of molecules that are sensitive to molecular conformation or when integrating GMLMs with quantum chemistry methods where 3D Cartesian positions are a necessary input to calculations. While 3D structures can be reconstructed from generative design based on structural fingerprints, this adds an additional source of uncertainty and error. Several models have so far been proposed that directly generate 3D point cloud representations of molecules. ,− While models based on normalizing flows and diffusion currently top the leaderboards when it comes to their ability to generate unique, novel, and valid molecules, they also come with considerable downsides. Existing diffusion models can often generate disconnected fragments at a high rate for large molecules. , In addition, molecular generation comes with considerable computational effort as the diffusion model acts simultaneously on all input coordinates and a large number of steps is required to converge the diffusion process. Autoregressive generation algorithms, on the other hand, build molecular structures atom-by-atom based on conditional probabilities of finding a specific element at a certain position in the presence of previously placed atoms. G-SchNet captures the elemental composition and the chemical environment of the atoms using the SchNet descriptor embeddings. , G-SchNet has been successfully used for a variety of applications, particularly in the context of conditional and biased property-driven design. ,, For these applications, it is crucial that the generative model accurately represents the underlying distribution provided by the training data.

The ability of G-SchNet to reproduce chemical features and functional groups of its training database has been analyzed in previous publications. ,, Gebauer et al. studied the performance of G-SchNet when trained on the QM9 data set, which has a relatively restricted chemical space, containing elements H, C, N, O, F, and only up to 9 non-hydrogen atoms per molecule. It was found that the bond length distribution of the generated molecules well represents the training data distribution and that the generated molecules also reproduce the distribution of HOMO–LUMO gaps. Training sets containing more diverse and larger molecules have been used for scaffold-based design and property-driven design. , Westermayr et al. trained G-SchNet on the OE62 data set of 61,489 crystal-forming organic molecules. The OE62 data set features large chemical and structural diversity with molecules varying in size from 3 to over 150 atoms and containing up to 16 elements. Westermayr et al. reported differences in elemental composition and subtle deviations in the bond distance and angle distributions between the training data and the generated structures. Root mean square deviations between generated structures and DFT-relaxed structures were found to be small for most molecules. However, fundamental quasiparticle gaps predicted for the unbiased generated molecules differed from the original training data set by several electronVolt. Nevertheless, the property-driven design toward lower fundamental gaps based on iterative fine-tuning of G-SchNet was successful, as the independent property predictor that was employed remained largely valid for all generated molecules. Koczor-Benda et al. recently employed G-SchNet for the targeted design of molecules with optimal vibrational properties to act as THz upconverters in plasmonic nanojunctions, , requiring the presence of a gold–thiolate group and the simultaneous infrared and Raman activity of vibrations in the THz region. Unbiased generated molecules exhibited significant chemical differences from the training data set with the consequence that the employed predictor for THz activity was not transferable to the generated molecules without retraining.

Here, we perform an in-depth analysis of the ability of the G-SchNet GMLM to generate data sets of molecules with up to two hundred atoms with structural and chemical properties that are consistent with the underlying training data. We do this for G-SchNet trained on the QM9 data set, , the OE62 data set, and a data set of thiol molecules. , Largely independent of the hyperparameters, G-SchNet consistently generates more undersaturated molecules than the underlying data sets, which is reflected in the elemental and molecular weight distribution. As revealed by clustering analysis, the generated molecular distribution lacks specific molecular motifs, such as large aliphatic structures, which affects different molecular properties. We find that resampling the training data to provide a more equal representation of structural features and functional groups in the training data only partially remedies the observed discrepancy between the chemical space covered by the training data and the generated molecules. Based on this analysis, we study the role of functional group constraints and training data set composition on the chemical space distribution of generated molecules and how the structural bias during molecule generation affects property distributions. Finally, we show that bias in molecular generation can be used to discriminate original from generated molecules.

Methods

Generative Machine Learning

We use the schnetpack-gschnet package for all model training and molecule generation. In this work, we employ three databases: the QM9 database of 134k molecules with 9 non-hydrogen atoms, , the OE62 database of 61,489 crystal-forming organic molecules extracted from the Cambridge Crystal Structure Database, and a data set of about 3000 gold–thiolate molecules, , henceforth referred to as the ‘THz database’. All data sets are public and provide molecules in their relaxed, equilibrium geometry as predicted by density functional theory (DFT). For QM9, we use 50,000 (50k) molecules for training, 5k for validation, and 76k for testing. We employ default hyperparameter settings of the G-SchNet QM9 experiment unless stated otherwise. We set the number of molecules to generate as 60k, and the maximum number of atoms per molecule to 35 (following ref ). For the learning curve, the same validation set (5k) and test set (76k) were used across all models. For the OE62 database, we use 45k molecules for training, 4.5k for validation, and 12k for testing. The placement cutoff and the covalent radius factor in G-SchNet are increased (to 2.6 and 1.3 Å, respectively), to accommodate the increased variety of chemical elements. As in ref , we reduce the number of random atom placement trajectories sampled to 5, to reduce computational cost. We show later that this does not significantly affect the generation. The parameters used for generation are the same as for QM9, except for the maximum number of atoms that was set to 200, to account for the larger size of training molecules. For the constrained gold–thiolate molecule generation, we combine the OE62 data set with the THz data set using the same settings as for OE62-only training and generation. Further details on the hyperparameters of G-SchNet for training and generation are provided in the Supporting Information (SI).

Molecular Analysis

Generated molecules are filtered using the approach from ref , which removes duplicates and disconnected structures. Unless stated otherwise, only these filtered generated molecules are included in the molecular analysis. Molecules generated by G-SchNet when trained on OE62 are additionally downsampled, such that the distribution of atom counts in generated data sets matches the distribution of atom counts in OE62. Details of the downsampling procedure can be found in the Supporting Information. All molecules from the QM9 and OE62 training databases are included in the analysis. For elemental composition analysis, pairwise distance distribution analysis and generation of canonical SMILES representations, the Open Babel package was used. Ring types and counts, as well as functional groups were identified with the RDKit package. We note that in many cases, the canonical SMILES strings did not convert to valid RDKit molecules, even for the training databases, therefore these molecules are not included in further analysis. Further information is provided in the Supporting Information. HOMO–LUMO gaps were predicted for the training and generated molecules using the SchNet+H model.

Clustering in Latent Chemical Space

Visual representations of the chemical space spanned by molecules within training and generated databases are created through the application of principal component analysis (PCA) on two high-dimensional molecular descriptors, following the same formalism introduced in ref . The first descriptor, referred to as the structural descriptor, is an averaged SOAP (smooth overlap of atomic positions) descriptor, obtained with the DScribe package. This results in a 57,792-dimensional encoding of the average atomic environment around each atom for each molecule in a given database. The second descriptor, referred to as the bonding descriptor, is a bespoke descriptor formed from features relating to molecular bonding extracted by the Open Babel and RDKit software packages. , 592 of these features make up the bonding descriptor, ranging from simple quantities such as the number of each element in a molecule, to more complex measures such as the number of aromatic rings of a given size in each molecule.

After calculating these descriptors for every molecule in a database, independent PCAs are fit for each descriptor and used to compare training and generated data sets. In cases where multiple databases need to span the same latent chemical space for comparison, these are fit across the concatenation of these databases. Molecules are plotted in the latent space formed by the first principal component (PC) from each descriptor to evaluate the extent to which data sets align (e.g., the same PC vectors are used across Figures and ). We have carefully assessed that both descriptor spaces efficiently capture their respective covariances in a single PC, providing more than 66% covariance in the first principal component. We apply a 2D kernel density estimate (KDE) to the molecules in these latent spaces to identify areas of localization. To aid with the analysis of these densely packed latent spaces which often contain tens of thousands of molecules, we cluster molecules within this latent space. Again, when comparison between data sets is required, this is performed over the concatenation of these data sets such that clustering applies across data sets, and molecules in a given cluster within one data set are similar in structure and bonding to molecules in the same cluster in another data set. As in ref , we use a mixture of the BIRCH and agglomerative clustering algorithms, as implemented in the scikit-learn package, to allow for finding clusters with different sizes.

4.

4

Latent chemical space covered by the molecules of the OE62 data set and the generated molecules (downsampled according to OE62’s atom count distribution), formed by PCA on descriptors encoding chemical structure and bonding. Percentages on axis labels indicate the proportion of the overall variance in each descriptor captured by the respective PC. (a) and (d) are KDE heatmaps of the training and generated molecules respectively, where darker areas correspond to a higher population of molecules. (b) and (c) represent the same molecules, divided into clusters representing similar areas of the latent space. Eight molecules, spread across the training and generated data sets, are shown to highlight similarities and differences in regions of the space covered by the data sets.

6.

6

KDE heatmaps of the latent chemical space covered by the molecules of the OE62 and THz training data sets, and the generated molecules from unconstrained and constrained G-SchNet models (sampled according to OE62’s atom count distribution), formed by PCA on structural and bonding descriptors of molecules from these data sets. Percentages on axis labels indicate the proportion of the overall variance in each descriptor captured by the respective PC.

We later resample the OE62 training data set within the latent space by dividing the space created by the maximum and minimum bounds of the first PCs from each descriptor into an evenly spaced grid, then randomly sampling a single molecule from each grid rectangle. The number of returned molecules is controlled by the resolution of this grid (how many divisions are created along each axis), n grid, chosen by fitting a 2D KDE to the sampled molecules and optimizing for a lack of spatial localization.

Decision Tree Discriminators

We train decision trees to discriminate between training and generated molecules. The DecisionTreeClassifier from scikit-learn was used for classifying molecules, using balanced class weights. 20% of molecules were used as test set, while the remaining molecules were split in 90%:10% for train and validation sets for finding the optimal depth of decision trees. As feature vectors, we tested Morgan fingerprint vectors with different radii and number of bits (see Supporting Information), generated with the RDKit package. We also tested adding total heavy atom counts and individual element counts to the feature vectors (see Supporting Information), which increases accuracy scores in the case of QM9, but has no notable effect in the case of OE62. Consequently, these were used only for QM9. The tree depth parameter values were optimized for each data set (see Supporting Information) in the range of 3–17 to achieve the highest accuracy score on the validation set. For testing different G-SchNet models trained on OE62 with different G-SchNet hyperparameter settings, we train decision trees using the same type of feature vector (Morgan fingerprints only) and tree depth parameter (9) that were found to be optimal for the optimized G-SchNet model used to generate molecules from OE62 (see Supporting Information).

Results and Discussion

Small Molecule Generation

Molecules generated by G-SchNet trained on the QM9 database have already been analyzed in the original G-SchNet publication. Here we only briefly review and expand that analysis to emphasize some key points and provide more detailed insights that will enable us to compare models trained on data sets of larger and more complex molecules in the following sections.

Trained on QM9, G-SchNet reproduces the atom number distribution per molecule well (Figure S2), generating only slightly larger molecules than those present in QM9 (18.6 vs 18.0 atoms on average). However, the molecular weight distributions reveal that a significant number of generated molecules have larger molecular weight (extra peak at 140 g/mol, see Figure S2) than molecules in QM9. Resampling the generated data set to match the atom number distribution of the original training data set does not have a significant impact on the molecular weight distribution (Figure S2), suggesting that the proportion of heavier elements is inherently larger in generated molecules. By analyzing the elemental composition, we find that this is mainly due to oxygen and nitrogen atoms being slightly overexpressed while hydrogens and carbons are slightly under-expressed (Figures b and S3). This behavior was already noted in the original G-SchNet paper.

2.

2

(a) G-SchNet learning curve on the QM9 data set, using 5 randomly sampled atom placement trajectories per molecule or all possible trajectories. (b) Elemental composition of the generated molecules compared to QM9.

The radial distribution functions of C–C and C–O distances have also been reported in ref . Here we want to point out that although the overall agreement is impressive for general atom-pair distances, some discrepancies become apparent (Figure a) when looking at only bond distances. It becomes clear that the model cannot fully reproduce the distribution of single, double, and triple bonds. For example, short C–C distances, especially C–C triple bonds are underrepresented in the generated database. Functional group analysis shows that C–C triple bonds are present in 14% of molecules in QM9, but only in 3% of generated molecules. Similarly, C–C aromatic bonds are present in 13% of QM9, but only in 2% of generated molecules.

1.

1

Distribution of bonded C–C distances for (a) the QM9 and (b) the OE62 training set and corresponding generated molecules. Generated molecules are sampled to match the atom number distribution of the respective training sets.

We trained decision tree models to classify molecules as training or generated based on molecular fingerprints and element counts (Figure S5). Our best model achieves an accuracy score of 0.74 on previously unseen molecules. Among the most important features to classify molecules as either generated or belonging to the training data (top five most important features are shown in the SI, Figure S6) are the number of non-hydrogen atoms, the number of nitrogen atoms, and the number of carbons involved in a triple bond. The first is easy to understand: since only the total number of atoms was restricted, molecules with more than 9 non-hydrogen atoms are present in the generated set; this clearly sets them apart from QM9, which contains at most 9 non-hydrogen atoms. Inaccuracies in elemental composition have also been discussed in the previous paragraphs. The presence of a triple bonded carbon in molecules with ≤ 9 heavy atoms likely results in a classification as training, in line with our observations that C–C triple bonds are underrepresented in generated molecules.

Average ring counts of 3–6 membered rings per molecule have been reported in ref . These show an increase in overall ring count, due to an increase in 3 and 4-membered rings. In Figure S4, we additionally analyze the proportion of molecules having different numbers of aromatic, unsaturated aliphatic and saturated rings. Fewer of the generated molecules have aromatic and unsaturated aliphatic rings, while the presence of saturated rings increased noticeably, in particular, the proportion of molecules containing 2 or 3 saturated rings increased.

For total dipole moment, isotropic polarizability, and electronic spatial extent, Gebauer et al. reported SchNet property predictions for training and generated molecules, while HOMO–LUMO gaps were calculated by DFT. The distributions for the generated molecules and QM9 are quite similar for all four properties, indicating that the minor chemical differences between generated and training molecules do not affect property distributions significantly.

Finally, we investigate how quickly G-SchNet learns from training data (3D molecular structures), by analyzing models trained on 10k, 30k and 50k molecules from QM9. As the number of possible atom placement trajectories scales quadratically with the number of atoms, to limit the cost of training, we restrict the sampling trajectories during training to five. To test how a reduction of the number of trajectories sampled during training influences molecule generation, we performed an ablation study with QM9. We set the number of randomly sampled atom placement trajectories to five and investigated different numbers of training molecules. The resulting learning curve is shown in Figure a. When limiting the number of trajectories, the validation and test losses are also calculated for the same number of random trajectories. While losses are increased with fewer trajectories, the elemental composition of the generated molecules is not significantly affected.

Large and Chemically Diverse Molecules

As expected, the wider variety of elements and molecular sizes in the OE62 database presents a challenge for G-SchNet. Already the proportion of successful molecule generation to requested number of molecules is significantly lower for OE62 than for QM9. The details on the number of connected/unique/valid molecules, and how many can be converted to RDKit molecules, are given in Table S4.

In contrast to the QM9 case, the size distribution of generated molecules is significantly different from that of the OE62 database. The generated molecules are typically much larger than the molecules contained in the training data set and also contain a smaller proportion of hydrogen atoms (Figures a and S8), indicating that heavier atoms and unsaturated bonds are overexpressed. To match the size distribution of the training set as closely as possible, it is necessary to either limit the maximum number of atoms to a significantly smaller number than the maximum number of atoms in OE62 or to downsample the generated molecules according to the atom count distribution of the training database. We perform this downsampling, reducing the data set to 9482 generated molecules, and we use this sampled data set for all further analysis. Figure b shows that even after downsampling, the molecular weight distribution of the generated set is broader and shifted to larger values than that of OE62. This suggests that the generated molecules are more likely to contain heavier atoms, which is in line with observations for QM9. We confirmed that this difference in molecular weight distribution is statistically significant by applying Welch’s t test and a two-sample Kolmogorov–Smirnov (KS) test. Both tests returned negligible p-values, indicating that, respectively, the means of the training and generated distributions are not equal and that samples drawn from each distribution cannot be drawn from the same distribution.

3.

3

Analysis of the OE62 training database and corresponding generated molecules: (a) Hydrogen atom count vs total atom count, (b) Molecular weight distributions, (c) HOMO–LUMO gap distributions as predicted by the SchNet + H model, (d) Type and number of rings in molecules, (e) Elemental composition. In (b–e) the generated database is sampled to match the atom count distribution of the training set.

The elemental composition (Figure e) reveals that N, O, F, S, and Cl are all overexpressed. Interestingly, in contrast to our observations for the QM9 database, in this case, the longer bond distances corresponding to single C–C bonds (Figure S8) are underrepresented compared to unsaturated bonds. This is also reflected in the increased proportion of molecules that contain multiple unsaturated aliphatic rings compared to OE62 (Figure d).

To check if these effects are inherent to G-SchNet or are an artifact of filtering the generated molecules for validity and uniqueness, we performed the same analysis for raw generated molecules (Figures S7–S9). Filtering has a negligible effect across the investigated distribution and does not affect any of the above-mentioned trends.

The latent chemical space occupied by the molecules of the training and generated data sets, formed by PCA fitted to the structural and bonding descriptors of the molecules of these data sets, is shown in Figure . Within this latent space, the generated molecules mimic the majority of the training set molecules, following the edges of the space outlined by OE62 and capturing the area of highest density (Figure a,d). However, there are some differences between the two distributions; while the area between −2.5 and 1.5 in both PCs is replicated well, the generated molecules do not extend into the same high-value areas of either PC that are achieved in the training data set. This is somewhat sensible, as the well-replicated regions lie close to the high-density region of the latent space shown in Figure a, while the unreplicated regions lie in spaces of comparatively low density. We verified that this lack of coverage is not caused by the downsampling of generated molecules based on the size distribution of the training database by creating a similar latent space map of the unsampled generated molecules (Figure S10).

Clustering analysis on these latent spaces was carried out, as shown in Figure b,c. A selection of molecules from some of these clusters is also shown to highlight similarities and differences between regions of the latent spaces.

Molecules (1) and (6), from the training and generated data sets, respectively, show that the clustering approach successfully identifies similar clusters across both data sets; by probing the latent space at the same location in either data set, structurally and chemically similar molecules can be extracted. This cluster in either data set was characterized by small molecules with a high proportion of desaturation and aromaticity, as well as a significant proportion of heteroatoms per molecule. This is contrasted in molecules (2) and (7), which come from a cluster that consists of molecules with aliphatic rings, high saturation and few heteroatoms. G-SchNet found the generation of these types of molecules particularly difficult, echoing our earlier observations regarding the increased prevalence of heteroatoms and unsaturated bonds across the generated data set.

Clusters around molecule (3) were not represented at all in the generated data set; this is likely due to molecules in this area of the latent space possessing repeated structural motifs and functional groups, in addition to lines of symmetry. These molecules are usually synthesized through repeated applications of the same processes, making them difficult to create with G-SchNet as it generates molecules stochastically and is unlikely to repeat the same generative path multiple times per molecule. The vast majority of generated molecules from the high-density region shown in Figure d were instead similar to molecule (8), containing a handful of aromatic or otherwise unsaturated rings and 4–8 heteroatomsusually N, O, F, S, and Cl.

Within the replicated area of the latent space, the overlap is not perfect however; some generated molecules at values of Bonding PC1 between 1.0–3.75 exist at lower values of Structural PC1 than in the training data set, giving rise to an area of the latent space that is covered in Figure c but not in Figure b. Molecule (5) is from this extended region, while molecule (4) is one of the closest molecules to this area in the training data set. The clusters from which these molecules originate generally encompass a variety of large, extended molecules with a high degree of whole-molecule conjugation, but G-SchNet’s tendency to overexpress heteroatoms leads to highly substituted molecules such as (5), which are not present in the training database due to the complexity of such molecules. The entries in OE62 are real, crystal-forming molecules, while the molecules from this region of the generated data set are so complex that they are unlikely to be formed either as natural products or through manual synthesis. There is potential to filter out these molecules by sampling generated databases with respect to the molecular weight distribution of the training data set, rather than its size distribution, as shown in Figure S11.

We have also trained a decision tree discriminator for OE62, which has a much higher accuracy score than the one trained for QM9. (Table ). The most important features that discriminate generated molecules from the OE62 data set are mainly carbon atoms involved in different, e.g., unsaturated, aromatic or aliphatic bonds (see Figure S15 for a depiction of the five most important Morgan fingerprint bits), which is in line with our observations that C–C bond distance distributions are not captured well by the model. The hydroxyl group is also among the most important features, and our analysis shows that this group is significantly enriched in generated molecules (present in 67% of generated vs 22% of OE62 valid RDKit molecules).

1. Test Losses and Accuracy Scores of G-SchNet Models Trained with Different Parameter Settings .

  distance loss element type loss combined loss decision tree score
original 0.234 0.161 0.394 0.892
20 Å cutoffs 0.234 0.166 0.400 0.900
10 trajectories 0.216 0.152 0.367 0.908
1:3 loss ratio 0.271 0.162 0.189 0.911
a

The original model uses 10 Å model and prediction cutoffs, 5 randomly sampled trajectories, and a distance:element type loss ratio of 1:1

In contrast to QM9, when trained on OE62, G-SchNet produces molecules with significantly different property distributions. The HOMO–LUMO gap distribution of generated molecules (Figure c) is shifted to smaller values by about 2 eV (see also ref . for electron affinities and HOMO–LUMO gaps), which is explained by the lower level of saturation for generated molecules. Again, Welch’s t test and a KS test returned negligible p-values, allowing us to assert that the differences in training and generated distributions of HOMO–LUMO gaps were statistically significant.

To see if the model can be improved, we tested different parameter settings for training G-SchNet as well as for generating molecules with the trained models (Table ). This included changing the loss function from the one available in G-SchNet that is a sum of element type loss and distance loss with 1:1 weight ratio by default.

From these tests, only increasing the number of randomly sampled trajectories resulted in improvement in the test losses (both in distance and element type loss). Increasing the model and prediction cutoff parameters does not affect the results, while changing the loss ratio to 1:3, thus giving more weight to the element type loss, only results in an increase of distance loss on the test set with no improvement in element type loss.

How these parameters affect the molecular weight distribution, elemental composition and bond distance distribution of generated molecules is reported in Figures S12–S13. In general, there are minor differences in elemental composition between the molecules generated with different models when compared to the discrepancy with the OE62 training set. The C–C distance distributions are practically the same with the different models, irrespective of considering all C–C distances or only bond distances, suggesting that this variation in distance loss is insignificant. The balanced accuracy scores of decision trees trained to discriminate each generated database from the OE62 database are also very similar (Table ). These findings show that tweaking the hyperparameters of the G-SchNet model cannot correct its inability to fully represent the chemical distribution of its training database.

Other sources of error are also possible; the SchNet descriptor, used internally within G-SchNet to represent local atomic environments with rotational, translational and permutational invariance, may be at fault. Stark et al. previously showed that even after iterative refinement of a training set, SchNet models using this descriptor failed to produce a smooth potential energy surface (PES) for the dissociative adsorption of H2 on a Cu(111) surface. In contrast, the equivariant PaiNN model yielded a smooth PES with the same data. While the molecules generated here are chemically very different to the tested systems, the inability of the SchNet descriptor to provide smooth representations with atomic coordinates may affect the ability of G-SchNet to recognize similar atomic groups when constructing molecules.

Role of the Training Data Distribution

Structural and chemical discrepancies between the training data and the generated molecules may also arise by providing a biased and imbalanced training data set. Figure a shows many of the training molecules in OE62 localize around a relatively small area of the latent space created by PCA. This localization makes up around 75% of the training data, such that when G-SchNet learns conditional probability distributions for the next atom type, distance and position, they are inherently biased toward emulating the atoms in these molecules. This may explain why the generated molecules in Figure d are similarly localized. Molecules in the unexplored areas of the latent space are likely not present because their generation would require multiple improbable samplings of these distributions.

To investigate if this effect can be mitigated, we ‘flatten’ the training distribution, such that the model is trained on a more balanced distribution of molecules (as defined by the latent space). This may increase the relative likelihood of generating molecules in underrepresented regions of chemical space. OE62 was therefore flattened by sampling the principal component subspace on a regular grid with n grid = 350 grid points to yield a new training database of 15,552 molecules. A new G-SchNet model was trained on this flattened data set, and was used to generate 19,976 new molecules. This was also downsampled according to the atom count distribution of the flattened OE62 training set, resulting in a data set of 5112 generated molecules.

KDE plots for the flattened training and generated molecules are shown in Figure a,b, alongside the elemental distributions for these databases in Figure c. We generated PCs for these databases by passing their structural and bonding descriptors through the same PCA transformations as were used previously, enabling direct comparison with the original OE62 results. Comparing these spatial distributions of molecules to those in Figure , the training and generated distributions cover the PCA space more equally. The high-density region is larger, and more molecules are generated with Structural PC1 > 1. As shown previously, this area of the latent space corresponds mostly to aliphatic molecules. Comparing the elemental distributions to those in Figure reveals near-identical trends between the training and generated databases. Despite the flattened training database exhibiting a greater ratio of hydrogen atoms per molecule than the full OE62 database, the generated molecules still exhibit a significantly different hydrogen-to-carbon ratio.

5.

5

Analysis of the spatially flattened OE62 training database and corresponding generated molecules (after downsampling to match the training data set’s atom count distribution). Panels (a) and (b) show plots of the latent chemical space occupied by the databases, using the same PCA fits as in Figure . Panel (c) shows the difference in elemental composition between the two databases.

Overall, while providing a data set that is more evenly distributed in terms of the structural space and the bonding environments in molecules does slightly reduce the deviations in the structures of training and generated molecules, for the most part, large deviations remain. This affects the ability to sample novel molecules from the same space as the training data and will likely affect the ability to recover the associated distribution of molecular properties.

Generation with Functional Group Constraints

Often, generative molecular design is based on property targets, but also on constraints with respect to the type and frequency of functional groups present in candidate molecules. For example, Koczor-Benda et al. have generated molecules with optimal THz upconversion efficiency when placed inside plasmonic nanojunction devices. Thiol groups ensure that the molecules strongly bind within the nanojunction via a gold–thiolate bond, making the presence of such a functional group a design prerequisite. To reliably judge the THz upconversion efficiency of candidate molecules, their properties need to be predicted based on quantum chemistry calculations in the presence of the gold–thiolate group. Scaffold-based constraints can be imposed with G-SchNet by providing a filter that places the thiol group at the starting token of the autoregressive structure generation. However, most molecular data sets would contain few thiol groups and therefore it is highly likely that data sets will not provide a robust representation of relevant chemical environments. We address this issue by combining two chemically different databases; a data set of about 3000 molecules containing gold–thiolate bonds, , which provides the required chemical functionality, and OE62, which covers a wider chemical space and other potentially desirable chemical properties. We will call this combined data set OE62 + THz.

When training an unconstrained G-SchNet model on the OE62 + THz data set (where the THz data set contributes only about 5% of the training molecules) and generating a new data set of 17,332 molecules (after downsampling with respect to the OE62 + THz data set’s atom count distribution) with this model, we observe similar trends in the data distributions as for OE62, i.e., a smaller proportion of H and C and an increased proportion of heteroatoms when compared to the training data set. We similarly train a G-SchNet model with the scaffold constraint applied, yielding a data set of 21,924 molecules after downsampling. The latent chemical space covered by the molecules in these data sets (Figure ) is obtained by fitting a new PCA transformation to the combined molecules of the OE62 and THz training sets, as well as the unconstrained and scaffold-constrained generated sets. It shows that the THz subset of the data set covers a significantly smaller region of the latent chemical space than OE62, with high concentrations of molecules located in different areas of latent space. Molecule generation based on the OE62 + THz trained model explores a much smaller region than is covered by OE62, but extends significantly beyond the region covered by the THz data. The generated molecules explore areas of the latent space not covered by either of the training data sets.

Without imposing the scaffold constraint, about 4% of generated molecules are gold–thiolates, whereas, with the constraint, 100% of molecules have a gold–thiolate group. This is not fully visible in the latent space maps, likely because a single thiol group does not contribute much to either of the high-dimensional bonding and structural descriptors, resulting in the region covered by the molecules of the THz data set being entirely contained within that of the molecules of OE62. We note that, in principle, a generated molecule can contain more than one thiol group, but we have not observed the generation of such a molecule.

The constraint leads to the generation of smaller molecules that are closer to those of the THz data set than those of OE62. This is reflected in the latent space plots, where, although the extent of the regions covered by the unconstrained and constrained generated data sets is very similar, the high-density regions are entirely different. While the majority of unconstrained molecules are generated within the same high-density region of chemical space as was emphasized in OE62, most constrained molecules reflect the high-density region from the THz data set instead. We emphasize that the constraint does not prevent G-SchNet from learning from OE62, as the lower-density regions of the constrained data set still explore the same area as was also explored in the unconstrained case.

The same clustering analysis that was performed for the OE62-only trained model above was repeated with this new latent space to better identify trends in the generated molecules of the combined OE62 + THz data set, shown in Figure . In this figure, molecule (1) represents part of the high-density region of OE62, which is characterized by multiple aromatic rings conjugated across the majority of each medium-sized (20–35 atoms) molecule, often with 3–7 heteroatoms. This cluster, and many surrounding it, are present in both the THz data set and the constrained generated molecules, but its population is vastly reduced. Molecule (2), and others like it, are not present in the THz data set or the constrained molecules; this is a highly aliphatic molecule with few heteroatoms though, and we previously established that G-SchNet struggles to generate such molecules. It is therefore not a surprise to see this region of the latent space sparsely covered in the generated data set.

7.

7

Clustering of molecules from (a) the OE62 and (b) the THz training databases, in addition to (c) the constrained generated database, within the same latent chemical space formed by PCA as was shown in Figure . Examples of molecules from a subset of clusters are depicted.

Molecules (3) and (4) represent characteristic examples from the high-density regions of the THz data set and the constrained generated molecules, respectively. Both are smaller than molecule (1), reinforcing that the constrained model is, on average, generating smaller molecules. Of note is the high proportion of heteroatoms compared to carbon atoms in this area of the latent space, which turns out to be characteristic of the left edge (Structural PC1 < 0.0) of the space. This is echoed in molecules such as (5), which is much larger, but also contains many more heteroatoms. As in the earlier OE62-only model, molecules like (5) exist in an area of the latent space that was not covered by the molecules of the training set, as they are created by G-SchNet’s tendency to overexpress heteroatoms.

In summary, we find that by merging a database with a specific chemical functionality with a database that covers a wide area of chemical space, molecule generation with functional group constraints can be achieved for a larger area of chemical space, resulting in more diverse molecules. This is observed despite the remaining shortcomings of G-SchNet which lead to the insufficient coverage of chemical space by the generated molecules.

Conclusions and Outlook

We investigated the ability of the autoregressive three-dimensional molecular structure generation model G-SchNet to generate molecules that faithfully represent the structural distribution spanned by the training data set. We provide a detailed analysis of the elemental distribution, the functional group composition and latent chemical space maps, which reveal that generated molecules are systematically biased compared to the training data set. The bias leads to the overexpression of unsaturated carbon and heteroatoms compared to the training data and the suppression of aliphatic structural motifs. When visualized in latent chemical space maps, the molecules only cover a small region of chemical space compared to the training data. In addition, G-SchNet does not reliably represent the molecular size distribution of the training data set, typically leading to the generation of many molecules larger than those in the training data set, requiring additional subsampling of generated molecules with respect to the size or molecular weight distribution of the training data set.

While structural bias effects in generated molecules are relatively subtle in small molecule data sets such as QM9, only minimally affecting the elemental distribution and the property distribution of the generated molecules, the effects are significant in larger molecules (up to 200 atoms) as represented in the OE62 data set. Even without conditioning or fine-tuning, molecules generated by G-SchNet do not cover the same latent chemical space as the training data set. Simple decision-tree-based classifier models can identify such bias and discriminate between training and generated molecules based on the presence of certain functional groups or the size of molecules. We introduce a strategy to ensure broader chemical space sampling despite the imposition of functional group constraints. Data sets with specific chemical functionalities that only cover a narrow region of chemical space can be complemented with chemically unrelated data sets to create diverse molecules that reliably satisfy functional group constraints. By combining a small database of gold–thiolate-containing molecular structures optimized at DFT level (3k molecules) with a comprehensive DFT database such as OE62, we demonstrate that G-SchNet is capable of learning features from both databases simultaneously while generating strictly thiol-containing molecules only.

While inaccuracies in reproducing size and elemental distributions of the training set might be negligible for some fields of applications, specific property prediction applications, such as electronic properties (e.g., HOMO–LUMO gaps), and property-driven design workflows require generative models to recover the training data distribution if a complementary molecular property prediction model is to be used to screen generated molecules. If this is not the case, property prediction models that may be trained and validated for the training data set may become invalid for the generated molecules as they sit outside of the training data distribution. This can lead to uncontrolled errors in property prediction and results in a lack of control in property-driven workflows. This challenge has recently been discussed by Koczor-Benda et al. in the context of property-driven design of molecules with optimal vibrational spectroscopic properties for THz detection.

In some cases, the tendency of G-SchNet to overgenerate unsaturated and functionalized molecules might work in favor of generating molecules with desired properties even without conditioning or iterative biasing, such as for molecules with small HOMO–LUMO gaps. In other cases, this may disqualify G-SchNet entirely for the task. While we here cannot provide an unambiguous conclusion of the causes of the inherent generation bias in the autoregressive model G-SchNet, we can confirm that an imbalance in the structural space spanned by the training data set is likely only a small contributing factor. Insufficient expressivity of the descriptor will likely be a key factor that may be addressed in the future with the introduction of equivariant or vector-based learnable atom embeddings and latent feature representations in generative models.

A critical question that remains is whether the bias we have identified in molecular generation with G-SchNet is also present in other state-of-the-art generative models based on other architectures, such as VAEs and normalizing flow diffusion. The current literature standard is to analyze molecule generation mostly based on molecular uniqueness, novelty, and validity, which may not be sufficient to fully assess generative design algorithms for their suitability to be employed in property-driven design workflows.

Supplementary Material

ci5c00665_si_001.pdf (3.9MB, pdf)

Acknowledgments

The authors acknowledge funding through the UKRI Future Leaders Fellowship programme (MR/X023109/1), a UKRI frontier research grant (EP/X014088/1), and the EPSRC Centre for Doctoral Training in Modelling of Heterogeneous Systems (EP/S022848/1) at the University of Warwick. High-performance computing resources were provided via the Scientific Computing Research Technology Platform of the University of Warwick, the EPSRC-funded HPC Midlands+ computing centre for access to Sulis (EP/P020232/1), and the Northern Ireland High Performance Computing (NI-HPC) service for access to Kelvin2 (EP/T022175/1).

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.5c00665.

  • Additional computational details, machine learning model parameters, and analysis results (PDF)

Z.K.-B.: Conceptualization, supervision, data curation, formal analysis, investigation, writing. F.B.: Data curation, methodology, and formal analysis. A. A.-F.: Data curation, formal analysis, methodology, investigation. J.G.: Data curation, formal analysis, investigation, writing. R.J.M.: Conceptualization, supervision, writing.

The authors declare no competing financial interest.

Code for the extraction of bonding features from molecular databases and obtaining the principal components of the structural/bonding descriptors has been released in our GSchNetTools package, available at https://github.com/maurergroup/GSchNetTools. Training and generated databases of molecules, along with principal components in both descriptor spaces, are available on figshare at10.6084/m9.figshare.28661504.

Published as part of Journal of Chemical Information and Modeling special issue “Chemical Compound Space Exploration by Multiscale High-Throughput Screening and Machine Learning”.

References

  1. Elton D. C., Boukouvalas Z., Fuge M. D., Chung P. W.. Deep learning for molecular design - A review of the state of the art. Mol. Syst. Des. Eng. 2019;4:828–849. doi: 10.1039/C9ME00039A. [DOI] [Google Scholar]
  2. Schwalbe-Koda, D. ; Gómez-Bombarelli, R. . Machine Learning Meets Quantum Physics; Schütt, K. T. ; Chmiela, S. ; von Lilienfeld, O. A. ; Tkatchenko, A. ; Tsuda, K. ; Müller, K.-R. , Eds.; Springer International Publishing: Cham, 2020; pp 445–467. [Google Scholar]
  3. Bilodeau C., Jin W., Jaakkola T., Barzilay R., Jensen K. F.. Generative models for molecular discovery: Recent advances and challenges. WIREs Comput. Mol. Sci. 2022;12:e1608. doi: 10.1002/wcms.1608. [DOI] [Google Scholar]
  4. Anstine D. M., Isayev O.. Generative Models as an Emerging Paradigm in the Chemical Sciences. J. Am. Chem. Soc. 2023;145:8736–8750. doi: 10.1021/jacs.2c13467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Gebauer, N. ; Gastegger, M. ; Schütt, K. . Symmetry-Adapted Generation of 3d Point Sets for the Targeted Discovery of Molecules; Advances in Neural Information Processing Systems 32 (NeurIPS 2019); NIPS, 2019. [Google Scholar]
  6. Westermayr J., Gilkes J., Barrett R., Maurer R. J.. High-throughput property-driven generative design of functional organic molecules. Nat. Comput. Sci. 2023;3:139–148. doi: 10.1038/s43588-022-00391-1. [DOI] [PubMed] [Google Scholar]
  7. Gebauer N. W. A., Gastegger M., Hessmann S. S., Müller K.-R., Schütt K. T.. Inverse design of 3d molecular structures with conditional generative neural networks. Nat. Commun. 2022;13:973. doi: 10.1038/s41467-022-28526-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Zhang J., Chen H.. De Novo Molecule Design Using Molecular Generative Models Constrained by Ligand–Protein Interactions. J. Chem. Inf. Model. 2022;62:3291–3306. doi: 10.1021/acs.jcim.2c00177. [DOI] [PubMed] [Google Scholar]
  9. Gómez-Bombarelli R., Wei J. N., Duvenaud D., Hernández-Lobato J. M., Sánchez-Lengeling B., Sheberla D., Aguilera-Iparraguirre J., Hirzel T. D., Adams R. P., Aspuru-Guzik A.. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Cent. Sci. 2018;4:268–276. doi: 10.1021/acscentsci.7b00572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Putin E., Asadulaev A., Ivanenkov Y., Aladinskiy V., Sanchez-Lengeling B., Aspuru-Guzik A., Zhavoronkov A.. Reinforced Adversarial Neural Computer for de Novo Molecular Design. J. Chem. Inf. Model. 2018;58:1194–1204. doi: 10.1021/acs.jcim.7b00690. [DOI] [PubMed] [Google Scholar]
  11. Kadurin A., Nikolenko S., Khrabrov K., Aliper A., Zhavoronkov A.. druGAN: An Advanced Generative Adversarial Autoencoder Model for de Novo Generation of New Molecules with Desired Molecular Properties in Silico. Mol. Pharmaceutics. 2017;14:3098–3104. doi: 10.1021/acs.molpharmaceut.7b00346. [DOI] [PubMed] [Google Scholar]
  12. Tazhigulov, R. N. ; Schiller, J. ; Oppenheim, J. ; Winston, M. . Molecular Fingerprints for Robust and Efficient ML-Driven Molecular Generation. 2022. arXiv:2211.09086. arXiv.org e-Printarchive. 10.48550/arXiv.2211.09086. [DOI]
  13. Wang Z., Chen Y., Ma P., Yu Z., Wang J., Liu Y., Ye X., Sakurai T., Zeng X.. Image-based generation for molecule design with SketchMol. Nat. Mach. Intell. 2025;7:244–255. doi: 10.1038/s42256-025-00982-3. [DOI] [Google Scholar]
  14. Fung V., Jia S., Zhang J., Bi S., Yin J., Ganesh P.. Atomic structure generation from reconstructing structural fingerprints. Mach. Learn.: Sci. Technol. 2022;3:045018. doi: 10.1088/2632-2153/aca1f7. [DOI] [Google Scholar]
  15. Hoogeboom, E. ; Satorras, V. G. ; Vignac, C. ; Welling, M. . Equivariant Diffusion for Molecule Generation in 3D; Proceedings of the 39th International Conference on Machine Learning; PMLR, 2022; pp 8867–8887. [Google Scholar]
  16. Xu, M. ; Powers, A. S. ; Dror, R. O. ; Ermon, S. ; Leskovec, J. . Geometric Latent Diffusion Models for 3D Molecule Generation; Proceedings of the 40th International Conference on Machine Learning; ACM, 2023; pp 38592–38610. [Google Scholar]
  17. Hua, C. ; Luan, S. ; Xu, M. ; Ying, Z. ; Fu, J. ; Ermon, S. ; Precup, D. . MUDiff: Unified Diffusion for Complete Molecule Generation; Proceedings of the Second Learning on Graphs Conference; PMLR, 2024; pp 33:1–33:26. [Google Scholar]
  18. Cornet, F. ; Bartosh, G. ; Schmidt, M. N. ; Naesseth, C. A. . Equivariant Neural Diffusion for Molecule Generation; Advances in Neural Information Processing Systems 37 (NeurIPS 2024); NIPS, 2024; pp 49429–49460. [Google Scholar]
  19. Jo, J. ; Lee, S. ; Hwang, S. J. . Score-based Generative Modeling of Graphs via the System of Stochastic Differential Equations; Proceedings of the 39th International Conference on Machine Learning; PMLR, 2022; pp 10362–10383. [Google Scholar]
  20. Schütt, K. T. ; Kindermans, P. J. ; Sauceda, H. E. ; Chmiela, S. ; Tkatchenko, A. ; Müller, K. R. . SchNet: A Continuous-Filter Convolutional Neural Network for Modeling Quantum Interactions; Advances in Neural Information Processing Systems 30 (NIPS 2017); NIPS, 2017; pp 992–1002. [Google Scholar]
  21. Schütt K. T., Sauceda H. E., Kindermans P. J., Tkatchenko A., Müller K. R.. SchNet - A deep learning architecture for molecules and materials. J. Chem. Phys. 2018;148:241722. doi: 10.1063/1.5019779. [DOI] [PubMed] [Google Scholar]
  22. Koczor-Benda, Z. ; Chaudhuri, S. ; Gilkes, J. ; Bartucca, F. ; Li, L. ; Maurer, R. J. . Generative design of functional organic molecules for terahertz radiation detection. 2025. arXiv:2503.14748. arXiv.org e-Printarchive. 10.48550/arXiv.2503.14748. [DOI]
  23. Joshi R. P., Gebauer N. W., Bontha M., Khazaieli M., James R. M., Brown J. B., Kumar N.. 3D-scaffold: a deep learning framework to generate 3D coordinates of drug-like molecules with desired scaffolds. J. Phys. Chem. B. 2021;125:12166–12176. doi: 10.1021/acs.jpcb.1c06437. [DOI] [PubMed] [Google Scholar]
  24. Ramakrishnan R., Dral P. O., Rupp M., Anatole von Lilienfeld O.. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data. 2014;1:140022. doi: 10.1038/sdata.2014.22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Stuke A., Kunkel C., Golze D., Todorović M., Margraf J. T., Reuter K., Rinke P., Oberhofer H.. Atomic structures and orbital energies of 61,489 crystal-forming organic molecules. Sci. Data. 2020;7:58. doi: 10.1038/s41597-020-0385-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Westermayr J., Maurer R. J.. Physically inspired deep learning of molecular excitations and photoemission spectra. Chem. Sci. 2021;12:10755–10764. doi: 10.1039/D1SC01542G. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Xomalis A., Zheng X., Chikkaraddy R., Koczor-Benda Z., Miele E., Rosta E., Vandenbosch G. A. E., Martínez A., Baumberg J. J.. Detecting mid-infrared light by molecular frequency upconversion in dual-wavelength nanoantennas. Science. 2021;374:1268–1271. doi: 10.1126/science.abk2593. [DOI] [PubMed] [Google Scholar]
  28. Chen W., Roelli P., Hu H., Verlekar S., Amirtharaj S. P., Barreda A. I., Kippenberg T. J., Kovylina M., Verhagen E., Martínez A., Galland C.. Continuous-wave frequency upconversion with a molecular optomechanical nanocavity. Science. 2021;374:1264–1267. doi: 10.1126/science.abk3106. [DOI] [PubMed] [Google Scholar]
  29. Ruddigkeit L., van Deursen R., Blum L. C., Reymond J.-L.. Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. J. Chem. Inf. Model. 2012;52:2864–2875. doi: 10.1021/ci300415d. [DOI] [PubMed] [Google Scholar]
  30. Koczor-Benda Z., Boehmke A. L., Xomalis A., Arul R., Readman C., Baumberg J. J., Rosta E.. Molecular Screening for Terahertz Detection with Machine-Learning-Based Methods. Phys. Rev. X. 2021;11:041035. doi: 10.1103/PhysRevX.11.041035. [DOI] [Google Scholar]
  31. Gebauer, N. ; Schütt, K. T. . Conditional G-SchNet Extension for SchNetPack 2.0 - A Generative Neural Network for 3d Molecules; GitHub. https://github.com/atomistic-machine-learning/schnetpack-gschnet. [Google Scholar]
  32. Weininger D.. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988;28:31–36. doi: 10.1021/ci00057a005. [DOI] [Google Scholar]
  33. O’Boyle N. M., Banck M., James C. A., Morley C., Vandermeersch T., Hutchison G. R.. Open Babel: An open chemical toolbox. J. Cheminform. 2011;3:33. doi: 10.1186/1758-2946-3-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Landrum, G. RDKit: Open-source cheminformatics; RDKit. http://www.rdkit.org/. [Google Scholar]
  35. Bartók A. P., Kondor R., Csányi G.. On representing chemical environments. Phys. Rev. B. 2013;87:184115. doi: 10.1103/PhysRevB.87.184115. [DOI] [Google Scholar]
  36. Himanen L., Jäger M. O., Morooka E. V., Canova F. F., Ranawat Y. S., Gao D. Z., Rinke P., Foster A. S.. DScribe: Library of descriptors for machine learning in materials science. Comput. Phys. Commun. 2020;247:106949. doi: 10.1016/j.cpc.2019.106949. [DOI] [Google Scholar]
  37. Zhang T., Ramakrishnan R., Livny M.. BIRCH: ANew Data Clustering Algorithm and Its Applications. Data Min. Knowl. Discovery. 1997;1:141–182. doi: 10.1023/A:1009783824328. [DOI] [Google Scholar]
  38. Schubert E., Sander J., Ester M., Kriegel H. P., Xu X.. DBSCAN Revisited, Revisited. ACM Trans. Database Syst. 2017;42:1–21. doi: 10.1145/3068335. [DOI] [Google Scholar]
  39. Pedregosa F.. et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011;23:2825–2830. [Google Scholar]
  40. Stark W. G., Westermayr J., Douglas-Gallardo O. A., Gardner J., Habershon S., Maurer R. J.. Machine Learning Interatomic Potentials for Reactive Hydrogen Dynamics at Metal Surfaces Based on Iterative Refinement of Reaction Probabilities. J. Phys. Chem. C. 2023;127:24168–24182. doi: 10.1021/acs.jpcc.3c06648. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Schütt, K. T. ; Unke, O. T. ; Gastegger, M. . Equivariant message passing for the prediction of tensorial properties and molecular spectra. 2021.arXiv:2102.03150. arXiv.org e-Printarchive. 10.48550/arXiv.2102.03150. [DOI]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ci5c00665_si_001.pdf (3.9MB, pdf)

Articles from Journal of Chemical Information and Modeling are provided here courtesy of American Chemical Society

RESOURCES