Stochastic machine learning via sigma profiles to build a digital chemical space

Dinis O Abranches; Edward J Maginn; Yamil J Colón

doi:10.1073/pnas.2404676121

. 2024 Jul 23;121(31):e2404676121. doi: 10.1073/pnas.2404676121

Stochastic machine learning via sigma profiles to build a digital chemical space

Dinis O Abranches ^a, Edward J Maginn ^a, Yamil J Colón ^a,¹

PMCID: PMC11295021 PMID: 39042681

Significance

The search for a method to convert the discrete set of all synthesizable molecules to a continuous mathematical space, navigable through simple optimization procedures, has been at the forefront of machine learning efforts since their inception in the field of chemistry. A pathway to achieve such a digital molecular space was developed in this work using sigma profiles, a continuous set of molecular descriptors derived from quantum chemistry. This digital space is shown to be easily navigable through machine-learning-enabled algorithms, namely gradient search and Bayesian optimization, boosting molecular design and discovery. Within this unique framework, molecules with a desired property can be obtained, without any prior knowledge or data, with as little as 11 iterations (i.e., experimental measurements).

Keywords: AI, Gaussian processes, Bayesian optimization, latent space, autoencoders

Abstract

This work establishes a different paradigm on digital molecular spaces and their efficient navigation by exploiting sigma profiles. To do so, the remarkable capability of Gaussian processes (GPs), a type of stochastic machine learning model, to correlate and predict physicochemical properties from sigma profiles is demonstrated, outperforming state-of-the-art neural networks previously published. The amount of chemical information encoded in sigma profiles eases the learning burden of machine learning models, permitting the training of GPs on small datasets which, due to their negligible computational cost and ease of implementation, are ideal models to be combined with optimization tools such as gradient search or Bayesian optimization (BO). Gradient search is used to efficiently navigate the sigma profile digital space, quickly converging to local extrema of target physicochemical properties. While this requires the availability of pretrained GP models on existing datasets, such limitations are eliminated with the implementation of BO, which can find global extrema with a limited number of iterations. A remarkable example of this is that of BO toward boiling temperature optimization. Holding no knowledge of chemistry except for the sigma profile and boiling temperature of carbon monoxide (the worst possible initial guess), BO finds the global maximum of the available boiling temperature dataset (over 1,000 molecules encompassing more than 40 families of organic and inorganic compounds) in just 15 iterations (i.e., 15 property measurements), cementing sigma profiles as a powerful digital chemical space for molecular optimization and discovery, particularly when little to no experimental data is initially available.

The design and discovery of new molecules and materials have traditionally been carried out through a combination of heuristics, theories, models, and extensive trial-and-error efforts (1–3). However, with the growth of accessible computational power, combined with the collection and compilation of vast property datasets in the open literature, data-driven methods are becoming prevalent in chemistry-related areas, gradually outperforming and replacing conventional design frameworks (4). These data-driven methods involve training machine learning models on large datasets, aiming at relating molecular structures, compositions, and operational conditions with target material properties. Examples of such machine learning models include support vector machines (5), deep neural networks (6), and graph neural networks (7).

Machine learning is not just a tool to generate correlations between molecular descriptors and properties. Its inception in chemistry has led to the emergence of unconventional theoretical concepts, such as the notion of a digital molecular space (8, 9). The set of all possible molecular structures is discrete and requires complex and prohibitively expensive algorithms to be navigated (10, 11). However, this set (or subsets) can be converted to an abstract, low-dimensional, and continuous mathematical space, which can then be efficiently probed and navigated using simple computational tools, such as the nonparametric Gaussian processes (GP) models (12). In fact, owing to their simplicity, low computational cost, and stochastic nature (13, 14), GPs are an ideal instrument to interpolate such digital molecular spaces, allowing for a multitude of powerful navigation schemes, namely active learning (15) and Bayesian optimization (BO) (16).

Digital molecular spaces can be constructed using autoencoders, in which case they are commonly denoted as latent spaces (17–19). Briefly, a machine learning model (encoder) is trained to convert molecular structures or descriptors to an abstract low-dimensional latent space. Simultaneously, another machine learning model (decoder) is trained to convert latent space data points back to molecular structures. Finally, a third machine learning model (e.g., GPs) may then be used to relate latent space data points to physicochemical properties and to perform optimization tasks on that space. However, notwithstanding their usefulness, latent spaces obtained through autoencoders are often too abstract, nonintuitive, and encode little chemical information beyond atomic connectivity. Consequently, relating such abstract spaces to target properties requires extensive datasets. For example, in the seminal work of Gómez-Bombarelli et al. (18), neural networks were successfully trained on two autoencoder-generated latent spaces to predict assorted quantum chemistry properties relying on 108,000 or 250,000 data points, respectively. Unfortunately, these dataset sizes are rarely available for many applications of interest.

Dataset size limitations can be minimized by constraining latent spaces to specific families of target compounds. For instance, Liu et al. (20) were able to train machine learning models on an ionic liquid latent space to predict carbon dioxide solubility using just 130 unique compound data points. They were then able to explore the latent space and find novel ionic liquid structures for carbon dioxide capture. Another way of minimizing dataset size limitations while maintaining molecular generality is to eliminate the abstractness of digital molecular spaces. This can be achieved by generating spaces based on existing molecular descriptors that are known to perform well on small datasets, instead of relying on autoencoders. The ideal molecular descriptor for this task should encode a great deal of chemical and molecular information, possess a low number of dimensions, and represent a continuous mathematical space. This rules out popular descriptors such as SMILES, SELFIES, fingerprints, Coulomb matrices, graphs, bags of bonds, or atomic coordinates.

Sigma profiles are unnormalized histograms of the screened surface charges of molecules (21). Obtained from quantum chemistry calculations, sigma profiles encode a great deal of chemical and molecular information, particularly polarity (22). Furthermore, their dimensionality (i.e., number of histogram bins) is small (51 for a common sigma profile calculation scheme) and is not affected by the size (i.e., number of atoms) of the molecules being described. These reasons, among others, have made sigma profiles popular molecular descriptors for machine learning applications (23–26). For example, we have shown that convolutional neural networks (CNNs) trained using sigma profiles are able to accurately predict assorted physicochemical properties for organic and inorganic compounds (molar mass, boiling temperature, vapor pressure, density, refractive index, and aqueous solubility), requiring small (hundreds of molecules) datasets to be properly fitted (23). The performance of sigma profiles as molecular descriptors was remarkable, both due to the breadth of molecules (organic and inorganic) employed and due to the multitude of properties described. In a way, sigma profiles shorten the information gap between input (molecular descriptor) and output (property), decreasing the learning burden of machine learning models and, thus, drastically reducing training dataset sizes.

Although neural networks are often the model of choice to relate sigma profiles with properties, there is limited evidence in the literature suggesting that GPs may be a feasible alternative. For instance, Sanchez-Lengeling et al. (27) utilized GPs to accurately correlate Hansen solubility parameters against a set of several molecular descriptors that included sigma profiles. Another example is the work by Salahshoori et al. (28), where GPs were explored to correlate carbon dioxide solubility in deep eutectic solvents against molecular descriptors derived from sigma profiles. Both results suggest that GPs can replace neural networks in the description of physicochemical properties from sigma profiles. If true, this would then allow for the use of several GP-dependent navigation strategies (e.g., BO) to navigate the sigma profile space. Furthermore, we have also developed a graph convolutional network framework that can predict sigma profiles from molecular structures, functioning as a molecular encoder and bypassing quantum chemistry calculations, thus greatly decreasing the computational cost of generating large sigma profile databases (29). All the above strengthens the hypothesis that sigma profiles are a viable digital molecular space.

The objective of this work is to establish sigma profiles as an intuitive, low-dimensional, and easily navigable digital molecular space. To do so, the ability of GPs to predict physicochemical properties from sigma profiles was initially assessed, and their performance was compared against results obtained using neural networks. Then, the sigma profile digital molecular space was characterized, followed by the introduction of two GP-based tools to efficiently navigate it, namely gradient search and BO. These methods were explored from the perspective of molecular optimization, with the objective of maximizing a given physicochemical property.

GP Performance

To demonstrate that sigma profiles (which are defined in SI Appendix, Eq. S1, with examples for water, methane, and propane being depicted in SI Appendix, Fig. S1) can function as a digital molecular space navigable through GP-dependent methods, it is first necessary to show that GPs can accurately describe and predict physicochemical properties from sigma profiles. To do so, the performance of GPs was benchmarked against that displayed by state-of-the-art CNNs using previously published property datasets (SI Appendix, Fig. S2 and Table S1). The impact of machine learning design variables, namely data normalization and GP kernels, was carefully studied and the results are discussed at length in Supporting Information. This initial assessment indicates that the best tradeoff between GP simplicity and performance is attained when using radial basis function and white noise kernels with a trainable Gaussian likelihood (totaling four model hyperparameters), no feature normalization, and label normalization based on the natural distribution of the property (SI Appendix, Tables S2 and S3). Thus, GPs following those guidelines were fitted to training sets for all properties studied and evaluated against the corresponding testing sets. To allow for a direct and fair comparison between GPs and CNNs, training and testing sets for the properties of interest were taken without modification from our previous CNN work (23), and the same sigma profile database reported by Mullins et al. (30) was used. These results are depicted in Fig. 1 for boiling temperature and SI Appendix, Fig. S3 for the remaining properties. The resulting GP hyperparameters for each case are listed in SI Appendix, Table S4 showing, surprisingly, little variance across different properties.

Fig. 1 and SI Appendix, Fig. S3 showcase the extraordinary ability of GPs to predict physicochemical properties by interpolating the sigma profile space. Using the coefficient of determination for the testing set as the metric of interest, which is a surrogate for the generalization (or predictive) capability of the models, GPs outperform CNNs in the prediction of the following five out of the six properties studied: molar mass (0.96 vs. 0.94), boiling temperature (0.95 vs. 0.90), vapor pressure (0.81 vs. 0.80), density (0.74 vs. 0.68), and refractive index (0.79 vs. 0.63). GPs attain a predictive capability comparable to that of CNNs for the remaining property, aqueous solubility (0.88 vs. 0.90). Such performance is particularly remarkable given the extensive property ranges being described (e.g., 11 orders of magnitude for aqueous solubility), as listed in SI Appendix, Table S1 and the breadth of molecules and physical states involved (e.g., density and refractive index datasets include both solids and liquids). Note that the reason behind the near-perfect training coefficients of determination is a peculiarity of GPs and is explained at length in SI Appendix, with SI Appendix, Fig. S4 illustrating the role of the white noise kernel by reproducing SI Appendix, Fig. S3 but assuming noise-free data. This leads to a significant drop in GP performance.

The benchmark results reported above are surprising given that GPs are much simpler models than CNNs, with their training and hyperparameter tuning having negligible computational cost. In fact, the production of each plot in SI Appendix, Fig. S3, which included label normalization, GP training and hyperparameter tuning, computing predictions for both training and testing sets, and label unnormalization, took between 1 to 10 s on a regular laptop computer. In stark contrast, the hyperparameter tuning of each CNN took roughly a week on specialized high-performance computer clusters, while their final fitting (after hyperparameter tuning) can take hours on a regular laptop computer (23). This dramatic reduction in computational cost, together with an increase in prediction performance, is a major advantage of GPs over neural-network-type machine learning models for sigma profile exploitation.

Notwithstanding their different mathematical origins and conceptual natures, GPs and CNNs appear to display similar issues with outliers. The best example of this can be found in the density testing set, where both models display a severe outlying prediction at an experimental density around 3 g/mL. Other instances of similar outlying behavior include predictions for low vapor pressure and low aqueous solubility experimental values, as well as boiling temperature values around 50 to 100 °C. This implies that both GPs and CNNs are learning or describing the same underlying relationship between physicochemical properties and sigma profiles. In turn, this suggests that both models are learning an existing chemical relationship found in nature between properties and molecule polarities (i.e., sigma profiles).

The computational cost of GPs does not grow linearly with the amount of training data (N), scaling approximately as O(N ³) instead. This is mostly due to the computation of the inverse of the covariance matrix of the training data ( $Σ_{y, y}^{- 1}$ , SI Appendix, Eqs. S2–S9). As such, while the paragraphs above highlight the remarkable performance of GPs, it must be reinforced that their usage is made possible only because sigma profiles are employed as molecular descriptors. As explained in the previous section, sigma profiles encode a great deal of physicochemical-relevant information, allowing machine learning models to learn property–molecule relationships using small and scarce datasets (the amount of training data used above is between 294 and 1,289 molecules, as listed in SI Appendix, Table S1). Thus, such small datasets permit the usage of GPs rather than more complex and costly neural network models which, in turn, allows for the exploration of the space with GP-dependent optimization schemes, as will be demonstrated below.

Related to the previous paragraph, the training and testing sets for each property were taken without modification from our previous work employing CNNs. However, it is worth noting that when GPs are used in lieu of CNNs, training set sizes can be decreased without any significant losses in model performance or generalizability. This is illustrated in SI Appendix, Fig. S5, where the performance of GPs for each property set is depicted as a function of training set size. With the exception of density and refractive index, where underfitting and overfitting appear to be occurring, GP performance plateaus roughly at training set sizes half of what were employed in this section.

Digital Molecular Space

The previous section demonstrated the ability of GPs to interpolate the sigma profile space, with performances rivaling or even surpassing those of the much more complex and costly CNN models. Now, sigma profiles are examined and interpreted from the perspective of a digital molecular space.

SI Appendix, Fig. S1 provides sigma profile examples for water, methane, and propane. These illustrate the type of intuitive information that is readily available from sigma profiles, namely the various degrees of polarity of a molecule and their surface area extent. For instance, both methane and propane display two sigma profile peaks around a σ value of zero, albeit of different sizes, illustrating their apolar nature and surface area differences. Water, on the other hand, displays peaks for large positive and negative σ values, indicating a highly polar molecule.

Inspecting individual sigma profiles while inferring chemical characteristics is a straightforward task (22). However, representing and examining the full sigma profile space is not trivial. Each sigma profile used in this work contains 51 bins, thus forming a continuous digital molecular space of 51 dimensions. Two different ways of visualizing this space can be conceived. The first is to simply represent all sigma profiles simultaneously. This offers information about the distribution of each sigma profile bin across all available molecules, particularly their minimum and maximum bounds. A second form of visualization is to perform dimensionality reduction, aiming at projecting the sigma profile space into an abstract, two-dimensional space. This was performed in this work using principal component analysis (PCA) (31), through the method published by Tipping and Bishop (32). Both forms of representation are depicted in Fig. 2, with different families of compounds highlighted in SI Appendix, Fig. S6.

Fig. 2. — Illustration of the sigma profile space (*Top Left* corner) as the collection of all sigma profiles available in the Mullins et al. (30) database, its conversion to an abstract, two-dimensional space (*Lower Left* corner) for visualization purposes through PCA, and the exploration of the sigma profile digital space through gradient search, aiming at maximizing boiling temperature. The three-dimensional surface (*Top Middle*) represents the GP-predicted boiling temperature for each sigma profile point, superimposed on the two-dimensional abstract PCA space. Highlighted sigma profiles in all depictions represent successive iterations (i.e., molecules) of gradient search, starting at carbon monoxide (blue) and ending at tetraphenylmethane (yellow).

Despite the breadth of chemical families included in the Mullins et al. (30) database, both forms of representation of the sigma profile space reveal the existence of gaps. Large sigma profile peaks are often found around σ values of zero, with σ values closer to the extremities being only scarcely populated. In other words, the apolar region of sigma profiles is much more populated than those regions corresponding to higher polarities. This is due to the tetrahedral nature of carbon chemistry, and the stability of ramified carbon chains. Larger organic molecules always include a carbon backbone (which registers as sizable sigma profile peaks centered around a σ value of zero) onto which polar functional groups are added, explaining the overpopulation of the center of the sigma profile space.

While representing all sigma profiles in a single figure provides no significant information or intuition about the sigma profile space, its two-dimensional projection reveals a great deal of information. For example, as depicted in SI Appendix, Fig. S6, the n-alkane series is located in a narrow range of the second PCA dimension (PCA₂), specifically between −18 and −14. However, it spans nearly the entire range of the first PCA dimension (PCA₁), from −24 to 195. This suggests that PCA₁ and PCA₂ capture apolar and polar sigma profile segments, respectively, which is further supported by the location of the n-alcohol series that is parallel to the n-alkane series but at a higher PCA₂ (between −7 and 1). Other interesting landscape characteristics include the narrow clustering of terpenes, the clustering of polyols across large and low PCA₂ and PCA₁ values, respectively, the edge position of water, and the distribution of fluorinated hydrocarbons across the Bottom Lower quadrant. Note, however, that the PCA-derived two-dimensional sigma profile space is used below and in Fig. 2 only as a visualization tool. All optimization procedures are still performed on the actual sigma profile digital molecular space.

Having developed well-behaved and computationally inexpensive surrogate functions (GPs) for the relationship between physicochemical properties and sigma profiles (Fig. 1), as well as having constructed an intuitive visualization approach for the sigma profile space through PCA, we now proceed to explore this space. The simplest way to do so is through local optimization tools such as gradient search. Gradient search is an iterative optimization algorithm that updates an initial domain guess ( $x^{0}$ ) following the steepest descent (or ascent) path, which is given by its gradient. To do so, the gradient of the surrogate function, $f'$ , is computed at any given iteration, and the domain point is updated following the rule:

\begin{matrix} x^{i + 1} = x \in x^{*} s . t . \underset{x}{argmax} [cos (α \nabla f^{'} (x^{i}), x - x^{i})], \end{matrix}

[1]

where $x^{i}$ and $x^{i + 1}$ represent the sigma profiles at iteration i and i + 1, respectively, $\nabla f^{'} (x^{i})$ is the gradient of $f'$ at $x^{i}$ , α is the direction of optimization (+1 or −1 depending on whether local maxima or minima are desired, respectively), $x^{*}$ is the set of available sigma profiles to probe, and $c o s (a, b)$ represents the cosine of the angle between vectors a and b (also known as cosine similarity). The algorithm stops when the predicted target property, $y^{i} = f' (x^{i})$ , does not improve across a given number of iterations, with this number being known as patience.

As an initial case study for local optimization, consider the task of optimizing molecular structures toward maximizing boiling temperature. Gradient search was performed on the sigma profile space using the GP trained for boiling temperatures. The initial guess was carbon monoxide, which is the molecule that exhibits the lowest boiling temperature of the available dataset and, thus, should provide the largest challenge to optimize. This optimization procedure is illustrated in Fig. 2, including the three-dimensional GP surface for boiling temperature imposed on the two-dimensional PCA space and the molecular structure of each gradient search iteration. A schematic of the algorithm followed, along with the results obtained for this case study are also depicted in SI Appendix, Fig. S7.

As Fig. 2 reveals, despite starting from the molecule with the lowest boiling temperature (carbon monoxide), gradient search converges to tetraphenylmethane in just five iterations. As expected given the nature of the gradient search algorithm, this compound represents a local, rather than global, maximum. However, it is still noteworthy that it possesses the fifth-largest boiling temperature of the dataset studied. Moreover, note the similarity between the sigma profiles of the intermediate iterations (namely the molecules containing hydrogen bond donors and acceptors) despite their differences in chemical structure. Likewise, despite being structurally very different, the position of the sigma profile peaks for carbon monoxide (initial sigma profile guess) and tetraphenylmethane (optimized sigma profile) are similar, indicating two relatively apolar molecules, with the absolute values of their sigma profile peaks indicating a significant size difference between both molecules. This supports the notion of crucial chemical information being encoded in the sigma profile space that is closely related to physicochemical properties, with the possibility of molecules that are structurally very different being, nevertheless, near each other in the sigma profile space, facilitating molecular optimization schemes.

To further support the usefulness of gradient search, this algorithm was applied to the remaining properties studied in this work. In each case, the initial sigma profile guess corresponded to the global minimum of the corresponding dataset, with the gradient search task being to maximize the property. These results are reported in SI Appendix, Fig. S8 and reveal how gradient search can quickly converge to molecules representing local maxima. In fact, gradient search converged, for example, to the third-largest molar mass, second-largest density, and the largest (global maximum) refractive index value. Note, though, that in this last example (refractive index) the global minimum and maximum of the property are relatively close to each other in the sigma profile digital molecular space.

Starting from different initial domain points yields, of course, gradient search convergences to different local maxima. This is exemplified in SI Appendix, Fig. S9 by performing gradient search toward maximizing boiling temperature starting from different initial guesses, namely trifluoroacetic acid or squalane. The former converges to didecyl phthalate, which is the global maximum of the function, with a boiling temperature of 463 °C, while the latter converges in a single iteration to one of its closest neighbors, heptacosane, which displays the third-highest boiling temperature of the dataset.

BO

While gradient search allows for efficient molecular optimization toward maximizing or minimizing physicochemical properties of interest, it requires an existing, previously trained GP model. Such a surrogate model may not be available, particularly when the properties of interest have not been measured before or datasets are not readily available. BO resolves this major drawback of gradient search by performing both GP training and sigma profile optimization simultaneously. Thus, BO can perform molecular optimization without any previously available data.

Starting from an initial known sigma profile/property data point, a GP model is trained on that guess and an acquisition function (AF) is computed. Then, the sigma profile that maximizes this AF is chosen as the next digital space point to probe (i.e., to obtain its property value). The algorithm then retrains the GP model on the newly available data point and continues until the maximum of the AF is below a given threshold. In this work, the AF of choice was expected improvement (EI), here defined as ref. 33

\begin{matrix} A F (x^{*}) = E I (x^{*}) = (f^{'} (x^{*}) - {[y]}^{+}) Φ (\frac{f^{'} (x^{*}) - {[y]}^{+}}{σ (x^{*})}) \\ + σ (x) φ (\frac{f^{'} (x^{*}) - {[y]}^{+}}{σ (x^{*})}), \end{matrix}

[2]

where $x^{*}$ is the set of available sigma profiles to probe, [y]⁺ is the maximum property value observed in the previous algorithm iterations, $σ (x^{*})$ is the GP-predicted SD (SI Appendix, Eq. S9), and Φ and φ represent the cumulative and probability distribution functions of the standard normal distribution.

To test its performance on the sigma profile digital space, BO was performed with the objective of maximizing boiling temperature. The algorithm was employed as described above, schematized in SI Appendix, Fig. S10, and portrayed in Fig. 3. Akin to the previous section, the starting point of the algorithm was carbon monoxide, which is the molecule with the lowest boiling temperature available. Furthermore, to improve the convergence performance of the GP hyperparameters, initial guesses were provided as the average of those listed in SI Appendix, Table S4.

The results shown in Fig. 3 and detailed in SI Appendix, Fig. S10 are astounding. Holding no knowledge of chemistry except for the sigma profile and boiling temperature of carbon monoxide, BO finds the global maximum of the available boiling temperature dataset in just 15 iterations (i.e., 15 boiling temperature measurements, one for each newly requested molecule), spending twelve more iterations exploring the sigma profile digital space and increasing its confidence (measured by Eq. 2 in the form of EI) about the position of the global maximum. This type of performance is remarkable, particularly considering that i) BO was provided with the worst possible initial guess (carbon monoxide), ii) the boiling temperature dataset includes more than a thousand molecules encompassing over 40 families of organic and inorganic compounds, and iii) BO starts from scratch, without any pretrained surrogate model, unlike the gradient search algorithm presented in the previous section.

As illustrated in Fig. 3, BO quickly finds a global maximum because, unlike the GP regression results discussed in the previous sections, it probes new data points based on an EI metric (Eq. 2). In other words, while GP regression aims to learn the full relationship between the molecular space and physicochemical properties, BO aims only to find the requested extrema while optimizing molecules through their sigma profiles, with its underlying GP surrogate model being trained solely on the vicinity of expected positive results. Of course, the great performance and efficiency of BO are made possible due to the underlying sigma profile space that encodes a great deal of chemical information directly connected to the physicochemical properties of interest.

To further demonstrate the versatility of the sigma profile digital molecular space and the easiness of its navigation and optimization, even when little to no experimental data is available, BO was employed for the remaining five properties under study in this work. As was the case in the previous section, the initial guess for each property corresponded to its global minimum being, thus, the worst possible guess for the BO algorithm. These results are reported in SI Appendix, Fig. S11. BO was able to find the global maximum for molar mass (11 iterations) and vapor pressure (28 iterations), while finding the third-largest refractive index (27 iterations) and second-largest aqueous solubility (22 iterations). Of course, these results can be further refined by redefining the stopping criterion of the algorithm, at the expense of more iterations being required for convergence.

Both in the result for boiling temperature (Fig. 3) and those of the remaining properties (SI Appendix, Fig. S11), BO takes an extreme approach to space exploration, appearing to attempt to maximize the distance between sigma profile points probed. This type of behavior is expected from a universal interpolator but may also hint at the existence of certain sigma profiles that provide the most chemical information to a model starved of data. Another important point to consider is the fact that BO is usually only applicable to low-dimensional spaces (up to 20 dimensions) (34). Interestingly, this limitation is not experienced in the results reported herein, which may be connected to the fact that the dimensions of the sigma profile space are highly correlated, being the result of binning screened charges. Finally, note again the advantage of GPs performing well without feature normalization, allowing for the exploration of the sigma profile space without changing its characteristics throughout the iterative process due to normalization without statistically significant data. All the above shows that BO on the sigma profile digital space may be the ideal tool to perform molecular optimization and discovery on scarce datasets.

Computational Methods

The datasets and computational methods used in this work are briefly summarized here, while a detailed description is provided in SI Appendix. All datasets and Python code necessary to reproduce the results reported throughout this work are available publicly and free of charge in the following GitHub repository: https://github.com/MaginnGroup/Sigmaussian.

The sigma profiles used in this work were taken from the database developed by Mullins et al. (30), which is the same database used in our previous CNN work (23). This database contains the sigma profiles of 1,432 molecules (one sigma profile per molecule), encompassing over 40 different families of organic compounds, as well as some inorganic molecules, namely water, gases, acids, and bases. Some of the organic families represented in the database include saturated and unsaturated hydrocarbons, aromatic compounds, alcohols, amines, carboxylic acids, ethers, esters, ketones, aldehydes, and halogenated compounds. Training and testing datasets pertaining to six different physicochemical properties, namely molar mass, normal boiling temperature, vapor pressure at 25 °C, density at 20 °C, refractive index at 20 °C, and aqueous solubility at 25 °C, were taken without modification from our previous CNN work (23).

All GP-related calculations reported in this work were performed using the Python packages GPflow (V. 2.5.2) (35) and TensorFlow (V. 2.10.0) (36, 37). GPs were always fitted to properties (y) and sigma profiles (x), such that y = GP(x). Each model possesses between four and five hyperparameters (two or three from the main kernel, the variance of the white noise kernel, and the Gaussian likelihood variance). These were optimized by maximizing the log marginal likelihood of each GP using the L-BFGS-B algorithm (38).

Supplementary Material

Appendix 01 (PDF)

pnas.2404676121.sapp.pdf^{(7.6MB, pdf)}

Movie S1.

Progression of the optimization path during Bayesian optimization on the sigma profile digital molecular space for boiling temperature maximization, projected on the PCA visualization space. Note the discovery of large molecules by the algorithm after the sixth iteration, which become the norm in subsequent iterations.

pnas.2404676121.sm01.gif^{(3.3MB, gif)}

Movie S2.

Progression of the optimization path during Bayesian optimization on the sigma profile digital molecular space for vapor pressure maximization, projected on the PCA visualization space. Note the discovery of halogenated molecules by the algorithm after the third iteration, which then become prevalent in subsequent iterations.

pnas.2404676121.sm02.gif^{(2.9MB, gif)}

Acknowledgments

This work was supported by the US Department of Energy via subcontract 630340 from Los Alamos National Laboratory, Materials and Chemical Sciences Division, and Breakthrough Electrolytes for Energy Storage Systems, an Energy Frontier Research Center funded by the US Department of Energy, Office of Science, Basic Energy Sciences, under award DE-SC0019409. The authors acknowledge the Center for Research Computing at the University of Notre Dame for providing computational resources. D.O.A. also thanks the support of the Patrick and Jana Eilers Graduate Student Fellowship for Energy Related Research.

Author contributions

D.O.A., E.J.M., and Y.J.C. designed research; D.O.A. performed research; and D.O.A. wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

This article is a PNAS Direct Submission.

Data, Materials, and Software Availability

Datasets and python code data have been deposited in Github (https://github.com/MaginnGroup/Sigmaussian) (39).

Supporting Information

References

1.Seh Z. W., et al. , Combining theory and experiment in electrocatalysis: Insights into materials design. Science 355, eaad4998 (2017). [DOI] [PubMed] [Google Scholar]
2.Liu J., et al. , Advanced energy storage devices: Basic principles, analytical methods, and rational materials design. Adv. Sci. 5, 1700322 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Saparov B., Mitzi D. B., Organic-inorganic perovskites: Structural versatility for functional materials design. Chem. Rev. 116, 4558–4596 (2016). [DOI] [PubMed] [Google Scholar]
4.Butler K. T., Davies D. W., Cartwright H., Isayev O., Walsh A., Machine learning for molecular and materials science. Nature 559, 547–555 (2018). [DOI] [PubMed] [Google Scholar]
5.Heikamp K., Bajorath J., Support vector machines for drug discovery. Expert Opin. Drug Discov. 9, 93–104 (2014). [DOI] [PubMed] [Google Scholar]
6.Goh G. B., Hodas N. O., Vishnu A., Deep learning for computational chemistry. J. Comput. Chem. 38, 1291–1307 (2017). [DOI] [PubMed] [Google Scholar]
7.Jiang D., et al. , Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models. J. Cheminform 13, 12 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.von Lilienfeld O. A., Müller K.-R., Tkatchenko A., Exploring chemical compound space with quantum-based machine learning. Nat. Rev. Chem. 4, 347–358 (2020). [DOI] [PubMed] [Google Scholar]
9.Restrepo G., Chemical space: Limits, evolution and modelling of an object bigger than our universal library. Digital Discov. 1, 568–585 (2022). [Google Scholar]
10.Venkatasubramanian V., Chan K., Caruthers J. M., Computer-aided molecular design using genetic algorithms. Comput. Chem. Eng. 18, 833–844 (1994). [Google Scholar]
11.Brown N., McKay B., Gilardoni F., Gasteiger J., A graph-based genetic algorithm and its application to the multiobjective evolution of median molecules. J. Chem. Inf. Comput. Sci. 44, 1079–1087 (2004). [DOI] [PubMed] [Google Scholar]
12.Deringer V. L., et al. , Gaussian process regression for materials and molecules. Chem. Rev. 121, 10073–10141 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Rasmussen C. E., Williams C. K. I., Gaussian Processes for Machine Learning (The MIT Press, ed. 2, 2006). [Google Scholar]
14.Rasmussen C. E., “Gaussian processes in machine learning” in Advanced Lectures on Machine Learning (ML 2003), Bousquet O., von Luxburg U., Rätsch G., Eds. (Lecture Notes in Computer Science, Berlin, Heidelberg, Springer, 2004), vol. 3176, pp. 63–71.
15.Mohr B., et al. , Data-driven discovery of cardiolipin-selective small molecules by computational active learning. Chem. Sci. 13, 4498–4511 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Blaschke T., Olivecrona M., Engkvist O., Bajorath J., Chen H., Application of generative autoencoder in de novo molecular design. Mol. Inform. 37, 1700123 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Walters W. P., Barzilay R., Applications of deep learning in molecule generation and molecular property prediction. Acc. Chem. Res. 54, 263–270 (2021). [DOI] [PubMed] [Google Scholar]
18.Gómez-Bombarelli R., et al. , Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Kadurin A., Nikolenko S., Khrabrov K., Aliper A., Zhavoronkov A., druGAN: An advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Mol. Pharm. 14, 3098–3104 (2017). [DOI] [PubMed] [Google Scholar]
20.Liu X., et al. , Machine learning-based design of ionic liquids at the atomic scale for highly efficient CO₂ capture. ACS Sustainable Chem. Eng. 11, 8978–8987 (2023). [Google Scholar]
21.Klamt A., The COSMO and COSMO-RS solvation models. WIREs Comput. Mol. Sci. 1, 699–709 (2011). [Google Scholar]
22.Klamt A., COSMO-RS: From Quantum Chemistry to Fluid Phase Thermodynamics and Drug Design (Elsevier, ed. 1, 2005). [Google Scholar]
23.Abranches D. O., Zhang Y., Maginn E. J., Colón Y. J., Sigma profiles in deep learning: Towards a universal molecular descriptor. Chem. Commun. 58, 5630–5633 (2022). [DOI] [PubMed] [Google Scholar]
24.Boublia A., et al. , Multitask neural network for mapping the glass transition and melting temperature space of homo- and co-polyhydroxyalkanoates using σ profiles molecular inputs. ACS Sustainable Chem. Eng. 11, 208–227 (2023). [Google Scholar]
25.Awaja N. E., et al. , Molecular-based artificial neural networks for selecting deep eutectic solvents for the removal of contaminants from aqueous media. Chem. Eng. J. 476, 146429 (2023). [Google Scholar]
26.Fan D., et al. , Application of interpretable machine learning models to improve the prediction performance of ionic liquids toxicity. Sci. Total Environ. 908, 168168 (2024). [DOI] [PubMed] [Google Scholar]
27.Sanchez-Lengeling B., et al. , A bayesian approach to predict solubility parameters. Adv. Theory Simul. 2, 1800069 (2019). [Google Scholar]
28.Salahshoori I., Baghban A., Yazdanbakhsh A., Novel hybrid QSPR-GPR approach for modeling of carbon dioxide capture using deep eutectic solvents. RSC Adv. 13, 30071–30085 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
29.Abranches D. O., Maginn E. J., Colón Y. J., Boosting graph neural networks with molecular mechanics: A case study of sigma profile prediction. J. Chem. Theory Comput. 19, 9318–9328 (2023). [DOI] [PubMed] [Google Scholar]
30.Mullins E., et al. , Sigma-profile database for using COSMO-based thermodynamic methods. Ind. Eng. Chem. Res. 45, 4389–4415 (2006). [Google Scholar]
31.Bro R., Smilde A. K., Principal component analysis. Anal. Methods 6, 2812–2831 (2014). [Google Scholar]
32.Tipping M. E., Bishop C. M., Mixtures of probabilistic principal component analyzers. Neural Comput. 11, 443–482 (1999). [DOI] [PubMed] [Google Scholar]
33.Frazier P. I., Wang J., “Bayesian optimization for materials design” in Information Science for Materials Discovery and Design, Lookman T., Alexander F., Rajan K., Eds. (Springer Series in Materials Science, Springer, Cham, 2016), vol 225, pp. 45–75.
34.Frazier P. I., A tutorial on bayesian optimization. arXiv [Preprint] (2018). 10.48550/arXiv.1807.02811 (Accessed 6 March 2024). [DOI]
35.Matthews A. G. d. G., et al. , GPflow: A Gaussian process library using TensorFlow. J. Mach. Learn. Res. 18, 1–6 (2017). [Google Scholar]
36.TensorFlow, 10.5281/zenodo.7604243 (2023). Accessed 6 March 2024. [DOI]
37.Abadi M., et al. , TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv [Preprint] (2016). 10.48550/arXiv.1603.04467 (Accessed 6 March 2024). [DOI]
38.Byrd R. H., Lu P., Nocedal J., Zhu C., A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16, 1190–1208 (1995). [Google Scholar]
39.Abranches D. O., Maginn E. J., Colón Y. J., Sigmaussian. Github. https://github.com/MaginnGroup/Sigmaussian. Deposited 6 March 2024.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

pnas.2404676121.sapp.pdf^{(7.6MB, pdf)}

Movie S1.

pnas.2404676121.sm01.gif^{(3.3MB, gif)}

Movie S2.

pnas.2404676121.sm02.gif^{(2.9MB, gif)}

Data Availability Statement

Datasets and python code data have been deposited in Github (https://github.com/MaginnGroup/Sigmaussian) (39).

[r1] 1.Seh Z. W., et al. , Combining theory and experiment in electrocatalysis: Insights into materials design. Science 355, eaad4998 (2017). [DOI] [PubMed] [Google Scholar]

[r2] 2.Liu J., et al. , Advanced energy storage devices: Basic principles, analytical methods, and rational materials design. Adv. Sci. 5, 1700322 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r3] 3.Saparov B., Mitzi D. B., Organic-inorganic perovskites: Structural versatility for functional materials design. Chem. Rev. 116, 4558–4596 (2016). [DOI] [PubMed] [Google Scholar]

[r4] 4.Butler K. T., Davies D. W., Cartwright H., Isayev O., Walsh A., Machine learning for molecular and materials science. Nature 559, 547–555 (2018). [DOI] [PubMed] [Google Scholar]

[r5] 5.Heikamp K., Bajorath J., Support vector machines for drug discovery. Expert Opin. Drug Discov. 9, 93–104 (2014). [DOI] [PubMed] [Google Scholar]

[r6] 6.Goh G. B., Hodas N. O., Vishnu A., Deep learning for computational chemistry. J. Comput. Chem. 38, 1291–1307 (2017). [DOI] [PubMed] [Google Scholar]

[r7] 7.Jiang D., et al. , Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models. J. Cheminform 13, 12 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r8] 8.von Lilienfeld O. A., Müller K.-R., Tkatchenko A., Exploring chemical compound space with quantum-based machine learning. Nat. Rev. Chem. 4, 347–358 (2020). [DOI] [PubMed] [Google Scholar]

[r9] 9.Restrepo G., Chemical space: Limits, evolution and modelling of an object bigger than our universal library. Digital Discov. 1, 568–585 (2022). [Google Scholar]

[r10] 10.Venkatasubramanian V., Chan K., Caruthers J. M., Computer-aided molecular design using genetic algorithms. Comput. Chem. Eng. 18, 833–844 (1994). [Google Scholar]

[r11] 11.Brown N., McKay B., Gilardoni F., Gasteiger J., A graph-based genetic algorithm and its application to the multiobjective evolution of median molecules. J. Chem. Inf. Comput. Sci. 44, 1079–1087 (2004). [DOI] [PubMed] [Google Scholar]

[r12] 12.Deringer V. L., et al. , Gaussian process regression for materials and molecules. Chem. Rev. 121, 10073–10141 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r13] 13.Rasmussen C. E., Williams C. K. I., Gaussian Processes for Machine Learning (The MIT Press, ed. 2, 2006). [Google Scholar]

[r14] 14.Rasmussen C. E., “Gaussian processes in machine learning” in Advanced Lectures on Machine Learning (ML 2003), Bousquet O., von Luxburg U., Rätsch G., Eds. (Lecture Notes in Computer Science, Berlin, Heidelberg, Springer, 2004), vol. 3176, pp. 63–71.

[r15] 15.Mohr B., et al. , Data-driven discovery of cardiolipin-selective small molecules by computational active learning. Chem. Sci. 13, 4498–4511 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r16] 16.Blaschke T., Olivecrona M., Engkvist O., Bajorath J., Chen H., Application of generative autoencoder in de novo molecular design. Mol. Inform. 37, 1700123 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r17] 17.Walters W. P., Barzilay R., Applications of deep learning in molecule generation and molecular property prediction. Acc. Chem. Res. 54, 263–270 (2021). [DOI] [PubMed] [Google Scholar]

[r18] 18.Gómez-Bombarelli R., et al. , Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r19] 19.Kadurin A., Nikolenko S., Khrabrov K., Aliper A., Zhavoronkov A., druGAN: An advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Mol. Pharm. 14, 3098–3104 (2017). [DOI] [PubMed] [Google Scholar]

[r20] 20.Liu X., et al. , Machine learning-based design of ionic liquids at the atomic scale for highly efficient CO₂ capture. ACS Sustainable Chem. Eng. 11, 8978–8987 (2023). [Google Scholar]

[r21] 21.Klamt A., The COSMO and COSMO-RS solvation models. WIREs Comput. Mol. Sci. 1, 699–709 (2011). [Google Scholar]

[r22] 22.Klamt A., COSMO-RS: From Quantum Chemistry to Fluid Phase Thermodynamics and Drug Design (Elsevier, ed. 1, 2005). [Google Scholar]

[r23] 23.Abranches D. O., Zhang Y., Maginn E. J., Colón Y. J., Sigma profiles in deep learning: Towards a universal molecular descriptor. Chem. Commun. 58, 5630–5633 (2022). [DOI] [PubMed] [Google Scholar]

[r24] 24.Boublia A., et al. , Multitask neural network for mapping the glass transition and melting temperature space of homo- and co-polyhydroxyalkanoates using σ profiles molecular inputs. ACS Sustainable Chem. Eng. 11, 208–227 (2023). [Google Scholar]

[r25] 25.Awaja N. E., et al. , Molecular-based artificial neural networks for selecting deep eutectic solvents for the removal of contaminants from aqueous media. Chem. Eng. J. 476, 146429 (2023). [Google Scholar]

[r26] 26.Fan D., et al. , Application of interpretable machine learning models to improve the prediction performance of ionic liquids toxicity. Sci. Total Environ. 908, 168168 (2024). [DOI] [PubMed] [Google Scholar]

[r27] 27.Sanchez-Lengeling B., et al. , A bayesian approach to predict solubility parameters. Adv. Theory Simul. 2, 1800069 (2019). [Google Scholar]

[r28] 28.Salahshoori I., Baghban A., Yazdanbakhsh A., Novel hybrid QSPR-GPR approach for modeling of carbon dioxide capture using deep eutectic solvents. RSC Adv. 13, 30071–30085 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]

[r29] 29.Abranches D. O., Maginn E. J., Colón Y. J., Boosting graph neural networks with molecular mechanics: A case study of sigma profile prediction. J. Chem. Theory Comput. 19, 9318–9328 (2023). [DOI] [PubMed] [Google Scholar]

[r30] 30.Mullins E., et al. , Sigma-profile database for using COSMO-based thermodynamic methods. Ind. Eng. Chem. Res. 45, 4389–4415 (2006). [Google Scholar]

[r31] 31.Bro R., Smilde A. K., Principal component analysis. Anal. Methods 6, 2812–2831 (2014). [Google Scholar]

[r32] 32.Tipping M. E., Bishop C. M., Mixtures of probabilistic principal component analyzers. Neural Comput. 11, 443–482 (1999). [DOI] [PubMed] [Google Scholar]

[r33] 33.Frazier P. I., Wang J., “Bayesian optimization for materials design” in Information Science for Materials Discovery and Design, Lookman T., Alexander F., Rajan K., Eds. (Springer Series in Materials Science, Springer, Cham, 2016), vol 225, pp. 45–75.

[r34] 34.Frazier P. I., A tutorial on bayesian optimization. arXiv [Preprint] (2018). 10.48550/arXiv.1807.02811 (Accessed 6 March 2024). [DOI]

[r35] 35.Matthews A. G. d. G., et al. , GPflow: A Gaussian process library using TensorFlow. J. Mach. Learn. Res. 18, 1–6 (2017). [Google Scholar]

[r36] 36.TensorFlow, 10.5281/zenodo.7604243 (2023). Accessed 6 March 2024. [DOI]

[r37] 37.Abadi M., et al. , TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv [Preprint] (2016). 10.48550/arXiv.1603.04467 (Accessed 6 March 2024). [DOI]

[r38] 38.Byrd R. H., Lu P., Nocedal J., Zhu C., A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16, 1190–1208 (1995). [Google Scholar]

[r39] 39.Abranches D. O., Maginn E. J., Colón Y. J., Sigmaussian. Github. https://github.com/MaginnGroup/Sigmaussian. Deposited 6 March 2024.

PERMALINK

Stochastic machine learning via sigma profiles to build a digital chemical space

Dinis O Abranches

Edward J Maginn

Yamil J Colón

Significance

Abstract

GP Performance

Fig. 1.

Digital Molecular Space

Fig. 2.

BO

Fig. 3.

Computational Methods

Supplementary Material

Acknowledgments

Author contributions

Competing interests

Footnotes

Data, Materials, and Software Availability

Supporting Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Stochastic machine learning via sigma profiles to build a digital chemical space

Dinis O Abranches

Edward J Maginn

Yamil J Colón

Significance

Abstract

GP Performance

Fig. 1.

Digital Molecular Space

Fig. 2.

BO

Fig. 3.

Computational Methods

Supplementary Material

Acknowledgments

Author contributions

Competing interests

Footnotes

Data, Materials, and Software Availability

Supporting Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases