Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2019 Feb 7;116(9):3401–3406. doi: 10.1073/pnas.1816132116

Accurate molecular polarizabilities with coupled cluster theory and machine learning

David M Wilkins a, Andrea Grisafi a, Yang Yang b, Ka Un Lao b, Robert A DiStasio Jr b,1, Michele Ceriotti a,1
PMCID: PMC6397574  PMID: 30733292

Significance

The dipole polarizability of molecules and materials is central to several physical phenomena, modeling techniques, and the interpretation of many experiments. Its accurate evaluation from first principles requires quantum chemistry methods that are often too demanding for routine use. The highly accurate calculations reported herein provide a much-needed benchmark of the accuracy of hybrid density functional theory (DFT) as well as training data for a machine-learning model that can predict the polarizability tensor with an error that is about 50% smaller than DFT. This framework provides an accurate, inexpensive, and transferable strategy for estimating the polarizabilities of molecules containing dozens of atoms, and therefore removes a considerable obstacle to accurate and reliable atomistic-based modeling of matter.

Keywords: dipole polarizability, machine learning, coupled cluster theory, density functional theory, Gaussian process regression

Abstract

The molecular dipole polarizability describes the tendency of a molecule to change its dipole moment in response to an applied electric field. This quantity governs key intra- and intermolecular interactions, such as induction and dispersion; plays a vital role in determining the spectroscopic signatures of molecules; and is an essential ingredient in polarizable force fields. Compared with other ground-state properties, an accurate prediction of the molecular polarizability is considerably more difficult, as this response quantity is quite sensitive to the underlying electronic structure description. In this work, we present highly accurate quantum mechanical calculations of the static dipole polarizability tensors of 7,211 small organic molecules computed using linear response coupled cluster singles and doubles theory (LR-CCSD). Using a symmetry-adapted machine-learning approach, we demonstrate that it is possible to predict the LR-CCSD molecular polarizabilities of these small molecules with an error that is an order of magnitude smaller than that of hybrid density functional theory (DFT) at a negligible computational cost. The resultant model is robust and transferable, yielding molecular polarizabilities for a diverse set of 52 larger molecules (including challenging conjugated systems, carbohydrates, small drugs, amino acids, nucleobases, and hydrocarbon isomers) at an accuracy that exceeds that of hybrid DFT. The atom-centered decomposition implicit in our machine-learning approach offers some insight into the shortcomings of DFT in the prediction of this fundamental quantity of interest.


The last decade has seen great progress in the first principles evaluation of the structures, stabilities, and properties of molecules and materials. Kohn–Sham density functional theory (DFT) has played a pivotal role in this endeavor by providing ground-state properties with an accuracy that is sufficient for many useful applications at a manageable computational cost (13). However, DFT is not equally accurate for every property of interest. For instance, an accurate and reliable description of the molecular dipole polarizability α, a tensor that describes how the molecular dipole changes in the presence of an applied electric field E, can be quite difficult to obtain (4). This is primarily due to the fact that α is a response property that is particularly sensitive to the quantum mechanical description of the underlying electronic structure. As such, nontrivial electron correlation effects and basis set incompleteness error must be simultaneously accounted for when determining α. For these reasons and in light of the fact that α is a fundamental quantity of interest that underlies induction and dispersion interactions (57) and Raman and sum frequency generation spectroscopy (811), and represents a key ingredient in the development of next generation polarizable force fields (1216), it is important to provide benchmark values for α beyond the accuracy of DFT. In this regard, linear response coupled cluster singles and doubles theory (LR-CCSD) (1719) has been shown to provide considerably more accurate and reliable predictions for α when used in conjunction with a sufficiently large (diffuse) one-particle basis set (2023). However, such a prediction is accompanied by a substantially larger computational cost (scaling with the sixth power of the system size), which can become quite prohibitive even when treating molecules with as few as 1015 atoms.

In the last few years, machine learning (ML) has gained traction as an alternative approach to the prediction of molecular properties, substituting or complementing electronic structure methods (2426). In particular, it has been shown that accuracy on par with (or even better than) DFT can be achieved in the prediction of many molecular properties (27, 28) and that DFT (29) or coupled cluster (30) accuracy can be reached more easily when using a less accurate but more computationally efficient electronic structure method as a stepping stone. The polarizability, however, poses an additional challenge to ML. Due to its tensorial nature, the predicted α must transform according to the symmetries of the SO(3) rotation group. For rigid molecules, this is easily achieved by learning the components of the tensor written in the reference frame of the molecule (31, 32). However, to obtain a transferable model that would also be suitable for flexible molecules—as well as different compounds—this line of thought would require a cumbersome and inelegant fragment decomposition. To avoid these complications, a symmetry-adapted Gaussian process regression (SA-GPR) scheme has recently been derived to naturally incorporate this SO(3) covariance into an ML scheme that is suitable to predict tensorial quantities of arbitrary order (33). In this paper, we present comprehensive coupled cluster-level benchmarks for the polarizabilities of the 7,000 small organic molecules contained in the QM7b database (34). We use these reference calculations to assess the accuracy of different hybrid DFT schemes and train an SA-GPR–based ML scheme (AlphaML) that can accurately predict the polarizability tensor with nominal computational cost. We then test the extrapolative prediction capabilities of AlphaML on a showcase dataset composed of 52 larger molecules and demonstrate that this approach provides a viable alternative to state-of-the-art and computationally prohibitive electronic structure methods for predicting molecular polarizabilities.

Results

Electronic Structure Calculations.

The QM7b database (26, 34, 35) includes N = 7,211 molecules containing up to seven “heavy” atoms (i.e., C, N, O, S, Cl) with varying levels of H saturation. This dataset is based on a systematic enumeration of small organic compounds (35) and contains a rich diversity of chemical groups, making it a challenging test of the accuracy associated with DFT and quantum chemical methodologies. DFT-based molecular polarizabilities were obtained by (numerical) differentiation of the molecular dipole moment, μ, with respect to an external electric field E using the hybrid B3LYP (36, 37) and SCAN0 (38) functionals. Reference molecular polarizabilities were obtained using LR-CCSD. To account for basis set incompleteness error, which can be even more important than higher-order electron correlation effects in an accurate and reliable determination of α (2123, 39), we used the d-aug-cc-pVDZ basis set (39) for all calculations herein. Although this double-ζ basis set has only a moderate number of polarization functions, augmentation with an additional set of diffuse functions almost always increases the convergence of α with respect to aug-cc-pVDZ (21, 22, 3941). The alternative choice of retaining a single set of diffuse functions and simply increasing the angular momentum by using the slightly larger aug-cc-pVTZ basis set yields α values of comparable quality to d-aug-cc-pVDZ (SI Appendix) (21, 22, 3941), albeit with a significant increase in the computational effort required to treat the entire QM7b dataset. A more detailed description of the electronic structure calculations performed in this work is in Materials and Methods.

To enable comparisons between molecules of different sizes, all error estimates (explicit expressions for which are given in Materials and Methods) are computed based on molecular polarizabilities divided by the number of atoms, ni, contained within a given molecule. On the QM7b database, the popular B3LYP hybrid DFT functional predicts α with a mean signed error (MSE) of 0.259 a.u., a mean absolute error (MAE) of 0.302 a.u., and a root mean square error (RMSE) of 0.404 a.u. with respect to the reference LR-CCSD values. These errors, which include both scalar and anisotropic contributions, are quite substantial and correspond to 18.3% of the intrinsic variability within the QM7b database, defined as σCCSD=[1Niαi(CCSD)α(CCSD)F2/ni2]1/2. The large MSE value obtained with B3LYP indicates a systematic overestimation of α by this functional (4, 42); results from the SCAN0 hybrid functional show a substantially reduced MSE of 0.059 a.u. Despite the smaller systematic overestimation of α in comparison with B3LYP, the statistical errors obtained with SCAN0 are still quite large, with computed MAE (RMSE) values of 0.217 (0.316) a.u. From the ML point of view, the AlphaML model presented herein performs almost equally well for B3LYP and SCAN0. For this reason, we focus our discussion on the B3LYP and LR-CCSD results, which will be referred to as DFT and coupled cluster singles and doubles theory (CCSD), respectively, throughout the remainder of the manuscript.

Improved SA-GPR.

The formalism underlying the SA-GPR scheme in general and the λ-SOAP (smooth overlap of atomic positions) descriptors on which our model is based have been introduced elsewhere (33) and are summarized in Materials and Methods. In this work, we include several substantial improvements that increase the accuracy and speed of the SA-GPR model, and these are worth a separate discussion. For one, evaluation of the λ-SOAP representation is greatly accelerated by choosing the most significant few hundred spherical harmonic components (of several tens of thousands) using farthest point sampling (FPS) (43). The calculation of the kernel in Eq. 1 can be carried out with essentially the same result as if all components were retained but with a much lower computational cost. A second improvement is the generalization of the λ-SOAP kernels beyond the linear kernels used in ref. 33. It has been shown that, in many cases, taking an integer power of the scalar SOAP kernel improves the performance of the associated ML model. This can be understood in terms of the order (two body, three body,…) of the interatomic correlations that are described by different kernels (44, 45). In the tensorial case, one should be careful, as the linear nature of the kernel is essential to ensure the correct covariant behavior. To include nonlinearity and increase the order of the model without affecting the symmetry properties, we multiplied the λ>0 kernels by the scalar λ=0 kernel raised to the power of ζ1 as in Eq. 2. Finally, we combined multiple kernels computed with different environment radii, rc, which have been shown to be beneficial in the scalar case (30). Together, these improvements halve the error on QM7b as discussed in detail in SI Appendix.

Learning on the QM7b Database.

These highly accurate reference CCSD calculations and the SA-GPR scheme lay the foundation for a transferable model to predict molecular polarizabilities. In this first incarnation of the AlphaML model, we use the reference DFT and CCSD calculations on the QM7b set for training (34). As a first verification of its performance, we computed learning curves for the DFT and CCSD polarizabilities of the QM7b dataset. We used up to 5,400 structures for training with subsequent assessment of the accuracy and reliability of the AlphaML model in the prediction of α for the 1,811 structures that were not included in the training set. The structures were added to the training set according to their FPS order (43) (i.e., starting from the most diverse configurations). This procedure is representative of an efficient learning strategy that aims to obtain uniform accuracy with the minimum number of reference calculations (30). Using the best kernel hyperparameters (as described in SI Appendix), we trained a model to learn the CCSD polarizabilities. We report ML errors in terms of the percentage of the intrinsic variability of the CCSD dataset (σCCSD=2.216 a.u. per atom) so as to provide a direct measure of the learning performance. As illustrated by the learning curves in Fig. 1, using up to 75% of the QM7b database for training yields a 2.5% RMSE with respect to σCCSD in predicting CCSD polarizabilities.

Fig. 1.

Fig. 1.

Learning curves for the per atom polarizabilities of the molecules in the QM7b database calculated using either CCSD or DFT as well as for the difference (Δ) between the two. The testing set consists of 1,811 molecules, and the right-hand axis shows the RMSE as a fraction of the intrinsic variability of the CCSD polarizability, σCCSD.

To get a clearer idea of the accuracy associated with these ML-based predictions, one can compare these values against hybrid DFT. Using the same metric, the intrinsic error of DFT is 18% of σCCSD in the prediction of CCSD polarizabilities. This demonstrates that an ML model based on SA-GPR can yield polarizabilities with an accuracy that is approximately one order of magnitude greater than DFT. At the same time, the corresponding DFT polarizabilities can be learned with an error of 3.2% of σCCSD. As seen in other cases (29, 30), highly accurate quantum chemistry calculations are smoother and slightly easier to learn than more approximate methods, like DFT.

The AlphaML model can also be trained to evaluate the correction between different levels of theory, a correction commonly referred to as Δ learning that is often found to result in much smaller error than learning the raw quantity itself (29, 30). For instance, the use of DFT as a baseline to learn CCSD polarizabilities reduces the error by an additional factor of 2× relative to the direct learning of αCCSD (Fig. 1). Δ learning therefore provides a way to further reduce the prediction error at the cost of performing a baseline DFT calculation. In SI Appendix, we demonstrate that the performance of AlphaML is rather insensitive to the details of the target electronic structure method, showing similar accuracy for SCAN0 as that observed for B3LYP.

Extrapolation to Larger Molecules.

Our definition of the kernel between two molecules as an average of environmental kernels means that the polarizabilities predicted by AlphaML are given as a sum of predicted polarizabilities for each environment (30). This feature allows one to predict α for larger molecules. To test the behavior of AlphaML in this extrapolative regime, we trained this model on the entire QM7b database and then predicted the polarizabilities in a showcase dataset of 52 large molecules, which includes amino acids, nucleobases, drug molecules, carbohydrates, and 23 isomers of C8Hn (the molecule key is in SI Appendix). As discussed in SI Appendix, many of these molecules are at the periphery of the portion of chemical compound space spanned by the QM7b dataset and therefore constitute a challenging test for AlphaML.

In Table 1, we show the RMSE errors in predicting α for the showcase molecules using AlphaML as well as the error made when using DFT to approximate CCSD. Table 1 also breaks down the error into the λ=0 and λ=2 components of α; with an error in the anisotropic response comparable with that in the trace, this demonstrates that AlphaML learns both components with similar efficiency. As seen in the previous section, we again note that using the AlphaML model to predict CCSD polarizabilities is more accurate than simply using DFT. However, the use of DFT as the baseline in the Δ-learning sense leads to an additional reduction of 2030% in the error. In SI Appendix, we further discuss the behavior of the model when using the SCAN0 functional, which is similar to that observed here for B3LYP. While AlphaML predicts CCSD polarizabilities of the showcase molecules with better than DFT accuracy, we observe a substantial decrease in accuracy, which is to be expected when the model is extrapolated to the larger molecules in the showcase dataset.

Table 1.

RMSE in the prediction of the per atom polarizabilities of 52 showcase molecules

Method RMSE RMSE (λ=0) RMSE (λ=2)
CCSD/DFT 0.573 0.348 0.456
CCSD/ML 0.244 0.120 0.212
DFT/ML 0.302 0.143 0.266
Δ(CCSD-DFT)/ML 0.181 0.083 0.161

CCSD/DFT denotes the discrepancy between CCSD and DFT values, while CCSD/ML and DFT/ML give the errors in predicting CCSD and DFT polarizabilities using AlphaML. Δ(CCSD-DFT)/ML gives the error in predicting the differences between the CCSD and DFT polarizabilities. All ML predictions are based on training on the full QM7b database. The total RMSE is expressed in atomic units (a.u.) per atom and broken down into the errors associated with the scalar (λ=0) and tensorial (λ=2) components of α.

We can investigate the performance of AlphaML in more detail by analyzing the errors of individual molecules in the showcase dataset. Fig. 2 shows that the errors are actually very small for most molecules. Large errors occur predominantly for highly polarizable compounds, particularly those that show a large degree of conjugation, such as long-chain alkenes and the purine nucleobases. For these systems, the underlying electronic structure is characterized by a high degree of delocalization, which requires larger cutoffs and more complex reference molecules to ensure accurate predictions. The ML predictions for the tensorial component of the polarizability, α(2), tend to be slightly less accurate than the DFT reference except for the highly polarizable alkenes, for which AlphaML dramatically outperforms DFT. Sulfur-containing structures, which are poorly represented in QM7b, also exhibit comparatively large errors.

Fig. 2.

Fig. 2.

RMSE made in approximating the λ=0 (Lower) and λ=2 (Upper) components of the per atom polarizability in the showcase dataset. The x axis corresponds to the numerical indices provided in the showcase molecule key in SI Appendix, and the vertical lines show the partitioning of the dataset into the different groups outlined in the same figure. Red squares show the ML error, blue circles show the error made in using DFT to approximate CCSD, and black crosses show the error made when Δ learning the CCSD correction with respect to DFT.

The large discrepancy between DFT, CCSD, and AlphaML observed for alkenes (like octatetraene) reflects the nonlocal and collective nature of the underlying physics in these systems as well as the inherent structure of the AlphaML model. For DFT and CCSD, the narrowing HOMO-LUMO (highest occupied molecular orbital-lowest unoccupied molecular orbital) gaps in conjugated hydrocarbons lead to near-metallic states, which are known to exhibit strong multireference character (46). As such, these systems represent a significant challenge for electronic structure methods (like DFT and CCSD) that are not explicitly based on a multireference wavefunction. In practice, this leads to divergent polarizabilities (47, 48), and methods like CCSD are no longer reliable as the source of reference quantum chemical data for ML. An ML framework like AlphaML, which relies on local atomic environments to represent structures, tacitly disregards any collective (nonlocal) behavior that extends beyond the range of the local domains and the size of the molecules included in the training set. As shown in Fig. 3, the per carbon polarizabilities predicted by AlphaML therefore saturate to a constant value for the s-trans alkenes and acenes that are larger than those included in the QM7b dataset (i.e., hexatriene and benzene, respectively). Although this is a limitation when trying to learn collective and nonlocal physics, the local structure of AlphaML is also instrumental for obtaining the accurate and transferable predictions that we demonstrated on the showcase dataset.

Fig. 3.

Fig. 3.

Polarizability per C atom for the series of s-trans alkenes (from C6H8 to C22H24) and acenes (from benzene to pentacene) as well as fullerene (C60). The reference CCSD results for anthracene and tetracene were taken from ref. 49, and the reference CCSD result for C60 was taken from ref. 50. The green squares (and error bars) indicate the experimental measurements for C60 (51). Results are provided from DFT and CCSD calculations as well as the corresponding AlphaML models.

Even when it comes to challenging conjugated systems with a vanishing HOMO-LUMO gap, the predictions of AlphaML are stable and completely avoid the unphysical and divergent predictions of costlier (but far from reference) quantum mechanical methods, like DFT and CCSD. For molecules with a sizable gap (like C60), the nonlocality is less pathological, and AlphaML performs remarkably well. For this prototypical nanotechnological system, ML predictions are within 10% of DFT and CCSD results and within the range of experimental values, despite the extrapolation to a system size that is one order of magnitude larger than the molecules in the training set.

Atom-Centered Environmental Polarizabilities.

The atom-centered structure of AlphaML provides a natural additive decomposition of α into a sum of local terms, iαi, which can be used to better understand how different functional groups contribute to the molecular polarizability. Unlike other methods for decomposing the polarizability, such as an atoms-in-molecules scheme (52) or a self-consistent decomposition (53), the approach used in this section does not require any additional calculations on top of the molecular polarizability, as the atom-centered polarizabilities are obtained as a by-product of the local nature of the SA-GPR scheme. When interpreting the αi, one should keep in mind that each term corresponds to the contribution from the entire atom-centered environment, and the way that the polarizability is split between neighboring atoms is entirely inductive, reflecting the interplay between data, structure (as represented by the kernels), and regression rather than explicit physicochemical considerations. For instance, a few atoms within the showcase dataset (in particular, several H environments) have αi with negative eigenvalues, which reflects the fact that they reduce the dielectric response of the functional group to which they belong.

With this in mind, one can recognize physically meaningful features in the magnitude and anisotropy of the αi. Fig. 4 depicts eight representative examples. Comparing saturated and unsaturated hydrocarbons (e.g., 2,2-dimethylhexane, cis-4-octene, and octatetraene), one sees that AlphaML predicts the contribution from the unsaturated carbon atoms to be large and very anisotropic, which is consistent with the higher degree of electron delocalization along conjugated molecules. Similarly large and anisotropic contributions are associated with aromatic systems as seen, for example, in guanine and the indole ring of tryptophan. Oxygen atoms are associated with a very anisotropic αi; a large fraction of the polarizability of −OH and −COOH groups is assigned to the environments centered around nearby H and C atoms, but O atoms systematically contribute another anisotropic term oriented perpendicularly to the highly polarizable lone pairs (e.g., fructose as well as the carboxyl group in the amino acids). The sulfur-centered environments in cysteine and methionine have the largest contribution to the total polarizability in the showcase set and exhibit a strongly anisotropic response. All of these examples suggest that AlphaML can use relatively local structural information to determine an atom-centered decomposition of α that encodes nontrivial quantum mechanical contributions from each functional group (or moiety) contained within a given molecule. It is this ability to predict such an environment-dependent decomposition of α that underlies the observed better than DFT performance of AlphaML when faced with the often insurmountable challenge of transferability to a sector of chemical compound space that contains molecules that are quite distinct and notably larger than those included in the training set. A similar atom-centered decomposition can also be performed in the context of Δ learning, revealing the molecular features that are associated with the most substantial errors of the approximate methods. As shown in SI Appendix, this approach reveals how the large discrepancy between DFT and CCSD for alkenes is associated primarily with the extended conjugate system.

Fig. 4.

Fig. 4.

Predicted atomic contributions to the total CCSD polarizability tensor for a selection of showcase molecules. The ellipsoids are aligned along the principal axes of αi, and their extent is proportional to the square root of the corresponding eigenvalue. The principal axes are shown, and they are colored based on whether the corresponding eigenvalues are positive (black) or negative (red). The figure key (which is not drawn to scale) has additional details.

Discussion

Polarizability calculations with traditional quantum chemical methods have always implied a tradeoff between accuracy and computational cost. While CCSD calculations give more accurate predictions for the polarizabilities of molecules (especially large molecules) than DFT with various functionals (4, 21), the associated computational cost can be prohibitive. In our case, the CCSD calculations for the largest molecules in the showcase dataset required thousands of central processing unit (CPU) hours and approximately 500 GB of RAM. In this paper, we have demonstrated that the AlphaML framework, which combines SA-GPR with λ-SOAP kernels and CCSD reference calculations on small molecules, allows us to sidestep these expensive calculations and obtain results with an accuracy that almost always exceeds DFT but at a fraction of the computational cost. Although this model was trained on a database of small organic molecules, it can be used to predict larger compounds with an accuracy that rivals DFT and can be systematically improved by extending the training set. The atom-centered decomposition of the ML predictions of α can be interpreted in terms of physicochemical considerations, revealing for instance the large and anisotropic contributions that originate from delocalized π systems. In doing so, however, one should keep in mind that these contributions correspond to chemical environments rather than an atoms-in-molecules decomposition scheme.

Having shown the promise of the AlphaML framework by learning polarizabilities of small molecules, future work will focus on extensions of the training set to include larger molecules and oligomers, improvements in the accuracy of the underlying reference calculations, incorporation of methods to estimate the uncertainty in the predictions, as well as more efforts to include collective and nonlocal physics into the model. The need for these fundamental developments is underscored by an analysis of the behavior of the challenging series of conjugated alkenes, which are predicted by DFT and CCSD to have divergent polarizabilities due to vanishing HOMO-LUMO gaps. These improvements will make it possible to predict the polarizability for increasingly complex molecular systems and eventually, condensed phases. The availability of inexpensive atom-centered estimates of the fully anisotropic α will be useful to design more accurate polarizable force fields for atomistic simulations as well as to computationally evaluate Raman and sum frequency generation spectroscopies, thereby improving the predictive power of simulations and increasing the insight that can be obtained from experiments.

Materials and Methods

First Principles Calculations.

In this work, DFT calculations with the B3LYP functional (36, 37) and all LR-CCSD calculations were performed with Psi4 v1.1 (54), while DFT calculations with the SCAN0 functional (38) were performed with Q-Chem v5.0 (55). All of the molecular geometries used for ML were taken from the QM7b database (34, 35). All 52 showcase molecules (as well as the alkene series, acene series, and fullerene molecule in Fig. 3) were relaxed following the protocol used for QM7b (34). All DFT polarizabilities were computed with the finite-field method using a central difference formula with a step size of δE=1.8897261250×105 a.u. All CCSD polarizabilities were calculated using LR-CCSD, except for those of the eight largest molecules in the showcase dataset (which include molecules 18, 19, 20, 21, 23, 25, 26, and 28 as listed in SI Appendix). CCSD polarizabilities for these molecules were obtained using the (orbital unrelaxed) finite-field method due to the prohibitively large computational resources (memory and disk) needed by LR-CCSD. The frozen core approximation and direct scf_type were used during all CCSD calculations. In the cases where the finite-field method was used, CCSD polarizabilities were obtained as α=2U/E2. Additional details of the calculations are given in SI Appendix. The polarizability data generated can be found in Yang et al. (56).

Error Assessment.

We use the Frobenius norm, defined as αF2=i,jx,y,zαij2, to assess the accuracy of a polarizability estimate α in a way that is rotationally invariant and includes both scalar and anisotropic components. Given two sets of polarizabilities, αi and αi, for N structures (each containing ni atoms), we define the following quantities: MSE1Ni(αiFαiF)/ni; MAE1NiαiαiF/ni; and RMSE[1NiαiαiF2/ni2]1/2. Errors are defined on a per atom basis to simplify the comparison between molecules of different sizes.

SA-GPR.

The SA-GPR framework used herein to build an ML model for the polarizability is based on the following steps. (i) Each polarizability tensor, α, is decomposed into its irreducible (real spherical) components: the scalar α(0)=αxx+αyy+αzz/3 and the five-vector α(2)=2αxy,αyz,αxz,2αzzαxxαyy23,αxxαyy2. One can compute the RMSE separately on these two components, since αF2=|α(0)|F2+|α(2)|F2. (ii) λ-SOAP vector components αnlαnl|Xj,λμ(2) are computed for each environment Xj and describe interatomic correlations within a prescribed cutoff radius, rc, of the central atom j. The definition of these components is given in ref. 33. (iii) The base kernel between two environments, suitable to learn tensor components of order λ, is then defined as

kμj,μkλkμμλ(Xj,Xk)=JXj,λμ|JJ|Xk,λμ, [1]

where we use the shorthand {J} to indicate a subset of the possible spherical harmonic components of the descriptors, αnlαnl, that are automatically selected with a farthest-point sampling procedure (43). (iv) The linear SOAP kernel can describe atomic correlations up to three-body terms. Many-body correlations can be introduced by normalizing it and then raising it to an integer power. To preserve the linear nature of the λ-SOAP kernels, which is crucial to enforce the correct symmetry properties, we use

kμμλ,ζ(Xj,Xk)kμμλ(Xj,Xk)k000(Xj,Xk)ζ1;kj,kλkj,kλ/kj,jλFkk,kλF. [2]

(v) For each component of α, we build a kernel ridge regression model with weights wkμ that are determined by optimizing the loss

2=μ,AN|αμ(λ)(A)kMjAwkμ(kj,kλ)μμ|2+σ2wTKMMw, [3]

in which N is the training set, M is a (possibly sparse) set of representative environments used as the basis, and KMM is the matrix of kernels between representative environments. An online prediction tool for α, based on the AlphaML framework, is also available at http://alphaml.org.

Supplementary Material

Supplementary File

Acknowledgments

The authors thank Felix Musil and Michael Willatt for helpful discussions. D.M.W. and M.C. acknowledge support from European Research Council Horizon 2020 Grant 677013-HBMAP. A.G. acknowledges funding from the Max Planck–École Polytechnique Fédérale de Lausanne (MPG-EPFL) Center for Molecular Nanoscience. Y.Y., K.U.L., and R.A.D. acknowledge partial support from Cornell University through startup funding. This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by Office of Science of the US Department of Energy Contract DE-AC02-06CH11357. This work was also supported by a grant from Swiss National Supercomputing Center Project ID s843 and the EPFL Scientific Computing Center.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Data deposition: Coordinates and dipole polarizabilities of the QM7b database, the showcase dataset, and the alkenes and acenes have been deposited in the Materials Cloud Archive (https://archive.materialscloud.org/2019.0002/v1).

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1816132116/-/DCSupplemental.

References

  • 1.Engel E, Dreizler RM. Density Functional Theory: An Advanced Course. Springer; Berlin: 2011. [Google Scholar]
  • 2.Burke K. Perspective on density functional theory. J Chem Phys. 2012;136:150901. doi: 10.1063/1.4704546. [DOI] [PubMed] [Google Scholar]
  • 3.Lejaeghere K, et al. Reproducibility in density functional theory calculations of solids. Science. 2016;351:145–152. doi: 10.1126/science.aad3000. [DOI] [PubMed] [Google Scholar]
  • 4.Hait D, Head-Gordon M. How accurate are static polarizability predictions from density functional theory? An assessment over 132 species at equilibrium geometry. Phys Chem Chem Phys. 2018;20:19800–19810. doi: 10.1039/c8cp03569e. [DOI] [PubMed] [Google Scholar]
  • 5.Stone A. 1997. The Theory of Intermolecular Forces, International Series of Monographs on Chemistry (Clarendon, Oxford, United Kingdom)
  • 6.Hermann J, DiStasio RA, Jr, Tkatchenko A. First-principles models for van der Waals interactions in molecules and materials: Concepts, theory, and applications. Chem Rev. 2017;117:4714–4758. doi: 10.1021/acs.chemrev.6b00446. [DOI] [PubMed] [Google Scholar]
  • 7.Grimme S. Dispersion interaction and chemical bonding. In: Frenking G, Shaik S, editors. The Chemical Bond: Chemical Bonding Across the Periodic Table. Wiley-VCH; Hoboken, NJ: 2014. pp. 477–500. [Google Scholar]
  • 8.Shen YR. Surface properties probed by second harmonic and sum-frequency generation. Nature. 1989;337:519–525. [Google Scholar]
  • 9.Luber S, Iannuzzi M, Hutter J. Raman spectra from ab initio molecular dynamics and its application to liquid s-methyloxirane. J Chem Phys. 2014;141:094503. doi: 10.1063/1.4894425. [DOI] [PubMed] [Google Scholar]
  • 10.Morita A, Hynes JT. A theoretical analysis of the sum frequency generation spectrum of the water surface. Chem Phys. 2000;258:371–390. [Google Scholar]
  • 11.Medders GR, Paesani F. Dissecting the molecular structure of the air/water interface from quantum simulations of the sum-frequency generation spectrum. Chem Phys Lett. 2016;138:3912–3919. doi: 10.1021/jacs.6b00893. [DOI] [PubMed] [Google Scholar]
  • 12.Sprik M, Klein ML. A polarizable model for water using distributed charge sites. J Chem Phys. 1988;89:7556–7560. [Google Scholar]
  • 13.Fanourgakis GS, Xantheas SS. Development of transferable interaction potentials for water. v. extension of the flexible, polarizable, thole-type model potential (TTM3-F, v. 3.0) to describe the vibrational spectra of water clusters and liquid water. J Chem Phys. 2008;128:074506. doi: 10.1063/1.2837299. [DOI] [PubMed] [Google Scholar]
  • 14.Ponder JW, et al. Current status of the AMOEBA polarizable force field. J Phys Chem B. 2010;114:2549–2564. doi: 10.1021/jp910674d. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Medders GR, Babin V, Paesani F. Development of a “first-principles” water potential with flexible monomers. III. Liquid phase properties. J Chem Theory Comput. 2014;10:2906–2910. doi: 10.1021/ct5004115. [DOI] [PubMed] [Google Scholar]
  • 16.Bereau T, DiStasio RA, Jr, Tkatchenko A, von Lilienfeld OA. Non-covalent interactions across organic and biological subsets of chemical space: Physics-based potentials parametrized from machine learning. J Chem Phys. 2018;148:241706. doi: 10.1063/1.5009502. [DOI] [PubMed] [Google Scholar]
  • 17.Monkhorst HJ. Calculation of properties with the coupled-cluster method. Int J Quantum Chem. 1977;12:421–432. [Google Scholar]
  • 18.Koch H, Jørgensen P. Coupled cluster response functions. J Chem Phys. 1990;93:3333–3344. [Google Scholar]
  • 19.Christiansen O, Jørgensen P, Hättig C. Response functions from Fourier component variational perturbation theory applied to a time-averaged quasienergy. Int J Quantum Chem. 1998;68:1–52. [Google Scholar]
  • 20.Christiansen O, Gauss J, Stanton JF. Frequency-dependent polarizabilities and first hyperpolarizabilities of CO and H2O from coupled cluster calculations. Chem Phys Lett. 1999;305:147–155. [Google Scholar]
  • 21.Hammond JR, de Jong WA, Kowalski K. Coupled-cluster dynamic polarizabilities including triple excitations. J Chem Phys. 2008;128:224102. doi: 10.1063/1.2929840. [DOI] [PubMed] [Google Scholar]
  • 22.Hammond JR, Govind N, Kowalski K, Autschbach J, Xantheas SS. Accurate dipole polarizabilities for water clusters n=2-12 at the coupled-cluster level of theory and benchmarking of various density functionals. J Chem Phys. 2009;131:214103. doi: 10.1063/1.3263604. [DOI] [PubMed] [Google Scholar]
  • 23.Lao KU, Jia J, Maitra R, DiStasio RA., Jr On the geometric dependence of the molecular dipole polarizability in water: A benchmark study of higher-order electron correlation, basis set incompleteness error, core electron effects, and zero-point vibrational contributions. J Chem Phys. 2018;149:204303. doi: 10.1063/1.5051458. [DOI] [PubMed] [Google Scholar]
  • 24.Behler J, Parrinello M. Generalized neural-network representation of high-dimensional potential-energy surfaces. Phys Rev Lett. 2007;98:146401. doi: 10.1103/PhysRevLett.98.146401. [DOI] [PubMed] [Google Scholar]
  • 25.Bartók AP, Payne MC, Kondor R, Csányi G. Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons. Phys Rev Lett. 2010;104:136403. doi: 10.1103/PhysRevLett.104.136403. [DOI] [PubMed] [Google Scholar]
  • 26.Rupp M, Tkatchenko A, Müller KR, von Lilienfeld OA. Fast and accurate modeling of molecular atomization energies with machine learning. Phys Rev Lett. 2012;108:058301. doi: 10.1103/PhysRevLett.108.058301. [DOI] [PubMed] [Google Scholar]
  • 27.De S, Bartók AP, Csányi G, Ceriotti M. Comparing molecules and solids across structural and alchemical space. Phys Chem Chem Phys. 2016;18:13754–13769. doi: 10.1039/c6cp00415f. [DOI] [PubMed] [Google Scholar]
  • 28.Faber FA, et al. Prediction errors of molecular machine learning models lower than hybrid DFT error. J Chem Theory Comput. 2017;13:5255–5264. doi: 10.1021/acs.jctc.7b00577. [DOI] [PubMed] [Google Scholar]
  • 29.Ramakrishnan R, Dral PO, Rupp M, von Lilienfeld OA. Big data meets quantum chemistry approximations: The Δ-machine learning approach. J Chem Theory Comput. 2015;11:2087–2096. doi: 10.1021/acs.jctc.5b00099. [DOI] [PubMed] [Google Scholar]
  • 30.Bartók AP, et al. Machine learning unifies the modeling of materials and molecules. Sci Adv. 2017;3:e1701816. doi: 10.1126/sciadv.1701816. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Bereau T, Andrienko D, von Lilienfeld OA. Transferable atomic multipole machine learning models for small organic molecules. J Chem Theory Comput. 2015;11:3225–3233. doi: 10.1021/acs.jctc.5b00301. [DOI] [PubMed] [Google Scholar]
  • 32.Liang C, et al. Solvent fluctuations and nuclear quantum effects modulate the molecular hyperpolarizability of water. Phys Rev B. 2017;96:041407. [Google Scholar]
  • 33.Grisafi A, Wilkins DM, Csányi G, Ceriotti M. Symmetry-adapted machine learning for tensorial properties of atomistic systems. Phys Rev Lett. 2018;120:036002. doi: 10.1103/PhysRevLett.120.036002. [DOI] [PubMed] [Google Scholar]
  • 34.Montavon G, et al. Machine learning of molecular electronic properties in chemical compound space. New J Phys. 2013;15:095003. [Google Scholar]
  • 35.Blum LC, Reymond JL. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J Am Chem Soc. 2009;131:8732–8733. doi: 10.1021/ja902302h. [DOI] [PubMed] [Google Scholar]
  • 36.Becke AD. Density-functional thermochemistry. III, the role of exact exchange. J Chem Phys. 1993;98:5648–5652. [Google Scholar]
  • 37.Stephens PJ, Devlin FJ, Chabalowski CF, Frisch MJ. Ab initio calculation of vibrational absorption and circular dichroism spectra using density functional force fields. J Phys Chem. 1994;98:11623–11627. [Google Scholar]
  • 38.Hui K, Chai JD. Scan-based hybrid and double-hybrid density functionals from models without fitted parameters. J Chem Phys. 2016;144:044114. doi: 10.1063/1.4940734. [DOI] [PubMed] [Google Scholar]
  • 39.Woon DE, Dunning TH., Jr Gaussian basis sets for use in correlated molecular calculations. IV. calculation of static electrical response properties. J Chem Phys. 1994;100:2975–2988. [Google Scholar]
  • 40.Christiansen O, Hättig C, Gauss J. Polarizabilities of CO, N2, HF, Ne, BH, and CH+ from ab initio calculations: Systematic studies of electron correlation, basis set errors and vibrational contributions. J Chem Phys. 1998;109:4745–4757. [Google Scholar]
  • 41.Reis H, Papadopoulos MG, Avramopoulos A. Calculation of the microscopic and macroscopic linear and nonlinear optical properties of acetonitrile. I. Accurate molecular properties in the gas phase and susceptibilities of the liquid in onsager’s reaction-field model. J Phys Chem A. 2003;107:3907–3917. [Google Scholar]
  • 42.Karne AS, et al. Systematic comparison of DFT and CCSD dipole moments, polarizabilities and hyperpolarizabilities. Chem Phys Lett. 2015;635:168–173. [Google Scholar]
  • 43.Imbalzano G, et al. Automatic selection of atomic fingerprints and reference configurations for machine-learning potentials. J Chem Phys. 2018;148:241730. doi: 10.1063/1.5024611. [DOI] [PubMed] [Google Scholar]
  • 44.Bartók AP, Kondor R, Csányi G. On representing chemical environments. Phys Rev B. 2013;87:184115. [Google Scholar]
  • 45.Glielmo A, Zeni C, De Vita A. Efficient nonparametric n-body force fields from machine learning. Phys Rev B. 2018;97:184307. [Google Scholar]
  • 46.Voloshina E, Paulus B. First multireference correlation treatment of bulk metals. J Chem Theory Comput. 2014;10:1698–1706. doi: 10.1021/ct401040t. [DOI] [PubMed] [Google Scholar]
  • 47.Smith SM, et al. Static and dynamic polarizabilities of conjugated molecules and their cations. J Phys Chem A. 2004;108:11063–11072. [Google Scholar]
  • 48.Grüning M, Gritsenko OV, Baerends EJ. Exchange potential from the common energy denominator approximation for the Kohn–Sham Green’s function: Application to (hyper)polarizabilities of molecular chains. J Chem Phys. 2002;116:6435–6442. [Google Scholar]
  • 49.Huzak M, Deleuze MS. Benchmark theoretical study of the electric polarizabilities of naphthalene, anthracene, and tetracene. J Chem Phys. 2013;138:024319. doi: 10.1063/1.4773018. [DOI] [PubMed] [Google Scholar]
  • 50.Kowalski K, Hammond JR, de Jong WA, Sadlej AJ. Coupled cluster calculations for static and dynamic polarizabilities of C60. J Chem Phys. 2008;129:226101. doi: 10.1063/1.3028541. [DOI] [PubMed] [Google Scholar]
  • 51.Sabirov DS. Polarizability as a landmark property for fullerene chemistry and materials science. RSC Adv. 2014;4:44996. [Google Scholar]
  • 52.Laidig KE, Bader RFW. Properties of atoms in molecules: Atomic polarizabilities. J Chem Phys. 1990;93:7213–7224. [Google Scholar]
  • 53.Applequist J, Carl JR, Fung KK. Atom dipole interaction model for molecular polarizability. Application to polyatomic molecules and determination of atom polarizabilities. J Am Chem Soc. 1972;94:2952–2960. [Google Scholar]
  • 54.Parrish RM, et al. Psi4 1.1: An open-source electronic structure program emphasizing automation, advanced libraries, and interoperability. J Chem Theory Comput. 2017;13:3185–3197. doi: 10.1021/acs.jctc.7b00174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Shao Y, et al. Advances in molecular quantum chemistry contained in the Q-Chem 4 program package. Mol Phys. 2015;113:184–215. [Google Scholar]
  • 56.Yang Y, et al. 2019 Coupled-cluster polarizabilities in the QM7b and a showcase database. Materials Cloud Archive (2019), doi:10.24435/materialscloud:2019.0002/v1. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES